0.1.3 • Published 7 years ago
weibo-crawler v0.1.3
Weibo Crawler / 微博爬虫
A simple weibo crawler
Features
- Crawl all the weibos of a user.
Result
http://weibo.com/p/1005051736338681/home?from=page_100505_profile&wvr=6&mod=data#place
[
{
"scheme": "http://m.weibo.cn/status/EquqEyH0v?mblogid=EquqEyH0v&luicode=10000011&lfid=1076031736338681",
"createdAt": "01-12 16:38",
"id": "4063135008268475",
"text": "想玩物丧志,请问玩什么物能比较轻松愉快无负担地丧志 ",
"repostsCount": 3,
"commentsCount": 47,
"likesCount": 59
},
{
"scheme": "http://m.weibo.cn/status/EpMbwy2dU?mblogid=EpMbwy2dU&luicode=10000011&lfid=1076031736338681",
"createdAt": "01-08 00:00",
"id": "4061434268111702",
"text": "分享图片 ",
"repostsCount": 5,
"commentsCount": 7,
"likesCount": 71,
"pics": [
"http://wx4.sinaimg.cn/orj360/677e6cf9ly1fbiie692r0j20ku09ywg7.jpg",
"http://wx3.sinaimg.cn/orj360/677e6cf9ly1fbiie60h1nj20ku0a9ab9.jpg",
"http://wx2.sinaimg.cn/orj360/677e6cf9ly1fbiie6hl5aj20ku04pdgb.jpg"
]
},
{
"scheme": "http://m.weibo.cn/status/EpLXkAI8Q?mblogid=EpLXkAI8Q&luicode=10000011&lfid=1076031736338681",
"createdAt": "01-07 23:25",
"id": "4061425468749492",
"text": "Repost",
"repostsCount": 14,
"commentsCount": 3,
"likesCount": 29,
"retweetedStatus": {
"id": "4061419093386276",
"text": "如果你讨厌文青,那这是一个好时代。我们有能者多劳的大大,有始终关怀引导年轻人文化思想的团团。文艺行业腊月十八,而你的孩子,再也不用看那些把文青们脑子弄乱的东西了。 ",
"userName": "alitha",
"userId": 3171360847
},
{
"...": "...."
}
]
Quick Start
git clone https://github.com/fytriht/weibo-crawler.git
cd weibo-crawler
npm install
npm run test
Basic usage
Installing
npm i weibo-crawler
Get URL & Start crawling
- Go to the user's home page which you want to crawl
- Click "他/她的主页" button
- Copy the URL
const weiboCrawler = require('weibo-crawler')
const url = 'http://weibo.com/p/1005051736338681/home?from=page_100505_profile&wvr=6&mod=data#place'
weiboCrawler(url)
.then(data => {
console.log(JSON.stringify(data, null, 2))
/*
* or you can, for example:
*
* const fs = require('fs')
*
* fs.writeFile('data.json', JSON.stringify(data, null, 2), (err) => {
* if (err) return err
* })
*
*/
})
.catch(err => console.error('Something went wrong.'))
You can also set the concurrency:
...
weiboCrawler(url, concurrency) // concurrency defaulting to 5
...
Limitation
- Due to Sina Weibo's anti-crawling strategy, the default concurrency is recommended. If the concurrency is too high, you might get an error similar to the following. Just wait a few more minutes and try again.
node index.js
undefined:1
<!DOCTYPE html>
^
SyntaxError: Unexpected token < in JSON at position 0
at JSON.parse (<anonymous>)
at superagent.get.end (/*****************************/node_modules/weibo-crawler/getApiUrls.js:21:32)
at Request.callback (/*****************************/node_modules/superagent/lib/node/index.js:672:3)
at Stream.<anonymous> (/*****************************/node_modules/superagent/lib/node/index.js:866:18)
at emitNone (events.js:86:13)
at Stream.emit (events.js:185:7)
at Unzip.<anonymous> (/*****************************/node_modules/superagent/lib/node/unzip.js:53:12)
at emitNone (events.js:91:20)
at Unzip.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
- If the user have a lot of weibos(3000+), you might get an unexpected result. I am working on this issue.
TODO
- test cases
- error handling
unescaped text content
Support
If you have any problem or suggestion please open an issue here