1.0.5 • Published 8 years ago
simple-node-crawler v1.0.5
simple-node-crawler
A simple web crawler on node.
features
- persistent files on local disk or database (currently support mongodb)
- resume and continue
- multiple paths with content extraction patterns
- convert to markdown
- auto detect html encoding
- save images
install
npm install simple-node-crawler
usage
var Crawler = require('simple-node-crawler');
var c = new Crawler({
host:'developer.51cto.com',
patterns: [{'path': 'art/', 'pattern': '.m_l' }],
usedb: true,
saveImage: true
}).start('http://developer.51cto.com/col/1308/');
configuration
host
- host constraint for the crawling.patterns
- if you want to crawl a specific path, you can specify the path name or leave it as ''; the pattern is the css patterns for the main body of the webpage, id/class/tag name are supported, if you need all the html body, you can specify 'body'.usedb
- if you want to use local file system, then set to false. If you have mongodb installed and want to use it, then set to true.saveImage
- whether to save images to local file system.dbConnectionString
- mongodb connection string, default to 'mongodb://localhost/test'utf8
- whether need to convert to uft8. Default to true.crawlerNumber
- how many cralwer thread you want to have. Default to 5.
feature to be implemented
- keyword analysis & extraction
license
MIT