Simple-node-crawler NPM

simple-node-crawler

A simple web crawler on node.

features

persistent files on local disk or database (currently support mongodb)
resume and continue
multiple paths with content extraction patterns
convert to markdown
auto detect html encoding
save images

install

npm install simple-node-crawler

usage

var Crawler = require('simple-node-crawler');

var c = new Crawler({
	host:'developer.51cto.com',
	patterns: [{'path': 'art/', 'pattern': '.m_l' }],
	usedb: true,
	saveImage: true
}).start('http://developer.51cto.com/col/1308/');

configuration

host - host constraint for the crawling.
patterns - if you want to crawl a specific path, you can specify the path name or leave it as ''; the pattern is the css patterns for the main body of the webpage, id/class/tag name are supported, if you need all the html body, you can specify 'body'.
usedb - if you want to use local file system, then set to false. If you have mongodb installed and want to use it, then set to true.
saveImage - whether to save images to local file system.
dbConnectionString - mongodb connection string, default to 'mongodb://localhost/test'
utf8 - whether need to convert to uft8. Default to true.
crawlerNumber - how many cralwer thread you want to have. Default to 5.