Spiderer NPM | npm.io

spiderer

description

A light-weight framework of spider implemented in node.js.

You could crawl many pages at the same time, to make full use of the network IO. But you have to be careful with the minimum interval, if too small, your IP address may be blocked.

example

var Spider = require('spiderer');

function filter(err, res, $) {
	console.log($('title').text());
	var res = $('[href]');
	if (res.map) {
		return res.map(function() {
				return $(this).attr('href');
		});
	}
}
var config = {
	startURLs: ['http://wanghuanming.com'],
	interval: 4 * 1000,
	filter: filter,
	log: true
}

var spider = new Spider(config);
spider.start();

configuration

filter You need to specified a filter, which receive a $ (jquery) and response, return a selector from $. If not provided, spider will crawl all URLs in html. This is the most import function, which should do some valuable jobs.
startURLs spider will start from these URLs.
interval spider working interval. Default to be 2 * 1000.
log Log or not, if true, log infos will be stored in log/file.
timeout request timeout.

spider

needle colors oftype cheerio log4js

@everything-registry/sub-chunk-2806

10 years ago

11 years ago

11 years ago

11 years ago