0.0.0 • Published 10 years ago

smart-crawler v0.0.0

Weekly downloads
2
License
MIT
Repository
github
Last release
10 years ago

Smart Crawler

What is this

This module is propose to scrapy website pages and extract information from doms which selected by jQuery-like selectors

Features

  • Batch page scrapy support
  • jQuery-like selector supported for extracting dom infomation
  • Promises/A style support

How to use

  • Scrapy website: new Crawler([domain1, domain2], callback)
var Crawler = require("crawler");
var domain = "http://example.com";

new Crawler(domain, function (err, result, mergedResult) {
	var $body = result[domain];
	console.log($body.html());
});
  • Using selector: new Crawler([domain1, domain2], queryString, callback)

Note: not all jQuery query style is supported, details on cheerio

var Crawler = require("crawler");
var domain = "http://example.com";

new Crawler(domain, "p", function (err, result, mergedResult) {

	if (err) return;

	var paragraphs = result[domain];
	paragraphs.forEach(function ($p) {
		console.log($p.text());
	});
});
  • Custom request options: new Crawler([domain1, domain2], requestOptions,queryString, callback)
var Crawler = require("crawler");
var domain = "http://example.com";

new Crawler(domain, {
	'timeout': 2000,
    'headers': {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'
    }
} "p", function (err, result, mergedResult) {

	if (err) return;
});

Request Options available on Request: Custom HTTP Headers

More examples please visit: examples

External API

  • .refetch(): to refetch the same sites with same parameters;