0.0.8 • Published 9 years ago
dcrawler v0.0.8
node-distributed-crawler
Features
- Distributed crawler
- Configurable url parser and data parser
- jQuery selector using cheerio
- Parsed data insertion in Mongodb collection
- Domain wise interval configuration in distributed enviroment
- node 0.8+ support
Note: update to latest version (0.0.4+), don't use 0.0.1
I am actively updating this library, for any feature suggestion or git fork request are welcomed :)
Installation
$ npm install dcrawler
Usage
var DCrawler = require("dcrawler");
var options = {
mongodbUri: "mongodb://0.0.0.0:27017/crawler-data",
profilePath: __dirname + "/" + "profile"
};
var logs = {
dbUri: "mongodb://0.0.0.0:27017/crawler-log",
storeHost: true
};
var dc = new DCrawler(options, logs);
dc.start();
Note: mongodb connection uri (mongodbUri
& dbUri
) should be same (queueing of urls should be centeralized)
The DCrawler takes options and log options construcotr: 1. options with following porperties *:
- mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
- profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *
- logs to store logs in centrelized location using winston-mongodb with following porperties:
- dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
storeHost: Boolean, true or false to store workers host name or not in log collection.
Note:
logs
is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructorvar dc = new DCrawler(options);
Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:
- collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
- url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or 'http://example.com') *
- interval: Interval between request in miliseconds. Default is
1000
(Eg: For 2 secods interval:2000
) - followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
- resume: Boolean, true or false to resume crawling from previous crawled data.
- beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:
beforeStart: function (config) {
console.log("started crawling example.com");
}
- parseUrl: Function to get further url from crawled page. Function has
error
,response
object and$
jQuery object param. Function returns Array of url string. Example function:
parseUrl: function (error, response, $) {
var _url = [];
try {
$("a").each(function(){
var href = $(this).attr("href");
if (href && href.indexOf("/products") > -1) {
if (href.indexOf("http://example.com") === -1) {
href = "http://example.com/" + href;
}
_url.push(href);
}
)};
} catch (e) {
console.log(e);
}
return _url;
}
- parseData: Function to exctract information from crawled page. Function has
error
,response
object and$
jQuery object param. Function returns data Object to insert in collection . Example function:
parseData: function (error, response, $) {
var _data = null;
try {
var _id = $("h1#productId").html();
var name = $("span#productName").html();
var price = $("label#productPrice").html();
var url = response.uri;
_data = {
_id: _id,
name: name,
price: price,
url: url
}
} catch (e) {
console.log(e);
}
return _data;
}
- onComplete: Function to execute on completing crawling. Function has
config
param which contains perticular profile config object. Example function:
onComplete: function (config) {
console.log("completed crawling example.com");
}