Dcrawler NPM | npm.io

node-distributed-crawler

Features

Distributed crawler
Configurable url parser and data parser
jQuery selector using cheerio
Parsed data insertion in Mongodb collection
Domain wise interval configuration in distributed enviroment
node 0.8+ support

Note: update to latest version (0.0.4+), don't use 0.0.1

I am actively updating this library, for any feature suggestion or git fork request are welcomed :)

Installation

$ npm install dcrawler

Usage

var DCrawler = require("dcrawler");

var options = {
    mongodbUri:     "mongodb://0.0.0.0:27017/crawler-data",
    profilePath:    __dirname + "/" + "profile"
};
var logs = {
    dbUri:      "mongodb://0.0.0.0:27017/crawler-log",
	storeHost:  true
};
var dc = new DCrawler(options, logs);
dc.start();

Note: mongodb connection uri (mongodbUri & dbUri) should be same (queueing of urls should be centeralized)

The DCrawler takes options and log options construcotr: 1. options with following porperties *:

mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *

logs to store logs in centrelized location using winston-mongodb with following porperties:

dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
storeHost: Boolean, true or false to store workers host name or not in log collection.
Note: logs is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor
```
var dc = new DCrawler(options);
```

Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:

collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or 'http://example.com') *
interval: Interval between request in miliseconds. Default is 1000 (Eg: For 2 secods interval: 2000)
followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
resume: Boolean, true or false to resume crawling from previous crawled data.
beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:

beforeStart: function (config) {
    console.log("started crawling example.com");
}

parseUrl: Function to get further url from crawled page. Function has error, response object and $ jQuery object param. Function returns Array of url string. Example function:

parseUrl: function (error, response, $) {
    var _url = [];
    
    try {
        $("a").each(function(){
            var href = $(this).attr("href");
            if (href && href.indexOf("/products") > -1) {
                if (href.indexOf("http://example.com") === -1) {
                    href = "http://example.com/" + href;
                }
                _url.push(href);
            }
        )};
    } catch (e) {
        console.log(e);
    }
    
    return _url;
}

parseData: Function to exctract information from crawled page. Function has error, response object and $ jQuery object param. Function returns data Object to insert in collection . Example function:

parseData: function (error, response, $) {
    var _data = null;
    
    try {
        var _id = $("h1#productId").html();
        var name = $("span#productName").html();
        var price = $("label#productPrice").html();
        var url = response.uri;
        
        _data = {
            _id: _id,
            name: name,
            price: price,
            url: url
        }
    } catch (e) {
        console.log(e);
    }
    
    return _data;
}

onComplete: Function to execute on completing crawling. Function has config param which contains perticular profile config object. Example function:

onComplete: function (config) {
    console.log("completed crawling example.com");
}

Chirag (blikenoother -at- gmail dot com)

distribited crawling spider scraper scraping jquery crawler

winston winston-mongodb async mongodb buffer-crc32 request cheerio lodash

@everything-registry/sub-chunk-1455 @zalastax/nolb-dc

11 years ago

11 years ago

11 years ago

11 years ago

11 years ago

11 years ago

11 years ago

12 years ago