0.0.3 • Published 10 years ago

node-ckan-crawler v0.0.3

Weekly downloads
4
License
MIT
Repository
github
Last release
10 years ago

node-ckan-crawler

A simple and fast NodeJS based crawler for sites powered by CKAN http://ckan.org

  • Uses the CKAN package_search Action.Get API to crawl packages / datasets

Install

npm install node-ckan-crawler

Usage

var CKANCrawler = require('node-ckan-crawler');

var crawler = new CKANCrawler();

crawler.queueSite('http://datahub.io/');
crawler.on('content', function(response, content){
  console.log('content', response.uri, content.length);
});

More examples

See more examples found in examples\

API

Events

Event: 'content'

When response received from the site has been parsed and results ready for consumption

response an http.IncomingMessage object returned from mikeal's request()

body a JSON object of the response.body

crawler.on('content', function(response, body) {
    ...
});

Event: 'beforeQueue'

When next link is ready to be added to the crawler queue. Return a non-true value to skip the link

url a string of the next link ready to be added to the crawler queue

next a callback function

crawler.on('beforeQueue', function(url, next) {
    next(true); // to add the link to the queue
    // next(false) // to skip link
});

Event: 'queued'

After a link was added to the crawler queue.

url a string of the next link ready to be added to the crawler queue

crawler.on('queued', function(url) {
    ...
});

Event: 'drain'

When crawler has drained its queue and has no more links to crawl

crawler.on('drain', function() {
    ...
});

Event: 'error'

When an error has occurred

crawler.on('error', function(err) {
    ...
});

Methods

queueSite(url)

Queue a CKAN powered site by specifying its base API url

Example:

crawler.queueSite('http://datahub.io')

Known Issues

Credits

Links

License

Copyright (c) 2014 Hafiz Ismail. This software is licensed under the MIT License.

0.0.3

10 years ago

0.0.2

10 years ago

0.0.1

10 years ago