0.2.0 • Published 9 years ago

xspider v0.2.0

Weekly downloads
1
License
MIT
Repository
github
Last release
9 years ago

spider

spider in node.js

features

1 can set how many request per second

2 can use extend to inherit

install

npm install xspider

usage

`js
var Spider = require('xspider').Spider,
    Crawler = require('xspider').Crawler;

var s = new Spider('http://www.sina.com.cn/'),

see examples/v2ex.js

API

Spider

options

create a spider with option:

var s = new Spider(option)

options: maxConnections: max connection conncurry rps: max requests per second, if less than 1, for example 0.5, the spider request per two seconds. maxPages: max pages to be crawled.

methods:

start

start(crawler)

start spider to crawl. before start, you should set the spider's crawler instance, use crawler() method or call start with a crawler instance.

stop

stop()

stop spider.

cycle

cycle(crawler, interval)

crawler

crawler(crawl)

set or get the spider's crawler instance

pause

pause()

pause spider.

resume

resume()

resume a paused spider.

static method

extend

extend the Spider class

`js
var MySpider = Spider.extend({
    start: ....
})

Crawler

methods

setRoute

set crawler callbacks for specify urls.

router

router(url)

return a crawler's method to handle this url. crawler has two basic method: index, detail.

fetch

fetch(url)

internal used. this method return a promise instance.

handle

handle(url)

this method is an interface to handle a url. internally, it first call router to find if there is a method to handle this url, if found, it first call fetch method, then use the method to handle to url and it's response.

static method

extend

extend the Crawler class

`js
var MyCrawler = Crawler.extend({
    index: function(url, resp) { return [] },
    detail: function(url, resp) {return [] },
})

the callback function should return an array, which is an url list to be crawled.