0.0.1 • Published 6 years ago

tspider v0.0.1

Weekly downloads
3
License
MIT
Repository
github
Last release
6 years ago

tspider

A web spider tool written with typescript and classify by type

USAGE

Basic Usage

// es6 or typescript
import { Spider } from 'tspider'
// commonjs
// const { Spider } = require('tspider')

const spider = new Spider({
  keepSession: true,
  timeout: 5000,
  limit: 2,
  interval: 1000,
  defaultParser(res, body, spider) {
    console.log(body)
  },
  defaultErrorHandler(err, res, spider) {
    console.error(err)
  }
})

// add new url or requestOption object to spider
spider.add(['url1', 'url2', 'url3'])

spider.start()

The response object is a superagent response object

Define Different Parser By TypeName

const spider = new Spider(options)

spider.type('typeName').parse((res, body, spider) => {
  console.log(body)
}).error((err, res) => {
  console.error(err)
})

// Passing request type with add method's first paramter
spider.add('typeName', ['url1', 'url2', 'url3'])

// Passing request type in requestOption object
spider.add([{
  type: 'typeName',
  url: 'url1'
}, {
  type: 'typeName',
  url: 'url2'
}])

spider.start()

When some type has no parser or errorHandler, it will use spider's default parser or default errorHandler.

PreRequest Handler

const spider = new Spider(options)

// global preRequest handler
spider.preRequest(function (reqOption) {
  reqOption.headers['from'] = 'tspider'
  return reqOption
})

// type private preRequest handler
spider.type('ajax').preRequest(function (reqOption) {
  reqOption.headers['X-Requested-With'] = 'XMLHttpRequest'
  return reqOption
}).parse((res, body, spider) => {
  console.log(body)
}).error((err, res) => {
  console.error(err)
})

spider.add('ajax', ['url1', 'url2', 'url3'])

spider.start()

Use Plugin

const spider = new Spider(options)

spider.plugin({
  // emit when call start()
  start(spider) {
    console.log('spider start execute')
  },
  // emit when call add(), requestOption is the RequestOption object will be push into the request pool
  addRequestOption(requestOption, spider) {
    console.log(requestOption)
  },
  // emit before a new request object push into the sending request set, request parameter is a superagent request
  sendRequest(request, spider) {
    request.use(someSuperagentPlugin)
  },
  // emit when all sending request have been parse, and no more new requestOption in request pool.
  end(spider) {
    console.log('spider has no more new request')
  }
})

spider.type('typeName').parse((res, body, spider) => {
  console.log(body)
}).error((err, res) => {
  console.error(err)
})

// Passing request type with add method's first paramter
spider.add('typeName', ['url1', 'url2', 'url3'])

spider.start()

Parameter Strust

SpiderInitOption

passing when you new a Spider class.

  • timeout:number request timeout time, unit is ms
  • retry:number when request failed, retry times
  • headers: { [headerName: string]: string | number} all request will send with the - header
  • cookies: { [cookieName: string]: string | number} | string all request will send - with the cookies
  • keepSession: boolean is need to save response cookies (use superagent.agent())
  • limit: number max concurrent sending request number
  • interval: number next request's wait time before sending, unit is ms
  • defaultParser: ResponseParser default response parser, use when no type or no matched type response parser
  • defaultErrorHandler: ErrorHandler default error handler, use when no type or no matched type error handler

RequestOption

passing when you call Spider instance's add method

  • type: string the task type requestOption belong to
  • url: string the url request will send
  • method: 'get' | 'post' | 'put' | 'head' | 'delete' | 'patch' | 'options' the request method
  • query: object | string the url query parameters
  • body: any the request body will be send
  • contentType: string the request body type, as Content-Type header
  • accept: string the response content will receive, as Accept header

Built-in Parser Generator

tspider provide some useful parser generator function, you can use them well.

combineParser

Combine some parsers to a single parser, revice a array of parser. if passing async function, you can use asyncCombineParser, unless you want to execute them in parallel

spider.type('combine').parse(Parser.combineParser([function (res, body, spider) {
  console.log(res.header)
}, function (res, body, spider) {
  console.log(res.text)
}]))

asyncCombineParser

Combine some parsers to a single parser, revice a array of async parser function, and execute them serially

spider.type('asyncCombine').parse(Parser.asyncCombineParser([async function (res, body, spider) {
  // async operate1...
}, function (res, body, spider) {
  // async operate2...
}]))

fileSaveParser

if response is a file, this parser can help you save file to your pc

spider.type('fileSave').parse(Parser.fileSaveParser(function (res) {
  // fileName generator funcion, return a string as filename
  return 'filename'
}, function(err, savePath) {
  console.log('file save to: ' + savePath)
}))

charsetConvertParser

use npm lib iconv-lite to convert res.text's or res.body's content to your encoding

spider.type('charsetConvert').parse(Parser.charsetConvertParser('utf8', function(err, res, converted, spider) {
  // converted can be a string(res.text) or a object(res.body)
}))
0.0.1

6 years ago