tspider v0.0.1
tspider
A web spider tool written with typescript and classify by type
USAGE
Basic Usage
// es6 or typescript
import { Spider } from 'tspider'
// commonjs
// const { Spider } = require('tspider')
const spider = new Spider({
keepSession: true,
timeout: 5000,
limit: 2,
interval: 1000,
defaultParser(res, body, spider) {
console.log(body)
},
defaultErrorHandler(err, res, spider) {
console.error(err)
}
})
// add new url or requestOption object to spider
spider.add(['url1', 'url2', 'url3'])
spider.start()
The response object is a superagent response object
Define Different Parser By TypeName
const spider = new Spider(options)
spider.type('typeName').parse((res, body, spider) => {
console.log(body)
}).error((err, res) => {
console.error(err)
})
// Passing request type with add method's first paramter
spider.add('typeName', ['url1', 'url2', 'url3'])
// Passing request type in requestOption object
spider.add([{
type: 'typeName',
url: 'url1'
}, {
type: 'typeName',
url: 'url2'
}])
spider.start()
When some type has no parser or errorHandler, it will use spider's default parser or default errorHandler.
PreRequest Handler
const spider = new Spider(options)
// global preRequest handler
spider.preRequest(function (reqOption) {
reqOption.headers['from'] = 'tspider'
return reqOption
})
// type private preRequest handler
spider.type('ajax').preRequest(function (reqOption) {
reqOption.headers['X-Requested-With'] = 'XMLHttpRequest'
return reqOption
}).parse((res, body, spider) => {
console.log(body)
}).error((err, res) => {
console.error(err)
})
spider.add('ajax', ['url1', 'url2', 'url3'])
spider.start()
Use Plugin
const spider = new Spider(options)
spider.plugin({
// emit when call start()
start(spider) {
console.log('spider start execute')
},
// emit when call add(), requestOption is the RequestOption object will be push into the request pool
addRequestOption(requestOption, spider) {
console.log(requestOption)
},
// emit before a new request object push into the sending request set, request parameter is a superagent request
sendRequest(request, spider) {
request.use(someSuperagentPlugin)
},
// emit when all sending request have been parse, and no more new requestOption in request pool.
end(spider) {
console.log('spider has no more new request')
}
})
spider.type('typeName').parse((res, body, spider) => {
console.log(body)
}).error((err, res) => {
console.error(err)
})
// Passing request type with add method's first paramter
spider.add('typeName', ['url1', 'url2', 'url3'])
spider.start()
Parameter Strust
SpiderInitOption
passing when you new a Spider class.
timeout:number
request timeout time, unit is msretry:number
when request failed, retry timesheaders: { [headerName: string]: string | number}
all request will send with the - headercookies: { [cookieName: string]: string | number} | string
all request will send - with the cookieskeepSession: boolean
is need to save response cookies (usesuperagent.agent()
)limit: number
max concurrent sending request numberinterval: number
next request's wait time before sending, unit is msdefaultParser: ResponseParser
default response parser, use when no type or no matched type response parserdefaultErrorHandler: ErrorHandler
default error handler, use when no type or no matched type error handler
RequestOption
passing when you call Spider instance's add method
type: string
the task type requestOption belong tourl: string
the url request will sendmethod: 'get' | 'post' | 'put' | 'head' | 'delete' | 'patch' | 'options'
the request methodquery: object | string
the url query parametersbody: any
the request body will be sendcontentType: string
the request body type, as Content-Type headeraccept: string
the response content will receive, as Accept header
Built-in Parser Generator
tspider provide some useful parser generator function, you can use them well.
combineParser
Combine some parsers to a single parser, revice a array of parser. if passing async function, you can use asyncCombineParser
, unless you want to execute them in parallel
spider.type('combine').parse(Parser.combineParser([function (res, body, spider) {
console.log(res.header)
}, function (res, body, spider) {
console.log(res.text)
}]))
asyncCombineParser
Combine some parsers to a single parser, revice a array of async parser function, and execute them serially
spider.type('asyncCombine').parse(Parser.asyncCombineParser([async function (res, body, spider) {
// async operate1...
}, function (res, body, spider) {
// async operate2...
}]))
fileSaveParser
if response is a file, this parser can help you save file to your pc
spider.type('fileSave').parse(Parser.fileSaveParser(function (res) {
// fileName generator funcion, return a string as filename
return 'filename'
}, function(err, savePath) {
console.log('file save to: ' + savePath)
}))
charsetConvertParser
use npm lib iconv-lite
to convert res.text's or res.body's content to your encoding
spider.type('charsetConvert').parse(Parser.charsetConvertParser('utf8', function(err, res, converted, spider) {
// converted can be a string(res.text) or a object(res.body)
}))
6 years ago