spindel v2.0.2
spindel
A web crawler/spider
"spindel" is the Swedish word for spider.
Installation
Install spindel
using npm:
npm install --save spindel
Usage
Module usage
Start with single url
const spindel = require('spindel');
// Start a crawler at http://example.com:
const stream = spindel('http://example.com');
stream.on('data', res => {
// see response object format below
});
Start with multiple urls
// Start a crawler with an initial queue consisting of two urls:
const stream = spindel([
'http://example.com',
'http://another.com'
]);
stream.on('data', res => {
// see response object format below
});
Use a database as url queue
// Start a crawler with a custom queue:
const redisQueue = {
popUrl() {
return getNextUrlFromRedisAndReturnAPromise();
},
pushUrl(url) {
return pushUrlToRedisAndReturnAPromise(url);
}
};
const stream = spindel(redisQueue);
stream.on('data', res => {
// see response object format below
});
API
spindel(urlsOrQueue, options)
Name | Type | Description |
---|---|---|
urlsOrQueue | String , Array or Object | A single url, an array of urls or a queue implementation |
options | Object | The options object |
Returns: stream.Readable
which emits response objects on the 'data'
event.
Options
options.transformHtml
Type: Function
Default: noop
Params:
Name | Type | Description |
---|---|---|
body | String | The response body |
url | String | The url for the page being crawled |
res | Object | The full response object |
Return value: Any
or Promise<Any>
.
For responses containing HTML (i.e. having a content-type which begins with text/
and ends with html
) this function will be run and its return value will be set to transformedHtml
in the response object.
options.gotOptions
Type: Object
Default: {}
Options passed to got
.
Streamed response objects
A response object has the format:
{
url: String, // the crawled url
statusCode: Number, // the HTTP status code
statusMessage: String, // the HTTP status message
body: String, // the response body
headers: Object, // the HTTP response headers
hrefs: Array(String), // found <a href /> urls in the body if content is HTML
transformedHtml: String // if content is HTML this contains the `body` after applying the `transformHtml` option function
}
Queue implementation
A queue implementation consists of two functions popUrl
and pushUrl
.
queue.popUrl
Type: function
Params:
Name | Type | Description |
---|---|---|
lastUrl | String | The last crawled url, or null for the first url |
Should return: String
or Promise<String>
to continue crawling or null
or Promise<null>
to stop crawling.
queue.pushUrl
Type: function
Params:
Name | Type | Description |
---|---|---|
href | String | A found href in the currently crawled response body |
referral | String | The url for the current crawl |
Should return: nothing or Promise
.
Example of the internal ArrayQueue
function arrayQueue(initialUrls) {
const urls = initialUrls.slice();
return {
pushUrl(url) {
urls.push(url);
},
popUrl() {
return urls.pop();
}
};
}
The queue implementation above is used if spindel's urlsOrQueue
parameter is a String
or Array
.
License
MIT © Joakim Carlstein