2.0.2 • Published 8 years ago

spindel v2.0.2

Weekly downloads
4
License
MIT
Repository
github
Last release
8 years ago

spindel

Build status NPM version XO code style

A web crawler/spider

"spindel" is the Swedish word for spider.

Installation

Install spindel using npm:

npm install --save spindel

Usage

Module usage

Start with single url

const spindel = require('spindel');

// Start a crawler at http://example.com:
const stream = spindel('http://example.com');

stream.on('data', res => {
	// see response object format below
});

Start with multiple urls

// Start a crawler with an initial queue consisting of two urls:
const stream = spindel([
	'http://example.com',
	'http://another.com'
]);

stream.on('data', res => {
	// see response object format below
});

Use a database as url queue

// Start a crawler with a custom queue:
const redisQueue = {
	popUrl() {
		return getNextUrlFromRedisAndReturnAPromise();
	},
	pushUrl(url) {
		return pushUrlToRedisAndReturnAPromise(url);
	}
};
const stream = spindel(redisQueue);

stream.on('data', res => {
	// see response object format below
});

API

spindel(urlsOrQueue, options)

NameTypeDescription
urlsOrQueueString, Array or ObjectA single url, an array of urls or a queue implementation
optionsObjectThe options object

Returns: stream.Readable which emits response objects on the 'data' event.

Options

options.transformHtml

Type: Function
Default: noop

Params:

NameTypeDescription
bodyStringThe response body
urlStringThe url for the page being crawled
resObjectThe full response object

Return value: Any or Promise<Any>.

For responses containing HTML (i.e. having a content-type which begins with text/ and ends with html) this function will be run and its return value will be set to transformedHtml in the response object.

options.gotOptions

Type: Object
Default: {}

Options passed to got.

Streamed response objects

A response object has the format:

{
	url: String, // the crawled url
	statusCode: Number, // the HTTP status code
	statusMessage: String, // the HTTP status message
	body: String, // the response body
	headers: Object, // the HTTP response headers
	hrefs: Array(String), // found <a href /> urls in the body if content is HTML
	transformedHtml: String // if content is HTML this contains the `body` after applying the `transformHtml` option function
}

Queue implementation

A queue implementation consists of two functions popUrl and pushUrl.

queue.popUrl

Type: function

Params:

NameTypeDescription
lastUrlStringThe last crawled url, or null for the first url

Should return: String or Promise<String> to continue crawling or null or Promise<null> to stop crawling.

queue.pushUrl

Type: function

Params:

NameTypeDescription
hrefStringA found href in the currently crawled response body
referralStringThe url for the current crawl

Should return: nothing or Promise.

Example of the internal ArrayQueue
function arrayQueue(initialUrls) {
	const urls = initialUrls.slice();

	return {
		pushUrl(url) {
			urls.push(url);
		},
		popUrl() {
			return urls.pop();
		}
	};
}

The queue implementation above is used if spindel's urlsOrQueue parameter is a String or Array.

License

MIT © Joakim Carlstein

2.0.2

8 years ago

2.0.1

8 years ago

2.0.0

8 years ago

1.1.0

8 years ago

1.0.0

8 years ago