2.6.5 • Published 8 years ago
icrawler v2.6.5
icrawler
Tool for easy scraping data from websites
Features
- Clean and simple API
- Persistent error-proof crawling
- State saving for continuous crawling
- jQuery-like server-side DOM parsing with Cheerio
- Parallel requests
- Proxy list and user-agent list support
- HTTP headers and cookies setup
- Automatic charset detection and conversion
- Console progress indicator
- node from 0.10 to 6.0 support
Install
npm install icrawlerUsage
icrawler(startData, opts, parse, done);startData- task for start crawling (or array of tasks). Single icrawler task can be url (of page or API resource) or object withurlfield for url. Optionaly you can usedatafield with object forPOSTrequest (default method isGET). You can use any other fields for custom data. For example, you can mark different types of tasks for parsing different ways, or you can store in task partial data, when one result record needs more than one requests.opts(optional) - options:concurrency- positive number of parallel requests or negative number of milisecs of delay between requests with no parallelism. Defaults to 1.delay- time in milisecs to wait on error before try to crawle again. Defaults to 10000 (10 secs).errorsFirst- iftruefailed requests will repeated before all others. iffalse- it will pushed in tail of queue. Defaults tofalse.allowedStatuses- number or array of numbers of HTTP response codes that are not errors. Defaults to 200.skipDuplicates- iftrueparse every URL only once. Defaults totrue.objectTaskParse- iftruetask object will sent toparseinstead url string. Defaults tofalse.decode_response- (ordecode) Whether to decode the text responses to UTF-8, if Content-Type header shows a different charset. Defaults to true.noJquery- iftruesend response body string toparsefunction (as$parameter) as is, without jQuery-like parsing. Defaults tofalse.noResults- iftruedon't save parsed items to results array (nosavefield in_parameter ofparsefunction). Defaults tofalse.quiet- iftruedon't write anything to console. Nologandstepfields in_parameter ofparsefunction. Defaults tofalse.open_timeout(ortimeout) - Returns error if connection takes longer than X milisecs to establish. Defaults to 10000 (10 secs). 0 means no timeout.read_timeout- Returns error if data transfer takes longer than X milisecs, after connection is established. Defaults to 10000 milisecs (not like in needle).proxy- Forwards request through HTTP(s) proxy. Eg.proxy: 'http://user:pass@proxy.server.com:3128'. If array of strings - use proxies from list.proxyRandom- iftrueuse random proxy from list for every request; iffalseafter each error use new proxy from list. Defaults totrue. Ifproxyis not array -proxyRandomoption will be ignored.reverseProxy- Replace part of url before request fot using reverse proxy. IfreverseProxyis string, it'l be used instead protocol and domain of original url. IfreverseProxyis object, in original url substringreverseProxy.towill be replaced byreverseProxy.from. If array of strings or objects - use reverse proxies from list.reverseProxyRandom- iftrueuse random reverse proxy from list for every request; iffalseafter each error use new reverse proxy from list. Defaults totrue. IfreverseProxyis not array -reverseProxyRandomoption will be ignored.headers- Object containing custom HTTP headers for request. Overrides defaults described below.cookies- Sets a{key: 'val'}object as a 'Cookie' header.connection- Sets 'Connection' HTTP header. Defaults to close.compressed- if true sets 'Accept-Encoding' HTTP header to 'gzip, deflate'. Defaults tofalse.agent- Sets custom http.Agent.user_agent- Sets the 'User-Agent' HTTP header. If array of strings - use 'User-Agent' header from list. Defaults to Needle/{version} (Node.js {nodeVersion}).agentRandom- iftrueuse random 'User-Agent' from list for every request; iffalseafter each error use new 'User-Agent' from list. Defaults totrue. Ifuser_agentis not array -agentRandomoption will be ignored.onError-function (err, task)for doing something on first error before pause.init-function (needle, log, callback)for preparing cookies and headers for crawling. Must runcallback(err)if errors orcallback(null, cookies, headers)if success.initOnError- iftrueruniniton every resume after errors. Iffalseruninitonly on start. Ifinitis not set -initOnErroroption will be ignored. Defaults totrue.cleanCookiesOnInit- iftrueclean old cookies oninitrun. Defaults tofalse.cleanHeadersOnInit- iftrueclean old headers oninitrun. Defaults tofalse.save-function (tasks, results)for saving crawler state.tasksis object containing arrayswaiting,finishedandfailedwith tasks from queue.resultsis array of already fetched data. Ignored iffileset.results- results saved bysavefor continue crawling after crash or manual break. Ignored iffileset.tasks- tasks saved bysavefor continue crawling after crash or manual break.tasksis object containing arrayswaiting,finishedandfailed. Ignored iffileset.file- name of file for saving crawler state for continue crawling after crash or manual break. Use it insteadsave,tasksandresultsfor auto saving.saveOnError- iftruerunssaveevery time when paused on error. Defaults totrue.saveOnFinish- iftruerunssavewhen crawling finished. Defaults totrue.saveOnExit- iftruerunssavewhen user abort script byCtrl+C. Defaults totrue.saveOnCount- if number runssaveeverysaveOnCountrequests.asyncParse- iftrue- runsparsein asynchronous mode. Defaults tofalse.
parse- page-parsingfunction(task, $, _, res), that runs for every crawled page and gets this params:task- url of parsed page. If setobjectTaskParsethentaskis object withurlproperty.$- jQuery-like (cheeriopowered) object for html page or parsed object forjsonor raw response body ifnoJqueryistrue._- object with four functions:_.push(task)- adds new task (or array of tasks) to crawler queue (will be parsed later). Every task can be url string or object withurlproperty._.save(item)- adds parsed item to results array_.step()- increment indicator_.log(message /*, ... */)- safe logging (use it insteadconsole.log)_.cb- callback function for asynchronous mode. Is undefined ifasyncParseisfalse.
res(optional) - full response object (needlepowered).
done(optional) -function(result), that runs once with result of crawling/parsing
Example
var icrawler = require('icrawler');
var opts = {
concurrency: 10,
errorsFirst: true
};
icrawler('http://example.com/', opts, function(url, $, _){
if($('#next').length > 0){
_.push($('.next').attr('href'));
_.log('PAGE');
}
$('.news>a').each(function() {
_.step();
_.save({
title: $(this).text(),
href: $(this).attr('href')
})
});
}, function(result){
console.log(result);
});Some real life examples:
License
MIT
2.6.5
8 years ago
2.6.4
9 years ago
2.6.3
9 years ago
2.6.2
9 years ago
2.6.1
9 years ago
2.6.0
9 years ago
2.5.1
9 years ago
2.5.0
9 years ago
2.4.1
9 years ago
2.4.0
9 years ago
2.3.1
9 years ago
2.3.0
9 years ago
2.2.1
9 years ago
2.2.0
10 years ago
2.1.0
10 years ago
2.0.1
10 years ago
2.0.0
10 years ago
1.1.2
10 years ago
1.1.1
10 years ago
1.1.0
10 years ago
1.0.4
10 years ago
1.0.3
10 years ago
1.0.2
10 years ago
1.0.1
10 years ago
1.0.0
10 years ago