@lucascaro/wreck NPM

W.R.E.C.K. Web Crawler

WRECK is a fast, reliable, flexible web crawling kit.

Project Status

This project is in the design and prototype phase.

Roadmap

Initial design and request for feedback
First prototype for testing
- Multi process crawling
- Configurable per-process concurrency
- HTTP and HTTPS support
- HTTP retries
- HEAD and GET requests
- Shared work queue
- Request rate limiting
- Crawl depth
- Limit to original domain
- URL normalization
- Exclude patterns
- Persistent state across runs
- Maximum request limit
- Output levels
- Simple reporting
- Basic unit testing
- Nofollow patters
- Include patterns
Domain whitelist
Reporting
Unit testing
Functional testing
Incorporate design feedback
Code clean-up
Performance and memmory profiling and improvements
Implement all core features
Add to npm registry

Installing

npm i -g @lucascaro/wreck

Invoke it by running wreck.

Running

Show available commands

$ wreck

wreck v0.0.1

Usage:
 wreck               [options] [commands]   Reliable and Efficient Web Crawler

Options:
    -v --verbose                  Make operation more talkative.
    -s --silent                   Make operation silent (Only errors and warnings will be shown).
    -f --state-file    <fileName> Path to status file.

Available Subcommands:
   crawl

   report


 run wreck  help <subcommand> for more help.

Crawl

$ wreck help crawl

crawl

Usage:
 crawl               [options]

Options:
    -u --url           <URL>      Crawl starting from this URL
    -R --retries       <number>   Maximum retries for a URL
    -t --timeout       <number>   Maximum seconds to wait for requests
    -m --max-requests  <number>   Maximum request for this run.
    -n --no-resume                Force the command to restart crawling from scratch, even if there is saved state.
    -w --workers       <nWorkers> Start this many workers. Defaults to one per CPU.
    -d --max-depth     <number>   Maximum link depth to crawl.
    -r --rate-limit    <number>   Number of requests that will be made per second.
    -e --exclude       <regex>    Do now crawl URLs that match this regex. Can be specified multiple times.
    -c --concurrency   <concurrency> How many requests can be active at the same time.

Crawl an entire website

Default operation:

wreck crawl -u https://example.com

This will use the default operation mode:

1 worker process per CPU
100 maximum concurrent requests
save state to ./wreck.run.state.json
automatically resume work if state file is present
unlimited crawl depth
limit crawling to the provided main domain
no rate limit
3 maximum retries for urls that return a 429 status code

Minimal operation (useful for debugging):

wreck crawl -u https://example.com --concurrency=1 --workers=1 --rate-limit=1

Debug

Clone the repository:

git clone git@github.com:lucascaro/wreck.git
cd wreck
npm link

This project uses debug. Set the environment variable DEBUG to * to see all output:

DEBUG=* wreck crawl -u https://example.com

Contributing

Please feel free to add questions, comments, and suggestions via github issues.

@zerollup/ts-transform-paths console-commando debug immutable node-fetch p-ratelimit split2

@everything-registry/sub-chunk-562 @zalastax/nolb-_luc

0.0.10

6 years ago

0.0.3

7 years ago