0.0.10 • Published 5 years ago

@lucascaro/wreck v0.0.10

Weekly downloads
-
License
GPL-3
Repository
-
Last release
5 years ago

W.R.E.C.K. Web Crawler Build Status

WRECK is a fast, reliable, flexible web crawling kit.

Project Status

This project is in the design and prototype phase.

Roadmap

  • Initial design and request for feedback
  • First prototype for testing
    • Multi process crawling
    • Configurable per-process concurrency
    • HTTP and HTTPS support
    • HTTP retries
    • HEAD and GET requests
    • Shared work queue
    • Request rate limiting
    • Crawl depth
    • Limit to original domain
    • URL normalization
    • Exclude patterns
    • Persistent state across runs
    • Maximum request limit
    • Output levels
    • Simple reporting
    • Basic unit testing
    • Nofollow patters
    • Include patterns
  • Domain whitelist
  • Reporting
  • Unit testing
  • Functional testing
  • Incorporate design feedback
  • Code clean-up
  • Performance and memmory profiling and improvements
  • Implement all core features
  • Add to npm registry

Installing

npm i -g @lucascaro/wreck

Invoke it by running wreck.

Running

Show available commands

$ wreck

wreck v0.0.1

Usage:
 wreck               [options] [commands]   Reliable and Efficient Web Crawler

Options:
    -v --verbose                  Make operation more talkative.
    -s --silent                   Make operation silent (Only errors and warnings will be shown).
    -f --state-file    <fileName> Path to status file.

Available Subcommands:
   crawl

   report


 run wreck  help <subcommand> for more help.

Crawl

$ wreck help crawl

crawl

Usage:
 crawl               [options]

Options:
    -u --url           <URL>      Crawl starting from this URL
    -R --retries       <number>   Maximum retries for a URL
    -t --timeout       <number>   Maximum seconds to wait for requests
    -m --max-requests  <number>   Maximum request for this run.
    -n --no-resume                Force the command to restart crawling from scratch, even if there is saved state.
    -w --workers       <nWorkers> Start this many workers. Defaults to one per CPU.
    -d --max-depth     <number>   Maximum link depth to crawl.
    -r --rate-limit    <number>   Number of requests that will be made per second.
    -e --exclude       <regex>    Do now crawl URLs that match this regex. Can be specified multiple times.
    -c --concurrency   <concurrency> How many requests can be active at the same time.

Crawl an entire website

Default operation:

wreck crawl -u https://example.com

This will use the default operation mode:

  • 1 worker process per CPU
  • 100 maximum concurrent requests
  • save state to ./wreck.run.state.json
  • automatically resume work if state file is present
  • unlimited crawl depth
  • limit crawling to the provided main domain
  • no rate limit
  • 3 maximum retries for urls that return a 429 status code

Minimal operation (useful for debugging):

wreck crawl -u https://example.com --concurrency=1 --workers=1 --rate-limit=1

Debug

Clone the repository:

git clone git@github.com:lucascaro/wreck.git
cd wreck
npm link

This project uses debug. Set the environment variable DEBUG to * to see all output:

DEBUG=* wreck crawl -u https://example.com

Contributing

Please feel free to add questions, comments, and suggestions via github issues.