@a11ywatch/crawler v0.9.9
crawler
A gRPC web indexer turbo charged for performance.
This project is capable of handling millions of pages per second efficiently.
Getting Started
Make sure to have Rust installed or Docker.
This project requires that you start up another gRPC server on port 50051 following the proto spec.
The user agent is spoofed on each crawl to a random agent and the indexer extends spider as the base.
cargo runordocker compose up
Installation
You can install easily with the following:
Cargo
The crate is available to setup a gRPC server within rust projects.
cargo install website_crawlerDocker
You can use also use the docker image at a11ywatch/crawler.
Set the CRAWLER_IMAGE env var to darwin-arm64 to get the native m1 mac image.
crawler:
container_name: crawler
image: "a11ywatch/crawler:${CRAWLER_IMAGE:-latest}"
ports:
- 50055Node / Bun
We also release the package to npm @a11ywatch/crawler.
npm i @a11ywatch/crawlerAfter import at the top of your project to start the gRPC server or run node directly against the module.
import "@a11ywatch/crawler";Example
This is a basic example crawling a web page, add spider to your Cargo.toml:
[dependencies]
website_crawler = "0.9.4"A basic example can also be done with:
One terminal run the server
cargo run --example server --releaseAnother terminal run the client/server
cargo run --example client --releasehttps://user-images.githubusercontent.com/8095978/221221122-cfed83aa-6ca1-4122-a1db-0d9948e9f9d9.mov
Dependencies
In order to build crawler locally >= 0.5.0, you need the protoc Protocol Buffers compiler, along with Protocol Buffers resource files.
Ubuntu
proto compiler needs to be at v3 in order to compile. Ubuntu 18+ auto installs.
sudo apt update && sudo apt upgrade -y
sudo apt install -y protobuf-compiler libprotobuf-devAlpine Linux
sudo apk add protoc protobuf-devmacOS
Assuming Homebrew is already installed. (If not, see instructions for installing Homebrew on the Homebrew website.)
brew install protobufFeatures
jemalloc- use jemalloc memory allocator (default disabled).regex- use the regex crate for blacklist urls validation.ua_generator- use the ua_generator crate to spoof random user agent.smart- use smart mode to run HTTP request first and chrome when JS is needed.chrome: Enables chrome headless rendering, use the env varCHROME_URLto connect remotely.
About
This crawler is optimized for reduced latency and uses isolated based concurrency as it can handle over 10,000 pages within several milliseconds. In order to receive the links found for the crawler you need to add the website.proto to your server. This is required since every request spawns a thread. Isolating the context drastically improves performance (preventing shared resources / communication ).
Help
If you need help implementing the gRPC server to receive the pages or links when found check out the gRPC node example for a starting point .
LICENSE
Check the license file in the root of the project.
1 year ago
1 year ago
1 year ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
3 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
3 years ago
2 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago