Redis-web-crawler NPM

Web Crawler with Redis Graph

Read the blog post.

Web Crawler built with NodeJS. Fetch site data from a given URL and recursively follow links across the web.

Search the sites with either breadth first search, or depth first search.

Every URL will be saved to a Graph (using an adjacency list). The Graph is stored with Redis.

Installation

npm install --save redis-web-crawler

Usage

Run a local redis server to store output: $ redis-server

Create a new crawler instance and pass in a configuration object. Call the run method to begin crawling.

  import WebCrawler from 'redis-web-crawler';

  const crawlerSettings = {
    startUrl: 'https://en.wikipedia.org/wiki/Main_Page',
    followInternalLinks: false,
    searchDepthLimit: null,
    searchAlgorithm: "breadthFirstSearch",
  }

  var crawler = new WebCrawler(crawlerSettings);
  crawler.run();

Configuration Properties

Name	Type	Description
startUrl	`string`	A valid URL off a page with links.
followInternalLinks	`boolean`	Toggle searching through internal site links
searchDepthLimit	`integer`	Set a limit on the recursive URL requests
searchAlgorithm	`string`	"breadthFirstSearch" or "depthFirstSearch"

Exporting the Redis Graph

clone the Redis Dump Repo
run commands to install gem dependencies (refer to redis-dump/README)
with redis server up and running:
- note the slave and port of the redis-server (e.g. 6371)
- in project root folder, run ./bin/redis-dump -u 127.0.0.1:6371 > db_full.json
- view the Redis export in db_full.json

spencerlepine.com · GitHub @spencerlepine · Twitter @spencerlepine

node redis webcrawler javascript crawler scraper web-crawler

jsdom node-fetch redis

@infinitebrahmanuniverse/nolb-redi @everything-registry/sub-chunk-2635

1.0.0

4 years ago