1.0.0 • Published 3 years ago

redis-web-crawler v1.0.0

Weekly downloads
-
License
MIT
Repository
github
Last release
3 years ago

Web Crawler with Redis Graph

version downloads

Read the blog post.

Web Crawler built with NodeJS. Fetch site data from a given URL and recursively follow links across the web.

Search the sites with either breadth first search, or depth first search.

Every URL will be saved to a Graph (using an adjacency list). The Graph is stored with Redis.

Installation

npm install --save redis-web-crawler

Usage

Run a local redis server to store output: $ redis-server

Create a new crawler instance and pass in a configuration object. Call the run method to begin crawling.

  import WebCrawler from 'redis-web-crawler';

  const crawlerSettings = {
    startUrl: 'https://en.wikipedia.org/wiki/Main_Page',
    followInternalLinks: false,
    searchDepthLimit: null,
    searchAlgorithm: "breadthFirstSearch",
  }

  var crawler = new WebCrawler(crawlerSettings);
  crawler.run();

Configuration Properties

NameTypeDescription
startUrlstringA valid URL off a page with links.
followInternalLinksbooleanToggle searching through internal site links
searchDepthLimitintegerSet a limit on the recursive URL requests
searchAlgorithmstring"breadthFirstSearch" or "depthFirstSearch"

Exporting the Redis Graph

  • clone the Redis Dump Repo
  • run commands to install gem dependencies (refer to redis-dump/README)
  • with redis server up and running:
    • note the slave and port of the redis-server (e.g. 6371)
    • in project root folder, run ./bin/redis-dump -u 127.0.0.1:6371 > db_full.json
    • view the Redis export in db_full.json

spencerlepine.com  ·  GitHub @spencerlepine  ·  Twitter @spencerlepine