0.2.0 • Published 5 years ago

crawlbot v0.2.0

Weekly downloads
-
License
ISC
Repository
github
Last release
5 years ago

Crawlbot

See crawlbot-server to interact with crawlbot through your browser.

Crawlbot is an easy to use library for crawling websites that use Sitemaps. It is inspired by the popular Python library Scrapy's SitemapSpider.

Usage

crawlbot.crawl(domain, since, onHTML, onExit): ChildProcess

const crawlbot = require("crawlbot")

const onHTML = (html, url) => {
  console.log("Got html for ", url)
}

const onExit = (code, signal) => {
  console.log("Crawler exited with code", code)
}

crawlbot.crawl("qz.com", "2019-10-11", onHTML, onExit)

const forkedProcess = crawlbot.crawl("qz.com", "2019-10-11", onHTML, onExit)
/* forkedProcess.kill() // This will kill the crawler */

This forks a Node.js process that parses domain's sitemap and calls onHTML(html,url) every time it downloads a page.

Notes

Crawlbot will only crawl websites that have a valid sitemap. It will also only crawl webpages that have been modified since the date provided in the since parameter. If the date of the page cannot be infered from the sitemap, it will not be crawled.

0.2.0

5 years ago

0.1.3

5 years ago

0.1.2

5 years ago

0.1.1

5 years ago

0.1.0

5 years ago