@robertsvendsen/node-crawler NPM

node-crawler

Crawls web urls from a list

Very simple wrapper for puppeteer, with the most basic requirements for a crawler inluded.

Install

npm install @robertsvendsen/node-crawler

If you are getting this: Error: Failed to launch the browser process! undefined Fontconfig error: No writable cache directories

https://pptr.dev/troubleshooting#could-not-find-expected-browser-locally

If you had that problem, and you fixed it with ENV during install, you must always keep the environment variable: PUPPETEER_CACHE_DIR=$(pwd)

Example

import Crawler, { CrawlerPageOptions } from '@robertsvendsen/node-crawler/src/crawler'

const options = new CrawlerOptions({
  name: 'node-crawler-agent',
  concurrency: 1,
  readRobotsTxt: true,
  dataPath: 'data/crawler',
});

const crawler = new Crawler(options);
const links = [{ url: "https://www.google.com" }];

init().then(async () => {
  console.info('Crawling complete');
    // await delay(10000); // If you have troubles with the script exits before crawling completed make a delay here. The queue is empty but crawling is not.
  await crawler.close();
  process.exit();
});

async function init() {
  const pageOptions = new CrawlerPageOptions({ downloadImages: true });
  
  for (const link of links) {
    crawler.add(link.url, pageOptions).then((result) => {
      if (result) {
        console.info('Crawled', link.url);
      }
    }
    
    // To avoid saturating the CPU immediately on startup we don't fill the queue up all the way.
    await crawler.queue.onSizeLessThan(options.concurrency * 2);
  }
  
  await crawler.queue.onEmpty();
}

Options

CrawlerBrowserOptions

width = 1920; // 3840
height = 1080; // 2160
isLandscape = false;
isMobile = false;
hasTouch = false;

CrawlerPageOptions

downloadImages = false;
returnPageInstance = false; // If true, you must close it yourself.
timeout = 10000; // Page load timeout in ms.
waitUntil = 'networkidle2';

CrawlerOptions

concurrency = 1;
readRobotsTxt = true;
name = 'node-crawler'; // This should just be the name, no version or anything.
version = '0.1';
email = ''; // contact email for this crawler.
dataPath = 'data';
saveAsPDF = false; // Enable PDF file generation /printing of the site.
saveFiles = true; // Handle this yourself? set to true.
headless = true;

Roadmap

Concurrency using threads or processes. Actually, it might be possible to just increase the prop because puppeteer should be able to handle more tabs.
Recursive crawling options
- When crawling recursive, it should handle the robots.txt delay as well.
- Callback-function in options to determine if a link should be queued (for recursive search)
Own database (sqllite3)
- table: sites (site_id, domain, url)
  - table: site_options
  - table: pages (page_id, site_id, path, querystring, last_visited, status_code, redirect_location)
Logo fetcher (upper left corner, name contains 'logo'?)