0.5.8 • Published 8 months ago

@robertsvendsen/node-crawler v0.5.8

Weekly downloads
-
License
ISC
Repository
-
Last release
8 months ago

node-crawler

Crawls web urls from a list

Very simple wrapper for puppeteer, with the most basic requirements for a crawler inluded.

Install

npm install @robertsvendsen/node-crawler

If you are getting this: Error: Failed to launch the browser process! undefined Fontconfig error: No writable cache directories

https://pptr.dev/troubleshooting#could-not-find-expected-browser-locally

If you had that problem, and you fixed it with ENV during install, you must always keep the environment variable: PUPPETEER_CACHE_DIR=$(pwd)

Example

import Crawler, { CrawlerPageOptions } from '@robertsvendsen/node-crawler/src/crawler'

const options = new CrawlerOptions({
  name: 'node-crawler-agent',
  concurrency: 1,
  readRobotsTxt: true,
  dataPath: 'data/crawler',
});

const crawler = new Crawler(options);
const links = [{ url: "https://www.google.com" }];

init().then(async () => {
  console.info('Crawling complete');
    // await delay(10000); // If you have troubles with the script exits before crawling completed make a delay here. The queue is empty but crawling is not.
  await crawler.close();
  process.exit();
});

async function init() {
  const pageOptions = new CrawlerPageOptions({ downloadImages: true });
  
  for (const link of links) {
    crawler.add(link.url, pageOptions).then((result) => {
      if (result) {
        console.info('Crawled', link.url);
      }
    }
    
    // To avoid saturating the CPU immediately on startup we don't fill the queue up all the way.
    await crawler.queue.onSizeLessThan(options.concurrency * 2);
  }
  
  await crawler.queue.onEmpty();
}

Options

CrawlerBrowserOptions

width = 1920; // 3840
height = 1080; // 2160
isLandscape = false;
isMobile = false;
hasTouch = false;

CrawlerPageOptions

downloadImages = false;
returnPageInstance = false; // If true, you must close it yourself.
timeout = 10000; // Page load timeout in ms.
waitUntil = 'networkidle2';

CrawlerOptions

concurrency = 1;
readRobotsTxt = true;
name = 'node-crawler'; // This should just be the name, no version or anything.
version = '0.1';
email = ''; // contact email for this crawler.
dataPath = 'data';
saveAsPDF = false; // Enable PDF file generation /printing of the site.
saveFiles = true; // Handle this yourself? set to true.
headless = true;

Roadmap

  • Concurrency using threads or processes. Actually, it might be possible to just increase the prop because puppeteer should be able to handle more tabs.
  • Recursive crawling options
    • When crawling recursive, it should handle the robots.txt delay as well.
    • Callback-function in options to determine if a link should be queued (for recursive search)
  • Own database (sqllite3)
    • table: sites (site_id, domain, url)
      • table: site_options
      • table: pages (page_id, site_id, path, querystring, last_visited, status_code, redirect_location)
  • Logo fetcher (upper left corner, name contains 'logo'?)
0.5.8

8 months ago

0.5.6

9 months ago

0.5.5

9 months ago

0.5.7

9 months ago

0.5.4

1 year ago

0.5.3

1 year ago

0.5.0

1 year ago

0.5.2

1 year ago

0.5.1

1 year ago

0.3.9

2 years ago

0.3.12

2 years ago

0.3.11

2 years ago

0.3.10

2 years ago

0.4.1

2 years ago

0.4.0

2 years ago

0.1.22

2 years ago

0.1.23

2 years ago

0.1.24

2 years ago

0.1.25

2 years ago

0.3.0

2 years ago

0.2.0

2 years ago

0.3.6

2 years ago

0.2.7

2 years ago

0.3.5

2 years ago

0.2.6

2 years ago

0.3.7

2 years ago

0.3.2

2 years ago

0.2.3

2 years ago

0.3.1

2 years ago

0.2.2

2 years ago

0.2.5

2 years ago

0.2.4

2 years ago

0.1.20

2 years ago

0.1.21

2 years ago

0.1.19

2 years ago

0.1.13

2 years ago

0.1.14

2 years ago

0.1.15

2 years ago

0.1.16

2 years ago

0.1.17

2 years ago

0.1.18

2 years ago

0.1.12

2 years ago

0.1.10

2 years ago

0.1.8

2 years ago

0.1.7

2 years ago

0.1.5

2 years ago

0.1.4

2 years ago

0.1.3

2 years ago

0.1.0

2 years ago