Evalscraper NPM

evalscraper

evalscraper is middleware for scraping web pages with Google Puppeteer.

Installation

npm install evalscraper

Usage

ESM

import { Scraper, ScrapeTask } from "./dist/evalscraper";

CJS

const { Scraper, ScrapeTask } = require("./dist/evalscraper");

Create a new Scraper instance.

const scraper = new Scraper();

A ScrapeTask's first parameter is the url of the page to scrape. Then follow one or more arrays, each containing elements for a scrape of that page. pageFunction evaluates in browser context.

const scrapeTask =
  new ScrapeTask(
    'https://url-to-scrape/',
    [
      'key',                   // property to hold returned value of this scrape

      'selector',              // element to select on page

      pageFunction(selectors), // a functon passed an array containing all
                               // instances of 'selector' found on the page;
                               // pageFunction evaluates in browser context

      callback(array)          // optional callback that is passed an
                               // array returned by pageFunction
    ],
    // ...[Next scrape]
);

Pass the ScrapeTask to the.scrape() method. It returns a Promise that resolves to an object with key: value pairs determined by the ScrapeTask.

const scrapeOfPage = await scraper.scrape(scrapeTask);

Close the scraper.

await scraper.close();

Mutliple Scraper instances can be created.

const scraperFoo = new Scraper();
const scraperBar = new Scraper();

const resultsFoo = await scraperFoo.scrape(taskFoo);
const resultsBar = await scraperBar.scrape(taskBar);

scraperFoo.close();
scraperBar.close();

Or a single Scraper instance can be reused.

const scraperFoo = new Scraper();

const resultsFoo = await scraperFoo.scrape(taskFoo);
const resultsBar = await scraperFoo.scrape(taskBar);

scraperFoo.close();

The number of concurrent scrapes you can run will be limited by your hardware.

Configuration

A Scraper instance can be configured by passing an object to the constructor.

  const scraper = new Scraper(
    {
      // default values
      throwError: true,
      noisy: false, // when true, progress is logged to console
      timeout: 30000,
      maxRetries: 2
    });

Example

Scrape Hacker News and return the titles and links of the first ten stories.

const { Scraper, ScrapeTask } = require("./dist/evalscraper");

const scraper = new Scraper({
  throwError: true,
  noisy: true,
  timeout: 30000,
  maxRetries: 2,
});

// returns the titles and links of
// the first ten Hacker News stories
const newsScrape = new ScrapeTask("https://news.ycombinator.com/", [
  "stories",
  "a.titlelink",
  (anchors) =>
    anchors.map((a) => {
      const story = [];
      story.push(a.textContent);
      story.push(a.href);
      return story;
    }),
  (stories) => stories.slice(0, 10),
]);

async function logStories(scrapeTask) {
  try {
    const hackerNews = await scraper.scrape(scrapeTask);
    hackerNews.stories.forEach((story) =>
      console.log(story[0], story[1], "\n")
    );
    scraper.close();
  } catch (err) {
    console.log(err);
  }
}

logStories(newsScrape);

puppeteer web scraping

chalk puppeteer

@everything-registry/sub-chunk-1621 @zalastax/nolb-eva

4 years ago

4 years ago

4 years ago

4 years ago