@cherastain/scraper NPM

@cherastain/scraper

An easy to use scraper using puppeteer under the hood to evaluate content and for further page processing.

Usage

import { Scraper } from "@cherastain/scraper";

const scraper = new Scraper();
const url = "https://www.npmjs.com/";
const selector: IScraperSelector = {
  links: { selector: "a", format: { attr: "href" } },
};
const result = await scraper.process(url, selector);

// expected result : { links: [...] }

Selector basic

Each selector property has to be defined so the result contains a property with the same name. For a result as follow:

{
  title:"Title of page",
  subTitles:[...],
  links:[...]
}

selector should be defined as:

const selector: IScraperSelector = {
  title: { selector: "h1", format: ["unique"] },
  subTitles: "h2",
  links: { selector: "a", format: { attr: "href" } },
};

Remark: default format is DOM element innerText

Documentation

Scraper class

ctor

Scraper(options)

process method

scraper.process(url, selector, options);

Parameter	Type	Description	Default
url	string	url to scrape
selector	IScraperSelector	(optional) selector model to use for scrape result	As defined in librarySettings `{ hrefs: { selector: "a", format: { attr: "href" } } }`
options	IScraperOptions	(optional)	undefined

Contracts

IScraperSelector interface

A selector property can be:

a string that is
- a html tag name
- or a css class (prefixed with .)
- or a xpath
a IScraperSelectorIdentifier

Example 1

{
  links: "a"; // html tag
}

Example 2

{
  links: ".link"; // css class
}

Example 3

{
  links: "/html/body/a"; // xpath
}

Example 4

{
  links: {
    selector: "a";
  } // IScraperSelectorIdentifier equivalent to Example 1
}

IScraperSelectorIdentifier interface

Property	Type	Description
selector	string	can be a html tag name, a css class (prefixed with .) or a xpath
format	IScraperSelector or ScraperValueFormater[]	(optional)

IScraperOptions interface

Property	Type	Description
isConsoleEnabled	boolean	(optional) enable console from page evaluation
isRobotIgnored	boolean	(optional) ignore robots.txt on domain scraped
isVerboseEnabled	boolean	(optional) enable message from scraper
preProcess	(((page: Page) => Promise) or "scrollBottom")[]	(optional) function called before scraping occured
userAgent	string	(optional) set user-agent as seen by the scraped site

ScraperValueFormater type

By default, result values return DOM element innerText but can be formated using:

Format	Description
{ attr: string }	value will be the given attribute of the DOM container
"html"	value will be the innerHTML of the DOM container
"unique"	value will be unique (instead of an array)
((value: any) => string);	final value will be formatted during post process based on given function and value set for the element by other formatter

Example

The following example use preprocess option to :

scroll to the bottom of the page
change every href to "foo"

and use a selector to get an unique link href formatted with -${x}-

const s = new Scraper();
const url = "https://www.npmjs.com/";
const selectors: IScraperSelector = {
  firstLinkHref: {
    selector: "a",
    format: [{ attr: "href" }, "unique", (x) => `-${x}-`],
  },
};
const options = {
  preProcess: [
    "scrollBottom",
    async (page: Page) => {
      await page.evaluate(() => {
        const links = [...document.getElementsByTagName("a")];
        links.forEach((link) => {
          link.href = "foo";
        });
      });
    },
  ],
};
const result = await s.process(url, selectors, options);

// expected result : { firstLinkHref: "-foo-"}

1 year ago

1 year ago

1 year ago

1 year ago