1.1.2 • Published 1 year ago

@cherastain/scraper v1.1.2

Weekly downloads
-
License
MIT
Repository
-
Last release
1 year ago

@cherastain/scraper

An easy to use scraper using puppeteer under the hood to evaluate content and for further page processing.

Usage

import { Scraper } from "@cherastain/scraper";

const scraper = new Scraper();
const url = "https://www.npmjs.com/";
const selector: IScraperSelector = {
  links: { selector: "a", format: { attr: "href" } },
};
const result = await scraper.process(url, selector);

// expected result : { links: [...] }

Selector basic

Each selector property has to be defined so the result contains a property with the same name. For a result as follow:

{
  title:"Title of page",
  subTitles:[...],
  links:[...]
}

selector should be defined as:

const selector: IScraperSelector = {
  title: { selector: "h1", format: ["unique"] },
  subTitles: "h2",
  links: { selector: "a", format: { attr: "href" } },
};

Remark: default format is DOM element innerText

Documentation

Scraper class

ctor

Scraper(options)

process method

scraper.process(url, selector, options);
ParameterTypeDescriptionDefault
urlstringurl to scrape
selectorIScraperSelector(optional) selector model to use for scrape resultAs defined in librarySettings { hrefs: { selector: "a", format: { attr: "href" } } }
optionsIScraperOptions(optional)undefined

Contracts

IScraperSelector interface

A selector property can be:

  • a string that is
    • a html tag name
    • or a css class (prefixed with .)
    • or a xpath
  • a IScraperSelectorIdentifier

Example 1

{
  links: "a"; // html tag
}

Example 2

{
  links: ".link"; // css class
}

Example 3

{
  links: "/html/body/a"; // xpath
}

Example 4

{
  links: {
    selector: "a";
  } // IScraperSelectorIdentifier equivalent to Example 1
}

IScraperSelectorIdentifier interface

PropertyTypeDescription
selectorstringcan be a html tag name, a css class (prefixed with .) or a xpath
formatIScraperSelector or ScraperValueFormater[](optional)

IScraperOptions interface

PropertyTypeDescription
isConsoleEnabledboolean(optional) enable console from page evaluation
isRobotIgnoredboolean(optional) ignore robots.txt on domain scraped
isVerboseEnabledboolean(optional) enable message from scraper
preProcess(((page: Page) => Promise) or "scrollBottom")[](optional) function called before scraping occured
userAgentstring(optional) set user-agent as seen by the scraped site

ScraperValueFormater type

By default, result values return DOM element innerText but can be formated using:

FormatDescription
{ attr: string }value will be the given attribute of the DOM container
"html"value will be the innerHTML of the DOM container
"unique"value will be unique (instead of an array)
((value: any) => string);final value will be formatted during post process based on given function and value set for the element by other formatter

Example

The following example use preprocess option to :

  • scroll to the bottom of the page
  • change every href to "foo"

and use a selector to get an unique link href formatted with -${x}-

const s = new Scraper();
const url = "https://www.npmjs.com/";
const selectors: IScraperSelector = {
  firstLinkHref: {
    selector: "a",
    format: [{ attr: "href" }, "unique", (x) => `-${x}-`],
  },
};
const options = {
  preProcess: [
    "scrollBottom",
    async (page: Page) => {
      await page.evaluate(() => {
        const links = [...document.getElementsByTagName("a")];
        links.forEach((link) => {
          link.href = "foo";
        });
      });
    },
  ],
};
const result = await s.process(url, selectors, options);

// expected result : { firstLinkHref: "-foo-"}
1.1.2

1 year ago

1.1.1

1 year ago

1.1.0

1 year ago

1.0.0

1 year ago