1.1.2 • Published 1 year ago
@cherastain/scraper v1.1.2
@cherastain/scraper
An easy to use scraper using puppeteer under the hood to evaluate content and for further page processing.
Usage
import { Scraper } from "@cherastain/scraper";
const scraper = new Scraper();
const url = "https://www.npmjs.com/";
const selector: IScraperSelector = {
links: { selector: "a", format: { attr: "href" } },
};
const result = await scraper.process(url, selector);
// expected result : { links: [...] }
Selector basic
Each selector property has to be defined so the result contains a property with the same name. For a result as follow:
{
title:"Title of page",
subTitles:[...],
links:[...]
}
selector should be defined as:
const selector: IScraperSelector = {
title: { selector: "h1", format: ["unique"] },
subTitles: "h2",
links: { selector: "a", format: { attr: "href" } },
};
Remark: default format is DOM element innerText
Documentation
Scraper class
ctor
Scraper(options)
process method
scraper.process(url, selector, options);
Parameter | Type | Description | Default |
---|---|---|---|
url | string | url to scrape | |
selector | IScraperSelector | (optional) selector model to use for scrape result | As defined in librarySettings { hrefs: { selector: "a", format: { attr: "href" } } } |
options | IScraperOptions | (optional) | undefined |
Contracts
IScraperSelector interface
A selector property can be:
- a string that is
- a html tag name
- or a css class (prefixed with
.
) - or a xpath
- a IScraperSelectorIdentifier
Example 1
{
links: "a"; // html tag
}
Example 2
{
links: ".link"; // css class
}
Example 3
{
links: "/html/body/a"; // xpath
}
Example 4
{
links: {
selector: "a";
} // IScraperSelectorIdentifier equivalent to Example 1
}
IScraperSelectorIdentifier interface
Property | Type | Description |
---|---|---|
selector | string | can be a html tag name, a css class (prefixed with .) or a xpath |
format | IScraperSelector or ScraperValueFormater[] | (optional) |
IScraperOptions interface
Property | Type | Description |
---|---|---|
isConsoleEnabled | boolean | (optional) enable console from page evaluation |
isRobotIgnored | boolean | (optional) ignore robots.txt on domain scraped |
isVerboseEnabled | boolean | (optional) enable message from scraper |
preProcess | (((page: Page) => Promise) or "scrollBottom")[] | (optional) function called before scraping occured |
userAgent | string | (optional) set user-agent as seen by the scraped site |
ScraperValueFormater type
By default, result values return DOM element innerText but can be formated using:
Format | Description |
---|---|
{ attr: string } | value will be the given attribute of the DOM container |
"html" | value will be the innerHTML of the DOM container |
"unique" | value will be unique (instead of an array) |
((value: any) => string); | final value will be formatted during post process based on given function and value set for the element by other formatter |
Example
The following example use preprocess option to :
- scroll to the bottom of the page
- change every href to "foo"
and use a selector to get an unique link href formatted with -${x}-
const s = new Scraper();
const url = "https://www.npmjs.com/";
const selectors: IScraperSelector = {
firstLinkHref: {
selector: "a",
format: [{ attr: "href" }, "unique", (x) => `-${x}-`],
},
};
const options = {
preProcess: [
"scrollBottom",
async (page: Page) => {
await page.evaluate(() => {
const links = [...document.getElementsByTagName("a")];
links.forEach((link) => {
link.href = "foo";
});
});
},
],
};
const result = await s.process(url, selectors, options);
// expected result : { firstLinkHref: "-foo-"}