@skypilot/scraper v1.0.0-alpha.23
@skypilot/scraper
Node-base scriptable web scraper
How to use
- Create a database adapter
const dbFilePath = 'tmp/demo.json';
const database = new LowDb(dbFilePath);- Create a scraper that uses the database
import { PlaywrightScraper } from './src/PlaywrightScraper';
const scraper = new PlaywrightScraper({ database });- Use
ScriptBuilderto build a script:
import { ScriptBuilder } from './src/ScriptBuilder';
const builder = new ScriptBuilder()
.goTo('https://www.iana.org/domains/reserved') // start at a page
.runOnAll({ // Runs the nested `commands` on each element that matches `query`
query: 'table#arpa-table > tbody > tr > td > span.domain.label',
commands: new ScriptBuilder()
.follow('a') // follow the href in the first `a` tag
.query({ // gather this data for each iteration of the elements matching the `runOnAll` query
title: 'head > title',
sponsor: '//h2[contains(text(), "Sponsoring Organisation")]/following-sibling::b',
adminContact: '//h2[contains(text(), "Administrative Contact")]/following-sibling::b',
techContact: '//h2[contains(text(), "Technical Contact")]/following-sibling::b',
})
.write() // writes to the database
});- Pass the script into the scraper's
runmethod:
const result = scraper.run(builder);Query
There are two ways to write a query:
1. A Query or ShorthandQuery object
A Query object is the standard way to write a selector:
interface Query {
selector: string; // a CSS or XPath selector
attributeName?: string; // if specified, select this attribute's value; otherwise, select the element's text content
scope?: 'one' | 'all'; // default = 'one'; when used with `runOnAll`, `scope: 'all'` is automatically set
limit?: Integer; // limits the selection to `limit` elements
nthOfType?: Integer; // select the `nth` element matching the selector
}A ShorthandQuery is the same as Query object, but uses a shorthand syntax for some of the keys:
interface ShorthandQuery {
sel: string;
attr?: string;
scope?: 'one' | 'all';
limit?: Integer;
nth?: Integer;
}See CSS and XPath selectors. Support for text selectors will be added soon.
A query matches the first element matching the selector, with two exceptions:
- When used with
runOnAllor whenscope: 'all', the selector selects all matching elements up to thelimit(if any) - When
nthOfTypeis set, the selector selects thenthmatching element
2. A string query
When a string value is used as the query, that value is treated as the selector param.
E.g., if the argument is 'h2', it is understood to mean { selector: 'h2' }.
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago