1.0.0-alpha.23 • Published 3 years ago

@skypilot/scraper v1.0.0-alpha.23

Weekly downloads
180
License
MIT
Repository
github
Last release
3 years ago

@skypilot/scraper

npm latest downloads license: ISC

Node-base scriptable web scraper

How to use

  1. Create a database adapter
const dbFilePath = 'tmp/demo.json';
const database = new LowDb(dbFilePath);
  1. Create a scraper that uses the database
import { PlaywrightScraper } from './src/PlaywrightScraper';
const scraper = new PlaywrightScraper({ database });
  1. Use ScriptBuilder to build a script:
import { ScriptBuilder } from './src/ScriptBuilder';
const builder = new ScriptBuilder()
  .goTo('https://www.iana.org/domains/reserved') // start at a page
  .runOnAll({ // Runs the nested `commands` on each element that matches `query`
    query: 'table#arpa-table > tbody > tr > td > span.domain.label',
    commands: new ScriptBuilder()
      .follow('a') // follow the href in the first `a` tag
      .query({ // gather this data for each iteration of the elements matching the `runOnAll` query
        title: 'head > title',
        sponsor: '//h2[contains(text(), "Sponsoring Organisation")]/following-sibling::b',
        adminContact: '//h2[contains(text(), "Administrative Contact")]/following-sibling::b',
        techContact: '//h2[contains(text(), "Technical Contact")]/following-sibling::b',
      })
      .write() // writes to the database
  });
  1. Pass the script into the scraper's run method:
const result = scraper.run(builder);

Query

There are two ways to write a query:

1. A Query or ShorthandQuery object

A Query object is the standard way to write a selector:

interface Query {
  selector: string; // a CSS or XPath selector
  attributeName?: string; // if specified, select this attribute's value; otherwise, select the element's text content
  scope?: 'one' | 'all'; // default = 'one'; when used with `runOnAll`, `scope: 'all'` is automatically set
  limit?: Integer; // limits the selection to `limit` elements
  nthOfType?: Integer; // select the `nth` element matching the selector
}

A ShorthandQuery is the same as Query object, but uses a shorthand syntax for some of the keys:

interface ShorthandQuery {
  sel: string;
  attr?: string;
  scope?: 'one' | 'all';
  limit?: Integer;
  nth?: Integer;
}

See CSS and XPath selectors. Support for text selectors will be added soon.

A query matches the first element matching the selector, with two exceptions:

  • When used with runOnAll or when scope: 'all', the selector selects all matching elements up to the limit (if any)
  • When nthOfType is set, the selector selects the nth matching element

2. A string query

When a string value is used as the query, that value is treated as the selector param.

E.g., if the argument is 'h2', it is understood to mean { selector: 'h2' }.

1.0.0-alpha.21

3 years ago

1.0.0-alpha.23

3 years ago

1.0.0-alpha.22

3 years ago

1.0.0-alpha.20

3 years ago

1.0.0-alpha.19

3 years ago

1.0.0-alpha.18

3 years ago

1.0.0-alpha.16

3 years ago

1.0.0-alpha.15

3 years ago

1.0.0-alpha.17

3 years ago

1.0.0-alpha.14

3 years ago

1.0.0-alpha.13

3 years ago

1.0.0-alpha.12

3 years ago

1.0.0-alpha.11

3 years ago

1.0.0-alpha.10

3 years ago

1.0.0-alpha.9

3 years ago

1.0.0-alpha.8

3 years ago

1.0.0-alpha.7

3 years ago

1.0.0-alpha.6

3 years ago

1.0.0-alpha.5

3 years ago

1.0.0-alpha.4

3 years ago

1.0.0-alpha.3

3 years ago

1.0.0-alpha.2

3 years ago

1.0.0-alpha.1

3 years ago

1.0.0-alpha.0

3 years ago

0.2.0-alpha.13

3 years ago

0.2.0-alpha.12

3 years ago

0.2.0-alpha.11

3 years ago

0.2.0-alpha.10

3 years ago

0.2.0-alpha.9

3 years ago

0.2.0-alpha.8

3 years ago

0.2.0-alpha.7

3 years ago

0.2.0-alpha.6

3 years ago

0.2.0-alpha.5

3 years ago

0.2.0-alpha.4

3 years ago

0.2.0-alpha.3

3 years ago

0.2.0-alpha.2

3 years ago

0.2.0-alpha.1

3 years ago