0.1.0 • Published 10 months ago

@open-automaton/automaton v0.1.0

Weekly downloads
-
License
MIT
Repository
github
Last release
10 months ago

@open-automaton/automaton

A web scraping/RPA solution for ease of use, maintenance and (soon™) deployment. It uses an XML based DSL which both defines the scraping process as well as the structure of the returned data. It compares favorably to uipath, Blue Prism ALM, Kapow(Now Kofax RPA) and apify. These solutions make the work of building and maintaining scrapers infinitely easier than directly using a primary scraping solution(like playwright, puppeteer, jsdom, cheerio, selenium, windmill, beautifulsoup or others).

Usage

Here we're going to do a simple scrape of unprotected data on craigslist(you should use their available RSS feed instead, but it serves as an excellent example for how to harvest results and works in all the engines):

<go url="https://sfbay.craigslist.org/search/apa">
    <set xpath="//li[@class='result-row']" variable="matches">
        <set
            xpath="//time[@class='result-date']/text()"
            variable="time"
        ></set>
        <set
            xpath="//span[@class='result-price']/text()"
            variable="price"
        ></set>
        <set
            xpath="//span[@class='housing']/text()"
            variable="housing"
        ></set>
        <set
            xpath="string(//img/@src)"
            variable="link"
        ></set>
    </set>
    <emit variables="matches"></emit>
</go>

automaton definitions can be used in whatever context they are needed: from the command line, your own code or from a GUI (Soon™).

const Automaton = require('@open-automaton/automaton');
  • Cheerio
    const MiningEngine = require(
        '@open-automaton/cheerio-mining-engine'
    );
    let myEngine = new MiningEngine();
  • Puppeteer
    const Engine = require(
        '@open-automaton/puppeteer-mining-engine'
    );
    let myEngine = new MiningEngine();
  • Playwright: Chromium
    const Engine = require(
        '@open-automaton/playwright-mining-engine'
    );
    let myEngine = new MiningEngine({type:'chromium'});
  • Playwright: Firefox
    const Engine = require(
        '@open-automaton/playwright-mining-engine'
    );
    let myEngine = new MiningEngine({type:'firefox'});
  • Playwright: Webkit
    const Engine = require(
        '@open-automaton/playwright-mining-engine'
    );
    let myEngine = new MiningEngine({type:'webkit'});
  • JSDom
    const Engine = require(
        '@open-automaton/jsdom-mining-engine'
    );
    let myEngine = new MiningEngine();
let results = await Automaton.scrape(
    'definition.xml',
    myEngine
);

That's all it takes, if you need a different usage pattern that is supported as well.

    npm install -g automaton-cli
    auto --help

TBD

Scraper Actions

A progression from page to page, either by loading a url, submitting a form or clicking a UI element requires either url or form

type accepts json, application/json or form

Some engines that use the browser will only submit using the form configuration on the page and ignore the method and type options.

<go
    url="https://domain.com/path/"
    form="form-name"
    method="post"
    type="application/json"
></go>

Either use a variable to set a target input on a form or set a variable using an xpath or regex. Lists are extracted by putting sets inside another set

<set
    variable="variable-name"
    xpath="//xpath/expression"
    regex="[regex]+.(expression)"
    form="form-name"
    target="input-element-name"
></set>

emit a value to the return and optionally post that value to a remote url

<emit
    variables="some,variables"
    remote="https://domain.com/path/"
></emit>

Maintaining Scrapers

Here's a basic process for data behind a simple form

First you'll want to understand xpath (and probably DOM, regex and css selectors) before we proceed, as most of the selectors in a good definition are xpath which is as general as possible.

Once you're done with that, the auto command( get by installing the CLI) has a few operations we'll be using.

auto fetch https://domain.com/path/ > page.html

The first thing you might do against the HTML you've captured is pull all the forms out of the page, like this:

auto xpath "//form" page.html

Assuming you've identified the form name you are targeting as my-form-name, you then want to get all the inputs out of it with something like:

auto xpath-form-inputs "//form[@name='my-form-name']" page.html

Then you need to write selectors for the inputs that need to be set (all of them in the case of cheerio, but otherwise the browser abstraction usually handles those that are prefilled)

<set
    form="<form-selector>"
    target="<input-name>"
    variable="<incoming-value-name>"
></set>
<go form="<form-selector>">
    <!-- extraction logic to go here -->
</go>

Here you'll need to manually use your browser go to the submitted page and save the HTML by opening the inspector, then copying the HTML from the root element, then pasting it into a file.

Now we need to look for rows with something like:

auto xpath "//ul|//ol|//tbody" page.html

Once you settle on a selector for the correct element add a selector in the definition:

<set xpath="<xpath-selector>" variable="matches">
    <!--more selected fields here -->
</set>

Last we need to looks for individual fields using something like:

auto xpath "//li|//tr" page_fragment.html

Once you settle on a selector for the correct element add a selector in the definition:

<set xpath="<xpath-selector>" variable="matches">
    <set
        xpath="<xpath-selector>"
        variable="<field-name>"
    ></set>
    <!--more selected fields here -->
</set>

To target the output emit the variables you want, otherwise it will dump everything in the environment.

auto scrape my-definition.auto.xml --data '{"JSON":"data"}'
#TODO on CLI, but already working in the API call options

The most frustrating thing about scrapers is, because they are tied to the structural representation of the presentation, which is designed to change, scrapers will inevitably break. While this is frustrating, using the provided tools on fresh fetches of the pages in question will quickly highlight what's failing. Usually:

  1. The url has changed, requiring an update to the definition,
  2. The page structure has changed requiring 1 or more selectors to be rewritten,
  3. The page has changed their delivery architecture, requiring you to use a more expensive engine (computationally: cheerio < jsdom < puppeteer, playwright).

Examples of building scrapers:

Deploying a Scraper

TBD

Publishing a Definition (Soon™)

First, create a directory that describes the site we're fetching, the work we're doing and ends with .auto, let's call this one some-site-register.auto. Once created, let's go into that directory.

Automaton definitions are published as useable nodejs npm modules, though making and maintaining them does not require any javascript. You'll need your own npm credentials to publish.

Once in the directory let's run

auto init ../some/path/some-site-register.auto

If a definition is not provided, a blank one will be initialized

you'll need to import the engine you want to use by default:

# we are choosing to default to JSDOM
npm install @open-automaton/jsdom-mining-engine

then add an entry to package.json for the default engine

{
    "defaultAutomatonEngine" : "@open-automaton/jsdom-mining-engine"
}

publishing is the standard:

npm publish

Before publishing, please consider updating the README to describe your incoming data requirements.

Once it's all set up, you have a bunch of features, out of the box.

Testing

npm run test

Scraping

you can run your definition with

npm run scrape '{"JSON":"data"}'

you can reference the definition directly (in parent projects) at:

let xmlPath = require('some-site-register.auto').xml;

which is short for:

path.merge(
    require.resolve('some-site-register.auto'),
    'src',
    'some-site-register.auto.xml'
)
// ./node_modules/some-site-register.auto/src/some-site-register.auto.xml

The top level Automaton.scrape() function knows how to transform some-site-register.auto into that, so you can just use the shorthand there.

You can include your scraper(once published) with:

let MyScraper = require('some-site-register.auto');
MyScraper.scrape(automatonEngine);
// or MyScraper.scrape(); to use the default engine

About Automaton

View the development roadmap.

Read a little about where this came from.

Testing

You can run the mocha test suite with:

    npm run test

Enjoy,

-Abbey Hawk Sparrow