@open-automaton/automaton v0.1.0
@open-automaton/automaton
A web scraping/RPA solution for ease of use, maintenance and (soon™) deployment. It uses an XML based DSL which both defines the scraping process as well as the structure of the returned data. It compares favorably to uipath, Blue Prism ALM, Kapow(Now Kofax RPA) and apify. These solutions make the work of building and maintaining scrapers infinitely easier than directly using a primary scraping solution(like playwright, puppeteer, jsdom, cheerio, selenium, windmill, beautifulsoup or others).
Usage
Here we're going to do a simple scrape of unprotected data on craigslist(you should use their available RSS feed instead, but it serves as an excellent example for how to harvest results and works in all the engines):
<go url="https://sfbay.craigslist.org/search/apa">
<set xpath="//li[@class='result-row']" variable="matches">
<set
xpath="//time[@class='result-date']/text()"
variable="time"
></set>
<set
xpath="//span[@class='result-price']/text()"
variable="price"
></set>
<set
xpath="//span[@class='housing']/text()"
variable="housing"
></set>
<set
xpath="string(//img/@src)"
variable="link"
></set>
</set>
<emit variables="matches"></emit>
</go>
automaton
definitions can be used in whatever context they are needed: from the command line, your own code or from a GUI (Soon™).
const Automaton = require('@open-automaton/automaton');
- Cheerio
const MiningEngine = require( '@open-automaton/cheerio-mining-engine' ); let myEngine = new MiningEngine();
- Puppeteer
const Engine = require( '@open-automaton/puppeteer-mining-engine' ); let myEngine = new MiningEngine();
- Playwright: Chromium
const Engine = require( '@open-automaton/playwright-mining-engine' ); let myEngine = new MiningEngine({type:'chromium'});
- Playwright: Firefox
const Engine = require( '@open-automaton/playwright-mining-engine' ); let myEngine = new MiningEngine({type:'firefox'});
- Playwright: Webkit
const Engine = require( '@open-automaton/playwright-mining-engine' ); let myEngine = new MiningEngine({type:'webkit'});
- JSDom
const Engine = require( '@open-automaton/jsdom-mining-engine' ); let myEngine = new MiningEngine();
let results = await Automaton.scrape(
'definition.xml',
myEngine
);
That's all it takes, if you need a different usage pattern that is supported as well.
npm install -g automaton-cli
auto --help
Scraper Actions
A progression from page to page, either by loading a url, submitting a form or clicking a UI element requires either url
or form
type
accepts json
, application/json
or form
Some engines that use the browser will only submit using the form configuration on the page and ignore the method
and type
options.
<go
url="https://domain.com/path/"
form="form-name"
method="post"
type="application/json"
></go>
Either use a variable to set a target input on a form or set a variable using an xpath or regex. Lists are extracted by putting set
s inside another set
<set
variable="variable-name"
xpath="//xpath/expression"
regex="[regex]+.(expression)"
form="form-name"
target="input-element-name"
></set>
emit a value to the return and optionally post that value to a remote url
<emit
variables="some,variables"
remote="https://domain.com/path/"
></emit>
Maintaining Scrapers
Here's a basic process for data behind a simple form
First you'll want to understand xpath (and probably DOM, regex and css selectors) before we proceed, as most of the selectors in a good definition are xpath which is as general as possible.
Once you're done with that, the auto
command( get by installing the CLI
) has a few operations we'll be using.
auto fetch https://domain.com/path/ > page.html
The first thing you might do against the HTML you've captured is pull all the forms out of the page, like this:
auto xpath "//form" page.html
Assuming you've identified the form name you are targeting as my-form-name
, you then want to get all the inputs out of it with something like:
auto xpath-form-inputs "//form[@name='my-form-name']" page.html
Then you need to write selectors for the inputs that need to be set (all of them in the case of cheerio, but otherwise the browser abstraction usually handles those that are prefilled)
<set
form="<form-selector>"
target="<input-name>"
variable="<incoming-value-name>"
></set>
<go form="<form-selector>">
<!-- extraction logic to go here -->
</go>
Here you'll need to manually use your browser go to the submitted page and save the HTML by opening the inspector, then copying the HTML from the root element, then pasting it into a file.
Now we need to look for rows with something like:
auto xpath "//ul|//ol|//tbody" page.html
Once you settle on a selector for the correct element add a selector in the definition:
<set xpath="<xpath-selector>" variable="matches">
<!--more selected fields here -->
</set>
Last we need to looks for individual fields using something like:
auto xpath "//li|//tr" page_fragment.html
Once you settle on a selector for the correct element add a selector in the definition:
<set xpath="<xpath-selector>" variable="matches">
<set
xpath="<xpath-selector>"
variable="<field-name>"
></set>
<!--more selected fields here -->
</set>
To target the output emit the variables you want, otherwise it will dump everything in the environment.
auto scrape my-definition.auto.xml --data '{"JSON":"data"}'
#TODO on CLI, but already working in the API call options
The most frustrating thing about scrapers is, because they are tied to the structural representation of the presentation, which is designed to change, scrapers will inevitably break. While this is frustrating, using the provided tools on fresh fetches of the pages in question will quickly highlight what's failing. Usually:
- The url has changed, requiring an update to the definition,
- The page structure has changed requiring 1 or more selectors to be rewritten,
- The page has changed their delivery architecture, requiring you to use a more expensive engine (computationally: cheerio < jsdom < puppeteer, playwright).
Examples of building scrapers:
Deploying a Scraper
Publishing a Definition (Soon™)
First, create a directory that describes the site we're fetching, the work we're doing and ends with .auto
, let's call this one some-site-register.auto
. Once created, let's go into that directory.
Automaton definitions are published as useable nodejs npm modules, though making and maintaining them does not require any javascript. You'll need your own npm credentials to publish.
Once in the directory let's run
auto init ../some/path/some-site-register.auto
If a definition is not provided, a blank one will be initialized
you'll need to import the engine you want to use by default:
# we are choosing to default to JSDOM
npm install @open-automaton/jsdom-mining-engine
then add an entry to package.json
for the default engine
{
"defaultAutomatonEngine" : "@open-automaton/jsdom-mining-engine"
}
publishing is the standard:
npm publish
Before publishing, please consider updating the README to describe your incoming data requirements.
Once it's all set up, you have a bunch of features, out of the box.
Testing
npm run test
Scraping
you can run your definition with
npm run scrape '{"JSON":"data"}'
you can reference the definition directly (in parent projects) at:
let xmlPath = require('some-site-register.auto').xml;
which is short for:
path.merge(
require.resolve('some-site-register.auto'),
'src',
'some-site-register.auto.xml'
)
// ./node_modules/some-site-register.auto/src/some-site-register.auto.xml
The top level Automaton.scrape()
function knows how to transform some-site-register.auto
into that, so you can just use the shorthand there.
You can include your scraper(once published) with:
let MyScraper = require('some-site-register.auto');
MyScraper.scrape(automatonEngine);
// or MyScraper.scrape(); to use the default engine
About Automaton
View the development roadmap.
Read a little about where this came from.
Testing
You can run the mocha test suite with:
npm run test
Enjoy,
-Abbey Hawk Sparrow