import-io-cli v0.0.1
import-io-cli
This toolchain allows import.io users and managed service providers to build out scalable extractor definitions by creating a modular Extractor Library.
To jump straight into the browser context methods available, see IContext.
- Getting started
- For Extractor Architects
- For Extractor Implementors
- Integration into CI systems
- Usage
- Commands Reference
Getting started
Concepts
An import.io Extractor Library is a git repository, that contains a library of modules and extractors for one or more organizations.
There are multiple types of module:
- Action
- A browser control and logic building block
- Uses a browser context to control the browser - see IContext
- Action may be used as an interface
- A default definition may be provided
- Named parameters (e.g. domain)
- Schema
- A definition of what columns are expected to be returned
- Extraction
- A definition of what to extract on the page
- Template
- An extractor template that is inherited from when creating an org extractor
Each instance of a module maps to a file within the git repository within the top-level src/library
folder, and has a URI composed of the type and path, e.g. "template:product/details".
Integration with current SAAS platform
This workflow will currently publish extractors to app.import.io
, but in future it will be deploying to workbench.import.io
.
It deploys at least two extractors - a staging extractor and a production extractor. You can also deploy a dev extractor, an extractor just for you, etc. The SAAS account will be resolved from the workbench org slug, but for now is configured in the repository.
These extractors can be run for testing purposes in the SAAS platform, but many features may not be functional.
⚠️ Part of deployment is setting policies, and as such the user that is deploying these will need to have the correct permissions.
Conversion of existing extractors
There is a feature in the backlog to be able to bring current extractors into the library, but it is not available yet.
Getting started
Install Google Chrome
Download and install Google Chrome if you don't already have it.
Install the client
Download and install the import-io client via a pkg file (macOS), installer (windows) or tarball (linux).
Configure
To configure, run:
> import-io config
=It will write a file .import-io.apikey in your home directory.
Have a hack
You can start the browser up and get a REPL to control it by running:
> import-io browser:launch
If the CLI cannot find your Chrome instance, set the CHROME_PATH environment variable.
This will start a browser and give you a REPL interface to control the browser, as well as the dev tools for the browser.
For Extractor Architects
Setting up a new library
> mkdir my-library && cd my-library && git init && npm init && import-io init --org <workbench org slug>
⚠️ Set private: true in the package.json to avoid accidentally publishing the repository.
This will create a new git repo and add the basic scaffolding for the repository.
You need to set the SAAS user account id in the config.yaml file until the integration to read the platform user id from workbench is complete.
Setting up a new extractor template
An extractor template is what is required to have developers build out instances of extractors for orgs.
It and its dependencies contain all the information require
Create a branch
Create a branch in git for your extractor template and the new interfaces - you should use standard
Create a schema
> import-io schema:new product/details
There is no typing information in the schema, everything is extracted as text, unless extracted from JSON, then a JSON primitive.
Create an entry point action and its dependency interfaces
An entry point action is the first action that an extractor runs.
Actions are parametrized so that we can load in different dependencies depending on, for example, what domain the extractor is targeting.
Creating the entry point
> import-io action:new --path product/details --parameters country domain --inputs sku
Now you can edit the action and add the dependencies and control code, e.g.
module.exports = {
parameters: [
{
name: "country",
description: "ISO-2 country code"
},
{
name: "domain",
description: "Top private domain"
}
],
inputs: [
{
name: "sku",
description: "Retailer unique SKU",
type: "string"
}
],
dependencies: {
skuToUrl: 'action:product/sku2url/${country}/${domain}',
gotoUrl: 'action:gotourl/${country}/${domain}',
productDetails: 'action:product/details/extract/${country}/${domain}',
},
path: "${country}/${domain}",
implementation: async ({ sku }, { country, domain }, context, { skuToUrl, gotoUrl, productDetails}) => {
const url = await skuToUrl({sku}, { country, domain });
await gotoUrl({ url }, { domain });
await productDetails({ url }, { domain });
}
}
There is also a unit test added in test/
which should be completed.
The dependencies are of the URI form <module type>:<path>
.
Creating the dependencies
Now we need to similarly create the dependencies:
> import-io action:new --path product/sku2url --parameters country domain --inputs sku
module.exports = {
parameters: [
{
name: "country",
description: "ISO-2 country code"
},
{
name: "domain",
description: "Top private domain"
}
],
inputs: [
{
name: "sku",
description: "Retailer unique SKU",
type: "string"
}
],
dependencies: {
},
path: "${country}/${domain}",
implementation: async ({ sku }, { country, domain }, context, dependencies) => {
throw new Error('No default implementation');
}
}
Note this does not have a default implementation as it would not make sense.
> import-io action:new --path gotourl --parameters country domain --inputs url
module.exports = {
parameters: [
{
name: "country",
description: "ISO-2 country code"
},
{
name: "domain",
description: "Top private domain"
}
],
inputs: [
{
name: "url",
description: "URL to visit",
type: "string"
}
],
dependencies: {
},
path: "${country}/${domain}",
implementation: async ({ url }, { country, domain }, context, dependencies) => {
await context.goto(url, {timeout: 10000, waitUntil: 'load', checkBlocked: true});
}
}
> import-io action:new --path product/details/extract --parameters country domain --inputs url
module.exports = {
description: 'Extract the product details when on a product details page already',
parameters: [
{
name: "country",
description: "ISO-2 country code"
},
{
name: "domain",
description: "Top private domain"
}
],
inputs: [],
dependencies: {
productDetails: 'extraction:product/details/extract/${country}/${domain}',
},
path: "${country}/${domain}",
implementation: async ({ sku }, { country, domain }, context, {productDetails}) => {
await context.extract(productDetails);
}
}
This clearly has a default implementation - do a single extraction - but can be overriden for more complicated extractions.
Creating an extractor template
> import-io template:new --schema product/details --parameters domain country --entryPoint product/details --path product/details
Now you can edit the generated template:
proxy:
zone: USA
type: DATA_CENTER
policy:
numberRetries: 3
priority: MEDIUM
retryDelay: 60
backoffPolicy: EXPONENTIAL
retryWithResidentialProxyAfter: 99
honorRobots: false
schema: product/details
parameters:
- domain
- country
entryPoint: product/details
pathTemplate: "product/details/${country}/${domain}"
The pathTemplate is where the extractors generated from the template will be placed for each organization.
For Extractor Implementors
Create an extractor from the template
Creating an extractor will also bootstrap all its dependencies if they do not already exist.
> import-io extractor:new --org my_org --parameters country=us domain=amazon.com --template product/details
This creates a extractor.yaml file at the path specified in the template in the org directory:
template: product/details
parameters:
country: us
domain: amazon.com
You now need to update the scaffolded files:
(1) Sku2Url implementation
module.exports = {
implements: "product/sku2url",
implementation: async ({ sku }, { country, domain }, context, dependencies) => {
return `https://www.amazon.com/dp/${sku}`;
}
}
(2) Goto URL can just use the standard implementation - in future we might want to add special CAPTCHA handling, etc. You can either leave the file like this, or remove it if the default implementation is in a folder above.
module.exports = {
implements: "gotourl",
}
(3) Extract action - use the default; if we in future have some kind of branching logic to use different extractions depending upon the page template we could add this in.
Note that currently if you re-extract data, logic will not re-run to choose a different extration.
(4) Extraction - needs to be completed with an initial set of rules.
(5) Sample inputs - need to be created.
Run the action locally
> import-io action:run:local --parameters country=us domain=amazon.com --action product/details --inputs sku=B008OZS41U
This will open a local browser and allow you to then control the browser via the REPL and use the Chrome dev tools.
Deploy a dev version of the extractor(s) to the platform
> import-io extractor:deploy -o my_org -p product/details/us -b dev
## Create unit tests from the sample inputs
You can run the sample inputs server side in order to generate a set of unit tests in the test/
directory.
> import-io extractor:tests:update -o my_org -e product/details/us/amazon.com -b dev
## Run unit tests from the sample inputs
You can run the sample inputs server side in order to generate a set of unit tests in the test/
directory.
> import-io extractor:tests:unit -o my_org -p .
You need to have the Import.io Code application installed.
Integration into CI systems
TODO: currently cannot run the unit tests for all orgs
- CI=1 import-io extractor:tests:unit -o my_org -p .
- npm test
Usage
$ npm install -g import-io-cli
$ import-io COMMAND
running command...
$ import-io (-v|--version|version)
import-io-cli/0.0.1 darwin-x64 node-v12.14.1
$ import-io --help [COMMAND]
USAGE
$ import-io COMMAND
...
Commands Reference
import-io action:compile
import-io action:implement
import-io action:interface
import-io action:new
import-io action:run:local
import-io action:run:remote
import-io browser:launch [FILE]
import-io cache:clear
import-io config [FILE]
import-io extraction:new PATH
import-io extractor:build
import-io extractor:deploy
import-io extractor:new
import-io extractor:run
import-io extractor:tests:functional
import-io extractor:tests:unit
import-io extractor:tests:update
import-io function:new PATH
import-io help [COMMAND]
import-io init
import-io schema:new PATH
import-io template:new
import-io action:compile
Compile an action to JS
USAGE
$ import-io action:compile
OPTIONS
-P, --parameters=parameters (required) parameter values, key=value
-a, --action=action (required) action path
-h, --help show CLI help
See code: src/commands/action/compile.ts
import-io action:implement
Implement an action interface
USAGE
$ import-io action:implement
OPTIONS
-I, --interface=interface (required) path to where interface is
-P, --parameters=parameters (required) parameter values, key=value
-h, --help show CLI help
See code: src/commands/action/implement.ts
import-io action:interface
Create a new interface with default implementation
USAGE
$ import-io action:interface
OPTIONS
-P, --parameters=parameters parameter name
-h, --help show CLI help
-i, --inputs=inputs input name
-p, --path=path (required) new action path
See code: src/commands/action/interface.ts
import-io action:new
Create a new action (not an interface impl)
USAGE
$ import-io action:new
OPTIONS
-P, --parameters=parameters parameter name
-h, --help show CLI help
-i, --inputs=inputs input name
-p, --path=path (required) new action path
See code: src/commands/action/new.ts
import-io action:run:local
Run an action locally
USAGE
$ import-io action:run:local
OPTIONS
-P, --parameters=parameters (required) parameter values, key=value
-a, --action=action (required) action path
-c, --compile compile down to the code action (cannot use breakpoints)
-h, --help show CLI help
-i, --inputs=inputs (required) input values, key=value
--proxy=proxy proxy, e.g. http://10.10.10.10:8000
--proxyauth=proxyauth proxyauth, e.g. user:password
See code: src/commands/action/run/local.ts
import-io action:run:remote
Run the action locally using an import.io remote browser
USAGE
$ import-io action:run:remote
OPTIONS
-C, --country=country country code for proxy to checkout
-P, --parameters=parameters (required) parameter values, key=value
-a, --action=action (required) action path
-h, --help show CLI help
-i, --inputs=inputs (required) input values, key=value
-p, --private check out a private browser
-r, --residential Use residential proxies
-v, --virtualBrowserId=virtualBrowserId virtual browser id to check out
See code: src/commands/action/run/remote.ts
import-io browser:launch [FILE]
Launch a browser for testing
USAGE
$ import-io browser:launch [FILE]
OPTIONS
-h, --help show CLI help
--proxy=proxy proxy, e.g. http://10.10.10.10:8000
--proxyauth=proxyauth proxyauth, e.g. user:password
See code: src/commands/browser/launch.ts
import-io cache:clear
Clear cache
USAGE
$ import-io cache:clear
See code: src/commands/cache/clear.ts
import-io config [FILE]
Intialize developer configuration
USAGE
$ import-io config [FILE]
OPTIONS
-h, --help show CLI help
See code: src/commands/config.ts
import-io extraction:new PATH
Create a new extraction
USAGE
$ import-io extraction:new PATH
OPTIONS
-h, --help show CLI help
-s, --schema=schema (required) schema path
See code: src/commands/extraction/new.ts
import-io extractor:build
Build extractor (s)
USAGE
$ import-io extractor:build
OPTIONS
-h, --help show CLI help
-o, --org=org (required) org slug
-p, --prefix=prefix path prefix to search in
See code: src/commands/extractor/build.ts
import-io extractor:deploy
Build an extractor
USAGE
$ import-io extractor:deploy
OPTIONS
-b, --branch=branch (required) branch to deploy
-h, --help show CLI help
-o, --org=org (required) org slug
-p, --prefix=prefix path prefix to search in
See code: src/commands/extractor/deploy.ts
import-io extractor:new
Create a new template
USAGE
$ import-io extractor:new
OPTIONS
-P, --parameters=parameters (required) parameter values, key=value
-h, --help show CLI help
-o, --org=org (required) org slug
-p, --path=path path override
-t, --template=template (required) path to template
See code: src/commands/extractor/new.ts
import-io extractor:run
Run an extractor
USAGE
$ import-io extractor:run
OPTIONS
-b, --branch=branch (required) branch to run
-d, --deploy deploy before running, if false will run the currently saved extractor
-e, --extractor=extractor (required) path to extractor directory
-h, --help show CLI help
-i, --inputs=inputs input values, key=value (NOT SUPPORTED)
-o, --org=org (required) org slug
-w, --wait whether or not to wait until the crawl run completes
See code: src/commands/extractor/run.ts
import-io extractor:tests:functional
Carry out a crawl run then compare the data
USAGE
$ import-io extractor:tests:functional
OPTIONS
-b, --branch=branch (required) branch to run
-d, --deploy deploy before running, if false will run the currently saved extractor
-e, --extractor=extractor (required) path to extractor directory
-h, --help show CLI help
-o, --org=org (required) org slug
See code: src/commands/extractor/tests/functional.ts
import-io extractor:tests:unit
Check that the extractions still give the expected data
USAGE
$ import-io extractor:tests:unit
OPTIONS
-h, --help show CLI help
-o, --org=org (required) org slug
-p, --prefix=prefix path prefix to search in
See code: src/commands/extractor/tests/unit.ts
import-io extractor:tests:update
Import a crawl run as a set of test cases
USAGE
$ import-io extractor:tests:update
OPTIONS
-b, --branch=branch extractor branch
-c, --crawlRunId=crawlRunId id of crawl run to import
-e, --extractor=extractor (required) path to extractor directory
-h, --help show CLI help
-o, --org=org (required) org slug
See code: src/commands/extractor/tests/update.ts
import-io function:new PATH
Create a new function
USAGE
$ import-io function:new PATH
OPTIONS
-h, --help show CLI help
See code: src/commands/function/new.ts
import-io help [COMMAND]
display help for import-io
USAGE
$ import-io help [COMMAND]
ARGUMENTS
COMMAND command to show help for
OPTIONS
--all see all commands in CLI
See code: @oclif/plugin-help
import-io init
Initialize a library
USAGE
$ import-io init
OPTIONS
-h, --help show CLI help
-o, --org=org org slug
See code: src/commands/init.ts
import-io schema:new PATH
Create a new schema
USAGE
$ import-io schema:new PATH
OPTIONS
-h, --help show CLI help
See code: src/commands/schema/new.ts
import-io template:new
Create a new template
USAGE
$ import-io template:new
OPTIONS
-P, --parameters=parameters (required) parameters
-e, --entryPoint=entryPoint (required) action entrypoint
-h, --help show CLI help
-s, --schema=schema (required) path to schema
-t, --path=path (required) path to template
See code: src/commands/template/new.ts
4 years ago