0.0.1 • Published 4 years ago

import-io-cli v0.0.1

Weekly downloads
-
License
MIT
Repository
github
Last release
4 years ago

import-io-cli

This toolchain allows import.io users and managed service providers to build out scalable extractor definitions by creating a modular Extractor Library.

To jump straight into the browser context methods available, see IContext.

Getting started

Concepts

An import.io Extractor Library is a git repository, that contains a library of modules and extractors for one or more organizations.

There are multiple types of module:

  • Action
    • A browser control and logic building block
    • Uses a browser context to control the browser - see IContext
    • Action may be used as an interface
      • A default definition may be provided
      • Named parameters (e.g. domain)
  • Schema
    • A definition of what columns are expected to be returned
  • Extraction
    • A definition of what to extract on the page
  • Template
    • An extractor template that is inherited from when creating an org extractor

Each instance of a module maps to a file within the git repository within the top-level src/library folder, and has a URI composed of the type and path, e.g. "template:product/details".

Integration with current SAAS platform

This workflow will currently publish extractors to app.import.io, but in future it will be deploying to workbench.import.io.

It deploys at least two extractors - a staging extractor and a production extractor. You can also deploy a dev extractor, an extractor just for you, etc. The SAAS account will be resolved from the workbench org slug, but for now is configured in the repository.

These extractors can be run for testing purposes in the SAAS platform, but many features may not be functional.

⚠️ Part of deployment is setting policies, and as such the user that is deploying these will need to have the correct permissions.

Conversion of existing extractors

There is a feature in the backlog to be able to bring current extractors into the library, but it is not available yet.

Getting started

Install Google Chrome

Download and install Google Chrome if you don't already have it.

Install the client

Download and install the import-io client via a pkg file (macOS), installer (windows) or tarball (linux).

Configure

To configure, run:

> import-io config

=It will write a file .import-io.apikey in your home directory.

Have a hack

You can start the browser up and get a REPL to control it by running:

> import-io browser:launch

If the CLI cannot find your Chrome instance, set the CHROME_PATH environment variable.

This will start a browser and give you a REPL interface to control the browser, as well as the dev tools for the browser.

For Extractor Architects

Setting up a new library

> mkdir my-library && cd my-library && git init && npm init && import-io init --org <workbench org slug>

⚠️ Set private: true in the package.json to avoid accidentally publishing the repository.

This will create a new git repo and add the basic scaffolding for the repository.

You need to set the SAAS user account id in the config.yaml file until the integration to read the platform user id from workbench is complete.

Setting up a new extractor template

An extractor template is what is required to have developers build out instances of extractors for orgs.

It and its dependencies contain all the information require

Create a branch

Create a branch in git for your extractor template and the new interfaces - you should use standard

Create a schema

> import-io schema:new product/details

There is no typing information in the schema, everything is extracted as text, unless extracted from JSON, then a JSON primitive.

Create an entry point action and its dependency interfaces

An entry point action is the first action that an extractor runs.

Actions are parametrized so that we can load in different dependencies depending on, for example, what domain the extractor is targeting.

Creating the entry point

> import-io action:new --path product/details --parameters country domain --inputs sku

Now you can edit the action and add the dependencies and control code, e.g.

module.exports = {
  parameters: [
    {
      name: "country",
      description: "ISO-2 country code"
    },
    {
      name: "domain",
      description: "Top private domain"
    }
  ],
  inputs: [
    {
      name: "sku",
      description: "Retailer unique SKU",
      type: "string"
    }
  ],
  dependencies: {
    skuToUrl: 'action:product/sku2url/${country}/${domain}',
    gotoUrl: 'action:gotourl/${country}/${domain}',
    productDetails: 'action:product/details/extract/${country}/${domain}',
  },
  path: "${country}/${domain}",
  implementation: async ({ sku }, { country, domain }, context, { skuToUrl, gotoUrl, productDetails}) => {
    const url = await skuToUrl({sku}, { country, domain });
    await gotoUrl({ url }, { domain });
    await productDetails({ url }, { domain });
  }
}

There is also a unit test added in test/ which should be completed.

The dependencies are of the URI form <module type>:<path>.

Creating the dependencies

Now we need to similarly create the dependencies:

> import-io action:new --path product/sku2url --parameters country domain --inputs sku
module.exports = {
  parameters: [
    {
      name: "country",
      description: "ISO-2 country code"
    },
    {
      name: "domain",
      description: "Top private domain"
    }
  ],
  inputs: [
    {
      name: "sku",
      description: "Retailer unique SKU",
      type: "string"
    }
  ],
  dependencies: {
  },
  path: "${country}/${domain}",
  implementation: async ({ sku }, { country, domain }, context, dependencies) => {
    throw new Error('No default implementation');
  }
}

Note this does not have a default implementation as it would not make sense.

> import-io action:new --path gotourl --parameters country domain --inputs url
module.exports = {
  parameters: [
    {
      name: "country",
      description: "ISO-2 country code"
    },
    {
      name: "domain",
      description: "Top private domain"
    }
  ],
  inputs: [
    {
      name: "url",
      description: "URL to visit",
      type: "string"
    }
  ],
  dependencies: {
  },
  path: "${country}/${domain}",
  implementation: async ({ url }, { country, domain }, context, dependencies) => {
    await context.goto(url, {timeout: 10000, waitUntil: 'load', checkBlocked: true});
  }
}
> import-io action:new --path product/details/extract --parameters country domain --inputs url
module.exports = {
  description: 'Extract the product details when on a product details page already',
  parameters: [
    {
      name: "country",
      description: "ISO-2 country code"
    },
    {
      name: "domain",
      description: "Top private domain"
    }
  ],
  inputs: [],
  dependencies: {
    productDetails: 'extraction:product/details/extract/${country}/${domain}',
  },
  path: "${country}/${domain}",
  implementation: async ({ sku }, { country, domain }, context, {productDetails}) => {
    await context.extract(productDetails);
  }
}

This clearly has a default implementation - do a single extraction - but can be overriden for more complicated extractions.

Creating an extractor template

> import-io template:new --schema product/details --parameters domain country --entryPoint product/details --path product/details

Now you can edit the generated template:

proxy:
  zone: USA
  type: DATA_CENTER
policy:
  numberRetries: 3
  priority: MEDIUM
  retryDelay: 60
  backoffPolicy: EXPONENTIAL
  retryWithResidentialProxyAfter: 99
honorRobots: false
schema: product/details
parameters:
  - domain
  - country
entryPoint: product/details
pathTemplate: "product/details/${country}/${domain}"

The pathTemplate is where the extractors generated from the template will be placed for each organization.

For Extractor Implementors

Create an extractor from the template

Creating an extractor will also bootstrap all its dependencies if they do not already exist.

> import-io extractor:new --org my_org --parameters country=us domain=amazon.com --template product/details

This creates a extractor.yaml file at the path specified in the template in the org directory:

template: product/details
parameters:
  country: us
  domain: amazon.com

You now need to update the scaffolded files:

(1) Sku2Url implementation

module.exports = {
    implements: "product/sku2url",
    implementation: async ({ sku }, { country, domain }, context, dependencies) => {
        return `https://www.amazon.com/dp/${sku}`;
    }
}

(2) Goto URL can just use the standard implementation - in future we might want to add special CAPTCHA handling, etc. You can either leave the file like this, or remove it if the default implementation is in a folder above.

module.exports = {
    implements: "gotourl",
}

(3) Extract action - use the default; if we in future have some kind of branching logic to use different extractions depending upon the page template we could add this in.

Note that currently if you re-extract data, logic will not re-run to choose a different extration.

(4) Extraction - needs to be completed with an initial set of rules.

(5) Sample inputs - need to be created.

Run the action locally

> import-io action:run:local --parameters country=us domain=amazon.com --action product/details --inputs sku=B008OZS41U

This will open a local browser and allow you to then control the browser via the REPL and use the Chrome dev tools.

Deploy a dev version of the extractor(s) to the platform

> import-io extractor:deploy -o my_org -p product/details/us -b dev

## Create unit tests from the sample inputs

You can run the sample inputs server side in order to generate a set of unit tests in the test/ directory.

> import-io extractor:tests:update -o my_org -e product/details/us/amazon.com -b dev

## Run unit tests from the sample inputs

You can run the sample inputs server side in order to generate a set of unit tests in the test/ directory.

> import-io extractor:tests:unit -o my_org -p .

You need to have the Import.io Code application installed.

Integration into CI systems

TODO: currently cannot run the unit tests for all orgs

  • CI=1 import-io extractor:tests:unit -o my_org -p .
  • npm test

Usage

$ npm install -g import-io-cli
$ import-io COMMAND
running command...
$ import-io (-v|--version|version)
import-io-cli/0.0.1 darwin-x64 node-v12.14.1
$ import-io --help [COMMAND]
USAGE
  $ import-io COMMAND
...

Commands Reference

import-io action:compile

Compile an action to JS

USAGE
  $ import-io action:compile

OPTIONS
  -P, --parameters=parameters  (required) parameter values, key=value
  -a, --action=action          (required) action path
  -h, --help                   show CLI help

See code: src/commands/action/compile.ts

import-io action:implement

Implement an action interface

USAGE
  $ import-io action:implement

OPTIONS
  -I, --interface=interface    (required) path to where interface is
  -P, --parameters=parameters  (required) parameter values, key=value
  -h, --help                   show CLI help

See code: src/commands/action/implement.ts

import-io action:interface

Create a new interface with default implementation

USAGE
  $ import-io action:interface

OPTIONS
  -P, --parameters=parameters  parameter name
  -h, --help                   show CLI help
  -i, --inputs=inputs          input name
  -p, --path=path              (required) new action path

See code: src/commands/action/interface.ts

import-io action:new

Create a new action (not an interface impl)

USAGE
  $ import-io action:new

OPTIONS
  -P, --parameters=parameters  parameter name
  -h, --help                   show CLI help
  -i, --inputs=inputs          input name
  -p, --path=path              (required) new action path

See code: src/commands/action/new.ts

import-io action:run:local

Run an action locally

USAGE
  $ import-io action:run:local

OPTIONS
  -P, --parameters=parameters  (required) parameter values, key=value
  -a, --action=action          (required) action path
  -c, --compile                compile down to the code action (cannot use breakpoints)
  -h, --help                   show CLI help
  -i, --inputs=inputs          (required) input values, key=value
  --proxy=proxy                proxy, e.g. http://10.10.10.10:8000
  --proxyauth=proxyauth        proxyauth, e.g. user:password

See code: src/commands/action/run/local.ts

import-io action:run:remote

Run the action locally using an import.io remote browser

USAGE
  $ import-io action:run:remote

OPTIONS
  -C, --country=country                    country code for proxy to checkout
  -P, --parameters=parameters              (required) parameter values, key=value
  -a, --action=action                      (required) action path
  -h, --help                               show CLI help
  -i, --inputs=inputs                      (required) input values, key=value
  -p, --private                            check out a private browser
  -r, --residential                        Use residential proxies
  -v, --virtualBrowserId=virtualBrowserId  virtual browser id to check out

See code: src/commands/action/run/remote.ts

import-io browser:launch [FILE]

Launch a browser for testing

USAGE
  $ import-io browser:launch [FILE]

OPTIONS
  -h, --help             show CLI help
  --proxy=proxy          proxy, e.g. http://10.10.10.10:8000
  --proxyauth=proxyauth  proxyauth, e.g. user:password

See code: src/commands/browser/launch.ts

import-io cache:clear

Clear cache

USAGE
  $ import-io cache:clear

See code: src/commands/cache/clear.ts

import-io config [FILE]

Intialize developer configuration

USAGE
  $ import-io config [FILE]

OPTIONS
  -h, --help  show CLI help

See code: src/commands/config.ts

import-io extraction:new PATH

Create a new extraction

USAGE
  $ import-io extraction:new PATH

OPTIONS
  -h, --help           show CLI help
  -s, --schema=schema  (required) schema path

See code: src/commands/extraction/new.ts

import-io extractor:build

Build extractor (s)

USAGE
  $ import-io extractor:build

OPTIONS
  -h, --help           show CLI help
  -o, --org=org        (required) org slug
  -p, --prefix=prefix  path prefix to search in

See code: src/commands/extractor/build.ts

import-io extractor:deploy

Build an extractor

USAGE
  $ import-io extractor:deploy

OPTIONS
  -b, --branch=branch  (required) branch to deploy
  -h, --help           show CLI help
  -o, --org=org        (required) org slug
  -p, --prefix=prefix  path prefix to search in

See code: src/commands/extractor/deploy.ts

import-io extractor:new

Create a new template

USAGE
  $ import-io extractor:new

OPTIONS
  -P, --parameters=parameters  (required) parameter values, key=value
  -h, --help                   show CLI help
  -o, --org=org                (required) org slug
  -p, --path=path              path override
  -t, --template=template      (required) path to template

See code: src/commands/extractor/new.ts

import-io extractor:run

Run an extractor

USAGE
  $ import-io extractor:run

OPTIONS
  -b, --branch=branch        (required) branch to run
  -d, --deploy               deploy before running, if false will run the currently saved extractor
  -e, --extractor=extractor  (required) path to extractor directory
  -h, --help                 show CLI help
  -i, --inputs=inputs        input values, key=value (NOT SUPPORTED)
  -o, --org=org              (required) org slug
  -w, --wait                 whether or not to wait until the crawl run completes

See code: src/commands/extractor/run.ts

import-io extractor:tests:functional

Carry out a crawl run then compare the data

USAGE
  $ import-io extractor:tests:functional

OPTIONS
  -b, --branch=branch        (required) branch to run
  -d, --deploy               deploy before running, if false will run the currently saved extractor
  -e, --extractor=extractor  (required) path to extractor directory
  -h, --help                 show CLI help
  -o, --org=org              (required) org slug

See code: src/commands/extractor/tests/functional.ts

import-io extractor:tests:unit

Check that the extractions still give the expected data

USAGE
  $ import-io extractor:tests:unit

OPTIONS
  -h, --help           show CLI help
  -o, --org=org        (required) org slug
  -p, --prefix=prefix  path prefix to search in

See code: src/commands/extractor/tests/unit.ts

import-io extractor:tests:update

Import a crawl run as a set of test cases

USAGE
  $ import-io extractor:tests:update

OPTIONS
  -b, --branch=branch          extractor branch
  -c, --crawlRunId=crawlRunId  id of crawl run to import
  -e, --extractor=extractor    (required) path to extractor directory
  -h, --help                   show CLI help
  -o, --org=org                (required) org slug

See code: src/commands/extractor/tests/update.ts

import-io function:new PATH

Create a new function

USAGE
  $ import-io function:new PATH

OPTIONS
  -h, --help  show CLI help

See code: src/commands/function/new.ts

import-io help [COMMAND]

display help for import-io

USAGE
  $ import-io help [COMMAND]

ARGUMENTS
  COMMAND  command to show help for

OPTIONS
  --all  see all commands in CLI

See code: @oclif/plugin-help

import-io init

Initialize a library

USAGE
  $ import-io init

OPTIONS
  -h, --help     show CLI help
  -o, --org=org  org slug

See code: src/commands/init.ts

import-io schema:new PATH

Create a new schema

USAGE
  $ import-io schema:new PATH

OPTIONS
  -h, --help  show CLI help

See code: src/commands/schema/new.ts

import-io template:new

Create a new template

USAGE
  $ import-io template:new

OPTIONS
  -P, --parameters=parameters  (required) parameters
  -e, --entryPoint=entryPoint  (required) action entrypoint
  -h, --help                   show CLI help
  -s, --schema=schema          (required) path to schema
  -t, --path=path              (required) path to template

See code: src/commands/template/new.ts