Datahorde NPM | npm.io

DataHorde

A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.

A toolkit which allows a user to easily, and programmatically, crawl a site for data. It contains a tool for extracting to pipe delimited csv for easy import and manipulation. Source is available in GitLab and documentation on components is available at https://rdfedor.gitlab.io/datahorde.

Installation
Available Commands
Development Setup
Release History
Contributing

1. Installation

OS X & Linux:

To install the CLI globally and accessible from terminal,

npm install -g datahorde

To install as part of a project,

npm install --save datahorde

2. Available Commands

Usage: dh [options] [command]

Options:
  -V, --version                                       output the version number
  -h, --help                                          output usage information

Commands:
  crawl|c [options] <targetUrl> [otherTargetUrls...]  Searches a collection of urls, collects data and writes to csv
  devour|d [options] <inputFile>                      Allows for the easy ingestion of csvs to various datasources
  digest|di [options] <action>                        Processes digest action [execute,migrate,rollback] against a particular database
  mimic|m [options] <templateFile> <outputFile>       Templating engine which uses ejs to build customized scripts
  egress|v [options] <query>                          Exports data from a datasource using a query to csv

2.1 Crawler

Allows a user to crawl a website collecting data as defined by the definition.

Usage: dh crawl|c [options] <outputFile> <targetUrl> [otherTargetUrls...]

Searches a collection of urls, collects data and writes to csv

Options:
  -l, --load                                           load targetUrls from local csv
  -k, --loadUrlKey <loadUrlKey>                        the header that contains the url (default: "url")
  -e, --loadDelimiter <loadDelimiter>                  delimeter separating columns in the csv (default: "|")
  -a, --loadHeaders <column1>[,column2...]             header name of columns in the csv (default: true)
  -m, --loadRenameHeader                               map column data to headers (default: false)
  -p, --profile <name>                                 use a specific crawler profile (Available: generic) (default: "generic")
  -r, --rowSelector <selector>                         css selector to the rows to select the data from (default: "body")
  -c, --columnSelector <selector>[,otherSelectors...]  css selector to the values inside the rows (default: "a")
  -n, --nextPageLinkSelector <selector>                css selector to url for next page
  -o, --outputHeaders <name>[,otherColumns...]         displays headers at the top of the export
  -h, --headers <header>=<value>[,otherHeaders...]     set header values in the requests made
  -x, --proxy <proxy>                                  path to the socks proxy
  -t, --threads <threads>                              run multiple threads at the same time (default: 1)
  -b, --browser                                        emulate a browser request
  -h, --help                                           output usage information

As an example, lets write to csv and display a list of the 42nd weekend's top release titles, release title link,movie studio that published it and the link to the imdb movie studio page,

dh crawl -r 'table.mojo-body-table tr' -c 'td.mojo-field-type-release a,td.mojo-field-type-release a:attr(href),td.mojo-field-type-release_studios a,td.mojo-field-type-release_studios a:attr(href)' -o 'release_title,release_title_link,movie_studio,movie_studio_link' mojo_movies_weekend_24.csv https://www.boxofficemojo.com/weekend/2019W42/

Then taking the output of the above, we can collect details from each page like the total demoestic and internation gross, genre and feature length,

dh crawl -l -i release_title_link -h 'title,domestic_gross,international_gross,genre,feature_length' -r main -c 'h1,.mojo-performance-summary-table div:nth-child(2) span:nth-child(3) span:text,.mojo-performance-summary-table div:nth-child(3) span:nth-child(3) span:text,.mojo-summary-values > div:nth-child(4) > span:nth-child(2):text,.mojo-summary-values > div:nth-child(2) > span:nth-child(2):text' mojo_movie_details.csv mojo_movies_weekend_24.csv

2.2 Devour

Ingests a file based csv into a database table.

Usage: dh devour|d [options] <inputFile>

Allows for the easy ingestion of csvs to various datasources

Options:
  -V, --version                                         output the version number
  -s, --schema <name>=<columnType>,[<otherColumns>...]  describe the csv file and the import table model. (Types: int, str, bigint, double, float, date, uuid)
  -d, --delimiter <delimiter>                           Delimiter separating columns in csv. (default: "|")
  -u, --databaseUri <uri>                               DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
  -t, --tableName <tableName>                           Name of the table to create at the databaseUri. (default: "importTable")
  -h, --noHeaders                                       Csv file has no headers
  -r, --renameHeaders                                   Rename headers to match schema on import
  -h, --help                                            output usage information

Builing on the output from the crawler, we can then ingest the csv using devour into a file based sqlite database,

dh devour --databaseUri='sqlite:./mojo_movies.sqlite' --schema 'release_title=str(75),release_title_link=str,movie_studio=str(75),movie_studio_link=str' mojo_movies_weekend_24.csv

One could also rename the columns as we import them if they don't line up with what we're aiming for,

dh devour --databaseUri='sqlite:mojo_movies.sqlite' --schema 'title=str(75),titleUrl=str,studio=str(75),studioUrl=str' --renameHeaders ./mojo_movies_weekend_24.csv

2.3 Digest

Digest executes a series of sql against a data source. It supports database migrations and allows for the ingestion and processing of data from csvs to a data source.

Usage: dh digest|di [options] <action> <basePath>

Processes digest action [execute,crawl,seed,egress,migrate,rollback] against a particular database

Options:
  -V, --version                               output the version number
  -d, --data <name>=<value>,[<otherData>...]  digest data used when generating migrations and sql
  -u, --databaseUri <uri>                     DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
  -h, --help                                  output usage information

Action Definitions

Execute (ex) - Runs crawl, migrate, seeds, process the path files then egresses the data back to csv.
Crawl (cr) - Collects data from a given set of urls and exports them to csv.
Seed (se) - Imports data from a csv to a particular table location. Supports same parameters as defined in the cli.
Egress (eg) - Exports data to csv from a file based query. It can be templatized to use the values passed by the data parameter.
Migrate (mi) - Applies schema migrations to a particular data source.
Rollback (ro) - Reverses schema migrations applied to a data source.

A basic example is available to show how to build the digest where the parameters are based on the options from their respective CLI implemtations. Migrations are supported by Sequelize.

dh digest exec ./examples/digest-basic-example/

A more complex example with crawlers is available. Change year or week by passing via the cli like this and they will override the default values defined in the configuration.

dh digest exec -d 'year=2019,week=43' ./examples/digest-movie-studio-report/

2.4 Egress

Executes a given series of commands against a database and exports data from a tabl to a file based csv.

Usage: dh egress|v [options] <query>

Exports data from a datasource using a query to csv

Options:
  -V, --version                               output the version number
  -o, --output <filepath>                     write data to csv
  -d, --delimeter <delimeter>                 delimeter separating column data (default: "|")
  -u, --databaseUri <uri>                     DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
  -h, --headers <headers>[, otherHeaders...]  comma separated list of headers in the csv (default: true)
  -h, --help                                  output usage information

Then export the same data using egress,

dh egress --databaseUri='sqlite:./mojo_movies.sqlite' --output ./mojo_movies_output.csv 'select * from importTable'

2.5 Mimic

Allows for files to be templatized so they can be customized based on the parameters passed.

Usage: dh mimic|m [options] <templateFile> <outputFile>

Templating engine which uses ejs to build customized scripts

Options:
  -V, --version                               output the version number
  -d, --data <name>=<value>,[<otherData>...]  generates template using the given values
  -n, --noOverwrite                           does not overwrite the file if the outputFile already exists
  -h, --help                                  output usage information

For more documentation, please refer to the jsdoc.

3. Development Setup

To develop locally, simple checkout the repository and run,

npm install
npm test

4. Release History

0.0.9 - 0.0.14
- Updates to documentation
0.0.8
- Extract common database components
- Add extensions to common database components
- Add digest tool to automate commands
- Add better error handling
0.0.7
- Update version for npm release
0.0.2
- Refactor CLI into a single command
- Refactor library to support csv export
- Add csv export utility "vomit"
0.0.1
- Work in progress