datahorde v0.0.14
DataHorde
A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.
A toolkit which allows a user to easily, and programmatically, crawl a site for data. It contains a tool for extracting to pipe delimited csv for easy import and manipulation. Source is available in GitLab and documentation on components is available at https://rdfedor.gitlab.io/datahorde.
Table of Contents
1. Installation
OS X & Linux:
To install the CLI globally and accessible from terminal,
npm install -g datahorde
To install as part of a project,
npm install --save datahorde
2. Available Commands
Usage: dh [options] [command]
Options:
-V, --version output the version number
-h, --help output usage information
Commands:
crawl|c [options] <targetUrl> [otherTargetUrls...] Searches a collection of urls, collects data and writes to csv
devour|d [options] <inputFile> Allows for the easy ingestion of csvs to various datasources
digest|di [options] <action> Processes digest action [execute,migrate,rollback] against a particular database
mimic|m [options] <templateFile> <outputFile> Templating engine which uses ejs to build customized scripts
egress|v [options] <query> Exports data from a datasource using a query to csv
2.1 Crawler
Allows a user to crawl a website collecting data as defined by the definition.
Usage: dh crawl|c [options] <outputFile> <targetUrl> [otherTargetUrls...]
Searches a collection of urls, collects data and writes to csv
Options:
-l, --load load targetUrls from local csv
-k, --loadUrlKey <loadUrlKey> the header that contains the url (default: "url")
-e, --loadDelimiter <loadDelimiter> delimeter separating columns in the csv (default: "|")
-a, --loadHeaders <column1>[,column2...] header name of columns in the csv (default: true)
-m, --loadRenameHeader map column data to headers (default: false)
-p, --profile <name> use a specific crawler profile (Available: generic) (default: "generic")
-r, --rowSelector <selector> css selector to the rows to select the data from (default: "body")
-c, --columnSelector <selector>[,otherSelectors...] css selector to the values inside the rows (default: "a")
-n, --nextPageLinkSelector <selector> css selector to url for next page
-o, --outputHeaders <name>[,otherColumns...] displays headers at the top of the export
-h, --headers <header>=<value>[,otherHeaders...] set header values in the requests made
-x, --proxy <proxy> path to the socks proxy
-t, --threads <threads> run multiple threads at the same time (default: 1)
-b, --browser emulate a browser request
-h, --help output usage information
As an example, lets write to csv and display a list of the 42nd weekend's top release titles, release title link,movie studio that published it and the link to the imdb movie studio page,
dh crawl -r 'table.mojo-body-table tr' -c 'td.mojo-field-type-release a,td.mojo-field-type-release a:attr(href),td.mojo-field-type-release_studios a,td.mojo-field-type-release_studios a:attr(href)' -o 'release_title,release_title_link,movie_studio,movie_studio_link' mojo_movies_weekend_24.csv https://www.boxofficemojo.com/weekend/2019W42/
Then taking the output of the above, we can collect details from each page like the total demoestic and internation gross, genre and feature length,
dh crawl -l -i release_title_link -h 'title,domestic_gross,international_gross,genre,feature_length' -r main -c 'h1,.mojo-performance-summary-table div:nth-child(2) span:nth-child(3) span:text,.mojo-performance-summary-table div:nth-child(3) span:nth-child(3) span:text,.mojo-summary-values > div:nth-child(4) > span:nth-child(2):text,.mojo-summary-values > div:nth-child(2) > span:nth-child(2):text' mojo_movie_details.csv mojo_movies_weekend_24.csv
2.2 Devour
Ingests a file based csv into a database table.
Usage: dh devour|d [options] <inputFile>
Allows for the easy ingestion of csvs to various datasources
Options:
-V, --version output the version number
-s, --schema <name>=<columnType>,[<otherColumns>...] describe the csv file and the import table model. (Types: int, str, bigint, double, float, date, uuid)
-d, --delimiter <delimiter> Delimiter separating columns in csv. (default: "|")
-u, --databaseUri <uri> DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
-t, --tableName <tableName> Name of the table to create at the databaseUri. (default: "importTable")
-h, --noHeaders Csv file has no headers
-r, --renameHeaders Rename headers to match schema on import
-h, --help output usage information
Builing on the output from the crawler, we can then ingest the csv using devour into a file based sqlite database,
dh devour --databaseUri='sqlite:./mojo_movies.sqlite' --schema 'release_title=str(75),release_title_link=str,movie_studio=str(75),movie_studio_link=str' mojo_movies_weekend_24.csv
One could also rename the columns as we import them if they don't line up with what we're aiming for,
dh devour --databaseUri='sqlite:mojo_movies.sqlite' --schema 'title=str(75),titleUrl=str,studio=str(75),studioUrl=str' --renameHeaders ./mojo_movies_weekend_24.csv
2.3 Digest
Digest executes a series of sql against a data source. It supports database migrations and allows for the ingestion and processing of data from csvs to a data source.
Usage: dh digest|di [options] <action> <basePath>
Processes digest action [execute,crawl,seed,egress,migrate,rollback] against a particular database
Options:
-V, --version output the version number
-d, --data <name>=<value>,[<otherData>...] digest data used when generating migrations and sql
-u, --databaseUri <uri> DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
-h, --help output usage information
Action Definitions
- Execute (ex) - Runs crawl, migrate, seeds, process the path files then egresses the data back to csv.
- Crawl (cr) - Collects data from a given set of urls and exports them to csv.
- Seed (se) - Imports data from a csv to a particular table location. Supports same parameters as defined in the cli.
- Egress (eg) - Exports data to csv from a file based query. It can be templatized to use the values passed by the data parameter.
- Migrate (mi) - Applies schema migrations to a particular data source.
- Rollback (ro) - Reverses schema migrations applied to a data source.
A basic example is available to show how to build the digest where the parameters are based on the options from their respective CLI implemtations. Migrations are supported by Sequelize.
dh digest exec ./examples/digest-basic-example/
A more complex example with crawlers is available. Change year or week by passing via the cli like this and they will override the default values defined in the configuration.
dh digest exec -d 'year=2019,week=43' ./examples/digest-movie-studio-report/
2.4 Egress
Executes a given series of commands against a database and exports data from a tabl to a file based csv.
Usage: dh egress|v [options] <query>
Exports data from a datasource using a query to csv
Options:
-V, --version output the version number
-o, --output <filepath> write data to csv
-d, --delimeter <delimeter> delimeter separating column data (default: "|")
-u, --databaseUri <uri> DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
-h, --headers <headers>[, otherHeaders...] comma separated list of headers in the csv (default: true)
-h, --help output usage information
Then export the same data using egress,
dh egress --databaseUri='sqlite:./mojo_movies.sqlite' --output ./mojo_movies_output.csv 'select * from importTable'
2.5 Mimic
Allows for files to be templatized so they can be customized based on the parameters passed.
Usage: dh mimic|m [options] <templateFile> <outputFile>
Templating engine which uses ejs to build customized scripts
Options:
-V, --version output the version number
-d, --data <name>=<value>,[<otherData>...] generates template using the given values
-n, --noOverwrite does not overwrite the file if the outputFile already exists
-h, --help output usage information
For more documentation, please refer to the jsdoc.
3. Development Setup
To develop locally, simple checkout the repository and run,
npm install
npm test
4. Release History
- 0.0.9 - 0.0.14
- Updates to documentation
- 0.0.8
- Extract common database components
- Add extensions to common database components
- Add digest tool to automate commands
- Add better error handling
- 0.0.7
- Update version for npm release
- 0.0.2
- Refactor CLI into a single command
- Refactor library to support csv export
- Add csv export utility "vomit"
- 0.0.1
- Work in progress
5. Contributing
- Fork it (< https://gitlab.com/rdfedor/datahorde/-/forks/new >)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request