1.0.5 • Published 7 years ago
load-balance-lines v1.0.5
Parallelize newline-delimited data processing by load balancing lines between multiple processes
Summary
Install
# Make the executable accessible within your project npm scripts as load-balance-lines
# or, out of npm scripts, as ./node_modules/.bin/load-balance-lines
npm i load-balance-lines
# or globally
npm i -g load-balance-lines
Basic use
Take a huge pile of data with atomic data elements separated by newline breaks, typically NDJSON.
# Make sure your executable is... executable
chmod +x /path/to/my/executable
# and let's go!
cat data.ndjson | load-balance-lines /path/to/my/executable some args
or without the cat command, using <
load-balance-lines /path/to/my/executable some args for the executable < data.ndjson
Simple demo
see test
Real case demo
For the needs of wikidata-rank, we need to parse a full dump of Wikidata
- get the latest dump (currently 31G gzipped)
wget -c https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
- Use nice to use the maximum amount of CPU possible while letting the priority to other processes
- Use pigz to decompress it using threads (drop-in replacement to the single threaded gzip)
nice pigz -d < latest-all.json.gz | nice load-balance-lines /path/to/wikidata-rank/scripts/calculate_base_scores
Options
Number of processes
By default, there will be as many processes as CPU cores, but it can be modified by setting an environment variable
export LBL_PROCESSES=4 ; cat data.ndjson | load-balance-lines ./my/script
Verbose
By default, the load balancer is silent to let stdout free for sub-processes outputs, but you can get some basic informations by setting LBL_VERBOSE
export LBL_VERBOSE=true ; cat data.ndjson | load-balance-lines ./my/script