1.0.5 • Published 6 years ago

load-balance-lines v1.0.5

Weekly downloads
8
License
MIT
Repository
github
Last release
6 years ago

Parallelize newline-delimited data processing by load balancing lines between multiple processes

htop

Summary

Install

# Make the executable accessible within your project npm scripts as load-balance-lines
# or, out of npm scripts, as ./node_modules/.bin/load-balance-lines
npm i load-balance-lines
# or globally
npm i -g load-balance-lines

Basic use

Take a huge pile of data with atomic data elements separated by newline breaks, typically NDJSON.

# Make sure your executable is... executable
chmod +x /path/to/my/executable
# and let's go!
cat data.ndjson | load-balance-lines /path/to/my/executable some args

or without the cat command, using <

load-balance-lines /path/to/my/executable some args for the executable < data.ndjson

Simple demo

see test

Real case demo

For the needs of wikidata-rank, we need to parse a full dump of Wikidata

  • get the latest dump (currently 31G gzipped)
wget -c https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
  • Use nice to use the maximum amount of CPU possible while letting the priority to other processes
  • Use pigz to decompress it using threads (drop-in replacement to the single threaded gzip)
nice pigz -d < latest-all.json.gz | nice load-balance-lines /path/to/wikidata-rank/scripts/calculate_base_scores

Options

Number of processes

By default, there will be as many processes as CPU cores, but it can be modified by setting an environment variable

export LBL_PROCESSES=4 ; cat data.ndjson | load-balance-lines ./my/script

Verbose

By default, the load balancer is silent to let stdout free for sub-processes outputs, but you can get some basic informations by setting LBL_VERBOSE

export LBL_VERBOSE=true ; cat data.ndjson | load-balance-lines ./my/script
1.0.5

6 years ago

1.0.4

6 years ago

1.0.3

6 years ago

1.0.1

6 years ago

1.0.0

6 years ago