H-digest NPM | npm.io

h-digest

takes a large and continuous data stream and continuously only retain a reduced empirical CDF approximation

Moved on to sample-distribution, about 10x more accurate and about just as fast

small, simple, no dependencies

• Example • Features • Limitations • Why • API • License

Example

HD = require('hdigest') // or import HD from 'hdigest'

// recorder with 7 retained samples with build-in weighting
var hd0 = HD(7)

hd0.push(4)
hd0.push([5,3,6,2,7,1,8])
hd0.push(0)

console.log(hd0.min) // 0
console.log(hd0.max) // 8
console.log(hd0.quantile(0.5)) // 4
console.log(hd0.quantile([0, 0.5, 1])) // [0, 4, 8]

Features

very small code and footprint for large number of instances
less than 200 sloc, no dependencies, 2kb minified
constant memory use, no compression steps and/or triggered garbage collection
significantly faster than other implementations (about 3-5x faster)
tested with random floats, discrete values, sorted values, repeated values and skewed distribution

Limitations

no other utility methods (use npm lazy-stats for mean and variance)
the remaining selected values do not preserve the mean of the inputs

Background

Based on the work of Dunning with significant changes to the algorithm:

Samples retained represent the maximum for a given rank (instead of value weighted centroid of fixed rank)
No value interpolation. Values are kept as-is, but ranks are interpolated.
Fixed length, every new value discards an old one

The above points are thought to yield the following benefits:

The use of maxima instead of weighted centroids is closer to the underlying math (F(x) = P(X<=x))
Use of simpler, faster fixed arrays instead of a tree structure
No need for tree compresion steps
Better handling of sorted data, discrete data and repeated identical values
Faster, smaller footprint for hundreds of instances to measure hundreads of instruments
No garbage collection required

Related projects

tdigest) based on the work of Dunning
sample-distribution, 5-10x faster and more accurate (by some measure) than the above

API

Creation

Samples are retained depending on how close they are to the target CDF probability points to be retained.

The main function has 3 different input types:

{number} length: length of the internal target CDF to be generated
{Array<number>} CDF: if strictly increasing from 0 to 1, the array will be used the target CDF
{Array<number>} PDF: if not a CDF, the array will be treated as a PDF to be summed and normalized into a CDF

Note that to preserve the maxima, a PDF will be padded with 0s at both ends if not already the case. This will result in a recorder length that is greater than the input PDF

Properties

.N number: total samples received
.probs array: internal sigmoid/cdf used for selecting retained samples
.values array: selected retained sample values
.ranks array: interpolated ranks of retained samples
.min number: the minimum of all samples. Same as quantile(0)
.max number: the minimum of all samples. Same as quantile(1)
.ave number: an approximation of the sample average from the derived cdf

Methods

.push(number | array) void: sample value(s) to be added
.quantile(number | array) {number | array}: value(s) for specified probabilitie(s)
.percentile(number | array) same as quantile, different name (percentile values from the quantile function)