@jrc03c/tf-k-means NPM

Intro

This library implements the K-means algorithm in TensorFlow.js. Note that it is a fresh rewrite of willfind/tf-kmeans, which should be considered to be deprecated.

Installation

npm install --save @jrc03c/tf-k-means

Usage

Clustering the data

The model's API follows the scikit-learn conventions of having fit, predict, and score methods.

import { KMeansMeta } from "@jrc03c/tf-k-means"

const x = getDataSomehow()

const model = new KMeansMeta({
  finalMaxIterations: 100,
  finalMaxRestarts: 50,
  finalMaxTime: 1000,
  initialization: KMeansMeta.Initialization.PLUS_PLUS,
  ks: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
  maxIterations: 50,
  maxRestarts: 25,
  maxTime: 3000,
  method: KMeansMeta.Method.SILHOUETTE,
  testSize: 0.25,
  tolerance: 1e-4,
})

model.fit(x, progress => {
  console.log("k-means progress:", (progress * 100).toFixed(2) + "%")
})

const labels = model.predict(x)
const score = model.score(x)

Visualizing the data

I've included a t-SNE model for visualization purposes. It allows you to project data down into 2 (or 3 or however many) dimensions.

To continue the example above, we'd stack the data points (x) and the model's learned centroids (i.e., cluster centers) and project them all at once down into (e.g.) 2 dimensions.

import { TSNE } from "@jrc03c/tf-k-means"
import * as tf from "@tensorflow/tfjs"

const stacked = tf.stack([x, model.centroids])

const projector = new TSNE({
  dimensions: 2,
  learningRate: 100,
  maxIterations: 1000,
  maxTime: 2000,
  perplexity: 30,
})

const projected = projector
  .fitTransform(stacked, progress => {
    console.log("t-sne progress:", (progress * 100).toFixed(2) + "%")
  })
  .arraySync()

const xProjected = projected.slice(0, x.shape[0])
const centroidsProjected = projected.slice(x.shape[0])

// then visualize!

Demo

The /demo folder contains a full example. Run the demo using:

npm run demo

And then visit http://localhost:1234 in your browser.

Caveats

I tried to implement a time limit for each model's fit method (or fitTransform in the case of the t-SNE model) using the maxTime and finalMaxTime properties described below. However, some operations are just so large and synchronous and uninterruptible that the time constraint simply can't be honored. Thus the time limits are more like suggestions or approximations rather than hard boundaries. I may add some functionality in the future that would move parts of the fit methods into Web Workers. Then they would be cancellable at any time, regardless of whether or not the program was in the middle of executing a large synchronous operation.

API

NOTE: In the documentation that follows, I refer to the K-means objective function — the within-cluster sum of squared errors — as "WCSSE".

Models

`KMeansMeta`

Properties

`Initialization` (static)

An object with this definition:

{
  PLUS_PLUS: "PLUS_PLUS",
  RANDOM: "RANDOM",
}

`centroids`

A 2-dimensional tf.Tensor representing the centroids learned by the final model fitted at the end of the fit method (after the best K-value has been determined). The default is null.

`finalMaxIterations`

A non-negative integer representing the maximum number of iterations allowed for the final model fitted at the end of the fit method. The default is 100.

`finalMaxRestarts`

A non-negative integer representing the maximum number of restarts allowed for the final model fitted at the end of the fit method. The default is 50.

`finalMaxTime`

A non-negative float representing the maximum amount of time in milliseconds allowed for the final model fitted at the end of the fit method. The default is 1000.

`fittedModel`

A KMeans instance representing the final model fitted at the end of the fit method. The default is null.

`initialization`

A string representing the method to use for initializing centroids at the beginning of each restart of a KMeans model. Can either be "RANDOM" or "PLUS_PLUS". (NOTE: These can be accessed as static properties on the class. For example: KMeansMeta.Initialization.PLUS_PLUS) The default is "PLUS_PLUS".

`k`

A non-negative integer representing the number of clusters learned by the final model fitted at the end of the fit method. The default is 0.

`ks`

A 1-dimensional array of non-negative integers representing K-values to test during the fit method. The default is [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].

`maxIterations`

A non-negative integer representing the maximum number of iterations allowed for each KMeans instance used to find the best K-value in the fit method. The default is 50.

`maxRestarts`

A non-negative integer representing the maximum number of restarts allowed for each KMeans instance used to find the best K-value in the fit method. The default is 25.

`maxTime`

A non-negative float representing the maximum amount of time in milliseconds allowed across all KMeans instances used to find the best K-value in the fit method. (In other words, all of the K-values must be tested before maxTime has elapsed.) The default is 3000.

`method`

A string representing the method used to determine the best K-value. Can either be "ELBOW" or "SILHOUETTE". (NOTE: These can be accessed as static properties on the class. For example: KMeansMeta.Method.SILHOUETTE) The default is "SILHOUETTE".

`shouldNormalizeInputData`

A boolean representing whether or not data should be normalized when passed into the fit, predict, and score methods. "Normalized" in this context means scaling and centering each column of data to have a mean of 0 and a standard deviation of 1. The default is true.

`testSize`

A float between 0 and 1 representing the fraction of the data to set aside for testing the validity of each KMeans instance during the fit method. The default is 0.25.

`timedOut`

A boolean representing whether or not any of the KMeans instances ran out of time while trying to run their fit methods. The default is false.

`tolerance`

A float representing the maximum difference between two consecutive WCSSE scores required to allow a KMeans instance to skip the remainder of its iterations during its fit method. The default is 0.0001.

Methods

`constructor(options)`

Accepts an options object with these properties, all of which correspond to properties of the same names on the instance:

{
  finalMaxIterations: 100,
  finalMaxRestarts: 50,
  finalMaxTime: 1000,
  fittedModel: null,
  initialization: KMeansMeta.Initialization.PLUS_PLUS,
  ks: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
  maxIterations: 50,
  maxRestarts: 25,
  maxTime: 3000,
  method: KMeansMeta.Method.SILHOUETTE,
  shouldNormalizeInputData: true,
  testSize: 0.25,
  tolerance: 1e-4
}

`destroy()`

Calls the fittedModel.destroy() method, which disposes of any relevant TensorFlow tensors or variables stored in memory. Returns the KMeansMeta instance.

`fit(x, fn)`

Accepts a 2-dimensional array or tensor, x, and a callback function, fn. Determines the best K-value from ks and then fits a final KMeans model, which it stores as the fittedModel value. The callback function, fn, which receives a float between 0 and 1, can be used to track the progress of the fitting process. Returns the KMeansMeta instance.

`predict(x)`

Accepts a 2-dimensional tensor array or tensor, x. Returns a 1-dimensional tf.Tensor of non-negative integers representing the indices into centroids to which each data point in x is assigned.

`printReport(x, centroidsTrue, labelsTrue, fn)`

Accepts a 2-dimensional array or tensor, x; another 2-dimensional array or tensor, centroidsTrue; a 1-dimensional array or tensor, labelsTrue; and a callback function fn. Calls the fit method using x, sorts the learned centroids so that they best align with centroidsTrue, predicts some labels using the predict method, scores the accuracy of those predictions, and then prints a report to the command line or the browser console. Returns the KMeansMeta instance.

`score(x)`

Accepts a 2-dimensional tensor, x. Returns a float representing the WCSSE.

`KMeans`

Properties

`centroids`

A 2-dimensional tf.Tensor representing the centroids learned during the fit method. The default is null.

`initialization`

A string representing the method to use for initializing centroids at the beginning of each restart. Can either be "RANDOM" or "PLUS_PLUS". (NOTE: These can be accessed as static properties on the class. For example: KMeans.Initialization.PLUS_PLUS) The default is "PLUS_PLUS".

`k`

A non-negative integer representing the number of clusters into which to group the data during the fit method. The default is 0.

`maxIterations`

A non-negative integer representing the maximum number of iterations allowed within each restart in the fit method. The default is 100.

`maxRestarts`

A non-negative integer representing the maximum number of restarts allowed in the fit method. The default is 50.

`maxTime`

A non-negative float representing the maximum amount of time in milliseconds allowed for the model to complete its fit method. The default is 3000.

`scaler`

A StandardScaler instance. The default is null.

`shouldNormalizeInputData`

`timedOut`

A boolean representing whether or not the model ran out of time while trying to run its fit method. The default is false.

`tolerance`

A float representing the maximum difference between two consecutive WCSSE scores required to allow the model to skip the remainder of its iterations during its fit method. The default is 0.0001.

Methods

`constructor(options)`

Accepts an options object with these properties, all of which correspond to properties of the same names on the instance:

{
  centroids: null,
  initialization: KMeans.Initialization.PLUS_PLUS,
  k: 0,
  maxIterations: 100,
  maxRestarts: 50,
  maxTime: 3000,
  shouldNormalizeInputData: true,
  tolerance: 1e-4
}

`destroy()`

Disposes of any relevant TensorFlow tensors or variables stored in memory. Returns the KMeans instance.

`fit(x, fn)`

Accepts a 2-dimensional array or tensor, x, and a callback function, fn. Finds the centroids that best represent the data. The callback function, fn, which receives a float between 0 and 1, can be used to track the progress of the fitting process. Returns the KMeans instance.

`initializeCentroids(x)`

Accepts a 2-dimensional array or tensor, x. Returns a 2-dimensional tf.Tensor representing a new set of centroids to be used at the beginning of a restart in the fit method.

`initializeCentroidsPlusPlus(x)`

Accepts a 2-dimensional array or tensor, x. Returns a 2-dimensional tf.Tensor representing a new set of centroids selected using the K-means++ algorithm to be used at the beginning of a restart in the fit method.

`initializeCentroidsRandom(x)`

Accepts a 2-dimensional array or tensor, x. Returns a 2-dimensional tf.Tensor representing a new set of randomly-selected centroids to be used at the beginning of a restart in the fit method.

`predict(x)`

Accepts a 2-dimensional array or tensor, x. Returns a 1-dimensional tf.Tensor of non-negative integers representing the indices into centroids to which each data point in x is assigned.

`printReport(x, centroidsTrue, labelsTrue, fn)`

`score(x)`

Accepts a 2-dimensional tensor, x. Returns a float representing the WCSSE.

`TSNE`

Properties

`dimensions`

A non-negative integer representing the number of dimensions into which to project the data. The default is 2.

`learningRate`

A non-negative float representing the learning step size. The default is 100.

`maxIterations`

A non-negative integer representing the maximum number of iterations allowed for the fitTransform method. The default is 1000.

`maxTime`

A non-negative float representing the maximum amount of time in milliseconds allowed for the model to complete its fitTransform method.

`perplexity`

According to Wikipedia, perplexity is "a measure of uncertainty in the value of a sample from a discrete probability distribution". According to Andrej Karpathy's t-SNE library, which is the library on top of which this library's t-SNE model is built, perplexity is "roughly how many neighbors each point influences". The default is 30.

`timedOut`

A boolean representing whether or not the model ran out of time while trying to run its fitTransform method. The default is false.

Methods

`constructor(options)`

Accepts an options object with these properties, all of which correspond to properties of the same names on the instance:

{
  dimensions: 2,
  learningRate: 100,
  maxIterations: 1000,
  maxTime: 2000,
  perplexity: 30,
}

`fitTransform(x, fn)`

Accepts a 2-dimensional tensor or array, x, and a callback function, fn. Projects the data in x down into the desired number of dimensions. The callback function, fn, which receives a float between 0 and 1, can be used to track the progress of the fitting and transforming process. Returns a 2-dimensional tf.Tensor representing the projected data.

Helper classes & functions

`accuracy(a, b)`

Accepts two 1-dimensional tensors or arrays, a and b. Returns a float between 0 and 1 representing how many of the values in a match their corresponding values in b.

`computeParameters(x)`

Accepts a 2-dimensional tensor or array, x. Returns an object with these properties:

{
  finalMaxIterations: ...,
  finalMaxRestarts: ...,
  maxIterations: ...,
  maxRestarts: ...,
  testSize: ...,
}

This function can be used to help estimate parameters to pass into KMeansMeta based on the shape of x.

`getMemoryStatus()`

Returns an object with these properties:

{
  tensorCount: tf.memory().numTensors,
  variableCount: Object.keys(tf.engine().state.registeredVariables).length,
}

`pairwiseDistances(a, b)`

Accepts two 2-dimensional tensors or arrays, a and b. Returns a 2-dimensional tf.Tensor representing all of the pair-wise distances between all of the rows of a and all of the rows of b.

`rSquared(xtrue, xpred)`

Accepts two tensors or arrays with the same shapes, xtrue and xpred. Returns a float value representing an R^2 score less than or equal to 1.

`silhouette(distances, labels)`

Accepts a 2-dimensional tensor or array, distances (which represents pair-wise distances among all points in a dataset), and a 1-dimensional tensor or array, labels. Returns a float value between -1 and 1 representing, on average, how much better each data point matches its assigned cluster than its next-best cluster. Higher scores are better.

`sortByColumn(x, j)`

Accepts a 2-dimensional tensor or array, x, and a non-negative integer, j (which represents the column number of x by which to sort the dataset). Returns a 2-dimensional tf.Tensor sorted by column j.

`sortCentroids(ctrue, cpred)`

Accepts two 2-dimensional tensors or arrays, ctrue and cpred. Returns a 2-dimensional tf.Tensor with the same shape as cpred.

`StandardScaler`

Properties

`means`

A 1-dimensional tf.Tensor representing the average value of each column of a dataset. The default is null.

`stdevs`

A 1-dimensional tf.Tensor representing the standard deviation of each column of a dataset. The default is null.

Methods

`fit(x)`

Accepts a 2-dimensional tensor or array, x. Finds and stores the means and standard deviations of each column of x as means and stdevs respectively. Returns the StandardScaler instance.

`transform(x1, x2, x3, ...)`

Accepts any number of 2-dimensional tensors or arrays. Returns the same number of 2-dimensional tf.Tensor instances, all of which have been scaled and centered such that each column has a mean of 0 and a standard deviation of 1.

`untransform(x1, x2, x3, ...)`

Accepts any number of 2-dimensional tensors or arrays. Returns the same number of 2-dimensional tf.Tensor instances, all of which have been scaled and translated using the stored means and stdevs.

@jrc03c/js-math-tools @jrc03c/karpathy-tsne-js @tensorflow/tfjs

6 months ago

8 months ago

8 months ago

8 months ago

8 months ago

8 months ago

9 months ago

9 months ago

9 months ago

Intro

Installation

Usage

Clustering the data

Visualizing the data

Demo

Caveats

API

Models

KMeansMeta

Properties

Initialization (static)

centroids

finalMaxIterations

finalMaxRestarts

finalMaxTime

fittedModel

initialization

k

ks

maxIterations

maxRestarts

maxTime

method

shouldNormalizeInputData

testSize

timedOut

tolerance

Methods

constructor(options)

destroy()

fit(x, fn)

predict(x)

printReport(x, centroidsTrue, labelsTrue, fn)

score(x)

KMeans

Properties

centroids

initialization

k

maxIterations

maxRestarts

maxTime

scaler

shouldNormalizeInputData

timedOut

tolerance

Methods

constructor(options)

destroy()

fit(x, fn)

initializeCentroids(x)

initializeCentroidsPlusPlus(x)

initializeCentroidsRandom(x)

predict(x)

printReport(x, centroidsTrue, labelsTrue, fn)

score(x)

TSNE

Properties

dimensions

learningRate

maxIterations

maxTime

perplexity

timedOut

Methods

constructor(options)

fitTransform(x, fn)

Helper classes & functions

accuracy(a, b)

computeParameters(x)

getMemoryStatus()

pairwiseDistances(a, b)

rSquared(xtrue, xpred)

silhouette(distances, labels)

sortByColumn(x, j)

sortCentroids(ctrue, cpred)

StandardScaler

Properties

means

`KMeansMeta`

`Initialization` (static)

`centroids`

`finalMaxIterations`

`finalMaxRestarts`

`finalMaxTime`

`fittedModel`

`initialization`

`k`

`ks`

`maxIterations`

`maxRestarts`

`maxTime`

`method`

`shouldNormalizeInputData`

`testSize`

`timedOut`

`tolerance`

`constructor(options)`

`destroy()`

`fit(x, fn)`

`predict(x)`

`printReport(x, centroidsTrue, labelsTrue, fn)`

`score(x)`

`KMeans`

`centroids`

`initialization`

`k`

`maxIterations`

`maxRestarts`

`maxTime`

`scaler`

`shouldNormalizeInputData`

`timedOut`

`tolerance`

`constructor(options)`

`destroy()`

`fit(x, fn)`

`initializeCentroids(x)`

`initializeCentroidsPlusPlus(x)`

`initializeCentroidsRandom(x)`

`predict(x)`

`printReport(x, centroidsTrue, labelsTrue, fn)`

`score(x)`

`TSNE`

`dimensions`

`learningRate`

`maxIterations`

`maxTime`

`perplexity`

`timedOut`

`constructor(options)`

`fitTransform(x, fn)`

`accuracy(a, b)`

`computeParameters(x)`

`getMemoryStatus()`

`pairwiseDistances(a, b)`

`rSquared(xtrue, xpred)`

`silhouette(distances, labels)`

`sortByColumn(x, j)`

`sortCentroids(ctrue, cpred)`

`StandardScaler`

`means`

`stdevs`

`fit(x)`

`transform(x1, x2, x3, ...)`

`untransform(x1, x2, x3, ...)`