0.2.0 • Published 3 years ago

arquero-arrow v0.2.0

Weekly downloads
27
License
BSD-3-Clause
Repository
github
Last release
3 years ago

arquero-arrow

Arrow serialization support for Arquero. The toArrow(data) method encodes either an Arquero table or an array of objects into the Apache Arrow format. This package provides a convenient interface to the apache-arrow JavaScript library, while also providing more performant encoders for standard integer, float, date, boolean, and string dictionary types.

API Documentation

# aq.toArrow(input, types) · Source

Create an Apache Arrow table for an input dataset. The input data can be either an Arquero table or an array of standard JavaScript objects. This method will throw an error if type inference fails or if the generated columns have differing lengths.

  • input: An input dataset to convert to Arrow format. If array-valued, the data should consist of an array of objects where each entry represents a row and named properties represent columns. Otherwise, the input data should be an Arquero table.
  • options: Options for Arrow encoding.

    • columns: Ordered list of column names to include. If function-valued, the function should accept the input data as a single argument and return an array of column name strings.
    • limit: The maximum number of rows to include (default Infinity).
    • offset: The row offset indicating how many initial rows to skip (default 0).
    • types: An optional object indicating the Arrow data type to use for named columns. If specified, the input should be an object with column names for keys and Arrow data types for values. If a column's data type is not explicitly provided, type inference will be performed.

      Type values can either be instantiated Arrow DataType instances (for example, new Float64(),new DateMilliseconds(), etc.) or type enum codes (Type.Float64, Type.Date, Type.Dictionary). For convenience, arquero-arrow re-exports the apache-arrow Type enum object (see examples below). High-level types map to specific data type instances as follows:

      • Type.Datenew DateMilliseconds()
      • Type.Dictionarynew Dictionary(new Utf8(), new Int32())
      • Type.Floatnew Float64()
      • Type.Intnew Int32()
      • Type.Intervalnew IntervalYearMonth()
      • Type.Timenew TimeMillisecond()

      Types that require additional parameters (including List, Struct, and Timestamp) can not be specified using type codes. Instead, use data type constructors from apache-arrow, such as new List(new Int32()).

Examples

Encode Arrow data from an input Arquero table:

const { table } = require('arquero');
const { toArrow, Type } = require('arquero-arrow');

// create Arquero table
const dt = table({
  x: [1, 2, 3, 4, 5],
  y: [3.4, 1.6, 5.4, 7.1, 2.9]
});

// encode as an Arrow table (infer data types)
// here, infers Uint8 for 'x' and Float64 for 'y'
const at1 = toArrow(dt);

// encode into Arrow table (set explicit data types)
const at2 = toArrow(dt, {
  types: {
    x: Type.Uint16,
    y: Type.Float32
  }
});

// serialize Arrow table to a transferable byte array
const bytes = at1.serialize();

Register a toArrow() method for all Arquero tables:

const { internal: { ColumnTable }, table } = require('arquero');
const { toArrow } = require('arquero-arrow');

// add new method to Arquero tables
ColumnTable.prototype.toArrow = function(types) {
  return toArrow(this, types);
};

// create Arquero table, encode as an Arrow table (infer data types)
const at = table({
  x: [1, 2, 3, 4, 5],
  y: [3.4, 1.6, 5.4, 7.1, 2.9]
}).toArrow();

Encode Arrow data from an input object array:

const { toArrow } = require('arquero-arrow');

// encode object array as an Arrow table (infer data types)
const at = toArrow([
  { x: 1, y: 3.4 },
  { x: 2, y: 1.6 },
  { x: 3, y: 5.4 },
  { x: 4, y: 7.1 },
  { x: 5, y: 2.9 }
]);

Build Instructions

To build and develop locally:

0.2.0

3 years ago

0.1.1

3 years ago

0.1.0

3 years ago

0.0.1

3 years ago