0.3.0 • Published 3 years ago

teafile v0.3.0

Weekly downloads
12
License
-
Repository
-
Last release
3 years ago

Warning: WIP, not functional yet npm

Usage

Install

npm install teafile

Read from buffer

import { Teafile } from 'teafile'

// binary data could come from an api endpoint
import axios from 'axios'
let { data } = await axios.get(`/api/teafile/AAPL.tea`)

// data is a buffer
let file = Teafile.fromBuffer(data)
console.log(file)
/*
{
    itemDescription: "Time Price Flag",
    nameValues: {
        "Ticker": "AAPL,
        "Decimals": 2
    },
    timeScale: {
        epoch: 719162 (1970),
        ticksPerDay: 86400000
    },
    data: {
        Time: [...],
        Price: [...],
        Flag: [...]
    }
}
*/

Write to buffer

import { Teafile } from 'teafile'

let file = new Teafile()

file.itemDescription = "Time Price Flag"
file.epoch(719162)
file.nameValue("Ticker", "AAPL")
file.nameValue("Decimals", 2)
file.write(1581447223000, 23.51, 1) // Tuesday, February 11, 2020 6:53:43.000 PM GMT+00:00
file.write(1581447223001, 23.52, 1) // Tuesday, February 11, 2020 6:53:43.001 PM GMT+00:00

let data = file.toBuffer()
console.log(data)
// [ <binary data> ]


// You could upload the data to a server
import axios from 'axios'
axios.post(`/endpoint/teafile`, {data})

// Pretty-print a summary
import { Teafile } from 'teafile'

// assume you have the binary data in a buffer
let binaryData = ArrayBuffer(...)

let data = Teafile.fromBuffer(binaryData)
console.log(data.summary())

/*
data = {
    itemDescription: "Time Price Flag",
    nameValues: {
        "Ticker": "AAPL,
        "Decimals": 2
    },
    timeScale: {
        epoch: 719162 (1970),
        ticksPerDay: 86400000
    },
    data: {
        Time: [...],
        Price: [...],
        Flag: [...]
    }
}

*/

What is the teafile format

Teafile is a file format to store time series as binary flat files

  • An optional header holds a description of file contents such as a description of the item type layout (schema) and metadata (key/value pairs) as well as metadata about the format for the datetime part of the format.
  • The file format is designed to be simple. APIs can easily be written in any language
  • DiscreteLogics publishes the format and releases APIs for C#, C++ and Python under the GPL.

I'll describe at a high level, how this format stores your data and what you might want to use it for.

High-level overview of the format

I highly encourage anyone to read the exact specification for the Teafile format - check it out here: http://discretelogics.com/resources/teafilespec/. It is clearly and concisely written.

Teafiles start with a header followed by optional sections, and finally the item area holding the time series data in binary format.

Header (mandatory)

The header is mandatory. Any teafile that doesn't contain the information below in the first 32 bytes of the file is ill-formed.

bytesDescription
8Magic valueMandatory: 0x0d0e0a0402080500. Is also used to determine the endianness of the file
8Item StartSpecify the byte offset that the items start at
8Item EndSpecify the byte offset that the items end at
8Section CountHere you specify how many sections follow below

Sections (optional)

The sections are optional. There are four section types and they cannot be repeated:

NameHexadecimal representationWhat is it for?
Item Section0x0aDescribe the types (int64, float, etc) for the binary data
Time Section0x40Describe what format the timestamp is using?
Description Section0x80Describe the contents of the file
NameValue Section0x81Arbitrary metadata. Key: Value pairs is the only style supported
Customlarger than 0xffffYou can do whatever you want

According to the spec, teafile doesn't strictly require metadata sections. Does it makes sense to have zero metadata sections describing the data? If you had zero sections, your file would simply be the above mentioned header, followed by the binary data. You would need to know exactly how you stored the data. That might be the case for some applications that store their data in the exact same format every time the file is saved.

A section always looks like this:

BytesDescriptionWhat is it for?
4Section TypeThe program reading this file needs to know what type of section follows.
4Next Section OffsetThe byte offset from the current byte location to the next section
0 - NThe section's dataThe actual metadata for the section

The Next Section Offset is very useful because it allows an application to jump from section to section - only parsing the metadata it's interested in and skipping anything it doesn't know how to parse (for example - applications that haven't implemented a section type parser).

Item Section

It allows you to describe your binary data and how it is layed out in memory.

  1. You can specify the field type. They have the following values:
// platform agnostic
;(Int8 = 1),
  (Int16 = 2),
  (Int32 = 3),
  (Int64 = 4),
  (UInt8 = 5),
  (UInt16 = 6),
  (UInt32 = 7),
  (UInt64 = 8),
  (Float = 9), // IEEE 754
  (Double = 10), // IEEE 754
  // platform specific
  (NetDecimal = 0x200),
  // private extensions must have integer identifiers above 0x1000.
  (Custom = 0x1000)
  1. The field's offset within the item.

Let's say you had the following item:

struct Tick {
    int64 timestamp; // occupies 8 bytes
    int32 flag; // occupies 4 bytes
    float price; // occupies 8 bytes
    int64 volume; // occupies 8 bytes
}

If you were describing the price field, the offset value would be 12.

  1. Field Name

String representation for this field. Useful to use as a key in a map.

Time Section

Here you specify how the timestamp is specified in the binary data. You can specify the epoch and the number of ticks in a day (precision of your timestamps).

Description Section

Just text describing what is in the file.

NameValue Section

You can include a bunch of {Key: Value} pairs to describe your content. This is where you would jam in all your metadata about the files.

You have to specify the type of the value.

One of: Int32, Double, Text, Uuid

For example, if this was a stock ticker file you might have something like this:

NameValueKind
TickerIBMText
DisplayNameInternational Business Machines Corp.Text
ResolutionDayText
FeedInteractive BrokersText
Decimals2Int32

Item data

This is the bulk of the file. It's binary data. You can read the Item Section in the header to know how to parse the file. Or, if your application already knows exactly how the data is layed out, it can skip that part and just start parsing the data right away.

Use Cases

This file format allows for different workflows and access patterns. You might have an application heavily using the NameValue section to store useful info about the data. Or, you might have an application that reads the first 32 bytes to find the offset for the data, seeks straight to that offset, and efficiently rips through millions of .tea files (for example, stock market back testing).

Author

William Hoyle - williamhoyle.ca