0.0.1 • Published 6 months ago
@travisennis/datasets v0.0.1
@travisennis/datasets
A TypeScript library for interacting with the HuggingFace Datasets API.
Installation
npm install @travisennis/datasets
Usage
import { datasets } from '@travisennis/datasets';
import { z } from 'zod';
// Define your dataset schema
const schema = z.object({
question: z.string(),
answer: z.string(),
});
// Initialize the dataset
const ds = datasets({
dataset: 'hotpot_qa',
schema,
});
// Get dataset information
const info = await ds.info();
// Get dataset splits
const splits = await ds.splits();
// Get rows from a specific split
const rows = await ds.getRows({
split: 'train',
config: 'distractor',
options: { offset: 0, length: 100 }
});
// Stream data using iterator
for await (const batch of ds.data({
split: 'train',
config: 'distractor',
})) {
console.log(batch);
}
API Reference
datasets(options)
Creates a new dataset instance.
Options
dataset
: The name of the HuggingFace datasetschema
: A Zod schema defining the structure of the dataset
Methods
validate()
Checks if the dataset is valid and accessible.
splits()
Returns available splits in the dataset.
info(config = 'default')
Returns dataset information including description, citation, and statistics.
listParquetFiles()
Lists available parquet files for the dataset.
downloadParquetFiles(options)
Downloads parquet files locally.
options.downloadDir
: Optional custom download directory
getRows(options)
Returns a batch of rows from the dataset.
options.split
: Dataset split (e.g., 'train', 'test')options.config
: Dataset configurationoptions.options.offset
: Starting row indexoptions.options.length
: Number of rows to fetch
data(options)
Returns an async iterator for streaming dataset rows.
options.split
: Dataset splitoptions.config
: Dataset configurationoptions.options.offset
: Starting row indexoptions.options.length
: Batch size
Reference
License
MIT
0.0.1
6 months ago