0.0.1 • Published 1 year ago
@flexpilot-ai/tokenizers v0.0.1
Main Features
- Fast and Efficient: Leverages Rust's performance for rapid tokenization.
- Versatile: Supports various tokenization models including BPE, WordPiece, and Unigram.
- Easy Integration: Seamlessly use pre-trained tokenizers in your Node.js projects.
- Customizable: Fine-tune tokenization parameters for your specific use case.
- Production-Ready: Designed for both research and production environments.
Installation
Install the package using npm:
npm install @flexpilot-ai/tokenizersUsage Example
Here's an example demonstrating how to use the Tokenizer class:
import { Tokenizer } from "@flexpilot-ai/tokenizers";
import fs from "fs";
// Read the tokenizer configuration file
const fileBuffer = fs.readFileSync("path/to/tokenizer.json");
const byteArray = Array.from(fileBuffer);
// Create a new Tokenizer instance
const tokenizer = new Tokenizer(byteArray);
// Encode a string
const text = "Hello, y'all! How are you 😁 ?";
const encoded = tokenizer.encode(text, true);
console.log("Encoded:", encoded);
// Decode the tokens
const decoded = tokenizer.decode(encoded, false);
console.log("Decoded:", decoded);
// Use the fast encoding method
const fastEncoded = tokenizer.encodeFast(text, true);
console.log("Fast Encoded:", fastEncoded);API Reference
Tokenizer
The main class for handling tokenization.
Constructor
constructor(bytes: Array<number>)Creates a new Tokenizer instance from a configuration provided as an array of bytes.
bytes: An array of numbers representing the tokenizer configuration.
Methods
encode
encode(input: string, addSpecialTokens: boolean): Array<number>Encodes the input text into token IDs.
input: The text to tokenize.addSpecialTokens: Whether to add special tokens during encoding.- Returns: An array of numbers representing the token IDs.
decode
decode(ids: Array<number>, skipSpecialTokens: boolean): stringDecodes the token IDs back into text.
ids: An array of numbers representing the token IDs.skipSpecialTokens: Whether to skip special tokens during decoding.- Returns: The decoded text as a string.
encodeFast
encodeFast(input: string, addSpecialTokens: boolean): Array<number>A faster version of the encode method for tokenizing text.
input: The text to tokenize.addSpecialTokens: Whether to add special tokens during encoding.- Returns: An array of numbers representing the token IDs.
Contributing
We welcome contributions! Please see our Contributing Guide for more details.
License
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
Acknowledgments
- This library is based on the HuggingFace Tokenizers Rust implementation.
- Special thanks to the Rust and Node.js communities for their invaluable resources and support.
0.0.1
1 year ago