Brainnet-tokenizer NPM

Tokenizer

A simple and efficient tokenizer for natural language processing tasks. This tokenizer supports multiple languages and handles special characters effectively.

Features

Tokenizes text into words and special characters.
Encodes text into token IDs.
Decodes token IDs back into text.
Saves and loads vocabulary from a file.
Supports multiple languages.

Installation

To use the Tokenizer class, you need to install the package from npm:

npm install @brainnet/tokenizer

Usage

Here is an example of how to use the Tokenizer class:

const Tokenizer = require('@brainnet/tokenizer');

// Create a Tokenizer instance
const tokenizer = new Tokenizer();

// Tokenize the text
const text = "Tokenization is a fundamental step in natural language processing.";
const tokens = tokenizer.tokenize(text);
console.log("Tokens:", tokens);

// Encode the text
const encodedResult = tokenizer.encode(text);
console.log("Encoding Result:", encodedResult);

// Decode the text
const decodedResult = tokenizer.decode(encodedResult.encodedArray);
console.log("Decoding Result:", decodedResult);

// Save the vocabulary to a file
const vocabularyFilePath = 'vocabulary.json';
tokenizer.saveVocabulary(vocabularyFilePath);
console.log("Vocabulary saved to", vocabularyFilePath);

// Create a new Tokenizer instance and load the vocabulary
const newTokenizer = new Tokenizer();
newTokenizer.loadVocabulary(vocabularyFilePath);
console.log("Vocabulary loaded from", vocabularyFilePath);

// Encode and decode the text using the loaded vocabulary
const newEncodedResult = newTokenizer.encode(text);
const newDecodedResult = newTokenizer.decode(newEncodedResult.encodedArray);
console.log("New Encoding Result:", newEncodedResult);
console.log("New Decoding Result:", newDecodedResult);

// Convert token ID to token
const tokenId = tokenizer.getTokenId("Tokenization");
const token = tokenizer.getToken(tokenId);
console.log(`Token ID ${tokenId} corresponds to token: "${token}"`);

API

`Tokenizer`

`constructor()`

Creates an instance of Tokenizer.

`tokenize(text: string): string[]`

Tokenizes the input text into words and special characters.

`getTokenId(token: string): number`

Adds a token to the vocabulary if it doesn't exist, and returns its ID.

`getToken(tokenId: number): string | null`

Converts a token ID back to its corresponding token.

`getVocabularySize(): number`

Returns the size of the vocabulary.

`encode(text: string): Object`

Encodes the input text into an array of token IDs.

`decode(encodedArray: number[]): Object`

Decodes an array of token IDs back into text.

`saveVocabulary(filePath: string): void`

Saves the vocabulary to a file.

`loadVocabulary(filePath: string): void`

Loads the vocabulary from a file.

License

This project is licensed under the Apache-2.0 License.