1.0.0 • Published 11 months ago

brainnet-tokenizer v1.0.0

Weekly downloads
-
License
Apache-2.0
Repository
-
Last release
11 months ago

Tokenizer

A simple and efficient tokenizer for natural language processing tasks. This tokenizer supports multiple languages and handles special characters effectively.

Features

  • Tokenizes text into words and special characters.
  • Encodes text into token IDs.
  • Decodes token IDs back into text.
  • Saves and loads vocabulary from a file.
  • Supports multiple languages.

Installation

To use the Tokenizer class, you need to install the package from npm:

npm install @brainnet/tokenizer

Usage

Here is an example of how to use the Tokenizer class:

const Tokenizer = require('@brainnet/tokenizer');

// Create a Tokenizer instance
const tokenizer = new Tokenizer();

// Tokenize the text
const text = "Tokenization is a fundamental step in natural language processing.";
const tokens = tokenizer.tokenize(text);
console.log("Tokens:", tokens);

// Encode the text
const encodedResult = tokenizer.encode(text);
console.log("Encoding Result:", encodedResult);

// Decode the text
const decodedResult = tokenizer.decode(encodedResult.encodedArray);
console.log("Decoding Result:", decodedResult);

// Save the vocabulary to a file
const vocabularyFilePath = 'vocabulary.json';
tokenizer.saveVocabulary(vocabularyFilePath);
console.log("Vocabulary saved to", vocabularyFilePath);

// Create a new Tokenizer instance and load the vocabulary
const newTokenizer = new Tokenizer();
newTokenizer.loadVocabulary(vocabularyFilePath);
console.log("Vocabulary loaded from", vocabularyFilePath);

// Encode and decode the text using the loaded vocabulary
const newEncodedResult = newTokenizer.encode(text);
const newDecodedResult = newTokenizer.decode(newEncodedResult.encodedArray);
console.log("New Encoding Result:", newEncodedResult);
console.log("New Decoding Result:", newDecodedResult);

// Convert token ID to token
const tokenId = tokenizer.getTokenId("Tokenization");
const token = tokenizer.getToken(tokenId);
console.log(`Token ID ${tokenId} corresponds to token: "${token}"`);

API

Tokenizer

constructor()

Creates an instance of Tokenizer.

tokenize(text: string): string[]

Tokenizes the input text into words and special characters.

getTokenId(token: string): number

Adds a token to the vocabulary if it doesn't exist, and returns its ID.

getToken(tokenId: number): string | null

Converts a token ID back to its corresponding token.

getVocabularySize(): number

Returns the size of the vocabulary.

encode(text: string): Object

Encodes the input text into an array of token IDs.

decode(encodedArray: number[]): Object

Decodes an array of token IDs back into text.

saveVocabulary(filePath: string): void

Saves the vocabulary to a file.

loadVocabulary(filePath: string): void

Loads the vocabulary from a file.

License

This project is licensed under the Apache-2.0 License.