1.0.0 • Published 5 months ago

tokensize v1.0.0

Weekly downloads
-
License
ISC
Repository
-
Last release
5 months ago

NPM Module Documentation

The tokenizer function takes a string as input and returns an object with the following properties:

  • count: the number of tokens in the input string
  • characters: the number of characters in the input string
  • text: the original input string
  • tokens: an array of objects, where each object represents a token and its position in the input string. Each token object has the following properties:
    • token: the token string
    • start: the starting index of the token in the input string
    • end: the ending index of the token in the input string

The tokenizer function uses the js-tiktoken library to encode the input string into tokens using the GPT-2 encoding scheme. It then decodes the tokens back into strings, maps the tokens to their positions in the input string using the mapTokensToChunks function, and returns the resulting object.

Usage

To use this module, you can import the tokenizer function and call it with a string argument. Here's an example:

import { tokenizer } from 'your-module-name';

const input = 'This is a sample input string.';
const result = await tokenizer(input);

console.log(result);
/*
{
  count: 7,
  characters: 28,
  text: 'This is a sample input string.',
  tokens: [
    { token: 'This', start: 0, end: 3 },
    { token: 'Ġis', start: 5, end: 7 },
    { token: 'Ġa', start: 8, end: 8 },
    { token: 'Ġsample', start: 10, end: 16 },
    { token: 'Ġinput', start: 18, end: 22 },
    { token: 'Ġstring', start: 24, end: 29 },
    { token: '.', start: 29, end: 29 }
  ]
}
*/
1.0.0

5 months ago