Tokeniz3r NPM | npm.io

Tokeniz3r

A module that tokenizes strings!!!

What is a tokenizer

Tokenization is often used as a first step when interpreting strings, for example by a compiler. A tokenizer will break a string down into tokens, the correct term for which is "Lexical Analysis". Tokens can be identified, for example, using regular expressions (as is the method used by Tokeniz3r). A collection of token types and their descriptions (referred to as grammar rules in Tokeniz3r) becomes a "lexical grammar" that describes all types of tokens that are valid.

How does Tokeniz3r work?

The Tokeniz3r node module, once imported by your own npm project, provides a single Tokenizer class through its interface with a number of public methods and properties. You can then create instances of Tokenizer based on a sting that you want to break into tokens, and a grammar that you specify yourself and which uses regular expressions to find matches in the input string. When you instantiate a Tokenizer object, an active token is identified (i.e. is considered the best match using the grammar rules provided) at the start of the string. You will then be able to get info about the active token, as well as move the active token to next and previous tokens within the boundaries of the string.

Prerequisites

Node.js installed
NPM installed (this comes with node.js)

Using Tokeniz3r

Install tokeniz3r npm package in your project

npm install tokeniz3r --save

Import Tokenizer (ES6**) module.

import Tokenizer from 'tokeniz3r'

Create an input string.

const inputStr = 'test string.'

Create a grammar object.

const WordAndDotGrammar = [
  {
    tokenType: 'WORD',
    regex: /^[\w|åäöÅÄÖ]+/i
  },
  {
    tokenType: 'DOT',
    regex: /^\./
  }
]

Create an instance of the Tokenizer class.

const tokenizer = new Tokenizer(inputStr, WordAndDotGrammar)

Use the instance's public methods and properties.

console.log(tokenizer.getActiveToken().toString())

** Note that due to time constraints I have not been able to publish a successful build of this module to npm that is backwards compatible. Tokeniz3r uses ES Modules and you must import it as such, i.e. the 'require' method will not work. Your project should therefore use ES Modules (at least ES6).

Grammar formatting

The Grammar argument must be an array of objects. Each object represents a grammar rule which has a 'tokenType' property (with any string value) and a 'regex' property that is used to apply the rule to strings to find matching tokens. Another example (ArithmeticGrammar) is shown below.

const ArithmeticGrammar = [
    {
        tokenType: 'NUMBER',
        regex: /^[0-9]+(\.([0-9])+)?/
    },
    { 
        tokenType: 'ADD',
        regex: /^[+]/
    },
    { 
        tokenType: 'MUL',
        regex: /^[*]/
    }
]

Some important things to note when forming the grammar argument are:

The caret symbol ^ must be used at the start of the regex patterns. This is important to ensure that moving forward through your input string Tokeniz3r will always find a match at the start of the string.
While flags can be used in your regex properties the global flag should NOT be used as token matching is based on single matches. Global matching is not handled by the app.
When using capturing groups in your regex patterns, you should be aware that the app will only use the first item (the whole match) from the returned value array.
regex properties can be formatted one of 3 ways:
- new RegExp('^\w|åäöÅÄÖ+', 'i')
- new RegExp(/^\w|åäöÅÄÖ+/, 'i')
- /^\w|åäöÅÄÖ+/i

Other things you should be aware of:

The input string is trimmed to remove spaces at the beginning and end
spaces between tokens are not used as part of tokens.
The only whitespace handled by tozeniz3r are regular word spaces (the common whitespace symbol U+0020 ‘ ‘ SPACE (also ASCII 32)). You should NOT therefore use other whitespace symbols in your string, such as tabs and hard line breaks.

The END token

This token is automatically added to your grammar rules by the app. It is used to show you when you have reached the end of your input string. The rule is formatted as follows:

    {
        tokenType: 'END',
        regex: /^\s*$/
    }

Active, next and previous tokens

Tokeniz3r instances have 3 public functions: getActiveToken, setActiveTokenToNext and setActiveTokenToPrevious. The rules that you define in the grammar argument are used to determine what the next active token will be moving forward in the input string. However, the rules that are used when moving backwards in the input string are generated by the app which uses the input grammar rules to create a comparable set of rules that simply find regex matches at the end of strings instead of the beginning.

The longest match rule (a.k.a. Maximal Munch)

This is applied to every token matching that takes place. If multiple matching tokens are found then the token with the longest matched value (i.e. the most characters) will be considered best. When there are 2 or more best matches of the same length then a LexicalError will be thrown.

Tokenizer's public methods and properties

The following methods can be accessed on an instance of the Tokenizer class:

getActiveToken() - Helps you access Tokenizer's active token. The returned object has two main public methods, getType() and getValue(), as well as a tailored toString() method that prettifies the Token instance's details for you.

getInputStrCurrentIndex() - Shows where the first character of the active token is located in the input string.

setActiveTokenToNext() - Moves the active token to the next best matching token. Should not be used when the END token is active.

setActiveTokenToPrevious() - Moves the active token to the previous best matching token. Should not be used when currentIdex is 0.

Usage example

import Tokenizer from 'tokeniz3r'

const inputStr = 'one two.'
const WordAndDotGrammar = [
        {
            tokenType: 'WORD',
            regex: /^[\w|åäöÅÄÖ]+/
        },
        { 
            tokenType: 'DOT',
            regex: /^\./
        }
    ]

const tokenizer = new Tokenizer(inputStr, WordAndDotGrammar)

const token = tokenizer.getActiveToken()

console.log(`Token value: ${token.getValue()}`)
console.log(`Token type: ${token.getType()}`)
console.log(`Prettify token: ${token.toString()}`)
console.log(`Token starts at index ${tokenizer.getInputStrCurrentIndex()}`)

tokenizer.setActiveTokenToNext()
const token2 = tokenizer.getActiveToken()

console.log(`Prettify token2: ${token2.toString()}`)
console.log(`Token2 starts at index ${tokenizer.getInputStrCurrentIndex()}`)

Result:

Token value: one
Token type: WORD
Prettify token: WORD("one")
Token starts at index 0
Prettify token2: WORD("two")
Token2 starts at index 4

Common Exceptions

The below table lists some exceptions that may be thrown when interacting with Tokenizer's interface.

Type	Message
MethodCallError	setActiveTokenToNext should not be called when END token is active
MethodCallError	setActiveTokenToPrevious should not be called when first token is active
LexicalError	No grammar rule matches string ‘x’
LexicalError	Cannot get a longest match from tokens ‘x’ and ‘y’
GrammarValidationError	Grammar argument is not an array of expected objects
GrammarValidationError	Grammar rule found with missing property
GrammarValidationError	Grammar rule property of wrong type used

3 years ago

4 years ago

4 years ago

4 years ago