1.0.7 • Published 3 years ago
@tamim.jabr/tokenizer v1.0.7
Tokenizer
It is a package to help you detect tokens in strings based on the grammar you choose
How to install it?
npm i @tamim.jabr/tokenizer
How to import it?
import tokenizer from '@tamim.jabr/tokenizer'
How to use it?
A tokenizer is created by sending the grammar object
import tokenizer from '@tamim.jabr/tokenizer'
const {
Tokenizer,
Grammar,
WordAndDotGrammar,
ArithmeticGrammar,
ExclamationGrammar,
MaximalMunchGrammar
} = tokenizer
const grammar = new WordAndDotGrammar()
const newTokenizer = new Tokenizer(grammar, 'hello World .')
newTokenizer.getActiveToken()
// expected : { tokenType: 'WORD', tokenValue: 'hello' }
newTokenizer.moveActiveTokenToNext()
newTokenizer.getActiveToken()
// expected : { tokenType: 'WORD', tokenValue: 'World' }
newTokenizer.moveActiveTokenToNext()
newTokenizer.getActiveToken()
// expected : { tokenType: 'DOT', tokenValue: '.' }
newTokenizer.hasNext()
// expected : true
newTokenizer.moveActiveTokenToNext()
newTokenizer.getActiveToken()
// expected : { tokenType: 'END', tokenValue: '' }
newTokenizer.hasNext()
// expected : false
newTokenizer.moveActiveTokenToNext()
// expected : Error: Invalid index for active token
Available Grammars:
- Word and dot grammar: detects words (even words including:åäöÅÄÖ) and dots using the following regex:
[
{
tokenType: 'WORD',
regex: /^[\w|åäöÅÄÖ]+/
},
{
tokenType: 'DOT',
regex: /^\./
}
]
const wordAndDotGrammar = new WordAndDotGrammar()
- Arithmetic grammar: detects arithmetic expressions using the following regex:
[
{
tokenType: 'NUMBER',
regex: /^[0-9]+(\.([0-9])+)?/
},
{
tokenType: 'ADD',
regex: /^[+]/
},
{
tokenType: 'MUL',
regex: /^[*]/
},
{
tokenType: 'DIV',
regex: /^[/]/
},
{
tokenType: 'SUB',
regex: /^[-]/
},
{
tokenType: 'LEFT_PARENTHESES',
regex: /^[\(]/
},
{
tokenType: 'RIGHT_PARENTHESES',
regex: /^[\)]/
}
]
const arithmeticGrammar = new ArithmeticGrammar()
- Maximal munch grammar: detects Numbers and distinguishes between Float and Integer using the following regex:
[
{
tokenType: 'INTEGER',
regex: /^[0-9]+/
},
{
tokenType: 'FLOAT',
regex: /^[0-9]+\.[0-9]+/
}
]
const maximalMunchGrammar = new MaximalMunchGrammar()
- Exclamation grammar: detects words (even words including:åäöÅÄÖ) and exclamation marks using the following regex:
[
{
tokenType: 'WORD',
regex: /^[\w|åäöÅÄÖ]+/
},
{
tokenType: 'EXCLAMATION',
regex: /^\!/
}
]
const exclamationGrammar = new ExclamationGrammar()
Use your own grammar:
import tokenizer from '@tamim.jabr/tokenizer'
const {
Tokenizer,
Grammar
} = tokenizer
const regexAndTypesList = [
{
tokenType: 'WORD_WITHOUT_NUMBERS',
regex: /^[a-zA-Z]+/
},
{
tokenType: 'DOLLAR_SIGN',
regex: /^\$+/
}
]
const ownGrammar = new Grammar(regexAndTypesList)
const testString = '$ test string'
const newTokenizer = new Tokenizer(ownGrammar, testString)
newTokenizer.getActiveToken()
// expected:{ tokenType: 'DOLLAR_SIGN', tokenValue: '$' }
Public Interface (Method to use):
- getActiveToken() : return the current active token
- moveActiveTokenToNext(): move the active token to next token if exists
- moveActiveTokenToPrevious() : move the active token to previous token if exists
- hasNext(): return a boolean indicating if there is a token after the current active token or not. !notice that the token of type "END" is a valid token and you will get true if you invoke the method when the current token is before "END"
Lexical Errors:
Lexical Errors happens when you send string to the tokenizer that have a part that the chosen grammar doesn't include in its regex list. You get the lexical error when you first try to move the acitve token to it or when you get the acitve token when the lexical error exists in the first token.
const arithmeticGrammar = new Grammar.ArithmeticGrammar()
const newTokenizer = new Tokenizer(arithmeticGrammar, '4 hej + 2')
newTokenizer.moveActiveTokenToNext()
// expected: LexicalError: No lexical element matches "hej + 2"