2.3.0-alpha.8 • Published 1 year ago

@parsing-kit/lexer v2.3.0-alpha.8

Weekly downloads
-
License
Apache-2.0
Repository
github
Last release
1 year ago

Lexer

JavaScript library for tokenizing strings with regular expressions

Installation

npm install @parsing-kit/lexer
# or, if you prefer yarn:
yarn add @parsing-kit/lexer

Usage

Code example

import { Lexer, regularExpression } from "@parsing-kit/lexer"

// Lexer input
const code = "print '👋'"

class MyLexer extends Lexer {
  // Lexer rules
  static rules = {
    print: regularExpression(/\b(print)\b/),
    string: regularExpression(/'([^'\\]|\\.)*'/),
    space: regularExpression(/( |\t)+/)
  }
}

// Initialize a lexer
const lexer = new MyLexer(code)

// Run the lexer and save tokens as an array
const tokens = lexer.tokenize()

Defining rules

Basic rules

To tell the lexer how to tokenize a string, we need to define rules. To do that, we use an object, where each key corresponds to a token type, and each value is an object that defines the rule.

For example, our input string is a space-separated list of integers. Then one of possible ways to define rules is this:

const rules = {
  integer: regularExpression(/[0-9]+/),
  space: regularExpression(/( )+/)
}

Let's go a bit deeper and add support for hexadecimal integers. We have a few ways to do that. Let's start with the easiest one: add another token type.

const rules = {
  decimal: regularExpression(/[0-9]+/),
  hexadecimal: regularExpression(/0x[0-9a-fA-F]+/),
  space: regularExpression(/( )+/)
}

But in this rules definition there is a problem: order of rules. Lexer goes in the order we specify the rules in the object. So when trying to tokenize a string 0x9, we'll get a SyntaxError. The reason is that the lexer 'ate' the 0 character, because it matches the decimal rule, and then it saw the x character that didn't match any rule.

So, the way to fix is to reorder these rules:

const rules = {
  hexadecimal: regularExpression(/0x[0-9a-fA-F]+/),
  decimal: regularExpression(/[0-9]+/),
  space: regularExpression(/( )+/)
}

But what if we want both hexadecimal and decimal be tokenized as an integer? We can combine these rules:

const rules = {
  integer: regularExpression(/0x[0-9a-fA-F]+|[0-9]+/),
  space: regularExpression(/( )+/)
}

It works, but looks a bit harder to understand. What if we want to make a more difficult rule? For example, a string:

const rules = {
  string: regularExpression(/"([^"\\]|\\.)*"|'([^'\\]|\\.)*'/)
}

It looks even more difficult to understand. So that's the point where we can use oneOf rule:

const rules = {
  string: oneOf(
    regularExpression(/"([^"\\]|\\.)*"/),
    regularExpression(/'([^'\\]|\\.)*'/)
  )
}

And for our example with integers:

const rules = {
  integer: oneOf(
    regularExpression(/0x[0-9a-fA-F]+/),
    regularExpression(/[0-9]+/)
  ),
  space: regularExpression(/( )+/)
}

Here, the order of regular expressions matters too!

More advanced rules

Now let's make more advanced rules! For example, we want to tokenize a string but we don't want quotes to get into the token's value. For this we have a second argument in the regularExpression rule. You can pass a function that accepts a regular expression match array (a value that's returned when calling regex.exec(string)) and returns a string, which will be the resulting token value:

const rules = {
  string: regularExpression(/'(([^'\\]|\\.)*)'/, (match) => match[1])
}

Changing rules of a certain lexer instance

Sometimes we might need to change lexer rules only for one object (instance) without touching all the others. This can be achieved by changing the rules property on that object:

const lexer = new CustomLexer("...")

lexer.rules = {
  // ...
}

Working with tokens

There are three ways to work with tokens: 1. Call Lexer#tokenize() and get an array of tokens 2. Call Lexer#nextToken() to get the next token (returns a token of type endOfFile or of a type, specified in lexer options) 3. Call Lexer#peekAtNextToken() to peek at a token without removing it from the token stream 3. Use for-of loop and iterate over tokens: for (const token of lexer)

Checkout API for more in-depth understanding of how these methods work.

Overriding default lexer properties

If there are no options provided, a lexer will use default values. If you can change those default values for your class, then here's an example how to do it:

class MyLexer extends Lexer {
  static defaultOptions = {
    ...super.defaultOptions,
    endOfFileTokenType: "EndOfFile"
  }
}

API

TODO: autogenerate API

Lexer

static rules: Rules

An object with rules

constructor(input: string, options?: LexerOptions)

new Lexer(inputString, rules)

Arguments:

  • inputString - a string to tokenize
  • options - an object with lexer options
    • skip - an array of token types that should be skipped (default: [])
    • endOfFileTokenType - type of token that will be returned when reaching end of file (default: "endOfFile")
    • filePath - path to the source file (default: undefined)

get rules

Returns rules of the current lexer instance. These can be overridden without touching the prototype of a class. This is particularly useful if you need to extend a lexer only inside a certain function, but without impact on other functions.

set rules

Replaces rules of the current lexer instance with the passed ones. This won't impact all other lexer instances.

reset()

lexer.reset()

You can use this method to reset current lexer's position

nextToken(): Token

lexer.nextToken()

Returns a token of type endOfFile (you can specify another type in lexer's options) when reached the end of input

peekAtNextToken({ atOffset: number = 0 }): Token

lexer.peekAtNextToken()
lexer.peekAtNextToken({ atOffset: 1 }) // returns a token after the next one

tokenize(): Token[]

lexer.tokenize()

Returns an array of tokens.

Note: lexer does not reset current position when calling this function.

lexer.tokenize() // [...] (array with tokens)
lexer.tokenize() // [] (empty array)

But if you prepend lexer.reset() before the second call, then you probably won't get an empty array.

Rule

abstract class Rule {
  abstract isMatching(input: string, position: Position): boolean
  abstract createToken(type: string, input: string, position: Position): Token
}

Rules

type Rules = {
  [ruleName: string]: Rule
}

Token

class Token {
  type: string
  value: string
  position: {
    start: Position,
    end: Position
  }
}

Position

class Position {
  line: number
  column: number
  index: number
  filePath?: string
}

add(line: number, column: number, index: number)

Creates a new Position instance, where line, column, and index have new values, calculated by adding the current corresponding values to passed arguments.

clone(): Position

Creates a new Position instance with the same properties.

toString(): string

Creates a string with information about position in the source code. If file path is not provided when creating a Position instance, then the resulting string will look like line:column. Otherwise, the string will look like line:column (in filepath).

Examples:

new Position(1, 2, 1).toString() // 1:2
new Position(1, 2, 1, "source-file.js") // 1:2 (in source-file.js)

SyntaxError

Exception thrown by the lexer when the input doesn't match any rule.

class SyntaxError extends Error {
  position: Position
}

position: Position

Position in the input where the error occurred.