@teplovs/parser NPM

Parser

Installation

npm install --save @teplovs/parser
# or:
yarn add @teplovs/parser

Getting started

In this getting started tutorial, we will build a JSON parser.

Initializing an empty project

First, let's create an empty folder for our project:

mkdir json-parser

And then we have to navigate to the folder:

cd json-parser

We have to initialize an npm package in order to install dependencies:

npm init
# or, in you prefer using yarn:
yarn init

And now install dependencies:

npm install --save @teplovs/lexer @teplovs/parser
# or with yarn:
yarn add @teplovs/lexer @teplovs/parser

Finally, let's use ES6 modules instead of CommonJS (this allows us to use syntax with import and export instead of require and module.exports). To do this, open your package.json and specify a property called type:

{
  "name": "json-parser",
  "type": "module",
  // ...
}

Building a lexer

Lexer is a tool for splitting source code into tokens. Then those tokens are used by a parser.

Here you can find more in-depth information on the lexer library.

Let's create a file called lexer.js, where we will be the code of our lexer.

Lexer rules are described using regular expressions. To learn more about them, you can read this article.

Now let's implement our lexer. First we need to import the Lexer class from the library:

import { Lexer } from "@teplovs/lexer"

Then we can create our own lexer class, let's call it JSONLexer. And we should export it in order to access it from other modules:

export class JSONLexer extends Lexer {
  // ...
}

Now we need to start defining our rules. That is done using a static property of our lexer called rules. We have to assign it a key-value object, where each key is a name of a rule and each value is a regular expression that is used to match the rule.

To start with, let's add a regular expression for numbers:

export class JSONLexer extends Lexer {
  static rules = {
    // The 'Number' rule matches both integers and floating-point numbers
    Number: /[0-9]+(\.[0-9]+)?/,
  }
}

Then we can add strings:

export class JSONLexer extends Lexer {
  static rules = {
    // The 'Number' rule matches both integers and floating-point numbers
    Number: /[0-9]+(\.[0-9]+)?/,

    // The 'String' rule matches only double-quoted strings
    String: /"([^"\\]|\\.)*"/,
  }
}

And we can easily add more and more rules:

export class JSONLexer extends Lexer {
  static rules = {
    // The 'Number' rule matches both integers and floating-point numbers
    Number: /[0-9]+(\.[0-9]+)?/,

    // The 'String' rule matches only double-quoted strings
    String: /"([^"\\]|\\.)*"/,

    Boolean: /\b(true|false)\b/,

    ArrayStart: /\[/,
    ArrayEnd: /\]/,

    ObjectStart: /\{/,
    ObjectEnd: /\}/,

    Comma: /\,/,
    Colon: /\:/,

    Whitespace: /\s+/
  }
}

Resulting code:

import { Lexer } from "@teplovs/lexer"

export class JSONLexer extends Lexer {
  static rules = {
    // The 'Number' rule matches both integers and floating-point numbers
    Number: /[0-9]+(\.[0-9]+)?/,

    // The 'String' rule matches only double-quoted strings
    String: /"([^"\\]|\\.)*"/,

    Boolean: /\b(true|false)\b/,

    ArrayStart: /\[/,
    ArrayEnd: /\]/,

    ObjectStart: /\{/,
    ObjectEnd: /\}/,

    Comma: /\,/,
    Colon: /\:/,

    Whitespace: /\s+/
  }
}

Building a parser

Now let's create a file called parser.js.

First, let's import the parser that will be the base class for our JSON parser:

import { Parser } from "@teplovs/parser"

Now let's create our parser class and export it:

export class JSONParser extends Parser {}

The parser library exports some functions that are used for defining rules. The simplest among them is probably the token function.

Let's define rules for numbers, strings, and booleans:

import { Parser, token } from "@teplovs/parser"

export class JSONParser extends Parser {
  static rules = {
    Number: token({ type: "Number" }),
    String: token({ type: "String" }),
    Boolean: token({ type: "Boolean" }),
  }
}

As you can see in this example, to use the token function, you have to pass an object that defines what type and/or value a token has to have in order to match the rule.

Let's say you would like to match a token of type Boolean with value true. You would be able to do it this way:

import { Parser, token } from "@teplovs/parser"

export class SomeKindOfAParser extends Parser {
  static rules = {
    True: token({
      type: "Boolean",
      value: "true"
    })
  }
}

Now let's go back to our JSON parser. It is often very useful to combine multiple rules into one rule. For instance, it will come in handy when we will be defining a rule for JSON arrays.

Let's create a rule called Expression that will combine all possible JSON expressions. To do that, we will use two functions: oneOf and rule. oneOf is used to define rules that can have some variations. And the rule function is used to refer to another rule. Here is the code example:

// Take a note that we've added new imports!
import { Parser, token, oneOf, rule } from "@teplovs/parser"

export class JSONParser extends Parser {
  static rules = {
    Expression: oneOf(
      rule("Number"),
      rule("String"),
      rule("Boolean"),
    ),

    Number: token({ type: "Number" }),
    String: token({ type: "String" }),
    Boolean: token({ type: "Boolean" }),
  }
}

Actually, the Expression rule will act as the main rule in our parser. We just need to add a few more expression types.

Now let's start working on parsing objects. The next 'building block' we need is a key-value pair. Let's define a rule for that. We will use a function called chain for that. chain is used for matching a sequence of tokens or subrules. Here is an example:

// Take a note that we've added a new import!
import { Parser, token, oneOf, rule, chain } from "@teplovs/parser"

export class JSONParser extends Parser {
  static rules = {
    Expression: oneOf(
      rule("Number"),
      rule("String"),
      rule("Boolean"),
    ),

    Number: token({ type: "Number" }),
    String: token({ type: "String" }),
    Boolean: token({ type: "Boolean" }),

    KeyValuePair: chain(
      rule("String"),
      token({ type: "Colon" }),
      rule("Expression")
    ),
  }
}

What we do here is we matching the following sequence: a string, a colon, and an expression. For example, this code would match the rule: "key":"value".

But here is a little problem. We should ignore the whitespace, however we don't do that. Let's quickly jump back to lexer.js and ignore the whitespace by default:

import { Lexer } from "@teplovs/lexer"

export class JSONLexer extends Lexer {
  static rules = {
    // ... (all the rules of the lexer stay the same)
  }

  static get defaultOptions() {
    return {
      // Skip 'Whitespace' tokens by default
      skip: ["Whitespace"]
    }
  }
}

Now we can create a rule for objects. Let's try to do this:

import { Parser, token, oneOf, rule, chain } from "@teplovs/parser"

export class JSONParser extends Parser {
  static rules = {
    Expression: oneOf(
      rule("Number"),
      rule("String"),
      rule("Boolean"),
    ),

    Number: token({ type: "Number" }),
    String: token({ type: "String" }),
    Boolean: token({ type: "Boolean" }),

    KeyValuePair: chain(
      rule("String"),
      token({ type: "Colon" }),
      rule("Expression")
    ),

    Object: chain(
      token({ type: "ObjectStart" }),
      // ???
      token({ type: "ObjectEnd" })
    )
  }
}

But there is a problem. Objects can be empty (have no key-value pairs inside), have only one key-value pair, or have many of them. We can use two functions: one called oneOrMore (or repeatable - depending on your preference, because they are the same function), and one called optional. But we need to somehow define a rule that won't look confusing to others. So let's create a custom function for rule definition!

We will create a function called commaSeparatedList, which will take any rule as an argument:

// Take a note that we've added a new import!
import { Parser, token, oneOf, rule, chain, oneOrMore } from "@teplovs/parser"

const commaSeparatedList = (rule) => {
  return chain(
    rule,
    oneOrMore(
      chain(
        token({ type: "Comma" }),
        rule
      )
    )
  )
}

// ... (parser class)

This little function will help us make our parser rules much more clear to others. Now let's implement the Object rule, and don't forget to add it to the Expression rule:

// Take a note that we've added a new import!
import { Parser, token, oneOf, rule, chain, oneOrMore, optional } from "@teplovs/parser"

const commaSeparatedList = (rule) => {
  return chain(
    rule,
    oneOrMore(
      chain(
        token({ type: "Comma" }),
        rule
      )
    )
  )
}

export class JSONParser extends Parser {
  static rules = {
    Expression: oneOf(
      rule("Number"),
      rule("String"),
      rule("Boolean"),
      rule("Object"),
    ),

    Number: token({ type: "Number" }),
    String: token({ type: "String" }),
    Boolean: token({ type: "Boolean" }),

    KeyValuePair: chain(
      rule("String"),
      token({ type: "Colon" }),
      rule("Expression")
    ),

    Object: chain(
      token({ type: "ObjectStart" }),

      // An object can have no body, that's why we use `optional`
      optional(
        oneOf(
          // An object body can also have only one key-value pair:
          rule("KeyValuePair"),
          // or many comma-separated key-value pairs:
          commaSeparatedList(rule("KeyValuePair"))
        )
      ),

      token({ type: "ObjectEnd" })
    )
  }
}

And the last step in defining rules is adding a rule for arrays. Also don't forget to add the Array rule into the Expression rule:

import { Parser, token, oneOf, rule, chain, oneOrMore, optional } from "@teplovs/parser"

const commaSeparatedList = (rule) => {
  return chain(
    rule,
    oneOrMore(
      chain(
        token({ type: "Comma" }),
        rule
      )
    )
  )
}

export class JSONParser extends Parser {
  static rules = {
    Expression: oneOf(
      rule("Number"),
      rule("String"),
      rule("Boolean"),
      rule("Array"),
      rule("Object")
    ),

    Number: token({ type: "Number" }),
    String: token({ type: "String" }),
    Boolean: token({ type: "Boolean" }),

    Array: chain(
      token({ type: "ArrayStart" }),

      // An array can have no body, that's why we use `optional`
      optional(
        oneOf(
          // An array body can also have only one expression:
          rule("Expression"),
          // or many comma-separated expressions:
          commaSeparatedList(rule("Expression"))
        )
      ),

      token({ type: "ArrayEnd" })
    ),

    KeyValuePair: chain(
      rule("String"),
      token({ type: "Colon" }),
      rule("Expression")
    ),

    Object: chain(
      token({ type: "ObjectStart" }),

      // An object can have no body, that's why we use `optional`
      optional(
        oneOf(
          // An object body can also have only one key-value pair:
          rule("KeyValuePair"),
          // or many comma-separated key-value pairs:
          commaSeparatedList(rule("KeyValuePair"))
        )
      ),

      token({ type: "ObjectEnd" })
    )
  }
}

That's it for this section. The next step is manipulating the output of a parser.

Shaping output of the parser

The next thing to do is to 'shape' the output of the parser. We don't need everything the parser returns by default. For instance, we don't need all those commas, brackets, and braces in arrays and objects.

Ignoring the output

Let's jump into the commaSeparatedList function. We don't need those commas to be available in the final output. So let's 'ignore' them.

const commaSeparatedList = (rule) => {
  return chain(
    rule,
    oneOrMore(
      chain(
        // This comma will be ignored
        token({ type: "Comma" }).ignore(),
        rule
      )
    )
  )
}

Changing the output with a custom function

We sometimes might want to manually change the output, using a custom function. You can do this using a .saveAs method, which passes the default result of parsing a rule as an argument.

For instance, let's say we want to convert a token of type 'Number' to a JavaScript number. Each Token object has properties: type, value, and position. We can do convert the token to a number this way:

token({ type: "Number" }).saveAs(token => {
  /*
    Actually, the faster way to convert a string to a number is `+token.value`.
    But for the sake of readability and simplicity, we will leave it this way:
  */
  return parseFloat(token.value)
})

A similar thing can be done with other types of rules, the difference is in types of arguments.

Let's improve the result of parsing a commaSeparatedList. The current output looks like this:

[
  expression1,
  [
    [ expression2 ],
    [ expression3 ],
    // ...
  ]
]

It has a lot of arrays we don't need. What we want to get is a flat array of expressions:

[
  expression1,
  expression2,
  expression3,
  // ...
]

Let's make little steps. First we want to flatten the second array:

const commaSeparatedList = (rule) => {
  return chain(
    rule,
    oneOrMore(
      chain(
        token({ type: "Comma" }).ignore(),
        rule
      ).saveAs((chainItems) => {
        /*
          Here, we get an array with outputs of each child rules as an argument.
          Let's imagine we are parsing a comma-separated list of numbers,
          and the 'Comma' token is not ignored. Then the argument of this function
          is an array of the 'Comma' token and a number. But if the comma is ignored,
          the only child inside of the array is the number. And we need to
          return this number.
        */

        const ruleResult = chainItems[0]
        return ruleResult
      })
    )
  )
}

The result after this little change is the following:

[
  expression1,
  [
    expression2,
    expression3,
    // ...
  ]
]

And now we want to create a fully flat array. That can be done the following way:

const commaSeparatedList = (rule) => {
  return chain(
    rule,
    oneOrMore(
      chain(
        token({ type: "Comma" }).ignore(),
        rule
      ).saveAs((chainItems) => {
        const ruleResult = chainItems[0]
        return ruleResult
      })
    )
  ).saveAs((items) => {
    const [firstItem, allTheOtherItems] = items
    return [firstItem, ...allTheOtherItems]
  })
}

And now we get the result we wanted:

[
  expression1,
  expression2,
  expression3,
  // ...
]

Now it's your turn to update all the other rules in order to generate a JavaScript object out of JSON. If you need some help, you can check out the source code in the examples/json-parser folder.

Using the parser

Now let's create an index.js file and parse our package.json using it:

import { JSONLexer } from "./lexer.js"
import { JSONParser } from "./parser.js"
import { fileURLToPath } from "url"
import { readFileSync } from "fs"
import { dirname, join } from "path"

// Helper function to parse JSON easier
const parse = (inputString) => {
  const lexer = new JSONLexer(inputString)
  const parser = new JSONParser(lexer)
  return parser.parse("Expression")
}

// This is a snippet that does the same as `__dirname` in CommonJS
const projectFolder = dirname(fileURLToPath(import.meta.url))
const pathToPackageJSON = join(projectFolder, "package.json")
const packageInformation = String(readFileSync(pathToPackageJSON))

console.log(parse(packageInformation))

And then run the file:

node index.js

Development

Prerequisites

Node.js and npm

Setup

Clone the repository

git clone https://github.com/teplovs/parser

Navigate to the folder

cd parser

Install dependencies

npm install
# or:
yarn install

Happy hacking! 🎉

parser syntax parsing

@infinitebrahmanuniverse/nolb-_tep @zalastax/nolb-_tep

1.0.0-alpha

3 years ago