@teplovs/parser v1.0.0-alpha
Parser
Installation
npm install --save @teplovs/parser
# or:
yarn add @teplovs/parser
Getting started
In this getting started tutorial, we will build a JSON parser.
Initializing an empty project
First, let's create an empty folder for our project:
mkdir json-parser
And then we have to navigate to the folder:
cd json-parser
We have to initialize an npm package in order to install dependencies:
npm init
# or, in you prefer using yarn:
yarn init
And now install dependencies:
npm install --save @teplovs/lexer @teplovs/parser
# or with yarn:
yarn add @teplovs/lexer @teplovs/parser
Finally, let's use ES6 modules instead of CommonJS (this allows us to use syntax with import
and export
instead of require
and module.exports
). To do this, open your package.json
and specify a property called type
:
{
"name": "json-parser",
"type": "module",
// ...
}
Building a lexer
Lexer is a tool for splitting source code into tokens. Then those tokens are used by a parser.
Here you can find more in-depth information on the lexer library.
Let's create a file called lexer.js
, where we will be the code of our lexer.
Lexer rules are described using regular expressions. To learn more about them, you can read this article.
Now let's implement our lexer. First we need to import the Lexer
class from the library:
import { Lexer } from "@teplovs/lexer"
Then we can create our own lexer class, let's call it JSONLexer
. And we should export it in order to access it from other modules:
export class JSONLexer extends Lexer {
// ...
}
Now we need to start defining our rules. That is done using a static property of our lexer called rules
. We have to assign it a key-value object, where each key is a name of a rule and each value is a regular expression that is used to match the rule.
To start with, let's add a regular expression for numbers:
export class JSONLexer extends Lexer {
static rules = {
// The 'Number' rule matches both integers and floating-point numbers
Number: /[0-9]+(\.[0-9]+)?/,
}
}
Then we can add strings:
export class JSONLexer extends Lexer {
static rules = {
// The 'Number' rule matches both integers and floating-point numbers
Number: /[0-9]+(\.[0-9]+)?/,
// The 'String' rule matches only double-quoted strings
String: /"([^"\\]|\\.)*"/,
}
}
And we can easily add more and more rules:
export class JSONLexer extends Lexer {
static rules = {
// The 'Number' rule matches both integers and floating-point numbers
Number: /[0-9]+(\.[0-9]+)?/,
// The 'String' rule matches only double-quoted strings
String: /"([^"\\]|\\.)*"/,
Boolean: /\b(true|false)\b/,
ArrayStart: /\[/,
ArrayEnd: /\]/,
ObjectStart: /\{/,
ObjectEnd: /\}/,
Comma: /\,/,
Colon: /\:/,
Whitespace: /\s+/
}
}
Resulting code:
import { Lexer } from "@teplovs/lexer"
export class JSONLexer extends Lexer {
static rules = {
// The 'Number' rule matches both integers and floating-point numbers
Number: /[0-9]+(\.[0-9]+)?/,
// The 'String' rule matches only double-quoted strings
String: /"([^"\\]|\\.)*"/,
Boolean: /\b(true|false)\b/,
ArrayStart: /\[/,
ArrayEnd: /\]/,
ObjectStart: /\{/,
ObjectEnd: /\}/,
Comma: /\,/,
Colon: /\:/,
Whitespace: /\s+/
}
}
Building a parser
Now let's create a file called parser.js
.
First, let's import the parser that will be the base class for our JSON parser:
import { Parser } from "@teplovs/parser"
Now let's create our parser class and export it:
export class JSONParser extends Parser {}
The parser library exports some functions that are used for defining rules. The simplest among them is probably the token
function.
Let's define rules for numbers, strings, and booleans:
import { Parser, token } from "@teplovs/parser"
export class JSONParser extends Parser {
static rules = {
Number: token({ type: "Number" }),
String: token({ type: "String" }),
Boolean: token({ type: "Boolean" }),
}
}
As you can see in this example, to use the token
function, you have to pass an object that defines what type and/or value a token has to have in order to match the rule.
Let's say you would like to match a token of type Boolean
with value true
. You would be able to do it this way:
import { Parser, token } from "@teplovs/parser"
export class SomeKindOfAParser extends Parser {
static rules = {
True: token({
type: "Boolean",
value: "true"
})
}
}
Now let's go back to our JSON parser. It is often very useful to combine multiple rules into one rule. For instance, it will come in handy when we will be defining a rule for JSON arrays.
Let's create a rule called Expression
that will combine all possible JSON expressions. To do that, we will use two functions: oneOf
and rule
. oneOf
is used to define rules that can have some variations. And the rule
function is used to refer to another rule. Here is the code example:
// Take a note that we've added new imports!
import { Parser, token, oneOf, rule } from "@teplovs/parser"
export class JSONParser extends Parser {
static rules = {
Expression: oneOf(
rule("Number"),
rule("String"),
rule("Boolean"),
),
Number: token({ type: "Number" }),
String: token({ type: "String" }),
Boolean: token({ type: "Boolean" }),
}
}
Actually, the Expression
rule will act as the main rule in our parser. We just need to add a few more expression types.
Now let's start working on parsing objects. The next 'building block' we need is a key-value pair. Let's define a rule for that. We will use a function called chain
for that. chain
is used for matching a sequence of tokens or subrules. Here is an example:
// Take a note that we've added a new import!
import { Parser, token, oneOf, rule, chain } from "@teplovs/parser"
export class JSONParser extends Parser {
static rules = {
Expression: oneOf(
rule("Number"),
rule("String"),
rule("Boolean"),
),
Number: token({ type: "Number" }),
String: token({ type: "String" }),
Boolean: token({ type: "Boolean" }),
KeyValuePair: chain(
rule("String"),
token({ type: "Colon" }),
rule("Expression")
),
}
}
What we do here is we matching the following sequence: a string, a colon, and an expression. For example, this code would match the rule: "key":"value"
.
But here is a little problem. We should ignore the whitespace, however we don't do that. Let's quickly jump back to lexer.js
and ignore the whitespace by default:
import { Lexer } from "@teplovs/lexer"
export class JSONLexer extends Lexer {
static rules = {
// ... (all the rules of the lexer stay the same)
}
static get defaultOptions() {
return {
// Skip 'Whitespace' tokens by default
skip: ["Whitespace"]
}
}
}
Now we can create a rule for objects. Let's try to do this:
import { Parser, token, oneOf, rule, chain } from "@teplovs/parser"
export class JSONParser extends Parser {
static rules = {
Expression: oneOf(
rule("Number"),
rule("String"),
rule("Boolean"),
),
Number: token({ type: "Number" }),
String: token({ type: "String" }),
Boolean: token({ type: "Boolean" }),
KeyValuePair: chain(
rule("String"),
token({ type: "Colon" }),
rule("Expression")
),
Object: chain(
token({ type: "ObjectStart" }),
// ???
token({ type: "ObjectEnd" })
)
}
}
But there is a problem. Objects can be empty (have no key-value pairs inside), have only one key-value pair, or have many of them. We can use two functions: one called oneOrMore
(or repeatable
- depending on your preference, because they are the same function), and one called optional
. But we need to somehow define a rule that won't look confusing to others. So let's create a custom function for rule definition!
We will create a function called commaSeparatedList
, which will take any rule as an argument:
// Take a note that we've added a new import!
import { Parser, token, oneOf, rule, chain, oneOrMore } from "@teplovs/parser"
const commaSeparatedList = (rule) => {
return chain(
rule,
oneOrMore(
chain(
token({ type: "Comma" }),
rule
)
)
)
}
// ... (parser class)
This little function will help us make our parser rules much more clear to others. Now let's implement the Object
rule, and don't forget to add it to the Expression
rule:
// Take a note that we've added a new import!
import { Parser, token, oneOf, rule, chain, oneOrMore, optional } from "@teplovs/parser"
const commaSeparatedList = (rule) => {
return chain(
rule,
oneOrMore(
chain(
token({ type: "Comma" }),
rule
)
)
)
}
export class JSONParser extends Parser {
static rules = {
Expression: oneOf(
rule("Number"),
rule("String"),
rule("Boolean"),
rule("Object"),
),
Number: token({ type: "Number" }),
String: token({ type: "String" }),
Boolean: token({ type: "Boolean" }),
KeyValuePair: chain(
rule("String"),
token({ type: "Colon" }),
rule("Expression")
),
Object: chain(
token({ type: "ObjectStart" }),
// An object can have no body, that's why we use `optional`
optional(
oneOf(
// An object body can also have only one key-value pair:
rule("KeyValuePair"),
// or many comma-separated key-value pairs:
commaSeparatedList(rule("KeyValuePair"))
)
),
token({ type: "ObjectEnd" })
)
}
}
And the last step in defining rules is adding a rule for arrays. Also don't forget to add the Array
rule into the Expression
rule:
import { Parser, token, oneOf, rule, chain, oneOrMore, optional } from "@teplovs/parser"
const commaSeparatedList = (rule) => {
return chain(
rule,
oneOrMore(
chain(
token({ type: "Comma" }),
rule
)
)
)
}
export class JSONParser extends Parser {
static rules = {
Expression: oneOf(
rule("Number"),
rule("String"),
rule("Boolean"),
rule("Array"),
rule("Object")
),
Number: token({ type: "Number" }),
String: token({ type: "String" }),
Boolean: token({ type: "Boolean" }),
Array: chain(
token({ type: "ArrayStart" }),
// An array can have no body, that's why we use `optional`
optional(
oneOf(
// An array body can also have only one expression:
rule("Expression"),
// or many comma-separated expressions:
commaSeparatedList(rule("Expression"))
)
),
token({ type: "ArrayEnd" })
),
KeyValuePair: chain(
rule("String"),
token({ type: "Colon" }),
rule("Expression")
),
Object: chain(
token({ type: "ObjectStart" }),
// An object can have no body, that's why we use `optional`
optional(
oneOf(
// An object body can also have only one key-value pair:
rule("KeyValuePair"),
// or many comma-separated key-value pairs:
commaSeparatedList(rule("KeyValuePair"))
)
),
token({ type: "ObjectEnd" })
)
}
}
That's it for this section. The next step is manipulating the output of a parser.
Shaping output of the parser
The next thing to do is to 'shape' the output of the parser. We don't need everything the parser returns by default. For instance, we don't need all those commas, brackets, and braces in arrays and objects.
Ignoring the output
Let's jump into the commaSeparatedList
function. We don't need those commas to be available in the final output. So let's 'ignore' them.
const commaSeparatedList = (rule) => {
return chain(
rule,
oneOrMore(
chain(
// This comma will be ignored
token({ type: "Comma" }).ignore(),
rule
)
)
)
}
Changing the output with a custom function
We sometimes might want to manually change the output, using a custom function. You can do this using a .saveAs
method, which passes the default result of parsing a rule as an argument.
For instance, let's say we want to convert a token of type 'Number' to a JavaScript number. Each Token
object has properties: type
, value
, and position
. We can do convert the token to a number this way:
token({ type: "Number" }).saveAs(token => {
/*
Actually, the faster way to convert a string to a number is `+token.value`.
But for the sake of readability and simplicity, we will leave it this way:
*/
return parseFloat(token.value)
})
A similar thing can be done with other types of rules, the difference is in types of arguments.
Let's improve the result of parsing a commaSeparatedList
. The current output looks like this:
[
expression1,
[
[ expression2 ],
[ expression3 ],
// ...
]
]
It has a lot of arrays we don't need. What we want to get is a flat array of expressions:
[
expression1,
expression2,
expression3,
// ...
]
Let's make little steps. First we want to flatten the second array:
const commaSeparatedList = (rule) => {
return chain(
rule,
oneOrMore(
chain(
token({ type: "Comma" }).ignore(),
rule
).saveAs((chainItems) => {
/*
Here, we get an array with outputs of each child rules as an argument.
Let's imagine we are parsing a comma-separated list of numbers,
and the 'Comma' token is not ignored. Then the argument of this function
is an array of the 'Comma' token and a number. But if the comma is ignored,
the only child inside of the array is the number. And we need to
return this number.
*/
const ruleResult = chainItems[0]
return ruleResult
})
)
)
}
The result after this little change is the following:
[
expression1,
[
expression2,
expression3,
// ...
]
]
And now we want to create a fully flat array. That can be done the following way:
const commaSeparatedList = (rule) => {
return chain(
rule,
oneOrMore(
chain(
token({ type: "Comma" }).ignore(),
rule
).saveAs((chainItems) => {
const ruleResult = chainItems[0]
return ruleResult
})
)
).saveAs((items) => {
const [firstItem, allTheOtherItems] = items
return [firstItem, ...allTheOtherItems]
})
}
And now we get the result we wanted:
[
expression1,
expression2,
expression3,
// ...
]
Now it's your turn to update all the other rules in order to generate a JavaScript object out of JSON. If you need some help, you can check out the source code in the examples/json-parser
folder.
Using the parser
Now let's create an index.js
file and parse our package.json
using it:
import { JSONLexer } from "./lexer.js"
import { JSONParser } from "./parser.js"
import { fileURLToPath } from "url"
import { readFileSync } from "fs"
import { dirname, join } from "path"
// Helper function to parse JSON easier
const parse = (inputString) => {
const lexer = new JSONLexer(inputString)
const parser = new JSONParser(lexer)
return parser.parse("Expression")
}
// This is a snippet that does the same as `__dirname` in CommonJS
const projectFolder = dirname(fileURLToPath(import.meta.url))
const pathToPackageJSON = join(projectFolder, "package.json")
const packageInformation = String(readFileSync(pathToPackageJSON))
console.log(parse(packageInformation))
And then run the file:
node index.js
Development
Prerequisites
- Node.js and npm
Setup
- Clone the repository
git clone https://github.com/teplovs/parser
- Navigate to the folder
cd parser
- Install dependencies
npm install
# or:
yarn install
- Happy hacking! 🎉
3 years ago