@watl/generic-parser v0.2.5
Generic Language Parser
This tool is a library to enable tokenizing and parsing abstract grammars into custom ASTs using a simple declarative and composible syntax.
Install
$ npm install --save @watl/generic-parser
Contents
How it works
There are really only a handful of basic utilities exported from generic-parser
, and new parsers are expected to be created declaratively by combining larger and more complex patterns together into trees of what are referred to as Reader
functions. A Reader
is ultimately just a function which receives a ParserState
parameter, and returns some kind of result. The top-level Reader
is then passed into the parse
function to parse your source.
For example, here we have an (admittedly, pretty bad) parser for simple math expressions.
import { parse, sequence, optional, circular, token_matcher } from '@watl/generic-parser';
// Define a token for all our valid operators
const operator
= token_matcher('operator', /\+|\-|\/|\*/y);
// Define a token to find integers
const integer
= token_matcher('integer', /[0-9]+/y);
// Finally, define what an expression looks like
const expression
= sequence(
integer,
optional(
sequence(
operator,
circular(() => expression)
)
)
);
// Now that we've defined our grammar, we can parse our test string
const [ result, state ] = parse('10+20-15', expression, /\n/g);
operator = '+' | '-' | '/' | '*' ;
integer_digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
integer = integer_digit , { integer_digit } ;
expression = integer , [ operator , expression ] ;
ParserState
A ParserState
object is very simple, it just contains the raw lines of source, and a pointer (line/char) to the current location in the file. Additionally, if you make use of error recovery via the skippable
function, errors that are collected during parsing will be stored on the state.
interface ParserState {
lines: string[];
line_index: number;
char_index: number;
errors: MatchFailed[];
}
Token<T extends string>
The actual text that gets extracted from the source using token_matcher
is represented using a Token
object.
interface Token<T extends string> {
type: T;
text: string;
position: {
start: {
line: number;
char: number;
};
end: {
line: number;
char: number;
};
};
}
MatchFailed
When a branch of the parser fails (for example, due to a syntax error), it will return an instance of the MatchFailed
class, which describes what failed and where in the source code. Normally, this would bubble all the way up and be the final result of the parse
call, but there are also some Reader
types that can handle these MatchFailed
results and try to recover.
declare class MatchFailed {
public readonly line: number;
public readonly char: number;
public readonly reason: string;
public readonly formatted: string;
}
Reader<T, E = MatchFailed>
A Reader
function takes in one of the above ParserState
objects, and returns a value of either type T
or E
. The following low-level Reader
functions are provided out of the box:
token_matcher(name, regex, close?) : Reader<Token>
The token_matcher
function looks for a specific token at the current location. If a token matching the given regex is found at the current source location, a Token
object representing that portion of the text will be returned, moving the pointer forward. Otherwise, returns MatchFailed
.
const identifer
= token_matcher('identifier', /[a-zA-Z_][a-zA-Z0-9_]*/y);
parse('foo', identifier, /\n/g);
// Returns:
{
type: 'identifier',
text: 'foo',
position: {
start: {
line: 0,
char: 0
},
end: {
line: 0,
char: 3
}
}
}
sequence<T>(...matchers) : Reader<T[]>
The sequence
function searches for a specific sequence of matches in order. This is the primary building block to create any expression or statement that contains more than one token.
const integer
= token_matcher('integer', /[0-9]+/y);
const op_add
= token_matcher('op_add', /\+/y);
const add_expr
// Look for an expression formatted as `n+n` for any integer `n`
= sequence(
integer,
op_add,
integer
);
integer_digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
integer = integer_digit , { integer_digit } ;
op_add = '+' ;
add_expr = integer , op_add , integer ;
one_of<T>(...matchers) : Reader<T>
The one_of
function allows you to match against a number of different options. The parser will try options in the order they're given until one matches, or we run out of branches to try.
const literal
= one_of(
literal_integer,
literal_string,
literal_boolean,
literal_null,
literal_regex
);
literal = literal_integer
| literal_string
| literal_boolean
| literal_null
| literal_regex
;
repeat<T>(matcher, allow_empty = false) : Reader<T[]>
The repeat
function allows you to look for any number of the same pattern, repeated one after another. The second param allows you to control whether or not 0 matches is considered valid.
const function_body
= repeat(
one_of(
statement,
expression
),
true // allow empty
);
function_body = { statement | expression } ;
optional<T>(matcher) : Reader<T, null>
The optional
function is a wrapper to be placed around another Reader
. If the wrapped reader returns a MatchFailed
, the optional
wrapper will instead just return null
, allowing the parent to continue parsing and just accepting that the optional content is not defined in the source.
// Parses "public function" or "function"
const declare_func
= sequence(
// The "public" keyword, and the whitespace after it, is optional. If
// the source does not contain them, index 0 of the result will be null.
optional(
sequence(
token_matcher('keyword_public', /public\b/y),
token_matcher('whitespace', /[ \t\s]+/y)
)
),
token_matcher('keyword_function', /function\b/y),
// ...
);
keyword_public = 'public' ;
whitespace = ( ' ' | '\t' | '\s' ) , { ' ' | '\t' | '\s' } ;
keyword_function = 'function' ;
declare_func = [ keyword_public , whitespace ] , keyword_function ;
list<T, S>(elem_matcher, separator_matcher, allow_empty, allow_dangling separator) : Reader<T[]>
The list
function allows you to match against a delimited list. It takes two matchers, one for the list elements themselves, and one for the delimiter. Additionally, it takes two boolean flags to control whether or not empty lists are considered valid, and whether or not dangling separators are considered valid.
const integer_list
= sequence(
token_matcher('open_bracket', /\[/y),
list(
token_matcher('integer', /[0-9]+/y),
token_matcher('list_separator', /,/y),
true, true // allow empty and allow dangling separator
),
token_matcher('close_bracket', /\]/y),
);
open = '[' ;
close = ']' ;
sep = ',' ;
integer_digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
integer = integer_digit , { integer_digit } ;
integer_list = open , [ integer , { sep , integer } , [ sep ] ] , close ;
circular<T>(() => Reader<T>) : Reader<T>
The circular
function just allows you to late-bind a Reader
in the tree, allowing for the easy construction of circular or recursive grammars.
const expression
= one_of(
identifier,
literal_string,
literal_integer,
literal_boolean,
sequence(
punc_open_paren,
circular(() => expression), // recursive pattern
punc_close_paren
),
// ...
)
expression = identifier
| literal_string
| literal_integer
| literal_boolean
| ( '(' , expression , ')' )
;
process<T, E, R>(matcher, (params, state) => R | E) : Reader<R, E>
The process
function exists to enable you to map the raw token expressions you find in the source into other types of objects. This is where you can craft your various AST nodes in whatever way you want them formatted.
const declare_func
= process(
sequence(
token_matcher('keyword_func', /function\b/y),
token_matcher('whitespace', /\s+/y),
token_matcher('identifier', /[a-z]+/y),
token_matcher('punc_open_paren', /\(/y),
// params...
token_matcher('punc_close_paren', /\)/y),
token_matcher('punc_open_brace', /\{/y),
// body...
token_matcher('punc_close_brace', /\}/y),
),
(tokens) => {
const [ func, name, , params, , body ] = tokens;
// Process our sequence of tokens into an object to represent this
// entity in our final AST
return { _type: 'declare_func', func, name, params, body };
}
);
identifier = (? any alpha char ?) , { (? any alpha char ?) } ;
whitespace = '\s' , { '\s' } ;
func_params = '(' , (? params... ?) , ')' ;
func_body = '{' , (? body... ?) , '}' ;
declare_func = 'function' , whitespace , identifier , func_params , func_body ;
skippable<T>(matcher, skip, onSkip) : Reader<T>
The skippable
function enables you to attempt to recover from a match failure by providing a fallback pattern to use to skip over a broken part of the source. The parser will then continue as if the match succeeded, and attempt to keep parsing from that point, collecting the failures that occur as a list of errors on the ParserState
. This is mostly intended to enable you to provide better error messages to the user than the default "something unexpected at char 10".
const test
= sequence(
skippable(
token_matcher('foo', /foo\b/y),
// If we don't find 'foo', attempt to just skip over some block of text
// that looks like "a word", and continue parsing from there
/[a-z]+\b/y,
// When we skip over a failure, we need to tell the parser what to tell
// the user about what happened
(skipped, state) => {
const message = `Encountered unexpected token "${skipped.text}". (Did you mean "foo"?)`;
return new MatchFailed(message, state, skipped.text.length);
}
),
token_matcher('whitespace', /\s+/y),
token_matcher('bar', /bar\b/y),
token_matcher('whitespace', /\s+/y),
token_matcher('baz', /baz\b/y),
);
// Works as expected, no errors
parse('foo bar baz', test, /\n/g);
// Parsing actually completes, but you get an error on the state object along with the AST:
// > Encountered unexpected token "fo". (Did you mean "foo"?)
parse('fo bar baz', test, /\n/g);
// This one actually still fails, because "bar" isn't identified as skippable
parse('fo ba baz', test, /\n/g);
Simple Example Parser
import {
parse,
token_matcher,
list, optional, sequence, repeat, one_of, circular, process,
MatchFailed
} from '@watl/generic-parser';
// This is our test string that we're going to try parsing
const test_string = 'const foo: u8 = 10;';
// We're gonna use this a lot as whitespace is allowed optionally
// just about everywhere
const whitespace
= optional(
repeat(
token_matcher('whitespace', /\s+/y)
)
);
// Matcher for simple integer literals, like `10`
const literal_integer
= token_matcher('literal_integer', /[0-9]+/y);
// Matcher for simple alpha-numeric identifier names
const identifier
= token_matcher('identifier', /[a-zA-Z_][a-zA-Z0-9_]*/y);
// In reality, an expression parser is way more complex than this, but we'll just
// say that an expression is either and int literal or an identifier
const expression
= one_of(
literal_integer,
identifier
);
const const_declaration
= process(
sequence(
// Look for the identifying 'const' keyword to start us off
token_matcher('const_keyword', /const\b/y),
whitespace,
// The name of our new constant
identifier,
whitespace,
// We want to support type inference in our new language, so we'll make the
// type declaration optional in the grammer
optional(
sequence(
token_matcher('punc_type_declaration', /:/y),
whitespace,
identifier,
whitespace
)
),
// Look for the assignment operator next
token_matcher('punc_op_assignment', /=/y),
whitespace,
// Read the actual expression to be assigned
expression,
whitespace,
// And finally, we end on a semi-colon
token_matcher('punc_statement_terminator', /;/y)
),
// Using `process` allows us to transform the collected tokens
// from the sequence to turn them into a more useful AST node
(tokens) => {
const [ keyword, , name, , type_expr, op_assign, , expr, , semi ] = tokens;
return {
node_type: 'const_declaration',
keyword,
name,
type_expr,
op_assign,
expr,
semi
};
}
);
const [ result, state ] = parse(test_string, const_declaration, /\n/g);