0.2.5 • Published 5 years ago

@watl/generic-parser v0.2.5

Weekly downloads
1
License
ISC
Repository
-
Last release
5 years ago

Generic Language Parser

This tool is a library to enable tokenizing and parsing abstract grammars into custom ASTs using a simple declarative and composible syntax.

Install

$ npm install --save @watl/generic-parser

Contents


How it works

There are really only a handful of basic utilities exported from generic-parser, and new parsers are expected to be created declaratively by combining larger and more complex patterns together into trees of what are referred to as Reader functions. A Reader is ultimately just a function which receives a ParserState parameter, and returns some kind of result. The top-level Reader is then passed into the parse function to parse your source.

For example, here we have an (admittedly, pretty bad) parser for simple math expressions.

import { parse, sequence, optional, circular, token_matcher } from '@watl/generic-parser';

// Define a token for all our valid operators
const operator
    = token_matcher('operator', /\+|\-|\/|\*/y);

// Define a token to find integers
const integer
    = token_matcher('integer', /[0-9]+/y);

// Finally, define what an expression looks like
const expression
    = sequence(
        integer,
        optional(
            sequence(
                operator,
                circular(() => expression)
            )
        )
    );

// Now that we've defined our grammar, we can parse our test string
const [ result, state ] = parse('10+20-15', expression, /\n/g);
operator       = '+' | '-' | '/' | '*' ;
integer_digit  = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
integer        = integer_digit , { integer_digit } ;
expression     = integer , [ operator , expression ] ;

ParserState

A ParserState object is very simple, it just contains the raw lines of source, and a pointer (line/char) to the current location in the file. Additionally, if you make use of error recovery via the skippable function, errors that are collected during parsing will be stored on the state.

interface ParserState {
    lines: string[];
    line_index: number;
    char_index: number;
    errors: MatchFailed[];
}

Token<T extends string>

The actual text that gets extracted from the source using token_matcher is represented using a Token object.

interface Token<T extends string> {
    type: T;
    text: string;
    position: {
        start: {
            line: number;
            char: number;
        };
        end: {
            line: number;
            char: number;
        };
    };
}

MatchFailed

When a branch of the parser fails (for example, due to a syntax error), it will return an instance of the MatchFailed class, which describes what failed and where in the source code. Normally, this would bubble all the way up and be the final result of the parse call, but there are also some Reader types that can handle these MatchFailed results and try to recover.

declare class MatchFailed {
    public readonly line: number;
    public readonly char: number;
    public readonly reason: string;
    public readonly formatted: string;
}

Reader<T, E = MatchFailed>

A Reader function takes in one of the above ParserState objects, and returns a value of either type T or E. The following low-level Reader functions are provided out of the box:

token_matcher(name, regex, close?) : Reader<Token>

The token_matcher function looks for a specific token at the current location. If a token matching the given regex is found at the current source location, a Token object representing that portion of the text will be returned, moving the pointer forward. Otherwise, returns MatchFailed.

const identifer
    = token_matcher('identifier', /[a-zA-Z_][a-zA-Z0-9_]*/y);

parse('foo', identifier, /\n/g);

// Returns:
{
    type: 'identifier',
    text: 'foo',
    position: {
        start: {
            line: 0,
            char: 0
        },
        end: {
            line: 0,
            char: 3
        }
    }
}
sequence<T>(...matchers) : Reader<T[]>

The sequence function searches for a specific sequence of matches in order. This is the primary building block to create any expression or statement that contains more than one token.

const integer
    = token_matcher('integer', /[0-9]+/y);

const op_add
    = token_matcher('op_add', /\+/y);

const add_expr
    // Look for an expression formatted as `n+n` for any integer `n`
    = sequence(
        integer,
        op_add,
        integer
    );
integer_digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
integer       = integer_digit , { integer_digit } ;
op_add        = '+' ;
add_expr      = integer , op_add , integer ;
one_of<T>(...matchers) : Reader<T>

The one_of function allows you to match against a number of different options. The parser will try options in the order they're given until one matches, or we run out of branches to try.

const literal
    = one_of(
        literal_integer,
        literal_string,
        literal_boolean,
        literal_null,
        literal_regex
    );
literal = literal_integer
        | literal_string
        | literal_boolean
        | literal_null
        | literal_regex
        ;
repeat<T>(matcher, allow_empty = false) : Reader<T[]>

The repeat function allows you to look for any number of the same pattern, repeated one after another. The second param allows you to control whether or not 0 matches is considered valid.

const function_body
    = repeat(
        one_of(
            statement,
            expression
        ),
        true // allow empty
    );
function_body = { statement | expression } ;
optional<T>(matcher) : Reader<T, null>

The optional function is a wrapper to be placed around another Reader. If the wrapped reader returns a MatchFailed, the optional wrapper will instead just return null, allowing the parent to continue parsing and just accepting that the optional content is not defined in the source.

// Parses "public function" or "function"
const declare_func
    = sequence(
        // The "public" keyword, and the whitespace after it, is optional. If
        // the source does not contain them, index 0 of the result will be null.
        optional(
            sequence(
                token_matcher('keyword_public', /public\b/y),
                token_matcher('whitespace', /[ \t\s]+/y)
            )
        ),
        token_matcher('keyword_function', /function\b/y),
        // ...
    );
keyword_public   = 'public' ;
whitespace       = ( ' ' | '\t' | '\s' ) , { ' ' | '\t' | '\s' } ;
keyword_function = 'function' ;
declare_func     = [ keyword_public , whitespace ] , keyword_function ;
list<T, S>(elem_matcher, separator_matcher, allow_empty, allow_dangling separator) : Reader<T[]>

The list function allows you to match against a delimited list. It takes two matchers, one for the list elements themselves, and one for the delimiter. Additionally, it takes two boolean flags to control whether or not empty lists are considered valid, and whether or not dangling separators are considered valid.

const integer_list
    = sequence(
        token_matcher('open_bracket', /\[/y),
        list(
            token_matcher('integer', /[0-9]+/y),
            token_matcher('list_separator', /,/y),
            true, true // allow empty and allow dangling separator
        ),
        token_matcher('close_bracket', /\]/y),
    );
open          = '[' ;
close         = ']' ;
sep           = ',' ;
integer_digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
integer       = integer_digit , { integer_digit } ;
integer_list  = open , [ integer , { sep , integer } , [ sep ] ] , close ;
circular<T>(() => Reader<T>) : Reader<T>

The circular function just allows you to late-bind a Reader in the tree, allowing for the easy construction of circular or recursive grammars.

const expression
    = one_of(
        identifier,
        literal_string,
        literal_integer,
        literal_boolean,
        sequence(
            punc_open_paren,
            circular(() => expression),  // recursive pattern
            punc_close_paren
        ),
        // ...
    )
expression = identifier
           | literal_string
           | literal_integer
           | literal_boolean
           | ( '(' , expression , ')' )
           ;
process<T, E, R>(matcher, (params, state) => R | E) : Reader<R, E>

The process function exists to enable you to map the raw token expressions you find in the source into other types of objects. This is where you can craft your various AST nodes in whatever way you want them formatted.

const declare_func
    = process(
        sequence(
            token_matcher('keyword_func', /function\b/y),
            token_matcher('whitespace', /\s+/y),
            token_matcher('identifier', /[a-z]+/y),
            token_matcher('punc_open_paren', /\(/y),
            // params...
            token_matcher('punc_close_paren', /\)/y),
            token_matcher('punc_open_brace', /\{/y),
            // body...
            token_matcher('punc_close_brace', /\}/y),
        ),
        (tokens) => {
            const [ func, name, , params, , body ] = tokens;

            // Process our sequence of tokens into an object to represent this
            // entity in our final AST
            return { _type: 'declare_func', func, name, params, body };
        }
    );
identifier   = (? any alpha char ?) , { (? any alpha char ?) } ;
whitespace   = '\s' , { '\s' } ;
func_params  = '(' , (? params... ?) , ')' ;
func_body    = '{' , (? body... ?) , '}' ;
declare_func = 'function' , whitespace , identifier , func_params , func_body ;
skippable<T>(matcher, skip, onSkip) : Reader<T>

The skippable function enables you to attempt to recover from a match failure by providing a fallback pattern to use to skip over a broken part of the source. The parser will then continue as if the match succeeded, and attempt to keep parsing from that point, collecting the failures that occur as a list of errors on the ParserState. This is mostly intended to enable you to provide better error messages to the user than the default "something unexpected at char 10".

const test
    = sequence(
        skippable(
            token_matcher('foo', /foo\b/y),

            // If we don't find 'foo', attempt to just skip over some block of text
            // that looks like "a word", and continue parsing from there
            /[a-z]+\b/y,

            // When we skip over a failure, we need to tell the parser what to tell
            // the user about what happened
            (skipped, state) => {
                const message = `Encountered unexpected token "${skipped.text}". (Did you mean "foo"?)`;

                return new MatchFailed(message, state, skipped.text.length);
            }
        ),
        token_matcher('whitespace', /\s+/y),
        token_matcher('bar', /bar\b/y),
        token_matcher('whitespace', /\s+/y),
        token_matcher('baz', /baz\b/y),
    );

// Works as expected, no errors
parse('foo bar baz', test, /\n/g);

// Parsing actually completes, but you get an error on the state object along with the AST:
//  > Encountered unexpected token "fo". (Did you mean "foo"?)
parse('fo bar baz', test, /\n/g);

// This one actually still fails, because "bar" isn't identified as skippable
parse('fo ba baz', test, /\n/g);

Simple Example Parser

import {
	parse,
	token_matcher,
	list, optional, sequence, repeat, one_of, circular, process,
	MatchFailed
} from '@watl/generic-parser';

// This is our test string that we're going to try parsing
const test_string = 'const foo: u8 = 10;';

// We're gonna use this a lot as whitespace is allowed optionally
// just about everywhere
const whitespace
    = optional(
        repeat(
            token_matcher('whitespace', /\s+/y)
        )
    );

// Matcher for simple integer literals, like `10`
const literal_integer
    = token_matcher('literal_integer', /[0-9]+/y);

// Matcher for simple alpha-numeric identifier names
const identifier
    = token_matcher('identifier', /[a-zA-Z_][a-zA-Z0-9_]*/y);

// In reality, an expression parser is way more complex than this, but we'll just
// say that an expression is either and int literal or an identifier
const expression
    = one_of(
        literal_integer,
        identifier
    );

const const_declaration
    = process(
        sequence(
            // Look for the identifying 'const' keyword to start us off
            token_matcher('const_keyword', /const\b/y),
            whitespace,

            // The name of our new constant
            identifier,
            whitespace,

            // We want to support type inference in our new language, so we'll make the
            // type declaration optional in the grammer
            optional(
                sequence(
                    token_matcher('punc_type_declaration', /:/y),
                    whitespace,
                    identifier,
                    whitespace
                )
            ),

            // Look for the assignment operator next
            token_matcher('punc_op_assignment', /=/y),
            whitespace,

            // Read the actual expression to be assigned
            expression,
            whitespace,

            // And finally, we end on a semi-colon
            token_matcher('punc_statement_terminator', /;/y)
        ),

        // Using `process` allows us to transform the collected tokens
        // from the sequence to turn them into a more useful AST node
        (tokens) => {
            const [ keyword, , name, , type_expr, op_assign, , expr, , semi ] = tokens;

            return {
                node_type: 'const_declaration',
                keyword,
                name,
                type_expr,
                op_assign,
                expr,
                semi
            };
        }
    );

const [ result, state ] = parse(test_string, const_declaration, /\n/g);
0.2.5

5 years ago

0.2.4

5 years ago

0.2.3

5 years ago

0.2.2

5 years ago

0.2.1

5 years ago

0.2.0

5 years ago

0.1.1

5 years ago

0.1.0

5 years ago