0.8.2-20180812191723 • Published 8 years ago

@atomist/microgrammar v0.8.2-20180812191723

Weekly downloads
509
License
SEE LICENSE IN LI...
Repository
github
Last release
8 years ago

@atomist/microgrammar

npm version Build Status

Parsing library written in TypeScript, filling the large gap between the sweet spots of regular expressions and full-blown BNF or equivalent grammars. Can parse and cleanly update structured content. npm module page here.

Concepts

Microgrammars are a powerful way of parsing structured content such as source code, described in this Stanford paper. Microgrammars are designed to recognize structures in a string or stream and extract their content: For example, to recognize a Java method that has a particular annotation and to extract particular parameters. They are more powerful and typically more readable than regular expressions for complex cases, although they can be built using regex.

Atomist microgrammars go beyond the Stanford paper example in that they permit updating as well as matching, preserving positions. They also draw inspiration from other experience and sources such as the old SNOBOL programming language.

Examples

There are two styles of use:

  • From definitions: Defining a grammar in JavaScript objects
  • From strings: Defining a grammar in a string that resembles input that will be matched

From Definitions Style

Here's a simple example:

const mg = Microgrammar.fromDefinitions<{name: string, age: number}>({
    name: /[a-zA-Z0-9]+/,
    _col: ":",
    age: Integer
});

const results = mg.findMatches("-celine:61 greg*^ tom::: mandy:11");
assert(result.length === 2);
const first = results[0];
assert(first.$matched === "celine:61");
// The offset of this match was the 1st character, as the 0th was discarded
assert(first.$offset === 1);
assert(first.name === "celine");
assert(first.age === 61);

Some notes:

  • A microgrammar definition is typically an object literal, with its properties being matched in turn. This is like concatenation in a BNF grammar.
  • Matcher property values can be regular expressions (like /[a-zA-Z0-9]+/ here), string literals (like :), or custom matchers (like Integer). It's easy to define custom matchers for use in composition.
  • All properties need to match for the whole microgrammar to match.
  • Properties that match are bound to the result, unless their names begin with _, in which case the values are discarded.
  • Certain out of band values, beginning with $, are added to the results, showing the exact text that matched, the offset etc.
  • When using TypeScript, microgrammar returns can be strongly typed. In this case we've used an anonymous type, but we could also use an interface. We could also use untyped, JavaScript style.
  • Matching skips junk such as greg*^ tom:::. In this case, greg and tom: will look like the start of valid matches, but the first will fail when it can't match a : and the second when there isn't a digit after the colon.
  • We can match against a string or a stream. In this case we've used a string. In stream matching, we'd be more likely to use one an API offering callbacks rather than building an array, so we don't need to hold all our matches in memory at once.

Of course, such a simple example could easily be handled by a regular expression and capture groups. But the power becomes apparent with nested productions and more elaborate matchers.

A more complex example, showing composition:

export const CLASS_NAME = /[a-zA-Z_$][a-zA-Z0-9_$]+/;

// Any annotation we're not interested in
const DiscardedAnnotation = {
    _at: "@",
    _annotationName: CLASS_NAME,
    _content: optional(JavaParenthesizedExpression),
};

const SpringBootApp = Microgrammar.fromDefinitions<{ name: string }>({
    _app: "@SpringBootApplication",
    _content: optional(JavaParenthesizedExpression),
    _otherAnnotations: zeroOrMore(DiscardedAnnotation),
    _visibility: optional("public"),
    _class: "class",
    name: CLASS_NAME,
});

This will match content like this:

@SpringBootApplication
@Foo
@Bar(name = "Baz", magicParam = 31754)
public class MySpringBootApplication

Notes:

  • JavaParenthesizedExpression is a built-in matcher constant that matches any valid Java content within (...). It uses a state machine. It's easy to write such custom matchers.
  • By default, microgrammars are tolerant of whitespace, treating it as a token separator. This is the behavior we want when parsing most languages or configuration formats.
  • Because the other properties have names beginning with _, only the class name (MySpringBootApplication in our example) is bound to the result. We care about the structure of the rest of the class declaration, but we don't need to extract other values in this particular case.

From String Style

This is a higher level usage model in which a string resembling the desired input but with variable placeholders is used to define the grammar.

This style is ideally suited for simpler grammars. For example:

const ValuePredicateGrammar = Microgrammar.fromString<Predicate>(
    "@${name}='${value}'");

It can be combined with the definitional style through providing optional definitions for the named fields. For example, to constrain the match on a name in the above example using a regular expression:

const ValuePredicateGrammar = Microgrammar.fromString<Predicate>(
    "@${name}='${value}'", {
    	name: /[a-z]+/
    });

As with the object definitional style, whitespace is ignored by default.

Further documentation can be found in the reference. You can also take a look at the tests in this repository.

Alternatives and When To Use Microgrammars

Microgrammars have obvious similarities to BNF grammars, but differ in some important respects:

  • They are intended to match and explain parts of the input, rather than the whole input
  • They excel at skipping content they are uninterested in
  • They are not necessarily context free
  • They do not need to construct a full AST, although they construct ASTs for structures they do match. Thus they can easily cope with partially structured data, happily skipping over incomprehensible content

Similarities are:

  • The idea of productions
  • Composability, including the ability to reuse productions in different grammars
  • Operations such as alternative, optional and rep, that enable building complex structures.

Compared to regular expressions, microgrammars are:

  • Capable of handing greater complexity
  • More composable
  • Higher level, able to use regular expressions as building blocks
  • Capable of expressing nested structures
  • Arbitrarily extensible through TypeScript function predicates and custom matchers

While it would be overkill to use a microgrammar for something that can be expressed in a simple regex, microgrammars tend to be clearer for complex cases.

Usage

The @atomist/microgrammar module contains both the TypeScript typings and compiled JavaScript. You can use this project by adding the dependency in your package.json.

$ npm install @atomist/microgrammar --save

Performance considerations

See Writing efficient microgrammars.

Development

See the contribution guidelines.

Running tests

Run all the tests in mocha:

npm test

Run one test file:

TEST=MyTestFile.ts npm testone

Run benchmarks with profiling, leaving a profile.txt file to view:

npm run benchmark

Clean (including deleting any profiling data):

npm run clean

1.2.1

7 years ago

1.2.0

7 years ago

1.0.4

7 years ago

1.0.3

7 years ago

1.0.1

7 years ago

1.0.0-M.4

8 years ago

1.0.0-M.1

8 years ago

0.9.1

8 years ago

0.9.0

8 years ago

0.8.1

8 years ago

0.8.0

8 years ago

0.7.0

9 years ago

0.6.2

9 years ago

0.6.1

9 years ago

0.6.0

9 years ago

0.5.1

9 years ago

0.5.0

9 years ago

0.4.0

9 years ago

0.3.12

9 years ago

0.3.11

9 years ago

0.3.10

9 years ago

0.3.9

9 years ago

0.3.7

9 years ago

0.3.6

9 years ago

0.3.5

9 years ago

0.3.4

9 years ago

0.3.3

9 years ago

0.3.1

9 years ago

0.2.0

9 years ago

0.1.0

9 years ago