1.2.0 • Published 2 years ago
@mazard/scanner v1.2.0
Mazard Scanner
This scanner converts a Markdown document into an array of tokens. These tokens can then be interpreted by a parser into an expression tree. Much inspiration has been taken from Robert Nystrom's Crafting Interpreters as well as Alfred Aho's The Theory of Parsing, Translation, and Compiling.
Tokens types
| Type | Description | Example |
|---|---|---|
| SYMBOL | An alphanumeric string that closely resembles a variable name in other languages | Foo, foo, foo-bar, foo_bar |
| RUNE | Similar to a symbol, but these strings contain non-alphanumeric content | Foo#, -foo, _foo, 1foo, fo>o |
| NUMBER | An integer, decimal, or a number in exponential notation | 1, 1.0, +1, -1, 1.0e1 |
| SPACE | One ore more space characters. The literal value is the number of spaces encountered. | |
| TAB | A "\t" or " " at the start of a line. | |
| BR | One or more line break characters. | |
| COLON | A, well, colon | : |
| COLON_COLON | Two colons in sequence, likely indicated an Obsidian metadata value | :: |
| FRONTMATTER_START | The triple-dash at the start of a frontmatter section | --- |
| FRONTMATTER_END | The triple-dash at the end of a frontmatter section | --- |
| FRONTMATTER_KEY | A frontmatter key | The foo in foo: bar |
| FRONTMATTER_VALUE | A frontmatter value | The bar in foo: bar |
| FRONTMATTER_BULLET | A dash at the beginning of a line | The - in - bar |
| CODE_START | The triple-backtick at the start of a code section | ``` |
| CODE_LANGUAGE | The language specified after the triple backticks of a CODE_START | The typescript in \``typescript` |
| CODE_KEY | Similar to frontmatter, code blocks can have keys and values after the CODE_START | The foo in foo: bar |
| CODE_VALUE | A metadata code value | The bar in foo: bar |
| CODE_SOURCE | The source code inside of a code block | |
| CODE_END | The triple-backtick at the end of a code section | ``` |
| HHASH | A one- to six-legged hash tag at the beginning of a line | The ### in ### Foo |
| HGTHAN | A > at the beginning of a line | The > in > Foo |
| L_BRACKET | A single left bracket | [ |
| LL_BRACKET | Two left brackets | [[ |
| R_BRACKET | A single right bracket | [ |
| RR_BRACKET | Two right brackets | ]] |
| LL_BRACE | Two left braces | {{ |
| RR_BRACE | Two right braces | }} |
| ASTERISK | A single asterisk | * |
| ASTERISK_ASTERISK | Two asterisks | ** |
| EQUALS_EQUALS | Two equals signs | == |
| ORDINAL | A number with an ordinal suffix | 1st, 2nd, 3rd, 4th |
| PIPE | A bar pipe | \| |
| TAG | A symbol prefixed with a hashtag | #tag, #tag-foo #tag1 |
| TILDE_TILDE | Two tildes | ~~ |
| ESCAPE | A backslash followed by any character | \| |
| L_PAREN | A left parenthesis | ( |
| R_PAREN | A right parenthesis | ) |
| BACKTICK | A single backtick | ``` |
| DOLLAR | A dollar sign | $ |
| DOLLAR_DOLLAR | Two dollar signs | $$ |
| PERCENT_PERCENT | Two percent signs | %% |
| COMMENT | The content of a comment | A comment in %% A comment |
| HTML_TAG | An html tag | <div>, </div>, <p /> |
| HR | A horizontal rule | ---, ***, ___ |
| BULLET | A dash or asterisk at the beginning of a line | The - in - foo |
| N_BULLET | A numbered bullet at the beginning of a line | The 1. in 1. foo |
| CHECKBOX | A checkbox at the beginning of a line | The - [ ] in - [ ] foo |
| URL | A url | https://www.google.com |
| EOF | The very end of the string or file |
Some examples
const tokens = scanTokens([
"# Mazard Scanner",
"",
"This scanner converts a Markdown document into an array of tokens.",
]);
printTokens(tokens);| No | Type | Lexeme | Literal | Line | Column |
|---|---|---|---|---|---|
| 0 | HHASH | "#" | 1 | 0 | 0 |
| 1 | SPACE | " " | 1 | 0 | 1 |
| 2 | SYMBOL | "Mazard" | "Mazard" | 0 | 2 |
| 3 | SPACE | " " | 1 | 0 | 8 |
| 4 | SYMBOL | "Scanner" | "Scanner" | 0 | 9 |
| 5 | BR | "\n\n" | 2 | 0 | 16 |
| 6 | SYMBOL | "This" | "This" | 2 | 0 |
| 7 | SPACE | " " | 1 | 2 | 4 |
| 8 | SYMBOL | "scanner" | "scanner" | 2 | 5 |
| 9 | SPACE | " " | 1 | 2 | 12 |
| 10 | SYMBOL | "converts" | "converts" | 2 | 13 |
| 11 | SPACE | " " | 1 | 2 | 21 |
| 12 | SYMBOL | "a" | "a" | 2 | 22 |
| 13 | SPACE | " " | 1 | 2 | 23 |
| 14 | SYMBOL | "Markdown" | "Markdown" | 2 | 24 |
| 15 | SPACE | " " | 1 | 2 | 32 |
| 16 | SYMBOL | "document" | "document" | 2 | 33 |
| 17 | SPACE | " " | 1 | 2 | 41 |
| 18 | SYMBOL | "into" | "into" | 2 | 42 |
| 19 | SPACE | " " | 1 | 2 | 46 |
| 20 | SYMBOL | "an" | "an" | 2 | 47 |
| 21 | SPACE | " " | 1 | 2 | 49 |
| 22 | SYMBOL | "array" | "array" | 2 | 50 |
| 23 | SPACE | " " | 1 | 2 | 55 |
| 24 | SYMBOL | "of" | "of" | 2 | 56 |
| 25 | SPACE | " " | 1 | 2 | 58 |
| 26 | RUNE | "tokens." | "tokens." | 2 | 59 |
| 27 | EOF | "" | "" | 2 | 66 |
const tokens = scanTokens("here's a *line* with some ~~formatting~~.");
printTokens(tokens);| No | Type | Lexeme | Literal | Line | Column |
|---|---|---|---|---|---|
| 0 | RUNE | "here's" | "here's" | 0 | 0 |
| 1 | SPACE | " " | 1 | 0 | 6 |
| 2 | SYMBOL | "a" | "a" | 0 | 7 |
| 3 | SPACE | " " | 1 | 0 | 8 |
| 4 | ASTERISK | "*" | "*" | 0 | 9 |
| 5 | SYMBOL | "line" | "line" | 0 | 10 |
| 6 | ASTERISK | "*" | "*" | 0 | 14 |
| 7 | SPACE | " " | 1 | 0 | 15 |
| 8 | SYMBOL | "with" | "with" | 0 | 16 |
| 9 | SPACE | " " | 1 | 0 | 20 |
| 10 | SYMBOL | "some" | "some" | 0 | 21 |
| 11 | SPACE | " " | 1 | 0 | 25 |
| 12 | TILDE_TILDE | "~~" | "~~" | 0 | 26 |
| 13 | SYMBOL | "formatting" | "formatting" | 0 | 28 |
| 14 | TILDE_TILDE | "~~" | "~~" | 0 | 38 |
| 15 | RUNE | "." | "." | 0 | 40 |
| 16 | EOF | "" | "" | 0 | 41 |
const tokens = scanTokens([
"- [x] Finish the scanner.",
"- [ ] Write some reasonable documentation",
]);
printTokens(tokens);| No | Type | Lexeme | Literal | Line | Column |
|---|---|---|---|---|---|
| 0 | CHECKBOX | "- x" | true | 0 | 0 |
| 1 | SPACE | " " | 1 | 0 | 5 |
| 2 | SYMBOL | "Finish" | "Finish" | 0 | 6 |
| 3 | SPACE | " " | 1 | 0 | 12 |
| 4 | SYMBOL | "the" | "the" | 0 | 13 |
| 5 | SPACE | " " | 1 | 0 | 16 |
| 6 | RUNE | "scanner." | "scanner." | 0 | 17 |
| 7 | BR | "\n" | 1 | 0 | 25 |
| 8 | CHECKBOX | "- " | false | 1 | 0 |
| 9 | SPACE | " " | 1 | 1 | 5 |
| 10 | SYMBOL | "Write" | "Write" | 1 | 6 |
| 11 | SPACE | " " | 1 | 1 | 11 |
| 12 | SYMBOL | "some" | "some" | 1 | 12 |
| 13 | SPACE | " " | 1 | 1 | 16 |
| 14 | SYMBOL | "reasonable" | "reasonable" | 1 | 17 |
| 15 | SPACE | " " | 1 | 1 | 27 |
| 16 | SYMBOL | "documentation" | "documentation" | 1 | 28 |
| 17 | EOF | "" | "" | 1 | 41 |