1.2.0 • Published 2 years ago
@mazard/scanner v1.2.0
Mazard Scanner
This scanner converts a Markdown document into an array of tokens. These tokens can then be interpreted by a parser into an expression tree. Much inspiration has been taken from Robert Nystrom's Crafting Interpreters as well as Alfred Aho's The Theory of Parsing, Translation, and Compiling.
Tokens types
| Type | Description | Example | 
|---|---|---|
| SYMBOL | An alphanumeric string that closely resembles a variable name in other languages | Foo, foo, foo-bar, foo_bar | 
| RUNE | Similar to a symbol, but these strings contain non-alphanumeric content | Foo#, -foo, _foo, 1foo, fo>o | 
| NUMBER | An integer, decimal, or a number in exponential notation | 1, 1.0, +1, -1, 1.0e1 | 
| SPACE | One ore more space characters. The literal value is the number of spaces encountered. | |
| TAB | A "\t" or " " at the start of a line. | |
| BR | One or more line break characters. | |
| COLON | A, well, colon | : | 
| COLON_COLON | Two colons in sequence, likely indicated an Obsidian metadata value | :: | 
| FRONTMATTER_START | The triple-dash at the start of a frontmatter section | --- | 
| FRONTMATTER_END | The triple-dash at the end of a frontmatter section | --- | 
| FRONTMATTER_KEY | A frontmatter key | The fooinfoo: bar | 
| FRONTMATTER_VALUE | A frontmatter value | The barinfoo: bar | 
| FRONTMATTER_BULLET | A dash at the beginning of a line | The -in- bar | 
| CODE_START | The triple-backtick at the start of a code section | ``` | 
| CODE_LANGUAGE | The language specified after the triple backticks of a CODE_START | The typescriptin\``typescript` | 
| CODE_KEY | Similar to frontmatter, code blocks can have keys and values after the CODE_START | The fooinfoo: bar | 
| CODE_VALUE | A metadata code value | The barinfoo: bar | 
| CODE_SOURCE | The source code inside of a code block | |
| CODE_END | The triple-backtick at the end of a code section | ``` | 
| HHASH | A one- to six-legged hash tag at the beginning of a line | The ###in### Foo | 
| HGTHAN | A >at the beginning of a line | The >in> Foo | 
| L_BRACKET | A single left bracket | [ | 
| LL_BRACKET | Two left brackets | [[ | 
| R_BRACKET | A single right bracket | [ | 
| RR_BRACKET | Two right brackets | ]] | 
| LL_BRACE | Two left braces | {{ | 
| RR_BRACE | Two right braces | }} | 
| ASTERISK | A single asterisk | * | 
| ASTERISK_ASTERISK | Two asterisks | ** | 
| EQUALS_EQUALS | Two equals signs | == | 
| ORDINAL | A number with an ordinal suffix | 1st, 2nd, 3rd, 4th | 
| PIPE | A bar pipe | \| | 
| TAG | A symbol prefixed with a hashtag | #tag, #tag-foo #tag1 | 
| TILDE_TILDE | Two tildes | ~~ | 
| ESCAPE | A backslash followed by any character | \| | 
| L_PAREN | A left parenthesis | ( | 
| R_PAREN | A right parenthesis | ) | 
| BACKTICK | A single backtick | ``` | 
| DOLLAR | A dollar sign | $ | 
| DOLLAR_DOLLAR | Two dollar signs | $$ | 
| PERCENT_PERCENT | Two percent signs | %% | 
| COMMENT | The content of a comment | A commentin%% A comment | 
| HTML_TAG | An html tag | <div>,</div>,<p /> | 
| HR | A horizontal rule | ---,***,___ | 
| BULLET | A dash or asterisk at the beginning of a line | The -in- foo | 
| N_BULLET | A numbered bullet at the beginning of a line | The 1.in1. foo | 
| CHECKBOX | A checkbox at the beginning of a line | The - [ ]in- [ ] foo | 
| URL | A url | https://www.google.com | 
| EOF | The very end of the string or file | 
Some examples
const tokens = scanTokens([
	"# Mazard Scanner",
	"",
	"This scanner converts a Markdown document into an array of tokens.",
]);
printTokens(tokens);| No | Type | Lexeme | Literal | Line | Column | 
|---|---|---|---|---|---|
| 0 | HHASH | "#" | 1 | 0 | 0 | 
| 1 | SPACE | " " | 1 | 0 | 1 | 
| 2 | SYMBOL | "Mazard" | "Mazard" | 0 | 2 | 
| 3 | SPACE | " " | 1 | 0 | 8 | 
| 4 | SYMBOL | "Scanner" | "Scanner" | 0 | 9 | 
| 5 | BR | "\n\n" | 2 | 0 | 16 | 
| 6 | SYMBOL | "This" | "This" | 2 | 0 | 
| 7 | SPACE | " " | 1 | 2 | 4 | 
| 8 | SYMBOL | "scanner" | "scanner" | 2 | 5 | 
| 9 | SPACE | " " | 1 | 2 | 12 | 
| 10 | SYMBOL | "converts" | "converts" | 2 | 13 | 
| 11 | SPACE | " " | 1 | 2 | 21 | 
| 12 | SYMBOL | "a" | "a" | 2 | 22 | 
| 13 | SPACE | " " | 1 | 2 | 23 | 
| 14 | SYMBOL | "Markdown" | "Markdown" | 2 | 24 | 
| 15 | SPACE | " " | 1 | 2 | 32 | 
| 16 | SYMBOL | "document" | "document" | 2 | 33 | 
| 17 | SPACE | " " | 1 | 2 | 41 | 
| 18 | SYMBOL | "into" | "into" | 2 | 42 | 
| 19 | SPACE | " " | 1 | 2 | 46 | 
| 20 | SYMBOL | "an" | "an" | 2 | 47 | 
| 21 | SPACE | " " | 1 | 2 | 49 | 
| 22 | SYMBOL | "array" | "array" | 2 | 50 | 
| 23 | SPACE | " " | 1 | 2 | 55 | 
| 24 | SYMBOL | "of" | "of" | 2 | 56 | 
| 25 | SPACE | " " | 1 | 2 | 58 | 
| 26 | RUNE | "tokens." | "tokens." | 2 | 59 | 
| 27 | EOF | "" | "" | 2 | 66 | 
const tokens = scanTokens("here's a *line* with some ~~formatting~~.");
printTokens(tokens);| No | Type | Lexeme | Literal | Line | Column | 
|---|---|---|---|---|---|
| 0 | RUNE | "here's" | "here's" | 0 | 0 | 
| 1 | SPACE | " " | 1 | 0 | 6 | 
| 2 | SYMBOL | "a" | "a" | 0 | 7 | 
| 3 | SPACE | " " | 1 | 0 | 8 | 
| 4 | ASTERISK | "*" | "*" | 0 | 9 | 
| 5 | SYMBOL | "line" | "line" | 0 | 10 | 
| 6 | ASTERISK | "*" | "*" | 0 | 14 | 
| 7 | SPACE | " " | 1 | 0 | 15 | 
| 8 | SYMBOL | "with" | "with" | 0 | 16 | 
| 9 | SPACE | " " | 1 | 0 | 20 | 
| 10 | SYMBOL | "some" | "some" | 0 | 21 | 
| 11 | SPACE | " " | 1 | 0 | 25 | 
| 12 | TILDE_TILDE | "~~" | "~~" | 0 | 26 | 
| 13 | SYMBOL | "formatting" | "formatting" | 0 | 28 | 
| 14 | TILDE_TILDE | "~~" | "~~" | 0 | 38 | 
| 15 | RUNE | "." | "." | 0 | 40 | 
| 16 | EOF | "" | "" | 0 | 41 | 
const tokens = scanTokens([
	"- [x] Finish the scanner.",
	"- [ ] Write some reasonable documentation",
]);
printTokens(tokens);| No | Type | Lexeme | Literal | Line | Column | 
|---|---|---|---|---|---|
| 0 | CHECKBOX | "- x" | true | 0 | 0 | 
| 1 | SPACE | " " | 1 | 0 | 5 | 
| 2 | SYMBOL | "Finish" | "Finish" | 0 | 6 | 
| 3 | SPACE | " " | 1 | 0 | 12 | 
| 4 | SYMBOL | "the" | "the" | 0 | 13 | 
| 5 | SPACE | " " | 1 | 0 | 16 | 
| 6 | RUNE | "scanner." | "scanner." | 0 | 17 | 
| 7 | BR | "\n" | 1 | 0 | 25 | 
| 8 | CHECKBOX | "- " | false | 1 | 0 | 
| 9 | SPACE | " " | 1 | 1 | 5 | 
| 10 | SYMBOL | "Write" | "Write" | 1 | 6 | 
| 11 | SPACE | " " | 1 | 1 | 11 | 
| 12 | SYMBOL | "some" | "some" | 1 | 12 | 
| 13 | SPACE | " " | 1 | 1 | 16 | 
| 14 | SYMBOL | "reasonable" | "reasonable" | 1 | 17 | 
| 15 | SPACE | " " | 1 | 1 | 27 | 
| 16 | SYMBOL | "documentation" | "documentation" | 1 | 28 | 
| 17 | EOF | "" | "" | 1 | 41 |