intertext-lexer v0.30.0
InterText Lexer Interlex
Table of Contents generated with DocToc
- InterText Lexer
Interlex
InterText Lexer Interlex
Notes
- An 'empty lexer' (i.e. a lexer without any lexemes) will match the empty string and nothing else;
depending on the lexer's configuration, the former may contain
$startand / or$endtokens, and the latter, in addition, may either contain a single$errortoken or else throw an error, as the case may be.
Adding Lexemes
- ATM adding lexemes one by one is supported by calling
lexer.add_lexeme() - in the future, defining entire grammars in one go should become possible
lexer.add_lexeme()expects acfg('configuration') object with the following keys:cfg.mode: a valid JS identifier to identify the lexing mode;- if no base mode was passed on instantiation of the lexer, the mode named in the first call
to
lexer.add_lexeme()will become the base mode - lexing will start from the base mode, which thereby becomes the initial mode (but note that the base mode may be re-assumed later without it becoming the initial mode)
- if no base mode was passed on instantiation of the lexer, the mode named in the first call
to
cfg.lxid(Lexeme ID): a valid JS identifier to identify the matched lexeme;cfg.pattern: a string or a regular expression to describe a constant or variable pattern. Strings will be converted to regexes, using proper escaping for all characters that are special in regexes (like*,(and so on);- named groups (captures): when a regular expression has named groups (as in
/abc(?<letter>[a-z]+)/), those matches will be put into an object namedgof the token (so if the above matches,token.g.letterwill contain a string matching one or more of the basic 26 lower case Latin letters)
- named groups (captures): when a regular expression has named groups (as in
- names of modes and lexemes will be used to construct regex group names; therefore, they must all be valid JS identifiers
cfg.jump: see Jumpscfg.create: an optional function that will be called right after a token is created from the lexeme (and right before it is frozen and yielded to the caller); whatevercreate()returns will become the next tokencfg.valueandcfg.empty_valueallow to set thevalueproperty of a token; both can be eithernull(the default), a text or a function whose return value will becometoken.value.- when defined,
cfg.valuewill always override the token value;cfg.empty_valuewill only be considered when the token value would be an empty string - when
cfg.valueorcfg.empty_valueare functions, they will be called in the context of the lexer and with the token as only argument cfg.valueorcfg.empty_valuewill be considered immediately beforecfg.create()is called (where applicable)
- when defined,
Jumps
The jump property of a lexeme declaration indicates which new mode the lexer should switch to when
it encounters a matching pattern. It is either a string or a function. Allowed strings take one of four forms
(assuming we're in mode plain in the below):
entry jumps (jumps that mandate a jump to a new mode): say we're looking for left pointy brackets
<and want to switch to modetag. Depending on whether the boundary itself should belong to the current or the upcoming mode, the value forjumpshould have either a leading or a trailing left square bracket[:{ jump: '[tag', }: an inclusive entry jump; the 'boundary post' (the token for the<) will belong to the new mode,tag). This is called 'inclusive' because the new mode already includes the upcoming token (although it is declared withmode: 'plain'). The jump targettagappears 'inside' the square bracket.{ jump: 'tag[', }: an exclusive entry jump; the 'boundary post' (the token for the<) will belong to the old mode,plain). This is called 'exclusive' because the new mode will not include the upcoming token. The jump target appears 'outside' the square bracket.
exit jumps (jumps that mandate a jump out of the current mode back to the previous one): say we're inside a pair of pointy brackets
<...>that we're lexing in modetag; now we encounter the right pointy bracket>that signals the end of that stretch, so it's time to revert toplain. This can be symbolized by ajumpvalue that either starts or ends with a](right square bracket) and has a.(dot) symbolizing the location of the mode the 'boundary post' will belong to:{ jump: '.]', }: an inclusive exit jump; 'boundary post' belongs to the old mode,tag{ jump: '].', }: an exclusive exit jump; 'boundary post' belongs to the new mode,plain
singleton (or virtual) jumps: lexemes that are declared with a mode name enclosed by a left and a right bracket as in
jump: '[foo]'will, when a match occurs, cause a token to be emitted for that match whose mode is set to the jump target (herefoo). For example, when declaring tokens and modes for typical"string literals", it is possible to fast-track, as it were, the special case of an empty string literal,"", in plain mode, but still make that lexeme and token belong to the string literal mode (say,dqstrfor 'double quoted string'):lexer.add_lexeme { mode: 'plain', lxid: 'dq2', jump: '[dqstr]', pattern: /(?<!")""(?!")/u, reserved: '"', }.- Singleton jumps will cause border tokens to be emitted just as with regular jumps (when the lexer is
configured with
border_tokens: true).
- Singleton jumps will cause border tokens to be emitted just as with regular jumps (when the lexer is
configured with
In case the value of the
jumpproperty is a function, it will be called with an object{ token, match, lexer, }. It should return one of:nullin case nothing should be done; the token will be used as passed into this function, and the mode will not be changed, or- an object with two optional properties,
jumpandtoken, in which casejumpshould be an optional string whose be interpreted as described above, andtokenwhich could be the token passed in or a completely new one.- In any event, the token's
jump,modeandmkproperties will be adjusted as appropriate which means that setting or not setting these values makes no difference.
- Note: You can only change the lexer's mode by returning an allowable value for
jump; a returnedtoken'smodewill be ignored.
Example
Here is a minimal lexer that understands a tiny fraction of the Markdown grammar, namely, single stars *
for emphasis and single backticks ` for code spans. The single star will be passed through
as-is inside code spans:
{ Interlex
compose } = require 'intertext-lexer'
first = Symbol 'first'
last = Symbol 'last'
#.........................................................................................................
new_toy_md_lexer = ( mode = 'plain' ) ->
lexer = new Interlex { dotall: false, }
#.........................................................................................................
lexer.add_lexeme { mode: 'plain', lxid: 'escchr', jump: null, pattern: /\\(?<chr>.)/u, }
lexer.add_lexeme { mode: 'plain', lxid: 'star1', jump: null, pattern: /(?<!\*)\*(?!\*)/u, }
lexer.add_lexeme { mode: 'plain', lxid: 'codespan', jump: 'literal[', pattern: /(?<!`)`(?!`)/u, }
lexer.add_lexeme { mode: 'plain', lxid: 'other', jump: null, pattern: /[^*`\\]+/u, }
lexer.add_lexeme { mode: 'literal', lxid: 'codespan', jump: '.]', pattern: /(?<!`)`(?!`)/u, }
lexer.add_lexeme { mode: 'literal', lxid: 'text', jump: null, pattern: /(?:\\`|[^`])+/u, }
#.........................................................................................................
return lexerTopological Sorting
Interlex optionally uses topological sorting (provided by
ltsort, q.v.) of lexemes. This is triggered by adding a
before or after attribute when calling leyer.add_lexeme(). Either attribute may be a TID (which
identifies a lexeme in the same mode) or a list (array) of TIDs. Both values may also be a star * meaning
'before' or 'after everything else'. These dependency indicators don't have to be exhaustive; where left
unspecified, the relative ordering of addition of the lexemes is kept.
Observe that ordering is only defined for lexemes within the same lexer mode; there's no notion of relative ordering between lexer modes or lexemes across modes.
Reserved and Catchall Lexemes
Each lexeme can announce so-called 'reserved' characters or words; these are for now restricted to strings and lists of strings, but could support regexes in the future as well. The idea is to collect those characters and character sequences that are 'triggers' for a given lexeme and, when the mode has been defined, to automatically construct two lexemes that will capture
all the remaining sequences of non-reserved characters; this is called a catchall lexeme (whose default TID is set to
$catchallunless overriden by alxidsetting). The catchall lexeme's function lies in explicitly capturing any part of the input that has not been covered by any other lexemer higher up in the chain of patterns, thereby avoiding a more unhelpful$errortoken that would just say 'no match at position so-and-so' and terminate lexing.all the remaining reserved characters (default TID:
$reserved); these could conceivably be used to produce a list of fishy parts in the source, and / or to highlight such places in the output, or, if one feels so inclined, terminate parsing with an error message. For example, when one wants to translate Markdown-like markup syntax to HTML, one could decide that double stars start and end bold type (<strong>...</strong>), or, when a single asterisk is used at the start of a line, indicate unordered list items (<ul>...<li>...</ul>), and are considered illegal in any other position except inside code stretches and when escaped with a backslash. Such a mechanism can help to uncover problems with the source text instead of just glancing over dubious markup and 'just do something', possibly leading to subtle errors.
Whether the catchall and the reserved lexemes should match single occurrences or contiguous stretches of
occurrences of reserved items can be set with concat: true and concat: false. In the below lexer these
have been left to their defaults (no concatenation called for), but in the last tabular output below the
result for a string of 'foreign' and 'reserved' characters with concat: true is shown.
{ Interlex, } = require 'intertext-lexer'
### NOTE these are the default settings, shown here for clarity ###
lexer = new Interlex()
#.........................................................................................................
mode = 'plain'
lexer.add_lexeme { mode, lxid: 'escchr', pattern: /\\(?<chr>.)/u, reserved: '\\', }
lexer.add_lexeme { mode, lxid: 'star2', pattern: ( /(?<!\*)\*\*(?!\*)/u ), reserved: '*', }
lexer.add_lexeme { mode, lxid: 'heading', pattern: ( /^(?<hashes>#+)\s+/u ), reserved: '#', }
lexer.add_lexeme { mode, lxid: 'word', pattern: ( /\p{Letter}+/u ), }
lexer.add_lexeme { mode, lxid: 'number_symbol', pattern: ( /#(?=\p{Number})/u ), }
lexer.add_lexeme { mode, lxid: 'number', pattern: ( /\p{Number}+/u ), }
lexer.add_lexeme { mode, lxid: 'ws', pattern: ( /\s+/u ), }
lexer.add_catchall_lexeme { mode, concat: false, }
lexer.add_reserved_lexeme { mode, concat: false, }
#.........................................................................................................
H.tabulate "lexer", ( x for _, x of lexer.registry.plain.lexemes )
for probe in [ 'helo', 'helo*x', '*x', "## question #1 and a hash: #", "## question #1 and a hash: \\#", ]
debug GUY.trm.reverse GUY.trm.steel probe
H.tabulate ( rpr probe ), lexer.run probeThe lexer's plain mode now has a $catchall and a reserved lexeme:
lexer
┌───────┬───────────────┬─────────────────────────────────────────┬──────┬──────────┬──────────────┐
│mode │lxid │pattern │jump │reserved │type_of_jump │
├───────┼───────────────┼─────────────────────────────────────────┼──────┼──────────┼──────────────┤
│plain │escchr │/(?<𝔛escchr>\\(?<escchr𝔛chr>.))/u │● │\ │nojump │
│plain │star2 │/(?<𝔛star2>(?<!\*)\*\*(?!\*))/u │● │* │nojump │
│plain │heading │/(?<𝔛heading>^(?<heading𝔛hashes>#+)\s+)/u│● │# │nojump │
│plain │word │/(?<𝔛word>\p{Letter}+)/u │● │● │nojump │
│plain │number_symbol │/(?<𝔛number_symbol>#(?=\p{Number}))/u │● │● │nojump │
│plain │number │/(?<𝔛number>\p{Number}+)/u │● │● │nojump │
│plain │ws │/(?<𝔛ws>\s+)/u │● │● │nojump │
│plain │$catchall │/(?<𝔛$catchall>(?!\\|\*|#)[^])/ │● │● │nojump │
│plain │$reserved │/(?<𝔛$reserved>\\|\*|#)/ │● │● │nojump │
└───────┴───────────────┴─────────────────────────────────────────┴──────┴──────────┴──────────────┘Results:
'helo'
┌───────┬──────┬────────────┬──────┬───────┬───────┬──────┬────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼──────┼────────────┼──────┼───────┼───────┼──────┼────┼────────┤
│plain │word │plain:word │● │helo │0 │4 │● │^plain │
└───────┴──────┴────────────┴──────┴───────┴───────┴──────┴────┴────────┘ 'helo*x'
┌───────┬───────────┬─────────────────┬──────┬───────┬───────┬──────┬────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼───────────┼─────────────────┼──────┼───────┼───────┼──────┼────┼────────┤
│plain │word │plain:word │● │helo │0 │4 │● │^plain │
│plain │$reserved │plain:$reserved │● │* │4 │5 │● │^plain │
│plain │word │plain:word │● │x │5 │6 │● │^plain │
└───────┴───────────┴─────────────────┴──────┴───────┴───────┴──────┴────┴────────┘ '*x'
┌───────┬───────────┬─────────────────┬──────┬───────┬───────┬──────┬────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼───────────┼─────────────────┼──────┼───────┼───────┼──────┼────┼────────┤
│plain │$reserved │plain:$reserved │● │* │0 │1 │● │^plain │
│plain │word │plain:word │● │x │1 │2 │● │^plain │
└───────┴───────────┴─────────────────┴──────┴───────┴───────┴──────┴────┴────────┘ '## question #1 and a hash: #'
┌───────┬───────────────┬─────────────────────┬──────┬──────────┬───────┬──────┬────────────────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼───────────────┼─────────────────────┼──────┼──────────┼───────┼──────┼────────────────┼────────┤
│plain │heading │plain:heading │● │## │0 │3 │{ hashes: '##' }│^plain │
│plain │word │plain:word │● │question │3 │11 │● │^plain │
│plain │ws │plain:ws │● │ │11 │12 │● │^plain │
│plain │number_symbol │plain:number_symbol │● │# │12 │13 │● │^plain │
│plain │number │plain:number │● │1 │13 │14 │● │^plain │
│plain │ws │plain:ws │● │ │14 │15 │● │^plain │
│plain │word │plain:word │● │and │15 │18 │● │^plain │
│plain │ws │plain:ws │● │ │18 │19 │● │^plain │
│plain │word │plain:word │● │a │19 │20 │● │^plain │
│plain │ws │plain:ws │● │ │20 │21 │● │^plain │
│plain │word │plain:word │● │hash │21 │25 │● │^plain │
│plain │$catchall │plain:$catchall │● │: │25 │27 │● │^plain │
│plain │$reserved │plain:$reserved │● │# │27 │28 │● │^plain │
└───────┴───────────────┴─────────────────────┴──────┴──────────┴───────┴──────┴────────────────┴────────┘ '## question #1 and a hash: \\#'
┌───────┬───────────────┬─────────────────────┬──────┬──────────┬───────┬──────┬────────────────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼───────────────┼─────────────────────┼──────┼──────────┼───────┼──────┼────────────────┼────────┤
│plain │heading │plain:heading │● │## │0 │3 │{ hashes: '##' }│^plain │
│plain │word │plain:word │● │question │3 │11 │● │^plain │
│plain │ws │plain:ws │● │ │11 │12 │● │^plain │
│plain │number_symbol │plain:number_symbol │● │# │12 │13 │● │^plain │
│plain │number │plain:number │● │1 │13 │14 │● │^plain │
│plain │ws │plain:ws │● │ │14 │15 │● │^plain │
│plain │word │plain:word │● │and │15 │18 │● │^plain │
│plain │ws │plain:ws │● │ │18 │19 │● │^plain │
│plain │word │plain:word │● │a │19 │20 │● │^plain │
│plain │ws │plain:ws │● │ │20 │21 │● │^plain │
│plain │word │plain:word │● │hash │21 │25 │● │^plain │
│plain │$catchall │plain:$catchall │● │: │25 │27 │● │^plain │
│plain │escchr │plain:escchr │● │\# │27 │29 │{ chr: '#' } │^plain │
└───────┴───────────────┴─────────────────────┴──────┴──────────┴───────┴──────┴────────────────┴────────┘Result with add_catchall_lexeme { mode, concat: false, }, add_reserved_lexeme { mode, concat: false, }:
':.;*#'
┌───────┬───────────┬─────────────────┬──────┬───────┬───────┬──────┬────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼───────────┼─────────────────┼──────┼───────┼───────┼──────┼────┼────────┤
│plain │$catchall │plain:$catchall │● │: │0 │1 │● │^plain │
│plain │$catchall │plain:$catchall │● │. │1 │2 │● │^plain │
│plain │$catchall │plain:$catchall │● │; │2 │3 │● │^plain │
│plain │$reserved │plain:$reserved │● │* │3 │4 │● │^plain │
│plain │$reserved │plain:$reserved │● │# │4 │5 │● │^plain │
└───────┴───────────┴─────────────────┴──────┴───────┴───────┴──────┴────┴────────┘Result with add_catchall_lexeme { mode, concat: true, }, add_reserved_lexeme { mode, concat: true, }:
':.;*#'
┌───────┬───────────┬─────────────────┬──────┬───────┬───────┬──────┬────┬────────┐
│mode │lxid │mk │jump │value │x1 │x2 │g │$key │
├───────┼───────────┼─────────────────┼──────┼───────┼───────┼──────┼────┼────────┤
│plain │$catchall │plain:$catchall │● │:.; │0 │3 │● │^plain │
│plain │$reserved │plain:$reserved │● │*# │3 │5 │● │^plain │
└───────┴───────────┴─────────────────┴──────┴───────┴───────┴──────┴────┴────────┘- it is possible to give
$catchalland$reservedlexemes a custom TID by settting thelxidparameter when callinglexer.add_catchall_lexeme()andlexer.add_reserved_lexeme()
Linewise Lexing and State-Keeping
state:state: 'keep'—do not reset lexer state implicitly (except once before the very first chunk of source is passed to the lexer withlexer.walk()orlexer.run())- this is the default for both
split: 'lines'andsplit: false, so modes (but not lexemes) may stretch across line or chunk boundaries
- this is the default for both
state: 'reset'—reset lexer state before processing each new chunk of source. This happens always whenlexer.walk()(orlexer.run()) is called, and, ifsplit: 'lines'is set, before each new line of input
split:split: 'lines'—the default; whenlexer.walk { source, }(orlexer.run { source, }) is called, the lexer will internally useGUY.str.walk_lines sourceto split the source into line-sized chunks (with line endings such as\nremoved)split: false—no splitting ofsourceis attempted.- when
lexer.walk { path, }is used withsplit: false(not recommended), then the entire content of the corresponding file are first (synchronously) read into memory and then lexed in its entirety. This may be suboptimal when files get big in comparison to available RAM.
- when
automatic
$bordertokens:- enabled with
cfg.border_tokens: true - issued whenever a jump from one mode to the other occurs
when jump lexemes are declared as inclusive, just looking at the stream of tokens may make it impossible to determine stretches of contiguous tokens; e.g. when
<starts and>endstagmode inclusively, then<t1><t2>will have a sequence of{ value: '>', mode: 'tag', },{ value: '<', mode: 'tag', }with no change in mode. Enable border tokens and now you get a sequence{ mode: 'tag', lxid: 'rightpointy', value: '>', } { mode: 'tag', lxid: '$border', value: '', atrs: { prv: 'tag', nxt: 'plain', }, } { mode: 'tag', lxid: '$border', value: '', atrs: { prv: 'plain', nxt: 'tag', }, } { mode: 'tag', lxid: 'leftpointy', value: '<', }`valueof border tokens can be set with e.g.cfg.border_value: '|'(can then concatenate allvalueproperties of all tokens to visualize where lexer mode was changed)
- enabled with
Behavior of automatic
$eoftokens:- only when enabled at instantiation with
eof_token: true - when start tokens are enabled, they will be sent
- when
stateisreset: each timelexer.walk()is called - when
stateiskeep: only whenlexer.walk()is called for the first time after an implicit or explicit reset of the lexer state (an implicit call only occurs once after a lexer has been instantiated and is used for the first time, or is triggered by a prior explicit call tolexer.end())
- when
- in any event, 'reset of lexer state' means that mode stack is emptied and the lexing mode is set to the
base mode; however, the line number will not be reset to
1 - when EOF tokens are enabled, they will be sent
- when
stateisreset: each timelexer.walk()is called and has exhausted the current source - when
stateiskeep: any time whenlexer.end()is called (explicitly)
- when
- there's the edge case that a reset of the lexer state caused by an explicit call to
lexer.start()from application code within afor token from lexer.walk sourceloop; this is a question that will have to be dealt with later
- only when enabled at instantiation with
using
{ state: 'reset', }can be advantageous when lexing line-oriented code such as CSV because it guarantees that at the start of each line, the lexer is reset to its base mode and hence things like an erroneously forgotten closing quote will not affect the entire rest of the result; in other words, it makes lexing a little more robust.
CFG Settings first, last, start_of_line, end_of_line
It is possible to include any kind of values when lexing starts or ends and also before and after each line;
in each case, no value is sent if the respective setting is null or undefined; when the setting has been
set to a function, that function is called without arguments; all other values are sent as-is. In order to
send null, undefined or a function, use a function with that return value.
first: emitted as first tokenlast: emitted as last token when end of source has been reachedstart_of_line: only whensplit: 'lines'is set: emitted before first token (if any) of each lineend_of_line: only whensplit: 'lines'is set: emitted after last token (if any) of each line
The reset() Method
The reset() method of the Lexer will be called at the beginning of lexing and, additionally when split:
'lines' is set, before each new line.
Linewise Lexing
TO BE UPDATED
- advantages
- no more struggling with different end-of-line (EOL) standards
- lexeme definitions can simply assume
/^/will match start-of-line and/$/will match end-of-line, forget about the 'dot match all' flag (/.../s) - oftentimes, 'lines of text' will be reasonably small and meaningful chunks of data to work with, as certified by the success of decades of Posix-style line-oriented data processing; the alternative is handling the content of an entire (arbitrarily huge) file, or abitrary chunks of a file derived from a running offsets + some byte lengths (which always risks cutting through a multibyte UTF-8-encoded character and needs some sort of careful state-keeping)
- most of time, lexers will have no need to look at EOL characters; many languages do not care for newlines (outside of string literals) at all and those that do care only (at least at the lexing level) about whether something comes close to the start or the end of a given line, or that something like a line comment will extend to the end of the present line
- initialize with
lexer = new Interlex { linewise: true, } - each time
lexer.feed(),lexer.walk(), orlexer.run()is called, internal line counter is incremented - therefore, should call
lexer.feed(),lexer.walk(), andlexer.run()only with a single line of text - observe that one can always call
lexer.walk { path, }, then lexer will iterate over lines of the file - lexer will yield lexemes in the shape
{ mode, lxid, mk, jump, value, lnr1, x1, lnr2, x2, g, source, }as with non-linewise lexing, but withsourcerepresenting the current line (not the entire lexed text),lnr1indicating the 1-based line number of the start of the match,lnr2the same for the end of the match, andx1andx2indexing into those lines in terms of exclusive 0-based UTF-16 code unit indexes) (so if the first letter on the first line is matched, its token will contain{ lnr1: 1, x1: 0, lnr2: 1, x2: 1, })- since
lnr1andlnr2are only present in linewise lexing which implies that the lexer only gets to see a single line at a time,lnr1andlnr2must always be equal (IOW there can be no tokens across linebreaks in linewise mode). However, if those tokens are then fed to a parser, that parser may match tokens across linebreaks, and in that case it will be convenient to derive the position of the resulting region by{ lnr1, x1, } = first_token; { lnr2, x2, } = last_token
- since
Prepending and Appending to Chunks and Lines
- Can instantiate lexer with
prepend,appendsettings - this will prefix, suffix each line or chunk with the string given, if any
- may choose to instantiate as
lexer = new Interlex { split: 'lines', append: '\n', }to ensure each line is properly terminated depending on use case - when prepending,
xpositions will take prefix length into account and will not match positions in the source
Comparing Token Positions
- import as
{ sorter } = require 'intertext-lexer' sorter.sort: ( tokens... ) ->—sort tokens according to their relative positions as given by the attributeslnr1,x1sorter.cmp: ( a, b ) =>—compare the positions of two tokensa,baccording to their attributeslnr1,x1; returns-1ifastarts beforeb,0ifaandbstart at the same point (not possible ifa ≠ band both tokens came out of the same lexer running over the same source), and+1ifastarts afterbsorter.ordering_is: ( a, b ) -> ( @cmp a, b ) is -1—returnstrueif the ordering of the two tokensa,bis as given in the call, otherwisefalse. If JavaScript allowed for custom operators or operator overrides, then maybe I would've implemented this asa << bora precedes binstead ofordering_is a, b
Positioning
- can increase (but not decrease) line number
lnr1,lnr2, code unit indexx1,x2by callinglexer.set_offset { lnr, x, }before lexing a chunk of source lnrmust be a one-based line number; it will be decremented by1and added to bothlnr1andlnr2xmust be a zero-based code unit index (JS string index); it will be added to bothx1andx2- both
lnrandxare optional; their defaults are{ lnr: 1, x: 0, } - this is useful when parts of a file or a string are to be lexed with some parts omitted
- output of
Start_stop_preprocessorcan be used, line and column positions will be those in the original source
Tools
(experimental)
Collection of useful stuff
Start-Stop Preprocessor
- use it to find start, stop tokens in source before applying your main lexer
- currently fixed to recognize XML processing instruction-like
<?start?>,<?stop?>,<?stop-all?>(with variant<?stop_all?>to avoid risk of line breaks when re-flowing text in editor); no whitespace may be used inside these - could be extended to accept custom lexer or custom lexemes
- will yield tokens with
{ data: { active: true, }, }(orfalse) depending on whether source text followed more close a start or a stop instruction - the tokens containing the relevant processing instructions will always be set to
active: false - uses linewise mode
initialize as
{ tools } = require 'intertext-lexer' prepro = new tools.Start_stop_preprocessor { active: true, eraser: ' ', }Shown here are the defaults:
- set
active: falseto only start when a<?start?>marker is found eraserandjoinercontrol how gaps in the source are treated- consider a source like
abc<?start?>xyzbeing processed with initialactive: true - the
<?start?>marker is redundant in that case and will be 'elided', meaning a token with `value: - but that leaves a hole in the source:
abc❓❓❓xyz, how to deal with it? - the MVP solution was to send one active chunk
abc, one inactive chunk<?start?>, then one active chunkxyz. But this is not a good solution if the downstream lexer operates in linewise fashion because it then will treatabcandxyzas appearing on two consecutive lines and mess up their position data - most of the time one would prefer all
lnr1, x1positions to be preserved as faithfully as feasable xyzoccurred atx1: 12in the input; if we now pass onabcxyzthat would change its position tox1: 3. What's more, it isn't quite clear whether we should treatabcandxyzas separate stretches / words (because they were separated by a marker) or as a single stretch / word (because the marker was elided). Only the consumer can tell what they want- The preprocessor tries to err on the side of the 'safe' and practical assumption that the consumer
probably wants their source positions be preserved and won't mind extraneous inline spaces (true for a
lot or source code, HTML &c) and will replace each elided character by a
\x20(U+0020 Space), yieldingabc xyz. - This behavior is called 'erasing' and is controlled by the
eraserconfiguration setting. This can be any string, including the empty string; it will be repeated for as many times as the number of code units (JS string index, length) the erased part comprised (so any codepoint in U+0000..U+FFFF will preserve positions) - The alternative to 'erasing' is 'joining' which will put a single copy of whatever text is present in
the
joinerconfiguration setting into the spot where a marker was found, so processingabc<?start?>xyzwithjoiner: ' 'will produceabc xyz - settings
eraser: ''andjoiner: ''are equivalent and will produceabcxyz - if the default setting of
eraser: ' 'is no good fit for your use case consider to use something likeeraser: '\x00'; U+0000 should not normally be part of any human-readable text source; a pattern/\x00+/will preserve the information that the source has a hole in this spot, and the resulting lineabc␀␀␀␀␀␀␀␀␀xyzwill preserve positions - Because the preprocessor will keep lines with 'holes' rather than breaking up lines that have intermittend start/stop marks, the relative ordering of active and inactive chunks is not guaranteed; only the ordering relative to other active (respectively inactive) tokens is preserved
- set
To Do
- – documentation
- – allow to configure
start,stop,errortokens, implicitfinalize() - – introduce aliases for names of
composethat don't use snake case &c - – group renaming has a fault in that it will wrongly accept things looking like a named group inside
a square-bracket character class, as in
/[?<abc>)] - – we cannot mix regexes with and without
s/dotallflag; configure that per mode, per instance? - – allow to add lexemes w/out explicit mode, will provide default / add to base mode
- – use
datoms - – provide collection of standard lexers for recurring tasks, including an abstracted version of Markdown star lexer
- – clarify whether to use 'lexeme ID' or 'token ID'; whould really be the former because a lexeme is
the description ('class' or 'type' if you will) of its instances (the tokens); tokens with the same
lxidmay repeat while there can only be at most one lexeme with a givenlxidin a given namespace / mode - – implement readable representation / RPR for lexers, maybe as table
- – safeguard against undefined lexemes mentioned by
before,after - – distinguish between
- proto-lexemes (which are lexeme definitions may be incomplete and have not yet been compiled; they are 'dormant' and stateless),
- (proper) lexemes (which are lexemes in the registry lexer that is ready to be used; these may be stateful), and
- tokens (the results of certain lexemes having matched at some point in the source text)
- – allow symbolic mode, jump values as in
'$codespan_mode'that refer to values in@cfg? - – allow to set prefixes for input (as class members) and output (as instance members, object properties, or list elements)
- – implement
add_lexemes()for adding single and multiple lexemes - – make use of mode names in
lx_*properties mandatory to avoid name conflicts – implement
set_lnr()– offer text normalization that includes removing trailing whitespace, different line endings
echo '–––'; echo "a1 xyz123\nb1" echo '–––'; echo "a2 xyz123\n\rb2" echo '–––'; echo "a3 xyz123\r\nb3" echo '–––'; echo "a4 xyz123\n\nb4" echo '–––'; echo "a5 xyz123\n\r\n\rb5" echo '–––'; echo "a6 xyz123\n\n\r\rb6" echo '–––'; echo "a7 xyz123\r\n\r\nb7" echo '–––'; echo "a8 xyz123\r\n\n\rb8" /(\n\r|\r\n|\n)/ -> '\n' /\r/ -> ''- pay attention to the excellent SO answer https://stackoverflow.com/a/3469155/7568091 who suggests using
/[^\S\r\n]/with double negative ([^]plus\S) to match linear whitespace only
- pay attention to the excellent SO answer https://stackoverflow.com/a/3469155/7568091 who suggests using
– export
GUY.*.walk_lines()to promote easy use of line-wise lexing- – should we walk over entire file content when
lexer.cfg.linewiseisfalse? Needed to keep parity with walking over texts - – implement
reset()method that is equivalent to instantiating a new lexer with the same settings - – already possible to use
:within mode names to indicate multi-level hierarchy (modes and submodes); possible / necessary / useful to formalize this? - – allow lexeme declarations to declare errors with a
code - – optionally (but less importantly), could demand implicit catchall and reserved lexemes for all modes, then allow overrides per mode
- – add public API
new_token()(can be used asnew_token tto produce copy oft, ornew_token { t..., value: 'xxx', }to derive fromt, so don't need explicit arguments for that) - – review role of Datom,
$keyelement - – introduce new value for
cfg.splitwhich is likelinesbut foregoes the implicit application ofGUY.str.walk_lines()and trimming, assuming this has been properly done by the consumer; this mainly as a minor optimization - – how to mark borders when two inclusive jumps appear with no separation as in
<tag1><tag2>? - – implement method to add standard lexemes:
- – add tests to ensure positive, negative lookbehinds, lookaheads are not recognized as capturing groups
- – might want to have tokens that cause one or two border tokens to be emitted, notation:
jump: '].[': emit one token{ lxid: '$border, data: { prv: 'plain', nxt: 'plain', }, }jump: ']..[': emit two tokens{ lxid: '$border, data: { prv: 'plain', nxt: 'plain', }, }jump: ']xyz[': emit two tokens{ lxid: '$border, data: { prv: 'plain', nxt: 'xyz', }, }and{ lxid: '$border, data: { prv: 'xyz', nxt: 'plain', }, }; the modexyzintroduced here need not be declared
- – implement positioning API to ensure correct positioning of tokens obtained from a lexer that
consumes output of a
Start_stop_preprocessor: - – allow parsing of 'minimal token' with mandatory attribute,
value, optional attributeslnr1,x1; this will implicitly calllexer.set_offset { lnr: t.lnr1, x: t.x1, }. Useful for consuming tokens fromStart_stop_preprocessor- should offset be reset or carried on when intermittently lexing w/out positions?
- – consider to rename
token.source -> token.input,token.value -> token.source - – implement declarative chaining of preprocessors and lexers, lexers and lexers
- – implement using regexes in
reservedwhen possible - – disallow lexing with an 'empty' lexer (that has no lexemes); must explicitly declare a 'match-nothing' lexeme if that's what you want (unlikely)
- – remove
set_offset(), implementset_position() - – in test
start_stop_preprocessor/start_stop_preprocessor_basic()some tests show active, inactive tokens out of order; try to fix in preprocessor$assemble_lines()? - – just as
denchgtokens are emitted at the very start and end of each document, so should$bordertokens be emitted - – use
$meta(or similar),$outlineas mode names in preprocessors to avoid name clashes with userland modes - – allow for longest-first matches that, starting from the left end, always return the longest
matching sequence, such that
list of integersis split intolist of,integerswhen matchers are/list\b/,list\s+of\b,integers?\b(and, indeed,/of\s+integers?\b/)- alternatively, and simpler, require that all lexemes are bounded by a separator to the left and right as motivated by Regular-Expressions.info: Alternation with The Vertical Bar or Pipe Symbol
- – use
slevithan/regexinternally to escape strings etc.; export it for the benefit of the user
Is Done
- + demo in
hengist/dev/snippets/src/demo-compose-regexp.coffeeandhengist/blob/master/dev/intertext-lexer/src/first-demo.coffee - + prefix named groups for parameters with rule name (token key) to enable re-use of parameter names
- + allow multiple
gosub_*,returntokens in a single lexer mode; use API rather than naming convention for these - + implement
cfg-based API foradd_lexeme()that providesjumpargument to replacegosub_/returnnaming convention - + implement
step() - + rename
autoreset,reset()->autostart,start() - + make calls to
finalize()implicit - + with
cfg.autostart,feed()andreset()behave identically - + implement
feed()to add new source - + implement functions for
jump - + implement topological sorting of lexemes
- + allow lexemes to announce 'reserved' / 'forbidden' / 'active' characters (such as
<that signals start of an HTML tag) that can later be used to formulate a fallback pattern to capture otherwise unmatched text portions- + at any point, allow to construct a pattern that only matches reserved characters and a pattern that matches anything except reserved characters
- + (?) consider to reset
lexer.cfg.linewisetotruewhenlexer.walk()gets called withpathor else throw error (because results will likely be not as expected). Contra: legitimate to parse with local positions, no line numbers - + implement lexeme property
create - + disallow lexemes to be accidentally overwritten
- + allow lexeme declarations to override
valueby setting eithervalueorempty_valueto - + modify behavior of catchall and reserved:
- + catchall and reserved are 'declared', not 'added', meaning they will be created implicitly when
_finalize()is called - + catchall and reserved alway come last (in this order)
- + prevent re-ordering of catchall and reserved when doing topological sorting
- + the instantiation settings
catchall_concatandreserved_concatcan be overriden when either is declared constant or function
- + catchall and reserved are 'declared', not 'added', meaning they will be created implicitly when
- + implement
line,colcoordinates for tokens - + change indexing shape from
lnr,start,stoptol1,x1,l2,x2, since in the general case, a token may start one one line and end on another.x1,x2are zero-based, exclusive, code unit indexes (JS string indices), whilel1,l2are one-based, inclusive line numbers. Observe that it can be quite difficult to give correct column numbers when complex scripts are used; for Latin script sources that do not use combining characters but may be intermingled e.g. with symbols and CJK characters from SMP, SIP and TIP ,( Array.from 'string'[ ... x1 ] ).lengthconverts correctly from 0-based code units to human-readable column counts (but throw in combining characters, RTL scripts or complex emoji and they will be incorrect) - + consider to introduce 'pre-jumps' (?) such that the occurrence of a match (say,
<inplainmode) means that the match is already in the jump-target mode (say,tag). This should make some things cleaner / more logical when both the left and the right delimiters of a mode are within that mode Implement syntax, semantics for inclusive, exclusive jumps:- fast, slow jump; inclusive, exclusive jump; early, late jump
- syntax (assuming mode
plain):- entry jumps:
{ jump: '[tag', }(inclusive entry jump; boundary 'post' belongs to new modetag),{ jump: 'tag[', }(exclusive entry jump; boundary 'post' belongs to old modeplain),
- exit jumps; the location of the
.(dot) symbolizes the location of the mode the 'post' will belong to:{ jump: '.]', }(inclusive exit jump; boundary 'post' belongs to old modetag),{ jump: '].', }(exclusive exit jump; boundary 'post' belongs to new modeplain)
- entry jumps:
- documentation
- + rename
x->atrs - + rename
atrs->data - + implement 'singleton' / 'virtual' jumps:
jump: '[str]'will return a token in modestrwithout jumping into that mode (or, by 'virtually' jumping to that mode and then immediately back); thevalueof the token will be the matched substring (as usual)- optionally:
jump: 'str[]', same as above, but value will always be the empty string (can also be done asjump: '[str]', value: ''so not essential)
+ add MVP version of
tools/start-stop-preprocessorto implement start/stop roughly as detailed belowimplement a preprocessing mode and a binary property
lexer.state.active, the rule being that- until the preprocessing mode has brought the lexer from
active == falsetoactive == true, all material is rendered, as-is, asvalueproperty of special$rawtokens, without being scanned by the regular mode patterns - as soon as the preprocessing mode can bring
lexer.state.activefromtruetofalseand vice versa any number of times, which means that we can use the lexer to determine regions for lexing - the reason the above is not feasable with regular modes is that
- once we jump from preprocessing to regular, the lexer will stay in that regular mode (when
state: 'keep'is set) - when preprocessing has found a
startmeta-token, one does know that only material after that token will have to be lexed by an regular mode—but one does not yet know whether another meta-token can be matched within that remaining region of the source; therefore, one has to first exhaust the preprocessor (for the current chunk or line at least) before regular lexing can start - this is essentially the behavior of
Interlexitself, so one could implement preprocessing by instantiating a separateInterlexinstance
- once we jump from preprocessing to regular, the lexer will stay in that regular mode (when
- until the preprocessing mode has brought the lexer from
+ tokens should never have the jump function as value for the
jumpproperty
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago