xml-zero-lexer v4.0.1
xml-zero-lexer
Friendly and forgiving HTML5/XML5/React-JSX lexer/parser with lots of tests. Memory-efficient and Web Worker compatible.
Features
- HTML, XML, and React JSX parsing/lexing
- Zero-copy (returns only indexes into the original string, not copies of values)
- TypeScript
- Lots of tests
- Limited in scope. Easy to audit
- Zero dependencies
"Zero Copy"?
Generally speaking (very generally) the term 'zero copy' refers to a way of parsing input and returning a 'table of contents' data structure providing to you —the programmer— a list of indexes into the input.
The term "Zero Copy" is an attempt to distinguish between 'table of contents' parsing Vs a 'fork the input data, and copy every distinct part into a new data structure without any reference to the original' approach to parsing.
This is all quite philosophical so unless you really care about this then choose a more popular library, please.
Usage
import Lexx, { NodeTypes } from "xml-zero-lexer";
//
const xml = "<p>his divine shadow</p>";
//
const tokens = Lexx(xml);
//
// So now 'tokens' will be:
//
//   [
//     [NodeTypes.ELEMENT_NODE, 1, 2],
//     [NodeTypes.TEXT_NODE, 3, 20],
//     [NodeTypes.CLOSE_ELEMENT],
//   ]
//
// You can now index into the input string
// using the return value `tokens`...
//
const firstToken = tokens[0];
//
// The first token is an element so this is true:
const isElement = firstToken[0] === NodeTypes.ELEMENT_NODE;
//
// would console.log a string of "p"
console.log(xml.substring(firstToken[1], firstToken[2]));
//
const secondToken = tokens[1];
//
// The second token is a text node so this is true;
const isTextNode = secondToken[0] === NodeTypes.TEXT_NODE;
//
// would console.log a string of "his divine shadow"
console.log(xml.substring(secondToken[1], secondToken[2]));API
xml-zero-lexer's default export is a function that can be imported as import Lexx from 'xml-zero-lexer';.
You can provide this function one or two arguments:
- (Required) The input stringof HTML, XML, or React JSX;
- (Optional) - { jsx?: boolean; html?: boolean; blackholes?: string[] }where:- jsx (boolean, optional, default=false): Whether to parse - {expression}in the middle of text / child nodes as a distinct JSX token. React-JSX attributes are always parsed by- xml-zero-lexerso this option is only whether JSX in text / child nodes should be parsed (- true) or treated as text (- false).- html (boolean, optional, default=false): Currently, only affects whether to treat - <br>and- <link>and- <img>(and other HTML self-closing tags) as self-closing tags, affecting whether to return another token of- [NodeTypes.CLOSE_ELEMENT]when parsing these elements.- blackholes (string[], optional, default="script", "style"): an array of elements names (typically - ["script", "style"]etc.) that have special parsing rules meaning input of- <script> <p> </script>should be parsed as a script element with a text node of- " <p> ". The special rule is that such an element encompasses everything until it closes with the same element name.
This function returns Token[]. Each Token has the shape [NodeType: NODE_TYPE, ...restIndexes: number[] ] where:
- NodeType: NODE_TYPE = number (integer). A number representing the type of Node such as 'open element', 'attribute', 'jsx attribute', 'text node', 'close element' and so on. For convenience the- NodeTypesexport has named keys for each type- NodeTypes.OPEN_ELEMENT,- NodeTypes.CLOSE_ELEMENT,- NodeTypes.TEXT_NODE,- NodeTypes.ATTRIBUTE_NODE(etc...). This constant can be accessed through the- NodeTypesexport eg- import { NodeTypes } from 'xml-zero-lexer';.
- ...restIndexes: number[]an array of integer indexes into the input string that can be used as eg.- inputString.substring(restIndexes[0], restIndexes[1]). Each type of- NODE_TYPEcan have a different- restIndexes.length, from 0 to 4. Some- NodeTypes also have a variable length, such as- NodeTypes.ATTRIBUTE, because attribute values are optional in HTML5 (ie- <input hidden>lacks a value, and in HTML5 should be treated as a boolean- true, but is essentially valueless for the purposes of parsing). So an input string of- <a href="//zombo.com" hidden>would return:
[
  [NodeTypes.ELEMENT_NODE, 1, 2],
  [NodeTypes.ATTRIBUTE_NODE, 3, 7, 9, 20],
  [NodeTypes.ATTRIBUTE_NODE, 22, 28],
];Note the variation in length of the ATTRIBUTE_NODE tokens.
To learn more about the expected outputs for various inputs, please read the tests in index.test.ts... they're designed to be readable, just look for the const cases = [ variable around line 41.
4KB gzipped.
Want to turn it back into formatted HTML or XML? Try xml-zero-beautify.
CHANGELOG
If you're upgrading to 3.1.x or greater please note that parsing </> no longer results in an array with two items of an opening element and a closing element, it now just results in an array with one item of a closing element. This better adheres to expected semantics for React-JSX parsing. Tests have been updated accordingly.
Part of XML-Zero.js
3 years ago
3 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
6 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago