1.1.3 • Published 2 years ago

@thenja/html-parser v1.1.3

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

Test Coverage-shield-badge-1

Html-Parser

A simple forgiving html parser for javascript (browser and nodejs).

Features

  • Works in NodeJs or the browser
  • Parse HTML that may not be valid
  • HTML is parsed into a json object, this object can be modified and converted back into HTML
  • Ability to clean HTML, such as remove empty tags and more.

Not supported

  • CDATA in html is not supported

How to use

Installation

npm install @thenja/html-parser --save

Typescript

import { HtmlParser } from "@thenja/html-parser";
let htmlParser = new HtmlParser();

Javascript (browser)

<script src="dist/thenja-html-parser.min.js" type="text/javascript"></script>
var htmlParser = new Thenja.HtmlParser();

Parse HTML

Basic usage

let html = "<div><p>Hello world!</p></div>";
let output = htmlParser.parse(html);

Example output

[
  {
    "type": "tag",
    "tagType": "default",
    "name": "div",
    "attributes": {},
    "children": [
      {
        "type": "tag",
        "tagType": "default",
        "name": "p",
        "attributes": {},
        "children": [
          {
            "type": "text",
            "data": "Hello world!"
          }
        ]
      }
    ]
  }
]

Parse html and reverse the output

let html = "<div><p>Hello world!</p></div>";
let output = htmlParser.parse(html);
let reversedHtml = htmlParser.reverse(output);

Listen for errors

let html = "<div><p>Hello world!</p></div>";
let output = htmlParser.parse(html, (err) => {
  // handle errors here
});

Listen for nodes being added when parsing

// In this example we will replace .jpg extensions with .png
let html = "<div><img src='my-picture.jpg' /></div>";
let output = htmlParser.parse(html, null, (node, parentNode) => {
  if(node.name === 'img' && node.attributes && node.attributes.src) {
    node.attributes.src = node.attributes.src.replace('.jpg', '.png');
  }
});
let newHtml = htmlParser.reverse(output);
// newHtml will equal: <div><img src='my-picture.png' /></div>

Listen for nodes being stringified when reversing

// In this example we will remove the class attribute
let html = "<div class='my-style'></div>";
let output = htmlParser.parse(html);
let newHtml = htmlParser.reverse(output, (node) => {
  if(node.name === 'div') {
    delete node.attributes['class'];
  }
});
// newHtml will equal: <div></div>

Clean up the html

The clean function allows you to remove unwanted html tags (such as empty tags) and empty text nodes.

Available options:

OptionsDescription
removeEmptyTagsRemove empty html tags, such as <p></p>
removeEmptyTextNodesBasically remove a text node if it only contains whitespace
let html = "<div>Hi there<p></p></div>";
// by default, clean options are true, so this is only here for demo purposes
let cleanOptions = { removeEmptyTags: true, removeEmptyTextNodes: true };
let output = htmlParser.parse(html);
output = htmlParser.clean(output, cleanOptions);
let newHtml = htmlParser.reverse(output);
// newHtml will equal: <div>Hi there</div>

Development

npm run init - Setup the app for development (run once after cloning)

npm run dev - Run this command when you want to work on this app. It will compile typescript, run tests and watch for file changes.

Distribution

npm run build -- -v <version> - Create a distribution build of the app.

-v (version) - Optional Either "patch", "minor" or "major". Increase the version number in the package.json file.

The build command creates a /compiled directory which has all the javascript compiled code and typescript definitions. As well, a /dist directory is created that contains a minified javascript file.

Testing

Tests are automatically ran when you do a build.

npm run test - Run the tests. The tests will be ran in a nodejs environment. You can run the tests in a browser environment by opening the file /spec/in-browser/SpecRunner.html.

License

MIT © Nathan Anderson

ToDo

  1. Add in more unit tests

  2. Add in a flattenText() function. This will flatten many nested text nodes into one text node.

<p>My name is <strong>Nathan</strong></p>
Flattened to:
<p>My name is Nathan</p>