Custom_html_parser NPM

Custom HTML parser

Created for extracting information from html documents easily.

Installing:

npm install custom_html_parser

Starting example:

import parser from 'custom_html_parser';
const result = parser.parse(html, options);

Description

This module operates with a SAX parser under the hood. This is done with heavily relying on the htmlparser2 module. The goal was to create a module which is only needed to set up with a JSON file.

Here is how it works:

It searches for tags with given name and attributes and tags inside them. Each search group consists of a base_tag. Every time the parser finds one of these tags it will always create a new result. And everything it finds inside this search group (defined in the base_tag and search_tags) will automatically be added to this result object. We can define several search targets with the search_tags object.

Example:

HTML document:

<div class="foo" title="bar" interesting-attribute="interesting-value">
  something
</div>

Options:

[
  {
    "base_tag": {
      "tag": "div", 
      "attributes": ["class", "title"],
      "values": ["foo", "bar"],
      
      "get_attributes": ["interesting-attribute", "missing_attribute"],
      "get_attributes_as": ["save_name", "save_name_2"],
      "prefix_attributes_with": ["custom_prefix ", "prefix_2 "],
      "empty_attributes_placeholders": ["", "missing"],

      "get_text": true,
      "get_text_as": "inside_text",
      "prefix_text_with": "another_custom_prefix "
    }
  }
]

Result:

[
  [
    {
      "save_name":["custom_prefix interesting-value"],
      "save_name_2":["missing"],
      "inside_text":["another_custom_prefix something"]
    }
  ]
]

More complicated example:

This is where the strength of this approach is easy to see. If there are tags repeating (like a list), it is much easier to extract and organize information this way.

HTML document:

<ul class="hsy_ul" style="width: 1472px">
  <li class="hsy_li">
    <strong class="hsy_m">Jan</strong>
    <a title="Spring">
      <span>12 in.</span>
      <i style="height:56%;"></i>
    </a>
    <a title="Summer">
      <span>23 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Fall">
      <span>22 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Winter">
      <span>1 in.</span>
      <i style="height:57%;"></i>
    </a>
  </li>
</ul>

Options:

[
  {
    "base_tag": {
      "tag": "ul", 
      "attributes": ["class"],
      "values": ["hsy_ul"]
    },
    "search_tags": [
      [
        {
          "tag": "li", 
          "attributes": ["class"],
          "values": ["hsy_li"]
        },
        {
          "tag": "a", 
          "attributes": [],
          "values": [],

          "get_text": true,
          "get_text_as": "Depth",
          "empty_text_placeholder": "0 in.",
          "inside_tag_text": true,

          "get_attributes": ["title"],
          "get_attributes_as": ["Season"],
          "prefix_attributes_with": ["2017, "]
        }
      ]
    ]
  }
]

Result:

[
  [
    {
      "Season": ["2017, Spring","2017, Summer","2017, Fall","2017, Winter"],
      "Depth": ["3 in.","23 in.","22 in.","0 in."]
    }
  ]
]

Documentation:

Every time a tag occurs you have an option to save it's attributes and/or the text in between it's opening and closing tags. You have several options to decide how you want these values to be saved. All of these can be added to both the base_tag and the search_tags as well, but always if it finds something to save it will be saved to the actual base_tag's result object.

Option	Type	Optional	Meaning
tag	string	Needs	The name of the searched tag
attributes	string[]	Needs	The attribute names to check for a match for the tag
values	string[]	Needs	The values of each attribute for the searched tag
get_text	boolean	Optional	Whether to save the text in between the opening and closing tags or not. Default: false
get_text_as	string	Optional	The key in which the text between the tags should be saved
prefix_text_with	string	Optional	A prefix value for the text. Only adds it if it finds something
empty_text_placeholder	string	Optional	If there is nothing in between the tags it can save this instead. Does not add prefix to it
only_first_text	boolean	Optional	Whether to only save the text right after the opening tag until the first opening tag inside (or until the end of the tag). Default: false
inside_tag_text	boolean	Optional	Whether to also save the text in between other tags as well as long as they are between the current tag. Default: false
get_attributes	string[]	Optional	Which attribute values to save if there is a match
get_attributes_as	string[]	Optional	The name at which each attribute value should be saved
prefix_attributes_with	string[]	Optional	Prefixes to use at the attribute values
empty_attributes_placeholders	string[]	Optional	In case of an empty or not-existing attribute, what placeholder to use instead of the value

The string array values that belong to each other are used by order. This means, that each of them has to be the same length. And the values at the same index are addressing the same use case. In the case of:

{
  "get_attributes": ["theme", "alt"],
  "get_attributes_as": ["style", "description"],
  "prefix_attributes_with": ["Image: ", "Text:"],
  "empty_attributes_placeholders": ["empty", ""]
}

The alt attribute's value of the tag will be saved with the name description. It will be prefixed with Text:, and if there is no alt value, nothing will be saved.

The hierarchy between the seach_tags are from top to bottom. Each time the index grows it searches inside the previous tag. But everything that it finds will be saved to the same base_tag. You can add more of these hierarchies in the same base tag with adding more arrays into the search_tags.

TODO: Make attributes and values optional parameters. Testing.

License

MIT

html parser data extraction

htmlparser2 @types/htmlparser2 @types/node

0.0.2

8 years ago

0.0.1

8 years ago