0.0.2 • Published 6 years ago

custom_html_parser v0.0.2

Weekly downloads
-
License
MIT
Repository
bitbucket
Last release
6 years ago

Custom HTML parser

Created for extracting information from html documents easily.

Installing:

npm install custom_html_parser

Starting example:

import parser from 'custom_html_parser';
const result = parser.parse(html, options);

Description

This module operates with a SAX parser under the hood. This is done with heavily relying on the htmlparser2 module. The goal was to create a module which is only needed to set up with a JSON file.

Here is how it works:

It searches for tags with given name and attributes and tags inside them. Each search group consists of a base_tag. Every time the parser finds one of these tags it will always create a new result. And everything it finds inside this search group (defined in the base_tag and search_tags) will automatically be added to this result object. We can define several search targets with the search_tags object.

Example:

HTML document:
<div class="foo" title="bar" interesting-attribute="interesting-value">
  something
</div>
Options:
[
  {
    "base_tag": {
      "tag": "div", 
      "attributes": ["class", "title"],
      "values": ["foo", "bar"],
      
      "get_attributes": ["interesting-attribute", "missing_attribute"],
      "get_attributes_as": ["save_name", "save_name_2"],
      "prefix_attributes_with": ["custom_prefix ", "prefix_2 "],
      "empty_attributes_placeholders": ["", "missing"],

      "get_text": true,
      "get_text_as": "inside_text",
      "prefix_text_with": "another_custom_prefix "
    }
  }
]
Result:
[
  [
    {
      "save_name":["custom_prefix interesting-value"],
      "save_name_2":["missing"],
      "inside_text":["another_custom_prefix something"]
    }
  ]
]

More complicated example:

This is where the strength of this approach is easy to see. If there are tags repeating (like a list), it is much easier to extract and organize information this way.

HTML document:
<ul class="hsy_ul" style="width: 1472px">
  <li class="hsy_li">
    <strong class="hsy_m">Jan</strong>
    <a title="Spring">
      <span>12 in.</span>
      <i style="height:56%;"></i>
    </a>
    <a title="Summer">
      <span>23 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Fall">
      <span>22 in.</span>
      <i style="height:57%;"></i>
    </a>
    <a title="Winter">
      <span>1 in.</span>
      <i style="height:57%;"></i>
    </a>
  </li>
</ul>
Options:
[
  {
    "base_tag": {
      "tag": "ul", 
      "attributes": ["class"],
      "values": ["hsy_ul"]
    },
    "search_tags": [
      [
        {
          "tag": "li", 
          "attributes": ["class"],
          "values": ["hsy_li"]
        },
        {
          "tag": "a", 
          "attributes": [],
          "values": [],

          "get_text": true,
          "get_text_as": "Depth",
          "empty_text_placeholder": "0 in.",
          "inside_tag_text": true,

          "get_attributes": ["title"],
          "get_attributes_as": ["Season"],
          "prefix_attributes_with": ["2017, "]
        }
      ]
    ]
  }
]
Result:
[
  [
    {
      "Season": ["2017, Spring","2017, Summer","2017, Fall","2017, Winter"],
      "Depth": ["3 in.","23 in.","22 in.","0 in."]
    }
  ]
]

Documentation:

Every time a tag occurs you have an option to save it's attributes and/or the text in between it's opening and closing tags. You have several options to decide how you want these values to be saved. All of these can be added to both the base_tag and the search_tags as well, but always if it finds something to save it will be saved to the actual base_tag's result object.

OptionTypeOptionalMeaning
tagstringNeedsThe name of the searched tag
attributesstring[]NeedsThe attribute names to check for a match for the tag
valuesstring[]NeedsThe values of each attribute for the searched tag
get_textbooleanOptionalWhether to save the text in between the opening and closing tags or not. Default: false
get_text_asstringOptionalThe key in which the text between the tags should be saved
prefix_text_withstringOptionalA prefix value for the text. Only adds it if it finds something
empty_text_placeholderstringOptionalIf there is nothing in between the tags it can save this instead. Does not add prefix to it
only_first_textbooleanOptionalWhether to only save the text right after the opening tag until the first opening tag inside (or until the end of the tag). Default: false
inside_tag_textbooleanOptionalWhether to also save the text in between other tags as well as long as they are between the current tag. Default: false
get_attributesstring[]OptionalWhich attribute values to save if there is a match
get_attributes_asstring[]OptionalThe name at which each attribute value should be saved
prefix_attributes_withstring[]OptionalPrefixes to use at the attribute values
empty_attributes_placeholdersstring[]OptionalIn case of an empty or not-existing attribute, what placeholder to use instead of the value

The string array values that belong to each other are used by order. This means, that each of them has to be the same length. And the values at the same index are addressing the same use case. In the case of:

{
  "get_attributes": ["theme", "alt"],
  "get_attributes_as": ["style", "description"],
  "prefix_attributes_with": ["Image: ", "Text:"],
  "empty_attributes_placeholders": ["empty", ""]
}

The alt attribute's value of the tag will be saved with the name description. It will be prefixed with Text:, and if there is no alt value, nothing will be saved.

The hierarchy between the seach_tags are from top to bottom. Each time the index grows it searches inside the previous tag. But everything that it finds will be saved to the same base_tag. You can add more of these hierarchies in the same base tag with adding more arrays into the search_tags.

TODO: Make attributes and values optional parameters. Testing.


License

MIT

0.0.2

6 years ago

0.0.1

6 years ago