Seize NPM | npm.io

seize

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader.

Install

npm i --save seize

Usage

Seize can be used with DOM libraries such as jsdom for example. It only extracts and prepares certain DOM-node for further usage.

Example

var Seize = require('seize'),
    jsdom = require('jsdom').jsdom;

var window = jsdom('<your html here>').defaultView,
    seize  = new Seize(window.document);

seize.content(); // returns DOM-node
seize.text();    // returns only text

Browser usage

For browser usage you shoud clone you DOM object or create it from HTML string:

/**
 * Converts html string to Document
 * @param  {String} html  html document string
 * @return {Node}         document
 */
function HTMLParser(html){
  var doc = document.implementation.createHTMLDocument("example");
  doc.documentElement.innerHTML = html;
  return doc;
};

How it works

Here is algorythm how it works:

Getting html tags that we expect to be text or content container such as p, table, img, etc.
Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
Setting score for each container by containing tags
Setting score by class name, id name, tag xPath score and text score
Sorting canditates by score
Taking first candidate
Cleaning up article