0.1.7 • Published 10 years ago

seize v0.1.7

Weekly downloads
3
License
MIT
Repository
github
Last release
10 years ago

seize

Build Status Dependency Status Codacy Badge


Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader.

Install

npm i --save seize

Usage

Seize can be used with DOM libraries such as jsdom for example. It only extracts and prepares certain DOM-node for further usage.

Example

var Seize = require('seize'),
    jsdom = require('jsdom').jsdom;

var window = jsdom('<your html here>').defaultView,
    seize  = new Seize(window.document);

seize.content(); // returns DOM-node
seize.text();    // returns only text

Browser usage

For browser usage you shoud clone you DOM object or create it from HTML string:

/**
 * Converts html string to Document
 * @param  {String} html  html document string
 * @return {Node}         document
 */
function HTMLParser(html){
  var doc = document.implementation.createHTMLDocument("example");
  doc.documentElement.innerHTML = html;
  return doc;
};

How it works

Here is algorythm how it works:

  • Getting html tags that we expect to be text or content container such as p, table, img, etc.
  • Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
  • Setting score for each container by containing tags
  • Setting score by class name, id name, tag xPath score and text score
  • Sorting canditates by score
  • Taking first candidate
  • Cleaning up article

Todo

Seize still in development, so you can use it at one's own risk. You always can help to improve it.

  • Improve readme
  • Improve text scoring
  • Improve page detection wich can't be extracted
  • More tests
  • More examples

Contributing

You are welcomed to improve this small piece of software :)

Author

0.1.7

10 years ago

0.1.6

10 years ago

0.1.5

10 years ago

0.1.4

10 years ago

0.1.3

10 years ago

0.1.2

10 years ago

0.1.1

10 years ago

0.1.0

10 years ago