0.2.0 • Published 5 years ago

html-juicer v0.2.0

Weekly downloads
1
License
GPL-3.0
Repository
github
Last release
5 years ago

html-juicer

CircleCI

A more simple way to clean 3rd webpage. Similar to arc90 readability.

Why

arc90 readability is been used widely for getting a clean view of a webpage. But it's algorithm has some shortcoming then some page got a wrong result.

In the algorithm of arc90 readability, it first calculate all paragraph's score, add the paragraph, its parentNode and parentNode's parentNode to a candidate list, then pick the topCandidate which has the highest score. With a existing candidate, arc90 then walk through its siblings for other possible content. So under this algorithm, the traverser will search max to 4th depth to a top candidate.

But in reality, many famous blog site use very deep nest structure for its content. Like this article in medium, a arc90 readability only get the first section of the whole article. The bottom-up traverse process can't do any thing about it.

How we implements

So html-juicer has a top-down traverse process.

We first calculate all paragraph's score like arc90, but we also score every parentNode until we reach the root. Then we traverse down the dom tree, find out the most possible root for the article. This is the final target. Simple right? 🤓

More things

With the article root, we will do more stuff based on caller's config. For default config, we remove h1 tag, clean all useless attribute, and replace resouce' src to a correct result. All helper methods is well tested in helpers.test.ts.

Usage

Currently html-juicer only work in node.js.

npm i html-juicer
import {Juicer} from 'html-juicer'

new Juicer(
  html: string, 
  config?: {
      useHeaderAsTitle?: boolean
      cleanH1?: boolean
      cleanAttribute?: boolean
      url?: URL | string | null
  },
): {
  content: string
  title: string
}

Config

namedescriptiondefault
useHeaderAsTitleuse h1 as result title or document.titletrue
cleanH1remove h1 tag in article roottrue
cleanAttributeclean useless attributetrue
urlthe url of htmlnull

Dependencies

html-juicer only depend on jsdom.

0.2.0

5 years ago

0.1.0

5 years ago