Htmlgrabr NPM | npm.io

HTMLGrabr library

A Node.js library to grab and clean HTML content.

Features

Extract page content from an URL (HTMLGrabr.grabURL(url: URL): GrabbedPage)
Extract page content from a string (HTMLGrabr.grab(s: string): GrabbedPage)
Extract Open Graph properties
Clean the page content:
- Extract main HTML content using mozilla-readability
- Sanitize HTML content using DOMPurify, with some extras:
  - Remove unwanted links or images
  - Remove pixel tracker
  - Remove unwanted attributes (such as style, class, id, ...)
  - And more

Usage

npm install --save htmlgrabr

The in your code:

const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const { URL } = require('url')

const grabber = new HTMLGrabr()

grabber.grabUrl(new URL('https://about.readflow.app'))
  .then(page => {
    console.log(page)
  }, err => {
    console.error(err)
  })

API

Create new instance:

const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const grabber = new HTMLGrabr(config)

Configuration object:

interface GrabberConfig {
  debug?: boolean                     // Print debug logs if true
  pretty?: boolean                    // Beautify HTML content if true
  isBlockedHost?: BlockedHostCtrlFunc // Function used to detect unwanted URLs
  rewriteURL?: URLRewriterFunc        // Function used to rewrite HTML src attributes
  rules?: Map<string, Rule>           // Rule definitions (see below)
  headers?: Headers                   // HTTP headers to set
}

Rule definition:

export interface Rule {
  selector: string             // HTML query selector
  type: 'redirect' | 'content' // Rule type:
  // - 'redirect' will use 'src' or 'href' attribute to redirect content extraction
  // - 'content' to specify content to extract
}

Grab a page:

const result = grabber.grabUrl(new URL('https://...'))

Result object:

interface GrabbedPage {
  title: string        // Page title
  url: string | null   // Source URL
  image: string | null // Page illustration
  html: string         // HTML content
  text: string         // Text content (from HTML)
  excerpt: string      // Excerpt (from meta data or HTML)
  length: number       // Read length
  images: ImageMeta[]  // Embedded image URLs
}

@mozilla/readability @types/dompurify @types/jsdom @types/mime-types @types/mozilla-readability @types/node-fetch dompurify html2plaintext jsdom mime-types node-fetch parse5 pretty

@infinitebrahmanuniverse/nolb-htmlg @everything-registry/sub-chunk-1870 @zalastax/nolb-htmlg

3 years ago

3 years ago

4 years ago

4 years ago

6 years ago

6 years ago