1.1.1 • Published 3 years ago

htmlgrabr v1.1.1

Weekly downloads
8
License
MIT
Repository
github
Last release
3 years ago

HTMLGrabr library

Travis Coverage Status Donate

A Node.js library to grab and clean HTML content.

Features

  • Extract page content from an URL (HTMLGrabr.grabURL(url: URL): GrabbedPage)
  • Extract page content from a string (HTMLGrabr.grab(s: string): GrabbedPage)
  • Extract Open Graph properties
  • Clean the page content:
    • Extract main HTML content using mozilla-readability
    • Sanitize HTML content using DOMPurify, with some extras:
      • Remove unwanted links or images
      • Remove pixel tracker
      • Remove unwanted attributes (such as style, class, id, ...)
      • And more

Usage

npm install --save htmlgrabr

The in your code:

const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const { URL } = require('url')

const grabber = new HTMLGrabr()

grabber.grabUrl(new URL('https://about.readflow.app'))
  .then(page => {
    console.log(page)
  }, err => {
    console.error(err)
  })

API

Create new instance:

const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const grabber = new HTMLGrabr(config)

Configuration object:

interface GrabberConfig {
  debug?: boolean                     // Print debug logs if true
  pretty?: boolean                    // Beautify HTML content if true
  isBlockedHost?: BlockedHostCtrlFunc // Function used to detect unwanted URLs
  rewriteURL?: URLRewriterFunc        // Function used to rewrite HTML src attributes
  rules?: Map<string, Rule>           // Rule definitions (see below)
  headers?: Headers                   // HTTP headers to set
}

Rule definition:

export interface Rule {
  selector: string             // HTML query selector
  type: 'redirect' | 'content' // Rule type:
  // - 'redirect' will use 'src' or 'href' attribute to redirect content extraction
  // - 'content' to specify content to extract
}

Grab a page:

const result = grabber.grabUrl(new URL('https://...'))

Result object:

interface GrabbedPage {
  title: string        // Page title
  url: string | null   // Source URL
  image: string | null // Page illustration
  html: string         // HTML content
  text: string         // Text content (from HTML)
  excerpt: string      // Excerpt (from meta data or HTML)
  length: number       // Read length
  images: ImageMeta[]  // Embedded image URLs
}