1.1.1 • Published 5 years ago
htmlgrabr v1.1.1
HTMLGrabr library
A Node.js library to grab and clean HTML content.
Features
- Extract page content from an URL (
HTMLGrabr.grabURL(url: URL): GrabbedPage) - Extract page content from a string (
HTMLGrabr.grab(s: string): GrabbedPage) - Extract Open Graph properties
 - Clean the page content:
- Extract main HTML content using mozilla-readability
 - Sanitize HTML content using DOMPurify, with some extras:
- Remove unwanted links or images
 - Remove pixel tracker
 - Remove unwanted attributes (such as 
style,class,id, ...) - And more
 
 
 
Usage
npm install --save htmlgrabrThe in your code:
const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const { URL } = require('url')
const grabber = new HTMLGrabr()
grabber.grabUrl(new URL('https://about.readflow.app'))
  .then(page => {
    console.log(page)
  }, err => {
    console.error(err)
  })API
Create new instance:
const HTMLGrabr = require('htmlgrabr').HTMLGrabr
const grabber = new HTMLGrabr(config)Configuration object:
interface GrabberConfig {
  debug?: boolean                     // Print debug logs if true
  pretty?: boolean                    // Beautify HTML content if true
  isBlockedHost?: BlockedHostCtrlFunc // Function used to detect unwanted URLs
  rewriteURL?: URLRewriterFunc        // Function used to rewrite HTML src attributes
  rules?: Map<string, Rule>           // Rule definitions (see below)
  headers?: Headers                   // HTTP headers to set
}Rule definition:
export interface Rule {
  selector: string             // HTML query selector
  type: 'redirect' | 'content' // Rule type:
  // - 'redirect' will use 'src' or 'href' attribute to redirect content extraction
  // - 'content' to specify content to extract
}Grab a page:
const result = grabber.grabUrl(new URL('https://...'))Result object:
interface GrabbedPage {
  title: string        // Page title
  url: string | null   // Source URL
  image: string | null // Page illustration
  html: string         // HTML content
  text: string         // Text content (from HTML)
  excerpt: string      // Excerpt (from meta data or HTML)
  length: number       // Read length
  images: ImageMeta[]  // Embedded image URLs
}