Defuddle NPM | npm.io

de·fud·dle /diˈfʌdl/ transitive verb
to remove unnecessary elements from a web page, and make it easily readable.

Beware! Defuddle is very much a work in progress!

Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Features

Defuddle aims to output clean and consistent HTML documents. It was written for Obsidian Web Clipper with the goal of creating a more useful input for HTML-to-Markdown converters like Turndown.

Defuddle can be used as a replacement for Mozilla Readability with a few differences:

More forgiving, removes fewer uncertain elements.
Provides a consistent output for footnotes, citations, code blocks.
Uses a page's mobile styles to guess at unnecessary elements.
Extracts more metadata from the page, including schema.org data.

Installation

npm install defuddle

Usage

import { Defuddle } from 'defuddle';

const article = new Defuddle(document).parse();

// Use the extracted content and metadata
console.log(article.content);  // HTML string of the main content
console.log(article.title);    // Title of the article

Server-side usage

When using Defuddle in a Node.js environment, you can use JSDOM to create a DOM document:

import { Defuddle } from 'defuddle';
import { JSDOM } from 'jsdom';

const html = '...'; // Your HTML string
const dom = new JSDOM(html, {
  url: "https://www.example.com/page-url" // Optional: helps resolve relative URLs
});

const article = new Defuddle(dom.window.document).parse();
console.log(article.content);

Providing url in the JSDOM constructor helps convert relative URLs (images, links, etc.) to absolute URLs.

Response

The parse() method returns an object with the following properties:

Property	Type	Description
`content`	string	HTML string of the extracted main content
`title`	string	Title of the article
`description`	string	Description or summary of the article
`domain`	string	Domain name of the website
`favicon`	string	URL of the website's favicon
`image`	string	URL of the article's main image
`parseTime`	number	Time taken to parse the page in milliseconds
`published`	string	Publication date of the article
`author`	string	Author of the article
`site`	string	Name of the website
`schemaOrgData`	object	Raw schema.org data extracted from the page
`wordCount`	number	Total number of words in the extracted content

HTML standardization

Defuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.

Headings

Anchor links in <h1> to <h6> elements are removed and become plain headings.

Code blocks

Code block are standardized to the following output. If present, line numbers and syntax highlighting are removed, but the language is retained and added as a data attribute and class.

<pre>
  <code data-lang="js" class="language-js">
    // code
  </code>
</pre>

Footnotes

Inline references and footnotes are converted to a standard format:

Inline reference<sup id="fnref:1"><a href="#fn:1">1</a></sup>.

<div class="footnotes">
  <ol>
    <li class="footnote" id="fn:1">
      <p>
        Footnote content.&nbsp;<a href="#fnref:1" class="footnote-backref">↩</a>
      </p>
    </li>
    </ol>
</div>

Development

Build

To build the package, you'll need Node.js and npm installed. Then run:

# Install dependencies
npm install

# Clean and build
npm run build

This will generate:

dist/index.js - UMD build for both Node.js and browsers
dist/index.d.ts - TypeScript declaration file

readability content-extraction article-extraction web-scraping html-cleanup content-parser article-parser dom

5 months ago

5 months ago

5 months ago

5 months ago

5 months ago

5 months ago

5 months ago