0.2.4 • Published 4 months ago

defuddle v0.2.4

Weekly downloads
-
License
MIT
Repository
github
Last release
4 months ago

de·​fud·dle /diˈfʌdl/ transitive verb
to remove unnecessary elements from a web page, and make it easily readable.

Beware! Defuddle is very much a work in progress!

Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Features

Defuddle aims to output clean and consistent HTML documents. It was written for Obsidian Web Clipper with the goal of creating a more useful input for HTML-to-Markdown converters like Turndown.

Defuddle can be used as a replacement for Mozilla Readability with a few differences:

  • More forgiving, removes fewer uncertain elements.
  • Provides a consistent output for footnotes, citations, code blocks.
  • Uses a page's mobile styles to guess at unnecessary elements.
  • Extracts more metadata from the page, including schema.org data.

Installation

npm install defuddle

Usage

import { Defuddle } from 'defuddle';

const article = new Defuddle(document).parse();

// Use the extracted content and metadata
console.log(article.content);  // HTML string of the main content
console.log(article.title);    // Title of the article

Server-side usage

When using Defuddle in a Node.js environment, you can use JSDOM to create a DOM document:

import { Defuddle } from 'defuddle';
import { JSDOM } from 'jsdom';

const html = '...'; // Your HTML string
const dom = new JSDOM(html, {
  url: "https://www.example.com/page-url" // Optional: helps resolve relative URLs
});

const article = new Defuddle(dom.window.document).parse();
console.log(article.content);

Providing url in the JSDOM constructor helps convert relative URLs (images, links, etc.) to absolute URLs.

Response

The parse() method returns an object with the following properties:

PropertyTypeDescription
contentstringHTML string of the extracted main content
titlestringTitle of the article
descriptionstringDescription or summary of the article
domainstringDomain name of the website
faviconstringURL of the website's favicon
imagestringURL of the article's main image
parseTimenumberTime taken to parse the page in milliseconds
publishedstringPublication date of the article
authorstringAuthor of the article
sitestringName of the website
schemaOrgDataobjectRaw schema.org data extracted from the page
wordCountnumberTotal number of words in the extracted content

HTML standardization

Defuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.

Headings

Anchor links in <h1> to <h6> elements are removed and become plain headings.

Code blocks

Code block are standardized to the following output. If present, line numbers and syntax highlighting are removed, but the language is retained and added as a data attribute and class.

<pre>
  <code data-lang="js" class="language-js">
    // code
  </code>
</pre>

Footnotes

Inline references and footnotes are converted to a standard format:

Inline reference<sup id="fnref:1"><a href="#fn:1">1</a></sup>.

<div class="footnotes">
  <ol>
    <li class="footnote" id="fn:1">
      <p>
        Footnote content.&nbsp;<a href="#fnref:1" class="footnote-backref">↩</a>
      </p>
    </li>
    </ol>
</div>

Development

Build

To build the package, you'll need Node.js and npm installed. Then run:

# Install dependencies
npm install

# Clean and build
npm run build

This will generate:

  • dist/index.js - UMD build for both Node.js and browsers
  • dist/index.d.ts - TypeScript declaration file
0.2.1

4 months ago

0.2.0

4 months ago

0.2.3

4 months ago

0.2.2

4 months ago

0.2.4

4 months ago

0.1.1

4 months ago

0.1.0

5 months ago