1.2.0 • Published 6 years ago

find-main-content v1.2.0

Weekly downloads
6
License
ISC
Repository
github
Last release
6 years ago

Find The Main Content In An HTML Page

Module for finding the main content on a page with the help of Cheerio. It can convert it into markdown, text or keep it in HTML.

It removes header, footer, menu, sidebar, ...

Installation

$ npm install find-main-content -S

You need also to use Cheerio

$ npm install cheerio -S

Simple usage

const cheerio = require('cheerio');
const { findContent } = require('find-main-content');

const $ = cheerio.load('<html> .... </html>');

// Return a nice data structure within the main content &
// some extract infos on links, images, headers, title, description, ...
const html = findContent($); // get the main content in the html format
const txt = findContent($, 'txt'); // get the main content in the txt format
const md = findContent($, 'md'); // get the main content in the markdown format

Options

You can control how to extract the main div with some options. You can specify a subset of the following attributes.

const options = {

  // If more then one H1 is found, use the first one as the main title of the page
  useFirstH1: true,

  // Remove the H1 from the main content, the H1 will be in the final json structure
  removeH1FromContent: true,

  // Some site set some links in Hn, if true, we remove them
  removeHeadersWithoutText: true,

  // if true, don't add the images in the final extraction
  removeImages: true,

  // Remove HTML tag figcaption
  removeFigcaptions: true,

  // Replace links by their anchor text
  replaceLinks: true,

  // Remove HTML Form
  removeForm: false,

  // Remove basic html tags that have no children
  removeEmptyTag: false

  // Remove tags that match to selectors
  removeTags : '... ' // list of selectors separated by comma or line break

  // The HTML selector. If specified, the main content will be extract from the html element that matchs to the selector
  htmlSelector : '...'


};

const  cheerio  = require('cheerio');
const { findContent } = require('find-main-content');

const $ = cheerio.load('<html> .... </html>');

const data = findContent($, 'html', options);

Structure returned by the function findContent

{
  title: '...',
  description: "...',
  images: [
    {
      src: 'https://... .jpg',
      alt: '...'
    },
    ...
  ],
  links: [
    {
      href: 'https://...',
      text: '...'
    },

  ],
  headers: [
    {
      type: 'h1',
      text: '...'
    },
    {
      type: 'h2',
      text: '...'
    }
    ...
  ],
  content: '....' // in either html, markdown or txt format
}
1.2.0

6 years ago

1.1.9

6 years ago

1.1.8

6 years ago

1.1.7

6 years ago

1.1.6

6 years ago

1.1.5

6 years ago

1.1.4

6 years ago

1.1.3

6 years ago

1.1.2

6 years ago

1.1.1

6 years ago

1.1.0

6 years ago

1.0.8

6 years ago

1.0.7

6 years ago

1.0.6

6 years ago

1.0.5

6 years ago

1.0.4

7 years ago

1.0.3

7 years ago

1.0.2

7 years ago

1.0.1

7 years ago

1.0.0

7 years ago