1.2.0 • Published 4 years ago
find-main-content v1.2.0
Find The Main Content In An HTML Page
Module for finding the main content on a page with the help of Cheerio. It can convert it into markdown, text or keep it in HTML.
It removes header, footer, menu, sidebar, ...
Installation
$ npm install find-main-content -S
You need also to use Cheerio
$ npm install cheerio -S
Simple usage
const cheerio = require('cheerio');
const { findContent } = require('find-main-content');
const $ = cheerio.load('<html> .... </html>');
// Return a nice data structure within the main content &
// some extract infos on links, images, headers, title, description, ...
const html = findContent($); // get the main content in the html format
const txt = findContent($, 'txt'); // get the main content in the txt format
const md = findContent($, 'md'); // get the main content in the markdown format
Options
You can control how to extract the main div with some options. You can specify a subset of the following attributes.
const options = {
// If more then one H1 is found, use the first one as the main title of the page
useFirstH1: true,
// Remove the H1 from the main content, the H1 will be in the final json structure
removeH1FromContent: true,
// Some site set some links in Hn, if true, we remove them
removeHeadersWithoutText: true,
// if true, don't add the images in the final extraction
removeImages: true,
// Remove HTML tag figcaption
removeFigcaptions: true,
// Replace links by their anchor text
replaceLinks: true,
// Remove HTML Form
removeForm: false,
// Remove basic html tags that have no children
removeEmptyTag: false
// Remove tags that match to selectors
removeTags : '... ' // list of selectors separated by comma or line break
// The HTML selector. If specified, the main content will be extract from the html element that matchs to the selector
htmlSelector : '...'
};
const cheerio = require('cheerio');
const { findContent } = require('find-main-content');
const $ = cheerio.load('<html> .... </html>');
const data = findContent($, 'html', options);
Structure returned by the function findContent
{
title: '...',
description: "...',
images: [
{
src: 'https://... .jpg',
alt: '...'
},
...
],
links: [
{
href: 'https://...',
text: '...'
},
],
headers: [
{
type: 'h1',
text: '...'
},
{
type: 'h2',
text: '...'
}
...
],
content: '....' // in either html, markdown or txt format
}
1.2.0
4 years ago
1.1.9
5 years ago
1.1.8
5 years ago
1.1.7
5 years ago
1.1.6
5 years ago
1.1.5
5 years ago
1.1.4
5 years ago
1.1.3
5 years ago
1.1.2
5 years ago
1.1.1
5 years ago
1.1.0
5 years ago
1.0.8
5 years ago
1.0.7
5 years ago
1.0.6
5 years ago
1.0.5
5 years ago
1.0.4
5 years ago
1.0.3
5 years ago
1.0.2
5 years ago
1.0.1
5 years ago
1.0.0
5 years ago