0.0.4 • Published 10 months ago

rehype-extract-posts v0.0.4

Weekly downloads
-
License
MIT
Repository
github
Last release
10 months ago

rehype-extract-posts

Rehype plugin to extract posts from an HTML document into a clean RSS-compatible JSON format.

Install

npm install rehype-extract-posts

Use

import { got } from 'got'
import { unified } from 'unified'
import rehypeParse from 'rehype-parse'
import rehypeExtractPosts from 'rehype-extract-posts'

const html = await got.get('https://indiehackers.com').then(res => res.body)
const file = { value: html }
const processor = unified().use(rehypeParse).use(rehypeExtractPosts)
const tree = processor.parse(file)
await processor.run(tree, file)

console.log(file.data.posts)

Running the above code will insert an array in file.data.posts containing potential posts in the HTML document, using the following schema:

{
	url: string
	lang?: string
	title?: string
	author?: string
	content?: string
	snippet?: string
	summary?: string
	categories?: string[]
	commentsUrl?: string
	imageUrl?: string
	media?: [{
		url: string
		length?: number
		type?: string
	}]
	createdAt?: string
	updatedAt?: string
}

See the test/ folder for examples.

API

This package exports a single plugin function.

unified().use(rehypeExtractPosts[, options])

The plugin executes a series of traversals to find and extract potential posts in an HTML document.

options

Configuration (optional).

options.host

Prepend the host string to all internal URLs on the page. Applies to nodes with href and src props (like a, img, video). If rehypeExtractMeta plugin is used before rehypeExtractPosts, the url value from file.data.meta will be used as a fallback. If not, the plugin will try to infer the host from from the meta tags in the tree.