rehype-extract-posts v0.0.5
rehype-extract-posts
Rehype plugin to extract posts from an HTML document into a clean RSS-compatible JSON format.
Install
npm install rehype-extract-postsUse
import { got } from 'got'
import { unified } from 'unified'
import rehypeParse from 'rehype-parse'
import rehypeExtractPosts from 'rehype-extract-posts'
const html = await got.get('https://indiehackers.com').then(res => res.body)
const file = { value: html }
const processor = unified().use(rehypeParse).use(rehypeExtractPosts)
const tree = processor.parse(file)
await processor.run(tree, file)
console.log(file.data.posts)Running the above code will insert an array in file.data.posts containing potential posts in the HTML document, using the following schema:
{
url: string
lang?: string
title?: string
author?: string
content?: string
snippet?: string
summary?: string
categories?: string[]
commentsUrl?: string
imageUrl?: string
media?: [{
url: string
length?: number
type?: string
}]
createdAt?: string
updatedAt?: string
}See the test/ folder for examples.
API
This package exports a single plugin function.
unified().use(rehypeExtractPosts[, options])
The plugin executes a series of traversals to find and extract potential posts in an HTML document.
options
Configuration (optional).
options.host
Prepend the host string to all internal URLs on the page. Applies to nodes with href and src props (like a, img, video). If rehypeExtractMeta plugin is used before rehypeExtractPosts, the url value from file.data.meta will be used as a fallback. If not, the plugin will try to infer the host from from the meta tags in the tree.