rehype-extract-posts v0.0.4
rehype-extract-posts
Rehype plugin to extract posts from an HTML document into a clean RSS-compatible JSON format.
Install
npm install rehype-extract-posts
Use
import { got } from 'got'
import { unified } from 'unified'
import rehypeParse from 'rehype-parse'
import rehypeExtractPosts from 'rehype-extract-posts'
const html = await got.get('https://indiehackers.com').then(res => res.body)
const file = { value: html }
const processor = unified().use(rehypeParse).use(rehypeExtractPosts)
const tree = processor.parse(file)
await processor.run(tree, file)
console.log(file.data.posts)
Running the above code will insert an array in file.data.posts
containing potential posts in the HTML document, using the following schema:
{
url: string
lang?: string
title?: string
author?: string
content?: string
snippet?: string
summary?: string
categories?: string[]
commentsUrl?: string
imageUrl?: string
media?: [{
url: string
length?: number
type?: string
}]
createdAt?: string
updatedAt?: string
}
See the test/
folder for examples.
API
This package exports a single plugin function.
unified().use(rehypeExtractPosts[, options])
The plugin executes a series of traversals to find and extract potential posts in an HTML document.
options
Configuration (optional).
options.host
Prepend the host string to all internal URLs on the page. Applies to nodes with href
and src
props (like a
, img
, video
). If rehypeExtractMeta
plugin is used before rehypeExtractPosts
, the url
value from file.data.meta
will be used as a fallback. If not, the plugin will try to infer the host from from the meta
tags in the tree.