1.0.0 • Published 5 months ago

@silyze/html-prompt-utils v1.0.0

Weekly downloads
-
License
MIT
Repository
-
Last release
5 months ago

Html Prompt Utils

HTML Prompt Utils is a lightweight toolkit for turning raw HTML into a prompt‑friendly, compressed JSON representation – perfect for LLM prompts, diffing, or storing a minimal DOM snapshot. It consists of a streaming HTML serializer and a selective compression algorithm that keeps only the semantic parts of the DOM (text, ids, classes, and a curated list of attributes).


Features

FeatureDescription
Streaming HTML → ASTParse HTML incrementally with HTMLSerializer, producing a simple DocumentNode tree.
Lossy compressioncompressNode collapses the tree into a terser CompressedNode, merging selectors and discarding irrelevant markup.
Attribute whitelistingOnly meaningful attributes (e.g. hrefsrcvalue, …) are preserved, keeping output compact and deterministic.
Ignore tagsConfigure tags (default: head, script, iframe, …) that should be excluded entirely during parsing.
Tree‑agnosticWorks with full HTML strings, server‑sent chunks, or any ReadableStream you wrap in HTMLTextStream.

Installation

npm install @silyze/html-prompt-utils

Quick start

import {
  HTMLTextStream,
  HTMLSerializer,
  compressNode,
} from "@silyze/html-prompt-utils";

// 1) Wrap your HTML (string | Promise<string>) in a stream helper
const html = new HTMLTextStream(
  `<div id="app"><p>Hello <strong>world</strong></p></div>`
);

// 2) Parse it → DocumentNode (AST)
const doc = await HTMLSerializer.parse(html);

// 3) Compress the AST for prompt usage
const compressed = compressNode(doc);

console.log(JSON.stringify(compressed, null, 2));
/*
{
  "div#app": {
    "p": {
      "text": [
        "Hello",
        "world"
      ]
    }
  }
}
*/

API Reference

All exports live off the package root:

import {
  compressNode,
  HTMLTextStream,
  HTMLSerializer,
  DocumentNode,
  CompressedNode,
  HTMLPipeTarget,
  HTMLStream,
} from "@silyze/html-prompt-utils";

Types

TypeDescription
DocumentNodeA minimal DOM‑like interface { name, attributes?, children? }.
CompressedNodeA recursively compressed representation (see Compression format below).
HTMLStreamObject with pipeTo(target) for pumping data into a consumer.
HTMLPipeTarget{ write(chunk), end(chunk?) } – anything that accepts streamed chunks (e.g. htmlparser2.Parser).

HTMLTextStream (class)

Wraps a string or Promise<string> as an HTMLStream so it can be consumed by the parser.

new HTMLTextStream(src: string | Promise<string>): HTMLTextStream

HTMLSerializer (class)

MemberSignatureNotes
constructor(ignoreTags?: string[])Creates a serializer instance.
static defaultIgnoreTagsreadonly string[] – defaults to ["head","script","iframe","meta","style","link"].
static parse(html: HTMLStream, ignoreTags?, options?)Convenience that builds a Parser (from htmlparser2), feeds it, and resolves to a DocumentNode.
root: Promise<DocumentNode>Promise of the final tree (same as return from parse).
currentRoot: DocumentNodeSynchronous access while streaming (advanced).

Under the hood it implements the htmlparser2.Handler interface (onopentag, ontext, …) so you can wire it manually when needed.

compressNode(root: DocumentNode): CompressedNode | undefined

Traverse a DocumentNode and return a compressed version or undefined if the node is entirely ignorable (e.g. whitespace only).

Compression format

  • Outermost keys are CSS‑like selectorstag#id.class1.class2.
  • Special key text holds raw text content.
  • If a selector or text contains a single child, the array wrapper is stripped.
  • Attributes are expressed as additional selectors like [href], [value], [placeholder].
  • Only attributes in the preservation whitelist are ever kept:
const preserveAttributes = [
  "type",
  "placeholder",
  "value",
  "min",
  "max",
  "name",
  "src",
  "alt",
  "href",
  "target",
  "action",
  "for",
  "selected",
  "checked",
  "multiple",
  "list",
];
  • contenteditable="true" is also kept.
  • Deep single‑child chains collapse: <div><span><a>… ⇒ selector div span a.

Advanced usage

Custom streaming source

import { PassThrough } from "node:stream";
const pass = new PassThrough();
const stream = {
  pipeTo: (target) => pass.on("data", target.write).on("end", target.end),
};

const serializer = new HTMLSerializer();
const parser = new Parser(serializer);
stream.pipeTo(parser);

pass.write('<p streaming="yes">');
pass.write("Hello");
pass.end("</p>");

const doc = await serializer.root;

Changing ignored tags

const doc = await HTMLSerializer.parse(htmlStream, [
  /* your tags */
]);