Md-llm NPM | npm.io

md-llm

A Markdown to llmxml (.llm) file format bridge that transforms Markdown documents into structured AST nodes optimized for LLM processing.

What's llmxml? Why?

LLMs <3 XML. But it doesn't have to be strict insanely nested XML. It can basically be mostly-flat pseudo-xml.
Humans <3 markdown.
Markdown sections can be nested hierarchies with # headers.
md-llm takes markdown documents and breaks them into nested nodes that can be output as XML
After converting them to nested nodes, you can also use it to target specific portions of a document, giving you importable markdown 'modules' in meld.

This library is consumed by oneshot and meld cli tools. If you like what this does, you probably want those.

Features

Bidirectional conversion between Markdown and LLM-optimized formats
Semantic section detection and processing
Rich support for Markdown elements:
- Headers with customizable depth
- Code blocks with language and metadata
- Lists (ordered, unordered, and task lists)
- Tables with alignment and formatting
- Blockquotes and thematic breaks
- HTML content preservation
- Frontmatter processing
- Definition lists
- References and footnotes
Extensible transform pipeline
Fuzzy section matching
High performance and memory efficient
Customizable tag name generation
Modular architecture for custom transforms

Installation

npm install md-llm

Quick Start

import { mdToLlm } from 'md-llm';

const markdown = `
# System Context
Some context here...

## Project Setup
Instructions for setup:
\`\`\`bash
npm install
\`\`\`
`;

const result = await mdToLlm(markdown);
console.log(result.ast);

Output:

{
  type: 'tag',
  name: 'Document',
  children: [
    {
      type: 'tag',
      name: 'SystemContext',
      children: [
        { type: 'text', value: 'Some context here...' },
        {
          type: 'tag',
          name: 'ProjectSetup',
          children: [
            { type: 'text', value: 'Instructions for setup:' },
            {
              type: 'tag',
              name: 'Code',
              attributes: { language: 'bash' },
              children: [{ type: 'text', value: 'npm install' }]
            }
          ]
        }
      ]
    }
  ]
}

API Reference

Core Function

async function mdToLlm(
  markdown: string,
  options?: MdToLlmOptions
): Promise<ParseResult>

Options

interface MdToLlmOptions {
  headerDepth?: number;  // 1-6, default 2
  tagNameMap?: Record<string, string>;  // Custom header->tag mappings
  preserveHeaderText?: boolean;  // Keep original text as first line?
  customTransforms?: MdToLlmTransform[];  // Add custom transforms
}

Transform Pipeline

The library uses a modular transform pipeline that processes different Markdown elements:

HeaderTransform: Converts headers to semantic tags
CodeFenceTransform: Processes code blocks with language and metadata
ListTransform: Handles ordered and unordered lists
TableTransform: Processes tables with alignment
BlockquoteTransform: Handles blockquotes
ThematicBreakTransform: Processes horizontal rules
FrontmatterTransform: Extracts YAML frontmatter
DefinitionTransform: Processes definition lists
ReferenceTransform: Handles link references and footnotes
TaskListTransform: Processes task lists
HtmlTransform: Preserves HTML content
SectionTransform: Handles section boundaries and hierarchy

Custom Transforms

You can create custom transforms by implementing the MdToLlmTransform interface:

interface MdToLlmTransform {
  transform(node: MdastNode): LlmAstNode;
  canTransform(node: MdastNode): boolean;
}

Example custom transform:

class CustomTransform implements MdToLlmTransform {
  canTransform(node: MdastNode): boolean {
    return node.type === 'customType';
  }

  transform(node: MdastNode): LlmAstNode {
    return {
      type: 'tag',
      name: 'CustomTag',
      attributes: {},
      children: []
    };
  }
}

// Use in options
const options = {
  customTransforms: [new CustomTransform()]
};

Node Types

The library uses two main node types:

interface TagNode {
  type: 'tag';
  name: string;
  attributes?: Record<string, string>;
  children: (TagNode | TextNode)[];
}

interface TextNode {
  type: 'text';
  value: string;
}

Section Processing

The library provides powerful section processing capabilities:

interface Section {
  id: string;
  title: string;
  level: number;
  frontmatter?: string;
  content: Node[];
  metadata: {
    hasDefinitionLists: boolean;
    hasTaskLists: boolean;
    hasFootnotes: boolean;
    references: {
      links: Map<string, string>;
      footnotes: Map<string, string>;
    }
  };
  parent?: Section;
  children: Section[];
}

Error Handling

The library provides detailed error information in the ParseResult:

interface ParseResult {
  ast: DocumentNode;
  errors?: Error[];
}

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the Meld License - see the LICENSE file for details.

llm-ast remark remark-parse remark-stringify unified

0.1.0

9 months ago