0.1.0 • Published 6 months ago

md-llm v0.1.0

Weekly downloads
-
License
-
Repository
-
Last release
6 months ago

md-llm

A Markdown to llmxml (.llm) file format bridge that transforms Markdown documents into structured AST nodes optimized for LLM processing.

What's llmxml? Why?

  1. LLMs <3 XML. But it doesn't have to be strict insanely nested XML. It can basically be mostly-flat pseudo-xml.
  2. Humans <3 markdown.
  3. Markdown sections can be nested hierarchies with # headers.
  4. md-llm takes markdown documents and breaks them into nested nodes that can be output as XML
  5. After converting them to nested nodes, you can also use it to target specific portions of a document, giving you importable markdown 'modules' in meld.

This library is consumed by oneshot and meld cli tools. If you like what this does, you probably want those.

Features

  • Bidirectional conversion between Markdown and LLM-optimized formats
  • Semantic section detection and processing
  • Rich support for Markdown elements:
    • Headers with customizable depth
    • Code blocks with language and metadata
    • Lists (ordered, unordered, and task lists)
    • Tables with alignment and formatting
    • Blockquotes and thematic breaks
    • HTML content preservation
    • Frontmatter processing
    • Definition lists
    • References and footnotes
  • Extensible transform pipeline
  • Fuzzy section matching
  • High performance and memory efficient
  • Customizable tag name generation
  • Modular architecture for custom transforms

Installation

npm install md-llm

Quick Start

import { mdToLlm } from 'md-llm';

const markdown = `
# System Context
Some context here...

## Project Setup
Instructions for setup:
\`\`\`bash
npm install
\`\`\`
`;

const result = await mdToLlm(markdown);
console.log(result.ast);

Output:

{
  type: 'tag',
  name: 'Document',
  children: [
    {
      type: 'tag',
      name: 'SystemContext',
      children: [
        { type: 'text', value: 'Some context here...' },
        {
          type: 'tag',
          name: 'ProjectSetup',
          children: [
            { type: 'text', value: 'Instructions for setup:' },
            {
              type: 'tag',
              name: 'Code',
              attributes: { language: 'bash' },
              children: [{ type: 'text', value: 'npm install' }]
            }
          ]
        }
      ]
    }
  ]
}

API Reference

Core Function

async function mdToLlm(
  markdown: string,
  options?: MdToLlmOptions
): Promise<ParseResult>

Options

interface MdToLlmOptions {
  headerDepth?: number;  // 1-6, default 2
  tagNameMap?: Record<string, string>;  // Custom header->tag mappings
  preserveHeaderText?: boolean;  // Keep original text as first line?
  customTransforms?: MdToLlmTransform[];  // Add custom transforms
}

Transform Pipeline

The library uses a modular transform pipeline that processes different Markdown elements:

  • HeaderTransform: Converts headers to semantic tags
  • CodeFenceTransform: Processes code blocks with language and metadata
  • ListTransform: Handles ordered and unordered lists
  • TableTransform: Processes tables with alignment
  • BlockquoteTransform: Handles blockquotes
  • ThematicBreakTransform: Processes horizontal rules
  • FrontmatterTransform: Extracts YAML frontmatter
  • DefinitionTransform: Processes definition lists
  • ReferenceTransform: Handles link references and footnotes
  • TaskListTransform: Processes task lists
  • HtmlTransform: Preserves HTML content
  • SectionTransform: Handles section boundaries and hierarchy

Custom Transforms

You can create custom transforms by implementing the MdToLlmTransform interface:

interface MdToLlmTransform {
  transform(node: MdastNode): LlmAstNode;
  canTransform(node: MdastNode): boolean;
}

Example custom transform:

class CustomTransform implements MdToLlmTransform {
  canTransform(node: MdastNode): boolean {
    return node.type === 'customType';
  }

  transform(node: MdastNode): LlmAstNode {
    return {
      type: 'tag',
      name: 'CustomTag',
      attributes: {},
      children: []
    };
  }
}

// Use in options
const options = {
  customTransforms: [new CustomTransform()]
};

Node Types

The library uses two main node types:

interface TagNode {
  type: 'tag';
  name: string;
  attributes?: Record<string, string>;
  children: (TagNode | TextNode)[];
}

interface TextNode {
  type: 'text';
  value: string;
}

Section Processing

The library provides powerful section processing capabilities:

interface Section {
  id: string;
  title: string;
  level: number;
  frontmatter?: string;
  content: Node[];
  metadata: {
    hasDefinitionLists: boolean;
    hasTaskLists: boolean;
    hasFootnotes: boolean;
    references: {
      links: Map<string, string>;
      footnotes: Map<string, string>;
    }
  };
  parent?: Section;
  children: Section[];
}

Error Handling

The library provides detailed error information in the ParseResult:

interface ParseResult {
  ast: DocumentNode;
  errors?: Error[];
}

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the Meld License - see the LICENSE file for details.