Ohmyreader NPM

OhMyReader

一个强大的网页内容提取库，支持多种格式和编码，可以将网页内容转换为结构化的 Markdown 格式。

特性

🚀 自动选择最佳提取器
📝 转换为干净的 Markdown 格式
🔍 智能提取元数据（标题、作者、日期）
🌐 支持多种编码（UTF-8、GBK 等）
⚡ 异步操作，支持并发
🛡️ 内置内容验证
🔄 自动重试机制

安装

npm install ohmyreader
# 或
bun add ohmyreader

基础使用

import { extractContent } from 'ohmyreader';

// 简单用法
const result = await extractContent('https://example.com/article');

// 或使用链式调用
import { Reader } from 'ohmyreader';

const result = await new Reader()
  .from('https://example.com/article')
  .extract();

高级用法

链式配置

import { Reader } from 'ohmyreader';

const result = await new Reader()
  .from('https://example.com/article')
  .withValidation({
    enabled: true,
    minLength: 100,
    maxNavRatio: 0.3
  })
  .withRequestOptions({
    timeout: 5000,
    retries: 3,
    headers: {
      'User-Agent': 'Custom User Agent'
    }
  })
  .withExtractOptions({
    preferParser: true,
    includeComments: true,
    includeMeta: true
  })
  .extract();

传统选项方式

const result = await extractContent('https://example.com/article', {
  validate: {
    enabled: true,
    minLength: 100,
    maxNavRatio: 0.3
  },
  request: {
    timeout: 5000,
    retries: 3,
    headers: {
      'User-Agent': 'Custom User Agent'
    }
  },
  extract: {
    preferParser: true,
    includeComments: true,
    includeMeta: true
  }
});

配置选项

interface ExtractOptions {
  // 内容验证选项
  validate?: {
    enabled?: boolean;        // 是否启用内容验证
    minLength?: number;       // 最小内容长度
    maxNavRatio?: number;     // 最大导航内容比例
  };
  
  // 网络请求选项
  request?: {
    timeout?: number;         // 超时时间（毫秒）
    retries?: number;         // 重试次数
    headers?: HeadersInit;    // 自定义请求头
  };
  
  // 提取选项
  extract?: {
    includeComments?: boolean;  // 是否包含评论
    preferParser?: boolean;     // 是否优先使用 Parser
    includeMeta?: boolean;      // 是否包含元数据
  };
}

返回结果类型

interface ExtractResult {
  title: string;          // 文章标题
  author: string;         // 作者
  publishDate: string;    // 发布日期
  markdown: string;       // Markdown 格式的正文
  excerpt: string;        // 摘要
  leadImageUrl?: string | null;  // 主图 URL
  domain?: string;        // 域名
  wordCount?: number;     // 字数统计
  url: string;           // 原始 URL
}

错误处理

try {
  const result = await new Reader()
    .from('https://example.com/article')
    .extract();
} catch (error) {
  if (error.message.includes('URL未设置')) {
    // 处理 URL 错误
  } else if (error.message.includes('提取失败')) {
    // 处理提取失败
  } else if (error.message.includes('网络错误')) {
    // 处理网络错误
  }
}

注意事项

某些网站可能需要登录或有反爬虫措施
建议设置适当的请求间隔
注意遵守目标网站的使用条款

贡献

欢迎提交 Issue 和 Pull Request！

许可证

MIT

readability content-extraction markdown web-scraping article-parser

@mozilla/readability @postlight/parser iconv-lite jsdom turndown

0.1.1

8 months ago

0.1.0

8 months ago