1.0.2 • Published 10 months ago

@darkbing/knowledge-retrieval v1.0.2

Weekly downloads
-
License
MIT
Repository
github
Last release
10 months ago

Knowledge Retrieval

A powerful web crawler and knowledge processing toolkit for extracting and managing web content. This package provides an interactive CLI for crawling websites, processing content, and managing knowledge bases.

Installation

npm install @darkbing/knowledge-retrieval

Features

  • Web Crawler: Configurable depth and page limits
  • Content Processing: Markdown and JSON output formats
  • AI Integration: Compatible with various LLM models
  • Settings Management: Interactive configuration
  • Customizable Headers: Support for authentication and custom requests
  • Organized Storage: Structured data storage with raw and processed content

Quick Start

  1. Install the package globally (optional):
npm install -g @darkbing/knowledge-retrieval
  1. Create a .env file with your configuration:
BASE_STORAGE_PATH=./knowledge_retrieval
MODEL_NAME=ollama/mistral
  1. Start the interactive CLI:
npx knowledge-retrieval

Interactive CLI

The CLI provides several options:

  • Crawl Website: Start a new crawling session
  • Process Data: Convert raw data to structured format
  • Cleanup: Remove temporary files
  • Settings: Manage configuration

Settings Management

Configure your crawler through the interactive settings menu:

  1. Crawler Settings

    • Max crawl depth
    • Max pages to crawl
    • Request timeout
    • User agent
  2. Storage Settings

    • Base storage path
    • Raw data directory
    • Processed data directory
  3. Processing Settings

    • Default processing mode (markdown/json)
    • Model name for AI processing
  4. Custom Headers

    • Add/remove custom HTTP headers
    • Support for authentication tokens

Programmatic Usage

import { KnowledgeRetrieval, SettingsManager } from '@darkbing/knowledge-retrieval';

// Initialize with custom settings
const settings = new SettingsManager();
await settings.updateSettings({
  maxCrawlDepth: 3,
  maxCrawlPages: 50,
  modelName: 'ollama/mistral'
});

// Create crawler instance
const crawler = new KnowledgeRetrieval(settings);

// Start crawling
await crawler.crawl('https://example.com');

// Process crawled data
await crawler.processData();

Configuration

Default Settings

{
  "maxCrawlDepth": 3,
  "maxCrawlPages": 50,
  "requestTimeout": 5000,
  "userAgent": "KnowledgeRetrievalBot/1.0",
  "baseStoragePath": "./knowledge_retrieval",
  "rawDataDir": "raw_data",
  "processedDataDir": "processed_data",
  "defaultProcessingMode": "markdown",
  "modelName": "ollama/mistral"
}

Settings are stored in know-bot.json and can be managed through the interactive CLI or programmatically.

Development

  1. Clone the repository:
git clone https://github.com/SnapsPH/know.git
cd knowledge-retrieval
  1. Install dependencies:
npm install
  1. Build the project:
npm run build
  1. Run tests:
npm test

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Support

Changelog

1.0.0

  • Initial release
  • Interactive CLI with settings management
  • Web crawler with configurable depth and limits
  • Content processing with markdown and JSON support
  • AI integration with model selection
  • Custom headers support
  • Comprehensive test coverage