@darkbing/knowledge-retrieval NPM

Knowledge Retrieval

A powerful web crawler and knowledge processing toolkit for extracting and managing web content. This package provides an interactive CLI for crawling websites, processing content, and managing knowledge bases.

Installation

npm install @darkbing/knowledge-retrieval

Features

Web Crawler: Configurable depth and page limits
Content Processing: Markdown and JSON output formats
AI Integration: Compatible with various LLM models
Settings Management: Interactive configuration
Customizable Headers: Support for authentication and custom requests
Organized Storage: Structured data storage with raw and processed content

Quick Start

Install the package globally (optional):

npm install -g @darkbing/knowledge-retrieval

Create a .env file with your configuration:

BASE_STORAGE_PATH=./knowledge_retrieval
MODEL_NAME=ollama/mistral

Start the interactive CLI:

npx knowledge-retrieval

Interactive CLI

The CLI provides several options:

Crawl Website: Start a new crawling session
Process Data: Convert raw data to structured format
Cleanup: Remove temporary files
Settings: Manage configuration

Settings Management

Configure your crawler through the interactive settings menu:

Crawler Settings
- Max crawl depth
- Max pages to crawl
- Request timeout
- User agent
Storage Settings
- Base storage path
- Raw data directory
- Processed data directory
Processing Settings
- Default processing mode (markdown/json)
- Model name for AI processing
Custom Headers
- Add/remove custom HTTP headers
- Support for authentication tokens

Programmatic Usage

import { KnowledgeRetrieval, SettingsManager } from '@darkbing/knowledge-retrieval';

// Initialize with custom settings
const settings = new SettingsManager();
await settings.updateSettings({
  maxCrawlDepth: 3,
  maxCrawlPages: 50,
  modelName: 'ollama/mistral'
});

// Create crawler instance
const crawler = new KnowledgeRetrieval(settings);

// Start crawling
await crawler.crawl('https://example.com');

// Process crawled data
await crawler.processData();

Configuration

Default Settings

{
  "maxCrawlDepth": 3,
  "maxCrawlPages": 50,
  "requestTimeout": 5000,
  "userAgent": "KnowledgeRetrievalBot/1.0",
  "baseStoragePath": "./knowledge_retrieval",
  "rawDataDir": "raw_data",
  "processedDataDir": "processed_data",
  "defaultProcessingMode": "markdown",
  "modelName": "ollama/mistral"
}

Settings are stored in know-bot.json and can be managed through the interactive CLI or programmatically.

Development

Clone the repository:

git clone https://github.com/SnapsPH/know.git
cd knowledge-retrieval

Install dependencies:

npm install

Build the project:

npm run build

Run tests:

npm test

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Support

GitHub Issues: Report a bug
Email: your-email@example.com

Changelog

1.0.0

Initial release
Interactive CLI with settings management
Web crawler with configurable depth and limits
Content processing with markdown and JSON support
AI integration with model selection
Custom headers support
Comprehensive test coverage

web-crawler knowledge-extraction data-processing web-scraping ai-toolkit

axios cheerio dotenv fs-extra inquirer yargs

1.0.2

11 months ago

1.0.1

11 months ago

1.0.0

11 months ago