0.0.2 • Published 8 months ago
@lg-tools/crawler v0.0.2
LeafyGreen Crawler Tool
A CLI tool for crawling and analyzing website content for LeafyGreen AI.
Overview
This tool crawls websites and stores the content in MongoDB collections for use with LeafyGreen AI systems. The crawler can process either specific URLs or use pre-configured website sources.
Prerequisites
- Node.js (v16 or higher)
- Yarn package manager
- MongoDB Atlas account with connection details
- Environment variables properly configured
Installation
# From the root of the leafygreen-ui-private repository
cd tools/crawler
yarn installConfiguration
Create a .env file in the tools/crawler directory with the following variables:
MONGODB_USER=your_mongodb_user
MONGODB_PASSWORD=your_mongodb_password
MONGODB_PROJECT_URL=your_project_url
MONGODB_APP_NAME=your_app_nameDefault Sources
The crawler comes with pre-configured sources in src/constants.ts:
- MongoDB Design (https://mongodb.design)
- React Documentation (https://react.dev)
- MDN Web Docs (https://developer.mozilla.org)
To add or modify sources, edit the SOURCES array in src/constants.ts.
Usage
Building the Tool
yarn buildBasic Usage
# Use the built version
yarn lg-crawler
# Or use the development version
yarn crawlCommand Line Options
-v, --verbose: Enable verbose output-d, --depth <number>: Set maximum crawl depth (default: 3)--url <url>: Specify a single URL to crawl--dry-run: Run crawler without inserting documents into MongoDB
Examples
# Crawl all pre-configured sources with verbose output
yarn crawl --verbose
# Crawl a specific URL with a depth of 2
yarn crawl --url https://example.com --depth 2
# Test crawling without saving to MongoDB
yarn crawl --dry-run --verboseDevelopment
Project Structure
src/index.ts: Main entry point and command-line interfacesrc/crawler.ts: Core crawler implementationsrc/constants.ts: Configuration constants and source definitionssrc/utils/: Helper utilities for crawling and data processing
Adding New Features
- Make your code changes
- Build the project:
yarn build - Test your changes:
yarn crawl --dry-run --verbose
Running Tests
yarn testTroubleshooting
- MongoDB Connection Issues: Verify your
.envfile has the correct credentials - Crawling Errors: Use the
--verboseflag to get detailed logs - Rate Limiting: Some websites may block the crawler if too many requests are made
License
Apache-2.0