1.0.7 β€’ Published 7 months ago

enterprise-ai-recursive-web-scraper v1.0.7

Weekly downloads
-
License
MIT
Repository
github
Last release
7 months ago

✨ Features

  • πŸš€ High Performance:
    • Blazing fast multi-threaded scraping with concurrent processing
    • Smart rate limiting to prevent API throttling and server overload
    • Automatic request queuing and retry mechanisms
  • πŸ€– AI-Powered: Intelligent content extraction using Groq LLMs
  • 🌐 Multi-Browser: Support for Chromium, Firefox, and WebKit
  • πŸ“Š Smart Extraction:
    • Structured data extraction without LLMs using CSS selectors
    • Topic-based and semantic chunking strategies
    • Cosine similarity clustering for content deduplication
  • 🎯 Advanced Capabilities:
    • Recursive domain crawling with boundary respect
    • Intelligent rate limiting with token bucket algorithm
    • Session management for complex multi-page flows
    • Custom JavaScript execution support
    • Enhanced screenshot capture with lazy-load detection
    • iframe content extraction
  • πŸ”’ Enterprise Ready:
    • Proxy support with authentication
    • Custom headers and user-agent configuration
    • Comprehensive error handling and retry mechanisms
    • Flexible timeout and rate limit management
    • Detailed logging and monitoring

πŸš€ Quick Start

To install the package, run:

npm install enterprise-ai-recursive-web-scraper

Using the CLI

The enterprise-ai-recursive-web-scraper package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

Installation

Ensure that the package is installed globally to use the CLI:

npm install -g enterprise-ai-recursive-web-scraper

Running the CLI

Once installed, you can use the web-scraper command to start scraping. Here’s a basic example of how to use it:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output

CLI Options

  • -k, --api-key <key>: (Required) Your Google Gemini API key
  • -u, --url <url>: (Required) The URL of the website to scrape
  • -o, --output <directory>: Output directory for scraped data (default: scraping_output)
  • -d, --depth <number>: Maximum crawl depth (default: 3)
  • -c, --concurrency <number>: Concurrent scraping limit (default: 5)
  • -r, --rate-limit <number>: Requests per second (default: 5)
  • -t, --timeout <number>: Request timeout in milliseconds (default: 30000)
  • -f, --format <type>: Output format: json|csv|markdown (default: json)
  • -v, --verbose: Enable verbose logging
  • --retry-attempts <number>: Number of retry attempts (default: 3)
  • --retry-delay <number>: Delay between retries in ms (default: 1000)

Example usage with rate limiting:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \
  --depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose

πŸ”§ Advanced Usage

Rate Limiting Configuration

Configure rate limiting to respect server limits and prevent throttling:

import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    rateLimiter: new RateLimiter({
        maxTokens: 5,      // Maximum number of tokens
        refillRate: 1,     // Tokens refilled per second
        retryAttempts: 3,  // Number of retry attempts
        retryDelay: 1000   // Delay between retries (ms)
    })
});

Structured Data Extraction

To extract structured data using a JSON schema, you can use the JsonExtractionStrategy:

import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {
    baseSelector: "article",
    fields: [
        { name: "title", selector: "h1" },
        { name: "content", selector: ".content" },
        { name: "date", selector: "time", attribute: "datetime" }
    ]
};

const scraper = new WebScraper({
    extractionStrategy: new JsonExtractionStrategy(schema)
});

Custom Browser Session

You can customize the browser session with specific configurations:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    browserConfig: {
        headless: false,
        proxy: "http://proxy.example.com",
        userAgent: "Custom User Agent"
    }
});

🀝 Contributors

πŸ“„ License

MIT Β© Mike Odnis

πŸ’™ Built with create-typescript-app