0.0.4 • Published 7 months ago

@cli-upkaran/adapter-website v0.0.4

Weekly downloads
-
License
MIT
Repository
-
Last release
7 months ago

@cli-upkaran/adapter-website

npm version

Website data preparation adapter for cli-upkaran.

This package provides an adapter for fetching and processing content from websites (single pages, sitemaps, crawls) as part of a data preparation pipeline within cli-upkaran commands.

Features

  • Fetches content from URLs.
  • Discovers URLs via sitemap.xml or crawling (limited depth).
  • Extracts main content using CSS selectors or Readability.
  • Converts HTML to Markdown or extracts plain text.
  • Handles concurrent requests.
  • Supports filtering fetched URLs using glob patterns.
  • Integrates with the @cli-upkaran/dataprep-core pipeline.

Installation

This package is intended to be used as a dependency by cli-upkaran command plugins that need to process web content.

pnpm add @cli-upkaran/adapter-website

Usage

Command plugins can utilize this adapter to configure website fetching.

// Within a command plugin's implementation
import { processWebsite } from '@cli-upkaran/adapter-website';
import type { DataPrepAdapterOptions } from '@cli-upkaran/dataprep-core';

async function runMyWebCommand(options: MyWebCommandOptions) {
  const adapterOptions: DataPrepAdapterOptions = {
    url: options.startUrl,
    match: options.includePathPatterns,
    selector: options.contentSelector,
    // ... other website adapter specific options like concurrency, depth, etc.
  };

  // Use the adapter to get data sources
  const dataSources = await processWebsite(adapterOptions);

  // ... process dataSources ...
}

(Note: The exact API (processWebsite) is illustrative and may differ based on actual implementation.)

Contributing

See the main CONTRIBUTING.md in the root of the repository.

License

MIT - See the main LICENSE file in the root of the repository.