0.0.4 • Published 7 months ago
@cli-upkaran/adapter-website v0.0.4
@cli-upkaran/adapter-website
Website data preparation adapter for cli-upkaran
.
This package provides an adapter for fetching and processing content from websites (single pages, sitemaps, crawls) as part of a data preparation pipeline within cli-upkaran
commands.
Features
- Fetches content from URLs.
- Discovers URLs via sitemap.xml or crawling (limited depth).
- Extracts main content using CSS selectors or Readability.
- Converts HTML to Markdown or extracts plain text.
- Handles concurrent requests.
- Supports filtering fetched URLs using glob patterns.
- Integrates with the
@cli-upkaran/dataprep-core
pipeline.
Installation
This package is intended to be used as a dependency by cli-upkaran
command plugins that need to process web content.
pnpm add @cli-upkaran/adapter-website
Usage
Command plugins can utilize this adapter to configure website fetching.
// Within a command plugin's implementation
import { processWebsite } from '@cli-upkaran/adapter-website';
import type { DataPrepAdapterOptions } from '@cli-upkaran/dataprep-core';
async function runMyWebCommand(options: MyWebCommandOptions) {
const adapterOptions: DataPrepAdapterOptions = {
url: options.startUrl,
match: options.includePathPatterns,
selector: options.contentSelector,
// ... other website adapter specific options like concurrency, depth, etc.
};
// Use the adapter to get data sources
const dataSources = await processWebsite(adapterOptions);
// ... process dataSources ...
}
(Note: The exact API (processWebsite
) is illustrative and may differ based on actual implementation.)
Contributing
See the main CONTRIBUTING.md in the root of the repository.
License
MIT - See the main LICENSE file in the root of the repository.
0.0.4
7 months ago
0.0.2
7 months ago
0.0.2-latest.2
7 months ago
0.0.2-beta.1
7 months ago
0.0.2-beta.0
7 months ago