1.0.0 • Published 6 months ago

@watercrawl/nodejs v1.0.0

Weekly downloads
-
License
MIT
Repository
-
Last release
6 months ago

WaterCrawl Node.js Client

A TypeScript/Node.js client library for the WaterCrawl API. This client provides a simple and intuitive way to interact with WaterCrawl's web crawling service.

Test npm version License: MIT

Features

  • Full API coverage for WaterCrawl endpoints
  • Written in TypeScript with complete type definitions
  • Support for both synchronous and asynchronous crawling
  • Real-time crawl monitoring with event streaming
  • Automatic result downloading and processing
  • Promise-based API with async/await support
  • Comprehensive error handling
  • ESLint and Prettier configured for code quality
  • Built with ES Modules
  • Continuous Integration with GitHub Actions
  • Automated npm publishing with version tags

Installation

npm install @watercrawl/nodejs

Quick Start

import { WaterCrawlAPIClient } from '@watercrawl/nodejs';

// Initialize the client with your API key
const client = new WaterCrawlAPIClient('your-api-key');

// Simple synchronous crawling
const result = await client.scrapeUrl('https://watercrawl.dev');
console.log(result);

// Asynchronous crawling with monitoring
const request = await client.scrapeUrl('https://watercrawl.dev', {}, {}, false);
for await (const event of client.monitorCrawlRequest(request.uuid)) {
    console.log('Event:', event);
}

API Reference

Types

interface CrawlRequest {
    uuid: string;
    status: 'pending' | 'running' | 'completed' | 'failed' | 'stopped';
    created_at: string;
    updated_at: string;
}

interface CrawlResult {
    uuid: string;
    url: string;
    data: Record<string, any>;
    metadata: Record<string, any>;
}

interface PageOptions {
    wait_for_selector?: string;
    screenshot?: boolean;
    viewport?: {
        width: number;
        height: number;
    };
}

interface PluginOptions {
    extract_links?: boolean;
    extract_text?: boolean;
    custom_js?: string;
}

interface CrawlEvent {
    type: 'status' | 'result' | 'error';
    data: any;
}

Constructor

class WaterCrawlAPIClient {
    constructor(apiKey: string, baseUrl?: string);
}

Methods

Crawl Requests

// Create a new crawl request
async createCrawlRequest(
    url: string,
    spiderOptions?: Record<string, any>,
    pageOptions?: PageOptions,
    pluginOptions?: PluginOptions
): Promise<CrawlRequest>;

// List all crawl requests
async getCrawlRequestsList(
    page?: number,
    pageSize?: number
): Promise<{ results: CrawlRequest[] }>;

// Get a specific crawl request
async getCrawlRequest(itemId: string): Promise<CrawlRequest>;

// Stop a crawl request
async stopCrawlRequest(itemId: string): Promise<null>;

// Download crawl request results
async downloadCrawlRequest(itemId: string): Promise<CrawlResult[]>;

Monitoring and Results

// Monitor crawl progress in real-time
async *monitorCrawlRequest(
    itemId: string,
    download?: boolean
): AsyncGenerator<CrawlEvent, void, unknown>;

// Get results for a crawl request
async getCrawlRequestResults(
    itemId: string
): Promise<{ results: CrawlResult[] }>;

// Download a specific result
async downloadResult(resultObject: CrawlResult): Promise<Record<string, any>>;

Simplified Crawling

// Synchronous crawling (waits for result)
async scrapeUrl(
    url: string,
    pageOptions?: PageOptions,
    pluginOptions?: PluginOptions,
    sync?: true,
    download?: true
): Promise<Record<string, any>>;

// Asynchronous crawling (returns immediately)
async scrapeUrl(
    url: string,
    pageOptions?: PageOptions,
    pluginOptions?: PluginOptions,
    sync: false
): Promise<CrawlRequest>;

Examples

Basic Crawling

import { WaterCrawlAPIClient } from '@watercrawl/nodejs';
import type { CrawlResult } from '@watercrawl/nodejs';

const client = new WaterCrawlAPIClient('your-api-key');

// Simple crawl
const result = await client.scrapeUrl('https://watercrawl.dev');
console.log('Crawl result:', result);

Advanced Crawling with Options

import { WaterCrawlAPIClient } from '@watercrawl/nodejs';
import type { PageOptions, PluginOptions } from '@watercrawl/nodejs';

const client = new WaterCrawlAPIClient('your-api-key');

// Crawl with custom options
const pageOptions: PageOptions = {
    wait_for_selector: '.content',
    screenshot: true,
    viewport: {
        width: 1920,
        height: 1080
    }
};

const pluginOptions: PluginOptions = {
    extract_links: true,
    extract_text: true,
    custom_js: `
        // Custom JavaScript to run on the page
        return {
            title: document.title,
            metaDescription: document.querySelector('meta[name="description"]')?.content
        };
    `
};

const result = await client.scrapeUrl('https://watercrawl.dev', pageOptions, pluginOptions);

Asynchronous Crawling with Progress Monitoring

import { WaterCrawlAPIClient } from '@watercrawl/nodejs';
import type { CrawlEvent } from '@watercrawl/nodejs';

const client = new WaterCrawlAPIClient('your-api-key');

// Start crawl asynchronously
const request = await client.scrapeUrl('https://watercrawl.dev', {}, {}, false);

// Monitor progress
for await (const event of client.monitorCrawlRequest(request.uuid)) {
    switch (event.type) {
        case 'status':
            console.log('Status update:', event.data);
            break;
        case 'result':
            console.log('Got result:', event.data);
            break;
        case 'error':
            console.error('Error:', event.data);
            break;
    }
}

Managing Multiple Crawls

import { WaterCrawlAPIClient } from '@watercrawl/nodejs';
import type { CrawlRequest, CrawlResult } from '@watercrawl/nodejs';

const client = new WaterCrawlAPIClient('your-api-key');

// List all crawl requests
const { results: requests } = await client.getCrawlRequestsList();
console.log('All requests:', requests);

// Get specific request details
const request: CrawlRequest = await client.getCrawlRequest(requests[0].uuid);
console.log('Request details:', request);

// Stop a crawl
await client.stopCrawlRequest(request.uuid);

// Download results
const { results } = await client.getCrawlRequestResults(request.uuid);
console.log('Results:', results);

Error Handling

The client includes comprehensive error handling with TypeScript support:

import { WaterCrawlAPIClient } from '@watercrawl/nodejs';
import type { APIError } from '@watercrawl/nodejs';

try {
    const result = await client.scrapeUrl('https://watercrawl.dev');
} catch (error) {
    if ((error as APIError).response) {
        // API error with response
        console.error('API Error:', (error as APIError).response.data);
    } else {
        // Network or other error
        console.error('Error:', error.message);
    }
}

Development

Setup

# Install dependencies
npm install

# Copy example environment file
cp .env.example .env

# Add your API key to .env
echo "WATERCRAWL_API_KEY=your-api-key" > .env

Available Scripts

# Build the project
npm run build

# Run tests
npm test

# Lint the code
npm run lint

# Format the code
npm run format

# Prepare for publishing
npm run prepare

Continuous Integration

This project uses GitHub Actions for continuous integration and deployment:

  • Testing: All pushes to main and pull requests are automatically tested against Node.js versions 16, 18, and 20.
  • Publishing: When a version tag (e.g., v1.0.0) is pushed, the package is automatically:
    1. Built and tested
    2. Version number is validated
    3. Published to npm
    4. A GitHub release is created

Publishing a New Version

To publish a new version:

  1. Update the version in package.json:

    npm version patch  # or minor, or major
  2. Push the changes and the new tag:

    git push && git push --tags
  3. The GitHub Action will automatically:

    • Run all tests
    • Publish to npm if tests pass
    • Create a GitHub release

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.