@purepageio/fetch-engines NPM

@purepageio/fetch-engines

Fetching web content can be complex. You need to handle static HTML, dynamic JavaScript-driven sites, network errors, retries, caching, and potential bot detection measures. Managing browser automation tools like Playwright adds another layer of complexity with resource pooling and stealth configurations.

@purepageio/fetch-engines simplifies this entire process by providing a set of robust, configurable, and easy-to-use engines for retrieving web page content.

Why use @purepageio/fetch-engines?

Unified API: Get content from simple or complex sites using the same fetchHTML(url, options?) method.
Flexible Strategies: Choose the right tool for the job:
- FetchEngine: Lightweight and fast for static HTML, using the standard fetch API. Ideal for speed and efficiency with content that doesn't require JavaScript rendering. Supports custom headers.
- HybridEngine: The best of both worlds – tries FetchEngine first for speed, automatically falls back to a powerful browser engine (internally, PlaywrightEngine) for reliability on complex, JavaScript-heavy pages. Supports custom headers.
Robust & Resilient: Built-in caching, configurable retries, and standardized error handling make your fetching logic more dependable.
Simplified Automation: When HybridEngine uses its browser capabilities (via the internal PlaywrightEngine), it manages browser instances and contexts automatically through efficient pooling and includes integrated stealth measures to bypass common anti-bot systems.
Content Transformation: Optionally convert fetched HTML directly to clean Markdown content.
TypeScript Ready: Fully typed for a better development experience.

This package provides a high-level abstraction, letting you focus on using the web content rather than the intricacies of fetching it.

Features

Multiple Fetching Strategies: Choose between FetchEngine (lightweight fetch) or HybridEngine (smart fallback to a full browser engine).
Unified API: Simple fetchHTML(url, options?) interface across both primary engines.
Custom Headers: Easily provide custom HTTP headers for requests in both FetchEngine and HybridEngine.
Configurable Retries: Automatic retries on failure with customizable attempts and delays.
Built-in Caching: In-memory caching with configurable TTL to reduce redundant fetches.
Playwright Stealth: When HybridEngine utilizes its browser capabilities, it automatically integrates playwright-extra and stealth plugins to bypass common bot detection.
Managed Browser Pooling: Efficient resource management for HybridEngine's browser mode with configurable browser/context limits and lifecycles.
Smart Fallbacks: HybridEngine uses FetchEngine first, falling back to its internal browser engine only when needed. The internal browser engine can also optionally use a fast HTTP fetch before launching a full browser.
Content Conversion: Optionally convert fetched HTML directly to Markdown.
Standardized Errors: Custom FetchError classes provide context on failures.
TypeScript Ready: Fully typed codebase for enhanced developer experience.

Installation

pnpm add @purepageio/fetch-engines
# or with npm
npm install @purepageio/fetch-engines
# or with yarn
yarn add @purepageio/fetch-engines

If you plan to use the HybridEngine (which internally uses Playwright for advanced fetching), you also need to install Playwright's browser binaries:

pnpm exec playwright install
# or
npx playwright install

Engines

FetchEngine: Uses the standard fetch API. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast. This is your go-to for speed and efficiency when JavaScript rendering is not required.
HybridEngine: A smart combination. It first attempts to fetch content using the lightweight FetchEngine. If that fails for any reason (e.g., network error, non-HTML content, HTTP error like 403), or if spaMode is enabled and an SPA shell is detected, it automatically falls back to using an internal, powerful browser engine (based on Playwright). This provides the speed of FetchEngine for simple sites while retaining the power of a full browser for complex, dynamic websites. This is recommended for most general-purpose fetching tasks.
PlaywrightEngine (Internal Component): While not recommended for direct use by most users, PlaywrightEngine is the component HybridEngine uses internally for its browser-based fetching. It manages Playwright browser instances, contexts, and stealth features. Users needing direct, low-level control over Playwright might consider it, but HybridEngine offers a more robust and flexible approach for most scenarios.

Basic Usage

FetchEngine

import { FetchEngine } from "@purepageio/fetch-engines";

const engine = new FetchEngine(); // Default: fetches HTML

async function main() {
  try {
    const url = "https://example.com";
    const result = await engine.fetchHTML(url);
    console.log(`Fetched ${result.url} (ContentType: ${result.contentType})`);
    console.log(`Title: ${result.title}`);
    console.log(`Content (HTML): ${result.content.substring(0, 100)}...`);

    // Example fetching Markdown directly via constructor option
    const markdownEngine = new FetchEngine({ markdown: true });
    const mdResult = await markdownEngine.fetchHTML(url);
    console.log(`\nFetched ${mdResult.url} (ContentType: ${mdResult.contentType})`);
    console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
  } catch (error) {
    console.error("Fetch failed:", error);
  }
}
main();

HybridEngine

import { HybridEngine } from "@purepageio/fetch-engines";

// Engine configured to fetch HTML by default for its internal engines
// and provide some custom headers for all requests made by HybridEngine.
const engine = new HybridEngine({
  markdown: false,
  headers: { "X-Global-Custom-Header": "HybridGlobalValue" },
  // Other PlaywrightEngine specific configs can be set here for the fallback mechanism
  // e.g., playwrightLaunchOptions: { args: ["--disable-gpu"] }
});

async function main() {
  try {
    const urlSimple = "https://example.com"; // Simple site, likely handled by FetchEngine
    const urlComplex = "https://quotes.toscrape.com/"; // JS-heavy site, likely requiring Playwright fallback

    // --- Scenario 1: FetchEngine part of HybridEngine handles it ---
    console.log(`\nFetching simple site (${urlSimple}) with per-request headers...`);
    const result1 = await engine.fetchHTML(urlSimple, {
      headers: { "X-Request-Specific": "SimpleRequestValue" },
    });
    // FetchEngine (via HybridEngine) will use:
    // 1. Its base default headers (User-Agent etc.)
    // 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
    // 3. Overridden/augmented by per-request headers ("X-Request-Specific")
    console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
    console.log(`Content (HTML): ${result1.content.substring(0, 100)}...`);

    // --- Scenario 2: Playwright part of HybridEngine handles it ---
    console.log(`\nFetching complex site (${urlComplex}) requesting Markdown and with per-request headers...`);
    const result2 = await engine.fetchHTML(urlComplex, {
      markdown: true,
      headers: { "X-Request-Specific": "ComplexRequestValue", "X-Another": "ComplexAnother" },
    });
    // PlaywrightEngine (via HybridEngine) will use:
    // 1. Its base default headers (User-Agent etc. if doing HTTP fallback, or for page.setExtraHTTPHeaders)
    // 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
    // 3. Overridden/augmented by per-request headers ("X-Request-Specific", "X-Another")
    // The markdown: true option will be respected by the Playwright part.
    console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
    console.log(`Content (Markdown):\n${result2.content.substring(0, 300)}...`);
  } catch (error) {
    console.error("Hybrid fetch failed:", error);
  } finally {
    await engine.cleanup(); // Important for HybridEngine
  }
}
main();

Configuration

Engines accept an optional configuration object in their constructor to customise behavior.

FetchEngine

The FetchEngine accepts a FetchEngineOptions object with the following properties:

Option	Type	Default	Description
`markdown`	`boolean`	`false`	If `true`, converts fetched HTML to Markdown. `contentType` in the result will be set to `'markdown'`.
`headers`	`Record<string, string>`	`{}`	Custom HTTP headers to be sent with the request. These are merged with and can override the engine's default headers. Headers from `fetchHTML` options take higher precedence.

// Example: FetchEngine with custom headers and Markdown conversion
const customFetchEngine = new FetchEngine({
  markdown: true,
  headers: {
    "User-Agent": "MyCustomFetchAgent/1.0",
    "X-Api-Key": "your-api-key",
  },
});

Header Precedence for `FetchEngine`:

Headers passed in fetchHTML(url, { headers: { ... } }) (highest precedence).
Headers passed in the FetchEngine constructor new FetchEngine({ headers: { ... } }).
Default headers of the FetchEngine (e.g., its default User-Agent) (lowest precedence).

`PlaywrightEngineConfig` (Used by `HybridEngine`)

The HybridEngine constructor accepts a PlaywrightEngineConfig object. These settings configure the underlying FetchEngine and PlaywrightEngine (for fallback scenarios) and the hybrid strategy itself. When using HybridEngine, you are essentially configuring how it will behave and how its internal Playwright capabilities will operate if needed.

Key Options for HybridEngine (from PlaywrightEngineConfig):

Option	Type	Default	Description
`headers`	`Record<string, string>`	`{}`	Custom HTTP headers. For `HybridEngine`, these serve as default headers for both its internal `FetchEngine` (constructor) and `PlaywrightEngine` (constructor). They can be overridden by headers in `HybridEngine.fetchHTML()` options.
`markdown`	`boolean`	`false`	Default Markdown conversion. For `HybridEngine`: sets default for internal `FetchEngine` (constructor) and internal `PlaywrightEngine`. Can be overridden per-request for the `PlaywrightEngine` part.
`useHttpFallback`	`boolean`	`true`	(For Playwright part) If `true`, attempts a fast HTTP fetch before using Playwright. Ineffective if `spaMode` is `true`.
`useHeadedModeFallback`	`boolean`	`false`	(For Playwright part) If `true`, automatically retries specific failed Playwright attempts in headed (visible) mode.
`defaultFastMode`	`boolean`	`true`	If `true`, initially blocks non-essential resources and skips human simulation. Can be overridden per-request. Effectively `false` if `spaMode` is `true`.
`simulateHumanBehavior`	`boolean`	`true`	If `true` (and not `fastMode` or `spaMode`), attempts basic human-like interactions.
`concurrentPages`	`number`	`3`	Max number of pages to process concurrently within the engine queue.
`maxRetries`	`number`	`3`	Max retry attempts for a failed fetch (excluding initial try).
`retryDelay`	`number`	`5000`	Delay (ms) between retries.
`cacheTTL`	`number`	`900000`	Cache Time-To-Live (ms). `0` disables caching. (15 mins default)
`spaMode`	`boolean`	`false`	If `true`, enables Single Page Application mode. This typically bypasses `useHttpFallback`, effectively sets `fastMode` to `false`, uses more patient load conditions (e.g., network idle), and may apply `spaRenderDelayMs`. Recommended for JavaScript-heavy sites.
`spaRenderDelayMs`	`number`	`0`	Explicit delay (ms) after page load events in `spaMode` to allow for client-side rendering. Only applies if `spaMode` is `true`.
`playwrightLaunchOptions`	`LaunchOptions`	`undefined`	(For Playwright part) Optional Playwright launch options (from `playwright` package, e.g., `{ args: ['--some-flag'] }`) passed when a browser instance is created. Merged with internal defaults.

Browser Pool Options (For HybridEngine's internal PlaywrightEngine):

Option	Type	Default	Description
`maxBrowsers`	`number`	`2`	Max concurrent browser instances managed by the pool.
`maxPagesPerContext`	`number`	`6`	Max pages per browser context before recycling.
`maxBrowserAge`	`number`	`1200000`	Max age (ms) a browser instance lives before recycling. (20 mins default)
`healthCheckInterval`	`number`	`60000`	How often (ms) the pool checks browser health. (1 min default)
`useHeadedMode`	`boolean`	`false`	Forces the entire pool (for Playwright part) to launch browsers in headed (visible) mode.
`poolBlockedDomains`	`string[]`	`[]`	List of domain glob patterns to block requests to (for Playwright part).
`poolBlockedResourceTypes`	`string[]`	`[]`	List of Playwright resource types (e.g., 'image', 'font') to block (for Playwright part).
`proxy`	`{ server: string, ... }?`	`undefined`	Proxy configuration object (see `PlaywrightEngineConfig` type) (for Playwright part).

`HybridEngine` - Configuration Summary & Header Precedence

When you configure HybridEngine using PlaywrightEngineConfig:

headers: Constructor headers are passed to the internal FetchEngine's constructor and the internal PlaywrightEngine's constructor.
markdown: Sets the default for both internal engines.
spaMode: Sets the default for HybridEngine's SPA shell detection and for the internal PlaywrightEngine.
Other options primarily configure the internal PlaywrightEngine or general retry/caching logic.

Per-request options in HybridEngine.fetchHTML(url, options):

headers?: Record<string, string>:
- These headers override any headers set in the HybridEngine constructor.
- If FetchEngine is used: These headers are passed to FetchEngine.fetchHTML(url, { headers: ... }). FetchEngine then merges them with its constructor headers and base defaults.
- If PlaywrightEngine (fallback) is used: These headers are merged with HybridEngine constructor headers (options take precedence) and the result is passed to PlaywrightEngine.fetchHTML(). PlaywrightEngine then applies its own logic (e.g., for page.setExtraHTTPHeaders or its HTTP fallback).
markdown?: boolean:
- If FetchEngine is used: This per-request option is ignored. FetchEngine uses its own constructor markdown setting.
- If PlaywrightEngine (fallback) is used: This overrides PlaywrightEngine's default and determines its output format.
spaMode?: boolean: Overrides HybridEngine's default SPA mode and is passed to PlaywrightEngine if used.
fastMode?: boolean: Passed to PlaywrightEngine if used; no effect on FetchEngine.

// Example: HybridEngine with SPA mode enabled by default
const spaHybridEngine = new HybridEngine({ spaMode: true, spaRenderDelayMs: 2000 });

async function fetchSpaSite() {
  try {
    // This will use PlaywrightEngine directly if smallblackdots is an SPA shell
    const result = await spaHybridEngine.fetchHTML(
      "https://www.smallblackdots.net/release/16109/corrina-joseph-wish-tonite-lonely"
    );
    console.log(`Title: ${result.title}`);
  } catch (e) {
    console.error(e);
  }
}

Return Value

All fetchHTML() methods return a Promise that resolves to an HTMLFetchResult object:

content (string): The fetched content, either original HTML or converted Markdown.
contentType ('html' | 'markdown'): Indicates the format of the content string.
title (string | null): Extracted page title (from original HTML).
url (string): Final URL after redirects.
isFromCache (boolean): True if the result came from cache.
statusCode (number | undefined): HTTP status code.
error (Error | undefined): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.

API Reference

`engine.fetchHTML(url, options?)`

url (string): URL to fetch.
options? (FetchOptions): Optional per-request overrides.
- headers?: Record<string, string>: Custom headers for this specific request.
- markdown?: boolean: (For HybridEngine's Playwright part) Request Markdown conversion.
- fastMode?: boolean: (For HybridEngine's Playwright part) Override fast mode.
- spaMode?: boolean: (For HybridEngine) Override SPA mode behavior for this request.
Returns: Promise<HTMLFetchResult>

Fetches content, returning HTML or Markdown based on configuration/options in result.content with result.contentType indicating the format.

`engine.cleanup()` (`HybridEngine` and direct `FetchEngine` if no cleanup needed)

Returns: Promise<void>

For HybridEngine, this gracefully shuts down all browser instances managed by its internal PlaywrightEngine. It is crucial to call await engine.cleanup() when you are finished using HybridEngine to release system resources. FetchEngine has a cleanup method for API consistency, but it's a no-op as FetchEngine doesn't manage persistent resources.

Stealth / Anti-Detection (via `HybridEngine`)

When HybridEngine uses its internal browser capabilities (via PlaywrightEngine), it automatically integrates playwright-extra and its powerful stealth plugin (puppeteer-extra-plugin-stealth). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.

There are no manual configuration options for stealth; it is enabled by default when HybridEngine uses its browser functionality.

While effective, be aware that no stealth technique is foolproof, and sophisticated websites may still detect automated browsing.

Error Handling

Errors during fetching are typically thrown as instances of FetchError (or its subclasses like FetchEngineHttpError), providing more context than standard Error objects.

FetchError properties:
- message (string): Description of the error.
- code (string | undefined): A specific error code (e.g., ERR_NAVIGATION_TIMEOUT, ERR_HTTP_ERROR, ERR_NON_HTML_CONTENT).
- originalError (Error | undefined): The underlying error that caused this fetch error (e.g., a Playwright error object).
- statusCode (number | undefined): The HTTP status code, if relevant (especially for FetchEngineHttpError).

Common FetchError codes and scenarios:

ERR_HTTP_ERROR: Thrown by FetchEngine for HTTP status codes >= 400. error.statusCode will be set.
ERR_NON_HTML_CONTENT: Thrown by FetchEngine if the content type is not HTML and markdown conversion is not requested.
ERR_PLAYWRIGHT_OPERATION: A general error from HybridEngine's browser mode indicating a failure during a Playwright operation (e.g., page acquisition, navigation, interaction). The originalError property will often contain the specific Playwright error.
ERR_NAVIGATION: Often seen as part of ERR_PLAYWRIGHT_OPERATION's message or in originalError when a Playwright navigation (in HybridEngine's browser mode) fails (e.g., timeout, SSL error).
ERR_MARKDOWN_CONVERSION_NON_HTML: Thrown by HybridEngine (when its Playwright part is active) if markdown: true is requested for a non-HTML content type (e.g., XML, JSON).
ERR_UNSUPPORTED_RAW_CONTENT_TYPE: Thrown by HybridEngine (when its Playwright part is active and markdown: false) if requested for a content type it doesn't support for direct fetching (e.g., images, applications).
ERR_CACHE_ERROR: Indicates an issue with cache read/write operations.
ERR_PROXY_CONFIG_ERROR: Problem with proxy configuration (for HybridEngine's browser mode).
ERR_BROWSER_POOL_EXHAUSTED: If HybridEngine's browser pool cannot provide a page.
Other Scenarios (often wrapped by ERR_PLAYWRIGHT_OPERATION or a generic FetchError when HybridEngine uses its browser mode):
- Network issues (DNS resolution, connection refused).
- Proxy connection failures.
- Page crashes or context/browser disconnections within Playwright.
- Failures during browser launch or management by the pool.

The HTMLFetchResult object may also contain an error property if the final fetch attempt failed after all retries but an earlier attempt (within retries) might have produced some intermediate (potentially unusable) result data. It's generally best to rely on the thrown error for failure handling.

Example:

import { HybridEngine, FetchError } from "@purepageio/fetch-engines";

// Example using HybridEngine to illustrate error handling
const engine = new HybridEngine({ useHttpFallback: false, maxRetries: 1 }); // useHttpFallback for Playwright part

async function fetchWithHandling(url: string) {
  try {
    const result = await engine.fetchHTML(url, { headers: { "X-My-Header": "TestValue" } });
    if (result.error) {
      console.warn(`Fetch for ${url} included non-critical error after retries: ${result.error.message}`);
    }
    console.log(`Success for ${url}! Title: ${result.title}, Content type: ${result.contentType}`);
    // Use result.content
  } catch (error) {
    console.error(`Fetch failed for ${url}:`);
    if (error instanceof FetchError) {
      console.error(`  Error Code: ${error.code || "N/A"}`);
      console.error(`  Message: ${error.message}`);
      if (error.statusCode) {
        console.error(`  Status Code: ${error.statusCode}`);
      }
      if (error.originalError) {
        console.error(`  Original Error: ${error.originalError.name} - ${error.originalError.message}`);
      }
      // Example of specific handling:
      if (error.code === "ERR_PLAYWRIGHT_OPERATION") {
        console.error(
          "  Hint: This was a Playwright operation failure (HybridEngine's browser mode). Check Playwright logs or originalError."
        );
      }
    } else if (error instanceof Error) {
      console.error(`  Generic Error: ${error.message}`);
    } else {
      console.error(`  Unknown error occurred: ${String(error)}`);
    }
  }
}

async function runExamples() {
  await fetchWithHandling("https://nonexistentdomain.example.com"); // Likely DNS or navigation error via FetchEngine or Playwright
  await fetchWithHandling("https://example.com/non_html_resource.json"); // Test with actual JSON URL if available (FetchEngine might handle, or Playwright if complex)
  await engine.cleanup(); // Important for HybridEngine
}

runExamples();

Logging

Currently, the library uses console.warn and console.error for internal warnings (like fallback events) and critical errors. More sophisticated logging options may be added in the future.