@purepageio/fetch-engines v0.4.1
@purepageio/fetch-engines
Fetching web content can be complex. You need to handle static HTML, dynamic JavaScript-driven sites, network errors, retries, caching, and potential bot detection measures. Managing browser automation tools like Playwright adds another layer of complexity with resource pooling and stealth configurations.
@purepageio/fetch-engines simplifies this entire process by providing a set of robust, configurable, and easy-to-use engines for retrieving web page content.
Why use @purepageio/fetch-engines?
- Unified API: Get content from simple or complex sites using the same
fetchHTML(url, options?)method. - Flexible Strategies: Choose the right tool for the job:
FetchEngine: Lightweight and fast for static HTML, using the standardfetchAPI. Ideal for speed and efficiency with content that doesn't require JavaScript rendering. Supports custom headers.HybridEngine: The best of both worlds – triesFetchEnginefirst for speed, automatically falls back to a powerful browser engine (internally,PlaywrightEngine) for reliability on complex, JavaScript-heavy pages. Supports custom headers.
- Robust & Resilient: Built-in caching, configurable retries, and standardized error handling make your fetching logic more dependable.
- Simplified Automation: When
HybridEngineuses its browser capabilities (via the internalPlaywrightEngine), it manages browser instances and contexts automatically through efficient pooling and includes integrated stealth measures to bypass common anti-bot systems. - Content Transformation: Optionally convert fetched HTML directly to clean Markdown content.
- TypeScript Ready: Fully typed for a better development experience.
This package provides a high-level abstraction, letting you focus on using the web content rather than the intricacies of fetching it.
Table of Contents
- Features
- Installation
- Engines
- Basic Usage
- Configuration
- Return Value
- API Reference
- Stealth / Anti-Detection (
PlaywrightEngine) - Error Handling
- Logging
- Contributing
- License
Features
- Multiple Fetching Strategies: Choose between
FetchEngine(lightweightfetch) orHybridEngine(smart fallback to a full browser engine). - Unified API: Simple
fetchHTML(url, options?)interface across both primary engines. - Custom Headers: Easily provide custom HTTP headers for requests in both
FetchEngineandHybridEngine. - Configurable Retries: Automatic retries on failure with customizable attempts and delays.
- Built-in Caching: In-memory caching with configurable TTL to reduce redundant fetches.
- Playwright Stealth: When
HybridEngineutilizes its browser capabilities, it automatically integratesplaywright-extraand stealth plugins to bypass common bot detection. - Managed Browser Pooling: Efficient resource management for
HybridEngine's browser mode with configurable browser/context limits and lifecycles. - Smart Fallbacks:
HybridEngineusesFetchEnginefirst, falling back to its internal browser engine only when needed. The internal browser engine can also optionally use a fast HTTP fetch before launching a full browser. - Content Conversion: Optionally convert fetched HTML directly to Markdown.
- Standardized Errors: Custom
FetchErrorclasses provide context on failures. - TypeScript Ready: Fully typed codebase for enhanced developer experience.
Installation
pnpm add @purepageio/fetch-engines
# or with npm
npm install @purepageio/fetch-engines
# or with yarn
yarn add @purepageio/fetch-enginesIf you plan to use the HybridEngine (which internally uses Playwright for advanced fetching), you also need to install Playwright's browser binaries:
pnpm exec playwright install
# or
npx playwright installEngines
FetchEngine: Uses the standardfetchAPI. Suitable for simple HTML pages or APIs returning HTML. Lightweight and fast. This is your go-to for speed and efficiency when JavaScript rendering is not required.HybridEngine: A smart combination. It first attempts to fetch content using the lightweightFetchEngine. If that fails for any reason (e.g., network error, non-HTML content, HTTP error like 403), or ifspaModeis enabled and an SPA shell is detected, it automatically falls back to using an internal, powerful browser engine (based on Playwright). This provides the speed ofFetchEnginefor simple sites while retaining the power of a full browser for complex, dynamic websites. This is recommended for most general-purpose fetching tasks.PlaywrightEngine(Internal Component): While not recommended for direct use by most users,PlaywrightEngineis the componentHybridEngineuses internally for its browser-based fetching. It manages Playwright browser instances, contexts, and stealth features. Users needing direct, low-level control over Playwright might consider it, butHybridEngineoffers a more robust and flexible approach for most scenarios.
Basic Usage
FetchEngine
import { FetchEngine } from "@purepageio/fetch-engines";
const engine = new FetchEngine(); // Default: fetches HTML
async function main() {
try {
const url = "https://example.com";
const result = await engine.fetchHTML(url);
console.log(`Fetched ${result.url} (ContentType: ${result.contentType})`);
console.log(`Title: ${result.title}`);
console.log(`Content (HTML): ${result.content.substring(0, 100)}...`);
// Example fetching Markdown directly via constructor option
const markdownEngine = new FetchEngine({ markdown: true });
const mdResult = await markdownEngine.fetchHTML(url);
console.log(`\nFetched ${mdResult.url} (ContentType: ${mdResult.contentType})`);
console.log(`Content (Markdown):\n${mdResult.content.substring(0, 300)}...`);
} catch (error) {
console.error("Fetch failed:", error);
}
}
main();HybridEngine
import { HybridEngine } from "@purepageio/fetch-engines";
// Engine configured to fetch HTML by default for its internal engines
// and provide some custom headers for all requests made by HybridEngine.
const engine = new HybridEngine({
markdown: false,
headers: { "X-Global-Custom-Header": "HybridGlobalValue" },
// Other PlaywrightEngine specific configs can be set here for the fallback mechanism
// e.g., playwrightLaunchOptions: { args: ["--disable-gpu"] }
});
async function main() {
try {
const urlSimple = "https://example.com"; // Simple site, likely handled by FetchEngine
const urlComplex = "https://quotes.toscrape.com/"; // JS-heavy site, likely requiring Playwright fallback
// --- Scenario 1: FetchEngine part of HybridEngine handles it ---
console.log(`\nFetching simple site (${urlSimple}) with per-request headers...`);
const result1 = await engine.fetchHTML(urlSimple, {
headers: { "X-Request-Specific": "SimpleRequestValue" },
});
// FetchEngine (via HybridEngine) will use:
// 1. Its base default headers (User-Agent etc.)
// 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
// 3. Overridden/augmented by per-request headers ("X-Request-Specific")
console.log(`Fetched ${result1.url} (ContentType: ${result1.contentType}) - Title: ${result1.title}`);
console.log(`Content (HTML): ${result1.content.substring(0, 100)}...`);
// --- Scenario 2: Playwright part of HybridEngine handles it ---
console.log(`\nFetching complex site (${urlComplex}) requesting Markdown and with per-request headers...`);
const result2 = await engine.fetchHTML(urlComplex, {
markdown: true,
headers: { "X-Request-Specific": "ComplexRequestValue", "X-Another": "ComplexAnother" },
});
// PlaywrightEngine (via HybridEngine) will use:
// 1. Its base default headers (User-Agent etc. if doing HTTP fallback, or for page.setExtraHTTPHeaders)
// 2. Overridden/augmented by HybridEngine's constructor headers ("X-Global-Custom-Header")
// 3. Overridden/augmented by per-request headers ("X-Request-Specific", "X-Another")
// The markdown: true option will be respected by the Playwright part.
console.log(`Fetched ${result2.url} (ContentType: ${result2.contentType}) - Title: ${result2.title}`);
console.log(`Content (Markdown):\n${result2.content.substring(0, 300)}...`);
} catch (error) {
console.error("Hybrid fetch failed:", error);
} finally {
await engine.cleanup(); // Important for HybridEngine
}
}
main();Configuration
Engines accept an optional configuration object in their constructor to customise behavior.
FetchEngine
The FetchEngine accepts a FetchEngineOptions object with the following properties:
| Option | Type | Default | Description |
|---|---|---|---|
markdown | boolean | false | If true, converts fetched HTML to Markdown. contentType in the result will be set to 'markdown'. |
headers | Record<string, string> | {} | Custom HTTP headers to be sent with the request. These are merged with and can override the engine's default headers. Headers from fetchHTML options take higher precedence. |
// Example: FetchEngine with custom headers and Markdown conversion
const customFetchEngine = new FetchEngine({
markdown: true,
headers: {
"User-Agent": "MyCustomFetchAgent/1.0",
"X-Api-Key": "your-api-key",
},
});Header Precedence for FetchEngine:
- Headers passed in
fetchHTML(url, { headers: { ... } })(highest precedence). - Headers passed in the
FetchEngineconstructornew FetchEngine({ headers: { ... } }). - Default headers of the
FetchEngine(e.g., its defaultUser-Agent) (lowest precedence).
PlaywrightEngineConfig (Used by HybridEngine)
The HybridEngine constructor accepts a PlaywrightEngineConfig object. These settings configure the underlying FetchEngine and PlaywrightEngine (for fallback scenarios) and the hybrid strategy itself. When using HybridEngine, you are essentially configuring how it will behave and how its internal Playwright capabilities will operate if needed.
Key Options for HybridEngine (from PlaywrightEngineConfig):
| Option | Type | Default | Description |
|---|---|---|---|
headers | Record<string, string> | {} | Custom HTTP headers. For HybridEngine, these serve as default headers for both its internal FetchEngine (constructor) and PlaywrightEngine (constructor). They can be overridden by headers in HybridEngine.fetchHTML() options. |
markdown | boolean | false | Default Markdown conversion. For HybridEngine: sets default for internal FetchEngine (constructor) and internal PlaywrightEngine. Can be overridden per-request for the PlaywrightEngine part. |
useHttpFallback | boolean | true | (For Playwright part) If true, attempts a fast HTTP fetch before using Playwright. Ineffective if spaMode is true. |
useHeadedModeFallback | boolean | false | (For Playwright part) If true, automatically retries specific failed Playwright attempts in headed (visible) mode. |
defaultFastMode | boolean | true | If true, initially blocks non-essential resources and skips human simulation. Can be overridden per-request. Effectively false if spaMode is true. |
simulateHumanBehavior | boolean | true | If true (and not fastMode or spaMode), attempts basic human-like interactions. |
concurrentPages | number | 3 | Max number of pages to process concurrently within the engine queue. |
maxRetries | number | 3 | Max retry attempts for a failed fetch (excluding initial try). |
retryDelay | number | 5000 | Delay (ms) between retries. |
cacheTTL | number | 900000 | Cache Time-To-Live (ms). 0 disables caching. (15 mins default) |
spaMode | boolean | false | If true, enables Single Page Application mode. This typically bypasses useHttpFallback, effectively sets fastMode to false, uses more patient load conditions (e.g., network idle), and may apply spaRenderDelayMs. Recommended for JavaScript-heavy sites. |
spaRenderDelayMs | number | 0 | Explicit delay (ms) after page load events in spaMode to allow for client-side rendering. Only applies if spaMode is true. |
playwrightLaunchOptions | LaunchOptions | undefined | (For Playwright part) Optional Playwright launch options (from playwright package, e.g., { args: ['--some-flag'] }) passed when a browser instance is created. Merged with internal defaults. |
Browser Pool Options (For HybridEngine's internal PlaywrightEngine):
| Option | Type | Default | Description |
|---|---|---|---|
maxBrowsers | number | 2 | Max concurrent browser instances managed by the pool. |
maxPagesPerContext | number | 6 | Max pages per browser context before recycling. |
maxBrowserAge | number | 1200000 | Max age (ms) a browser instance lives before recycling. (20 mins default) |
healthCheckInterval | number | 60000 | How often (ms) the pool checks browser health. (1 min default) |
useHeadedMode | boolean | false | Forces the entire pool (for Playwright part) to launch browsers in headed (visible) mode. |
poolBlockedDomains | string[] | [] | List of domain glob patterns to block requests to (for Playwright part). |
poolBlockedResourceTypes | string[] | [] | List of Playwright resource types (e.g., 'image', 'font') to block (for Playwright part). |
proxy | { server: string, ... }? | undefined | Proxy configuration object (see PlaywrightEngineConfig type) (for Playwright part). |
HybridEngine - Configuration Summary & Header Precedence
When you configure HybridEngine using PlaywrightEngineConfig:
headers: Constructor headers are passed to the internalFetchEngine's constructor and the internalPlaywrightEngine's constructor.markdown: Sets the default for both internal engines.spaMode: Sets the default forHybridEngine's SPA shell detection and for the internalPlaywrightEngine.- Other options primarily configure the internal
PlaywrightEngineor general retry/caching logic.
Per-request options in HybridEngine.fetchHTML(url, options):
headers?: Record<string, string>:- These headers override any headers set in the
HybridEngineconstructor. - If
FetchEngineis used: These headers are passed toFetchEngine.fetchHTML(url, { headers: ... }).FetchEnginethen merges them with its constructor headers and base defaults. - If
PlaywrightEngine(fallback) is used: These headers are merged withHybridEngineconstructor headers (options take precedence) and the result is passed toPlaywrightEngine.fetchHTML().PlaywrightEnginethen applies its own logic (e.g., forpage.setExtraHTTPHeadersor its HTTP fallback).
- These headers override any headers set in the
markdown?: boolean:- If
FetchEngineis used: This per-request option is ignored.FetchEngineuses its own constructormarkdownsetting. - If
PlaywrightEngine(fallback) is used: This overridesPlaywrightEngine's default and determines its output format.
- If
spaMode?: boolean: OverridesHybridEngine's default SPA mode and is passed toPlaywrightEngineif used.fastMode?: boolean: Passed toPlaywrightEngineif used; no effect onFetchEngine.
// Example: HybridEngine with SPA mode enabled by default
const spaHybridEngine = new HybridEngine({ spaMode: true, spaRenderDelayMs: 2000 });
async function fetchSpaSite() {
try {
// This will use PlaywrightEngine directly if smallblackdots is an SPA shell
const result = await spaHybridEngine.fetchHTML(
"https://www.smallblackdots.net/release/16109/corrina-joseph-wish-tonite-lonely"
);
console.log(`Title: ${result.title}`);
} catch (e) {
console.error(e);
}
}Return Value
All fetchHTML() methods return a Promise that resolves to an HTMLFetchResult object:
content(string): The fetched content, either original HTML or converted Markdown.contentType('html' | 'markdown'): Indicates the format of thecontentstring.title(string | null): Extracted page title (from original HTML).url(string): Final URL after redirects.isFromCache(boolean): True if the result came from cache.statusCode(number | undefined): HTTP status code.error(Error | undefined): Error object if the fetch failed after all retries. It's generally recommended to rely on catching thrown errors for failure handling.
API Reference
engine.fetchHTML(url, options?)
url(string): URL to fetch.options?(FetchOptions): Optional per-request overrides.headers?: Record<string, string>: Custom headers for this specific request.markdown?: boolean: (ForHybridEngine's Playwright part) Request Markdown conversion.fastMode?: boolean: (ForHybridEngine's Playwright part) Override fast mode.spaMode?: boolean: (ForHybridEngine) Override SPA mode behavior for this request.
- Returns:
Promise<HTMLFetchResult>
Fetches content, returning HTML or Markdown based on configuration/options in result.content with result.contentType indicating the format.
engine.cleanup() (HybridEngine and direct FetchEngine if no cleanup needed)
- Returns:
Promise<void>
For HybridEngine, this gracefully shuts down all browser instances managed by its internal PlaywrightEngine. It is crucial to call await engine.cleanup() when you are finished using HybridEngine to release system resources.
FetchEngine has a cleanup method for API consistency, but it's a no-op as FetchEngine doesn't manage persistent resources.
Stealth / Anti-Detection (via HybridEngine)
When HybridEngine uses its internal browser capabilities (via PlaywrightEngine), it automatically integrates playwright-extra and its powerful stealth plugin (puppeteer-extra-plugin-stealth). This plugin applies various techniques to make the headless browser controlled by Playwright appear more like a regular human-operated browser, helping to bypass many common bot detection systems.
There are no manual configuration options for stealth; it is enabled by default when HybridEngine uses its browser functionality.
While effective, be aware that no stealth technique is foolproof, and sophisticated websites may still detect automated browsing.
Error Handling
Errors during fetching are typically thrown as instances of FetchError (or its subclasses like FetchEngineHttpError), providing more context than standard Error objects.
FetchErrorproperties:message(string): Description of the error.code(string | undefined): A specific error code (e.g.,ERR_NAVIGATION_TIMEOUT,ERR_HTTP_ERROR,ERR_NON_HTML_CONTENT).originalError(Error | undefined): The underlying error that caused this fetch error (e.g., a Playwright error object).statusCode(number | undefined): The HTTP status code, if relevant (especially forFetchEngineHttpError).
Common FetchError codes and scenarios:
ERR_HTTP_ERROR: Thrown byFetchEnginefor HTTP status codes >= 400.error.statusCodewill be set.ERR_NON_HTML_CONTENT: Thrown byFetchEngineif the content type is not HTML andmarkdownconversion is not requested.ERR_PLAYWRIGHT_OPERATION: A general error fromHybridEngine's browser mode indicating a failure during a Playwright operation (e.g., page acquisition, navigation, interaction). TheoriginalErrorproperty will often contain the specific Playwright error.ERR_NAVIGATION: Often seen as part ofERR_PLAYWRIGHT_OPERATION's message or inoriginalErrorwhen a Playwright navigation (inHybridEngine's browser mode) fails (e.g., timeout, SSL error).ERR_MARKDOWN_CONVERSION_NON_HTML: Thrown byHybridEngine(when its Playwright part is active) ifmarkdown: trueis requested for a non-HTML content type (e.g., XML, JSON).ERR_UNSUPPORTED_RAW_CONTENT_TYPE: Thrown byHybridEngine(when its Playwright part is active andmarkdown: false) if requested for a content type it doesn't support for direct fetching (e.g., images, applications).ERR_CACHE_ERROR: Indicates an issue with cache read/write operations.ERR_PROXY_CONFIG_ERROR: Problem with proxy configuration (forHybridEngine's browser mode).ERR_BROWSER_POOL_EXHAUSTED: IfHybridEngine's browser pool cannot provide a page.- Other Scenarios (often wrapped by
ERR_PLAYWRIGHT_OPERATIONor a genericFetchErrorwhenHybridEngineuses its browser mode):- Network issues (DNS resolution, connection refused).
- Proxy connection failures.
- Page crashes or context/browser disconnections within Playwright.
- Failures during browser launch or management by the pool.
The HTMLFetchResult object may also contain an error property if the final fetch attempt failed after all retries but an earlier attempt (within retries) might have produced some intermediate (potentially unusable) result data. It's generally best to rely on the thrown error for failure handling.
Example:
import { HybridEngine, FetchError } from "@purepageio/fetch-engines";
// Example using HybridEngine to illustrate error handling
const engine = new HybridEngine({ useHttpFallback: false, maxRetries: 1 }); // useHttpFallback for Playwright part
async function fetchWithHandling(url: string) {
try {
const result = await engine.fetchHTML(url, { headers: { "X-My-Header": "TestValue" } });
if (result.error) {
console.warn(`Fetch for ${url} included non-critical error after retries: ${result.error.message}`);
}
console.log(`Success for ${url}! Title: ${result.title}, Content type: ${result.contentType}`);
// Use result.content
} catch (error) {
console.error(`Fetch failed for ${url}:`);
if (error instanceof FetchError) {
console.error(` Error Code: ${error.code || "N/A"}`);
console.error(` Message: ${error.message}`);
if (error.statusCode) {
console.error(` Status Code: ${error.statusCode}`);
}
if (error.originalError) {
console.error(` Original Error: ${error.originalError.name} - ${error.originalError.message}`);
}
// Example of specific handling:
if (error.code === "ERR_PLAYWRIGHT_OPERATION") {
console.error(
" Hint: This was a Playwright operation failure (HybridEngine's browser mode). Check Playwright logs or originalError."
);
}
} else if (error instanceof Error) {
console.error(` Generic Error: ${error.message}`);
} else {
console.error(` Unknown error occurred: ${String(error)}`);
}
}
}
async function runExamples() {
await fetchWithHandling("https://nonexistentdomain.example.com"); // Likely DNS or navigation error via FetchEngine or Playwright
await fetchWithHandling("https://example.com/non_html_resource.json"); // Test with actual JSON URL if available (FetchEngine might handle, or Playwright if complex)
await engine.cleanup(); // Important for HybridEngine
}
runExamples();Logging
Currently, the library uses console.warn and console.error for internal warnings (like fallback events) and critical errors. More sophisticated logging options may be added in the future.
Contributing
Contributions are welcome! Please open an issue or submit a pull request on the GitHub repository.
License
MIT
6 months ago
7 months ago
7 months ago
7 months ago
7 months ago
7 months ago
7 months ago
7 months ago
7 months ago
7 months ago
7 months ago
8 months ago
8 months ago
8 months ago
8 months ago
8 months ago
8 months ago
8 months ago
8 months ago