Octopus Scout
Octoryn Web Ingestion Engine: a governed, auditable, AI-native ingestion pipeline for web pages, PDFs, and knowledge workflows.
This first version optimizes for the normal 80% of the web: fetch, optional browser render, extract, normalize to Markdown/JSON, build evidence anchors, cache/version the result, and expose it through API, CLI, queue, and MCP-compatible tooling.
架构与技术说明、与 Firecrawl 对标:见 docs/ARCHITECTURE.md。
Quick Start
npm install
npm run playwright:install
npm run dev
curl -s http://localhost:8787/health
curl -s http://localhost:8787/scrape \
-H 'content-type: application/json' \
-d '{"url":"https://example.com","render":"static"}'
The SSRF guard blocks private/loopback hosts by default — set
OCTORYN_SCOUT_ALLOW_PRIVATE_HOSTS=trueto scrapelocalhostor other private addresses during local dev and tests.
CLI:
npm run cli -- scrape https://example.com --render static
npm run cli -- sitemap https://example.com/sitemap.xml
npm run cli -- map https://example.com --search docs --limit 100
npm run cli -- crawl https://example.com --max-depth 1 --max-pages 25
npm run cli -- export https://example.com --embed --jsonl
npm run cli -- ingest https://example.com
npm run cli -- search "what is this page about" --top-k 5 --mode hybrid
npm run cli -- extract https://example.com --schema '{"type":"object","properties":{"title":{"type":"string"}}}'
npm run cli -- ingest-site https://example.com --max-depth 1 --max-pages 25
npm run cli -- crawl https://example.com --resume <crawlId>
npm run cli -- crawls
npm run cli -- retention --snapshot-versions 5 --audit-days 90
npm run cli -- refresh --max-age-days 7
npm run cli -- approvals pending
npm run cli -- approve <approval-id> --by you@org.com --note "reviewed"
Use as a library:
The package (octopus-scout) exposes the engine through its dist/index.js entrypoint,
so you can call the pipeline directly instead of going through the HTTP/CLI/MCP surfaces:
import { scrapeUrl, searchKnowledge } from "octopus-scout";
const result = await scrapeUrl({ url: "https://example.com", render: "static" });
console.log(result.extraction.markdown);
const hits = await searchKnowledge({ query: "what is this page about", topK: 5 });
console.log(hits);
Storage
No Docker or external database is required. By default octopus-scout uses an
embedded SQLite database (a single octopus-scout.db file under the data
dir) for snapshots, vectors, audit, approvals, and crawl jobs — clone and run.
Works on any platform. SQLite is provided by the optional native module
better-sqlite3;npm installnever fails if it can't build, and at runtime octopus-scout uses SQLite when the driver is available and otherwise transparently falls back to the file backend (with a one-time notice). So the clone-and-run promise holds even where no prebuilt binary or build toolchain exists.
- Set
OCTORYN_SCOUT_STORAGE_BACKEND=filefor the plain-JSON fallback (files under.octoryn-scout/). - Set
DATABASE_URL=postgres://...to use Postgres + pgvector instead, for large corpora or multi-instance deployments. WhenREDIS_URLis set,/jobs/scrapeandnpm run workeruse BullMQ for durable queues.
Postgres and Redis are entirely optional; bring them in only when you outgrow the embedded defaults:
docker compose up
API
Ingestion:
GET /healthPOST /fetchstatic fetch with robots and rate-limit policyPOST /renderPlaywright browser renderPOST /scrapefull ingestion pipeline (hash dedup + governance gating; supports pre-scrapeactionson browser renders)POST /sitemapsitemap URL extractionPOST /mapfast site URL discovery (sitemap + root-page links, same-origin/subdomain + path/search filters) — cf. Firecrawl/mapPOST /crawldepth-bounded crawl (BFS, sitemap seed, same-origin + regex filters;resumeCrawlIdto continue a checkpointed crawl)GET /crawls/GET /crawls/:idlist / read persisted crawl jobsPOST /jobs/scrape/POST /jobs/crawl/POST /jobs/ingest-siteenqueue durable jobs when Redis is configuredGET /jobs/:id?queue=scrape|crawl|site|deadjob state / result / failure
Knowledge & retrieval:
POST /exportchunk + (optionally) embed a page into a RAG document / JSONLPOST /ingestscrape → chunk → embed → store into the vector indexPOST /ingest-sitecrawl a whole site and index every page into the vector storePOST /searchretrieval over the knowledge base —mode=vector|lexical|hybrid(default), optionalrerank; returns chunks with citation anchors, trust, and governance statusPOST /extractLLM structured extraction — scrape a URL and return JSON conforming to a supplied JSON Schema (cf. Firecrawl/extract)GET /versions?url=version history (content-hash snapshots) for a URLGET /snapshots/:idread a saved snapshot
Governance & operations:
GET /governance/approvals?status=list approval requests (pending/approved/rejected)GET /governance/approvals/:idread one approval requestPOST /governance/approvals/:id/decisionapprove/reject (records an audit event)GET /audit?target=&action=query the append-only audit trailPOST /admin/retentionprune old snapshot versions, audit events, and decided approvalsPOST /admin/refreshrun a staleness sweep — re-ingest snapshots older than a thresholdGET /eventstail recent internal events (scrape/approval/crawl/ingest)GET /webhookswebhook delivery log (status, attempts, response code)GET /metrics(?format=prometheus) request/status/governance counters + per-domain statsGET /readyreadiness probe (checks Redis/Postgres reachability when configured)
Pipeline
URL Input
-> Fetcher / Browser Renderer (pooled) / Crawler (depth-bounded BFS)
-> Content Extractor
-> Markdown / JSON Normalizer
-> Evidence + Citation Builder
-> Governance (trust score, sensitive-domain gating, audit, human approval)
-> Cache / Hash-Dedup / Versioning
-> Knowledge Pipeline (chunking + embedding hook + RAG/JSONL export)
-> Agent / RAG / Workflow (CLI, HTTP API, MCP server)
Knowledge & RAG
POST /export (or cli export) chunks a page's Markdown by heading structure into
token-bounded, overlapping chunks, maps each chunk back to a citation anchor and to
its character offsets in the source Markdown, and emits a RagDocument (or JSONL, one
line per chunk).
POST /ingest runs the full read-path — scrape → chunk → embed → store — into a vector
index, and POST /search retrieves the nearest chunks for a query, each carrying its
source URL, citation anchor, trust score, and governance status. Content blocked by
governance is never indexed; requires_approval content is indexed with its status so
search can filter it (includeBlocked, minTrust, url).
Retrieval (POST /search) supports three modes: vector (embedding cosine),
lexical (SQLite FTS5 by default, in-memory BM25 on the file backend, Postgres full-text on Postgres), and
hybrid (default) which fuses both candidate sets with Reciprocal Rank Fusion.
Results then pass through a pluggable reranker (OCTORYN_SCOUT_RERANK_PROVIDER =
heuristic default | cohere | voyage | none); the heuristic reranker is
deterministic and offline, and Cohere/Voyage activate when their API key is set.
POST /extract (or cli extract) performs LLM structured extraction: it scrapes a
URL, then returns JSON conforming to a JSON Schema you supply. The provider is pluggable
(OCTORYN_SCOUT_EXTRACTION_PROVIDER = none default | anthropic | openai): Anthropic
uses the official SDK with claude-opus-4-8 and output_config json-schema output, OpenAI
uses json-schema response_format; governance-blocked pages are skipped, never extracted.
Embeddings are produced through a pluggable EmbeddingProvider
(OCTORYN_SCOUT_EMBEDDING_PROVIDER = stub | voyage | openai): the default is a
deterministic, network-free stub, and Voyage/OpenAI activate when their API key is set
(VOYAGE_API_KEY / OPENAI_API_KEY), falling back to the stub otherwise.
The default embedding provider is a deterministic, NON-SEMANTIC stub — it produces stable offline vectors for testing but does not capture meaning, so
vectorandhybridsearch are only semantically meaningful once you setOCTORYN_SCOUT_EMBEDDING_PROVIDERtovoyageoropenai(with the matching API key). The vector store is the embedded SQLite backend (in-process cosine) by default; whenDATABASE_URLis set it uses pgvector (avector(dim)column + HNSW cosine index,<=>distance) and transparently falls back to jsonb + in-process cosine if thevectorextension is unavailable.
Access control
Set OCTORYN_SCOUT_AUTH_MODE (off | write | all) and OCTORYN_SCOUT_API_KEYS
(comma-separated) to require an API key via Authorization: Bearer <key> or
x-api-key. write protects all mutating requests plus the governance-sensitive
/governance and /audit reads; all protects everything except GET /health. With
no keys configured, auth is disabled (backward compatible).
Per-domain policy
Point OCTORYN_SCOUT_POLICY_FILE (or drop <dataDir>/policy.json) at a
GovernancePolicy: per-domain action (allow | block | require_approval),
rateLimitMs, and trustOverride. Policy escalation is applied on top of the
keyword/robots decision (it can only tighten, never relax a block), with the
most-specific domain match winning.
{
"version": "v1",
"defaultAction": "allow",
"domains": [{ "domain": "example.com", "action": "require_approval", "rateLimitMs": 3000 }]
}
Scale & reliability
- Browser pool — a single Chromium instance with a bounded concurrent-page
semaphore (
OCTORYN_SCOUT_BROWSER_MAX_PAGES) and idle auto-shutdown. - Distributed rate limiting — per-domain spacing is enforced across processes via
Redis when
REDIS_URLis set (atomic EVAL), falling back to in-memory; honors robotscrawl-delay. - Dead-letter queue — scrape/crawl jobs that exhaust retries are pushed to a
dead-letter queue with a classified failure reason (
timeout,robots_blocked,http_error,render_error,unknown). - Resumable crawls — crawl jobs checkpoint their frontier/visited/results to a
store every
OCTORYN_SCOUT_CRAWL_CHECKPOINT_EVERYpages;POST /crawlwithresumeCrawlIdcontinues from the last checkpoint.GET /crawlslists jobs. - Whole-site ingestion —
POST /ingest-sitecrawls a site and indexes every allowed page into the vector store in one call;POST /jobs/ingest-siteruns it as a durable BullMQ job (pollGET /jobs/:id), with exhausted retries routed to the dead-letter queue. - Distributed scheduler lock — the scheduled staleness sweep is wrapped in a Redis
lock (
SET NX PX+ Lua compare-del), so multiple instances don't double-sweep; with no Redis it degrades to single-instance run-anyway. - Retention —
POST /admin/retention(orcli retention) prunes snapshot versions beyondOCTORYN_SCOUT_SNAPSHOT_RETENTION_VERSIONS/_DAYS, audit events pastOCTORYN_SCOUT_AUDIT_RETENTION_DAYS, and already-decided approvals (pending approvals are never pruned).0= keep everything. - Observability —
GET /metrics(JSON or Prometheus) andGET /ready.
Eventing & automation
The engine emits internal events (scrape.completed, approval.requested,
approval.decided, crawl.completed, site_ingest.completed) on an in-process bus;
GET /events tails them.
- Webhooks — set
OCTORYN_SCOUT_WEBHOOK_URLS(comma list) to forward events as JSON POSTs. WhenOCTORYN_SCOUT_WEBHOOK_SECRETis set each delivery carries anx-octoryn-signature: sha256=<hmac>header for verification; deliveries retry with backoff up toOCTORYN_SCOUT_WEBHOOK_MAX_ATTEMPTSand are logged atGET /webhooks. Filter which events fire withOCTORYN_SCOUT_WEBHOOK_EVENTS. This closes the human-in-the-loop: anapproval.requestedwebhook can page a reviewer. - Scheduled refresh — with
OCTORYN_SCOUT_SCHEDULE_ENABLED=true, a background sweep everyOCTORYN_SCOUT_REFRESH_INTERVAL_MSre-ingests snapshots older thanOCTORYN_SCOUT_STALENESS_MAX_AGE_DAYS(up toOCTORYN_SCOUT_REFRESH_LIMITper run), keeping the knowledge base fresh. Trigger manually withPOST /admin/refreshorcli refresh.
Discovery & interaction
POST /map(orcli map) — fast URL discovery for a site: seeds from sitemaps and the root page's links, dedupes, filters by same-origin/subdomain andincludePaths/excludePaths/search, and caps atlimit. No per-URL scraping — it's a cheap map.- Pre-scrape actions —
/scrapeand/renderaccept anactionsarray executed in order on browser renders before the DOM is captured:wait,waitForSelector,click,scroll,type,press,screenshot(per-action screenshots returned inactionScreenshots). Useful for cookie banners, "load more", and tabbed content. - Stealth-plus —
OCTORYN_SCOUT_STEALTH=truerenders with comprehensive, hand-rolled (zero-dependency) anti-detection: realistic Chrome UA + UA-CH headers, locale/timezone/viewport, automation launch-flag hiding, and an init script that patchesnavigator.webdriver/languages/plugins, stubswindow.chrome, and spoofs WebGL vendor/renderer +hardwareConcurrency.OCTORYN_SCOUT_EXTRA_HEADERS(JSON) injects custom headers on both static fetch and render. - BYO proxy —
OCTORYN_SCOUT_PROXY_URLS(comma list,http://user:pass@host:port) routes requests through your proxies with round-robin rotation: Playwright-native on the render path, and a hand-rollednode:net/node:tlsCONNECT tunnel on the static path (zero dependencies). Bring your own proxies — there is no hosted proxy pool. - JS-challenge handling — Cloudflare-style "Just a moment" interstitials are detected
and waited out by the real browser executing the challenge (no solving).
FetchProvideris a pluggable seam (LocalFetchProvidertoday) for future backends. - CAPTCHA — a
CaptchaSolverseam exists but ships only aNoopCaptchaSolverplaceholder (TODO). Solving modern CAPTCHAs requires an external service/model and is intentionally not built in. - Out of scope (by design): a hosted proxy pool and adversarial-grade anti-bot evasion. The stealth + BYO-proxy + challenge-waiting above handle most of the everyday web; hard targets behind aggressive bot defenses or CAPTCHAs are not guaranteed.
Security
- SSRF protection — every outbound fetch/render runs through a URL guard that
rejects non-
http(s)schemes and any host resolving to a private/loopback/link-local address (incl. the cloud metadata IP169.254.169.254), defeating DNS-rebinding by checking the resolved IP. Override per environment withOCTORYN_SCOUT_ALLOW_PRIVATE_HOSTS=true(for localhost dev/tests) or scope withOCTORYN_SCOUT_HOST_ALLOWLIST/_BLOCKLIST. - Content limits — responses over
OCTORYN_SCOUT_MAX_CONTENT_BYTESare rejected (streamed read aborts early), and onlyOCTORYN_SCOUT_ALLOWED_CONTENT_TYPESare processed; bodies are charset-decoded from the content-type header. - API-key auth — see Access control above;
/governance,/audit, and/adminreads/writes are protected inwritemode.
MCP server (Claude & Codex)
The engine ships an MCP stdio server exposing eight tools — octoryn_scrape,
octoryn_crawl, octoryn_map, octoryn_export, octoryn_ingest,
octoryn_ingest_site, octoryn_search, octoryn_extract — so agents can scrape, crawl,
map, ingest, semantically search the governed knowledge base, and run structured
extraction directly.
npm run build # produces dist/mcp.js
npx octopus-scout-mcp # or: node dist/mcp.js
Ready-to-paste configs live in docs/mcp/ (Claude Code .mcp.json,
Claude Desktop, Codex config.toml); full guide in docs/MCP.md.
Governance Defaults
The engine respects robots.txt by default, applies per-domain rate limiting, records
content hashes and source metadata, creates citation anchors from extracted Markdown,
and assigns a basic source trust score.
Medical/legal/financial content is flagged as requires_approval: a pending
ApprovalRecord is created and the page waits for a human decision via
/governance/approvals/:id/decision (or cli approve/reject). Every scrape, approval
request, and decision is written to an append-only audit trail (/audit).
OCTORYN_SCOUT_APPROVAL_MODE (off | flag | enforce) controls how strict gating is.
Re-scraping unchanged content is deduplicated by content hash, and each distinct
version is retained as a queryable snapshot (/versions?url=).
These policies are intentionally conservative and easy to replace with stricter Octoryn governance rules.
Contributing
See CONTRIBUTING.md for setup and the local check gate
(typecheck + format:check + test). Security issues: please follow
SECURITY.md rather than opening a public issue.
License
AGPL-3.0-or-later Octoryn. Network use is distribution: if you run a modified version as a service, the AGPL requires you to offer your modified source to its users. This is deliberate — it keeps the engine and its derivatives open.