1.0.0 • Published yesterdayCLI

sitecheckup

Licence

MIT

Version

1.0.0

Deps

Size

89 kB

Vulns

Weekly

Summary Dependency Versions

sitecheckup

A command-line tool for checking the integrity of a deployed website. It crawls the live site with a real browser and checks that pages, links, images, video embeds, stylesheets, scripts, fonts, favicons, Open Graph tags, canonical URLs and other resources are actually working, so regressions get caught before or after deployment.

It's framework-agnostic and works against anything reachable over HTTP: sites built by static site generators such as Hugo, Jekyll, Eleventy, Zola, Pelican and Hexo, frameworks that support static export like Next.js, Astro, Gatsby, SvelteKit and Nuxt, server-rendered apps, and plain HTML sites.

The --json report is structured for AI coding agents, not just humans: issues are deduplicated, confidence-scored and grouped into patterns so an agent can pair it with the site's source and go straight to fixing quality issues instead of triaging a raw list. See Fixing issues with an AI coding agent.

Install

npm install -g sitecheckup

or run it without installing:

npx sitecheckup https://example.com

Usage

sitecheckup <url> [options]

<url>              base URL of the deployed site (e.g. https://dsheiko.com)
--depth N          limit crawl depth (default: unlimited)
--concurrency N    parallel pages (default: 2)
--max N            stop after N pages
--timeout N        page timeout in seconds (default: 15)
--wait N           extra ms to wait after page load for JS rendering (default: 0)
--only cats        comma-separated list of checks to run (default: all)
--skip cats        comma-separated list of checks to skip
--no-external      skip external link checking
--json file        write a JSON report to file
--html file        write an HTML report to file
--fail-on level    exit non-zero on "error" (default) or "warning"
--debug            verbose stderr logging
--help, -h         show this help

Available check categories for --only / --skip: http, links, external-links, images, video, css, javascript, fonts, icons, opengraph, canonical, embeds.

Example

sitecheckup https://dsheiko.com --concurrency 4 --json report.json --html report.html

Summary by category
  ✗ HTTP integrity: 1 error (1 page affected)
  ✗ Internal links: 2 errors (2 pages affected)
  ✗ Images: 1 error (1 page affected)

Performance
  Unique assets transferred: 4.8 MB
  CSS: 82 KB
  JS: 640 KB
  Requests: 143
  Median page weight: 118 KB
  Heaviest page: https://dsheiko.com/blog/post (210 KB)
  Largest image: https://dsheiko.com/img/hero.jpg (1.1 MB)

38 pages crawled in 22.4s
Summary: 4 errors, 1 warning, 0 info (7 distinct across all severities)

For full details (per-issue breakdown with affected pages), run again with --json report.json or --html report.html

The console output groups by category and shows how many distinct problems there are versus how many pages they affect; the full per-issue detail (message, exact pages, sample context) only goes into the --json/--html reports, which is why the tool nudges you toward one of those when it finds anything.

The process exits with code 1 when there is at least one error-level finding (or a warning too, with --fail-on warning), which makes it a natural fit for a post-deploy CI step.

What it checks

HTTP integrity — every page returns 200, no unexpected 3xx redirects, no 404/500, no circular redirects, HTTPS certificate errors, Content-Type sanity (skipped for URLs that are non-HTML by extension, like feeds and sitemaps, since those aren't supposed to serve text/html).
Internal links — every <a href> target exists and, when it points at a fragment (/docs#installation), that the target page actually has a matching id/name.
External links — every off-site <a href> is classified as OK, Redirected, Dead, Timeout or SSL error. A 401/403/429/999 response is reported separately as low-confidence, since anti-bot/WAF protection commonly returns those for pages that actually exist.
Images — every <img> returns 200, doesn't redirect excessively, and has a non-zero content length. Also flags <img src=""> and <img src="undefined">.
Video — YouTube/Vimeo embeds and self-hosted <video>/<source> elements are checked for basic reachability.
CSS — every stylesheet loads, and url(...) references inside it (background images, @font-face sources) are checked too.
JavaScript — every script file loads, plus uncaught exceptions and console errors captured during the real page render are reported.
Fonts — preloaded .woff, .woff2 and .ttf files return 200.
Favicons & icons — favicon.ico, apple-touch-icon, manifest.webmanifest and mask-icon, when referenced, all return 200.
Open Graph assets — og:image and twitter:image are checked to exist.
Canonical URLs — canonical tag exists, points at a working page, and isn't duplicated (skipped on non-HTML pages, same as HTTP integrity's Content-Type check).
Performance — unique assets transferred, CSS/JS size, request count, median page weight, the heaviest page and the largest image are reported (informational, never fails the run).
Other embedded resources — PDFs, ZIPs, MP3s, MP4s, SVGs and iframes linked from a page are checked for reachability.

Fixing issues with an AI coding agent

The JSON report (--json report.json) is meant to be handed straight to an AI coding agent like Claude Code, so it can find the source of each problem and fix it, rather than you triaging the list by hand.

Each entry in issues is a distinct problem, already deduplicated across pages:

{
  "category": "links",
  "severity": "error",
  "confidence": "high",
  "message": "Broken internal link (404)",
  "target": "https://dsheiko.com/weblog/74",
  "meta": { "status": 404 },
  "pageCount": 3,
  "pages": [ "https://dsheiko.com/weblog/72", "https://dsheiko.com/weblog/73" ],
  "contexts": [
    { "page": "https://dsheiko.com/weblog/72", "text": "Read more", "html": "<a href=\"/weblog/74\">Read more</a>" }
  ]
}

target is the URL of the broken resource (a page, image, script, link, etc.)
pages lists the pages that reference it (capped at 5; pageCount has the true total)
meta carries check-specific detail, e.g. an HTTP status code or the missing anchor
confidence is "high" for deterministic failures (404, missing file) or "low" when the status (401/403/429/999) is one anti-bot/WAF protection commonly returns for pages that actually exist — treat "low" findings as needing manual verification, not confirmed defects, for links and external-links issues
contexts (link issues only) gives the offending <a> tag's visible text and outerHTML for a sample of the affected pages, so the agent can grep the source for that exact anchor instead of searching the whole content tree

summary reports both raw and deduplicated counts — errors/warnings/info are per-page occurrence counts, while distinctErrors/distinctWarnings/distinctInfo match issues.length by severity. Use the distinct counts when sizing the fix, and the raw counts when gauging how many pages are affected.

patterns groups distinct issues further, by templating out the variable part of a URL (a slug, an id, a query value). A recurring bug like a broken share link that appears under 200 different slugs shows up in issues as 200 rows, but as one row here with an occurrenceCount — that's usually the more useful place to start:

{
  "category": "external-links",
  "severity": "warning",
  "confidence": "high",
  "messageTemplate": "Redirected: https://example.com/{slug}//share?ref={value}",
  "targetTemplate": "https://example.com/{slug}//share?ref={value}",
  "occurrenceCount": 200,
  "totalPageCount": 340,
  "examples": [ { "target": "https://example.com/my-post//share?ref=fb", "message": "...", "pageCount": 2, "pages": [ "..." ] } ]
}

To fix the findings, point the agent at the report and the site's source:

sitecheckup https://example.com --json report.json
claude "Read report.json and fix the errors. The site source is in ./content,
built with Hugo. Group fixes by category and skip anything under external-links
that's a third-party site being down, not something we control."

Worth telling the agent up front:

Which checks are actionable from source (http, links, images, css, javascript, fonts, icons, opengraph, canonical, embeds) versus ones that just reflect a third party being flaky (external-links, video).
Where the site source lives and what generates it, since target/pages are live URLs, not file paths, the agent needs to map them back to templates or content files itself.
To start from patterns rather than issues when there are any: a handful of patterns usually points at a small number of root-cause bugs, while the raw issues list can bury them under hundreds of near-identical rows.
To skip confidence: "low" findings by default, since those are unverified bot/WAF blocks, not confirmed defects, and only investigate them if asked.
To re-run sitecheckup after fixing to confirm summary.distinctErrors and summary.distinctWarnings dropped, rather than trusting the diff alone.

How it works

The crawler drives a real, headless Chromium instance (via Puppeteer) starting from <url> and follows same-origin <a href> links to discover the rest of the site. Every check — including asset checks like images, fonts and stylesheets — is done through the browser rather than a plain HTTP client, so redirects, TLS handling and cert errors match what an actual visitor would see.

Known limitations

Video checks are reachability-only: the tool does not call YouTube's or Vimeo's oEmbed API, so it can't tell a deleted video apart from a private one, and it doesn't fetch thumbnails.
CSS/font assets are discovered by parsing the stylesheet text for url(...) references. Assets that a browser would only fetch conditionally (e.g. a @font-face that's never actually used by visible text) may not be checked, since they wouldn't be loaded by a real visitor either.

Development

npm install
npm test

Tests run against a local, in-process HTTP fixture server (see test/fixtures/site) so no real network access is required.

License

MIT

Keywords

site-checker health-check link-checker crawler puppeteer post-deploy hugo jekyll eleventy gatsby next astro nuxt sveltekit