npm.io
1.4.1 • Published 2 months agoCLI

sitemap-xml-parser

Licence
MIT
Version
1.4.1
Deps
1
Size
28 kB
Vulns
0
Weekly
0
Stars
2

sitemap-xml-parser

Parses sitemap XML files and returns all listed URLs. Can be used as a CLI tool or a Node.js library.

  • Follows sitemap index files recursively and decompresses gzip automatically
  • Supports custom request headers, concurrency control, and request timeouts
  • CLI: outputs plain URLs, TSV, or JSON with configurable field selection (--fields)
  • CLI: filters URLs by substring or regular expression

Installation

npm install sitemap-xml-parser

CLI

Run without installing via npx:

npx sitemap-xml-parser <url> [options]

Or, after installing globally (npm install -g sitemap-xml-parser):

sitemap-xml-parser <url> [options]

Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See Options for available flags.

Examples

# Print all URLs
npx sitemap-xml-parser https://example.com/sitemap.xml

# Save URLs to a file, errors to a log
npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log

# Count URLs
npx sitemap-xml-parser https://example.com/sitemap.xml --count

# Stop after 100 entries
npx sitemap-xml-parser https://example.com/sitemap.xml --cap 100

# Filter and count
npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog" --count

# Filter by regular expression
npx sitemap-xml-parser https://example.com/sitemap.xml --filter-regex "blog/[0-9]{4}/"

# Output as TSV (loc, lastmod, changefreq, priority)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv
CLI: getting more fields — discovering what's available and outputting all of them

Some sitemaps include extension fields such as image:image or news:news beyond the standard four. If you need to include those fields in your output, use --list-fields to find out what's available first.

# Output as JSON with all fields (all fields present in the source XML are included by default)
npx sitemap-xml-parser https://example.com/sitemap.xml --format json

# Discover all fields present in a sitemap
npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields

# Output as TSV with custom columns (e.g. image sitemap extension)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv --fields loc,image:image

# Output as TSV with all fields (fetches twice: once to discover fields, once to output)
npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv \
  --fields "$(npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields | paste -sd, -)"

Options

CLI
Flag Default Description
--delay <ms> 1000 Milliseconds to wait between batches when following a sitemap index. --limit URLs are fetched in parallel per batch; after each batch completes, the process waits --delay ms before starting the next. Set to 0 to disable.
--limit <n> 10 Number of child sitemaps to fetch concurrently per batch.
--timeout <ms> 30000 Milliseconds before a request is aborted.
--cap <n> Stop collecting after this many URL entries. Useful for sampling large sitemaps.
--header <Name: Value> Add a request header. Repeatable. Single: --header "User-Agent: MyBot/1.0". Multiple: --header "User-Agent: MyBot/1.0" --header "Authorization: Bearer token"
--filter <str> Only output URLs whose loc contains the given string (substring match). Can be combined with --count or --format.
--filter-regex <regex> Only output URLs whose loc matches the given regular expression. Invalid patterns exit non-zero. Can be combined with --count or --format.
--format <fmt> Output format: tsv prints a header row followed by one tab-separated row per entry; json outputs a JSON array of entry objects including all fields from the source XML.
--fields <f1,f2,...> Comma-separated list of fields to include in the output. Requires --format. For tsv, defaults to loc,lastmod,changefreq,priority. For json, defaults to all fields. Nested values are serialized as JSON in TSV output.
--list-fields Print all field names found across every entry, one per line. Scans the entire sitemap and outputs the union of all keys seen. Useful for discovering available fields before using --fields. Compatible with --filter and --filter-regex. Cannot be combined with --format, --fields, --cap, or --count.
--count Print only the total number of URLs.
Library
Option Type Default Description
delay number 1000 Same as --delay.
limit number 10 Same as --limit.
timeout number 30000 Same as --timeout.
cap number Same as --cap.
headers object Key-value map of request headers. Same as repeated --header.
onError function Called as onError(url, error) when a fetch or parse fails. The entry is skipped regardless.
onEntry function Called as onEntry(entry) each time a URL entry is parsed. entry has the same shape as the objects returned by fetch().

Features

  • Follows Sitemap Index files recursively, including nested indexes (Index within an Index)
  • Automatically decompresses gzip: supports both .gz URLs and Content-Encoding: gzip responses
  • Batch processing: fetches limit child sitemaps in parallel per batch, then waits delay ms after each batch completes
  • Automatically follows redirects (301/302/303/307/308) up to 5 hops; errors beyond that are reported via onError. Custom request headers are forwarded only when the redirect stays on the same origin (same scheme, host, and port); they are stripped on cross-origin redirects.

Usage

const SitemapXMLParser = require('sitemap-xml-parser');

const parser = new SitemapXMLParser('https://example.com/sitemap.xml');

(async () => {
    const urls = await parser.fetch();
    urls.forEach(entry => {
        console.log(entry.loc);
    });
})();
Custom headers
const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
    headers: {
        'User-Agent': 'MyBot/2.0',
        'Authorization': 'Bearer my-token',
    },
});
Error handling with onError

Failed URLs (network errors, non-2xx responses, malformed XML) are skipped by default. Provide an onError callback to inspect them:

const parser = new SitemapXMLParser('https://example.com/sitemap.xml', {
    onError: (url, err) => {
        console.error(`Skipped ${url}: ${err.message}`);
    },
});

Return value

fetch() resolves to an array of URL entry objects. Each object contains all fields present in the source XML — no field selection is applied at the library level:

[
  {
    loc:        'https://example.com/page1',
    lastmod:    '2024-01-01',
    changefreq: 'weekly',
    priority:   '0.8',
  },
  // ...
]

loc is always a string. Standard fields (lastmod, changefreq, priority) are strings when present, or undefined when absent from the source XML.

Sitemap extension fields (e.g. image:image, news:news, video:video) are also preserved as-is when present in the source XML. Their values reflect the structure parsed by the underlying XML parser — nested elements become objects.