0.11.0 • Published 4 years ago

getsitemap v0.11.0

Weekly downloads
4
License
ISC
Repository
github
Last release
4 years ago

getsitemap

getsitemap is a library that takes a domain name as input and returns a stream of <url> objects from the <urlset> elements of the website's sitemap.xml file(s). It can be used for obtaining a list of pages to crawl from a website. The objects in the stream will match the sitemap protocol:

{
  url: "http//newyorktimes.com", // Always present
  lastmod: "2019-10-01" // Optional
}

See Turbo Crawl for a powerful web crawling library based on getsitemap.

Usage

Streaming the URL set to a file. The file will be of ndjson type, which means that each line will be a JSON object. Note that this will not be a valid JSON file but is useful for reading large files line-by-line.

const getsitemap = require("getsitemap")

const url = "theintercept.com"
const since = Date.parse("2019-10-01")

const mapper = new getsitemap.SiteMapper(url)
const sitemapstream = mapper.map(since)
const file = fs.createWriteStream(`./intercept.ndjson`)
sitemapstream.pipe(file)
/* OR */
const sitemapstream = mapper.map(since)
sitemapstream.on("data", (obj) => {
  // obj.url, obj.lastmod
})

Configuration

getsitemap uses hittp under the hood to make HTTP requests, and by default it will delay requests to the same host for 3 seconds so as to not overload the server. getsitemap can be configured in the same way as hittp:

const getsitemap = require("getsitemap")

const url = "theintercept.com"
const since = Date.parse("2019-10-01")
const options = { delay_ms: 3000, cachePath: "./.hittp/cache } // Default

const mapper = new getsitemap.SiteMapper()
mapper.map(url, since, options).then((sitemapstream) => {
  const file = fs.createWriteStream(`./intercept.ndjson`)
  sitemapstream.pipe(file)
})

Don't forget to add your cache path to .gitignore! Default path is ./.hittp

0.11.0

4 years ago

0.10.0

4 years ago

0.9.2

4 years ago

0.9.1

4 years ago

0.9.0

5 years ago

0.8.6

5 years ago

0.8.5

5 years ago

0.8.4

5 years ago

0.8.3

5 years ago

0.8.2

5 years ago

0.8.1

5 years ago

0.8.0

5 years ago

0.7.14

5 years ago

0.7.13

5 years ago

0.7.12

5 years ago

0.7.11

5 years ago

0.7.10

5 years ago

0.7.9

5 years ago

0.7.8

5 years ago

0.7.7

5 years ago

0.7.6

5 years ago

0.7.5

5 years ago

0.7.4

5 years ago

0.7.3

5 years ago

0.7.2

5 years ago

0.7.1

5 years ago

0.7.0

5 years ago

0.6.2

5 years ago

0.6.1

5 years ago

0.6.0

5 years ago

0.5.0

5 years ago

0.4.0

5 years ago

0.3.1

5 years ago

0.3.0

5 years ago

0.2.1

5 years ago

0.2.0

5 years ago

0.1.1

5 years ago

0.1.0

5 years ago