@headwall/url-crawler NPM

head-spider

URL crawler and web content analyser.

The idea is to create an instance of the crawler, add one or more URLs to it along with one or more response/document processors. When the crawler has no more URLs in its queue, it finished.

This can form the basis of a technical SEO crawler, or any other content crawler/scraper.

When a page has been fetched, a series of "processors" are run over it to extract structured data.

After all the processors have finished, the "analysers" are run, which can look for things like missing IMG Alt text, out-of-sequence heading elements, whatever you want.

You can easily add your own processors and analysers.

This is still in early development as I'm working on the test suite and setting up some basic document processors.

You can run the test suite with npm run test.

spider crawl web

jquery jsdom md5 mime-types superagent

@everything-registry/sub-chunk-389

0.2.7

2 years ago

0.2.6

3 years ago

0.2.5

3 years ago