0.3.1 • Published 2 years ago

@markoskon/scrape-url-node v0.3.1

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

Scrape URLs

A node script that scrapes from an initial URL linked URLs, under the same origin, recursively. It then reports in the console:

  • The unique characters from the HTML pages it visited.
  • The URLs it discovered and visited. In other words, you get a list with all the linked pages under the same origin.
  • A list of URLs it discovered and not visited because they are not from the same origin. In other words, you get a list with all the URLs you link to (this list also includes URLs with hashes and non http/https, though).

Install

npm i -g @markoskon/scrape-url-node

Usage

scrape [option..] <url>
# get the help menu
scrape --help

Future updates

There's a (small) chance to extend it to search for other information, not only URL lists and characters. This is because the ability to recursively discover URLs from a root URL seems useful (I'm sure there are many programs that already do this, but, oh well). I could implement this by parsing additional options in the CLI, or even with a callback, if I export a function that you then can import in your code.

For the time being, I'm not considering to execute JavaScript with a tool like Puppeteer, because the information I want doesn't require it.