@jldb/web-to-md NPM

🕷️ Web-to-MD: Your Friendly Neighborhood Web Crawler and Markdown Converter 🕸️

Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀

🌟 Why Web-to-MD?

Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!

🎭 Features That'll Make You Go "Wow!"

🔍 Crawls websites like a pro detective
🧙‍♂️ Magically transforms HTML into beautiful Markdown
🏃‍♂️ Resumes interrupted crawls (because life happens!)
📚 Creates separate Markdown files or one big book of knowledge
🎨 Shows fancy progress bars (because who doesn't love those?)
🚦 Respects rate limits (we're polite crawlers here!)
🌳 Preserves directory structure (if you're into that sort of thing)
🔒 Handles authentication gracefully (no trespassing allowed!)
👥 Multi-worker support (because teamwork makes the dream work!)
🔄 Smart content change detection (no need to crawl what hasn't changed!)

🛠️ Installation

Clone this repo (it won't bite, promise!)
Run npm install (sit back and watch the magic happen)
Run npm run build to compile the TypeScript code

🚀 Usage

Fire up Web-to-MD with this incantation:

npm start -- -u <url> -o <output_directory> [options]

🎛️ Options (Mix and Match to Your Heart's Content)

-u, --url <url>: The URL of your web treasure trove (required)
-o, --output <output>: Where to stash your Markdown gold (required)
-c, --combine: Merge all pages into one massive scroll of knowledge
-e, --exclude <paths>: Comma-separated list of paths to skip (shh, we won't tell)
-r, --rate <rate>: Max pages per second (default: 5, for the speed demons)
-d, --depth <depth>: How deep should we dig? (default: 3, watch out for dragons)
-m, --max-file-size <size>: Max file size in MB for combined output (default: 2)
-n, --name <name>: Name your combined file (get creative!)
-p, --preserve-structure: Keep the directory structure (for the neat freaks)
-t, --timeout <timeout>: Timeout in seconds for page navigation (default: 3.5)
-i, --initial-timeout <initialTimeout>: Initial timeout for the first page (default: 60)
-re, --retries <retries>: Number of retries for initial page load (default: 3)
-w, --workers <workers>: Number of concurrent workers (default: 1, for the multitaskers)

🌟 Example (Because We All Need a Little Guidance)

npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3

This will:

Crawl https://docs.example.com
Save Markdown files to ./my_docs
Combine all pages into one file
Crawl up to 5 levels deep
Respect a rate limit of 3 pages per second
Name the combined file "ExampleDocs"
Use 3 concurrent workers for faster crawling

🔧 Config Magic: Resuming and Customizing Your Crawls

Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:

📁 Config File

After a crawl (complete or interrupted), Web-to-MD saves a config.json file in your output directory. This file contains all the settings and state information from your last crawl.

🔄 Resuming a Crawl

To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json file and pick up where it left off.

🎛️ Customizing Your Crawl

You can manually edit the config.json file to customize your next crawl. Here are the available options and their default values:

Option	Description	Default Value
`url`	Starting URL for the crawl	(Required)
`outputDir`	Output directory for Markdown files	(Required)
`excludePaths`	Paths to exclude from crawling	`[]`
`maxPagesPerSecond`	Maximum pages to crawl per second	`5`
`maxDepth`	Maximum depth to crawl	`3`
`maxFileSizeMB`	Maximum file size in MB for combined output	`2`
`combine`	Combine all pages into a single file	`false`
`name`	Name for the combined output file	`undefined`
`preserveStructure`	Preserve directory structure	`false`
`timeout`	Timeout in seconds for page navigation	`3.5`
`initialTimeout`	Initial timeout in seconds for the first page load	`60`
`retries`	Number of retries for initial page load	`3`
`numWorkers`	Number of concurrent workers	`1`

You can modify these settings in the config.json file to customize your crawl. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}

🌟 Example Workflow

Start an initial crawl:

npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2

If the crawl is interrupted, Web-to-MD will save the state in ./my_docs/config.json.
To resume, simply run:
```
npm start -- -o ./my_docs
```
To customize, edit ./my_docs/config.json to change the crawl settings as needed. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}

Run the crawl again with the updated config:
```
npm start -- -o ./my_docs
```

This workflow allows you to fine-tune your crawls and easily pick up where you left off!

🎭 Contributing

Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝

📜 License

ISC (It's So Cool) License

🙏 Acknowledgements

A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸

Now go forth and crawl some docs! 🕷️📚

documentation crawler markdown cli

@types/cli-progress @types/node @types/puppeteer @types/turndown ansi-colors cli-progress commander puppeteer ts-node turndown

0.1.0

1 year ago