@jldb/web-to-md v0.1.0
π·οΈ Web-to-MD: Your Friendly Neighborhood Web Crawler and Markdown Converter πΈοΈ
Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! π
π Why Web-to-MD?
Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!
π Features That'll Make You Go "Wow!"
- π Crawls websites like a pro detective
- π§ββοΈ Magically transforms HTML into beautiful Markdown
- πββοΈ Resumes interrupted crawls (because life happens!)
- π Creates separate Markdown files or one big book of knowledge
- π¨ Shows fancy progress bars (because who doesn't love those?)
- π¦ Respects rate limits (we're polite crawlers here!)
- π³ Preserves directory structure (if you're into that sort of thing)
- π Handles authentication gracefully (no trespassing allowed!)
- π₯ Multi-worker support (because teamwork makes the dream work!)
- π Smart content change detection (no need to crawl what hasn't changed!)
π οΈ Installation
- Clone this repo (it won't bite, promise!)
- Run
npm install
(sit back and watch the magic happen) - Run
npm run build
to compile the TypeScript code
π Usage
Fire up Web-to-MD with this incantation:
npm start -- -u <url> -o <output_directory> [options]
ποΈ Options (Mix and Match to Your Heart's Content)
-u, --url <url>
: The URL of your web treasure trove (required)-o, --output <output>
: Where to stash your Markdown gold (required)-c, --combine
: Merge all pages into one massive scroll of knowledge-e, --exclude <paths>
: Comma-separated list of paths to skip (shh, we won't tell)-r, --rate <rate>
: Max pages per second (default: 5, for the speed demons)-d, --depth <depth>
: How deep should we dig? (default: 3, watch out for dragons)-m, --max-file-size <size>
: Max file size in MB for combined output (default: 2)-n, --name <name>
: Name your combined file (get creative!)-p, --preserve-structure
: Keep the directory structure (for the neat freaks)-t, --timeout <timeout>
: Timeout in seconds for page navigation (default: 3.5)-i, --initial-timeout <initialTimeout>
: Initial timeout for the first page (default: 60)-re, --retries <retries>
: Number of retries for initial page load (default: 3)-w, --workers <workers>
: Number of concurrent workers (default: 1, for the multitaskers)
π Example (Because We All Need a Little Guidance)
npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3
This will:
- Crawl https://docs.example.com
- Save Markdown files to ./my_docs
- Combine all pages into one file
- Crawl up to 5 levels deep
- Respect a rate limit of 3 pages per second
- Name the combined file "ExampleDocs"
- Use 3 concurrent workers for faster crawling
π§ Config Magic: Resuming and Customizing Your Crawls
Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:
π Config File
After a crawl (complete or interrupted), Web-to-MD saves a config.json
file in your output directory. This file contains all the settings and state information from your last crawl.
π Resuming a Crawl
To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json
file and pick up where it left off.
ποΈ Customizing Your Crawl
You can manually edit the config.json
file to customize your next crawl. Here are the available options and their default values:
Option | Description | Default Value |
---|---|---|
url | Starting URL for the crawl | (Required) |
outputDir | Output directory for Markdown files | (Required) |
excludePaths | Paths to exclude from crawling | [] |
maxPagesPerSecond | Maximum pages to crawl per second | 5 |
maxDepth | Maximum depth to crawl | 3 |
maxFileSizeMB | Maximum file size in MB for combined output | 2 |
combine | Combine all pages into a single file | false |
name | Name for the combined output file | undefined |
preserveStructure | Preserve directory structure | false |
timeout | Timeout in seconds for page navigation | 3.5 |
initialTimeout | Initial timeout in seconds for the first page load | 60 |
retries | Number of retries for initial page load | 3 |
numWorkers | Number of concurrent workers | 1 |
You can modify these settings in the config.json
file to customize your crawl. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}
π Example Workflow
Start an initial crawl:
npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2
If the crawl is interrupted, Web-to-MD will save the state in
./my_docs/config.json
.To resume, simply run:
npm start -- -o ./my_docs
To customize, edit
./my_docs/config.json
to change the crawl settings as needed. For example:
{
"settings": {
"url": "https://docs.example.com",
"outputDir": "./my_docs",
"excludePaths": ["/blog", "/forum"],
"maxPagesPerSecond": 5,
"maxDepth": 4,
"numWorkers": 3
}
}
- Run the crawl again with the updated config:
npm start -- -o ./my_docs
This workflow allows you to fine-tune your crawls and easily pick up where you left off!
π Contributing
Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! π€
π License
ISC (It's So Cool) License
π Acknowledgements
A big thank you to all the open-source projects that made Web-to-MD possible. You rock! πΈ
Now go forth and crawl some docs! π·οΈπ
12 months ago