crawltojson v1.11.11
crawltojson
A powerful and flexible web crawler that converts website content into structured JSON. Perfect for creating training datasets, content migration, web scraping, or any task requiring structured web content extraction.
🎯 Intended Use
Just two commands to crawl a website and save the content in a structured JSON file.
npx crawltojson config
npx crawltojson crawl
🚀 Features
- 🌐 Crawl any website with customizable patterns
- 📦 Export to structured JSON
- 🎯 CSS selector-based content extraction
- 🔄 Automatic retry mechanism for failed requests
- 🌲 Depth-limited crawling
- ⏱️ Configurable timeouts
- 🚫 URL pattern exclusion
- 💾 Stream-based processing for memory efficiency
- 🎨 Beautiful CLI interface with progress indicators
📋 Table of Contents
- Installation
- Quick Start
- Configuration Options
- Advanced Usage
- Output Format
- Use Cases
- Development
- Troubleshooting
- Contributing
- License
🔧 Installation
Global Installation (Recommended)
npm install -g crawltojson
Using npx (No Installation)
npx crawltojson
Local Project Installation
npm install crawltojson
🚀 Quick Start
- Generate configuration file:
crawltojson config
- Start crawling:
crawltojson crawl
⚙️ Configuration Options
Basic Options
url
- Starting URL to crawl- Example: "https://example.com/blog"
- Must be a valid HTTP/HTTPS URL
match
- URL pattern to match (supports glob patterns)- Example: "https://example.com/blog/**"
- Use ** for wildcard matching
- Default: Same as starting URL with /** appended
selector
- CSS selector to extract content- Example: "article.content"
- Default: "body"
- Supports any valid CSS selector
maxPages
- Maximum number of pages to crawl- Default: 50
- Range: 1 to unlimited
- Helps control crawl scope
Advanced Options
maxRetries
- Maximum number of retries for failed requests- Default: 3
- Useful for handling temporary network issues
- Exponential backoff between retries
maxLevels
- Maximum depth level for crawling- Default: 3
- Controls how deep the crawler goes from the starting URL
- Level 0 is the starting URL
- Helps prevent infinite crawling
timeout
- Page load timeout in milliseconds- Default: 7000 (7 seconds)
- Prevents hanging on slow-loading pages
- Adjust based on site performance
excludePatterns
- Array of URL patterns to ignore- Default patterns:
[ "**/tag/**", // Ignore tag pages "**/tags/**", // Ignore tag listings "**/#*", // Ignore anchor links "**/search**", // Ignore search pages "**.pdf", // Ignore PDF files "**/archive/**" // Ignore archive pages ]
- Default patterns:
Configuration File
The configuration is stored in crawltojson.config.json
. Example:
{
"url": "https://example.com/blog",
"match": "https://example.com/blog/**",
"selector": "article.content",
"maxPages": 100,
"maxRetries": 3,
"maxLevels": 3,
"timeout": 7000,
"outputFile": "crawltojson.output.json",
"excludePatterns": [
"**/tag/**",
"**/tags/**",
"**/#*"
]
}
🎯 Advanced Usage
Selecting Content
The selector
option supports any valid CSS selector. Examples:
# Single element
article.main-content
# Multiple elements
.post-content, .comments
# Nested elements
article .content p
# Complex selectors
main article:not(.ad) .content
URL Pattern Matching
The match
pattern supports glob-style matching:
# Match exact path
https://example.com/blog/
# Match all blog posts
https://example.com/blog/**
# Match specific sections
https://example.com/blog/2024/**
https://example.com/blog/*/technical/**
Exclude Patterns
Customize excludePatterns
for your needs:
{
"excludePatterns": [
"**/tag/**", // Tag pages
"**/category/**", // Category pages
"**/page/*", // Pagination
"**/wp-admin/**", // Admin pages
"**?preview=true", // Preview pages
"**.pdf", // PDF files
"**/feed/**", // RSS feeds
"**/print/**" // Print pages
]
}
📄 Output Format
The crawler generates a JSON file with the following structure:
[
{
"url": "https://example.com/page1",
"content": "Extracted content...",
"timestamp": "2024-11-02T12:00:00.000Z",
"level": 0
},
{
"url": "https://example.com/page2",
"content": "More content...",
"timestamp": "2024-11-02T12:00:10.000Z",
"level": 1
}
]
Fields:
url
: The normalized URL of the crawled pagecontent
: Extracted text content based on selectortimestamp
: ISO timestamp of when the page was crawledlevel
: Depth level from the starting URL (0-based)
🎯 Use Cases
Content Migration
- Crawl existing website content
- Export to structured format
- Import into new platform
Training Data Collection
- Gather content for ML models
- Create datasets for NLP
- Build content classifiers
Content Archival
- Archive website content
- Create backups
- Document snapshots
SEO Analysis
- Extract meta content
- Analyze content structure
- Track content changes
Documentation Collection
- Crawl documentation sites
- Create offline copies
- Generate searchable indexes
🛠️ Development
Local Setup
- Clone the repository:
git clone https://github.com/yourusername/crawltojson.git
cd crawltojson
- Install dependencies:
npm install
- Build the project:
npm run build
- Link for local testing:
npm link
Development Commands
# Run build
npm run build
# Clean build
npm run clean
# Run tests
npm test
# Watch mode
npm run dev
Publishing
- Update version:
npm version patch|minor|major
- Build and publish:
npm run build
npm publish
❗ Troubleshooting
Common Issues
- Browser Installation Failed
# Manual installation
npx playwright install chromium
- Permission Errors
# Fix CLI permissions
chmod +x ./dist/cli.js
- Build Errors
# Clean install
rm -rf node_modules dist package-lock.json
npm install
npm run build
Debug Mode
Set DEBUG environment variable:
DEBUG=crawltojson* crawltojson crawl
🤝 Contributing
- Fork the repository
- Create feature branch
- Commit changes
- Push to branch
- Create Pull Request
Coding Standards
- Use ESLint configuration
- Add tests for new features
- Update documentation
- Follow semantic versioning
📜 License
MIT License - see LICENSE for details.
🙏 Acknowledgments
- Built with Playwright
- CLI powered by Commander.js
- Inspired by web scraping communities
Made with ❤️ by Vivek M. Agarwal
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago
10 months ago