0.2.0 • Published 4 months ago
pathik v0.2.0
Pathik for Node.js
High-performance web crawler implemented in Go with JavaScript bindings.
Installation
npm install pathik
Prerequisites
- Node.js 14+
- Go 1.16+ (for building the binary)
Usage
Basic Crawling
const pathik = require('pathik');
const path = require('path');
const fs = require('fs');
// Create output directory
const outputDir = path.resolve('./output_data');
fs.mkdirSync(outputDir, { recursive: true });
// List of URLs to crawl
const urls = [
'https://example.com',
'https://news.ycombinator.com'
];
// Crawl the URLs
pathik.crawl(urls, { outputDir })
.then(results => {
console.log('Crawling results:');
for (const [url, files] of Object.entries(results)) {
console.log(`URL: ${url}`);
console.log(`HTML file: ${files.html}`);
console.log(`Markdown file: ${files.markdown}`);
}
})
.catch(error => {
console.error(`Error during crawling: ${error.message}`);
});
Parallel Crawling
Pathik supports parallel crawling by default, making it very efficient for batch processing:
const pathik = require('pathik');
// List of many URLs to crawl in parallel
const urls = [
'https://example.com',
'https://news.ycombinator.com',
'https://github.com',
'https://developer.mozilla.org',
'https://wikipedia.org'
];
// Crawl multiple URLs in parallel (default behavior)
pathik.crawl(urls, { outputDir: './output' })
.then(results => {
console.log(`Successfully crawled ${Object.keys(results).length} URLs`);
});
// Disable parallel crawling if needed
pathik.crawl(urls, {
outputDir: './output',
parallel: false // Process sequentially
})
.then(results => {
console.log(`Successfully crawled ${Object.keys(results).length} URLs sequentially`);
});
R2 Upload
const pathik = require('pathik');
// Crawl and upload to R2
pathik.crawlToR2(['https://example.com'], { uuid: 'my-unique-id' })
.then(results => {
console.log('R2 Upload results:');
for (const [url, info] of Object.entries(results)) {
console.log(`URL: ${url}`);
console.log(`R2 HTML key: ${info.r2_html_key}`);
console.log(`R2 Markdown key: ${info.r2_markdown_key}`);
}
})
.catch(error => {
console.error(`Error during R2 upload: ${error.message}`);
});
Command-line Interface
# Install globally
npm install -g pathik
# Crawl URLs
pathik crawl https://example.com https://news.ycombinator.com -o ./output
# Crawl multiple URLs in parallel (default)
pathik crawl https://example.com https://news.ycombinator.com https://github.com -o ./output
# Disable parallel crawling
pathik crawl https://example.com https://news.ycombinator.com --no-parallel -o ./output
# Crawl and upload to R2
pathik r2 https://example.com -u my-unique-id
API
pathik.crawl(urls, options)
Crawl URLs and save content locally.
urls
: String or array of URLs to crawloptions
: Object with crawl optionsoutputDir
: Directory to save output (uses temp dir if null)parallel
: Boolean to enable/disable parallel crawling (default: true)
- Returns: Promise resolving to an object mapping URLs to file paths
pathik.crawlToR2(urls, options)
Crawl URLs and upload content to R2.
urls
: String or array of URLs to crawloptions
: Object with R2 optionsuuid
: UUID to prefix filenames (generates random UUID if null)parallel
: Boolean to enable/disable parallel crawling (default: true)
- Returns: Promise resolving to an object mapping URLs to R2 keys
Building the Binary
If the Go binary isn't built automatically during installation:
npm run build-binary
Troubleshooting
Missing Binary
npm run build-binary
Import Errors
npm uninstall -y pathik
cd pathik-js && npm install
Performance
Pathik's concurrent crawling is powered by Go's goroutines, making it significantly more memory-efficient than browser automation tools:
- Uses ~10x less memory than Playwright
- Efficiently processes large batches of URLs
- Parallelism controlled by the Go binary (default: 5 concurrent crawls)
License
Apache 2.0