1.1.2 • Published 4 months ago
@the-node-forge/simple-web-scraper v1.1.2
Simple Web Scraper
A lightweight and efficient web scraping package for JavaScript/TypeScript applications. This package helps developers fetch HTML content, parse web pages, and extract data effortlessly.
✨ Features
- ✅ Fetch Web Content – Retrieve HTML from any URL with ease.
- ✅ Parse and Extract Data – Utilize integrated parsing tools to extract information.
- ✅ Configurable Options – Customize scraping behaviors using CSS selectors.
- ✅ Headless Browser Support – Optionally use Puppeteer for JavaScript-rendered pages.
- ✅ Lightweight & Fast – Uses Cheerio for static HTML scraping.
- ✅ TypeScript Support – Fully typed for robust development.
- ✅ Data Export Support – Export scraped data in JSON or CSV formats.
- ✅ CSV Import Support – Read CSV files and convert them to JSON.
📚 Installation
Install via npm:
npm install simple-web-scraper
or using Yarn:
yarn add simple-web-scraper
🚀 Why Use Cheerio and Puppeteer?
This package leverages Cheerio and Puppeteer for powerful web scraping capabilities:
🔹 Cheerio (Fast and Lightweight)
- Ideal for static HTML parsing (like
jQuery
for the backend). - Extremely fast and lightweight – perfect for pages without JavaScript rendering.
- Provides easy CSS selector querying for extracting structured data.
🔹 Puppeteer (Headless Browser Automation)
- Handles JavaScript-rendered pages – essential for scraping dynamic content.
- Can interact with pages, click buttons, and fill out forms.
- Allows screenshot capturing, PDF generation, and full-page automation.
✅ Best of Both Worlds
- Use Cheerio for speed when scraping static pages.
- Switch to Puppeteer for JavaScript-heavy sites requiring full rendering.
- Provides flexibility to choose the best approach for your project.
✅ API Reference
WebScraper Class
new WebScraper(options?: ScraperOptions)
📊 Props
Parameter | Type | Description |
---|---|---|
usePuppeteer | boolean (optional) | Whether to use Puppeteer (default: true ) |
throttle | number (optional) | Delay in milliseconds between requests (default: 1000 ) |
rules | Record<string, string> | CSS selectors defining data extraction rules |
Methods
scrape(url: string): Promise<Record<string, any>>
- Scrapes the given URL based on the configured options.
exportToJSON(data: any, filePath: string): void
- Exports the given data to a JSON file.
exportToCSV(data: any | any[], filePath: string): void
- Exports the given data to a CSV file.
🛠️ Basic Usage
1. Scraping Web Pages
You can scrape web pages using either Puppeteer (for JavaScript-heavy pages) or Cheerio (for static HTML pages).
import { WebScraper } from 'simple-web-scraper';
const scraper = new WebScraper({
usePuppeteer: false, // Set to true for dynamic pages
rules: {
title: 'h1',
description: 'meta[name=\"description\"]',
},
});
(async () => {
const data = await scraper.scrape('https://example.com');
console.log(data);
})();
2. Using Puppeteer for JavaScript-heavy Pages
To scrape pages that require JavaScript execution:
const scraper = new WebScraper({
usePuppeteer: true, // Enable Puppeteer for JavaScript-rendered content
rules: {
heading: 'h1',
price: '.product-price',
},
});
(async () => {
const data = await scraper.scrape('https://example.com/product');
console.log(data);
})();
3. Exporting Data
- Scraped data can be exported to JSON or CSV files using utility functions.
Export to JSON
import { exportToJSON } from 'simple-web-scraper';
const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');
Export to CSV
import { exportToCSV } from 'simple-web-scraper';
const data = [
{ name: 'Example 1', value: 42 },
{ name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');
// Preserve null and undefined values as null
exportToCSV(data, 'output.csv', { preserveNulls: true });
🖥 Backend Example - Module (import)
This example demonstrates how to use simple-web-scraper
in a Node.js backend:
import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';
const app = express();
const scraper = new WebScraper({
usePuppeteer: true,
rules: { title: 'h1', content: 'p' },
});
app.get('/scrape-example', async (req, res) => {
try {
const url = 'https://github.com/The-Node-Forge';
const data = await scraper.scrape(url);
exportToJSON(data, 'output.json'); // export JSON
exportToCSV(data, 'output.csv', { preserveNulls: true }); // export CSV
res.status(200).json({ success: true, data });
} catch (error) {
res.status(500).json({ success: false, error: error.message });
}
});
🖥 Backend Example - CommonJS (require)
This example demonstrates how to use simple-web-scraper
in a Node.js backend:
const {
WebScraper,
exportToJSON,
exportToCSV,
} = require('@the-node-forge/simple-web-scraper/dist');
const scraper = new WebScraper({
usePuppeteer: true,
rules: {
fullHTML: 'html', // Entire page HTML
title: 'head > title', // Page title
description: 'meta[name="description"]', // Meta description
keywords: 'meta[name="keywords"]', // Meta keywords
favicon: 'link[rel="icon"]', // Favicon URL
mainHeading: 'h1', // First H1 heading
allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
firstParagraph: 'p', // First paragraph
allParagraphs: 'p', // All paragraphs on the page
links: 'a', // All links on the page
images: 'img', // All image URLs
imageAlts: 'img', // Alternative text for images
videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
tables: 'table', // Capture table elements
tableData: 'td', // Capture table cells
lists: 'ul, ol', // Capture all lists
listItems: 'li', // Capture all list items
scripts: 'script', // JavaScript file sources
stylesheets: 'link[rel="stylesheet"]', // External CSS files
structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
socialLinks:
'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
author: 'meta[name="author"]', // Author meta tag
publishDate: 'meta[property="article:published_time"], time', // Publish date
modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
canonicalURL: 'link[rel="canonical"]', // Canonical URL
openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
openGraphImage: 'meta[property="og:image"]', // OpenGraph image
twitterCard: 'meta[name="twitter:card"]', // Twitter card type
twitterTitle: 'meta[name="twitter:title"]', // Twitter title
twitterDescription: 'meta[name="twitter:description"]', // Twitter description
twitterImage: 'meta[name="twitter:image"]', // Twitter image
},
});
app.get('/test-scraper', async (req, res) => {
try {
const url = 'https://github.com/The-Node-Forge';
const data = await scraper.scrape(url);
exportToJSON(data, 'output.json'); // export JSON
exportToCSV(data, 'output.csv'); // export CSV
res.status(200).json({ success: true, data });
} catch (error) {
res.status(500).json({ success: false, error: error.message });
}
});
🛠️ Full Usage Example
import { WebScraper } from 'simple-web-scraper';
const scraper = new WebScraper({
usePuppeteer: true, // Set to false if scraping static pages
rules: {
fullHTML: 'html', // Entire page HTML
title: 'head > title', // Page title
description: 'meta[name="description"]', // Meta description
keywords: 'meta[name="keywords"]', // Meta keywords
favicon: 'link[rel="icon"]', // Favicon URL
mainHeading: 'h1', // First H1 heading
allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
firstParagraph: 'p', // First paragraph
allParagraphs: 'p', // All paragraphs on the page
links: 'a', // All links on the page
images: 'img', // All image URLs
imageAlts: 'img', // Alternative text for images
videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
tables: 'table', // Capture table elements
tableData: 'td', // Capture table cells
lists: 'ul, ol', // Capture all lists
listItems: 'li', // Capture all list items
scripts: 'script', // JavaScript file sources
stylesheets: 'link[rel="stylesheet"]', // External CSS files
structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
socialLinks:
'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
author: 'meta[name="author"]', // Author meta tag
publishDate: 'meta[property="article:published_time"], time', // Publish date
modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
canonicalURL: 'link[rel="canonical"]', // Canonical URL
openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
openGraphImage: 'meta[property="og:image"]', // OpenGraph image
twitterCard: 'meta[name="twitter:card"]', // Twitter card type
twitterTitle: 'meta[name="twitter:title"]', // Twitter title
twitterDescription: 'meta[name="twitter:description"]', // Twitter description
twitterImage: 'meta[name="twitter:image"]', // Twitter image
},
});
(async () => {
const data = await scraper.scrape('https://example.com');
console.log(data);
})();
📊 Rule Set Table
Rule | CSS Selector | Target Data |
---|---|---|
fullHTML | html | The entire HTML of the page |
title | head > title | The <title> of the page |
description | meta[name="description"] | Meta description for SEO |
keywords | meta[name="keywords"] | Meta keywords |
favicon | link[rel="icon"] | Website icon |
mainHeading | h1 | The first <h1> heading |
allHeadings | h1, h2, h3, h4, h5, h6 | All headings (h1 -h6 ) |
firstParagraph | p | The first paragraph (<p> ) |
allParagraphs | p | All paragraphs on the page |
links | a | All anchor <a> links |
images | img | All image <img> sources |
imageAlts | img | All image alt texts |
videos | video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"] | Video sources (<video> , YouTube, Vimeo) |
tables | table | All <table> elements |
tableData | td | Individual <td> elements |
lists | ul, ol | All ordered <ol> and unordered <ul> lists |
listItems | li | All list <li> items |
scripts | script | JavaScript files included (<script src="..."> ) |
stylesheets | link[rel="stylesheet"] | Stylesheets (<link rel="stylesheet"> ) |
structuredData | script[type="application/ld+json"] | JSON-LD structured data for SEO |
socialLinks | a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"] | Facebook, Twitter, LinkedIn, Instagram links |
author | meta[name="author"] | Page author (meta[name="author"] ) |
publishDate | meta[property="article:published_time"], time | Date article was published |
modifiedDate | meta[property="article:modified_time"] | Last modified date |
canonicalURL | link[rel="canonical"] | Canonical URL (avoids duplicate content) |
openGraphTitle | meta[property="og:title"] | OpenGraph metadata for social sharing |
openGraphDescription | meta[property="og:description"] | OpenGraph description |
openGraphImage | meta[property="og:image"] | OpenGraph image URL |
twitterCard | meta[name="twitter:card"] | Twitter card type (summary , summary_large_image ) |
twitterTitle | meta[name="twitter:title"] | Twitter title metadata |
twitterDescription | meta[name="twitter:description"] | Twitter description metadata |
twitterImage | meta[name="twitter:image"] | Twitter image metadata |
💡 Contributing
Contributions are welcome! Please submit issues or pull requests.