@the-node-forge/simple-web-scraper NPM

Simple Web Scraper

A lightweight and efficient web scraping package for JavaScript/TypeScript applications. This package helps developers fetch HTML content, parse web pages, and extract data effortlessly.

✨ Features

✅ Fetch Web Content – Retrieve HTML from any URL with ease.
✅ Parse and Extract Data – Utilize integrated parsing tools to extract information.
✅ Configurable Options – Customize scraping behaviors using CSS selectors.
✅ Headless Browser Support – Optionally use Puppeteer for JavaScript-rendered pages.
✅ Lightweight & Fast – Uses Cheerio for static HTML scraping.
✅ TypeScript Support – Fully typed for robust development.
✅ Data Export Support – Export scraped data in JSON or CSV formats.
✅ CSV Import Support – Read CSV files and convert them to JSON.

📚 Installation

Install via npm:

npm install simple-web-scraper

or using Yarn:

yarn add simple-web-scraper

🚀 Why Use Cheerio and Puppeteer?

This package leverages Cheerio and Puppeteer for powerful web scraping capabilities:

🔹 Cheerio (Fast and Lightweight)

Ideal for static HTML parsing (like jQuery for the backend).
Extremely fast and lightweight – perfect for pages without JavaScript rendering.
Provides easy CSS selector querying for extracting structured data.

🔹 Puppeteer (Headless Browser Automation)

Handles JavaScript-rendered pages – essential for scraping dynamic content.
Can interact with pages, click buttons, and fill out forms.
Allows screenshot capturing, PDF generation, and full-page automation.

✅ Best of Both Worlds

Use Cheerio for speed when scraping static pages.
Switch to Puppeteer for JavaScript-heavy sites requiring full rendering.
Provides flexibility to choose the best approach for your project.

✅ API Reference

WebScraper Class

new WebScraper(options?: ScraperOptions)

📊 Props

Parameter	Type	Description
`usePuppeteer`	`boolean` (optional)	Whether to use Puppeteer (default: `true`)
`throttle`	`number` (optional)	Delay in milliseconds between requests (default: `1000`)
`rules`	`Record<string, string>`	CSS selectors defining data extraction rules

Methods

`scrape(url: string): Promise<Record<string, any>>`

Scrapes the given URL based on the configured options.

`exportToJSON(data: any, filePath: string): void`

Exports the given data to a JSON file.

`exportToCSV(data: any | any[], filePath: string): void`

Exports the given data to a CSV file.

🛠️ Basic Usage

1. Scraping Web Pages

You can scrape web pages using either Puppeteer (for JavaScript-heavy pages) or Cheerio (for static HTML pages).

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: false, // Set to true for dynamic pages
  rules: {
    title: 'h1',
    description: 'meta[name=\"description\"]',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

2. Using Puppeteer for JavaScript-heavy Pages

To scrape pages that require JavaScript execution:

const scraper = new WebScraper({
  usePuppeteer: true, // Enable Puppeteer for JavaScript-rendered content
  rules: {
    heading: 'h1',
    price: '.product-price',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com/product');
  console.log(data);
})();

3. Exporting Data

Scraped data can be exported to JSON or CSV files using utility functions.

Export to JSON

import { exportToJSON } from 'simple-web-scraper';

const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');

Export to CSV

import { exportToCSV } from 'simple-web-scraper';

const data = [
  { name: 'Example 1', value: 42 },
  { name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');

// Preserve null and undefined values as null
exportToCSV(data, 'output.csv', { preserveNulls: true });

🖥 Backend Example - Module (import)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';

const app = express();
const scraper = new WebScraper({
  usePuppeteer: true,
  rules: { title: 'h1', content: 'p' },
});

app.get('/scrape-example', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv', { preserveNulls: true }); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

🖥 Backend Example - CommonJS (require)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

const {
  WebScraper,
  exportToJSON,
  exportToCSV,
} = require('@the-node-forge/simple-web-scraper/dist');

const scraper = new WebScraper({
  usePuppeteer: true,
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

app.get('/test-scraper', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv'); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

🛠️ Full Usage Example

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: true, // Set to false if scraping static pages
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

📊 Rule Set Table

Rule	CSS Selector	Target Data
fullHTML	`html`	The entire HTML of the page
title	`head > title`	The `<title>` of the page
description	`meta[name="description"]`	Meta description for SEO
keywords	`meta[name="keywords"]`	Meta keywords
favicon	`link[rel="icon"]`	Website icon
mainHeading	`h1`	The first `<h1>` heading
allHeadings	`h1, h2, h3, h4, h5, h6`	All headings (`h1`-`h6`)
firstParagraph	`p`	The first paragraph (`<p>`)
allParagraphs	`p`	All paragraphs on the page
links	`a`	All anchor `<a>` links
images	`img`	All image `<img>` sources
imageAlts	`img`	All image alt texts
videos	`video, iframe[src="youtube.com"], iframe[src="vimeo.com"]`	Video sources (`<video>`, YouTube, Vimeo)
tables	`table`	All `<table>` elements
tableData	`td`	Individual `<td>` elements
lists	`ul, ol`	All ordered `<ol>` and unordered `<ul>` lists
listItems	`li`	All list `<li>` items
scripts	`script`	JavaScript files included (`<script src="...">`)
stylesheets	`link[rel="stylesheet"]`	Stylesheets (`<link rel="stylesheet">`)
structuredData	`script[type="application/ld+json"]`	JSON-LD structured data for SEO
socialLinks	`a[href="facebook.com"], a[href="twitter.com"], a[href="linkedin.com"], a[href="instagram.com"]`	Facebook, Twitter, LinkedIn, Instagram links
author	`meta[name="author"]`	Page author (`meta[name="author"]`)
publishDate	`meta[property="article:published_time"], time`	Date article was published
modifiedDate	`meta[property="article:modified_time"]`	Last modified date
canonicalURL	`link[rel="canonical"]`	Canonical URL (avoids duplicate content)
openGraphTitle	`meta[property="og:title"]`	OpenGraph metadata for social sharing
openGraphDescription	`meta[property="og:description"]`	OpenGraph description
openGraphImage	`meta[property="og:image"]`	OpenGraph image URL
twitterCard	`meta[name="twitter:card"]`	Twitter card type (`summary`, `summary_large_image`)
twitterTitle	`meta[name="twitter:title"]`	Twitter title metadata
twitterDescription	`meta[name="twitter:description"]`	Twitter description metadata
twitterImage	`meta[name="twitter:image"]`	Twitter image metadata

💡 Contributing

Contributions are welcome! Please submit issues or pull requests.

web scraping data extraction puppeteer cheerio headless browser automation web automation HTML parser structured data JSON export CSV export configurable scraper research aggregation

axios cheerio puppeteer

1.1.2

5 months ago

1.1.0

5 months ago

1.0.0

5 months ago