1.1.2 • Published 4 months ago

@the-node-forge/simple-web-scraper v1.1.2

Weekly downloads
-
License
MIT
Repository
github
Last release
4 months ago

Simple Web Scraper

License: MIT Made with TypeScript

NPM Version Build Status Platform

Live Documentation

A lightweight and efficient web scraping package for JavaScript/TypeScript applications. This package helps developers fetch HTML content, parse web pages, and extract data effortlessly.


✨ Features

  • Fetch Web Content – Retrieve HTML from any URL with ease.
  • Parse and Extract Data – Utilize integrated parsing tools to extract information.
  • Configurable Options – Customize scraping behaviors using CSS selectors.
  • Headless Browser Support – Optionally use Puppeteer for JavaScript-rendered pages.
  • Lightweight & Fast – Uses Cheerio for static HTML scraping.
  • TypeScript Support – Fully typed for robust development.
  • Data Export Support – Export scraped data in JSON or CSV formats.
  • CSV Import Support – Read CSV files and convert them to JSON.

📚 Installation

Install via npm:

npm install simple-web-scraper

or using Yarn:

yarn add simple-web-scraper

🚀 Why Use Cheerio and Puppeteer?

This package leverages Cheerio and Puppeteer for powerful web scraping capabilities:

🔹 Cheerio (Fast and Lightweight)

  • Ideal for static HTML parsing (like jQuery for the backend).
  • Extremely fast and lightweight – perfect for pages without JavaScript rendering.
  • Provides easy CSS selector querying for extracting structured data.

🔹 Puppeteer (Headless Browser Automation)

  • Handles JavaScript-rendered pages – essential for scraping dynamic content.
  • Can interact with pages, click buttons, and fill out forms.
  • Allows screenshot capturing, PDF generation, and full-page automation.

Best of Both Worlds

  • Use Cheerio for speed when scraping static pages.
  • Switch to Puppeteer for JavaScript-heavy sites requiring full rendering.
  • Provides flexibility to choose the best approach for your project.

API Reference

WebScraper Class

new WebScraper(options?: ScraperOptions)

📊 Props

ParameterTypeDescription
usePuppeteerboolean (optional)Whether to use Puppeteer (default: true)
throttlenumber (optional)Delay in milliseconds between requests (default: 1000)
rulesRecord<string, string>CSS selectors defining data extraction rules

Methods

scrape(url: string): Promise<Record<string, any>>

  • Scrapes the given URL based on the configured options.

exportToJSON(data: any, filePath: string): void

  • Exports the given data to a JSON file.

exportToCSV(data: any | any[], filePath: string): void

  • Exports the given data to a CSV file.

🛠️ Basic Usage

1. Scraping Web Pages

You can scrape web pages using either Puppeteer (for JavaScript-heavy pages) or Cheerio (for static HTML pages).

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: false, // Set to true for dynamic pages
  rules: {
    title: 'h1',
    description: 'meta[name=\"description\"]',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

2. Using Puppeteer for JavaScript-heavy Pages

To scrape pages that require JavaScript execution:

const scraper = new WebScraper({
  usePuppeteer: true, // Enable Puppeteer for JavaScript-rendered content
  rules: {
    heading: 'h1',
    price: '.product-price',
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com/product');
  console.log(data);
})();

3. Exporting Data

  • Scraped data can be exported to JSON or CSV files using utility functions.

Export to JSON

import { exportToJSON } from 'simple-web-scraper';

const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');

Export to CSV

import { exportToCSV } from 'simple-web-scraper';

const data = [
  { name: 'Example 1', value: 42 },
  { name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');

// Preserve null and undefined values as null
exportToCSV(data, 'output.csv', { preserveNulls: true });

🖥 Backend Example - Module (import)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';

const app = express();
const scraper = new WebScraper({
  usePuppeteer: true,
  rules: { title: 'h1', content: 'p' },
});

app.get('/scrape-example', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv', { preserveNulls: true }); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

🖥 Backend Example - CommonJS (require)

This example demonstrates how to use simple-web-scraper in a Node.js backend:

const {
  WebScraper,
  exportToJSON,
  exportToCSV,
} = require('@the-node-forge/simple-web-scraper/dist');

const scraper = new WebScraper({
  usePuppeteer: true,
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

app.get('/test-scraper', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv'); // export CSV

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

🛠️ Full Usage Example

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: true, // Set to false if scraping static pages
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

(async () => {
  const data = await scraper.scrape('https://example.com');
  console.log(data);
})();

📊 Rule Set Table

RuleCSS SelectorTarget Data
fullHTMLhtmlThe entire HTML of the page
titlehead > titleThe <title> of the page
descriptionmeta[name="description"]Meta description for SEO
keywordsmeta[name="keywords"]Meta keywords
faviconlink[rel="icon"]Website icon
mainHeadingh1The first <h1> heading
allHeadingsh1, h2, h3, h4, h5, h6All headings (h1-h6)
firstParagraphpThe first paragraph (<p>)
allParagraphspAll paragraphs on the page
linksaAll anchor <a> links
imagesimgAll image <img> sources
imageAltsimgAll image alt texts
videosvideo, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]Video sources (<video>, YouTube, Vimeo)
tablestableAll <table> elements
tableDatatdIndividual <td> elements
listsul, olAll ordered <ol> and unordered <ul> lists
listItemsliAll list <li> items
scriptsscriptJavaScript files included (<script src="...">)
stylesheetslink[rel="stylesheet"]Stylesheets (<link rel="stylesheet">)
structuredDatascript[type="application/ld+json"]JSON-LD structured data for SEO
socialLinksa[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]Facebook, Twitter, LinkedIn, Instagram links
authormeta[name="author"]Page author (meta[name="author"])
publishDatemeta[property="article:published_time"], timeDate article was published
modifiedDatemeta[property="article:modified_time"]Last modified date
canonicalURLlink[rel="canonical"]Canonical URL (avoids duplicate content)
openGraphTitlemeta[property="og:title"]OpenGraph metadata for social sharing
openGraphDescriptionmeta[property="og:description"]OpenGraph description
openGraphImagemeta[property="og:image"]OpenGraph image URL
twitterCardmeta[name="twitter:card"]Twitter card type (summary, summary_large_image)
twitterTitlemeta[name="twitter:title"]Twitter title metadata
twitterDescriptionmeta[name="twitter:description"]Twitter description metadata
twitterImagemeta[name="twitter:image"]Twitter image metadata

💡 Contributing

Contributions are welcome! Please submit issues or pull requests.