@phunky/scrape-channel-listings NPM

Scrape Channel Listings

A TypeScript library for scraping TV channel listings from various providers:

DIRECTV
DISH Network
Sky UK
Virgin Media

This project started as a proof-of-concept for scraping TV channel listings from various providers. The codebase has been significantly improved with the assistance of Cursor AI

Features

Parallel scraping with configurable concurrency
Performance monitoring and statistics
Error handling and detailed logging
JSON output by default
Optional file output for each provider
Individual provider scraping support
Available as both a library and CLI tool

Prerequisites

This package requires Playwright with Chromium browser for web scraping. After installing the package, you'll need to install Playwright's Chromium browser:

# Install the package
npm install @phunky/scrape-channel-listings

# Install Playwright's Chromium browser
npx playwright install chromium

Data Sources

The channel listings are scraped from the following sources:

Virgin Media UK: rxtvinfo.com/virgin-media-channel-list-uk
Sky UK: rxtvinfo.com/sky-channel-list-uk
Sky UK Satallite: rxtvinfo.com/sky-satellite-channel-list-uk
Sky Ireland: rxtvinfo.com/sky-stream-and-sky-glass-channel-list-republic-of-ireland-epg
Freesat HD: rxtvinfo.com/freesat-channel-list-uk/
DIRECTV: usdirect.com/channels
DISH Network: allconnect.com/providers/dish/channel-guide

Please note that these sources are third-party websites and may change without notice. The scrapers are maintained to work with the current structure of these sites, but may need updates if the source websites undergo significant changes.

Installation

npm install @phunky/scrape-channel-listings

Usage

As a Library

import { scrapeAllProviders, scrapeProvider, type Channel, type ScrapingSummary } from '@phunky/scrape-channel-listings';

// Scrape all providers
const channels = await scrapeAllProviders();
console.log(channels); // Array of { provider: string, channels: Channel[] }

// Scrape with options
const summary = await scrapeAllProviders({
    writeFiles: true, // Write results to files
    maxConcurrent: 2  // Limit concurrent scrapers
});
console.log(summary); // ScrapingSummary object

// Scrape a specific provider
const result = await scrapeProvider('directv');
console.log(result); // ScraperResult object

As a CLI Tool

# Scrape all providers
npx @phunky/scrape-channel-listings

# Scrape specific providers
npx @phunky/scrape-channel-listings --provider virgin
npx @phunky/scrape-channel-listings --provider sky
npx @phunky/scrape-channel-listings --provider skyireland
npx @phunky/scrape-channel-listings --provider skysatellite
npx @phunky/scrape-channel-listings --provider freesat
npx @phunky/scrape-channel-listings --provider directv
npx @phunky/scrape-channel-listings --provider dish

# Write results to files
npx @phunky/scrape-channel-listings --write-files

# Control concurrent scrapers
npx @phunky/scrape-channel-listings --max-concurrent 4

API Reference

Types

interface Channel {
    number: string;
    name: string;
}

interface ProviderChannels {
    provider: string;
    channels: Channel[];
}

interface ScraperResult {
    name: string;
    success: boolean;
    duration: number;
    channelCount?: number;
    error?: Error;
    channels?: Channel[];
}

interface ScrapingOptions {
    writeFiles?: boolean;
    maxConcurrent?: number;
}

interface ScrapingSummary {
    results: ScraperResult[];
    totalDuration: number;
    successRate: string;
    totalChannels: number;
    failedScrapers: ScraperResult[];
}

Functions

`scrapeAllProviders(options?: ScrapingOptions): Promise<ProviderChannels[] | ScrapingSummary>`

Scrapes channel listings from all configured providers. Returns either an array of provider channels or a summary object depending on the writeFiles option.

`scrapeProvider(providerName: string, options?: ScrapingOptions): Promise<ScraperResult>`

Scrapes channel listings from a specific provider. Throws an error if the provider is not found.

Configuration

The scraper can be configured using environment variables:

# Run with custom configuration
HEADLESS=false CONCURRENT_SCRAPERS=2 npm run scrape

Available environment variables:

HEADLESS: Set to 'false' to see the browser while scraping (default: true)
CONCURRENT_SCRAPERS: Number of scrapers to run in parallel (default: 10)
RETRY_ATTEMPTS: Number of retry attempts for failed scrapes (default: 1)
RETRY_DELAY: Delay between retries in milliseconds (default: 1000)
PAGE_TIMEOUT: Page load timeout in milliseconds (default: 30000)
OUTPUT_DIR: Directory to save results when using --files (default: '../data')

Error Handling

The scraper will:

Retry failed attempts based on RETRY_ATTEMPTS setting
Log detailed error messages
Continue with remaining providers if one fails
Exit with code 1 if any scraper fails
Provide error details in the final summary (when using --files)

Development

# Install dependencies
npm install

# Build the package
npm run build

# Run tests
npm test

# Run specific scraper
npm run scrape:virgin
npm run scrape:sky
npm run scrape:skyireland
npm run scrape:skysatellite
npm run scrape:freesat
npm run scrape:directv
npm run scrape:dish