0.4.23 • Published 2 years ago

tms-scrape v0.4.23

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

tms-scrape

scrape a single web page from the command line

Install

npm install [-g] tms-scrape

Test

npm run test

Which scrapes terrymorse.com home page and produces the following file:

test
└── index.html

Usage

scrape --url <source_url>  [--config <config_file> --dest <directory>]

Where:

  • source_url - url of the page to scrape (overrides config_file value)
  • config_file - configuration file (JSON)
  • directory - directory to contain results of scrape (overrides config_file value)

The source_url is required, specified either in the config file:

{
  "urls": [
    "https://terrymorse.com"
  ],
  "directory": "test"
}

or on the command line.

Config File Default Properties

A default scraping does the following:

  • evaluates static page
  • saves only html
  • will not follow links
  • removes all scripts from HTML file
  • converts all relative URLs to absolute

See the following for config file property details:

const configDefault = {
  // urls to scrape (required)
  "urls": [],

  // destination directory
  "directory": "./scrape-result",

  // scrape using axios instead of 'website-scraper'
  "scrapeWithAxios": false,

  // types of files to save (default none)
  "sources": [
    // {selector: 'img', attr: 'src'},
    // {selector: 'link[rel="stylesheet"]', attr: 'href'}
  ],

  // where to store files (default none)
  "subdirectories": [
    // {directory: 'img', extensions: ['.jpg', '.jpeg', '.png', '.svg']},
    // {directory: 'css', extensions: ['.css']},
    // {directory: 'font', extensions: ['.woff', '.ttf', '.woff2']}
  ],

  // how deep in hierarchy to search (1: files referenced by source file)
  "maxDepth": 1,

  // remove all link elements
  "removeLinkEls": true,

  // remove all style elements
  "removeStyles": true,

  // remove all scripts from HTML file
  "removeScripts": true,

  // convert relative refs to absolute
  "convertRelativeRefs": true,

  // save html to file
  "saveToFile": true,

  // name for source file
  "defaultFilename": 'index.html',

  // keep going if there are errors
  "ignoreErrors": true,

  // dynamic: parse dynamic pages using puppeteer
  "dynamic": false,

  // dynamic settings
  "puppeteerConfig": {
    // "launchOptions": {"headless": false},
    // "scrollToBottom": {"timeout": 30_000, "viewportN": 10},
    // "blockNavigation": true,
    // "browser": null,
    // "headers": {}
  }
}

A config file specified on the command line may contain any of these properties. Any property missing from the config file will use the default (above).

API

import { doScrape } from 'tms-scrape';
/**
 * scrape a single page
 * @param {Object} options
 * @return {Promise<{directory: string, html: string}>}
 */

const {directory, html} = doScrape (options);
0.4.22

2 years ago

0.4.23

2 years ago

0.4.20

2 years ago

0.4.21

2 years ago

0.4.19

2 years ago

0.4.17

2 years ago

0.4.18

2 years ago

0.4.15

2 years ago

0.4.16

2 years ago

0.4.13

2 years ago

0.4.14

2 years ago

0.4.9

2 years ago

0.4.8

2 years ago

0.4.10

2 years ago

0.4.11

2 years ago

0.4.12

2 years ago

0.1.0

2 years ago

0.3.0

2 years ago

0.2.1

2 years ago

0.1.2

2 years ago

0.2.0

2 years ago

0.1.1

2 years ago

0.4.5

2 years ago

0.4.4

2 years ago

0.4.7

2 years ago

0.4.6

2 years ago

0.4.1

2 years ago

0.4.0

2 years ago

0.3.1

2 years ago

0.2.2

2 years ago

0.4.3

2 years ago

0.4.2

2 years ago

0.0.2

2 years ago

0.0.1

2 years ago