0.1.25 • Published 1 month ago

scrpr v0.1.25

Weekly downloads
56
License
UNLICENSE
Repository
github
Last release
1 month ago

scrpr

scrpr is a lightweight scraper multitool. it can fetch data via https, detect changes and parse the most common formats.

Usage Example

const scrpr = require("scrpr");

const scraper = scrpr({
	concurrency: 5,
	cachedir: '/tmp/scraper-cache',
});


scraper("https://example.org/data.csv", { 
	parse: "csv", 
}, function(err, change, data){

	if (err) console.error(err);
	if (change) console.log(data);
	
});

scrpr(opts)function scraper

Constructor, returns scraper function

Opts:

  • concurrency — number of parallel requests; default: 1
  • cachedir — directory to save cache data in; default: <root module>/.scrpr-cache

scraper([url], [opts], [callback(err, change, data)])

Scraper, delivers data

Opts:

  • method — http method; default: get
  • url — URL, alternative to url parameter
  • headers — additional http request headers, default: {}
  • data — http data to be sent, default: null
  • cache — use cache, default: true
  • cacheid — override cache id, default: hash(url, opts)
  • parse — format to parse, default: null (raw data)
  • successCodes — array of http status codes considered successful, default: [ 200 ]
  • needle — options passed on to needle, default {}
  • xlsx — options passed on to xlsx, default {}
  • xsv — options passed on to xsv, default {}
  • pdf — options passed on to pdf.js-extract, default {}
  • preprocess(data, callback(err, data)) — modify data before parsing
  • postprocess(data, callback(err, data)) — modify data after parsing
  • stream — deliver data as ReadableStream — no parsing or processing, default: false
  • metaredirects — follow <meta http-equiv="refresh"> style redirects, default: false
  • iconv — decode stream or data as this charset with iconv-lite before parsing, default: false
  • cooldown — microseconds since last fetch before a resource is fetched again, default: false
  • sizechange — treat unchanged content-length as same file, default: false

Callback:

  • err — contains Error or null
  • changetrue if data changed
  • data — raw or parsed data when changed, otherwise status string

Parsers

  • csv — Comma Seperated Values; data is an Object, parsed with xsv
  • tsv — Tab Separated Values; data is an Object, parsed with xsv
  • ssv — Semicolon Separated Values (data has been exported "as csv" with some localizations of Microsoft Excel): data is an Object, parsed with xsv
  • xml — eXtensible Markup Language; data is an Object, parsed with xml2js
  • json — JavaScript object Notation; data is an Object, parsed natively
  • html — HyperText Markup Language; data is an instance of cheerio
  • yaml — YAML Ain't Markup Language; data is an Object, parsed with yaml
  • xlsx — Office Open XML Workbook; data is an Object, parsed with xlsx; { "<sheetname>": [ [ cell, cell, cell, ... ], ... ] }
  • pdf — Portable Document Format; data is an Object, parsed with pdf.js-extract;
  • kdl — KDL Document Language; data is an Object, parsed with kdljs;
  • dw — Datawrapper Visualisation; data is an Object, extracted with dataunwrapper;

FTP

Rudimentary handling for ftp URLs is available if the optional get-uri dependency is installed.

Local Files

Rudimentary handling for local files is available with the file:/ pseude-protocol.

Optional dependencies

xsv, xlsx, xml2js, yaml, cheerio, dataunwrapper, iconv-lite, kdljs, pdf.js-extract and get-uri are optional dependencies. They should only be installed if their use is required.

License

UNLICENSE

0.1.25

1 month ago

0.1.24

8 months ago

0.1.23

12 months ago

0.1.20

1 year ago

0.1.21

1 year ago

0.1.17

2 years ago

0.1.18

1 year ago

0.1.19

1 year ago

0.1.15

2 years ago

0.1.16

2 years ago

0.1.11

2 years ago

0.1.12

2 years ago

0.1.13

2 years ago

0.1.14

2 years ago

0.1.10

2 years ago

0.1.8

2 years ago

0.1.7

2 years ago

0.1.9

2 years ago

0.1.4

2 years ago

0.1.6

2 years ago

0.1.5

2 years ago

0.1.3

3 years ago

0.1.2

3 years ago

0.1.1

3 years ago

0.1.0

3 years ago

0.0.17

3 years ago

0.0.16

3 years ago

0.0.10

3 years ago

0.0.11

3 years ago

0.0.12

3 years ago

0.0.13

3 years ago

0.0.14

3 years ago

0.0.15

3 years ago

0.0.9

3 years ago

0.0.8

3 years ago

0.0.5

3 years ago

0.0.7

3 years ago

0.0.6

3 years ago

0.0.4

3 years ago

0.0.3

3 years ago

0.0.2

3 years ago

0.0.1

3 years ago