scrpr v0.1.28
scrpr
scrpr is a lightweight scraper multitool. it can fetch data via https, detect changes and parse the most common formats.
Usage Example
const scrpr = require("scrpr");
const scraper = scrpr({
concurrency: 5,
cachedir: '/tmp/scraper-cache',
});
scraper("https://example.org/data.csv", {
parse: "csv",
}, function(err, change, data){
if (err) console.error(err);
if (change) console.log(data);
});scrpr(opts) → function scraper
Constructor, returns scraper function
Opts:
concurrency— number of parallel requests; default:1cachedir— directory to save cache data in; default:<root module>/.scrpr-cache
scraper([url], [opts], [callback(err, change, data)])
Scraper, delivers data
Opts:
method— http method; default:geturl— URL, alternative tourlparameterheaders— additional http request headers, default:{}data— http data to be sent, default:nullcache— use cache, default:truecacheid— override cache id, default:hash(url, opts)parse— format to parse, default:null(raw data)successCodes— array of http status codes considered successful, default:[ 200 ]needle— options passed on toneedle, default{}xlsx— options passed on toxlsx, default{}xsv— options passed on toxsv, default{}pdf— options passed on topdf.js-extract, default{}preprocess(data, callback(err, data))— modify data before parsingpostprocess(data, callback(err, data))— modify data after parsingstream— deliver data asReadableStream— no parsing or processing, default:falsemetaredirects— follow<meta http-equiv="refresh">style redirects, default:falseiconv— decode stream or data as this charset with iconv-lite before parsing, default:falsecooldown— microseconds since last fetch before a resource is fetched again, default:falsesizechange— treat unchanged content-length as same file, default:false
Callback:
err— contains Error ornullchange—trueif data changeddata— raw or parsed data when changed, otherwise status string
Parsers
csv— Comma Seperated Values;datais an Object, parsed with xsvtsv— Tab Separated Values;datais an Object, parsed with xsvssv— Semicolon Separated Values (data has been exported "as csv" with some localizations of Microsoft Excel):datais an Object, parsed with xsvxml— eXtensible Markup Language;datais an Object, parsed with xml2jsjson— JavaScript object Notation;datais an Object, parsed nativelyhtml— HyperText Markup Language;datais an instance of cheerioyaml— YAML Ain't Markup Language;datais an Object, parsed with yamlxlsx— Office Open XML Workbook;datais an Object, parsed with xlsx;{ "<sheetname>": [ [ cell, cell, cell, ... ], ... ] }pdf— Portable Document Format;datais an Object, parsed with pdf.js-extract;kdl— KDL Document Language;datais an Object, parsed with kdljs;dw— Datawrapper Visualisation;datais an Object, extracted with dataunwrapper;
FTP
Rudimentary handling for ftp URLs is available if the optional get-uri dependency is installed.
Local Files
Rudimentary handling for local files is available with the file:/ pseude-protocol.
Optional dependencies
xsv, xlsx, xml2js, yaml, cheerio, dataunwrapper, iconv-lite, kdljs, pdf.js-extract and get-uri are optional dependencies. They should only be installed if their use is required.
License
1 year ago
1 year ago
1 year ago
2 years ago
2 years ago
2 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago