scraping-pipeline v0.0.4
Node.js Scraping Pipelines
Introduction
Scraping pipeline is a typescript asynchronous module.
It helps to organize the code in pipeline applications.
It contains some generic functonal to scrap, parse, process, modify and send data.
It also let you define custom modules when the generic functionality is not enough.
Quick Start
How to install
npm i scraping-pipelineHere are some examples to help you understand the features
Basic pipeline with custom modules
import { Pipeline, Modules } from 'scraping-pipeline';
const yourFunctionToGetSomeCsv = async (): Promise<string> => {
const someCsv: string;
...
return someCsv;
};
const yourFunctionToStoreData = async (data: any) => {
...
};
const getter = new Modules.General.Custom(yourFunctionToGetSomeCsv);
const parser = new Modules.General.CsvParser({ headers: true });
const saver = new Modules.General.Custom(yourFunctionToStoreData);
const pipeline = new Pipeline([getter, parser, saver]);
pipeline.run().then(() => { console.log('Done') });Components and Types
Pipeline
Pipeline is the main component of the package.
It is initiated with a pipe of Modules.
Pipeline has a method run.
By running the Pipline it will execute the Modules in sequence and feed Data from one to another.
First module doesn't have feed Data.
Modules
Modules are small components which are usually doing a single task.
All Modules are implementing Modules.Base and extending Modules.Common<InputType, OutputType>.
There are some General Modules which are designed to do some standard tasks.
CsvParser
Modules.General.CsvParser is a module which helps to parse CSV Data and returns a structured output.
ArrayParser
Modules.General.ArrayParser is a generic module which helps to convert string arrays to some meaningful structure.
This module may be useful when you need to parse some raw data from documents.
It gets a ParsingTemplate as an constructor argument which lets the parser know how to convert the array to some structured data.
Custom
Modules.General.Custom<InputType, OutputType> is using a custom async function to solve custom problems.
It gets an async function as an processor wich will do the task.
The processor functions gets 3 arguments:
- data: InputType
- previous: any
- old: any[]
Returns a value with OutputType type
Data
Data<T> is generic type to send Data between modules.
The Data contains current, previous and old data. It stores all data passed across the Pipeline.
Usually you don't need to think about Data<T>, it is used in lower level of pipeline.
License
May be freely distributed under the MIT license.