0.0.4 • Published 2 years ago

scraping-pipeline v0.0.4

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

Node.js Scraping Pipelines

Build Status

Introduction

Scraping pipeline is a typescript asynchronous module.

It helps to organize the code in pipeline applications.

It contains some generic functonal to scrap, parse, process, modify and send data.

It also let you define custom modules when the generic functionality is not enough.

Quick Start

How to install

npm i scraping-pipeline

Here are some examples to help you understand the features

Basic pipeline with custom modules

import { Pipeline, Modules } from 'scraping-pipeline';

const yourFunctionToGetSomeCsv = async (): Promise<string> => {
  const someCsv: string;
  ...
  return someCsv;
};

const yourFunctionToStoreData = async (data: any) => {
  ...
};

const getter = new Modules.General.Custom(yourFunctionToGetSomeCsv);
const parser = new Modules.General.CsvParser({ headers: true });
const saver = new Modules.General.Custom(yourFunctionToStoreData);

const pipeline = new Pipeline([getter, parser, saver]);

pipeline.run().then(() => { console.log('Done') });

Components and Types

Pipeline

Pipeline is the main component of the package. It is initiated with a pipe of Modules.

Pipeline has a method run. By running the Pipline it will execute the Modules in sequence and feed Data from one to another.

First module doesn't have feed Data.

Modules

Modules are small components which are usually doing a single task.

All Modules are implementing Modules.Base and extending Modules.Common<InputType, OutputType>.

There are some General Modules which are designed to do some standard tasks.

CsvParser

Modules.General.CsvParser is a module which helps to parse CSV Data and returns a structured output.

ArrayParser

Modules.General.ArrayParser is a generic module which helps to convert string arrays to some meaningful structure.

This module may be useful when you need to parse some raw data from documents.

It gets a ParsingTemplate as an constructor argument which lets the parser know how to convert the array to some structured data.

Custom

Modules.General.Custom<InputType, OutputType> is using a custom async function to solve custom problems.

It gets an async function as an processor wich will do the task.

The processor functions gets 3 arguments:

  • data: InputType
  • previous: any
  • old: any[]

Returns a value with OutputType type

Data

Data<T> is generic type to send Data between modules. The Data contains current, previous and old data. It stores all data passed across the Pipeline.

Usually you don't need to think about Data<T>, it is used in lower level of pipeline.

License

May be freely distributed under the MIT license.