Batch-processing-service NPM

REACH Batch Article URL Analyser

Local Development instructions:

The application uses dotenv framework to inject environment variables within the application. which are passed with .env files in the root folder. As it stands there is within this project the folowing env files:

file	description
.env	local development
.env.demo	demo environment (currently not used)
.env.dev.Ad.Safety.analyser	ad safety analyser environment (currently not used)
.env.dev.appnexus	app nexus dev environment (currently not used)
.env.prod.appnexus	demo environment (currently not used)
.env.dev.BERTHA	bertha environment
.env.dev.stable	stable environment
.env.prod.reach	production environment
.env.sc.	(currently not used)
.env.telegraph	(currently not used)

Environment service configuration

Within each environment the following services must be configured:

NLU:

App Launch instructions:

node app.js or npm start

to lauch with a specific environment, such as prod: NODE_ENV=prod.appnexus node app.js or bertha: NODE_ENV=dev.BERTHA node app.js

Which starts a server: http://localhost:6003 (see server log for port number)

How To create a new environment file:

To create a new environment file the following credentials are required: Obtain credentials obtained from the respective watson service:

NLU:

    natural_language_understanding_apikey = << key from credentials in ibm console >> 
    natural_language_understanding_url = https://gateway-lon.watsonplatform.net/natural-language-understanding/api
    natural_language_understanding_version = 2019-02-01``

change url if there you the NLU instance is in different region than London.

    visual_recognition_apikey = << key from credentials in ibm console >>  
    visual_recognition_url = https://gateway.watsonplatform.net/visual-recognition/api  
    visual_recognition_version = 2019-02-01

change url if there you the VR instance is in different region than London.

WDS:

    discovery_url = https://gateway-lon.watsonplatform.net/discovery/api  
    discovery_apikey = << key from credentials in ibm console >>  
    discovery_version = 2019-02-01  
    discovery_collectionid =  
    discovery_environmentid =

Cloud storage bucket:
Create a set of credentials for IBM cloud storage, with HMAC credentials, and convert the values to base 64 string with the following:

    cloud_storage_enpoint_url = https://s3.eu-gb.cloud-object-storage.appdomain.cloud  
    cloud_storage_apikey = << key from credentials in ibm console >>
    cloud_storage_resource_instance_id = << key from credentials in ibm console >>  
    cloud_storage_access_key_id = << key from credentials in ibm console >>  
    cloud_storage_secret_access_key = << key from credentials in ibm console >>  
    cloud_storage_bucket = << input storage bucket >>  
    cloud_storage_reports = << output storage bucket >>

DB: Get the db credentials from the ibm console, connections are done as follows:

    postgreSQL_connectionString = postgres://user:password@0af45143-13f5-40ee-a847-2aea727b42fd.bmo1leol0d54tib7un7g.databases.appdomain.cloud:port/db?sslmode=verify-full
    postgreSQL_certificate_base64 = << pem ssl certificate string >>

Other environment variables

variable	description
write_to_db	enables writing nlu findings into the db, to be cached, the default value is present in the env file
read_from_db_cache	enables reading cached nlu findings from the db, the default value is present in the env file
write_to_log	create a CSV log file with a line for each analyzed article, containing the rating of the article
write_rules_to_log	adds also the rating of each rule. Applies only if `write_to_log = true`
write_to_cache	store result JSONs as files on server used as cache in future requests
analyze_images	enable/disable(s) analyzing images identified in HTML articles
recalculation_rate	recalculation rate for new rulesets
sleep_interval	time interval in seconds that the processor waits between new input files lookups, default values are in the env files
selected_process_mode	currently batch file processing mode is done through config in default.json, this defines the input/output format, and any filtering that might need doing. It defaults to default, and has with more modes
max_small_file_size	file size threshold in which it will process the whole file in one go instead of a stream, if not present it defaults to 20kb
articles_parallel	number of articles to process in parallel, if not present defaults to 30
NODE_ENV	used to change psql file to use test db for e2e tests
LOCAL_DEV	used to set the db to localhost, useful for local development

Processing modes:

Processing Mode config:

All existent processing modes are currently within the config/default.json under processMode.

The processing currently supports the following flags:

flag	description
name	name of the process mode
inputFormat	expected file format for the input
outputFormat	expected file format for the output file
saveArticles	saves articles as a list with format selected as above
saveReport	saves report from processed file, this includes the processed file, how many articles failed, total number processed, and status. Current saves report always in json
outputArticleErrors	if output format is json, it can output the error, for each article that fails, along with the input used, allowing better debugging
removeUnmatchedFilteredArticle	if true and there are matchers in articleFilterOutput and they return undefined (but not fail) it removes them from the output. the default is false
inputTransformation	uses Jexl framework to apply transforms to the input
articleFilterOutput	uses Jexl framework please see below for format.

Article Filter output format:

This part of the config allows allows to extract parts for the article response and filter it. It also allows to apply transforms to the data in the fields. It works by adding an array of key elements that should be present in the output. By adding them a key/value to output static values, or a matcher if we would like to filter parts of the original object only.

key	description
key	destination json key for the object (if format is csv, this value will be ommited)
value	value if we would like to output a static value associated with the key
matcher	Jexl matcher to filter objects
transforms	Jexl supports transforms to be added to a matcher, therefore, we add a static list of transformer fucntions in jexlTransforms.js class, and load them into jexl, to use them in order, just match the name of that function into the array

Please see default.json for transforms, and matcher examples and Jexl page.

If we need to add more transformer functions to the config, we add a new function into jexlTransforms

    const urlExtraneousRemoval = () => {
      return {
        name: 'urlExtraneousRemoval',
        method: urlString => URLStringBuilder.buildRemovingExtraneous(urlString)
      };
    };

Where the name returned is the name of the transform fucntion, and the method is the method to added to Jexl List of transforms. Then add the new transformer as an export

    module.exports = {
      urlExtraneousRemoval
    };

Then this function is available to be used in the transforms array for the inputTransformation array, articleFilterOutput.

Code structure:

The processor

The batch processor service is an express.js server app with no routing, its single goal is to pool for new files to be processed from ibm cloud storage. This works by starting the services in ./app.js and doing processor.init(). The service gets objects from ibm cloud storage by checking their name, and etag to inspect changes (in case the same file is uploaded multiple times).

The server/processor/processor.js: is responsible for waiting and pooling new files from ibm cloudstorage to be analysed
The server/processor/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, prepare a batch and save it, by either using a stream or an object put.
The server/processor/controllers/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, and save it, by either using a stream or an object put.
The server/processor/controllers/report/: contains controllers to create report or article streams.
The server/processor/controllers/jsonFilter/: applies transfomrs and/or filtering to article objects as per processing mode configuration.
The server/processor/controllers/objectOutputBuilder.js: is used to convert objects to their desired output format.
The server/processor/controllers/storageCache.js: is used to build a local cache of processed cloud storage objects, as means to complement the db article_process table.

Once a batch ends up in the articleQueue file, it ends up being processed individually through brand-safety-tools orchestrator file.

Unit tests

Unit tests, are written using Jest framework, and can be run by doing npm test in the terminal. An HTML coverage report is available in link.

Folder structure goes as follows:

path	description
test	root test folder
test/data	sample files to run the batch processor locally, or to be used in the unit tests. These are purely for development/demo purposes
test/e2e	end to end tests, paths below attempt to mirror the path of the test file in `server` folder
test/helpers	files helped to setup the tests
test/mocks	reusable jest mock files
test/unit	unit tests

### Tests structure: A unit test should attempt to test one condition of the class/module it is testing. tests name should follow:

test('<method name>() <condition to test>, <expected return value>')

an example:

test('filter() with config, articleFilterOutput and removeUnmatchedFilteredArticle set to false, throws an error', () => {}

describe() should be aggregate tests in the following order of preference:

aggregate tests within a test file.
aggregate a complex/finicky logical scenario.
aggregate tests around a method.
a test file should never test more or have describes that test more than one file.

reach brand safety advertisment ad nodejs cloud nlu natural language understanding vr visual recognition

1.0.0

5 years ago

batch-processing-service v1.0.0