1.0.0 • Published 4 years ago

batch-processing-service v1.0.0

Weekly downloads
-
License
ISC
Repository
-
Last release
4 years ago

REACH Batch Article URL Analyser

Table of Contents:

Local Development instructions:

The application uses dotenv framework to inject environment variables within the application. which are passed with .env files in the root folder. As it stands there is within this project the folowing env files:

filedescription
.envlocal development
.env.demodemo environment (currently not used)
.env.dev.Ad.Safety.analyserad safety analyser environment (currently not used)
.env.dev.appnexusapp nexus dev environment (currently not used)
.env.prod.appnexusdemo environment (currently not used)
.env.dev.BERTHAbertha environment
.env.dev.stablestable environment
.env.prod.reachproduction environment
.env.sc.(currently not used)
.env.telegraph(currently not used)

Environment service configuration

Within each environment the following services must be configured:

  • NLU:

App Launch instructions:

node app.js or npm start

to lauch with a specific environment, such as prod: NODE_ENV=prod.appnexus node app.js or bertha: NODE_ENV=dev.BERTHA node app.js

Which starts a server: http://localhost:6003 (see server log for port number)

How To create a new environment file:

To create a new environment file the following credentials are required: Obtain credentials obtained from the respective watson service:

  • NLU:
    natural_language_understanding_apikey = << key from credentials in ibm console >> 
    natural_language_understanding_url = https://gateway-lon.watsonplatform.net/natural-language-understanding/api
    natural_language_understanding_version = 2019-02-01``
change url if there you the NLU instance is in different region than London.
  • VR:
    visual_recognition_apikey = << key from credentials in ibm console >>  
    visual_recognition_url = https://gateway.watsonplatform.net/visual-recognition/api  
    visual_recognition_version = 2019-02-01  
change url if there you the VR instance is in different region than London.
  • WDS:
    discovery_url = https://gateway-lon.watsonplatform.net/discovery/api  
    discovery_apikey = << key from credentials in ibm console >>  
    discovery_version = 2019-02-01  
    discovery_collectionid =  
    discovery_environmentid =
  • Cloud storage bucket:
    Create a set of credentials for IBM cloud storage, with HMAC credentials, and convert the values to base 64 string with the following:
    cloud_storage_enpoint_url = https://s3.eu-gb.cloud-object-storage.appdomain.cloud  
    cloud_storage_apikey = << key from credentials in ibm console >>
    cloud_storage_resource_instance_id = << key from credentials in ibm console >>  
    cloud_storage_access_key_id = << key from credentials in ibm console >>  
    cloud_storage_secret_access_key = << key from credentials in ibm console >>  
    cloud_storage_bucket = << input storage bucket >>  
    cloud_storage_reports = << output storage bucket >>
  • DB: Get the db credentials from the ibm console, connections are done as follows:
    postgreSQL_connectionString = postgres://user:password@0af45143-13f5-40ee-a847-2aea727b42fd.bmo1leol0d54tib7un7g.databases.appdomain.cloud:port/db?sslmode=verify-full
    postgreSQL_certificate_base64 = << pem ssl certificate string >>

Other environment variables

variabledescription
write_to_dbenables writing nlu findings into the db, to be cached, the default value is present in the env file
read_from_db_cacheenables reading cached nlu findings from the db, the default value is present in the env file
write_to_logcreate a CSV log file with a line for each analyzed article, containing the rating of the article
write_rules_to_logadds also the rating of each rule. Applies only if write_to_log = true
write_to_cachestore result JSONs as files on server used as cache in future requests
analyze_imagesenable/disable(s) analyzing images identified in HTML articles
recalculation_raterecalculation rate for new rulesets
sleep_intervaltime interval in seconds that the processor waits between new input files lookups, default values are in the env files
selected_process_modecurrently batch file processing mode is done through config in default.json, this defines the input/output format, and any filtering that might need doing. It defaults to default, and has with more modes
max_small_file_sizefile size threshold in which it will process the whole file in one go instead of a stream, if not present it defaults to 20kb
articles_parallelnumber of articles to process in parallel, if not present defaults to 30
NODE_ENVused to change psql file to use test db for e2e tests
LOCAL_DEVused to set the db to localhost, useful for local development

Processing modes:

Processing Mode config:

All existent processing modes are currently within the config/default.json under processMode.

The processing currently supports the following flags:

flagdescription
namename of the process mode
inputFormatexpected file format for the input
outputFormatexpected file format for the output file
saveArticlessaves articles as a list with format selected as above
saveReportsaves report from processed file, this includes the processed file, how many articles failed, total number processed, and status. Current saves report always in json
outputArticleErrorsif output format is json, it can output the error, for each article that fails, along with the input used, allowing better debugging
removeUnmatchedFilteredArticleif true and there are matchers in articleFilterOutput and they return undefined (but not fail) it removes them from the output. the default is false
inputTransformationuses Jexl framework to apply transforms to the input
articleFilterOutputuses Jexl framework please see below for format.

Article Filter output format:

This part of the config allows allows to extract parts for the article response and filter it. It also allows to apply transforms to the data in the fields. It works by adding an array of key elements that should be present in the output. By adding them a key/value to output static values, or a matcher if we would like to filter parts of the original object only.

keydescription
keydestination json key for the object (if format is csv, this value will be ommited)
valuevalue if we would like to output a static value associated with the key
matcherJexl matcher to filter objects
transformsJexl supports transforms to be added to a matcher, therefore, we add a static list of transformer fucntions in jexlTransforms.js class, and load them into jexl, to use them in order, just match the name of that function into the array

Please see default.json for transforms, and matcher examples and Jexl page.

If we need to add more transformer functions to the config, we add a new function into jexlTransforms

    const urlExtraneousRemoval = () => {
      return {
        name: 'urlExtraneousRemoval',
        method: urlString => URLStringBuilder.buildRemovingExtraneous(urlString)
      };
    };

Where the name returned is the name of the transform fucntion, and the method is the method to added to Jexl List of transforms. Then add the new transformer as an export

    module.exports = {
      urlExtraneousRemoval
    };

Then this function is available to be used in the transforms array for the inputTransformation array, articleFilterOutput.

Code structure:

The processor

The batch processor service is an express.js server app with no routing, its single goal is to pool for new files to be processed from ibm cloud storage. This works by starting the services in ./app.js and doing processor.init(). The service gets objects from ibm cloud storage by checking their name, and etag to inspect changes (in case the same file is uploaded multiple times).

  • The server/processor/processor.js: is responsible for waiting and pooling new files from ibm cloudstorage to be analysed
  • The server/processor/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, prepare a batch and save it, by either using a stream or an object put.
  • The server/processor/controllers/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, and save it, by either using a stream or an object put.
  • The server/processor/controllers/report/: contains controllers to create report or article streams.
  • The server/processor/controllers/jsonFilter/: applies transfomrs and/or filtering to article objects as per processing mode configuration.
  • The server/processor/controllers/objectOutputBuilder.js: is used to convert objects to their desired output format.
  • The server/processor/controllers/storageCache.js: is used to build a local cache of processed cloud storage objects, as means to complement the db article_process table.

Once a batch ends up in the articleQueue file, it ends up being processed individually through brand-safety-tools orchestrator file.

Unit tests

Unit tests, are written using Jest framework, and can be run by doing npm test in the terminal. An HTML coverage report is available in link.

Folder structure goes as follows:

pathdescription
testroot test folder
test/datasample files to run the batch processor locally, or to be used in the unit tests. These are purely for development/demo purposes
test/e2eend to end tests, paths below attempt to mirror the path of the test file in server folder
test/helpersfiles helped to setup the tests
test/mocksreusable jest mock files
test/unitunit tests

### Tests structure: A unit test should attempt to test one condition of the class/module it is testing. tests name should follow:

test('<method name>() <condition to test>, <expected return value>')

an example:

test('filter() with config, articleFilterOutput and removeUnmatchedFilteredArticle set to false, throws an error', () => {}

describe() should be aggregate tests in the following order of preference:

  • aggregate tests within a test file.
  • aggregate a complex/finicky logical scenario.
  • aggregate tests around a method.
  • a test file should never test more or have describes that test more than one file.