Lex.services.jobs-scheduler NPM

lex.services.jobs-scheduler

Centralized Job Scheduler

THIS DOCUMENT IS WORK IN PROGRESS

Motivation

The motivation to create a centralized jobs scheduler is to have a single point of job management. Prior to Jobs Scheduler we had various ways of job deploymens, job configurations and ways to schedule job runs. The Scheduler keeps job configurations on one place and takes responsibility of launching jobs in the right time.

Terminology

Here in this document we use following terms:

Job is a configured unit of work, it is just a receipt when, how and by who the work should be done.
Worker is an application responsible for doing a job. Although worker is a specialized application to process certain type of job, there may exist multiple jobs for one worker. So worker can be invoked (asked to do a job) in different times with slightly different Job configurations.
Job Run is one particular invocation of a Worker with a Job configuration. Input to the run is the Job configuration. Output of the run is collected log and termination status.

High level overview

The Scheduler is an application with three logical connections:

Cloudant database
Management UI
Runtime environment

Scheduler architecture

Cloudant database is used to store information about Workers, Jobs and Job Runs. Currently there are three separate Cloudant databases for each of these three data entities.

Management UI is connected to the Scheduler by REST API the Scheduler exposes.

There may be multiple runtime environments where workers may run in. Because the way how workers are implemented and invoked vary significantly between environments, Scheduler has multiple sub-components, called “engines”, for each supported environment. Currently supported environments are:

Kubernetes - workers are implemented as single pass processes, e.g. java main() applications, packed as Docker images. It is responsibility of Scheduler to dynamically create Kubernetes POD and run worker’s Docker image in that POD.
Web Service - workers are implemented as web services. The web service is expected to be deployed and accessible on a configured URL. Worker is invoked by HTTP POST request to the endpoint.

For job/worker developers

Dockerized workers

Kubernetes cluster is used as runtime environment to run dockerized workers. Workers are not deployed in the cluster all the time. They are just registered in the Cloudant database and deployed in Docker Registry.

When the Scheduler needs to run a Job because of regular schedule or manual request, following actions are taken:

Scheduler finds Job configuration in the database by Job ID.
Scheduler finds Worker configuration in the database by Worker ID in the Job configuration.
Scheduler creates Job Run record in the database.
Scheduler dynamically creates POD in the Kubernetes cluster. The POD has one container with Docker Image as set in Worker configuration.
POD starts
POD terminates
Scheduler records POD termination status to Job Run record
Scheduler moves log messages from POD to Job Run record
Scheduler deletes POD to release system resources

Preparing new dockerized worker

Dockerized workers are prefered type of workers because they are secure form tenant isolation perspective, they do not suffer from memory leaking as they run only as long as they need and they may be force terminated if necessary. At the same time they are much easier to develop than Web Service based workers, as they are basically a single one-pass process packed as Docker image. The process itself reads configuration and writes log in traditional way - the configuration is passed in environment variables and logs are expected to be sent to stdout or stderr.

Furthermore, dockerized workers are programming language agnostic, so they may be implemented in Java, JavaScript, Go, Python or even as a shell script.

Programming model for Dockerized worker is single process. The process should:

read environment variables to get necessary information for the run, such as Cloudant URL, credentials, tennant ID and other job parameters.
perform the actual task
write all log messages to stdout or stderr
terminate with exit code zero in case of success, non-zero in case of failure

Worker should be then packed as Docker Image and deployed to a Docker Registry. Preferred registry is Container Registry service provided by Bluemix.

Next step is to register the Worker. At the time of writing it involves manual insertion of record into jobs-worker database such as:

{
  "_id": "requiredlearningcsv",
  "worker": {
    "name": "Required Learning CSV upload",
    "description": "Required Learning CSV upload worker. No parameters needed.",
    "image": "registry.ng.bluemix.net/yl_catalog/lex.batch.requiredlearningcsv"
  }
}

In future there may be an administration UI available for this step.

Last step is to configure one or more Jobs which will use the new Worker. There is UI for Job management, for TEST environment it is: https://yourlearningtest.w3bmix.ibm.com/servicecenter/#jobs/main

Environment variables

Worker runs in environment with following variables:

JOB_ID

This is ID of the job.

RUN_ID

This is ID of the run.

TENANT_ID

This is ID of the tenant the Job is configured for.

JOB_PARAMS

Job has, beside other properties, one property parameters which has no defined meaning and structure. It depends on particular Worker how to use the entered value. For example, a CSV parsing worker may use this property to configure CSV location, delimiters and limits. Other worker could use this property for completely different purpose.

JOB_SECURE_PARAMS

This is similar to JOB_PARAMS, but should be used for sensitive information, such as passwords. Difference to JOB_PARAMS is, that UI hides secure parameters for other users than job owners.

JOB_STORAGE

This is last saved value of job storage object. See Job Storage chapter.

LAST_SUCCESS_START

This is start timestamp of last successful run of the job.

LAST_SUCCESS_FINISH

This is finish timestamp of last successful run of the job.

yl.env secret based

There is a system-wide configurable mapping between yl.env secrets and environment variables. Current configuration is like this:

yl.env/cloudant -> cloudant
yl.env/db2 -> db2
yl.env/db2shadow -> db2shadow
yl.env/ibmsso -> ibmsso
yl.env/init -> init
yl.env/catalog -> catalog

VCAP_SERVICES

For backward compatibility with CloudFoundry processes, Scheduler also sets VCAP_SERVICES variable. This allows to convert FF3 jobs to new Workers without any code modification, just by packing them as Docker Image.

VCAP_SERVICES should be used only for migrated CloudFoundry workers. The variable will be removed in future. FF3 methods relying on this variable will be annotated as deprecated.

Content of VCAP_SERVICES variable is system-wide configurable.

Logging

As already mentioned, Dockerized worker should write all it’s logging messages to stdout or stderr. The message should contain no timestamp as Kubernetes inserts that timestamp for us. When the worker terminates, Scheduler moves the log from POD to Cloudant database. The log is attached to Job Run record and is therefore archived even after deletion of the POD.

Job Storage

Sometimes a worker may need to save and later load some values such as timestamps of last processed records. To avoid necessity of connecting to a database, Scheduler implements a simple storage mechanism, by which worker saves a data structure and gets that structure on next run.

We use JOB_STORAGE environment variable to transport stored data from Scheduler to worker. This is JSON encoded representation of the last saved value.

To save a value, worker needs to write a special string to the job log, i.e. to standard output. The written line should be prefixed by [saveJobStorage] string followed by space and JSON encoded value, e.g:

[saveJobStorage] {"myLastProcessedTs": "2018-01-16T14:48:32.905173068Z", "lastTimeWeatherWas": "sunny"}

Scheduler captures the line and updates job configuration object accordingly. Next time the job runs, the value is passed back in JOB_STORAGE environment variable as written above.

Web service workers

Web Service worker is implemented as single REST HTTP endpoint. Job is started by POSTing HTTP request to that endpoint. Body of the request contains a JSON object with run ID, job configuration, URL of the callback interface, job storage object and few other properties. It is also possible to specify HTTP headers to be sent in the request. Headers can be specified on worker and/or job level by httpHeaders property. Scheduler exposes three endpoints (log, status, job_storage) which are supposed to be called back by invoked worker to update status of the job run.

When the Scheduler needs to run a Job because of regular schedule or manual request, following actions are taken:

Scheduler finds Job configuration in the database by Job ID.
Scheduler finds Worker configuration in the database by Worker ID in the Job configuration.
Scheduler creates Job Run record in the database.
Scheduler sends POST request to the endpoint. The endpoint URL may be defined either on worker or job level.
Web service endpoint receives the POST request and starts the job. The endpoint immediately confirms start of the job by replying status code 201. The job runs asynchronously from now.
Scheduler starts timeout timer. The timeout may be specified on job level and defaults to 12 hours.
Running job MAY call Scheduler’s log callback service to write raw logs to Scheduler.
Running job MAY call Scheduler’s job_storage callback service to store an arbitrary JSON structure associated with the configured job. Next time the job is invoked, Scheduler sends that stored JSON in the POST request.
Finishing job MUST call Scheduler’s status service to notify Scheduler about job finalization.
Any call to the callback endpoints mentioned above restarts the timeout. If timeout overflows, the job is considered to be dead and is marked as ABORTED.

Further details about callback API as well as web service worker API (the POST endpoint) may be found in chapter API.

Web service worker configuration

Distinction between Dockerized and Web service based workers is done on Worker configuration level. Web service based jobs have engine property set to webservice and have additional property url set to the URL of worker activation endpoint. If there is no url property set on worker level, it is expected to be set on job level. This allows to have one special worker called other_webservice which serves as generic web service worker for those jobs which have url specified on job level.

This is example of web service worker configuration:

{
  "_id": "requiredlearningcsv",
  "worker": {
    "name": "Required Learning CSV upload",
    "description": "Required Learning CSV upload worker. No parameters needed.",
    "engine": "webservice",
    "url": "http://my.workers.com/workerABC"
  }
}

Job templates

Every worker configuration, no matter what type/engine it is, may contain a template for new jobs. The template is defined as template property of worker object in worker configuration document. When user creates a new job using UI, the template is used as initial data for that job. Although job templates are not mandatory, it is highly recommended to define them as it creates better user experience.

For scheduler administrators

Scheduler is deployed as regular web service application in Kubernetes in yl namespace.

Scheduler databases

Workers

Workers configuration is stored in jobs-worker Cloudant database. The database contains documents such as:

{
  "_id": "requiredlearningcsv",
  "worker": {
    "name": "Required Learning CSV upload",
    "description": "Required Learning CSV upload worker. No parameters needed.",
    "image": "registry.ng.bluemix.net/yl_catalog/lex.batch.requiredlearningcsv"
  }
}

The configuration above is example of dockerized worker. Note that it does not have property engine configured - therefore it defaults to kubernetes. An example of web service worker could be:

{
  "_id": "requiredlearningcsv",
  "worker": {
    "name": "Required Learning CSV upload",
    "description": "Required Learning CSV upload worker. No parameters needed.",
    "engine": "webservice",
    "url": "https://yl.company.com/workerXYZ",
    "httpHeaders": {
      "Authorization": "Basic YWxhZGRpbjpvcGVuc2VzYW1l"
    }
  }
}

Jobs

Jobs configuration is stored in jobs-config database. The database contains documents such as:

{
  "_id": "c33d29933ce131e4b0e97da50d7ed8f4",
  "_rev": "2-ea09f666e9b86f14d39cfe8d63c9887e",
  "job": {
    "tenant": "ibm",
    "name": "Required Learning CSV upload",
    "worker": "requiredlearningcsv",
    "cron": "0 10 * * *"
  }
}

Job Runs

Archive of all job runs is stored in jobs-log database. The database contains documents such as:

{
  "_id": "6cc0a54ddac5017d610b602da2670f6f",
  "_rev": "3-57aaeed82fbc02cf7077b92288b33719",
  "job": "c33d29933ce131e4b0e97da50d7ed8f4",
  "tenant": "ibm",
  "state": "FINISHED",
  "started": "2017-12-13T10:00:00.537Z",
  "finished": "2017-12-13T10:00:21.844Z",
  "_attachments": {
    "log": {
      "content_type": "text/plain",
      "revpos": 2,
      "digest": "md5-fQPFHI0pVScul/vzV/cwBw==",
      "length": 1072,
      "stub": true
    }
  }
}

Design documents

Cloudant documents are kept and maintained within this component repository in cloudant directory.

Subdirectories are named like Cloudant Databases which should be updated, without tenant prefix (e.g. mydb instead of ibm_mydb).
For the timebeing there is necessary to execute publish.sh in each DB directory to copy them to lex.cloudant.services repository, which is used by CloudOps Team for deployments to PROD and other environments.

Scheduler environment variables

PATH_PREFIX - Prefix for service URLs, optional, defaults to /.
YL_CLOUDANT - Cloudant connection info as JSON object, mandatory. Scheduler reads only url property and is therefore expected to contain also credentials.
YL_CLOUDANT_DB - Cloudant Configuration database name e.g. ylconfigurations
DB_JOBS - Name of the database with job configurations, mandatory.
DB_LOGS - Name of the database with job logs, mandatory.
DB_WORKERS - Name of the database with worker configurations, mandatory.
KUB_HOST - API host name or IP address of the controlled Kubernetes cluster, mandatory.
KUB_PORT - API port of the controlled Kubernetes cluster, mandatory.
KUB_TOKEN - API token of the controlled Kubernetes cluster, mandatory.
KUB_NAMESPACE - Namespace for worker PODs, mandatory.
WORKERS_VCAP_SERVICES - VCAP_SERVICES exposed to workers for backward FF3 compatibility, optional.
DISABLE_CRON - true to disable internal cron, optional, defaults to false. Should be set only when run locally.

API

Scheduler API

Swagger document for Scheduler API is available in Git.

Web Service worker API

Swagger document for Web Service worker API is available in Git.

@lx/common-helper @lx/config @lx/for-tenant @lx/swagger-app cron debug express js-yaml merge-deep nano node-ssh

1.1.1

7 years ago