lex.services.jobs-scheduler v1.1.1
lex.services.jobs-scheduler
Centralized Job Scheduler
THIS DOCUMENT IS WORK IN PROGRESS
- lex.services.jobs-scheduler
- Motivation
- Terminology
- High level overview
- For job/worker developers
- For scheduler administrators
Motivation
The motivation to create a centralized jobs scheduler is to have a single point of job management. Prior to Jobs Scheduler we had various ways of job deploymens, job configurations and ways to schedule job runs. The Scheduler keeps job configurations on one place and takes responsibility of launching jobs in the right time.
Terminology
Here in this document we use following terms:
- Job is a configured unit of work, it is just a receipt when, how and by who the work should be done.
- Worker is an application responsible for doing a job. Although worker is a specialized application to process certain type of job, there may exist multiple jobs for one worker. So worker can be invoked (asked to do a job) in different times with slightly different Job configurations.
- Job Run is one particular invocation of a Worker with a Job configuration. Input to the run is the Job configuration. Output of the run is collected log and termination status.
High level overview
The Scheduler is an application with three logical connections:
- Cloudant database
- Management UI
- Runtime environment
Cloudant database is used to store information about Workers, Jobs and Job Runs. Currently there are three separate Cloudant databases for each of these three data entities.
Management UI is connected to the Scheduler by REST API the Scheduler exposes.
There may be multiple runtime environments where workers may run in. Because the way how workers are implemented and invoked vary significantly between environments, Scheduler has multiple sub-components, called “engines”, for each supported environment. Currently supported environments are:
Kubernetes - workers are implemented as single pass processes, e.g. java main() applications, packed as Docker images. It is responsibility of Scheduler to dynamically create Kubernetes POD and run worker’s Docker image in that POD.
Web Service - workers are implemented as web services. The web service is expected to be deployed and accessible on a configured URL. Worker is invoked by HTTP POST request to the endpoint.
For job/worker developers
Dockerized workers
Kubernetes cluster is used as runtime environment to run dockerized workers. Workers are not deployed in the cluster all the time. They are just registered in the Cloudant database and deployed in Docker Registry.
When the Scheduler needs to run a Job because of regular schedule or manual request, following actions are taken:
- Scheduler finds Job configuration in the database by Job ID.
- Scheduler finds Worker configuration in the database by Worker ID in the Job configuration.
- Scheduler creates Job Run record in the database.
- Scheduler dynamically creates POD in the Kubernetes cluster. The POD has one container with Docker Image as set in Worker configuration.
- POD starts
- POD terminates
- Scheduler records POD termination status to Job Run record
- Scheduler moves log messages from POD to Job Run record
- Scheduler deletes POD to release system resources
Preparing new dockerized worker
Dockerized workers are prefered type of workers because they are secure form tenant isolation perspective, they do not suffer from memory leaking as they run only as long as they need and they may be force terminated if necessary. At the same time they are much easier to develop than Web Service based workers, as they are basically a single one-pass process packed as Docker image. The process itself reads configuration and writes log in traditional way - the configuration is passed in environment variables and logs are expected to be sent to stdout or stderr.
Furthermore, dockerized workers are programming language agnostic, so they may be implemented in Java, JavaScript, Go, Python or even as a shell script.
Programming model for Dockerized worker is single process. The process should:
- read environment variables to get necessary information for the run, such as Cloudant URL, credentials, tennant ID and other job parameters.
- perform the actual task
- write all log messages to stdout or stderr
- terminate with exit code zero in case of success, non-zero in case of failure
Worker should be then packed as Docker Image and deployed to a Docker Registry. Preferred registry is Container Registry service provided by Bluemix.
Next step is to register the Worker. At the time of writing it involves manual insertion of record into jobs-worker database such as:
{
"_id": "requiredlearningcsv",
"worker": {
"name": "Required Learning CSV upload",
"description": "Required Learning CSV upload worker. No parameters needed.",
"image": "registry.ng.bluemix.net/yl_catalog/lex.batch.requiredlearningcsv"
}
}
In future there may be an administration UI available for this step.
Last step is to configure one or more Jobs which will use the new Worker. There is UI for Job management, for TEST environment it is: https://yourlearningtest.w3bmix.ibm.com/servicecenter/#jobs/main
Environment variables
Worker runs in environment with following variables:
JOB_ID
This is ID of the job.
RUN_ID
This is ID of the run.
TENANT_ID
This is ID of the tenant the Job is configured for.
JOB_PARAMS
Job has, beside other properties, one property parameters
which has no defined meaning and structure. It depends on particular Worker how to use the entered value. For example, a CSV parsing worker may use this property to configure CSV location, delimiters and limits. Other worker could use this property for completely different purpose.
JOB_SECURE_PARAMS
This is similar to JOB_PARAMS, but should be used for sensitive information, such as passwords. Difference to JOB_PARAMS is, that UI hides secure parameters for other users than job owners.
JOB_STORAGE
This is last saved value of job storage object. See Job Storage chapter.
LAST_SUCCESS_START
This is start timestamp of last successful run of the job.
LAST_SUCCESS_FINISH
This is finish timestamp of last successful run of the job.
yl.env secret based
There is a system-wide configurable mapping between yl.env secrets and environment variables. Current configuration is like this:
- yl.env/cloudant -> cloudant
- yl.env/db2 -> db2
- yl.env/db2shadow -> db2shadow
- yl.env/ibmsso -> ibmsso
- yl.env/init -> init
- yl.env/catalog -> catalog
VCAP_SERVICES
For backward compatibility with CloudFoundry processes, Scheduler also sets VCAP_SERVICES variable. This allows to convert FF3 jobs to new Workers without any code modification, just by packing them as Docker Image.
VCAP_SERVICES should be used only for migrated CloudFoundry workers. The variable will be removed in future. FF3 methods relying on this variable will be annotated as deprecated.
Content of VCAP_SERVICES variable is system-wide configurable.
Logging
As already mentioned, Dockerized worker should write all it’s logging messages to stdout or stderr. The message should contain no timestamp as Kubernetes inserts that timestamp for us. When the worker terminates, Scheduler moves the log from POD to Cloudant database. The log is attached to Job Run record and is therefore archived even after deletion of the POD.
Job Storage
Sometimes a worker may need to save and later load some values such as timestamps of last processed records. To avoid necessity of connecting to a database, Scheduler implements a simple storage mechanism, by which worker saves a data structure and gets that structure on next run.
We use JOB_STORAGE environment variable to transport stored data from Scheduler to worker. This is JSON encoded representation of the last saved value.
To save a value, worker needs to write a special string to the job log, i.e. to standard output. The written line should be prefixed by [saveJobStorage]
string followed by space and JSON encoded value, e.g:
[saveJobStorage] {"myLastProcessedTs": "2018-01-16T14:48:32.905173068Z", "lastTimeWeatherWas": "sunny"}
Scheduler captures the line and updates job configuration object accordingly. Next time the job runs, the value is passed back in JOB_STORAGE environment variable as written above.
Web service workers
Web Service worker is implemented as single REST HTTP endpoint. Job is started by POSTing HTTP request to that endpoint. Body of the request contains a JSON object with run ID, job configuration, URL of the callback interface, job storage object and few other properties. It is also possible to specify HTTP headers to be sent in the request. Headers can be specified on worker and/or job level by httpHeaders
property. Scheduler exposes three endpoints (log, status, job_storage) which are supposed to be called back by invoked worker to update status of the job run.
When the Scheduler needs to run a Job because of regular schedule or manual request, following actions are taken:
- Scheduler finds Job configuration in the database by Job ID.
- Scheduler finds Worker configuration in the database by Worker ID in the Job configuration.
- Scheduler creates Job Run record in the database.
- Scheduler sends POST request to the endpoint. The endpoint URL may be defined either on worker or job level.
- Web service endpoint receives the POST request and starts the job. The endpoint immediately confirms start of the job by replying status code 201. The job runs asynchronously from now.
- Scheduler starts timeout timer. The timeout may be specified on job level and defaults to 12 hours.
- Running job MAY call Scheduler’s
log
callback service to write raw logs to Scheduler. - Running job MAY call Scheduler’s
job_storage
callback service to store an arbitrary JSON structure associated with the configured job. Next time the job is invoked, Scheduler sends that stored JSON in the POST request. - Finishing job MUST call Scheduler’s
status
service to notify Scheduler about job finalization. - Any call to the callback endpoints mentioned above restarts the timeout. If timeout overflows, the job is considered to be dead and is marked as ABORTED.
Further details about callback API as well as web service worker API (the POST endpoint) may be found in chapter API.
Web service worker configuration
Distinction between Dockerized and Web service based workers is done on Worker configuration level. Web service based jobs have engine
property set to webservice
and have additional property url
set to the URL of worker activation endpoint. If there is no url
property set on worker level, it is expected to be set on job level. This allows to have one special worker called other_webservice
which serves as generic web service worker for those jobs which have url specified on job level.
This is example of web service worker configuration:
{
"_id": "requiredlearningcsv",
"worker": {
"name": "Required Learning CSV upload",
"description": "Required Learning CSV upload worker. No parameters needed.",
"engine": "webservice",
"url": "http://my.workers.com/workerABC"
}
}
Job templates
Every worker configuration, no matter what type/engine it is, may contain a template for new jobs. The template is defined as template
property of worker
object in worker configuration document. When user creates a new job using UI, the template is used as initial data for that job. Although job templates are not mandatory, it is highly recommended to define them as it creates better user experience.
For scheduler administrators
Scheduler is deployed as regular web service application in Kubernetes in yl namespace.
Scheduler databases
Workers
Workers configuration is stored in jobs-worker Cloudant database. The database contains documents such as:
{
"_id": "requiredlearningcsv",
"worker": {
"name": "Required Learning CSV upload",
"description": "Required Learning CSV upload worker. No parameters needed.",
"image": "registry.ng.bluemix.net/yl_catalog/lex.batch.requiredlearningcsv"
}
}
The configuration above is example of dockerized worker. Note that it does not have property engine
configured - therefore it defaults to kubernetes
. An example of web service worker could be:
{
"_id": "requiredlearningcsv",
"worker": {
"name": "Required Learning CSV upload",
"description": "Required Learning CSV upload worker. No parameters needed.",
"engine": "webservice",
"url": "https://yl.company.com/workerXYZ",
"httpHeaders": {
"Authorization": "Basic YWxhZGRpbjpvcGVuc2VzYW1l"
}
}
}
Jobs
Jobs configuration is stored in jobs-config database. The database contains documents such as:
{
"_id": "c33d29933ce131e4b0e97da50d7ed8f4",
"_rev": "2-ea09f666e9b86f14d39cfe8d63c9887e",
"job": {
"tenant": "ibm",
"name": "Required Learning CSV upload",
"worker": "requiredlearningcsv",
"cron": "0 10 * * *"
}
}
Job Runs
Archive of all job runs is stored in jobs-log database. The database contains documents such as:
{
"_id": "6cc0a54ddac5017d610b602da2670f6f",
"_rev": "3-57aaeed82fbc02cf7077b92288b33719",
"job": "c33d29933ce131e4b0e97da50d7ed8f4",
"tenant": "ibm",
"state": "FINISHED",
"started": "2017-12-13T10:00:00.537Z",
"finished": "2017-12-13T10:00:21.844Z",
"_attachments": {
"log": {
"content_type": "text/plain",
"revpos": 2,
"digest": "md5-fQPFHI0pVScul/vzV/cwBw==",
"length": 1072,
"stub": true
}
}
}
Design documents
Cloudant documents are kept and maintained within this component repository in cloudant directory.
- Subdirectories are named like Cloudant Databases which should be updated, without tenant prefix (e.g.
mydb
instead ofibm_mydb
). - For the timebeing there is necessary to execute
publish.sh
in each DB directory to copy them to lex.cloudant.services repository, which is used by CloudOps Team for deployments to PROD and other environments.
Scheduler environment variables
PATH_PREFIX
- Prefix for service URLs, optional, defaults to/
.YL_CLOUDANT
- Cloudant connection info as JSON object, mandatory. Scheduler reads onlyurl
property and is therefore expected to contain also credentials.YL_CLOUDANT_DB
- Cloudant Configuration database name e.g.ylconfigurations
DB_JOBS
- Name of the database with job configurations, mandatory.DB_LOGS
- Name of the database with job logs, mandatory.DB_WORKERS
- Name of the database with worker configurations, mandatory.KUB_HOST
- API host name or IP address of the controlled Kubernetes cluster, mandatory.KUB_PORT
- API port of the controlled Kubernetes cluster, mandatory.KUB_TOKEN
- API token of the controlled Kubernetes cluster, mandatory.KUB_NAMESPACE
- Namespace for worker PODs, mandatory.WORKERS_VCAP_SERVICES
- VCAP_SERVICES exposed to workers for backward FF3 compatibility, optional.DISABLE_CRON
-true
to disable internal cron, optional, defaults to false. Should be set only when run locally.
API
Scheduler API
Swagger document for Scheduler API is available in Git.
Web Service worker API
Swagger document for Web Service worker API is available in Git.
5 years ago