platform-agent-import-astra-data-lake v2.7.1-dev.0
platform-agent-import-astra-data-lake
An AWS Batch agent to create the Glue Catalog for the Astra Data Lake and populate the lake with data.
Prerequisites
- Install Node.js and Node Package Manager (npm) here.
- Install your favorite text/code editor. I highly recommend Visual Studio Code as a lightweight, flexible, and extensible code editor. You can download it here.
- Set the required environment variables. This can be done by running the setup-dev-environment.sh script in the pipeline/ directory.
Getting Started
npm install
npm test
For dev deployments, you can setup necessary environment variables using the following script:
# Note: requires aws cli to be installed on your machine
. ./pipeline/lookup-ecr-uri.sh
Deployment Guide
Creation of the ECS Repository for this agent is a once-per-account deployment. The steps are documented here by the serverless/cloudformation YAML, but have been pushed up to AWS via serverless deploy from a developer machine rather than building a pipeline for a one-time action. Deployment occurs normally via the CI pipeline once this configuration is complete using the /agent-import-astra-data-lake/ECR-URI
SSM Parameter.
pushd pipeline/ecr
npm install
export AWS_PROFILE=prod
export AWS_REGION=us-east-1
npm run deploy
# Note that a serverless plugin handles looking up the ECR URI and storing it in SSM
popd
Creation of the Agent User for this agent is a once-per-account deployment. The steps are documented here by the serverless/cloudformation YAML, but have been pushed up to AWS via serverless deploy from a developer machine rather than building a pipeline for a one-time action. Deployment occurs normally via the CI pipeline once this configuration is complete.
pushd pipeline/user
npm install
export AWS_PROFILE=prod
export AWS_REGION=us-east-1
npm run deploy
popd
Also once-per-account, you must alter the ecsInstanceRole permissions to grant the secretsmanager:GetSecretValue permission on the agent-import-astra-data-lake resource. To do this, go into the IAM Management Console, open the 'Roles' page, search for 'ecsInstanceRole' and open the Role. Under 'Permissions', expand the 'data-ingestion-secrets-read' Policy and edit it to add:
"arn:aws:secretsmanager:*:*:secret:/agent-import-astra-data-lake/*"
to the Resources array of the policy.
Batch Job Queue
Creation of the Batch Job Queue and an SSM parameter that saves the job queue name must be done once per account-stage (so twice if you want a separate staging environment), and is handled by the project in the pipeline/job-queue folder. Consumers requiring the job queue name can reference it by looking up the /platform-agent-import-adl/job-queue-name-$STAGE_DATA_LAKE
SSM parameter.
For Dev account deployment:
pushd pipeline/job-queue
npm install
export AWS_PROFILE=dev
export AWS_REGION=us-east-1
export STAGE_DATA_LAKE=dev
./node_modules/.bin/serverless deploy
popd
For Prod account 'staging' stage deployment:
pushd pipeline/job-queue
npm install
export AWS_PROFILE=prod
export AWS_REGION=us-east-1
export STAGE_DATA_LAKE=staging
./node_modules/.bin/serverless deploy
popd
For Prod account 'prod' stage deployment:
pushd pipeline/job-queue
npm install
export AWS_PROFILE=prod
export AWS_REGION=us-east-1
export STAGE_DATA_LAKE=prod
./node_modules/.bin/serverless deploy
popd
Execution of the batch
For executing this batch it requires the following parameters:
- TenantId: This is the tenant ID of the customer that you are trying to execute the batch for
- QueryExecutionIds: Here you pass in a JSON string
- AstraDataLakeOutputLocation: This will be the output location for the Astra Data. Enter it in the form of
<S3Bucket>/<S3Key>
, eg.astra-data-lake-dev/dev/AstraData
. If you want it to be in the root level of the bucket, then only pass the name of the bucket - HealthCheck: If this parameter is set to
true
then only a check will be performed to make sure that the batch is operational. No further processing is done. To generate the astra-data-lake, set this value tofalse
- TreatWarningsAsErrors: This is an optional parameter, and is not currently getting used. Pass
true
for now in this argument.
Dependencies
mocha
Mocha is a simple and flexible JavaScript test framework. It can be used for BDD, TDD, and other testing types and provides many features for synchronous and asynchronous testing.
chai
Chai is a BDD/TDD assertion library. It provides the standard asserts as well as "should" and "expect" style asserts for BDD language.
aws-sdk
aws-sdk provides an interface for Amazon Web Services such as Step Functions and Batch.
5 years ago
5 years ago