2.0.492 • Published 14 days ago

cdk-emrserverless-with-delta-lake v2.0.492

Weekly downloads
-
License
Apache-2.0
Repository
github
Last release
14 days ago

cdk-emrserverless-with-delta-lake

License Release npm downloads pypi downloads NuGet downlods repo languages

npm (JS/TS)PyPI (Python)Maven (Java)GoNuGet
LinkLinkLinkLinkLink

high level architecture

This constrcut builds an EMR studio, a cluster template for the EMR Studio, and an EMR Serverless application. 2 S3 buckets will be created, one is for the EMR Studio workspace and the other one is for EMR Serverless applications. Besides, the VPC and the subnets for the EMR Studio will be tagged {"Key": "for-use-with-amazon-emr-managed-policies", "Value": "true"} via a custom resource. This is necessary for the service role of EMR Studio.
This construct is for analysts, data engineers, and anyone who wants to know how to process Delta Lake data with EMR serverless.
cfn designer
They build the construct via cdkv2 and build a serverless job within the EMR application generated by the construct via AWS CLI within few minutes. After the EMR serverless job is finished, they can then check the processed result done by the EMR serverless job on an EMR notebook through the cluster template.
app history

TOC

Requirements

  1. Your current identity has the AdministratorAccess power.
  2. An IAM user named Administrator with the AdministratorAccess power.
    • This is related to the Portfolio of AWS Service Catalog created by the construct, which is required for EMR cluster tempaltes.
    • You can choose whatsoever identity you wish to associate with the Product in the Porfolio for creating an EMR cluster via cluster tempalte. Check serviceCatalogProps in the EmrServerless construct for detail, otherwise, the IAM user mentioned above will be chosen to set up with the Product.

Before deployment

You might want to execute the following command.

PROFILE_NAME="scott.hsieh"
# If you only have one credentials on your local machine, just ignore `--profile`, buddy.  
cdk bootstrap aws://${AWS_ACCOUNT_ID}/${AWS_REGION} --profile ${PROFILE_NAME}

Minimal content for deployment

#!/usr/bin/env node
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { EmrServerless } from 'cdk-emrserverless-with-delta-lake';

class TypescriptStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    new EmrServerless(this, 'EmrServerless');
  }
}

const app = new cdk.App();
new TypescriptStack(app, 'TypescriptStack', {
  stackName: 'emr-studio',
  env: {
    region: process.env.CDK_DEFAULT_REGION,
    account: process.env.CDK_DEFAULT_ACCOUNT,
  },
});

After deployment

Promise me, darling, make advantage on the CloudFormation outputs. All you need is copy-paste, copy-paste, copy-paste, life should be always that easy.
cfn outputs
1. Define the following environment variables on your current session.

```
export PROFILE_NAME="${YOUR_PROFILE_NAME}"
export JOB_ROLE_ARN="${copy-paste-thank-you}"
export APPLICATION_ID="${copy-paste-thank-you}"
export SERVERLESS_BUCKET_NAME="${copy-paste-thank-you}"
export DELTA_LAKE_SCRIPT_NAME="delta-lake-demo"
```  
  1. Copy partial NYC-taxi data into the EMR Serverless bucket.
    aws s3 cp s3://nyc-tlc/trip\ data/ s3://${SERVERLESS_BUCKET_NAME}/nyc-taxi/ --exclude "*" --include "yellow_tripdata_2021-*.parquet" --recursive --profile ${PROFILE_NAME}
  2. Create a Python script for processing Delta Lake

    touch ${DELTA_LAKE_SCRIPT_NAME}.py
    cat << EOF > ${DELTA_LAKE_SCRIPT_NAME}.py
    from pyspark.sql import SparkSession
    import uuid
    
    if __name__ == "__main__":
        """
            Delta Lake with EMR Serverless, take NYC taxi as example.
        """
        spark = SparkSession \\
            .builder \\
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \\
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \\
            .enableHiveSupport() \\
            .appName("Delta-Lake-OSS") \\
            .getOrCreate()
    
        url = "s3://${SERVERLESS_BUCKET_NAME}/emr-serverless-spark/delta-lake/output/1.2.1/%s/" % str(
            uuid.uuid4())
    
        # creates a Delta table and outputs to target S3 bucket
        spark.range(5).write.format("delta").save(url)
    
        # reads a Delta table and outputs to target S3 bucket
        spark.read.format("delta").load(url).show()
    
        # The source for the second Delta table.
        base = spark.read.parquet(
            "s3://${SERVERLESS_BUCKET_NAME}/nyc-taxi/*.parquet")
    
        # The sceond Delta table, oh ya.
        base.write.format("delta") \\
            .mode("overwrite") \\
            .save("s3://${SERVERLESS_BUCKET_NAME}/emr-serverless-spark/delta-lake/nyx-tlc-2021")
        spark.stop()
    EOF
  3. Upload the script and required jars into the serverless bucket

    # upload script
    aws s3 cp delta-lake-demo.py s3://${SERVERLESS_BUCKET_NAME}/scripts/${DELTA_LAKE_SCRIPT_NAME}.py --profile ${PROFILE_NAME}
    # download jars and upload them
    DELTA_VERSION="2.2.0"
    DELTA_LAKE_CORE="delta-core_2.13-${DELTA_VERSION}.jar"
    DELTA_LAKE_STORAGE="delta-storage-${DELTA_VERSION}.jar"
    curl https://repo1.maven.org/maven2/io/delta/delta-core_2.13/${DELTA_VERSION}/${DELTA_LAKE_CORE} --output ${DELTA_LAKE_CORE}
    curl https://repo1.maven.org/maven2/io/delta/delta-storage/${DELTA_VERSION}/${DELTA_LAKE_STORAGE} --output ${DELTA_LAKE_STORAGE}
    aws s3 mv ${DELTA_LAKE_CORE} s3://${SERVERLESS_BUCKET_NAME}/jars/${${DELTA_LAKE_CORE}} --profile ${PROFILE_NAME}
    aws s3 mv ${DELTA_LAKE_STORAGE} s3://${SERVERLESS_BUCKET_NAME}/jars/${DELTA_LAKE_STORAGE} --profile ${PROFILE_NAME}

Create an EMR Serverless app

Rememeber, you got so much information to copy and paste from the CloudFormation outputs.
cfn outputs

aws emr-serverless start-job-run \
  --application-id ${APPLICATION_ID} \
  --execution-role-arn ${JOB_ROLE_ARN} \
  --name 'shy-shy-first-time' \
  --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://'${SERVERLESS_BUCKET_NAME}'/scripts/'${DELTA_LAKE_SCRIPT_NAME}'.py",
            "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.jars=s3://'${SERVERLESS_BUCKET_NAME}'/jars/delta-core_2.12-1.2.0.jar,s3://'${SERVERLESS_BUCKET_NAME}'/jars/delta-storage-1.2.0.jar"
        }
    }' \
  --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${SERVERLESS_BUCKET_NAME}'/serverless-log/"
	        }
	    }
	}' \
	--profile ${PROFILE_NAME}

If you execute with success, you should see similar reponse as the following:

{
    "applicationId": "00f1gvklchoqru25",
    "jobRunId": "00f1h0ipd2maem01",
    "arn": "arn:aws:emr-serverless:ap-northeast-1:630778274080:/applications/00f1gvklchoqru25/jobruns/00f1h0ipd2maem01"
}

and got a Delta Lake data under s3://${SERVERLESS_BUCKET_NAME}/emr-serverless-spark/delta-lake/nyx-tlc-2021/.
Delta Lake data

Check the executing job

Access the EMR Studio via the URL from the CloudFormation outputs. It should look very similar to the following url: https://es-pilibalapilibala.emrstudio-prod.ap-northeast-1.amazonaws.com, i.e., weird string and region won't be the same as mine.
1. Enter into the application
enter into the app
2. Enter into the executing job

Check results from an EMR notebook via cluster template

  1. Create a workspace and an EMR cluster via the cluster template on the AWS Console
    create workspace
  2. Check the results delivered by the EMR serverless application via an EMR notebook.

Fun facts

  1. You can assign multiple jars as a comma-separated list to the spark.jars as the Spark page says for your EMR Serverless job. The UI will complain, you still can start the job. Don't be afraid, just click it like when you were child, facing authority fearlessly.
    ui bug
  2. To fully delet a stack with the construct, you need to make sure there is no more workspace within the EMR Studio. Aside from that, you also need to remove the associated identity from the Service Catalog (this is a necessary resource for the cluster template).
  3. Version inconsistency on Spark history. Possibly it can be ignored yet still made me wonder why the versions are different.
    naughty inconsistency
  4. So far, I still haven't figured out how to make the s3a URI work. The s3 URI is fine while the serverless app will complain that it couldn't find proper credentials provider to read the s3a URI.

Future work

  1. Custom resuorce for EMR Serverless
  2. Make the construct more flexible for users
  3. Compare Databricks Runtime and EMR Serverless.
2.0.492

14 days ago

2.0.491

15 days ago

2.0.490

16 days ago

2.0.489

17 days ago

2.0.488

18 days ago

2.0.487

19 days ago

2.0.486

20 days ago

2.0.485

21 days ago

2.0.484

22 days ago

2.0.483

23 days ago

2.0.482

24 days ago

2.0.481

25 days ago

2.0.480

27 days ago

2.0.479

29 days ago

2.0.478

30 days ago

2.0.477

1 month ago

2.0.476

1 month ago

2.0.475

1 month ago

2.0.474

1 month ago

2.0.473

1 month ago

2.0.472

1 month ago

2.0.471

1 month ago

2.0.470

1 month ago

2.0.469

1 month ago

2.0.468

1 month ago

2.0.467

1 month ago

2.0.466

1 month ago

2.0.465

1 month ago

2.0.464

1 month ago

2.0.463

2 months ago

2.0.462

2 months ago

2.0.461

2 months ago

2.0.460

2 months ago

2.0.459

2 months ago

2.0.458

2 months ago

2.0.457

2 months ago

2.0.456

2 months ago

2.0.455

2 months ago

2.0.454

2 months ago

2.0.453

2 months ago

2.0.452

2 months ago

2.0.451

2 months ago

2.0.450

2 months ago

2.0.449

2 months ago

2.0.448

2 months ago

2.0.447

2 months ago

2.0.446

2 months ago

2.0.445

2 months ago

2.0.444

2 months ago

2.0.443

2 months ago

2.0.442

2 months ago

2.0.441

2 months ago

2.0.440

2 months ago

2.0.438

2 months ago

2.0.439

2 months ago

2.0.437

2 months ago

2.0.436

3 months ago

2.0.435

3 months ago

2.0.434

3 months ago

2.0.433

3 months ago

2.0.432

3 months ago

2.0.431

3 months ago

2.0.429

3 months ago

2.0.430

3 months ago

2.0.428

3 months ago

2.0.427

3 months ago

2.0.426

3 months ago

2.0.425

3 months ago

2.0.424

3 months ago

2.0.423

3 months ago

2.0.422

3 months ago

2.0.421

3 months ago

2.0.420

3 months ago

2.0.419

3 months ago

2.0.418

3 months ago

2.0.417

3 months ago

2.0.416

3 months ago

2.0.415

3 months ago

2.0.414

3 months ago

2.0.413

3 months ago

2.0.412

4 months ago

2.0.411

4 months ago

2.0.410

4 months ago

2.0.409

4 months ago

2.0.408

4 months ago

2.0.407

4 months ago

2.0.406

4 months ago

2.0.405

4 months ago

2.0.404

4 months ago

2.0.403

4 months ago

2.0.402

4 months ago

2.0.401

4 months ago

2.0.400

4 months ago

2.0.399

4 months ago

2.0.398

4 months ago

2.0.397

4 months ago

2.0.396

4 months ago

2.0.395

4 months ago

2.0.394

4 months ago

2.0.393

4 months ago

2.0.392

4 months ago

2.0.391

4 months ago

2.0.390

4 months ago

2.0.389

4 months ago

2.0.388

4 months ago

2.0.387

4 months ago

2.0.386

5 months ago

2.0.385

5 months ago

2.0.384

5 months ago

2.0.383

5 months ago

2.0.382

5 months ago

2.0.381

5 months ago

2.0.380

5 months ago

2.0.379

5 months ago

2.0.378

5 months ago

2.0.377

5 months ago

2.0.376

5 months ago

2.0.375

5 months ago

2.0.374

5 months ago

2.0.373

5 months ago

2.0.372

5 months ago

2.0.371

5 months ago

2.0.370

5 months ago

2.0.306

7 months ago

2.0.305

7 months ago

2.0.303

8 months ago

2.0.302

8 months ago

2.0.301

8 months ago

2.0.300

8 months ago

2.0.290

8 months ago

2.0.299

8 months ago

2.0.298

8 months ago

2.0.297

8 months ago

2.0.296

8 months ago

2.0.295

8 months ago

2.0.294

8 months ago

2.0.293

8 months ago

2.0.292

8 months ago

2.0.291

8 months ago

2.0.279

8 months ago

2.0.278

8 months ago

2.0.277

8 months ago

2.0.276

9 months ago

2.0.275

9 months ago

2.0.274

9 months ago

2.0.273

9 months ago

2.0.272

9 months ago

2.0.271

9 months ago

2.0.270

9 months ago

2.0.289

8 months ago

2.0.288

8 months ago

2.0.287

8 months ago

2.0.286

8 months ago

2.0.285

8 months ago

2.0.284

8 months ago

2.0.283

8 months ago

2.0.282

8 months ago

2.0.281

8 months ago

2.0.280

8 months ago

2.0.259

9 months ago

2.0.258

9 months ago

2.0.257

9 months ago

2.0.256

9 months ago

2.0.255

9 months ago

2.0.254

9 months ago

2.0.253

9 months ago

2.0.252

9 months ago

2.0.251

9 months ago

2.0.250

9 months ago

2.0.269

9 months ago

2.0.268

9 months ago

2.0.267

9 months ago

2.0.266

9 months ago

2.0.265

9 months ago

2.0.264

9 months ago

2.0.263

9 months ago

2.0.262

9 months ago

2.0.261

9 months ago

2.0.260

9 months ago

2.0.239

10 months ago

2.0.359

6 months ago

2.0.238

10 months ago

2.0.358

6 months ago

2.0.237

10 months ago

2.0.357

6 months ago

2.0.236

10 months ago

2.0.356

6 months ago

2.0.235

10 months ago

2.0.355

6 months ago

2.0.234

10 months ago

2.0.354

6 months ago

2.0.233

10 months ago

2.0.353

6 months ago

2.0.232

10 months ago

2.0.352

6 months ago

2.0.231

10 months ago

2.0.351

6 months ago

2.0.230

10 months ago

2.0.350

6 months ago

2.0.249

9 months ago

2.0.369

5 months ago

2.0.248

10 months ago

2.0.368

5 months ago

2.0.247

10 months ago

2.0.367

5 months ago

2.0.246

10 months ago

2.0.366

5 months ago

2.0.245

10 months ago

2.0.365

5 months ago

2.0.244

10 months ago

2.0.364

5 months ago

2.0.243

10 months ago

2.0.363

6 months ago

2.0.242

10 months ago

2.0.362

6 months ago

2.0.241

10 months ago

2.0.361

6 months ago

2.0.240

10 months ago

2.0.360

6 months ago

2.0.329

7 months ago

2.0.339

6 months ago

2.0.338

6 months ago

2.0.337

6 months ago

2.0.336

6 months ago

2.0.335

6 months ago

2.0.334

7 months ago

2.0.333

7 months ago

2.0.332

7 months ago

2.0.331

7 months ago

2.0.330

7 months ago

2.0.229

10 months ago

2.0.349

6 months ago

2.0.228

10 months ago

2.0.348

6 months ago

2.0.227

10 months ago

2.0.347

6 months ago

2.0.226

10 months ago

2.0.346

6 months ago

2.0.345

6 months ago

2.0.344

6 months ago

2.0.343

6 months ago

2.0.342

6 months ago

2.0.341

6 months ago

2.0.340

6 months ago

2.0.309

7 months ago

2.0.308

7 months ago

2.0.307

7 months ago

2.0.317

7 months ago

2.0.316

7 months ago

2.0.315

7 months ago

2.0.314

7 months ago

2.0.313

7 months ago

2.0.312

7 months ago

2.0.311

7 months ago

2.0.310

7 months ago

2.0.319

7 months ago

2.0.318

7 months ago

2.0.328

7 months ago

2.0.327

7 months ago

2.0.326

7 months ago

2.0.325

7 months ago

2.0.324

7 months ago

2.0.323

7 months ago

2.0.322

7 months ago

2.0.321

7 months ago

2.0.320

7 months ago

2.0.225

10 months ago

2.0.224

10 months ago

2.0.223

10 months ago

2.0.222

10 months ago

2.0.221

10 months ago

2.0.220

10 months ago

2.0.218

11 months ago

2.0.217

11 months ago

2.0.216

11 months ago

2.0.219

10 months ago

2.0.215

11 months ago

2.0.214

11 months ago

2.0.213

11 months ago

2.0.212

11 months ago

2.0.199

11 months ago

2.0.198

11 months ago

2.0.209

11 months ago

2.0.208

11 months ago

2.0.211

11 months ago

2.0.210

11 months ago

2.0.207

11 months ago

2.0.206

11 months ago

2.0.205

11 months ago

2.0.204

11 months ago

2.0.203

11 months ago

2.0.202

11 months ago

2.0.201

11 months ago

2.0.200

11 months ago

2.0.191

11 months ago

2.0.190

11 months ago

2.0.197

11 months ago

2.0.196

11 months ago

2.0.195

11 months ago

2.0.194

11 months ago

2.0.193

11 months ago

2.0.192

11 months ago

2.0.179

12 months ago

2.0.178

12 months ago

2.0.177

12 months ago

2.0.176

12 months ago

2.0.175

12 months ago

2.0.174

12 months ago

2.0.173

12 months ago

2.0.180

12 months ago

2.0.189

11 months ago

2.0.188

12 months ago

2.0.187

12 months ago

2.0.186

12 months ago

2.0.185

12 months ago

2.0.184

12 months ago

2.0.183

12 months ago

2.0.182

12 months ago

2.0.181

12 months ago

2.0.149

1 year ago

2.0.148

1 year ago

2.0.147

1 year ago

2.0.146

1 year ago

2.0.145

1 year ago

2.0.172

12 months ago

2.0.171

1 year ago

2.0.170

1 year ago

2.0.159

1 year ago

2.0.158

1 year ago

2.0.157

1 year ago

2.0.156

1 year ago

2.0.155

1 year ago

2.0.154

1 year ago

2.0.153

1 year ago

2.0.152

1 year ago

2.0.151

1 year ago

2.0.150

1 year ago

2.0.169

1 year ago

2.0.168

1 year ago

2.0.167

1 year ago

2.0.166

1 year ago

2.0.165

1 year ago

2.0.164

1 year ago

2.0.163

1 year ago

2.0.162

1 year ago

2.0.161

1 year ago

2.0.160

1 year ago

2.0.139

1 year ago

2.0.138

1 year ago

2.0.137

1 year ago

2.0.136

1 year ago

2.0.135

1 year ago

2.0.134

1 year ago

2.0.133

1 year ago

2.0.132

1 year ago

2.0.131

1 year ago

2.0.130

1 year ago

2.0.144

1 year ago

2.0.143

1 year ago

2.0.142

1 year ago

2.0.141

1 year ago

2.0.140

1 year ago

2.0.109

2 years ago

2.0.119

2 years ago

2.0.118

2 years ago

2.0.117

2 years ago

2.0.116

2 years ago

2.0.115

2 years ago

2.0.114

2 years ago

2.0.113

2 years ago

2.0.112

2 years ago

2.0.111

2 years ago

2.0.110

2 years ago

2.0.129

2 years ago

2.0.128

2 years ago

2.0.127

2 years ago

2.0.126

2 years ago

2.0.125

2 years ago

2.0.124

2 years ago

2.0.123

2 years ago

2.0.122

2 years ago

2.0.121

2 years ago

2.0.120

2 years ago

2.0.108

2 years ago

2.0.107

2 years ago

2.0.37

2 years ago

2.0.38

2 years ago

2.0.35

2 years ago

2.0.36

2 years ago

2.0.33

2 years ago

2.0.34

2 years ago

2.0.31

2 years ago

2.0.32

2 years ago

2.0.39

2 years ago

2.0.48

2 years ago

2.0.49

2 years ago

2.0.46

2 years ago

2.0.47

2 years ago

2.0.44

2 years ago

2.0.45

2 years ago

2.0.42

2 years ago

2.0.43

2 years ago

2.0.40

2 years ago

2.0.41

2 years ago

2.0.59

2 years ago

2.0.57

2 years ago

2.0.58

2 years ago

2.0.55

2 years ago

2.0.56

2 years ago

2.0.53

2 years ago

2.0.54

2 years ago

2.0.51

2 years ago

2.0.52

2 years ago

2.0.50

2 years ago

2.0.68

2 years ago

2.0.69

2 years ago

2.0.66

2 years ago

2.0.67

2 years ago

2.0.64

2 years ago

2.0.65

2 years ago

2.0.62

2 years ago

2.0.63

2 years ago

2.0.60

2 years ago

2.0.61

2 years ago

2.0.79

2 years ago

2.0.77

2 years ago

2.0.78

2 years ago

2.0.75

2 years ago

2.0.76

2 years ago

2.0.73

2 years ago

2.0.74

2 years ago

2.0.71

2 years ago

2.0.72

2 years ago

2.0.70

2 years ago

2.0.106

2 years ago

2.0.105

2 years ago

2.0.104

2 years ago

2.0.88

2 years ago

2.0.103

2 years ago

2.0.89

2 years ago

2.0.102

2 years ago

2.0.86

2 years ago

2.0.101

2 years ago

2.0.87

2 years ago

2.0.100

2 years ago

2.0.84

2 years ago

2.0.85

2 years ago

2.0.82

2 years ago

2.0.83

2 years ago

2.0.80

2 years ago

2.0.81

2 years ago

2.0.99

2 years ago

2.0.97

2 years ago

2.0.98

2 years ago

2.0.95

2 years ago

2.0.96

2 years ago

2.0.93

2 years ago

2.0.94

2 years ago

2.0.91

2 years ago

2.0.92

2 years ago

2.0.90

2 years ago

2.0.30

2 years ago

2.0.29

2 years ago

2.0.28

2 years ago

2.0.27

2 years ago

2.0.26

2 years ago

2.0.25

2 years ago

2.0.24

2 years ago

2.0.23

2 years ago

2.0.22

2 years ago

2.0.21

2 years ago

2.0.20

2 years ago

2.0.19

2 years ago

2.0.18

2 years ago

2.0.17

2 years ago

2.0.16

2 years ago

2.0.15

2 years ago

2.0.14

2 years ago

2.0.13

2 years ago

2.0.12

2 years ago

2.0.11

2 years ago

2.0.10

2 years ago

2.0.9

2 years ago

2.0.8

2 years ago

2.0.7

2 years ago

2.0.6

2 years ago

2.0.5

2 years ago

2.0.4

2 years ago

2.0.3

2 years ago

2.0.2

2 years ago

2.0.1

2 years ago

2.0.0

2 years ago

0.0.0

2 years ago