Add delta lake and emr-cli example (#50)

* Add delta lake and emr-cli example
aws-samples · Jan 26, 2024 · 6fdeb29 · 6fdeb29
1 parent c39223b
commit 6fdeb29
Show file tree

Hide file tree

Showing 6 changed files with 521 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,3 +8,5 @@ dist/
 # macOS detritus
 .DS_Store
 
+# config files
+.env
diff --git a/examples/pyspark/delta-lake/Dockerfile b/examples/pyspark/delta-lake/Dockerfile
@@ -0,0 +1,70 @@
+# This is a muti-stage Dockerfile that can be used to build many different types of
+# bundled dependencies for PySpark projects. 
+# The `base` stage installs generic tools necessary for packaging.
+#
+# There are `export-` and `build-` stages for the different types of projects.
+# - python-packages - Generic support for Python projects with pyproject.toml
+# - poetry - Support for Poetry projects
+#
+# This Dockerfile is generated automatically as part of the emr-cli tool.
+# Feel free to modify it for your needs, but leave the `build-` and `export-`
+# stages related to your project.
+#
+# To build manually, you can use the following command, assuming 
+# the Docker BuildKit backend is enabled. https://docs.docker.com/build/buildkit/
+#
+# Example for building a poetry project and saving the output to dist/ folder
+# docker build --target export-poetry --output dist .
+
+
+## ----------------------------------------------------------------------------
+##  Base stage for python development
+## ----------------------------------------------------------------------------
+FROM --platform=linux/amd64 amazonlinux:2 AS base
+
+RUN yum install -y python3 tar gzip
+
+ENV VIRTUAL_ENV=/opt/venv
+RUN python3 -m venv $VIRTUAL_ENV
+ENV PATH="$VIRTUAL_ENV/bin:$PATH"
+
+# EMR 6.x uses Python 3.7 - limit Poetry version to 1.5.1
+ENV POETRY_VERSION=1.5.1
+RUN python3 -m pip install --upgrade pip
+RUN curl -sSL https://install.python-poetry.org | python3 -
+
+ENV PATH="$PATH:/root/.local/bin"
+
+WORKDIR /app
+
+COPY . .
+
+# Test stage - installs test dependencies defined in pyproject.toml
+FROM base as test
+RUN python3 -m pip install .[test]
+
+## ----------------------------------------------------------------------------
+##  Build and export stages for standard Python projects
+## ----------------------------------------------------------------------------
+# Build stage - installs required dependencies and creates a venv package
+FROM base as build-python
+RUN python3 -m pip install venv-pack==0.2.0 && \
+    python3 -m pip install .
+RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
+
+# Export stage - used to copy packaged venv to local filesystem
+FROM scratch AS export-python
+COPY --from=build-python /output/pyspark_deps.tar.gz /
+
+## ----------------------------------------------------------------------------
+##  Build and export stages for Poetry Python projects
+## ----------------------------------------------------------------------------
+# Build stage for poetry
+FROM base as build-poetry
+RUN poetry self add poetry-plugin-bundle && \
+    poetry bundle venv dist/bundle --without dev && \
+    tar -czvf dist/pyspark_deps.tar.gz -C dist/bundle . && \
+    rm -rf dist/bundle
+
+FROM scratch as export-poetry
+COPY --from=build-poetry /app/dist/pyspark_deps.tar.gz /
diff --git a/examples/pyspark/delta-lake/README.md b/examples/pyspark/delta-lake/README.md
@@ -0,0 +1,78 @@
+# EMR Serverless Delta Lake with Poetry example
+
+This example shows how to use the [`emr-cli`](https://github.com/awslabs/amazon-emr-cli) to deploy a Poetry-based project with Delta Lake to EMR Serverless.
+
+As of EMR 6.9.0, Delta Lake jars are provided on the EMR Serverless image. This means you can use the `spark.jars` Spark configuration item to specify the path to the local Delta Lake jars. If you use a different version than what's provided with EMR Serverless, you can still use the `--packages` option to specify your version.
+
+## Getting Started
+
+> [!NOTE]
+> This assumes you already have an EMR Serverless 6.9.0 application or have completed the pre-requisites in this repo's [README](/README.md).
+
+To create an EMR Serverless application compatible with those code, use the following command:
+
+```bash
+aws emr-serverless create-application \
+    --release-label emr-6.9.0 \
+    --type SPARK
+```
+
+- Define some environment variables to be used later
+
+```shell
+export APPLICATION_ID=<APPLICATION_ID>
+export S3_BUCKET=<YOUR_BUCKET_NAME>
+export JOB_ROLE_ARN=arn:aws:iam::<ACCOUNT_ID>:role/emr-serverless-job-role
+```
+
+You can either `git clone` this project or use the `emr init` command to create a Poetry project and add the `delta-take` dependency yourself.
+
+- Option 1: `git clone`
+
+```
+git clone https://github.com/aws-samples/emr-serverless-samples.git
+cd emr-serverless-samples/examples/pyspark/delta-lake
+poetry install
+```
+
+- Option 2: `emr init`
+
+```
+emr init --project-type poetry delta-lake
+cd delta-lake
+poetry add delta-spark==2.1.0
+```
+
+Copy `main.py` from this directory to your new folder.
+
+## Deploying
+
+```bash
+emr run \
+    --application-id ${APPLICATION_ID} \
+    --job-role ${JOB_ROLE_ARN} \
+    --s3-code-uri s3://${S3_BUCKET}/tmp/emr-cli-delta-lake/ \
+    --s3-logs-uri s3://${S3_BUCKET}/logs/ \
+    --entry-point main.py \
+    --job-args ${S3_BUCKET} \
+    --spark-submit-opts "--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar" \
+    --build --wait --show-stdout
+```
+
+> [!NOTE]
+> Because of how `delta-spark` is packaged, this will include `pyspark` as a dependency. The `--build` flag packages and deploys a virtualenv with `delta-spark` and related dependencies.
+
+You should see the following output:
+
+```
+[emr-cli]: Job submitted to EMR Serverless (Job Run ID: 00fgj5hq9e4le80m)
+[emr-cli]: Waiting for job to complete...
+[emr-cli]: Job state is now: SCHEDULED
+[emr-cli]: Job state is now: RUNNING
+[emr-cli]: Job state is now: SUCCESS
+[emr-cli]: stdout for 00fgj5hq9e4le80m
+--------------------------------------
+Itsa Delta!
+
+[emr-cli]: Job completed successfully!
+```
diff --git a/examples/pyspark/delta-lake/main.py b/examples/pyspark/delta-lake/main.py
@@ -0,0 +1,26 @@
+import sys
+import uuid
+
+from delta import DeltaTable, configure_spark_with_delta_pip
+from pyspark.sql import SparkSession
+
+builder = (
+    SparkSession.builder.appName("DeltaExample")
+    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
+    .config(
+        "spark.sql.catalog.spark_catalog",
+        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
+    )
+)
+
+spark = configure_spark_with_delta_pip(builder).getOrCreate()
+
+bucket_name = sys.argv[1]
+
+url = f"s3://{bucket_name}/tmp/delta-lake/output/1.0.1/{uuid.uuid4()}/"
+
+# creates a Delta table and outputs to target S3 bucket
+spark.range(0, 5).write.format("delta").save(url)
+
+if DeltaTable.isDeltaTable(spark, url):
+    print("Itsa Delta!")