Skip to content

Commit

Permalink
Update Delta Lake example
Browse files Browse the repository at this point in the history
  • Loading branch information
dacort committed Jan 26, 2024
1 parent f9fc1aa commit 387edfa
Show file tree
Hide file tree
Showing 4 changed files with 180 additions and 392 deletions.
70 changes: 70 additions & 0 deletions examples/pyspark/delta-lake/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# This is a muti-stage Dockerfile that can be used to build many different types of
# bundled dependencies for PySpark projects.
# The `base` stage installs generic tools necessary for packaging.
#
# There are `export-` and `build-` stages for the different types of projects.
# - python-packages - Generic support for Python projects with pyproject.toml
# - poetry - Support for Poetry projects
#
# This Dockerfile is generated automatically as part of the emr-cli tool.
# Feel free to modify it for your needs, but leave the `build-` and `export-`
# stages related to your project.
#
# To build manually, you can use the following command, assuming
# the Docker BuildKit backend is enabled. https://docs.docker.com/build/buildkit/
#
# Example for building a poetry project and saving the output to dist/ folder
# docker build --target export-poetry --output dist .


## ----------------------------------------------------------------------------
## Base stage for python development
## ----------------------------------------------------------------------------
FROM --platform=linux/amd64 amazonlinux:2 AS base

RUN yum install -y python3 tar gzip

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

# EMR 6.x uses Python 3.7 - limit Poetry version to 1.5.1
ENV POETRY_VERSION=1.5.1
RUN python3 -m pip install --upgrade pip
RUN curl -sSL https://install.python-poetry.org | python3 -

ENV PATH="$PATH:/root/.local/bin"

WORKDIR /app

COPY . .

# Test stage - installs test dependencies defined in pyproject.toml
FROM base as test
RUN python3 -m pip install .[test]

## ----------------------------------------------------------------------------
## Build and export stages for standard Python projects
## ----------------------------------------------------------------------------
# Build stage - installs required dependencies and creates a venv package
FROM base as build-python
RUN python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install .
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz

# Export stage - used to copy packaged venv to local filesystem
FROM scratch AS export-python
COPY --from=build-python /output/pyspark_deps.tar.gz /

## ----------------------------------------------------------------------------
## Build and export stages for Poetry Python projects
## ----------------------------------------------------------------------------
# Build stage for poetry
FROM base as build-poetry
RUN poetry self add poetry-plugin-bundle && \
poetry bundle venv dist/bundle --without dev && \
tar -czvf dist/pyspark_deps.tar.gz -C dist/bundle . && \
rm -rf dist/bundle

FROM scratch as export-poetry
COPY --from=build-poetry /app/dist/pyspark_deps.tar.gz /
32 changes: 30 additions & 2 deletions examples/pyspark/delta-lake/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,16 @@ As of EMR 6.9.0, Delta Lake jars are provided on the EMR Serverless image. This

## Getting Started

> **Note**: This assumes you already have an EMR Serverless application or have completed the pre-requisites in this repo's [README](/README.md).
> [!NOTE]
> This assumes you already have an EMR Serverless 6.9.0 application or have completed the pre-requisites in this repo's [README](/README.md).
To create an EMR Serverless application compatible with those code, use the following command:

```bash
aws emr-serverless create-application \
--release-label emr-6.9.0 \
--type SPARK
```

- Define some environment variables to be used later

Expand Down Expand Up @@ -43,8 +52,27 @@ emr run \
--application-id ${APPLICATION_ID} \
--job-role ${JOB_ROLE_ARN} \
--s3-code-uri s3://${S3_BUCKET}/tmp/emr-cli-delta-lake/ \
--s3-logs-uri s3://${S3_BUCKET}/logs/ \
--entry-point main.py \
--job-args ${S3_BUCKET} \
--spark-submit-opts "--conf spark.jars=/usr/share/aws/delta/lib/delta-core.jar,/usr/share/aws/delta/lib/delta-storage.jar" \
--build --wait
--build --wait --show-stdout
```

> [!NOTE]
> Because of how `delta-spark` is packaged, this will include `pyspark` as a dependency. The `--build` flag packages and deploys a virtualenv with `delta-spark` and related dependencies.
You should see the following output:

```
[emr-cli]: Job submitted to EMR Serverless (Job Run ID: 00fgj5hq9e4le80m)
[emr-cli]: Waiting for job to complete...
[emr-cli]: Job state is now: SCHEDULED
[emr-cli]: Job state is now: RUNNING
[emr-cli]: Job state is now: SUCCESS
[emr-cli]: stdout for 00fgj5hq9e4le80m
--------------------------------------
Itsa Delta!
[emr-cli]: Job completed successfully!
```
Loading

0 comments on commit 387edfa

Please sign in to comment.