Skip to content

Commit

Permalink
Update docs for EMR 7.x - closes #56
Browse files Browse the repository at this point in the history
Also replace deprecated GitHub markdown alert syntax
  • Loading branch information
dacort committed Jan 8, 2024
1 parent 30ea478 commit dbe1565
Show file tree
Hide file tree
Showing 8 changed files with 43 additions and 11 deletions.
6 changes: 4 additions & 2 deletions airflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

As of `apache-airflow-providers-amazon==5.0.0`, the EMR Serverless Operator is now part of the official [Apache Airflow Amazon Provider](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/index.html) and has been tested with open source Apache Airflow v2.2.2.

> **Warning** The operator in this repository is no longer maintained.
> [!CAUTION]
> The operator in this repository is no longer maintained.
## Amazon Managed Workflows for Apache Airflow (MWAA)

Expand All @@ -15,7 +16,8 @@ apache-airflow-providers-amazon==6.0.0
boto3>=1.23.9
```

> **Note** `boto3>=1.23.9` is required for EMR Serverless support
> [!IMPORTANT]
> `boto3>=1.23.9` is required for EMR Serverless support
## Example DAGs

Expand Down
3 changes: 2 additions & 1 deletion cloudformation/emr-serverless-cloudwatch-dashboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ Let's take a quick look and see how we can use the dashboard to optimize our EMR

In this example, we'll start an application with a limited set of pre-initialized capacity and run jobs that both fit and exceed that capacity and see what happens.

> **Note**: EMR Serverless sends metrics to Amazon CloudWatch every 1 minute, so you may see different behavior depending on how quickly you run the commands.
> [!NOTE]
> EMR Serverless sends metrics to Amazon CloudWatch every 1 minute, so you may see different behavior depending on how quickly you run the commands.
### Pre-requisites

Expand Down
2 changes: 1 addition & 1 deletion examples/java/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ aws emr-serverless start-job-run \
}'
```

> **Note**: We don't specify any Spark CPU or memory configurations - the defaults [defined here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults) fit into the pre-init capacity we created above. We used a value of `16GB` for our pre-init capacity because the Spark default is `14GB` plus 10% for Spark memory overhead.
> [!TIP] We don't specify any Spark CPU or memory configurations - the defaults [defined here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults) fit into the pre-init capacity we created above. We used a value of `16GB` for our pre-init capacity because the Spark default is `14GB` plus 10% for Spark memory overhead.
## Clean up

Expand Down
5 changes: 4 additions & 1 deletion examples/java/hello-world/src/main/java/HelloWorld.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
package com.amazon.damon;

import org.apache.spark.sql.SparkSession;


public class HelloWorld {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
Expand All @@ -8,4 +11,4 @@ public static void main(String[] args) {

spark.stop();
}
}
}
3 changes: 2 additions & 1 deletion examples/pyspark/custom-images/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ In this example, we use a simple example of adding the `seaborn` library to buil

## Pre-requisities

> **Note** This example is intended to be run in the `us-east-1` region as it reads data from [NOAA Global Surface Summary of Day dataset](https://registry.opendata.aws/noaa-gsod/) from the Registry of Open Data.
> [!IMPORTANT]
> This example is intended to be run in the `us-east-1` region as it reads data from [NOAA Global Surface Summary of Day dataset](https://registry.opendata.aws/noaa-gsod/) from the Registry of Open Data.
In order to make use of custom images in EMR, you'll need to have:
- a local installation of Docker to build your image
Expand Down
5 changes: 4 additions & 1 deletion examples/pyspark/custom_python_version/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Custom Python versions on EMR Serverless

Occasionally, you'll require a specific Python version. While EMR Serverless uses Python 3.7.10 by default, you can upgrade by building your own virtual environment with the desired version and copying the binaries when you package your virtual environment.
> [!IMPORTANT]
> EMR release 7.x now supports Python 3.9.x by default. To change the Python version in 7.x releases, use `public.ecr.aws/amazonlinux/amazonlinux:2023-minimal` as your base image.
Occasionally, you'll require a specific Python version. While EMR Serverless uses Python 3.7.x by default, you can upgrade by building your own virtual environment with the desired version and copying the binaries when you package your virtual environment.

Let's say you want to make use of the new `match` statements in Python 3.10 - We'll use a Dockerfile to install Python 3.10.6 and create our custom virtual environment.

Expand Down
17 changes: 17 additions & 0 deletions examples/pyspark/dependencies/Dockerfile.al2023
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base

RUN dnf install -y gcc python3 python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
great_expectations==0.18.7 \
venv-pack==0.2.0

RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /
13 changes: 9 additions & 4 deletions examples/pyspark/dependencies/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ You can create isolated Python virtual environments to package multiple Python l
- [Docker](https://www.docker.com/get-started) with the [BuildKit backend](https://docs.docker.com/engine/reference/builder/#buildkit)
- An S3 bucket in `us-east-1` and an IAM Role to run your EMR Serverless jobs

> **Note**: If using Docker on Apple Silicon ensure you use `--platform linux/amd64`
> [!IMPORTANT]
> This example is intended to be run in the `us-east-1` region as it reads data from [New York City Taxi dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) from the Registry of Open Data. If your EMR Serverless application is in a different region, you must [configure networking](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html).
> [!IMPORTANT]
> The default [Dockerfile](./Dockerfile) is configured to use `linux/amd64`
> If using Graviton, update to use `linux/arm64` or pass `--platform linux/arm64` to the `docker build` command. See the [EMR Serverless architecture options](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/architecture.html) for more detail.
> If using EMR 7.x, you must use Amazon Linux 2023 as the base image instead of Amazon Linux 2. A sample file is provided in [Dockerfile.al2023](./Dockerfile.al2023).
Set the following variables according to your environment.

Expand All @@ -28,8 +34,6 @@ All the commands below should be executed in this (`examples/pyspark/dependencie

This command builds the included `Dockerfile` and exports the resulting `pyspark_ge.tar.gz` file to your local filesystem.

> **Note** The included [Dockerfile](./Dockerfile) builds for x86 - if you would like to build for Graviton, update the Dockerfile to use `linux/arm64` as the platform and see the [EMR Serverless architecture options](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/architecture.html) for more detail.
```shell
# Enable BuildKit backend
DOCKER_BUILDKIT=1 docker build --output . .
Expand Down Expand Up @@ -123,7 +127,8 @@ _This approach can also be used with EMR release label `emr-6.6.0`._

To do this, we'll create a [`pom.xml`](./pom.xml) that specifies our dependencies and use a [maven Docker container](./Dockerfile.jars) to build the uberjar. In this example, we'll package `org.postgresql:postgresql:42.4.0` and use the example script in [./pg_query.py](./pg_query.py) to query a Postgres database.

> **Note**: The code in `pg_query.py` is for demonstration purposes only - never store credentials directly in your code. 😁
> [!TIP]
> The code in `pg_query.py` is for demonstration purposes only - never store credentials directly in your code. 😁
1. Build an uberjar with your dependencies

Expand Down

0 comments on commit dbe1565

Please sign in to comment.