Update docs for EMR 7.x - closes #56

Also replace deprecated GitHub markdown alert syntax
aws-samples · Jan 8, 2024 · dbe1565 · dbe1565
1 parent 30ea478
commit dbe1565
Show file tree

Hide file tree

Showing 8 changed files with 43 additions and 11 deletions.
diff --git a/airflow/README.md b/airflow/README.md
@@ -2,7 +2,8 @@
 
 As of `apache-airflow-providers-amazon==5.0.0`, the EMR Serverless Operator is now part of the official [Apache Airflow Amazon Provider](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/index.html) and has been tested with open source Apache Airflow v2.2.2.
 
-> **Warning** The operator in this repository is no longer maintained.
+> [!CAUTION]
+> The operator in this repository is no longer maintained.
 
 ## Amazon Managed Workflows for Apache Airflow (MWAA)
 
@@ -15,7 +16,8 @@ apache-airflow-providers-amazon==6.0.0
 boto3>=1.23.9
 ```
 
-> **Note** `boto3>=1.23.9` is required for EMR Serverless support
+> [!IMPORTANT]
+> `boto3>=1.23.9` is required for EMR Serverless support
 
 ## Example DAGs
 

diff --git a/cloudformation/emr-serverless-cloudwatch-dashboard/README.md b/cloudformation/emr-serverless-cloudwatch-dashboard/README.md
@@ -98,7 +98,8 @@ Let's take a quick look and see how we can use the dashboard to optimize our EMR
 
 In this example, we'll start an application with a limited set of pre-initialized capacity and run jobs that both fit and exceed that capacity and see what happens.
 
-> **Note**: EMR Serverless sends metrics to Amazon CloudWatch every 1 minute, so you may see different behavior depending on how quickly you run the commands.
+> [!NOTE]
+> EMR Serverless sends metrics to Amazon CloudWatch every 1 minute, so you may see different behavior depending on how quickly you run the commands.
 
 ### Pre-requisites
 

diff --git a/examples/java/README.md b/examples/java/README.md
@@ -92,7 +92,7 @@ aws emr-serverless start-job-run \
     }'
 ```
 
-> **Note**: We don't specify any Spark CPU or memory configurations - the defaults [defined here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults) fit into the pre-init capacity we created above. We used a value of `16GB` for our pre-init capacity because the Spark default is `14GB` plus 10% for Spark memory overhead.
+> [!TIP] We don't specify any Spark CPU or memory configurations - the defaults [defined here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults) fit into the pre-init capacity we created above. We used a value of `16GB` for our pre-init capacity because the Spark default is `14GB` plus 10% for Spark memory overhead.
 
 ## Clean up
 

diff --git a/examples/java/hello-world/src/main/java/HelloWorld.java b/examples/java/hello-world/src/main/java/HelloWorld.java
@@ -1,5 +1,8 @@
+package com.amazon.damon;
+
 import org.apache.spark.sql.SparkSession;
 
+
 public class HelloWorld {
   public static void main(String[] args) {
     SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
@@ -8,4 +11,4 @@ public static void main(String[] args) {
 
     spark.stop();
   }
-}
+}
diff --git a/examples/pyspark/custom-images/README.md b/examples/pyspark/custom-images/README.md
@@ -6,7 +6,8 @@ In this example, we use a simple example of adding the `seaborn` library to buil
 
 ## Pre-requisities
 
-> **Note** This example is intended to be run in the `us-east-1` region as it reads data from [NOAA Global Surface Summary of Day dataset](https://registry.opendata.aws/noaa-gsod/) from the Registry of Open Data.
+> [!IMPORTANT]
+> This example is intended to be run in the `us-east-1` region as it reads data from [NOAA Global Surface Summary of Day dataset](https://registry.opendata.aws/noaa-gsod/) from the Registry of Open Data.
 
 In order to make use of custom images in EMR, you'll need to have:
 - a local installation of Docker to build your image

diff --git a/examples/pyspark/custom_python_version/README.md b/examples/pyspark/custom_python_version/README.md
@@ -1,6 +1,9 @@
 # Custom Python versions on EMR Serverless
 
-Occasionally, you'll require a specific Python version. While EMR Serverless uses Python 3.7.10 by default, you can upgrade by building your own virtual environment with the desired version and copying the binaries when you package your virtual environment.
+> [!IMPORTANT]
+> EMR release 7.x now supports Python 3.9.x by default. To change the Python version in 7.x releases, use `public.ecr.aws/amazonlinux/amazonlinux:2023-minimal` as your base image.
+
+Occasionally, you'll require a specific Python version. While EMR Serverless uses Python 3.7.x by default, you can upgrade by building your own virtual environment with the desired version and copying the binaries when you package your virtual environment.
 
 Let's say you want to make use of the new `match` statements in Python 3.10 - We'll use a Dockerfile to install Python 3.10.6 and create our custom virtual environment.
 

diff --git a/examples/pyspark/dependencies/Dockerfile.al2023 b/examples/pyspark/dependencies/Dockerfile.al2023
@@ -0,0 +1,17 @@
+FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
+
+RUN dnf install -y gcc python3 python3-devel
+
+ENV VIRTUAL_ENV=/opt/venv
+RUN python3 -m venv $VIRTUAL_ENV
+ENV PATH="$VIRTUAL_ENV/bin:$PATH"
+
+RUN python3 -m pip install --upgrade pip && \
+    python3 -m pip install \
+    great_expectations==0.18.7 \
+    venv-pack==0.2.0
+
+RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz
+
+FROM scratch AS export
+COPY --from=base /output/pyspark_ge.tar.gz /
diff --git a/examples/pyspark/dependencies/README.md b/examples/pyspark/dependencies/README.md
@@ -8,7 +8,13 @@ You can create isolated Python virtual environments to package multiple Python l
 - [Docker](https://www.docker.com/get-started) with the [BuildKit backend](https://docs.docker.com/engine/reference/builder/#buildkit)
 - An S3 bucket in `us-east-1` and an IAM Role to run your EMR Serverless jobs
 
-> **Note**: If using Docker on Apple Silicon ensure you use `--platform linux/amd64`
+> [!IMPORTANT]
+> This example is intended to be run in the `us-east-1` region as it reads data from [New York City Taxi dataset](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) from the Registry of Open Data. If your EMR Serverless application is in a different region, you must [configure networking](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html).
+
+> [!IMPORTANT]
+> The default [Dockerfile](./Dockerfile) is configured to use `linux/amd64`
+> If using Graviton, update to use `linux/arm64` or pass `--platform linux/arm64` to the `docker build` command. See the [EMR Serverless architecture options](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/architecture.html) for more detail.
+> If using EMR 7.x, you must use Amazon Linux 2023 as the base image instead of Amazon Linux 2. A sample file is provided in [Dockerfile.al2023](./Dockerfile.al2023).
 
 Set the following variables according to your environment.
 
@@ -28,8 +34,6 @@ All the commands below should be executed in this (`examples/pyspark/dependencie
 
 This command builds the included `Dockerfile` and exports the resulting `pyspark_ge.tar.gz` file to your local filesystem.
 
-> **Note** The included [Dockerfile](./Dockerfile) builds for x86 - if you would like to build for Graviton, update the Dockerfile to use `linux/arm64` as the platform and see the [EMR Serverless architecture options](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/architecture.html) for more detail.
-
 ```shell
 # Enable BuildKit backend
 DOCKER_BUILDKIT=1 docker build --output . .
@@ -123,7 +127,8 @@ _This approach can also be used with EMR release label `emr-6.6.0`._
 
 To do this, we'll create a [`pom.xml`](./pom.xml) that specifies our dependencies and use a [maven Docker container](./Dockerfile.jars) to build the uberjar. In this example, we'll package `org.postgresql:postgresql:42.4.0` and use the example script in [./pg_query.py](./pg_query.py) to query a Postgres database.
 
-> **Note**: The code in `pg_query.py` is for demonstration purposes only - never store credentials directly in your code. 😁
+> [!TIP]
+> The code in `pg_query.py` is for demonstration purposes only - never store credentials directly in your code. 😁
 
 1. Build an uberjar with your dependencies
-Original file line number
+Diff line change
@@ Expand Up / @@ -92,7 +92,7 @@ aws emr-serverless start-job-run \ @@
         }'
     ```
-    > **Note**: We don't specify any Spark CPU or memory configurations - the defaults [defined here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults) fit into the pre-init capacity we created above. We used a value of `16GB` for our pre-init capacity because the Spark default is `14GB` plus 10% for Spark memory overhead.
+    > [!TIP] We don't specify any Spark CPU or memory configurations - the defaults [defined here](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults) fit into the pre-init capacity we created above. We used a value of `16GB` for our pre-init capacity because the Spark default is `14GB` plus 10% for Spark memory overhead.
     ## Clean up
@@ Expand Down @@