Skip to content

OCI (Docker like) images for Data Processing Pipelines (DPP), e.g., Spark

License

Notifications You must be signed in to change notification settings

data-engineering-helpers/dpp-images

Repository files navigation

Container images focusing on Data Processing Pipelines (DPP)

Table of Content (ToC)

Created by gh-md-toc

Overview

That project produces OCI (Docker-compliant) images, which provide environments for Data Processing Pipelines (DPP), ready to use and to be deployed on Modern Data Stack (MDS), be it on private or public clouds (e.g., AWS, Azure, GCP).

These images are based on AWS-supported Corretto. Thanks to GitHub Actions (CI/CD), every time there are commits on this Git repository, the OCI imaages are built and published on Docker Hub.

These OCI images are aimed at deploying Data Engineering applications, typically Data Processing Pipelines (DPP), on Modern Data Stack (MDS).

The authors of this repository also maintain general purpose cloud Python OCI images in a dedicated GitHub repository and Docker Hub space.

Thanks to Docker multi-stage builds, one can easily have in a same Docker specification file two images, one for every day data engineering work, and the other one to deploy the corresponding applications onto production environments.

The Docker images of this repository just add various utilities to make it work out of the box with cloud vendors (e.g., Azure and AWS command-line utilities) and cloud-native tools (e.g., S3-Mountpoint), on top of the native images maintained by the AWS-supported Corretto. They also add specific Python versions.

In the OCI image, Python packages are installed by the pip utility. For testing purposes, outside of the container, Python virtual environments may be installed thanks to PyEnv and pipenv, as detailed in the dedicated procedure on the Python induction notebook sub-project.

Any additional Python module may be installed either:

  • With pip and some requirements.txt dependency specification file:
$ python3 -mpip install -r requirements.txt
  • In a dedicated virtual environment, controlled by pipenv through local Pipfile (and potentially Pipfile.lock) files, which should be versioned:
$ pipenv --rm; pipenv install; pipenv install --dev

On the other hand, the OCI images install those modules globally.

The Docker images of this repository are intended to run any Data Engineering applications / Data Processing Pipeline (DPP).

See also

Simple use

  • Download the Docker images
    • JDK17:
$ docker pull infrahelpers/dpp:jdk17-python3.9
  docker pull infrahelpers/dpp:jdk17-sbt1.9.8
  • JDK11:
$ docker pull infrahelpers/dpp:jdk11-python3.9
  docker pull infrahelpers/dpp:jdk11-sbt1.9.8
  • Launch a Spark application:
$ docker run -it --rm infrahelpers/dpp:jdk11-python3.9

Build your own container image

$ mkdir -p ~/dev/infra && cd ~/dev/infra
$ git clone https://github.com/data-engineering-helpers/dpp.git
$ cd dpp
  • Build the OCI images (here with Docker, but any other tool may be used):
    • Setup the requested versions for the various stacks:
$ export JDK_VERSION="17" # or "11" or "8"
  export PYTHON_MINOR_VERSION="3.9"
  export PYTHON_MICRO_VERSION="3.9.18"
  export SBT_VERSION="1.9.8"
  • Amazon Linux 2023 (AL2023) for Elastic Map Reduce (EMR) 7.x and DataBricks base image:
$ docker build -t infrahelpers/dpp:jdk$JDK_VERSION --build-arg JDK_VERSION=$JDK_VERSION corretto-emr-dbs-universal-base
  • Amazon Linux 2023 (AL2023) for Elastic Map Reduce (EMR) 7.x and DataBricks with a single Python installation, with more freedom on its version:
$ docker build -t infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION --build-arg JDK_VERSION=$JDK_VERSION --build-arg PYTHON_MINOR_VERSION=$PYTHON_MINOR_VERSION --build-arg PYTHON_MICRO_VERSION=$PYTHON_MICRO_VERSION corretto-emr-dbs-universal-pyspark
  docker tag infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MICRO_VERSION
  docker tag infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION infrahelpers/dpp:jdk$JDK_VERSION-python
  • Amazon Linux 2023 (AL2023) for Elastic Map Reduce (EMR) 7.x and DataBricks with SBT and Scala, with more freedom on its version:
$ docker build -t infrahelpers/dpp:jdk$JDK_VERSION-sbt$SBT_VERSION --build-arg JDK_VERSION=$JDK_VERSION --build-arg SBT_VERSION=$SBT_VERSION corretto-emr-dbs-universal-spark-scala
  docker tag infrahelpers/dpp:jdk$JDK_VERSION-sbt$SBT_VERSION infrahelpers/dpp:jdk$JDK_VERSION-sbt
$ docker login
  docker push infrahelpers/dpp:jdk$JDK_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MICRO_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-python
  docker push infrahelpers/dpp:jdk$JDK_VERSION-sbt$SBT_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-sbt
  • Choose which image should be the latest, tag it and upload it to Docker Hub:
$ docker tag infrahelpers/dpp:jdk$JDK_VERSION infrahelpers/dpp:latest
$ docker push infrahelpers/dpp:latest
  • Shutdown the Docker image
$ docker ps
CONTAINER ID IMAGE                    COMMAND                   CREATED        STATUS        PORTS                  NAMES
7b69efc9dc9a de/dpp                   "/bin/sh -c 'python …"    48 seconds ago Up 47 seconds 0.0.0.0:9000->8050/tcp vigilant_merkle
$ docker kill vigilant_merkle
vigilant_merkle
$ docker ps
CONTAINER ID IMAGE                    COMMAND                   CREATED        STATUS        PORTS                  NAMES

About

OCI (Docker like) images for Data Processing Pipelines (DPP), e.g., Spark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •