Container images focusing on Data Processing Pipelines (DPP)

Table of Content (ToC)

Overview
- See also
Simple use
Build your own container image

Overview

That project produces OCI (Docker-compliant) images, which provide environments for Data Processing Pipelines (DPP), ready to use and to be deployed on Modern Data Stack (MDS), be it on private or public clouds (e.g., AWS, Azure, GCP).

These images are based on AWS-supported Corretto. Thanks to GitHub Actions (CI/CD), every time there are commits on this Git repository, the OCI imaages are built and published on Docker Hub.

These OCI images are aimed at deploying Data Engineering applications, typically Data Processing Pipelines (DPP), on Modern Data Stack (MDS).

The authors of this repository also maintain general purpose cloud Python OCI images in a dedicated GitHub repository and Docker Hub space.

Thanks to Docker multi-stage builds, one can easily have in a same Docker specification file two images, one for every day data engineering work, and the other one to deploy the corresponding applications onto production environments.

The Docker images of this repository just add various utilities to make it work out of the box with cloud vendors (e.g., Azure and AWS command-line utilities) and cloud-native tools (e.g., S3-Mountpoint), on top of the native images maintained by the AWS-supported Corretto. They also add specific Python versions.

In the OCI image, Python packages are installed by the pip utility. For testing purposes, outside of the container, Python virtual environments may be installed thanks to PyEnv and pipenv, as detailed in the dedicated procedure on the Python induction notebook sub-project.

Any additional Python module may be installed either:

With pip and some requirements.txt dependency specification file:

$ python3 -mpip install -r requirements.txt

In a dedicated virtual environment, controlled by pipenv through local Pipfile (and potentially Pipfile.lock) files, which should be versioned:

$ pipenv --rm; pipenv install; pipenv install --dev

On the other hand, the OCI images install those modules globally.

The Docker images of this repository are intended to run any Data Engineering applications / Data Processing Pipeline (DPP).

Simple use

Download the Docker images
- JDK17:

$ docker pull infrahelpers/dpp:jdk17-python3.9
  docker pull infrahelpers/dpp:jdk17-sbt1.9.8

JDK11:

$ docker pull infrahelpers/dpp:jdk11-python3.9
  docker pull infrahelpers/dpp:jdk11-sbt1.9.8

Launch a Spark application:

$ docker run -it --rm infrahelpers/dpp:jdk11-python3.9

Build your own container image

Clone the Git repository:

$ mkdir -p ~/dev/infra && cd ~/dev/infra
$ git clone https://github.com/data-engineering-helpers/dpp.git
$ cd dpp

Build the OCI images (here with Docker, but any other tool may be used):
- Setup the requested versions for the various stacks:

$ export JDK_VERSION="17" # or "11" or "8"
  export PYTHON_MINOR_VERSION="3.9"
  export PYTHON_MICRO_VERSION="3.9.18"
  export SBT_VERSION="1.9.8"

Amazon Linux 2023 (AL2023) for Elastic Map Reduce (EMR) 7.x and DataBricks base image:

$ docker build -t infrahelpers/dpp:jdk$JDK_VERSION --build-arg JDK_VERSION=$JDK_VERSION corretto-emr-dbs-universal-base

Amazon Linux 2023 (AL2023) for Elastic Map Reduce (EMR) 7.x and DataBricks with a single Python installation, with more freedom on its version:

$ docker build -t infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION --build-arg JDK_VERSION=$JDK_VERSION --build-arg PYTHON_MINOR_VERSION=$PYTHON_MINOR_VERSION --build-arg PYTHON_MICRO_VERSION=$PYTHON_MICRO_VERSION corretto-emr-dbs-universal-pyspark
  docker tag infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MICRO_VERSION
  docker tag infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION infrahelpers/dpp:jdk$JDK_VERSION-python

Amazon Linux 2023 (AL2023) for Elastic Map Reduce (EMR) 7.x and DataBricks with SBT and Scala, with more freedom on its version:

$ docker build -t infrahelpers/dpp:jdk$JDK_VERSION-sbt$SBT_VERSION --build-arg JDK_VERSION=$JDK_VERSION --build-arg SBT_VERSION=$SBT_VERSION corretto-emr-dbs-universal-spark-scala
  docker tag infrahelpers/dpp:jdk$JDK_VERSION-sbt$SBT_VERSION infrahelpers/dpp:jdk$JDK_VERSION-sbt

In addition to what the Docker Hub builds, the CI/CD (GitHub Actions) pipeline also builds the infrahelpers/dpp images on two CPU architectures, namely the classical AMD64 and the newer ARM64, from the
(Optional) Push the newly built images to Docker Hub. That step is usually not needed, as the images are automatically built everytime there is a change on GitHub)

$ docker login
  docker push infrahelpers/dpp:jdk$JDK_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MINOR_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-python$PYTHON_MICRO_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-python
  docker push infrahelpers/dpp:jdk$JDK_VERSION-sbt$SBT_VERSION
  docker push infrahelpers/dpp:jdk$JDK_VERSION-sbt

Choose which image should be the latest, tag it and upload it to Docker Hub:

$ docker tag infrahelpers/dpp:jdk$JDK_VERSION infrahelpers/dpp:latest
$ docker push infrahelpers/dpp:latest

Shutdown the Docker image

$ docker ps
CONTAINER ID IMAGE                    COMMAND                   CREATED        STATUS        PORTS                  NAMES
7b69efc9dc9a de/dpp                   "/bin/sh -c 'python …"    48 seconds ago Up 47 seconds 0.0.0.0:9000->8050/tcp vigilant_merkle
$ docker kill vigilant_merkle
vigilant_merkle
$ docker ps
CONTAINER ID IMAGE                    COMMAND                   CREATED        STATUS        PORTS                  NAMES

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
corretto-emr-dbs-universal-base		corretto-emr-dbs-universal-base
corretto-emr-dbs-universal-pyspark		corretto-emr-dbs-universal-pyspark
corretto-emr-dbs-universal-spark-scala		corretto-emr-dbs-universal-spark-scala
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Container images focusing on Data Processing Pipelines (DPP)

Table of Content (ToC)

Overview

See also

Simple use

Build your own container image

About

Releases

Packages

Contributors 4

Languages

License

data-engineering-helpers/dpp-images

Folders and files

Latest commit

History

Repository files navigation

Container images focusing on Data Processing Pipelines (DPP)

Table of Content (ToC)

Overview

See also

Simple use

Build your own container image

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages