spark-glue-python

Apache Spark with AWS Glue metastore and Python docker image

Features

Base Image: Uses sdaberdaku/spark-with-glue-builder:v3.5.1 as the base image.

Python Version: Uses Python 3.10.14-slim-bookworm.

User Setup: Creates a system user 'spark' with a specified UID and GID for running Spark processes.

Java and Other Packages: Installs OpenJDK 17, Tini, Procps, and Gettext-base.

Environment Variables: Sets up environment variables for Java, Spark, and Hadoop.

Spark and Hadoop Installation: Copies Spark and Hadoop binaries from the builder image to the specified directories.

Permissions: Adjusts ownership of Spark and Hadoop directories to the 'spark' user.

Entrypoint Setup: Copies and configures the entrypoint and decommission scripts for Spark.

Python Dependencies: Installs PySpark and other dependencies listed in the requirements.txt file.

Included JARs

The current Docker image inherits a series of JARs from its builder image.

Here is a summary of the JAR files that are included in the Docker image (under /opt/spark/jars):

AWS Glue Data Catalog Spark Client JAR: aws-glue-datacatalog-spark-client-3.5.1.jar
AWS Java SDK bundle library: aws-java-sdk-bundle-1.12.262.jar
Hadoop AWS library: hadoop-aws-3.3.4.jar
Wildfly OpenSSL library: wildfly-openssl-1.0.7.Final.jar
PostgreSQL library: postgresql-42.6.0.jar
Checker Qual: checker-qual-3.31.0.jar
delta-spark: delta-spark_2.12-3.2.0.jar
antlr4-runtime: antlr4-runtime-4.9.3.jar
delta-storage: delta-storage-3.2.0.jar
delta-storage-s3-dynamodb: delta-storage-s3-dynamodb-3.2.0.jar

Hadoop Native Libraries:

Hadoop native libraries are downloaded and installed in the /opt/hadoop directory.

Instructions

Follow these instructions to build the Docker image:

Clone this repository:

git clone https://github.com/sebastiandaberdaku/spark-glue-python.git
cd spark-glue-python
docker build -t sdaberdaku/spark-glue-python:v3.5.1-python3.10.14 . --network host
docker push sdaberdaku/spark-glue-python:v3.5.1-python3.10.14

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-glue-python

Features

Included JARs

Hadoop Native Libraries:

Instructions

About

Releases 2

Packages

Contributors 2

Languages

License

sebastiandaberdaku/spark-glue-python

Folders and files

Latest commit

History

Repository files navigation

spark-glue-python

Features

Included JARs

Hadoop Native Libraries:

Instructions

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages