Airflow/Minio/Spark PoC

A PoC of an lakehouse architecture.

Introduction

At $WORK the target architecture is Microsoft Synapse which is Microsoft's umbrella product of ADLS Gen2, Spark, Serverless SQL and Dedicated SQL, integrated with version control.

There are a few weak spots of Azure Synapse:

Its data pipelines which provides too little control over dependencies
Pull requests result in large JSON blobs, which makes them hard to review
The data engineering team wants to generate templated pipelines which is essentially programming (and hence better served using a programming language)

This PoC serves as an investigation of an alternative orchistration set-up with Apache Airflow.

The following components are used:

Airflow 2.8.0 (on Python 3.10)
Minio to emulate ADLS Gen2. Although Microsoft supports Azurite for storage emulation but it does not support hierarchies, which is essential for this PoC.
Spark for processing PySpark jobs

Ideally this we'd leverage dbt as well, but this might be a bridge too far at the moment. The current version of the dbt-synapse connector (1.4.0) is somewhat maintained by Microsoft and only has support for Dedicated SQL, not Serverless SQL.

Components

Storage

A medallion data lake architecture with the following layers:

landingzone: This is where the files land, partitioned by application and entity.
bronze: The landed files are transformed to parquet format and stored in raw format. Partitioned by application and entity.
- TODO: This is where quality checks should take place.
silver: This is where the delta is calculated and the data is structured according to our data domain definition.
gold: A curated set of data ready for consumption.

Minio is used for S3 compatible storage.

Processing

Processing data is done using Apache Spark. The jobs are defined in the airflow/spark-jobs directory. Each job receives the metadata entities defined in airflow/dags/metadata/entities.py.

DAGs

000_setup_env_v1: Used to initialize Airflow connections and the Minio storage bucket. Only needs to be executed once to complete set-up of the PoC.
landing_to_silver_v1: Moves a CSV-file from the landingzone via a parquet file in the bronze layer to a parquet file in the silver layer. The current DAG also generates fake data and uploads it to the landing zone to emulate a filedrop event. The data generation takes into account the point in time the DAG is executed to be able to simulate changes to the data over time.
gold_report_v1: Waits for department and employee data to be updated before generating a dummy report.

Getting started

To start execute ./env-start.sh to boot up Airflow, Minio and Spark.
Once all services have started, Airflow should be operational at http://localhost:8080. You can login with airflow/airflow.
Trigger the setup_env_v1 DAG by pressing the play (⏵) button
Enable the other 3 dags (department_landing_to_silver_v1, employee_landing_to_silver_v1, gold_report_v1) and give them some time to catch-up (start date 2024-01-01)
To cleanup services and storage execute ./env-cleanup.sh

Operations

Rebuild specific container while docker is running

When python requirements change source images need to be rebuild, you can do so using

docker-compose -f docker-compose.[airflow|spark|minio].yaml up -d --force-recreate --build [service-name]
E.g. docker-compose -f docker-compose.airflow.yaml up -d --force-recreate --build airflow-scheduler

Future work

From heresay it seems Delta is not mature enough in Microsoft Synapse.
- This needs to be investigated.
- If it is mature delta parquet should be used from the silver layer onwards.
Investigate Iceberg support
Check data quality e.g. using Amazon Deequ

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
airflow		airflow
spark		spark
.gitignore		.gitignore
README.md		README.md
docker-compose.airflow.yaml		docker-compose.airflow.yaml
docker-compose.minio.yaml		docker-compose.minio.yaml
docker-compose.spark.yaml		docker-compose.spark.yaml
env-clean.sh		env-clean.sh
env-start.sh		env-start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airflow/Minio/Spark PoC

Introduction

Components

Storage

Processing

DAGs

Getting started

Operations

Rebuild specific container while docker is running

Future work

About

Releases

Packages

Languages

MelleB/airflow-minio-spark-poc

Folders and files

Latest commit

History

Repository files navigation

Airflow/Minio/Spark PoC

Introduction

Components

Storage

Processing

DAGs

Getting started

Operations

Rebuild specific container while docker is running

Future work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages