Skip to content

Latest commit

 

History

History
201 lines (143 loc) · 11.3 KB

README.md

File metadata and controls

201 lines (143 loc) · 11.3 KB

Airflow

Registry DAG Template

Guidelines on building, organizing, and publishing example DAGs for the Airflow community to discover and find inspiration from on the Astronomer Registry.

Maintained with ❤️ by Astronomer.


For the best user experience by the Airflow community, example DAG repositories should be public on GitHub and follow the structural guidelines and practices presented below.


Before you begin

An easy way to get started developing and running Apache Airflow pipelines locally is with the Astro CLI. Learn how to install it here. The CLI is also a great way to take advantage of this repository template.


Repository structure

The organization of this repository resembles that generated by a project initialized by the Astro CLI. This template provides all the necessary files to run Apache Airflow locally within Docker containers using the Astro CLI along with an example DAG.

.
├── .astro
│   └── config.yaml # Configuration file for local environment settings
├── .dockerignore
├── .gitignore
├── Dockerfile
├── README.md
├── dags # DAG files go here
│   └── example_dag.py
├── packages.txt # For OS-level packages
├── requirements.txt # Place Provider packages as well as other Python modules used in your Airflow DAG here
└── .astro-registry.yaml # Contains the path/to/dag_file of DAGs to publish to the Astronomer Registry as well as the categories they should be assigned to.

Here are additional details about these files plus other directories and files that are optional to include in your own repository:

.astro/config.yaml

This file is used to establish settings to your local environment. The project.name setting is declared in the file already to "registry_dag_template". Ideally this should be updated to a pertinent name for the project however.

Dockerfile

The Dockerfile includes a reference to an Astronomer Certified Docker Image. Astronomer Certified (AC) is a Debian-based, production-ready distribution of Apache Airflow that mirrors the open source project and undergoes additional levels of rigorous testing conducted by the Astronomer team.

This Docker image is hosted on Astronomer's Docker Registry and allows you to run Airflow on Astronomer. Additionally, the image you include in your Dockerfile dictates the version of Airflow you'd like to run locally.

Note: Airflow version 1.10.x reached end-of-life on June 17, 2021. It is strongly recommended to use a version of Airflow that is 2.0 or greater.

README.md

The README for your example DAG repository has an understated importance -- it provides necessary context for the Airflow community about your DAG. Feel free to include a use case description or a reason why the DAG was created or even how it can be used, the Airflow Providers used, the connections that are assumed to be in place to run the DAG, etc. The more details, the better.

requirements.txt

Any additional Python modules that are necessary to install in your local Airflow environment should be listed here. This includes Provider packages which are used in the DAG. For transparency of what versions of modules were used during development and testing of the DAG, the versions should be pinned. This allows others to easily run, reproduce, and build upon the DAG while not having to guess what versions of modules were installed.

For example:

apache-airflow-providers-amazon==2.0.0
apache-airflow-providers-salesforce==3.0.0
apache-airflow-providers-snowflake==2.0.0
jsonschema==3.2.0

Browse the Astronomer Registry to find all of the available Providers that can be used in any Airflow DAG.

.astro-registry.yaml

This file is used to help control what DAGs should be published on the Astronomer Registry and what categor(y/ies) should be assigned to the DAGs in repository on the Registry. This file should be kept up-to-date with any new DAGs that should be published to the Registry. (More info in the Publishing your DAG repository for the Astronomer Registry section.)

The format is straightforward:

# These categories will be applied to all DAGs in the repo.
categories:
  - Airflow Fundamentals
# List of DAGs that should be published to the Astronomer Registry.
dags:
  - path: dags/example_dags.py  # Must be the path/to/the/dag_file.py

More than one category can be assigned to the DAGs. The following DAG categories are currently supported on the Astronomer Registry so you pick the ones the best described your repo of DAGs:

  • AI + Machine Learning
  • Airflow Fundamentals
  • Alerts/Notifications
  • Big Data & Analytics
  • CI/CD
  • Communication & Messaging
  • Compute
  • Containers
  • Data Management & Governance
  • Data Processing
  • Data Quality
  • Data Science
  • Data Storage
  • Databases
  • DevOps
  • ETL/ELT
  • Infrastructure (IaaS)
  • Logging & Monitoring
  • Machine Learning
  • Model Registry
  • Orchestration
  • Public Cloud
  • Query Engines
  • Web Services
  • Work Management

include/ [Optional]

This directory can be added as a main directory to the repository (i.e. same level as dags/) to house any other files. This can be Python functions, small/reference datasets, custom Airflow Operators, SQL files, etc. With the Astro CLI, this directory is automatically read from so changes to any files in this directory will be picked up without having to restart your local environment.

As part of a DAG-writing best practice, it is a good idea to separate accompanying logic and data from the logic which creates the DAG. Think of the DAG file as a configuration file; it should be clean and ideally only contain logic for the DAG object, tasks, and task dependencies. Using the include/ directory can help improve organization of the repository greatly by providing a place to put the other "stuff".

For example:

...
├── include
│   ├── operators
│   │   ├── __init__.py
│   │   └── custom_operator.py
│   ├── sql
│   │   ├── extract
│   │   │   └── extract_data.sql
│   │   ├── load
│   │   │   └── insert_into_table.sql
│   │   └── transform
│   │       └── transform_other_data.sql
│   └── data
│       └── reference_data.csv
...

Before creating custom Operators or using the PythonOperator to execute logic, check out the Astronomer Registry to find all of the available Modules that can be used out-of-the-box with Airflow Providers.

plugins/ [Optional]

This directory can be added as a main directory to the repository (i.e. same level as dags/) to house any logic for Airflow UI plugins like menu links. With the Astro CLI, this directory is automatically read from so changes to any files in this directory will be picked up without having to restart your local environment.

airflow_settings.yaml [Optional]

When using the Astro CLI, this file can be used to programmatically create Connections, Variables, and Pools for your local Airflow environment. For more information about this file, check out this documentation.

Note: Since this file can contain sensitive information like credentials, it is strongly recommended that this file be used for local development only and not be published to GitHub. The .gitignore within this repository template file does list the airflow_settings.yaml file so please do not remove the entry.


Using this template

Follow the steps below to use the template and initialize a new repository locally.

  1. Above the file list, click Use this template

    image

  2. Type a name for the repository, select your user as the owner, and an optional description.

    image

  3. Select Public visibility for the repository.

    image

  4. Click Create repository from template.

  5. Clone the repository locally. Refer to the GitHub documentation for different options of cloning repositories.

  6. [Optional] Navigate to where the repository was cloned and run the following Astro CLI command to update the project.name setting in the .astro/config.yaml file provided in the repository. This will update the name used to generate Docker containers and make them more discernible if there are multiple DAGs initialized locally via this template.

    astro config set project.name <name-of-repository>
  7. To beginning development and testing of the DAG, run astro dev start to spin up a local Airflow environment. There is no need to run astro dev init as this functionality is already built in to the template repository.


Key requirements

  • The DAG must have a top-level docstring. The docstring should comprise of the title of DAG on a first, standalone line (this will be used as the title displayed on the Astronomer Registry), then a following paragraph describing the DAG itself. This will be used as the DAG description on the Registry.

  • The repo must have at least one semantically versioned tag. Repository tags are how the Astronomer Registry will know new DAG updates have occurred and to propagate those updates to the Registry. (More on this in the next section.)

Publishing your DAG repository for the Astronomer Registry

If you have never submitted your DAG repository for publication to the Astronomer Registry, create a new release/tag for your repository on the main branch. Ultimately, the backend of the Astronomer Registry will check for new tags for a DAG repository to trigger updating the DAG on the Registry.

NOTE: Tags for the repository must follow typical semantic versioning.

Now that you've created a release/tag, head over to the Astronomer Registry and fill out the form with your shiny new DAG repo details!

If your repo has already been published to the Astronomer Registry, create a new release/tag and the Astronomer Registry will do the rest (picking up updates to DAG files, new DAGs, etc.). Just make sure the .astro-registry.yaml file is updated with any new or updated DAGs.