Created by gh-md-toc
This project intends to develop and maintain a command-line (CLI) utility in Go to help deploy data engineering pipelines on modern data stack (MDS).
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Data engineering pipeline deployment on the Modern Data Stack (MDS)
- Architecture principles for data engineering pipelines on the Modern Data Stack (MDS)
- Data Processing Pipeline (DPP) utility in Go (this repository)
-
Check the latest versions/tags: https://github.com/data-engineering-helpers/dppctl/tags
-
Import/download the module:
$ go build
- Clone and edit the YAML deployment specification. For instance, for a deployment on AWS cloud:
$ cp depl/aws-dev-sample.yaml depl/aws-dev.yaml
$ vi depl/aws-dev.yaml
- Check the version of the
dppctl
utility:
$ ./dppctl -v
[dppctl] 0.0.x-alpha.x
- Launch the
dppctl
utility in checking mode (which is the default one):
$ ./dppctl -f depl/aws-dev.yaml
- Launch the
dppctl
utility in deployment mode:
$ ./dppctl -f depl/aws-dev.yaml -c deploy
- Recompute the dependencies:
$ go mod tidy
- Check that the tests pass:
$ go test
- Tag the Git repository:
$ git commit -m "[Release] v0.0.x-alpha.x"
$ git push
$ git tag -a v0.0.x-alpha.x -m "[Release] v0.0.x-alpha.x"
$ git push --tags
- Publish the module:
$ GOPROXY=proxy.golang.org go list -m github.com/data-engineering-helpers/[email protected]
github.com/data-engineering-helpers/data-pipeline-deployment v0.0.x-alpha.x
As of beginning of 2023, apparently for security reasons, it does not seem
possible to target/use the Airflow API directly on
the AWS managed service (MWAA). One has to use instead the API backend
of the MWAA CLI. That is why the Go code of
the corresponding AWSAirflowCLI()
function
is not straightforward.
Note that the use of the MWAA CLI API (through curl
) is itself
convoluted, as detailed below.
- Stack Overflow - Is it possible to access the Airflow API in AWS MWAA?
- Apache Airflow - Airflow API reference guide
- AWS - Amazon Managed Workflows for Apache Airflow (MWAA) User Guide
- GitHub - AWS - Sample code for MWAA
- Configuration:
$ export MWAA_ENV="<the-MWAA-environment-name"
export AWS_REGION="eu-west-1"
export CLI_TOKEN
export WEB_SERVER_HOSTNAME
- Create a CLI (command-line) token:
$ aws mwaa --region $AWS_REGION create-cli-token --name $MWAA_ENV
{
"CliToken": "someToken",
"WebServerHostname": "<airflow-id>.$AWS_REGION.airflow.amazonaws.com"
}
- Copy/paste the web server hostname and the CLI token and save them as environment variables:
$ CLI_TOKEN="someToken"
WEB_SERVER_HOSTNAME="<airflow-id>.$AWS_REGION.airflow.amazonaws.com"
-
Note that the CLI token is very short-lived (valid for only one or two times) and the two operations (
aws mwaa create-cli-token
andCLI_TOKEN="some-token"
) must be repeated every time before the following commands are perfomed -
Invoke an Airflow command through the API wrapping the MWAA CLI
- Raw (not formatted) outpout:
$ curl -s --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" --header "Authorization: Bearer $CLI_TOKEN" --header "Content-Type: text/plain" --data-raw "dags list -o json"|jq -r ".stdout" | base64 -d
...
[{"dag_id": "dag_name", "filepath": "prefix/script.py", "owner": "airflow", "paused": "True"}, {"dag_id": ...}, ...]
- CSV-formatted outpout (list of DAGs):
$ curl -s --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" --header "Authorization: Bearer $CLI_TOKEN" --header "Content-Type: text/plain" --data-raw "dags list -o json"|jq -r ".stdout" | base64 -d | grep "^\[{\"dag_id\"" | jq -r ".[]|[.dag_id,.filepath,.owner,.paused]|@csv" | sed -e s/\"//g
...
...
dag_name,prefix/script.py,airflow,True
...