Skip to content

Command-line (CLI) utility for Data Processing Pipelines (DPP)

License

Notifications You must be signed in to change notification settings

data-engineering-helpers/dppctl

Repository files navigation

Getting started with Go for Data Processing Pipeline (DPP) CLI tool

OpenSSF Scorecard

Table of Contents

Created by gh-md-toc

Overview

This project intends to develop and maintain a command-line (CLI) utility in Go to help deploy data engineering pipelines on modern data stack (MDS).

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

AWS SDK for Go

Getting started

$ go build
  • Clone and edit the YAML deployment specification. For instance, for a deployment on AWS cloud:
$ cp depl/aws-dev-sample.yaml depl/aws-dev.yaml
$ vi depl/aws-dev.yaml
  • Check the version of the dppctl utility:
$ ./dppctl -v
[dppctl] 0.0.x-alpha.x
  • Launch the dppctl utility in checking mode (which is the default one):
$ ./dppctl -f depl/aws-dev.yaml
  • Launch the dppctl utility in deployment mode:
$ ./dppctl -f depl/aws-dev.yaml -c deploy

Publish the module

  • Recompute the dependencies:
$ go mod tidy
  • Check that the tests pass:
$ go test
  • Tag the Git repository:
$ git commit -m "[Release] v0.0.x-alpha.x"
$ git push
$ git tag -a v0.0.x-alpha.x -m "[Release] v0.0.x-alpha.x"
$ git push --tags
  • Publish the module:
$ GOPROXY=proxy.golang.org go list -m github.com/data-engineering-helpers/[email protected]
github.com/data-engineering-helpers/data-pipeline-deployment v0.0.x-alpha.x

Troubleshooting

AWS Airflow (MWAA)

As of beginning of 2023, apparently for security reasons, it does not seem possible to target/use the Airflow API directly on the AWS managed service (MWAA). One has to use instead the API backend of the MWAA CLI. That is why the Go code of the corresponding AWSAirflowCLI() function is not straightforward. Note that the use of the MWAA CLI API (through curl) is itself convoluted, as detailed below.

References

Listing the DAGs

  • Configuration:
$ export MWAA_ENV="<the-MWAA-environment-name"
  export AWS_REGION="eu-west-1"
  export CLI_TOKEN
  export WEB_SERVER_HOSTNAME
  • Create a CLI (command-line) token:
$ aws mwaa --region $AWS_REGION create-cli-token --name $MWAA_ENV
{
    "CliToken": "someToken",
    "WebServerHostname": "<airflow-id>.$AWS_REGION.airflow.amazonaws.com"
}
  • Copy/paste the web server hostname and the CLI token and save them as environment variables:
$ CLI_TOKEN="someToken"
  WEB_SERVER_HOSTNAME="<airflow-id>.$AWS_REGION.airflow.amazonaws.com"
  • Note that the CLI token is very short-lived (valid for only one or two times) and the two operations (aws mwaa create-cli-token and CLI_TOKEN="some-token") must be repeated every time before the following commands are perfomed

  • Invoke an Airflow command through the API wrapping the MWAA CLI

    • Raw (not formatted) outpout:
$ curl -s --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" --header "Authorization: Bearer $CLI_TOKEN" --header "Content-Type: text/plain" --data-raw "dags list -o json"|jq -r ".stdout" | base64 -d
...
[{"dag_id": "dag_name", "filepath": "prefix/script.py", "owner": "airflow", "paused": "True"}, {"dag_id": ...}, ...]
  • CSV-formatted outpout (list of DAGs):
$ curl -s --request POST "https://$WEB_SERVER_HOSTNAME/aws_mwaa/cli" --header "Authorization: Bearer $CLI_TOKEN" --header "Content-Type: text/plain" --data-raw "dags list -o json"|jq -r ".stdout" | base64 -d | grep "^\[{\"dag_id\"" | jq -r ".[]|[.dag_id,.filepath,.owner,.paused]|@csv" | sed -e s/\"//g
...
...
dag_name,prefix/script.py,airflow,True
...

About

Command-line (CLI) utility for Data Processing Pipelines (DPP)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages