This repository contains the practical exercise of the Data Derp training. It contains the following relevant modules:
- base
- .tf files for the creation of an AWS S3 bucket where ingested/transformed files will live
- data-ingestion
- /src - source code
- /tests - tests
- .tf files for the creation of the AWS Glue job
- data-transformation
- /src - source code
- /tests - tests
- .tf files for the creation of the AWS Glue job
- data-analytics
- An empty Jupyter Notebook
- data-streaming
- .dbc files for practice with streaming
- bootstrap
- Cloudformation template that creates a VPC, Githubrunner (requires a Github Personal Access Token), Terraform Remote State S3 bucket
- Mirror this repo in your account as a PRIVATE repo (since you're running your own self-hosted GithubRunners, you'll want to ensure your project is Private)
- Set up your Development Environment
- Bootstrap the AWS Dependencies:
./data-derp aws-deps -p <project-name> -m <module-name> -u <github-username>
- 💡 you will need valid AWS credentials. See the README.
- the
project-name
andmodule-name
must be globally unique as an AWS S3 bucket is created (this resource is globally unique)
- Create a Github workflow:
./data-derp setup-workflow -p <project-name> -m <module-name>
- The
project-name
andmodule-name
must be the same as step (3)
- The
- Fix the tests in
data-ingestion/
anddata-transformation
(in that order). See Development Environment for tips and tricks on running python/tests in the dev-container.
-
Import the
https://github.com/kelseymok/data-derp
as a PRIVATE repo calleddata-derp
: -
Clone the new repo locally and add the original repository as a source:
git clone [email protected]:<your-username>/data-derp.git
cd ./data-derp
git remote add source [email protected]:kelseymok/data-derp.git
- To pull in new changes:
git fetch source
git rebase source/master