Data Derp

This repository contains the practical exercise of the Data Derp training. It contains the following relevant modules:

base
- .tf files for the creation of an AWS S3 bucket where ingested/transformed files will live
data-ingestion
- /src - source code
- /tests - tests
- .tf files for the creation of the AWS Glue job
data-transformation
- /src - source code
- /tests - tests
- .tf files for the creation of the AWS Glue job
data-analytics
- An empty Jupyter Notebook
data-streaming
- .dbc files for practice with streaming
bootstrap
- Cloudformation template that creates a VPC, Githubrunner (requires a Github Personal Access Token), Terraform Remote State S3 bucket

Quickstart

Mirror this repo in your account as a PRIVATE repo (since you're running your own self-hosted GithubRunners, you'll want to ensure your project is Private)
Set up your Development Environment
Bootstrap the AWS Dependencies: ./data-derp aws-deps -p <project-name> -m <module-name> -u <github-username>
- 💡 you will need valid AWS credentials. See the README.
- the project-name and module-name must be globally unique as an AWS S3 bucket is created (this resource is globally unique)
Create a Github workflow: ./data-derp setup-workflow -p <project-name> -m <module-name>
- The project-name and module-name must be the same as step (3)
Fix the tests in data-ingestion/ and data-transformation (in that order). See Development Environment for tips and tricks on running python/tests in the dev-container.

Start importing a repository in your Github account:
Import the https://github.com/kelseymok/data-derp as a PRIVATE repo called data-derp:
Clone the new repo locally and add the original repository as a source:

git clone [email protected]:<your-username>/data-derp.git
cd ./data-derp
git remote add source [email protected]:kelseymok/data-derp.git

git fetch source
git rebase source/master

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
base		base
bootstrap		bootstrap
data-analytics		data-analytics
data-ingestion		data-ingestion
data-transformation		data-transformation
data-workflow		data-workflow
datasets		datasets
terraform-modules		terraform-modules
test-utils		test-utils
.gitignore		.gitignore
README.md		README.md
data-derp		data-derp
development-environment.md		development-environment.md
open-aws-spark-ui.sh		open-aws-spark-ui.sh
pytest.ini		pytest.ini
setup-workflow		setup-workflow