This project aims to load raw song and user data, process and save data as star schema for later analysis by Elastic Map-Reduce (EMR) service. This is also used to satisfied with Data Lake
project under Data Engineer Nanodegree Program.
- Python3
- Python virtual environment (aka
venv
) - AWS credentials/config files under
~/.aws
directories.
- Bootstrap virtual environment with dependencies
$ python3 -m venv ./venv $ source ./venv/bin/activate $ pip install -r requirements.txt
- Copy config template
template.dl.cfg
todl.cfg
andaws_stuff/template.common.sh
toaws_stuff/common.sh
.$ cp ./template.dl.cfg ./dl.cfg $ cp ./aws_stuff/template.common.sh ./aws_stuff/common.sh
- Fill
dl.cfg
onETL_PROCESSED_DATA_SET
section. It refers to target S3 bucket to store processed data set. Here are possible values.[ETL_PROCESSED_DATA_SET] BUCKET_NAME=sample-data-lake-bucket USER_DATA_PREFIX=data-lake/user ARTIST_DATA_PREFIX=data-lake/artist TIME_DATA_PREFIX=data-lake/time SONG_DATA_PREFIX=data-lake/song SONGPLAY_DATA_PREFIX=data-lake/songplay
- Fill
aws_stuff/common.sh
oncluster_name
,key_pair_file_name
,subnet_id
,log_uri
andpem_file_path
. Here are possible values.# can be anything as your choice cluster_name="tony-emr-cluster" # S3 location to store EMR logs in as your choice log_uri="s3://sample-emr-cluster-log/" # https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html key_pair_file_name="sample-ec2-emr-key-pair" pem_file_path="${HOME}/.aws/sample-ec2-emr-key-pair.pem" # default subnet ID # https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html#create-default-vpc subnet_id="sample-subnet-id"
- Spin up EMR cluster.
$ cd ./aws_stuff $ ./create_emr_cluster.sh
- Look for cluster ID from previous result. Then, put it to
aws_stuff/common.sh
. Here is a possible value.cluster_id="sample-cluster-id"
- Retrieve public domain name of master node from EMR AWS console. Then, put it to
aws_stuff/common.sh
. Here is a possible value.master_public_dns="sample-master-node.compute.amazonaws.com"
- Upload
etl.py
anddl.cfg
to master node.$ cd ./aws_stuff $ ./upload_etl_stuff.sh
- SSH to master node and submit
etl.py
script viaspark-submit
command.$ spark-submit --master yarn etl.py
- Terminate EMR cluster after used.
$ cd ./aws_stuff
$ ./terminate_emr_cluster.sh