Sparkify Data Lake

Author: Jarome Leslie

Date: 2021-10-11

Introduction

With the goal of expanding on Sparkify's existing data warehouse platform, this project involves the creation of a data lake using Apache Spark. First, the raw directories of JSON logs and song files are loaded from an AWS S3 bucket s3://udacity-dend/. Next the data is transformed into a five different tables, then written to partitioned parquet files to their respective table directories in the S3 bucket s3://udacity-workspace/.

With this tool, the Sparkify team is provided the greater capability of querying their data to ask and more efficiently answer questions about their expanding user base.

Database schema design and ETL process

The diagram below illustrates how the Sparkify database is modeled using the Star Schema and is centred around the songplay Fact table. Supporting its definition are four Dimension tables: songs, users, artists, and time. As highlighted in the diagram, the tables have been partioned in the following way:

the songplays table is partitioned by year and month, based on the start_time field;
the time table is partitioned by year and month; and
the songs table is partitioned by year and artist.

Files in repository

The source files provided include logs and song data. With respect to logs, event files for each day of November 2018 were provided in json format. In terms of songs, a subset of the Million Song Dataset is provided as a collection of files in json format.

How to run the python scripts

To run the data pipeline, run the etl.py script: python etl.py

Run the ETL pipeline using AWS EMR Spark clusters

Leveraging the computing resources of AWS EMR, can improve the execution from the X hours required locally to the Y minutes achieved using the below prescribed configuration. Prior to executing the below steps, it is important to create an IAM user with the AmazonS3FullAccess role.

Create an EMR cluster as follows:

aws emr create-cluster --name udacity-spark-proj --use-default-roles\ 
 --release-label emr-5.28.0 --instance-count 3 --applications Name=Spark\ 
 --ec2-attributes KeyName=AWS_EC2_Personal,SubnetIds=subnet-0f756913b00e7b3ac\
 --instance-type m5.xlarge --profile personal

Enable port forwarding as follows:

ssh -i <path to local pem file> -N -D 8157 hadoop@<Public IPv4 DNS>

Copy AWS certification file to EMR cluster as follows:

scp -i <identity file> <path to local pem file> -N -D 8157 hadoop@<Public IPv4 DNS>:/home/hadoop

Connect to EMR cluster using the command:

ssh -i <path to local identity file> hadoop@<Public IPv4 DNS>

Execute Spark job using the command below:

/usr/bin/spark-submit --master yarn ./etl.py

Terminate EMR cluster after job is completed

terminate-clusters --cluster-ids <value>

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Data Lake

Introduction

Database schema design and ETL process

Files in repository

How to run the python scripts

Run the ETL pipeline using AWS EMR Spark clusters

About

Releases

Packages

Languages

License

jsleslie/Sparkify-Spark-ETL

Folders and files

Latest commit

History

Repository files navigation

Sparkify Data Lake

Introduction

Database schema design and ETL process

Files in repository

How to run the python scripts

Run the ETL pipeline using AWS EMR Spark clusters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages