In this project, I worked on the music streaming app, Sparkify. I ETLed Sparkify's Logs and Songs JSON files S3 Data Lake to Parquet S3 Data Lake.
EMR Spark was used to load the data as a set of dimensional tables.
It enabled analytics finding insights into the tracks and user logs for increase in SaaS ROI songs.
As Sparkify's Data Engineer, I processed the ETL for a star schema model Data warehouse using S3 Data Lake and Parquet files.
The Star Schema Model for Data Warehouse was achived by using AWS EMR - Apache PySpark to ETL the S3 JSON Songs and Logs Data into S3 Parquet Files. There is a Fact table, "songplays" along with four more Dimension tables named "users", "songs", "artists" and "time".
These Fact and Dimension tables will incoporate into designing a SQL Data Ware House.