Sparkify is a fictional music application that store songs and users' activity logs in separate JSON files. When the application started to grow, it becomes extremely difficult for the company to handle and benefit from these files. The suggested solution is to start investing in databases and ETL pipelines (The process of extracting data from various sources, Transforming and processing it then store it in the destination database)
Since the company deal with huge amount of data, Star schema database design is the perfect fit for this application cause it facilitates insert and update processes. The database consist of the following tables:
- Song play (The fact table)
- Songs (Dimensional table extracted from song_data files)
- Artists (Dimensional table extracted from song_data files)
- Users (Dimensional table extracted from log_data files)
- Time (Dimensional table extracted from Timestamp column)
The logic of ETL Pipeline is as follow:
- Navigate and pull all JSON Files from the source
- Separate song data into two tables (Songs & Artists)
- Separate log data into two table (Users & Time)
- Finally insert the records to the new PostgreSQL database
Here are screenshots of all the tables, after feeding them with etl pipeline records
user table
time table
songplay
To Run the Pipeline do the following instructions in the same order
- Open the terminal or bash in windows
- Write python create_tables.py then click enter to execute the command
- Write python etl.py then wait until the processing is completed
- Run test.ipynb to make sure all the records were added successfully
Regards, Noof Aleliwi