Sparkify Suggested Database & ETL Pipeline

About

Sparkify is a fictional music application that store songs and users' activity logs in separate JSON files. When the application started to grow, it becomes extremely difficult for the company to handle and benefit from these files. The suggested solution is to start investing in databases and ETL pipelines (The process of extracting data from various sources, Transforming and processing it then store it in the destination database)

Database design

Since the company deal with huge amount of data, Star schema database design is the perfect fit for this application cause it facilitates insert and update processes. The database consist of the following tables:

Song play (The fact table)
Songs (Dimensional table extracted from song_data files)
Artists (Dimensional table extracted from song_data files)
Users (Dimensional table extracted from log_data files)
Time (Dimensional table extracted from Timestamp column)

ETL Pipeline:

The logic of ETL Pipeline is as follow:

Navigate and pull all JSON Files from the source
Separate song data into two tables (Songs & Artists)
Separate log data into two table (Users & Time)
Finally insert the records to the new PostgreSQL database

Final Result:

Here are screenshots of all the tables, after feeding them with etl pipeline records

song table:

artist table:

user table

time table

songplay

User Manual:

To Run the Pipeline do the following instructions in the same order

Open the terminal or bash in windows
Write python create_tables.py then click enter to execute the command
Write python etl.py then wait until the processing is completed
Run test.ipynb to make sure all the records were added successfully

Regards, Noof Aleliwi

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
.workspace-config.json		.workspace-config.json
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Suggested Database & ETL Pipeline

About

Database design

ETL Pipeline:

Final Result:

User Manual:

About

Languages

Naleliwi/DEND_Data-Modeling-Postgres

Folders and files

Latest commit

History

Repository files navigation

Sparkify Suggested Database & ETL Pipeline

About

Database design

ETL Pipeline:

Final Result:

User Manual:

About

Topics

Resources

Stars

Watchers

Forks

Languages