ETL Project for a fictional company sparkify. The ETL Pipeline built is aimed at enabling the Analytics team to retrieve information on songs users are listening to. Data sources exists in directories of JSON logs and JSON metadata on Sparkify songs app.
Created a database schema to match Analytics teams data requirements making use of star schema modelling. Fact table - songplays - log records data associated with songplays
songplay_id | start_time | user_id | level | song_id | artist_id | session_id | location | user_agent |
---|
Dimension tables:
-
users - app users
first_name last_name gender level -
songs - songs in app database
song_id title artist_id year duration -
artists - artists in app database
artist_id name location latitude longitude -
time - timestamps of records on songplays broken down into specific units
start_time hour day week month year weekday
Created a python ETL pipeline that retrieves, processes and loads records into the the various tables. create_tables.py --> creates the fact and dimension tables etl.py --> reads and processes files from song_data and log_data and loads them into tables
Highly uitilized the web to get an understanding of concepts
-Reference #1:Pandas User Guide
-Reference #2:Selecting a specific value from a dataframe
-Reference #3:Dataframe to list