This is the first project in the Udacity Data Engineer Nanodegree.
The virtual startup 'Sparkify' provides a music streaming service. In this project we create an ETL pipeline using song and log datasets to a DB optimized for queries on analysis of song plays, which could be used to support music recommendations, user behaviour prediction, or reporting for artist remuneration.
Clone this repository and (if you want to use your own data), replace the files in the /data/song_data
and data/log_data
directories with your own song and log data.
pip3 install -r requirements.txt
python3 create_tables.py
python3 etl.py
Note: Expects to be able to connect to Postgres using the following credentials:
host=127.0.0.1 dbname=sparkifydb user=student password=student
- etl.py - The main script that runs the ETL Pipeline which processes the files in
data/
directory - create_tables.py - Creates the
sparkifydb
, dropping the existing db & tables if they already exist - db_connection.py - Defines a context manager class
DbConnection
which wraps thepsycopg2
connection & methods - tests/test_etl.py - Test for the ETL Pipeline
- tests/test.ipynb - Executes a select on each table to confirm that it is populated.
songplays - records in log data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
users - users in the app
- user_id, first_name, last_name, gender, level
songs - songs in music database
- song_id, title, artist_id, year, duration
artists - artists in music database
- artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.