Skip to content

julianazhu/sparkify-postgres-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sparkify-etl-pipeline

Project Overview

This is the first project in the Udacity Data Engineer Nanodegree.

The virtual startup 'Sparkify' provides a music streaming service. In this project we create an ETL pipeline using song and log datasets to a DB optimized for queries on analysis of song plays, which could be used to support music recommendations, user behaviour prediction, or reporting for artist remuneration.

How to Run

Clone this repository and (if you want to use your own data), replace the files in the /data/song_data and data/log_data directories with your own song and log data.

pip3 install -r requirements.txt
python3 create_tables.py
python3 etl.py

Note: Expects to be able to connect to Postgres using the following credentials:

host=127.0.0.1 dbname=sparkifydb user=student password=student

Project Files

  • etl.py - The main script that runs the ETL Pipeline which processes the files in data/ directory
  • create_tables.py - Creates the sparkifydb, dropping the existing db & tables if they already exist
  • db_connection.py - Defines a context manager class DbConnection which wraps the psycopg2 connection & methods
  • tests/test_etl.py - Test for the ETL Pipeline
  • tests/test.ipynb - Executes a select on each table to confirm that it is populated.

SparkifyDB (Star Schema)

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong

  • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables

users - users in the app

  • user_id, first_name, last_name, gender, level

songs - songs in music database

  • song_id, title, artist_id, year, duration

artists - artists in music database

  • artist_id, name, location, latitude, longitude

time - timestamps of records in songplays broken down into specific units

  • start_time, hour, day, week, month, year, weekday

Data Source - The Million Song Dataset

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

About

ETL Pipeline to Postgres

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published