sparkify-etl-pipeline

Project Overview

This is the first project in the Udacity Data Engineer Nanodegree.

The virtual startup 'Sparkify' provides a music streaming service. In this project we create an ETL pipeline using song and log datasets to a DB optimized for queries on analysis of song plays, which could be used to support music recommendations, user behaviour prediction, or reporting for artist remuneration.

How to Run

Clone this repository and (if you want to use your own data), replace the files in the /data/song_data and data/log_data directories with your own song and log data.

pip3 install -r requirements.txt
python3 create_tables.py
python3 etl.py

Note: Expects to be able to connect to Postgres using the following credentials:

host=127.0.0.1 dbname=sparkifydb user=student password=student

Project Files

etl.py - The main script that runs the ETL Pipeline which processes the files in data/ directory
create_tables.py - Creates the sparkifydb, dropping the existing db & tables if they already exist
db_connection.py - Defines a context manager class DbConnection which wraps the psycopg2 connection & methods
tests/test_etl.py - Test for the ETL Pipeline
tests/test.ipynb - Executes a select on each table to confirm that it is populated.

SparkifyDB (Star Schema)

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - users in the app

user_id, first_name, last_name, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, latitude, longitude

time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Data Source - The Million Song Dataset

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sparkify-etl-pipeline

Project Overview

How to Run

Project Files

SparkifyDB (Star Schema)

Fact Table

Dimension Tables

Data Source - The Million Song Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
tests		tests
LICENSE		LICENSE
README.md		README.md
create_tables.py		create_tables.py
db_connection.py		db_connection.py
etl.py		etl.py
requirements.txt		requirements.txt
sql_queries.py		sql_queries.py

License

julianazhu/sparkify-postgres-etl

Folders and files

Latest commit

History

Repository files navigation

sparkify-etl-pipeline

Project Overview

How to Run

Project Files

SparkifyDB (Star Schema)

Fact Table

Dimension Tables

Data Source - The Million Song Dataset

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages