Skip to content

load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables

Notifications You must be signed in to change notification settings

Awesimous/Sparkify_AWS

Repository files navigation

Introduction

Sparkify

Sparkify strives for a world with more music! How do we do this? By accompanying you with the best-fit for your taste. We ensure that the music in your ears is just the right for you - whatever the situation and mood might be!

Startup the project

The aim of the project is to empower our data analytics with the most effective structure possible, on the AWS Cloud! For that reason, we build a ETL pipeline for extracting data from s3, stage them on Redshift and transform them into dimensional tables for song play analysis. With the Cloud approach we want data that:

  • is easily accessible,
  • has a structure that is easy to understand,
  • can be analysed by SQL queries.

Point of Origin

Our data is stored as 2 sets of json files:

Log data

Log_data files store all information we have about users and their sessions, including user's name and location, level of access, song and artist name, timestamp when the song was played etc. The fields available in every log_data file are:

  • artist
  • auth
  • firstName
  • gender
  • itemInSession
  • lastName
  • length
  • level
  • location
  • method
  • page
  • registration
  • sessionId
  • song
  • status
  • ts
  • userAgent
  • userId

The log_data files are partitioned by year and month, with a separate folder for each partition. For example, below we have pasted filepaths to two files in this dataset:

  • log_data/2018/11/2018-11-12-events.json
  • log_data/2018/11/2018-11-13-events.json

Song data

Song_data files provide information about every single songs available in our service, along with some info about the artist. The following fields are available for each song:

  • artist_id
  • artist_latitude
  • artist_location
  • artist_longitude
  • artist_name
  • duration
  • num_songs
  • song_id
  • title
  • year

Each json file in the song_data dataset stores info about one song. The song_data files are partitioned by the first three letters of each song's track ID. For example, below we have pasted filepaths to two files in this dataset:

  • song_data/A/B/C/TRABCEI128F424C983.json
  • song_data/A/A/B/TRAABJL12903CDCF1A.json

Database Design

grafik

Explanation of colors:

  • Grey: Table names
  • Red: VARCHAR
  • Yellow: INTEGER / FLOAT
  • Green: TIMESTAMP

Fact table

Table name: songplays Fields: songplay_id, start_time, user_id, level, session_id, location, user_agent, song_id, artist_id Datasource: log_data, song_data Dimensions

Dimension tables

Table name: users Fields: user_id, first_name, last_name, gender, level

Table name: songs Fields: song_id, title, artist_id, year, duration

Table name: artists Fields: artist_id, name, location, latitude, longitude

Table name: time Fields: start_time, hour, day, week, month, year, weekday

Explanation of files

Files only work if a Cluster is up and running on Redshift

The project workspace includes six files:

  • create_tables.py drops and creates database tables. Resets tables prior to running the ETL- scripts.
  • etl.py reads and processes files from song_data and log_data buckets stored in S3 and stages them on Redshift. Staged tables are subsequently used to insert data in our created data schema logic (see create_tables.py with respective sql queries).
  • sql_queries.py contains all the projects sql queries and refered to by create_tables.py and etl.py.
  • main.py driver function to execute create_tables and etl subsequently

How to run

The data-sources are provided by two S3 buckets

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data
  • Log data json path: s3://udacity-dend/log_json_path.json

You need a AWS Redshift Cluster up and running

Create Cluster

  • Redshift dc2.large cluster with 4 nodes was used
  • Include IAM role authorization mechanism,
  • The only policy attached to this IAM is AmazonS3ReadOnlyAccess
  • Make sure Redshift has public access and VPC Secuirty Group access.

Run the drive program main.py

  • python main.py
  • lean back

The create_tables.py and etl.py file can also be run independently

  • python create_tables.py
  • python etl.py

About

load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published