Sparkify strives for a world with more music! How do we do this? By accompanying you with the best-fit for your taste. We ensure that the music in your ears is just the right for you - whatever the situation and mood might be!
The aim of the project is to empower our data analytics with the most effective structure possible, on the AWS Cloud! For that reason, we build a ETL pipeline for extracting data from s3, stage them on Redshift and transform them into dimensional tables for song play analysis. With the Cloud approach we want data that:
- is easily accessible,
- has a structure that is easy to understand,
- can be analysed by SQL queries.
Our data is stored as 2 sets of json files:
Log_data files store all information we have about users and their sessions, including user's name and location, level of access, song and artist name, timestamp when the song was played etc. The fields available in every log_data file are:
- artist
- auth
- firstName
- gender
- itemInSession
- lastName
- length
- level
- location
- method
- page
- registration
- sessionId
- song
- status
- ts
- userAgent
- userId
The log_data files are partitioned by year and month, with a separate folder for each partition. For example, below we have pasted filepaths to two files in this dataset:
- log_data/2018/11/2018-11-12-events.json
- log_data/2018/11/2018-11-13-events.json
Song_data files provide information about every single songs available in our service, along with some info about the artist. The following fields are available for each song:
- artist_id
- artist_latitude
- artist_location
- artist_longitude
- artist_name
- duration
- num_songs
- song_id
- title
- year
Each json file in the song_data dataset stores info about one song. The song_data files are partitioned by the first three letters of each song's track ID. For example, below we have pasted filepaths to two files in this dataset:
- song_data/A/B/C/TRABCEI128F424C983.json
- song_data/A/A/B/TRAABJL12903CDCF1A.json
- Grey: Table names
- Red: VARCHAR
- Yellow: INTEGER / FLOAT
- Green: TIMESTAMP
Table name: songplays Fields: songplay_id, start_time, user_id, level, session_id, location, user_agent, song_id, artist_id Datasource: log_data, song_data Dimensions
Table name: users Fields: user_id, first_name, last_name, gender, level
Table name: songs Fields: song_id, title, artist_id, year, duration
Table name: artists Fields: artist_id, name, location, latitude, longitude
Table name: time Fields: start_time, hour, day, week, month, year, weekday
Files only work if a Cluster is up and running on Redshift
The project workspace includes six files:
- create_tables.py drops and creates database tables. Resets tables prior to running the ETL- scripts.
- etl.py reads and processes files from song_data and log_data buckets stored in S3 and stages them on Redshift. Staged tables are subsequently used to insert data in our created data schema logic (see create_tables.py with respective sql queries).
- sql_queries.py contains all the projects sql queries and refered to by create_tables.py and etl.py.
- main.py driver function to execute create_tables and etl subsequently
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
- Log data json path: s3://udacity-dend/log_json_path.json
Create Cluster
- Redshift dc2.large cluster with 4 nodes was used
- Include IAM role authorization mechanism,
- The only policy attached to this IAM is AmazonS3ReadOnlyAccess
- Make sure Redshift has public access and VPC Secuirty Group access.
Run the drive program main.py
- python main.py
- lean back
The create_tables.py and etl.py file can also be run independently
- python create_tables.py
- python etl.py