Author: Sean Reidy
UMBC DATA 601 Fall 2020
Final Project
This project is an utilizes the Spotify Web API in conjunction with the Genius Web API. exploring, analysing, modeling genre by using different audio features from 13 different Spotify genres.
The report and modeling can be found in Report.ipynb
The final model is a Random Forest Classifier, with 70% accuracy at predicting genre.
It is obvious to most individuals what a musical genre sounds like, it's quite easy to say one song is rock and another is rap after only to listing for a few seconds. What’s difficult is asking why? Why is song X rock and not rap? What natural heuristics did we use in our heads to make this decision? For Music, and other creative media, there are no predefined rules for classifying a track or artists with a genre. With this project I hope to define genre in more quantitative factors, such as a song's popularity, the key, and other audio features.
- What characteristics objectivly define a music genre?
- Is it possible to classify traks into a genre from audio feautres and lyric data?
- Are some genres more diverse than others? And how so, is the genre creatively contraited by the trappings of the genre?
- What is better at defining genre, is it the audio features (such as tone, valance, or danceability) or is the lyrical content of a song more effective for a classification model?
- Does incorpering NLP features on lyrical content help improve classification of genre?
- How does lyrical sentiment, and profanity differ across genre?
- Does there exist a relationship between the Valence (musical tone of a song: happy or sad) and Lyrics Sentiment (are the lyrics themselves happy or sad)?
- Using our understanding of genre from our models is it possible to generate music and lyrics of a given genre?
Looking at the technical and programming aspects of this project. I wanted to learn and implement parallel processing for the data collection process. Earlier in this project I had realized that the volume of data I would need to pull from the Spotify API would be much greater than my past work. Pulling tracks one at a time would not be practical, so I wanted to learn how to parallelize this process with Pythons Concurrent Futures package.
If you're a Spotify user like myself, I’m almost certain you were fascinated by your recent spotify wrapped for 2020. Spotify’s yearly summary of your musical adventures on the platform, serving as an “best of” montage of your music.
This year in particular given the circumstances of our new lives, many user's musical habits and even tastes have shifted and changed. Music is deeply personal and varied, as this year has made some of us branch out into new genres, and others look back at comfort songs from our youth. There is reason to suspect that these changes in taste is a response to stress and environmental factors. More info on shifting tastes in music during the pandemic
Similar projects and tutorials references on the Spotify Web API and python data science
- Every Noise an ongoing project mapping and cataloging every single spotify genre
- https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b Shows how to implment the Spotipy python package to acess the Spotify Web API and save the results to a pandas data frame.
- https://www.kaggle.com/nicapotato/top-spotify-tracks-2019-eda/execution An example of EDA on spotify feature data
- https://www.kaggle.com/aniruddhachoudhury/classify-song-genres-from-audio-data-model Refrenced for testing and comparing moddle results
-
Data: Data is not included in this Github repo due to file size restrictions.To Download any of the data used follow this google drive link
-
Notebooks
- Report.ipynb: Notebook containing the report and modeling
- GettingData.ipynb: Pulling from both the Spotify Web API and the Genius Web API
- EDA.ipynb: Exploritroty
- FeatureEngineering.ipynb: cleaning, and saving the data, removing outliers.
- Packages Used:
- Spotipy: A Python library for the Spotify Web API
- lyricsgenius : A Python wrapper for the Genius Web API
- getpass: A Python library to hide spotify web API credentials
- pickel: used to save the spotify dataframe to disk
- pandas
- matplotlib
- seaborn
- scikit-learn
- re Regular expression operations
- nltk Python Natural Language Toolkit
- wordcloud
Illustrated in the image above is the process used to collect the data from both the Spotify Web API and the Genius API. Two different Python API wrappers were used to interact with the API’s, Spotipy and lyricsgenius
There is a total of 6 root genres, each root genre has 5 of the most popular sub_genres, These genres were selected using the Every Noise Project By looking at the top most popular genres across all of Spotify, then selecting the associated sub genres.
- Pop : post teen pop, dance pop, electropop, pop dance, indie pop
- Rap : hip hop. souther hip hop, gangster rap, trap, dirty south rap
- EDM : electro house, big room, pop edm, pop dance, complextro
- R&B : urban contemporary, new jack swing, neo soul, hip pop,pop r&b
- Country : Country road, contemporary county, moden country rock, country rock, country dawn
- Rock : album rock, classic rock, permanet wave, hard rock, modern rock
The Features of the Dataset
- track_id : a spotify primary key; unique for each track
- artist: name of artist
- album: name of album
- trackName: title of track
- root genre: (TARGET VAR) the root genre of the track, pop, rap, edm, R&B, Country, Rock
- sub genre: The associated sub genre of the track
- acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- duration_ms: The duration of the track in milliseconds.
- energy: a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
- instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context.
- key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation
- liveness: Detects the presence of an audience in the recording.
- loudness: The overall loudness of a track in decibels (dB).
- mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- speechiness: Speechiness detects the presence of spoken words in a track.
- tempo: BPM of track
- time_signature: An estimated overall time signature of a track.
- valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
- category_id: The Spotify Category ID of the track
- popularity: The value will be between 0 and 100, with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.
- lyrics: A large string of song raw lyrics from the genius API
- lyrics_vector: lyrics, cleaned of new line chars and vectorized into a list
- pos_tagging: a list of tuples generated from nltk.pos_tag, Part of Speach
- nouns: A total count of nouns in a track
- verbs: a total count of verbs in a track
- adverbs: a total count of adverbs in a track
- adjectives: a total count of adjectives in a track
- foreign_count: a total count of foreign(non english) words in a track
- profanity_count: a total count of profanity in a track
- lyrics_lemm: lyrics vectors that have been stemmed and lemmitized
- lyrics_sentiment: The mean sentiment from nltk.sentiment.vader across all lines in the track lyrics. negative values is more sad,mean and postive values happy and upbeat.
The Spotify Web Api provided the following Audio Features for each track more info here