This project serves as the final capstone project for the MiSK Data Science immersive program.
This project aims to recommend songs to a user based on the mood of their most recently played songs on Spotify. There are two datasets that are obtained:
To determine the mood of the song two variables for each track will be observed. The first is the valence of the track from the Spotify API audio features, the other is the lyrics of the track. These two variables are chosen specifically because choosing one or the other is usually not enough to determine the mood of the song, which can be considered specific to the user. Especially in cases where the lyrics of the song do not match the audio mood (valence) of the track. A popular example of a song like this is Take a Walk by Passion Pit, which is an upbeat happy sounding song with sad lyrics. On the flip side, a popular sad/mellow song that has hopeful/happy lyrics is Don't Panic by Coldplay.
Spotify uses a series of different features to classify tracks.
Lyrics are not available on Spotify, so to obtain the lyrics of the tracks mutliple methods were used.
lyricsgenius
) created for Python simplifies the process further by providing Python methods for the Genius API.import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from sklearn.preprocessing import LabelEncoder
from wordcloud import WordCloud
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.impute import KNNImputer
from numpy.linalg import norm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Preparing Data
# Read csv of 50 most recently played songs of user
user_df = pd.read_csv('.\processed dataset\processed_user_dataset.csv')
# Read csv of spotify mood playlist dataset
spotify_df = pd.read_csv('.\processed dataset\processed_spotify_dataset.csv')
frames = [user_df,spotify_df]
# first 50 rows are the users recently played
df = pd.concat(frames)
df.shape
(2388, 19)
The dataset contains 12 columns that we can use for exploratory analysis. Since we have already cleaned the dataset, there is not alot of observed NaN values. Howeverm we will check to see:
msno.matrix(df)
plt.show()
We can see that there are only two columns with missing values (noting that the first 50 rows are the users recently listened) played_at is null for the spotify dataset as it does not exist so we will keep the NaN values. On the other hand, the mood is null for the users songs, as this is information we currently don't have and will be predicting.
labels = list(df)[6:16]
features_df = df[['danceability',
'energy',
'key',
'loudness',
'speechiness',
'acousticness',
'instrumentalness',
'liveness',
'valence',
'tempo']]
# showing distribution of each audio feature across all the observations
features_df.hist(figsize = (15,15))
plt.show()
By observing the histograms, we can see that the audio features that have the least variance between each observation are speechiness and instrumentalness. This tells us that the majority of our observations have around the same speechiness and instrumentalness, so these are features we probably won't be able to get alot of information out of. On the hand, valence, tempo, energy, and danceability have the most variance between each observation, so these are features that we should study to differentiate between the tracks and also find similiarities.
# Finding the correlation between all the audio features
scatter = sns.heatmap(features_df.corr(),
annot=True,
linewidths=.5, cmap='RdPu')
We can see that there is the highest correlation between energy-loudness and valence-energy. While, we see there is an anti-correlation between acousticness-energy and acousticness-loudness. Both of these results are logical as energetic songs tend to be loud and have positive valence. While acoustic tracks are less likely to be loud and energetic.
pd.DataFrame(pd.DataFrame(df.groupby(df['artist']).
filter(lambda x: len(x)>10)).groupby("artist").energy.mean()).sort_values(by='energy',ascending=False)[:20]
energy | |
---|---|
artist | |
City Morgue | 0.849840 |
Katy Perry | 0.782824 |
Lady Gaga | 0.755636 |
Little Mix | 0.750118 |
$uicideboy$ | 0.742516 |
Maroon 5 | 0.727786 |
One Direction | 0.727706 |
Dua Lipa | 0.722000 |
Shawn Mendes | 0.695789 |
Rihanna | 0.645053 |
Ariana Grande | 0.637375 |
Taylor Swift | 0.609824 |
Various Artists | 0.553193 |
Oceans Ahead | 0.539615 |
Ed Sheeran | 0.535704 |
Adele | 0.406000 |
Sam Smith | 0.361545 |
Billie Eilish | 0.288011 |
df.loc[df['artist'] == 'City Morgue', 'track']
12 16 TOES 27 33rd Blakk Glass 36 66SLAVS 57 ACAB (feat. Nascar Aloe) 59 ALL KILLER NO FILLER 140 Arson 149 Aw Shit - Zillakami Solo 155 BUAKAW 320 CRANK 321 CYKA 325 Caligula 459 DAWG 557 Downer 794 Gravehop187 803 HURTWORLD '99 1330 NECK BRACE 1443 PROSTHETIC LEGS 1604 SHINNERS13 1608 SPLINTER 1710 Sk8 Head 1731 Snow On Tha Bluff 1857 THE BALLOONS 1858 THE ELECTRIC EXPERIENCE 2046 V12 2179 YELLOW PISS Name: track, dtype: object
If we disclude various artists, SuicideboyS is the most occuring artist in our dataset, however, City Morgue has the most energetic songs and appears 25 times in our dataset
pd.DataFrame(pd.DataFrame(df.groupby(df['artist']).
filter(lambda x: len(x)>10)).groupby("artist").loudness.mean()).sort_values(by='loudness',ascending=False)[:20]
loudness | |
---|---|
artist | |
Little Mix | 0.783266 |
Katy Perry | 0.780845 |
One Direction | 0.768305 |
Dua Lipa | 0.766334 |
Lady Gaga | 0.763828 |
City Morgue | 0.757022 |
$uicideboy$ | 0.741045 |
Shawn Mendes | 0.736193 |
Maroon 5 | 0.732349 |
Rihanna | 0.726620 |
Ariana Grande | 0.718249 |
Ed Sheeran | 0.675858 |
Oceans Ahead | 0.659323 |
Adele | 0.655346 |
Sam Smith | 0.639838 |
Various Artists | 0.619837 |
Taylor Swift | 0.613379 |
Billie Eilish | 0.446614 |
df.loc[df['artist'] == 'Little Mix', 'track']
63 About the Boy 253 Black Magic 626 F.U. 706 Freak 922 How Ya Doin'? (feat. Missy Elliott) 1038 If You Want My Love 1071 Joan of Arc 1218 Love Me or Leave Me 1371 No More Sad Songs 1387 Notice 1489 Power 1619 Salute 1689 Shout Out to My Ex 1690 Shout Out to My Ex 1845 Sweet Melody 2084 Wasabi 2162 Woman Like Me (feat. Nicki Minaj) Name: track, dtype: object
Even though we have a high correlation between loudness and energy, we can see that the artist with the loudest songs is Little Mix who appear in the dataset 17 times. If we observe our outputs we can also conclude that Katy Perry has the highest correlation between energy and loudness, appearing second in both.
# 2-D scatter plot to visualize tendencies using some audio features,
# specifically the "energy" and "danceability" features
# The red cross represents the "average popular song"'s audio features
average_noise = user_df['energy'].mean()
average_danceability = user_df['danceability'].mean()
plt.scatter(spotify_df['danceability'],spotify_df['energy'],alpha=0.75)
plt.axhline(y=average_noise, color='r')
plt.axvline(x=average_danceability, color='r')
plt.title("Energy as a function of Danceability - from my music library")
plt.xlabel("Danceability")
plt.ylabel("Energy")
plt.show()
Here we're comparing the users tendency to listen to songs that have energy and danceability compared to the Spotify dataset. We can see that the songs in the Spotify dataset generally have high energy and danceability. While the user (red cross) listens to songs that are nearly higher in energy than danceability. Generally, the Spotify dataset has a tracks with higher danceability and energy, while the user dataset has an almost intermediate level of energy and danceability.
# Finding the correlation between all the audio features(continuous) and the mood(categorical)
enc = LabelEncoder()
spotify_df['mood_enc'] = enc.fit_transform(spotify_df['mood'])
corr = spotify_df.iloc[:, :-1].corr()
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns,
annot=True,
linewidths=.2,)
plt.show()
Noting that after encoding:
True to our previous deduction, speechiness has the least correlation with the mood of a song. We can also note that not one audio feature has the highest correlation with the mood of a song.
# Word cloud of user dataset lyrics vs. spotify dataset lyrics
wordcloud_spotify = WordCloud().generate(' '.join(spotify_df['single_text']))
plt.imshow(wordcloud_spotify)
plt.axis("off")
plt.show()
wordcloud_spotify = WordCloud().generate(' '.join(user_df['single_text']))
plt.imshow(wordcloud_spotify)
plt.axis("off")
plt.show()
Suprisingly, both datasets share the same most common words. This could indicate that the songs the user is listening to are very similiar to the collected data from Spotify. Another factor could be that these words are generally the most common. We can see from the following article that most of the words generated in our wordcloud match.
It's hard to conclude the sentiment of the lyrics overall by looking at their wordclouds. We will assign a sentiment to each song ranging from negative to neutral to positive. To do this we will perform sentiment analysis using popular lexicons.
These sentiments will help us identify the mood better:
# NLP sentiment analysis using TextBlob
filter_values = [-1, -0.35, 0.3, 1]
def sentiment_func(lyrics):
try:
return TextBlob(lyrics).sentiment
except:
return None
df['sentiment'] = df['single_text'].apply(sentiment_func)
df['sentiment'][0][0]
df['polarity'] = df['sentiment'].apply(lambda x: x[0])
df = df.drop(columns= 'sentiment')
df['sentiment'] = pd.cut(df['polarity'], bins=filter_values,
labels=['negative', 'neutral', 'positive'])
We won't be looking at subjectivity thats outputted with polarity under sentiment for TextBlob, as it's not relevant to song lyrics.
Polarity is a float that lies between [-1,1], -1 indicates negative sentiment and +1 indicates positive sentiments. However, by observing the results TextBlob does not always provide the expected results. So, we'll use VaderSentiment which gives a more detailed breakdown and compare the two.
# NLP sentiment analysis using Vader
analyzer = SentimentIntensityAnalyzer()
df['v_sentiment'] = df['single_text'].apply(analyzer.polarity_scores)
df = pd.concat([df.drop(['v_sentiment'], axis=1), df['v_sentiment'].apply(pd.Series)], axis=1)
df['v_sentiment'] = pd.cut(df['compound'], bins=filter_values,
labels=['negative', 'neutral', 'positive'])
Vader performs much better at predicting negative sentiment, however it also over exaggerates the neutrals and sometimes negatives to positive sentiments. After observing and comparing, we see that Vader is better at predicting Energy songs as positive and Angry songs as negatives, which are our extremes at both ends, but TextBlob is better at prediciting Calm songs as neutral. Both perform around the same for Happy and Sad Songs. However, if we take into consideration that the mood does not always match the sentiment of lyrics, we can conclude that that it can sometimes make sense for the mood not to match the sentiment.
For example, Taylor Swifts 'Bad Blood' has a mood of Energy but the lyrics have a negative sentiment for both Vader and TextBlob and this can be because the lyrics contain words such as 'bad', 'sad', 'mad' etc. which are recognized as negative sentiment words.
# showing distribution of the classifications given by Vader and TextBlob
df.groupby(["mood", "sentiment", "v_sentiment"]).size().reset_index(name="count")
mood | sentiment | v_sentiment | count | |
---|---|---|---|---|
0 | Angry | negative | negative | 10 |
1 | Angry | negative | neutral | 0 |
2 | Angry | negative | positive | 2 |
3 | Angry | neutral | negative | 258 |
4 | Angry | neutral | neutral | 17 |
5 | Angry | neutral | positive | 143 |
6 | Angry | positive | negative | 3 |
7 | Angry | positive | neutral | 2 |
8 | Angry | positive | positive | 24 |
9 | Calm | negative | negative | 4 |
10 | Calm | negative | neutral | 0 |
11 | Calm | negative | positive | 1 |
12 | Calm | neutral | negative | 102 |
13 | Calm | neutral | neutral | 30 |
14 | Calm | neutral | positive | 230 |
15 | Calm | positive | negative | 2 |
16 | Calm | positive | neutral | 0 |
17 | Calm | positive | positive | 40 |
18 | Energy | negative | negative | 8 |
19 | Energy | negative | neutral | 0 |
20 | Energy | negative | positive | 2 |
21 | Energy | neutral | negative | 101 |
22 | Energy | neutral | neutral | 25 |
23 | Energy | neutral | positive | 279 |
24 | Energy | positive | negative | 5 |
25 | Energy | positive | neutral | 1 |
26 | Energy | positive | positive | 78 |
27 | Happy | negative | negative | 11 |
28 | Happy | negative | neutral | 0 |
29 | Happy | negative | positive | 0 |
30 | Happy | neutral | negative | 92 |
31 | Happy | neutral | neutral | 27 |
32 | Happy | neutral | positive | 297 |
33 | Happy | positive | negative | 4 |
34 | Happy | positive | neutral | 0 |
35 | Happy | positive | positive | 88 |
36 | Sad | negative | negative | 8 |
37 | Sad | negative | neutral | 1 |
38 | Sad | negative | positive | 2 |
39 | Sad | neutral | negative | 134 |
40 | Sad | neutral | neutral | 23 |
41 | Sad | neutral | positive | 234 |
42 | Sad | positive | negative | 3 |
43 | Sad | positive | neutral | 0 |
44 | Sad | positive | positive | 47 |
by observing and comparing the two lexicons, we can conclude that Vader performs better than TextBlob except in Calm songs, but thats because TextBlob tends to classify most of the songs as neutral, so it has a higher chance of classifying calm songs as neutral since most of its classifications are neutral.
# we will drop TextBlob and use Vader
df = df.drop(columns= ['sentiment','neg','pos','neu', 'polarity'])
df.rename(columns={'v_sentiment': 'sentiment', 'compound': 'polarity'}, inplace= True)
So we will be looking at both the valence of the song as well as its sentiment to recommend the songs to the user. Noting that valence is defined by spotify as a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
So, by looking at both the audio feature and sentiment we can better determine the mood of the user's most recently listened to songs. Using our labelled data from the Spotify dataset, we can better predict the mood of the user's recently played songs.
We will impute the NaN values in the mood column of the users dataset by comparing the audio feature values and sentiment of the track with those similiar to it that are already labelled in the spotify dataset. The reason we chose to use imputation instead of a machine learning algorithm, is that imputers have become powerful and they are easier to use serving our function better.
# Encode mood column
df['mood_enc'] = df.mood.map({'Angry': 0, 'Sad': 1, 'Calm': 2, 'Happy': 3, 'Energy': 4})
The KNN imputer is a multi-variate imputer, which essentially means it can take multiple features when imputing the missing values. However, the imputer can only work on continuous variables, so we will first map our mood
column to int values as an identifier for each mood. The idea in kNN methods is to identify ‘k’ samples in the dataset that are similar or close in the space. Then we use these ‘k’ samples to estimate the value of the missing data points. Each sample’s missing values are imputed using the mean value of the ‘k’-neighbors found in the dataset.[1]
# knn based imputation for categorical variables
imputer = KNNImputer(n_neighbors= 2)
df_filled = imputer.fit_transform(df[['danceability',
'energy',
'key',
'loudness',
'speechiness',
'acousticness',
'instrumentalness',
'liveness',
'valence',
'mood_enc']])
df_filled
array([[ 0.761 , 0.525 , 11. , ..., 0.0921, 0.531 , 3. ], [ 0.626 , 0.528 , 4. , ..., 0.0995, 0.274 , 1. ], [ 0.559 , 0.345 , 4. , ..., 0.141 , 0.458 , 2.5 ], ..., [ 0.561 , 0.0848, 2. , ..., 0.112 , 0.206 , 1. ], [ 0.635 , 0.673 , 1. , ..., 0.669 , 0.837 , 3. ], [ 0.724 , 0.895 , 7. , ..., 0.097 , 0.64 , 4. ]])
We notice that some of the imputed values are floats between two integers. This means that the particular track belongs in more than mood. However, for our application we consider a track to only have one mood, so we will change the type of the result to int. Now, we will decode our mood and give them back their labels.
imputed_mood = df_filled[:,-1].astype(int)
imputed_mood[:50]
df = df.drop(columns='mood_enc')
df['mood_enc'] = imputed_mood.tolist()
df.reset_index(drop = True, inplace=True)
df['mood'] = df['mood_enc'].map({0: 'Angry', 1: 'Sad', 2: 'Calm', 3: 'Happy', 4: 'Energy'})
Just by observing the results, we can conclude that the imputer performed very well. Taking the song Rhinestone Eyes by Gorillaz as an example from the users dataset. The song has been identified to have positive lyrics (although I would personally categorize it as sentimentally negative) but the audio features ultimateley made the imputer decide to classify it under Angry, which is where I also would personally categorize this song too.
Now that we have labelled the 50 most recently listened to songs, we can observe what mood is the most occuring for the user and then recommend similiar songs that fall under that mood. We will be using Content-based filtering as this method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended. [1]
df[:50]['artist'].value_counts()[:5]
Taylor Swift 4 Tears For Fears 3 Nothing But Thieves 2 Elbow 2 Red Hot Chili Peppers 2 Name: artist, dtype: int64
df[:50]['track'].value_counts()[:5]
Everybody Wants To Rule The World 3 The Funeral 2 gold rush 2 Under the Bridge 2 Chasing Cars 2 Name: track, dtype: int64
df[:50]['mood'].value_counts()
# So our user is mainly listening to Sad and Calm songs
Sad 18 Calm 18 Happy 11 Energy 2 Angry 1 Name: mood, dtype: int64
How do we use a matrix for a recommendation? The answer is similarity. We now need to calculate the similarity of one lyric to another, as well as, one feature to another. How do we do that? We can use different metrics such as cosine similarity, or Euclidean distance. For our song recommendation system, we are going to use cosine similarity and particularly, its implementation from Scikit-learn.
features = ['instrumentalness','acousticness','danceability','energy','liveness','loudness','speechiness', 'tempo','valence','polarity']
def combine_all_features(row):
return str(row['polarity']) +" "+ str(row['instrumentalness']) +" "+ str(row['acousticness']) +" "+ str(row['danceability'])+" "+ str(row["energy"])+" "+str(row["liveness"])+" "+str(row["loudness"])+" "+str(row["speechiness"])+" "+str(row["tempo"])+" "+str(row["valence"])
df['combined_features'] = df.apply(combine_all_features,axis=1)
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined_features'])
cosine_sim = cosine_similarity(count_matrix)
def fetch_artist_from_index_1(index):
return df[df.index == index]["artist"].values[0]
def fetch_index_from_artist_1(artist):
return df[df.artist == artist].index.values[0]
user_choice_for_singer = "Taylor Swift"
artist_index = fetch_index_from_artist_1(user_choice_for_singer)
similar_artists = list(enumerate(cosine_sim[artist_index]))
sim_artist_sort = sorted(similar_artists,key=lambda x:x[1],reverse=True)[2:]
i=0
print("Top 10 similar artists/singers to "+ user_choice_for_singer +" are:\n")
for element in sim_artist_sort:
print(i+1,'->',fetch_artist_from_index_1(element[0]))
print(f" Similiarity score: %.4f" % sim_artist_sort[i][1])
i=i+1
if i>=10:
break
Top 10 similar artists/singers to Taylor Swift are: 1 -> Leona Lewis Similiarity score: 0.1818 2 -> Donovan Woods Similiarity score: 0.1818 3 -> James Newton Howard Similiarity score: 0.1818 4 -> Leon Bridges Similiarity score: 0.1741 5 -> Stealth Similiarity score: 0.1741 6 -> Billie Eilish Similiarity score: 0.1741 7 -> Carly Rae Jepsen Similiarity score: 0.1741 8 -> OMI Similiarity score: 0.1741 9 -> Leon Bridges Similiarity score: 0.1741 10 -> Sticky Fingers Similiarity score: 0.1741
features = ['instrumentalness','acousticness','danceability','energy','liveness','loudness','speechiness', 'tempo','valence','polarity']
def combine_all_features(row):
return str(row['polarity']) +" "+ str(row['instrumentalness']) +" "+ str(row['acousticness']) +" "+ str(row['danceability'])+" "+ str(row["energy"])+" "+str(row["liveness"])+" "+str(row["loudness"])+" "+str(row["speechiness"])+" "+str(row["tempo"])+" "+str(row["valence"])
df['combined_features'] = df.apply(combine_all_features,axis=1)
# # Initialize tfidf vectorizer
# tfidf = TfidfVectorizer(analyzer='word', stop_words='english')
# # Fit and transform
# tfidf_matrix = tfidf.fit_transform(df['combined_features'])
# cosine_sim = cosine_similarity(tfidf_matrix)
cv = CountVectorizer()
count_matrix = cv.fit_transform(df['combined_features'])
cosine_sim = cosine_similarity(count_matrix)
def fetch_song_from_index_2(index):
return df[df.index == index]["track"].values[0]
def fetch_artist_from_index_2(index):
return df[df.index == index]["artist"].values[0]
def fetch_index_from_song_2(track):
return df[df.track == track].index.values[0]
user_choice_song = "Everybody Wants To Rule The World"
song_index = fetch_index_from_song_2(user_choice_song)
similar_songs = list(enumerate(cosine_sim[song_index]))
similar_songs_sorted = sorted(similar_songs,key=lambda x:x[1],reverse=True)[4:]
i=0
print("Top 10 similar songs to "+ user_choice_song +" are:\n")
for element in similar_songs_sorted:
print(i+1,'->',fetch_song_from_index_2(element[0])+" By "+ fetch_artist_from_index_2(element[0]))
print(" Similiarity score: %.4f" % similar_songs_sorted[i][1])
i=i+1
if i>=10:
break
Top 10 similar songs to Everybody Wants To Rule The World are: 1 -> High Life By Manic Drive Similiarity score: 0.1907 2 -> Rolling in the Deep By Adele Similiarity score: 0.1907 3 -> Hurt Like That By Katelyn Tarver Similiarity score: 0.1741 4 -> I Like Me Better By Lauv Similiarity score: 0.1741 5 -> Touch By Sleeping At Last Similiarity score: 0.1612 6 -> Black Hole - Acoustic Version By Griff Similiarity score: 0.1005 7 -> I Like That By Bazzi Similiarity score: 0.1005 8 -> My Universe By Coldplay Similiarity score: 0.1005 9 -> Queen (feat. Quinn XCII) By ayokay Similiarity score: 0.1005 10 -> 3 Nights By Dominic Fike Similiarity score: 0.0953
features = ['instrumentalness','acousticness','danceability','energy','liveness','loudness','speechiness', 'tempo','valence']
def combine_all_features(row):
return str(row['instrumentalness']) +" "+ str(row['acousticness']) +" "+ str(row['danceability'])+" "+ str(row["energy"])+" "+str(row["liveness"])+" "+str(row["loudness"])+" "+str(row["speechiness"])+" "+str(row["tempo"])+" "+str(row["valence"])
df['combined_features'] = df.apply(combine_all_features,axis=1)
# Initialize tfidf vectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')
# Fit and transform
tfidf_matrix = tfidf.fit_transform(df['combined_features'])
cosine_sim = cosine_similarity(tfidf_matrix)
# cv = CountVectorizer()
# count_matrix = cv.fit_transform(df['combined_features'])
# cosine_sim = cosine_similarity(count_matrix)
def fetch_song_from_index(index):
return df[df.index == index]["track"].values[0]
def fetch_artist_from_index(index):
return df[df.index == index]["artist"].values[0]
def fetch_index_from_mood(mood):
return df[df.mood == mood].index.values[0]
user_choice_mood = "Sad"
song_index = fetch_index_from_mood(user_choice_mood)
similar_songs = list(enumerate(cosine_sim[song_index]))
similar_songs_sorted = sorted(similar_songs,key=lambda x: x[1],reverse=True)[1:]
i=0
print("Top 10 similar songs that match the "+ user_choice_mood +" mood of your songs:\n")
for element in similar_songs_sorted:
print(i+1,'->',fetch_song_from_index(element[0])+" By "+ fetch_artist_from_index(element[0]))
print(" Similiarity score: %.4f" % similar_songs_sorted[i][1])
i=i+1
if i>=10:
break
Top 10 similar songs that match the Sad mood of your songs: 1 -> Like I Did By JC Stewart Similiarity score: 1.0000 2 -> Moral of the Story By Ashe Similiarity score: 0.1860 3 -> Neva Cared By Polo G Similiarity score: 0.1787 4 -> Superhero By Hayd Similiarity score: 0.1461 5 -> Overpass Graffiti By Ed Sheeran Similiarity score: 0.1444 6 -> Ain't No Mountain High Enough - Stereo Version By Marvin Gaye Similiarity score: 0.1443 7 -> One Day By Tate McRae Similiarity score: 0.1424 8 -> Feeling Whitney By Post Malone Similiarity score: 0.1374 9 -> Gimme! Gimme! Gimme! (A Man After Midnight) By ABBA Similiarity score: 0.1370 10 -> Falling Up By Dean Lewis Similiarity score: 0.1332
The recommendations made for each given user input are accurate, besides a few tracks that do not match the criterias. However, generally, the model can recommend songs depending on the mood of the users most recently listened to artist, track, or general mood of all the recently played songs.
By using the Spotify API, we were able to scrape data from both the user's spotify recently played songs, as well as, playlists created by Spotify and other Spotify users. This resulted in our two main datasets. The Spotify API was also used to obtain the Audio Features of each track, which helps give us more information on the track and ultimately allows us to predict the mood of each track. Additionally, we scraped lyrics for each track to further help us determine the moods of the tracks. This was done by using the Genius API, BeautifulSoup4 library, and the lyrics_extractor library. After, we obtained our data, we did some data cleaning and preparation, which included removing rows with NaN values in the lyrics or normalizing the loudness in the audio features. This stage also included performing some text pre-processing on the lyrics to make them easier to implement NLP on.
We done some EDA to help us understand our datasets better and be able to move on to the next steps. NLP was performed on the lyrics of the songs to determine the sentiment of the song, which would help in determing the mood of the song. Althought the used NLP libaries (TextBlob and Vader) did not perform as well as expected, it was concluded that in some cases, it is fine for there to be a disparity between the mood of the lyrics and the mood of the song.
With all our data ready and cleaned for giving our user some recommendations, there was one step before we could do that. By using KNN imputation, we were able to assign a mood to the users songs. Since KNN imputation is multi-variate, we were able to use our audio features and our lyrics sentiment to see which other tracks had the closest distance between all these features and thus predict the mood of the songs. Although, before resorting to imputation multiple machine learning models were used, but none of them performed well enough (not being able to go even above an accuracy of 0.5). Then, some deep learning models were also used as seen in this article. However, it too did not perform any better than the machine learning algorithms. So, in the end, imputation was the best choice for its ease of use and it also labelled the songs accurately.
Finally, we were able to determine which mood was the most prevalent amongst the user's most recently listened to songs and now recommend some similiar songs from our dataset. We also observed their most listened to artist and recommended some similiar artists and their most listened to song and recommended some similiar songs. The recommeder system used utilized content-based filtering, as we have the users personal data and preferences. We used both the tfidf vectorizer and count vectorizer which are ways to convert a given set of strings into a frequency representation that make it easier to find similiarities[1]. Although, it mostly gave similiar songs, it also sometimes recommended songs that did not match the criteria.
In conclusion, the results of the project give us a recommender system that can impute the mood of a users recently played songs and then recommend songs that sound similiar and have a similiar mood.