The machine learning approach
So far we’ve been able to uncover some interesting insights about the data. In order to shorten this article, let’s go directly to the part about the machine learning algorithm that was used.
I wanted to build a model to predict which class, hit or non-hit, that a song is most likely to belong to based on a set of explanatory variables, as explained at the beginning of this article.
In its raw state, the data collected was not ready for training a machine learning model. So I treated them; first, using the SMOTE technique, because the hit song class was under-represented compared to the non-hit category, then other feature engineering techniques in order to standardize the data.
Also, I wanted a model that would perform as well as possible. To do this, I trained several algorithms and compared the results based on the selected evaluation criteria. It turned out that the LightGBM classification algorithm is the one that, at the time of training, better detected the patterns between the explanatory variables and the variable to be predicted (hit).
For a first level performance analysis of the model, we will use the confusion matrix. The visualization of the confusion matrix will allow us to understand the errors made by our classifier compared to the test subset.
This matrix measures the quality of a classification system. In a binary classification, the principal diagonal represents the observations correctly classified by the model; and the secondary diagonal, those classified incorrectly. Therefore, the most frequent mistake made by the model is to have classified a song as non-hit when in reality it was a hit (129 cases), the type II error more precisely.
Type I error is that the model classifies song as a hit when it is non-hit (false hit); and type II error, the reverse; that is to say the case where it classifies a music as non-hit yet it is hit (false non-hit). If we are trying to understand the psychology of a music producer, Type I error is less acceptable than Type II error. We wouldn’t want to incur all the expenses related to the production and promotion of a song that a model has predicted that would be a hit so that in the end it won’t be. The value of the Type I error should be minimal.
The classification report
The classification report presents statistics calculated from the data in the confusion matrix. Each metric describes a different aspect of the classification. We will use this report for a second level performance analysis of the model.
The accuracy, which globally measures the percentage of correct classification performed by the model, is 96%. Since the test subset is unbalanced, this percentage is stretched by the over-represented class, in this case, the non-hit class. So this metric is not the best we could use.
The recall measures the percentage of occurrences classified correctly by the model for each class. A classification is correct when the predicted class matches the actual class. On one hand, from the 5,484 non-hit songs that we used to test the model, 98% were correctly classified. On the other hand, the algorithm correctly classified only 52% of the 271 hit songs we submitted to it. You will have understood: it is more difficult for the algorithm to classify a hit song as a hit (true hit) than to classify a non-hit song as a non-hit (true non-hit).
The precision levels for the non-hit and hit classes are 0.98 and 0.55, respectively. This translates that 98% of all the songs the model classified as non-hits are indeed non-hits; and, only 55% of the songs it predicted hits really are.
Our model reacts better when it comes to non-hits songs. This is most likely due to the fact that from the start this class had a lot more data. The pattern detection between the non-hit modality of the dependent variable and the other explanatory variables is perhaps favored because of this.
Once satisfied with the performance of the model, I resolved to understand the most influential explanatory variables in determining the class of a given song. That’s why I used the plot importance below.
We can see that the popularity of a song on Spotify is the most important variable in the process of predicting which class it is most likely to belong to. Then, the artist’s popularity on Spotify, the number of followers he has and the number of markets in which the song is available on constitutes a second wave of determining variables in the process. Finally come mainly the variables related to the audio of the songs with a relatively similar level of influence.
This analysis draws attention to something major. Essentially, a song is a hit if it is popular on Spotify, is performed by an artist who is also popular on Spotify and has a significant number of followers, and finally, if it is available in the greatest number of countries across the world. This conclusion seems logical, and … Eurêka🙂, it is also verified empirically by our model.
To better appreciate the relevance of this conclusion, it should be borne in mind that Billboard’s year-end top 100 music list is based primarily on a commercial aspect. Indeed, this ranking is a faithful reflection of physical and digital sales, radio listening and music streaming in the United States; all income-generating activities, directly or indirectly.
The more popular the song is on Spotify, the more it is listened to online.; more streams translates to more revenue generated because after each song listened to by a subscriber, the streaming platforms pay a fee to the artist or the music label. The popularity of the artist on Spotify and the number of followers he has there are channels that amplify the number of streams and sales, which will then help increase the income generated by his music.
The more revenue music generates, the more likely it is to be on Billboard’s top 100 music list at the end of the year; therefore the more likely it is also to be ranked hit by our model because the variables that mainly determine the level of music income are the most influential in the ranking process according to our importance graph (logical, isn’t it😉 ?).
Usually, music has another important characteristic that we haven’t yet taken into account: the lyrics. Can we further increase the performance of the model by using lyrics? This is what we will explore in the second part of the article.