hitpredict: predicting billboard hits using spotify...

1
HitPredict: Predicting Billboard Hits Using Spotify Data Elena Georgieva, Marcella Suta, Nicholas Burton Professors Andrew Ng and Ron Dror Stanford Machine Learning Poster Session | Stanford, California 2018 Contact: {egeorgie, msuta, ngburton} @stanford.edu Abstract Features and Data Ten audio features were extracted from the Spotify API 4 (Table 1). We created the Artist Score metric, assigning a score of 1 to a song if the artist previously had a Billboard hit, and 0 otherwise. Results Methods Algorithms Supervised Learning: data split 75/25 into training/ validation. Logistic Regression and GDA yielded the strongest results. Bagging using random forests corrected SVM from over-fitting. Decision Tree performs poorly as it suers from severe over-fitting. Neural Network with regularization, using one hidden layer of six units with the sigmoid activation function. The L 2 regularization function was applied to the cost function to avoid over-fitting. Audio Features Danceability Liveness Instrumentalness Speechiness Acousticness Loudness Valence Tempo Energy Artist Score 73.4 73.4 72 100 76.8 75.9 73.7 72.8 51.5 76.5 0 25 50 75 100 Logistic Regression GDA SVM (w/ Bagging) Decision Tree Neural Network Training (75%) Validation (25%) Figure 3. Billboard hit prediction accuracy results for five machine- learning algorithms. LR and NN give the highest prediction accuracy on the validation set. Table 1. Audio features extracted from Spotify’s API. Spotify assigns each song a value between 0 and 1 for these features, except loudness which is measured in decibels. Data for ~4000 songs was collected from Billboard.com 3 and the Million Song Dataset 5 . Songs were from 1990-2018. Songs were labeled 1 or 0 based on Billboard success. Audio features for each song were extracted from the Spotify Web API 4 . Five machine-learning algorithms were used to predict a song’s Billboard success. Figure 2. A plot of songs’ danceability vs. energy vs. loudness (dB). Black circles represent Billboard hits and red marks represent non-hits. References [1] Billboard. (2018). Billboard Hot 100 Chart. [2] Chinoy, S. and Ma, J. (2018). Why Songs of the Summer Sound the Same. Nytimes.com. [3] Guo, A. Python API for Billboard Data. Github.com. [4] Spotify Web API. https:// developer.spotify.com/ [5] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. ISMIR Conference, 2011. The Billboard Hot 100 Chart 1 remains one of the definitive ways to measure the success of a popular song. We investigated using machine learning techniques to predict which songs will become Billboard Hot 100 Hits. We were able to predict the Billboard success of a song with ~75% accuracy using machine-learning algorithms including Logistic Regression, GDA, SVM, Decision Trees and Neural Networks. Figure 1. Illustration of audio features for the 5 top tracks of December 2018. Our algorithm predicted their Billboard success with 100% accuracy. Logistic Regression Neural Network Feature Accuracy Feature Accuracy Artist Score 72.9% Danceability 65.3% Instrumental 73.2% Acousticness 69.6% Danceability 73.2% Speechiness 73.0% Accousticness 75.3% Valence 73.4% Speechiness 75.8% Energy 74.9% Loudness 75.8% Artist Score 74.6% Tempo 75.9% Instrumental 75.1% Valence 75.7% Tempo 76.5% Energy 74.0% Liveness 76.4% Liveness 74.3% Loudness 72.7% Table 2. Error analysis for the two strongest- performing algorithms. The features at the end of the list decreased the accuracy of predictions. Increasing order of importance 50 55 60 65 70 75 80 85 90 95 100 Logistic Regression Neural Network Figure 4. Algorithms yield higher accuracy for more recent songs. Features of pop songs are unique to their time period. Figure 5. Features of songs released in winter vary from features of other songs. We did not observe the same trends for song of summer. 50 55 60 65 70 75 80 85 90 95 100 Sept - May Summer Feb - Oct Winter Logistic Regression Neural Network 0 0.5 1 Dance Energy Speech Acoustic Liveness Valence Top Songs: December 2018 "Sicko Mode" Travis Scott "Thank You, Next" Ariana Grande "Happier" Marshmello & Bastille "Without Me" Halsey "Girls Like You" Maroon 5 ft. Cardi B.

Upload: others

Post on 20-Apr-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HitPredict: Predicting Billboard Hits Using Spotify Datacs229.stanford.edu/proj2018/poster/16.pdf · 60 65 70 75 80 85 90 95 100 Logistic Regression Neural Network Figure 4. Algorithms

HitPredict: Predicting Billboard Hits Using Spotify Data!Elena Georgieva, Marcella Suta, Nicholas Burton

Professors Andrew Ng and Ron Dror

Stanford Machine Learning Poster Session | Stanford, California 2018 Contact: {egeorgie, msuta, ngburton} @stanford.edu

Abstract

Features and Data •  Ten audio features were extracted from the Spotify API4

(Table 1). •  We created the Artist Score metric, assigning a score of 1

to a song if the artist previously had a Billboard hit, and 0 otherwise.

Results

Methods

Algorithms •  Supervised Learning: data split 75/25 into training/

validation. Logistic Regression and GDA yielded the strongest results.

•  Bagging using random forests corrected SVM from over-fitting.

•  Decision Tree performs poorly as it suffers from severe over-fitting.

•  Neural Network with regularization, using one hidden layer of six units with the sigmoid activation function. The L2 regularization function was applied to the cost function to avoid over-fitting.

Audio Features Danceability Liveness Instrumentalness Speechiness Acousticness Loudness Valence Tempo Energy Artist Score

73.4 73.4 72

100

76.875.9 73.7 72.8

51.5

76.5

0

25

50

75

100

LogisticRegression

GDA SVM(w/Bagging)

DecisionTree

NeuralNetwork

Training(75%)Validation(25%) Figure 3. Billboard hit

prediction accuracy results for five machine-learning algorithms. LR and NN give the highest prediction accuracy on the validation set.

Table 1. Audio features extracted from Spotify’s API. Spotify assigns each song a value between 0 and 1 for these features, except loudness which is measured in decibels.

•  Data for ~4000 songs was collected from Billboard.com3 and the Million Song Dataset5. Songs were from 1990-2018.

•  Songs were labeled 1 or 0 based on Billboard success.

•  Audio features for each song were extracted from the Spotify Web API4.

•  Five machine-learning algorithms were used to predict a song’s Billboard success.

Figure 2. A plot of songs’ danceability vs. energy vs. loudness (dB). Black circles represent Billboard hits and red marks represent non-hits.

References [1] Billboard. (2018). Billboard Hot 100 Chart. [2] Chinoy, S. and Ma, J. (2018). Why Songs of the Summer Sound the Same. Nytimes.com. [3] Guo, A. Python API for Billboard Data. Github.com. [4] Spotify Web API. https://developer.spotify.com/ [5] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. ISMIR Conference, 2011.

•  The Billboard Hot 100 Chart1 remains one of the definitive ways to measure the success of a popular song. We investigated using machine learning techniques to predict which songs will become Billboard Hot 100 Hits.

•  We were able to predict the Billboard success of a song with ~75% accuracy using machine-learning algorithms including Logistic Regression, GDA, SVM, Decision Trees and Neural Networks.

Figure 1. Illustration of audio features for the 5 top tracks of December 2018. Our algorithm predicted their Billboard success with 100% accuracy.

Logistic Regression Neural Network Feature Accuracy Feature Accuracy Artist Score 72.9% Danceability 65.3% Instrumental 73.2% Acousticness 69.6% Danceability 73.2% Speechiness 73.0% Accousticness 75.3% Valence 73.4% Speechiness 75.8% Energy 74.9% Loudness 75.8% Artist Score 74.6% Tempo 75.9% Instrumental 75.1% Valence 75.7% Tempo 76.5% Energy 74.0% Liveness 76.4% Liveness 74.3% Loudness 72.7%

Table 2. Error analysis for the two strongest-performing algorithms. The features at the end of the list decreased the accuracy of predictions.

Incr

easi

ng o

rder

of i

mpo

rtan

ce

50556065707580859095100 LogisticRegression

NeuralNetwork

Figure 4. Algorithms yield higher accuracy for more recent songs. Features of pop songs are unique to their time period.

Figure 5. Features of songs released in winter vary from features of other songs. We did not observe the same trends for song of summer.

50556065707580859095100

Sept-May

Summer Feb-Oct Winter

LogisticRegressionNeuralNetwork

0

0.5

1Dance

Energy

Speech

Acoustic

Liveness

Valence

TopSongs:December2018"SickoMode"TravisScott

"ThankYou,Next"ArianaGrande"Happier"Marshmello&Bastille"WithoutMe"Halsey

"GirlsLikeYou"Maroon5ft.CardiB.