Data Science I | Harvard University | Fall 2017 |
Spotify is a music, podcast, and video streaming service. It provides digital rights management - protected content from record labels and media companies.
One of Spotify’s primary products is Playlists, collections of tracks that users (and / or Spotify) can build for any mood or event. With over 40 million songs available, the company attempts to direct the most relevant songs to users based on their preferences.
These Playlists are compiled in a complex manner, involving both human-led and computer-led processes. What stands is that algorithmically curated discovery playlists, and their effectiveness, remain an important business interest for the company.
The overarching goal of this project is to understand how these algorithms can be evaluated and improved with machine learning techniques.
Spotify’s business model is, to a significant extent, centered around providing its users with relevant songs based on user inputs and historical preferences. Being able to recommend appropriate playlists to its users is hence of vital importance. With this motivation in mind, the two problem statements are:
1. What predictors and what model can be used to determine the success of a Spotify playlist (i.e., number of followers) more accurately out-of-sample than a simple baseline model and how well do these predictors match with expectations gained from exploratory data analysis?
2. Using this improved model, generate playlists according to user-specified filters such that the resultant playlists are deemed to have a high probability of being successful.
High-level Data Statistics
Metric | Statistic |
---|---|
Unique Playlists | 1,420 |
Unique Tracks | 72,789 |
Unique Artists | 28,915 |
Unique Predictors | >3,000 |
Please refer to Data Mining & Wrangling for more information on the data mining and wrangling procedures employed.
Please refer to Exploratory Data Analysis for more in-depth insights into the exploratory data analysis.
Please refer to Baseline Models for more information on the baseline models employed.
Please refer to Advanced Models for more information on the advanced models employed.
Please refer to Model Inference for more information on the chosen model.
Please refer to Playlist Generation & Conclusion for more information.
Caruana, R. & Niculescu-Mizil, A (2005). An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2015) An Introduction to Statistical Learning with Application in R. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.
We would be remiss not to mention both the Spotify API:
As well as the useful Python API extension:
Finally, of course: