Spotify Playlist Data Analysis

Data Science I Harvard University Fall 2017

Introduction

Spotify is a music, podcast, and video streaming service. It provides digital rights management - protected content from record labels and media companies.

One of Spotify’s primary products is Playlists, collections of tracks that users (and / or Spotify) can build for any mood or event. With over 40 million songs available, the company attempts to direct the most relevant songs to users based on their preferences.

These Playlists are compiled in a complex manner, involving both human-led and computer-led processes. What stands is that algorithmically curated discovery playlists, and their effectiveness, remain an important business interest for the company.

The overarching goal of this project is to understand how these algorithms can be evaluated and improved with machine learning techniques.

Problem Statement and Motivation

Spotify’s business model is, to a significant extent, centered around providing its users with relevant songs based on user inputs and historical preferences. Being able to recommend appropriate playlists to its users is hence of vital importance. With this motivation in mind, the two problem statements are:

1. What predictors and what model can be used to determine the success of a Spotify playlist (i.e., number of followers) more accurately out-of-sample than a simple baseline model and how well do these predictors match with expectations gained from exploratory data analysis?

2. Using this improved model, generate playlists according to user-specified filters such that the resultant playlists are deemed to have a high probability of being successful.

Introduction and Description of Data

High-level Data Statistics

Metric Statistic
Unique Playlists 1,420
Unique Tracks 72,789
Unique Artists 28,915
Unique Predictors >3,000

Please refer to Data Mining & Wrangling for more information on the data mining and wrangling procedures employed.

Please refer to Exploratory Data Analysis for more in-depth insights into the exploratory data analysis.

Modeling Approach and Project Trajectory

Please refer to Baseline Models for more information on the baseline models employed.

Please refer to Advanced Models for more information on the advanced models employed.

Results, Conclusions, and Future Work

Please refer to Model Inference for more information on the chosen model.

Please refer to Playlist Generation & Conclusion for more information.

Caruana, R. & Niculescu-Mizil, A (2005). An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2015) An Introduction to Statistical Learning with Application in R. Springer.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

We would be remiss not to mention both the Spotify API:

Spotify Developer API

As well as the useful Python API extension:

Spotipy Python Library

Finally, of course:

Harvard Data Science I