Data Mining & Wrangling

Contents

Source Code

Please refer to Data Mining & Wrangling for the source code (Jupyter Notebook).

Data Mining

Connect With The Spotify API

To begin pulling playlist data from the Spotify API, first a connection with the API needs to be made. For this, both a so-called “client id” and “client secret id” are required. Once these “id’s” are obtained, we follow the below outlined steps to set up the API connection:

client_id = "client_id"
client_secret = "client_secret_id"

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

The main idea of this project is twofold: (i) to infer about key predictors (whether track features or artist features) which are statistically significant in determining a playlist’s success in terms of number of followers; and (ii) to create a custom playlist that is deemed to be succesful (i.e., would obtain many followers).

To this extent, the first step in doing any further analysis is to obtain the playlists we want to run our predictions on. We decide to focus on Spotify’s own “featured” playlists - i.e., those produced by Spotify itself given specific genres / moods / artists etc..

The initial step is to pull Spotify’s featured playlists and obtain a number of base playlist features. The obtained baseline playlist features are converted into a large dataframe next.

data = pd.DataFrame(np.array(spotify_playlists).reshape(len(spotify_playlists),6), 
                    columns=['Name', 'No. of Tracks', 'ID', 'URI', 'HREF', 'Public'])
data.head()
Name No. of Tracks ID URI HREF Public
0 Today's Top Hits 50 37i9dQZF1DXcBWIGoYBM5M spotify:user:spotify:playlist:37i9dQZF1DXcBWIG... https://api.spotify.com/v1/users/spotify/playl... True
1 RapCaviar 63 37i9dQZF1DX0XUsuxWHRQd spotify:user:spotify:playlist:37i9dQZF1DX0XUsu... https://api.spotify.com/v1/users/spotify/playl... True
2 mint 61 37i9dQZF1DX4dyzvuaRJ0n spotify:user:spotify:playlist:37i9dQZF1DX4dyzv... https://api.spotify.com/v1/users/spotify/playl... True
3 Are & Be 51 37i9dQZF1DX4SBhb3fqCJd spotify:user:spotify:playlist:37i9dQZF1DX4SBhb... https://api.spotify.com/v1/users/spotify/playl... True
4 Rock This 64 37i9dQZF1DXcF6B6QPhFDv spotify:user:spotify:playlist:37i9dQZF1DXcF6B6... https://api.spotify.com/v1/users/spotify/playl... True

For each playlist, the number of followers is obtained - this number will be the response variable for our regression based models. Finally - the number of followers is concatenated to the playlist dataframe.

data['Followers'] = pd.DataFrame({'Followers': playlist_follower})
data.head()
Name No. of Tracks ID URI HREF Public Followers
0 Today's Top Hits 50 37i9dQZF1DXcBWIGoYBM5M spotify:user:spotify:playlist:37i9dQZF1DXcBWIG... https://api.spotify.com/v1/users/spotify/playl... True 18247159.0
1 RapCaviar 63 37i9dQZF1DX0XUsuxWHRQd spotify:user:spotify:playlist:37i9dQZF1DX0XUsu... https://api.spotify.com/v1/users/spotify/playl... True 8375355.0
2 mint 61 37i9dQZF1DX4dyzvuaRJ0n spotify:user:spotify:playlist:37i9dQZF1DX4dyzv... https://api.spotify.com/v1/users/spotify/playl... True 4616753.0
3 Are & Be 51 37i9dQZF1DX4SBhb3fqCJd spotify:user:spotify:playlist:37i9dQZF1DX4SBhb... https://api.spotify.com/v1/users/spotify/playl... True 3806312.0
4 Rock This 64 37i9dQZF1DXcF6B6QPhFDv spotify:user:spotify:playlist:37i9dQZF1DXcF6B6... https://api.spotify.com/v1/users/spotify/playl... True 4004115.0

Following the above outlined steps, we are able to produce a dataframe consisting of, in excess 1400, playlists with relevant information such as playlist id, number of playlist tracks, and number of playlist followers.

Collect Spotify Audio Features Per Track in Playlist

Using the dataframe of playlists - and specifically the playlist id column - we iterate over all tracks in every playlist and pull relevant audio features which could potentially be helpful in predicting the success of a playlist. Audio features refers to acousticness, energy, key, valence and etc.

To this extent, we define a function to pull all playlists’ tracks.

def get_playlist_tracks(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

Running the feature extraction from Spotify could take a significant amount of time and also tend to raise errors in the process. To avoid losing information when such error occurs, a dictionary is used in cache memory. Audio features are extracted using the below code - note running this code on all playlists takes a significant amount of time (measured in hours).

for item,song in enumerate(songs):
    if song not in audio_feat:
        try:
            audio_feat[song] = sp.audio_features(song)
        except:
            pass

        if item % limit_songs_small == 0:
            time.sleep(random.randint(0, 1))

        if item % limit_songs_medium == 0:
            time.sleep(random.randint(0, 1))

        out = np.floor(item * 1. / len(songs_playlist) * 100)
        sys.stdout.write("\r%d%%" % out)
        sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

Once all the audio features are extracted, they are converted into the main audio feature dataframe and saved as a large csv file.

acousticness dance duration energy instrumentalness key liveness loudness mode playlist song speech tempo time valence
0 0.365 0.307 258933 0.481 0 3 0.207 -8.442 0 37i9dQZF1DXcBWIGoYBM5M 00kkWwGsR9HblTUHb3BmdX 0.128 68.894 3 0.329
1 0.993 0.322 160897 0.0121 0.927 5 0.127 -31.994 1 37i9dQZF1DXcBWIGoYBM5M 01T3AjynqSMVfiAQCAfrKJ 0.0491 112.464 4 0.118
2 0.994 0.375 58387 0.00406 0.908 7 0.0842 -31.824 0 37i9dQZF1DXcBWIGoYBM5M 02BumRY2OTFMkMxrXSVMat 0.0671 139.682 1 0.358
3 0.992 0.393 288280 0.0429 0.925 9 0.0821 -25.727 0 37i9dQZF1DXcBWIGoYBM5M 02mkkozonPEDCenOhuWwLc 0.0341 135.405 4 0.0394
4 0.992 0.373 99867 0.117 0.909 10 0.111 -25.222 0 37i9dQZF1DXcBWIGoYBM5M 02xmGU9unopKjpblPRC67j 0.0511 125.288 3 0.189

Collect Spotify Artist Information Per Track in Playlist

Following a similar procedure as the audio feature extraction, artist information for every track in every playlist is extracted next.

First, a function is defined to retrieve artist information given an artist name.

def get_artist(name):
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        return items[0]
    else:
        return None

Again, a dictionary in cache memory is setup for the main artist feature extraction loop. Artist features are extracted using the code below - note running this code on all playlists takes a significant amount of time (measured in hours).

for item,artist in enumerate(artists):
    if artist not in artist_info:
        try:
            artist_info[artist] = get_artist(artist)
        except:
            pass
    
    if item % limit_artist_small == 0:
        time.sleep(random.randint(0, 1))
    
    if item % limit_artist_medium == 0:
        time.sleep(random.randint(0, 1))
        
    out = np.floor(item * 1. / len(artists) * 100)
    sys.stdout.write("\r%d%%" % out)
    sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

Once all the artist features are extracted, they are converted into the main artist feature dataframe and saved as a large csv file.

artist followers genres playlist popularity song
0 10 Years 157035 [alternative metal, nu metal, post-grunge, rap... 37i9dQZF1DXcF6B6QPhFDv 63 0uyDAijTR0tOuH24hxDhE5
1 21 Savage 2323273 [dwn trap, rap, trap music] 37i9dQZF1DX0XUsuxWHRQd 98 2vaMWMPMgsWX4fwJiKmdWm
2 24hrs 28839 [dwn trap, trap music, underground hip hop] 37i9dQZF1DX0XUsuxWHRQd 73 2c5D6B8oXAwc6easamdgVA
3 3LAU 175224 [big room, brostep, deep big room, edm, electr... 37i9dQZF1DX4JAvHpjipBk 67 6yxobtnNHKRAA0cvoNxJhe
4 50 Cent 2686486 [east coast hip hop, gangster rap, hip hop, po... 37i9dQZF1DX0XUsuxWHRQd 85 32aYDW8Qdnv1ur89TUlDnm

Data Wrangling

Loading Data Frames

Once all data is extracted from Spotify, the next step is to combine the separate dataframes (i.e., for playlists, audio features and artists) and to perform some initial feature engineering in the hope of creating useful data for inference and prediction of playlist success.

The first step is to load all the dataframes separately. Beginning with the playlist dataframe.

playlist_df = pd.read_csv('Playlist.csv')
playlist_df.head()
Unnamed: 0 Name No. of Tracks ID URI HREF Public Followers
0 0 Today's Top Hits 50 37i9dQZF1DXcBWIGoYBM5M spotify:user:spotify:playlist:37i9dQZF1DXcBWIG... https://api.spotify.com/v1/users/spotify/playl... True 18079985.0
1 1 RapCaviar 61 37i9dQZF1DX0XUsuxWHRQd spotify:user:spotify:playlist:37i9dQZF1DX0XUsu... https://api.spotify.com/v1/users/spotify/playl... True 8283836.0
2 2 mint 61 37i9dQZF1DX4dyzvuaRJ0n spotify:user:spotify:playlist:37i9dQZF1DX4dyzv... https://api.spotify.com/v1/users/spotify/playl... True 4593498.0
3 3 Are & Be 51 37i9dQZF1DX4SBhb3fqCJd spotify:user:spotify:playlist:37i9dQZF1DX4SBhb... https://api.spotify.com/v1/users/spotify/playl... True 3773823.0
4 4 Rock This 60 37i9dQZF1DXcF6B6QPhFDv spotify:user:spotify:playlist:37i9dQZF1DXcF6B6... https://api.spotify.com/v1/users/spotify/playl... True 3989695.0

Next, the track feature dataframe is loaded.

tracks_df = pd.read_csv('tracks_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
tracks_df.head()
acousticness dance duration energy instrumentalness key liveness loudness mode playlist song speech tempo time valence
0 0.039500 0.299 214973 0.9210 0.737000 4 0.5890 -6.254 1 37i9dQZF1DXcBWIGoYBM5M 0076oEQq8IToGfnzU3bTHY 0.1930 174.982 4 0.0532
1 0.365000 0.307 258933 0.4810 0.000000 3 0.2070 -8.442 0 37i9dQZF1DXcBWIGoYBM5M 00kkWwGsR9HblTUHb3BmdX 0.1280 68.894 3 0.3290
2 0.078700 0.630 261731 0.6560 0.000906 0 0.0953 -6.423 0 37i9dQZF1DXcBWIGoYBM5M 01JkrDSrakX5UO5knhpKNA 0.0276 133.012 4 0.4320
3 0.000192 0.521 188834 0.8370 0.051000 5 0.0929 -4.581 1 37i9dQZF1DXcBWIGoYBM5M 01KsbekyuQQXpVnxIfNRaC 0.1220 80.027 4 0.6230
4 0.993000 0.322 160897 0.0121 0.927000 5 0.1270 -31.994 1 37i9dQZF1DXcBWIGoYBM5M 01T3AjynqSMVfiAQCAfrKJ 0.0491 112.464 4 0.1180

Finally, the artist dataframe is loaded.

artist_df_sub = pd.read_csv('artist_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
artist_df_sub.head()
artist followers genres playlist popularity song
0 *NSYNC 498511.0 ['boy band', 'dance pop', 'europop', 'pop', 'p... 37i9dQZF1DWXDAhqlN7e6W 75.0 35zGjsxI020C2NPKp2fzS7
1 10 Years 154800.0 ['alternative metal', 'nu metal', 'post-grunge... 37i9dQZF1DWWJOmJ7nRx0C 63.0 4qmoz9OUEBaXUzlWQX4ZU4
2 2 Chainz 1926728.0 ['dwn trap', 'pop rap', 'rap', 'southern hip h... 37i9dQZF1DX7QOv5kjbU68 91.0 4XoP1AkbOurU9CeZ2rMEz2
3 21 Savage 2224587.0 ['dwn trap', 'rap', 'trap music'] 37i9dQZF1DX7QOv5kjbU68 98.0 4ckuS4Nj4FZ7i3Def3Br8W
4 24hrs 27817.0 ['dwn trap', 'trap music', 'underground hip hop'] 37i9dQZF1DX0XUsuxWHRQd 74.0 2c5D6B8oXAwc6easamdgVA

As we can see from the above - artists are categorized by a list of genres as opposed to a single genre. Therefore, genres are one-hot encoded to convert these genre lists into predictors we can run models on.

mlb = MultiLabelBinarizer(sparse_output=True)
pre_data = mlb.fit_transform(artist_df_sub['genres'].str.split(','))
classes = [i.strip('[]') for i in mlb.classes_]
genre_sub = pd.DataFrame(pre_data.toarray(),columns=classes)
_, i = np.unique(genre_sub.columns, return_index=True)
genre_sub = genre_sub.iloc[:, i]

artist_df_sub_mid = artist_df_sub.drop('genres', axis=1)

artist_sub_frames = [artist_df_sub_mid,genre_sub]
artist_df = pd.concat(artist_sub_frames,axis=1,join='inner')

Once all the genres are one-hot encoded, the dataframes are grouped by playlist to enable feature engineering.

Feature Engineering

Artist Variables

In terms of artists, feature engineering led to the following predictors:

First, the top 50 artists (in terms of number of Spotify followers) are extracted. Then, we count the amount of times these artists show up in a given playlist and record the counts as predictors in the final dataframe. Second, we obtain the list of 30 artists who appear most often in playlists with 35,000+ followers. Third, all the genres in a playlist are encoded to binary values in the one-hot encoded genre columns.

Finally, the main artist data frame is created below:

artist_features_df['Playlist_Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
artist_features_df['ID'] = artist_features_df.index

artist_main_df = artist_features_df.reset_index().drop(0, axis=1)
artist_main_df.head()
followers_mean followers_std popularity_mean popularity_std top_0_10 top_10_20 top_20_30 top_30_40 top_40_50 Playlist_Followers ID
0 134413.666667 3.654590e+05 42.833333 19.575645 0 0 0 0 0 24.0 01WIu4Rst0xeZnTunWxUL7
1 103320.580645 3.320150e+05 48.903226 15.029648 0 0 0 0 0 330.0 05dTMGk8MjnpQg3bKuoXcc
2 566814.560000 1.427308e+06 60.280000 15.512146 0 0 0 1 0 73.0 070FVPBKvfu6M5tf4I9rt2
3 199831.484848 2.953859e+05 58.696970 15.627470 0 0 0 0 0 6173.0 08vPKM3pmoyF6crB2EtASQ
4 223253.774194 4.918438e+05 49.516129 19.489948 0 0 0 0 0 145.0 08ySLuUm0jMf7lJmFwqRMu

Audio Feature Variables

Similar to the artist feature engineering, the playlists’ audio features are engineered next. Specifically, for each audio feature (such as acousticness, duration, energy) mined from Spotify, the mean and standard deviation across all playlist tracks is computed.

The engineered audio features are converted into a dataframe as follows:

features_df['Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
features_df['ID'] = features_df.index

features_main_df = features_df.reset_index().drop(0, axis=1)
features_main_df.head()
acousticness_mean acousticness_std dance_mean dance_std energy_mean energy_std instrumentalness_mean instrumentalness_std key_mean key_std ... speech_mean speech_std tempo_mean tempo_std time_mean time_std valence_mean valence_std Followers ID
0 0.641282 0.326942 0.467911 0.241057 0.275940 0.225821 0.119650 0.277109 0.275940 0.225821 ... 0.383051 0.403365 101.045969 51.857504 3.338462 1.553996 0.319263 0.246235 24.0 01WIu4Rst0xeZnTunWxUL7
1 0.249844 0.321182 0.555140 0.172088 0.666567 0.230578 0.077776 0.240452 0.666567 0.230578 ... 0.137260 0.226812 130.850167 30.525135 4.000000 0.454859 0.496127 0.256787 6198.0 056jpfChuMP5D1NMMaDXRR
2 0.278816 0.262749 0.634392 0.140270 0.596000 0.166902 0.192559 0.341460 0.596000 0.166902 ... 0.082210 0.131105 122.768255 28.215783 4.000000 0.200000 0.656235 0.245299 330.0 05dTMGk8MjnpQg3bKuoXcc
3 0.228810 0.251421 0.600400 0.178801 0.612200 0.192433 0.179571 0.336604 0.612200 0.192433 ... 0.052150 0.025935 114.439167 21.997673 4.000000 0.262613 0.481787 0.251199 73.0 070FVPBKvfu6M5tf4I9rt2
4 0.394114 0.362573 0.599424 0.151256 0.541097 0.289705 0.203059 0.332371 0.541097 0.289705 ... 0.106724 0.112448 110.134788 25.125111 4.000000 0.353553 0.511997 0.243171 6173.0 08vPKM3pmoyF6crB2EtASQ

5 rows × 26 columns

Finally, the last step is to create the main dataframe using an inner merge on both the audio feature dataframe and artist dataframe. This inner merge leads to a loss of 126 playlists in total (i.e., there was no overlap between the two dataframes across these playlists).

master_df = pd.merge(features_main_df, artist_df_groups, how='inner', on='ID')
master_df.head()
acousticness_mean acousticness_std dance_mean dance_std energy_mean energy_std instrumentalness_mean instrumentalness_std key_mean key_std ... 'wrestling' 'wrock' 'ye ye' 'yoik' 'zapstep' 'zeuhl' 'zim' 'zolo' 'zydeco' 'no_genre'
0 0.641282 0.326942 0.467911 0.241057 0.275940 0.225821 0.119650 0.277109 0.275940 0.225821 ... 0 0 0 0 0 0 0 0 0 1
1 0.278816 0.262749 0.634392 0.140270 0.596000 0.166902 0.192559 0.341460 0.596000 0.166902 ... 0 0 0 0 0 0 0 0 0 1
2 0.228810 0.251421 0.600400 0.178801 0.612200 0.192433 0.179571 0.336604 0.612200 0.192433 ... 0 0 0 0 0 0 0 0 0 1
3 0.394114 0.362573 0.599424 0.151256 0.541097 0.289705 0.203059 0.332371 0.541097 0.289705 ... 0 0 0 0 0 0 0 0 0 1
4 0.194509 0.278470 0.531067 0.150001 0.759400 0.249805 0.115499 0.258020 0.759400 0.249805 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 3245 columns

The master dataframe is saved for both EDA and modeling purposes next and final dataframe size is presented.

Number of Playlists: 1420
Number of Predictors: 3245

String Parsing / Natural Language Processing

Here, we further analyze the names of the playlist based on the rationale that listeners usually search for key terms like ‘Best’, ‘Hit’, ‘Workout’ when they look for certain type of playlists. Due to the relatively small size of our data set, we adopt a string parsing approach for our model (which could be easily scaled with Python’s NLTK package for larger data sets or more advanced modeling).

new_df = pd.merge(full_df, playlist_df[['Name', 'ID']], on='ID', how='left')

String parsing follows the below example methods:

Str_Best = full_df_concise.Name.str.contains('Best|Top|Hit|best|top|hit|Hot|hot|Pick|pick')
Str_Workout = full_df_concise.Name.str.contains('Workout|workout|Motivation|motivation|Power|power|Cardio|')
Str_Party = full_df_concise.Name.str.contains('Party|party')
Str_Chill = full_df_concise.Name.str.contains('Chill|chill|Relax|relax')
Str_Acoustic = full_df_concise.Name.str.contains('Acoustic|acoustic')
Str_2000s = full_df_concise.Name.str.contains('20')
Str_1990s = full_df_concise.Name.str.contains('90|91|92|93|94|95|96|97|98|99')
Str_1980s = full_df_concise.Name.str.contains('80|81|82|83|84|85|86|87|88|89')
Str_1970s = full_df_concise.Name.str.contains('70|71|72|73|74|75|76|77|78|79')
Str_1960s = full_df_concise.Name.str.contains('60|61|62|63|64|65|66|67|68|69')
Str_1950s = full_df_concise.Name.str.contains('50s')

Interaction Terms with Audio Features and Genre

The following section describes the process of creating interaction terms between genres and audio features. Interaction terms are considered because genre may have an effect on the relationships between audio features and the number of playlist followers. For example, different levels of energy may be more popular for rap music than for acoustic music.

The first step is to bucket the genres (with a total of more than 100 specific genres) into broader categories. As listed below, some of the most common broad genres include: house, hip hop, pop, dance, r&b, acoustic, and soul.

broad_genres = ['house','hip hop','pop','dance','r&b','rap','acoustic','soul']

Next, interaction terms are generated between genre categories and certain audio features. Below are the interaction terms that are created. These features are selected through a separate analysis in which all of the genres, audio features, and all possible interactions are used as predictors to model the number of playlist followers. We find that the interaction terms listed below are significant.

interaction_columns = ['house_acousticness_mean','hip hop_acousticness_std','pop_liveness_std','dance_liveness_std',
                      'r&b_acousticness_std','rap_energy_std','rap_key_std','acoustic_acousticness_std','acoustic_acousticness_mean',
                      'acoustic_energy_std','acoustic_key_std','soul_acousticness_std']
house_acousticness_mean hip hop_acousticness_std pop_liveness_std dance_liveness_std r&b_acousticness_std rap_energy_std rap_key_std acoustic_acousticness_std acoustic_acousticness_mean acoustic_energy_std acoustic_key_std soul_acousticness_std
count 1420.000000 1418.000000 1418.000000 1418.000000 1418.000000 1418.000000 1418.000000 1418.000000 1420.000000 1418.000000 1418.000000 1418.000000
mean 0.224109 0.235339 0.156279 0.137165 0.239961 0.210305 0.210305 0.102606 0.115892 0.080324 0.080324 0.173310
std 0.212280 0.144852 0.056181 0.073726 0.143718 0.094412 0.094412 0.150939 0.190786 0.117964 0.117964 0.162756
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.129984 0.111896 0.160846 0.204460 0.204460 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.221718 0.302918 0.155873 0.149616 0.306497 0.238572 0.238572 0.000000 0.000000 0.000000 0.000000 0.240984
75% 0.366849 0.341949 0.185451 0.180391 0.344083 0.267752 0.267752 0.285148 0.228570 0.220703 0.220703 0.332228
max 0.961000 0.428986 0.351859 0.351859 0.444861 0.371096 0.371096 0.444861 0.961000 0.347747 0.347747 0.420705

By now, the final dataframe has been created. We will leverage this dataframe and its features to conduct EDA and to construct models in the following sections.