Data Mining & Wrangling

Source Code
Data Mining
Data Wrangling

Source Code

Please refer to Data Mining & Wrangling for the source code (Jupyter Notebook).

Data Mining

Connect With The Spotify API

To begin pulling playlist data from the Spotify API, first a connection with the API needs to be made. For this, both a so-called “client id” and “client secret id” are required. Once these “id’s” are obtained, we follow the below outlined steps to set up the API connection:

client_id = "client_id"
client_secret = "client_secret_id"

client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Collect Spotify’s Featured Playlist Data

The main idea of this project is twofold: (i) to infer about key predictors (whether track features or artist features) which are statistically significant in determining a playlist’s success in terms of number of followers; and (ii) to create a custom playlist that is deemed to be succesful (i.e., would obtain many followers).

To this extent, the first step in doing any further analysis is to obtain the playlists we want to run our predictions on. We decide to focus on Spotify’s own “featured” playlists - i.e., those produced by Spotify itself given specific genres / moods / artists etc..

The initial step is to pull Spotify’s featured playlists and obtain a number of base playlist features. The obtained baseline playlist features are converted into a large dataframe next.

data = pd.DataFrame(np.array(spotify_playlists).reshape(len(spotify_playlists),6), 
                    columns=['Name', 'No. of Tracks', 'ID', 'URI', 'HREF', 'Public'])
data.head()

	Name	No. of Tracks	ID	URI	HREF	Public
0	Today's Top Hits	50	37i9dQZF1DXcBWIGoYBM5M	spotify:user:spotify:playlist:37i9dQZF1DXcBWIG...	https://api.spotify.com/v1/users/spotify/playl...	True
1	RapCaviar	63	37i9dQZF1DX0XUsuxWHRQd	spotify:user:spotify:playlist:37i9dQZF1DX0XUsu...	https://api.spotify.com/v1/users/spotify/playl...	True
2	mint	61	37i9dQZF1DX4dyzvuaRJ0n	spotify:user:spotify:playlist:37i9dQZF1DX4dyzv...	https://api.spotify.com/v1/users/spotify/playl...	True
3	Are & Be	51	37i9dQZF1DX4SBhb3fqCJd	spotify:user:spotify:playlist:37i9dQZF1DX4SBhb...	https://api.spotify.com/v1/users/spotify/playl...	True
4	Rock This	64	37i9dQZF1DXcF6B6QPhFDv	spotify:user:spotify:playlist:37i9dQZF1DXcF6B6...	https://api.spotify.com/v1/users/spotify/playl...	True

For each playlist, the number of followers is obtained - this number will be the response variable for our regression based models. Finally - the number of followers is concatenated to the playlist dataframe.

data['Followers'] = pd.DataFrame({'Followers': playlist_follower})
data.head()

	Name	No. of Tracks	ID	URI	HREF	Public	Followers
0	Today's Top Hits	50	37i9dQZF1DXcBWIGoYBM5M	spotify:user:spotify:playlist:37i9dQZF1DXcBWIG...	https://api.spotify.com/v1/users/spotify/playl...	True	18247159.0
1	RapCaviar	63	37i9dQZF1DX0XUsuxWHRQd	spotify:user:spotify:playlist:37i9dQZF1DX0XUsu...	https://api.spotify.com/v1/users/spotify/playl...	True	8375355.0
2	mint	61	37i9dQZF1DX4dyzvuaRJ0n	spotify:user:spotify:playlist:37i9dQZF1DX4dyzv...	https://api.spotify.com/v1/users/spotify/playl...	True	4616753.0
3	Are & Be	51	37i9dQZF1DX4SBhb3fqCJd	spotify:user:spotify:playlist:37i9dQZF1DX4SBhb...	https://api.spotify.com/v1/users/spotify/playl...	True	3806312.0
4	Rock This	64	37i9dQZF1DXcF6B6QPhFDv	spotify:user:spotify:playlist:37i9dQZF1DXcF6B6...	https://api.spotify.com/v1/users/spotify/playl...	True	4004115.0

Following the above outlined steps, we are able to produce a dataframe consisting of, in excess 1400, playlists with relevant information such as playlist id, number of playlist tracks, and number of playlist followers.

Collect Spotify Audio Features Per Track in Playlist

Using the dataframe of playlists - and specifically the playlist id column - we iterate over all tracks in every playlist and pull relevant audio features which could potentially be helpful in predicting the success of a playlist. Audio features refers to acousticness, energy, key, valence and etc.

To this extent, we define a function to pull all playlists’ tracks.

def get_playlist_tracks(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

Running the feature extraction from Spotify could take a significant amount of time and also tend to raise errors in the process. To avoid losing information when such error occurs, a dictionary is used in cache memory. Audio features are extracted using the below code - note running this code on all playlists takes a significant amount of time (measured in hours).

for item,song in enumerate(songs):
    if song not in audio_feat:
        try:
            audio_feat[song] = sp.audio_features(song)
        except:
            pass

        if item % limit_songs_small == 0:
            time.sleep(random.randint(0, 1))

        if item % limit_songs_medium == 0:
            time.sleep(random.randint(0, 1))

        out = np.floor(item * 1. / len(songs_playlist) * 100)
        sys.stdout.write("\r%d%%" % out)
        sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

Once all the audio features are extracted, they are converted into the main audio feature dataframe and saved as a large csv file.

	acousticness	dance	duration	energy	instrumentalness	key	liveness	loudness	mode	playlist	song	speech	tempo	time	valence
0	0.365	0.307	258933	0.481	0	3	0.207	-8.442	0	37i9dQZF1DXcBWIGoYBM5M	00kkWwGsR9HblTUHb3BmdX	0.128	68.894	3	0.329
1	0.993	0.322	160897	0.0121	0.927	5	0.127	-31.994	1	37i9dQZF1DXcBWIGoYBM5M	01T3AjynqSMVfiAQCAfrKJ	0.0491	112.464	4	0.118
2	0.994	0.375	58387	0.00406	0.908	7	0.0842	-31.824	0	37i9dQZF1DXcBWIGoYBM5M	02BumRY2OTFMkMxrXSVMat	0.0671	139.682	1	0.358
3	0.992	0.393	288280	0.0429	0.925	9	0.0821	-25.727	0	37i9dQZF1DXcBWIGoYBM5M	02mkkozonPEDCenOhuWwLc	0.0341	135.405	4	0.0394
4	0.992	0.373	99867	0.117	0.909	10	0.111	-25.222	0	37i9dQZF1DXcBWIGoYBM5M	02xmGU9unopKjpblPRC67j	0.0511	125.288	3	0.189

Collect Spotify Artist Information Per Track in Playlist

Following a similar procedure as the audio feature extraction, artist information for every track in every playlist is extracted next.

First, a function is defined to retrieve artist information given an artist name.

def get_artist(name):
    results = sp.search(q='artist:' + name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        return items[0]
    else:
        return None

Again, a dictionary in cache memory is setup for the main artist feature extraction loop. Artist features are extracted using the code below - note running this code on all playlists takes a significant amount of time (measured in hours).

for item,artist in enumerate(artists):
    if artist not in artist_info:
        try:
            artist_info[artist] = get_artist(artist)
        except:
            pass
    
    if item % limit_artist_small == 0:
        time.sleep(random.randint(0, 1))
    
    if item % limit_artist_medium == 0:
        time.sleep(random.randint(0, 1))
        
    out = np.floor(item * 1. / len(artists) * 100)
    sys.stdout.write("\r%d%%" % out)
    sys.stdout.flush()

sys.stdout.write("\r%d%%" % 100)

Once all the artist features are extracted, they are converted into the main artist feature dataframe and saved as a large csv file.

	artist	followers	genres	playlist	popularity	song
0	10 Years	157035	[alternative metal, nu metal, post-grunge, rap...	37i9dQZF1DXcF6B6QPhFDv	63	0uyDAijTR0tOuH24hxDhE5
1	21 Savage	2323273	[dwn trap, rap, trap music]	37i9dQZF1DX0XUsuxWHRQd	98	2vaMWMPMgsWX4fwJiKmdWm
2	24hrs	28839	[dwn trap, trap music, underground hip hop]	37i9dQZF1DX0XUsuxWHRQd	73	2c5D6B8oXAwc6easamdgVA
3	3LAU	175224	[big room, brostep, deep big room, edm, electr...	37i9dQZF1DX4JAvHpjipBk	67	6yxobtnNHKRAA0cvoNxJhe
4	50 Cent	2686486	[east coast hip hop, gangster rap, hip hop, po...	37i9dQZF1DX0XUsuxWHRQd	85	32aYDW8Qdnv1ur89TUlDnm

Data Wrangling

Loading Data Frames

Once all data is extracted from Spotify, the next step is to combine the separate dataframes (i.e., for playlists, audio features and artists) and to perform some initial feature engineering in the hope of creating useful data for inference and prediction of playlist success.

The first step is to load all the dataframes separately. Beginning with the playlist dataframe.

playlist_df = pd.read_csv('Playlist.csv')
playlist_df.head()

	Unnamed: 0	Name	No. of Tracks	ID	URI	HREF	Public	Followers
0	0	Today's Top Hits	50	37i9dQZF1DXcBWIGoYBM5M	spotify:user:spotify:playlist:37i9dQZF1DXcBWIG...	https://api.spotify.com/v1/users/spotify/playl...	True	18079985.0
1	1	RapCaviar	61	37i9dQZF1DX0XUsuxWHRQd	spotify:user:spotify:playlist:37i9dQZF1DX0XUsu...	https://api.spotify.com/v1/users/spotify/playl...	True	8283836.0
2	2	mint	61	37i9dQZF1DX4dyzvuaRJ0n	spotify:user:spotify:playlist:37i9dQZF1DX4dyzv...	https://api.spotify.com/v1/users/spotify/playl...	True	4593498.0
3	3	Are & Be	51	37i9dQZF1DX4SBhb3fqCJd	spotify:user:spotify:playlist:37i9dQZF1DX4SBhb...	https://api.spotify.com/v1/users/spotify/playl...	True	3773823.0
4	4	Rock This	60	37i9dQZF1DXcF6B6QPhFDv	spotify:user:spotify:playlist:37i9dQZF1DXcF6B6...	https://api.spotify.com/v1/users/spotify/playl...	True	3989695.0

Next, the track feature dataframe is loaded.

tracks_df = pd.read_csv('tracks_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
tracks_df.head()

	acousticness	dance	duration	energy	instrumentalness	key	liveness	loudness	mode	playlist	song	speech	tempo	time	valence
0	0.039500	0.299	214973	0.9210	0.737000	4	0.5890	-6.254	1	37i9dQZF1DXcBWIGoYBM5M	0076oEQq8IToGfnzU3bTHY	0.1930	174.982	4	0.0532
1	0.365000	0.307	258933	0.4810	0.000000	3	0.2070	-8.442	0	37i9dQZF1DXcBWIGoYBM5M	00kkWwGsR9HblTUHb3BmdX	0.1280	68.894	3	0.3290
2	0.078700	0.630	261731	0.6560	0.000906	0	0.0953	-6.423	0	37i9dQZF1DXcBWIGoYBM5M	01JkrDSrakX5UO5knhpKNA	0.0276	133.012	4	0.4320
3	0.000192	0.521	188834	0.8370	0.051000	5	0.0929	-4.581	1	37i9dQZF1DXcBWIGoYBM5M	01KsbekyuQQXpVnxIfNRaC	0.1220	80.027	4	0.6230
4	0.993000	0.322	160897	0.0121	0.927000	5	0.1270	-31.994	1	37i9dQZF1DXcBWIGoYBM5M	01T3AjynqSMVfiAQCAfrKJ	0.0491	112.464	4	0.1180

Finally, the artist dataframe is loaded.

artist_df_sub = pd.read_csv('artist_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
artist_df_sub.head()

	artist	followers	genres	playlist	popularity	song
0	*NSYNC	498511.0	['boy band', 'dance pop', 'europop', 'pop', 'p...	37i9dQZF1DWXDAhqlN7e6W	75.0	35zGjsxI020C2NPKp2fzS7
1	10 Years	154800.0	['alternative metal', 'nu metal', 'post-grunge...	37i9dQZF1DWWJOmJ7nRx0C	63.0	4qmoz9OUEBaXUzlWQX4ZU4
2	2 Chainz	1926728.0	['dwn trap', 'pop rap', 'rap', 'southern hip h...	37i9dQZF1DX7QOv5kjbU68	91.0	4XoP1AkbOurU9CeZ2rMEz2
3	21 Savage	2224587.0	['dwn trap', 'rap', 'trap music']	37i9dQZF1DX7QOv5kjbU68	98.0	4ckuS4Nj4FZ7i3Def3Br8W
4	24hrs	27817.0	['dwn trap', 'trap music', 'underground hip hop']	37i9dQZF1DX0XUsuxWHRQd	74.0	2c5D6B8oXAwc6easamdgVA

As we can see from the above - artists are categorized by a list of genres as opposed to a single genre. Therefore, genres are one-hot encoded to convert these genre lists into predictors we can run models on.

mlb = MultiLabelBinarizer(sparse_output=True)
pre_data = mlb.fit_transform(artist_df_sub['genres'].str.split(','))
classes = [i.strip('[]') for i in mlb.classes_]
genre_sub = pd.DataFrame(pre_data.toarray(),columns=classes)
_, i = np.unique(genre_sub.columns, return_index=True)
genre_sub = genre_sub.iloc[:, i]

artist_df_sub_mid = artist_df_sub.drop('genres', axis=1)

artist_sub_frames = [artist_df_sub_mid,genre_sub]
artist_df = pd.concat(artist_sub_frames,axis=1,join='inner')

Once all the genres are one-hot encoded, the dataframes are grouped by playlist to enable feature engineering.

Feature Engineering

Artist Variables

In terms of artists, feature engineering led to the following predictors:

Thirty columns represent the names of top 30 artists (in terms of appearing most often in popular playlists). These are categorical variables indicating whether a playlist has a specific artist.
Five columns represent the number of times top 50 artists (in terms of artist followers) appear in the playlists (bucketed in 10 artist intervals each)
Two columns represent the mean and standard deviation of artist followers per playlist
Two columns represent the mean and standard deviation of artist popularity per playlist
Artist genres are one-hot encoded

First, the top 50 artists (in terms of number of Spotify followers) are extracted. Then, we count the amount of times these artists show up in a given playlist and record the counts as predictors in the final dataframe. Second, we obtain the list of 30 artists who appear most often in playlists with 35,000+ followers. Third, all the genres in a playlist are encoded to binary values in the one-hot encoded genre columns.

Finally, the main artist data frame is created below:

artist_features_df['Playlist_Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
artist_features_df['ID'] = artist_features_df.index

artist_main_df = artist_features_df.reset_index().drop(0, axis=1)
artist_main_df.head()

	followers_mean	followers_std	popularity_mean	popularity_std	top_30_40	Playlist_Followers	ID
0	134413.666667	3.654590e+05	42.833333	19.575645	0	24.0	01WIu4Rst0xeZnTunWxUL7
1	103320.580645	3.320150e+05	48.903226	15.029648	0	330.0	05dTMGk8MjnpQg3bKuoXcc
2	566814.560000	1.427308e+06	60.280000	15.512146	1	73.0	070FVPBKvfu6M5tf4I9rt2
3	199831.484848	2.953859e+05	58.696970	15.627470	0	6173.0	08vPKM3pmoyF6crB2EtASQ
4	223253.774194	4.918438e+05	49.516129	19.489948	0	145.0	08ySLuUm0jMf7lJmFwqRMu

Audio Feature Variables

Similar to the artist feature engineering, the playlists’ audio features are engineered next. Specifically, for each audio feature (such as acousticness, duration, energy) mined from Spotify, the mean and standard deviation across all playlist tracks is computed.

The engineered audio features are converted into a dataframe as follows:

features_df['Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
features_df['ID'] = features_df.index

features_main_df = features_df.reset_index().drop(0, axis=1)
features_main_df.head()

	acousticness_mean	acousticness_std	dance_mean	dance_std	energy_mean	energy_std	instrumentalness_mean	instrumentalness_std	key_mean	key_std	...	speech_mean	speech_std	tempo_mean	tempo_std	time_mean	time_std	valence_mean	valence_std	Followers	ID
0	0.641282	0.326942	0.467911	0.241057	0.275940	0.225821	0.119650	0.277109	0.275940	0.225821	...	0.383051	0.403365	101.045969	51.857504	3.338462	1.553996	0.319263	0.246235	24.0	01WIu4Rst0xeZnTunWxUL7
1	0.249844	0.321182	0.555140	0.172088	0.666567	0.230578	0.077776	0.240452	0.666567	0.230578	...	0.137260	0.226812	130.850167	30.525135	4.000000	0.454859	0.496127	0.256787	6198.0	056jpfChuMP5D1NMMaDXRR
2	0.278816	0.262749	0.634392	0.140270	0.596000	0.166902	0.192559	0.341460	0.596000	0.166902	...	0.082210	0.131105	122.768255	28.215783	4.000000	0.200000	0.656235	0.245299	330.0	05dTMGk8MjnpQg3bKuoXcc
3	0.228810	0.251421	0.600400	0.178801	0.612200	0.192433	0.179571	0.336604	0.612200	0.192433	...	0.052150	0.025935	114.439167	21.997673	4.000000	0.262613	0.481787	0.251199	73.0	070FVPBKvfu6M5tf4I9rt2
4	0.394114	0.362573	0.599424	0.151256	0.541097	0.289705	0.203059	0.332371	0.541097	0.289705	...	0.106724	0.112448	110.134788	25.125111	4.000000	0.353553	0.511997	0.243171	6173.0	08vPKM3pmoyF6crB2EtASQ

5 rows × 26 columns

Finally, the last step is to create the main dataframe using an inner merge on both the audio feature dataframe and artist dataframe. This inner merge leads to a loss of 126 playlists in total (i.e., there was no overlap between the two dataframes across these playlists).

master_df = pd.merge(features_main_df, artist_df_groups, how='inner', on='ID')
master_df.head()

	acousticness_mean	acousticness_std	dance_mean	dance_std	energy_mean	energy_std	instrumentalness_mean	instrumentalness_std	key_mean	key_std	...	'no_genre'
0	0.641282	0.326942	0.467911	0.241057	0.275940	0.225821	0.119650	0.277109	0.275940	0.225821	...	1
1	0.278816	0.262749	0.634392	0.140270	0.596000	0.166902	0.192559	0.341460	0.596000	0.166902	...	1
2	0.228810	0.251421	0.600400	0.178801	0.612200	0.192433	0.179571	0.336604	0.612200	0.192433	...	1
3	0.394114	0.362573	0.599424	0.151256	0.541097	0.289705	0.203059	0.332371	0.541097	0.289705	...	1
4	0.194509	0.278470	0.531067	0.150001	0.759400	0.249805	0.115499	0.258020	0.759400	0.249805	...	1

5 rows × 3245 columns

The master dataframe is saved for both EDA and modeling purposes next and final dataframe size is presented.

Number of Playlists: 1420
Number of Predictors: 3245

String Parsing / Natural Language Processing

Here, we further analyze the names of the playlist based on the rationale that listeners usually search for key terms like ‘Best’, ‘Hit’, ‘Workout’ when they look for certain type of playlists. Due to the relatively small size of our data set, we adopt a string parsing approach for our model (which could be easily scaled with Python’s NLTK package for larger data sets or more advanced modeling).

After reading in the full dataset and the playlist dataset, we perform a left join based on playlist ID and add the playlist name to the full dataset
We search for 12 categories of specific strings that cover ‘Best’, ‘Workout’, ‘Party’, ‘Chill’, ‘Acoustic’, ‘2000s’, ‘1990s’, ‘1980s’, ‘1970s’, ‘1960s’, and ‘1950s’ using the str.contain function
After creating these 12 boolean variables, we transform them to binary ones (0 or 1) by multiplying 1
Lastly, we include those binary variables in the dataframe as predictor variables

new_df = pd.merge(full_df, playlist_df[['Name', 'ID']], on='ID', how='left')

String parsing follows the below example methods:

Str_Best = full_df_concise.Name.str.contains('Best|Top|Hit|best|top|hit|Hot|hot|Pick|pick')
Str_Workout = full_df_concise.Name.str.contains('Workout|workout|Motivation|motivation|Power|power|Cardio|')
Str_Party = full_df_concise.Name.str.contains('Party|party')
Str_Chill = full_df_concise.Name.str.contains('Chill|chill|Relax|relax')
Str_Acoustic = full_df_concise.Name.str.contains('Acoustic|acoustic')
Str_2000s = full_df_concise.Name.str.contains('20')
Str_1990s = full_df_concise.Name.str.contains('90|91|92|93|94|95|96|97|98|99')
Str_1980s = full_df_concise.Name.str.contains('80|81|82|83|84|85|86|87|88|89')
Str_1970s = full_df_concise.Name.str.contains('70|71|72|73|74|75|76|77|78|79')
Str_1960s = full_df_concise.Name.str.contains('60|61|62|63|64|65|66|67|68|69')
Str_1950s = full_df_concise.Name.str.contains('50s')

Interaction Terms with Audio Features and Genre

The following section describes the process of creating interaction terms between genres and audio features. Interaction terms are considered because genre may have an effect on the relationships between audio features and the number of playlist followers. For example, different levels of energy may be more popular for rap music than for acoustic music.

The first step is to bucket the genres (with a total of more than 100 specific genres) into broader categories. As listed below, some of the most common broad genres include: house, hip hop, pop, dance, r&b, acoustic, and soul.

broad_genres = ['house','hip hop','pop','dance','r&b','rap','acoustic','soul']

Next, interaction terms are generated between genre categories and certain audio features. Below are the interaction terms that are created. These features are selected through a separate analysis in which all of the genres, audio features, and all possible interactions are used as predictors to model the number of playlist followers. We find that the interaction terms listed below are significant.

interaction_columns = ['house_acousticness_mean','hip hop_acousticness_std','pop_liveness_std','dance_liveness_std',
                      'r&b_acousticness_std','rap_energy_std','rap_key_std','acoustic_acousticness_std','acoustic_acousticness_mean',
                      'acoustic_energy_std','acoustic_key_std','soul_acousticness_std']

	house_acousticness_mean	hip hop_acousticness_std	pop_liveness_std	dance_liveness_std	r&b_acousticness_std	rap_energy_std	rap_key_std	acoustic_acousticness_std	acoustic_acousticness_mean	acoustic_energy_std	acoustic_key_std	soul_acousticness_std
count	1420.000000	1418.000000	1418.000000	1418.000000	1418.000000	1418.000000	1418.000000	1418.000000	1420.000000	1418.000000	1418.000000	1418.000000
mean	0.224109	0.235339	0.156279	0.137165	0.239961	0.210305	0.210305	0.102606	0.115892	0.080324	0.080324	0.173310
std	0.212280	0.144852	0.056181	0.073726	0.143718	0.094412	0.094412	0.150939	0.190786	0.117964	0.117964	0.162756
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.129984	0.111896	0.160846	0.204460	0.204460	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.221718	0.302918	0.155873	0.149616	0.306497	0.238572	0.238572	0.000000	0.000000	0.000000	0.000000	0.240984
75%	0.366849	0.341949	0.185451	0.180391	0.344083	0.267752	0.267752	0.285148	0.228570	0.220703	0.220703	0.332228
max	0.961000	0.428986	0.351859	0.351859	0.444861	0.371096	0.371096	0.444861	0.961000	0.347747	0.347747	0.420705

By now, the final dataframe has been created. We will leverage this dataframe and its features to conduct EDA and to construct models in the following sections.