Please refer to Data Mining & Wrangling for the source code (Jupyter Notebook).
To begin pulling playlist data from the Spotify API, first a connection with the API needs to be made. For this, both a so-called “client id” and “client secret id” are required. Once these “id’s” are obtained, we follow the below outlined steps to set up the API connection:
client_id = "client_id"
client_secret = "client_secret_id"
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
The main idea of this project is twofold: (i) to infer about key predictors (whether track features or artist features) which are statistically significant in determining a playlist’s success in terms of number of followers; and (ii) to create a custom playlist that is deemed to be succesful (i.e., would obtain many followers).
To this extent, the first step in doing any further analysis is to obtain the playlists we want to run our predictions on. We decide to focus on Spotify’s own “featured” playlists - i.e., those produced by Spotify itself given specific genres / moods / artists etc..
The initial step is to pull Spotify’s featured playlists and obtain a number of base playlist features. The obtained baseline playlist features are converted into a large dataframe next.
data = pd.DataFrame(np.array(spotify_playlists).reshape(len(spotify_playlists),6),
columns=['Name', 'No. of Tracks', 'ID', 'URI', 'HREF', 'Public'])
data.head()
Name | No. of Tracks | ID | URI | HREF | Public | |
---|---|---|---|---|---|---|
0 | Today's Top Hits | 50 | 37i9dQZF1DXcBWIGoYBM5M | spotify:user:spotify:playlist:37i9dQZF1DXcBWIG... | https://api.spotify.com/v1/users/spotify/playl... | True |
1 | RapCaviar | 63 | 37i9dQZF1DX0XUsuxWHRQd | spotify:user:spotify:playlist:37i9dQZF1DX0XUsu... | https://api.spotify.com/v1/users/spotify/playl... | True |
2 | mint | 61 | 37i9dQZF1DX4dyzvuaRJ0n | spotify:user:spotify:playlist:37i9dQZF1DX4dyzv... | https://api.spotify.com/v1/users/spotify/playl... | True |
3 | Are & Be | 51 | 37i9dQZF1DX4SBhb3fqCJd | spotify:user:spotify:playlist:37i9dQZF1DX4SBhb... | https://api.spotify.com/v1/users/spotify/playl... | True |
4 | Rock This | 64 | 37i9dQZF1DXcF6B6QPhFDv | spotify:user:spotify:playlist:37i9dQZF1DXcF6B6... | https://api.spotify.com/v1/users/spotify/playl... | True |
For each playlist, the number of followers is obtained - this number will be the response variable for our regression based models. Finally - the number of followers is concatenated to the playlist dataframe.
data['Followers'] = pd.DataFrame({'Followers': playlist_follower})
data.head()
Name | No. of Tracks | ID | URI | HREF | Public | Followers | |
---|---|---|---|---|---|---|---|
0 | Today's Top Hits | 50 | 37i9dQZF1DXcBWIGoYBM5M | spotify:user:spotify:playlist:37i9dQZF1DXcBWIG... | https://api.spotify.com/v1/users/spotify/playl... | True | 18247159.0 |
1 | RapCaviar | 63 | 37i9dQZF1DX0XUsuxWHRQd | spotify:user:spotify:playlist:37i9dQZF1DX0XUsu... | https://api.spotify.com/v1/users/spotify/playl... | True | 8375355.0 |
2 | mint | 61 | 37i9dQZF1DX4dyzvuaRJ0n | spotify:user:spotify:playlist:37i9dQZF1DX4dyzv... | https://api.spotify.com/v1/users/spotify/playl... | True | 4616753.0 |
3 | Are & Be | 51 | 37i9dQZF1DX4SBhb3fqCJd | spotify:user:spotify:playlist:37i9dQZF1DX4SBhb... | https://api.spotify.com/v1/users/spotify/playl... | True | 3806312.0 |
4 | Rock This | 64 | 37i9dQZF1DXcF6B6QPhFDv | spotify:user:spotify:playlist:37i9dQZF1DXcF6B6... | https://api.spotify.com/v1/users/spotify/playl... | True | 4004115.0 |
Following the above outlined steps, we are able to produce a dataframe consisting of, in excess 1400, playlists with relevant information such as playlist id, number of playlist tracks, and number of playlist followers.
Using the dataframe of playlists - and specifically the playlist id column - we iterate over all tracks in every playlist and pull relevant audio features which could potentially be helpful in predicting the success of a playlist. Audio features refers to acousticness, energy, key, valence and etc.
To this extent, we define a function to pull all playlists’ tracks.
def get_playlist_tracks(username, playlist_id):
results = sp.user_playlist_tracks(username, playlist_id)
tracks = results['items']
while results['next']:
results = sp.next(results)
tracks.extend(results['items'])
return tracks
Running the feature extraction from Spotify could take a significant amount of time and also tend to raise errors in the process. To avoid losing information when such error occurs, a dictionary is used in cache memory. Audio features are extracted using the below code - note running this code on all playlists takes a significant amount of time (measured in hours).
for item,song in enumerate(songs):
if song not in audio_feat:
try:
audio_feat[song] = sp.audio_features(song)
except:
pass
if item % limit_songs_small == 0:
time.sleep(random.randint(0, 1))
if item % limit_songs_medium == 0:
time.sleep(random.randint(0, 1))
out = np.floor(item * 1. / len(songs_playlist) * 100)
sys.stdout.write("\r%d%%" % out)
sys.stdout.flush()
sys.stdout.write("\r%d%%" % 100)
Once all the audio features are extracted, they are converted into the main audio feature dataframe and saved as a large csv file.
acousticness | dance | duration | energy | instrumentalness | key | liveness | loudness | mode | playlist | song | speech | tempo | time | valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.365 | 0.307 | 258933 | 0.481 | 0 | 3 | 0.207 | -8.442 | 0 | 37i9dQZF1DXcBWIGoYBM5M | 00kkWwGsR9HblTUHb3BmdX | 0.128 | 68.894 | 3 | 0.329 |
1 | 0.993 | 0.322 | 160897 | 0.0121 | 0.927 | 5 | 0.127 | -31.994 | 1 | 37i9dQZF1DXcBWIGoYBM5M | 01T3AjynqSMVfiAQCAfrKJ | 0.0491 | 112.464 | 4 | 0.118 |
2 | 0.994 | 0.375 | 58387 | 0.00406 | 0.908 | 7 | 0.0842 | -31.824 | 0 | 37i9dQZF1DXcBWIGoYBM5M | 02BumRY2OTFMkMxrXSVMat | 0.0671 | 139.682 | 1 | 0.358 |
3 | 0.992 | 0.393 | 288280 | 0.0429 | 0.925 | 9 | 0.0821 | -25.727 | 0 | 37i9dQZF1DXcBWIGoYBM5M | 02mkkozonPEDCenOhuWwLc | 0.0341 | 135.405 | 4 | 0.0394 |
4 | 0.992 | 0.373 | 99867 | 0.117 | 0.909 | 10 | 0.111 | -25.222 | 0 | 37i9dQZF1DXcBWIGoYBM5M | 02xmGU9unopKjpblPRC67j | 0.0511 | 125.288 | 3 | 0.189 |
Following a similar procedure as the audio feature extraction, artist information for every track in every playlist is extracted next.
First, a function is defined to retrieve artist information given an artist name.
def get_artist(name):
results = sp.search(q='artist:' + name, type='artist')
items = results['artists']['items']
if len(items) > 0:
return items[0]
else:
return None
Again, a dictionary in cache memory is setup for the main artist feature extraction loop. Artist features are extracted using the code below - note running this code on all playlists takes a significant amount of time (measured in hours).
for item,artist in enumerate(artists):
if artist not in artist_info:
try:
artist_info[artist] = get_artist(artist)
except:
pass
if item % limit_artist_small == 0:
time.sleep(random.randint(0, 1))
if item % limit_artist_medium == 0:
time.sleep(random.randint(0, 1))
out = np.floor(item * 1. / len(artists) * 100)
sys.stdout.write("\r%d%%" % out)
sys.stdout.flush()
sys.stdout.write("\r%d%%" % 100)
Once all the artist features are extracted, they are converted into the main artist feature dataframe and saved as a large csv file.
artist | followers | genres | playlist | popularity | song | |
---|---|---|---|---|---|---|
0 | 10 Years | 157035 | [alternative metal, nu metal, post-grunge, rap... | 37i9dQZF1DXcF6B6QPhFDv | 63 | 0uyDAijTR0tOuH24hxDhE5 |
1 | 21 Savage | 2323273 | [dwn trap, rap, trap music] | 37i9dQZF1DX0XUsuxWHRQd | 98 | 2vaMWMPMgsWX4fwJiKmdWm |
2 | 24hrs | 28839 | [dwn trap, trap music, underground hip hop] | 37i9dQZF1DX0XUsuxWHRQd | 73 | 2c5D6B8oXAwc6easamdgVA |
3 | 3LAU | 175224 | [big room, brostep, deep big room, edm, electr... | 37i9dQZF1DX4JAvHpjipBk | 67 | 6yxobtnNHKRAA0cvoNxJhe |
4 | 50 Cent | 2686486 | [east coast hip hop, gangster rap, hip hop, po... | 37i9dQZF1DX0XUsuxWHRQd | 85 | 32aYDW8Qdnv1ur89TUlDnm |
Once all data is extracted from Spotify, the next step is to combine the separate dataframes (i.e., for playlists, audio features and artists) and to perform some initial feature engineering in the hope of creating useful data for inference and prediction of playlist success.
The first step is to load all the dataframes separately. Beginning with the playlist dataframe.
playlist_df = pd.read_csv('Playlist.csv')
playlist_df.head()
Unnamed: 0 | Name | No. of Tracks | ID | URI | HREF | Public | Followers | |
---|---|---|---|---|---|---|---|---|
0 | 0 | Today's Top Hits | 50 | 37i9dQZF1DXcBWIGoYBM5M | spotify:user:spotify:playlist:37i9dQZF1DXcBWIG... | https://api.spotify.com/v1/users/spotify/playl... | True | 18079985.0 |
1 | 1 | RapCaviar | 61 | 37i9dQZF1DX0XUsuxWHRQd | spotify:user:spotify:playlist:37i9dQZF1DX0XUsu... | https://api.spotify.com/v1/users/spotify/playl... | True | 8283836.0 |
2 | 2 | mint | 61 | 37i9dQZF1DX4dyzvuaRJ0n | spotify:user:spotify:playlist:37i9dQZF1DX4dyzv... | https://api.spotify.com/v1/users/spotify/playl... | True | 4593498.0 |
3 | 3 | Are & Be | 51 | 37i9dQZF1DX4SBhb3fqCJd | spotify:user:spotify:playlist:37i9dQZF1DX4SBhb... | https://api.spotify.com/v1/users/spotify/playl... | True | 3773823.0 |
4 | 4 | Rock This | 60 | 37i9dQZF1DXcF6B6QPhFDv | spotify:user:spotify:playlist:37i9dQZF1DXcF6B6... | https://api.spotify.com/v1/users/spotify/playl... | True | 3989695.0 |
Next, the track feature dataframe is loaded.
tracks_df = pd.read_csv('tracks_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
tracks_df.head()
acousticness | dance | duration | energy | instrumentalness | key | liveness | loudness | mode | playlist | song | speech | tempo | time | valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.039500 | 0.299 | 214973 | 0.9210 | 0.737000 | 4 | 0.5890 | -6.254 | 1 | 37i9dQZF1DXcBWIGoYBM5M | 0076oEQq8IToGfnzU3bTHY | 0.1930 | 174.982 | 4 | 0.0532 |
1 | 0.365000 | 0.307 | 258933 | 0.4810 | 0.000000 | 3 | 0.2070 | -8.442 | 0 | 37i9dQZF1DXcBWIGoYBM5M | 00kkWwGsR9HblTUHb3BmdX | 0.1280 | 68.894 | 3 | 0.3290 |
2 | 0.078700 | 0.630 | 261731 | 0.6560 | 0.000906 | 0 | 0.0953 | -6.423 | 0 | 37i9dQZF1DXcBWIGoYBM5M | 01JkrDSrakX5UO5knhpKNA | 0.0276 | 133.012 | 4 | 0.4320 |
3 | 0.000192 | 0.521 | 188834 | 0.8370 | 0.051000 | 5 | 0.0929 | -4.581 | 1 | 37i9dQZF1DXcBWIGoYBM5M | 01KsbekyuQQXpVnxIfNRaC | 0.1220 | 80.027 | 4 | 0.6230 |
4 | 0.993000 | 0.322 | 160897 | 0.0121 | 0.927000 | 5 | 0.1270 | -31.994 | 1 | 37i9dQZF1DXcBWIGoYBM5M | 01T3AjynqSMVfiAQCAfrKJ | 0.0491 | 112.464 | 4 | 0.1180 |
Finally, the artist dataframe is loaded.
artist_df_sub = pd.read_csv('artist_df_sub.csv').drop(['Unnamed: 0','Unnamed: 0.1'],axis=1)
artist_df_sub.head()
artist | followers | genres | playlist | popularity | song | |
---|---|---|---|---|---|---|
0 | *NSYNC | 498511.0 | ['boy band', 'dance pop', 'europop', 'pop', 'p... | 37i9dQZF1DWXDAhqlN7e6W | 75.0 | 35zGjsxI020C2NPKp2fzS7 |
1 | 10 Years | 154800.0 | ['alternative metal', 'nu metal', 'post-grunge... | 37i9dQZF1DWWJOmJ7nRx0C | 63.0 | 4qmoz9OUEBaXUzlWQX4ZU4 |
2 | 2 Chainz | 1926728.0 | ['dwn trap', 'pop rap', 'rap', 'southern hip h... | 37i9dQZF1DX7QOv5kjbU68 | 91.0 | 4XoP1AkbOurU9CeZ2rMEz2 |
3 | 21 Savage | 2224587.0 | ['dwn trap', 'rap', 'trap music'] | 37i9dQZF1DX7QOv5kjbU68 | 98.0 | 4ckuS4Nj4FZ7i3Def3Br8W |
4 | 24hrs | 27817.0 | ['dwn trap', 'trap music', 'underground hip hop'] | 37i9dQZF1DX0XUsuxWHRQd | 74.0 | 2c5D6B8oXAwc6easamdgVA |
As we can see from the above - artists are categorized by a list of genres as opposed to a single genre. Therefore, genres are one-hot encoded to convert these genre lists into predictors we can run models on.
mlb = MultiLabelBinarizer(sparse_output=True)
pre_data = mlb.fit_transform(artist_df_sub['genres'].str.split(','))
classes = [i.strip('[]') for i in mlb.classes_]
genre_sub = pd.DataFrame(pre_data.toarray(),columns=classes)
_, i = np.unique(genre_sub.columns, return_index=True)
genre_sub = genre_sub.iloc[:, i]
artist_df_sub_mid = artist_df_sub.drop('genres', axis=1)
artist_sub_frames = [artist_df_sub_mid,genre_sub]
artist_df = pd.concat(artist_sub_frames,axis=1,join='inner')
Once all the genres are one-hot encoded, the dataframes are grouped by playlist to enable feature engineering.
In terms of artists, feature engineering led to the following predictors:
First, the top 50 artists (in terms of number of Spotify followers) are extracted. Then, we count the amount of times these artists show up in a given playlist and record the counts as predictors in the final dataframe. Second, we obtain the list of 30 artists who appear most often in playlists with 35,000+ followers. Third, all the genres in a playlist are encoded to binary values in the one-hot encoded genre columns.
Finally, the main artist data frame is created below:
artist_features_df['Playlist_Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
artist_features_df['ID'] = artist_features_df.index
artist_main_df = artist_features_df.reset_index().drop(0, axis=1)
artist_main_df.head()
followers_mean | followers_std | popularity_mean | popularity_std | top_0_10 | top_10_20 | top_20_30 | top_30_40 | top_40_50 | Playlist_Followers | ID | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 134413.666667 | 3.654590e+05 | 42.833333 | 19.575645 | 0 | 0 | 0 | 0 | 0 | 24.0 | 01WIu4Rst0xeZnTunWxUL7 |
1 | 103320.580645 | 3.320150e+05 | 48.903226 | 15.029648 | 0 | 0 | 0 | 0 | 0 | 330.0 | 05dTMGk8MjnpQg3bKuoXcc |
2 | 566814.560000 | 1.427308e+06 | 60.280000 | 15.512146 | 0 | 0 | 0 | 1 | 0 | 73.0 | 070FVPBKvfu6M5tf4I9rt2 |
3 | 199831.484848 | 2.953859e+05 | 58.696970 | 15.627470 | 0 | 0 | 0 | 0 | 0 | 6173.0 | 08vPKM3pmoyF6crB2EtASQ |
4 | 223253.774194 | 4.918438e+05 | 49.516129 | 19.489948 | 0 | 0 | 0 | 0 | 0 | 145.0 | 08ySLuUm0jMf7lJmFwqRMu |
Similar to the artist feature engineering, the playlists’ audio features are engineered next. Specifically, for each audio feature (such as acousticness, duration, energy) mined from Spotify, the mean and standard deviation across all playlist tracks is computed.
The engineered audio features are converted into a dataframe as follows:
features_df['Followers'] = playlist_df[['Followers']].groupby(playlist_df['ID']).first()
features_df['ID'] = features_df.index
features_main_df = features_df.reset_index().drop(0, axis=1)
features_main_df.head()
acousticness_mean | acousticness_std | dance_mean | dance_std | energy_mean | energy_std | instrumentalness_mean | instrumentalness_std | key_mean | key_std | ... | speech_mean | speech_std | tempo_mean | tempo_std | time_mean | time_std | valence_mean | valence_std | Followers | ID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.641282 | 0.326942 | 0.467911 | 0.241057 | 0.275940 | 0.225821 | 0.119650 | 0.277109 | 0.275940 | 0.225821 | ... | 0.383051 | 0.403365 | 101.045969 | 51.857504 | 3.338462 | 1.553996 | 0.319263 | 0.246235 | 24.0 | 01WIu4Rst0xeZnTunWxUL7 |
1 | 0.249844 | 0.321182 | 0.555140 | 0.172088 | 0.666567 | 0.230578 | 0.077776 | 0.240452 | 0.666567 | 0.230578 | ... | 0.137260 | 0.226812 | 130.850167 | 30.525135 | 4.000000 | 0.454859 | 0.496127 | 0.256787 | 6198.0 | 056jpfChuMP5D1NMMaDXRR |
2 | 0.278816 | 0.262749 | 0.634392 | 0.140270 | 0.596000 | 0.166902 | 0.192559 | 0.341460 | 0.596000 | 0.166902 | ... | 0.082210 | 0.131105 | 122.768255 | 28.215783 | 4.000000 | 0.200000 | 0.656235 | 0.245299 | 330.0 | 05dTMGk8MjnpQg3bKuoXcc |
3 | 0.228810 | 0.251421 | 0.600400 | 0.178801 | 0.612200 | 0.192433 | 0.179571 | 0.336604 | 0.612200 | 0.192433 | ... | 0.052150 | 0.025935 | 114.439167 | 21.997673 | 4.000000 | 0.262613 | 0.481787 | 0.251199 | 73.0 | 070FVPBKvfu6M5tf4I9rt2 |
4 | 0.394114 | 0.362573 | 0.599424 | 0.151256 | 0.541097 | 0.289705 | 0.203059 | 0.332371 | 0.541097 | 0.289705 | ... | 0.106724 | 0.112448 | 110.134788 | 25.125111 | 4.000000 | 0.353553 | 0.511997 | 0.243171 | 6173.0 | 08vPKM3pmoyF6crB2EtASQ |
5 rows × 26 columns
Finally, the last step is to create the main dataframe using an inner merge on both the audio feature dataframe and artist dataframe. This inner merge leads to a loss of 126 playlists in total (i.e., there was no overlap between the two dataframes across these playlists).
master_df = pd.merge(features_main_df, artist_df_groups, how='inner', on='ID')
master_df.head()
acousticness_mean | acousticness_std | dance_mean | dance_std | energy_mean | energy_std | instrumentalness_mean | instrumentalness_std | key_mean | key_std | ... | 'wrestling' | 'wrock' | 'ye ye' | 'yoik' | 'zapstep' | 'zeuhl' | 'zim' | 'zolo' | 'zydeco' | 'no_genre' | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.641282 | 0.326942 | 0.467911 | 0.241057 | 0.275940 | 0.225821 | 0.119650 | 0.277109 | 0.275940 | 0.225821 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0.278816 | 0.262749 | 0.634392 | 0.140270 | 0.596000 | 0.166902 | 0.192559 | 0.341460 | 0.596000 | 0.166902 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0.228810 | 0.251421 | 0.600400 | 0.178801 | 0.612200 | 0.192433 | 0.179571 | 0.336604 | 0.612200 | 0.192433 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0.394114 | 0.362573 | 0.599424 | 0.151256 | 0.541097 | 0.289705 | 0.203059 | 0.332371 | 0.541097 | 0.289705 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0.194509 | 0.278470 | 0.531067 | 0.150001 | 0.759400 | 0.249805 | 0.115499 | 0.258020 | 0.759400 | 0.249805 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 3245 columns
The master dataframe is saved for both EDA and modeling purposes next and final dataframe size is presented.
Number of Playlists: 1420
Number of Predictors: 3245
Here, we further analyze the names of the playlist based on the rationale that listeners usually search for key terms like ‘Best’, ‘Hit’, ‘Workout’ when they look for certain type of playlists. Due to the relatively small size of our data set, we adopt a string parsing approach for our model (which could be easily scaled with Python’s NLTK package for larger data sets or more advanced modeling).
new_df = pd.merge(full_df, playlist_df[['Name', 'ID']], on='ID', how='left')
String parsing follows the below example methods:
Str_Best = full_df_concise.Name.str.contains('Best|Top|Hit|best|top|hit|Hot|hot|Pick|pick')
Str_Workout = full_df_concise.Name.str.contains('Workout|workout|Motivation|motivation|Power|power|Cardio|')
Str_Party = full_df_concise.Name.str.contains('Party|party')
Str_Chill = full_df_concise.Name.str.contains('Chill|chill|Relax|relax')
Str_Acoustic = full_df_concise.Name.str.contains('Acoustic|acoustic')
Str_2000s = full_df_concise.Name.str.contains('20')
Str_1990s = full_df_concise.Name.str.contains('90|91|92|93|94|95|96|97|98|99')
Str_1980s = full_df_concise.Name.str.contains('80|81|82|83|84|85|86|87|88|89')
Str_1970s = full_df_concise.Name.str.contains('70|71|72|73|74|75|76|77|78|79')
Str_1960s = full_df_concise.Name.str.contains('60|61|62|63|64|65|66|67|68|69')
Str_1950s = full_df_concise.Name.str.contains('50s')
The following section describes the process of creating interaction terms between genres and audio features. Interaction terms are considered because genre may have an effect on the relationships between audio features and the number of playlist followers. For example, different levels of energy may be more popular for rap music than for acoustic music.
The first step is to bucket the genres (with a total of more than 100 specific genres) into broader categories. As listed below, some of the most common broad genres include: house, hip hop, pop, dance, r&b, acoustic, and soul.
broad_genres = ['house','hip hop','pop','dance','r&b','rap','acoustic','soul']
Next, interaction terms are generated between genre categories and certain audio features. Below are the interaction terms that are created. These features are selected through a separate analysis in which all of the genres, audio features, and all possible interactions are used as predictors to model the number of playlist followers. We find that the interaction terms listed below are significant.
interaction_columns = ['house_acousticness_mean','hip hop_acousticness_std','pop_liveness_std','dance_liveness_std',
'r&b_acousticness_std','rap_energy_std','rap_key_std','acoustic_acousticness_std','acoustic_acousticness_mean',
'acoustic_energy_std','acoustic_key_std','soul_acousticness_std']
house_acousticness_mean | hip hop_acousticness_std | pop_liveness_std | dance_liveness_std | r&b_acousticness_std | rap_energy_std | rap_key_std | acoustic_acousticness_std | acoustic_acousticness_mean | acoustic_energy_std | acoustic_key_std | soul_acousticness_std | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1420.000000 | 1418.000000 | 1418.000000 | 1418.000000 | 1418.000000 | 1418.000000 | 1418.000000 | 1418.000000 | 1420.000000 | 1418.000000 | 1418.000000 | 1418.000000 |
mean | 0.224109 | 0.235339 | 0.156279 | 0.137165 | 0.239961 | 0.210305 | 0.210305 | 0.102606 | 0.115892 | 0.080324 | 0.080324 | 0.173310 |
std | 0.212280 | 0.144852 | 0.056181 | 0.073726 | 0.143718 | 0.094412 | 0.094412 | 0.150939 | 0.190786 | 0.117964 | 0.117964 | 0.162756 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.129984 | 0.111896 | 0.160846 | 0.204460 | 0.204460 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.221718 | 0.302918 | 0.155873 | 0.149616 | 0.306497 | 0.238572 | 0.238572 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.240984 |
75% | 0.366849 | 0.341949 | 0.185451 | 0.180391 | 0.344083 | 0.267752 | 0.267752 | 0.285148 | 0.228570 | 0.220703 | 0.220703 | 0.332228 |
max | 0.961000 | 0.428986 | 0.351859 | 0.351859 | 0.444861 | 0.371096 | 0.371096 | 0.444861 | 0.961000 | 0.347747 | 0.347747 | 0.420705 |
By now, the final dataframe has been created. We will leverage this dataframe and its features to conduct EDA and to construct models in the following sections.