Advanced Models

Contents

Source Code

Please refer to Advanced Models for the source code (Jupyter Notebook).

Strategy of Fitting Advanced Models

1) Data Pre-Processing: After reading in the dataframe, we first split the training/test data by (90%-10% split) due to the small size of the dataset. Then,we standardize the numerical columns, and finally we check for any missing data and imput accordingly.

2) Model Score Function: for the simplicity of model summary, we will create a model scoring function encompassing the following 6 metrics

3) Model Fitting: Here, we will fit 9 different advanced regressors on the training data and then predict using the test data

4) Model Summary: After fitting all the models, we will present 3 summary tables based on training score, test score and qualitative metrics for the models

5) Cross Validation: Based on the summary, we will further fine-tune the parameters on the best model using cross validation

Data Pre-Processing

After reading in the dataframe, we will pre-process the data in four steps:

Model Score Function

Here, we choose 6 metrics to evaluate our models:

1) $R^2$ - R Squared = measures how well future datasets are likely to be predicted by the model. The score ranges from negative (because the model can be arbitrarily worse) to a best possible value of 1.0. Usually, the bigger the $R^2$, the better the model. Yet we do acknowledge the tedency of over-fitting with $R^2$ as with more predictors, it will only remain constant or increase for the training set.

2) $EVar$ - Explained Variance Score = measures how good the model explains the variance in the response variable. The score ranges from a minimum of 0 to a maximum of 1.0. Similar to $R^2$, the higher the score, the better the model.

3) $MAE$ - Mean Absolute Error = computes the expected value of the absolute error or the $l1$ loss function. For all the error functions, the smaller the error, the better the model.

4) $MSE$ - Mean Squared Error = computes the expected value of the squared error

5) $MSLE$ - Mean Squared Log Error = computes the expected value of the squared logarithmic error. This would probably be the most appropriate metric to evalute our models as we log-transformed our response variable - number of followers for the playlist.

6) $MEAE$ - Median Absolute Error = computes the loss function by using the median of all absolute differences between the actual values and the predicted values. This metric is robust to outliers.

The key metrics given the parameters of our pre-processing are 1) $R^2$ and 5) $MSLE$, which will be the most important basis of comparison for the 9 advanced models.

Model Fitting

Gradient Boosting Regressor

According to Ben Gorman, if Linear Regression were a Toyota Camry, the Gradient Boosting Regressor would easily be a UH-60 Blackhawk Helicopter.

Gradient Boosting Regressor is an ensemble machine learning procedure that fits new models consecutively to provide a more reliable estimate of the response variable. It constructs new base-learners to be correlated with the negative gradient of the loss function: 1) least square regression (ls), 2) least absolute deviation (lad), 3) huber (a combination of ls and lad), 4) quantile - which allows for quantile regression. The choice of the loss function allows for great flexibility in Gradient Boosting.The best error function is huber for our model based on trial and error / cross-validation.

The principle behind this procedure is to adopt a slow learning approach where we fit a regression tree to the residuals from the model rather than the actual response variable. We then add this new regression tree to update the residuals. The base learner here could be small regression trees and we slowly improve them in areas that they do not perform well.

The main tuning parameters are the number of splits in each tree ($d$) - which controls the complexity of boosting, the number of trees ($B$) - which can overfit if too big, and the shrinkage parameber ($\lambda$).

A simplified algorithm of boosting is described here (James, Witten, Hastie, and Tibshirani, 2013):

Step 1. Set $\hat{f}(x) = 0$ and $r_i = y_i$ for all $i$ in the training set

Step 2. For $b = 1, 2, \cdots, B$, repeat the following steps:

(a) Fit a regression tree $\hat{f}_b$ with $d$ splits to the training data $(X, r)$

(b) Update $\hat{f}$ by adding in a shrunken version of the new regression tree

(c) Update the residuals

Step 3. We finally output the boosted model

GradientBoostingRegressor(alpha=0.99, criterion='friedman_mse', init=None,
             learning_rate=0.03, loss='huber', max_depth=5,
             max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

Random Forest Regressor

Random Forest Regressor is another commonly used machine learning method that incorporates different regression trees. We bootstrap the training samples, which are used to build an ensemble of full regression trees. Each time a split in a tree is considered, we randomly select a subset of predictors (typically set as $m = \sqrt{p}$ of the full set of $p$ predictors. Finally, we average the predictors. This is an improvement to bagging (which takes the full set of predictors $m=p$) because by allowing only a random subset of predictors, Random Forest effectively decorrelates the regression trees.

In comparison to Boosting in which weak learners (high bias, low variance) are used as base learners, which are then modelled sequentially to minimize bias, Random Forest models fully grown decision trees (low bias, high variance) in parallel and aims to reduce variance. In this way, Random Forest effectively limits the chances of over-fitting whereas boosting sometimes does overfit.

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=15,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=2, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
           oob_score=False, random_state=2, verbose=0, warm_start=False)

Huber Regressor

HuberRegressor(alpha=10000, epsilon=1.0, fit_intercept=True, max_iter=100,
        tol=1e-05, warm_start=False)

Elastic Net

ElasticNet(alpha=0.05, copy_X=True, fit_intercept=True, l1_ratio=1.0,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

SVR

SVR(C=10.0, cache_size=200, coef0=0.0, degree=3, epsilon=2.0, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Neural Network

MLPRegressor(activation='relu', alpha=1e-06, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

Adaboost Regressor

AdaBoostRegressor(base_estimator=GradientBoostingRegressor(alpha=0.95, criterion='friedman_mse', init=None,
             learning_rate=0.01, loss='huber', max_depth=3,
             max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False),
         learning_rate=0.04, loss='exponential', n_estimators=200,
         random_state=None)

Bagging Regressor

BaggingRegressor(base_estimator=GradientBoostingRegressor(alpha=0.95, criterion='friedman_mse', init=None,
             learning_rate=0.01, loss='huber', max_depth=3,
             max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=200,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

Extra Trees Regressor

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=15,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
          oob_score=False, random_state=None, verbose=0, warm_start=False)

Model Summary

Training Data

Insights

Training Scores:

1. Gradient Boosting 2. Random Forest 3. Huber 4. Elastic Net 5. SVR 6. Neural Network 7. Adaboost 8. Bagging 9. Extra Trees
R2 0.736802 0.849933 -10.140200 0.235608 0.316403 -1.250388 0.334077 0.248881 0.919044
EVar 0.736812 0.849968 -0.366807 0.235608 0.318354 -0.782976 0.339111 0.251695 0.919044
MAE 1.216138 0.924503 9.558437 2.126171 2.015300 3.631153 2.067943 2.119796 0.611329
MSE 2.406400 1.372049 101.854169 6.988793 6.250089 20.575162 6.088492 6.867435 0.740172
MSLE 0.034869 0.025533 5.343244 0.094997 0.090284 0.346028 0.082981 0.097215 0.007881
MEAE 0.993349 0.778746 10.386407 1.803829 1.988947 3.070110 1.918677 1.819704 0.433278

Training Scores Ranking:

1. Gradient Boosting 2. Random Forest 3. Huber 4. Elastic Net 5. SVR 6. Neural Network 7. Adaboost 8. Bagging 9. Extra Trees
R2 3.0 2.0 9.0 7.0 5.0 8.0 4.0 6.0 1.0
EVar 3.0 2.0 8.0 7.0 5.0 9.0 4.0 6.0 1.0
MAE 3.0 2.0 9.0 7.0 4.0 8.0 5.0 6.0 1.0
MSE 3.0 2.0 9.0 7.0 5.0 8.0 4.0 6.0 1.0
MSLE 3.0 2.0 9.0 6.0 5.0 8.0 4.0 7.0 1.0
MEAE 3.0 2.0 9.0 4.0 7.0 8.0 6.0 5.0 1.0

Test Data

Insights

Test Scores:

1. Gradient Boosting 2. Random Forest 3. Huber 4. Elastic Net 5. SVR 6. Neural Network 7. Adaboost 8. Bagging 9. Extra Trees
R2 0.364040 0.335447 -10.592332 0.252289 0.148519 -1.922166 0.251908 0.216170 0.115049
EVar 0.364041 0.336090 -0.096876 0.254182 0.151558 -1.353871 0.253246 0.225546 0.115100
MAE 1.868775 1.906521 9.485701 2.021368 2.135076 4.055117 2.065463 2.024350 2.136890
MSE 5.440345 5.684945 99.167105 6.396325 7.284031 24.997796 6.399582 6.705308 7.570352
MSLE 0.076427 0.085219 5.374158 0.097137 0.110722 0.457958 0.094543 0.102952 0.102324
MEAE 1.595252 1.588612 9.923143 1.710858 1.880958 3.540511 1.928455 1.713208 1.593460

Test Score Rankings:

1. Gradient Boosting 2. Random Forest 3. Huber 4. Elastic Net 5. SVR 6. Neural Network 7. Adaboost 8. Bagging 9. Extra Trees
R2 1.0 2.0 9.0 3.0 6.0 8.0 4.0 5.0 7.0
EVar 1.0 2.0 8.0 3.0 6.0 9.0 4.0 5.0 7.0
MAE 1.0 2.0 9.0 3.0 6.0 8.0 5.0 4.0 7.0
MSE 1.0 2.0 9.0 3.0 6.0 8.0 4.0 5.0 7.0
MSLE 1.0 2.0 9.0 4.0 7.0 8.0 3.0 6.0 5.0
MEAE 3.0 1.0 9.0 4.0 6.0 8.0 7.0 5.0 2.0

Cross Validation

In the last section we will assess the accuracy (bias) and precision (residual error) of the 1. Gradient Boosting Regressor. The idea of cross validation is that we divide our training dataset into 5 random parts, the training set (4 parts) is used to estimate the model, the validation set (1 part) is used to check the predictive capability and refine the model. The test set is only used once to estimate the model’s true error. In this way, we can fine-tune our ensemble models’ parameters to ensure a more statistically robust model.

grid = {'max_depth': [5, 10],
        'learning_rate': [0.03, 0.07, 0.10], 
        'n_estimators': [50, 100, 200], 
        'loss': ['ls', 'huber'],
        'alpha': [0.7, 0.9, 0.99], 
        'max_features': ['sqrt', 'auto', 'log2']}
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [5, 10], 'learning_rate': [0.03, 0.07, 0.1], 'n_estimators': [50, 100, 200], 'loss': ['ls', 'huber'], 'alpha': [0.7, 0.9, 0.99], 'max_features': ['sqrt', 'auto', 'log2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

Cross-Validated Model - Fit on Training Set, Scored on Test Set

R2      0.363502
EVar    0.363517
MAE     1.871419
MSE     5.444946
MSLE    0.076121
MEAE    1.605702
dtype: float64


Best Estimator Parameters
loss: huber
max_depth: 5
n_estimators: 200
learning rate: 0.0
alpha: 1.0
max_features: auto