Please refer to Advanced Models for the source code (Jupyter Notebook).
1) Data Pre-Processing: After reading in the dataframe, we first split the training/test data by (90%-10% split) due to the small size of the dataset. Then,we standardize the numerical columns, and finally we check for any missing data and imput accordingly.
2) Model Score Function: for the simplicity of model summary, we will create a model scoring function encompassing the following 6 metrics
3) Model Fitting: Here, we will fit 9 different advanced regressors on the training data and then predict using the test data
4) Model Summary: After fitting all the models, we will present 3 summary tables based on training score, test score and qualitative metrics for the models
5) Cross Validation: Based on the summary, we will further fine-tune the parameters on the best model using cross validation
After reading in the dataframe, we will pre-process the data in four steps:
Here, we choose 6 metrics to evaluate our models:
1) $R^2$ - R Squared = measures how well future datasets are likely to be predicted by the model. The score ranges from negative (because the model can be arbitrarily worse) to a best possible value of 1.0. Usually, the bigger the $R^2$, the better the model. Yet we do acknowledge the tedency of over-fitting with $R^2$ as with more predictors, it will only remain constant or increase for the training set.
2) $EVar$ - Explained Variance Score = measures how good the model explains the variance in the response variable. The score ranges from a minimum of 0 to a maximum of 1.0. Similar to $R^2$, the higher the score, the better the model.
3) $MAE$ - Mean Absolute Error = computes the expected value of the absolute error or the $l1$ loss function. For all the error functions, the smaller the error, the better the model.
4) $MSE$ - Mean Squared Error = computes the expected value of the squared error
5) $MSLE$ - Mean Squared Log Error = computes the expected value of the squared logarithmic error. This would probably be the most appropriate metric to evalute our models as we log-transformed our response variable - number of followers for the playlist.
6) $MEAE$ - Median Absolute Error = computes the loss function by using the median of all absolute differences between the actual values and the predicted values. This metric is robust to outliers.
The key metrics given the parameters of our pre-processing are 1) $R^2$ and 5) $MSLE$, which will be the most important basis of comparison for the 9 advanced models.
According to Ben Gorman, if Linear Regression were a Toyota Camry, the Gradient Boosting Regressor would easily be a UH-60 Blackhawk Helicopter.
Gradient Boosting Regressor is an ensemble machine learning procedure that fits new models consecutively to provide a more reliable estimate of the response variable. It constructs new base-learners to be correlated with the negative gradient of the loss function: 1) least square regression (ls), 2) least absolute deviation (lad), 3) huber (a combination of ls and lad), 4) quantile - which allows for quantile regression. The choice of the loss function allows for great flexibility in Gradient Boosting.The best error function is huber for our model based on trial and error / cross-validation.
The principle behind this procedure is to adopt a slow learning approach where we fit a regression tree to the residuals from the model rather than the actual response variable. We then add this new regression tree to update the residuals. The base learner here could be small regression trees and we slowly improve them in areas that they do not perform well.
The main tuning parameters are the number of splits in each tree ($d$) - which controls the complexity of boosting, the number of trees ($B$) - which can overfit if too big, and the shrinkage parameber ($\lambda$).
A simplified algorithm of boosting is described here (James, Witten, Hastie, and Tibshirani, 2013):
Step 1. Set $\hat{f}(x) = 0$ and $r_i = y_i$ for all $i$ in the training set
Step 2. For $b = 1, 2, \cdots, B$, repeat the following steps:
(a) Fit a regression tree $\hat{f}_b$ with $d$ splits to the training data $(X, r)$
(b) Update $\hat{f}$ by adding in a shrunken version of the new regression tree
(c) Update the residuals
Step 3. We finally output the boosted model
GradientBoostingRegressor(alpha=0.99, criterion='friedman_mse', init=None,
learning_rate=0.03, loss='huber', max_depth=5,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False)
Random Forest Regressor is another commonly used machine learning method that incorporates different regression trees. We bootstrap the training samples, which are used to build an ensemble of full regression trees. Each time a split in a tree is considered, we randomly select a subset of predictors (typically set as $m = \sqrt{p}$ of the full set of $p$ predictors. Finally, we average the predictors. This is an improvement to bagging (which takes the full set of predictors $m=p$) because by allowing only a random subset of predictors, Random Forest effectively decorrelates the regression trees.
In comparison to Boosting in which weak learners (high bias, low variance) are used as base learners, which are then modelled sequentially to minimize bias, Random Forest models fully grown decision trees (low bias, high variance) in parallel and aims to reduce variance. In this way, Random Forest effectively limits the chances of over-fitting whereas boosting sometimes does overfit.
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=15,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
oob_score=False, random_state=2, verbose=0, warm_start=False)
HuberRegressor(alpha=10000, epsilon=1.0, fit_intercept=True, max_iter=100,
tol=1e-05, warm_start=False)
ElasticNet(alpha=0.05, copy_X=True, fit_intercept=True, l1_ratio=1.0,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
SVR(C=10.0, cache_size=200, coef0=0.0, degree=3, epsilon=2.0, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
MLPRegressor(activation='relu', alpha=1e-06, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
nesterovs_momentum=True, power_t=0.5, random_state=None,
shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
verbose=False, warm_start=False)
AdaBoostRegressor(base_estimator=GradientBoostingRegressor(alpha=0.95, criterion='friedman_mse', init=None,
learning_rate=0.01, loss='huber', max_depth=3,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False),
learning_rate=0.04, loss='exponential', n_estimators=200,
random_state=None)
BaggingRegressor(base_estimator=GradientBoostingRegressor(alpha=0.95, criterion='friedman_mse', init=None,
learning_rate=0.01, loss='huber', max_depth=3,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
presort='auto', random_state=None, subsample=1.0, verbose=0,
warm_start=False),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=10, n_jobs=1, oob_score=False,
random_state=None, verbose=0, warm_start=False)
ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=15,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
Insights
Training Scores:
1. Gradient Boosting | 2. Random Forest | 3. Huber | 4. Elastic Net | 5. SVR | 6. Neural Network | 7. Adaboost | 8. Bagging | 9. Extra Trees | |
---|---|---|---|---|---|---|---|---|---|
R2 | 0.736802 | 0.849933 | -10.140200 | 0.235608 | 0.316403 | -1.250388 | 0.334077 | 0.248881 | 0.919044 |
EVar | 0.736812 | 0.849968 | -0.366807 | 0.235608 | 0.318354 | -0.782976 | 0.339111 | 0.251695 | 0.919044 |
MAE | 1.216138 | 0.924503 | 9.558437 | 2.126171 | 2.015300 | 3.631153 | 2.067943 | 2.119796 | 0.611329 |
MSE | 2.406400 | 1.372049 | 101.854169 | 6.988793 | 6.250089 | 20.575162 | 6.088492 | 6.867435 | 0.740172 |
MSLE | 0.034869 | 0.025533 | 5.343244 | 0.094997 | 0.090284 | 0.346028 | 0.082981 | 0.097215 | 0.007881 |
MEAE | 0.993349 | 0.778746 | 10.386407 | 1.803829 | 1.988947 | 3.070110 | 1.918677 | 1.819704 | 0.433278 |
Training Scores Ranking:
1. Gradient Boosting | 2. Random Forest | 3. Huber | 4. Elastic Net | 5. SVR | 6. Neural Network | 7. Adaboost | 8. Bagging | 9. Extra Trees | |
---|---|---|---|---|---|---|---|---|---|
R2 | 3.0 | 2.0 | 9.0 | 7.0 | 5.0 | 8.0 | 4.0 | 6.0 | 1.0 |
EVar | 3.0 | 2.0 | 8.0 | 7.0 | 5.0 | 9.0 | 4.0 | 6.0 | 1.0 |
MAE | 3.0 | 2.0 | 9.0 | 7.0 | 4.0 | 8.0 | 5.0 | 6.0 | 1.0 |
MSE | 3.0 | 2.0 | 9.0 | 7.0 | 5.0 | 8.0 | 4.0 | 6.0 | 1.0 |
MSLE | 3.0 | 2.0 | 9.0 | 6.0 | 5.0 | 8.0 | 4.0 | 7.0 | 1.0 |
MEAE | 3.0 | 2.0 | 9.0 | 4.0 | 7.0 | 8.0 | 6.0 | 5.0 | 1.0 |
Insights
In terms of the test data, the top 3 performers are Gradient Boosting Regressor, Random Forest Regressor, and Adaboost Regressor if we focus on $R^2$ and $MSLE$
The parameters in the Gradient Boosting Regressor could be further fine-tuned to enhance its performance in the cross-validation section
Our results closely resemble the empirical results by Rich Caruana and Alexandru Niculescu-Mizil’s analysis in An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics (2005) which concluded that boosted trees were the best learning algorithm, closely followed by Random Forest
While we do note that Gradient Boosting Regressor has superior performance by 3% absolute points higher than Random Forest Regressor, we also need to acknowledge the apparent advantages of using the Random Forest Regressor as it naturally handles categorical predictor variables, it is relatively time efficient even for larger datasets, we don’t have to assume anything about the underlying population distribution, it fits non-linear interactions well, it also selects variables automatically. And most importantly, Random Forest has only one hyper-parameter that we need to set - the number of features to select at each node, compared to Gradient Boosting which has several hyperparameters that we need to fine-tune in cross-validation.
Test Scores:
1. Gradient Boosting | 2. Random Forest | 3. Huber | 4. Elastic Net | 5. SVR | 6. Neural Network | 7. Adaboost | 8. Bagging | 9. Extra Trees | |
---|---|---|---|---|---|---|---|---|---|
R2 | 0.364040 | 0.335447 | -10.592332 | 0.252289 | 0.148519 | -1.922166 | 0.251908 | 0.216170 | 0.115049 |
EVar | 0.364041 | 0.336090 | -0.096876 | 0.254182 | 0.151558 | -1.353871 | 0.253246 | 0.225546 | 0.115100 |
MAE | 1.868775 | 1.906521 | 9.485701 | 2.021368 | 2.135076 | 4.055117 | 2.065463 | 2.024350 | 2.136890 |
MSE | 5.440345 | 5.684945 | 99.167105 | 6.396325 | 7.284031 | 24.997796 | 6.399582 | 6.705308 | 7.570352 |
MSLE | 0.076427 | 0.085219 | 5.374158 | 0.097137 | 0.110722 | 0.457958 | 0.094543 | 0.102952 | 0.102324 |
MEAE | 1.595252 | 1.588612 | 9.923143 | 1.710858 | 1.880958 | 3.540511 | 1.928455 | 1.713208 | 1.593460 |
Test Score Rankings:
1. Gradient Boosting | 2. Random Forest | 3. Huber | 4. Elastic Net | 5. SVR | 6. Neural Network | 7. Adaboost | 8. Bagging | 9. Extra Trees | |
---|---|---|---|---|---|---|---|---|---|
R2 | 1.0 | 2.0 | 9.0 | 3.0 | 6.0 | 8.0 | 4.0 | 5.0 | 7.0 |
EVar | 1.0 | 2.0 | 8.0 | 3.0 | 6.0 | 9.0 | 4.0 | 5.0 | 7.0 |
MAE | 1.0 | 2.0 | 9.0 | 3.0 | 6.0 | 8.0 | 5.0 | 4.0 | 7.0 |
MSE | 1.0 | 2.0 | 9.0 | 3.0 | 6.0 | 8.0 | 4.0 | 5.0 | 7.0 |
MSLE | 1.0 | 2.0 | 9.0 | 4.0 | 7.0 | 8.0 | 3.0 | 6.0 | 5.0 |
MEAE | 3.0 | 1.0 | 9.0 | 4.0 | 6.0 | 8.0 | 7.0 | 5.0 | 2.0 |
In the last section we will assess the accuracy (bias) and precision (residual error) of the 1. Gradient Boosting Regressor. The idea of cross validation is that we divide our training dataset into 5 random parts, the training set (4 parts) is used to estimate the model, the validation set (1 part) is used to check the predictive capability and refine the model. The test set is only used once to estimate the model’s true error. In this way, we can fine-tune our ensemble models’ parameters to ensure a more statistically robust model.
grid = {'max_depth': [5, 10],
'learning_rate': [0.03, 0.07, 0.10],
'n_estimators': [50, 100, 200],
'loss': ['ls', 'huber'],
'alpha': [0.7, 0.9, 0.99],
'max_features': ['sqrt', 'auto', 'log2']}
GridSearchCV(cv=5, error_score='raise',
estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, presort='auto', random_state=None,
subsample=1.0, verbose=0, warm_start=False),
fit_params={}, iid=True, n_jobs=-1,
param_grid={'max_depth': [5, 10], 'learning_rate': [0.03, 0.07, 0.1], 'n_estimators': [50, 100, 200], 'loss': ['ls', 'huber'], 'alpha': [0.7, 0.9, 0.99], 'max_features': ['sqrt', 'auto', 'log2']},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
Cross-Validated Model - Fit on Training Set, Scored on Test Set
R2 0.363502
EVar 0.363517
MAE 1.871419
MSE 5.444946
MSLE 0.076121
MEAE 1.605702
dtype: float64
Best Estimator Parameters
loss: huber
max_depth: 5
n_estimators: 200
learning rate: 0.0
alpha: 1.0
max_features: auto