Random Forest for Classification

Section 1. Read prepared training/validation/test set

clear;clc; %Clear the workspace and the screen
rng(42); %Set the random number generator to "the answer to everything"
addpath(genpath('.'));
data_path='Data';
filename='PreparedData.mat';
full_filename=fullfile(data_path, filename);
load(full_filename); %Load the dataset
CategoricalFeatureNames= [{'Gender'} {'Education'} {'Marriage'} {'PayStatus_Sep'} {'PayStatus_Aug'} {'PayStatus_Jul'}...
{'PayStatus_Jun'} {'PayStatus_May'} {'PayStatus_Apr'} ];
RelevantFeatureNames =[{'Credit_Limit'} {'Gender'} {'Education'} {'Marriage'} {'Age'} {'PayStatus_Sep'} {'PayStatus_Aug'} {'PayStatus_Jul'}...
{'PayStatus_Jun'} {'PayStatus_May'} {'PayStatus_Apr'} {'Bill_Sep'} {'Bill_Aug'} {'Bill_Jul'} {'Bill_Jun'}...
{'Bill_May'} {'Bill_Apr'} {'Pay_Sep'} {'Pay_Aug'} {'Pay_Jul'} {'Pay_Jun'} {'Pay_May'} {'Pay_Apr'}];
[RelevantFeatures, indexinVariableNames, indexinRelevantFeatureNames]=intersect(VariableNames,RelevantFeatureNames,'stable');
[RelevantCategoricalFeatures, indexinCategoricalFeatureNames, CategindexinRelevantFeatures]=intersect(CategoricalFeatureNames,RelevantFeatures,'stable');
X_train=XY_train(:,RelevantFeatures);
Y_train=table2array(XY_train(:,'Default_Status'));
X_test=XY_test(:,RelevantFeatures);
Y_test=table2array(XY_test(:,'Default_Status'));

Section 2. Train a small size decision tree

We will use this small decision tree to compare to random forests
rng(42); %Set the random number generator to "the answer to everything"
maxNumSplits=4;
Tree= fitctree(X_train,Y_train,...
'PredictorNames',RelevantFeatureNames,...
'MaxNumSplits',maxNumSplits,...
'CategoricalPredictors',RelevantCategoricalFeatures);
view(Tree,'mode','graph');

Section 3. Train a random forest

Train a random forest with 3 small size decision trees.
%templateTree(Name,Value) is a MATLAB function creates decision tree template.
% Input arguments:
% 'MaxNumSplits' - Maximal number of decision splits.
% 'MinParentSize' - Minimum number of branch node observations.
% 'NumVariablesToSample' - Number of predictors to select at random for each split, and the default is square root of the number of predictors.
% Output arguments:
% tree - Decision tree template for classification or regression.
tree=templateTree('MaxNumSplits',4);
%fitcensemble() is a MATLAB function which can fit random forest models.
% Input arguments:
% X_train - Input (features) training data.
% Y_train - Output (response variable) training data.
% 'PredictorNames' - predictor variable names.
% 'ResponseName' - Response variable name.
% 'CategoricalPredictors' - Indicates a categorical variable list.
% 'Method' - Ensemble-aggregation method. 'Bag' indicates bootstrap aggregating.
% 'Leaners' - Weak learners to use in ensemble. For random forest, the weak leaner is decision tree.
% 'NumLearningCycles' - Number of ensemble learning cycles, also means number of decision trees in random forest.
% 'Cost' - Cost of misclassification.
% 'Weights' - Observation weights.
% Output arguments:
% RF - Trained random forest classification model which contains fitted mode information.
rng(42);
TreeNumber=3;
RF_small=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);

Section 4. View the random forest

You can see a Random Forest easily. It's just a collection of trees.
for t=1:TreeNumber
view(RF_small.Trained{t},'mode','graph')
end

Section 5. Evaluate Performance

Section 5.1 Compute confusion matrix in training data and testing data
To predict, it will go through each one of the trees, and then pick the mode (since it's classification)
outputs_train=[predict(RF_small, X_train) 1-predict(RF_small, X_train) ]';
targets_train=[Y_train 1-Y_train]';
plot_confusion(targets_train, outputs_train,{'default','non-default'},'Training Data Confusion Matrix')
outputs_test=[predict(RF_small, X_test) 1-predict(RF_small, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Testing Data Confusion Matrix')
Section 5.2 Compute accuracy, precision, sensitivity and specificity
First, we will check the performance measures in training data:
Accuracy = (TP+TN)/(TP+TN+FP+FN);
Precision= TP/(TP+FP);
Sensitivity (Recall) = TP/(TP+FN);
Specificity = TN/(TN+FP);
Where,
TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative
performance_small_train=tperformance(RF_small,X_train,Y_train)
performance_small_train = 1×4 table
AccuracyPrecisionSensitivitySpecificity
10.79900.67690.16220.9782
Then, we will check the performance measures in testing data:
performance_small_test=tperformance(RF_small,X_test,Y_test)
performance_small_test = 1×4 table
AccuracyPrecisionSensitivitySpecificity
10.79300.66740.15770.9772
Section 5.3 Plot ROC curve and compute AUC
[~,score] = predict(RF_small, X_test);
figure;
[X,Y,~,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
The area under the curve is: 0.741752

Section 6. Random forest with 31 small size decision trees

tic
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_medium=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
% compute confusion matrix
outputs_train=[predict(RF_medium, X_train) 1-predict(RF_medium, X_train) ]';
targets_train=[Y_train 1-Y_train]';
plot_confusion(targets_train, outputs_train,{'default','non-default'},'Training Data Confusion Matrix')
outputs_test=[predict(RF_medium, X_test) 1-predict(RF_medium, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Testing Data Confusion Matrix')
% compute accuracy, precision, sensitivity and specificity
performance_medium_train=tperformance(RF_medium,X_train,Y_train)
performance_medium_train = 1×4 table
AccuracyPrecisionSensitivitySpecificity
10.80590.67680.22240.9701
performance_medium_test=tperformance(RF_medium,X_test,Y_test)
performance_medium_test = 1×4 table
AccuracyPrecisionSensitivitySpecificity
10.80220.68840.21950.9712
% plot ROC curve
[~,score] = predict(RF_medium, X_test);
figure;
[X,Y,~,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
The area under the curve is: 0.757871
toc
Elapsed time is 7.053786 seconds.

Section 7. Random forest with 101 number of small size decision trees

tic
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=101;
RF_large=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
% compute confusion matrix
outputs_train=[predict(RF_large, X_train) 1-predict(RF_large, X_train) ]';
targets_train=[Y_train 1-Y_train]';
plot_confusion(targets_train, outputs_train,{'default','non-default'},'Training Data Confusion Matrix')
outputs_test=[predict(RF_large, X_test) 1-predict(RF_large, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Testing Data Confusion Matrix')
% compute accuracy, precision, sensitivity and specificity
performance_large_train=tperformance(RF_large,X_train,Y_train)
performance_large_train = 1×4 table
AccuracyPrecisionSensitivitySpecificity
10.81350.69290.27140.9661
performance_large_test=tperformance(RF_large,X_test,Y_test)
performance_large_test = 1×4 table
AccuracyPrecisionSensitivitySpecificity
10.80920.70350.26150.9680
% plot ROC curve
[~,score] = predict(RF_large, X_test);
figure;
[X,Y,T,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
The area under the curve is: 0.760331
toc
Elapsed time is 15.463312 seconds.

Section 8. Cost Function and Weighting

Section 8. compare medium size random forest with cost and without cost

tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_mediumCost=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'Cost',[0,1;5,0],...
'NumLearningCycles',TreeNumber);
outputs_test=[predict(RF_medium, X_test) 1-predict(RF_medium, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest Confusion Matrix')
outputs_test=[predict(RF_mediumCost, X_test) 1-predict(RF_mediumCost, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest - Cost Confusion Matrix')
% plot ROC curve
[~,score] = predict(RF_mediumCost, X_test);
figure;
[X,Y,T,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
The area under the curve is: 0.754348
Section 8.2 Compare a small tree without weighting observations and a small tree with weighting observations
weights=ones(length(Y_train),1);
weights(Y_train==1)=5;
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_mediumWeight=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'Weights', weights,...
'NumLearningCycles',TreeNumber);
outputs_test=[predict(RF_medium, X_test) 1-predict(RF_medium, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest Confusion Matrix')
outputs_test=[predict(RF_mediumWeight, X_test) 1-predict(RF_mediumWeight, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest - Weight Confusion Matrix')
% plot ROC curve
[~,score] = predict(RF_mediumWeight, X_test);
figure;
[X,Y,T,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
The area under the curve is: 0.754348

Section 9. View Variables Importance in Medium Size Random Forest

Section 9.1 compute the predictor importance by summing the impurity gain over all splits on a given predictor. Gini is the default impurity for classification trees.
%predictorImportance() is a MATLAB function to estimate the predictor importance by summing these estimates over all weak learners.
importance=predictorImportance(RF_medium);
%sort() sort array elements.
[~,ind]=sort(importance,'descend');
%array2table() convert homogeneous array to table.
Gini_rank_table=array2table([RelevantFeatureNames(ind).', num2cell(importance(ind)).'],'VariableNames',{'variable_name','importance_score'})
Gini_rank_table = 23×2 table
variable_nameimportance_score
1'PayStatus_Aug'0.0038
2'PayStatus_Sep'0.0033
3'PayStatus_Jul'0.0014
4'PayStatus_May'0.0013
5'PayStatus_Jun'0.0012
6'PayStatus_Apr'8.0277e-04
7'Credit_Limit'1.8782e-04
8'Pay_Sep'1.4075e-04
9'Pay_Jul'1.3601e-04
10'Pay_Jun'8.6370e-05
figure;
bar(importance);
xticks([1,ind(1),15,23])
xticklabels(RelevantFeatureNames([1,ind(1),15,23]))
title('Random Forest Variable Gini Importance Plot')
ylabel('variable importance')
xlabel('predictor names')
Section 9.2 compute the predictor importance by permutation of out-of-bag predictor observations for random forest of trees.
%oobPermutedPredictorImportance() is a MATLAB function to estimate the predictor importance by permutation of out-of-bag predictor observations for random forest of trees.
importance=oobPermutedPredictorImportance(RF_medium);
%sort() sort array elements.
[ignore,ind]=sort(importance,'descend');
%array2table() convert homogeneous array to table.
Permutation_rank_table=array2table([RelevantFeatureNames(ind).', num2cell(importance(ind)).'],'VariableNames',{'variable_name','importance_score'})
Permutation_rank_table = 23×2 table
variable_nameimportance_score
1'PayStatus_Sep'1.0099
2'PayStatus_Aug'0.6544
3'PayStatus_May'0.6383
4'PayStatus_Apr'0.6142
5'PayStatus_Jul'0.5178
6'PayStatus_Jun'0.4801
7'Marriage'0.1826
8'Pay_Sep'0.1826
9'Bill_Apr'0.1826
10'Credit_Limit'0.0953
figure;
bar(importance);
xticks([1,ind(1),15,23])
xticklabels(RelevantFeatureNames([1,ind(1),15,23]))
title('Random Forest Variable Permutation Importance Plot')
ylabel('variable importance')
xlabel('predictor names')

Section 10. Optimize Random Forest Hyperparameters

Section 10.1 Run following code, and we can see the hyperparameters that we can optimize.
Hyperparameters=hyperparameters('fitcensemble',X_train, Y_train, 'Tree');
for i=1:length(Hyperparameters)
Hyperparameters(i,1).Name
end
ans = 'Method'
ans = 'NumLearningCycles'
ans = 'LearnRate'
ans = 'MinLeafSize'
ans = 'MaxNumSplits'
ans = 'SplitCriterion'
ans = 'NumVariablesToSample'
Section 10.2 Optimize - 'MaxNumSplits'
Here we'lloptimize the Random Forest. We're restricting # of trees to 3 ad Cross-Validation to 5 folds for the sake of time.
%You can set 'OptimizeHyperparameters' in fitcensemble() with a cell array of eligible hyperparameter names, and
% fitcensemble() can automatically use cross validation to tune the hyperparameters.
%'HyperparameterOptimizationOptions' provides options for optimization, and you can indicate the search method, number of folds
%and maximum number of iterations.
tic
RF_optimize = fitcensemble(X_train,Y_train,...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'NumLearningCycles',5,...
'OptimizeHyperparameters',{'MaxNumSplits'},...
'HyperparameterOptimizationOptions',...
struct('MaxObjectiveEvaluations',10,'Optimizer','randomsearch','Kfold',5));
|========================================================================| | Iter | Eval | Objective | Objective | BestSoFar | MaxNumSplits | | | result | | runtime | (observed) | | |========================================================================| | 1 | Best | 0.19805 | 5.2606 | 0.19805 | 1340 | | 2 | Best | 0.19152 | 1.4734 | 0.19152 | 2 | | 3 | Best | 0.18233 | 2.7625 | 0.18233 | 123 | | 4 | Accept | 0.20671 | 5.6357 | 0.18233 | 12978 | | 5 | Accept | 0.18248 | 2.3516 | 0.18233 | 57 | | 6 | Accept | 0.19519 | 1.3902 | 0.18233 | 2 | | 7 | Accept | 0.18471 | 2.2962 | 0.18233 | 6 | | 8 | Accept | 0.18771 | 1.5175 | 0.18233 | 7 | | 9 | Accept | 0.18343 | 1.7757 | 0.18233 | 14 | | 10 | Accept | 0.1901 | 3.913 | 0.18233 | 765 | __________________________________________________________ Optimization completed. MaxObjectiveEvaluations of 10 reached. Total function evaluations: 10 Total elapsed time: 32.9371 seconds. Total objective function evaluation time: 28.3764 Best observed feasible point: MaxNumSplits ____________ 123 Observed objective function value = 0.18233 Function evaluation time = 2.7625
toc
Elapsed time is 34.295416 seconds.

Section 11. Use cross-valiadation to evaluate the random forests

We'll use cross validation to do model selection without touching the test data.
We'll skip the large for the sake of time.
tic
%cv_large = crossval(RF_large,'KFold',5);
cv_medium = crossval(RF_medium,'KFold',5);
cv_small = crossval(RF_small,'KFold',5);
cv_optimize = crossval(RF_optimize,'KFold',5);
%l_large = kfoldLoss(cv_large);
l_medium = kfoldLoss(cv_medium);
l_small = kfoldLoss(cv_small);
l_optimize = kfoldLoss(cv_optimize);
%table(l_large,l_medium,l_small,l_optimize)
table(l_medium,l_small,l_optimize)
ans = 1×3 table
l_mediuml_smalll_optimize
10.18900.19610.1841
toc
Elapsed time is 14.106522 seconds.

Section 12. Boosting

This is an example about how to run boosting.
tic
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_AdaBoost=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','AdaBoostM1',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
outputs_test=[predict(RF_AdaBoost, X_test) 1-predict(RF_AdaBoost, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest AdaBoost Confusion Matrix')
toc
Elapsed time is 6.771934 seconds.