Random Forest for Classification
Section 1. Read prepared training/validation/test set
clear;clc; %Clear the workspace and the screen
rng(42); %Set the random number generator to "the answer to everything"
addpath(genpath('.'));
data_path='Data';
filename='PreparedData.mat';
full_filename=fullfile(data_path, filename);
load(full_filename); %Load the dataset
CategoricalFeatureNames= [{'Gender'} {'Education'} {'Marriage'} {'PayStatus_Sep'} {'PayStatus_Aug'} {'PayStatus_Jul'}...
{'PayStatus_Jun'} {'PayStatus_May'} {'PayStatus_Apr'} ];
RelevantFeatureNames =[{'Credit_Limit'} {'Gender'} {'Education'} {'Marriage'} {'Age'} {'PayStatus_Sep'} {'PayStatus_Aug'} {'PayStatus_Jul'}...
{'PayStatus_Jun'} {'PayStatus_May'} {'PayStatus_Apr'} {'Bill_Sep'} {'Bill_Aug'} {'Bill_Jul'} {'Bill_Jun'}...
{'Bill_May'} {'Bill_Apr'} {'Pay_Sep'} {'Pay_Aug'} {'Pay_Jul'} {'Pay_Jun'} {'Pay_May'} {'Pay_Apr'}];
[RelevantFeatures, indexinVariableNames, indexinRelevantFeatureNames]=intersect(VariableNames,RelevantFeatureNames,'stable');
[RelevantCategoricalFeatures, indexinCategoricalFeatureNames, CategindexinRelevantFeatures]=intersect(CategoricalFeatureNames,RelevantFeatures,'stable');
X_train=XY_train(:,RelevantFeatures);
Y_train=table2array(XY_train(:,'Default_Status'));
X_test=XY_test(:,RelevantFeatures);
Y_test=table2array(XY_test(:,'Default_Status'));
Section 2. Train a small size decision tree
We will use this small decision tree to compare to random forests
rng(42); %Set the random number generator to "the answer to everything"
maxNumSplits=4;
Tree= fitctree(X_train,Y_train,...
'PredictorNames',RelevantFeatureNames,...
'MaxNumSplits',maxNumSplits,...
'CategoricalPredictors',RelevantCategoricalFeatures);
view(Tree,'mode','graph');
Section 3. Train a random forest
Train a random forest with 3 small size decision trees.
%templateTree(Name,Value) is a MATLAB function creates decision tree template.
% Input arguments:
% 'MaxNumSplits' - Maximal number of decision splits.
% 'MinParentSize' - Minimum number of branch node observations.
% 'NumVariablesToSample' - Number of predictors to select at random for each split, and the default is square root of the number of predictors.
% Output arguments:
% tree - Decision tree template for classification or regression.
tree=templateTree('MaxNumSplits',4);
%fitcensemble() is a MATLAB function which can fit random forest models.
% Input arguments:
% X_train - Input (features) training data.
% Y_train - Output (response variable) training data.
% 'PredictorNames' - predictor variable names.
% 'ResponseName' - Response variable name.
% 'CategoricalPredictors' - Indicates a categorical variable list.
% 'Method' - Ensemble-aggregation method. 'Bag' indicates bootstrap aggregating.
% 'Leaners' - Weak learners to use in ensemble. For random forest, the weak leaner is decision tree.
% 'NumLearningCycles' - Number of ensemble learning cycles, also means number of decision trees in random forest.
% 'Cost' - Cost of misclassification.
% 'Weights' - Observation weights.
% Output arguments:
% RF - Trained random forest classification model which contains fitted mode information.
rng(42);
TreeNumber=3;
RF_small=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
Section 4. View the random forest
You can see a Random Forest easily. It's just a collection of trees.
for t=1:TreeNumber
view(RF_small.Trained{t},'mode','graph')
end
Section 5. Evaluate Performance
Section 5.1 Compute confusion matrix in training data and testing data
To predict, it will go through each one of the trees, and then pick the mode (since it's classification)
outputs_train=[predict(RF_small, X_train) 1-predict(RF_small, X_train) ]';
targets_train=[Y_train 1-Y_train]';
plot_confusion(targets_train, outputs_train,{'default','non-default'},'Training Data Confusion Matrix')
outputs_test=[predict(RF_small, X_test) 1-predict(RF_small, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Testing Data Confusion Matrix')
Section 5.2 Compute accuracy, precision, sensitivity and specificity
First, we will check the performance measures in training data:
Accuracy = (TP+TN)/(TP+TN+FP+FN);
Precision= TP/(TP+FP);
Sensitivity (Recall) = TP/(TP+FN);
Specificity = TN/(TN+FP);
Where,
TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative
performance_small_train=tperformance(RF_small,X_train,Y_train)
Then, we will check the performance measures in testing data:
performance_small_test=tperformance(RF_small,X_test,Y_test)
Section 5.3 Plot ROC curve and compute AUC
[~,score] = predict(RF_small, X_test);
figure;
[X,Y,~,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
Section 6. Random forest with 31 small size decision trees
tic
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_medium=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
% compute confusion matrix
outputs_train=[predict(RF_medium, X_train) 1-predict(RF_medium, X_train) ]';
targets_train=[Y_train 1-Y_train]';
plot_confusion(targets_train, outputs_train,{'default','non-default'},'Training Data Confusion Matrix')
outputs_test=[predict(RF_medium, X_test) 1-predict(RF_medium, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Testing Data Confusion Matrix')
% compute accuracy, precision, sensitivity and specificity
performance_medium_train=tperformance(RF_medium,X_train,Y_train)
performance_medium_test=tperformance(RF_medium,X_test,Y_test)
% plot ROC curve
[~,score] = predict(RF_medium, X_test);
figure;
[X,Y,~,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
toc
Section 7. Random forest with 101 number of small size decision trees
tic
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=101;
RF_large=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
% compute confusion matrix
outputs_train=[predict(RF_large, X_train) 1-predict(RF_large, X_train) ]';
targets_train=[Y_train 1-Y_train]';
plot_confusion(targets_train, outputs_train,{'default','non-default'},'Training Data Confusion Matrix')
outputs_test=[predict(RF_large, X_test) 1-predict(RF_large, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Testing Data Confusion Matrix')
% compute accuracy, precision, sensitivity and specificity
performance_large_train=tperformance(RF_large,X_train,Y_train)
performance_large_test=tperformance(RF_large,X_test,Y_test)
% plot ROC curve
[~,score] = predict(RF_large, X_test);
figure;
[X,Y,T,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
toc
Section 8. Cost Function and Weighting
Section 8. compare medium size random forest with cost and without cost
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_mediumCost=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'Cost',[0,1;5,0],...
'NumLearningCycles',TreeNumber);
outputs_test=[predict(RF_medium, X_test) 1-predict(RF_medium, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest Confusion Matrix')
outputs_test=[predict(RF_mediumCost, X_test) 1-predict(RF_mediumCost, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest - Cost Confusion Matrix')
% plot ROC curve
[~,score] = predict(RF_mediumCost, X_test);
figure;
[X,Y,T,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
Section 8.2 Compare a small tree without weighting observations and a small tree with weighting observations
weights=ones(length(Y_train),1);
weights(Y_train==1)=5;
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_mediumWeight=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'Learners',tree,...
'Weights', weights,...
'NumLearningCycles',TreeNumber);
outputs_test=[predict(RF_medium, X_test) 1-predict(RF_medium, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest Confusion Matrix')
outputs_test=[predict(RF_mediumWeight, X_test) 1-predict(RF_mediumWeight, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest - Weight Confusion Matrix')
% plot ROC curve
[~,score] = predict(RF_mediumWeight, X_test);
figure;
[X,Y,T,auc] = perfcurve(Y_test,score(:,2),1);
plot(X,Y,X,X);
fprintf('The area under the curve is: %f\n', auc);
Section 9. View Variables Importance in Medium Size Random Forest
Section 9.1 compute the predictor importance by summing the impurity gain over all splits on a given predictor. Gini is the default impurity for classification trees.
%predictorImportance() is a MATLAB function to estimate the predictor importance by summing these estimates over all weak learners.
importance=predictorImportance(RF_medium);
%sort() sort array elements.
[~,ind]=sort(importance,'descend');
%array2table() convert homogeneous array to table.
Gini_rank_table=array2table([RelevantFeatureNames(ind).', num2cell(importance(ind)).'],'VariableNames',{'variable_name','importance_score'})
figure;
bar(importance);
xticks([1,ind(1),15,23])
xticklabels(RelevantFeatureNames([1,ind(1),15,23]))
title('Random Forest Variable Gini Importance Plot')
ylabel('variable importance')
xlabel('predictor names')
Section 9.2 compute the predictor importance by permutation of out-of-bag predictor observations for random forest of trees.
%oobPermutedPredictorImportance() is a MATLAB function to estimate the predictor importance by permutation of out-of-bag predictor observations for random forest of trees.
importance=oobPermutedPredictorImportance(RF_medium);
%sort() sort array elements.
[ignore,ind]=sort(importance,'descend');
%array2table() convert homogeneous array to table.
Permutation_rank_table=array2table([RelevantFeatureNames(ind).', num2cell(importance(ind)).'],'VariableNames',{'variable_name','importance_score'})
figure;
bar(importance);
xticks([1,ind(1),15,23])
xticklabels(RelevantFeatureNames([1,ind(1),15,23]))
title('Random Forest Variable Permutation Importance Plot')
ylabel('variable importance')
xlabel('predictor names')
Section 10. Optimize Random Forest Hyperparameters
Section 10.1 Run following code, and we can see the hyperparameters that we can optimize.
Hyperparameters=hyperparameters('fitcensemble',X_train, Y_train, 'Tree');
for i=1:length(Hyperparameters)
Hyperparameters(i,1).Name
end
Section 10.2 Optimize - 'MaxNumSplits'
Here we'lloptimize the Random Forest. We're restricting # of trees to 3 ad Cross-Validation to 5 folds for the sake of time.
%You can set 'OptimizeHyperparameters' in fitcensemble() with a cell array of eligible hyperparameter names, and
% fitcensemble() can automatically use cross validation to tune the hyperparameters.
%'HyperparameterOptimizationOptions' provides options for optimization, and you can indicate the search method, number of folds
%and maximum number of iterations.
tic
RF_optimize = fitcensemble(X_train,Y_train,...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','Bag',...
'NumLearningCycles',5,...
'OptimizeHyperparameters',{'MaxNumSplits'},...
'HyperparameterOptimizationOptions',...
struct('MaxObjectiveEvaluations',10,'Optimizer','randomsearch','Kfold',5));
toc
Section 11. Use cross-valiadation to evaluate the random forests
We'll use cross validation to do model selection without touching the test data.
We'll skip the large for the sake of time.
tic
%cv_large = crossval(RF_large,'KFold',5);
cv_medium = crossval(RF_medium,'KFold',5);
cv_small = crossval(RF_small,'KFold',5);
cv_optimize = crossval(RF_optimize,'KFold',5);
%l_large = kfoldLoss(cv_large);
l_medium = kfoldLoss(cv_medium);
l_small = kfoldLoss(cv_small);
l_optimize = kfoldLoss(cv_optimize);
%table(l_large,l_medium,l_small,l_optimize)
table(l_medium,l_small,l_optimize)
toc
Section 12. Boosting
This is an example about how to run boosting.
tic
tree=templateTree('MaxNumSplits',4);
rng(42);
TreeNumber=31;
RF_AdaBoost=fitcensemble(X_train, Y_train, ...
'CategoricalPredictors',RelevantCategoricalFeatures,...
'Method','AdaBoostM1',...
'Learners',tree,...
'NumLearningCycles',TreeNumber);
outputs_test=[predict(RF_AdaBoost, X_test) 1-predict(RF_AdaBoost, X_test) ]';
targets_test=[Y_test 1-Y_test]';
plot_confusion(targets_test, outputs_test,{'default','non-default'},'Random Forest AdaBoost Confusion Matrix')
toc