unsupervised learning techniques to diversifying and pruning random forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Unsupervised Learning Techniques to Diversifyingand Pruning Random Forest

Dr Mohamed Medhat Gaber

School of Computing Science and Digital MediaRobert Gordon University

27 January 2015

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Acknowledgement

Work done in collaboration with PhD student Khaled Fawagrehand co-supervisor Dr Eyad Elyan

1 BackgroundData ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

2 Clustering and Ensemble DiversityCLUB-DRFExperimental Study

3 Outlier Scoring and Ensemble DiversityLOFB-DRFExperimental Study

4 Summary and Future WorkSummaryFuture Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

What is Data Classification?

Data classification is the process of assigning a class(labelling) to a data instance, based on the values of a set ofpredictive attributes (features).

The process has two stages:1 Model construction: potentially a large number of “labelled”

instances are fed to a classification technique to build a model(classifier).

2 Model usage: once the model is constructed, it can bedeployed and used to classify “unlabelled” instances.

A large number of techniques have been proposed addressingthe data classification process (e.g., decision trees, artificialneural networks, and support vector machine).

Predictive accuracy has been the major concern whendesigning a new classification technique, followed by timeneeded for model construction and usage.

Data classification is the process of assigning a class(labelling) to a data instance, based on the values of a set ofpredictive attributes (features).The process has two stages:

1 Model construction: potentially a large number of “labelled”instances are fed to a classification technique to build a model(classifier).

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Ensemble Classification

Combining a number of classifiers to vote towards the winningclass has been thoroughly investigated by machine learningand data mining communities.

Bagging, boosting and stacking are among the majorapproaches to build ensemble of classifiers.

Bagging uses bootstrap sampling to generate diverse numberof samples in the dataset.

Boosting builds classifiers in a sequence encouraging laterbuilt classifiers to be expert in classifying incorrectly classifiedinstances from previous classifiers in the sequence.

Stacking uses a hierarchy of classifiers that generates a newdataset for a single classifier to be built.

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Random Forest Tops State-of-the-art Classifiers

179 classifiers

121 datasets (the whole UCI repository at the time of theexperiment)

Random Forest was the first ranked, followed by SVM withGaussian kernel

Reference

Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D.(2014). Do we need hundreds of classifiers to solve real worldclassification problems?. The Journal of Machine LearningResearch, 15(1), 3133-3181.

179 classifiers

Reference

179 classifiers

Reference

Improving Random Forests

Source: Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: from earlydevelopments to recent advancements. Systems Science & Control Engineering: AnOpen Access Journal, 2(1), pp. 602-609.

CLUB-DRFExperimental Study

How is Diversity Related to Clustering?

The aim of any clustering algorithm is to produce cohesiveclusters that are well separated

A good clustering model diversifies among members ofdifferent clusters

Inspired by this observation, we hypothesised that if trees inthe Random Forest are clustered, we can use a small subset(typically one tree) from each cluster to produce a diversifiedRandom Forest

The benefits are two fold

An increased diversificationA smaller ensemble, leading to faster classification ofunlabelled instances

CLUB-DRF

We termed the method CLUsterBased Diversified Random Forests(CLUB-DRF)

Three stages are followed:

A Random Forest is inducedusing the traditional methodTrees are clustered according totheir classification patternOne or more representative arechosen from each cluster to formthe pruned Random Forest

…....

C(t1, T) C(tn, T)

t1 ……. tn

Parent RF

Training Set

Random Forest Algorithm

Clustering Algorithm

Cluster 1 Cluster k

Representative Selection

Testing Set

t1 ……. tk

CLUB-DRF

We termed the method CLUsterBased Diversified Random Forests(CLUB-DRF)

Three stages are followed:

A Random Forest is inducedusing the traditional methodTrees are clustered according totheir classification patternOne or more representative arechosen from each cluster to formthe pruned Random Forest

…....

C(t1, T) C(tn, T)

t1 ……. tn

Parent RF

Training Set

Random Forest Algorithm

Clustering Algorithm

Cluster 1 Cluster k

Representative Selection

Testing Set

t1 ……. tk

CLUB-DRF

CLUB-DRF Settings

A number of settings are needed as follows:

The clustering algorithm used

The number of clusters of trees

The number of trees representing each cluster

The criteria for choosing the representatives

RandomBest performing

CLUB-DRF Settings

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Experimental Setup

Summarised Results

10 20 30 40Size (Number of Trees)

Method

CLUB−DRF

Pruning Results

Sample of Detailed Results

LOFB-DRFExperimental Study

How is Diversity Related to Outlier Detection?

Outliers are out of the norm instances that are thought to begenerated by a different mechanism

By analogy, trees that are significantly different (diverse) fromthe set of other trees in the Random Forest can be seen asoutliers

Local Outlier Factor (LOF) assigns a real number to eachinstance to represent its peculiarity

Inspired by this analogy, we hypothesised that a diverseensemble of trees can be formed using outlier detectionmethod

LOFB-DRF

We termed the methodLocal Outlier Factor BasedDiversified Random Forests(LOFB-DRF)

It follows similar steps toCLUB-DRF

Each tree is assigned LOFvalue

Trees are then chosenaccording to two criteria

Predictive accuracyLOF value

LOFB-DRF

LOFB-DRF Settings

LOF setting of the number of nearest neighbours

Options of combining LOF with predictive accuracy

Using LOF only ruling out predictive accuracyUsing a combination strategy

LOFB-DRF Settings

LOF setting of the number of nearest neighbours

Options of combining LOF with predictive accuracy

Using LOF only ruling out predictive accuracyUsing a combination strategy

Experimental Setup

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Experimental Setup

Summarised Results

10 20 30 40Size (Number of Trees)

Method

LOF−DRF

Pruning Results

Sample of Detailed Results

SummaryFuture Work

Summary

Random Forest has proved superiority over the last few years

Two methods were presented in this talk aiming atdiversifying and pruning Random Forests

Results showed the potential of these two methods to furtherenhance the predictive accuracy of the method

The high level of pruning makes these techniques candidatesfor real-time applications, as the number of trees to betraversed are significantly reduced

SummaryFuture Work

Summary

SummaryFuture Work

Summary

SummaryFuture Work

Summary

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)

Using other clustering techniques

In LOFB-DRF:

Exploring other options for combining LOF value andpredictive accuracy

Using LOF and predictive accuracy for the choice of treerepresentatives in each cluster

Applying both methods to other ensemble classificationtechniques

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)Using other clustering techniques

In LOFB-DRF:

SummaryFuture Work

Future Work

In CLUB-DRF:

In LOFB-DRF:

SummaryFuture Work

Future Work

In CLUB-DRF:

In LOFB-DRF:

SummaryFuture Work

Future Work

In CLUB-DRF:

In LOFB-DRF:

SummaryFuture Work

Thanks for listening!

Contact Details

Dr Mohamed Medhat GaberE-mail: [email protected]

Webpage: http://mohamedmgaber.weebly.com/LinkedIn: https://www.linkedin.com/profile/view?id=21808352

Twitter: https://twitter.com/mmmgaberResearchGate:

https://www.researchgate.net/profile/Mohamed Gaber16?ev=prf highl

unsupervised learning techniques to diversifying and pruning random forest

Data & Analytics

ensemble diversity summary

data classication process

classication technique

data instance

background clustering

model construction

model classier

model usage