unsupervised learning techniques to diversifying and pruning random forest

83
Background Clustering and Ensemble Diversity Outlier Scoring and Ensemble Diversity Summary and Future Work Unsupervised Learning Techniques to Diversifying and Pruning Random Forest Dr Mohamed Medhat Gaber School of Computing Science and Digital Media Robert Gordon University 27 January 2015 Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Upload: mohamed-medhat-gaber

Post on 28-Jul-2015

96 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Unsupervised Learning Techniques to Diversifyingand Pruning Random Forest

Dr Mohamed Medhat Gaber

School of Computing Science and Digital MediaRobert Gordon University

27 January 2015

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 2: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Acknowledgement

Work done in collaboration with PhD student Khaled Fawagrehand co-supervisor Dr Eyad Elyan

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 3: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

1 BackgroundData ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

2 Clustering and Ensemble DiversityCLUB-DRFExperimental Study

3 Outlier Scoring and Ensemble DiversityLOFB-DRFExperimental Study

4 Summary and Future WorkSummaryFuture Work

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 4: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

What is Data Classification?

Data classification is the process of assigning a class(labelling) to a data instance, based on the values of a set ofpredictive attributes (features).

The process has two stages:1 Model construction: potentially a large number of “labelled”

instances are fed to a classification technique to build a model(classifier).

2 Model usage: once the model is constructed, it can bedeployed and used to classify “unlabelled” instances.

A large number of techniques have been proposed addressingthe data classification process (e.g., decision trees, artificialneural networks, and support vector machine).

Predictive accuracy has been the major concern whendesigning a new classification technique, followed by timeneeded for model construction and usage.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 5: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

What is Data Classification?

Data classification is the process of assigning a class(labelling) to a data instance, based on the values of a set ofpredictive attributes (features).The process has two stages:

1 Model construction: potentially a large number of “labelled”instances are fed to a classification technique to build a model(classifier).

2 Model usage: once the model is constructed, it can bedeployed and used to classify “unlabelled” instances.

A large number of techniques have been proposed addressingthe data classification process (e.g., decision trees, artificialneural networks, and support vector machine).

Predictive accuracy has been the major concern whendesigning a new classification technique, followed by timeneeded for model construction and usage.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 6: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

What is Data Classification?

Data classification is the process of assigning a class(labelling) to a data instance, based on the values of a set ofpredictive attributes (features).The process has two stages:

1 Model construction: potentially a large number of “labelled”instances are fed to a classification technique to build a model(classifier).

2 Model usage: once the model is constructed, it can bedeployed and used to classify “unlabelled” instances.

A large number of techniques have been proposed addressingthe data classification process (e.g., decision trees, artificialneural networks, and support vector machine).

Predictive accuracy has been the major concern whendesigning a new classification technique, followed by timeneeded for model construction and usage.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 7: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

What is Data Classification?

Data classification is the process of assigning a class(labelling) to a data instance, based on the values of a set ofpredictive attributes (features).The process has two stages:

1 Model construction: potentially a large number of “labelled”instances are fed to a classification technique to build a model(classifier).

2 Model usage: once the model is constructed, it can bedeployed and used to classify “unlabelled” instances.

A large number of techniques have been proposed addressingthe data classification process (e.g., decision trees, artificialneural networks, and support vector machine).

Predictive accuracy has been the major concern whendesigning a new classification technique, followed by timeneeded for model construction and usage.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 8: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 9: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 10: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 11: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 12: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 13: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Decision Tree Classification Techniques

Almost all decision trees are constructed using a similarprocedure

Attributes (features) represented in internal nodes with theirvalues given on the links for tree traversal (a variation of thisexists for binary decision trees)

Leaf nodes are class labels

Decision trees mainly vary in the goodness measure used tofind the best attribute to split on (e.g., information gain, gainratio, Gini index, and Chi-square)

The first attribute which is called the root is the bestattribute (according to some goodness measure) to spit on.

An iterative process to build subtrees is followed with findingthe best attribute (attribute = value) to split on at eachiteration

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 14: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Ensemble Classification

Combining a number of classifiers to vote towards the winningclass has been thoroughly investigated by machine learningand data mining communities.

Bagging, boosting and stacking are among the majorapproaches to build ensemble of classifiers.

Bagging uses bootstrap sampling to generate diverse numberof samples in the dataset.

Boosting builds classifiers in a sequence encouraging laterbuilt classifiers to be expert in classifying incorrectly classifiedinstances from previous classifiers in the sequence.

Stacking uses a hierarchy of classifiers that generates a newdataset for a single classifier to be built.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 15: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Ensemble Classification

Combining a number of classifiers to vote towards the winningclass has been thoroughly investigated by machine learningand data mining communities.

Bagging, boosting and stacking are among the majorapproaches to build ensemble of classifiers.

Bagging uses bootstrap sampling to generate diverse numberof samples in the dataset.

Boosting builds classifiers in a sequence encouraging laterbuilt classifiers to be expert in classifying incorrectly classifiedinstances from previous classifiers in the sequence.

Stacking uses a hierarchy of classifiers that generates a newdataset for a single classifier to be built.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 16: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Ensemble Classification

Combining a number of classifiers to vote towards the winningclass has been thoroughly investigated by machine learningand data mining communities.

Bagging, boosting and stacking are among the majorapproaches to build ensemble of classifiers.

Bagging uses bootstrap sampling to generate diverse numberof samples in the dataset.

Boosting builds classifiers in a sequence encouraging laterbuilt classifiers to be expert in classifying incorrectly classifiedinstances from previous classifiers in the sequence.

Stacking uses a hierarchy of classifiers that generates a newdataset for a single classifier to be built.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 17: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Ensemble Classification

Combining a number of classifiers to vote towards the winningclass has been thoroughly investigated by machine learningand data mining communities.

Bagging, boosting and stacking are among the majorapproaches to build ensemble of classifiers.

Bagging uses bootstrap sampling to generate diverse numberof samples in the dataset.

Boosting builds classifiers in a sequence encouraging laterbuilt classifiers to be expert in classifying incorrectly classifiedinstances from previous classifiers in the sequence.

Stacking uses a hierarchy of classifiers that generates a newdataset for a single classifier to be built.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 18: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Ensemble Classification

Combining a number of classifiers to vote towards the winningclass has been thoroughly investigated by machine learningand data mining communities.

Bagging, boosting and stacking are among the majorapproaches to build ensemble of classifiers.

Bagging uses bootstrap sampling to generate diverse numberof samples in the dataset.

Boosting builds classifiers in a sequence encouraging laterbuilt classifiers to be expert in classifying incorrectly classifiedinstances from previous classifiers in the sequence.

Stacking uses a hierarchy of classifiers that generates a newdataset for a single classifier to be built.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 19: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 20: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 21: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 22: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 23: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 24: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Diversity and Predictive Accuracy

Diversity among members of the ensemble is key to predictiveaccuracy

There are many ways to measuring such diversity; it is not astraightforward process

Regardless of the used measure, diversity has been the targetof a number of ‘diversity creation’ methods

Bagging and boosting enforce diversity by input manipulation

Stacking typically imposes diversity using a number ofdifferent classifiers

Error correcting code manipulates output to create diversity

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 25: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

√F

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 26: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

√F

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 27: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

√F

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 28: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

√F

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 29: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

√F

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 30: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forests: An Overview

An ensemble classification and regression techniqueintroduced by Leo Breiman

It generates a diversified ensemble of decision trees adoptingtwo methods:

A bootstrap sample is used for the construction of each tree(bagging), resulting in approximately 63.2% unique samples,and the rest are repeatedAt each node split, only a subset of features are drawnrandomly to assess the goodness of each feature/attribute (

√F

or log2 F is used, where F is the total number of features)

Trees are allowed to grow without pruning

Typically 100 to 500 trees are used to form the ensemble

It is now considered among the best performing classifiers

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 31: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forest Tops State-of-the-art Classifiers

179 classifiers

121 datasets (the whole UCI repository at the time of theexperiment)

Random Forest was the first ranked, followed by SVM withGaussian kernel

Reference

Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D.(2014). Do we need hundreds of classifiers to solve real worldclassification problems?. The Journal of Machine LearningResearch, 15(1), 3133-3181.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 32: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forest Tops State-of-the-art Classifiers

179 classifiers

121 datasets (the whole UCI repository at the time of theexperiment)

Random Forest was the first ranked, followed by SVM withGaussian kernel

Reference

Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D.(2014). Do we need hundreds of classifiers to solve real worldclassification problems?. The Journal of Machine LearningResearch, 15(1), 3133-3181.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 33: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Random Forest Tops State-of-the-art Classifiers

179 classifiers

121 datasets (the whole UCI repository at the time of theexperiment)

Random Forest was the first ranked, followed by SVM withGaussian kernel

Reference

Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D.(2014). Do we need hundreds of classifiers to solve real worldclassification problems?. The Journal of Machine LearningResearch, 15(1), 3133-3181.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 34: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

Data ClassificationEnsemble ClassificationEnsemble DiversityRandom Forests

Improving Random Forests

Source: Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: from earlydevelopments to recent advancements. Systems Science & Control Engineering: AnOpen Access Journal, 2(1), pp. 602-609.

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 35: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

How is Diversity Related to Clustering?

The aim of any clustering algorithm is to produce cohesiveclusters that are well separated

A good clustering model diversifies among members ofdifferent clusters

Inspired by this observation, we hypothesised that if trees inthe Random Forest are clustered, we can use a small subset(typically one tree) from each cluster to produce a diversifiedRandom Forest

The benefits are two fold

An increased diversificationA smaller ensemble, leading to faster classification ofunlabelled instances

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 36: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

How is Diversity Related to Clustering?

The aim of any clustering algorithm is to produce cohesiveclusters that are well separated

A good clustering model diversifies among members ofdifferent clusters

Inspired by this observation, we hypothesised that if trees inthe Random Forest are clustered, we can use a small subset(typically one tree) from each cluster to produce a diversifiedRandom Forest

The benefits are two fold

An increased diversificationA smaller ensemble, leading to faster classification ofunlabelled instances

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 37: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

How is Diversity Related to Clustering?

The aim of any clustering algorithm is to produce cohesiveclusters that are well separated

A good clustering model diversifies among members ofdifferent clusters

Inspired by this observation, we hypothesised that if trees inthe Random Forest are clustered, we can use a small subset(typically one tree) from each cluster to produce a diversifiedRandom Forest

The benefits are two fold

An increased diversificationA smaller ensemble, leading to faster classification ofunlabelled instances

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 38: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

How is Diversity Related to Clustering?

The aim of any clustering algorithm is to produce cohesiveclusters that are well separated

A good clustering model diversifies among members ofdifferent clusters

Inspired by this observation, we hypothesised that if trees inthe Random Forest are clustered, we can use a small subset(typically one tree) from each cluster to produce a diversifiedRandom Forest

The benefits are two fold

An increased diversificationA smaller ensemble, leading to faster classification ofunlabelled instances

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 39: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

CLUB-DRF

We termed the method CLUsterBased Diversified Random Forests(CLUB-DRF)

Three stages are followed:

A Random Forest is inducedusing the traditional methodTrees are clustered according totheir classification patternOne or more representative arechosen from each cluster to formthe pruned Random Forest

…....

…....

C(t1, T) C(tn, T)

t1 ……. tn

Parent RF

Training Set

Random Forest Algorithm

Clustering Algorithm

Cluster 1 Cluster k

Representative Selection

Testing Set

t1 ……. tk

CLUB-DRF

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 40: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

CLUB-DRF

We termed the method CLUsterBased Diversified Random Forests(CLUB-DRF)

Three stages are followed:

A Random Forest is inducedusing the traditional methodTrees are clustered according totheir classification patternOne or more representative arechosen from each cluster to formthe pruned Random Forest

…....

…....

C(t1, T) C(tn, T)

t1 ……. tn

Parent RF

Training Set

Random Forest Algorithm

Clustering Algorithm

Cluster 1 Cluster k

Representative Selection

Testing Set

t1 ……. tk

CLUB-DRF

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 41: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

CLUB-DRF Settings

A number of settings are needed as follows:

The clustering algorithm used

The number of clusters of trees

The number of trees representing each cluster

The criteria for choosing the representatives

RandomBest performing

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 42: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

CLUB-DRF Settings

A number of settings are needed as follows:

The clustering algorithm used

The number of clusters of trees

The number of trees representing each cluster

The criteria for choosing the representatives

RandomBest performing

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 43: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

CLUB-DRF Settings

A number of settings are needed as follows:

The clustering algorithm used

The number of clusters of trees

The number of trees representing each cluster

The criteria for choosing the representatives

RandomBest performing

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 44: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

CLUB-DRF Settings

A number of settings are needed as follows:

The clustering algorithm used

The number of clusters of trees

The number of trees representing each cluster

The criteria for choosing the representatives

RandomBest performing

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 45: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 46: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 47: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 48: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 49: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 50: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Experimental Setup

We tested the technique over 15 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used k-modes to cluster the trees

We used the following values for k : 5, 10, 15, 20, 25, 30, 35,and 40

We used one representative tree per cluster based on the OutOf Bag (OOB) performance

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 51: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Summarised Results

0

3

6

9

10 20 30 40Size (Number of Trees)

Num

ber

of D

atas

ets

Method

CLUB−DRF

RF

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 52: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Pruning Results

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 53: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

CLUB-DRFExperimental Study

Sample of Detailed Results

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 54: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

How is Diversity Related to Outlier Detection?

Outliers are out of the norm instances that are thought to begenerated by a different mechanism

By analogy, trees that are significantly different (diverse) fromthe set of other trees in the Random Forest can be seen asoutliers

Local Outlier Factor (LOF) assigns a real number to eachinstance to represent its peculiarity

Inspired by this analogy, we hypothesised that a diverseensemble of trees can be formed using outlier detectionmethod

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 55: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

How is Diversity Related to Outlier Detection?

Outliers are out of the norm instances that are thought to begenerated by a different mechanism

By analogy, trees that are significantly different (diverse) fromthe set of other trees in the Random Forest can be seen asoutliers

Local Outlier Factor (LOF) assigns a real number to eachinstance to represent its peculiarity

Inspired by this analogy, we hypothesised that a diverseensemble of trees can be formed using outlier detectionmethod

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 56: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

How is Diversity Related to Outlier Detection?

Outliers are out of the norm instances that are thought to begenerated by a different mechanism

By analogy, trees that are significantly different (diverse) fromthe set of other trees in the Random Forest can be seen asoutliers

Local Outlier Factor (LOF) assigns a real number to eachinstance to represent its peculiarity

Inspired by this analogy, we hypothesised that a diverseensemble of trees can be formed using outlier detectionmethod

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 57: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

How is Diversity Related to Outlier Detection?

Outliers are out of the norm instances that are thought to begenerated by a different mechanism

By analogy, trees that are significantly different (diverse) fromthe set of other trees in the Random Forest can be seen asoutliers

Local Outlier Factor (LOF) assigns a real number to eachinstance to represent its peculiarity

Inspired by this analogy, we hypothesised that a diverseensemble of trees can be formed using outlier detectionmethod

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 58: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

LOFB-DRF

We termed the methodLocal Outlier Factor BasedDiversified Random Forests(LOFB-DRF)

It follows similar steps toCLUB-DRF

Each tree is assigned LOFvalue

Trees are then chosenaccording to two criteria

Predictive accuracyLOF value

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 59: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

LOFB-DRF

We termed the methodLocal Outlier Factor BasedDiversified Random Forests(LOFB-DRF)

It follows similar steps toCLUB-DRF

Each tree is assigned LOFvalue

Trees are then chosenaccording to two criteria

Predictive accuracyLOF value

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 60: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

LOFB-DRF

We termed the methodLocal Outlier Factor BasedDiversified Random Forests(LOFB-DRF)

It follows similar steps toCLUB-DRF

Each tree is assigned LOFvalue

Trees are then chosenaccording to two criteria

Predictive accuracyLOF value

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 61: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

LOFB-DRF

We termed the methodLocal Outlier Factor BasedDiversified Random Forests(LOFB-DRF)

It follows similar steps toCLUB-DRF

Each tree is assigned LOFvalue

Trees are then chosenaccording to two criteria

Predictive accuracyLOF value

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 62: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

LOFB-DRF Settings

A number of settings are needed as follows:

LOF setting of the number of nearest neighbours

Options of combining LOF with predictive accuracy

Using LOF only ruling out predictive accuracyUsing a combination strategy

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 63: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

LOFB-DRF Settings

A number of settings are needed as follows:

LOF setting of the number of nearest neighbours

Options of combining LOF with predictive accuracy

Using LOF only ruling out predictive accuracyUsing a combination strategy

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 64: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 65: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 66: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 67: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 68: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 69: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 70: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Experimental Setup

We tested the technique over 10 datasets from the UCIrepository

We generated 500 trees for the main Random Forest

We used LOF with 40 nearest neighbours

We used [rank = normal(LOF )× accuracy ] for each tree,where normal(LOF ), accuracy ∈ [0, 1]

Trees with the higher rank are chosen as representatives

We used the following values for representative trees: 5, 10,15, 20, 25, 30, 35, and 40

Repeated hold-out method used to estimate the performance

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 71: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Summarised Results

0

2

4

6

10 20 30 40Size (Number of Trees)

Num

ber

of D

atas

ets

Method

LOF−DRF

RF

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 72: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Pruning Results

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 73: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

LOFB-DRFExperimental Study

Sample of Detailed Results

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 74: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Summary

Random Forest has proved superiority over the last few years

Two methods were presented in this talk aiming atdiversifying and pruning Random Forests

Results showed the potential of these two methods to furtherenhance the predictive accuracy of the method

The high level of pruning makes these techniques candidatesfor real-time applications, as the number of trees to betraversed are significantly reduced

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 75: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Summary

Random Forest has proved superiority over the last few years

Two methods were presented in this talk aiming atdiversifying and pruning Random Forests

Results showed the potential of these two methods to furtherenhance the predictive accuracy of the method

The high level of pruning makes these techniques candidatesfor real-time applications, as the number of trees to betraversed are significantly reduced

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 76: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Summary

Random Forest has proved superiority over the last few years

Two methods were presented in this talk aiming atdiversifying and pruning Random Forests

Results showed the potential of these two methods to furtherenhance the predictive accuracy of the method

The high level of pruning makes these techniques candidatesfor real-time applications, as the number of trees to betraversed are significantly reduced

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 77: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Summary

Random Forest has proved superiority over the last few years

Two methods were presented in this talk aiming atdiversifying and pruning Random Forests

Results showed the potential of these two methods to furtherenhance the predictive accuracy of the method

The high level of pruning makes these techniques candidatesfor real-time applications, as the number of trees to betraversed are significantly reduced

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 78: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)

Using other clustering techniques

In LOFB-DRF:

Exploring other options for combining LOF value andpredictive accuracy

Using LOF and predictive accuracy for the choice of treerepresentatives in each cluster

Applying both methods to other ensemble classificationtechniques

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 79: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)Using other clustering techniques

In LOFB-DRF:

Exploring other options for combining LOF value andpredictive accuracy

Using LOF and predictive accuracy for the choice of treerepresentatives in each cluster

Applying both methods to other ensemble classificationtechniques

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 80: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)Using other clustering techniques

In LOFB-DRF:

Exploring other options for combining LOF value andpredictive accuracy

Using LOF and predictive accuracy for the choice of treerepresentatives in each cluster

Applying both methods to other ensemble classificationtechniques

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 81: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)Using other clustering techniques

In LOFB-DRF:

Exploring other options for combining LOF value andpredictive accuracy

Using LOF and predictive accuracy for the choice of treerepresentatives in each cluster

Applying both methods to other ensemble classificationtechniques

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 82: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Future Work

In CLUB-DRF:

Exploring other methods for choosing tree representatives fromeach cluster (e.g., varying the number of representatives percluster)Using other clustering techniques

In LOFB-DRF:

Exploring other options for combining LOF value andpredictive accuracy

Using LOF and predictive accuracy for the choice of treerepresentatives in each cluster

Applying both methods to other ensemble classificationtechniques

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest

Page 83: Unsupervised Learning Techniques to Diversifying and Pruning Random Forest

BackgroundClustering and Ensemble Diversity

Outlier Scoring and Ensemble DiversitySummary and Future Work

SummaryFuture Work

Q & A

Thanks for listening!

Contact Details

Dr Mohamed Medhat GaberE-mail: [email protected]

Webpage: http://mohamedmgaber.weebly.com/LinkedIn: https://www.linkedin.com/profile/view?id=21808352

Twitter: https://twitter.com/mmmgaberResearchGate:

https://www.researchgate.net/profile/Mohamed Gaber16?ev=prf highl

Dr Mohamed Medhat Gaber Diversifying and Pruning Random Forest