random forests dzieciolowski - sas group presentatio… · • gini index (binary only) ... •...

C op yr i g h t © 2016 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .

DEMYSTIFYING RANDOM FORESTS

ANTONI DZIECIOLOWSKI SAS CANADA

C op yr i g h t © 2016 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .2

RANDOM FOREST MOTIVATION

“With excellent performance on all eight metrics, calibrated boosted trees were

the best learning algorithm overall. Random forests are close second, followed

by uncalibrated bagged trees, calibrated SVMs, and uncalibrated neural nets."

Rich Caruana, Alexandru Niculescu-Mizil.

An Empirical Comparison of Supervised Learning Algorithms. ICML 2006


DECISION TREE DEFINITION

Decision Tree: is a schematic, tree-shaped diagram used

to determine a course of action or show a statistical

probability. Each branch of the decision tree represents a

possible decision, occurrence or reaction. The tree is

structured to show how and why one choice may lead to

the next, with the use of the branches indicating each

option is mutually exclusive.


DECISION TREE DEFINITION

X1 = 2


DECISION TREE BINARY SPLIT EXAMPLE

Julie Grisanti - Decision Trees: An Overview

http://www.aunalytics.com/decision-trees-an-overview/

Splitting Criteria:• Information Gain

• Variance

• Gini Index (Binary only)

• Chi Square

• Etc.


RANDOM FORESTS


RANDOM FOREST LEO BREIMAN

1928 - 2005

• Responsible in part for bridging the gap between statistics and

computer science in machine learning.

• Contributed in the work on how classification and regression trees

and ensemble of trees fit to bootstrap samples. (Bagging)

• Focused on computationally intensive multivariate analysis,

especially the use of nonlinear methods for pattern recognition

and prediction in high dimensional spaces

• Developed decision trees (random forest) as computationally

efficient alternatives to neural nets.

https://www.stat.berkeley.edu/~breiman/

https://www.stat.berkeley.edu/~breiman/


WHAT IS A RANDOM FOREST?

“Random forests are a combination of tree predictors such that each tree

depends on the values of a random vector sampled independently and

with the same distribution for all trees in the forest.”

Breiman Leo. Random Forests, Statistics

Department University of California Berkeley, 2001


RANDOM FOREST

((x1,y1),…,(xN,yN)) = D (Observed Data points)

m < M features (variables)

Algorithm: Random Forest for Regression or Classification.

1. For t = 1 to B: (Construct B trees) (a) Choose a bootstrap sample Dt from D of size N from the training data.

(b) Grow a random-forest tree Ti to the bootstrapped data, by recursively repeating the following steps for

each leaf node of the tree, until the minimum node size nmin is reached.

i. Select m variables at random from the M variables.

ii. Pick the best variable/split-point among the m.

iii. Split the node into two daughter nodes.

2. Output the ensemble of trees {Tb} B1 .

[Hastie, Tibshirani, Friedman. The Elements of Statistical Learning]


VISUALIZATION OF BAGGING


HOW TO BUILD A RANDOM TREE (BOOTSTRAPPING)

Data Space (inputs) Response Space(outputs)

Feat 1 Feat 2 Feat 3 … Feat M

Obs 1 2 3 5 3

Obs 2 6 1 4 4

Obs 3 3 5 9 5

Obs 4 5 7 8 8

Obs 5 0 8 2 2

…

Obs N 7 1 3 5

Pick m features from M and n observations from N at random

Feat 1

Feat 3

Target 1 Target 2 Target 3 Target 4 Target 5 … Target N

0 1 1 0 0 1


BAGGING OR BOOTSTRAP AGGREGATION

Average many noisy but approximately unbiased models, to reduce the

variance of estimated prediction function

[Hastie, Tibshirani, Friedman. The Elements of Statistical Learning]


BUILDING A FOREST (ENSEMBLE)


RANDOM FOREST ADVANTAGES

• Can solve both type of problems, classification and regression

• Random forests generalize well to new data

• It is unexcelled in accuracy among current algorithms*

• It runs efficiently on large data bases and can handle thousands of input variables without variable

deletion

• It gives estimates of what variables are important in the classification

• It generates an internal unbiased estimate of the generalization error as the forest building progresses

• It has an effective method for estimating missing data and maintains accuracy when a large proportion

of the data are missing

• It computes proximities between pairs of cases that can be used in clustering, locating outliers, or give

interesting views of the data.

• Out-of-bag error estimate removes the need for a set aside test set


DISADVANTAGES

• The results are less actionable because forests are not easily interpreted.

Considered black box approach for statistical modelers with little control on

what the model does. Similar to a Neural Network

• It surely does a good job at classification but not as good as for regression

problem as it does not give precise continuous nature predictions. In case of

regression, it doesn’t predict beyond the range in the training data, and that

they may over-fit data sets that are particularly noisy.


SAS ENTERPRISE

MINERRANDOM FOREST SAS HPFOREST

PROC HPFOREST;

target targetname/level=typeoftarget;

input (categorical variables) /level=typeofvariable (nominal)

input (numerical variables) /level=typeofvariable (interval)


OUTPUT OF PROC HPFOREST

random forests dzieciolowski - sas group presentatio… · • gini index (binary only) ... •...

Documents