random forests-talk-nl-meetup

© 2015 IBM Corporation

Trees. Bagging. Random Forests.

Abdul Haseeb /

IBM - Sweden

IBM Spark © 2015 IBM Corporation

1. Background – Neural Networks

2. Regression and Classification Trees

3. Bias - Variance - Over/Underfitting

4. Bagging and Random Forests

Agenda


Problems such as:

• With alots of data, with many variables,

noise in data

Strengths and weaknesses:

• Implicit learning of function mapping,

dealing with noise and missing data. But difficult to understand, long training duration

Neural Networks


Good:

• FIexible fitters, capture non-linearity and interactions.

• Handles categorical and numeric y and x very nicely.

• Performance and Interpretable (when small).

Bad:

• Not the best in out-of-sample predictive performance

Trees-based Learning


Regression: the output variable takes continuous values.

Regression Trees Trees


Step function f(n): Within each region (interval) average values for the subset of training data

• Predict use step function.

• Equivalently, we drop parameters down the tree

until it lands in a leaf, and predict the average

of the y values for the training observations

in the same leaf

Regression Trees Trees


Regression Trees with 2 dimensions Trees


Regression Trees with n-dimensions Trees


Classification: the output variable takes class labels.

Classification Trees Trees


• Recursive binary splits to partition the predictor space.

• Each binary split consists of a decision rule which sends x left or right.

• The set of bottom nodes (or leaves) give a partition of the x space.

• To predict, we drop an out-of-sample x down the tree until it lands in a bottom node.

Trees Summary Trees


Good:

• Handles categorical/numeric nicely.

• Agnostic to scale of predictors

• Computationally fast.

• Small trees are interpretable.

• Variable selection.

Trees

Bad:

• Step function is crude, does not give

the best predictive performance.

• Hard to assess uncertainly.

• Big trees are not interpretable.


• Both high bias and high variance are negative properties for types of models.

• A model with high bias “underfits” the

data.

• A model with high variance and low bias

“overfits” the data.

Bias, Variance, Under/Over-fitting Basics


• Overfitting the data causes the model to fit the noise, rather than the actual underlying behavior.

• Models with less variance will be more robust (fit a better model) in the presence of noisy data.

• The more complex the type of model we choose, the less error there will be on the training set.

Overfitting Problem Basics


• Additional set of data (Validation set)

• Validation set is used to see how well the model generalizes

to unseen data.

• instead of just looking at training error, we now look at both

training AND validation/generalization error.

• Underfitting – Validation error and training error are both

high

• Overfitting – Validation error is high while training error is low

Cross Validation Basics


• Logistic regression has high bias and low variance.

• Instead of representing the data points as a line, it tries to draw a separating line between two

classes. (linear decision boundary).

• For example, ……. interaction between leads and marketing content, such as webinars.

• However, TOO MANY webinar visits may be a negative signal, and could reflect the behavior of

a competitor.

• If this is the case, logistic regression would underfit the data by not capturing this nonlinear

correlation between webinar visits.

Logistic Regression - Underfit Basics


• For example, …. From training data, all leads that converted were from California.

• Then the learning algorithm identifies location = California as a strongly positive signal.

• Our model will have very good training and validation errors.

• Regularization prevents any single feature from being given too positive or too negative of a weight.

The strength of regularization can be finely tuned.

• More regularization means more bias and less variance,

• Less regularization means less bias and more variance.

• Too strong regularization means model will ignore all the feature.

Logistic Regression - Overfit Basics


• Decision trees in general have low bias and high variance.

• Given a training set, we can keep asking questions until we are able to distinguish between

ALL examples in the data set.

• We could keep asking questions until there is only a single example in each leaf.

• Since this allows us to correctly classify all elements in the training set, the tree is unbiased.

• However, there are many possible trees that could distinguish between all elements, which

means higher variance.

Decision Tree Basics


Divide-and-conquer approach focused on performance.

Ensemble methods take a group of weak

learners that combine together to form a strong learner.

Ensemble Methods Weak learner's collective capability


• Randomization to build decision trees to help overfitting.

• Bootstrap aggregation – randomizing the subset of examples used to build a tree

• Randomizing features – randomizing the subset of features used when asking questions

• Average the predictions of all the different trees.

• Taking the average of low bias models, the average model will also have low bias.

• The average will have low variance.

• Ensemble methods can reduce variance without increasing bias.

Random Forests


1. Sample N cases at random to create subset of data.

2. At each node

1. Chose m predictor variables at random

2. Chose the predictor variable that provides the

best split.

3. At next node, chose another m predictor

variables and iterate.

Random Forests


1. Random Record Selection :

• Each tree is trained on roughly 2/3rd of the total training data

• Cases are drawn at random with replacement from the original data.

2. Random Variable Selection :

• Some predictor variables (say, m) are selected at random and the best split is used to split the node.

3. For each tree, using the leftover 1/3rd data, calculate the misclassification rate (OOB) error rate.

4. RF Score: The forest chooses the classification having the most votes over all the trees in the forest.

How Random Forests Work


• Generates m new training data sets. • Each new training data set picks a sample of observations with replacement (bootstrap sample) from original data set. • By sampling with replacement, some observations may be repeated in each new training data set. • The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

Bagging (Bootstrap Aggregating) Technologies


1. Initialize proximities to zeroes

2. For any given tree, apply the tree to all cases

3. If case i and case j both end up in the same node, increase proximity prox(ij) between i

and j by one

4. Accumulate over all trees in RF and normalize by the number of trees in RF to creates a

proximity matrix (a square matrix with 1 on the diagonal and values between 0 and 1 in the

offdiagonal positions).

Proximity (Similarity)


1. Number of trees used in the forest (ntree) and

2. Number of random variables used in each tree (mtry).

Fine tune Random Forest


• Set the mtry to the default value (sqrt of total number of all predictors)

• Build random forest with different ntree values (100, 200, 300….,1,000).

• Build RF classifiers for each ntree value, record the OOB error rate and

calculate the number of trees where the OOB error rate stabilizes and reach

minimum.

Fine tune Random Forest – Optimal ntree


1. Experiment with including the (square root of total number of all predictors),

(half of thissquare root value), and (twice of the square root value). And

check which mtry returns maximum Area under curve.

• Thus, for 1000 predictors the number of predictors to select for each node

would be 16, 32, and 64 predictors.

Fine tune Random Forest – Optimal mtree


1. Generalization to cases with completely new data.

2. Lack of Insight.

3. If a variable is a categorical variable with multiple levels, random forests are

biased towards the variable having multiple levels

Shortcomings of Random Forests


Q/A


References • Trees and Random Forests (Adele Cutler, Professor Utah State University)

• Analysis of a Random Forests model (Gerard Biau, Journal of Machine Learning Research)

• Random Forest (Wikipedia)

• Random Forests Introduction (Quora)

• Random Forests for Regressions (Quora)

• Random Forests, Ensembles, and Performance Metrics (Dr. Arshavir Blackwell, CitizenNet)

• Random Forests (Leo Brieman, UC Berkley)

• Kaggle and Random Forests

• Random Forest Classifiers :A Survey and Future Research Directions (V. Y. Kulkarni and Dr. P. K. Sinha, Internaltional Journal of

Advanced Computing)

• Exploratory data analysis using Random Forests (Zachary Jones and Fridolin Linder, Penn State University)

random forests-talk-nl-meetup

Technology