random forests-talk-nl-meetup
TRANSCRIPT
© 2015 IBM Corporation
Trees. Bagging. Random Forests.
Abdul Haseeb /
IBM - Sweden
IBM Spark © 2015 IBM Corporation
1. Background – Neural Networks
2. Regression and Classification Trees
3. Bias - Variance - Over/Underfitting
4. Bagging and Random Forests
Agenda
IBM Spark © 2015 IBM Corporation
Problems such as:
• With alots of data, with many variables,
noise in data
Strengths and weaknesses:
• Implicit learning of function mapping,
dealing with noise and missing data. But difficult to understand, long training duration
Neural Networks
IBM Spark © 2015 IBM Corporation
Good:
• FIexible fitters, capture non-linearity and interactions.
• Handles categorical and numeric y and x very nicely.
• Performance and Interpretable (when small).
Bad:
• Not the best in out-of-sample predictive performance
Trees-based Learning
IBM Spark © 2015 IBM Corporation
Regression: the output variable takes continuous values.
Regression Trees Trees
IBM Spark © 2015 IBM Corporation
Step function f(n): Within each region (interval) average values for the subset of training data
• Predict use step function.
• Equivalently, we drop parameters down the tree
until it lands in a leaf, and predict the average
of the y values for the training observations
in the same leaf
Regression Trees Trees
IBM Spark © 2015 IBM Corporation
Regression Trees with 2 dimensions Trees
IBM Spark © 2015 IBM Corporation
Regression Trees with n-dimensions Trees
IBM Spark © 2015 IBM Corporation
Classification: the output variable takes class labels.
Classification Trees Trees
IBM Spark © 2015 IBM Corporation
• Recursive binary splits to partition the predictor space.
• Each binary split consists of a decision rule which sends x left or right.
• The set of bottom nodes (or leaves) give a partition of the x space.
• To predict, we drop an out-of-sample x down the tree until it lands in a bottom node.
Trees Summary Trees
IBM Spark © 2015 IBM Corporation
Good:
• Handles categorical/numeric nicely.
• Agnostic to scale of predictors
• Computationally fast.
• Small trees are interpretable.
• Variable selection.
Trees
Bad:
• Step function is crude, does not give
the best predictive performance.
• Hard to assess uncertainly.
• Big trees are not interpretable.
IBM Spark © 2015 IBM Corporation
• Both high bias and high variance are negative properties for types of models.
• A model with high bias “underfits” the
data.
• A model with high variance and low bias
“overfits” the data.
Bias, Variance, Under/Over-fitting Basics
IBM Spark © 2015 IBM Corporation
• Overfitting the data causes the model to fit the noise, rather than the actual underlying behavior.
• Models with less variance will be more robust (fit a better model) in the presence of noisy data.
• The more complex the type of model we choose, the less error there will be on the training set.
Overfitting Problem Basics
IBM Spark © 2015 IBM Corporation
• Additional set of data (Validation set)
• Validation set is used to see how well the model generalizes
to unseen data.
• instead of just looking at training error, we now look at both
training AND validation/generalization error.
• Underfitting – Validation error and training error are both
high
• Overfitting – Validation error is high while training error is low
Cross Validation Basics
IBM Spark © 2015 IBM Corporation
• Logistic regression has high bias and low variance.
• Instead of representing the data points as a line, it tries to draw a separating line between two
classes. (linear decision boundary).
• For example, ……. interaction between leads and marketing content, such as webinars.
• However, TOO MANY webinar visits may be a negative signal, and could reflect the behavior of
a competitor.
• If this is the case, logistic regression would underfit the data by not capturing this nonlinear
correlation between webinar visits.
Logistic Regression - Underfit Basics
IBM Spark © 2015 IBM Corporation
• For example, …. From training data, all leads that converted were from California.
• Then the learning algorithm identifies location = California as a strongly positive signal.
• Our model will have very good training and validation errors.
• Regularization prevents any single feature from being given too positive or too negative of a weight.
The strength of regularization can be finely tuned.
• More regularization means more bias and less variance,
• Less regularization means less bias and more variance.
• Too strong regularization means model will ignore all the feature.
Logistic Regression - Overfit Basics
IBM Spark © 2015 IBM Corporation
• Decision trees in general have low bias and high variance.
• Given a training set, we can keep asking questions until we are able to distinguish between
ALL examples in the data set.
• We could keep asking questions until there is only a single example in each leaf.
• Since this allows us to correctly classify all elements in the training set, the tree is unbiased.
• However, there are many possible trees that could distinguish between all elements, which
means higher variance.
Decision Tree Basics
IBM Spark © 2015 IBM Corporation
Divide-and-conquer approach focused on performance.
Ensemble methods take a group of weak
learners that combine together to form a strong learner.
Ensemble Methods Weak learner's collective capability
IBM Spark © 2015 IBM Corporation
• Randomization to build decision trees to help overfitting.
• Bootstrap aggregation – randomizing the subset of examples used to build a tree
• Randomizing features – randomizing the subset of features used when asking questions
• Average the predictions of all the different trees.
• Taking the average of low bias models, the average model will also have low bias.
• The average will have low variance.
• Ensemble methods can reduce variance without increasing bias.
Random Forests
IBM Spark © 2015 IBM Corporation
1. Sample N cases at random to create subset of data.
2. At each node
1. Chose m predictor variables at random
2. Chose the predictor variable that provides the
best split.
3. At next node, chose another m predictor
variables and iterate.
Random Forests
IBM Spark © 2015 IBM Corporation
1. Random Record Selection :
• Each tree is trained on roughly 2/3rd of the total training data
• Cases are drawn at random with replacement from the original data.
2. Random Variable Selection :
• Some predictor variables (say, m) are selected at random and the best split is used to split the node.
3. For each tree, using the leftover 1/3rd data, calculate the misclassification rate (OOB) error rate.
4. RF Score: The forest chooses the classification having the most votes over all the trees in the forest.
How Random Forests Work
IBM Spark © 2015 IBM Corporation
• Generates m new training data sets. • Each new training data set picks a sample of observations with replacement (bootstrap sample) from original data set. • By sampling with replacement, some observations may be repeated in each new training data set. • The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).
Bagging (Bootstrap Aggregating) Technologies
IBM Spark © 2015 IBM Corporation
1. Initialize proximities to zeroes
2. For any given tree, apply the tree to all cases
3. If case i and case j both end up in the same node, increase proximity prox(ij) between i
and j by one
4. Accumulate over all trees in RF and normalize by the number of trees in RF to creates a
proximity matrix (a square matrix with 1 on the diagonal and values between 0 and 1 in the
offdiagonal positions).
Proximity (Similarity)
IBM Spark © 2015 IBM Corporation
1. Number of trees used in the forest (ntree) and
2. Number of random variables used in each tree (mtry).
Fine tune Random Forest
IBM Spark © 2015 IBM Corporation
• Set the mtry to the default value (sqrt of total number of all predictors)
• Build random forest with different ntree values (100, 200, 300….,1,000).
• Build RF classifiers for each ntree value, record the OOB error rate and
calculate the number of trees where the OOB error rate stabilizes and reach
minimum.
Fine tune Random Forest – Optimal ntree
IBM Spark © 2015 IBM Corporation
1. Experiment with including the (square root of total number of all predictors),
(half of thissquare root value), and (twice of the square root value). And
check which mtry returns maximum Area under curve.
• Thus, for 1000 predictors the number of predictors to select for each node
would be 16, 32, and 64 predictors.
Fine tune Random Forest – Optimal mtree
IBM Spark © 2015 IBM Corporation
1. Generalization to cases with completely new data.
2. Lack of Insight.
3. If a variable is a categorical variable with multiple levels, random forests are
biased towards the variable having multiple levels
Shortcomings of Random Forests
IBM Spark © 2015 IBM Corporation
Q/A
IBM Spark © 2015 IBM Corporation
References • Trees and Random Forests (Adele Cutler, Professor Utah State University)
• Analysis of a Random Forests model (Gerard Biau, Journal of Machine Learning Research)
• Random Forest (Wikipedia)
• Random Forests Introduction (Quora)
• Random Forests for Regressions (Quora)
• Random Forests, Ensembles, and Performance Metrics (Dr. Arshavir Blackwell, CitizenNet)
• Random Forests (Leo Brieman, UC Berkley)
• Kaggle and Random Forests
• Random Forest Classifiers :A Survey and Future Research Directions (V. Y. Kulkarni and Dr. P. K. Sinha, Internaltional Journal of
Advanced Computing)
• Exploratory data analysis using Random Forests (Zachary Jones and Fridolin Linder, Penn State University)