random forests 13-06-2015

11
Random Forests Cork Big Data & Analytics Group

Upload: michael-keane

Post on 15-Apr-2017

137 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Random forests 13-06-2015

Random ForestsCork Big Data & Analytics Group

Page 2: Random forests 13-06-2015

Decision Trees• One of the older ML algorithms (Breiman et al. 1983)• One of the most popular (Rexer Data Miner Survey 2013)• Really versatile, handle non-linear relationships,

missing data, outliers, categorical or numerical targets – you name it!• Can be easily interpreted – the rules can be presented

as a table, or a series of if-then statements for each “split”• Can also be visually represented• CART, ID3, C4.5, CHAID, C5.0

Page 3: Random forests 13-06-2015

Note: hope you don’t mindthe political example!

Page 4: Random forests 13-06-2015

Decision Trees (cont’d.)• Decision Trees have low bias – the created model

generally approximates reality well• On the other hand, they have high variance – a

model tends to perform differently on different samples of the data• We need consistent performance, so what now?• How about we “grow” a bunch of decision trees, and

average them up?• Breiman thought about this, and in 2001 developed…

Page 5: Random forests 13-06-2015
Page 6: Random forests 13-06-2015

Random Forests• Mimics an ensemble of “experts” making a decision• Grows a bunch of bagged decision trees, using

subsets of variables (to handle variance)• Fast (relatively), scalable, has all the benefits of

decision trees• Has several parameters to tweak for performance• Implemented in all major ML software and libraries• But – is a “black box”, so no rules, no visualizations,

little inference

Page 7: Random forests 13-06-2015

Random Forests (cont’d.)• Give you “free” cross-validation (through calculating

OOB error)• This means shorter training time• Calculates variable importance• Partial dependence plots• Now supports censored (survival) data• Handles class imbalance• Can create very large objects in memory

Page 8: Random forests 13-06-2015

Random Forests in R• randomForest• randomForestSRC• ggRandomForests• party• randomForestCI (swager on GitHub)• edarf (zmjones on GitHub)• Boruta

Page 9: Random forests 13-06-2015

Tuning Parameters• Number of Trees• Number of Variables• Prior Class Weights• Cutoff• Sample Size• Node Size

Page 10: Random forests 13-06-2015

Some Resources• James, Witten, Hastie, Tibshirani, An Introduction to

Statistical Learning• Kuhn, Johnson, Applied Predictive Modeling• Jones, Linder, Exploratory Data Analysis using Random

Forests (article)• Package Vignettes on CRAN• CrossValidated.com

Page 11: Random forests 13-06-2015

THANK [email protected]