data mining - volinsky - 2011 - columbia university 1 topic 10 - ensemble methods

36
Data Mining - Volinsky - 2011 - Columbia University 1 Topic 10 - Ensemble Methods

Upload: sharleen-russell

Post on 03-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Data Mining - Volinsky - 2011 - Columbia University 1

Topic 10 - Ensemble Methods

Ensemble Models - Motivation

• Remember this picture?• Always looking for balance between low complexity

(‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit)

• By combining many different models, ensembles make it easier to hit the ‘sweet spot’ of modelling.

• Best for models to draw from diverse, independent opinions– Wisdom Of Crowds

Data Mining - Volinsky - 2011 - Columbia University 2

Strain()

Stest()

Ensemble Methods - Motivation

• Models are just models.– Usually not true!– The truth is often much more complex than any

single model can capture.– Combinations of simple models can be arbitrarily

complex. (e.g. spam/robots models, neural nets, splines)

• Notion: An average of several measurements is often more accurate and stable than a single measurement

Accuracy: how well the model does for estimation and prediction

Stability: small changes in inputs have little effect on outputs

Data Mining - Volinsky - 2011 - Columbia University 3

Ensemble Methods – How They Work

• The ensemble predicts a target value as an average or a vote of the predictions (of several individual models)... – Each model is fit independently of the others– Final prediction is a combination of the independent

predictions of all models

• For an continuous target, an ensemble averages predictions– Usually weighted

• For a categorical target (classification), an ensemble may average the probabilities of the target values…or may use ‘voting’.– Voting classifies a case into the class that was selected

most by individual models

Data Mining - Volinsky - 2011 - Columbia University 4

Ensemble Models – Why they work

• Voting example– 5 independent classifiers– 70% accuracy for each– Use voting…– What is the probability that the

ensemble model is correct? • Lets simulate it

– What about 100 examples?– (not a realistic example, why?)

Data Mining - Volinsky - 2011 - Columbia University 5

Ensemble Schemes• The beauty is that you can average together models

of any kind!!!• Don’t need fancy schemes – just average!• But there are fancy schemes: each one has various

ways of fitting many models to the same data, and use voting or averaging– Stacking (Wolpert 92): fit many leave-1-out models – Bagging (Breiman 96) build models on many permutations

of original data – Boosting (Freund & Shapire 96): iteratively re-model, using

re-weighted data based on errors from previous models…– Arcing (Breiman 98), Bumping (Tibshirani 97), Crumpling

(Anderson & Elder 98) , Born-Again (Breiman 98): – Bayesian Model Averaging - near to my heart…

• We’ll explore BMA, bagging and boosting…

Data Mining - Volinsky - 2011 - Columbia University 6

Ensemble Methods – Bayesian Model Averaging

Data Mining - Volinsky - 2011 - Columbia University 7

Data Mining - Volinsky - 2011 - Columbia University 8

Model Averaging

• Idea: account for inherent variance of the model selection process

• Posterior Variance = Within-Model Variance +

Between-Model Variance

• Data-driven model selection is risky: “Part of the evidence is spent specify the model” (Leamer, 1978)

• Model-based inferences can be over-precise

Data Mining - Volinsky - 2011 - Columbia University 9

Model Averaging• For some quantity of interest : avg over all

Models M, given the data D:

To calculate the first term properly, you need to integrate out model parameters ,

Where is the MLE.

For the second term, note that

Pr( | ) Pr( | , ) Pr( | ) D M D M DM

=∑

Pr(Δ | M,D) = Pr(∫ Δ | M,θ ,D)Pr(θ | M,D)

≈Pr(Δ | M, ˆ θ ,D)^

Pr(Mk | D) ∝ Pr(D | Mk )Pr(Mk )

BICk = logPr(D | Mk) ≈ logPr(D |θ k,Mk ) −dk

2log(n)

Bayesian Model Averaging

• The approximations on the previous page allow you to calculate many posterior model probabilities quickly, and gives you the weights to use for averaging.

• But, how do you know which models to average over?– Example, regression with p parameters– Each subset of p is a ‘model’– 2p possible models!

• Idea:

Data Mining - Volinsky - 2011 - Columbia University 10

Data Mining - Volinsky - 2011 - Columbia University11

Model Averaging

• But how to find the best models without fitting all models?

• Solution: Leaps and Bounds algorithm can find the best model without fitting all models– Goal: find the single best model for each model size

Don’t need to traverse this part of the tree since there is no way it can beat AB

BMA - Example

Data Mining - Volinsky - 2011 - Columbia University 12

PMP = Posterior Model Probability

Best Models

Score on holdout data: BMA wins

Ensemble Methods - Boosting

Data Mining - Volinsky - 2011 - Columbia University 13

Boosting…

• Different approach to model ensembles – mostly for classification

• Observed: when model predictions are not highly correlated, combining does well

• Big idea: can we fit models specifically to the “difficult” parts of the data?

Data Mining - Volinsky - 2011 - Columbia University 14

Data Mining - Volinsky - 2011 - Columbia University 15

Boosting— Algorithm

From HTF p. 339

Example

• Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University 16

• Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University 17

Example

• Courtesy M. Littman

Data Mining - Volinsky - 2011 - Columbia University 18

Example

Boosting - Advantages

• Fast algorithms - AdaBoost• Flexible – can work with any

classification algorithm• Individual models don’t have to be

good– In fact, the method works best with bad

models!– (bad = slightly better than random

guessing)– Most common model – “boosted stumps”

Data Mining - Volinsky - 2011 - Columbia University 19

Data Mining - Volinsky - 2011 - Columbia University 20

Boosting Example from HTF p. 302

Ensemble Methods – Bagging / Stacking

Data Mining - Volinsky - 2011 - Columbia University 21

Data Mining - Volinsky - 2011 - Columbia University 22

Bagging for Combining ClassifiersBagging = Boostrap aggregating• Big Idea:

– To avoid overfitting of specific dataset, fit model to “bootstrapped” random sets of the data

• Bootstrap– Random sample, with replacement, from the data

set– Size of sample = size of data

– X= (1,2,3,4,5,6,7,8,9,10)– B1=(1,2,3,3,4,5,6,6,7,8)– B2=(1,1,1,1,2,2,2,5,6,8)– …

• Bootstrap sample have the same statistical properties as original data

• By creating similar datasets you can see how much stability there is in your data. If there is a lack of stability, averaging helps.

Bagging

• Training data sets of size N

• Generate B “bootstrap” sampled data sets of size N

• Build B models (e.g., trees), one for each bootstrap sample– Intuition is that the bootstrapping “perturbs” the data

enough to make the models more resistant to true variability

– Note: only ~62% of data included in any bootstrap sample• Can use the rest as an out-of-sample estimate!

• For prediction, combine the predictions from the B models– Voting or averaging based on“out-of-bag” sample– Plus: generally improves accuracy on models such as trees– Negative: lose interpretability

Data Mining - Volinsky - 2011 - Columbia University 23

Data Mining - Volinsky - 2011 - Columbia University 24

HTF Bagging Example p 285

Ensemble Methods – Random Forests

Data Mining - Volinsky - 2011 - Columbia University 25

Data Mining - Volinsky - 2011 - Columbia University 26

Random Forests

• Trees are great, but

– As we’ve seen, they are “unstable”– Also, trees are sensitive to the primary

split, which can lead the tree in inappropriate directions

– one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data -

Example of Tree Instability

Data Mining - Volinsky - 2011 - Columbia University 27from G. Ridgeway, 2003

Random Forests

• Solution:– random forests: an ensemble of decision trees– Similar to bagging: inject randomness to overcome instability– each tree is built on a random subset of the training data

• Boostrapped version of data

– at each split point, only a random subset of predictors are considered

– Use “out-of-bag” hold out sample to estimate size of each tree– prediction is simply majority vote of the trees ( or mean

prediction of the trees).

• Randomizing the variables used is the key– Reduces correlation between models!

• Has the advantage of trees, with more robustness, and a smoother decision rule.

Data Mining - Volinsky - 2011 - Columbia University 28

Data Mining - Volinsky - 2011 - Columbia University 29

HTF Example p 589

Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32

30Data Mining - Volinsky - 2011 - Columbia University

Random Forests – How Big A Tree

• Breiman’s original algorithm said: “to keep bias low, trees are to be grown to maximum depth”

• However, empirical evidence typically shows that “stumps” do best

Data Mining - Volinsky - 2011 - Columbia University 31

Ensembles – Main Points• Averaging models together has been shown to be

effective for prediction• Many weird names:

– See papers by Leo Breiman (e.g. “Bagging Predictors”, Arcing the Edge”, and “Random Forests” for more detail

• Key points– Models average well if they are uncorrelated– Can inject randomness to insure uncorrelated models– Averaging small models better than large ones

• Also, can give more insight into variables than simple tree– Variables that show up again and again must be good

Data Mining - Volinsky - 2011 - Columbia University 32

Visualizing Forests• Data: Wisconsin Breast Cancer

– Courtesy S. Urbanek

Data Mining - Volinsky - 2011 - Columbia University 33

Data Mining - Volinsky - 2011 - Columbia University 34

Data Mining - Volinsky - 2011 - Columbia University 35

References

• Random Forests from Leo Breiman himself

• Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32

• Hastie, Tibshirani, Friedman (HTF)– Chapters 8,10,15,16–

Data Mining - Volinsky - 2011 - Columbia University 36