bagging: motivationanna/stat697f/chapter8...bagging: motivation i the decision trees su er from high...

Bagging: motivation

I The decision trees suffer from high variance. Bootstrapaggregation, or bagging, is a general-purpose procedure forreducing the variance of a statistical learning method.

I averaging a set of observations reduces variance. Hence anatural way to reduce the variance and hence increase theprediction accuracy of a statistical learning method is to takemany training sets from the population, build a separateprediction model using each training set, and average theresulting predictions.

I we bootstrap to generate different training datasets.

Bagging: procedure

I Bootstrap training set to obtain the bth bootstrapped trainingset, b = 1, ...,B.

I Train a tree with the bth bootstrapped training set, notpruned so that the individual tree has high variance but lowbias

I Let f̂ ∗b(x) be the prediction for the test case x . The baggingprediction is then

f̂bag (x) =1

B

B∑b=1

f̂ ∗b(x) for regression,

f̂bag (x) = argmaxk

B∑b=1

I{f̂ ∗b(x)=k} for classification.

Test error estimation for bagging trees

I Cross-validation or vaidation setI Out-of-Bag (OOB) error estimation

I One can show that on average, each bagged tree makes use ofaround two-thirds of the observations. The remaining one-thirdof the observations not used to fit a given bagged tree arereferred to as the out-of-bag (OOB) observations.

I We can predict the response for the ith observation using eachof the trees in which that observation was OOB. This will yieldaround B/3 predictions for the ith observation. In order toobtain a single prediction for the ith observation, we canaverage these predicted responses (if regression is the goal) orcan take a majority vote (if classification is the goal). Thisleads to a single OOB prediction for the ith observation.

I Out-of-Bag (OOB) error estimationI An OOB prediction can be obtained in this way for each of the

n observations, from which the overall OOB MSE (for aregression problem) or classification error (for a classificationproblem) can be computed. The resulting OOB error is a validestimate of the test error for the bagged model, since theresponse for each observation is predicted using only the treesthat were not fit using that observation.

I It can be shown that with B sufficiently large, OOB error isvirtually equivalent to leave-one-out cross-validation error

0 50 100 150 200 250 300

0.1

00.1

50.2

00.2

50.3

0

Number of Trees

Err

or

Test: Bagging

Test: RandomForest

OOB: Bagging

OOB: RandomForest

Figure: Bagging and random forest results for the Heart data. The testerror (black and orange) is shown as a function of B, the number ofbootstrapped training sets used. Random forests were applied withm =

√p. The dashed line indicates the test error resulting from a single

classification tree. The green and blue traces show the OOB error, whichin this case is considerably lower.

Variable importance measures

Bagging typically results in improved accuracy over predictionusing a single tree at the expense of interpretability. One canobtain an overall summary of the importance of each predictorusing the RSS (for bagging regression trees) or the Gini index (forbagging classification trees).

I In the case of bagging regression trees, we can record thetotal amount that the RSS is decreased due to splits over agiven predictor, averaged over all B trees. A large valueindicates an important predictor.

I In the context of bagging classification trees, we can add upthe total amount that the Gini index is decreased by splitsover a given predictor, averaged over all B trees.

Thal

Ca

ChestPain

Oldpeak

MaxHR

RestBP

Age

Chol

Slope

Sex

ExAng

RestECG

Fbs

0 20 40 60 80 100

Variable Importance

Figure: A variable importance plot for the Heart data. Variableimportance is computed using the mean decrease in Gini index, andexpressed relative to the maximum.

Random forests

I The bagged trees based on the bootstrapped samples oftenlook quite similar to each other. They are therefore oftenHighly correlated.

I averaging many highly correlated quantities does not lead toas large of a reduction in variance as averaging manyuncorrelated quantities. In particular, this means thatbagging will not lead to a substantial reduction in varianceover a single tree in this setting.

I To decorrelate the trees on the bootstrapped samples,random forests build a number of decision trees onbootstrapped training samples. But when building thesedecision trees, each time a split in a tree is considered, arandom sample of m predictors is chosen as split candidatesfrom the full set of p predictors. The split is allowed to useonly one of those m predictors.

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Te

st

Cla

ssific

atio

n E

rro

r

m=p

m=p/2

m= p

Figure: Results from random forests for the 15-class gene expression dataset with p = 500 predictors. The test error is displayed as a function ofthe number of trees. Each colored line corresponds to a different value ofm ,the number of predictors available for splitting at each interior treenode. Random forests (m < p) lead to a slight improvement overbagging (m = p). A single classification tree has an error rate of 45.7%

Choice of paramaters m and B

I m can be chosed by cross-validationI For m, it is recommended that

I For classification, the default value for m is√p and the

minimum node size is one.I For regression, the default value for m is p/3 and the minimum

node size is five.

I As with bagging, random forests will not overfit if we increaseB ,so in practice we use a value of B sufficiently large for theerror rate to have settled down.

Boosting: Adaboost (discrete boost) for classification

I Like bagging, boosting involves combining a large number ofdecision trees.

I Different from bagging, bossting trees are grownsequentially: each tree is grown using information frompreviously grown trees

I Boosting does not involve bootstrap sampling; instead eachtree is fit on a weighted original dataset: The samples thatare misclassified by the previouse tree receive moreweights

I The individual trees do not contribute to the final predictionequally: Give larger weights to more accurate classifiersin the sequence.

Figure: Schematic of AdaBoost. Classifiers are trained on weightedversions of the dataset, and then combined to produce a final prediction.

Boosting for regression trees

0 1000 2000 3000 4000 5000

0.0

50

.10

0.1

50

.20

0.2

5

Number of Trees

Te

st

Cla

ssific

atio

n E

rro

r

Boosting: depth=1

Boosting: depth=2

RandomForest: m= p

Figure: Results from performing boosting and random forests on the15-class gene expression data set in order to predict cancer versus normal.The test error is displayed as a function of the number of trees. For thetwo boosted models, λ = 0.01. Depth-1 trees slightly outperform depth-2trees, and both outperform the random forest, although the standarderrors are around 0.02, making none of these differences significant. Thetest error rate for a single tree is 24 %.

Choice of tuning parameters

I The number of trees B. Unlike bagging and random forests,boosting can overfit if B is too large, although this overfittingtends to occur slowly if at all. We use cross-validation toselect B.

I The common practice is to restrict all individual trees to havethe same size.

I Let d be the number of splits in each tree. When d = 1, eachtree is a stump , consisting of a single split. In this case, theboosted stump ensemble is fitting an additive model, sinceeach term involves only a single variable. More generally d isthe interaction depth.

I For regression, d = 1 often works well. For classiciation,experience indicates that d ∈ [3, 7] works well in the context ofboosting, with results being fairly insensitive to particularchoices in this range. One can fine-tune the value for d bytrying several different values and choosing the one that produces the lowest risk on a validation sample. However, thisseldom provid es significant improvement over using d ≈ 5.

bagging: motivationanna/stat697f/chapter8...bagging: motivation i the decision trees su er from high...

Documents