lecture 15: model selection and validation, cross validationwe nd the best parameter ^ by minimizing...
TRANSCRIPT
Lecture 15: Model Selection and Validation,Cross Validation
Hao Helen Zhang
October 15, 2020
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Outline
Read: ESL 7.4, 7.10
Training set, Validation set, Test Set
Hold-out method
K-fold Cross Validation
Model-based Selection Criteria
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Model Assessment
Given data (xi , yi ), i = 1, · · · , n, where xi is the input and yi is theresponse, we fit the model f
Model assessment is extremely important, as it can
measure the quality of a fitted model.
compare different methods and make an optimal choice.
select the best model with a model class F = {fα}, where αis the model index
choose optimal tuning parameters in regularization methods.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Model Selection and Assessment
Two Goals:
Model Assessment: after choosing a final model, estimate itsprediction error (generalization performance) on new data.
Model Selection: estimate the performance of differentmodels and choose the best model
Different models:
Is logistic f1 better than LDA f2?
From the same model class
Is 3-nn better than 5-nn?
What is the best k for the model class F = {fk , 1 ≤ k ≤ n}.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Bias-Variance Trade-off
As complexity increases, the model tends to
adapt to more complicated underlying structures (decrease inbias)
have increased estimation error (increase in variance)
In between there is an optimal model complexity, giving theminimum test error.
Role of tuning parameter α (k in knn, or λ in LASSO)
Tuning parameter α controls the model complexity
Find the best α to produce the minimum of test error
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Model Performance Evaluation
How do we evaluate the performance of f ?
its performance on the real-world data (“generalization”performance)
we want to test f on the data that it has never seen before
this set of data is called the “test” setthe test data is an independent set of the training data
Why not training data?
the risk of overfitting
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Loss and Errors
If Y is continuous, two commonly-used loss functions are
squared error L(Y , f (X)) = (Y − f (X))2
absolute error L(Y , f (X)) = |Y − f (X)|
If Y takes values {1, ...,K}, the common loss functions are
0− 1 loss L(Y , f (X) = I (Y 6= f (X))
log-likelihood L(Y , p(X) = −2∑K
k=1 I (Y = k) log pk(X)= −2 log pY (X)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Training error and Test error
Training error: the average loss over the training samples
TrainErr =1
n
n∑i=1
L(yi , f (xi ))
Test error: Given a test data set (xi , y′i ) for i = 1, . . . , n′,
TestErr =1
n′
n′∑i=1
L(Y ′i , f (Xi )).
TestErr is an estimate of PE. The larger n′, the betterestimate.
Prediction error: the expected error over all the input
PE = E(X,Y ′,data)[1
n
n∑i=1
L(Y ′i , f (Xi ))],
where Y ′i = f (Xi ) + ε′i , i = 1, · · · , n, are independent ofy1, · · · , yn. Expectation is taken w.r.t. all random quantities.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Examples
Regression:
TrainErr =1
n(yi − f (xi ))2
Classification:
log-likelihood loss: TrainErr is the sample log-likelihood
TrainErr = −2
n
n∑i=1
log pyi (xi ),
also called the cross-entropy loss or deviance.
0− 1 loss: the error is the misclassification rate
TrainErr| =1
n
n∑i=1
I (yi 6= f (xi )), TestErr =1
n′
n′∑i=1
I (yi 6= f (xi ),
TestErr is also called expected misclassification rate
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
What is Wrong with Training Error
TrainErr is not a good estimate of TestErr
TrainErr is often quite different from TestErr.
TrainErr can dramatically underestimate TestErr.
TrainErr is not an objective measure on the generalizationperformance
we built f based on the training data.
TrainErr decreases with model complexity.
Training error drops to zero if the model is complex enoughA model with zero training error overfits the training data;over-fitted models typically generalize poorly
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
�������� � ������� �� �������� �������� ���������� � �������� ���� ������� �
���� ����� ����� ��
� �������� ����� ��
�������������
����� ���������
���� � � ������
���� ������
� ����
������ �� ���� ��� ����� ����� �� � � ����� ��
����� ����������
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Common Techniques
How do we estimate TestErr accurately? Consider
data-rich situation
data-insufficient situation
Common Techniques (Efron 2004, JASA)
Hold Out Method (data-rich situation)
Resampling methods:
cross validationbootstrap
Analytical estimation criteria
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Data-Rich Situation: Hold Out Method
Key Idea: Estimate the test error by
hold out a subset of the training data from the fitting processapply the fitted model to those held out observations andcalculate the prediction errors
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Hold Out Approach: without tuning
Randomly divide the data into two parts: a training set and atest or hold-out set.
The model is fit on the training set, and the fitted model isused to predict the responses for the observations in theheld-out set.
The resulting error provides an estimate of the test error (e.g.,MSE in regression, misclassification rate in classification).
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Hold Out Approach: with tuning
Randomly divide the dataset into three parts:
training set: to fit the modelstuning (validation) set: to estimate the prediction error formodel selectiontest set: to assess the generalization error of the final model
The typical proportions are respectively: 50%, 25%, 25%.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Drawbacks of Hold Out Approach
The TestErr estimate can be highly variable, depending onwhich observations are included in the training set and whichare held out
Only a subset of the observations (those not held out) areused to fit the model.
The estimated error tend to overestimate the test error for themodel fit on the entire data set.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Resampling Methods
These methods refit a model to samples formed from the trainingset
Cross-validation
bootstrap techniques
They provide estimates of test-set prediction error, and thestandard deviation and bias of our parameter estimates.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
K-fold Cross Validation
The simplest and most popular way to estimate the test error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K
leave the kth portion out, and fit the model using the otherK − 1 parts. Denote the solution by f −k(x)calculate the prediction error of the f −k(x) on the kth(left-out) portion
3 Average the errors
Define the Index function (allocating memberships)
κ : {1, ..., n} → {1, ...,K}.
Then the K -fold cross-validation estimate of prediction error is
CV =1
n
n∑i=1
L(yi , f−κ(i)(xi ))
Typical choice: K = 5, 10Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Choose α
Given tuning parameter α, the model fit f −k(x, α).The CV (α) curve is
CV (α) =1
n
n∑i=1
L(yi , f−κ(i)(xi , α))
CV (α) provides an estimate of the test error curve
We find the best parameter α by minimizing CV (α).
Final model is f (x, α).
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Application Examples
For regression problems, apply CV to
shrinkage methods: select λ
best subset selection: select the best model
nonparametric methods: select the smoothing parameter
For classification methods, apply CV to
kNN: select k
SVM: select λ
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Leave-Out-One (LOO) Cross Validation
If K=n, we have leave-out-one cross validation.
κ(i) = i
LOO-CV is approximately unbiased for the true predictionerror
LOO-CV can have high variance because the n “training sets”are so similar to one another
Computation:
Computational burden is generally considerable, requiring nmodel fitting.
For special models like linear smoothers, computation can bedone quickly. For example: Generalized Cross Validation(GCV)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Statistical Properties of Cross Validation Error Estimates
Cross-validation error is very nearly unbiased for test error.
The slight bias is due to that the training set incross-validation is slightly smaller than the actual data set(e.g. for LOOCV the training set size is n 1 when there are nobserved cases).
In nearly all situations, the effect of this bias is conservative,i.e., the estimated fit is slightly biased in the directionsuggesting a poorer fit. In practice, this bias is rarely aconcern.
The variance of CV-error can be large.
In practice, to reduce variability, multiple rounds ofcross-validation are performed using different partitions, andthe validation results are averaged over the rounds.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Performance of CV
Example: Two-class classification problem
Linear model with best subsets regression of subset size p
In Figure 7.9, (p is tuning parameter)
both curves picks best p = 10. (The true model)
CV curve is flat beyond 10.
Standard error bars are the standard errors of the individualmisclassification error rates
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Standard Errors for K-fold CV
Let CVk(fα) denote the validation errors of fα in each fold for α, orsimply CVk(α). Here α ∈ {α1, . . . , αm}, within a set of candidatevalues for α. Then we have
The cross validation error
CV (α) =1
K
K∑k=1
CVk(α).
The standard deviation of CV (α) is calculated as
SE (α) =√
Var(CV (α)).
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
One-standard Error for K-fold CV
Letα = arg min
α∈{α1,...,αm}CV (α).
First we find α; then we move α in the direction of increasingregularization as much as can as long as
CV (α) ≤ CV (α) + SE (α).
choose the most parsimonious model with error no more thanone standard error above the best error.
Idea: “All equal (up to one standard error), go for thesimplest model”.
In this example, one-SE rule CV would choose p = 9.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
����������� ������������������������ �!���" �#�$�&% �')( �*�+����-,/.���01��2#�3" �-�#��4 5#"����768�9�-�;:�<�<1= >?2#��@&A�7"CB
Subset Size p
Miscla
ssifica
tion Er
ror
5 10 15 20
0.00.1
0.20.3
0.40.5
0.6
••
••
••
••
•• • • • • • • • • • •
••
• •
••
• •
• • • • • • •• • • • •
DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z���bX^�Uef�]�c� W~rXpqS#U �X�+���
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
The Wrong and Right Way
Consider a simpler classifier for microarrays:
1 Starting with 5,000 genes, find the top 200 genes having thelargest correlation with the class label
2 Carry about the nearest-centroid classification using top 200genes
How do we select the tuning parameter in the classifier?
Way 1: apply cross-validation to step 2
Way 2: apply cross-validation to steps 1 and 2
Which is right? – Cross validating the whole procedure.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Generalized Cross Validation
If linear fitting methods y = Sy satisfy
yi − f −i (xi ) =yi − f (xi )
1− Sii,
then we do NOT need to train the model n times.Using the estimation Sii = trace(S)/n, we have the approximatescore
GCV =1
n
n∑i=1
[yi − f (xi )
1− trace(S)/n
]2
computational advantage
GCV (α) is always smoother than CV (α)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Model-Based Methods: Information Criteria
Covariance penalties
Cp
AIC
BIC
Stein’s UBR
They can be used for both model selection and test errorestimation.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Generalized Information Criteria (GIC)
Information criteria:
GIC(model) = −2 · loglik + α · df,
the degree of freedom (df) of f as the number of effectiveparameters of the model.
Examples: In linear regression settings, we use
AIC = n log(RSS/n) + 2 · df,
BIC = n log(RSS/n) + log(n) · df
where RSS =∑n
i=1[yi − f (xi )]2 (the residual sum of squares)
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Optimism of Training Error Rate
training error:
err =1
n
n∑i=1
L(yi , f (xi ))
test error:Err = E(X,Y ,data)[L(Y , f (X))].
in-sample error rate (an estimate of prediction error)
Errin =1
n
n∑i=1
EynewEyL(ynewi , f (xi ))
Optimismop = Errin − Ey err
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation
Optimism
op is typically positive (Why?)
For the squared error, 0-1, and other loss function,
op =2
n
n∑i=1
Cov(yi , yi ).
The amount by which err under-estimates the true errordepends on the strength of correlation yi and its production.
Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation