l06 cross validation

Fall 2011

Copyright Robert A Stine Revised 11/16/11

Statistics 622 Module 6

Cross Validation OVERVIEW ................................................................................................................................................................................... 2 INTRODUCTION TO CROSS VALIDATION ................................................................................................................................ 4 EXAMPLE DATA SMALL_LOANS.JMP .............................................................................................................................. 6 TWO MODELS ............................................................................................................................................................................. 7 CROSS VALIDATION ................................................................................................................................................................ 10 CALIBRATION ANALYSIS AND SHRINKAGE ......................................................................................................................... 11 ADJUSTED R-SQUARED .......................................................................................................................................................... 13 ADVICE FOR PRACTICE ........................................................................................................................................................... 14 MULTI-FOLD CROSS VALIDATION ........................................................................................................................................ 15 COMMENTS .............................................................................................................................................................................. 18

The goal of regression modeling is often prediction in one way or another. Before we rely on a model, we have to think hard about how close those predictions are going to approximate the actual values. If you look back at Module 5, youll see that theres a risk in using statistics like RMSE for making this judgment. The risk is that these statistics presume weve fit the right model, as if we have found the model that nature used to produce the data. In most applications, we never know whether we have the true model, as if there were such a thing. After all, a model is just a means by which we can come to understand, to simplify the complexities of the real world. Once we admit we dont know the true model, how are we supposed to judge our own model. How well will the model that were using perform when we try to use it to predict other data? Will it be as accurate as it claims, or will it disappoint? If we knew these answers, we might have chosen a different model for the task. Cross validation is an approach to evaluating a model. It can be employed in a computationally intensive method that picks a model, finding the model that predicts best out-of-sample. Like more familiar statistics such as R2, it too is a bit on the optimistic side. That said, cross-validation is perhaps the best that we can do when we need to anticipate how well a model is going to perform in use. We just have to be realistic about how we use it.

Statistics 622 6-2 Fall, 2011

Overview Cross validation

Goal is to find a realistic (honest) estimate for how well a model predicts new data. Statistical models predict the data used to pick and fit the model (train the model) better than they predict new data that was not used in the fitting process. You cannot expect a model to predict new data better than it fits the observed data. Cross validation measures statistical sampling error (estimated by RMSE and standard errors) as well as model specification error that is missed by the usual diagnostics.

Shrinkage Calibration in a test sample reveals things not seen in the usual calibration plot: Prediction often works better if we pull back the predictions toward a common value (zero or the mean). The effect is evident in the CV version of a calibration plot of new y values versus predictions from a model. The slope b1 in this calibration plot is less than 1. The overall F statistic of the model suggests how much less.

Discussion Cross validation Holds back some data from the modeling process in order to test the model. It implements the common sense notion of testing a model on new data before applying in practice. Comes in many flavors, ranging from leaving out each case one-at-a-time (not such a good idea) to methods that repeatedly leave out large subsets.


Example Start by fitting two regression models, one deliberately much more complex than the other. Compare the models using classical measures, such as residual plots, RMSE, and statistical significance. Split the observed data in half, then refit the models. Compare the accuracy of each model when predicting the data used in the estimation to the accuracy when predicting new data. Calibration plots are handy in this context.

Prediction accuracy is key in models Having an honest estimate of how well a model can be expected to predict new data is key in both using and selling a model. Statistical models often claim to be able to predict better as they become more complex. This tendency then produces models that are not as precise as claimed because the models assume the MRM and dont allow for specification error. Cross-validation allows us to see this effect.


Introduction to Cross Validation Idea

Classical internal estimates of model performance (e.g., RMSE and R2) presume that the equation of the fitted model is correctly specified (has the right variables). These omit an important source of prediction error. Cross validation provides an external estimate of model performance by testing the model in a more realistic setting that does not presume so much. To assess a statistical model, hold back data from the modeling process; reserve this data to test the model. Common in consulting: client withholds some data to test the claims of the consultant.

Procedure Fit a regression on a randomly chosen subset of data, and then test it (validate it) using the rest of the data. Estimation/training data: chose/fit model Validation/test data: predict using model Does the model built from the estimation data fit the cases in the validation sample as well as it claims to? Example: Is the average squared error when predicting the test sample as small as promised by the model?

Flexible Idea extends beyond regression to any method that generates predictions.


Various forms of cross validation Split sample trains on half and then test on half. Fit a model using of the data, then use it to predict the other . Then reverse the roles of the halves. k-fold cross validation divides the data into k subsets (at random). Train (or estimate) the fit on k-1 subsets and then test on the omitted subset. Repeat, leaving out a different subset each time. (k is usually 5 or 10.) Reversed cross validation reverses the allocation of the data, using more to test the fit than to fit the model. Three-way cross-validation splits the data into 3 groups: training, testing, and tuning. The tuning set is used to pick the model that is to be used.

Random repetition All of these can be repeated, each time starting with a new random split of the data into disjoint subsets.

Optimistic

Cross-validation is not a panacea; it too omits sources of prediction error. (a) The samples used to test a model in CV are random

samples from the same population. That may not be true when the model is used.

(b) Unless you take extra steps, CV wont adjust for the role of how the data was used to pick the model. This can be fixed but requires further effort.


Example Data small_loans.jmp Sample of 990 loans to small businesses.

Relatively tall and thin data table + lots of substantive guidance make it easy to choose explanatory variables.

Response variable Evaluate the risk of loans by building a model that predicts whether loans will be repaid on time. Loan amounts are right skewed with average $43,000. The response (PRSM) equals 1 if the loan is repaid on schedule. Values less than 1 indicate that the loan is behind schedule. Average PRSM is 0.87 with SD 0.07. Much more symmetric.

Explanatory characteristics Variety of characteristics of the merchant, such as FICO credit score, credit history, cash flow and location. Characteristics of the neighborhood where the merchant operates, including household income, property values, population. Properties of the broker that originates the loan, including commission and fees.

Derived characteristics Some explanatory variables are not original data columns, but were derived (logs, ratios) from substantive insights.

20406080

Count

0.6 0.7 0.8 0.9 1 1.1

255075100125

Count

0 50000 150000 250000 350000


Two Models Split sample

Use 200 observations to train models. Use the remaining 790 to test the models. This is a reversed cross-validation, using the bulk of the data to judge accuracy rather than get a better fit. Exclude the data in the test group.1

Initial, relatively parsimonious model2 Overall significant, some individual estimates are not. Use the VIF to see if Log(income) is collinear.

Summary of Fit RSquare 0.430 RSquare Adj 0.413 Root Mean Square Error 0.056 Mean of Response 0.871 Observations (or Sum Wgts) 200

Analysis of Variance

Source DF Sum of Squares Mean Square F Ratio Model 6 0.450 0.0749 24.3134 Error 193 0.595 0.0031 Prob > F C. Total 199 1.044 |t| Intercept 0.23141 0.1400 1.65 0.1001 FICO 0.00027 0.0001 4.72


Elaborate model Adds explanatory variables, such as State and interactions. The output only shows an excerpt of the estimates. Why does the effect test for State show only only 39 degrees of freedom?

Summary of Fit RSquare 0.690 RSquare Adj 0.556 Root Mean Square Error 0.048 Observations (or Sum Wgts) 200

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 60 0.720 0.012006 5.1516 Error 139 0.324 0.002331 Prob > F C. Total 199 1.044 F ISO.Name 7 0.0243 1.4911 0.1752 FICO*ISO.Name 7 0.0182 1.1129 0.3584 State 39 0.1688 1.8576 0.0048*

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 0.39728 0.1411 2.81 0.0056* FICO 0.00021 0.0001 2.45 0.0155* Residential % 0.00311 0.0005 5.83


Diagnostics Check residual plots, leverage plots, calibration plots. Do these show any anomalies, anticipate problems?

Comparison of models The difference between R2 statistics seems more impressive than the difference between RMSEs, though reducing the RMSE by 10% isnt bad.

Coef Estimates R2 RMSE Initial Both have some that are not statistically significant. Both omit

some interesting potentially useful interactions.

43% 0.056 Enhanced 69% 0.048

Preference

Which model is better? Cannot use a partial-F test to compare these models. These models are not nested (we cannot obtain the simpler model by removing predictors from the larger model), complicating the comparison of the fits. The larger, enhanced model drops some of those in the smaller regression and adds others. Without nesting, we cannot rely on a partial F test to compare the two models.


Cross Validation Analysis steps

Save the prediction formulas from each model. Compute the prediction errors in the data3 Define a formula that subtracts the predicted values from the actual values. Compare the distributions.

Moments in Validation

Initial Elaborate -0.0012 Mean -0.0076 0.0579 Std Dev 0.0634

790 N 790 0.0033 Variance 0.0040

0 N Missing WHY? 42 The choice between the models reverses in cross-validation The simpler model fits better in the test sample than the enhanced model! From model in

training sample Test sample

F R2 RMSE SD(pred errors) Basic 15.6 0.407 0.051 0.058 Enhanced 3.6 0.690 0.046 0.063

3 Rows > Clear row states, then build a column that subtracts the predicted values from the actual PRSM that forms the response.


Calibration Analysis and Shrinkage Both models are calibrated in-sample.

In-sample calibration finds situations in which the average of the response dont match the predicted values. Most often, we can fix this by (a) transforming a variable (b) adding explanatory variables to the model or (c) calibrating the model using a spline/polynomial. Validation in the test sample detects a different issue that is invisible if checked in-sample.

In-sample and out-of-sample calibration plots Build scatterplots, with separate calibration regressions for the two subsets and two models. (red is test fit)

Summary of Fit, Test sample

Initial Enhanced 0.4077 RSquare 0.3706 0.0578 Root Mean Square Error 0.0596

790 Observations 748

Initial Estimate Std Error t Ratio Prob>|t| Intercept -0.048 0.039 -1.22 0.2222 Initial Pred PRSM 1.051 0.045 23.29


Estimated intercept and slope For an in-sample calibration plot, the estimated intercept in the regression of y on has intercept b0=0 and slope b1=1. Thats not the case in the fit of the test data on the predictions from the model built with the training data.

Shrinkage and F When predicting new data, we can improve the predictions of any regression model (with at least 2 explanatory variables) by shrinking the predictions in the sense of pulling them back toward the average prediction. How much shrinkage? The overall F-statistic of the fitted model estimates the amount. In this example, we have

Training Test F 1/F Shrinkage

(1-b1) Significant

benefit? Basic 15.6 0.06 -0.051 0.05 No Enhanced 3.6 0.28 0.32 0.03 Yes Hence if the F-stat is large, theres little value in shrinkage. As F gets smaller, it becomes more useful to shrink the estimates.

Interpreting the overall F-stat Complex models claim to fit better than they actually do. Once the F is large, youre okay and shrinkage wont make much difference. If the F is small, you get better predictions by shrinking.


Adjusted R-Squared Definition

Adjusted R2 behaves more like the overall F-ratio in that it takes account of the sample size n and the number of explanatory variables in the model k: 1! R2 = Residual SS / (n ! (k +1))Total SS / (n !1) , Residual SS = ei

2" ,Total SS = yi ! y( )2" whereas the usual R2 statistic is simply

1! R2 = Residual SSTotal SS What does it do?

Its trying (though not hard enough) to estimate the R2 in the test sample. The adjustment works well unless the model becomes highly complex relative to the sample size, as is the case for the enhanced model in this example. The enhanced model in this example fits k = 60 coefficients with n = 200 observations. Compare the R2 and adjusted R2 statistics in the original and test samples. Adjusted R2 anticipates the decay in the fit out-of-sample, but underestimates the magnitude.

Training Test Calibration R2 Adj R2 R2

Basic 0.43 0.41 0.41 Enhanced 0.69 0.56 0.37


Advice for Practice What to do in the real world?

In this example, the basic model is more honest, but the enhanced model is more predictive. That suggests using the enhanced model, but when testing, the simpler model fit better. But that was only using a n=200 cases. How accurate can we expect it to be if we were to use all n=990 for fitting the model, rather than only 200? We know that we dont get good estimates from this model with n=200, but if we could use all of the data, we might yet get a better fit.

Two approaches (others come later) (a) Rely on test statistics like adjusted R2 and the overall F ratio to anticipate how much worse the model will fit when applied to new data. When fit to all 990, the enhanced model would seem to fit reasonably well, with little loss of accuracy. Adjusted R2 is only slightly less than R2 and the F ratio is large, F 17.

Summary of Fit RSquare 0.5642 RSquare Adj 0.5310 Root Mean Square Error 0.0510 Observations (or Sum Wgts) 990

Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 70 3.0972 0.0442 16.9985 Error 919 2.3921 0.0026 Prob > F C. Total 989 5.4892


Multi-fold Cross Validation Address two issues

(a) The results of a cross-validation experiment depend on which cases happen to be included or excluded. That is, theres an element of randomness in the CV results. (b) Need a large test sample to get a good estimate of precision. But a large test sample means that the estimation sample is smaller than what you plan to use. (e.g., Whats the point of validating a model fit to n=500 when you plan to use 1,000?)

Approach (standard) Partition the data into disjoint folds, say N=10 or 20 folds, each holding 10% or 5% of the cases. Train (i.e., fit) the model using N-1 folds. The size of the training sample is then close to the size of the model you plan to use. Test the model by predicting the cases in the left-out fold. Average the prediction errors over the folds.

Randomized version Repeatedly train the model using a randomly selected 90% of the data and test using the remaining 10%.

Scripting The JMP script cv_script.jsl automates the randomized process. This script repeatedly fits the regression model named by the table variable Model in the data table (a dialog that describes a regression). It constructs a new data table that saves the RMSE for the regression when fit in the training sample and the SD of the prediction errors for the data in the test sample.


Model comparison Whats the effect of fitting the elaborate model if we get to use the full sample? Which is better Repeat the random splitting process 50 times for each: each time leaving out 10% of the cases. The model is thus fit to about 900 cases rather than 200 as in the first example. With the larger estimation sample (90% of the full 990 cases) the more complex model predicts better! Average RMSE Training Testing Simple model 0.057 0.057 Elaborate model 0.053 0.053

Details for the elaborate model The following histograms show the in-sample and out-of-sample RMSEs created by this script. Notice how much more variable the testing results are. Test RMSE are more variable since those rely on smaller samples of only 10% of the cases.


Out-of-sample precision Can we trust the results of the elaborate model when fit to a large sample with about 900 cases? For this, we can compare how much the claimed (Training) RMSE underestimates the actual (Testing) RMSE. (Testing RMSE) (Training RMSE)

Actual accuracy Claimed accuracy Difference

Because the distribution of the differences is centered at zero, the model does not claim (on average) to predict better than it does though there is considerable variation in the results.

Mean 0.00027 Std Dev 0.00395 Std Err Mean 0.00056 Upper 95% Mean 0.00139 Lower 95% Mean -0.00086


Comments (a) Beware relying on one test sample.4

Weve seen that theres a lot of variation in the test results. If you only do one testing sample, you can easily be fooled. Look back at the CV results for the elaborate model The model has average RMSE 0.053 in training samples. You should not be surprised to get a test sample with much higher or lower RMSE: One test sample is not enough.

(b) Theres no such thing as a true model! The choice of the best model depends on its complexity relative to the amount of data we have to use. With n=200, the simpler model evidently predicts better (we could validate this as well). With n=900, the more complex elaborate model predicts better.

4 A colleague and I once gamed a modeling competition by taking advantage of the fact that the organizer used one test sample to judge the various competitors.

Test samples with RMSE > 0.053 make it look like the model is worse than claimed.

Test samples with RMSE < 0.053 make it look like the model is better than claimed.

Typical Training RMSE 0.053

l06 cross validation

Documents

true model

right model

different model

cross validation overview

fold cross validation

familiar statistics

example data

goal of regression modeling