model selection strategies - · pdf filedeveloping a multivariable prediction model •...
TRANSCRIPT
MODEL SELECTION STRATEGIES
Tony Panzarella
Preamble• Although focus will be on time-to-event data the same
principles apply to other outcome data
Lab Course March 20, 2014 2
Developing a multivariable prediction model• Select clinically relevant predictors for possible inclusion in
the model• Evaluate the quality of the data and how to handle missing
data• Data handling decisions• Choosing a strategy for selecting the important variables in
the final model• Deciding how to model continuous variables• Selecting measures of model performance or predictive
accuracy
Lab Course March 20, 2014 3
AUTOMATIC SELECTION ROUTINES
Forward Selection• Variables are added to the model one at a time• At each stage the variable added is the one which gives
the largest decrease in the value of -2LogL on its inclusion• The process ends when each of the remaining variables
fails to reduce -2LogL by a pre-specified amount (typically couched as a significance level)
Lab Course March 20, 2014 5
Backward elimination• Full model is fit first• Variables are excluded one at a time• At each stage the variable omitted is the one that
increases -2LogL by the smallest amount by its exclusion• The process ends when the next candidate for deletion
increases the value of -2LogL by more than a pre-specified amount.
Lab Course March 20, 2014 6
Stepwise• Operates similarly to forward selection• However, a variable that is included can be considered for
exclusion at a later stage• Thus after adding a variable, the procedure then checks
whether any previously included variable can be deleted
Lab Course March 20, 2014 7
Lab Course March 20, 2014 8
proc phreg data=myeloma; model Time*VStatus(0)=LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc / selection=score best=3;run;
Best Subsets
Provides a computational efficient way to screen all possible modelsThe procedure requires a criterion to judge a model
Given the criterion the software screens all models containing q covariates and reports the covariates in the best, say n, models for q=1,2,3,…,p, where p denotes the number of covariates
SAS uses the score test
Lab Course March 20, 2014 9
Regression Models Selected by Score CriterionNumber ofVariables
ScoreChi-Square
Variables Included in Model
1 8.5164 LogBUN1 5.0664 HGB1 3.1816 Platelet2 12.7252 LogBUN HGB2 11.1842 LogBUN Platelet2 9.9962 LogBUN SCalc3 15.3053 LogBUN HGB SCalc3 13.9911 LogBUN HGB Age3 13.5788 LogBUN HGB Frac4 16.9873 LogBUN HGB Age SCalc4 16.0457 LogBUN HGB Frac SCalc
4 15.7619 LogBUN HGB LogPBM SCalc
5 17.6291 LogBUN HGB Age Frac SCalc
5 17.3519 LogBUN HGB Age LogPBM SCalc
5 17.1922 LogBUN HGB Age LogWBC SCalc
6 17.9120 LogBUN HGB Age Frac LogPBM SCalc
6 17.7947 LogBUN HGB Age LogWBC Frac SCalc
6 17.7744 LogBUN HGB Platelet Age Frac SCalc
7 18.1517 LogBUN HGB Platelet Age Frac LogPBM SCalc
7 18.0568 LogBUN HGB Age LogWBC Frac LogPBM SCalc
7 18.0223 LogBUN HGB Platelet Age LogWBC Frac SCalc
8 18.3925 LogBUN HGB Platelet Age LogWBC Frac LogPBM SCalc
8 18.1636 LogBUN HGB Platelet Age Frac LogPBM Protein SCalc
8 18.1309 LogBUN HGB Platelet Age LogWBC Frac Protein SCalc
9 18.4550 LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc
The PHREG Procedure
Disadvantages of automatic routines• They typically lead to one particular subset of variables,
rather than a set of equally good ones• The subsets found might be different for different selection
routines• They generally tend not to account for the hierarchic
principle• Dependent on the stopping rule• It does not foster critical thinking about the problem
Lab Course March 20, 2014 10
Collett• The model selection strategy depends to some extent on
the purpose of the study
Lab Course March 20, 2014 11
Collett• Chow et al. (2002)• Main goal: Investigate what explanatory variables, in a
palliative care setting, are associated with overall survival
Lab Course March 20, 2014 12
Collett• Fosker et al. (2013)• The Importance of Poor Performance Status in
Personalizing Palliative Radiotherapy Towards the End of Life
Lab Course March 20, 2014 13
CollettStep 0:Identify a set of explanatory variables that have the potential for being included in a model
This approach assumes that all variables are considered to be on an equal footing, and there is no a priori reason to include any specific variables (like treatment).
Steps 1-4:Determine the combination of variables to be included
In practice, there will not be a unique combination of variables; there are likely to be a number of equally good models
Lab Course March 20, 2014 14
Collett• If the number of potential explanatory variables (including
interactions, non-linear terms etc.) is not too large, it might be feasible to consider all combinations of terms
• Pay due regard to the hierarchic principle and use the statistic -2Log(Likelihood)
• Use AIC to compare possible models
Lab Course March 20, 2014 15
Collett• When the number of variables is relatively large, the
number of possible models that need to be fitted can be computationally expensive
• Automatic selection routines might seem to be an attractive option• Forward selection• Backward elimination• Stepwise
Lab Course March 20, 2014 16
Collett
Step 1: Fit a univariate model for each covariate, and identify the predictors significant at some level p1, say 0.20.
Lab Course March 20, 2014 17
CollettStep 2:Fit a multivariate model with all significant univariate predictors, and use backward selection to eliminate nonsignificant variables at some level p2, say 0.10.
Lab Course March 20, 2014 18
CollettStep 3:Starting with final step (2) model, consider each of the non-significant variables from step (1) using forward selection, with significance level p3, say 0.10.
Lab Course March 20, 2014 19
CollettStep 4:Do final pruning of main-effects model (omit variables that are non-significant, add any that are significant), using stepwise regression with significance level p4.
Lab Course March 20, 2014 20
Collett• At this stage, you may also consider adding interactions
between any of the main effects currently in the model, under the hierarchical principle.
Lab Course March 20, 2014 21
Collett• Collett recommends using a likelihood ratio test for all
variable inclusion/exclusion decisions.
Lab Course March 20, 2014 22
Collett• Statistical criteria alone should not guide the model
selection strategy• It may not be appropriate to include particular combinations of
variables• It might be unwise to omit some non statistically significant
variables
Lab Course March 20, 2014 23
Hosmer, Lemeshow and MayPurposeful selection
Step 1:Fit a multivariable model containing all variables significant in the univariable analysis at the 0.20 to 0.25 significance level, and any other variables not selected using this criterion but judged to be of clinical importance
Lab Course March 20, 2014 24
Hosmer, Lemeshow and May• Note:• If there are many covariates that show a statistically
significant association with survival you can rank order the covariates based on p-values using only the most highly significant variables. Include one covariate per ten events.
Lab Course March 20, 2014 25
Hosmer, Lemeshow and MayStep 2:Use Wald test p-values of the individual coefficients to identify covariates that might be deleted
• Cautioned not to delete too many seemingly non-significant variables at one time
• Confirm above by using partial likelihood test
Lab Course March 20, 2014 26
Hosmer, Lemeshow and MayStep 3:Assess whether removal of the covariate has produced an “important” change in the coefficients of the variables remaining in the model. A value of 20% is used as an indicator of important change.If the variable excluded is an important confounder reintroduce it into the model.This process continues until no variables can be deleted.
Lab Course March 20, 2014 27
Hosmer, Lemeshow and MayStep 4:Add to the model, one at a time, all variables excluded from the initial multivariable model to confirm that they are neither statistically significant nor a confounder
Result referred to as the preliminary main effects model
Lab Course March 20, 2014 28
Hosmer, Lemeshow and MayStep 5:Test linearity of the continuous covariates
This is referred to as the main effects model
Lab Course March 20, 2014 29
Hosmer, Lemeshow and MayStep 6:Are interactions needed? Use 0.05 significance level. Use Wald p-value and partial likelihood ratio test as described earlier
Lab Course March 20, 2014 30
Hosmer, Lemeshow and MayStep 7: Final ModelCheck model assumptions, goodness-of-fit
Lab Course March 20, 2014 31
Machin, Cheung, Parmar• Explanatory variables are categorized
1. Fundamental to research design (D)2. Those that influence outcome or are confounders (K)3. Uncertain influence (Q)
Lab Course March 20, 2014 32
Strategies• Forced-entry • Significance tests• Change in estimates of hazard ratios
Lab Course March 20, 2014 33
Forced-entry• Include variables in the model according to research
design or prior opinion. This could include a non-statistically significant variable.• E.g. treatment variable in a RCT
• Include variables known to be influential in their ability to confound the primary association of interest
• The resulting model (with statistically non-significant effects) could have a reduced efficiency
Lab Course March 20, 2014 34
Significance testing• Step-up or step-down procedures where selection is
‘manual’, not automated
Lab Course March 20, 2014 35
Change in estimates• If our purpose is to obtain a suitable estimate of the HR for
a key variable the significance-testing strategy may not be successful in selecting confounders
• Compare HRCrude with the adjusted estimate HRAdjusted
for a clinically important difference. A 10% change is suggested.
Lab Course March 20, 2014 36
Practical considerations• Due to the effects of bias if more than 20% of the data
points are missing for a variable exclude it from the modeling process. If missing data comprise < 5% then the bias introduced will likely be small.
• Check to see how any automatic selection routines handle missing data
• In practice one can start with missing data excluded at the early stages of the selection process but bring them back into the process as it becomes more clear which variables are likely to be in the final model
Lab Course March 20, 2014 37
Practical considerations• Significance level to use? Err on the side of caution. Use
0.10 generally and 0.2 for the change-in-estimates method
Lab Course March 20, 2014 38
Practical considerations• Univariable analysis per se is not recommended• Rationale for univariable screening
• if an explanatory variable is associated with an outcome variable this association may be the result of confounding
• However, if an explanatory variable is not associated with an outcome variable in a univariable analysis, there is no gain in further examining it in a multivariable analysis• This argument is flawed; it overlooks the possibility of confounding
which may suppress a genuine relation; so-called ‘negative’ confounding
Lab Course March 20, 2014 39
Positive vs. Negative Confounding• Positive confounding – An association is found between an
exposure variable and outcome but in reality there is no association. The spurious association is caused by the confounder OR the association is stronger than it appears because of the confounder
• Negative confounding - An association is not found between an exposure variable and outcome but in reality there is an association. The true association is suppressed by the confounder OR the association is weaker than it appears in reality because of a confounder
Lab Course March 20, 2014 40
Lab Course March 20, 2014 41
Higher education in
women
True Magnitude
Apparent Magnitude
Higher education in women
Nulliparous
Outcome: Higher breast cancer
incidence
Lower breast cancer
incidenceOutcome: Lower
breast cancer incidence
Steyerberg• The problem of overfitting already starts with considering
too many candidate predictors in a data set. • The problem is difficult to solve with standard statistical
techniques which are used by default in medical research.• The uncertainty of model selection is an important source
of overfitting.
Lab Course March 20, 2014 42
Steyerberg• Improvements can be sought by limiting the necessity for
selection by using subject matter knowledge, especially in relatively smaller data sets (also advocated by Harrell)
• Use better algorithms to discover patterns in the data (e.g. LASSO)
• LASSO is a penalized estimation technique where the estimated regression coefficients are constrained such that the sum of their scaled absolute values falls below some constant k chosen by cross-validation
• This type of constraint forces some regression coefficients towards zero (which helps with overfitting problem) and some to exactly zero (helping with variable selection)
Lab Course March 20, 2014 43
Royston et al.• No consensus exits on the best method for selecting
variables• Two main strategies:
• Full model approach – all candidate variables are included. This model is claimed to avoid overfitting and selection bias and provide correct standard errors and P values. However, the full model is not always easy to define
• Backward elimination approach – the choice of significance level has a major effect on the number of variables selected.• Selection of predictors by significance testing is known to produce
selection bias (regression coefficients overestimated) and optimism as a result of overfitting. Overfitting leads to worse prediction in independent data
Lab Course March 20, 2014 44
Example 1 - Chow et al. (Collett approach)
Lab Course March 20, 2014 45
Lab Course March 20, 2014 46
Lab Course March 20, 2014 47
Lab Course March 20, 2014 48
Lab Course March 20, 2014 49
Lab Course March 20, 2014 50
Lab Course March 20, 2014 51
Example 2 – Fosker et al. (Harrell approach)
• The Importance of Poor Performance Status in Personalising Palliative Radiotherapy Towards the End of Life
• The goal of our project is to define a clinically relevant ECOG PS based algorithm that would enable accurate prediction of patients with shorter life expectancies (< 3-4 months).
Lab Course March 20, 2014 52
Lab Course March 20, 2014 53
Lab Course March 20, 2014
Parameter P-value Hazard 95% CI Ra�o Lower Upper
Age 67-74 0.0009 1.168 1.066 1.28Age 75+ <.0001 1.265 1.15 1.391
Brain mets Yes <.0001 1.354 1.252 1.465ECOG 1 <.0001 1.575 1.317 1.885ECOG 2 <.0001 2.258 1.881 2.712ECOG 3 <.0001 3.59 2.989 4.312ECOG 4 <.0001 5.925 4.743 7.401
Gender Male <.0001 1.337 1.24 1.44Primary Lung <.0001 1.249 1.158 1.348
Multivariate AnalysisCox Proportional Hazards model resultsNOTE: ECOG=0 as reference category for variable ECOG
54
Conclusions• One size doesn’t fit all – hard to conclude there is a “best”
approach.
• “Cutting to the chase” is not appropriate to describe multivariable modeling building
• “A good model is one chosen by using a careful, well thought out covariate selection process that gives thought consideration to issues of adjustment and interactions and thoroughly evaluates the model for assumptions, influential observations, and tests for goodness-of-fit” (Hosmer and Lemeshow 2008)
Lab Course March 20, 2014 55
References• Collett D. Modelling Survival Data in Medical Research. Chapman and
Hall 1991.• Hosmer DW, Lemeshow S, May S. Applied Survival Analysis –
Regression Modeling of Time-to-event Data 2nd edition Wiley• Machin D, Cheung YB, Parmar MKB. Survival Analysis – A Practical
Approach. Wiley 2006.• Steyerberg EW. Clinical Prediction Models. Springer 2009.• Royston P, Moons KGM, Altman DG, Vergouwe Y. Prognosis and
prognostic research: Developing a prognostic model BMJ June 2009 Volume 338 pp1373-1377
• Harrell FE. Regression Modeling Strategies. Springer 2001. New York.
Lab Course March 20, 2014 56