model selection strategies - · pdf filedeveloping a multivariable prediction model •...

MODEL SELECTION STRATEGIES

Tony Panzarella

Preamble• Although focus will be on time-to-event data the same

principles apply to other outcome data

Lab Course March 20, 2014 2

Developing a multivariable prediction model• Select clinically relevant predictors for possible inclusion in

the model• Evaluate the quality of the data and how to handle missing

data• Data handling decisions• Choosing a strategy for selecting the important variables in

the final model• Deciding how to model continuous variables• Selecting measures of model performance or predictive

accuracy


AUTOMATIC SELECTION ROUTINES

Forward Selection• Variables are added to the model one at a time• At each stage the variable added is the one which gives

the largest decrease in the value of -2LogL on its inclusion• The process ends when each of the remaining variables

fails to reduce -2LogL by a pre-specified amount (typically couched as a significance level)


Backward elimination• Full model is fit first• Variables are excluded one at a time• At each stage the variable omitted is the one that

increases -2LogL by the smallest amount by its exclusion• The process ends when the next candidate for deletion

increases the value of -2LogL by more than a pre-specified amount.


Stepwise• Operates similarly to forward selection• However, a variable that is included can be considered for

exclusion at a later stage• Thus after adding a variable, the procedure then checks

whether any previously included variable can be deleted



proc phreg data=myeloma; model Time*VStatus(0)=LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc / selection=score best=3;run;

Best Subsets

Provides a computational efficient way to screen all possible modelsThe procedure requires a criterion to judge a model

Given the criterion the software screens all models containing q covariates and reports the covariates in the best, say n, models for q=1,2,3,…,p, where p denotes the number of covariates

SAS uses the score test


Regression Models Selected by Score CriterionNumber ofVariables

ScoreChi-Square

Variables Included in Model

1 8.5164 LogBUN1 5.0664 HGB1 3.1816 Platelet2 12.7252 LogBUN HGB2 11.1842 LogBUN Platelet2 9.9962 LogBUN SCalc3 15.3053 LogBUN HGB SCalc3 13.9911 LogBUN HGB Age3 13.5788 LogBUN HGB Frac4 16.9873 LogBUN HGB Age SCalc4 16.0457 LogBUN HGB Frac SCalc

4 15.7619 LogBUN HGB LogPBM SCalc

5 17.6291 LogBUN HGB Age Frac SCalc

5 17.3519 LogBUN HGB Age LogPBM SCalc

5 17.1922 LogBUN HGB Age LogWBC SCalc

6 17.9120 LogBUN HGB Age Frac LogPBM SCalc

6 17.7947 LogBUN HGB Age LogWBC Frac SCalc

6 17.7744 LogBUN HGB Platelet Age Frac SCalc

7 18.1517 LogBUN HGB Platelet Age Frac LogPBM SCalc

7 18.0568 LogBUN HGB Age LogWBC Frac LogPBM SCalc

7 18.0223 LogBUN HGB Platelet Age LogWBC Frac SCalc

8 18.3925 LogBUN HGB Platelet Age LogWBC Frac LogPBM SCalc

8 18.1636 LogBUN HGB Platelet Age Frac LogPBM Protein SCalc

8 18.1309 LogBUN HGB Platelet Age LogWBC Frac Protein SCalc

9 18.4550 LogBUN HGB Platelet Age LogWBC Frac LogPBM Protein SCalc

The PHREG Procedure

Disadvantages of automatic routines• They typically lead to one particular subset of variables,

rather than a set of equally good ones• The subsets found might be different for different selection

routines• They generally tend not to account for the hierarchic

principle• Dependent on the stopping rule• It does not foster critical thinking about the problem


Collett• The model selection strategy depends to some extent on

the purpose of the study


Collett• Chow et al. (2002)• Main goal: Investigate what explanatory variables, in a

palliative care setting, are associated with overall survival


Collett• Fosker et al. (2013)• The Importance of Poor Performance Status in

Personalizing Palliative Radiotherapy Towards the End of Life


CollettStep 0:Identify a set of explanatory variables that have the potential for being included in a model

This approach assumes that all variables are considered to be on an equal footing, and there is no a priori reason to include any specific variables (like treatment).

Steps 1-4:Determine the combination of variables to be included

In practice, there will not be a unique combination of variables; there are likely to be a number of equally good models


Collett• If the number of potential explanatory variables (including

interactions, non-linear terms etc.) is not too large, it might be feasible to consider all combinations of terms

• Pay due regard to the hierarchic principle and use the statistic -2Log(Likelihood)

• Use AIC to compare possible models


Collett• When the number of variables is relatively large, the

number of possible models that need to be fitted can be computationally expensive

• Automatic selection routines might seem to be an attractive option• Forward selection• Backward elimination• Stepwise


Collett

Step 1: Fit a univariate model for each covariate, and identify the predictors significant at some level p1, say 0.20.


CollettStep 2:Fit a multivariate model with all significant univariate predictors, and use backward selection to eliminate nonsignificant variables at some level p2, say 0.10.


CollettStep 3:Starting with final step (2) model, consider each of the non-significant variables from step (1) using forward selection, with significance level p3, say 0.10.


CollettStep 4:Do final pruning of main-effects model (omit variables that are non-significant, add any that are significant), using stepwise regression with significance level p4.


Collett• At this stage, you may also consider adding interactions

between any of the main effects currently in the model, under the hierarchical principle.


Collett• Collett recommends using a likelihood ratio test for all

variable inclusion/exclusion decisions.


Collett• Statistical criteria alone should not guide the model

selection strategy• It may not be appropriate to include particular combinations of

variables• It might be unwise to omit some non statistically significant

variables


Hosmer, Lemeshow and MayPurposeful selection

Step 1:Fit a multivariable model containing all variables significant in the univariable analysis at the 0.20 to 0.25 significance level, and any other variables not selected using this criterion but judged to be of clinical importance


Hosmer, Lemeshow and May• Note:• If there are many covariates that show a statistically

significant association with survival you can rank order the covariates based on p-values using only the most highly significant variables. Include one covariate per ten events.


Hosmer, Lemeshow and MayStep 2:Use Wald test p-values of the individual coefficients to identify covariates that might be deleted

• Cautioned not to delete too many seemingly non-significant variables at one time

• Confirm above by using partial likelihood test


Hosmer, Lemeshow and MayStep 3:Assess whether removal of the covariate has produced an “important” change in the coefficients of the variables remaining in the model. A value of 20% is used as an indicator of important change.If the variable excluded is an important confounder reintroduce it into the model.This process continues until no variables can be deleted.


Hosmer, Lemeshow and MayStep 4:Add to the model, one at a time, all variables excluded from the initial multivariable model to confirm that they are neither statistically significant nor a confounder

Result referred to as the preliminary main effects model


Hosmer, Lemeshow and MayStep 5:Test linearity of the continuous covariates

This is referred to as the main effects model


Hosmer, Lemeshow and MayStep 6:Are interactions needed? Use 0.05 significance level. Use Wald p-value and partial likelihood ratio test as described earlier


Hosmer, Lemeshow and MayStep 7: Final ModelCheck model assumptions, goodness-of-fit


Machin, Cheung, Parmar• Explanatory variables are categorized

1. Fundamental to research design (D)2. Those that influence outcome or are confounders (K)3. Uncertain influence (Q)


Strategies• Forced-entry • Significance tests• Change in estimates of hazard ratios


Forced-entry• Include variables in the model according to research

design or prior opinion. This could include a non-statistically significant variable.• E.g. treatment variable in a RCT

• Include variables known to be influential in their ability to confound the primary association of interest

• The resulting model (with statistically non-significant effects) could have a reduced efficiency


Significance testing• Step-up or step-down procedures where selection is

‘manual’, not automated


Change in estimates• If our purpose is to obtain a suitable estimate of the HR for

a key variable the significance-testing strategy may not be successful in selecting confounders

• Compare HRCrude with the adjusted estimate HRAdjusted

for a clinically important difference. A 10% change is suggested.


Practical considerations• Due to the effects of bias if more than 20% of the data

points are missing for a variable exclude it from the modeling process. If missing data comprise < 5% then the bias introduced will likely be small.

• Check to see how any automatic selection routines handle missing data

• In practice one can start with missing data excluded at the early stages of the selection process but bring them back into the process as it becomes more clear which variables are likely to be in the final model


Practical considerations• Significance level to use? Err on the side of caution. Use

0.10 generally and 0.2 for the change-in-estimates method


Practical considerations• Univariable analysis per se is not recommended• Rationale for univariable screening

• if an explanatory variable is associated with an outcome variable this association may be the result of confounding

• However, if an explanatory variable is not associated with an outcome variable in a univariable analysis, there is no gain in further examining it in a multivariable analysis• This argument is flawed; it overlooks the possibility of confounding

which may suppress a genuine relation; so-called ‘negative’ confounding


Positive vs. Negative Confounding• Positive confounding – An association is found between an

exposure variable and outcome but in reality there is no association. The spurious association is caused by the confounder OR the association is stronger than it appears because of the confounder

• Negative confounding - An association is not found between an exposure variable and outcome but in reality there is an association. The true association is suppressed by the confounder OR the association is weaker than it appears in reality because of a confounder



Higher education in

women

True Magnitude

Apparent Magnitude

Higher education in women

Nulliparous

Outcome: Higher breast cancer

incidence

Lower breast cancer

incidenceOutcome: Lower

breast cancer incidence

Steyerberg• The problem of overfitting already starts with considering

too many candidate predictors in a data set. • The problem is difficult to solve with standard statistical

techniques which are used by default in medical research.• The uncertainty of model selection is an important source

of overfitting.


Steyerberg• Improvements can be sought by limiting the necessity for

selection by using subject matter knowledge, especially in relatively smaller data sets (also advocated by Harrell)

• Use better algorithms to discover patterns in the data (e.g. LASSO)

• LASSO is a penalized estimation technique where the estimated regression coefficients are constrained such that the sum of their scaled absolute values falls below some constant k chosen by cross-validation

• This type of constraint forces some regression coefficients towards zero (which helps with overfitting problem) and some to exactly zero (helping with variable selection)


Royston et al.• No consensus exits on the best method for selecting

variables• Two main strategies:

• Full model approach – all candidate variables are included. This model is claimed to avoid overfitting and selection bias and provide correct standard errors and P values. However, the full model is not always easy to define

• Backward elimination approach – the choice of significance level has a major effect on the number of variables selected.• Selection of predictors by significance testing is known to produce

selection bias (regression coefficients overestimated) and optimism as a result of overfitting. Overfitting leads to worse prediction in independent data


Example 1 - Chow et al. (Collett approach)


Example 2 – Fosker et al. (Harrell approach)

• The Importance of Poor Performance Status in Personalising Palliative Radiotherapy Towards the End of Life

• The goal of our project is to define a clinically relevant ECOG PS based algorithm that would enable accurate prediction of patients with shorter life expectancies (< 3-4 months).


Lab Course March 20, 2014

Parameter P-value Hazard 95% CI Ra�o Lower Upper

Age 67-74 0.0009 1.168 1.066 1.28Age 75+ <.0001 1.265 1.15 1.391

Brain mets Yes <.0001 1.354 1.252 1.465ECOG 1 <.0001 1.575 1.317 1.885ECOG 2 <.0001 2.258 1.881 2.712ECOG 3 <.0001 3.59 2.989 4.312ECOG 4 <.0001 5.925 4.743 7.401

Gender Male <.0001 1.337 1.24 1.44Primary Lung <.0001 1.249 1.158 1.348

Multivariate AnalysisCox Proportional Hazards model resultsNOTE: ECOG=0 as reference category for variable ECOG

54

Conclusions• One size doesn’t fit all – hard to conclude there is a “best”

approach.

• “Cutting to the chase” is not appropriate to describe multivariable modeling building

• “A good model is one chosen by using a careful, well thought out covariate selection process that gives thought consideration to issues of adjustment and interactions and thoroughly evaluates the model for assumptions, influential observations, and tests for goodness-of-fit” (Hosmer and Lemeshow 2008)


References• Collett D. Modelling Survival Data in Medical Research. Chapman and

Hall 1991.• Hosmer DW, Lemeshow S, May S. Applied Survival Analysis –

Regression Modeling of Time-to-event Data 2nd edition Wiley• Machin D, Cheung YB, Parmar MKB. Survival Analysis – A Practical

Approach. Wiley 2006.• Steyerberg EW. Clinical Prediction Models. Springer 2009.• Royston P, Moons KGM, Altman DG, Vergouwe Y. Prognosis and

prognostic research: Developing a prognostic model BMJ June 2009 Volume 338 pp1373-1377

• Harrell FE. Regression Modeling Strategies. Springer 2001. New York.


model selection strategies - · pdf filedeveloping a multivariable prediction model •...

Documents