race 625 medical statistics in clinical research€¦ · predictors, we use a natural extension of...
TRANSCRIPT
RACE 625 :Medical Statistics in Clinical Research
Multiple linear regression
Asst. Prof. Dr. Sasivimol Rattanasiri [email protected]
Doctor of Philosophy Program in Clinical Epidemiology, Section for Clinical Epidemiology & BiostatisticsFaculty of Medicine Ramathibodi Hospital, Mahidol University
Semester 1, 2017 www.ceb-rama.org
1
CONTENTS
1. MULTIPLE REGRESSION MODEL ................................................................................. 3
2. FITTING MULTIPLE REGRESSION MODEL ............................................................... 5
3. FITTING CATEGORICAL INDEPENDENT VARIABLES ........................................... 6
4. EVALUATING THE MULTIPLE REGRESSION MODEL ......................................... 10
4.1 Test for overall regression ............................................................................................... 10
4.2 The coefficient of multiple determination ....................................................................... 11
4.3 Test for partial regression coefficients ............................................................................. 12
4.4 Evaluating multiple regression by STATA ...................................................................... 14
5. MODEL SELECTION ........................................................................................................ 16
5.1 Forward selection procedure ............................................................................................ 16
5.2 Backward elimination procedure ..................................................................................... 20
5.3 Stepwise procedure .......................................................................................................... 21
6. CONFOUNDING AND INTERACTION .......................................................................... 22
6.1 Confounding effects ......................................................................................................... 22
6.2 Interaction effects ............................................................................................................. 25
7. REGRESSION DIAGNOSTICS ........................................................................................ 29
7.1 Normality ......................................................................................................................... 29
7.2 Linearity ........................................................................................................................... 30
7.3 Homoskedasticity ............................................................................................................. 31
7.4 Multicollinearity .............................................................................................................. 32
7.5 Outliers ............................................................................................................................. 33
2
OBJECTIVES
This module will help you to be able to:
Fit a regression model considering > 2 co-variables simultaneously
Perform model selection
Assessing goodness of fit of the model & checking assumptions
Interpret & report results
REFERENCES
1. Neter J, Wasserman W, and Kutner HM. Applied statistical models, third edition. Boston:
IRWIN 1990; 21 - 433.
2. Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Applied regression analysis and other
multivariable methods. Washington: Duxbury Press 1998; 39 - 212.
3. Altman, G. Practical statistics for medical research. 1991, London: Chapman & Hall.
4. Hosmer, D. and S. Lemeshow. Applied regression analysis and other multivariable
methods. 1998, Washington: Duxbury Press.
READING SECTION
Read Neter J et. Al.; Chapters 7, 8, 11, & 12
ASSIGNMENT IV (25%)
P.39, Due: September 27, 2017
3
1. MULTIPLE REGRESSION MODEL
In Module IV, we discussed simple linear regression and correlation analysis. A simple
regression model includes one predictor and one outcome. In practice, an outcome is
usually affected by more than one predictor. For example, the systolic blood pressure
(SBP) may be determined by the age, smoking behavior, and body mass index (BMI). To
investigate the more complicated relationship between an outcome and a numbers of
predictors, we use a natural extension of simple linear regression known as multiple
regression analysis.
There are several advantages of using multiple regression analysis.
1) Develop a prognostic index to predict the outcome from several predictors. For
example, the SBP may be predicted by the age, smoking behavior, and BMI.
2) Adjust (control) for the potential confounding factors, in which the study design
has not planned for adjusting. For example, the effect of BMI on the SBP may be
confounded by age and smoking behavior. Fitting all these variables together can
assess the effect of BMI by adjusting for age and smoking behavior.
A multiple regression model with y as an outcome (dependent variable) and kxxxx ,...,,, 321 as k
predictors (independent variables) is written as:
kk XXXXy ...332211 (1)
where refers to the population intercept or constant term,
k ,...,,, 321 are the population slope or regression coefficients of independent
variables kXXXX ,...,,, 321 , respectively,
refers to the random error term.
The constant term is the mean value of the outcome when all predictors in the
model take the value 0.
The coefficients k ,...,,, 321 are called the partial regression coefficients. For
example, 1 is a partial coefficient of 1x , and it gives the change in y due to a one
unit change in 1x when all other predictors included in the model are held constant.
4
A positive value for a particular i in model (1) indicates a positive relationship
between the outcome (Y) and the related predictor (X). A negative value for a
particular i in that model indicates a negative relationship between the outcome
(Y) and the related predictor (X).
The multiple regression model (1) can only be used when the relationship between the
outcome and each predictor is linear. Each of the iX variables in the model (1) represents
a single variable raised to the first power. This model is referred to as a first-order multiple
regression model.
For a data sample, a multiple linear regression is written as:
ikki eXbXbXbXbay ...ˆ332211 (2)
where
iy is the estimated or predicted value of the outcome iy ,
a is the un-bias estimator of the population intercept,
kbbbb ,....,, 321 are the un-bias estimators of the population slope of predictive
variables kXXXX ,...,,, 321 ,respectively,
ie is the random error which is the difference between the observed and predicted
values of the dependent variable iy . The technical term for this distance is called a
residual.
5
2. FITTING MULTIPLE REGRESSION MODEL
The method of least squares is used to estimate the parameters in the multiple regression
model. In general, the least squares method chooses as the best-fitting model the one that
minimizes the sum of squares of the distance between the observed values and predicted
values of the outcome which can be defined as:
n
i
kikiiii
n
i
ii XbXbXbXbayyy1
332211
2
1
)...()ˆ( (3)
The least squares estimated regression coefficients kbbbba ,...,,,, 321 in model (3) are
obtained by using matrix mathematics. In this text we do not present matrix formulas for
calculating the least squares estimates of kbbbba ,...,,,, 321 , so the matrix formulation for
multiple regression can be found in Kleinbaum, Appendix B, pages 732-743, Neter,
Section 7.2, pages 236-239.
Instead of using the formulas manually, the calculations in a multiple regression analysis
are made by using statistical software packages, such as STATA or SPSS. Even for a
multiple regression with two predictors, the formulas are complex and manual calculations
are time consuming. In this module we perform the multiple regression analysis using
STATA program. However, the solutions which are obtained by using other statistical
software packages such as SPSS, SAS, or MINITAB can be interpreted the same way.
Example 2-1 Researchers wanted to investigate how SBP varies with the BMI and age.
The outcome is ‘sbp’ and the predictors are ‘BMI’ and ‘age’. To fit the multiple linear
regression by using the STATA, we would type:
. regress sbp1 bmi age
1. Source | SS df MS 2.Number of obs = 30
-----------+---------------------------------- 3.F(2, 27) = 9.16
Model | 3496.04264 2 1748.02132 Prob > F = 0.0009
esidual | 5149.95736 27 190.739161 4.R-squared = 0.4044
-----------+---------------------------------- 5.Adj R-squared = 0.3602
Total | 8646 29 298.137931 Root MSE = 13.811
------------------------------------------------------------------------------
6. sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073
age | .5637281 .1660759 3.39 0.002 .2229686 .9044876
_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142
------------------------------------------------------------------------------
6
There are two predictors, the BMI and age, included into this model. In the STATA output
no.6, the variable ‘cons_’ is referred to the intercept. The regression coefficients for the
intercept, BMI, and age are presented in the second column (label ‘Coef’). The multiple
regression model for this example is given as:
ageBMIsbp
56.078.102.55
From this equation, the population intercept is 55.02. It is the value of estimated SBP for
the BMI=0 and age=0. This means that a patient who has zero BMI and zero age is
expected to have a SBP of 55.02 mmHg. This is the technical interpretation of the
population intercept. In reality, that may not be true because none of the patients in our
sample has both zero BMI and zero age. In most medical applications the value of the
population intercept has no practical meaning.
The estimated coefficient of the BMI (1 ) is 1.78. This value gives the change in the
estimated SBP for a one-unit change in the BMI when the age is held constant. Thus, we
can state that a patient with one extra BMI, but the same age is expected to have a higher
SBP of 1.78 mmHg.
The estimated coefficient of age (2 ) is 0.56. This value gives the change in the estimated
SBP for a one-unit change in age when the BMI is held constant. Thus, we can state that
for a patient with one extra unit of age with the same BMI is expected to have a higher
SBP of 0.56 mmHg.
3. FITTING CATEGORICAL INDEPENDENT VARIABLES
All the predictive variables that we have considered up to this point have been measured
on a continuous scale. However, regression analysis can be generalized to incorporate
categorical variables into the model. For example, we want to test if smoking status affects
the SBP. The generalization is based on the use of dummy variables which is the central
idea of this section.
The dummy variable is a binary variable that takes on the values 0 and 1 which is used to
identify the different categories of a categorical variable. The term “dummy” is used to
7
indicate the fact that the numerical values (such as 0 and 1) assumed by the variable have
no quantitative meaning, but are used to identify different categories of the categorical
variable under consideration.
For example, we have a variable for smoking status code 1 for current smoke, 2 for ex-
smoke, and 3 for never smoke. Thus, the dummy variables for smoking behavior are
presented as follows:
Smoking behavior
Dummy variables
Dummy 1 Dummy 2 Dummy 3
Current smoke 1 0 0
Ex-smoke 0 1 0
Never smoke 0 0 1
To include these variables in a multiple regression model, we use k-1 dummy variables for
k categories. The omitted category is referred to as the reference group. It is arbitrary
which group is assigned to be the reference group. The choice of a reference group is
usually dictated by subject-matter considerations. For example, if never smoke category is
referred to as the reference group in the multiple regression analysis, only the dummy
variables 1 and 2 are fitted in the model.
There are two easy ways to create dummy variables in STATA. Firstly, is using the tabulate
command and the generate( ) option, as shown below.
tabulate smoke,gen(dum) smoke | Freq. Percent Cum. ------------+----------------------------------- 1 | 8 25.00 25.00 2 | 9 28.13 53.13 3 | 15 46.88 100.00 ------------+----------------------------------- Total | 32 100.00
list smoke dum1 dum2 dum3 +----------------------------+ | smkgr dum1 dum2 dum3 | |----------------------------| 1. | 1 1 0 0 | 2. | 1 1 0 0 | 3. | 1 1 0 0 | 4. | 2 0 1 0 | 5. | 2 0 1 0 | |----------------------------| 6. | 2 0 1 0 | 7. | 3 0 0 1 | 8. | 3 0 0 1 | 9. | 3 0 0 1 | +----------------------------+
8
The tabulate command with generate option created three dummy variables called dum1,
dum2 and dum3. The STATA creates three dummy variables and lets the users choose
their own reference. Suppose that we add the smoking status to the regression model for
predicting SBP that already contains the BMI and age. If the never smoke category is
referred to as the reference group, only the dummy1 and dummy2 are fitted in the model as
follows:
. regress sbp1 bmi age dum1 dum2
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(4, 25) = 4.24
Model | 3496.06849 4 874.017123 Prob > F = 0.0093
Residual | 5149.93151 25 205.99726 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3091
Total | 8646 29 298.137931 Root MSE = 14.353
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591
age | .5635802 .1743535 3.23 0.003 .2044924 .922668
dum1 | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473
dum2 | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422
_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262
------------------------------------------------------------------------------
Secondly is to use the xi command to create dummy variables. xi i.smoke
i.smoke _Ismoke_1-3 (naturally coded; _Ismoke_1 omitted)
. list smoke _Ismoke_2 _Ismoke_3
+-------------------------------+
| smoke _Ismok~2 _Ismok~3 |
|-------------------------------|
1. | past 1 0 |
2. | never 0 1 |
3. | never 0 1 |
4. | never 0 1 |
5. | never 0 1 |
The xi command created two dummy variables called _Ismkgr_2 and _Ismkgr_3 and omitted
the dummy variable for group 1 as the reference group. The default of xi command will assign
the first category as the reference group. If we want another category to be the reference, this
can be re-assigned using the char var [omit] # command. For example, we would like group 3
to be the reference group, we would type
char smoke [omit] 3
xi i.smoke
i.smoke _Ismoke_1-3 (naturally coded; _Ismoke_3 omitted)
9
The xi command is a prefix command, which can be followed with another modeling
command. To fit the multiple regression of SBP on the QUET, age, and smoking status, we
would type
. char smoke [omit] 3
. xi: regress sbp1 bmi age i.smoke
i.smoke _Ismoke_1-3 (naturally coded; _Ismoke_3 omitted)
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(4, 25) = 4.24
Model | 3496.06849 4 874.017123 Prob > F = 0.0093
Residual | 5149.93151 25 205.99726 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3091
Total | 8646 29 298.137931 Root MSE = 14.353
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591
age | .5635802 .1743535 3.23 0.003 .2044924 .922668
_Ismoke_1 | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473
_Ismoke_2 | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422
_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262
------------------------------------------------------------------------------
Or use single command to determine the reference group as follow:
. regress sbp1 bmi age ib(3).smoke
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(4, 25) = 4.24
Model | 3496.06849 4 874.017123 Prob > F = 0.0093
Residual | 5149.93151 25 205.99726 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3091
Total | 8646 29 298.137931 Root MSE = 14.353
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591
age | .5635802 .1743535 3.23 0.003 .2044924 .922668
|
smoke |
present | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473
past | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422
|
_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262
------------------------------------------------------------------------------
From these commands, the STATA treats BMI and age as continuous variable and smoke
as a dummy variable. We can interpret that with identical BMI and age, current smokers
have a SBP about 0.07 mmHg non-significantly lower than never smokers, and ex-smokers
have a SBP about 0.04 mmHg non-significantly lower than never smokers.
10
4. EVALUATING THE MULTIPLE REGRESSION MODEL
Before using a multiple regression model to predict or estimate, it is desirable to determine
first whether it adequately describes the relationship between the outcome and a set of
predictors and whether it can be used effectively for the purpose of prediction.
4.1 Test for overall regression
We now consider an overall test for a regression model which contains k predictors in the
model. The null hypothesis for this test is that there is no linear relationship between the
outcome and the set of predictors. In other words, all of the predictors considered together
do not explain a significant amount of the variation in the outcome. The null and
alternative hypotheses are defined as:
0...: 210 kH
:aH Not all j (j=1,…k) equal zero
An ANOVA approach can be used to perform this test. The particular form of an ANOVA
table for regression analysis is presented in Table 4-1. The ANOVA approach is based on
the partitioning of the total variation of the observed values iY into two components as
follows:
ii yy = yyi ˆ + ii yy ˆ (4)
total variation = explained variation + unexplained variation
Thus, the total variation can be viewed as the sum of two components:
1. Explained variation is the variation of the predicted values iY around the mean Y ,
which is measured by the regression sum of squares (SSR). This indicates how
much the variation of the observed values iY can be explained by the predictive
variables that are included in the regression model. The mean square regression
(MSR) is obtained by dividing the regression sum of squares by its corresponding
degrees of freedom.
2. Unexplained variation is the variation of the observed values iY around the fitted
regression line, which is measured by the error sum of squares (SSE). The mean
11
square residual is obtained by dividing the error sum of squares by its
corresponding degrees of freedom.
Table 4-1 ANOVA Table for multiple regression
Source of
variation
SS df MS F P value
Regression
n
i
i )YY(SSR1
2ˆ k MSR=SSR/K MSR/MSE
Error
n
i
ii )Y(YSSE1
2ˆ n-k-1 MSE=SSE/(n-k-1)
Total
n
i
i )Y(YSST1
2 n-1
The appropriate statistical test for significant overall regression is the F test, which is obtained
by dividing the mean square regression by the mean square residual as follows:
MSE
MSRF (5)
This test has a F distribution with k and n-k-1 degrees of freedom. The computed value of F
can then be converted to the associated p value. The last step is to compare the p value with the
level of significance, and make a decision whether the null hypothesis is rejected or not.
4.2 The coefficient of multiple determination
In section 2.4.2 of module IV, we discussed about the coefficient of determination for a
simple linear regression model. It is the proportion of the total sum of squares (SST) that is
explained by the simple regression model. The coefficient of determination for the multiple
regression model, usually called the coefficient of multiple determination, is denoted by 2R .
It is defined as the proportion of the total sum of squares (SST) that is explained by the
multiple regression model. It tells us how good the multiple regression model is and how
well the predictive variables included in the model explain the outcome.
The calculation of the R2 is also based on the ANOVA Table which is presented in Table
4-1, which is defined as:
12
SST
SSE
SST
SSR
yy
yy
Rn
i
i
n
i
i
1
)(
)ˆ(
1
2
1
2
2 (8)
However, the R2 has one major shortcoming. The 2R value generally increases as we add
more and more predictive variables to the regression model. This does not imply that the
regression model with a higher value of 2R does a better job of predicting the outcome.
Such a value of 2R will be misleading, and it will not represent the true power of predictive
variables in the regression model. To eliminate this shortcoming of 2R , it is prefer to use
the adjusted R2, which is a modified measure that adjusts for the number of predictive
variables in the model. The adjusted 2R can determined by dividing each sum of squares by
its associated degrees of freedom. Thus, the adjusted 2R can be defined as:
)1/(
)/(12
nSST
knSSERa (9)
where n is the total number of observations, and k is the number of predictive variables in
the model.
4.3 Test for partial regression coefficients
Frequently we wish to test if the addition of any specific predictor (X*), given that others
are already in the model ( kXXXX ,...,,, 321 ), significantly improves the prediction of the
outcome. That is, we want to test the null hypothesis that 0* against the alternative
hypothesis that 0* (j=1, 2,…, k) in the model
**
332211 ... XXXXXy kk .
The appropriate statistical test for these partial regression coefficients is the partial F test.
The concept of this test is to compare the error sum of squares between two models:
1) The full model contains kXXX ,...,, 21 and *X as the predictors.
2) The reduced model contains kXXX ,...,, 21 but not *X .
The error sum of squares of the full model is smaller than the reduced model. The
difference between these error sums of squares is called an extra sum of squares.
13
The error sum of squares when kXXX ,...,, 21 and *X are in the model, is denoted
by ),,...,( *
21 XXXXSSE k ,
The error sum of squares when kXXX ,...,, 21 are in the model, is denoted by
),...,( 21 kXXXSSE ,
The extra sum of squares which is denoted by ),...,,|( 21
*
kXXXXSSR can be
defined as:
),,...,,(),...,,(),...,,|( *
212121
* XXXXSSEXXXSSEXXXXSSR kkk
Thus, the partial F test is defined as:
1/),,...,(
1/),...,,|(*
21
21
**
knXXXXSSE
XXXXSSRF
k
k
),,...,,(
),...,,|(*
21
21
*
XXXXMSE
XXXXMSR
k
k (6)
This F statistic has an F distribution with 1 and n-k-1 degrees of freedom.
Note:
To distinguish the partial F test in equation (6) from the overall F test in equation (5), we use
the *F test statistic for the partial F test, and F test statistic for the overall F test.
An equivalent way to perform the partial F test is to use a t test. The t test focuses on a test
of the null hypothesis that 0* . The t test for testing this null hypothesis is computed as:
)ˆ(
ˆ
*
**
SEt (7)
Where
* is the estimated coefficient of specific predictor (X*) in the regression model
**
332211 ... XXXXXy kk .
The )ˆ( *SE is the estimate of the standard error of * .
This test has a t distribution with n-k-1 degrees of freedom.
Since the two tests are equivalent, the choice is usually made in terms of the available
information provided by the computer package output.
14
4.4 Evaluating multiple regression by STATA
Consider the STATA output of example 2-1, the output indicated that:
. regress sbp1 bmi age
1. Source | SS df MS 2.Number of obs = 30
-----------+---------------------------------- 3.F(2, 27) = 9.16
Model | 3496.04264 2 1748.02132 Prob > F = 0.0009
esidual | 5149.95736 27 190.739161 4.R-squared = 0.4044
-----------+---------------------------------- 5.Adj R-squared = 0.3602
Total | 8646 29 298.137931 Root MSE = 13.811
------------------------------------------------------------------------------
6. sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-----------+----------------------------------------------------------------
bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073
age | .5637281 .1660759 3.39 0.002 .2229686 .9044876
_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142
------------------------------------------------------------------------------
1. Table of the analysis of variance
The ANOVA table shows the values of sum of squares, degrees of freedom and variances as:
- The values of sum of squares are
Sum of squares of total variation: SST=8646
Sum of squares of regression: SSR=3496.04
Sum of squares of error: SSE=5149.96
- The degrees of freedom (df) are
df of the total variation is n-1, that is 30-1=29
df of the explained variation is k, that is 2.
df of the unexplained variation is n-k-1, that is 27.
- The variances or mean square (MS) are
Mean square of total variation (MST) is SST/df = 8646/29 = 298.14
Mean square of regression (MSR) is SSR/df = 3496.04/2 = 1748.02
Mean square of error (MSE) is SSE/df = 5149.96/27 = 190.74
2. The number of total observations
The number of total observations for this example is 30.
3. The overall F test for overall regression
We now consider the overall F test for a regression model which contains the predictive
variables; BMI and age. The null hypothesis for this test is that there is no linear relationship
between the SBP and the set of predictive variables; BMI and age. From the ANOVA table, the
overall F test for the null hypothesis 0:0 ageBMIH , is computed as:
15
16.974.190
02.1748
MSE
MSRF .
This test has the F distribution with 2 and 27 degrees of freedom. The output from the STATA
indicated that the p value for this example (0.0009) is less than 0.001. As a result, we reject the
null hypothesis and conclude that there is a relationship between the SBP and the set of
predictive variables, BMI and age.
4. The R2
The estimated 2R for this example is computed as:
0.40448646
3496.042 SST
SSRR
This implies that 40.44% of the variation of SBP is explained by its linear relationship with the
BMI and age.
5. The adjusted R2
The estimate of adjusted 2R for this example is computed as:
0.360229/8646
2/04.3496
1/
/2
nSST
kSSRR
This implies that after adjusting the numbers of the predictive variables, 36.02% of the variation
of SBP is explained by its linear relationship with the BMI and age.
6. The table for tests of partial regression coefficients
This table shows the set of predictive variables and their corresponding coefficients. The
variable ‘cons_’ referred to the intercept. The partial regression coefficients for the BMI and
age are presented in the second column (label ‘Coef’). Their standard errors (SE) and the
values of t test for partial regression coefficients are presented in the third and fourth columns,
respectively. The corresponding p values and 95% CI of coefficients are presented in the fifth
and sixth columns, respectively.
The null hypotheses for partial regression coefficients can be defined as:
0:
0:
20
10
H
H
For our example, the t value for the BMI is 2.10 which has the p value of 0.045. Thus, we
reject the null hypothesis and conclude that there is linear relationship between the BMI and
16
SBP when age is already in the model. In other words, the addition of the BMI, given that age
is already in the model, significantly contribute to the prediction of the SBP.
The t value for age is 3.39 which has the p value of 0.002. Thus, we reject the null hypothesis
and conclude that there is linear relationship between age and SBP although BMI is already in
the model. In other words, the addition of age, given that BMI is already in the model,
significantly contributes to the prediction of the SBP.
5. MODEL SELECTION
In this section we focus on determining the best (most important or most valid) subset of
the k predictive variables for describing the relationship between the outcome and the
predictive variables. There are many strategies for selecting the best model. Such strategies
have focused on deciding whether a single variable should be added to a model or whether
a single variable should be deleted from a model. In this section we explain an algorithm
for evaluating models with forward selection, backward elimination, and stepwise
procedures. These procedures are widely used in practice.
5.1 Forward selection procedure
This strategy focuses on deciding whether a single variable should be added to a model. In
the forward selection procedure, we proceed as follows:
Step 0:
1. Fit a simple linear regression model for each of the k potential predictive variables.
2. Select the first predictive variable which most highly correlates with the outcome.
3. Fit the regression model to the selected predictive variable.
4. Consider overall F test, if it is not significant, stop and conclude that no predictive
variables are important predictors. If the overall F test is significant, add this selected
predictor in the model and go to the next step.
Example:
To do the simple linear regression for 3 independent variables; age, BMI, and smoking status
for predicting SBP by using the STATA, we would type:
17
. regress sbp1 age
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(1, 28) = 12.41
Model | 2655.06439 1 2655.06439 Prob > F = 0.0015
Residual | 5990.93561 28 213.961986 R-squared = 0.3071
-------------+---------------------------------- Adj R-squared = 0.2823
Total | 8646 29 298.137931 Root MSE = 14.627
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .6133205 .1741078 3.52 0.001 .2566768 .9699642
_cons | 94.03786 7.572722 12.42 0.000 78.52584 109.5499
------------------------------------------------------------------------------
. regress sbp1 bmi
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(1, 28) = 4.95
Model | 1298.35408 1 1298.35408 Prob > F = 0.0344
Residual | 7347.64592 28 262.415926 R-squared = 0.1502
-------------+---------------------------------- Adj R-squared = 0.1198
Total | 8646 29 298.137931 Root MSE = 16.199
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bmi | 2.193388 .986084 2.22 0.034 .1734861 4.213289
_cons | 69.75698 22.33493 3.12 0.004 24.00596 115.508
------------------------------------------------------------------------------
. regress sbp1 ib(3).smoke
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(2, 27) = 0.07
Model | 43.6752137 2 21.8376068 Prob > F = 0.9339
Residual | 8602.32479 27 318.604622 R-squared = 0.0051
-------------+---------------------------------- Adj R-squared = -0.0686
Total | 8646 29 298.137931 Root MSE = 17.849
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
smoke |
present | -2.692308 8.020824 -0.34 0.740 -19.14968 13.76506
past | .0854701 7.740062 0.01 0.991 -15.79583 15.96677
|
_cons | 119.6923 4.95056 24.18 0.000 109.5346 129.85
------------------------------------------------------------------------------
From the STATA output for prediction of SBP, we see that the highest squared correlation
is for the age (R-squared=0.3071). The overall F test for the regression of SBP on age is
statistically significant (p=0.0015). Therefore, the age is added in the model at this step.
Step 1:
1. Fit regression models that contain the variable initially selected (at step 0) and
another predictive variable which is not yet in the model.
2. Consider the t test for testing the null hypothesis that 0j and p value associated
with each remaining variable.
18
3. Focus on the variable with the largest t test which is equivalent to the smallest p
value. If the test is significant, add that predictive variable to the regression model.
If it is not significant, use the model from step 0 which has only one predictive
variable.
To do multiple regressions that contains the age and another predictive variable by using the
STATA, we would type:
. regress sbp1 age bmi
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(2, 27) = 9.16
Model | 3496.04264 2 1748.02132 Prob > F = 0.0009
Residual | 5149.95736 27 190.739161 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3602
Total | 8646 29 298.137931 Root MSE = 13.811
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .5637281 .1660759 3.39 0.002 .2229686 .9044876
bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073
_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142
------------------------------------------------------------------------------
. regress sbp1 age ib(3).smoke
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(3, 26) = 3.85
Model | 2659.70909 3 886.569696 Prob > F = 0.0210
Residual | 5986.29091 26 230.241958 R-squared = 0.3076
-------------+---------------------------------- Adj R-squared = 0.2277
Total | 8646 29 298.137931 Root MSE = 15.174
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .6147128 .1823656 3.37 0.002 .239855 .9895706
|
smoke |
present | -.3221169 6.854604 -0.05 0.963 -14.41196 13.76772
past | -.9337973 6.586714 -0.14 0.888 -14.47298 12.60539
|
_cons | 94.34723 8.616691 10.95 0.000 76.63536 112.0591
------------------------------------------------------------------------------
From the STATA output, we see that the BMI has the largest t test with a p value of 0.045.
It also has the largest adjusted R-squared. Therefore, the BMI is added to the regression
model at this step.
Step 2:
At each subsequent step, consider the t test for the predictive variables which are not yet in
the model. If the largest t test is statistically significant, add the new variable to the model.
If the t test is not significant, no more variables are included in the model and the process
is stopped.
19
For our example we have already added age and BMI to the model. We now consider if we
should add the smoking status in the model. To add the smoking status in the multiple
regression which contains the age and BMI with STATA, we would type:
. regress sbp1 age bmi ib(3).smoke
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(4, 25) = 4.24
Model | 3496.06849 4 874.017123 Prob > F = 0.0093
Residual | 5149.93151 25 205.99726 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3091
Total | 8646 29 298.137931 Root MSE = 14.353
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .5635802 .1743535 3.23 0.003 .2044924 .922668
bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591
|
smoke |
present | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473
past | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422
|
_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262
------------------------------------------------------------------------------
The t test for smoking status can be defined as
. test (1.smoke=0) (2.smoke=0)
( 1) 1.smoke = 0
( 2) 2.smoke = 0
F( 2, 25) = 0.00
Prob > F = 0.9999
From the above STATA output, the t test for the smoking status, controlling for age and
BMI, is very small (less than 0.001) with a p value of 0.999. This value is not statistically
significant at =0.05, so the process is stopped. Thus, the forward selection procedure
identifies age and BMI as the best subset of the predictive variables. To do the forward
selection procedure with the STATA command, we would type:
. xi:sw regress sbp1 age bmi (i.smoke),pe(0.05)
i.smoke _Ismoke_1-3 (naturally coded; _Ismoke_3 omitted)
begin with empty model
p = 0.0015 < 0.0500 adding age
p = 0.0452 < 0.0500 adding bmi
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(2, 27) = 9.16
Model | 3496.04264 2 1748.02132 Prob > F = 0.0009
Residual | 5149.95736 27 190.739161 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3602
Total | 8646 29 298.137931 Root MSE = 13.811
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .5637281 .1660759 3.39 0.002 .2229686 .9044876
bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073
_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142
------------------------------------------------------------------------------
20
5.2 Backward elimination procedure
This strategy has focused on deciding whether a single variable should be deleted from a
model. In the backward elimination procedure, we proceed as follows:
Step 0:
1. Fit regression model that contain all predictive variables.
2. Consider the t test for every variable in the model.
3. Focus on the variable with the lowest t test which also has the highest p value. If the
test is not significant, remove that variable from the model. If it is significant, keep
that variable in the model.
To fit regression model that contains all predictive variables (age, QUET, and smoking
status) by using the STATA, we would type:
. regress sbp1 age bmi ib(3).smoke
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(4, 25) = 4.24
Model | 3496.06849 4 874.017123 Prob > F = 0.0093
Residual | 5149.93151 25 205.99726 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3091
Total | 8646 29 298.137931 Root MSE = 14.353
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .5635802 .1743535 3.23 0.003 .2044924 .922668
bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591
|
smoke |
present | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473
past | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422
|
_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262
------------------------------------------------------------------------------
. test (1.smoke=0) (2.smoke=0)
( 1) 1.smoke = 0
( 2) 2.smoke = 0
F( 2, 25) = 0.00
Prob > F = 0.9999
From the STATA output, we see that the smoking status has the lowest t test with a p value
of 0.999. Therefore, the smoking status is dropped from the regression model at this step.
21
Step 1:
If a variable is dropped in step 0, re-compute the regression equation for the remaining
variables, and repeat the backward elimination procedure steps 0 and 1. If the variable is
not dropped, the backward elimination procedure is terminated.
From the step 0 of our example, the smoking status was dropped from the model, so we re-
computed the regression equation without the smoking status. The result of re-computation
is presented as follows:
. regress sbp1 age bmi
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(2, 27) = 9.16
Model | 3496.04264 2 1748.02132 Prob > F = 0.0009
Residual | 5149.95736 27 190.739161 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3602
Total | 8646 29 298.137931 Root MSE = 13.811
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .5637281 .1660759 3.39 0.002 .2229686 .9044876
bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073
_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142
------------------------------------------------------------------------------
From the STATA output, we see that the BMI has the lowest t test with a p value of 0.045.
However, the test is significant, so the BMI is not dropped from the regression model.
Therefore, we stop here with this model, which is the same model that we got when using
the forward selection procedure. To do the backward elimination procedure using the
STATA command, we would type:
. xi:sw regress sbp1 age bmi (i.smoke),pr(0.05)
i.smoke _Ismoke_1-3 (naturally coded; _Ismoke_3 omitted)
begin with full model
p = 0.9999 >= 0.0500 removing _Ismoke_1 _Ismoke_2
Source | SS df MS Number of obs = 30
-------------+---------------------------------- F(2, 27) = 9.16
Model | 3496.04264 2 1748.02132 Prob > F = 0.0009
Residual | 5149.95736 27 190.739161 R-squared = 0.4044
-------------+---------------------------------- Adj R-squared = 0.3602
Total | 8646 29 298.137931 Root MSE = 13.811
------------------------------------------------------------------------------
sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .5637281 .1660759 3.39 0.002 .2229686 .9044876
bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073
_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142
------------------------------------------------------------------------------
22
6. CONFOUNDING AND INTERACTION
Both confounding and interaction involve additional variables that may affect an
association between two or more variables. The additional variables to be considered are
synonymously referred to as extraneous variables, control variables, or covariates. For
example, from the previous example about SBP, we assess whether age is associated with
the SBP, accounting for the smoking status, so the extraneous variable here is the smoking
status.
6.1 Confounding effects
Confounding is the condition where the relationship of interest is different when an
extraneous variable is ignored or included in the data analysis. The assessment of
confounding requires a comparison between a crude estimate of an association (which
ignores the extraneous variable) and adjusted estimate of an association (which accounts in
some way for the extraneous variable). If the crude and adjusted estimates are
meaningfully different, confounding is present and an extraneous variable must be
included in data analysis.
Suppose that we are interested in describing the relationship between a predictive variable
“drug therapy” and a continuous outcome “SBP”, taking into account the possible
confounding effect of a third variable “age”. If drug therapy is a dichotomous variable
(e.g., drug=1 or 0 for drug A or placebo, respectively). The comparison between a crude
estimate of an association with an adjusted estimate of an association can be expressed in
terms of the following two regression models:
1) )()( 210 agedrugYSBP
2) )(10 drugYSBP
The model (1) expresses the relationship between drug therapy and SBP, adjusted for the
variable age, in terms of the partial regression coefficient (1 ) of the variable drug. The
estimate of 1 (which we will denote by Age|1 ) obtained from the least-squares fitting of
model (1), is an adjusted-effect measure. This value gives the estimated change in SBP per
unit change in drug therapy after adjusting for age.
23
The model (2) expresses the relationship between drug therapy and SBP, ignoring the
variable age, in terms of the regression coefficient (1 ) of the variable drug. The estimate
of1 (which is denoted by
1 ), obtained from the least-squares fitting of model (2), is a
crude estimate of the relationship between drug therapy and SBP.
Thus, confounding is present if the estimate of the regression coefficients of the study
variable “drug” between models (1) and (2) are meaningfully different. As an example,
suppose that
9.15ˆ1 and 1.4ˆ
|1 Age
Then, we can conclude that a 1-unit change in drug therapy yields a 16-unit change in SBP
when age is ignored, and a 1-unit change in drug therapy yields a 4-unit change in SBP
when age is controlled. That is the relationship between drug therapy and SBP is much
weaker after controlling for age. Thus, we would treat age as a confounder and control for
it in the analysis.
As another example, suppose that:
1.6ˆ1 and 2.6ˆ
|1 Age
Here, we can conclude that age is not a confounder because there is no meaningful
difference between the estimates 6.2 and 6.1. Sometimes, an investigator may have to deal
with much more problematic comparisons, such as 5.5ˆ1 versus 1.4ˆ
|1 Age . One
approach to deal with this problem is to consider the clinical importance of the numerical
difference between estimates, based on a priori knowledge of the variable (s) involved.
For example, the estimated coefficients 5.5 and 4.1 are the crude and adjusted differences
of mean SBP between drug A and placebo. It is important to decide whether a mean
difference of 5.5 is clinically more important than a mean difference of 4.1. If there is
meaningful difference in clinical practice, we have to treat age as a confounder and include
the variable age in the model.
One approach sometimes used incorrectly to assess confounding is a statistical test of the
partial regression coefficient of the extraneous variable. Such a test does not address
confounding, but rather precision. For example, a test evaluates whether significant
additional variation in SBP is explained by adding BMI ( 2 ) to a model already containing
24
the drug therapy (1 ). It is to determine whether a confidence interval for
1 is
considerably narrower when BMI is in the model than when it is not. However, if 0ˆ2 , it
does not follow that 1|1ˆˆ BMI . Therefore, 0ˆ
2 is not sufficient for confounding effects.
Consider the STATA outputs, which describe two regression models for the relationship
between drug and SBP when the BMI is ignored or included in the model, respectively.
1) Crude estimates of the relationship between drug therapy and SBP (BMI is ignored).
xi: regress sbp i.drug
i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted)
Source | SS df MS Number of obs = 32
-------------+------------------------------ F( 1, 30) = 6.16
Model | 1094.31693 1 1094.31693 Prob > F = 0.0189
Residual | 5331.65182 30 177.721727 R-squared = 0.1703
-------------+------------------------------ Adj R-squared = 0.1426
Total | 6425.96875 31 207.289315 Root MSE = 13.331
------------------------------------------------------------------------------
sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Idrug_1 | 11.90688 4.798404 2.48 0.019 2.107235 21.70653
_cons | 137.4615 3.697418 37.18 0.000 129.9104 145.0127
------------------------------------------------------------------------------
2) Adjusted estimates of the relationship between drug therapy and SBP, after accounting for
BMI xi: regress sbp i.drug bmi
i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted)
Source | SS df MS Number of obs = 32
-------------+------------------------------ F( 2, 29) = 40.54
Model | 4733.10854 2 2366.55427 Prob > F = 0.0000
Residual | 1692.86021 29 58.3744899 R-squared = 0.7366
-------------+------------------------------ Adj R-squared = 0.7184
Total | 6425.96875 31 207.289315 Root MSE = 7.6403
------------------------------------------------------------------------------
sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Idrug_1 | 10.6436 2.754685 3.86 0.001 5.009639 16.27756
bmi | 1.560152 .1976058 7.90 0.000 1.156002 1.964301
_cons | 55.13354 10.64064 5.18 0.000 33.37098 76.89609
------------------------------------------------------------------------------
From the STATA output, the statistical test of 0:0 BMIH has the p value of <0.001.
Thus, we reject the null hypothesis and conclude that the addition of BMI, given that drug
is already in the model, significantly contributes to the prediction of the SBP. The adjusted
R-squared when BMI is included in the model is 0.72, whereas the R-squared is 0.17 when
BMI is ignored. However, there is no meaningful difference between the crude and
adjusted estimates 11.91 and 10.64, respectively. As a result, BMI is not a confounder for
25
this example, although the additional variation of SBP is explained by including BMI in
the model.
6.2 Interaction effects
Interaction is the condition where the relationship of interest is different at different levels
of the extraneous variable. The assessment of interaction focuses on describing the
relationship of interest at different levels of the extraneous variable. For example, in
assessing interaction due to sex in describing the relationship between age and SBP, we
must determine whether the regression coefficients of the relationship between age and
SBP differ between males and females.
To illustrate the concept of interaction, let us consider the following example. Suppose
that we wish to determine how two independent variables; age and sex jointly affect the
systolic blood pressure. To distinguish between interaction and no interaction effects, we
consider two graphs based on two hypothetical data sets which are presented in Figure 5-1.
These show the straight-line regression of SBP versus age for females against the
corresponding regression for males.
In Figure 6-1(a), the two regression lines are parallel. This figure suggests that the rate of
change in SBP as a function of age remains the same regardless of males and females. In
other words, the relationship between SBP and age does not depend on sex. It can be
concluded that there is no interaction between age and sex. In this situation, we can
investigate the effects of age and sex on SBP independently of one another. These effects
are called the main effects. One way to represent the relationship depicted in Figure 6-1(a)
is with a regression model of the form
)()( 210 sexageY .
Here, the change in the mean of SBP for a 1-unit change in age is equal to1 , regardless of
males or females, while changing the category of sex only has the effect of shifting the
straight line relating SBP and age without affecting the value of the slope1 .
26
a) No interaction between age and sex
b) Interaction between age and sex
Figure 6-1 Graphs of non-interacting and interacting independent variables
In contrast, two regression lines cross or trend to cross together in Figure 6-1(b). This figure
depicts a situation where the relationship between SBP and age depends on sex; in particular
the SBP appears to increase with increasing age for males but to decrease with increasing age
for females. It can be concluded that there is an interaction between age and sex. In this
situation, we cannot investigate the main effects of age and sex on SBP, since age and sex do
not operate independently of one another in their effects on SBP. One way to represent such
interaction effects mathematically is to use a regression model of the form
120
130
140
150
160
170
Systo
lic b
loo
d p
ressure
(m
mH
g)
40 45 50 55 60 65
Age (years)
Females Males
130
140
150
160
170
Systo
lic b
loo
d p
ressure
(m
mH
g)
40 45 50 55 60 65Age (years)
Females Males
27
)()()( 12210 sexagesexageY .
An interaction term is basically the product of two independent variables of interest which
are age and sex in our example. Here the change in the mean of SBP for a 1-unit change in
age is equal to )(121 sex , which clearly depends on sex. For our particular example,
when sex=0 (i.e., when sex=male), the regression model can be written as:
)0()0()( 12210 ageageY
)(10 age
and when sex=1 (i.e., when sex=female), the regression model becomes:
)1()1()( 12210 ageageY
age )( 12120
Note that these regression lines present different intercepts and different slopes. In linear
regression, interaction is evaluated by using the statistical tests about product terms
involving basic independent variables in the model.
We present here an example of how to assess interaction by using STATA. We will
continue with our example of the relationship between age and SBP to investigate the
possible interaction. The results of fitting the linear regression model to this data set
indicated that there is a linear relationship between age and SBP. Another question that can
be answered by such data is whether an interaction exists between age and smoking status.
In other words, that is whether the slopes of the straight-lines relating SBP to age
significantly differ for smokers and for non-smokers. To create an interaction term by
using the STATA, we would type:
xi: regress sbp i.smk*age
i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) i.smk*age _IsmkXage_# (coded as above) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 26.63 Model | 4758.42362 3 1586.14121 Prob > F = 0.0000
Residual | 1667.54513 28 59.5551833 R-squared = 0.7405 -------------+------------------------------ Adj R-squared = 0.7127 Total | 6425.96875 31 207.289315 Root MSE = 7.7172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | -12.84603 21.71534 -0.59 0.559 -57.32789 31.63583 age | 1.515216 .2703328 5.61 0.000 .9614643 2.068968 _IsmkXage_1 | .4349186 .4048227 1.07 0.292 -.3943232 1.26416 _cons | 58.57428 14.80476 3.96 0.000 28.2481 88.90046 ------------------------------------------------------------------------------
28
The regression model for including the interaction term can be written as:
)()()( 3210 agesmokeagesmokeY
From the STATA output, we see that when smoking status=non smokers (smoke=0), the
regression model can be written as:
)0(43.0)(52.1)0(87.1257.58 ageageY
)(52.157.58 ageY
and when smoking status=smokers (smoke=1), the regression model becomes:
)1(43.0)(52.1)1(87.1257.58 ageageY
)(95.170.45 ageY
Plotting these two lines (Figure 6-2), we see that they appear to be almost parallel. It
indicates that there is probably no statistically significant interaction. We can confirm this
lack of significance by using the partial F test for testing the hypothesis 0: 30 H , given
that smoking status and age are in the model. The p value for the partial F test is equal to
0.292. Therefore, the slopes of the straight lines relating SBP and age do not significantly
differ for smokers and non-smokers. This means that there is no interaction between age
and smoking status in this situation. This conclusion is equivalent to finding that smokers
have a consistently higher SBP than non-smokers at all ages, and the rate of change with
respect to age is the same for both groups.
Figure 6-2 Comparison by smoking status of straight-line regressions of SBP on age
120
130
140
150
160
170
Systo
lic b
loo
d p
ressure
(m
mH
g)
40 45 50 55 60 65Age (years)
Non-smokers Smokers
29
7. REGRESSION DIAGNOSTICS
7.1 Normality The assumption of normality can be assessed formally by a normal plot of the residuals.
These residuals are assumed to be independent normal random variables with mean 0 and
constant variance (2σ ). Assumption violation will result in invalid model. To make sure
that the model is appropriate, it needs to be checked whether their distributions are normal.
Checking this assumption can be performed as follows:
STATA command:
1. Estimate the residuals
After fitting regression model by regress command, the estimation of residuals can be done
by predict command with option resid as:
. xi: regress sbp i.drug age quet
i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted)
Source | SS df MS Number of obs = 32
-------------+------------------------------ F( 3, 28) = 32.42
Model | 4989.65072 3 1663.21691 Prob > F = 0.0000
Residual | 1436.31803 28 51.2970727 R-squared = 0.7765
-------------+------------------------------ Adj R-squared = 0.7525
Total | 6425.96875 31 207.289315 Root MSE = 7.1622
------------------------------------------------------------------------------
sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Idrug_1 | 10.62885 2.582308 4.12 0.000 5.339232 15.91847
age | 1.003488 .3102821 3.23 0.003 .3679039 1.639072
quet | 9.705102 4.339773 2.24 0.033 .8154801 18.59472
_cons | 51.38847 10.11437 5.08 0.000 30.67013 72.10681
------------------------------------------------------------------------------
. predict error, resid
2. Create normal probability plot . pnorm error
Figure 7-1 Normal probability plot of residuals from the full multiple regression model
0.0
00
.25
0.5
00
.75
1.0
0
Norm
al F
[(err
or-
m)/
s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
30
3. Test for normality
. swilk error
Shapiro-Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+--------------------------------------------------
error | 32 0.97405 0.865 -0.300 0.61789
The test performs with the null hypothesis that the residuals are normally distributed.
The Shapiro-Wilk statistic is equal 0.97405 and its related p-value is 0.61789.
We therefore fail to reject the null hypothesis and conclude that the residuals are normally
distributed.
7.2 Linearity
Augmented component-plus-residual plot versus independent variables will suggest whether
a relationship is linear. This can be performed as below.
STATA command:
1. Plot Augmented component-plus-residual versus independent variable age
. acprplot age, mspline msopts(bands(13)) title(Augmented component-plus-residual
versus age plot)
Figure 7-2 Augmented component-plus-residual versus age
-30
-20
-10
01
0
Au
gm
ente
d c
om
po
ne
nt p
lus r
esid
ua
l
40 45 50 55 60 65age
Augmented component-plus-residual versus age plot
31
2. Plot Augmented component-plus-residual versus independent variable BMI
. acprplot quet, mspline msopts(bands(13)) title(Augmented component-plus-residual
versus BMI plot)
Figure 7-3 Augmented component-plus-residual versus BMI
These graphs suggest neither definite linearity nor definite curvature relationship. These
might be linear, but some outliers may cause the line to not be smooth.
7.3 Homoskedasticity
The assumption of homoskedasticity can be assessed by plotting the residuals against the
predicted values of the dependent variable. This plot gives an idea whether residuals are
constant across fitted values. For example, after fitting drug, age, and BMI to predict
changes of SBP, a plot of residuals against the predicted value of the SBP is shown in
Figure 7-4. In the graph below, residuals symmetrically lie below and above the horizontal
line. This might suggest constant variance and testing should be performed as below.
. estat hettest error
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: error
chi2(1) = 1.82
Prob > chi2 = 0.1771
The test suggests that variance of residuals is constant over the values of independent variable.
-70
-60
-50
-40
-30
Au
gm
ente
d c
om
po
ne
nt p
lus r
esid
ua
l
2.5 3 3.5 4 4.5quetelet index
Augmented component-plus-residual versus BMI plot
32
Figure 7-4 Residuals from the regression model, plotted against predicted values of SBP
7.4 Multicollinearity
This assumption does not matter for simple regression analysis, but it is important to check
for the multiple regression model. This happens when independent variables are correlated
and it will result in bias estimation of coefficients, standard errors, and inference of
coefficients. Therefore, pair wise correlation (r) should be performed to get an idea what
variables might highly correlate and have collinearity effect in the model if we fit them
together. If collinearity occurs, adding or deleting independent variables makes changes in
coefficients and produces corresponding standard errors. The standard error is sometimes
as high as the coefficient or even higher.
To detect multicollinearity additional to pair wise correlation, a variance inflation factor is
usually used. The parameter measures how the variances of regression coefficients are
inflated compared to when the independent variables are not linearly related. This can be
estimated in STATA as below. The VIF > 10 is suggested as collinearity and the value
close to 1 is no evidence of collinearity. If the VIF is greater than 10, that variable should
be omitted from the model. For this example, there is no evidence of collinearity.
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
age | 2.82 0.355212
quet | 2.81 0.355588
_Idrug_1 | 1.00 0.996620
-------------+----------------------
Mean VIF | 2.21
-20
-10
01
02
0
Resid
uals
120 130 140 150 160 170Linear prediction
33
7.5 Outliers
7.5.1 Identifying X outliers
Identifying X outliers for multiple regression is a multi-dimension of X variables.
Considering only one variable separately is not appropriate since extreme values of some
unvaried outliers may not influence in regression line whereas multivariate outliers (which
cannot detect in unvaried outliers) may affect on fitting the regression line. We use H
matrix to identify X outliers as follows:
XXXXH 1)(
Let hij (hat or leverage) be the i element on the main diagonal of the H matrix, which can
be obtained from:
phh
XXXXhn
iiiii
iiii
1
1'
,10
)(
where p = number of regression parameters including a constant term. The higher the hii
value (higher leverage), the further distance that an x variable departs from the centre of X
matrix and that is the outlier of X. The suggestion is that if the hat is as high as twice the
average (2*p/n), it is considered as outlier. Another criteria is classifying hii 0.2-0.5, > 0.5
as moderate and high leverage. Estimation of hat can be performed as below.
STATA command:
2. Estimate leverage value . predict xdist,hat
. sum xdist,det
Leverage
-------------------------------------------------------------
Percentiles Smallest
1% .0549423 .0549423
5% .0568518 .0568518
10% .0762993 .0651438 Obs 32
25% .0843276 .0762993 Sum of Wgt. 32
50% .1157301 Mean .125
Largest Std. Dev. .056152
75% .1453192 .2223231
90% .2223231 .2470635 Variance .0031531
95% .2482278 .2482278 Skewness 1.114692
99% .2663676 .2663676 Kurtosis 3.49025
34
3. Identify outlying cases
As for the model, the average leverage is 2x4/32 = 0.25. Only 1 subject has leverage value
higher than this cutoff. However, the highest leverage does not exceed the high level of
leverage (i.e. = 0.5). We could see that subject id=2 has the highest leverage value of
0.266. Keep in mind that this subject is potential X outlier, but needs to be explored
further whether this potential outlier influences on the regression model (predicted values).
. list person error xdist quet age drug if xdist>2*(4/32) & xdist!=.
+----------------------------------------------------+
| person error xdist quet age drug |
|----------------------------------------------------|
18. | 2 -2.082762 .2663676 3.251 41 0 |
+----------------------------------------------------+
7.5.2 Identifying Y outliers
This can be done using studentized deleted residuals, which is calculated by:
)(
*
ˆ
)(
iiii
i
ii
YYd
ds
dd
or
2/1
2
)(
*
)1(
1
)1(
iiii
iii
ii
ehSSE
pne
hMSE
ed
di is called deleted residual, which is the residual where xi(i) case is deleted. The
studentized deleted residual has t-distribution with n-p-1 degrees of freedom. To identify
the outlier, it needs to be compared with a cutoff threshold, say 0.05, as below.
STATA command:
1. Estimate the studentized deleted residuals
. predict estu,rstudent
. sum estu,det
Studentized residuals
-------------------------------------------------------------
Percentiles Smallest
1% -2.514817 -2.514817
5% -1.589173 -1.589173
10% -1.08973 -1.185283 Obs 32
25% -.5998537 -1.08973 Sum of Wgt. 32
50% -.0976471 Mean .0120775
Largest Std. Dev. 1.08881
75% .626205 1.436678
90% 1.436678 1.613788 Variance 1.185507
95% 2.430551 2.430551 Skewness .3678051
99% 2.576622 2.576622 Kurtosis 3.461563
35
2. Identify outlying cases
. list person error estu quet age drug if abs(estu)>invttail(27,0.05)
+-----------------------------------------------------+
| person error estu quet age drug |
|-----------------------------------------------------|
8. | 8 14.76043 2.430551 3.612 48 1 |
9. | 9 14.84753 2.576622 2.368 44 1 |
12. | 12 -14.32618 -2.514817 4.032 51 1 |
+-----------------------------------------------------+
7.5.3 Identifying influential cases
After identifying cases that are outlying with respect to their X values and/or their Y values,
the next step is to ascertain whether or not these outlying cases are influential. We shall take
up three measures of influence that are widely used in practice, each based on the omission of a
single case to measure its influence.
i) Influence on the fitted values (DFFITS)
The first task is to explore whether these x & y outliers influence the fitted values since it is not
necessary that all x or y outliers affect fitting values. DFFITS are parameters that takeinto
account both x (leverage) and y (d*) outlying indexes which are often used in practice as follows.
ii(i)
i(i)i
ihMSE
YYDFFITS
ˆˆ
The letters DF refer to distance difference between the fitted value iYˆ for the ith case where all n
cases are considered in the model, and fitted value )(ˆiiY for the ith case in which the ith case is
removed from the model. The denominator is the standardization, which reflects the number of
estimated standard deviations that the fitted value and iYˆ change when the ith case is removed
from the model. The DFFITS can also be calculated using d* and hii as follows:
1/2
ii
ii*
ii )h1
h(dDFFITS
As for the equation, DFFITS is a function of d*, which increases or decreases according to hii
values. The absolute value of DFFITS which exceeds 1 is considered as influential case for
small-medium sample size and np /2 for a large sample size. This can be explored as
follows:
36
STATA command
1. Estimate DFFITS. predict dfits, dfits
2. Identifying influential cases. list person error dfits quet age drug if abs( dfits)>1
+-----------------------------------------------------+
| person error dfits quet age drug |
|-----------------------------------------------------|
8. | 8 14.76043 1.041157 3.612 48 1 |
9. | 9 14.84753 1.377664 2.368 44 1 |
12. | 12 -14.32618 -1.440561 4.032 51 1 |
+-----------------------------------------------------+
ii) Influence on all of the estimated regression coefficients (Cook’s distance)
Cook’s distance (Di) measures the impact of the ith case on all regression coefficients (or
coefficient vectors) if the ith case is omitted, which is defined as:
pMSE
)bX(bX)b(bD
(i)(i)
i
The way to interpret Di is as follows:
- if Di > 4/n or
- Di > F(1-α, p,n-p)
The ith case has substantially influenced on estimate coefficients.
STATA command:
1. Estimate Cook’s distance
. predict cookd1,cooks
2. Identify influential cases
. list person error cookd1 quet age drug if cookd1>(4/32)
+----------------------------------------------------+
| person error cookd1 quet age drug |
|----------------------------------------------------|
8. | 8 14.76043 .2305867 3.612 48 1 |
9. | 9 14.84753 .3949498 2.368 44 1 |
10. | 10 8.75689 .1641441 4.637 64 1 |
12. | 12 -14.32618 .4359133 4.032 51 1 |
+----------------------------------------------------+
iii) Influence on the partial regression coefficients (DFBETAS)
This measures the influence of the ith case on estimation of each regression coefficient.
This can be assessed by
37
omitted is case ith the if tcoefficien regressionb
X)X( of element diagonal kth thec
wherecMSE
bbDFBETAS
k(i)
1-
kk
kk(i)
k(i)k
k(i)
That is the DFBETAS determines the standardized difference of coefficients that are
estimated with and without the ith cases. The absolute DFBETAS > 1 and > n/2 are
supposed to be influential cases for small-medium and large sample sizes. This can be
estimated in STATA as follows:
STATA command:
1. Estimate DFBETAS
. predict df_drug,dfbeta(_Idrug_1)
. predict df_age,dfbeta(age)
. predict df_bmi,dfbeta(quet)
2. Identify influential cases
. list person error df_drug df_age df_bmi quet age drug if abs(df_drug)>1 |
abs(df_age)>1 | abs(df_bmi)>1
+----------------------------------------------------------------------------+
| person error df_drug df_age df_bmi quet age drug |
|----------------------------------------------------------------------------|
12. | 12 -14.32618 -.4310711 1.128824 -1.263236 4.032 51 1 |
+----------------------------------------------------------------------------+
Finally the diagnostic model needs to be summarized as a whole considering three
parameters, which subjects are influential cases that affect on prediction values (DFFITS),
estimation of coefficient vectors (COOK’S D) and each coefficient (DFBETAS).
. list person sbp drug age quet smkgr if abs( dfits)>1 | cookd1>(4/32) |
abs(df_drug)>1 | abs(df_age)>1 | abs(df_bmi)>1
+-------------------------------------------+
| person sbp drug age quet smkgr |
|-------------------------------------------|
8. | 8 160 1 48 3.612 1 |
9. | 9 144 1 44 2.368 1 |
10. | 10 180 1 64 4.637 1 |
12. | 12 138 1 51 4.032 1 |
+-------------------------------------------+
There are 4 subjects who are potentially influential cases with the above criteria. In
summary, the residuals are normal, constant variance, and some outliers that influence the
regression model. We thus need to explore data particularly starting with these above
subjects. Their data needs to be checked and validated for all variables to make sure that
38
the data is correct. If there is some incorrect data, it should be corrected accordingly and
then re-fit to the regression model. The diagnostic model also needs to be re-assessed. If
the data of these subjects are correct as they are, try to exclude these subjects and see how
the regression model is.
39
Assignment IV (25%) Due Date: Sep 27, 2017
From the cohort of Thai population who did not have chronic kidney disease (CKD) at the
beginning was conducted. They were follow up for 7 years with newly diagnosis of CKD as
the outcome of interest. The fourth objective was to determine the factors which associated the
percent change of GFR at 7 years of follow up. There are many factors associated with the
percent change of GFR such as:
Demographic data: age, gender, BMI, waist-hip ratio, systolic blood pressure
Risk behavior: smoking, alcohol consumption, exercise
Comorbidity: hypertension (HT), diabetes mellitus (DM), high cholesterol
Medical history: NSAID
The aim of this assignment is to:
a) Fit the linear regression model step by step, and explain the method used
b) What is the parsimonious equation?
c) Perform diagnostic measures, checking assumptions of final model.
d) Interpret the results of final model.
e) Create a dummy table and present the results according to the dummy table.
f) Interpret and writing results according to the table.
*** The data are given in the data set SEEK2_data.dta