race 615 introduction to medical statistics€¦ · instead of using the formulas manually, the...
TRANSCRIPT
RACE 615 Introduction to Medical Statistics
Multiple Linear Regression
Assist.Prof.Dr.Sasivimol Rattanasiri
Doctor of Philosophy Program in Clinical Epidemiology, Section for Clinical Epidemiology & Biostatistics Faculty of Medicine Ramathibodi Hospital, Mahidol University
CONTENTS
1. MULTIPLE REGRESSION MODEL ................................................................................. 3
2. FITTING MULTIPLE REGRESSION MODEL ............................................................... 5
3. FITTING CATEGORICAL INDEPENDENT VARIABLES ........................................... 7
4. EVALUATING THE MULTIPLE REGRESSION MODEL ......................................... 10
4.1 Test for overall regression ............................................................................................... 10
4.2 The coefficient of multiple determination ....................................................................... 11
4.3 Test for partial regression coefficients ............................................................................. 12
4.4 Evaluating multiple regression by STATA ...................................................................... 14
5. MODEL SELECTION ........................................................................................................ 17
5.1 Forward selection procedure ............................................................................................ 17
5.2 Backward elimination procedure ..................................................................................... 20
5.3 Stepwise procedure .......................................................................................................... 22
6. CONFOUNDING AND INTERACTION .......................................................................... 24
6.1 Confounding effects ......................................................................................................... 24
6.2 Interaction effects ............................................................................................................. 27
7. REGRESSION DIAGNOSTICS ........................................................................................ 31
7.1 Normality ......................................................................................................................... 31
7.2 Linearity ........................................................................................................................... 32
7.3 Homoskedasticity ............................................................................................................. 33
7.4 Multicollinearity .............................................................................................................. 34
7.5 Outliers ............................................................................................................................. 35
Assignment V………………………………………………………….……………..…………………..41
2
OBJECTIVES This module will help you to be able to:
Fit a regression model considering > 2 co-variables simultaneously
Perform model selection
Assessing goodness of fit of the model & checking assumptions
Interpret & report results
REFERENCES
1. Neter J, Wasserman W, and Kutner HM. Applied statistical models, third edition. Boston:
IRWIN 1990; 21 - 433.
2. Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Applied regression analysis and other
multivariable methods. Washington: Duxbury Press 1998; 39 - 212.
3. Altman, G. Practical statistics for medical research. 1991, London: Chapman & Hall.
4. Hosmer, D. and S. Lemeshow. Applied regression analysis and other multivariable
methods. 1998, Washington: Duxbury Press.
READING SECTION Read Neter J et. Al.; Chapters 7, 8, 11, & 12
ASSIGNMENT V (25%) p.41, Due: October 8, 2015
3
1. MULTIPLE REGRESSION MODEL In Module IV, we discussed simple linear regression and correlation analysis. A simple
regression model includes one predictor and one outcome. In practice, an outcome is
usually affected by more than one predictor. For example, the systolic blood pressure
(SBP) may be determined by the age, smoking behavior, and body mass index (BMI). To
investigate the more complicated relationship between an outcome and a numbers of
predictors, we use a natural extension of simple linear regression known as multiple
regression analysis.
There are several advantages of using multiple regression analysis.
1) Develop a prognostic index to predict the outcome from several predictors. For
example, the SBP may be predicted by the age, smoking behavior, and BMI.
2) Adjust (control) for the potential confounding factors, in which the study design
has not planned for adjusting. For example, the effect of BMI on the SBP may be
confounded by age and smoking behavior. Fitting all these variables together can
assess the effect of BMI by adjusting for age and smoking behavior.
A multiple regression model with y as an outcome (dependent variable) and kxxxx ,...,,, 321 as k
predictors (independent variables) is written as:
kk XXXXy ...332211 (1)
where refers to the population intercept or constant term,
k ,...,,, 321 are the population slope or regression coefficients of independent
variables kXXXX ,...,,, 321 , respectively,
refers to the random error term.
The constant term is the mean value of the outcome when all predictors in the
model take the value 0.
The coefficients k ,...,,, 321 are called the partial regression coefficients. For
example, 1 is a partial coefficient of 1x , and it gives the change in y due to a one
unit change in 1x when all other predictors included in the model are held constant.
4
A positive value for a particular i in model (1) indicates a positive relationship
between the outcome (Y) and the related predictor (X). A negative value for a
particular i in that model indicates a negative relationship between the outcome
(Y) and the related predictor (X).
The multiple regression model (1) can only be used when the relationship between the
outcome and each predictor is linear. Each of the iX variables in the model (1) represents
a single variable raised to the first power. This model is referred to as a first-order multiple
regression model.
For a data sample, a multiple linear regression is written as:
ikki eXbXbXbXbay ...ˆ 332211 (2)
where
iy is the estimated or predicted value of the outcome iy ,
a is the un-bias estimator of the population intercept,
kbbbb ,....,, 321 are the un-bias estimators of the population slope of predictive
variables kXXXX ,...,,, 321 ,respectively,
ie is the random error which is the difference between the observed and predicted
values of the dependent variable iy . The technical term for this distance is called a
residual.
5
2. FITTING MULTIPLE REGRESSION MODEL The method of least squares is used to estimate the parameters in the multiple regression
model. In general, the least squares method chooses as the best-fitting model the one that
minimizes the sum of squares of the distance between the observed values and predicted
values of the outcome which can be defined as:
n
ikikiiii
n
iii XbXbXbXbayyy
1332211
2
1
)...()ˆ( (3)
The least squares estimated regression coefficients kbbbba ,...,,,, 321 in model (3) are
obtained by using matrix mathematics. In this text we do not present matrix formulas for
calculating the least squares estimates of kbbbba ,...,,,, 321 , so the matrix formulation for
multiple regression can be found in Kleinbaum, Appendix B, pages 732-743, Neter,
Section 7.2, pages 236-239.
Instead of using the formulas manually, the calculations in a multiple regression analysis
are made by using statistical software packages, such as STATA or SPSS. Even for a
multiple regression with two predictors, the formulas are complex and manual calculations
are time consuming. In this module we perform the multiple regression analysis using
STATA program. However, the solutions which are obtained by using other statistical
software packages such as SPSS, SAS, or MINITAB can be interpreted the same way.
Example 2-1 Researchers wanted to investigate how SBP varies with the QUET1 and age.
The outcome is ‘sbp’ and the predictors are ‘quet’ and ‘age’. To fit the multiple linear
regression by using the STATA, we would type:
1 QUET stands for “quetelet index”, a measure of size defined by QUET=100×(weight/height2)
6
regress sbp quet age 1.Source | SS df MS 2.Number of obs = 32 ---------+------------------------------ 3.F( 2, 29) = 25.92 Model | 4120.59224 2 2060.29612 Prob > F = 0.0000 Residual | 2305.37651 29 79.4957416 4.R-squared = 0.6412 ---------+------------------------------ 5.Adj R-squared = 0.6165 Total | 6425.96875 31 207.289315 6.Root MSE = 8.916 --------------------------------------------------------------------------- 7. sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+------------------------------------------------------------------ quet | 9.750732 5.402456 1.805 0.081 -1.298531 20.8 age | 1.045157 .3860567 2.707 0.011 .2555828 1.834732 _cons | 55.32344 12.53475 4.414 0.000 29.687 80.95987 ---------------------------------------------------------------------------
There are two predictors, the QUET and age, included into this model. In the STATA
output no.7, the variable ‘cons_’ is referred to the intercept. The regression coefficients for
the intercept, QUET, and age are presented in the second column (label ‘Coef’). The
multiple regression model for this example is given as:
agequetsbp
05.175.932.55
From this equation, the population intercept is 55.32. It is the value of estimated SBP for
the QUET=0 and age=0. This means that a patient who has zero QUET and zero age is
expected to have a SBP of 55.32 mmHg. This is the technical interpretation of the
population intercept. In reality, that may not be true because none of the patients in our
sample has both zero QUET and zero age. In most medical applications the value of the
population intercept has no practical meaning.
The estimated coefficient of the QUET ( 1 ) is 9.75. This value gives the change in the
estimated SBP for a one-unit change in the QUET when the age is held constant. Thus, we
can state that a patient with one extra QUET, but the same age is expected to have a higher
SBP of 9.75 mmHg.
The estimated coefficient of age ( 2 ) is 1.05. This value gives the change in the estimated
SBP for a one-unit change in age when the QUET is held constant. Thus, we can state that
for a patient with one extra unit of age, the same QUET is expected to have a higher SBP
of 1.05 mmHg.
7
3. FITTING CATEGORICAL INDEPENDENT VARIABLES All the predictive variables that we have considered up to this point have been measured
on a continuous scale. However, regression analysis can be generalized to incorporate
categorical variables into the model. For example, we want to test if smoking status affects
the SBP. The generalization is based on the use of dummy variables which is the central
idea of this section.
The dummy variable is a binary variable that takes on the values 0 and 1 which is used to
identify the different categories of a categorical variable. The term “dummy” is used to
indicate the fact that the numerical values (such as 0 and 1) assumed by the variable have
no quantitative meaning, but are used to identify different categories of the categorical
variable under consideration.
For example, we have a variable for smoking status code 1 for current smoke, 2 for ex-
smoke, and 3 for never smoke. Thus, the dummy variables for smoking behavior are
presented as follows:
Smoking behavior Dummy variables
Dummy 1 Dummy 2 Dummy 3
Current smoke 1 0 0
Ex-smoke 0 1 0
Never smoke 0 0 1
To include these variables in a multiple regression model, we use k-1 dummy variables for
k categories. The omitted category is referred to as the reference group. It is arbitrary
which group is assigned to be the reference group. The choice of a reference group is
usually dictated by subject-matter considerations. For example, if never smoke category is
referred to as the reference group in the multiple regression analysis, only the dummy
variables 1 and 2 are fitted in the model.
There are two easy ways to create dummy variables in STATA. Firstly, is using the tabulate
command and the generate( ) option, as shown below.
8
tabulate smkgr,gen(dum) smkgr | Freq. Percent Cum. ------------+----------------------------------- 1 | 8 25.00 25.00 2 | 9 28.13 53.13 3 | 15 46.88 100.00 ------------+----------------------------------- Total | 32 100.00
list smkgr dum1 dum2 dum3 +----------------------------+ | smkgr dum1 dum2 dum3 | |----------------------------| 1. | 1 1 0 0 | 2. | 1 1 0 0 | 3. | 1 1 0 0 | 4. | 2 0 1 0 | 5. | 2 0 1 0 | |----------------------------| 6. | 2 0 1 0 | 7. | 3 0 0 1 | 8. | 3 0 0 1 | 9. | 3 0 0 1 | +----------------------------+
The tabulate command with generate option created three dummy variables called dum1,
dum2 and dum3. The STATA creates three dummy variables and lets the users choose
their own reference. Suppose that we add the smoking status to the regression model for
predicting SBP that already contains the QUET and age. If the never smoke category is
referred to as the reference group, only the dummy1 and dummy2 are fitted in the model as
follows:
regress sbp quet age dum1 dum2 Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 4, 27) = 24.96 Model | 5058.30759 4 1264.5769 Prob > F = 0.0000 Residual | 1367.66116 27 50.6541171 R-squared = 0.7872 -------------+------------------------------ Adj R-squared = 0.7556 Total | 6425.96875 31 207.289315 Root MSE = 7.1172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 8.066741 4.332307 1.86 0.074 -.8224187 16.9559 age | 1.207758 .3111643 3.88 0.001 .5693015 1.846214 dum1 | 13.26391 3.134341 4.23 0.000 6.832772 19.69505 dum2 | 6.908531 3.047391 2.27 0.032 .6557999 13.16126 _cons | 47.20073 10.40753 4.54 0.000 25.84624 68.55522 ------------------------------------------------------------------------------
Secondly is to use the xi command to create dummy variables. xi i.smkgr
i.smkgr _Ismkgr_1-3 (naturally coded; _Ismkgr_1 omitted)
list smkgr _Ismkgr_2 _Ismkgr_3 +-----------------------------+ | smkgr _Ismkg~2 _Ismkg~3 | |-----------------------------| 1. | 1 0 0 | 2. | 1 0 0 | 3. | 1 0 0 | 9. | 2 1 0 | 10. | 2 1 0 | |-----------------------------| 11. | 2 1 0 | 18. | 3 0 1 | 19. | 3 0 1 | 20. | 3 0 1 | +-----------------------------+
9
The xi command created two dummy variables called _Ismkgr_2 and _Ismkgr_3 and omitted
the dummy variable for group 1 as the reference group. The default of xi command will assign
the first category as the reference group. If we want another category to be the reference, this
can be re-assigned using the char var [omit] # command. For example, we would like group 3
to be the reference group, we would type
char smkgr [omit] 3 xi i.smkgr i.smkgr _Ismkgr_1-3 (naturally coded; _Ismkgr_3 omitted)
The xi command is a prefix command, which can be followed with another modeling
command. To fit the multiple regression of SBP on the QUET, age, and smoking status, we
would type
char smkgr [omit] 3 xi: regress sbp quet age i.smkgr i.smkgr _Ismkgr_1-3 (naturally coded; _Ismkgr_3 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 4, 27) = 24.96 Model | 5058.30759 4 1264.5769 Prob > F = 0.0000 Residual | 1367.66116 27 50.6541171 R-squared = 0.7872 -------------+------------------------------ Adj R-squared = 0.7556 Total | 6425.96875 31 207.289315 Root MSE = 7.1172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 8.066741 4.332307 1.86 0.074 -.8224187 16.9559 age | 1.207758 .3111643 3.88 0.001 .5693015 1.846214 _Ismkgr_1 | 13.26391 3.134341 4.23 0.000 6.832772 19.69505 _Ismkgr_2 | 6.908531 3.047391 2.27 0.032 .6557999 13.16126 _cons | 47.20073 10.40753 4.54 0.000 25.84624 68.55522 ------------------------------------------------------------------------------
From this command, the STATA treats quet and age as continuous variable and smkgr as a
dummy variable. We can interpret that with identical QUET and age, current smokers have
a SBP about 13.26 mmHg significantly higher than never smokers.
10
4. EVALUATING THE MULTIPLE REGRESSION MODEL
Before using a multiple regression model to predict or estimate, it is desirable to determine
first whether it adequately describes the relationship between the outcome and a set of
predictors and whether it can be used effectively for the purpose of prediction.
4.1 Test for overall regression
We now consider an overall test for a regression model which contains k predictors in the
model. The null hypothesis for this test is that there is no linear relationship between the
outcome and the set of predictors. In other words, all of the predictors considered together
do not explain a significant amount of the variation in the outcome. The null and
alternative hypotheses are defined as:
0...: 210 kH
:aH Not all j (j=1,…k) equal zero
An ANOVA approach can be used to perform this test. The particular form of an ANOVA
table for regression analysis is presented in Table 4-1. The ANOVA approach is based on
the partitioning of the total variation of the observed values iY into two components as
follows:
ii yy = yyi ˆ + ii yy ˆ (4)
total variation = explained variation + unexplained variation
Thus, the total variation can be viewed as the sum of two components:
1. Explained variation is the variation of the predicted values iY around the mean Y ,
which is measured by the regression sum of squares (SSR). This indicates how
much the variation of the observed values iY can be explained by the predictive
variables that are included in the regression model. The mean square regression
(MSR) is obtained by dividing the regression sum of squares by its corresponding
degrees of freedom.
2. Unexplained variation is the variation of the observed values iY around the fitted
regression line, which is measured by the error sum of squares (SSE). The mean
11
square residual is obtained by dividing the error sum of squares by its
corresponding degrees of freedom.
Table 4-1 ANOVA Table for multiple regression
Source of
variation
SS df MS F P value
Regression
n
ii )YY(SSR
1
2ˆ k MSR=SSR/K MSR/MSE
Error
n
iii )Y(YSSE
1
2ˆ n-k-1 MSE=SSE/(n-k-1)
Total
n
ii )Y(YSST
1
2 n-1
The appropriate statistical test for significant overall regression is the F test, which is obtained
by dividing the mean square regression by the mean square residual as follows:
MSE
MSRF (5)
This test has a F distribution with k and n-k-1 degrees of freedom. The computed value of F
can then be converted to the associated p value. The last step is to compare the p value with the
level of significance, and make a decision whether the null hypothesis is rejected or not.
4.2 The coefficient of multiple determination
In section 2.4.2 of module IV, we discussed about the coefficient of determination for a
simple linear regression model. It is the proportion of the total sum of squares (SST) that is
explained by the simple regression model. The coefficient of determination for the multiple
regression model, usually called the coefficient of multiple determination, is denoted by 2R
. It is defined as the proportion of the total sum of squares (SST) that is explained by the
multiple regression model. It tells us how good the multiple regression model is and how
well the predictive variables included in the model explain the outcome.
The calculation of the R2 is also based on the ANOVA Table which is presented in Table
4-1, which is defined as:
12
SST
SSE
SST
SSR
yy
yyR n
ii
n
ii
1)(
)ˆ(
1
2
1
2
2 (8)
However, the R2 has one major shortcoming. The 2R value generally increases as we add
more and more predictive variables to the regression model. This does not imply that the
regression model with a higher value of 2R does a better job of predicting the outcome.
Such a value of 2R will be misleading, and it will not represent the true power of predictive
variables in the regression model. To eliminate this shortcoming of 2R , it is prefer to use
the adjusted R2, which is a modified measure that adjusts for the number of predictive
variables in the model. The adjusted 2R can determined by dividing each sum of squares
by its associated degrees of freedom. Thus, the adjusted 2R can be defined as:
)1/(
)/(12
nSST
knSSERa (9)
where n is the total number of observations, and k is the number of predictive variables in
the model.
4.3 Test for partial regression coefficients
Frequently we wish to test if the addition of any specific predictor (X*), given that others
are already in the model ( kXXXX ,...,,, 321 ), significantly improves the prediction of the
outcome. That is, we want to test the null hypothesis that 0* against the alternative
hypothesis that 0* (j=1, 2,…, k) in the model
**332211 ... XXXXXy kk .
The appropriate statistical test for these partial regression coefficients is the partial F test.
The concept of this test is to compare the error sum of squares between two models:
1) The full model contains kXXX ,...,, 21 and *X as the predictors.
2) The reduced model contains kXXX ,...,, 21 but not *X .
The error sum of squares of the full model is smaller than the reduced model. The
difference between these error sums of squares is called an extra sum of squares.
13
The error sum of squares when kXXX ,...,, 21 and *X are in the model, is denoted
by ),,...,( *21 XXXXSSE k ,
The error sum of squares when kXXX ,...,, 21 are in the model, is denoted by
),...,( 21 kXXXSSE ,
The extra sum of squares which is denoted by ),...,,|( 21*
kXXXXSSR can be
defined as:
),,...,,(),...,,(),...,,|( *212121
* XXXXSSEXXXSSEXXXXSSR kkk
Thus, the partial F test is defined as:
1/),,...,(
1/),...,,|(*
21
21*
*
knXXXXSSE
XXXXSSRF
k
k
),,...,,(
),...,,|(*
21
21*
XXXXMSE
XXXXMSR
k
k (6)
This F statistic has an F distribution with 1 and n-k-1 degrees of freedom.
Note:
To distinguish the partial F test in equation (6) from the overall F test in equation (5), we use
the *F test statistic for the partial F test, and F test statistic for the overall F test.
An equivalent way to perform the partial F test is to use a t test. The t test focuses on a test
of the null hypothesis that 0* . The t test for testing this null hypothesis is computed as:
)ˆ(
ˆ*
**
SEt (7)
Where
* is the estimated coefficient of specific predictor (X*) in the regression model
**332211 ... XXXXXy kk .
The )ˆ( *SE is the estimate of the standard error of * .
This test has a t distribution with n-k-1 degrees of freedom.
Since the two tests are equivalent, the choice is usually made in terms of the available
information provided by the computer package output.
14
4.4 Evaluating multiple regression by STATA
Consider the STATA output of example 2-1, the output indicated that:
regress sbp quet age 1.Source | SS df MS 2.Number of obs = 32 ---------+------------------------------ 3.F( 2, 29) = 25.92 Model | 4120.59224 2 2060.29612 Prob > F = 0.0000 Residual | 2305.37651 29 79.4957416 4.R-squared = 0.6412 ---------+------------------------------ 5.Adj R-squared = 0.6165 Total | 6425.96875 31 207.289315 6.Root MSE = 8.916 --------------------------------------------------------------------------- 7. sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+------------------------------------------------------------------ quet | 9.750732 5.402456 1.805 0.081 -1.298531 20.8 age | 1.045157 .3860567 2.707 0.011 .2555828 1.834732 _cons | 55.32344 12.53475 4.414 0.000 29.687 80.95987 ---------------------------------------------------------------------------
1. Table of the analysis of variance
The ANOVA table shows the values of sum of squares, degrees of freedom and variances as:
- The values of sum of squares are
Sum of squares of total variation: SST=6425.97
Sum of squares of regression: SSR=4120.59
Sum of squares of error: SSE=2305.38
- The degrees of freedom (df) are
df of the total variation is n-1, that is 32-1=31
df of the explained variation is k, that is 2.
df of the unexplained variation is n-k-1, that is 29.
- The variances or mean square (MS) are
Mean square of total variation (MST) is SST/df = 6425.97/31 = 207.29
Mean square of regression (MSR) is SSR/df = 4120.59/2 = 2060.30
Mean square of error (MSE) is SSE/df = 2305.38/29 = 79.50
2. The number of total observations
The number of total observations for this example is 32.
3. The overall F test for overall regression
We now consider the overall F test for a regression model which contains the predictive
variables; QUET index and age. The null hypothesis for this test is that there is no linear
relationship between the SBP and the set of predictive variables; QUET index and age. From
15
the ANOVA table, the overall F test for the null hypothesis 0:0 ageQuetH , is computed
as:
92.2550.79
30.2060
MSE
MSRF .
This test has the F distribution with 2 and 29 degrees of freedom. The output from the STATA
indicated that the p value for this example is less than 0.001. As a result, we reject the null
hypothesis and conclude that there is a relationship between the SBP and the set of predictive
variables, QUET index and age.
4. The R2
The estimated 2R for this example is computed as:
0.646425.97
4120.592 SST
SSRR
This implies that 64.12% of the variation of SBP is explained by its linear relationship with the
QUET index and age.
5. The adjusted R2
The estimate of adjusted 2R for this example is computed as:
62.031/97.6425
2/59.4120
1/
/2
nSST
kSSRR
This implies that after adjusting the numbers of the predictive variables, 61.65% of the variation
of SBP is explained by its linear relationship with the QUET and age.
6. The square root of mean square error
This value is a square root of mean square error (MSE). The MSE of this example is 79.50, thus
the square root of MSE is 8.91.
7. The table for tests of partial regression coefficients
This table shows the set of predictive variables and their corresponding coefficients. The
variable ‘cons_’ referred to the intercept. The partial regression coefficients for the QUET and
age are presented in the second column (label ‘Coef’). Their standard errors (SE) and the
values of t test for partial regression coefficients are presented in the third and fourth columns,
16
respectively. The corresponding p values and 95% CI of coefficients are presented in the fifth
and sixth columns, respectively.
The null hypotheses for partial regression coefficients can be defined as:
0:
0:
20
10
H
H
For our example, the t value for the QUET is 1.81 which has the p value of 0.081. Thus, we fail
to reject the null hypothesis and conclude that there is no relationship between the QUET and
SBP when age is already in the model. In other words, the addition of the QUET, given that
age is already in the model, does not significantly contribute to the prediction of the SBP.
The t value for age is 2.71 which has the p value of 0.011. Thus, we reject the null hypothesis
and conclude that there is relationship between age and SBP although QUET is already in the
model. In other words, the addition of age, given that QUET is already in the model,
significantly contributes to the prediction of the SBP.
17
5. MODEL SELECTION
In this section we focus on determining the best (most important or most valid) subset of
the k predictive variables for describing the relationship between the outcome and the
predictive variables. There are many strategies for selecting the best model. Such strategies
have focused on deciding whether a single variable should be added to a model or whether
a single variable should be deleted from a model. In this section we explain an algorithm
for evaluating models with forward selection, backward elimination, and stepwise
procedures. These procedures are widely used in practice.
5.1 Forward selection procedure
This strategy focuses on deciding whether a single variable should be added to a model. In
the forward selection procedure, we proceed as follows:
Step 0:
1. Fit a simple linear regression model for each of the k potential predictive variables.
2. Select the first predictive variable which most highly correlates with the outcome.
3. Fit the regression model to the selected predictive variable.
4. Consider overall F test, if it is not significant, stop and conclude that no predictive
variables are important predictors. If the overall F test is significant, add this selected
predictor in the model and go to the next step.
Example:
To do the simple linear regression for 3 independent variables; age, QUET, and smoking
status for predicting SBP by using the STATA, we would type:
regress sbp quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 36.75 Model | 3537.94585 1 3537.94585 Prob > F = 0.0000 Residual | 2888.0229 30 96.2674299 R-squared = 0.5506 -------------+------------------------------ Adj R-squared = 0.5356 Total | 6425.96875 31 207.289315 Root MSE = 9.8116 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 21.49167 3.545147 6.06 0.000 14.25151 28.73182 _cons | 70.57641 12.32187 5.73 0.000 45.4118 95.74102 ------------------------------------------------------------------------------
18
regress sbp age Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 45.18 Model | 3861.63037 1 3861.63037 Prob > F = 0.0000 Residual | 2564.33838 30 85.4779458 R-squared = 0.6009 -------------+------------------------------ Adj R-squared = 0.5876 Total | 6425.96875 31 207.289315 Root MSE = 9.2454 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.6045 .2387159 6.72 0.000 1.116977 2.092023 _cons | 59.09162 12.81626 4.61 0.000 32.91733 85.26592 ------------------------------------------------------------------------------ xi: regress sbp i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 1.95 Model | 393.098162 1 393.098162 Prob > F = 0.1723 Residual | 6032.87059 30 201.095686 R-squared = 0.0612 -------------+------------------------------ Adj R-squared = 0.0299 Total | 6425.96875 31 207.289315 Root MSE = 14.181 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | 7.023529 5.023498 1.40 0.172 -3.235823 17.28288 _cons | 140.8 3.661472 38.45 0.000 133.3223 148.2777 ------------------------------------------------------------------------------
From the STATA output for prediction of SBP, we see that the highest squared correlation
is for the age (R-squared=0.60). The overall F test for the regression of SBP on age is
statistically significant (p<0.001). Therefore, the age is added in the model at this step.
Step 1:
1. Fit regression models that contain the variable initially selected (at step 0) and
another predictive variable which is not yet in the model.
2. Consider the t test for testing the null hypothesis that 0j and p value associated
with each remaining variable.
3. Focus on the variable with the largest t test which is equivalent to the smallest p
value. If the test is significant, add that predictive variable to the regression model.
If it is not significant, use the model from step 0 which has only one predictive
variable.
19
To do multiple regressions that contains the age and another predictive variable by using the
STATA, we would type:
regress sbp age quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 25.92 Model | 4120.59224 2 2060.29612 Prob > F = 0.0000 Residual | 2305.37651 29 79.4957416 R-squared = 0.6412 -------------+------------------------------ Adj R-squared = 0.6165 Total | 6425.96875 31 207.289315 Root MSE = 8.916 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.045157 .3860567 2.71 0.011 .2555828 1.834732 quet | 9.750732 5.402456 1.80 0.081 -1.298531 20.8 _cons | 55.32344 12.53475 4.41 0.000 29.687 80.95987 ------------------------------------------------------------------------------ xi: regress sbp age i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------
From the STATA output, we see that the smoking status (variable name is smk) has the
largest t test with a p value of 0.001. It also has the largest adjusted R-squared. Therefore,
the smoking status is added to the regression model at this step.
Step 2:
At each subsequent step, consider the t test for the predictive variables which are not yet in
the model. If the largest t test is statistically significant, add the new variable to the model.
If the t test is not significant, no more variables are included in the model and the process
is stopped.
For our example we have already added age and smoking status to the model. We now
consider if we should add the QUET index (variable name is quet) in the model. To add the
QUET in the multiple regression which contains the age and smoking status with STATA,
we would type:
20
xi: regress sbp age i.smk quet i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 29.71 Model | 4889.82567 3 1629.94189 Prob > F = 0.0000 Residual | 1536.14308 28 54.8622529 R-squared = 0.7609 -------------+------------------------------ Adj R-squared = 0.7353 Total | 6425.96875 31 207.289315 Root MSE = 7.4069 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.212715 .3238192 3.75 0.001 .549401 1.876028 _Ismk_1 | 9.945568 2.656057 3.74 0.001 4.504882 15.38625 quet | 8.592448 4.498681 1.91 0.066 -.6226828 17.80758 _cons | 45.10319 10.76488 4.19 0.000 23.05235 67.15404 ------------------------------------------------------------------------------
From the above STATA output, the t test for the QUET index, controlling for age and
smoking status, is 1.91 with a p value of 0.066. This value is not statistically significant at
=0.05, so the process is stopped. Thus, the forward selection procedure identifies age
and smoking status as the best subset of the predictive variables.
To do the forward selection procedure with the STATA command, we would type:
xi: sw regress sbp age quet i.smk,pe(0.05) i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) begin with empty model p = 0.0000 < 0.0500 adding age p = 0.0009 < 0.0500 adding _Ismk_1 Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------
5.2 Backward elimination procedure
This strategy has focused on deciding whether a single variable should be deleted from a
model. In the backward elimination procedure, we proceed as follows:
Step 0:
1. Fit regression model that contain all predictive variables.
2. Consider the t test for every variable in the model.
21
3. Focus on the variable with the lowest t test which also has the highest p value. If the
test is not significant, remove that variable from the model. If it is significant, keep
that variable in the model.
To fit regression model that contains all predictive variables (age, QUET, and smoking
status) by using the STATA, we would type:
xi: regress sbp quet age i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 29.71 Model | 4889.82567 3 1629.94189 Prob > F = 0.0000 Residual | 1536.14308 28 54.8622529 R-squared = 0.7609 -------------+------------------------------ Adj R-squared = 0.7353 Total | 6425.96875 31 207.289315 Root MSE = 7.4069 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 8.592448 4.498681 1.91 0.066 -.6226828 17.80758 age | 1.212715 .3238192 3.75 0.001 .549401 1.876028 _Ismk_1 | 9.945568 2.656057 3.74 0.001 4.504882 15.38625 _cons | 45.10319 10.76488 4.19 0.000 23.05235 67.15404 ------------------------------------------------------------------------------
From the STATA output, we see that the QUET index has the lowest t test with a p value
of 0.066. Therefore, the QUET index is dropped from the regression model at this step.
Step 1:
If a variable is dropped in step 0, re-compute the regression equation for the remaining
variables, and repeat the backward elimination procedure steps 0 and 1. If the variable is
not dropped, the backward elimination procedure is terminated.
From the step 0 of our example, the QUET index was dropped from the model, so we re-
computed the regression equation without the QUET index. The result of re-computation is
presented as follows:
xi: regress sbp age i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------
22
From the STATA output, we see that the smoking status has the lowest t test with a p value
of 0.001. However, the test is significant, so the smoking status is not dropped from the
regression model. Therefore, we stop here with this model, which is the same model that
we got when using the forward selection procedure.
To do the backward elimination procedure using the STATA command, we would type:
xi: sw regress sbp quet age i.smk,pr(0.05) i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) begin with full model p = 0.0664 >= 0.0500 removing quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------
5.3 Stepwise procedure
Another approach for model selection is a stepwise method. There are two main versions
of the stepwise procedure: (a) forward selection followed by a test for backward
elimination which is called the forward stepwise procedure and (b) backward elimination
followed by a test for forward selection which is called the backward stepwise procedure.
The forward stepwise procedure allows a variable which was added into the model at an
earlier stage, to be dropped subsequently if it is no longer helpful in conjunction with
variables added at later stages. In opposite, the backward stepwise procedure allows
variables which were dropped from the model at an earlier stage, to be added later.
For our example, we process the forward stepwise procedure as follows:
Step 0: Set the choice of level of significance to include a variable which is denoted by Ep ,
equal to 0.05. The age is added into the model in this step, since it has the highest
significant correlation with SBP.
Step 1: The smoking status is added into the model, since it has a higher significant partial
correlation with SBP than does QUET, given that age is already in the model.
23
Step 2: Fit the model that contains both variables; age and smoking status, and set the
choice of level of significance to remove a variable which is denoted by Rp , equal to 0.10.
The t test of the age, given that the smoking status is already in the model is 8.47 with a p
value less than 0.001. Thus, the age is not dropped from the model.
Step 3: Add the QUET in the model from the previous step, and check whether we should
include the QUET in the model. The t test of the QUET, given that the age and smoking
status are already in the model is 1.91 with a p value of 0.066. Thus, the QUET is not
added in the model and the process is terminated.
To do the forward stepwise regression using the STATA command, we would type:
xi: sw regress sbp age quet i.smk,pe(.05) pr(.1) forward i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) begin with empty model p = 0.0000 < 0.0500 adding age p = 0.0009 < 0.0500 adding _Ismk_1 Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------
Thus, the forward stepwise identifies the variables age and smoking status as the best
subset of predictive variables, a result that happens to be consistent with our previous
analyses based on the forward selection and the backward elimination.
24
6. CONFOUNDING AND INTERACTION Both confounding and interaction involve additional variables that may affect an
association between two or more variables. The additional variables to be considered are
synonymously referred to as extraneous variables, control variables, or covariates. For
example, from the previous example about SBP, we assess whether age is associated with
the SBP, accounting for the smoking status, so the extraneous variable here is the smoking
status.
6.1 Confounding effects
Confounding is the condition where the relationship of interest is different when an
extraneous variable is ignored or included in the data analysis. The assessment of
confounding requires a comparison between a crude estimate of an association (which
ignores the extraneous variable) and adjusted estimate of an association (which accounts in
some way for the extraneous variable). If the crude and adjusted estimates are
meaningfully different, confounding is present and an extraneous variable must be
included in data analysis.
Suppose that we are interested in describing the relationship between a predictive variable
“drug therapy” and a continuous outcome “SBP”, taking into account the possible
confounding effect of a third variable “age”. If drug therapy is a dichotomous variable
(e.g., drug=1 or 0 for drug A or placebo, respectively). The comparison between a crude
estimate of an association with an adjusted estimate of an association can be expressed in
terms of the following two regression models:
1) )()( 210 agedrugYSBP
2) )(10 drugYSBP
The model (1) expresses the relationship between drug therapy and SBP, adjusted for the
variable age, in terms of the partial regression coefficient ( 1 ) of the variable drug. The
estimate of 1 (which we will denote by Age|1 ) obtained from the least-squares fitting of
model (1), is an adjusted-effect measure. This value gives the estimated change in SBP per
unit change in drug therapy after adjusting for age.
The model (2) expresses the relationship between drug therapy and SBP, ignoring the
variable age, in terms of the regression coefficient ( 1 ) of the variable drug. The estimate
25
of 1 (which is denoted by 1 ), obtained from the least-squares fitting of model (2), is a
crude estimate of the relationship between drug therapy and SBP.
Thus, confounding is present if the estimate of the regression coefficients of the study
variable “drug” between models (1) and (2) are meaningfully different. As an example,
suppose that
9.151 and 1.4ˆ|1 Age
Then, we can conclude that a 1-unit change in drug therapy yields a 16-unit change in SBP
when age is ignored, and a 1-unit change in drug therapy yields a 4-unit change in SBP
when age is controlled. That is the relationship between drug therapy and SBP is much
weaker after controlling for age. Thus, we would treat age as a confounder and control for
it in the analysis.
As another example, suppose that:
1.61 and 2.6ˆ|1 Age
Here, we can conclude that age is not a confounder because there is no meaningful
difference between the estimates 6.2 and 6.1. Sometimes, an investigator may have to deal
with much more problematic comparisons, such as 5.51 versus 1.4ˆ|1 Age . One
approach to deal with this problem is to consider the clinical importance of the numerical
difference between estimates, based on a priori knowledge of the variable (s) involved.
For example, the estimated coefficients 5.5 and 4.1 are the crude and adjusted differences
of mean SBP between drug A and placebo. It is important to decide whether a mean
difference of 5.5 is clinically more important than a mean difference of 4.1. If there is
meaningful difference in clinical practice, we have to treat age as a confounder and include
the variable age in the model.
One approach sometimes used incorrectly to assess confounding is a statistical test of the
partial regression coefficient of the extraneous variable. Such a test does not address
confounding, but rather precision. Such a test evaluates whether significant additional
variation in SBP is explained by adding variable age to a model already containing the
variable drug therapy. In other words, it is to determine whether a confidence interval for
1 is considerably narrower when age is in the model than when it is not. Another reason
26
for not focusing on 2 is that, if 0ˆ2 , it does not follow that 1|1
ˆˆ Age . In other words,
0ˆ2 is not sufficient for confounding effects.
Consider the STATA outputs, which describe two regression models for the relationship
between drug and SBP when the age is ignored or included in the model, respectively.
1) Crude estimates of the relationship between drug therapy and SBP (age is ignored). xi: regress sbp i.drug i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 6.16 Model | 1094.31693 1 1094.31693 Prob > F = 0.0189 Residual | 5331.65182 30 177.721727 R-squared = 0.1703 -------------+------------------------------ Adj R-squared = 0.1426 Total | 6425.96875 31 207.289315 Root MSE = 13.331 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Idrug_1 | 11.90688 4.798404 2.48 0.019 2.107235 21.70653 _cons | 137.4615 3.697418 37.18 0.000 129.9104 145.0127 ------------------------------------------------------------------------------
2) Adjusted estimates of the relationship between drug therapy and SBP, after accounting for age
xi: regress sbp i.drug age i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 40.54 Model | 4733.10854 2 2366.55427 Prob > F = 0.0000 Residual | 1692.86021 29 58.3744899 R-squared = 0.7366 -------------+------------------------------ Adj R-squared = 0.7184 Total | 6425.96875 31 207.289315 Root MSE = 7.6403 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Idrug_1 | 10.6436 2.754685 3.86 0.001 5.009639 16.27756 age | 1.560152 .1976058 7.90 0.000 1.156002 1.964301 _cons | 55.13354 10.64064 5.18 0.000 33.37098 76.89609 ------------------------------------------------------------------------------
From the STATA output, the statistical test of 0:0 ageH has the p value of <0.001.
Thus, we reject the null hypothesis and conclude that the addition of age, given that drug is
already in the model, significantly contributes to the prediction of the SBP. The adjusted
R-squared when age is included in the model is 0.72, whereas the R-squared is 0.17 when
age is ignored. However, there is no meaningful difference between the crude and adjusted
estimates 11.91 and 10.64, respectively. As a result, age is not a confounder for this
example, although the additional variation of SBP is explained by including age in the
model.
27
6.2 Interaction effects
Interaction is the condition where the relationship of interest is different at different levels
of the extraneous variable. The assessment of interaction focuses on describing the
relationship of interest at different levels of the extraneous variable. For example, in
assessing interaction due to sex in describing the relationship between age and SBP, we
must determine whether the regression coefficients of the relationship between age and
SBP differ between males and females.
To illustrate the concept of interaction, let us consider the following example. Suppose
that we wish to determine how two independent variables; age and sex jointly affect the
systolic blood pressure. To distinguish between interaction and no interaction effects, we
consider two graphs based on two hypothetical data sets which are presented in Figure 5-1.
These show the straight-line regression of SBP versus age for females against the
corresponding regression for males.
In Figure 6-1(a), the two regression lines are parallel. This figure suggests that the rate of
change in SBP as a function of age remains the same regardless of males and females. In
other words, the relationship between SBP and age does not depend on sex. It can be
concluded that there is no interaction between age and sex. In this situation, we can
investigate the effects of age and sex on SBP independently of one another. These effects
are called the main effects. One way to represent the relationship depicted in Figure 6-1(a)
is with a regression model of the form
)()( 210 sexageY .
Here, the change in the mean of SBP for a 1-unit change in age is equal to 1 , regardless of
males or females, while changing the category of sex only has the effect of shifting the
straight line relating SBP and age without affecting the value of the slope 1 .
28
a) No interaction between age and sex
b) Interaction between age and sex
Figure 6-1 Graphs of non-interacting and interacting independent variables
In contrast, two regression lines cross or trend to cross together in Figure 6-1(b). This figure
depicts a situation where the relationship between SBP and age depends on sex; in particular
the SBP appears to increase with increasing age for males but to decrease with increasing age
for females. It can be concluded that there is an interaction between age and sex. In this
situation, we cannot investigate the main effects of age and sex on SBP, since age and sex do
not operate independently of one another in their effects on SBP. One way to represent such
interaction effects mathematically is to use a regression model of the form
120
130
140
150
160
170
Sys
tolic
blo
od
pre
ssur
e (
mm
Hg)
40 45 50 55 60 65Age (years)
Females Males
130
140
150
160
170
Sys
tolic
blo
od
pre
ssur
e (
mm
Hg)
40 45 50 55 60 65Age (years)
Females Males
29
)()()( 12210 sexagesexageY .
An interaction term is basically the product of two independent variables of interest which
are age and sex in our example. Here the change in the mean of SBP for a 1-unit change in
age is equal to )(121 sex , which clearly depends on sex. For our particular example,
when sex=0 (i.e., when sex=male), the regression model can be written as:
)0()0()( 12210 ageageY
)(10 age
and when sex=1 (i.e., when sex=female), the regression model becomes:
)1()1()( 12210 ageageY
age )( 12120
Note that these regression lines present different intercepts and different slopes. In linear
regression, interaction is evaluated by using the statistical tests about product terms
involving basic independent variables in the model.
We present here an example of how to assess interaction by using STATA. We will
continue with our example of the relationship between age and SBP to investigate the
possible interaction. The results of fitting the linear regression model to this data set
indicated that there is a linear relationship between age and SBP. Another question that
can be answered by such data is whether an interaction exists between age and smoking
status. In other words, that is whether the slopes of the straight-lines relating SBP to age
significantly differ for smokers and for non-smokers. To create an interaction term by
using the STATA, we would type:
xi: regress sbp i.smk*age
i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) i.smk*age _IsmkXage_# (coded as above) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 26.63 Model | 4758.42362 3 1586.14121 Prob > F = 0.0000 Residual | 1667.54513 28 59.5551833 R-squared = 0.7405 -------------+------------------------------ Adj R-squared = 0.7127 Total | 6425.96875 31 207.289315 Root MSE = 7.7172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | -12.84603 21.71534 -0.59 0.559 -57.32789 31.63583 age | 1.515216 .2703328 5.61 0.000 .9614643 2.068968 _IsmkXage_1 | .4349186 .4048227 1.07 0.292 -.3943232 1.26416 _cons | 58.57428 14.80476 3.96 0.000 28.2481 88.90046 ------------------------------------------------------------------------------
30
The regression model for including the interaction term can be written as:
)()()( 3210 agesmokeagesmokeY
From the STATA output, we see that when smoking status=non smokers (smoke=0), the
regression model can be written as:
)0(43.0)(52.1)0(87.1257.58 ageageY
)(52.157.58 ageY
and when smoking status=smokers (smoke=1), the regression model becomes:
)1(43.0)(52.1)1(87.1257.58 ageageY
)(95.170.45 ageY
Plotting these two lines (Figure 6-2), we see that they appear to be almost parallel. It
indicates that there is probably no statistically significant interaction. We can confirm this
lack of significance by using the partial F test for testing the hypothesis 0: 30 H , given
that smoking status and age are in the model. The p value for the partial F test is equal to
0.292. Therefore, the slopes of the straight lines relating SBP and age do not significantly
differ for smokers and non-smokers. This means that there is no interaction between age
and smoking status in this situation. This conclusion is equivalent to finding that smokers
have a consistently higher SBP than non-smokers at all ages, and the rate of change with
respect to age is the same for both groups.
Figure 6-2 Comparison by smoking status of straight-line regressions of SBP on age
120
130
140
150
160
170
Sys
tolic
blo
od
pre
ssur
e (
mm
Hg)
40 45 50 55 60 65Age (years)
Non-smokers Smokers
31
7. REGRESSION DIAGNOSTICS 7.1 Normality The assumption of normality can be assessed formally by a normal plot of the residuals.
These residuals are assumed to be independent normal random variables with mean 0 and
constant variance (2σ ). Assumption violation will result in invalid model. To make sure
that the model is appropriate, it needs to be checked whether their distributions are normal.
Checking this assumption can be performed as follows:
STATA command:
1. Estimate the residuals
After fitting regression model by regress command, the estimation of residuals can be done
by predict command with option resid as:
. xi: regress sbp i.drug age quet i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 32.42 Model | 4989.65072 3 1663.21691 Prob > F = 0.0000 Residual | 1436.31803 28 51.2970727 R-squared = 0.7765 -------------+------------------------------ Adj R-squared = 0.7525 Total | 6425.96875 31 207.289315 Root MSE = 7.1622 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Idrug_1 | 10.62885 2.582308 4.12 0.000 5.339232 15.91847 age | 1.003488 .3102821 3.23 0.003 .3679039 1.639072 quet | 9.705102 4.339773 2.24 0.033 .8154801 18.59472 _cons | 51.38847 10.11437 5.08 0.000 30.67013 72.10681 ------------------------------------------------------------------------------ . predict error, resid
2. Create normal probability plot . pnorm error
Figure 7-1 Normal probability plot of residuals from the full multiple regression model
0.0
00
.25
0.5
00
.75
1.0
0N
orm
al F
[(er
ror-
m)/
s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
32
3. Test for normality . swilk error Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+-------------------------------------------------- error | 32 0.97405 0.865 -0.300 0.61789
The test performs with the null hypothesis that the residuals are normally distributed.
The Shapiro-Wilk statistic is equal 0.97405 and its related p-value is 0.61789.
We therefore fail to reject the null hypothesis and conclude that the residuals are normally
distributed.
7.2Linearity Augmented component-plus-residual plot versus independent variables will suggest whether
a relationship is linear. This can be performed as below.
STATA command:
1. Plot Augmented component-plus-residual versus independent variable age
. acprplot age, mspline msopts(bands(13)) title(Augmented component-plus-residual versus age plot)
Figure 7-2 Augmented component-plus-residual versus age
-30
-20
-10
01
0A
ugm
ent
ed c
om
pone
nt p
lus
resi
dua
l
40 45 50 55 60 65age
Augmented component-plus-residual versus age plot
33
2. Plot Augmented component-plus-residual versus independent variable BMI
. acprplot quet, mspline msopts(bands(13)) title(Augmented component-plus-residual versus BMI plot)
Figure 7-3 Augmented component-plus-residual versus BMI
These graphs suggest neither definite linearity nor definite curvature relationship. These
might be linear, but some outliers may cause the line to not be smooth.
7.3 Homoskedasticity
The assumption of homoskedasticity can be assessed by plotting the residuals against the
predicted values of the dependent variable. This plot gives an idea whether residuals are
constant across fitted values. For example, after fitting drug, age, and BMI to predict
changes of SBP, a plot of residuals against the predicted value of the SBP is shown in
Figure 7-4. In the graph below, residuals symmetrically lie below and above the horizontal
line. This might suggest constant variance and testing should be performed as below.
. estat hettest error Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: error chi2(1) = 1.82 Prob > chi2 = 0.1771
The test suggests that variance of residuals is constant over the values of independent variable.
-70
-60
-50
-40
-30
Au
gme
nted
co
mpo
nent
plu
s re
sid
ual
2.5 3 3.5 4 4.5quetelet index
Augmented component-plus-residual versus BMI plot
34
Figure 7-4 Residuals from the regression model, plotted against predicted values of SBP
7.4 Multicollinearity
This assumption does not matter for simple regression analysis, but it is important to check
for the multiple regression model. This happens when independent variables are correlated
and it will result in bias estimation of coefficients, standard errors, and inference of
coefficients. Therefore, pair wise correlation (r) should be performed to get an idea what
variables might highly correlate and have collinearity effect in the model if we fit them
together. If collinearity occurs, adding or deleting independent variables makes changes in
coefficients and produces corresponding standard errors. The standard error is sometimes
as high as the coefficient or even higher.
To detect multicollinearity additional to pair wise correlation, a variance inflation factor is
usually used. The parameter measures how the variances of regression coefficients are
inflated compared to when the independent variables are not linearly related. This can be
estimated in STATA as below. The VIF > 10 is suggested as collinearity and the value
close to 1 is no evidence of collinearity. If the VIF is greater than 10, that variable should
be omitted from the model. For this example, there is no evidence of collinearity. . estat vif Variable | VIF 1/VIF -------------+---------------------- age | 2.82 0.355212 quet | 2.81 0.355588 _Idrug_1 | 1.00 0.996620 -------------+---------------------- Mean VIF | 2.21
-20
-10
01
02
0R
esid
ual
s
120 130 140 150 160 170Linear prediction
35
7.5 Outliers
7.5.1 Identifying X outliers
Identifying X outliers for multiple regression is a multi-dimension of X variables.
Considering only one variable separately is not appropriate since extreme values of some
unvaried outliers may not influence in regression line whereas multivariate outliers (which
cannot detect in unvaried outliers) may affect on fitting the regression line. We use H
matrix to identify X outliers as follows:
XXXXH 1)(
Let hij (hat or leverage) be the i element on the main diagonal of the H matrix, which can
be obtained from:
phh
XXXXhn
iiiii
iiii
1
1'
,10
)(
where p = number of regression parameters including a constant term. The higher the hii
value (higher leverage), the further distance that an x variable departs from the centre of X
matrix and that is the outlier of X. The suggestion is that if the hat is as high as twice the
average (2*p/n), it is considered as outlier. Another criteria is classifying hii 0.2-0.5, > 0.5
as moderate and high leverage. Estimation of hat can be performed as below.
STATA command: 2. Estimate leverage value
. predict xdist,hat . sum xdist,det Leverage ------------------------------------------------------------- Percentiles Smallest 1% .0549423 .0549423 5% .0568518 .0568518 10% .0762993 .0651438 Obs 32 25% .0843276 .0762993 Sum of Wgt. 32 50% .1157301 Mean .125 Largest Std. Dev. .056152 75% .1453192 .2223231 90% .2223231 .2470635 Variance .0031531 95% .2482278 .2482278 Skewness 1.114692 99% .2663676 .2663676 Kurtosis 3.49025
36
3. Identify outlying cases
As for the model, the average leverage is 2x4/32 = 0.25. Only 1 subject has leverage value
higher than this cutoff. However, the highest leverage does not exceed the high level of
leverage (i.e. = 0.5). We could see that subject id=2 has the highest leverage value of
0.266. Keep in mind that this subject is potential X outlier, but needs to be explored
further whether this potential outlier influences on the regression model (predicted values).
. list person error xdist quet age drug if xdist>2*(4/32) & xdist!=. +----------------------------------------------------+ | person error xdist quet age drug | |----------------------------------------------------| 18. | 2 -2.082762 .2663676 3.251 41 0 | +----------------------------------------------------+
7.5.2 Identifying Y outliers
This can be done using studentized deleted residuals, which is calculated by:
)(
*
ˆ
)(
iiii
i
ii
YYd
ds
dd
or
2/1
2
)(
*
)1(
1
)1(
iii
i
iii
ii
ehSSE
pne
hMSE
ed
di is called deleted residual, which is the residual where xi(i) case is deleted. The
studentized deleted residual has t-distribution with n-p-1 degrees of freedom. To identify
the outlier, it needs to be compared with a cutoff threshold, say 0.05, as below.
37
STATA command: 1. Estimate the studentized deleted residuals
. predict estu,rstudent
. sum estu,det Studentized residuals ------------------------------------------------------------- Percentiles Smallest 1% -2.514817 -2.514817 5% -1.589173 -1.589173 10% -1.08973 -1.185283 Obs 32 25% -.5998537 -1.08973 Sum of Wgt. 32 50% -.0976471 Mean .0120775 Largest Std. Dev. 1.08881 75% .626205 1.436678 90% 1.436678 1.613788 Variance 1.185507 95% 2.430551 2.430551 Skewness .3678051 99% 2.576622 2.576622 Kurtosis 3.461563
2. Identify outlying cases
. list person error estu quet age drug if abs(estu)>invttail(27,0.05) +-----------------------------------------------------+ | person error estu quet age drug | |-----------------------------------------------------| 8. | 8 14.76043 2.430551 3.612 48 1 | 9. | 9 14.84753 2.576622 2.368 44 1 | 12. | 12 -14.32618 -2.514817 4.032 51 1 | +-----------------------------------------------------+
7.5.3 Identifying influential cases
After identifying cases that are outlying with respect to their X values and/or their Y values,
the next step is to ascertain whether or not these outlying cases are influential. We shall take
up three measures of influence that are widely used in practice, each based on the omission of a
single case to measure its influence.
i) Influence on the fitted values (DFFITS)
The first task is to explore whether these x & y outliers influence the fitted values since it is not
necessary that all x or y outliers affect fitting values. DFFITS are parameters that takeinto
account both x (leverage) and y (d*) outlying indexes which are often used in practice as follows.
ii(i)
i(i)ii
hMSE
YYDFFITS
ˆˆ
The letters DF refer to distance difference between the fitted value iYˆ for the ith case where all n
cases are considered in the model, and fitted value )(ˆiiY for the ith case in which the ith case is
removed from the model. The denominator is the standardization, which reflects the number of
estimated standard deviations that the fitted value and iYˆ change when the ith case is removed
38
from the model. The DFFITS can also be calculated using d* and hii as follows:
1/2
ii
ii*ii )
h1
h(dDFFITS
As for the equation, DFFITS is a function of d*, which increases or decreases according to hii
values. The absolute value of DFFITS which exceeds 1 is considered as influential case for
small-medium sample size and np /2 for a large sample size. This can be explored as
follows:
STATA command 1. Estimate DFFITS
. predict dfits, dfits
2. Identifying influential cases . list person error dfits quet age drug if abs( dfits)>1 +-----------------------------------------------------+ | person error dfits quet age drug | |-----------------------------------------------------| 8. | 8 14.76043 1.041157 3.612 48 1 | 9. | 9 14.84753 1.377664 2.368 44 1 | 12. | 12 -14.32618 -1.440561 4.032 51 1 | +-----------------------------------------------------+
ii) Influence on all of the estimated regression coefficients (Cook’s distance)
Cook’s distance (Di) measures the impact of the ith case on all regression coefficients (or
coefficient vectors) if the ith case is omitted, which is defined as:
pMSE
)bX(bX)b(bD (i)(i)
i
The way to interpret Di is as follows:
- if Di > 4/n or
- Di > F(1-α, p,n-p)
The ith case has substantially influenced on estimate coefficients.
STATA command: 1. Estimate Cook’s distance
. predict cookd1,cooks
39
2. Identify influential cases
. list person error cookd1 quet age drug if cookd1>(4/32) +----------------------------------------------------+ | person error cookd1 quet age drug | |----------------------------------------------------| 8. | 8 14.76043 .2305867 3.612 48 1 | 9. | 9 14.84753 .3949498 2.368 44 1 | 10. | 10 8.75689 .1641441 4.637 64 1 | 12. | 12 -14.32618 .4359133 4.032 51 1 | +----------------------------------------------------+
iii) Influence on the partial regression coefficients (DFBETAS)
This measures the influence of the ith case on estimation of each regression coefficient.
This can be assessed by
omitted is case ith the if tcoefficien regressionb
X)X( of element diagonal kth thec
wherecMSE
bbDFBETAS
k(i)
1-kk
kk(i)
k(i)kk(i)
That is the DFBETAS determines the standardized difference of coefficients that are
estimated with and without the ith cases. The absolute DFBETAS > 1 and > n/2 are
supposed to be influential cases for small-medium and large sample sizes. This can be
estimated in STATA as follows:
STATA command:
1. Estimate DFBETAS . predict df_drug,dfbeta(_Idrug_1) . predict df_age,dfbeta(age) . predict df_bmi,dfbeta(quet)
2. Identify influential cases . list person error df_drug df_age df_bmi quet age drug if abs(df_drug)>1 | abs(df_age)>1 | abs(df_bmi)>1 +----------------------------------------------------------------------------+ | person error df_drug df_age df_bmi quet age drug | |----------------------------------------------------------------------------| 12. | 12 -14.32618 -.4310711 1.128824 -1.263236 4.032 51 1 | +----------------------------------------------------------------------------+
Finally the diagnostic model needs to be summarized as a whole considering three
parameters, which subjects are influential cases that affect on prediction values (DFFITS),
estimation of coefficient vectors (COOK’S D) and each coefficient (DFBETAS).
40
. list person sbp drug age quet smkgr if abs( dfits)>1 | cookd1>(4/32) | abs(df_drug)>1 | abs(df_age)>1 | abs(df_bmi)>1 +-------------------------------------------+ | person sbp drug age quet smkgr | |-------------------------------------------| 8. | 8 160 1 48 3.612 1 | 9. | 9 144 1 44 2.368 1 | 10. | 10 180 1 64 4.637 1 | 12. | 12 138 1 51 4.032 1 | +-------------------------------------------+
There are 4 subjects who are potentially influential cases with the above criteria. In
summary, the residuals are normal, constant variance, and some outliers that influence the
regression model. We thus need to explore data particularly starting with these above
subjects. Their data needs to be checked and validated for all variables to make sure that
the data is correct. If there is some incorrect data, it should be corrected accordingly and
then re-fit to the regression model. The diagnostic model also needs to be re-assessed. If
the data of these subjects are correct as they are, try to exclude these subjects and see how
the regression model is.
41
Assignment V :
Multiple regression (25%) Due date: Oct 8, 2015
From the randomized controlled trial of calcium supplements, researchers wanted to determine
the factors which associated with BMD at total left femur which was stored under the variable
‘total’. The details of factors that we need to explore are presented in following table.
1=smoke, 2=ex-smoke, 3=non-smoke
1=yes, 2=sometime, 3=no
Hour/day
1=regular, 2=sometime, 3=no
Classify by glucose>=126 mg/dl
Classify SBP>=140 or DBP>=90 as
hypertension, otherwise, normal
Classify cholesterol >=240 mg/dl as high
cholesterol, otherwise, normal
The data are given in the data set cross-sectional_BMD_&_risk_factor.dta.
- Fit the regression model step by step, explain the method used
- What is the parsimonious equation?
- Perform diagnostic measures, checking assumptions and explain results.
- Interpret results of the final model, writing a report in text & tables if needed.