race 615 introduction to medical statistics€¦ · instead of using the formulas manually, the...

RACE 615 Introduction to Medical Statistics

Multiple Linear Regression

Assist.Prof.Dr.Sasivimol Rattanasiri

Doctor of Philosophy Program in Clinical Epidemiology, Section for Clinical Epidemiology & Biostatistics Faculty of Medicine Ramathibodi Hospital, Mahidol University

CONTENTS

1. MULTIPLE REGRESSION MODEL ................................................................................. 3

2. FITTING MULTIPLE REGRESSION MODEL ............................................................... 5

3. FITTING CATEGORICAL INDEPENDENT VARIABLES ........................................... 7

4. EVALUATING THE MULTIPLE REGRESSION MODEL ......................................... 10

4.1 Test for overall regression ............................................................................................... 10

4.2 The coefficient of multiple determination ....................................................................... 11

4.3 Test for partial regression coefficients ............................................................................. 12

4.4 Evaluating multiple regression by STATA ...................................................................... 14

5. MODEL SELECTION ........................................................................................................ 17

5.1 Forward selection procedure ............................................................................................ 17

5.2 Backward elimination procedure ..................................................................................... 20

5.3 Stepwise procedure .......................................................................................................... 22

6. CONFOUNDING AND INTERACTION .......................................................................... 24

6.1 Confounding effects ......................................................................................................... 24

6.2 Interaction effects ............................................................................................................. 27

7. REGRESSION DIAGNOSTICS ........................................................................................ 31

7.1 Normality ......................................................................................................................... 31

7.2 Linearity ........................................................................................................................... 32

7.3 Homoskedasticity ............................................................................................................. 33

7.4 Multicollinearity .............................................................................................................. 34

7.5 Outliers ............................................................................................................................. 35

Assignment V………………………………………………………….……………..…………………..41

2

OBJECTIVES This module will help you to be able to:

Fit a regression model considering > 2 co-variables simultaneously

Perform model selection

Assessing goodness of fit of the model & checking assumptions

Interpret & report results

REFERENCES

1. Neter J, Wasserman W, and Kutner HM. Applied statistical models, third edition. Boston:

IRWIN 1990; 21 - 433.

2. Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Applied regression analysis and other

multivariable methods. Washington: Duxbury Press 1998; 39 - 212.

3. Altman, G. Practical statistics for medical research. 1991, London: Chapman & Hall.

4. Hosmer, D. and S. Lemeshow. Applied regression analysis and other multivariable

methods. 1998, Washington: Duxbury Press.

READING SECTION Read Neter J et. Al.; Chapters 7, 8, 11, & 12

ASSIGNMENT V (25%) p.41, Due: October 8, 2015

3

1. MULTIPLE REGRESSION MODEL In Module IV, we discussed simple linear regression and correlation analysis. A simple

regression model includes one predictor and one outcome. In practice, an outcome is

usually affected by more than one predictor. For example, the systolic blood pressure

(SBP) may be determined by the age, smoking behavior, and body mass index (BMI). To

investigate the more complicated relationship between an outcome and a numbers of

predictors, we use a natural extension of simple linear regression known as multiple

regression analysis.

There are several advantages of using multiple regression analysis.

1) Develop a prognostic index to predict the outcome from several predictors. For

example, the SBP may be predicted by the age, smoking behavior, and BMI.

2) Adjust (control) for the potential confounding factors, in which the study design

has not planned for adjusting. For example, the effect of BMI on the SBP may be

confounded by age and smoking behavior. Fitting all these variables together can

assess the effect of BMI by adjusting for age and smoking behavior.

A multiple regression model with y as an outcome (dependent variable) and kxxxx ,...,,, 321 as k

predictors (independent variables) is written as:

kk XXXXy ...332211 (1)

where refers to the population intercept or constant term,

k ,...,,, 321 are the population slope or regression coefficients of independent

variables kXXXX ,...,,, 321 , respectively,

refers to the random error term.

The constant term is the mean value of the outcome when all predictors in the

model take the value 0.

The coefficients k ,...,,, 321 are called the partial regression coefficients. For

example, 1 is a partial coefficient of 1x , and it gives the change in y due to a one

unit change in 1x when all other predictors included in the model are held constant.

4

A positive value for a particular i in model (1) indicates a positive relationship

between the outcome (Y) and the related predictor (X). A negative value for a

particular i in that model indicates a negative relationship between the outcome

(Y) and the related predictor (X).

The multiple regression model (1) can only be used when the relationship between the

outcome and each predictor is linear. Each of the iX variables in the model (1) represents

a single variable raised to the first power. This model is referred to as a first-order multiple

regression model.

For a data sample, a multiple linear regression is written as:

ikki eXbXbXbXbay ...ˆ 332211 (2)

where

iy is the estimated or predicted value of the outcome iy ,

a is the un-bias estimator of the population intercept,

kbbbb ,....,, 321 are the un-bias estimators of the population slope of predictive

variables kXXXX ,...,,, 321 ,respectively,

ie is the random error which is the difference between the observed and predicted

values of the dependent variable iy . The technical term for this distance is called a

residual.

5

2. FITTING MULTIPLE REGRESSION MODEL The method of least squares is used to estimate the parameters in the multiple regression

model. In general, the least squares method chooses as the best-fitting model the one that

minimizes the sum of squares of the distance between the observed values and predicted

values of the outcome which can be defined as:

n

ikikiiii

n

iii XbXbXbXbayyy

1332211

2

1

)...()ˆ( (3)

The least squares estimated regression coefficients kbbbba ,...,,,, 321 in model (3) are

obtained by using matrix mathematics. In this text we do not present matrix formulas for

calculating the least squares estimates of kbbbba ,...,,,, 321 , so the matrix formulation for

multiple regression can be found in Kleinbaum, Appendix B, pages 732-743, Neter,

Section 7.2, pages 236-239.

Instead of using the formulas manually, the calculations in a multiple regression analysis

are made by using statistical software packages, such as STATA or SPSS. Even for a

multiple regression with two predictors, the formulas are complex and manual calculations

are time consuming. In this module we perform the multiple regression analysis using

STATA program. However, the solutions which are obtained by using other statistical

software packages such as SPSS, SAS, or MINITAB can be interpreted the same way.

Example 2-1 Researchers wanted to investigate how SBP varies with the QUET1 and age.

The outcome is ‘sbp’ and the predictors are ‘quet’ and ‘age’. To fit the multiple linear

regression by using the STATA, we would type:

1 QUET stands for “quetelet index”, a measure of size defined by QUET=100×(weight/height2)

6

regress sbp quet age 1.Source | SS df MS 2.Number of obs = 32 ---------+------------------------------ 3.F( 2, 29) = 25.92 Model | 4120.59224 2 2060.29612 Prob > F = 0.0000 Residual | 2305.37651 29 79.4957416 4.R-squared = 0.6412 ---------+------------------------------ 5.Adj R-squared = 0.6165 Total | 6425.96875 31 207.289315 6.Root MSE = 8.916 --------------------------------------------------------------------------- 7. sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+------------------------------------------------------------------ quet | 9.750732 5.402456 1.805 0.081 -1.298531 20.8 age | 1.045157 .3860567 2.707 0.011 .2555828 1.834732 _cons | 55.32344 12.53475 4.414 0.000 29.687 80.95987 ---------------------------------------------------------------------------

There are two predictors, the QUET and age, included into this model. In the STATA

output no.7, the variable ‘cons_’ is referred to the intercept. The regression coefficients for

the intercept, QUET, and age are presented in the second column (label ‘Coef’). The

multiple regression model for this example is given as:

agequetsbp

05.175.932.55

From this equation, the population intercept is 55.32. It is the value of estimated SBP for

the QUET=0 and age=0. This means that a patient who has zero QUET and zero age is

expected to have a SBP of 55.32 mmHg. This is the technical interpretation of the

population intercept. In reality, that may not be true because none of the patients in our

sample has both zero QUET and zero age. In most medical applications the value of the

population intercept has no practical meaning.

The estimated coefficient of the QUET ( 1 ) is 9.75. This value gives the change in the

estimated SBP for a one-unit change in the QUET when the age is held constant. Thus, we

can state that a patient with one extra QUET, but the same age is expected to have a higher

SBP of 9.75 mmHg.

The estimated coefficient of age ( 2 ) is 1.05. This value gives the change in the estimated

SBP for a one-unit change in age when the QUET is held constant. Thus, we can state that

for a patient with one extra unit of age, the same QUET is expected to have a higher SBP

of 1.05 mmHg.

7

3. FITTING CATEGORICAL INDEPENDENT VARIABLES All the predictive variables that we have considered up to this point have been measured

on a continuous scale. However, regression analysis can be generalized to incorporate

categorical variables into the model. For example, we want to test if smoking status affects

the SBP. The generalization is based on the use of dummy variables which is the central

idea of this section.

The dummy variable is a binary variable that takes on the values 0 and 1 which is used to

identify the different categories of a categorical variable. The term “dummy” is used to

indicate the fact that the numerical values (such as 0 and 1) assumed by the variable have

no quantitative meaning, but are used to identify different categories of the categorical

variable under consideration.

For example, we have a variable for smoking status code 1 for current smoke, 2 for ex-

smoke, and 3 for never smoke. Thus, the dummy variables for smoking behavior are

presented as follows:

Smoking behavior Dummy variables

Dummy 1 Dummy 2 Dummy 3

Current smoke 1 0 0

Ex-smoke 0 1 0

Never smoke 0 0 1

To include these variables in a multiple regression model, we use k-1 dummy variables for

k categories. The omitted category is referred to as the reference group. It is arbitrary

which group is assigned to be the reference group. The choice of a reference group is

usually dictated by subject-matter considerations. For example, if never smoke category is

referred to as the reference group in the multiple regression analysis, only the dummy

variables 1 and 2 are fitted in the model.

There are two easy ways to create dummy variables in STATA. Firstly, is using the tabulate

command and the generate( ) option, as shown below.

8

tabulate smkgr,gen(dum) smkgr | Freq. Percent Cum. ------------+----------------------------------- 1 | 8 25.00 25.00 2 | 9 28.13 53.13 3 | 15 46.88 100.00 ------------+----------------------------------- Total | 32 100.00

list smkgr dum1 dum2 dum3 +----------------------------+ | smkgr dum1 dum2 dum3 | |----------------------------| 1. | 1 1 0 0 | 2. | 1 1 0 0 | 3. | 1 1 0 0 | 4. | 2 0 1 0 | 5. | 2 0 1 0 | |----------------------------| 6. | 2 0 1 0 | 7. | 3 0 0 1 | 8. | 3 0 0 1 | 9. | 3 0 0 1 | +----------------------------+

The tabulate command with generate option created three dummy variables called dum1,

dum2 and dum3. The STATA creates three dummy variables and lets the users choose

their own reference. Suppose that we add the smoking status to the regression model for

predicting SBP that already contains the QUET and age. If the never smoke category is

referred to as the reference group, only the dummy1 and dummy2 are fitted in the model as

follows:

regress sbp quet age dum1 dum2 Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 4, 27) = 24.96 Model | 5058.30759 4 1264.5769 Prob > F = 0.0000 Residual | 1367.66116 27 50.6541171 R-squared = 0.7872 -------------+------------------------------ Adj R-squared = 0.7556 Total | 6425.96875 31 207.289315 Root MSE = 7.1172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 8.066741 4.332307 1.86 0.074 -.8224187 16.9559 age | 1.207758 .3111643 3.88 0.001 .5693015 1.846214 dum1 | 13.26391 3.134341 4.23 0.000 6.832772 19.69505 dum2 | 6.908531 3.047391 2.27 0.032 .6557999 13.16126 _cons | 47.20073 10.40753 4.54 0.000 25.84624 68.55522 ------------------------------------------------------------------------------

Secondly is to use the xi command to create dummy variables. xi i.smkgr

i.smkgr _Ismkgr_1-3 (naturally coded; _Ismkgr_1 omitted)

list smkgr _Ismkgr_2 _Ismkgr_3 +-----------------------------+ | smkgr _Ismkg~2 _Ismkg~3 | |-----------------------------| 1. | 1 0 0 | 2. | 1 0 0 | 3. | 1 0 0 | 9. | 2 1 0 | 10. | 2 1 0 | |-----------------------------| 11. | 2 1 0 | 18. | 3 0 1 | 19. | 3 0 1 | 20. | 3 0 1 | +-----------------------------+

9

The xi command created two dummy variables called _Ismkgr_2 and _Ismkgr_3 and omitted

the dummy variable for group 1 as the reference group. The default of xi command will assign

the first category as the reference group. If we want another category to be the reference, this

can be re-assigned using the char var [omit] # command. For example, we would like group 3

to be the reference group, we would type

char smkgr [omit] 3 xi i.smkgr i.smkgr _Ismkgr_1-3 (naturally coded; _Ismkgr_3 omitted)

The xi command is a prefix command, which can be followed with another modeling

command. To fit the multiple regression of SBP on the QUET, age, and smoking status, we

would type

char smkgr [omit] 3 xi: regress sbp quet age i.smkgr i.smkgr _Ismkgr_1-3 (naturally coded; _Ismkgr_3 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 4, 27) = 24.96 Model | 5058.30759 4 1264.5769 Prob > F = 0.0000 Residual | 1367.66116 27 50.6541171 R-squared = 0.7872 -------------+------------------------------ Adj R-squared = 0.7556 Total | 6425.96875 31 207.289315 Root MSE = 7.1172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 8.066741 4.332307 1.86 0.074 -.8224187 16.9559 age | 1.207758 .3111643 3.88 0.001 .5693015 1.846214 _Ismkgr_1 | 13.26391 3.134341 4.23 0.000 6.832772 19.69505 _Ismkgr_2 | 6.908531 3.047391 2.27 0.032 .6557999 13.16126 _cons | 47.20073 10.40753 4.54 0.000 25.84624 68.55522 ------------------------------------------------------------------------------

From this command, the STATA treats quet and age as continuous variable and smkgr as a

dummy variable. We can interpret that with identical QUET and age, current smokers have

a SBP about 13.26 mmHg significantly higher than never smokers.

10

4. EVALUATING THE MULTIPLE REGRESSION MODEL

Before using a multiple regression model to predict or estimate, it is desirable to determine

first whether it adequately describes the relationship between the outcome and a set of

predictors and whether it can be used effectively for the purpose of prediction.

4.1 Test for overall regression

We now consider an overall test for a regression model which contains k predictors in the

model. The null hypothesis for this test is that there is no linear relationship between the

outcome and the set of predictors. In other words, all of the predictors considered together

do not explain a significant amount of the variation in the outcome. The null and

alternative hypotheses are defined as:

0...: 210 kH

:aH Not all j (j=1,…k) equal zero

An ANOVA approach can be used to perform this test. The particular form of an ANOVA

table for regression analysis is presented in Table 4-1. The ANOVA approach is based on

the partitioning of the total variation of the observed values iY into two components as

follows:

ii yy = yyi ˆ + ii yy ˆ (4)

total variation = explained variation + unexplained variation

Thus, the total variation can be viewed as the sum of two components:

1. Explained variation is the variation of the predicted values iY around the mean Y ,

which is measured by the regression sum of squares (SSR). This indicates how

much the variation of the observed values iY can be explained by the predictive

variables that are included in the regression model. The mean square regression

(MSR) is obtained by dividing the regression sum of squares by its corresponding

degrees of freedom.

2. Unexplained variation is the variation of the observed values iY around the fitted

regression line, which is measured by the error sum of squares (SSE). The mean

11

square residual is obtained by dividing the error sum of squares by its

corresponding degrees of freedom.

Table 4-1 ANOVA Table for multiple regression

Source of

variation

SS df MS F P value

Regression

n

ii )YY(SSR

1

2ˆ k MSR=SSR/K MSR/MSE

Error

n

iii )Y(YSSE

1

2ˆ n-k-1 MSE=SSE/(n-k-1)

Total

n

ii )Y(YSST

1

2 n-1

The appropriate statistical test for significant overall regression is the F test, which is obtained

by dividing the mean square regression by the mean square residual as follows:

MSE

MSRF (5)

This test has a F distribution with k and n-k-1 degrees of freedom. The computed value of F

can then be converted to the associated p value. The last step is to compare the p value with the

level of significance, and make a decision whether the null hypothesis is rejected or not.

4.2 The coefficient of multiple determination

In section 2.4.2 of module IV, we discussed about the coefficient of determination for a

simple linear regression model. It is the proportion of the total sum of squares (SST) that is

explained by the simple regression model. The coefficient of determination for the multiple

regression model, usually called the coefficient of multiple determination, is denoted by 2R

. It is defined as the proportion of the total sum of squares (SST) that is explained by the

multiple regression model. It tells us how good the multiple regression model is and how

well the predictive variables included in the model explain the outcome.

The calculation of the R2 is also based on the ANOVA Table which is presented in Table

4-1, which is defined as:

12

SST

SSE

SST

SSR

yy

yyR n

ii

n

ii

1)(

)ˆ(

1

2

1

2

2 (8)

However, the R2 has one major shortcoming. The 2R value generally increases as we add

more and more predictive variables to the regression model. This does not imply that the

regression model with a higher value of 2R does a better job of predicting the outcome.

Such a value of 2R will be misleading, and it will not represent the true power of predictive

variables in the regression model. To eliminate this shortcoming of 2R , it is prefer to use

the adjusted R2, which is a modified measure that adjusts for the number of predictive

variables in the model. The adjusted 2R can determined by dividing each sum of squares

by its associated degrees of freedom. Thus, the adjusted 2R can be defined as:

)1/(

)/(12

nSST

knSSERa (9)

where n is the total number of observations, and k is the number of predictive variables in

the model.

4.3 Test for partial regression coefficients

Frequently we wish to test if the addition of any specific predictor (X*), given that others

are already in the model ( kXXXX ,...,,, 321 ), significantly improves the prediction of the

outcome. That is, we want to test the null hypothesis that 0* against the alternative

hypothesis that 0* (j=1, 2,…, k) in the model

**332211 ... XXXXXy kk .

The appropriate statistical test for these partial regression coefficients is the partial F test.

The concept of this test is to compare the error sum of squares between two models:

1) The full model contains kXXX ,...,, 21 and *X as the predictors.

2) The reduced model contains kXXX ,...,, 21 but not *X .

The error sum of squares of the full model is smaller than the reduced model. The

difference between these error sums of squares is called an extra sum of squares.

13

The error sum of squares when kXXX ,...,, 21 and *X are in the model, is denoted

by ),,...,( *21 XXXXSSE k ,

The error sum of squares when kXXX ,...,, 21 are in the model, is denoted by

),...,( 21 kXXXSSE ,

The extra sum of squares which is denoted by ),...,,|( 21*

kXXXXSSR can be

defined as:

),,...,,(),...,,(),...,,|( *212121

* XXXXSSEXXXSSEXXXXSSR kkk

Thus, the partial F test is defined as:

1/),,...,(

1/),...,,|(*

21

21*

*

knXXXXSSE

XXXXSSRF

k

k

),,...,,(

),...,,|(*

21

21*

XXXXMSE

XXXXMSR

k

k (6)

This F statistic has an F distribution with 1 and n-k-1 degrees of freedom.

Note:

To distinguish the partial F test in equation (6) from the overall F test in equation (5), we use

the *F test statistic for the partial F test, and F test statistic for the overall F test.

An equivalent way to perform the partial F test is to use a t test. The t test focuses on a test

of the null hypothesis that 0* . The t test for testing this null hypothesis is computed as:

)ˆ(

ˆ*

**

SEt (7)

Where

* is the estimated coefficient of specific predictor (X*) in the regression model

**332211 ... XXXXXy kk .

The )ˆ( *SE is the estimate of the standard error of * .

This test has a t distribution with n-k-1 degrees of freedom.

Since the two tests are equivalent, the choice is usually made in terms of the available

information provided by the computer package output.

14

4.4 Evaluating multiple regression by STATA

Consider the STATA output of example 2-1, the output indicated that:

regress sbp quet age 1.Source | SS df MS 2.Number of obs = 32 ---------+------------------------------ 3.F( 2, 29) = 25.92 Model | 4120.59224 2 2060.29612 Prob > F = 0.0000 Residual | 2305.37651 29 79.4957416 4.R-squared = 0.6412 ---------+------------------------------ 5.Adj R-squared = 0.6165 Total | 6425.96875 31 207.289315 6.Root MSE = 8.916 --------------------------------------------------------------------------- 7. sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+------------------------------------------------------------------ quet | 9.750732 5.402456 1.805 0.081 -1.298531 20.8 age | 1.045157 .3860567 2.707 0.011 .2555828 1.834732 _cons | 55.32344 12.53475 4.414 0.000 29.687 80.95987 ---------------------------------------------------------------------------

1. Table of the analysis of variance

The ANOVA table shows the values of sum of squares, degrees of freedom and variances as:

- The values of sum of squares are

Sum of squares of total variation: SST=6425.97

Sum of squares of regression: SSR=4120.59

Sum of squares of error: SSE=2305.38

- The degrees of freedom (df) are

df of the total variation is n-1, that is 32-1=31

df of the explained variation is k, that is 2.

df of the unexplained variation is n-k-1, that is 29.

- The variances or mean square (MS) are

Mean square of total variation (MST) is SST/df = 6425.97/31 = 207.29

Mean square of regression (MSR) is SSR/df = 4120.59/2 = 2060.30

Mean square of error (MSE) is SSE/df = 2305.38/29 = 79.50

2. The number of total observations

The number of total observations for this example is 32.

3. The overall F test for overall regression

We now consider the overall F test for a regression model which contains the predictive

variables; QUET index and age. The null hypothesis for this test is that there is no linear

relationship between the SBP and the set of predictive variables; QUET index and age. From

15

the ANOVA table, the overall F test for the null hypothesis 0:0 ageQuetH , is computed

as:

92.2550.79

30.2060

MSE

MSRF .

This test has the F distribution with 2 and 29 degrees of freedom. The output from the STATA

indicated that the p value for this example is less than 0.001. As a result, we reject the null

hypothesis and conclude that there is a relationship between the SBP and the set of predictive

variables, QUET index and age.

4. The R2

The estimated 2R for this example is computed as:

0.646425.97

4120.592 SST

SSRR

This implies that 64.12% of the variation of SBP is explained by its linear relationship with the

QUET index and age.

5. The adjusted R2

The estimate of adjusted 2R for this example is computed as:

62.031/97.6425

2/59.4120

1/

/2

nSST

kSSRR

This implies that after adjusting the numbers of the predictive variables, 61.65% of the variation

of SBP is explained by its linear relationship with the QUET and age.

6. The square root of mean square error

This value is a square root of mean square error (MSE). The MSE of this example is 79.50, thus

the square root of MSE is 8.91.

7. The table for tests of partial regression coefficients

This table shows the set of predictive variables and their corresponding coefficients. The

variable ‘cons_’ referred to the intercept. The partial regression coefficients for the QUET and

age are presented in the second column (label ‘Coef’). Their standard errors (SE) and the

values of t test for partial regression coefficients are presented in the third and fourth columns,

16

respectively. The corresponding p values and 95% CI of coefficients are presented in the fifth

and sixth columns, respectively.

The null hypotheses for partial regression coefficients can be defined as:

0:

0:

20

10

H

H

For our example, the t value for the QUET is 1.81 which has the p value of 0.081. Thus, we fail

to reject the null hypothesis and conclude that there is no relationship between the QUET and

SBP when age is already in the model. In other words, the addition of the QUET, given that

age is already in the model, does not significantly contribute to the prediction of the SBP.

The t value for age is 2.71 which has the p value of 0.011. Thus, we reject the null hypothesis

and conclude that there is relationship between age and SBP although QUET is already in the

model. In other words, the addition of age, given that QUET is already in the model,

significantly contributes to the prediction of the SBP.

17

5. MODEL SELECTION

In this section we focus on determining the best (most important or most valid) subset of

the k predictive variables for describing the relationship between the outcome and the

predictive variables. There are many strategies for selecting the best model. Such strategies

have focused on deciding whether a single variable should be added to a model or whether

a single variable should be deleted from a model. In this section we explain an algorithm

for evaluating models with forward selection, backward elimination, and stepwise

procedures. These procedures are widely used in practice.

5.1 Forward selection procedure

This strategy focuses on deciding whether a single variable should be added to a model. In

the forward selection procedure, we proceed as follows:

Step 0:

1. Fit a simple linear regression model for each of the k potential predictive variables.

2. Select the first predictive variable which most highly correlates with the outcome.

3. Fit the regression model to the selected predictive variable.

4. Consider overall F test, if it is not significant, stop and conclude that no predictive

variables are important predictors. If the overall F test is significant, add this selected

predictor in the model and go to the next step.

Example:

To do the simple linear regression for 3 independent variables; age, QUET, and smoking

status for predicting SBP by using the STATA, we would type:

regress sbp quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 36.75 Model | 3537.94585 1 3537.94585 Prob > F = 0.0000 Residual | 2888.0229 30 96.2674299 R-squared = 0.5506 -------------+------------------------------ Adj R-squared = 0.5356 Total | 6425.96875 31 207.289315 Root MSE = 9.8116 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 21.49167 3.545147 6.06 0.000 14.25151 28.73182 _cons | 70.57641 12.32187 5.73 0.000 45.4118 95.74102 ------------------------------------------------------------------------------

18

regress sbp age Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 45.18 Model | 3861.63037 1 3861.63037 Prob > F = 0.0000 Residual | 2564.33838 30 85.4779458 R-squared = 0.6009 -------------+------------------------------ Adj R-squared = 0.5876 Total | 6425.96875 31 207.289315 Root MSE = 9.2454 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.6045 .2387159 6.72 0.000 1.116977 2.092023 _cons | 59.09162 12.81626 4.61 0.000 32.91733 85.26592 ------------------------------------------------------------------------------ xi: regress sbp i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 1.95 Model | 393.098162 1 393.098162 Prob > F = 0.1723 Residual | 6032.87059 30 201.095686 R-squared = 0.0612 -------------+------------------------------ Adj R-squared = 0.0299 Total | 6425.96875 31 207.289315 Root MSE = 14.181 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | 7.023529 5.023498 1.40 0.172 -3.235823 17.28288 _cons | 140.8 3.661472 38.45 0.000 133.3223 148.2777 ------------------------------------------------------------------------------

From the STATA output for prediction of SBP, we see that the highest squared correlation

is for the age (R-squared=0.60). The overall F test for the regression of SBP on age is

statistically significant (p<0.001). Therefore, the age is added in the model at this step.

Step 1:

1. Fit regression models that contain the variable initially selected (at step 0) and

another predictive variable which is not yet in the model.

2. Consider the t test for testing the null hypothesis that 0j and p value associated

with each remaining variable.

3. Focus on the variable with the largest t test which is equivalent to the smallest p

value. If the test is significant, add that predictive variable to the regression model.

If it is not significant, use the model from step 0 which has only one predictive

variable.

19

To do multiple regressions that contains the age and another predictive variable by using the

STATA, we would type:

regress sbp age quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 25.92 Model | 4120.59224 2 2060.29612 Prob > F = 0.0000 Residual | 2305.37651 29 79.4957416 R-squared = 0.6412 -------------+------------------------------ Adj R-squared = 0.6165 Total | 6425.96875 31 207.289315 Root MSE = 8.916 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.045157 .3860567 2.71 0.011 .2555828 1.834732 quet | 9.750732 5.402456 1.80 0.081 -1.298531 20.8 _cons | 55.32344 12.53475 4.41 0.000 29.687 80.95987 ------------------------------------------------------------------------------ xi: regress sbp age i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------

From the STATA output, we see that the smoking status (variable name is smk) has the

largest t test with a p value of 0.001. It also has the largest adjusted R-squared. Therefore,

the smoking status is added to the regression model at this step.

Step 2:

At each subsequent step, consider the t test for the predictive variables which are not yet in

the model. If the largest t test is statistically significant, add the new variable to the model.

If the t test is not significant, no more variables are included in the model and the process

is stopped.

For our example we have already added age and smoking status to the model. We now

consider if we should add the QUET index (variable name is quet) in the model. To add the

QUET in the multiple regression which contains the age and smoking status with STATA,

we would type:

20

xi: regress sbp age i.smk quet i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 29.71 Model | 4889.82567 3 1629.94189 Prob > F = 0.0000 Residual | 1536.14308 28 54.8622529 R-squared = 0.7609 -------------+------------------------------ Adj R-squared = 0.7353 Total | 6425.96875 31 207.289315 Root MSE = 7.4069 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.212715 .3238192 3.75 0.001 .549401 1.876028 _Ismk_1 | 9.945568 2.656057 3.74 0.001 4.504882 15.38625 quet | 8.592448 4.498681 1.91 0.066 -.6226828 17.80758 _cons | 45.10319 10.76488 4.19 0.000 23.05235 67.15404 ------------------------------------------------------------------------------

From the above STATA output, the t test for the QUET index, controlling for age and

smoking status, is 1.91 with a p value of 0.066. This value is not statistically significant at

=0.05, so the process is stopped. Thus, the forward selection procedure identifies age

and smoking status as the best subset of the predictive variables.

To do the forward selection procedure with the STATA command, we would type:

xi: sw regress sbp age quet i.smk,pe(0.05) i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) begin with empty model p = 0.0000 < 0.0500 adding age p = 0.0009 < 0.0500 adding _Ismk_1 Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------

5.2 Backward elimination procedure

This strategy has focused on deciding whether a single variable should be deleted from a

model. In the backward elimination procedure, we proceed as follows:

Step 0:

1. Fit regression model that contain all predictive variables.

2. Consider the t test for every variable in the model.

21

3. Focus on the variable with the lowest t test which also has the highest p value. If the

test is not significant, remove that variable from the model. If it is significant, keep

that variable in the model.

To fit regression model that contains all predictive variables (age, QUET, and smoking

status) by using the STATA, we would type:

xi: regress sbp quet age i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 29.71 Model | 4889.82567 3 1629.94189 Prob > F = 0.0000 Residual | 1536.14308 28 54.8622529 R-squared = 0.7609 -------------+------------------------------ Adj R-squared = 0.7353 Total | 6425.96875 31 207.289315 Root MSE = 7.4069 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- quet | 8.592448 4.498681 1.91 0.066 -.6226828 17.80758 age | 1.212715 .3238192 3.75 0.001 .549401 1.876028 _Ismk_1 | 9.945568 2.656057 3.74 0.001 4.504882 15.38625 _cons | 45.10319 10.76488 4.19 0.000 23.05235 67.15404 ------------------------------------------------------------------------------

From the STATA output, we see that the QUET index has the lowest t test with a p value

of 0.066. Therefore, the QUET index is dropped from the regression model at this step.

Step 1:

If a variable is dropped in step 0, re-compute the regression equation for the remaining

variables, and repeat the backward elimination procedure steps 0 and 1. If the variable is

not dropped, the backward elimination procedure is terminated.

From the step 0 of our example, the QUET index was dropped from the model, so we re-

computed the regression equation without the QUET index. The result of re-computation is

presented as follows:

xi: regress sbp age i.smk i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------

22

From the STATA output, we see that the smoking status has the lowest t test with a p value

of 0.001. However, the test is significant, so the smoking status is not dropped from the

regression model. Therefore, we stop here with this model, which is the same model that

we got when using the forward selection procedure.

To do the backward elimination procedure using the STATA command, we would type:

xi: sw regress sbp quet age i.smk,pr(0.05) i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) begin with full model p = 0.0664 >= 0.0500 removing quet Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------

5.3 Stepwise procedure

Another approach for model selection is a stepwise method. There are two main versions

of the stepwise procedure: (a) forward selection followed by a test for backward

elimination which is called the forward stepwise procedure and (b) backward elimination

followed by a test for forward selection which is called the backward stepwise procedure.

The forward stepwise procedure allows a variable which was added into the model at an

earlier stage, to be dropped subsequently if it is no longer helpful in conjunction with

variables added at later stages. In opposite, the backward stepwise procedure allows

variables which were dropped from the model at an earlier stage, to be added later.

For our example, we process the forward stepwise procedure as follows:

Step 0: Set the choice of level of significance to include a variable which is denoted by Ep ,

equal to 0.05. The age is added into the model in this step, since it has the highest

significant correlation with SBP.

Step 1: The smoking status is added into the model, since it has a higher significant partial

correlation with SBP than does QUET, given that age is already in the model.

23

Step 2: Fit the model that contains both variables; age and smoking status, and set the

choice of level of significance to remove a variable which is denoted by Rp , equal to 0.10.

The t test of the age, given that the smoking status is already in the model is 8.47 with a p

value less than 0.001. Thus, the age is not dropped from the model.

Step 3: Add the QUET in the model from the previous step, and check whether we should

include the QUET in the model. The t test of the QUET, given that the age and smoking

status are already in the model is 1.91 with a p value of 0.066. Thus, the QUET is not

added in the model and the process is terminated.

To do the forward stepwise regression using the STATA command, we would type:

xi: sw regress sbp age quet i.smk,pe(.05) pr(.1) forward i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) begin with empty model p = 0.0000 < 0.0500 adding age p = 0.0009 < 0.0500 adding _Ismk_1 Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 39.16 Model | 4689.68423 2 2344.84211 Prob > F = 0.0000 Residual | 1736.28452 29 59.87188 R-squared = 0.7298 -------------+------------------------------ Adj R-squared = 0.7112 Total | 6425.96875 31 207.289315 Root MSE = 7.7377 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 1.70916 .2017587 8.47 0.000 1.296517 2.121803 _Ismk_1 | 10.29439 2.768107 3.72 0.001 4.632978 15.95581 _cons | 48.0496 11.12956 4.32 0.000 25.2871 70.81211 ------------------------------------------------------------------------------

Thus, the forward stepwise identifies the variables age and smoking status as the best

subset of predictive variables, a result that happens to be consistent with our previous

analyses based on the forward selection and the backward elimination.

24

6. CONFOUNDING AND INTERACTION Both confounding and interaction involve additional variables that may affect an

association between two or more variables. The additional variables to be considered are

synonymously referred to as extraneous variables, control variables, or covariates. For

example, from the previous example about SBP, we assess whether age is associated with

the SBP, accounting for the smoking status, so the extraneous variable here is the smoking

status.

6.1 Confounding effects

Confounding is the condition where the relationship of interest is different when an

extraneous variable is ignored or included in the data analysis. The assessment of

confounding requires a comparison between a crude estimate of an association (which

ignores the extraneous variable) and adjusted estimate of an association (which accounts in

some way for the extraneous variable). If the crude and adjusted estimates are

meaningfully different, confounding is present and an extraneous variable must be

included in data analysis.

Suppose that we are interested in describing the relationship between a predictive variable

“drug therapy” and a continuous outcome “SBP”, taking into account the possible

confounding effect of a third variable “age”. If drug therapy is a dichotomous variable

(e.g., drug=1 or 0 for drug A or placebo, respectively). The comparison between a crude

estimate of an association with an adjusted estimate of an association can be expressed in

terms of the following two regression models:

1) )()( 210 agedrugYSBP

2) )(10 drugYSBP

The model (1) expresses the relationship between drug therapy and SBP, adjusted for the

variable age, in terms of the partial regression coefficient ( 1 ) of the variable drug. The

estimate of 1 (which we will denote by Age|1 ) obtained from the least-squares fitting of

model (1), is an adjusted-effect measure. This value gives the estimated change in SBP per

unit change in drug therapy after adjusting for age.

The model (2) expresses the relationship between drug therapy and SBP, ignoring the

variable age, in terms of the regression coefficient ( 1 ) of the variable drug. The estimate

25

of 1 (which is denoted by 1 ), obtained from the least-squares fitting of model (2), is a

crude estimate of the relationship between drug therapy and SBP.

Thus, confounding is present if the estimate of the regression coefficients of the study

variable “drug” between models (1) and (2) are meaningfully different. As an example,

suppose that

9.151 and 1.4ˆ|1 Age

Then, we can conclude that a 1-unit change in drug therapy yields a 16-unit change in SBP

when age is ignored, and a 1-unit change in drug therapy yields a 4-unit change in SBP

when age is controlled. That is the relationship between drug therapy and SBP is much

weaker after controlling for age. Thus, we would treat age as a confounder and control for

it in the analysis.

As another example, suppose that:

1.61 and 2.6ˆ|1 Age

Here, we can conclude that age is not a confounder because there is no meaningful

difference between the estimates 6.2 and 6.1. Sometimes, an investigator may have to deal

with much more problematic comparisons, such as 5.51 versus 1.4ˆ|1 Age . One

approach to deal with this problem is to consider the clinical importance of the numerical

difference between estimates, based on a priori knowledge of the variable (s) involved.

For example, the estimated coefficients 5.5 and 4.1 are the crude and adjusted differences

of mean SBP between drug A and placebo. It is important to decide whether a mean

difference of 5.5 is clinically more important than a mean difference of 4.1. If there is

meaningful difference in clinical practice, we have to treat age as a confounder and include

the variable age in the model.

One approach sometimes used incorrectly to assess confounding is a statistical test of the

partial regression coefficient of the extraneous variable. Such a test does not address

confounding, but rather precision. Such a test evaluates whether significant additional

variation in SBP is explained by adding variable age to a model already containing the

variable drug therapy. In other words, it is to determine whether a confidence interval for

1 is considerably narrower when age is in the model than when it is not. Another reason

26

for not focusing on 2 is that, if 0ˆ2 , it does not follow that 1|1

ˆˆ Age . In other words,

0ˆ2 is not sufficient for confounding effects.

Consider the STATA outputs, which describe two regression models for the relationship

between drug and SBP when the age is ignored or included in the model, respectively.

1) Crude estimates of the relationship between drug therapy and SBP (age is ignored). xi: regress sbp i.drug i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 1, 30) = 6.16 Model | 1094.31693 1 1094.31693 Prob > F = 0.0189 Residual | 5331.65182 30 177.721727 R-squared = 0.1703 -------------+------------------------------ Adj R-squared = 0.1426 Total | 6425.96875 31 207.289315 Root MSE = 13.331 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Idrug_1 | 11.90688 4.798404 2.48 0.019 2.107235 21.70653 _cons | 137.4615 3.697418 37.18 0.000 129.9104 145.0127 ------------------------------------------------------------------------------

2) Adjusted estimates of the relationship between drug therapy and SBP, after accounting for age

xi: regress sbp i.drug age i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 2, 29) = 40.54 Model | 4733.10854 2 2366.55427 Prob > F = 0.0000 Residual | 1692.86021 29 58.3744899 R-squared = 0.7366 -------------+------------------------------ Adj R-squared = 0.7184 Total | 6425.96875 31 207.289315 Root MSE = 7.6403 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Idrug_1 | 10.6436 2.754685 3.86 0.001 5.009639 16.27756 age | 1.560152 .1976058 7.90 0.000 1.156002 1.964301 _cons | 55.13354 10.64064 5.18 0.000 33.37098 76.89609 ------------------------------------------------------------------------------

From the STATA output, the statistical test of 0:0 ageH has the p value of <0.001.

Thus, we reject the null hypothesis and conclude that the addition of age, given that drug is

already in the model, significantly contributes to the prediction of the SBP. The adjusted

R-squared when age is included in the model is 0.72, whereas the R-squared is 0.17 when

age is ignored. However, there is no meaningful difference between the crude and adjusted

estimates 11.91 and 10.64, respectively. As a result, age is not a confounder for this

example, although the additional variation of SBP is explained by including age in the

model.

27

6.2 Interaction effects

Interaction is the condition where the relationship of interest is different at different levels

of the extraneous variable. The assessment of interaction focuses on describing the

relationship of interest at different levels of the extraneous variable. For example, in

assessing interaction due to sex in describing the relationship between age and SBP, we

must determine whether the regression coefficients of the relationship between age and

SBP differ between males and females.

To illustrate the concept of interaction, let us consider the following example. Suppose

that we wish to determine how two independent variables; age and sex jointly affect the

systolic blood pressure. To distinguish between interaction and no interaction effects, we

consider two graphs based on two hypothetical data sets which are presented in Figure 5-1.

These show the straight-line regression of SBP versus age for females against the

corresponding regression for males.

In Figure 6-1(a), the two regression lines are parallel. This figure suggests that the rate of

change in SBP as a function of age remains the same regardless of males and females. In

other words, the relationship between SBP and age does not depend on sex. It can be

concluded that there is no interaction between age and sex. In this situation, we can

investigate the effects of age and sex on SBP independently of one another. These effects

are called the main effects. One way to represent the relationship depicted in Figure 6-1(a)

is with a regression model of the form

)()( 210 sexageY .

Here, the change in the mean of SBP for a 1-unit change in age is equal to 1 , regardless of

males or females, while changing the category of sex only has the effect of shifting the

straight line relating SBP and age without affecting the value of the slope 1 .

28

a) No interaction between age and sex

b) Interaction between age and sex

Figure 6-1 Graphs of non-interacting and interacting independent variables

In contrast, two regression lines cross or trend to cross together in Figure 6-1(b). This figure

depicts a situation where the relationship between SBP and age depends on sex; in particular

the SBP appears to increase with increasing age for males but to decrease with increasing age

for females. It can be concluded that there is an interaction between age and sex. In this

situation, we cannot investigate the main effects of age and sex on SBP, since age and sex do

not operate independently of one another in their effects on SBP. One way to represent such

interaction effects mathematically is to use a regression model of the form

120

130

140

150

160

170

Sys

tolic

blo

od

pre

ssur

e (

mm

Hg)

40 45 50 55 60 65Age (years)

Females Males

130

140

150

160

170

Sys

tolic

blo

od

pre

ssur

e (

mm

Hg)

40 45 50 55 60 65Age (years)

Females Males

29

)()()( 12210 sexagesexageY .

An interaction term is basically the product of two independent variables of interest which

are age and sex in our example. Here the change in the mean of SBP for a 1-unit change in

age is equal to )(121 sex , which clearly depends on sex. For our particular example,

when sex=0 (i.e., when sex=male), the regression model can be written as:

)0()0()( 12210 ageageY

)(10 age

and when sex=1 (i.e., when sex=female), the regression model becomes:

)1()1()( 12210 ageageY

age )( 12120

Note that these regression lines present different intercepts and different slopes. In linear

regression, interaction is evaluated by using the statistical tests about product terms

involving basic independent variables in the model.

We present here an example of how to assess interaction by using STATA. We will

continue with our example of the relationship between age and SBP to investigate the

possible interaction. The results of fitting the linear regression model to this data set

indicated that there is a linear relationship between age and SBP. Another question that

can be answered by such data is whether an interaction exists between age and smoking

status. In other words, that is whether the slopes of the straight-lines relating SBP to age

significantly differ for smokers and for non-smokers. To create an interaction term by

using the STATA, we would type:

xi: regress sbp i.smk*age

i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) i.smk*age _IsmkXage_# (coded as above) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 26.63 Model | 4758.42362 3 1586.14121 Prob > F = 0.0000 Residual | 1667.54513 28 59.5551833 R-squared = 0.7405 -------------+------------------------------ Adj R-squared = 0.7127 Total | 6425.96875 31 207.289315 Root MSE = 7.7172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | -12.84603 21.71534 -0.59 0.559 -57.32789 31.63583 age | 1.515216 .2703328 5.61 0.000 .9614643 2.068968 _IsmkXage_1 | .4349186 .4048227 1.07 0.292 -.3943232 1.26416 _cons | 58.57428 14.80476 3.96 0.000 28.2481 88.90046 ------------------------------------------------------------------------------

30

The regression model for including the interaction term can be written as:

)()()( 3210 agesmokeagesmokeY

From the STATA output, we see that when smoking status=non smokers (smoke=0), the

regression model can be written as:

)0(43.0)(52.1)0(87.1257.58 ageageY

)(52.157.58 ageY

and when smoking status=smokers (smoke=1), the regression model becomes:

)1(43.0)(52.1)1(87.1257.58 ageageY

)(95.170.45 ageY

Plotting these two lines (Figure 6-2), we see that they appear to be almost parallel. It

indicates that there is probably no statistically significant interaction. We can confirm this

lack of significance by using the partial F test for testing the hypothesis 0: 30 H , given

that smoking status and age are in the model. The p value for the partial F test is equal to

0.292. Therefore, the slopes of the straight lines relating SBP and age do not significantly

differ for smokers and non-smokers. This means that there is no interaction between age

and smoking status in this situation. This conclusion is equivalent to finding that smokers

have a consistently higher SBP than non-smokers at all ages, and the rate of change with

respect to age is the same for both groups.

Figure 6-2 Comparison by smoking status of straight-line regressions of SBP on age

120

130

140

150

160

170

Sys

tolic

blo

od

pre

ssur

e (

mm

Hg)

40 45 50 55 60 65Age (years)

Non-smokers Smokers

31

7. REGRESSION DIAGNOSTICS 7.1 Normality The assumption of normality can be assessed formally by a normal plot of the residuals.

These residuals are assumed to be independent normal random variables with mean 0 and

constant variance (2σ ). Assumption violation will result in invalid model. To make sure

that the model is appropriate, it needs to be checked whether their distributions are normal.

Checking this assumption can be performed as follows:

STATA command:

1. Estimate the residuals

After fitting regression model by regress command, the estimation of residuals can be done

by predict command with option resid as:

. xi: regress sbp i.drug age quet i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 32.42 Model | 4989.65072 3 1663.21691 Prob > F = 0.0000 Residual | 1436.31803 28 51.2970727 R-squared = 0.7765 -------------+------------------------------ Adj R-squared = 0.7525 Total | 6425.96875 31 207.289315 Root MSE = 7.1622 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Idrug_1 | 10.62885 2.582308 4.12 0.000 5.339232 15.91847 age | 1.003488 .3102821 3.23 0.003 .3679039 1.639072 quet | 9.705102 4.339773 2.24 0.033 .8154801 18.59472 _cons | 51.38847 10.11437 5.08 0.000 30.67013 72.10681 ------------------------------------------------------------------------------ . predict error, resid

2. Create normal probability plot . pnorm error

Figure 7-1 Normal probability plot of residuals from the full multiple regression model

0.0

00

.25

0.5

00

.75

1.0

0N

orm

al F

[(er

ror-

m)/

s]

0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

32

3. Test for normality . swilk error Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+-------------------------------------------------- error | 32 0.97405 0.865 -0.300 0.61789

The test performs with the null hypothesis that the residuals are normally distributed.

The Shapiro-Wilk statistic is equal 0.97405 and its related p-value is 0.61789.

We therefore fail to reject the null hypothesis and conclude that the residuals are normally

distributed.

7.2Linearity Augmented component-plus-residual plot versus independent variables will suggest whether

a relationship is linear. This can be performed as below.

STATA command:

1. Plot Augmented component-plus-residual versus independent variable age

. acprplot age, mspline msopts(bands(13)) title(Augmented component-plus-residual versus age plot)

Figure 7-2 Augmented component-plus-residual versus age

-30

-20

-10

01

0A

ugm

ent

ed c

om

pone

nt p

lus

resi

dua

l

40 45 50 55 60 65age

Augmented component-plus-residual versus age plot

33

2. Plot Augmented component-plus-residual versus independent variable BMI

. acprplot quet, mspline msopts(bands(13)) title(Augmented component-plus-residual versus BMI plot)

Figure 7-3 Augmented component-plus-residual versus BMI

These graphs suggest neither definite linearity nor definite curvature relationship. These

might be linear, but some outliers may cause the line to not be smooth.

7.3 Homoskedasticity

The assumption of homoskedasticity can be assessed by plotting the residuals against the

predicted values of the dependent variable. This plot gives an idea whether residuals are

constant across fitted values. For example, after fitting drug, age, and BMI to predict

changes of SBP, a plot of residuals against the predicted value of the SBP is shown in

Figure 7-4. In the graph below, residuals symmetrically lie below and above the horizontal

line. This might suggest constant variance and testing should be performed as below.

. estat hettest error Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: error chi2(1) = 1.82 Prob > chi2 = 0.1771

The test suggests that variance of residuals is constant over the values of independent variable.

-70

-60

-50

-40

-30

Au

gme

nted

co

mpo

nent

plu

s re

sid

ual

2.5 3 3.5 4 4.5quetelet index

Augmented component-plus-residual versus BMI plot

34

Figure 7-4 Residuals from the regression model, plotted against predicted values of SBP

7.4 Multicollinearity

This assumption does not matter for simple regression analysis, but it is important to check

for the multiple regression model. This happens when independent variables are correlated

and it will result in bias estimation of coefficients, standard errors, and inference of

coefficients. Therefore, pair wise correlation (r) should be performed to get an idea what

variables might highly correlate and have collinearity effect in the model if we fit them

together. If collinearity occurs, adding or deleting independent variables makes changes in

coefficients and produces corresponding standard errors. The standard error is sometimes

as high as the coefficient or even higher.

To detect multicollinearity additional to pair wise correlation, a variance inflation factor is

usually used. The parameter measures how the variances of regression coefficients are

inflated compared to when the independent variables are not linearly related. This can be

estimated in STATA as below. The VIF > 10 is suggested as collinearity and the value

close to 1 is no evidence of collinearity. If the VIF is greater than 10, that variable should

be omitted from the model. For this example, there is no evidence of collinearity. . estat vif Variable | VIF 1/VIF -------------+---------------------- age | 2.82 0.355212 quet | 2.81 0.355588 _Idrug_1 | 1.00 0.996620 -------------+---------------------- Mean VIF | 2.21

-20

-10

01

02

0R

esid

ual

s

120 130 140 150 160 170Linear prediction

35

7.5 Outliers

7.5.1 Identifying X outliers

Identifying X outliers for multiple regression is a multi-dimension of X variables.

Considering only one variable separately is not appropriate since extreme values of some

unvaried outliers may not influence in regression line whereas multivariate outliers (which

cannot detect in unvaried outliers) may affect on fitting the regression line. We use H

matrix to identify X outliers as follows:

XXXXH 1)(

Let hij (hat or leverage) be the i element on the main diagonal of the H matrix, which can

be obtained from:

phh

XXXXhn

iiiii

iiii

1

1'

,10

)(

where p = number of regression parameters including a constant term. The higher the hii

value (higher leverage), the further distance that an x variable departs from the centre of X

matrix and that is the outlier of X. The suggestion is that if the hat is as high as twice the

average (2*p/n), it is considered as outlier. Another criteria is classifying hii 0.2-0.5, > 0.5

as moderate and high leverage. Estimation of hat can be performed as below.

STATA command: 2. Estimate leverage value

. predict xdist,hat . sum xdist,det Leverage ------------------------------------------------------------- Percentiles Smallest 1% .0549423 .0549423 5% .0568518 .0568518 10% .0762993 .0651438 Obs 32 25% .0843276 .0762993 Sum of Wgt. 32 50% .1157301 Mean .125 Largest Std. Dev. .056152 75% .1453192 .2223231 90% .2223231 .2470635 Variance .0031531 95% .2482278 .2482278 Skewness 1.114692 99% .2663676 .2663676 Kurtosis 3.49025

36

3. Identify outlying cases

As for the model, the average leverage is 2x4/32 = 0.25. Only 1 subject has leverage value

higher than this cutoff. However, the highest leverage does not exceed the high level of

leverage (i.e. = 0.5). We could see that subject id=2 has the highest leverage value of

0.266. Keep in mind that this subject is potential X outlier, but needs to be explored

further whether this potential outlier influences on the regression model (predicted values).

. list person error xdist quet age drug if xdist>2*(4/32) & xdist!=. +----------------------------------------------------+ | person error xdist quet age drug | |----------------------------------------------------| 18. | 2 -2.082762 .2663676 3.251 41 0 | +----------------------------------------------------+

7.5.2 Identifying Y outliers

This can be done using studentized deleted residuals, which is calculated by:

)(

*

ˆ

)(

iiii

i

ii

YYd

ds

dd

or

2/1

2

)(

*

)1(

1

)1(

iii

i

iii

ii

ehSSE

pne

hMSE

ed

di is called deleted residual, which is the residual where xi(i) case is deleted. The

studentized deleted residual has t-distribution with n-p-1 degrees of freedom. To identify

the outlier, it needs to be compared with a cutoff threshold, say 0.05, as below.

37

STATA command: 1. Estimate the studentized deleted residuals

. predict estu,rstudent

. sum estu,det Studentized residuals ------------------------------------------------------------- Percentiles Smallest 1% -2.514817 -2.514817 5% -1.589173 -1.589173 10% -1.08973 -1.185283 Obs 32 25% -.5998537 -1.08973 Sum of Wgt. 32 50% -.0976471 Mean .0120775 Largest Std. Dev. 1.08881 75% .626205 1.436678 90% 1.436678 1.613788 Variance 1.185507 95% 2.430551 2.430551 Skewness .3678051 99% 2.576622 2.576622 Kurtosis 3.461563

2. Identify outlying cases

. list person error estu quet age drug if abs(estu)>invttail(27,0.05) +-----------------------------------------------------+ | person error estu quet age drug | |-----------------------------------------------------| 8. | 8 14.76043 2.430551 3.612 48 1 | 9. | 9 14.84753 2.576622 2.368 44 1 | 12. | 12 -14.32618 -2.514817 4.032 51 1 | +-----------------------------------------------------+

7.5.3 Identifying influential cases

After identifying cases that are outlying with respect to their X values and/or their Y values,

the next step is to ascertain whether or not these outlying cases are influential. We shall take

up three measures of influence that are widely used in practice, each based on the omission of a

single case to measure its influence.

i) Influence on the fitted values (DFFITS)

The first task is to explore whether these x & y outliers influence the fitted values since it is not

necessary that all x or y outliers affect fitting values. DFFITS are parameters that takeinto

account both x (leverage) and y (d*) outlying indexes which are often used in practice as follows.

ii(i)

i(i)ii

hMSE

YYDFFITS

ˆˆ

The letters DF refer to distance difference between the fitted value iYˆ for the ith case where all n

cases are considered in the model, and fitted value )(ˆiiY for the ith case in which the ith case is

removed from the model. The denominator is the standardization, which reflects the number of

estimated standard deviations that the fitted value and iYˆ change when the ith case is removed

38

from the model. The DFFITS can also be calculated using d* and hii as follows:

1/2

ii

ii*ii )

h1

h(dDFFITS

As for the equation, DFFITS is a function of d*, which increases or decreases according to hii

values. The absolute value of DFFITS which exceeds 1 is considered as influential case for

small-medium sample size and np /2 for a large sample size. This can be explored as

follows:

STATA command 1. Estimate DFFITS

. predict dfits, dfits

2. Identifying influential cases . list person error dfits quet age drug if abs( dfits)>1 +-----------------------------------------------------+ | person error dfits quet age drug | |-----------------------------------------------------| 8. | 8 14.76043 1.041157 3.612 48 1 | 9. | 9 14.84753 1.377664 2.368 44 1 | 12. | 12 -14.32618 -1.440561 4.032 51 1 | +-----------------------------------------------------+

ii) Influence on all of the estimated regression coefficients (Cook’s distance)

Cook’s distance (Di) measures the impact of the ith case on all regression coefficients (or

coefficient vectors) if the ith case is omitted, which is defined as:

pMSE

)bX(bX)b(bD (i)(i)

i

The way to interpret Di is as follows:

- if Di > 4/n or

- Di > F(1-α, p,n-p)

The ith case has substantially influenced on estimate coefficients.

STATA command: 1. Estimate Cook’s distance

. predict cookd1,cooks

39

2. Identify influential cases

. list person error cookd1 quet age drug if cookd1>(4/32) +----------------------------------------------------+ | person error cookd1 quet age drug | |----------------------------------------------------| 8. | 8 14.76043 .2305867 3.612 48 1 | 9. | 9 14.84753 .3949498 2.368 44 1 | 10. | 10 8.75689 .1641441 4.637 64 1 | 12. | 12 -14.32618 .4359133 4.032 51 1 | +----------------------------------------------------+

iii) Influence on the partial regression coefficients (DFBETAS)

This measures the influence of the ith case on estimation of each regression coefficient.

This can be assessed by

omitted is case ith the if tcoefficien regressionb

X)X( of element diagonal kth thec

wherecMSE

bbDFBETAS

k(i)

1-kk

kk(i)

k(i)kk(i)

That is the DFBETAS determines the standardized difference of coefficients that are

estimated with and without the ith cases. The absolute DFBETAS > 1 and > n/2 are

supposed to be influential cases for small-medium and large sample sizes. This can be

estimated in STATA as follows:

STATA command:

1. Estimate DFBETAS . predict df_drug,dfbeta(_Idrug_1) . predict df_age,dfbeta(age) . predict df_bmi,dfbeta(quet)

2. Identify influential cases . list person error df_drug df_age df_bmi quet age drug if abs(df_drug)>1 | abs(df_age)>1 | abs(df_bmi)>1 +----------------------------------------------------------------------------+ | person error df_drug df_age df_bmi quet age drug | |----------------------------------------------------------------------------| 12. | 12 -14.32618 -.4310711 1.128824 -1.263236 4.032 51 1 | +----------------------------------------------------------------------------+

Finally the diagnostic model needs to be summarized as a whole considering three

parameters, which subjects are influential cases that affect on prediction values (DFFITS),

estimation of coefficient vectors (COOK’S D) and each coefficient (DFBETAS).

40

. list person sbp drug age quet smkgr if abs( dfits)>1 | cookd1>(4/32) | abs(df_drug)>1 | abs(df_age)>1 | abs(df_bmi)>1 +-------------------------------------------+ | person sbp drug age quet smkgr | |-------------------------------------------| 8. | 8 160 1 48 3.612 1 | 9. | 9 144 1 44 2.368 1 | 10. | 10 180 1 64 4.637 1 | 12. | 12 138 1 51 4.032 1 | +-------------------------------------------+

There are 4 subjects who are potentially influential cases with the above criteria. In

summary, the residuals are normal, constant variance, and some outliers that influence the

regression model. We thus need to explore data particularly starting with these above

subjects. Their data needs to be checked and validated for all variables to make sure that

the data is correct. If there is some incorrect data, it should be corrected accordingly and

then re-fit to the regression model. The diagnostic model also needs to be re-assessed. If

the data of these subjects are correct as they are, try to exclude these subjects and see how

the regression model is.

41

Assignment V :

Multiple regression (25%) Due date: Oct 8, 2015

From the randomized controlled trial of calcium supplements, researchers wanted to determine

the factors which associated with BMD at total left femur which was stored under the variable

‘total’. The details of factors that we need to explore are presented in following table.

1=smoke, 2=ex-smoke, 3=non-smoke

1=yes, 2=sometime, 3=no

Hour/day

1=regular, 2=sometime, 3=no

Classify by glucose>=126 mg/dl

Classify SBP>=140 or DBP>=90 as

hypertension, otherwise, normal

Classify cholesterol >=240 mg/dl as high

cholesterol, otherwise, normal

The data are given in the data set cross-sectional_BMD_&_risk_factor.dta.

- Fit the regression model step by step, explain the method used

- What is the parsimonious equation?

- Perform diagnostic measures, checking assumptions and explain results.

- Interpret results of the final model, writing a report in text & tables if needed.

race 615 introduction to medical statistics€¦ · instead of using the formulas manually, the...

Documents