race 625 medical statistics in clinical research€¦ · predictors, we use a natural extension of...

RACE 625 :Medical Statistics in Clinical Research

Multiple linear regression

Asst. Prof. Dr. Sasivimol Rattanasiri [email protected]

Doctor of Philosophy Program in Clinical Epidemiology, Section for Clinical Epidemiology & BiostatisticsFaculty of Medicine Ramathibodi Hospital, Mahidol University

Semester 1, 2017 www.ceb-rama.org

1

CONTENTS

1. MULTIPLE REGRESSION MODEL ................................................................................. 3

2. FITTING MULTIPLE REGRESSION MODEL ............................................................... 5

3. FITTING CATEGORICAL INDEPENDENT VARIABLES ........................................... 6

4. EVALUATING THE MULTIPLE REGRESSION MODEL ......................................... 10

4.1 Test for overall regression ............................................................................................... 10

4.2 The coefficient of multiple determination ....................................................................... 11

4.3 Test for partial regression coefficients ............................................................................. 12

4.4 Evaluating multiple regression by STATA ...................................................................... 14

5. MODEL SELECTION ........................................................................................................ 16

5.1 Forward selection procedure ............................................................................................ 16

5.2 Backward elimination procedure ..................................................................................... 20

5.3 Stepwise procedure .......................................................................................................... 21

6. CONFOUNDING AND INTERACTION .......................................................................... 22

6.1 Confounding effects ......................................................................................................... 22

6.2 Interaction effects ............................................................................................................. 25

7. REGRESSION DIAGNOSTICS ........................................................................................ 29

7.1 Normality ......................................................................................................................... 29

7.2 Linearity ........................................................................................................................... 30

7.3 Homoskedasticity ............................................................................................................. 31

7.4 Multicollinearity .............................................................................................................. 32

7.5 Outliers ............................................................................................................................. 33

2

OBJECTIVES

This module will help you to be able to:

Fit a regression model considering > 2 co-variables simultaneously

Perform model selection

Assessing goodness of fit of the model & checking assumptions

Interpret & report results

REFERENCES

1. Neter J, Wasserman W, and Kutner HM. Applied statistical models, third edition. Boston:

IRWIN 1990; 21 - 433.

2. Klienbaum GD., Kupper LL, Muller EK, and Nizam A. Applied regression analysis and other

multivariable methods. Washington: Duxbury Press 1998; 39 - 212.

3. Altman, G. Practical statistics for medical research. 1991, London: Chapman & Hall.

4. Hosmer, D. and S. Lemeshow. Applied regression analysis and other multivariable

methods. 1998, Washington: Duxbury Press.

READING SECTION

Read Neter J et. Al.; Chapters 7, 8, 11, & 12

ASSIGNMENT IV (25%)

P.39, Due: September 27, 2017

3

1. MULTIPLE REGRESSION MODEL

In Module IV, we discussed simple linear regression and correlation analysis. A simple

regression model includes one predictor and one outcome. In practice, an outcome is

usually affected by more than one predictor. For example, the systolic blood pressure

(SBP) may be determined by the age, smoking behavior, and body mass index (BMI). To

investigate the more complicated relationship between an outcome and a numbers of

predictors, we use a natural extension of simple linear regression known as multiple

regression analysis.

There are several advantages of using multiple regression analysis.

1) Develop a prognostic index to predict the outcome from several predictors. For

example, the SBP may be predicted by the age, smoking behavior, and BMI.

2) Adjust (control) for the potential confounding factors, in which the study design

has not planned for adjusting. For example, the effect of BMI on the SBP may be

confounded by age and smoking behavior. Fitting all these variables together can

assess the effect of BMI by adjusting for age and smoking behavior.

A multiple regression model with y as an outcome (dependent variable) and kxxxx ,...,,, 321 as k

predictors (independent variables) is written as:

kk XXXXy ...332211 (1)

where refers to the population intercept or constant term,

k ,...,,, 321 are the population slope or regression coefficients of independent

variables kXXXX ,...,,, 321 , respectively,

refers to the random error term.

The constant term is the mean value of the outcome when all predictors in the

model take the value 0.

The coefficients k ,...,,, 321 are called the partial regression coefficients. For

example, 1 is a partial coefficient of 1x , and it gives the change in y due to a one

unit change in 1x when all other predictors included in the model are held constant.

4

A positive value for a particular i in model (1) indicates a positive relationship

between the outcome (Y) and the related predictor (X). A negative value for a

particular i in that model indicates a negative relationship between the outcome

(Y) and the related predictor (X).

The multiple regression model (1) can only be used when the relationship between the

outcome and each predictor is linear. Each of the iX variables in the model (1) represents

a single variable raised to the first power. This model is referred to as a first-order multiple

regression model.

For a data sample, a multiple linear regression is written as:

ikki eXbXbXbXbay ...ˆ332211 (2)

where

iy is the estimated or predicted value of the outcome iy ,

a is the un-bias estimator of the population intercept,

kbbbb ,....,, 321 are the un-bias estimators of the population slope of predictive

variables kXXXX ,...,,, 321 ,respectively,

ie is the random error which is the difference between the observed and predicted

values of the dependent variable iy . The technical term for this distance is called a

residual.

5

2. FITTING MULTIPLE REGRESSION MODEL

The method of least squares is used to estimate the parameters in the multiple regression

model. In general, the least squares method chooses as the best-fitting model the one that

minimizes the sum of squares of the distance between the observed values and predicted

values of the outcome which can be defined as:

n

i

kikiiii

n

i

ii XbXbXbXbayyy1

332211

2

1

)...()ˆ( (3)

The least squares estimated regression coefficients kbbbba ,...,,,, 321 in model (3) are

obtained by using matrix mathematics. In this text we do not present matrix formulas for

calculating the least squares estimates of kbbbba ,...,,,, 321 , so the matrix formulation for

multiple regression can be found in Kleinbaum, Appendix B, pages 732-743, Neter,

Section 7.2, pages 236-239.

Instead of using the formulas manually, the calculations in a multiple regression analysis

are made by using statistical software packages, such as STATA or SPSS. Even for a

multiple regression with two predictors, the formulas are complex and manual calculations

are time consuming. In this module we perform the multiple regression analysis using

STATA program. However, the solutions which are obtained by using other statistical

software packages such as SPSS, SAS, or MINITAB can be interpreted the same way.

Example 2-1 Researchers wanted to investigate how SBP varies with the BMI and age.

The outcome is ‘sbp’ and the predictors are ‘BMI’ and ‘age’. To fit the multiple linear

regression by using the STATA, we would type:

. regress sbp1 bmi age

1. Source | SS df MS 2.Number of obs = 30

-----------+---------------------------------- 3.F(2, 27) = 9.16

Model | 3496.04264 2 1748.02132 Prob > F = 0.0009

esidual | 5149.95736 27 190.739161 4.R-squared = 0.4044

-----------+---------------------------------- 5.Adj R-squared = 0.3602

Total | 8646 29 298.137931 Root MSE = 13.811

------------------------------------------------------------------------------

6. sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-----------+----------------------------------------------------------------

bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073

age | .5637281 .1660759 3.39 0.002 .2229686 .9044876

_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142

------------------------------------------------------------------------------

6

There are two predictors, the BMI and age, included into this model. In the STATA output

no.6, the variable ‘cons_’ is referred to the intercept. The regression coefficients for the

intercept, BMI, and age are presented in the second column (label ‘Coef’). The multiple

regression model for this example is given as:

ageBMIsbp

56.078.102.55

From this equation, the population intercept is 55.02. It is the value of estimated SBP for

the BMI=0 and age=0. This means that a patient who has zero BMI and zero age is

expected to have a SBP of 55.02 mmHg. This is the technical interpretation of the

population intercept. In reality, that may not be true because none of the patients in our

sample has both zero BMI and zero age. In most medical applications the value of the

population intercept has no practical meaning.

The estimated coefficient of the BMI (1 ) is 1.78. This value gives the change in the

estimated SBP for a one-unit change in the BMI when the age is held constant. Thus, we

can state that a patient with one extra BMI, but the same age is expected to have a higher

SBP of 1.78 mmHg.

The estimated coefficient of age (2 ) is 0.56. This value gives the change in the estimated

SBP for a one-unit change in age when the BMI is held constant. Thus, we can state that

for a patient with one extra unit of age with the same BMI is expected to have a higher

SBP of 0.56 mmHg.

3. FITTING CATEGORICAL INDEPENDENT VARIABLES

All the predictive variables that we have considered up to this point have been measured

on a continuous scale. However, regression analysis can be generalized to incorporate

categorical variables into the model. For example, we want to test if smoking status affects

the SBP. The generalization is based on the use of dummy variables which is the central

idea of this section.

The dummy variable is a binary variable that takes on the values 0 and 1 which is used to

identify the different categories of a categorical variable. The term “dummy” is used to

7

indicate the fact that the numerical values (such as 0 and 1) assumed by the variable have

no quantitative meaning, but are used to identify different categories of the categorical

variable under consideration.

For example, we have a variable for smoking status code 1 for current smoke, 2 for ex-

smoke, and 3 for never smoke. Thus, the dummy variables for smoking behavior are

presented as follows:

Smoking behavior

Dummy variables

Dummy 1 Dummy 2 Dummy 3

Current smoke 1 0 0

Ex-smoke 0 1 0

Never smoke 0 0 1

To include these variables in a multiple regression model, we use k-1 dummy variables for

k categories. The omitted category is referred to as the reference group. It is arbitrary

which group is assigned to be the reference group. The choice of a reference group is

usually dictated by subject-matter considerations. For example, if never smoke category is

referred to as the reference group in the multiple regression analysis, only the dummy

variables 1 and 2 are fitted in the model.

There are two easy ways to create dummy variables in STATA. Firstly, is using the tabulate

command and the generate( ) option, as shown below.

tabulate smoke,gen(dum) smoke | Freq. Percent Cum. ------------+----------------------------------- 1 | 8 25.00 25.00 2 | 9 28.13 53.13 3 | 15 46.88 100.00 ------------+----------------------------------- Total | 32 100.00

list smoke dum1 dum2 dum3 +----------------------------+ | smkgr dum1 dum2 dum3 | |----------------------------| 1. | 1 1 0 0 | 2. | 1 1 0 0 | 3. | 1 1 0 0 | 4. | 2 0 1 0 | 5. | 2 0 1 0 | |----------------------------| 6. | 2 0 1 0 | 7. | 3 0 0 1 | 8. | 3 0 0 1 | 9. | 3 0 0 1 | +----------------------------+

8

The tabulate command with generate option created three dummy variables called dum1,

dum2 and dum3. The STATA creates three dummy variables and lets the users choose

their own reference. Suppose that we add the smoking status to the regression model for

predicting SBP that already contains the BMI and age. If the never smoke category is

referred to as the reference group, only the dummy1 and dummy2 are fitted in the model as

follows:

. regress sbp1 bmi age dum1 dum2

Source | SS df MS Number of obs = 30

-------------+---------------------------------- F(4, 25) = 4.24

Model | 3496.06849 4 874.017123 Prob > F = 0.0093

Residual | 5149.93151 25 205.99726 R-squared = 0.4044

-------------+---------------------------------- Adj R-squared = 0.3091

Total | 8646 29 298.137931 Root MSE = 14.353

------------------------------------------------------------------------------

sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591

age | .5635802 .1743535 3.23 0.003 .2044924 .922668

dum1 | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473

dum2 | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422

_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262

------------------------------------------------------------------------------

Secondly is to use the xi command to create dummy variables. xi i.smoke

i.smoke _Ismoke_1-3 (naturally coded; _Ismoke_1 omitted)

. list smoke _Ismoke_2 _Ismoke_3

+-------------------------------+

| smoke _Ismok~2 _Ismok~3 |

|-------------------------------|

1. | past 1 0 |

2. | never 0 1 |

3. | never 0 1 |

4. | never 0 1 |

5. | never 0 1 |

The xi command created two dummy variables called _Ismkgr_2 and _Ismkgr_3 and omitted

the dummy variable for group 1 as the reference group. The default of xi command will assign

the first category as the reference group. If we want another category to be the reference, this

can be re-assigned using the char var [omit] # command. For example, we would like group 3

to be the reference group, we would type

char smoke [omit] 3

xi i.smoke


9

The xi command is a prefix command, which can be followed with another modeling

command. To fit the multiple regression of SBP on the QUET, age, and smoking status, we

would type

. char smoke [omit] 3

. xi: regress sbp1 bmi age i.smoke



-------------+---------------------------------- F(4, 25) = 4.24

Model | 3496.06849 4 874.017123 Prob > F = 0.0093


-------------+---------------------------------- Adj R-squared = 0.3091

Total | 8646 29 298.137931 Root MSE = 14.353

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591

age | .5635802 .1743535 3.23 0.003 .2044924 .922668

_Ismoke_1 | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473

_Ismoke_2 | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422

_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262

------------------------------------------------------------------------------

Or use single command to determine the reference group as follow:

. regress sbp1 bmi age ib(3).smoke


-------------+---------------------------------- F(4, 25) = 4.24

Model | 3496.06849 4 874.017123 Prob > F = 0.0093


-------------+---------------------------------- Adj R-squared = 0.3091

Total | 8646 29 298.137931 Root MSE = 14.353

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591

age | .5635802 .1743535 3.23 0.003 .2044924 .922668

|

smoke |

present | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473

past | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422

|

_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262

------------------------------------------------------------------------------

From these commands, the STATA treats BMI and age as continuous variable and smoke

as a dummy variable. We can interpret that with identical BMI and age, current smokers

have a SBP about 0.07 mmHg non-significantly lower than never smokers, and ex-smokers

have a SBP about 0.04 mmHg non-significantly lower than never smokers.

10

4. EVALUATING THE MULTIPLE REGRESSION MODEL

Before using a multiple regression model to predict or estimate, it is desirable to determine

first whether it adequately describes the relationship between the outcome and a set of

predictors and whether it can be used effectively for the purpose of prediction.

4.1 Test for overall regression

We now consider an overall test for a regression model which contains k predictors in the

model. The null hypothesis for this test is that there is no linear relationship between the

outcome and the set of predictors. In other words, all of the predictors considered together

do not explain a significant amount of the variation in the outcome. The null and

alternative hypotheses are defined as:

0...: 210 kH

:aH Not all j (j=1,…k) equal zero

An ANOVA approach can be used to perform this test. The particular form of an ANOVA

table for regression analysis is presented in Table 4-1. The ANOVA approach is based on

the partitioning of the total variation of the observed values iY into two components as

follows:

ii yy = yyi ˆ + ii yy ˆ (4)

total variation = explained variation + unexplained variation

Thus, the total variation can be viewed as the sum of two components:

1. Explained variation is the variation of the predicted values iY around the mean Y ,

which is measured by the regression sum of squares (SSR). This indicates how

much the variation of the observed values iY can be explained by the predictive

variables that are included in the regression model. The mean square regression

(MSR) is obtained by dividing the regression sum of squares by its corresponding

degrees of freedom.

2. Unexplained variation is the variation of the observed values iY around the fitted

regression line, which is measured by the error sum of squares (SSE). The mean

11

square residual is obtained by dividing the error sum of squares by its

corresponding degrees of freedom.

Table 4-1 ANOVA Table for multiple regression

Source of

variation

SS df MS F P value

Regression

n

i

i )YY(SSR1

2ˆ k MSR=SSR/K MSR/MSE

Error

n

i

ii )Y(YSSE1

2ˆ n-k-1 MSE=SSE/(n-k-1)

Total

n

i

i )Y(YSST1

2 n-1

The appropriate statistical test for significant overall regression is the F test, which is obtained

by dividing the mean square regression by the mean square residual as follows:

MSE

MSRF (5)

This test has a F distribution with k and n-k-1 degrees of freedom. The computed value of F

can then be converted to the associated p value. The last step is to compare the p value with the

level of significance, and make a decision whether the null hypothesis is rejected or not.

4.2 The coefficient of multiple determination

In section 2.4.2 of module IV, we discussed about the coefficient of determination for a

simple linear regression model. It is the proportion of the total sum of squares (SST) that is

explained by the simple regression model. The coefficient of determination for the multiple

regression model, usually called the coefficient of multiple determination, is denoted by 2R .

It is defined as the proportion of the total sum of squares (SST) that is explained by the

multiple regression model. It tells us how good the multiple regression model is and how

well the predictive variables included in the model explain the outcome.

The calculation of the R2 is also based on the ANOVA Table which is presented in Table

4-1, which is defined as:

12

SST

SSE

SST

SSR

yy

yy

Rn

i

i

n

i

i

1

)(

)ˆ(

1

2

1

2

2 (8)

However, the R2 has one major shortcoming. The 2R value generally increases as we add

more and more predictive variables to the regression model. This does not imply that the

regression model with a higher value of 2R does a better job of predicting the outcome.

Such a value of 2R will be misleading, and it will not represent the true power of predictive

variables in the regression model. To eliminate this shortcoming of 2R , it is prefer to use

the adjusted R2, which is a modified measure that adjusts for the number of predictive

variables in the model. The adjusted 2R can determined by dividing each sum of squares by

its associated degrees of freedom. Thus, the adjusted 2R can be defined as:

)1/(

)/(12

nSST

knSSERa (9)

where n is the total number of observations, and k is the number of predictive variables in

the model.

4.3 Test for partial regression coefficients

Frequently we wish to test if the addition of any specific predictor (X*), given that others

are already in the model ( kXXXX ,...,,, 321 ), significantly improves the prediction of the

outcome. That is, we want to test the null hypothesis that 0* against the alternative

hypothesis that 0* (j=1, 2,…, k) in the model

**

332211 ... XXXXXy kk .

The appropriate statistical test for these partial regression coefficients is the partial F test.

The concept of this test is to compare the error sum of squares between two models:

1) The full model contains kXXX ,...,, 21 and *X as the predictors.

2) The reduced model contains kXXX ,...,, 21 but not *X .

The error sum of squares of the full model is smaller than the reduced model. The

difference between these error sums of squares is called an extra sum of squares.

13

The error sum of squares when kXXX ,...,, 21 and *X are in the model, is denoted

by ),,...,( *

21 XXXXSSE k ,

The error sum of squares when kXXX ,...,, 21 are in the model, is denoted by

),...,( 21 kXXXSSE ,

The extra sum of squares which is denoted by ),...,,|( 21

*

kXXXXSSR can be

defined as:

),,...,,(),...,,(),...,,|( *

212121

* XXXXSSEXXXSSEXXXXSSR kkk

Thus, the partial F test is defined as:

1/),,...,(

1/),...,,|(*

21

21

**

knXXXXSSE

XXXXSSRF

k

k

),,...,,(

),...,,|(*

21

21

*

XXXXMSE

XXXXMSR

k

k (6)

This F statistic has an F distribution with 1 and n-k-1 degrees of freedom.

Note:

To distinguish the partial F test in equation (6) from the overall F test in equation (5), we use

the *F test statistic for the partial F test, and F test statistic for the overall F test.

An equivalent way to perform the partial F test is to use a t test. The t test focuses on a test

of the null hypothesis that 0* . The t test for testing this null hypothesis is computed as:

)ˆ(

ˆ

*

**

SEt (7)

Where

* is the estimated coefficient of specific predictor (X*) in the regression model

**

332211 ... XXXXXy kk .

The )ˆ( *SE is the estimate of the standard error of * .

This test has a t distribution with n-k-1 degrees of freedom.

Since the two tests are equivalent, the choice is usually made in terms of the available

information provided by the computer package output.

14

4.4 Evaluating multiple regression by STATA

Consider the STATA output of example 2-1, the output indicated that:

. regress sbp1 bmi age

1. Source | SS df MS 2.Number of obs = 30

-----------+---------------------------------- 3.F(2, 27) = 9.16

Model | 3496.04264 2 1748.02132 Prob > F = 0.0009

esidual | 5149.95736 27 190.739161 4.R-squared = 0.4044

-----------+---------------------------------- 5.Adj R-squared = 0.3602

Total | 8646 29 298.137931 Root MSE = 13.811

------------------------------------------------------------------------------

6. sbp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-----------+----------------------------------------------------------------

bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073

age | .5637281 .1660759 3.39 0.002 .2229686 .9044876

_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142

------------------------------------------------------------------------------

1. Table of the analysis of variance

The ANOVA table shows the values of sum of squares, degrees of freedom and variances as:

- The values of sum of squares are

Sum of squares of total variation: SST=8646

Sum of squares of regression: SSR=3496.04

Sum of squares of error: SSE=5149.96

- The degrees of freedom (df) are

df of the total variation is n-1, that is 30-1=29

df of the explained variation is k, that is 2.

df of the unexplained variation is n-k-1, that is 27.

- The variances or mean square (MS) are

Mean square of total variation (MST) is SST/df = 8646/29 = 298.14

Mean square of regression (MSR) is SSR/df = 3496.04/2 = 1748.02

Mean square of error (MSE) is SSE/df = 5149.96/27 = 190.74

2. The number of total observations

The number of total observations for this example is 30.

3. The overall F test for overall regression

We now consider the overall F test for a regression model which contains the predictive

variables; BMI and age. The null hypothesis for this test is that there is no linear relationship

between the SBP and the set of predictive variables; BMI and age. From the ANOVA table, the

overall F test for the null hypothesis 0:0 ageBMIH , is computed as:

15

16.974.190

02.1748

MSE

MSRF .

This test has the F distribution with 2 and 27 degrees of freedom. The output from the STATA

indicated that the p value for this example (0.0009) is less than 0.001. As a result, we reject the

null hypothesis and conclude that there is a relationship between the SBP and the set of

predictive variables, BMI and age.

4. The R2

The estimated 2R for this example is computed as:

0.40448646

3496.042 SST

SSRR

This implies that 40.44% of the variation of SBP is explained by its linear relationship with the

BMI and age.

5. The adjusted R2

The estimate of adjusted 2R for this example is computed as:

0.360229/8646

2/04.3496

1/

/2

nSST

kSSRR

This implies that after adjusting the numbers of the predictive variables, 36.02% of the variation

of SBP is explained by its linear relationship with the BMI and age.

6. The table for tests of partial regression coefficients

This table shows the set of predictive variables and their corresponding coefficients. The

variable ‘cons_’ referred to the intercept. The partial regression coefficients for the BMI and

age are presented in the second column (label ‘Coef’). Their standard errors (SE) and the

values of t test for partial regression coefficients are presented in the third and fourth columns,

respectively. The corresponding p values and 95% CI of coefficients are presented in the fifth

and sixth columns, respectively.

The null hypotheses for partial regression coefficients can be defined as:

0:

0:

20

10

H

H

For our example, the t value for the BMI is 2.10 which has the p value of 0.045. Thus, we

reject the null hypothesis and conclude that there is linear relationship between the BMI and

16

SBP when age is already in the model. In other words, the addition of the BMI, given that age

is already in the model, significantly contribute to the prediction of the SBP.

The t value for age is 3.39 which has the p value of 0.002. Thus, we reject the null hypothesis

and conclude that there is linear relationship between age and SBP although BMI is already in

the model. In other words, the addition of age, given that BMI is already in the model,

significantly contributes to the prediction of the SBP.

5. MODEL SELECTION

In this section we focus on determining the best (most important or most valid) subset of

the k predictive variables for describing the relationship between the outcome and the

predictive variables. There are many strategies for selecting the best model. Such strategies

have focused on deciding whether a single variable should be added to a model or whether

a single variable should be deleted from a model. In this section we explain an algorithm

for evaluating models with forward selection, backward elimination, and stepwise

procedures. These procedures are widely used in practice.

5.1 Forward selection procedure

This strategy focuses on deciding whether a single variable should be added to a model. In

the forward selection procedure, we proceed as follows:

Step 0:

1. Fit a simple linear regression model for each of the k potential predictive variables.

2. Select the first predictive variable which most highly correlates with the outcome.

3. Fit the regression model to the selected predictive variable.

4. Consider overall F test, if it is not significant, stop and conclude that no predictive

variables are important predictors. If the overall F test is significant, add this selected

predictor in the model and go to the next step.

Example:

To do the simple linear regression for 3 independent variables; age, BMI, and smoking status

for predicting SBP by using the STATA, we would type:

17

. regress sbp1 age


-------------+---------------------------------- F(1, 28) = 12.41

Model | 2655.06439 1 2655.06439 Prob > F = 0.0015


-------------+---------------------------------- Adj R-squared = 0.2823

Total | 8646 29 298.137931 Root MSE = 14.627

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .6133205 .1741078 3.52 0.001 .2566768 .9699642

_cons | 94.03786 7.572722 12.42 0.000 78.52584 109.5499

------------------------------------------------------------------------------

. regress sbp1 bmi


-------------+---------------------------------- F(1, 28) = 4.95

Model | 1298.35408 1 1298.35408 Prob > F = 0.0344


-------------+---------------------------------- Adj R-squared = 0.1198

Total | 8646 29 298.137931 Root MSE = 16.199

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

bmi | 2.193388 .986084 2.22 0.034 .1734861 4.213289

_cons | 69.75698 22.33493 3.12 0.004 24.00596 115.508

------------------------------------------------------------------------------

. regress sbp1 ib(3).smoke


-------------+---------------------------------- F(2, 27) = 0.07

Model | 43.6752137 2 21.8376068 Prob > F = 0.9339


-------------+---------------------------------- Adj R-squared = -0.0686

Total | 8646 29 298.137931 Root MSE = 17.849

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

smoke |

present | -2.692308 8.020824 -0.34 0.740 -19.14968 13.76506

past | .0854701 7.740062 0.01 0.991 -15.79583 15.96677

|

_cons | 119.6923 4.95056 24.18 0.000 109.5346 129.85

------------------------------------------------------------------------------

From the STATA output for prediction of SBP, we see that the highest squared correlation

is for the age (R-squared=0.3071). The overall F test for the regression of SBP on age is

statistically significant (p=0.0015). Therefore, the age is added in the model at this step.

Step 1:

1. Fit regression models that contain the variable initially selected (at step 0) and

another predictive variable which is not yet in the model.

2. Consider the t test for testing the null hypothesis that 0j and p value associated

with each remaining variable.

18

3. Focus on the variable with the largest t test which is equivalent to the smallest p

value. If the test is significant, add that predictive variable to the regression model.

If it is not significant, use the model from step 0 which has only one predictive

variable.

To do multiple regressions that contains the age and another predictive variable by using the

STATA, we would type:

. regress sbp1 age bmi


-------------+---------------------------------- F(2, 27) = 9.16

Model | 3496.04264 2 1748.02132 Prob > F = 0.0009


-------------+---------------------------------- Adj R-squared = 0.3602

Total | 8646 29 298.137931 Root MSE = 13.811

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .5637281 .1660759 3.39 0.002 .2229686 .9044876

bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073

_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142

------------------------------------------------------------------------------

. regress sbp1 age ib(3).smoke


-------------+---------------------------------- F(3, 26) = 3.85

Model | 2659.70909 3 886.569696 Prob > F = 0.0210


-------------+---------------------------------- Adj R-squared = 0.2277

Total | 8646 29 298.137931 Root MSE = 15.174

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .6147128 .1823656 3.37 0.002 .239855 .9895706

|

smoke |

present | -.3221169 6.854604 -0.05 0.963 -14.41196 13.76772

past | -.9337973 6.586714 -0.14 0.888 -14.47298 12.60539

|

_cons | 94.34723 8.616691 10.95 0.000 76.63536 112.0591

------------------------------------------------------------------------------

From the STATA output, we see that the BMI has the largest t test with a p value of 0.045.

It also has the largest adjusted R-squared. Therefore, the BMI is added to the regression

model at this step.

Step 2:

At each subsequent step, consider the t test for the predictive variables which are not yet in

the model. If the largest t test is statistically significant, add the new variable to the model.

If the t test is not significant, no more variables are included in the model and the process

is stopped.

19

For our example we have already added age and BMI to the model. We now consider if we

should add the smoking status in the model. To add the smoking status in the multiple

regression which contains the age and BMI with STATA, we would type:

. regress sbp1 age bmi ib(3).smoke


-------------+---------------------------------- F(4, 25) = 4.24

Model | 3496.06849 4 874.017123 Prob > F = 0.0093


-------------+---------------------------------- Adj R-squared = 0.3091

Total | 8646 29 298.137931 Root MSE = 14.353

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .5635802 .1743535 3.23 0.003 .2044924 .922668

bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591

|

smoke |

present | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473

past | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422

|

_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262

------------------------------------------------------------------------------

The t test for smoking status can be defined as

. test (1.smoke=0) (2.smoke=0)

( 1) 1.smoke = 0

( 2) 2.smoke = 0

F( 2, 25) = 0.00

Prob > F = 0.9999

From the above STATA output, the t test for the smoking status, controlling for age and

BMI, is very small (less than 0.001) with a p value of 0.999. This value is not statistically

significant at =0.05, so the process is stopped. Thus, the forward selection procedure

identifies age and BMI as the best subset of the predictive variables. To do the forward

selection procedure with the STATA command, we would type:

. xi:sw regress sbp1 age bmi (i.smoke),pe(0.05)


begin with empty model

p = 0.0015 < 0.0500 adding age

p = 0.0452 < 0.0500 adding bmi


-------------+---------------------------------- F(2, 27) = 9.16

Model | 3496.04264 2 1748.02132 Prob > F = 0.0009


-------------+---------------------------------- Adj R-squared = 0.3602

Total | 8646 29 298.137931 Root MSE = 13.811

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .5637281 .1660759 3.39 0.002 .2229686 .9044876

bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073

_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142

------------------------------------------------------------------------------

20

5.2 Backward elimination procedure

This strategy has focused on deciding whether a single variable should be deleted from a

model. In the backward elimination procedure, we proceed as follows:

Step 0:

1. Fit regression model that contain all predictive variables.

2. Consider the t test for every variable in the model.

3. Focus on the variable with the lowest t test which also has the highest p value. If the

test is not significant, remove that variable from the model. If it is significant, keep

that variable in the model.

To fit regression model that contains all predictive variables (age, QUET, and smoking

status) by using the STATA, we would type:

. regress sbp1 age bmi ib(3).smoke


-------------+---------------------------------- F(4, 25) = 4.24

Model | 3496.06849 4 874.017123 Prob > F = 0.0093


-------------+---------------------------------- Adj R-squared = 0.3091

Total | 8646 29 298.137931 Root MSE = 14.353

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .5635802 .1743535 3.23 0.003 .2044924 .922668

bmi | 1.78307 .8849171 2.01 0.055 -.0394511 3.605591

|

smoke |

present | -.071104 6.484866 -0.01 0.991 -13.42694 13.28473

past | -.0397733 6.246055 -0.01 0.995 -12.90376 12.82422

|

_cons | 56.06207 20.6748 2.71 0.012 13.48152 98.64262

------------------------------------------------------------------------------

. test (1.smoke=0) (2.smoke=0)

( 1) 1.smoke = 0

( 2) 2.smoke = 0

F( 2, 25) = 0.00

Prob > F = 0.9999

From the STATA output, we see that the smoking status has the lowest t test with a p value

of 0.999. Therefore, the smoking status is dropped from the regression model at this step.

21

Step 1:

If a variable is dropped in step 0, re-compute the regression equation for the remaining

variables, and repeat the backward elimination procedure steps 0 and 1. If the variable is

not dropped, the backward elimination procedure is terminated.

From the step 0 of our example, the smoking status was dropped from the model, so we re-

computed the regression equation without the smoking status. The result of re-computation

is presented as follows:

. regress sbp1 age bmi


-------------+---------------------------------- F(2, 27) = 9.16

Model | 3496.04264 2 1748.02132 Prob > F = 0.0009


-------------+---------------------------------- Adj R-squared = 0.3602

Total | 8646 29 298.137931 Root MSE = 13.811

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .5637281 .1660759 3.39 0.002 .2229686 .9044876

bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073

_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142

------------------------------------------------------------------------------

From the STATA output, we see that the BMI has the lowest t test with a p value of 0.045.

However, the test is significant, so the BMI is not dropped from the regression model.

Therefore, we stop here with this model, which is the same model that we got when using

the forward selection procedure. To do the backward elimination procedure using the

STATA command, we would type:

. xi:sw regress sbp1 age bmi (i.smoke),pr(0.05)


begin with full model

p = 0.9999 >= 0.0500 removing _Ismoke_1 _Ismoke_2


-------------+---------------------------------- F(2, 27) = 9.16

Model | 3496.04264 2 1748.02132 Prob > F = 0.0009


-------------+---------------------------------- Adj R-squared = 0.3602

Total | 8646 29 298.137931 Root MSE = 13.811

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

age | .5637281 .1660759 3.39 0.002 .2229686 .9044876

bmi | 1.783396 .849328 2.10 0.045 .0407193 3.526073

_cons | 56.01783 19.46729 2.88 0.008 16.07424 95.96142

------------------------------------------------------------------------------

22

6. CONFOUNDING AND INTERACTION

Both confounding and interaction involve additional variables that may affect an

association between two or more variables. The additional variables to be considered are

synonymously referred to as extraneous variables, control variables, or covariates. For

example, from the previous example about SBP, we assess whether age is associated with

the SBP, accounting for the smoking status, so the extraneous variable here is the smoking

status.

6.1 Confounding effects

Confounding is the condition where the relationship of interest is different when an

extraneous variable is ignored or included in the data analysis. The assessment of

confounding requires a comparison between a crude estimate of an association (which

ignores the extraneous variable) and adjusted estimate of an association (which accounts in

some way for the extraneous variable). If the crude and adjusted estimates are

meaningfully different, confounding is present and an extraneous variable must be

included in data analysis.

Suppose that we are interested in describing the relationship between a predictive variable

“drug therapy” and a continuous outcome “SBP”, taking into account the possible

confounding effect of a third variable “age”. If drug therapy is a dichotomous variable

(e.g., drug=1 or 0 for drug A or placebo, respectively). The comparison between a crude

estimate of an association with an adjusted estimate of an association can be expressed in

terms of the following two regression models:

1) )()( 210 agedrugYSBP

2) )(10 drugYSBP

The model (1) expresses the relationship between drug therapy and SBP, adjusted for the

variable age, in terms of the partial regression coefficient (1 ) of the variable drug. The

estimate of 1 (which we will denote by Age|1 ) obtained from the least-squares fitting of

model (1), is an adjusted-effect measure. This value gives the estimated change in SBP per

unit change in drug therapy after adjusting for age.

23

The model (2) expresses the relationship between drug therapy and SBP, ignoring the

variable age, in terms of the regression coefficient (1 ) of the variable drug. The estimate

of1 (which is denoted by

1 ), obtained from the least-squares fitting of model (2), is a

crude estimate of the relationship between drug therapy and SBP.

Thus, confounding is present if the estimate of the regression coefficients of the study

variable “drug” between models (1) and (2) are meaningfully different. As an example,

suppose that

9.15ˆ1 and 1.4ˆ

|1 Age

Then, we can conclude that a 1-unit change in drug therapy yields a 16-unit change in SBP

when age is ignored, and a 1-unit change in drug therapy yields a 4-unit change in SBP

when age is controlled. That is the relationship between drug therapy and SBP is much

weaker after controlling for age. Thus, we would treat age as a confounder and control for

it in the analysis.

As another example, suppose that:

1.6ˆ1 and 2.6ˆ

|1 Age

Here, we can conclude that age is not a confounder because there is no meaningful

difference between the estimates 6.2 and 6.1. Sometimes, an investigator may have to deal

with much more problematic comparisons, such as 5.5ˆ1 versus 1.4ˆ

|1 Age . One

approach to deal with this problem is to consider the clinical importance of the numerical

difference between estimates, based on a priori knowledge of the variable (s) involved.

For example, the estimated coefficients 5.5 and 4.1 are the crude and adjusted differences

of mean SBP between drug A and placebo. It is important to decide whether a mean

difference of 5.5 is clinically more important than a mean difference of 4.1. If there is

meaningful difference in clinical practice, we have to treat age as a confounder and include

the variable age in the model.

One approach sometimes used incorrectly to assess confounding is a statistical test of the

partial regression coefficient of the extraneous variable. Such a test does not address

confounding, but rather precision. For example, a test evaluates whether significant

additional variation in SBP is explained by adding BMI ( 2 ) to a model already containing

24

the drug therapy (1 ). It is to determine whether a confidence interval for

1 is

considerably narrower when BMI is in the model than when it is not. However, if 0ˆ2 , it

does not follow that 1|1ˆˆ BMI . Therefore, 0ˆ

2 is not sufficient for confounding effects.

Consider the STATA outputs, which describe two regression models for the relationship

between drug and SBP when the BMI is ignored or included in the model, respectively.

1) Crude estimates of the relationship between drug therapy and SBP (BMI is ignored).

xi: regress sbp i.drug

i.drug _Idrug_0-1 (naturally coded; _Idrug_0 omitted)


-------------+------------------------------ F( 1, 30) = 6.16

Model | 1094.31693 1 1094.31693 Prob > F = 0.0189


-------------+------------------------------ Adj R-squared = 0.1426

Total | 6425.96875 31 207.289315 Root MSE = 13.331

------------------------------------------------------------------------------

sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Idrug_1 | 11.90688 4.798404 2.48 0.019 2.107235 21.70653

_cons | 137.4615 3.697418 37.18 0.000 129.9104 145.0127

------------------------------------------------------------------------------

2) Adjusted estimates of the relationship between drug therapy and SBP, after accounting for

BMI xi: regress sbp i.drug bmi



-------------+------------------------------ F( 2, 29) = 40.54

Model | 4733.10854 2 2366.55427 Prob > F = 0.0000


-------------+------------------------------ Adj R-squared = 0.7184

Total | 6425.96875 31 207.289315 Root MSE = 7.6403

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

_Idrug_1 | 10.6436 2.754685 3.86 0.001 5.009639 16.27756

bmi | 1.560152 .1976058 7.90 0.000 1.156002 1.964301

_cons | 55.13354 10.64064 5.18 0.000 33.37098 76.89609

------------------------------------------------------------------------------

From the STATA output, the statistical test of 0:0 BMIH has the p value of <0.001.

Thus, we reject the null hypothesis and conclude that the addition of BMI, given that drug

is already in the model, significantly contributes to the prediction of the SBP. The adjusted

R-squared when BMI is included in the model is 0.72, whereas the R-squared is 0.17 when

BMI is ignored. However, there is no meaningful difference between the crude and

adjusted estimates 11.91 and 10.64, respectively. As a result, BMI is not a confounder for

25

this example, although the additional variation of SBP is explained by including BMI in

the model.

6.2 Interaction effects

Interaction is the condition where the relationship of interest is different at different levels

of the extraneous variable. The assessment of interaction focuses on describing the

relationship of interest at different levels of the extraneous variable. For example, in

assessing interaction due to sex in describing the relationship between age and SBP, we

must determine whether the regression coefficients of the relationship between age and

SBP differ between males and females.

To illustrate the concept of interaction, let us consider the following example. Suppose

that we wish to determine how two independent variables; age and sex jointly affect the

systolic blood pressure. To distinguish between interaction and no interaction effects, we

consider two graphs based on two hypothetical data sets which are presented in Figure 5-1.

These show the straight-line regression of SBP versus age for females against the

corresponding regression for males.

In Figure 6-1(a), the two regression lines are parallel. This figure suggests that the rate of

change in SBP as a function of age remains the same regardless of males and females. In

other words, the relationship between SBP and age does not depend on sex. It can be

concluded that there is no interaction between age and sex. In this situation, we can

investigate the effects of age and sex on SBP independently of one another. These effects

are called the main effects. One way to represent the relationship depicted in Figure 6-1(a)

is with a regression model of the form

)()( 210 sexageY .

Here, the change in the mean of SBP for a 1-unit change in age is equal to1 , regardless of

males or females, while changing the category of sex only has the effect of shifting the

straight line relating SBP and age without affecting the value of the slope1 .

26

a) No interaction between age and sex

b) Interaction between age and sex

Figure 6-1 Graphs of non-interacting and interacting independent variables

In contrast, two regression lines cross or trend to cross together in Figure 6-1(b). This figure

depicts a situation where the relationship between SBP and age depends on sex; in particular

the SBP appears to increase with increasing age for males but to decrease with increasing age

for females. It can be concluded that there is an interaction between age and sex. In this

situation, we cannot investigate the main effects of age and sex on SBP, since age and sex do

not operate independently of one another in their effects on SBP. One way to represent such

interaction effects mathematically is to use a regression model of the form

120

130

140

150

160

170

Systo

lic b

loo

d p

ressure

(m

mH

g)

40 45 50 55 60 65

Age (years)

Females Males

130

140

150

160

170

Systo

lic b

loo

d p

ressure

(m

mH

g)

40 45 50 55 60 65Age (years)

Females Males

27

)()()( 12210 sexagesexageY .

An interaction term is basically the product of two independent variables of interest which

are age and sex in our example. Here the change in the mean of SBP for a 1-unit change in

age is equal to )(121 sex , which clearly depends on sex. For our particular example,

when sex=0 (i.e., when sex=male), the regression model can be written as:

)0()0()( 12210 ageageY

)(10 age

and when sex=1 (i.e., when sex=female), the regression model becomes:

)1()1()( 12210 ageageY

age )( 12120

Note that these regression lines present different intercepts and different slopes. In linear

regression, interaction is evaluated by using the statistical tests about product terms

involving basic independent variables in the model.

We present here an example of how to assess interaction by using STATA. We will

continue with our example of the relationship between age and SBP to investigate the

possible interaction. The results of fitting the linear regression model to this data set

indicated that there is a linear relationship between age and SBP. Another question that can

be answered by such data is whether an interaction exists between age and smoking status.

In other words, that is whether the slopes of the straight-lines relating SBP to age

significantly differ for smokers and for non-smokers. To create an interaction term by

using the STATA, we would type:

xi: regress sbp i.smk*age

i.smk _Ismk_0-1 (naturally coded; _Ismk_0 omitted) i.smk*age _IsmkXage_# (coded as above) Source | SS df MS Number of obs = 32 -------------+------------------------------ F( 3, 28) = 26.63 Model | 4758.42362 3 1586.14121 Prob > F = 0.0000

Residual | 1667.54513 28 59.5551833 R-squared = 0.7405 -------------+------------------------------ Adj R-squared = 0.7127 Total | 6425.96875 31 207.289315 Root MSE = 7.7172 ------------------------------------------------------------------------------ sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ismk_1 | -12.84603 21.71534 -0.59 0.559 -57.32789 31.63583 age | 1.515216 .2703328 5.61 0.000 .9614643 2.068968 _IsmkXage_1 | .4349186 .4048227 1.07 0.292 -.3943232 1.26416 _cons | 58.57428 14.80476 3.96 0.000 28.2481 88.90046 ------------------------------------------------------------------------------

28

The regression model for including the interaction term can be written as:

)()()( 3210 agesmokeagesmokeY

From the STATA output, we see that when smoking status=non smokers (smoke=0), the

regression model can be written as:

)0(43.0)(52.1)0(87.1257.58 ageageY

)(52.157.58 ageY

and when smoking status=smokers (smoke=1), the regression model becomes:

)1(43.0)(52.1)1(87.1257.58 ageageY

)(95.170.45 ageY

Plotting these two lines (Figure 6-2), we see that they appear to be almost parallel. It

indicates that there is probably no statistically significant interaction. We can confirm this

lack of significance by using the partial F test for testing the hypothesis 0: 30 H , given

that smoking status and age are in the model. The p value for the partial F test is equal to

0.292. Therefore, the slopes of the straight lines relating SBP and age do not significantly

differ for smokers and non-smokers. This means that there is no interaction between age

and smoking status in this situation. This conclusion is equivalent to finding that smokers

have a consistently higher SBP than non-smokers at all ages, and the rate of change with

respect to age is the same for both groups.

Figure 6-2 Comparison by smoking status of straight-line regressions of SBP on age

120

130

140

150

160

170

Systo

lic b

loo

d p

ressure

(m

mH

g)

40 45 50 55 60 65Age (years)

Non-smokers Smokers

29

7. REGRESSION DIAGNOSTICS

7.1 Normality The assumption of normality can be assessed formally by a normal plot of the residuals.

These residuals are assumed to be independent normal random variables with mean 0 and

constant variance (2σ ). Assumption violation will result in invalid model. To make sure

that the model is appropriate, it needs to be checked whether their distributions are normal.

Checking this assumption can be performed as follows:

STATA command:

1. Estimate the residuals

After fitting regression model by regress command, the estimation of residuals can be done

by predict command with option resid as:

. xi: regress sbp i.drug age quet



-------------+------------------------------ F( 3, 28) = 32.42

Model | 4989.65072 3 1663.21691 Prob > F = 0.0000


-------------+------------------------------ Adj R-squared = 0.7525

Total | 6425.96875 31 207.289315 Root MSE = 7.1622

------------------------------------------------------------------------------


-------------+----------------------------------------------------------------

_Idrug_1 | 10.62885 2.582308 4.12 0.000 5.339232 15.91847

age | 1.003488 .3102821 3.23 0.003 .3679039 1.639072

quet | 9.705102 4.339773 2.24 0.033 .8154801 18.59472

_cons | 51.38847 10.11437 5.08 0.000 30.67013 72.10681

------------------------------------------------------------------------------

. predict error, resid

2. Create normal probability plot . pnorm error

Figure 7-1 Normal probability plot of residuals from the full multiple regression model

0.0

00

.25

0.5

00

.75

1.0

0

Norm

al F

[(err

or-

m)/

s]

0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)

30

3. Test for normality

. swilk error

Shapiro-Wilk W test for normal data

Variable | Obs W V z Prob>z

-------------+--------------------------------------------------

error | 32 0.97405 0.865 -0.300 0.61789

The test performs with the null hypothesis that the residuals are normally distributed.

The Shapiro-Wilk statistic is equal 0.97405 and its related p-value is 0.61789.

We therefore fail to reject the null hypothesis and conclude that the residuals are normally

distributed.

7.2 Linearity

Augmented component-plus-residual plot versus independent variables will suggest whether

a relationship is linear. This can be performed as below.

STATA command:

1. Plot Augmented component-plus-residual versus independent variable age

. acprplot age, mspline msopts(bands(13)) title(Augmented component-plus-residual

versus age plot)

Figure 7-2 Augmented component-plus-residual versus age

-30

-20

-10

01

0

Au

gm

ente

d c

om

po

ne

nt p

lus r

esid

ua

l

40 45 50 55 60 65age

Augmented component-plus-residual versus age plot

31

2. Plot Augmented component-plus-residual versus independent variable BMI

. acprplot quet, mspline msopts(bands(13)) title(Augmented component-plus-residual

versus BMI plot)

Figure 7-3 Augmented component-plus-residual versus BMI

These graphs suggest neither definite linearity nor definite curvature relationship. These

might be linear, but some outliers may cause the line to not be smooth.

7.3 Homoskedasticity

The assumption of homoskedasticity can be assessed by plotting the residuals against the

predicted values of the dependent variable. This plot gives an idea whether residuals are

constant across fitted values. For example, after fitting drug, age, and BMI to predict

changes of SBP, a plot of residuals against the predicted value of the SBP is shown in

Figure 7-4. In the graph below, residuals symmetrically lie below and above the horizontal

line. This might suggest constant variance and testing should be performed as below.

. estat hettest error

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity

Ho: Constant variance

Variables: error

chi2(1) = 1.82

Prob > chi2 = 0.1771

The test suggests that variance of residuals is constant over the values of independent variable.

-70

-60

-50

-40

-30

Au

gm

ente

d c

om

po

ne

nt p

lus r

esid

ua

l

2.5 3 3.5 4 4.5quetelet index

Augmented component-plus-residual versus BMI plot

32

Figure 7-4 Residuals from the regression model, plotted against predicted values of SBP

7.4 Multicollinearity

This assumption does not matter for simple regression analysis, but it is important to check

for the multiple regression model. This happens when independent variables are correlated

and it will result in bias estimation of coefficients, standard errors, and inference of

coefficients. Therefore, pair wise correlation (r) should be performed to get an idea what

variables might highly correlate and have collinearity effect in the model if we fit them

together. If collinearity occurs, adding or deleting independent variables makes changes in

coefficients and produces corresponding standard errors. The standard error is sometimes

as high as the coefficient or even higher.

To detect multicollinearity additional to pair wise correlation, a variance inflation factor is

usually used. The parameter measures how the variances of regression coefficients are

inflated compared to when the independent variables are not linearly related. This can be

estimated in STATA as below. The VIF > 10 is suggested as collinearity and the value

close to 1 is no evidence of collinearity. If the VIF is greater than 10, that variable should

be omitted from the model. For this example, there is no evidence of collinearity.

. estat vif

Variable | VIF 1/VIF

-------------+----------------------

age | 2.82 0.355212

quet | 2.81 0.355588

_Idrug_1 | 1.00 0.996620

-------------+----------------------

Mean VIF | 2.21

-20

-10

01

02

0

Resid

uals

120 130 140 150 160 170Linear prediction

33

7.5 Outliers

7.5.1 Identifying X outliers

Identifying X outliers for multiple regression is a multi-dimension of X variables.

Considering only one variable separately is not appropriate since extreme values of some

unvaried outliers may not influence in regression line whereas multivariate outliers (which

cannot detect in unvaried outliers) may affect on fitting the regression line. We use H

matrix to identify X outliers as follows:

XXXXH 1)(

Let hij (hat or leverage) be the i element on the main diagonal of the H matrix, which can

be obtained from:

phh

XXXXhn

iiiii

iiii

1

1'

,10

)(

where p = number of regression parameters including a constant term. The higher the hii

value (higher leverage), the further distance that an x variable departs from the centre of X

matrix and that is the outlier of X. The suggestion is that if the hat is as high as twice the

average (2*p/n), it is considered as outlier. Another criteria is classifying hii 0.2-0.5, > 0.5

as moderate and high leverage. Estimation of hat can be performed as below.

STATA command:

2. Estimate leverage value . predict xdist,hat

. sum xdist,det

Leverage

-------------------------------------------------------------

Percentiles Smallest

1% .0549423 .0549423

5% .0568518 .0568518

10% .0762993 .0651438 Obs 32

25% .0843276 .0762993 Sum of Wgt. 32

50% .1157301 Mean .125

Largest Std. Dev. .056152

75% .1453192 .2223231

90% .2223231 .2470635 Variance .0031531

95% .2482278 .2482278 Skewness 1.114692

99% .2663676 .2663676 Kurtosis 3.49025

34

3. Identify outlying cases

As for the model, the average leverage is 2x4/32 = 0.25. Only 1 subject has leverage value

higher than this cutoff. However, the highest leverage does not exceed the high level of

leverage (i.e. = 0.5). We could see that subject id=2 has the highest leverage value of

0.266. Keep in mind that this subject is potential X outlier, but needs to be explored

further whether this potential outlier influences on the regression model (predicted values).

. list person error xdist quet age drug if xdist>2*(4/32) & xdist!=.

+----------------------------------------------------+

| person error xdist quet age drug |

|----------------------------------------------------|

18. | 2 -2.082762 .2663676 3.251 41 0 |

+----------------------------------------------------+

7.5.2 Identifying Y outliers

This can be done using studentized deleted residuals, which is calculated by:

)(

*

ˆ

)(

iiii

i

ii

YYd

ds

dd

or

2/1

2

)(

*

)1(

1

)1(

iiii

iii

ii

ehSSE

pne

hMSE

ed

di is called deleted residual, which is the residual where xi(i) case is deleted. The

studentized deleted residual has t-distribution with n-p-1 degrees of freedom. To identify

the outlier, it needs to be compared with a cutoff threshold, say 0.05, as below.

STATA command:

1. Estimate the studentized deleted residuals

. predict estu,rstudent

. sum estu,det

Studentized residuals

-------------------------------------------------------------

Percentiles Smallest

1% -2.514817 -2.514817

5% -1.589173 -1.589173

10% -1.08973 -1.185283 Obs 32

25% -.5998537 -1.08973 Sum of Wgt. 32

50% -.0976471 Mean .0120775

Largest Std. Dev. 1.08881

75% .626205 1.436678

90% 1.436678 1.613788 Variance 1.185507

95% 2.430551 2.430551 Skewness .3678051

99% 2.576622 2.576622 Kurtosis 3.461563

35

2. Identify outlying cases

. list person error estu quet age drug if abs(estu)>invttail(27,0.05)

+-----------------------------------------------------+

| person error estu quet age drug |

|-----------------------------------------------------|

8. | 8 14.76043 2.430551 3.612 48 1 |

9. | 9 14.84753 2.576622 2.368 44 1 |

12. | 12 -14.32618 -2.514817 4.032 51 1 |

+-----------------------------------------------------+

7.5.3 Identifying influential cases

After identifying cases that are outlying with respect to their X values and/or their Y values,

the next step is to ascertain whether or not these outlying cases are influential. We shall take

up three measures of influence that are widely used in practice, each based on the omission of a

single case to measure its influence.

i) Influence on the fitted values (DFFITS)

The first task is to explore whether these x & y outliers influence the fitted values since it is not

necessary that all x or y outliers affect fitting values. DFFITS are parameters that takeinto

account both x (leverage) and y (d*) outlying indexes which are often used in practice as follows.

ii(i)

i(i)i

ihMSE

YYDFFITS

ˆˆ

The letters DF refer to distance difference between the fitted value iYˆ for the ith case where all n

cases are considered in the model, and fitted value )(ˆiiY for the ith case in which the ith case is

removed from the model. The denominator is the standardization, which reflects the number of

estimated standard deviations that the fitted value and iYˆ change when the ith case is removed

from the model. The DFFITS can also be calculated using d* and hii as follows:

1/2

ii

ii*

ii )h1

h(dDFFITS

As for the equation, DFFITS is a function of d*, which increases or decreases according to hii

values. The absolute value of DFFITS which exceeds 1 is considered as influential case for

small-medium sample size and np /2 for a large sample size. This can be explored as

follows:

36

STATA command

1. Estimate DFFITS. predict dfits, dfits

2. Identifying influential cases. list person error dfits quet age drug if abs( dfits)>1

+-----------------------------------------------------+

| person error dfits quet age drug |

|-----------------------------------------------------|

8. | 8 14.76043 1.041157 3.612 48 1 |

9. | 9 14.84753 1.377664 2.368 44 1 |

12. | 12 -14.32618 -1.440561 4.032 51 1 |

+-----------------------------------------------------+

ii) Influence on all of the estimated regression coefficients (Cook’s distance)

Cook’s distance (Di) measures the impact of the ith case on all regression coefficients (or

coefficient vectors) if the ith case is omitted, which is defined as:

pMSE

)bX(bX)b(bD

(i)(i)

i

The way to interpret Di is as follows:

- if Di > 4/n or

- Di > F(1-α, p,n-p)

The ith case has substantially influenced on estimate coefficients.

STATA command:

1. Estimate Cook’s distance

. predict cookd1,cooks

2. Identify influential cases

. list person error cookd1 quet age drug if cookd1>(4/32)

+----------------------------------------------------+

| person error cookd1 quet age drug |

|----------------------------------------------------|

8. | 8 14.76043 .2305867 3.612 48 1 |

9. | 9 14.84753 .3949498 2.368 44 1 |

10. | 10 8.75689 .1641441 4.637 64 1 |

12. | 12 -14.32618 .4359133 4.032 51 1 |

+----------------------------------------------------+

iii) Influence on the partial regression coefficients (DFBETAS)

This measures the influence of the ith case on estimation of each regression coefficient.

This can be assessed by

37

omitted is case ith the if tcoefficien regressionb

X)X( of element diagonal kth thec

wherecMSE

bbDFBETAS

k(i)

1-

kk

kk(i)

k(i)k

k(i)

That is the DFBETAS determines the standardized difference of coefficients that are

estimated with and without the ith cases. The absolute DFBETAS > 1 and > n/2 are

supposed to be influential cases for small-medium and large sample sizes. This can be

estimated in STATA as follows:

STATA command:

1. Estimate DFBETAS

. predict df_drug,dfbeta(_Idrug_1)

. predict df_age,dfbeta(age)

. predict df_bmi,dfbeta(quet)

2. Identify influential cases

. list person error df_drug df_age df_bmi quet age drug if abs(df_drug)>1 |

abs(df_age)>1 | abs(df_bmi)>1

+----------------------------------------------------------------------------+

| person error df_drug df_age df_bmi quet age drug |

|----------------------------------------------------------------------------|

12. | 12 -14.32618 -.4310711 1.128824 -1.263236 4.032 51 1 |

+----------------------------------------------------------------------------+

Finally the diagnostic model needs to be summarized as a whole considering three

parameters, which subjects are influential cases that affect on prediction values (DFFITS),

estimation of coefficient vectors (COOK’S D) and each coefficient (DFBETAS).

. list person sbp drug age quet smkgr if abs( dfits)>1 | cookd1>(4/32) |

abs(df_drug)>1 | abs(df_age)>1 | abs(df_bmi)>1

+-------------------------------------------+

| person sbp drug age quet smkgr |

|-------------------------------------------|

8. | 8 160 1 48 3.612 1 |

9. | 9 144 1 44 2.368 1 |

10. | 10 180 1 64 4.637 1 |

12. | 12 138 1 51 4.032 1 |

+-------------------------------------------+

There are 4 subjects who are potentially influential cases with the above criteria. In

summary, the residuals are normal, constant variance, and some outliers that influence the

regression model. We thus need to explore data particularly starting with these above

subjects. Their data needs to be checked and validated for all variables to make sure that

38

the data is correct. If there is some incorrect data, it should be corrected accordingly and

then re-fit to the regression model. The diagnostic model also needs to be re-assessed. If

the data of these subjects are correct as they are, try to exclude these subjects and see how

the regression model is.

39

Assignment IV (25%) Due Date: Sep 27, 2017

From the cohort of Thai population who did not have chronic kidney disease (CKD) at the

beginning was conducted. They were follow up for 7 years with newly diagnosis of CKD as

the outcome of interest. The fourth objective was to determine the factors which associated the

percent change of GFR at 7 years of follow up. There are many factors associated with the

percent change of GFR such as:

Demographic data: age, gender, BMI, waist-hip ratio, systolic blood pressure

Risk behavior: smoking, alcohol consumption, exercise

Comorbidity: hypertension (HT), diabetes mellitus (DM), high cholesterol

Medical history: NSAID

The aim of this assignment is to:

a) Fit the linear regression model step by step, and explain the method used

b) What is the parsimonious equation?

c) Perform diagnostic measures, checking assumptions of final model.

d) Interpret the results of final model.

e) Create a dummy table and present the results according to the dummy table.

f) Interpret and writing results according to the table.

*** The data are given in the data set SEEK2_data.dta

race 625 medical statistics in clinical research€¦ · predictors, we use a natural extension of...

Documents