introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...general concepts...

26
Introduction to linear model Valeska Andreozzi 2012 References 3 Correlation 5 Definition .................................................................... 6 Pearson correlation coefficient....................................................... 8 Spearman correlation coefficient .................................................... 10 Hypothesis test................................................................ 12 Simple linear regression 15 Motivation ................................................................... 16 The model ................................................................... 18 Model assumptions ............................................................. 22 Fitting the model .............................................................. 23 Exercise..................................................................... 28 Multiple linear regression 29 The model ................................................................... 30 Hypothesis test................................................................ 35 Variable selection .............................................................. 46 Model check ................................................................. 50 Interaction ................................................................... 66 1

Upload: others

Post on 04-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Introduction to linear model

Valeska Andreozzi

2012

References 3

Correlation 5

Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Pearson correlation coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Spearman correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Hypothesis test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Simple linear regression 15

Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Multiple linear regression 29

The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Hypothesis test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Model check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

1

Page 2: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Summary

References

Correlation

DefinitionPearson correlation coefficientSpearman correlation coefficientHypothesis test

Simple linear regression

MotivationThe modelModel assumptionsFitting the modelExercise

Multiple linear regression

The modelHypothesis testVariable selectionModel checkInteraction

DEIO/CEAUL Valeska Andreozzi – slide 2

References slide 3

References

■ Rosner, B (2010). Fundamentals of Biostatistics. 7th Edition. Duxbury Resource Center.

■ Krzanowski, W (1998). An Introduction to Statistical Modelling. Arnold Texts in Statistics.

■ Harrel, F (2001). Regression Modeling Strategies. Springer-Verlag.

■ Weisberg, S (2005). Applied Linear Regression (Wiley Series in Probability and Statistics). Third Edition.Wiley.

■ Dalgaard, P (2008). Introductory Statistics with R (Statistics and Computing). Second Edition. Springer.

DEIO/CEAUL Valeska Andreozzi – slide 4

Correlation slide 5

Definition

■ The correlation coefficient is a dimensionless quantity that is independent of the units of the randomvariables X and Y and ranges between −1 and 1

■ For random variables that are approximately linearly related, a correlation coefficient of 0 impliesindependence

■ A correlation coefficient close to 1 implies nearly perfect positive dependence with large values of Xcorresponding to large values of Y and small values of X corresponding to small values of Y

■ A correlation coefficient close to −1 implies perfect negative dependence with large values of Xcorresponding to small values of Y and vice versa

DEIO/CEAUL Valeska Andreozzi – slide 6

2

Page 3: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Examples

■ An example of astrong positive cor-relation is betweenforced respiratoryvolume (FEV) andheight.

■ A weaker positivecorrelation exist be-tween serum choles-terol and dietaryintake cholesterol.

■ A strong negativecorrelation can befound between rest-ing pulse and age inchildren under theage of 10y.

DEIO/CEAUL Valeska Andreozzi – slide 7

Pearson correlation coefficient

■ Assuming that Y e X are two random variables with a linear relationship, we can measure thecorrelation of a sample calculating the Pearson correlation coeficient, given by:

r =

∑i(xi − x)(yi − y)√∑

i(xi − x)2∑

i(yi − y)2

DEIO/CEAUL Valeska Andreozzi – slide 8

in R

library(ISwR)

data(thuesen)

attach(thuesen)

View(thuesen)

plot(thuesen)

cor(blood.glucose, short.velocity)

cor(blood.glucose, short.velocity,use="complete.obs")

Correlation in R Commander

Statistics > Summaries > Correlation matrix ...

DEIO/CEAUL Valeska Andreozzi – slide 9

3

Page 4: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Spearman correlation coefficient

■ Adequate when one or both variables are either ordinal or have a distribution that is far from normal

■ The Spearman correlation coefficient is a nonparametric method which has the advantage of beinginvariant to monotone transformation of the coordinates.

■ The man disadvantage of this method is that its interpretation is not quite clear.

■ The Spearman (rank) correlation coefficient rs is obtained by replacing the observation of X and Y bytheir ranks and computing the correlation (Pearson coefficient).

■ It is assumed that if there were a perfect correlation between two variables, then the ranks for each personon each variable would be the same and rs = 1. The less perfect the correlation, the closer to zero rs

would be.

DEIO/CEAUL Valeska Andreozzi – slide 10

in R

Correlation in R

x<-rank(thuesen$blood.glucose[-16])

y<-rank(thuesen$short.velocity[-16])

cor(x,y,method="pearson")

cor(thuesen$blood.glucose,thuesen$short.velocity,

use="complete.obs",method="spearman")

Correlation in R Commander

Statistics > Summaries > Correlation matrix ...

DEIO/CEAUL Valeska Andreozzi – slide 11

Hypothesis test

■ Is is possible to test the significance of the correlation by tranforming it to a t-distributed variable, whichwill be identical with the test obtained from testing the significance of the slope of either regression of yon x, or vice-versa (see later).

■ In R

cor.test(blood.glucose, short.velocity)

■ in R Commander

Statistics > Summaries > Correlation test ...

DEIO/CEAUL Valeska Andreozzi – slide 12

4

Page 5: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Exercises

Match the following items with graphics. r = 0 0 < r < 1 r = 1 r = −1 − 1 < r < 0

DEIO/CEAUL Valeska Andreozzi – slide 13

Exercises

DEIO/CEAUL Valeska Andreozzi – slide 14

5

Page 6: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Simple linear regression slide 15

Motivation

What are the relationship between sistolic blood pressure (SBP) and age among health adults?

■ SBP increases with age

■ There are fluctuations around a linear trend

■ Variability of SBP not completely explained by age ⇒ random component

Why would we like to fit a model?

■ Describe the relationship between SBP and age

■ Prediction

DEIO/CEAUL Valeska Andreozzi – slide 16

Motivation

What can we say about SBP × age?

20 30 40 50 60 70

120

140

160

180

200

220

id

pa

DEIO/CEAUL Valeska Andreozzi – slide 17

6

Page 7: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

General concepts

yi = β0 + β1xi + ǫi

20 30 40 50 60 70

120

140

160

180

200

220

id

pa

■ Positive linear correlation

■ Relationship is not perfect.

■ Fitted line which describe the lin-ear relationship between SBP andage.

yi = 98.71 + 0.97xi

DEIO/CEAUL Valeska Andreozzi – slide 18

Model interpretation

SBPi = 98.71 + 0.97 × agei

20 30 40 50 60 70

120

140

160

180

200

220

id

pa

β0 = 98.71

■ estimated value of SBP when ageis zero

β1 = 0.97

■ the SPB increases 0.97 mmHg foran increment on one year of age

DEIO/CEAUL Valeska Andreozzi – slide 19

7

Page 8: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Model illustration

Illustration of the components of a simple linear regression model.

■ Systematic component: β0 + β1xi

■ Statistical/Probabilistic Model: Yi = β0 + β1xi + ǫi or E(Yi) = β0 + β1xi

DEIO/CEAUL Valeska Andreozzi – slide 20

Model illustration

Simple linear regression representation.

■ The means of the probability distributions of Yi show the sistematic relation with X

DEIO/CEAUL Valeska Andreozzi – slide 21

Model assumptions

Independence: Yi are all independent

Linearity: The expected value of Yi is a linear function of Xi

Homogeneity of variance: The variance of Yi probability distribution is constant over X and equal to σ2

Normality: For all Xi, Yi follows a Normal distribution. Assumption necessary to build hypothesis test andconfidence intervals of the model parameters β

DEIO/CEAUL Valeska Andreozzi – slide 22

8

Page 9: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Model estimation

Least Squares Method

E(Yi|X) = β0 + β1xi

■ LSM gives estimates β0 and β1 that minimize the sum of squared errors (SSE)

SSE =

n∑

i=1

ǫ2i

=

n∑

i=1

(yi − yi)2 =

n∑

i=1

(yi − β0 − β1xi)2

DEIO/CEAUL Valeska Andreozzi – slide 23

Model estimation

β coefficientsDifferentiating SSE and setting the partial derivatives to zero, we have:

∂SSE

∂β0=

n∑

i=1

[yi − β0 − β1xi] = 0

∂SSE

∂β1=

n∑

i=1

[xi(yi − β0 − β1xi)] = 0

The system results give the estimates of the model parameters

β0 = y − β1x

β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

DEIO/CEAUL Valeska Andreozzi – slide 24

Model estimation

Variance of Y (σ2)

■ Under the null hypothesis that the residuals are independent random variables with zero mean andvariance constant equal to σ2, an unbiased estimator for σ2 is the ratio between SSE =

∑ni=1 ǫ2i and the

degree of freedom (the number of observation minus the number of model coefficients)

■ And then, the variance σ2 of Y is obtained.

DEIO/CEAUL Valeska Andreozzi – slide 25

9

Page 10: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

in R

Simple linear regression in R

dados<-read.table("pasis.dat",header=T)

names(dados)

head(dados)

plot(dados)

modelo<-lm(pa~id,data=dados)

summary(modelo)

plot(dados)

abline(modelo,col=2)

DEIO/CEAUL Valeska Andreozzi – slide 26

in R Commander

Simple linear regression in R Commander

Data > Import data > fromm text file, clipboard, or URL, ...

Graphics > Scatterplot...

Statistics > Fit Models > Linear regression...

DEIO/CEAUL Valeska Andreozzi – slide 27

Exercise

Exercise in R

1. With the rmr dataset (ISwR package), plot the metabolic rate versus body weight. Fit a linear regressionmodel to the relation. According to the fitted model, what is the predicted metabolic rate for a bodyweight of 70kg?

2. In the jull dataset (ISwR package) fit a linear regression model to the square root of the IGF-Iconcentration versus age, to the group of subjects over 25 years old.

Tools > Load package(s)...

Data > Data in packages > Read data set from an attached package...

Graphics > Scatterplot...

Statistics > Fit Models > Linear regression...

Data > Manage variable in active data set > Compute new variable...

DEIO/CEAUL Valeska Andreozzi – slide 28

10

Page 11: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Multiple linear regression slide 29

Multiple linear regression

yi = β0 + β1x1i + β2x2i + ǫi

■ Describe the relationship between the response (dependent) variable (outcome) (Y ) and two or moreindependent variables (covariates, predictors, explanatory variables) (X1,X2,X3, · · · ,Xk)

■ Estimate the direction of the association between response variable and covariates.

■ The covariates can be transformed variables (example: log(cd4)), polynomials terms (example: age2),interaction terms (example age × sex) and dummy variables.

■ Determinate which covariates are important to predict the response variable

■ Describe the relationship of X1,X2,X3, · · · ,Xk and Y adjusted by the effect of other covariates Z1 andZ2, for example.

DEIO/CEAUL Valeska Andreozzi – slide 30

Multiple linear regression

yi = β0 + β1x1i + β2x2i + ǫi

■ Assumes that the response variable is an random variable which varies from individual to individual i.

■ The nature of the continuous response variable suggests that the Normal distribution is adequate to thepopulation model of Yi

■ So, Yi follows a Normal distribution with mean µi and variance σ2 unknown. (Yi ∼ N(µi, σ2))

■ Similarly, we can say that each observation yi = µi + ǫi and that ǫi ∼ N(0, σ2)

■ The model parameters are also estimated by least square method.

DEIO/CEAUL Valeska Andreozzi – slide 31

11

Page 12: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Exemplo

Describe the relationship between blood preassure (yi) and age (x1i), body mass index (x2i) and smoke habits(x3i). File: (multi.dat)

E(Yi) = β0 + β1x1i + β2x2i + β3x3i

Data > Import data > from a text file, clipboard or URL ...

Statistics > Summaries > Active data set

summary(dados)

pessoa pa id

Min. : 1.00 Min. :120.0 Min. :41.00

1st Qu.: 8.75 1st Qu.:134.8 1st Qu.:48.00

Median :16.50 Median :143.0 Median :53.50

Mean :16.50 Mean :144.5 Mean :53.25

3rd Qu.:24.25 3rd Qu.:152.0 3rd Qu.:58.25

Max. :32.00 Max. :180.0 Max. :65.00

imc hf

Min. :2368 n~ao:15

1st Qu.:3022 sim:17

Median :3380

Mean :3441

3rd Qu.:3776

Max. :4637

DEIO/CEAUL Valeska Andreozzi – slide 32

in R Commander

Statitics > Fit models > Linear models...

Call:

lm(formula = pa ~ id + imc + hf, data = multi)

Residuals:

Min 1Q Median 3Q Max

-13.5420 -6.1812 -0.7282 5.2908 15.7050

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 45.103192 10.764875 4.190 0.000252 ***

id 1.212715 0.323819 3.745 0.000829 ***

imc 0.008592 0.004499 1.910 0.066427 .

hf[T.sim] 9.945568 2.656057 3.744 0.000830 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 7.407 on 28 degrees of freedom

Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353

F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09

DEIO/CEAUL Valeska Andreozzi – slide 33

12

Page 13: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Model interpretation

■ β0 = 45.103 The intercept has no real interpretation in the example because is the average BP of a person withage zero, body mass index equal to zero and who smokes.

■ β1 = 1.212 The BP increases, on average, 1.21 mmHg for an increase of 1 year of age adjusted by body massindex and smoke habits

■ β2 = 0.008 The BP increases, on average, 0.008 mmHg for an increase of 1 unit of body mass index, holdingeverything else constant

■ β3 = 9.945 The BP increases, on average, 9.94 mmHg for those who smokes compared to those that does notsmoke. This effect is adjusted by all the other variable in the model

id effect plot

id

pa

120

130

140

150

160

40 45 50 55 60 65

imc effect plot

imc

pa

130

140

150

160

2500 3000 3500 4000 4500

hf effect plot

hf

pa

135

140

145

150

não sim

DEIO/CEAUL Valeska Andreozzi – slide 34

Hypothesis test

Analysis of Variance

■ ANOVA partition the total variability in the sample of yi into:

i

(yi − y)2

︸ ︷︷ ︸total variability

=∑

i

(yi − y)2

︸ ︷︷ ︸variability explained by

the regression line

+∑

i

(yi − yi)2

︸ ︷︷ ︸variability not explained(residual variation about

the fitted line)

Variability partitions

DEIO/CEAUL Valeska Andreozzi – slide 35

13

Page 14: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

ANOVA

i

(yi − y)2

︸ ︷︷ ︸Total

=∑

i

(yi − y)2

︸ ︷︷ ︸Regression

+∑

i

(yi − yi)2

︸ ︷︷ ︸Residual

Source Sum of degrees of Mean sum ofsquares (SS) freedom (df) squares (MS)

Regression SSreg =∑

(yi − y)2 m MSregression =SSreg

m

Residual SSE=∑

(yi − yi)2 n − m − 1 MSresidual =

SSE

n − m − 1

Total SStotal =∑

(yi − y)2 n − 1 MStotal =SStotal

n − 1

DEIO/CEAUL Valeska Andreozzi – slide 36

Hypothesis test

ANOVA

F =SSreg

m

SSEn−m−1

=MSregression

MSresidual∼ Fm,n−m−1

with n = number of observations and m = number of variables.

■ If there is no linear relationship, the regression sum of square just represent random variation so theregression mean square is another, independent, estimate of σ2

■ The F test indicates whether there is evidence of a linear relationship between Y and X

■ F test: Ratio between variability explained by the regression and residual variation

■ This ratio will close to one if there is no an effective relationship and will be larger, otherwise. In thesimple linear regression this is equivalent to test H0: β1 = 0 versus H1: β1 6= 0

DEIO/CEAUL Valeska Andreozzi – slide 37

14

Page 15: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

in R Commander

ANOVA for multiple regression model

Statitics > Fit models > Linear models...

Call:

lm(formula = pa ~ id + imc + hf, data = multi)

Residuals:

Min 1Q Median 3Q Max

-13.5420 -6.1812 -0.7282 5.2908 15.7050

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 45.103192 10.764875 4.190 0.000252 ***

id 1.212715 0.323819 3.745 0.000829 ***

imc 0.008592 0.004499 1.910 0.066427 .

hf[T.sim] 9.945568 2.656057 3.744 0.000830 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 7.407 on 28 degrees of freedom

Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353

F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09

DEIO/CEAUL Valeska Andreozzi – slide 38

Hypothesis test

Wald test

■ Test H0: βk = 0 versus H1: βk 6= 0 using T statistics

■ T = βk

EP (βk)

■ Under H0, T follows a t-Student distribution with n − p degrees of freedom (p = number of modelcoefficients and n = number of observations) or approximately a normal distribution with zero mean andunity variance.

DEIO/CEAUL Valeska Andreozzi – slide 39

in R Commander

Wald test in R Commander

Statitics > Fit models > Linear models...

Call:

lm(formula = pa ~ id + imc + hf, data = multi)

Residuals:

Min 1Q Median 3Q Max

-13.5420 -6.1812 -0.7282 5.2908 15.7050

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 45.103192 10.764875 4.190 0.000252 ***

id 1.212715 0.323819 3.745 0.000829 ***

imc 0.008592 0.004499 1.910 0.066427 .

hf[T.sim] 9.945568 2.656057 3.744 0.000830 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 7.407 on 28 degrees of freedom

Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353

F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09

DEIO/CEAUL Valeska Andreozzi – slide 40

15

Page 16: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Hypothesis test

Partial F-test

■ To compare two models that are nested, we can compute partial F-test.

■ Suppose two models Mp and Mq with, respectively, p and q parameters (p < q).

■ Mp and Mq are nested models if (Mp ⊂ Mq), i.e, all parameters present in Mp are also present in Mq.

■ To test H0 : the subset of variables present in Mq that is not present Mp are all not significant

against

■ H1 : at least one of the variable in this subset is significant to model Y

■ correspond to test simultaneously that q − p parameters are all equal to zero, using partial F-test.

F =(SSregq − SSregp)/(q − p)

SSEq/(n − q)∼ Fq−p,n−q

DEIO/CEAUL Valeska Andreozzi – slide 41

in R Commander

Partial F-test in R Commander

Models > Hypothesis test > Compare two models...

> anova(LinearModel.3, LinearModel.1)

Analysis of Variance Table

Model 1: pa ~ id

Model 2: pa ~ id + imc + hf

Res.Df RSS Df Sum of Sq F Pr(>F)

1 30 2564.3

2 28 1536.1 2 1028.2 9.3707 0.0007663 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

DEIO/CEAUL Valeska Andreozzi – slide 42

Confidence Interval

■ 100(1 − α)% confidence interval of β′s is given by:

[βk − tn−p,α/2 × EP (βk) ; βk + tn−p,α/2 × EP (βk)]

In R:

confint(modelo)

In R Commander:

Models > Confidence intervals...

> Confint(LinearModel.6, level=.95)

Estimate 2.5 % 97.5 %

(Intercept) 45.103192423 23.0523453750 67.15403947

id 1.212714616 0.5494010054 1.87602823

imc 0.008592449 -0.0006226821 0.01780758

hf[T.sim] 9.945567820 4.5048826190 15.38625302

DEIO/CEAUL Valeska Andreozzi – slide 43

16

Page 17: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Coefficient of determination

■ (Multiple) Coefficient of determination is a summary measure given by the ratio of the regression sum ofsquares to the total sum of squares.

R2 =SSreg

SStotal

■ R2 represents the proportion of the total sum of squares explained by the regression.

■ R2 can be estabilished by the square of the correlation between the yi and the predicted values yi fromthe model.

■ R2 lies between 0 and 1.

■ Important note: Do not confuse F-test from ANOVA with R2. F-test from ANOVA indicates whetherthere is evidence of linear relationship between Y and X or in other words, that the regression model issignificant. R2 measure the quality of the model for prediction of Y

DEIO/CEAUL Valeska Andreozzi – slide 44

Coefficient of determination

■ R2 should not be used as a measure of quality of model fit because its value always increase when avariable is add to the model.

■ For this purpose one can use an adjusted coefficient of determination R2a

R2a = 1 − MSresidual

MStotal

DEIO/CEAUL Valeska Andreozzi – slide 45

Variable selection

■ There are several methods to select variable into a model. The most popular are the sequential ones:foward selection, backward deletion and stepwise selection.

■ Forward selection: models are systematically built up by adding variables one by one to the null model

compromising just β0 (intercept)

■ Backward deletion: models are systematically reduced by deleting variables one by one to the full

model compromising just β0

■ Stepwise selection: it is a combination of the two process mentioned above.

■ For any methods, the most crucial decision is the choice of the stopping-rule.

■ Some choices are the Akaike Information Criteria, which there is no statistical distribution associated toproceed a formal test, or the partial F-test, which the level of significance to add or delete a variable hasto be chosen.

■ Let´s learn by an example.

DEIO/CEAUL Valeska Andreozzi – slide 46

17

Page 18: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Example

Chose package MASS and data set birtwt

Data > Data in packages > Read data set from an attached package...

Help > Help on active data set (if available)...

Recode the variables: change bwt to kg, transform race and smoke to factors

Data > Manage variables in active data set > Convert

numerical variables to factors...

Data > Manage variables in active data set > Compute new variable

Fit the model: bwt ∼ age + ftv + ht + lwt + ptl + race + smoke + ui

Statistics > Fit models > Linear model...

Select variables by using a sequential method

Models > Stepwise model selection...

DEIO/CEAUL Valeska Andreozzi – slide 47

Example

Select variables by using a foward selection with partial F-test. In each step, add a variable that has the minimum p-value inferiorto 0.20

nullmodel <-lm(bwt ~ 1, data=birthwt)

add1(nullmodel,scope=~ age +ftv +ht + lwt + ptl + race + smoke + ui,

test="F")

model1<-lm(bwt ~ ui, data=birthwt)

add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")

model1<-lm(bwt ~ ui+race, data=birthwt)

add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")

model1<-lm(bwt ~ ui+race+smoke, data=birthwt)

add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")

model1<-lm(bwt ~ ui+race+smoke+ht, data=birthwt)

add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")

model1<-lm(bwt ~ ui+race+smoke+ht+lwt, data=birthwt)

add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")

addmodel<-lm(bwt~ ui+race+smoke+ht+lwt,data=birthwt)

summary(addmodel)

DEIO/CEAUL Valeska Andreozzi – slide 48

18

Page 19: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Example

Select variables by using a backward deletion with partial F-test. In each step, delete a variable that has the maximum p-valuesuperior to 0.25

fullmodel <-lm(bwt ~ age +ftv +ht + lwt + ptl + race + smoke + ui,

data=birthwt)

drop1(fullmodel,test="F")

model2 <-lm(bwt ~ age +ht + lwt + ptl + race + smoke + ui, data=birthwt)

drop1(model2,test="F")

model2 <-lm(bwt ~ ht + lwt + ptl + race + smoke + ui, data=birthwt)

drop1(model2,test="F")

model2 <-lm(bwt ~ ht + lwt + race + smoke + ui, data=birthwt)

drop1(model2,test="F")

dropmodel<-lm(bwt ~ ht + lwt + race + smoke + ui, data=birthwt)

summary(dropmodel)

DEIO/CEAUL Valeska Andreozzi – slide 49

Model check

■ Regression diagnostics are used after fitting to check if a fitted mean function and assumptions areconsistent with observed data.

■ The basic statistics here are the residuals or possibly rescaled residuals.

■ If the fitted model does not give a set of residuals that appear to be reasonable, then some aspect ofthe model, either the assumed mean function or assumptions concerning the variance function, may becalled into doubt.

DEIO/CEAUL Valeska Andreozzi – slide 50

Residuals

■ Using the matrix notation, we begin by deriving the properties of residuals.

■ The basic multiple linear regression model is given byY = Xβ + ǫ and V ar(ǫ) = σ2I

■ X is a known matrix with n rows and p columns, including a column of 1s for the intercept

■ β is the unknown parameter vector p × 1

■ ǫ consists of unobservable errors that we assume are equally variable and uncorrelated

DEIO/CEAUL Valeska Andreozzi – slide 51

19

Page 20: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Residuals

■ We estimate β by β = (XT X)−1XT Y and the fitted values Y

Y = Xβ (1)

= X(XT X)−1XT Y (2)

= HY (3)

■ where H is a n × n called hat matrix because it transforms the vector of observed responses Y into thevector of fitted responses Y

DEIO/CEAUL Valeska Andreozzi – slide 52

Residuals

■ The vector of residuals ǫ is defined by

ǫ = Y − Y (4)

= Y − Xβ (5)

= Y − X(XT X)−1XT Y (6)

= (I − H)Y (7)

DEIO/CEAUL Valeska Andreozzi – slide 53

Residuals

■ The errors ǫ are unobservable random variables, assumed to have zero mean and uncorrelated elements,each with common variance σ2. The residuals ǫ are computed quantities that can be graphed orotherwise studied. Their mean and variance, using equation 7, are:

E(ǫ) = 0

V ar(ǫ) = σ2(I − H)

■ Like the errors, each of the residuals has zero mean, but each residual may have a different variance.

■ Unlike the errors, the residuals are correlated

■ The residuals are linear combinations of the errors. If the errors are normally distributed, so are theresiduals.

DEIO/CEAUL Valeska Andreozzi – slide 54

Residuals

■ In scalar form, the variance of the ith residual is

V ar(ǫi) = σ2(1 − hii) (8)

■ where hii is the ith diagonal element of H

■ Diagnostic procedures are based on the computed residuals, which we would like to assume behave as

the unobservable errors would.

DEIO/CEAUL Valeska Andreozzi – slide 55

20

Page 21: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Residuals

■ All the above story is told to conclude that model validation should be done by standardized residuals

■ Here are some examples

DEIO/CEAUL Valeska Andreozzi – slide 56

Residuals

■ Here are some examples

DEIO/CEAUL Valeska Andreozzi – slide 57

Residuals

■ Here are some examples

DEIO/CEAUL Valeska Andreozzi – slide 58

21

Page 22: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Residuals

Residual plots:

■ (a) null plot;

■ (b) right-opening megaphone;

■ (c) left-opening megaphone;

■ (d) double outward box;

■ (e) - (f) nonlinearity;

■ (g) - (h) combinations of nonlinearity and nonconstant variance function.

DEIO/CEAUL Valeska Andreozzi – slide 59

Residuals

Working residualr = yi − µi

Pearson residual

rp =yi − µi√

σ2

Pearson standardized residual

r′p =yi − µi√

σ2(1 − hii)

in R

rstandard(model, type="pearson")

in R Commander

Models > Add observations statistics to data

The R commander calculates de Studentized residuals (re-normalize the residuals to have unit variance, using a

leave-one-out measure of the error variance)

DEIO/CEAUL Valeska Andreozzi – slide 60

22

Page 23: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Plot of residuals

Constant variance: plot standardized residuals against their corresponding fitted values (yi). The pointsshould appear randomly and evenly about zero if assumption is respected.

Graphs > Scatterplot...

2.0 2.5 3.0 3.5

−3

−2

−1

01

2

fitted.LinearModel.2

rstu

dent

.Lin

earM

odel

.2

DEIO/CEAUL Valeska Andreozzi – slide 61

Plot of residuals

Normality: plot the ranked standardized residuals against inverse normal cumulative distribution values.epartures form normality being indicaed by deviantes of the plot from a straight line.

Graphs > Quantile-comparision plot...

−3 −2 −1 0 1 2 3

−3

−2

−1

01

2

norm quantiles

birt

hwt$

rstu

dent

.Lin

earM

odel

.2

DEIO/CEAUL Valeska Andreozzi – slide 62

23

Page 24: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Plot of residuals

Independence: plot standardized residuals against the serial order in which the observations were taken.Again, random scatter of points indicates that the assumption is valid.

Graphs > Scatterplot...

0 50 100 150

−3

−2

−1

01

2

obsNumber

rstu

dent

.Lin

earM

odel

.2

0 50 100 150

−3

−2

−1

01

2

obsNumber

rstu

dent

.Lin

earM

odel

.2

low01

Be carefull with the data set organization...

DEIO/CEAUL Valeska Andreozzi – slide 63

Plot of residuals

The truth is:

Data > Manage variables in active data set > Compute new variable

birthwt$index <- with(birthwt, sample(1:189))

0 50 100 150

−3

−2

−1

01

2

index

rstu

dent

.Lin

earM

odel

.2

DEIO/CEAUL Valeska Andreozzi – slide 64

24

Page 25: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

Plot of residuals

Linearity: plot standardized residuals against individual explanatory variables. Linearity is indicated if all plotsexhibit random scatter of equal width about zero. Non-linearity when residuals are plotted againstexplanatory variables in the model suggest that higher-order terms involving those variables should beadded to the model. Systematic patterns exhibited when residuals are plotted against variables that arenot included in the model suggest that those variables should be added to the model

Graphs > Scatterplot...

100 150 200 250

−3

−2

−1

01

2

lwt

rstu

dent

.Lin

earM

odel

.2

15 20 25 30 35 40 45

−3

−2

−1

01

2

age

rstu

dent

.Lin

earM

odel

.2DEIO/CEAUL Valeska Andreozzi – slide 65

Interaction

■ When interaction is present, the association between the risk factor and the outcome variable differs, ordepends in some way on the level of a covariate.

■ That is, the covariate modifies the effect of the risk factor.

■ Epidemiologists uses the term effect modifier to describe a variable that interacts with a risk factor.

■ Interaction can be included in a regression model by adding the product term covariate times risk factor.

DEIO/CEAUL Valeska Andreozzi – slide 66

Interaction

Interaction representation

x (risk factor)

y (o

utco

me)

group Agroup B

Without interaction

x (risk factor)

y (o

utco

me)

group Agroup B

With interaction

DEIO/CEAUL Valeska Andreozzi – slide 67

25

Page 26: Introduction to linear modelcurso-glm.wdfiles.com/local--files/regressao-linear/...General concepts yi = β0 +β1xi +ǫi 20 30 40 50 60 70 120 140 160 180 200 220 id pa Positive linear

in R

The cystfibr data frame (package: ISwR) contains lung function data for cystic fibrosis patients (7-23 years old).

Call:

lm(formula = pemaxlog ~ bmp * sex, data = cystfibr)

Residuals:

Min 1Q Median 3Q Max

-0.34387 -0.19622 -0.06581 0.26224 0.48999

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.869731 0.481692 8.034 7.7e-08 ***

bmp 0.010660 0.005975 1.784 0.0889 .

sex[T.fem] 1.184884 0.738289 1.605 0.1234

bmp:sex[T.fem] -0.017056 0.009388 -1.817 0.0835 .

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.2685 on 21 degrees of freedom

Multiple R-squared: 0.2219, Adjusted R-squared: 0.1107

F-statistic: 1.996 on 3 and 21 DF, p-value: 0.1455

DEIO/CEAUL Valeska Andreozzi – slide 68

26