Introduction to linear model
Valeska Andreozzi
2012
References 3
Correlation 5
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Pearson correlation coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Spearman correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Hypothesis test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Simple linear regression 15
Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Fitting the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Exercise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Multiple linear regression 29
The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Hypothesis test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Model check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1
Summary
References
Correlation
DefinitionPearson correlation coefficientSpearman correlation coefficientHypothesis test
Simple linear regression
MotivationThe modelModel assumptionsFitting the modelExercise
Multiple linear regression
The modelHypothesis testVariable selectionModel checkInteraction
DEIO/CEAUL Valeska Andreozzi – slide 2
References slide 3
References
■ Rosner, B (2010). Fundamentals of Biostatistics. 7th Edition. Duxbury Resource Center.
■ Krzanowski, W (1998). An Introduction to Statistical Modelling. Arnold Texts in Statistics.
■ Harrel, F (2001). Regression Modeling Strategies. Springer-Verlag.
■ Weisberg, S (2005). Applied Linear Regression (Wiley Series in Probability and Statistics). Third Edition.Wiley.
■ Dalgaard, P (2008). Introductory Statistics with R (Statistics and Computing). Second Edition. Springer.
DEIO/CEAUL Valeska Andreozzi – slide 4
Correlation slide 5
Definition
■ The correlation coefficient is a dimensionless quantity that is independent of the units of the randomvariables X and Y and ranges between −1 and 1
■ For random variables that are approximately linearly related, a correlation coefficient of 0 impliesindependence
■ A correlation coefficient close to 1 implies nearly perfect positive dependence with large values of Xcorresponding to large values of Y and small values of X corresponding to small values of Y
■ A correlation coefficient close to −1 implies perfect negative dependence with large values of Xcorresponding to small values of Y and vice versa
DEIO/CEAUL Valeska Andreozzi – slide 6
2
Examples
■ An example of astrong positive cor-relation is betweenforced respiratoryvolume (FEV) andheight.
■ A weaker positivecorrelation exist be-tween serum choles-terol and dietaryintake cholesterol.
■ A strong negativecorrelation can befound between rest-ing pulse and age inchildren under theage of 10y.
DEIO/CEAUL Valeska Andreozzi – slide 7
Pearson correlation coefficient
■ Assuming that Y e X are two random variables with a linear relationship, we can measure thecorrelation of a sample calculating the Pearson correlation coeficient, given by:
r =
∑i(xi − x)(yi − y)√∑
i(xi − x)2∑
i(yi − y)2
DEIO/CEAUL Valeska Andreozzi – slide 8
in R
library(ISwR)
data(thuesen)
attach(thuesen)
View(thuesen)
plot(thuesen)
cor(blood.glucose, short.velocity)
cor(blood.glucose, short.velocity,use="complete.obs")
Correlation in R Commander
Statistics > Summaries > Correlation matrix ...
DEIO/CEAUL Valeska Andreozzi – slide 9
3
Spearman correlation coefficient
■ Adequate when one or both variables are either ordinal or have a distribution that is far from normal
■ The Spearman correlation coefficient is a nonparametric method which has the advantage of beinginvariant to monotone transformation of the coordinates.
■ The man disadvantage of this method is that its interpretation is not quite clear.
■ The Spearman (rank) correlation coefficient rs is obtained by replacing the observation of X and Y bytheir ranks and computing the correlation (Pearson coefficient).
■ It is assumed that if there were a perfect correlation between two variables, then the ranks for each personon each variable would be the same and rs = 1. The less perfect the correlation, the closer to zero rs
would be.
DEIO/CEAUL Valeska Andreozzi – slide 10
in R
Correlation in R
x<-rank(thuesen$blood.glucose[-16])
y<-rank(thuesen$short.velocity[-16])
cor(x,y,method="pearson")
cor(thuesen$blood.glucose,thuesen$short.velocity,
use="complete.obs",method="spearman")
Correlation in R Commander
Statistics > Summaries > Correlation matrix ...
DEIO/CEAUL Valeska Andreozzi – slide 11
Hypothesis test
■ Is is possible to test the significance of the correlation by tranforming it to a t-distributed variable, whichwill be identical with the test obtained from testing the significance of the slope of either regression of yon x, or vice-versa (see later).
■ In R
cor.test(blood.glucose, short.velocity)
■ in R Commander
Statistics > Summaries > Correlation test ...
DEIO/CEAUL Valeska Andreozzi – slide 12
4
Exercises
Match the following items with graphics. r = 0 0 < r < 1 r = 1 r = −1 − 1 < r < 0
DEIO/CEAUL Valeska Andreozzi – slide 13
Exercises
DEIO/CEAUL Valeska Andreozzi – slide 14
5
Simple linear regression slide 15
Motivation
What are the relationship between sistolic blood pressure (SBP) and age among health adults?
■ SBP increases with age
■ There are fluctuations around a linear trend
■ Variability of SBP not completely explained by age ⇒ random component
Why would we like to fit a model?
■ Describe the relationship between SBP and age
■ Prediction
DEIO/CEAUL Valeska Andreozzi – slide 16
Motivation
What can we say about SBP × age?
20 30 40 50 60 70
120
140
160
180
200
220
id
pa
DEIO/CEAUL Valeska Andreozzi – slide 17
6
General concepts
yi = β0 + β1xi + ǫi
20 30 40 50 60 70
120
140
160
180
200
220
id
pa
■ Positive linear correlation
■ Relationship is not perfect.
■ Fitted line which describe the lin-ear relationship between SBP andage.
yi = 98.71 + 0.97xi
DEIO/CEAUL Valeska Andreozzi – slide 18
Model interpretation
SBPi = 98.71 + 0.97 × agei
20 30 40 50 60 70
120
140
160
180
200
220
id
pa
β0 = 98.71
■ estimated value of SBP when ageis zero
β1 = 0.97
■ the SPB increases 0.97 mmHg foran increment on one year of age
DEIO/CEAUL Valeska Andreozzi – slide 19
7
Model illustration
Illustration of the components of a simple linear regression model.
■ Systematic component: β0 + β1xi
■ Statistical/Probabilistic Model: Yi = β0 + β1xi + ǫi or E(Yi) = β0 + β1xi
DEIO/CEAUL Valeska Andreozzi – slide 20
Model illustration
Simple linear regression representation.
■ The means of the probability distributions of Yi show the sistematic relation with X
DEIO/CEAUL Valeska Andreozzi – slide 21
Model assumptions
Independence: Yi are all independent
Linearity: The expected value of Yi is a linear function of Xi
Homogeneity of variance: The variance of Yi probability distribution is constant over X and equal to σ2
Normality: For all Xi, Yi follows a Normal distribution. Assumption necessary to build hypothesis test andconfidence intervals of the model parameters β
DEIO/CEAUL Valeska Andreozzi – slide 22
8
Model estimation
Least Squares Method
E(Yi|X) = β0 + β1xi
■ LSM gives estimates β0 and β1 that minimize the sum of squared errors (SSE)
SSE =
n∑
i=1
ǫ2i
=
n∑
i=1
(yi − yi)2 =
n∑
i=1
(yi − β0 − β1xi)2
DEIO/CEAUL Valeska Andreozzi – slide 23
Model estimation
β coefficientsDifferentiating SSE and setting the partial derivatives to zero, we have:
∂SSE
∂β0=
n∑
i=1
[yi − β0 − β1xi] = 0
∂SSE
∂β1=
n∑
i=1
[xi(yi − β0 − β1xi)] = 0
The system results give the estimates of the model parameters
β0 = y − β1x
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2
DEIO/CEAUL Valeska Andreozzi – slide 24
Model estimation
Variance of Y (σ2)
■ Under the null hypothesis that the residuals are independent random variables with zero mean andvariance constant equal to σ2, an unbiased estimator for σ2 is the ratio between SSE =
∑ni=1 ǫ2i and the
degree of freedom (the number of observation minus the number of model coefficients)
■ And then, the variance σ2 of Y is obtained.
DEIO/CEAUL Valeska Andreozzi – slide 25
9
in R
Simple linear regression in R
dados<-read.table("pasis.dat",header=T)
names(dados)
head(dados)
plot(dados)
modelo<-lm(pa~id,data=dados)
summary(modelo)
plot(dados)
abline(modelo,col=2)
DEIO/CEAUL Valeska Andreozzi – slide 26
in R Commander
Simple linear regression in R Commander
Data > Import data > fromm text file, clipboard, or URL, ...
Graphics > Scatterplot...
Statistics > Fit Models > Linear regression...
DEIO/CEAUL Valeska Andreozzi – slide 27
Exercise
Exercise in R
1. With the rmr dataset (ISwR package), plot the metabolic rate versus body weight. Fit a linear regressionmodel to the relation. According to the fitted model, what is the predicted metabolic rate for a bodyweight of 70kg?
2. In the jull dataset (ISwR package) fit a linear regression model to the square root of the IGF-Iconcentration versus age, to the group of subjects over 25 years old.
Tools > Load package(s)...
Data > Data in packages > Read data set from an attached package...
Graphics > Scatterplot...
Statistics > Fit Models > Linear regression...
Data > Manage variable in active data set > Compute new variable...
DEIO/CEAUL Valeska Andreozzi – slide 28
10
Multiple linear regression slide 29
Multiple linear regression
yi = β0 + β1x1i + β2x2i + ǫi
■ Describe the relationship between the response (dependent) variable (outcome) (Y ) and two or moreindependent variables (covariates, predictors, explanatory variables) (X1,X2,X3, · · · ,Xk)
■ Estimate the direction of the association between response variable and covariates.
■ The covariates can be transformed variables (example: log(cd4)), polynomials terms (example: age2),interaction terms (example age × sex) and dummy variables.
■ Determinate which covariates are important to predict the response variable
■ Describe the relationship of X1,X2,X3, · · · ,Xk and Y adjusted by the effect of other covariates Z1 andZ2, for example.
DEIO/CEAUL Valeska Andreozzi – slide 30
Multiple linear regression
yi = β0 + β1x1i + β2x2i + ǫi
■ Assumes that the response variable is an random variable which varies from individual to individual i.
■ The nature of the continuous response variable suggests that the Normal distribution is adequate to thepopulation model of Yi
■ So, Yi follows a Normal distribution with mean µi and variance σ2 unknown. (Yi ∼ N(µi, σ2))
■ Similarly, we can say that each observation yi = µi + ǫi and that ǫi ∼ N(0, σ2)
■ The model parameters are also estimated by least square method.
DEIO/CEAUL Valeska Andreozzi – slide 31
11
Exemplo
Describe the relationship between blood preassure (yi) and age (x1i), body mass index (x2i) and smoke habits(x3i). File: (multi.dat)
E(Yi) = β0 + β1x1i + β2x2i + β3x3i
Data > Import data > from a text file, clipboard or URL ...
Statistics > Summaries > Active data set
summary(dados)
pessoa pa id
Min. : 1.00 Min. :120.0 Min. :41.00
1st Qu.: 8.75 1st Qu.:134.8 1st Qu.:48.00
Median :16.50 Median :143.0 Median :53.50
Mean :16.50 Mean :144.5 Mean :53.25
3rd Qu.:24.25 3rd Qu.:152.0 3rd Qu.:58.25
Max. :32.00 Max. :180.0 Max. :65.00
imc hf
Min. :2368 n~ao:15
1st Qu.:3022 sim:17
Median :3380
Mean :3441
3rd Qu.:3776
Max. :4637
DEIO/CEAUL Valeska Andreozzi – slide 32
in R Commander
Statitics > Fit models > Linear models...
Call:
lm(formula = pa ~ id + imc + hf, data = multi)
Residuals:
Min 1Q Median 3Q Max
-13.5420 -6.1812 -0.7282 5.2908 15.7050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.103192 10.764875 4.190 0.000252 ***
id 1.212715 0.323819 3.745 0.000829 ***
imc 0.008592 0.004499 1.910 0.066427 .
hf[T.sim] 9.945568 2.656057 3.744 0.000830 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 7.407 on 28 degrees of freedom
Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353
F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09
DEIO/CEAUL Valeska Andreozzi – slide 33
12
Model interpretation
■ β0 = 45.103 The intercept has no real interpretation in the example because is the average BP of a person withage zero, body mass index equal to zero and who smokes.
■ β1 = 1.212 The BP increases, on average, 1.21 mmHg for an increase of 1 year of age adjusted by body massindex and smoke habits
■ β2 = 0.008 The BP increases, on average, 0.008 mmHg for an increase of 1 unit of body mass index, holdingeverything else constant
■ β3 = 9.945 The BP increases, on average, 9.94 mmHg for those who smokes compared to those that does notsmoke. This effect is adjusted by all the other variable in the model
id effect plot
id
pa
120
130
140
150
160
40 45 50 55 60 65
imc effect plot
imc
pa
130
140
150
160
2500 3000 3500 4000 4500
hf effect plot
hf
pa
135
140
145
150
não sim
DEIO/CEAUL Valeska Andreozzi – slide 34
Hypothesis test
Analysis of Variance
■ ANOVA partition the total variability in the sample of yi into:
∑
i
(yi − y)2
︸ ︷︷ ︸total variability
=∑
i
(yi − y)2
︸ ︷︷ ︸variability explained by
the regression line
+∑
i
(yi − yi)2
︸ ︷︷ ︸variability not explained(residual variation about
the fitted line)
Variability partitions
DEIO/CEAUL Valeska Andreozzi – slide 35
13
ANOVA
∑
i
(yi − y)2
︸ ︷︷ ︸Total
=∑
i
(yi − y)2
︸ ︷︷ ︸Regression
+∑
i
(yi − yi)2
︸ ︷︷ ︸Residual
Source Sum of degrees of Mean sum ofsquares (SS) freedom (df) squares (MS)
Regression SSreg =∑
(yi − y)2 m MSregression =SSreg
m
Residual SSE=∑
(yi − yi)2 n − m − 1 MSresidual =
SSE
n − m − 1
Total SStotal =∑
(yi − y)2 n − 1 MStotal =SStotal
n − 1
DEIO/CEAUL Valeska Andreozzi – slide 36
Hypothesis test
ANOVA
F =SSreg
m
SSEn−m−1
=MSregression
MSresidual∼ Fm,n−m−1
with n = number of observations and m = number of variables.
■ If there is no linear relationship, the regression sum of square just represent random variation so theregression mean square is another, independent, estimate of σ2
■ The F test indicates whether there is evidence of a linear relationship between Y and X
■ F test: Ratio between variability explained by the regression and residual variation
■ This ratio will close to one if there is no an effective relationship and will be larger, otherwise. In thesimple linear regression this is equivalent to test H0: β1 = 0 versus H1: β1 6= 0
DEIO/CEAUL Valeska Andreozzi – slide 37
14
in R Commander
ANOVA for multiple regression model
Statitics > Fit models > Linear models...
Call:
lm(formula = pa ~ id + imc + hf, data = multi)
Residuals:
Min 1Q Median 3Q Max
-13.5420 -6.1812 -0.7282 5.2908 15.7050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.103192 10.764875 4.190 0.000252 ***
id 1.212715 0.323819 3.745 0.000829 ***
imc 0.008592 0.004499 1.910 0.066427 .
hf[T.sim] 9.945568 2.656057 3.744 0.000830 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 7.407 on 28 degrees of freedom
Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353
F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09
DEIO/CEAUL Valeska Andreozzi – slide 38
Hypothesis test
Wald test
■ Test H0: βk = 0 versus H1: βk 6= 0 using T statistics
■ T = βk
EP (βk)
■ Under H0, T follows a t-Student distribution with n − p degrees of freedom (p = number of modelcoefficients and n = number of observations) or approximately a normal distribution with zero mean andunity variance.
DEIO/CEAUL Valeska Andreozzi – slide 39
in R Commander
Wald test in R Commander
Statitics > Fit models > Linear models...
Call:
lm(formula = pa ~ id + imc + hf, data = multi)
Residuals:
Min 1Q Median 3Q Max
-13.5420 -6.1812 -0.7282 5.2908 15.7050
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.103192 10.764875 4.190 0.000252 ***
id 1.212715 0.323819 3.745 0.000829 ***
imc 0.008592 0.004499 1.910 0.066427 .
hf[T.sim] 9.945568 2.656057 3.744 0.000830 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 7.407 on 28 degrees of freedom
Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353
F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09
DEIO/CEAUL Valeska Andreozzi – slide 40
15
Hypothesis test
Partial F-test
■ To compare two models that are nested, we can compute partial F-test.
■ Suppose two models Mp and Mq with, respectively, p and q parameters (p < q).
■ Mp and Mq are nested models if (Mp ⊂ Mq), i.e, all parameters present in Mp are also present in Mq.
■ To test H0 : the subset of variables present in Mq that is not present Mp are all not significant
against
■ H1 : at least one of the variable in this subset is significant to model Y
■ correspond to test simultaneously that q − p parameters are all equal to zero, using partial F-test.
F =(SSregq − SSregp)/(q − p)
SSEq/(n − q)∼ Fq−p,n−q
DEIO/CEAUL Valeska Andreozzi – slide 41
in R Commander
Partial F-test in R Commander
Models > Hypothesis test > Compare two models...
> anova(LinearModel.3, LinearModel.1)
Analysis of Variance Table
Model 1: pa ~ id
Model 2: pa ~ id + imc + hf
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30 2564.3
2 28 1536.1 2 1028.2 9.3707 0.0007663 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
DEIO/CEAUL Valeska Andreozzi – slide 42
Confidence Interval
■ 100(1 − α)% confidence interval of β′s is given by:
[βk − tn−p,α/2 × EP (βk) ; βk + tn−p,α/2 × EP (βk)]
In R:
confint(modelo)
In R Commander:
Models > Confidence intervals...
> Confint(LinearModel.6, level=.95)
Estimate 2.5 % 97.5 %
(Intercept) 45.103192423 23.0523453750 67.15403947
id 1.212714616 0.5494010054 1.87602823
imc 0.008592449 -0.0006226821 0.01780758
hf[T.sim] 9.945567820 4.5048826190 15.38625302
DEIO/CEAUL Valeska Andreozzi – slide 43
16
Coefficient of determination
■ (Multiple) Coefficient of determination is a summary measure given by the ratio of the regression sum ofsquares to the total sum of squares.
R2 =SSreg
SStotal
■ R2 represents the proportion of the total sum of squares explained by the regression.
■ R2 can be estabilished by the square of the correlation between the yi and the predicted values yi fromthe model.
■ R2 lies between 0 and 1.
■ Important note: Do not confuse F-test from ANOVA with R2. F-test from ANOVA indicates whetherthere is evidence of linear relationship between Y and X or in other words, that the regression model issignificant. R2 measure the quality of the model for prediction of Y
DEIO/CEAUL Valeska Andreozzi – slide 44
Coefficient of determination
■ R2 should not be used as a measure of quality of model fit because its value always increase when avariable is add to the model.
■ For this purpose one can use an adjusted coefficient of determination R2a
R2a = 1 − MSresidual
MStotal
DEIO/CEAUL Valeska Andreozzi – slide 45
Variable selection
■ There are several methods to select variable into a model. The most popular are the sequential ones:foward selection, backward deletion and stepwise selection.
■ Forward selection: models are systematically built up by adding variables one by one to the null model
compromising just β0 (intercept)
■ Backward deletion: models are systematically reduced by deleting variables one by one to the full
model compromising just β0
■ Stepwise selection: it is a combination of the two process mentioned above.
■ For any methods, the most crucial decision is the choice of the stopping-rule.
■ Some choices are the Akaike Information Criteria, which there is no statistical distribution associated toproceed a formal test, or the partial F-test, which the level of significance to add or delete a variable hasto be chosen.
■ Let´s learn by an example.
DEIO/CEAUL Valeska Andreozzi – slide 46
17
Example
Chose package MASS and data set birtwt
Data > Data in packages > Read data set from an attached package...
Help > Help on active data set (if available)...
Recode the variables: change bwt to kg, transform race and smoke to factors
Data > Manage variables in active data set > Convert
numerical variables to factors...
Data > Manage variables in active data set > Compute new variable
Fit the model: bwt ∼ age + ftv + ht + lwt + ptl + race + smoke + ui
Statistics > Fit models > Linear model...
Select variables by using a sequential method
Models > Stepwise model selection...
DEIO/CEAUL Valeska Andreozzi – slide 47
Example
Select variables by using a foward selection with partial F-test. In each step, add a variable that has the minimum p-value inferiorto 0.20
nullmodel <-lm(bwt ~ 1, data=birthwt)
add1(nullmodel,scope=~ age +ftv +ht + lwt + ptl + race + smoke + ui,
test="F")
model1<-lm(bwt ~ ui, data=birthwt)
add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")
model1<-lm(bwt ~ ui+race, data=birthwt)
add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")
model1<-lm(bwt ~ ui+race+smoke, data=birthwt)
add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")
model1<-lm(bwt ~ ui+race+smoke+ht, data=birthwt)
add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")
model1<-lm(bwt ~ ui+race+smoke+ht+lwt, data=birthwt)
add1(model1,scope=~ age +ftv +ht + lwt + ptl + race + smoke +ui ,test="F")
addmodel<-lm(bwt~ ui+race+smoke+ht+lwt,data=birthwt)
summary(addmodel)
DEIO/CEAUL Valeska Andreozzi – slide 48
18
Example
Select variables by using a backward deletion with partial F-test. In each step, delete a variable that has the maximum p-valuesuperior to 0.25
fullmodel <-lm(bwt ~ age +ftv +ht + lwt + ptl + race + smoke + ui,
data=birthwt)
drop1(fullmodel,test="F")
model2 <-lm(bwt ~ age +ht + lwt + ptl + race + smoke + ui, data=birthwt)
drop1(model2,test="F")
model2 <-lm(bwt ~ ht + lwt + ptl + race + smoke + ui, data=birthwt)
drop1(model2,test="F")
model2 <-lm(bwt ~ ht + lwt + race + smoke + ui, data=birthwt)
drop1(model2,test="F")
dropmodel<-lm(bwt ~ ht + lwt + race + smoke + ui, data=birthwt)
summary(dropmodel)
DEIO/CEAUL Valeska Andreozzi – slide 49
Model check
■ Regression diagnostics are used after fitting to check if a fitted mean function and assumptions areconsistent with observed data.
■ The basic statistics here are the residuals or possibly rescaled residuals.
■ If the fitted model does not give a set of residuals that appear to be reasonable, then some aspect ofthe model, either the assumed mean function or assumptions concerning the variance function, may becalled into doubt.
DEIO/CEAUL Valeska Andreozzi – slide 50
Residuals
■ Using the matrix notation, we begin by deriving the properties of residuals.
■ The basic multiple linear regression model is given byY = Xβ + ǫ and V ar(ǫ) = σ2I
■ X is a known matrix with n rows and p columns, including a column of 1s for the intercept
■ β is the unknown parameter vector p × 1
■ ǫ consists of unobservable errors that we assume are equally variable and uncorrelated
DEIO/CEAUL Valeska Andreozzi – slide 51
19
Residuals
■ We estimate β by β = (XT X)−1XT Y and the fitted values Y
Y = Xβ (1)
= X(XT X)−1XT Y (2)
= HY (3)
■ where H is a n × n called hat matrix because it transforms the vector of observed responses Y into thevector of fitted responses Y
DEIO/CEAUL Valeska Andreozzi – slide 52
Residuals
■ The vector of residuals ǫ is defined by
ǫ = Y − Y (4)
= Y − Xβ (5)
= Y − X(XT X)−1XT Y (6)
= (I − H)Y (7)
DEIO/CEAUL Valeska Andreozzi – slide 53
Residuals
■ The errors ǫ are unobservable random variables, assumed to have zero mean and uncorrelated elements,each with common variance σ2. The residuals ǫ are computed quantities that can be graphed orotherwise studied. Their mean and variance, using equation 7, are:
E(ǫ) = 0
V ar(ǫ) = σ2(I − H)
■ Like the errors, each of the residuals has zero mean, but each residual may have a different variance.
■ Unlike the errors, the residuals are correlated
■ The residuals are linear combinations of the errors. If the errors are normally distributed, so are theresiduals.
DEIO/CEAUL Valeska Andreozzi – slide 54
Residuals
■ In scalar form, the variance of the ith residual is
V ar(ǫi) = σ2(1 − hii) (8)
■ where hii is the ith diagonal element of H
■ Diagnostic procedures are based on the computed residuals, which we would like to assume behave as
the unobservable errors would.
DEIO/CEAUL Valeska Andreozzi – slide 55
20
Residuals
■ All the above story is told to conclude that model validation should be done by standardized residuals
■ Here are some examples
DEIO/CEAUL Valeska Andreozzi – slide 56
Residuals
■ Here are some examples
DEIO/CEAUL Valeska Andreozzi – slide 57
Residuals
■ Here are some examples
DEIO/CEAUL Valeska Andreozzi – slide 58
21
Residuals
Residual plots:
■ (a) null plot;
■ (b) right-opening megaphone;
■ (c) left-opening megaphone;
■ (d) double outward box;
■ (e) - (f) nonlinearity;
■ (g) - (h) combinations of nonlinearity and nonconstant variance function.
DEIO/CEAUL Valeska Andreozzi – slide 59
Residuals
Working residualr = yi − µi
Pearson residual
rp =yi − µi√
σ2
Pearson standardized residual
r′p =yi − µi√
σ2(1 − hii)
in R
rstandard(model, type="pearson")
in R Commander
Models > Add observations statistics to data
The R commander calculates de Studentized residuals (re-normalize the residuals to have unit variance, using a
leave-one-out measure of the error variance)
DEIO/CEAUL Valeska Andreozzi – slide 60
22
Plot of residuals
Constant variance: plot standardized residuals against their corresponding fitted values (yi). The pointsshould appear randomly and evenly about zero if assumption is respected.
Graphs > Scatterplot...
2.0 2.5 3.0 3.5
−3
−2
−1
01
2
fitted.LinearModel.2
rstu
dent
.Lin
earM
odel
.2
DEIO/CEAUL Valeska Andreozzi – slide 61
Plot of residuals
Normality: plot the ranked standardized residuals against inverse normal cumulative distribution values.epartures form normality being indicaed by deviantes of the plot from a straight line.
Graphs > Quantile-comparision plot...
−3 −2 −1 0 1 2 3
−3
−2
−1
01
2
norm quantiles
birt
hwt$
rstu
dent
.Lin
earM
odel
.2
DEIO/CEAUL Valeska Andreozzi – slide 62
23
Plot of residuals
Independence: plot standardized residuals against the serial order in which the observations were taken.Again, random scatter of points indicates that the assumption is valid.
Graphs > Scatterplot...
0 50 100 150
−3
−2
−1
01
2
obsNumber
rstu
dent
.Lin
earM
odel
.2
0 50 100 150
−3
−2
−1
01
2
obsNumber
rstu
dent
.Lin
earM
odel
.2
low01
Be carefull with the data set organization...
DEIO/CEAUL Valeska Andreozzi – slide 63
Plot of residuals
The truth is:
Data > Manage variables in active data set > Compute new variable
birthwt$index <- with(birthwt, sample(1:189))
0 50 100 150
−3
−2
−1
01
2
index
rstu
dent
.Lin
earM
odel
.2
DEIO/CEAUL Valeska Andreozzi – slide 64
24
Plot of residuals
Linearity: plot standardized residuals against individual explanatory variables. Linearity is indicated if all plotsexhibit random scatter of equal width about zero. Non-linearity when residuals are plotted againstexplanatory variables in the model suggest that higher-order terms involving those variables should beadded to the model. Systematic patterns exhibited when residuals are plotted against variables that arenot included in the model suggest that those variables should be added to the model
Graphs > Scatterplot...
100 150 200 250
−3
−2
−1
01
2
lwt
rstu
dent
.Lin
earM
odel
.2
15 20 25 30 35 40 45
−3
−2
−1
01
2
age
rstu
dent
.Lin
earM
odel
.2DEIO/CEAUL Valeska Andreozzi – slide 65
Interaction
■ When interaction is present, the association between the risk factor and the outcome variable differs, ordepends in some way on the level of a covariate.
■ That is, the covariate modifies the effect of the risk factor.
■ Epidemiologists uses the term effect modifier to describe a variable that interacts with a risk factor.
■ Interaction can be included in a regression model by adding the product term covariate times risk factor.
DEIO/CEAUL Valeska Andreozzi – slide 66
Interaction
Interaction representation
x (risk factor)
y (o
utco
me)
group Agroup B
Without interaction
x (risk factor)
y (o
utco
me)
group Agroup B
With interaction
DEIO/CEAUL Valeska Andreozzi – slide 67
25
in R
The cystfibr data frame (package: ISwR) contains lung function data for cystic fibrosis patients (7-23 years old).
Call:
lm(formula = pemaxlog ~ bmp * sex, data = cystfibr)
Residuals:
Min 1Q Median 3Q Max
-0.34387 -0.19622 -0.06581 0.26224 0.48999
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.869731 0.481692 8.034 7.7e-08 ***
bmp 0.010660 0.005975 1.784 0.0889 .
sex[T.fem] 1.184884 0.738289 1.605 0.1234
bmp:sex[T.fem] -0.017056 0.009388 -1.817 0.0835 .
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.2685 on 21 degrees of freedom
Multiple R-squared: 0.2219, Adjusted R-squared: 0.1107
F-statistic: 1.996 on 3 and 21 DF, p-value: 0.1455
DEIO/CEAUL Valeska Andreozzi – slide 68
26