the team - exeter data analytics hub## residual standard error: 10.31 on 98 degrees of freedom ##...
TRANSCRIPT
Linear models in R
Richard Sherley
University of Exeter, Penryn Campus, UK
Februrary 2020
Richard Sherley Linear models in R Februrary 2020 1 / 1
Thanks and preamble
Thanks to J. J. Valletta (now an Associate Lecturer in Statistics,University of St Andrews) and T. J. McKinley (Lecturer in MathematicalBiology, now at the Streatham Campus)
Extensive notes, handouts of these slides, and data files for the practicalsare available at: https://exeter-data-analytics.github.io/StatModelling/
The Team• Dr Beth Clark
• Dr Dan Padfield
• Dr Matt Silk
• Dr Richard Inger
Richard Sherley Linear models in R Februrary 2020 2 / 1
What is a model and why do we need one?
A model is a human construct/abstraction that tries to approximate thedata generating process in some useful manner
Models are built for• enhancing our understanding of a complex phenomenon
• executing “what if”scenarios
• predicting/forecasting an outcome
• controlling a system
Richard Sherley Linear models in R Februrary 2020 3 / 1
Illustrative example
0 2 4 6 8 10
05
1015
2025
Hours spent watching Game of Thrones (GoT) at the weekend
Min
utes
spe
nt o
n m
onda
y ta
lkin
g ab
out G
oT a
t wor
k
Richard Sherley Linear models in R Februrary 2020 4 / 1
Illustrative example
0 2 4 6 8 10
05
1015
2025
Hours spent watching Game of Thrones (GoT) at the weekend
Min
utes
spe
nt o
n m
onda
y ta
lkin
g ab
out G
oT a
t wor
k
Richard Sherley Linear models in R Februrary 2020 5 / 1
Illustrative example
0 2 4 6 8 10
05
1015
2025
Hours spent watching Game of Thrones (GoT) at the weekend
Min
utes
spe
nt o
n m
onda
y ta
lkin
g ab
out G
oT a
t wor
k
c = intercept
m = gradient
Richard Sherley Linear models in R Februrary 2020 6 / 1
Illustrative example
0 2 4 6 8 10
05
1015
2025
Hours spent watching Game of Thrones (GoT) at the weekend
Min
utes
spe
nt o
n m
onda
y ta
lkin
g ab
out G
oT a
t wor
k
β0 = intercept
β1 = gradient
Richard Sherley Linear models in R Februrary 2020 7 / 1
Formal definition
yi = β0 + β1xi + εi
εi ∼ N (0, σ2)
Observed data• y (outcome/response): minutes spent talking about GoT
• x (explanatory): hours spent watching Game of Thrones (GoT)
Parameters to infer• β0: intercept
• β1: gradient wrt minutes talking about GoT
Richard Sherley Linear models in R Februrary 2020 8 / 1
Linear models in R
• Use the lm() function
• Requires a formula objectoutcome ∼ explanatory variable
1 # talk: minutes spent talking about GoT (outcome/response variable)
2 # watch: hours spent watching GoT (explanatory variable)
3
4 fit <- lm(talk ~ watch)
5
6 # If data is in a data frame called "df"
7 fit <- lm(talk ~ watch , df)
Richard Sherley Linear models in R Februrary 2020 9 / 1
Summary of fitted model
1 summary(fit)
##
## Call:
## lm(formula = height ~ weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.089 -6.926 -0.689 6.057 24.967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.35229 7.11668 0.331 0.742
## weight 2.17446 0.08782 24.762 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 10.31 on 98 degrees of freedom
## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8608
## F-statistic: 613.1 on 1 and 98 DF, p-value: < 2.2e-16
Richard Sherley Linear models in R Februrary 2020 10 / 1
Model checking
In order to make robust inference, we must check the model fit
1 plot(fit)
140 160 180 200 220
−30
−20
−10
010
2030
Fitted values
Res
idua
ls
Residuals vs Fitted
79
5664
−2 −1 0 1 2
−3
−2
−1
01
23
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
79
5664
140 160 180 200 220
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location79
5664
0.00 0.01 0.02 0.03 0.04
−3
−2
−1
01
23
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
Residuals vs Leverage
79
56
88
Richard Sherley Linear models in R Februrary 2020 11 / 1
Model checking
A couple of examples where the homogeneity of variance assumption isviolated
Richard Sherley Linear models in R Februrary 2020 12 / 1
Model checking: Leverage and Influence
• Obs. 1 – large leverage (outlier in x and y), but not influential
• Obs. 2 – large residual, but not an outlier in x or y. Not influential
• Obs. 3 – not an outlier for y, but has large leverage and largeresidual: very influential (high Cook’s distance).
Richard Sherley Linear models in R Februrary 2020 13 / 1
Model checking: Plot your data!
• Always visualise your data before fitting any model
• Assumption of a linear relationship...
• Scatterplots can indicate unequal variance, nonlinearity and outliers
Richard Sherley Linear models in R Februrary 2020 14 / 1
Model checking: Plot your data!
• Always visualise your data before fitting any model
• Assumption of a linear relationship...
• Scatterplots can indicate unequal variance, nonlinearity and outliers
Anscombe (1973) American Statistician 27: 17–21
• r2 = 0.68 in all four cases
• Test that H0 = 0 identical in all four cases (t = 4.24, P = 0.002)
Richard Sherley Linear models in R Februrary 2020 15 / 1
Confidence vs prediction intervals
25
50
75
100
0.7 0.8 0.9
thorax
long
evity
Richard Sherley Linear models in R Februrary 2020 16 / 1
Confidence vs prediction intervals
0
25
50
75
100
0.7 0.8 0.9thorax
long
evity
Fitted regression line with 97% prediction interval
Richard Sherley Linear models in R Februrary 2020 16 / 1
Multiple linear regression
yi = β0 + β1x1i + . . .+ βpxpi + εi
εi ∼ N (0, σ2)
60 70 80 90 100
120
140
160
180
200
220
240
120
140
160
180
200
220
Weight (kg)
Hei
ght o
f Par
ents
(cm
)
Hei
ght (
cm)
Richard Sherley Linear models in R Februrary 2020 17 / 1
Multiple linear regression
yi = β0 + β1x1i + . . .+ βpxpi + εi
εi ∼ N (0, σ2)
60 70 80 90 100
120
140
160
180
200
220
240
120
140
160
180
200
220
Weight (kg)
Hei
ght o
f Par
ents
(cm
)
Hei
ght (
cm)
Richard Sherley Linear models in R Februrary 2020 18 / 1
Categorical variables
160
200
240
60 70 80 90 100
weight
heig
ht
sex
Female
Male
Richard Sherley Linear models in R Februrary 2020 19 / 1
Categorical variables
We need dummy variables
Si =
{1 if i is male,0 otherwise
Here, female is known as the baseline/reference levelThe regression is:
yi = β0 + β1Si + β2xi + εi
Or in English:
heighti = β0 + β1sexi + β2weighti + εi
Richard Sherley Linear models in R Februrary 2020 20 / 1
Categorical variables
The mean regression lines for male and female are:
• Female (sex=0)
heighti = β0 + (β1 × 0) + β2weighti
heighti = β0 + β2weighti
• Male (sex=1)
heighti = β0 + (β1 × 1) + β2weighti
heighti = (β0 + β1) + β2weighti
Richard Sherley Linear models in R Februrary 2020 21 / 1
Summary
Linear regression is a powerful tool:
• It splits the data into signal (trend/mean) and noise (residual error)
• It can cope with multiple variables
• It can incorporate different types of variable
• It can be used to produce point and interval estimates for theparameters
• It can be used to assess the importance of variables
But always check that the model fit is sensible, that the assumptions aremet and the results make sense!
Richard Sherley Linear models in R Februrary 2020 22 / 1