dr. matteo tanadini [email protected] herbst ... · julian j faraway. extending the...
TRANSCRIPT
Angewandte statistische Regression I
Dr. Matteo [email protected]
Herbst Semester 2019 (ETHZ)
7. Vorlesung Angewandte statistische Regression I 1 / 37
Outline
1 Introduction
2 Designs
3 Get to know data
4 Fitting a Mixed Effects Model with lme4
5 Checking the model assumptions
6 Inference procedure
7 Concluding remarks
7. Vorlesung Angewandte statistische Regression I 2 / 37
Section 1
Introduction
7. Vorlesung Angewandte statistische Regression I 3 / 37
What is a Mixed Model?
Definition: a statistical model that contains both random and fixedeffects.
Example: a clinical trial where a new drug is tested along with the goldstandard.
Usually fixed effects are predictors of interest, for which we want to drawconclusions (e.g. drug).
Random effects are predictors that need to be considered in the analysis(design variables e.g. patient), but are usually not of primary interest1.
1Nevertheless, we may be interested in their variability.7. Vorlesung Angewandte statistische Regression I 4 / 37
What is a Mixed Model?
It comes natural to use Mixed Effect Models when “grouped data“ ispresent. This data arises in experiments where an “experimental unit” ismeasured more than once (e.g. diameter of a tree measured yearly).
How to decide whether a variable must be considered to be random ornot?
Design of the experiment/study
Questions addressed
Hospitals example:Response variable: length of stay [days]Design: 10 hospitals, about 1.000 hospitalisations per hospitalQ1: What is the average length of stay?Q2: Which hospital has, on average, the longest stays?
7. Vorlesung Angewandte statistische Regression I 5 / 37
Orthodont Example
data(Orthodont, package = "nlme")
names(Orthodont)
[1] "distance" "age" "Subject" "Sex"
head(Orthodont)
Grouped Data: distance ~ age | Subject
distance age Subject Sex
1 26 8 M01 Male
2 25 10 M01 Male
3 29 12 M01 Male
4 31 14 M01 Male
5 22 8 M02 Male
6 22 10 M02 Male
7. Vorlesung Angewandte statistische Regression I 6 / 37
Orthodont Example
age
dist
ance
20
25
30
8 9 10 11 12 13 14
Male
8 9 10 11 12 13 14
Female
This graph was produced with the xyplot() of lattice package [6]. Similar graphs can be produced with ggplot2 package [7].
7. Vorlesung Angewandte statistische Regression I 7 / 37
Orthodont Example
Aims: model the growth of the measured distance2 and inspect genderdifferences. For more information about the data set typehelp(Orthodont) in R.
Response variable: distance
Fixed: Sex and age
Random: Subject
We are interested in growth (i.e. age) and gender (i.e. Sex).We must consider person, but this is not our main interest.
2between the pituitary gland and the pterygomaxillary fissure [mm].7. Vorlesung Angewandte statistische Regression I 8 / 37
Can we drop “Subject” from the analysis?
Given that we are not interested in “Subject”, we drop it from theanalysis? Can we? (motivate)
age
dist
ance
20
25
30
8 9 10 11 12 13 14
Male
8 9 10 11 12 13 14
Female
7. Vorlesung Angewandte statistische Regression I 9 / 37
Can we drop “Subject” from the analysis?
Let’s fit a simple linear model:
lm.Ort <- lm(distance ~ age * Sex, data = Orthodont)
summary(lm.Ort)
Call:
lm(formula = distance ~ age * Sex, data = Orthodont)
Residuals:
Min 1Q Median 3Q Max
-5.616 -1.322 -0.168 1.330 5.247
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.341 1.416 11.54 < 2e-16 ***
age 0.784 0.126 6.22 1.1e-08 ***
SexFemale 1.032 2.219 0.47 0.64
age:SexFemale -0.305 0.198 -1.54 0.13
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 2.3 on 104 degrees of freedom
Multiple R-squared: 0.423,Adjusted R-squared: 0.406
F-statistic: 25.4 on 3 and 104 DF, p-value: 2.11e-12
Is that fine? Let’s check the model assumptions
7. Vorlesung Angewandte statistische Regression I 10 / 37
Can we drop “Subject” from the analysis?
par(mfrow = c(2, 2), oma = c(0, 0, 0, 0), mar = c(2, 4, 2, 1))
plot(lm.Ort)
21 22 23 24 25 26 27
−6
−4
−2
02
46
Fitted values
Res
idua
ls
Residuals vs Fitted
49
3539
−2 −1 0 1 2
−2
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
49
35 39
21 22 23 24 25 26 27
0.0
0.5
1.0
1.5
Sta
ndar
dize
d re
sidu
als
Scale−Location49
3539
0.00 0.01 0.02 0.03 0.04 0.05 0.06
−3
−2
−1
01
2
Sta
ndar
dize
d re
sidu
als
Cook's distance
Residuals vs Leverage
101104
49
The main model assumptions look good! What about theindependence of the observations?
7. Vorlesung Angewandte statistische Regression I 11 / 37
Can we drop “Subject” from the analysis?
resi
dual
s(lm
.Ort
)
−6
−4
−2
0
2
4
F07M
05 F08M
16 F03M
11 F05 F11M
03M
06 F02M
08M
12 F10M
14M
09 F04M
02M
15 F09 F06M
04M
01 F01M
13M
07M
10
There is a clear “Subject” effect! Omitting it from the analysis is incorrect.
7. Vorlesung Angewandte statistische Regression I 12 / 37
Can we drop “Subject” from the analysis?
The previous graph clearly showed that if we don’t consider “Subject” inour analysis, we cannot assume that observations are independent.
Thus, the inference procedure (i.e. computing p-values and estimatingconfidence intervals) is not valid when subjects are omitted3.
Design variables should always be there!
3Although the estimated regression coefficients are still unbiased.7. Vorlesung Angewandte statistische Regression I 13 / 37
Section 2
Designs
7. Vorlesung Angewandte statistische Regression I 14 / 37
Experiments can have several random effects:
Agricultural studywith plots and subplots, response variable = plant height
Educational studywith schools, classes and scholars, response variable = grade
Fish reproduction studywith mothers and fathers, response variable = larvae survival
7. Vorlesung Angewandte statistische Regression I 15 / 37
Experiments can have several random effects, and things becomplex/complicated
Educational study (II)schools, teachers and scholars, response variable = grade
Pedomorphological studysites, response variable = [Zn] and [Cu]
Survival studycomputer processors monitored over time, response variable =working/broken
Note: these are the 3 most common situations that yield “grouped data”.
Warning: Not all studies that have “grouped data” are best analysedwithin the framework of Mixed Effects Models.
7. Vorlesung Angewandte statistische Regression I 16 / 37
Section 3
Get to know data
7. Vorlesung Angewandte statistische Regression I 17 / 37
When dealing with Mixed Models it is essential to understand the datastructure.
There are 2 fundamental steps to carry out before fitting models:
summaries (str(), head(), table(), ...)
graphics
The structure of the data determines what questions you can answer.
7. Vorlesung Angewandte statistische Regression I 18 / 37
See MixedModels Lab.pdf
7. Vorlesung Angewandte statistische Regression I 19 / 37
Section 4
Fitting a Mixed Effects Model with lme4
7. Vorlesung Angewandte statistische Regression I 20 / 37
lme4
Currently the best and more powerful way to fit Mixed Models in R is byusing the lme4 package [1].
To fit a Linear Mixed Effect Model we use the function lmer()
There are 2 arguments that need to be specified:formula4 and data.
mod.0 <- lmer(distance ~ age * Sex + (1 | Subject),
data = Orthodont)
4Which must contain at least one random effect.7. Vorlesung Angewandte statistische Regression I 21 / 37
Models can contain several random effects:
mem.0 <- lmer(grade ~ subject + (1 | School) + (1 | Class) + (1 | Student),
data = MathGrades)
##
fish.0 <- lmer(weight ~ days + (1 | Mother) + (1 | Father) + (1 | larvae),
data = FishDeveloppement)
We don’t need to specify whether random effects are nested or crossed.
7. Vorlesung Angewandte statistische Regression I 22 / 37
Section 5
Checking the model assumptions
7. Vorlesung Angewandte statistische Regression I 23 / 37
Model assumptions
There are two classes of assumptions:
Structural assumptions (model equation)
Distributional assumptions
All assumptions are assessed simultaneously using graphs.
7. Vorlesung Angewandte statistische Regression I 24 / 37
Structural assumptions
When fitting mod.0 we assumed growth to be linear, and we allowed thetwo gender groups to have different slopes.We also assumed that growth was the same within gender groups. In otherwords, that regression lines for subjects run parallel to the gender line (i.esimple random intercept).
We may assume that growth (slopes) are the same in both groups, in thatcase we would fit the model:
mem.1 <- lmer(distance ~ age + Sex + (1 | Subject),
data = Orthodont)
The graphical analysis suggested quite strongly that the two groups havedifferent slopes.
7. Vorlesung Angewandte statistische Regression I 25 / 37
Distributional assumptions
distancei ,j ,k = β(0)i + β
(1)i · agei ,j ,k + bj + εi ,j ,k
i = 1, ..., nsex j = 1, ..., nsub k = 1, ...,N
ε ∼ N (0, σ2ε)
b ∼ N (0, σ2b)
7. Vorlesung Angewandte statistische Regression I 26 / 37
See MixedModels Lab.pdf
7. Vorlesung Angewandte statistische Regression I 27 / 37
The fundamental graphics to assess model assumptions are:
Tukey-Anscombe plot
Scale-location plot
Quantile-Quantile plots
Residuals against predictor plots
Other further graphics (if required)
7. Vorlesung Angewandte statistische Regression I 28 / 37
Section 6
Inference procedure
7. Vorlesung Angewandte statistische Regression I 29 / 37
p-values and confidence intervals can be computed with the functionsanova(), drop1() and confint()
Fixed effects are usually tested with Likelihood Ratio Tests.
For random effects, we prefer to estimate confidence intervals via profilinglikelihood (profile()).
7. Vorlesung Angewandte statistische Regression I 30 / 37
See MixedModels Lab.pdf
7. Vorlesung Angewandte statistische Regression I 31 / 37
Section 7
Concluding remarks
7. Vorlesung Angewandte statistische Regression I 32 / 37
Take home message
Mixed Effects Models are very powerful methods well suited toanalyse a wide variety of modern experiments
They are relatively easy to use
Complex experiments/studies require the use of appropriatevisualisation techniques
The is some debate on how to perform inference. However, rememberthat statistical significance does not imply practical/biologicalsignificance.
lme4 is one of the most up-to-date software to analyse theseexperiments. It does not implement *everything* for good reasons
Extensions to LMMs and GLMMS exist (e.g. mgcv)
7. Vorlesung Angewandte statistische Regression I 33 / 37
Further topics (not covered here):
Generalised Mixed Models (GLMMs; glmer(..., family =...))
Generalised Additive Mixed Models (GAMMs; mgcv)
R2 and stepwise model selection (Don’t use them with LMMs!)
Partial residual plots
Dynamic documents (knitr and RMarkdown)
Getting advice (Internet, consulting offices)
Literature (see next slide)
...
7. Vorlesung Angewandte statistische Regression I 34 / 37
Literature
lme4: Mixed-effects modeling with R [2] (short!)
Mixed Effects Models in S and S-plus [4]
Extending the linear model with R: generalized linear, mixed effectsand non-parametric regression models [3]
Generalized additive models: an introduction with R [8]
Lattice: multivariate data visualization with R [5]
7. Vorlesung Angewandte statistische Regression I 35 / 37
Bibliography I
Douglas Bates et al. lme4: Linear Mixed-Effects Models using ’Eigen’and S4. R package version 1.1-18-1. 2018. url:https://CRAN.R-project.org/package=lme4.
Douglas M Bates. “lme4: Mixed-effects modeling with R”. In: URLhttp://lme4. r-forge. r-project. org/book (2010).
Julian J Faraway. Extending the linear model with R: generalizedlinear, mixed effects and nonparametric regression models. CRC press,2005.
Jose Pinheiro and Douglas Bates. Mixed-effects models in S andS-PLUS. Springer Science & Business Media, 2006.
Deepayan Sarkar. Lattice: multivariate data visualization with R.Springer Science & Business Media, 2008.
7. Vorlesung Angewandte statistische Regression I 36 / 37
Bibliography II
Deepayan Sarkar. lattice: Trellis Graphics for R. R package version0.20-35. 2017. url:https://CRAN.R-project.org/package=lattice.
Hadley Wickham et al. ggplot2: Create Elegant Data VisualisationsUsing the Grammar of Graphics. R package version 3.0.0. 2018. url:https://CRAN.R-project.org/package=ggplot2.
Simon Wood. Generalized additive models: an introduction with R.CRC press, 2006.
7. Vorlesung Angewandte statistische Regression I 37 / 37