summary of postgraduate training week material on mixed …dma0psc/pgtraining/summary_lmm.pdf ·...

22
Summary of postgraduate training week material on mixed models Peter Craig (March 26th 2013) 1 Motivation We’ll use two motivating examples based on datasets which are part of the lme4 package for R: Dyestuff and sleepstudy. Load needed packages: library(lme4) library(ggplot2) Dyestuff is a one-way layout with response Yield being measured for 5 replicates from each of 6 Batches: dye.plot = qplot( Batch, Yield, data=Dyestuff, geom="jitter", position=position_jitter(width=.1, height=0) ) dye.plot 1450 1500 1550 1600 A B C D E F Batch Yield Sleepstudy records the Reaction times over 10 days of 18 Subjects who underwent a sleep deprivation regime: sleep.plot = qplot(Days, Reaction, data=sleepstudy) + facet_wrap(~Subject, ncol=6) + stat_smooth(method = "lm")+ xlab("Days of sleep deprivation")+ 1

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Summary of postgraduate training week material on mixed modelsPeter Craig (March 26th 2013)

    1 Motivation

    We’ll use two motivating examples based on datasets which are part of the lme4package for R: Dyestuff and sleepstudy.

    Load needed packages:

    library(lme4)

    library(ggplot2)

    Dyestuff is a one-way layout with response Yield being measured for 5 replicatesfrom each of 6 Batches:

    dye.plot = qplot(

    Batch, Yield, data=Dyestuff,

    geom="jitter",

    position=position_jitter(width=.1, height=0)

    )

    dye.plot

    ● ●

    ●1450

    1500

    1550

    1600

    A B C D E FBatch

    Yie

    ld

    Sleepstudy records the Reaction times over 10 days of 18 Subjects who underwenta sleep deprivation regime:

    sleep.plot = qplot(Days, Reaction, data=sleepstudy) +

    facet_wrap(~Subject, ncol=6) +

    stat_smooth(method = "lm") +

    xlab("Days of sleep deprivation") +

    1

  • ylab("Average reaction time (ms)")

    sleep.plot

    ●●

    ●● ● ● ●

    ● ● ●●

    ● ●

    ● ● ●●

    ● ●●

    ●● ● ●

    ●●

    ●●

    ● ●●

    ● ●

    ● ●

    ●●

    ●● ●

    ●●

    ● ● ●

    ● ●

    ● ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ● ● ●● ● ●

    ● ●

    ●●

    ● ●

    ● ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ● ●

    ●●

    ●●

    ● ● ● ●●

    ●●

    ● ●●

    ● ●

    ●●

    ●● ● ●

    ●● ●

    ● ●● ● ● ●

    ●●

    ● ●

    ●●

    ● ●●

    ● ●

    308 309 310 330 331 332

    333 334 335 337 349 350

    351 352 369 370 371 372

    200

    300

    400

    500

    200

    300

    400

    500

    200

    300

    400

    500

    0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5Days of sleep deprivation

    Ave

    rage

    rea

    ctio

    n tim

    e (m

    s)

    In both datasets, we have covariates which have values which are naturally part ofa larger “population” (batches and subjects respectively). It would seem naturalto use models which incorporate that knowledge.

    The sleep study example is more interesting but more complicated and so I willfocus initially on the Dyestuff data which is an example of a single-factor experi-mental design (one-way analysis of variance)

    As well as the examples, we have an additional motivation for considering randomeffects from the generalised linear models strand of the training week: what shouldwe really do about over-dispersion?

    2 One-way random effects analysis of variance

    Here are some questions we might want to answer in the context of the Dyestuffdata:

    2

  • Q1 How much do batch differences contribute to overall variation in yield?

    Q2 What the true mean yield for one of the sampled batches?

    Q3 What can we say about an unsampled batch?

    Q4 What is the overall mean yield across all batches (not just those sam[pled)?

    First let’s look at this using the conventional linear model (traditional one-wayanalysis of variance).

    The ‘fixed-effects’ model is:

    yij = µ+ αi + �ij for i = 1, . . . ,M and j = 1, . . . , N

    where �ij iid ∼ N(0, σ2). By fixing the number of observations to be the same N ineach group, we have a balanced design which simplifies some of the mathematicaldescription below.

    The fixed effects model is over-parameterised: we need to add a constraint. Thestandard constraint in R is α1 = 0 but an alternative parameterisation whichmatches the later random-effects model better is

    ∑i αi = 0 which we will use.

    Regardless of the constraint imposed, µ + αi is the true mean yield for batch iand the estimates satisfy µ̂ + α̂i = yi. For the sum-to-zero constraint µ̂ = y andα̂i = yi − y.Fitting the model in R and summarising we get

    options(contrasts = c("contr.sum", "contr.sum"))

    dmodel.f = lm(Yield ~ Batch, Dyestuff)

    summary(dmodel.f)

    ##

    ## Call:

    ## lm(formula = Yield ~ Batch, data = Dyestuff)

    ##

    ## Residuals:

    ## Min 1Q Median 3Q Max

    ## -85.0 -33.0 3.0 31.8 97.0

    ##

    ## Coefficients:

    ## Estimate Std. Error t value Pr(>|t|)

    ## (Intercept) 1527.50 9.04 168.98

  • ## Residual standard error: 49.5 on 24 degrees of freedom

    ## Multiple R-squared: 0.489,Adjusted R-squared: 0.383

    ## F-statistic: 4.6 on 5 and 24 DF, p-value: 0.0044

    anova(dmodel.f)

    ## Analysis of Variance Table

    ##

    ## Response: Yield

    ## Df Sum Sq Mean Sq F value Pr(>F)

    ## Batch 5 56358 11272 4.6 0.0044 **

    ## Residuals 24 58830 2451

    ## ---

    ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    The analysis of variance gives us an answer to Q1 by looking at the sums ofsquares: the Batch sum of squares is just a little smaller than the residual sum ofsquares and so Batch explains just less than 50% of the variation in Yield.

    The summary gives us an answer to Q2: batch A mean is estimated at 1527.5-22.5=1505, batch B at 1527.5+0.50=1528 etc.

    The fixed-effects analysis gives no answer to Q3 other than an estimate of withinbatch standard deviation at 49.5 which applies to all batches.

    The fixed-effects gives us an estimate of µ at 1527.5 with standard error 9. Theestimate would seem to be answer to Q4 but the standard error is not the correctmeasure of uncertainty for Q4 because µ is not overall mean yield for all batches;it is the true overall mean yield for these six batches (since

    ∑6i=1 αi = 0)

    However, the batches are clearly a sample from a larger “population” of batchesand it seems natural to model inter-batch variation randomly.

    To make the model into a ‘random-effects’ model: remove the constraint on theαi and instead add a model for their variability: αi iid ∼ N(0, ψ2).

    2.1 Inference

    2.1.1 Adjusted sums of squares

    Originally, people adjusted sums of squares in the ANOVA table but that method-ology led to divisions over the right approach in complex designs and has largelydisappeared.

    Here’s the idea: if we look at the standard one-way analysis of variance table, itreads

    Sum of squares Expected sum of squaresBatch N

    ∑i(yi − y)2 (M − 1)N [ψ2 + σ2/N ]

    Residuals (N − 1)∑

    i s2i M(N − 1)σ2

    The expected sums of squares are calculated using the random effects model.

    4

  • We can see that residual sum of squares leads easily to an unbiased estimateof σ2 whereas the batch sum of squares needs to be adjusted by subtracting anappropriate multiple of the residual sum of squares in order to lead to an unbiasedestimator ψ2. The resulting estimator can be negative! This reveals an importantfeature of the model which is that it is possible empirically for the batch sum ofsquares to be in some sense “too small” and that is known to cause difficulties.

    2.1.2 Maximum likelihood (ML)

    The modern approach to estimation is all based on the observation that, givenvalues for µ, ψ and σ, the data y have a multivariate normal distribution withblock diagonal covariance matrix

    y |µ, ψ, σ ∼ NMN(µ1,Σ) where Σ =

    Σg 0

    Σg. . .

    0 Σg

    and

    Σg =

    ψ2 + σ2 ψ2 . . . . . . ψ2

    ψ2 ψ2 + σ2. . .

    ......

    . . . . . . . . ....

    ... ψ2 ψ2 + σ2 ψ2

    ψ2 . . . . . . ψ2 ψ2 + σ2

    = ψ2JN + σ

    2IN

    where IN is the N ×N identity matrix and JN the corresponding matrix of all 1s.The immediately gives us a likelihood

    L(µ, ψ, σ) = p(y |µ, ψ, σ)

    and we could then proceed by maximum likelihood.

    2.1.3 Restricted maximum likelihood (REML)

    However, ML is only asymptotically unbiased and efficient and often leads to badlybiased estimates of variance parameters.

    The common alternative is restricted maximum likelihood. The restricted like-lihood function L(ψ, σ) can be derived by two different methods which give thesame answer:

    1. When estimating the variance parameters ψ and σ, take as data a maximalcollection of contrasts (linear combinations of elements of y) which don’tinvolve µ: for example yij − yi, yi − y. Use the joint distribution of theconstrasts to define the restricted likelihood for ψ and σ.

    5

  • 2. Integrate the original likelihood L(µ, ψ, σ) with respect to µ to obtain amarginal likelihood:

    L(ψ, σ) =

    ∫L(µ, ψ, σ)dµ

    An alternative interpretation of this is that REML is actually Bayesian max-imum a posteriori estimation using a flat prior on all three parameters andfirst integrating out µ before doing the maximisation.

    Both ML and REML are implemented for a wide range of random effects modelsby the lmer function from the lme4 package for R.

    dmodel.r = lmer(Yield ~ 1 | Batch, data = Dyestuff)

    Because R automatically adds an ’intercept’ when it is missing, the model formulais equivalent to Yield~1+1|Batch. In the longer version of the formula, there aretwo components: 1 specifies a fixed-effect intercept and 1|Batch means add arandom constant which depends on Batch, those constants being drawn from anormal distribution with mean zero.

    The result of fitting the model is:

    dmodel.r

    ## Linear mixed model fit by REML

    ## Formula: Yield ~ 1 | Batch

    ## Data: Dyestuff

    ## AIC BIC logLik deviance REMLdev

    ## 326 330 -160 327 320

    ## Random effects:

    ## Groups Name Variance Std.Dev.

    ## Batch (Intercept) 1764 42.0

    ## Residual 2451 49.5

    ## Number of obs: 30, groups: Batch, 6

    ##

    ## Fixed effects:

    ## Estimate Std. Error t value

    ## (Intercept) 1527.5 19.4 78.8

    Here we see that the estimate of ψ is 42.0 and the estimate of σ is 49.5. Looking atthe variance estimates, the estimate for ψ2 is about 25% smaller than the estimatefor σ2; compare this to the relative sizes of the sums of squares for the fixed effectsmodel. The reason the answer is different is that the random-effects analysis buildsin the equivalent of adjusting sums of squares.

    In principle, the random-effects analysis gives a better answer to Q1. However,the answer is based on the assumption/model of a normal distribution for batchmean yields. If the extra assumptions in random-effects models are badly violated,we may actually get better answers from fixed-effects models.

    6

  • 2.1.4 Bayesian approaches

    REML can be seen as pseudo-Bayes. Fixed-effects modelling can be seen as Bayeswith an independent flat prior on each random effect. To do Bayes properly forthe random effects model needs a little care in relation to the prior for ψ.

    One might think that, since p(σ) ∝ 1/σ is the Jeffreys prior for the variancein normal sampling, p(ψ) ∝ 1/ψ would be a natural lazy prior. However, thatleads to disaster as the posterior is then improper. Similarly, the 1/ψ2 ∼ Γ(�, �)originally proposed for WinBUGS examples leads to bad MCMC properties sincethe posterior is “nearly improper” (logically this is nonsense but somehow is auseful description). A uniform improper prior on ψ is now quite popular but seeGelman (2006) for more on this and other possibilities.

    WinBUGS and Stan (the Gelman group’s new Hamiltonian MCMC software) canboth fit this model easily. In R, we can access Stan using RStan and there is alsothe very fine MCMCglmm package.

    2.2 More about REML

    Once we have estimates for ψ and σ, we have a linear model for y but withcorrelated errors: putting αi and �ij together into a single “error” term. Hencewe can estimate µ by BLUE (best linear unbiased estimation) which is equivalentto generalised least squares (GLS). This also gives us a standard error for µ̂. Infact, for the balanced model, it is still the case that µ̂ = y and the square of thestandard error is simply (ψ̂2 + σ̂2/N)/M .

    Looking again at the output for dmodel.r, we see that the estimate of µ is 1527.5with standard error 19.4. The larger standard error for the random-effects modelreflects the fact that µ is now the true mean yield across all batches.

    Similarly, we can estimate the random effect αi by BLUP (best linear unbiasedprediction): we seek the linear combination of y which satisfies E[ cTy − αi ] = 0and which minimises E[ (cTy − αi)2 ]. This turns out to be

    α̃i =

    (ψ2

    ψ2 + σ2/N

    )(yi − y)

    which exhibits “shrinkage” compared to the fixed-effects model least squares esti-mate (yi − y). The ratio ψ2/(ψ2 + σ2/N) is known as the shrinkage factor.Shrinkage is easily understood once we realise that the expected sample varianceof the collection y1, . . . , yn is greater than ψ

    2: consequently we expect (probalisti-cally) that yi is further away from µ than αi.

    Note that y + α̃i is the answer to Q2 using the random effects model and dif-fers from the answer from the fixed-effects model. There is an intriguing litera-ture which argues that even anti-Bayesian frequentists are often better off usingBayesian estimates because shrinkage means that the latter often have better cov-erage properties than the apparently obvious corresponding frequentist estimates.

    The estimates of the αi are obtained in R for the example by

    7

  • ranef(dmodel.r)

    ## $Batch

    ## (Intercept)

    ## A -17.6080

    ## B 0.3913

    ## C 28.5641

    ## D -23.0860

    ## E 56.7369

    ## F -44.9982

    We can examine the shrinkage phenomenon empirically by computing the fittedvalues:

    plot(fitted(dmodel.f), fitted(dmodel.r))

    abline(0, 1)

    mean(Dyestuff$Yield)

    ## [1] 1527.5

    points(1527.5, 1527.5, pch = 19, col = 2)

    ●●●●●

    ●●●●●

    ●●●●●

    ●●●●●

    ●●●●●

    ●●●●●

    1480 1500 1520 1540 1560 1580 1600

    1480

    1500

    1520

    1540

    1560

    1580

    fitted(dmodel.f)

    fitte

    d(dm

    odel

    .r)

    8

  • The pairs all lie on a straight line but not the line where the coordinates are equal.The line does go through (y, y) but with what slope?

    (fitted(dmodel.r) - mean(Dyestuff$Yield))/(fitted(dmodel.f) -

    mean(Dyestuff$Yield))

    ## 1 2 3 4 5 6 7 8

    ## 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826

    ## 9 10 11 12 13 14 15 16

    ## 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826

    ## 17 18 19 20 21 22 23 24

    ## 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826

    ## 25 26 27 28 29 30

    ## 0.7826 0.7826 0.7826 0.7826 0.7826 0.7826

    That this slope is ψ̂2/(ψ̂2 + σ2/N) can be verified using the output of dmodel.r:

    1764/(1764 + 2451.3/5)

    ## [1] 0.7825

    3 General mixed model formulation

    The general form of a linear mixed model is

    y = Xβ + Zb+ �

    where Xβ are fixed effects just like in the standard linear model, Zb are one ormore components/terms of random effects with b ∼ N(0,Σ(Ω)) and � ∼ N(0, σ2I)is the measurement/error/replication term. Here the covariance matrix for therandom effects is parameterised by one or more parameters collectively denotedΩ.

    When there is more than one random effect component, components are oftennested but need not always be so.

    For the one-way model, Ω = (ψ) and

    β = (µ) X =

    1...1

    b = α1...αM

    Z =

    1 0 . . . 0...1 0 . . . 00 1 . . . 0...0 1 . . . 00 0 . . . 1...0 0 . . . 1

    Σ = ψ2IM

    9

  • For the sleep study data, a natural model is

    yit = µ+ βt+ αi + γit+ �it

    where i indexes subjects and t indexes days.

    The simplest version of this model is to take αi iid ∼ N(0, σ2α) and independentlyγi iid ∼ N(0, σ2γ). Then Ω = (σα, σγ),

    β =

    (µβ

    )X =

    1 01 11 2...

    ...1 9...

    ...1 01 11 2...

    ...1 9

    b =

    α1γ1...α18γ18

    Z =

    1 0 0 0 . . . 0 01 1 0 0 . . . 0 0...1 9 0 0 . . . 0 00 0 1 0 . . . 0 00 0 1 1 . . . 0 0...0 0 1 9 . . . 0 00 0 0 0 . . . 1 00 0 0 0 . . . 1 1...0 0 0 0 . . . 1 9

    and Σ(Ω) is the diagonal matrix with diagonal entries (σ2α, σ

    2γ, σ

    2α, σ

    2γ, . . . , σ

    2α, σ

    2γ).

    This model can be fitted in R by

    slmodel3 = lmer(Reaction ~ Days + (1 | Subject) + (0 + Days |

    Subject), sleepstudy)

    slmodel3

    ## Linear mixed model fit by REML

    ## Formula: Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)

    ## Data: sleepstudy

    ## AIC BIC logLik deviance REMLdev

    ## 1754 1770 -872 1752 1744

    ## Random effects:

    ## Groups Name Variance Std.Dev.

    ## Subject (Intercept) 627.6 25.05

    ## Subject Days 35.9 5.99

    ## Residual 653.6 25.57

    ## Number of obs: 180, groups: Subject, 18

    ##

    ## Fixed effects:

    ## Estimate Std. Error t value

    ## (Intercept) 251.41 6.89 36.5

    ## Days 10.47 1.56 6.7

    ##

    ## Correlation of Fixed Effects:

    ## (Intr)

    ## Days -0.184

    10

  • A slightly more complicated version of the model introduces correlation betweenαi and γi so that Ω = (σα, σγ, ρ) and Σ(Ω) is block-diagonal with each block beingthe 2× 2 matrix (

    σ2α ρσασγρσασγ σ

    )In R, we do

    slmodel1 = lmer(Reaction ~ Days + (Days | Subject), sleepstudy)

    slmodel1

    ## Linear mixed model fit by REML

    ## Formula: Reaction ~ Days + (Days | Subject)

    ## Data: sleepstudy

    ## AIC BIC logLik deviance REMLdev

    ## 1756 1775 -872 1752 1744

    ## Random effects:

    ## Groups Name Variance Std.Dev. Corr

    ## Subject (Intercept) 612.1 24.74

    ## Days 35.1 5.92 0.066

    ## Residual 654.9 25.59

    ## Number of obs: 180, groups: Subject, 18

    ##

    ## Fixed effects:

    ## Estimate Std. Error t value

    ## (Intercept) 251.41 6.82 36.8

    ## Days 10.47 1.55 6.8

    ##

    ## Correlation of Fixed Effects:

    ## (Intr)

    ## Days -0.138

    The difference between the two formulae expressing the model takes a little effortto understand. Because of R’s addition of intercept terms the formula for slmodel1is equivalent to Reaction ~ 1+Days + (1+Days|Subject) which specifies a fixedeffect linear regression on Days and then random effect changes to the linear re-gression dependent on Subject. In the formula for slmodel3, the (1|Subject)specifies random intercepts and (0+Days|Subject) specifies (independently) ran-dom slopes: the purpose of 0+ is to prevent R’s default behaviour of adding theapparently missing intercept term to that part of the formula. R still adds thefixed-effect intercept and slmodel3 again specifies a fixed effect regression withrandom-effect changes to the intercept and (independently this time) random ef-fect changes to the slope. If we compare the outputs of fitting the two models, wesee that slmodel1 includes a correlation between the random-effects changes forintercept and slope.

    11

  • 3.1 Inference

    Generalising from the one-way case, we obtain a multivariate normal model for y:

    y |β,Ω, σ ∼ N(Xβ, ZΣ(Ω)ZT + σ2I)

    which gives us a likelihood function L(β,Ω, σ).

    Again, either by integrating out β or by taking as data a maximal set of contrastswhich are free of β, we arrive at the REML likelihood L(Ω, σ) which can bemaximised numerically.

    Given Ω̂ and σ̂, and writing Σ̂ = Σ(Ω̂), we can

    • use GLS/BLUE (see appendix) to obtain estimates of the fixed effects β̃:

    β̃ = (XTW−1X)−1XTW−1y

    where W = ZΣ̂ZT + σ̂2I

    • use BLUP (see appendix) to obtain estimates of the random effects:

    b̃ = Σ̂ZTW−1(y −Xβ̃)

    Taking slmodel1 as the model for now, we see that there are fixed effects estimatesof intercept and slope which are µ̃ = 251.4 and β̃ = 10.47.

    We can also look at the random effects to see that we do get random interceptand slope terms per student:

    ranef(slmodel1)

    ## $Subject

    ## (Intercept) Days

    ## 308 2.2572 9.1993

    ## 309 -40.3984 -8.6211

    ## 310 -38.9605 -5.4502

    ## 330 23.6919 -4.8137

    ## 331 22.2613 -3.0693

    ## 332 9.0398 -0.2719

    ## 333 16.8410 -0.2231

    ## 334 -7.2330 1.0744

    ## 335 -0.3320 -10.7524

    ## 337 34.8900 8.6296

    ## 349 -25.2110 1.1727

    ## 350 -13.0714 6.6140

    ## 351 4.5784 -3.0152

    ## 352 20.8636 3.5367

    ## 369 3.2754 0.8723

    ## 370 -25.6144 4.8218

    ## 371 0.8072 -0.9882

    ## 372 12.3147 1.2844

    12

  • It’s interesting to compare these to the ordinary regression lines shown on theoriginal plot of the data:

    sleep.ranef.linedata = coef(slmodel1)$Subject

    names(sleep.ranef.linedata) = c("a", "b")

    sleep.ranef.linedata$Subject = row.names(sleep.ranef.linedata)

    sleep.plot +

    geom_abline(

    aes(intercept=a, slope=b),

    data=sleep.ranef.linedata,

    color="red"

    )

    ●●

    ●● ● ● ●

    ● ● ●●

    ● ●

    ● ● ●●

    ● ●●

    ●● ● ●

    ●●

    ●●

    ● ●●

    ● ●

    ● ●

    ●●

    ●● ●

    ●●

    ● ● ●

    ● ●

    ● ●●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ● ● ●● ● ●

    ● ●

    ●●

    ● ●

    ● ●●

    ● ●

    ●●

    ●●

    ●●

    ● ●●

    ●●

    ● ●

    ●●

    ●●

    ● ● ● ●●

    ●●

    ● ●●

    ● ●

    ●●

    ●● ● ●

    ●● ●

    ● ●● ● ● ●

    ●●

    ● ●

    ●●

    ● ●●

    ● ●

    308 309 310 330 331 332

    333 334 335 337 349 350

    351 352 369 370 371 372

    200

    300

    400

    500

    200

    300

    400

    500

    200

    300

    400

    500

    0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5Days of sleep deprivation

    Ave

    rage

    rea

    ctio

    n tim

    e (m

    s)

    Here, the coef function applied to a model object produced by lmer produces anoutput adding the fixed and random effects together.

    The shrinkage of the regression slopes is visible in many cases. However, thereare exceptions; exactly why is not clear to me although it may be due to theshrinkage of the intercepts. I would need to investigate further mathematically tounderstand better.

    13

  • 4 Diagnostics

    Once we have β̃ and b̃, we can define residuals:

    e = y −Xβ̃ − Zb̃

    and fitted valuesỹ = y − e

    Like the linear model, there is value in:

    • looking at model fit by plotting the residuals versus each covariate and versusthe fitted values;

    • looking at a quantile-plot of the residuals to assess normality of the residuals.

    Unlike the linear model, there is value in looking at a quantile-plot of each group ofrandom effects to assess their normality. There is also some potential to diagnoseproblems with the fixed effects part of the model value by:

    • looking at normality of ẽ = y −Xβ̃ = Zb̃+ e;

    • plotting ẽ versus covariates;

    although the components of Zb+�, the model equivalent of ẽ, are not independent.

    4.1 Dyestuff

    We can plot the residuals versus the Batch covariate:

    qplot(Dyestuff$Batch, resid(dmodel.r))

    ●●

    ●●

    −50

    0

    50

    A B C D E FDyestuff$Batch

    resi

    d(dm

    odel

    .r)

    14

  • which looks pretty reasonable.

    We can make a Gaussian quantile-quantile plot of the residuals for both therandom-efffects and fixed-effects models:

    qqnorm(resid(dmodel.r))

    ● ●

    ●●

    −2 −1 0 1 2

    −50

    050

    Normal Q−Q Plot

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    qqnorm(resid(dmodel.f))

    ● ●

    −2 −1 0 1 2

    −50

    050

    100

    Normal Q−Q Plot

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    The former has a slightly worrying lower tail compared to the latter; however thereare only 30 data points.

    Finally, we can produce a quantile plot of the random effects:

    qqnorm(ranef(dmodel.r)$Batch[, 1])

    −1.0 −0.5 0.0 0.5 1.0

    −40

    −20

    020

    4060

    Normal Q−Q Plot

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    but there are so few random effects that we learn little.

    4.2 Sleepstudy

    We should plot the residuals versus each covariate:

    15

  • qplot(sleepstudy$Days, resid(slmodel1))

    ●● ● ●

    ●● ●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ● ●

    ● ●

    ●●

    ● ●●

    ● ●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    −100

    −50

    0

    50

    100

    0.0 2.5 5.0 7.5sleepstudy$Days

    resi

    d(sl

    mod

    el1)

    and

    qplot(sleepstudy$Subject, resid(slmodel1))

    ●●●●

    ●●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    −100

    −50

    0

    50

    100

    308 309 310 330 331 332 333 334 335 337 349 350 351 352 369 370 371 372sleepstudy$Subject

    resi

    d(sl

    mod

    el1)

    16

  • Both indicate the existence of a few clear outliers and the former some evidenceof a quadratic dependence on time.

    We should actually plot the residuals against both covariates simultaneously (likethe original data plot):

    qplot(Days, resid(slmodel1), data = sleepstudy) + facet_wrap(~Subject,

    ncol = 6) + xlab("Days of sleep deprivation") + ylab("Residual")

    ●●

    ● ● ● ●● ● ●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ● ● ●●

    ●●

    ●●

    ●●

    ● ●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ● ●●

    ●●

    ●●

    ● ● ●● ● ●

    ● ●●

    ●●

    ●●

    ● ● ●

    ●● ●

    ● ● ●

    ● ●

    ● ●

    ● ●

    ●●

    ●●

    ●● ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●

    ●● ●

    ●● ●

    ●●

    ●●

    ● ●

    ●● ●

    308 309 310 330 331 332

    333 334 335 337 349 350

    351 352 369 370 371 372

    −100

    −50

    0

    50

    100

    −100

    −50

    0

    50

    100

    −100

    −50

    0

    50

    100

    0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5Days of sleep deprivation

    Res

    idua

    l

    We see autocorrelation for many subjects but not consistently. There would bemany ways to enhance the model to take into account this and the apparentquadratic dependence on time but it is not clear that the effort would be worth-while.

    17

  • We should also make a quantile plot of the residuals and of each type of randomeffect:

    qqnorm(resid(slmodel1), main = "")

    ●●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●

    ●●●

    ● ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    −2 −1 0 1 2

    −10

    0−

    500

    5010

    0

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    OK apart from the three outliers;

    qqnorm(ranef(slmodel1)$Subject[, 1], main = "")

    ●●

    ●●

    −2 −1 0 1 2−

    40−

    200

    20Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    no evidence of problems;

    qqnorm(ranef(slmodel1)$Subject[, 2], main = "")

    ●●

    ● ●

    −2 −1 0 1 2

    −10

    −5

    05

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    no evidence of problems.

    5 Model selection/comparison

    When we entertain multiple models for the same data, we can use likelihood ratiotests and quantities like AIC and BIC but we need to be more aware of theirlimitations than for the linear model.

    The general definition of AIC is −2l + 2p where l denotes the maximum of thelog-likelihood and BIC is −2l + p log n.

    18

  • With REML, we can only use these methods to compare models having the samefixed effects parameterisation.

    With ML, we might in principle compare models with different fixed effects pa-rameterisations or even random and fixed effects models. However, the parametercounting issues for random-effects models make this quite questionable and it isarguable that the decision between fixed and random effects is a fundamentalmodelling choice to be made by the modeller and not by semi-automatic criteria.

    AIC and BIC for random-effects models are undermined by the difficulty in count-ing parameters: counting one parameter for a single variance component is believedto under-state the correct number of parameters in the asymptotic arguments lead-ing to AIC/BIC. It is not clear how much under-statement there is and in factthat is likely to depend on the relative values of the variance parameters.

    Consider the models for the dyestuff data:

    • The fixed-effects model has 6 fixed effects parameters and 1 variance param-eter. In comparing linear models, it would be conventional to take p = 6;from an asymptotic likelihood perspective, one might take p = 7.

    • For the random-effects model, there are apparently p = 3 parameters. Butin reality, the effective number of parameters is somewhere between 3 and 7.

    • At present, there is no neat resolution to this issue although it is an activearea of research. The best advice one can give (courtesy of Doug Bates) isto compute AIC/BIC for the random-effects model using p = 3 but to treatthe resulting numbers with greater caution.

    5.1 Dyestuff

    On the basis of the discussion above, we should not be attempting to use AIC/BICto choose between the fixed and random effects models. It seems clear that therewould be many possible batches and that a random-effects model is more likelyto be appropriate.

    However, suppose that we were to try it. The extractAIC function calculates AICfor us:

    ## AIC for dmodel.r and dmodel.f

    extractAIC(dmodel.f)

    ## [1] 6.0 239.4

    extractAIC(dmodel.r)

    ## [1] 3.0 325.7

    ## AIC for random-effects using maximum likelihood

    dmodel.rML = lmer(Yield ~ 1 | Batch, data = Dyestuff, REML = FALSE)

    extractAIC(dmodel.rML)

    ## [1] 3.0 333.3

    19

  • We see that the AIC is apparently much lower for the fixed-effects model. How-ever, this is not a correct comparison. The fixed effects model was obtainedusing lm and for both lm and glm, extractAIC bases the AIC calculation on theminimum deviance rather the maximum log-likelihood; the difference is constantwhich means that the two versions select the same model. To obtain a version fordmodel.f which is more comparable to the random-efffects values, do

    -2 * logLik(dmodel.f) + 2 * 6

    ## 'log Lik.' 324.6 (df=7)

    The purpose of this example is just to point out differences in definition of AICand not to make any further statement about which model should be used for thedyestuff data.

    5.2 Sleepstudy

    Looking at the earlier output (slmodel1 and slmodel3) for the sleepstudy data,we see that the AIC is slightly lower for slmodel3 than for slmodel1. The fit(as measured by the log-likliehood, ML or REML deviance) is virtually identicalbut the extra correlation parameter penalises slmodel1. The fit is essentially thesame because the correlation parameter estimate is close to zero.

    For the sake of it, let us fit the same model omitting the slope random-effects:

    slmodel2 = lmer(Reaction ~ Days + (1 | Subject), sleepstudy)

    slmodel2

    ## Linear mixed model fit by REML

    ## Formula: Reaction ~ Days + (1 | Subject)

    ## Data: sleepstudy

    ## AIC BIC logLik deviance REMLdev

    ## 1794 1807 -893 1794 1786

    ## Random effects:

    ## Groups Name Variance Std.Dev.

    ## Subject (Intercept) 1378 37.1

    ## Residual 960 31.0

    ## Number of obs: 180, groups: Subject, 18

    ##

    ## Fixed effects:

    ## Estimate Std. Error t value

    ## (Intercept) 251.405 9.746 25.8

    ## Days 10.467 0.804 13.0

    ##

    ## Correlation of Fixed Effects:

    ## (Intr)

    ## Days -0.371

    20

  • We see that the AIC and BIC are now much higher, reflecting the fact that this isa poor model because there are significant differences in slope between subjects.

    A BLUE — Best Linear Unbiased Estimation

    Context is that we have a vector y = Xβ + Zb + � where X and Z are knownmatrices, β are unknown parameters, b ∼ N(0,Σ) and � ∼ N(0, σ2I). Here Σ andσ are assumed known since we will already estimated them using REML. Anotherway to write this is that y ∼ N(Xβ,W ) where W = ZΣZT + σ2I is known.For arbitrary vector c, we now seek the best linear unbiased estimate of ψ = cTβ,i.e. the linear combination ψ̃ = hTy which minimises Var[ ψ̃ ] subject to E[ ψ̃ ] =ψ = cTβ.

    E[ ψ̃ ] = hTXβ and so (since β is unknown) h must satisfy XTh = c.

    But Var[ ψ̃ ] = hTWh and so we set up a Lagrange multiplier based function tocarry out the constrained optimisation:

    Q(h, λ) = hTWh+ λT(XTh− c)

    We now need to find the turning-point of Q.

    ∂Q

    ∂h= 2Wh+Xλ

    and equating to zero yieldsh = −1

    2W−1Xλ

    and applying the constraint we now find that

    c = XTh = −12XTW−1Xλ

    from whichλ = −2[XTW−1X]−1c

    Substituting this value of lambda into the equation for h, we arrive at

    h = W−1X[XTW−1X]−1c

    from whichψ̃ = cT[XTW−1X]−1XTW−1y

    Since this of the form cTβ̃ where β̃ = [XTW−1X]−1XTW−1y for all c, we call β̃the BLUE of β.

    B BLUP — Best Linear Unbiased Prediction

    Context is the same as for BLUE except that this time the focus is on estimatinglinear combinations of the vector b.

    Let φ = cTb. We require the best linear unbiased predictor of φ, i.e. φ̃ = hTy suchthat E[ φ̃− φ ] = 0 which minimises Var[ φ̃− φ ].

    21

  • Now E[φ ] = 0 and so we require that E[ φ̃ ] = 0, i.e. hTXβ = 0. Since β isunknown, this implies that hTX = 0.

    Var[ ψ̃ − ψ ] = Var[hTXβ + hTZb+ hT�− cTb ]= hTWh+ cTΣc− 2hTZΣc

    and we again set up a Lagrange-multiplier based function:

    Q(h, λ) = hTWh+ cTΣc− 2hTZΣc+ λTXTh

    ∂Q

    ∂h= 2Wh− 2ZΣc+Xλ

    and equating to zero yields

    h = W−1(ZΣc− 12Xλ)

    and applying the constraint XTh = 0 we now find that

    0 = XTh = XTW−1(ZΣc− 12Xλ)

    from whichλ = 2[XTW−1X]−1XTW−1ZΣc

    and substituting into the equation for h gives

    h = W−1(ZΣc−X[XTW−1X]−1XTW−1ZΣc) = W−1(I−X[XTW−1X]−1XTW−1)ZΣc

    Thereforeφ̃ = cTΣZT(I −W−1X[XTW−1X]−1XT)W−1y

    and since this is of the form cTb̃ where b̃ does not depend on c, we call

    b̃ = ΣZT(I −W−1X[XTW−1X]−1XT)W−1y

    the BLUP of b.

    Note that this can be written in terms of the BLUE of β:

    b̃ = ΣZTW−1(y −Xβ̃)

    22