paris lecture 1

Lecture 1: Review of Linear Mixed Models


Shravan Vasishth

Department of LinguisticsUniversity of Potsdam, Germany

October 7, 2014

1 / 42


Introduction

Motivating this course

In psycholinguistics, we usually use frequentist methods toanalyze our data.

The most common tool I use is lme4.

In recent years, very powerful programming languages havebecome available that make Bayesian modeling relatively easy.

Bayesian tools have several important advantages overfrequentist methods, but they require some very specificbackground knowledge.

My goal in this course is to try to provide that backgroundknowledge.

Note: I will teach a more detailed (2-week) course in ESSLLI inBarcelona, Aug 3-14, 2015.

2 / 42


Introduction

Motivating this course

In this introductory lecture, my goals are to

1 motivate you look at Bayesian Linear Mixed Models as analternative to using frequentist methods.

2 make sure we are all on the same page regarding the movingparts of a linear mixed model.

3 / 42


Preliminaries

Prerequisites

1 Familiarity with fitting standard LMMs such as:

lmer(rt~cond+(1+cond|subj)+(1+cond|item),dat)

2 Basic knowledge of R.

4 / 42


Preliminaries

A bit about my background

1999: Discovered the word “ANOVA” in a chance conversationwith a psycholinguist.

1999: Did my first self-paced reading experiment. Fit arepeated measures ANOVA.

2000: Went to statisticans at Ohio State’ statisticalconsulting unit, and they said: “why are you fitting ANOVA?You need linear mixed models.”

2000-2010: Kept fitting and publishing linear mixed models.

2010: Realized I didn’t really know what I was doing, andstarted a part-time MSc in Statistics at Sheffield’s School ofMath and Statistics (2011-2015).

5 / 42


Repeated measures data

Linear mixed modelsExample: Gibson and Wu data, Language and Cognitive Processes, 2012

6 / 42



Linear mixed modelsExample: Gibson and Wu data, Language and Cognitive Processes, 2012

Subject vs object relative clauses in Chinese, self-pacedreading.The critical region is the head noun.The goal is to find out whether SRs are harder to process thanORs at the head noun.

7 / 42



Linear mixed modelsExample: Gibson and Wu 2012 data

> head(data[,c(1,2,3,4,7,10,11)])

subj item type pos rt rrt x

7 1 13 obj-ext 6 1140 -0.8771930 0.5

20 1 6 subj-ext 6 1197 -0.8354219 -0.5

32 1 5 obj-ext 6 756 -1.3227513 0.5

44 1 9 obj-ext 6 643 -1.5552100 0.5

60 1 14 subj-ext 6 860 -1.1627907 -0.5

73 1 4 subj-ext 6 868 -1.1520737 -0.5

8 / 42



lme4 model of Gibson and Wu dataCrossed varying intercepts and slopes model, with correlation

This is the type of “maximal” model that most people fit nowadays(citing Barr et al 2012):

> m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item),

+ subset(data,region=="headnoun"))

I will now show two major (related) problems that occur with thesmall datasets we usually have in psycholinguistics:

The correlation estimates either lead to degenerate variancecovariance matrices, and/or

The correlation estimates are wild estimates that have nobearing with reality.

[This is not a failing of lmer, but rather that the user isdemanding too much of lmer.]

9 / 42



lme4 model of Gibson and Wu dataTypical data analysis: Crossed varying intercepts and slopes model, with correlation

> summary(m1)

Linear mixed model fit by REML [lmerMod]

Formula: rrt ~ x + (1 + x | subj) + (1 + x | item)

Data: subset(data, region == "headnoun")

REML criterion at convergence: 1595.6

Scaled residuals:

Min 1Q Median 3Q Max

-2.5441 -0.6430 -0.1237 0.5996 3.2501

Random effects:

Groups Name Variance Std.Dev. Corr

subj (Intercept) 0.371228 0.60928

x 0.053241 0.23074 -0.51

item (Intercept) 0.110034 0.33171

x 0.009218 0.09601 1.00

Residual 0.891577 0.94423

Number of obs: 547, groups: subj, 37; item, 15

Fixed effects:

Estimate Std. Error t value

(Intercept) -2.67151 0.13793 -19.369

x -0.07758 0.09289 -0.835

Correlation of Fixed Effects:

(Intr)

x 0.012 10 / 42



The “best” model

The way to decide on the “best” model is to find the simplestmodel using the Generalized Likelihood Ratio Test (Pinheiro andBates 2000). Here, this is the varying intercepts model, not themaximal model.

> m1<- lmer(rrt~x+(1+x|subj)+(1+x|item),

+ headnoun)

> m1a<- lmer(rrt~x+(1|subj)+(1|item),

+ headnoun)

11 / 42



The “best” model

> anova(m1,m1a)

Data: headnoun

Models:

m1a: rrt ~ x + (1 | subj) + (1 | item)

m1: rrt ~ x + (1 + x | subj) + (1 + x | item)

Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)

m1a 5 1603.5 1625.0 -796.76 1593.5

m1 9 1608.5 1647.3 -795.27 1590.5 2.9742 4 0.5622

12 / 42



How meaningful were the lmer estimates of correlations inthe maximal model m1?Simulated data

Here, we simulate data with the same structure, sample size, andparameter values as the Gibson and Wu data, except that weassume that the correlations are 0.6. Then we analyze the datausing lmer (maximal model). Can lmer recover thecorrelations?

13 / 42




We define a function called new.df that generates data similar tothe Gibson and Wu data-set. For code, see accompanying .R file.

14 / 42




Next, we write a function that generates data for us repeatedlywith the following specifications: sample size for subjects anditems, and some correlation between subject intercept and slope,and item intercept and slope.

> gendata<-function(subjects=37,items=15){

+ dat<-new.df(nsubj=subjects,nitems=items,

+ rho.u=0.6,rho.w=0.6)

+ dat <- dat[[1]]

+ dat<-dat[,c(1,2,3,9)]

+ dat$x<-ifelse(dat$cond==1,-0.5,0.5)

+

+ return(dat)

+ }15 / 42




Set number of simulations:

> nsim<-100

Next, we generate simulated data 100 times, and then store theestimated subject and item level correlations in the random effects,and plot their distributions.We do this for two settings: Gibson and Wu sample sizes (37subjects, 15 items), and 50 subjects and 30 items.

16 / 42




37 subjects and 15 items

> library(lme4)

> subjcorr<-rep(NA,nsim)

> itemcorr<-rep(NA,nsim)

> for(i in 1:nsim){

+ dat<-gendata()

+ m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat)

+ subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2]

+ itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2]

+ }

17 / 42




Distribution of subj. corr.

ρ̂u

Den

sity

−0.2 0.2 0.6 1.0

0.0

0.5

1.0

1.5

Distribution of item corr.

ρ̂w

Den

sity

−1.0 0.0 0.5 1.0

0.0

0.5

1.0

1.5

18 / 42




50 subjects and 30 items

> subjcorr<-rep(NA,nsim)

> itemcorr<-rep(NA,nsim)

> for(i in 1:nsim){

+ #print(i)

+ dat<-gendata(subjects=50,items=30)

+ m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat)

+ subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2]

+ itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2]

+ }

19 / 42




Distribution of subj. corr.

ρ̂u

Den

sity

0.2 0.6 1.0

0.0

1.0

2.0

Distribution of item corr.

ρ̂w

Den

sity

0.0 0.4 0.8

0.0

1.0

2.0

20 / 42




Conclusion:

1 It seems that lmer can estimate the correlation parameters justin case sample size for items and subjects is “large enough”(this can be established using simulation, as done above).

2 Barr et al’s recommendation to fit a maximal model makessense as a general rule only if it’s already clear that we haveenough data to estimate all the variance components andparameters.

3 In my experience, that is rarely the case at least inpsycholinguistics, especially when we go to more complexdesigns than a simple two-condition study.

21 / 42



Keep it maximal?Gelman and Hill (2007, p. 549) make a more temperedrecommendation than Barr et al:

Don’t get hung up on whether a coefficient “should” varyby group. Just allow it to vary in the model, and then, ifthe estimated scale of variation is small . . . , maybe youcan ignore it if that would be more convenient. Practicalconcerns sometimes limit the feasible complexity of amodel–for example, we might fit a varying-interceptmodel first, then allow slopes to vary, then addgroup-level predictors, and so forth. Generally, however,it is only the difficulties of fitting and, especially,understanding the models that keeps us from adding evenmore complexity, more varying coefficients, and moreinteractions.

22 / 42


Why fit Bayesian LMMs?

Advantages of fitting a Bayesian LMM

1 For such data, there can be situations where you really needto or want to fit full variance-covariance matrices for randomeffects. Bayesian LMMs will let you fit them even in caseswhere lmer would fail to converge or return nonsensicalestimates (due to too little data).The way we will set them up, Bayesian LMMs will typicallyunderestimate correlation, unless there is enough data.

2 A direct answer to the research question can be obtained byexamining the posterior distribution given data.

3 We can avoid the traditional hard binary decision associatedwith frequentist methods: p < 0.05 implies reject null, andp > 0.05 implies “accept” null. We are more interested inquantifying our uncertainty about the scientific claim.

4 Prior knowledge can be included in the model.

23 / 42


Why fit Bayesian LMMs?

Disadvantages of doing a Bayesian analysis

You have to invest effort into specifying a model; unlike lmer,which involves a single line of code, JAGS and Stan modelspecifications can extend to 20-30 lines.A lot of decisions have to be made.

There is a steep learning curve; you have to know a bit aboutprobability distributions, MCMC methods, and of courseBayes’ Theorem.

It takes much more time to fit a complicated model in aBayesian setting than with lmer.

But I will try to demonstrate to you in this course that it’s worththe effort, especially when you don’t have a lot of data (usually thecase in psycholinguistics).

24 / 42


Brief review of linear (mixed) models

Linear models

yi↑

response

=

parameter↓

β0 +

parameter↓

β1xi↑

predictor

+ εi↑

error

(1)

where

εi is the residual error, assumed to be normally distributed:εi ∼ N(0,σ2).

Each response yi (i ranging from 1 to I) is independently andidentically distributed as yi ∼ N(β0 + β1xi ,σ

2).

Point values for parameters: β0 and β1 are the parametersto be estimated. In the frequentist setting, these are pointvalues, they have no distribution.

Hypothesis test: Usually, β1 is the parameter of interest; inthe frequentist setting, we test the null hypothesis that β1 = 0.

25 / 42



Linear models and repeated measures data


Linear mixed models are useful for correlated data (e.g.,repeated measures) where the responses y are notindependently distributed.

A key difference from linear models is that the interceptand/or slope vary by subject j = 1, . . . ,J (and possibly also byitem k = 1, . . . ,K ):

yi↑

response

=

varying intercepts↓

[β0 + u0j + w0k ] +

varying slopes↓

[β1 + u1j + w1k ]xi↑

predictor

+ εi↑

error

(2)

26 / 42




Unpacking the lme4 modelCrossed varying intercepts and slopes model, with correlation

yi↑

response

=

varying intercepts↓

[β0 + u0j + w0k ] +

varying slopes↓

[β1 + u1j + w1k ]xi↑

predictor

+ εi↑

error

(3)

This is the “maximal” model we saw earlier:

> m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item),

+ headnoun)

27 / 42





> summary(m1)

Linear mixed model fit by REML [lmerMod]

Formula: rrt ~ x + (1 + x | subj) + (1 + x | item)

Data: headnoun

REML criterion at convergence: 1595.6

Scaled residuals:

Min 1Q Median 3Q Max

-2.5441 -0.6430 -0.1237 0.5996 3.2501

Random effects:


subj (Intercept) 0.371228 0.60928

x 0.053241 0.23074 -0.51

item (Intercept) 0.110034 0.33171

x 0.009218 0.09601 1.00

Residual 0.891577 0.94423

Number of obs: 547, groups: subj, 37; item, 15

Fixed effects:

Estimate Std. Error t value

(Intercept) -2.67151 0.13793 -19.369

x -0.07758 0.09289 -0.835

Correlation of Fixed Effects:

(Intr)

x 0.012 28 / 42





rrti = (β0 + u0j + w0k) + (β1 + u1j + w1k)xi + εi (4)

1 i = 1, . . . ,547 data points; j = 1, . . . ,37 items; k = 1, . . . ,15 datapoints

2 xi is coded −0.5 (SR) and 0.5 (OR).

3 εi ∼ N(0,σ2).

4 u0j ∼ N(0,σ0j) and u1j ∼ N(0,σ1j).

5 w0k ∼ N(0,σ0k) and w1k ∼ N(0,σ1k).

with a multivariate normal distribution for the varying slopes andintercepts(

u0j

u1j

)∼ N

((00

),Σu

) (w0k

w1k

)∼ N

((00

),Σw

)(5)

29 / 42




The variance components associated with subjects

Random effects:


subj (Intercept) 0.37 0.61

so 0.05 0.23 -0.51

Σu =

[σ2u0 ρu σu0σu1

ρu σu0σu1 σ2u1

]=

[0.612 −.51×0.61×0.23

−.51×0.61×0.23 0.232

](6)

30 / 42




The variance components associated with items

Random effects:


item (Intercept) 0.11 0.33

so 0.01 0.10 1.00

Note the by items intercept-slope correlation of +1.00.

Σw =

[σ2w0 ρw σw0σw1

ρw σw0σw1 σ2w1

]=

[0.332 1×0.33×0.10

1×0.33×0.10 0.102

](7)

31 / 42




The variance components of the “maximal” linear mixedmodel

RTi = β0 + u0j + w0k + (β1 + u1j + w1k)xi + εi (8)

(u0j

u1j

)∼ N

((00

),Σu

) (w0k

w1k

)∼ N

((00

),Σw

)(9)

εi ∼ N(0,σ2) (10)

The parameters are β0,β1,Σu,Σw ,σ . Each of the matrices Σ hasthree parameters. So we have 9 parameters.

32 / 42




Summary so far

Linear mixed models allow us to take all relevant variancecomponents into account; LMMs allow us to describe how thedata were generated.

However, maximal models should not be fit blindly, especiallywhen there is not enough data to estimate parameters.

For small datasets we often see degenerate variancecovariance estimates (with correlation ±1). Manypsycholinguists ignore this degeneracy.

If one cares about the correlation, one should not ignore thedegeneracy.

33 / 42


Frequentist vs Bayesian methods

The frequentist approach

1 In the frequentist setting, we start with a dependent measurey , for which we assume a probability model.

2 In the above example, we have reading time data, rt, whichwe assume is generated from a normal distribution with somemean µ and variance σ2; we write this rt∼ N(µ,σ2).

3 Given a particular set of parameter values µ and σ2, we couldstate the probability distribution of rt given the parameters.We can write this as p(rt | µ,σ2).

34 / 42




1 In reality, we know neither µ nor σ2. The goal of fitting amodel to such data is to estimate the two parameters, andthen to draw inferences about what the true value of µ is.

2 The frequentist method relies on the fact that, under repeatedsampling and with a large enough sample size, the samplingdistribution the sample mean X̄ is distributed as N(µ,σ2/n).

3 The standard method is to use the sample mean x̄ as anestimate of µ and given a large enough sample size n, we cancompute an approximate 95% confidence intervalx̄±2× (σ̂2/n).

35 / 42




The 95% confidence interval has a slightly complicatedinterpretation:If we were to repeatedly carry out the experiment and compute aconfidence interval each time using the above procedure, 95% ofthose confidence intervals would contain the true parameter valueµ (assuming, of course, that all our model assumptions aresatisfied).The particular confidence interval we calculated for our particularsample does not give us a range such that we are 95% certain thatthe true µ lies within it, although this is how most users ofstatistics seem to (mis)interpret the confidence interval.

36 / 42



The 95% CI

0 20 40 60 80 100

5658

6062

64

95% CIs in 100 repeated samples

i−th repeated sample

Sco

res

37 / 42



The Bayesian approach

1 The Bayesian approach starts with a probability model thatdefines our prior belief about the possible values that theparameter µ and σ2 might have.

2 This probability model expresses what we know so far aboutthese two parameters (we may not know much, but inpractical situations, it is not the case that we don’t knowanything about their possible values).

3 Given this prior distribution, the probability model p(y | µ,σ2)and the data y allow us to compute the probabilitydistribution of the parameters given the data, p(µ,σ2 | y).

4 This probability distribution, called the posterior distribution,is what we use for inference.

38 / 42




1 Unlike the 95% confidence interval, we can define a 95%credible interval that represents the range within which we are95% certain that the true value of the parameter lies, giventhe data at hand.

2 Note that in the frequentist setting, the parameters are pointvalues: µ is assumed to have a particular value in nature.

3 In the Bayesian setting, µ is a random variable with aprobability distribution; it has a mean, but there is also someuncertainty associated with its true value.

39 / 42




Bayes’ theorem makes it possible to derive the posterior distributiongiven the prior and the data. The conditional probability rule inprobability theory (see Kerns) is that the joint distribution of tworandom variables p(θ ,y) is equal to p(θ | y)p(y). It follows that:

p(θ ,y) =p(θ | y)p(y)

=p(y ,θ) (because p(θ ,y) = p(y ,θ))

=p(y | θ)p(θ).

(11)

The first and third lines in the equalities above imply that

p(θ | y)p(y) = p(y | θ)p(θ). (12)

40 / 42




Dividing both sides by p(y) we get:

p(θ | y) =p(y | θ)p(θ)

p(y)(13)

The term p(y | θ) is the probability of the data given θ . If we treatthis as a function of θ , we have the likelihood function.Since p(θ | y) is the posterior distribution of θ given y , andp(y | θ) the likelihood, and p(θ) the prior, the followingrelationship is established:

Posterior ∝ Likelihood×Prior (14)

41 / 42




Posterior ∝ Likelihood×Prior (15)

We ignore the denominator p(y) here because it only serves as anormalizing constant that renders the left-hand side (the posterior)a probability distribution.The above is Bayes’ theorem, and is the basis for determining theposterior distribution given a prior and the likelihood.The rest of this course simply unpacks this idea.Next week, we will look at some simple examples of the applicationof Bayes’ Theorem.

42 / 42

paris lecture 1

Data & Analytics