non-gaussian response variables - wordpress.com...have a poisson distribution (at each distance)...

Non-Gaussian Response Variables

What is the Generalized Model Doing?

• The fixed effects are like the factors in a traditional analysis of variance or linear model

• The random effects are different – A generalized linear mixed model (GLMM) includes fixed and random

factors – The random effects are modeled as coming from a specified

distribution (i.e., each female has a different baseline offspring size, and this baseline size has a normal distribution with an estimated variance)

– The model basically searches through parameter values to find the set of slopes and intercepts that maximize the probability of the observed data

– By including a random effect, you can reduce that variable’s impact on the fixed effect analysis – you usually will not care too much about the random effect (but you might)

A Flexible Modeling Framework

• GLMMs (and related models) allow a lot of modeling flexibility

– Explicit modeling of heterogeneity

– Including a spatial or temporal component

– Response variables that are not normally distributed

– Include fixed and random factors

– Naturally allows nested designs and repeated measures

Exponential Family of Distributions

• Normal: symmetric, continuous • Poisson: Asymmetric, discrete

– Rare events, like number of robberies per week in College Station

– The mean equals the variance

• Binomial: Asymmetric, discrete – Number of occurrences – The mean is larger than the variance

• Negative binomial: Asymmetric, discrete – Like the binomial, except the variance is larger than the mean

• Gamma: asymmetric, continuous – Can have a variety of shapes, but all observations are positive

Parts of a GLM

• The distribution of the response variable – Usually we assume it’s normal

• Specification of the systematic component in terms of explanatory variables – The fixed effects – If we had random factors, it would be a GLMM

• The link between the systematic part and the response

variables – In our usual models, the link is the identity link, where the

expected value of the response value is directly estimated (like from the equation for a line: y = mx+b)

Implementing a Poisson GLM

• Now the response variable has a Poisson distribution

• We specify the systematic part of the model in the usual way (same goes for random parts if we want those)

• The link is logarithmic, which ensures that the predicted values are always non-negative (a Poisson distribution doesn’t allow negative values)

Example: Amphibian Roadkills

• Dataset: Roadkills of amphibians at 52 sites of varying distance from a “natural park”

• Number of roadkills is not normally distributed

• Amphibian getting run over by car might be a rare, random event, so you might expect it to have a Poisson distribution (at each distance)

Plot of Roadkills on Distance

0 5000 10000 15000 20000 25000

020

40

60

80

100

Distance from Park

Tota

l R

oadkill

s

0 5000 10000 15000 20000 25000

020

40

60

80

100

Fitting a Poisson GLM

• > M1 <- glm(TOT.N ~ D.PARK, family=poisson, data=roadkills)

• > summary(M1)

Poisson GLM Output Deviance Residuals: Min 1Q Median 3Q Max -8.1100 -1.6950 -0.4708 1.4206 7.3337 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.316e+00 4.322e-02 99.87 <2e-16 *** D.PARK -1.059e-04 4.387e-06 -24.13 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1071.4 on 51 degrees of freedom Residual deviance: 390.9 on 50 degrees of freedom AIC: 634.29 Number of Fisher Scoring iterations: 4

Meaning of Deviance

• Null and residual deviances are kind of like maximum likelihood equivalents of the total and residual sums of squares

• An R2 like term can be obtained from:

• Applying this relationship to the previous model, we find that it explains 63.5% of the variation

deviance null

deviance residual - deviance null100

Fitting a Line for the Model

0 5000 10000 15000 20000 25000

020

40

60

80

100

Distance from Park

Tota

l R

oadkill

s

0 5000 10000 15000 20000 25000

020

40

60

80

100

Code for the Lines MyData <- data.frame(D.PARK = seq(from = 0, to = 25000, by=1000)) G <- predict(M1, newdata=MyData, type="link", se=TRUE) F <- exp(G$fit) FSEUP <- exp(G$fit+1.96*G$se.fit) FSELOW <- exp(G$fit-1.96*G$se.fit) lines(MyData$D.PARK, F, lty=1, lwd=3) lines(MyData$D.PARK, FSEUP, lty=2, lwd=3) lines(MyData$D.PARK, FSELOW, lty=2, lwd=3)

Model Selection in a Poisson GLM

• Option 1: Drop terms sequentially and test full and reduced models

• Option 2: Use the “drop1” command to drop each explanatory variable in turn

• Option 3: Use the anova command to sequentially remove each term and compare the resulting models to the original full model

The drop1 command

• Example: Still roadkills, but with nine explanatory variables

• > M2 <- glm(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR, family=poisson, data=RK)

• > summary(M2)

• > drop1(M2, test=“Chi”)

Results of drop1() Single term deletions Model: TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR Df Deviance AIC LRT Pr(>Chi) <none> 270.23 529.62 OPEN.L 1 273.93 531.32 3.69 0.0546474 . MONT.S 1 306.89 564.28 36.66 1.410e-09 *** SQ.POLIC 1 285.53 542.92 15.30 9.181e-05 *** D.PARK 1 838.09 1095.48 567.85 < 2.2e-16 *** SQ.SHRUB 1 298.31 555.70 28.08 1.167e-07 *** SQ.WATRES 1 280.02 537.41 9.79 0.0017539 ** L.WAT.C 1 335.47 592.86 65.23 6.648e-16 *** SQ.LPROAD 1 281.25 538.64 11.02 0.0009009 *** SQ.DWATCOUR 1 272.50 529.89 2.27 0.1319862 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Overdispersion

• Recall that the Poisson distribution assumes the variance is equal to the mean

• If the variance is greater than the mean, then a Poisson will not accurately describe the data

• This problem is called overdispersion

Detecting Overdispersion

• Calculate:

• D is the residual deviance of the model [It was 390.9 in model M1 a few slides ago].

• n – p represents the degrees of freedom for the residual deviance [also reported by the summary() function – in this case 50]

• If this value is around 1, then overdispersion should not be a problem

• If it is greater than 1, then overdispersion is a problem.

• 390.9/50 = 7.8, so overdispersion is a problem in this dataset.

pn

D

Overdispersion in a Poisson GLM

• One approach is to use a quasi-Poisson GLM

• This model includes a dispersion parameter to better model the variance relative to the mean

• If the dispersion parameter (φ) is large then it might be better to use a different model

Fitting a Quasipoisson

• > M4 <- glm(TOT.N ~ D.PARK, family=quasipoisson, data=RK)

• > summary(M4)

Results Deviance Residuals: Min 1Q Median 3Q Max -8.1100 -1.6950 -0.4708 1.4206 7.3337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.316e+00 1.194e-01 36.156 < 2e-16 *** D.PARK -1.058e-04 1.212e-05 -8.735 1.24e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 7.630148) Null deviance: 1071.4 on 51 degrees of freedom Residual deviance: 390.9 on 50 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4

Model Selection in Quasipoisson

• AIC is not defined for a quasipoisson model, so you can’t use AIC

• It’s possible to compare models using F-tests

• drop1(M5, test=“F”)

Model Validation in Poisson GLM

• Pearson residuals: scaled by the expected mean for a given value of the explanatory variable (because the variance of the poisson changes with the mean)

• Deviance residuals: the contribution of each observation to the residual deviance. In other words, a measure of how badly that point fits.

• The default is to use the deviance residuals for model validation, and they will usually be the best choice.

What to Plot

• Deviance residuals versus:

– The fitted values

– Each explanatory variable in the model

– Each explanatory variable dropped from the model

– Against time (if it’s available)

– Against any spatial aspect of the data

• We don’t expect normality, but we are looking for patterns and fit

Model Validation Plots

2.0 2.5 3.0 3.5 4.0

-10

-50

5

Predicted values

Resid

uals

Residuals vs Fitted

2

19

1

-2 -1 0 1 2

-3-2

-10

12

3

Theoretical Quantiles

Std

. devia

nce r

esid

.

Normal Q-Q

2

1

19

2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

Predicted values

Std

. devi

ance r

esid

.

Scale-Location2

119

0.00 0.04 0.08 0.12

-3-2

-10

12

3

Leverage

Std

. P

ears

on r

esid

.

Cook's distance 0.5

0.5

1

Residuals vs Leverage

21

6

Model Validation Plots

10 20 30 40 50 60 70

-40

-20

020

40

Response residuals

mu

E

10 20 30 40 50 60 70

-50

5

Pearson residuals

mu

EP

10 20 30 40 50 60 70

-2-1

01

23

Pearson residuals scaled

mu

EP

2

10 20 30 40 50 60 70

-50

5

Deviance residuals

mu

ED

Code for the Validation Plots #Model validation example M5 <- glm(TOT.N ~ D.PARK, family = quasipoisson, data=RK) plot(M5) EP <- resid(M5, type="pearson") ED <- resid(M5, type="deviance") mu <- predict(M5, type="response") E <- RK$TOT.N - mu EP2 <- E/sqrt(7.630148*mu) op <- par(mfrow = c(2,2)) plot(x = mu, y = E, main="Response residuals") plot(x = mu, y = EP, main="Pearson residuals") plot(x = mu, y = EP2, main="Pearson residuals scaled") plot(x = mu, y = ED, main="Deviance residuals") par(op)

Interpretation

• This model has a couple of problems

• First, the residuals have a clear pattern, where they are above the predicted line at some distances and below it at others

• Second, some outliers are strongly influencing the results

Negative Binomial GLM

• Assumes:

– The distribution of the response variable is negative binomial for any value of X. Recall that the variance is larger than the mean for a negative binomial distribution

– The link function is logarithmic, which ensures that the fitted values are always non-negative

Fitting a Negative Binomial GLM

• > library(MASS)

• > M6 <- glm.nb(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR, link="log", data=RK)

• > summary(M6, cor=FALSE)

Some output Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.951e+00 4.145e-01 9.532 <2e-16 *** OPEN.L -9.419e-03 3.245e-03 -2.903 0.0037 ** MONT.S 5.846e-02 3.481e-02 1.679 0.0931 . SQ.POLIC -4.618e-02 1.298e-01 -0.356 0.7221 D.PARK -1.235e-04 1.292e-05 -9.557 <2e-16 *** SQ.SHRUB -3.881e-01 2.883e-01 -1.346 0.1784 SQ.WATRES 1.631e-01 1.675e-01 0.974 0.3301 L.WAT.C 2.076e-01 9.636e-02 2.154 0.0312 * SQ.LPROAD 5.944e-01 3.214e-01 1.850 0.0644 . SQ.DWATCOUR -1.489e-05 1.139e-02 -0.001 0.9990 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(5.5178) family taken to be 1) Null deviance: 213.674 on 51 degrees of freedom Residual deviance: 51.803 on 42 degrees of freedom AIC: 390.11

Tools for Model Selection

• The z-statistic from the summary (previous slide)

• Analysis of deviance table from anova(M6, test=“Chi”) – does sequential testing

• Drop each term in turn using drop1(M6, test=“Chi”)

• Manually specify a nested model and compare them using anova(M6, M7, test=“Chi”)

Results

• Model after model selection procedure:

> M8 <- glm.nb(TOT.N ~ OPEN.L + D.PARK, link = "log", data=RK)

> summary(M8)

• > plot(M8)

1.5 2.0 2.5 3.0 3.5 4.0 4.5-3

-2-1

01

23

Predicted values

Resid

uals

Residuals vs Fitted

2

19

1

-2 -1 0 1 2

-2-1

01

23


Std

. devia

nce r

esid

.

Normal Q-Q

2

19

1

1.5 2.0 2.5 3.0 3.5 4.0 4.5

0.0

0.5

1.0

1.5

Predicted values

Std

. devi

ance r

esid

.

Scale-Location2

191

0.00 0.02 0.04 0.06 0.08 0.10-2

-10

12

34

Leverage

Std

. P

ears

on r

esid

.Cook's distance

0.5


19

48

18

Negative Binomial Plots

2.0 2.5 3.0 3.5 4.0

-10

-50

5

Predicted values

Resid

uals

Residuals vs Fitted

2

19

1

-2 -1 0 1 2

-3-2

-10

12

3


Std

. devia

nce r

esid

.

Normal Q-Q

2

1

19

2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

Predicted values

Std

. devi

ance r

esid

.

Scale-Location2

119

0.00 0.04 0.08 0.12

-3-2

-10

12

3

Leverage

Std

. P

ears

on r

esid

.Cook's distance 0.5

0.5

1


21

6

Poisson Plots Which is better?

Adding Random Effects in a GLMM

• What if you have a non-Gaussian response variable AND want to include random effects in your model?

• The answer is a GLMM

• Several packages are available in R, but we will use glmer from the lme4 package

Example: Deer Parasites

• Data consist of whether or not each deer has parasites

• Deer differ by sex, size and farm of origin

• Which factors seem like they should be fixed and which are random?

• Because the response variable is binary, a binomial distribution is appropriate

Implementing the GLMM

• > library(lme4)

• > DE.lme4 <- glmer(Ec01 ~ CLength * fSex + (1 | fFarm), family=binomial, data=deer)

• > summary(DE.lme4)

Results – Part I Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: binomial ( logit ) Formula: Ec01 ~ CLength * fSex + (1 | fFarm) Data: deer AIC BIC logLik deviance df.resid 832.6 856.1 -411.3 822.6 821 Scaled residuals: Min 1Q Median 3Q Max -6.2678 -0.6090 0.2809 0.5022 3.4546 Random effects: Groups Name Variance Std.Dev. fFarm (Intercept) 2.391 1.546 Number of obs: 826, groups: fFarm, 24

Results – Part II

Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.938969 0.356003 2.638 0.00835 ** CLength 0.038964 0.006917 5.633 1.77e-08 *** fSex2 0.624487 0.222938 2.801 0.00509 ** CLength:fSex2 0.035859 0.011409 3.143 0.00167 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) CLngth fSex2 CLength -0.107 fSex2 -0.189 0.238 CLngth:fSx2 0.091 -0.514 0.232

Summary

• Generalized Linear Models can accommodate non-Gaussian response variables

• It’s possible to include fixed and random effects, and then the model is called a Generalized Linear Mixed Model

• The syntax for the random effects depends upon the package that’s being used for the analysis, so be careful

Summary

• Other features can be modeled as well, and you should consult Zuur et al. and the literature if your data include:

– Temporal autocorrelation

– Spatial autocorrelation

– An excess or deficit of individuals in the zero category compared to the expectations of the exponential family of distributions

non-gaussian response variables - wordpress.com...have a poisson distribution (at each distance)...

Documents