non-gaussian response variables - wordpress.com...have a poisson distribution (at each distance)...
TRANSCRIPT
Non-Gaussian Response Variables
What is the Generalized Model Doing?
• The fixed effects are like the factors in a traditional analysis of variance or linear model
• The random effects are different – A generalized linear mixed model (GLMM) includes fixed and random
factors – The random effects are modeled as coming from a specified
distribution (i.e., each female has a different baseline offspring size, and this baseline size has a normal distribution with an estimated variance)
– The model basically searches through parameter values to find the set of slopes and intercepts that maximize the probability of the observed data
– By including a random effect, you can reduce that variable’s impact on the fixed effect analysis – you usually will not care too much about the random effect (but you might)
A Flexible Modeling Framework
• GLMMs (and related models) allow a lot of modeling flexibility
– Explicit modeling of heterogeneity
– Including a spatial or temporal component
– Response variables that are not normally distributed
– Include fixed and random factors
– Naturally allows nested designs and repeated measures
Exponential Family of Distributions
• Normal: symmetric, continuous • Poisson: Asymmetric, discrete
– Rare events, like number of robberies per week in College Station
– The mean equals the variance
• Binomial: Asymmetric, discrete – Number of occurrences – The mean is larger than the variance
• Negative binomial: Asymmetric, discrete – Like the binomial, except the variance is larger than the mean
• Gamma: asymmetric, continuous – Can have a variety of shapes, but all observations are positive
Parts of a GLM
• The distribution of the response variable – Usually we assume it’s normal
• Specification of the systematic component in terms of explanatory variables – The fixed effects – If we had random factors, it would be a GLMM
• The link between the systematic part and the response
variables – In our usual models, the link is the identity link, where the
expected value of the response value is directly estimated (like from the equation for a line: y = mx+b)
Implementing a Poisson GLM
• Now the response variable has a Poisson distribution
• We specify the systematic part of the model in the usual way (same goes for random parts if we want those)
• The link is logarithmic, which ensures that the predicted values are always non-negative (a Poisson distribution doesn’t allow negative values)
Example: Amphibian Roadkills
• Dataset: Roadkills of amphibians at 52 sites of varying distance from a “natural park”
• Number of roadkills is not normally distributed
• Amphibian getting run over by car might be a rare, random event, so you might expect it to have a Poisson distribution (at each distance)
Plot of Roadkills on Distance
0 5000 10000 15000 20000 25000
020
40
60
80
100
Distance from Park
Tota
l R
oadkill
s
0 5000 10000 15000 20000 25000
020
40
60
80
100
Fitting a Poisson GLM
• > M1 <- glm(TOT.N ~ D.PARK, family=poisson, data=roadkills)
• > summary(M1)
Poisson GLM Output Deviance Residuals: Min 1Q Median 3Q Max -8.1100 -1.6950 -0.4708 1.4206 7.3337 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.316e+00 4.322e-02 99.87 <2e-16 *** D.PARK -1.059e-04 4.387e-06 -24.13 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1071.4 on 51 degrees of freedom Residual deviance: 390.9 on 50 degrees of freedom AIC: 634.29 Number of Fisher Scoring iterations: 4
Meaning of Deviance
• Null and residual deviances are kind of like maximum likelihood equivalents of the total and residual sums of squares
• An R2 like term can be obtained from:
• Applying this relationship to the previous model, we find that it explains 63.5% of the variation
deviance null
deviance residual - deviance null100
Fitting a Line for the Model
0 5000 10000 15000 20000 25000
020
40
60
80
100
Distance from Park
Tota
l R
oadkill
s
0 5000 10000 15000 20000 25000
020
40
60
80
100
Code for the Lines MyData <- data.frame(D.PARK = seq(from = 0, to = 25000, by=1000)) G <- predict(M1, newdata=MyData, type="link", se=TRUE) F <- exp(G$fit) FSEUP <- exp(G$fit+1.96*G$se.fit) FSELOW <- exp(G$fit-1.96*G$se.fit) lines(MyData$D.PARK, F, lty=1, lwd=3) lines(MyData$D.PARK, FSEUP, lty=2, lwd=3) lines(MyData$D.PARK, FSELOW, lty=2, lwd=3)
Model Selection in a Poisson GLM
• Option 1: Drop terms sequentially and test full and reduced models
• Option 2: Use the “drop1” command to drop each explanatory variable in turn
• Option 3: Use the anova command to sequentially remove each term and compare the resulting models to the original full model
The drop1 command
• Example: Still roadkills, but with nine explanatory variables
• > M2 <- glm(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR, family=poisson, data=RK)
• > summary(M2)
• > drop1(M2, test=“Chi”)
Results of drop1() Single term deletions Model: TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR Df Deviance AIC LRT Pr(>Chi) <none> 270.23 529.62 OPEN.L 1 273.93 531.32 3.69 0.0546474 . MONT.S 1 306.89 564.28 36.66 1.410e-09 *** SQ.POLIC 1 285.53 542.92 15.30 9.181e-05 *** D.PARK 1 838.09 1095.48 567.85 < 2.2e-16 *** SQ.SHRUB 1 298.31 555.70 28.08 1.167e-07 *** SQ.WATRES 1 280.02 537.41 9.79 0.0017539 ** L.WAT.C 1 335.47 592.86 65.23 6.648e-16 *** SQ.LPROAD 1 281.25 538.64 11.02 0.0009009 *** SQ.DWATCOUR 1 272.50 529.89 2.27 0.1319862 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Overdispersion
• Recall that the Poisson distribution assumes the variance is equal to the mean
• If the variance is greater than the mean, then a Poisson will not accurately describe the data
• This problem is called overdispersion
Detecting Overdispersion
• Calculate:
• D is the residual deviance of the model [It was 390.9 in model M1 a few slides ago].
• n – p represents the degrees of freedom for the residual deviance [also reported by the summary() function – in this case 50]
• If this value is around 1, then overdispersion should not be a problem
• If it is greater than 1, then overdispersion is a problem.
• 390.9/50 = 7.8, so overdispersion is a problem in this dataset.
pn
D
Overdispersion in a Poisson GLM
• One approach is to use a quasi-Poisson GLM
• This model includes a dispersion parameter to better model the variance relative to the mean
• If the dispersion parameter (φ) is large then it might be better to use a different model
Fitting a Quasipoisson
• > M4 <- glm(TOT.N ~ D.PARK, family=quasipoisson, data=RK)
• > summary(M4)
Results Deviance Residuals: Min 1Q Median 3Q Max -8.1100 -1.6950 -0.4708 1.4206 7.3337 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.316e+00 1.194e-01 36.156 < 2e-16 *** D.PARK -1.058e-04 1.212e-05 -8.735 1.24e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 7.630148) Null deviance: 1071.4 on 51 degrees of freedom Residual deviance: 390.9 on 50 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4
Model Selection in Quasipoisson
• AIC is not defined for a quasipoisson model, so you can’t use AIC
• It’s possible to compare models using F-tests
• drop1(M5, test=“F”)
Model Validation in Poisson GLM
• Pearson residuals: scaled by the expected mean for a given value of the explanatory variable (because the variance of the poisson changes with the mean)
• Deviance residuals: the contribution of each observation to the residual deviance. In other words, a measure of how badly that point fits.
• The default is to use the deviance residuals for model validation, and they will usually be the best choice.
What to Plot
• Deviance residuals versus:
– The fitted values
– Each explanatory variable in the model
– Each explanatory variable dropped from the model
– Against time (if it’s available)
– Against any spatial aspect of the data
• We don’t expect normality, but we are looking for patterns and fit
Model Validation Plots
2.0 2.5 3.0 3.5 4.0
-10
-50
5
Predicted values
Resid
uals
Residuals vs Fitted
2
19
1
-2 -1 0 1 2
-3-2
-10
12
3
Theoretical Quantiles
Std
. devia
nce r
esid
.
Normal Q-Q
2
1
19
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
Predicted values
Std
. devi
ance r
esid
.
Scale-Location2
119
0.00 0.04 0.08 0.12
-3-2
-10
12
3
Leverage
Std
. P
ears
on r
esid
.
Cook's distance 0.5
0.5
1
Residuals vs Leverage
21
6
Model Validation Plots
10 20 30 40 50 60 70
-40
-20
020
40
Response residuals
mu
E
10 20 30 40 50 60 70
-50
5
Pearson residuals
mu
EP
10 20 30 40 50 60 70
-2-1
01
23
Pearson residuals scaled
mu
EP
2
10 20 30 40 50 60 70
-50
5
Deviance residuals
mu
ED
Code for the Validation Plots #Model validation example M5 <- glm(TOT.N ~ D.PARK, family = quasipoisson, data=RK) plot(M5) EP <- resid(M5, type="pearson") ED <- resid(M5, type="deviance") mu <- predict(M5, type="response") E <- RK$TOT.N - mu EP2 <- E/sqrt(7.630148*mu) op <- par(mfrow = c(2,2)) plot(x = mu, y = E, main="Response residuals") plot(x = mu, y = EP, main="Pearson residuals") plot(x = mu, y = EP2, main="Pearson residuals scaled") plot(x = mu, y = ED, main="Deviance residuals") par(op)
Interpretation
• This model has a couple of problems
• First, the residuals have a clear pattern, where they are above the predicted line at some distances and below it at others
• Second, some outliers are strongly influencing the results
Negative Binomial GLM
• Assumes:
– The distribution of the response variable is negative binomial for any value of X. Recall that the variance is larger than the mean for a negative binomial distribution
– The link function is logarithmic, which ensures that the fitted values are always non-negative
Fitting a Negative Binomial GLM
• > library(MASS)
• > M6 <- glm.nb(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR, link="log", data=RK)
• > summary(M6, cor=FALSE)
Some output Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.951e+00 4.145e-01 9.532 <2e-16 *** OPEN.L -9.419e-03 3.245e-03 -2.903 0.0037 ** MONT.S 5.846e-02 3.481e-02 1.679 0.0931 . SQ.POLIC -4.618e-02 1.298e-01 -0.356 0.7221 D.PARK -1.235e-04 1.292e-05 -9.557 <2e-16 *** SQ.SHRUB -3.881e-01 2.883e-01 -1.346 0.1784 SQ.WATRES 1.631e-01 1.675e-01 0.974 0.3301 L.WAT.C 2.076e-01 9.636e-02 2.154 0.0312 * SQ.LPROAD 5.944e-01 3.214e-01 1.850 0.0644 . SQ.DWATCOUR -1.489e-05 1.139e-02 -0.001 0.9990 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(5.5178) family taken to be 1) Null deviance: 213.674 on 51 degrees of freedom Residual deviance: 51.803 on 42 degrees of freedom AIC: 390.11
Tools for Model Selection
• The z-statistic from the summary (previous slide)
• Analysis of deviance table from anova(M6, test=“Chi”) – does sequential testing
• Drop each term in turn using drop1(M6, test=“Chi”)
• Manually specify a nested model and compare them using anova(M6, M7, test=“Chi”)
Results
• Model after model selection procedure:
> M8 <- glm.nb(TOT.N ~ OPEN.L + D.PARK, link = "log", data=RK)
> summary(M8)
• > plot(M8)
1.5 2.0 2.5 3.0 3.5 4.0 4.5-3
-2-1
01
23
Predicted values
Resid
uals
Residuals vs Fitted
2
19
1
-2 -1 0 1 2
-2-1
01
23
Theoretical Quantiles
Std
. devia
nce r
esid
.
Normal Q-Q
2
19
1
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.0
0.5
1.0
1.5
Predicted values
Std
. devi
ance r
esid
.
Scale-Location2
191
0.00 0.02 0.04 0.06 0.08 0.10-2
-10
12
34
Leverage
Std
. P
ears
on r
esid
.Cook's distance
0.5
Residuals vs Leverage
19
48
18
Negative Binomial Plots
2.0 2.5 3.0 3.5 4.0
-10
-50
5
Predicted values
Resid
uals
Residuals vs Fitted
2
19
1
-2 -1 0 1 2
-3-2
-10
12
3
Theoretical Quantiles
Std
. devia
nce r
esid
.
Normal Q-Q
2
1
19
2.0 2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
Predicted values
Std
. devi
ance r
esid
.
Scale-Location2
119
0.00 0.04 0.08 0.12
-3-2
-10
12
3
Leverage
Std
. P
ears
on r
esid
.Cook's distance 0.5
0.5
1
Residuals vs Leverage
21
6
Poisson Plots Which is better?
Adding Random Effects in a GLMM
• What if you have a non-Gaussian response variable AND want to include random effects in your model?
• The answer is a GLMM
• Several packages are available in R, but we will use glmer from the lme4 package
Example: Deer Parasites
• Data consist of whether or not each deer has parasites
• Deer differ by sex, size and farm of origin
• Which factors seem like they should be fixed and which are random?
• Because the response variable is binary, a binomial distribution is appropriate
Implementing the GLMM
• > library(lme4)
• > DE.lme4 <- glmer(Ec01 ~ CLength * fSex + (1 | fFarm), family=binomial, data=deer)
• > summary(DE.lme4)
Results – Part I Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: binomial ( logit ) Formula: Ec01 ~ CLength * fSex + (1 | fFarm) Data: deer AIC BIC logLik deviance df.resid 832.6 856.1 -411.3 822.6 821 Scaled residuals: Min 1Q Median 3Q Max -6.2678 -0.6090 0.2809 0.5022 3.4546 Random effects: Groups Name Variance Std.Dev. fFarm (Intercept) 2.391 1.546 Number of obs: 826, groups: fFarm, 24
Results – Part II
Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.938969 0.356003 2.638 0.00835 ** CLength 0.038964 0.006917 5.633 1.77e-08 *** fSex2 0.624487 0.222938 2.801 0.00509 ** CLength:fSex2 0.035859 0.011409 3.143 0.00167 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) CLngth fSex2 CLength -0.107 fSex2 -0.189 0.238 CLngth:fSx2 0.091 -0.514 0.232
Summary
• Generalized Linear Models can accommodate non-Gaussian response variables
• It’s possible to include fixed and random effects, and then the model is called a Generalized Linear Mixed Model
• The syntax for the random effects depends upon the package that’s being used for the analysis, so be careful
Summary
• Other features can be modeled as well, and you should consult Zuur et al. and the literature if your data include:
– Temporal autocorrelation
– Spatial autocorrelation
– An excess or deficit of individuals in the zero category compared to the expectations of the exponential family of distributions