modern regression and mortality · high levels of particulates (pm10) on short term, non-accidental...

33
Modern regression and Mortality Doug Nychka National Center for Atmospheric Research www.cgd.ucar.edu/~nychka Statistical models GLM models Flexible regression Combining GLM and splines.

Upload: others

Post on 03-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Modern regression and Mortality

Doug NychkaNational Center for Atmospheric Research

www.cgd.ucar.edu/~nychka

• Statistical models

• GLM models

• Flexible regression

• Combining GLM and splines.

1

Page 2: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Statistical tools for the analysis of Mortalityrelated to air pollution and temperature.

The goal of this talk is to give a gentle introductionto the models used by Zeger, Dominici et al. SPHJohns Hopkins for explaning mortality based on envi-ronmental factors.

The Johns Hopkins group is focused on the effect ofhigh levels of particulates (PM10) on short term, non-accidental mortality rates.

This leverages the National Morbidity, Mortality andAir Pollution Study Database (NMMAPS). (Welty,Peng and McDermott)

2

Page 3: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

3

Page 4: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

Calendar effects + Age categories

3

Page 5: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

Calendar effects + Age categories

+ seasonal effects

3

Page 6: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

Calendar effects + Age categories

+ seasonal effects

+ particulates + temperature/humidity stress +

3

Page 7: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

Calendar effects + Age categories

+ seasonal effects

+ particulates + temperature/humidity stress +

+ interaction of temperture with age

3

Page 8: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

Calendar effects + Age categories

+ seasonal effects

+ particulates + temperature/humidity stress +

+ interaction of temperture with age

+ random component.

3

Page 9: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Schematic of PM10 model:

Mortality in an urban center =

Calendar effects + Age categories

+ seasonal effects

+ particulates + temperature/humidity stress +

+ interaction of temperture with age

+ random component.

This work is a useful motivation for understanding

Poisson Regression

Flexible regression models

3

Page 10: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Some data atributes:

• A response depends on other variables: (Mortalitydepends ontemperature and PM.)

Yk = f (Xk) + noise

Here we interpret the noise as having a mean ofzero.

• The explanatory variables can be eithercatgorical (age category, day-of-week)or continuous (daily average temperature, PM)

• The response may be discrete or NonGaussian

4

Page 11: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Issues for data analysis

How does one specify all of these components in a simple andunambigous fashion?

How does one estimate a multivariate functional relationship?

5

Page 12: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Formula’s

Some variables:

mort: Daily, nonaccidental mortality for three age cat-gories.time: Day

tmp: Daily average temperaturetmp3: Daily average temperature for past three days

PM: Daily Particulate (< 10µg)

AgeCat: Three age categories <65, 65-74, >75

6

Page 13: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Formula for a linear regression of mortality on 3-day temper-ature

mort ~ tmp3

3-day temperature and PM

mort ~ tmp + PM

Dependence on the age categories

mort ~ AgeCat + tmp + PM

These are additive models because the variables ap-pear by themselves and the contribution of each canbe inferred from their individual values.

7

Page 14: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Interactive dependence on the age categories

mort ~ AgeCat+ tmp+ AgeCat:tmp

or justmort ~ AgeCat*tmp

Three different slopes and intercepts, the : is an in-teraction. * includes all possible terms. There areseveral conventions that make this not as transparentin the fit.

This is more interpretable.

mort ~ AgeCat + AgeCat:tmp -1

Coefficients:

AgeCat1 AgeCat2 AgeCat3 tmp:AgeCat1 tmp:AgeCat2 tmp:AgeCat3

66.804 44.522 121.158 -0.156 -0.139 -0.473

8

Page 15: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

All models are wrong, but some are useful ...

9

Page 16: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

All models are wrong, but some are useful ...

but some models are more wrong than others!

9

Page 17: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

All models are wrong, but some are useful ...

but some models are more wrong than others!

Changing variance is missed.Absolute residuals from 3 slopes model.

9

Page 18: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Sublties of the temperature response are missed!

10

Page 19: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

A Generalized Linear Model (GLM)

Assume that mortality (Mt) is approximately Pois-son distributed

Mean model with log link function

Mt ∼ Poisson(µt)

E(Mt) = µt

model the log of the mean.

log(µt) = f (covariates)

11

Page 20: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Variance model with over-dispersion

The Poission distribution has a variance equal to themean:V ar(Mt) = µt. This allows modeling of both the meanand variance simultaneously.

Often the Poission does not include enough variabilityso an additional parameter is included:

V ar(Mt) = φµt

φ the over-dispersion parameter.

12

Page 21: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Example of a Poisson Response

13

Page 22: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

GLM fit to mortality

First consider the linear model but using the Pois-sion regression. log(µt) has three different slopes as afunction of temperature based on the age categories.

In R-code:

glm( mort~ AgeCat*tmp3, family=quasipoisson)

Results in residuals that better fit model assumptions.

14

Page 23: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Flexible Regression

Response of mortality is not linear, not clear whatfunctional form to choose.

One strategy is to represent the unknown function asa linear combination of basis functions

f (temp) =∑j=1

Mψj(temp)βj

Splines: Local basis functions that allow one to con-trol the flexibility of the shape of f.

15

Page 24: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Three factors that control splines:

• number of knots

• location of knots (usually equally spaced)

• order of fit (usually cubic)

A cubic B-spline basis for temperature (6 knots)

Functions are piecewise cubic in between knots.

16

Page 25: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Controlling the resolution

4 knots

12 knots

17

Page 26: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Applying the spline model to temperature

The simplest model is now

log(µt) = AgeCat + f (tmp3) = AgeCat +M∑j=1

ψj(tmp3t)βj

where ψj are the B-splines.

No interaction here between age category and thetemperature response.

An extension is to have a distinct response to temper-ature for each age category. This gives three seperateB-spline curves with different coefficients.

18

Page 27: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Results with 4 knots

Relative Risk estimates:

19

Page 28: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Results with 6 knots

Relative Risk estimates:

20

Page 29: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Issues of inference

Selecting the amount of curve flexibilityThere are information criteria and cross-validation tech-niques to do this.

Uncertainty in estimatesParametric bootstrapping, fitting synthetic data setsgenerated from the fitted model gives a useful mea-sure of error.

Bayesian methods can also give an idea of the uncer-tainty of the estimated relationships.

21

Page 30: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

This talk is ending too soon!

Standardized residuals against time:

22

Page 31: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

This talk is ending too soon!

Standardized residuals against time:

Use an additive model:

log(µt) = f1(tmp3t) + f2(t)

f1 and f2 are both modeled using B-splines

22

Page 32: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Additive curves for time trends

23

Page 33: Modern regression and Mortality · high levels of particulates (PM10) on short term, non-accidental mortality rates. This leverages the National Morbidity, Mortality and Air Pollution

Remarks

• There are many new statistical tools to discernstructure in complex data sets.

• Inference can be formalized by Bayesian methods

• One challenge is to combine models across cities.

24