going beyond gis for environmental health frank c. curriero [email protected] environmental health...
TRANSCRIPT
Going Beyond GISfor
Environmental Health
Frank C. Curriero
Environmental Health Sciences and BiostatisticsBloomberg School of Public Health
EnviroHealth ConnectionsSummer Institute
2006
Bio
• Joint appt. in Env Health Sci and Biostatistics
• PhD in Statistics
• Research agenda is spatial statistics
Statistics
Env Health Geography (GIS)
Spatial Statistics
Objectives
• Provide exposure to the field of spatial statistics. Keep it simple (non-technical)
Applications of GIS in Environmental Health
Beyond GIS, maps make you think/question
Current research topics
• Geography (location) is a source of variation worth considering in environmental health investigations.
What is Spatial Statistics?
Statistics for the analysis of spatial data
“spatial” “geographic”
What is Spatial Data?
The “where” in addition to the “what” was observedor measured is important and recorded with the data.
Location information (the “where”) can vary.
What is GIS?
Stands for Geographic Information SystemAnything more depends on who you ask!
What is a GIS?
One word def: Database
Two word def: Visual Database
Visual database for geographic data• Stores• Manipulates• Analysis• Queries• Creates• Displays
. . . . MAPS
“Layer cake of information”
What else:
- A computer system (piece of software) with a tremendous amount of capability for storing, querying, combining, presenting, . . . , spatial data.
- GIS is designed specifically for spatial data and hence built to handle all of its complicated features.
- GIS is a generic name like word processor. ArcGIS, MapInfo, Idrisi are examples of different GIS.
- The earth does not have to be the backdrop for every GIS application, but certainly most common.
What else (cont.)
- Public health was not the first and probably not be the last application of GIS and spatial statistics.
- GIS as a mechanism for generating hypotheses (exploratory spatial data analysis).
- GIS is a tool, a very powerful and valuable tool when working with spatial data.
Applications in Spatial Statistics and GIS
• Waterborne disease outbreaks
• DDE soil contamination
• Lyme Disease
• Prostate cancer mapping
• Chesapeake Bay water quality assessment
US Waterborne Disease Outbreaks, 1948-1994US Waterborne Disease Outbreaks, 1948-1994
Outbreak Data
Location Longitude Latitude Month Year
AL, Anniston -85.83 33.65 Oct 1953AL, Center Pt. -86.68 33.63 Nov 1958
WY, Cody -109.06 44.53 July 1986
.
.
.
.
.
.
.
.
.
US Waterborne Disease Outbreaks, 1948-1994US Waterborne Disease Outbreaks, 1948-1994
Substantive Questions
Do outbreaks occur at random across the US?
Are outbreaks preceded by extreme precipitation events?
Does the risk of an outbreak vary spatially and related towatershed vulnerability?
Objective: Association between extreme prcip. and outbreaks
Methods: Overlay map of outbreaks and extreme precip events
2,105 watersheds (USGS) 16,000+ weather stations (NCDC) define extreme precipitation aggregate precip and outbreak to watershed
Results: 51% of outbreaks were coincident with extreme levels of precip within a 2 month lag preceding the outbreak month.
Conclusion: Is this evidence of an association?
US Waterborne Disease Outbreaks, 1948-1994
OOOOOOOOO
OO
OO
O O
O
O
O
O
O
OO
OO
O
OOO OO
OOOO
OOO
OOOOOOOOOOOO
O
O OOOO
OO
OOO
OO
O
OO
O
OO
O
O
O
O
OOOO
OOOO
O OO O
O
O
OOOOO
OOOO
O
OOO
OO
O OO
OO
OO
OO
OO
O
O
OO
OO O OO
OO
O
OOOOO
O
OOOOOO
OO
O
O
OOO
OOO O
OO
OOOOOOO
OOOO OO
OO O O
OOO
OO
OO
O
OOO
O
OO
OOO
OO
O OOO O
OOO
OO OOOOOOO
OOOO
OOOOO
O
O
OOOOOO
O OO
OOO
O
O
O
OO
OO
O
OO
OO
O
OOOOOOOO
OOO
OO OOOO
OO
O
OOO
O
OO
O
OOO
OutbreakExtreme Prcp
US Waterborne Disease Outbreaks, 1948-1994
Results: 51% of outbreaks were coincident with extreme levels of precip within a 2 month lag preceding the outbreak month.
Conclusion: Is this evidence of an association?
US Waterborne Disease Outbreaks, 1948-1994
OOOOOOOOO
OO
OO
O O
O
O
O
O
O
OO
OO
O
OOO OO
OOOO
OOO
OOOOOOOOOOOO
O
O OOOO
OO
OOO
OO
O
OO
O
OO
O
O
O
O
OOOO
OOOO
O OO O
O
O
OOOOO
OOOO
O
OOO
OO
O OO
OO
OO
OO
OO
O
O
OO
OO O OO
OO
O
OOOOO
O
OOOOOO
OO
O
O
OOO
OOO O
OO
OOOOOOO
OOOO OO
OO O O
OOO
OO
OO
O
OOO
O
OO
OOO
OO
O OOO O
OOO
OO OOOOOOO
OOOO
OOOOO
O
O
OOOOOO
O OO
OOO
O
O
O
OO
OO
O
OO
OO
O
OOOOOOOO
OOO
OO OOOO
OO
O
OOO
O
OO
O
OOO
OutbreakExtreme Prcp
US Waterborne Disease Outbreaks, 1948-1994
• Map generation included many involved GIS tasks on numerous data sources, GIS Spatial Analysis.
• Statistically speaking though it represents risk factor data.
• Spatial statistics often considers the map as a starting point, which in GIS is often an endpoint.
1107800 1108200 1108600 1109000
7250
0072
5500
7260
00
Easting
Nor
thin
g
Western Maryland Superfund Site
Residential
Residential
Undeveloped
Industrial
DDE Soil Sample Data
Sample # Easting Northing DDE (ppm)
1 1108420 725173 160 2 1108300 725378 4
110 1108490 725038 92
Western Maryland Superfund Site
.
.
.
.
.
.
.
.
.
0 50 100 150 200 250 300
020
4060
80
DDE in Soil Samples 1992-1997 (ppm)
Freq
uenc
y
N = 110Mean = 25.40Stdev = 46.38Min = 0.005Max = 300
1108000 1108500 1109000
72
50
00
72
55
00
72
60
00
0.01 <= y < 2.12.1 <= y < 4.44.4 <= y < 2323 <= y < 300
EastingN
ort
hin
g
Substantive Questions
Does the site exceed regulated levels of DDE contamination and in need of remediation?
What is the level of DDE in my backyard?
1108000 1108400 1108800 1109200
72
50
00
72
54
00
72
58
00
72
62
00
01
00
20
03
00
40
0Residential
Residential
Undeveloped
Industrial
Easting
Nort
hin
g
Kriged DDE Predictions
Kriging: Spatial prediction at unsampled locations based on data from sampled locations.
Environmental health applications of kriging exposure maps
Kriged DDE Predictions
Baltimore County Lyme Disease: 1989-1990
Lyme Disease Cases and Controls
Cases ControlsLongitude Latitude Longitude Latitude
-76.4047 39.3421 -76.4054 39.3419-76.3433 39.3736 -76.3522 39.3718
-76.7592 39.3265 -76.7665 39.3119
.
.
.
.
.
Lyme CaseLyme Control
Baltimore County Lyme Disease
Lyme Case
Lyme Control
Substantive Questions
Do cases of Lyme Disease tend to cluster, generally oras localized “hot spots?”
Does risk of Lyme Disease vary spatially over Balt. County?
Identify and quantify environmental risk factorsassociated with Lyme Disease.
Baltimore County Lyme Disease: 1989-1990
Lyme CaseLyme Control
Baltimore County Lyme Disease
Lyme Case
Lyme Control
-76.8 -76.7 -76.6 -76.5 -76.4
39
.33
9.4
39
.53
9.6
39
.7
Longitude
La
titu
de
0.0
0.5
1.0
1.5
2.0
Lyme CaseLyme Control
Baltimore County Lyme Disease RiskBaltimore County Lyme Disease Risk: 1989-1990
Spatial Case/Control Analysis
• Spatial density estimate of cases divided by spatial density estimate of controls (nonparametric kernel approach).
• Logistic regression approach to include covariates.
Statistical Methods Exist to Address
• Do cases (events) show a tendency to cluster?
• Identifying “clusters” or “hot spots.”
• Does risk of disease (or outcome of interest) vary spatially?
• Is disease risk elevated near a particular point source?
• Spatial prediction of outcomes at unobserved locations.
• Risk factor estimation in the presence of residual spatial variation.
Types of Spatial Data
1. Geostatistical Data
Basic structure is data tagged with locations.
Locations can essentially exist anywhere.
Referred to as continuous spatial variation.
Example: MD Superfund Site DDE
2. Point Pattern Data
Locations are the data denoting occurrence of events.
Common to aggregate to area-level data.
Example: Baltimore County Lyme Disease Cases Baltimore County Lyme Disease Controls
3. Area-level Data
Data summarized to an area unit.
Rarely arises naturally.
Often an aggregate form of point pattern data.
Referred to as discrete spatial variation.
Example: Maryland prostate cancer by zip code
Why Collect Locations as Part of Data?
• Sometimes locations are the only data (as in point patterns).
• Risk (or outcome of interest) may vary spatially.
• Location can serve as an information gateway to other linked data sources: environmental demographic social etc.
• Data are spatially dependent and locations are used in statistical methods that account for this dependence.
• In general things can vary spatially and geography (location) maybe a source of variation worth considering.
Temporal Dependence
• Time series or longitudinal data.
• Past/present direction inherent in temporal data.
Spatial Dependence
• Dimensions > 1 and loss of directional component.
• Observations closer together in space are more similar than observations further away (clustering).
“in space” “on the earth”
Spatial Dependence (clustering) in Environmental Health Data
• A contagious agent of the outcome under investigation.
• The spatial variation in the population at risk.
• An underlying shared environmental characteristic, measured or unmeasured, that also varies spatially (Shared Environment Effect).
Could be due to:
What GIS is Not
• A complete system for statistical or scientific inference.
• Maps, most basic and fundamental concepts in GIS, are not statistical inference.
• A GIS map of one variable is analogous to a histogram display two variables overlayed is analogous to an x-y scatterplot or 2x2 table.
In statistics we go beyond histograms and scatterplots.
In the GIS literature analysis or spatial analysis often means spatial data manipulation which is something different than statistical analysis.
An Important Distinction
Geographic Analysis of Prostate Cancer in Maryland
PI: Ann Klassen (HPM & Oncology)
Collaborators: Margaret Ensminger, Chyvette Williams, JeanHeeHong (HPM) Frank Curriero (Biostat), Anthony Alberg (Epi) Martin Kulldorff (Harvard), Helen Meissner (NCI)
Cooperative Agreement from Association of Schools of Public Health and Centers for Disease Control
Data Agreement with the Maryland Cancer Registry
One of six CDC projects investigating geography and prostate cancer, including NY, CT/MA, NJ, Kansas/Iowa, and Louisiana.
Prostate Cancer Reported to MD Cancer Registry 1992-1997
Proportion of an Outcome of Interest*
* All geocoded cases
Legend
No Data
0 - 12
13 - 30
31 - 67
68 - 100
Outcomes of Interest Include• Incidence• Stage at diagnosis• Tumor grade at diagnosis• Failure to stage or grade• Treatment and mortality
Legend
No Data
0 - 12
13 - 30
31 - 67
68 - 100
Proportion of an Outcome of Interest *
* All geocoded cases
What is Geocoding?
GIS process of translating mailing address information tocoordinates on a map, such as with longitude and latitude
16 Goucher Woods CtTowson, MD 21286
(-76.5883, 39.4005)
Nongeocoded Data
Mailing addresses that could not be geocoded
8123 Rose Haven RoadRosedale, MD 21237
Nongeocoded
Legend
No Data
0 - 12
13 - 30
31 - 67
68 - 100
Legend
0 - 8
9 - 12
13 - 30
31 - 67
68 - 100
Geocoded Cases (15,585)
All Cases (17,091)
Proportion of Outcome of Interest
(1) Common to just ignore nongeocodes
Statistical Issues
What's the Consequence?
Historically not well documented in publications
(2) Level of aggregation for analyses? Zip code level
Census tract, county, etc.
(3) Nongeocodes represent missing data and most likely not missing at random
Statistical Issues (cont.)
% Nongeocoded0 - 9
10 - 25
26 - 47
48 - 75
76 - 100
MD Prostate Cancer Proportion of NonGeocodes
Age = 72
Known Information (fictitious example)
Race = WhiteYear of Diagnosis = 1991
Stage at Diagnosis = Late
Tumor Grade = Aggressive
Zip Code = 21237
Statistical Issues (cont.)
(3) Nongeocodes carry plenty of information
Statistical Solutions
(a) Impute a location for nongeocodes
Determine the age-race distribution within known zip codesWeighted random selection based on known age and raceSampling with and without replacement
Multiple imputation to assess bias
(Joint work with Ann Klassen, HPM)
(b) Develop statistical models for outcomes at different levels of aggregation
Spatial variation in risk model for geocoded household level data and nongeocoded zip code level data
(Joint work with Peter Diggle, Biost)
Chesapeake Bay Water Quality Assessment
Data
TemperatureTurbidityDissolved OxygenChlorophyll a
Needed
Assessments at unsampled locations
Kriging
A spatial regression method that provides optimalprediction at unsampled locations.
Kriged predictions are weighted averages of sampleddata, higher weights given to data closer to the predictionsite.
Proximity is measured by the straight line Euclideandistance (“as the crow flies”).
Chesapeake Bay Fixed Station Data
Euclidean distance may notbe appropriate.
Propose a water metric
Currently kriging only worksfor Euclidean distance.
New methods needed.
Closing Remarks
• GIS for spatial database management and hypothesis generation (posing the questions)
• Spatial Statistics for inferential methods (answering the questions)
• Why consider location
Scientific inference may depend on it Gateway to environmental data Source of variation worth considering
• Biography and Geography of Public Health