local indicators of categorical data boots, b. (2003). developing local measures of spatial...
TRANSCRIPT
Local Indicators of Categorical Data
Boots, B. (2003). Developing local measures of spatial association for categorical data. Journal of
Geographical Systems, 5(2), 139-160.
Why does space matter?
• Toblers first Law:• "Everything is related to everything else, but near things are
more related than distant things.“[1]• Spatial autocorrelation
• Observations are located in space/ have spatial component• Where did someone get sick?• Where are richer people living?• A wide range of questions can be evaluated from a spatial
perspective • High likelihood of similar properties if distance (physical but also
social etc.) is low
• Data has often distinct spatial characteristics• Clustering vs randomness vs uniform distribution
Spatial Data Basics
• Spatial Data is stored together with attributes in two formats
• Raster Data• Area represented by equally sized squares
• Vector Data • Data represented as Points, Lines or Polygons
Global and local measures
• Expression of spatial value similarity
• Global Measures• Moran’s I (deviation from mean)
• Geary’s C (actual values)• Getis-Ord (identifies general clustering of high or low values)• Join-Count Statistic (binary data)• Single value for entire data set
• Local Measures• Value for each observation• E.g. Local Getis-Ord and Local Moran’s I• Expression of spatial value similarity
Example Global Moran’s I
N is the number of observations (points or polygons)
is the mean of the variable
Xi is the variable value at a particular location
Xj is the variable value at another location
Wij is a weight indexing location of i relative to j
n
1i
2i
n
1i
n
1jij
n
1i
n
1jjiij
)x(x)w(
)x)(xx(xwN
I
x
Measures of Local Spatial Association• Common Uses
• assessing the assumption of stationarity for a given study region
• identifying the existence of pockets of distinctive data values (hot and cold spots)
• identifying the scale (spatial extent) at which there is no discernible association of data values
Measures of Local Spatial Association• Example Local Moran’s I
• Measurement of similarity for each region• Local Getis Ord…
• Sum of local values creates global test statistic
• All common measures for continuous (and ordinal) variables• Developed in context of regression to identify residuals• Would quantify categorical data implying measurable distance• No measurements for local spatial association of categorical
data
Categorical Data
• Join-Count widely applied as global measure• Mostly for binary data
• More classes problematic and require large sub regions to ensure sufficient counts• Only cells and polygons• Counts links between cells
• Values assigned based on occurrence or non occurrence• Border between cells
• Assume from now on a raster dataset with black and white cells
• New: Local join-count statistic
• Different from quantitative data; two base concepts:• composition which relates to aspatial characteristics of the different
classes• Global and local concentration
• configuration which refers to characteristics of the spatial distribution of the classes• Clustering
Categorical Data
• Global composition:• Share of one class at overall count
• 15 cells black, 85 white total:100• Share: 15% black
• Local composition:• If global composition is known likelihood of finding x
members of a class is given by binominal distribution:
• Evaluation of significant presence and absence of cells based on formula above for specific m by m subregion; adjustment for multiple testing; assuming no spatial dependence
• Pr(X<=x) or Pr(X>= x) < 0.05
Join-Count Test StatisticTest Statistic given by: Z= Observed - Expected
SD of Expected
Expected = randomly generated
Expected Values SD of expected
k= total number of joins
Pb expected proportion black (random or given) pw proportion white
M is based on k via
Categorical Data
• Global configuration• Counts all possible links and counts links with b/b, w/w, b/w
– share • Rarely used
• High share of b/b and w/w in contrast to b/w indicates clustering• High relative share of b/w indicates dispersal
• Local configuration• Local configuration dependent on global and local
composition• Conditional relationship; • Is number of joint counts different from random
distribution of black cells
Categorical Data
• Local configuration continued• Using global composition we derive joint count for random
distribution• Distribution of joint counts
• For large datasets with counts for b/b, w/w and b/w larger than 30: normal approximation
• Smaller: total count or simulation of sample configurations
• Counting all links in subregion around spatial unit• Identifying all cells which differ significantly regarding b/b, w/w and
b/w count from global value assuming randomness
• Local composition and configuration can be combined as tool for visualization
An example
• Perennial shrub• Atriplex hymenelytra
• Study area:• Death Valley, CA
• Black: Presence of perennial shrub / White: Absence• Global composition:
• 65/256=0.254• Insignificant global test, no spatial association
• Local tests for matrices: 3x3, 5x5 and 7x7
Example
• Significant deviations from global composition under the assumption of non-dependence
Example
• Significant deviation from global configuration under the assumption of non-dependence
Example
• Combination of both
• Interpretation can be difficult
• Hot clumps,and hot or clump only indicate area specifically suitable for growth of the shrub
• Explorative data analysis:• next question: What makes
this area special?
Problems
• Assumption of global spatial non-dependence• Problematic
• True random patterns very unlikely• With global spatial dependence:
• Too liberal: many local hotspots identified
• Suggested method:• Identify cells with significant local composition• Compare number and distribution with random simulations• Identify cells with significant local configuration (clumps)• Compute probability to encounter black cells in clumps + outside of clumps• Evaluate local composition using additive binominal with all subregions
• Useful? Step two enables evaluation via montecarlo simulation if numbers and distribution vary significantly
• We are still often interested in the hotspots targets for intervention etc.
Potential Problems
• Vector data characterized by unequally sized polygons• How to define areas?
• Steps to central polygon• Potential bias towards large polygons with many boundaries
• Highly complex data problematic• What if polygon has multiple borders with second polygon
• Other methods yield also results• Moran’s I and Getis-Ord produce results with binary data
• Though conceptually inappropriate might provide hints and include global composition and standardize
• Scan Statistic to identify hot-spots but requires conversion to point data
Potential Problems
• Edge Effects• What to do with missing values at the edge of the study
area?• Use of count data to estimate edge effects highly problematic
• Modifiable areal unit problem (MAUP)• Testing across varying subregion sizes (steps)• Clustering varies across geographic scales
• Multiple testing• Can be too conservative
Conclusion
• Joint counts well established measure of global spatial association for categorical data
• Development of local spatial statistics for categorical data• More accurate conceptual treatment of categorical data• Can visualize clustering and concentration of categorical data• Useful for explorative spatial data analysis• But often limited to binary problem
• Practical improvement?• A local Moran’s I may provide an indication• It depends on the question asked
• Assessing impact of global measures• Complicated and not fully developed• Necessity: depends on question asked
Software to deal with spatial problems• GIS
• Spatial data tool• Spatial properties (adjacency…) inherent to datasets – worry free• Tools can be created in Python/ integrated tools for spatial statistics• Push a button but limited options in non-spatial statistics
• R• Flexible and a large variety of available tools• Data has to be preprocessed to allow spatial calculation –adjacency etc.• Can take some time
• (Matlab)
• (SAS)• Seems to have a variety of procedures for point data analysis
Code in R
• Introduction to spatial R:• https://pakillo.github.io/R-GIS-tutorial/
• Creating neighbors in spatial data:• https://cran.r-project.org/web/packages/spdep/vignettes/nb.pdf
• This can be also used to create all subregions
• Global join count (SPDEP package):• http://www.inside-r.org/packages/cran/spdep/docs/joincount.test
• Perform this test on all subregions using global as expected values
• Procedure for test for differences in local composition (has to be performed for all spatial units) (stats package): • https://
stat.ethz.ch/R-manual/R-devel/library/stats/html/prop.test.html
References
• Anselin, L. (1995). Local indicators of spatial association-LISA. Geographical analysis, 27(2), 93-115.• Boots, B. (2003). Developing local measures of spatial
association for categorical data. Journal of Geographical Systems, 5(2), 139-160.• Rogerson, P., & Yamada, I. (2008). Statistical detection
and surveillance of geographic clusters. CRC Press.• Tobler W., (1970) "A computer movie simulating
urban growth in the Detroit region". Economic Geography, 46(2): 234-240