local indicators of categorical data boots, b. (2003). developing local measures of spatial...

Local Indicators of Categorical Data

Boots, B. (2003). Developing local measures of spatial association for categorical data. Journal of

Geographical Systems, 5(2), 139-160.

Why does space matter?

• Toblers first Law:• "Everything is related to everything else, but near things are

more related than distant things.“[1]• Spatial autocorrelation

• Observations are located in space/ have spatial component• Where did someone get sick?• Where are richer people living?• A wide range of questions can be evaluated from a spatial

perspective • High likelihood of similar properties if distance (physical but also

social etc.) is low

• Data has often distinct spatial characteristics• Clustering vs randomness vs uniform distribution

Spatial Data Basics

• Spatial Data is stored together with attributes in two formats

• Raster Data• Area represented by equally sized squares

• Vector Data • Data represented as Points, Lines or Polygons

Global and local measures

• Expression of spatial value similarity

• Global Measures• Moran’s I (deviation from mean)

• Geary’s C (actual values)• Getis-Ord (identifies general clustering of high or low values)• Join-Count Statistic (binary data)• Single value for entire data set

• Local Measures• Value for each observation• E.g. Local Getis-Ord and Local Moran’s I• Expression of spatial value similarity

Example Global Moran’s I

N is the number of observations (points or polygons)

is the mean of the variable

Xi is the variable value at a particular location

Xj is the variable value at another location

Wij is a weight indexing location of i relative to j

n

1i

2i

n

1i

n

1jij

n

1i

n

1jjiij

)x(x)w(

)x)(xx(xwN

I

x

Measures of Local Spatial Association• Common Uses

• assessing the assumption of stationarity for a given study region

• identifying the existence of pockets of distinctive data values (hot and cold spots)

• identifying the scale (spatial extent) at which there is no discernible association of data values

Measures of Local Spatial Association• Example Local Moran’s I

• Measurement of similarity for each region• Local Getis Ord…

• Sum of local values creates global test statistic

• All common measures for continuous (and ordinal) variables• Developed in context of regression to identify residuals• Would quantify categorical data implying measurable distance• No measurements for local spatial association of categorical

data

Categorical Data

• Join-Count widely applied as global measure• Mostly for binary data

• More classes problematic and require large sub regions to ensure sufficient counts• Only cells and polygons• Counts links between cells

• Values assigned based on occurrence or non occurrence• Border between cells

• Assume from now on a raster dataset with black and white cells

• New: Local join-count statistic

• Different from quantitative data; two base concepts:• composition which relates to aspatial characteristics of the different

classes• Global and local concentration

• configuration which refers to characteristics of the spatial distribution of the classes• Clustering

Categorical Data

• Global composition:• Share of one class at overall count

• 15 cells black, 85 white total:100• Share: 15% black

• Local composition:• If global composition is known likelihood of finding x

members of a class is given by binominal distribution:

• Evaluation of significant presence and absence of cells based on formula above for specific m by m subregion; adjustment for multiple testing; assuming no spatial dependence

• Pr(X<=x) or Pr(X>= x) < 0.05

Join-Count Test StatisticTest Statistic given by: Z= Observed - Expected

SD of Expected

Expected = randomly generated

Expected Values SD of expected

k= total number of joins

Pb expected proportion black (random or given) pw proportion white

M is based on k via

Categorical Data

• Global configuration• Counts all possible links and counts links with b/b, w/w, b/w

– share • Rarely used

• High share of b/b and w/w in contrast to b/w indicates clustering• High relative share of b/w indicates dispersal

• Local configuration• Local configuration dependent on global and local

composition• Conditional relationship; • Is number of joint counts different from random

distribution of black cells

Categorical Data

• Local configuration continued• Using global composition we derive joint count for random

distribution• Distribution of joint counts

• For large datasets with counts for b/b, w/w and b/w larger than 30: normal approximation

• Smaller: total count or simulation of sample configurations

• Counting all links in subregion around spatial unit• Identifying all cells which differ significantly regarding b/b, w/w and

b/w count from global value assuming randomness

• Local composition and configuration can be combined as tool for visualization

An example

• Perennial shrub• Atriplex hymenelytra

• Study area:• Death Valley, CA

• Black: Presence of perennial shrub / White: Absence• Global composition:

• 65/256=0.254• Insignificant global test, no spatial association

• Local tests for matrices: 3x3, 5x5 and 7x7

Example

• Significant deviations from global composition under the assumption of non-dependence

Example

• Significant deviation from global configuration under the assumption of non-dependence

Example

• Combination of both

• Interpretation can be difficult

• Hot clumps,and hot or clump only indicate area specifically suitable for growth of the shrub

• Explorative data analysis:• next question: What makes

this area special?

Problems

• Assumption of global spatial non-dependence• Problematic

• True random patterns very unlikely• With global spatial dependence:

• Too liberal: many local hotspots identified

• Suggested method:• Identify cells with significant local composition• Compare number and distribution with random simulations• Identify cells with significant local configuration (clumps)• Compute probability to encounter black cells in clumps + outside of clumps• Evaluate local composition using additive binominal with all subregions

• Useful? Step two enables evaluation via montecarlo simulation if numbers and distribution vary significantly

• We are still often interested in the hotspots targets for intervention etc.

Potential Problems

• Vector data characterized by unequally sized polygons• How to define areas?

• Steps to central polygon• Potential bias towards large polygons with many boundaries

• Highly complex data problematic• What if polygon has multiple borders with second polygon

• Other methods yield also results• Moran’s I and Getis-Ord produce results with binary data

• Though conceptually inappropriate might provide hints and include global composition and standardize

• Scan Statistic to identify hot-spots but requires conversion to point data

Potential Problems

• Edge Effects• What to do with missing values at the edge of the study

area?• Use of count data to estimate edge effects highly problematic

• Modifiable areal unit problem (MAUP)• Testing across varying subregion sizes (steps)• Clustering varies across geographic scales

• Multiple testing• Can be too conservative

Conclusion

• Joint counts well established measure of global spatial association for categorical data

• Development of local spatial statistics for categorical data• More accurate conceptual treatment of categorical data• Can visualize clustering and concentration of categorical data• Useful for explorative spatial data analysis• But often limited to binary problem

• Practical improvement?• A local Moran’s I may provide an indication• It depends on the question asked

• Assessing impact of global measures• Complicated and not fully developed• Necessity: depends on question asked

Software to deal with spatial problems• GIS

• Spatial data tool• Spatial properties (adjacency…) inherent to datasets – worry free• Tools can be created in Python/ integrated tools for spatial statistics• Push a button but limited options in non-spatial statistics

• R• Flexible and a large variety of available tools• Data has to be preprocessed to allow spatial calculation –adjacency etc.• Can take some time

• (Matlab)

• (SAS)• Seems to have a variety of procedures for point data analysis

Code in R

• Introduction to spatial R:• https://pakillo.github.io/R-GIS-tutorial/

• Creating neighbors in spatial data:• https://cran.r-project.org/web/packages/spdep/vignettes/nb.pdf

• This can be also used to create all subregions

• Global join count (SPDEP package):• http://www.inside-r.org/packages/cran/spdep/docs/joincount.test

• Perform this test on all subregions using global as expected values

• Procedure for test for differences in local composition (has to be performed for all spatial units) (stats package): • https://

stat.ethz.ch/R-manual/R-devel/library/stats/html/prop.test.html

https://pakillo.github.io/R-GIS-tutorial/

https://pakillo.github.io/R-GIS-tutorial/

https://cran.r-project.org/web/packages/spdep/vignettes/nb.pdf

https://cran.r-project.org/web/packages/spdep/vignettes/nb.pdf

http://www.inside-r.org/packages/cran/spdep/docs/joincount.test

http://www.inside-r.org/packages/cran/spdep/docs/joincount.test

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prop.test.html

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prop.test.html

References

• Anselin, L. (1995). Local indicators of spatial association-LISA. Geographical analysis, 27(2), 93-115.• Boots, B. (2003). Developing local measures of spatial

association for categorical data. Journal of Geographical Systems, 5(2), 139-160.• Rogerson, P., & Yamada, I. (2008). Statistical detection

and surveillance of geographic clusters. CRC Press.• Tobler W., (1970) "A computer movie simulating

urban growth in the Detroit region". Economic Geography, 46(2): 234-240

local indicators of categorical data boots, b. (2003). developing local measures of spatial...

Documents

spatial distribution

local measuresvalue

low data

spatial componentwhere

black local composition

quantitative data

entire data

local getisord