encyclopedia of environmetrics || nearest neighbor methods based in part on the article “nearest...

14
Nearest neighbor methods Nearest neighbor (NN) methods include at least six different groups of statistical methods. All have in common the idea that some aspects of the similarity between a point and its NN can be used to make useful inferences. In some cases, the similarity is the distance between the point and its NN; in others, the appropriate similarity is based on other identifying characteristics of the points. I will discuss in detail NN methods for studying spatial point processes and analyzing field experiments because both are commonly used in biology and environmetrics. I will very briefly discuss NN designs for field experiments. I will not discuss NN estimates of probability density functions (pdfs) [1], NN methods for discrimination or classification [2], or NN linkage (i.e., simple linkage) in hierarchical clustering [3, pp. 73–84]. Although these last three methods have been applied to environmetric data, they are much more general. Nearest Neighbor Methods for Spatial Point Processes Spatial point process data describe the locations of “interesting” events (see Point processes, spatial) and (possibly) some information about each event. Some examples include locations of tree trunks [4], locations of bird nests [5], locations of pottery shards, and locations of cancer cases [6]. I will focus on the most common case where the location is recorded in two dimensions (x,y ). Similar techniques can be used for three-dimensional data (e.g., locations of galaxies in space) or one-dimensional data (e.g., nesting sites along a coastline or along a riverbank). Usually, the locations of all events in a defined area are observed (completely mapped data), but occasionally only a subset of locations is observed (sparsely sampled data). Univariate point process data include only the locations of the events; marked point process data include additional information about the event at each location [7, pp. 294–296]. For example, the species may be recorded for each tree, some cultural Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics. identification may be recorded for each pottery shard, and nest success or nest failure may be recorded for each bird nest. Location or marked location data can be used to answer many different sorts of questions. The scien- tific context for a question depends on the area of application, but the questions can be grouped into general categories. One very common category of question concerns the spatial pattern of the observa- tions. Are the locations spatially clustered? Do they tend to be regularly distributed, or are they random (i.e., a realization of a homogeneous Poisson pro- cess)? A second common set of questions concerns the relationships between different types of events in a marked point process. Do two different species of tree tend to occur together? Are locations of cancer cases more clustered than a random subset of a con- trol group (see Disease mapping)? A third set of questions deals with the density (number of events per unit area). What is the average density of trees in an area? What does a map of density look like? Methods to answer each of these types of question are discussed in the following sections. Theoretical treatments of NN methods for spatial point patterns can be found in Refs 7 – 9. Applications of NN methods can be found in many articles and books, including Refs 7, 10, 11. Describing and Testing Spatial Patterns Using Completely Mapped Data Describing and testing spatial patterns of locations has a long history. Historically, the primary con- cern was with the question of randomness [12–15]. Are locations randomly distributed throughout the study area (i.e., are the locations a realization of a Poisson process with homogeneous intensity), or do the locations indicate some structure (i.e., clus- tering or repulsion between locations)? Because of the many connotations of randomness and the importance of a homogeneous Poisson process as a benchmark, it is commonly called complete spatial randomness (CSR). In this section NN tests based on completely mapped data are described. Locations of all events are recorded in an arbitrary study region. Often, the study region is square or rectangular, but this is not a requirement. Tests for the less common case of sampled data are described in the next section. Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd. This article is © 2013 John Wiley & Sons, Ltd. This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd. DOI: 10.1002/9780470057339.van007.pub2

Upload: walter-w

Post on 12-Dec-2016

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods

Nearest neighbor (NN) methods include at least sixdifferent groups of statistical methods. All have incommon the idea that some aspects of the similaritybetween a point and its NN can be used to makeuseful inferences. In some cases, the similarity is thedistance between the point and its NN; in others, theappropriate similarity is based on other identifyingcharacteristics of the points. I will discuss in detailNN methods for studying spatial point processesand analyzing field experiments because both arecommonly used in biology and environmetrics. I willvery briefly discuss NN designs for field experiments.I will not discuss NN estimates of probability densityfunctions (pdfs) [1], NN methods for discriminationor classification [2], or NN linkage (i.e., simplelinkage) in hierarchical clustering [3, pp. 73–84].Although these last three methods have been appliedto environmetric data, they are much more general.

Nearest Neighbor Methods for SpatialPoint Processes

Spatial point process data describe the locations of“interesting” events (see Point processes, spatial)and (possibly) some information about each event.Some examples include locations of tree trunks [4],locations of bird nests [5], locations of pottery shards,and locations of cancer cases [6]. I will focus on themost common case where the location is recorded intwo dimensions (x, y). Similar techniques can be usedfor three-dimensional data (e.g., locations of galaxiesin space) or one-dimensional data (e.g., nesting sitesalong a coastline or along a riverbank). Usually, thelocations of all events in a defined area are observed(completely mapped data), but occasionally only asubset of locations is observed (sparsely sampleddata). Univariate point process data include onlythe locations of the events; marked point processdata include additional information about the eventat each location [7, pp. 294–296]. For example, thespecies may be recorded for each tree, some cultural

Based in part on the article “Nearest neighbormethods” by Philip M. Dixon, which appeared in theEncyclopedia of Environmetrics.

identification may be recorded for each pottery shard,and nest success or nest failure may be recorded foreach bird nest.

Location or marked location data can be used toanswer many different sorts of questions. The scien-tific context for a question depends on the area ofapplication, but the questions can be grouped intogeneral categories. One very common category ofquestion concerns the spatial pattern of the observa-tions. Are the locations spatially clustered? Do theytend to be regularly distributed, or are they random(i.e., a realization of a homogeneous Poisson pro-cess)? A second common set of questions concernsthe relationships between different types of events ina marked point process. Do two different species oftree tend to occur together? Are locations of cancercases more clustered than a random subset of a con-trol group (see Disease mapping)? A third set ofquestions deals with the density (number of eventsper unit area). What is the average density of treesin an area? What does a map of density look like?Methods to answer each of these types of questionare discussed in the following sections.

Theoretical treatments of NN methods for spatialpoint patterns can be found in Refs 7–9. Applicationsof NN methods can be found in many articles andbooks, including Refs 7, 10, 11.

Describing and Testing Spatial PatternsUsing Completely Mapped Data

Describing and testing spatial patterns of locationshas a long history. Historically, the primary con-cern was with the question of randomness [12–15].Are locations randomly distributed throughout thestudy area (i.e., are the locations a realization ofa Poisson process with homogeneous intensity), ordo the locations indicate some structure (i.e., clus-tering or repulsion between locations)? Becauseof the many connotations of randomness and theimportance of a homogeneous Poisson process as abenchmark, it is commonly called complete spatialrandomness (CSR).

In this section NN tests based on completelymapped data are described. Locations of all eventsare recorded in an arbitrary study region. Often, thestudy region is square or rectangular, but this is nota requirement. Tests for the less common case ofsampled data are described in the next section.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 2: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

2 Nearest neighbor methods

Tests Based on Mean Nearest Neighbor Distance

The distances between NNs provide informationabout the pattern of points. Define W as the dis-tance from a randomly chosen event to the nearestother event in a homogeneous Poisson process withintensity (expected number of points per unit area;see Poisson intensity) of ρ. The pdf and cumulativedistribution function (cdf) of W are

g(w) = 2ρπw exp(−ρπw2) (1)

G(w) = 1 − exp(−ρπw2) (2)

so the mean and the variance of W are

E(W) = 1

2ρ1/2 (3)

and

var(W) = 4 − π

4πρ(4)

Based on these moments evaluated at the observeddensity, ρ, Clark and Evans [14] proposed using thestandardized mean to test CSR. The Clark–Evansstatistic [14] is

ZCE = w − E[W |ρ]

(N−1var[W |ρ])1/2 (5)

Their proposed test was to compare ZCE to a standardnormal distribution.

This test and the many users of it ignore twoproblems: nonindependence and edge effects. Ina completely mapped area, many of the distancesbetween NNs are correlated. That correlation is 1.0when two points, A and B, are reflexive NNs, thatis, when B is the NN of A, and A is the NNof B [16]. Other authors have called these isolatedNNs [17] or mutual NNs [18]. This problem is notrestricted to a few points. When points exhibit CSR intwo dimensions, approximately 62.15% of the pointsare reflexive NNs [16]. Their effect is to inflate thevariance of the mean NN distance.

Edge effects arise because the distribution of W

(2) assumes an unbounded area, but the observed NNdistances are calculated from points in a defined studyarea. When a point is near the edge of the studyarea, it is possible that the true NN is a point justoutside the study area, not a more distant point thathappens to be in the study area. Edge effects leadto overestimation (positive bias) of the mean NN

distance. Edge effects can be practically important;neglecting them can alter conclusions about thespatial pattern [19].

Edge effects may be minimized by includinga buffer area that surrounds the primary studyarea [14]. NN distances are calculated only for pointsin the primary study area, but locations in the bufferarea are available as potential NNs. With a suffi-ciently large buffer area, this approach can eliminateedge effects, but it is wasteful since an appropriatelylarge buffer area may contain many locations. Theeffects of both issues can be eliminated by usingMonte Carlo simulation to establish study-specificcritical values (see Simulation and Monte Carlomethods) or by deriving approximations for the edge-corrected moments by curve-fitting to an extensive setof simulation results [20].

One difficulty with tests based on the mean NNdistance is that the mean is just a single summary ofthe pattern. Two point patterns may have the samemean NN distance, but one exhibits CSR and theother does not. One such pattern would have a fewpatches of clustered points and an appropriate numberof widely scattered individuals. The points in theclusters have small NN distances, but the widelyscattered individuals have large NN distances. Withthe appropriate mix of clustered and scattered points,the mean NN distance could be exactly that givenby (3).

Distribution of Nearest Neighbor Distances

An alternative is to consider the entire distribu-tion, G(w), of NN distance [9]. CSR can be testedby comparing the observed distribution functionof NN distances, G(w), to the theoretical cdf(2). A variety of test statistics have been sug-gested, including Kolmogorov–Smirnov type statis-tics: supw |G(w) − G(w)|, Cramer–von Mises typestatistics:

∫w

[G(w) − G(w)]2, or Anderson–Darlingtype statistics:

∫w

[G(w)−G(w)]2/[G(w)(1−G(w))].The Kolmogorov–Smirnov statistic seems to be themost commonly used. The usual critical values forthe one-sample Kolmogorov–Smirnov test are notappropriate here because of nonindependence of NNdistances, especially for reflexive NNs, as discussedabove. Instead, a Monte Carlo test (see Simulationand Monte Carlo methods) must be used [21].

Edge-corrected estimators of G(w) have beendevised; some of them are summarized in Ref. 8,

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 3: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods 3

pp. 613–614, 637–638 and Ref. 7, pp. 208–212.The edge-corrected cdf can also be estimated bya Kaplan–Meier estimator [22]. Edge correctionsreduce the bias in the estimator, but they increasethe sampling variance [23].

Although edge-corrected estimators are needed ifthe observed distribution function is to be comparedwith the theoretical distribution function under CSR(2), they may not be needed for a Monte Carlo testof CSR. A more powerful test of CSR is to use anonedge-corrected estimator and compare the biasedestimate of G(w) to the biased mean, G(w), underCSR [23]. The biased mean G(w) and pointwiseprediction intervals are computed by simulation.Values of G(w) above the simulated mean indicateclustering of points (an excess of short distances toNNs). Values of G(w) below the simulated meanindicate regularity (few to no points with shortdistances to NNs). Tests of departure from CSR at aspecific distance, w, are based on quantiles of G(w)

estimated by simulation.The Monte Carlo approach is not limited to testing

CSR. It can be used to evaluate the fit of any processthat can be simulated. For example, a set of locationsmight be compared with a Poisson cluster processor a Strauss process [7, 9].

NN methods have been extended in a variety ofways. Tests can be based on other functions of theNN distances (e.g., squared [24] or smallest [25] NNdistances), but such tests have not been widely used.Distances to the second NN, or the third NN, orperhaps an even further neighbor have been suggestedas a way to look at patterns on a larger scale. Finally,the distance between a randomly chosen point andthe nearest event also provides information about thespatial pattern [7, 9].

Point–Event Distances and the J (x) Function

The point–event distribution, F(x), considers thedistance between a randomly chosen location (not thelocation of an event) and the nearest event. This canbe estimated by choosing m locations in the studyarea and computing the distance from each locationto its NN. As with G(w), edge effects complicateestimation of the cdf. An edge-corrected estimator is

FR(x) = number of (xi ≤ x, di > x)

number of (di > x)(6)

where xi is the distance between a point and itsneighboring event, and di is the distance between apoint and its nearest boundary. When the events area realization of CSR, X, the point–event distance,and W , the NN distance, have the same distribution,so F(x) = 1 − exp(−ρπx2). However, the effects ofdeviations from randomness on F (x) are opposite tothose on G(w). Values of F (x) above the expectedvalue indicate regularity. Values below the expectedvalue indicate clustering.

New summary statistics can be derived by combin-ing the event–event distance distribution, G(x), andpoint–event distance distribution, F(x). Van Lieshoutand Baddeley [26] proposed the J function,

J (x) = 1 − G(x)

1 − F(x)(7)

which is defined for all x for which F(t) < 1.J (t) = 1 for all t if the process is CSR. J (t) > 1indicates inhibition at distance t and J (t) > 1 indi-cates clustering at distance t . J (t) can be estimatedwithout requiring edge correction because the biasesarising from edge effects cancel out if the processis CSR [27]. Recently, estimators of F(x), G(w),and J (x) have been developed for inhomogeneouspoint processes [28]. Maltez-Mauro et al. use boththe J (x) and Ripley’s K(t) function to evaluate spa-tial patterns of recruitment in a Mediterranean Oakforest [29].

The NN distance distribution, point–event dis-tance distribution, and Ripley’s K function providedifferent insights into the spatial pattern. The NNdistribution function, G(w), is slightly more pow-erful at detecting departures from CSR in the direc-tion of regularity [9]. The point–event distributionfunction provides information about the empty spacebetween points. It appears to be the more power-ful method for detecting departures in the directionof clustering [7, 9]. Ripley’s K function simultane-ously examines the spatial pattern at many distancescales and is now the most popular approach forcompletely mapped data. However, it is possible toconstruct point processes with the same G(w) buta different Ripley’s K(t) function [7]. Conversely,Baddeley and Silverman [30] illustrate two processeswith the same Ripley’s K function but very differentNN distance distributions and point–event distancedistributions.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 4: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

4 Nearest neighbor methods

Are Events Points or Circles?

The distributional theory in the previous sectionsassumes that events occupy no space. Treating eventsas points assumes that it is possible for two events(e.g., locations of tree trunks or bird nests) to bean infinitesimally small distance from each other.The point assumption is reasonable when the area ofthe events is small relative to the spacing between theevents. The assumption is likely to be appropriate forbird nests (generally small) or tree trunks (generallylow density), but not for ant nests (large size relativeto the density of nests in an area). If events areincorrectly assumed to be points, the analysis ofthe spatial pattern indicates a tendency to regularitybecause two events do not occur within a smalldistance (the physical size of the event) of each other.

An approximation to the mean event–event dis-tance, W , for nonoverlapping circles under CSR is

W ≈ d + exp(−ρπd2)[1 − �(2ρπd)1/2]√ρ

(8)

where d and ρ are, respectively, the diameter of thecircles and the number of circles per unit area [31].In (8) a low intensity, ρ, and a small and constantdiameter, d , are assumed. The distribution, G(w),of NN distances for nonoverlapping circles can beestimated by Monte Carlo simulation by using asequential inhibition algorithm [9]. The distributionof event–event distances can be complicated whenthe circles are large or the density is high. Forexample, it may not be possible to fit the requirednumber of large circles into the study area.

Algorithms and Computing

The simplest way to compute NN distances is bydirect enumeration; that is, computing the distancesbetween all pairs of points, then reporting the small-est distance for each point as the NN distance. Fora large number of points, this becomes impracti-cal, and more efficient algorithms have been devel-oped to approach the problem. Possible approachesinclude subdividing the region into smaller subre-gions [32], computing the Dirichlet tessellation (alsoknown as the Voronoi tessellation or Thiessen tessel-lation) and using that to identify NNs, or computingthe quadtree, a sorted matrix of locations that simpli-fies the search for NNs, and using that to identify

x coordinate

50100 30

0

50

100

150

200

y co

ordi

nate

Figure 1 Marked plot of tree locations in a 50 m × 200 mplot of hardwood swamp in South Carolina, USA. Circlesare locations of cypress trees, squares are locations of blackgum trees, and dots are locations of any other species.

NNs [33]. Murtagh [32] reviews the properties ofthese algorithms.

Functions or procedures for NN spatial analysisare included in few statistical programs, but directenumeration is very simple to program when needed.Packages of functions for spatial point pattern anal-ysis usually include functions for point–point and

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 5: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods 5

point–event analyses. Many of these are R packagessuch as Splancs, spatial, and spatstat.

Example: Trees in a Swamp Hardwood Forest

Figure 1 shows the locations of all 630 trees (stems >

11.5 cm diameter at breast height) and the locationsof 91 cypress trees in a 1- ha plot of swamp hardwoodforest in South Carolina, USA. There are 13 differenttree species in this plot, but over 75% of the stemsare one of the three species: black gum (Nyssasylvatica), water tupelo (Nyssa aquatica), or baldcypress (Taxodium distichum). Visually, trees seemto be scattered randomly throughout the plot, butcypress trees seem to be clustered in three bands. NNstatistics provide a way to test the hypothesis thatstems are randomly distributed throughout the plot.Tests based on mean NN distance, G(w), and F(x)

are illustrated using the locations of all 630 trees andthe locations of the 91 cypress trees.

For all 630 tree locations, the mean NN distanceis 1.99 m. If 630 points were randomly distributedin a 200 m × 50 m rectangle, the expected NN dis-tance is 2.034 m, with a standard error (SE) of0.044 m, using edge-corrected moments [20]. Thereis no evidence of departure from CSR (z = −0.973with a two-sided P value of 0.33). The effect of theedge corrections is minimal here because the plotis large and the NN distance is small. The uncor-rected expected NN distance is 1.992 m, with an SEof 0.042 m, using (3) and (4).

Conclusions using the distribution of NN dis-tances, G(w), are similar. G(w) was estimatedwithout edge corrections, so G(w) must be com-pared with simulated values, not the theoreticalexpectation (3). The observed cdf, the theoreti-cal expected value (3), and the average simu-lated cdf are very similar (Figure 2a), althoughthe observed G(w) is slightly larger than theexpected value at short distances. The differences

Distance (m)

cdf

1 2 3 4

0

2

4

6

8

(a) Distance (m)

cdf

1 2 3 4

−4

−2

0

4

2

(b)

Distance (m)

5 10 150

2

4

6

8

10

(c)

cdf

5 10 15

Distance (m)

5

0

10

−10

−5

15

−15

(d)

cdf

Figure 2 Cdf plots of NN distances for all trees and for cypress trees only, in a 50 m × 200 m plot (see Figure 1).(a) All trees: comparison of observed cdf (solid line), theoretical cdf (dotted line), and average simulated cdf under CSR(dashed line). (b) All trees: comparison of observed cdf (solid line) and average simulated cdf (dashed line). The dottedlines represent the pointwise 0.025 and 0.975 quantiles of the simulated cdf under CSR. For clarity, the theoretical cdfis subtracted from all plotted cdfs. (c) Cypress trees only: comparison of observed cdf (solid line), theoretical cdf (dottedline), and average simulated cdf under CSR (dashed line). (d) Cypress trees only: comparison of observed cdf (solid line)and average simulated cdf (dashed line). The dotted lines represent the pointwise 0.025 and 0.975 quantiles of the simulatedcdf under CSR. For clarity, the theoretical cdf is subtracted from all plotted cdfs.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 6: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

6 Nearest neighbor methods

can be seen more clearly if 1 − exp(−ρπw2) issubtracted from all curves (Figure 2b). AlthoughG(w) is larger than both the theoretical expectedvalue and the average simulated value, it lieswithin the pointwise 95% confidence limits. None ofthe three summary statistics (Kolmogorov–Smirnov,Cramer–von Mises, or Anderson–Darling) is signif-icant at α = 0.05. For example, the observed Kol-mogorov–Smirnov test statistic of 0.044 is less thanthe simulated 90th percentile, 0.052. The estimated P

values for the Kolmogorov–Smirnov, Cramer–vonMises, and Anderson–Darling test statistics are 0.19,0.31, and 0.40, respectively.

In contrast, the distribution of point–event dis-tances suggests there is some clustering of treelocations. The observed F (x) is below the theo-retical and average simulated curves (Figure 3a, b)and outside the pointwise 95% confidence boundsat large distances (Figure 3b). Because F (x) falls

below the expected values, distances from randomlychosen points are stochastically greater than expectedif events exhibited CSR. The greater than expectedabundance of large empty spaces provides evidenceof clustering of the events. The P values for the Kol-mogorov–Smirnov, Cramer–von Mises, and Ander-son–Darling summary statistics range from 0.004 to0.009. The conclusion of some evidence for cluster-ing of all tree locations matches the conclusion usingRipley’s K function.

For the 91 cypress trees, the mean NN distanceis 5.08 m, which is slightly smaller than the edge-corrected expected distance of 5.55 m, with an SE of0.33 m. Using the NN distance, there is no evidenceof a nonrandom distribution; the z-statistic is −1.41,with a two-sided P value of 0.16. The effect of theedge corrections is larger when the density of pointsis smaller. The uncorrected expected NN distance forthe 91 cypress trees is 5.24 m, with an SE of 0.29 m.

Distance (m)

1 2 3 40

2

4

6

8

(a)

cdf

−4

0

−2

2

4

Distance (m)

1 2 3 4

(b)

cdf

(c) Distance (m)

0

2

4

6

8

10

5 10 15

cdf

Distance (m)(d)

5 10 15−2

−1

0

1

2

cdf

Figure 3 Cdf plots of point–event distances for all trees and for cypress trees only, in a 50 m × 200 m plot (see Figure 1).(a) All trees: comparison of observed cdf (solid line), theoretical cdf (dotted line), and average simulated cdf under CSR(dashed line). (b) All trees: comparison of observed cdf (solid line) and average simulated cdf (dashed line). The dottedlines represent the pointwise 0.025 and 0.975 quantiles of the simulated cdf under CSR. For clarity, the theoretical cdf issubtracted from all plotted cdfs. (c) Cypress trees only: comparison of the observed cdf (solid line), theoretical cdf (dottedline), and average simulated cdf under CSR (dashed line). (d) Cypress trees only: comparison of observed cdf (solid line)and average simulated cdf (dashed line). The dotted lines represent the pointwise 0.025 and 0.975 quantiles of the simulatedcdf under CSR. For clarity, the theoretical cdf is subtracted from all plotted cdfs.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 7: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods 7

The distributions of G(w) and F (x) for the91 cypress trees provide evidence of clustering.There is an unusually large number of NN dis-tances between 3 m and 7 m (Figure 2c, d) andG(w) is at or above the pointwise 0.975 quan-tiles of simulated values (Figure 2d). This excessis consistent with clustering of cypress trees. Thereare also significant fewer (at least pointwise) NNdistances at 13 m. All three summary statisticsare significant (Kolmogorov–Smirnov P value =0.034, Cramer–von Mises P value = 0.007, Ander-son–Darling P value = 0.011). The point–event dis-tances are stochastically greater than expected underCSR (Figure 3c, d). The observed distribution, F (x),lies outside the pointwise 95% confidence bandsfor many distances. All three summary statistics arehighly significant (P = 0.001). The conclusion thatcypress trees are strongly clustered matches that usingRipley’s K function.

Directed Tests

The tests in the previous section are general tests ofCSR against an unspecified alternative. Other testsmay be more powerful when the alternative is morespecific (e.g., events are associated with specific sites,or the density of events increases from east to west).Association between point events and a nonpointstochastic process can be tested using the NN distancefrom each event to the nearest part of the secondprocess [34]. However, most directed tests [35–37]use features other than NN distance.

Describing and Testing Spatial Patternswith Use of a Sample of Nearest NeighborDistances

An alternative to mapping the locations of all eventsin a study area is to measure NN distances for a ran-dom sample of individuals. When NN distances arecalculated from a simple random sample of events,the distributional theory for both the mean NN dis-tance and the distribution function is much simpler.Many different tests of CSR have been developed foruse with a random sample of point–event, NN, orpoint–event–event distances. These are summarizedand evaluated in Ref. 8, pp. 602–614 and Ref. 9,pp. 33–40.

The most straightforward way to select a randomsample of NN distances is to enumerate all indi-viduals in the statistical population, select a simplerandom sample of events, and measure the distancesfrom the selected events to their NNs. Enumera-tion can be avoided by clever use of subregions(described by Byth and Ripley [38]), or by randomlyselecting points (not events). The distance from therandomly selected point to the nearest event is arandom sample from the distribution of point–eventdistances, F(x), but the distance from that event toits closest event is not a random sample from G(w)

because the point–event and event–event distancesare correlated. The distributions of all quantities inthe point–event–event sample when events exhibitCSR have been derived [39].

An alternative that is easy to implement in the fieldis the T -square sample [40], illustrated in Refs 8,9, p. 35, a modified point–event–event sample. Apoint, A, is randomly chosen, and the NN, B, isfound. Define X as the distance between A andB. Then, the study area is divided into two halfplanes by a line through B and perpendicular to AB(hence the name, T -square). Attention is restrictedto the half plane that does not contain point A.The distance to the NN, Z, of B in that halfplane is measured. When points exhibit CSR, X andZ are independent, and the distribution of Z/

√2

is the same as the distribution of NN distances,G(w) [40].

Estimating Density

A random sample of point to NN distances can beused to estimate ρ, the average density of eventsin the study area. When events exhibit CSR, themaximum likelihood estimate is

ρ = n

n∑i=1

X2i

)−1

(9)

where Xi is the distance from a randomly chosenpoint i to its NN [41]. An unbiased estimate is [42]:

ρP = (n − 1)

n∑i=1

X2i

)−1

(10)

Both estimators are very dependent on the CSRassumption and can be biased if locations are clus-tered or regularly distributed.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 8: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

8 Nearest neighbor methods

Many other estimators have been proposed. Dig-gle [9, pp. 39–40] summarizes many of these. Byth[43] evaluated the robustness of many estimatorsto deviations from CSR. She recommended an esti-mator, ρT, based on two quantities from T -squaresampling: X, the distance from point to nearest event,and Z, the distance to the NN in a half plane:

ρT = n2

[2√

2

(n∑

i=1

Xi

) (n∑

i=1

Zi

)]−1

(11)

Nearest Neighbor Methods to ExamineSpatial Patterns of More Than One Typeof Point

Additional information about an event is often avail-able. For example, tree locations might be markedwith the species of tree (a mark with discrete levels)or the size of the tree (a mark with continuous levels).The methods in the previous sections can be used toanalyze patterns in all events (ignoring the marks) orsubgroups of events (e.g., just species A or just treeslarger than 50 cm). However, other interesting ques-tions could be asked about the relationship betweenthe two (or more) sets of locations.

Multivariate spatial point patterns are those whereevents can be classified into different types, thatis, the marks are discrete [8, p. 707]. Usually, thenumber of different types is small; bivariate patterns,with two types of marks, are the most common. Somequestions that could be asked about point processeswith discrete marks are as follows:

1. Are the processes that generate locations withdifferent marks independent?

2. Are marks randomly assigned to locations? Con-ditional on the observed locations of superposi-tion of the two marked processes, are the marksindependent?

3. Are marks segregated? Are locations with onetype of mark surrounded by locations with thesame mark?

These questions about the relationships betweenprocesses make no assumptions about the marginalpattern of each process. In particular, either process(or the superposition of processes) may be inde-pendent, clustered, or regularly distributed. The two

general methods to answer these questions are thecomparison of distribution functions [44, 45] and theNN contingency table [4, 14, 46]. Other approachesinclude Ripley’s K functions, van Lieshout’s J func-tion [47], or parametric point process models.

Define the following multitype extensions of thepoint–event and NN distances. Xi is the distancefrom a randomly chosen point to the nearest eventwith mark i, with cdf Fi(x). Wij is the distance froman event with mark i to the nearest event with markj , with cdf Gij (w). If the process with mark i isindependent of the process with mark j , then

Fi(x) = Gji(x) (12)

Fj (x) = Gij (x) (13)

and Xi and Xj are independent [44, 48]. Notethat property (12) does not imply property (13),so two tests are needed [44]. One possible test isbased on supx |Fi(x) − Gji(x)|. Critical values areestimated by Monte Carlo simulation because of thenonindependence of point–event and event–eventdistances.

Two different simulation methods could be usedin the Monte Carlo test. The choice depends onthe null hypothesis. If the null hypothesis is inde-pendence between marks (question 1, above), thentoroidal shifts or some parametric model should beused to generate the randomization distribution. If thenull hypothesis is random assignment of marks con-ditional on the set of events, then random labeling ofevents should be used to generate the randomizationdistribution. In general, these two hypotheses are notequivalent and the sampling distributions are not thesame [49].

Nearest Neighbor Contingency Tables

The NN contingency table focuses on the ecologi-cally important question of segregation [4, 46]. Thistable describes marks of events and their NNs, notthe distance between them (Table 1). In sparselysampled data, the counts (NAA, NAB, NBA, and NBB)are independent Poisson random variables or condi-tionally independent given the row marginal totals(NA and NB) under the null hypothesis of randomlabeling [4]. The hypothesis can be tested with atraditional one degree of freedom (df) χ2 test of inde-pendence [4] (see Categorical data).

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 9: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods 9

Table 1 Cell counts in an NN contingency table.

Mark of neighbor

Mark of point A B Total

A NAA NAB NA

B NBA NBB NB

Total MA MB N

N is the number of points in the spatial pattern. NA and NB arethe number of type A and type B points. NAA is the number oftype A points with type A NNs. NAB is the number of type Apoints with type B NNs. NBA and NBB are the number of typeB points with type A and type B NNs, respectively. MA andMB are the column sums, that is, the total number of pointswith type A neighbors and type B neighbors, respectively.

Table 2 Expected cell counts for NN contingency tablefor completely mapped data.

To:

From: A B

A NA(NA − 1)N − 1

NANBN − 1

B NBNAN − 1

NB(NB − 1)N − 1

See Table 1 for definitions of M , NA, and NB.

In completely mapped data, the sampling distri-bution of the counts is different [16]. If events arerandomly labeled, the expected values of the countsdepend only on the number of each type of event(NA and NB) and the total number of events, N

(Table 2) [46]. The variances and covariances dependon the number of events of each type, the number ofreflexive NNs, and the number of shared NNs; for aderivation of the formula, see Ref. 46. The first twomoments of the cell counts can be used to test forsegregation of type A events [NAA > E(NAA)], testfor segregation of type B events [NBB > E(NBB)],or construct an omnibus 2 df χ2 test of randomlabeling. If the numbers of points are large, thenthe distributions of test statistics can be adequatelyapproximated by asymptotic normal and χ2 distribu-tions [50]. If the number of points is small, then thedistributions should be determined by Monte Carlosimulation.

Patterns with k marks (k > 2) can be analyzed byconsidering all pairs of marks two at a time (using

distance methods or 2 × 2 NN contingency tables),or by considering the k × k contingency table. Theexpected counts and their variances under randomlabeling follow the same form as those for a 2 × 2contingency table, but there are more possible formsfor the covariance between two counts. Details aregiven in Ref. 51.

Other approaches that have been suggested forthe analysis of multitype point processes include thecomparison of bivariate Ripley’s K functions [45,52], empty space methods [45] (comparisons ofpoint–event distance distributions), and mark corre-lation functions [7].

Example, Part 2

Cypress and black gum are two of the three abun-dant species in the 1-ha plot of swamp forestconsidered in Example 1. An interesting ecologi-cal question is whether these two species are spa-tially segregated; that is, do cypress trees tend tobe found near other cypress trees and do blackgum trees tend to be found near other black gumtrees? The marked plot of locations (Figure 1) sug-gests that cypress trees and black gum occur indifferent clusters. Confirming this requires an anal-ysis of the bivariate spatial pattern. The threetests that will be illustrated are the comparisonsof cdfs (12), (13) [48], the independence of dis-tances [48], and the NN contingency table [46].The ecological background suggests that randomlabeling is the more appropriate null hypothe-sis.

The cdf of distances from randomly chosen pointsto the nearest black gum tree, FG(x), and the cdfof point–event distances to the nearest cypress tree,FC(x), were estimated without edge corrections byusing a randomly located grid of points [38]. Thecdfs of distances from black gum trees to the nearestcypress tree, GGC(x), and from cypress trees to thenearest black gum tree, GCG(x), were also estimatedwithout edge corrections. Both species show the samepattern. Cypress trees are stochastically further fromblack gum trees than from randomly chosen points(Figure 4a). Also, black gum trees are stochasticallyfarther from cypress trees than from randomly chosenpoints (Figure 4b).

The observed differences can be compared withthose found under random labeling by using the Kol-mogorov–Smirnov two-sample statistics, supx |FG(x)

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 10: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

10 Nearest neighbor methods

5 10 15 20

0

0.2

0.4

0.6

0.8

1.0

Distance (m)(a)

cdf

5 10 15 200

0.2

0.4

0.6

0.8

1.0

Distance (m)(b)

cdf

0

0

5

5

10

10

15

15

20

20

Distance to black gum (m)(c)

Dis

tanc

e to

cyp

ress

(m

)

Figure 4 Cdf of point–event distances for all treesand for cypress trees only. (a) Comparison of cdfs ofpoint to cypress tree distances [FC(x), dotted line] andblack gum tree to cypress tree distances [GGC(x), solidline]. (b) Comparison of cdfs of point to black gumtree distances [FG(x), dotted line] and cypress tree to blackgum tree distances [GCG(x), solid line]. (c) Relationshipbetween distances from a randomly chosen point to thenearest cypress tree and nearest black gum tree.

− GCG(x)| and supx |FC(x) − GGC(x)|, as the teststatistics. The observed maximum differences are notunusually large (P value = 0.109 for black gum and0.083 for cypress). The distance from a randomlychosen point to the nearest black gum tree, XG,is negatively correlated with the distance from thesame point to the nearest cypress tree, XC (Figure 4c;

Kendall’s τ = −0.12, with a one-sided P value,by randomization, of 0.001). This result is consis-tent with the spatial segregation of the two species.The different results from the three tests are con-sistent with Diggle and Cox’s [48] observation thatthe correlation test is more powerful than the Kol-mogorov–Smirnov test for the sparsely sampled spa-tial patterns they studied.

The NN contingency table indicates that bothspecies have an excess of NNs of the same species(Table 3). The variances of the cell counts are38.88 for black gum–black gum and 25.55 forcypress–cypress. The P values can be computed byMonte Carlo randomization or by a normal approx-imation [46]. In either case, the one-sided P valuesare small (0.001 or less) for both species.

Nearest Neighbor Methods for FieldExperiments

A different set of NN methods can be used to ana-lyze spatially structured field experiments. One of themost common applications is in agronomic varietytrials, which commonly include many treatments ran-domly assigned to small plots arranged in a rectan-gular lattice. Traditional methods of controlling forbetween-plot heterogeneity, such as using a random-ized complete block (RCB) design, may not be veryeffective when a large number of treatments forcesthe blocks to be large. NN methods use informationfrom adjacent plots to adjust for within-block het-erogeneity and so provide more precise estimates oftreatment means and differences. If there is within-plot heterogeneity on a spatial scale that is larger thana single plot and smaller than the entire block, thenyields from adjacent plots will be positively corre-lated. Information from neighboring plots can be usedto reduce or remove the unwanted effect of the spa-tial heterogeneity and hence improve the estimate ofthe treatment effect. Data from neighboring plots canalso be used to reduce the influence of competitionbetween adjacent plots. Each of these approaches willbe briefly discussed below.

Papadakis [53, 54] proposed an analysis of covari-ance to reduce the effects of small-scale spatial het-erogeneity in yields. The value of the covariate foreach plot is obtained by averaging residuals from theneighboring plots. The choice of neighboring plotsdepends on the crop, the plot size and shape, and the

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 11: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods 11

Table 3 Observed counts (NO), expected counts (NE), and z scores for the cypress tree and blackgum tree NN contingency table.

Species of neighbor

Black gum tree Cypress treeSpecies ofpoint NO NE z NO NE z Total

Black gum 149 121.1 4.47 33 60.9 −4.47 182Cypress 43 60.9 −3.54 48 30.1 3.54 91Total 192 81 273

spacing between plots. In many row crops, the neigh-boring plots are defined as the two adjacent plots inthe row, except that plots at an end of a row have onlyone neighbor. In other situations, it may be appropri-ate to consider four neighbors, which include the twobetween-row neighbors. If the spatial heterogeneity issuch that the effects of within-row and between-rowneighbors are quite different, then one could computeseparate covariates for within-row and between-rowneighbors. Once the covariates are computed, treat-ment effects are reestimated using an analysis ofcovariance. For adjustment in one dimension (e.g.,along crop rows), the model would be

Yi = μ + Xiτi + βRi + εi (14)

where Yi is the observed yield on the ith plot, Ri

is the mean residual on neighbors of the ith plot,β is the spatial dependence parameter, μ and τi

are the parameters in the model for the treatmenteffects, Xi is the row of the design matrix for theith plot, and εi is the residual variability in yields,which are assumed to be uncorrelated. When β is0, observations on adjacent plots are independent;larger positive values of β (β < 1) correspond toincreasing spatial correlation between neighboringplots. The observed value of β depends on plot size,plot shape, plot spacing, and the scale of the spatialheterogeneity. Values are often close to one whenplots are small. In the absence of treatments, andignoring edge effects, the Papadakis model impliesthat correlations between plot yields have a first-orderautoregressive structure, with corr(Yi, Yj ) = λ|i−j |,where β = 2λ/(1 + λ2). Values of β close to oneimply that λ is also close to one.

The autoregressive correlation structure impliedby the ad hoc Papadakis model is one example ofthe random field approach to a spatially designed

experiment [8, 55, 56]. Many other models, includ-ing the iterated Papadakis method, the WilkinsonNN [57] model, the Besag and Kempton [58] first-order difference models, the Williams model [59],and the Gleeson–Cullis autoregressive integratedmoving average (ARIMA) models [60, 61] (see Timeseries) correspond to different specifications of aspatial correlation matrix. Computations are handledeither by a general purpose residual maximum like-lihood (REML) algorithm for linear mixed effectsmodels [e.g., PROC MIXED or PROC GLIMMIXin SAS, lme() in the R nlme package, lmer( ) inthe R lme4 package], or by specialized software fora particular model (e.g., TwoD for two-dimensionalGleeson–Cullis models [62]).

The properties of these methods have been exten-sively discussed over the past 20 years. Dagnelie [63,64] provides a review and historical summary of thePapadakis model. Wu and Dutilleul [65] use unifor-mity trial data to compare autoregressive models,difference models, and traditional RCB analyses. Theefficiency of a spatial analysis, relative to an RCBdesign, is usually greater than 1.2 and can be as highas 2 [64]. Wilkinson [57] claimed the Papadakis esti-mator of treatment effects was biased, but Pearce [66]found generally very small biases in the scenarios heconsidered. The Papadakis method appears to workbest when there are at least three replicates per treat-ment, many treatments (greater than 10), and strongbut patchy spatial heterogeneity [64]. When thereis an underlying trend, first-order difference modelsappear to work well.

Medium-scale spatial heterogeneity usually causesa positive correlation between adjacent plots. Whenthere is competition between plots, neighbors canhave a negative effect on the response in adja-cent plots [67]. The Papadakis model (14) can be

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 12: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

12 Nearest neighbor methods

extended to estimate treatment-specific competitiveeffects. The choice of covariate should be influencedby biological mechanisms. If competition for sun-light is important, a reasonable covariate could bethe difference between the mean height of plants inthe plot and the mean height on neighboring plots.If disease spread is important, a reasonable covariatecould be the mean disease severity on neighboringplots [68]. The coefficient for the covariate [β in(14)] estimates the strength of the competitive rela-tionship.

Experimental design for a study that will use someform of neighbor-adjusted analysis usually focuses onneighbor balance; that is, ensuring that all pairs oftreatments occur in adjacent plots equally frequently.Adjacent plots can be defined as only those withinthe same row (one-dimensional neighbor-balanceddesigns [69]) or as including both those in the samerow and those in the same column (two-dimensionalneighbor-balanced designs [70]). The choice willdepend on the size, shape, and spacing of the plotsand on the biological and physical mechanisms influ-encing the correlation between plots. Methods forconstruction of one-dimensional or two-dimensionaldesigns can be found in Refs 69–71.

References

[1] Devroye, L. & Gyorfi, L. (1985). Nonparametric densityestimation, The L1View, John Wiley & Sons, New York.

[2] Ripley, B.D. (1996). Pattern Recognition and NeuralNetworks, Cambridge University Press, Cambridge.

[3] Everitt, B.S., Landau, S., Leese, M., & Stahl, D. (2011).Cluster Analysis, 5th Edition, John Wiley & Sons, Ltd,Chichester.

[4] Pielou, E.C. (1961). Segregation and symmetry in two-species populations as studied by nearest-neighbourrelationships, Journal of Ecology 49, 255–269.

[5] Brown, D. (1975). A test of randomness in nest spacing,Wildlife 26, 102–103.

[6] Cuzick, J. & Edwards, R. (1990). Spatial clustering forinhomogeneous populations (with discussion), Journalof the Royal Statistical Society, Series B 52, 73–104.

[7] Illian, J., Penttinen, A., Stoyan, H., & Stoyan, D.(2008). Statistical Analysis and Modelling of SpatialPoint Patterns, John Wiley & Sons, Chichester.

[8] Cressie, N. (1991). Statistics for Spatial Data, JohnWiley & Sons, New York.

[9] Diggle, P.J. (2003). Statistical Analysis of Spatial PointPatterns, 2nd Edition, Arnold, London.

[10] Pielou, E.C. (1977). Mathematical Ecology, John Wiley& Sons, New York.

[11] Stoyan, D. & Penttinen, A. (2000). Recent applicationsof point process methods in forestry statistics, StatisticalScience 15, 61–78.

[12] Ashby, E. (1936). Statistical ecology, Botanical Review2, 221–235.

[13] Ashby, E. (1948). Statistical ecology. II – a reassess-ment, Botanical Review 14, 222–234.

[14] Clark, P.J. & Evans, F.C. (1954). Distance to nearestneighbor as a measure of spatial relationships in popu-lations, Ecology 35, 445–453.

[15] Neyman, J. & Scott, E.L. (1952). A theory of thespatial distribution of galaxies, Astrophysical Journal116, 144–163.

[16] Cox, T.F. (1981). Reflexive nearest neighbours, Biomet-rics 37, 367–369.

[17] Pickard, D.K. (1982). Isolated nearest neighbors, Jour-nal of Applied Probability 19, 444–449.

[18] Schilling, M.F. (1986). Mutual and shared neighborprobabilities: finite- and infinite-dimensional results,Advances in Applied Probability 18, 388–405.

[19] Brewer, R. & McCann, M.T. (1985). Spacing in acornwoodpeckers, Ecology 66, 307–308.

[20] Donnelly, K.P. (1978). Simulations to determine thevariance and edge effect of total nearest-neighbour dis-tance, in Simulation Studies in Archaeology, I. Hodder,ed., Cambridge University Press, Cambridge, pp. 91–95.

[21] Diggle, P.J., Besag, J., & Gleaves, J.T. (1976). Statisticalanalysis of spatial point patterns by means of distancemethods, Biometrics 32, 659–667.

[22] Baddeley, A. & Gill, R.D. (1997). Kaplan–Meier esti-mators of distance distributions for spatial point pro-cesses, The Annals of Statistics 25, 263–292.

[23] Gignoux, J., Duby, C., & Barot, S. (1999). Comparingthe performance of Diggle’s tests of spatial randomnessfor small samples with and without edge-effect cor-rection: application to ecological data, Biometrics 55,156–164.

[24] Brown, D. & Rothery, P. (1978). Randomness and localregularity of points in a plane, Biometrika 65, 115–122.

[25] Silverman, B. & Brown, T. (1978). Short distance,flat triangles, and Poisson limits, Journal of AppliedProbability 15, 815–825.

[26] van Lieshout, M.N.M. & Baddeley, A. (1996). A non-parametric measure of spatial interaction in point pat-terns, Statistica Neerlandica 50, 344–361.

[27] Baddeley, A., Kerscher, M., Schladitz, K., & Scott, B.T.(2000). Estimating the J function without edge correc-tion, Statistica Neerlandica 54, 315–328.

[28] van Lieshout, M.N.M. (2011). A $J$-function for inho-mogeneous point processes, Statistica Neerlandica 65,183–201.

[29] Maltez-Mauro, S., Garcia, L.V., Maranon, T., & Fre-itas, H. (2007). Recruitment patterns in a MediterraneanOak forest: a case study showing the important of thespatial component, Forest Science 53, 645–652.

[30] Baddeley, A.J. & Silverman, B.W. (1984). A caution-ary example on the use of second-order methods foranalyzing point patterns, Biometrics 40, 1089–1093.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 13: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

Nearest neighbor methods 13

[31] Simberloff, D. (1979). Nearest neighbor assessmentsof spatial configurations of circles rather than points,Ecology 60, 679–685.

[32] Murtagh, F. (1984). A review of fast techniques fornearest neighbor searching, Compstat 1984, 143–147.

[33] Friedman, J., Bentley, J.L., & Finkel, R.A. (1977).An algorithm for finding best matches in logarithmicexpected time, ACM Transactions on Mathematical Soft-ware 3, 209–226.

[34] Berman, M. (1986). Testing for spatial associationbetween a point process and another stochastic process,Applied Statistics 35, 54–62.

[35] Lawson, A. (1988). On tests for spatial trend in anon-homogeneous Poisson process, Journal of AppliedStatistics 15, 225–234.

[36] Rathbun, S.L. (1996). Estimation of Poisson intensityusing partially observed concomitant variables, Biomet-rics 52, 226–242.

[37] Waller, L.A., Turnbull, B.W., Clark, L.C., & Nasca, P.(1994). Spatial pattern analyses to detect rare diseaseclusters, in Case Studies in Biometry, N. Lange, L. Ryan,L. Billard, D. Brillinger, L. Conquest, & J. Greenhouse,eds, John Wiley & Sons, Inc., New York, pp. 3–23.

[38] Byth, K. & Ripley, B.D. (1980). On sampling spatialpatterns by distance methods, Biometrics 36, 279–284.

[39] Cox, T.F. & Lewis, T. (1976). A conditioned distanceratio method for analyzing spatial patterns, Biometrika63, 483–491.

[40] Besag, J.E. & Gleaves, J.T. (1973). On the detectionof spatial pattern in plant communities, Bulletin of theInternational Statistical Institute 45, 153–158.

[41] Moore, P.G. (1954). Spacing in plant populations, Ecol-ogy 35, 222–227.

[42] Pollard, J.H. (1971). On distance estimators of density inrandomly distributed forests, Biometrics 27, 991–1002.

[43] Byth, K. (1982). On robust distance-based intensityestimators, Biometrics 38, 127–135.

[44] Goodall, D.W. (1965). Plot-less tests of interspecificassociation, Journal of Ecology 53, 197–210.

[45] Lotwick, H.W. & Silverman, B.W. (1982). Methods foranalyzing spatial processes of several types of points,Journal of the Royal Statistical Society, Series B 44,406–413.

[46] Dixon, P.M. (1994). Testing spatial segregation usinga nearest-neighbor contingency table, Ecology 75,1940–1948.

[47] van Lieshout, M.N.M. (2006). A $J$-function formarked point patterns, Annals of the Institute of Statisti-cal Mechanics 58, 235–259.

[48] Diggle, P.J. & Cox, T.F. (1983). Some distance-basedtests of independence for sparsely-sampled multivariatespatial point patterns, International Statistical Review51, 11–23.

[49] Goreaud, F. & Pelissier, R. (2003). Avoiding misinter-pretation of biotic interactions with the intertype K12-function: population independence vs. random labellinghypothesis, Journal of Vegetation Science 14, 681–692.

[50] Ceyhan, E. (2010). On the use of nearest neighborcontingency tables for testing spatial segregation, Envi-ronmental and Ecological Statistics 17, 247–282.

[51] Dixon, P.M. (2002). Nearest-neighbor contingency tableanalysis of spatial segregation for several species, Eco-science 9, 142–151.

[52] Diggle, P.J. & Chetwynd, A.G. (1991). Second-orderanalysis of spatial clustering for inhomogeneous popu-lations, Biometrics 47, 1155–1163.

[53] Papadakis, J.S. (1937). Methode statistique pour desexperiences sur champ, Institut d’Amelioration desPlantes a Thessaloniki (Thessalonika, Greece), BulletinScientifique, No. 23.

[54] Papadakis, J.S. (1984). Advances in the analysis of fieldexperiments, Proceedings of the Academy of Athens 59,326–342.

[55] Bartlett, M.S. (1978). Nearest neighbour models in theanalysis of field experiments (with discussion), Journalof the Royal Statistical Society, Series B 40, 147–174.

[56] Zimmerman, D.L. & Harville, D.A. (1991). A randomfield approach to the analysis of field-plot experimentsand other spatial experiments, Biometrics 47, 223–239.

[57] Wilkinson, G.N., Eckert, S.R., Hancock, T.W., &Mayo, O. (1983). Nearest neighbour (NN) analysis offield experiments (with discussion), Journal of the RoyalStatistical Society, Series B 45, 151–211.

[58] Besag, J. & Kempton, R. (1986). Statistical analysis offield experiments using neighbouring plots, Biometrics42, 231–251.

[59] Williams, E.R. (1986). A neighbour model for fieldexperiments, Biometrika 73, 279–287.

[60] Cullis, B.R. & Gleeson, A.C. (1991). Spatial analysisof field experiments – extension to two dimensions,Biometrics 47, 1449–1460.

[61] Gleeson, A.C. & Cullis, B.R. (1987). Residual maximumlikelihood (REML) estimation of a neighbour model forfield experiments, Biometrics 43, 277–288.

[62] Gilmour, A.R. (1992). TwoD: A Program to Fit a MixedLinear Model with Two-Dimensional Spatial Adjustmentfor Local Trend, NSW Agriculture Research Center,Tamworth, Australia.

[63] Dagnelie, P. (1987). La methode de Papadakis en exper-imentation agronomique: considerations historiques etbibliographiques, Biometrie et Praximetrie 27, 49–64.

[64] Dagnelie, P. (1989). The method of Papadakis in agricul-tural estimation: an overview, Biuletyn Oceny Odmian(Cultivar Testing Bulletin) 21, 111–122.

[65] Wu, T. & Dutilleul, P. (1999). Validity and efficiency ofneighbor analyses in comparison with classical completeand incomplete block analyses of field experiments,Agronomy Journal 91, 721–731.

[66] Pearce, S.C. (1998). Field experimentation on roughland: the method of Papadakis reconsidered, Journal ofAgricultural Science, Cambridge 131, 1–11.

[67] Kempton, R. (1982). Adjustment for competition bet-ween varieties in plant breeding trials, Journal of Agri-cultural Science 98, 599–611.

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2

Page 14: Encyclopedia of Environmetrics || Nearest Neighbor Methods Based in part on the article “Nearest neighbor methods” by Philip M. Dixon, which appeared in the Encyclopedia of Environmetrics

14 Nearest neighbor methods

[68] Kempton, R. (1991). Interference in agricultural exper-iments, Proceedings of the Second Biometric Soci-ety East/Central/Southern African Network Meeting,Harare, Zimbabwe.

[69] Williams, R.M. (1952). Experimental designs forserially correlated observations, Biometrika 39,151–167.

[70] Freeman, G.H. (1979). Some two-dimensional designsbalanced for nearest neighbours, Journal of the RoyalStatistical Society, Series B 41, 88–95.

[71] Federer, W.T. & Basford, K.E. (1991). Competingeffects designs and models for two-dimensional fieldarrangements, Biometrics 47, 1461–1472.

(See also Point processes, spatial–temporal; Rip-ley’s K function; Spatial analysis in ecology;Spatial design, optimal)

PHILIP M. DIXON

Encyclopedia of Environmetrics, Online © 2006 John Wiley & Sons, Ltd.This article is © 2013 John Wiley & Sons, Ltd.This article was published in Encyclopedia of Environmetrics Second Edition in 2012 by John Wiley & Sons, Ltd.DOI: 10.1002/9780470057339.van007.pub2