analysis of spatial point patterns using hierarchical ... · a hip¶otese alternativa de n~ao...
TRANSCRIPT
Analysis of Spatial Point Patterns
Using Hierarchical Clustering
Algorithms
Sandra M. C. PereiraGrad Dip (UFMS), BSc Hons, MSc (UnB), Brazil
This thesis is presented for the degree of
Doctor of Philosophy
of the University of Western Australia
School of Mathematics & Statistics.
September 2003
ii
iii
Abstract
This thesis is a new proposal for analysing spatial point patterns in spatial statis-
tics using the outputs of popular techniques of (classical, non-spatial, multivariate)
cluster analysis. The outputs of a chosen hierarchical algorithm, named fusion dis-
tances, are applied to investigate important spatial characteristics of a given point
pattern.
The fusion distances may be regarded as a missing link between the fields of
spatial statistics and multivariate cluster analysis. Up to now, these two fields have
remained rather separate because of fundamental differences in approach. It is shown
that fusion distances are very good at discriminating different types of spatial point
patterns.
A detailed study on the power of the Monte Carlo test under the null hypoth-
esis of Complete Spatial Randomness (the benchmark of spatial statistics) against
chosen alternative models is also conducted. For instance, the test (based on the
fusion distance) is very powerful for some arbitrary values of the parameters of the
alternative.
A new general approach is developed for analysing a given point pattern using
several graphical techniques for exploratory data analysis and inference. The new
strategy is applied to univariate and multivariate point patterns. A new extension of
a popular strategy in spatial statistics, named the analysis of the local configuration,
is also developed. This new extension uses the fusion distances, and analyses a
localised neighbourhood of a given point of the point pattern.
New spatial summary function and statistics, named the fusion distance function
H(t), area statistic A, statistic S, and spatial Rg index, are introduced, and proven
to be useful tools for identifying relevant features of spatial point patterns.
In conclusion, the new methodology using the outputs of hierarchical clustering
algorithms can be considered as an essential complement to the existing approaches
in spatial statistics literature.
iv
In Portuguese: Resumo
Esta tese de doutorado e uma proposta nova para analisar os conjuntos de
padroes de pontos em estatıstica espacial utilizando as tecnicas hierarquicas de
analise de agrupamento para os conjuntos de dados multivariados. Os resultados
provenientes da aplicacao de um algorıtmo hierarquico escolhido a priori, denomina-
dos as distancias unidas, sao utilizados para investigar as caracterısticas importantes
de um conjunto arbitrario de padroes de pontos.
As distancias unidas podem ser consideradas como uma ponte de ligacao entre
as areas de estudos de estatıstica espacial e de analise de agrupamento. Ate o
presente momento, estas duas areas permaneceram separadas devido as diferencas
fundamentais em metodologias de estudos. Demonstra-se que as distancias unidas
sao muito boas para discriminar os diferentes tipos de conjuntos de padroes de
pontos.
O poder do teste da hipotese nula de completa aleatoriedade espacial contra
a hipotese alternativa de nao aleatoriedade, baseada nos modelos espaciais de re-
gularidade e de agrupamento, foi estudado utilizando simulacoes. Por exemplo, o
teste (usando as distancias unidas) e muito poderoso para valores arbitrarios dos
parametros dos modelos alternativos selecionados.
Uma nova metodologia geral e desenvolvida para estudar os conjuntos de pontos
utilizando varias tecnicas de analise exploratoria de dados e de inferencia. A nova
metodologia e aplicada a conjuntos de padroes de pontos univariados e multivariados.
Uma nova extensao do metodo popular em estatıstica espacial, denominado analise
de configuracao local, tambem e desenvolvida. Esta extensao utiliza as distancias
unidas e analisa uma vizinhanca local de um ponto arbritario do conjunto de padroes
de pontos.
Tres novas estatısticas e uma nova funcao sao apresentadas e definidas nesta
tese: a funcao de distancia unida H(t); a area estatıstica A; a estatıstica S e o
ındice espacial Rg. Demostra-se que estas novas estatısticas sao instrumentos uteis
para identificar propriedades relevantes dos padroes de pontos.
Portanto, espera-se que o novo procedimento para analisar os conjunto de padroes
de pontos, fundamentado nas distancias unidas, ira ser um complemento essencial
para os metodos existentes na literatura de estatıstica espacial.
v
Statement of Originality
The research and computational work done in this thesis are wholly my own
composition. However, exception must be made for the cited, quoted references,
and ideas that are explicitly stated and acknowledged in my work.
vi
“You are a child of the universe, no less than the trees and the stars and you
have a right to be here.” Excerpt from Desiderata by Ehrman Max [36].
Figure 1: A specimen of Ouratea acuminata which is the most frequent species found
in the Brazilian trees dataset. Source: [77].
vii
Contents
Abstract iii
List of Tables xi
List of Figures xiii
Acknowledgements xvii
1 Introduction 1
1.1 Thesis research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Spatial point patterns and cluster analysis 5
2.1 Spatial point patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Spatial clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Selected hierarchical clustering algorithms . . . . . . . . . . . . . . . 9
3 Monte Carlo test 13
3.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Function estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 P-P plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Inapplicability of Monte Carlo test to P-P plot . . . . . . . . . . . . . 18
3.5 Modified Monte Carlo test applied to P-P plot . . . . . . . . . . . . . 18
3.6 Q-Q plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 A-A plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
viii
4 New strategy for analysing point patterns 23
4.1 New summary function and statistic . . . . . . . . . . . . . . . . . . 24
4.1.1 Fusion distance function . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Area statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Relative distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Description of strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . 32
4.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Application to published point patterns . . . . . . . . . . . . . 34
4.4.2 Application to simulated point patterns . . . . . . . . . . . . . 36
5 Study of power 39
5.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Estimation of power . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.1 Test using supremum distance . . . . . . . . . . . . . . . . . . 42
5.4.2 Test using area statistic . . . . . . . . . . . . . . . . . . . . . 49
6 Analysis of multivariate point patterns 53
6.1 Extension based on fusion distance function . . . . . . . . . . . . . . 53
6.2 Extension based on S statistic . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Extension based on spatial Rg index . . . . . . . . . . . . . . . . . . 63
7 Analysis of local configuration 73
7.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.1 Application to full redwoods . . . . . . . . . . . . . . . . . . . 76
7.2.2 Application to Longleaf pines . . . . . . . . . . . . . . . . . . 81
7.2.3 Application to Lansing woods . . . . . . . . . . . . . . . . . . 87
ix
8 Analysis of Brazilian trees point pattern 95
8.1 Brazilian trees point pattern . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Analysis of univariate Brazilian trees dataset . . . . . . . . . . . . . . 100
8.3 Analysis of Multivariate Brazilian trees dataset . . . . . . . . . . . . 108
8.4 Complementary analysis . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.4.1 Fusion distance function . . . . . . . . . . . . . . . . . . . . . 110
8.4.2 Area statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4.3 S statistic and spatial Rg index . . . . . . . . . . . . . . . . . 120
8.4.4 Gamma approximation for spatial Rg index . . . . . . . . . . 121
8.4.5 Analysis of local configuration . . . . . . . . . . . . . . . . . . 122
9 Conclusion and open problems 129
9.1 Problems studied and findings . . . . . . . . . . . . . . . . . . . . . . 129
9.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
9.3 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Bibliography 135
A New strategy based on the Average and Complete Linkage 145
A.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . 145
A.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.2.1 Envelopes for P-P plots, Q-Q plots and A-A plots . . . . . . . 145
A.2.2 Bands for P-P plots, Q-Q plots and A-A plots . . . . . . . . . 145
A.3 Random labelling hypothesis . . . . . . . . . . . . . . . . . . . . . . . 145
A.4 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
B Power of the test: fusion distance function 161
B.1 Cluster alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.2 Inhibition alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
C Complementary information on the Brazilian trees dataset 177
x
xi
List of Tables
4.1 Empirical area statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Power of test: clustering, area statistic . . . . . . . . . . . . . . . . . 51
5.2 Power of test: inhibition, area statistic . . . . . . . . . . . . . . . . . 51
6.1 S statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Two classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 Spatial Rg index: Single Linkage, Average Linkage . . . . . . . . . . . 69
6.4 Spatial Rg index and gamma approximation . . . . . . . . . . . . . . 71
7.1 Full redwoods: Single, Average, Complete Linkage . . . . . . . . . . . 76
7.2 Contingency tables: Longleaf pines, Single, Average and Complete
Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 Contingency table: Lansing woods, Average Linkage . . . . . . . . . . 89
7.4 Contingency table and Pearson residuals: Lansing woods, Average
Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1 Brazilian trees and ranked frequency of species . . . . . . . . . . . . . 98
8.2 Brazilian trees dataset: seven subclasses, three classes, two types . . . 99
8.3 Area statistic for Brazilian trees . . . . . . . . . . . . . . . . . . . . . 119
8.4 S statistic, Rg index: Brazilian trees dataset . . . . . . . . . . . . . . 120
8.5 Monte Carlo null distribution of spatial Rg index . . . . . . . . . . . 122
8.6 Contingency tables of Brazilian trees into seven, three, two types and
groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.1 Power: cluster alternative . . . . . . . . . . . . . . . . . . . . . . . . 164
B.2 Power: inhibition alternative . . . . . . . . . . . . . . . . . . . . . . . 175
C.1 Heights of Brazilian trees . . . . . . . . . . . . . . . . . . . . . . . . . 178
C.2 Dbh of Brazilian trees . . . . . . . . . . . . . . . . . . . . . . . . . . 178
C.3 Brazilian trees’ plant systematics . . . . . . . . . . . . . . . . . . . . 179
xii
xiii
List of Figures
1 Ouratea acuminata . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
2.1 Standard spatial datasets . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Single Linkage dendrograms . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Monte Carlo test applied to function estimates . . . . . . . . . . . . . 16
3.2 Inapplicability of pointwise Monte Carlo test to P-P plots . . . . . . . 18
3.3 Monte Carlo tests using critical band . . . . . . . . . . . . . . . . . . 21
4.1 Fusion distance function plots . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Knee plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Inverted knee plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Single Linkage relative pdf plots . . . . . . . . . . . . . . . . . . . . . 31
4.5 P-P plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Simulation envelopes: P-P, Q-Q plots . . . . . . . . . . . . . . . . . . 35
4.7 Envelopes for A-A plots . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8 Critical bands: P-P, Q-Q plots . . . . . . . . . . . . . . . . . . . . . . 36
4.9 Critical bands for A-A plots . . . . . . . . . . . . . . . . . . . . . . . 37
4.10 Application of the new strategy: dataset 1 . . . . . . . . . . . . . . . 37
4.11 Application of the new strategy: dataset 2 . . . . . . . . . . . . . . . 38
5.1 Realisations from clustering . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Inhibition model I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Power of tests: clustering . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Power of tests: clustering, cont. . . . . . . . . . . . . . . . . . . . . . 46
5.5 Interpretation of power: clustering . . . . . . . . . . . . . . . . . . . . 47
5.6 Power of test: inhibition . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7 Interpretation of power: inhibition . . . . . . . . . . . . . . . . . . . . 50
xiv
5.8 Power of test: clustering, inhibition, area statistic . . . . . . . . . . . 52
6.1 Cat Retinal Ganglia dataset . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 P-P plots for Cat Retinal Ganglia dataset . . . . . . . . . . . . . . . 56
6.3 A-A plots for Cat Retinal Ganglia dataset . . . . . . . . . . . . . . . 57
6.4 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.5 Austin Hughes’ dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.6 Classified Longleaf pines dataset . . . . . . . . . . . . . . . . . . . . . 62
6.7 Clustered dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.8 Full redwoods dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.9 Gamma approximations . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1 Kernel densities of full redwoods . . . . . . . . . . . . . . . . . . . . . 77
7.2 Dendrograms of tvd for full redwoods . . . . . . . . . . . . . . . . . . 78
7.3 Local configuration of full redwoods . . . . . . . . . . . . . . . . . . . 79
7.4 Local fusion distance function: full redwoods . . . . . . . . . . . . . . 80
7.5 Proportional Longleaf pines dataset . . . . . . . . . . . . . . . . . . . 81
7.6 Kernel densities of Longleaf pines . . . . . . . . . . . . . . . . . . . . 82
7.7 Dendrograms of tvd for Longleaf pines . . . . . . . . . . . . . . . . . 83
7.8 Local configuration of Longleaf pines . . . . . . . . . . . . . . . . . . 85
7.9 Local fusion distance function: Longleaf pines . . . . . . . . . . . . . 86
7.10 Relative frequency barplot of dbh for two groups: Longleaf pines . . . 86
7.11 Lansing woods dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.12 Lansing woods dataset and six types . . . . . . . . . . . . . . . . . . 88
7.13 Local configuration of Lansing woods . . . . . . . . . . . . . . . . . . 90
7.14 Local fusion distance function: Lansing woods, four groups . . . . . . 91
8.1 Brazilian trees dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2 Brazilian trees: seven subclasses . . . . . . . . . . . . . . . . . . . . . 97
xv
8.3 Brazilian trees: three classes, two types . . . . . . . . . . . . . . . . . 99
8.4 Barplots of species . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.5 Histogram and scatter plots of heights . . . . . . . . . . . . . . . . . 102
8.6 Histogram and scatter plots of dbh’s . . . . . . . . . . . . . . . . . . 103
8.7 Box plots of top ten heights and dbh based on species . . . . . . . . . 104
8.8 Scatter plots of top ten heights and dbh based on species . . . . . . . 105
8.9 Mark correlation function . . . . . . . . . . . . . . . . . . . . . . . . 108
8.10 Three most frequent species . . . . . . . . . . . . . . . . . . . . . . . 109
8.11 F -function for the most frequent species . . . . . . . . . . . . . . . . 110
8.12 G-cross for the most frequent species . . . . . . . . . . . . . . . . . . 111
8.13 J-cross for the most frequent species . . . . . . . . . . . . . . . . . . 112
8.14 K-cross for the most frequent species . . . . . . . . . . . . . . . . . . 113
8.15 J-cross for three classes . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.16 F -function for two types . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.17 G-cross for two types . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.18 J-cross for two types . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.19 K-cross for two types . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.20 Fusion distance function from Brazilian trees . . . . . . . . . . . . . . 118
8.21 Gamma approximation for Brazilian trees . . . . . . . . . . . . . . . 121
8.22 Kernel densities of Brazilian trees . . . . . . . . . . . . . . . . . . . . 123
8.23 Dendrograms of tvd for Brazilian trees . . . . . . . . . . . . . . . . . 124
8.24 Local configuration classification: Average Linkage . . . . . . . . . . 125
8.25 Local fusion distance function: seven, three, two groups . . . . . . . . 128
A.1 Average and Complete Linkage: dendrograms . . . . . . . . . . . . . 146
A.2 Average Linkage relative pdf plots . . . . . . . . . . . . . . . . . . . . 147
A.3 Complete Linkage relative pdf plots . . . . . . . . . . . . . . . . . . . 147
A.4 Average and Complete Linkage P-P plots: envelopes . . . . . . . . . . 148
xvi
A.5 Average and Complete Linkage Q-Q plots: envelopes . . . . . . . . . 149
A.6 Average and Complete Linkage A-A plots: envelopes . . . . . . . . . 150
A.7 Average and Complete Linkage P-P plots: bands . . . . . . . . . . . 151
A.8 Average and Complete Linkage Q-Q plots: bands . . . . . . . . . . . 152
A.9 Average and Complete Linkage A-A plots: bands . . . . . . . . . . . 153
A.10 Average Linkage P-P plots for Cat Retinal Ganglia dataset . . . . . . 154
A.11 Complete Linkage P-P plots for Cat Retinal Ganglia dataset . . . . . 155
A.12 Average Linkage Q-Q plots for Cat Retinal Ganglia dataset . . . . . . 156
A.13 Complete Linkage Q-Q plots for Cat Retinal Ganglia dataset . . . . . 157
A.14 Average Linkage A-A plots for Cat Retinal Ganglia dataset . . . . . . 158
A.15 Complete Linkage A-A plots for Cat Retinal Ganglia dataset . . . . . 159
A.16 Average Linkage: histograms . . . . . . . . . . . . . . . . . . . . . . . 160
B.1 Power: Q-Q plots for clustering . . . . . . . . . . . . . . . . . . . . . 165
B.2 Power: Q-Q plots for clustering, cont. . . . . . . . . . . . . . . . . . . 166
B.3 Power: Q-Q plots for clustering, cont. . . . . . . . . . . . . . . . . . . 167
B.4 Power: Q-Q plots for clustering, cont. . . . . . . . . . . . . . . . . . . 168
B.5 Power: Q-Q plots for clustering, cont. . . . . . . . . . . . . . . . . . . 169
B.6 Power: Q-Q plots for inhibition . . . . . . . . . . . . . . . . . . . . . 170
B.7 Power: Q-Q plots for inhibition, cont. . . . . . . . . . . . . . . . . . . 171
B.8 Power: Q-Q plots for inhibition, cont. . . . . . . . . . . . . . . . . . . 172
B.9 Power: Q-Q plots for inhibition, cont. . . . . . . . . . . . . . . . . . . 173
B.10 Power: Q-Q plots for inhibition, cont. . . . . . . . . . . . . . . . . . . 174
xvii
Acknowledgements
I am very grateful for the assistance of my supervisor, Prof. Adrian J. Baddeley,
in providing me with guidance and knowledge in the field of spatial statistics.
I am also grateful for helpful discussions with the following researchers: Dr. H.
Rue (on the area statistic), Prof. A. Unwin (on the knee plot), Dr. U. Hahn (on the
fusion distance function and power of the test), Prof. M. Handcock (on the relative
distribution plots), Prof. N. Cressie (on the analysis of local configuration), Dr. M.
Meirelles and Mr. A. Luiz (on the Brazilian trees point pattern).
My special thanks are given to Dr. J. Chia, Mr. T. Duong, Dr. R. Guidi, Dr.
R. Milne, Dr. B. Turlach and Dr. M. van Lieshout for their enlightened suggestions
in order to make my oral and written presentations more understandable.
I also express my gratitude to the University of Western Australia and to the
School of Mathematics and Statistics, especially for granting me two and half years
of the University Postgraduate Scholarship (UPA).
Finally, I thank you very much my dearest parents, Benedicto and Selia, my
beloved husband Robert, my family and friends for their love, friendship, motivation,
support and prayers over the eternity of my studies.
xviii
1CHAPTER 1
Introduction
1.1 Thesis research
Spatial statistics Spatial statistics is the analysis of data of any kind which are
attributed to locations in space. Examples of spatial data are temperature recordings
from a network of weather stations; measurements of soil properties at each location
in a field; public health records of the incidence of new disease cases; and maps
of the spatial locations of geological faults. Techniques of spatial statistics have
been used in a variety of scientific fields such as biology, geostatistics, epidemiology,
and pattern recognition. For complete and more detailed information on spatial
statistics, see [24, 30, 90, 102, 109].
This work concerns the study of spatial point patterns in spatial statistics, in
particular, on the analysis of spatial clustering. Given a pattern of points strewn
over a plane, the objective of spatial clustering is to identify any clusters of points.
This is important, for instance, in the analysis of occurrences of rare diseases, where
a cluster of disease cases may indicate a common cause for the disease [3, 12, 111].
Most of the techniques in spatial statistics are designed to detect the presence of
clustering, but not to identify the clusters themselves. A pattern can be clustered
in the sense of spatial statistics without having any clearly identifiable clusters.
Clustering is vaguely defined in spatial statistics as the tendency of some points
of the pattern to be closer to each other than they would be expected to be on
average in a homogeneous Poisson point pattern, the benchmark of spatial statistics.
However, there are some exceptions where the benchmark is not the homogeneous
Poisson process. For instance, if the human population is not uniformly spread over
a country, then we may collect data about the local density of the population, such
as a sample of many non-disease cases called “controls”, and the null hypothesis is
that the disease cases are (inhomogeneous) Poisson with an intensity proportional
to the population density [59]. So a clustering of rare disease cases may also mean
a clustering relative to the expected pattern in the population.
The general methodology which has been developed for analysing spatial point
patterns leads to a particular way of looking at the problem of identifying clusters
in point patterns. For example, spatial summary functions such as the empty space
F , nearest neighbour distance G, reduced second moment K and Van Lieshout and
Baddeley’s J are useful tools for spatial clustering. This approach is available in
several recent works [23, 32, 66, 67].
2 Chapter 1. Introduction
Non-spatial multivariate cluster analysis A point in the plane may be identified
by its coordinates (x, y), so that a pattern of n points (n ∈ N) in the plane may
be thought of as a set of data recording the values of two random variables X and
Y observed for each of n “points” or “objects’. Methods for identifying clusters
of similar objects in such multivariate data have been developed in the field of
multivariate cluster analysis [37, 49, 45, 68, 72]. This is a much larger and older
field than that of spatial clustering, and offers hundreds of different techniques,
algorithms and methods.
It is important to notice that up to now, the fields of spatial statistics and
multivariate cluster analysis have remained rather separate, despite some connection
between them, because of the fundamental differences in approach.
Objectives The main objective of this thesis is to combine techniques from spatial
statistics and non-spatial multivariate cluster analysis to solve problems in spatial
clustering. Instead of the usual statistical modelling process of formulating theoret-
ical models, which leads to tests that may or may not perform well in practice, we
start within a procedure, namely hierarchical clustering, which we know performs
well, and has an empirical basis; then we give it a formal inferential property.
The specific aims of this work are to investigate new spatial summary statistics
and functions using non-spatial multivariate cluster analysis and spatial statistics; to
construct inferential measures for identifying and validating clusters; and to develop
graphical techniques for spatial clustering.
This thesis adopts a novel strategy in combining and reconciling techniques from
the analysis of spatial point patterns and hierarchical clustering algorithms. Hope-
fully, this work will not only lead to new interpretations of established knowledge
but also to the discovery and creation of (alternative) complementary strategies to
analyse univariate and multivariate point patterns.
1.2 Overview of the thesis
Chapter 2 presents a summarised description of analysis of spatial point patterns
and of multivariate cluster analysis.
Chapter 3 briefly reviews the Monte Carlo hypothesis test. Based on this method-
ology, a new graphical procedure for performing the Monte Carlo test with an exact
significance level is presented. This modified version of the test is then applied to
several graphical devices.
1.2. Overview of the thesis 3
Chapter 4 introduces a new summary function, the fusion distance function,
and a new statistic, the area statistic, which are based on the output of hierarchical
clustering algorithms. Next, the chapter presents a new strategy for analysing point
patterns which can distinguish between different types of patterns. The new strategy
is an application of the fusion distance function and the modified version of Monte
Carlo test.
Chapter 5 investigates the power of the Monte Carlo test of Complete Spatial
Randomness against two alternative models: spatial clustering and inhibition. The
power of the test is estimated using simulation experiments. The fusion distance
function and the area statistic (introduced in Chapter 4) are the basis of the calcu-
lation of the power of the test.
Chapter 6 presents two new methods and a modified version of a multivari-
ate cluster analysis technique for analysing multivariate point patterns. The first
method is based on the strategy introduced in Chapter 4. The second is based on a
new summary index named the S statistic. Finally, the third is a modified version of
a cluster analysis index, the Rg index, that has been adapted to the spatial context
of the statistical analysis.
Chapter 7 investigates an alternative approach to analysing a local neighbour-
hood of a point pattern, named “the analysis of local configuration”. This approach
is a new extension of a popular strategy: the Local Indicators of Spatial Association
(LISA), in spatial statistics. First, the probability density function of the fusion
distances is estimated using kernel density techniques, and then the groups of the
fusion distance probability densities are classified using a chosen distance measure.
(This measure is known as the total variation distance.)
Chapters 8 introduce a large multivariate point pattern, named “the Brazilian
trees dataset”. This point pattern is throughly analysed using (traditional) standard
techniques of spatial statistics, and the new strategies developed in Chapters 4, and
6. The analysis of local configuration (Chapter 7) is also applied to the Brazilian
trees dataset.
Chapter 9 presents a summary and critique of the research done in this thesis.
The main problems and findings of each chapter are discussed, and suggestions for
future work are also made.
4 Chapter 1. Introduction
5CHAPTER 2
Spatial point patterns and cluster analysis
This chapter presents a summarised description of the analysis of point patterns,
spatial clustering, and cluster analysis. That is, in Section 2.1, a concise background
on spatial point patterns is given, and three standard point patterns are presented.
In Section 2.2, several bibliographical references of methods for analysing spatial
clustering are cited. Finally, in Sections 2.3 and 2.4, cluster analysis and a selection
of hierarchical clustering algorithms are described, respectively.
2.1 Spatial point patterns
A spatial point pattern is defined as a set of locations regularly or irregularly
distributed within a region of interest, which have been generated by some unknown
random mechanism. A spatial point pattern may be interpreted as a realisation of
a spatial point process. For instance, the standard benchmark in spatial statistics
literature for a point process is the homogeneous Poisson point process [30, 24, 60];
also known as the Complete Spatial Randomness or CSR. In other words, CSR char-
acterises the absence of structure in the process. A definition of the homogeneous
Poisson process is presented in Section 5.1. Further information on the theory of
spatial point processes is presented in [26, 102].
• • ••
• ••• •
••• •
•• •• • • •
••• • •••••• •• • • ••
•• •• • • •• •••••
• •• ••••
••
• ••••• •
•
• •
••
••
••
•
•
• •
•
••
•
•
••
•
•
•
•
• •
•
• •
•
• •
•
•
•
••
•
•
•
• •
• •• •• ••• ••••••
•••• •
•••
••••••••
•• ••••
••••••••
•••••
•••••••
•••• ••
Figure 2.1: Standard spatial point patterns: pines (left), cells (centre) and
redwoods (right) re-scaled to the unit square. Source: [30].
Examples of spatial point patterns are found in many fields of applied sci-
ences such as biology, biostatistics, botany, environmental engineering, geography
and astronomy. Figure 2.1 shows three standard datasets from spatial statistics:
the Japanese black pines saplings, biological centre cells, and California redwoods
seedlings. (These point patterns are simple, extreme examples used in Chapter 4
6 Chapter 2. Spatial point patterns and cluster analysis
for illustrative purposes, and in Chapters 6–8, more complex, ambiguous and chal-
lenging point patterns will be described and analysed.) Henceforth, the standard
datasets will be regarded as the pines, cells, and redwoods, respectively.
The pines (Figure 2.1 (left)) were extracted from a larger dataset published in
[80], and show the locations of 65 Japanese black pine saplings in a square of 5.7
m. The cells (Figure 2.1 (centre)), published by [25], show the locations of 42
cell centres in the rescaled unit square. The redwoods (Figure 2.1 (right)) were
extracted from a larger dataset published in [105], and show the locations of 62
California redwood seedlings in a square of approximately 23 m. These datasets
were chosen by Diggle [30] to illustrate examples of random, regular and clustered
point patterns, respectively. When looking at these examples of point patterns, (see
Figure 2.1), typical questions that may arise are as follows:
1. Are the points of the pines distributed at completely random locations?
2. Are the points of the cells attracting each other or they are being repulsed?
3. Is there any kind of dependence between points of the redwoods?
Some of the typical questions may be answered by considering traditional sum-
mary functions such as the empty space F -function, nearest neighbour distance
G-function, and reduced second moment K-function (also known as Ripley’s K-
function). The definition, property and application of these summary functions are
reported by [24, 30, 90]. Most of the summary functions available in the spatial
statistics literature provide useful descriptions of a given point pattern.
However, a practical interpretation of the existing summary functions might be
complicated. (A good reason for this complication might be that the traditional
summary functions might not take into consideration some fundamental aspects of
the given point pattern. For example, in biology and botany, it is important to take
into account the spread of seeds, interaction between plants, ecological conditions
for life, division and grown of cells, etc.) In our point of view, there is still a need to
introduce new summary functions which analytical results have easy interpretation.
There is also interest in finding new summary functions that might perform
better than existing summary functions at discriminating between different types of
patterns. For example, Van Lieshout and Baddeley [66] recently introduced a new
summary function which is a non-parametric measure of spatial interaction. This
function is called the J-function and for values J(r) = 1, the function suggests a
2.2. Spatial clustering 7
lack of interaction between points of the given pattern that is, Complete Spatial
Randomness. Deviations from the value 1 suggest spatial inhibition if J(r) > 1 or
spatial clustering if J(r) < 1. The definition, properties and applications of the Van
Lieshout and Baddeley’s J-function are presented by [66, 67]. In Chapter 4, a new
summary function which is also easy to compute and interpret will be introduced.
It will be shown that this summary function performs well in practical applications.
2.2 Spatial clustering
Given a spatial point pattern, the aim of spatial clustering is to identify any
clusters of points. This identification is important, for instance, in the analysis
of occurrences of rare diseases, where a cluster of disease cases may indicate the
possibility of a common cause [3, 12, 32, 33, 61, 106, 111]. Some methods for
identifying clusters in spatial patterns have been developed in the spatial statistics
literature [8, 18, 24, 65, 109]. For instance, Van Lieshout [65] presents a Bayesian
approach to modelling data and unknown cluster centres in object recognition. In
particular, data and cluster centres are modelled as realisations of a point process.
More information on this technique is available in [65, Chapter 5]. Other methods for
analysing spatial clustering are also elaborated in several recent works [12, 23, 32, 71].
The approach proposed in this thesis is to apply a chosen hierarchical clustering
algorithm to a given point pattern, and to investigate the output of the algorithm. If
spatial clusters are detected then tools for identifying and validating clusters are also
introduced. To the best of our knowledge, this approach is new and more general
than analysing spatial clustering only. However, before developing our approach, a
summary of cluster analysis, hierarchical algorithms and main properties is presented
next.
2.3 Cluster analysis
Cluster analysis is a frequently used term for techniques which seek to separate
data into groups. For instance, let x1, . . . , xn be observed measurements of ` vari-
ables on each of n points or objects which are believed to be heterogeneous. Then
the main objective of cluster analysis is to group these n points into g homogeneous
classes or clusters, where n, `, g ∈ N. Usually g is much smaller than n. General
references for cluster analysis are [17, 37, 45, 49, 68, 72].
Most algorithms for finding clusters in a dataset are based on a measure of dis-
similarity between points. A dissimilarity coefficient d has the following properties:
8 Chapter 2. Spatial point patterns and cluster analysis
1. d(xi, xj) > 0,
2. d(xi, xj) = d(xj, xi),
3. d(xi, xi) = 0, where i, j = 1, 2, . . . , n.
Note that clustering algorithms may also be based on a measure of similarity be-
tween points. In this case, a similarity coefficient will have the scale reversed. A
dissimilarity coefficient d may satisfy a metric property
d(xi, xj) 6 d(xi, xk) + d(xk, xj), (2.1)
or an ultrametric property
d(xi, xj) 6 max {d(xi, xk), d(xk, xj)}, (2.2)
where i, j, k = 1, . . . , n. For instance, a popular choice of a dissimilarity coefficient
is the pairwise Euclidean distance given by d(xi, xj) = ‖xi − xj‖, where i 6= j.
An important concept in cluster analysis is a dendrogram which is regarded as a
two-dimensional diagram, and illustrates the fusions or partitions that are made at
each successive level of a hierarchical clustering algorithm. That is, the dendrogram
is a graphical representation of an ultrametric dissimilarity coefficient.
0.0
0.05
0.15
0.25
0.0
0.05
0.15
0.25
0.0
0.05
0.15
0.25
Figure 2.2: Dendrograms obtained by a hierarchical clustering algorithm, Single
Linkage, applied to the pines (left), cells (centre), and redwoods (right). The
pairwise Euclidean distance described previously in the text is the chosen dissimilar-
ity coefficient, the y-axis represents distance between clusters, and the datasets are
presented in Section 2.1.
Moreover, Jardine and Sibson [53] defined a dendrogram as a special function
that maps an ultrametric dissimilarity coefficient into the set of real numbers. Typ-
ical examples of a dendrogram are shown in Figure 2.2. (For each pairs of cluster
2.4. Selected hierarchical clustering algorithms 9
which are merged at a stage of the algorithm, a horizontal line is drawn, with y-
coordinate equal to the minimum Euclidean distance between the two clusters. The
vertical and horizontal lines represent the tree structure of the successive mergers
of the clusters.)
A hierarchical clustering algorithm is considered as an approximation of a dis-
similarity coefficient by an ultrametric. A hierarchical technique classifies a dataset
into a hierarchy of partitions, building from the lowest level of n clusters, each con-
taining a single point, to a single cluster containing all n points. Consequently,
when a point is allocated to a group, this point is not allowed to be reallocated to
a different group as the number of clusters g decreases.
There are several hierarchical algorithms, for instance, [17] lists 23 different tech-
niques such as the Single Linkage, Average Linkage, Complete Linkage, Ward’s Min-
imum Variance, and Centroid. Among them, Single Linkage, Average Linkage, and
Complete Linkage are known to be the easiest and most commonly used in cluster
analysis literature. In this thesis, Single Linkage, Average Linkage and Complete
Linkage are chosen to be the foundation of the proposed strategies to analyse point
patterns. These three algorithms satisfy some important properties presented in
[38, 53, 58], for example, chaining effect, monotonicity, stability and ties. These
features make the selected algorithms more attractive than others.
In the next section, the chosen algorithms and main properties are briefly de-
scribed. More details and further information on hierarchical clustering algorithms,
applications and properties are also reported in [17, 38, 53, 58, 68].
2.4 Selected hierarchical clustering algorithms
Single Linkage is considered to be the simplest clustering algorithm, and is in-
troduced by Florek, Lukaszewicz, Perkal, Steinhaus, and Zubrzycki [40]. The main
feature of Single Linkage is that the dissimilarity coefficient between groups is de-
fined as the distance between their closest pairs of points, one from each group.
Examples of dendrograms of the Single Linkage applied to the pines, cells and
redwoods, in which the dissimilarity coefficient is the pairwise Euclidean distance,
are shown in Figure 2.2. (The datasets are introduced in Section 2.1.) The following
description and notation of Single Linkage is quoted from [68].
Algorithm:
1. Order the 12n(n − 1) dissimilarity coefficients into ascending order.
10 Chapter 2. Spatial point patterns and cluster analysis
2. Let C1, . . . , Cn be the starting clusters each containing one point, namely
Ci = {xi}, where i = 1, . . . , n.
3. Let di1j1 = min{d(xi, xj) : i 6= j, i, j = 1, . . . , n} so that xi1 and xj1 are
nearest. (For a point process, the probability of obtaining equal values for
the smallest dissimilarity coefficients is equal to 0.) Then these two points are
grouped into a cluster, so we have (n − 1) clusters, where Ci1
⋃
Cj1 is a new
cluster. The value di1j1 is called the first “fusion distance” h1.
4. Let di2j2 be the next smallest dissimilarity coefficient. If neither i1 nor j1 equals
i2 or j2, the new (n−2) clusters are Ci1
⋃
Cj1 , Ci2
⋃
Cj2 . If i2 = i1 and j1 6= j2
the new (n − 2) clusters are Ci1
⋃
Cj1
⋃
Cj2 , plus the remaining old clusters.
The value di2j2 is called the second fusion distance h2, where h1 6 h2.
5. The process continues as described in item 4 through all 12n(n − 1) dissimi-
larity coefficients. At kth stage, let dikjkdenote the kth smallest dissimilarity
coefficient. Then the cluster containing ik is joined with the cluster contain-
ing jk. If ik and jk are already in the same cluster, then no new groups are
formed in this stage. The value dikjkis called the kth fusion distance hk, where
h1 6 h2 6 · · · 6 hk.
6. The clustering process can be halted before all the clusters have been joined
into one group by stopping when the inter-cluster dissimilarity coefficients are
all greater than d0, where d0 is an arbitrary value called the threshold level. Let
C1?, . . . , Cg
? be the resulting clusters. These clusters have the property that if
d∗0 (> d0) is a higher threshold, then the two clusters Cr, Cs will be joined at
the threshold d∗0 if at least one dissimilarity coefficient dirjs
(or a single link)
exists between ir and js with xir ∈ Cr, xjs∈ Cs and d0 < dirjs
6 d0∗.
Properties of Single Linkage: A brief description of relevant properties: chaining
effect, monotonicity, ties, and stability is presented as follows. Further details on
the properties of Single Linkage algorithm are presented in [53].
a. Chaining effect: Single Linkage has a tendency to form spherical or elliptical
clusters, each one around a nucleus. However, if the clusters have no nuclei
the algorithm leads to a chaining effect. This effect is due to the fact that
links, once made, can not be broken. Therefore, Single Linkage may not give
satisfactory results if random noise is present between clusters.
2.4. Selected hierarchical clustering algorithms 11
b. Monotonicity: Single Linkage gives clustering of identical topology for any
monotonic transformation of a dissimilarity coefficient d.
c. Ties: if there are ties, that is, equal values for the smallest dissimilarity coef-
ficient between two clusters, then it does not matter which choice is made for
joining the clusters. The resulting clusters will be unchanged. It is therefore
allowable to randomly choose one of the smallest coefficients and then proceed
with the clustering process.
d. Stability: if there are small changes in the dissimilarity coefficient d then
these changes should not give rise to noticeable alteration in the classification
of Single Linkage.
Average Linkage This algorithm also named “Unweighted Pair-Group Average”
is introduced by Sokal and Michener [97]. A definition of Average Linkage using
the same notation as that of Single Linkage is presented as follows. Consider two
clusters Cr and Cs, then the dissimilarity coefficient drs between the clusters Cr and
Cs is defined as the average of all dissimilarity coefficients drs, where xr is any point
of Cr and xs is any point of Cs. Typical examples of dendrograms generated by
Average Linkage are shown in Figures A.1 (a), (c), and (e), in appendix A. The
algorithm is also applied to the pines, cells and redwoods.
Properties of Average Linkage: The main properties of Average Linkage are
monotonicity and ties which are briefly described in items b and c of Single Link-
age properties, respectively. Further information on the algorithm and properties is
reported in [58].
Complete Linkage has its original form published by Sørensen [98], and is the
opposite of Single Linkage. That is, the dissimilarity coefficient between groups is
defined as the largest distance between the point of one cluster and the point of the
other. Formally,
drs = max{d(xr, xs) : xi ∈ Cr, xj ∈ Cs}. (2.3)
Figures A.1 (b), (d) and (f), in appendix A, show typical examples of dendrograms
generated by Complete Linkage applied to the pines, cells, and redwoods.
Properties of Complete Linkage: this algorithm satisfies the following properties:
point proportion, cluster omission, monotonicity, and well-structured g-group admis-
sibility. Observe that the algorithm does not fulfill the properties: ties and stability
described in items c and d of Single Linkage, respectively. For more information on
its properties, see [38].
12 Chapter 2. Spatial point patterns and cluster analysis
13CHAPTER 3
Monte Carlo test
This chapter presents a brief review of the methodology of the Monte Carlo hy-
pothesis test and its application to spatial point patterns in spatial statistics. In
particular, a new and modified version of the Monte Carlo test applied to P-P plots
(Definition 5) is presented in Section 3.5. This modified version of the Monte Carlo
test applied to P-P plots has exact significance level α. A transformed version of
the P-P plot, the A-A plot (Definition 8), is introduced in Section 3.7. This plot
is a useful tool for analysing function estimates and has the property of stabilising
variance.
In this thesis, the Monte Carlo test methodology is applied to a variety of graph-
ical tools: P-P plots, A-A plots and Q-Q plots (Definition 7) using pointwise sim-
ulation envelopes and simultaneous critical bands. Together with the output of a
hierarchical clustering algorithm (Section 2.4), the Monte Carlo test is the founda-
tion of a new strategy to analyse point patterns. This strategy will be described in
Chapter 4. The Monte Carlo testing was introduced independently by Dwass [35],
and Barnard [10]. A brief description of the one-sided Monte Carlo test published
by Diggle [30] is presented below.
The one-sided Monte Carlo test Let H0 be a given simple null hypothesis, x be a
given spatial dataset, z(x) be the corresponding value of a real-valued test statistic
Z; and zi, where i = 2, . . . ,m be simulated values generated by random sampling
from the distribution of Z under H0. Let z(j) be the jth largest among the complete
set of values {z1, z2, . . . , zm}, where z1 = z(x) and m ∈ N. Then, under H0,
P(z1 = z(j)) =1
mfor j = 1, . . . ,m. (3.1)
The null hypothesis is rejected if z1 ranks kth largest or higher. This gives an exact,
one-sided test of size α = km
. It is assumed that there are no ties, P(zi = zj) =
0 (i 6= j), so that the ranking of zi is unequivocal. Otherwise, equal values or ties
may occur in which case Diggle suggested the conservative rule of selecting the least
extreme rank for zi. Further details on this test, see [30, page 7].
Applications of Monte Carlo tests to point patterns are reported in [11, 24, 30,
66, 67, 90] and their main properties are investigated by [50, 54]. Next, standard
definitions of the inverse function of the cumulative distribution function (c.d.f.),
the quantiles of the distribution function and dataset are presented. The definitions
are important for building the two-sided version of Monte Carlo tests.
14 Chapter 3. Monte Carlo test
Definition 1 (Inverse function of c.d.f.). If a random variable has the cumulative
distribution function F then its inverse function, denoted by F−1, is defined as
F−1(p) = min{t ∈ R : F (t) > p}, for p ∈ [0, 1].
Definition 2 (Quantile of c.d.f.). If F is the cumulative distribution function of
a random variable, then the pth quantile of F , where p ∈ [0, 1], is a real number
given by
qp = F−1(p).
Examples of quantiles are the lower quartile, median and upper quartile of the
distribution F which are the values F−1(0.25), F−1(0.5) and F−1(0.75), respectively.
A definition of a quantile of the given dataset x is given below.
Definition 3 (Quantile of dataset). For the dataset x = {x1, . . . , xn} the order
statistics are the numbers ranked in ascending order (thus, x(1) is the minimum and
x(n) the maximum). If F is the empirical c.d.f. of the data x1, . . . , xn then the knth
quantile of F is the kth order statistic x(k).
More information on quantiles is described in [16, 92]. Next, the general approach
of the two-sided Monte Carlo test is presented.
3.1 General case
Let x be the given dataset, Z a real-valued statistic, H0 a simple null hypothesis
and H1 a simple or composite alternative hypothesis. We aim to construct a two-
sided test of exact size α, where α is a rational number in (0, 1).
1. Select a number m, where m is such that (m + 1)α2∈ Z
+ and simulate m
independent and identically distributed (i.i.d.) realisations of X under H0
that is, x(1), . . . ,x(m).
2. Calculate the test statistic Z applied to each of the m realisations
{Z(x(i)) : i = 1, . . . ,m}.
3. Compute the α2th and (1 − α
2)th quantiles of the complete set
{Z(x), Z(x(1)), . . . , Z(x(m))} given by Definition 3. For simplicity, the α2th and
(1 − α2)th quantiles are denoted by L and U , respectively.
(In other words, if (m + 1)α2
is a positive integer k say, then the α2th quantile
of Z1, . . . , Zm+1 is the kth order statistic Z(k), and the (1 − α2)th quantile is
the (m − k + 1)th order statistic Z(m−k+1).)
3.2. Function estimate 15
4. Reject H0 if Z(x) 6∈ [L,U ].
If the distribution of the test statistic Z is continuous, the rank of the given test
statistic Z(x) among the set of values {Z(x(i)) : i = 1, . . . ,m} determines an exact
significance level for the test since, under H0, each of the m possible rankings of
Z(x) is equally probable. Otherwise, ties in the set {Z(x), Z(x(1)), . . . , Z(x(m))}may occur. Thus, the level of significance of the test is not exact. Besag and
Diggle [11] recommended randomly assigning an ordering to any equal values because
this random choice provides an upper bound for the significance level of the Monte
Carlo test.
The quantiles L and U are well-defined since (m + 1)α2
is an integer. More-
over, if (m + 1)α2
is a positive integer, the (α2)th and (1 − α
2)th quantiles of the
set {Z(x), Z(x(1)), . . . , Z(x(m))} may be calculated. (See item 3 of the general case
described previously). For the special case α = 2m+1
, L and U are respectively given
by
L = min{Z(x), Z(x(1)), . . . , Z(x(m))} (3.2)
U = max{Z(x), Z(x(1)), . . . , Z(x(m))}. (3.3)
In general, L and U are the (m + 1)α2
smallest value and the (m + 1)(1− α2) largest
value, respectively.
Proposition 4. The level of significance of the Monte Carlo test is:
P(reject | H0) = P( reject H0 | H0 is true ) = α
Proof. Under H0, Z(x), Z(x(1)), . . . , Z(x(m)) are i.i.d., so the probability that Z(x)
is one of the (m + 1)α most extreme elements is equal to α, by symmetry.
Monte Carlo tests are applied to two special cases: function estimates and P-P
plots, in the next sections. Examples of function estimates are the reduced second
moment function K, empty space F function, nearest neighbour distance distribu-
tion function G and Van Lieshout and Baddeley’s function J .
3.2 Function estimate
Figure 3.1 shows a typical plot of a graphical method for applying a two-sided
Monte Carlo test to a function estimate. Instead of a single real-valued statistic
Z(x), one might consider function estimates of the form Zx(t) where as before x
denotes the given dataset and t > 0. A procedure to make a plot for applying the
two-sided Monte Carlo test is described as follows.
16 Chapter 3. Monte Carlo test
U(t)
L(t)
Z (t)
tt
x
0
Figure 3.1: A typical plot of a graphical method for applying Monte Carlo test to a
function estimate. Dotted lines: the (α2)th and (1− α
2)th quantiles, L(t) and U(t), of
the function estimate determined by H0. Solid line: the function estimate determined
by a given dataset x.
1. Simulate m i.i.d. realisations of Z (1)(t), . . . , Z(m)(t) under H0, and calculate
the (α2)th and (1 − α
2)th quantiles (Definition 3), L(t) and U(t), of the set
{Zx(t), Z(1)(t), . . . , Z(m)(t)}.
2. Plot Zx(t), L(t), U(t) against t, see Figure 3.1.
3. To perform the test using the plot, fix an arbitrary t0 ∈ R+, and reject H0 if
Zx(t0) 6∈ [L(t0), U(t0)].
In this general context, there is no simple rule for choosing t0 to achieve maximum
power. Care must be taken to fix t0 prior to performing the test, and independently
of the outcome of the simulations, so that the test has the desired significance level α.
Similar to the general case (Section 3.1), if the distribution of Z(t) is continuous
then the rank of the given function estimate among set values {Z (i)(t) : i = 1, . . . ,m}determines an exact significance level for the test since, under H0, each of the m
possible rankings of Z(t) are equally likely. Otherwise, ties may occur and we follow
Besag and Diggle’s recommendation [11] stated previously. Once again, for the
special case α = 2m+1
, the L(t) and U(t) quantiles are respectively given by the
following equations
L(t) = min{Zx(t), Z(1)(t), . . . , Z(m)(t)} (3.4)
U(t) = max{Zx(t), Z(1)(t), . . . , Z(m)(t)}. (3.5)
3.3. P-P plot 17
In the spatial statistics literature, L(t) and U(t) are known as the lower and upper
pointwise simulation envelopes, respectively.
Since Zx(t) is real-valued then the level of significance of the Monte Carlo test
applied to function estimates is α by Proposition 4.
Next, a graphical tool named the P-P plot is presented. This plot is useful for
comparing two distribution functions. However, before a definition of the P-P plot
is presented, a function estimate Z(t) of m i.i.d. realisations Z (1)(t), . . . , Z(m)(t) is
introduced by the following equation
Z(t) =1
m
m∑
i=1
Z(i)(t) for t > 0. (3.6)
3.3 P-P plot
Figure 3.2(a) shows an example of the P-P plot, introduced by Wilk and Gnanade-
sikan [114], in which the function estimate Z(t) (equation (3.6)) is plotted against
Zx(t). The definition of the P-P plot is given as follows.
Definition 5 (P-P plot). If two distributions have cumulative distribution func-
tions F1 and F2 then the P-P plot of F1 and F2 displays the pairs
(F1(t), F2(t)), ∀ t ∈ R. (3.7)
The equivalent definition of the P-P plot is the graph of the function (F2 ◦F1−1)
against t, where F−11 is the inverse function of F1 given by Definition 1. An important
property of the P-P plot is that if F1 ≡ F2 then the plot is the identity line.
In spatial point patterns, the P-P plot from a (given) function estimate Zx(t)
against a (simulated) theoretical function Z∗(t) will show the extent of agreement
between the given dataset and theoretical point process. For instance, many sum-
mary functions used in spatial statistics are c.d.f.’s (the empty space F , and nearest
neighbour distance distribution G) so that they are amenable to the P-P plot. More
information on the P-P plot is reported in [16, 41, 114].
The rationale described for applying the Monte Carlo test to function estimates
does not extend to the P-P plot. This inapplicability of the test to the P-P plot is
explained as follows.
18 Chapter 3. Monte Carlo test
3.4 Inapplicability of Monte Carlo test to P-P plot
Proceeding in a fashion similar to Section 3.2, it will be shown that the rationale
of two-sided Monte Carlo test is not applicable to P-P plots. The analogue for
P-P plots of the graphical procedure described in Section 3.2 would be to fix a
coordinate value v0 ∈ [0, 1] and reject H0 if Zx(t0) lies outside [L(t0), U(t0)], with t0 =
Z−1
(v0), where Z−1
is the inverse function (Definition 1) of Z. Under H0, the rank
of Zx(t0) in {Z(1)(t0), . . . , Z
(m)(t0), Zx(t0)} is not (in general) uniformly distributed
over {1, . . . ,m + 1} because t0 = Z−1
(v0) depends on Z(1)(t0), . . . , Z(m)(t0) but not
on x.
.01 1
1 1
Z(t)_
U(t)
L(t)
_ −1
_ −1
(a) (b)
Z (t)
U(Z (v))
L(Z (v))
x
v00 t 0
P−P plot Analogue of Figure 3.1
Figure 3.2: Inapplicability of the pointwise Monte Carlo test rationale to P-P plots:
(a) P-P plot of empirical function estimate Zx(t) against mean of realisations Z(t),
with Monte Carlo test applied at abscissa v0. (b) The test in (a) corresponds to
applying a Monte Carlo test to Zx(t) at a random ordinate t = Z
−1(v) which depends
on Zx(t).
Figure 3.2(b) shows the Monte Carlo test applied to Zx(t) at the random ordinate
t = Z−1
(v) which depends on the simulated data {Z (i)(t) : i = 1, . . . ,m}. The
difficulty is that Z(t) depends on Z (1)(t), . . . , Z(m)(t), so for a fixed v0, Zx(Z
−1(v0))
depends on both data and simulations. Therefore, the significance level of the test
is typically not equal to α. The significance level is generally unknown.
3.5 Modified Monte Carlo test applied to P-P plot
To resolve the problem described in Section 3.4, the following procedure is
adopted.
3.5. Modified Monte Carlo test applied to P-P plot 19
1. Simulate two sets of i.i.d. realisations from H0 that is, {Zx(1)(t), . . . , Zx(m)(t)}and {Zy(1)(t), . . . , Zy(M)(t)} independently, where m and M are positive inte-
gers.
2. From the set {Zy(1)(t), . . . , Zy(M)(t)}, compute the mean
Zy(t) =1
M
M∑
j=1
Zy(j)(t). (3.8)
3. From the set {Zx(1)(t), . . . , Zx(m)(t)}, calculate the (α2)th and (1−α
2)th quantiles
(Definition 3), denoted by Lx(t) and Ux(t), respectively.
4. Plot Zx(t), Lx(t), Ux(t) against Zy(t). That is, plot pairs (Zy(t), Zx
(t)),
(Zy(t), Lx(t)) and (Zy(t), Ux(t)).
5. Fix an arbitrary value v0, let t0 = Z−1
y (v0). Reject H0 if Zx(t0) 6∈ [Lx(t0), Ux(t0)].
We have no general rule for choosing M . Note m must be chosen so that (m+1)α2
is a (positive) integer. In our applications, their selected values were identical:
M ,m=39, 99, 999.
It is worth observing that Ux(t) and Lx(t) depend on the set {Zx(1)(t), . . . , Zx(m)(t)},Z
x(t) depends on x, and Zy(t) depends on the set {Zy(1)(t), . . . , Zy(M)(t)}. Therefore,
for a fixed v0, Zx(Z
−1
y (v0)) depends on x and Zy(j)(t); Lx(Z−1
y (v0)) and Ux(Z−1
y (v0))
depend on Zx(i)(t) and Zy(j)(t), where i = 1, . . . ,m, j = 1, . . . ,M . For α = 2m+1
,
Lx(t) and Ux(t) are called the lower and upper (pointwise) simulation envelopes of
the (two-sided) modified Monte Carlo test applied to P-P plots.
We now prove that the (two-sided) modified Monte Carlo test for P-P plots has
an exact significance level α.
Proposition 6. The significance level of the (two-sided) modified Monte Carlo test
for P-P plots is P(reject | H0) = P(reject H0 | H0 is true) = α.
Proof.
P(reject H0 | H0 is true) = PH0
(
Zx(Z
−1
y (v)) 6∈ [Lx(Z−1
y (v)), Ux(Z−1
y (v))]
)
=
EH0
[
P
(
Zx(Z
−1
y (v)) 6∈ [Lx(Z−1
y (v)), Ux(Z−1
y (v))]
∣
∣
∣
∣
Zy(1)(t), . . . , Zy(M)(t)
)] (3.9)
The function ZY (t) is completely determined by the set of realisations {Zy(j)(t) :
j = 1, . . . ,M} so Z−1
y (v0) = t0 is fixed given Zy(1)(t0), . . . , Zy(M)(t0). The argument
20 Chapter 3. Monte Carlo test
presented in Section 3.1 for the general case of the Monte Carlo test (Section 3.1)
establishes that:
P
(
Zx(Z
−1
y (v0)) 6∈ [Lx(Z−1
y (v0)), Ux(Z−1
y (v0))]
∣
∣
∣
∣
Zy(1)(t0), . . . , Zy(M)(t0)
)
= α
so
EH0
[
P
(
Zx(Z
−1
y (v0)) 6∈ [Lx(Z−1
y (v0)); Ux(Z−1
y (v0))]
∣
∣
∣
∣
Zy(1)(t0), . . . , Zy(M)(t0)
)]
= α.
Critical bands for function estimates Our aim is to construct a region in which
H0 is rejected at an exact significance level α if Zx(t) goes inside this region for
any t. This region is known as a critical region and is defined by [13, 92]. A
complementary idea for plotting a critical region for summary functions in spatial
statistics is introduced by Ripley [90, Chapter 8] for the first time, to the best of
our knowledge. Examples of L-function with “95% confidence band” are plotted on
[90, pages 171, 173]. That is, 95% of realisations of L calculated from a binomial
process should lie within the confidence band. (The summary function L is defined
as L(t) =√
K(t)π
, where K(t) is the reduced second moment function.) Further
information on K, L functions, and confidence band developed by Ripley is reported
in [90]. Next, our procedure to apply the Monte Carlo test to a function estimate
using the simultaneous critical band at an exact significance level α is presented.
Monte Carlo tests using simultaneous critical bands Let Zx(t) be a real function
estimate determined by a given dataset x. The procedure is given as follows.
1. Follow items 1 and 2 of the procedure described in Section 3.5.
2. For each realisation i, compute the maximum absolute deviation di of Zx(i)(t)
from Zy(t) defined by
di = supt
|Zx(i)(t) − Zy(t)|, i = 1, . . . ,m.
3. Order the set {d1, . . . , dm} in ascending order, compute the (1−α)th quantile
(Definition 3) of the ordered d(i)’s, and denote it by d(1−α).
4. Calculate the critical functions [13] given by equation
Zy(t) ± d(1−α). (3.10)
The graph of these functions against t is called the simultaneous critical band.
3.6. Q-Q plot 21
5. Plot Zx(t) and the critical functions (equation 3.10) against t.
6. Reject H0 if Zx(t) /∈
[
Zy(t) − d(1−α), Zy(t) + d(1−α)
]
for some t.
Thus, H0 is rejected if the graph of Zx(t) lies outside the critical band at any point
t. Figure 3.3 displays a typical plot of the Monte Carlo test applied to the function
estimate Zx(t) using the simultaneous critical band at an exact significance level α.
Z (t)x
Z (t)_
+ d
Z (t) − d_
(1− α)
(1−α)
t
y
y
Figure 3.3: A typical plot of a graphical method for applying a Monte Carlo test
to a function estimate using a critical band. Dashed lines: the critical functions
determined by H0. Solid line: the function estimate Zx(t) determined by a given
dataset x.
3.6 Q-Q plot
The Q-Q plot is another useful graphical device for comparing two distribution
functions. The definition of the Q-Q plot is presented below.
Definition 7 (Q-Q plot). If two distributions have cumulative distribution func-
tions F1 and F2, then the Q-Q plot of F1 and F2 displays the pairs of points(
F−11 (p), F−1
2 (p))
, ∀p ∈ [0, 1], where F−11 and F−1
2 are the inverse functions (Def-
inition 1) of F1 and F2, respectively. Equivalently the Q-Q plot is the graph of
F2−1 ◦ F1.
The ranges of the x and y axes are the ranges of the corresponding distributions
F1, F2. Two important properties of the Q-Q plot are presented as follows. The
Q-Q plot is the identity line if and only if F1 ≡ F2. In addition, the Q-Q plot is a
straight line if and only if F1(t) = F2(a + bt) where a, b ∈ R. More information on
the Q-Q plot and its properties is available in [16, 41, 114].
Monte Carlo tests can also be applied graphically to Q-Q plots. A procedure to
calculate, and plot the pointwise simulation envelopes and the simultaneous critical
22 Chapter 3. Monte Carlo test
bands for Q-Q plots is analogous to those previously described for function estimates,
P-P plots, and modified version to P-P plots, described in Sections 3.2 and 3.5.
Next, a transformed version of the P-P plot, named the A-A plot, is presented.
The transformed P-P plot is another useful tool for comparing distributions func-
tions, and function estimates, graphically.
3.7 A-A plot
The A-A plot is a transformed P-P plot. Aitkin and Clayton [1] proposed the
use of the Fisher angular transformation, arcsin√
1 − F , in the P-P plot. The A-A
plot definition and rationale are presented as follows.
Definition 8 (A-A plot). If two random variables have cumulative distribution
functions F1 and F2 then the A-A plot of F1 and F2 displays the pairs of points(
arcsin√
1 − F1(t), arcsin√
1 − F2(t)
)
, ∀ t ∈ R.
Examples of the A-A plot are presented in [1] and in Section 4.4.1. The rationale
for the A-A plot is based on Wilk and Gnanadesikan’s proposition of a transforma-
tion of the axes of the P-P plot and Q-Q plot by a real function. It is also known
that Fisher’s angular transformation [96], arcsin√
F , stabilises variance for binomial
estimate of proportions.
In spatial statistics, an important property of the A-A plot is that if a given point
pattern is a realisation of a theoretical point process then the plot should be close to
the identity line. In this thesis, the transformation
(
arcsin√
1 − Z(t)
)
is applied
to both axes of the P-P plot to achieve approximately constant variance. Observe
that Z(t) is a chosen summary function which is also a c.d.f. More information on
the transformed P-P plot and Fisher’s angular transformation, see [1, 96, 114].
Monte Carlo tests can also be applied graphically to the A-A plot. A procedure
to calculate and to plot the pointwise simulation envelopes and simultaneous critical
bands for the A-A plot are analogous to those previously described for the P-P plot
and Q-Q plot in Sections 3.5, 3.5 and 3.6.
23CHAPTER 4
New strategy for analysing point patterns
This chapter introduces a new summary function, the fusion distance function, and
a new statistic, the area statistic, which are based on the output of a non-spatial hi-
erarchical clustering algorithm (Chapter 2) applied to a given spatial point pattern.
This chapter also explores applications of the fusion distance function in a spatial
context and develops both a graphical non-parametric method for exploratory anal-
ysis of point patterns, and formal inference using simulations and Monte Carlo tests
(Chapter 3).
The new strategy has two parts: exploratory data analysis (Section 4.3.1) and
inference (Section 4.3.2). First, the fusion distance function from the (observed)
given point pattern will be compared with the mean of simulations from a (chosen)
theoretical point process using graphical techniques. The proposed techniques to
compare the fusion distance functions are P-P plots (Definition 5), Q-Q plots (Def-
inition 7), A-A plots (Definition 8) and relative distribution plots (Definition 12).
Second, in the inference, the modified version of the Monte Carlo method (Section
3.5) is proposed for testing the fusion distance function from the given point pattern
against the mean of the fusion distance functions from simulations of the theoretical
point process. In most applications, a chosen theoretical point process is a homo-
geneous Poisson process of unknown intensity λ; this is the null hypothesis CSR
(defined in Section 5.1).
In this chapter, our choice of null hypothesis is a binomial point process which is
a simple case of the homogeneous Poisson conditioned on a fixed number of points.
However, an inhomogeneous Poisson process may also be selected for a null hypoth-
esis. For instance, in spatial epidemiology, an example is a case-control study [59]
where we have a point pattern of 30–100 cases of a rare disease (the cases) and an-
other point pattern of 1000–10000 people who are healthy but otherwise comparable
in age and socioeconomic status (the controls). The control data tells us about the
nonuniform density of the population. A natural null hypothesis is that the cases
are an inhomogeneous Poisson process with intensity proportional to the population
density. Thus, this is a good example where the null hypothesis is not CSR but is
an inhomogeneous Poisson process with a known intensity up to a constant factor.
The new strategy may equally well be applied to these examples.
24 Chapter 4. New strategy for analysing point patterns
4.1 New summary function and statistic
The new summary function, the fusion distance function, is motivated by the
following. Consider Figure 2.2 which shows the dendrograms obtained by apply-
ing the Single Linkage algorithm (Section 2.4) to the datasets: pines, cells and
redwoods (Section 2.1). The pointwise Euclidean distance is chosen as a dissimilar-
ity coefficient between points and clusters. On visual inspection, the output for the
redwoods shows a highly structured dendrogram with clear separation between large
clusters, while the output for the cells has a very disordered appearance and that
for the pines is intermediate. This suggests that the dendrograms carry enough
information to enable us to discriminate between clustered and CSR.
In the cluster analysis literature, there is no formal definition of what is meant
by a structured or disordered dendrogram. However, a very simple criterion is to
look only at the computed values of the dissimilarity coefficient between two clusters
of points in the dendrogram. These values will be named “fusion distances”. The
range of the fusion distances in the dendrogram is comparatively much broader
for the redwoods and much narrower for the cells, with the pines again giving an
intermediate result.
The exploratory results of cluster analysis indicate that a point pattern can be
analysed by applying a hierarchical clustering algorithm and then extracting the
fusion distances of the dendrogram. To the best of our knowledge, this approach to
analysing a spatial dataset is new. It is developed in Section 4.1.1. In Chapter 5, we
will demonstrate that the fusion distances are very good at discriminating between
different types of spatial patterns.
4.1.1 Fusion distance function Let x = {x1, . . . , xn} be a given point pat-
tern, consisting of a fixed number n of points in some bounded region W ⊂ R2. Next,
a chosen hierarchical clustering algorithm is applied to the given point pattern based
on their pairwise Euclidean distances ‖xi−xj‖, i 6= j. This produces a dendrogram,
from which a list of fusion distances hk, k = 1, . . . , (n−1) is extracted. For example,
for the Single Linkage algorithm (Section 2.4), hk is defined as hk = dikjk, where d is
the pairwise Euclidean distance between a point in one group and a point in other
group. Thus, we can now form the empirical cumulative distribution function of the
fusion distances. This function is named the fusion distance function, denoted by
H(t), and its definition is as follows.
4.1. New summary function and statistic 25
Definition 9 (Fusion distance function). For t ∈ R,
H(t) =1
n − 1
n−1∑
k=1
1{hk 6 t}, (4.1)
where hk is a chosen fusion distance between two groups of points and 1{} is the
indicator function. That is, H is the empirical c.d.f. of the fusion distances.
Figure 4.1 shows typical plots of the fusion distance function H(t) for the pines,
cells and redwoods.
t
H(t)
0.0 0.05 0.10 0.15 0.20 0.25 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t
H(t)
0.0 0.05 0.10 0.15 0.20 0.25 0.30
0.0
0.2
0.4
0.6
0.8
1.0
tH(
t)
0.0 0.05 0.10 0.15 0.20 0.25 0.30
0.0
0.2
0.4
0.6
0.8
1.0
Figure 4.1: The fusion distance function H(t) against t for the point patterns: pines
(left), cells (centre) and redwoods (right). Dissimilarity coefficient: pairwise Eu-
clidean distance, Single Linkage algorithm. The point patterns are re-scaled to the
unit square and their physical dimensions and background information are described
by Diggle [30].
The fusion distance function depends on the chosen algorithm and coefficient.
Also, the fusion distances cannot be regarded as if they were independent and iden-
tically distributed observations. (The hk’s are ordered: h1 6 . . . 6 hn−1. In general,
the(
n
2
)
pairwise distances are not independent.)
Application of fusion distance function: knee plot Consider a given multivariate
dataset D with n objects or points that have been measured on ` variables, where
n, ` ∈ N and ` < n. Then apply cluster analysis to D and form the set S(D) =
{h1, . . . , hn−1} of the fusion distances hk, where k = 1, . . . , n − 1, between two
clusters.
It is a common practice in cluster analysis to plot the number of clusters, say g,
against the output of a hierarchical algorithm to find the best number of clusters in
the dataset. In other words, the plot of g against the values of the fusion distances
26 Chapter 4. New strategy for analysing point patterns
hk’s provides information on the best number of clusters. This plot is known as a
knee plot or scree plot. For the definition and applications of the knee plot, see [52].
Figure 4.2 shows examples of the knee plots of the fusion distances plotted against
the number of clusters for the pines, cells, and redwoods. The knee plots were
made using the pairwise Euclidean distance and Single Linkage algorithm (Section
2.4).
Linear transformation of knee plot A relationship between the index k of the set
S(D) of the fusion distances and the number of clusters g is given by
k = (n − g). (4.2)
Re-arranging equation (4.2) such that the number of clusters is a function of the
index, that is, g = n− k and dividing this formula by the total number of points of
the dataset, then the following result is obtained
g
n= 1 − k
n, where n > 0. (4.3)
Next, the linear transformation given by equation (4.3) is applied to the y-axis of
the inverted knee plot. Therefore, the linear transformation given by equation (4.3)
is the relationship between a knee plot and the plot of the fusion distance function.
That is, a knee plot is a rotated and scaled version of the cumulative distribution
function.
Figure 4.3 shows examples of inverted knee plots of the number of clusters against
the fusion distances for the pines, cells and redwoods. The inverted knee plots
are similar to the plots of the fusion distance function H(t) shown in Figure 4.1,
except for a scale factor.
number of clusters g
fusio
n di
stan
ces
h_k
0 10 20 30 40 50 60
0.0
0.05
0.10
0.15
0.20
0.25
number of clusters g
fusio
n di
stan
ces
h_k
0 10 20 30 40
0.0
0.05
0.10
0.15
0.20
number of clusters g
fusio
n di
stan
ces
h_k
0 10 20 30 40 50 60
0.0
0.05
0.10
0.15
0.20
0.25
0.30
Figure 4.2: Knee plots of the fusion distances against the number of clusters for
the pines (left), cells (centre) and redwoods (right). Dissimilarity coefficient:
pairwise Euclidean distance, Single Linkage algorithm.
4.1. New summary function and statistic 27
fusion distances h_k
num
ber o
f clu
ster
g
0.0 0.05 0.10 0.15 0.20 0.25
010
2030
4050
60
fusion distances h_k
num
ber o
f clu
ster
g
0.0 0.05 0.10 0.15 0.20
010
2030
40
fusion distances h_k
num
ber o
f clu
ster
g
0.0 0.05 0.10 0.15 0.20 0.25 0.30
010
2030
4050
60
Figure 4.3: Inverted knee plots for the pines (left), cells (centre) and redwoods
(right).
Knee plots and analysis of point patterns The smooth shape of the right plot in
Figure 4.2 suggests that there are clusters of points in the redwoods. Furthermore,
the sharp decrease of the values of fusion distances (when the number of clusters
varies from 2 to 7) indicates that the best number of clusters is an integer between
2 and 7. For the cells (see the central plot in Figure 4.2), the values of fusion
distances are moderately flat for the values of g ∈ [1, 25]. This stability suggests
that there may not be clusters in this dataset. For the pines (see the left plot in
Figure 4.2), the values of fusion distances are intermediate, between the values of
redwoods and cells. This application of the knee plot using fusion distances is also
an indication that a knee plot may be a useful tool to discriminate between different
types of spatial patterns. However, there is still a need for further investigation. In
this study, this example is only for an exploratory data analysis.
4.1.2 Area statistic The area statistic, A, is a new index based on the fusion
distance function introduced in Section 4.1.1, and is defined as follows.
Definition 10 (Area statistic).
A =
∫ 1
0
H
(
H∗−1
(u)
)
du (4.4)
where H(t) is the fusion distance function for a given point pattern, and H∗(t)
is the sample mean of the fusion distance functions for simulations from the null
hypothesis.
That is, given x, the area statistic is defined as the area under the P-P plot
of the fusion distance function H of x against the pointwise mean H∗
of fusion
28 Chapter 4. New strategy for analysing point patterns
distance functions that are computed from simulations of the null hypothesis. (A
typical example of a reference point process is a homogeneous Poisson process which
is introduced by Definition 13.)
The area statistic A can be rewritten (if H∗
is continuous and strictly increasing)
A =
∫ +∞
0
H(t)dH∗(t)
So
A − 1
2=
∫ +∞
−∞
(
H(t) − H∗(t)
)
dH∗(t).
This may be compared with the Anderson-Darling statistic which is quoted from
[34, page 26, equation 4.1.4]
AD =
∫ +∞
−∞
(H(t) − E0H(t))2
E0H(t)(1 − E0H(t))dE0H(t),
where E0H(t) is the expected value of H(t) under the null hypothesis.
The rationale for the Anderson-Darling statistic is that, if H(t) is an estimate of
E0H(t) such as the empirical c.d.f. based on observations, then
Var(H(t)) =E0H(t)(1 − E0H(t))
n
so
E
(
(H(t) − E0H(t))2
E0H(t)(1 − E0H(t))
)
= constant.
Therefore, the area statistic can be described as a simplification of the Anderson-
Darling statistic. Further information on the Anderson-Darling statistic is reported
in [4, 34].
Proposition 11. If H ≡ H∗
then A = 0.5, and if X is a homogeneous Poisson
process then E(A) = 0.5.
Proof. The first statement is trivial. For the second statement, let us assume
that the number of simulations of the reference homogeneous Poisson process (Sec-
tion 5.1) is sufficiently large that H∗(t) is essentially non-random. Let us also
assume that the point pattern x is a realisation of the same Poisson process. Un-
der these assumptions, E(A) = E
(
∫ 1
0H(H
∗−1(u))du
)
=∫ 1
0E
(
H(H∗−1
(u))
)
du.
Since H∗
is non random and E
(
H(v)
)
= H∗(v) for all v ∈ R+ (x is Poisson) then
E
(
H(H∗−1
(u))
)
= u. So that E(A) =∫ 1
0udu = 0.5.
4.2. Relative distribution 29
SL AL CL
Datasets A A SA A A SA A A SA
Pines 0.493 0.498 0.044 0.496 0.503 0.028 0.495 0.499 0.025
Cells 0.312 0.500 0.055 0.352 0.502 0.036 0.370 0.501 0.031
Redwoods 0.726 0.499 0.047 0.672 0.501 0.030 0.657 0.500 0.026
Table 4.1: Empirical area statistic A from the pines, cells and redwoods; the
sample mean A and sample standard deviation SA for the area statistic based on
1000 simulations of homogeneous Poisson processes with intensities 65, 42, and 62,
respectively. Single Linkage (SL), Average Linkage (AL) and Complete Linkage
(CL).
Values of A > 0.5 would be associated with clustered point patterns and values
A < 0.5 associated with regular point patterns. If a point pattern is clustered then
the fusion distances tend to have a higher frequency of small distances. So the fusion
distance function from the clustered pattern may substantially be above the mean of
the simulated fusion distance functions from the homogeneous Poisson point process.
In this case, A > 0.5. For a regular pattern, the fusion distance function may not
have a higher concentration of small distances. Then its fusion distance function
may considerably be below the mean of the simulated fusion distance functions from
the homogeneous Poisson point process. Thus, A < 0.5. However, no general rules
can be inferred.
Illustration Table 4.1 presents the estimated values of the area statistic A from
the pines, cells and redwoods using Single Linkage (SL), Average Linkage (AV)
and Complete Linkage (CL). This table also shows the values of the sample mean
A and sample standard deviation SA of the area statistic based on the assumption
that the reference point process is the homogeneous Poisson. The estimated values
for A and SA are based on 1000 simulations of Poisson point processes with same
intensities as the observed point patterns, that is λi = 65, 42, 62, respectively.
4.2 Relative distribution
Useful graphical tools for comparing two distributions are the P-P plot (Defini-
tion 5), Q-Q plot (Definition 7), and A-A plot (Definition 8). Another useful device
is the relative distribution plot which is based on the relative distribution method
[47] applied to social sciences. The relative distribution plot is presented as follows.
30 Chapter 4. New strategy for analysing point patterns
Definition 12 (Relative distribution). Let F1, F2 be two cumulative distribution
functions. Their relative cumulative distribution function is given by
G(r) = F2(F−11 (r)), 0 6 r 6 1,
where F−11 is the inverse function (Definition 1) of the cumulative distribution func-
tion F1. Now, if F1, F2 have probability densities f1, f2 then G has the probability
density
g(r) =f2(F
−11 (r))
f1(F−11 (r))
, 0 6 r 6 1.
The function g is called the relative probability density function of F1, F2.
A simple interpretation of the plots of the relative distribution is that if the cu-
mulative distribution functions F1 and F2 are identical then for 0 6 r 6 1, g(r) =
1 and G(r) = r. The plot of the relative cumulative distribution function is identical
to the P-P plot. In other words, G is the function plotted in a P-P plot. Further
information on the definition, properties and applications of the relative distribution
method is presented in [47].
New application of relative distribution plot To the best of our knowledge, the
relative distribution method has not been applied to spatial statistics. Therefore,
a new application of the relative distribution plot to analyse point patterns based
on the fusion distance function is given as follows. (The software for computing the
relative distribution is available in [48]).
Figure 4.4 shows the relative probability density functions with pointwise 95%
confidence intervals for the fusion distance functions H(t) for the pines, cells and
redwoods (Section 2.1) plotted against the mean H(t) of 1000 realisations from a
binomial point process on the unit square. The fusion distance function is computed
using the Single Linkage algorithm (Section 2.4).
The relative density g(r) is estimated by kernel smoothing techniques [2, 14, 39,
112], and the pointwise confidence intervals that are shown on the plots of g(r) are
based on the large-sample normal approximation [47],
g(r) ∼ N
(
g(r),g(r)R(κ)
mhm
+g2(r)R(κ)
nhm
)
where n, m are the sample sizes from the estimated distribution and from H0, re-
spectively; hm is the smoothing bandwidth used for the density estimation; κ is the
4.2. Relative distribution 31
02
46
8
0.0 0.4 0.8
................
....................................................................................
....................................................................................................
02
46
8
0.0 0.4 0.8
......................................................
......
.....
....
....
...........................
..............................................
.....
....
...
....
..............
..
.
.
.
.
.
.................
02
46
8
0.0 0.4 0.8
......
.............................................................................................
.
.........................
.........................................
....
.......
..................
..
.
.
.
Figure 4.4: Relative probability density function (y-axis) of the fusion distances H(t)
plotted against the mean H(t) (x-axis). The probability density plots with pointwise
95% confidence intervals of the pines (left), cells (centre), and redwoods (right).
Solid lines: relative probability density, dotted lines: 95% confidence intervals. The
mean is estimated from 1000 realisations from a binomial point process; Single Link-
age algorithm.
kernel of the density estimation, and R(κ) =+∞∫
−∞
|κ(x)|2dx. Under this approxima-
tion, the 95% pointwise confidence intervals for g(r) are given by
g(r) ± 1.96
√
g(r)R(κ)
mhm
+g2(r)R(κ)
nhm
.
(For the computational work, Handcock and Morris [48] chose the Ksmooth density
estimation [110], and Ricardo Cao’s adaptive method for the bandwidth estimation.)
Details and more information on the estimation of the confidence interval for the
relative distribution are provided in [47, Chapter 9].
The left plot in Figure 4.4 shows that the pines are almost indistinguishable
from a binomial point process with regard to the distribution of fusion distances.
However, the central plot in Figure 4.4 shows that cells and a binomial are
different. There is a peak in the probability distribution function plot at about the
75th percentile on the x-axis, so that this peak suggests regularity in the cells. In
other words, for the cells there is a higher concentration of large fusion distances
than we would expect for a binomial point process.
The right plot in Figure 4.4 shows that the redwoods and a binomial are different.
The peaks at the tails indicate polarisation which suggest that there is clustering in
the redwoods. Handcock [personal communication] observed that the relative peak
at about the 75th percentile in the x-axis suggests some regularity in the redwoods.
32 Chapter 4. New strategy for analysing point patterns
Handcock also mentioned that the redwoods could be thought of as a mixture of a
clustered pattern with a small component of a regular pattern.
The plots of the relative cumulative distributions from the pines, cells, and
redwoods are not shown in this section because these plots are identical to the P-P
plots of the fusion distance function that are presented in Figure 4.6.
4.3 Description of strategy
This section describes the two parts of the strategy: exploratory data analysis
and inference.
4.3.1 Exploratory data analysis If the aim is to perform exploratory data
analysis to a given point pattern x then the procedure is presented as follows.
1. Apply a chosen hierarchical algorithm to x and compute its fusion distance
function H(t) (Definition 9).
2. Simulate m i.i.d. realisations of a binomial point process (where n = N(x)
independently random points), denoted by x(1), . . . ,x(m).
3. For each realisation x(i), where i = 1, . . . ,m, compute its fusion distance
function, denoted by Hi(t).
4. Compute the pointwise mean of simulated fusion distance functions, given by
H(t) =1
m
m∑
i=1
Hi(t). (4.5)
5. Compare H(t) with H(t) graphically using an A-A plot (Definition 8), P-P plot
(Definition 5), Q-Q plot (Definition 7) or relative distribution plot (Definition
12).
If the given point pattern x is a realisation of a binomial point pattern then the
plot (A-A plot, P-P plot, Q-Q plot or relative cumulative distribution plot) should
be close to the identity line.
4.3. Description of strategy 33
Interpretation of exploratory data analysis A simple interpretation is that if a
given point pattern is clustered then the expected P-P plot of the fusion distance
function will mostly be above the identity line, suggesting that there are many more
points aggregated into groups than in a binomial point process. But, if the given
point pattern is regular then the expected P-P plot of the fusion distance function
will mostly be below the identity line, suggesting the absence of spatial clustering.
An equivalent interpretation of the exploratory data analysis can be made for
an A-A plot and relative cumulative distribution plot. However, for Q-Q plots, the
interpretation is opposite. In other words, if a given point pattern is regular then the
expected Q-Q plot of the fusion distance function will mostly be above the identity
line. But, if the point pattern is clustered then the expected Q-Q plot of the fusion
distance function will mostly be below the identity line.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 4.5: P-P plots of fusion distance function H(t) versus H(t) for pines (left),
cells (centre) and redwoods (right). Solid lines: P-P plots, dotted lines: identity
line. The mean is estimated from 1000 realisations of binomial point process with
same intensity as observed pattern, Single Linkage algorithm.
Illustration Figure 4.5 shows the exploratory data analysis performed for the
datasets: pines, cells and redwoods (Section 2.1). The P-P plots of the fusion
distance function of the datasets are plotted against the mean of fusion distance
functions from 1000 simulations of binomial point processes with same intensities as
the observed datasets.
The P-P plot of the fusion distance function of the pines is very close to the
identity line (see the left plot in Figure 4.5). Therefore, the pines can be regarded
as a realisation of a binomial point process. However, the fusion distance functions
of the cells and redwoods are distant from the identity line. Thus, the cells and
redwoods appear not to be realisations of a binomial point process (see the central
and right plots in Figure 4.5, respectively).
34 Chapter 4. New strategy for analysing point patterns
4.3.2 Inference The second part of the strategy is to perform the modified
version of the Monte Carlo test (Section 3.5), graphically. Let x be a realisation
of a point process X, and H0, H1 be the given null and alternative hypotheses,
respectively. If the purpose of the analysis is a formal test based on the fusion
distance function, then the procedure is given as follows.
1. Specify H0, H1 and significance level α.
2. Apply a chosen hierarchical algorithm to x and compute its fusion distance
function H(t) (Definition 9).
3. Simulate m i.i.d. realisations under H0, that is, x(1), . . . ,x(m).
4. For each simulated point pattern x(i), compute its fusion distance functions
Hi(t), i = 1, . . . ,m, and the mean H(t) given by equation (4.5).
5. Apply the modified version of the Monte Carlo test (Section 3.5) using either
simulation envelopes or critical bands to the P-P plot (Definition 5), Q-Q plot
(Definition 7) or A-A plot (Definition 8).
If H(t) is outside the pointwise simulation envelope or simultaneous critical band
then H0 is rejected.
4.4 Applications
4.4.1 Application to published point patterns The inferential part of the
new strategy was applied to the pines, cells and redwoods datasets (Section 2.1).
For each test, Single Linkage was the chosen hierarchical algorithm, M,m = 999
realisations under H0 were generated that is, the realisations were from binomial
point processes with the same intensities as the observed datasets. (The composite
H1 was that the point patterns were not realisations of the binomial point process
with the specified intensities.) The significance level was α = 0.05 and the software
used was the R Development Core Team Version 1.5.1 [51] on a Pentium 4 (1.8G
Hz). The results are shown as follows.
Simulation envelopes for P-P, Q-Q, A-A plots For the pines, the fusion distance
function is within the pointwise simulation envelopes for the P-P plot, Q-Q plot and
A-A plot. (See the left plots in Figures 4.6 and 4.7.) However, for the cells
and redwoods, the fusion distance functions are substantially outside the pointwise
4.4. Applications 35
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.05 0.10 0.15
0.0
0.05
0.10
0.15
0.0 0.05 0.10 0.15 0.20
0.0
0.05
0.10
0.15
0.20
0.0 0.05 0.10 0.15 0.20
0.0
0.05
0.10
0.15
0.20
0.25
Figure 4.6: Plots of fusion distance function H(t) against H(t) with pointwise simu-
lation envelopes at 5% significance level. Datasets: pines (left), cells (centre) and
redwoods (right). Upper: P-P plots, lower: Q-Q plots. Dashed lines: envelopes;
dotted lines: identity line; Single Linkage algorithm.
envelopes. (See the central and right plots in Figures 4.6 and 4.7, respectively.) For
instance, when H(t) u 0.17, H0 is not rejected for the pines but is rejected for the
cells and redwoods.
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Figure 4.7: A-A plots of arcsin√
1 − H(t) against arcsin√
1 − H(t) with pointwise
simulation envelopes at 5% significance level. Datasets: pines (left), cells (centre)
and redwoods (right). Solid lines: A-A plots, dashed lines: envelopes; dotted lines:
identity line; Single Linkage algorithm.
36 Chapter 4. New strategy for analysing point patterns
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.2
0.6
1.0
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.2
0.6
1.0
0.0 0.2 0.4 0.6 0.8 1.0
-0.2
0.2
0.6
1.0
0.0 0.05 0.10 0.15
0.0
0.05
0.10
0.15
0.0 0.05 0.10 0.15 0.20
0.0
0.05
0.10
0.15
0.20
0.0 0.05 0.10 0.15 0.20
0.0
0.05
0.10
0.15
0.20
0.25
Figure 4.8: Plots of fusion distance function H(t) versus H(t) with simultaneous
critical bands at 5% significance level. Datasets: pines (left), cells (centre) and
redwoods (right). Upper: P-P plots, lower: Q-Q plots. Dashed lines: critical func-
tions; dotted lines: identity line; Single Linkage algorithm.
Critical bands for P-P plots, Q-Q plots and A-A plots For the pines, the fusion
distance function is inside the simultaneous critical bands for the P-P plots, Q-Q
plots and A-A plots. (See the left plots in Figures 4.8, and 4.9.) However, for
the cells and redwoods the fusion distance functions are substantially outside the
critical bands. (See the central and right plots in Figures 4.8, and 4.9, respectively.)
Thus, H0 is not rejected for the pines but is rejected for the cells and redwoods.
4.4.2 Application to simulated point patterns In this subsection, two
datasets are simulated, and the inferential part of the new method (introduced in
Section 4.3.2) is applied to them. The fusion distance function and Ripley’s K-
function are computed for both datasets. (The definition and properties of the
Ripley’s K-function are available in [88, 90].)
This application is considered as a very good example that there is still a need for
new summary functions to analysing point patterns. Note that the fusion distance
function successfully identifies the presence of spatial clustering in both datasets, in
contrast to the K-function.
4.4. Applications 37
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Figure 4.9: A-A plots of arcsin√
1 − H(t) versus arcsin√
1 − H(t) with the simulta-
neous critical bands at 5% significance level. Datasets: pines (left), cells (centre)
and redwoods (right). Solid lines: A-A plots; dashed lines: critical functions; dotted
lines: identity line; Single Linkage algorithm.
Datasets The first simulated dataset, Dataset one, is a realisation of a Matern
cluster process (Definition 14) with parent intensity λp = 5, daughter intensity
λc = 5, and radius r = 0.25 on the unit square. (See the left plot in Figure
4.10.) Dataset two is also a realisation of a Matern cluster process, where λp = 10,
λc = 10, and r = 0.5, on the unit square. (See the left plot in Figure 4.11.) The
chosen parameters of the cluster process were based on Ripley’s choice [89].
λ = 5, r = 0.25
0.00 0.02 0.04 0.06 0.08
0.00
00.
010
0.02
00.
030
K−funsim envs
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
fus dis funcrit bandident line
Figure 4.10: Left: Dataset one, a simulated Matern cluster process with λp = 5,
λc = 5 and r = 0.25 on the unit square. Centre: translate estimate of K-function
[81]. Right: P-P plot of fusion distance function. Solid lines: function estimates,
dashed lines: envelopes (centre); critical bands (right), dotted lines: identity line,
Single Linkage algorithm.
K-function The estimates of K-function from the datasets were computed and
plotted using the software described in Section 4.4 and the spatial library Spatstat
38 Chapter 4. New strategy for analysing point patterns
[9] on a Pentium 4 (1.8G Hz). The K-function was estimated using the translation
correction of Ohser [81] and the upper and lower simulation envelopes were calcu-
lated from 40 realisations under the homogeneous Poisson process (Definition 13)
with the same intensities as the simulated datasets. Observe that for t < 0.08, the
estimated K-function was inside the simulation envelopes suggesting CSR for both
datasets. (See the central plots in Figures 4.10 and 4.11, respectively.)
λ = 10, r = 0.5
0.00 0.02 0.04 0.06 0.08
0.00
00.
010
0.02
00.
030
K−funsim envs
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
fus dis funcrit bandident line
Figure 4.11: Left: Dataset two, a simulated Matern cluster process with λp = 10,
λc = 10 and r = 0.5 on the unit square. Centre: translate estimate of K-function
[81]. Right: P-P plot of fusion distance function. Solid lines: function estimates,
dashed lines: envelopes (centre), critical band (right), dotted lines: identity line,
Single Linkage algorithm.
Fusion distance function The (two-sided) modified Monte Carlo test using crit-
ical bands (Section 3.5), and based on the P-P plot of the fusion distance function
applied to the simulated datasets was performed at significance level 5%. The sim-
ple H0: a homogeneous Poisson process with the same intensity as the simulated
dataset is tested against the composite H1 that the dataset was not a realisation of
the homogeneous Poisson process. The fusion distance function and critical bands
were calculated from M,m = 39 realisations under H0 and using the Single Linkage
algorithm (Section 2.4).
For instance, when H(t) u 0.3 and H(t) u 0.55, H0 was rejected for Dataset one
and Dataset two, respectively. (The fusion distance function was also outside the
critical bands for a small neighbourhood around these values.) That is, the simu-
lated datasets were not realisations of the homogeneous Poisson process. Moreover,
the estimated fusion distance functions were substantially above the identity line,
suggesting clustered patterns for both datasets. (See the right plots in Figures 4.10
and 4.11, respectively.)
39CHAPTER 5
Power of Monte Carlo tests
This chapter evaluates the power of two Monte Carlo tests of Complete Spatial
Randomness (CSR) based on the fusion distance function (Definition 9) against the
alternative hypotheses of spatial clustering and inhibition. One Monte Carlo test
uses the supremum distance (Kolmogorov-Smirnov statistic) and the other uses the
area statistic (Definition 10). The power of the tests is then estimated through
simulation experiments.
There are several models [19, 69, 78, 95, 105] or alternative hypotheses against
which the power of the test of CSR may be tested. In this thesis, two models are
chosen to represent the cluster and inhibited interactions between points. The first
model is the Matern cluster, and the second is the Matern type II point process.
Both models are introduced by [69], and described in Section 5.1.
There are two important reasons for selecting these models. First, the models
are simple to simulate using direct algorithms (feasible programming and short com-
putational time). In other words, Matern cluster and Matern type II point processes
do not have intensive computational time, neither do they depend on iterative al-
gorithms such as the Markov Chain Monte Carlo Method (MCMC). Second, the
point patterns generated under these models are easily manipulated to exhibit de-
pendence between points. (That is, the degree of attraction or repulsion varies from
“no interaction”, and “mild interaction” to “strong interaction”.)
The main purpose of this chapter is not to give a general rule or recommendation
on the power of the test of CSR. It is only to give an illustration of the power of
the test based on the fusion distance and area statistic restricted to a chosen set of
parameters of the alternative models. (The following definition of the power is given
by Bickel and Doksum [13, Chapter 5]: the power of a test against the alternative
H1 is the probability of rejecting H0 when H1 is true.) In the next sections, a brief
description of the chosen models, test statistics and summary function to estimate
the power of the test of CSR against cluster and inhibition alternatives are presented.
5.1 Models
Null model The chosen null model is a homogeneous Poisson point process which
is also known as Complete Spatial Randomness (CSR). This process is defined by an
important property and presented below. (The following definition is quoted from
[60].)
40 Chapter 5. Study of power
Definition 13 (Homogeneous Poisson point process). A point process on a
plane is a homogeneous Poisson if: (i) N(B) has the Poisson distribution with mean
measure λ|B| for some positive measurable function λ and any measurable subset B
of R2, where |B| is the area of B and N(B) represents the number of events (points
of the process) in B. (ii) for any disjoint measurable subsets B1, . . . , Bn of R2, the
random variables N(B1), . . . , N(Bn) are independent.
Further information on properties and applications of the homogeneous Poisson
point process is reported in [24, 26, 30, 60].
First model: clustering The first chosen alternative model is a Matern cluster
process (see Section 2, Figures 1B and 1C of Matern [70]), a special case of a
Neyman-Scott process [78, 90, 102] which consists of independent random circular
clusters of radius r. (The following definition of the process is quoted from [71].)
Definition 14 (Matern cluster point process). The Matern cluster process
with parent intensity λp > 0, cluster intensity λc > 0 and cluster radius r > 0,
is constructed in two steps: (1) Generate the cluster centres (parents) xp from the
homogeneous Poisson process of intensity λp. (2) For each parent xp generate a
cluster (daughters) from the homogeneous Poisson process of intensity λc on the
ball b(xp, r). The Matern cluster process is given by the union of the clusters. The
expected number of daughters per cluster is µ = λcπr2 and the overall intensity of
the process is λ = λpλcπr2.
Second model: inhibition The second chosen alternative process is a Matern
model II inhibition process, which is introduced by Matern [69]. (The following
definition of the model is quoted from [71, page 48].)
Definition 15 (Matern model II). The Matern model II process with initial in-
tensity λ0 > 0 and minimum inter-point distance r > 0, is constructed by dependent
thinning of the homogeneous Poisson process as follows: (1) Generate the homoge-
neous Poisson process of intensity λ0. For each point xi, generate an independent
uniform variable si ∈ [0, 1] that represents the times by which the points can be
ordered. (2) Remove any points xi such that there is another point xj satisfying
||xi − xj|| < r and sj < si. The overall intensity of the process is
λ = λ01 − exp−λ0πr2
λ0πr2.
More information on the Matern model II and Matern cluster processes, prop-
erties and applications is reported in [69, 70, 71, 90, 102, 104].
5.2. Tests 41
5.2 Tests
First test The power of the first Monte Carlo test is based on the fusion distance
function and uses the supremum distance, the Kolmogorov-Smirnov statistic [34].
The definition of the supremum distance is given by,
U = sup06t6t1
|F (t) − F (t)|
over a range of t values of interest, where F (t) is the (theoretical) c.d.f., F (t) is the
(observed) empirical c.d.f. and t1 denotes the upper limit of a range of t values.
However, it may not be possible to use standard goodness-of-fit tests such as
χ2 statistic since the distribution of the Kolmogorov-Smirnov statistic under CSR
is still unknown. The distribution of U is usually known in classical cases where
F is the empirical cumulative distribution function of independent and identically
distributed observations. But here F is different because the observations are not
independent. Therefore, the two-sided modified version of Monte Carlo tests (Sec-
tion 3.5) is performed not only to estimate the power of the test of CSR but also to
achieve exact significance level α.
Second test The power of the second Monte Carlo test is also based on the
fusion distance function and uses the area statistic (Section 4.1.2). Similar to the
distribution of the first test, the distribution of the area statistic under CSR is
unknown so the two-sided modified version of Monte Carlo tests is performed.
Clustering algorithm and dissimilarity coefficient The Single Linkage is chosen
to form the clusters, and the pairwise Euclidean distance is chosen to measure the
distance between points.
Range of argument To estimate the power of the first test based on the fusion
distance function H(t), the ranges of the argument t are: [0,0.22] for the Matern
cluster, and [0, 0.20] for the Matern model II. For the second test using the area
statistic, the range of t is [0,√
2] for both models. (The upper limits of the ranges
of t values are chosen because H(t1) u 1 for realisations from these models.)
5.3 Experimental study
Software and computational time The software and library used for the power
study were described in Sections 4.4 and 4.4.2, respectively. The continuous compu-
tational times for the power based on the fusion distance function and area statistic
were approximately four weeks and two weeks, respectively.
42 Chapter 5. Study of power
Realisations from null model For each Monte Carlo test performed; M,m = 99
realisations of CSR with intensity λ = 100 points on the unit square were simulated.
Realisations from clustering The Matern cluster processes, with five parent in-
tensities λp = {5, 10, 20, 25, 50}, were simulated on the unit square. Then, λc (pa-
rameter daughter) is adjusted to keep the total intensity of the process constant at
100 points. The degree of interaction, r, between daughters and parents was varied.
That is, r = {0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2}. Typical real-
isations of the Matern cluster processes, for some of the described parameters are
shown in Figure 5.1. (For instance, when r = 0.005 only the clusters can be seen in
this figure.) Therefore, the generated patterns had three different sizes of clusters:
small (each parent on average had two daughters), medium (each parent on average
had three, four or five daughters) and large clusters (each parent on average had
ten daughters). The more daughter points and the smaller the radius, the more
clustered the point pattern is.
Realisations from inhibition The Matern model II processes, with ten initial
intensities λ0 = {110, 120, 130, 140, 150, 160, 170, 180, 190, 200}, were simulated on
the unit square. The initial intensities were chosen to achieve the total intensity of
100 points. For each λ0, the degree of inhibition between points is controlled by the
(parameter radius) r = {0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05}.Figure 5.2 shows typical plots of realisations of Matern model II processes for some
of the described parameters. The degree of inhibition amongst points varies with
the size of the radius. That is, the larger the radius the more inhibited the point
pattern.
5.4 Estimation of power
The two-sided modified version of Monte Carlo tests (Section 3.5) at the exact
significance level α = 0.05 was performed based on M,m = 99 realisations under
the null model. Then, the powers of the tests were estimated from 1000 simulations
under H1 for each set of parameters of alternative models. The chosen values of the
parameters of the alternative models were described in Section 5.3. The fractions of
rejections out of 1000 simulations were the estimate of the powers of the tests.
5.4.1 Test using supremum distance The results obtained for the powers
of Monte Carlo tests, using the supremum distance, show that the optimal choice
of the upper limit t1 does not strongly depend on the mean number of daughters
5.4. Estimation of power 43
λ = 25, r = 0.005 λ = 25, r = 0.05
λ = 25, r = 0.125 λ = 25, r = 0.2
λ = 50, r = 0.005 λ = 50, r = 0.05
λ = 50, r = 0.125 λ = 50, r = 0.2
Figure 5.1: Typical plots of realisations of Matern cluster point processes with λp =
{25, 50}, λc is adjusted to keep the total intensity of the process constant at 100
points and r = {0.005, 0.05, 0.125, 0.2} on the unit square.
per cluster. (See Figures 5.3, 5.4, and Tables B.1(a)–(e) (in appendix B) which
present the estimated powers of the Monte Carlo test of CSR against Matern cluster
processes with parameters described previously in the text.)
44 Chapter 5. Study of power
λ = 110, r = 0.005 λ = 120, r = 0.01
λ = 130, r = 0.015 λ = 140, r = 0.02
λ = 150, r = 0.025 λ = 160, r = 0.03
Figure 5.2: Typical plots of realisations of Matern model II processes with λ0 =
{110, 120, 130, 140, 150, 160} and r = {0.005, 0.01, 0.015, 0.02, 0.025, 0.03} on the
unit square.
Clustering For instance, consider r = 0.005 and the first plot in Figure 5.3 (top
row and left side). The test is very powerful given that t1 < 0.07 or t1 > 0.17.
However, the loss of the power is noticeable if t1 ∈ [0.07, 0.17]. To investigate the
reasons for loss of power of the test, extra simulations were done. Figure 5.5 shows
typical plots of fusion distance functions and estimated means from realisations
under the Poisson (CSR) and Matern cluster processes. The left plot on the top
row in Figure 5.5 shows 100 fusion distance functions for the Poisson with λ = 100,
and 100 fusion distance functions for Matern cluster with λp = 5, λc = 20, and
r = 0.005.
The values from the fusion distance functions of Matern cluster are different from
those of Poisson when t1 ∈ [0, 0.1]. Consequently, the power of the test is strong.
However, for t1 ∈ (0.11, 0.13] the fusion distance functions of both patterns appear
to be approximately equal. In addition, for t1 ∈ (0.13, 0.2] the values of the fusion
5.4. Estimation of power 45
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.005
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.01
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.025
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.05
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.075
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.1
Figure 5.3: Power of Monte Carlo tests of CSR against Matern cluster processes
with parameters λp, λc, r; where λp, r are varying as shown. λc is adjusted to
keep intensity of the process constant at 100. Test uses 99 realisations of CSR.
Power estimated from 1000 realisations under Matern cluster processes, test statistic:
supremum distance; t1 is the upper limit of the range.
46 Chapter 5. Study of power
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.125
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1po
wer
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.15
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.175
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
r=0.2
Figure 5.4: Power of Monte Carlo tests of CSR against Matern cluster processes with
parameters λp, λc, r; where λp, r are varying as shown. λc is adjusted to keep inten-
sity of process constant at 100. Test uses 99 realisations of CSR. Power estimated
from 1000 realisations under Matern cluster processes. Test statistic: supremum
distance; t1 is the upper limit of the range.
distance functions for both patterns also appear to be similar. Therefore, the power
of the test is weak when both fusion distance functions have similar values.
The right plot on the top row in Figure 5.5 shows the estimated means of the
fusion distance functions against t1. The estimated means of both fusion distance
functions are equal when t1 ∈ [0.12, 0.13]. Thus, the left plot on the top row in
Figure 5.3 shows that the power is approximately equal to zero for t1 ∈ [0.12, 0.13].
For t1 > 0.13, the power is gradually stronger.
5.4. Estimation of power 47
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
H(t)
PoissonCluster
λ = 5, r = 0.005
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
Mea
n of
H(t)
PoissonCluster
λ = 5, r = 0.005
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
H(t)
λ = 50, r = 0.2
PoissonCluster
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
Mea
n of
H(t)
PoissonCluster
λ = 50, r = 0.2
Figure 5.5: Typical plots of fusion distance functions from Poisson and Matern
cluster processes. Left: computed H(t) of individual realisations, right: estimated
H(t). Upper: 100 realisations from Poisson with λ = 100, and 100 realisations from
Matern cluster with λp = 5, λc = 20, r1 = 0.005. Lower: 100 realisations of Poisson
with λ = 100, and 100 realisations of Matern cluster with λp = 50, λc = 2, r10 = 0.2.
A more detailed explanation can be visualised through Q-Q plots. Figures B.1
– B.5 (in appendix B) show the quantiles of the fusion distance functions from the
Matern cluster plotted against Poisson. Note that both quantiles are equal when
t1 = 0.13 in Figure B.3. Therefore, the power is equal to zero. However, when
t1 > 0.13 the quantiles from both processes are different. So the power is greater
48 Chapter 5. Study of power
than zero. (See Figure B.3.) The left plot on the lower row in Figure 5.5 shows 100
fusion distance functions from the Poisson with λ = 100, and 100 fusion distance
functions from the Matern cluster with λp = 50, λc = 2, and r = 0.2. The right plot
on the lower row in Figure 5.5 shows the estimated means of the fusion distance
functions. Similar conclusions can be drawn to explain the fluctuation of the power
of the test of the test of CSR against Matern cluster not only for these parameters
but also for the remaining parameters.
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
r = 0.005r = 0.01r = 0.015r = 0.045r = 0.02
0.00 0.10 0.20 0.30
0.0
0.2
0.4
0.6
0.8
1.0
t_1
powe
r
r = 0.03r = 0.035r = 0.04r = 0.045r = 0.05
Figure 5.6: Power of Monte Carlo tests of CSR against Matern model II processes
with parameters λ0, r; where λ0 is chosen to achieve an intensity of 100. Test uses
99 realisations of CSR. Power estimated from 1000 realisations under the Matern
model II. Test statistic: supremum distance; t1 is the upper limit of the range.
5.4. Estimation of power 49
Inhibition The results obtained for the power of the test of CSR against the
Matern model II, using the supremum distance, show that the optimal choice of t1
depends on the choice of r. (See Figure 5.6 and Table B.2 (in appendix B) which
present the estimated powers of the test of CSR against Matern model II processes
with parameters described previously in the text.)
For instance, consider r = 0.005 and the upper plot in Figure 5.6. The test
of CSR is powerful given that t1 > 0.15. However, the fluctuation of the power is
noticeable for t1 6 0.15. To investigate this fluctuation, we proceed in a similar
fashion that was described for clustering previously. Figure 5.7 shows typical plots
of the fusion distance functions from the Poisson processes. Observe that the values
from the fusion distance functions of Matern model II patterns are close to those
from the Poisson patterns for t1 ∈ [0, 0.2]. In particular, for t1 ∈ [0, 0.05] the fusion
distance functions of both patterns are approximately equal. So the power of the
test of CSR is approximately zero.
Figures B.6 – B.10 (in appendix B) show the Q-Q plots of the fusion distance
functions from both models. The quantiles are plotted for t1 ∈ [0, 0.2]. Note that, the
quantiles of the fusion distance functions for both models (Poisson and Matern model
II) are very close to the identity line. For instance, when t1 = 0.01, both quantiles
are equal in Figure B.6. Therefore, for t1 = 0.01, the power is zero. However, the test
of CSR is the most powerful for the Matern model II, with λ0 = 200 and r = 0.05,
if t1 ∈ [0.001, 0.06]. (See the lower plot in Figure 5.6.) For these parameters, the
fusion distance functions from both models are very different. (See the lower plots
in Figure 5.7.)
The left plot in Figure 5.7 shows 100 fusion distance functions for the Poisson
with λ = 100, and 100 fusion distance functions from the Matern model II with
λ0 = 200 and r = 0.05. The right plot in Figure 5.7 shows the estimated means
of the fusion distance functions. Similar conclusions can be drawn to explain the
fluctuation of the power of CSR against Matern model II alternative based on the
fusion distance function, not only for these parameters but also for the remaining
parameters.
5.4.2 Test using area statistic
Clustering The left plot in Figure 5.8 and Table 5.1 present the estimated powers
of Monte Carlo tests using the area statistic. The test is the most powerful for the
Matern cluster with λp = 5, λc = 20, and r = 0.005. For the other parameters
50 Chapter 5. Study of power
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
H(t)
PoissonMatern II
λ = 110, r = 0.005
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1M
ean
of H
(t)
PoissonMatern II
λ = 110, r = 0.005
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
H(t)
PoissonMatern II
λ = 200, r = 0.05
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
t_1
Mea
n of
H(t)
PoissonMatern II
λ = 200, r = 0.05
Figure 5.7: Typical plots of fusion distance functions from Poisson and Matern
model II processes. Left: computed H(t) of individual realisations, right: estimated
H(t). Upper: 100 realisations of Poisson with λ = 100, and 100 realisations of
Matern model II with λ0 = 110, r1 = 0.005. Lower: 100 realisations of Poisson with
λ = 100, and 100 realisations of Matern model II with λ0 = 200, r10 = 0.05.
described previously, the obtained results show that the estimated power decreases
with the increasing radius r or with the increasing mean number of parents λp.
Inhibition The right plot in Figure 5.8 and Table 5.2 present the estimated pow-
ers of Monte Carlo tests of CSR against Matern model II processes using the area
statistic. The obtained results show that the estimated power increases with the
5.4. Estimation of power 51
Matern cluster, area statistic
0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2
5 parents 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99
10 parents 1.00 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.94 0.81
20 parents 1.00 1.00 1.00 1.00 1.00 0.99 0.93 0.72 0.49 0.31
25 parents 1.00 1.00 1.00 1.00 0.99 0.98 0.81 0.53 0.32 0.19
50 parents 1.00 1.00 0.99 0.99 0.91 0.55 0.28 0.14 0.07 0.05
Table 5.1: Power of Monte Carlo tests of CSR against Matern cluster processes with
λp, λc, r, where λp and r are varying as shown and λc is adjusted to keep intensity
of the process constant at 100. Test uses 99 realisations of CSR. Power using area
statistic and estimated from 1000 simulations under Matern cluster process.
increasing inhibition radius r. The test is the most powerful for the Matern model
II with radius r > 0.045.
Matern model II, area statistic
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Power 0.004 0.004 0.005 0.02 0.09 0.38 0.72 0.95 0.99 1.00
Table 5.2: Power of Monte Carlo tests of CSR against the Matern model II with λ0,
r; where λ0 is chosen to achieve an intensity of 100. Test uses 99 realisations of
CSR. Power using area statistic and estimated from 1000 simulations under Matern
model II.
Conclusion The power of the Monte Carlo test based on the supremum distance
(Section 5.4.1) is quite variable and difficult to understand whereas the power of
the Monte Carlo test based on the area statistic (Section 5.4.2) is straightforward.
The best power achieved by the supremum distance is comparable to the best power
achieved by the area statistic. Therefore, we recommend the Monte Carlo test based
on the area statistic for the models studied here.
52 Chapter 5. Study of power
0.00 0.05 0.10 0.15 0.20 0.25 0.30
0.0
0.2
0.4
0.6
0.8
1.0
r
powe
r
λ = 5λ = 10λ = 20λ = 25λ = 50
Cluster
0.01 0.02 0.03 0.04 0.05
0.0
0.2
0.4
0.6
0.8
1.0
r
powe
r
Inhibition
Figure 5.8: Power of Monte Carlo tests of CSR against Matern cluster (left) and
Matern model II (right) with parameters described previously in the text. λp, λ0 are
chosen to achieve an intensity of 100. Test uses 99 realisations of CSR. Power using
area statistic and estimated from 1000 simulations under each model.
53CHAPTER 6
Analysis of multivariate point patterns
The main aim of this chapter is the analysis of multivariate or multitype point pat-
terns using a clustering algorithm. Diggle [30, page 90] defined a multivariate point
process as any stochastic mechanism which generates events classified as type j for
j = 1, . . . , k. The k (univariate) point processes are referred to as the components
of the multivariate process. A multivariate point pattern is a realisation of a multi-
variate point process. For a general introduction to the theory of multivariate point
processes, see [20, 21].
The usual approach [24, 30, 90, 91, 102] to investigate independence between
different types of multivariate point patterns begins by the estimation of cross-
type versions of the standard summary functions, such as the nearest neighbour
distance G and second reduced moment K functions. For instance, Van Lieshout
and Baddeley’s J function [67] is also used to study forms of dependence between
points of different types in a multivariate point pattern. Further information and
applications of the J-function to multivariate point patterns are reported in [67].
In this chapter, an extension of the strategy described in Chapter 4 to multivari-
ate point patterns using three summary statistics is investigated. The first statistic
is the fusion distance function introduced in Section 4.1.1, and the second is a new
summary statistic introduced in Section 6.2, the S statistic. (The S statistic mea-
sures the number of clusters in which all members belong to the same type of a given
multitype point pattern). The properties of S for a given bivariate point pattern
are examined under the random labelling and independence hypotheses.
Finally, we introduce a spatially modified version of the Rg index [37, 86], a
popular measurement used for comparing two classifications in cluster analysis. The
properties of the spatial Rg index are investigated under the random labelling, and
independence null hypotheses.
6.1 Extension based on fusion distance function
Let Y = (X1, . . . , X`) be a (marked) multivariate point process on Rd, where Xj
(j = 1, . . . , `) is a univariate point process on R. Note that types and components
of a given point process have different concepts. The types of a multitype point
process are the marks or labels, for example “on”/“off” attached to the points. The
components of a marked point process are the sub-patterns consisting of the points
of one type. Our notation for a realisation of a (given) marked multivariate point
54 Chapter 6. Analysis of multivariate point patterns
process, a (given) marked point pattern, is y = (x1, . . . ,x`), where xj is the sub-
pattern of points of type j, yi is the ith point (i = 1, . . . , n), and y0 is the unmarked
point pattern. (The y0 can also be referred to as the point pattern regardless of the
marks.)
The extension of the strategy to analyse multivariate point patterns is to test
the null hypothesis of random labelling using the fusion distance function. The
definition of the random labelling property is given below, and then the procedure
of the strategy is described next.
Definition 16 (random labelling). Let mj be the type (mark) attached to the ith
point yi. The random labelling hypothesis states that, given the unmarked pattern
y0, the types m1, ...,mk (k ∈ N) attached to these points are i.i.d. with distribution
pj. A consequence of this hypothesis is that, given the unmarked pattern y0 and the
number nj of points of type j, the component of type j is a simple random sample
of size nj without replacement from y. There is also another consequence: given
the unmarked pattern y0, and the number nj of points of each type j, the marks
m1, ...,mk are a random permutation of (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , k).
The procedure:
1. Select a component of the marked point pattern, say, the sub-point pattern
type 1 (size n1), and compute its fusion distance function, H1(t).
2. Collect x(1)1 , . . . ,x
(m)1 , m i.i.d. sub-samples of size n1 (selected randomly with-
out replacement) from y0, and compute the fusion distance functions, H(r)1 (t),
where r = 1, . . . ,m.
3. Calculate the mean H1(t) of the fusion distance functions H(r)1 (t) given by
H1(t) =1
m
m∑
r=1
H(r)1 (t), for r = 1, . . . ,m. (6.1)
4. Apply the two parts of the strategy presented in Section 4.3, the exploratory
data analysis and inference, to compare H1(t) with H1(t).
If the random labelling hypothesis does not hold, then H1(t) is outside the (point-
wise) simulation envelopes or inside the (simultaneous) critical band at exact signif-
icance level α.
6.1. Extension based on fusion distance function 55
In other words, if the random labelling hypothesis is true, then given the number
of points of type 1, the component of type 1 is a random sample without replacement
from y. Next, an illustration of the extension of the strategy using the fusion
distance function applied to the bivariate Cat Retinal Ganglia dataset is presented.
Figure 6.1: The bivariate Cat Retinal Ganglia dataset with two types: 65 “on” cells
(4) and 70 “off” cells (◦) on a rectangular region with dimensions 1 mm by 0.7533
mm. Source: [113], data provided in [9].
Cat Retinal Ganglia dataset The Cat Retinal Ganglia Data were introduced by
Wassle, Boycott, and Illing [113] and were analysed by [31, 67, 113]. Figure 6.1
shows the dataset, which is a pattern of beta-type ganglion cells in the retina of a
cat recorded by [113]. Beta cells are associated with the resolution of fine detail in
the cat’s visual system. The cells can be classified anatomically as “on” or “off”.
In this sample, there are 65 on cells and 70 off cells in a rectangular region with
dimensions 1 mm by 0.7533 mm. Van Lieshout and Baddeley [67] stated that the
statistical independence of the on and off components would strengthen the claim
that there are two separate channels, one for brightness and another for darkness, as
postulated by Hering in 1874. More information on the Cat Retinal Ganglia dataset
and its analysis is presented in [31, 67, 113].
Illustration The extension of the strategy was applied to the Cat Retinal Ganglia
dataset using the two-sided Monte Carlo test (Section 3.5). The random labelling
null hypothesis was tested against the dependence of the types on the locations of the
points at 5% exact significance level. Figures 6.2 and 6.3 show the P-P plots and A-
A plots of the fusion distance function with simulation envelopes and critical bands
based on the Single Linkage, respectively. The number of random permutations of
the type labels was 999. The results based on the Single Linkage are equivalent to
those obtained from the Average Linkage and Complete Linkage. (Further details
are provided in Section A.3, appendix A.) The fusion distance functions of the
56 Chapter 6. Analysis of multivariate point patterns
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
Figure 6.2: P-P plots of fusion distance functions H1(t), H2(t) versus H(t) from the
Cat Retinal Ganglia dataset. Upper: on cells (type 1); lower: off cells (type 2). Left:
simulation envelopes; right: critical bands; 5% significance level; Single Linkage
algorithm; 999 random permutations of the type labels.
two types of the dataset are mostly outside the simulation envelopes and inside
the critical bands. Therefore, the random labelling is rejected for the Cat Retinal
Ganglia dataset based on these clustering algorithms. Note that our result agrees
with the results obtained by [31, 67, 113].
6.2 Extension based on S statistic
In this section, we introduce the S statistic. Then, we examine its properties
under random labelling and independence hypotheses. We also present an extension
of the strategy using the S statistic.
Definition 17 (S statistic). Let y be a marked multivariate point pattern with
j types or marks on a bounded region W . Then, a chosen clustering algorithm is
applied to y and the number of clusters is counted at all levels of the dendrogram in
6.2. Extension based on S statistic 57
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
Figure 6.3: A-A plots of arcsin√
1 − H(t) against arcsin√
1 − H(t) for the Cat
Retinal Ganglia dataset. Upper: on cells (type 1); lower row: off cells (type 2). Left:
simulation envelopes; right: critical bands; 5% significance level; Single Linkage
algorithm; 999 random permutations of the type labels.
which all members of a cluster are of the same type. Thus, the statistic S is defined
as:
S = #{clusters at all hierarchical levels in which
all members of a cluster are of the same type},(6.2)
where # denotes the number of. That is, the S statistic measures the degree of
attraction between the types of a given multivariate point pattern.
Large values of S correspond to positive association between types of nearby
points, while small values of S correspond to negative association. In applications,
the alternative hypothesis is usually one of positive association so that we perform
one-sided tests where large values are critical.
Properties of S under random labelling Given a hierarchical dendrogram of a
marked point pattern with two components, let Ci be the cluster created by fusion
58 Chapter 6. Analysis of multivariate point patterns
at the ith step of the hierarchical algorithm, and Cn−1 = {y1, . . . , yn} be the entire
set of points. Then,
S =n−1∑
i=1
1{cluster Ci consists only of points of a single type},
where 1{} denotes the indicator function. Thus the expected value of S is given by
E[S] =n∑
i=1
P{cluster Ci consists only of points of a single type}.
Under the random labelling hypothesis, if there are n1 points of type 1 and n2 points
of type 2 then (conditional on y that is, conditional on the locations of the points,
but not on their marks)
P{cluster Ci consists of points of a single type} = P{all points in Ci have type 1}+ P{all points in Ci have type 2}
=
(
n−si
n1−si
)
(
n
n1
) +
(
n−si
n2−si
)
(
n
n2
) ,
(6.3)
where si = # points in Ci.
If si > n1, n2 then P{cluster Ci consists of points of a single type} = 0. Thus
(conditional on y)
Err[S] =
(
∑n−1i=1
(
n−si
n1−si
)
+(
n−si
n2−si
)
)
(
n
n1
) (6.4)
taking(
n
−k
)
= 0 for − k < 0. Observe that Err[S] is the expected value of the S
statistic under the random labelling hypothesis conditional on y, and depends on the
given dendrogram, more particularly on the sizes of the clusters si. The summands
in equation (6.4) decrease rapidly as si increases, so it is easy to terminate the sum.
Write I i = 1{cluster Ci consists of points of a single type} then S =∑n−1
i=1 I i.
Thus
Var(S) = Var(n−1∑
i=1
I i)
=n−1∑
i=1
Var(I i) +∑
i6=j
Cov(I i, Ij).
(6.5)
6.2. Extension based on S statistic 59
Note that Var(I i) = pi(1 − pi), where
pi = Err(I i) = P{cluster Ci consists of points of a single type} and
Cov(I i, Ij) = pij − pipj,
where pij =Err(I iIj)
=P{cluster Ci consists of points of a single type
and Cj consists of points of a single type}.(6.6)
If Ci ∩ Cj 6= ∅ then
pij = P{Ci ∪ Cj consists of points of a single type}
=
[
(
n−s
n1−s
)
+(
n−s
n2−s
)
]
(
n
n1
) ,
(6.7)
where s = # points in Ci ∪ Cj. If Ci ∩ Cj = ∅ then
pij = P{Ci consists of points of a single type, Cj consists of points of a single type}= P{Ci ∪ Cj consists of points of a single type}+ P{Ci consists of points of type 1, Cj consists of points of type 2}+ P{Ci consists of points of type 2, Cj consists of points of type 1}
=
[
(
n−si−sj
n1−si−sj
)
+(
n−si−sj
n2−si−sj
)
+(
n−si−sj
n1−si
)
+(
n−si−sj
n2−sj
)
]
(
n
n1
)
(6.8)
so Var(S) may also be computed.
Illustrative example Let us consider y = (x1,x2) to be a given marked bivariate
point pattern, where x1 = {y1, y2}, x2 = {y3, y4}, y0 = {y1, y2, y3, y4}, y1 = (1, 1),
y2 = (2, 2), y3 = (6, 6), y4 = (8, 8), and their respective marks be m1=“on”,
m2=“on”, m3=“off’, m4=“off’. The Single Linkage algorithm is applied to y, and
its dendrogram is shown in Figure 6.4. The fusion distances are h1 =√
2, h2 = 2√
2,
h3 = 4√
2, and the clusters C1 = {y1, y2}, C2 = {y3, y4}, C3 = {y1, y2, y3, y4}.
Under the random labelling hypothesis, and condition on y there are n1 = 2
points of type “on”, n2 = 2 points of type “off”. The values of the probabilities,
numbers of points in each cluster, S statistic and its expected value are: p1 = 16,
p2 = 16, s1 = 2, s2 = 2, s3 = 4, S = 2, and Err[S] = 2
3, respectively.
60 Chapter 6. Analysis of multivariate point patterns
1‘‘on’’ ‘‘on’’
3
5
C 3
1 ‘‘off’’ ‘‘off’’
C 2
C 1
2yy
4y3y
Figure 6.4: A Single Linkage dendrogram applied to a given marked bivariate point
pattern y.
Properties of S under independence The calculations become simpler if the types
are assumed to be independent with P{type 1} = p and P{type 2} = 1 − p = q.
Then E[Ii] = P{cluster Ci consists of points of a single type} = psi + qsi . Thus the
expected value of S under independence, denoted by Eind[S], is given as follows,
Eind[S] =n−1∑
i=1
(
psi + qsi
)
. (6.9)
Observe that equation (6.9) can be used as an approximation when n is large,n1
n∼ p and n2
n∼ q. If Ci∩Cj = ∅ then I i, Ij are independent. In practice, if the point
pattern has more than 2 types, then the properties of the S statistic under random
labelling and independence are very difficult to compute analytically. Therefore,
in this case, we need to rely on Monte Carlo simulation and tests, described in
Chapter 3.
The extension of the strategy using the S statistic is similar to that described
for the fusion distance function presented in Section 6.1. But here, the (one-sided)
modified version of the Monte Carlo test described in Section 3.5 is performed to
test the null hypotheses: random labelling, and independence.
Applications of S statistic The strategy using the S statistic is applied to the
bivariate point patterns: Cat Retinal Ganglia dataset (Section 6.1), Austin Hughes’
Amacrine Cell Data, Clustered, and Longleaf pines. (The level of significance of the
tests is α = 0.05.) However, before the results are presented, a brief description of
the remaining datasets is given as follows.
Austin Hughes’ dataset Figure 6.5 shows Austin Hughes’ Amacrine Cell Data,
which has 152 on cells and 142 off cells on a rectangular region with dimensions
6.2. Extension based on S statistic 61
Figure 6.5: Austin Hughes’ dataset with two types: 152 on cells (4) and 142 off
cells (◦) on a rectangular region with dimensions 1.6065 mm by 1 mm. Source: [31],
data provided in [9].
1.6065 mm by 1.00 mm. This dataset is an example of a bivariate point pattern of
Amacrine cells in the retina of a rabbit. In what follows, this dataset is referred to
as Austin Hughes’ dataset. For more information on Austin Hughes’ dataset, see
[31, 9].
Longleaf pines dataset The Longleaf pines data were introduced by Platt, Evans,
and Rathbun [85], and register the locations and diameters at breast height (dbh)
of 584 Longleaf pines, in a square of 200 m in southern Georgia, USA. (Platt, Evans
and Rathbun [85, page 500] classified trees less than 5 cm dbh as “juveniles”, trees
with 5–30 cm dbh as “subadults”, and trees larger than 30 cm dbh as “adults”.)
More details of this dataset are reported in [9, 85, 87]. For simplicity, the dataset
(analysed here) was re-scaled to the unit square and classified into two types: 313
trees with dbh 6 30 cm, and 273 trees with dbh > 30 cm. Trees with dbh 6 30 were
called “young” and trees with dbh > 30 were named “adult”. Figure 6.6 shows the
Longleaf pines dataset classified into young (◦), and adult (4) types.
Clustered dataset Figure 6.7 shows a realisation of a Matern cluster process (de-
fined in Section 5.1) with λp = 2, λc = 100 and r = 0.2 on the unit square. This
simulated dataset is an example of an “ideally” clustered point pattern. The daugh-
ters from the first parent are labelled type 1 and the daughters of the second parent
are labelled type 2.
Results of S statistic Table 6.1 shows the estimated values of the S statistic, 5%
Monte Carlo critical values under the null hypotheses of random labelling and inde-
pendence, denoted by S5%rr and S5%ind, respectively. (The 5% Monte Carlo critical
value is the 95th quantile defined in Section 3.) The chosen clustering algorithm was
the Single Linkage and the number of realisations under each null hypothesis was
62 Chapter 6. Analysis of multivariate point patterns
999. From Table 6.1, the observed values of S statistic are greater than the Monte
Datasets S S5%rr S5%ind
Cat Retinal Ganglia 1 32.7 32.3
Austin Hughes’ 9 67.0 66.6
Clustered 198 44.2 43.8
Longleaf pines 349 120.2 119.8
Table 6.1: Estimated values of S statistic, Monte Carlo critical values under random
labelling and independence null hypotheses, S5%rr, S5%ind, respectively, for Cat Reti-
nal Ganglia, Austin Hughes’, Clustered and Longleaf pines datasets; Single Linkage
algorithm; for each dataset and null hypothesis: 999 permutations of the type labels.
Carlo critical values for the Clustered and Longleaf pines datasets. Thus, both null
hypotheses are rejected for the datasets. However, for the Cat Retinal Ganglia and
Austin Hughes’ datasets, S < S5%rr, S5%ind. Therefore, both null hypotheses are
not rejected for these point patterns. The obtained results from the Single Linkage
are similar to those obtained from the the Average Linkage and Complete Linkage.
In other words, the null hypotheses were rejected for Clustered and Longleaf pines
datasets, but not rejected for Cat Retinal Ganglia and Austin Hughes’ datasets.
Consequently, the results based on these algorithms are not shown here.
Figure 6.6: Longleaf pines dataset classified into two types: 313 young trees which
has dbh 6 30 cm (◦), and 271 adult trees which has dbh > 30 cm (4) on a squared
region of 200 m side. The squared region was re-scaled to the unit square. Source:
[85], data provided in [9].
6.3. Extension based on spatial Rg index 63
Figure 6.7: Clustered dataset, a simulated Matern cluster point pattern with λp = 2,
λc = 100, r = 0.2 on the unit square. Daughters of the first parent are labelled type
1 (◦) and daughters of the second parent are labelled type 2 (4).
6.3 Extension based on spatial Rg index
In this section, the Rg index is presented, and then the index is modified to
assimilate the spatial context. Next, the properties of the spatial Rg under ran-
dom labelling and independence hypotheses are investigated. The extension of the
strategy using the spatial Rg index is also presented.
Rg index The Rg index is one of the most commonly used measurements for
comparing two classifications in non-spatial cluster analysis (Section 2.3). The index
is introduced by Rand [86] and the following definition is quoted from [37, page 147].
Definition 18 (Rg index). Let C1 and C2 be two classifications of the same dataset
of n points into g clusters, where g is fixed. The Rg index of similarity between C1
and C2 is
Rg(C1, C2) =Tg − 1
2Pg − 1
2Qg +
(
n
2
)
(
n
2
) ,
where
Tg =
g∑
i=1
g∑
i=1
nij2 − n, Pg =
g∑
i=1
ni¦2 − n, Qg =
g∑
i=1
n¦j2 − n,
and the quantity nij is the number of points in common between the ith cluster
of the first classification, and the jth cluster of the second (the cluster in the two
classifications may each be labelled arbitrarily from 1 to g.) The terms ni¦ and n¦j
are appropriate marginal totals of the of nij values.
Everitt [37] interpreted the Rg index as the probability that two points are treated
alike in both classifications. He also pointed out that the Rg index lies in the interval
64 Chapter 6. Analysis of multivariate point patterns
[0,1] and takes its upper limit when there is complete agreement between the two
classifications. Further details, properties and applications of the Rg index to cluster
analysis are reported in [37, 42, 86].
Spatial Rg index To the best of our knowledge, the Rg index has not been applied
to the analysis of spatial point patterns. Thus, a (new) modified version of the index,
the spatial Rg index, is introduced as follows.
Definition 19 (Spatial Rg index). Let y be a marked multivariate point pattern
with j different marks, and y0 be the unmarked point pattern. Let Cm be the
classification of the points of y based on their marks (group i contains all points
with mark equal to i, i = 1, . . . , j). Let Cs be the classification of y0 obtained by
applying a chosen clustering algorithm to y0 and extracting a classification with j
classes. The spatial Rg index is Rg(Cs, Cm).
That is, we compare the “j-class” classification from the point pattern y0 with
the classification into “j-marks” from the marked point pattern y using the spatial
Rg index. Thus, the spatial Rg index measures the extent of spatial segregation of
the points of different marks.
Properties of spatial Rg index In this section, some properties of the spatial Rg
index for a bivariate point pattern are investigated. Consider a point pattern y with
two types, where the first type has n1 points and the second has n2 points, and n =
n1+n2. After applying a clustering algorithm to the unmarked point pattern y0, cut
the dendrogram (Section 2.3) into two groups so that the first group has m1 points
and the second group has m2 points, where n = m1 + m2 and n1, n2,m1,m2 ∈ N.
Let Ai,j , where i, j = 1, 2, be the numbers of points of type i belonging to group j.
Then
A11 + A12 = n1, A11 + A21 = m1, A21 + A22 = n2, A12 + A22 = m2.
The summarised information on the two classifications is presented in Table 6.2.
The spatial Rg index (denoted by Rg) can also be written as
Rg =1(
n
2
)
[
# of pairs (i, j) of same type and same group
+ # of pairs (i, j) of different type and different group
]
=1(
n
2
)
[(
A11
2
)
+
(
A22
2
)
+
(
A12
2
)
+
(
A21
2
)
+ A11A22 + A12A21
]
.
(6.10)
6.3. Extension based on spatial Rg index 65
group 1 group 2∑
type 1 A11 A12 n1
type 2 A21 A22 n2∑
m1 m2 n
Table 6.2: Spatial classification into two types and cluster analysis classification into
two groups for a given bivariate point pattern.
Consider A11 = X and A22 = Y then the spatial Rg index can be re-written as a
quadratic form in X and Y
Rg =1
n(n − 1)
[
X(X − 1) + Y (Y − 1) + (n1 − X)(n1 − 1 − X)
+ (n2 − Y )(n2 − 1 − Y ) + XY + (n1 − X)(n2 − Y )
] (6.11)
The quadratic form given by equation (6.11) is symmetric in X about 12n1, that is,
if X is replaced by (n1 −X) the same result is obtained. The quadratic form is also
symmetric in Y about 12n2. Moreover, the coefficients of X2 and Y 2 are positive so
that a minimum occurs at X = 12n1, Y = 1
2n2 yielding
min Rg =1
n(n − 1)
[
1
2n1(
1
2n1 − 1) +
1
2n2(
1
2n2 − 1)
+1
2n1(
1
2n1 − 1) +
1
2n2(
1
2n2 − 1) +
1
4n1n2 +
1
4n1n2
]
=1
n(n − 1)
[
n1(1
2n1 − 1) + n2(
1
2− 1) +
1
2n1n2
]
(6.12)
The minimum of the spatial Rg index given by equation (6.12) occurs when X,Y are
free to take real values. However, if X,Y are constrained to be nonnegative integers
then the minimum of the spatial Rg index occurs at one of the integer points closest
to X = 12n1 , Y = 1
2n2. Let n be large (n → ∞) with n1 ≈ c1n and n2 ≈ c2n, where
c2 = 1 − c1. If Aij ≈ cijn then c11 + c12 = c1 and c21 + c22 = c2 = 1 − c1. Therefore,
the spatial Rg index can be approximated by the following expression
Rg ≈ c211 + c2
12 + c222 + c2
21 + c12c21 + c11c22
= Z2 + (c1 − Z)2 + W 2 + (c2 − W )2 + (c1 − Z)(c2 − W ) + ZW,(6.13)
where Z = c11,W = c22. By symmetry, the minimum value of the spatial Rg index
66 Chapter 6. Analysis of multivariate point patterns
as a function of Z and W , for fixed c1, occurs at Z = 12c1 and W = 1
2c2. Then
min Rg =1
4c21 +
1
4c21 +
1
4c22 +
1
4c22 +
1
4c1c2 +
1
4c1c2 =
1
2(c2
1 + c22 + c1c2)
=1
2(c2
1 + (1 − c1)2 + c1(1 − c1)) =
1
2(1 − c1 + c2
1).(6.14)
This always exceeds its values for c1 = 12; Rg >
12(1 − 1
2+ 1
4) = 3
8. The spatial Rg
index is equal to 1 if, and only if, all possible pairs are either of the same group
and same type or of a different group and different type. The only way to ensure
equality is to have all points belong to one group and one type.
For given n1, n2 the maximum possible value of the spatial Rg index occurs at a
boundary point (because the spatial Rg index is a convex function of X and Y ). In
other words, the maximum value of the spatial Rg index occurs when either X = 0
or n1, and either Y = 0 or n2.
Random labelling The distribution of the spatial Rg index under the labelling
hypothesis for a given bivariate point pattern is shown as follows. Suppose that the
point pattern consists of n1 points of type 1 and n2 points of type 2, where n =
n1 + n2. Hierarchical clustering of the unmarked points divides them into 2 groups
of size m1, m2, where n = m1 + m2. If the labels are randomly permuted (equal
probability for all n! permutations) then each possible labelling has probability
n1!n2!
n!=
1(
n
n1
) ,
that is, each subset of n1 points has an equal chance of being the subset labelled
type 1. Hence the outcome presented by Table 6.2 has probability
(
m1
A11
)(
m2
A22
)
(
n
n1
) ,
and value
Rg =
(
A11
2
)
+(
A22
2
)
+(
A12
2
)
+(
A21
2
)
+ A11A22 + A12A21(
n
2
) .
Observe that
A12 = n1 − A11, A21 = m1 − A11,
A22 = n2 − A21 = n − n1 − m1 + A11.
6.3. Extension based on spatial Rg index 67
Thus there is only one free variable, A11 = X say, constrained by
A11 > 0 ⇐⇒ X > 0
A12 > 0 ⇐⇒ X 6 n1
A21 > 0 ⇐⇒ X 6 m1
A22 > 0 ⇐⇒ X > m1 − n2
(6.15)
Therefore max(0,m1 − n2) 6 X 6 min(n1,m1). Then under the random labelling
hypothesis, the spatial Rg index and probability can be expressed as a function of
X as follows:
Rg(x) =1(
n
2
)
[(
x
2
)
+
(
n2 − m1 + x
2
)
+
(
n1 − x
2
)
+
(
m1 − x
2
)
+ x(n2 − m1 + x) + (n1 − x)(m1 − x)
] (6.16)
and
P(X = x) =
(
m1
x
)(
m2
n2−m1+x
)
(
n
n1
) (6.17)
Exact null distribution of Rg index The exact distribution of Rg index under the
null hypothesis of random labelling could, in principle, be calculated from equations
(6.16) and (6.17). However, in practice this will be difficult when n is large.
Monte Carlo approximation of the null distribution of Rg index The null dis-
tribution of Rg index can be approximated to arbitrarily good accuracy, by Monte
Carlo Methods, by randomly permuting the type labels and computing Rg index for
each such permutation.
Based on visual inspection of the histograms of the fusion distances applied to
the multivariate point patterns Longleaf pines (described in Section 6.2) and Brazil-
ian trees (introduced in Section 8.1), it seems appropriate to try fitting a gamma
distribution for approximating the spatial Rg index distribution. (The histograms of
the fusion distances applied to Longleaf pines and Brazilian trees datasets are shown
in Figure A.16, in appendix A.) The following definition of the gamma distribution
is quoted from [55, page 166].
Definition 20 (Shifted Gamma distribution). A random variable X has a
shifted gamma distribution if its probability density function is given by
P(X = x) =(x − γ)α−1exp[−(x − γ)/β]
βαΓ(α)(6.18)
where α > 0, β > 0, and x > γ. The parameters α, β, and γ are known as the
“shape”, “scale” and “shift” of the distribution, respectively.
68 Chapter 6. Analysis of multivariate point patterns
The parameters α, β, γ of the shifted gamma distribution are estimated using
the Method of Moments described by [55, page 186] and presented as follows. Given
values of n independent random variables X1, . . . , Xn, each distributed as in equation
(6.18) then the Method of Moments estimators α, β and γ are given by
α =4m2
3
m32
, β =m3
2m2
, γ = X − 2m22
m3
, (6.19)
respectively, where
X = n−1
n∑
j=1
Xj, m2 = n−1
n∑
j=1
(Xj − X)2, m3 = n−1
n∑
j=1
(Xj − X)3.
Even though the Moment estimators are often less accurate than the Maximum
Likelihood estimators α∗, β∗ and γ∗, the Moment estimators do not rely on iterative
computational algorithms. Therefore, the Method of Moments is here preferred for
estimating the parameters of the gamma distribution. Our preference is because of
two main reasons. First, the aim is to have a simple approximation of the spatial
Rg index distribution. Second, this approximation should be feasible and rapidly
calculated using direct algorithms.
In other words, the computational time for calculating the parameters of the
Maximum Likelihood estimators is much longer than for the estimators of the
Method of Moments. In addition to the waiting time problem, the programming task
is much more difficult and demanding than the direct calculations of the Method
of Moments estimators. Further information on the gamma distribution and its
estimators is report in [55].
Gamma approximation to null distribution of spatial Rg index: A procedure to
approximate the null distribution of the spatial Rg index from a given bivariate
point pattern is described as follows.
Given a bivariate point pattern, the null distribution is simulated few times (for
example 30 up to 100 times), and the parameters of the shifted gamma distribution
of the spatial Rg index are estimated using the Method of Moments given by equation
(6.19). The p-value for the observed spatial Rg index is then calculated from the
given bivariate point pattern based on the shifted gamma distribution.
Extension of strategy using spatial Rg index The extension of the strategy using
the spatial Rg index is similar to that described for the S statistic presented in
Section 6.2. The (one-sided) modified version of the Monte Carlo test (Section 3.5)
is then performed to test the random labelling null hypothesis.
6.3. Extension based on spatial Rg index 69
Single Linkage Average Linkage
Datasets Rg R5%rr Rg R5%rr
Cat Retinal Ganglia 0.496 0.502 0.496 0.511
Austin Hughes’ 0.498 0.499 0.498 0.504
Clustered 1 0.507 1 0.507
Longleaf pines 0.503 0.503 0.553 0.503
Table 6.3: The estimated spatial Rg index, and 5% Monte Carlo critical values,
R5%rr, for the Monte Carlo test of random labelling. Datasets: Cat Retinal Gan-
glia, Austin Hughes’, Clustered and Longleaf pines. 999 realisations under random
labelling.
First application The strategy using the spatial Rg index was applied to the
bivariate point patterns: Cat Retinal Ganglia, Austin Hughes’, Clustered, and Lon-
gleaf pines. These datasets were described previously. (The level of significance of
the tests was α = 0.05.) Table 6.3 shows the estimated values of the spatial Rg index
and 5% Monte Carlo critical value under random labelling hypothesis, denoted by
R5%rr. (The 5% Monte Carlo critical value is the 95th quantile defined in Section
3.) The number of realisations under the null hypothesis was 999, and the chosen
clustering algorithms were the Single Linkage and Average Linkage.
The results obtained from the Single Linkage (see Table 6.3) show that the ran-
dom labelling hypothesis is only rejected for the Clustered point pattern. However,
based on the Average Linkage, the random labelling is rejected for the Clustered
and Longleaf pines datasets. The results from the Complete Linkage algorithm are
similar to those obtained from the Average Linkage. That is, the null hypothesis is
rejected for the Clustered and Longleaf pines datasets. Therefore, the result from
the Complete Linkage is not presented here.
Second application The null distributions of the spatial Rg index from the bivari-
ate point patterns: Cat Retinal Ganglia, Austin Hughes’, Longleaf pines, Clustered,
and (full) California redwoods seedlings (described below) were approximated using
gamma distributions. The Method of Moments was used to estimate the parameters
of the gamma distributions (Section 6.3). However, before the results are presented,
a brief description of the (full) California redwoods seedlings dataset is given as
follows.
California redwoods seedlings dataset Figure 6.8 shows the California redwoods
dataset [105] in which the locations of 195 seedlings of California redwood trees are
70 Chapter 6. Analysis of multivariate point patterns
Region I
Region II
Ripley’s subset
Figure 6.8: California redwoods seedlings dataset with 195 points re-scaled to the unit
square. This dataset is regarded as the full redwoods. Ripley’s subset is commonly
known as the redwoods data. Source: [105], data provided in [9].
plotted. Strauss [105] divided the sampling region into two regions demarcated by
a diagonal line corresponding to a discontinuity in the soil and land usage. (Region
I has 72 trees and region II has 123 trees.) Strauss [personal communication] has
informed the author that the dataset is no longer available. Therefore, a plot of the
entire dataset [105] was scanned and digitised by the author in 2002.
Henceforth, the California redwoods seedlings dataset is regarded as the full
redwoods. To the best of our knowledge, this dataset has only been analysed by
[65, 105]. For further details on the dataset, see [105]. A subset of the full redwoods
dataset, consisting of 62 points in a square sub-region, was extracted by Ripley [88]
and is known as the redwoods data in spatial statistics literature. The subset is a
very good example of a clustered point pattern. (Figure 6.8 shows the full redwoods
dataset with regions I, II, and Ripley’s subset.)
Even though the full redwoods seedlings dataset is a univariate point pattern,
the regions I and II of the dataset were regarded as if they had two separate marks:
the points located at region I were labelled type 1, and the points in region II were
labelled type 2.
The results The Single Linkage was unable to divide the datasets into two sub-
stantial groups. As an example, for the Cat Retinal Ganglia dataset, the first cluster
obtained had 126 points and the other cluster had 9 points. Another example was
for the Austin Hughes’ dataset, where the first cluster had 293 points and the other
6.3. Extension based on spatial Rg index 71
Datasets Rg α β γ p-value
Cat Retinal Ganglia 0.496 0.511 0.0074 0.496 0.792
Austin Hughes’ 0.498 0.505 0.0034 0.498 0.955
Clustered 1 0.508 0.0050 0.497 0
Longleaf pines 0.553 1.182 0.0015 0.499 3.33e-16
Full redwoods 0.588 0.961 0.0061 0.497 2.57e-07
Table 6.4: Estimated values of the spatial Rg index from datasets: Cat Retinal
Ganglia, Austin Hughes’, Clustered, Longleaf pines, and full redwoods; estimated
parameters α, β, γ of gamma approximation and p-values from the Monte Carlo
null distribution of spatial Rg index under random labelling null hypothesis; Average
Linkage algorithm.
cluster had 1 point. Thus, the Average Linkage was chosen to form the clusters of
the datasets.
Figure 6.9 shows that the fitted gammas are very good approximations for the
Monte Carlo null distributions of the spatial Rg index for the datasets: Cat Reti-
nal Ganglia, Austin Hughes’, Clustered, Longleaf pines with two types, and full
redwoods with two regions.
Table 6.4 shows the estimated values of the spatial Rg index, parameters α, β, γ
of the gamma approximations, and p-values from the gamma approximation of Rg
index under random labelling null hypothesis based on the Average Linkage.
The parameters were estimated using the Method of Moments (Section 6.3). Be-
cause of the large p-values for the Cat Retinal Ganglia and Austin Hughes’ datasets
(presented in Table 6.4), the null hypothesis of random labelling is not rejected for
these datasets. However, the random labelling is rejected for: Clustered, Longleaf
pines, and full redwoods datasets.
72 Chapter 6. Analysis of multivariate point patterns
0.50 0.55 0.60 0.65 0.70
0.50
0.55
0.60
0.65
0.70
Monte Carlo null distribution
gam
ma
appr
oxim
atio
n
q−q plotident line
Cat Retinal Ganglia
0.50 0.52 0.54 0.56 0.58 0.60
0.50
0.52
0.54
0.56
0.58
0.60
Monte Carlo null distribution
gam
ma
appr
oxim
atio
n
q−q plotident line
Austin Hughes’
0.50 0.55 0.60 0.65
0.50
0.55
0.60
0.65
Monte Carlo null distribution
gam
ma
appr
oxim
atio
n
q−q plotident line
Clustered
0.50 0.51 0.52 0.53 0.54 0.55
0.50
0.51
0.52
0.53
0.54
0.55
Monte Carlo null distribution
gam
ma
appr
oxim
atio
n
q−q plotident line
Longleaf pines
0.50 0.55 0.60 0.65
0.50
0.55
0.60
0.65
Monte Carlo null distribution
gam
ma
appr
oxim
atio
n
q−q plotident line
Full redwoods
Figure 6.9: Q-Q plots comparing the Monte Carlo estimates of the null distributions
of the spatial Rg index with their gamma approximations, for each of datasets: Cat
Retinal Ganglia, Austin Hughes’, Clustered, Longleaf pines, and full redwoods. Solid
lines: Q-Q plots, dashed lines: identity line, Average Linkage algorithm.
73CHAPTER 7
Analysis of local configuration
This chapter presents a new extension of a popular approach for analysing localised
neighbourhoods of a given point pattern in spatial statistics. The new extension,
named “analysis of local configuration”, is based on the fusion distance function.
Our attention is now focused on a local neighbourhood of a given point of the
dataset.
In spatial statistics literature, the original strategy is known as the Local Indi-
cators of Spatial Association, or LISA [5, 6, 22, 23, 109]. An early paper by Getis
and Ord [44] suggested local versions of the K, L, and G functions. Anselin [6] out-
lined a general class of local indicators of spatial association, LISA, and showed how
this class of local indicators allows for the decomposition of global indicators such
as the Moran’s I and Geary’s c statistics [109, page 170]. Anselin also illustrated
applications of LISA to the spatial pattern of conflict in African countries [28] and
to a number of Monte Carlo simulations. The following definition of LISA is quoted
from [6, page 94].
Definition 21 (LISA). A local indicator of spatial association is any statistic that
satisfies two requirements. First, a value of LISA for each observation gives an
indicator of the extent of significant spatial clustering of similar values around that
observation. Second, the sum of LISAs for all observations is proportional to a
global indicator of spatial association.
Recently, Cressie and Collins [22, 23] also investigated LISA methodology for
point patterns and developed a version based on the product density function. (The
product density function is defined by Stoyan, Kendall and Mecke [102, page 120].)
After estimating the product density functions using kernel smoothing [2, 14, 39,
112], Cressie and Collins applied classical multidimensional scaling [68] to reduce
the number of LISA functions, then they applied the non-hierarchical K-means al-
gorithm [49] to classify LISA functions into bundles or groups. (Cressie and Collins
defined a bundle of LISA functions as a set of similar product density functions [23].)
Further information on the methodology developed by Cressie and Collins and its
application to a minefield point pattern with clutter, are presented in [23, 22].
Instead of using the (traditional summary function) K-function, classical mul-
tidimensional scaling, and non-hierarchical algorithm to characterise a local neigh-
bourhood, fusion distances based on hierarchical algorithms will be used here. This
74 Chapter 7. Analysis of local configuration
application of the fusion distance is new, and some of Cressie and Collin’s steps will
be followed to introduce the analysis of local configuration. In particular, the prob-
ability density function of the fusion distances are estimated using kernel smoothing
techniques [2, 14, 39, 112]. Then the groups of the fusion distance densities will
be classified using a different measure from that chosen by Cressie and Collins [22].
The analysis of the local configuration procedure is presented as follows.
7.1 Strategy
Let x = {x1, . . . ,xn} be a given point pattern with n points, and k be a small
positive integer (k < n). For each point xi (i = 1, . . . , n), its k-nearest neighbour
points, xi1,xi2, . . . ,xik are computed using the pairwise Euclidean distance. Then,
for each point xi and its k-nearest neighbours, the subset {xi,xi1,xi2, . . . ,xik} is
formed. This subset is regarded as the local neighbourhood around the point xi.
Now, a chosen hierarchical clustering algorithm is applied to each subset, and an-
other set is formed, that is, the set of fusion distances {hi1, . . . , hik−1}.
The probability density functions of the fusion distances are estimated using the
kernel density estimator, given by
fi(h, `) = f(h; `,xi) = (k`)−1
k−1∑
j=1
κ{(h − hij)/`}, (7.1)
where κ is a function satisfying∫ +∞
−∞κ(h)dh = 1 and κ(h) > 0, known as the
kernel, ` is a positive number known as bandwidth or window width, i = 1, . . . , n
and j = 1, . . . , (k − 1). Further information on kernel estimators and properties is
reported in [14, 112]. Henceforth, fi(h, `) is denoted as fi. (fi is a smoothed density
estimator of hij for the point xi.)
The probability density functions of the fusion distances may be compared using
a distance measure. For instance, the total variation distance [46, 108] can be used,
and its definition is presented as follows.
Definition 22 (Total variation distance). Let f(x) and g(x) be two probability
densities of random variables X1 and X2. The total variation distance between f(x)
and g(x) is given by
d(f, g) = supB⊂R
∫
B
f(x)dx −∫
B
g(x)dx
=1
2
∫
R
|f(x) − g(x)|dx. (7.2)
7.1. Strategy 75
The total variation distances between each pair of probability densities fi, fr
of the fusion distances are computed, that is, d(fi, fr), where i, r = 1, . . . , n. Let
D =
[
d(fi, fr)
]
be the total variation distance matrix.
Next, a chosen hierarchical clustering algorithm is applied to D to search for
clusters of the probability densities. In other words, the smoothed densities of the
fusion distances are considered as if they were units or points to be classified. Thus,
hierarchical clustering algorithms are applied to find similar groups of the probability
density functions of the fusion distances.
It is now assumed there may be g clusters or groups of probability densities of the
fusion distances. Then, the mean of the local fusion distance function is computed
for each group of probability densities,
Hv(t) =1
nv
nv∑
s=1
Hj(t) (7.3)
where nv is the number of points in each group, s = 1, . . . , nv, v = 1, . . . , g, and
0 < g < n.
A homogeneous Poisson point process with the same intensity and on the same
bounded region as the given point pattern is simulated. For each point of the
simulated Poisson process, the fusion distance function based on its kth nearest
neighbours is calculated, that is, HPois(t)(i), where i = 1, . . . , n. Then, the mean
of the local fusion distance functions for all the points from the simulated Poisson
process is given by,
HPois(t) =1
n
n∑
i=1
HPois(t)(i) (7.4)
Finally, for each group, its mean (equation (7.3)) is plotted against the Poisson
mean (equation (7.4)) in the same plot. That is, the group mean of the local
fusion distance functions are compared with the group mean of the local fusion
distance function from the homogeneous Poisson process, graphically. If the group
mean (equation (7.3)) of the fusion distance function is above the identity line, this
suggests that this group consists of points clustered together.
Interpretation of local fusion distance function Similar to the interpretation of
the fusion distance function for the exploratory data analysis (Section 4.3.1), if the
local fusion distance function for a given group (equation (7.3)) is substantially
above the identity line then there is a suggestion for a locally clustered pattern.
However, if the local fusion distance function is considerably below the identity line,
76 Chapter 7. Analysis of local configuration
then there is an indication that the point pattern is locally regular. Nevertheless, if
the local fusion distance function is close to the identity line then the pattern should
be locally random.
7.2 Applications
The analysis of local configuration was applied to the datasets: full redwoods
(Section 6.3), Longleaf pines (Section 6.2) and Lansing woods (which is described
in Section 7.2.3). The results obtained from each dataset are presented as follows.
7.2.1 Application to full redwoods The full redwoods dataset is described
in Section 6.3 and the results of the analysis of local configuration are shown below.
Results: kernel estimation The kernel probability densities based on the 20 near-
est neighbours, Single Linkage, Average Linkage, and Complete Linkage have similar
shapes and suggest a higher probability of fusion distances which are smaller than
or equal to 0.05 for the dataset. (See the plots in Figure 7.1.)
Total variation distance Figure 7.2 shows that except for the Single Linkage
dendrogram, the remaining dendrograms of the total variation distances from the
kernel densities are well structured and suggest the presence of spatial clustering on
the dataset. Observe that the dendrograms shown in Figure 7.2 are based only on
the spatial locations of the points, and do not use information about the regions I,
II dichotomy.
Contingency tables Table 7.1 shows the frequency counts of the full redwoods
points that belong to the two regions (I,II) and groups (1,2) based on the total
variation distance. The results based on the Average Linkage and Complete Linkage
are successful in identifying the majority of points of the full redwoods that belong
to regions I and II (see the diagonal cells of table).
SLGroup1 2
Region I 72 0II 118 5
ALGroup1 268 42 121
CLGroup1 270 210 113
Table 7.1: Contingency tables of the full redwoods by regions (I,II) and groups (1,2).
Groups are based on total variation distances; 20 nearest neighbours; Single Linkage
(SL), Average Linkage (AL), Complete Linkage (CL).
7.2. Applications 77
0.00 0.05 0.10 0.15 0.20 0.25 0.30
010
2030
4050
6070
fusion distances
Prob
abilit
y de
nsity
func
tion
Single Linkage
0.00 0.05 0.10 0.15 0.20 0.25 0.30
010
2030
40
fusion distances
Prob
abilit
y de
nsity
func
tion
Average Linkage
0.0 0.1 0.2 0.3 0.4 0.5
05
1015
2025
30
fusion distances
Prob
abilit
y de
nsity
func
tion
Complete Linkage
Figure 7.1: Kernel probability densities of the fusion distances from the full redwoods
dataset based on the 20 nearest neighbours, Single Linkage, Average Linkage, and
Complete Linkage.
78 Chapter 7. Analysis of local configuration
0.0
0.1
0.2
0.3
0.4
0.5
Single Linkage
tota
l var
iatio
n di
stan
ces
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Average Linkage
tota
l var
iatio
n di
stan
ces
0.0
0.5
1.0
1.5
Complete Linkage
tota
l var
iatio
n di
stan
ces
Figure 7.2: Dendrograms of the total variation distances from kernel probability
densities of fusion distances for the full redwoods dataset based on the 20 nearest
neighbours, Single Linkage, Average Linkage, and Complete Linkage.
7.2. Applications 79
Single Linkage
Average Linkage
Complete Linkage
Figure 7.3: Classification of points in the full redwoods dataset into two groups
(4, ◦) based on their local configuration (20 nearest neighbours, fusion distances,
kernel smoothing, total variation distance, hierarchical clustering: Single Linkage,
Average Linkage, and Complete Linkage).
80 Chapter 7. Analysis of local configuration
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2Ident line
Figure 7.4: Estimated group means of local fusion distance functions plotted against
the local fusion distance function from a homogeneous Poisson process with the same
intensity as the full redwoods; 20 nearest neighbours; Average Linkage.
Local configuration classification The Single Linkage overclassifies the points of
region I (see the upper plot in Figure 7.3). This is an example of the chaining effect
(Section 2.4) and suggests that the clusters of the probability densities of the fusion
distances may not have a nucleus. Therefore, the Single Linkage performs poorly at
identifying the points that belong to the different regions of the full redwoods.
For instance, consider the Average Linkage dendrogram of total variation dis-
tances (see the central plot in Figure 7.3). First, a homogeneous Poisson process
with the same intensity as the full redwoods on the re-scaled unit square was simu-
lated, and the estimated mean (equation (7.4)) of the local fusion distance functions
was calculated. Second, the estimated means (equation (7.3)) of local fusion dis-
tance functions for the two groups were computed and compared with the estimated
mean of the local fusion distance function for the homogeneous Poisson process.
The local fusion distance function for group 1 is different from the local fusion
distance function for group 2. (See Figure 7.4.) The local fusion distance function
for the group 1 suggests that there is a locally clustered pattern in the dataset.
Therefore, the local fusion distance function successfully discriminates between the
two different patterns of the full redwoods dataset. An equivalent conclusion can be
drawn for the Complete Linkage.
7.2. Applications 81
Figure 7.5: Longleaf pines trees are shown as circles, the diameter of each tree is
proportional to the maximum size of the tree’s diameter at breast height. Adult trees
are plotted with larger circles; young trees are plotted with smaller circles. Source:
[85], data provided in [9].
The analysis of local configuration based on these three algorithms and k=10
is very similar to that obtained from k=20 nearest neighbours, consequently, the
results are not shown here. In conclusion, the analysis of local configuration using
the Average Linkage and Complete Linkage has successfully separated and identified
the majority of the redwoods trees that have different spatial neighbourhoods.
7.2.2 Application to Longleaf pines The Longleaf pines dataset classified
into two types, young and adult trees, is described in Section 6.2. This dataset is
more complicated to analyse than the full redwoods described in Section 6.3. First,
the task of identifying and separating the trees that belong to the two types based
on their location is difficult. Second, the young trees are close to the adult trees and
except for the size of the dbh, the types can not be distinguished (see Figure 7.5).
Therefore, it is a challenge to analyse this inhomogeneous dataset.
Results: kernel estimation Figure 7.6 shows that the kernel densities based on
the 10 nearest neighbours, Single Linkage, Average Linkage, and Complete Linkage
have different shapes.
Total variation distance Even though the structured Average Linkage and Com-
plete Linkage dendrograms may suggest that there are clusters on the dataset, the
disordered Single Linkage dendrogram is enough evidence to tell us that the Longleaf
pines dataset does not exhibit a very good separation for clusters. (See Figure 7.7).
82 Chapter 7. Analysis of local configuration
0 5 10 15 20
0.0
0.5
1.0
1.5
2.0
fusion distances
Prob
abilit
y de
nsity
func
tion
Single Linkage
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
fusion distances
Prob
abilit
y de
nsity
func
tion
Average Linkage
0 5 10 15 20
0.0
0.1
0.2
0.3
0.4
0.5
0.6
fusion distances
Prob
abilit
y de
nsity
func
tion
Complete Linkage
Figure 7.6: Kernel probability densities of fusion distances from the Longleaf pines
based on the 10 nearest neighbours; Single Linkage; Average Linkage; Complete
Linkage.
7.2. Applications 83
0.0
0.2
0.4
0.6
0.8
1.0
Single Linkage
tota
l var
iatio
n di
stan
ces
0.0
0.5
1.0
1.5
2.0
Average Linkage
tota
l var
iatio
n di
stan
ces
01
23
4
Complete Linkage
tota
l var
iatio
n di
stan
ces
Figure 7.7: Dendrograms of total variation distances from kernel probability densities
of fusion distances for the Longleaf pines dataset based on the 10 nearest neighbours;
Single Linkage; Average Linkage; Complete Linkage.
84 Chapter 7. Analysis of local configuration
SLGroup1 2
Type young 306 7adult 271 0
ALGroup1 2
214 99256 15
CLGroup1 2
260 53262 9
Table 7.2: Contingency tables of Longleaf pines by types (young,adult) and groups
(1,2). Types based on dbh of trees and groups based on total variation distances; 10
nearest neighbours; Single Linkage (SL); Average Linkage (AL); Complete Linkage
(CL).
Contingency tables For instance, the Average Linkage Table 7.2 shows that it is
not true that all young trees are in a different neighbourhood from the adult trees.
However, there may be a substantial number of young trees that are packed together
(the 99 young trees which were classified into group 2). In other words, some young
trees are growing in tight clusters.
Local configuration classification Figure 7.8 (upper) shows that the Single Link-
age classifies the majority of the trees of Longleaf pines as the young type. Similar
to the result of this algorithm applied to the full redwoods, the chaining effect sug-
gests that the clusters of kernel densities of fusion distances may not have a nucleus.
Therefore, the Single Linkage has pointed out that this dataset may not be well
separated into clusters.
Even though there is no strong evidence for clusters, let us consider, for example,
the Average Linkage dendrogram. (See the central plot in Figure 7.8). First, a
homogeneous Poisson process with the same intensity as the Longleaf pines dataset
on the 200 m sided square was simulated and the mean (equation (7.4)) of the
local fusion distance functions was estimated. Next, the estimated group means
(equation (7.3)) of the local fusion distance functions were computed, and compared
with (equation (7.4)) from the Poisson process.
Figure 7.9 shows that the local fusion distance functions from groups 1 and 2 are
different. This statement is also confirmed by the barplot of the relative frequency
of dbh for both groups, which is plotted in Figure 7.10. The local fusion distance
function for group 2 suggests that there is a locally clustered pattern in the dataset.
Therefore, the local fusion distance function using the Average Linkage successfully
identifies a small pocket of young trees that are clustered together. In conclusion,
the results of the local configuration applied to the Longleaf pines are very good
7.2. Applications 85
Single Linkage
Average Linkage
Complete Linkage
Figure 7.8: Classification of points in the Longleaf pines dataset into two groups
(4, ◦) based on their local configuration (10 nearest neighbours, fusion distances,
kernel smoothing, total variation distance, hierarchical clustering: Single Linkage,
Average Linkage, and Complete Linkage).
86 Chapter 7. Analysis of local configuration
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2Ident line
Figure 7.9: Estimated group means of local fusion distance functions plotted against
the mean of local fusion distance functions from a homogeneous Poisson process with
the same intensity as the Longleaf pines; 10 nearest neighbours; Average Linkage.
0−20 20−40 40−60 60−80
group 1group 2
Rela
tive
frequ
ency
0.0
0.2
0.4
0.6
0.8
1.0
Figure 7.10: Relative frequency barplot of dbh from Longleaf pines classified into two
groups; 10 nearest neighbours, Average Linkage.
because this point pattern is indeed a much more challenging dataset to analyse
than the full redwoods.
7.2. Applications 87
7.2.3 Application to Lansing woods Figure 7.11 shows the Lansing woods
dataset introduced by [43]. The data record the location and botanical classification
of 2251 trees on a 924 ft x 924 ft (19.6 acre) block of Lansing Woods, in Clinton
County, Michigan, USA. The original block size was re-scaled to the unit square.
Figure 7.11: Lansing woods dataset with six types: black oak (◦), hickory (M),
maple (+), miscellaneous (×), red oak (¦), white oak (O). Source: [43], data pro-
vided in [9].
The botanical classification of the types of the trees into species is: hickory,
maple, red oak, white oak, black oak and miscellaneous. For details on the dataset and
its analysis, see [43, 30]. Figure 7.12 shows the trees’ locations plotted individually
by their types. The plots are ordered according to the frequency of points in each
type that is, from the largest to the smallest. (The symbol encoding is the same as
in Figure 7.11.)
The Longleaf pines and Lansing woods datasets are similar in respect to the
locations of the trees. It is also noticeable that the locations of the Lansing woods
trees are closer together than in the Longleaf pines. Visually, it may be an impossible
task to identify and to separate different neighbourhoods, even though the species
of the trees are known.
Results The Single Linkage dendrogram applied to the kernel densities of the
fusion distances from the Lansing woods dataset has a disordered structure similar
to the Single Linkage dendrogram applied to the Longleaf pines dataset (see upper
88 Chapter 7. Analysis of local configuration
Hickory Maple
White oak Red oak
Black oak Miscellaneous
Figure 7.12: Lansing woods dataset with six types plotted individually. The ordered
plots are according to the descending frequency of points in each type. Hickory (M),
maple (+), white oak (O), red oak (¦), black oak (◦), miscellaneous (×). The symbol
encoding in this figure is the same as in Figure 7.11.
7.2. Applications 89
Group1 2 3 4 5 6
Hickory 283 299 24 24 64 9Maple 239 186 6 2 80 1
Type White oak 170 197 6 17 49 9Red oak 141 146 7 5 45 2
Black oak 56 58 2 8 11 0Misc 43 40 4 0 17 1
Table 7.3: Contingency table of Lansing woods classified into six botanical types
(hickory, maple, white oak, red oak, black oak, misc.), and six groups which are
based on total variation distances, 20 nearest neighbours, Average Linkage.
plot in Figure 7.7). Thus, the disordered dendrogram indicates that the dataset
exhibits a poor separation for clusters. (The equivalent conclusion was also drawn
from the Single Linkage dendrogram applied to the Longleaf pines in Section 7.2.2.)
Moreover, the results of the analysis of local configuration based on the 20 near-
est neighbours, Average Linkage, and Complete Linkage are alike. Therefore, a
summary of these results, such as the contingency table and local configuration
classification of the Lansing woods with six types based on the Average Linkage, is
presented next.
Contingency table and local configuration classification Similar to the analysis
of the Longleaf pines, we consider, for instance, the Average Linkage dendrogram of
the total variation distances based on the 20 nearest neighbours. The plots shown
in Figure 7.13 are ordered according to the descending frequency of points in each
group (from the largest to the lowest). That is, the group 1 has the largest frequency,
followed by groups 2,5,4,3,6 (the lowest frequency).
Next, the Poisson process with the same intensity as the Lansing woods dataset
on the re-scaled unit square was simulated and the mean (equation (7.4)) of the local
fusion distance function was estimated. The six estimated group means (equation
(7.3)) of the local fusion distance functions were then computed and compared with
the estimated Poisson mean of the local fusion distance functions.
The upper plot in Figure 7.14 shows that the local fusion distance functions
of the groups are not substantially above the identity line. Therefore, the local
fusion distance function suggests that there may not be groups in the Lansing woods
dataset.
90 Chapter 7. Analysis of local configuration
Spatial group 1 Spatial group 2
Spatial Group 5 Spatial group 4
Spatial group 3 Spatial group 6
Figure 7.13: Local configuration classification based on total variation distances from
the Lansing woods dataset. The ordered plots are according to the descending fre-
quency of points in each group; 20 nearest neighbours; Average Linkage.
7.2. Applications 91
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2group 3group 4group 5group 6Ident. line
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2group 3group 4Ident. line
Figure 7.14: Estimated group means of local fusion distance functions plotted against
the means of local fusion distance functions from homogeneous Poisson processes
with the same intensity as the Lansing woods dataset: six groups (upper), four
groups (lower), 20 nearest neighbours, Average Linkage.
92 Chapter 7. Analysis of local configuration
Further classification Figure 7.13 suggests that the Lansing woods dataset might
be classified into fewer groups. Thus, the four group classification is now considered.
The Average Linkage dendrogram from the total variation distances is cut at g = 4
groups and the mean (equation (7.3)) of the local fusion distance functions for each
new group is also computed.
For the Lansing woods dataset, the classification into g = 6 groups is a refinement
of the four group classification. This is a nice property of a hierarchical clustering
algorithm.
A homogeneous Poisson process, with the same estimated intensity as the Lans-
ing woods dataset on the re-scaled unit square, is simulated and the mean (equa-
tion (7.4)) of the local fusion distance functions is estimated. The estimated group
means (equation (7.3)) of the local fusion distance functions are then compared with
the Poisson mean (equation (7.4)), graphically.
The lower plot in Figure 7.14 shows that there is a suggestion for two different
types of patterns. First, the local fusion distance from group 1 is slightly below the
identity line. Second, the local fusion distance function from group 2 is considerably
above the identity line. Thus, the results of the local fusion distance function for the
Lansing woods dataset classified into four groups (based on the 20 nearest neighbours
and Average Linkage) suggest that this dataset may not have a clear separation for
local clusters, except for groups 1 and 2.
In addition to the results of the local fusion distance, we may inquire into if there
is any relationship between the botanical and local configuration classifications. Let
us consider a formal test such as the Pearson χ2 test of independence [56, Chapter
13]. To answer our question, this test is then performed. The null hypothesis: the
three botanical classification into “oak” (white, red, and black oak), hickory, and
grouped “maple and miscellaneous” is independent of the four group classification.
This is tested against the dependence of these two classifications.
Table 7.4 shows the contingency table and Pearson residuals for the Lansing
woods dataset classified into three botanical types and four groups, respectively.
The computed χ2 = 32.56 with d.f.=6, and the tabulated upper 5% point of χ2
is 12.59 [56, page 667], so that the null hypothesis of independence is rejected at
α = 0.05. It would also be rejected at α = 0.01. (The p-value=1.274 × 10−5.)
Therefore, the two classifications are not independent.
(The contingency table 7.4 shows that there are significantly fewer hickory trees
in group 2 than would be expected under independence. There is also a significantly
7.2. Applications 93
Contingency table:Group
1 2 3 4Oak 606 24 64 9
(595.567) (17.489) (83.073) (6.871)Type Hickory 518 2 97 2
(524.404) (15.399) (73.147) (6.050)Maple & miscellaneous 783 30 105 11
(787.029) (23.112) (109.780) (9.080)
Pearson residuals:Group
1 2 3 4Oak 0.428 1.557 -2.093 0.812
Type Hickory -0.280 -3.415 2.789 -1.646Maple & miscellaneous -0.144 1.433 -0.456 0.637
Table 7.4: Upper: contingency table of Lansing woods dataset by three botanical types
and four groups. Groups based on total variation distances, 20 nearest neighbours,
Average Linkage. Lower: Pearson residuals calculated from the contingency table.
high number of these trees in group 3. The botanical explanation for this finding is
unknown to us.)
In summary, the results of the analysis of local configuration based on the 20
nearest neighbours, Single Linkage, Average Linkage and Complete Linkage indicate
that there may not be clusters of trees in the Lansing woods dataset. Despite this,
the results obtained from the Average Linkage suggest the presence of two different
spatial sub-patterns for the Lansing woods dataset.
Conclusion The local configuration strategy was applied to three point patterns:
full redwoods, Longleaf pines and Lansing woods. The spatial classifications of the
latter two might appear to disagree with their biological/botanical classifications.
However, the strategy was able to successfully identify the different spatial textures
of all three point patterns.
94 Chapter 7. Analysis of local configuration
95CHAPTER 8
Analysis of Brazilian trees point pattern
This chapter studies a large point pattern classified into fifty-six botanical species,
seven botanical subclasses and three botanical classes. The spatial dataset was
kindly provided by Dr. Meirelles and Mr. Luiz in 1999, and was named the “Brazil-
ian trees dataset”.
Exploratory data analysis and inference based on the traditional summary func-
tions: empty space F , nearest neighbour distance G, Van Lieshout and Baddeley’s
J , reduced second moment K, and mark correlation ρ [82, 104] are applied to the
Brazilian trees point pattern. Next, the complementary analysis using the new
strategies (Chapters 4, and 6) is applied to the Brazilian trees dataset. The study is
based on the fusion distance function (Section 4.1.1), area statistic (Section 4.1.2),
S statistic (Section 6.2) and spatial Rg index (Section 6.3).
This chapter also presents the analysis of the local configuration (Section 7.1)
applied to the Brazilian trees dataset (Section 8.1). The analysis is based on the
20 nearest neighbours, Single Linkage, Average Linkage and Complete Linkage al-
gorithms. The Brazilian trees point pattern is introduced as follows.
8.1 Brazilian trees point pattern
Data collection and preparation The Brazilian trees dataset was collected by
Meirelles and Martins in 1979, on the ecological reserve of the Federal University
of Brasılia, named as Agua Limpa farm, in Brasılia, DF, Brazil. All trees in one
hectare were mapped and to the best of our knowledge, the trees were a natural
stand of native species [75]. The sampled area was 100 x 100 square metres. The
data record the tree number, quadrat number, species number, location, height (in
metres) and dbh (in metres). The dbh was measured at 0.3 m above ground level.
Parts of the dataset were analysed by Meirelles and Luiz [64, 74]. For instance,
Meirelles and Luiz [74] examined the 18 most dominant species. The species Byr-
sonima coccolobifolia and Aspidosperma tomentosum were classified as random and
the 16 remaining species were classified as clustered. Meirelles and Luiz used the
Morisita Index [76] and Dispersion Index [63] to classify the species. Another in-
vestigation was made by Luiz [64], where six species were studied in his Master’s
thesis. For more information on the dataset and its analysis, see [64, 74].
In our data preparation, a few minor inconsistencies were found. First, the
species Ouratea acuminata (species 32) was published as the Ouratea hexasperma
96 Chapter 8. Analysis of Brazilian trees point pattern
in Table 1 on [74, page 187]. However, the name of this species was corrected by
[73]. Second, Meirelles and Luiz [64, 74] stated that their dataset had 1122 trees
classified into 56 different species. Nevertheless, in the dataset here presented the
species Kielmeyera coriaceae (species 25 in Table 8.1) was missing. Meirelles and
Luiz acknowledged the absence of this species [75]. Finally, a few discrepancies in
the measurements of the trees’ height and dbh were found. The measurements were
not typical of trees on a Brazilian savanna or grassland, that is, some were too
small and others were too large for typical trees from the central region of Brazil.
Therefore, Meirelles and Luiz also corrected the unusual measurements [75].
The Brazilian trees dataset may have other inconsistencies that have not been
identified. However, to the best of our knowledge, the minor inconsistencies found
were corrected. The author is very grateful to Dr. Meirelles and Mr. Luiz, for the
provided dataset and corrections.
The dataset Figure 8.1 shows the Brazilian trees dataset, the location of 1122
Brazilian trees on a 100 m square in the reserve of Agua Limpa farm, DF, Brazil.
The original block size was re-scaled to the unit square.
Figure 8.1: Locations of 1122 trees on a 100 m square on a grassland, in the reserve
of Agua Limpa farm, DF, Brazil. The original block size was re-scaled to the unit
square. Source: Meirelles and Martin (1999).
Botanical plant systematics The currently accepted botanical classification of
each tree species into genus, species, family, order, subclass and class is extracted
8.1. Brazilian trees point pattern 97
by the author from [15, 115]. For the complete plant systematics of the Brazilian
trees dataset, see Table C.3, in appendix C.
Fifty-six botanical species The tree species ranked in order of frequency are pre-
sented in Table 8.1. The most frequent species is Ouratea acuminata, a photograph
of which is shown in Figure 1. The picture was downloaded from [77] in February
2003.
Some comments on the botanical nomenclature for Table 8.1 are as follows. There
were some species that were not identified by Meirelles and Martins. For instance,
for species 16, Siagrus sp., “Siagrus” was the identified genus but the species was
unknown. Another example is species 6, Myrtaceae fm. The genus and species of the
tree were not identified but its family was identified as the “Myrtaceae”. Another
example is species 17: this species was not identified, so it was named “IND.453”.
Similar conclusions can be drawn for other species not identified in Tables 8.1, and
C.3 in appendix C, on the sampled area.
Seven botanical subclasses The plant systematics of the Brazilian trees dataset
into seven subclasses ranked in order of frequency is presented in Table 8.2. The
dataset classified into seven subclasses is plotted in Figure 8.2.
Figure 8.2: Brazilian trees dataset classified by the seven botanical subclasses: Are-
cidae (O); Asteridae (£); Dilleniidae (+); Hamamelidae (×), Liliidae (¦); Miscel-
laneous (M); and Rosidae (◦).
98 Chapter 8. Analysis of Brazilian trees point pattern
Frequency Species number Name
293 32 Ouratea acuminata
65 41 Qualea grandiflora
64 43 Qualea parviflora
57 47 Sclerolobium aureum
48 16 Siagrus sp.
43 10 Caryocar brasiliense
43 45 Roupala montana
41 21 Erythroxylum tortuosum
36 31 Myrcia sp.
29 54 Vellozia sp.
26 9 Byrsonima sp.
25 26 Lafaensia pacari
25 46 Salacia crassifolia
20 2 Aspidosperma tomentosum
19 11 Connarus fulvus
17 7 Byrsonima coccolabifalia
17 29 Miconia sp.
17 42 Qualea multiflora
16 20 Erythroxylum suberosum
15 8 Byrsonima crassa
14 13 Dalbergia vidacea
14 15 Didymopanax macrocarpum
13 1 Aspidosperma macrocarpum
13 23 Butia sp.
13 27 Palmeira sp.
12 5 Bowdichia virgiloides
10 28 Miconia ferruginata
9 40 Pterodon pubescens
8 44 Rapanea guyanensis
7 3 Bombax gracilipes
7 6 Myrtaceae fm.
7 18 Enterolobium ellipticum
7 35 Piptocarpha rotundifolia
6 52 IND. 192
5 4 Bombax tomentosum
5 12 Copaifera langsdorfii
5 30 Mimosa claussenii
5 56 Platimenia reticulata
4 14 Davilla elliptica
4 19 Eremanthus
4 24 Hymenaea stillocarpa
4 38 Pouteria ramiflora
4 51 Symplocos revoluta
3 22 IND.445
3 48 Stryphnodendron sp.
3 49 Styrax ferrugineus
3 50 Sweetia dasycarpa
3 55 Vochysia elliptica
2 33 Palicourea rigida
2 36 Vochysia rufa
2 39 Vochysia thyrsoidea
2 53 Strychnos sp.
1 17 IND.453
1 34 IND. 443
1 37 Plenckia populosea
0 25 Kielmeyera coriaceae
Table 8.1: The tree species ranked in order of frequency in the Brazilian trees dataset.
8.1. Brazilian trees point pattern 99
Frequency Species number Subclass
553 5, 6, 7, 8, 9, 11, 12, 13, 15, 18, 20, Rosidae21, 24, 26, 28, 29, 30, 36, 37, 39, 40,41, 42, 43, 45, 46, 47, 48, 50, 55, 56
371 3, 4, 10, 14, 32, 38, 44, 49, 51 Dilleniidae
59 16, 17, 22, 34, 52 Miscellaneous
48 1, 2, 19, 33, 35, 53 Asteridae
36 31 Hamamelidae
29 54 Liliidae
26 23, 27 Aracidae
Frequency Subclass Class
1008 Asteridae, Dilleniidae, MagnoliopsidaHamamelidae, Rosidae
59 Miscellaneous Others
55 Arecidae, Liliidae Liliopsida
Frequency Class Type
1008 Magnoliopsida 1
114 Liliopsida and Others 2
Table 8.2: The trees subclasses (upper), classes (centre), and types (lower) ranked
in order of frequency in the Brazilian dataset. Species numbers shown in Table 8.1.
Figure 8.3: Left: Brazilian trees dataset classified by the three botanical classes:
Magnoliopsida (◦), Liliopsida (¦), and Others (M). Right: the dataset classified by
the type 1 (◦), and type 2 (M).
100 Chapter 8. Analysis of Brazilian trees point pattern
Three botanical classes The three botanical classes of the Brazilian trees dataset
are Magnoliopsida, Liliopsida and others (the trees that were not identified). The
left plot in Figure 8.3 shows the classified dataset and Table 8.2 presents the classes
ranked in order of frequency.
Next, the Brazilian trees dataset is analysed using the standard summary func-
tions: F , G, J , K and ρ (the mark correlation function [104]). The new strategies
developed in Chapters 4 and 6 are also applied to the dataset. The fusion distance
function (Section 4.1.1), area statistic (Section 4.1.2), S statistic (Section 6.2) and
spatial Rg index (Section 6.3) are also investigated. The Brazilian trees dataset is
regarded first as an example of a univariate point pattern, and then as an exam-
ple of a multivariate point pattern. The software and library used for computing
the spatial statistics are R Development Core Team Version 1.5.1 [51] and Spatstat
Version 1.3-2 [9] on a Pentium 4 (1.8G Hz), respectively.
8.2 Analysis of univariate Brazilian trees dataset
Let us assume that the Brazilian trees dataset is a realisation of a univariate
point pattern. Its attributes: species, heights, and dbh are analysed descriptively.
The mark correlation function is then presented, and estimated for each one of the
attributes of the univariate Brazilian trees dataset.
Species Figure 8.4 shows the histograms of the ranked species of the Brazilian
trees dataset. Observe that there is an interesting but inexplicable fact from the
plots, that the rank of the species has negative exponential decay,
log(frequency) = a + b.rank, where b < 0.
Thus
frequency ≈ A. exp(−B.rank), where A = exp(a), and B > 0.
Height Table C.1, in appendix C, suggests that the values of the height are
quoted to the nearest 0.1 metre in the range 0–8 metres, and to the nearest 1 metre
in the range 8–26 metres. However, the values 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5 and 7 are
very frequent, suggesting that the height has often been guessed to the nearest half
metre only. The plots in Figure 8.5 suggest that there is no indication of spatial
trend in height.
8.2. Analysis of univariate Brazilian trees dataset 101
Species (in rank order)
Freq
uenc
y
050
100
200
300
Species (in rank order)
Log(
frequ
ency
)
01
23
45
6
Figure 8.4: Barplots of the frequencies of ranked species of Brazilian trees: frequency
(left), logarithm of frequency (right).
Dbh The information shown in Table C.2, in appendix C, suggests that most of
the values of the diameter at breast height are quoted to the nearest 0.5 metre. The
scatter plots shown in Figure 8.6 suggest that there is no spatial trend in diameter
at breast height.
Species, height and dbh The ten most frequent species of the Brazilian trees
dataset are plotted against heights and diameters at breast height (dbh) in Fig-
ure 8.7. The first and second tallest species are the Myrcia and Qualea parviflora
(species 31 and 43 in Table 8.1), respectively. Trees with the two largest dbh be-
long to species Myrcia and Caryocar brasiliense (species 31 and 10 in Table 8.1).
Figure 8.8 shows that there is correlation between dbh and height.
Mark correlation function Let us assume that X is a simple marked point process
on Rd, and the mark correlation function is a measure of the dependence between
the marks of two points of the process a distance r apart, where r > 0. The following
definition of the mark correlation function is quoted from Stoyan and Stoyan [104,
page 263]. Further details on the definition, property, estimation and application of
the mark correlation function are also reported in [82, 101, 100, 104, 93].
Definition 23 (Mark correlation function). Let f(m′,m′′) be an arbitrary (non-
negative) measurable function on R2 depending on the marks m′ and m′′ of two
102 Chapter 8. Analysis of Brazilian trees point pattern
Height
Freq
uenc
y
0 2 4 6 8 10 12
010
020
030
040
0
0 20 40 60 80 100
05
1015
X−coordinate
Heig
ht
0 20 40 60 80 100
05
1015
Y−coordinate
Heig
ht
Figure 8.5: Heights of the Brazilian trees dataset: histogram (upper), scatter plots
of heights against x and y (centre, lower), respectively.
8.2. Analysis of univariate Brazilian trees dataset 103
Diameter at breast height
Freq
uenc
y
0 5 10 15 20 25 30 35
010
020
030
040
050
0
0 20 40 60 80 100
05
1015
X−coordinate
Diam
eter
at b
reas
t hei
ght
0 20 40 60 80 100
05
1015
Y−coordinate
Diam
eter
at b
reas
t hei
ght
Figure 8.6: Dbh of the Brazilian trees dataset: histogram (upper); scatter plots of
dbh against x and y (centre, lower), respectively.
104 Chapter 8. Analysis of Brazilian trees point pattern
10 16 21 31 32 41 43 45 47 54
02
46
810
Species
Heig
ht
10 16 21 31 32 41 43 45 47 54
05
1015
2025
30
Species
Dbh
Figure 8.7: Ten most frequent species of the Brazilian trees dataset plotted against
height (upper) and dbh (lower). Numbers plotted on the x-axis are species numbers
shown in Table 8.1.
8.2. Analysis of univariate Brazilian trees dataset 105
0 5 10 15 20 25 30
24
68
10
Dbh
Heig
ht
2332
24
47
32
45
32
41
161616
41
3232
31
32 9
3210203221 32
18
19
32
3213241
52
41
41
41
32213246
48
21321 2647
29
42
15 2648
26 45323232
41
41
21 323226 32
55
21
31
32
4610
16
932
32327
3532
35
5
41
54
41 4141
9
2
4332 323232 115432
41
41
3251
54
16
21
431145
1132
46
213254 323223
18
31
18
27
10
44
4
105
10
32 1046
321643
4521
32
13
9
16
3232502132479
36
32
45
12
8
26
47
2032203245
1119
31
47
3245
323229
3254
47
3243
49
32 2
31
42
54
40
32
2946
32
31
5231
1347
3241 412121
32
4321
47
9
387
4332 3232
41
3221 3232
41
43
31
32
26
32 165332
7322
32323232323232 32
47
15 2132
32322026
32
4747 45
41
238
15
23
5
203232
43
23
42
32
347
47
43
32
2
32
32
48
322116
32
52
43
32
42
3243
43
3232 32
32
2332
323232
4216
32
43
9
43
32
49
32
21
32
14
10
32 43167
323232 3232
3243 11323232
29
1632
16
31 41
161616
32
27
2032
32324632
27
13
9
3232
32
21
321
43
31
15
3232
3214 1
4118
43
5
843
4143
323232
47
5243
45
12041
32
4747 45
1120
4232 8
10
11
35
5
7
322727
31
32 323232
3254
32412
32
41
47
323226
13
16 921
3254 32
16
2132
3
3235 46
32
10
3223
1044
541
41 3232453
16321116 16
2132
32323254 3232
32
41
26
324345
8
43
41
41
23
41414141
4132459
46 45
1021
41
732
43
49
1032
2
43
2 28
2129
4646
43
31
1054
54
43
21 322032
10
11
2121
56
1651
35
2932
45418
32
1
37 43
3316 4532
2732
4641
10
3052
322054 7 161616
66
2
43
432132
43 3226 3239 463221 32
1010323221
32
1
45
3143
1143
28322311 23
9329
4643
8
1
11 271132
41
23
41414141
41
10
45
4356
9184445
26
29
45
26
4532
2943
32
38
47
31
29
47
31
16161646
4632
42
9
323232
31
1410
31
43
4532
2847
161647
56
13
43
47
2616
1
54
2
31
21 2054
1
32
3
2614
3143
3241
32 32
31
1015
235132
453232
26
3232 2626
10
432
10
3221
32 10323232
41
3221 32
4315
2610
24
84523
31
328
11
31
73232
11
21
47
474747
38 38
45
747
15
13
452947
47
3232954
29
4
43
47
21
41
45
28
33
43
43
32
10
4754
1
16
11
45
47
9
32
3232
47
47
1
4645
3247
43
322
1616
439
4526
3227
44
2947
50 7
31
15
323232
4331
29323232
6
16
66
45
43
1513
3211
42
32
41
30 47 46
10
322
45
32
5
55
475526
51
4110
13
28
4154
329
947
43
4142
224
7
15
32
3
36
427 32
45
44
22
56
917
47
45
42
4547
47
3247
41
2
549
2947
20
31
26
42
453245
10
41 2415
10
31
47
46
12
10
5427
1
5
322020
9
10
453232
31
31
2132
47
32 2
2
3245
13
6
28
47
10
3245
4743
3219
32
41
43
42
32
15
32 16
4747
3032
2830
10
543232433516
1132 32323241
4643
54645
3210
16
12
41
5
3232
5
4726
8
10
54
54
4343
42
56
26
47
47
5454
47
32
21 83542
7475421
47
323232
3442
22
4621
8
13
40
21
40
41
40
2
50
9
32
15 1616
10
20
241
7
18
40
16
39
2531
5
15
6
32
22
46
27
43
32323246
12
3232
28322
43
3244
44
54
41
29
1031
45
29
54
932
3230 3216
43
439 4
43
1632 1021
43
31
3
1616
41
40
31
46
323232
3245
32221 8
32
31
41
329163232
47
3232
31
10
32 1146
43
43
8
2032
10
161
28
32264132
4040
102
2954
19
32
40
41
32
2810
32
3232
27 272732
3131 13
32
26
78 4132
41
32
41
52
327
32
43
32
10
32
12
8
43
3232
44
13
42
713 54
13
0 5 10 15 20 25 30
24
68
Dbh
Heig
ht
32
47
32
45
32
41
161616
41
3232
31
32
3210
3221 3232
213241
41
41
41
322132
213247
45323232
41
41
21 32323221
31
32
10
1632
3232
32
41
54
41 4141
4332 3232
325432
41
41
32
54
16
21
43
45
32213254 3232
31
10
10
10
32 10
321643
45
2132
16
3232213247 32
4547
323245
31
47
3245
3232
3254
47
324332
31
54 3232
31
31
47
3241 412121
32
43
21
47
4332 3232
41
3221 3232
41
43
31
32321632
32
3232323232323232
47
2132
323232
4747 45
41
3232
43
32
4747
43
32
32
323221
1632
43
323243
43
3232 32
3232
3232321632
4343
3232
21
32
10
324316 323232 32
32
3243 32
323216
3216
31 41
161616
32 3232
3232
3232
32
21
32
43
31
3232
32
414343
4143
323232
47
43
4541
32
4747 45
32
10
32
31
3232
323232
54324132
41
47
3232162132
54 32
16
213232 32
10
32
10
41
41323245
1632
16 1621
32323232
54 323232
41
324345
43
41
41
41
414141
41
3245 45
1021
41
32
43 1032
43
21
43
31
1054
54
43
21 3232
102121
1632
45
32
43
16 453232
41
10
3254
161616
43
4321
32
43 3232
3221 32
10103232
2132
45
3143
433232
43
32
41
41
414141
41
10
45
43
4545
4532
43
32
47
31
47
31
16161632
323232
31
10
31
43
45
32
47
161647
43
47
1654
31
2154
32
3143
3241
3232
31
10
32
45
32323232
10
32
10
3221
32 10323232
41
3221 32
43
1045
31
32
31
323221
47
4747
47
45
474547
47
323254
43
47
21
41
45
43
43
32
10
4754
1645
4732
3232
47
47
453247
43
321616
43
45
32
4731
323232
4331
3232
3216
45
43
3232
41
47
10
32
45
32
47
4110
4154
3247
43
41
3232
45
47
45
4547
47
3247
41
54
47
31
4532
45
10
41
10
31
47
10
5432
10
4532
32
31
31
2132
47
323245
47
10
3245
4743
3232
41
43
3232 16
4747
32
10
54323243
1632 323232
41
43
45
32
10
16
41
3232
47 10
54
54
4343
47
47
5454
47
32
21475421
47
32323221
21
41
32
1616
1041
16
31
32
43
32323232
32
32
43
3254
41
10
31
45
54
32
32 3216
43
4343
1632 1021
43
31
1616
41
31
323232
32
45
322132
31
41
3216
3232
47
3232
31
10
3243
43
32
10
16324132
10
54
32
41
32
10
32
32
3232
3131
3241
32
41
32
41
3232
43
32
10
32
43
3232
54
Figure 8.8: Dbh of the Brazilian trees dataset plotted against the height: 56 species
(upper); ten most frequent species (lower). Numbers inside the plots are species
numbers shown in Table 8.1.
106 Chapter 8. Analysis of Brazilian trees point pattern
points x′ and x′′. The measure αf(2) on R
2d is defined by
αf(2)(B1 × B2) = E
[
∑
[x1;m1]∈X
∑
[x2;m2]∈X(x1 6=x2)
f(m1,m2)1B1(x1)1B2(x2)
]
.
The summation is over all pairs [x1; m1], [x2; m2] of marked points of X in B1 and
B2, where x1 6= x2 and B1, B2 are Borel sets of Rd. Then, assuming continuity, there
is a density function %f (x1, x2) for αf(2), which is called the “f -product density”.
For instance, if f ≡ 1 then αf(2)(B1 × B2) = E
[
N(B1)N(B2)
]
, that is, αf(2) is the
second moment measure of X. The quotient
κf (x1, x2) =%
(2)f (x1, x2)
%(2)(x1, x2), where %(2)(x1, x2) 6= 0,
can be interpreted as a conditional mean, namely as the mean of f(M1,M2), given
that there is a point of the point process at both locations x1 and x2, where M1 and
M2 denote the marks of x1 and x2, respectively. If the point process is stationary
and isotropic then κf (x1, x2) depends only on ‖x1 −x2‖ and we usually write κf (r),
r > 0. This function describes the correlation between marks. To give κf (r) more
of the character of a correlation function, it is normalised. The mark correlation
function is defined by
ρf (r) =κf (r)
κf (∞),
where κf (∞) = E[f(M,M ′) and M , M ′ are independent samples from the marginal
distribution of marks. Thus, roughly speaking,
ρf (r) =E[f(M1,M2)]
E[f(M,M ′)],
where M1, M2 are the marks attached to two points of the process separated by a
distance r, while M , M ′ are independent realisations from the marginal distribution
of marks. Note that f is any function f(m1,m2) with two arguments that are
possible marks of the point pattern, and which returns a nonnegative real value.
The mark correlation function is not a correlation function in the usual statistical
sense because this function can take any nonnegative real value. The value 1 suggests
lack of correlation. If the marks attached to the points X are i.i.d. then ρf (r) ≡ 1.
The interpretation of values larger or smaller than 1 will depend on the choice of
the function f .
8.2. Analysis of univariate Brazilian trees dataset 107
For the height and dbh of the Brazilian trees dataset, the function f of the mark
correlation function is defined by f(m1,m2) = m1m2 because these attributes are
continuous real-valued marks. Thus
ρf (r) =E[M1M2]
E[MM ′]=
cov(M1,M2)
E[M ]E[M ′]+ 1
since M,M ′ are independent. In this case, the mark correlation function ρf (r) is
a re-scaled version of the covariance function of the marks at two points separated
by a distance r. If the marks are i.i.d. then ρf (r) ≡ 1, whereas ρf (r) > 1 suggests
positive association and ρf (r) < 1 indicates negative association.
For the species of the Brazilian trees, which is a discrete mark, the function f
is defined by f(m1,m2) = 1{m1 = m2}, where 1{} denotes the indicator function.
Therefore,
ρf (r) =P(M1 = M2)
P(M = M ′),
where M,M ′ are independent with the same mark distribution. Analogous to the
interpretation of ρf for continuous marks, if discrete marks are i.i.d. then ρf (r) ≡ 1,
whereas ρf (r) > 1 indicates positive association and ρf (r) < 1 suggests negative
association.
The sampling window of the Brazilian trees dataset was constrained because the
Spatstat [9] was unable to handle the entire dataset due to shortage of computa-
tional memory. Thus, the considered sampling window was [20,80]x[20,80] metres.
Note that Stoyan and Stoyan [104, page 292], in Figures 124, 125, plotted the esti-
mated mark correlation functions against r. The plotted values r ∈ [0, 28], where
28 mm was about one quarter of the shortest size of the sampling window. For the
Brazilian trees dataset, the estimated mark correlation functions using the transla-
tion correction [81] were also plotted against r. The values r ∈ [0, 15], where 15 m
is equal to one quarter of the shortest side of the constrained window.
The left plot in Figure 8.9 shows the estimated mark correlation function from
the heights of the Brazilian trees. The mark correlation function suggests a positive
association at small distances (r 6 3). This positive association indicates that young
trees are clustered together. A positive correlation at small distances (r 6 3) is also
noticeable for the estimated mark correlation function from the species: see the right
plot in Figure 8.9, suggesting that neighbouring trees tend to be of the same species
more frequently than would be expected if the species were allocated at random.
However, the central plot in Figure 8.9 suggests that there is independence on the
dbh of the trees at distances greater than 1 m.
108 Chapter 8. Analysis of Brazilian trees point pattern
0 5 10 15
0.0
0.5
1.0
1.5
r
trans
, th
eo
Height
0 5 10 150.
00.
40.
81.
2
r
trans
, th
eo
Dbh
0 5 10 15
0.0
0.5
1.0
1.5
2.0
2.5
r
trans
, th
eo
Species
Figure 8.9: Estimated mark correlation functions for the height, dbh and species from
the Brazilian trees dataset. Solid lines: the mark correlation estimate function using
the translation correction [81], dashed lines: y=1 line which represents independence
of the marks.
8.3 Analysis of Multivariate Brazilian trees dataset
The Brazilian trees dataset is now regarded as a multivariate point pattern with
fifty-six types. In theory, the spatial analysis of such dataset is possible, but in
practice it is prohibitive to carry out this analysis. The methods available in the
spatial statistics literature work very well for datasets that have at most two or three
different types. Thus, a feasible option is to analyse the dataset classified into fewer
types using summary functions, F , G, K and J .
Henceforth, the estimators of the functions (F , G, J) are calculated using the
Kaplan-Meier estimators [7], denoted by “km”, and the estimator of the K-function
is computed using the translation correction [81], denoted by “trans”. Moreover,
the J-function of the theoretical homogeneous Poisson process is denoted by “theo”.
(Our notation for “km, trans, and theo” will appear on the y-axis of the plots
presented in the next subsections.)
Three most frequent species The plots in Figure 8.10 show the locations of the
three most frequent species of the Brazilian trees classified into Ouratea, Qualea
and Others. The Ouratea acuminata species has 293 trees, followed by the Qualea
sp.: Qualea grandiflora, Qualea parviflora, Qualea multiflora which have 146 trees
(in total); while the 52 remaining species have 683 trees: see the species frequencies
shown in Table 8.1.
F -function The Kaplan-Meier estimates of F -function suggest that the three
point patterns Others, Ouratea, and Qualea are realisations of Poisson point pro-
8.3. Analysis of Multivariate Brazilian trees dataset 109
cesses with the same intensities as the observed species. (See the plots in Fig-
ure 8.11.)
G-cross function The Kaplan-Meier estimates of G-cross function suggest that
the Qualea point pattern is clustered at small distances r < 4 because the estimate
function is above the estimated function of a homogeneous Poisson process. (See
the (Qualea, Qualea) plot in Figure 8.12.) Thus, the trees of Qualea sp. tend to be
closer to each other at small distances than if they were randomly located.
J-cross function The Kaplan-Meier estimates of the J-cross functions (see Fig-
ure 8.13) suggest positive association for the univariate point patterns: Others,
Ouratea, and Qualea at distances r < 4, because their estimated values are smaller
than 1.
K-cross function The K-cross functions plotted in Figure 8.14 show that Ouratea
and Qualea are clustered at distances smaller than r < 5. Note that the estimated
translate values of Ouratea and Qualea sp. are greater than the estimated values
for homogeneous Poisson point processes for r < 5.
Three botanical classes The estimated F , G, J , and K applied to the three
botanical classes are similar to those obtained from the three most frequent species,
except for the J-cross function. The diagonal plots in Figure 8.15, the Kaplan-Meier
estimates of J-cross functions suggest positive association for the univariate point
patterns: Liliopsida, Magnoliopsida, and Others at small distances r < 4. The
estimated values of J-cross are smaller than 1 for r < 4.
Two types The association between the types Magnoliopsida and Others, is anal-
ysed using the summary functions F , G, J and K.
Others Ouratea Qualea
Figure 8.10: Location of the three most frequent species of the Brazilian trees dataset:
Others (left), Ouratea (centre), and Qualea (right).
110 Chapter 8. Analysis of Brazilian trees point pattern
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
others
r
km, t
heo
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
Ouratea
r
km, t
heo
0 1 2 3 4 5 6 7
0.0
0.2
0.4
0.6
0.8
Qualea
r
km, t
heo
Figure 8.11: F -functions from: Others (left), Ouratea (centre), Qualea (right). Solid
lines: Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.
F -function The Kaplan-Meier estimates of F -function suggest that the point
pattern Magnoliopsida is a realisation of a homogeneous Poisson point process with
the same intensity as the observed point pattern :see the left plot in Figure 8.16.
Observe that there is a small deviation between the estimate and homogeneous
Poisson curves for the Others point pattern at distances r < 8. However, this
deviation which suggests regularity is not supported by the cross-functions: G, J ,
K. (See the results presented next.)
G-cross function The right lower plot in Figure 8.17 suggests that the Others are
clustered for r < 4 because its estimate function is above the homogeneous Poisson
function. Thus, the Others trees tend to be closer to each other than if they were
randomly located for r < 4 .
J-cross function The diagonal plots in Figure 8.18 suggest a positive association
for the univariate datasets: Magnoliopsida and Others for r < 4. There is also
a suggestion of a positive association between Others and Magnoliopsida for r >
2. That is, the presence of an Others tree increases the probability of finding a
Magnoliopsida tree nearby. (See the lower left plot in Figure 8.18.)
K-cross function The right lower plot in Figure 8.14 suggests clustering for the
Others point pattern. This result agrees with that from the J-cross function.
8.4 Complementary analysis
8.4.1 Fusion distance function The inferential part of the strategy (Section
4.3.2) is applied to the univariate Brazilian trees dataset using the fusion distance
function (Section 4.1.1). The null hypothesis is composite; that is, H0: the Brazilian
8.4. Complementary analysis 111
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(others,others)
0 1 2 3 40.
00.
20.
40.
60.
81.
0
r
km ,
theo
(others,Ouratea)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(others,Qualea)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Ouratea,others)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Ouratea,Ouratea)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Ouratea,Qualea)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Qualea,others)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Qualea,Ouratea)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Qualea,Qualea)
Array of G functions for Ouratea, Qualea & others.
Figure 8.12: G-cross functions from: Others (left), Ouratea (centre), Qualea (right).
Solid lines: Kaplan-Meier estimates, dashed lines: homogeneous Poisson point pro-
cesses.
112 Chapter 8. Analysis of Brazilian trees point pattern
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(others,others)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(others,Ouratea)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(others,Qualea)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(Ouratea,others)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(Ouratea,Ouratea)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(Ouratea,Qualea)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(Qualea,others)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(Qualea,Ouratea)
0 1 2 3 4
0.0
0.4
0.8
1.2
r
km ,
theo
(Qualea,Qualea)
Array of J functions for Ouratea, Qualea & others.
Figure 8.13: J-cross functions for the Others, Ouratea, and Qualea. Solid lines:
Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.
8.4. Complementary analysis 113
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(others,others)
0 1 2 3 4 50
2040
6080
100
r
trans
, th
eo
(others,Ouratea)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(others,Qualea)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(Ouratea,others)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(Ouratea,Ouratea)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(Ouratea,Qualea)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(Qualea,others)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(Qualea,Ouratea)
0 1 2 3 4 5
020
4060
8010
0
r
trans
, th
eo
(Qualea,Qualea)
Array of K functions for Ouratea, Qualea & others.
Figure 8.14: K-cross functions for the Others, Ouratea, and Qualea. Solid lines:
translate border estimates, dashed lines: homogeneous Poisson point processes.
114 Chapter 8. Analysis of Brazilian trees point pattern
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(Liliopsida,Liliopsida)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(Liliopsida,Magnoliopsida)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(Liliopsida,others)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(Magnoliopsida,Liliopsida)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(Magnoliopsida,Magnoliopsida)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(Magnoliopsida,others)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(others,Liliopsida)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(others,Magnoliopsida)
0.0 1.0 2.0 3.0
0.0
0.4
0.8
r
km ,
theo
(others,others)
Array of J functions for Magnoliopsida, Liliopsida & others.
Figure 8.15: J-cross functions for the botanical classes: Liliopsida, Magnoliopsida
and Others. Solid lines: Kaplan-Meier estimates, dashed lines: homogeneous Pois-
son point processes.
8.4. Complementary analysis 115
0.0 1.0 2.0 3.0
0.0
0.2
0.4
0.6
0.8
Magnoliopsida
r
km, t
heo
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
others
r
km, t
heo
Figure 8.16: F -functions from the Magnoliopsida (left) and Others (right). Solid
lines: Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Magnoliopsida,Magnoliopsida)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Magnoliopsida,others)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(others,Magnoliopsida)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(others,others)
Array of G functions for Magnoliopsida & others.
Figure 8.17: G-cross functions for the Magnoliopsida and Others. Solid lines:
Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.
116 Chapter 8. Analysis of Brazilian trees point pattern
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Magnoliopsida,Magnoliopsida)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(Magnoliopsida,others)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(others,Magnoliopsida)
0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
r
km ,
theo
(others,others)
Array of J functions for Magnoliopsida & others.
Figure 8.18: J-cross functions for the Magnoliopsida and Others. Solid lines:
Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.
8.4. Complementary analysis 117
0 1 2 3 4
020
4060
8010
0
r
trans
, th
eo
(Magnoliopsida,Magnoliopsida)
0 1 2 3 4
020
4060
8010
0
r
trans
, th
eo
(Magnoliopsida,others)
0 1 2 3 4
020
4060
8010
0
r
trans
, th
eo
(others,Magnoliopsida)
0 1 2 3 4
020
4060
8010
0
r
trans
, th
eo
(others,others)
Array of K functions for Magnoliopsida & others.
Figure 8.19: K-cross functions from the Magnoliopsida and Others. Solid lines:
Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.
118 Chapter 8. Analysis of Brazilian trees point pattern
trees dataset is a realisation of a Poisson process with unknown intensity λ. The
intensity is then estimated from the dataset, λ = 0.1122. The (two-sided) modified
Monte Carlo test (Section 3.5) is performed based on 999 simulations under H0, and
the Average Linkage algorithm. The resulting test approximately has the (desired)
5% significance level.
0.0 0.2 0.4 0.6 0.8 1.0
−0.0
6−0
.02
0.00
0.02
0.04
0.06
Mean of H(t)
H(t)
− m
ean
of H
(t)
fus dist funsim envsy=0 line
0.0 0.2 0.4 0.6 0.8 1.0
−0.0
6−0
.02
0.00
0.02
0.04
0.06
Mean of H(t)
H(t)
− m
ean
of H
(t)
fus dist fun crit bandy=0 line
Figure 8.20: P-P plots of H(t) (x-axis) plotted against H(t) − H(t) (y-axis) ap-
plied to the univariate Brazilian trees dataset. 5% significance level, 999 realisations
under H0. Simulation envelopes (upper), Critical bands (lower). Solid lines: P-P
plots, dotted lines: y=0 line, dashed lines: envelopes and bands, Average Linkage
algorithm.
8.4. Complementary analysis 119
A A SA A2.5 A97.5
SL 0.523 0.501 0.010 0.481 0.520
AL 0.520 0.500 0.007 0.487 0.512
CL 0.513 0.495 0.006 0.484 0.506
Table 8.3: Estimated area statistic A from the univariate Brazilian trees dataset;
the sample mean and standard deviation, A and SA; Monte Carlo critical values,
A2.5 and A97.5. The statistics are estimated from 999 realisations under H0. Single
Linkage (SL), Average Linkage (AV), Complete Linkage (CL).
Figure 8.20 shows the estimated mean H(t) (x-axis) plotted against the esti-
mated H(t) − H(t) (y-axis). Observe that H(t) − H(t) is substantially outside the
simulation envelope and critical band. Consequently, H0 is rejected. In other words,
the univariate Brazilian trees dataset is not a realisation of a homogeneous Poisson
point process with λ = 0.1122. The results of the fusion distance function from the
Brazilian trees indicate that the location of the trees are more clustered than would
be expected for a homogeneous Poisson process.
The fusion distance function was also computed for the Brazilian trees dataset
classified into the three most frequent species, three botanical classes and two types.
The obtained results are similar to those presented in Section 8.3, so they are not
shown here.
8.4.2 Area statistic If a given point pattern is a realisation of a homogeneous
Poisson process, then the expected value of the area statistic is 0.5 (see Proposition
11 in Section 4.1.2). Deviations from this value may indicate either spatial clustering
or spatial inhibition. The null hypothesis is exactly the same as the null hypothesis
of the fusion distance function (Section 8.4.1). The (two-sided) modified Monte
Carlo test (Section 3.5) is performed based on 999 simulations under H0, and the
Single Linkage, Average Linkage, and Complete Linkage algorithms.
The prediction intervals for the estimated area statistic are given as follows:
Single Linkage: [0.481, 0.521]; Average Linkage: [0.486, 0.514]; Complete Linkage:
[0.483, 0.507]. The Monte Carlo estimates of the critical values, A2.5 and A97.5, are
the 2.5th and 97.5th quantiles of the area statistic under the H0, respectively. The
estimated values of the area statistic A are greater than the A97.5 quantiles. (See
Table 8.3.) Therefore, the null hypothesis is rejected. The results of the area statistic
also suggest that the locations of the trees are more clustered than we would expect
120 Chapter 8. Analysis of Brazilian trees point pattern
for a homogeneous Poisson process.
8.4.3 S statistic and spatial Rg index The extension of the strategy using
the S statistic (Section 6.2), and spatial Rg index (Section 6.3) is applied to test the
random labelling hypothesis to the multivariate Brazilian trees dataset. That is, the
(one-sided) Monte Carlo tests (Section 3.5) are performed at 5% significance level,
and based on the Single Linkage (SL), Average Linkage (AL), Complete Linkage
(CL), 999 random permutations of the type labels.
56 species, 7 subclasses, 3 classes, 2 types Table 8.4 shows the estimated values of
the S statistic, spatial Rg index, S, Rg; and Monte Carlo critical values, S5%rr, R5%rr,
respectively, for the Brazilian trees dataset classified into 56 species, 7 subclasses, 3
classes, and 2 types.
56 species S S5%rr Rg R5%rr
SL 58 39 0.648 0.647
AL 64 44 0.895 0.894
CL 68 46 0.896 0.896
7 subclasses S S5%rr Rg R5%rr
SL 209 183 0.367 0.371
AL 235 206 0.603 0.602
CL 249 213 0.598 0.600
3 classes S S5%rr Rg R5%rr
SL 654 636 0.809 0.812
AL 729 704 0.439 0.448
CL 738 712 0.408 0.414
2 types S S5%rr Rg R5%rr
SL 655 640 0.816 0.819
AL 730 707 0.501 0.503
CL 739 714 0.504 0.506
Table 8.4: Estimated values S, S5%rr , Rg, R5%rr from the Brazilian trees dataset
classified into: 56 species, 7 subclasses, 3 classes, 2 types; Single Linkage (SL),
Average Linkage (AL), Complete Linkage (CL), 999 random permutations of the
type labels.
8.4. Complementary analysis 121
Note that using the S statistic and based on the three clustering algorithms, the
random labelling is rejected for the multivariate Brazilian trees dataset.
For the spatial Rg index, except for the three class and two type classifications,
where the random labelling hypothesis is not rejected, the remaining results are
similar to those obtained from S statistic. Thus, the random labelling hypothesis is
rejected for the dataset classified into 56 species, and based on the Single Linkage and
Average Linkage. The random labelling hypothesis is also rejected for the dataset
classified into seven subclasses, and based on the Average Linkage. (See Table 8.4.)
8.4.4 Gamma approximation for spatial Rg index The gamma approxi-
mation is fitted to the Monte Carlo null distribution of the spatial Rg index applied
to the Brazilian trees dataset classified into two types, using the procedure described
in Section 6.3. Figure 8.21 shows that the fitted gamma is a good approximation
for the Monte Carlo null distribution of the spatial Rg index.
0.500 0.505 0.510 0.515 0.520
0.50
00.
505
0.51
00.
515
0.52
0
Monte Carlo null distribution
gam
ma
appr
oxim
atio
n
q−q plotident line
Brazilian trees with two types
Figure 8.21: Q-Q plot comparing the Monte Carlo estimate of the null distribution
of the spatial Rg index with its gamma approximation, for the Brazilian trees dataset
with two types. Solid line: Q-Q plot, dashed line: identity line.
Table 8.5 presents the estimated values of the spatial Rg index; parameters: α,
β, γ of the fitted gamma approximation; and p-value from the Monte Carlo null
distribution of the spatial Rg index under the random labelling hypothesis, and
based on the Average Linkage. (The parameters of the gamma approximation were
122 Chapter 8. Analysis of Brazilian trees point pattern
Datasets Rg α β γ p-value
Brazilian 0.501 4.530 0.0005 0.499 0.584
Table 8.5: Estimated spatial Rg index applied to the bivariate Brazilian trees dataset;
parameters α, β, γ of the fitted gamma; and p-value from the Monte Carlo null dis-
tribution of the spatial Rg index under random labelling hypothesis; Average Linkage.
estimated using the Method of Moments (Section 6.3).) Therefore, the random
labelling hypothesis is rejected for the bivariate Brazilian trees dataset because of
the large p-value.
8.4.5 Analysis of local configuration This section presents the analysis of
the local configuration (Section 7.1) applied to the Brazilian trees dataset, and based
on the 20 nearest neighbours, Single Linkage, Average Linkage and Complete Link-
age. Figure 8.22 shows the kernel densities of the probability functions of the fusion
distances applied to the univariate Brazilian trees dataset. The dendrograms of the
Single Linkage, Average Linkage and Complete Linkage are shown in Figure 8.23.
Similar to the dendrograms of the Longleaf pines and Lansing woods datasets
(Section 7.2.3), the disordered Single Linkage dendrogram suggests that there may
not be spatial clustering in the Brazilian trees dataset. Even though there is no
strong evidence for clusters in the dataset, the analysis proceeds and the dendro-
gram of the Average Linkage is cut into seven, three and two clusters. The cluster
classification is then compared with the botanical classification.
The results of the classification based on the Single Linkage and Complete Link-
age algorithms are not shown here. The main reasons are the Single Linkage has a
poor separation of the dataset into meaningful clusters, and the Complete Linkage
results are similar to those obtained from the Average Linkage.
Seven groups The contingency table 8.6 shows the frequency counts of the Brazil-
ian trees that are classified into seven, three, and two botanical types, and into seven,
three, and two groups based on the total variation distance, respectively. The upper
plot in Figure 8.24 shows the local configuration classification of the Brazilian trees
dataset into seven groups from the total variation distance based on the 20 nearest
neighbours, and Average.
A Poisson process with the same estimated intensity as the Brazilian trees dataset
on the 100 m square was simulated, and the mean (equation (7.4)) of the local fusion
8.4. Complementary analysis 123
0 5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
fusion distances
Prob
abilit
y de
nsity
func
tion
Single Linkage
0 5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
fusion distances
Prob
abilit
y de
nsity
func
tion
Average Linkage
0 5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
fusion distances
Prob
abilit
y de
nsity
func
tion
Complete Linkage
Figure 8.22: Kernel probability densities of the fusion distances from the univariate
Brazilian trees dataset based on the 20 nearest neighbours, Single Linkage, Average
Linkage, and Complete Linkage.
124 Chapter 8. Analysis of Brazilian trees point pattern0.
00.
20.
40.
6
Single Linkage
tota
l var
iatio
n di
stan
ces
0.0
0.5
1.0
1.5
2.0
Average Linkage
tota
l var
iatio
n di
stan
ces
01
23
4
Complete Linkage
tota
l var
iatio
n di
stan
ces
Figure 8.23: Dendrograms of total variation distances from kernel densities of fusion
distances for the Brazilian trees dataset, based on the 20 nearest neighbours; Single
Linkage; Average Linkage; Complete Linkage.
8.4. Complementary analysis 125
Seven groups
Three groups
Two groups
Figure 8.24: Classification of points in the Brazilian trees dataset into: seven groups
(upper), three groups (centre), and two groups (lower) based on their local config-
uration (20 nearest neighbours, fusion distances, kernel smoothing, total variation
distance, Average Linkage).
126 Chapter 8. Analysis of Brazilian trees point pattern
Group1 2 3 4 5 6 7
Aracidae 6 13 3 0 3 1 0Asteridae 16 27 3 0 1 1 0Dilleniidae 118 201 39 1 4 8 0
Subclass Hamamelidae 12 20 4 0 0 0 0Liliidae 9 9 9 0 1 1 0
Miscellaneous 11 40 7 0 0 1 0Rosidae 152 321 48 6 9 15 2
Group1 2 3
Magnoliopsida 415 569 24Class Liliopsida 31 22 2
Others 18 40 1
Group1 2
Type 1 984 242 111 3
Table 8.6: Contingency tables of the Brazilian trees dataset by botanical types and
groups. Upper: seven subclasses and groups; centre: three classes and groups; lower:
two types and groups. Groups based on total variation distances; 20 nearest neigh-
bours; Average Linkage.
distance function was computed. The estimated group means (equation (7.3)) of the
local fusion distance functions were calculated, and compared with the estimated
mean of the local fusion distance functions from the Poisson process. Except for
group 7, which only has 2 trees, the upper plot in Figure 8.25 suggests that there is
not a clear separation for clusters in the Brazilian trees dataset.
Three groups The central plot in Figure 8.24 shows the local configuration clas-
sification of the Brazilian trees dataset into three groups based on the total variation
distances from the 20 nearest neighbours and Average Linkage. The group means
of the local fusion distance functions were estimated, and compared with the mean
of the fusion distance functions from a simulated Poisson process with the same
intensity as the Brazilian trees dataset on the 100 m square. The lower left plot in
Figure 8.25 indicates that the Brazilian trees dataset does not have a good separation
8.4. Complementary analysis 127
for clusters.
Two groups The lower plot in Figure 8.24 shows the local configuration classi-
fication of the Brazilian trees dataset into two groups based on the total variation
distance from the 20 nearest neighbours and Average Linkage. The lower right plot
in Figure 8.25 suggests that there is not a clear separation for two groups in the
Brazilian trees dataset.
The results of the analysis based on the 10 nearest neighbours are very similar to
those obtained for the 20 nearest neighbours. Therefore, the 10 nearest neighbour
analysis is not presented here.
128 Chapter 8. Analysis of Brazilian trees point pattern
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2group 3group 4group 5group 6group 7Ident. line
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2group 3Ident. line
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Estimated mean of H_{Pois}(t)
Estim
ated
mea
n of
H_v
(t)
group 1group 2Ident. line
Figure 8.25: Estimated group means of local fusion distance functions plotted against
the mean of local fusion distance functions from homogeneous Poisson processes with
the same intensity as the Brazilian trees dataset. Upper: seven groups; lower left:
three groups; lower right: two groups. 20 nearest neighbours; Average Linkage.
129CHAPTER 9
Conclusion and open problems
This chapter describes the main problems studied, and findings of the research
reported in this thesis, chapter by chapter. Furthermore, important issues arising
from the new approach are discussed in detail and suggestions for future work are
made.
9.1 Problems studied and findings
Chapter 3 The problem of an exact significance level of the Monte Carlo test
applied to the P-P plot was studied and solved. The (new) modified version of
the two-sided test was described in Section 3.5. In addition, the transformed P-P
plot, “A-A plot”, was presented and applied to the published and simulated point
patterns.
Chapter 4 The problem of shortage of summary statistics that work well in
practical applications was investigated and complementary statistics are proposed.
In particular, fusion distance function and area statistic were developed. Moreover,
both statistics were throughly studied using the most popular hierarchical algorithms
(Single Linkage, Average Linkage and Complete Linkage), and dissimilarity coeffi-
cient (Euclidean pairwise distance) in multivariate cluster analysis. Note that the
fusion distance function and area statistic depend strongly on the chosen clustering
algorithm, and dissimilarity coefficient. In other words, for a given point pattern,
the fusion distance function and area statistic will have different shapes for different
algorithms and coefficients.
Fusion distance function The fusion distance function is regarded as a link
with the open problem of finding the best number of clusters in multivariate clus-
ter analysis. Here, the relationship between the fusion distance function and best
number of clusters using a knee plot is examined. In summary, the fusion distance
function is a linear combination of the knee plot. However, the problem of find-
ing parametric functions for estimating the fusion distance function under the null
hypothesis of Complete Spatial Randomness is still open.
Area statistic The problem of the (unknown) value of the area statistic un-
der the null hypothesis of Complete Spatial Randomness is investigated and solved.
Proposition 11 demonstrates that the area statistic for a homogeneous Poisson pro-
cess with (known) intensity λ is equal to 0.5.
130 Chapter 9. Conclusion and open problems
New strategy The problem of analysing point patterns using tools of exploratory
data analysis and inference is studied, and a new strategy is proposed in Section
4.3. A new application of the relative distribution method using the fusion distance
function is presented. In particular, the relative distribution plot is applied to the
standard spatial datasets.
Chapter 5 The problem of estimating the power of the Monte Carlo tests un-
der the null hypothesis of Complete Spatial Randomness was investigated, and a
particular case of the power was estimated. This particular study was an illustra-
tion of the powers of the Monte Carlo tests using the supremum distance and area
statistics. Both tests were based on the fusion distance function.
The chosen alternatives are two special models of Matern point processes (Matern
cluster and Matern model II). Observe that simulations from the alternative models
do not depend on iterative algorithms. Thus, the problem of waiting time for the
programming task is feasible, especially because direct algorithms are chosen.
The power of the Monte Carlo test based on the supremum distance is quite
variable and difficult to understand whereas the power of the Monte Carlo test based
on the area statistic is straightforward. The best power achieved by the supremum
distance is comparable to the best power achieved by the area statistic. Therefore,
the Monte Carlo test based on the area statistic is recommended for the models
studied here.
Chapter 6 The problem of analysing multivariate point patterns is examined,
and complementary strategies are proposed. First, an extension of the strategy
using the fusion distance function is described. Second, another extension based on
the S statistic is presented. Third, an extension of the strategy using the spatial Rg
index, a (new) modified version of Rg index, is developed.
In addition, some properties of the statistics (S, spatial Rg index) are investi-
gated. The extensions are applied to bivariate published point patterns. Finally,
the Monte Carlo null distribution of the spatial Rg index is approximated, using a
gamma distribution.
Chapter 7 The problem of examining a localised neighbourhood based on the
fusion distance function was studied, and the analysis of local configuration, a new
extension of LISA (Local Indicators of Spatial Association), is presented in Section
7.1. That is, given a local neighbourhood of a point pattern, the probability density
of the fusion distance function is approximated using kernel smoothing techniques.
9.2. Critique 131
The total variation distance is chosen to measure distances between probability
densities of the local fusion distances.
The analysis of local configuration is applied to published multivariate datasets.
The local configuration strategy has successfully identified different textures of the
published datasets. The Average Linkage and Complete Linkage show a better
performance than the Single Linkage. In fact, the poor performance of the Single
Linkage is mainly because of the chaining effect. Thus, if there are no clusters with
nucleus in a given dataset then the Single Linkage is unable to separate the dataset
into meaningful clusters.
Chapter 8 The problem of analysing large multivariate point patterns, such
as the Brazilian trees dataset, was studied using exploratory data analysis and in-
ference, based on spatial summary functions and statistics. The additional task of
finding the complete botanical classification of the Brazilian trees dataset was done.
A few inconsistencies found in the dataset were also corrected. The problematic
analysis of the Brazilian trees dataset with fifty-six types was examined, and solved
by classifying the dataset into fewer types: seven subclasses, three most frequent
species, three classes, and two types.
The results from the fusion distance function and area statistic show that regard-
less of the types, the univariate dataset is clustered. For the multivariate dataset
classified into fewer types, the results obtained by using the traditional summary
functions (G, J , K), and the fusion distance function show the presence of clustered
sub-patterns in the dataset.
9.2 Critique
Limitations In this thesis, the proposed methods and strategies were only
examined by using popular hierarchical algorithms (Single Linkage, Average Linkage
and Complete Linkage) and dissimilarity coefficient (Euclidean pairwise distance).
There are no theoretical results related to the new (non-parametric) summary
statistics and function. Moreover, the statistics rely on Monte Carlo simulations
and tests for the applications to point patterns.
In Chapter 5, a comparative study of the power of the Monte Carlo tests using the
supremum distance and area statistic under Complete Spatial Randomness against
the chosen alternative models based on the traditional summary functions (G, J , K)
was not done. This would be useful in order to compare the fusion distance function
132 Chapter 9. Conclusion and open problems
with the traditional summary functions however, requires extensive computation
and programming.
In Chapter 7, the analysis of a localised neighbourhood of a given point pattern,
based on the local fusion distance function, is examined. However, the problem of
identifying the clusters, and measuring the degree of spatial clustering needs further
investigation.
Weaknesses The fusion distance function and area statistic are strongly de-
pendent on choice of a hierarchical algorithm and dissimilarity coefficient. Also, the
fusion distances of a given point pattern cannot be regarded as if they were inde-
pendent and identically distributed observations. (See details described in Section
4.1.1.)
The study of the power of the Monte Carlo tests (Chapter 5) is not intended to
produce a general rule. The results found are an illustration of the performance of
the new summary statistics to an arbitrary choice of the alternative models.
Difficulty The main difficulty found is the practical impossibility of plotting
large multitype point patterns. In theory, the existing methods and software work
very well only if the number of types is smaller than or equal to three.
Final issues First, the fusion distance function H(t), area statistic A, statis-
tic S, and spatial Rg index are non-parametric function and statistics. Second,
it was not possible to use standard goodness-of-fit tests since the distribution of
Kolmogorov-Smirnov statistic under CSR is still unknown. So the two-sided modi-
fied version of Monte Carlo tests (Section 3.5) based on the fusion distance function
and area statistic is performed to estimate the power of the test of CSR and to
achieve exact significance level α. (See further information in Section 5.2.) Finally,
the proposed strategies and methods depend strongly on computers and graphical
analysis.
9.3 Open problems
Chapter 4
• Study the fusion distance function and area statistic using other hierarchical
algorithms, for instance the Ward’s Variance Minimum, and a generalised
dissimilarity coefficient such as the Mahalanobis distance.
9.3. Open problems 133
• Given a clustered pattern with a fixed number of clusters, investigate the
new summary function and statistics using a non-hierarchical algorithm, for
example K-means.
• Approximate a parametric function for the fusion distance function under the
null model of Complete Spatial Randomness.
• Examine the fusion distance function under more complicated models such as
the inhomogeneous Poisson point process.
Chapter 5
• Compare the power of the Monte Carlo test under Complete Spatial Random-
ness using the supremum distance, and area statistic based on the summary
functions: G, J , and K.
• Investigate the power of the Monte Carlo test under Complete Spatial Ran-
domness using the supremum distance, and area statistic based on another
hierarchical algorithms and Mahalanobis distance.
Chapter 6
• Examine the viability of the computational programming for estimating the
parameters of the Monte Carlo null distribution of the spatial Rg index using
the Maximum Likelihood Method.
• Explore other distributions, such as the Log Normal and Weibull, to approxi-
mate the Monte Carlo null distribution of the spatial Rg index.
Chapter 8
• Extend the existing techniques and software for analysing and plotting large
multivariate point patterns, especially, for point patterns with a number of
types larger than five.
134 Chapter 9. Conclusion and open problems
135
Bibliography
[1] M. Aitkin and D. Clayton. The fitting of exponential, Weibull and extreme
value distributions to complex censored survival data using GLIM. Appl.
Statist., 29:156–163, 1980.
[2] H. Akaike. An approximation to the density function. Ann. Inst. Statist.
Math., 6:127–132, 1954.
[3] N. H. Anderson and D. M. Titterington. Some methods for investigating
spatial clustering, with epidemiological applications. J. R. Statist. Soc. A,
160(1):87–105, 1997.
[4] T. W. Anderson and D. A. Darling. Asymptotic theory of certain “goodness
of fit” criteria based on stochastic processes. Ibid., 23:193–212, 1952.
[5] L. Anselin. The Moran scatterplot as an ESDA tool to assess local instability
in spatial association. In The DISDATA Specialist Meeting on GIS and Spatial
Analysis, Amsterdam, The Netherlands, pages 1–5. West Virginia University,
Regional Research Institute, Research Paper 9330, 1993.
[6] L. Anselin. Local indicators of spatial association - LISA. Geographical Anal-
ysis, 27:93–115, 1995.
[7] A. J. Baddeley and R. D. Gill. Kaplan-Meier estimators for interpoint distance
distributions of spatial point processes. Ann. Statist., 25:263–292, 1997.
[8] A. J. Baddeley and M. N. M. van Lieshout. Stochastic geometry models in
high-level vision. In K. V. Mardia, editor, Statistics and Images, pages 233–
258. Carfax, Abingdon, 1993.
[9] A. J. Baddeley and R. Turner. SpatStat for R, 1.3-2 edition, May 2002.
[10] G. A. Barnard. Discussion of Professor Bartlett’s paper. J. R. Statist. Soc.
Ser. B, 25:294, 1963.
[11] J. Besag and P. J. Diggle. Simple Monte Carlo tests for spatial pattern. Applied
Statistics, 26:327–333, 1977.
[12] J. Besag and J. Newell. The detection of clusters in rare diseases. J. R. Statist.
Soc. A, 154(1):143–155, 1991.
136 Bibliography
[13] P. J. Bickel and Kjell A. Doksum. Mathematical Statistics. Holden-Day, Inc.,
California, 1977.
[14] A. W. Bowman and A. Azzalini. Applied smoothing techniques for data anal-
ysis: the kernel approach with S-Plus illustrations. Oxford University Press,
Oxford, 1997.
[15] R. K. Brummitt. Vascular Plant Families and Genera. Royal Botanic Gardens,
Kew, 1992.
[16] J. M. Chambers, W. S. Cleveland, B. Kleiner, and P. A. Tukey. Graphical
Methods for Data Analysis. Wadsworth, Inc., California, 1987.
[17] J. L. Chandon and S. Pinson. Analyse Typologique. Masson, Paris, 1981.
[18] A. D. Cliff and J. K. Ord. Spatial Processes: Models and Applications. Pion,
London, 1981.
[19] D. R. Cox. Some statistical methods related with series of events (with dis-
cussion). J. R. Statist. Soc. B, 17:129–164, 1955.
[20] D. R. Cox and V. Isham. Point Processes. Chapman and Hall, London, 1980.
[21] D. R. Cox and P. A. W. Lewis. Multivariate point processes. In Proceedings
of the sixth Berkeley Symposium of Mathematics Statistics and Probability,
number 3, pages 401–445. University of California Press, 1972.
[22] N. Cressie and L. B. Collins. Analysis of spatial point patterns using bundles
of product density lisa functions. J Agric Biol Environ Stat, 6:118–135, 2001.
[23] N. Cressie and L. B. Collins. Patterns in spatial point locations: Local indica-
tors of spatial association in a minefield with clutter. Naval Research Logistics,
48:333–347, 2001.
[24] N. A. C. Cressie. Statistics for Spatial Data. John Wiley and Sons, Inc., New
York, 1991.
[25] F. H. C. Crick and P. A. Lawrence. Compartments and polychones in insect
development. Science, 189:340–347, 1975.
[26] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Pro-
cesses. Spring-Verlag, New York, 1988.
Bibliography 137
[27] A. Dasgupta and A. E. Raftery. Detecting features in spatial point processes
with clutter via model-based clustering. Journal of the American Statistical
Association, 93:294–302, 1998.
[28] P. Diehl. Geography and war: A review and assessment of the empirical
literature, edited by M. Ward. New Geopolitics, Gordon and Breach, 1992.
[29] P. J. Diggle. On parameter estimation and goodness-of-fit testing for spatial
point patterns. Biometrics, 35:87–101, 1979.
[30] P. J. Diggle. Statistical Analysis of Spatial Point Patterns. Academic Press,
London, 1983.
[31] P. J. Diggle. Displaced amacrine cells in the retina of a rabbit: analysis of a
bivariate spatial point pattern. J. Neurosci. Meth., 18:115–125, 1986.
[32] P. J. Diggle. A point process modelling approach to raised incidence of a
rare phenomenon in the vinicity of a prespecified point. J. R. Statist. Soc. A,
153(3):349–362, 1990.
[33] R. Doll. The epidemiology of childhood leukaemia. J. R. Statist. Soc. A.,
152:341–351, 1989.
[34] J. Durbin. Distribution Theory for Tests Based on the Sample Distribution
Function. Society for Industrial and Applied Mathematics, Philadelphia, 1973.
[35] M. Dwass. Modified randomization tests for nonparametric hypotheses. Ann.
Math. Statist., 28:181–187, 1957.
[36] M. Ehrmann and URL R. L. Bell. Desiderata.
http://www.geocities.com/lswote/desiderata.html, 1927.
[37] B. S. Everitt. Cluster Analysis. Edward Arnold, London, 1993.
[38] L. Fisher and J. W. van Ness. Admissible clustering procedures. Biometrika,
58:91–104, 1971.
[39] E. Fix and J. L. Hodges. Discriminatory analysis– non-parametric discrimi-
nation: consistency properties. Report, Project no. 21-29-004 No. 4,, USAF
School of Aviation Medicine, Randolph Field, TX, 1951.
138 Bibliography
[40] K. Florek, J. Lukaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la
liaison et la division des points d’un ensemble fini. Colloq. Math., 2:282–285,
in French, 1951.
[41] E. B. Fowlkes. A Folio of Distributions. Marcel Dekker, Inc., New York and
Basel, 1987.
[42] E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical
clusterings. J. Amer. Statist. Assoc., 78:553–569, 1983.
[43] D. J. Gerrard. Competition quotient: A new measure of the competition affect-
ing individual forest trees. Research bulletin, Vol 20, Agricultural Experiment
Station, Michigan State University, 1969.
[44] A. Getis and K. Ord. The analysis of spatial association by use of distance
statistics. Geographical Analysis, 24:189–206, 1992.
[45] A. D. Gordon. Classification. Chapman and Hall, London, 1981.
[46] P. R. Halmos. Measure Theory. Van Nostrand Reinhold Company, New York,
1969.
[47] M. S. Handcock and M. Morris. Relative Distribution Methods in the Social
Sciences. Springer-Verlag, New York, 1999.
[48] M. S. Handcock and M. Morris. The software on relative distribution methods
on social sciences. http://csde.washington.edu/~handcock/RelDist/ ,
1999.
[49] J. A. Hartigan. Clustering Algorithms. John Wiley and Sons Ltd, New York,
1975.
[50] A. C. A. Hope. A simplified Monte Carlo significance test procedure. J. R.
Statist. Soc. B, 30:582–598, 1968.
[51] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics.
Journal of Computational and Graphical Statistics, 5(3):299–314, 1996.
[52] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice
Hall,Inc., Englewood Cliff, 1988.
[53] N. Jardine and R. Sibson. Mathematical Taxonomy. John Wiley and Sons
Ltd, London, 1971.
Bibliography 139
[54] K.-H. Jockel. Finite sample properties and asymptotic efficiency of Monte
Carlo tests. Annals of Statistics, 14:336–347, 1986.
[55] N. L. Johnson and S. Kotz. Distributions in Statistics: continuous univariate
distributions-1. Houghton Mifflin Company, Boston, 1970.
[56] R. A. Johnson and G. K. Bhattacharyya. Statistics Principles and Methods.
John Wiley and Sons, Inc., New York, 1996.
[57] S. P. Kaluzny, S. C. Vega, T. P. Cardoso, and A. A. Shelly. S+SPATIALSTATS
User’s Manual, 1.0 edition, February 1996.
[58] L. Kaufman and P. Rousseeuw. Finding Group in Data: an Introduction to
Cluster Analysis. John Wiley and Sons Inc., New York, 1990.
[59] J. E. Kelsall and P. J. Diggle. Kernel estimation of relative risk. Bernoulli,
1:3–16, 1995.
[60] J. F. C Kingman. Poisson Processes. Oxford University Press, Oxford, 1993.
[61] L. J. Kinlen. Evidence for an infective cause for childhood leukaemia: a
Scottish new town compared to nuclear reprocessing sites. Lancet, 1988.
[62] E. L. Lehmann. Elements of large-sample theory. Springer-Verlag New York,
New York, 1999.
[63] J. A. Ludwig and J. F. Reynolds. Statistical Ecology: a primer on methods
and computing. John Wiley and Sons, New York, 1988.
[64] A. J. B. Luiz. Determinacao da distribuicao espacial de pontos usando a
distancia ao vizinho mais proximo: Aplicacao em populacoes vegetais. Mas-
ter’s thesis, Universidade de Brasılia, Brazil, in Portuguese, 1995.
[65] M. N. M. van Lieshout. Stochastic Geometry Models in Image Analysis and
Spatial Statistics. PhD thesis, Free University of Amsterdam, 1994.
[66] M. N. M. van Lieshout and A. J. Baddeley. A nonparametric measure of
spatial interaction in point patterns. Statistica Neerlandica, 50:344–361, 1996.
[67] M. N. M. van Lieshout and A. J. Baddeley. Indices of dependence between
types in multivariate point patterns. Scandinavian Journal of Statistics,
26:511–532, 1999.
140 Bibliography
[68] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic
Press Inc., London, 1979.
[69] B. Matern. Spatial variation. Medd. Statens Skogforskringsinstitut, 49, 5,
Forest Research Institute of Sweden, 1960.
[70] B. Matern. Doubly stochastic poisson processes in the plane. In Statistical
Ecology: Spatial Patterns and Statistical Distributions, based on the Proceed-
ings of the International Symposium on Statistical Ecology, volume 1, pages
195–213, University Park and London, 1971. The Pennsylvania State Univer-
sity Press.
[71] B. Matern. Spatial Variation: Lecture Notes in Statistic. Spring-Verlag, Berlin,
1986.
[72] G. J. McLachlan and K. E. Basford. Mixture models: Inference and Applica-
tions to Clustering. M. Dekker, New York, 1988.
[73] M. L. Meirelles. Personal communication, 2003.
[74] M. L. Meirelles and A. J. B. Luiz. Padroes espaciais de arvores de um cerrado
em Brasılia, DF. Revta Brasil. Bot., Sao Paulo, 18:185–189, in Portuguese,
1995.
[75] M. L. Meirelles and A. J. B. Luiz. Personal communication, 2000.
[76] M. Morisita. Measuring of the dispersion of individuals and analysis of the
distributional patterns. Memoirs of the faculty of science, E 2, Kyushu Univ.
Ser., Kyushu University, 215–235, 1959.
[77] URL New York Botanical Garden. Ochnaceae, Gomphia acuminata
DC(isotype). http://image.nybg.org/herbim/2080/v-208-00428910big.jpg,
1811.
[78] J. Neyman and E. L. Scott. Statistical approach to problems of cosmology. J.
R. Statist. Soc. B, 20:1–43, 1958.
[79] J. Neyman and E. L Scott. Processes of clustering and applications. In Stochas-
tic Point Processes, P. W. A. Lewis, 1972.
[80] M. Numata. Forest vegetation in the vicinity of Choshi. Coastal flora and veg-
etation at Choshi, Chiba Prefecture IV. Bulletin Choshi Marine Laboratory,
Chiba University, in Japanese, 3:28–48, 1961.
Bibliography 141
[81] J. Ohser. On estimators for the reduced moment measure of point processes.
Math. Operationsforch. Statist. Ser. Statist., in German, 14:63–71, 1983.
[82] A. K. Penttinen and D. Stoyan. Statistical analysis for a class of line segment
processes. Scand. J. Statist., 16:153–161, 1989.
[83] Sandra M. C. Pereira. Um estudo descritivo sobre os criterios de determinacao
do numero de agrupamentos. Master’s thesis, Universidade de Brasılia, Brazil,
in Portuguese, 1993.
[84] Sandra M. C. Pereira. Analysis of spatial point processes based on the out-
puts of clustering algorithms. In Proceedings of the 23rd European Meeting of
Statisticians, Revista Estatıstica, Statistic Review, volume II, pages 309–310,
Instituto Nacional de Estatıstica, Portugal, 2001.
[85] W. J. Platt, G. W. Evans, and S. L. Rathbun. The population dynamics of
a long-lived Conifer (Pinus palustris). The American Naturalist, 131:491–525,
1988.
[86] W. M. Rand. Objective criteria for the evaluation of clustering methods. J.
Amer. Stat. Assoc., 66:846–850, 1971.
[87] S. L. Rathbun and N. Cressie. A space-time survival point process for a
longleaf pine forest in southern Georgia. Journal of the American Statistical
Association, 89:1164–1173, 1994.
[88] B. D. Ripley. Modelling spatial patterns (with discussion). Journal of the
Royal Statistical Society, Series B, 39:172–212, 1977.
[89] B. D. Ripley. Tests of randomness for spatial point patterns. Journal of the
Royal Statistical Society, Series B, 41:368–374, 1979.
[90] B. D. Ripley. Spatial Statistics. John Wiley and Sons, Inc., New York, 1981.
[91] B. D. Ripley. Statistical Inference for Spatial Processes. Cambridge University
Press, New York, 1988.
[92] G. G. Roussas. A First Course in Mathematical Statistics. Addison-Wesley
Publishing Company, Reading, 1973.
[93] M. Schlather. On the second-order characteristics of marked point processes.
Bernoulli, 7 (1):99–117, 2001.
142 Bibliography
[94] B. T. Scott. Summary Functions in the Analysis of Spatial Point Patterns.
PhD thesis, University of Western Australia, 2001.
[95] I. J. Smalley. Contraction crack networks in basalt flows. Geological Magazine,
103 (2):110–114, 1966.
[96] G. W. Snedecor and W. G. Cochran. Statistical Methods. Iowa State University
Press, Ames, 1980.
[97] R. R. Sokal and C. D. Michener. A statistical method for evaluating systematic
relationships. Univ. Kansas Sci. Bull., 38:1409–1438, 1958.
[98] T. Sørensen. A method of establishing groups of equal amplitude in plant
sociology based on similarity of species content. K. danske Vidensk. Selsk.
Skr. (biol), 5:1–34, 1948.
[99] M. A. Stephens. Tests based on edf statistics. In R. B. D’Agostino and M. A.
Stephens, Goodness-of-Fit Techniques, 1986.
[100] D. Stoyan. Correlations of the marks of marked point processes- statistical
inference and simple models. J. Inf. Process. Cybern., 20:285–294, 1984.
[101] D. Stoyan. On correlations of marked point processes. Math. Nachr., 116:197–
207, 1984.
[102] D. Stoyan, W. S. Kendall, and J. Mecke. Stochastic Geometry and Its Appli-
cations. John Wiley & Sons, Chichester, 1987.
[103] D. Stoyan and A. Penttinen. Recent applications of point process methods in
forestry statistics. Statistical Science, 15(1):61–78, 2000.
[104] D. Stoyan and H. Stoyan. Fractals, Random Shapes and Point Fields. John
Wiley & Sons, Chichester, 1994.
[105] D. J. Strauss. A model for clustering. Biometrika, 62:467–475, 1975.
[106] M. J. Symons, R. C. Grimson, and Y. C. Yuan. Clustering of rare events.
Biometrics, 39(1):193–205, 1983.
[107] E. Thonnes and M.N.M. van Lieshout. A comparative study on the power of
van Lieshout and Baddeley’s J-function. Research report 334, Department of
Statistics, University of Warwick, 1999.
Bibliography 143
[108] H. Thorisson. Coupling, Stationarity, and Regeneration. Springer-Verlag New
York, Inc., New York, 2000.
[109] G. J. G. Upton and B. Fingleton. Spatial Data Analysis by Example, Volume
1: Point Pattern and Quantitative Data. John Wiley and Sons, Inc., New
York, 1985.
[110] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S-Plus.
Springer–Verlag New York, Inc., New York, 1999.
[111] R. Wakeford. Childhood leukaemia and nuclear installations. J. R. Statist.
Soc. A, 152:61–86, 1989.
[112] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall, London,
UK, 1995.
[113] H. Wassle, B.B. Boycott, and R.-B Illing. Morphology and mosaic of on- and
off-beta cells in the cat retina and some functional considerations. Proc. Roy.
Soc. London Serv. B, 212:177–195, 1981.
[114] M. B. Wilk and R. Gnanadesikan. Probability plotting methods for the anal-
ysis of data. Biometrika, 55:1–17, 1968.
[115] D. W. Woodland. Contemporary Plant Systematics. Prentice Hall, Englewood
Cliffs, 1991.
144 Bibliography
145APPENDIX A
Results of the new strategy based on the Average Linkage
and Complete Linkage algorithms
A.1 Exploratory data analysis
Dendrograms of the Average Linkage and Complete Linkage applied to the stan-
dard point patterns: pines, cells, and redwoods are plotted in Figure A.1. The
datasets are presented in Section 2.1.
Relative distribution plots Figures A.2 and A.3 show the relative probability
density functions with the pointwise 95% confidence intervals for the fusion distance
functions H(t) applied to the pines, cells, and redwoods plotted against the mean
H(t) for 1000 realisations of a binomial point process on the unit square, based on
the Average Linkage and Complete Linkage, respectively.
A.2 Inference
A.2.1 Envelopes for P-P plots, Q-Q plots and A-A plots Figure A.4,
A.5, and A.6 show the P-P plots, Q-Q plots and A-A plots applied to the pines,
cells, redwoods, with the pointwise simulation envelopes at 5% significance level.
The results are based on the Average Linkage and Complete Linkage, and on 999 re-
alisations under the binomial point process with the same intensities as the observed
datasets, on the unit square, respectively.
A.2.2 Bands for P-P plots, Q-Q plots and A-A plots Figures A.7, A.8,
and A.9 show the P-P plots, Q-Q plots and A-A plots applied to the pines, cells,
and redwoods, with the simultaneous critical bands at 5% significance level. The
results are based on the Average Linkage and Complete Linkage, and on 999 reali-
sations under the binomial point process with the same intensities as the observed
datasets, on the unit square, respectively.
A.3 Random labelling hypothesis
The results of the extension based on the fusion distance function (Section 6.1)
applied to the bivariate point pattern Cat Retinal Ganglia, and based on the Average
Linkage and Complete Linkage are presented next. Figure A.10, A.11, A.12, A.13,
A.14, and A.15 show the P-P plots, Q-Q plots, A-A plots with the envelopes and
bands at 5% significance level, respectively.
146 Appendix A. New strategy based on the Average and Complete Linkage
0.0
0.5
1.0
(a)
0.0
0.5
1.0
(b)
0.0
0.5
1.0
(c)
0.0
0.5
1.0
(d)
0.0
0.5
1.0
(e)
0.0
0.5
1.0
(f)
Figure A.1: Dendrograms of the clustering algorithms applied to the spatial datasets:
(a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average Linkage;
right: (b),(d),(f): Complete Linkage.
A.3. Random labelling hypothesis 147
02
46
0.0 0.4 0.8
................
....................................................................................
...................................................................................................
.
02
46
0.0 0.4 0.8
........................................
......
.......
............................................................................
......
....
....
....
....
.................................................
02
46
0.0 0.4 0.8
....................................................................................................
.......................................................................................
.........
....
Figure A.2: Relative probability density function (y-axis) of the fusion distances H(t)
plotted against H(t) (x-axis). The probability density functions plots with pointwise
95% confidence intervals of the datasets: pines (left), cells (centre), and redwoods
(right); Average Linkage; 1000 realisations under H0.
02
46
0.0 0.4 0.8
...............
.....................................................................................
.................................................................................................
...
02
46
0.0 0.4 0.8
....................................
....
.........
..............................................................................
....
....
.....
.....
.....................................................
..
02
46
0.0 0.4 0.8
....................................................................................................
.........................................................................................
...........
Figure A.3: Relative probability density function (y-axis) of the fusion distances H(t)
plotted against H(t) (x-axis). The probability density functions plots with pointwise
95% confidence intervals of the datasets: pines (left), cells (centre), and redwoods
(right); Complete Linkage, 1000 realisations under H0.
148 Appendix A. New strategy based on the Average and Complete Linkage
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(b)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(c)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(d)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(e)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(f)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure A.4: Simulation Envelopes at significance level 5% for P-P plots applied to the
datasets: (a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average
Linkage; right: (b),(d),(f): Complete Linkage. Solid lines: P-P plots; dashed lines:
envelopes; dotted lines: identity line, 999 realisations under H0.
A.4 Histograms
The histograms of the fusion distances from the Average Linkage algorithm ap-
plied to the point patterns Longleaf pines (Section 6.2) and Brazilian trees (Section
8.1) are shown in Figure A.16. It seems appropriate to try fitting a gamma distri-
bution for approximating the spatial Rg index distribution (Section 6.3).
A.4. Histograms 149
(a)
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
(b)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
(c)
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
(d)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
(e)
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
(f)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Figure A.5: Simulation envelopes at 5% significance level for Q-Q plots applied to the
datasets: (a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average
Linkage; right: (b),(d),(f): Complete Linkage. Solid lines: Q-Q plots; dashed lines:
envelopes; dotted lines: identity line, 999 realisations under H0.
150 Appendix A. New strategy based on the Average and Complete Linkage
(a)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(b)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(c)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(d)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(e)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(f)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Figure A.6: Simulation envelopes at 5% significance level for A-A plots applied to the
datasets: (a),(b) pines; (c),(d) cells; (e),(f) redwoods; left: (a),(c),(e): Average
Linkage. Right: (b),(d),(f): Complete Linkage. Solid lines: A-A plots; dashed lines:
envelopes; dotted lines: identity line, 999 realisations under H0.
A.4. Histograms 151
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(b)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(c)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(d)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(e)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(f)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure A.7: Bands at 5% significance level for P-P plots applied to the datasets:
(a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average Linkage;
right: (b),(d),(f): Complete Linkage. Solid lines: P-P plots; dashed lines: bands;
dotted lines: identity line, 999 realisations under H0.
152 Appendix A. New strategy based on the Average and Complete Linkage
(a)
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
(b)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.4
0.8
1.2
(c)
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
(d)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.4
0.8
1.2
(e)
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
(f)
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.4
0.8
1.2
Figure A.8: Bands at 5% significance level for Q-Q plots of fusion distance func-
tion H(t) versus H(t) applied to the datasets: (a),(b) pines; (c),(d) cells; (e),(f)
redwoods; left: (a),(c),(e): Average Linkage; right: (b),(d),(f): Complete Link-
age. Solid lines: Q-Q plots; dashed lines: bands; dotted lines: identity line, 999
realisations under H0.
A.4. Histograms 153
(a)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(b)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(c)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(d)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(e)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
(f)
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Figure A.9: Bands at 5% significance level for A-A plots of arcsin√
1 − H(t) ver-
sus arcsin√
1 − H(t) applied to the datasets: (a),(b) pines; (c),(d) cells; (e),(f)
redwoods; left: (a),(c),(e): Average Linkage; right: (b),(d),(f): Complete Link-
age. Solid lines: A-A plots; dashed lines: bands; dotted lines: identity line, 999
realisations under H0.
154 Appendix A. New strategy based on the Average and Complete Linkage
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.80.
00.
40.
8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
Figure A.10: Average Linkage P-P plots for the Cat Retinal Ganglia against the
random labelling hypothesis. First row: on cells (type 1), second row: off cells (type
2). Left: pointwise envelopes; right: critical bands. Solid lines: P-P plot, Dashed
lines: envelopes and bands, dotted lines: identity line; 5% significance level, 999
random permutations of the type labels.
A.4. Histograms 155
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
Figure A.11: Complete Linkage P-P plots for the Cat Retinal Ganglia against the
random labelling hypothesis. First row: on cells (type 1), second row: off cells (type
2). Left: simulation envelopes; right: critical bands. Solid lines: P-P plot, Dashed
lines: envelopes and bands, dotted lines: identity line; 5% significance level; 999
random permutations of the type labels.
156 Appendix A. New strategy based on the Average and Complete Linkage
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.0 0.2 0.4 0.60.
00.
20.
40.
6
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
Figure A.12: Average Linkage Q-Q plots for the Cat Retinal Ganglia against the
random labelling hypothesis. First row: on cells (type 1), second row: off cells (type
2). Left: simulation envelopes; right: critical bands. Solid lines: Q-Q plot, Dashed
lines: envelopes and bands, dotted lines: identity line; 5% significance level, 999
random permutations of the type labels.
A.4. Histograms 157
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
Figure A.13: Complete Linkage Q-Q plots for the Cat Retinal Ganglia against the
random labelling hypothesis. First row: on cells (type 1), second row: off cells (type
2). Left: simulation envelopes; right: critical bands. Solid lines: Q-Q plot, Dashed
lines: envelopes and bands, dotted lines: identity line. 5% significance level; 999
random permutations of the type labels.
158 Appendix A. New strategy based on the Average and Complete Linkage
0.0 1.0
0.0
1.0
0.0 1.00.
01.
0
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
Figure A.14: Average Linkage A-A plots for the Cat Retinal Ganglia against the
random labelling hypothesis. First row: on cells (type 1), second row: off cells (type
2). Left: simulation envelopes, right: critical bands. Solid lines: A-A plot, Dashed
lines: envelopes and bands, dotted lines: identity line; 5% significance level; 999
random permutations of the type labels.
A.4. Histograms 159
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
0.0 1.0
0.0
1.0
Figure A.15: Complete Linkage A-A plots for the Cat Retinal Ganglia against the
random labelling hypothesis. First row: on cells (type 1), second row: off cells (type
2). Left: simulation envelopes; right: critical bands. Solid lines: A-A plot, Dashed
lines: envelopes and bands, dotted lines: identity line; 5% significance level, 999
random permutations of the type labels.
160 Appendix A. New strategy based on the Average and Complete Linkage
0 20 60 100
010
030
050
0
Fusion distances h_k
Freq
uenc
y
Longleaf pines
0 20 40 60
020
060
010
00
Fusion distances h_k
Freq
uenc
y Brazilian trees
Figure A.16: Histograms of the fusion distances from the Average Linkage den-
drograms applied to the point patterns. Left: Longleaf pines (Section 6.2); right:
Brazilian trees (Section 8.1).
161APPENDIX B
Power of the test: fusion distance function
B.1 Cluster alternative
Estimated power Tables B.1 (a) – (e) present the estimated powers of Monte
Carlo tests of CSR against the Matern cluster model with parameters described
previously in Section 5.3. The tests use the supremum distance, and are based on
the fusion distance function.
Power explanation: Q-Q plots Figures B.1 – B.5 show the quantiles of 100 re-
alisations of the fusion distance functions from the Poisson with λ = 100 plotted
against the quantiles of 100 realisations of the fusion distance functions from the
Matern cluster with λp = 5, λc = 20, r = 0.005. The upper limit t1 ∈ [0; 0.22] by
increments of 0.005. Note that the quantiles of the fusion distance functions from
both models (Poisson and Matern cluster) are equal for t1 = 0.12. See Figure B.3.)
However, for t1 > 0.13, both fusion distances are different. (See Figures B.3 – B.5.)
B.2 Inhibition alternative
Estimated power Table B.2 presents the estimated powers of the test of CSR
against the Matern model II with parameters described previously. The tests use
the supremum distance, and are based on the fusion distance function.
Power explanation: Q-Q plots Figures B.6 – B.10 show the quantiles of 100
realisations of fusion distance functions from the Poisson with λ = 100 plotted
against the quantiles of 100 realisations of fusion distance functions from Matern
model II with λ0 = 200, r = 0.005. The upper limit t1 ∈ [0; 0.2] by increments of
0.005. Observe that the quantiles of the fusion distance functions for both models
(Poisson and Matern model II) are very close to the identity line demonstrating that
the realisations of the Matern model II for the specified parameters are very similar
to the homogeneous Poisson. Therefore, the power of the test for the parameter
model is very small or zero.
162 Appendix B. Power of the test: fusion distance function
Table B1 (a): λp = 5 parents
t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2
0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 1.00 0.99 0.95 0.82 0.63 0.45 0.350.02 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.93 0.83 0.690.03 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.94 0.860.04 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.900.05 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.920.06 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.930.07 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.930.08 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.97 0.910.09 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.93 0.860.10 1.00 1.00 1.00 1.00 0.99 0.98 0.97 0.92 0.86 0.720.11 0.84 0.80 0.78 0.72 0.72 0.73 0.72 0.68 0.61 0.500.12 0.06 0.07 0.05 0.07 0.10 0.16 0.22 0.23 0.24 0.240.13 0.03 0.04 0.03 0.02 0.03 0.04 0.05 0.06 0.09 0.110.14 0.03 0.04 0.03 0.04 0.04 0.06 0.08 0.10 0.12 0.150.15 0.07 0.09 0.09 0.14 0.16 0.18 0.18 0.20 0.22 0.250.16 0.45 0.45 0.42 0.39 0.39 0.37 0.33 0.34 0.35 0.370.17 0.83 0.80 0.76 0.69 0.63 0.58 0.51 0.52 0.51 0.520.18 0.91 0.90 0.85 0.80 0.74 0.70 0.61 0.63 0.59 0.620.19 0.98 0.98 0.97 0.93 0.88 0.86 0.78 0.81 0.73 0.750.20 0.98 0.98 0.97 0.93 0.89 0.87 0.79 0.81 0.74 0.780.21 0.98 0.98 0.97 0.92 0.89 0.86 0.81 0.80 0.76 0.810.22 0.98 0.99 0.97 0.94 0.91 0.88 0.86 0.85 0.84 0.86
Table B1 (b): λc = 10 parents
t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2
0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 1.00 0.93 0.74 0.48 0.32 0.19 0.140.02 1.00 1.00 1.00 1.00 1.00 0.97 0.84 0.66 0.47 0.360.03 1.00 1.00 1.00 1.00 1.00 0.99 0.94 0.85 0.65 0.500.04 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.89 0.76 0.620.05 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.93 0.78 0.650.06 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.92 0.79 0.680.07 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.91 0.77 0.650.08 1.00 1.00 1.00 1.00 1.00 0.99 0.96 0.85 0.69 0.590.09 1.00 1.00 1.00 1.00 0.99 0.97 0.89 0.76 0.59 0.500.10 0.84 0.78 0.82 0.81 0.85 0.81 0.70 0.57 0.41 0.380.11 0.02 0.03 0.04 0.11 0.28 0.35 0.38 0.32 0.23 0.270.12 0.00 0.00 0.00 0.00 0.02 0.07 0.10 0.14 0.13 0.150.13 0.02 0.02 0.03 0.03 0.04 0.05 0.06 0.08 0.11 0.140.14 0.34 0.35 0.28 0.22 0.16 0.14 0.13 0.15 0.18 0.210.15 0.80 0.79 0.70 0.56 0.40 0.31 0.27 0.26 0.29 0.340.16 0.96 0.95 0.89 0.78 0.60 0.49 0.42 0.40 0.42 0.430.17 0.99 0.98 0.96 0.89 0.75 0.65 0.59 0.53 0.54 0.570.18 0.99 0.99 0.97 0.92 0.81 0.70 0.65 0.61 0.60 0.650.19 1.00 1.00 0.99 0.96 0.90 0.83 0.79 0.73 0.73 0.780.20 1.00 1.00 0.98 0.96 0.87 0.81 0.79 0.75 0.77 0.820.21 1.00 0.99 0.98 0.94 0.86 0.81 0.80 0.79 0.81 0.860.22 0.99 0.99 0.98 0.95 0.88 0.85 0.86 0.86 0.88 0.92
B.2. Inhibition alternative 163
Table B1 (c): λp = 20 parents
t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2
0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 0.95 0.63 0.35 0.22 0.11 0.09 0.070.02 1.00 1.00 1.00 1.00 0.96 0.75 0.51 0.29 0.20 0.170.03 1.00 1.00 1.00 1.00 0.99 0.88 0.68 0.49 0.33 0.250.04 1.00 1.00 1.00 1.00 1.00 0.94 0.77 0.57 0.42 0.330.05 1.00 1.00 1.00 1.00 1.00 0.96 0.81 0.65 0.46 0.370.06 1.00 1.00 1.00 1.00 0.99 0.95 0.80 0.64 0.49 0.390.07 1.00 1.00 1.00 1.00 0.99 0.94 0.75 0.59 0.45 0.360.08 1.00 1.00 1.00 1.00 0.97 0.88 0.65 0.53 0.42 0.320.09 0.64 0.69 0.76 0.87 0.85 0.72 0.52 0.39 0.32 0.280.10 0.01 0.02 0.10 0.31 0.46 0.42 0.31 0.26 0.22 0.210.11 0.01 0.01 0.01 0.02 0.10 0.14 0.16 0.14 0.14 0.140.12 0.13 0.11 0.08 0.03 0.04 0.04 0.08 0.11 0.11 0.120.13 0.58 0.50 0.37 0.19 0.12 0.08 0.11 0.11 0.12 0.140.14 0.90 0.85 0.69 0.45 0.29 0.17 0.17 0.18 0.19 0.210.15 0.98 0.95 0.86 0.68 0.47 0.33 0.28 0.31 0.29 0.290.16 0.99 0.98 0.93 0.79 0.59 0.45 0.42 0.43 0.40 0.420.17 1.00 1.00 0.96 0.86 0.68 0.56 0.51 0.55 0.52 0.550.18 0.99 0.99 0.94 0.86 0.71 0.62 0.61 0.63 0.63 0.670.19 1.00 0.99 0.97 0.91 0.79 0.74 0.72 0.72 0.76 0.760.20 1.00 0.98 0.94 0.89 0.79 0.74 0.76 0.79 0.82 0.840.21 0.99 0.97 0.92 0.86 0.79 0.77 0.81 0.84 0.86 0.910.22 0.97 0.95 0.92 0.88 0.84 0.88 0.90 0.91 0.93 0.95
Table B1 (d): λp = 25 parents
t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2
0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 0.91 0.55 0.29 0.16 0.11 0.09 0.070.02 1.00 1.00 1.00 1.00 0.91 0.63 0.37 0.24 0.17 0.130.03 1.00 1.00 1.00 1.00 0.98 0.83 0.56 0.39 0.29 0.190.04 1.00 1.00 1.00 1.00 0.99 0.89 0.66 0.46 0.35 0.250.05 1.00 1.00 1.00 1.00 0.99 0.91 0.68 0.52 0.40 0.310.06 1.00 1.00 1.00 1.00 0.98 0.90 0.67 0.52 0.42 0.300.07 1.00 1.00 1.00 1.00 0.98 0.87 0.61 0.48 0.40 0.310.08 0.97 0.98 0.98 0.99 0.92 0.75 0.54 0.40 0.35 0.280.09 0.32 0.37 0.53 0.75 0.73 0.58 0.41 0.33 0.28 0.240.10 0.00 0.01 0.04 0.20 0.34 0.34 0.25 0.22 0.21 0.170.11 0.03 0.02 0.01 0.01 0.08 0.13 0.13 0.14 0.16 0.140.12 0.25 0.20 0.14 0.05 0.04 0.06 0.10 0.11 0.14 0.130.13 0.67 0.62 0.43 0.20 0.12 0.09 0.11 0.12 0.15 0.130.14 0.91 0.87 0.71 0.43 0.28 0.18 0.17 0.19 0.21 0.200.15 0.96 0.96 0.84 0.62 0.44 0.30 0.28 0.29 0.31 0.320.16 0.98 0.97 0.90 0.73 0.55 0.39 0.40 0.42 0.43 0.420.17 0.99 0.98 0.93 0.81 0.66 0.53 0.51 0.53 0.55 0.550.18 0.98 0.98 0.92 0.81 0.68 0.60 0.61 0.66 0.65 0.670.19 0.98 0.99 0.95 0.85 0.75 0.70 0.73 0.77 0.77 0.800.20 0.96 0.97 0.91 0.80 0.76 0.72 0.78 0.83 0.84 0.870.21 0.94 0.95 0.89 0.78 0.79 0.77 0.84 0.90 0.90 0.920.22 0.93 0.93 0.89 0.84 0.87 0.87 0.91 0.95 0.96 0.96
164 Appendix B. Power of the test: fusion distance function
Table B1 (e): λp = 50 parents
t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2
0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 0.60 0.25 0.11 0.07 0.06 0.04 0.050.02 1.00 1.00 1.00 0.93 0.53 0.26 0.17 0.14 0.08 0.080.03 1.00 1.00 1.00 0.99 0.74 0.40 0.23 0.15 0.12 0.100.04 1.00 1.00 1.00 0.99 0.82 0.50 0.33 0.20 0.15 0.130.05 1.00 1.00 1.00 0.99 0.82 0.55 0.35 0.22 0.20 0.140.06 1.00 1.00 1.00 0.98 0.80 0.51 0.34 0.24 0.19 0.170.07 0.84 0.88 0.91 0.91 0.71 0.48 0.32 0.23 0.19 0.170.08 0.27 0.31 0.51 0.67 0.57 0.38 0.25 0.19 0.17 0.150.09 0.03 0.03 0.08 0.26 0.36 0.26 0.19 0.14 0.15 0.140.10 0.06 0.04 0.03 0.05 0.15 0.13 0.12 0.12 0.12 0.120.11 0.20 0.17 0.09 0.04 0.07 0.08 0.09 0.08 0.09 0.100.12 0.48 0.42 0.27 0.11 0.06 0.08 0.08 0.08 0.09 0.110.13 0.71 0.65 0.47 0.24 0.12 0.11 0.10 0.10 0.10 0.130.14 0.81 0.78 0.62 0.37 0.20 0.17 0.16 0.18 0.18 0.210.15 0.85 0.82 0.69 0.48 0.31 0.26 0.26 0.28 0.29 0.320.16 0.85 0.82 0.71 0.53 0.40 0.37 0.39 0.41 0.43 0.460.17 0.86 0.83 0.72 0.61 0.51 0.51 0.54 0.55 0.56 0.610.18 0.81 0.78 0.68 0.65 0.61 0.61 0.67 0.68 0.69 0.710.19 0.81 0.79 0.74 0.72 0.72 0.72 0.78 0.82 0.82 0.830.20 0.77 0.74 0.74 0.76 0.78 0.82 0.86 0.88 0.89 0.900.21 0.78 0.76 0.78 0.81 0.86 0.88 0.91 0.95 0.94 0.940.22 0.84 0.86 0.87 0.92 0.94 0.94 0.96 0.98 0.98 0.98
Table B.1: Power of Monte Carlo tests of CSR against Matern cluster process with
parameters λp, λc, r; where λp, r are varying as shown; λc is adjusted to keep
intensity of the process constant at 100. Test uses 99 realisations of CSR. Power es-
timated from 1000 realisations under Matern cluster processes; supremum distance,
Single Linkage.
B.2. Inhibition alternative 165
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.000 0.010 0.020 0.030
0.95
0.97
0.99
quantiles of H(t) of Poissonqu
antile
s of
H(t)
of C
lust
er0.00 0.02 0.04
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.00 0.02 0.04 0.06 0.08
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.02 0.06 0.10
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.05 0.10 0.15
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.10 0.15 0.20
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.10 0.15 0.20 0.25
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.15 0.20 0.25 0.30
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
Figure B.1: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson and 100 realisations of fusion distance functions for Matern
cluster processes. The upper limit t1 ∈ [0; 0.04] by an increment of 0.005. Solid line:
Q-Q plot, dotted line: identity line, Single Linkage.
166 Appendix B. Power of the test: fusion distance function
0.20 0.25 0.30 0.35 0.40
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.25 0.35 0.45
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.25 0.35 0.45 0.55
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.35 0.45 0.55
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.35 0.45 0.55
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.45 0.50 0.55 0.60 0.65
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.50 0.60 0.70
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.55 0.65 0.75
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.60 0.70 0.80
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
Figure B.2: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson and 100 realisations of fusion distance functions for Matern
cluster processes. The upper limit t1 ∈ [0.045; 0.085] by an increment of 0.005. Solid
line: Q-Q plot, dotted line: identity line, Single Linkage.
B.2. Inhibition alternative 167
0.65 0.70 0.75 0.80 0.85
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.70 0.80 0.90
0.95
0.97
0.99
quantiles of H(t) of Poissonqu
antile
s of
H(t)
of C
lust
er0.75 0.80 0.85 0.90
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.75 0.85 0.95
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.75 0.85 0.95
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.75 0.80 0.85 0.90 0.95
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.85 0.90 0.95
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.88 0.92 0.96
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.90 0.92 0.94 0.96 0.98
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
Figure B.3: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson and 100 realisations of fusion distance functions for Matern
cluster processes. The upper limit t1 ∈ [0.09; 0.13] by an increment of 0.005. Solid
line: Q-Q plot, dotted line: identity line, Single Linkage.
168 Appendix B. Power of the test: fusion distance function
0.92 0.94 0.96 0.98 1.00
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.94 0.96 0.98 1.00
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.94 0.96 0.98 1.00
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.94 0.96 0.98 1.00
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.95 0.97 0.99
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.95 0.97 0.99
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.96 0.98 1.00
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.975 0.985 0.995
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.975 0.985 0.995
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
Figure B.4: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson and 100 realisations of fusion distance functions for Matern
cluster processes. The upper limit t1 ∈ [0.135; 0.175] by an increment of 0.005. Solid
line: Q-Q plot, dotted line: identity line, Single Linkage.
B.2. Inhibition alternative 169
0.975 0.985 0.995
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.980 0.990 1.000
0.95
0.97
0.99
quantiles of H(t) of Poissonqu
antile
s of
H(t)
of C
lust
er0.985 0.990 0.995 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.988 0.992 0.996 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.988 0.992 0.996 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.988 0.992 0.996 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.988 0.992 0.996 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.988 0.992 0.996 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
0.988 0.992 0.996 1.000
0.95
0.97
0.99
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Clu
ster
Figure B.5: Typical Q-Q plots of 100 realisations of fusion distance functions for
Poisson and 100 realisations of fusion distance functions for Matern cluster pro-
cesses. The upper limit t1 ∈ [0.18; 0.22] by an increment of 0.005. Solid line: Q-Q
plot, dotted line: identity line, Single Linkage.
170 Appendix B. Power of the test: fusion distance function
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.000 0.005 0.010 0.015−1
.0−0
.50.
00.
51.
0
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.00 0.02 0.04
0.00
0.01
0.02
0.03
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.00 0.02 0.04 0.06
0.00
0.04
0.08
0.12
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.00 0.04 0.08
0.02
0.06
0.10
0.14
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.04 0.08 0.12
0.06
0.10
0.14
0.18
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.05 0.10 0.15 0.20
0.10
0.15
0.20
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.10 0.15 0.20
0.10
0.15
0.20
0.25
0.30
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.10 0.20 0.30
0.15
0.20
0.25
0.30
0.35
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
Figure B.6: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson and of 100 realisations of fusion distance functions for Matern
model II processes. The upper limit t1 ∈ [0; 0.04] by an increment of 0.005. Solid
lines: Q-Q plot, dotted lines: identity line, Single Linkage.
B.2. Inhibition alternative 171
0.15 0.25 0.35
0.20
0.25
0.30
0.35
0.40
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.20 0.30 0.40
0.25
0.30
0.35
0.40
0.45
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.20 0.30 0.40 0.50
0.30
0.40
0.50
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.30 0.40 0.50
0.35
0.45
0.55
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.35 0.45 0.55
0.40
0.50
0.60
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.40 0.50 0.600.
400.
500.
600.
70
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.45 0.55 0.65 0.75
0.45
0.55
0.65
0.75
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.55 0.65 0.75
0.50
0.60
0.70
0.80
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.55 0.65 0.75
0.60
0.70
0.80
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
Figure B.7: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson process and 100 realisations of fusion distance functions for
Matern model II processes. The upper limit t1 ∈ [0.045; 0.085] by an increment of
0.005. Solid lines: Q-Q plot, dotted lines: identity line, Single Linkage.
172 Appendix B. Power of the test: fusion distance function
0.60 0.70 0.80
0.65
0.75
0.85
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.65 0.75 0.850.
700.
800.
90
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.70 0.80 0.90
0.75
0.85
0.95
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.75 0.80 0.85 0.90
0.75
0.85
0.95
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.80 0.85 0.90 0.95
0.80
0.85
0.90
0.95
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.85 0.90 0.95
0.85
0.90
0.95
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.85 0.90 0.95
0.86
0.90
0.94
0.98
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.85 0.90 0.95
0.88
0.92
0.96
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.88 0.92 0.96 1.00
0.92
0.96
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
Figure B.8: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson process and 100 realisations of fusion distance functions for
Matern model II processes. The upper limit t1 ∈ [0.09; 0.13] by an increment of
0.005. Solid lines: Q-Q plot, dotted lines: identity line, Single Linkage.
B.2. Inhibition alternative 173
0.90 0.94 0.98
0.92
0.94
0.96
0.98
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.92 0.94 0.96 0.98 1.00
0.92
0.94
0.96
0.98
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.92 0.94 0.96 0.98 1.00
0.96
0.97
0.98
0.99
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.94 0.96 0.98 1.00
0.96
0.97
0.98
0.99
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.95 0.97 0.99
0.96
0.97
0.98
0.99
1.00
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.95 0.97 0.990.
970
0.98
00.
990
1.00
0
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.96 0.97 0.98 0.99 1.00
0.97
00.
980
0.99
01.
000
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.970 0.980 0.990 1.000
0.98
00.
990
1.00
0
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.975 0.985 0.995
0.99
00.
994
0.99
8
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
Figure B.9: Typical Q-Q plots of 100 realisations of fusion distance functions for a
homogeneous Poisson process and 100 realisations of fusion distance functions for
Matern model II processes. The upper limit t1 ∈ [0.135; 0.175] by an increment of
0.005. Solid lines: Q-Q plot, dotted line: identity line, Single Linkage.
174 Appendix B. Power of the test: fusion distance function
0.975 0.985 0.995
0.99
00.
994
0.99
8
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.988 0.992 0.996 1.000
0.99
00.
994
0.99
8
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.988 0.992 0.996 1.000
0.99
00.
994
0.99
8
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.988 0.992 0.996 1.000
0.99
00.
994
0.99
8
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
0.988 0.992 0.996 1.000
0.99
00.
994
0.99
8
quantiles of H(t) of Poisson
quan
tiles
of H
(t) o
f Mat
ern
II
Figure B.10: Typical Q-Q plots of 100 realisations of fusion distance functions for
a homogeneous Poisson process and 100 realisations of fusion distance functions for
Matern model II processes. The upper limit t1 ∈ [0.18; 0.2] by an increment of 0.005.
Solid lines: Q-Q plot, dotted lines: identity line, Single Linkage.
B.2. Inhibition alternative 175
Inhibition alternative: Matern model II
t1 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.02 0.04 0.03 0.26 1.00 1.00 1.00 1.00 1.00 1.00 1.000.03 0.05 0.04 0.03 0.21 0.69 1.00 1.00 1.00 1.00 1.000.04 0.07 0.07 0.05 0.03 0.06 0.42 0.99 1.00 1.00 1.000.05 0.16 0.12 0.13 0.10 0.05 0.04 0.27 0.93 1.00 1.000.06 0.18 0.17 0.21 0.24 0.19 0.20 0.02 0.22 0.76 1.000.07 0.20 0.19 0.40 0.50 0.38 0.32 0.14 0.03 0.19 0.830.08 0.09 0.23 0.49 0.60 0.55 0.64 0.41 0.18 0.03 0.210.09 0.09 0.23 0.49 0.64 0.73 0.75 0.73 0.56 0.16 0.040.10 0.10 0.23 0.50 0.55 0.76 0.80 0.82 0.72 0.43 0.140.11 0.19 0.22 0.47 0.52 0.71 0.79 0.83 0.77 0.54 0.390.12 0.08 0.20 0.44 0.46 0.65 0.72 0.78 0.72 0.50 0.420.13 0.08 0.13 0.34 0.34 0.52 0.55 0.62 0.49 0.46 0.470.14 0.23 0.40 0.50 0.50 0.65 0.70 0.77 0.66 0.63 0.690.15 0.41 0.59 0.71 0.68 0.80 0.82 0.87 0.79 0.77 0.850.16 0.57 0.74 0.84 0.78 0.90 0.91 0.93 0.86 0.83 0.920.17 0.80 0.84 0.89 0.83 0.94 0.95 0.97 0.87 0.87 0.960.18 0.88 0.91 0.94 0.87 0.96 0.97 0.98 0.99 0.89 0.980.19 0.94 0.94 0.96 0.98 0.98 0.99 0.99 0.99 0.89 0.990.20 0.97 0.97 0.98 0.99 0.99 0.99 1.00 0.99 0.90 0.99
Table B.2: Power of Monte Carlo tests of CSR against Matern model II processes
with parameters λ, r; where λ is chosen to achieve an intensity of 100. Test uses
99 realisations of CSR. Power estimated from 1000 realisations under Matern model
II; supremum distance, Single Linkage.
176 Appendix B. Power of the test: fusion distance function
177APPENDIX C
Complementary information on the Brazilian trees dataset
Height and diameter at breast height Tables C.1 and C.2 show the observed
values of the height, and dbh of the Brazilian trees dataset.
Complete botanical classification Table C.3 presents the complete botanical clas-
sification of the Brazilian trees dataset (Chapter 8) into genus, species, family, order,
subclass and class extracted by the author from [15, 115].
178 Appendix C. Complementary information on the Brazilian trees dataset
height (m) 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
frequency 1 1 5 7 5 12 19 47 46 35
height (m) 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7
frequency 54 53 113 59 52 26 18 75 21 9
height (m) 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7
frequency 6 7 52 12 5 5 9 28 7 4
height (m) 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7
frequency 6 9 52 10 7 6 4 28 4 1
height (m) 4.8 4.9 5 5.1 5.3 5.4 5.5 5.6 5.9 6
frequency 2 2 44 1 2 2 18 7 1 24
height (m) 6.1 6.2 6.3 6.4 6.5 6.6 7 7.2 7.3 7.4
frequency 4 3 3 3 18 3 20 1 1 2
height (m) 7.5 7.6 8 8.5 8.6 8.7 8.9 9 9.5 10
frequency 9 1 18 2 1 1 1 3 1 3
height (m) 11
frequency 1
Table C.1: Frequency of the height of the Brazilian trees.
dbh (m) 1 2 2.5 3 3.5 4 4.5 4.7 5 5.5
frequency 1 1 1 1 3 4 1 1 17 21
dbh (m) 6 6.5 7 7.1 7.5 7.7 8 8.5 9 9.5
frequency 55 44 193 1 56 1 149 44 110 29
dbh (m) 10 10.5 10.8 11 11.2 11.5 12 12.5 13 13.5
frequency 74 9 1 45 1 9 51 9 26 6
dbh (m) 14 14.5 15 15.5 16 16.5 17 17.5 18 18.5
frequency 24 7 25 4 11 4 17 2 6 1
dbh (m) 19 19.5 20 20.5 21 21.5 22 23 23.5 24
frequency 10 2 8 1 5 2 4 4 1 7
dbh (m) 25 25.5 26 26.5 28 29 32 32.5 33
frequency 1 1 4 1 1 2 1 1 1
Table C.2: Frequency of the dbh of the Brazilian trees.
Appendix C. Complementary information on the Brazilian trees dataset 179
Number Genus Species Family Order Subclass Class
1 Aspidosperma macrocarpon Apocynaceae Gentianales Asteridae Magnoliopsida
2 Aspidosperma tomentosum Apocynaceae Gentianales Asteridae Magnoliopsida
3 Bombax gracilipes Bombacaceae Malvales Dilleniidae Magnoliopsida
4 Bombax tomentosum Bombacaceae Malvales Dilleniidae Magnoliopsida
5 Bowdichia virgiloides Fabaceae Fabales Rosidae Magnoliopsida
6 NA NA Myrtaceae Myrtales Rosidae Magnoliopsida
7 Byrsonima coccolabifalia Malpighiaceae Polygalales Rosidae Magnoliopsida
8 Byrsonima crassa Malpighiaceae Polygalales Rosidae Magnoliopsida
9 Byrsonima NA Malpighiaceae Polygalales Rosidae Magnoliopsida
10 Caryocar brasiliense Caryocaraceae Theales Dilleniidae Magnoliopsida
11 Connarus fulvus Connaraceae Rosales Rosidae Magnoliopsida
12 Copaifera langsdorfii Fabaceae Fabales Rosidae Magnoliopsida
13 Dalbergia vidacea Fabaceae Fabales Rosidae Magnoliopsida
14 Davilla elliptica Dilleniaceae Dilleniales Dilleniidae Magnoliopsida
15 Didymopanax macrocarpum Araliaceae Apiales Rosidae Magnoliopsida
16 Siagrus NA NA NA NA NA
17 NA Ind.453 NA NA NA NA
18 Enterolobium ellipticum Fabaceae Fabales Rosidae Magnoliopsida
19 Eremanthus NA Asteraceae Asterales Asteridae Magnoliopsida
20 Erythroxylum suberosum Erythroxylaceae Linales Rosidae Magnoliopsida
21 Erythroxylum tortuosum Erythroxylaceae Linales Rosidae Magnoliopsida
22 NA Ind.445 NA NA NA NA
23 Butia NA Arecaceae Arecales Arecidae Liliopsida
24 Hymenaea stillocarpa Fabaceae Fabales Rosidae Magnoliopsida
25 Kielmeyera coriaceae NA NA NA NA
26 Lafoensia pacari Lythraceae Myrtales Rosidae Magnoliopsida
27 Palmeira NA Arecaceae Arecales Arecidae Liliopsida
28 Miconia ferruginata Melastomataceae Myrtales Rosidae Magnoliopsida
29 Miconia NA Melastomataceae Myrtales Rosidae Magnoliopsida
30 Mimosa claussenii Mimosaceae Fabales Rosidae Magnoliopsida
31 Myrica NA Myricaceae Myricales Hamamelidae Magnoliopsida
32 Ouratea acuminata Ochnaceae Theales Dilleniidae Magnoliopsida
33 Palicourea rigida Rubiaceae Rubiales Asteridae Magnoliopsida
34 NA Ind.443 NA NA NA NA
35 Piptocarpha rotundifolia Asteraceae Asterales Asteridae Magnoliopsida
36 Vochysia rufa Vochysiaceae Polygalales Rosidae Magnoliopsida
37 Plenckia populosea Celastraceae Celastrales Rosidae Magnoliopsida
38 Pouteria ramiflora Sapotaceae Ebenales Dilleniidae Magnoliopsida
39 Vochysia thyrsoidea Vochysiaceae Polygalales Rosidae Magnoliopsida
40 Pteredon pubescens Fabaceae Fabales Rosidae Magnoliopsida
41 Qualea grandiflora Vochysiaceae Polygalales Rosidae Magnoliopsida
42 Qualea multiflora Vochysiaceae Polygalales Rosidae Magnoliopsida
43 Qualea parviflora Vochysiaceae Polygalales Rosidae Magnoliopsida
44 Rapanea guyanensis Myrsinaceae Primulales Dilleniidae Magnoliopsida
45 Roupala montana Proteaceae Proteales Rosidae Magnoliopsida
46 Salacia crassifolia Hippocrateaceae Celastrales Rosidae Magnoliopsida
47 Sclerolobium aureum Fabaceae Fabales Rosidae Magnoliopsida
48 Stryphnodendron NA Fabaceae Fabales Rosidae Magnoliopsida
49 Styrax ferrugineus Styracaceae Ebenales Dilleniidae Magnoliopsida
50 Sweetia dasycarpa Fabaceae Fabales Rosidae Magnoliopsida
51 Symplocos revoluta Symplocaceae Ebenales Dilleniidae Magnoliopsida
52 NA Ind.192 NA NA NA NA
53 Strychnos NA Loganiaceae Gentianales Asteridae Magnoliopsida
54 Vellozia NA Velloziaceae Liliales Liliidae Liliopsida
55 Vochysia elliptica Vochysiaceae Polygalales Rosidae Magnoliopsida
56 Plathimenia reticulata Fabaceae Fabales Rosidae Magnoliopsida
Table C.3: Complete plant systematics of the Brazilian trees into genus, species,
family, order, subclass, and class. (“NA”: unknown). Source: [15, 115].