analysis of spatial point patterns using hierarchical ... · a hip¶otese alternativa de n~ao...

Analysis of Spatial Point Patterns

Using Hierarchical Clustering

Algorithms

Sandra M. C. PereiraGrad Dip (UFMS), BSc Hons, MSc (UnB), Brazil

This thesis is presented for the degree of

Doctor of Philosophy

of the University of Western Australia

School of Mathematics & Statistics.

September 2003

iii

Abstract

This thesis is a new proposal for analysing spatial point patterns in spatial statis-

tics using the outputs of popular techniques of (classical, non-spatial, multivariate)

cluster analysis. The outputs of a chosen hierarchical algorithm, named fusion dis-

tances, are applied to investigate important spatial characteristics of a given point

pattern.

The fusion distances may be regarded as a missing link between the fields of

spatial statistics and multivariate cluster analysis. Up to now, these two fields have

remained rather separate because of fundamental differences in approach. It is shown

that fusion distances are very good at discriminating different types of spatial point

patterns.

A detailed study on the power of the Monte Carlo test under the null hypoth-

esis of Complete Spatial Randomness (the benchmark of spatial statistics) against

chosen alternative models is also conducted. For instance, the test (based on the

fusion distance) is very powerful for some arbitrary values of the parameters of the

alternative.

A new general approach is developed for analysing a given point pattern using

several graphical techniques for exploratory data analysis and inference. The new

strategy is applied to univariate and multivariate point patterns. A new extension of

a popular strategy in spatial statistics, named the analysis of the local configuration,

is also developed. This new extension uses the fusion distances, and analyses a

localised neighbourhood of a given point of the point pattern.

New spatial summary function and statistics, named the fusion distance function

H(t), area statistic A, statistic S, and spatial Rg index, are introduced, and proven

to be useful tools for identifying relevant features of spatial point patterns.

In conclusion, the new methodology using the outputs of hierarchical clustering

algorithms can be considered as an essential complement to the existing approaches

in spatial statistics literature.

iv

In Portuguese: Resumo

Esta tese de doutorado e uma proposta nova para analisar os conjuntos de

padroes de pontos em estatıstica espacial utilizando as tecnicas hierarquicas de

analise de agrupamento para os conjuntos de dados multivariados. Os resultados

provenientes da aplicacao de um algorıtmo hierarquico escolhido a priori, denomina-

dos as distancias unidas, sao utilizados para investigar as caracterısticas importantes

de um conjunto arbitrario de padroes de pontos.

As distancias unidas podem ser consideradas como uma ponte de ligacao entre

as areas de estudos de estatıstica espacial e de analise de agrupamento. Ate o

presente momento, estas duas areas permaneceram separadas devido as diferencas

fundamentais em metodologias de estudos. Demonstra-se que as distancias unidas

sao muito boas para discriminar os diferentes tipos de conjuntos de padroes de

pontos.

O poder do teste da hipotese nula de completa aleatoriedade espacial contra

a hipotese alternativa de nao aleatoriedade, baseada nos modelos espaciais de re-

gularidade e de agrupamento, foi estudado utilizando simulacoes. Por exemplo, o

teste (usando as distancias unidas) e muito poderoso para valores arbitrarios dos

parametros dos modelos alternativos selecionados.

Uma nova metodologia geral e desenvolvida para estudar os conjuntos de pontos

utilizando varias tecnicas de analise exploratoria de dados e de inferencia. A nova

metodologia e aplicada a conjuntos de padroes de pontos univariados e multivariados.

Uma nova extensao do metodo popular em estatıstica espacial, denominado analise

de configuracao local, tambem e desenvolvida. Esta extensao utiliza as distancias

unidas e analisa uma vizinhanca local de um ponto arbritario do conjunto de padroes

de pontos.

Tres novas estatısticas e uma nova funcao sao apresentadas e definidas nesta

tese: a funcao de distancia unida H(t); a area estatıstica A; a estatıstica S e o

ındice espacial Rg. Demostra-se que estas novas estatısticas sao instrumentos uteis

para identificar propriedades relevantes dos padroes de pontos.

Portanto, espera-se que o novo procedimento para analisar os conjunto de padroes

de pontos, fundamentado nas distancias unidas, ira ser um complemento essencial

para os metodos existentes na literatura de estatıstica espacial.

v

Statement of Originality

The research and computational work done in this thesis are wholly my own

composition. However, exception must be made for the cited, quoted references,

and ideas that are explicitly stated and acknowledged in my work.

vi

“You are a child of the universe, no less than the trees and the stars and you

have a right to be here.” Excerpt from Desiderata by Ehrman Max [36].

Figure 1: A specimen of Ouratea acuminata which is the most frequent species found

in the Brazilian trees dataset. Source: [77].

vii

Contents

Abstract iii

List of Tables xi

List of Figures xiii

Acknowledgements xvii

1 Introduction 1

1.1 Thesis research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Spatial point patterns and cluster analysis 5

2.1 Spatial point patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Spatial clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Selected hierarchical clustering algorithms . . . . . . . . . . . . . . . 9

3 Monte Carlo test 13

3.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Function estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 P-P plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Inapplicability of Monte Carlo test to P-P plot . . . . . . . . . . . . . 18

3.5 Modified Monte Carlo test applied to P-P plot . . . . . . . . . . . . . 18

3.6 Q-Q plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.7 A-A plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

viii

4 New strategy for analysing point patterns 23

4.1 New summary function and statistic . . . . . . . . . . . . . . . . . . 24

4.1.1 Fusion distance function . . . . . . . . . . . . . . . . . . . . . 24

4.1.2 Area statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Relative distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Description of strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . 32

4.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.1 Application to published point patterns . . . . . . . . . . . . . 34

4.4.2 Application to simulated point patterns . . . . . . . . . . . . . 36

5 Study of power 39

5.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3 Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4 Estimation of power . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4.1 Test using supremum distance . . . . . . . . . . . . . . . . . . 42

5.4.2 Test using area statistic . . . . . . . . . . . . . . . . . . . . . 49

6 Analysis of multivariate point patterns 53

6.1 Extension based on fusion distance function . . . . . . . . . . . . . . 53

6.2 Extension based on S statistic . . . . . . . . . . . . . . . . . . . . . . 56

6.3 Extension based on spatial Rg index . . . . . . . . . . . . . . . . . . 63

7 Analysis of local configuration 73

7.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.2.1 Application to full redwoods . . . . . . . . . . . . . . . . . . . 76

7.2.2 Application to Longleaf pines . . . . . . . . . . . . . . . . . . 81

7.2.3 Application to Lansing woods . . . . . . . . . . . . . . . . . . 87

ix

8 Analysis of Brazilian trees point pattern 95

8.1 Brazilian trees point pattern . . . . . . . . . . . . . . . . . . . . . . . 95

8.2 Analysis of univariate Brazilian trees dataset . . . . . . . . . . . . . . 100

8.3 Analysis of Multivariate Brazilian trees dataset . . . . . . . . . . . . 108

8.4 Complementary analysis . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.4.1 Fusion distance function . . . . . . . . . . . . . . . . . . . . . 110

8.4.2 Area statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.4.3 S statistic and spatial Rg index . . . . . . . . . . . . . . . . . 120

8.4.4 Gamma approximation for spatial Rg index . . . . . . . . . . 121

8.4.5 Analysis of local configuration . . . . . . . . . . . . . . . . . . 122

9 Conclusion and open problems 129

9.1 Problems studied and findings . . . . . . . . . . . . . . . . . . . . . . 129

9.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

9.3 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Bibliography 135

A New strategy based on the Average and Complete Linkage 145

A.1 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . 145

A.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

A.2.1 Envelopes for P-P plots, Q-Q plots and A-A plots . . . . . . . 145

A.2.2 Bands for P-P plots, Q-Q plots and A-A plots . . . . . . . . . 145

A.3 Random labelling hypothesis . . . . . . . . . . . . . . . . . . . . . . . 145

A.4 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

B Power of the test: fusion distance function 161

B.1 Cluster alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

B.2 Inhibition alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

C Complementary information on the Brazilian trees dataset 177

xi

List of Tables

4.1 Empirical area statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Power of test: clustering, area statistic . . . . . . . . . . . . . . . . . 51

5.2 Power of test: inhibition, area statistic . . . . . . . . . . . . . . . . . 51

6.1 S statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 Two classifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.3 Spatial Rg index: Single Linkage, Average Linkage . . . . . . . . . . . 69

6.4 Spatial Rg index and gamma approximation . . . . . . . . . . . . . . 71

7.1 Full redwoods: Single, Average, Complete Linkage . . . . . . . . . . . 76

7.2 Contingency tables: Longleaf pines, Single, Average and Complete

Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.3 Contingency table: Lansing woods, Average Linkage . . . . . . . . . . 89

7.4 Contingency table and Pearson residuals: Lansing woods, Average

Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8.1 Brazilian trees and ranked frequency of species . . . . . . . . . . . . . 98

8.2 Brazilian trees dataset: seven subclasses, three classes, two types . . . 99

8.3 Area statistic for Brazilian trees . . . . . . . . . . . . . . . . . . . . . 119

8.4 S statistic, Rg index: Brazilian trees dataset . . . . . . . . . . . . . . 120

8.5 Monte Carlo null distribution of spatial Rg index . . . . . . . . . . . 122

8.6 Contingency tables of Brazilian trees into seven, three, two types and

groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

B.1 Power: cluster alternative . . . . . . . . . . . . . . . . . . . . . . . . 164

B.2 Power: inhibition alternative . . . . . . . . . . . . . . . . . . . . . . . 175

C.1 Heights of Brazilian trees . . . . . . . . . . . . . . . . . . . . . . . . . 178

C.2 Dbh of Brazilian trees . . . . . . . . . . . . . . . . . . . . . . . . . . 178

C.3 Brazilian trees’ plant systematics . . . . . . . . . . . . . . . . . . . . 179

xiii

List of Figures

1 Ouratea acuminata . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

2.1 Standard spatial datasets . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Single Linkage dendrograms . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Monte Carlo test applied to function estimates . . . . . . . . . . . . . 16

3.2 Inapplicability of pointwise Monte Carlo test to P-P plots . . . . . . . 18

3.3 Monte Carlo tests using critical band . . . . . . . . . . . . . . . . . . 21

4.1 Fusion distance function plots . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Knee plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Inverted knee plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Single Linkage relative pdf plots . . . . . . . . . . . . . . . . . . . . . 31

4.5 P-P plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Simulation envelopes: P-P, Q-Q plots . . . . . . . . . . . . . . . . . . 35

4.7 Envelopes for A-A plots . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.8 Critical bands: P-P, Q-Q plots . . . . . . . . . . . . . . . . . . . . . . 36

4.9 Critical bands for A-A plots . . . . . . . . . . . . . . . . . . . . . . . 37

4.10 Application of the new strategy: dataset 1 . . . . . . . . . . . . . . . 37

4.11 Application of the new strategy: dataset 2 . . . . . . . . . . . . . . . 38

5.1 Realisations from clustering . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Inhibition model I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3 Power of tests: clustering . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 Power of tests: clustering, cont. . . . . . . . . . . . . . . . . . . . . . 46

5.5 Interpretation of power: clustering . . . . . . . . . . . . . . . . . . . . 47

5.6 Power of test: inhibition . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.7 Interpretation of power: inhibition . . . . . . . . . . . . . . . . . . . . 50

xiv

5.8 Power of test: clustering, inhibition, area statistic . . . . . . . . . . . 52

6.1 Cat Retinal Ganglia dataset . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 P-P plots for Cat Retinal Ganglia dataset . . . . . . . . . . . . . . . 56

6.3 A-A plots for Cat Retinal Ganglia dataset . . . . . . . . . . . . . . . 57

6.4 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.5 Austin Hughes’ dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.6 Classified Longleaf pines dataset . . . . . . . . . . . . . . . . . . . . . 62

6.7 Clustered dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.8 Full redwoods dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.9 Gamma approximations . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.1 Kernel densities of full redwoods . . . . . . . . . . . . . . . . . . . . . 77

7.2 Dendrograms of tvd for full redwoods . . . . . . . . . . . . . . . . . . 78

7.3 Local configuration of full redwoods . . . . . . . . . . . . . . . . . . . 79

7.4 Local fusion distance function: full redwoods . . . . . . . . . . . . . . 80

7.5 Proportional Longleaf pines dataset . . . . . . . . . . . . . . . . . . . 81

7.6 Kernel densities of Longleaf pines . . . . . . . . . . . . . . . . . . . . 82

7.7 Dendrograms of tvd for Longleaf pines . . . . . . . . . . . . . . . . . 83

7.8 Local configuration of Longleaf pines . . . . . . . . . . . . . . . . . . 85

7.9 Local fusion distance function: Longleaf pines . . . . . . . . . . . . . 86

7.10 Relative frequency barplot of dbh for two groups: Longleaf pines . . . 86

7.11 Lansing woods dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.12 Lansing woods dataset and six types . . . . . . . . . . . . . . . . . . 88

7.13 Local configuration of Lansing woods . . . . . . . . . . . . . . . . . . 90

7.14 Local fusion distance function: Lansing woods, four groups . . . . . . 91

8.1 Brazilian trees dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 96

8.2 Brazilian trees: seven subclasses . . . . . . . . . . . . . . . . . . . . . 97

xv

8.3 Brazilian trees: three classes, two types . . . . . . . . . . . . . . . . . 99

8.4 Barplots of species . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.5 Histogram and scatter plots of heights . . . . . . . . . . . . . . . . . 102

8.6 Histogram and scatter plots of dbh’s . . . . . . . . . . . . . . . . . . 103

8.7 Box plots of top ten heights and dbh based on species . . . . . . . . . 104

8.8 Scatter plots of top ten heights and dbh based on species . . . . . . . 105

8.9 Mark correlation function . . . . . . . . . . . . . . . . . . . . . . . . 108

8.10 Three most frequent species . . . . . . . . . . . . . . . . . . . . . . . 109

8.11 F -function for the most frequent species . . . . . . . . . . . . . . . . 110

8.12 G-cross for the most frequent species . . . . . . . . . . . . . . . . . . 111

8.13 J-cross for the most frequent species . . . . . . . . . . . . . . . . . . 112

8.14 K-cross for the most frequent species . . . . . . . . . . . . . . . . . . 113

8.15 J-cross for three classes . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.16 F -function for two types . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.17 G-cross for two types . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.18 J-cross for two types . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8.19 K-cross for two types . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

8.20 Fusion distance function from Brazilian trees . . . . . . . . . . . . . . 118

8.21 Gamma approximation for Brazilian trees . . . . . . . . . . . . . . . 121

8.22 Kernel densities of Brazilian trees . . . . . . . . . . . . . . . . . . . . 123

8.23 Dendrograms of tvd for Brazilian trees . . . . . . . . . . . . . . . . . 124

8.24 Local configuration classification: Average Linkage . . . . . . . . . . 125

8.25 Local fusion distance function: seven, three, two groups . . . . . . . . 128

A.1 Average and Complete Linkage: dendrograms . . . . . . . . . . . . . 146

A.2 Average Linkage relative pdf plots . . . . . . . . . . . . . . . . . . . . 147

A.3 Complete Linkage relative pdf plots . . . . . . . . . . . . . . . . . . . 147

A.4 Average and Complete Linkage P-P plots: envelopes . . . . . . . . . . 148

xvi

A.5 Average and Complete Linkage Q-Q plots: envelopes . . . . . . . . . 149

A.6 Average and Complete Linkage A-A plots: envelopes . . . . . . . . . 150

A.7 Average and Complete Linkage P-P plots: bands . . . . . . . . . . . 151

A.8 Average and Complete Linkage Q-Q plots: bands . . . . . . . . . . . 152

A.9 Average and Complete Linkage A-A plots: bands . . . . . . . . . . . 153

A.10 Average Linkage P-P plots for Cat Retinal Ganglia dataset . . . . . . 154

A.11 Complete Linkage P-P plots for Cat Retinal Ganglia dataset . . . . . 155

A.12 Average Linkage Q-Q plots for Cat Retinal Ganglia dataset . . . . . . 156

A.13 Complete Linkage Q-Q plots for Cat Retinal Ganglia dataset . . . . . 157

A.14 Average Linkage A-A plots for Cat Retinal Ganglia dataset . . . . . . 158

A.15 Complete Linkage A-A plots for Cat Retinal Ganglia dataset . . . . . 159

A.16 Average Linkage: histograms . . . . . . . . . . . . . . . . . . . . . . . 160

B.1 Power: Q-Q plots for clustering . . . . . . . . . . . . . . . . . . . . . 165

B.2 Power: Q-Q plots for clustering, cont. . . . . . . . . . . . . . . . . . . 166




B.6 Power: Q-Q plots for inhibition . . . . . . . . . . . . . . . . . . . . . 170

B.7 Power: Q-Q plots for inhibition, cont. . . . . . . . . . . . . . . . . . . 171




xvii

Acknowledgements

I am very grateful for the assistance of my supervisor, Prof. Adrian J. Baddeley,

in providing me with guidance and knowledge in the field of spatial statistics.

I am also grateful for helpful discussions with the following researchers: Dr. H.

Rue (on the area statistic), Prof. A. Unwin (on the knee plot), Dr. U. Hahn (on the

fusion distance function and power of the test), Prof. M. Handcock (on the relative

distribution plots), Prof. N. Cressie (on the analysis of local configuration), Dr. M.

Meirelles and Mr. A. Luiz (on the Brazilian trees point pattern).

My special thanks are given to Dr. J. Chia, Mr. T. Duong, Dr. R. Guidi, Dr.

R. Milne, Dr. B. Turlach and Dr. M. van Lieshout for their enlightened suggestions

in order to make my oral and written presentations more understandable.

I also express my gratitude to the University of Western Australia and to the

School of Mathematics and Statistics, especially for granting me two and half years

of the University Postgraduate Scholarship (UPA).

Finally, I thank you very much my dearest parents, Benedicto and Selia, my

beloved husband Robert, my family and friends for their love, friendship, motivation,

support and prayers over the eternity of my studies.

1CHAPTER 1

Introduction

1.1 Thesis research

Spatial statistics Spatial statistics is the analysis of data of any kind which are

attributed to locations in space. Examples of spatial data are temperature recordings

from a network of weather stations; measurements of soil properties at each location

in a field; public health records of the incidence of new disease cases; and maps

of the spatial locations of geological faults. Techniques of spatial statistics have

been used in a variety of scientific fields such as biology, geostatistics, epidemiology,

and pattern recognition. For complete and more detailed information on spatial

statistics, see [24, 30, 90, 102, 109].

This work concerns the study of spatial point patterns in spatial statistics, in

particular, on the analysis of spatial clustering. Given a pattern of points strewn

over a plane, the objective of spatial clustering is to identify any clusters of points.

This is important, for instance, in the analysis of occurrences of rare diseases, where

a cluster of disease cases may indicate a common cause for the disease [3, 12, 111].

Most of the techniques in spatial statistics are designed to detect the presence of

clustering, but not to identify the clusters themselves. A pattern can be clustered

in the sense of spatial statistics without having any clearly identifiable clusters.

Clustering is vaguely defined in spatial statistics as the tendency of some points

of the pattern to be closer to each other than they would be expected to be on

average in a homogeneous Poisson point pattern, the benchmark of spatial statistics.

However, there are some exceptions where the benchmark is not the homogeneous

Poisson process. For instance, if the human population is not uniformly spread over

a country, then we may collect data about the local density of the population, such

as a sample of many non-disease cases called “controls”, and the null hypothesis is

that the disease cases are (inhomogeneous) Poisson with an intensity proportional

to the population density [59]. So a clustering of rare disease cases may also mean

a clustering relative to the expected pattern in the population.

The general methodology which has been developed for analysing spatial point

patterns leads to a particular way of looking at the problem of identifying clusters

in point patterns. For example, spatial summary functions such as the empty space

F , nearest neighbour distance G, reduced second moment K and Van Lieshout and

Baddeley’s J are useful tools for spatial clustering. This approach is available in

several recent works [23, 32, 66, 67].

2 Chapter 1. Introduction

Non-spatial multivariate cluster analysis A point in the plane may be identified

by its coordinates (x, y), so that a pattern of n points (n ∈ N) in the plane may

be thought of as a set of data recording the values of two random variables X and

Y observed for each of n “points” or “objects’. Methods for identifying clusters

of similar objects in such multivariate data have been developed in the field of

multivariate cluster analysis [37, 49, 45, 68, 72]. This is a much larger and older

field than that of spatial clustering, and offers hundreds of different techniques,

algorithms and methods.

It is important to notice that up to now, the fields of spatial statistics and

multivariate cluster analysis have remained rather separate, despite some connection

between them, because of the fundamental differences in approach.

Objectives The main objective of this thesis is to combine techniques from spatial

statistics and non-spatial multivariate cluster analysis to solve problems in spatial

clustering. Instead of the usual statistical modelling process of formulating theoret-

ical models, which leads to tests that may or may not perform well in practice, we

start within a procedure, namely hierarchical clustering, which we know performs

well, and has an empirical basis; then we give it a formal inferential property.

The specific aims of this work are to investigate new spatial summary statistics

and functions using non-spatial multivariate cluster analysis and spatial statistics; to

construct inferential measures for identifying and validating clusters; and to develop

graphical techniques for spatial clustering.

This thesis adopts a novel strategy in combining and reconciling techniques from

the analysis of spatial point patterns and hierarchical clustering algorithms. Hope-

fully, this work will not only lead to new interpretations of established knowledge

but also to the discovery and creation of (alternative) complementary strategies to

analyse univariate and multivariate point patterns.

1.2 Overview of the thesis

Chapter 2 presents a summarised description of analysis of spatial point patterns

and of multivariate cluster analysis.

Chapter 3 briefly reviews the Monte Carlo hypothesis test. Based on this method-

ology, a new graphical procedure for performing the Monte Carlo test with an exact

significance level is presented. This modified version of the test is then applied to

several graphical devices.

1.2. Overview of the thesis 3

Chapter 4 introduces a new summary function, the fusion distance function,

and a new statistic, the area statistic, which are based on the output of hierarchical

clustering algorithms. Next, the chapter presents a new strategy for analysing point

patterns which can distinguish between different types of patterns. The new strategy

is an application of the fusion distance function and the modified version of Monte

Carlo test.

Chapter 5 investigates the power of the Monte Carlo test of Complete Spatial

Randomness against two alternative models: spatial clustering and inhibition. The

power of the test is estimated using simulation experiments. The fusion distance

function and the area statistic (introduced in Chapter 4) are the basis of the calcu-

lation of the power of the test.

Chapter 6 presents two new methods and a modified version of a multivari-

ate cluster analysis technique for analysing multivariate point patterns. The first

method is based on the strategy introduced in Chapter 4. The second is based on a

new summary index named the S statistic. Finally, the third is a modified version of

a cluster analysis index, the Rg index, that has been adapted to the spatial context

of the statistical analysis.

Chapter 7 investigates an alternative approach to analysing a local neighbour-

hood of a point pattern, named “the analysis of local configuration”. This approach

is a new extension of a popular strategy: the Local Indicators of Spatial Association

(LISA), in spatial statistics. First, the probability density function of the fusion

distances is estimated using kernel density techniques, and then the groups of the

fusion distance probability densities are classified using a chosen distance measure.

(This measure is known as the total variation distance.)

Chapters 8 introduce a large multivariate point pattern, named “the Brazilian

trees dataset”. This point pattern is throughly analysed using (traditional) standard

techniques of spatial statistics, and the new strategies developed in Chapters 4, and

6. The analysis of local configuration (Chapter 7) is also applied to the Brazilian

trees dataset.

Chapter 9 presents a summary and critique of the research done in this thesis.

The main problems and findings of each chapter are discussed, and suggestions for

future work are also made.

4 Chapter 1. Introduction

5CHAPTER 2

Spatial point patterns and cluster analysis

This chapter presents a summarised description of the analysis of point patterns,

spatial clustering, and cluster analysis. That is, in Section 2.1, a concise background

on spatial point patterns is given, and three standard point patterns are presented.

In Section 2.2, several bibliographical references of methods for analysing spatial

clustering are cited. Finally, in Sections 2.3 and 2.4, cluster analysis and a selection

of hierarchical clustering algorithms are described, respectively.

2.1 Spatial point patterns

A spatial point pattern is defined as a set of locations regularly or irregularly

distributed within a region of interest, which have been generated by some unknown

random mechanism. A spatial point pattern may be interpreted as a realisation of

a spatial point process. For instance, the standard benchmark in spatial statistics

literature for a point process is the homogeneous Poisson point process [30, 24, 60];

also known as the Complete Spatial Randomness or CSR. In other words, CSR char-

acterises the absence of structure in the process. A definition of the homogeneous

Poisson process is presented in Section 5.1. Further information on the theory of

spatial point processes is presented in [26, 102].

• • ••

• ••• •

••• •

•• •• • • •

••• • •••••• •• • • ••

•• •• • • •• •••••

• •• ••••

••

• ••••• •

•

• •

••

••

••

•

•

• •

•

••

•

•

••

•

•

•

•

• •

•

• •

•

• •

•

•

•

••

•

•

•

• •

• •• •• ••• ••••••

•••• •

•••

••••••••

•• ••••

••••••••

•••••

•••••••

•••• ••

Figure 2.1: Standard spatial point patterns: pines (left), cells (centre) and

redwoods (right) re-scaled to the unit square. Source: [30].

Examples of spatial point patterns are found in many fields of applied sci-

ences such as biology, biostatistics, botany, environmental engineering, geography

and astronomy. Figure 2.1 shows three standard datasets from spatial statistics:

the Japanese black pines saplings, biological centre cells, and California redwoods

seedlings. (These point patterns are simple, extreme examples used in Chapter 4

6 Chapter 2. Spatial point patterns and cluster analysis

for illustrative purposes, and in Chapters 6–8, more complex, ambiguous and chal-

lenging point patterns will be described and analysed.) Henceforth, the standard

datasets will be regarded as the pines, cells, and redwoods, respectively.

The pines (Figure 2.1 (left)) were extracted from a larger dataset published in

[80], and show the locations of 65 Japanese black pine saplings in a square of 5.7

m. The cells (Figure 2.1 (centre)), published by [25], show the locations of 42

cell centres in the rescaled unit square. The redwoods (Figure 2.1 (right)) were

extracted from a larger dataset published in [105], and show the locations of 62

California redwood seedlings in a square of approximately 23 m. These datasets

were chosen by Diggle [30] to illustrate examples of random, regular and clustered

point patterns, respectively. When looking at these examples of point patterns, (see

Figure 2.1), typical questions that may arise are as follows:

1. Are the points of the pines distributed at completely random locations?

2. Are the points of the cells attracting each other or they are being repulsed?

3. Is there any kind of dependence between points of the redwoods?

Some of the typical questions may be answered by considering traditional sum-

mary functions such as the empty space F -function, nearest neighbour distance

G-function, and reduced second moment K-function (also known as Ripley’s K-

function). The definition, property and application of these summary functions are

reported by [24, 30, 90]. Most of the summary functions available in the spatial

statistics literature provide useful descriptions of a given point pattern.

However, a practical interpretation of the existing summary functions might be

complicated. (A good reason for this complication might be that the traditional

summary functions might not take into consideration some fundamental aspects of

the given point pattern. For example, in biology and botany, it is important to take

into account the spread of seeds, interaction between plants, ecological conditions

for life, division and grown of cells, etc.) In our point of view, there is still a need to

introduce new summary functions which analytical results have easy interpretation.

There is also interest in finding new summary functions that might perform

better than existing summary functions at discriminating between different types of

patterns. For example, Van Lieshout and Baddeley [66] recently introduced a new

summary function which is a non-parametric measure of spatial interaction. This

function is called the J-function and for values J(r) = 1, the function suggests a

2.2. Spatial clustering 7

lack of interaction between points of the given pattern that is, Complete Spatial

Randomness. Deviations from the value 1 suggest spatial inhibition if J(r) > 1 or

spatial clustering if J(r) < 1. The definition, properties and applications of the Van

Lieshout and Baddeley’s J-function are presented by [66, 67]. In Chapter 4, a new

summary function which is also easy to compute and interpret will be introduced.

It will be shown that this summary function performs well in practical applications.

2.2 Spatial clustering

Given a spatial point pattern, the aim of spatial clustering is to identify any

clusters of points. This identification is important, for instance, in the analysis

of occurrences of rare diseases, where a cluster of disease cases may indicate the

possibility of a common cause [3, 12, 32, 33, 61, 106, 111]. Some methods for

identifying clusters in spatial patterns have been developed in the spatial statistics

literature [8, 18, 24, 65, 109]. For instance, Van Lieshout [65] presents a Bayesian

approach to modelling data and unknown cluster centres in object recognition. In

particular, data and cluster centres are modelled as realisations of a point process.

More information on this technique is available in [65, Chapter 5]. Other methods for

analysing spatial clustering are also elaborated in several recent works [12, 23, 32, 71].

The approach proposed in this thesis is to apply a chosen hierarchical clustering

algorithm to a given point pattern, and to investigate the output of the algorithm. If

spatial clusters are detected then tools for identifying and validating clusters are also

introduced. To the best of our knowledge, this approach is new and more general

than analysing spatial clustering only. However, before developing our approach, a

summary of cluster analysis, hierarchical algorithms and main properties is presented

next.

2.3 Cluster analysis

Cluster analysis is a frequently used term for techniques which seek to separate

data into groups. For instance, let x1, . . . , xn be observed measurements of ` vari-

ables on each of n points or objects which are believed to be heterogeneous. Then

the main objective of cluster analysis is to group these n points into g homogeneous

classes or clusters, where n, `, g ∈ N. Usually g is much smaller than n. General

references for cluster analysis are [17, 37, 45, 49, 68, 72].

Most algorithms for finding clusters in a dataset are based on a measure of dis-

similarity between points. A dissimilarity coefficient d has the following properties:


1. d(xi, xj) > 0,

2. d(xi, xj) = d(xj, xi),

3. d(xi, xi) = 0, where i, j = 1, 2, . . . , n.

Note that clustering algorithms may also be based on a measure of similarity be-

tween points. In this case, a similarity coefficient will have the scale reversed. A

dissimilarity coefficient d may satisfy a metric property

d(xi, xj) 6 d(xi, xk) + d(xk, xj), (2.1)

or an ultrametric property

d(xi, xj) 6 max {d(xi, xk), d(xk, xj)}, (2.2)

where i, j, k = 1, . . . , n. For instance, a popular choice of a dissimilarity coefficient

is the pairwise Euclidean distance given by d(xi, xj) = ‖xi − xj‖, where i 6= j.

An important concept in cluster analysis is a dendrogram which is regarded as a

two-dimensional diagram, and illustrates the fusions or partitions that are made at

each successive level of a hierarchical clustering algorithm. That is, the dendrogram

is a graphical representation of an ultrametric dissimilarity coefficient.

0.0

0.05

0.15

0.25

0.0

0.05

0.15

0.25

0.0

0.05

0.15

0.25

Figure 2.2: Dendrograms obtained by a hierarchical clustering algorithm, Single

Linkage, applied to the pines (left), cells (centre), and redwoods (right). The

pairwise Euclidean distance described previously in the text is the chosen dissimilar-

ity coefficient, the y-axis represents distance between clusters, and the datasets are

presented in Section 2.1.

Moreover, Jardine and Sibson [53] defined a dendrogram as a special function

that maps an ultrametric dissimilarity coefficient into the set of real numbers. Typ-

ical examples of a dendrogram are shown in Figure 2.2. (For each pairs of cluster

2.4. Selected hierarchical clustering algorithms 9

which are merged at a stage of the algorithm, a horizontal line is drawn, with y-

coordinate equal to the minimum Euclidean distance between the two clusters. The

vertical and horizontal lines represent the tree structure of the successive mergers

of the clusters.)

A hierarchical clustering algorithm is considered as an approximation of a dis-

similarity coefficient by an ultrametric. A hierarchical technique classifies a dataset

into a hierarchy of partitions, building from the lowest level of n clusters, each con-

taining a single point, to a single cluster containing all n points. Consequently,

when a point is allocated to a group, this point is not allowed to be reallocated to

a different group as the number of clusters g decreases.

There are several hierarchical algorithms, for instance, [17] lists 23 different tech-

niques such as the Single Linkage, Average Linkage, Complete Linkage, Ward’s Min-

imum Variance, and Centroid. Among them, Single Linkage, Average Linkage, and

Complete Linkage are known to be the easiest and most commonly used in cluster

analysis literature. In this thesis, Single Linkage, Average Linkage and Complete

Linkage are chosen to be the foundation of the proposed strategies to analyse point

patterns. These three algorithms satisfy some important properties presented in

[38, 53, 58], for example, chaining effect, monotonicity, stability and ties. These

features make the selected algorithms more attractive than others.

In the next section, the chosen algorithms and main properties are briefly de-

scribed. More details and further information on hierarchical clustering algorithms,

applications and properties are also reported in [17, 38, 53, 58, 68].

2.4 Selected hierarchical clustering algorithms

Single Linkage is considered to be the simplest clustering algorithm, and is in-

troduced by Florek, Lukaszewicz, Perkal, Steinhaus, and Zubrzycki [40]. The main

feature of Single Linkage is that the dissimilarity coefficient between groups is de-

fined as the distance between their closest pairs of points, one from each group.

Examples of dendrograms of the Single Linkage applied to the pines, cells and

redwoods, in which the dissimilarity coefficient is the pairwise Euclidean distance,

are shown in Figure 2.2. (The datasets are introduced in Section 2.1.) The following

description and notation of Single Linkage is quoted from [68].

Algorithm:

1. Order the 12n(n − 1) dissimilarity coefficients into ascending order.


2. Let C1, . . . , Cn be the starting clusters each containing one point, namely

Ci = {xi}, where i = 1, . . . , n.

3. Let di1j1 = min{d(xi, xj) : i 6= j, i, j = 1, . . . , n} so that xi1 and xj1 are

nearest. (For a point process, the probability of obtaining equal values for

the smallest dissimilarity coefficients is equal to 0.) Then these two points are

grouped into a cluster, so we have (n − 1) clusters, where Ci1

⋃

Cj1 is a new

cluster. The value di1j1 is called the first “fusion distance” h1.

4. Let di2j2 be the next smallest dissimilarity coefficient. If neither i1 nor j1 equals

i2 or j2, the new (n−2) clusters are Ci1

⋃

Cj1 , Ci2

⋃

Cj2 . If i2 = i1 and j1 6= j2

the new (n − 2) clusters are Ci1

⋃

Cj1

⋃

Cj2 , plus the remaining old clusters.

The value di2j2 is called the second fusion distance h2, where h1 6 h2.

5. The process continues as described in item 4 through all 12n(n − 1) dissimi-

larity coefficients. At kth stage, let dikjkdenote the kth smallest dissimilarity

coefficient. Then the cluster containing ik is joined with the cluster contain-

ing jk. If ik and jk are already in the same cluster, then no new groups are

formed in this stage. The value dikjkis called the kth fusion distance hk, where

h1 6 h2 6 · · · 6 hk.

6. The clustering process can be halted before all the clusters have been joined

into one group by stopping when the inter-cluster dissimilarity coefficients are

all greater than d0, where d0 is an arbitrary value called the threshold level. Let

C1?, . . . , Cg

? be the resulting clusters. These clusters have the property that if

d∗0 (> d0) is a higher threshold, then the two clusters Cr, Cs will be joined at

the threshold d∗0 if at least one dissimilarity coefficient dirjs

(or a single link)

exists between ir and js with xir ∈ Cr, xjs∈ Cs and d0 < dirjs

6 d0∗.

Properties of Single Linkage: A brief description of relevant properties: chaining

effect, monotonicity, ties, and stability is presented as follows. Further details on

the properties of Single Linkage algorithm are presented in [53].

a. Chaining effect: Single Linkage has a tendency to form spherical or elliptical

clusters, each one around a nucleus. However, if the clusters have no nuclei

the algorithm leads to a chaining effect. This effect is due to the fact that

links, once made, can not be broken. Therefore, Single Linkage may not give

satisfactory results if random noise is present between clusters.

2.4. Selected hierarchical clustering algorithms 11

b. Monotonicity: Single Linkage gives clustering of identical topology for any

monotonic transformation of a dissimilarity coefficient d.

c. Ties: if there are ties, that is, equal values for the smallest dissimilarity coef-

ficient between two clusters, then it does not matter which choice is made for

joining the clusters. The resulting clusters will be unchanged. It is therefore

allowable to randomly choose one of the smallest coefficients and then proceed

with the clustering process.

d. Stability: if there are small changes in the dissimilarity coefficient d then

these changes should not give rise to noticeable alteration in the classification

of Single Linkage.

Average Linkage This algorithm also named “Unweighted Pair-Group Average”

is introduced by Sokal and Michener [97]. A definition of Average Linkage using

the same notation as that of Single Linkage is presented as follows. Consider two

clusters Cr and Cs, then the dissimilarity coefficient drs between the clusters Cr and

Cs is defined as the average of all dissimilarity coefficients drs, where xr is any point

of Cr and xs is any point of Cs. Typical examples of dendrograms generated by

Average Linkage are shown in Figures A.1 (a), (c), and (e), in appendix A. The

algorithm is also applied to the pines, cells and redwoods.

Properties of Average Linkage: The main properties of Average Linkage are

monotonicity and ties which are briefly described in items b and c of Single Link-

age properties, respectively. Further information on the algorithm and properties is

reported in [58].

Complete Linkage has its original form published by Sørensen [98], and is the

opposite of Single Linkage. That is, the dissimilarity coefficient between groups is

defined as the largest distance between the point of one cluster and the point of the

other. Formally,

drs = max{d(xr, xs) : xi ∈ Cr, xj ∈ Cs}. (2.3)

Figures A.1 (b), (d) and (f), in appendix A, show typical examples of dendrograms

generated by Complete Linkage applied to the pines, cells, and redwoods.

Properties of Complete Linkage: this algorithm satisfies the following properties:

point proportion, cluster omission, monotonicity, and well-structured g-group admis-

sibility. Observe that the algorithm does not fulfill the properties: ties and stability

described in items c and d of Single Linkage, respectively. For more information on

its properties, see [38].

13CHAPTER 3

Monte Carlo test

This chapter presents a brief review of the methodology of the Monte Carlo hy-

pothesis test and its application to spatial point patterns in spatial statistics. In

particular, a new and modified version of the Monte Carlo test applied to P-P plots

(Definition 5) is presented in Section 3.5. This modified version of the Monte Carlo

test applied to P-P plots has exact significance level α. A transformed version of

the P-P plot, the A-A plot (Definition 8), is introduced in Section 3.7. This plot

is a useful tool for analysing function estimates and has the property of stabilising

variance.

In this thesis, the Monte Carlo test methodology is applied to a variety of graph-

ical tools: P-P plots, A-A plots and Q-Q plots (Definition 7) using pointwise sim-

ulation envelopes and simultaneous critical bands. Together with the output of a

hierarchical clustering algorithm (Section 2.4), the Monte Carlo test is the founda-

tion of a new strategy to analyse point patterns. This strategy will be described in

Chapter 4. The Monte Carlo testing was introduced independently by Dwass [35],

and Barnard [10]. A brief description of the one-sided Monte Carlo test published

by Diggle [30] is presented below.

The one-sided Monte Carlo test Let H0 be a given simple null hypothesis, x be a

given spatial dataset, z(x) be the corresponding value of a real-valued test statistic

Z; and zi, where i = 2, . . . ,m be simulated values generated by random sampling

from the distribution of Z under H0. Let z(j) be the jth largest among the complete

set of values {z1, z2, . . . , zm}, where z1 = z(x) and m ∈ N. Then, under H0,

P(z1 = z(j)) =1

mfor j = 1, . . . ,m. (3.1)

The null hypothesis is rejected if z1 ranks kth largest or higher. This gives an exact,

one-sided test of size α = km

. It is assumed that there are no ties, P(zi = zj) =

0 (i 6= j), so that the ranking of zi is unequivocal. Otherwise, equal values or ties

may occur in which case Diggle suggested the conservative rule of selecting the least

extreme rank for zi. Further details on this test, see [30, page 7].

Applications of Monte Carlo tests to point patterns are reported in [11, 24, 30,

66, 67, 90] and their main properties are investigated by [50, 54]. Next, standard

definitions of the inverse function of the cumulative distribution function (c.d.f.),

the quantiles of the distribution function and dataset are presented. The definitions

are important for building the two-sided version of Monte Carlo tests.

14 Chapter 3. Monte Carlo test

Definition 1 (Inverse function of c.d.f.). If a random variable has the cumulative

distribution function F then its inverse function, denoted by F−1, is defined as

F−1(p) = min{t ∈ R : F (t) > p}, for p ∈ [0, 1].

Definition 2 (Quantile of c.d.f.). If F is the cumulative distribution function of

a random variable, then the pth quantile of F , where p ∈ [0, 1], is a real number

given by

qp = F−1(p).

Examples of quantiles are the lower quartile, median and upper quartile of the

distribution F which are the values F−1(0.25), F−1(0.5) and F−1(0.75), respectively.

A definition of a quantile of the given dataset x is given below.

Definition 3 (Quantile of dataset). For the dataset x = {x1, . . . , xn} the order

statistics are the numbers ranked in ascending order (thus, x(1) is the minimum and

x(n) the maximum). If F is the empirical c.d.f. of the data x1, . . . , xn then the knth

quantile of F is the kth order statistic x(k).

More information on quantiles is described in [16, 92]. Next, the general approach

of the two-sided Monte Carlo test is presented.

3.1 General case

Let x be the given dataset, Z a real-valued statistic, H0 a simple null hypothesis

and H1 a simple or composite alternative hypothesis. We aim to construct a two-

sided test of exact size α, where α is a rational number in (0, 1).

1. Select a number m, where m is such that (m + 1)α2∈ Z

+ and simulate m

independent and identically distributed (i.i.d.) realisations of X under H0

that is, x(1), . . . ,x(m).

2. Calculate the test statistic Z applied to each of the m realisations

{Z(x(i)) : i = 1, . . . ,m}.

3. Compute the α2th and (1 − α

2)th quantiles of the complete set

{Z(x), Z(x(1)), . . . , Z(x(m))} given by Definition 3. For simplicity, the α2th and

(1 − α2)th quantiles are denoted by L and U , respectively.

(In other words, if (m + 1)α2

is a positive integer k say, then the α2th quantile

of Z1, . . . , Zm+1 is the kth order statistic Z(k), and the (1 − α2)th quantile is

the (m − k + 1)th order statistic Z(m−k+1).)

3.2. Function estimate 15

4. Reject H0 if Z(x) 6∈ [L,U ].

If the distribution of the test statistic Z is continuous, the rank of the given test

statistic Z(x) among the set of values {Z(x(i)) : i = 1, . . . ,m} determines an exact

significance level for the test since, under H0, each of the m possible rankings of

Z(x) is equally probable. Otherwise, ties in the set {Z(x), Z(x(1)), . . . , Z(x(m))}may occur. Thus, the level of significance of the test is not exact. Besag and

Diggle [11] recommended randomly assigning an ordering to any equal values because

this random choice provides an upper bound for the significance level of the Monte

Carlo test.

The quantiles L and U are well-defined since (m + 1)α2

is an integer. More-

over, if (m + 1)α2

is a positive integer, the (α2)th and (1 − α

2)th quantiles of the

set {Z(x), Z(x(1)), . . . , Z(x(m))} may be calculated. (See item 3 of the general case

described previously). For the special case α = 2m+1

, L and U are respectively given

by

L = min{Z(x), Z(x(1)), . . . , Z(x(m))} (3.2)

U = max{Z(x), Z(x(1)), . . . , Z(x(m))}. (3.3)

In general, L and U are the (m + 1)α2

smallest value and the (m + 1)(1− α2) largest

value, respectively.

Proposition 4. The level of significance of the Monte Carlo test is:

P(reject | H0) = P( reject H0 | H0 is true ) = α

Proof. Under H0, Z(x), Z(x(1)), . . . , Z(x(m)) are i.i.d., so the probability that Z(x)

is one of the (m + 1)α most extreme elements is equal to α, by symmetry.

Monte Carlo tests are applied to two special cases: function estimates and P-P

plots, in the next sections. Examples of function estimates are the reduced second

moment function K, empty space F function, nearest neighbour distance distribu-

tion function G and Van Lieshout and Baddeley’s function J .

3.2 Function estimate

Figure 3.1 shows a typical plot of a graphical method for applying a two-sided

Monte Carlo test to a function estimate. Instead of a single real-valued statistic

Z(x), one might consider function estimates of the form Zx(t) where as before x

denotes the given dataset and t > 0. A procedure to make a plot for applying the

two-sided Monte Carlo test is described as follows.


U(t)

L(t)

Z (t)

tt

x

0

Figure 3.1: A typical plot of a graphical method for applying Monte Carlo test to a

function estimate. Dotted lines: the (α2)th and (1− α

2)th quantiles, L(t) and U(t), of

the function estimate determined by H0. Solid line: the function estimate determined

by a given dataset x.

1. Simulate m i.i.d. realisations of Z (1)(t), . . . , Z(m)(t) under H0, and calculate

the (α2)th and (1 − α

2)th quantiles (Definition 3), L(t) and U(t), of the set

{Zx(t), Z(1)(t), . . . , Z(m)(t)}.

2. Plot Zx(t), L(t), U(t) against t, see Figure 3.1.

3. To perform the test using the plot, fix an arbitrary t0 ∈ R+, and reject H0 if

Zx(t0) 6∈ [L(t0), U(t0)].

In this general context, there is no simple rule for choosing t0 to achieve maximum

power. Care must be taken to fix t0 prior to performing the test, and independently

of the outcome of the simulations, so that the test has the desired significance level α.

Similar to the general case (Section 3.1), if the distribution of Z(t) is continuous

then the rank of the given function estimate among set values {Z (i)(t) : i = 1, . . . ,m}determines an exact significance level for the test since, under H0, each of the m

possible rankings of Z(t) are equally likely. Otherwise, ties may occur and we follow

Besag and Diggle’s recommendation [11] stated previously. Once again, for the

special case α = 2m+1

, the L(t) and U(t) quantiles are respectively given by the

following equations

L(t) = min{Zx(t), Z(1)(t), . . . , Z(m)(t)} (3.4)

U(t) = max{Zx(t), Z(1)(t), . . . , Z(m)(t)}. (3.5)

3.3. P-P plot 17

In the spatial statistics literature, L(t) and U(t) are known as the lower and upper

pointwise simulation envelopes, respectively.

Since Zx(t) is real-valued then the level of significance of the Monte Carlo test

applied to function estimates is α by Proposition 4.

Next, a graphical tool named the P-P plot is presented. This plot is useful for

comparing two distribution functions. However, before a definition of the P-P plot

is presented, a function estimate Z(t) of m i.i.d. realisations Z (1)(t), . . . , Z(m)(t) is

introduced by the following equation

Z(t) =1

m

m∑

i=1

Z(i)(t) for t > 0. (3.6)

3.3 P-P plot

Figure 3.2(a) shows an example of the P-P plot, introduced by Wilk and Gnanade-

sikan [114], in which the function estimate Z(t) (equation (3.6)) is plotted against

Zx(t). The definition of the P-P plot is given as follows.

Definition 5 (P-P plot). If two distributions have cumulative distribution func-

tions F1 and F2 then the P-P plot of F1 and F2 displays the pairs

(F1(t), F2(t)), ∀ t ∈ R. (3.7)

The equivalent definition of the P-P plot is the graph of the function (F2 ◦F1−1)

against t, where F−11 is the inverse function of F1 given by Definition 1. An important

property of the P-P plot is that if F1 ≡ F2 then the plot is the identity line.

In spatial point patterns, the P-P plot from a (given) function estimate Zx(t)

against a (simulated) theoretical function Z∗(t) will show the extent of agreement

between the given dataset and theoretical point process. For instance, many sum-

mary functions used in spatial statistics are c.d.f.’s (the empty space F , and nearest

neighbour distance distribution G) so that they are amenable to the P-P plot. More

information on the P-P plot is reported in [16, 41, 114].

The rationale described for applying the Monte Carlo test to function estimates

does not extend to the P-P plot. This inapplicability of the test to the P-P plot is

explained as follows.


3.4 Inapplicability of Monte Carlo test to P-P plot

Proceeding in a fashion similar to Section 3.2, it will be shown that the rationale

of two-sided Monte Carlo test is not applicable to P-P plots. The analogue for

P-P plots of the graphical procedure described in Section 3.2 would be to fix a

coordinate value v0 ∈ [0, 1] and reject H0 if Zx(t0) lies outside [L(t0), U(t0)], with t0 =

Z−1

(v0), where Z−1

is the inverse function (Definition 1) of Z. Under H0, the rank

of Zx(t0) in {Z(1)(t0), . . . , Z

(m)(t0), Zx(t0)} is not (in general) uniformly distributed

over {1, . . . ,m + 1} because t0 = Z−1

(v0) depends on Z(1)(t0), . . . , Z(m)(t0) but not

on x.

.01 1

1 1

Z(t)_

U(t)

L(t)

_ −1

_ −1

(a) (b)

Z (t)

U(Z (v))

L(Z (v))

x

v00 t 0

P−P plot Analogue of Figure 3.1

Figure 3.2: Inapplicability of the pointwise Monte Carlo test rationale to P-P plots:

(a) P-P plot of empirical function estimate Zx(t) against mean of realisations Z(t),

with Monte Carlo test applied at abscissa v0. (b) The test in (a) corresponds to

applying a Monte Carlo test to Zx(t) at a random ordinate t = Z

−1(v) which depends

on Zx(t).

Figure 3.2(b) shows the Monte Carlo test applied to Zx(t) at the random ordinate

t = Z−1

(v) which depends on the simulated data {Z (i)(t) : i = 1, . . . ,m}. The

difficulty is that Z(t) depends on Z (1)(t), . . . , Z(m)(t), so for a fixed v0, Zx(Z

−1(v0))

depends on both data and simulations. Therefore, the significance level of the test

is typically not equal to α. The significance level is generally unknown.

3.5 Modified Monte Carlo test applied to P-P plot

To resolve the problem described in Section 3.4, the following procedure is

adopted.

3.5. Modified Monte Carlo test applied to P-P plot 19

1. Simulate two sets of i.i.d. realisations from H0 that is, {Zx(1)(t), . . . , Zx(m)(t)}and {Zy(1)(t), . . . , Zy(M)(t)} independently, where m and M are positive inte-

gers.

2. From the set {Zy(1)(t), . . . , Zy(M)(t)}, compute the mean

Zy(t) =1

M

M∑

j=1

Zy(j)(t). (3.8)

3. From the set {Zx(1)(t), . . . , Zx(m)(t)}, calculate the (α2)th and (1−α

2)th quantiles

(Definition 3), denoted by Lx(t) and Ux(t), respectively.

4. Plot Zx(t), Lx(t), Ux(t) against Zy(t). That is, plot pairs (Zy(t), Zx

(t)),

(Zy(t), Lx(t)) and (Zy(t), Ux(t)).

5. Fix an arbitrary value v0, let t0 = Z−1

y (v0). Reject H0 if Zx(t0) 6∈ [Lx(t0), Ux(t0)].

We have no general rule for choosing M . Note m must be chosen so that (m+1)α2

is a (positive) integer. In our applications, their selected values were identical:

M ,m=39, 99, 999.

It is worth observing that Ux(t) and Lx(t) depend on the set {Zx(1)(t), . . . , Zx(m)(t)},Z

x(t) depends on x, and Zy(t) depends on the set {Zy(1)(t), . . . , Zy(M)(t)}. Therefore,

for a fixed v0, Zx(Z

−1

y (v0)) depends on x and Zy(j)(t); Lx(Z−1

y (v0)) and Ux(Z−1

y (v0))

depend on Zx(i)(t) and Zy(j)(t), where i = 1, . . . ,m, j = 1, . . . ,M . For α = 2m+1

,

Lx(t) and Ux(t) are called the lower and upper (pointwise) simulation envelopes of

the (two-sided) modified Monte Carlo test applied to P-P plots.

We now prove that the (two-sided) modified Monte Carlo test for P-P plots has

an exact significance level α.

Proposition 6. The significance level of the (two-sided) modified Monte Carlo test

for P-P plots is P(reject | H0) = P(reject H0 | H0 is true) = α.

Proof.

P(reject H0 | H0 is true) = PH0

(

Zx(Z

−1

y (v)) 6∈ [Lx(Z−1

y (v)), Ux(Z−1

y (v))]

)

=

EH0

[

P

(

Zx(Z

−1

y (v)) 6∈ [Lx(Z−1

y (v)), Ux(Z−1

y (v))]

∣

∣

∣

∣

Zy(1)(t), . . . , Zy(M)(t)

)] (3.9)

The function ZY (t) is completely determined by the set of realisations {Zy(j)(t) :

j = 1, . . . ,M} so Z−1

y (v0) = t0 is fixed given Zy(1)(t0), . . . , Zy(M)(t0). The argument


presented in Section 3.1 for the general case of the Monte Carlo test (Section 3.1)

establishes that:

P

(

Zx(Z

−1

y (v0)) 6∈ [Lx(Z−1

y (v0)), Ux(Z−1

y (v0))]

∣

∣

∣

∣

Zy(1)(t0), . . . , Zy(M)(t0)

)

= α

so

EH0

[

P

(

Zx(Z

−1

y (v0)) 6∈ [Lx(Z−1

y (v0)); Ux(Z−1

y (v0))]

∣

∣

∣

∣

Zy(1)(t0), . . . , Zy(M)(t0)

)]

= α.

Critical bands for function estimates Our aim is to construct a region in which

H0 is rejected at an exact significance level α if Zx(t) goes inside this region for

any t. This region is known as a critical region and is defined by [13, 92]. A

complementary idea for plotting a critical region for summary functions in spatial

statistics is introduced by Ripley [90, Chapter 8] for the first time, to the best of

our knowledge. Examples of L-function with “95% confidence band” are plotted on

[90, pages 171, 173]. That is, 95% of realisations of L calculated from a binomial

process should lie within the confidence band. (The summary function L is defined

as L(t) =√

K(t)π

, where K(t) is the reduced second moment function.) Further

information on K, L functions, and confidence band developed by Ripley is reported

in [90]. Next, our procedure to apply the Monte Carlo test to a function estimate

using the simultaneous critical band at an exact significance level α is presented.

Monte Carlo tests using simultaneous critical bands Let Zx(t) be a real function

estimate determined by a given dataset x. The procedure is given as follows.

1. Follow items 1 and 2 of the procedure described in Section 3.5.

2. For each realisation i, compute the maximum absolute deviation di of Zx(i)(t)

from Zy(t) defined by

di = supt

|Zx(i)(t) − Zy(t)|, i = 1, . . . ,m.

3. Order the set {d1, . . . , dm} in ascending order, compute the (1−α)th quantile

(Definition 3) of the ordered d(i)’s, and denote it by d(1−α).

4. Calculate the critical functions [13] given by equation

Zy(t) ± d(1−α). (3.10)

The graph of these functions against t is called the simultaneous critical band.

3.6. Q-Q plot 21

5. Plot Zx(t) and the critical functions (equation 3.10) against t.

6. Reject H0 if Zx(t) /∈

[

Zy(t) − d(1−α), Zy(t) + d(1−α)

]

for some t.

Thus, H0 is rejected if the graph of Zx(t) lies outside the critical band at any point

t. Figure 3.3 displays a typical plot of the Monte Carlo test applied to the function

estimate Zx(t) using the simultaneous critical band at an exact significance level α.

Z (t)x

Z (t)_

+ d

Z (t) − d_

(1− α)

(1−α)

t

y

y

Figure 3.3: A typical plot of a graphical method for applying a Monte Carlo test

to a function estimate using a critical band. Dashed lines: the critical functions

determined by H0. Solid line: the function estimate Zx(t) determined by a given

dataset x.

3.6 Q-Q plot

The Q-Q plot is another useful graphical device for comparing two distribution

functions. The definition of the Q-Q plot is presented below.

Definition 7 (Q-Q plot). If two distributions have cumulative distribution func-

tions F1 and F2, then the Q-Q plot of F1 and F2 displays the pairs of points(

F−11 (p), F−1

2 (p))

, ∀p ∈ [0, 1], where F−11 and F−1

2 are the inverse functions (Def-

inition 1) of F1 and F2, respectively. Equivalently the Q-Q plot is the graph of

F2−1 ◦ F1.

The ranges of the x and y axes are the ranges of the corresponding distributions

F1, F2. Two important properties of the Q-Q plot are presented as follows. The

Q-Q plot is the identity line if and only if F1 ≡ F2. In addition, the Q-Q plot is a

straight line if and only if F1(t) = F2(a + bt) where a, b ∈ R. More information on

the Q-Q plot and its properties is available in [16, 41, 114].

Monte Carlo tests can also be applied graphically to Q-Q plots. A procedure to

calculate, and plot the pointwise simulation envelopes and the simultaneous critical


bands for Q-Q plots is analogous to those previously described for function estimates,

P-P plots, and modified version to P-P plots, described in Sections 3.2 and 3.5.

Next, a transformed version of the P-P plot, named the A-A plot, is presented.

The transformed P-P plot is another useful tool for comparing distributions func-

tions, and function estimates, graphically.

3.7 A-A plot

The A-A plot is a transformed P-P plot. Aitkin and Clayton [1] proposed the

use of the Fisher angular transformation, arcsin√

1 − F , in the P-P plot. The A-A

plot definition and rationale are presented as follows.

Definition 8 (A-A plot). If two random variables have cumulative distribution

functions F1 and F2 then the A-A plot of F1 and F2 displays the pairs of points(

arcsin√

1 − F1(t), arcsin√

1 − F2(t)

)

, ∀ t ∈ R.

Examples of the A-A plot are presented in [1] and in Section 4.4.1. The rationale

for the A-A plot is based on Wilk and Gnanadesikan’s proposition of a transforma-

tion of the axes of the P-P plot and Q-Q plot by a real function. It is also known

that Fisher’s angular transformation [96], arcsin√

F , stabilises variance for binomial

estimate of proportions.

In spatial statistics, an important property of the A-A plot is that if a given point

pattern is a realisation of a theoretical point process then the plot should be close to

the identity line. In this thesis, the transformation

(

arcsin√

1 − Z(t)

)

is applied

to both axes of the P-P plot to achieve approximately constant variance. Observe

that Z(t) is a chosen summary function which is also a c.d.f. More information on

the transformed P-P plot and Fisher’s angular transformation, see [1, 96, 114].

Monte Carlo tests can also be applied graphically to the A-A plot. A procedure

to calculate and to plot the pointwise simulation envelopes and simultaneous critical

bands for the A-A plot are analogous to those previously described for the P-P plot

and Q-Q plot in Sections 3.5, 3.5 and 3.6.

23CHAPTER 4

New strategy for analysing point patterns

This chapter introduces a new summary function, the fusion distance function, and

a new statistic, the area statistic, which are based on the output of a non-spatial hi-

erarchical clustering algorithm (Chapter 2) applied to a given spatial point pattern.

This chapter also explores applications of the fusion distance function in a spatial

context and develops both a graphical non-parametric method for exploratory anal-

ysis of point patterns, and formal inference using simulations and Monte Carlo tests

(Chapter 3).

The new strategy has two parts: exploratory data analysis (Section 4.3.1) and

inference (Section 4.3.2). First, the fusion distance function from the (observed)

given point pattern will be compared with the mean of simulations from a (chosen)

theoretical point process using graphical techniques. The proposed techniques to

compare the fusion distance functions are P-P plots (Definition 5), Q-Q plots (Def-

inition 7), A-A plots (Definition 8) and relative distribution plots (Definition 12).

Second, in the inference, the modified version of the Monte Carlo method (Section

3.5) is proposed for testing the fusion distance function from the given point pattern

against the mean of the fusion distance functions from simulations of the theoretical

point process. In most applications, a chosen theoretical point process is a homo-

geneous Poisson process of unknown intensity λ; this is the null hypothesis CSR

(defined in Section 5.1).

In this chapter, our choice of null hypothesis is a binomial point process which is

a simple case of the homogeneous Poisson conditioned on a fixed number of points.

However, an inhomogeneous Poisson process may also be selected for a null hypoth-

esis. For instance, in spatial epidemiology, an example is a case-control study [59]

where we have a point pattern of 30–100 cases of a rare disease (the cases) and an-

other point pattern of 1000–10000 people who are healthy but otherwise comparable

in age and socioeconomic status (the controls). The control data tells us about the

nonuniform density of the population. A natural null hypothesis is that the cases

are an inhomogeneous Poisson process with intensity proportional to the population

density. Thus, this is a good example where the null hypothesis is not CSR but is

an inhomogeneous Poisson process with a known intensity up to a constant factor.

The new strategy may equally well be applied to these examples.

24 Chapter 4. New strategy for analysing point patterns

4.1 New summary function and statistic

The new summary function, the fusion distance function, is motivated by the

following. Consider Figure 2.2 which shows the dendrograms obtained by apply-

ing the Single Linkage algorithm (Section 2.4) to the datasets: pines, cells and

redwoods (Section 2.1). The pointwise Euclidean distance is chosen as a dissimilar-

ity coefficient between points and clusters. On visual inspection, the output for the

redwoods shows a highly structured dendrogram with clear separation between large

clusters, while the output for the cells has a very disordered appearance and that

for the pines is intermediate. This suggests that the dendrograms carry enough

information to enable us to discriminate between clustered and CSR.

In the cluster analysis literature, there is no formal definition of what is meant

by a structured or disordered dendrogram. However, a very simple criterion is to

look only at the computed values of the dissimilarity coefficient between two clusters

of points in the dendrogram. These values will be named “fusion distances”. The

range of the fusion distances in the dendrogram is comparatively much broader

for the redwoods and much narrower for the cells, with the pines again giving an

intermediate result.

The exploratory results of cluster analysis indicate that a point pattern can be

analysed by applying a hierarchical clustering algorithm and then extracting the

fusion distances of the dendrogram. To the best of our knowledge, this approach to

analysing a spatial dataset is new. It is developed in Section 4.1.1. In Chapter 5, we

will demonstrate that the fusion distances are very good at discriminating between

different types of spatial patterns.

4.1.1 Fusion distance function Let x = {x1, . . . , xn} be a given point pat-

tern, consisting of a fixed number n of points in some bounded region W ⊂ R2. Next,

a chosen hierarchical clustering algorithm is applied to the given point pattern based

on their pairwise Euclidean distances ‖xi−xj‖, i 6= j. This produces a dendrogram,

from which a list of fusion distances hk, k = 1, . . . , (n−1) is extracted. For example,

for the Single Linkage algorithm (Section 2.4), hk is defined as hk = dikjk, where d is

the pairwise Euclidean distance between a point in one group and a point in other

group. Thus, we can now form the empirical cumulative distribution function of the

fusion distances. This function is named the fusion distance function, denoted by

H(t), and its definition is as follows.

4.1. New summary function and statistic 25

Definition 9 (Fusion distance function). For t ∈ R,

H(t) =1

n − 1

n−1∑

k=1

1{hk 6 t}, (4.1)

where hk is a chosen fusion distance between two groups of points and 1{} is the

indicator function. That is, H is the empirical c.d.f. of the fusion distances.

Figure 4.1 shows typical plots of the fusion distance function H(t) for the pines,

cells and redwoods.

t

H(t)

0.0 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t

H(t)

0.0 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

tH(

t)

0.0 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.1: The fusion distance function H(t) against t for the point patterns: pines

(left), cells (centre) and redwoods (right). Dissimilarity coefficient: pairwise Eu-

clidean distance, Single Linkage algorithm. The point patterns are re-scaled to the

unit square and their physical dimensions and background information are described

by Diggle [30].

The fusion distance function depends on the chosen algorithm and coefficient.

Also, the fusion distances cannot be regarded as if they were independent and iden-

tically distributed observations. (The hk’s are ordered: h1 6 . . . 6 hn−1. In general,

the(

n

2

)

pairwise distances are not independent.)

Application of fusion distance function: knee plot Consider a given multivariate

dataset D with n objects or points that have been measured on ` variables, where

n, ` ∈ N and ` < n. Then apply cluster analysis to D and form the set S(D) =

{h1, . . . , hn−1} of the fusion distances hk, where k = 1, . . . , n − 1, between two

clusters.

It is a common practice in cluster analysis to plot the number of clusters, say g,

against the output of a hierarchical algorithm to find the best number of clusters in

the dataset. In other words, the plot of g against the values of the fusion distances


hk’s provides information on the best number of clusters. This plot is known as a

knee plot or scree plot. For the definition and applications of the knee plot, see [52].

Figure 4.2 shows examples of the knee plots of the fusion distances plotted against

the number of clusters for the pines, cells, and redwoods. The knee plots were

made using the pairwise Euclidean distance and Single Linkage algorithm (Section

2.4).

Linear transformation of knee plot A relationship between the index k of the set

S(D) of the fusion distances and the number of clusters g is given by

k = (n − g). (4.2)

Re-arranging equation (4.2) such that the number of clusters is a function of the

index, that is, g = n− k and dividing this formula by the total number of points of

the dataset, then the following result is obtained

g

n= 1 − k

n, where n > 0. (4.3)

Next, the linear transformation given by equation (4.3) is applied to the y-axis of

the inverted knee plot. Therefore, the linear transformation given by equation (4.3)

is the relationship between a knee plot and the plot of the fusion distance function.

That is, a knee plot is a rotated and scaled version of the cumulative distribution

function.

Figure 4.3 shows examples of inverted knee plots of the number of clusters against

the fusion distances for the pines, cells and redwoods. The inverted knee plots

are similar to the plots of the fusion distance function H(t) shown in Figure 4.1,

except for a scale factor.

number of clusters g

fusio

n di

stan

ces

h_k

0 10 20 30 40 50 60

0.0

0.05

0.10

0.15

0.20

0.25


fusio

n di

stan

ces

h_k

0 10 20 30 40

0.0

0.05

0.10

0.15

0.20


fusio

n di

stan

ces

h_k

0 10 20 30 40 50 60

0.0

0.05

0.10

0.15

0.20

0.25

0.30

Figure 4.2: Knee plots of the fusion distances against the number of clusters for

the pines (left), cells (centre) and redwoods (right). Dissimilarity coefficient:

pairwise Euclidean distance, Single Linkage algorithm.

4.1. New summary function and statistic 27

fusion distances h_k

num

ber o

f clu

ster

g

0.0 0.05 0.10 0.15 0.20 0.25

010

2030

4050

60


num

ber o

f clu

ster

g

0.0 0.05 0.10 0.15 0.20

010

2030

40


num

ber o

f clu

ster

g

0.0 0.05 0.10 0.15 0.20 0.25 0.30

010

2030

4050

60

Figure 4.3: Inverted knee plots for the pines (left), cells (centre) and redwoods

(right).

Knee plots and analysis of point patterns The smooth shape of the right plot in

Figure 4.2 suggests that there are clusters of points in the redwoods. Furthermore,

the sharp decrease of the values of fusion distances (when the number of clusters

varies from 2 to 7) indicates that the best number of clusters is an integer between

2 and 7. For the cells (see the central plot in Figure 4.2), the values of fusion

distances are moderately flat for the values of g ∈ [1, 25]. This stability suggests

that there may not be clusters in this dataset. For the pines (see the left plot in

Figure 4.2), the values of fusion distances are intermediate, between the values of

redwoods and cells. This application of the knee plot using fusion distances is also

an indication that a knee plot may be a useful tool to discriminate between different

types of spatial patterns. However, there is still a need for further investigation. In

this study, this example is only for an exploratory data analysis.

4.1.2 Area statistic The area statistic, A, is a new index based on the fusion

distance function introduced in Section 4.1.1, and is defined as follows.

Definition 10 (Area statistic).

A =

∫ 1

0

H

(

H∗−1

(u)

)

du (4.4)

where H(t) is the fusion distance function for a given point pattern, and H∗(t)

is the sample mean of the fusion distance functions for simulations from the null

hypothesis.

That is, given x, the area statistic is defined as the area under the P-P plot

of the fusion distance function H of x against the pointwise mean H∗

of fusion


distance functions that are computed from simulations of the null hypothesis. (A

typical example of a reference point process is a homogeneous Poisson process which

is introduced by Definition 13.)

The area statistic A can be rewritten (if H∗

is continuous and strictly increasing)

A =

∫ +∞

0

H(t)dH∗(t)

So

A − 1

2=

∫ +∞

−∞

(

H(t) − H∗(t)

)

dH∗(t).

This may be compared with the Anderson-Darling statistic which is quoted from

[34, page 26, equation 4.1.4]

AD =

∫ +∞

−∞

(H(t) − E0H(t))2

E0H(t)(1 − E0H(t))dE0H(t),

where E0H(t) is the expected value of H(t) under the null hypothesis.

The rationale for the Anderson-Darling statistic is that, if H(t) is an estimate of

E0H(t) such as the empirical c.d.f. based on observations, then

Var(H(t)) =E0H(t)(1 − E0H(t))

n

so

E

(

(H(t) − E0H(t))2

E0H(t)(1 − E0H(t))

)

= constant.

Therefore, the area statistic can be described as a simplification of the Anderson-

Darling statistic. Further information on the Anderson-Darling statistic is reported

in [4, 34].

Proposition 11. If H ≡ H∗

then A = 0.5, and if X is a homogeneous Poisson

process then E(A) = 0.5.

Proof. The first statement is trivial. For the second statement, let us assume

that the number of simulations of the reference homogeneous Poisson process (Sec-

tion 5.1) is sufficiently large that H∗(t) is essentially non-random. Let us also

assume that the point pattern x is a realisation of the same Poisson process. Un-

der these assumptions, E(A) = E

(

∫ 1

0H(H

∗−1(u))du

)

=∫ 1

0E

(

H(H∗−1

(u))

)

du.

Since H∗

is non random and E

(

H(v)

)

= H∗(v) for all v ∈ R+ (x is Poisson) then

E

(

H(H∗−1

(u))

)

= u. So that E(A) =∫ 1

0udu = 0.5.

4.2. Relative distribution 29

SL AL CL

Datasets A A SA A A SA A A SA

Pines 0.493 0.498 0.044 0.496 0.503 0.028 0.495 0.499 0.025

Cells 0.312 0.500 0.055 0.352 0.502 0.036 0.370 0.501 0.031

Redwoods 0.726 0.499 0.047 0.672 0.501 0.030 0.657 0.500 0.026

Table 4.1: Empirical area statistic A from the pines, cells and redwoods; the

sample mean A and sample standard deviation SA for the area statistic based on

1000 simulations of homogeneous Poisson processes with intensities 65, 42, and 62,

respectively. Single Linkage (SL), Average Linkage (AL) and Complete Linkage

(CL).

Values of A > 0.5 would be associated with clustered point patterns and values

A < 0.5 associated with regular point patterns. If a point pattern is clustered then

the fusion distances tend to have a higher frequency of small distances. So the fusion

distance function from the clustered pattern may substantially be above the mean of

the simulated fusion distance functions from the homogeneous Poisson point process.

In this case, A > 0.5. For a regular pattern, the fusion distance function may not

have a higher concentration of small distances. Then its fusion distance function

may considerably be below the mean of the simulated fusion distance functions from

the homogeneous Poisson point process. Thus, A < 0.5. However, no general rules

can be inferred.

Illustration Table 4.1 presents the estimated values of the area statistic A from

the pines, cells and redwoods using Single Linkage (SL), Average Linkage (AV)

and Complete Linkage (CL). This table also shows the values of the sample mean

A and sample standard deviation SA of the area statistic based on the assumption

that the reference point process is the homogeneous Poisson. The estimated values

for A and SA are based on 1000 simulations of Poisson point processes with same

intensities as the observed point patterns, that is λi = 65, 42, 62, respectively.

4.2 Relative distribution

Useful graphical tools for comparing two distributions are the P-P plot (Defini-

tion 5), Q-Q plot (Definition 7), and A-A plot (Definition 8). Another useful device

is the relative distribution plot which is based on the relative distribution method

[47] applied to social sciences. The relative distribution plot is presented as follows.


Definition 12 (Relative distribution). Let F1, F2 be two cumulative distribution

functions. Their relative cumulative distribution function is given by

G(r) = F2(F−11 (r)), 0 6 r 6 1,

where F−11 is the inverse function (Definition 1) of the cumulative distribution func-

tion F1. Now, if F1, F2 have probability densities f1, f2 then G has the probability

density

g(r) =f2(F

−11 (r))

f1(F−11 (r))

, 0 6 r 6 1.

The function g is called the relative probability density function of F1, F2.

A simple interpretation of the plots of the relative distribution is that if the cu-

mulative distribution functions F1 and F2 are identical then for 0 6 r 6 1, g(r) =

1 and G(r) = r. The plot of the relative cumulative distribution function is identical

to the P-P plot. In other words, G is the function plotted in a P-P plot. Further

information on the definition, properties and applications of the relative distribution

method is presented in [47].

New application of relative distribution plot To the best of our knowledge, the

relative distribution method has not been applied to spatial statistics. Therefore,

a new application of the relative distribution plot to analyse point patterns based

on the fusion distance function is given as follows. (The software for computing the

relative distribution is available in [48]).

Figure 4.4 shows the relative probability density functions with pointwise 95%

confidence intervals for the fusion distance functions H(t) for the pines, cells and

redwoods (Section 2.1) plotted against the mean H(t) of 1000 realisations from a

binomial point process on the unit square. The fusion distance function is computed

using the Single Linkage algorithm (Section 2.4).

The relative density g(r) is estimated by kernel smoothing techniques [2, 14, 39,

112], and the pointwise confidence intervals that are shown on the plots of g(r) are

based on the large-sample normal approximation [47],

g(r) ∼ N

(

g(r),g(r)R(κ)

mhm

+g2(r)R(κ)

nhm

)

where n, m are the sample sizes from the estimated distribution and from H0, re-

spectively; hm is the smoothing bandwidth used for the density estimation; κ is the

4.2. Relative distribution 31

02

46

8

0.0 0.4 0.8

................

....................................................................................

....................................................................................................

02

46

8

0.0 0.4 0.8

......................................................

......

.....

....

....

...........................

..............................................

.....

....

...

....

..............

..

.

.

.

.

.

.................

02

46

8

0.0 0.4 0.8

......

.............................................................................................

.

.........................

.........................................

....

.......

..................

..

.

.

.

Figure 4.4: Relative probability density function (y-axis) of the fusion distances H(t)

plotted against the mean H(t) (x-axis). The probability density plots with pointwise

95% confidence intervals of the pines (left), cells (centre), and redwoods (right).

Solid lines: relative probability density, dotted lines: 95% confidence intervals. The

mean is estimated from 1000 realisations from a binomial point process; Single Link-

age algorithm.

kernel of the density estimation, and R(κ) =+∞∫

−∞

|κ(x)|2dx. Under this approxima-

tion, the 95% pointwise confidence intervals for g(r) are given by

g(r) ± 1.96

√

g(r)R(κ)

mhm

+g2(r)R(κ)

nhm

.

(For the computational work, Handcock and Morris [48] chose the Ksmooth density

estimation [110], and Ricardo Cao’s adaptive method for the bandwidth estimation.)

Details and more information on the estimation of the confidence interval for the

relative distribution are provided in [47, Chapter 9].

The left plot in Figure 4.4 shows that the pines are almost indistinguishable

from a binomial point process with regard to the distribution of fusion distances.

However, the central plot in Figure 4.4 shows that cells and a binomial are

different. There is a peak in the probability distribution function plot at about the

75th percentile on the x-axis, so that this peak suggests regularity in the cells. In

other words, for the cells there is a higher concentration of large fusion distances

than we would expect for a binomial point process.

The right plot in Figure 4.4 shows that the redwoods and a binomial are different.

The peaks at the tails indicate polarisation which suggest that there is clustering in

the redwoods. Handcock [personal communication] observed that the relative peak

at about the 75th percentile in the x-axis suggests some regularity in the redwoods.


Handcock also mentioned that the redwoods could be thought of as a mixture of a

clustered pattern with a small component of a regular pattern.

The plots of the relative cumulative distributions from the pines, cells, and

redwoods are not shown in this section because these plots are identical to the P-P

plots of the fusion distance function that are presented in Figure 4.6.

4.3 Description of strategy

This section describes the two parts of the strategy: exploratory data analysis

and inference.

4.3.1 Exploratory data analysis If the aim is to perform exploratory data

analysis to a given point pattern x then the procedure is presented as follows.

1. Apply a chosen hierarchical algorithm to x and compute its fusion distance

function H(t) (Definition 9).

2. Simulate m i.i.d. realisations of a binomial point process (where n = N(x)

independently random points), denoted by x(1), . . . ,x(m).

3. For each realisation x(i), where i = 1, . . . ,m, compute its fusion distance

function, denoted by Hi(t).

4. Compute the pointwise mean of simulated fusion distance functions, given by

H(t) =1

m

m∑

i=1

Hi(t). (4.5)

5. Compare H(t) with H(t) graphically using an A-A plot (Definition 8), P-P plot

(Definition 5), Q-Q plot (Definition 7) or relative distribution plot (Definition

12).

If the given point pattern x is a realisation of a binomial point pattern then the

plot (A-A plot, P-P plot, Q-Q plot or relative cumulative distribution plot) should

be close to the identity line.

4.3. Description of strategy 33

Interpretation of exploratory data analysis A simple interpretation is that if a

given point pattern is clustered then the expected P-P plot of the fusion distance

function will mostly be above the identity line, suggesting that there are many more

points aggregated into groups than in a binomial point process. But, if the given

point pattern is regular then the expected P-P plot of the fusion distance function

will mostly be below the identity line, suggesting the absence of spatial clustering.

An equivalent interpretation of the exploratory data analysis can be made for

an A-A plot and relative cumulative distribution plot. However, for Q-Q plots, the

interpretation is opposite. In other words, if a given point pattern is regular then the

expected Q-Q plot of the fusion distance function will mostly be above the identity

line. But, if the point pattern is clustered then the expected Q-Q plot of the fusion

distance function will mostly be below the identity line.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.5: P-P plots of fusion distance function H(t) versus H(t) for pines (left),

cells (centre) and redwoods (right). Solid lines: P-P plots, dotted lines: identity

line. The mean is estimated from 1000 realisations of binomial point process with

same intensity as observed pattern, Single Linkage algorithm.

Illustration Figure 4.5 shows the exploratory data analysis performed for the

datasets: pines, cells and redwoods (Section 2.1). The P-P plots of the fusion

distance function of the datasets are plotted against the mean of fusion distance

functions from 1000 simulations of binomial point processes with same intensities as

the observed datasets.

The P-P plot of the fusion distance function of the pines is very close to the

identity line (see the left plot in Figure 4.5). Therefore, the pines can be regarded

as a realisation of a binomial point process. However, the fusion distance functions

of the cells and redwoods are distant from the identity line. Thus, the cells and

redwoods appear not to be realisations of a binomial point process (see the central

and right plots in Figure 4.5, respectively).


4.3.2 Inference The second part of the strategy is to perform the modified

version of the Monte Carlo test (Section 3.5), graphically. Let x be a realisation

of a point process X, and H0, H1 be the given null and alternative hypotheses,

respectively. If the purpose of the analysis is a formal test based on the fusion

distance function, then the procedure is given as follows.

1. Specify H0, H1 and significance level α.

2. Apply a chosen hierarchical algorithm to x and compute its fusion distance

function H(t) (Definition 9).

3. Simulate m i.i.d. realisations under H0, that is, x(1), . . . ,x(m).

4. For each simulated point pattern x(i), compute its fusion distance functions

Hi(t), i = 1, . . . ,m, and the mean H(t) given by equation (4.5).

5. Apply the modified version of the Monte Carlo test (Section 3.5) using either

simulation envelopes or critical bands to the P-P plot (Definition 5), Q-Q plot

(Definition 7) or A-A plot (Definition 8).

If H(t) is outside the pointwise simulation envelope or simultaneous critical band

then H0 is rejected.

4.4 Applications

4.4.1 Application to published point patterns The inferential part of the

new strategy was applied to the pines, cells and redwoods datasets (Section 2.1).

For each test, Single Linkage was the chosen hierarchical algorithm, M,m = 999

realisations under H0 were generated that is, the realisations were from binomial

point processes with the same intensities as the observed datasets. (The composite

H1 was that the point patterns were not realisations of the binomial point process

with the specified intensities.) The significance level was α = 0.05 and the software

used was the R Development Core Team Version 1.5.1 [51] on a Pentium 4 (1.8G

Hz). The results are shown as follows.

Simulation envelopes for P-P, Q-Q, A-A plots For the pines, the fusion distance

function is within the pointwise simulation envelopes for the P-P plot, Q-Q plot and

A-A plot. (See the left plots in Figures 4.6 and 4.7.) However, for the cells

and redwoods, the fusion distance functions are substantially outside the pointwise

4.4. Applications 35

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.05 0.10 0.15

0.0

0.05

0.10

0.15

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.25

Figure 4.6: Plots of fusion distance function H(t) against H(t) with pointwise simu-

lation envelopes at 5% significance level. Datasets: pines (left), cells (centre) and

redwoods (right). Upper: P-P plots, lower: Q-Q plots. Dashed lines: envelopes;

dotted lines: identity line; Single Linkage algorithm.

envelopes. (See the central and right plots in Figures 4.6 and 4.7, respectively.) For

instance, when H(t) u 0.17, H0 is not rejected for the pines but is rejected for the

cells and redwoods.

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Figure 4.7: A-A plots of arcsin√

1 − H(t) against arcsin√

1 − H(t) with pointwise

simulation envelopes at 5% significance level. Datasets: pines (left), cells (centre)

and redwoods (right). Solid lines: A-A plots, dashed lines: envelopes; dotted lines:

identity line; Single Linkage algorithm.


0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.2

0.6

1.0

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.2

0.6

1.0

0.0 0.2 0.4 0.6 0.8 1.0

-0.2

0.2

0.6

1.0

0.0 0.05 0.10 0.15

0.0

0.05

0.10

0.15

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.0 0.05 0.10 0.15 0.20

0.0

0.05

0.10

0.15

0.20

0.25

Figure 4.8: Plots of fusion distance function H(t) versus H(t) with simultaneous

critical bands at 5% significance level. Datasets: pines (left), cells (centre) and

redwoods (right). Upper: P-P plots, lower: Q-Q plots. Dashed lines: critical func-

tions; dotted lines: identity line; Single Linkage algorithm.

Critical bands for P-P plots, Q-Q plots and A-A plots For the pines, the fusion

distance function is inside the simultaneous critical bands for the P-P plots, Q-Q

plots and A-A plots. (See the left plots in Figures 4.8, and 4.9.) However, for

the cells and redwoods the fusion distance functions are substantially outside the

critical bands. (See the central and right plots in Figures 4.8, and 4.9, respectively.)

Thus, H0 is not rejected for the pines but is rejected for the cells and redwoods.

4.4.2 Application to simulated point patterns In this subsection, two

datasets are simulated, and the inferential part of the new method (introduced in

Section 4.3.2) is applied to them. The fusion distance function and Ripley’s K-

function are computed for both datasets. (The definition and properties of the

Ripley’s K-function are available in [88, 90].)

This application is considered as a very good example that there is still a need for

new summary functions to analysing point patterns. Note that the fusion distance

function successfully identifies the presence of spatial clustering in both datasets, in

contrast to the K-function.


0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5


1 − H(t) versus arcsin√

1 − H(t) with the simulta-

neous critical bands at 5% significance level. Datasets: pines (left), cells (centre)

and redwoods (right). Solid lines: A-A plots; dashed lines: critical functions; dotted

lines: identity line; Single Linkage algorithm.

Datasets The first simulated dataset, Dataset one, is a realisation of a Matern

cluster process (Definition 14) with parent intensity λp = 5, daughter intensity

λc = 5, and radius r = 0.25 on the unit square. (See the left plot in Figure

4.10.) Dataset two is also a realisation of a Matern cluster process, where λp = 10,

λc = 10, and r = 0.5, on the unit square. (See the left plot in Figure 4.11.) The

chosen parameters of the cluster process were based on Ripley’s choice [89].

λ = 5, r = 0.25

0.00 0.02 0.04 0.06 0.08

0.00

00.

010

0.02

00.

030

K−funsim envs

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

fus dis funcrit bandident line

Figure 4.10: Left: Dataset one, a simulated Matern cluster process with λp = 5,

λc = 5 and r = 0.25 on the unit square. Centre: translate estimate of K-function

[81]. Right: P-P plot of fusion distance function. Solid lines: function estimates,

dashed lines: envelopes (centre); critical bands (right), dotted lines: identity line,

Single Linkage algorithm.

K-function The estimates of K-function from the datasets were computed and

plotted using the software described in Section 4.4 and the spatial library Spatstat


[9] on a Pentium 4 (1.8G Hz). The K-function was estimated using the translation

correction of Ohser [81] and the upper and lower simulation envelopes were calcu-

lated from 40 realisations under the homogeneous Poisson process (Definition 13)

with the same intensities as the simulated datasets. Observe that for t < 0.08, the

estimated K-function was inside the simulation envelopes suggesting CSR for both

datasets. (See the central plots in Figures 4.10 and 4.11, respectively.)

λ = 10, r = 0.5

0.00 0.02 0.04 0.06 0.08

0.00

00.

010

0.02

00.

030

K−funsim envs

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

fus dis funcrit bandident line

Figure 4.11: Left: Dataset two, a simulated Matern cluster process with λp = 10,

λc = 10 and r = 0.5 on the unit square. Centre: translate estimate of K-function

[81]. Right: P-P plot of fusion distance function. Solid lines: function estimates,

dashed lines: envelopes (centre), critical band (right), dotted lines: identity line,

Single Linkage algorithm.

Fusion distance function The (two-sided) modified Monte Carlo test using crit-

ical bands (Section 3.5), and based on the P-P plot of the fusion distance function

applied to the simulated datasets was performed at significance level 5%. The sim-

ple H0: a homogeneous Poisson process with the same intensity as the simulated

dataset is tested against the composite H1 that the dataset was not a realisation of

the homogeneous Poisson process. The fusion distance function and critical bands

were calculated from M,m = 39 realisations under H0 and using the Single Linkage

algorithm (Section 2.4).

For instance, when H(t) u 0.3 and H(t) u 0.55, H0 was rejected for Dataset one

and Dataset two, respectively. (The fusion distance function was also outside the

critical bands for a small neighbourhood around these values.) That is, the simu-

lated datasets were not realisations of the homogeneous Poisson process. Moreover,

the estimated fusion distance functions were substantially above the identity line,

suggesting clustered patterns for both datasets. (See the right plots in Figures 4.10

and 4.11, respectively.)

39CHAPTER 5

Power of Monte Carlo tests

This chapter evaluates the power of two Monte Carlo tests of Complete Spatial

Randomness (CSR) based on the fusion distance function (Definition 9) against the

alternative hypotheses of spatial clustering and inhibition. One Monte Carlo test

uses the supremum distance (Kolmogorov-Smirnov statistic) and the other uses the

area statistic (Definition 10). The power of the tests is then estimated through

simulation experiments.

There are several models [19, 69, 78, 95, 105] or alternative hypotheses against

which the power of the test of CSR may be tested. In this thesis, two models are

chosen to represent the cluster and inhibited interactions between points. The first

model is the Matern cluster, and the second is the Matern type II point process.

Both models are introduced by [69], and described in Section 5.1.

There are two important reasons for selecting these models. First, the models

are simple to simulate using direct algorithms (feasible programming and short com-

putational time). In other words, Matern cluster and Matern type II point processes

do not have intensive computational time, neither do they depend on iterative al-

gorithms such as the Markov Chain Monte Carlo Method (MCMC). Second, the

point patterns generated under these models are easily manipulated to exhibit de-

pendence between points. (That is, the degree of attraction or repulsion varies from

“no interaction”, and “mild interaction” to “strong interaction”.)

The main purpose of this chapter is not to give a general rule or recommendation

on the power of the test of CSR. It is only to give an illustration of the power of

the test based on the fusion distance and area statistic restricted to a chosen set of

parameters of the alternative models. (The following definition of the power is given

by Bickel and Doksum [13, Chapter 5]: the power of a test against the alternative

H1 is the probability of rejecting H0 when H1 is true.) In the next sections, a brief

description of the chosen models, test statistics and summary function to estimate

the power of the test of CSR against cluster and inhibition alternatives are presented.

5.1 Models

Null model The chosen null model is a homogeneous Poisson point process which

is also known as Complete Spatial Randomness (CSR). This process is defined by an

important property and presented below. (The following definition is quoted from

[60].)

40 Chapter 5. Study of power

Definition 13 (Homogeneous Poisson point process). A point process on a

plane is a homogeneous Poisson if: (i) N(B) has the Poisson distribution with mean

measure λ|B| for some positive measurable function λ and any measurable subset B

of R2, where |B| is the area of B and N(B) represents the number of events (points

of the process) in B. (ii) for any disjoint measurable subsets B1, . . . , Bn of R2, the

random variables N(B1), . . . , N(Bn) are independent.

Further information on properties and applications of the homogeneous Poisson

point process is reported in [24, 26, 30, 60].

First model: clustering The first chosen alternative model is a Matern cluster

process (see Section 2, Figures 1B and 1C of Matern [70]), a special case of a

Neyman-Scott process [78, 90, 102] which consists of independent random circular

clusters of radius r. (The following definition of the process is quoted from [71].)

Definition 14 (Matern cluster point process). The Matern cluster process

with parent intensity λp > 0, cluster intensity λc > 0 and cluster radius r > 0,

is constructed in two steps: (1) Generate the cluster centres (parents) xp from the

homogeneous Poisson process of intensity λp. (2) For each parent xp generate a

cluster (daughters) from the homogeneous Poisson process of intensity λc on the

ball b(xp, r). The Matern cluster process is given by the union of the clusters. The

expected number of daughters per cluster is µ = λcπr2 and the overall intensity of

the process is λ = λpλcπr2.

Second model: inhibition The second chosen alternative process is a Matern

model II inhibition process, which is introduced by Matern [69]. (The following

definition of the model is quoted from [71, page 48].)

Definition 15 (Matern model II). The Matern model II process with initial in-

tensity λ0 > 0 and minimum inter-point distance r > 0, is constructed by dependent

thinning of the homogeneous Poisson process as follows: (1) Generate the homoge-

neous Poisson process of intensity λ0. For each point xi, generate an independent

uniform variable si ∈ [0, 1] that represents the times by which the points can be

ordered. (2) Remove any points xi such that there is another point xj satisfying

||xi − xj|| < r and sj < si. The overall intensity of the process is

λ = λ01 − exp−λ0πr2

λ0πr2.

More information on the Matern model II and Matern cluster processes, prop-

erties and applications is reported in [69, 70, 71, 90, 102, 104].

5.2. Tests 41

5.2 Tests

First test The power of the first Monte Carlo test is based on the fusion distance

function and uses the supremum distance, the Kolmogorov-Smirnov statistic [34].

The definition of the supremum distance is given by,

U = sup06t6t1

|F (t) − F (t)|

over a range of t values of interest, where F (t) is the (theoretical) c.d.f., F (t) is the

(observed) empirical c.d.f. and t1 denotes the upper limit of a range of t values.

However, it may not be possible to use standard goodness-of-fit tests such as

χ2 statistic since the distribution of the Kolmogorov-Smirnov statistic under CSR

is still unknown. The distribution of U is usually known in classical cases where

F is the empirical cumulative distribution function of independent and identically

distributed observations. But here F is different because the observations are not

independent. Therefore, the two-sided modified version of Monte Carlo tests (Sec-

tion 3.5) is performed not only to estimate the power of the test of CSR but also to

achieve exact significance level α.

Second test The power of the second Monte Carlo test is also based on the

fusion distance function and uses the area statistic (Section 4.1.2). Similar to the

distribution of the first test, the distribution of the area statistic under CSR is

unknown so the two-sided modified version of Monte Carlo tests is performed.

Clustering algorithm and dissimilarity coefficient The Single Linkage is chosen

to form the clusters, and the pairwise Euclidean distance is chosen to measure the

distance between points.

Range of argument To estimate the power of the first test based on the fusion

distance function H(t), the ranges of the argument t are: [0,0.22] for the Matern

cluster, and [0, 0.20] for the Matern model II. For the second test using the area

statistic, the range of t is [0,√

2] for both models. (The upper limits of the ranges

of t values are chosen because H(t1) u 1 for realisations from these models.)

5.3 Experimental study

Software and computational time The software and library used for the power

study were described in Sections 4.4 and 4.4.2, respectively. The continuous compu-

tational times for the power based on the fusion distance function and area statistic

were approximately four weeks and two weeks, respectively.


Realisations from null model For each Monte Carlo test performed; M,m = 99

realisations of CSR with intensity λ = 100 points on the unit square were simulated.

Realisations from clustering The Matern cluster processes, with five parent in-

tensities λp = {5, 10, 20, 25, 50}, were simulated on the unit square. Then, λc (pa-

rameter daughter) is adjusted to keep the total intensity of the process constant at

100 points. The degree of interaction, r, between daughters and parents was varied.

That is, r = {0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2}. Typical real-

isations of the Matern cluster processes, for some of the described parameters are

shown in Figure 5.1. (For instance, when r = 0.005 only the clusters can be seen in

this figure.) Therefore, the generated patterns had three different sizes of clusters:

small (each parent on average had two daughters), medium (each parent on average

had three, four or five daughters) and large clusters (each parent on average had

ten daughters). The more daughter points and the smaller the radius, the more

clustered the point pattern is.

Realisations from inhibition The Matern model II processes, with ten initial

intensities λ0 = {110, 120, 130, 140, 150, 160, 170, 180, 190, 200}, were simulated on

the unit square. The initial intensities were chosen to achieve the total intensity of

100 points. For each λ0, the degree of inhibition between points is controlled by the

(parameter radius) r = {0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045, 0.05}.Figure 5.2 shows typical plots of realisations of Matern model II processes for some

of the described parameters. The degree of inhibition amongst points varies with

the size of the radius. That is, the larger the radius the more inhibited the point

pattern.

5.4 Estimation of power

The two-sided modified version of Monte Carlo tests (Section 3.5) at the exact

significance level α = 0.05 was performed based on M,m = 99 realisations under

the null model. Then, the powers of the tests were estimated from 1000 simulations

under H1 for each set of parameters of alternative models. The chosen values of the

parameters of the alternative models were described in Section 5.3. The fractions of

rejections out of 1000 simulations were the estimate of the powers of the tests.

5.4.1 Test using supremum distance The results obtained for the powers

of Monte Carlo tests, using the supremum distance, show that the optimal choice

of the upper limit t1 does not strongly depend on the mean number of daughters

5.4. Estimation of power 43

λ = 25, r = 0.005 λ = 25, r = 0.05

λ = 25, r = 0.125 λ = 25, r = 0.2

λ = 50, r = 0.005 λ = 50, r = 0.05

λ = 50, r = 0.125 λ = 50, r = 0.2

Figure 5.1: Typical plots of realisations of Matern cluster point processes with λp =

{25, 50}, λc is adjusted to keep the total intensity of the process constant at 100

points and r = {0.005, 0.05, 0.125, 0.2} on the unit square.

per cluster. (See Figures 5.3, 5.4, and Tables B.1(a)–(e) (in appendix B) which

present the estimated powers of the Monte Carlo test of CSR against Matern cluster

processes with parameters described previously in the text.)


λ = 110, r = 0.005 λ = 120, r = 0.01

λ = 130, r = 0.015 λ = 140, r = 0.02

λ = 150, r = 0.025 λ = 160, r = 0.03

Figure 5.2: Typical plots of realisations of Matern model II processes with λ0 =

{110, 120, 130, 140, 150, 160} and r = {0.005, 0.01, 0.015, 0.02, 0.025, 0.03} on the

unit square.

Clustering For instance, consider r = 0.005 and the first plot in Figure 5.3 (top

row and left side). The test is very powerful given that t1 < 0.07 or t1 > 0.17.

However, the loss of the power is noticeable if t1 ∈ [0.07, 0.17]. To investigate the

reasons for loss of power of the test, extra simulations were done. Figure 5.5 shows

typical plots of fusion distance functions and estimated means from realisations

under the Poisson (CSR) and Matern cluster processes. The left plot on the top

row in Figure 5.5 shows 100 fusion distance functions for the Poisson with λ = 100,

and 100 fusion distance functions for Matern cluster with λp = 5, λc = 20, and

r = 0.005.

The values from the fusion distance functions of Matern cluster are different from

those of Poisson when t1 ∈ [0, 0.1]. Consequently, the power of the test is strong.

However, for t1 ∈ (0.11, 0.13] the fusion distance functions of both patterns appear

to be approximately equal. In addition, for t1 ∈ (0.13, 0.2] the values of the fusion


0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.005

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.01

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.025

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.05

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.075

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.1

Figure 5.3: Power of Monte Carlo tests of CSR against Matern cluster processes

with parameters λp, λc, r; where λp, r are varying as shown. λc is adjusted to

keep intensity of the process constant at 100. Test uses 99 realisations of CSR.

Power estimated from 1000 realisations under Matern cluster processes, test statistic:

supremum distance; t1 is the upper limit of the range.


0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.125

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1po

wer

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.15

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.175

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

r=0.2

Figure 5.4: Power of Monte Carlo tests of CSR against Matern cluster processes with

parameters λp, λc, r; where λp, r are varying as shown. λc is adjusted to keep inten-

sity of process constant at 100. Test uses 99 realisations of CSR. Power estimated

from 1000 realisations under Matern cluster processes. Test statistic: supremum

distance; t1 is the upper limit of the range.

distance functions for both patterns also appear to be similar. Therefore, the power

of the test is weak when both fusion distance functions have similar values.

The right plot on the top row in Figure 5.5 shows the estimated means of the

fusion distance functions against t1. The estimated means of both fusion distance

functions are equal when t1 ∈ [0.12, 0.13]. Thus, the left plot on the top row in

Figure 5.3 shows that the power is approximately equal to zero for t1 ∈ [0.12, 0.13].

For t1 > 0.13, the power is gradually stronger.


0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

H(t)

PoissonCluster

λ = 5, r = 0.005

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

Mea

n of

H(t)

PoissonCluster

λ = 5, r = 0.005

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

H(t)

λ = 50, r = 0.2

PoissonCluster

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

Mea

n of

H(t)

PoissonCluster

λ = 50, r = 0.2

Figure 5.5: Typical plots of fusion distance functions from Poisson and Matern

cluster processes. Left: computed H(t) of individual realisations, right: estimated

H(t). Upper: 100 realisations from Poisson with λ = 100, and 100 realisations from

Matern cluster with λp = 5, λc = 20, r1 = 0.005. Lower: 100 realisations of Poisson

with λ = 100, and 100 realisations of Matern cluster with λp = 50, λc = 2, r10 = 0.2.

A more detailed explanation can be visualised through Q-Q plots. Figures B.1

– B.5 (in appendix B) show the quantiles of the fusion distance functions from the

Matern cluster plotted against Poisson. Note that both quantiles are equal when

t1 = 0.13 in Figure B.3. Therefore, the power is equal to zero. However, when

t1 > 0.13 the quantiles from both processes are different. So the power is greater


than zero. (See Figure B.3.) The left plot on the lower row in Figure 5.5 shows 100

fusion distance functions from the Poisson with λ = 100, and 100 fusion distance

functions from the Matern cluster with λp = 50, λc = 2, and r = 0.2. The right plot

on the lower row in Figure 5.5 shows the estimated means of the fusion distance

functions. Similar conclusions can be drawn to explain the fluctuation of the power

of the test of the test of CSR against Matern cluster not only for these parameters

but also for the remaining parameters.

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

r = 0.005r = 0.01r = 0.015r = 0.045r = 0.02

0.00 0.10 0.20 0.30

0.0

0.2

0.4

0.6

0.8

1.0

t_1

powe

r

r = 0.03r = 0.035r = 0.04r = 0.045r = 0.05

Figure 5.6: Power of Monte Carlo tests of CSR against Matern model II processes

with parameters λ0, r; where λ0 is chosen to achieve an intensity of 100. Test uses

99 realisations of CSR. Power estimated from 1000 realisations under the Matern

model II. Test statistic: supremum distance; t1 is the upper limit of the range.


Inhibition The results obtained for the power of the test of CSR against the

Matern model II, using the supremum distance, show that the optimal choice of t1

depends on the choice of r. (See Figure 5.6 and Table B.2 (in appendix B) which

present the estimated powers of the test of CSR against Matern model II processes

with parameters described previously in the text.)

For instance, consider r = 0.005 and the upper plot in Figure 5.6. The test

of CSR is powerful given that t1 > 0.15. However, the fluctuation of the power is

noticeable for t1 6 0.15. To investigate this fluctuation, we proceed in a similar

fashion that was described for clustering previously. Figure 5.7 shows typical plots

of the fusion distance functions from the Poisson processes. Observe that the values

from the fusion distance functions of Matern model II patterns are close to those

from the Poisson patterns for t1 ∈ [0, 0.2]. In particular, for t1 ∈ [0, 0.05] the fusion

distance functions of both patterns are approximately equal. So the power of the

test of CSR is approximately zero.

Figures B.6 – B.10 (in appendix B) show the Q-Q plots of the fusion distance

functions from both models. The quantiles are plotted for t1 ∈ [0, 0.2]. Note that, the

quantiles of the fusion distance functions for both models (Poisson and Matern model

II) are very close to the identity line. For instance, when t1 = 0.01, both quantiles

are equal in Figure B.6. Therefore, for t1 = 0.01, the power is zero. However, the test

of CSR is the most powerful for the Matern model II, with λ0 = 200 and r = 0.05,

if t1 ∈ [0.001, 0.06]. (See the lower plot in Figure 5.6.) For these parameters, the

fusion distance functions from both models are very different. (See the lower plots

in Figure 5.7.)

The left plot in Figure 5.7 shows 100 fusion distance functions for the Poisson

with λ = 100, and 100 fusion distance functions from the Matern model II with

λ0 = 200 and r = 0.05. The right plot in Figure 5.7 shows the estimated means

of the fusion distance functions. Similar conclusions can be drawn to explain the

fluctuation of the power of CSR against Matern model II alternative based on the

fusion distance function, not only for these parameters but also for the remaining

parameters.

5.4.2 Test using area statistic

Clustering The left plot in Figure 5.8 and Table 5.1 present the estimated powers

of Monte Carlo tests using the area statistic. The test is the most powerful for the

Matern cluster with λp = 5, λc = 20, and r = 0.005. For the other parameters


0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

H(t)

PoissonMatern II

λ = 110, r = 0.005

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1M

ean

of H

(t)

PoissonMatern II

λ = 110, r = 0.005

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

H(t)

PoissonMatern II

λ = 200, r = 0.05

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

t_1

Mea

n of

H(t)

PoissonMatern II

λ = 200, r = 0.05

Figure 5.7: Typical plots of fusion distance functions from Poisson and Matern

model II processes. Left: computed H(t) of individual realisations, right: estimated

H(t). Upper: 100 realisations of Poisson with λ = 100, and 100 realisations of

Matern model II with λ0 = 110, r1 = 0.005. Lower: 100 realisations of Poisson with

λ = 100, and 100 realisations of Matern model II with λ0 = 200, r10 = 0.05.

described previously, the obtained results show that the estimated power decreases

with the increasing radius r or with the increasing mean number of parents λp.

Inhibition The right plot in Figure 5.8 and Table 5.2 present the estimated pow-

ers of Monte Carlo tests of CSR against Matern model II processes using the area

statistic. The obtained results show that the estimated power increases with the


Matern cluster, area statistic

0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

5 parents 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99

10 parents 1.00 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.94 0.81

20 parents 1.00 1.00 1.00 1.00 1.00 0.99 0.93 0.72 0.49 0.31

25 parents 1.00 1.00 1.00 1.00 0.99 0.98 0.81 0.53 0.32 0.19

50 parents 1.00 1.00 0.99 0.99 0.91 0.55 0.28 0.14 0.07 0.05

Table 5.1: Power of Monte Carlo tests of CSR against Matern cluster processes with

λp, λc, r, where λp and r are varying as shown and λc is adjusted to keep intensity

of the process constant at 100. Test uses 99 realisations of CSR. Power using area

statistic and estimated from 1000 simulations under Matern cluster process.

increasing inhibition radius r. The test is the most powerful for the Matern model

II with radius r > 0.045.

Matern model II, area statistic

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Power 0.004 0.004 0.005 0.02 0.09 0.38 0.72 0.95 0.99 1.00

Table 5.2: Power of Monte Carlo tests of CSR against the Matern model II with λ0,

r; where λ0 is chosen to achieve an intensity of 100. Test uses 99 realisations of

CSR. Power using area statistic and estimated from 1000 simulations under Matern

model II.

Conclusion The power of the Monte Carlo test based on the supremum distance

(Section 5.4.1) is quite variable and difficult to understand whereas the power of

the Monte Carlo test based on the area statistic (Section 5.4.2) is straightforward.

The best power achieved by the supremum distance is comparable to the best power

achieved by the area statistic. Therefore, we recommend the Monte Carlo test based

on the area statistic for the models studied here.


0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.2

0.4

0.6

0.8

1.0

r

powe

r

λ = 5λ = 10λ = 20λ = 25λ = 50

Cluster

0.01 0.02 0.03 0.04 0.05

0.0

0.2

0.4

0.6

0.8

1.0

r

powe

r

Inhibition

Figure 5.8: Power of Monte Carlo tests of CSR against Matern cluster (left) and

Matern model II (right) with parameters described previously in the text. λp, λ0 are

chosen to achieve an intensity of 100. Test uses 99 realisations of CSR. Power using

area statistic and estimated from 1000 simulations under each model.

53CHAPTER 6

Analysis of multivariate point patterns

The main aim of this chapter is the analysis of multivariate or multitype point pat-

terns using a clustering algorithm. Diggle [30, page 90] defined a multivariate point

process as any stochastic mechanism which generates events classified as type j for

j = 1, . . . , k. The k (univariate) point processes are referred to as the components

of the multivariate process. A multivariate point pattern is a realisation of a multi-

variate point process. For a general introduction to the theory of multivariate point

processes, see [20, 21].

The usual approach [24, 30, 90, 91, 102] to investigate independence between

different types of multivariate point patterns begins by the estimation of cross-

type versions of the standard summary functions, such as the nearest neighbour

distance G and second reduced moment K functions. For instance, Van Lieshout

and Baddeley’s J function [67] is also used to study forms of dependence between

points of different types in a multivariate point pattern. Further information and

applications of the J-function to multivariate point patterns are reported in [67].

In this chapter, an extension of the strategy described in Chapter 4 to multivari-

ate point patterns using three summary statistics is investigated. The first statistic

is the fusion distance function introduced in Section 4.1.1, and the second is a new

summary statistic introduced in Section 6.2, the S statistic. (The S statistic mea-

sures the number of clusters in which all members belong to the same type of a given

multitype point pattern). The properties of S for a given bivariate point pattern

are examined under the random labelling and independence hypotheses.

Finally, we introduce a spatially modified version of the Rg index [37, 86], a

popular measurement used for comparing two classifications in cluster analysis. The

properties of the spatial Rg index are investigated under the random labelling, and

independence null hypotheses.

6.1 Extension based on fusion distance function

Let Y = (X1, . . . , X`) be a (marked) multivariate point process on Rd, where Xj

(j = 1, . . . , `) is a univariate point process on R. Note that types and components

of a given point process have different concepts. The types of a multitype point

process are the marks or labels, for example “on”/“off” attached to the points. The

components of a marked point process are the sub-patterns consisting of the points

of one type. Our notation for a realisation of a (given) marked multivariate point

54 Chapter 6. Analysis of multivariate point patterns

process, a (given) marked point pattern, is y = (x1, . . . ,x`), where xj is the sub-

pattern of points of type j, yi is the ith point (i = 1, . . . , n), and y0 is the unmarked

point pattern. (The y0 can also be referred to as the point pattern regardless of the

marks.)

The extension of the strategy to analyse multivariate point patterns is to test

the null hypothesis of random labelling using the fusion distance function. The

definition of the random labelling property is given below, and then the procedure

of the strategy is described next.

Definition 16 (random labelling). Let mj be the type (mark) attached to the ith

point yi. The random labelling hypothesis states that, given the unmarked pattern

y0, the types m1, ...,mk (k ∈ N) attached to these points are i.i.d. with distribution

pj. A consequence of this hypothesis is that, given the unmarked pattern y0 and the

number nj of points of type j, the component of type j is a simple random sample

of size nj without replacement from y. There is also another consequence: given

the unmarked pattern y0, and the number nj of points of each type j, the marks

m1, ...,mk are a random permutation of (1, 1, . . . , 1, 2, 2, . . . , 2, . . . , k).

The procedure:

1. Select a component of the marked point pattern, say, the sub-point pattern

type 1 (size n1), and compute its fusion distance function, H1(t).

2. Collect x(1)1 , . . . ,x

(m)1 , m i.i.d. sub-samples of size n1 (selected randomly with-

out replacement) from y0, and compute the fusion distance functions, H(r)1 (t),

where r = 1, . . . ,m.

3. Calculate the mean H1(t) of the fusion distance functions H(r)1 (t) given by

H1(t) =1

m

m∑

r=1

H(r)1 (t), for r = 1, . . . ,m. (6.1)

4. Apply the two parts of the strategy presented in Section 4.3, the exploratory

data analysis and inference, to compare H1(t) with H1(t).

If the random labelling hypothesis does not hold, then H1(t) is outside the (point-

wise) simulation envelopes or inside the (simultaneous) critical band at exact signif-

icance level α.

6.1. Extension based on fusion distance function 55

In other words, if the random labelling hypothesis is true, then given the number

of points of type 1, the component of type 1 is a random sample without replacement

from y. Next, an illustration of the extension of the strategy using the fusion

distance function applied to the bivariate Cat Retinal Ganglia dataset is presented.

Figure 6.1: The bivariate Cat Retinal Ganglia dataset with two types: 65 “on” cells

(4) and 70 “off” cells (◦) on a rectangular region with dimensions 1 mm by 0.7533

mm. Source: [113], data provided in [9].

Cat Retinal Ganglia dataset The Cat Retinal Ganglia Data were introduced by

Wassle, Boycott, and Illing [113] and were analysed by [31, 67, 113]. Figure 6.1

shows the dataset, which is a pattern of beta-type ganglion cells in the retina of a

cat recorded by [113]. Beta cells are associated with the resolution of fine detail in

the cat’s visual system. The cells can be classified anatomically as “on” or “off”.

In this sample, there are 65 on cells and 70 off cells in a rectangular region with

dimensions 1 mm by 0.7533 mm. Van Lieshout and Baddeley [67] stated that the

statistical independence of the on and off components would strengthen the claim

that there are two separate channels, one for brightness and another for darkness, as

postulated by Hering in 1874. More information on the Cat Retinal Ganglia dataset

and its analysis is presented in [31, 67, 113].

Illustration The extension of the strategy was applied to the Cat Retinal Ganglia

dataset using the two-sided Monte Carlo test (Section 3.5). The random labelling

null hypothesis was tested against the dependence of the types on the locations of the

points at 5% exact significance level. Figures 6.2 and 6.3 show the P-P plots and A-

A plots of the fusion distance function with simulation envelopes and critical bands

based on the Single Linkage, respectively. The number of random permutations of

the type labels was 999. The results based on the Single Linkage are equivalent to

those obtained from the Average Linkage and Complete Linkage. (Further details

are provided in Section A.3, appendix A.) The fusion distance functions of the


0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

Figure 6.2: P-P plots of fusion distance functions H1(t), H2(t) versus H(t) from the

Cat Retinal Ganglia dataset. Upper: on cells (type 1); lower: off cells (type 2). Left:

simulation envelopes; right: critical bands; 5% significance level; Single Linkage

algorithm; 999 random permutations of the type labels.

two types of the dataset are mostly outside the simulation envelopes and inside

the critical bands. Therefore, the random labelling is rejected for the Cat Retinal

Ganglia dataset based on these clustering algorithms. Note that our result agrees

with the results obtained by [31, 67, 113].

6.2 Extension based on S statistic

In this section, we introduce the S statistic. Then, we examine its properties

under random labelling and independence hypotheses. We also present an extension

of the strategy using the S statistic.

Definition 17 (S statistic). Let y be a marked multivariate point pattern with

j types or marks on a bounded region W . Then, a chosen clustering algorithm is

applied to y and the number of clusters is counted at all levels of the dendrogram in

6.2. Extension based on S statistic 57

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0


1 − H(t) against arcsin√

1 − H(t) for the Cat

Retinal Ganglia dataset. Upper: on cells (type 1); lower row: off cells (type 2). Left:

simulation envelopes; right: critical bands; 5% significance level; Single Linkage

algorithm; 999 random permutations of the type labels.

which all members of a cluster are of the same type. Thus, the statistic S is defined

as:

S = #{clusters at all hierarchical levels in which

all members of a cluster are of the same type},(6.2)

where # denotes the number of. That is, the S statistic measures the degree of

attraction between the types of a given multivariate point pattern.

Large values of S correspond to positive association between types of nearby

points, while small values of S correspond to negative association. In applications,

the alternative hypothesis is usually one of positive association so that we perform

one-sided tests where large values are critical.

Properties of S under random labelling Given a hierarchical dendrogram of a

marked point pattern with two components, let Ci be the cluster created by fusion


at the ith step of the hierarchical algorithm, and Cn−1 = {y1, . . . , yn} be the entire

set of points. Then,

S =n−1∑

i=1

1{cluster Ci consists only of points of a single type},

where 1{} denotes the indicator function. Thus the expected value of S is given by

E[S] =n∑

i=1

P{cluster Ci consists only of points of a single type}.

Under the random labelling hypothesis, if there are n1 points of type 1 and n2 points

of type 2 then (conditional on y that is, conditional on the locations of the points,

but not on their marks)

P{cluster Ci consists of points of a single type} = P{all points in Ci have type 1}+ P{all points in Ci have type 2}

=

(

n−si

n1−si

)

(

n

n1

) +

(

n−si

n2−si

)

(

n

n2

) ,

(6.3)

where si = # points in Ci.

If si > n1, n2 then P{cluster Ci consists of points of a single type} = 0. Thus

(conditional on y)

Err[S] =

(

∑n−1i=1

(

n−si

n1−si

)

+(

n−si

n2−si

)

)

(

n

n1

) (6.4)

taking(

n

−k

)

= 0 for − k < 0. Observe that Err[S] is the expected value of the S

statistic under the random labelling hypothesis conditional on y, and depends on the

given dendrogram, more particularly on the sizes of the clusters si. The summands

in equation (6.4) decrease rapidly as si increases, so it is easy to terminate the sum.

Write I i = 1{cluster Ci consists of points of a single type} then S =∑n−1

i=1 I i.

Thus

Var(S) = Var(n−1∑

i=1

I i)

=n−1∑

i=1

Var(I i) +∑

i6=j

Cov(I i, Ij).

(6.5)


Note that Var(I i) = pi(1 − pi), where

pi = Err(I i) = P{cluster Ci consists of points of a single type} and

Cov(I i, Ij) = pij − pipj,

where pij =Err(I iIj)

=P{cluster Ci consists of points of a single type

and Cj consists of points of a single type}.(6.6)

If Ci ∩ Cj 6= ∅ then

pij = P{Ci ∪ Cj consists of points of a single type}

=

[

(

n−s

n1−s

)

+(

n−s

n2−s

)

]

(

n

n1

) ,

(6.7)

where s = # points in Ci ∪ Cj. If Ci ∩ Cj = ∅ then

pij = P{Ci consists of points of a single type, Cj consists of points of a single type}= P{Ci ∪ Cj consists of points of a single type}+ P{Ci consists of points of type 1, Cj consists of points of type 2}+ P{Ci consists of points of type 2, Cj consists of points of type 1}

=

[

(

n−si−sj

n1−si−sj

)

+(

n−si−sj

n2−si−sj

)

+(

n−si−sj

n1−si

)

+(

n−si−sj

n2−sj

)

]

(

n

n1

)

(6.8)

so Var(S) may also be computed.

Illustrative example Let us consider y = (x1,x2) to be a given marked bivariate

point pattern, where x1 = {y1, y2}, x2 = {y3, y4}, y0 = {y1, y2, y3, y4}, y1 = (1, 1),

y2 = (2, 2), y3 = (6, 6), y4 = (8, 8), and their respective marks be m1=“on”,

m2=“on”, m3=“off’, m4=“off’. The Single Linkage algorithm is applied to y, and

its dendrogram is shown in Figure 6.4. The fusion distances are h1 =√

2, h2 = 2√

2,

h3 = 4√

2, and the clusters C1 = {y1, y2}, C2 = {y3, y4}, C3 = {y1, y2, y3, y4}.

Under the random labelling hypothesis, and condition on y there are n1 = 2

points of type “on”, n2 = 2 points of type “off”. The values of the probabilities,

numbers of points in each cluster, S statistic and its expected value are: p1 = 16,

p2 = 16, s1 = 2, s2 = 2, s3 = 4, S = 2, and Err[S] = 2

3, respectively.


1‘‘on’’ ‘‘on’’

3

5

C 3

1 ‘‘off’’ ‘‘off’’

C 2

C 1

2yy

4y3y

Figure 6.4: A Single Linkage dendrogram applied to a given marked bivariate point

pattern y.

Properties of S under independence The calculations become simpler if the types

are assumed to be independent with P{type 1} = p and P{type 2} = 1 − p = q.

Then E[Ii] = P{cluster Ci consists of points of a single type} = psi + qsi . Thus the

expected value of S under independence, denoted by Eind[S], is given as follows,

Eind[S] =n−1∑

i=1

(

psi + qsi

)

. (6.9)

Observe that equation (6.9) can be used as an approximation when n is large,n1

n∼ p and n2

n∼ q. If Ci∩Cj = ∅ then I i, Ij are independent. In practice, if the point

pattern has more than 2 types, then the properties of the S statistic under random

labelling and independence are very difficult to compute analytically. Therefore,

in this case, we need to rely on Monte Carlo simulation and tests, described in

Chapter 3.

The extension of the strategy using the S statistic is similar to that described

for the fusion distance function presented in Section 6.1. But here, the (one-sided)

modified version of the Monte Carlo test described in Section 3.5 is performed to

test the null hypotheses: random labelling, and independence.

Applications of S statistic The strategy using the S statistic is applied to the

bivariate point patterns: Cat Retinal Ganglia dataset (Section 6.1), Austin Hughes’

Amacrine Cell Data, Clustered, and Longleaf pines. (The level of significance of the

tests is α = 0.05.) However, before the results are presented, a brief description of

the remaining datasets is given as follows.

Austin Hughes’ dataset Figure 6.5 shows Austin Hughes’ Amacrine Cell Data,

which has 152 on cells and 142 off cells on a rectangular region with dimensions


Figure 6.5: Austin Hughes’ dataset with two types: 152 on cells (4) and 142 off

cells (◦) on a rectangular region with dimensions 1.6065 mm by 1 mm. Source: [31],

data provided in [9].

1.6065 mm by 1.00 mm. This dataset is an example of a bivariate point pattern of

Amacrine cells in the retina of a rabbit. In what follows, this dataset is referred to

as Austin Hughes’ dataset. For more information on Austin Hughes’ dataset, see

[31, 9].

Longleaf pines dataset The Longleaf pines data were introduced by Platt, Evans,

and Rathbun [85], and register the locations and diameters at breast height (dbh)

of 584 Longleaf pines, in a square of 200 m in southern Georgia, USA. (Platt, Evans

and Rathbun [85, page 500] classified trees less than 5 cm dbh as “juveniles”, trees

with 5–30 cm dbh as “subadults”, and trees larger than 30 cm dbh as “adults”.)

More details of this dataset are reported in [9, 85, 87]. For simplicity, the dataset

(analysed here) was re-scaled to the unit square and classified into two types: 313

trees with dbh 6 30 cm, and 273 trees with dbh > 30 cm. Trees with dbh 6 30 were

called “young” and trees with dbh > 30 were named “adult”. Figure 6.6 shows the

Longleaf pines dataset classified into young (◦), and adult (4) types.

Clustered dataset Figure 6.7 shows a realisation of a Matern cluster process (de-

fined in Section 5.1) with λp = 2, λc = 100 and r = 0.2 on the unit square. This

simulated dataset is an example of an “ideally” clustered point pattern. The daugh-

ters from the first parent are labelled type 1 and the daughters of the second parent

are labelled type 2.

Results of S statistic Table 6.1 shows the estimated values of the S statistic, 5%

Monte Carlo critical values under the null hypotheses of random labelling and inde-

pendence, denoted by S5%rr and S5%ind, respectively. (The 5% Monte Carlo critical

value is the 95th quantile defined in Section 3.) The chosen clustering algorithm was

the Single Linkage and the number of realisations under each null hypothesis was


999. From Table 6.1, the observed values of S statistic are greater than the Monte

Datasets S S5%rr S5%ind

Cat Retinal Ganglia 1 32.7 32.3

Austin Hughes’ 9 67.0 66.6

Clustered 198 44.2 43.8

Longleaf pines 349 120.2 119.8

Table 6.1: Estimated values of S statistic, Monte Carlo critical values under random

labelling and independence null hypotheses, S5%rr, S5%ind, respectively, for Cat Reti-

nal Ganglia, Austin Hughes’, Clustered and Longleaf pines datasets; Single Linkage

algorithm; for each dataset and null hypothesis: 999 permutations of the type labels.

Carlo critical values for the Clustered and Longleaf pines datasets. Thus, both null

hypotheses are rejected for the datasets. However, for the Cat Retinal Ganglia and

Austin Hughes’ datasets, S < S5%rr, S5%ind. Therefore, both null hypotheses are

not rejected for these point patterns. The obtained results from the Single Linkage

are similar to those obtained from the the Average Linkage and Complete Linkage.

In other words, the null hypotheses were rejected for Clustered and Longleaf pines

datasets, but not rejected for Cat Retinal Ganglia and Austin Hughes’ datasets.

Consequently, the results based on these algorithms are not shown here.

Figure 6.6: Longleaf pines dataset classified into two types: 313 young trees which

has dbh 6 30 cm (◦), and 271 adult trees which has dbh > 30 cm (4) on a squared

region of 200 m side. The squared region was re-scaled to the unit square. Source:

[85], data provided in [9].

6.3. Extension based on spatial Rg index 63

Figure 6.7: Clustered dataset, a simulated Matern cluster point pattern with λp = 2,

λc = 100, r = 0.2 on the unit square. Daughters of the first parent are labelled type

1 (◦) and daughters of the second parent are labelled type 2 (4).

6.3 Extension based on spatial Rg index

In this section, the Rg index is presented, and then the index is modified to

assimilate the spatial context. Next, the properties of the spatial Rg under ran-

dom labelling and independence hypotheses are investigated. The extension of the

strategy using the spatial Rg index is also presented.

Rg index The Rg index is one of the most commonly used measurements for

comparing two classifications in non-spatial cluster analysis (Section 2.3). The index

is introduced by Rand [86] and the following definition is quoted from [37, page 147].

Definition 18 (Rg index). Let C1 and C2 be two classifications of the same dataset

of n points into g clusters, where g is fixed. The Rg index of similarity between C1

and C2 is

Rg(C1, C2) =Tg − 1

2Pg − 1

2Qg +

(

n

2

)

(

n

2

) ,

where

Tg =

g∑

i=1

g∑

i=1

nij2 − n, Pg =

g∑

i=1

ni¦2 − n, Qg =

g∑

i=1

n¦j2 − n,

and the quantity nij is the number of points in common between the ith cluster

of the first classification, and the jth cluster of the second (the cluster in the two

classifications may each be labelled arbitrarily from 1 to g.) The terms ni¦ and n¦j

are appropriate marginal totals of the of nij values.

Everitt [37] interpreted the Rg index as the probability that two points are treated

alike in both classifications. He also pointed out that the Rg index lies in the interval


[0,1] and takes its upper limit when there is complete agreement between the two

classifications. Further details, properties and applications of the Rg index to cluster

analysis are reported in [37, 42, 86].

Spatial Rg index To the best of our knowledge, the Rg index has not been applied

to the analysis of spatial point patterns. Thus, a (new) modified version of the index,

the spatial Rg index, is introduced as follows.

Definition 19 (Spatial Rg index). Let y be a marked multivariate point pattern

with j different marks, and y0 be the unmarked point pattern. Let Cm be the

classification of the points of y based on their marks (group i contains all points

with mark equal to i, i = 1, . . . , j). Let Cs be the classification of y0 obtained by

applying a chosen clustering algorithm to y0 and extracting a classification with j

classes. The spatial Rg index is Rg(Cs, Cm).

That is, we compare the “j-class” classification from the point pattern y0 with

the classification into “j-marks” from the marked point pattern y using the spatial

Rg index. Thus, the spatial Rg index measures the extent of spatial segregation of

the points of different marks.

Properties of spatial Rg index In this section, some properties of the spatial Rg

index for a bivariate point pattern are investigated. Consider a point pattern y with

two types, where the first type has n1 points and the second has n2 points, and n =

n1+n2. After applying a clustering algorithm to the unmarked point pattern y0, cut

the dendrogram (Section 2.3) into two groups so that the first group has m1 points

and the second group has m2 points, where n = m1 + m2 and n1, n2,m1,m2 ∈ N.

Let Ai,j , where i, j = 1, 2, be the numbers of points of type i belonging to group j.

Then

A11 + A12 = n1, A11 + A21 = m1, A21 + A22 = n2, A12 + A22 = m2.

The summarised information on the two classifications is presented in Table 6.2.

The spatial Rg index (denoted by Rg) can also be written as

Rg =1(

n

2

)

[

# of pairs (i, j) of same type and same group

+ # of pairs (i, j) of different type and different group

]

=1(

n

2

)

[(

A11

2

)

+

(

A22

2

)

+

(

A12

2

)

+

(

A21

2

)

+ A11A22 + A12A21

]

.

(6.10)


group 1 group 2∑

type 1 A11 A12 n1

type 2 A21 A22 n2∑

m1 m2 n

Table 6.2: Spatial classification into two types and cluster analysis classification into

two groups for a given bivariate point pattern.

Consider A11 = X and A22 = Y then the spatial Rg index can be re-written as a

quadratic form in X and Y

Rg =1

n(n − 1)

[

X(X − 1) + Y (Y − 1) + (n1 − X)(n1 − 1 − X)

+ (n2 − Y )(n2 − 1 − Y ) + XY + (n1 − X)(n2 − Y )

] (6.11)

The quadratic form given by equation (6.11) is symmetric in X about 12n1, that is,

if X is replaced by (n1 −X) the same result is obtained. The quadratic form is also

symmetric in Y about 12n2. Moreover, the coefficients of X2 and Y 2 are positive so

that a minimum occurs at X = 12n1, Y = 1

2n2 yielding

min Rg =1

n(n − 1)

[

1

2n1(

1

2n1 − 1) +

1

2n2(

1

2n2 − 1)

+1

2n1(

1

2n1 − 1) +

1

2n2(

1

2n2 − 1) +

1

4n1n2 +

1

4n1n2

]

=1

n(n − 1)

[

n1(1

2n1 − 1) + n2(

1

2− 1) +

1

2n1n2

]

(6.12)

The minimum of the spatial Rg index given by equation (6.12) occurs when X,Y are

free to take real values. However, if X,Y are constrained to be nonnegative integers

then the minimum of the spatial Rg index occurs at one of the integer points closest

to X = 12n1 , Y = 1

2n2. Let n be large (n → ∞) with n1 ≈ c1n and n2 ≈ c2n, where

c2 = 1 − c1. If Aij ≈ cijn then c11 + c12 = c1 and c21 + c22 = c2 = 1 − c1. Therefore,

the spatial Rg index can be approximated by the following expression

Rg ≈ c211 + c2

12 + c222 + c2

21 + c12c21 + c11c22

= Z2 + (c1 − Z)2 + W 2 + (c2 − W )2 + (c1 − Z)(c2 − W ) + ZW,(6.13)

where Z = c11,W = c22. By symmetry, the minimum value of the spatial Rg index


as a function of Z and W , for fixed c1, occurs at Z = 12c1 and W = 1

2c2. Then

min Rg =1

4c21 +

1

4c21 +

1

4c22 +

1

4c22 +

1

4c1c2 +

1

4c1c2 =

1

2(c2

1 + c22 + c1c2)

=1

2(c2

1 + (1 − c1)2 + c1(1 − c1)) =

1

2(1 − c1 + c2

1).(6.14)

This always exceeds its values for c1 = 12; Rg >

12(1 − 1

2+ 1

4) = 3

8. The spatial Rg

index is equal to 1 if, and only if, all possible pairs are either of the same group

and same type or of a different group and different type. The only way to ensure

equality is to have all points belong to one group and one type.

For given n1, n2 the maximum possible value of the spatial Rg index occurs at a

boundary point (because the spatial Rg index is a convex function of X and Y ). In

other words, the maximum value of the spatial Rg index occurs when either X = 0

or n1, and either Y = 0 or n2.

Random labelling The distribution of the spatial Rg index under the labelling

hypothesis for a given bivariate point pattern is shown as follows. Suppose that the

point pattern consists of n1 points of type 1 and n2 points of type 2, where n =

n1 + n2. Hierarchical clustering of the unmarked points divides them into 2 groups

of size m1, m2, where n = m1 + m2. If the labels are randomly permuted (equal

probability for all n! permutations) then each possible labelling has probability

n1!n2!

n!=

1(

n

n1

) ,

that is, each subset of n1 points has an equal chance of being the subset labelled

type 1. Hence the outcome presented by Table 6.2 has probability

(

m1

A11

)(

m2

A22

)

(

n

n1

) ,

and value

Rg =

(

A11

2

)

+(

A22

2

)

+(

A12

2

)

+(

A21

2

)

+ A11A22 + A12A21(

n

2

) .

Observe that

A12 = n1 − A11, A21 = m1 − A11,

A22 = n2 − A21 = n − n1 − m1 + A11.


Thus there is only one free variable, A11 = X say, constrained by

A11 > 0 ⇐⇒ X > 0

A12 > 0 ⇐⇒ X 6 n1

A21 > 0 ⇐⇒ X 6 m1

A22 > 0 ⇐⇒ X > m1 − n2

(6.15)

Therefore max(0,m1 − n2) 6 X 6 min(n1,m1). Then under the random labelling

hypothesis, the spatial Rg index and probability can be expressed as a function of

X as follows:

Rg(x) =1(

n

2

)

[(

x

2

)

+

(

n2 − m1 + x

2

)

+

(

n1 − x

2

)

+

(

m1 − x

2

)

+ x(n2 − m1 + x) + (n1 − x)(m1 − x)

] (6.16)

and

P(X = x) =

(

m1

x

)(

m2

n2−m1+x

)

(

n

n1

) (6.17)

Exact null distribution of Rg index The exact distribution of Rg index under the

null hypothesis of random labelling could, in principle, be calculated from equations

(6.16) and (6.17). However, in practice this will be difficult when n is large.

Monte Carlo approximation of the null distribution of Rg index The null dis-

tribution of Rg index can be approximated to arbitrarily good accuracy, by Monte

Carlo Methods, by randomly permuting the type labels and computing Rg index for

each such permutation.

Based on visual inspection of the histograms of the fusion distances applied to

the multivariate point patterns Longleaf pines (described in Section 6.2) and Brazil-

ian trees (introduced in Section 8.1), it seems appropriate to try fitting a gamma

distribution for approximating the spatial Rg index distribution. (The histograms of

the fusion distances applied to Longleaf pines and Brazilian trees datasets are shown

in Figure A.16, in appendix A.) The following definition of the gamma distribution

is quoted from [55, page 166].

Definition 20 (Shifted Gamma distribution). A random variable X has a

shifted gamma distribution if its probability density function is given by

P(X = x) =(x − γ)α−1exp[−(x − γ)/β]

βαΓ(α)(6.18)

where α > 0, β > 0, and x > γ. The parameters α, β, and γ are known as the

“shape”, “scale” and “shift” of the distribution, respectively.


The parameters α, β, γ of the shifted gamma distribution are estimated using

the Method of Moments described by [55, page 186] and presented as follows. Given

values of n independent random variables X1, . . . , Xn, each distributed as in equation

(6.18) then the Method of Moments estimators α, β and γ are given by

α =4m2

3

m32

, β =m3

2m2

, γ = X − 2m22

m3

, (6.19)

respectively, where

X = n−1

n∑

j=1

Xj, m2 = n−1

n∑

j=1

(Xj − X)2, m3 = n−1

n∑

j=1

(Xj − X)3.

Even though the Moment estimators are often less accurate than the Maximum

Likelihood estimators α∗, β∗ and γ∗, the Moment estimators do not rely on iterative

computational algorithms. Therefore, the Method of Moments is here preferred for

estimating the parameters of the gamma distribution. Our preference is because of

two main reasons. First, the aim is to have a simple approximation of the spatial

Rg index distribution. Second, this approximation should be feasible and rapidly

calculated using direct algorithms.

In other words, the computational time for calculating the parameters of the

Maximum Likelihood estimators is much longer than for the estimators of the

Method of Moments. In addition to the waiting time problem, the programming task

is much more difficult and demanding than the direct calculations of the Method

of Moments estimators. Further information on the gamma distribution and its

estimators is report in [55].

Gamma approximation to null distribution of spatial Rg index: A procedure to

approximate the null distribution of the spatial Rg index from a given bivariate

point pattern is described as follows.

Given a bivariate point pattern, the null distribution is simulated few times (for

example 30 up to 100 times), and the parameters of the shifted gamma distribution

of the spatial Rg index are estimated using the Method of Moments given by equation

(6.19). The p-value for the observed spatial Rg index is then calculated from the

given bivariate point pattern based on the shifted gamma distribution.

Extension of strategy using spatial Rg index The extension of the strategy using

the spatial Rg index is similar to that described for the S statistic presented in

Section 6.2. The (one-sided) modified version of the Monte Carlo test (Section 3.5)

is then performed to test the random labelling null hypothesis.


Single Linkage Average Linkage

Datasets Rg R5%rr Rg R5%rr

Cat Retinal Ganglia 0.496 0.502 0.496 0.511

Austin Hughes’ 0.498 0.499 0.498 0.504

Clustered 1 0.507 1 0.507

Longleaf pines 0.503 0.503 0.553 0.503

Table 6.3: The estimated spatial Rg index, and 5% Monte Carlo critical values,

R5%rr, for the Monte Carlo test of random labelling. Datasets: Cat Retinal Gan-

glia, Austin Hughes’, Clustered and Longleaf pines. 999 realisations under random

labelling.

First application The strategy using the spatial Rg index was applied to the

bivariate point patterns: Cat Retinal Ganglia, Austin Hughes’, Clustered, and Lon-

gleaf pines. These datasets were described previously. (The level of significance of

the tests was α = 0.05.) Table 6.3 shows the estimated values of the spatial Rg index

and 5% Monte Carlo critical value under random labelling hypothesis, denoted by

R5%rr. (The 5% Monte Carlo critical value is the 95th quantile defined in Section

3.) The number of realisations under the null hypothesis was 999, and the chosen

clustering algorithms were the Single Linkage and Average Linkage.

The results obtained from the Single Linkage (see Table 6.3) show that the ran-

dom labelling hypothesis is only rejected for the Clustered point pattern. However,

based on the Average Linkage, the random labelling is rejected for the Clustered

and Longleaf pines datasets. The results from the Complete Linkage algorithm are

similar to those obtained from the Average Linkage. That is, the null hypothesis is

rejected for the Clustered and Longleaf pines datasets. Therefore, the result from

the Complete Linkage is not presented here.

Second application The null distributions of the spatial Rg index from the bivari-

ate point patterns: Cat Retinal Ganglia, Austin Hughes’, Longleaf pines, Clustered,

and (full) California redwoods seedlings (described below) were approximated using

gamma distributions. The Method of Moments was used to estimate the parameters

of the gamma distributions (Section 6.3). However, before the results are presented,

a brief description of the (full) California redwoods seedlings dataset is given as

follows.

California redwoods seedlings dataset Figure 6.8 shows the California redwoods

dataset [105] in which the locations of 195 seedlings of California redwood trees are


Region I

Region II

Ripley’s subset

Figure 6.8: California redwoods seedlings dataset with 195 points re-scaled to the unit

square. This dataset is regarded as the full redwoods. Ripley’s subset is commonly

known as the redwoods data. Source: [105], data provided in [9].

plotted. Strauss [105] divided the sampling region into two regions demarcated by

a diagonal line corresponding to a discontinuity in the soil and land usage. (Region

I has 72 trees and region II has 123 trees.) Strauss [personal communication] has

informed the author that the dataset is no longer available. Therefore, a plot of the

entire dataset [105] was scanned and digitised by the author in 2002.

Henceforth, the California redwoods seedlings dataset is regarded as the full

redwoods. To the best of our knowledge, this dataset has only been analysed by

[65, 105]. For further details on the dataset, see [105]. A subset of the full redwoods

dataset, consisting of 62 points in a square sub-region, was extracted by Ripley [88]

and is known as the redwoods data in spatial statistics literature. The subset is a

very good example of a clustered point pattern. (Figure 6.8 shows the full redwoods

dataset with regions I, II, and Ripley’s subset.)

Even though the full redwoods seedlings dataset is a univariate point pattern,

the regions I and II of the dataset were regarded as if they had two separate marks:

the points located at region I were labelled type 1, and the points in region II were

labelled type 2.

The results The Single Linkage was unable to divide the datasets into two sub-

stantial groups. As an example, for the Cat Retinal Ganglia dataset, the first cluster

obtained had 126 points and the other cluster had 9 points. Another example was

for the Austin Hughes’ dataset, where the first cluster had 293 points and the other


Datasets Rg α β γ p-value

Cat Retinal Ganglia 0.496 0.511 0.0074 0.496 0.792

Austin Hughes’ 0.498 0.505 0.0034 0.498 0.955

Clustered 1 0.508 0.0050 0.497 0

Longleaf pines 0.553 1.182 0.0015 0.499 3.33e-16

Full redwoods 0.588 0.961 0.0061 0.497 2.57e-07

Table 6.4: Estimated values of the spatial Rg index from datasets: Cat Retinal

Ganglia, Austin Hughes’, Clustered, Longleaf pines, and full redwoods; estimated

parameters α, β, γ of gamma approximation and p-values from the Monte Carlo

null distribution of spatial Rg index under random labelling null hypothesis; Average

Linkage algorithm.

cluster had 1 point. Thus, the Average Linkage was chosen to form the clusters of

the datasets.

Figure 6.9 shows that the fitted gammas are very good approximations for the

Monte Carlo null distributions of the spatial Rg index for the datasets: Cat Reti-

nal Ganglia, Austin Hughes’, Clustered, Longleaf pines with two types, and full

redwoods with two regions.

Table 6.4 shows the estimated values of the spatial Rg index, parameters α, β, γ

of the gamma approximations, and p-values from the gamma approximation of Rg

index under random labelling null hypothesis based on the Average Linkage.

The parameters were estimated using the Method of Moments (Section 6.3). Be-

cause of the large p-values for the Cat Retinal Ganglia and Austin Hughes’ datasets

(presented in Table 6.4), the null hypothesis of random labelling is not rejected for

these datasets. However, the random labelling is rejected for: Clustered, Longleaf

pines, and full redwoods datasets.


0.50 0.55 0.60 0.65 0.70

0.50

0.55

0.60

0.65

0.70

Monte Carlo null distribution

gam

ma

appr

oxim

atio

n

q−q plotident line

Cat Retinal Ganglia

0.50 0.52 0.54 0.56 0.58 0.60

0.50

0.52

0.54

0.56

0.58

0.60


gam

ma

appr

oxim

atio

n


Austin Hughes’

0.50 0.55 0.60 0.65

0.50

0.55

0.60

0.65


gam

ma

appr

oxim

atio

n


Clustered

0.50 0.51 0.52 0.53 0.54 0.55

0.50

0.51

0.52

0.53

0.54

0.55


gam

ma

appr

oxim

atio

n


Longleaf pines

0.50 0.55 0.60 0.65

0.50

0.55

0.60

0.65


gam

ma

appr

oxim

atio

n


Full redwoods

Figure 6.9: Q-Q plots comparing the Monte Carlo estimates of the null distributions

of the spatial Rg index with their gamma approximations, for each of datasets: Cat

Retinal Ganglia, Austin Hughes’, Clustered, Longleaf pines, and full redwoods. Solid

lines: Q-Q plots, dashed lines: identity line, Average Linkage algorithm.

73CHAPTER 7

Analysis of local configuration

This chapter presents a new extension of a popular approach for analysing localised

neighbourhoods of a given point pattern in spatial statistics. The new extension,

named “analysis of local configuration”, is based on the fusion distance function.

Our attention is now focused on a local neighbourhood of a given point of the

dataset.

In spatial statistics literature, the original strategy is known as the Local Indi-

cators of Spatial Association, or LISA [5, 6, 22, 23, 109]. An early paper by Getis

and Ord [44] suggested local versions of the K, L, and G functions. Anselin [6] out-

lined a general class of local indicators of spatial association, LISA, and showed how

this class of local indicators allows for the decomposition of global indicators such

as the Moran’s I and Geary’s c statistics [109, page 170]. Anselin also illustrated

applications of LISA to the spatial pattern of conflict in African countries [28] and

to a number of Monte Carlo simulations. The following definition of LISA is quoted

from [6, page 94].

Definition 21 (LISA). A local indicator of spatial association is any statistic that

satisfies two requirements. First, a value of LISA for each observation gives an

indicator of the extent of significant spatial clustering of similar values around that

observation. Second, the sum of LISAs for all observations is proportional to a

global indicator of spatial association.

Recently, Cressie and Collins [22, 23] also investigated LISA methodology for

point patterns and developed a version based on the product density function. (The

product density function is defined by Stoyan, Kendall and Mecke [102, page 120].)

After estimating the product density functions using kernel smoothing [2, 14, 39,

112], Cressie and Collins applied classical multidimensional scaling [68] to reduce

the number of LISA functions, then they applied the non-hierarchical K-means al-

gorithm [49] to classify LISA functions into bundles or groups. (Cressie and Collins

defined a bundle of LISA functions as a set of similar product density functions [23].)

Further information on the methodology developed by Cressie and Collins and its

application to a minefield point pattern with clutter, are presented in [23, 22].

Instead of using the (traditional summary function) K-function, classical mul-

tidimensional scaling, and non-hierarchical algorithm to characterise a local neigh-

bourhood, fusion distances based on hierarchical algorithms will be used here. This

74 Chapter 7. Analysis of local configuration

application of the fusion distance is new, and some of Cressie and Collin’s steps will

be followed to introduce the analysis of local configuration. In particular, the prob-

ability density function of the fusion distances are estimated using kernel smoothing

techniques [2, 14, 39, 112]. Then the groups of the fusion distance densities will

be classified using a different measure from that chosen by Cressie and Collins [22].

The analysis of the local configuration procedure is presented as follows.

7.1 Strategy

Let x = {x1, . . . ,xn} be a given point pattern with n points, and k be a small

positive integer (k < n). For each point xi (i = 1, . . . , n), its k-nearest neighbour

points, xi1,xi2, . . . ,xik are computed using the pairwise Euclidean distance. Then,

for each point xi and its k-nearest neighbours, the subset {xi,xi1,xi2, . . . ,xik} is

formed. This subset is regarded as the local neighbourhood around the point xi.

Now, a chosen hierarchical clustering algorithm is applied to each subset, and an-

other set is formed, that is, the set of fusion distances {hi1, . . . , hik−1}.

The probability density functions of the fusion distances are estimated using the

kernel density estimator, given by

fi(h, `) = f(h; `,xi) = (k`)−1

k−1∑

j=1

κ{(h − hij)/`}, (7.1)

where κ is a function satisfying∫ +∞

−∞κ(h)dh = 1 and κ(h) > 0, known as the

kernel, ` is a positive number known as bandwidth or window width, i = 1, . . . , n

and j = 1, . . . , (k − 1). Further information on kernel estimators and properties is

reported in [14, 112]. Henceforth, fi(h, `) is denoted as fi. (fi is a smoothed density

estimator of hij for the point xi.)

The probability density functions of the fusion distances may be compared using

a distance measure. For instance, the total variation distance [46, 108] can be used,

and its definition is presented as follows.

Definition 22 (Total variation distance). Let f(x) and g(x) be two probability

densities of random variables X1 and X2. The total variation distance between f(x)

and g(x) is given by

d(f, g) = supB⊂R

∫

B

f(x)dx −∫

B

g(x)dx

=1

2

∫

R

|f(x) − g(x)|dx. (7.2)

7.1. Strategy 75

The total variation distances between each pair of probability densities fi, fr

of the fusion distances are computed, that is, d(fi, fr), where i, r = 1, . . . , n. Let

D =

[

d(fi, fr)

]

be the total variation distance matrix.

Next, a chosen hierarchical clustering algorithm is applied to D to search for

clusters of the probability densities. In other words, the smoothed densities of the

fusion distances are considered as if they were units or points to be classified. Thus,

hierarchical clustering algorithms are applied to find similar groups of the probability

density functions of the fusion distances.

It is now assumed there may be g clusters or groups of probability densities of the

fusion distances. Then, the mean of the local fusion distance function is computed

for each group of probability densities,

Hv(t) =1

nv

nv∑

s=1

Hj(t) (7.3)

where nv is the number of points in each group, s = 1, . . . , nv, v = 1, . . . , g, and

0 < g < n.

A homogeneous Poisson point process with the same intensity and on the same

bounded region as the given point pattern is simulated. For each point of the

simulated Poisson process, the fusion distance function based on its kth nearest

neighbours is calculated, that is, HPois(t)(i), where i = 1, . . . , n. Then, the mean

of the local fusion distance functions for all the points from the simulated Poisson

process is given by,

HPois(t) =1

n

n∑

i=1

HPois(t)(i) (7.4)

Finally, for each group, its mean (equation (7.3)) is plotted against the Poisson

mean (equation (7.4)) in the same plot. That is, the group mean of the local

fusion distance functions are compared with the group mean of the local fusion

distance function from the homogeneous Poisson process, graphically. If the group

mean (equation (7.3)) of the fusion distance function is above the identity line, this

suggests that this group consists of points clustered together.

Interpretation of local fusion distance function Similar to the interpretation of

the fusion distance function for the exploratory data analysis (Section 4.3.1), if the

local fusion distance function for a given group (equation (7.3)) is substantially

above the identity line then there is a suggestion for a locally clustered pattern.

However, if the local fusion distance function is considerably below the identity line,


then there is an indication that the point pattern is locally regular. Nevertheless, if

the local fusion distance function is close to the identity line then the pattern should

be locally random.

7.2 Applications

The analysis of local configuration was applied to the datasets: full redwoods

(Section 6.3), Longleaf pines (Section 6.2) and Lansing woods (which is described

in Section 7.2.3). The results obtained from each dataset are presented as follows.

7.2.1 Application to full redwoods The full redwoods dataset is described

in Section 6.3 and the results of the analysis of local configuration are shown below.

Results: kernel estimation The kernel probability densities based on the 20 near-

est neighbours, Single Linkage, Average Linkage, and Complete Linkage have similar

shapes and suggest a higher probability of fusion distances which are smaller than

or equal to 0.05 for the dataset. (See the plots in Figure 7.1.)

Total variation distance Figure 7.2 shows that except for the Single Linkage

dendrogram, the remaining dendrograms of the total variation distances from the

kernel densities are well structured and suggest the presence of spatial clustering on

the dataset. Observe that the dendrograms shown in Figure 7.2 are based only on

the spatial locations of the points, and do not use information about the regions I,

II dichotomy.

Contingency tables Table 7.1 shows the frequency counts of the full redwoods

points that belong to the two regions (I,II) and groups (1,2) based on the total

variation distance. The results based on the Average Linkage and Complete Linkage

are successful in identifying the majority of points of the full redwoods that belong

to regions I and II (see the diagonal cells of table).

SLGroup1 2

Region I 72 0II 118 5

ALGroup1 268 42 121

CLGroup1 270 210 113

Table 7.1: Contingency tables of the full redwoods by regions (I,II) and groups (1,2).

Groups are based on total variation distances; 20 nearest neighbours; Single Linkage

(SL), Average Linkage (AL), Complete Linkage (CL).


0.00 0.05 0.10 0.15 0.20 0.25 0.30

010

2030

4050

6070

fusion distances

Prob

abilit

y de

nsity

func

tion

Single Linkage

0.00 0.05 0.10 0.15 0.20 0.25 0.30

010

2030

40

fusion distances

Prob

abilit

y de

nsity

func

tion

Average Linkage

0.0 0.1 0.2 0.3 0.4 0.5

05

1015

2025

30

fusion distances

Prob

abilit

y de

nsity

func

tion

Complete Linkage

Figure 7.1: Kernel probability densities of the fusion distances from the full redwoods

dataset based on the 20 nearest neighbours, Single Linkage, Average Linkage, and

Complete Linkage.


0.0

0.1

0.2

0.3

0.4

0.5

Single Linkage

tota

l var

iatio

n di

stan

ces

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Average Linkage

tota

l var

iatio

n di

stan

ces

0.0

0.5

1.0

1.5

Complete Linkage

tota

l var

iatio

n di

stan

ces

Figure 7.2: Dendrograms of the total variation distances from kernel probability

densities of fusion distances for the full redwoods dataset based on the 20 nearest

neighbours, Single Linkage, Average Linkage, and Complete Linkage.


Single Linkage

Average Linkage

Complete Linkage

Figure 7.3: Classification of points in the full redwoods dataset into two groups

(4, ◦) based on their local configuration (20 nearest neighbours, fusion distances,

kernel smoothing, total variation distance, hierarchical clustering: Single Linkage,

Average Linkage, and Complete Linkage).


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Estimated mean of H_{Pois}(t)

Estim

ated

mea

n of

H_v

(t)

group 1group 2Ident line

Figure 7.4: Estimated group means of local fusion distance functions plotted against

the local fusion distance function from a homogeneous Poisson process with the same

intensity as the full redwoods; 20 nearest neighbours; Average Linkage.

Local configuration classification The Single Linkage overclassifies the points of

region I (see the upper plot in Figure 7.3). This is an example of the chaining effect

(Section 2.4) and suggests that the clusters of the probability densities of the fusion

distances may not have a nucleus. Therefore, the Single Linkage performs poorly at

identifying the points that belong to the different regions of the full redwoods.

For instance, consider the Average Linkage dendrogram of total variation dis-

tances (see the central plot in Figure 7.3). First, a homogeneous Poisson process

with the same intensity as the full redwoods on the re-scaled unit square was simu-

lated, and the estimated mean (equation (7.4)) of the local fusion distance functions

was calculated. Second, the estimated means (equation (7.3)) of local fusion dis-

tance functions for the two groups were computed and compared with the estimated

mean of the local fusion distance function for the homogeneous Poisson process.

The local fusion distance function for group 1 is different from the local fusion

distance function for group 2. (See Figure 7.4.) The local fusion distance function

for the group 1 suggests that there is a locally clustered pattern in the dataset.

Therefore, the local fusion distance function successfully discriminates between the

two different patterns of the full redwoods dataset. An equivalent conclusion can be

drawn for the Complete Linkage.


Figure 7.5: Longleaf pines trees are shown as circles, the diameter of each tree is

proportional to the maximum size of the tree’s diameter at breast height. Adult trees

are plotted with larger circles; young trees are plotted with smaller circles. Source:

[85], data provided in [9].

The analysis of local configuration based on these three algorithms and k=10

is very similar to that obtained from k=20 nearest neighbours, consequently, the

results are not shown here. In conclusion, the analysis of local configuration using

the Average Linkage and Complete Linkage has successfully separated and identified

the majority of the redwoods trees that have different spatial neighbourhoods.

7.2.2 Application to Longleaf pines The Longleaf pines dataset classified

into two types, young and adult trees, is described in Section 6.2. This dataset is

more complicated to analyse than the full redwoods described in Section 6.3. First,

the task of identifying and separating the trees that belong to the two types based

on their location is difficult. Second, the young trees are close to the adult trees and

except for the size of the dbh, the types can not be distinguished (see Figure 7.5).

Therefore, it is a challenge to analyse this inhomogeneous dataset.

Results: kernel estimation Figure 7.6 shows that the kernel densities based on

the 10 nearest neighbours, Single Linkage, Average Linkage, and Complete Linkage

have different shapes.

Total variation distance Even though the structured Average Linkage and Com-

plete Linkage dendrograms may suggest that there are clusters on the dataset, the

disordered Single Linkage dendrogram is enough evidence to tell us that the Longleaf

pines dataset does not exhibit a very good separation for clusters. (See Figure 7.7).


0 5 10 15 20

0.0

0.5

1.0

1.5

2.0

fusion distances

Prob

abilit

y de

nsity

func

tion

Single Linkage

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

fusion distances

Prob

abilit

y de

nsity

func

tion

Average Linkage

0 5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

fusion distances

Prob

abilit

y de

nsity

func

tion

Complete Linkage

Figure 7.6: Kernel probability densities of fusion distances from the Longleaf pines

based on the 10 nearest neighbours; Single Linkage; Average Linkage; Complete

Linkage.


0.0

0.2

0.4

0.6

0.8

1.0

Single Linkage

tota

l var

iatio

n di

stan

ces

0.0

0.5

1.0

1.5

2.0

Average Linkage

tota

l var

iatio

n di

stan

ces

01

23

4

Complete Linkage

tota

l var

iatio

n di

stan

ces

Figure 7.7: Dendrograms of total variation distances from kernel probability densities

of fusion distances for the Longleaf pines dataset based on the 10 nearest neighbours;

Single Linkage; Average Linkage; Complete Linkage.


SLGroup1 2

Type young 306 7adult 271 0

ALGroup1 2

214 99256 15

CLGroup1 2

260 53262 9

Table 7.2: Contingency tables of Longleaf pines by types (young,adult) and groups

(1,2). Types based on dbh of trees and groups based on total variation distances; 10

nearest neighbours; Single Linkage (SL); Average Linkage (AL); Complete Linkage

(CL).

Contingency tables For instance, the Average Linkage Table 7.2 shows that it is

not true that all young trees are in a different neighbourhood from the adult trees.

However, there may be a substantial number of young trees that are packed together

(the 99 young trees which were classified into group 2). In other words, some young

trees are growing in tight clusters.

Local configuration classification Figure 7.8 (upper) shows that the Single Link-

age classifies the majority of the trees of Longleaf pines as the young type. Similar

to the result of this algorithm applied to the full redwoods, the chaining effect sug-

gests that the clusters of kernel densities of fusion distances may not have a nucleus.

Therefore, the Single Linkage has pointed out that this dataset may not be well

separated into clusters.

Even though there is no strong evidence for clusters, let us consider, for example,

the Average Linkage dendrogram. (See the central plot in Figure 7.8). First, a

homogeneous Poisson process with the same intensity as the Longleaf pines dataset

on the 200 m sided square was simulated and the mean (equation (7.4)) of the

local fusion distance functions was estimated. Next, the estimated group means

(equation (7.3)) of the local fusion distance functions were computed, and compared

with (equation (7.4)) from the Poisson process.

Figure 7.9 shows that the local fusion distance functions from groups 1 and 2 are

different. This statement is also confirmed by the barplot of the relative frequency

of dbh for both groups, which is plotted in Figure 7.10. The local fusion distance

function for group 2 suggests that there is a locally clustered pattern in the dataset.

Therefore, the local fusion distance function using the Average Linkage successfully

identifies a small pocket of young trees that are clustered together. In conclusion,

the results of the local configuration applied to the Longleaf pines are very good


Single Linkage

Average Linkage

Complete Linkage

Figure 7.8: Classification of points in the Longleaf pines dataset into two groups

(4, ◦) based on their local configuration (10 nearest neighbours, fusion distances,

kernel smoothing, total variation distance, hierarchical clustering: Single Linkage,

Average Linkage, and Complete Linkage).


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Estim

ated

mea

n of

H_v

(t)

group 1group 2Ident line


the mean of local fusion distance functions from a homogeneous Poisson process with

the same intensity as the Longleaf pines; 10 nearest neighbours; Average Linkage.

0−20 20−40 40−60 60−80

group 1group 2

Rela

tive

frequ

ency

0.0

0.2

0.4

0.6

0.8

1.0

Figure 7.10: Relative frequency barplot of dbh from Longleaf pines classified into two

groups; 10 nearest neighbours, Average Linkage.

because this point pattern is indeed a much more challenging dataset to analyse

than the full redwoods.


7.2.3 Application to Lansing woods Figure 7.11 shows the Lansing woods

dataset introduced by [43]. The data record the location and botanical classification

of 2251 trees on a 924 ft x 924 ft (19.6 acre) block of Lansing Woods, in Clinton

County, Michigan, USA. The original block size was re-scaled to the unit square.

Figure 7.11: Lansing woods dataset with six types: black oak (◦), hickory (M),

maple (+), miscellaneous (×), red oak (¦), white oak (O). Source: [43], data pro-

vided in [9].

The botanical classification of the types of the trees into species is: hickory,

maple, red oak, white oak, black oak and miscellaneous. For details on the dataset and

its analysis, see [43, 30]. Figure 7.12 shows the trees’ locations plotted individually

by their types. The plots are ordered according to the frequency of points in each

type that is, from the largest to the smallest. (The symbol encoding is the same as

in Figure 7.11.)

The Longleaf pines and Lansing woods datasets are similar in respect to the

locations of the trees. It is also noticeable that the locations of the Lansing woods

trees are closer together than in the Longleaf pines. Visually, it may be an impossible

task to identify and to separate different neighbourhoods, even though the species

of the trees are known.

Results The Single Linkage dendrogram applied to the kernel densities of the

fusion distances from the Lansing woods dataset has a disordered structure similar

to the Single Linkage dendrogram applied to the Longleaf pines dataset (see upper


Hickory Maple

White oak Red oak

Black oak Miscellaneous

Figure 7.12: Lansing woods dataset with six types plotted individually. The ordered

plots are according to the descending frequency of points in each type. Hickory (M),

maple (+), white oak (O), red oak (¦), black oak (◦), miscellaneous (×). The symbol

encoding in this figure is the same as in Figure 7.11.


Group1 2 3 4 5 6

Hickory 283 299 24 24 64 9Maple 239 186 6 2 80 1

Type White oak 170 197 6 17 49 9Red oak 141 146 7 5 45 2

Black oak 56 58 2 8 11 0Misc 43 40 4 0 17 1

Table 7.3: Contingency table of Lansing woods classified into six botanical types

(hickory, maple, white oak, red oak, black oak, misc.), and six groups which are

based on total variation distances, 20 nearest neighbours, Average Linkage.

plot in Figure 7.7). Thus, the disordered dendrogram indicates that the dataset

exhibits a poor separation for clusters. (The equivalent conclusion was also drawn

from the Single Linkage dendrogram applied to the Longleaf pines in Section 7.2.2.)

Moreover, the results of the analysis of local configuration based on the 20 near-

est neighbours, Average Linkage, and Complete Linkage are alike. Therefore, a

summary of these results, such as the contingency table and local configuration

classification of the Lansing woods with six types based on the Average Linkage, is

presented next.

Contingency table and local configuration classification Similar to the analysis

of the Longleaf pines, we consider, for instance, the Average Linkage dendrogram of

the total variation distances based on the 20 nearest neighbours. The plots shown

in Figure 7.13 are ordered according to the descending frequency of points in each

group (from the largest to the lowest). That is, the group 1 has the largest frequency,

followed by groups 2,5,4,3,6 (the lowest frequency).

Next, the Poisson process with the same intensity as the Lansing woods dataset

on the re-scaled unit square was simulated and the mean (equation (7.4)) of the local

fusion distance function was estimated. The six estimated group means (equation

(7.3)) of the local fusion distance functions were then computed and compared with

the estimated Poisson mean of the local fusion distance functions.

The upper plot in Figure 7.14 shows that the local fusion distance functions

of the groups are not substantially above the identity line. Therefore, the local

fusion distance function suggests that there may not be groups in the Lansing woods

dataset.


Spatial group 1 Spatial group 2

Spatial Group 5 Spatial group 4

Spatial group 3 Spatial group 6

Figure 7.13: Local configuration classification based on total variation distances from

the Lansing woods dataset. The ordered plots are according to the descending fre-

quency of points in each group; 20 nearest neighbours; Average Linkage.


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Estim

ated

mea

n of

H_v

(t)

group 1group 2group 3group 4group 5group 6Ident. line

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Estim

ated

mea

n of

H_v

(t)

group 1group 2group 3group 4Ident. line


the means of local fusion distance functions from homogeneous Poisson processes

with the same intensity as the Lansing woods dataset: six groups (upper), four

groups (lower), 20 nearest neighbours, Average Linkage.


Further classification Figure 7.13 suggests that the Lansing woods dataset might

be classified into fewer groups. Thus, the four group classification is now considered.

The Average Linkage dendrogram from the total variation distances is cut at g = 4

groups and the mean (equation (7.3)) of the local fusion distance functions for each

new group is also computed.

For the Lansing woods dataset, the classification into g = 6 groups is a refinement

of the four group classification. This is a nice property of a hierarchical clustering

algorithm.

A homogeneous Poisson process, with the same estimated intensity as the Lans-

ing woods dataset on the re-scaled unit square, is simulated and the mean (equa-

tion (7.4)) of the local fusion distance functions is estimated. The estimated group

means (equation (7.3)) of the local fusion distance functions are then compared with

the Poisson mean (equation (7.4)), graphically.

The lower plot in Figure 7.14 shows that there is a suggestion for two different

types of patterns. First, the local fusion distance from group 1 is slightly below the

identity line. Second, the local fusion distance function from group 2 is considerably

above the identity line. Thus, the results of the local fusion distance function for the

Lansing woods dataset classified into four groups (based on the 20 nearest neighbours

and Average Linkage) suggest that this dataset may not have a clear separation for

local clusters, except for groups 1 and 2.

In addition to the results of the local fusion distance, we may inquire into if there

is any relationship between the botanical and local configuration classifications. Let

us consider a formal test such as the Pearson χ2 test of independence [56, Chapter

13]. To answer our question, this test is then performed. The null hypothesis: the

three botanical classification into “oak” (white, red, and black oak), hickory, and

grouped “maple and miscellaneous” is independent of the four group classification.

This is tested against the dependence of these two classifications.

Table 7.4 shows the contingency table and Pearson residuals for the Lansing

woods dataset classified into three botanical types and four groups, respectively.

The computed χ2 = 32.56 with d.f.=6, and the tabulated upper 5% point of χ2

is 12.59 [56, page 667], so that the null hypothesis of independence is rejected at

α = 0.05. It would also be rejected at α = 0.01. (The p-value=1.274 × 10−5.)

Therefore, the two classifications are not independent.

(The contingency table 7.4 shows that there are significantly fewer hickory trees

in group 2 than would be expected under independence. There is also a significantly


Contingency table:Group

1 2 3 4Oak 606 24 64 9

(595.567) (17.489) (83.073) (6.871)Type Hickory 518 2 97 2

(524.404) (15.399) (73.147) (6.050)Maple & miscellaneous 783 30 105 11

(787.029) (23.112) (109.780) (9.080)

Pearson residuals:Group

1 2 3 4Oak 0.428 1.557 -2.093 0.812

Type Hickory -0.280 -3.415 2.789 -1.646Maple & miscellaneous -0.144 1.433 -0.456 0.637

Table 7.4: Upper: contingency table of Lansing woods dataset by three botanical types

and four groups. Groups based on total variation distances, 20 nearest neighbours,

Average Linkage. Lower: Pearson residuals calculated from the contingency table.

high number of these trees in group 3. The botanical explanation for this finding is

unknown to us.)

In summary, the results of the analysis of local configuration based on the 20

nearest neighbours, Single Linkage, Average Linkage and Complete Linkage indicate

that there may not be clusters of trees in the Lansing woods dataset. Despite this,

the results obtained from the Average Linkage suggest the presence of two different

spatial sub-patterns for the Lansing woods dataset.

Conclusion The local configuration strategy was applied to three point patterns:

full redwoods, Longleaf pines and Lansing woods. The spatial classifications of the

latter two might appear to disagree with their biological/botanical classifications.

However, the strategy was able to successfully identify the different spatial textures

of all three point patterns.

95CHAPTER 8

Analysis of Brazilian trees point pattern

This chapter studies a large point pattern classified into fifty-six botanical species,

seven botanical subclasses and three botanical classes. The spatial dataset was

kindly provided by Dr. Meirelles and Mr. Luiz in 1999, and was named the “Brazil-

ian trees dataset”.

Exploratory data analysis and inference based on the traditional summary func-

tions: empty space F , nearest neighbour distance G, Van Lieshout and Baddeley’s

J , reduced second moment K, and mark correlation ρ [82, 104] are applied to the

Brazilian trees point pattern. Next, the complementary analysis using the new

strategies (Chapters 4, and 6) is applied to the Brazilian trees dataset. The study is

based on the fusion distance function (Section 4.1.1), area statistic (Section 4.1.2),

S statistic (Section 6.2) and spatial Rg index (Section 6.3).

This chapter also presents the analysis of the local configuration (Section 7.1)

applied to the Brazilian trees dataset (Section 8.1). The analysis is based on the

20 nearest neighbours, Single Linkage, Average Linkage and Complete Linkage al-

gorithms. The Brazilian trees point pattern is introduced as follows.

8.1 Brazilian trees point pattern

Data collection and preparation The Brazilian trees dataset was collected by

Meirelles and Martins in 1979, on the ecological reserve of the Federal University

of Brasılia, named as Agua Limpa farm, in Brasılia, DF, Brazil. All trees in one

hectare were mapped and to the best of our knowledge, the trees were a natural

stand of native species [75]. The sampled area was 100 x 100 square metres. The

data record the tree number, quadrat number, species number, location, height (in

metres) and dbh (in metres). The dbh was measured at 0.3 m above ground level.

Parts of the dataset were analysed by Meirelles and Luiz [64, 74]. For instance,

Meirelles and Luiz [74] examined the 18 most dominant species. The species Byr-

sonima coccolobifolia and Aspidosperma tomentosum were classified as random and

the 16 remaining species were classified as clustered. Meirelles and Luiz used the

Morisita Index [76] and Dispersion Index [63] to classify the species. Another in-

vestigation was made by Luiz [64], where six species were studied in his Master’s

thesis. For more information on the dataset and its analysis, see [64, 74].

In our data preparation, a few minor inconsistencies were found. First, the

species Ouratea acuminata (species 32) was published as the Ouratea hexasperma

96 Chapter 8. Analysis of Brazilian trees point pattern

in Table 1 on [74, page 187]. However, the name of this species was corrected by

[73]. Second, Meirelles and Luiz [64, 74] stated that their dataset had 1122 trees

classified into 56 different species. Nevertheless, in the dataset here presented the

species Kielmeyera coriaceae (species 25 in Table 8.1) was missing. Meirelles and

Luiz acknowledged the absence of this species [75]. Finally, a few discrepancies in

the measurements of the trees’ height and dbh were found. The measurements were

not typical of trees on a Brazilian savanna or grassland, that is, some were too

small and others were too large for typical trees from the central region of Brazil.

Therefore, Meirelles and Luiz also corrected the unusual measurements [75].

The Brazilian trees dataset may have other inconsistencies that have not been

identified. However, to the best of our knowledge, the minor inconsistencies found

were corrected. The author is very grateful to Dr. Meirelles and Mr. Luiz, for the

provided dataset and corrections.

The dataset Figure 8.1 shows the Brazilian trees dataset, the location of 1122

Brazilian trees on a 100 m square in the reserve of Agua Limpa farm, DF, Brazil.

The original block size was re-scaled to the unit square.

Figure 8.1: Locations of 1122 trees on a 100 m square on a grassland, in the reserve

of Agua Limpa farm, DF, Brazil. The original block size was re-scaled to the unit

square. Source: Meirelles and Martin (1999).

Botanical plant systematics The currently accepted botanical classification of

each tree species into genus, species, family, order, subclass and class is extracted

8.1. Brazilian trees point pattern 97

by the author from [15, 115]. For the complete plant systematics of the Brazilian

trees dataset, see Table C.3, in appendix C.

Fifty-six botanical species The tree species ranked in order of frequency are pre-

sented in Table 8.1. The most frequent species is Ouratea acuminata, a photograph

of which is shown in Figure 1. The picture was downloaded from [77] in February

2003.

Some comments on the botanical nomenclature for Table 8.1 are as follows. There

were some species that were not identified by Meirelles and Martins. For instance,

for species 16, Siagrus sp., “Siagrus” was the identified genus but the species was

unknown. Another example is species 6, Myrtaceae fm. The genus and species of the

tree were not identified but its family was identified as the “Myrtaceae”. Another

example is species 17: this species was not identified, so it was named “IND.453”.

Similar conclusions can be drawn for other species not identified in Tables 8.1, and

C.3 in appendix C, on the sampled area.

Seven botanical subclasses The plant systematics of the Brazilian trees dataset

into seven subclasses ranked in order of frequency is presented in Table 8.2. The

dataset classified into seven subclasses is plotted in Figure 8.2.

Figure 8.2: Brazilian trees dataset classified by the seven botanical subclasses: Are-

cidae (O); Asteridae (£); Dilleniidae (+); Hamamelidae (×), Liliidae (¦); Miscel-

laneous (M); and Rosidae (◦).


Frequency Species number Name

293 32 Ouratea acuminata

65 41 Qualea grandiflora

64 43 Qualea parviflora

57 47 Sclerolobium aureum

48 16 Siagrus sp.

43 10 Caryocar brasiliense

43 45 Roupala montana

41 21 Erythroxylum tortuosum

36 31 Myrcia sp.

29 54 Vellozia sp.

26 9 Byrsonima sp.

25 26 Lafaensia pacari

25 46 Salacia crassifolia

20 2 Aspidosperma tomentosum

19 11 Connarus fulvus

17 7 Byrsonima coccolabifalia

17 29 Miconia sp.

17 42 Qualea multiflora

16 20 Erythroxylum suberosum

15 8 Byrsonima crassa

14 13 Dalbergia vidacea

14 15 Didymopanax macrocarpum

13 1 Aspidosperma macrocarpum

13 23 Butia sp.

13 27 Palmeira sp.

12 5 Bowdichia virgiloides

10 28 Miconia ferruginata

9 40 Pterodon pubescens

8 44 Rapanea guyanensis

7 3 Bombax gracilipes

7 6 Myrtaceae fm.

7 18 Enterolobium ellipticum

7 35 Piptocarpha rotundifolia

6 52 IND. 192

5 4 Bombax tomentosum

5 12 Copaifera langsdorfii

5 30 Mimosa claussenii

5 56 Platimenia reticulata

4 14 Davilla elliptica

4 19 Eremanthus

4 24 Hymenaea stillocarpa

4 38 Pouteria ramiflora

4 51 Symplocos revoluta

3 22 IND.445

3 48 Stryphnodendron sp.

3 49 Styrax ferrugineus

3 50 Sweetia dasycarpa

3 55 Vochysia elliptica

2 33 Palicourea rigida

2 36 Vochysia rufa

2 39 Vochysia thyrsoidea

2 53 Strychnos sp.

1 17 IND.453

1 34 IND. 443

1 37 Plenckia populosea

0 25 Kielmeyera coriaceae

Table 8.1: The tree species ranked in order of frequency in the Brazilian trees dataset.

8.1. Brazilian trees point pattern 99

Frequency Species number Subclass

553 5, 6, 7, 8, 9, 11, 12, 13, 15, 18, 20, Rosidae21, 24, 26, 28, 29, 30, 36, 37, 39, 40,41, 42, 43, 45, 46, 47, 48, 50, 55, 56

371 3, 4, 10, 14, 32, 38, 44, 49, 51 Dilleniidae

59 16, 17, 22, 34, 52 Miscellaneous

48 1, 2, 19, 33, 35, 53 Asteridae

36 31 Hamamelidae

29 54 Liliidae

26 23, 27 Aracidae

Frequency Subclass Class

1008 Asteridae, Dilleniidae, MagnoliopsidaHamamelidae, Rosidae

59 Miscellaneous Others

55 Arecidae, Liliidae Liliopsida

Frequency Class Type

1008 Magnoliopsida 1

114 Liliopsida and Others 2

Table 8.2: The trees subclasses (upper), classes (centre), and types (lower) ranked

in order of frequency in the Brazilian dataset. Species numbers shown in Table 8.1.

Figure 8.3: Left: Brazilian trees dataset classified by the three botanical classes:

Magnoliopsida (◦), Liliopsida (¦), and Others (M). Right: the dataset classified by

the type 1 (◦), and type 2 (M).


Three botanical classes The three botanical classes of the Brazilian trees dataset

are Magnoliopsida, Liliopsida and others (the trees that were not identified). The

left plot in Figure 8.3 shows the classified dataset and Table 8.2 presents the classes

ranked in order of frequency.

Next, the Brazilian trees dataset is analysed using the standard summary func-

tions: F , G, J , K and ρ (the mark correlation function [104]). The new strategies

developed in Chapters 4 and 6 are also applied to the dataset. The fusion distance

function (Section 4.1.1), area statistic (Section 4.1.2), S statistic (Section 6.2) and

spatial Rg index (Section 6.3) are also investigated. The Brazilian trees dataset is

regarded first as an example of a univariate point pattern, and then as an exam-

ple of a multivariate point pattern. The software and library used for computing

the spatial statistics are R Development Core Team Version 1.5.1 [51] and Spatstat

Version 1.3-2 [9] on a Pentium 4 (1.8G Hz), respectively.

8.2 Analysis of univariate Brazilian trees dataset

Let us assume that the Brazilian trees dataset is a realisation of a univariate

point pattern. Its attributes: species, heights, and dbh are analysed descriptively.

The mark correlation function is then presented, and estimated for each one of the

attributes of the univariate Brazilian trees dataset.

Species Figure 8.4 shows the histograms of the ranked species of the Brazilian

trees dataset. Observe that there is an interesting but inexplicable fact from the

plots, that the rank of the species has negative exponential decay,

log(frequency) = a + b.rank, where b < 0.

Thus

frequency ≈ A. exp(−B.rank), where A = exp(a), and B > 0.

Height Table C.1, in appendix C, suggests that the values of the height are

quoted to the nearest 0.1 metre in the range 0–8 metres, and to the nearest 1 metre

in the range 8–26 metres. However, the values 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5 and 7 are

very frequent, suggesting that the height has often been guessed to the nearest half

metre only. The plots in Figure 8.5 suggest that there is no indication of spatial

trend in height.

8.2. Analysis of univariate Brazilian trees dataset 101

Species (in rank order)

Freq

uenc

y

050

100

200

300

Species (in rank order)

Log(

frequ

ency

)

01

23

45

6

Figure 8.4: Barplots of the frequencies of ranked species of Brazilian trees: frequency

(left), logarithm of frequency (right).

Dbh The information shown in Table C.2, in appendix C, suggests that most of

the values of the diameter at breast height are quoted to the nearest 0.5 metre. The

scatter plots shown in Figure 8.6 suggest that there is no spatial trend in diameter

at breast height.

Species, height and dbh The ten most frequent species of the Brazilian trees

dataset are plotted against heights and diameters at breast height (dbh) in Fig-

ure 8.7. The first and second tallest species are the Myrcia and Qualea parviflora

(species 31 and 43 in Table 8.1), respectively. Trees with the two largest dbh be-

long to species Myrcia and Caryocar brasiliense (species 31 and 10 in Table 8.1).

Figure 8.8 shows that there is correlation between dbh and height.

Mark correlation function Let us assume that X is a simple marked point process

on Rd, and the mark correlation function is a measure of the dependence between

the marks of two points of the process a distance r apart, where r > 0. The following

definition of the mark correlation function is quoted from Stoyan and Stoyan [104,

page 263]. Further details on the definition, property, estimation and application of

the mark correlation function are also reported in [82, 101, 100, 104, 93].

Definition 23 (Mark correlation function). Let f(m′,m′′) be an arbitrary (non-

negative) measurable function on R2 depending on the marks m′ and m′′ of two


Height

Freq

uenc

y

0 2 4 6 8 10 12

010

020

030

040

0

0 20 40 60 80 100

05

1015

X−coordinate

Heig

ht

0 20 40 60 80 100

05

1015

Y−coordinate

Heig

ht

Figure 8.5: Heights of the Brazilian trees dataset: histogram (upper), scatter plots

of heights against x and y (centre, lower), respectively.


Diameter at breast height

Freq

uenc

y

0 5 10 15 20 25 30 35

010

020

030

040

050

0

0 20 40 60 80 100

05

1015

X−coordinate

Diam

eter

at b

reas

t hei

ght

0 20 40 60 80 100

05

1015

Y−coordinate

Diam

eter

at b

reas

t hei

ght

Figure 8.6: Dbh of the Brazilian trees dataset: histogram (upper); scatter plots of

dbh against x and y (centre, lower), respectively.


10 16 21 31 32 41 43 45 47 54

02

46

810

Species

Heig

ht

10 16 21 31 32 41 43 45 47 54

05

1015

2025

30

Species

Dbh

Figure 8.7: Ten most frequent species of the Brazilian trees dataset plotted against

height (upper) and dbh (lower). Numbers plotted on the x-axis are species numbers

shown in Table 8.1.


0 5 10 15 20 25 30

24

68

10

Dbh

Heig

ht

2332

24

47

32

45

32

41

161616

41

3232

31

32 9

3210203221 32

18

19

32

3213241

52

41

41

41

32213246

48

21321 2647

29

42

15 2648

26 45323232

41

41

21 323226 32

55

21

31

32

4610

16

932

32327

3532

35

5

41

54

41 4141

9

2

4332 323232 115432

41

41

3251

54

16

21

431145

1132

46

213254 323223

18

31

18

27

10

44

4

105

10

32 1046

321643

4521

32

13

9

16

3232502132479

36

32

45

12

8

26

47

2032203245

1119

31

47

3245

323229

3254

47

3243

49

32 2

31

42

54

40

32

2946

32

31

5231

1347

3241 412121

32

4321

47

9

387

4332 3232

41

3221 3232

41

43

31

32

26

32 165332

7322

32323232323232 32

47

15 2132

32322026

32

4747 45

41

238

15

23

5

203232

43

23

42

32

347

47

43

32

2

32

32

48

322116

32

52

43

32

42

3243

43

3232 32

32

2332

323232

4216

32

43

9

43

32

49

32

21

32

14

10

32 43167

323232 3232

3243 11323232

29

1632

16

31 41

161616

32

27

2032

32324632

27

13

9

3232

32

21

321

43

31

15

3232

3214 1

4118

43

5

843

4143

323232

47

5243

45

12041

32

4747 45

1120

4232 8

10

11

35

5

7

322727

31

32 323232

3254

32412

32

41

47

323226

13

16 921

3254 32

16

2132

3

3235 46

32

10

3223

1044

541

41 3232453

16321116 16

2132

32323254 3232

32

41

26

324345

8

43

41

41

23

41414141

4132459

46 45

1021

41

732

43

49

1032

2

43

2 28

2129

4646

43

31

1054

54

43

21 322032

10

11

2121

56

1651

35

2932

45418

32

1

37 43

3316 4532

2732

4641

10

3052

322054 7 161616

66

2

43

432132

43 3226 3239 463221 32

1010323221

32

1

45

3143

1143

28322311 23

9329

4643

8

1

11 271132

41

23

41414141

41

10

45

4356

9184445

26

29

45

26

4532

2943

32

38

47

31

29

47

31

16161646

4632

42

9

323232

31

1410

31

43

4532

2847

161647

56

13

43

47

2616

1

54

2

31

21 2054

1

32

3

2614

3143

3241

32 32

31

1015

235132

453232

26

3232 2626

10

432

10

3221

32 10323232

41

3221 32

4315

2610

24

84523

31

328

11

31

73232

11

21

47

474747

38 38

45

747

15

13

452947

47

3232954

29

4

43

47

21

41

45

28

33

43

43

32

10

4754

1

16

11

45

47

9

32

3232

47

47

1

4645

3247

43

322

1616

439

4526

3227

44

2947

50 7

31

15

323232

4331

29323232

6

16

66

45

43

1513

3211

42

32

41

30 47 46

10

322

45

32

5

55

475526

51

4110

13

28

4154

329

947

43

4142

224

7

15

32

3

36

427 32

45

44

22

56

917

47

45

42

4547

47

3247

41

2

549

2947

20

31

26

42

453245

10

41 2415

10

31

47

46

12

10

5427

1

5

322020

9

10

453232

31

31

2132

47

32 2

2

3245

13

6

28

47

10

3245

4743

3219

32

41

43

42

32

15

32 16

4747

3032

2830

10

543232433516

1132 32323241

4643

54645

3210

16

12

41

5

3232

5

4726

8

10

54

54

4343

42

56

26

47

47

5454

47

32

21 83542

7475421

47

323232

3442

22

4621

8

13

40

21

40

41

40

2

50

9

32

15 1616

10

20

241

7

18

40

16

39

2531

5

15

6

32

22

46

27

43

32323246

12

3232

28322

43

3244

44

54

41

29

1031

45

29

54

932

3230 3216

43

439 4

43

1632 1021

43

31

3

1616

41

40

31

46

323232

3245

32221 8

32

31

41

329163232

47

3232

31

10

32 1146

43

43

8

2032

10

161

28

32264132

4040

102

2954

19

32

40

41

32

2810

32

3232

27 272732

3131 13

32

26

78 4132

41

32

41

52

327

32

43

32

10

32

12

8

43

3232

44

13

42

713 54

13

0 5 10 15 20 25 30

24

68

Dbh

Heig

ht

32

47

32

45

32

41

161616

41

3232

31

32

3210

3221 3232

213241

41

41

41

322132

213247

45323232

41

41

21 32323221

31

32

10

1632

3232

32

41

54

41 4141

4332 3232

325432

41

41

32

54

16

21

43

45

32213254 3232

31

10

10

10

32 10

321643

45

2132

16

3232213247 32

4547

323245

31

47

3245

3232

3254

47

324332

31

54 3232

31

31

47

3241 412121

32

43

21

47

4332 3232

41

3221 3232

41

43

31

32321632

32

3232323232323232

47

2132

323232

4747 45

41

3232

43

32

4747

43

32

32

323221

1632

43

323243

43

3232 32

3232

3232321632

4343

3232

21

32

10

324316 323232 32

32

3243 32

323216

3216

31 41

161616

32 3232

3232

3232

32

21

32

43

31

3232

32

414343

4143

323232

47

43

4541

32

4747 45

32

10

32

31

3232

323232

54324132

41

47

3232162132

54 32

16

213232 32

10

32

10

41

41323245

1632

16 1621

32323232

54 323232

41

324345

43

41

41

41

414141

41

3245 45

1021

41

32

43 1032

43

21

43

31

1054

54

43

21 3232

102121

1632

45

32

43

16 453232

41

10

3254

161616

43

4321

32

43 3232

3221 32

10103232

2132

45

3143

433232

43

32

41

41

414141

41

10

45

43

4545

4532

43

32

47

31

47

31

16161632

323232

31

10

31

43

45

32

47

161647

43

47

1654

31

2154

32

3143

3241

3232

31

10

32

45

32323232

10

32

10

3221

32 10323232

41

3221 32

43

1045

31

32

31

323221

47

4747

47

45

474547

47

323254

43

47

21

41

45

43

43

32

10

4754

1645

4732

3232

47

47

453247

43

321616

43

45

32

4731

323232

4331

3232

3216

45

43

3232

41

47

10

32

45

32

47

4110

4154

3247

43

41

3232

45

47

45

4547

47

3247

41

54

47

31

4532

45

10

41

10

31

47

10

5432

10

4532

32

31

31

2132

47

323245

47

10

3245

4743

3232

41

43

3232 16

4747

32

10

54323243

1632 323232

41

43

45

32

10

16

41

3232

47 10

54

54

4343

47

47

5454

47

32

21475421

47

32323221

21

41

32

1616

1041

16

31

32

43

32323232

32

32

43

3254

41

10

31

45

54

32

32 3216

43

4343

1632 1021

43

31

1616

41

31

323232

32

45

322132

31

41

3216

3232

47

3232

31

10

3243

43

32

10

16324132

10

54

32

41

32

10

32

32

3232

3131

3241

32

41

32

41

3232

43

32

10

32

43

3232

54

Figure 8.8: Dbh of the Brazilian trees dataset plotted against the height: 56 species

(upper); ten most frequent species (lower). Numbers inside the plots are species

numbers shown in Table 8.1.


points x′ and x′′. The measure αf(2) on R

2d is defined by

αf(2)(B1 × B2) = E

[

∑

[x1;m1]∈X

∑

[x2;m2]∈X(x1 6=x2)

f(m1,m2)1B1(x1)1B2(x2)

]

.

The summation is over all pairs [x1; m1], [x2; m2] of marked points of X in B1 and

B2, where x1 6= x2 and B1, B2 are Borel sets of Rd. Then, assuming continuity, there

is a density function %f (x1, x2) for αf(2), which is called the “f -product density”.

For instance, if f ≡ 1 then αf(2)(B1 × B2) = E

[

N(B1)N(B2)

]

, that is, αf(2) is the

second moment measure of X. The quotient

κf (x1, x2) =%

(2)f (x1, x2)

%(2)(x1, x2), where %(2)(x1, x2) 6= 0,

can be interpreted as a conditional mean, namely as the mean of f(M1,M2), given

that there is a point of the point process at both locations x1 and x2, where M1 and

M2 denote the marks of x1 and x2, respectively. If the point process is stationary

and isotropic then κf (x1, x2) depends only on ‖x1 −x2‖ and we usually write κf (r),

r > 0. This function describes the correlation between marks. To give κf (r) more

of the character of a correlation function, it is normalised. The mark correlation

function is defined by

ρf (r) =κf (r)

κf (∞),

where κf (∞) = E[f(M,M ′) and M , M ′ are independent samples from the marginal

distribution of marks. Thus, roughly speaking,

ρf (r) =E[f(M1,M2)]

E[f(M,M ′)],

where M1, M2 are the marks attached to two points of the process separated by a

distance r, while M , M ′ are independent realisations from the marginal distribution

of marks. Note that f is any function f(m1,m2) with two arguments that are

possible marks of the point pattern, and which returns a nonnegative real value.

The mark correlation function is not a correlation function in the usual statistical

sense because this function can take any nonnegative real value. The value 1 suggests

lack of correlation. If the marks attached to the points X are i.i.d. then ρf (r) ≡ 1.

The interpretation of values larger or smaller than 1 will depend on the choice of

the function f .


For the height and dbh of the Brazilian trees dataset, the function f of the mark

correlation function is defined by f(m1,m2) = m1m2 because these attributes are

continuous real-valued marks. Thus

ρf (r) =E[M1M2]

E[MM ′]=

cov(M1,M2)

E[M ]E[M ′]+ 1

since M,M ′ are independent. In this case, the mark correlation function ρf (r) is

a re-scaled version of the covariance function of the marks at two points separated

by a distance r. If the marks are i.i.d. then ρf (r) ≡ 1, whereas ρf (r) > 1 suggests

positive association and ρf (r) < 1 indicates negative association.

For the species of the Brazilian trees, which is a discrete mark, the function f

is defined by f(m1,m2) = 1{m1 = m2}, where 1{} denotes the indicator function.

Therefore,

ρf (r) =P(M1 = M2)

P(M = M ′),

where M,M ′ are independent with the same mark distribution. Analogous to the

interpretation of ρf for continuous marks, if discrete marks are i.i.d. then ρf (r) ≡ 1,

whereas ρf (r) > 1 indicates positive association and ρf (r) < 1 suggests negative

association.

The sampling window of the Brazilian trees dataset was constrained because the

Spatstat [9] was unable to handle the entire dataset due to shortage of computa-

tional memory. Thus, the considered sampling window was [20,80]x[20,80] metres.

Note that Stoyan and Stoyan [104, page 292], in Figures 124, 125, plotted the esti-

mated mark correlation functions against r. The plotted values r ∈ [0, 28], where

28 mm was about one quarter of the shortest size of the sampling window. For the

Brazilian trees dataset, the estimated mark correlation functions using the transla-

tion correction [81] were also plotted against r. The values r ∈ [0, 15], where 15 m

is equal to one quarter of the shortest side of the constrained window.

The left plot in Figure 8.9 shows the estimated mark correlation function from

the heights of the Brazilian trees. The mark correlation function suggests a positive

association at small distances (r 6 3). This positive association indicates that young

trees are clustered together. A positive correlation at small distances (r 6 3) is also

noticeable for the estimated mark correlation function from the species: see the right

plot in Figure 8.9, suggesting that neighbouring trees tend to be of the same species

more frequently than would be expected if the species were allocated at random.

However, the central plot in Figure 8.9 suggests that there is independence on the

dbh of the trees at distances greater than 1 m.


0 5 10 15

0.0

0.5

1.0

1.5

r

trans

, th

eo

Height

0 5 10 150.

00.

40.

81.

2

r

trans

, th

eo

Dbh

0 5 10 15

0.0

0.5

1.0

1.5

2.0

2.5

r

trans

, th

eo

Species

Figure 8.9: Estimated mark correlation functions for the height, dbh and species from

the Brazilian trees dataset. Solid lines: the mark correlation estimate function using

the translation correction [81], dashed lines: y=1 line which represents independence

of the marks.

8.3 Analysis of Multivariate Brazilian trees dataset

The Brazilian trees dataset is now regarded as a multivariate point pattern with

fifty-six types. In theory, the spatial analysis of such dataset is possible, but in

practice it is prohibitive to carry out this analysis. The methods available in the

spatial statistics literature work very well for datasets that have at most two or three

different types. Thus, a feasible option is to analyse the dataset classified into fewer

types using summary functions, F , G, K and J .

Henceforth, the estimators of the functions (F , G, J) are calculated using the

Kaplan-Meier estimators [7], denoted by “km”, and the estimator of the K-function

is computed using the translation correction [81], denoted by “trans”. Moreover,

the J-function of the theoretical homogeneous Poisson process is denoted by “theo”.

(Our notation for “km, trans, and theo” will appear on the y-axis of the plots

presented in the next subsections.)

Three most frequent species The plots in Figure 8.10 show the locations of the

three most frequent species of the Brazilian trees classified into Ouratea, Qualea

and Others. The Ouratea acuminata species has 293 trees, followed by the Qualea

sp.: Qualea grandiflora, Qualea parviflora, Qualea multiflora which have 146 trees

(in total); while the 52 remaining species have 683 trees: see the species frequencies

shown in Table 8.1.

F -function The Kaplan-Meier estimates of F -function suggest that the three

point patterns Others, Ouratea, and Qualea are realisations of Poisson point pro-

8.3. Analysis of Multivariate Brazilian trees dataset 109

cesses with the same intensities as the observed species. (See the plots in Fig-

ure 8.11.)

G-cross function The Kaplan-Meier estimates of G-cross function suggest that

the Qualea point pattern is clustered at small distances r < 4 because the estimate

function is above the estimated function of a homogeneous Poisson process. (See

the (Qualea, Qualea) plot in Figure 8.12.) Thus, the trees of Qualea sp. tend to be

closer to each other at small distances than if they were randomly located.

J-cross function The Kaplan-Meier estimates of the J-cross functions (see Fig-

ure 8.13) suggest positive association for the univariate point patterns: Others,

Ouratea, and Qualea at distances r < 4, because their estimated values are smaller

than 1.

K-cross function The K-cross functions plotted in Figure 8.14 show that Ouratea

and Qualea are clustered at distances smaller than r < 5. Note that the estimated

translate values of Ouratea and Qualea sp. are greater than the estimated values

for homogeneous Poisson point processes for r < 5.

Three botanical classes The estimated F , G, J , and K applied to the three

botanical classes are similar to those obtained from the three most frequent species,

except for the J-cross function. The diagonal plots in Figure 8.15, the Kaplan-Meier

estimates of J-cross functions suggest positive association for the univariate point

patterns: Liliopsida, Magnoliopsida, and Others at small distances r < 4. The

estimated values of J-cross are smaller than 1 for r < 4.

Two types The association between the types Magnoliopsida and Others, is anal-

ysed using the summary functions F , G, J and K.

Others Ouratea Qualea

Figure 8.10: Location of the three most frequent species of the Brazilian trees dataset:

Others (left), Ouratea (centre), and Qualea (right).


0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

others

r

km, t

heo

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

Ouratea

r

km, t

heo

0 1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

0.8

Qualea

r

km, t

heo

Figure 8.11: F -functions from: Others (left), Ouratea (centre), Qualea (right). Solid

lines: Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.

F -function The Kaplan-Meier estimates of F -function suggest that the point

pattern Magnoliopsida is a realisation of a homogeneous Poisson point process with

the same intensity as the observed point pattern :see the left plot in Figure 8.16.

Observe that there is a small deviation between the estimate and homogeneous

Poisson curves for the Others point pattern at distances r < 8. However, this

deviation which suggests regularity is not supported by the cross-functions: G, J ,

K. (See the results presented next.)

G-cross function The right lower plot in Figure 8.17 suggests that the Others are

clustered for r < 4 because its estimate function is above the homogeneous Poisson

function. Thus, the Others trees tend to be closer to each other than if they were

randomly located for r < 4 .

J-cross function The diagonal plots in Figure 8.18 suggest a positive association

for the univariate datasets: Magnoliopsida and Others for r < 4. There is also

a suggestion of a positive association between Others and Magnoliopsida for r >

2. That is, the presence of an Others tree increases the probability of finding a

Magnoliopsida tree nearby. (See the lower left plot in Figure 8.18.)

K-cross function The right lower plot in Figure 8.14 suggests clustering for the

Others point pattern. This result agrees with that from the J-cross function.

8.4 Complementary analysis

8.4.1 Fusion distance function The inferential part of the strategy (Section

4.3.2) is applied to the univariate Brazilian trees dataset using the fusion distance

function (Section 4.1.1). The null hypothesis is composite; that is, H0: the Brazilian

8.4. Complementary analysis 111

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(others,others)

0 1 2 3 40.

00.

20.

40.

60.

81.

0

r

km ,

theo

(others,Ouratea)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(others,Qualea)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(Ouratea,others)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(Ouratea,Ouratea)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(Ouratea,Qualea)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(Qualea,others)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(Qualea,Ouratea)

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(Qualea,Qualea)

Array of G functions for Ouratea, Qualea & others.

Figure 8.12: G-cross functions from: Others (left), Ouratea (centre), Qualea (right).

Solid lines: Kaplan-Meier estimates, dashed lines: homogeneous Poisson point pro-

cesses.


0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(others,others)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(others,Ouratea)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(others,Qualea)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(Ouratea,others)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(Ouratea,Ouratea)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(Ouratea,Qualea)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(Qualea,others)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(Qualea,Ouratea)

0 1 2 3 4

0.0

0.4

0.8

1.2

r

km ,

theo

(Qualea,Qualea)

Array of J functions for Ouratea, Qualea & others.

Figure 8.13: J-cross functions for the Others, Ouratea, and Qualea. Solid lines:

Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.


0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(others,others)

0 1 2 3 4 50

2040

6080

100

r

trans

, th

eo

(others,Ouratea)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(others,Qualea)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(Ouratea,others)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(Ouratea,Ouratea)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(Ouratea,Qualea)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(Qualea,others)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(Qualea,Ouratea)

0 1 2 3 4 5

020

4060

8010

0

r

trans

, th

eo

(Qualea,Qualea)

Array of K functions for Ouratea, Qualea & others.

Figure 8.14: K-cross functions for the Others, Ouratea, and Qualea. Solid lines:

translate border estimates, dashed lines: homogeneous Poisson point processes.


0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(Liliopsida,Liliopsida)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(Liliopsida,Magnoliopsida)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(Liliopsida,others)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(Magnoliopsida,Liliopsida)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(Magnoliopsida,Magnoliopsida)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(Magnoliopsida,others)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(others,Liliopsida)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(others,Magnoliopsida)

0.0 1.0 2.0 3.0

0.0

0.4

0.8

r

km ,

theo

(others,others)

Array of J functions for Magnoliopsida, Liliopsida & others.

Figure 8.15: J-cross functions for the botanical classes: Liliopsida, Magnoliopsida

and Others. Solid lines: Kaplan-Meier estimates, dashed lines: homogeneous Pois-

son point processes.


0.0 1.0 2.0 3.0

0.0

0.2

0.4

0.6

0.8

Magnoliopsida

r

km, t

heo

0 2 4 6 8

0.0

0.2

0.4

0.6

0.8

others

r

km, t

heo

Figure 8.16: F -functions from the Magnoliopsida (left) and Others (right). Solid

lines: Kaplan-Meier estimates, dashed lines: homogeneous Poisson point processes.

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo


0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo


0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo


0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(others,others)

Array of G functions for Magnoliopsida & others.

Figure 8.17: G-cross functions for the Magnoliopsida and Others. Solid lines:



0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo


0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo


0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo


0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

r

km ,

theo

(others,others)

Array of J functions for Magnoliopsida & others.

Figure 8.18: J-cross functions for the Magnoliopsida and Others. Solid lines:



0 1 2 3 4

020

4060

8010

0

r

trans

, th

eo


0 1 2 3 4

020

4060

8010

0

r

trans

, th

eo


0 1 2 3 4

020

4060

8010

0

r

trans

, th

eo


0 1 2 3 4

020

4060

8010

0

r

trans

, th

eo

(others,others)

Array of K functions for Magnoliopsida & others.

Figure 8.19: K-cross functions from the Magnoliopsida and Others. Solid lines:



trees dataset is a realisation of a Poisson process with unknown intensity λ. The

intensity is then estimated from the dataset, λ = 0.1122. The (two-sided) modified

Monte Carlo test (Section 3.5) is performed based on 999 simulations under H0, and

the Average Linkage algorithm. The resulting test approximately has the (desired)

5% significance level.

0.0 0.2 0.4 0.6 0.8 1.0

−0.0

6−0

.02

0.00

0.02

0.04

0.06

Mean of H(t)

H(t)

− m

ean

of H

(t)

fus dist funsim envsy=0 line

0.0 0.2 0.4 0.6 0.8 1.0

−0.0

6−0

.02

0.00

0.02

0.04

0.06

Mean of H(t)

H(t)

− m

ean

of H

(t)

fus dist fun crit bandy=0 line

Figure 8.20: P-P plots of H(t) (x-axis) plotted against H(t) − H(t) (y-axis) ap-

plied to the univariate Brazilian trees dataset. 5% significance level, 999 realisations

under H0. Simulation envelopes (upper), Critical bands (lower). Solid lines: P-P

plots, dotted lines: y=0 line, dashed lines: envelopes and bands, Average Linkage

algorithm.


A A SA A2.5 A97.5

SL 0.523 0.501 0.010 0.481 0.520

AL 0.520 0.500 0.007 0.487 0.512

CL 0.513 0.495 0.006 0.484 0.506

Table 8.3: Estimated area statistic A from the univariate Brazilian trees dataset;

the sample mean and standard deviation, A and SA; Monte Carlo critical values,

A2.5 and A97.5. The statistics are estimated from 999 realisations under H0. Single

Linkage (SL), Average Linkage (AV), Complete Linkage (CL).

Figure 8.20 shows the estimated mean H(t) (x-axis) plotted against the esti-

mated H(t) − H(t) (y-axis). Observe that H(t) − H(t) is substantially outside the

simulation envelope and critical band. Consequently, H0 is rejected. In other words,

the univariate Brazilian trees dataset is not a realisation of a homogeneous Poisson

point process with λ = 0.1122. The results of the fusion distance function from the

Brazilian trees indicate that the location of the trees are more clustered than would

be expected for a homogeneous Poisson process.

The fusion distance function was also computed for the Brazilian trees dataset

classified into the three most frequent species, three botanical classes and two types.

The obtained results are similar to those presented in Section 8.3, so they are not

shown here.

8.4.2 Area statistic If a given point pattern is a realisation of a homogeneous

Poisson process, then the expected value of the area statistic is 0.5 (see Proposition

11 in Section 4.1.2). Deviations from this value may indicate either spatial clustering

or spatial inhibition. The null hypothesis is exactly the same as the null hypothesis

of the fusion distance function (Section 8.4.1). The (two-sided) modified Monte

Carlo test (Section 3.5) is performed based on 999 simulations under H0, and the

Single Linkage, Average Linkage, and Complete Linkage algorithms.

The prediction intervals for the estimated area statistic are given as follows:

Single Linkage: [0.481, 0.521]; Average Linkage: [0.486, 0.514]; Complete Linkage:

[0.483, 0.507]. The Monte Carlo estimates of the critical values, A2.5 and A97.5, are

the 2.5th and 97.5th quantiles of the area statistic under the H0, respectively. The

estimated values of the area statistic A are greater than the A97.5 quantiles. (See

Table 8.3.) Therefore, the null hypothesis is rejected. The results of the area statistic

also suggest that the locations of the trees are more clustered than we would expect


for a homogeneous Poisson process.

8.4.3 S statistic and spatial Rg index The extension of the strategy using

the S statistic (Section 6.2), and spatial Rg index (Section 6.3) is applied to test the

random labelling hypothesis to the multivariate Brazilian trees dataset. That is, the

(one-sided) Monte Carlo tests (Section 3.5) are performed at 5% significance level,

and based on the Single Linkage (SL), Average Linkage (AL), Complete Linkage

(CL), 999 random permutations of the type labels.

56 species, 7 subclasses, 3 classes, 2 types Table 8.4 shows the estimated values of

the S statistic, spatial Rg index, S, Rg; and Monte Carlo critical values, S5%rr, R5%rr,

respectively, for the Brazilian trees dataset classified into 56 species, 7 subclasses, 3

classes, and 2 types.

56 species S S5%rr Rg R5%rr

SL 58 39 0.648 0.647

AL 64 44 0.895 0.894

CL 68 46 0.896 0.896

7 subclasses S S5%rr Rg R5%rr

SL 209 183 0.367 0.371

AL 235 206 0.603 0.602

CL 249 213 0.598 0.600

3 classes S S5%rr Rg R5%rr

SL 654 636 0.809 0.812

AL 729 704 0.439 0.448

CL 738 712 0.408 0.414

2 types S S5%rr Rg R5%rr

SL 655 640 0.816 0.819

AL 730 707 0.501 0.503

CL 739 714 0.504 0.506

Table 8.4: Estimated values S, S5%rr , Rg, R5%rr from the Brazilian trees dataset

classified into: 56 species, 7 subclasses, 3 classes, 2 types; Single Linkage (SL),

Average Linkage (AL), Complete Linkage (CL), 999 random permutations of the

type labels.


Note that using the S statistic and based on the three clustering algorithms, the

random labelling is rejected for the multivariate Brazilian trees dataset.

For the spatial Rg index, except for the three class and two type classifications,

where the random labelling hypothesis is not rejected, the remaining results are

similar to those obtained from S statistic. Thus, the random labelling hypothesis is

rejected for the dataset classified into 56 species, and based on the Single Linkage and

Average Linkage. The random labelling hypothesis is also rejected for the dataset

classified into seven subclasses, and based on the Average Linkage. (See Table 8.4.)

8.4.4 Gamma approximation for spatial Rg index The gamma approxi-

mation is fitted to the Monte Carlo null distribution of the spatial Rg index applied

to the Brazilian trees dataset classified into two types, using the procedure described

in Section 6.3. Figure 8.21 shows that the fitted gamma is a good approximation

for the Monte Carlo null distribution of the spatial Rg index.

0.500 0.505 0.510 0.515 0.520

0.50

00.

505

0.51

00.

515

0.52

0


gam

ma

appr

oxim

atio

n


Brazilian trees with two types

Figure 8.21: Q-Q plot comparing the Monte Carlo estimate of the null distribution

of the spatial Rg index with its gamma approximation, for the Brazilian trees dataset

with two types. Solid line: Q-Q plot, dashed line: identity line.

Table 8.5 presents the estimated values of the spatial Rg index; parameters: α,

β, γ of the fitted gamma approximation; and p-value from the Monte Carlo null

distribution of the spatial Rg index under the random labelling hypothesis, and

based on the Average Linkage. (The parameters of the gamma approximation were


Datasets Rg α β γ p-value

Brazilian 0.501 4.530 0.0005 0.499 0.584

Table 8.5: Estimated spatial Rg index applied to the bivariate Brazilian trees dataset;

parameters α, β, γ of the fitted gamma; and p-value from the Monte Carlo null dis-

tribution of the spatial Rg index under random labelling hypothesis; Average Linkage.

estimated using the Method of Moments (Section 6.3).) Therefore, the random

labelling hypothesis is rejected for the bivariate Brazilian trees dataset because of

the large p-value.

8.4.5 Analysis of local configuration This section presents the analysis of

the local configuration (Section 7.1) applied to the Brazilian trees dataset, and based

on the 20 nearest neighbours, Single Linkage, Average Linkage and Complete Link-

age. Figure 8.22 shows the kernel densities of the probability functions of the fusion

distances applied to the univariate Brazilian trees dataset. The dendrograms of the

Single Linkage, Average Linkage and Complete Linkage are shown in Figure 8.23.

Similar to the dendrograms of the Longleaf pines and Lansing woods datasets

(Section 7.2.3), the disordered Single Linkage dendrogram suggests that there may

not be spatial clustering in the Brazilian trees dataset. Even though there is no

strong evidence for clusters in the dataset, the analysis proceeds and the dendro-

gram of the Average Linkage is cut into seven, three and two clusters. The cluster

classification is then compared with the botanical classification.

The results of the classification based on the Single Linkage and Complete Link-

age algorithms are not shown here. The main reasons are the Single Linkage has a

poor separation of the dataset into meaningful clusters, and the Complete Linkage

results are similar to those obtained from the Average Linkage.

Seven groups The contingency table 8.6 shows the frequency counts of the Brazil-

ian trees that are classified into seven, three, and two botanical types, and into seven,

three, and two groups based on the total variation distance, respectively. The upper

plot in Figure 8.24 shows the local configuration classification of the Brazilian trees

dataset into seven groups from the total variation distance based on the 20 nearest

neighbours, and Average.

A Poisson process with the same estimated intensity as the Brazilian trees dataset

on the 100 m square was simulated, and the mean (equation (7.4)) of the local fusion


0 5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

fusion distances

Prob

abilit

y de

nsity

func

tion

Single Linkage

0 5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

fusion distances

Prob

abilit

y de

nsity

func

tion

Average Linkage

0 5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

fusion distances

Prob

abilit

y de

nsity

func

tion

Complete Linkage

Figure 8.22: Kernel probability densities of the fusion distances from the univariate

Brazilian trees dataset based on the 20 nearest neighbours, Single Linkage, Average

Linkage, and Complete Linkage.

124 Chapter 8. Analysis of Brazilian trees point pattern0.

00.

20.

40.

6

Single Linkage

tota

l var

iatio

n di

stan

ces

0.0

0.5

1.0

1.5

2.0

Average Linkage

tota

l var

iatio

n di

stan

ces

01

23

4

Complete Linkage

tota

l var

iatio

n di

stan

ces

Figure 8.23: Dendrograms of total variation distances from kernel densities of fusion

distances for the Brazilian trees dataset, based on the 20 nearest neighbours; Single

Linkage; Average Linkage; Complete Linkage.


Seven groups

Three groups

Two groups

Figure 8.24: Classification of points in the Brazilian trees dataset into: seven groups

(upper), three groups (centre), and two groups (lower) based on their local config-

uration (20 nearest neighbours, fusion distances, kernel smoothing, total variation

distance, Average Linkage).


Group1 2 3 4 5 6 7

Aracidae 6 13 3 0 3 1 0Asteridae 16 27 3 0 1 1 0Dilleniidae 118 201 39 1 4 8 0

Subclass Hamamelidae 12 20 4 0 0 0 0Liliidae 9 9 9 0 1 1 0

Miscellaneous 11 40 7 0 0 1 0Rosidae 152 321 48 6 9 15 2

Group1 2 3

Magnoliopsida 415 569 24Class Liliopsida 31 22 2

Others 18 40 1

Group1 2

Type 1 984 242 111 3

Table 8.6: Contingency tables of the Brazilian trees dataset by botanical types and

groups. Upper: seven subclasses and groups; centre: three classes and groups; lower:

two types and groups. Groups based on total variation distances; 20 nearest neigh-

bours; Average Linkage.

distance function was computed. The estimated group means (equation (7.3)) of the

local fusion distance functions were calculated, and compared with the estimated

mean of the local fusion distance functions from the Poisson process. Except for

group 7, which only has 2 trees, the upper plot in Figure 8.25 suggests that there is

not a clear separation for clusters in the Brazilian trees dataset.

Three groups The central plot in Figure 8.24 shows the local configuration clas-

sification of the Brazilian trees dataset into three groups based on the total variation

distances from the 20 nearest neighbours and Average Linkage. The group means

of the local fusion distance functions were estimated, and compared with the mean

of the fusion distance functions from a simulated Poisson process with the same

intensity as the Brazilian trees dataset on the 100 m square. The lower left plot in

Figure 8.25 indicates that the Brazilian trees dataset does not have a good separation


for clusters.

Two groups The lower plot in Figure 8.24 shows the local configuration classi-

fication of the Brazilian trees dataset into two groups based on the total variation

distance from the 20 nearest neighbours and Average Linkage. The lower right plot

in Figure 8.25 suggests that there is not a clear separation for two groups in the

Brazilian trees dataset.

The results of the analysis based on the 10 nearest neighbours are very similar to

those obtained for the 20 nearest neighbours. Therefore, the 10 nearest neighbour

analysis is not presented here.


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Estim

ated

mea

n of

H_v

(t)

group 1group 2group 3group 4group 5group 6group 7Ident. line

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Estim

ated

mea

n of

H_v

(t)

group 1group 2group 3Ident. line

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Estim

ated

mea

n of

H_v

(t)

group 1group 2Ident. line


the mean of local fusion distance functions from homogeneous Poisson processes with

the same intensity as the Brazilian trees dataset. Upper: seven groups; lower left:

three groups; lower right: two groups. 20 nearest neighbours; Average Linkage.

129CHAPTER 9

Conclusion and open problems

This chapter describes the main problems studied, and findings of the research

reported in this thesis, chapter by chapter. Furthermore, important issues arising

from the new approach are discussed in detail and suggestions for future work are

made.

9.1 Problems studied and findings

Chapter 3 The problem of an exact significance level of the Monte Carlo test

applied to the P-P plot was studied and solved. The (new) modified version of

the two-sided test was described in Section 3.5. In addition, the transformed P-P

plot, “A-A plot”, was presented and applied to the published and simulated point

patterns.

Chapter 4 The problem of shortage of summary statistics that work well in

practical applications was investigated and complementary statistics are proposed.

In particular, fusion distance function and area statistic were developed. Moreover,

both statistics were throughly studied using the most popular hierarchical algorithms

(Single Linkage, Average Linkage and Complete Linkage), and dissimilarity coeffi-

cient (Euclidean pairwise distance) in multivariate cluster analysis. Note that the

fusion distance function and area statistic depend strongly on the chosen clustering

algorithm, and dissimilarity coefficient. In other words, for a given point pattern,

the fusion distance function and area statistic will have different shapes for different

algorithms and coefficients.

Fusion distance function The fusion distance function is regarded as a link

with the open problem of finding the best number of clusters in multivariate clus-

ter analysis. Here, the relationship between the fusion distance function and best

number of clusters using a knee plot is examined. In summary, the fusion distance

function is a linear combination of the knee plot. However, the problem of find-

ing parametric functions for estimating the fusion distance function under the null

hypothesis of Complete Spatial Randomness is still open.

Area statistic The problem of the (unknown) value of the area statistic un-

der the null hypothesis of Complete Spatial Randomness is investigated and solved.

Proposition 11 demonstrates that the area statistic for a homogeneous Poisson pro-

cess with (known) intensity λ is equal to 0.5.

130 Chapter 9. Conclusion and open problems

New strategy The problem of analysing point patterns using tools of exploratory

data analysis and inference is studied, and a new strategy is proposed in Section

4.3. A new application of the relative distribution method using the fusion distance

function is presented. In particular, the relative distribution plot is applied to the

standard spatial datasets.

Chapter 5 The problem of estimating the power of the Monte Carlo tests un-

der the null hypothesis of Complete Spatial Randomness was investigated, and a

particular case of the power was estimated. This particular study was an illustra-

tion of the powers of the Monte Carlo tests using the supremum distance and area

statistics. Both tests were based on the fusion distance function.

The chosen alternatives are two special models of Matern point processes (Matern

cluster and Matern model II). Observe that simulations from the alternative models

do not depend on iterative algorithms. Thus, the problem of waiting time for the

programming task is feasible, especially because direct algorithms are chosen.

The power of the Monte Carlo test based on the supremum distance is quite

variable and difficult to understand whereas the power of the Monte Carlo test based

on the area statistic is straightforward. The best power achieved by the supremum

distance is comparable to the best power achieved by the area statistic. Therefore,

the Monte Carlo test based on the area statistic is recommended for the models

studied here.

Chapter 6 The problem of analysing multivariate point patterns is examined,

and complementary strategies are proposed. First, an extension of the strategy

using the fusion distance function is described. Second, another extension based on

the S statistic is presented. Third, an extension of the strategy using the spatial Rg

index, a (new) modified version of Rg index, is developed.

In addition, some properties of the statistics (S, spatial Rg index) are investi-

gated. The extensions are applied to bivariate published point patterns. Finally,

the Monte Carlo null distribution of the spatial Rg index is approximated, using a

gamma distribution.

Chapter 7 The problem of examining a localised neighbourhood based on the

fusion distance function was studied, and the analysis of local configuration, a new

extension of LISA (Local Indicators of Spatial Association), is presented in Section

7.1. That is, given a local neighbourhood of a point pattern, the probability density

of the fusion distance function is approximated using kernel smoothing techniques.

9.2. Critique 131

The total variation distance is chosen to measure distances between probability

densities of the local fusion distances.

The analysis of local configuration is applied to published multivariate datasets.

The local configuration strategy has successfully identified different textures of the

published datasets. The Average Linkage and Complete Linkage show a better

performance than the Single Linkage. In fact, the poor performance of the Single

Linkage is mainly because of the chaining effect. Thus, if there are no clusters with

nucleus in a given dataset then the Single Linkage is unable to separate the dataset

into meaningful clusters.

Chapter 8 The problem of analysing large multivariate point patterns, such

as the Brazilian trees dataset, was studied using exploratory data analysis and in-

ference, based on spatial summary functions and statistics. The additional task of

finding the complete botanical classification of the Brazilian trees dataset was done.

A few inconsistencies found in the dataset were also corrected. The problematic

analysis of the Brazilian trees dataset with fifty-six types was examined, and solved

by classifying the dataset into fewer types: seven subclasses, three most frequent

species, three classes, and two types.

The results from the fusion distance function and area statistic show that regard-

less of the types, the univariate dataset is clustered. For the multivariate dataset

classified into fewer types, the results obtained by using the traditional summary

functions (G, J , K), and the fusion distance function show the presence of clustered

sub-patterns in the dataset.

9.2 Critique

Limitations In this thesis, the proposed methods and strategies were only

examined by using popular hierarchical algorithms (Single Linkage, Average Linkage

and Complete Linkage) and dissimilarity coefficient (Euclidean pairwise distance).

There are no theoretical results related to the new (non-parametric) summary

statistics and function. Moreover, the statistics rely on Monte Carlo simulations

and tests for the applications to point patterns.

In Chapter 5, a comparative study of the power of the Monte Carlo tests using the

supremum distance and area statistic under Complete Spatial Randomness against

the chosen alternative models based on the traditional summary functions (G, J , K)

was not done. This would be useful in order to compare the fusion distance function


with the traditional summary functions however, requires extensive computation

and programming.

In Chapter 7, the analysis of a localised neighbourhood of a given point pattern,

based on the local fusion distance function, is examined. However, the problem of

identifying the clusters, and measuring the degree of spatial clustering needs further

investigation.

Weaknesses The fusion distance function and area statistic are strongly de-

pendent on choice of a hierarchical algorithm and dissimilarity coefficient. Also, the

fusion distances of a given point pattern cannot be regarded as if they were inde-

pendent and identically distributed observations. (See details described in Section

4.1.1.)

The study of the power of the Monte Carlo tests (Chapter 5) is not intended to

produce a general rule. The results found are an illustration of the performance of

the new summary statistics to an arbitrary choice of the alternative models.

Difficulty The main difficulty found is the practical impossibility of plotting

large multitype point patterns. In theory, the existing methods and software work

very well only if the number of types is smaller than or equal to three.

Final issues First, the fusion distance function H(t), area statistic A, statis-

tic S, and spatial Rg index are non-parametric function and statistics. Second,

it was not possible to use standard goodness-of-fit tests since the distribution of

Kolmogorov-Smirnov statistic under CSR is still unknown. So the two-sided modi-

fied version of Monte Carlo tests (Section 3.5) based on the fusion distance function

and area statistic is performed to estimate the power of the test of CSR and to

achieve exact significance level α. (See further information in Section 5.2.) Finally,

the proposed strategies and methods depend strongly on computers and graphical

analysis.

9.3 Open problems

Chapter 4

• Study the fusion distance function and area statistic using other hierarchical

algorithms, for instance the Ward’s Variance Minimum, and a generalised

dissimilarity coefficient such as the Mahalanobis distance.

9.3. Open problems 133

• Given a clustered pattern with a fixed number of clusters, investigate the

new summary function and statistics using a non-hierarchical algorithm, for

example K-means.

• Approximate a parametric function for the fusion distance function under the

null model of Complete Spatial Randomness.

• Examine the fusion distance function under more complicated models such as

the inhomogeneous Poisson point process.

Chapter 5

• Compare the power of the Monte Carlo test under Complete Spatial Random-

ness using the supremum distance, and area statistic based on the summary

functions: G, J , and K.

• Investigate the power of the Monte Carlo test under Complete Spatial Ran-

domness using the supremum distance, and area statistic based on another

hierarchical algorithms and Mahalanobis distance.

Chapter 6

• Examine the viability of the computational programming for estimating the

parameters of the Monte Carlo null distribution of the spatial Rg index using

the Maximum Likelihood Method.

• Explore other distributions, such as the Log Normal and Weibull, to approxi-

mate the Monte Carlo null distribution of the spatial Rg index.

Chapter 8

• Extend the existing techniques and software for analysing and plotting large

multivariate point patterns, especially, for point patterns with a number of

types larger than five.

135

Bibliography

[1] M. Aitkin and D. Clayton. The fitting of exponential, Weibull and extreme

value distributions to complex censored survival data using GLIM. Appl.

Statist., 29:156–163, 1980.

[2] H. Akaike. An approximation to the density function. Ann. Inst. Statist.

Math., 6:127–132, 1954.

[3] N. H. Anderson and D. M. Titterington. Some methods for investigating

spatial clustering, with epidemiological applications. J. R. Statist. Soc. A,

160(1):87–105, 1997.

[4] T. W. Anderson and D. A. Darling. Asymptotic theory of certain “goodness

of fit” criteria based on stochastic processes. Ibid., 23:193–212, 1952.

[5] L. Anselin. The Moran scatterplot as an ESDA tool to assess local instability

in spatial association. In The DISDATA Specialist Meeting on GIS and Spatial

Analysis, Amsterdam, The Netherlands, pages 1–5. West Virginia University,

Regional Research Institute, Research Paper 9330, 1993.

[6] L. Anselin. Local indicators of spatial association - LISA. Geographical Anal-

ysis, 27:93–115, 1995.

[7] A. J. Baddeley and R. D. Gill. Kaplan-Meier estimators for interpoint distance

distributions of spatial point processes. Ann. Statist., 25:263–292, 1997.

[8] A. J. Baddeley and M. N. M. van Lieshout. Stochastic geometry models in

high-level vision. In K. V. Mardia, editor, Statistics and Images, pages 233–

258. Carfax, Abingdon, 1993.

[9] A. J. Baddeley and R. Turner. SpatStat for R, 1.3-2 edition, May 2002.

[10] G. A. Barnard. Discussion of Professor Bartlett’s paper. J. R. Statist. Soc.

Ser. B, 25:294, 1963.

[11] J. Besag and P. J. Diggle. Simple Monte Carlo tests for spatial pattern. Applied

Statistics, 26:327–333, 1977.

[12] J. Besag and J. Newell. The detection of clusters in rare diseases. J. R. Statist.

Soc. A, 154(1):143–155, 1991.

136 Bibliography

[13] P. J. Bickel and Kjell A. Doksum. Mathematical Statistics. Holden-Day, Inc.,

California, 1977.

[14] A. W. Bowman and A. Azzalini. Applied smoothing techniques for data anal-

ysis: the kernel approach with S-Plus illustrations. Oxford University Press,

Oxford, 1997.

[15] R. K. Brummitt. Vascular Plant Families and Genera. Royal Botanic Gardens,

Kew, 1992.

[16] J. M. Chambers, W. S. Cleveland, B. Kleiner, and P. A. Tukey. Graphical

Methods for Data Analysis. Wadsworth, Inc., California, 1987.

[17] J. L. Chandon and S. Pinson. Analyse Typologique. Masson, Paris, 1981.

[18] A. D. Cliff and J. K. Ord. Spatial Processes: Models and Applications. Pion,

London, 1981.

[19] D. R. Cox. Some statistical methods related with series of events (with dis-

cussion). J. R. Statist. Soc. B, 17:129–164, 1955.

[20] D. R. Cox and V. Isham. Point Processes. Chapman and Hall, London, 1980.

[21] D. R. Cox and P. A. W. Lewis. Multivariate point processes. In Proceedings

of the sixth Berkeley Symposium of Mathematics Statistics and Probability,

number 3, pages 401–445. University of California Press, 1972.

[22] N. Cressie and L. B. Collins. Analysis of spatial point patterns using bundles

of product density lisa functions. J Agric Biol Environ Stat, 6:118–135, 2001.

[23] N. Cressie and L. B. Collins. Patterns in spatial point locations: Local indica-

tors of spatial association in a minefield with clutter. Naval Research Logistics,

48:333–347, 2001.

[24] N. A. C. Cressie. Statistics for Spatial Data. John Wiley and Sons, Inc., New

York, 1991.

[25] F. H. C. Crick and P. A. Lawrence. Compartments and polychones in insect

development. Science, 189:340–347, 1975.

[26] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Pro-

cesses. Spring-Verlag, New York, 1988.

Bibliography 137

[27] A. Dasgupta and A. E. Raftery. Detecting features in spatial point processes

with clutter via model-based clustering. Journal of the American Statistical

Association, 93:294–302, 1998.

[28] P. Diehl. Geography and war: A review and assessment of the empirical

literature, edited by M. Ward. New Geopolitics, Gordon and Breach, 1992.

[29] P. J. Diggle. On parameter estimation and goodness-of-fit testing for spatial

point patterns. Biometrics, 35:87–101, 1979.

[30] P. J. Diggle. Statistical Analysis of Spatial Point Patterns. Academic Press,

London, 1983.

[31] P. J. Diggle. Displaced amacrine cells in the retina of a rabbit: analysis of a

bivariate spatial point pattern. J. Neurosci. Meth., 18:115–125, 1986.

[32] P. J. Diggle. A point process modelling approach to raised incidence of a

rare phenomenon in the vinicity of a prespecified point. J. R. Statist. Soc. A,

153(3):349–362, 1990.

[33] R. Doll. The epidemiology of childhood leukaemia. J. R. Statist. Soc. A.,

152:341–351, 1989.

[34] J. Durbin. Distribution Theory for Tests Based on the Sample Distribution

Function. Society for Industrial and Applied Mathematics, Philadelphia, 1973.

[35] M. Dwass. Modified randomization tests for nonparametric hypotheses. Ann.

Math. Statist., 28:181–187, 1957.

[36] M. Ehrmann and URL R. L. Bell. Desiderata.

http://www.geocities.com/lswote/desiderata.html, 1927.

[37] B. S. Everitt. Cluster Analysis. Edward Arnold, London, 1993.

[38] L. Fisher and J. W. van Ness. Admissible clustering procedures. Biometrika,

58:91–104, 1971.

[39] E. Fix and J. L. Hodges. Discriminatory analysis– non-parametric discrimi-

nation: consistency properties. Report, Project no. 21-29-004 No. 4,, USAF

School of Aviation Medicine, Randolph Field, TX, 1951.

138 Bibliography

[40] K. Florek, J. Lukaszewicz, J. Perkal, H. Steinhaus, and S. Zubrzycki. Sur la

liaison et la division des points d’un ensemble fini. Colloq. Math., 2:282–285,

in French, 1951.

[41] E. B. Fowlkes. A Folio of Distributions. Marcel Dekker, Inc., New York and

Basel, 1987.

[42] E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical

clusterings. J. Amer. Statist. Assoc., 78:553–569, 1983.

[43] D. J. Gerrard. Competition quotient: A new measure of the competition affect-

ing individual forest trees. Research bulletin, Vol 20, Agricultural Experiment

Station, Michigan State University, 1969.

[44] A. Getis and K. Ord. The analysis of spatial association by use of distance

statistics. Geographical Analysis, 24:189–206, 1992.

[45] A. D. Gordon. Classification. Chapman and Hall, London, 1981.

[46] P. R. Halmos. Measure Theory. Van Nostrand Reinhold Company, New York,

1969.

[47] M. S. Handcock and M. Morris. Relative Distribution Methods in the Social

Sciences. Springer-Verlag, New York, 1999.

[48] M. S. Handcock and M. Morris. The software on relative distribution methods

on social sciences. http://csde.washington.edu/~handcock/RelDist/ ,

1999.

[49] J. A. Hartigan. Clustering Algorithms. John Wiley and Sons Ltd, New York,

1975.

[50] A. C. A. Hope. A simplified Monte Carlo significance test procedure. J. R.

Statist. Soc. B, 30:582–598, 1968.

[51] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics.

Journal of Computational and Graphical Statistics, 5(3):299–314, 1996.

[52] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice

Hall,Inc., Englewood Cliff, 1988.

[53] N. Jardine and R. Sibson. Mathematical Taxonomy. John Wiley and Sons

Ltd, London, 1971.

Bibliography 139

[54] K.-H. Jockel. Finite sample properties and asymptotic efficiency of Monte

Carlo tests. Annals of Statistics, 14:336–347, 1986.

[55] N. L. Johnson and S. Kotz. Distributions in Statistics: continuous univariate

distributions-1. Houghton Mifflin Company, Boston, 1970.

[56] R. A. Johnson and G. K. Bhattacharyya. Statistics Principles and Methods.

John Wiley and Sons, Inc., New York, 1996.

[57] S. P. Kaluzny, S. C. Vega, T. P. Cardoso, and A. A. Shelly. S+SPATIALSTATS

User’s Manual, 1.0 edition, February 1996.

[58] L. Kaufman and P. Rousseeuw. Finding Group in Data: an Introduction to

Cluster Analysis. John Wiley and Sons Inc., New York, 1990.

[59] J. E. Kelsall and P. J. Diggle. Kernel estimation of relative risk. Bernoulli,

1:3–16, 1995.

[60] J. F. C Kingman. Poisson Processes. Oxford University Press, Oxford, 1993.

[61] L. J. Kinlen. Evidence for an infective cause for childhood leukaemia: a

Scottish new town compared to nuclear reprocessing sites. Lancet, 1988.

[62] E. L. Lehmann. Elements of large-sample theory. Springer-Verlag New York,

New York, 1999.

[63] J. A. Ludwig and J. F. Reynolds. Statistical Ecology: a primer on methods

and computing. John Wiley and Sons, New York, 1988.

[64] A. J. B. Luiz. Determinacao da distribuicao espacial de pontos usando a

distancia ao vizinho mais proximo: Aplicacao em populacoes vegetais. Mas-

ter’s thesis, Universidade de Brasılia, Brazil, in Portuguese, 1995.

[65] M. N. M. van Lieshout. Stochastic Geometry Models in Image Analysis and

Spatial Statistics. PhD thesis, Free University of Amsterdam, 1994.

[66] M. N. M. van Lieshout and A. J. Baddeley. A nonparametric measure of

spatial interaction in point patterns. Statistica Neerlandica, 50:344–361, 1996.

[67] M. N. M. van Lieshout and A. J. Baddeley. Indices of dependence between

types in multivariate point patterns. Scandinavian Journal of Statistics,

26:511–532, 1999.

140 Bibliography

[68] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic

Press Inc., London, 1979.

[69] B. Matern. Spatial variation. Medd. Statens Skogforskringsinstitut, 49, 5,

Forest Research Institute of Sweden, 1960.

[70] B. Matern. Doubly stochastic poisson processes in the plane. In Statistical

Ecology: Spatial Patterns and Statistical Distributions, based on the Proceed-

ings of the International Symposium on Statistical Ecology, volume 1, pages

195–213, University Park and London, 1971. The Pennsylvania State Univer-

sity Press.

[71] B. Matern. Spatial Variation: Lecture Notes in Statistic. Spring-Verlag, Berlin,

1986.

[72] G. J. McLachlan and K. E. Basford. Mixture models: Inference and Applica-

tions to Clustering. M. Dekker, New York, 1988.

[73] M. L. Meirelles. Personal communication, 2003.

[74] M. L. Meirelles and A. J. B. Luiz. Padroes espaciais de arvores de um cerrado

em Brasılia, DF. Revta Brasil. Bot., Sao Paulo, 18:185–189, in Portuguese,

1995.

[75] M. L. Meirelles and A. J. B. Luiz. Personal communication, 2000.

[76] M. Morisita. Measuring of the dispersion of individuals and analysis of the

distributional patterns. Memoirs of the faculty of science, E 2, Kyushu Univ.

Ser., Kyushu University, 215–235, 1959.

[77] URL New York Botanical Garden. Ochnaceae, Gomphia acuminata

DC(isotype). http://image.nybg.org/herbim/2080/v-208-00428910big.jpg,

1811.

[78] J. Neyman and E. L. Scott. Statistical approach to problems of cosmology. J.

R. Statist. Soc. B, 20:1–43, 1958.

[79] J. Neyman and E. L Scott. Processes of clustering and applications. In Stochas-

tic Point Processes, P. W. A. Lewis, 1972.

[80] M. Numata. Forest vegetation in the vicinity of Choshi. Coastal flora and veg-

etation at Choshi, Chiba Prefecture IV. Bulletin Choshi Marine Laboratory,

Chiba University, in Japanese, 3:28–48, 1961.

Bibliography 141

[81] J. Ohser. On estimators for the reduced moment measure of point processes.

Math. Operationsforch. Statist. Ser. Statist., in German, 14:63–71, 1983.

[82] A. K. Penttinen and D. Stoyan. Statistical analysis for a class of line segment

processes. Scand. J. Statist., 16:153–161, 1989.

[83] Sandra M. C. Pereira. Um estudo descritivo sobre os criterios de determinacao

do numero de agrupamentos. Master’s thesis, Universidade de Brasılia, Brazil,

in Portuguese, 1993.

[84] Sandra M. C. Pereira. Analysis of spatial point processes based on the out-

puts of clustering algorithms. In Proceedings of the 23rd European Meeting of

Statisticians, Revista Estatıstica, Statistic Review, volume II, pages 309–310,

Instituto Nacional de Estatıstica, Portugal, 2001.

[85] W. J. Platt, G. W. Evans, and S. L. Rathbun. The population dynamics of

a long-lived Conifer (Pinus palustris). The American Naturalist, 131:491–525,

1988.

[86] W. M. Rand. Objective criteria for the evaluation of clustering methods. J.

Amer. Stat. Assoc., 66:846–850, 1971.

[87] S. L. Rathbun and N. Cressie. A space-time survival point process for a

longleaf pine forest in southern Georgia. Journal of the American Statistical

Association, 89:1164–1173, 1994.

[88] B. D. Ripley. Modelling spatial patterns (with discussion). Journal of the

Royal Statistical Society, Series B, 39:172–212, 1977.

[89] B. D. Ripley. Tests of randomness for spatial point patterns. Journal of the

Royal Statistical Society, Series B, 41:368–374, 1979.

[90] B. D. Ripley. Spatial Statistics. John Wiley and Sons, Inc., New York, 1981.

[91] B. D. Ripley. Statistical Inference for Spatial Processes. Cambridge University

Press, New York, 1988.

[92] G. G. Roussas. A First Course in Mathematical Statistics. Addison-Wesley

Publishing Company, Reading, 1973.

[93] M. Schlather. On the second-order characteristics of marked point processes.

Bernoulli, 7 (1):99–117, 2001.

142 Bibliography

[94] B. T. Scott. Summary Functions in the Analysis of Spatial Point Patterns.

PhD thesis, University of Western Australia, 2001.

[95] I. J. Smalley. Contraction crack networks in basalt flows. Geological Magazine,

103 (2):110–114, 1966.

[96] G. W. Snedecor and W. G. Cochran. Statistical Methods. Iowa State University

Press, Ames, 1980.

[97] R. R. Sokal and C. D. Michener. A statistical method for evaluating systematic

relationships. Univ. Kansas Sci. Bull., 38:1409–1438, 1958.

[98] T. Sørensen. A method of establishing groups of equal amplitude in plant

sociology based on similarity of species content. K. danske Vidensk. Selsk.

Skr. (biol), 5:1–34, 1948.

[99] M. A. Stephens. Tests based on edf statistics. In R. B. D’Agostino and M. A.

Stephens, Goodness-of-Fit Techniques, 1986.

[100] D. Stoyan. Correlations of the marks of marked point processes- statistical

inference and simple models. J. Inf. Process. Cybern., 20:285–294, 1984.

[101] D. Stoyan. On correlations of marked point processes. Math. Nachr., 116:197–

207, 1984.

[102] D. Stoyan, W. S. Kendall, and J. Mecke. Stochastic Geometry and Its Appli-

cations. John Wiley & Sons, Chichester, 1987.

[103] D. Stoyan and A. Penttinen. Recent applications of point process methods in

forestry statistics. Statistical Science, 15(1):61–78, 2000.

[104] D. Stoyan and H. Stoyan. Fractals, Random Shapes and Point Fields. John

Wiley & Sons, Chichester, 1994.

[105] D. J. Strauss. A model for clustering. Biometrika, 62:467–475, 1975.

[106] M. J. Symons, R. C. Grimson, and Y. C. Yuan. Clustering of rare events.

Biometrics, 39(1):193–205, 1983.

[107] E. Thonnes and M.N.M. van Lieshout. A comparative study on the power of

van Lieshout and Baddeley’s J-function. Research report 334, Department of

Statistics, University of Warwick, 1999.

Bibliography 143

[108] H. Thorisson. Coupling, Stationarity, and Regeneration. Springer-Verlag New

York, Inc., New York, 2000.

[109] G. J. G. Upton and B. Fingleton. Spatial Data Analysis by Example, Volume

1: Point Pattern and Quantitative Data. John Wiley and Sons, Inc., New

York, 1985.

[110] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S-Plus.

Springer–Verlag New York, Inc., New York, 1999.

[111] R. Wakeford. Childhood leukaemia and nuclear installations. J. R. Statist.

Soc. A, 152:61–86, 1989.

[112] M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall, London,

UK, 1995.

[113] H. Wassle, B.B. Boycott, and R.-B Illing. Morphology and mosaic of on- and

off-beta cells in the cat retina and some functional considerations. Proc. Roy.

Soc. London Serv. B, 212:177–195, 1981.

[114] M. B. Wilk and R. Gnanadesikan. Probability plotting methods for the anal-

ysis of data. Biometrika, 55:1–17, 1968.

[115] D. W. Woodland. Contemporary Plant Systematics. Prentice Hall, Englewood

Cliffs, 1991.

144 Bibliography

145APPENDIX A

Results of the new strategy based on the Average Linkage

and Complete Linkage algorithms

A.1 Exploratory data analysis

Dendrograms of the Average Linkage and Complete Linkage applied to the stan-

dard point patterns: pines, cells, and redwoods are plotted in Figure A.1. The

datasets are presented in Section 2.1.

Relative distribution plots Figures A.2 and A.3 show the relative probability

density functions with the pointwise 95% confidence intervals for the fusion distance

functions H(t) applied to the pines, cells, and redwoods plotted against the mean

H(t) for 1000 realisations of a binomial point process on the unit square, based on

the Average Linkage and Complete Linkage, respectively.

A.2 Inference

A.2.1 Envelopes for P-P plots, Q-Q plots and A-A plots Figure A.4,

A.5, and A.6 show the P-P plots, Q-Q plots and A-A plots applied to the pines,

cells, redwoods, with the pointwise simulation envelopes at 5% significance level.

The results are based on the Average Linkage and Complete Linkage, and on 999 re-

alisations under the binomial point process with the same intensities as the observed

datasets, on the unit square, respectively.

A.2.2 Bands for P-P plots, Q-Q plots and A-A plots Figures A.7, A.8,

and A.9 show the P-P plots, Q-Q plots and A-A plots applied to the pines, cells,

and redwoods, with the simultaneous critical bands at 5% significance level. The

results are based on the Average Linkage and Complete Linkage, and on 999 reali-

sations under the binomial point process with the same intensities as the observed

datasets, on the unit square, respectively.

A.3 Random labelling hypothesis

The results of the extension based on the fusion distance function (Section 6.1)

applied to the bivariate point pattern Cat Retinal Ganglia, and based on the Average

Linkage and Complete Linkage are presented next. Figure A.10, A.11, A.12, A.13,

A.14, and A.15 show the P-P plots, Q-Q plots, A-A plots with the envelopes and

bands at 5% significance level, respectively.

146 Appendix A. New strategy based on the Average and Complete Linkage

0.0

0.5

1.0

(a)

0.0

0.5

1.0

(b)

0.0

0.5

1.0

(c)

0.0

0.5

1.0

(d)

0.0

0.5

1.0

(e)

0.0

0.5

1.0

(f)

Figure A.1: Dendrograms of the clustering algorithms applied to the spatial datasets:

(a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average Linkage;

right: (b),(d),(f): Complete Linkage.

A.3. Random labelling hypothesis 147

02

46

0.0 0.4 0.8

................

....................................................................................

...................................................................................................

.

02

46

0.0 0.4 0.8

........................................

......

.......

............................................................................

......

....

....

....

....

.................................................

02

46

0.0 0.4 0.8

....................................................................................................

.......................................................................................

.........

....

Figure A.2: Relative probability density function (y-axis) of the fusion distances H(t)

plotted against H(t) (x-axis). The probability density functions plots with pointwise

95% confidence intervals of the datasets: pines (left), cells (centre), and redwoods

(right); Average Linkage; 1000 realisations under H0.

02

46

0.0 0.4 0.8

...............

.....................................................................................

.................................................................................................

...

02

46

0.0 0.4 0.8

....................................

....

.........

..............................................................................

....

....

.....

.....

.....................................................

..

02

46

0.0 0.4 0.8

....................................................................................................

.........................................................................................

...........

Figure A.3: Relative probability density function (y-axis) of the fusion distances H(t)

plotted against H(t) (x-axis). The probability density functions plots with pointwise

95% confidence intervals of the datasets: pines (left), cells (centre), and redwoods

(right); Complete Linkage, 1000 realisations under H0.


(a)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(c)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(d)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(e)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(f)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure A.4: Simulation Envelopes at significance level 5% for P-P plots applied to the

datasets: (a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average

Linkage; right: (b),(d),(f): Complete Linkage. Solid lines: P-P plots; dashed lines:

envelopes; dotted lines: identity line, 999 realisations under H0.

A.4 Histograms

The histograms of the fusion distances from the Average Linkage algorithm ap-

plied to the point patterns Longleaf pines (Section 6.2) and Brazilian trees (Section

8.1) are shown in Figure A.16. It seems appropriate to try fitting a gamma distri-

bution for approximating the spatial Rg index distribution (Section 6.3).

A.4. Histograms 149

(a)

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

(b)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

(c)

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

(d)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

(e)

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

(f)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure A.5: Simulation envelopes at 5% significance level for Q-Q plots applied to the

datasets: (a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average

Linkage; right: (b),(d),(f): Complete Linkage. Solid lines: Q-Q plots; dashed lines:



(a)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(b)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(c)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(d)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(e)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(f)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Figure A.6: Simulation envelopes at 5% significance level for A-A plots applied to the

datasets: (a),(b) pines; (c),(d) cells; (e),(f) redwoods; left: (a),(c),(e): Average

Linkage. Right: (b),(d),(f): Complete Linkage. Solid lines: A-A plots; dashed lines:


A.4. Histograms 151

(a)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(c)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(d)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(e)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(f)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure A.7: Bands at 5% significance level for P-P plots applied to the datasets:

(a),(b) pines; (c),(d) cells; (e),(f) redwoods. Left: (a),(c),(e): Average Linkage;

right: (b),(d),(f): Complete Linkage. Solid lines: P-P plots; dashed lines: bands;

dotted lines: identity line, 999 realisations under H0.


(a)

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

(b)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.4

0.8

1.2

(c)

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

(d)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.4

0.8

1.2

(e)

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

(f)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.4

0.8

1.2

Figure A.8: Bands at 5% significance level for Q-Q plots of fusion distance func-

tion H(t) versus H(t) applied to the datasets: (a),(b) pines; (c),(d) cells; (e),(f)

redwoods; left: (a),(c),(e): Average Linkage; right: (b),(d),(f): Complete Link-

age. Solid lines: Q-Q plots; dashed lines: bands; dotted lines: identity line, 999

realisations under H0.

A.4. Histograms 153

(a)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(b)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(c)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(d)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(e)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

(f)

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Figure A.9: Bands at 5% significance level for A-A plots of arcsin√

1 − H(t) ver-

sus arcsin√

1 − H(t) applied to the datasets: (a),(b) pines; (c),(d) cells; (e),(f)

redwoods; left: (a),(c),(e): Average Linkage; right: (b),(d),(f): Complete Link-

age. Solid lines: A-A plots; dashed lines: bands; dotted lines: identity line, 999

realisations under H0.


0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.80.

00.

40.

8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

Figure A.10: Average Linkage P-P plots for the Cat Retinal Ganglia against the

random labelling hypothesis. First row: on cells (type 1), second row: off cells (type

2). Left: pointwise envelopes; right: critical bands. Solid lines: P-P plot, Dashed

lines: envelopes and bands, dotted lines: identity line; 5% significance level, 999

random permutations of the type labels.

A.4. Histograms 155

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

Figure A.11: Complete Linkage P-P plots for the Cat Retinal Ganglia against the


2). Left: simulation envelopes; right: critical bands. Solid lines: P-P plot, Dashed

lines: envelopes and bands, dotted lines: identity line; 5% significance level; 999



0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.0 0.2 0.4 0.60.

00.

20.

40.

6

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

Figure A.12: Average Linkage Q-Q plots for the Cat Retinal Ganglia against the


2). Left: simulation envelopes; right: critical bands. Solid lines: Q-Q plot, Dashed



A.4. Histograms 157

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

Figure A.13: Complete Linkage Q-Q plots for the Cat Retinal Ganglia against the


2). Left: simulation envelopes; right: critical bands. Solid lines: Q-Q plot, Dashed

lines: envelopes and bands, dotted lines: identity line. 5% significance level; 999



0.0 1.0

0.0

1.0

0.0 1.00.

01.

0

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0

Figure A.14: Average Linkage A-A plots for the Cat Retinal Ganglia against the


2). Left: simulation envelopes, right: critical bands. Solid lines: A-A plot, Dashed

lines: envelopes and bands, dotted lines: identity line; 5% significance level; 999


A.4. Histograms 159

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0

0.0 1.0

0.0

1.0

Figure A.15: Complete Linkage A-A plots for the Cat Retinal Ganglia against the


2). Left: simulation envelopes; right: critical bands. Solid lines: A-A plot, Dashed




0 20 60 100

010

030

050

0

Fusion distances h_k

Freq

uenc

y

Longleaf pines

0 20 40 60

020

060

010

00

Fusion distances h_k

Freq

uenc

y Brazilian trees

Figure A.16: Histograms of the fusion distances from the Average Linkage den-

drograms applied to the point patterns. Left: Longleaf pines (Section 6.2); right:

Brazilian trees (Section 8.1).

161APPENDIX B

Power of the test: fusion distance function

B.1 Cluster alternative

Estimated power Tables B.1 (a) – (e) present the estimated powers of Monte

Carlo tests of CSR against the Matern cluster model with parameters described

previously in Section 5.3. The tests use the supremum distance, and are based on

the fusion distance function.

Power explanation: Q-Q plots Figures B.1 – B.5 show the quantiles of 100 re-

alisations of the fusion distance functions from the Poisson with λ = 100 plotted

against the quantiles of 100 realisations of the fusion distance functions from the

Matern cluster with λp = 5, λc = 20, r = 0.005. The upper limit t1 ∈ [0; 0.22] by

increments of 0.005. Note that the quantiles of the fusion distance functions from

both models (Poisson and Matern cluster) are equal for t1 = 0.12. See Figure B.3.)

However, for t1 > 0.13, both fusion distances are different. (See Figures B.3 – B.5.)

B.2 Inhibition alternative

Estimated power Table B.2 presents the estimated powers of the test of CSR

against the Matern model II with parameters described previously. The tests use

the supremum distance, and are based on the fusion distance function.

Power explanation: Q-Q plots Figures B.6 – B.10 show the quantiles of 100

realisations of fusion distance functions from the Poisson with λ = 100 plotted

against the quantiles of 100 realisations of fusion distance functions from Matern

model II with λ0 = 200, r = 0.005. The upper limit t1 ∈ [0; 0.2] by increments of

0.005. Observe that the quantiles of the fusion distance functions for both models

(Poisson and Matern model II) are very close to the identity line demonstrating that

the realisations of the Matern model II for the specified parameters are very similar

to the homogeneous Poisson. Therefore, the power of the test for the parameter

model is very small or zero.

162 Appendix B. Power of the test: fusion distance function

Table B1 (a): λp = 5 parents

t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 1.00 0.99 0.95 0.82 0.63 0.45 0.350.02 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.93 0.83 0.690.03 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.94 0.860.04 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.900.05 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.920.06 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.930.07 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.930.08 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.97 0.910.09 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.93 0.860.10 1.00 1.00 1.00 1.00 0.99 0.98 0.97 0.92 0.86 0.720.11 0.84 0.80 0.78 0.72 0.72 0.73 0.72 0.68 0.61 0.500.12 0.06 0.07 0.05 0.07 0.10 0.16 0.22 0.23 0.24 0.240.13 0.03 0.04 0.03 0.02 0.03 0.04 0.05 0.06 0.09 0.110.14 0.03 0.04 0.03 0.04 0.04 0.06 0.08 0.10 0.12 0.150.15 0.07 0.09 0.09 0.14 0.16 0.18 0.18 0.20 0.22 0.250.16 0.45 0.45 0.42 0.39 0.39 0.37 0.33 0.34 0.35 0.370.17 0.83 0.80 0.76 0.69 0.63 0.58 0.51 0.52 0.51 0.520.18 0.91 0.90 0.85 0.80 0.74 0.70 0.61 0.63 0.59 0.620.19 0.98 0.98 0.97 0.93 0.88 0.86 0.78 0.81 0.73 0.750.20 0.98 0.98 0.97 0.93 0.89 0.87 0.79 0.81 0.74 0.780.21 0.98 0.98 0.97 0.92 0.89 0.86 0.81 0.80 0.76 0.810.22 0.98 0.99 0.97 0.94 0.91 0.88 0.86 0.85 0.84 0.86

Table B1 (b): λc = 10 parents

t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 1.00 0.93 0.74 0.48 0.32 0.19 0.140.02 1.00 1.00 1.00 1.00 1.00 0.97 0.84 0.66 0.47 0.360.03 1.00 1.00 1.00 1.00 1.00 0.99 0.94 0.85 0.65 0.500.04 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.89 0.76 0.620.05 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.93 0.78 0.650.06 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.92 0.79 0.680.07 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.91 0.77 0.650.08 1.00 1.00 1.00 1.00 1.00 0.99 0.96 0.85 0.69 0.590.09 1.00 1.00 1.00 1.00 0.99 0.97 0.89 0.76 0.59 0.500.10 0.84 0.78 0.82 0.81 0.85 0.81 0.70 0.57 0.41 0.380.11 0.02 0.03 0.04 0.11 0.28 0.35 0.38 0.32 0.23 0.270.12 0.00 0.00 0.00 0.00 0.02 0.07 0.10 0.14 0.13 0.150.13 0.02 0.02 0.03 0.03 0.04 0.05 0.06 0.08 0.11 0.140.14 0.34 0.35 0.28 0.22 0.16 0.14 0.13 0.15 0.18 0.210.15 0.80 0.79 0.70 0.56 0.40 0.31 0.27 0.26 0.29 0.340.16 0.96 0.95 0.89 0.78 0.60 0.49 0.42 0.40 0.42 0.430.17 0.99 0.98 0.96 0.89 0.75 0.65 0.59 0.53 0.54 0.570.18 0.99 0.99 0.97 0.92 0.81 0.70 0.65 0.61 0.60 0.650.19 1.00 1.00 0.99 0.96 0.90 0.83 0.79 0.73 0.73 0.780.20 1.00 1.00 0.98 0.96 0.87 0.81 0.79 0.75 0.77 0.820.21 1.00 0.99 0.98 0.94 0.86 0.81 0.80 0.79 0.81 0.860.22 0.99 0.99 0.98 0.95 0.88 0.85 0.86 0.86 0.88 0.92

B.2. Inhibition alternative 163

Table B1 (c): λp = 20 parents

t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 0.95 0.63 0.35 0.22 0.11 0.09 0.070.02 1.00 1.00 1.00 1.00 0.96 0.75 0.51 0.29 0.20 0.170.03 1.00 1.00 1.00 1.00 0.99 0.88 0.68 0.49 0.33 0.250.04 1.00 1.00 1.00 1.00 1.00 0.94 0.77 0.57 0.42 0.330.05 1.00 1.00 1.00 1.00 1.00 0.96 0.81 0.65 0.46 0.370.06 1.00 1.00 1.00 1.00 0.99 0.95 0.80 0.64 0.49 0.390.07 1.00 1.00 1.00 1.00 0.99 0.94 0.75 0.59 0.45 0.360.08 1.00 1.00 1.00 1.00 0.97 0.88 0.65 0.53 0.42 0.320.09 0.64 0.69 0.76 0.87 0.85 0.72 0.52 0.39 0.32 0.280.10 0.01 0.02 0.10 0.31 0.46 0.42 0.31 0.26 0.22 0.210.11 0.01 0.01 0.01 0.02 0.10 0.14 0.16 0.14 0.14 0.140.12 0.13 0.11 0.08 0.03 0.04 0.04 0.08 0.11 0.11 0.120.13 0.58 0.50 0.37 0.19 0.12 0.08 0.11 0.11 0.12 0.140.14 0.90 0.85 0.69 0.45 0.29 0.17 0.17 0.18 0.19 0.210.15 0.98 0.95 0.86 0.68 0.47 0.33 0.28 0.31 0.29 0.290.16 0.99 0.98 0.93 0.79 0.59 0.45 0.42 0.43 0.40 0.420.17 1.00 1.00 0.96 0.86 0.68 0.56 0.51 0.55 0.52 0.550.18 0.99 0.99 0.94 0.86 0.71 0.62 0.61 0.63 0.63 0.670.19 1.00 0.99 0.97 0.91 0.79 0.74 0.72 0.72 0.76 0.760.20 1.00 0.98 0.94 0.89 0.79 0.74 0.76 0.79 0.82 0.840.21 0.99 0.97 0.92 0.86 0.79 0.77 0.81 0.84 0.86 0.910.22 0.97 0.95 0.92 0.88 0.84 0.88 0.90 0.91 0.93 0.95

Table B1 (d): λp = 25 parents

t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 0.91 0.55 0.29 0.16 0.11 0.09 0.070.02 1.00 1.00 1.00 1.00 0.91 0.63 0.37 0.24 0.17 0.130.03 1.00 1.00 1.00 1.00 0.98 0.83 0.56 0.39 0.29 0.190.04 1.00 1.00 1.00 1.00 0.99 0.89 0.66 0.46 0.35 0.250.05 1.00 1.00 1.00 1.00 0.99 0.91 0.68 0.52 0.40 0.310.06 1.00 1.00 1.00 1.00 0.98 0.90 0.67 0.52 0.42 0.300.07 1.00 1.00 1.00 1.00 0.98 0.87 0.61 0.48 0.40 0.310.08 0.97 0.98 0.98 0.99 0.92 0.75 0.54 0.40 0.35 0.280.09 0.32 0.37 0.53 0.75 0.73 0.58 0.41 0.33 0.28 0.240.10 0.00 0.01 0.04 0.20 0.34 0.34 0.25 0.22 0.21 0.170.11 0.03 0.02 0.01 0.01 0.08 0.13 0.13 0.14 0.16 0.140.12 0.25 0.20 0.14 0.05 0.04 0.06 0.10 0.11 0.14 0.130.13 0.67 0.62 0.43 0.20 0.12 0.09 0.11 0.12 0.15 0.130.14 0.91 0.87 0.71 0.43 0.28 0.18 0.17 0.19 0.21 0.200.15 0.96 0.96 0.84 0.62 0.44 0.30 0.28 0.29 0.31 0.320.16 0.98 0.97 0.90 0.73 0.55 0.39 0.40 0.42 0.43 0.420.17 0.99 0.98 0.93 0.81 0.66 0.53 0.51 0.53 0.55 0.550.18 0.98 0.98 0.92 0.81 0.68 0.60 0.61 0.66 0.65 0.670.19 0.98 0.99 0.95 0.85 0.75 0.70 0.73 0.77 0.77 0.800.20 0.96 0.97 0.91 0.80 0.76 0.72 0.78 0.83 0.84 0.870.21 0.94 0.95 0.89 0.78 0.79 0.77 0.84 0.90 0.90 0.920.22 0.93 0.93 0.89 0.84 0.87 0.87 0.91 0.95 0.96 0.96


Table B1 (e): λp = 50 parents

t1 0.005 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 1.00 1.00 1.00 0.60 0.25 0.11 0.07 0.06 0.04 0.050.02 1.00 1.00 1.00 0.93 0.53 0.26 0.17 0.14 0.08 0.080.03 1.00 1.00 1.00 0.99 0.74 0.40 0.23 0.15 0.12 0.100.04 1.00 1.00 1.00 0.99 0.82 0.50 0.33 0.20 0.15 0.130.05 1.00 1.00 1.00 0.99 0.82 0.55 0.35 0.22 0.20 0.140.06 1.00 1.00 1.00 0.98 0.80 0.51 0.34 0.24 0.19 0.170.07 0.84 0.88 0.91 0.91 0.71 0.48 0.32 0.23 0.19 0.170.08 0.27 0.31 0.51 0.67 0.57 0.38 0.25 0.19 0.17 0.150.09 0.03 0.03 0.08 0.26 0.36 0.26 0.19 0.14 0.15 0.140.10 0.06 0.04 0.03 0.05 0.15 0.13 0.12 0.12 0.12 0.120.11 0.20 0.17 0.09 0.04 0.07 0.08 0.09 0.08 0.09 0.100.12 0.48 0.42 0.27 0.11 0.06 0.08 0.08 0.08 0.09 0.110.13 0.71 0.65 0.47 0.24 0.12 0.11 0.10 0.10 0.10 0.130.14 0.81 0.78 0.62 0.37 0.20 0.17 0.16 0.18 0.18 0.210.15 0.85 0.82 0.69 0.48 0.31 0.26 0.26 0.28 0.29 0.320.16 0.85 0.82 0.71 0.53 0.40 0.37 0.39 0.41 0.43 0.460.17 0.86 0.83 0.72 0.61 0.51 0.51 0.54 0.55 0.56 0.610.18 0.81 0.78 0.68 0.65 0.61 0.61 0.67 0.68 0.69 0.710.19 0.81 0.79 0.74 0.72 0.72 0.72 0.78 0.82 0.82 0.830.20 0.77 0.74 0.74 0.76 0.78 0.82 0.86 0.88 0.89 0.900.21 0.78 0.76 0.78 0.81 0.86 0.88 0.91 0.95 0.94 0.940.22 0.84 0.86 0.87 0.92 0.94 0.94 0.96 0.98 0.98 0.98

Table B.1: Power of Monte Carlo tests of CSR against Matern cluster process with

parameters λp, λc, r; where λp, r are varying as shown; λc is adjusted to keep

intensity of the process constant at 100. Test uses 99 realisations of CSR. Power es-

timated from 1000 realisations under Matern cluster processes; supremum distance,

Single Linkage.


−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

quantiles of H(t) of Poisson

quan

tiles

of H

(t) o

f Clu

ster

0.000 0.010 0.020 0.030

0.95

0.97

0.99

quantiles of H(t) of Poissonqu

antile

s of

H(t)

of C

lust

er0.00 0.02 0.04

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.00 0.02 0.04 0.06 0.08

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.02 0.06 0.10

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.05 0.10 0.15

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.10 0.15 0.20

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.10 0.15 0.20 0.25

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.15 0.20 0.25 0.30

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

Figure B.1: Typical Q-Q plots of 100 realisations of fusion distance functions for a

homogeneous Poisson and 100 realisations of fusion distance functions for Matern

cluster processes. The upper limit t1 ∈ [0; 0.04] by an increment of 0.005. Solid line:

Q-Q plot, dotted line: identity line, Single Linkage.


0.20 0.25 0.30 0.35 0.40

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.25 0.35 0.45

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.25 0.35 0.45 0.55

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.35 0.45 0.55

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.35 0.45 0.55

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.45 0.50 0.55 0.60 0.65

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.50 0.60 0.70

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.55 0.65 0.75

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.60 0.70 0.80

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster



cluster processes. The upper limit t1 ∈ [0.045; 0.085] by an increment of 0.005. Solid

line: Q-Q plot, dotted line: identity line, Single Linkage.


0.65 0.70 0.75 0.80 0.85

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.70 0.80 0.90

0.95

0.97

0.99


antile

s of

H(t)

of C

lust

er0.75 0.80 0.85 0.90

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.75 0.85 0.95

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.75 0.85 0.95

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.75 0.80 0.85 0.90 0.95

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.85 0.90 0.95

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.88 0.92 0.96

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.90 0.92 0.94 0.96 0.98

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster






0.92 0.94 0.96 0.98 1.00

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.94 0.96 0.98 1.00

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.94 0.96 0.98 1.00

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.94 0.96 0.98 1.00

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.95 0.97 0.99

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.95 0.97 0.99

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.96 0.98 1.00

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.975 0.985 0.995

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.975 0.985 0.995

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster






0.975 0.985 0.995

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.980 0.990 1.000

0.95

0.97

0.99


antile

s of

H(t)

of C

lust

er0.985 0.990 0.995 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.988 0.992 0.996 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.988 0.992 0.996 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.988 0.992 0.996 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.988 0.992 0.996 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.988 0.992 0.996 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

0.988 0.992 0.996 1.000

0.95

0.97

0.99


quan

tiles

of H

(t) o

f Clu

ster

Figure B.5: Typical Q-Q plots of 100 realisations of fusion distance functions for

Poisson and 100 realisations of fusion distance functions for Matern cluster pro-

cesses. The upper limit t1 ∈ [0.18; 0.22] by an increment of 0.005. Solid line: Q-Q

plot, dotted line: identity line, Single Linkage.


−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0


quan

tiles

of H

(t) o

f Mat

ern

II

0.000 0.005 0.010 0.015−1

.0−0

.50.

00.

51.

0


quan

tiles

of H

(t) o

f Mat

ern

II

0.00 0.02 0.04

0.00

0.01

0.02

0.03


quan

tiles

of H

(t) o

f Mat

ern

II

0.00 0.02 0.04 0.06

0.00

0.04

0.08

0.12


quan

tiles

of H

(t) o

f Mat

ern

II

0.00 0.04 0.08

0.02

0.06

0.10

0.14


quan

tiles

of H

(t) o

f Mat

ern

II

0.04 0.08 0.12

0.06

0.10

0.14

0.18


quan

tiles

of H

(t) o

f Mat

ern

II

0.05 0.10 0.15 0.20

0.10

0.15

0.20


quan

tiles

of H

(t) o

f Mat

ern

II

0.10 0.15 0.20

0.10

0.15

0.20

0.25

0.30


quan

tiles

of H

(t) o

f Mat

ern

II

0.10 0.20 0.30

0.15

0.20

0.25

0.30

0.35


quan

tiles

of H

(t) o

f Mat

ern

II


homogeneous Poisson and of 100 realisations of fusion distance functions for Matern

model II processes. The upper limit t1 ∈ [0; 0.04] by an increment of 0.005. Solid

lines: Q-Q plot, dotted lines: identity line, Single Linkage.


0.15 0.25 0.35

0.20

0.25

0.30

0.35

0.40


quan

tiles

of H

(t) o

f Mat

ern

II

0.20 0.30 0.40

0.25

0.30

0.35

0.40

0.45


quan

tiles

of H

(t) o

f Mat

ern

II

0.20 0.30 0.40 0.50

0.30

0.40

0.50


quan

tiles

of H

(t) o

f Mat

ern

II

0.30 0.40 0.50

0.35

0.45

0.55


quan

tiles

of H

(t) o

f Mat

ern

II

0.35 0.45 0.55

0.40

0.50

0.60


quan

tiles

of H

(t) o

f Mat

ern

II

0.40 0.50 0.600.

400.

500.

600.

70


quan

tiles

of H

(t) o

f Mat

ern

II

0.45 0.55 0.65 0.75

0.45

0.55

0.65

0.75


quan

tiles

of H

(t) o

f Mat

ern

II

0.55 0.65 0.75

0.50

0.60

0.70

0.80


quan

tiles

of H

(t) o

f Mat

ern

II

0.55 0.65 0.75

0.60

0.70

0.80


quan

tiles

of H

(t) o

f Mat

ern

II


homogeneous Poisson process and 100 realisations of fusion distance functions for

Matern model II processes. The upper limit t1 ∈ [0.045; 0.085] by an increment of

0.005. Solid lines: Q-Q plot, dotted lines: identity line, Single Linkage.


0.60 0.70 0.80

0.65

0.75

0.85


quan

tiles

of H

(t) o

f Mat

ern

II

0.65 0.75 0.850.

700.

800.

90


quan

tiles

of H

(t) o

f Mat

ern

II

0.70 0.80 0.90

0.75

0.85

0.95


quan

tiles

of H

(t) o

f Mat

ern

II

0.75 0.80 0.85 0.90

0.75

0.85

0.95


quan

tiles

of H

(t) o

f Mat

ern

II

0.80 0.85 0.90 0.95

0.80

0.85

0.90

0.95


quan

tiles

of H

(t) o

f Mat

ern

II

0.85 0.90 0.95

0.85

0.90

0.95


quan

tiles

of H

(t) o

f Mat

ern

II

0.85 0.90 0.95

0.86

0.90

0.94

0.98


quan

tiles

of H

(t) o

f Mat

ern

II

0.85 0.90 0.95

0.88

0.92

0.96

1.00


quan

tiles

of H

(t) o

f Mat

ern

II

0.88 0.92 0.96 1.00

0.92

0.96

1.00


quan

tiles

of H

(t) o

f Mat

ern

II




0.005. Solid lines: Q-Q plot, dotted lines: identity line, Single Linkage.


0.90 0.94 0.98

0.92

0.94

0.96

0.98

1.00


quan

tiles

of H

(t) o

f Mat

ern

II

0.92 0.94 0.96 0.98 1.00

0.92

0.94

0.96

0.98

1.00


quan

tiles

of H

(t) o

f Mat

ern

II

0.92 0.94 0.96 0.98 1.00

0.96

0.97

0.98

0.99

1.00


quan

tiles

of H

(t) o

f Mat

ern

II

0.94 0.96 0.98 1.00

0.96

0.97

0.98

0.99

1.00


quan

tiles

of H

(t) o

f Mat

ern

II

0.95 0.97 0.99

0.96

0.97

0.98

0.99

1.00


quan

tiles

of H

(t) o

f Mat

ern

II

0.95 0.97 0.990.

970

0.98

00.

990

1.00

0


quan

tiles

of H

(t) o

f Mat

ern

II

0.96 0.97 0.98 0.99 1.00

0.97

00.

980

0.99

01.

000


quan

tiles

of H

(t) o

f Mat

ern

II

0.970 0.980 0.990 1.000

0.98

00.

990

1.00

0


quan

tiles

of H

(t) o

f Mat

ern

II

0.975 0.985 0.995

0.99

00.

994

0.99

8


quan

tiles

of H

(t) o

f Mat

ern

II




0.005. Solid lines: Q-Q plot, dotted line: identity line, Single Linkage.


0.975 0.985 0.995

0.99

00.

994

0.99

8


quan

tiles

of H

(t) o

f Mat

ern

II

0.988 0.992 0.996 1.000

0.99

00.

994

0.99

8


quan

tiles

of H

(t) o

f Mat

ern

II

0.988 0.992 0.996 1.000

0.99

00.

994

0.99

8


quan

tiles

of H

(t) o

f Mat

ern

II

0.988 0.992 0.996 1.000

0.99

00.

994

0.99

8


quan

tiles

of H

(t) o

f Mat

ern

II

0.988 0.992 0.996 1.000

0.99

00.

994

0.99

8


quan

tiles

of H

(t) o

f Mat

ern

II

Figure B.10: Typical Q-Q plots of 100 realisations of fusion distance functions for

a homogeneous Poisson process and 100 realisations of fusion distance functions for

Matern model II processes. The upper limit t1 ∈ [0.18; 0.2] by an increment of 0.005.

Solid lines: Q-Q plot, dotted lines: identity line, Single Linkage.


Inhibition alternative: Matern model II

t1 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.000.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.02 0.04 0.03 0.26 1.00 1.00 1.00 1.00 1.00 1.00 1.000.03 0.05 0.04 0.03 0.21 0.69 1.00 1.00 1.00 1.00 1.000.04 0.07 0.07 0.05 0.03 0.06 0.42 0.99 1.00 1.00 1.000.05 0.16 0.12 0.13 0.10 0.05 0.04 0.27 0.93 1.00 1.000.06 0.18 0.17 0.21 0.24 0.19 0.20 0.02 0.22 0.76 1.000.07 0.20 0.19 0.40 0.50 0.38 0.32 0.14 0.03 0.19 0.830.08 0.09 0.23 0.49 0.60 0.55 0.64 0.41 0.18 0.03 0.210.09 0.09 0.23 0.49 0.64 0.73 0.75 0.73 0.56 0.16 0.040.10 0.10 0.23 0.50 0.55 0.76 0.80 0.82 0.72 0.43 0.140.11 0.19 0.22 0.47 0.52 0.71 0.79 0.83 0.77 0.54 0.390.12 0.08 0.20 0.44 0.46 0.65 0.72 0.78 0.72 0.50 0.420.13 0.08 0.13 0.34 0.34 0.52 0.55 0.62 0.49 0.46 0.470.14 0.23 0.40 0.50 0.50 0.65 0.70 0.77 0.66 0.63 0.690.15 0.41 0.59 0.71 0.68 0.80 0.82 0.87 0.79 0.77 0.850.16 0.57 0.74 0.84 0.78 0.90 0.91 0.93 0.86 0.83 0.920.17 0.80 0.84 0.89 0.83 0.94 0.95 0.97 0.87 0.87 0.960.18 0.88 0.91 0.94 0.87 0.96 0.97 0.98 0.99 0.89 0.980.19 0.94 0.94 0.96 0.98 0.98 0.99 0.99 0.99 0.89 0.990.20 0.97 0.97 0.98 0.99 0.99 0.99 1.00 0.99 0.90 0.99

Table B.2: Power of Monte Carlo tests of CSR against Matern model II processes

with parameters λ, r; where λ is chosen to achieve an intensity of 100. Test uses

99 realisations of CSR. Power estimated from 1000 realisations under Matern model

II; supremum distance, Single Linkage.

177APPENDIX C

Complementary information on the Brazilian trees dataset

Height and diameter at breast height Tables C.1 and C.2 show the observed

values of the height, and dbh of the Brazilian trees dataset.

Complete botanical classification Table C.3 presents the complete botanical clas-

sification of the Brazilian trees dataset (Chapter 8) into genus, species, family, order,

subclass and class extracted by the author from [15, 115].

178 Appendix C. Complementary information on the Brazilian trees dataset

height (m) 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7

frequency 1 1 5 7 5 12 19 47 46 35

height (m) 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7

frequency 54 53 113 59 52 26 18 75 21 9

height (m) 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7

frequency 6 7 52 12 5 5 9 28 7 4

height (m) 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7

frequency 6 9 52 10 7 6 4 28 4 1

height (m) 4.8 4.9 5 5.1 5.3 5.4 5.5 5.6 5.9 6

frequency 2 2 44 1 2 2 18 7 1 24

height (m) 6.1 6.2 6.3 6.4 6.5 6.6 7 7.2 7.3 7.4

frequency 4 3 3 3 18 3 20 1 1 2

height (m) 7.5 7.6 8 8.5 8.6 8.7 8.9 9 9.5 10

frequency 9 1 18 2 1 1 1 3 1 3

height (m) 11

frequency 1

Table C.1: Frequency of the height of the Brazilian trees.

dbh (m) 1 2 2.5 3 3.5 4 4.5 4.7 5 5.5

frequency 1 1 1 1 3 4 1 1 17 21

dbh (m) 6 6.5 7 7.1 7.5 7.7 8 8.5 9 9.5

frequency 55 44 193 1 56 1 149 44 110 29

dbh (m) 10 10.5 10.8 11 11.2 11.5 12 12.5 13 13.5

frequency 74 9 1 45 1 9 51 9 26 6

dbh (m) 14 14.5 15 15.5 16 16.5 17 17.5 18 18.5

frequency 24 7 25 4 11 4 17 2 6 1

dbh (m) 19 19.5 20 20.5 21 21.5 22 23 23.5 24

frequency 10 2 8 1 5 2 4 4 1 7

dbh (m) 25 25.5 26 26.5 28 29 32 32.5 33

frequency 1 1 4 1 1 2 1 1 1

Table C.2: Frequency of the dbh of the Brazilian trees.

Appendix C. Complementary information on the Brazilian trees dataset 179

Number Genus Species Family Order Subclass Class

1 Aspidosperma macrocarpon Apocynaceae Gentianales Asteridae Magnoliopsida

2 Aspidosperma tomentosum Apocynaceae Gentianales Asteridae Magnoliopsida

3 Bombax gracilipes Bombacaceae Malvales Dilleniidae Magnoliopsida

4 Bombax tomentosum Bombacaceae Malvales Dilleniidae Magnoliopsida

5 Bowdichia virgiloides Fabaceae Fabales Rosidae Magnoliopsida

6 NA NA Myrtaceae Myrtales Rosidae Magnoliopsida

7 Byrsonima coccolabifalia Malpighiaceae Polygalales Rosidae Magnoliopsida

8 Byrsonima crassa Malpighiaceae Polygalales Rosidae Magnoliopsida

9 Byrsonima NA Malpighiaceae Polygalales Rosidae Magnoliopsida

10 Caryocar brasiliense Caryocaraceae Theales Dilleniidae Magnoliopsida

11 Connarus fulvus Connaraceae Rosales Rosidae Magnoliopsida

12 Copaifera langsdorfii Fabaceae Fabales Rosidae Magnoliopsida

13 Dalbergia vidacea Fabaceae Fabales Rosidae Magnoliopsida

14 Davilla elliptica Dilleniaceae Dilleniales Dilleniidae Magnoliopsida

15 Didymopanax macrocarpum Araliaceae Apiales Rosidae Magnoliopsida

16 Siagrus NA NA NA NA NA

17 NA Ind.453 NA NA NA NA

18 Enterolobium ellipticum Fabaceae Fabales Rosidae Magnoliopsida

19 Eremanthus NA Asteraceae Asterales Asteridae Magnoliopsida

20 Erythroxylum suberosum Erythroxylaceae Linales Rosidae Magnoliopsida

21 Erythroxylum tortuosum Erythroxylaceae Linales Rosidae Magnoliopsida


23 Butia NA Arecaceae Arecales Arecidae Liliopsida

24 Hymenaea stillocarpa Fabaceae Fabales Rosidae Magnoliopsida

25 Kielmeyera coriaceae NA NA NA NA

26 Lafoensia pacari Lythraceae Myrtales Rosidae Magnoliopsida

27 Palmeira NA Arecaceae Arecales Arecidae Liliopsida

28 Miconia ferruginata Melastomataceae Myrtales Rosidae Magnoliopsida

29 Miconia NA Melastomataceae Myrtales Rosidae Magnoliopsida

30 Mimosa claussenii Mimosaceae Fabales Rosidae Magnoliopsida

31 Myrica NA Myricaceae Myricales Hamamelidae Magnoliopsida

32 Ouratea acuminata Ochnaceae Theales Dilleniidae Magnoliopsida

33 Palicourea rigida Rubiaceae Rubiales Asteridae Magnoliopsida


35 Piptocarpha rotundifolia Asteraceae Asterales Asteridae Magnoliopsida

36 Vochysia rufa Vochysiaceae Polygalales Rosidae Magnoliopsida

37 Plenckia populosea Celastraceae Celastrales Rosidae Magnoliopsida

38 Pouteria ramiflora Sapotaceae Ebenales Dilleniidae Magnoliopsida

39 Vochysia thyrsoidea Vochysiaceae Polygalales Rosidae Magnoliopsida

40 Pteredon pubescens Fabaceae Fabales Rosidae Magnoliopsida

41 Qualea grandiflora Vochysiaceae Polygalales Rosidae Magnoliopsida

42 Qualea multiflora Vochysiaceae Polygalales Rosidae Magnoliopsida

43 Qualea parviflora Vochysiaceae Polygalales Rosidae Magnoliopsida

44 Rapanea guyanensis Myrsinaceae Primulales Dilleniidae Magnoliopsida

45 Roupala montana Proteaceae Proteales Rosidae Magnoliopsida

46 Salacia crassifolia Hippocrateaceae Celastrales Rosidae Magnoliopsida

47 Sclerolobium aureum Fabaceae Fabales Rosidae Magnoliopsida

48 Stryphnodendron NA Fabaceae Fabales Rosidae Magnoliopsida

49 Styrax ferrugineus Styracaceae Ebenales Dilleniidae Magnoliopsida

50 Sweetia dasycarpa Fabaceae Fabales Rosidae Magnoliopsida

51 Symplocos revoluta Symplocaceae Ebenales Dilleniidae Magnoliopsida


53 Strychnos NA Loganiaceae Gentianales Asteridae Magnoliopsida

54 Vellozia NA Velloziaceae Liliales Liliidae Liliopsida

55 Vochysia elliptica Vochysiaceae Polygalales Rosidae Magnoliopsida

56 Plathimenia reticulata Fabaceae Fabales Rosidae Magnoliopsida

Table C.3: Complete plant systematics of the Brazilian trees into genus, species,

family, order, subclass, and class. (“NA”: unknown). Source: [15, 115].

analysis of spatial point patterns using hierarchical ... · a hip¶otese alternativa de n~ao...

Documents