in copyright - non-commercial use permitted rights ...26212/... · a geometric framework for visual...

158
Research Collection Doctoral Thesis A geometric framework for visual grouping Author(s): Turina, Andreas Publication Date: 2003 Permanent Link: https://doi.org/10.3929/ethz-a-004488586 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Upload: hadung

Post on 06-Mar-2018

213 views

Category:

Documents


1 download

TRANSCRIPT

Research Collection

Doctoral Thesis

A geometric framework for visual grouping

Author(s): Turina, Andreas

Publication Date: 2003

Permanent Link: https://doi.org/10.3929/ethz-a-004488586

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

DISS. ETH NO. 14919

A Geometric Framework for Visual Grouping

A dissertation submitted to the

SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH

for the degree ofDoctor of Technical Sciences

presented by

ANDREAS TURINA

Dipl. El.-Ing. ETH

born 4th of June, 1971citizen of

Fallanden, Switzerland

accepted on the recommendation of

Prof. Dr. Luc Van Gool, examinerProf. Dr. Bernt Schiele, co-examiner

2002

To My Parents

Abstract

This dissertation deals with a geometric framework for the efficient detection of

regular repetitions of planar (but not necessarily coplanar) patterns. Such pattern

repetitions are ubiquitous: Tilings of a floor, repetitions of windows on a building

facade, mirror-symmetries etc. Basically, two aspects are of importance: There is a

repeating pattern, and the repetition is carried out in a regular manner.

The desire for an automatic detection of such groupings is an old challenge in Com-

puter Vision, and an immense number of contributions exists, most of them address-

ing the grouping of low-level features, like edges and contours, assuming pseudo-

orthographic projection models. Geometric grouping contributions that deal with

full perspective skew are comparatively new.

Most of these earlier approaches are characterized by their extensive use of combina-

torial techniques, which renders the grouping process fairly inefficient. In addition,

they focus on one particular grouping type only, restricted to a narrow range of

features, often specified by the user beforehand.

The grouping system proposed in this dissertation avoids the shortcomings of earlier

contributions. The novelty of our approach is that it is efficient by banning exten-

sive combinatorics from all stages. Furthermore, our approach is more general in

that all groupings related by planar homologies are detected. These include period-

icities, mirror-symmetries and point-symmetries that have traditionally been dealt

with separately. The approach can handle perspective distortions. It avoids to get

trapped in combinatorics through invariant-based hashing for pattern matching and

through Hough transforms for the detection of fixed structures.

At the heart of our system lie the fixed structures of the transformations that de-

scribe these regular configurations. Fixed structures are geometric entities, like

points and lines, that remain fixed under both the original symmetry operation in

the scene and the transformation that relates repeating patterns in the image. The

knowledge of fixed structures drastically reduces the complexity (degrees of freedom)

of the problem, and therefore the main effort is their efficient extraction.

A first step detects small, repeating planar patches near points of interest in the

image using affinely invariant neighbourhoods. The way how they are extracted

i

ii

makes them immune to affine geometric transformations and linear photometric

changes. Invariant neighbourhoods are characterized by a feature vector of moment

invariants that implicitly describe the underlying intensity profile in an invariant

way again. Pattern repetitions then translate to clusters in this feature space, and

similar patterns can be found efficiently using invariant-based indexing.

In a second step, clusters of similar invariant neighbourhoods are analyzed for their

regularity using a cascaded version of the Hough transform. The end products are

candidates for fixed structures, found in a non-combinatorial way. A single point /

neighbourhood match then suffices to lift the remaining degree of freedom in order

to set up a grouping (i.e. planar homology) hypothesis. Finally, hypotheses are

validated for their correctness based on a correlation-based procedure that delineates

the symmetric parts in the image. The system has been applied to a wealth of regular

images to demonstrate its performance.

Kurzfassung

Diese Dissertation behandelt die effiziente Detektion sich regular wiederholender,

planarer (aber nicht notwendigerweise koplanarer) Muster in Bildern. Regulare Re-

petitionen dieser Art sind fast allgegenwartig: man denke z.B. an einen gekachelten

Boden, die regelmassige Anordnung von Fenstern einer Hausfassade, Spiegelsym-

metrien etc. Im wesentlichen gibt es dabei zwei Feststellungen: Es gibt ein sich

wiederholendes Muster, und die Art der Wiederholung vollzieht sich nach strengen

Regeln.

Der Wunsch, solche Gruppierungen automatisch in Bildern zu finden, reicht in der

’Computer Vision’ weit zuruck, und eine grosse Anzahl von Beitragen sind mit der

Zeit entstanden. Die meisten davon handeln uber Gruppierung von Bildprimitiven,

z.B. Kantenpunkte und Konturen, unter Annahme von pseudo-orthographischer

Projektion. Geometrische Ansatze, welche auch perspektivische Verzerrungen be-

handeln, sind vergleichsweise neu.

Ein gewichtiger Nachteil der meisten fruheren Gruppierungsansatze stellt deren star-

ker Einsatz kombinatorischer Methoden dar, was sich sehr ungunstig auf die Effizienz

auswirkt. Zusatzlich sind diese Systeme fur die Detektion eines einzigen Gruppie-

rungstyps massgeschneidert, wobei man sich nur auf ein paar wenige, spezifische

Merkmale abstutzt. Noch dazu mussen diese oft vom Benutzer angegeben werden.

Das in dieser Dissertation vorgeschlagene System behebt viele Defizite fruherer

Ansatze. Umfangreiche kombinatorische Methoden werden hier in allen Bereichen

strikt vermieden. Eine weitere Neuerung stellt die Tatsache dar, dass unser Gruppie-

rungsansatz allgemeiner ist, indem er auf planaren Homologien basiert. Damit wer-

den Periodizitaten, Spiegel- und Punktsymmetrien im gleichen Zug erkannt, und das

unter Einbezug perspektivischer Verzerrungen. Dies wird mittels invarianz-basierten

Hashing Methoden und Hough-Techniken erreicht.

Unserem System liegt das Konzept der sog. fixen Strukturen zugrunde. Dabei han-

delt es sich um Punkte und Geraden, welche unter der originalen Symmetrieoperati-

on im Raum sowie deren Abbildung im Bild erhalten bleiben. Sind diese Strukturen

einmal bekannt, dann reduziert sich die Komplexitat (Anzahl Freiheitsgrade) be-

trachtlich. Das Ziel ist folglich, diese fixen Strukturen auf effiziente Weise zu finden.

iii

iv

In einem ersten Schritt wird nach Wiederholungen kleiner, planarer Segmente ge-

sucht. Dazu kommen affin-invariante Umgebungen zum Einsatz. Die Art und Weise,

wie solche Umgebungen extrahiert werden, macht sie immun gegenuber affinen geo-

metrischen Verzerrungen sowie linearen, photometrischen Anderungen. Jede einzel-

ne solche Umgebung wird durch einen Merkmalsvektor charakterisiert, der wieder-

um aus Momentinvarianten besteht. Dieser Vektor beschreibt das Intensitatsprofil

von affin-invarianten Umgebungen wiederum auf invariante Weise. Die Detektion

sich wiederholender Bildsegmente verlagert sich damit auf die Identifikation von

Anhaufungen in diesem Merkmalsraum. Indizierungstechniken tragen dabei zur Ef-

fizienzsteigerung bei.

In einem zweiten Schritt werden gefundene Wiederholungen solcher Bildsegmente

auf deren Regularitat gepruft. Dies wird uber eine spezielle Version der Hough Trans-

formation erreicht, welche als Endprodukt mogliche Kandidaten fur fixe Strukturen

liefert, wiederum auf nicht-kombinatorische Art. Eine einzelne Punktkorrespondenz

genugt nun, um eine Gruppierungshypothese aufzustellen. Ein korrelations-basiertes

Verfahren uberpruft die Hypothese auf ihre Richtigkeit und segmentiert dabei die

Gruppierung im Bild. Die Leistungsfahigkeit des Systems wird anhand einer Vielzahl

unterschiedlicher Bilder demonstriert.

Acknowledgement

First, I would like to thank my supervisor, Prof. Dr. Luc Van Gool, for both

his lead and his valuable support during the entire duration of my dissertation.

Apart from his brilliant professional skills, I highly appreciated his willingness to

provide me with everything that I needed for the daily work. I also appreciated his

offers to travel to remote locations for meetings and conferences around the globe;

a necessity for establishing contacts with the Computer Vision community. I also

thank Prof. Dr. Bernt Schiele for his role as co-referee and his advice on various

technical problems.

Special thanks go to Dr. Tinne Tuytelaars whose aid substantially shaped this

dissertation. With her as a designated tutor from the very beginning, I really enjoyed

the privilege of a close collaboration with a skilled and experienced researcher who

followed my work with great interest. Our fruitful exchange of ideas, problems,

solutions, software and data on an almost daily basis was of inestimable value.

I am grateful to all members of the Computer Vision Laboratory at ETH (“BIWI”)

who supported me during the work on my PhD thesis. Furthermore, I am especially

thankful to our system manager Manuel Oetiker, whose technical help and manage-

ment skills of a complex computational infrastructure provided the necessary basics

so essential for the work in Computer Vision.

I especially want to express my thanks to my parents, Marko and Helga Turina,

who made my studies at ETH Zurich possible and who gave me their unconditional

support in all phases of my life.

Andreas Turina

v

Contents

Abstract i

Kurzfassung iii

Contents xi

List of Figures xiv

List of Tables xv

1 Introduction 1

1.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Possible Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Strategy and System Overview . . . . . . . . . . . . . . . . . . . . . . 4

1.4.1 Regular Repetitions . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.2 The Danger of Resorting to Combinatorics . . . . . . . . . . . 5

1.4.3 Efficient Detection of Repetitions . . . . . . . . . . . . . . . . 7

1.4.4 Efficient Detection of Regularities . . . . . . . . . . . . . . . . 8

1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Tour d’horizon: From the Early Days to State of the Art 11

2.1 Gestalt Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Grouping Based on Gestalt Laws . . . . . . . . . . . . . . . . . . . . 12

vii

viii Contents

2.3 Grouping Based on Geometry . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 The Affine Case . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 The Perspective Case . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Fixed Structures - Key to Efficiency 27

3.1 Plane Projective Transformations . . . . . . . . . . . . . . . . . . . . 28

3.1.1 Coarse Structure . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Fixed Structures and Subgroups . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Fixed Structures . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2 Subgroups Defined by Fixed Structures . . . . . . . . . . . . . 30

3.3 Fixed Structures for Grouping . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Conjugate Symmetry . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Planar Homologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Elations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Basic Technologies I: Affinely Invariant Neighbourhoods 39

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Affinely Invariant Neighbourhoods . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Geometry-based Neighbourhoods . . . . . . . . . . . . . . . . 45

4.2.2 Intensity-based Neighbourhood Extraction . . . . . . . . . . . 50

4.3 Neighbourhood Description . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Contents ix

5 Basic Technologies II: The Cascaded Hough Transform 55

5.1 The Hough Transform Revisited . . . . . . . . . . . . . . . . . . . . . 55

5.2 The Cascaded Hough Transform . . . . . . . . . . . . . . . . . . . . . 56

5.2.1 The CHT-point . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.2 Homogeneous Representation of CHT-points . . . . . . . . . . 58

5.3 CHT Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3.1 Image Frame −→ CHT Frame . . . . . . . . . . . . . . . . . . 60

5.3.2 CHT-Frame → Image-Frame . . . . . . . . . . . . . . . . . . . 62

5.4 Applying the CHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4.1 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4.2 Peak Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.3 Peak Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.6.1 Accuracy vs. Resolution . . . . . . . . . . . . . . . . . . . . . 71

5.6.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . 71

5.6.3 Peak Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.6.4 Alternative Parameterization . . . . . . . . . . . . . . . . . . 72

5.7 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Detection of Repetitions 75

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 Invariant Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2.1 Generic Affinely Invariant Feature Vectors . . . . . . . . . . . 77

6.2.2 Normalized Feature Vectors . . . . . . . . . . . . . . . . . . . 77

6.3 Neighbourhood Comparison . . . . . . . . . . . . . . . . . . . . . . . 82

6.3.1 Feature Vector Comparison . . . . . . . . . . . . . . . . . . . 83

6.3.2 Correlation-based Comparison of Affinely

Invariant Neighbourhoods . . . . . . . . . . . . . . . . . . . . 85

x Contents

6.3.3 Other Comparison Methods . . . . . . . . . . . . . . . . . . . 85

6.4 Matching / Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 92

7 Detection of Regularities 93

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.2 Finding Fixed Structures . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.2.1 Candidate Pencils of Fixed Lines . . . . . . . . . . . . . . . . 94

7.2.2 Candidate Lines of Fixed Points . . . . . . . . . . . . . . . . . 96

7.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.3 Finding the Groupings . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.4 Hypotheses Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5.1 Advantages of the CHT . . . . . . . . . . . . . . . . . . . . . 104

7.5.2 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5.3 Computation Times . . . . . . . . . . . . . . . . . . . . . . . 105

7.5.4 CHT vs. Gaussian Sphere . . . . . . . . . . . . . . . . . . . . 106

7.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 106

8 Experimental Results 109

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.2 General Planar Homologies . . . . . . . . . . . . . . . . . . . . . . . . 110

8.3 Elations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

9 Conclusion 119

9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

9.2 Discussion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2.1 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Contents xi

A Linear Discriminant Analysis 125

A.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

A.2 Covariance Matrix Based on Tracking Experiments . . . . . . . . . . 127

B Image Database Overview 129

Bibliography 131

List of Figures

1.1 A regular repetition of floor tiles, distorted by perspective skew. . . . 6

3.1 Classificatory structure of subgroups for fixed points and lines. . . . . 31

3.2 Distortion of a mirror-symmetry . . . . . . . . . . . . . . . . . . . . . 32

3.3 Planar homology examples . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Visualization of group action . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Effects of perspective skew . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Neighbourhood example . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Harris corner points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Local intensity extrema. . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Neighbourhood construction for curved edges . . . . . . . . . . . . . . 46

4.6 Neighbourhood construction for straight edges . . . . . . . . . . . . . 48

4.7 Neighbourhood construction for homogeneous regions . . . . . . . . . 49

4.8 Example of homogeneous neighbourhoods . . . . . . . . . . . . . . . 50

4.9 Intensity-based neighbourhood construction. . . . . . . . . . . . . . . 51

4.10 Intensity-based neighbourhood example . . . . . . . . . . . . . . . . . 52

5.1 CHT subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Different point representations . . . . . . . . . . . . . . . . . . . . . . 60

5.3 Effect of smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 CHT buffer example . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.5 Buffer sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

xii

List of Figures xiii

5.6 CHT example: input . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.7 CHT example: buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 CHT example: collinear structures . . . . . . . . . . . . . . . . . . . 70

5.9 CHT example: second Hough . . . . . . . . . . . . . . . . . . . . . . 70

5.10 CHT example: pencils of fixed lines . . . . . . . . . . . . . . . . . . . 71

6.1 Original image and neighbourhoods . . . . . . . . . . . . . . . . . . . 89

6.2 Feature space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.3 Clusters in the image and feature space . . . . . . . . . . . . . . . . . 91

7.1 Pencil of fixed lines example . . . . . . . . . . . . . . . . . . . . . . . 98

7.2 Fixed structures example . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.3 Lines of fixed points example . . . . . . . . . . . . . . . . . . . . . . 100

7.4 Effect of a global warp . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.5 Validation result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.1 Butterfly example I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2 Butterfly example II . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.3 Carpet example I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.4 Carpet example II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.5 Books example I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.6 Book example II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.7 Beer-box example I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.8 Beer-box example II . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.9 Building facade example I . . . . . . . . . . . . . . . . . . . . . . . . 114

8.10 Building facade example II . . . . . . . . . . . . . . . . . . . . . . . . 115

8.11 Visualization of the symmetry density. . . . . . . . . . . . . . . . . . 115

8.12 Router example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

A.1 Initial cluster configuration. . . . . . . . . . . . . . . . . . . . . . . . 126

A.2 Transformed dataset after rotation and scaling. . . . . . . . . . . . . 126

xiv List of Figures

A.3 Situation after the second transform. . . . . . . . . . . . . . . . . . . 127

B.1 Example images the system was applied to. . . . . . . . . . . . . . . . 129

B.2 Example images the system was applied to (ctd.) . . . . . . . . . . . 130

List of Tables

2.1 Classificatory structure . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Hierarchy of subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.1 Moment invariants used for comparing the patterns within an invari-

ant neighbourhood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.1 Moment invariants used for comparing the patterns within an invari-

ant neighbourhood (ctd.). . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2 Moment invariants used for comparing the patterns within a parallelogram-

shaped invariant neighbourhood after normalization of the neighbour-

hood to a reference square. . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3 Moment invariants used for comparing the underlying intensity and

color information within an elliptic invariant neighbourhood after nor-

malization to a reference circular neighbourhood. . . . . . . . . . . . 81

7.1 Strategy for extracting fixed structure candidates working on both

large and small clusters of affinely invariant neighbourhoods. Struc-

tures used as input are printed in a sans-serif font, and their corre-

sponding outputs are printed in boldface. The numbers in the outer-

most right column indicate the CHT level numbers. . . . . . . . . . . 95

7.2 Computation times for finding the pencil of fixed lines candidates on

a 440 MHz SUN Ultra 10. . . . . . . . . . . . . . . . . . . . . . . . . 105

A.1 Inter- (left column) and intra (right column) cluster distances ob-

tained using a global covariance matrix estimate (top row) and the

covariance matrix based on tracking experiments (bottom row). . . . 128

xv

1Introduction

Our visual system, especially the visual cortex, processes all visual infor-

mation reliably in a very short time, and we simply take this capability

for granted. We immediately recognize an infinite variety of different ob-

jects and the surrounding environment. And we are able to perform this

task almost irrespective of their pose, location and illumination conditions (except

for total darkness). Interestingly, it seems that we also have an inherent ability for

the perception of symmetries. They somehow automatically attract our attention,

and we do not even need special concentration. In addition, this performance is

generated continuously. We just have to keep our eyes open.

As a consequence, it is not surprising that we do not realize the underlying com-

plexity of this process. Once we try to transfer the same skills to a machine, we

become fully aware that this is a problem of extraordinary complexity. Although

a lot of research has been invested in machine vision for several decades, a generic

solution is not in sight.

In computer vision, the detection of symmetries is closely related to the problem of

object recognition. Given an object and its appropriate representation stored in a

database, the task is to recognize this object again in images, irrespective of pose,

location, illumination conditions and distance to the camera. Similarly, repetitions

of patterns normally also suffer from these distortions when viewed obliquely. If one

wants to design a vision system for the detection of symmetries, capable of operating

in a general purpose domain, solutions for these difficulties have to be worked out.

The human visual system, on the other hand, seems to handle these complexities

easily, such that we immediately perceive symmetries even as outstanding structures.

For us, not so much the specific nature of symmetric objects or symmetrically ar-

ranged patterns is of importance; it is rather the regularity (laws of repetition) that

makes symmetries more salient. Saliency in this context means our inborn capabil-

ity to perceive the symmetric layout of the single parts as a self-contained entity.

In short, we perform grouping without even being aware of the active nature of this

process.

1

2 Chapter 1. Introduction

Consequently, grouping is an important step in vision that combines segments of

visual information within an image into higher-order, perceptually salient structures,

more amenable to semantic interpretation. As such, it is an important stepping

stone between low-level vision and scene understanding, leading towards a deeper

understanding of observed shapes and structure and scene organization.

1.1 Rationale

Grouping is a longstanding problem in computer vision. In the literature, rather

intuitive concepts like ’goodness’ and ’non-accidentalness’ have been used to compile

catalogues of grouping types. These are very useful as they list special configura-

tions that a good grouping approach should be able to find. However, starting from

perceptual impressions rarely hints at effective ways to do the underlying computa-

tions.

The situation is different if we consider groupings as similar planar (not necessarily

coplanar) patterns in special, relative positions, i.e. patterns that appear repeatedly

in the image. Under these assumptions, and in combination with a simple pinhole

camera model, geometric relations can be derived. Such a quantitative description

eases the more systematic detection of groupings in images, as opposed to the rather

’ad-hoc’-like grouping rules mentioned above. And indeed, patterns that appear

repeatedly in a regular manner are ubiquitous. We usually encounter them in our

daily life, such as brick walls, floor tilings etc.

Such regularities are salient configurations for humans, but for computer vision sys-

tems it is relatively hard to pick them up. The difficulty is that, for a computer, a

digital image is just a bunch of pixels, an array of numbers between 0 and 255, with-

out any further meaning. However, if there are any quantitative relations between

repeating patterns in the image, the relations can be formalized as an algorithm,

and a computer can start with a methodical analysis on this bunch of pixels.

We therefore believe that a geometry-driven approach will be an efficient option

for such types of regularities to be detected. This dissertation focuses on grouping

planar, but not necessarily coplanar patterns, with the following goals:

Principled Approach: We propose a more systematic and hierarchical classifi-

cation of grouping types, albeit from a specifically geometric point of view.

Directly tied to the classification is an approach for their detection.

Perspective Effects: Grouping has often been carried out under the assump-

tion of (pseudo-)orthographic projection. This has to do with the fact that

many more cues survive the corresponding affine skewing than the projective

1.2. Possible Applications 3

skewing that amounts to the more realistic, perspective model. Here, the full

perspective nature of projection will be taken into account.

Efficiency: Grouping is about combining parts into larger configurations. Hence,

there is a risk of combinatorial search. Here, we avoid extensive combinatorics

through the combined use of invariance and Hough techniques.

1.2 Possible Applications

Apart from rather abstract applications like scene understanding and scene organi-

zation, the knowledge or extraction of groupings in images might be useful in many

respects.

Image Descriptor The rapid expansion of computer networks and the dramat-

ically falling costs of data storage are making multimedia databases increasingly

common. Digital information in the form of images, music and video is quickly

gaining importance for business and entertainment. Consequently, the growth of

multimedia databases creates the need for more effective search and access tech-

niques, especially of image data. Knowledge about regular repetitions (symmetries)

can be used as an additional, valuable image descriptor for content based image

retrieval (CBIR).

Wide-baseline Stereo Also known as the correspondence problem, the objective

can be shortly summarized as follows: Given two images of the same object or scene

and a feature in one image, where is the corresponding feature (i.e the projection

of the same 3D feature) in the other image ? This is presently a very active field of

research, and many interesting automatic systems have been developed, assuming

uncalibrated cameras. However, the existence of pattern repetitions in one or both

images complicates this task due to the combinatorial variety of possible matches.

The detection of groupings prior to matching might offer a way out to resolve am-

biguities.

3D Reconstruction It has been shown that e.g. bilateral symmetry can be trans-

lated to two different views of the same object. With this information, it is already

possible to infer estimates about the slant and tilt of the object plane with respect

to the image plane. In addition, specific knowledge about regular repetitions also

allows to deal with occlusions: if a basic repeating unit, together with the laws

of repetition, can be determined, partial occlusions can be removed that way by

exploiting the redundancy that repetitions bring.

4 Chapter 1. Introduction

1.3 Main Contributions

Before we proceed with a more detailed description of our strategy and the tools

involved, it seems useful to summarize the main contributions which have been

realized in this work:

We have developed a unified framework for the detection of regular pattern

repetitions that can deal with more than one grouping type. The proposed

framework is able to detect groupings under the more general class of pla-

nar homologies. These include, for instance, mirror-symmetries and point-

symmetries, but also periodicities. Furthermore, we take perspective effects

fully into account. This is in contrast to previous systems that focus on one

specific grouping type only and / or assume a weaker projection model.

Efficiency was a principal design goal for the proposed system, and the com-

bined use of invariance and Hough techniques allows to ban expensive com-

binatorial techniques from all processing steps. Combinatorics is typical for

most earlier systems, and as a consequence, the required computational effort

is accordingly high.

Our system processes normal images without any kind of presegmentation.

Pattern repetitions and symmetries do not need to be delineated manually

beforehand. This is thanks to affinely invariant neighbourhoods that work on

a full wealth of features. Other systems tend to use only a very limited number

of specific features for the detection of repetitions.

1.4 Strategy and System Overview

We pointed out the importance of efficiency for grouping. This is because most

previous grouping systems do not stand to the computational complexity of the

problem at hand. Algorithms presented so far were mainly developed to illustrate

the outcome of theoretical considerations. Yet these algorithms lack the computa-

tional efficiency needed by an application to work autonomously in a general-purpose

domain. It is therefore not surprising that even invariance-based approaches still

apply computationally expensive combinatorial techniques to some extent.

This section outlines the basic ideas about how exhaustive combinatorial approaches

are banned from the principal stages in the proposed grouping framework.

1.4. Strategy and System Overview 5

1.4.1 Regular Repetitions

In principle, the detection of groupings in images can be seen as a rather straight-

forward task. Assuming no ’a priori’ knowledge about the scene and the camera

parameters, one has to obtain information about what is repeated and how it is re-

peated. As simple as this task might appear, several notions must be defined before

an automatic grouping application can be designed.

In the context discussed here, the ’what ’ can appear in various different forms (think

of e.g. windows on a building facade or bricks of a wall) and is usually not known in

advance. This emphasizes the need for abstraction: the ’what’ is a basic unit with

multiple repetitions. In contrast to the ’what’, more can be said about the ’how ’.

Regularity implies a formal mathematical law of repetition in the scene, and this

law can be quantified in algebraic and geometric terms.

For grouping, the ’whats ’ and ’hows ’ are even related. If we have a clear idea about

the specific nature of a repeating entity, this would certainly help in determining

how this entity repeats throughout the scene. On the other hand, if we know about

the underlying ’laws’ of a repetition, it would be easier to determine what part of

the image is being repeated. From this point of view, grouping can be seen as a

classical ’chicken-and-egg’ problem.

Note that we (deliberately) leave open the specific nature of such repeating patterns

for the discussion in this chapter (a later chapter is devoted to them). We only

require them to be planar. In fact, a pattern itself is not of particular interest, but

rather the way it repeats.

In addition, we consider the geometric relations between repeating patterns in the

image to be planar homologies, which excludes rotational symmetries. We will ex-

plain inherent properties of planar homologies later on in this report. For the time

being, it is sufficient to know that this class of projective transformations is capable

to catch (geometrically) a wide variety of repetitions and symmetries, such as the

often occurring periodicities and mirror-symmetries.

1.4.2 The Danger of Resorting to Combinatorics

The first problem is the detection of one or several basic units whose repetitions

comprise the unknown grouping. A basic unit is a small, planar patch. Regardless

of the nature of a basic unit under consideration, the most natural way to detect

its repeating instances are pairwise comparisons. A prototype of a basic unit is

identified, and repetitions can be found by pairwise comparisons among the set of

candidates. Only those patches that fulfill certain similarity criteria are promising

candidates to be repeating instances of the current prototype.

6 Chapter 1. Introduction

The most commonly used method for measuring the similarity of planar patches is

cross-correlation. In the context of intra-image grouping, simple correlation-based

methods have indeed been applied in the absence of perspective skew. In such cases,

the computation of correlations is not much of a problem since repeating patches

do not differ in shape and size.

This situation changes, however, when perspec-

Figure 1.1: A regular repetition

of floor tiles, distorted by per-

spective skew.

tive effects are included, and these are almost

omnipresent in normal images. Under such cir-

cumstances, traditional correlation-based tech-

niques with a fixed window are no longer appli-

cable. In addition, mirror-symmetric patterns

cannot be detected that way.

The example shown in Figure 1.1 illustrates the

problem where a basic unit (floor tile) is re-

peated in a regular manner, and the shape of

a tile varies as it repeats throughout the im-

age. Under perspective distortion, the change in

shape and size between two arbitrary tiles can

be captured by an 8 parameter projective transformation. Such a transformation

is necessary to register two planar patches for correlation. As can easily be seen,

measuring the similarity of two patches anywhere in the image by just positioning

fixed-sized correlation windows at the corresponding locations does no longer work.

In fact, this process now has to be accompanied with the determination of the

transformation parameters, which results in a tremendous growth in computational

complexity.

Another strategy to find similar repeated patterns (as applied by [Leung and Malik

1996]) starts from a point of interest and examines its immediate neighbourhood for

similar patterns. Restricting the search space in this way allows to approximate the

perspective skew therein through an affine transformation. As a result, the spatial

arrangement of similar patterns is represented as a graph, where two nodes are

related by an affinity map. The assumption of affine geometric relations between two

’adjacent’ patterns is indeed reasonable and requires fewer parameters to solve for.

On the other hand, the affine approximation for adjacency in a topological sense fails

under severe perspective distortion or if the Euclidean distance between two adjacent

patterns is too far such that the amount of skew goes beyond affine transformations.

This strategy leans itself better to periodicities than mirror-symmetries.

These two possibilities mentioned for finding repeating patterns make the difficulties

apparent: exhaustive pairwise comparisons in combination with the determination

of transformation parameters. They are needed for the geometric registration of two

patterns, which is a prerequisite for the application of similarity measures.

1.4. Strategy and System Overview 7

1.4.3 Efficient Detection of Repetitions

Fortunately, such brute-force approaches can be

Image

Affinely invariantneighbourhoods

avoided. The strategy applied in this thesis starts

with an efficient detection of repeating basic units.

We propose the use of affinely invariant neigh-

bourhoods to find them. These neighbourhoods

are small, local patches that are extracted near

points of interest, such as Harris corner points

or intensity extrema. The central idea is that

such neighbourhoods can be extracted in iso-

lation and in a way that makes their enclosed

surface region immune against affine geometric

transformations and linear photometric changes.

Affinely invariant neighbourhoods were developed for object recognition and wide-

baseline stereo applications, where correspondences between different images of the

same scene, but from a different viewpoint, must be established. The apparent

changes of sufficiently small parts of a scene when imaged from different viewpoints

can be approximated as affine. As affinely invariant neighbourhoods are robust

against such changes (they are also robust against changes in illumination, as we

will see later), they cover the same part of an objects surface independent of the

viewpoint and without reference to other views. This idea is applied in the context

of intra-image grouping, where affinely invariant neighbourhoods adapt themselves

to the effects of perspective distortion to some extent. As a consequence, they

independently cover repeating planar image patches. More information about the

affinely invariant neighbourhoods is given in Chapter 4.

The fact that the invariant neighbourhoods are ’only’ robust against affine transfor-

mations seems to contradict the idea of dealing with perspective distortions. Due

to their local character, though, the geometric relations between them can be con-

sidered as affine at the initial stages of grouping.

Matching

Affinely invariant neighbourhoods must be matched to find similar ones among the

entire set that has been extracted. Special care has to be taken at this stage not

to fall into combinatorial techniques such as those described in Section 1.4.2. To

maintain efficiency during the matching stage, to each affinely invariant region can

be associated a feature vector that consists of affinely invariant moment invariants[Mindru et al. 1999a]. Such moment invariants capture the underlying intensity

pattern in a way that makes them again insensitive to both affine geometric distor-

tions and linear photometric changes. Neighbourhood characterization via moment

invariants allows the use of hashing and indexing techniques. In particular, such

8 Chapter 1. Introduction

techniques allow for an efficient identification of clusters of similar neighbourhoods

with respect to their feature vectors, thus avoiding exhaustive pairwise comparisons.

Clusters represent candidates of similar repeating affinely invariant neighbourhoods.

In this thesis we propose a partition of the feature space into regions of low and high

densities, i.e. regions where a small and large numbers of feature vectors gather.

The reason why high and low density clusters are

Affinely invariant

Matching

neighbourhoods

of special interest is the spatial arrangement of

their corresponding neighbourhoods in the im-

age. High density clusters denote a large number

of similar neighbourhoods, which is typical for

e.g. periodicities like the repeating floor tiles in

Figure 1.1. Low density clusters are indications

for a rather small number of repeating neigh-

bourhoods, which occurs in situations like e.g.

a mirror-symmetric configuration.

The process of finding repetitions (i.e. feature vector clusters) will be explained in

more detail in Chapter 6. More important at the moment, though, is the importance

of the proposed invariant feature clusters with respect to efficiency, as it allows to

find small repeating planar patterns without the combinatorial pitfalls so typical for

earlier approaches.

1.4.4 Efficient Detection of Regularities

After having identified sets of similar repeating planar patches, i.e. sets of simi-

lar affinely invariant neighbourhoods, these have to be analyzed for their spatial

configuration.

More precisely, we want to know if there is a ge-

Matching

Cascaded Hough

Hypothesis

ometric transformation that explains their spa-

tial arrangement, or if their layout is irregular.

A geometric transformation is said to ’explain’ a

set of regular repeating patterns if it maps them

onto one another, which is in accordance with

the mathematical definition of symmetry.

Here we look for planar homologies that relate

repeating patches. Planar homologies are pro-

jectivities that have a line of fixed points and a pencil of fixed lines as structures

that they keep fixed.

If the fixed structures of the corresponding homology are known in advance, then the

degrees of freedom are drastically reduced and only one point match is needed to fix

1.5. Outline of the Thesis 9

the transformation. In our framework, we extract the unknown fixed structures by a

cascaded application of the Hough transform. How this can be achieved is explained

in Chapter 7. Most important is the fact that the extraction of fixed structures is

non-combinatorial, thus keeping efficiency during this important stage.

Once fixed structures and grouping hypotheses have been set up, these are verified

for their correctness. We apply a correlation-based approach that segments the

image into areas that are in agreement with the hypothesis under investigation.

False hypotheses can thus be rejected quickly.

1.5 Outline of the Thesis

This report is structured as follows.

In Chapter 2, we discuss earlier work in the context of grouping. As the term

grouping is rather ambiguous, the amount of literature is accordingly vast. This

chapter is by no means an exhaustive overview. Nevertheless, we believe to cover

the most important work related to this thesis.

Chapter 3 takes a closer look at the geometric concepts that the presented system is

based on. In particular, we introduce planar homologies and their fixed structures

and explain their relations to grouping.

Loosely speaking, one half of the backbone of our system are the affinely invariant

neighbourhoods explained in Chapter 4. Here, four different types of neighbour-

hoods have been developed to this date, and we cover their extraction methods and

properties in more detail.

The second half of the backbone is the cascaded Hough transform (CHT) presented

in Chapter 5. The CHT is an iterated application of a Hough transform, where the

output of a previous transform can be used as input for a subsequent one. This

chapter only describes the basic mechanisms of the CHT and the transformations

between the different coordinate frames.

In Chapter 6, we explain how repetitions are found efficiently. We discuss measures

for similarity and address the problems of obtaining representative statistics.

Next, Chapter 7 shows how the CHT is applied to extract the fixed structures given

clusters of similar affinely invariant neighbourhoods as input. Also, we explain how

this leads to planar homology candidates and present a validation scheme needed

for the verification of grouping hypotheses.

Experimental results are shown in Chapter 8 for various grouping types, and Chap-

ter 9 finally concludes this thesis with some suggestions for improvements and further

work.

2Tour d’horizon: From the Early

Days to State of the Art

The automatic detection of symmetries and groupings in images is a long

researched topic and reaches back to the early days of computer vision.

The concept of grouping in the vision literature is not precisely defined

and is also strongly associated with perceptual organization. In fact,

grouping is applicable to a number of cognitive activities, not just vision. In vision,

grouping can be applied to a number of stages and it can make use of different types

of features. As a consequence, a large number of contributions have evolved over

time. This state of affairs gives rise to some ambiguity in the term ”grouping”.

Previous contributions about grouping differ from one another with respect to the

types of features they comprise, the dimensions over which the groupings are sought,

the underlying assumptions about the data acquisition process and so on.

Although the concept of perceptual organization and grouping can even be extended

to ”higher dimensional”data, e.g. range-images, 3D volume data, 2D + motion etc.,

this thesis addresses the problem of finding groupings in 2D images, and so does the

literature survey in this chapter. Due to the large number of contributions devoted

to grouping and perceptual organization in general, the overview given here is by

no means complete. The goal is a classification scheme to structure earlier work. A

classification is useful for the illustration of the progress achieved so far in grouping

research in computer vision.

The organization of image features into structures at a higher semantical level is

of particular interest in machine vision for various reasons. A human observer is

capable of performing grouping tasks in (almost) real-time, unaware of the necessary

computational complexity. Systematic investigations about human perception were

carried out by psychologists, and their results inspired researchers in computer vision

in their early contributions to grouping.

11

12 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

Gestalt-based

ad-hoc The goal is the grouping of low-level features, such as inter-

rupted contour edges, emerging from the same object, mostly

in the context of object recognition.Geometry-based

Orthographic Detection of symmetries and regularities assuming ortho-

graphic projection, mainly in the context of 3D-reconstruction.

Perspective Detection of symmetries and regularities assuming realistic per-

spective projection, mainly in the context of 3D-reconstruction

and scene understanding.

Table 2.1: Classificatory structure

2.1 Gestalt Laws

Gestalt is a German word which roughly translates to ”organized structure”. Gestalt

theory is a very general psychological theory that can be used to study and under-

stand aspects of human behaviour. The grouping capability of human vision was

studied by the early Gestalt psychologists [Wertheimer 1923]. The emphasis in the

Gestalt approach was on the configuration of the elements, rather than on the ele-

ments per se. This emphasis is seen on the credo of the Gestalt psychologists: the

whole is different than the sum of the parts.

Unfortunately, this important component of human vision has been missing from

most of the computer vision systems, presumably due to the lack of a clear compu-

tational theory for the role of perceptual organization in the overall functioning of

vision. One of the basic goals underlying research on perceptual organization has

been to discover some principle that could unify the various grouping phenomena of

human vision.

Although the Gestaltists did not provide a precise physiological or computational

model of how the visual system processes information, they did come up with a set

of laws specifying what will be grouped with what and what we will perceive as figure

versus ground.

2.2 Grouping Based on Gestalt Laws

Based on the results of Gestalt research in the 1930’s, it has been suggested that

local geometric relations can be used to structure image features into higher-level

organizations. This problem is approached by looking for non-accidental properties,

i.e. features that have some property that is frequently shared by features originating

in a single object, but that would very rarely appear by accident.

2.2. Grouping Based on Gestalt Laws 13

Motivations for grouping arose e.g. from the field of object recognition, where

features of a 3D model have to be matched against their 2D counterparts projected

onto the image. While it is true that the appearance of a three-dimensional object

can change completely as it is viewed from different viewpoints, it is also true that

many aspects of an object’s projection (examples include instances of connectivity,

collinearity etc.) remain invariant over large changes of viewpoints.

The features most commonly used in early recognition systems were of a geometric

nature, like curved edges and straight lines, and most systems worked on simplified

objects like polygons and polyhedrons. Results in the field of object recognition soon

stressed the necessity of some type of grouping (or selection) for the establishment of

tentative matches between image features and an object model in order to render the

combinatorics of object recognition manageable. Many object recognition systems

now exploit simple grouping techniques.

The use of non-accidental properties for grouping has been developed by Witkin

and Tennenbaum [Witkin and Tennenbaum 1983], Binford [Binford 1981], Kanade[Kanade 1981] and Richards and Jepson [Richards and Jepson 1992]. According to

these authors, the human visual system is sensitive to properties commonly produced

by a single object or process, and they rarely occur at random.

Lowe [Lowe 1985] was one of the first who explored data-driven grouping in a recog-

nition system. To the best of our knowledge, he was also the first to introduce the

term ’non-accidentalness’ explicitly in this context. His system, SCERPO, forms

local groups of edges based on proximity, parallelism and collinearity to reduce

the amount of search for model matches. Lowe developed a quantitative statistical

framework to judge whether perceptual organizations of line segments are significant

or have arisen by accident. The underlying assumption is the normal distribution

of line segments with respect to position, orientation and location.

Jacobs [Jacobs 1989, Jacobs 1996] extended the work by Lowe by including local

geometric relations to form nonlocal groups of edges. His system finds groups in

image edges that could have arisen from a convex object in the scene. Although not

among the classic Gestalt properties, Jacobs emphasizes the importance of convexity

for object recognition. Huttenlocher and Wayner [Huttenlocher and Wayner 1992]

extended the work by Jacobs by incorporating graph-theoretical methods to speed

up recognition systems.

By combining more than one cue in a probabilistic framework, better performance

can be achieved, which seems to be the experience of many researchers (Jacobs [Ja-

cobs 1989], Lowe [Lowe 1985], Sha’ashua and Ullman [Sha’ashua and Ullman 1988]).

Summary Most of these early grouping contributions focus on the organization of

low-level image features originating from a single object. The main motivation is the

14 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

reduction of computational complexity for object recognition tasks. These grouping

techniques use ad-hoc lists of Gestalt rules as a basis and are restricted to edges and

contours, without using additional sources of information, such as color and texture.

Due to the lack of a quantitative description of Gestalt rules, the aforementioned

grouping types are of a rather intuitive nature.

2.3 Grouping Based on Geometry

In contrast to Gestalt-based grouping techniques, geometry-driven approaches ben-

efit from a clear mathematical theory that quantifies the relations between features

that are to be organized. Expressing the image formation process in terms of geom-

etry constrains the relations between features to be grouped.

The motivation for the geometric approaches arose mainly from recognition and

shape recovery tasks. The fundamental problem is: Given a single image of an

arbitrary shape (or repeating instances thereof), how much information can we

obtain about the true shape if no camera and object parameters are known ?

Under certain assumptions, e.g. the kind of image projection and inherent properties

of the shape, information e.g. about its orientation can be obtained. In particular,

relational constraints between parts of a single object (such as bilateral symmetry),

or relations about the way multiple objects repeat, has turned out to be of signifi-

cant importance. For a human observer, the knowledge or assumption of symmetry

translates into an impression of the slant and tilt of the object with respect to the

image plane. The relational constraints rely upon precisely known mathematical

relationships. Certain invariant descriptions of such constraints survive image pro-

jection. As a consequence, these can be exploited in the image, in spite of the skew

induced by the image formation process.

To this end, geometric grouping approaches can be roughly classified according

to the image projection model used. Early geometric grouping systems assume

an orthographic (or pseudo-orthographic) projection model, which results in affine

geometric relations among the entities in an image. In case of weak perspective

effects, an orthographic projection model is a good approximation, and many more

cues survive the projection onto the image than in the perspective case. However,

grouping systems assuming an orthographic projection model break down under

serious perspective skew, therefore limiting the range of possible applications.

In the last few years, research has been invested to deal with the full perspective

case, and in fact image cues that remain invariant under certain classes of projective

transformations can successfully be exploited in the context of intra-image grouping.

In what follows, we will look at the history of the geometric grouping approaches in

more detail. Although many authors have developed geometric grouping systems in

2.3. Grouping Based on Geometry 15

the absence of perspective or affine skew (assumption of a head-on view), these will

not be treated in this chapter.

2.3.1 The Affine Case

Grouping systems that assume an orthographic projection model result in affine rela-

tions between geometric image features. Early grouping contributions — as related

to this thesis — addressed the problem of symmetry detection under orthographic

viewing conditions.

Skewed Symmetry

The problem of skewed symmetry has received a lot of attention in computer vision

literature. Skewed symmetry is the type of pattern that emerges when a (mirror)

symmetric planar shape is viewed obliquely. From a geometrical viewpoint, a skew

symmetric figure is obtained when the points in a symmetric figure are mapped with

a shear transformation to their numerically equivalent points measured in oblique

coordinates. It was understood early on that the presence of such a cue helps in per-

forming a wide variety of tasks such as object recognition and deprojection [Kanade

1981].

Friedberg [Friedberg 1986] approached the problem of detecting skewed symmetry

axes based on the standard matrix of second-order moments of a shape. For a planar

object exhibiting a bilateral symmetry, the moment matrix becomes diagonal. This

property can be further exploited since the skew-symmetry operation in the image

and on the object are related by a conjugation, leading to what Friedberg terms the

Fundamental Symmetry Constraint. This constraint is applied to solve for a pair

of values (α, β) (rotation and skew), reducing this two dimensional search space by

one dimension. Since the fundamental symmetry constraint is a necessary, but not

sufficient condition for skewed symmetry, it is used to constrain the search space.

Ponce [Ponce 1988] derives a local method based on the curvature of a mirror-

symmetric contour. Pairs of contour points are exhaustively compared to determine

when a necessary condition is satisfied. In contrast to the work by Friedberg, Ponce’s

technique relies on local contour features, thus being more insensitive to occlusions.

In a similar vein, Gross and Boult incorporated both a global (moments of contours)

and a local (tangents at contours) approach into their SYMAN system [Gross and

Boult 1991, Gross and Boult 1994]. For the global method, the authors establish

relations between measured skewed image contour moments and the symmetry axes

of a planar shape, whereas the local method relies on the fact that contour tangents

at skew-symmetric point pairs intersect on the skewed symmetry axis. In both cases,

16 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

axis orientation and the angle of skew are to be determined; translation invariance

is achieved by starting from the centroid of the contour under investigation. Special

attention is paid to the problem of skew ambiguity, which is of interest for certain

classes of shapes, such as circles, ellipses and isosceles triangles as well.

Van Gool et al. [Van Gool et al. 1995c] introduced a more general symmetry concept

based on the invariant parameterization of contours. Symmetry is interpreted in a

broader sense as repeated shape fragments lying in parallel planes. The construction

of the Arc Length Space (ALS) allows the efficient detection and analysis of both

mirror and rotational symmetries under oblique viewing conditions. In addition,

undetected symmetries can be inferred by exploiting the properties of the ALS.

In [Van Gool et al. 1995b], a comprehensive description of skewed symmetries is

presented. Orthographically skewed symmetry is characterized by two features that

are present in perfect mirror symmetry and that are preserved under the skewing,

i.e. that are invariant under affine transformations: parallelism of the chords and

collinearity of the midpoints. Using these as points of departure, a set of invariants

is derived that skewed mirrored point pairs or contour segments should satisfy. It is

shown that, once the direction of the chords is known, a two-dimensional subgroup

of the affine transformations can be found, which in turn allows to derive invariants

suited for skewed symmetry. From a more practical point of view, one can also

impose a ’dual’ set of constraints to chord-parallelism and midpoint-collinearity,

that are the equiaffinity- (area preservation) and involution-constraint, which allows

to set up hypotheses in a much more efficient way.

Invariants also play an important role in the work by Mukherjee et al. [Mukherjee

et al. 1995]. They focus on skewed mirror-symmetry mainly in the context of depro-

jection. In the case of affine mirror-symmetry, invariants under this 3 dof subgroup

are easier to handle than under general (6 dof) affine transforms. In particular,

transformation properties of skewed mirror symmetry (such as e.g. the involution

constraint) are exploited, together with distinguished points on the contour of the

object. Contour segments are labeled with invariant signatures, which allows effi-

cient matching for hypotheses generation using invariant-based hashing. Although

matching can be performed with a complexity of O(n), the authors implemented a

simpler O(n2) algorithm, arguing that n is comparatively small in their case.

Wallpaper Symmetries

Liu and Collins focused on the automatic analysis of wallpaper patterns, that is

pattern repetitions related by translations, reflections, glide reflections and rotations.

A first contribution [Liu and Collins 2000] classified wallpaper-symmetric patterns

under the assumption of a head-on view. In later work [Liu and Collins 2001]

the authors extend the concept of skewed symmetry to skewed symmetry groups.

2.3. Grouping Based on Geometry 17

More precisely, they show that particular symmetry groups survive general affine

skewing, and certain symmetry groups ’migrate’ to some others. Based on peaks

in the autocorrelation function of the symmetric image, their system constructs

the generating lattice of the underlying symmetry. The structure of this lattice is

investigated in more detail to detect — after deprojection — one of the 17 wallpaper

groups constituting the pattern, and to identify meaningful repeating basic patterns.

Repetitions

A completely different type of grouping deals with repetitions of particular image

features. In contrast to the past (contour-based) work mentioned before (except for

maybe [Van Gool et al. 1995c]), the paper by Leung and Malik [Leung and Malik

1996] deals with grouping of irregularly repeating texture elements. The motivation

is the same as in the case of skewed symmetry, namely the recovery of 3D scene

structure, because repeating texture elements can be regarded as multiple views in

a single image. Although Leung and Malik deal with perspective images, we assign

their method to the affine case. Quite similar to the grouping strategy presented

in this thesis, the authors start with so-called points of interest to detect repeating

distinctive elements, and it is assumed that the geometric relations between adjacent

elements are affine. The outcome boils down to a graph representation of the spatial

relationship between texture elements, where nodes represent repeating patches and

arcs denote affine maps that best warp the two patches onto each other. In this way,

even weak perspective effects can be gradually dealt with.

2.3.2 The Perspective Case

The assumption of orthographic / pseudo-orthographic projection models for group-

ing is certainly valid for a wide range of applications. However, such assumptions are

no longer valid when strong perspective effects are present. Under such conditions,

affine grouping systems are no longer applicable. Taking perspective effects fully

into account complicates the problem, as symmetric objects or parts thereof are

now related by the more general class of projective transformations or projectivities

for short. General projective transformations have more degrees of freedom than

their affine counterparts, and less cues are preserved under projectivities that can

be utilized for establishing geometric grouping correspondences.

Nevertheless, plane projective transformations (PL(3)) are well understood and its

algebraic structure can be taken advantage of. In fact, more invariants can be derived

for particular subclasses of projective transformations than the general projective

invariant, i.e. the cross-ratio. The first related contributions have focused on the

generalization of skewed symmetry towards the projective case.

18 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

Mirror Symmetry

Glachet [Glachet et al. 1993] was one of the first who carried the concept of skewed

symmetry over into the projective domain. The tools exploited in the affine case are

modified, that is the parallelism of the chords translates to their vanishing point,

and the middle point invariance becomes the harmonic cross-ratio. Starting with

the contour of an object, a first coarse estimation for the symmetry axis and the

vanishing point is sought, followed by a verification/refinement step. Later on it is

shown that — given the vanishing point and the axis — the whole object can be

uniquely deprojected if its size is known, or up to a scale factor otherwise.

Bruckstein and Shaked [Bruckstein and Shaked 1998] present an approach that

deals with the detection of mirror symmetries of contours under both affine and

perspective skew. They argue that symmetries of a contour manifest themselves

as special structures in a projection-invariant signature function, thereby reducing

the problem of symmetry detection and analysis to that of analyzing a periodic 1D

function.

More General Configurations

In [Van Gool and Proesmans 1995, Van Gool et al. 1998], planar homologies are

introduced as a special subgroup of the projectivities useful for grouping and recog-

nition tasks. Although Glachet et al. never mentioned planar homologies explicitly,

they make use of their properties, namely the concept of fixed structures. Apart

from planar mirror-symmetric figures, Van Gool et al. show that planar homologies

can deal with a greater variety of inter and/or intra-object relations: scene objects

(or parts thereof) related by a 3D perspectivity are related by planar homologies in

the image. It can be shown that planar homologies form a subgroup (that is all

projectivities that have a line of fixed points and a pencil of fixed lines), and simpler

invariants can be constructed. Their usefulness is underlined by a shadow-based car-

tographic tool that can more accurately delineate building and shadow boundaries

as an assisting tool for a human operator.

In [Van Gool 1997, Van Gool 1998], Van Gool gives a more principled approach

to grouping based on the concept of fixed structures. The basic grouping config-

uration under investigation are two planar shapes in 3D related by a 2D projec-

tive transformation. According to the structures that are kept fixed (points, lines

and combinations thereof), the corresponding subgroups of the projectivities can

be classified. As a consequence, subgroup-specific invariants can be constructed,

which allows a more efficient detection of specific grouping configurations. A ma-

jor design goal is efficiency, that is the reduction of combinatorics to an absolute

minimum. Apart from efficiently matching curve segments using subgroup-specific

2.3. Grouping Based on Geometry 19

invariants, Van Gool proposes a cascaded version of the Hough transform to extract

fixed structures, again in a non-combinatorial way.

Cham and Cipolla [Cham and Cipolla 1996] came up with a curve-based grouping

approach, although not specifically in the context of grouping (i.e. for instance

detection of symmetry axes) but for the more general problem of curve-matching.

They tackled the problem of automatically establishing curve correspondences un-

der 2D projective transformations without the use of landmark points. Specifically,

seedpoints on curves (for instance locations having a high cornerness) are used as

pivot points for establishing point-correspondences on two curves, and these pivot

points are allowed to drift over a short distance along the curve. Letting the points

drift allows a more precise hypothesis estimation, which leads to a minimization

problem for the particular hypothesis under scrutiny. The quality of the transfor-

mation basis points, which is important in the presence of highly symmetric curves,

is quantified using the concept of geometric saliency.

Most recent work by Turina et al. [Turina et al. 2001b, Turina et al. 2001a,

Tuytelaars et al. 2002] picks up the concept of fixed structures developed by Van

Gool et al. As groupings are related by planar homologies, their approach deals with

more than one grouping type. In a first step, repetitions of small, planar patches are

detected using affinely invariant neighbourhoods. Matching is performed in a feature

space, thus avoiding computationally costly pairwise comparisons. Repetitions are

then analyzed for regularity through a cascaded version of the Hough transform,

which yields candidates for fixed structures. Grouping hypotheses are validated

with correlation-based schemes. In [Turina et al. 2001a], possible solutions for the

detection of grouping hierarchies are suggested.

Repetitions

The work that comes most closely to this thesis is that by Schaffalitzky and Zis-

serman [Schaffalitzky and Zisserman 1998, Schaffalitzky and Zisserman 2000]. In

their contributions, the authors deal with the detection of periodicities, i.e. regular

translations of planar image features. Under the assumption of a simple pinhole

camera model, it is shown that such translationally symmetric patterns are related

by elations in the image (see Section 3.5). Here again, fixed structures (vanishing

line, vanishing point) play an important role to cut down complexity. Similar to

Leung and Malik, Schaffalitzky and Zisserman start from points of interest and ex-

amine their local neighbourhood for similar patches. This way, pattern repetitions

are detected. RANSAC [Fischler and Bolles 1981] is then applied to obtain elation

hypotheses followed by a maximum-likelihood re-estimation.

20 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

2.4 Analysis

After having shortly outlined what we could find as the most relevant earlier work

related to geometric grouping, we want to give a more detailed analysis with respect

to some topics that are considered to be important. The issues discussed in this

section are

Generality : Which geometric configurations can be detected, and what kind

of images is a system applicable to ?

Features : Which features are to be grouped, or what needs to be present in

the image so that a particular grouping system is applicable at all ?

Efficiency : A key issue when it comes to grouping. So far, grouping approaches

are known to be computationally expensive, and tend to be characterized by

the extensive use of combinatorics.

2.4.1 Generality

Here, generality is understood in a geometric sense. The question is what geometric

grouping configurations can be handled, rather than the variety of image features

used for grouping, which will be discussed later on.

Different Grouping Types

So far, most previous grouping contributions have been dedicated to a specific group-

ing type. As an example, the approach by [Glachet et al. 1993] works on general-

purpose contour images (in principle), but they assume a cross-ratio value of -1 for

the detection of mirror-symmetries, which amounts to planar harmonic homologies.

As a consequence, their approach is restricted to planar mirror-symmetric shapes;

non-planar mirror-symmetric configurations, such as e.g. objects before a tilted

mirror like the example shown in Chapter 8, are not detectable.

Roughly speaking, grouping systems devoted to mirror-symmetry are unable to deal

with repetitions and vice versa. An exception is the work by Van Gool [Van Gool

1997, Van Gool et al. 1998] in that planar homologies (or even the more general

concept of fixed structures) are proposed as foundations for grouping. For example,

if one remembers that harmonic homologies and elations are ’degenerate’ cases of

the more general class of planar homologies, then a grouping system that deals with

planar homologies is able to deal with both mirror-symmetries and regular repeti-

tions in perspective images. However, no such generic system has been implemented

for fully-automated grouping.

2.4. Analysis 21

Completeness

Another issue related to generality is how systematically the grouping is carried out,

especially in case of regular repetitions. Schaffalitzky and Zisserman’s grid-grouper,

for instance, detects elations on 2D repeating patterns such as a brick wall or a

tiled floor. In fact, such highly symmetric patterns exhibit many elations, yet their

system picks out only one or two, without analyzing their interrelations (such as

linear dependencies among different periodicities etc.).

Applicability / Preprocessing

Many earlier systems assume a certain amount of preprocessing before grouping can

be carried out at all. For instance, the skewed symmetry analyzer in [Friedberg

1986] uses contours of pre-segmented shapes (or binary images without background

or clutter) as input. Similarly, the systems described in [Bruckstein and Shaked 1998,

Glachet et al. 1993, Ponce 1988] were applied to artificial images and line drawings,

respectively.

Contributions by [Van Gool et al. 1995b, Mukherjee et al. 1995, Cham and Cipolla

1996, Van Gool 1997] presented results on ’real’ example images, but these images

contain only a close-up view of the object(s) to be grouped before a homogeneous

background. Such situations are certainly easier to analyze, as relevant features

(e.g. edges) can be extracted more reliably. Although the examples are real, they

produce a somewhat artificial impression.

Some other systems need user-interaction to some extent in order to extract some

of the features needed (reference points for the invariant signature in [Van Gool et

al. 1995b], selection of edges in [Gross and Boult 1991]).

In a similar vein, Liu and Collins wallpaper analyzer only works with images fully

covered by a wallpaper-symmetric pattern (since such patterns extend ad-infinitum),

i.e. their system is not able to automatically segment out wallpaper patterns in

an image for further analysis. Although results on real images are presented, the

wallpaper-symmetric patches must be segmented manually beforehand.

Only the most recent work (incl. those of the author) [Schaffalitzky and Zisserman

2000, Leung and Malik 1996, Schaffalitzky and Zisserman 1998, Turina et al. 2001b,

Turina et al. 2001b, Tuytelaars et al. 2002] presented systems and results that

underlined their performance on images of real scenes containing perspective skew.

Stated otherwise, these systems do not rely on preprocessing and can deal more or

less automatically with general-purpose image scenes (cluttered background etc.).

22 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

2.4.2 Features

Another important issue are the image features that can be exploited for grouping.

Almost all geometric grouping contributions extract geometric primitives like edges

and contours, and grouping is then carried out on these entities.

Even if one concentrates on grouping edges and contours, a further loss of perfor-

mance can follow from further processing. Ponce [Ponce 1988] for instance bases his

system on curvature, but curves are not always guaranteed to be sufficiently smooth.

Obviously his system fails to detect skewed symmetries for shapes composed of only

straight edges, which is quite common for man-made objects.

Worth mentioning in this context are again [Schaffalitzky and Zisserman 1998,

Schaffalitzky and Zisserman 2000, Leung and Malik 1996] whose work is not only

based on contour and curve information. Leung and Malik first look for ’distinctive

elements’ using the second-order moment matrix followed by intensity-based cross-

correlation to find similar patterns. Schaffalitzky and Zisserman combined both

geometric and photometric information in their search for repetitive patterns. More

precisely, Harris corner points and straight lines (and intersections thereof) are used

together with closed contours described by the affine texture moment invariants

proposed by [Van Gool et al. 1996].

Global Features

The features used for grouping can either be global or local, and the choice has a

significant influence on both robustness and efficiency. Depending on the context,

global may refer to images or shapes.

Global features have been used to analyze entire shapes for symmetry, for instance

when a shape can be precisely delineated through its closed contour. Friedberg[Friedberg 1986] and Gross and Boult [Gross and Boult 1991, Gross and Boult 1994]

worked with global contour moments.

The wallpaper analyzer by Liu and Collins can be regarded as global, since a Fourier

transformation of the entire image has to be applied for computing its autocorre-

lation function. Liu and Collins do not explicitly mention the computation of the

autocorrelation in the Fourier domain, but they adapted a procedure by Lin [Lin

et al. 1997], where the autocorrelation is computed in the Fourier domain. Experi-

ments done during this thesis confirm the noticeable superiority of the frequency to

the spatial domain for computing the autocorrelation function.

Global cues add to the efficiency as there is no need for exhaustive pairwise compar-

isons. Moreover, global methods are more robust to noise. The disadvantage is that

the shape must be fully visible, i.e. global methods are more sensitive to occlusions

and imperfect symmetry.

2.4. Analysis 23

Local Features

Global features are no longer applicable when symmetric patterns are only partially

visible due to e.g. occlusions. In normal images, this might occur quite often. In

such situations local approaches are appropriate. Even from a conceptual point of

view, the effects of symmetry operations are easier to apprehend; for instance, a

mirror-symmetry maps one contour segment onto another one — understanding the

same process in terms of global contour moments is certainly less straightforward.

Clearly, local features suffer from several shortcomings. The most serious danger

lies in the nature of the local approach per se as there is a high risk of falling into

combinatorics. Local features are also more error-prone regarding their extraction

(noise). For the case of serious perspective distortion, large scale differences may

pose a problem.

In [Ponce 1988], every contour point is used as a local feature. Others approxi-

mate the contour by its convex hull ([Glachet et al. 1993]) or only concentrate on

polygonal shapes ([Bruckstein and Shaked 1998]), thereby using line segments and

endpoints as features.

Apart from a contour itself, ’identifiable points’ ([Van Gool et al. 1995c]) like inflec-

tion points, and ’curve markers’ ([Mukherjee et al. 1995]) such as bi-tangent contact

points are used for establishing point correspondences.

Also in the case of repetitions, both Schaffalitzky and Zisserman and Leung and

Malik start from points of interest and use local geometric and photometric patches

to derive grouping hypotheses. More precisely, the system by Schaffalitzky and

Zisserman carries out grouping rather punctually (equally spaced coplanar lines,

repetitions on translations in a plane and on a regular grid).

In [Turina et al. 2001a, Turina et al. 2001b], the range of features is much wider

than the aforementioned systems. This is due to the use of different types of affinely

invariant neighbourhoods (see Chapter 4). As a result, repetitions consisting of a

whole variety of features can be dealt with.

2.4.3 Efficiency

Efficiency issues can be considered of outstanding importance for grouping. In

principle, identifying groupings in images is quite easy: compare one feature with

all others in an image and check whether certain constraints are fulfilled, then go on

with the next feature etc. As easy as this process might be, it is computationally

infeasible, yet most of the earlier grouping strategies employ combinatorics like this

at one stage or another.

24 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

Combinatorial Strategies

At the heart of the combinatorial strategies are exhaustive pairwise comparisons,

usually of complexity O(n2). Exhaustive in this context means a large number of

features to be compared (large n), thus ending up with long computation times.

Ponce’s approach [Ponce 1988] tests every pair of contour points for a local curvature-

based constraint, which is hardly feasible in practice. Cham and Cipolla [Cham

and Cipolla 1996] proceed in a similar way by both considering intersection points

from each pair of contour tangents and through the introduction of the local skewed

symmetry field, i.e. a spatial representation of symmetry evaluation for each pair of

contour points.

The extreme amount of combinatorics as proposed by Ponce can be tolerably soft-

ened by approaching a contour’s shape by its convex hull ([Glachet et al. 1993]), or

by considering only polygonal shapes ([Bruckstein and Shaked 1998]). Such approx-

imations certainly reduce the number of pairs to be considered, yet a complex shape

still needs a substantial number of points for its convex hull, maybe even more than

the fixed 20 segments used by Glachet for the rough estimates of the symmetry axis

and the vanishing point.

Also computationally expensive is the system by Leung and Malik [Leung and Malik

1996], where each distinctive scene patch is compared to its eight neighbours, in

combination with the estimation of an affine map (minimization of an error measure).

This procedure is repeated until no more similar patches are available.

Schaffalitzky and Zisserman [Schaffalitzky and Zisserman 2000, Schaffalitzky and

Zisserman 1998] attempt to alleviate the amount of combinatorics through the use

of RANSAC, which typically is already much more efficient than simple pairwise

comparisons. In brief, RANSAC [Fischler and Bolles 1981] is an algorithm that

simultaneously fits parameters and rejects outliers. The idea is that by fitting the

parameters to a subset of data consisting of inliers, outliers can be suppressed.

Samples not consistent with the model are rejected. Schaffalitzky and Zisserman

employ RANSAC to determine e.g. salient vanishing points from parallel scene lines

and for the generation of elation hypotheses.

A critical parameter for RANSAC is the percentage of outliers. In situations where

most of the data are inliers, RANSAC is superior to earlier approaches in that

meaningful models (i.e. grouping hypotheses) can be found with less computational

effort than pairwise comparisons. However, a loss of performance occurs when the

number of outliers reaches parity with the number of inliers. As a consequence,

the computational complexity might again be of order O(n2), thereby loosing its

superiority over classical combinatorial approaches.

2.5. Summary and Conclusions 25

Efficiency-devoted Strategies

The introduction of invariant descriptions for certain features in computer vision

has lead to ways of efficiently establishing tentative correspondences (with respect

to geometry and intensity) between them. Here, the term ’efficient’ means the

avoidance of computationally costly combinatorial approaches.

In general, we only consider those systems as efficient that make use of invariance to

cut down heavy combinatorics. Invariance in combination with hashing techniques

can render such systems even more powerful.

Invariant signatures in the case of skewed contour symmetries were employed by[Mukherjee et al. 1995, Van Gool et al. 1995b], and the construction of the arc

length space in [Van Gool et al. 1995c] is also based on an invariant contour param-

eterization.

For the more general approach given by [Van Gool 1998, Van Gool 1997], efficiency

also comes through the concept of fixed structures — fixed points, fixed lines, lines

of fixed points and combinations — that are characteristics for particular subgroups

of the projectivities. Depending on the grouping configuration sought (i.e. the

configuration defined by a specific subgroup), fixed structures might already lift

many degrees of freedom, which significantly improves the efficiency. It is suggested

that fixed structures can be detected efficiently through the use of a cascaded Hough

transform [Tuytelaars et al. 1998a].

Efficiency is also a principal design goal for this thesis. In [Turina et al. 2001b,

Turina et al. 2001a], the efficient detection of both mirror-symmetries and reg-

ularities was based on the concept of fixed structures. In these contributions, a

line of fixed points and a pencil of fixed lines were extracted using invariant-based

matching, together with a cascaded Hough transform.

2.5 Summary and Conclusions

To summarize this overview of previous contributions to grouping, we briefly mention

some of the issues that have lead to the design of the grouping approach presented

in this thesis:

Efficiency: Combinatorics is pervasive through most grouping contributions. Al-

though efficient approaches have been made for the affine case, no efficient

strategies have been reported that can also deal with mirror-symmetries un-

der perspective distortions. The situation is somewhat better for periodicities

through the use of RANSAC as shown by Schaffalitzky and Zisserman. The

goal of this thesis is to handle even serious perspective effects efficiently, that

26 Chapter 2. Tour d’horizon: From the Early Days to State of the Art

is with an absolute minimum of combinatorics through the use of invariance

and hashing techniques.

Features: Most earlier work performs grouping on contours only, other sources of

information are not considered. Only Schaffalitzky and Zisserman made use

of multiple features, even though rather punctually (scene dependent). We

want to use a variety of different features (geometric and photometric) in a

consistent way.

Preprocessing: Many authors in the past developed grouping systems that as-

sume a substantial amount of preprocessing (pre-segmentation etc.) or demon-

strated their results on artificial data only. The strategy that we propose is

applicable to general purpose images without any form of preprocessing and

pre-segmentation.

Grouping Types: So far, geometric grouping approaches are dedicated to one

specific grouping type, and no generic system has yet been presented that is

able to deal with more than one grouping type. We regard the geometric

concept of fixed structures as a promising solution to a more generic design

that can deal with such groupings, e.g. mirror-symmetries, regular repetitions

etc.

3Fixed Structures - Key to

Efficiency

Most earlier grouping contributions assumed weak perspective effects.

Weak perspective is a limiting form of perspective which occurs when

the depth of objects along the line of sight is small compared to the

viewing distance. Affine transformations are a good approximation to

distortions that arise by weak perspective as these include typical linear geometric

transformations such as rotation, translation, scaling and skewing. More features

are preserved under affine transformations, and invariants can easier be constructed

than for the more general projective case.

In this thesis, though, we want to detect groupings effectively under the more re-

alistic perspective case. Past work has shown that invariant-based methods yield

an enhancement over traditional combinatorial methods in this respect. The crux

of the matter is that only a few, robust invariants are known for the general pro-

jective case. On the other hand, subgroups of the projectivities offer promising

opportunities with respect to efficiency and robustness. In short, these subgroups

are defined by the geometric structures that they preserve. They will be denoted as

fixed structures from this point onwards.

In fact, fixed structures might indeed occur as visible features in images containing

symmetries or regularities, such as the symmetry axis of a mirror-symmetry or the

horizon line of a plane with a periodicity. Depending on the grouping type sought,

the knowledge of the corresponding fixed structures might drastically cut down

complexity, hence we consider them as a key feature for achieving efficiency.

Of course, the resulting increase in efficiency is in vain if fixed structures can only

be extracted with exhaustive, combinatorial techniques. For the time being, it

is assumed that fixed structures can indeed be found efficiently. We will see in

Chapter 7 how this can be done.

In this chapter, we shortly introduce projective transformations and their basic alge-

braic and geometric properties in the first part. The second part gives a more concise

27

28 Chapter 3. Fixed Structures - Key to Efficiency

definition of fixed structures and leads over to subgroups of the projectivities. The

third part describes how such subgroups can be exploited for the efficient detection

of specific grouping types. The chapter finishes with a more detailed introduction

to the important class of planar homologies.

3.1 Plane Projective Transformations

As proposed by Felix Klein in his famous “Erlangen Program” in 1872, geometry is

the study of properties invariant under groups of transformations. From this point

of view, projective geometry is the study of properties of the projective plane1 IP 2

that are invariant under a group of transformations known as projectivities.

A projectivity is an invertible mapping from points in IP 2 to points in IP 2 that maps

lines to lines. More precisely ([Hartley and Zisserman 2000]),

Definition 3.1 A projectivity is an invertible mapping h from IP 2 to itself such

that three points x1,x2 and x3 lie on the same line if and only if h(x1), h(x2) and

h(x3) do.

It can easily be seen that projectivities form a group in the strict mathematical

sense (the inverse of a projectivity is also a projectivity, and so is the composi-

tion of two projectivities). Projectivities are often called collineations, projective

transformations or homographies.

Note that Definition 3.1 is coordinate-free. An equivalent algebraic definition of a

projectivity can be given based on the following result:

Theorem 3.1 A mapping h: IP 2 → IP 2 is a projectivity if and only if there exists

a non-singular 3× 3 matrix H such that for any point in IP 2 represented by a vector

x it is true that h(x) = Hx.

The projective linear group of n × n matrices is denoted by PL(n). In the case of

projective transformations of the plane n = 3.

3.1.1 Coarse Structure

Important subgroups of PL(3) can be identified while looking at the algebraic def-

inition, represented as a 3 × 3 matrix. The affine group as a subgroup of PL(3)

consists of matrices for which the last row in H is (0, 0, 1). The Euclidean group,

1Here we focus on the two-dimensional projective plane, although the general theory can beextended to higher dimensions

3.2. Fixed Structures and Subgroups 29

which in turn is a subgroup of the affine group, has an additional orthogonal upper

2× 2 submatrix.

One can define a hierarchy of transformations, starting from the most specialized,

the Euclideans, and progressively generalizing until projective transformations are

reached. A more detailed explanation of the individual subgroups and their prop-

Group Degrees of freedom

Euclidean 3 dof

Similarity 4 dof

Affine 6 dof

Projective 8 dof

Table 3.1: Hierarchy of subgroups

erties would be out of the scope of this thesis report. For a more formal approach,

we refer to the work by Semple and Kneebone [Semple and Kneebone 1952] and

Springer [Springer 1964]. An excellent description of projective transforms with

respect to Computer Vision can also be found in [Hartley and Zisserman 2000].

3.2 Fixed Structures and Subgroups

3.2.1 Fixed Structures

In the previous chapter, we mentioned that certain geometric entities remain fixed

under certain symmetry operations in the scene and their associated projectivities

in the image. In this section we develop this thought more thoroughly. For the

following, the source and destination planes are the same so that the transformation

maps points x to points x′ in the same coordinate system.

The key idea is that an eigenvector corresponds to a fixed point of the transformation,

since for an eigenvector e with eigenvalue λ,

He = λe, and e ≡ λe (3.1)

A 3× 3 matrix has three eigenvalues and consequently a plane projective transfor-

mation has up to three fixed points. As the characteristic equation is a cubic in

this case, one or three of the eigenvalues, and corresponding eigenvectors, is real.

Fixed lines can be treated in a similar way, since lines transform as l′ = H>l, thus

correspond to eigenvectors of H>.

Note that fixed lines are fixed as a set, not fixed pointwise, i.e. a point on the line is

mapped to another point on the same line, but in general the source and destination

points will differ.

30 Chapter 3. Fixed Structures - Key to Efficiency

A further specialization concerns repeated eigenvalues: suppose two of the eigen-

values (e.g. λ2 and λ3) are identical, and that there are two distinct eigenvectors

(e2, e3) corresponding to λ2 = λ3. Then the line containing the eigenvectors e2, e3

will be fixed pointwise, i.e. it is a line of fixed points.

This line of thought can be continued more systematically by investigating all possi-

ble configurations of eigenvalues and eigenvectors. We don’t go into much detail here

and refer to the classical textbooks ([Semple and Kneebone 1952, Springer 1964])

instead.

3.2.2 Subgroups Defined by Fixed Structures

All projective transformations that keep the same structures fixed (e.g. a specific

line or point) form subgroups of the projectivities [Van Gool et al. 1995a], and these

subgroups can be categorized based on their fixed structures. More precisely, the

classification is based on a combination of fixed points and fixed lines that projective

transformations can share.

These combinations are summarized schematically in Fig. 3.1. Each square corre-

sponds to a different type of subgroup, with a qualitatively different combination of

fixed structures. A point in such a square indicates a specific (but arbitrary) fixed

point; the same holds for a line. Note that sometimes a fixed point lies on a fixed

line.

Thick lines indicate lines where every point on such a line

Fixed point

Fixed line

Line of fixed points

Pencil of fixed lines

is a fixed point, hence thick lines are line of fixed points.

Bunches of concurrent lines indicate pencils of fixed lines,

where all lines through a point (vertex ) remain fixed.

The vertex is a fixed point. Pencils of fixed lines are

the projective duals of lines of fixed points. The black

square at the bottom represents the trivial case, where

all points are fixed points, which is the identity.

While going down the categorization scheme, additional

fixed structures are added, thereby gradually decreasing

the dimensionality of the subgroups. The dimension of

the corresponding subgroup is indicated on the right.

For an in-depth discussion of the inherent properties of

these subgroups, we refer to [Van Gool et al. 1994].

Subgroups having a line of fixed points and a pencil of

fixed lines are of special interest since these represent

planar homologies, the principal subgroup used in this thesis. In Fig. 3.1, the planar

homologies are highlighted.

3.2. Fixed Structures and Subgroups 31

A Word about Invariants

Clearly, the fixed structures that define certain subgroups of the projectivities are

invariant under the action of that particular subgroup, but not necessarily under

general projective transformations. In [Van Gool 1998, Van Gool 1997], it is shown

how additional invariants (point / line configurations and curve parameterizations)

can be derived for certain subgroups.

1 DOF

2 DOF

3 DOF

5 DOF

0 DOF

4 DOF

6 DOF

Figure 3.1: Classificatory structure of subgroups for fixed points and lines.

32 Chapter 3. Fixed Structures - Key to Efficiency

3.3 Fixed Structures for Grouping

Van Gool ([Van Gool 1998]) pointed out that the use of general projective invariants

is not necessarily the optimal approach for the detection of specific grouping config-

urations. Since the grouping process can be regarded as matching objects (and/or

parts thereof) onto their symmetric counterparts, far too many matches might result

when using general projective invariants.

This can easily be understood by the following example: Consider multiple rep-

etitions of e.g. mirror-symmetric patterns. All halves are projectively equivalent

under general projective invariants, although we are primarily interested in mirror-

symmetric configurations.

Symmetry-specific invariants, however, can increase the efficiency considerably as

they selectively pick out those objects and object parts that are in symmetric po-

sitions. And this is the guiding principle in this thesis: The knowledge of fixed

structures gives away important information about specific grouping configurations

such that grouping hypotheses can be determined without combinatorial procedures.

The efficient detection of specific grouping configurations hinges on the existence

of projective subgroups to which such skewed configurations would have to belong.

For the rest of this section we explain the principal ideas in more detail.

3.3.1 Conjugate Symmetry

Our point of departure is a symmetric configuration in 3D space. More precisely, the

symmetries that we are interested in are translational symmetry (e.g. floor tilings,

windows on the facade of a building) and mirror symmetries like the one shown in

Fig.3.2. In general, the symmetric, planar parts in the scene are either related by a

Figure 3.2: Mirror-symmetric configuration when viewed head-on (left) and

obliquely (right).

3.4. Planar Homologies 33

translation (in a particular direction) or by a perspectivity, where in this case the

symmetric patches don’t need to be coplanar.

These symmetry operations, when applied to planar objects in the scene, have struc-

tures that are kept fixed. For instance, mirror-symmetries map all points on the

symmetry axis onto themselves. Less obvious, on the other hand, are translational

symmetries. They keep the line at infinity and the direction of the translation

unaltered.

It was shown in earlier publications that such fixed structures in the scene have their

corresponding counterparts in the image, i.e. they survive the image projection.

This applies to both the translational symmetries ([Schaffalitzky and Zisserman

2000]) and planar patterns that are perspectively related ([Van Gool et al. 1998]).

In the image, the geometric relations between symmetric parts manifest themselves

as planar homologies.

For the case of coplanar symmetric patterns, the fixed structures can even be quan-

tified mathematically. The transformation between the projected patterns in the

image, expressed by the non-singular 3 × 3 matrix H2, is similar to the original

projectivity H3 in the algebraic sense, i.e.

H2 = PH3P−1, (3.2)

where P is the perspectivity that maps the scene plane onto the image plane. Hence,

H2 and H3 have the same fixed structures because they share the same eigenvalues.

It can easily be seen how fixed structures in Fig. 3.2 become apparent: all points

lying on the symmetry axis are mapped onto themselves (left) in 3D, yet the trans-

formation that maps corresponding points onto each other in the perspective image

(right) still has an axis on which all points remain fixed and a pencil of fixed lines

connecting corresponding points.

It is important to keep in mind that not too many features survive the projection

onto the image. Returning to the example in Fig. 3.2, the most obvious symme-

try characteristics disappear: the joins connecting mirror-symmetric patches are no

longer parallel, their intersection angle with the symmetry axis is no longer orthog-

onal, symmetric points have no longer the same distance to the axis etc. However,

H2 still has an axis on which all points remain fixed and a pencil of fixed lines con-

necting corresponding points. Also note the simple nature of the fixed structures —

lines and points — regardless of the complexity of the repeated patterns.

3.4 Planar Homologies

From a geometric point of view, planar homologies arise when two planar shapes

in the scene are related by a 3D perspectivity ([Van Gool and Proesmans 1995]).

34 Chapter 3. Fixed Structures - Key to Efficiency

The practical importance of planar homologies will be illustrated with examples

throughout this thesis.

Definition 3.2 A plane projective transformation is a planar homology if it has a

line of fixed points (axis) together with a fixed point not on the line (vertex).

An algebraically equivalent definition is that the

P’

P

V3 × 3 matrix H has two equal and one distinct

eigenvalues λ0, λ0, λ2. The axis is the join of

the eigenvectors corresponding to the degener-

ate eigenvalues. The third eigenvector corre-

sponds to the vertex. The ratio of the third

to the other eigenvalue µ := λ2/λ0 (cross-ratio,

modulus) is a characteristic invariant of the ho-

mology.

Note that the set of all such transformations does not form a group, but those with

the same vertex and axis do. The cross-ratios defined by the vertex V , a pair of

corresponding points P, P ′ and the intersection of the line joining these points with

the line of fixed points, is the same for all points related by the homology. One has

therefore 5 dof in specifying a planar homology:

vertex v = (x, y, w)> (2 dof)

axis a = (a, b, c)> (2 dof)

characteristic cross-ratio µ (modulus) (1 dof)

The special case in which the modulus µ is -1 (harmonic cross-ratio) is also known

as planar harmonic homology. It is then involutory, that is H2 = I, and has four

dof. As seen earlier, in perspective images of a plane object with coplanar bilat-

eral symmetry, corresponding points in the image are related by planar harmonic

homologies.

Parameterization

Projective transformations representing planar homologies can be parameterized

directly in terms of their fixed structures and characteristic cross-ratio [Hartley and

Zisserman 2000]:

H = I + (µ− 1)va>

v>a, H−1 = I +

(1

µ− 1

)va>

v>a(3.3)

where I is the identity matrix.

3.5. Elations 35

Figure 3.3: Examples for shapes related by planar homologies; harmonic and

general mirror-symmetric configuration (right).

Planar Homologies and Grouping

Obviously, once the fixed structures of a particular configuration are known, only

one dof (cross-ratio / modulus) remains that can easily be lifted by a point match to

fix the planar homology. As mentioned earlier, all projective transformations that

keep a line of fixed points and a point fixed, form a subgroup. The members of this

one-parameter subgroup differ only in the value of the cross-ratio.

For a more detailed description of planar homologies and their applications to com-

puter vision, we refer to [Van Gool et al. 1998].

Figure 3.3 shows some examples of objects that are related by planar homologies.

On the left, a ladder is shown that casts a shadow on the wall to which it is fixed.

The vertex in the image is the light source (the sun), that is normally not visible in

the image.

The image to the right shows two books in a mirror-symmetric configuration. The

white book and its mirror-symmetric counterpart are in the same plane, i.e. they

are related by a planar harmonic homology. The red book pair, however, is not

coplanar, hence these two books are related by a regular planar homology. Note

that the two mirror-symmetric configurations share a common pencil of fixed lines,

however their symmetry axes are different.

3.5 Elations

A special (or degenerate) case of planar homologies arises by the incidence of the

fixed point with the line of fixed points, which is also known as elation. Algebraically,

the matrix has three equal eigenvalues, but the eigenspace is 2-dimensional.

36 Chapter 3. Fixed Structures - Key to Efficiency

An elation has 4 dof, one less than a general planar homology due to the constraint

that the vertex of the pencil of fixed lines lies on the line of fixed points. To uniquely

determine an elation, one must specify a

line of fixed points a = (a, b, c)> (2 dof)

position of vertex v = (x, y, z)> on the line of fixed points (1 dof)

parameter µ (’scale’ of the vertex, 1 dof)

In short, an elation can be determined by two point matches. Elations often arise in

practice as conjugate translations. For instance, if a pattern repeats by a translation,

like identical windows on a wall of a building, then these repeating patterns are

related by an elation in the image. The parameter µ quantifies the amount of

translation in this case.

Parameterization

Elations can be parameterized directly like general planar homologies [Hartley and

Zisserman 2000]:

H = I + µva> with a>v = 0 (3.4)

with a the line of fixed points and v the vertex. The second constraint in (3.4)

expresses the incidence of the vertex and the line of fixed points.

Group Action

To demonstrate the effect of a one-parameter subgroup, we start with an elation

whose parameterization is given in equation (3.4). When both a and v are fixed,

we have a one-parameter subgroup with µ as variable group parameter. This is the

situation of the floor tiling shown in Fig. 3.5 (upper left), where the tiled floor is a

planar periodicity. The line of fixed points a is the horizon line of the floor plane,

the vertex v can be seen as the fixed direction of a translation and µ quantifies its

magnitude.

Note that there are many elations (e.g. horizontal, vertical, diagonal etc.) that make

up the floor tiling, and all of them are equally valid. The set of these elations shares

the same line of fixed points a, but repetitions in different directions have different

pencils of fixed lines (different vertices v) and thus belong to different subgroups.

The slightly faded images in Fig. 3.5 (top right and bottom row) show the effect of a

particular elation when applied to a single floor tile. The direction of the translation

is along the ’horizontal’ tile row towards the right border of the image. As elations

3.5. Elations 37

Figure 3.4: Original image (top left) and the resulting translation of a tile under

the action of a particular elation as a one-parameter subgroup of the projectivities.

Results are shown for three different values of µ1, µ2 and µ3, respectively.

with identical axes and vertices are one-parameter subgroups of the projectivities

(i.e. 1 dof), the parameter µ moves a particular tile to the right along a fixed line

towards the vertex v of the pencil of fixed lines. Figure 3.5 shows the effect for three

different choices of µ such that the resulting translation corresponds to one, two and

three ’units’ (tile length).

38 Chapter 3. Fixed Structures - Key to Efficiency

3.6 Summary and Conclusions

The detection of regular repetitions in images boils down to the determination of

grouping hypotheses that map repeating patterns onto each other. Under perspec-

tive skew, projective transformations capture the induced deformations of planar

patches when mapped onto their related counterparts. The determination of such 8

dof projectivities is computationally costly without prior knowledge.

The geometric concept of fixed structures offers a way out as they allow to home in

on specific grouping types. Fixed structures are geometric entities — like points and

lines — that remain fixed under both the original symmetry operation in the scene

and the corresponding 2D projective transformation in the image. All projective

transformations that keep the same structures fixed form subgroups of the projec-

tivities, and these can be classified based on their fixed structures. In this thesis we

focus on planar homologies. Planar homologies have 5 dof and keep a line of fixed

points and a pencil of fixed lines unaffected. Once these fixed structures are known,

the remaining dof and thus the hypothesis can be fixed efficiently by a single point

match, which is a significant reduction of complexity.

4Basic Technologies I: Affinely

Invariant Neighbourhoods

As mentioned earlier, the detection of repetitions is the first step in the

proposed grouping system. Here, we consider repetitions of small, planar

patches, and the repetitions obey to some underlying mathematical laws

(rotations excluded). The efficient detection of such patches (that are

not known in advance) calls for a generic, appearance-based representation that

is also invariant against the distortions that they undergo throughout the image.

This chapter explains how repeating patterns can be efficiently detected using local,

affinely invariant neighbourhoods. The methods for the extraction of such patterns

are vital for the understanding of the proposed grouping system and deserve a

detailed explanation. This chapter is therefore the first part of important basic

technologies that our strategy is built on.

We first start with the rationale that motivates the use of these sophisticated con-

structs as invariant representations of interest points. In the second section, we take

a closer look at the different neighbourhood types and their extraction. The third

section deals with moment invariants that characterize the neighbourhoods in an

invariant way again.

4.1 Motivation

The problem we face is the extraction of repeating planar patches, not having any

hints about their shape and texture. Without any a priori information about the

grouping configurations that we want to detect, the question is what to look for in

this first stage of the process.

If we would have a clear idea about the kind of repeating patterns that we seek,

the search can be restricted to specific image features, yet at the cost of generality.

Our system, on the other hand, should be able to tackle the detection of repeating

39

40 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

patterns regardless of their nature and complexity. So how can this task be ac-

complished ? We think that all those locations with a substantial change in image

intensities are promising points to start the search. These will be denoted as points

of interest. Perspective distortions certainly cause the intensity variations around

these points to be distorted as well. Hence, the efficient detection of similar points of

interest calls for an invariant representation against such distortions, in combination

with a similarity measure.

The reasons for an invariant representation for specific image features can be demon-

strated with an example. Consider again the floor tiling shown in figure 4.1. It is

evident that the shape of a tile near the lower left corner of the figure is noticeably

different to a tile near from another, more remote portion of the floor. The left part

of figure 4.1 picks three arbitrary tiles from different parts of the floor. Although

they are identical in the scene, in the image these three tiles differ in shape, size and

brightness due to the perspective distortion and slightly varying illumination (right).

If each tile is characterized in a manner that is invariant under these operations,

similar ones can easily be identified — their invariant description is the same. In

Figure 4.1: Deformations induced by perspective skew shown on the basis of single

tiles that make up a periodicity (left). The three highlighted tiles shown again (right)

next to each other.

the context of intra-image grouping, we can draw the following conclusions that are

of importance for an efficient, reliable detection of repeating patterns:

Generality: Apart from planarity, the system has no a priori information about the

specific nature of repeating patterns that make up groupings. This complicates

the problem, as there is a vast diversity of possible shapes and textures. In

the past, some researchers focused on the repetitions of specific features, like

lines and line intersections ([Schaffalitzky and Zisserman 1998]), thus hinging

their systems on the presence of such features in the image. However, these

4.1. Motivation 41

kinds of restrictions are unacceptable for a system that should deal with the

largest possible variety of regular repetitions.

Locality: Local features allow to overcome the drawbacks that arise due to partial

occlusions and image clutter. In addition, they avoid the need for segmentation

prior to grouping.

Invariance: Repeating patterns in images with perspective effects suffer from both

geometric and photometric distortions. Invariance allows the system to gen-

eralize from a single instance of a repeating pattern, and hence makes the

system robust to the aforementioned distortions. Through the invariance, in-

herent properties of patterns that change with each instance are filtered out.

The need for an invariant description of local features arose in the context of object

recognition and wide-baseline stereo and has now become a widely studied field of

research. This development was triggered by the work of Schmid and Mohr [Schmid

and Mohr 1997] who first identified points of interest, e.g. corners, and further on

concentrated on these points only. Each interest point is described by a rotation-

invariant feature vector of local characteristics based on local graylevel invariants.

An additional scale space is applied to overcome the changes in scale between a

query images and images in a database.

The disadvantage of their approach is the level of invariance (rotation, translation,

scale) under geometric transformations. However, invariance under a wider class of

transformations is needed for various applications that work on real images.

Tuytelaars et al. came up with a number of contributions that extend the work of

Schmid and Mohr towards invariance under the more general class of geometric affine

transformations and linear changes in intensities in each of the three colorbands[Tuytelaars and Van Gool 1999, Tuytelaars and Van Gool 2000] on a local scale.

Several, different invariant representations are used in combination with the use of

multiple features in the immediate neighbourhood of interest points.

Many other authors have addressed the same problem as well in the wide-baseline

stereo context (e.g. the most recent like [Baumberg 2000, Matas et al. 2002]), but

they exploit less features for their invariant representation.

We consider the affinely invariant representation of interest points as proposed by

Tuytelaars et al. well suited for the task of geometric grouping for the following

reasons:

Multiple, complementary types of invariant neighbourhoods exploit the imme-

diate environment of interest points by taking into account different features

(geometric information and raw intensity data). Depending on what is on offer

in the image, the responses of particular types might be different, but a larger

42 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

variety of repetitions can be dealt with this way — in contrast to relying on a

single type of neighbourhood only.

Affine geometric transformations and linear photometric changes are an ap-

propriate approximation for the relations between small, planar, repeating

patches that belong to a particular grouping configuration. Figure 4.1 clearly

illustrates that geometric transformations beyond translations, scalings and

rotations are required to bring these tiles into registration. Invariance un-

der affinities might first seem contradictive to the fact that the system should

fully deal with perspective effects. However, since we consider only small, local

planar patches at this stage, this approximation is appropriate in practice.

Invariance under these transformations renders the detection of such small,

repeating patches efficient through the use of invariant-based indexing tech-

niques.

In the following we look at this invariant representation in more detail.

4.2 Affinely Invariant Neighbourhoods

The affinely invariant neighbourhoods as proposed in [Tuytelaars and Van Gool 1999,

Tuytelaars and Van Gool 2000] and [Turina et al. 2001b] are used to extract local

regions of interest in the image. These are small image patches attached to interest

points that change their shape in the image (affine transformations) in order to

cover identical physical parts of a surface independent of the relative pose with

respect to the camera (under the assumption of local planarity). As an example,

Figure 4.2 shows some invariant neighbourhoods that have been extracted on the

front plane of a box and its image in the mirror. The invariant neighbourhoods do

indeed represent the same parts of the box. The crux of the matter is that they

were extracted independently, i.e. without any information about the symmetric

neighbourhood. This is important from both a computational and practical point

of view, as no pairwise comparisons between neighbourhoods are necessary for their

extraction (→ low computational complexity), and one is not limited to a predefined

set of pattern viewing angles (→ general viewing conditions).

In addition to the affine geometric invariance, the neighbourhoods are also invariant

to linear photometric changes. It can be shown that for a lambertian surface a change

in position of the light source results in an overall scaling of the intensities [Oren and

Nayar 1994] with the same scaling factor. A change in the illumination color, on the

other hand, corresponds to a different scale-factor for each of the three colorbands.

In short, a different scaling factor for each spectral band suffices to model the effect

of changing illumination for a lambertian reflection.

4.2. Affinely Invariant Neighbourhoods 43

Figure 4.2: Some affinely invariant neighbourhoods found on a box and its mirror

image. Note how the shape of the neighbourhoods is adapted such that symmet-

ric neighbourhoods cover identical parts of the box. Nevertheless, each of these

neighbourhoods was extracted independently from the others.

An additional offset for each spectral band has been shown to better model the

combined effect of diffuse and specular reflection [Wolff 1994] and to give better

performance [Reiss 1993]. This results in the following model for the changes in

intensities between two small, repeating planar patterns: R′

G′

B′

=

sR 0 0

0 sG 0

0 0 sB

R

G

B

+

oR

oG

oB

(4.1)

As said, affinely invariant neighbourhoods are extracted around points of interest

that are from now on referred to as anchor points.

Anchor Points

The selection of appropriate anchor points is an important step as it reduces the

needed computation time, since not each image pixel has to be considered. Good

anchor points result in stable invariant representations, are repeatable and easy to

44 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

detect with a minimum of computation time. Repeatability in the context of group-

ing means that anchor points attached to repeating patterns should be found wher-

ever instances of such patterns appear. We use two different types of anchor points:

Harris corner points [Harris and Stephens 1988] and intensity extrema. These points

typically are relatively stable under the aforementioned geometric and photometric

changes.

Harris Corner Points: A Harris corner detector selects points with an inten-

sity profile that shows quite some changes and a substantial bending orthogonal

to the gradient direction. As a consequence, not only corners in a classical sense

are detected, but also T-junctions, endpoints of a line, points on an edge with high

curvature and so on.

Harris corners are not really affinely invariant as the support over which the intensity

profile is computed is not adapted to affine deformations. Nevertheless, a recent

comparison of several different interest point detectors [Schmid et al. 2000] showed

that the Harris corner detector obtained the best score with respect to repeatability,

i.e. robustness to viewpoint and illumination change. Another advantage is that

Harris corner points typically contain a large amount of information, resulting in a

high discriminative power.

A drawback of Harris points as anchor points is the violation of the planarity as-

sumption in their immediate neighbourhood: Being corners, they often tend to lie

near the border of an object, close to a depth discontinuity.

Local Intensity Extrema: A complementary type of interest points start from

local intensity extrema of the image brightness I(x, y). After first applying some

Gaussian smoothing to reduce the effect of noise (otherwise we would end up with

too many, unstable candidates), we apply a non-maximum suppression algorithm to

extract the local extrema.

In spite of the fact that these extrema cannot as accurately be localized as Harris

corners, they can withstand any continuous geometric deformation and monotonic

transformation of the intensity. In addition, they are less likely to lie near the border

of an object as compared to the Harris corners. From a practical viewpoint, local

intensity extrema unfold their power in situations where there are no clear, dominant

edges, for instance with blob-like structures.

The two types of anchor points yield two major classes of invariant neighbourhoods:

geometry-based neighbourhoods use geometric structures (such as corners, edges and

fitted lines) for their extraction, whereas intensity-based neighbourhoods are purely

based on image intensities.

4.2. Affinely Invariant Neighbourhoods 45

Figure 4.3: Harris corner points

Next, we summarize the extraction of the different invariant neighbourhood types,

although for a more detailed discussion we refer to [Van Gool et al. 2001, Tuytelaars

and Van Gool 2000, Tuytelaars and Van Gool 1999]. For an in-depth study, we

recommend [Tuytelaars 2000].

4.2.1 Geometry-based Neighbourhoods

Geometry-based neighbourhoods make use of a Harris corner point and a nearby

edge, extracted with the Canny edge detector [Canny 1986]. We have developed

two neighbourhood types that make use of either curved or straight edges for their

extraction. A third type deals with homogeneous patches (i.e. untextured patches

of uniform color), surrounded by straight edges. Homogeneity is of special interest

due to the lack of underlying texture information, yet such patterns occur very often

in man-made scenes (think of e.g. brick walls, floor tilings, etc.).

Curved Edges

The extraction of neighbourhoods based on curved edges starts from a Harris corner

point h close to an edge. Two points p1 and p2 move away from the corner in both

46 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

Figure 4.4: Local intensity extrema.

2

p2

p1 l

l

1

qh

Figure 4.5: Geometry-based neighbourhood construction for the case of curved

edges.

4.2. Affinely Invariant Neighbourhoods 47

directions along the edge. Their relative speed is coupled through the equality of

relative affinely invariant parameters l1 and l2 (see also Fig. 4.5):

li =

∫abs (|pi

(1)(si) h− pi(si)|) dsi (4.2)

with si an arbitrary curve parameter (in two different directions), p(1)(si) the first

derivative of pi(si) with respect to si, abs() the absolute value and | . . . | the deter-

minant. This condition prescribes that the areas between the joint < h,p1 > and

the edge and between the joint < h,p2 > and the edge remain identical. This is an

affinely invariant criterium indeed. Both l1 and l2 are relative, affine invariants, but

their ratio l1/l2 is an absolute affine invariant and the association of a point on one

edge with a point on the other edge is also affinely invariant. From now on, we will

simply use l when referring to l1 = l2.

For each value l, the two points p1(l) and p2(l) together with the corner h define a

parallelogram Ω(l) : the parallelogram spanned by the vectors p1(l)−h and p2(l)−h.

This yields a one dimensional family of parallelogram-shaped neighbourhoods. From

this 1D family we select one or a few for which some photometric quantities of the

texture covered by the parallelogram go through an extremum. More precisely, the

photometric quantities we use are:

Inv1 = abs

(|p1 − pg p2 − pg||h− p1 h− p2|

)M1

00√M2

00M000 − (M1

00)2

Inv2 = abs

(|h− pg q− pg||h− p1 h− p2|

)M1

00√M2

00M000 − (M1

00)2

with Mnpq =

∫Ω

In(x, y)xpyq dxdy

pg =

(M1

10

M100

,M1

01

M100

)(4.3)

with Mnpq the nth order, (p + q)th degree moment computed over the neighbour-

hood Ω(l), pg the center of gravity of the neighbourhood, weighted with intensity

I(x, y) (one of the three color bands R, G or B), and q the corner of the parallelo-

gram opposite to the corner point h (see Figure 4.5). These photometric quantities

typically reach a minimum when the center of gravity passes through one of the di-

agonals of the parallelogram. The four parallelogram-shaped neighbourhoods shown

in Figure 4.2 were extracted with this method.

Straight Edges

In the case of straight edges, the method described above cannot be applied, since

l = 0 along the entire edge. However, since intersections of two straight edges occur

quite often, we cannot simply neglect this case.

48 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

A straightforward extension of the previous technique would then be to search for

local extrema in a 2D search-space spanned by two arbitrary parameters s1 and

s2 for the two straight edges, instead of a 1D search-space over l. However, the

functions Inv1(Ω) and Inv2(Ω) we used for the curved-edges case, do not show

clear, well-defined extrema in the 2D case. Rather, we have some shallow valleys of

low values (corresponding to cases where the center of gravity lies on or close to one

of the diagonals).

Instead of taking the inaccurate local extrema of one function, we combine the two

photometric quantities given in Equation 4.3 and the intersections of the two ’val-

leys’ of local minima are taken to fix the parameters s1 and s2 of the invariant

neighbourhoods, as shown in Figure 4.6. The special case where the two valleys (al-

most) coincide must be detected and rejected, since the intersection is not accurate

in that case.

1

2

Inv1

s

s

1

Inv

s

s2

1

2

Inv2

s

s

Figure 4.6: Geometry-based neighbourhood construction for the straight edges

case: the intersection of the “valleys” of two different functions is used instead of a

local extremum.

Homogeneous Neighbourhoods

Finally, in case of homogeneous neighbourhoods delineated by straight edges, the

above neighbourhood extraction mechanism fails due to the lack of sufficient texture

information. However, such situations occur very often in image scenes like man-

made periodicities shown in Fig. 4.1. However, due to the homogeneity of the

neighbourhood between edges, no clear extrema emerge for the function that is

evaluated for the extraction of straight neighbourhoods and therefore this method

becomes more sensitive to noise. This results in affinely invariant neighbourhoods

that cover areas with no (perceptual) meaning.

To overcome the need for texture we designed an extra neighbourhood type during

this thesis that is tailored for this particular situation. The idea is to make use of

the boundaries of homogeneous areas, i.e. edges, where a sudden change in intensity

occurs.

4.2. Affinely Invariant Neighbourhoods 49

Corner point

areaHomogeneous

dariesArea boun−

(straight edges)

y’

x’

Figure 4.7: Geometry-based neighbourhood construction for the case of homoge-

neous areas bounded by straight edges. The slightly darkened homogeneous area is

part of the two-dimensional search space where f(x′, y′) reaches its extremum at the

opposite corner.

Starting from a corner point and two neighbouring straight edges again, we search for

a local extremum of a function in a two-dimensional search space (upon smoothing;

Figure 4.7). This function uses gradients along lines parallel to the edges and yields

significant responses only at the boundaries of the homogeneous areas, i.e. at the

intersection of straight edges. The function is:

f(x′, y′) =1

x′y′

[y′∑

j=0

Dx′I(x′, y′j) ·x′∑

i=0

Dy′I(x′i, y

′)

](4.4)

where Dx′ , Dy′ denote finite difference approximations to the gradients, I(x′, y′) is

the image intensity and (x′, y′) a coordinate axes frame fixed to the straight edges, as

indicated in Fig. 4.7. It is assumed that the borders of a homogeneous area consist

of step discontinuities.

Among the many local extrema that might emerge in the search space, we select

only those that lead to a neighbourhood with maximal homogeneity. This is easily

achieved by an additional check for edges inside the neighbourhood spanned by the

extrema candidate. Figure 4.8 shows the neighbourhoods that have been extracted

with this method in an example image.

Parameters Extracting geometry-based neighbourhoods requires a large number

of parameters in its current implementation ([Tuytelaars 2000]). Two important

parameters should be mentioned here that specify the minimum and maximum

neighbourhood size. These values delimit the search area over which local extrema

of the functions are retained. We have used values of 5 and 60 pixels for the minimum

and maximum size, respectively.

50 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

Figure 4.8: Homogeneous neighbourhoods detected in an image with a large num-

ber of homogeneous areas.

Performance The geometry-based neighbourhood extraction takes a relatively

high amount of computation time. For instance, for the image shown in Figure 4.8,

we needed up to 90 seconds for the combined extraction of neighbourhoods starting

from straight edges, from curved edges and homogeneous neighbourhoods, while the

image shown in Figure 4.2 took about four minutes. This may seem prohibitively

expensive. However, we believe that a good feature extraction is vital for obtaining

a good and sufficiently general grouping system.

4.2.2 Intensity-based Neighbourhood Extraction

A drawback of the geometry-based methods described in the previous section is that

they rely to a great extent on the accurate detection of corners and edges. Any failure

in the detection of the geometric entities, such as missed corners, interrupted edges

or edges that are connected in a different way, causes the neighbourhood extraction

to fail as well. That’s why we have also developed a complementary method, that

uses only intensity information. Given a local extremum in intensity, the intensity

function along rays emanating from the extremum is studied, as shown in Figure 4.9.

The following function is evaluated along each ray:

fI(t) =abs(|I(t)− I0|)

max(∫ t

0 |I(t)−I0|dt

t, d

) (4.5)

4.2. Affinely Invariant Neighbourhoods 51

I(t)

f(t)

t

tt

final ellipse

Figure 4.9: Intensity-based neighbourhood construction.

with t an arbitrary parameter along the ray, I(t) the intensity at position t, I0 the

intensity value at the extremum and d a small number which has been added to

prevent a division by zero. The point for which this function reaches an extremum

is invariant under the aforementioned affine geometric and linear photometric trans-

formations (given the ray). Typically, a maximum is reached at positions where

the intensity suddenly increases or decreases drastically. The function fI(t) is in

itself already invariant. Nevertheless, we again select the points where this function

reaches an extremum to make a robust selection. Although fI(t) as such is not in-

variant to the geometric and photometric transformations we consider, the positions

of its extrema are invariant

Note that in theory, leaving out the denominator in the expression for fI(t) would

yield a simpler function which still has invariant positions for its local extrema. In

practice, however, this simpler function does not give as good results since its local

extrema are more shallow, resulting in inaccurate positions along the rays and hence

inaccurate neighbourhoods. With the denominator added, on the other hand, the

local extrema are in most cases more accurately localized.

Moreover, disturbances in the position of the local extremum due to a flat extremum

in the intensities hardly affect the positions of these points. Indeed, the portion in

the integral for which I(t) = I0 has only a small effect on the computed values for

fI(t).

Next, all points corresponding to maxima of fI(t) along rays originating from the

same local extremum are linked to enclose an (affinely invariant) neighbourhood (see

Figure 4.9). This often irregularly-shaped neighbourhood is replaced by an ellipse

having the same shape moments up to the second order. This ellipse-fitting is again

affinely invariant. Finally, the size of the ellipses is doubled. This leads to more

distinctive neighbourhoods, due to a more diversified texture pattern within the

neighbourhood and hence facilitates the matching process, at the cost of a higher

risk of non-planarity due to the less local character of the neighbourhoods.

Figure 4.10 shows the intensity-based neighbourhoods for a detail of the image shown

52 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

in Figure 4.4. Extracting all the intensity-based invariant neighbourhoods over the

entire image was done in about 4 seconds of computation time.

Figure 4.10: Affinely invariant neighbourhoods found with the intensity-based

neighbourhood extraction method for a detail of the image shown in Figure 4.4.

4.3 Neighbourhood Description

All neighbourhoods (except for those covering purely homogeneous areas) are char-

acterized by feature vectors that are invariant under affine geometric changes and

scalings and offsets in the different color bands. More precisely, the feature vector

consists of geometric/photometric moment invariants that are composed of Gener-

alized Color Moments [Mindru et al. 1999b]. These moments contain powers of the

image coordinates and of intensities of the different color bands:

Mabcpq =

∫∫Ω

xpyq[R(x, y)]a[G(x, y)]b[B(x, y)]c dxdy

The moment invariants characterize the color patterns within the neighbourhoods.

In our experiments we used a feature vector of 18 moment invariants for the geometry-

4.4. Conclusion 53

based neighbourhoods and 9 moment-invariants for the intensity-based neighbour-

hoods, composed of moments up to the first order and second degree. These in-

variant descriptions allow to find similar neighbourhoods without combinatorics by

using hashing techniques (→ low computational complexity). For the homogeneous

neighbourhoods covering patches with constant color, the moment invariants can-

not be used for the characterization as these become sensitive to noise. Instead, we

simply use color ratios, to obtain invariance under a single scale-factor for all three

colorbands. An assortment of all moment invariants used in our experiments can be

found in Chapter 6.

Based on this invariant description, repetitions of a pattern can be detected as a

cluster of invariant neighbourhoods in feature space. Moreover, this can be imple-

mented efficiently without resorting to combinatorics using hashing techniques (→low computational complexity). How this is done is explained in Chapter 6.

4.4 Conclusion

The reliable and efficient detection of small, planar and unknown repeating patterns

in perspective images is far from being straightforward. Repetitions in the image

usually suffer from geometric deformations and varying illumination. Occlusion

might occur as well, which emphasizes the need for a local method. These effects

have to be overcome, and invariance offers a way out. While it is nearly impossible

to achieve invariance under projective deformations, the situation is different for the

affine case. We assume that the geometric relations between small, planar repeating

patches is indeed affine.

The first procedure in the proposed grouping system is the extraction of interest

points and their invariant representation. The affinely invariant neighbourhoods by

Tuytelaars et al. are the tools of our choice to arrive at such a representation. They

exploit various different features in the immediate environment around points of

interest, and they are invariant under affine geometric distortions and linear photo-

metric changes. We do not use a single neighbourhood type that works for all pos-

sible images under all possible circumstances, we rather prefer a more opportunistic

system that exploits several neighbourhood types simultaneously, depending on the

image content. As a consequence, the different extraction methods might perform

variably well, but chances are good to obtain sufficient invariant neighbourhoods to

get the grouping process started.

To this end, we use two different methods for the extraction of such neighbour-

hoods: geometry-based and intensity-based. Starting with Harris corner points, the

geometry-based methods make use of nearby edges and straight lines. We further dis-

tinguish between curved edges, straight edges and homogeneous regions delineated

54 Chapter 4. Basic Technologies I: Affinely Invariant Neighbourhoods

by straight edges. The intensity-based method works around intensity extrema and

offers advantages in cases where only insufficient geometric information is available.

Affinely invariant neighbourhoods are characterized by moment invariants that, in

turn, are made up of generalized color moments. Moment invariants characterize

the underlying texture in an affinely invariant way again. This invariant description

finally allows an efficient detection of similar neighbourhoods without resorting to

combinatorics.

5Basic Technologies II: The

Cascaded Hough Transform

Another vital tool that plays a key role in our grouping system is the

Cascaded Hough Transform or CHT for short. It boils down to an

iterated application of the traditional Hough transform for straight lines.

The CHT is the second key technique that allows a non-combinatorial

analysis of clusters of affinely invariant neighbourhoods for their regularity.

This chapter is a more formal introduction to the CHT. Later on in Chapter 7, we

explain in more detail how the CHT is applied for the extraction of fixed structures.

In the first section, the underlying ideas of the general Hough transform are reviewed.

In Section 5.2, we give an introduction to the Cascaded Hough Transform. A third

section deals with various conversions that are essential for switching between the

different Hough spaces and coordinate frames. Section 5.4 focuses on technical

aspects and Section 5.5 illustrates the application of the CHT with a real example.

The last section discusses some improvements.

5.1 The Hough Transform Revisited

The Hough transform [Illingworth and Kittler 1988, Leavers 1993] is a global, robust

technique for the detection of parameterized shapes in images, especially straight

lines. It is based on the transformation of the line points to a parameter space. Each

of these line points is characterized as the solution to some particular equation. The

most widely used and simplest form in which to express a line is the slope-intercept

form:

y = mx + b (5.1)

where m is the slope of the line and b is the y-intercept (the y value of the line when

it crosses the y axis). Any line can be characterized by these two parameters m and

b.

55

56 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

If we start reasoning in the dual way, we regard a point as the intersection of all

possible lines passing through it. We can characterize each of the lines passing

through this point (x, y) as having coordinates (m.b) in some slope-intercept space.

In fact, for all the lines that pass through a given point, there is a unique value of

b for m:

b = y −mx (5.2)

The central idea of the Hough transform is that the set of (m, b) values corresponding

to the lines passing through point (x, y) form a line in (m, b) space. In short, every

point in image space (x, y) corresponds to a line in parameter space (m, b), and each

point in (m, b) space corresponds to a line in (x, y) space.

The Hough transform works by letting each feature point (x, y) vote in (m, b) space

for each possible line passing through it. These votes are totaled in an accumulator.

5.2 The Cascaded Hough Transform

Here, straight lines are given a slope-intercept parametric representation, i.e. using

parameters (a, b) according to

ax + b + y = 0, (5.3)

which brings out the projective duality between points and lines explicitly through

its perfect symmetry between line coordinates (a, b) and image coordinates (x, y).

The CHT maps a pair of edge point coordinates (x, y) to a line in the (a, b) parameter

space and v.v. Indeed, the parameters a and b are to the image space (x, y) what x

and y are to the Hough space (a, b). Lines in one space can be detected as points in

the other space and, vice versa, for every point there is also a corresponding line. As

a result, the output of one Hough transform can be used directly as input for another.

This way, we can detect lines, line intersections and collinear line intersections in a

manner explained shortly. Hough schemes for the extraction of both lines and their

intersections have been proposed by others as well [Lutton et al. 1994, Xu 1988] but

not through identical cascaded transforms as in [Tuytelaars et al. 1998b].

The (a, b)-parameterization is known to cause problems as this space is unbounded.

Both a and b can take infinite values. Therefore, the polar (ρ, θ) line parameter-

ization has been introduced [Duda and Hart 1972]. This parameterization yields

a bounded parameter space. But now, a point is transformed to a cosine in pa-

rameter space, instead of a line. Hence, the symmetry between image space and

parameter space is broken. Yet, rather than going to the polar representation and

thereby loosing the point/line symmetry, such problems can be avoided by splitting

the (a, b)-space into three bounded subspaces (see Figure 5.1).

Here, we will stick to the slope-intersect representation in order to preserve the

duality between the image and parameter coordinate frames.

5.2. The Cascaded Hough Transform 57

a

b

Subspace 1

1/a

Subspace 2

b/a

1/b

Subspace 3

a/b

Figure 5.1: To preserve the duality between points and lines and the simple line

parametrization while avoiding problems with an unbounded space, the original

(a, b) space is split into three subspaces.

Parameter Space The first subspace also has coordinates a and b, but is used

only for |a| ≤ 1 and |b| ≤ 1. If |a| > 1 and |b| ≤ |a|, the point (a, b) turns up in

the second subspace, with coordinates 1/a and b/a. If, finally, |b| > 1 and |a| < |b|,we use a third subspace with coordinates 1/b and a/b. In this way, the unbounded

(a, b)-space is split into three subspaces with coordinates restricted to the interval

[−1, 1], while a point (x, y) in the original space is still transformed into a line in

each of the three subspaces.

Image Space The same parameterization is also used for the image coordinates

(x, y), yielding three subspaces with coordinates (x, y), (1/x, y/x) and (1/y, x/y).

This stands to reason as the (x, y)-space is in fact also an unbounded space, not

restricted to the size of the image itself. Vanishing points (vertices of pencils of

fixed lines) tend to fall far outside the dimensions of the image. The proposed

parameterization includes points lying at or near infinity in a natural way. Moreover,

the original image is rescaled, such that it fits entirely within the first subspace

(without changing the aspect ratio of the image). The largest dimension of the image

(usually the horizontal one) is taken to be the unit. This way, the parameterization

makes explicit positional references such as left from, right from, above or below

the field of view, depending on the subspace in which structures are found. This

representation can be interpreted as the projection onto a unit cube centered at the

focal point of the camera.

58 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

As can be seen from Figure 5.1, the CHT parameterization can also be interpreted

as an inhomogeneous discretization of the unbounded parameter space, with cells

growing larger as they get further away from the origin. This is in keeping with

the fact that points and structures lying further away are normally determined less

accurately and similar shifts in their position have less impact in the image the

further away they are. For a more detailed description we refer to [Tuytelaars et al.

1998b].

To recapitulate, points and lines in the image can be associated to points in the

CHT coordinate frame, and these are the peaks that emerge in the Hough spaces.

Yet before we discuss the necessary transformations between the image and the

CHT coordinate spaces, we first introduce the concept of the CHT-point and its

homogeneous representation. To avoid confusion, we use different notations for the

different coordinate frames:

Coordinate Frame Notation Example

Image regular ax + b + y = 0

CHT-point typewriter p = (x, y, l)

Homog. CHT-point sans serif p = (x, y, z)

5.2.1 The CHT-point

An image point [image line] is given three parameters in this CHT representation:

a coordinate pair (x, y) [(a, b)] and a subspace label l.

p = (x, y, l) or l = (a, b, l),x, y

a, b

∈ [−1, 1], l ∈ 1, 2, 3 (5.4)

Such a representation is from now on expressed as a CHT-point. The term ’point’

might be misleading, since this representation holds for lines as well. However, due

to the dual nature of points and lines, such an expression is absolutely valid. The

concept of the CHT-point offers a more generic description that fully integrates the

dual relationship between points and lines.

5.2.2 Homogeneous Representation of CHT-points

The representation of image-points and image-lines through CHT-points is very

compact as it includes structures that are even far beyond the boundaries of an

image, i.e. points and lines at infinity. For practical computations, though, the

CHT-point is rather cumbersome.

A more elegant method leads to the homogeneous representation (x, y, z) of a CHT-

point (x, y, l). Basically, in the homogeneous representation each point is expressed

5.3. CHT Arithmetics 59

in terms of the coordinates of the first subspace. For a better distinction, we use

small letters for CHT-points (x, y) in the first subspace and capital letters (X, Y) for

the coordinates of CHT-points in subspaces 2 or 3. To arrive at the homogeneous

form, we proceed as follows:

l = 1: If the point (x, y, 1) is in the first subspace, the conversion is trivial: (x, y, z) =

(x, y, 1), that is the z-value equals 1.

l = 2: If the point (X, Y, 2) is in the second subspace, this means that its coordi-

nates actually represent 1/x and y/x. So we have

x = 1/X and y = Y/X

multiplying by X yields the homogeneous form:

(x/z, y/z) = (1/X, Y/X) → (x, y, z) = (1, Y, X)

l = 3: Similarly, for (X, Y, 3) the coordinate pair (X, Y) actually represents 1/y and

x/y. So we have

y = 1/X and x = y · Y = Y/X

which results in (x, y, z) = (Y, 1, X).

Thus, the homogeneous representation captures the ’subspace-membership’ in an

inherent way which makes it well suited for certain computational tasks.

5.3 CHT Arithmetics

As simple as the concept of the CHT might at first seem, its actual practical usage

is not always obvious. This has to do with the different spaces, their corresponding

coordinate frames and the transformations between them (Figure 5.2). We therefore

consider a more detailed introduction to the different conversion routines appropriate

at this point.

Applying the CHT in practice boils down to toggling information between different

coordinate frames. For instance, image point coordinates (such as a pixel position in

integer coordinates or an edge point with sub-pixel accuracy) usually range between

0 and the width/height of the image (e.g. 640× 480). The same image point in its

CHT representation, however, translates to two coordinates in the range of [−1, +1]

and a subspace label (after rescaling).

Since the CHT (and Hough techniques in general) is a voting mechanism, the three

subspaces must be discretized to act as accumulators. They are usually stored

60 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

in the computer memory as buffers (multidimensional arrays) of a predefined size.

For the sake of simplicity, we assume such buffers to be two-dimensional arrays.

Thus, a CHT-point has its corresponding position in such a 2D-array, which are

the coordinates of a voting cell. The mapping of a CHT-point to its corresponding

!!!!!!!!!!!!!"""""""""""""#############$$$$$$$$$$$$$%%%%%%%%%%%%%&&&&&&&&&&&&&640

AccumulatorImage CHT

401

401

480 −1

oy

+1

+1

−1

Figure 5.2: A point in the image (left) and its representation as a CHT-point

(middle) in the first subspace. As the CHT subspaces are accumulators, a CHT-

point corresponds to a pixel in an accumulator (right).

accumulator cell is a sort of linear scaling. A CHT-point coordinate c is transformed

to its buffer position by

(c + 1) ·R/2 (5.5)

where R is the size of the (squared) accumulator array in pixels. The resulting

value is rounded. Since the CHT-subspaces and accumulators are both squares, no

distinction is necessary for the x and y coordinate. The reverse mapping from the

accumulator to CHT-point coordinates is simply the inverse of (5.5).

In the following, we explain the less-trivial conversion routines of points and lines

between the image and CHT coordinate frames more thoroughly. First, we show

how to get the CHT-point representations for points and lines in the image. The

second part deals with the other direction.

5.3.1 Image Frame −→ CHT Frame

Before the CHT can be applied, its input must first be transformed to the CHT

coordinate frame, i.e. the input are CHT-points. Shortly, the features that we use

as CHT input are image points and image lines, see Chapter 6 for more details.

Practically, the lines that we use as input are always given by two points. The

question now is: Given a point / line in the image, how can we express it as a

CHT-point ?

5.3. CHT Arithmetics 61

Image Point → CHT-point

As mentioned at the beginning, image points (within the image boundaries) are

first rescaled so that they fit entirely within the first subspace, i.e. the original

pixel coordinates (x, y) are transformed such that they fall entirely in the interval

[−1, +1]×[−1, +1]. This is achieved through some sort of anisotropic scaling. Hence,

the CHT-point representation (x, y, l) of an image point (x, y) is obtained by

x =x + ox

∆− 1

y =y + oy

∆− 1

l ≡ 1

(5.6)

with ∆ = maxw, h/2, w, h the width and the height of the image, resp., and

ox, oy the offsets in x and y direction. These offsets account for the deviations of an

image from a square, such that the image center falls to the origin of the first CHT

subspace after rescaling. The meaning of the offsets can be seen from Figure 5.2.

The example image in this figure is larger in width than in height, so there is only

an offset in the y direction.

Note that points within the width and height of an image are transformed to the

first subspace (x, y ∈ [−1, 1], l ≡ 1). The situation is different for points beyond the

image boundaries: Their coordinate values in the CHT frames have values larger

than 1, so they will be either in subspace 2 or 3.

Image Line → CHT-point

We assume that an image line can always be specified by two image points. The two

points are first transformed to their corresponding normalized homogeneous CHT

representations (xi/zi, yi/zi, 1). We end up with two equations for two unknowns:

ax1 + b + y1 = 0

ax2 + b + y2 = 0,(5.7)

which can easily be solved for a. If we consider the homogeneous form of the line

equations in (5.7), we get the following:

a =y2 − y1

x1 − x2≡ a

c(5.8)

Hence, two out of three parameters (a and c) are known and b can be determined

by substituting the results into (5.7). Finally, one obtains the homogeneous line

62 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

representation (a, b, c) in the CHT coordinate frame:

a = y2 − y1

b = x2y1 − x1y2

c = x2 − x1

The line representation as a CHT-point is easily obtained by determining the sub-

space for the ratios (a/c, b/c) as explained in Section 5.2.

5.3.2 CHT-Frame → Image-Frame

The conversions given here are needed when peaks detected in the CHT subspaces

must be given their appropriate coordinates / parameters in the image. Again, one

must distinguish between points and lines.

CHT-Point → Image-Point

The transformation of a CHT-point (x, y, l ≡ 1) back to the image coordinate frame

is straightforward:

x = (x + 1) ·∆− ox

y = (y + 1) ·∆− oy,(5.9)

which is simply the reverse of (5.6). For CHT-points in subspaces 2 or 3, the

coordinate X actually corresponds to 1/x (subspace 2) or 1/y (subspace 3), and

similarly for the y coordinate. Solving (X, Y) for (x, y) yields a point in subspace

0 with coordinate values > 1, which means that the point is outside of the image

boundaries. Nevertheless, (5.9) yields the correct result for these cases as well.

For a homogeneous CHT-point (x, y, z)>, the conversion back to the image coordinate

frame is more elegant. A multiplication with ∆ 0 ∆− ox

0 ∆ ∆− oy

0 0 1

and ∆ = maxw, h/2 (5.10)

does the job and yields its homogeneous counterpart (x, y, z)> in the image coordi-

nate frame. Its true location is easily determined after normalization (i.e. division

by z).

5.4. Applying the CHT 63

CHT-Point → Image-Line

The backmapping of CHT-points that correspond to lines is somewhat more difficult.

Most graphical drawing programs and computer vision libraries represent lines in

the image coordinate frame by

ax + by + cz = 0, (5.11)

which is clearly different to the CHT line parameterization ax + b + y = 0. The link

between the two forms can be established through homogeneous coordinates again:

a

c

x

z+

b

c

z

z+

y

z= 0 (5.12)

Multiplying (5.12) with cz yields the homogeneous line equation that is more similar

to the classic parameterization (5.11):

ax + bz + cy = 0

Note that b and c are interchanged ! Hence, to arrive at the classical form (5.11)

one uses the inverse transpose of (5.10), i.e. the homogeneous CHT-point (a, c, b)>

is multiplied such that a

b

c

=

1/∆ 0 0

0 1 0

ox −∆ oy −∆ ∆

·

a

c

b

(5.13)

Notice again that b and c on the right hand side of (5.13) are interchanged !

5.4 Applying the CHT

The foregoing discussion gave us the necessary tools for switching back and forth

between the image and the CHT coordinate frames. Now we turn to the Hough-

properties of the CHT.

5.4.1 Hough Transform

Technically, the CHT is applied on CHT-points. This is convenient in that the actual

nature of the input (points or lines) is unimportant. If the CHT-point represents a

point, applying the Hough on it yields all those a and b’s such that the condition

(5.3) is fulfilled. In practice, the value of each voting cell that corresponds to the

correct value of a and b is increased by 1. This way, we obtain a one-pixel wide line

in the accumulators.

64 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

More precisely, CHT-points that serve as input are mapped to their corresponding

input-buffers first (see Figure 5.2). In the end, these buffers are the actual input

to the CHT. Of course, there is no difficulty in applying the transform directly on

CHT-points. However, having the input CHT-points in such buffers enables the

peak-validation mechanism explained in Section 5.4.3.

In addition, the cascading is made possible that way. After having filtered the

output buffers after a first Hough transform, just these buffers are utilized as input

for a subsequent Hough.

Because of noise, discretization of both the image and accumulators, and factors

inherent to the application itself (imprecisions in the data used as input, see Chap-

ter 7), we want to allow a little tolerance in fitting the lines to the input data. This is

done by allowing a feature point to vote not just for a sharp line in the accumulator,

but to cast fuzzy votes for nearby accumulator cells. In essence, this votes not just

for all lines that pass through that feature point but also for those that pass close by.

More precisely, instead of only increasing one accumulator cell that corresponds to

the position (a, b), we also add votes to the neighbouring cells, where the votes are

weighted with a Gaussian. That is, the further away a neighbouring cell, the smaller

the value by which it is incremented. Figure 5.3 illustrates this effect when applied

to four perfectly aligned and four less perfectly aligned collinear input points.

5.4.2 Peak Extraction

The important, non-accidental structures that we are interested in (e.g. vanishing

points and vanishing lines) emerge as peaks in the different Hough subspaces. The

goal is the localization and extraction of such peaks. The more salient such peaks

are, the higher the chance that they indeed correspond to non-accidental structures.

In practice, however, Hough buffers tend to be noisy and crowded, hence relevant

peaks are not always easy to extract. This has to do with the number of the outliers

among the input used, as well as with their accuracy. Discretization errors contribute

to the problem of peak detection as well. In spite of the large Hough literature, a

generic solution to buffer filtering does not yet exist.

Those accumulator cells that obtained the highest number of votes are certainly

the most promising candidates to start with. The detection of the cell with the

largest number of votes is a trivial task when only the highest peak is of interest.

Our situation is different in that we are interested in a reliable extraction of multiple

peaks, because these might all correspond to important structures. Figure 5.4 shows

a typical Hough subspace with a few salient peaks. As can be seen, the majority

of the accumulator cells obtained votes, but only a few of them are of importance.

Finding these few important cells boils down to the detection of local maxima.

5.4. Applying the CHT 65

Figure 5.3: Resulting accumulator buffers after a Hough transform on four collinear

input points (magnified cutout of 30×30 pixels around the intersection point). Top

row: Part of the buffer when the neighbourhood of the sampled accumulator cells

are incremented according to a Gaussian with σ = 5. Bottom row: Only the cells

at discrete sampling locations were incremented. Left column: All four input points

are perfectly collinear. Right column: Input points slightly deviated from perfect

collinearity. Clearly, the intersection peak is more outspoken when smoothing is

applied (right column, top), whereas several spurious peaks emerge when smoothing

is omitted (right column, bottom).

66 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

Figure 5.4: A CHT subspace accumulator (left) with the second peak (from above)

magnified (right). The darker the pixels, the more votes they received. Accumulator

cells that received zero votes are shown in white.

This task is further complicated in that a local maximum is not clearly apparent

even for a human observer: the magnified peak in Figure 5.4 (right) illustrates the

problem of determining a local maximum with a rather fuzzy structure.

We use a sort of non-maximum suppression for the extraction of local peaks. More

precisely, we start with the accumulator cell that received the maximal number of

votes among all three subspaces. We clear the neighbourhood around the current

peak, thereby setting all those cells to zero that have a gradually decreasing number

of votes. When no more such cells are left, the candidate peak is removed and the

procedure starts again.

Salient peaks having a sharp shape are rather the exception than the rule. Based on

our experiments, the general shape of a peak looks rather blurred, and no definite

local maximum can be identified. The reason is that several adjacent cells received

the same number of votes. In this case we start the local peak extraction at the

average position.

Our experiments have shown that the sole extraction of peaks in the Hough spaces

is insufficient. Depending on the situation, peaks that obtained many votes by

the CHT might be erroneous for the reasons mentioned in Section 5.4.1. As a

consequence, far too many candidates for fixed structures might result.

An additional validation mechanism offers a possibility to discard those peaks that

form local maxima (with respect to the peak extraction process), but do not cor-

respond to actual non-accidental structures, i.e. they are a product of noise or

other effects. In essence, our validation technique returns to the previous input level

5.4. Applying the CHT 67

and checks for the support for each individual peak under validation. Through the

cascading property of the CHT, this can be carried out efficiently.

5.4.3 Peak Validation

The peak extraction method extracts many local peaks in the accumulator spaces,

Figure 5.5: Schematic sampling of an input buffer. Points met within the sampling

swath path (like the one shown here) belong to the support of the peak under

validation.

however not all of them correspond to real structures that are present in the input

data. The iteration of the transforms worsens the situation: spurious input peaks

at the current level result in even more buffer noise at the next level, which in turn

renders the peak extraction more difficult. We therefore try to keep the number of

input points low.

Fortunately, the dual nature of the CHT parameterization allows a fast rejection of

accidental and spurious structures. It amounts to applying the CHT in the ’reverse’

direction: given a candidate peak, e.g. (a0, b0), apply on it a Hough transform. The

set of points (x, y) fulfilling the condition a0x + b0 + y = 0 is a line passing through

the (dual) subspaces. But instead of incrementing the corresponding cells of new

accumulators, the cells of the existing input buffer(s) at the previous level are sam-

pled along this line. Input points met along the sampling path are recorded and are

said to support the peak under validation. If the sampling does not ’hit’ a sufficient

number of points, it can be assumed that the peak is incorrect. Many wrong, spu-

rious peaks can be rejected with this method: they are less likely to have sufficient

support at the previous level. Furthermore, the support of a particular structure

is again validated. This way, the support is tracked down to the very beginning of

the CHT cascade. The validation procedure also emphasizes the advantage of both

applying the Hough on an input buffer (instead on CHT-points directly) and the

smoothing described in Section 5.4.1.

68 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

5.5 Example

Here we illustrate how the mechanism of the CHT is applied to analyze a set of

points is for its spatial layout. In particular, we are interested if vanishing points

can be detected for the situation shown in Figure 5.6. Note that we come to the

general strategy for the detection of fixed structures in Chapter 7. For the moment,

two successive applications of the Hough transform do the job. A first transform

yields collinear arrangements among the points under investigation, and a second

one finds those locations where collinear structures intersect, i.e. vanishing points.

Subspace 1

-1 +1

-1

+1

Figure 5.6: A cluster of affinely invariant neighbourhoods (left) whose centerpoints

are used as input to the CHT. Right: the input points shown in the CHT coordinate

frame.

We start with a cluster of affinely invariant neighbourhoods (Figure 5.6 left) whose

centerpoints are used as input for the CHT. Figure 5.6 (right) shows the cluster

centerpoints as CHT-points in the first subspace. Next, the Hough transform is

applied for a first time, leading to the unfiltered accumulators shown in the top row

of Figure 5.7. The middle row shows the resulting local maxima found by the peak

extraction method. The bottom row shows the remaining peaks after the validation.

Figure 5.8 makes the effect of the peak validation apparent in the image. On the left,

all peaks in the middle row of Figure 5.7 are drawn as image lines after conversion

to the image coordinate frame. The same is done for the bottom row of Figure 5.7,

leading to the configuration in Figure 5.8 right. Obviously, most of the lines did not

get beyond the validation stage. Almost all peaks in the second subspace (Figure 5.7,

middle and bottom row) were discarded. However, the remaining lines point indeed

to the principal directions of the floor layout. Note that most of the rejected peaks

of the second subspace conform to the converging vertical lines in Figure 5.8 (left).

Although a meaningful pencil of fixed lines, each of these fixed lines is supported by

only two input points, which is insufficient to be non-accidental.

5.5. Example 69

Subspace 1 Subspace 2 Subspace 3

Subspace 1 Subspace 2 Subspace 3

Subspace 1 Subspace 2 Subspace 3

Figure 5.7: Unfiltered accumulator buffers (subspaces 1,2 and 3) after the first

Hough transform (top row). Middle row: Local maxima obtained by the peak

extraction method described in the text. Bottom row: Remaining peaks after the

validation routine in Section 5.4.3 was applied. Darker peaks received more votes

than brighter ones.

70 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

Figure 5.8: Collinear structures among the neighbourhood center points before

(left) and after application of the peak validation routine (right).

The peaks shown in the bottom row of Figure 5.7 now serve as input for a next

Hough. This second transform picks up collinear peaks from the first Hough, which

Subspace 1 Subspace 2 Subspace 3

1

2

3

Figure 5.9: Resulting accumulator buffers after the second Hough transform. The

three peaks that correspond to common intersection points of collinear structures

(vertices of pencils of fixed lines) are the ones emerging in the second subspace

(middle, marked with a circle).

correspond to intersection points of collinear structures among the neighbourhood

centers in the image. The output of the second Hough is shown in Figure 5.9,

with vanishing point candidates in the second subspace. The original input points

re-emerge as peaks in the first subspace, but they are not further considered; see

Chapter 7 for details. Figure 5.10 shows the position of two of the three peaks in

the image coordinate frame, together with their contributing lines.

5.6. Discussion 71

2

13

Figure 5.10: The line intersections (peaks 1 and 3 from Figure 5.9) shown in

the image coordinate frame, together with their supporting lines, i.e. the collinear

structures that contributed to the peaks. The proximity to the origin of peak 2

translates to a location far beyond the image boundaries.

5.6 Discussion

The cascaded Hough transform is an efficient tool for the detection of spatial struc-

tures. It rests on the premise of the point-line duality, which enables the application

of one Hough transform on the output of a previous one. The CHT offers advan-

tages to alternative fitting techniques such as RANSAC, but it should be mentioned

that there are some computational drawbacks and open questions as well. In the

following, we shortly take a closer look at them.

5.6.1 Accuracy vs. Resolution

In principle, the accuracy of the CHT output (point/line parameters) can be well

adapted to the specific needs at hand. Accuracy is primarily adjusted via the size of

the accumulator buffers. A high grade of accuracy is certainly desirable, however this

is at the cost of computation time and storage requirements. While it is generally

true that modern desktop computers are much more powerful than in the early years

of the Hough transform (1960’s), too large accumulator sizes still cause additional

computational expenses.

In our experiments, we used fixed accumulators buffers of size 401 × 401 pixels,

which has proven appropriate for normal images.

5.6.2 Computational Complexity

Hough transforms are generally known to be computationally rather expensive. In

our situation, the computational complexity is highly dependent on the particular

72 Chapter 5. Basic Technologies II: The Cascaded Hough Transform

image, especially on the number of input points. The most expensive part is not

the Hough transform per se, but the extraction of the peaks. As an unsurprising

rule of thumb, the more ’noise’ present in the Hough spaces, the more time the peak

extraction needs.

5.6.3 Peak Extraction

Plenty of room is available for optimization when it comes to the extraction of peaks.

Some ideas that we consider to be promising:

Currently, each CHT subspace is treated in isolation during the peak extraction

process. This is a drawback for peaks very close to the borders of a subspace:

trailing edges might well extend into adjacent subspace(s). A peak extraction

mechanism that does not stop at the boundaries might improve accuracy and

robustness.

In our experiments, we used accumulators of fixed size. However, a system that

incorporates multiple resolutions might help to better ’divide the chaff from

the wheat’: Similar to scale-space approaches, peaks that really correspond

to important structures can be traced across multiple resolutions. Based on

our observations, not so much the absolute height of a peak is of importance,

rather its local neighbourhood. A smaller, isolated peak is more likely to corre-

spond to a non-accidental, meaningful structure than a larger one immediately

surrounded by others. Similar observations were reported by [Liu and Collins

2000] for the extraction of peaks in autocorrelation functions. Although we

do apply a separate peak evaluation technique that helps in reducing noise,

not all incorrect ones can be rejected that way. We therefore think that the

tracking of peaks across multiple resolutions is a promising approach to keep

only relevant structures, however a tradeoff must be found between different

resolutions and the resulting increase in computation time.

5.6.4 Alternative Parameterization

A slope-intercept line parameterization that is equally symmetric with respect to

(a, b) and (x, y) is

ax + by + 1 = 0 (5.14)

This parameterization offers the additional advantage that it is closely related to the

classical equation using homogeneous coordinates ax+by+cz = 0. As a consequence,

a line-corresponding CHT-point (a, b, l) can be converted to its homogeneous form

in a similar way as point-corresponding CHT-points, which directly yields the image-

line parameters (a, b, c) according to the classic parameterization ax + by + cz = 0.

5.7. Summary and Conclusion 73

5.7 Summary and Conclusion

The CHT allows the application of a Hough transform on the output of a previ-

ous Hough transform through a symmetric slope-intersect parameterization of lines.

This way, one can detect collinear structures, intersection points of collinear struc-

tures, collinear configurations of intersection points etc.

Since both (a, b) and (x, y) are unbounded, the entire parameter space is split into

three bounded subspaces. Points and lines in the image can be represented as points

in CHT-subspaces. We introduced the concept of a CHT-point to make the various

conversions between the image and the CHT coordinate frames.

Peaks in the CHT subspaces are extracted using a sort of non-maximum suppression,

followed by a peak-validation routine with the goal to reject wrong and spurious

peaks at the very beginning.

Of course, all kinds of refinements are possible, such as an in-depth analysis of the

effects of the resolution of the parameter spaces and trade-offs thereof, or improve-

ments of the peak extraction technique. Finally, an alternative line slope-intersection

parameterization that also fully exploits the point/line duality is possible as well.

6Detection of Repetitions

After having dealt with the basic technologies in the previous two chap-

ters, we can now turn the attention to the actual grouping process. As

mentioned in the introductory part of this report, the entire grouping

process consists of two principal steps. First, we look for repetitions in

the image, thereby using the affinely invariant neighbourhoods. The second step

then analyzes the spatial configuration of the found repetitions through the use of

the cascaded Hough transform. The most important aspect here is that the two

steps are carried out without the use of extensive combinatorics.

This chapter deals with the detection of repetitions irrespective of their regularity

and is organized as follows. After a short introduction, the description of affinely

invariant neighbourhoods by means of moment invariants is explained. The second

section describes how affinely invariant neighbourhoods can be compared. Next, in

Section 6.4, a matching technique is proposed that looks for clusters of invariant

neighbourhoods in the feature space. Section 6.5 demonstrates the entire process

with a real example. Section 6.6 discusses some aspects of this strategy, and Sec-

tion 6.7 concludes this chapter.

6.1 Introduction

Here, we focus on the detection of small, repeating planar patches, in particular the

affinely invariant neighbourhoods described in Chapter 4. These features are more

general than the lines, line intersections and fixed-sized patches used by others (e.g.[Leung and Malik 1996, Schaffalitzky and Zisserman 1998]). Moreover, the specific

kind of features to be used in an individual image is not selected beforehand by the

user of the system.

We simply run each type of neighbourhood extractor described in Chapter 4 on each

image. The local character of these neighbourhoods improves the robustness of the

system to occlusions, while the invariance to pattern pose and illumination changes

75

76 Chapter 6. Detection of Repetitions

makes it possible to detect the repetition under oblique viewpoints and non-uniform

illumination.

The fact that only invariance under affine geometric deformations is considered may

seem to contradict the aim of dealing with perspective distortions. This restriction

is acceptable in practice though, since the neighbourhoods themselves are rather

small and affinities are a good model for the perspective deformations on such a

local scale. Further, less local steps of the grouping process will deal with the full

perspective effects.

Simplifying further to similarity- or Euclidean transformations is unacceptable for

the kind of images we want to work with. Look for instance at the example of a

tiled floor shown in Figure 1.1. Tiles on the left cannot be mapped onto the tiles on

the right of the figure simply by translation, rotation and adjustment of scale. An

affine transformation, however, is well suited for this task.

In the next section, we explain in more detail how the affinely invariant neighbour-

hoods can be characterized in an invariant way again. This invariant characteriza-

tion is essential for finding similar neighbourhoods under the considered group of

transformations.

6.2 Invariant Description

The techniques presented in Chapter 4 allow the delineation of local, affinely invari-

ant neighbourhoods. With them, local, affine invariants, computed based on the

corresponding support independent of the viewpoint and illumination, can be ex-

tracted. Such invariants are indispensable for an efficient matching of local features

even under strong perspective distortions. As in the neighbourhood extraction step,

we consider invariance both under affine geometric changes and linear photometric

changes (see Equation (4.1) in Chapter 4) in each of the three colorbands.

Each neighbourhood is characterized by a feature vector of moment invariants. The

sole exception are the homogeneous neighbourhoods (see Section 4.2.1). These cover

purely homogeneous patches, which is a rather degenerate case for moments (noise).

For this situation, we use average color ratios for their characterization, to be at

least partially invariant against illumination changes.

The moments we use are generalized color moments [Mindru et al. 1999b]. These

moments integrate powers of the image coordinates and the intensity information

over a neighbourhood:

Mabcpq =

∫∫Ω

xpyq[R(x, y)]a[G(x, y)]b[B(x, y)]c dxdy (6.1)

6.2. Invariant Description 77

with order p+ q and degree a+ b+ c. In fact, they implicitly characterize the shape,

the intensity and the color distribution of the underlying neighbourhood pattern in

a uniform manner. Moment invariants use all information available on a local scale

(geometry, texture and color).

6.2.1 Generic Affinely Invariant Feature Vectors

Several sets of feature vectors have been tested during the work. The first set of

affine moment invariants is also the most generic one, as it can be used for any kind

of affinely invariant neighbourhood, in contrast to the other set (explained in Sec-

tion 6.2.2) that exploits some special properties of specific types of neighbourhoods.

The first set of the more generic invariants make up a feature vector that is composed

of 18 moment invariants. These are invariant functions of moments up to the first

order and third degree. In [Mindru et al. 1999b], it has been proven that the 18

invariants form a basis for all invariants under the considered group of geometric

and photometric transformations involving this kind of moments. An overview of

all 18 invariants is given in Table 6.1.

The reason why these expressions are so complex is that they also take the photo-

metric transformations fully into account, i.e. both a scaling and an offset for each

spectral band. As a result, it is hard to give a physical interpretation. Nevertheless,

the meaning of the 18 moment invariants listed in Table 6.1 can be summarized as

interactions of areas, center of gravities (weighted with one or more colorbands),

correlations of colorbands, relative positions of center of gravities weighted with two

different colorbands and several combinations thereof. By combining geometric with

photometric information, the compensation for illumination changes gets more and

more difficult, hence the growing complexity of the expressions. For a more detailed

interpretation, we refer to [Mindru et al. 1999b, Tuytelaars 2000].

6.2.2 Normalized Feature Vectors

The feature vector presented in the previous section is generally applicable to any

type of affinely invariant neighbourhood. The advantage is that this allows to treat

all neighbourhoods in the same way, and new neighbourhoods types can easily be

added.

However, sometimes knowledge about the neighbourhood extraction can be ex-

ploited to derive simpler invariant expressions with lower order moments and hence

resulting in more stable feature descriptions. This can be achieved by normalizing

the neighbourhood against such transformations.

78 Chapter 6. Detection of Repetitions

Table 6.1: Moment invariants used for comparing the patterns within an invariant

neighbourhood.

SR12 =

M20010 M100

01 M00000 −M200

10 M10000 M000

01 −M20001 M100

10 M00000

+M20001 M100

00 M00010 + M200

00 M10010 M000

01 −M20000 M100

01 M00010

2

(M00000 )2

[M200

00 M00000 − (M100

00 )2]3

D1RG12 =

M10010 M010

01 M00000 −M100

10 M01000 M000

01 −M10001 M010

10 M00000

+M10001 M010

00 M00010 + M100

00 M01010 M000

01 −M10000 M010

01 M00010

2

(M00000 )4

[M200

00 M00000 − (M100

00 )2] [

M02000 M000

00 − (M01000 )2

]

D2RG12 =

(M00000 )2M100

10 M02001 −M000

00 M10010 M000

01 M02000

−2M00000 M010

01 M01000 M100

10 + 2M00001 (M010

00 )2M10010

−M00000 M000

10 M10000 M020

01 + 2M00010 M010

00 M10000 M010

01

−(M00000 )2M100

01 M02010 + M000

00 M10001 M000

10 M02000

+2M00000 M010

10 M01000 M100

01 − 2M00010 (M010

00 )2M10001

+M00000 M000

01 M10000 M020

10 − 2M01010 M100

00 M00001 M010

00

2

(M00000 )4

[M200

00 M00000 − (M100

00 )2] [

M02000 M000

00 − (M01000 )2

]2

inv[1] = SR12

inv[2] = SG12 (similar)

inv[3] = SB12 (similar)

inv[4] = DRG02 =

[M11000 M000

00 −M10000 M010

00 ]2

[M20000 M000

00 − (M10000 )2] [M020

00 M00000 − (M010

00 )2]inv[5] = DGB

02 (similar)

inv[6] = DBR02 (similar)

inv[7] = D1RG12

inv[8] = D1GB12 (similar)

inv[9] = D1BR12 (similar)

inv[10] = D2RG12

inv[11] = D2GB12 (similar)

inv[12] = D2BR12 (similar)

6.2. Invariant Description 79

Table 6.1: Moment invariants used for comparing the patterns within an invariant

neighbourhood (ctd.).

D3RG12 =

(M00000 )2M010

10 M20001 −M000

00 M01010 M000

01 M20000

−2M00000 M100

01 M10000 M010

10 + 2M00001 (M100

00 )2M01010

−M00000 M000

10 M01000 M200

01 + 2M00010 M100

00 M01000 M100

01

−(M00000 )2M010

01 M20010 + M000

00 M01001 M000

10 M20000

+2M00000 M100

10 M10000 M010

01 − 2M00010 (M100

00 )2M010010

+M00000 M000

01 M01000 M200

10 − 2M10010 M100

00 M00001 M010

00

2

(M00000 )4

[M200

00 M00000 − (M100

00 )2]2 [

M02000 M000

00 − (M01000 )2

]

D4RG12 =

(M00000 )2M100

10 M11001 −M000

00 M10010 M000

01 M11000

−M00000 M100

10 M10000 M010

01 + M10010 M100

00 M00001 M010

00

−M00000 M000

10 M10000 M110

01 + M00010 (M100

00 )2M01001

−M00010 M100

00 M01000 M100

01 − (M00000 )2M100

01 M11010

+M00000 M100

01 M00010 M110

00 + M00000 M100

01 M10000 M010

10

+M00000 M000

01 M10000 M110

10 −M00001 (M100

00 )2M01010

2

(M00000 )4

[M200

00 M00000 − (M100

00 )2]2 [

M02000 M000

00 − (M01000 )2

]inv[13] = D3RG

12

inv[14] = D3GB12 (similar)

inv[15] = D3BR12 (similar)

inv[16] = D4RG12

inv[17] = D4GB12 (similar)

inv[18] = D4BR12 (similar)

Normalization for Geometry-based Neighbourhoods

For neighbourhoods extracted with the geometry-based method, we know that the

neighbourhood is parallelogram-shaped. Skew and scale changes can simply be re-

moved by applying an additional affine transformation that maps the neighbourhood

to a squared reference neighbourhood of fixed size. Moreover, we also know which

corner belongs to the original anchor point (Harris corner point). Based on this

information, rotation can be compensated for. The only geometric deformation left

is a possible ’switching’ of the two axes (due to the edges being taken in a different

order). In fact, for grouping purposes, where symmetric features may be mirrored

variations of one another, switching the axes must be considered, otherwise mirror-

symmetries might be ’invisible’ in feature space. We therefore also use a variation

of the invariant feature vector to compensate for this effect.

80 Chapter 6. Detection of Repetitions

inv[1] =M110

00

M00000

inv[2] =M011

00

M00000

inv[3] =M101

00

M00000

inv[4] =M100

10

M10000

inv[5] =M010

10

M01000

inv[6] =M001

10

M00100

inv[7] =M100

01

M10000

inv[8] =M010

01

M01000

inv[9] =M001

01

M00100

inv[10] =M100

11

M10000

inv[11] =M010

11

M01000

inv[12] =M001

11

M00100

inv[13] =M100

20

M10000

inv[14] =M010

20

M01000

inv[15] =M001

20

M00100

inv[16] =M100

02

M10000

inv[17] =M010

02

M01000

inv[18] =M001

02

M00100

Table 6.2: Moment invariants used for comparing the patterns within a

parallelogram-shaped invariant neighbourhood after normalization of the neighbour-

hood to a reference square.

In this way, the affine transformations have been completely compensated for. In

fact, the normalization corresponds to using the points p, p1 and p2 as an affine

basis, and describing the color and intensity profile with respect to this basis.

We also normalize against the photometric changes. However, this cannot be achieved

by exploiting some extra knowledge about the region extraction. Instead, the nor-

malization is directly based on the intensity profile itself, by replacing each intensity

value I by I ′ = aI + b with a and b such that the average intensity is 128 with a

spread of the intensities of 50.

For an overview of the invariants used in this case, see Table 6.2. As all transfor-

mations have been compensated for through normalization, the invariants become

much simpler now. In fact, any measurement in the normalized reference neigh-

bourhood can be used as an invariant. The reason why we stick to moments is that

they are quite robust to noise.

Note that the invariants in Table 6.2 do not necessarily form a basis. Actually, they

were selected rather ad hoc based on their physical interpretation. inv[1] to inv[3]

are related through the correlation between two colorbands. inv[4] to inv[6] and

inv[7] to inv[9] are the x− and y−coordinates respectively of the centers of gravity

weighted with one colorband, while inv[10] to inv[18] are combinations of higher

order moments.

6.2. Invariant Description 81

Normalization for Intensity-based Neighbourhoods

Using the intensity-based extraction method results in elliptic neighbourhoods. A

similar normalization as for the geometry-based, parallelogram-shaped neighbour-

hoods can be applied, this time using a circular reference neighbourhood instead.

However, there remains one degree of freedom to be determined, namely the orien-

tation of the underlying color or intensity profile within the circular reference neigh-

bourhood. The invariants used in this case need to be rotation-invariant. Again,

normalization against illumination changes is applied as well.

inv[1] =M110

00

M00000

inv[2] =M011

00

M00000

inv[3] =M101

00

M00000

inv[4] =

√√√√√ (M10010 M000

00 )2 + (M00010 M100

00 )2 + (M10001 M000

00 )2

+ (M00001 M100

00 )2 − 2 M10010 M000

00 M00010 M100

00

−2 M10001 M000

00 M00001 M100

00

M10000 M000

00

inv[5] = (similar)

inv[6] = (similar)

inv[7] =

√M100

20 (M00000 )2 − 2M000

10 M10010 M000

00 + M10000 (M000

10 )2

+M10002 (M000

00 )2 − 2M00001 M100

01 M00000 + M100

00 (M00001 )2

M10000 (M000

00 )2

inv[8] = (similar)

inv[9] = (similar)

Table 6.3: Moment invariants used for comparing the underlying intensity and

color information within an elliptic invariant neighbourhood after normalization to

a reference circular neighbourhood.

Table 6.3 summarizes the invariants for this case. Only 9 invariants are given here

and were used in the experiments, although it is certainly possible to find more by

combining several colorbands and using second order moments.

inv[1] to inv[3] are identical to the corresponding invariants for the geometry-

based, normalized neighbourhoods (representing the correlation between several

colorbands). inv[4] to inv[6] are the distances between the center of gravity of the

neighbourhood weighted with one colorband and the center point of the neighbour-

hood, while inv[7] to inv[9] correspond to the weighted average squared distances

to the center of the neighbourhood (xm, ym), i.e.

inv[7] =

√∫∫[(x− xm)2 + (y − ym)2]R(x, y) dxdy∫

R(x, y) dxdy

82 Chapter 6. Detection of Repetitions

where R(x, y) denotes the intensity function of the red color band.

Color vs. Grayscale

Whenever color information is available, it is highly recommended to use it for the

invariant neighbourhood description. As opposed to grayscale images, the three

spectral bands constitute additional sources of information, although they are of-

ten not independent as the different bands are correlated. Nevertheless, different

neighbourhoods are easier to discriminate in the feature space. In case of grayscale

images, only the single-banded invariants can be used, which drastically reduces the

number of invariants of second order and first degree (see Table 6.1) without nor-

malization. This is clearly insufficient for an efficient distinction of neighbourhoods.

Alternative ways for the construction of more invariants (e.g. higher order and/or

degrees) are possible (see [Mindru et al. 1998]). In this thesis, though, only color

images have been used for the experiments.

6.3 Neighbourhood Comparison

In grouping, comparison techniques play an essential role, and a wealth of such

techniques exist. Basically, a comparison method yields a similarity (or dissimilarity)

measure given two features. Depending on the features to compare and the context

of the application, some comparison methods are better suited than others. In the

context of efficient grouping, with usually a large number of features (i.e. affinely

invariant neighbourhoods) to inspect, comparison must be performed with minimal

computational effort.

In the first place, we compare affinely invariant neighbourhoods through their fea-

ture vectors. Repetitive neighbourhoods have similar feature vectors close to one

another in the feature space, and identifying such clusters there is exceedingly more

efficient than a direct, pixel-wise search on the neighbourhood contents. Once can-

didates for repetitions have been found in the feature space, we make an additional

cross-correlation check of their intensity patterns for the reasons mentioned in Sec-

tion 6.3.2.

Whatever comparison techniques and similarity measures are applied, we only com-

pare affinely invariant neighbourhoods of the same type, that is neighbourhoods that

were extracted with the same method. Comparing geometry-based, parallelogram-

shaped neighbourhoods to intensity-based, elliptical ones is pointless even on in-

tuitive grounds. Treating all parallelogram-shaped neighbourhoods in the same

manner is problematic as well, because different extraction methods are involved

6.3. Neighbourhood Comparison 83

for the individual neighbourhoods. Hence, to better model the properties of par-

ticular neighbourhood types and their corresponding invariant feature vectors, only

neighbourhoods of the same type are compared.

6.3.1 Feature Vector Comparison

In theory, two feature vectors computed over corresponding neighbourhoods, with

one of them subject to an affine transformation and/or illumination change, should

be identical due to their invariant nature. In practice, though, they will never be

completely equal due to noise, discretization errors and/or misalignments.

Another source of errors are deviations from the model used to approximate the geo-

metric and photometric relations among affinely invariant neighbourhoods. Remem-

ber that we assumed affine geometric deformations and linear photometric changes.

These models are not adequate if there are strong perspective distortions in the

image, if the surface is not completely planar, if patches are partially occluded or

if there are strong specular reflections. In particular, specular reflections can ren-

der the intensity profile of two neighbourhoods completely different, thus making

it impossible to match them. Specular reflections might indeed be quite common

in images of e.g. man-made repetitions (think of the facade of a building with re-

peating windows on a nice day with the sun reflecting on the windows). However,

as this mostly occurs occasionally and on a local scale, this case can safely be ne-

glected. The other deviations are usually small, such that two feature vectors are

still relatively close to one another. In grouping, especially when dealing with a

large number of repetitions, a miss of a few matches can be coped with.

Mahalanobis Distance

The problem we face is the selection of a distance measure between two multi-

dimensional feature vectors to quantify their similarity, i.e. to find out whether two

feature vectors represent two repeating instances of the same neighbourhood. The

spread or variation of their different components might be totally different, which

clearly disqualifies the Euclidean distance as a similarity measure. The variance

of one invariant might be several orders of magnitude larger than the variance of

another invariant. At the same time, several invariants might be correlated as well.

The Mahalanobis distance is a better similarity measure in this case, as it correctly

takes into account this different variability of the elements of the feature vector.

The Mahalanobis distance between two vectors x1 and x2 is given as follows:

dM(x1,x2) =√

(x1 − x2)>Σ(x1 − x2) (6.2)

84 Chapter 6. Detection of Repetitions

where Σ is the covariance matrix. Σi,i represents the spread (i.e the variance σ2)

of the i’th element of the feature vector and Σij/√

ΣiiΣjj, (i 6= j) the correlation

between the i’th and j’th component.

Achieving maximum Separability

The occurrence of multiple repetitions (of different features) in the image corre-

sponds to multiple clusters in the feature space. If clusters form well separated,

compact entities, their automatic identification is certainly eased.

At the same time, one is also interested in a reduction of the dimensionality of the

feature space, be it e.g. for visualization or a cut-down in computational complexity.

However, a lower-dimensional space is in vain if the separability of the original

features is lost. It is therefore of importance that well separated features remain

well separated in the reduced space.

Reasonable distance measures involving the Mahalanobis distance greatly depend on

the covariance matrix Σ. The crux of the matter is that Σ strongly depends on the

degree of viewpoint and illumination change, and last but not least on the underlying

intensity profile of the affinely invariant neighbourhood. As a consequence, it is not

straightforward to obtain a good estimate Σ.

In most cases, the variability of a single feature over different viewing conditions

is hardly related to the overall variability of that feature. Some features may be

very stable to image noise or changing viewing conditions, resulting in a low intra-

class variability, while they still vary a lot between different neighbourhoods due to

the different intensity profile. Other features, especially those using higher order

moments, are significantly less discriminative, as they are more sensitive to noise,

misalignment of the neighbourhood or deviations from the model, regardless of the

overall variability of the feature.

Actually, the optimal feature is a feature with a high discriminative power, i.e.

the combination of a high variability between neighbourhoods covering different

patterns with a low variability of the same neighbourhood over different viewing and

illumination conditions. In contrast to techniques like e.g. the principle component

analysis, a linear discriminant analysis (LDA) is well suited for this task, as it

maximizes the ratio of the inter-class to the intra-class variabilities through two

consecutive coordinate transformations.

The LDA rests on the premise of the less strict common covariance matrix assump-

tion. It is assumed that all neighbourhoods share the same intra-class statistics

(independent of the color or intensity profile). However, these statistics can be dif-

ferent from the overall, inter-class statistics. It has been shown that, if the number

of training samples is small, this assumption leads to higher classification accuracy,

6.3. Neighbourhood Comparison 85

even if the covariance matrix of each class greatly differs [Friedman 1989]. Ap-

pendix A illustrates the LDA as applied in this thesis in more detail and shortly

explains how we obtained estimates for neighbourhood-specific covariance matrices.

6.3.2 Correlation-based Comparison of Affinely

Invariant Neighbourhoods

Apart from comparing the invariant feature vectors of two neighbourhoods, one can

also directly compare the pixel intensities in the two neighbourhoods by computing

the cross-correlation. The definition of the normalized cross-correlation is given by

dC =

∑i

∑j

[I(x + i, y + j)− I

] [I ′(x′ + i, y′ + j)− I ′

]√∑

i

∑j

[I(x + i, y + j)− I

]2∑

i

∑j

[I ′(x′ + i, y′ + j)− I ′

]2(6.3)

with (x, y) and (x′, y′) the corresponding pixels in the first respectively the sec-

ond neighbourhood, I(x, y) and I ′(x′, y′) the respective intensities and I and I ′ the

average intensities.

The cross-correlation as such is sensitive even to small distortions. It assumes that

the pixel corresponding to (x + i, y + i) is located at (x′ + i, y′ + i), which is only

the case for a pure two-dimensional translation. Therefore, affine shape normal-

ization is required, that is the original neighbourhoods must be transformed to a

unit square (parallelogram-shaped neighbourhoods) and circular reference neigh-

bourhood (elliptical-shaped neighbourhoods) first. In this way, the geometric de-

formations are compensated for, except for the rotational component of elliptical

neighbourhoods, which is found by maximizing the cross-correlation.

dC can be interpreted as a similarity measure with values always ranging from -1 to

+1. It is equal to ±1 if and only if the two intensity patterns are related by a linear

relation. The lower the absolute value of dC, the less correlated the patterns are. In

the rest of this text, we simply use the term ’cross-correlation’ when referring to the

normalized cross-correlation.

Cross-correlation is a more powerful measure than the Mahalanobis distance. How-

ever, it is by far not as efficient than the Mahalanobis distance due to the compu-

tational complexity. In spite of the larger computational effort, we combine both

methods for the detection of repetitions, as explained in more detail in Section 6.4.

6.3.3 Other Comparison Methods

The cross-correlation-based comparison of two neighbourhoods is based on the as-

sumption of linear changes in image intensities. Indeed, the cross-correlation directly

86 Chapter 6. Detection of Repetitions

measures the linear dependency of two intensity profiles. Similarly, the invariants

used for the Mahalanobis-based comparison also assume a linear model for the inten-

sity changes. However, as shown in [Tuytelaars 2000], the underlying linear model

is not always exact.

The modeling of illumination effects is still an open area of research, and previous

publications are mixed about their conclusions concerning intensity changes and

achieving illumination invariance. Invariants based on a more complex, full affine

illumination model (in combination with perspective, geometric deformations) have

been developed recently [Mindru et al. 2001]. The authors report a gain of 10% in

recognition performance (for natural images), thereby confirming the superiority of

the new, but more complex invariants for such images.

Other measures based on maximum likelihood can correctly deal with non-linear and

unmodelled intensity transformations. However, these are not yet implemented in

our system, but might offer promising alternatives for further improvements.

Mutual Information

Mutual information allows to measure the similarity between two neighbourhoods

without a specific model for the relationship between corresponding intensity values.

This is a technique based on information theory and was introduced by Viola and

Wells [Viola and Wells 1997] in the context of registration and recognition. Mutual

information is measured as follows:

I(I, J) =∑

i

∑j

log

(p(i, j)

p(i)p(j)

)(6.4)

with p(i) the probability of intensity i in image I, p(j) the probability of intensity

j in image J and p(i, j) the joint probability of intensity i in image I and intensity

j in image J . Basically, this measure expresses the idea that the color or intensity

profile of one neighbourhood should be predictable to a high degree based on the

color or intensity profile of the other neighbourhood. The pure existence of this

statistical relationship suffices to obtain high scores with this measure, without a

model for the exact form of the relationship. Mutual information measures the

general dependence, while correlation quantifies the linear dependence between two

neighbourhoods.

Correlation Ratio

As an alternative to statistical relationships, the correlation ratio tests for the exis-

tence of a functional relationship [Roche et al. 1999]:

η2(I, J) = 1− V arI − φ(J)V arI

(6.5)

6.4. Matching / Clustering 87

with φ(J) the least square optimal non-linear approximation of I in terms of J and

V ar. . . the variance of the measure between brackets. Each color or intensity

value in the first neighbourhood is mapped to a color or intensity value in the other

neighbourhood (or, more precisely, a distribution around a single color or intensity

value). The fundamental difference to mutual information is that the correlation

ratio is based on the variance instead of the entropy.

6.4 Matching / Clustering

Once we have extracted the different types of affinely invariant neighbourhoods, we

try to find the repeating patterns by looking for clusters of invariant neighbourhoods

in the feature space. This is the space spanned by the elements of the feature vector

of moment invariants described earlier.

For the rest of the text, we distinguish between large clusters (consisting of more

than 6 neighbourhoods) and small clusters (consisting of anything between 2 and

6 neighbourhoods). To avoid heavy combinatorics, the large clusters (typically be-

longing to periodicities) will sometimes be dealt with in a different manner than the

small clusters (typically belonging to mirror-symmetries and / or point-symmetries).

Large and small clusters translate to regions of high- and low densities in the feature

space, respectively.

The actual matching and clustering goes as follows. For each different type of neigh-

bourhood (e.g. geometry-based with curved edges or intensity-based), a separate

feature space is built. To better separate the different clusters and to reduce the di-

mensionality of the feature space, a linear discriminant analysis is performed. In fact,

a dimensionality reduction is necessary, since an 18-dimensional (9-dimensional) fea-

ture space with only a few hundreds of datapoints (feature vectors) is too sparse for

representative statistics.

Clusters are then identified in this reduced space through a non-parametric density

estimation using an unimodal Gaussian kernel. The choice of the with σ of the con-

volution kernel is usually a crucial issue in density estimation [Scott 1992]. Setting

the value for σ too large can be interpreted as an oversmoothing of the data, which

prevents important structures from becoming ’visible’. Too small a value, on the

other hand, results in too many, fine-grained density peaks. Since the distribution of

the overall features is unknown, a value for σ has to be set based on the underlying

data. We adapted a method described by [Pauwels and Frederix 1999] to set σ in a

data driven way.

More precisely, given an n-dimensional dataset xi ∈ IRn; i = 1 . . . N, a density

f(x) is obtained by convolving the dataset with the the unimodal density kernel

88 Chapter 6. Detection of Repetitions

Kσ(x):

f(x) =1

N

N∑i

Kσ(x− xi) (6.6)

where σ is the (beforehand) unknown size-parameter for the kernel, measuring its

spread. In particular, the unimodal (rotation-invariant) Gaussian is given by

Kσ(x) =

(1

2πσ2

)n/2

e−‖x‖2/2σ2

(6.7)

The spread parameter σ is now taken proportional to the average radius of the ball

that encloses the k nearest neighbours of a datapoint. This has the advantage that

σ is completely determined by the data and scales with the size and range of the

dataset. The number k of nearest neighbours is fixed to be one percent of the total

number of datapoints, but with a minimum of k = 10, i.e.

k = max0.01N, 10 (6.8)

The neighbourhood with maximum density is then selected as seed for a new cluster

and will be used as a cluster prototype. Other neighbourhoods are added to the

cluster if they are within a predefined distance d from the moving average of the

cluster, provided that they also yield a good score on the cross-correlation test (w.r.t.

the prototype). This is an additional, final test, that has been added to reduce the

number of errors (neighbourhoods belonging to a cluster even though they don’t

cover the same pattern). Here again, the correlation is computed after normaliza-

tion to a square or circular reference neighbourhood. In case of intensity-based,

elliptical neighbourhoods, the correct rotation between the two circular reference

neighbourhoods is determined by maximizing the cross-correlation.

Next, all neighbourhoods belonging to the cluster are removed from the feature

space and the same process is repeated until no more clusters are found.

Parameters

A Mahalanobis distance threshold of d = 3 for two features to be similar has pro-

duced good results. Assuming that the covariance matrix has correctly been es-

timated and that the invariants have a Gaussian distribution, the probability of

not being within this boundary, although the neighbourhoods have correctly been

extracted, is smaller than 5%.

Concerning the dimensionality reduction of the feature space, we have found that a

projection of the feature vectors to the first two principal components corresponding

to the two largest eigenvalues is sufficient without a loss of essential information.

The threshold for the cross-correlation was set to 0.7 in all our experiments.

6.5. Example 89

6.5 Example

Here we show the matching/clustering step with a typical elation example. In order

to avoid scene clutter, we only show one neighbourhood type. The advertising panel

Figure 6.1: The original image (left) and all intensity-based elliptic neighbour-

hoods extracted around intensity extrema. A total of 182 neighbourhoods were

found.

in Figure 6.1 shows a planar repetition of beer cans, and we wish to ’capture’ these

repeating elements in terms of intensity-based, affinely invariant neighbourhoods.

529 intensity extrema led to the extraction of 182 elliptical neighbourhoods (right)

in about 3 seconds.1 The computation of all feature vectors took about 7 seconds,

incl. the transformations needed by the discriminant analysis. The two principal

components of the dataset corresponding to the two largest eigenvalues are shown

in Figure 6.2 after the density estimation, with dark grayvalues indicating high

densities. Density estimation takes less than a second for this example, resulting

in a σ of 1.58 for the convolution kernel. Three isolated regions of higher density

can be spotted. The datapoint with the highest density is located in the second

quadrant and serves as the prototype for the clustering algorithm.

In this part of the feature space, a cluster of 22 neighbourhoods was found in ap-

proximately 3 seconds. The corresponding neighbourhoods are shown in Figure 6.3

left, both in the image and the feature space. After removal of the cluster and a

density re-estimation, the datapoint with the highest density acts as the prototype

again for a new clustering run. This time, the system took approx. 7 seconds to

extract a second cluster of 12 neighbourhoods.

1Sun Ultra 10, 440 MHz. Image size: 700 × 525

90 Chapter 6. Detection of Repetitions

-10 10 201. PC

-15

-10

-5

5

10

2. PC

Figure 6.2: The first two principal components of the feature space after the LDA

transformations and the density estimation. The mean was subtracted from all

datapoints. The grayvalues of the points are proportional to their density values.

The three regions of higher density (darker points) are indeed in agreement to what

a coarse visual inspection would depict as cluster candidates.

6.6 Discussion

Usually a few hundreds of affinely invariant neighbourhoods are extracted per image

(depending on the neighbourhood type and image content), and the corresponding

feature spaces of 18 (9) dimensions are quite sparse such that a reduction in di-

mensionality is appropriate. A crucial point is that the features must keep their

discriminative power in the reduced space, which emphasizes the need for a linear

discriminant analysis.

A LDA results in better separability of clusters by minimizing intra-cluster dis-

tances while maximizing inter-cluster distances. However, this procedure does not

always yield the desired result. This might be the case in situations where only one

cluster is present. The net effect is a disruption of the cluster, with some correct

neighbourhood matches going undetected.

Without the LDA, feature vectors usually occupy a smaller portion of the feature

space, i.e. they are closer to each other. As a consequence, many more datapoints

are within the predefined Mahalanobis distance around the prototype — and not

only those belonging to the cluster. As a result, one might obtain more correct

matches, but at the cost of a tremendous increase in computation time. For an

6.6. Discussion 91

-12 -8 -41. PC

-15

-10

-5

5

2. PC

-12 -8 -41. PC

-15

-10

-5

5

2. PC

Figure 6.3: Top: Two clusters of intensity-based neighbourhoods that have been

found using the clustering method described in Section 6.4 in the image. Bottom:

Enlarged part of the feature space from Figure 6.2 with the two clusters encircled.

The first cluster to the left was removed from the feature space before looking for

the second cluster (right).

92 Chapter 6. Detection of Repetitions

average of about 200 affinely invariant neighbourhoods, the detection of clusters

would be in the range of several minutes (per cluster).

Another issue addresses the question if a re-estimation of densities is necessary after

a cluster has been removed from the feature space. Clearly, the removal of datapoints

affects its overall structure w.r.t the densities. Our experiences have shown that a

re-estimation leads to a faster detection of other clusters (if there are any). To better

explain this effect, imagine a situation similar to the example shown in the previous

section, with a first cluster comparatively larger than a second one. After removal

of the first cluster, datapoints within or close to the region of that cluster, but not

being members of it, still might have larger densities than those points in the second

cluster (without density re-estimation). As a consequence, the search for the second

cluster starts at a wrong location. A re-estimation corrects this effect. We therefore

consider an update of feature densities (after the removal of a cluster) as necessary.

6.7 Summary and Conclusions

A preliminary grouping step is performed with the goal to find similar, small planar

patches, irrespective of their regularity. Through the use of the affinely invariant

neighbourhoods introduced in Chapter 4, similar such neighbourhoods can be found

efficiently by describing each by a feature vector consisting of moment invariants.

This allows the use of indexing techniques, which improves efficiency. Repetitions

are then detected by looking for similar feature vectors in the feature space in

combination with a normalized cross-correlation. Several alternative comparison

methods are presented as well.

A linear discriminant analysis with tracking-based covariance matrices leads to a

substantially better discriminative power of the features, which again contributes to

the overall efficiency.

Repetitive neighbourhoods correspond to clusters in the feature space. A non-

parametric density estimation identifies regions where several feature vectors gather,

and density peaks serve as prototypes for examining their immediate surroundings

w.r.t. a predefined Mahalanobis distance, thereby using the moving average of the

prototype as new cluster candidates are added. Once a cluster has been found and

removed from the feature space, the procedure is repeated.

7Detection of Regularities

Regularities are repetitions of planar patterns in a regular, well defined

spatial arrangement. This implies the existence of certain ’rules’ or

’guidelines’ that make up the entire pattern. Formally, the spatial con-

figuration can be seen as the effect of an underlying mathematical law

responsible for its creation, hence the name regular. And indeed, the general mathe-

matical description of symmetries by transformation groups provides an insight into

the mechanisms that build symmetric patterns.

This chapter deals with the detection of regularities of repeating patterns, that is

the governing mathematical laws. After a brief introduction, in Section 7.2 it is

explained how the cascaded Hough transform introduced in Chapter 5 is applied

to clusters of similar affinely invariant neighbourhoods to extract fixed structure

candidates. Section 7.3 describes the instantiation of planar homology hypotheses,

and Section 7.4, describes a method for their verification. In Section 7.5, some pros

and cons are discussed and Section 7.6 concludes the chapter.

7.1 Introduction

Now that we have found one or several repeating patterns in the image, the goal

is to find the regularities behind them (if there are any). As mentioned earlier, we

assume that the regular repetition of an image pattern can be characterized by a

planar homology. Our goal here is to find that homology.

We recall that the corresponding projectivity H has a line of fixed points and a pencil

of fixed lines. To hypothesize H, we first extract fixed structure candidates. This is

achieved in a non-combinatorial way, using the CHT (see Chapter 5). Next, a single

neighbourhood match suffices to lift the remaining degree of freedom. Why do we

need an additional neighbourhood match ? Remember from Chapter 3 that general

planar homologies have 5 dof, yet the fixed structures lift 4 dof in total. Hence,

an additional point match is needed. The situation is only slightly different in the

93

94 Chapter 7. Detection of Regularities

case of elations. Here, the fixed structures lift 3 dof (the vertex of the pencil lies on

line of fixed points), thus leaving us with one remaining dof, thus the need for an

additional point match.

To sum up, finding a grouping characterized by a general planar homology amounts

to the determination of a transformation with 5 dof. The prior knowledge of the

fixed structures cuts down the complexity of the problem considerably to 1 dof.

However, a system can only benefit from this reduction if no additional, unnecessary

complexity is introduced for the extraction of the fixed structures. In the following,

we explain how this can be achieved efficiently.

7.2 Finding Fixed Structures

The first step in the analysis for regularity comprises the extraction of fixed structure

candidates, and this process has to be carried out efficiently, i.e. without recursing

to combinatorial methods. We propose the use of the CHT to get to the desired

result.

To shortly outline the strategy, we use two procedures for extracting pencils of fixed

lines and lines of fixed points candidates, where each applies the CHT two times

successively. Essentially, both procedures are identical in what they do. Their only

difference is that they act on spaces dual to each other, see Table 7.1.

7.2.1 Candidate Pencils of Fixed Lines

Large Clusters

To find good candidates for pencils of fixed lines, we use the center points of invariant

neighbourhoods belonging to a cluster as input for the CHT. Collinear arrangements

of neighbourhood centers can be detected by applying a Hough transform on these

center points. A second Hough transform applied to the peaks of the output of the

first one yields intersections of straight neighbourhood alignments, i.e. candidates

for pencils of multiple fixed lines.

7.2. Finding Fixed Structures 95

Pencils of fixed lines Line of fixed points

Small Large Small Large

— center points↓ Hough ↓ 0

joinscollineararrangements

—characteristiclines

1

↓ Hough ↓

intersectionsof joins

intersectionsof collineararrangements

Intersections ofcharacteristiclines

intersectionsof character-istic lines

2

↓ Hough ↓collinear con-figurationsof charac-teristic lineintersections

collinear con-figurationsof charac-teristic lineintersections

3

Table 7.1: Strategy for extracting fixed structure candidates working on both large

and small clusters of affinely invariant neighbourhoods. Structures used as input are

printed in a sans-serif font, and their corresponding outputs are printed in boldface.

The numbers in the outermost right column indicate the CHT level numbers.

Small Clusters

For the small clusters, the joins connecting the centers of neighbourhoods belonging

to the same cluster are added as direct input before taking the second Hough.

Adding these lines helps in detecting the vertices of the pencils of fixed lines in case

where there is only a limited number of repetitions (e.g. mirror-symmetries). Adding

lines between pairs of neighbourhood centers seems to undermine our goal of avoiding

combinatorial steps. However, since this measure is restricted to neighbourhoods

belonging to small clusters, relatively few such lines are constructed (maximum 15

lines per cluster).

When applying two successive Hough transforms, the original input will re-emerge

as peaks in the Hough spaces. As a result, the original neighbourhood centers pop

up in the space where we look for pencils of fixed lines. However, since we know

which points have been used as input, these peaks can be identified and ignored for

further processing.

96 Chapter 7. Detection of Regularities

7.2.2 Candidate Lines of Fixed Points

Large Clusters

To detect candidates for lines of fixed points, we apply exactly the same scheme as for

the detection of the pencil of fixed lines candidates, but in the dual spaces. As input

for the first Hough transform, we use characteristic lines of the neighbourhoods.

These are sides and diagonals of the parallelogram-shaped neighbourhoods, and a

photometric invariant variant of the axes of inertia of the elliptical neighbourhoods.

In more detail, these axes of inertia can be found by first mapping an elliptic neigh-

bourhood to a circular reference neighbourhood. Next, the major and minor axes are

then extracted as the lines passing through the center O with orientations θmax, θmin

defined by the solution of

tan2 θ +m20 −m02

m11

tan θ − 1 = 0 (7.1)

with mpq the p + q’th order, first degree moment centered of the neighbourhood’s

geometric center. It can be shown that these axes are invariant under both linear

intensity changes and rotation, in the sense that they cover the same part of the

neighbourhood after a rotation. For more details we refer to [Ferrari et al. 2001]. It

must also be mentioned that this problem is ill-conditioned if elliptic neighbourhoods

cover patterns with a perfect rotational symmetry. The resulting axes of inertia can

no longer be used. In such situations, we proceed according to the strategy for

the extraction of pencils of fixed lines (using the centerpoints; see previous section)

and apply the Hough for a third time. This way, fixed structures can be found

when parallelogram-shaped neighbourhoods offer no alternative, like in the situation

shown in Figure 4.10.

By applying a first Hough transform, points where many of these lines intersect can

be detected. A second Hough transform applied to the peaks of the output of the

first one yields collinear arrangements of intersection points. These correspond to

the candidate lines of fixed points.

Small Clusters

Again, for the small clusters, we add some additional input before taking the second

Hough transform. In this case, these are intersections of corresponding characteristic

lines (e.g. intersections of corresponding sides and diagonals of parallelogram-shaped

neighbourhoods). This makes it possible to detect lines of fixed points, even if the

number of repeating patterns is low.

Since we use several input lines for each neighbourhood, spurious peaks will pop up

after applying the Hough transform. Indeed, starting from the sides and diagonals

7.2. Finding Fixed Structures 97

of a parallelogram-shaped neighbourhood, it is obvious that the corners and center

of the parallelogram will be detected as intersection points, although they are not

really of interest to us. The same holds for the elliptical neighbourhoods, where the

neighbourhood centers will be detected as intersection points of the axes of inertia.

These are not the non-accidental intersections we are interested in (since they are

not related to the regularity at all). Hence, they have to be removed before taking

the second Hough transform.

7.2.3 Example

Let us take the image shown in Figure 4.8 as an example. The relevant planar

homologies in this example are the different elations that correspond to translations

from one tile to another in the ground plane. The candidate pencils of fixed lines to

be detected have their vertices in the vanishing points of these translation directions,

while the common line of fixed points corresponds to the horizon line.

Pencils of Fixed Lines

The centers of the neighbourhoods of the floor tile clusters shown in Figure 4.8

were used as input to the CHT to find the candidate vertices of pencils of fixed

lines (top row in Figure 7.1). The middle row of the same figure shows the three

unfiltered subspaces after applying the Hough for the first time (level 1). Peaks in

these spaces correspond to collinear arrangements of neighbourhood centers. Note

that the peaks in the first and especially in the second subspace are again placed in

collinear arrangements. This is because they represent a set of convergent lines.

It is this collinearity of the peaks in level 1 that is detected by the second Hough

transform. The bottom row of Figure 7.1 shows the unfiltered output of this sec-

ond Hough transform. This time, the peaks indicate locations (both inside and

outside the image) where collinear structures intersect. These include the original

input points (the peaks in the first subspace) as well as the vanishing points that

correspond to the pencil vertices. After removal of the re-emerging peaks, only six

candidate vertices for pencils of fixed lines remain. Figure 7.2 shows the candidate

vertices for pencils of fixed lines (except for the one that fell too far outside the im-

age boundaries to be displayed), together with the lines (collinear structures) that

contributed to them.

It should be mentioned here that a cluster of 42 elliptical neighbourhoods was found

as well, with almost circular shapes around the tile centers. The centerpoints of this

cluster coincides very well with a subset of the cluster from Figure 4.8 and yield

indeed two of the candidate pencils shown in Figure 7.2. However, their axes of

98 Chapter 7. Detection of Regularities

First subspace Second subspace Third subspace

leve

l0

leve

l1

leve

l2

Figure 7.1: Detection of the candidate pencils of fixed lines based on the CHT

for the largest cluster found in the image shown in Figure 4.8: the input (centers

of neighbourhoods belonging to one cluster) (top), the three unfiltered subspaces

after applying a first Hough transform (middle), and after applying a second Hough

transform(bottom). Apart from the original input points (in the first subspace),

additional peaks arise in the second and third subspace, that correspond to the

vertices of the pencils of fixed lines.

7.2. Finding Fixed Structures 99

Figure 7.2: The candidate pencils of fixed lines and the most dominant vanishing

line, as detected by the CHT, after conversion to the image coordinate frame. Pencils

of fixed lines are shown together with their vertices (filled circles). Different sizes

indicate different support.

inertia are ill-conditioned and thus inapplicable for the extraction of line of fixed

point candidates.

Lines of Fixed Points

To find the candidate lines of fixed points, the sides and diagonals of the parallelogram-

shaped neighbourhoods of the cluster were used as level 1 input to the CHT. The

top row in Figure 7.3 shows the corresponding input spaces.

After applying a first Hough transform, we obtain the three (unfiltered) subspaces

(level 2) shown in the middle row of the same figure. The most salient peaks (in

the second and third subspaces) correspond to vanishing points 1. Most of the

peaks in the first subspace are removed before taking the second Hough transform,

since they correspond to neighbourhood centers or corners instead of non-accidental

alignments.

Finally, the result of applying a second Hough transform to the peaks of the output

of the previous level is shown in the bottom row. The peaks in the first subspace

(left) correspond to lines that have been used as input two levels before, so they

don’t bring any new information and are rejected as re-emerging peaks. Only the

intersection of the three lines in the third subspace (right) is non-accidental, hence

it is a promising line-of-fixed-points candidate for the regularity of the kitchen floor.

It is also shown in Figure 7.2.

1The fact that the output is so similar to the bottom row of Figure 7.1 is due to the fact thatwe’re dealing with elations, so the vertices of the pencils of fixed lines coincide with the structuresthat contribute to the line of fixed points.

100 Chapter 7. Detection of Regularities

First subspace Second subspace Third subspace

leve

l1

leve

l2

leve

l3

Figure 7.3: Detection of the candidate lines of fixed points based on the CHT

for the largest neighbourhood cluster of the image shown in Figure 4.8. The input

spaces(characteristic lines of the neighbourhoods belonging to the cluster) (top), the

three unfiltered subspaces after applying a first Hough transform (middle), and after

applying a second Hough transform (bottom).

7.3. Finding the Groupings 101

7.3 Finding the Groupings

In order to hypothesize a planar homology (incl. elation) we start by selecting a

good pair of candidate line of fixed points and a candidate pencil of fixed lines.

These are structures that both received many votes by the CHT and to which the

same repeating neighbourhoods have contributed. Once the fixed structures have

been hypothesized, a single pair of repeating neighbourhoods fixes the last remaining

degree of freedom of the planar homology H.

In case of large clusters, only pairs of neighbourhoods close to one another are ex-

amined. These correspond to the smallest distance of a repetition, which intuitively

becomes clear for the example shown in Figure 7.2: a neighbourhood pair close

to one another denotes a translation in the magnitude of one tile. Moreover, we

only consider pairs of neighbourhoods that both contributed to the extraction of

both fixed structures and that can be mapped onto each other by a member of the

subgroup of the projectivities defined by the fixed structures. The peak validation

described in Section 5.4.3 enables a fast identification of neighbourhoods that con-

tributed to a particular fixed structure, because the validation can iteratively be

applied to any CHT level further down. Hence, for each fixed structure candidate,

we can trace down the support until we arrive at the input level (level 0 or 1). From

here, the corresponding neighbourhoods can then easily be identified.

The peak validation yields not only the neighbourhoods that contributed to a fixed

structure, but also imposes a spatial organization on this set with respect to the

subgroup at hand: as an example, the neighbourhoods that contributed to a pencil

of fixed lines must necessarily lie on a fixed line, and the peak validation routine

quickly identifies them.

These measures avoid slipping into combinatorics during the process of hypothesis

instantiation. Note also that the number of planar homology hypotheses to be

validated is much smaller than the number of pairs of close neighbourhoods, since

typically many pairs result in the same hypothesis.

Finally, a hypothesis can be instantiated in practice using Equations (3.3) and (3.4),

resp., by solving for the unknown parameter µ (the remaining dof).

7.4 Hypotheses Validation

Once fixed structures and thus groupings are hypothesized, these need to be verified.

In particular, the planar homology hypotheses need further testing, with a threefold

goal:

102 Chapter 7. Detection of Regularities

Efficiency: The CHT might yield several candidates for fixed structures. As this

leads to a hypothesize-and-verify method, wrong candidates must be rejected

with minimal computational effort.

Extent: Given a hypothesized planar homology, we want to find out exactly the

support in the image for this specific hypothesis, i.e. segment the image into

a consistent and a non-consistent part.

Correctness: From this point onwards perspective effects should be taken fully

into account. Also, by pulling more information from the image, a more accu-

rate estimate of the transformation can be obtained.

These goals are achieved by a region-growing algorithm that compares the original

image to its warped version for conformity based on normalized cross-correlation.

Warped in this context means the pixel-wise transformation of the original image

with the hypothesized planar homology H. An example is shown in Figure 7.4.

Figure 7.4: Semi-transparent overlay of the original image and its warped version.

The hypothesis in this case is a translation of one ’tile-unit’ to the left (elation).

We use the center of the repeating neighbourhoods that contributed to the detection

of the fixed structures as seed points for our region-growing algorithm. The correla-

tion is computed locally for corresponding pixels in both images using a correlation

7.4. Hypotheses Validation 103

window of fixed size. In case the correlation value for pixel pi,j is larger than a

predefined threshold, then pi,j is considered to be in agreement with the hypothesis

H and the same procedure is repeated for the adjacent pixels pi,j−1, pi,j+1, pi−1,j and

pi+1,j. The region growing algorithm stops when there are no candidate pixels left to

be evaluated. The whole procedure is repeated for all remaining centerpoints that

have not fallen inside an already grown region. As a consequence, even disconnected

groupings can be segmented.

To compensate for inaccuracies in the grouping hypotheses and/or imperfect sym-

metries, we allow the correlation window to drift a distance of one pixel starting

from the average displacement at neighbouring pixels. In this way large but gradual

deviations can be compensated for while using a correlation window shift of only

one pixel, which keeps the computation time low. At the same time, we limit the

total drifting distance to half the Euclidean distance between a pixel at (i, j) and

its warped location (i′, j′, 1)> = H(i, j, 1)>, to avoid too large deviations from the

original hypothesis. Allowing the correlation mask to slide gradually has proven

to yield good results in situations where the symmetry in the image is not perfect,

and / or the hypothesized symmetry has noise on it.

For a hypothesized planar homology that is not correct, the correlation value almost

immediately drops below the threshold value, giving very small segmentations that

can easily be rejected. This results in a fast rejection of false hypotheses.

Example (ctd.)

For the example shown in Figure 4.8, we combined the strongest candidate pencil

of fixed lines with the only candidate line of fixed points. From all those neigh-

bourhoods belonging to the cluster, only one planar homology hypothesis emerged,

corresponding to a translation over one ’tile-unit’. We then warped the image ac-

cording to this transformation, as shown in Figure 7.4.

Note how the original floor tiling coincides with its warped version, while the non-

repeating objects seem motion-blurred, like e.g. the dog in the middle and the

drawers to the right. This is exactly what is being detected during the hypothesis

verification stage.

Figure 7.5 shows the resulting segmentation. The part of the image that is not

darkened was found to be consistent with the hypothesized transformation. The

’holes’ in the foreground arise due to the fixed-size correlation window and the

homogeneity of the tiles. Note that the extension of the segmentation over part of

the cupboard at the upper left part of the image is correct: since we are considering

a ’horizontal’ translation, this part of the image is indeed consistent, as can also be

seen from Figure 7.4. The computation time needed to validate this hypothesis was

1 minute and 45 seconds.

104 Chapter 7. Detection of Regularities

Figure 7.5: Validation result (segmentation) of the hypothesis shown in Figure 7.4.

Darker pixels are considered inconsistent with the hypothesized transformation.

7.5 Discussion

7.5.1 Advantages of the CHT

Using the CHT allows many neighbourhoods to contribute to the selection of can-

didate fixed structures right from the start. This reduces the influence of possible

imprecisions in their individual positions, which have a much stronger impact in the

case of RANSAC [Fischler and Bolles 1981] (e.g. used in the work of Schaffalitzky

and Zisserman [Schaffalitzky and Zisserman 2000]).

Another advantage over more heuristic methods like RANSAC is the superior per-

formance with respect to outliers. This is especially the case when the number

of outliers equals the number of inliers. Such situations might cause RANSAC to

increase the computation times for a resulting model (without any guarantees for

correctness), thus becoming prohibitively expensive with respect to computational

complexity.

7.5.2 Parameters

In our experiments, we have used accumulator buffer sizes of 401 × 401 pixels for

each subspace. Concerning the peak validation, a minimum support of three input

points/lines are set as threshold for a peak to be accepted, i.e. to be non-accidental.

A correlation window size of 25 × 25 pixels was used for the validation with a

correlation threshold set to a value of 0.7. To decrease the computation time for

7.5. Discussion 105

the hypothesis validation, we downscale the entire image by a factor of 2. Or stated

differently, for the pixel-wise validation, only every fourth pixel is evaluated.

Obviously, the choice of the parameters has a substantial impact on the final result.

For instance, increasing the threshold for non-accidentalness (number of collinear

structures during peak validation) might cause important fixed structures to go

undetected, whereas too low a value might result in too many grouping hypotheses to

be validated. A too large correlation window for the hypothesis validation increases

the total validation time, thus affecting the overall effectiveness. On the other

hand, a too small window size results in segmentations susceptible to even small

misalignments and noise.

The problem here is very similar to estimate a representative covariance matrix

that accounts for the overall variability of all features used to characterize affinely

invariant neighbourhoods. Again, the situation is highly image-dependent and it

is nearly impossible to obtain parameter values with best results for all possible

images. We have therefore set these parameters based on empirical grounds. The

above values are a compromise so that optimal outcomes were achieved with our

collection of test images (normally with sizes of 640× 480).

7.5.3 Computation Times

Some information about computation times for the kitchen floor example shown

throughout the chapter is summarized in Table 7.2. The centerpoints of 114 affinely

invariant neighbourhoods were used as input. It should be noted though that the

computation times differ for each image, depending on the size of the cluster used

as input and the amount of clutter in the Hough spaces. ’Filtering’ in this table

refers to both the detection of peaks (non-maximum suppression), support checking

as well as the removal of re-emerging peaks.

Step Time (ms) # peaks

first Hough transform 400

level 1 Peak extraction 16120 617

level 1 Filtering 440 384 left

second Hough transform 780

level 2 Peak extraction 2430 95

level 2 Filtering 400 6 left

Table 7.2: Computation times for finding the pencil of fixed lines candidates on a

440 MHz SUN Ultra 10.

106 Chapter 7. Detection of Regularities

7.5.4 CHT vs. Gaussian Sphere

Many alternative methods have been developed for the automatic extraction of

vanishing points. In this context, the concept of the Gaussian Sphere is worth

mentioning [Barnard 1983]. The basic idea is to use the unit sphere (Gaussian

sphere) as an accumulator space for vanishing point detection. Common intersection

points for line segments in the image (i.e. vanishing points) translate to common

pairs of intersection points of great circles on this sphere.

Although less versatile than the CHT (to date, no reports of an iterated application

are known to the author) and with a prerequisite of known camera parameters,

authors came up with a probabilistic reasoning about the locations of vanishing

points on the sphere [Gallagher 2002].

In particular, an occurrence of mutually orthogonal sets of lines is often observed in

man-made scenes, and lines with a vertical orientation are dominant. Furthermore,

it is mostly true that cameras are held upright with respect to the scene. These

assumptions can be exploited so that each point on the Gaussian sphere can be

assigned a likelihood of being a vanishing point [Gallagher 2002].

Even if the above reasoning might seem somewhat heuristic, we have made similar

observations concerning the location of vanishing points and lines (vanishing points

tend to fall outside of the image). If prior knowledge about the preferred locations of

fixed structures in the CHT subspaces could be obtained, the detection of vanishing

points and lines might be facilitated.

7.6 Summary and Conclusions

In this chapter, we address the problem of analyzing repeating patterns (clusters of

affinely invariant neighbourhoods) for their regularity efficiently, that is without the

use of extensive combinatorics.

We first apply the cascaded Hough transform for the extraction of fixed structures

(pencils of fixed lines and line of fixed points). We utilize two successive iterations

of the CHT for both the detection of pencils of fixed lines and lines of fixed points

candidates, thereby treating large and small clusters slightly differently.

After fixed structures have been hypothesized, a single neighbourhood match suf-

fices to lift the remaining degree of freedom. The usually huge number of possible

neighbourhood matches is cut down through the constraints that the CHT imposes

on a cluster of neighbourhoods: only those that contributed both to a pencil of fixed

lines and line of fixed points are considered.

7.6. Summary and Conclusions 107

After having set up a grouping hypothesis, it is validated for its correctness based on

normalized cross-correlation. After a pixel-wise transformation of the entire image

with the hypothesized planar homology, the validation procedure segments those

parts in the image that are in agreement with the hypothesis. For wrong hypotheses,

the correlation value almost immediately drops below the threshold, such that they

can be rejected immediately.

8Experimental Results

In this chapter, the performance of the proposed grouping framework is

tested on real images containing symmetric patterns that are related by

planar homologies. More precisely, we want to know if the system is able

to reliably detect such symmetries in normal images taken with digital

cameras that are usual in trade.

After an introductory section, we show some results when the system is applied to

groupings related by general planar homologies in Section 8.2. In Section 8.3, we

demonstrate how elations and periodicities are dealt with. Section 8.4 concludes

this chapter.

8.1 Introduction

Before proceeding with the presentation of experimental results, one word about

the goals of an experimental validation. Of course, the principal goal is to detect

the groupings in a wide variety of different images. However, it must also be said

that it is nearly impossible to conduct a full, systematic investigation. This is

due to the fact that a large number of parameters has piled up for a system like

ours, which is assembled of many sophisticated modules in a processing chain. The

overall performance is adjusted with 57 (!) parameters.1 Optimal values were found

empirically such that the best results are achieved with the same set of parameter

values for all images.

By applying the grouping system on many images exhibiting a wide diversity of

symmetric patterns, it was possible to get a more qualitative feeling for the overall

influence of certain parameters. This helps to better understand their role in the en-

tire system, however a quantitative analysis about how they affect the final outcome

is infeasible.

1And these are only the most important ones

109

110 Chapter 8. Experimental Results

So this chapter aims at a demonstration of the performance of the system when

applied to many different types of symmetric scenes.

8.2 General Planar Homologies

Here, we show some results obtained when the system is applied to symmetric

patterns related by general planar homologies. In the image, they represent mirror-

symmetries.

As a first example, the method is applied to the butterfly image shown in Figure 8.1

(left). The pencil of fixed lines and the line of fixed points (axis) were correctly

Figure 8.1: The mirror-symmetric wings of a butterfly. Right: original image.

Left: Clusters of affinely invariant neighbourhoods. These small clusters are used

as input to the CHT.

determined using the small clusters in Figure 8.1 as input to the CHT. One neigh-

bourhood match then completely determines the planar homology that geometrically

relates the two wings of the butterfly. In Figure 8.2 (right), the resulting planar ho-

mology hypothesis was applied to the original image. The result of the transform

is a mapping of the left wing onto the right one and vice versa. To better see the

accuracy of the hypothesis, the warped image is shown together with the original,

undistorted image in a semi-transparent overlay. The areas outside the bright poly-

gon in the middle are those pixels that fall beyond the image boundaries. The right

image in Figure 8.2 shows the result of the hypothesis validation. As can be seen,

the system was able to correctly segment this mirror-symmetric configuration of the

butterfly wings.

Another example exhibits a mirror-symmetry on the hand-woven carpet shown in

Figure 8.3. Here, small clusters resulting in 6 correct pairwise matches were detected

for the principal mirror-symmetric arrangements of patterns on the carpet. Note

that most of the individual pattern are again highly symmetric. However, these

8.2. General Planar Homologies 111

Figure 8.2: A semi-transparent overlay of the original image with its warped

version (left). Right: the resulting segmentation of the image after the validation.

Clearly, the hypothesis is correct.

Figure 8.3: Left: Mirror-symmetry on a hand-woven carpet and the matches found

by the system (right; the corresponding neighbourhoods are not shown).

Figure 8.4: Left: Semi-transparent overlay of the transformed image with the orig-

inal one. Again, the darker areas outside the bright polygonal shape are neglected

for the validation.

112 Chapter 8. Experimental Results

are too small to be detected at this local scale. Figure 8.4 left shows the warped

image together with the original one as in the previous example. As this carpet is

hand-woven, the symmetry is not perfect. Ground-truth measurements yield indeed

deviations from a perfect bilaterally symmetric layout (about 2 cm in a distance of

20 cm from the symmetry axis). The result of the hypothesis validation is shown

in the right part of Figure 8.4. Only small areas near the symmetry axis would

have been segmented without the slight drift of the correlation window. Obviously,

as the quality of the symmetry decreases with increasing distance from the axis,

this example confirms the capabilities of the system to deal with even imperfect

symmetries.

As a third example, the system is applied to two books in front of a mirror, shown

in Figure 8.5. This scene consists of two different groupings that have a common

Figure 8.5: Two books in front of a mirror (left) and the fixed structures as

detected by the system (right).

pencil of fixed lines. The common pencil of fixed lines is an indication of some

hidden relation between the two groupings (the fact that they are placed in front of

the same mirror). On each book, a few pairwise matches were found, enough to find

the common vertex of the pencil of fixed lines and the two different lines of fixed

points. The left and the right part of Figure 8.6 show the resulting segmentations

for both hypotheses.

8.3 Elations

Translational symmetries in the form of a floor tiling were used as an illustration

of the processing steps throughout the preceding chapters. Such floor tiling are

textbook examples for periodicities, yet the system is able to deal with periodicities

with a less regular structure. For example, consider the pile of beer boxes shown in

8.3. Elations 113

Figure 8.6: Resulting segmentations for the hypotheses of both the red (left) and

white book (right).

Figure 8.7: Pile of beer boxes, arranged rather irregularly. The original image

(left), the cluster of affinely invariant neighbourhoods for the black holes (middle)

and an enlarged view of the neighbourhoods covering the white labels (right).

Figure 8.7. Note that the boxes are placed rather irregularly in two different orienta-

tions (either with the black hole or the white label facing the camera). The system

detected two distinct clusters of affinely invariant neighbourhoods (black holes /

white labels, Figure 8.7 middle and right), and for each cluster the correct fixed

structures were detected. As each side of the beer boxes has a different length, the

different rows exhibit different planar homologies (same fixed structure, but different

cross-ratio), resulting in two different segmentations for the horizontal directions.

Another example deals with the building facade shown in Figure 8.9. Due to the

large number of repetitions (a cluster of 158 affinely invariant neighbourhoods was

extracted), the vanishing line of the wall plane clearly emerges in the Hough spaces.

This corresponds to a common line of fixed points of all elations mapping one small

window onto another. More precisely, the valid elations differ in their pencils of fixed

lines (directions) and cross-ratios (translational distances). Figure 8.10 shows the

resulting segmentations found for the vertical, horizontal and one diagonal direction.

Due to the homogeneous regions in between the window units, they are sometimes

114 Chapter 8. Experimental Results

Figure 8.8: Resulting hypotheses segmentations obtained from horizontal (left),

and two vertical (middle, right) point matches for both clusters. The hypothesis in

the middle column was formed by a vertical point match of two immediate adjacent

neighbourhoods, whereas the hypothesis in the right column was obtained by a

vertical point match of two black holes (white labels) with one white labeled (black

holed) box in between. Note that the unit lengths of the transformations (box

height) is 1 in the left part of the pile and 2 in the right part, and they are in

agreement for both box cluster regularities.

Figure 8.9: Regular repetitions of small windows are grouped together in blocks of

9× 9 that again repeat in a regular manner at a higher level (left). Right: a cluster

of homogeneous neighbourhoods.

8.3. Elations 115

Figure 8.10: The presence of a high degree of symmetry becomes apparent as

the large window blocks are in agreement with elations in vertical (left), horizontal

(middle) and diagonal directions (right).

merged in larger segmentations for some directions.

Obviously, the arrangements of the window blocks are again in a regular manner.

Indeed, there is a hierarchy of groupings at two scales. At this point, the natural

question arises about how a system can detect this additional regularity at the

larger scale. Clearly, both the small scale groupings (repetitions of the windows)

and the large scale grouping (repetitions of the window blocks) share the same fixed

structures.

For such highly symmetric structures, one possi-

Figure 8.11: Visualization of

the symmetry density.

bility for the delineation of the large scale group-

ing exploits the concept of symmetry density.

In particular, the segmented areas of different

valid homology hypotheses are accumulated in

a buffer. Those areas with the largest values ex-

hibit the highest degree of symmetry. The visu-

alization of the symmetry density for the build-

ing example is shown in Figure 8.11, where the

window blocks become apparent as regions with

the highest degree of symmetry. Although no

such system has been developed yet, a symme-

try density image might serve as a good starting

point. Future work will therefore deal with a more systematic exploitation thereof,

hopefully leading to a comprehensive framework for the detection of hierarchical

groupings.

The next example deals with more complicated symmetries in the same vein. Fig-

ure 8.12 (left) shows the original situation. Here we have two principal translational

symmetries in horizontal and vertical directions that make up the regularity of the

plugs. From a geometrical viewpoint, these are two elations sharing a common van-

ishing line. More precisely, the translations in the vertical direction even share both

116 Chapter 8. Experimental Results

Figure 8.12: Symmetric arrangements of plugs with different elations that all

have the vanishing line in common. Top left: Original image. Top right: Resulting

segmentation for the horizontal direction and the vertical ones (bottom row)

fixed structures, but differ in the value of the cross-ratio. The system was able to ex-

tract the fixed structures, and the resulting segmentations are shown in Figure 8.12

for the horizontal (top right) and vertical directions (lower row). The left picture

in the bottom row shows the validation for a hypothesis that maps the upper two

rows of plugs onto each other, while the right picture illustrates the resulting seg-

mentation for the lower two rows. Note that the group of plugs of the second row

(from above) is actually in agreement with this hypothesis, and this was correctly

segmented by the system.

8.4 Conclusion

Experiments were conducted to demonstrate the overall capability of the system

to deal with symmetries related by planar homologies in a wide variety of different

8.4. Conclusion 117

images. All images were taken using regular digital cameras, and correction against

radial distortion was made where necessary. Generally speaking, the system works

reasonably well and is able to detect groupings where the repeating patterns consist

of a rich diversity of textures. This is in contrast to previous contributions that

focused on only one type of grouping and / or exploited only a narrow range of

features.

One shortcoming is the limited robustness to changes in scale during the detection of

repetitions, which is especially the case for the geometry-based invariant neighbour-

hoods. At the time this report is written, experiments are carried out to improve

the extraction of affinely invariant neighbourhoods with respect to changes in scale.

From these experiments, we can also conclude that our approach is efficient: the

average computation time required — from the extraction of interest points till hy-

potheses validation in cluttered scenes — is in the magnitude of several minutes.

This emphasizes the superiority of our strategy over traditional, combinatorial meth-

ods.

9Conclusion

Grouping in its many flavors has attracted the interest of researchers since

the early days of computer vision. The rakish definition of grouping as

’putting together what belongs together’ arose with the ongoing devel-

opment of vision systems, with a trend to increasing complexity. Yet

regardless of their complexity, many systems rely on grouping as a necessary prepro-

cessing step, where it is mostly performed at the lowest level, e.g. the organization

of edgels and lines. This explains why grouping on the lowest level is still of interest

even in present days.

It is only in recent years that attention has turned towards the detection of groupings

at a higher level in ordinary images, without the need for (manual) preprocessing

or presegmentation. These newer contributions mostly focus on regular repetitions,

and regularity implies a quantitative description that is formalized by the laws of ge-

ometry, especially projective geometry. And this is the point where the framework

developed during this dissertation enters the scene. In the following, Section 9.1

briefly recapitulates our contributions and revisits the technologies employed in the

framework. Section 9.2 finishes this report with ideas and suggestions for improve-

ments and further research.

9.1 Summary

The most similar work to ours is the grouping system by Schaffalitzky and Zisserman[Schaffalitzky and Zisserman 2000, Schaffalitzky and Zisserman 1998]. The authors

also attack the problem of finding regular repetitions in images, although their

system is limited to the case of elations. It would be mistaken to consider their

work as a starting point for this dissertation, since the strategy, techniques and

generality of our approach is completely different.

Our main contributions to intra-image grouping are manifold. First of all, our

system is able to detect regular pattern repetitions related by the more general class

119

120 Chapter 9. Conclusion

of planar homologies, which includes periodicities, mirror-symmetries and reflection

about points. This is in contrast to earlier work that focused on one particular

grouping type only. We have also shown that grouping can be performed efficiently

by banning heavy combinatorics from all processing steps. Efficiency is a crucial

issue when it comes to grouping, and here lies the novelty of our approach. Most

other systems are virtually characterized by the excessive combinatorial techniques

that they apply. Finally, our framework is more generic in the repeating features

that it is able to detect (affinely invariant neighbourhoods); most earlier approaches

are very limited in this respect.

The grouping strategy developed during this dissertation is based on the geometric

concept of fixed structures of planar homologies that relate repeating patterns in

the image. Fixed structures are geometric entities, like points and lines, that remain

fixed under a certain group of transformations. Regardless of their rather abstract

nature, fixed structures might indeed correspond to visible features in images, like for

instance a horizon line. The reason why fixed structures are of special interest is that

they lift many degrees of freedom of the transformation sought. If fixed structures

are known, the problem of finding a general 5 dof planar homology is reduced to

lifting only a single dof, which is a substantial reduction in complexity. To arrive at

candidates for fixed structures, we first detect repetitions of small, planar patches.

A second step analyzes these repetitions for their regularity, yielding fixed structures

as output.

Points of interest serve as starting points for the delineation of affinely invariant

neighbourhoods in the image. These are small, local, planar patches that self-

adapt to the underlying intensity profile, and the extraction process is invariant

against affine geometric and linear photometric changes. Each such neighbourhood

is described by a feature vector of moment invariants (color-ratios, resp.), which

allows to find neighbourhoods that cover similar patterns (i.e. repetitions) efficiently.

Several neighbourhood extraction methods are used (geometry-based / intensity-

based), and — depending on the image content — some extraction methods have

a better response than others. The idea is to have an opportunistic system that

exploits what is on offer in a specific image, such that enough affinely invariant

neighbourhoods can be extracted to get the grouping process started.

Once clusters of affinely invariant neighbourhoods have been extracted, these are

analyzed for their regularity (in a non-combinatorial way again) using a cascaded

version of the Hough transform (CHT). A line parameterization that is symmetric in

both (a, b) and (x, y) enables the iterated application of the Hough transform, where

the output of a previous transform can be used as input for a subsequent one. The

CHT yields fixed structure candidates (if any), and a single neighbourhood-match

suffices to lift the remaining dof and hence to arrive at the long awaited planar

homology hypothesis. Each hypothesis is then validated for its correctness, which

results in a segmentation of the image into symmetric parts.

9.2. Discussion and Outlook 121

9.2 Discussion and Outlook

9.2.1 Improvements

The proposed framework can still be improved in a number of ways. This holds

especially for a comprehensive system like ours that is a synthesis of methods and

knowhow from rather diverse corners of Computer Vision. Each method has its own

specific advantages and shortcomings, and improvements to each of them pay off to

the overall effectiveness and robustness of the entire grouping system.

First of all, the extraction of affinely invariant neighbourhoods is a self-contained,

complex system with plenty of room for optimization. Many suggestions for im-

provements were already pointed out by Tuytelaars ([Tuytelaars 2000]). Some of

them have been realized during this thesis, so for instance a more efficient coding

and thus speed improvements towards almost real-time (for some neighbourhood

types). However, one shortcoming is the lack of robustness against major changes

in scale. Experiments have already been made to render the neighbourhood extrac-

tion more stable in this respect, with promising results. Further work will therefore

incorporate these extensions.

Another problem is the increasing number of parameters as more functionality is

added to the system. As mentioned earlier, the values for the 57 most important

parameters have been found empirically based on a diversity of test images. For cases

where the system fails to find groupings using the set of default values, a detection

can nevertheless be enforced by tuning the parameters accordingly. However, this

runs against the goal of an automated system. We have not paid much attention yet

to the determination of parameter values in an adaptive, data-driven or smart way.

As many parameters are geometry-specific, an additional processing stage could be

inserted between low-level feature extraction and the extraction of affinely invariant

neighbourhoods. The collection of statistics about geometric primitives (such as the

median of edge lengths etc.) would be the purpose of this stage, which would adapt

the following neighbourhood extraction stage to the realities in the image.

In the same spirit, an automatic parameter-adaption would improve the cascaded

Hough transform as well. The question is whether the coarseness of quantization

of Hough spaces can be set adaptively based on the geometric structures in the

image and the desired accuracy. Although one might argue that CPU and memory

requirements are no longer a problem nowadays (remember Moore’s law), filter-

ing operations still pose a computational burden that can certainly be lowered by

avoiding unnecessary accumulator sizes.

Also interesting is a more systematic exploit of color information, and this is ar-

guably a notoriously difficult task and still an open field of research. Is it possible to

obtain a higher discriminative power of features by choosing a different color space,

122 Chapter 9. Conclusion

such that invariance against linear photometric changes is preserved ? For instance,

homogeneous affinely invariant neighbourhoods might clearly benefit from this im-

provement, as the metrics in the currently used RGB color space is not always

in agreement in what a human observer perceives. Similarity measures based on

normalized cross-correlation of graylevel values are pervasive throughout the entire

grouping system. Can correlation be made more effective by including the spec-

tral information in a more appropriate way ? Reports on generalized correlation

can be found in [Jawahar and Narayanan 2002] and might indeed help in a better

discrimination of features.

9.2.2 Future Work

After being able to deal with groupings related by planar homologies, it is natural

to ask for extensions towards other types symmetries, in particular rotational sym-

metries. To this date, only a few authors have tackled the problem of recognition of

rotational symmetries ([Forsyth et al. 1992]), or have shown that 3D objects with fi-

nite rotational symmetry induce geometric relations in the image ([Liu et al. 1995]).

However, no automatic system has been reported yet. Here, it is desirable to extend

the concept of fixed structures for rotational symmetries, to be nicely integrated in

the existing framework. For repeating patterns in a rotation-symmetric configura-

tion, the fixed structures correspond to pencils of conics, and these require more

than two parameters for their specification. How they can be extracted efficiently

is subject to ongoing research.

Apart from rotational symmetries, the systematic analysis of interrelations between

different groupings in the same image poses a further challenge. Examples shown

earlier have already led us to this problem, and common fixed structures are al-

ready a good indication for hidden relations. Clearly, these are all non-accidental

arrangements. Our observations have shown that the segmented regions (hypothe-

sis validations) might play an important role: Different groupings (with or without

common fixed structures) might have different (isolated) or overlapping segmen-

tations. Those areas in the image with overlapping segmentations are of special

interest, as they indicate a ’higher degree’ of symmetry. In this context, it would be

interesting to establish a link to wallpaper groups, although we only face truncated

versions thereof in images. Nevertheless, an analysis for wallpaper regularities is

possible by moving the fixed structures to infinity, thereby removing perspective

distortions. In this respect, the work by Liu and Collins ([Liu and Collins 2001,

Liu and Collins 2000]) complements our work very well in that our system is able

to delineate regions of high symmetry automatically (the system by Liu and Collins

requires manual selection of symmetric image parts).

Alternatively, a more elaborate theoretical treatment might solely be based fixed

structures. Analyzing them would allow to infer information about missing fixed

9.2. Discussion and Outlook 123

structures that went undetected by the CHT. It can indeed be shown that, for

three planar homologies in a triangular configuration, a classifactory structure can

be derived. Encouraging preliminary experiments have already been made in this

respect ([Tuytelaars et al. 2002]).

Finally, groupings may occur at different hierarchical levels. For instance, each half

of a mirror-symmetry might contain some regularity itself. Or think at the building

facade with repeating windows shown in Figure 8.9 in Chapter 8. Such groupings

need also to be found. To that end, the concept of the symmetry density image has

already coarsely delineated the window blocks, and principally the CHT might again

be applied to e.g. the center of gravities of these regions of high symmetry to detect

the grouping at the larger scale. Concepts and strategies for the integration of these

extensions still have to be determined, and this will be the focus of future work,

which clearly emphasizes the yet undiscovered potential of this strand of research.

ALinear Discriminant Analysis

One of the recurring problems encountered in applying statistical techniques to

pattern recognition problems is the reduction of dimensionality. Procedures that are

analytically or computationally manageable in low-dimensional spaces can become

completely impractical in high-dimensional spaces. Thus, various techniques have

been developed for reducing the dimensionality of the feature space in the hope of

obtaining a more manageable problem.

The dimensionality can be reduced from d dimensions to one dimension if the d

dimensional data is merely projected onto a line. Of course, even if the samples

form well-separated, compact clusters in d-space, projection onto an arbitrary line

will usually produce a confused mixture of samples from all of the classes. However,

by moving the line around, an orientation might be found for which the projected

samples are well separated. And this is exactly the goal we wish to achieve here

for the more general case, that is the reduction of dimensionality from d to k with

k < d.

A.1 Principle

We illustrate the linear discriminant analysis as applied in this dissertation with an

artificial example. Figure A.1 shows three well-separated clusters in 2D. The specific

nature of the features plotted here is not of interest at the moment. The different

number of data-points in each cluster were randomly drawn from four bivariate

normal distributions that differ only in the mean. Important here is that all clusters

have approximately the same spread, which is according to the common covariance

matrix assumption.

Finding a projection line for which the samples are still well separated after pro-

jection is quite straightforward for the configuration in Figure A.1. The x-axis is

obviously the best choice. However, the task gets more difficult when the feature

125

126 Appendix A. Linear Discriminant Analysis

-2 -1 1 2 3 4

-5

5

10

Figure A.1: Initial cluster configuration.

space exceeds three dimensions and the projection line or plane (or subspace) must

be found automatically.

The first step is a rotation of the coordinate frame in the direction of the largest

(cluster-specific) spread, followed by whitening of the data. Whitening is a rescaling

of the axes, resulting in ’sphere’-like clusters. The effect of this step can be seen in

Figure A.2. Although hardly visible, the clusters are now sphere-like, i.e. the vari-

-2 -1 1 2

-20

-10

10

20

30

40

Figure A.2: Transformed dataset after rotation and scaling.

ances of each cluster are now normalized in both dimensions in this new coordinate

frame. From a computational viewpoint, the transformation is performed based on

the singular value decomposition of the cluster-specific covariance matrix ΣC:

ΣC = U ·D ·V> (A.1)

A.2. Covariance Matrix Based on Tracking Experiments 127

with U and V orthogonal and D diagonal. The entries Dii on the diagonal are the

variances, sorted in ascending order.

Note the increase of the inter-cluster distance in this particular example. This is

due to the fact that the standard deviation of the first component (w.r.t the original

coordinate frame in Figure A.1) is smaller than one. The net effect is a compression

in one direction with a dilution for the other one, which results in larger distances

between the clusters for this particular configuration.

Next, a rotation is applied to the transformed dataset, but this time based on the

singular value decomposition of the global covariance matrix ΣG of the transformed

data. Roughly speaking, the global covariance matrix takes into account the overall

variability of the clusters. Applied to the example shown here, this results again

in a 90 degree rotation, as shown in Figure A.3. Actually, this second transform

-20 -10 10 20 30 40

-2

-1

1

2

Figure A.3: Situation after the second transform.

corresponds to a principal component analysis. In general, a reduction of the dimen-

sionality can now be obtained by projection onto the first few principal components

(the x-axis in Figure A.3), thereby keeping the different clusters well separated.

A.2 Covariance Matrix Based on Tracking Exper-

iments

In the following, the term feature refers to the feature vector of an affinely invariant

neighbourhood. Simply speaking, we want to arrive at an estimate for a feature-

specific covariance matrix that represents the overall variability of a feature to the

best possible extent. Remember that different neighbourhood types have different

128 Appendix A. Linear Discriminant Analysis

feature spaces, hence each neighbourhood type has its own specific covariance matrix

Σ.

We obtained estimates Σ by tracking invariant neighbourhoods throughout a video

sequence: In the first frame, three affinely invariant neighbourhoods covering differ-

ent parts on the physical surface of an object were manually identified. Next, the

camera was gradually moved and the illumination slightly changed. This way, the

scene is imaged from different viewpoints (under varying illumination conditions)

and the three appointed neighbourhoods were manually identified in the consecutive

frames.

For each cluster, the mean was determined and subtracted from all feature vectors

belonging to that cluster. This allows to look at the deviations only. Then, all

clusters were put together again to compute the covariance matrix. This yields an

estimate of the covariance matrix based on the average of some clusters.

Of course, this is only a rudimentary estimate, since so many factors account for the

overall variability (see Section 6.3.1) that can never by fully covered by a tracking

experiment. Nevertheless, given the same dataset used in the tracking experiment,

we determined the average separability of the clusters using both the pooled and

feature-specific covariance matrix. Table A.1 confirms that the ratio of inter to intra-

cluster distances is remarkably larger when using a LDA. The results are shown in

Cluster 1 2 3

1 0 4.3554 4.193

2 4.3554 0 4.2047

3 4.193 4.2047 0

Cluster Intra-cluster distance

1 3.8594

2 3.9993

3 3.1486

Cluster 1 2 3

1 0 23.90 29.33

2 23.90 0 26.13

3 29.33 26.13 0

Cluster Intra-cluster distance

1 4.4317

2 4.2403

3 3.7756

Table A.1: Inter- (left column) and intra (right column) cluster distances obtained

using a global covariance matrix estimate (top row) and the covariance matrix based

on tracking experiments (bottom row).

Table A.1. The distances shown are actually the averaged distances between all the

feature vectors of different clusters and within the same cluster.

BImage Database Overview

Figure B.1: Example images the system was applied to.

129

130 Appendix B. Image Database Overview

Figure B.2: Example images the system was applied to (ctd.)

Bibliography

[Barnard 1983] S. Barnard. Interpreting perspective images. Artificial Intelligence,

21:435–462, 1983.

[Baumberg 2000] A. Baumberg. Reliable feature matching across widely separated

views. In IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, pages 774–781. IEEE, 2000.

[Binford 1981] T. Binford. Inferring Surfaces from Images. Artificial Intelligence,

17:205–244, 1981.

[Bruckstein and Shaked 1998] A.M. Bruckstein and D. Shaked. Skew-Symmetry

Detection via Invariant Signatures. Pattern Recognition, 31(2):181–192, 1998.

[Canny 1986] J. Canny. A computational approach to edge detection. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 8:679–698, 1986.

[Cham and Cipolla 1996] T. Cham and R. Cipolla. Geometric Saliency of Curve

Correspondences and Grouping of Symmetric Contours. In European Confer-

ence on Computer Vision, volume 1, pages 385–398, Cambridge, UK, April 1996.

Springer.

[Duda and Hart 1972] R.O. Duda and P.E. Hart. Use of the hough transform to

detect lines and curves in pictures. 15:11–15, 1972.

[Ferrari et al. 2001] V. Ferrari, T. Tuytelaars, and L. Van Gool. Markerless aug-

mented reality with a real-time affine region tracker. In Procs. of the IEEE and

ACM Intl. Symposium on Augmented Reality, 2001.

[Fischler and Bolles 1981] M. A. Fischler and R.C. Bolles. Random sample con-

sensus: A paradigm for model fitting with applications to image analysis and

automated cartography. Communications of the ACM, 24(6):381–395, 1981.

[Forsyth et al. 1992] D.A. Forsyth, J. Mundy, A. Zisserman, and C.A. Rothwell.

Recognising rotationally symmetric surfaces from their outlines. In European

Conference on Computer Vision, pages 639–647, 1992.

[Friedberg 1986] S. A. Friedberg. Finding Axes of Skewed Symmetry. Computer

Vision, Graphics, and Image Processing, 34:138–155, 1986.

131

132 Bibliography

[Friedman 1989] J. Friedman. Regularized discriminant analysis. Journal of the

American Statistical Association, 84(405):165–175, March 1989.

[Gallagher 2002] A.C. Gallagher. A ground truth based vanishing point detection

algorithm. Pattern Recognition, 35:1527–1543, 2002.

[Glachet et al. 1993] R. Glachet, J.T. Lapreste, and M. Dhome. Locating and Mod-

elling a Flat Symmetric Object from a Single Perspective Image. Computer

Vision, Graphics, and Image Processing: Image Understanding, 57(2):219–226,

March 1993.

[Gross and Boult 1991] A. Gross and T. Boult. SYMAN: A SYMetry ANalyzer. In

IEEE Computer Society Conference on Computer Vision and Pattern Recognition,

pages 744–746. IEEE, 1991.

[Gross and Boult 1994] A. Gross and T. Boult. Analyzing skewed symmetries. In-

ternational Journal of Computer Vision, 13(1):91–111, 1994.

[Harris and Stephens 1988] C. Harris and M. Stephens. A combined corner and edge

detector. In Proc. Jth Alvey Vision Conf., pages 147–151, 1988.

[Hartley and Zisserman 2000] R. Hartley and A. Zisserman. Multiple View Geome-

try in Computer Vision. Cambridge University Press, 2 edition, 2000.

[Huttenlocher and Wayner 1992] D. Huttenlocher and P. Wayner. Finding Convex

Edge Groupings in an Image. International Journal of Computer Vision, 8(1):7–

27, 1992.

[Illingworth and Kittler 1988] J. Illingworth and J. Kittler. A survey of the hough

transform. Computer Vision, Graphics, and Image Processing, 44:87–116, 1988.

[Jacobs 1989] D. Jacobs. Groups for recognition. MIT AI Memo 1177, 1989.

[Jacobs 1996] D. Jacobs. Robust and efficient detection of salient convex groups.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(18), 1996.

[Jawahar and Narayanan 2002] C.V. Jawahar and P.J. Narayanan. Generalised cor-

relation for multi-feature correspondence. Pattern Recognition, 35:1303–1313,

2002.

[Kanade 1981] T. Kanade. Recovery of the Three-Dimensional Shape of an Object

from a Single View. Artificial Intelligence, 17:409–460, 1981.

[Leavers 1993] V. F. Leavers. Which hough transform ? Computer Vision, Graph-

ics, and Image Processing, 58(2):250–264, 1993.

[Leung and Malik 1996] T. Leung and J. Malik. Detecting, localizing and grouping

repeated scene elements from an image. In European Conference on Computer

Vision, volume 1, pages 546–555, England, April 1996.

[Lin et al. 1997] H.-C. Lin, L.-L. Wang, and S.-N. Yang. Extracting periodicity of a

regular texture based on autocorrelation functions. Pattern Recognition Letters,

18:433–443, 1997.

Bibliography 133

[Liu and Collins 2000] J. Liu and T. Collins. A Computational Model for Repeated

Pattern Perception using Frieze and Wallpaper Groups. In IEEE Computer So-

ciety Conference on Computer Vision and Pattern Recognition, 2000.

[Liu and Collins 2001] J. Liu and T. Collins. Skewed Symmetry Groups. In IEEE

Computer Society Conference on Computer Vision and Pattern Recognition, De-

cember 2001.

[Liu et al. 1995] J. Liu, J. Mundy, and A. Zisserman. Grouping and structure re-

covery for images of objects with finite rotational symmetry. In Proc. Asian

Conference on Computer Vision, volume 1, pages 379–382, 1995.

[Lowe 1985] D. Lowe. Perceptual Organization in Visual Recognition. Kluwer Aca-

demic Publishers, 1985.

[Lutton et al. 1994] E. Lutton, H. Maıtre, and J. Lopez-Krahe. Contribution to the

determination of vanishing points using hough transform. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 16(4), 1994.

[Matas et al. 2002] J. Matas, S. Obdrzalek, and O. Chum. Local affine frames for

wide-baseline stereo. In International Conference on Pattern Recognition, August

2002.

[Mindru et al. 1998] F. Mindru, T. Moons, and L. Van Gool. Color-based moment

invariants for viewpoint and illumination independent recognition of planar color

patterns. In Intl. Conf. on Advances in Pattern Recognition, pages 113–122, 1998.

[Mindru et al. 1999a] F. Mindru, T. Moons, and L. Van Gool. Recognizing color

patterns irrespective of viewpoint and illumination. In IEEE Computer Society

Conference on Computer Vision and Pattern Recognition, volume 1, pages 368–

373, 1999.

[Mindru et al. 1999b] F. Mindru, T. Moons, and L. Van Gool. Recognizing color

patterns irrespective of viewpoint and illumination. In IEEE Computer Society

Conference on Computer Vision and Pattern Recognition, pages 368–373, 1999.

[Mindru et al. 2001] F. Mindru, T. Moons, and L. Van Gool. The influence of in-

tensity transformation models on illumination and viewpoint independent color

pattern recognition. In IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, December 2001. Post-Conference workshop on Identify-

ing Objects Across Variations in Lighting: Psychophysics and Computation.

[Mukherjee et al. 1995] D. P. Mukherjee, A. Zisserman, and M. Brady. Shape from

Symmetry – Detecting and Exploiting Symmetry in Affine Images. In Phil. Trans.

R. Soc. Lond. A, pages 77–106. 1995.

[Oren and Nayar 1994] M. Oren and S. Nayar. Seeing beyond lamberts law. In

European Conference on Computer Vision, pages 269–280, 1994.

134 Bibliography

[Pauwels and Frederix 1999] E. Pauwels and G. Frederix. Finding salient regions

in images: Non-parametric clustering for image segmentation and grouping. 75,

Jul./Aug. 1999.

[Ponce 1988] J. Ponce. Ribbons, symmetries and skewed symmetries. In ARPA Im-

age Understanding Workshop, volume 2, pages 1074–1079, Massachusetts, 1988.

[Reiss 1993] T. Reiss. Recognizing planar objects using invariant image features.

LNCS. Springer, 1993.

[Richards and Jepson 1992] W. Richards and A. Jepson. What Makes a Good Fea-

ture ? MIT AI Memo 1356, 1992.

[Roche et al. 1999] A. Roche, G. Malandain, and N. Ayache. Unifying maximum

likelihood approaches in medical image registration. Technical Report 3741, IN-

RIA, 1999.

[Schaffalitzky and Zisserman 1998] F. Schaffalitzky and A. Zisserman. Geometric

grouping of repeated elements within images. In Proc. 9’th British Machine Vision

Conference, pages 13–22, Southampton, 1998.

[Schaffalitzky and Zisserman 2000] F. Schaffalitzky and A. Zisserman. Planar

grouping for automatic detection of vanishing lines and points. Image and Vision

Computing, 18(9):647–658, June 2000.

[Schmid and Mohr 1997] C. Schmid and R. Mohr. Local greyvalue invariants for

image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence,

19(6):872–877, may 1997.

[Schmid et al. 2000] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest

point detectors. International Journal of Computer Vision, 37(2):151–172, 2000.

[Scott 1992] D.W. Scott. Multivariate Density Estimation. John Wiley & Sons,

1992.

[Semple and Kneebone 1952] J.G. Semple and G.T. Kneebone. Algebraic Projective

Geometry. Oxford University Press, 1952.

[Sha’ashua and Ullman 1988] A. Sha’ashua and S. Ullman. Structural Saliency:

The Detection of Globally Salient Structures Using a Locally Connected Network.

In Intl. Conf. on Computer Vision, pages 321–327, 1988.

[Springer 1964] C. Springer. Geometry and Analysis of Projective Spaces. Freeman,

1964.

[Turina et al. 2001a] A. Turina, T. Tuytelaars, and L. Van Gool. Efficient grouping

under perspective skew. In IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, volume 1, pages 247–254, Kauai, Hawaii, December

2001. IEEE Computer Society.

[Turina et al. 2001b] A. Turina, T. Tuytelaars, T.Moons, and L. Van Gool. Group-

ing via the matching of repeated patterns. In S. Singh, N. Murshed, and

Bibliography 135

W. Kropatsch, editors, Intl. Conf. on Advances in Pattern Recognition, number

2013 in Lecture Notes in Computer Science, pages 250–259, March 2001.

[Tuytelaars and Van Gool 1999] T. Tuytelaars and L. Van Gool. Content-based

image retrieval based on local, affinely invariant regions. In Proc. Third Intl.

Conf. on Visual Information Systems, pages 493–500, 1999.

[Tuytelaars and Van Gool 2000] T. Tuytelaars and L. Van Gool. Wide baseline

stereo based on local, affinely invariant regions. In Proc. British Machine Vision

Conf., pages 412–422, 2000.

[Tuytelaars et al. 1998a] T. Tuytelaars, L. Van Gool, M. Proesmans, and T. Moons.

The cascaded hough transform as an aid in aerial image interpretation. In Intl.

Conf. on Computer Vision, pages 67–72, January 1998.

[Tuytelaars et al. 1998b] T. Tuytelaars, L. Van Gool, M. Proesmans, and T. Moons.

A cascaded hough transform as an aid in aerial image interpretation. In Intl. Conf.

on Computer Vision, pages 67–72, 1998.

[Tuytelaars et al. 2002] T. Tuytelaars, A. Turina, and L. Van Gool. Non-

combinatorial detection of regular repetitions under perspective skew. Accepted

for IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002.

[Tuytelaars 2000] T. Tuytelaars. Local, invariant features for registration and recog-

nition. PhD thesis, Katholieke Universiteit Leuven, December 2000.

[Van Gool and Proesmans 1995] L. Van Gool and M. Proesmans. Grouping and

Invariants using Planar Homologies. In R. Mohr and W. Chengke, editors, Europe-

China Workshop on Geometrical Modelling and Invariants for Computer Vision,

pages 182–189. Xidan University Press, Xi’an, 1995.

[Van Gool et al. 1994] L. Van Gool, T. Moons, and M. Proesmans. Groups,

fixed sets, symmetries and invariants. Technical report, KUL/ESAT/MI2/9426,

Katholieke Universiteit Leuven, 1994.

[Van Gool et al. 1995a] L. Van Gool, T. Moons, and M. Proesmans. Groups for

grouping: a strategy for the exploitation of geometrical constraints. In Proc. 6th

Int. Conf. on Computer Analysis of Images and Patterns, pages 1–8, Prague,

Czechia, 1995.

[Van Gool et al. 1995b] L. Van Gool, T. Moons, D. Ungureanu, and A. Oosterlinck.

The Characterization and Detection of Skewed Symmetries. Computer Vision

and Image Understanding, 61(1):138–195, 1995.

[Van Gool et al. 1995c] L. Van Gool, T. Moons, D. Ungureanu, and E. Pauwels.

Symmetry from Shape and Shape from Symmetry. Int. J. of Robotics Research,

14(5):407–424, 1995.

[Van Gool et al. 1996] L. Van Gool, T. Moons, and D. Ungureanu. Geomet-

ric/photometric invariants for planar intensity patterns. In European Conference

136 Bibliography

on Computer Vision, volume 1 of Lecture Notes in Computer Science, pages 642–

651, Cambridge, UK, April 1996. Springer.

[Van Gool et al. 1998] L. Van Gool, M. Proesmans, and A. Zisserman. Planar Ho-

mologies as a basis for Grouping and Recognition. Image and Vision Computing,

16(1):21–26, 1998.

[Van Gool et al. 2001] L. Van Gool, T. Tuytelaars, and A. Turina. Local features

for image retrieval. In R. C. Veltkamp, H. Burkhardt, and H-P. Kriegel, edi-

tors, State-of-the-Art in Content-Based Image and Video Retrieval, volume 22 of

Computational Imaging and Vision, pages 21–41. Kluwer Academic Publishers,

2001.

[Van Gool 1997] L. Van Gool. A Systematic Approach to Geometry-Based Group-

ing and Non-accidentalness. In G. Sommer and J. Koenderink, editors, Alge-

braic Frames for the Perception-Action Cycle (AFPAC’97), volume 1315 of Lec-

ture Notes in Computer Science, pages 126–147, Kiel, Germany, September 1997.

Springer.

[Van Gool 1998] L. Van Gool. Projective subgroups for grouping. Phil. Trans. R.

Soc. Lond. A, 356(1740):1251–1266, 1998.

[Viola and Wells 1997] P. Viola and W. Wells. Alignment by maximization of mu-

tual information. International Journal of Computer Vision, 24(2):137–154, 1997.

[Wertheimer 1923] M. Wertheimer. Untersuchungen zur Lehre von der Gestalt II.

Psychol. Forschung, 4:301–350, 1923.

[Witkin and Tennenbaum 1983] A. Witkin and J. Tennenbaum. On the Role of

Structure in Vision. In J. Beck, B. Hope, and A. Rosenfeld, editors, Human and

Machine Vision. Academic Press, New York, 1983.

[Wolff 1994] L. Wolff. On the relative brightness of specular and diffuse reflection.

In European Conference on Computer Vision, pages 369–376, 1994.

[Xu 1988] L. Xu. A method for recognizing configurations consisting of line sets

and its application to discrimination of seismic face structures. In International

Conference on Pattern Recognition, pages 610–612, 1988.

Curriculum Vitae

Andreas Turina

Date of birth: 4th of June, 1971

Place of birth: Zurich, Switzerland

Citizenship: Fallanden, ZH

Education: 1978–1984 Primary School in Pfaffhausen (ZH).

1984–1989 High School, Matura Type B (Realgymna-

sium Ramibuhl, Zurich).

1990–1992 Studies of Physics at the Swiss Federal In-

stitute of Technology Zurich.

1993–1999 Studies of Electrical Engineering at the

Swiss Federal Institute of Technology

Zurich. Graduation with the degree

Dipl. El.-Ing. ETH.

1992, 1993

1996

Initial military service and officers’ school

at the Swiss Air Force.

1999–2002 Doctoral student at the Swiss Federal In-

stitute of Technology (ETH) Zurich.

Occupations: 1992–1993 Zurich State Police, Airport Division.

1997 Internship at Sulzer Carbomedics, Austin

TX.

1999–2002 Research assistant at ETH Zurich, Com-

puter Vision Group.