improved detection and tracking of objects in surveillance ... · object tracking, motion...
TRANSCRIPT
Improved Detection andTracking of Objects in
Surveillance Video
by
Simon Paul Denman, BEng (Hons, 1st Class),
BIT (Dist)
PhD Thesis
Submitted in Fulfilment
of the Requirements
for the Degree of
Doctor of Philosophy
at the
Queensland University of Technology
Image and Video Research Laboratory
Faculty of Built Environment and Engineering
May 2009
Keywords
Object Tracking, Motion Detection, Optical Flow, Condensation Filter, Particle
Filter, Thermal Imagery, Multi-Sensor Fusion, Multi-Spectral Tracking.
iv
Abstract
Surveillance networks are typically monitored by a few people, viewing several
monitors displaying the camera feeds. It is then very difficult for a human op-
erator to effectively detect events as they happen. Recently, computer vision
research has begun to address ways to automatically process some of this data,
to assist human operators. Object tracking, event recognition, crowd analysis and
human identification at a distance are being pursued as a means to aid human
operators and improve the security of areas such as transport hubs.
The task of object tracking is key to the effective use of more advanced technolo-
gies. To recognize an event people and objects must be tracked. Tracking also
enhances the performance of tasks such as crowd analysis or human identification.
Before an object can be tracked, it must be detected. Motion segmentation tech-
niques, widely employed in tracking systems, produce a binary image in which
objects can be located. However, these techniques are prone to errors caused by
shadows and lighting changes. Detection routines often fail, either due to erro-
neous motion caused by noise and lighting effects, or due to the detection routines
being unable to split occluded regions into their component objects. Particle fil-
ters can be used as a self contained tracking system, and make it unnecessary
for the task of detection to be carried out separately except for an initial (of-
ten manual) detection to initialise the filter. Particle filters use one or more
extracted features to evaluate the likelihood of an object existing at a given point
ii
each frame. Such systems however do not easily allow for multiple objects to be
tracked robustly, and do not explicitly maintain the identity of tracked objects.
This dissertation investigates improvements to the performance of object tracking
algorithms through improved motion segmentation and the use of a particle filter.
A novel hybrid motion segmentation / optical flow algorithm, capable of simulta-
neously extracting multiple layers of foreground and optical flow in surveillance
video frames is proposed. The algorithm is shown to perform well in the presence
of adverse lighting conditions, and the optical flow is capable of extracting a mov-
ing object. The proposed algorithm is integrated within a tracking system and
evaluated using the ETISEO (Evaluation du Traitement et de lInterpretation de
Sequences vidEO - Evaluation for video understanding) database, and signifi-
cant improvement in detection and tracking performance is demonstrated when
compared to a baseline system. A Scalable Condensation Filter (SCF), a particle
filter designed to work within an existing tracking system, is also developed. The
creation and deletion of modes and maintenance of identity is handled by the
underlying tracking system; and the tracking system is able to benefit from the
improved performance in uncertain conditions arising from occlusion and noise
provided by a particle filter. The system is evaluated using the ETISEO database.
The dissertation then investigates fusion schemes for multi-spectral tracking sys-
tems. Four fusion schemes for combining a thermal and visual colour modality are
evaluated using the OTCBVS (Object Tracking and Classification in and Beyond
the Visible Spectrum) database. It is shown that a middle fusion scheme yields
the best results and demonstrates a significant improvement in performance when
compared to a system using either mode individually.
Findings from the thesis contribute to improve the performance of semi-
automated video processing and therefore improve security in areas under surveil-
lance.
Contents
Abstract i
List of Tables xi
List of Figures xvii
Notation xxvii
Acronyms & Abbreviations xli
List of Publications xliii
Certification of Thesis xlvii
Acknowledgments xlix
Chapter 1 Introduction 1
iv CONTENTS
1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Improvements to Motion Segmentation . . . . . . . . . . . 4
1.2.2 Improvements to Particle Filters . . . . . . . . . . . . . . . 5
1.2.3 Improvements to Multi-Modal Fusion in Tracking Systems 6
1.3 Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Original Contributions and Publications . . . . . . . . . . . . . . 7
1.5 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2 Literature Review 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Foreground Segmentation and Motion Detection . . . . . . . . . . 17
2.2.1 Background Subtraction and Background modelling . . . . 19
2.2.2 Optical Flow Approaches . . . . . . . . . . . . . . . . . . . 28
2.2.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 Auxiliary Processes . . . . . . . . . . . . . . . . . . . . . . 32
2.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Detecting and Tracking Objects . . . . . . . . . . . . . . . . . . . 37
CONTENTS v
2.3.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 Matching and Tracking Objects . . . . . . . . . . . . . . . 42
2.3.3 Handling Occlusions . . . . . . . . . . . . . . . . . . . . . 49
2.3.4 Alternative Approaches to Tracking . . . . . . . . . . . . . 53
2.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.4 Prediction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.1 Motion Models . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.2 Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.5 Multi Camera Tracking Systems . . . . . . . . . . . . . . . . . . . 71
2.5.1 System Designs . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.2 Track Handover and Occlusion Handling . . . . . . . . . . 75
2.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 3 Tracking System Framework 91
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi CONTENTS
3.2.1 Tracking Algorithm Overview . . . . . . . . . . . . . . . . 94
3.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.1 Person Detection . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.2 Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.3 Blob Detection . . . . . . . . . . . . . . . . . . . . . . . . 107
3.4 Baseline Tracking System . . . . . . . . . . . . . . . . . . . . . . 108
3.5 Evaluation Process and Benchmarks . . . . . . . . . . . . . . . . 111
3.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 112
3.5.2 Evaluation Data and Configuration . . . . . . . . . . . . . 118
3.5.3 Tracking Output Description . . . . . . . . . . . . . . . . . 127
3.5.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Chapter 4 Motion Detection 143
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Multi-Modal Background modelling . . . . . . . . . . . . . . . . . 145
4.3 Core Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.3.1 Variable Threshold . . . . . . . . . . . . . . . . . . . . . . 149
4.3.2 Lighting Compensation . . . . . . . . . . . . . . . . . . . . 153
CONTENTS vii
4.3.3 Shadow Detection . . . . . . . . . . . . . . . . . . . . . . . 158
4.4 Computing Optical Flow Simultaneously . . . . . . . . . . . . . . 161
4.4.1 Detecting Overlapping Objects . . . . . . . . . . . . . . . 168
4.5 Detecting Stopped Motion . . . . . . . . . . . . . . . . . . . . . . 170
4.5.1 Feedback From External Source . . . . . . . . . . . . . . . 172
4.6 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . 173
4.6.1 Synthetic Data Tests . . . . . . . . . . . . . . . . . . . . . 174
4.6.2 Real World Data Tests . . . . . . . . . . . . . . . . . . . . 185
4.6.3 Optical Flow and Overlap Detection Evaluation . . . . . . 189
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Chapter 5 Object Detection 201
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.2 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.3 Using Static Foreground . . . . . . . . . . . . . . . . . . . . . . . 203
5.4 Detecting Overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.5 Integration into the Tracking System . . . . . . . . . . . . . . . . 211
5.6 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . 214
viii CONTENTS
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Chapter 6 The Scalable Condensation Filter 227
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.2 Scalable Condensation Filter . . . . . . . . . . . . . . . . . . . . . 229
6.2.1 Dynamic Sizing . . . . . . . . . . . . . . . . . . . . . . . . 230
6.2.2 Dynamic Feature Selection and Occlusion Handling . . . . 234
6.2.3 Adding Tracks and Incorporating Detection Results . . . . 239
6.3 Tracking Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.3.1 Handling Static Foreground . . . . . . . . . . . . . . . . . 245
6.4 Incorporation into Tracking System . . . . . . . . . . . . . . . . . 248
6.4.1 Matching Candidate Objects to Tracked Objects . . . . . . 250
6.4.2 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . 253
6.5 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . 255
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Chapter 7 Advanced Object Tracking and Applications 271
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.2 Multi-Camera Tracking . . . . . . . . . . . . . . . . . . . . . . . . 272
CONTENTS ix
7.2.1 System Description . . . . . . . . . . . . . . . . . . . . . . 272
7.2.2 Track Handover and Matching Objects in Different Views . 273
7.2.3 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
7.2.4 Evaluation using ETISEO database . . . . . . . . . . . . . 276
7.3 Multi-Spectral Tracking . . . . . . . . . . . . . . . . . . . . . . . 285
7.3.1 Evaluation of Fusion Points . . . . . . . . . . . . . . . . . 285
7.3.2 Evaluation of Fusion Techniques using OTCVBS Database 291
7.3.3 Proposed Fusion System . . . . . . . . . . . . . . . . . . . 300
7.3.4 Evaluation of System using OTCVBS Database . . . . . . 305
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Chapter 8 Conclusions and Future Work 309
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.2 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . 310
8.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Appendix A Baseline Tracking System Results 317
Appendix B Tracking System with Improved Motion Segmentation
Results 320
x CONTENTS
Appendix C Tracking System with SCF Results 323
Appendix D Multi-Camera Tracking System Results 326
Bibliography 329
List of Tables
3.1 Transition Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.2 Evaluation Metric Standard Definitions . . . . . . . . . . . . . . . 114
3.3 System Parameters - Fixed parameters for all data sets. . . . . . . 122
3.4 System Parameters - Parameters specific to the RD data sets. . . 124
3.5 System Parameters - Parameters specific to the BC data sets. . . 125
3.6 System Parameters - Parameters specific to the AP data sets. . . 126
3.7 System Parameters - Parameters specific to the BE data sets. . . 127
3.8 Baseline Tracking System Results . . . . . . . . . . . . . . . . . . 129
3.9 Overall Baseline Tracking System Results . . . . . . . . . . . . . . 129
3.10 Baseline Tracking System Throughput . . . . . . . . . . . . . . . 142
4.1 Synthetic Motion Detection Performance for AESOS Set 1 . . . . 176
4.2 Synthetic Motion Detection Performance for AESOS Set 2 . . . . 177
xii LIST OF TABLES
4.3 Synthetic Motion Detection Performance for AESOS Set 3 . . . . 177
4.4 Synthetic Motion Detection Performance for AESOS Set 4 . . . . 178
4.5 Synthetic Lighting Normalisation Performance . . . . . . . . . . . 183
4.6 Motion Detection Results for Real World Sequence . . . . . . . . 188
5.1 System Parameters - Additional parameters for system configuration.214
5.2 System Parameters - Additional parameters for system configura-
tion specific to each dataset group. . . . . . . . . . . . . . . . . . 215
5.3 Overall Tracking System Performance using a Global Variable
Threshold for Motion Detection (see Section 3.5.1 for an expla-
nation of metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.4 Overall Tracking System Performance using Individual Variable
Thresholds for Motion Detection (see Section 3.5.1 for an explana-
tion of metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
5.5 Performance of BE Cameras with Using Different Motion Thresh-
old Approaches (DOv, LOv and TOv are overall detection, localisa-
tion and tracking metric results respectively, see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 216
5.6 Tracking System with Improved Motion Detection Results (see
Section 3.5.1 for an explanation of metrics) . . . . . . . . . . . . . 218
5.7 Tracking System with Improved Motion Detection Overall Results
(see Section 3.5.1 for an explanation of metrics) . . . . . . . . . . 218
LIST OF TABLES xiii
5.8 Proposed Tracking System Throughput . . . . . . . . . . . . . . . 224
6.1 System Parameters - Additional parameters for system configuration.256
6.2 System Parameters - Additional parameters for system configura-
tion specific to each dataset group. . . . . . . . . . . . . . . . . . 256
6.3 Tracking System with SCF Results (see Section 3.5.1 for an expla-
nation of metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.4 Tracking System with SCF Overall Results (see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 257
6.5 Improvements using SCF (see Section 3.5.1 for an explanation of
metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.6 Overall Improvements using SCF (see Section 3.5.1 for an expla-
nation of metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.7 Proposed Tracking System Throughput . . . . . . . . . . . . . . . 267
7.1 Multi-Camera Tracking System Results (see Section 3.5.1 for an
explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . . . 277
7.2 Multi-Camera Tracking System Overall Results (see Section 3.5.1
for an explanation of metrics) . . . . . . . . . . . . . . . . . . . . 278
7.3 Improvements using a Multi-Camera System (see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 278
xiv LIST OF TABLES
7.4 Overall Improvements using a Multi-Camera System (see Section
3.5.1 for an explanation of metrics) . . . . . . . . . . . . . . . . . 279
7.5 Fusion Algorithm Evaluation - Set 1 Results (see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 292
7.6 Fusion Algorithm Evaluation - Overall Set 1 Results (see Section
3.5.1 for an explanation of metrics) . . . . . . . . . . . . . . . . . 292
7.7 Fusion Algorithm Evaluation - Set 2 Results (see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 295
7.8 Fusion Algorithm Evaluation - Overall Set 2 Results (see Section
3.5.1 for an explanation of metrics) . . . . . . . . . . . . . . . . . 295
7.9 Proposed Fusion Algorithm - Set 1 Results (see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 305
7.10 Proposed Fusion Algorithm - Overall Set 1 Results (see Section
3.5.1 for an explanation of metrics) . . . . . . . . . . . . . . . . . 305
7.11 Proposed Fusion Algorithm - Set 2 Results (see Section 3.5.1 for
an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 306
7.12 Proposed Fusion Algorithm - Overall Set 2 Results (see Section
3.5.1 for an explanation of metrics) . . . . . . . . . . . . . . . . . 306
A.1 RD Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 317
A.2 Overall RD Dataset Results . . . . . . . . . . . . . . . . . . . . . 317
A.3 BC Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 318
LIST OF TABLES xv
A.4 Overall BC Dataset Results . . . . . . . . . . . . . . . . . . . . . 318
A.5 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 318
A.6 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 318
A.7 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 318
A.8 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 319
B.1 RD Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 320
B.2 Overall RD Dataset Results . . . . . . . . . . . . . . . . . . . . . 320
B.3 BC Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 321
B.4 Overall BC Dataset Results . . . . . . . . . . . . . . . . . . . . . 321
B.5 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 321
B.6 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 321
B.7 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 321
B.8 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 322
C.1 RD Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 323
C.2 Overall RD Dataset Results . . . . . . . . . . . . . . . . . . . . . 323
C.3 BC Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 324
C.4 Overall BC Dataset Results . . . . . . . . . . . . . . . . . . . . . 324
xvi LIST OF TABLES
C.5 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 324
C.6 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 324
C.7 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 324
C.8 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 325
D.1 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 326
D.2 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 326
D.3 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 327
D.4 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 327
List of Figures
1.1 Example of a Surveillance Scene . . . . . . . . . . . . . . . . . . . 2
1.2 Example of Changing Scene Conditions . . . . . . . . . . . . . . . 3
2.1 A Basic Top-Down System . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Foreground Mask for a Scene (Hand Segmented). Areas of motion
are represented as white in the foreground mask. . . . . . . . . . . 17
2.3 Temporarily Stopped Objects . . . . . . . . . . . . . . . . . . . . 22
2.4 The Particle Filter Process (Merwe et al. [122]) . . . . . . . . . . 62
2.5 Sequential Importance Re-sampling. . . . . . . . . . . . . . . . . . 63
2.6 The Condensation Process (Isard and Blake [77]) . . . . . . . . . 64
2.7 Incorporating Adaboost Detections into the BPF Distribution . . 67
2.8 Multi-camera system architecture 1. . . . . . . . . . . . . . . . . . 71
2.9 Multi-camera system architecture 2. . . . . . . . . . . . . . . . . . 72
xviii LIST OF FIGURES
2.10 Surveillance network containing disjoint cameras. . . . . . . . . . 86
3.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.2 Tracking Algorithm Flowchart . . . . . . . . . . . . . . . . . . . . 95
3.3 State Diagram for a Tracked Object . . . . . . . . . . . . . . . . . 96
3.4 Head Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.5 Height Map Generation . . . . . . . . . . . . . . . . . . . . . . . . 103
3.6 Ellipse Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.7 Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.8 Detecting Overlapping Vehicles . . . . . . . . . . . . . . . . . . . 107
3.9 Ground truth and result overlaps . . . . . . . . . . . . . . . . . . 115
3.10 Examples of Evaluation Data . . . . . . . . . . . . . . . . . . . . 120
3.11 Zones for RD Datasets . . . . . . . . . . . . . . . . . . . . . . . . 123
3.12 Mask Image for RD Datasets . . . . . . . . . . . . . . . . . . . . 123
3.13 Mask Image for BC Datasets . . . . . . . . . . . . . . . . . . . . . 125
3.14 Zones for BE Datasets . . . . . . . . . . . . . . . . . . . . . . . . 126
3.15 Mask Images for BE Datasets . . . . . . . . . . . . . . . . . . . . 127
3.16 Example output from the tracking system . . . . . . . . . . . . . 128
LIST OF FIGURES xix
3.17 Example output from RD7 - Loss of track due to the target object
(car in blue rectangle on the far side of the road) being stationary
for several hundred frames . . . . . . . . . . . . . . . . . . . . . . 130
3.18 Example output from BC16 - Detection and localisation errors
caused by shadow/reflection of the moving object . . . . . . . . . 131
3.19 Example output from BC16 - Total occlusion, resulting in loss of
track identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.20 Example output from BE19-C1 - Impact of poor motion detection
performance on tracking, the person leaving the building is tracked
poorly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.21 Example output from BE19-C3 - Impact of poor motion detection
performance on tracking, the two people are localised poorly . . . 134
3.22 Sensitivity of Baseline System to Variations in τfit . . . . . . . . . 137
3.23 Sensitivity of Baseline System to Variations in τOccPer . . . . . . . 138
3.24 Sensitivity of Baseline System to Variations in τlum . . . . . . . . 139
4.1 Flowchart of proposed motion detection algorithm . . . . . . . . . 144
4.2 Motion Detection - The input image (on the left) is converted into
clusters (on the right) by pairing pixels . . . . . . . . . . . . . . . 146
4.3 Pixel Noise - Indoor Scene . . . . . . . . . . . . . . . . . . . . . . 150
4.4 Pixel Noise - Outdoor Scene . . . . . . . . . . . . . . . . . . . . . 151
xx LIST OF FIGURES
4.5 Sample Outdoor Scene With Changing Light Conditions . . . . . 154
4.6 Background Difference Across Whole Scene . . . . . . . . . . . . . 155
4.7 Partitioning of Image for Localised Lighting Compensation . . . . 155
4.8 Background Difference Across Region (2,2) . . . . . . . . . . . . . 156
4.9 Background Difference Across Region (3,5) . . . . . . . . . . . . . 157
4.10 Shadows cast over section of the background . . . . . . . . . . . . 159
4.11 Search Order for Optical Flow . . . . . . . . . . . . . . . . . . . . 162
4.12 Matching Across Cluster Boundaries - Bold lines indicate cluster
groupings, the red cluster location in the current image is being
compared to the blue cluster (actually split across two clusters) in
the previous image. . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.13 Optical Flow Tracking, for Lopf = 4 . . . . . . . . . . . . . . . . . 167
4.14 Optical Flow Pixel States . . . . . . . . . . . . . . . . . . . . . . 169
4.15 Static Layer Matching Flowchart . . . . . . . . . . . . . . . . . . 171
4.16 AESOS Database Example, GL is the grey level of the synthetic
figures in the scene . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.17 Synthetic Motion Detection Performance for AESOS Set 1, GL150 179
4.18 Synthetic Motion Detection Performance for AESOS Set 3, GL80 180
4.19 AESOS Lighting Variation Database . . . . . . . . . . . . . . . . 182
LIST OF FIGURES xxi
4.20 Synthetic Lighting Normalisation Performance using Set 02 . . . . 183
4.21 Synthetic Lighting Normalisation Performance using Set 03 . . . . 184
4.22 Motion Detection Results for Real World Sequence . . . . . . . . 186
4.23 Motion Detection Results for Real World Sequence . . . . . . . . 187
4.24 Optical Flow Performance - CAVIAR Set WalkByShop1front,
Frame 1640 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.25 Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,
Frame 541 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
4.26 Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,
Frame 1101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
4.27 Overlap Detection - Example 1 . . . . . . . . . . . . . . . . . . . 198
4.28 Overlap Detection - Example 2 . . . . . . . . . . . . . . . . . . . 198
4.29 Overlap Detection - Example 3 . . . . . . . . . . . . . . . . . . . 199
5.1 State Diagram Incorporating Static Objects . . . . . . . . . . . . 204
5.2 Static Object Detection using the Template Image . . . . . . . . . 206
5.3 Example Image Containing a Discontinuity . . . . . . . . . . . . . 208
5.4 Flow Status Vertical Projections . . . . . . . . . . . . . . . . . . . 209
5.5 Detected Discontinuities . . . . . . . . . . . . . . . . . . . . . . . 209
xxii LIST OF FIGURES
5.6 Example Image Containing Different Types of Discontinuities . . 210
5.7 Flow Status Vertical Projections . . . . . . . . . . . . . . . . . . . 211
5.8 Classified Discontinuities . . . . . . . . . . . . . . . . . . . . . . . 211
5.9 Tracking Algorithm Flowchart with Modified Object Detection
Routines (additions/changes to baseline system shown in yellow) . 212
5.10 Process for Detecting a Known Object . . . . . . . . . . . . . . . 212
5.11 Example output from RD7 - Maintaining Tracking of Temporarily
Stopped Objects (the car on the far side of the road) . . . . . . . 219
5.12 Example output from RD7 - Improved detection and localisation
of objects that have been stationary for long periods of time (the
car on the far side of the road)) . . . . . . . . . . . . . . . . . . . 219
5.13 Example output from BC16 - Improved Detection Results due to
Proposed Motion Detection Routine . . . . . . . . . . . . . . . . . 220
5.14 Example output from AP12-C7 - Tracking Example for AP Dataset221
5.15 Example output from AP11-C4 - Tracking Example for AP Dataset221
5.16 Example Output from BE19-C1 - Maintaining tracks for stationary
objects (the parked car) . . . . . . . . . . . . . . . . . . . . . . . 223
5.17 Example Output from BE20-C3 - Occlusion handling . . . . . . . 223
6.1 Dynamic Sizing of Particle Filter . . . . . . . . . . . . . . . . . . 232
6.2 Sequential Importance Re-sampling for the SCF . . . . . . . . . . 233
LIST OF FIGURES xxiii
6.3 A Typical Occlusion between Two Objects . . . . . . . . . . . . . 236
6.4 Calculating particle weights for occluded objects . . . . . . . . . . 237
6.5 Scalable Condensation Filter Process . . . . . . . . . . . . . . . . 240
6.6 Dividing input image for Appearance Model . . . . . . . . . . . . 242
6.7 Integration of the SCF into the Tracking System . . . . . . . . . . 250
6.8 Determining match probability using particles . . . . . . . . . . . 251
6.9 Update State Diagram Incorporating Occluded and Predicted States253
6.10 Example Output from RD7 - Occlusion handling using the SCF
(the person walking behind the car on the far side of the road) . . 259
6.11 Example Output from RD7 - Occlusion handling using the SCF
(the group of people in the bottom right corner of the scene) . . . 260
6.12 Example Output from BC16 - Initialisation of tracks from spurious
motion (the person entering through the door at the top of the scene)261
6.13 Example Output from BC16 - Improved occlusion handling using
the SCF (one person walks behind another) . . . . . . . . . . . . 263
6.14 Example Output from AP11-C4 - Errors as an object leaves the
scene using the SCF. . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.15 Example Output from BE19-C1 - Tracking Performance using the
SCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
xxiv LIST OF FIGURES
6.16 Example Output from BE20-C3 - Localisation Performance using
the SCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.1 A Severe Occlusion Observed from Two Views . . . . . . . . . . . 276
7.2 Example System Results AP11 - Object leaving and re-entering
field of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.3 Example System Results AP12 - Object leaving and re-entering
field of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
7.4 Example System Results BE19 - Track matching and occlusion
handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.5 Example System Results BE19 - Re-associating objects after de-
tection and tracking failure . . . . . . . . . . . . . . . . . . . . . . 283
7.6 Example System Results BE20 - Occlusion Performance . . . . . . 284
7.7 The points for fusion in the system . . . . . . . . . . . . . . . . . 286
7.8 Fusing Visual and Thermal Information into a YCbCr Image for
use in the Motion Detection . . . . . . . . . . . . . . . . . . . . . 287
7.9 Noise in Colour Images for Set 1 . . . . . . . . . . . . . . . . . . . 293
7.10 Example System Results for Set 1 - Occlusion . . . . . . . . . . . 294
7.11 Example System Results for Set 2 . . . . . . . . . . . . . . . . . . 298
7.12 Example System Results for Set 2 . . . . . . . . . . . . . . . . . . 299
7.13 Flowchart for Proposed Fusion System . . . . . . . . . . . . . . . 301
LIST OF FIGURES xxv
7.14 Updated State Diagram . . . . . . . . . . . . . . . . . . . . . . . 304
7.15 Example System Results for Set 2 . . . . . . . . . . . . . . . . . . 307
Notation
Ax X dimension of an appearance model.
Ay Y dimension of an appearance model.
Aec(x, y, t) Error measure for the colour component of an ap-
pearance model.
Aeopf (x, y, t) Error measure for the optical flow component of
an appearance model.
Ac(x, y, t, k) Value of the kth colour channel of an appearance
model.
Au(x,y,t) Value of the horizontal velocity of an appearance
model.
Av(x, y, t) Value of the vertical velocity of an appearance
model.
Am(x, y, t) Value of the motion occupancy of an appearance
model.
As(x, y, t) Value of the motion state for an appearance model.
Avehicle The number of motion pixels within the bounds of
a vehicle candidate.
B(T i, Ck) The Bhattacharya coefficient for the comparison of
histograms belonging to a candidate object and a
tracked object.
C(x, y, t, k) A cluster at pixel x, y in a background model.
xxviii NOTATION
Cy1(x, y, t, k) First luminance value of a cluster.
Cy2(x, y, t, k) Second luminance value of a cluster.
Cghy1 (x, y, t, k) First horizontal luminance gradient value of a clus-
ter.
Cghy2 (x, y, t, k) Second horizontal luminance gradient value of a
cluster.
Cgvy1 (x, y, t, k) First vertical luminance gradient value of a cluster.
Cgvy2 (x, y, t, k) Second vertical luminance gradient value of a clus-
ter.
Ccb(x, y, t, k) Blue chrominance value of a cluster.
Ccr(x, y, t, k) Red chrominance value of a cluster.
Cw(x, y, t, k) Weight value of a cluster.
Cob(x, y, t) Colour of the matching motion layer (i.e. active
or one of the static layers) for an appearance com-
parison involving static motion.
Cos(x, y, t) Colour of the matching static layer for an appear-
ance comparison involving static motion.
coccluded Counter monitoring the time an object spends in
the predicted state.
cpredicted Counter monitoring the time an object spends in
the occluded state.
card(x) The cardinality (size) of a set x.
χ Allowed variation in lighting compensation value
between two frames.
Di A candidate object, index i.
Dix X position of a candidate object.
Diy Y position of a candidate object.
Diw Width of a candidate object.
Dih Height of a candidate object.
NOTATION xxix
DiffChr(x, y, t) Chrominance difference when matching an input
pixel pair to a cluster in a background model.
DiffLum(x, y, t) Luminance difference when matching an input
pixel pair to a cluster in a background model.
dmin Minimum dimension bounds for particles in the
SCF.
dmax Maximum dimension bounds for particles in the
SCF.
dxmin Minimum bounds for the x dimension of particles
in the SCF.
dxmax Maximum bounds for the x dimension of particles
in the SCF.
dymin Minimum bounds for the y dimension of particles
in the SCF.
dymax Maximum bounds for the y dimension of particles
in the SCF.
dwmin Minimum bounds for the width dimension of par-
ticles in the SCF.
dwmax Maximum bounds for the width dimension of par-
ticles in the SCF.
dhmin Minimum bounds for the height dimension of par-
ticles in the SCF.
dhmax Maximum bounds for the height dimension of par-
ticles in the SCF.
E(i, j) An ellipse mask used during person detection.
Earea(Ci, T j) Error in area between a candidate object and a
tracked object.
Eposition(Ci, T j) Error in position between a candidate object and
a tracked object.
xxx NOTATION
emax The maximum allowed drift of a particle from
frame to frame.
exmax The maximum allowed drift of a particle from
frame to frame for the x dimension.
eymax The maximum allowed drift of a particle from
frame to frame for the y dimension.
ewmax The maximum allowed drift of a particle from
frame to frame for the width dimension.
ehmax The maximum allowed drift of a particle from
frame to frame for the height dimension.
Fc(x, y, t, k) Value of the kth colour channel of a feature ex-
tracted from an incoming image to build/compare
to an appearance model.
F activec (x′, y′, t, k) Intermediary value used when computing an ap-
pearance model that involves both active and
static motion.
F staticc (x′, y′, t, k) Intermediary value used when computing an ap-
pearance model that involves both active and
static motion.
Fu(x,y,t) Value of the horizontal velocity of a feature ex-
tracted from an incoming image to build/compare
to an appearance model.
Fv(x, y, t) Value of the vertical velocity of a feature extracted
from an incoming image to build/compare to an
appearance model.
Fm(x, y, t) Value of the motion occupancy of a feature ex-
tracted from an incoming image to build/compare
to an appearance model.
NOTATION xxxi
F (Ci, T j) A fit function between a tracked object and a can-
didate object.
F̄ (Ci, T j) A fit function between a tracked object and a can-
didate object that incorporates SCF output.
Farea(Ci, T j) A fit function between a tracked object and a can-
didate object considering only the area.
Fposition(Ci, T j) A fit function between a tracked object and a can-
didate object considering only the position.
FSCF (Ci, T j) A fit function between a tracked object and a can-
didate object considering the output of the SCF.
fgnd(t) The set of all clusters that are classified as fore-
ground.
fcount The number of frames a cluster in motion has been
tracked for.
Gc(x′, y′, t) The number of standard deviations an observation
is from the mean for the colour component of an
appearance model.
Gopf (x′, y′, t) The number of standard deviations an observation
is from the mean for the optical flow component
of an appearance model.
γi The initial position of an object (x, y, w, h).
Φµ,σ The cumulative density function for a Gaussian
distribution.
hproj(j) A vector containing a horizontal projection of a
motion image.
H(T i, n) The nth bin of the histogram of a tracked object.
I(t) Input Image.
Imobj Candidate image for an object.
IS Image of pixel states.
xxxii NOTATION
IST,n Static image template for an object, n.
Iy(t) The luminance channel of an input image in
YCbCr 4:2:2 format.
IZ Image composed of static cluster colours.
i Index into a list/vector/image. See text for con-
text.
j Index into a list/vector/image. See text for con-
text.
K Total number of clusters per pixel in a background
model.
Ks Number of stationary clusters per pixel in a back-
ground model.
Kb Number of background clusters per pixel in a back-
ground model.
k Index of a cluster at a pixel in a background model.
κ Index of the matching cluster for a pixel in a back-
ground model.
L Learning rate of a background model.
Lopf Learning rate of the average velocity for a cluster.
Lfus Learning rate of the performance metrics for object
fusion.
λi Movement of an object, i, from one frame to the
next.
M A motion image.
Ma Active motion image.
MIR Motion image from an IR camera.
Ms Static motion image.
MV is Motion image from visible spectrum camera.
Mk Value used to update cluster weights.
NOTATION xxxiii
min(x, y) The minimum of two values, x and y.
max(x, y) The maximum of two values, x and y.
N(t) The number of objects detected at the current
time.
Nthermal(t) The number of objects detected in the thermal do-
main at the current time.
Nvisual(t) The number of objects detected in the visual do-
main at the current time.
νinit Initial number of particles allocated to each object
in the SCF.
νadd Additional number of particles that can be added
for each occlusion level increase for each object in
the SCF.
Ofused(t) List of objects created by fusing detection lists for
the visual and thermal domains.
Othermal(t) List of objects detected using input from the visi-
ble spectrum.
Ovisible(t) List of objects detected using input from the visi-
ble spectrum.
Olum(r, t) Weighted average of luminance changes for a re-
gion within the scene.
Operson The occupancy (percentage of motion pixels within
the bounds) of a person candidate.
Ovehicle The occupancy (percentage of motion pixels within
the bounds) of a vehicle candidate.
Ovhoriz Horizontal overlap.
Ovvert Horizontal overlap.
Θ(x, y, t) The luminance at a pixel.
p(xt|xt−1) Probability distribution of the SCF.
xxxiv NOTATION
pi(xi,t|xi,t−1) Probability distribution of the particles for the ith
track within the SCF.
p(x, y, t) A pixel at located at x, y at time t.
P (x, y, t) A pixel pair, located at x, y at time t.
Pcount Size of the set, Wfgnd(x, y, t).
PTj ,t Image of back projected particle probabilities for
a tracked object.
pm(t) Overall performance for the modality m’s object
detection.
pthermal(t) Overall performance for the thermal domain’s ob-
ject detection.
pvisual(t) Overall performance for the visual domain’s object
detection.
Qp The number of successive frames a clusters posi-
tion can be estimated.
Qv The number of frames a cluster’s motion must have
been tracked for before an overlap can be recorded
involving the cluster.
qi(xi,t|xi,t−1) Probability distribution formed by object detec-
tion results for the ith track.
qthermal(t) Quality measure for the object detection perfor-
mance in the thermal domain for the current
frame.
qvisual(t) Quality measure for the object detection perfor-
mance in the visual domain for the current frame.
R Number of regions a scene is divided into.
Ri, j Reprojection error when transferring two tracked
object’s positions between camera views.
NOTATION xxxv
Rm Ratio between static foreground and active fore-
ground.
r Index of a subregion of the scene.
ρ A random vector used when updating particles in
the SCF.
S A pixel state, one of new, continuous, overlap or
ended.
S(x, y, t) The state of pixel pair.
si,n,t The nth particle for the ith track in the SCF.
σarea The standard deviation of the area error for match-
ing candidate objects to tracked objects.
σ(t)2Chr Variance of the chrominance differences when
matching clusters for a background model.
σ(t)2Chrinit
Initial chrominance variance for a background
model.
σ(t)2Lum Variance of the luminance differences when match-
ing clusters for a background model.
σ(t)2Luminit
Initial luminance variance for a background model.
σlum(r, t) Weighted standard deviation of the luminance
changes for a region for a given frame.
σpos The standard deviation of the position error for
matching candidate objects to tracked objects.
T j A tracked object, index j.
T jx X position of a tracked object.
T jx,i X position of a tracked object in view i.
T jx,ω X position of a tracked object in world coordinates.
T jy Y position of a tracked object.
T jy,i Y position of a tracked object in view i.
T jy,ω Y position of a tracked object in world coordinates.
xxxvi NOTATION
T ju X velocity of a tracked object.
T jv Y velocity of a tracked object.
T jw Width of a tracked object.
T jh Height of a tracked object.
τAreaV eh Threshold applied to Avehicle to determine if the
detected candidate is valid.
τa Threshold to determine if a modality is performing
well enough for objects to be added based solely
on detections from that modality.
τactive Number of frames of continuous detection required
for an object to enter the active state after cre-
ation.
τB Threshold for determining a match between two
histograms when matching objects across different
camera views.
τChr Chrominance threshold for a background model.
τChrShad Inverse scaling factor for the chrominance thresh-
old when detecting shadows.
τd Threshold for determining if two views of an object
(i.e. the same object observed in different cam-
eras) have transferred coordinates close enough to
remain paired.
τF1 Threshold for fusing motion images from multiple
modalities.
τF2 Threshold for fusing motion images from multiple
modalities.
τFScale Scaling factor for object detection using multiple
modalities.
τfit Threshold on fit scores.
NOTATION xxxvii
τforeground Threshold to separate foreground from background
in a background model.
τGrad Gradient threshold for shadow detection in a back-
ground model.
τ gradov Threshold on gradient overlaps.
τ gradprox Threshold on gradient overlap proximity for merg-
ing.
τLum Luminance threshold for a background model.
τLumShad Threshold for the luminance difference when de-
tecting shadows.
τmc Threshold to determine if the percentage change
in the amount of motion from frame to frame is
acceptable.
τMaxChr Maximum chrominance threshold for the motion
detection when a variable threshold is used.
τMaxLum Maximum luminance threshold for the motion de-
tection when a variable threshold is used.
τMinChr Minimum chrominance threshold for the motion
detection when a variable threshold is used.
τMinLum Minimum luminance threshold for the motion de-
tection when a variable threshold is used.
τnearby Threshold to determine nearby objects for
adding/removing particles from the SCF.
τoccluded Number of frames an object is allowed to exist in
the occluded state for before it is removed.
τOccPer Threshold applied to Operson to determine if the
detected candidate is valid.
τOccV eh Threshold applied to Ovehicle to determine if the
detected candidate is valid.
xxxviii NOTATION
τR Threshold for determining if two objects in dif-
ferent cameras have transferred coordinates close
enough to be considered the same object.
τS Threshold applied to Rm when determining if a
region is stationary.
τSOv Threshold for overlap between the active state and
discontinuity states for detecting overlaps.
τSOvType Threshold for the overlap between the new and
overlap states for determining the type of overlap
present.
τstatic Threshold on object velocity to determine if an
object is stationary.
τvel Threshold for velocity error when extracting a can-
didate image from optical flow images.
t Current time step.
tstatic A period of time that a cluster has remained sta-
tionary for.
txj X coordinate that has been transferred from view
j.
tyj Y coordinate that has been transferred from view
j.
U(T i, T j, Ck) The uncertainty between two tracked objects and
a candidate object.
U(x, y, t) Horizontal flow image.
uave The average horizontal velocity of a cluster.
υx The x position of a cluster.
υy The y position of a cluster.
V (x, y, t) Vertical flow image.
NOTATION xxxix
vcontour(i) A vector containing the top contour of a motion
image.
vHeightMap(i) A weighted sum of a vcontour(i) and a vproj(i).
vproj(i) A vector containing a vertical projection of a mo-
tion image.
vave The average vertical velocity of a cluster.
vx X velocity.
vy Y velocity.
W (x1 : x2, y1 : y2, t) A window of clusters.
Wfgnd(x, y, t) A subset of W (x1 : x2, y1 : y2, t) where all clusters
within the subset were in the foreground at t− 1.
WLum Average luminance difference when matching two
windows of clusters.
WGrad Average gradient difference when matching two
windows of clusters.
wk Weight for a cluster in the background model, in-
dex k.
wi,n,t The weight for the nth particle for the ith track in
the SCF.
wthermal(t) The weight assigned to the visual domain during
fusion of detected object lists.
wvisual(t) The weight assigned to the visual domain during
fusion of detected object lists.
X Image width.
x X coordinate.
Y Image height.
y y coordinate.
Z(x, y, t, z) A static layer within a motion model.
Zc(x, y, t, z) Counter for a static layer.
xl NOTATION
z Depth of a static layer.
zj,i,t The jth particle filter feature for the ith tracked
object.
z−k,i,t The kth negative particle filter feature for the ith
tracked object.
ζi,ω Function to transform image coordinates to world
coordinates.
ζi,j Function to transform image coordinates between
two cameras.
Acronyms & Abbreviations
2D 2-dimensional3D 3-dimensional
AO Abandoned ObjectAOD Abandoned Object Detection
BPF Boosted Particle Filter
CAVIAR Context Aware Vision using Image-based ActiveRecognition
ETISEO Evaluation du Traitement et de lInterpretationde Sequences Video - Evaluation for videounderstanding
FOV Field of ViewFN False NegativeFP False PositiveFPS Frames Per Second
GHz GigahertzGL Grey LevelGMM Gaussian Mixture ModelGT Ground Truth
HMM Hidden Markov ModelHSV Hue, Saturation, Value
xlii ACRONYMS & ABBREVIATIONS
ID IdentityIR Infrared
MPF Mixture Particle Filter
OTCBVS Object Tracking and Classification in andBeyond the Visible Spectrum
PDF Probability Density FunctionPETS Performance Evaluation of Tracking SystemsPTZ Pan-Tilt-Zoom
RGB Red, Green, Blue
SAD Sum of Absolute DifferencesSCF Scalable Particle FilterSIR Sequential Importance Re-sampling
TN True NegativeTP True Positive
Y’CbCr Luminance, Blue Chrominance, Red Chrominance
XML Extensible Markup Language
List of Publications
The journal articles that have been published as part of this research are as
follows:
1. S. Denman, V. Chandran, and S. Sridharan, ‘An Adaptive Optical Flow
Technique for Person Tracking Systems,’ Elsivier Pattern Recognition Let-
ters, vol. 28, pp. 1232-1239, 15 July 2007.
2. S. Denman, T. Lamb, C. Fookes, and S. Sridharan, ‘Multi-Spectral Fusion
for Surveillance Systems,’ International Journal on Computers and Electri-
cal Engineering, Accepted and published online on 30th December 2008,
DOI:10.1016/j.compeleceng.2008.11.011
The book chapters that have been published as part of this research are as follows:
1. F. Lin, S. Denman, V. Chandran, and S. Sridharan, Computational Foren-
sics, ch. Improved Subject Identification in Surveillance Video using Super-
resolution. Springer-Verlag, 2008. (Accepted 11 March 2008)
The conference articles that have been published as part of this research are as
follows:
xliv LIST OF PUBLICATIONS
1. S. Denman, V. Chandran, and S. Sridharan, ‘Tracking People in 3D Using
Position, Size and Shape,’ in Eighth International Symposium on Signal
Processing and Its Applications, Sydney, Australia, 2005, pp. 611-614.
2. S. Denman, V. Chandran, and S. Sridharan, ‘Adaptive Optical Flow for
Person Tracking,’ in Digital Image Computing: Techniques and Applica-
tions, Cairns, Australia, 2005, pp. 44-50.
3. S. Denman, V. Chandran, and S. Sridharan, ‘Person Tracking using Mo-
tion Detection and Optical Flow,’ in The 4rd Workshop on the Internet,
Telecommunications and Signal Processing, Noosa, Australia, 2005, pp.
242-247.
4. S. Denman, V. Chandran, and S. Sridharan, ‘A Multi-Class Tracker us-
ing a Scalable Condensation Filter,’ in Advanced Video and Signal Based
Surveillance, Sydney, 2006, DOI:10.1109/AVSS.2006.7.
5. S. Denman, C. Fookes, J. Cook, C. Davoren, A. Mamic, G. Farquharson,
D. Chen, B. Chen, and S. Sridharan, ‘Multi-view Intelligent Vehicle Surveil-
lance System,’ in Advanced Video and Signal Based Surveillance, Sydney,
2006, DOI:10.1109/AVSS.2006.78.
6. S. Denman, V. Chandran, and S. Sridharan, ‘Robust Multi-Layer Fore-
ground Segmentation for Surveillance Applications,’ in IAPR Conference
on Machine Vision Applications, The University of Tokyo, Japan, 2007, pp.
496-499.
7. F. Lin, S. Denman, V. Chandran, and S. Sridharan, ‘Automatic Track-
ing, Super-Resolution and Recognition of Human Faces from Surveillance
Video,’ in IAPR Conference on Machine Vision Applications, Tokyo, 2007,
pp. 37-40.
LIST OF PUBLICATIONS xlv
8. S. Denman, T. Lamb, C. Fookes, S. Sridharan, and V. Chandran, ‘Multi-
Sensor Tracking using a Scalable Condensation Filter,’ in International Con-
ference on Signal Processing and Communication Systems (ICSPCS), Gold
Coast, QLD, 2007, pp. 429-438.
9. S. Denman, S. Sridharan, and V. Chandran, ‘Abandoned Object Detection
Using Multi-Layer Motion Detection,’ in International Conference on Sig-
nal Processing and Communication Systems (ICSPCS), Gold Coast, QLD,
2007, pp. 439-448.
10. S. Denman, C. Fookes, V. Chandran, S. Sridharan, ‘Object Tracking using
Multiple Motion Modalities’, in International Conference on Signal Process-
ing and Communication Systems (ICSPCS), Gold Coast, QLD, 2008.
11. D. Ryan, S. Denman, C. Fookes, S. Sridharan, ‘Scene Invariant Crowd
Counting for Real-Time Surveillance’, in International Conference on Sig-
nal Processing and Communication Systems (ICSPCS), Gold Coast, QLD,
2008.
Certification of Thesis
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher educational institution. To the best of my
knowledge and belief, the thesis contains no material previously published or
written by another person except where due reference is made.
Signed:
Date:
Acknowledgments
I would firstly like to thank my supervisors, Associate Professor Vinod Chandran
and Professor Sridha Sridharan for their support and guidance throughout my
PhD. Thanks also to Clinton Fookes for his support and assistance.
I would also like to thank everyone in the Speech, Audio, Image and Video Tech-
nology (SAIVT) laboratory for the games of cards, entertaining (and at times
strange and confusing) conversations, and assistance.
I would also like to thank my family for their support throughout the PhD. Finally
I would like to thank my wife Pamela, for her support and understanding when
things weren’t working and deadlines were looming, and especially for the endless
supply of chocolate chip biscuits.
Simon Paul Denman
Queensland University of Technology
May 2009
Chapter 1
Introduction
1.1 Motivation and Overview
This dissertation is an investigation into object tracking and how it can be applied
to surveillance footage. Surveillance networks are typically monitored by one or
more people, looking at several monitors displaying the camera feeds. Each person
may potentially be responsible for monitoring hundreds of cameras, making it very
difficult for the human operator to effectively detect events as they happen. Given
the rise in security concerns in recent years, more and more security cameras have
been deployed, placing further strain on human operators.
Recently, computer vision research has begun to address ways to automatically
process some of this data, to assist human operators. Various areas of research
such as object tracking, event recognition, crowd analysis and human identifi-
cation at a distance are being pursued as a means to aid human operators and
improve the security of areas such as transport hubs (i.e. airports, train stations).
The task of object tracking is key to the effective use of more advanced technolo-
2 1.1 Motivation and Overview
gies (i.e. to recognise the actions of person, they need to be tracked first), and
can be used as an aid to enhance the utility of others (i.e. crowd analysis may
be able to detect an object moving against the flow of traffic, which can then be
tracked using object tracking techniques).
Object tracking itself is the task of following one or more objects about a scene,
from when they first appear to when the leave the scene. An object may be
anything of interest within the scene that can be detected, and depends on the
requirements of the scene itself. Figure 1.1 shows an example of a typical view
from a surveillance cameras, with various objects that need to be tracked outlined
(the colour indicates an ID, note that the ID assigned to each object is consistent
over time). In this scene, vehicles are people are of interest. To track the objects
in the scene, they must be detected and matched from frame to frame. To match
objects detected in one frame, to those present in a previous frame, one or more
features that can be used to compare the objects are required. Problems such as
occlusions, and errors in detection algorithms (either missed detections, or false
detections) must also be overcome.
(a) Frame 500 (b) Frame 600 (c) Frame 700
Figure 1.1: Example of a Surveillance Scene
Systems are required to operate in a variety of conditions. Figure 1.2 shows an
example of the same scene at two different times. In a surveillance situation,
it is not feasible for a human operator to change configuration values as the
scene conditions change. Any algorithm must be able to adjust to the new scene
1.2 Aims and Objectives 3
conditions, with minimal error.
(a) (b)
Figure 1.2: Example of Changing Scene Conditions
In the remainder of this chapter the aims and objectives of this thesis will be
described. The scope of the thesis will be defined, and an outline of the thesis
will be presented. Finally, the contributions made in this thesis will be stated.
1.2 Aims and Objectives
This thesis aims to improve the performance of object tracking systems by con-
tributing originally to three components : (a) motion segmentation, (b) particle
filters for object tracking and (c) fusion of information. These contributions will
be in the nature of:
1. Improvements to motion segmentation to enhance performance in adverse
conditions (shadow detection, variable thresholds, detection and manage-
ment of changing lighting conditions) and to improve the general utility of
the algorithm (simultaneously computing optical flow, segmenting motion
into multiple layers of static motion and active motion). See Section 1.2.1
for more detail.
4 1.2 Aims and Objectives
2. Improvements to particle filters to allow a particle filter to be used within
an existing tracking system, rather than as a self contained tracking system.
See Section 1.2.2 for more detail.
3. Improvements to multi-modal fusion in tracking systems to determine the
optimal fusion point for a tracking system, and to develop a fusion algorithm
that is able to dynamically alter the fusion parameters according to the
performance of each mode. See Section 1.2.3 for more detail.
1.2.1 Improvements to Motion Segmentation
Motion segmentation forms the basis of many tracking systems. In such systems,
detection results and thus tracking performance are ultimately reliant on the mo-
tion segmentation performance. Shadows and lighting fluctuations can introduce
further errors into the motion segmentation process and have an adverse effect on
a tracking system. Other tracking systems use optical flow information as a basis
for detecting and tracking objects. Optical flow offers advantages in its ability
to separate objects moving in different directions, but does complicate the initial
detection of objects. Optical flow is also prone to error in the presence of in-
consistent lighting and discontinuities, such as those caused by the self occlusion
present when a person walks.
This research aims to improve motion segmentation for object tracking appli-
cations, and investigate ways to use additional motion information to improve
object tracking performance. Two key additions to an existing motion segmen-
tation algorithm [18] are proposed:
1. The simultaneous computation of optical flow and motion segmentation,
using motion information from the previous frames to reduce discontinuities
1.2 Aims and Objectives 5
caused by a moving object against the background and eliminate the need
for the previous frame to be stored.
2. The ability to segment the scene into layers, consisting of multiple layers of
stationary foreground and a single layer moving foreground.
Each of these additions results in additional output being generated. As such,
techniques are proposed that allow a tracking system to utilise this information
in addition to the standard motion mask, to improve detection and tracking
results. It will be shown in Chapter 5 that the additional information provided
by the proposed motion detection algorithm (described in Chapter 4) results in
a significant improvement in system performance.
1.2.2 Improvements to Particle Filters
Particle filters [120] allow objects to be tracked without the need to detect the
objects every frame. Features for the objects of interest can be extracted, and
used to locate the object in future frames. However, such systems do not allow
for the features to adapt to the changing appearance of the tracked objects, or
provide simple methods through which tracked objects can be added or removed
from the particle filter.
This research aims to investigate ways to integrate a particle filter with a frame-
by-frame, detect and update, tracking system. As a result of this integration,
the proposed particle filter (presented in Chapter 6) is able to use a time varying
number of features and particles for each tracked object, as well as being able to
use different types of features for different tracked objects where appropriate.
6 1.3 Scope of Thesis
1.2.3 Improvements to Multi-Modal Fusion in Tracking
Systems
Tracking systems normally use a single video feed for input. Each type of video
modality, however, has its own set of weaknesses. In the case of a visual light
modality (colour or grey scale), this is susceptible to lighting changes and shad-
ows, and poor performance in low light conditions. For a thermal modality how-
ever, it is a lack of texture and colour information that enables objects to be
distinguished from one another that poses the largest challenge.
This research aims to determine the most appropriate point for fusion, in a multi-
modal tracking system using a visual colour modality and a thermal modality,
by evaluating four simple fusion schemes at different points in an object tracking
system. Based on these findings (presented in Chapter 7), an improved fusion
system is proposed.
1.3 Scope of Thesis
The scope of this thesis is defined by the following research questions:
1. Can optical flow and multi-layer motion segmentation (separating the fore-
ground into stationary foreground regions and moving foreground regions)
be computed simultaneously?
2. Does the combination of optical flow, and motion segmentation (itself a
combination of stationary foreground and moving foreground) for object
detection result in improved object tracking performance compared to using
motion segmentation on its own?
1.4 Original Contributions and Publications 7
3. Does a particle filter integrated into a frame-by-frame detect and update
tracking system result in improved object tracking performance when com-
pared to a frame-by-frame detect and update tracking system on its own?
4. Where is the optimal point for fusion in a multi-modal tracking system?
The tracking systems proposed within this thesis do not consider using any
learned model approaches to object detection, relying solely on simple detection
techniques based on the analysis of motion segmentation output. Learned mod-
els have not been considered due to the requirements in training such models to
accurately locate people and cars across a wide range of data, and due to motion
segmentation research being a main component of this thesis. Ultimately, the
use of the simple motion based detection techniques allows a wider range of data
to be tested than would have otherwise been possible. The ETISEO database
[130], and evaluation tool are used to evaluate the performance of object tracking
systems. The research into multi-modal tracking systems is restricted to a single
visual colour modality being fused with a single thermal modality. Performance
is evaluated using the OTCBVS database [42], and the ETISEO evaluation tool.
1.4 Original Contributions and Publications
The original contributions made in this thesis include:
(i) Simultaneous computation of multi-layer motion segmentation and
optical flow
Motion segmentation is a key early step in object tracking, and poor segmen-
tation performance can have an adverse impact on the performance of any
tracking algorithm. This research proposes an improved motion segmentation
algorithm that is able to simultaneously compute optical flow and multi-layer
8 1.4 Original Contributions and Publications
motion segmentation. Motion information is split into multiple layers of static
foreground (objects that entered the scene and come to a stop) and a layer
of active foreground (currently moving objects). The proposed algorithm is
evaluated using the AESOS database, the CAVIAR database [48], and data
captured in house, and significant improvement is shown.
(ii) Incorporation of multi-layer motion and optical flow into object
tracking
Object tracking systems typically use either motion segmentation (with only
a single layer of output), optical flow, or a learned model for object detection
and tracking. As tracking systems are generally aimed at real-time performance,
computing multiple motion modes (i.e. motion segmentation and optical flow) is
not ideal. This research proposes implementing the proposed hybrid multi-layer
motion segmentation / optical flow algorithm into a tracking system, and using
the multiple modes of output to improve detection and tracking performance.
The resultant tracking system is evaluated using the ETISEO database [130],
and significant improvement over a baseline tracking system (using a single layer
motion segmentation algorithm) is shown.
(iii) Improved object tracking through the Scalable Condensation
Filter (SCF)
Particle filters allow tracking to be performed in environments where detection
is difficult and unreliable, but still require a form of detection (this may include
manual instantiation) to initialise the particle filter. This research aims to
develop a condensation filter that can be used within a tracking framework
that detects and updates objects on a frame by frame basis, rather than as
a self contained tracking system. This allows the condensation filter to have
1.4 Original Contributions and Publications 9
access to continuously updated features, and use observations from the detection
algorithms to augment the condensation filter distribution. The proposed
condensation filter is able to dynamically scale the number of particles used for
each track, in addition to the number and type of features used, according to
the system complexity. The proposed tracking system is evaluated using the
ETISEO database [130], and improvement in tracking performance, particularly
occlusion handling, is demonstrated.
(iv) Investigation into and development of methods to fuse multiple
modalities for object tracking
This research aims to investigate the most appropriate way to fuse a visual colour
modality and thermal modality for the task of object tracking. Four simple fusion
schemes are evaluated:
1. Fusion during the motion detection process.
2. Fusion of the motion detection output.
3. Fusion of the object detection results.
4. Fusion of the tracked object lists.
These fusion schemes are evaluated using the OTCBVS database [42] and the
ETISEO evaluation tool [130]. It is shown that fusion of the object detection
results is most effective, and a more sophisticated fusion scheme at this point in
the system is proposed. It is shown that this scheme outperforms each modality
on its own, as well as the earlier evaluated schemes.
10 1.5 Outline of Thesis
1.5 Outline of Thesis
The thesis is outlined as follows:
Chapter 2: Literature Review
• Provides a detailed survey of literature related to motion detection, object
tracking, particle filters and abandoned object detection.
Chapter 3: Tracking System Framework
• Outlines the structure, both algorithmic and programmatic, of the tracking
systems used in this thesis.
• Details the baseline tracking system.
• Outlines the evaluation method used for evaluating the performance of the
tracking systems within this thesis.
• Presents benchmark scores for the baseline tracking system.
Chapter 4: Motion Detection
• Presents a novel motion detection algorithm capable of simultaneously cal-
culating optical flow and segmenting a multi-layered foreground.
• Illustrates the improvement that can be achieved using this algorithm, on
synthetic and real world data.
1.5 Outline of Thesis 11
Chapter 5: Object Detection
• Describes how the proposed motion detection algorithm (Chapter 4) can be
incorporated into the baseline tracking system, to improve object detection
and thus tracking.
• Presents evaluation results for the modified tracking system proposed in
this chapter, and show the improvement over the baseline system.
Chapter 6: The Scalable Condensation Filter
• Proposes a novel condensation filter, the Scalable Condensation Filter
(SCF), that is able to dynamically resize, and dynamically change features
as the system requires.
• Describes how the SCF can be implemented into the tracking system pro-
posed in Chapter 5.
• Presents evaluation results for the modified tracking system, and show the
improvement achieved by using the SCF.
Chapter 7: Advanced Object Tracking and Applications
• Proposes a multi-camera tracking system based on the systems proposed
earlier in this thesis, and demonstrate the improvement in performance that
can be gained in situations where occlusions are present.
• Investigates multi-modal fusion approaches for a tracking system.
12 1.5 Outline of Thesis
Chapter 8: Conclusions and Future Work
• Provides a summary of the research as well as possible avenues of future
work.
Chapter 2
Literature Review
2.1 Introduction
Tracking is the process of following an object of interest within a sequence of
frames, from its first appearance to its last. The type of object and its description
within the system depends on the application. During the time that it is present
in the scene, it may be occluded (either partially or fully) by other objects of
interest or fixed obstacles within the scene. A tracking system should be able
to predict the position of any occluded objects through the occlusion, ensuring
that the object is not temporarily lost and only detected again when the object
appears after the occlusion. The process can be extended to multiple cameras,
which can help to overcome occlusions by being able to observe the objects of
interest from multiple angles, but also requires that the same objects in different
views are grouped together.
Object tracking systems are typically geared toward surveillance applications
where it is desired to monitor people and/or vehicles moving about an area.
14 2.1 Introduction
Systems such as these need to perform in real time, and be able to deal with real
world environments and effects such as changes in lighting and spurious move-
ment in the background (such as trees moving in the wind). Other surveillance
applications include data mining applications, where the aim is to annotate video
after the event. Applications outside of surveillance have been found in the realm
of sport. The ball tracking system, ‘Hawk-eye’ [137], has become a standard
feature of tennis and cricket broadcasts, and uses object tracking techniques to
locate and track the ball as it moves about the court or pitch. The development of
‘smart rooms’, where a central intelligence is able to monitor a room’s occupants
and perform tasks according to the actions of occupants [50, 102] is another area
where object tracking techniques are being applied.
There are two distinct approaches to the tracking problem, top-down and bottom-
up [182]. Top-down methods are goal orientated, and the bulk of tracking systems
are designed in this manner. These typically involve some sort of segmentation
to locate regions of interest, from which objects and features can be extracted for
tracking.
Bottom-up systems respond to stimulus, and behave according to observed
changes. These systems are often used for gesture recognition tasks, and use
descriptor functions (such as optical flow) combined with filters to provide the
stimulus. Bottom-up systems, or stimulus driven systems, have been more focused
on data-mining applications. Efros et al. [45] uses optical flow as the stimulus
to classify actions in a sporting context, either on a soccer field, tennis court or
ballet stage. Zhang [182] uses motion descriptors derived from optical flow and
HMMs to index sports footage such as that from a basketball or volleyball match.
Recently surveillance applications designed to perform crowd monitoring [3] have
been developed using a bottom-up approach.
A top-down approach is the most popular method for developing surveillance
2.1 Introduction 15
systems for real time tracking. Systems have a common structure (see Figure
2.1) consisting of a segmentation step, a detection step and a tracking step.
Segmentation and object detection is commonly done as a two step process [66,
184], whereby motion detection proceeds object detection which uses the resultant
motion image. However some systems [131, 145] effectively merge these processes
by using model based object detection.
Figure 2.1: A Basic Top-Down System
Predictors are often used to predict the motion of the tracked object, and thus its
position in the next frame. This is used to aid in matching or in the event of the
object being temporarily lost (occluded). Features such as colour are popular to
aid in matching tracks [40, 113], with various forms of histogram matching and
colour clustering used to maintain the identity of tracked objects.
Recently, systems using particle filtering techniques [135, 163] have begun to use
both top-down and bottom-up approaches within the one system. The particle
filters are used to track previously detected objects, and are updated by using
stimulus from the input image(s). New objects are detected and added using a
top-down approach.
Tracking systems are required to function in a wide variety of conditions. Systems
must be able to function in both indoor and outdoor environments, and need to
be able to deal with challenges such as illumination changes and changing weather
conditions (i.e. fog, rain, snow). Other challenges such as occlusions are common
place in real world scenarios, and systems must be able to maintain an objects’
identity and a reasonable approximation of its position during these occlusions.
16 2.1 Introduction
In order to help overcome some of these problems, algorithms that are able to
utilise multi-camera setups have been developed [95, 118]. This however intro-
duces additional challenges such as camera calibration, track handover between
views, and how to utilise Pan-Tilt-Zoom (PTZ) cameras.
In order to gauge the performance of tracking algorithms, several evaluations
have been run, and a growing collection of databases are available. To date, a
large amount of research has used privately collected data to evaluate algorithms,
making comparison between different algorithms difficult.
This chapter will discuss the main areas of research within the field of object
tracking, and will be structured as follows:
• Section 2.2 will discuss motion detection techniques, such as background
segmentation, optical flow and auxiliary processes such as shadow detection.
• Section 2.3 will present various tracking systems for both people and vehi-
cles. The detection and matching of objects, as well as occlusion handling
and features used for tracking will be discussed.
• Section 2.4 will discuss the main types of predictors used within the object
tracking systems.
• Section 2.5 will discuss tracking systems that use multiple cameras, how
such systems are designed, and how they handle problems such as track
handover between cameras and occlusions.
2.2 Foreground Segmentation and Motion Detection 17
2.2 Foreground Segmentation and Motion De-
tection
Foreground segmentation is the process of dividing a scene into two classes, fore-
ground and background. The background is the region of the scene that is fixed,
such as roads, buildings and furniture. Whilst the background is fixed, its ap-
pearance can be expected to change over time, due to factors such as changing
weather or lighting conditions. The foreground is any element of the scene that
is moving, or expected to move, and some foreground elements may actually be
stationary for long periods of time (such as parked cars, which may be stationary
for hours at a time). It is also possible that some elements of background may
actually move, such as trees moving in a breeze. An example of a motion mask
is shown in Figure 2.2.
(a) Input Image (b) Foreground Mask
Figure 2.2: Foreground Mask for a Scene (Hand Segmented). Areas of motionare represented as white in the foreground mask.
There are two main approaches to locating foreground objects within surveillance
systems:
1. Background modelling/Subtraction - incoming pixels are compared to a
18 2.2 Foreground Segmentation and Motion Detection
background model to determine if they are foreground or background.
2. Optical Flow Approaches - compare consecutive images to determine flow
vectors (movement of each pixel) which can be used to detect moving (i.e.
foreground) objects.
Background modelling and subtraction techniques compare incoming images to a
learned background. In the case of background subtraction, single mode models
of the background are used. An image subtraction followed by thresholding is
performed to determine foreground pixels,
Imotion = |Iinput − Ibackground| > T, (2.1)
where Imotion is the output motion image, Iinput is the input image, Ibackground is
the background image and T is the motion threshold. Depending on the imple-
mentation, Ibackground may be a fixed image initialised at start up, or may adapt
to changes in the scene (i.e. lighting fluctuations).
Background modelling techniques [18, 97, 157] use a multi-modal approach, where
various modes of the background are stored for each pixel,
I(x, y)background = {c, w, n} , n = [1..N ], (2.2)
where (x, y) is a pixel in the background model; {c, w, n} is a single model of
background with c equal to the colour (this may be multiple values), w is the
weight and n is the index; and N is the number of models in the background.
Incoming images are compared to all possible background modes in order of
weight (highest to lowest) to determine if the pixel is foreground or not,
|I(x, y, n)background − I(x, y)input| < T, n = [1..N ] (2.3)
where (x, y) is the coordinate of the pixel being tested, n is the background mode
being compared to, and N is the total number of background modes. Once a
2.2 Foreground Segmentation and Motion Detection 19
match is found, the weight of the match is used in determining if the pixel is
foreground. If there is no match, then the pixel must be foreground. Background
segmentation routines update themselves overtime, removing modes that have
low probabilities as new modes appear, and gradually adjusting other modes as
the scene gradually changes (i.e. lighting changes due to the time of day).
Background subtraction, being a simpler approach results in faster execution
times, but it less robust when exposed to complex scenes containing environmen-
tal effects such as lighting fluctuations, or trees moving in the wind.
Optical flow [9, 72, 114] attempts do not explicitly determine where motion is
in the scene. Instead they attempt to determine what motion each pixel has
undergone between subsequent frames.
Systems belonging to each of these areas shall be discussed, as well as systems
that combine one or more techniques, or don’t fall into one of these classifications.
Finally, auxiliary processes such as those responsible for shadow detection or
illumination correction shall be discussed.
2.2.1 Background Subtraction and Background modelling
Background modelling methods build a model of the background, often multi-
modal, and compare this to each incoming frame. Early approaches used a fixed
model such as an image of the expected background and directly compared this
to incoming frames. Whilst this works in ideal situations, it cannot cope with any
variations in lighting and so is very limited. A moving average of the background
image can be used to add some adaptability to the model, but this is still limited.
Recently multi-modal solutions [18, 157] have been proposed that are able to
cope better with real world situations problems such as lighting fluctuations and
20 2.2 Foreground Segmentation and Motion Detection
a time varying background.
The choice of colour space and colour model is also important. Colour spaces
such as YCbCr and HSV separate the colour from the intensity making tasks
such as shadow detection simpler. The choice of colour model in the system can
have a similar effect.
Horprasert et al. [73] proposed a colour model based in RGB space which allowed
brightness and colour distortion to be measured. A given colour in RGB space
is represented as a line from the origin to the colour, its chromacity line. The
brightness distortion is a scalar value that brings the observed colour back to the
chromacity line, while the colour distortion is the distance between the observed
colour and chromacity line. A background model consisting of the observed colour
and its variance, and the variance of the brightness distortion and colour distor-
tion is used. The use of such a colour model allows shadows or highlights to be
picked up and eliminated from the foreground image.
Thongkamwitoon et al. [160] proposed an adaptive background subtraction algo-
rithm based on [73] that could vary the learning rate for different parts of the
scene. A constant learning rate can result in errors in two ways:
1. True background may be lost at areas of high activity when a fast learning
rate is used.
2. New background objects will be incorporated very slowly if a slow learning
rate is used.
To overcome this problem, they define a vivacity factor which can be used as
a substitution to the learning rate. The vivacity is calculated by observing the
changes at a pixel of a window of frames, and can be used to compensate for
different levels of motion within the scene.
2.2 Foreground Segmentation and Motion Detection 21
Stauffer and Grimson [157] proposed a multi-modal background model using a
GMM to model each pixel, and incoming pixels were compared to the GMM to
determine how well they matched the background. This allowed multi-modal
backgrounds to be effectively modelled, and for the model to learn and adapt to
changes in the background. This mixture of Gaussians (MOGS) approach (or
systems which use an approximation to it [18]) has become popular and several
improvements and variations have been proposed since.
Harville et al. [68] proposed a system based on the work of Stauffer and Grimson
[157] and the work of Gordon et al. [56], to produce a system that performed
foreground segmentation based on depth and colour information within a MOGS
framework. The proposed algorithm is able to modulate the learning rate accord-
ing to the amount of activity at a given pixel. Pixels that are more active, learn
slower to preserve the background, while inactive pixels have the learning rate
increased as it is likely that they represent a new background object.
Bowden and Kaewtrakulpong [13] improved upon the system proposed by Stauffer
and Grimson [157] by improving the learning rate so that it converged on a stable
background faster. Changes were also proposed that allow the system to better
deal with shadows. Further improvements have been proposed by Wang and Suter
[169], who proposed the addition of shadow removal and a foreground support
map to aid in the updating of the background. The MOGs approach has also
been applied to a moving camera situation by Hayman and Eklundh [70], who
adapted [157] to a system with a camera capable of panning and tilting (such as
may be found on a robot, or in video conferencing).
One problem that may arise with these approaches, is that objects that are not
actually part of the background, may become incorporated into the background
model. When an object first stops, it will still be detected as foreground, how-
ever, after a period of time the object will have been present in the scene long
22 2.2 Foreground Segmentation and Motion Detection
enough to be considered background. If this occurs, and the object then begins to
move again, it is possible that motion will be incorrectly detected at the location
where the object was. Whilst this could be partially overcome by using a slower
learning rate, this will then affect other aspects of the system such as the ability
to learn changes in the background due to changing light conditions (something
that is necessary in an outdoor scene). Within a tracking system, this inability
can lead to a higher number of occlusions, placing increased demands on other
segmentation steps, or tracking and predictions steps in the case of a surveillance
system.
(a) Frame 0 (b) Frame 800 (c) Frame 1900 (d) Frame 2450
Figure 2.3: Temporarily Stopped Objects
Figure 2.3 shows several frames of a sequence where cars are temporarily stopping,
and then leaving again. In this situation, a car is stopped within the scene for
over 2000 frames (80 seconds if the frames are captured at 25fps). In the same
time that the car is stopped, there is significant change in the scenes lighting (note
the difference between frame 1900 and 2450 in Figure 2.3). In such a situation,
if a slow learning rate was used to counter the temporarily stopping objects, the
same slow learning rate would result in excessive false motion detected after the
lighting change.
Harville [67] proposed an extension to the MOGs approach to overcome this
problem. Harville allowed a higher level process to impose positive or negative
feedback to force changes in the background model. In a situation where there
is a stationary foreground object in the scene, feedback can be applied to ensure
2.2 Foreground Segmentation and Motion Detection 23
that the weights of the mixture components associated with the object remain
sufficiently low that the object is still considered foreground.
Javed et al. [83] added gradient information to the MOGs model to help overcome
illumination changes, as the gradient in the background will remain relatively
stable in the background during an illumination change, even though the colour
does not. Possible gradient distributions could be calculated from the background
model using the various Gaussians that represent these background modes. For
each input pixel, the gradient (magnitude and direction) could be computed and
if it matched one of the possible gradient distributions, then the pixel belonged
to the background.
Zang and Klette [180] proposed a mixture of Gaussians model called PixelMap,
which combines a MOGs approach with region level and frame level considerations
to eliminate holes in objects and help reduce noise. Three processes are combined
in this approach.
1. A standard MOGs background model is used to locate foreground pixels.
2. A frame level process that considers a window of three frames (frame t, t−1
and t + 1). Two difference images are created between t + 1 and t, and t
and t− 1; with the logical and of these two difference images being added
to the mask created by the background model process.
3. A region level process is applied to fill in object holes and remove noise. A
window (5× 5) is moved across the image. For each point that’s in motion,
the motion within the window is analysed to check if it is connected, and if
the window is over 50% full. If this is the case, then the remainder of the
window is filled in, otherwise the centre pixel is set to 0. This third process
replaces the more conventional binary closing routines that are commonly
used, as these routines often struggle to close large holes, or to do so require
24 2.2 Foreground Segmentation and Motion Detection
large kernels that distort other details.
The resultant algorithm produces motion images with less noise and fewer holes
within the extracted objects.
Hu et al. [75] described a new colour model (cone-shape illumination model,
CSIM) to distinguish between foreground, shadow and highlights, and incorpo-
rated this into a mixture of Gaussians background model, that used both a long
term and short term background model. The proposed CSIM is similar to the
model proposed by [73] in that its centre axis lies on the line from the colour to
the origin. It uses a 3D cone centred at the mean with each axis 2.5 standard
deviations long. Hu et al. [75] also made use of a gradient based background
model [83] to help deal with dramatic lighting changes.
Li et al. [108] proposed a background subtraction algorithm that varied the learn-
ing rate for different areas of the scene according to their detected context. This
modification could be applied to any existing background subtraction scheme.
Two different contextual background region types are defined; fixed public facili-
ties such as counters, stores etc; and homogeneous surfaces such as walls and floor.
Orientation histogram representation (OHR) and principal colour representation
(PCR) are used to distinguish between the different background contexts.
Modelling each pixel with a set of GMMs is very processor intensive however,
and not ideal when foreground segmentation is only the first step in a multi-step
process (i.e. surveillance). To address this, Butler et al. [18] proposed a system,
where an approximation to a GMM was used. Butler et al. [18] proposed an
adaptive background segmentation algorithm where each pixel is modelled as a
group of clusters. A cluster consists of a centroid, describing the pixels colour;
and a weight, denoting the frequency of its occurrence. However, the use of a
GMM for each pixel is very processor hungry and not ideal.
2.2 Foreground Segmentation and Motion Detection 25
The motion detection uses colour images in Y’CbCr 4:2:2 format as input. Pix-
els are paired to create a cluster which consists of two luminance values (y1
and y2), a blue chrominance value (Cb), and red chrominance value (Cr) to de-
scribe the colour; and a weight, w. For each pixel pair as a set of K clusters,
C(x, y, t, 1..K) = (y1, y2, Cb, Cr, w) is stored, which represents a multi-modal
PDF. The pairing of pixels to form clusters means C(x, y, t, 1..K) is formed by the
pixels at (x, y) and (x+ 1, y). Each pixel is only used once, so for C(x, y, t, 1..K),
x must be even, and the algorithm requires images that have an even horizontal
dimension.
Clusters are ordered from highest to lowest weight; and the current matching
cluster, C(x, y, t,m) (where m is the index of the matching cluster in the range
1..K), for each pixel is stored, giving an approximation of the image.
For each (x, y, t) the algorithm makes a decision assigning it to one of the sets
(background, or a motion layer) by matching C(x, y, t, k), where k is an index in
the range 1 to K, to the pixels in the incoming image. Clusters are matched to
incoming pixels by finding the highest weighted cluster which satisfies,
|y1 − Cy1(x, y, t, k)|+ |y2 − Cy2(x, y, t, k)| < τLum, (2.4)
|Cb− CCb(x, y, t, k)|+ |Cr − CCr(x, y, t, k)| < τChr, (2.5)
where y1 is the luminance value at (x, y), y2 is the luminance value at (x+ 1, y),
Cb is the chrominance value at (x, y), and Cr is the chrominance value at (x +
1, y). Thresholds are applied to the luminance and chrominance, and if both
are satisfied, then the pixel is suitably close to the cluster to be a match. By
separating luminance and chrominance a certain amount of tolerance to shadows
is inbuilt. The centroid of the matching cluster is adjusted to reflect the current
pixel colour, and the weights of all clusters in the pixels group are adjusted to
reflect the new state,
wk = wk +1
L(Mk − wk) , (2.6)
26 2.2 Foreground Segmentation and Motion Detection
where wk is the weight of the being adjusted; L is the inverse of the traditional
learning rate, α; and Mk is 1 for the matching cluster and 0 for all others. If
there is no match, then the lowest weighted cluster is replaced with a new cluster
representing the incoming pixels.
Based on the accumulated pixel information, the frame can be classified into
foreground,
fgnd = ∀(x, y, t) wherem∑i=0
C(x, y, t, i)(w) < T (x, y, t), (2.7)
where T (x, y, t) is the foreground/background threshold; and background.
The clusters and weights are gradually adjusted over time as more frames are
processed, allowing the system to adapt to changes in the background model.
This means that new objects can be added to the scene (i.e. a box may be
placed on the floor), and over time these objects will be incorporated into the
background model.
Kim et al. [96] proposed a model where background values were quantized into
codebooks. Each codeword contained the RGB vector, the minimum and max-
imum brightness that the codeword matched to, its frequency, the longest time
that it was not seen, and its first and last access times. The system was similar
in construction to Stauffer and Grimson [157] and used a colour model that was
based on [73], allowing the system to handle shadows and highlights. Each pixel
used a different codebook size, depending on the variation at the pixel. The sys-
tem however requires a training sequence to initialise and does not adapt over
time.
Techniques such as those proposed by Butler et al. [18], Stauffer and Grimson
[157] and Kim et al. [96] provide a certain amount of tolerance for lighting changes
and shadows (depending on thresholds and colour models being used). However
2.2 Foreground Segmentation and Motion Detection 27
they are still susceptible to rapid lighting changes that may be caused by (in the
case of an outdoor scenario) the sun moving behind a cloud, or (in the case of an
indoor scenario) a light being turned on/off. These techniques may also require a
number of frames to learn the background model. If there are no moving objects
present, then a single frame is often sufficient, but if there are moving objects
already present then the number of frames required will depend on the rate of
movement, and the system’s learning rate.
A limitation of these techniques is their inability to distinguish between objects
that have been temporarily stationary, and those that are continuing to move.
Kim et al. [97] proposed a modification to their earlier system (Kim et al. [96])
that was able to distinguish short term background (i.e. a car that has stopped)
from motion and long term background. Statistics for each possible code are
recorded to determine which codes belong to background and foreground, and
which belong to short-term background (i.e. stopped cars).
Motion detection techniques such as those proposed by Stauffer and Grimson
[157], Butler et al. [18], Kim et al. [96] and their derivatives all work at the pixel
level, detecting individual changes at each pixel to determine motion. In the
event of small camera motion (possibly caused by wind in the case of an outdoor
camera), these methods would all, incorrectly, detect false motion in large por-
tions of the image. This problem could overcome by using image registrations to
ensure that any incoming images are aligned with the background model, how-
ever, this is computationally costly. Adelson [2] proposed a method of modelling
a scene using pixel layers (where a layer is a group of similar pixels), and this
approach has since been applied to the tasks of motion detection (Patwardhan
et al. [138]), video segmentation (Khan and Shah [94], Criminisi et al. [34]) and
object tracking (Tao et al. [159], Zhou and Tao [187]).
Patwardhan et al. [138] proposed using a layered approach to motion detection.
28 2.2 Foreground Segmentation and Motion Detection
A training sequence is used to locate the layers within a scene, with layers created
according to the similarity of individual pixels colour values. When processing
a frame, pixels are compared to a stack of previous frames to determine the
likelihood that they belong to one of the layers that exists at the pixel (using
a set of images allows for multiple layers to be considered, as a pixel may exist
near a layer boundary, or the layer itself may move, i.e. vegetation moving in
the wind). If the pixel does not belong to one of the background layers, it is
assigned to the foreground layer, and the system continues to learn new layers
and update existing ones as more frames are processed. Foreground layers that
become stationary can be added to the background model, and removed again
once the object begins to move (i.e. a parked car). The use of layers also allows
for overlapping foreground regions to be identified.
2.2.2 Optical Flow Approaches
Optical flow is a process which attempts to determine the motion each pixel in a
scene has undergone between subsequent images,
I(x, y, t) = I(x+ δu, y + δv, t+ δt), (2.8)
where u is the horizontal image velocity and v is the vertical image velocity, t is
the current time step and δt is the time difference between frames. In order to
determine u and v, two assumptions are commonly made:
1. That for a given region across two frames, its appearance will not change
due to lighting (constant luminance).
2. That a pixel present in a frame will still be present in the next frame (no
spatial discontinuities).
2.2 Foreground Segmentation and Motion Detection 29
Both these assumptions can be restrictive and, when broken can lead to errors.
Most optical flow techniques are either gradient based methods (Horn and
Schunck [72], Lucas and Kanade [114]), or block matching based methods (Bergen
et al. [6], Burt et al. [17]). Gradient based methods have been preferred due
to speed and performance considerations. Gradient based methods analyse the
change in intensity and gradient (using partial spatial and temporal derivatives)
to determine the optical flow. Block matching based methods rely on determin-
ing the correspondence between the two images. This typically involves matching
‘blocks’ of one image to ‘blocks’ of the other to determine how far that region has
moved.
Both methods perform best when determining flow at or around clearly defined
features, and make assumptions of constant luminance and spatial continuity.
As a result, when objects are not clearly defined (perhaps due to clutter) or the
lighting conditions vary, errors can occur in the optical flow output. Performance
also suffers where trying to determine the flow for uniform regions where there is
little to no texture.
To try and overcome the limitations of existing methods, Black and Anandan
[9] proposed a robust method based around a robust estimation framework. The
estimation framework reduces the outliers caused by motion discontinuities and
violations of the constant luminance assumption. However, this approach is too
slow for a real time system, and so not suitable for surveillance applications.
For surveillance applications, optical flow images can be restrictive as they will
detect all motion. Background modelling techniques can be trained to filter out
repetitive motion such as trees swaying in the breeze, but optical flow will always
detect this as motion. As such, extensive additional processing may be required
to filter the motion to determine what is cause by the target objects in the scene.
30 2.2 Foreground Segmentation and Motion Detection
2.2.3 Other Methods
Temporal thresholding processes monitor the variance of pixels over a period of
time. A moving average (or similar construct) is used to calculate the variance,
to which a threshold is applied to determine if there is motion. Temporal thresh-
olding processes, however, leave a trail of motion behind moving objects while the
variance stabilises, making them undesirable for use in applications that require
accurate segmentation.
To overcome this, Joo and Zheng. [84] and Abdelkader et al. [1] have proposed
methods that combine temporal thresholding and background subtraction to
achieve a more robust motion detection technique. The mean and variance of
each pixel is calculated over a window of several frames, and recursively updated
for each new frame. A simple exponential decay function is used to update the
filter. To overcome the limitations of temporal thresholding (leaving a trail after
the moving object), a simple background model is used in combination with the
temporal threshold approach. A second set of mean and variances is kept to model
the background. The background model uses a much larger window (slower learn-
ing rate), and its update process is selective in that only pixels that (according
to the variance) are not possible foreground pixels are incorporated. A confi-
dence weight that denotes the confidence of a pixel being part of the foreground
is extracted from the background model, which is multiplied with the variance
obtained from the temporal thresholding. The resulting value is thresholded to
determine if the pixel is in motion.
These methods have an advantage in that they do not require a training period.
Models can be built with motion occurring in the scene without incurring sig-
nificant errors. The background model proposed however only considers a single
mode of background, and so only simple scenes can be effectively modelled.
2.2 Foreground Segmentation and Motion Detection 31
Latzel et al. [105] proposed using interlaced images to detect motion. By ex-
ploiting the motion artifacts found in interlaced video, it is possible to extract
the edges of motion regions (and of edges within the motion regions) from the
sequence. This results in a system that is robust to lighting changes and environ-
mental changes. However the system is unable to deal with shadows, or moving
objects that temporarily stop, and is obviously not applicable to video streams
that are not interlaced.
Grabner and Bischof [57] proposed a background subtraction method based on
on-line adaboost. The input images are divided into a grid of small overlapping
rectangles and a boosted classifier is trained for each. The resultant system is
very sensitive and able to detect changes in low contrast scenes, however, this
also makes the approach susceptible to noise. The proposed approach is also not
as fast as other algorithms (reported to run between 5 and 10 fps), making it less
suitable for real time systems.
Thermal imagery motion detectors have been proposed by Davis and Sharma
[41] and Latecki et al. [104]. Thermal imagery is often noisier than colour, and
is subject to problems such as thermal halos. Davis and Sharma [41] described a
contour based system for use with IR imagery. A background subtraction routine
is first applied [157] to extract regions of interest. These regions typically contain
the target objects as well as an undesired thermal halo. A contour saliency map
which describes the likelihood of a pixel being on a object boundary is generated
by analysing gradient strengths within the foreground region and background
model. Thinning, thresholding and amplification routines are run over the map
image to produce a contour image for the moving objects. To overcome the
problem of incomplete contours, the watershed transform is used to locate possible
contour completions which are used to close any open contours. The watershed
transform is a method for segmenting images using watershed lines [33, 164]. The
32 2.2 Foreground Segmentation and Motion Detection
watershed transform treats the input image as a topographical map where the
grey level indicates the elevation. When applied to a gradient map, the watershed
lines are found along gradient ridges.
Latecki et al. [104] proposed using a series of images to detect motion in thermal
imagery. Due to the additional noise present in thermal imagery (when compared
with visual feeds), multiple frames were used to test for motion by measuring
texture spread across a time and space window. A high texture spread indicates
motion at the region.
2.2.4 Auxiliary Processes
Many systems rely on auxiliary processes to aid the motion detection. These
processes are aimed at removing shadows and/or reflections, or at dealing with
lighting fluctuations. Some systems [13] have incorporated these processes di-
rectly into the motion detection. Having such processes incorporated directly
into any motion detection is highly desirable, as it improves motion detection
performance and avoids the need to run additional processes which may compli-
cate the system.
Commonly, shadows are detected by analysing regions of motion in a colour
space that separates intensity and colour information (HSV and Y’CbCr are two
examples). Fung et al. [55] developed a shadow detection system in which it was
proposed that there are two types of shadows, self shadows and cast shadows.
Self shadows occur when one side of an object is not illuminated, where as cast
shadows occur when an object occludes a light source and casts a shadow onto the
background or other objects. Cast shadows have several distinguishing properties
that allow them to be located and removed:
2.2 Foreground Segmentation and Motion Detection 33
• The luminance of a cast shadow is lower than that of the background.
• The chrominance of the cast shadow is approximately equal to that of the
background.
• The gradient density between the cast shadow and the background is lower
than that between the object and the background.
• The shadow lies on the edge of the bounding region of movement.
Fung et al. [55] derived metrics or tests for each of these conditions to test for a
pixel being a shadow. From these tests, a ‘shadow confidence score’ is calculated.
A threshold is applied to this score to remove shadows from the image.
Nadimi and Bhanu [127, 128] proposed a physics based approach to shadow detec-
tion. The approach requires a training phase where the body colour (the colour
of the material under white light) of surfaces that may come under shadow in the
scene is calculated. This method uses physical properties of shadows to detect
them using a series of tests. A shadow must result in the intensity of the pixel be-
ing reduced, so a reduction in value across the R, G, and B channels is expected.
A blue ratio test is applied, as it is observed that for shadows cast outdoors onto
neutral surfaces, there is a higher ratio of blue due to the illumination by the
blue sky. An albedo ratio segmentation step is performed, to segment the image
into regions of uniform reflectance. An ambient illumination correction is per-
formed to remove the effect of sky illumination, and then body colour estimation
is performed to determine the true colour of the object. A verification step then
matches the various surfaces with their expected body colours to determine which
regions lie in shadow.
Wang et al. [168] proposed a method which analysed each shadow as a series of
sub-regions. Cast shadows are split into three groups:
34 2.2 Foreground Segmentation and Motion Detection
1. Deep Umbra Shadow - cause by blocking the sunlight, with no environmen-
tal/reflected light to lessen the shadow.
2. Shallow Umbra Shadow - caused by blocking the sunlight, with environ-
mental/reflected light to lessening the shadow.
3. Penumbra Shadow - partly illuminated by sunlight and environmental light,
this is very difficult for humans to detect.
Each type of shadow is segmented and dealt with separately to improve segmen-
tation accuracy and reduce the number of false shadows found. The percentage of
grey level reduction for pixel regions can be used to roughly segment the three dif-
ferent shadow types. Colour for the pixels is compared to that of the background
image. For a shadow region, the colour should not change significantly.
Grest et al. [59] observed that a region in shadow is a scaled-down version (darker)
of the same region in the background model. As such, normalised cross correlation
can be used to detect shadows. Jacques et al. [79] proposed a variation on the
work of [59] by adding an additional step that used the statistics of local pixel
ratios. Potential shadow regions are detected using NCC as described by [59].
The ratio of the input image and background image is calculated for the pixels
in the local neighbourhood of the candidate shadow. If the standard deviations
of these ratios is beneath a predefined threshold (i.e. if the region has undergone
a constant illumination decrease), then the region is classified as a shadow.
Shadow detection methods such as those proposed by [55, 59, 79, 128, 168] are
complex, multi-step processes and are as such are not ideal for use in a tracking
system, or for incorporation into an existing motion detection algorithm. Tech-
niques such as these can only be effectively applied as a post process to any
motion detection.
2.2 Foreground Segmentation and Motion Detection 35
Edges can also be used to aid in shadow detection. Xu et al. [174] proposed using a
canny edge detector on the foreground image to separate the various regions (both
shadow and non shadow) in the foreground. Through multi-frame integration,
region growing and edge matching, the shadow regions can be identified and
removed. Zhang et al. [183] was able to detect shadows using the ratio edge.
The ratio edge is the ratio between neighbouring pixels, and it is shown to be
illumination invariant. The location of this edge can be analysed to segment
shadows from moving objects. Once again however, these approaches are only
suitable when used in post processing.
Martel-Brisson and Zaccarin [119] proposed using GMMs to detect shadows (al-
lowing for easy integration with a MOGs foreground segmentation processes).
A Gaussian mixture shadow model (GMSM) is built which contains all possible
shadow states. The YUV colour space is used, and shadows are initially detected
by looking for a constant attenuation across the Y, U and V channels. At each
time step, the foreground state with the largest a priori probability is processed
to determine if it describes a cast shadow. If it does, then the shadow state can be
incorporated into the GMSM, either by combining it with another shadow state,
or as a new state (a complex scene may have two or three states per pixel for
shadows). The GMSM can be used to detect shadow by testing if the foreground
pixels match to the any of the shadow states of GMSM.
Cucchiara et al. [37] proposed a shadow detection algorithm to be used within the
tracking system, SAKBOT (Cucchiara et al. [35]). HSV colour space is used, and
a shadow is defined as a point where the intensity is reduced, and the ratio of the
reduction lies in the range α to β (where β is less than 1 to avoid detecting points
that have been slightly altered by noise, and α is determined by the strength of
the light source to describe how dark shadows could be); the saturation is slightly
reduced and the hue is relatively unchanged.
36 2.2 Foreground Segmentation and Motion Detection
This approach (Cucchiara et al. [37]) was later extended (Cucchiara et al. [36]) to
a system capable of discriminating between moving objects and their shadows as
well as ‘ghosts’ (objects detected due to errors in the motion segmentation, they
do not correspond to any actual motion) and their shadows within an image. An
analysis of the optical flow of detected moving objects can be used to distinguish
between ghost objects and moving objects. Moving objects should exhibit a high
average optical flow, while ghosts should be close to 0, as they do not actually
represent any motion (and as such the motion in the region is either zero or
inconsistent).
Shastry and Ramakrishnan [151] modified the system proposed by Cucchiara et al.
[37] so that an improvement in speed was achieved by using track information. As
the shadow detection was being used to detect shadows associated with tracked
moving objects, it could be assumed that the shadows are moving at the same
speed as the objects. A shadow at point (x, y) at time t will be at (x+mx, y+my)
at time t + 1 where (mx,my) is the velocity of the moving object. This predicts
results in small errors in the location of shadows, and over an extended time those
errors can increase. To prevent errors in the shadow detection building up due to
incorrect prediction of shadow movements, the shadow mask is recomputed every
N frames.
2.2.5 Summary
Within a surveillance environment, foreground segmentation and motion detec-
tion techniques can be used to locate objects of interest. Such environments are
often quite complex, and may contain complex lighting and a changing back-
ground. As such the use of a technique that is able to cope with a multi-modal
background (Butler et al. [18], Kim et al. [97], Patwardhan et al. [138], Stauffer
2.3 Detecting and Tracking Objects 37
and Grimson [157]) and can learn changes within the background is very impor-
tant. Depending on the environment in which systems are intended to operate,
additional techniques such as shadow detection, highlight detection and the abil-
ity to handle lighting changes are also important. Ideally, the ability to cope
with these issues should be part of any segmentation algorithm (Bowden and
Kaewtrakulpong [13], Hu et al. [75]), rather than additional processes that are
run afterwards.
Whilst optical flow can also be used detect regions of motion, unless a robust
method is used (i.e. Black and Anandan [9]), errors caused by violations of the
assumptions of constant luminance and no spatial discontinuities are likely to
cause the results to be inaccurate, and potentially unusable as a sole mode of de-
tection. In a surveillance situation (particularly one where there is natural light,
or florescent light which causes a noticeable flicker in video footage), with poten-
tially high levels of unpredictable movement, these assumptions will be violated
regularly.
A lack of common data for evaluation makes the comparison of different tech-
niques difficult, and as such, no explicit comparison is presented. However, the
relative strengths and weaknesses have been discussed in general terms based on
the presented literature and their reported results.
2.3 Detecting and Tracking Objects
The process of object tracking can be approached in two ways:
1. In each frame, detect all objects and match these to the list of objects from
the last frame;
38 2.3 Detecting and Tracking Objects
2. Detect an object once and extract one or more features to describe the
object, then follow the object using the extracted features.
The first approach is the most common approach to object tracking, and is ad-
dressed further in this section. Examples of the second approach are techniques
such as the mean shift algorithm (Fukunaga [54]) and its derivatives (such as
CAMShift, Continuous Adaptive Mean Shift, Bradski [14]), and particle filters
(which are discussed in Section 2.4). These systems often rely on external input
to initialise the tracking process, as a continuous detection process that allows
automatic discovery of new modes, requires detected modes to be matched to de-
termine which objects are already being tracked (i.e. the first approach described
above).
For the task of automated surveillance, it is desirable to be able to automatically
discover objects as they enter the scene, as relying on a human operator to flag
incoming objects on behalf of the tracking system defeats the original purpose of
the tracking system (i.e. to ease the burden on the human operators).
2.3.1 Object Detection
In order to track an object, it must be able to be reliably and consistently de-
tected, and features that can be observed and matched from frame to frame must
be extracted. These features can be simple distance and position based features,
or more complex colour and texture based features. The acquisition parameters
(colour or grey scale, image resolution, camera field of view) and environment
(indoor/outdoor, day/night) in which the system is intended to operate is likely
to play a large role in determining the type of features used.
Many systems use motion detection to detect objects for tracking. Haritaoglu
2.3 Detecting and Tracking Objects 39
et al. [65] detects people by locating blobs of motion and computing the vertical
histogram of the silhouette (heads will lie at a local maxima), and combining this
information with the results of convex hull-corner vertices to determine where
potential heads lay within the region. This approach can be used effectively to
segment groups of people. Fuentes and Velastin [53] performed motion detection
using luminance contrast and formed blobs, characterised by a bounding box,
centroid, width and height, to represent tracked people. A blob is a group of
connected regions, according to spatial constraints. As such, a detected blob
may be formed by a single, large, connected (either 8 or 4 connected) region; or
a cluster of several smaller connected components located close to one another.
Rather than simply tracking blobs, Zhao and Nevatia [184] proposed a system
that used an ellipsoid shape model to locate and segment people from the motion
image. The system was setup such that the camera was deployed a few meters
above the ground looking down, to help overcome the occlusion problems that
occur with ground level cameras. People are detected via an iterative process.
The following two steps are repeated until no more people are detected in the
scene.
1. Locate all heads and fit an ellipsoid person model at the head. If there is
sufficient motion within the ellipse, the person is accepted and their motion
is removed.
2. Perform geometric shadow analysis to remove shadow regions belonging to
the detected people (using the date, location and time of day to determine
the suns position, and orientation of any shadows).
Kang et al. [89] also applied head detection after background segmentation to
locate people, who are characterised by bounding boxes. In addition, results
from the previous frame’s head detection are used in the next frame through a
40 2.3 Detecting and Tracking Objects
feedback loop to aid the process.
Similar motion based approaches have been used to detect vehicles. Koller et al.
[99] used an adaptive background model to locate vehicles on the road. The shape
of the object is first derived from both the gradient image and the motion mask.
The shape is expressed as a convex polygon that encloses the object, smoothed
with the application of cubic spline parameters to the points.
A problem that can occur when using motion segmentation results as the basis
for object detection is that spurious motion can be included as part of a detected
object. To help overcome this, Lei and Xu [106] proposed a tracking system that
uses a variant of the brightness distortion metric [73] to remove shadows and
highlights from the motion image. Shadow/highlight removal is performed at
two threshold levels (tight and loose). Blobs that are unconnected in the tight
threshold output are grouped according to connectivity in the loose threshold
output. This allows the blob grouping from the loose thresholds to be retained
whilst ensuring that more of the spurious motion is removed.
Tracking systems that use colour images as input such as Matsumura et al. [121]
and Wang et al. [167] use skin detection to locate people within the image. These
skin regions are tracked, and a template is constructed from the extracted skin
segment. This skin segment template can then be used to match candidates
in future frames. Wang et al. [167] and Matsumura et al. [121] applied motion
detection first to simplify the skin detection by removing regions of no interest.
Matsumura et al. [121] searched for skin colour in frames, and tracked large skin
regions. These systems, however, rely on being able to detect the face or other
skin regions to track and this may not always be possible.
Other systems have used a learned model approach rather than relying on motion
detection. Motion based approaches are ultimately reliant on the performance
2.3 Detecting and Tracking Objects 41
on the motion detection, with poor motion detection likely to lead to poor object
detection. A model based approach can overcome this limitation, but relies on a
suitable model (or models) being trained.
Rigoll et al. [145] used Pseudo 2D HMMs (P2DHMM) to track people. This
avoided using motion detection, and so avoided the problems such as cars/other
people/trees causing additional movement that results in false objects being de-
tected in the scene or valid tracks being lost. A P2DHMM trained on over 600
people was used to locate people in a frame. The P2DHMM uses the tracked
objects centroid, velocity, bounding box height and width as inputs into Kalman
filter [172]. The Kalman filter feeds out the next prediction to the HMM to aid
in the tracking, and as tracking progresses the HMM adapts its model to better
fit the object that it is currently tracking. The major benefit of a system such as
this is that there is no motion detection, meaning the camera can zoom and pan
without resulting in any additional complexity being added to the system.
Nguyen et al. [131] also made use of Markov models recognising the actions of
people within a tracking system. An Abstract Hidden Markov Model (AHMM)
was developed which replaces the Markov chains of the standard HMM with
Markov policies. Policies could be defined in a hierarchy such that higher level
policies could be built from simple low-level policies. Like a HMM, the behaviours
were learned off-line by observing training data. Kato et al. [91] applied HMMs
to vehicle tracking and traffic monitoring.
Seitner and Lovell [150] used a Viola-Jones [165] detector to detect people and
subsequently track them in a scene. Yang et al. [177] and Okuma et al. [135] have
also used the Viola-Jones detector [165] with particle filters to track people.
Despite the clear advantage of being able to cope with lighting changes and
camera noise that would have a severe negative impact on methods that rely on
42 2.3 Detecting and Tracking Objects
the analysis on image masks, learned model approaches do have their drawbacks.
Within a surveillance environment, it can be expected that the objects of interest
(i.e. people or vehicles) will be viewed from various angles (i.e. front on, side
on, and anywhere in between), and any detection method will need to be view
invariant. This invariance can be achieved in two ways:
1. Train the model to recognise the object from any angle
2. Train several models for the different view angles
Given the variation observed when viewing an object such as a person from
all angles, any single model that is trained to detect a person at any angle is
likely to be too general and perform poorly (particularly in a real world situation
with a complex background). Whilst training several models solves this problem,
training several models to detect each object class is very demanding and not
ideal.
2.3.2 Matching and Tracking Objects
Once objects have been detected in a frame, it is necessary to match the objects
that have been detected to those that were detected in the previous frame. Zhao
and Nevatia [184] matches detected objects iteratively. Tracks are matched one
by one (each track compared to all located objects) in order of their depth in
the scene (determined by position in the frame and camera calibration). Fuentes
and Velastin [52] matches detected and tracked objects using a two way matrix
matching algorithm (matching forwards and reverse). In order to perform this
matching however, some form of feature (or features) needs to be extracted for
comparison.
2.3 Detecting and Tracking Objects 43
The features used to match tracks varies greatly, and the type of feature used is
partially dependent on the system requirements and type of input the system is
receiving (i.e. grey scale or colour images). The types of features used can be
broadly grouped as follows:
1. Geometric features (i.e. object position, bounding box position/size [106,
121]) - can be extracted directly from the object detection results with no
further image processing. Reliability of the features is directly dependent
on the object detection performance, however the features are very quick to
extract and compare. Due to their simplicity, geometric features are often
used in combination with more complex features [106, 121].
2. Edge features (i.e. silhouettes [63, 66]) - can be extracted from a motion
mask, or similar mask image (a mask could possibly be extracted using
colour segmentation techniques). Performance of edge features is depen-
dent on the accuracy of the mask, segmentation errors will lead to poor
performance.
3. Colour/Texture features (i.e. histograms [113, 129], appearance models
[25, 63, 74, 89, 144]) - can be extracted using a combination of the object
detection results, the input images and any mask images. Such features are
more robust to object detection and segmentation errors, but the features
are more computationally demanding.
Tracking systems may use multiple features to track objects [40, 63, 66, 178].
Different features may be used at different times, or depending on the state of
the tracked object. For example, when the object has been observed for several
frames in succession and there is little complexity in the scene, simple geometric
features may be sufficient to match the object. If the object has not been observed
for several frames then position and size estimates are less reliable, and so colour
44 2.3 Detecting and Tracking Objects
or texture features may be more appropriate.
Colour feature approaches have focused on using histograms, as they are simple
to compute and compare. The colour histogram is relatively unaffected by pose
change or motion, and so is also a reliable metric for matching after occlusion. His-
tograms are matched by calculating the histogram intersection. Lu and Tan [113]
uses motion detection to find people and characterises them using their bounding
rectangle, size and a colour histogram. Ng and Ranganath [129] proposed an im-
provement to the colour histogram model such that the histogram uses variable
bin widths, resulting in comparable performance with a five-component GMM.
One limitation of histograms is that they do not contain any position information.
Two objects that have very similar colour histograms may have dramatically
different appearances due to the distribution of the colours. For example one
person may be wearing a white shirt and black pants whilst a second is wearing
a black shirt and white pants. Whilst these people may have quite distinct
appearances, they would have very similar histograms. To overcome this fault,
Hu et al. [74] extracted three histograms from each person, one each for the
head, torso and legs, to not only allow for matching based on colour, but also
on distribution of colour. An ellipsoid shape model is used to characterise the
person, and tracking is performed using a condensation filter [77].
Chien et al. [25] proposed a colour model (Human Colour Structure Descriptor -
HCSD) that aims to capture the distribution of colours in a human body. Three
colours are used to represent the colour of the body, legs and shoes, and positions
are defined to describe the position of body and legs relative to the shoes. The
model is generated by first extracting the silhouette for the object; then extracting
skeletons for the shoes, legs and body; from which colours and positions are
obtained.
2.3 Detecting and Tracking Objects 45
Kang et al. [89] makes use of colour clusters to aid in tracking. Motion estimation
is the primary method of people detection and tracking, and is used in the case of
tracking a single person (or the occluder in a total occlusion). Colour is used for
partial occlusions, or when an object re-enters. Once a person is detected, colour
clusters are obtained by calculating the colour histogram, calculating means for
each bin, and then merging similar bins. About three clusters per person are
obtained, representing the three major colours belonging to that person. Each
cluster has a weight that is a function of its size, duration, frequency and the
existence of other nearby objects (nearby objects cause inaccuracies as they may
contribute part of their colour to the object). This colour model can be used to
match people after occlusions, or after they have left and re-entered the scene.
The colour correlogram (Rao et al. [144]) is a variant of the colour histogram,
where geometric information is encoded as well as colour information according
to predefined geometric configurations. Zhao and Tao [185] proposed a simplified
colour correlogram [144] for use in tracking systems. As the original correlogram
[144] is too processor intensive for real time tracking a simplified version which
only considers pixels lying on the major and auxiliary (perpendicular to the ma-
jor) axis of the object is proposed. This is much simpler to compute and still able
to deal with rotational variations. To track using the simplified correlogram, a
modified mean shift algorithm is proposed that is capable of determining rotation
changes within the search rather than separately.
Bourezak and Bilodeau [12] also make use of correlograms to track objects of
interest. Bourezak and Bilodeau [12] proposed a system that used histograms
to perform background segmentation. The image is divided into a series of sub-
regions, and a reference histogram is computed for each. This reference histogram
is then compared with the histogram for the incoming frame to determine if the
region is in motion. This can be performed iteratively to refine the detection.
46 2.3 Detecting and Tracking Objects
Using normalised histograms and a coarse to fine approach means that small
amounts of noise can be ignored and the system is invariant to changing lighting
conditions. Correlograms and histograms are then used to track the objects,
capturing both colour and texture information.
To improve reliability and tracking performance, systems such as Darrell et al.
[40] and Yang et al. [178] use multiple modalities to track people. Darrell et al.
[40] [39] combined the use of stereo, colour and face detection to track people.
Models are integrated according to their strengths and weaknesses, and reliability
of each mode. Face detection results are given greater precedence, and the other
two modalities are used to update when the face is not available. To detect
people after leaving and re-entering the scene (long term tracking), Darrell et al.
[40] uses visual clues such as height, skin colour, hair colour and face pattern.
These can all be used in the short term (up to a couple of hours), however, for
tracking of over a day, compensation for lighting changes is needed. This can be
done by mean shifting all colours and excluding any clothing colour information
from matching.
Yang et al. [178] combined motion, depth and colour, and merged these by treat-
ing the observations from each module as Gaussian distributions. Features are all
tracked separately and all have individual Kalman filters for tracking. Features
are fused late in the process and all tracking parameters have an ‘uncertainty’
value associated with them. The depth module performs SAD on regions of in-
terest, and computes U = Ds/D′s for all regions (where Ds is the minimum value
from the depth module, D′s is the second smallest, and U is the uncertainty). If
U and Ds satisfy thresholds, then a right-left consistency check is performed, the
resulting depth map is smoothed, and holes within are filled. The objects result-
ing from this detection are translated into an overhead view and the uncertainty
2.3 Detecting and Tracking Objects 47
for each candidate is,
Udepth = a× |p−W |W
+ c, (2.9)
where a and c are controlling constants, p is the width of the detected object, W
is the expected width of a person and Udepth is the uncertainty for the detected
object.
Skin colour is used for colour tracking. A locus model is used to group skin
regions into face shapes. The system tries to detect lips to distinguish the face
from other skin regions. The uncertainty for this modality is defined as,
Ucolour = a× |s− 1.5|1.5
+ b× t− 0.1
0.1+ c, (2.10)
where a, b and c are constants, s is the aspect ratio of the face bounding box and
t is the ratio of the lip colour.
The motion module uses a temporal subtraction technique to locate blobs and
nearby regions are merged to locate candidates. The uncertainty is,
Umotion = a× |s− 1.5|1.5
+ c, (2.11)
where a and c are constants and s is the aspect ratio of the bounding box.
To merge candidates from each module, the candidate with the lowest uncertainty
is taken and integrated with candidates from other modules that are within a de-
fined distance threshold. Candidates are combined as through they are Gaussian
distributions, and the uncertainty represents the standard deviation. Final can-
didates are tracked by a Kalman filter.
W4 [63, 66] uses a combination of silhouette matching and an appearance model to
track objects. Frame to frame tracking is achieved using silhouettes. Silhouettes
are compared by matching over a 5 × 3 window and performing a binary edge
correlation (typically dominated by the head and torso as these are slower moving
48 2.3 Detecting and Tracking Objects
than the legs). Whilst this approach is suitable for frame to frame tracking, if
a person is lost for more than a couple of frames the differences between the
silhouettes may be too great to match a person to their current silhouette.
To match people who have been occluded, or who have left and re-entered the
scene, Haritaoglu et al. [63][66] propose an appearance model where data per-
taining to the texture and position of the subject is recorded, and can be used to
determine the identity of a person who has just ceased to be occluded.,
Ψt(x, y) =I(x, y) + wt−1(x, y)×Ψt−1(x, y)
wt−1(x, y) + 1, (2.12)
where Ψ is the texture model, x, y is the pixel being updated, I is the input image,
and w is an occupancy map describing how many times the pixel x, y has been
classified as foreground in the last N frames. The texture model can be matched
to an new object to determine if the re-entering object is the same person,
C(p, r) =
∑(x,y)∈Sp
∣∣Stp(x, y)−Ψtr(x, y)
∣∣× wtr(x, y)∑wtr(x, y)
, (2.13)
where p is the person who has been tracked and r is the person who has dis-
appeared for a period of time. The tracked persons grey scale silhouette (Stp) is
compared to person r’s texture model to determine if they are the same person.
This model allowed people to be re-detected if they had been lost for several
frames due to occlusion, or had left and re-entered the scene.
Matsumura et al. [121] makes use of position, speed, a template image and an
object state to track the regions. The tracked regions positions are estimated
using a Kalman filter, and the object state (which could be one of absence,
emergence, tracked, lapped or lost) allows the system to know how to (or if to)
update the model.
Siebel and Maybank [152] combined a region tracker, head detection and active
shape tracker (AST) to obtain more robust tracking, where each component is
2.3 Detecting and Tracking Objects 49
able to make use of others to improve results. Fusion is achieved by allowing
trackers to use each other’s output as well as historical output. The region tracker
makes use of tracking status and a history database to track over time. If the
region tracker is unsure about a track or cannot detect an object, it uses the active
shape tracker. The AST is also used to split large regions. The head detector
uses regions from the region tracker, and its head positions are in turn used by
the AST. The AST uses the other modules to initialise tracks. The combined
results of the modules are refined and filtered. The system checks for the same
object in multiple trackers and then selects the best track for use, discarding the
others. However, multiple tracks of one object are kept if they are considered
possibly valid, although only the best tracks appear in the output.
Lei and Xu [106] tracks objects using several simple features (position, shape and
colour based), by comparing detected object features to those of the track and
dividing by the variance to obtain a cost for the match. Spurious objects can be
detected and removed by observing the variance of the position and velocity of
the tracked object, to determine if it represents an actual moving object.
2.3.3 Handling Occlusions
In real world tracking situations, occlusions are inevitable. Within a tracking
system, there are two main types of occlusion:
• Object and Environment - a tracked object is obscured by a fixed item in
the environment, such as moving behind a pillar or tree.
• Object and Object - one tracked object obscures another.
50 2.3 Detecting and Tracking Objects
It is important for a tracking system to be able to handle occlusions, and resume
tracking after an occlusion has passed. In order to do this, it can be advantageous
to detect or anticipate occlusion events. Lu and Tan [113] anticipates occlusion
events by using the minimum bounding rectangle for the object being tracked, to
determine when two objects are likely to intersect and one will occlude the other.
Koller et al. [99] estimates the depth position of the tracked objects. The depth
order is used in an explicit occlusion reasoning module that sorts objects based
on their vertical position, and combines this with expected positions (determined
by a Kalman filter) to determine occlusions.
Rad and Jamzad [142] explicitly deals with occlusions by predicting and detecting
when they occur using three criteria:
1. By examining the trajectory of each vehicle, if the centre points of the
tracked regions will be too close to each other, occlusion is predicted.
2. If the size of a region exceeds a given threshold, then it is assumed that
region includes more than one object.
3. If the size of a region changes by a significant amount between frames then
it can be assumed that an occlusion merge or split has occurred.
If an occlusion is detected, the region is split by examining the bounding contour
of the motion mask. The region is split through the contour point farthest away
from the minimum bounding rectangle.
Another approach to overcoming occlusion problems is to use depth information.
The use of depth information allows the order of objects (i.e. closest to the cam-
era) in the scene to be determined, and allows occluded objects to be segmented
provided they are at sufficiently distinct depths. Whilst depth ordering can be
2.3 Detecting and Tracking Objects 51
approximated by using the pixel coordinates of the object region that is touch-
ing the ground (in most systems where the camera is mounted such that objects
appear vertical in the image, this is the bottom edge of the bounding box), this
method relies on correct segmentation at the base of the object, and thus is prone
to inaccuracies. Haritaoglu et al. [64] applied object detection to both intensity
and disparity images, with the stereo modality proving most useful when there is
a sudden change in the illumination, there are shadows, or regions in the intensity
image split (less likely to split in disparity).
Harville and Li [69] proposed a system which tracks people within a plan view
(a view from above the scene, looking down). The systems input is ‘colour with
depth’, 3 channels of colour and 1 channel of depth. Height and occupancy
maps are generated in the plan view, showing candidate heights and the amount
of motion within the candidate region. By using a plan view for the tracking,
the system overcomes many of the occlusion problems found in other systems.
Beymer [7] also uses a plan view, sourced from a stereo camera mounted such that
it looked directly down to the ground, to count people as the enter a doorway.
Feature points can also be used to help overcome occlusions. For a given object, it
is likely that several features points can be extracted. As the object moves, it can
be expected that some feature points will be not visible at times, either due to self
occlusion or occlusion with other objects, but some features points are likely to
be always visible. Coifman et al. [26] proposed a system that used feature points
to track vehicles. As it is likely that each vehicle will have multiple feature points,
the use of these for tracking helps to mitigate the problems with occlusions as it
is likely that even in the event of an occlusion, one of the feature points will be
visible. No motion detection or background detection is used; corners are drawn
from a pre-defined detection area on the video input where vehicles are expected
to enter the scene. If these points then move, they are considered to be part of
52 2.3 Detecting and Tracking Objects
a moving vehicle. Points that satisfy a common motion constraint are grouped,
and the system then considers each group representative of a single vehicle.
Tang and Tao [158] proposed a dynamic feature graph to represent tracked ob-
jects. Objects are modelled as a group of invariant features (SIFT [111]) and
their relationship is encoded into a attributed relational graph. This can overcome
problems associated with other colour models such as histograms as it models the
structure and distribution of the object features. Feature relations are defined
by three items; the Euclidean distance between the features, the scale difference
and the orientation difference. As relative measures are used in determining the
relations, the model is invariant to rotations and translations. Features are added
to and removed from the model after they have been observed or absent for a
period of frames. Relaxation labelling is used is used to match the graphs.
Bunyak et al. [16] proposed a novel tracking approach based on a graph structure,
where nodes represent detected objects in consecutive frames, and edges represent
the confidence of a match between the nodes (objects). Over time, the graph
can be pruned, to eliminate the false trajectories as more information becomes
available. Appearance similarity is computed using colour features, and location
similarity is computed using centroids. The similarity measures are combined
to obtain a similarity confidence for the match. A separation confidence is also
obtained (which describes how distinct the match is) which is combined with the
similarity confidence using a weighted sum. The objects are filtered and the graph
is pruned at a variety of stages (object detection, similarity matching, evaluating
confidences, and by eliminating short segments that start or end unexpectedly).
Source and sink areas where occlusions may arise are identified in advance. A
Kalman filter is used to determine possible future positions of an occluded object,
which can be matched when the object reappears. An approach such as this allows
for tracking errors made as a result of occlusions or false/missed detections to be
2.3 Detecting and Tracking Objects 53
corrected as more evidence is gathered.
2.3.4 Alternative Approaches to Tracking
Systems that rely on visual cameras may suffer from problems relating to lighting
and weather conditions that can be avoided by using thermal sensors. Latecki
et al. [104] proposed a method adapted for detection and tracking in infrared
videos. A spatio-temporal representation was used, to provide a more robust
method of motion detection to counter the increased noise present in IR imagery
compared to visual.
Rather than using only colour or grey scale cameras within the visible spec-
trum, Han and Bhanu [61] proposed a system that uses a combination of thermal
infrared and colour sensors to detect human movement. Two approximately iden-
tical images are obtained, one from a colour camera and one from an IR camera,
and these are registered. The data from the two images can be fused to provide
more accurate human locations and better performance in adverse conditions.
O’Conaire et al. [31, 32, 133] experimented with fusion for object segmentation,
background modelling and tracking using colour and thermal infrared images.
Fusion for tracking is done in the appearance model by using a multi-dimensional
Gaussian to represent each pixel. The scores from the visible and thermal spec-
tra in the appearance model are fused in different ways to match the model to
the incoming image. The ways of combining scores methods are compared to
ascertain the best method for this form of fusion. Blum and Liu [11] proposed
different methods of early image fusion using the wavelet transform and the pyra-
mid transform. These early fusion methods can be used to fuse the images before
they are fed into a tracking system, allowing a conventional single mode tracking
algorithm to be used. Han and Bhanu [62] proposed techniques for the use of
54 2.3 Detecting and Tracking Objects
colour images and infrared images for use in moving human silhouette extrac-
tion as well using these silhouettes for automatic image registration between the
infrared and colour images.
Optical flow is another method commonly used to track objects [117, 134, 136,
162, 175, 179], and is often used as an alternative to motion detection for locating
moving objects in a scene. Whilst optical flow is more prone to noise than motion
detection, it does offer additional information in the form of the direction of
movement. This can be used to aid the prediction of future positions and segment
occlusions between objects moving in different directions.
Lucena et al. [117] uses the Lucas and Kanade algorithm to track objects using
particle filters. The probabilistic approach of the particle tracker works well for
optical flow, as it can naturally handle the incomplete or imprecise data that the
optical flow estimations provide. Lucena et al [115, 116] propose an observation
model to track using optical flow contour in the condensation framework [77]. The
flow discontinuities along the contour between inside and outside the contour of
the object being tracked are used to determine the accuracy of the match from
the model to the input image. The area inside should have an optical flow close
to that predicted by the model, while the area outside should be significantly
different.
Optical flow algorithms perform best around clearly defined features, and areas
that contain sparse levels of details are often handled poorly. To counter this,
Yamane et al. [175] proposed a method using optical flow and uniform brightness
regions (a section where the optical flow cannot be detected due to a lack of
texture) to track people. Optical flow is used to detect general areas of motion,
and areas of uniform brightness are found and tracked within the objects bounding
box.
2.3 Detecting and Tracking Objects 55
Other tracking approaches to utilise optical flow include Oshima et al. [136] pro-
posed using the mean shift algorithm [29, 30] with optical flow [72] and a near in-
frared camera to track people in low contrast surveillance imagery; and Yokoyama
and Poggio [179] who proposed using optical flow in conjunction with a canny
edge detector [20] to extract and track contours. Okada et al. [134] uses optical
flow and depth information for tracking, whilst Tsutsui et al. [162] applied optical
flow in a multiple camera system.
Optical flow can also be utilised to recognise actions in a sporting context [45]
and to detect abnormalities in crowd motion [3].
Tracking techniques have also been applied to detecting and tracking individual
body parts. This allows the movement of individual limbs to be monitored,
facilitating gesture detection. Wren et al. [173] developed a system (Pfinder)
where the person is modelled as a series of ‘blobs’, with each blob corresponding to
a major body part. This method allowed gesture recognition to be performed by
analysing the movement of the blobs. The system relied on a constant background
(due to a simple background detection method) and struggled when multiple
people entered the scene. Ramanan and Forsyth [143] proposed a similar method
where the human body is modelled as a 2D puppet, consisting of 9 rectangles
representing body parts (i.e. arms, legs, torso, head etc.). Through the use of
kinematic constraints, the parts can be joined together to model the person(s)
being tracked.
2.3.5 Summary
The problem of object tracking can be split into two main tasks, detection and
matching. There are two main approaches to detection, analysis of a mask image
(such as a motion image) to locate objects of interest (Fuentes and Velastin
56 2.3 Detecting and Tracking Objects
[53], Haritaoglu et al. [65], Zhao and Nevatia [184]), or the use of learned models
(Rigoll et al. [145], Seitner and Lovell [150]). Given the difficulties using a learned
model approach to detection (training suitable model(s) that are able to cope
with the wide variations in viewing angle), the analysis of mask images for object
detection has been more widely used.
To match objects, a wide variety of features are used from simple position and
geometric based features (Haritaoglu et al. [66], Matsumura et al. [121]), to his-
togram based colour models (Hu et al. [74], Lu and Tan [113]) and more complex
appearance models (Chien et al. [25], Haritaoglu et al. [66]). Appearance models
that can encode position and colour information are ideal, as these prove more ro-
bust and discriminative than histograms alone. Features can also be used to help
resolve occlusions by checking identity once the occlusion has passed. The ideal
choice of features for a system is not clear, and to a large extent it depends on
the application. However, using multiple features (Darrell et al. [40], Siebel and
Maybank [152], Yang et al. [178]) provides greater protection against switching
the identities of tracks and recovery from occlusion. The impact of occlusions can
also be lessened by anticipating occlusions (Lu and Tan [113], Rad and Jamzad
[142]), or modelling objects in such a way that they can be tracked through
occlusions (i.e. though the use of feature points, [26]).
A lack of common evaluation data makes direct comparison of individual tech-
niques difficult, as the vast majority of evaluation is performed on privately cap-
tured datasets, with any performance metrics used varying from author to author.
For this reason, no comparison of performance is given.
2.4 Prediction Methods 57
2.4 Prediction Methods
An important part of a tracking system is the ability to predict where an object
will be next frame. This is needed to aid in matching the tracks to detected
objects, and to predict position during occlusions. There are three common
approaches to predict an objects position:
1. Motion Models.
2. Kalman Filters.
3. Particle Filters.
Motion models and Kalman filters will be discussed in brief. A more detailed
discussion on particle filtering will be presented, as these techniques have become
the method of choice for tracking systems.
2.4.1 Motion Models
Motion models are a simple type of predictor and are quite common among simple
systems. Motion models aim to predict the next position based on a number of
past observations. They may or may not make use of acceleration, and can be
expressed as,
p(t+ 1) = p(t) + v(t), (2.14)
where p(t+ 1) is the expected position at the next time step, p(t) is the position
at the current time step, and v(t) is the velocity at the current time step.
For the simplest implementation,
v(t) = p(t)− p(t− 1). (2.15)
58 2.4 Prediction Methods
Other implementations use the history of the object to determine its velocity,
such that,
v(t) =p(t)− p(t−N)
N, (2.16)
where N is the size of the history being used. Using a smaller history (or none)
means that the model can react faster to changes in direction by the tracked
object. However, it also makes the model more sensitive to errors in the object’s
position (caused by segmentation or detection faults) which can result in the poor
prediction of future positions.
2.4.2 Kalman Filters
Kalman filters (Kalman [86]) are a linear predictive filter, and can be used to
predict the state of a system in the presence of noise. The filter estimates the
process state at the next time step, and uses the measurement at that time
step as feedback. Equations can be split into time update equations (predict
the next state of the process) and measurement update equations (incorporate
the new information into the system, to improve future estimations). A detailed
explanation of the equations and the tuning of parameters is provided in [172].
Kalman filters are, however, limited by their inability to effectively handle non-
Gaussian distributions, and are constrained by requiring the process being es-
timated, and the measurements’ relationship to the process, to be linear. Ex-
tensions have been proposed to the Kalman filter such as the Extended Kalman
Filter (EKF) and the Unscented Kalman Filter (UKF) (Julier and Uhlman [85])
to try and overcome this limitation.
The EKF allows non linear relationships by linearizing about the current mean
and covariance. The EKF can break down when faced with highly non-linear
models, which can not be adequately approximated (see Welch and Bishop [172]
2.4 Prediction Methods 59
for more information). The UKF [85] addresses some of the approximation issues
of the EKF by using the actual nonlinear models rather than approximations.
The UKF uses a set of deterministically chosen sample points (selected from
around the mean), which are propagated through the nonlinear system to obtain
the mean and covariance for the posterior distribution (see Julier and Uhlman
[85] for more information).
2.4.3 Particle Filters
Particle filters [120] are a sequential, Monte Carlo method based on particle
representation of probability densities. Particle filters have an advantage over
Kalman filters in that they can model any multi-modal distribution, where as
Kalman filters are constrained by the assumptions that state and sensory models
are linear, and that noise and posterior distributions are Gaussian.
Sequential Monte Carlo methods using sequential importance sampling were ini-
tially proposed in the 1950’s (Hammersley and Morton [60], Rosenbluth and
Rosenbluth [146]) for use in physics and statistics. Particle filters use a set of
samples (particles) to approximate the posterior PDF. Like a Kalman filter, the
process contains two major steps each time step, prediction and update.
The state of the filter at time t, is represented by xt, and its history is Xt =
(x1, x2, .., xt). The observation at time t is zt, and its history is Zt = (z1, z2, .., zt).
It is assumed that the object dynamics form a temporal first order markov chain
so that the next state is totally dependent on the directly previous state,
p(xt|Xt−1) = p(xt|xt−1). (2.17)
Observations are considered to be independent (mutually and with respect to the
process). The observation process is defined by specifying the conditional density,
60 2.4 Prediction Methods
p(zt|xt) at each time, t,
p(Zt|χt) =t∏i=1
p(zi|ti). (2.18)
The conditional state density at time t is defined as,
pt(xt) = p(xt|Zt). (2.19)
State density is propagated over time according to the rule,
p(xt|Zt) = ktp(zt|xt)p(xt|Zt−1), (2.20)
where,
p(xt|Zt−1) =
∫xt−1
p(x|xt−1)p(xt−1|zt−1), (2.21)
and kt is a normalisation constant that does not depend on xt.
In a computational environment, we approximate the posterior, p(xt|Zt), as a set
of N samples, {sit}i=1..N , where each sample has an importance weight, wit. When
the filter is initialised, this distribution is drawn from the prior density, p(x). At
each time step, the distribution is re-sampled to generate an un-weighted particle
set. Re-sampling is done according to the importance weights. The generic
particle filter algorithm is outlined below, and illustrated in Figure 2.4:
Initialisation: at t = 0
• For i = 1..N , select samples si0 from the prior distribution p(x0)
Iterate: for t = 1, 2..
1. Importance Sampling
(a) Predict the samples next position,
sit = p(xt|xt−1 = sit−1). (2.22)
2.4 Prediction Methods 61
This prediction process is governed by the needs of the individual
system, but typically involves adjusting the position according to pre-
defined system dynamics and adding noise.
(b) Evaluate the importance weights based on the measured features, zt,
wit = p(zt|xt = sit). (2.23)
(c) Normalise the importance weights,
wit =wit[∑Nj w
jt
] . (2.24)
2. Re-sampling
(a) Resample from sit so that samples with a high weight, wit are em-
phasised (sampled multiple times) and samples with a low weight are
suppressed (re-sampled few times, if at all).
(b) Set wit to 1N
for i = 1..N .
(c) The resultant sample set can be used to approximate the posterior
distribution.
The re-sampling step (2(a)) is required to avoid degeneracy in the algorithm
(see Kong et al. [100] for more details), by ensuring that all the weight does not
become contained within a single particle. Sequential Importance Re-sampling
(SIR) is a commonly used re-sampling scheme that uses the following process to
select a new sample. The process is applied for n = 1..N .
1. Generate a random number, r ∈ [0..1].
2. Find the smallest j which satisfies∑j
1wjt ≥ r, j ∈ [1..N ].
3. Set s′it = sjt−1.
62 2.4 Prediction Methods
Figure 2.4: The Particle Filter Process (Merwe et al. [122])
Particles with a higher weight are more likely to be sampled, resulting in higher
weighted samples being given greater emphasis in the next time step. Figure 2.5
illustrates this re-sampling process. Various implementations (Doucet [44], Pitt
and Shephard [139]) have shown that this algorithm can be implemented with
O(N) complexity. Other re-sampling schemes such as residual re-sampling and
minimum variance sampling have been proposed in Higuchi [71], Liu and Chen
[110] and Kitagawa [98] respectively.
A key issue when using particle filters is the number of particles needed. Using
fewer particles results in a computationally faster system, resulting in much effort
being expended in developing methods to lower the number of particles required.
Other research has focused on how to use particle filters in systems where multiple
objects are being tracked. While the output of the particle filter can be multi-
modal, the system will tend toward a single mode in the long term. This means
that for a filter that is tracking multiple people, tracks can be lost as the number
2.4 Prediction Methods 63
Figure 2.5: Sequential Importance Re-sampling.
of particles representing a track decreases due to the re-sampling procedure.
The condensation algorithm (Isard and Blake [77]), is a specific implementation
of particle filtering proposed by Isard and Blake [77] to track curves in images.
Like particle filtering, it is an iterative process where the sample set for time t
is generated by re-sampling from the sample set for time t − 1. It differs from
the particle filter described above in that the re-sampling step is done first. Thus
the output at the end of the time step, is a weighted particle set rather than an
un-weighted particle set (see Figure 2.6).
Tracking systems such as those proposed by Kang et al. [90] and Zeng and Ma
[181] use particle filtering techniques to track people. Kang et al. [90] modified
the condensation algorithm to track multiple people in a crowded environment.
A discrete human model was created to allow people to be represented as a
single discrete valued parameter; and a competition rule was introduced, such
that each tracker suppresses the weights of samples around features tracked by
another tracker, helping to avoid multi-modal distributions that can occur when
multiple objects are close to one another. Zeng and Ma [181] used active particle
filtering to track heads. Active particle filtering involves combining traditional
64 2.4 Prediction Methods
Figure 2.6: The Condensation Process (Isard and Blake [77])
particle filtering with curve fitting. Each particle is fitted to the closest local
maxima of the approximated PDF prior to weighting. The modifications allow
the system to use fewer particles to track objects.
While particle filters are able to represent multi-modal distributions, over time
the more dominant mode within the distribution comes to dominate the filter.
The result of this is that when tracking multiple objects, over time objects may
be lost as the modes that have a stronger response result in particles being taken
away from the weaker modes during re-sampling. One approach that has been
used to overcome this is to use a distribution that allows the presence of multiple
modes. Isard and MacCormick [78] developed BraMBLe, a Bayesian multiple-
blob tracker. A particle filtering implementation (Isard and Blake [77]) was mod-
ified by formulating a multi-blob likelihood function to express the likelihood of
2.4 Prediction Methods 65
a particular configuration of objects resulting in the observed image. This en-
abled the system to function with an unknown, time varying number of objects,
allowing the tracking of multiple objects. The proposed system was also able to
detect new modes as they entered, and remove modes as they left.
An alternative approach to multi-target tracking was proposed by Vermaak et al.
[163] who proposed the Mixture Particle Filter (MPF). The mixture particle
filter addresses the problem caused by a multi-modal posterior distribution (due
to ambiguities or multiple targets) resulting in poor performance. Each mode
(target) is effectively modelled by its own set of particles, which forms part of the
overall mixture. Given this, the overall distribution becomes
p(xt|xt−1) =M∑m=1
πm,tpm(xt|xt−1), (2.25)
where M is number of mixtures currently in the system, m is index of the current
mixture, πm,t is the weight for the mixture, m at time t and pm(xt|xt−1) is the
component distribution for the mixture m, approximated by a set of particles used
only by this mixture. Each mixture has its own set of particles, sit and weights
wit, where i ∈ Im. The weights of the particles for each mixture are calculated
such that they sum to one, ∑i∈Im
wit = 1, (2.26)
and the mixture weights also sum to one,
M∑m=1
πm,t = 1. (2.27)
When re-sampling, mixture components are re-sampled individually, ensuring
that modes are not lost during the procedure. Initial weights for the particles are
set according to the set size,
wit =1
card(Im), (2.28)
where card(Im) is the cardinality (size) of the particle set Im.
66 2.4 Prediction Methods
Throughout the mixture particle filter process, the individual filters only inter-
act through the computation of the weights. This multi-modal filtering approach
overcomes problems associated with previous multi-target trackers where the sam-
ples for a given target could become deleted and the target lost. However, the
system still maintains just a single particle filter for the whole system, rather
than one for each tracked object.
The MPF uses k-means clustering to determine when modes have split, or merged.
Splitting of a mode indicates that a new mode has appeared (i.e. a new tracked
object has entered the scene), and as such a new mixture is added to the MPF.
When two modes merge, it indicates that a mode has left (i.e. a tracked object
has left the scene) and a mixture is removed from the model. This allows the MPF
to model a scene with a dynamic number of objects of interest. This approach is
limited however by requiring each target object to match the same feature (i.e.
only a single histogram is used to locate all objects), and this feature is also
constant. As a result, the MPF is limited to situations where all targets have an
identical, unchanging appearance.
Okuma et al. [135] proposed the Boosted Particle Filter (BPF). This work ex-
tended that of Vermaak et al. [163] and used a cascaded adaboost (Viola and
Jones [165]) algorithm to detect the target objects to guide the particle filter,
rather than user initialisation followed by k-means clustering. A colour observa-
tion model is used (using HSV space to separate colour from intensity) to measure
likelihoods of the observations. To initialise the system, and allow new modes to
be detected, the adaboost results are incorporated into the proposal distribution,
so that when the adaboost detection performed well, the BPF distribution could
incorporate this information.
q∗B = αqada(xt|xt−1, yt) + (1− α)p(xt|xt−1). (2.29)
The term, qada, is a Gaussian distribution dependent on the current observation
2.4 Prediction Methods 67
from the adaboost detection, yt. By increasing the value of α, more weight is
applied to the adaboost detection, however when the adaboost detection can not
detect the target (due to clutter, lighting changes) α can be set to 0 so that the
system operates as a MPF.
Figure 2.7: Incorporating Adaboost Detections into the BPF Distribution
Figure 2.7 illustrates this process. In this instances, the adaboost detection pro-
cess detects a mode that is not tracked by the BPF. This information is incorpo-
rated into the final distribution. When re-sampling occurs, there is now a greater
probability that samples that represent the newly detected mode will be propa-
gated, and the new mode can be tracked. Like the MPF however, the BPF uses
a single unchanging feature, and therefore is also limited to situations where all
targets have a very similar unchanging appearance.
To overcome the limitation of a single feature for the whole filter, Wang et al.
[168] proposed a mixture filtering approach [163] where a separate histogram
is used for each target. This enables multiple people to be tracked in a scene
whilst maintaining identity through occlusions. However the proposed approach
is unable to discover new modes, and relies on manual initialisation.
Ryu and Huber [148] proposed using an individual particle filter, with a joint
68 2.4 Prediction Methods
observation model indicating the likelihood of occlusions to track multiple objects,
instead of the mixture approach of [135, 163]. The observation model reprojects
particle weights into image space to determine the most likely object at a given
pixel, and determine when occlusions are present. An observation model for
hidden targets, which uses the expected value of the measurement model, is
used to ensure that occluded objects are not lost whilst they are not visible. A
background filter, that is able to detect new modes entering the scene is used to
instantiate new tracks, while tracks are removed by monitoring the total particle
weight associated with each track and the expected value in the observation
model. Such a system allows new modes to be added dynamically (as in [135,
163]), but still allows different features to be used for different targets (as in
[168]).
Prez et al. [141] used particle filters to track faces by adapting the colour his-
togram based tracking approaches such as [14, 23, 29] to a particle filter frame-
work. The HSV colour space was used to separate intensity from the colour.
Multi-part colour models were used to improve the performance, by encoding
some spatial information within the colour model. Prez et al. [141] also made
use of the known background to improve performance. Rather than just compare
the histogram of the region in the current frame with that of the reference, it is
also compared with the histogram of the background image at the same region.
This aims to prevent the tracking from shifting from the target to a background
region.
Particle filters have also found uses in multi-modal systems. Checka et al. [22]
utilised particle filters to track people and monitor their speaker activity. Loy
et al. [112] made use of particle filters to track objects using a variety of hy-
potheses, depending on conditions. Using a particle filter allowed the multiple
hypotheses to be maintained, and allowed tracks to be updated at different fre-
2.4 Prediction Methods 69
quencies. Like Loy et al. [112], Breit and Rigoll [15] use particle filters to facilitate
multi modal object tracking. Using the condensation algorithm, Breit and Rigoll
[15] combines a pseudo 2-dimensional hidden markov model, a skin detector and
a motion detector to track people. The use of a particle filter facilitates simple in-
tegration of the three modes, and allows them to overcome the weaknesses posed
by each mode individually. The modes are combined during the calculation of
the sample weights, wit,
wit =∏j
p(zj|xit)wj . (2.30)
A weighted product of the probabilities is used to determine the final weights,
where the weights of each mode, wj are manually selected according to the relia-
bility of each mode.
As particle filters require a model to guide the particle movement from frame
to frame (in addition to a random offset), approaches have been proposed that
use Kalman filters [51, 122] to model the particle movement. Merwe et al. [122]
proposed a system that incorporated the unscented Kalman filter (UKF) by using
the UKF as the proposal distribution for the filter. Similar proposals have been
made using the EKF instead of the UKF (Freitas et al. [51]).
Yang et al. [177] proposed a Hierarchical Particle Filter, that characterised objects
using their colour and edge orientation histogram features. This system was able
to evaluate likelihoods in a coarse to fine manner (similar to that used by Viola
and Jones [165] in their object detector) to allow the system to focus on more
promising regions and quickly discard those with zero probability.
Other applications have been to use particles filters to track 3D human models,
either in a single camera (Green and Guan [58]) or in multiple cameras (Sigal
et al. [153]), or to guide robots (Kwolek [103], Schulz et al. [149]). Green and
Guan [58] and Sigal et al. [153] use particle filters to track the human body in
70 2.4 Prediction Methods
3D. Green used a single camera, tracking each joint angle using the particle filter
while Sigal uses a particle filter adapted to work on general graphs, to track the
limbs of a person from multiple cameras.
2.4.4 Summary
Being able to estimate the position of an object in a scene, either to aid in de-
tection in subsequent frames or track through occlusions is important within a
tracking system. Motion models and Kalman filters (Kalman [86], Welch and
Bishop [172]) allow a more direct approach, allowing an approximate future po-
sition to be estimate given the objects previous state. Particle filters (Isard and
Blake [77], Maskell and Gordon [120]) do not explicitly provide a set of coordi-
nates where the object should be, rather they produce a probability density that
describes the likelihood of the object being at any given location.
The design of particle filters is such that particles that have a higher likelihood are
more likely to be re-sampled, at the cost of particles that have a low likelihood.
This process helps the filter avoid a situation where poorly fitting particles come
to constitute the majority of the particle set, but also means that when tracking
a multi-modal distribution, over time the system is likely to loose all but the
most dominant mode. To ensure that multiple modes can be tracked, mixture
filters (Okuma et al. [135], Vermaak et al. [163]) have been proposed, which treat
the overall distribution as the sum of several sub-distributions, each of which
represents just a single mode. The sub-distributions are re-sampled separately,
to ensure that modes are not lost in the re-sampling process. Alternatively,
systems can simply use multiple particle filters, such that each target object is
tracked by its own filter (Ryu and Huber [148]).
2.5 Multi Camera Tracking Systems 71
2.5 Multi Camera Tracking Systems
The use of multiple cameras in a tracking system allows additional information
to be extracted from a scene, by either observing the same area from two or
more from different positions, or by observing more of a scene than a single
camera could. Most multi-camera systems are simply extensions of single camera
systems, facing the same problems with tracking and maintaining identity, only
having to do this across multiple views. Some problems, such as handling and
resolving occlusions, can be dealt with more effectively as an occluded object may
be visible in multiple views, and is unlikely to be occluded in all. Other problems,
such as maintaining identity for tracked objects, can become more complex as
tracked objects must be matched across multiple views.
There are two commonly used architectures for multi-camera systems, illustrated
in Figures 2.8 and 2.9.
Figure 2.8: Multi-camera system architecture 1.
The first approach (see Figure 2.8) applies single camera tracking techniques to
each camera feed, tracking the objects that are visible in that feed. The lists of
tracked objects for each view can then be combined by using camera calibration
information (and potentially other information such as colour and/or appearance
models) to form a list of global tracked objects. In such a system, information
72 2.5 Multi Camera Tracking Systems
Figure 2.9: Multi-camera system architecture 2.
from the global list may then be passed back to the single camera trackers to
improve performance. Systems proposed in Cheng et al. [24], Kazuyuki et al.
[92], Piva et al. [140], Wei and Piater [170] are examples of this architecture.
The second approach (see Figure 2.9) merges the different views prior to tracking.
After a detection stage, the detected objects are transferred to a global coordinate
scheme, and the objects (now represented in a 3D coordinate scheme) are tracked
by an object tracker. Such systems may merge the views after motion detection,
mapping the detected motion to a common ground plane and then performing
object detection. Systems proposed in Auvinet et al. [5], Krahnstoever et al. [101]
are examples of this architecture.
Systems that use the second architecture require accurate camera calibration as
they need to be able to register detected motion or objects to a common ground
plane or coordinate system very accurately. Systems that use the first architecture
may also utilise camera calibration, but can also make use of additional cues
(such the direction of motion, colour/appearance) when matching objects across
views. These systems can also be implemented in situations where very little of
the camera calibration information is actually known (such as the field of view
extents and overlaps only), or where the calibration is learnt as the system runs.
The first type of system is also better suited to networks that contain areas that
are not covered by a camera (i.e. have disjoint views), requiring hand off to be
2.5 Multi Camera Tracking Systems 73
performed blind, as they are better equipped to be able to recover identity after
a period of occlusion.
2.5.1 System Designs
Multi-camera networks require a large amount of data to be processed, and for
communication to occur between the various software modules responsible for
data collection and tracking. As camera networks become larger, more process-
ing power, and more advanced communication protocols are required. Camera
networks can be set up to communicate in two ways,
1. Each camera/tracker communicates directly with other trackers to deter-
mine positions of objects, hand overs and overlaps;
2. Each camera/tracker communicates with a central server, which combines
all data and provides information back to the trackers as required.
Marchesotti et al. [118] proposed a system using software agents (an indepen-
dently acting program module which can reason with other agents and make
decisions) to track people across a multiple camera system. Camera agents, re-
sponsible for acquiring images and performing image processing tasks, pass in-
formation to a top agent, which is responsible for organising the data from the
cameras and passing it to the appropriate simple agents. The simple agents are
responsible for the tracking and data fusion for multiple cameras. The simple
agents negotiate with each other to determine if they are tracking the same ob-
ject. For every new agent that it is added to the scene, a multicast message is
sent containing the position and histogram of the object. All other trackers then
evaluate matching metrics for this data to determine if they are seeing the same
object. The top agent is able to spawn new simple agents as required.
74 2.5 Multi Camera Tracking Systems
Focken and Stiefelhagen [50] proposed a system designed in the style of a dis-
tributed sensor network. Each machine in the network is synchronised and all
images are timestamped to ensure synchronisation. Each camera in the system
contains a background subtraction module, and produces a constant stream of
features that are sent to a tracking agent. The tracking agent is responsible for
collating this data and determining 3D tracks.
Krumm et al. [102] proposed a multi-camera system consisting of two stereo
cameras in their ‘EasyLiving’ project. Each stereo head uses a dedicated PC to
process the incoming images, and locate people in the scene. These results of this
process are passed to a third PC which performs the person tracking. Position
and histogram information is then passed back from the tracking PC to the stereo
head controllers.
Atsushi et al. [4] proposed a system that used a series of cameras that com-
municated directly with one another rather than through a central server. An
environment map is generated at the start of tracking by each camera, using the
calibration information for the camera network. This map shows where the cam-
era is in relation to the other cameras in the network, allowing it to determine
when another camera will be able to see an object that it is tracking, or which
camera a tracked object that has left the networks field of view is likely to enter.
The cameras communicate using three messages:
• Acquisition - a position is sent and the camera is to attempt to acquire a
track for a person at that location;
• Acquisition on border - a field of view edge is sent and the camera is to
attempt to acquire a track for a person at that location;
• Stop acquisition - the camera is to stop attempting to acquire a track.
2.5 Multi Camera Tracking Systems 75
Acquisition messages are sent when a track is in an overlapping field of view area,
acquisition on border messages are sent when a track leaves the field of view
the network and is expected to arrive at the edge of another’s FOV, and stop
acquisition messages are sent when a station detects a person who has entered
from an unwatched area. For all messages sent, an acknowledgment message is
sent back. This allows the system to dynamically determine if a sensor in the
network is down and adapt accordingly.
Micheloni et al. [123] proposed a system composed of static camera system’s (SCS,
consisting of fixed cameras with overlapping views) and active camera system’s
(ACS, consisting of a pan-tilt camera), and a method for the communication and
working together. The SCS is responsible for tracking all objects within its views,
and fusing data across different cameras to improve performance. The SCS is able
to request the ACS to track a specific target. Communication is performed over
a wireless network using a simple protocol that allows a network to,
1. Issue request commands such as requesting an ACS to track an object,
2. Transmit 2-D position of all object tracked by that network.
Messages can be sent either to an individual network, or to all networks.
2.5.2 Track Handover and Occlusion Handling
Multi-camera tracking systems add an additional level of complexity to tracking
systems, as tracked objects must be matched across camera views. Tracking
systems solve the problem of determining object correspondence in two ways:
1. Transfer coordinates of detected objects and/or extracted features to a
76 2.5 Multi Camera Tracking Systems
world coordinate scheme (such as an extracted ground plane) and perform
tracking within the 3D work space [5, 101].
2. Apply single camera tracking techniques to each view, and determine the
correspondences between the tracks from separate views [24, 92, 140, 170].
The first approach requires a high degree of accuracy in camera calibration, as we
need to be able to accurately transfer all detected objects to a world coordinate
scheme, however, as all tracking (and potentially object detection) is transferred
directly to a world coordinate scheme there is no need to deal with the problem
of object handover, or matching tracked objects in individual views.
The second approach is able to work on more loosely calibrated cameras, as
only knowledge of the field of view (FOV) extents are needed. However more
sophisticated calibration schemes can allow position and velocity to be used as
features, and provide greater accuracy when determining correspondences. The
remainder of this section will discuss techniques used to match objects in different
camera views, focusing on the features and matching techniques used.
Features that are often used to determine correspondence include
• Position, either by translating to a global coordinate scheme [10, 19, 21, 38,
50, 118] to extract 3D coordinates, and optionally trajectories [46, 118] and
velocities [140]; or by using FOV boundaries [8, 93, 95] in the 2D images
and knowledge of the order of the cameras (i.e. which cameras overlap with
each other).
• Appearance, incorporating shape/aspect features[38, 140]; or colour, using
histograms [92, 126, 186] or dominant colours [24].
2.5 Multi Camera Tracking Systems 77
Colour and appearance based features rely on a good quality imaging, and (de-
pending on the type of model used) the cameras to be located nearby and at a
similar angle to the subject. Position and shape/aspect features can be easily
translated into a common 3D coordinate system and compared between multiple
cameras and may be more suitable when the subjects cannot be reliably compared
using colour or appearance. When position cannot be computed accurately, due
to poor calibration or the subjects being too far away, trajectory can be used.
Trajectory however relies on the objects in the individual cameras to be reliably
tracked for a period of time so that there are valid trajectories for comparison.
Position has proved a popular feature as it can be quickly computed, and when
camera calibration is available, it is often desirable to use a 3D coordinate scheme
to switch between cameras and provide output to an end user. In situations where
there are few objects being tracked in the scene, it is also very reliable. A track
in a single view can have its position (and trajectory/velocity) transferred to a
common coordinate scheme where matching between multiple views can be done
purely by position. Ellis [46] tracked objects in a single view using a combination
of shape, position and colour information, but relied only on position to match
objects across different views. The resultant 3D position was then tracked by a
separate Kalman filter in 3D. Tracking in 3D also allows for easy comparison to
other views, and for views that are occluded to receive input from views that are
not, improving localisation of the objects in the occluded view. However, small
errors in the 2D segmentation (i.e. failing to cleanly segment a persons legs,
where they contact the ground plane) can result in large errors in the equivalent
3D coordinates, which can lead to incorrect matching between views.
Black et al. [10] proposes using a homography to map between overlapping views.
Transferred objects are match based on the error obtained when transferring coor-
dinates between views. Each view’s coordinates are transferred into one another,
78 2.5 Multi Camera Tracking Systems
and the errors are squared and summed. If this error is below a threshold, a match
has been found. Objects are simultaneously tracked in 2D and 3D using Kalman
filters, recording position and velocity (in pixels for 2D, real world coordinates
for 3D). The output of the 2D Kalman filters is transferred to the image plane to
provide a measure of uncertainty for the measurements, and allow improvement in
the observation uncertainty for the 3D Kalman filter, used matching objects and
updating the filter. The measurement uncertainty increases as the object moves
further from the viewpoint, where segmentation errors have a greater impact on
the translated coordinates.
Cupillard et al. [38] uses multiple cameras with overlapping fields of view to locate,
track, and identify the behaviour of groups. Motion detection is used to locate
moving regions in each camera, from which a set of 3D numerical parameters are
extracted and a semantic type (i.e. person, crowd) is assigned. The weighted sum
of the similarity between the 3D position, the semantic type and the fusion results
for related objects (as the system is aim at the analysis of groups, associated
objects in each view are grouped together) is used to determine object matches
across the camera views.
Focken and Stiefelhagen [50] combines data from multiple camera using position
information. Motion detection is performed on each camera feed, and the blobs
detected have their position translated into a 3D coordinate scheme. Blobs are
matched based on the 3D positions, with a threshold used to determine valid
matches, and provide a measure a confidence for each grouping. The regions
are tracked using a multiple hypothesis tracker, that maintains a list of possible
hypotheses for each track, based on the confidence in the location of the target
region (in 3D coordinates), and the distance moved from the previous position
(modelled as a Gaussian distribution, such that regions that are further from the
last position are weighted less). As tracking progresses and more information is
2.5 Multi Camera Tracking Systems 79
gathered, the less likely track trajectories can be discarded.
Micheloni et al. [123] proposed a system that combined networks of static cameras
(cameras with a fixed view) with networks of active cameras (cameras that can
change their view, i.e. Pan-Tilt-Zoom cameras). Observations from the different
views are fused using their real-world position and an appearance ratio, that pro-
vides a degree of confidence for the blob extracted from the camera view. Those
cameras that provide more reliable measurements are weighted higher when fusing
tracks to ensure that the use of multiple cameras does not degrade performance.
2D ground coordinates are translated into pan and tilt angles, allowing the active
camera system (ACS) to track the target object. The ACS aims to keep the
track in the centre of its view, and compares it position against the static camera
system (SCS) regularly to ensure that the correct target is being tracked. As the
ACS only uses pan and tilt operations, a simple image translation can be used
to register the camera view before and after moving, based on detected image
features. Feature points are also used to track the target object within the ACS.
Using field of view (FOV) boundaries [80, 93, 95] can reduce the impact of poor
object segmentation, as it is much simpler to accurately detect when a person is
entering or leaving a cameras FOV. When a person crosses a FOV boundary, the
system can check the appropriate FOV line in other cameras, with objects that lie
on along that line candidates for a match. This approach does leave the system
susceptible to errors caused by multiple objects being in the transition area at
the time of handover, and (depending on the layout of the camera network) only
allows a small window of time when correspondence can be determined. Bhuyan
et al. [8] also makes use of FOV lines to determine correspondence across multiple
views. Bhuyan et al. [8] proposed a multi-camera system where each camera uses
a single camera tracker (using the MPEG-7 art descriptor as a tracking feature
and the unscented Kalman filter to predict object position).
80 2.5 Multi Camera Tracking Systems
Colour and appearance matching approaches are less sensitive to errors in seg-
mentation, but may struggle if there are differences in colour balance between
the views, or if the subject’s clothing is not a constant colour, or pattern. To
help negate this difference in colour, colour spaces that separate intensity and
colour information are often used (HSV, YUV). Other approaches such as that
proposed by Kazuyuki et al. [92], Morioka and Hashimoto [126] combine the local
colour histograms that are used to track in an individual camera to form a global
histogram, which is used to identify across multiple cameras.
Often, features are fused or used in a hierarchy to achieve greater accuracy.
Marchesotti et al. [118] uses position and a colour histogram, and requires match-
ing constraints for both to be satisfied for a match to be made. Cheng et al.
[24] uses geometry and colour information to determine correspondence between
cameras. Geometry is applied first and if there are ambiguities the major colour
components (likely to represent clothing colours, hair and skin colours) are se-
lected and compared to determine the correspondence. Piva et al. [140] combines
position, speed, a shape factor (aspect ratio) and an chromatic characteristic.
Each is independently considered and a similarity is calculated, after which the
features are merged to determine if there is a match.
Krumm et al. [102] uses position and colour to match people across multiple
cameras. Simple velocity calculations are used to predict a persons next location,
and a persons past locations are stored within the system. The person tracker
(in the view being initialised) searches the area around the expected location of
the person for a person shaped blob. If multiple blobs are in the area, histogram
matching is applied to locate the correct blob. Atsushi et al[4] also determines
correspondence using colour and position. Colour and position are compared and
if the differences are within a limit, they are grouped. For the duration of this
grouping, the track for each view maintains a confidence that it is observing the
2.5 Multi Camera Tracking Systems 81
same track as all other objects in the group. If this confidence drops below a
threshold, then the object breaks from its group, potentially being grouped with
other tracks.
Yang et al. [176] describe a system to track people through an 18 camera in-
door surveillance network. Track hand off is performed in two distinct ways for
situations where the cameras overlap, and those where they do not (blind hand
off). For situations where there is overlap, the matching ratio is the fraction of
time (of the overlap) that the two tracks are within a distance threshold of one
another in world coordinates. For blind hand off, the matching metric is the
similarity in colour between the models, as well as an additional constraint that
the tracks must appear within a set time limit (upper and lower bounds) based
on the distance between the camera views.
Other systems use probabilistic frameworks to group objects between views. Wei
and Piater [170] proposed the use of particle filters and belief propagation to track
an object across multiple views. The object is tracked in each view using a particle
filter, and the information from these separate views is passed using sequential
belief propagation (Hua and Wu [76]) (SBP) to obtain a global coordinate, and
allow views to share information. Wei et al. [171] also used particle filters to
track objects in multiple views. However rather than use a particle filter for each
view, a single particle filter is used for the entire system. When an occlusion is
present (or likely) camera collaboration is used to handle the occlusion problem,
and coordinates are transferred between the views to maintain tracking.
Chang and Gong [21] proposed a system using Bayesian networks to classify
people across a multiple camera system. Multiple modalities are used for tracking,
covering various recognition (height and colour) and geometric (epipolar lines,
homographies, and use of scene landmarks) techniques. Comparison functions are
defined for each modality and the match results are combined within a Bayesian
82 2.5 Multi Camera Tracking Systems
framework to determine the likelihood that tracks in separate views correspond.
Cai and Aggarwal [19] also used a Bayesian classifier to match people across
a multi camera system. Cai and Aggarwal [19] system uses features based on
intensity and geometry to formulate a GMM to parameterise the features. A
Bayesian classifier can then be used to find the most appropriate match. This
system tracks a person from a single camera while that subject can be observed
well from that camera. Once the subject becomes occluded or leaves the field of
view for that camera, the system finds the next best camera and resumes tracking.
A simple location based prediction method is used to find the next best camera,
with one of the key requirements being that the camera that is switched to be
the one that will minimise the amount of future switching.
Kang et al. [87] registers the cameras in their system using a ground plane ho-
mography (this method relies on there being a common ground plane between
the cameras used in the system, however this is very common in man made en-
vironments). An appearance based model (based on a polar representation that
allows for a rotation invariant model) and velocity based model (both in 2D and
3D, using Kalman filters) are used and combined using a joint probability data
association filter (JPDAF). The JPDAF is used to handle occlusions and cam-
era hand off. Using the JPDAF, the probability of a tracked object occupying a
given position is defined as the product of the appearance, 2D position and 3D
position probabilities. The optimum position of track then becomes the position
that maximises these three probabilities.
Fleuret et al. [49] propose an approach to track people within a four camera
network, where the cameras are mounted at eye level. Such a configuration leads
to a large number of occlusions. Tracking is performed across a sliding window
of frames, rather than on a frame by frame basis, and the optimal track across
the whole window is taken. Tracks that are detected well in previous windows
2.5 Multi Camera Tracking Systems 83
are optimised first in future windows, so that troublesome tracks (which are more
likely to be unstable and ’jump’ to another tracked object) are optimised last, and
so cannot steal a track that has already been allocated. Track probabilities are
determined using an occupancy map, and appearance model. The occupancy map
is generated by transferring the results of background subtraction to the ground
plane, and formulating the probability that a given location of the ground plane
contains a person. This occupancy map is combined with the appearance model
to determine the most likely trajectory for a person over a window of frames.
Collins et al. [27] uses a group of pan-tilt-zoom cameras rather than static cam-
eras, to track a single person moving through a scene. Each camera tracks the
object using the mean shift algorithm [28, 30] and histogram matching (the mean
shift algorithm is robust to camera movement). Cameras are calibrated and share
the position of object they are tracking with one another, to aid cameras that are
tracking poorly. Everts et al. [47] proposed using multiple configured pan-tilt-
zoom cameras (PTZ) to cooperatively track objects. The cameras are calibrated
by collecting a number of real world points and their corresponding camera po-
sitions (pan and tilt values), and using an optimizer to determine the camera
parameters that allow for the transform from world coordinates to camera posi-
tion. This calibration assumes that there is a common ground plane, does not
incorporate zoom, and is performed offline. It is important to note that is un-
likely that this method will scale effectively, or what effect the number of position
pairs used has on calibration accuracy and computation time. Like Collins et al.
[27], objects are tracked using the mean shift algorithm [28, 30], as this technique
is not adversely effected by camera movement. At the end of each frame (once
the target has been located), the cameras are shifted so that the target objects
is in the centre of the frame. To handover the track between cameras, the colour
of the object, as well as the assumption that the object will be roughly centred
within the frame, are used. The proposed system is limited by the use of colour
84 2.5 Multi Camera Tracking Systems
as a discriminating factor. When an object that has a non-discriminative colour
is used, the system may completely fail to locate the object in the second camera
and fail catastrophically.
Tsutsui et al. [162] applies optical flow based person tracking in a multiple camera
environment. A tracking window for the subject being tracked is shifted according
to the mean flow of the frame for the next frame of the sequence. For a single
camera system, the person is modelled as a 2D plane, in a multi-camera system
the person can be modelled as a 3D cylinder, with motion vectors within the
volume being analysed. When an occlusion occurs, the tracking window and
velocity can be transferred to another view until the occlusion is resolved.
Mittal and Davis [124, 125] proposed a system to track people using colour and
position in multiple cameras through complex occlusions. People are modelled
using a cylinder, that is divided into several regions of equal height, each of which
has its own colour model. This models the vertical distribution of tracked person
colour (it is not possible to model the horizontal distribution effectively with 3D
reconstruction). A Bayesian classification scheme to used to segment the image
into regions that belong to a particular person and background, using the colour
models for the tracked person in combination with the position of the pixel being
segmented (relative to the expected position of the person). This segmentation
also takes into account distance from the camera and occlusions, so a person closer
to a camera will yield higher likelihoods (they are less likely to be occluded). The
results from this segmentation across the whole camera network are combined by
projecting the detected regions into the other views to determine the positions of
the people in 3D coordinates, after which the person models (appearance and a
Kalman filter for motion) are updated.
Systems such as those proposed by Auvinet et al. [5] and Krahnstoever et al.
[101] avoid the problem of merging objects across view and track hand off by
2.5 Multi Camera Tracking Systems 85
performing all tracking in a world coordinate domain. Auvinet et al. [5] proposed
a system which merged the results of motion detection into the ground plane of
a four camera network. This allows all motion in the scene to be viewed from an
overhead perspective, and helps to overcome occlusions. From this perspective,
blobs can be tracked, simple event recognition can be performed and immobile
objects can be detected. To merge the views, each motion image undergoes a
homographic transformation. The images are overlayed and regions that have
three or more silhouettes overlapping (i.e. three of the cameras views report
motion in the same portion of the ground plane) are accepted as blobs for tracking.
As the transform does not consider height off the ground plane, these overlapping
regions should represent the parts of objects that are touching the ground plane
(i.e. feet). Requiring three silhouettes to overlap reduces the likelihood that a
blob will be incorrectly created by objects above the ground plane (i.e. people
heads and upper bodies) being mapped to the same location. Krahnstoever et al.
[101] system performs target detection in the individual camera views, before
transferring the tracking to a calibrated ground plane.
Handling Disjoint Views
In many real life camera networks, it is likely that not all of scene will be covered
by cameras. It is therefore possible, that at some point a tracked object will leave
the field of view of the network, only to reappear in another camera a few seconds
later. Just as it is important to be able to consistently label objects when the
camera views overlap, it is also important to be able to identify when a person
entering a cameras field of view is the same person that left another camera a
few seconds previously. Figure 2.10 shows an example of such a network, where
each room contains one or more cameras, but the hallway connecting the rooms
in unsupervised. Ideally, a system should be able to tell when a person entering
86 2.5 Multi Camera Tracking Systems
room one has come from room two and vice-versa.
Figure 2.10: Surveillance network containing disjoint cameras.
Leoputra et al. [107] and Lim et al. [109] proposed a system to track objects in a
non-overlapping camera environment using a particle filter and environment map.
The particle filter operates as normal when the target is visible in the camera
view, however, when they are between camera views, the particle filter uses the
environment map (providing knowledge of possible pathways) to help determine
where the person will reappear, and propagate particles along the possible paths
that the person may have taken. The proposed system combines the particle
filter results with those of a histogram match for any new objects detected. The
particle filter results are weighted less the longer the person has been missing for,
reflecting the greater number of possibilities for where the person has gone, the
longer they are missing.
Kang et al. [88] extended the system proposed in [87], to track people in a multi-
camera environment consisting of two non-overlapping fixed cameras, and one
moving camera that can pan between the other two views. A spatio-temporal
joint probability data association filter (JPDAF) is proposed to aid in overcoming
the gaps in the systems field of view. The spatio temporal JPDAF uses a buffer
2.5 Multi Camera Tracking Systems 87
of images to improve performance. By allowing multiple images to be taken into
account when formulating probability, errors due to occlusions can be overcome.
The two stationary cameras are registered against a mosaic from the moving
camera, using a common ground plane. When detecting objects in multiple views,
the feet position of deleted objects is translated to other views to register common
objects. Once registered using position, the information from all views is fed into
the JPDAF.
Stauffer [156] proposed methods to determine the transition correspondence mod-
els (TCM) within a camera network (i.e. correspondence between exit and entry
locations in different cameras). The goal of a TCM is to estimate the likelihood
that a given observation at a given sink (exit point), was the result the same ob-
ject that earlier produced an observation at a given source (entry point), where
the sink and source points may be in different camera views. An unsupervised
hypothesis method is proposed which is able to approximate the likelihood of
transitions from one camera to another. The system was evaluated on synthetic
data simulating a traffic scenario (containing a traffic light, and a fork in the
road) and was shown to effectively determine the camera transitions.
It is also important to be able to determine the location of sources and sinks
within the camera network. Knowledge of where objects are allowed to enter can
improve the initialisation of tracks, while knowing where objects are allowed to
exit can help prevent lost tracks. Stauffer [155] used a two state hidden state
model (similar to a HMM, except all sequences are length two, and the model is
not shared across time) to estimate the source and sink positions within a scene,
given a set of track sequences. The first state in the model corresponds to the
sources, and the second to the sinks. An iterative optimisation routine is applied
to find the optimal placement of the sources and sinks.
Javed et al. [81, 82] extended the system proposed in [80, 93, 95] to be able
88 2.5 Multi Camera Tracking Systems
to learn the topology of a network containing disjoint cameras. The proposed
system uses source and sink locations, the velocity of tracked objects, the time it
takes objects to move between cameras and the appearance of tracked objects to
determine the configuration of the camera network. A training sequence is used
to learn an initial estimation of the system, and the parameters are continuously
updated during system operation. This updating also allows new behaviours to be
learned and added to the system, while transitions that are obsolete are forgotten
(i.e. people may be more likely to travel a different path in the afternoon to the
morning, thus some camera transitions are more likely at different times). Javed
et al. [82] also proposed modelling the difference in colour histograms between
the disjoint cameras. Once a correspondence model has been learned (and it is
possible to be sure of correspondences) the histograms of an object observed in
two cameras can be compared, and the difference modelled. As the cameras are
likely to be configured differently, use different lenses, or be different cameras all
together, it is important to be able to measure the colour difference between the
two views, to improve the accuracy of comparisons. The difference is modelled
using a Gaussian model.
2.5.3 Summary
The process of tracking objects within a multi-camera network can be approached
in two ways:
1. Apply single camera tracking techniques to each camera and use camera
calibration and/or feature matching to determine object correspondence
in the different views (Cheng et al. [24], Kazuyuki et al. [92], Piva et al.
[140], Wei and Piater [170]).
2. Transfer results of a detection process from all cameras to a common co-
2.5 Multi Camera Tracking Systems 89
ordinate scheme and treat as track in the same manner as a single camera
situation (Auvinet et al. [5], Krahnstoever et al. [101]).
The first method can be implemented in either a distributed (each camera com-
municates with the other cameras and data is shared between) or centralised way
(each camera communicates with a central server, which combines data and issues
commands to the views), whilst the second requires a centralised implementation.
This approach does avoid the need to match objects across the different views
however. In sufficiently large networks, systems could be implemented that use
such designs.
Being able to constantly label an object across the camera views is important
within a multi-camera network. A method to match objects between views is
required, and it needs to be able to handle the difference in pose between the
views. Given this requirement, position (Cai and Aggarwal [19], Chang and
Gong [21], Cupillard et al. [38], Marchesotti et al. [118], Piva et al. [140]) and
simple colour models (Cheng et al. [24], Kazuyuki et al. [92], ZhiHua and Komiya
[186]) are the most suitable and popular methods. Complex appearance models
are not suitable, as these are typically pose specific. In situations where there are
disjoint views, knowledge of the camera network and velocity of the target object
can be combined with colour models to match objects (Javed et al. [81], Kang
et al. [88], Lim et al. [109]).
Chapter 3
Tracking System Framework
3.1 Introduction
The following chapter describes the tracking framework that has been developed
as part of this thesis. The framework has been developed in C++ using the
vxl (vision-something-libraries) 1 as a base to provide basic image processing
functionality such as:
• Image structures for image storage and manipulation (including loading and
saving).
• Basic image processing such as morphology, edge detection, resizing.
• Vector and matrix structures, and associated maths functions.
The proposed framework allows for the development and testing of multi-camera
tracking systems. The framework makes extensive use of abstract base classes
1VXL can be downloaded from http://vxl.sourceforge.net/
92 3.2 System Design
and polymorphism, allowing new types of trackers, detectors, or storage classes
to be implemented quickly. Configuration is performed using one or more XML
configuration files which are loaded at startup. All system parameters, such as the
number and type of trackers, detectors and objects to be tracked are contained
within this file. Parameters cannot be changed once loaded.
3.2 System Design
The framework uses abstract base classes and polymorphism to allow for an
extensible system. There are four main classes which form the tracking system,
there are:
• ObjectView - a tracked object, as seen in a single camera view. Contains
only 2D information about the objects position (in pixel coordinates), and
has no knowledge of other views. Each ObjectView has a track ID and a
view ID associated with it to identify it within the system.
• TrackedObject - a collection of up to N ObjectView ’s (where N is the num-
ber of inputs to the system). This class also contains 3D information about
the object’s position, if camera calibration information is available. It is
created with a array of length N ObjectView pointers, which initially all
point to NULL. As the object moves through the scene, these pointers are
changed to point to the valid object view structures. Each TrackedObject
has a track ID associated with it. This ID is shared by the ObjectView ’s
associated with TrackedObject.
• ObjectTracker - a tracking class, responsible for tracking objects within a
single camera view. The ObjectTracker only sees the ObjectView ’s that
are within its camera view. It has no knowledge of the TrackedObject ’s,
3.2 System Design 93
or of other camera views. The class contains two lists of ObjectView ’s, a
shared list that contains the objects known to the system at the start of
the processing of the current frame, and an internal list that stores objects
that are detected in the current frame.
• Manager - a collection of N ObjectTracker ’s. The Manager sees the com-
plete list of TrackedObject ’s is responsible for maintenance tasks such as
creation deletion and transfer of objects between views.
These objects relate to one another as shown in Figure 3.1. The black arrows
indicate ownership (i.e. responsibility for creation/deletion/storage) within the
system (i.e. the ObjectTracker shares ownership of the ObjectView ’s with the
TrackedObject ’s, and is owned by the Manager). The red arrows indicate the
inputs to each object (i.e. the ObjectView receives data from the ObjectTracker
and TrackedObject, and sends data to the ObjectTracker and TrackedObject).
Input images are passed directly to the appropriate ObjectTracker for processing.
Figure 3.1: System Design
The creation and deletion of new ObjectView ’s and TrackedObject ’s requires com-
munication between the ObjectTracker and Manager classes. When a new object
94 3.2 System Design
enters the system, it is first detected by the ObjectTracker responsible for the
view in which the object appears. This tracker creates a new ObjectView and
adds it to the internal list of the ObjectTracker. When the Manager performs
maintenance, it checks the contents of this internal list and observes that a new
object has been placed there. A new TrackedObject is created and this ObjectView
is associated with it. The ObjectView is then removed from the internal list and
placed in the common list. Object deletion is also performed by the Manager.
An ObjectTracker may change the state of an ObjectView to be Dead (see Sec-
tion 3.2.1), marking it for deletion. When the Manager performs maintenance
(typically at the end of each frame), any ObjectView ’s that are in the Dead state
are deleted. If the ObjectView deleted is the only ObjectView associated with its
controlling TrackedObject, the TrackedObject is also deleted.
Each of these base classes is used as a platform for building more advanced
tracking systems, through inheritance.
3.2.1 Tracking Algorithm Overview
The tracking algorithm used in this work is a top-down system (see Figure 3.2).
Motion detection is used to perform initial segmentation, and the resultant motion
mask is used by one or more object detectors (possibly in combination with the
input image) to detect the target objects. The resulting list of candidate objects,
DObj(t) is compared to the list of tracked objects, TObj(t). Candidate objects
are compared to tracked objects to determine the quality of matches using a fit
function, F , which returns a value in the range of 0 to 1. A fit of 1 indicates a
perfect match, and a fit of 0 indicates no match. The candidate and track pair
which yield the highest fit score are matched, followed by the next lowest until
all candidate-track pairs that have a valid match (determined by a threshold on
3.2 System Design 95
the fit scores, the threshold is typically set to 0.5) are paired. Any remaining
candidates are added as new objects, and any unmatched tracked objects are
updated via prediction.
Figure 3.2: Tracking Algorithm Flowchart
All tracks have a state associated with them and two counters, which defines how
the system handles the track. Counters are kept for the number of successive
frames the object is correctly detected (cdetected), and the number of successive
frames that the system fails to detect the object ((coccluded)). There are five
possible states within the system:
1. Preliminary - Entered into when a track is first created. Tracks in this state
must be continually detected.
2. Transferred - Tracks that are moved from another camera view are created
in the transferred state. This is similar to the Preliminary state, but allows
for more leeway when detecting and matching the object.
3. Active - The track has been observed for several frames. Tracks spend most
of their time in this state. It indicates that the track has been located in
the last frame and its position is known.
4. Occluded - Indicates that the track has not been located in the last frame,
either due to occlusion or system error.
96 3.2 System Design
5. Dead - The track is to be removed from the system. Tracks in this state
are deleted when the current frame’s processing ends.
The state transitions are shows in Figure 3.3, and the transition conditions are
outlined in Table 3.1.
Figure 3.3: State Diagram for a Tracked Object
a Prelim→Active The tracked object is detected and matched forτactive successive frames (cactive ≥ τactive)
b Transferred→Active The tracked object is detected and matched oncec Prelim→Dead The tracked object is not detected and matched
for a single framed Transferred→Dead The tracked object is not detected and matched
for a single framee Active→Occluded The tracked object is not detected and matched
for a single framef Occluded→Active The tracked object is detected and matched for a
single frameg Active→Dead The tracked object is explicitly deleted by the
systemh Occluded→Dead The tracked object is not detected and matched
for τoccluded consecutive frames (coccluded ≥ τoccluded)
Table 3.1: Transition Conditions
3.2 System Design 97
In the proposed system, τactive is set to 3, and τoccluded is set to 10. These param-
eters are used throughout testing unless explicitly specified elsewhere.
Each tracked object stores several values that describe the object, major items
stored include:
• Position - 2D position in image coordinates, the position is stored as the
bounding box of the object.
• Histogram/Appearance Model - one or more colour/appearance models are
stored, to use for matching the object in ambiguous situations.
• Motion model - a motion model is stored, to allow a prediction of the
object’s position to be made in the event of an occlusion or ambiguities
when matching. This may be a constant velocity model, Kalman filter, or
particle filter.
• ID - an ID for the tracked object.
• Track type - type for the object, may be person, vehicle or unknown. This
determines what matching parameters are used, and what detection rou-
tines are employed.
• State - the state of the tracked object (see Figure 3.3) and its associated
counters, to determine how the object is handled and the action to be taken
in the event that the object cannot be detected.
Values such as the ID and Track type are set when the object is created, and are
unlikely to change for the life of the object. Other values such as the position,
histogram/appearance model, and state may change regularly.
The algorithm also allows zones to be defined within the scene. A zone allows
98 3.3 Object Detection
specific behaviour to be permitted or denied within a given part of the scene
(specified as a polygon). The following types of zone are defined by the system:
• Active - area where objects are allowed to exist.
• Entry - allow objects to enter within this region.
• Entry Priority - allow objects to enter within this region, with priority given
to one class of objects.
• Transfer - allow objects to be transferred to other camera views in this
region.
• Alarm - raise an alarm when an object is detected in this region.
• Inactive - region where tracks cannot appear, objects in this region will be
discarded.
By default (if no zones are specified), all locations are active and entry for all
classes of objects. Each zone can also be specified with an object type, so that
different entries, allowed locations and alarm zones can be defined for people and
vehicles. The entry priority zone is handled slightly differently, as it allows all
objects to enter at the specified region, but gives the specified object class priority
(i.e. in the specified region, cars are added before people). This can be used to
help control incorrect detections, by ensuring that the more likely object class at
a given position is processed first (i.e. on the footpath, person object would be
given priority while on the road vehicle objects would be processed first).
3.3 Object Detection
The system detects three different types of objects:
3.3 Object Detection 99
1. People - a region of motion with the major axis of the region vertically
aligned in the image, that contains a vertical peak in the motion image,
with a drop either side of the peak.
2. Vehicles - a rectangular region of motion (allowed range of aspects defined
by the system configuration) with a high ratio of motion pixels within the
region.
3. Blobs - a rectangular region of motion (allowed range of aspects the same
as, or more inclusive than the range for vehicles) with a ratio of motion
pixels with the region less than or equal to that required for vehicles and
people.
A region is defined as one or more 8-connected groups of pixels that are grouped
according to spatial constraints (proximity of region bounds and centroid to one
another).
These three object types are used as they encompass all objects that are observed
within the testing data. Additional object types can be added to the system if
required. The object detection routines (and thus the ability to track that type of
object) are enabled within the configuration file (i.e. a system can be configured
to track only people and ignore all vehicles). All object detection routines use a
binary image as a basis, such as a motion image.
3.3.1 Person Detection
Once a motion image has been obtained, it must be analysed to determine the
location of the any people present. Motion images can contain significant errors,
either as motion being detected where there is none, or motion not being detected
where it should be. Any detection techniques should be robust to these errors.
100 3.3 Object Detection
As motion is being used to detect there is no texture information available, only
information relating to size and silhouette. To extract people from a motion
image, the following process is used [66, 184]:
1. Locate areas of the image which contain a significant amount of motion (one
or more connected components that are closely located and satisfy minimum
size requirements, parameters such as required size of grouping distances
are defined in a configuration file and vary between applications/datasets)
and are likely to contain people.
2. Locate the heads of people within those regions using vertical histograms
and the top contour of the motion region.
3. Fit ellipses at the head locations to determine if there is sufficient motion
to constitute a person.
This process requires that people appear vertically in the image (i.e. parallel to
the left and right image bounds).
Motion images are analysed and broken into smaller segments containing patches
of motion to allow people who are vertically aligned (occupy a similar set of
columns at different heights in the image) to be detected (the use of vertical his-
tograms and the top contour means that only one head can exist in any given
column). These regions are processed separately, so if there is spatial separa-
tion between two vertically aligned people, their motion regions will be analysed
separately and each person can be detected. During this same process, small, un-
connected regions of motion can be removed, as their presence may lead to other
inaccuracies. These are likely to be errors, or motion caused by objects too small
to track (i.e. a piece of rubbish being blown across the ground by the wind). The
remaining regions can be grouped into spatial groups, and analysed individually.
3.3 Object Detection 101
Figure 3.4 shows the input and motion images, and the resultant head detection.
A single region of interest is located, and based on the height map of that region,
a single head is detected ((c) shows a white dot at the detected head on the height
map that corresponds to the motion image).
(a) Input Image (b) Motion Mask (c) Detected Heads
Figure 3.4: Head Detection
A person’s head in a silhouette image typically has the following properties:
1. It is the highest point on the person’s silhouette.
2. The surrounding area is roughly curved and symmetrical.
The second condition may not hold if the person is wearing a hat, has an unusual
hairstyle, or if there are errors in the segmentation. As such, the first property
will be used as the basis for detection.
It is assumed that people will appear in the image vertically (i.e. their spine will
be parallel to the vertical edge of the image), and so the image is analysed on a
column by column basis to determine the pixel height of the region. To determine
the height two approaches can be used:
1. Vertical Projection - vproj(i) =∑j=N−1
j=0 M(i, j), where vproj(i) is the vertical
projection at column i, j is the row index and N is the number of rows
(height) of the mask image, M .
102 3.3 Object Detection
2. Top Contour - vcontour(i) = N − (minimumj forwhichM(i, j) > 0), where
vcontour is the top contour. It is assumed that for the mask image (M) it is
zero indexed and the top left corner is at the coordinate (0, 0).
The vertical projection counts the number of motion pixels in each column, so a
region such as the head, which should have motion all the way below to the feet,
should lie at a global maximum. However, if the motion image contains errors
such as missing regions (i.e. a large portion of the persons shirt is not detected
as motion), the vertical projection may not contain the head at a maxima. The
top contour is simply the top most pixel in each column that is in motion, this
accuracy of this however depends on the accuracy of the motion detection around
the edge of the person. Either one of these, or both in combination, can be used
to detect the head of a person. Using both in combination can help overcome the
individual weaknesses of each modality and improve detection results (see figure
3.5),
vHeightMap(i) = αvproj(i) + βvcontour(i), (3.1)
where vHeightMap(i) is the combined height map, α is the weight of the vertical
projection, and β is the weight of the top contour. A mean filter is applied to
the height map to reduce noise and remove small local maxima (see Figure 3.5
(e)). This height map can then be searched for maxima, which are the likely
location of heads. The global maxima will provide a good estimate of the head of
one person. If multiple people are present in the area being analysed, then local
maxima will represent one or more of their heads. Analysis of the maxima, such
as looking at their prominence and proximity to other maxima, can be used to
determine which of these are likely to represent the heads of the people in the
region.
Once the heads have been located, ellipses are fitted at the head points. Ellipses
are oriented such that the major axis is vertical, and the length of the major axis
3.3 Object Detection 103
(a) InputImage
(b) Vertical Pro-jection
(c) Top Contour (d) Height Map (e) Mean FilteredHeight Map
Figure 3.5: Height Map Generation
of the ellipse is set to the height of the head as detected by the head detector. The
length of the major axis is chosen by performing further analysis of the height
map. The area surrounding the height map is searched to find the position either
side where the height drops to below a predefined ratio of the total height (i.e.
50%), or a minima in between two maxima. The maximum of the left and right
distance is used as the width of the minor axis, and the ellipse is cropped at the
smaller of the two (see Figure 3.6 (c)).
(a) Input Image (b) PersonBounds
(c) Ellipse
Figure 3.6: Ellipse Fitting
104 3.3 Object Detection
After the ellipse dimensions have been determined, a filled ellipse can be drawn
overlaying the detected person (see figure 3.6), and the amount of motion within
can be calculated such that,
Operson =
∑M(i, j) where E(i, j) > 0∑
E(i, j), (3.2)
where Operson is the percentage of the ellipse that contains motion, i and j are the
image coordinates, M is the motion image and E is the ellipse mask. If Operson
is above a threshold, τOccPer then candidate region is accepted as a valid person
candidate. τOccPer is set to 0.3 in the proposed system. The motion for that
person can now be removed from the motion image, to ensure that it is not used
to detect a second person later. In Figure 3.6, the occupancy for the ellipse is
86%, and so the candidate region is accepted.
3.3.2 Vehicle Detection
Vehicles are detected by locating large areas of motion, where there is a high
concentration of motion pixels in the region’s bounding box (i.e. most pixels
are in motion), as most vehicles are roughly rectangular in shape. The detection
process runs in two stages, the first simply groups large regions of motion together
to form a list of initial vehicle candidates. The second analyses this initial list
further, checking for overlapping objects to create a list of final vehicle candidates,
which is then used by the system to update existing tracks and create new tracks.
Initial candidate vehicles are formed by locating regions of motion and grouping
nearby regions in to a single candidate. The ratio of motion pixels to total
bounding box size,
Ovehicle =
∑x=R,y=Bx=L,y=T M(x, y)
W ×H(3.3)
where L, R, T , and B are the bounds of the candidate object, W and H are the
width and height of the candidate object, M is a binary motion mask and Ovehicle
3.3 Object Detection 105
is the ratio of motion pixels to bounding box size; as well as overall size,
Avehicle =
x=R,y=B∑x=L,y=T
M(x, y) (3.4)
where Avehicle is the total motion area of the candidate object; is used to validate
detected candidates (i.e. if the region is too small, or contains too little motion
relative to its size, it is discarded). If either Ovehicle or Avehicle is less than its
corresponding threshold, τOccV eh or τAreaV eh, the candidate is discarded.
Figure 3.7 shows the results of this initial vehicle detection process (top line shows
the input frame, bottom line shows the motion mask with any detected vehicles
surrounded by a red box).
(a) (b) (c)
(d) (e) (f)
Figure 3.7: Vehicle Detection
If an initial candidate is larger than expected, it may be due to the candidate
being the result of two vehicles overlapping. The candidate is analysed using
106 3.3 Object Detection
vertical and horizontal projection histograms,
vproj(i) =
j=0∑j=N−1
M(i, j), (3.5)
hproj(j) =i=0∑
i=O−1
M(i, j), (3.6)
where vproj(i) is the vertical projection at column i, j is the row index and N is
the number of rows (height) of the mask image, M , and hproj(j) is the horizontal
projection at column j, i is the column index and O is the number of columns
(width) of the mask image; to determine if the candidate may be formed by two
vehicles overlapping (see Figure 3.8). If the initial candidate is deemed to be an
acceptable size, it is accepted as a candidate vehicle.
An abrupt change in either dimension is likely to indicate an overlapping area.
Overlapping vehicles can be separated by detecting these changes and segmenting
accordingly. The gradient of the vertical and horizontal projection histograms is
analysed to detect overlaps,
Ovhoriz = |hproj(j)− hproj(j − 1)| ≥ O × τ gradov , (3.7)
Ovvert = |vproj(i)− vproj(i− 1)| ≥ N × τ gradov , (3.8)
where Ovhoriz and Ovvert are the detected horizontal and vertical overlaps respec-
tively and τ gradov is a scaling value used to determine a gradient threshold based on
the candidate size (25% in the proposed algorithm). Detected overlaps that are
within close proximity to one another (5% of the region size, τ gradprox ) are merged
and the average position of the merged overlaps is used.
In Figure 3.8, the two cars and cyclist are detected as a single vehicle candidate,
due to the cars overlapping and cyclist being close nearby. Computing the hor-
izontal and vertical projections for the candidate, it can be seen that there are
two significant changes in gradient in each direction (denoted by the red lines in
3.3 Object Detection 107
(a) Input Image (b) Vehicle Candidate
(c) Vertical Projection (d) Horizontal Projection
(e) Detected Vehicles (f) Detected Vehicles
Figure 3.8: Detecting Overlapping Vehicles
Figure 3.8 (c) and (d), and shown overlayed on the original motion image in (e)).
The initial candidate can be segmented using these boundaries to locate the three
separate candidate vehicles (Figure 3.8 (f)).
3.3.3 Blob Detection
A third object detection routine is used to catch any remaining objects that have
not been detected by either the person or vehicle detectors. The sole purpose of
108 3.4 Baseline Tracking System
this detector is to catch objects that the other detectors failed to detect, possibly
due to errors in the binary input image. For a standard system configuration,
objects detected by this detection routine can only be used to update the position
of existing tracked objects (i.e. an object detected by the blob detector cannot
result in a new object being added to the system).
The blob detector simply locates regions of motion and groups the regions ac-
cording to spatial constraints. If the grouped regions are within size bounds, they
are accepted as candidates. This process is essentially the same as that which
searches for vehicle candidates (see section 3.3.2), with looser constraints on the
merging of nearby regions of motion, occupancy and size. Constraints on size
and region grouping are relaxed as any objects detected by this process could not
be detected by the detection routine intended to find them (most likely due to
segmentation errors), and so using the constraints applied in the more specific de-
tection routines will result in no detection being made. As objects detected in this
process are only used to update known objects, this is deemed to be acceptable.
3.4 Baseline Tracking System
The baseline tracking system uses the algorithm described in Section 3.2.1. The
system uses the motion detection system proposed by Butler et al [18] (an
overview of this algorithm is provided in Section 2.2.1, and more details can
be found in Section 4.2), and the object detection routines described in Section
3.3.
3.4 Baseline Tracking System 109
A simple constant velocity motion model is used to predict object positions,
T ix(t+ 1) = T ix(t) +1
N(T ix(t)− T ix(t−N)), (3.9)
T iy(t+ 1) = T iy(t) +1
N(T iy(t)− T iy(t−N)), (3.10)
T ih(t+ 1) = T ih(t) +1
N(T ih(t)− T ih(t−N)), (3.11)
T iw(t+ 1) = T iw(t) +1
N(T iw(t)− T iw(t−N)), (3.12)
where T ix and T iy are the x and y image coordinates for track i, and T iw and
T ih are the width and height (in pixels) of track i, N is the size of the motion
model and t is the current time step. The whole bounding box is used within the
motion model to smooth the width and height and counter any fluctuations due
to segmentation errors. The size of the motion model (N) depends on frame rate
and the speed of objects, but is typically set to 10 frames for most systems.
Objects are detected using the object detection routines outlined in Sections 3.3.1
to 3.3.3 for people, vehicles and blobs respectively. Each object detection routines
results in a set objects being detected, Di, i = [1..D], where D is the total number
of objects detected.
The fit for an object to a candidate is calculated by comparing the position and
size of the detected object, Di, to the tracked object, T j. Errors in position and
area are calculated using,
Eposition(Di, T j) =
√√√√(Dix − T
jx
T jw
)2
+
(Diy − T
jy
T jh
)2
, (3.13)
Earea(Di, T j) =
∣∣Diw ×Di
h − T jw × Tjh)∣∣
T jw × T jh, (3.14)
where Eposition(Di, T j) is the error in the median position between the candidate
object Di and the tracked object T j, and Earea(Di, T j) is the error in area between
the objects. Dix, D
iy, D
iw and Di
h are the x and y position, the width and height
respectively. The errors are expressed as a percentage of the objects size (i.e. for
110 3.4 Baseline Tracking System
an object which is very large, a given change in position will be less significant
than for an object which is very small).
Errors are evaluated using two Gaussian distributions (one for position, one for
error) with a user specified standard deviations (σpos and σarea for the position and
area distributions respectively, these are expressed as a percentage and specified
as inputs in the configuration file) and a mean of 0 (no error, the value is the
same from one frame to the next). The likelihood of the position and size (area)
are determined separately such that,
Fposition(Di, T j) = Φ0,σpos(Eposition(Di, T j)), (3.15)
Farea(Di, T j) = Φ0,σarea(Earea(D
i, T j)), (3.16)
where Fposition(Di, T j) is the fit of the position component, Farea(Di, T j) is the
fit of the area component, and Φµ,σ is the cumulative density function for the
Gaussian distribution. The product of these errors is used as a measure of the fit
of the candidate to the tracked object,
F (Di, T j) = Fposition(Di, T j)× Farea(Di, T j), (3.17)
where F (Di, T j) is the fit between the track T j and the candidate Di. If F (Di, T j)
is greater than τfit and the track is in either the active, occluded or entry state,
a valid match has been found. For the transfer state, the threshold is lowered to
τfit
2. The threshold is lowered to account for potential errors when transferring
coordinates between views.
A histogram is used to compare objects when there is uncertainty regarding a
match. The histogram is calculated in YCbCr, with the luminance component
independent from the chrominance. The uncertainty of a match is determined by
the ratio of the top two fit’s for a given object,
UT i, T j, Dk =min(F (T i, Dk), F (T j, Dk))
max(F (T i, Dk), F (T j, Dk)), (3.18)
3.5 Evaluation Process and Benchmarks 111
where UT i, T j, Dk is the uncertainty, for the tracks T i and T j match to the object
Dk. If uT i,T j is greater than a threshold, Tunc, the histograms of the tracks are
compared to the histogram of the object (for our system, Tunc is set to 0.5).
Histograms are compared using the Bhattacharya coefficient,
B(T i, Dk) =
√√√√ N∑1
√H(T i, n)×H(Dk, n), (3.19)
where B(T i, Dk) is the Bhattacharya coefficient, H(T i, n) is the nth bin for the
histogram belonging to T i, and N is the total number of bins in the histogram. To
simplify comparison and ensure any results are within fixed bounds, the histogram
comparison is performed using histograms with their bin weights normalised such
that they sum to 1,N∑1
H(T i, n) = 1. (3.20)
This will return 1 for a perfect match, and 0 for no match. B(T i, Dk) is multiplied
with the original fit of the two tracks, such that,
F ′(T i, Dk) = F (T i, Dk)×B(T i, Dk), (3.21)
F ′(T j, Dk) = F (T j, Dk)×B(T j, Dk). (3.22)
Whichever track has the greatest value of F ′, is deemed the matching track.
3.5 Evaluation Process and Benchmarks
The proposed tracking system and the its improvements are evaluated using a
subset of the ETISEO database [130] and the ETISEO evaluation tool 2. The
ETISEO evaluation was run in 2006, to evaluate tracking and event recognition
systems.
2ETISEO resources such as the database and evaluation tool can be downloaded athttp://www-sop.inria.fr/orion/ETISEO/index.htm
112 3.5 Evaluation Process and Benchmarks
The throughput of the tracking systems proposed in this thesis are also evalu-
ated. The average number of frames processed per second (fps) for each group of
datasets (see Section 3.5.2) is used to evaluate the system performance in terms
of data throughput. Whilst it is not expected that the tracking systems pro-
posed in this thesis are capable of processing at 25 fps (the implementations of
the tracking system used within this thesis are unoptimised, only run on a single
processor core and load data to and from a hard drive), they are still ultimately
intended to be used for processing live data feeds of 5 fps or faster (the frame
rate required to process a live data feed depends on the data itself). As such, it
is important that the proposed systems approach or exceed this frame rate.
Section 3.5.1 outlines the metrics proposed by the ETISEO evaluation, which are
used in this evaluation; Section 3.5.2 details the subset of the ETISEO database
that is used, and describes the major configuration settings for each dataset; and
Section 3.5.4 contains benchmarks for the baseline tracking system.
3.5.1 Evaluation Metrics
As part of the ETISEO evaluation, an evaluation tool was developed and several
metrics were proposed for comparing the performance of tracking systems [154].
These metrics fall into five areas:
1. Detection of objects.
2. Localisation of objects.
3. Tracking of objects.
4. Classification of objects.
5. Event Recognition.
3.5 Evaluation Process and Benchmarks 113
For of each of these areas, there are several metrics to measure specific perfor-
mance criteria within each area, and an overall metric that is created by averaging
the simpler metrics. All metrics return a value in the range [0..1]. A value of 1
indicates best possible performance, 0 indicates worst possible. The evaluation
of work in this thesis will focus on metrics from the first three areas (detection,
localisation, tracking). The tracking systems discussed in this thesis do not per-
form event recognition (with the exception of abandoned object detection), and
classification of objects within the proposed systems is very limited (the ETISEO
evaluation uses a much greater range of classification types than the proposed
systems). As a result, metrics in these areas are deemed to be unnecessary for
the evaluation of the proposed systems.
The use of the ETISEO evaluation format and the metrics allows the system per-
formance to be analysed across a wide range of criteria. High level comparisons
can be made by comparing the overall metrics, whilst analysis of the simpler,
component metrics can be used to better understand the reasons for any im-
provements gained. The use of an existing evaluation process also avoids the
challenges involved in developing a evaluation tool to compare tracking data to
ground truth (and formulation of separate metrics).
The metrics used in the evaluations contained in this thesis are briefly outlined
here. Many metrics can be computed over either individual frames, or over the
whole sequence. Only metrics computed over whole sequence are considered in
our evaluation. More detailed information on these metrics can be found in [154].
The following standard definitions are used for many of the metrics:
114 3.5 Evaluation Process and Benchmarks
True Positive (TP) Detected situation exists in theground truth and results
True Negative (TN) A situation that does not exist in eitherthe ground truth or results
False Positive (FP) Algorithm has detected a situation thatdoes not exist in the ground truth
False Negative (FN) The algorithm has failed to detect anevent that exists in the ground truth
Table 3.2: Evaluation Metric Standard Definitions
From these values, four scores can be calculated, the Precision, Sensitivity, Speci-
ficity and F-Score,
Precision =TP
TP + FP, (3.23)
Sensitivity =TP
TP + FN, (3.24)
Specificity =TN
FP + TN, (3.25)
F − score =2× Precision× SensitivityPrecision+ Sensitivity
. (3.26)
It should be noted that not every metric computes all these measures.
Four distance measures are also defined. These are used to match objects in
the ground truth to objects in the result data. As such, they are used as part
of determining many other metrics. Overlaps need to be determined for both
spatial and temporal information (see Figure 3.9). In Figure 3.9, GT represents
the ground truth, and R the result.
3.5 Evaluation Process and Benchmarks 115
(a) Spatial Overlap (b) Temporal Overlap
Figure 3.9: Ground truth and result overlaps
The four metrics (E1-4, dice coefficient, overlapping, Betrozzi, maximum devia-
tion respectively) are defined as follows,
E1 =2× card(GT ∩R)
card(GT ) + card(R), (3.27)
E2 =card(GT ∩R)
GT, (3.28)
E3 =(GT ∩R)2
card(GT )× card(R), (3.29)
E4 = Max
(card(R/GT )
card(R),card(GT/C)
card(GT )
), (3.30)
(3.31)
where card(S) is the cardinality (number of elements) in the set, S. Each of these
measures can be used as is, or have a threshold applied to determine a match.
In instances where a distance metric is used, a result is calculated using each
distance metric, and the average of these scores is taken as the final metric score.
The evaluation within this thesis does not include individual scores for precision,
sensitivity, specificity or f-score for each metric. Nor does it provide results for
each distance metric (E1-4). The evaluation presents the overall score for each
metric. This is the average of the precision, sensitivity, specificity or f-score (or
whichever of these are computed). If the metric requires E1-4, then the overall
score is the average of the precision, sensitivity, specificity or f-score for each
distance metric.
116 3.5 Evaluation Process and Benchmarks
The evaluation also computes overall scores for detection, localisation and track-
ing. These scores are the average of all precision, sensitivity, specificity and f-score
results across E1-4, for all metrics in the category.
In the evaluations within this thesis, the overall metrics for detection, localisation
and tracking are used, as well as the individual metrics detailed below. Metrics are
outlined briefly here to describe what they are evaluating. Detailed information
on the precise formulation of the metrics is not provided here, and can be found
in [154]. Other metrics not used in our evaluation (such as those for object
classification and event recognition) can also be found in [154].
Detection Metrics
Two detection metrics are defined:
• Number of physical objects (D1)- The number of objects detected
compared to the number of objects present in the ground truth (does not
consider if the objects detected are valid, i.e. in the expected positions).
• Number of physical objects using their bounding box (D2) - The
number of detected objects that have a significant overlap with a ground
truth object. Only one detected object can match a ground truth object,
any additional detections are designated as false positives.
Localisation Metrics
Four localisation metrics are defined:
• Physical objects area (L1) - Evaluate the 2D object position in each
3.5 Evaluation Process and Benchmarks 117
frame. Uses the overlap of the ground truth and result bounding boxes to
compute the metric.
• Physical object area fragmentation (splitting) (L2) - Determine ob-
ject fragmentation by checking the number of detected objects that corre-
spond to (overlap with) each ground truth object. Ideally, each detected
object should overlap with a single ground truth object.
• Physical object area integration (merging) (L3) - Determine object
merging by checking the number of ground truth objects that correspond
to (overlap with) each detected object. Ideally, each ground truth object
should overlap with a single detected object.
• Physical object centroid localisation (L4) - Evaluates the detection ac-
curacy based on the distance between the detected centroid and the ground
truth centroid.
Metrics L2 and L3 do not compute the true/false positives/negatives, and define
a separate metric for matching (see [154]).
Tracking Metrics
Five tracking metrics are defined:
• Number of objects being tracked during time (T1) - measures the
ability of the system to detect and follow an object over time. Analyses
object occurrences and object persistence over time, and uses the distance
between the ground truth and the result data to determine performance.
• Tracking time (T2) - Measures the percentage of time that an object is
tracked for. Assumes that the object ID will be constant over the object
118 3.5 Evaluation Process and Benchmarks
life.
• Physical object ID fragmentation (T3) - Determine how well tracked
objects remain associated with ground truth objects. Measures the number
of result objects (tracked recorded in the test data) that are associated with
each ground truth object to determine how well objects remain tracked.
• Physical object ID confusion (T4) - Measure the rate at which result
tracks switch to other ground truth tracks (i.e. becomes confused).
• Physical object 2D trajectories (T5) - Determines if 2D trajectories
are correctly detected over time. A match between two given trajectories is
calculated by obtaining the distance between using a distance metric and
thresholding the result.
Metrics T2, T3 and T4 do not compute the true/false positives/negatives, and
define a separate metric for matching (see [154]).
3.5.2 Evaluation Data and Configuration
The evaluation uses the following datasets (example images are shown in Figure
3.10):
• ETI-VS2-RD6 - a dataset showing a road with a combination of cars and
pedestrians to track.
• ETI-VS2-RD7 - a dataset showing a road with a combination of cars and
pedestrians to track.
• ETI-VS2-BC16 - a dataset showing the corridor of a building, with several
doors on either side through which people enter and exit.
3.5 Evaluation Process and Benchmarks 119
• ETI-VS2-BC17 - a dataset showing the corridor of a building, with several
doors on either side through which people enter and exit.
• ETI-VS2-AP11 - a two camera dataset (ETI-VS2-AP11-C4 and ETI-VS2-
AP11-C7) showing part of an airport tarmac, with various vehicles moving
about.
• ETI-VS2-AP12 - a two camera dataset (ETI-VS2-AP12-C4 and ETI-VS2-
AP12-C7) showing part of an airport tarmac, with various vehicles moving
about.
• ETI-VS2-BE19 - a two camera dataset (ETI-VS2-BE19-C1 and ETI-VS2-
BE19-C3), showing the entrance to a building, with people entering and
leaving the building, and vehicles entering and leaving the parking lot.
• ETI-VS2-BE20 - a two camera dataset (ETI-VS2-BE20-C1 and ETI-VS2-
BE20-C3), showing the entrance to a building, with people entering and
leaving the building, and vehicles entering and leaving the parking lot.
The configurations used for the system are kept the same when testing different
datasets captured from the same camera (i.e. RD6 and RD7 all use the same
configuration, that is different to that used by BC16 and BC17). Configurations
are also kept as similar as possible when testing different tracking systems on the
same dataset (see testing performed in Chapters 5 and 6). For common elements
to the trackers, the same configuration is used, with the configuration only altered
to configure the additional features.
When performing all tests, the first image the tracker receives (initialisation im-
age) is an empty scene. Whilst the motion detectors used within the tracking
systems are able to learn the background over several hundred frames (the num-
ber required depends on the amount of motion within the scene and the learning
120 3.5 Evaluation Process and Benchmarks
(a) RD Example (b) BC Example
(c) AP C4 Example (d) AP C7 Example
(e) BE C1 Example (f) BE C3 Example
Figure 3.10: Examples of Evaluation Data
rate), this is not ideal as it means that an early portion of every dataset is devoted
to training the background model. An alternative method is to ensure that the
first image is an image of an empty scene. This ensures that the initial model is
of an empty scene, and allows object tracking to commence immediately.
3.5 Evaluation Process and Benchmarks 121
All images are resized to 320 × 240 pixels for processing. Tracking results are
scaled back to the original image dimensions for comparison to the ground truth.
Processing is performed at the smaller, and uniform dimensions, to improve sys-
tem speed and conserve storage space when performing the testing, and provide
a better simulation of real world conditions (real time performance is not pos-
sible for images sized 720 × 576 pixels or similar). Common image sizes across
datasets also allows more of the configuration parameters to remain fixed between
datasets.
For each dataset, the following configuration parameters are selected:
• Motion detector thresholds (τLum and τChr) and learning rate (L)
• Expected person size (minimum and maximum height)
• Expected vehicle size (minimum and maximum size specified as height and
aspect ratio, minimum amount of motion pixels)
These parameters can be quickly established by looking at a small set of frames
(at most 10) from the dataset. Motion detection parameters are chosen based
on the observed noise in the scene and the contrast between the moving objects
and the background. Expected sizes are chosen based on the size of objects
observed in the scene. For scenes where one object class is not present, the
appropriate detector is disabled and not configured. It is realistic to expect that
any commercial deployment of an intelligent surveillance system would perform
at this much configuration, if not significantly more, on any deployed system. It
is also unrealistic to expect a single configuration to optimal for all possible video
feeds.
All other parameters are fixed for all datasets. These parameters are listed in
Table 3.3.
122 3.5 Evaluation Process and Benchmarks
Parameter ValueMotion Detection Parameters
K 6Person Detection Parameters
τOccPer 0.3Minimum Aspect Ratio 0.15Maximum Aspect Ratio 0.5
Vehicle Detection ParametersτOccV eh 0.4τ gradov 0.25τ gradprox 0.05
Object Detection andMatching Parameters
τactive 3τoccluded 10τfit 0.5τunc 0.5
Table 3.3: System Parameters - Fixed parameters for all data sets.
RD Dataset Configuration
The RD datasets are configured to track both people and vehicles. Four zones
are defined:
1. Active Zone - the entire scene, for all object classes.
2. Entry Priority Zone - the roadway, with vehicles given priority.
3. Entry Zone - the sidewalk nearest the camera, for the person class.
4. Entry Zone - the sidewalk furthermost from the camera, for the person
class.
This configuration means that only people can be created on the sidewalk, and
either class can be created on the road with vehicles given priority. Once an
object is created, any position within the scene is valid (i.e. a vehicle may be
3.5 Evaluation Process and Benchmarks 123
tracked on the footpath, as long as it is created on the road). Figure 3.11 shows
the zones, with the numbers indicated in the image corresponding to the above
list.
Figure 3.11: Zones for RD Datasets
A mask image is also used, and is shown in Figure 3.12. This mask forces the
portion at the top of the image to be ignored. This region is deemed to be of
no interest as it contains the overlayed text that changes (causing motion to be
detected where there is nothing of interest), a tree that blocks the roadway almost
entirely (meaning vehicles cannot be tracked), and the side of the building. This
use of the mask also means that no motion is detected as a result of any movement
from the tree. The regions of images that are shaded dark are masked out.
Figure 3.12: Mask Image for RD Datasets
124 3.5 Evaluation Process and Benchmarks
Configuration parameters specific to the RD datasets are lists in Table 3.4.
Parameter ValueMotion Detection Parameters
τLum 50τChr 30L 9
Person Detection ParametersMinimum Height (pixels) 15Maximum Height (pixels) 60
Vehicle Detection ParametersMinimum Height (pixels) 15Maximum Height (pixels) 120Minimum Aspect Ratio 1
3
Maximum Aspect Ratio 3τAreaV eh 200
Table 3.4: System Parameters - Parameters specific to the RD data sets.
BC Dataset Configuration
The BC datasets are configured to track people only. No zones are defined (all
locations are valid for track existence and entry). A mask image is used, and
is shown in Figure 3.13. This mask simply removes portions of the image which
cannot contain people, allowing an improvement in processing speed. The regions
of images that are shaded dark are masked out.
Configuration parameters specific to the BC datasets are lists in Table 3.5. Note
that for the BC datasets, vehicle detection is disabled.
AP Dataset Configuration
The AP datasets are configured to track vehicles only. No zones are defined, and
no mask image is used.
3.5 Evaluation Process and Benchmarks 125
Figure 3.13: Mask Image for BC Datasets
Parameter ValueMotion Detection Parameters
τLum 50τChr 30L 9
Person Detection ParametersMinimum Height (pixels) 35Maximum Height (pixels) 200
Table 3.5: System Parameters - Parameters specific to the BC data sets.
Configuration parameters specific to the AP datasets are lists in Table 3.6. Note
that for the AP datasets, person detection is disabled
BE Dataset Configuration
The BE datasets are configured to track people and vehicles. The following zones
are defined across the two views:
1. Active Zone - the entire scene, for all object classes.
2. Entry Priority Zone - the driveway entry, with vehicles given priority.
3. Entry Zone - the building doorway, for the person class.
126 3.5 Evaluation Process and Benchmarks
Parameter ValueMotion Detection Parameters
τLum 50τChr 30L 9
Vehicle Detection ParametersMinimum Height (pixels) 10Maximum Height (pixels) 120Minimum Aspect Ratio 1
5
Maximum Aspect Ratio 5τAreaV eh 50
Table 3.6: System Parameters - Parameters specific to the AP data sets.
4. Entry Zone - the driveway and car park area, for both classes.
Figure 3.14 shows the zones, with the numbers indicated in the image correspond-
ing to the above list.
(a) BE C1 Zones (b) BE C3 Zones
Figure 3.14: Zones for BE Datasets
Mask images are also used, and are shown in Figure 3.15. The masks remove
regions where tracks cannot appear (such as the gardens, the side of the building).
This use of the masks also mean that no motion is detected as a result of any
movement from the trees. The regions of images that are shaded dark are masked
out.
Configuration parameters specific to the BE datasets are lists in Table 3.7.
3.5 Evaluation Process and Benchmarks 127
(a) BE C1 Mask (b) BE C3 Mask
Figure 3.15: Mask Images for BE Datasets
Parameter ValueMotion Detection Parameters
τLum 60τChr 50L 9
Person Detection ParametersMinimum Height (pixels) 20Maximum Height (pixels) 80
Vehicle Detection ParametersMinimum Height (pixels) 15Maximum Height (pixels) 120Minimum Aspect Ratio 1
3
Maximum Aspect Ratio 3τAreaV eh 500
Table 3.7: System Parameters - Parameters specific to the BE data sets.
3.5.3 Tracking Output Description
The object tracker annotates output images to indicate what objects have been
detected and are being tracked. All tracking output in the evaluations contained
within this thesis is annotated using the described approach. An example of the
tracking output is shown in Figure 3.16.
Objects may be annotated in two ways depending on what state they are in.
Objects that have only just entered the system and are in either the Preliminary
128 3.5 Evaluation Process and Benchmarks
(a) Frame 240 (b) Frame 340 (c) Frame 440
Figure 3.16: Example output from the tracking system
or Transfered state are annotated using a single yellow box. An example of this is
visible in Figure 3.16 (c). All other objects are annotated by drawing two boxes
around the object. These boxes indicate the objects ID and its type.
The outer bounding box indicates the object ID. The same object in should be
drawn with the same colour bounding box each frame (see Figure 3.16, the car
on the far side of the road is annotated with a blue box each frame). If the colour
of the outer bounding box changes, it indicates that the object’s ID has changed
which may indicate that an error has occurred. The ID bounding box can be one
of 16 colours (the EGA colours are used). The system will reuse colours if more
than 16 objects are tracked during the systems run time.
The inner bounding box indicates the objects type. There are three types within
the system:
1. Person - Indicated by a red box.
2. Vehicle - Indicated by a cyan box.
3. Unknown - Indicated by a white box.
The unknown object class should only be observed if an error occurs.
3.5 Evaluation Process and Benchmarks 129
3.5.4 Benchmarks
Overall results are shown in Tables 3.8 and 3.9. Detailed results for each dataset
are contained in Appendix A.
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD 0.77 0.64 0.80 1.00 0.98 1.00 0.39 0.31 0.77 0.95 0.32BC 0.77 0.41 0.73 1.00 0.98 0.99 0.45 0.30 0.74 0.85 0.34AP 0.82 0.64 0.76 1.00 1.00 1.00 0.57 0.39 0.97 1.00 0.45BE 0.66 0.25 0.55 0.81 0.79 1.03 0.19 0.11 0.47 0.64 0.21
Table 3.8: Baseline Tracking System Results
Data Set Overall Detection Overall Localisation Overall TrackingRD 0.67 0.93 0.44BC 0.48 0.91 0.46AP 0.68 0.93 0.59BE 0.33 0.74 0.25
Table 3.9: Overall Baseline Tracking System Results
RD Dataset Benchmarks
Benchmarks for the RD datasets (ETI-VS2-RD6 and ETI-VS2-RD7) are shown
in Tables A.1 and A.2 in Appendix A. Difficulties are experienced by the tracking
system due to the presence of objects that stop during the scene (see Figure 3.17,
the car on the far side of the road (blue bounding box) is lost after almost 500
frames of tracking), and several occlusions.
The reason objects are lost after being stationary for a period of time is that
the motion detection algorithm learns these objects as being part of the back-
ground. Whilst the problem of objects being lost can be overcome by using a
slower learning rate, this is not ideal, and thus not implemented. Within the
evaluated datasets, some tracked objects stay stationary for over 2000 frames.
130 3.5 Evaluation Process and Benchmarks
(a) Frame 140 (b) Frame 240 (c) Frame 340 (d) Frame 440
(e) Frame 540 (f) Frame 640
Figure 3.17: Example output from RD7 - Loss of track due to the target object(car in blue rectangle on the far side of the road) being stationary for severalhundred frames
A learning rate suitably slow to ensure that such objects are not incorporated
into the background will result in a background model that is close to static. In
situations such as that shown in Figure 3.17, detection and tracking errors also
arise when the object does begin to move again. As the vehicle has been learned
as part of the background, when it begins to move the space that it used to
occupy may be considered foreground (depending on the learning rate, and the
time spent stationary) and thus be detected as an object of interest.
Detection and tracking results are also diminished by vehicles that are present
and parked in the entire scene (see Figure 3.17, the white van on the far side of the
road is present for the entire scene). The tracking system requires motion to be
observed to detect objects. As these vehicles are always stationary, there is never
any motion observed to enable detection. However, as the ETISEO ground truth
annotates these objects, the performance of the detection and tracking metrics is
reduced.
3.5 Evaluation Process and Benchmarks 131
BC Dataset Benchmarks
Benchmarks for the BC datasets (ETI-VS2-BC16 and ETI-VS2-BC17) are shown
in Tables A.3 and A.4 in Appendix A. The BC datasets are captured in a hallway,
with a shallow field of view angle, resulting in a large number of occlusions. This
is reflected in the overall tracking performance, and the poor performance for the
tracking metric T2 (tracking time). The colour of the hallway and floors also
poses a problem, as it is similar to the colour of the skin and clothing of many of
the subjects within the system, and there is also a significant shadow/reflection
on the floors caused by the people walking about the scene (see Figure 3.18, the
person is poorly localised as their shadow is detected as being part of them) which
results in detection and localisation errors, and when multiple people are present
can compound tracking problems arising from occlusions.
(a) Frame 665 - Mo-tion
(b) Frame 675 - Mo-tion
(c) Frame 685 - Mo-tion
(d) Frame 695 - Mo-tion
(e) Frame 665 - Out-put
(f) Frame 675 - Out-put
(g) Frame 685 - Out-put
(h) Frame 695 - Out-put
Figure 3.18: Example output from BC16 - Detection and localisation errorscaused by shadow/reflection of the moving object
A typical occlusion within this dataset is shown in Figure 3.19. Due to the
shallow field of view angle of the camera, a total occlusion is caused as the people
walk away. This results in the identity of one person being lost, the identity of
132 3.5 Evaluation Process and Benchmarks
(a) Frame 1250 - Mo-tion
(b) Frame 1275 - Mo-tion
(c) Frame 1300 - Mo-tion
(d) Frame 1325 - Mo-tion
(e) Frame 1250 - Out-put
(f) Frame 1275 - Out-put
(g) Frame 1300 - Out-put
(h) Frame 1325 -Output
Figure 3.19: Example output from BC16 - Total occlusion, resulting in loss oftrack identities
second being switched to the first. Occlusions such as this are common with the
BC datasets. The motion detector output in Figure 3.19 also highlights another
problem affecting the performance of the BC datasets. The white shirt worn by
the person on the left cannot be distinguished from the floor and walls by the
motion detection procedure (thus it is not detected as being a region of motion).
This results in poor detection of this person. This problem affects several other
people in the scene. Whilst it can be solved by using a lower motion detection
threshold, this results in the shadows and reflections on the floor being even more
problematic (see Figure 3.18).
AP Dataset Benchmarks
Benchmarks for the AP datasets (ETI-VS2-AP11 and ETI-VS2-AP12, Cameras
4 and 7) are shown in Tables A.5 and A.6 in Appendix A. The AP datasets are
captured outdoors, and like the RD datasets, contains objects that are present
in the scene for the entire sequence (without moving) and thus are not detected.
3.5 Evaluation Process and Benchmarks 133
Like the RD datasets, these objects are annotated in the ground truth, and so the
systems failure to track them diminishes performance. The AP datasets contain
few occlusions, and as such there are few tracking errors, and the performance
errors recorded within the metrics can be largely attributed to the objects that
do not move within the scene.
BE Dataset Benchmarks
Benchmarks for the BE datasets (ETI-VS2-BE19 and ETI-VS2-BE20, Cameras 1
and 3) are shown in Tables A.7 and A.8 in Appendix A. The BE datasets contain
significant camera noise. To handle this, motion detection thresholds are raised
(less sensitive), however this makes the detection of some objects (particularly
people) difficult, as many people are dressed (at least partially) in dark clothing
and are moving across a dark road surface.
(a) Frame 100 - Mo-tion
(b) Frame 200 - Mo-tion
(c) Frame 300 - Mo-tion
(d) Frame 400 - Mo-tion
(e) Frame 100 - Out-put
(f) Frame 200 - Out-put
(g) Frame 300 - Out-put
(h) Frame 400 - Out-put
Figure 3.20: Example output from BE19-C1 - Impact of poor motion detectionperformance on tracking, the person leaving the building is tracked poorly
Figure 3.20 shows examples of the noise present in the data, with false motion
detected under the tree. There is also motion detected at the office front door
134 3.5 Evaluation Process and Benchmarks
through which the person leaves, which whilst valid motion, does potentially
cause additional problems for the object detection.
(a) Frame 475 - Mo-tion
(b) Frame 500 - Mo-tion
(c) Frame 525 - Mo-tion
(d) Frame 550 - Mo-tion
(e) Frame 475 - Out-put
(f) Frame 500 - Out-put
(g) Frame 525 - Out-put
(h) Frame 550 - Out-put
Figure 3.21: Example output from BE19-C3 - Impact of poor motion detectionperformance on tracking, the two people are localised poorly
Figure 3.21 shows a second example of tracking within the BE datasets. Once
again noise is present in the data, and the motion associated with the objects in
the scene is incomplete (i.e. the woman walking towards the office front door,
dressed in dark colours similar to the roadway).
The BE-C3 datasets, pose an additional challenge in that the camera is mounted
such that the people appear at an angle (see Figure 3.21 for an example). The
proposed person detection routine is intended to work in an environment where
the people appear vertical. Whilst the slant present in BE-C3 is not severe,
it is sufficiently large to reduce the D2 (detection based on bounding box), T2
(tracking time) and T5 (2D trajectories) metrics to, or almost to, 0. The slanted
camera angle, whilst not making detection impossible, does result in the system
tending to detect only a part of the person. T2 and T5 being 0 illustrates the
effect that poor detection has on the object tracking. Whilst part of the object
is detected and tracked by the system, this part is typically about 50% of the
3.5 Evaluation Process and Benchmarks 135
persons area, and this is sufficiently low ensure that there is not sufficient overlap
between the detected object and ground truth object to record a correct detection.
As a result, tracking metrics such as T2 and T5 are reduced to 0, as there is no
trajectory that closely matches that in the ground truth.
Within the BE datasets, the motion detection is configured to try and allow
a suitable compromise between the noise and detection problems, and detection
routines are also relaxed. Part of this configuration involves using a faster learning
rate, which further exacerbates the problem of objects being incorporated into the
background (see Figure 3.17). As a result of the motion detector performance and
relaxed detection parameters, there is an increase in false detections. However
this is required to ensure that the object of interest are detected at all. There is
also a prolonged occlusion between three people in BE20, which is handled very
poorly (as illustrated by the performance metrics for BE20, particularly camera
3).
Sensitivity to Thresholds
The baseline tracking system uses a large number thresholds during the tracking
process. In any system where decisions need to be made, thresholds are required.
In many cases these thresholds are fixed for all cases, whilst other thresholds need
to be set at an appropriate level for the dataset. Ideally, the system should be
functional with the thresholds set within a range of values (i.e. precise tuning of
each threshold should not be required for the system to function). However, it is
expected that for some thresholds, there will be some values (or ranges of values)
which result in system failure.
To asses the sensitivity of the baseline systems thresholds, three thresholds are
evaluated:
136 3.5 Evaluation Process and Benchmarks
1. τfit - The fit threshold for matching detected objects to tracked objects.
This threshold is set globally for all configurations, at 0.4.
2. τOccPer - The threshold for determining if a person candidate is to be ac-
cepted given the amount of motion within the detected object bounds. This
threshold is set globally for all configurations, at 0.3.
3. τLum - One of two thresholds for performing cluster matching within the
motion detection algorithm [18] (the other being τchr). This threshold is set
for each group of datasets (i.e. RD, BC etc.), as each camera has different
scene and noise characteristics. For the BC datasets, it is set to 50.
Each threshold is evaluated using the ETI-VS1-BC-12 dataset from set 1 of the
ETISEO evaluation. This dataset is very similar to VS2-BC16 and VS2-BC17,
and the same configuration is used (however no mask image is used due to the
slightly different camera angle). To evaluate sensitivity to a given threshold, the
baseline system is evaluated with the threshold in question set to different values
while the rest of configuration is unchanged.
The overall detection and tracking metric scores (see Section 3.5.1) are plotted for
each evaluated threshold. Localisation is not considered as localisation scores only
consider correctly detected objects (i.e. false detections, or missed detections, are
not considered, so poor detection does not effect the localisation scores). As a
result of this, localisation scores does not vary significantly as the thresholds are
changed, and are of little interest.
Figure 3.22 shows the performance of the overall detection and tracking metrics
as τfit is varied from 0 to 1 in 0.1 increments. τfit primarily has an effect on the
tracking performance, as this threshold is used after detection has taken place.
When τfit is set to very low values, any detected object can be matched to any
3.5 Evaluation Process and Benchmarks 137
Figure 3.22: Sensitivity of Baseline System to Variations in τfit
tracked object. This can result in incorrect matches, and tracks swapping identity.
This will only occur however, if the object detection produces errors, resulting
in erroneous detections being available for tracked objects to be matched to. As
the object detection performance is constant, there is little change in the system
performance for low values of τfit (0 to 0.4, see Figure 3.22).
As τfit is increased, obviously incorrect matches are eliminated and performance
improves, reaching its peak at 0.5 (see Figure 3.22). However as τfit is increased
further, performance begins to decline as matches which are in fact correct, are
ruled invalid due to the high value of τfit. It can be seen that whilst τfit performs
best at 0.5, detection does not decline severely until τfit exceeds 0.8, and the
system is capable of operating at a range of values.
Figure 3.23 shows the performance of the overall detection and tracking metrics as
τOccPer is varied from 0 to 1 in 0.1 increments. τOccPer is used when selecting valid
candidates during person detection. As such, it has an effect on both detection
and tracking performance.
When τOccPer is very low, more candidates can be detected by the system. How-
138 3.5 Evaluation Process and Benchmarks
Figure 3.23: Sensitivity of Baseline System to Variations in τOccPer
ever, many of these candidates are invalid, leading to the creation of false tracks.
As τOccPer is increased, fewer false candidates are detected, and tracking and de-
tection improves (see Figure 3.23, increase in tracking performance from 0.3 to
0.5).
At 0.5, tracking performance reaches it peak, however detection performance
drops. Fewer candidates are detected at τOccPer = 0.5, leading to not only a
decrease in the detection of false candidates, but also an increase in missed detec-
tions of valid candidates. This increase in missed detections ultimately leads to a
performance drop for overall detection. However, the tracking system is able to
predict positions to overcome these missed detections, and fewer false tracks are
spawned due to the reduced number of false candidates that have been detected.
As such tracking performance increases while detection performance decreases.
However, once τOccPer exceeds 0.5, performance decreases rapidly (see Figure
3.23). The detection algorithm struggles to detect any people at all, as a result
there are no objects to track, and the system fails. Once τOccPer reaches 1 (every
pixel within the person bounds must be in motion), the system fails to detect
and track any people.
3.5 Evaluation Process and Benchmarks 139
Figure 3.24: Sensitivity of Baseline System to Variations in τlum
τlum is one of two thresholds used in performing the motion detection (the other
being τchr). Motion detection thresholds are set independently for each camera
view, as each camera has different noise and scene characteristics. Figure 3.24
shows the performance of the overall detection and tracking metrics as Tlum is
varied. Motion thresholds have an effect on both detection and tracking results,
as the motion detection results are used by the object detection (and object
detection results are used by the tracking).
When τlum is set to low values, the motion detection algorithm is more sensitive,
detecting more motion. When the algorithm is too sensitive (Tlum = 0), too
much motion is detected, and no valid candidate objects can be found as the
majority of the scene is classified as being in motion due to sensor noise. As TLum
is increased, it becomes possible to detect candidates, and the system is able to
track objects. Tracking performance varies as the threshold is increased, how-
ever detection performance continuously decreases. The BC datasets have a low
contrast between the people and the background, so sensitive motion detection
is required. However, a highly reflective floor and visible noise result in a large
number of false candidates being detected. As the threshold is increased (less
sensitive motion detection) the tracking performance begins to improve again, as
140 3.5 Evaluation Process and Benchmarks
fewer false candidates are detected allowing for more accurate tracking of those
people that can be detected. However, once the threshold is set to above 80, per-
formance drops once again as the system is unable to detect motion associated
with the people, and thus is unable to track them.
It should be noted that having the chrominance threshold (τchr) set at 30 limits
the effects of increasing τlum. When performing motion detection, both τlum
and τchr thresholds must be satisfied for a match between the background model
and input image to occur. If both thresholds can not be satisfied by one of the
background modes, then motion will be detected at the pixel in question. As
Tlum is increased, more and more foreground pixels will be detected as a result of
violating τchr, and once τlum becomes sufficiently high, it will cease to have any
noticeable effect on the motion detection.
It has been shown that the three thresholds that have been evaluated are all
capable of operating reasonably over a wide range of values, indicating that the
system is not overly sensitive to threshold values. However, all these thresholds do
have optimal values, and some values which can cause the system performance
to decrease dramatically (in the case of τOccPer and τlum the system can fail
completely).
Whilst the thresholds do have some values that can result in poor performance,
these values are all very intuitive. It is obvious that setting a threshold such as
τOccPer to very high values (greater than 0.8) will result is very poor performance
unless the motion detection results feeding the object detection process are ex-
tremely accurate, and even then the fact that people are not a clearly defined
shape will still result in missed detections. Likewise, very low values of τlum (less
than 10) are going to result in very poor performance unless the input video
stream is free from all noise and environmental effects (florescent lighting flicker,
shadows, reflections etc.).
3.5 Evaluation Process and Benchmarks 141
Additional thresholds are not examined due to the high computational demand
of performing such as evaluation, however it is expected that the other thresh-
olds would perform similarly. It should also be noted that not all datasets would
exhibit similar results. For example, a dataset with lower noise and a greater con-
trast between foreground and background objects would not suffer the detection
performance drop off experienced when τlum is varied.
Data Throughput Benchmarks
The average frame rate at which the datasets were processed in shown in Table
3.10. These throughput rates are for the tracking system running on a single core
of a 2.4GHz Quad-Core Intel Core 2 CPU (i.e. there is no multi-threading). All
data is loaded from disk at the start of each frame, and images are then resized
to 320x240 pixels.
The BC datasets are processed at approximately 4 fps slower than the other
datasets. This is due to the larger objects present in the BC datasets. The
motion detection process is fastest when there is no motion detected (i.e. the
incoming pixel matches the first background mode it is compared to). As the
BC datasets contain larger objects than the other datasets, the motion detection
execution time is slower when compared to the other datasets. The speed of
object detection routines are also dependent on the size of the region they are
processing. As the dataset contains larger objects, the object detection routines
are executed more slowly when compared to other datasets.
142 3.5 Evaluation Process and Benchmarks
Data Set Frame Rate (fps)RD 18.84BC 14.47AP 19.89BE 18.36
Table 3.10: Baseline Tracking System Throughput
Chapter 4
Motion Detection
4.1 Introduction
Motion detection forms the basis of many object tracking systems. Motion detec-
tion is used to locate objects of interest for tracking, and as such, it is important
to have a robust and adaptive motion detection system. Failure of the motion
detection in such a system may result in failure to detect and track objects, or in
false objects being detected and tracked. The following chapter outlines a pro-
posed multi-modal background modelling system. The proposed system extends
an algorithm proposed by Butler [18].
A flowchart of the proposed algorithm is shown in Figure 4.1. This shows the
process that a single pixel undergoes.
The proposed algorithm extends Butler’s [18] to include the following:
• A variable threshold (see Section 4.3.1)
144 4.1 Introduction
Figure 4.1: Flowchart of proposed motion detection algorithm
4.2 Multi-Modal Background modelling 145
• Lighting compensation (see Section 4.3.2)
• Shadow detection (see Section 4.3.3)
• Segmentation of the detected motion into motion caused by moving objects
and motion caused by stationary objects (see Section 4.5)
• Computation of optical flow for pixels that are detected as being in motion
(see Section 4.4)
It should be noted that the variable threshold can be determined either for the
whole scene or for each pixel (as shown in Figure 4.1), and the lighting compen-
sation values are determined for regions of the scene (not for each pixel).
4.2 Multi-Modal Background modelling
An efficient method of foreground segmentation that is robust and adapts to
lighting and background changes was proposed by Butler [18]. This approach
is similar in design to the Mixture of Gaussian’s (MoG’s) approach proposed
by Stauffer and Grimson [157], in that each pixel is modelled by a group of
weighted modes that describe the likely appearance of the pixel. Unlike the
MoG’s approaches, cluster structures consisting of four colour values (stored as
two pairs) and a weight are used to represent the pixel modes.
The algorithm uses Y’CbCr 4:2:2 images as input, and clusters are formed by
pixel pairs (see Figure 4.2). Each pixel in the incoming image has two values, a
luminance and a single chrominance, which alternates between blue chrominance
and red chrominance. Pixels are paired for use in the motion detector, such that
each pixel pair contains two pairs of values (centroids), one of luminance values
and one of chrominance values. This pairing results in motion detection being
146 4.2 Multi-Modal Background modelling
effectively performed at half the horizontal resolution of the original image, with
the benefit being increased speed.
Figure 4.2: Motion Detection - The input image (on the left) is converted intoclusters (on the right) by pairing pixels
Let p(xi, yi, t) be a pixel in the incoming Y’CbCr 4:2:2 image, I(xi, yi, t) where
[xi, yi] is in [0..X−1, 0..Y −1] and t is in [0, T ]. A pixel pair, P (x, y, t) (where [x, y]
is in [0..X2− 1, 0..Y − 1]) is formed from p(xi, yi, t) = [y, cb] and p(xi + 1, yi, t) =
[y, cr] to obtain four colour values, P (x, y, t) = [y1, cb, y2, cr] (where xi = x × 2,
and yi = y). These four values are treated as two centroids ((y1, y2) and (cb, cr)).
Each image pixel, p(xi, yi, t), is only used once when forming pixel pairs P (x, y, t).
As the algorithm pairs consecutive pixels, input images must have an even width.
Let f(x, y, t) be a frame sequence, and P (x, y, t′) be a pixel pair in the frame at
time t′. Pixel colour history is recorded,
C(x, y, t, 0..K − 1) = [y1, y2, Cb, Cr, w], (4.1)
which represents a multi-modal PDF. K is the number of modes stored for each
pixel. Increasing K results in more memory being required to store the algorithms
data structures and an increase in time taken to search and sort cluster lists.
However a higher value of K means the algorithm can store a greater number of
modes and maintain a more accurate history for the pixel. Typically, K is set to
6.
4.2 Multi-Modal Background modelling 147
Each cluster contains two luminance values (y1 and y2), a blue chrominance value
(Cb), and red chrominance value (Cr) to describe the colour; and a weight, w.
Colour values are accessed and adjusted separately (i.e. the cluster is not a
vector). The weight describes the likelihood of the colour described by that
cluster being observed at that position in the image. Clusters are stored in order
of highest to lowest weight.
For each P (x, y, t) the algorithm makes a decision assigning it to background
or foreground by matching P (x, y, t) to C(x, y, t, k), where k is an index in the
range 0 to K− 1. Clusters are matched to incoming pixels by finding the highest
weighted cluster which satisfies,
|y1 − Cy1(k)|+ |y2 − Cy2(k)| < τLum, (4.2)
|Cb− CCb(k)|+ |Cr − CCr(k)| < τChr, (4.3)
where y1 and Cb are the luminance and chrominance values for the image pixel
p(x × 2, y), y2 and Cr are the luminance and chrominance values for the image
pixel p(x× 2 + 1, y, C(k) = C(x, y, t, k); and τLum and τChr are fixed thresholds
for evaluating matches. The centroid of the matching cluster is adjusted to reflect
the current pixel colour,
C(x, y, t, κ) = C(x, y, t, κ) +1
L(P (x, y, t)− C(x, t, y, κ)) , (4.4)
where κ is the index of the matching cluster; and the weights of all clusters in
the pixels group are adjusted to reflect the new state,
w′k = wk +1
L(Mk − wk) , (4.5)
where wk is the weight of the cluster being adjusted; L is the inverse of the
traditional learning rate, α; and Mk is 1 for the matching cluster and 0 for all
others.
If P (x, y, t) does not match any C(x, y, t, k), then the lowest weighted cluster,
C(x, y, t,K − 1), is replaced with a new cluster representing the incoming pixels.
148 4.3 Core Improvements
Clusters are gradually adjusted and removed as required, allowing the system to
adapt to changes in the background.
After the updating of weights and clusters, the cluster weights are normalised to
ensure they sum to one using
wk =wk∑K−1k=0 wk
. (4.6)
When a new cluster is added to the model, its weight is set to 0.01.
Based on the accumulated pixel information, the frame can be classified into
foreground,
fgnd = ∀(x, y, t) whereκ∑i=0
Cw(x, y, t, i) < τforeground, (4.7)
where τforeground is the foreground/background threshold and κ is the matching
cluster index; and background. Pixels that are classified as foreground are said
to be in motion.
4.3 Core Improvements
Three core improvements are made to Butler’s algorithm [18]. These changes
are intended to improve performance of the motion detection, without generating
additional modes of output (i.e. optical flow, multi-layer segmentation). Three
improvements are proposed:
1. Variable threshold - Factors such as scene lighting and variation, the type
of camera being used and type of objects being observed can all affect the
threshold that is used for detecting motion. The use of a variable threshold
is examined, to provide greater flexibility for the motion detection. A global
variable threshold and per pixel variable threshold are proposed.
4.3 Core Improvements 149
2. Lighting compensation - Within an outdoor scene, the level of lighting can
change rapidly due to the changing weather (e.g. the sun moving behind
clouds). Sudden changes such as this can result is a large amount of false
motion being detected, leading to poor object detection. An approach to
detect and compensate for lighting changes is proposed.
3. Shadow detection - All objects cast shadows, however when detecting and
tracking objects it is preferable to be able to ignore their shadow. A method
to detect and remove shadows during the motion detection process is pro-
posed.
4.3.1 Variable Threshold
Different surveillance situations contain differing cameras and lighting conditions,
and as such, require different thresholds to evaluate the presence of motion at a
pixel. The algorithm outlined in Section 4.2 calculates two differences for each
pixel every frame. In ideal situations, where there is no noise and no motion,
these differences should be 0. Assuming that there is no motion present in a
video sequence, the difference observed will be a direct result of any noise present
in the sequence.
Analysis of two sequences (see Figures 4.3 and 4.4) show that the pixel noise
distribution for a surveillance camera is a Gaussian distribution, with a mean of 0.
For two images sequences, each 100 frames in length and containing no motion or
discernible changes in the background (i.e. tree movement, lighting changes), the
differences observed when comparing to a background frame (the first frame in the
sequence) are calculated. Differences are recorded separately for the luminance
channel and for the chrominance channels (blue and red chrominance observations
are combined as they are also combined in the motion detection algorithm). The
150 4.3 Core Improvements
(a) Sample Input (b) Luminance Distribution
(c) Chrominance Distribution
Figure 4.3: Pixel Noise - Indoor Scene
distribution of the differences is shown to be approximately Gaussian. It should be
noted that these distributions (see Figures 4.3 and 4.4) include differences across
all pixels, and further analysis has shown that each pixel has approximately the
same noise distribution (i.e. sensor noise is constant across the whole sensor).
Variable thresholds, τLum(t) and τChr(t), to replace τLum and τChr can be calcu-
4.3 Core Improvements 151
(a) Sample Input (b) Luminance Distribution
(c) Chrominance Distribution
Figure 4.4: Pixel Noise - Outdoor Scene
lated by observing the standard deviation of the matching differences,
DiffLum(x, y, t) = |Py1(x, y, t)− Cy1(x, y, t, κ)|+ (4.8)
|Py2(x, y, t)− Cy2(x, y, t, κ)| ,
σ(t)2Lum =
1X2× Y
x=X2−1,y=Y−1∑
x=0,y=0
DiffLum(x, y, t)2, (4.9)
DiffChr(x, y, t) = |PCb(x, y, t)− CCb(x, y, t, κ)|+ (4.10)
|PCr(x, y, t)− CCr(x, y, t, κ)| ,
σ(t)2Chr =
1X2× Y
x=X2−1,y=Y−1∑
x=0,y=0
DiffChr(x, y, t)2, (4.11)
over time (where κ is the index of the matching cluster, and X and Y are the
152 4.3 Core Improvements
dimensions of the input image). The variance can be calculated progressively
as each pixel is processed, and the new threshold can then be determined at
the start of each frame. In situations where a matching cluster is not found
(i.e. a new mode is observed), then the same values that are used as the initial
luminance and chrominance variances, σ2Luminit
and σ2Chrinit
, are added instead of
DiffLum(x, y, t)2 and DiffChr(x, y, t)2.
It is assumed that the only source of error in a correct match is sensor noise, and
that the noise forms a Gaussian distribution with a mean of 0. For a Gaussian
distribution, 99% of the values are known to be within 3 standard deviations of
the mean. Given this, the standard deviations are multiplied by three to obtain
the matching thresholds,
τLum(t) = 3×√σ2Lum(t), (4.12)
τChr(t) = 3×√σ2Chr(t). (4.13)
To avoid sudden changes in the threshold caused by a high level of sensor noise, or
by other errors such as an incorrect matches, it is subjected to the same learning
rate as the cluster centroids,
σ2Lum(t) =
L− 1
Lσ2Lum(t− 1) +
1
Lσ2Lum(t). (4.14)
The threshold is bounds checked (upper and lower) to ensure that excessive levels
of motion or noise do not result in the threshold becoming too high, and that high
levels of inactivity do not result in it becoming too sensitive. These bounds are
set at extreme threshold values, minimum and maximum luminance thresholds
are typically 20 and 130 respectively and minimum and maximum chrominance
thresholds are 10 and 130 respectively.
This process can also be applied at a pixel level, where each pixel has its own
4.3 Core Improvements 153
thresholds, τLum(x, y, t) and τChr(x, y, t), based on the variances observed when
matching only that pixel.
4.3.2 Lighting Compensation
In surveillance situations, particularly outdoor scenarios, lighting levels can
change rapidly resulting in large amounts of erroneous motion. A sample outdoor
scene changing over time due to the sun is shown in Figure 4.5. As this figure
shows, significant changes can occur very rapidly (within a few seconds) and un-
less a high learning rate (or high frame rate) is used, these changes will not be
able to be incorporated into the background quickly enough to avoid errors. As
such, it is important to perform additional processing to negate these lighting
changes.
The distribution of changes observed between the background state (frame 11900)
and the other three images is shown in Figure 4.6. Whilst there is little change
in the chrominance values when compared to the background, there is significant
change in the luminance. The changes in the luminance are not concentrated
about a single value either. This is likely due to a combination of factors, such
as the different materials in the scene and their colours, the angle of the camera
and the angle of the light source (which also alters the shadows that are cast).
It would be ideal to be able to use a single constant value to adjust for the
luminance change in a given frame. However, as the luminance change is not
constant across a scene, scene is divided into several small regions, and each is
treated separately. Figure 4.7 shows a simple, even, partitioning of the image
that breaks the image into 25 sub regions.
154 4.3 Core Improvements
(a) Frame 11900, t = 0 seconds (b) Frame 12000, t = 4 seconds
(c) Frame 12150, t = 10 seconds (d) Frame 12500, t = 24 seconds
Figure 4.5: Sample Outdoor Scene With Changing Light Conditions
It can be seen in Figures 4.8 and 4.9 that the changes in colour are much more uni-
form (once again there is negligible change in the chrominance) when considering
only a small region.
These distributions can be described as the noise distribution (see section 4.3.1)
plus an offset, OLum, indicating the average change in luminance.
The calculation of a distinct luminance offset, OLum(r, t) (where r is the region
index in the range [0..R − 1], in Figure 4.7, R = 25) for each sub region of a
scene is proposed. At each time step, the weighted average of luminance changes
4.3 Core Improvements 155
Figure 4.6: Background Difference Across Whole Scene
Figure 4.7: Partitioning of Image for Localised Lighting Compensation
is calculated for each region,
OLum(r, t) =
∑DiffLum(x, y, t)× Cw(x, y, t, κ)∑
Cw(x, y, t, κ)∀(x, y) ∈ r, (4.15)
where κ is the index of the matching cluster. The use of weighted sum allows
pixels that are only recently created, and so potentially created partially under
156 4.3 Core Improvements
Figure 4.8: Background Difference Across Region (2,2)
the present lighting conditions, to be weighted less relative to those that have
been present longer. Provided this value is within a percentage threshold of the
previous luminance offset, it is accepted and used for the next frame,
χ <= OLum(r, t− 1) <=1
χ, (4.16)
where χ is the change threshold for the luminance offset and is in the range [0..1].
χ is set to 0.5 for the proposed algorithm, and is used only as a failsafe for extreme
conditions.
If the change in the luminance threshold is outside of the acceptable limit, it
indicates one of two things has occurred:
1. A very rapid lighting change has occurred.
2. A large object has entered the area.
4.3 Core Improvements 157
Figure 4.9: Background Difference Across Region (3,5)
If the first possibility has occurred, then the change should be accepted; if the
second possibility has occurred, the change should be discarded and the previous
value for the offset should be used. To determine which of these events has taken
place, the weighted standard deviation of the luminance offset is calculated,
σLum(r, t) =
√(OLum(r, t)−DiffLum(x, y, t))2 × Cw(x, y, t, κ)∀(x, y) ∈ r∑
Cw(x, y, t, κ).
(4.17)
If a rapid lighting change has occurred, then the change across the region should
be relatively uniform, and the standard deviation low. If a large object has sud-
denly entered (i.e. the region has gone from being empty to being half full), then
(unless under very specific circumstances relating to the colour of the object and
the scene) the change should be highly varied, and the standard deviation would
be high. Given this, a threshold is applied to the weighted standard deviation
such that if standard deviation is below the threshold the change is accepted,
158 4.3 Core Improvements
otherwise it is discarded and the previous value for the luminance offset is used.
The luminance offset is incorporated into the match equation by subtracting half
of the luminance offset from each pixel difference, and taking the absolute value
to allow for a single comparison.∣∣∣∣P (y1)− C(k)(y1)−OLum(r, t)
2
∣∣∣∣+
∣∣∣∣P (y2)− C(k)(y2)−OLum(r, t)
2
∣∣∣∣ < τLum(t),
(4.18)
In situations where coloured lighting is present, the same approach could be
applied to the chrominance threshold to compensate.
4.3.3 Shadow Detection
Shadows can result in motion being detected where there is none. As such,
it is important to recognise shadows and ensure that they are not recorded as
motion. Shadows can be characterised by the fact that they alter the luminance
component of the object’s colour, but have minimal effect on the chrominance.
This is shown in Figure 4.10, which shows two examples of a shadow being cast
across part of the background (the areas inside the boxes). For images (a) and
(b), the mean luminance change is −123.1 while the mean chrominance change
is negligible at 1.2. The standard deviations for luminance and chrominance
differences are 3.97 and 4.45 respectively, indicating that the luminance change
is fairly uniform, and the chrominance change is very close to the existing noise
distribution. Images (c) and (d) yield similar results, with the mean changes (for
luminance and chrominance) being −92.6 and 1.6 with standard deviations of
37.34 and 6.23 respectively. The increased standard deviation for the luminance
can be attributed to the soft nature of the shadow.
Gradient can also be used to aid in shadow detection, as the area under shadow
4.3 Core Improvements 159
(a) (b) (c) (d)
Figure 4.10: Shadows cast over section of the background
still retains its original texture (it has only been darkened by a shadow). However,
shadow edges will have a high gradient. Four gradient values can be calculated
for each cluster,
ygv1 = Iy(xi, yi)− Iy(xi, yi − 1), (4.19)
ygh1 = Iy(xi, yi)− Iy(xi − 1, yi), (4.20)
ygv2 = Iy(xi + 1, yi)− Iy(xi + 1, yi − 1), (4.21)
ygh2 = Iy(xi + 1, yi)− Iy(xi − 1, yi), (4.22)
where ygv1 is the vertical gradient for y1, ygh1 is the horizontal gradient for y1, and Iy
is the luminance channel of the input image. As the gradients are being calculated
for cluster pairs, x must be even. These values can then be incorporated into the
background model so that a cluster becomes
C(x, y, t, k) = [y1, y2, cb, cb, ygv1 , y
gh1 , y
gv2 , y
gh2 , w]. (4.23)
The cluster along the top edge of the image have a vertical gradient of 0 and
those on the left edge of the image have a horizontal gradient of 0 (as there is no
pixel to subtract to obtain a gradient). The gradient values are updated in the
same manner as the other values within the cluster, however they are only used
for shadow detection and optical flow calculations (see Section 4.4), as they can
cause problems when matching clusters to determine motion due to the gradient
of a pixel at (x, y) being effected by motion at the pixels at (x−1, y) and (x, y−1),
which will lead to additional false positives. When an area is under shadow, it
160 4.3 Core Improvements
can be assumed that the surrounding pixels are also under the same (or similar)
shadow. As such, the gradient of the pixel should be the same as it is for the
background mode.
Shadow detection is added to the algorithm by adding additional constraints
when matching the incoming pixels to the clusters, and by comparing gradients.
If the initial matching constraints are not satisfied (i.e. motion is detected),
the following constraints are checked to determine if the motion is caused by a
shadow:
0 < (Cy1(k)− y1) + (Cy2(k)− y2) < τLumShad, (4.24)
|Cb− CCb(k)|+ |Cr − CCr(k)| < τChr(t)
τChrShad, (4.25)∣∣ygv1 − Cygv
1(k)∣∣+∣∣∣ygh1 − Cygh
1(k)∣∣∣ < τGrad, (4.26)∣∣ygv2 − Cygv
2(k)∣∣+∣∣∣ygh2 − Cygh
2(k)∣∣∣ < τGrad. (4.27)
If there is a positive difference in the luminance, less than the prescribed shadow
threshold, τLumShad, only a small difference in the chrominance (determined by
dividing the chrominance threshold, τChr(t), by an integer τChrShad) and only a
small difference in the gradient less that a gradient threshold, τGrad, a shadow is
present and motion is not detected at P . τLumShad must be greater than τLum(t),
and is set to 200 in the proposed system, S is set 2, and τGrad is set to 50. These
values are fixed (i.e. they do not adapt to changing lighting conditions), and
are sufficient to detect most shadows. However, strong shadows cast in bright
conditions will often not be detected using these settings.
A pixel that is detected to be in shadow needs to be handled differently elsewhere
in the system. When adjusting the lighting model, and the variable threshold,
pixels that are in shadow are ignored, as the difference incurred when matching
has been significantly altered by the presence of a shadow. As it is not possible
to accurately estimate what effect the shadow has had, it is disregarded.
4.4 Computing Optical Flow Simultaneously 161
4.4 Computing Optical Flow Simultaneously
Optical flow algorithms attempt to determine the motion each pixel has under-
gone from one frame to the next, rather than simply if a pixel is in motion. The
incorporation of optical flow into the proposed motion detection algorithm to
provide additional information to processes that use the motion detection (i.e.
tracking systems) is proposed.
Optical flow calculations require the previous frame to be compared to the cur-
rent to determine motion. The need for comparison with the previous frame is
avoided by maintaining a record of the matching cluster for each pixel for the
last frame, essentially storing an approximation of the last frame. The accuracy
of the approximation, depends on the thresholds used in the motion detection
(tighter thresholds lead to a more accurate approximation).
Matching is performed over cluster windows and at a pixel resolution (i.e. not
cluster resolution, which is down sampled by 2 in the horizontal direction). Let
W (x1 : x2, y1 : y2, t) be the window of pixels extracted from the incoming frame
centred about (x, y) (the position of the cluster determine the flow is being de-
termined for), and W (x1′ : x2′, y′1 : y2′, t − 1) be the window of clusters from
the previous frame that will be compared to W (x1 : x2, y1 : y2, t), centred about
(x′, y′) (a possible position for the cluster in the previous frame). For a compari-
son to be made, the pixel in W (x1′ : x2′, y1′ : y2′, t−1) must have been in motion
at time t − 1. This ensures that no attempts are made to match to parts of the
background. The set of clusters that are compared then becomes
Wfgnd(x, y, t) = P (x, y, t− 1) ∈ fgnd(t− 1) (4.28)
where (x, y) ∈ W (x1′ : x2′, y1′ : y2′, t− 1),
where Wfgnd(x, y, t) is the set of clusters for the window centred at x, y that were
in the foreground last frame. The size of this set (i.e. the number of clusters in
162 4.4 Computing Optical Flow Simultaneously
the set) is defined as Pcount.
If this condition is not enforced, then for a cluster that lies on an object boundary,
part of the comparison for the window will be performed against the background
each frame. As the object is moving, the part of the background being compared
to will be constantly changing, and so it can be assumed that there will be no
match between these sections of the window. These poor matches may result in
the whole window being ruled a poor match, thus impeding the ability of the
system to estimate motion.
When motion is detected at a cluster, its surrounding region is examined to
determine the optical flow for that cluster. The surrounding area is analysed
outwards in rings. The centre pixel is checked first, and if a suitable match is
found, searching stops. If there is no match, then the next ’ring’ (at a distance
of one pixel) is searched in full, and so on until a match is found. Each ring
is searched in full, and the best match within the ring (if a match is present at
all) is accepted. Rings may be ’truncated’ to a pair of rows (or columns) if the
maximum horizontal and vertical accelerations are not equal (see Figure 4.11).
Figure 4.11: Search Order for Optical Flow
4.4 Computing Optical Flow Simultaneously 163
This method of searching attempts to minimise the acceleration of a pixel, by
taking the first match when searching outwards, rather than taking the best match
in the whole search area. Although the approaches aims to minimise acceleration
(constant velocity assumption), no restriction is placed on the velocity, as the
pixel can continue to accelerate gradually over the course of several frames.
As the algorithm is intended to function at a pixel resolution, a method to match
across cluster boundaries is required (see Figure 4.12). As down sampled chromi-
nance information is used, this becomes unreliable and can have a negative effect
on the matching. As a result, chrominance is not used for the optical flow calcu-
lations. Only luminance and luminance gradient information are used for match-
ing. When matching between two compete clusters (i.e. at an even horizontal
distance) the same matching equations as used when comparing clusters for the
motion detection (or for shadow detection in the case of the gradient information)
can be reused (see Equation 4.2 for luminance comparison and Equations 4.26
and 4.27 for gradient comparison). However, note that rather than comparing
cluster values to pixels in the input image, cluster values for the matching cluster
in the current frame are compared to cluster values for the matching cluster in
the previous frame. It is not possible to compare directly to the pixel values from
the previous frame as the previous frame is not stored.
When matching across a cluster boundary, the comparison is effectively between
one cluster in the current image to two in the previous image. In Figure 4.12, the
area marked in red is the cluster in the current image we are trying to find a match
for (C(x, y, t)), and the area marked in blue as the previous frame’s cluster we
are trying to match to. The blue area is actually two clusters. C(x′− 1, y′, t− 1)
is defined as the left most of the two, and C(x′, y′, t − 1) is the right most. The
164 4.4 Computing Optical Flow Simultaneously
Figure 4.12: Matching Across Cluster Boundaries - Bold lines indicate clustergroupings, the red cluster location in the current image is being compared to theblue cluster (actually split across two clusters) in the previous image.
4.4 Computing Optical Flow Simultaneously 165
comparison between these clusters then becomes
DiffLum(x, y, t) = |Cy1(x, y, t)− Cy1(x′ − 1, y′, t− 1)|+ (4.29)
|Cy2(x, y, t)− Cy2(x′, y′, t− 1)| ,
DiffGrad(x, y, t) =∣∣Cyvg
1(x, y, t)− Cyvg
1(x′ − 1, y′, t− 1)
∣∣+ (4.30)∣∣Cyvg2
(x, y, t)− Cyvg2
(x′, y′, t− 1)∣∣+∣∣∣Cyhg
1(x, y, t)− Cyhg
1(x′ − 1, y′, t− 1)
∣∣∣+∣∣∣Cyhg2
(x, y, t)− Cyhg2
(x′, y′, t− 1)∣∣∣ .
When matching across a cluster boundary, C(x′ − 1, y′, t− 1) and C(x′, y′, t− 1)
are checked separately to determine if they were in motion at time t − 1. If
only one was, then only the portion of the comparison that involves that cluster
is performed. In this instance, Pcount is incremented by 0.5 (as only half the
comparison has been performed).
A matching score for a cluster window is obtained by calculating the average
error in the luminance and gradient matches (WLum and WChr) for all foreground
pixels in the window,
WLum =1
PCount
∑x,y
DiffLum(x, y, t) where (x, y) ∈ Wfgnd(x, y, t), (4.31)
WGrad =1
PCount
∑x,y
DiffGrad(x, y, t) where (x, y) ∈ Wfgnd(x, y, t). (4.32)
WLum and WGrad are compared to a threshold, and if this threshold is met, and
if Pcount > 1, then a potential match has been found (whether this is the actual
match depends on what, if any, other matching windows are detected in the
current search ring). Thresholds for WLum and WGrad are fixed at 50 and 100 for
the proposed system (WGrad is twice WLum as it contains twice the comparisons).
These thresholds do not need to vary as the comparisons taking place are between
colour values recorded in consecutive frames, so there should be little variation
due to environmental changes. If Pcount is less that or equal to 1, it is discarded
166 4.4 Computing Optical Flow Simultaneously
as an invalid match. The primary concern is with detecting optical flow for large
objects, and it is assumed that for these there should be adjacent pixels that are
also in motion. If such pixels are not present (i.e. a match has been made to an
isolated pixel), than it is likely the match has been made to noise.
As the algorithm works at pixel resolution, flow is determined to integer precision.
Sub-pixel precision is not pursued, as this level of accuracy is not required for a
tracking application. Once movement for a cluster has been determined, its next
position is predicted. Optical flow is tracked within the system by a simple linear
prediction method. Optical flow information is propagated through the system
from frame to frame, tracking the movement of pixels over time. For each cluster
at time t that has non-zero optical flow, a prediction is propagated forward to
the expected position at time t+ 1 assuming a constant velocity model,
υx(t+ 1) = υx(t) + (υx(t)− υx(t− 1)) , (4.33)
υy(t+ 1) = υy(t) + (υy(t)− υy(t− 1)) , (4.34)
where υx(t − 1) and υx(t − 1) are the positions of the cluster υ in the previous
frame, υx(t) and υx(t) are the positions of the cluster υ in the current frame, and
υx(t + 1) and υx(t + 1) are the expected positions of the cluster p in the next
frame. At time t+1, when the system is processing a cluster that has a prediction
associated with it, it will use the prediction provided as a starting point for the
search, improving system performance. Multiple predictions are allowed to be
propagated forward to a single pixel (i.e. multiple pixels can potentially occupy
the same position in the next frame). When determining flow at that pixel in
the next frame, the centre point of each prediction is checked first, and the best
match (if there is a match at all) is taken as the flow. If there is no match, the
surrounding areas of the each prediction is analysed as described above until a
match is found.
These predictions are stored with an accumulated average velocity (uave and vave
4.4 Computing Optical Flow Simultaneously 167
for the horizontal and vertical velocities respectively), and a counter (fcount) to
indicate how many successive frames the pixel has been observed in motion for.
Average velocities are calculated as
uave(t) = uave(t− 1) +1
Lopf(u(t)− uave(t− 1)), (4.35)
vave(t) = vave(t− 1) +1
Lopf(v(t)− vave(t− 1)), (4.36)
where u and v are the horizontal and vertical flows for the current frame, and Lopf
is the learning rate for the average optical flow. Figure 4.13 shows an example
of the optical flow for a cluster being propagated and updated over a series of
frames. At time t, the motion and optical flow is first detected, the optical flow
of the prediction is initialised with the flow detected at time t. At time t+ 1 and
t + 2, the average flow information is copied from the cluster that was matched
to, and is updated to reflect the current state.
Figure 4.13: Optical Flow Tracking, for Lopf = 4
In the event that the optical flow cannot be determined for a cluster (i.e. a
matching window of clusters from the previous frame cannot be found), the list
of predictions for that cluster is used to estimate the flow. The prediction with
the highest counter (fcount) is assumed to indicate the likely flow for this pixel,
168 4.4 Computing Optical Flow Simultaneously
and is propagated through as a prediction, and a flag is set to indicate that it
is a prediction. A cluster’s flow information can only be propagated through for
Qp successive frames. Qp is kept small (< 3) as only a simple linear prediction
model is used.
If a cluster is not detected as being in motion, it is stationary and optical flow is
not calculated for that pixel.
4.4.1 Detecting Overlapping Objects
A cluster in motion can take on one of four states within the system:
1. New - the first appearance of a cluster, its flow cannot be determined as
there is no appropriate matching cluster.
2. Continuous - the cluster is in motion and a match to the previous frame
has been found (i.e. it has been in motion for two or more frames).
3. Overlap - the cluster was in motion last frame and cannot be found this
frame. The space that it should be this frame is occupied by another cluster
that has moved from a different direction.
4. Ended - the cluster cannot be found and there is no overlap condition, so
the motion must have ceased.
These four states are illustrated in Figure 4.14.
The state, S(x, y, t) of a pixel pair P (x, y, t) is determined using the tracked
optical flow information.
4.4 Computing Optical Flow Simultaneously 169
(a) New (b) Continuous
(c) Overlap (d) Ended
Figure 4.14: Optical Flow Pixel States
When motion is detected at a cluster, p(x, y, t), and its optical flow is determined
(U(x, y, t) and V (x, y, t)), p(x, y, t) and its flow information is obtained by copying
from the previous cluster, p(x − U(x, y, t), y − V (x, y, t), t − 1) and updated to
reflect the new situation. The motion arising from the cluster in the previous
frame is marked as accounted for.
At the end of the frame, any cluster whose motion from the previous frame has
not accounted for is either involved in an overlap, or its motion has ended. The
prediction for the pixel is checked (if there are multiple predictions, then the
prediction with the highest fcount) and if the predicted position is occupied by
motion, then an overlap has occurred. If there is no motion, then the motion has
ended. An overlap can only occur if the cluster that is obscured has been observed
for Qv successive frames (i.e. the motion of that cluster has been observed for a
period of time and is considered reliable, Qv is fixed at 3 in the proposed system).
This helps to reduce erroneous overlaps. Also, for an overlap that obscured a
cluster’s optical flow, information is still propagated through for Qp frames. If
170 4.5 Detecting Stopped Motion
new motion is then detected at the point where the obscured pixel is expected to
reappear, it is assumed that this motion is actually caused by the obscured pixel
reappearing.
The detection of overlapping pixels can aid in detecting overlaps between moving
objects in a scene.
4.5 Detecting Stopped Motion
A scene can be further classified in to active foreground, temporarily stopped
(static) foreground, and background. Active foreground is defined as motion
that is observed to be moving through the scene, such that any given cluster of
active foreground is only a given colour for a small number of frames. Static
foreground can be defined as motion that remains a constant colour for a period
of tstatic frames or more. To discriminate between active and static foreground,
the algorithm needs to compare the current cluster at a given pixel, to the last
cluster at that location, as well as any static foreground objects that are present
there.
When C(x, y, t, κ) = C(x, y, t − 1, κ), P (x, y, t) has a static layer, Z(x, y, t, z),
initialised, where z is the depth of the layer. Each layer has a counter, c,
and a colour, (y1, y2, Cb, Cr) associated with it. For subsequent frames where
C(x, y, t, κ) = C(x, y, t− 1, κ), Zc(x, y, t, z) is incremented, otherwise it is decre-
mented. Static pixels can be defined as,
∀(x, y, t) ∈ fgnd where Zc(x, y, t, z) >= tstatic. (4.37)
Static pixels can be further organised into layers depending on when the pixel
appears. Layers can be built one on top of the other, as new objects appear and
4.5 Detecting Stopped Motion 171
come to a stop atop an existing static layer. Layers remain until the observed
cluster is matched to either a lower layer, or the background.
The number of static layers available, Ks, is determined by the parameters of
the background model and the requirements of the scene. At least one cluster
must be dedicated to the active foreground and there must be one cluster per
background mode. Given this, the maximum number of static layers is,
Ks = K −Kb − 1, (4.38)
where K is the total number of clusters in the background model and Kb is the
number of background modes. Typically, Ks = 2 and K = 6.
The algorithm for detecting and updating the static layers for a single pixel is
outlined in figure 4.15. If the pixel already has static layers, these are compared
against first. If there are no layers, or no matches to existing layers, checks are
performed to see if there is possibly a new static layer forming (last two frames
have the same colour at the pixel). If this is the case, a new static layer is created.
Figure 4.15: Static Layer Matching Flowchart
Each static layer is monitored by a counter which is updated each time step, and
used to determine the state of the layer (i.e. static, to be removed). Counters
are incremented when the layer is detected, and decremented only when a lower
172 4.5 Detecting Stopped Motion
level static layer (or background) is detected. When a higher level static layer
(or active layer) is detected counters are unchanged as the static layer may be
hidden below. Counters are decremented gradually to provide error tolerance
for incorrect cluster matching, or noise. The decrement rate depends on the
scene, with more challenging scenes requiring a slower decrement rate due to
the increased chance of an erroneous cluster match. Layers are removed when
the counter reaches zero, and counters are capped to guarantee that a layer can
be removed in a set number of frames. Increment, decrement, static thresholds
and caps are fixed, and are set to 1, 3, 50, and 100 for the proposed algorithm.
However, these parameters are unlikely to be optimal for all scene configurations.
The algorithm has some limitations in that it is not possible to determine when
a lower level static object leaves while higher level static object remains, or when
a lower level object moves in behind a higher level object, due to the relevant
pixels being obscured.
4.5.1 Feedback From External Source
It is important to allow changes to occur in the background model as the scene
varies, but it is also important prevent foreground objects of interest being incor-
porated into the background. Objects such as stopped cars may remain stationary
in the scene for several minutes (or longer), in which time the clusters that model
the car will accumulate enough weight so that the car is considered part of the
background, despite it still being of interest in the scene. One way this can be
overcome is by having a very slow learning rate, however this will then mean that
legitimate changes (caused by changing light, or objects being placed in the scene
that are not of interest) will also take a very long time to be incorporated into
the background. An alternative method, is to allow an external process (such as
4.6 Evaluation and Testing 173
an surveillance system) to impose changes on the background model.
The inverse of the weight adjustment algorithm can be used to prevent the object
from being incorporated into the background model, by effectively stopping all
weight updates so that objects of interest remain in the foreground,
w′k =(Lwk −Mk)
L− 1, (4.39)
where wk is the weight of the cluster being adjusted; L is the inverse of the learning
rate (lower values will result in background changes being incorporated faster);
and Mk is 1 for the matching cluster and 0 for all others. An external process can
be used to provide a mask image back to the motion detection algorithm, which
can be used to apply the weight reversal.
4.6 Evaluation and Testing
The proposed algorithm is evaluated using synthetic data (see Section 4.6.1, and
real world data (see Section 4.6.2). The proposed algorithm is compared with the
system proposed by Butler [18]. Motion detection results from the algorithms are
compared to ground truth images to measure the performance of the algorithms.
Performance is measured in terms of false negatives (FN, motion present in ground
truth but not detected) and false positives (FP, motion detected but not present
in ground truth), true positives, true negatives, false positive and false negatives
are defined as,
TP = GT (x, y) = 1&M(x, y) = 1, (4.40)
TN = GT (x, y) = 0&M(x, y) = 0, (4.41)
FP = GT (x, y) = 0&M(x, y) = 1, (4.42)
FN = GT (x, y) = 1&M(x, y) = 0, (4.43)
(4.44)
174 4.6 Evaluation and Testing
where GT is the ground truth image and M is the motion result image. Each of
these images are binary images.
The performance of the optical flow (see Section 4.6.3) is evaluated by attempting
to segment a moving object from a scene, and using visual inspection of the
results to determine performance. The performance of the proposed optical flow
algorithm is compared to the Lucas-Kanade [114] algorithm, the Horn-Schunck
[72] algorithm, and a block matching algorithm.
4.6.1 Synthetic Data Tests
Testing is conducted using synthetic data from the AESOS database 1. A portion
the ACV Motion Detection Database is used to evaluate the proposed system.
This database contains four separate scenes (three outdoor, one indoor) with
animated models drawn on top of the background to simulate the motion. Each
scene contains 16 sets, each of which contains the figures drawn at a different
grey level (from grey level 60 to grey level 220 in increments of 10). An example
of the database is shown in Figure 4.16.
For testing, three sets at different grey levels (80, 150, 210) from each sequence
are used to compare the systems. Thresholds for the motion detectors were set
the same across the tests. Butlers algorithm has thresholds set to 50 and 30 for
τlum and τChr respectively, while the proposed algorithm uses 20 and 10 as the
minimum and 80 and 50 as the maximum of τlum and τChr respectively, with
initial values of 50 and 30. Each system uses K = 6 (number of clusters) and
L = 9 (inverse learning rate).
Four test configurations were run using the proposed system:
1This database was provided by Advanced Computer Vision GmbH - ACV
4.6 Evaluation and Testing 175
(a) Seq 1, GL = 80 (b) Seq 2, GL = 150
(c) Seq 3, GL = 150 (d) Seq 4, GL = 210
Figure 4.16: AESOS Database Example, GL is the grey level of the syntheticfigures in the scene
1. Single variable threshold for the whole system with other improvements
(shadow detection and lighting normalisation) disabled.
2. Variable threshold for each pixel with other improvements (shadow detec-
tion and lighting normalisation) disabled.
3. Single variable threshold for the whole system with other improvements
enabled.
4. Variable threshold for the each pixel with other improvements enabled.
These configurations are used to assess the benefits of using a single variable
threshold for the system versus a threshold for each pixel, and the effects
that other improvements have on the system when neither lighting variation
176 4.6 Evaluation and Testing
Algorithm GL 80 GL 150 GL 210False False False False False FalsePositive Negative Positive Negative Positive Negative
Butler 0.11% 48.13% 0.06% 63.00% 0.03% 78.08%
Proposed 0.14% 42.84% 0.10% 50.74% 0.07% 64.07%(Config 1)Proposed 0.13% 43.43% 0.095% 51.70% 0.06% 65.17%(Config 2)Proposed 0.12% 49.00% 0.08% 56.71% 0.07% 63.50%(Config 3)Proposed 0.11% 49.72% 0.07% 57.66% 0.06% 64.50%(Config 4)
Table 4.1: Synthetic Motion Detection Performance for AESOS Set 1
or shadows are present. Thresholds specific to the proposed system are set to
τLumShad = 150, τChrShad = 2, τgrad = 50 and χ = 0.75. Note that optical flow
performance is not assessed in these tests as the data set uses textureless synthetic
models to provide motion.
Tables 4.1 to 4.4 and Figures 4.17 and 4.18 show the test results and sample
output respectively.
The results show an overall increase in performance between the algorithm of
Butler[18] and the proposed configurations. There is a significant overall decrease
in the rate of false negatives (up to 14.58%) and only a small increase in false
positives (no greater than 0.23%). This improvement is most pronounced in
the datasets where the moving object colour is most similar to the background
(i.e. in set 1, 2 and 3, which are all bright scenes, this is GL210 where the
moving objects are most white). When the moving object’s colour is more distinct
from the background, the performance gain is less, and when shadow detection
and lighting normalisation are enabled, it can actually lead to a performance
4.6 Evaluation and Testing 177
Algorithm GL 80 GL 150 GL 210False False False False False FalsePositive Negative Positive Negative Positive Negative
Butler 0.06% 52.21% 0.05% 61.99% 0.07% 70.03%
Proposed 0.19% 45.48% 0.17% 51.66% 0.30% 57.07%(Config 1)Proposed 0.07% 46.12% 0.06% 52.88% 0.06% 58.11%(Config 2)Proposed 0.18% 51.97% 0.16% 54.21% 0.28% 58.03%(Config 3)Proposed 0.07% 52.66% 0.06% 55.41% 0.06% 59.07%(Config 4)
Table 4.2: Synthetic Motion Detection Performance for AESOS Set 2
Algorithm GL 80 GL 150 GL 210False False False False False FalsePositive Negative Positive Negative Positive Negative
Butler 0.15% 38.74% 0.13% 46.97% 0.16% 33.11%
Proposed 0.23% 27.85% 0.22% 31.95% 0.24% 25.88%(Config 1)Proposed 0.18% 38.65% 0.17% 37.25% 0.19% 27.28%(Config 2)Proposed 0.21% 28.83% 0.20% 32.79% 0.22% 26.43%(Config 3)Proposed 0.17% 39.52% 0.16% 38.07% 0.19% 27.81%(Config 4)
Table 4.3: Synthetic Motion Detection Performance for AESOS Set 3
178 4.6 Evaluation and Testing
Algorithm GL 80 GL 150 GL 210False False False False False FalsePositive Negative Positive Negative Positive Negative
Butler 0.01% 97.96% 0.01% 98.50% 0.04% 81.30%
Proposed 0.01% 84.24 0.08% 90.24 0.13% 71.15%(Config 1)Proposed 0.02% 84.74 0.02% 90.89 0.07% 72.00%(Config 2)Proposed 0.03% 93.44 0.04% 89.68 0.09% 70.86%(Config 3)Proposed 0.01% 93.70 0.02% 90.68 0.06% 71.74%(Config 4)
Table 4.4: Synthetic Motion Detection Performance for AESOS Set 4
4.6 Evaluation and Testing 179
(a) Butler (b) Butler (c) Butler (d) Butler
(e) Config 1 (f) Config 1 (g) Config 1 (h) Config 1
(i) Config 2 (j) Config 2 (k) Config 2 (l) Config 2
(m) Config 3 (n) Config 3 (o) Config 3 (p) Config 3
(q) Config 4 (r) Config 4 (s) Config 4 (t) Config 4
Figure 4.17: Synthetic Motion Detection Performance for AESOS Set 1, GL150
decrease. This performance drop off is observed primarily with the GL80 sets,
where the foreground objects are significantly darker than the background, and
can be attributed to the shadow detection falsely detecting shadows (and thus
causing a pixel in motion to be classified as non-motion). This is due in part to
the nature of the synthetic dataset. The shadow detection works by checking if
180 4.6 Evaluation and Testing
(a) Butler (b) Butler (c) Butler (d) Butler
(e) Config 1 (f) Config 1 (g) Config 1 (h) Config 1
(i) Config 2 (j) Config 2 (k) Config 2 (l) Config 2
(m) Config 3 (n) Config 3 (o) Config 3 (p) Config 3
(q) Config 4 (r) Config 4 (s) Config 4 (t) Config 4
Figure 4.18: Synthetic Motion Detection Performance for AESOS Set 3, GL80
there has a luminance decrease, with little change in the chrominance. In datasets
1, 2 and 4 the background is primarily grayscale with little texture, so overlaying
a grayscale object will (if the object has lower pixel values than the background)
result in a shadow being falsely detected. The second test performed by the
shadow detection is to look at the gradient change, with little change in gradient
4.6 Evaluation and Testing 181
more likely to indicate a shadow. All four datasets contain environments that are,
for large parts of the scene, devoid of texture and so have gradients of (about)
0. As the motion model also has no texture it also has a gradient of (about) 0,
satisfying the second condition for a shadow. This is shown in Figure 4.17, (fourth
and fifth rows), where it can be seen that the motion is detected effectively at the
edge of the object (where there is a gradient change), but poorly in the centre of
the object where there is little texture. The gradient change at the object edge
results in the system correctly detecting motion rather than a shadow. However,
it can be expected that this process will also result in the edges of some (if not
all) shadows being incorrectly detected as motion.
The configurations that used a single variable threshold (1 and 3) achieved a lower
rate of false negatives, while those that used a variable threshold (2 and 4) for
each pixel achieved a lower rate of false positives. Both approaches outperform
Butler [18] (which uses a fixed threshold) in terms of false negatives, but have
slightly worse performance when considering false positives.
A single threshold will be less impacted by local events in a scene, and so will
respond slower unless the entire scene undergoes a change (i.e. increased noise at
5% of pixels will have little to no effect, increased noise across the whole sensor
will cause the threshold to change). With a threshold for each pixel, if a pixel
is in motion often, and as such is adding new clusters to the model to depict
the scene state, the threshold be increased faster and may reach a point where
motion is not detected as effectively. However this same mechanism that allows
the threshold to increase faster will also result in the threshold dropping faster
when there is no motion.
Given this behaviour, it can be expected that a single variable threshold will result
in a lower rate of false negatives, as activity in a small portion of the scene (i.e.
the addition of new cluster to model the motion that is occurring) will have only
182 4.6 Evaluation and Testing
a small effect on the threshold. When using a variable threshold for each pixel,
these thresholds will respond more rapidly to motion and increase, becoming less
sensitive. This increase will result in fewer false positives.
The lighting normalisation is evaluated using the lighting variation dataset with
the AESOS database, which contains three datasets that artificially depict vary-
ing degrees of lighting variation. These datasets are based on a low grey level
scene from Set 1. Figure 4.19 contains an example from the database. In set 1,
the lighting oscillates very rapidly, but consistently. The lighting in set 2 oscil-
lates slower, but has a variable period and amplitude, and the lighting is set 3
changes in a more random fashion.
(a) Set 1, Frame 1 (b) Set 1, Frame 2 (c) Set 1, Frame 3 (d) Set 1, Frame 4
(e) Set 2, Frame 40 (f) Set 2, Frame 50 (g) Set 2, Frame 60 (h) Set 2, Frame 60
(i) Set 3, Frame 50 (j) Set 3, Frame 55 (k) Set 3, Frame 60 (l) Set 3, Frame 65
Figure 4.19: AESOS Lighting Variation Database
Testing is performed using the same thresholds and parameters used in earlier
tests. For the proposed algorithm a single variable threshold is used for the whole
scene, lighting normalisation is enabled, but shadow detection is disabled as there
4.6 Evaluation and Testing 183
Algorithm Set 01 Set 02 Set 03False False False False False FalsePositive Negative Positive Negative Positive Negative
Butler 14.31% 39.19% 8.66% 43.93% 9.56% 58.29%Proposed 0.63% 50.09% 0.77% 43.38% 1.71% 58.66%
Table 4.5: Synthetic Lighting Normalisation Performance
are no shadows in the scene, and as demonstrated in the previous testing, it can
have an adverse effect when used with this data due to the lack and colour and
texture. The proposed algorithm is once again compared to Butler’s [18]. Table
4.5 and Figures 4.20 and 4.21 show the results.
(a) Input, 250 (b) Input, 500 (c) Input, 750 (d) Input, 1000
(e) Butler, 250 (f) Butler, 500 (g) Butler, 750 (h) Butler, 1000
(i) Proposed, 250 (j) Proposed, 500 (k) Proposed, 750 (l) Proposed, 1000
Figure 4.20: Synthetic Lighting Normalisation Performance using Set 02
As the results show, the proposed lighting compensation approach is effectively
able to reduce the number of false positives whilst still delivering good perfor-
mance when detecting motion pixels. The poor performance in set 1 (when
184 4.6 Evaluation and Testing
(a) Input, 250 (b) Input, 500 (c) Input, 750 (d) Input, 1000
(e) Butler, 250 (f) Butler, 500 (g) Butler, 750 (h) Butler, 1000
(i) Proposed, 250 (j) Proposed, 500 (k) Proposed, 750 (l) Proposed, 1000
Figure 4.21: Synthetic Lighting Normalisation Performance using Set 03
comparing the false negative rate of the proposed system to Butler’s [18]) can
be attributed to extremely rapid rate of lighting change. The proposed approach
assumes that the luminance is changing at a constant rate,
Θ(x, y, t+ 1) = Θx, y, t+ (Θx, y, t−Θx, y, t− 1), (4.45)
where Θ(x, y, t) is the luminance at a pixel, x, y at time t. This assumption
approximately holds for real world effects such as cloud cover or AGC changes.
However in set 1, the lighting is oscillating extremely rapidly (and therefore chang-
ing direction very often) in a manner which is highly unlikely to be seen in a real
world situation. Due to the frequent changes in direction of the lighting change,
the assumption on which the proposed approach is based is frequently broken.
This causes increased errors in cluster matching, which results in more relaxed
matching thresholds and a higher false negative rate. Butler’s algorithm [18]
however, converges on the average background luminance and as a result, whilst
4.6 Evaluation and Testing 185
still producing noisy output (particularly at first when learning the average back-
ground), it outperforms the proposed approach in terms of false negatives for this
set.
4.6.2 Real World Data Tests
Testing is conducted using a 10,000 frame sequence of real world data acquired at
a public passenger drop off area. Twenty frames which illustrated various effects
such as lighting variation, shadows, temporarily stopped objects and overlapping
objects are hand segmented for comparison (it is not practical to hand segment
the entire sequence).
The algorithm’s overall performance was compared to Butler’s [18] (see Table
4.6). Incorrect detection of the motion type results in a false negative (FN)
and a false positive (FP) being recorded for the appropriate motion types (i.e.
active foreground detected when static’s expected - FN for static, FP for active;
static detected in layer two expected in layer one - FN and FP for static). The
performance of the algorithm at classifying active foreground, static foreground
and shadows is measured, to provide an indication of the performance of each
component. Shadow detection is measured purely in terms of false positives, as it
is expected that no motion should detected at a shadow (i.e. errors only occurs
when shadows are detected as motion). A simple object detector was applied to
the output of the proposed algorithm to locate large foreground objects and apply
feedback to the region they occupy. No morphological operations were applied to
the output of either system.
Thresholds for Butlers algorithm are set to 80 and 50 for τlum and τChr respec-
tively, while the proposed algorithm uses 30 and 20 as the minimum and 130 and
80 as the maximum of τlum and τChr respectively, with initial values of 80 and
186 4.6 Evaluation and Testing
50. Each system uses K = 6 (number of clusters) and L = 9 (inverse learning
rate). All other parameters as the same as used in the synthetic data tests. The
proposed system is evaluated using both a single variable threshold for the whole
system (Configuration 1) and an independent variable threshold for each pixel
(Configuration 2).
(a) Input, 1175 (b) Input, 1900 (c) Input, 2500 (d) Input, 3300
(e) GT, 1175 (f) GT, 1900 (g) GT, 2500 (h) GT, 3300
(i) Butler’s, 1175 (j) Butler’s, 1900 (k) Butler’s, 2500 (l) Butler’s, 3300
(m) Config 1, 1175 (n) Config 1, 1900 (o) Config 1, 2500 (p) Config 1, 3300
(q) Config 2, 1175 (r) Config 2, 1900 (s) Config 2, 2500 (t) Config 2, 3300
Figure 4.22: Motion Detection Results for Real World Sequence
4.6 Evaluation and Testing 187
(a) Input, 5575 (b) Input, 6600 (c) Input, 7525 (d) Input, 7585
(e) GT, 5575 (f) GT, 6600 (g) GT, 7525 (h) GT, 7585
(i) Butler’s, 5575 (j) Butler’s, 6600 (k) Butler’s, 7525 (l) Butler’s, 7585
(m) Config 1, 5575 (n) Config 1, 6600 (o) Config 1, 7525 (p) Config 1, 7585
(q) Config 2, 5575 (r) Config 2, 6600 (s) Config 2, 7525 (t) Config 2, 7585
Figure 4.23: Motion Detection Results for Real World Sequence
Figures 4.22 and 4.23 show a sample of the output. The top row shows the
original images; the second row shows the ground truth; third is the output
from Butler [18]; the fourth and fifths rows are the output from the proposed
algorithm (configurations 1 and 2 respectively). In ground truth and output from
188 4.6 Evaluation and Testing
Proposed Proposed Butler’s(Config 1) (Config 2) Algorithm[18]
False False False False False FalsePositive Negative Positive Negative Positive Negative
Active 2.68% 19.86% 1.56% 21.98% N/A N/AMotionShadow 17.37% N/A 16.80% N/A 64.95% N/AMotionStatic 2.60% 35.60% 1.09% 41.84% N/A N/AMotionTotal 5.13% 26.70% 2.57% 30.61% 8.46% 55.49%Motion
Table 4.6: Motion Detection Results for Real World Sequence
the proposed algorithm, green indicates active foreground, blue static foreground,
red (in the ground truth images only) indicates shadow (which is expected to be
detected as no motion in the bottom row).
As Table 4.6 and Figures 4.22 and 4.23 illustrate the system performs well and is
able to discern between static and active foreground objects, as well as cope with
lighting changes (see frames 1,900, 2,500 and 3,300 in Figure 4.22) and shadows.
However, the system does struggle to deal with lighting variations where the
background is widely varied, due to the different textures in the region (i.e. the
area around the rails on the left edge of the image, see frame 2,500 and 5,575).
The shadow detection can also affect the motion detection when dark objects
enter, such as the windscreen and windows of the car in frames 7,525 and 7,585.
Despite the limitations of proposed changes however, they result in a significant
improvement in performance, clearly reducing the rate of false positives and false
negatives when compared to [18].
4.6 Evaluation and Testing 189
4.6.3 Optical Flow and Overlap Detection Evaluation
To evaluate the performance of the proposed optical flow algorithm, attempts are
made to extract people from several test images. Expected motion is determined
using the ground truth data from the CAVIAR database. The difference between
the median locations in the previous and current frame is used as the expected
average velocity of the object. Extraction is performed by finding pixels where
the combined error in the horizontal flow and vertical flow is less than a threshold;
Imobj = |U − vx|+ |V − vy| < τvel, (4.46)
where U and V are the horizontal and vertical flow images; vx and vy are the
expected movements and Imobj is the extracted object image. τvel is set to 1.5 in
this evaluation.
The performance of the proposed algorithm is compared with that of three other
optical flow algorithms; the Lucas-Kanade [114] algorithm, the Horn-Schunck
[72] algorithm, and a block matching algorithm; from the OpenCV library2. For
these other optical flow techniques, the input images were first converted to grey
scale. Extraction results are also masked against the motion image generated by
the proposed algorithm to improve clarity when comparing the extracted regions.
Without the masking operation, additional noise would be visible in the extrac-
tion output from the other algorithms. Due to practical considerations, the people
within the images have not been hand segmented to test performance. Instead,
we have simply visually compared the performance of the various algorithms.
As Figures 4.24 to 4.26 show, the proposed algorithm is significantly better at
extracting a moving object from the scene, however the segmentation is unable
2The Open Source Computer Vision Library is used courtesy of the Intel Cor-poration and is available for public download from the World Wide Web at”‘http://www.intel.com/research/mrl/research/opencv/”’.
190 4.6 Evaluation and Testing
to extract the entire object, as not all pixels within the object meet the flow
criteria. This could be overcome by either using a morphological close operation,
or increasing the threshold used (i.e. detect pixels that fall within a larger range
of flow values). However, any increase in the threshold will also result in an
increase in noise, or an increased likelihood that pixels belonging to a different
object will be detected.
The other methods suffer from discontinuities around the edge of the person (mov-
ing object), and struggle with patches of movement that are a single colour (i.e.
the person’s clothes). Perhaps their biggest problem however is that they fail to
distinguish background from foreground, resulting in the detection of movement
in the background (this can be seen in the horizontal and vertical flow images),
however as the object extraction uses the motion image as a mask, these errors
are not seen in the extracted object. These errors are brought about by the as-
sumptions made by these techniques. Due to the lighting in the scene, there are
slight fluctuations in the colour of background regions from frame to frame. Lucas
et al.[114] and Horn et al.[72] use spatial intensity gradient information, whereas
the block matching technique uses correlation between image regions to obtain
the flow. However both methods rely on the intensity of corresponding regions
in the images being very similar, and the small fluctuations result in the average
intensity of these corresponding regions varying from frame to frame. When this
occurs on a uniform, featureless surface (i.e. floors, walls), these fluctuations can
result in motion being detected. The proposed algorithm does not suffer from this
problem as it uses a variable threshold to detect motion. This threshold adapts
to the level of noise in the scene to ensure that the small fluctuations observed
under conditions such as fluorescent lighting do not adversely affect the systems
performance.
The overlap detection is evaluated using sample sequences from the CAVIAR
4.6 Evaluation and Testing 191
database that contain two people overlapping. The optical flow status images
that are produced are used to detect areas where there are a high proportion of
flow discontinuities, likely to be caused by the edge of a moving object, either
overlapping objects or the edge of a region of motion. Only edges that run
vertically are detected in this test. Further details on the process that is used to
determine overlaps/edges can be found in Chapter 5. Examples of the flow status
image output, and the detected object edges are shown in Figures 4.27, 4.28 and
4.29.
The example sequences (see Figures 4.27, 4.28 and 4.29) show the motion detec-
tion output in the top row, the flow status in the second and the original frame
with the overlaps marked (shaded red bars) in the bottom row. The flow status
images can be interpreted as follows:
• Black - no motion is associated with this pixel.
• Green - new motion is found at this pixel (New).
• Yellow - the motion at this pixel is continuing from last frame (Continuous).
• Red - an overlap is present at this pixel (Overlap).
• Blue - motion was expected at this pixel, but colour not be found (Ended).
The example sequences show that overlaps can be detected by analysis of the flow
status images (see Figure 4.27 (j) and (k), Figure 4.28 (j) and (k) and Figure 4.29
(j) and (k)). Several object edges are also detected (see Figure 4.27 (i), Figure
4.28 (i) and (l) and Figure 4.29 (l)), whilst two detections (see Figure 4.27 (l) and
(j)) are erroneous, and detect and overlap/edge through the middle of an isolated
person. The detections at the edges can be attributed to increased instances
of pixels in the new and stopped states at the boundary of the person (i.e. a
192 4.7 Summary
discontinuity). Further analysis of these detection, analysing the ratio of overlap
pixels to new and stopped could separate these edge detection from overlaps.
The flow status images show an greater concentration of overlap pixels being
detected when an occlusion is occurring. There are also isolated overlap pixels
detected elsewhere, which can be partially attributed to the dataset. As can be
seen in the motion detection results, the motion detection performs poorly around
the legs of the people (due to the dark edge at the bottom of the shop front, which
is a very similar to colour to the pants worn by all subjects) and results in large
amounts of new motion being detected about the legs. This results in new motion
being detected at the legs in every frame, some of which forms false overlaps in
later frames. Despite this, in the three samples shown only 2 false object edges
are detected.
4.7 Summary
This chapter has presented a new algorithm for calculating multi-layer foreground
segmentation and optical flow simultaneously. The proposed approach uses the
motion information to help resolve discontinuities when computing the optical
flow, in addition to simple short term tracking of flow vectors into improve perfor-
mance. Any discontinuities that are present are also recorded for use in detecting
events such as overlaps between two moving objects. Several improvements aimed
at improving the motion segmentation performance have also been proposed:
• A variable threshold for use when matching the incoming image to the
background model.
• A lighting compensation method to handle fluctuations in natural lighting
conditions.
4.7 Summary 193
• Shadow detection, using luminance and gradient to avoid dark foreground
objects being classed as shadows.
• Detection of stationary foreground regions and the layering of such regions,
and the separation of these from moving foreground regions for use in track-
ing systems
• A feedback mechanism to allow an external process to alter background
model weights to keep objects of interest out of the background, or force
segmentation errors into the background.
The proposed approach has been evaluated on a public database (AESOS) as
well as in-house captured data, and is shown to offer significant improvements.
A comparison of the use of a single variable threshold for the whole system and a
variable threshold for each pixel has also been performed, and it has been shown
that each can be effective depending on the system requirements and nature of
the data used.
Despite the good performance observed, the proposed system has the following
limitations:
• The proposed lighting compensation technique assumes an approximately
linear rate of change for the scene lighting, resulting in a drop in perfor-
mance when strobe effects are encountered.
• Shadow detection in scenes with texture less foreground and background
and little to no colour can result in motion being incorrectly classified as
shadows.
• Stationary foreground detection can be impeded by objects arriving behind
or leaving from behind an existing stationary layer, due to the pixels of
interest being occluded.
194 4.7 Summary
• Optical flow is calculated at pixel resolution, and performs inconsistently
for objects travelling at less than 1 pixel per frame.
As these limitations are only encountered in highly specific situations (strobe
lighting and texture less scenes are only likely to be encountered in synthetic
data), or have no impact on the core motion detection performance (stationary
foreground and optical flow computations do not impact upon motion segmenta-
tion as they take place afterwards), these limitations are considered to be minor.
4.7 Summary 195
(a) Input Image (b) Motion Image
(c) Proposed, H-Flow (d) Proposed, V-Flow (e) Proposed, Object
(f) LK, H-Flow (g) LK, V-Flow (h) LK, Object
(i) HS, H-Flow (j) HS, V-Flow (k) HS, Object
(l) BM, H-Flow (m) BM, V-Flow (n) BM, Object
Figure 4.24: Optical Flow Performance - CAVIAR Set WalkByShop1front, Frame1640
196 4.7 Summary
(a) Input Image (b) Motion Image
(c) Proposed, H-Flow (d) Proposed, V-Flow (e) Proposed, Object
(f) LK, H-Flow (g) LK, V-Flow (h) LK, Object
(i) HS, H-Flow (j) HS, V-Flow (k) HS, Object
(l) BM, H-Flow (m) BM, V-Flow (n) BM, Object
Figure 4.25: Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,Frame 541
4.7 Summary 197
(a) Input Image (b) Motion Image
(c) Proposed, H-Flow (d) Proposed, V-Flow (e) Proposed, Object
(f) LK, H-Flow (g) LK, V-Flow (h) LK, Object
(i) HS, H-Flow (j) HS, V-Flow (k) HS, Object
(l) BM, H-Flow (m) BM, V-Flow (n) BM, Object
Figure 4.26: Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,Frame 1101
198 4.7 Summary
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
Figure 4.27: Overlap Detection - Example 1
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
Figure 4.28: Overlap Detection - Example 2
4.7 Summary 199
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
Figure 4.29: Overlap Detection - Example 3
Chapter 5
Object Detection
5.1 Introduction
To take full advantage of a motion detection routine capable of determining op-
tical flow and distinguishing moving objects from those that are temporarily
stopped (see Chapter 4), object detection routines need to be adapted to in-
corporate the additional information. The proposed motion detection method
provides three additional cues that require special handling:
• Optical flow.
• Detection of static foreground.
• Detection of overlaps between moving objects.
This chapter will address how this information is used in the object detection pro-
cess, and evaluate the improvements that are gained by using these in a tracking
system.
202 5.2 Optical Flow
5.2 Optical Flow
Optical flow can be used to aid in the detection of objects that are currently
moving in the scene. If an object has been observed for several frames, a reason-
able estimate of its velocity can be made. This velocity can be used to extract a
candidate region based on matching optical flow values, using the equation,
ImObj(n, t) = |U(t)− T nu |+ |V (t)− T nv | < τvel, (5.1)
where IObj(n, t) is the candidate image based on the optical flow information for
the tracked object T n at the time t, U and V are the horizontal and vertical
flow images, T nu and T nv are the expected horizontal and vertical velocities for
the tracked object n, and τvel is the optical flow error tolerance for the object
detection. τvel is typically set to 1.5 in the proposed system.
The operation is applied over a region that corresponds to the expected position of
the target object. This position is based on the previous observed position, offset
by the expected movement. The region is padded (expanded) by a few pixels (the
exact amount varies depending on the dataset, size and speed of objects, frame
rate etc.) to account for errors in the previous detection, or changes in direction.
The extracted region is likely to be incomplete (i.e. the region may contain
holes), due to inconsistencies in the optical flow. To counter this, a morphological
close operation is performed on the region. This region is then processed in
the same manner as for regular object detection (see Section 3.3), except any
detected candidate can only be matched to the intended target track (n). Like
the previously discussed object detection procedures, any detected and matched
region is removed from the motion detection images to prevent the same motion
being assigned to multiple objects.
This detection process can only be applied to objects in the Active state (see
5.3 Using Static Foreground 203
Section 3.2.1). This ensures that the object has been tracked for a suitable
number of frames to allow a reasonable estimate of the velocity to be made.
5.3 Using Static Foreground
When an object stops, the optical flow for the object (or at least a significant
portion of the object) becomes zero, and is therefore not an ideal mode for de-
tection (the background, as well as any number of other stopped objects will also
have an optical flow of zero). The static foreground output from the proposed
motion detector, and colour can be used to detect objects in instances such as
these.
To allow the system to effectively manage the tracking of moving and stationary
objects, the Active state (see Section 3.2.1 for details on the states of tracked
objects within the tracking framework) is divided into two sub-states, Moving
and Static. Figure 5.1 shows the updated state diagram that incorporates the
two types of tracked objects (moving and static). Objects can only enter and exit
the Active state as moving objects. Static objects can only occur when an object
that has been observed comes to a stop (i.e. an object cannot suddenly appear in
the image and then not move), and cannot suddenly disappear either (if detection
of static object fails, it is assumed that it is due to the object moving, so it is no
longer static). Objects enter the static state when the average velocity calculated
according to median bounding box position drops below τstatic. τstatic is set to 0.5
in the proposed system (i.e. the median pixel moves less than 1 pixel every two
frames).
To detect objects in the static foreground image, a method to determine when an
object becomes stationary is required. This can be achieved by monitoring the
204 5.3 Using Static Foreground
Figure 5.1: State Diagram Incorporating Static Objects
object’s average velocity over several frames (the average velocity is defined as
the average movement of centre of the objects bounding box). When this velocity
approaches zero (less than a threshold τstatic), the object is considered stationary
and the tracked object enters the Static sub-state. Detection of this track is
now possible using the static foreground image, though it may take several more
frames for static pixels to appear in the motion images depending on the threshold
for static pixels in the motion detector (see Section 4.5). Optical flow is not used
to ascertain if an object is stationary, as a stationary object does not necessarily
have an average optical flow of zero. For example, a person might be standing
still but waving their arms, which will yield only a small (if any) change in the
bounding box (and average velocity calculated based on position), but result in
non-zero optical flow.
When a tracked object, T n, enters the static sub-state, a template image is created
(IST,n, set to a size equal to the width and height of T n, plus a small tolerance to
account for any detection and segmentation error in recent frames, typically no
more than 3 pixels) to indicate what pixels belong to T n and their motion mode
5.3 Using Static Foreground 205
(static foreground and the layer, active foreground, or pixel does not belong to
T n). A static tracked object may consist of some active pixels (i.e. a person may
be standing still except for their head), and some pixels may change state from
active to static and vice versa whilst the object is static (i.e. a person may be
standing still, move an arm, and then be still again).
IST,n does not store colour, as this would be redundant. Assuming that a static
pixel remains present at a given location, the colour for that pixel is unchanging,
whilst it can be assumed that any active pixels are likely to have a changing
colour. When a new static pixel is added to the IST,n, its colour is checked to see
if it is present in the histogram of T n, and is only accepted if the colour is present
in the histogram (i.e. the object contains pixels of that colour). For any active
pixels that are preset, their colour is verified each frame, as there is no way of
knowing that the active pixel present at x, y at time t, is the same pixel at time
t+ 1.
IST,n is used to detect the object in subsequent frames after its creation. For each
pixel in IST,n, the algorithm checks if the state indicated by the template is still
valid (i.e. template indicates a static pixel, layer 1 - check if the static foreground
image has a static pixel at layer 1 present), and if so, this pixel has been detected
and verified. If the expected state cannot be detected, then additional states
are checked (i.e. check the active foreground). Once T n has been flagged as
stationary, and IST,n has been created, IST,n cannot be resized (i.e. the detected
object will remain the same size while stationary).
Figure 5.2 shows an example of the detection and update process using the tem-
plate image. In Figure 5.2, the input template contains pixels in both the static
foreground state (blue) and active foreground state (green). For pixels in the ini-
tial template, the system checks if the mode indicated is still valid, and if so, that
state remains. For pixels in the template where the state no longer exists (such
206 5.3 Using Static Foreground
as the static pixels in the upper middle of the template), the system checks for
other motion modes which may be valid. The resultant updated object template
is then stored and used in the next frame.
Figure 5.2: Static Object Detection using the Template Image
Only the template image is used for detection until the object begins to move
again. Movement can be detected by a significant increase in the amount of
active foreground present in the template image, or by a decrease in the number
of pixels detected as being part of the objects. Movement may also result in failure
to detect the object depending on scene characteristics (it is expected that for a
static object, it should be detected every frame, and a failure to detect indicates
that the object is no longer static). In the case of the object detection failing, the
object immediately ceases to be static and object detection is re-attempted using
the other detection methods. If movement is detected either by a decrease in the
number of pixels belonging to the object, or a large decrease in the number of
5.4 Detecting Overlaps 207
static pixels, then in the next frame the system will revert to the default detection
routines (see Section 3.3) to locate the object.
5.4 Detecting Overlaps
The proposed motion detection algorithm (see Chapter 4) is able to associate
a state with every pixel (no motion, new, continuous, overlap, ended), that in-
dicates the type of motion observed. This information can be used to detect
discontinuities in regions of motion, such as instances where two objects are over-
lapping. Unlike using static foreground (see Section 5.3), this process does not
require one of the objects to be stationary.
Of the five potential states for a pixel, three indicate a discontinuity:
1. New - new motion has been detected, likely cause is an object entering or
reappearing from occlusion.
2. Overlap - a motion mode that was observed last frame cannot be found and
the predicted location for it is occupied by a different mode, likely cause is
two objects overlapping.
3. Ended - motion has not been detected where it was expected, likely cause
is an object leaving or becoming obscured behind an obstacle.
All three states can also arise as a result of inaccurate optical flow computation.
One of the states indicates regular motion (continuous) and the other indicates
that there is no motion present.
To detect overlaps in an image, groups of discontinuities need to be found. A
vertical projection of the pixel states can be used to determine the amount of each
208 5.4 Detecting Overlaps
type of motion in each column of the image. Regions where there are higher ratios
of discontinuities to continuous pixels indicates, there is an overlap occurring.
An example of an overlap is shown in Figure 5.3. The flow status image shows
the type of motion detected at the pixels. In this image, yellow represents the
continuous state, green is new, red is overlap and blue is ended.
(a) Input Im-age
(b) Motion Im-age
(c) Flow Sta-tus Image
Figure 5.3: Example Image Containing a Discontinuity
Given the input images in Figure 5.3, vertical projections for the states can be
calculated using
vproj(i, S) =
j=N−1∑j=0
IS(i, j), (5.2)
where vproj(i, S) is the vertical projection at column i, for the pixel state s, j is
the row index and N is the number of rows (height) of the input image, Is, which
is a binary image which equals 1 for pixels that have the state s, and 0 otherwise.
Figure 5.4 (a) shows a plot of the vertical projections for these states. To detect
overlaps, the states that represent discontinuities are summed using the equation,
vproj(discont) = vproj(overlap) + α1vproj(new) + α2vproj(ended), (5.3)
where vproj(discont) is the summed vertical projection, vproj(overlap), vproj(new)
and vproj(ended) are the vertical projections for the overlap state, new state and
ended state respectively, and α1 and α2 are weights applied to the new and
ended state respectively. These weights are used to counter for situations where
5.4 Detecting Overlaps 209
the optical flow is performing poorly, and there is a large proportion of motion
detected in the new state. In the proposed system, both weights are set to 0.5.
The combined vertical projection can be compared to that of the continuous state
to determine the location of overlaps (see Figure 5.4 (b)). An overlap is detected
when,vproj(i)(discont)
vproj(i)(continuous)≥ τSOv, (5.4)
where i is the image column being analysed, vproj(continuous) is the vertical
projection for the continuous state and τSOv is the threshold for an overlap (set
to 1 in the proposed system). Figure 5.5 shows the input image with the detected
overlap area shaded.
(a) (b)
Figure 5.4: Flow Status Vertical Projections
Figure 5.5: Detected Discontinuities
In some instances, discontinuities will also be detected at the edge of a moving
object (see Figure 5.6). This is caused by an increased rate of pixels in the new
210 5.4 Detecting Overlaps
and dead states at discontinuities such as object edges. It is important to be
able to distinguish between discontinuities caused by the object edge, and by
overlapping objects.
(a) Input Im-age
(b) Motion Im-age
(c) Flow Sta-tus Image
(d) DetectedDiscontinuities
Figure 5.6: Example Image Containing Different Types of Discontinuities
The ratio of the pixel states that represent a discontinuity, as well as the amount
of motion detected either side of the discontinuity can be used for classification.
Figure 5.7 (b) shows the vertical projections of the different discontinuity types.
For the first discontinuity, the overlap state is highest, whist for the second the
new state dominates. If the equation,
vproj(i)(overlap)
vproj(i)(new)≥ τSOvType, (5.5)
is true, then the discontinuity is caused by an overlap, otherwise, it represents
the edge of an object.
In the event that an edge is detected, the amount of motion either side of the
discontinuity is analysed (sum of the vertical projection for n pixels either side,
n is set to 5% of the image width). If the amounts of motion are approximately
equal, then the edge is discarded as a false detection (i.e. it is not a valid edge,
and as the overlap state is less prominent than the new state it is not a valid
overlap either). For non-rigid objects (i.e. people) it is likely that discontinu-
ities will be detected at the edges (potentially every frame), as the motion of the
5.5 Integration into the Tracking System 211
(a) (b)
Figure 5.7: Flow Status Vertical Projections
arms and legs will violate the constant velocity assumption made in the optical
flow calculations. For a person walking through a scene, this means that a dis-
continuity may be detected in between their legs, and divide the person in two.
However, this discontinuity is likely to be caused by the detection of new motion,
and so by checking the amount of motion either side, it can be determined that
the discontinuity is in fact not at an edge and be discarded. Figure 5.8 shows the
classified discontinuities (Red - Overlap, Green - Edge).
Figure 5.8: Classified Discontinuities
5.5 Integration into the Tracking System
The additions to the detection system are incorporated into the system as shown
in Figure 5.9.
212 5.5 Integration into the Tracking System
Figure 5.9: Tracking Algorithm Flowchart with Modified Object Detection Rou-tines (additions/changes to baseline system shown in yellow)
Known objects (those that have been detected in the last frame) are detected and
updated first, using the methods described in Sections 5.2 and 5.3 (depending on
if the object is stationary). This process is shown in Figure 5.10. For known
objects that are successfully detected, their motion is removed from the motion
images. The adjusted motion images are then processed by the object detection
routines to locate any remaining objects in the scene. These routines are modified
to incorporate the detection of overlaps as described in Section 5.4. At the end
of each frame, the locations of the known objects are used to provide feedback
to the motion detector, to ensure that motion that has been associated with an
object remains separate from the background. Motion that is not associated with
an object will gradually be incorporated into the background.
Figure 5.10: Process for Detecting a Known Object
The process that detects known objects is shown in Figure 5.10. For an object to
be detected, it must be in either the Active or Occluded. Objects are required to be
in the system for several frames prior to this detection to allow an estimate of the
5.5 Integration into the Tracking System 213
optical flow to be made for detection (static detection will not occur this quickly
as the time required for a pixel to be considered stationary far exceeds the time
required for an object to be considered Active or Occluded) state. Both processes
produce one or more candidate objects which are matched to the intended target
in the same manner as the original system. If a match is found, the target is
updated. If not, the system attempts to detect the object using the standard
detection methods once all other known objects have been processed.
The process for updating the way that the histogram is modified to handle to
multiple modes of motion. Updates that occur when the tracked object is not
stationary (i.e. in the moving sub-state) are processed as they otherwise would
be. When the object is stationary, there is a possibility that static foreground
comprises part of the object, and that the static foreground component could be
behind an active foreground component, so the colour at that pixel in the input
image is not the colour of the static layer and the object in question. As such,
when an object is stationary and has a valid static template, the template is used
as a guide to update the histogram. The pixels that the static template indicates
to be present are used when updating the histogram.
The histogram comparison that occurs when dealing with an ambiguous match
(see Section 3.4) between a detected object and two tracked objects does not need
to be modified, as it is not possible for a stationary object to be involved in such
a comparison. The process of detecting the stationary objects occurs prior to the
standard object detection (see Section 3.3), so any stationary objects will already
be matched (and thus not involved in any further comparisons). Furthermore, a
stationary object that is not detected by the stationary object detection process is
considered to be no longer stationary, resulting in the stationary object template
being reset. For objects in this situation (i.e. having ceased to be stationary as
of the current frame), it is unknown which, if any pixels are in static foreground
214 5.6 Evaluation and Testing
as the template is no longer valid and has been reset (there should be very few,
if any pixels in static foreground). As such, any comparison can only use active
foreground images, and no additional modifications are required.
5.6 Evaluation and Testing
The modified system is evaluated using the same procedure detailed in Section
3.5, and a comparison is made to the results of the baseline system (see Section
3.5.4). Details on the metrics used and annotation of the tracking output can be
found in Sections 3.5.1 and 3.5.3 respectively. Configuration files used to test the
baseline system are modified to configure the new motion detector, and detection
routines. All existing configuration parameters are left unchanged. The initial
threshold for the proposed motion detection routine is set to the threshold used
in the baseline system (see Section 3.5.2 for baseline system parameters). Values
for the new system parameters for each dataset group are shown in Table 5.1 for
parameters that are constant for all configurations, and Table 5.2 for parameters
that vary between datasets.
Parameter ValueMotion Detection Parameters
τMinLum 30τMaxLum 130τMinChr 20τMaxChr 90τChrShad 2τGrad 25
Object Detection ParametersτV el 1.5
Table 5.1: System Parameters - Additional parameters for system configuration.
The proposed motion detector is capable of using either a global variable thresh-
old, or a variable threshold for each pixel (see Section 4.3.1), and it was found
5.6 Evaluation and Testing 215
Parameter RD BC AP BEMotion Detection Parameters
τLumShad 300 100 300 300
Table 5.2: System Parameters - Additional parameters for system configurationspecific to each dataset group.
that whilst the use of a single variable threshold resulted in a lower rate of false
negatives, the use of a variable threshold per pixel resulted in a lower rate of
false positives (see Section 4.6). This difference in performance results makes it
unclear which configuration is better suited to tracking, or if the type of threshold
used should be dictated by the characteristics of the data.
Testing is performed using both a single variable threshold, and an individual
variable threshold per pixel. Overall results are shown in Tables 5.3 (single vari-
able threshold) and 5.4 (individual variable threshold).
Data Set Overall Detection Overall Localisation Overall TrackingRD 0.67 0.94 0.53BC 0.49 0.91 0.46AP 0.65 0.92 0.60BE 0.37 0.86 0.31
Table 5.3: Overall Tracking System Performance using a Global Variable Thresh-old for Motion Detection (see Section 3.5.1 for an explanation of metrics)
Comparing the performance when using a single variable threshold for the whole
system against using a variable threshold for each pixel, it can be seen that for
the BC, the use of a variable pixel per threshold results in a performance im-
provement, whilst for the RD and AP datasets, there is a performance reduction
when using individual variable thresholds. The overall results shown for the BE
dataset are misleading. Table 5.5 shows a breakdown of the results for the BE
datasets by camera.
It can be seen that whilst a global variable threshold offers better performance
for camera 1, an individual variable threshold performs better for camera 3.
216 5.6 Evaluation and Testing
Data Set Overall Detection Overall Localisation Overall TrackingRD 0.67 0.94 0.53BC 0.49 0.91 0.49AP 0.62 0.92 0.59BE 0.36 0.78 0.31
Table 5.4: Overall Tracking System Performance using Individual VariableThresholds for Motion Detection (see Section 3.5.1 for an explanation of met-rics)
Camera Global Variable Individual VariableThreshold Thresholds
DOv LOv TOv DOv LOv TOvC1 0.57 0.91 0.36 0.55 0.90 0.32C3 0.16 0.81 0.26 0.17 0.66 0.29
Table 5.5: Performance of BE Cameras with Using Different Motion ThresholdApproaches (DOv, LOv and TOv are overall detection, localisation and trackingmetric results respectively, see Section 3.5.1 for an explanation of metrics)
The varying performance of different thresholding approaches on different
datasets can be explained by the nature of each dataset. The RD and AP datasets
contain little camera noise, and do not encounter significant problems with com-
plex shadows or reflections, or objects that are difficult to distinguish from the
background. As a result, the use of a single variable threshold process for the
scene results in a tighter threshold (fewer false negatives, more false positives),
allows the moving objects to be more effectively segmented from the background.
As the data is not prone to excessive spurious motion, the small increase in false
positives does not have a significant impact.
The BC and BE datasets (particularly BE-C3) however contain more challenging
conditions for motion segmentation. The BC datasets contain complex reflections
and shadows on the floor of the hallway in which the dataset is captured, whilst
the BE datasets contain significant camera noise. These datasets also contain
several people with clothing a very similar colour to the background, making
detection of these people more difficult. In these situations, the ability to have a
5.6 Evaluation and Testing 217
threshold for each pixel is advantageous. In regions of the scene where there is
noise present, the can threshold can quickly be raised whilst in regions where there
is little noise, the threshold can be lowered to aid in detecting moving objects
when the enter the scene. Whilst the BC datasets don’t contain significant camera
noise, they still benefit from the individual thresholds. The BC datasets contain
large amounts of motion when compared to the other datasets, meaning that
a global variable threshold is less likely to drop quickly to help detect hard to
distinguish foreground regions. The disadvantage of this however, is that the more
sensitive thresholds make the detection and removal of shadows more difficult,
and so little to no improvement is noticed handling shadows when compared to
the baseline system.
Given the differences between the datasets (i.e. capture environment), and their
suitability to the different modes of thresholding, the proposed tracking system
will be evaluated using a single variable threshold for the RD and AP datasets,
and variable thresholds for each pixel for the BC and BE datasets. Whilst BE-C1
does achieve better performance using a global threshold, for simplicity, and given
that BE-C3 is the more challenging camera view individual variable thresholds
will be used for all BE datasets. In a real world deployment of such a system,
system parameters would be tailored to the scenes requirements. As such, the
selection of an appropriate thresholding mode for each dataset is valid.
The overall results for the improved system are shown in Tables 5.6 and 5.7.
Detailed results for each dataset are shown in Appendix B.
The performance of the RD datasets (see Table B.1 and B.2 in Appendix B)
is significantly better when using the modified system incorporating the new
motion detection and detection routines. Significant increases in both the overall
detection and tracking metrics are observed, as a result of the modified systems
ability to continue to track objects once they have been stopped for a period of
218 5.6 Evaluation and Testing
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD 0.82 0.63 0.81 1.00 0.98 1.00 0.50 0.36 0.88 0.92 0.46BC 0.80 0.41 0.73 1.00 0.97 0.99 0.49 0.30 0.74 0.83 0.36AP 0.82 0.61 0.75 1.00 1.00 1.00 0.60 0.39 0.97 0.97 0.42BE 0.68 0.28 0.57 0.87 0.85 1.00 0.24 0.15 0.78 0.87 0.21
Table 5.6: Tracking System with Improved Motion Detection Results (see Section3.5.1 for an explanation of metrics)
Data Set Overall Detection Overall Localisation Overall TrackingRD 0.67 0.94 0.53BC 0.49 0.91 0.49AP 0.65 0.92 0.60BE 0.36 0.78 0.31
Table 5.7: Tracking System with Improved Motion Detection Overall Results (seeSection 3.5.1 for an explanation of metrics)
time (through the use of multi-layer motion detection and feedback to prevent
the motion from being incorporated into the background). An example of this is
shown in Figure 5.11 (top row shows the output of the baseline system, bottom
line shows the output of the tracking system with the proposed modifications).
The loss of tracking on stationary objects due to them becoming part of the
background also results in invalid tracks when the objects begin to move again.
This results in objects being detected in the place where the car was, as the
motion detector has wrongly learned that the primary background mode for that
region is the car.
The use of static foreground and a detection routine to locate objects that have
stopped, and are visible in static foreground, also results in improved detection
and tracking when the objects begin to move again. Figure 5.12 shows a situation
where a car that has been parked begins to move again (on the far side of the
road). The baseline system is able to detect a car, but is unable to correctly
localise it due to the errors in motion detection caused by the car being incor-
5.6 Evaluation and Testing 219
(a) Frame 150 (b) Frame 250 (c) Frame 350 (d) Frame 450 (e) Frame 550 (f) Frame 650
(g) Frame 150 (h) Frame 250 (i) Frame 350 (j) Frame 450 (k) Frame 550 (l) Frame 650
Figure 5.11: Example output from RD7 - Maintaining Tracking of TemporarilyStopped Objects (the car on the far side of the road)
porated into the background. As a result, the car is not tracked correctly and
an object is falsely detected at the location where the car was, which results in
further tracking errors when a second car passes later on. The improved tracking
system using the proposed motion detection does not suffer from these problems,
as the parked car is never moved into the background, and so there is no false
motion when it begins to move again.
(a) Frame2100
(b) Frame2150
(c) Frame 2200 (d) Frame2350
(e) Frame 2375 (f) Frame 2400
(g) Frame2100
(h) Frame2150
(i) Frame 2200 (j) Frame 2350 (k) Frame2375
(l) Frame 2400
Figure 5.12: Example output from RD7 - Improved detection and localisation ofobjects that have been stationary for long periods of time (the car on the far sideof the road))
The performance of the BC datasets is shown in Table B.3 and B.4 in Appendix
B. The proposed system suffers from many of the same problems as the base-
220 5.6 Evaluation and Testing
line system with the BC database (shallow FOV angle, complex shadows and
reflections). The improved motion detection routine, and use of shadow detec-
tion, does result in some improvement in the detection of people within the scene
(see Figure 5.13, top row shows the output from the baseline system, bottom
row is the output from the proposed system). This improvement is reflected in
the small improvements observed in the overall detection metrics and each of
the individual detection metrics, as well as increase in the tracking performance.
However the improvement in tracking performance is only small, as the frequent
total occlusions present in the database still pose a large problem.
(a) Frame 1750 (b) Frame 1760 (c) Frame 1770 (d) Frame 1780 (e) Frame 1790
(f) Frame 1750 (g) Frame 1760 (h) Frame 1770 (i) Frame 1780 (j) Frame 1790
Figure 5.13: Example output from BC16 - Improved Detection Results due toProposed Motion Detection Routine
The main area of improvement with the proposed system is more accurate de-
tection of people, due to the improved motion detection. Figure 5.13 shows a
situation where two people are walking down the hallway, each person wearing a
white shirt that is similar in colour the floor and walls. As a result, the baseline
system performs poorly, failing to detect portions of the shirt as being in mo-
tion. This results in only the legs of these people being properly tracked. The
proposed system is able to correctly detect the shirts as being in motion, and
segment the two people correctly. This improvement can be attributed to the
use of the variable threshold in the proposed motion detection algorithm. Whilst
5.6 Evaluation and Testing 221
this improvement does obviously lead to improvements in tracking performance,
it does not aid in the resolution of the occlusions.
Tables B.5 and B.6 (see Appendix B) show the results of the evaluation on the
AP datasets. Overall, the AP datasets perform very similarly to the baseline
system, with a small decrease in detection and small improvement in tracking
performance being observed.
(a) Frame 900 (b) Frame 950 (c) Frame 1000 (d) Frame 1050 (e) Frame 1100
(f) Frame 900 (g) Frame 950 (h) Frame 1000 (i) Frame 1050 (j) Frame 1100
Figure 5.14: Example output from AP12-C7 - Tracking Example for AP Dataset
(a) Frame 375 (b) Frame 425 (c) Frame 525 (d) Frame 625 (e) Frame 725
(f) Frame 375 (g) Frame 425 (h) Frame 525 (i) Frame 625 (j) Frame 725
Figure 5.15: Example output from AP11-C4 - Tracking Example for AP Dataset
Figures 5.14 and 5.15 show examples of the tracking output for the baseline and
proposed tracking systems (top row of images are the output from the baseline
system, bottom row is from the proposed system). Throughout the AP datasets,
222 5.6 Evaluation and Testing
both systems track the objects in the scene successfully. The small performance
decrease in the detection can be attributed to the proposed system not detecting
objects as quickly (i.e. five frames slower) when they enter the scene. For the
AP12 dataset, which is captured in full sun, this is partially caused by the shadow
detection. The use of shadow detection reduces the size of the detected objects
(i.e. shadow pixels may have otherwise been included), and for small objects
at the back of the scene, this can (for the first few frames of they are present)
result in the object being too small to be considered (this could be counteracted
by adjusting object detection parameters to accept smaller objects). The small
tracking improvement results for an improvement in the T1 metric (number of
objects being tracked during time) and is a result of the more consistent motion
detection providing more accurate object locations over time.
The performance of the BE datasets is shown in Tables B.3 and B.4 in Appendix
B. The proposed system results is improved performance over the baseline in
detection, localisation and tracking. This is largely due to the improved per-
formance on the BE19 dataset, specifically when handling the parked car. As
was found in the baseline system (see Section 3.5.4), the camera angle of BE-C3
once again results in extremely poor performance in the D2, T2 and T5 metrics.
This is to be expected, as whilst the motion segmentation algorithm has been
significantly changed, the detection algorithm which uses this information has
remained unchanged.
Figure 5.16 (the output from the baseline system in shown on the top row, and
the output from the proposed system is shown on the bottom row) shows a car
that has just parked in the parking lot, with two people also moving through the
scene (one of which has got out of the car). The baseline system loses track of
the object after a few hundred frames, and also performs worse when tracking
the people.
5.6 Evaluation and Testing 223
The proposed system does however detect a spurious person at the parked car
(see Figure 5.16 (h)). Due to the noisy nature of the BE dataset, the static
motion tends to be less stable than for the RD dataset. As a result, a certain
amount of active motion is detected over the static motion. In some situations,
this can result in a false track being spawned for a short period of time.
(a) Frame 450 (b) Frame 520 (c) Frame 550 (d) Frame 650 (e) Frame 750
(f) Frame 450 (g) Frame 520 (h) Frame 550 (i) Frame 650 (j) Frame 750
Figure 5.16: Example Output from BE19-C1 - Maintaining tracks for stationaryobjects (the parked car)
There is little to no improvement when handling the occlusion in BE20 however
(see Figure 5.17, the output from the baseline system in shown on the top row,
and the output from the proposed system is shown on the bottom row).
(a) Frame 600 (b) Frame 650 (c) Frame 700 (d) Frame 750 (e) Frame 800
(f) Frame 600 (g) Frame 650 (h) Frame 700 (i) Frame 750 (j) Frame 800
Figure 5.17: Example Output from BE20-C3 - Occlusion handling
Figure 5.17 shows the occlusion in BE20, for camera 3. The proposed system in
224 5.6 Evaluation and Testing
unable to perform any better in handling this occlusion, as the combination of the
camera noise and poor viewing angle make it difficult to separate the individual
people. The proposed system also performs poorly when tracking people as they
move away from the occlusions, as the additional detection cues such as optical
flow are not able to be well used when there is a high amount of noise present (this
makes the matching process for determining the optical flow very unreliable).
Table 5.8 shows the data throughput benchmarks for the proposed tracking sys-
tem. These throughput rates are calculated under the same conditions used
when benchmarking the baseline system (see Section 3.5.4). The incorporation
of the proposed motion detection routines results in a drop in throughput for all
datasets. This is to be expected, as a significant amount of additional processing
is performed by the proposed motion detection routine, and the accompanying
object detection routines.
The RD dataset experiences the largest relative decrease (32% decrease) due to
the large amount of static foreground present in the scenes. Static foreground
results in an increase in processing time for the motion detection algorithm, and
the object detection routine to locate static objects is more computationally in-
tensive than other object detection processes. The AP and BE datasets also
contain static objects, explaining why they suffer a bigger performance drop than
the BC dataset (no static objects). However as they have less static foreground
than the RD dataset, there is less of a performance decrease.
Data Set Frame Rate (fps)RD 12.90BC 11.65AP 14.52BE 14.26
Table 5.8: Proposed Tracking System Throughput
Despite the drop in data throughput when compared to the baseline system,
5.7 Summary 225
all datasets are processed at greater than 11 fps. Given that these results are
achieved executing on a single core, and that significant optimisations can be
made to the motion segmentation it is feasible that the proposed system could
process data in real time.
5.7 Summary
This chapter has presented methods for incorporating multiple modalities of mo-
tion information (active foreground, stationary foreground, optical flow) into a
single tracking system. Detection routines capable of using the additional infor-
mation as well as an approach to integrate the routines into an existing system
have been proposed. The proposed tracking system has been evaluated using the
ETISEO [130] database and compared against a baseline tracking system (see
Chapter 3) and significant improvement has been shown.
Despite the observed improvements, the proposed algorithm is limited in that
noise in the stationary foreground image can result in false objects being de-
tected (and false tracks created), and to keep stationary foreground out of the
background does rely on the object detection and tracking process finding the ob-
ject so that feedback can be applied to the motion detection algorithm. Also, the
proposed algorithm, whilst improving motion detection and object segmentation,
does not provide any additional aid in dealing with severe occlusions. However,
as the baseline system is unable to discriminate between stationary foreground
objects and moving foreground objects, or apply feedback to prevent objects from
being added to the background, these are seen as minor limitations.
Chapter 6
The Scalable Condensation Filter
6.1 Introduction
The performance of a tracking system can be improved by using more advanced
prediction methods such as particle filters. Particle filters are able to use features
that are extracted from one or more target objects to locate them in future frames.
Particle filters are not constrained by linearity and Gaussian assumptions like
Kalman filters are, and so are able to be used in a wide range of situations.
To date, many particle filter implementations have relied on using a common
feature to locate all objects in the scene [78, 135, 163], complicating the task of
maintaining the identity of the individual tracked objects. Manual initialisation
of tracks [163, 168] is also common, and limits the use of such systems in real
world applications.
In this chapter the Scalable Condensation Filter (SCF) is proposed and the pro-
cess by which it can be integrated into the tracking system proposed in Chapters
228 6.1 Introduction
3 and 5 is described. The SCF is a variant of the mixture particle filter [163],
that is able to alter the number of particles used by each independent mixture as
the scene changes, and allows the features used (and type of feature used) to vary
between tracks and from frame to frame, according to the system needs. This
allows the system to be more efficient as high particle counts and high complexity
features are only used when the scene is sufficiently complex.
The SCF is integrated into the proposed tracking system such that the proposed
object detection methods are able to be used to instantiate new tracks, update
existing tracks when possible, and provide additional stimulus to the individual
mixture distributions based on the current detection results. This integration
also aids in maintaining the identity of the individual mixtures as it allows each
mixture to use its own set of features. Importantly, these features are also able
to be updated on a frame to frame basis (according to how often the object can
be detected and updated), such that if the appearance of a tracked object does
change the features used are also able to change, helping to improve tracking
performance. Finally, the integration with the proposed tracking system also
allows for tracks to be removed from the SCF when appropriate.
The SCF is designed such that the type and number of features used by each
mixture component (each of which represents a tracked object) as well as the
number of particles used by each mixture component can vary from frame to
frame and mixture component to mixture component. This allows the SCF to
adapt to the needs of the scene.
6.2 Scalable Condensation Filter 229
6.2 Scalable Condensation Filter
A condensation filter[77] is used to aid in tracking objects within the system. The
Scalable Condensation Filter (SCF), an extension of the Mixture Particle Filter
(MPF)[163] and Boosted Particle Filter (BPF)[135], is proposed. A single filter
is used for the entire system, and the particle count is scaled according to the
number of objects being tracked. In addition, the number of particles for each
track is allowed to vary according to the complexity of the surrounding area,
and resampled in such a way that ensures that particles for a track (and thus
the track itself) are not lost owing to re-sampling (see Section 6.2.1). Features
used for tracking may also vary from frame to frame, and track to track. This
is important when tracking different classes of object within the one system (i.e.
people and vehicles) which may be better suited to different features, and also
allows objects that are in more complex regions of the scene (i.e. occluded) to
use more advanced features.
The distribution modelled by the SCF is the sum of the distributions for the
distributions of the individual tracks,
p(xt|xt−1) =N∑i=1
pi(xi,t|xi,t−1), (6.1)
where pi(xi,t|xi,t−1) is the component distribution for a single track, i, and N is
the total number of tracks within the system.
For the tracking system proposed in this thesis, the SCF particles are four dimen-
sional and describe a bounding box (a centre position - x and y pixel coordinates,
and the height and width),
si,n,t = {x, y, h, w}, (6.2)
where n is the index of the particle s is the range [0..N−1]. Each variable is free to
move within the dimension limits, {dmin, dmax}, which are defined by the system
230 6.2 Scalable Condensation Filter
(i.e. the limits of x and y are governed by the image size). The distribution of each
dimension is Gaussian, and independent from the other dimensions. The standard
deviation of each dimension is equal to the maximum expected movement of a
dimension from one frame to the next, emax.
6.2.1 Dynamic Sizing
Rather than have a fixed number of samples for the filter, the sample count is
dynamically altered as object’s enter and leave the scene, and as people move
about and occlude one another. For each track, an arbitrary number of samples,
νinit, are created about the object’s initial position and associated with that
object,
si,n,t = γi + 3× ρ (6.3)
where si,n,t is the new sample, γi is the new objects state, and ρ is a vector
of random values, in the range −emax to +emax. Note that the values of emax
potentially vary for each dimension (i.e. x, y, h and w, may be expected to
change at different rates). A multiplier of 3 is used when initialising the particles
to ensure that the initial distribution is not too closely packed about γi (the
object’s initial position).
The particles initially associated with the given track remain associated with that
track for duration of that track’s life, and the particle count for any individual
track cannot be diminished unless it is specifically desired. This initialisation
gives each tracked object a set of samples to model it immediately, rather than
needing to allow a period of frames for the system to adapt to its presence. When
an object leaves, νinit samples are removed from the system.
When tracked objects are close together, additional particles can be added and
more advanced features can be used to aid in the tracking. Three levels of occlu-
6.2 Scalable Condensation Filter 231
sion are defined for each track:
1. Level 0 (No Occlusion) - The tracked object is isolated within the scene,
there are other objects nearby.
2. Level 1 (Object Nearby) - Another tracked object’s bounding box is within
a distance τnearby. τnearby is set at a fixed value for each class and depends
on the expected size of the objects being tracked.
3. Level 2 (Overlap) - Another tracked object’s bounding box is overlapping.
When a track is first created, and added to the SCF, it is at occlusion level 0
and is created with the standard number of particles (νinit). For each occlusion
level increase, an additional νadd particles are added to the SCF for that track;
and νadd samples are removed for each occlusion level decrease. Additional (or
fewer) levels of occlusion could be defined based on the proximity of tracks, and
the severity of the overlap.
Figure 6.1 shows a system that is tracking two objects (blue and yellow). At
time t, these objects are suitably far apart that there is no occlusion, and so each
object is tracked with the standard number of particles (4). At time t + 1, the
objects are considered to be in an occlusion state. The system over-samples the
particle set to generate a set of 8 particles for each of the tracked objects. At
time t + 2, the occlusion has passed, and so the sample sets are under-sampled
such that each object is once again tracked by the standard number of particles.
Particle counts for tracked objects are altered during the re-sampling procedure
by either under-sampling or over-sampling. Resizing the system in this manner
ensures that no unnecessary updates are done (i.e. object 1 may have ceased
to be occluded by object 2, requiring a drop in the number of particles, but
have become occluded by object 3, requiring an increase in particles, such that
232 6.2 Scalable Condensation Filter
Figure 6.1: Dynamic Sizing of Particle Filter
ultimately no net change in the particle count is required), and improves CPU
utilisation.
A Sequential Importance Re-sampling (SIR)[44, 139] procedure is used to update
the sample set. Each new particle is adjusted according to a motion model
associated with the tracked object responsible for the particle. The expected
movement according to this motion model (based on a window of Q previous
observations) is added to the particle as well as a noise vector,
s(i,n,t+1) = s(i,n,t) + λi + ρ, (6.4)
where s(i,n,t+1) is the nth sample for track i at the next time step; s(i,n,t) is the nth
6.2 Scalable Condensation Filter 233
sample for track i at the current time step; ρ is the noise vector, which is within
the range [−emax..+ emax], and λi is the expected movement for the track, i. As
part of all particle updating and creation, a set of limits to are applied to each
particle, to ensure that it is describes a valid object (if a dimension exceeds a limit,
it is set to the limit). Whilst SIR would ensure that any particles that describe
invalid objects are not propagated (they would have 0 probability), performing
this test on the particles at this point avoids the need to check for valid image
coordinates when matching features, which allows the system to be more efficient.
Normally when re-sampling using SIR, random values in the range [0..1] are
selected and mapped to the corresponding particle according to the cumulative
probability in order to select the particles for re-sampling. For the SCF, re-
sampling is performed on a track by track basis, where the random value selected
is in the range that corresponds to the cumulative probability range of the track’s
particles. Figure 6.2 shows an example of this.
Figure 6.2: Sequential Importance Re-sampling for the SCF
In Figure 6.2, the SCF distribution contains particles for two tracks. Particles
234 6.2 Scalable Condensation Filter
for each track are stored in a continuous block. When re-sampling, particles for
the first track are re-sampled first by selecting random values in the range [0..x].
Once the correct number of particles have been re-sampled for the first track, the
second track is re-sampled by selecting random values in the range [x..1]. This
process can be easily scaled for additional tracked objects.
This approach relies on ensuring that the particles for each track are stored
continuously in the particle list (i.e. the list cannot have νinit particles for track
1, νinit particles for track 2, and then another νadd particles for track 1). Provided
particles are only added to and removed from the filter during re-sampling, this
is easily achieved.
6.2.2 Dynamic Feature Selection and Occlusion Handling
Each track is able to use multiple features. Using inheritance and polymorphism,
the types of features used by tracks can be allowed vary depending on circum-
stances and the class of object being tracked, without any change required in the
condensation filter itself. This approach allows different types of objects to use
features more suited to their individual properties.
Two types of features, each of which has various sub-types, are proposed:
1. Histograms
2. Appearance Models
Each of these features can optionally use motion detection and optical flow as
additional aids (i.e. a pixel must be in motion and must be moving in the same
direction as the object being tracked), and this can be change dynamically de-
pending on the systems status (i.e. if motion detection is unreliable for a period of
6.2 Scalable Condensation Filter 235
time due to environmental effects, this can be omitted when matching features).
Features are initially built when the tracked object is first detected, and are up-
dated every subsequent frame. As such, a track’s features learn the appearance
of the track and are able to vary over time to accommodate any changes that
may occur in the objects appearance. As the appearance models are computed
and compared in different manners, and model different aspects of the objects
appearance, they are assumed to be independent.
Histograms simply model colour distributions, and so while being quicker to
compute, do not take geographical information into consideration (i.e. a per-
son wearing blue pants and a red shirt will have a very similar histogram to a
person wearing red pants and a blue shirt, despite having a distinct appearance).
Appearance models encode position information as well as colour information,
and so are more discriminative. They are however more processor intensive.
The features used by the system are varied as the complexity changes. A his-
togram feature is used by default, and when a track’s occlusion level increases
above 0 (see section 6.2.1) an appearance model feature is used as well. When
multiple features are used, the probability for the particle is the product of the
probabilities for each feature,
wi,n,t =M∏j=1
p(zj,i,t|xi,t = si,n,t), (6.5)
where wi,n,t is the weight of particle n for track i at time t, M is the total number
of features for track i, zj,i,t is the jth feature for track i (xi,t), and si,n,t is the
particle from xi,t’s distribution we are matching the feature to.
When a track’s occlusion level reaches two (occlusion occurring), the process to
calculate weights is altered. In such a situation, the tracked object can either
be obscured (i.e. the blue object in Figure 6.3), or be obscuring another object
(i.e. the yellow object in Figure 6.3). Each of these situations must be handled
236 6.2 Scalable Condensation Filter
differently.
Figure 6.3: A Typical Occlusion between Two Objects
When an object is obscured (partially or fully), the SCF is likely to receive poor
responses to its features as it cannot be seen. To overcome this, features can be
matched only to regions where it is believed that an occlusion is not taking place,
and a fixed probability can be used when considering regions that cannot be seen.
At the end of processing each frame, the tracking system determines which objects
are in occlusion and updates the SCF with this information. For objects that
are partially obscured, the locations (bounding box) of the obscuring objects
are passed. When evaluating features, these regions are avoided, to prevent the
features being matched to a mix of the target object, and any objects that may
be obscuring the object. This however will also result in a reduced probability,
which when the occlusion is suitably severe (i.e. half of the object of more is
hidden) may still lead to SCF losing the track. To overcome this, regions that
are obscured are assigned a fixed probability, β. Using this process, the weight
of a particle belonging to an object that is occluded becomes,
wi,n,t = αvis
M∏j=1
p(zj,i,t|xi,t = si,n,t) + (1− αvis)× β, (6.6)
where αvis is the fraction of the object that was visible in the last frame, and β
6.2 Scalable Condensation Filter 237
is a constant between 0 and 1 that denotes the probability of an occluded object
being at an obscured location (typically set to 0.5). αvis is needed to ensure that
the probability for a particle does not exceed 1, and the constant, β, is used to
ensure that when an object is totally (or almost totally) occluded, the probability
for the object does not drop to zero and the track is not lost.
(a) Particle Location (b) Motion Image
(c) Occlusion Map (d) Motion Image with Occluding Re-gions Removed
Figure 6.4: Calculating particle weights for occluded objects
Figure 6.4 shows an example where the weight for a particle (the red box in (a))
is being calculated for an object that is partially obscured (the blue object in (a),
obscured by the yellow object). The expected location of the occluding object(s)
is used to create an occlusion map (c), which shows the regions where the target
object is expected to be obscured. The occlusion map is also used to alter the
motion image for the target. The original motion image (b) has the motion in
238 6.2 Scalable Condensation Filter
the region where the occluding object is expected to be removed, resulting in the
motion mask shown in (d).
When one object obscures another, it is still entirely visible, so there is no danger
that the occlusion will result in the SCF receiving poor responses to the track’s
features due to the object not being visible. If, however, the objects involved
in the occlusion have a similar appearance, there is a possibility that the SCF
will begin to track multiple objects within the one mixture component, possibly
leading to object identities being swapped or lost.
To overcome this problem, the idea of negative features is proposed. Within the
SCF, the probability of a particle is given the Equation 6.5. However, in an
occlusion there is a risk that, if the occluded object is of a similar appearance to
the target (occluding) object, the particle weights may be affected by a partial
match to the occluded object. When a severe occlusion is taking place, the
mixture component associated with the occluding object uses the features of the
occluded object in a negative capacity, such that the particle weight becomes,
wi,n,t =
Mj∏j=1
p(zj,i,t|xi,t = si,n,t)× (1−Mi∏k=1
p(z−k,i,t|xi,t = si,n,t)), (6.7)
where Mj is the number of positive features for the track, Mi is the number of
features for track i that are to be used in a negative way, and z−k,i,t is the kth
negative feature for track i. Note that the negative features are not actually
created by, updated by, or owned by track i. Rather they are copied from the
other object(s) involved in the occlusion. This results in the SCF yielding a
higher response to regions that match the appearance of the target object well
and, match that of the other object(s) involved in the occlusion poorly.
As each tracked object has its probabilities normalised and particles re-sampled
separately, there is no danger of the additional matching constraints (using ad-
ditional features, negative features, or measures to overcome the object being
6.2 Scalable Condensation Filter 239
obscured) reducing a track’s probabilities to the extent that the track’s particles
are removed from the system by the re-sampling procedure. It is feasible that
multiple or different appearance models and histograms could be used for each
track under appropriate circumstances.
6.2.3 Adding Tracks and Incorporating Detection Results
When a new object enters the scene, the SCF needs to begin tracking it. The
MPF [163] uses K-means clustering to detect split and merge events within the
existing modalities to detect and initialise new tracks, whilst the BPF [135] relies
on adaboost detection results being incorporated into the distribution to initialise
new tracks. As the SCF is designed to be integrated into an existing tracking
system, the discovery of new modes can be handled by the underlying tracking
system. When the detection methods utilised within this system detect a new
track, a new component mixture corresponding to the new track is added to the
SCF.
Like the boosted particle filter [135], the SCF may also use detection results from
the object detection routines in the tracking system when calculating particle
weights. Within the BPF, this is intended to help initialise new modes for track-
ing. Within the SCF, it is used to provide additional information to the system,
to help the component mixtures better represent and maintain the locations of
the tracked objects. The mixture distributions are combined using an additive
process (as is used in the BPF),
p′(xt|xt−1) =N∑i=1
αiqi(xi,t|xi,t−1, yi,t) + (1− αi)pi(xi,t|xi,t−1), (6.8)
where qi is a Gaussian distribution dependent on the current observations and
matches from the tracking system, yi,t, for a specific track, i, and N is the num-
ber of tracked objects. The mixture, qi(xi,t|xi,t−1, yi,t) is creating by summing
240 6.2 Scalable Condensation Filter
Gaussian’s for each object detection (yi,t) recorded for the tracked object (xi,t).
The standard deviation of the Gaussian is set to emax.
The value of αi varies between different tracks, and is set according to the occlu-
sion level. Object detections that arise from tracks that are at a lower occlusion
level are weighted higher (larger α) as they are less likely to be in error. De-
tections that arise during occlusion are less reliable, and so are weighted lower.
Within the proposed system, weights for α are set to 0.5, 0.25 and 0.125 for oc-
clusion levels of 0, 1 and 2 respectively. For each track, multiple detection results
may be used when updating each distribution (see Section 6.4).
This update allows the mixture distributions to become multi-modal themselves
temporarily, to help cope with ambiguities.
Figure 6.5: Scalable Condensation Filter Process
The updating of particle weights based on observed detections is added in to the
6.3 Tracking Features 241
process as shown in Figure 6.5. The usual condensation filter update is performed
(resample, re-weight based on features), after which the object detection and
updating of tracked objects takes place. These detection results can then be
incorporated into the SCF distribution at the end of the current time step (t), as
shown in Figure 6.5, such that at time t+ 1, when the distribution is re-sampled,
the detection results will influence the re-sampling process.
6.3 Tracking Features
The system uses two types of tracking features, histograms and appearance mod-
els. An appearance model is defined as a model that encodes both intensity and
position information, whilst a histogram simply encodes colour information.
An appearance model is proposed that utilises the proposed motion detection
routine (see Chapter 4), by incorporating colour, motion state, and optical flow
into a single model. The appearance model, A, is a grid of Ax by Ay squares,
with an average colour (Ac(k), where k is the colour channel), velocity (Au and
Av for the horizontal and vertical velocity respectively, derived from the optical
flow), and motion occupancy (Am) stored for each square. An error value for
the colour (Aec) and optical flow (Aeopf ) is also stored for each square. The input
image, I(t) is divided in to a grid of dimensions Ax by Ay. It is assumed that
these dimensions will be significantly smaller than those of the input images (see
Figure 6.6). It is also assumed that the object detection results will be similar
from frame to frame (either correct or at consistently incorrect), to ensure that
the contents of each square is reasonably consistent from frame to frame (changed
however are expected over time as the target moves about, however these should
occur over several frames, instead of a dramatic change from one frame to the
next).
242 6.3 Tracking Features
Figure 6.6: Dividing input image for Appearance Model
For each grid square in I(t), the average colour, percentage of motion, and optical
flow (horizontal and vertical) are computed,
Fc(x′, y′, t, k) =
1
card(M(x, y, t))
∑I(x, y, t, k) where (x, y) ∈M(t), (6.9)
Fu(x′, y′, t) =
1
card(M(x, y, t))
∑U(x, y, t) where (x, y) ∈M(t), (6.10)
Fv(x′, y′, t) =
1
card(M(x, y, t))
∑V (x, y, t) where (x, y) ∈M(t), (6.11)
Fm(x′, y′, t) =card(M(t))
card(I(t))(6.12)
where F is a feature extracted for the current image, x′, y′ are in the range
[0..Ax − 1, 0..Ay − 1], U and V are the input horizontal and vertical flow image,
M is the input motion image and M(t) is the set of all pixels that are in motion,
and x, y is in the range that corresponds to the grid square x′, y′.
Given the features for the incoming image, the appearance model components
are updated according to the equation,
A(t+ 1) = A(t) + (F (t)− A(t))× L, (6.13)
where L is the learning rate. L is defined as,
L =1
T; for T < W, (6.14)
L =1
W; for W >= T, (6.15)
6.3 Tracking Features 243
where W is the number of frames used in the model, and T is the number of
updates performed on the model. This ensures that the image that the model is
initialised with does not dominate the model for a significant number of frames.
Instead, the information is incorporated quickly when the model is new to provide
a better representation of the tracked object being modelled sooner.
An error measure is kept for both the optical flow and colour components of the
model,
F ec (x′, y′, t) =
K∑1
|Ac(x′, y′, t, k)− Fc(x′, y′, t, k)| , (6.16)
F eopf (x
′, y′, t) = |Au(x′, y′, t)− Fu(x′, y′, t)|+ |Av(x′, y′, t)− Fv(x′, y′, t)| , (6.17)
where F ec and F e
opf are the frame errors for colour and optical flow respectively,
and K is the number of colour channels in the appearance model.
The errors are updated over time using equations 6.14 and 6.15. The cumulative
error is used as an approximation to the standard deviation (it is assumed that
the observations over time form a Gaussian distribution) of the error, as it is
not practical to re-compute the standard deviation each frame, and not ideal
to assume a fixed standard deviation. Given that the standard deviation for a
sample set is defined as,
σ =
√√√√ 1
N
N∑n=1
(µ− sn)2, (6.18)
and in the proposed appearance model, for each grid square there is one obser-
vation at each time step (N = 1), so the standard deviation at a given time step
is,
σ =√
(µ− s)2 = |A(x′, y′, t)− F (x′, y′, t)| , (6.19)
which is the proposed error measure.
When matching the model to an input image, average colour, flow and motion
occupancy is computed for the image in the same manner as for an update.
244 6.3 Tracking Features
Errors for the colour and optical flow are calculated and these are compared to
the cumulative errors for the model,
Gc(x′, y′, t) =
Aec(x′, y′, t)
F ec (x′, y′, t)
, (6.20)
Gopf (x′, y′, t) =
Aeopf (x′, y′, t)
F eopf (x
′, y′, t), (6.21)
where Gc(x′, y′, t) and Gopf (x
′, y′, t) are the number of standard deviations from
the mean that the observation (input image) is. Aec(x′, y′, t) and Aeopf (x
′, y′, t) are
determined using Equation 6.13. A normal distribution look-up table is used to
determine the probability that these observations have arisen from the model,
which yields P (Fc(x′, y′, t)|Ac(x′, y′, t)) as the probability that the colour ob-
servation belongs to the distribution described in the appearance model, and
P (Fopf (x′, y′, t)|Aopf (x′, y′, t)) as the probability that the optical flow observation
belongs to the distribution described in the appearance model.
The probability that a given grid square matches the corresponding area in the
input image is then defined as,
P (F (x′, y′, t)|A(x′, y′, t)) = P (Fc(x′, y′, t)|Ac(x′, y′, t))× (6.22)
P (Fopf (x′, y′, t)|Aopf (x′, y′, t)),
where F (x′, y′, t) is the set of features for a grid square in the input image, and
A(x′, y′, t) is the set of features for a given grid square in the appearance model.
The motion occupancy component of the model is used as a weight when com-
puting the match across the whole model. A higher motion occupancy indicates
that there is more motion, and thus more information, in a given grid square.
Given this, the match for the model to an input image is,
P (I(t)|A(t)) =
∑x′=Ax;y′=Ay
x′=1;y′=1 P (F (x′, y′, t)|A(x′, y′, t))× Am(x′y′, t)∑x′=Ax;y′=Ay
x′=1;y′=1 Am(x′y′, t). (6.23)
Am(x′y′, t) is determined using Equation 6.13.
6.3 Tracking Features 245
6.3.1 Handling Static Foreground
The motion detection algorithm that is used within the tracking system described
in this thesis can distinguish between objects that are moving in the scene (active
foreground) and objects that have stopped moving, but are not part of the back-
ground (static foreground). Any appearance model needs to be able to handle
the different types of foreground.
The proposed appearance model divides the object being modelled into a grid.
Each grid location is modelled separately, with its own colour, optical flow and
expected motion occupancy. This can be extended by allowing each grid to have
its own state, AS(x, y, t), indicating what type of motion is expected at this
square. This type may be Active (there is no static foreground expected at this
location), Static (there is no active foreground expected at this location) or Both
(foreground of both types is expected).
The state of each grid square is determined during the object update of each
frame, and uses the static object template (see Section 5.3) to determine the
state of each grid square. If the template has not been initialised (i.e. the object
is not stationary) then all squares must be in the Active state. If the static
template is initialised, then the ratio of active foreground to static foreground
within the template region that corresponds to each grid square is calculated,
Rm =card(IST = Active)
card(IST = Static), (6.24)
where IST is the template image for the track that owns the appearance model,
card(IST = Active) is the number of pixels within the template image that are
detected to be in a state of active foreground, card(IST = Static) is the number
of template image pixels in a state of static foreground, and Rm is the ratio of
active foreground to static foreground within the template. A threshold, τs, is
246 6.3 Tracking Features
applied to this ratio to determine the state such that,
Rm ≤ τs =⇒ AS(x, y, t) = Static, (6.25)
Rm ≥1
τs=⇒ AS(x, y, t) = Active, (6.26)
1
τs< Rm < τs =⇒ AS(x, y, t) = Both. (6.27)
For a grid square that is in the Active state, it is processed as described in
Section 6.3. For squares that are in either the Static or Both states, an optical
flow comparison is not performed. Squares that are in the Static state contain
only static foreground and so will have zero flow, making comparison needless.
Squares that are in the Both state are likely to be in transition. Both is a transient
state, occurring either as an object stops and becomes static (during which time
the flow is changing to 0), or as an object begins to move again (during which
time the flow is changing from 0). In either situation, comparing the optical flow
is unreliable due to the known state change, and so it is not performed. Updates
to the optical flow are also not performed when a square is in one of these states.
The optical flow component is also reset when the Active state is re-entered (after
being in either the Static or Both state) as the object may have started moving
in a different direction and thus invalidated the flow component.
When updating a block that it is the Static state, the colour and motion features
become,
Fc(x′, y′, t, k) =
∑IZ(x, y, t, IST (x, y, t), k) where IST (x, y, t) = Static
card(IST (x, y, t) = Static), (6.28)
Fm(x′, y′, t) =card(IST (x, y, t) = Static)
card(IST (x, y, t)), (6.29)
where IZ(x, y, t, IST (x, y, t), k) is the colour of channel k for the IST (x, y, t)th
static layer. IZ is an image that contains the colours of all the static pixels. For
each pixel, this image contains a list that contains the colour of each static layer
ordered according to their depth. The value of IST at the pixel can be used as an
6.3 Tracking Features 247
index to access the appropriate colour.
When updating a block that is in the Both state, static and active foreground
must be considered. The colour and motion features in this situation become,
F staticc (x′, y′, t, k) =
∑IZ(x, y, t, IST (x, y), k) where (IST (x, y, t) = Static),(6.30)
F activec (x′, y′, t, k) =
∑I(x, y, t, k) where (IST (x, y, t) = Active),(6.31)
Fc(x′, y′, t, k) =
F staticc (x′, y′, t, k) + F active
c (x′, y′, t, k)
card(IST (x, y, t) = Static) + card(IST (x, y, t) = Active),(6.32)
Fm(x′, y′, t) =(card(IST (x, y, t) = Static) + (card(IST (x, y, t) = Active)
card(IST (x, y, t)).(6.33)
The values for calculated for Fc(x′, y′, t, k) and Fm(x′, y′, t) are used to update
the appearance model in the same manner previously described.
Performing a comparison between a model that contains static foreground (either
as part of the Static or Both state) is different from an update, as it is likely that
IST is no longer valid when the comparison is being performed. As compares
should only be occurring prior to an update (either with the SCF process, or
when directly comparing two tracks to a detected object), the template image
stored by the track will be from the previous frame, and thus be inaccurate. For
squares in the Static state, the static layer colour image (see Section 5.3) is used.
For each pixel in the grid square, the static colour that is closest to that of the
appearance model is used as the colour for that pixel, as such, the colour becomes,
Cos(x, y, t) = argminn=[1..IST (x,y,t)] |Ac(x′, y′, t)− IZ(x, y, t, n)| , (6.34)
Fc(x′, y′, t, k) =
∑Cos(x, y, t) where (x, y) ∈Ms(x, y, t)
card(Ms(x, y, t)), (6.35)
where Cos(x, y, t) is colour of the static layer that is the closest match to the
appearance model, IZ is an image that contains the colours each static pixel (i.e.
such that if the pixel at (x, y) has n static layers, the corresponding location in IZ
has n channels corresponding to the different layer), and Ms(x, y, t) is the static
foreground image. The closest matching colour is selected by comparing all k
248 6.4 Incorporation into Tracking System
colour channels. The motion occupancy for squares in the Static state becomes,
Fm(x′, y′, t, k) =card(Ms(x, y, t))
card(I(t)). (6.36)
For squares in the Both state, the static layer colour image and the input colour
image are used. Matching is performed in the same manner as for the Static
state, the closest matching colour is used when calculating the total error,
Cob(x, y, t) = min(Cos(x, y, t), Ac(x′, y′, t)− I(x, y, t, )), (6.37)
Fc(x′, y′, t, k) =
∑Cob(x, y, t) where (x, y) ∈Ms(x, y, t) ∪Ma(x, y, t)
card(Ms(x, y, t) ∪Ma(x, y, t)), (6.38)
where Cob(x, y, t) is the colour of the static layer or active layer that is the closest
match to the appearance model, and Ma(x, y, t) is the active foreground image.
The motion occupancy for the square is the number of pixels that have any motion
present (active or static),
Fm(x′, y′, t, k) =card(Ms(x, y, t) ∪Ma(x, y, t))
card(I(t)). (6.39)
The values calculated for Fc(x′, y′, t, k) and Fm(x′, y′, t) are used to update the
appearance model in the same manner previously described. The system does
not consider the ratio of static foreground and active foreground within a grid
square, as it is likely to be changing rapidly due to the Both state being transient.
6.4 Incorporation into Tracking System
The integration of the scalable condensation filter into the tracking system results
in the following changes in the system (see Figure 6.7):
1. After motion detection has been performed, the SCF is updated (resample
particles and determine particle weights). This is performed after motion
6.4 Incorporation into Tracking System 249
detection as the output of the motion detector (motion images and opti-
cal flow) is used as input for the SCF (depending on the configuration of
features used by the objects being tracked) in addition to the input colour
image.
2. When matching detected objects to tracked objects, the SCF is used to aid
in matching objects. The distribution of the SCF for the track in question
can be checked to determine the likelihood of the detected object being the
track in question (see Section 6.4.1).
3. When updating a tracked object, the tracked object passes to the SCF the
current features that have been extracted for that object, and the object’s
movement for the frame. Object detection information is also passed, and
is used to update the SCF as described in Section 6.2.3.
4. When an object cannot not be detected and its position in the frame needs
to be predicted, the distribution of the SCF for the track in question is used
to determine the most likely position for the track in question.
5. Any new objects added to the system result in a new component being
added to the SCF. Also, whenever a tracked object is removed from the
system, the component mixture for that track is removed from the SCF.
The use of the SCF to determine matches between detected objects and tracked
objects renders the use of a histogram (or appearance model) to determine which
of two detected objects is the better match for a given tracked object (see Section
3.4) redundant. As the match between a detected and tracked object is now par-
tially calculated from the SCF distribution, which itself uses the match between
the tracked object features (histogram/appearance model) and the image, the
SCF based tracking system by default uses the tracked object features to match
detected objects to tracked objects.
250 6.4 Incorporation into Tracking System
Figure 6.7: Integration of the SCF into the Tracking System
When adding detection results to the SCF mixture components, the detected
object that matches the tracked object as well as any detected objects that have
an uncertainty (see Equation 3.18, Section 3.4) within a threshold, are used when
updating the component mixture with the track information. All tracks are given
an equal weight when adding them to the mixture. This ensures that in situations
where ambiguities arise, the SCF distribution will encompass both modes until
the situation is resolved.
6.4.1 Matching Candidate Objects to Tracked Objects
By using a condensation filter [77] rather than a traditional particle filter, the
tracking system has access to a weighted particle set derived from the features
from the tracked objects at the previous frame and the current images. This par-
ticle set represents the (approximate) probability of the locations of the tracked
objects in the current frame. As such, the output of the SCF can be used to aid
in evaluating matches between candidate objects and the list of tracked objects.
6.4 Incorporation into Tracking System 251
To enable candidate objects to be compared to tracked objects, particles are back
projected into a map, PTj ,t (where Tj is the tracked object that the map relates)
to indicate the likelihood of an object occupying a given pixel in the current
frame.
Within the SCF, each particle describes a bounding box and has a probability
associated with it. The maximum probability from all particles at a location
is used (rather than an average or total) as it is unaffected by the number of
particles, or their location in space. Consider the example situations shown in
Figure 6.8 (the yellow ellipse is the object of interest and the coloured rectangles
represent different particles, in these examples, the more accurately the particles
inscribe the object of the interest, the higher their probability).
(a) Example 1 (b) Example 2
Figure 6.8: Determining match probability using particles
In Example 1, if an average measure of particle probabilities is used, PTj ,t would
show a reduced likelihood on the right side of the object, due to the poor response
of the blue particle, despite the good response achieved by the red particle in the
same area. In Example 2, if a summation of all particle probabilities is used, PTj ,t
would show an increased likelihood on the right side of the object due to the large
number of particles covering this area, despite none of these particles yielding a
particularly high response. In each of these cases, the use of the maximum particle
weight results in strong probabilities in locations covered by particles with strong
252 6.4 Incorporation into Tracking System
responses, and areas with particles that have only weak responses will have low
probabilities.
Using this approach, PTj ,t at a given pixel location can be defined as,
PTj ,t(x, y) = argmaxNj,max
n=Nj,min(wj,n,t) where x, y ∈ s(j, n, t), (6.40)
where s(j, n, t) is the nth particle for Tj, and wj,n,t is the corresponding weight,
Nj,min and Nj,max are the indexes of the first and last particles for Tj, and x, y is
location being evaluated. As each s(j, n, t) describes a bounding box, it is very
simple to determine if the present location, x, y, is within the region described by
the particle.
The likelihood of a candidate object matching a tracked object can then be defined
as,
FSCF (Ci, Tj) =1
W ×H
X0+W,Y0+H∑x=X0,y=Y0
PTj ,t(x, y), (6.41)
where FSCF (Ci, Tj) is the fit between the track Tj and the candidate Ci, X0 and
Y0 are the coordinates of the top left corner for the bounding box of Ci, and W
and H are the width and height of the bounding box of Ci. An average probability
is taken (rather than a total) to prevent the system from simply favouring the
largest object.
A limitation of this approach is that it does not consider the object size when
evaluating the match (i.e. it is possible for two objects of wildly different sizes to
have a similar appearance, so overall object size needs to be considered). Note
that it is not appropriate for Equation 6.41 to not normalise for size, as this
would result in larger objects being favoured in all matches. To overcome this,
the proposed likelihood is combined with the fit measure used in the baseline
system (see Equation 3.17, Section 3.4) such that the match between a candidate
6.4 Incorporation into Tracking System 253
object and a tracked object is,
F̄ (Ci, Tj) =√FSCF (Ci, Tj)× F (Ci, Tj). (6.42)
6.4.2 Occlusion Handling
The use of the SCF also allows for improved occlusion handling, as the scalable
condensation filter can continue to provide probabilities relating to the location
of the occluded object. To take advantage of this ability, the Occluded state is
redefined, and a new state is created, Predicted. Figure 6.9 shows the modified
state diagram.
Figure 6.9: Update State Diagram Incorporating Occluded and Predicted States
Previously, the Occluded state was entered when an object could not be detected
for a single frame. The Occluded state is now defined as an object that cannot
be detected at a given frame and also has an occlusion level of two (overlapping
with another object). The Occluded state assumes that the object detection has
failed due to the object being obscured. The Predicted state is entered when an
object cannot be detected at a given frame and has an occlusion level of less than
two (i.e. based on the previous frame, the tracked object is not obscured by any
254 6.4 Incorporation into Tracking System
other objects). In this situation, it is assumed that the object detection has failed
due to the object no longer being present in the scene. Given this, there is only
a timeout on the Predicted state, such that after τoccluded consecutive frames, the
track will pass to the Dead state.
From the Occluded state, an object can either move to the Active or Predicted
states. An object moves to the Active state if it is detected and matched. An
object moves to the Predicted state if it is deemed that the SCF no longer has
a sufficiently good estimate of the location of the object (i.e. all particles are
providing a very poor response to the features for the track), or if the object
ceases to be at an occlusion level of 2 and is not detected. In the event of either
of these occurring, the counter that limits the time spent in the Predicted state
is set to,
cpredicted = min(τoccluded
2, coccluded
), (6.43)
where cpredicted is the counter monitoring the time in the Predicted state, and
coccluded is a counter indicating how much time was spent in the Occluded state.
It is assumed that an object transitions to Predicted from Occluded, it is because
it is no longer in the scene (the SCF can longer track it, and the object detection
routines cannot find it). Whilst the system still enters the Predicted state briefly
(in case this assumption is wrong), the time spent in the state is significantly
reduced.
Whilst it is possible to implement a similar system of occlusion handling without
the use of the use of a particle filter, such a system has no way of knowing when
the object has actually left aside from monitoring the occlusion level. Such an
approach is likely to lead to additional tracking errors (such as track identities
being swapped or objects remaining after they have left the scene), and as a result
it is not implemented in the baseline system.
This approach also allows a smaller value of τoccluded to be used, as it is known that
6.5 Evaluation and Results 255
only an object whose detection is failing due to an inability to find the object, is
in this state.
6.5 Evaluation and Results
The proposed system using the SCF is evaluated using the same procedure de-
tailed in Section 3.5, and a comparison is made to the results of the baseline
system (see Section 3.5.4) and the system proposed in Chapter 5. Details on
the metrics used and annotation of the tracking output can be found in Sections
3.5.1 and 3.5.3 respectively. The proposed SCF based tracking system is derived
from that proposed in Chapter 5, and thus uses the alternative motion detection
routine and object detection routines proposed. Configuration files used to test
the system described in Chapter 5 are modified to include configuration for the
SCF. Values for the new system parameters for each dataset group are shown in
Table 6.1 for parameters that are constant for all configurations, and Table 6.2
for parameters that vary between datasets. dxmin is the value of dmin for the x
dimension of the SCF, dxmax is the value of dmax for the x dimension of the SCF
and exmax is the value of exmax for the x dimension of the SCF.
Note that different types of particles are created for each different object type in
the system (i.e. for the RD datasets, there are types of particles for people and
vehicles). In these situations, particle bounds are applied that are appropriate
to the type of object the particle is following (particles used to track a vehicle
will have height and width bounds that match up with the size constraints of
the vehicle). Likewise, different types of objects have different noise distribution
bounds as they are expected to move at different rates (vehicles are faster than
people).
256 6.5 Evaluation and Results
Parameter ValueCondensation Filter Parameters
νinit 50νadd 25dxmin 0dxmax 319dymin 0dymax 239dwmin Minimum Height ×
Minimum Aspectdwmax Maximum Height ×
Maximum Aspectdhmin Minimum Heightdhmax Maximum Height
Table 6.1: System Parameters - Additional parameters for system configuration.
Parameter RD BC AP BEPerson Particle Parameters
exmax 2 4 N/A 2eymax 2 4 N/A 2ewmax 1 2 N/A 1ehmax 1 2 N/A 1
Vehicle Particle Parametersexmax 4 N/A 4 4eymax 4 N/A 4 4ewmax 2 N/A 2 2ehmax 2 N/A 2 2
Table 6.2: System Parameters - Additional parameters for system configurationspecific to each dataset group.
The SCF is configured to initially use 50 particles per track, with an increase of 25
for each occlusion level increase (there are three occlusion levels, so a maximum
of 100 particles for a track). Each tracked object uses a histogram as its standard
feature, and the appearance model described in Section 6.3 is also used when
the occlusion level is raised above 0 (no occlusion). All existing configuration
parameters are left unchanged.
The overall results for the tracking system incorporating the SCF are shown in
6.5 Evaluation and Results 257
Tables 6.3 and 6.4. Detailed results for each dataset are shown in Appendix C.
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD 0.83 0.66 0.82 1.00 0.98 1.00 0.56 0.37 0.89 0.91 0.45BC 0.80 0.42 0.73 0.99 0.97 0.99 0.49 0.30 0.71 0.83 0.41AP 0.82 0.60 0.75 1.00 1.00 1.00 0.60 0.38 0.94 1.00 0.44BE 0.68 0.26 0.57 0.97 0.98 1.00 0.28 0.13 0.72 0.86 0.18
Table 6.3: Tracking System with SCF Results (see Section 3.5.1 for an explanationof metrics)
Data Set Overall Detection Overall Localisation Overall TrackingRD 0.69 0.94 0.57BC 0.49 0.91 0.49AP 0.65 0.92 0.60BE 0.34 0.85 0.32
Table 6.4: Tracking System with SCF Overall Results (see Section 3.5.1 for anexplanation of metrics)
The overall change in performance resulting from the use of the SCF is shown
in Tables 6.5 (individual metrics) and 6.6 (overall metrics). As these results
show, the use of the SCF results in an incremental improvement, with small
improvements in the tracking performance of the RD and BE datasets, and little
change in the BC and AP datasets.
The metrics for the RD datasets are shown in Tables C.1 and C.2 (see Appendix
C). The use of the SCF has little impact on detection and localisation perfor-
mance, only resulting in incremental improvements. This is expected, as the
detection methods used are unchanged. The small improvement is due to the im-
proved performance when predicting positions in the event of missed detections,
and improved handling of occlusions when the object detection methods perform
poorly. The improvement in occlusion handling also results in the small improve-
ment in tracking performance shown by the increase in T1 (number of objects
tracked during time). A small improvement in the fragmentation (T3) is offset
by a small decrease in confusion (T4). Whilst the SCF does improve the systems
258 6.5 Evaluation and Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD 0.01 0.03 0.01 0.00 0.00 0.00 0.06 0.00 0.01 -0.02 0.00BC 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 0.01 -0.03 0.00 0.05AP 0.00 -0.01 0.00 0.00 0.00 0.00 0.00 -0.01 -0.03 0.03 0.02BE 0.00 -0.02 0.00 0.10 0.12 0.00 0.04 -0.02 -0.06 0.00 -0.04
Table 6.5: Improvements using SCF (see Section 3.5.1 for an explanation ofmetrics)
Data Set Overall Detection Overall Localisation Overall TrackingRD 0.03 0.00 0.04BC 0.00 0.00 0.00AP 0.00 0.00 0.00BE -0.02 0.07 0.01
Table 6.6: Overall Improvements using SCF (see Section 3.5.1 for an explanationof metrics)
ability to follow objects through more complex scenes (improvement in T3), it is
susceptible to errors when a new modality is first initialised (decrease in T4) and
the features are able to describe multiple objects well. Within the RD dataset,
many of the people are dressed similarly (dark clothing), and enter in groups.
Likewise, many cars are similar colours (silver, white) and, at times, also enter
in pairs. In these situations, where the detection routines do not immediately
detect the two objects, it is possible for the identities to swap before the features
are well defined.
Examples of the difference in occlusion handling is shown in Figures 6.10 and 6.11
(in both figures, the top row shows results without using the SCF, the bottom
row with the SCF).
In Figure 6.10, a person walks behind a stopped car, resulting in only the torso
being visible and the person detection routine not being able to consistently locate
them. Without the SCF, the system looses track of the person, and creates a new
track several frames later when they can be detected correctly once more. Using
6.5 Evaluation and Results 259
(a) Frame 1495 (b) Frame 1500 (c) Frame 1505 (d) Frame 1510
(e) Frame 1495 (f) Frame 1500 (g) Frame 1505 (h) Frame 1510
Figure 6.10: Example Output from RD7 - Occlusion handling using the SCF (theperson walking behind the car on the far side of the road)
the SCF however, the system is able to continue to track the person through the
occlusion.
Figure 6.11 shows an occlusion between three moving people. Whilst neither
system handles this occlusion correctly, the SCF is able to maintain the identities
of the individual people for longer. Ultimately however, the occlusion continues
for too long for the SCF to be able to properly resolve the situation.
The metrics for the BC datasets are shown in Tables C.3 and C.4 in Appendix
C. There is no change in the overall metrics for the BC datasets, however there
are small changes within the individual tracking metrics. No change is expected
within detection and localisation, as the use of the SCF has little effect on the
object detection (only the process of matching detections and tracking).
The change in the tracking performance is varied, with a decrease in T1 and T3
(number of objects being tracked and fragmentation respectively) and an increase
in T2 and T5 (tracking time and 2D trajectories respectively). The drop in T1
and T3 is a result of uncertainty when new objects enter the scene. Within the
260 6.5 Evaluation and Results
(a) Frame 2150 (b) Frame 2200 (c) Frame 2220 (d) Frame 2240
(e) Frame 2260 (f) Frame 2280
(g) Frame 2150 (h) Frame 2200 (i) Frame 2220 (j) Frame 2240
(k) Frame 2260 (l) Frame 2280
Figure 6.11: Example Output from RD7 - Occlusion handling using the SCF (thegroup of people in the bottom right corner of the scene)
RD datasets this uncertainty resulted in a decrease in T4. In these datasets,
similarly coloured objects often enter in pairs and if the detection algorithms are
not able detect both objects consistently, there is possibility that the SCF may
begin to track the second object. Within the BC datasets, people frequently enter
through doors, which creates motion that is detected by the motion detector. In
several instances, the door is detected as it opens, and as the person enters from
behind the door, the system swaps from the door to the person and tracks the
person. However, when using the SCF in such a situation, the features used by the
6.5 Evaluation and Results 261
SCF (histogram and appearance model) are created using the initial detections
of the door. As a result, the SCF often fails to transfer the newly created track
from the door to the person, resulting in poorer performance for the T1 and T3
metrics. An example of this situation is shown in Figure 6.12.
(a) Frame 262 (b) Frame 264 (c) Frame 266 (d) Frame 268
(e) Frame 270 (f) Frame 272
(g) Frame 262 (h) Frame 264 (i) Frame 266 (j) Frame 268
(k) Frame 270 (l) Frame 272
Figure 6.12: Example Output from BC16 - Initialisation of tracks from spuriousmotion (the person entering through the door at the top of the scene)
Figure 6.12 shows an example of an person entering through a door. The person is
not easily distinguishable from the background, and the majority of the motion in
the region is actually caused by the door. As a result, the system initially tracks
the door. When the SCF is not used (Figure 6.12, top two rows), the system is
262 6.5 Evaluation and Results
able to switch from the door to person once the door motion subsides. As the
door and person are still close to one another, the system is able to make this
association. However, when the SCF is used (Figure 6.12, bottom two rows), the
system fails to make this association. The features for the tracked object, which
are used by the SCF to locate it in future frames, have been initialised using the
incorrect detections on the door and therefore are modelling the door. As such,
the system continues to track the door, eventually loosing track of it once all
motion from the door stops.
The improvement in T2 and T5 can be attributed to the improved occlusion
handling. The BC datasets contain a large number of occlusions, and whilst
not all are able to be correctly resolved using the SCF, there is a noticeable
improvement in the system’s occlusion handling abilities. Figure 6.13 shows an
example of an occlusion and how it is handled with the SCF (bottom two rows)
and without (top two rows).
In Figure 6.13 one person walks behind another, and is totally occluded for ap-
proximately 20 frames. Without the SCF, the person is lost. The system is unable
to separate the two people when the occlusion begins, and cannot re-detect the
person until after the occlusion has passed. The SCF however, is able to resolve
this occlusion successfully, despite a false object being detected (Frame 2720, the
shadow of the occluded person is briefly tracked) briefly.
The results for the AP datasets are shown in Tables C.5 and C.6 in Appendix
C. Detection and localisation performance is unchanged, as is overall tracking
performance. Whilst small changes are observed in the individual tracking metrics
(a mix of gains and drops), these can be attributed to the differing performance
of the systems when creating new tracks, rather than any significant tracking
errors or improvements. Due to the relative simplicity of the AP datasets (no
major occlusions, few objects in the scene at a given time) and the already good
6.5 Evaluation and Results 263
(a) Frame 2690 (b) Frame 2720 (c) Frame 2750 (d) Frame 2760
(e) Frame 2770 (f) Frame 2780
(g) Frame 2690 (h) Frame 2720 (i) Frame 2750 (j) Frame 2760
(k) Frame 2770 (l) Frame 2780
Figure 6.13: Example Output from BC16 - Improved occlusion handling usingthe SCF (one person walks behind another)
performance achieved with these datasets, these results are as expected. An
example of tracking output is shown in Figure 6.14.
Figure 6.14 illustrates an error that can occur using the SCF. As the vehicle exists
the scene (travels behind the aero bridge) the SCF tracks the object differently.
The system begins to track the vehicle poorly at frame 470, losing track of the
vehicle, before re-creating the track just before the object ultimately exits. The
fact that the vehicle has the majority of the object obscured, and is of a similar
colour to both the tarmac and aero bridge leads to confusion for the SCF and
264 6.5 Evaluation and Results
(a) Frame 430 (b) Frame 440 (c) Frame 450 (d) Frame 460
(e) Frame 470 (f) Frame 480 (g) Frame 490 (h) Frame 500
(i) Frame 430 (j) Frame 440 (k) Frame 450 (l) Frame 460
(m) Frame 470 (n) Frame 480 (o) Frame 490 (p) Frame 500
Figure 6.14: Example Output from AP11-C4 - Errors as an object leaves thescene using the SCF.
poor tracking. This could be overcome by using more bins the histogram to
achieve greater separation between the colours, however this comes at the cost of
memory and computational speed. This problem is observed at other times in the
AP datasets, and leads to a small drop in the T3 metric (object fragmentation).
Tables C.7 and C.8 (see Appendix C) show the results for the BE datasets.
Overall localisation and tracking performance are improved slightly by using the
SCF, while there is a slight decrease in overall detection performance. As was
found in previous evaluation, very poor results for D2, T2 and T5 are obtained
6.5 Evaluation and Results 265
due to the camera angle of BE-C3.
Figure 6.15 shows the tracking performance resulting from the use of the SCF (top
row show the system without the SCF, bottom row with the SCF). There is little
noticeable difference in performance between the two the systems, as the main
problem within the dataset is the poor detection performance. The occlusions
present are simple and can be handled effectively by both systems.
(a) Frame 520 (b) Frame 570 (c) Frame 620 (d) Frame 670 (e) Frame 730
(f) Frame 520 (g) Frame 570 (h) Frame 620 (i) Frame 670 (j) Frame 730
Figure 6.15: Example Output from BE19-C1 - Tracking Performance using theSCF
Figure 6.16 shows the difference in performance when processing BE20-C3 when
using the SCF and when not (top two rows are without the SCF, bottom two
rows are with). It can be seen that there is little difference in how the systems
perform when confronted with the occlusion. In each instance several incorrect
tracks are created, and tracks are frequently lost before being recreated. The
poor motion and object detection in this dataset, combined with the three people
in the occlusion being dressed in similar coloured clothing means that the SCF is
able to easily swap between people, creating errors. The similar coloured clothing
is compounded by the poor detection, as tracks are frequently lost and recreated,
meaning that the appearance models and histograms are initialised on data that
is not correctly segmented.
266 6.5 Evaluation and Results
(a) Frame 260 (b) Frame 310 (c) Frame 360 (d) Frame 410
(e) Frame 510 (f) Frame 610 (g) Frame 660 (h) Frame 710
(i) Frame 260 (j) Frame 310 (k) Frame 360 (l) Frame 410
(m) Frame 510 (n) Frame 610 (o) Frame 660 (p) Frame 710
Figure 6.16: Example Output from BE20-C3 - Localisation Performance usingthe SCF
Overall, the use of the SCF does not have the same impact on system perfor-
mance as the addition of an improved motion detection routine. Despite this,
improvement is still observed when handling occlusions and tracking in difficult
circumstances. The tight integration with the previously proposed tracking sys-
tem allows for features to be updated continuously, and for new tracks to be
easily added and removed from the tracking system.
Whilst the SCF is effective at resolving occlusions and handling situations where
there are segmentation errors, it is very vulnerable when a track is initialised
6.5 Evaluation and Results 267
using poor segmentation results. This often results in the SCF associating the
track with the source of the segmentation error (i.e. a door), rather than the
moving object. When the erroneous motion stops, the tracked object is then
often lost, and an additional new track must be spawned to track the object of
interest.
Table 6.7 shows the data throughput benchmarks for the proposed tracking sys-
tem. These throughput rates are calculated under the same conditions used when
benchmarking the baseline system (see Section 3.5.4). As expected, the incorpo-
ration of the SCF results in a drop in throughput for all datasets.
The RD and BE datasets suffer the most significant drop in performance (41%
and 39% respectively when compared to the system running without the SCF).
Once again, this can be attributed to the presence of static foreground. Compar-
ing features that contain static foreground is significantly more computationally
intensive than comparing features that contain only active foreground. The pro-
cess that extracts the colour of static layers from the motion segmentation (to
allow comparison of features that contain static foreground) is primarily respon-
sible for this increase in processing time.
The AP datasets suffers only a small performance penalty using the SCF, due to
the simple nature of the data. The BC datasets suffer more of an impact, due
to the high number of occlusions (resulting in additional particles and features)
and the large size of the objects. The computational demands of all the features
used by the SCF increase linearly with the size of the region being processed.
Data Set Frame Rate (fps)RD 7.57BC 9.99AP 13.22BE 8.75
Table 6.7: Proposed Tracking System Throughput
268 6.6 Summary
Despite the drop in data throughput when compared to the baseline system and
the proposed system without the SCF, all datasets are processed at greater than
7 fps. Given that these results are achieved executing on a single core, and that
significant optimisations can be made to the motion segmentation and processing
of object features it is feasible that the proposed system could process data in
real time.
6.6 Summary
This chapter has presented a new implementation of the condensation filter, the
Scalable Condensation Filter (SCF). The SCF is a derivative of the mixture
particle filter [163] that is intended to operate in tandem with an existing tracking
system, rather than as a self contained tracking system. The tight integration of
the SCF with an existing tracking system allows the SCF to:
• Add and remove new mixtures from the filter without any user intervention.
• Use progressively updated features for each tracked object.
• Use a time varying number of particles for each tracked object, based on
the system status as determined by the underlying tracking system.
• Use a varied number and type of features for each tracked object, based on
the system status as determined by the underlying tracking system.
• Incorporate detection results for each tracked object into each mixture com-
ponent, allowing each mixtures to monitor multiple modes temporarily
when object detection is uncertain (i.e. in the presence of occlusions, or
poor segmentation/detection performance).
6.6 Summary 269
• Use the tracking system to flag occlusions, and take appropriate action for
both the occluded and occluding objects to ensure that neither object is
lost.
The use of the SCF within the existing tracking system provides improved occlu-
sion handling, and improved performance in situations where segmentation and
detection are poor. The scalability of the system (time varying number of parti-
cles and number/type of features) also ensures that there is not a severe impact
on system performance. To utilise the previously proposed motion segmentation
algorithm (see Chapter 4), a new tracking feature has also been proposed.
The proposed tracking system, with the SCF integrated, has been tested using
the ETISEO database [130] and improvement, particularly when handling occlu-
sions, has been shown compared to the same system not using the SCF. However,
it has been shown that the SCF can also introduce problems in some situations,
particularly when there is poor contrast between the foreground and background,
and when the initial detection results when the object first enters are poor. In
these situations, tracks can be switched to either erroneous motion in the back-
ground, or potentially between objects. Improving the performance of the object
detection, and using more discriminative object features would help overcome
these problems.
Chapter 7
Advanced Object Tracking and
Applications
7.1 Introduction
Single camera tracking techniques can be extended to work with multiple cameras,
either by tracking objects through a network of cameras or through the fusion
of multiple views of the same area, and can be used as a initial step in more
advanced surveillance systems such as event and action recognition. This chapter
will apply the single camera tracking techniques previously discussed in this thesis
to two areas:
1. Multi-camera tracking.
2. Sensor Fusion for Object Tracking.
272 7.2 Multi-Camera Tracking
7.2 Multi-Camera Tracking
7.2.1 System Description
The tracking framework described in Chapter 3 is able to support multiple camera
inputs. Each camera input (view) is assigned an object tracker to track all objects
within the view. Each individual tracker has no knowledge of the other cameras
in the network, and as such is responsible for tracking only the objects within its
field of view.
The information from the multiple views is combined after each tracker has pro-
cessed the current frame by a separate module. This module aims to ensure that
every object is tracked in every view, and as such will attempt to find group
object in different views together, based on matching criteria. To achieve this,
the module performs the following tasks:
• Translate the 2D image coordinates of the tracked objects into a 3D coor-
dinate system.
• Match ungrouped objects across different view.
• Check that objects that have been paired, are still a valid match.
• Check for occlusions affecting paired objects, and ensure that the system
does not attempt to detect objects that are badly occluded in one view, if
it can be clearly seen in another.
To perform these tasks, camera calibration information is required that is able to
translate image coordinates (in pixels) to a world coordinate scheme,
[T nωx, Tnωy] = ζi,ω(T nx,i, T
njy,i ), (7.1)
7.2 Multi-Camera Tracking 273
and translate image coordinates in one camera view to another,
[T nx,j, Tny,j] = ζi,j(T
nx,i, T
nx,i), (7.2)
where T jx,i and T jy,i are the pixel coordinates of an object, n, in camera i, T jωx and
T jωy are the world coordinates for the object n, ζi,ω is a transform that translates
images coordinates in camera i to the world coordinate scheme and ζi,j is a
transform that translates image coordinates in camera i to camera j. It is assumed
that any image coordinates that are translated are located on the ground plane
(i.e. z = 0).
The databases used for multi-camera tracking within this thesis (ETISEO [130]
and PETS 2006 [132]) each provide camera calibration created using Tsai’s cam-
era calibration algorithm [161], which provides a world coordinate scheme (mea-
sured in meters) and the ability to transfer image coordinates between the world
coordinates and the alternative camera views.
7.2.2 Track Handover and Matching Objects in Different
Views
Objects can be paired between views in two different ways:
1. Detected in each view and then matched according to one or more features.
2. Detected in a single view, and then have its position translated to a second.
The second approach is very simple, in that no matching of features is required.
The bounding box is simply translated using the camera coordinates. Objects
translated between views in this manner begin in the Transfer state (see Section
3.2.1).
274 7.2 Multi-Camera Tracking
The first approach requires features to be extracted from each view and compared,
the proposed multi-camera system uses two features:
1. Position
2. Colour Histogram
Appearance models are not used, as, unless the cameras are observing the scene
from very similar angles, the appearance models will record a poor match. Colour
histograms however are view invariant (with the exception of non-uniformly
coloured objects), and so are better suited to matching across different cam-
eras. It is assumed that there is no significant difference in the colour calibration
of the cameras in the network (i.e. no coloured filters, no large differences in gain
or exposure).
Position is matched by projecting the coordinates in each view into one another,
and calculating the Euclidean error, such that,
R(i, j) =
√(T nx,i − txj)2 + (T ny,i − tyj)2 +
√(T nx,j − txi)2 + (T ny,j − tyi)2
2, (7.3)
where Ri, j is the average reprojection error for tracks i and j, T nx,i and T ny,i are
the x and y image coordinates for track i, and txj and tyj are the translated
image coordinates for track j.
Colour histograms are compared using the Bhattacharya coefficient (see Section
3.4). Histogram bins are normalised to sum to one, to ensure that B (the match
between) is in the range [0..1] (0 being no match, 1 being a perfect match). As it
is expected that the objects will not be the same size in the two views, comparing
actual bin counts would not be suitable.
7.2 Multi-Camera Tracking 275
Provided the matches for the two features satisfy the equations,
B(i, j) > τB, (7.4)
Ri, j < τR, (7.5)
where B(i, j) is the Bhattacharya coefficient when comparing the histograms of
track i and j, R(i, j) is the average reprojection error, and τB and τR are the
thresholds for these measures, the objects in the two views are paired. In each
frame, the position of the object is are checked again to ensure that they are still
within the position threshold, τd. τB, τR and τd are set to 0.75 (75% similarity
between the histograms), 1 (within 1 meter on the ground plane) and 2 (within
2 meters on the ground plane) respectively. If the positions exceed this threshold
for three consecutive frames, the pair is broken.
7.2.3 Occlusions
In a single camera tracking system, the tracking algorithm must attempt to detect
and track all objects within in the scene, even when those objects may be involved
in severe occlusions. This may often lead to tracked objects being lost, or the
identities of two or more tracked objects being swapped. Within a multi-camera
system however, it is possible that the area containing the occlusion may be
visible from multiple positions, and so whilst a person may be severely occluded
in one view they may not be in another. Figure 7.1, shows an example of such
an occlusion. There are three people in the scene, but in camera one, the third
person is obscured behind the first two, and in the second camera, only the third
person is clearly visible (the others are partially occluded).
Rather than try to detect all three people in each camera view, a better approach
is to attempt to detect only the first two people in camera one, and only the
third in camera two. Using the camera calibration information, the system can
276 7.2 Multi-Camera Tracking
(a) Camera 1 (b) Camera 2
Figure 7.1: A Severe Occlusion Observed from Two Views
still maintain the position of all people in all cameras, without the burden of
attempting to detect severely occluded people indefinitely.
Occlusions can be detected within a single camera by analysing the location of
the tracked objects relative to one another. Using the camera calibration, it is
possible to determine the order of the objects within the scene (i.e. which object
is closest to the camera, and thus in full view), and which objects are obscured.
If an object is obscured in one view, but visible in another, the obscured view is
flagged such that the system will not attempt to locate it.
For this approach to function properly, the tracked objects need to be initially
detected (and tracked) prior to the occlusion, as detecting the objects during the
occlusion is likely to prove unreliable.
7.2.4 Evaluation using ETISEO database
The multi-camera extensions to the tracking system are evaluated on the AP and
BE datasets of the ETISEO database, used in the previous evaluations within the
thesis. These datasets are each two camera datasets, and have camera calibration
supplied, calculated using the Tsai camera calibration toolbox [161]. The other
7.2 Multi-Camera Tracking 277
datasets (RD and BC) are only single camera datasets, and thus are of no interest
when testing a multi camera system. Details on the metrics used and annotation
of the tracking output can be found in Sections 3.5.1 and 3.5.3 respectively.
The ETISEO evaluation software uses the ID of each tracked object in both
the ground truth data and system output to determine the effectiveness of the
system at maintaining an object’s identity over time. When matching objects
across different camera views, the multi-camera system sets both object ID’s to
the same value. This is done to indicate with the tracking system this is the
same object viewed from two angles. However, it does result in an error being
recorded by the evaluation software. This is due to the object being detected and
tracked (and thus logged in the output file) for several frames under a different
ID. Normally, if the ID was to change it indicates that tracking has been lost and
the object has been re-created. However, with a multi-camera system it may be
because a track in one view has been paired with a track in another. Whilst this
is not a tracking error, the evaluation tool is unable to distinguish between the
two situations. As such, it is expected that there will be a small performance
drop in the tracking metrics as a result of this.
The overall results for the tracking system incorporating the SCF are shown in
Tables 7.1 and 7.2. Detailed results for each dataset are shown in Appendix D.
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP 0.83 0.64 0.77 1.00 1.00 1.00 0.64 0.44 0.94 1.00 0.61BE 0.64 0.22 0.54 0.88 0.91 1.01 0.24 0.09 0.66 0.89 0.16
Table 7.1: Multi-Camera Tracking System Results (see Section 3.5.1 for an ex-planation of metrics)
Tables 7.3 and 7.4 show the difference in performance between tracking in single
cameras and tracking with multiple cameras on these datasets (both tracking
systems are using the SCF, see Chapter 6, and difference are measured against
278 7.2 Multi-Camera Tracking
Data Set Overall Detection Overall Localisation Overall TrackingAP 0.68 0.93 0.66BE 0.31 0.79 0.28
Table 7.2: Multi-Camera Tracking System Overall Results (see Section 3.5.1 foran explanation of metrics)
those recorded in Section 6.5). There is a considerable increase in the performance
of the AP datasets, whilst the BE datasets have recorded a slight drop across most
metrics, and all overall metrics.
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP 0.01 0.04 0.02 0.00 0.00 0.00 0.04 0.06 -0.01 0.00 0.17BE -0.04 -0.04 -0.03 -0.09 -0.06 0.01 -0.04 -0.04 -0.06 0.03 -0.02
Table 7.3: Improvements using a Multi-Camera System (see Section 3.5.1 for anexplanation of metrics)
Results for the AP datasets are shown in Tables D.1 and D.2 (see Appendix D).
As these results show, the use of a multi-camera system improves the performance
of the tracking system.
Figures 7.2 and 7.3 show examples of the AP dataset tracked by two single camera
systems and by a single multi-camera system. The top two rows of these figures
show the output from single camera tracking systems for cameras C4 and C7, the
bottom two rows show the output of the proposed multi-camera tracking system.
Figure 7.2 shows a situation where an object, visible in both cameras, leaves
the field of view of one camera before re-entering, whilst remaining visible in
the second camera. By using a multi-camera system, the object is assigned its
original identity again when it reenters the scene (denoted by the colour of the
bounding box drawn around the object, in this case black).
Figure 7.3 shows a situation where two objects move from camera C7 (second
and fourth rows) to camera C4 (first and third rows). The use of multi-camera
7.2 Multi-Camera Tracking 279
Data Set Overall Detection Overall Localisation Overall TrackingAP 0.03 0.01 0.06BE -0.04 -0.06 -0.03
Table 7.4: Overall Improvements using a Multi-Camera System (see Section 3.5.1for an explanation of metrics)
(a) 400 (b) 450 (c) 500 (d) 550 (e) 600
(f) 400 (g) 450 (h) 500 (i) 550 (j) 600
(k) 400 (l) 450 (m) 500 (n) 550 (o) 600
(p) 400 (q) 450 (r) 500 (s) 550 (t) 600
Figure 7.2: Example System Results AP11 - Object leaving and re-entering fieldof view
system ensures that the same object is assigned the same ID in each view. The
system is also alert to an object entering the scene, from one camera to another,
helping to initialise the tracking of an object in a second view slightly quicker.
In Figure 7.3, it can be seen that the multi-camera system does not provide any
advantage when tracking the plane at the top of C4. This object is only visible
in a single view, and so no benefit is gained.
280 7.2 Multi-Camera Tracking
(a) 50 (b) 100 (c) 150 (d) 200 (e) 250
(f) 50 (g) 100 (h) 150 (i) 200 (j) 250
(k) 50 (l) 100 (m) 150 (n) 200 (o) 250
(p) 50 (q) 100 (r) 150 (s) 200 (t) 250
Figure 7.3: Example System Results AP12 - Object leaving and re-entering fieldof view
Results for the BE datasets are shown in Tables D.3 and D.4 in Appendix D. In
the case of the BE dataset, the use of the multi-camera dataset does not result in
an improvement in performance. In fact, a small drop in overall performance is
recorded. In the case of the drop in tracking performance, this can be attributed
to in part the switching of identities, and in part the unstable detection proce-
dures. The poor detection performance in C3 is once again present, and this also
has an effect on the overall system performance.
Figures 7.4, 7.5 and 7.6 show examples of the BE dataset tracked by two single
camera systems and by a single multi-camera system. The top two rows of these
figures show the output from single camera tracking systems for cameras C1 and
7.2 Multi-Camera Tracking 281
C3, the bottom two rows show the output of the proposed multi-camera tracking
system.
Figure 7.4 shows an example of the system output from BE19. In this sequence,
a person is exiting the building through the front doors and then walks down the
driveway. Like the single camera system, a false track is spawned at the door. As
the door is only covered by one camera, similar performance is expected. As the
person walks down the steps and enters the view of C3, the single camera system
spawns an extra track. Due to the camera angle and segmentation errors arising
from the black pants against the dark roadway, two people are detected where
there are only one. After a short period of time however, the second track is lost.
When using the multi-camera system, a false track is created in the same manner
as the false tracking in the single camera system. However, unlike the single
camera system, this track persists for much longer. The multi-camera system
correctly matches the track for this person in C1 with the initial track created
in C3, however, where the second track is incorrectly spawned, an occlusion is
created (albeit, between two tracked objects tracking the same person). The
occlusion handling of the multi-camera system is then able to, incorrectly, keep
both tracks alive far longer than in the single camera system.
Figure 7.5 shows another example of the tracking from BE19. The multi-camera
system is able to continue to track the person leaving the car for slightly longer
than the single camera system (note in (f), the blue bounding box, that is the
track previously associated with this person). Due to there being continuing
non-detections, this person is lost later on (see (s)). However, when they are
detected again, they are quickly re-associated with the corresponding track in
the other view, and they return to having the same track ID (denoted by the
bounding box colour, see (t)) that they had previously. This process however,
ultimately results in several ID changes which are registered by the system as
282 7.2 Multi-Camera Tracking
(a) 390 (b) 420 (c) 450 (d) 480 (e) 510
(f) 390 (g) 420 (h) 450 (i) 480 (j) 510
(k) 390 (l) 420 (m) 450 (n) 480 (o) 510
(p) 390 (q) 420 (r) 450 (s) 480 (t) 510
Figure 7.4: Example System Results BE19 - Track matching and occlusion han-dling
errors. The predicted positions shown in frames 570 and 580 ((q) and (r)), where
the predictions are taken from the position in the other camera view, are also
treated as false detections as they do not match the target object well. Given
this, a missed detection for the person is still recorded.
Figure 7.6 shows an example from BE20. As can be seen, even with combining
the two camera views the occlusions are still not properly resolved.
Despite the improvement in performance of the BE19 dataset, there is no im-
provement in BE20. This can be explained by the manner in which the camera
calibration matches tracks across the views. To transfer points between views, it
7.2 Multi-Camera Tracking 283
(a) 560 (b) 570 (c) 580 (d) 590 (e) 600
(f) 560 (g) 570 (h) 580 (i) 590 (j) 600
(k) 560 (l) 570 (m) 580 (n) 590 (o) 600
(p) 560 (q) 570 (r) 580 (s) 590 (t) 600
Figure 7.5: Example System Results BE19 - Re-associating objects after detectionand tracking failure
is desirable that they are on the ground plane (i.e. z = 0). The feet of a person
should be on the ground plane, and so the feet position (taken as the centre of
the bottom of the bounding box inscribing the tracked object) is used to deter-
mine the proximity of objects in different views to one another. However, if the
segmentation is inaccurate, then this foot position in also likely to be inaccurate
(either too low, if part of a shadow is grouped with the track; or too high if the
lower legs are missed). Within the BE datasets, and particularly C3, the seg-
mentation performs very poorly due to a combination of the high noise levels and
people appearing at an angle (due to the way the camera is mounted). To try
and counter this with the BE datasets, the ground plane distance between object
in different views for them to considered a match is quite high (2m), which also
284 7.2 Multi-Camera Tracking
(a) 950 (b) 1000 (c) 1050 (d) 1100 (e) 1150
(f) 950 (g) 1000 (h) 1050 (i) 1100 (j) 1150
(k) 950 (l) 1000 (m) 1050 (n) 1100 (o) 1150
(p) 950 (q) 1000 (r) 1050 (s) 1100 (t) 1150
Figure 7.6: Example System Results BE20 - Occlusion Performance
results in several false matches. Setting this lower, results in almost no matches.
This problem is exacerbated the further from the camera the person is, as errors
further from the camera translate to larger distances on the ground plane.
Problems such as these can potentially be overcome by using different strategies
to transfer points between views. The use of F-Matrices allows a point in one
view to be matched to a line in a second, and it is possible that by transferring the
head and feet points from one view to two lines in the second view, and searching
for an object that is bounded by those lines that performance could be improved.
The motion image could also be transferred to a common coordinate scheme [5]
to try and determine where ground plane overlaps are in the two views.
7.3 Multi-Spectral Tracking 285
7.3 Multi-Spectral Tracking
Object tracking and abandoned object detection systems typically rely on a single
colour modality for their input. As a result, performance can be compromised
when low lighting, shadowing, smoke, dust or unstable backgrounds are present,
or when the objects of interest are a similar colour to the background. Thermal
images are not affected by lighting changes or shadowing, and are not overtly
affected by smoke, dust or unstable backgrounds. However, thermal images lack
colour information which makes distinguishing between different people or objects
of interest within the same scene difficult. Using modalities from both the visible
and thermal infra-red spectra allows more information to be extracted from a
scene, and can help overcome the problems associated with using either modality
individually.
There are a variety of fusion points available when fusing visual colour and ther-
mal image feeds for use in a surveillance application. To determine the most
appropriate point, four approaches for fusing visual and thermal images for use
in a person tracking system (two early fusion methods, one mid fusion and one
late fusion method), are evaluated. A final fusion approach is proposed based on
this initial evaluation.
7.3.1 Evaluation of Fusion Points
To determine the most appropriate method for fusing the thermal infrared and
visible light images for object tracking, four different fusion approaches are pro-
posed (see figure 7.7):
1. Fusing images during the motion detection by interlacing the images.
286 7.3 Multi-Spectral Tracking
2. Fusing the motion detection results of each image.
3. Fusing when updating the tracked objects using detected object lists from
each modality.
4. Fusing the results of two object trackers, which each track a modality in-
dependently.
Figure 7.7: The points for fusion in the system
For each of these proposed systems, the tracking system described in Chapter 6
is used (with any required modifications made to allow for the fusion process). In
all cases, the scalable condensation filter is used to support the tracking, using a
histogram model and the proposed appearance model (see Section 6.3).
Fusion in the Motion Detector
The first fusion method involves fusing the images prior to the motion detec-
tion by interlacing the luminance channel of the visible light image with the grey
scale thermal infrared image. This approach is facilitated by using a motion
detector which requires YCbCr 4:2:2 input [43]. The motion detector analyses
images in 2 pixel (four value, two luminance, one blue chrominance and one red
chrominance) blocks from which clusters containing two centroids (a luminance
and chrominance cluster, {Y1, Y2;Cb,Cr}) are formed. The centroids of the clus-
7.3 Multi-Spectral Tracking 287
ters in the background model are compared to those in the incoming image to
determine foreground/background.
Rather than convert the colour image to YCbCr 4:2:2 format as would be done in
normal circumstances, it is converted to YCbCr 4:4:4. The thermal information
is then interlaced with the colour information. By treating the thermal informa-
tion as additional luminance data and doubling the luminance information, we
effectively create a YCbCr 4:2:2 image (see figure 7.8) that can be fed directly
into the tracking system without any further modifications.
Figure 7.8: Fusing Visual and Thermal Information into a YCbCr Image for usein the Motion Detection
This results in the motion detector clusters becoming {Y, T ;Cb,Cr}. This
method of fusion has the advantage of consuming little processing resources on
top of our existing system (the only additional load is when performing motion
detection), and is also very simple to implement. It does however require that the
colour and thermal images be correctly registered, which may require additional
processing, or in some situations, not be possible.
288 7.3 Multi-Spectral Tracking
Fusion After Motion Detection
The use of middle or late fusion allows for greater control over the information
contained in the images that can be used by the tracking process. This informa-
tion can be used to greatly improve the accuracy and robustness of the detection
and tracking system. In both the second and third (see section 7.3.1) of the
proposed fusion systems we compute motion detection for each of the images. If
either image shows an abnormal increase in motion, it is disregarded. In the un-
likely event that both show such an abnormality, the more consistent of the two
is chosen. The abnormality of the images is assessed by looking at the percentage
increase of the in motion pixel count,
card(M(t))
card(M(t− 1))> τmc, (7.6)
where card(M(t)) is the amount of motion in the image, t is the time step and
τmc is the threshold for determining invalid motion detection results. This test is
not performed if the overall percentage of pixels in motion in the scene is beneath
a threshold (10% in our system), as if there is very little motion then something
such as a person entering the scene may be enough to result in an invalid image.
Our second proposed fusion scheme involves fusing directly after the motion de-
tection. Once the motion detection masks are obtained for each the visible light
and the thermal infrared modalities, they are combined to obtain a single mask
for the scene. Rather than simply apply a logical “and” or “or” operation, the
images are fused using the following equations,
(MIR(x, y, t) > τF1)&(MV is(x, y, t) > τF1), (7.7)
MIR(x, y, t) > τF2, (7.8)
MV is(x, y, t) > τF2, (7.9)
7.3 Multi-Spectral Tracking 289
where MIR is the thermal motion image, MV is is the visual motion image, and τF1
and τF2 are thresholds to control the fusion (τF2 > τF1). If any of these equations
are satisfied, the fused motion mask at (x, y, t) is set to indicate motion. The
resultant mask is used in the remainder of the system described in Chapter 6.
Fusion After Object Detection
A second mid-fusion scheme is evaluated whereby motion detection and object
detection is carried out on both modalities, and the two object lists are used
to update the central list of tracked objects. Objects that have been previously
detected can be updated by a detection from either domain. For a new object to
be added, the object must be detected in both, or in the modality where it is not
detected there must be a given amount of motion within the region where the
object has been detected. The amount of motion required in the second modality
(where the object has not been detected) is the pixel count for the detected object
multiplied by a value, τFScale. τFScale, for our system is 0.5. This attempts to
ensure that a false detection in one modality, does not lead to an non-existent
track being initialised.
The proposed appearance model (see Section 6.3) is extended to contain infor-
mation from both motion detection routines. An additional motion and optical
flow component are added, such that the model consists of a shared colour com-
ponent, a motion and optical flow component for the visual domain input and
a motion and optical flow component for the thermal domain. The model can
be used to compare a detected object to either domain individually, or to both
simultaneously (an update can also be performed on only a single domain, or
both).
290 7.3 Multi-Spectral Tracking
Fusion After Tracking
A late fusion scheme is evaluated where each modality is tracked individually
and the resultant tracked object lists are fused in the same manner as for a
multi-camera network. Each view is processed separately, and a list of tracked
objects from each view is generated and tracked independently. At the end of
each frame, a camera management module attempts to determine which objects
that are being tracked by the individual trackers, represent the same real-world
objects. As our multi-camera network consists of two cameras that are observing
exactly the same area, there is no need to transfer to a world coordinate scheme,
or rely on camera calibration, pixel coordinates can be used directly. However, as
one view is in the colour domain and one is in the thermal, we are unable to use
colour/appearance as an additional metric. Given this, we simply use the overlap
of the bounding boxes to group objects.
At the end of each frame, the object lists are compared. It is expected that all
objects should be tracked in both modalities. For those objects that are being
tracked in only one modality, the tracks in the second modality (that are not
already associated with a track in the first) are searched to find a match based on
the overlap of the bounding box. If a matching track cannot be found (presumably
due to an inability to detect the object due to poor motion detection), one is
created and the system will attempt to begin tracking in the next frame (in this
case, the new track is initialised without initialising histograms and appearance
models, and as a result the condensation filter cannot be used until these have
been initialised).
Tracked objects that have been paired across the views are compared each frame,
to check that they are, in fact, a valid match. If the overlap between these two
objects drops below a threshold for two consecutive frames, the pair is broken
7.3 Multi-Spectral Tracking 291
up, and at the end of the next frame the system will attempt to pair the tracks
again (it is possible that they will be paired with each other again).
7.3.2 Evaluation of Fusion Techniques using OTCVBS
Database
The OTCBVS Benchmark Dataset Collection[42] is used to evaluate the four fu-
sion tracking systems. This is a publicly available dataset that contains aligned
thermal infrared and colour image sequences of two different outdoor scenes con-
taining pedestrians. The sequences include a variety situations of interest with
multiple pedestrians to test the system. We test the performance of the proposed
fusion system as well as tracking with both modalities individually.
Seven sub-sequences from the database are selected to highlight various situations
of interest such as stationary people, occlusions, people moving in shadowed areas,
and shadowing caused by cloud cover. Two sequences from the second location
(set 1 in our evaluation), and five from the first (set 2 in our evaluation) are used.
Separate results are shown for each set of sequences, as the first set (taken from
Location 2 in the database) contains significantly simpler scenarios than those in
the second. Ground truth tracking data has been computed for each of these sub-
sequences using the VIPER toolkit [166], and tracking performance is evaluated
using the ETISEO evaluation tool and metrics [130] as discussed in Chapter 3.
Details on annotation of the tracking output can be found in Section 3.5.3.
Results for the first set of two sequences are shown in Tables 7.5 and 7.5, and
Figure 7.10.
As Tables 7.5 and 7.6 show, there is little to no performance benefit from using
the proposed fusion schemes for the simple scenarios contained in set 1, when
292 7.3 Multi-Spectral Tracking
Algorithm Detection Localisation TrackingD1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5
Colour 0.88 0.59 0.80 0.98 1.00 1.00 0.60 0.52 0.63 0.92 0.79Thermal 0.98 0.72 0.86 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00Fusion 1 0.97 0.53 0.82 1.00 1.00 1.00 0.92 0.65 0.88 1.00 0.92Fusion 2 0.98 0.65 0.84 0.99 1.00 1.00 1.00 0.81 1.00 1.00 1.00Fusion 3 0.98 0.66 0.85 1.00 1.00 1.00 1.00 0.85 1.00 1.00 1.00Fusion 4 0.98 0.70 0.86 1.00 1.00 1.00 1.00 0.91 1.00 1.00 1.00
Table 7.5: Fusion Algorithm Evaluation - Set 1 Results (see Section 3.5.1 for anexplanation of metrics)
Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.65 0.93 0.64Thermal 0.77 0.96 0.99Fusion 1 0.62 0.94 0.88Fusion 2 0.72 0.95 0.97Fusion 3 0.73 0.95 0.98Fusion 4 0.75 0.96 0.99
Table 7.6: Fusion Algorithm Evaluation - Overall Set 1 Results (see Section 3.5.1for an explanation of metrics)
compared to the performance of the thermal modality alone. The colour modality
performs the worst due to noise that is present in the colour dataset, that resulted
in poor performance for the motion detection (when compared to the thermal
modality). An example of this is shown in Figure 7.9. As the datasets used are
only of a short length (200-300 frames each), the motion detection algorithm is
not able to adjust to cope with the noise, and setting initial thresholds that do
not detect the noise results in large amounts of the motion also being missed.
The noise in the colour images only had a significant impact on the first fusion
scheme, with its detection and tracking performance falling well below that of
the thermal modality on its own. The other three fusion schemes were able to
effectively overcome the failings of the colour modality and perform at a similar
level to the thermal modality. No systems outperformed the thermal modality
alone, however this can be attributed to the fact that all objects were tracked
7.3 Multi-Spectral Tracking 293
(a) 100 (b) 101 (c) 102 (d) 103
(e) 100 (f) 101 (g) 102 (h) 103
Figure 7.9: Noise in Colour Images for Set 1
correctly within this modality (tracking performance is 0.99 out of a maximum
of 1.0).
Figure 7.10 shows an example of the tracking output from set 1, where two people
cross paths, causing an occlusion. The top row shows the output of tracking using
colour images only, the second row shows the output of tracking using the thermal
images only, the third row shows results of tracking using fusion scheme 1, the
fourth row shows results of tracking using fusion scheme 2, the fifth row shows
results of tracking using fusion scheme 3 and the sixth row shows tracking results
using fusion scheme 4.
With the exception of the colour only modality, each configuration is able to re-
solve the occlusion correctly. The failure in the colour modality can be attributed
to the poor motion and object detection performance as a result of the noise (see
Figure 7.9 for an example of the noise in the colour images). Despite this failure
in the colour modality, the fusion system are all able to overcome this and track
the two objects through the occlusion correctly.
Results for the second set of five sequences are shown in Tables 7.7 and 7.8, and
294 7.3 Multi-Spectral Tracking
(a) 50 (b) 60 (c) 70 (d) 80 (e) 90
(f) 50 (g) 60 (h) 70 (i) 80 (j) 90
(k) 50 (l) 60 (m) 70 (n) 80 (o) 90
(p) 50 (q) 60 (r) 70 (s) 80 (t) 90
(u) 50 (v) 60 (w) 70 (x) 80 (y) 90
(z) 50 (aa) 60 (ab) 70 (ac) 80 (ad) 90
Figure 7.10: Example System Results for Set 1 - Occlusion
Figures 7.11 and 7.12.
As Tables 7.7 and 7.8 shows, the third proposed fusion scheme achieves the best
performance, ahead of the thermal modality individually. Examples of the system
output are shown in Figures 7.11 and 7.12. Within these figures, the top row
shows the output of tracking using colour images only, the second row shows the
7.3 Multi-Spectral Tracking 295
Algorithm Detection Localisation TrackingD1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5
Colour 0.72 0.42 0.64 1.00 1.00 1.00 0.43 0.34 0.90 0.97 0.40Thermal 0.90 0.50 0.77 1.00 1.00 1.00 0.61 0.45 0.86 0.91 0.51Fusion 1 0.80 0.43 0.66 1.00 1.00 1.00 0.55 0.50 0.86 0.97 0.44Fusion 2 0.85 0.46 0.72 1.00 1.00 1.00 0.58 0.45 0.91 0.98 0.47Fusion 3 0.90 0.55 0.77 1.00 1.00 1.00 0.64 0.51 0.89 0.96 0.52Fusion 4 0.88 0.52 0.76 1.00 1.00 1.00 0.54 0.47 0.75 0.94 0.47
Table 7.7: Fusion Algorithm Evaluation - Set 2 Results (see Section 3.5.1 for anexplanation of metrics)
Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.48 0.89 0.49Thermal 0.58 0.93 0.61Fusion 1 0.51 0.89 0.58Fusion 2 0.54 0.91 0.59Fusion 3 0.62 0.93 0.64Fusion 4 0.59 0.92 0.56
Table 7.8: Fusion Algorithm Evaluation - Overall Set 2 Results (see Section 3.5.1for an explanation of metrics)
output of tracking using the thermal images only, the third row shows results of
tracking using fusion scheme 1, the fourth row shows results of tracking using
fusion scheme 2, the fifth row shows results of tracking using fusion scheme 3 and
the sixth row shows tracking results using fusion scheme 4.
All systems perform significantly worse on the second set of data, due to the more
complex nature of the data. The scenes contain more people, the people being
tracked are smaller in the image, there are heavy shadows cast by the people and
the environment as well as shadows caused by moving clouds. As a result of the
shadowing present, the colour modality performs very poorly, resulting in many
false tracks being created as shadows from moving clouds are cast over the scene
(see Figure 7.12). The thermal modality does not suffer from these problems,
and very few false tracks are created.
Within the colour modality, the presence of such severe shadowing results in a
296 7.3 Multi-Spectral Tracking
large amount of false motion, which spawns a large number of false tracks. Whilst
the shadows can be partially removed using shadow detection, shadow detection
also results in some motion caused by the people in the scene to be lost (i.e. is
it falsely classified as being caused by shadows). This is due to most people in
the scene appearing significantly darker than the background, and there being
little texture on either the background or people (i.e. fairly constant colours,
low gradient). As such, more aggressive shadow detection, whilst able to remove
more of the shadows, also removes more of the motion caused by people.
The fusion systems all see some improvement over the colour modality, however
all except for the third is outperformed by the thermal modality alone. The first,
second and fourth fusion schemes are less effective at being able to completely
ignore a modality when it is performing poorly. The first and fourth fusion
schemes will always use the available information in the same manner regardless
of performance. The first fusion scheme achieves some improvement as the effect
of shadows is diluted due to the luminance channel being expanded. When each
cluster is compared in the motion detection, only a single luminance value is from
the colour modality (in the colour only system, both pixels are from the colour
modality). The effect of this is that the shadow detection performs better, but
still does not remove all erroneous motion. As a result some false tracks are still
spawned. The fourth fusion scheme assumes that each modality is producing
correct results, and attempts to merge the two lists of tracked objects through
the camera management. This results in the system being able to maintain the
identity of the valid objects more effectively (through the thermal modality), but
does not assist in handling or preventing the invalid tracks that are spawned in
the colour modality.
The second scheme is able to disregard an input in the event of suspected failure
(the same mechanism is used by the third fusion scheme), but this will not nec-
7.3 Multi-Spectral Tracking 297
essarily register an error when a shadow moves gradually across the scene, and
is better suited to dealing with errors caused by automatic gain control errors,
or indoor situations where lights are turned on/off. The shadows that appear
in these scenes gradually move across the scene, and there is no rapid change
in motion levels. As such, no error is detected and the colour modality is used,
despite the significant errors present. Whilst the difference between the levels of
motion in the thermal and colour modality could be used to indicate a problem
(i.e. they should each see a similar amount of motion in ideal circumstances), it is
hard to determine which modality is in error in a situation such as this (i.e. it is
just as likely that the thermal camera could be missing large portions of motion).
The third proposed scheme is better equipped to ignore the motion caused by
shadows as it does not appear in the thermal images, and so new tracks cannot
be spawned as a result of shadow motion (at least some motion is required in
both images to create a new track). This same mechanism also helps to deal
with errors in the thermal images (see Figure 7.11 - in (g) a track second track
is created along the building side as a result of a door being opened, however the
third fusion scheme is able to avoid this).
Under appropriate conditions, all fusion schemes can offer some level of improve-
ment over using either modality alone. Overall however, our third proposed
fusion scheme (fusion after object detection) performs the best, out performing
each camera on its own and the other fusion schemes. Fusions schemes one and
two are directly reliant on the quality of the motion detection from the colour and
thermal images. If either image contains excessive noise (sensor noise, or environ-
mental effects such as shadowing) the whole system suffers as the fusion has been
performed before any object detection processes, and so the object detection for
the whole system is degraded. Fusion scheme 4 performs object detection and
tracking independently on each image, and merges results. Poor performance in
298 7.3 Multi-Spectral Tracking
(a) 50 (b) 70 (c) 90 (d) 110 (e) 130
(f) 50 (g) 70 (h) 90 (i) 110 (j) 130
(k) 50 (l) 70 (m) 90 (n) 110 (o) 130
(p) 50 (q) 70 (r) 90 (s) 110 (t) 130
(u) 50 (v) 70 (w) 90 (x) 110 (y) 130
(z) 50 (aa) 70 (ab) 90 (ac) 110 (ad) 130
Figure 7.11: Example System Results for Set 2
one modality cannot be corrected by the other modality.
Depending on the conditions of the scene, fusion schemes 1, 2 and 4 may still al-
low some improvement over either modality individually, however at other times
it can result in reduced performance. This can possibly be overcome by modifying
7.3 Multi-Spectral Tracking 299
(a) 50 (b) 70 (c) 90 (d) 110 (e) 130
(f) 50 (g) 70 (h) 90 (i) 110 (j) 130
(k) 50 (l) 70 (m) 90 (n) 110 (o) 130
(p) 50 (q) 70 (r) 90 (s) 110 (t) 130
(u) 50 (v) 70 (w) 90 (x) 110 (y) 130
(z) 50 (aa) 70 (ab) 90 (ac) 110 (ad) 130
Figure 7.12: Example System Results for Set 2
the early fusion schemes to determine fusion parameters dynamically, or adding
additional intelligence to the multi-camera systems in fusion scheme 4 (possibly
a similar system to that used in the third scheme). Fusion after the object detec-
tion overcomes this problem more effectively, as in the event that one modality
produces poor results, the system can ignore this modality entirely and fall back
300 7.3 Multi-Spectral Tracking
on the second to update the system until both modalities are producing usable
results.
The results from set two show that even when one modality (the colour modality
in this case) is producing very poor results, it can still allow improvements in the
detection and in the tracking over time of objects (see Table 7.8) due to the added
colour information, which allows for better matching using appearance models
and histograms, used by the condensation filter. This fusion scheme weights both
inputs equally, assuming that either one is equally likely to produce valid/invalid
data. The thermal modality could be weighted higher for tasks such as initial
object detection to initialise tracks (so that fewer false tracks are spawned), yet
the discriminating power offered by the colour modality when tracking known
objects is not lost.
7.3.3 Proposed Fusion System
It is been shown in Section 7.3.2 that the best approach for multi-sensor object
tracking within the object tracking framework used in this thesis is a middle
fusion approach, where motion detection and object detection are performed on
each modality. The results of the individual object detection routines are used
to update a single list of tracked objects. Given these results, a more advanced
approach to fusion at this point in the system is proposed. Figure 7.13 shows a
flowchart of the proposed system.
It is important to monitor the performance of each modality, to ensure that if one
modality is severely affected by noise, objects detected within that modality are
weighted less importantly, or ignored. The performance of the motion detection
is measured and evaluated as described in Section 7.3.1. However, as shown in
Section 7.3.2, this approach is only able to handle severe problems and issues
7.3 Multi-Spectral Tracking 301
Figure 7.13: Flowchart for Proposed Fusion System
such as shadows cast by clouds are unlikely to result in errors being detected. To
overcome this, a method that can gauge the performance of the object detection
is required.
Motion detection and object detection is performed on each modality, resulting
in two object lists, Ovisible(t) and Othermal(t), of size Nvisual(t) and Nthermal(t)
respectively. Ideally, each of these object lists should each contain the number of
objects presently in the scene, N(t). The number of objects detected compared
to the number of objects present is used to determine the performance of each
modality at a given time,
N(t) < α; qvisual(t) = qthermal(t) = 1, (7.10)
qvisual(t) = 1− min(max(|N(t)−Nvisual(t)| − α, 0), N(t))
max(max(|N(t)−Nvisual(t)| − α, 0), N(t))(7.11)
qthermal(t) = 1− min(max(|N(t)−Nthermal(t)| − α, 0), N(t))
max(max(|N(t)−Nthermal(t)| − α, 0), N(t))(7.12)
A tolerance of α objects is allowed (within the proposed system α is set to 1). This
tolerance ensures that when the system contains no objects, the appearance of
an object does not result in the performance of the system dropping significantly
(this is also dealt with through the use of a learning rate to curb rapid changes
in the performance metric, see Equations 7.14 and 7.14). Whilst multiple objects
entering and exiting the scene will result in a drop in performance for the modality,
ideally this drop should be uniform across each modality. As it is the difference
302 7.3 Multi-Spectral Tracking
in performance between the modalities that is important, this is not a problem.
The performance for a given frame is incorporated into a global performance
metric which is adjusted gradually,
pvisual(t) = pvisual(t− 1) +qvisual(t)− pvisual(t− 1)
Lfus, (7.13)
pthermal(t) = pthermal(t− 1) +qthermal(t)− pthermal(t− 1)
Lfus, (7.14)
where Lfus is the learning rate for the performance metric. Initial performance
metrics may be specified within the system configuration, by default weighting
one modality above another (i.e. if it is known that one modality is less reliable
for a given scene it can be by default set to a lower value).
These performance metrics are then used to determine the weighting applied to
each modality when fusing object lists, and adding objects. The relative strength
of each modality for the task of object detection is calculated,
wvisual(t) =pvisual(t)
pvisual(t) + pthermal(t), (7.15)
wthermal(t) =pthermal(t)
pvisual(t) + pthermal(t), (7.16)
where wvisual(t) is the performance of the visual modality relative to the thermal,
and wthermal(t) is the performance of the thermal modality relative to the visual.
This process ensures that the weights of each modality sums to 1, which simplifies
the process of merging objects.
The two object lists are merged, by determining the overlap between the objects.
If the overlap between the two objects is greater than a threshold, T fusionov , the
objects are merged. For each object, there are several parameters such as the
bounding box, centroid and velocities. Each of these values is merged according
to the equation,
Ofused(t, i) = Ovisual(t, j)× wvisual(t) +Othermal(t, k)× wthermal(t), (7.17)
7.3 Multi-Spectral Tracking 303
where Ovisual(t, j) is the visual object being merged, Othermal(t, k) is the thermal
object being merged and Ofused(t, i) is the resultant fused object.
This yields three objects lists, Ofused(t), O′visible(t) and O′thermal(t), representing
the fused objects, the remaining visible and remaining thermal objects respec-
tively. The updated lists of visual and thermal objects are defined as,
O′visible(t) = Ovisible(t) /∈ Ofused(t), (7.18)
O′thermal(t) = Othermal(t) /∈ Ofused(t). (7.19)
The object lists are used to update the known tracks and add new tracks to the
system, this process is performed in the following steps:
1. Match objects in the merged list, Ofused(t), to the tracked list.
2. Match objects in the individual lists (O′visible(t) and O′thermal(t)) to the
tracked list, such that the best fitting object from either list is matched
in turn.
3. Add new objects within the merged list.
4. Add new objects within the individual lists.
The fourth stage involves additional checks to ensure that invalid objects are not
added, and an additional state is added to the system to accommodate this pro-
cess. A prerequisite amount of motion must be present within the other modality
for such a detection to be valid (this amount is specified in a configuration file,
rather than derived from the performance metrics), and the performance of the
modality must be greater than a threshold, τa (set to 0.5 within the proposed
system). Figure 7.14 shows the updated state diagram.
304 7.3 Multi-Spectral Tracking
Figure 7.14: Updated State Diagram
The Preliminary Single Modality state is used for objects that are added after
detection in a single model. It is similar in behaviour to the Entry state, in
that objects must be continuously detected when in this state, or they will be
removed from the system. The difference regarding this state is the time that an
object spends in this state before entering the Active state. This is determined
as follows,
τactive(i) =τactivepm(t)
, (7.20)
where τactive(i) is the active threshold for object i (the object being added), τactive
is the default threshold and pm(t) is the performance for the modality m, from
which the object is being added. However, when an object that is in this state is
detected in both modalities (i.e. it is updated from an object in the Ofused(t) list),
the threshold is decremented by one (along with the increment in the detection
count, in effect this counts as two detections).
The systems integration with the particle filter is unchanged (i.e. the particle
filter does not weight one modality above another).
7.3 Multi-Spectral Tracking 305
7.3.4 Evaluation of System using OTCVBS Database
The proposed fusion approach is evaluated using the same datasets and metrics as
used in Section 7.3.2. Results are compared to the visual and thermal modalities
individually as well as the third proposed fusion scheme (see Section 7.3.1) upon
which is this system is based.
Algorithm Detection Localisation TrackingD1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5
Colour 0.88 0.59 0.80 0.98 1.00 1.00 0.60 0.52 0.63 0.92 0.79Thermal 0.98 0.72 0.86 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00Fusion 3 0.98 0.66 0.85 1.00 1.00 1.00 1.00 0.85 1.00 1.00 1.00Proposed 0.98 0.69 0.85 1.00 1.00 1.00 1.00 0.83 1.00 1.00 1.00
Table 7.9: Proposed Fusion Algorithm - Set 1 Results (see Section 3.5.1 for anexplanation of metrics)
Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.65 0.93 0.64Thermal 0.77 0.96 0.99Fusion 3 0.73 0.95 0.98Proposed 0.75 0.95 0.98
Table 7.10: Proposed Fusion Algorithm - Overall Set 1 Results (see Section 3.5.1for an explanation of metrics)
Tables 7.9 and 7.10 show the results for set 1, using the proposed algorithm.
The performance of the proposed algorithm is very similar to that of the thermal
modality alone, and the third evaluated fusion scheme (see Section 7.3.1) upon
which the proposed algorithm is based. This is expected given that the thermal
modality and original fusion schemes performed very well for these datasets, and
so no significant improvement, or change in performance was expected.
Tables 7.11 and 7.12, and Figure 7.15 show the results for set 2 using the proposed
fusion algorithm. As can be seen, the proposed fusion algorithm offers a significant
improvement over both individual modalities, and a noticeable improvement over
the third fusion scheme on which the proposed algorithm is based.
306 7.3 Multi-Spectral Tracking
Algorithm Detection Localisation TrackingD1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5
Colour 0.72 0.42 0.64 1.00 1.00 1.00 0.43 0.34 0.90 0.97 0.40Thermal 0.90 0.50 0.77 1.00 1.00 1.00 0.61 0.45 0.86 0.91 0.51Fusion 3 0.90 0.55 0.77 1.00 1.00 1.00 0.64 0.51 0.89 0.96 0.52Proposed 0.93 0.56 0.76 1.00 1.00 1.00 0.70 0.49 0.95 0.97 0.60
Table 7.11: Proposed Fusion Algorithm - Set 2 Results (see Section 3.5.1 for anexplanation of metrics)
Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.48 0.89 0.49Thermal 0.58 0.93 0.61Fusion 3 0.62 0.93 0.64Proposed 0.63 0.92 0.69
Table 7.12: Proposed Fusion Algorithm - Overall Set 2 Results (see Section 3.5.1for an explanation of metrics)
Figure 7.15 shows a situation where there is heavy shadowing caused by a cloud
moving across the scene. The top row is the output of the colour modality only,
the second row is the thermal modality only, the third row is the output of
the third evaluated fusion scheme (see Section 7.3.1) and the fourth row is the
proposed fusion algorithm. In this situation, the colour modality alone performs
very poorly, spawning false tracks before losing track of the majority of the objects
in the scene as the motion detector attempts to adjust to cope with the changes.
The third of the initially proposed fusion approaches performs much better, but
a false track is still spawned (bottom centre of frame) due to noise in the thermal
image coinciding with errors caused shadows in the visual image. The proposed
fusion algorithm does not have this problem, as it is able to identify that the
detection in the visual modality is failing catastrophically, and ignore the modality
far more effectively.
7.4 Summary 307
(a) 50 (b) 70 (c) 90 (d) 110 (e) 130
(f) 50 (g) 70 (h) 90 (i) 110 (j) 130
(k) 50 (l) 70 (m) 90 (n) 110 (o) 130
(p) 50 (q) 70 (r) 90 (s) 110 (t) 130
Figure 7.15: Example System Results for Set 2
7.4 Summary
This chapter has presented two extensions to the single camera tracking systems
proposed in this thesis:
1. A multi-camera tracking system.
2. A multi-spectral tracking system.
It has been shown that by using multiple cameras to cover the same area of a
scene, system performance can be improved. This can allow a system to antic-
ipate when objects are entering, and help recover from errors in a single view.
However, the use of multiple cameras does not overcome problems caused by poor
308 7.4 Summary
segmentation and detection, and in some cases (depending on how the camera
management is implemented) it may in fact exacerbate the problem.
This chapter has also evaluated approaches for uses in a multi-spectral tracking
system, where a visual colour modality and thermal modality are combined. It
has been shown that a middle fusion approach, where motion detection and object
detection are applied to each image and the object detection results are fused is
optimal. An improved middle fusion scheme has been proposed and is shown to
offer significant improvements over either modality individually.
Chapter 8
Conclusions and Future Work
8.1 Introduction
This thesis has examined methods to improve the performance of motion segmen-
tation algorithms and particle filtering techniques for object tracking applications;
and examined methods for multi-modal fusion in an object tracking system.
Motion segmentation is a key step in many tracking algorithms as it forms the
basis of object detection. Improving segmentation results, as well as being able
to extract additional information such as optical flow and motion status such as
stationary or active, allows for improved object detection and thus tracking. Oc-
clusions and difficulties in object detection are a major source of error in tracking
systems. However a strength of particle filters is their ability to track objects in
adverse situations. Integrating a particle filter within a standard tracking system
allows the particle filter to use progressively updated features and aids in main-
taining identity of the tracked objects, and provides the tracking system with an
effective means to handle occlusions.
310 8.2 Summary of Contribution
The research in the above areas has led to four main contributions, these being:
1. The simultaneous computation of multi-layer motion segmentation and op-
tical flow information.
2. The combined use of multi-layer motion and optical flow to improve object
detection and tracking performance.
3. The development of the scalable condensation filter (SCF, a mixture con-
densation filter that can dynamically scale the number of particles and
number and type of features, for each mixture component) and its inte-
gration into an existing tracking system to allow for improved occlusion
handing and tracking in adverse conditions.
4. An investigation into multi-sensor fusion for object tracking, and the devel-
opment of a fusion scheme for fusing a visual colour modality and thermal
modality for object tracking.
These innovations have been shown to improve the performance of object tracking
in adverse conditions, such as the in the presence of complex lighting, or situations
where there are a frequent occlusions. In the following section a summary of these
four contributions is provided.
8.2 Summary of Contribution
The four original contributions in this thesis are:
(i) Simultaneous computation of multi-layer motion segmentation and
optical flow
8.2 Summary of Contribution 311
A novel motion segmentation technique that can simultaneous calculate optical
flow as well as multi-layer motion segmentation has been proposed. Regions of
motion are divided into multiple layers of stationary foreground (i.e. objects that
have entered the scene and come to a stop) and a single layer of active foreground
(objects that are currently moving). Optical flow is calculated using a window
matching approach, incorporating previous optical flow results as well as previous
motion detection results, such that the algorithm does not attempt to match to
regions that were background in the previous frame, to help reduce discontinu-
ities. The optical flow is calculated to pixel precision, but as it is targeted at
tracking applications, sub-pixel resolution is not required. The proposed system
also uses the optical flow information and a short term history to compute a mo-
tion consistency map, indicating probable overlaps and other discontinuities (i.e.
new motion).
As well as the additional modes of output, lighting compensation, a variable
threshold, shadow detection and feedback approach have been proposed to im-
prove the segmentation performance. The merit of a variable threshold for each
pixel against a single global variable threshold has also been investigated.
The proposed algorithm has been evaluated using the AESOS database, the
CAVIAR database [48] and data captured in-house, and significant improvement
over the baseline (see Section 4.6) has been demonstrated.
(ii) Incorporation of multi-layer motion and optical flow into object
tracking
312 8.2 Summary of Contribution
The proposed motion segmentation algorithm has been integrated into a tracking
system capable of utilising the multiple modes of output (optical flow, multi-layer
foreground segmentation, motion consistency map). Novel methods to utilise
these modes of input within a tracking system have been proposed, these include:
1. The use of optical flow to extract moving objects.
2. The use of stationary foreground information to locate temporarily stopped
objects, or detect abandoned objects.
3. The use of the motion consistency map to detect overlaps and provide fur-
ther segmentation during the detection stage.
The proposed methods have been tested using the ETISEO database [130]
and achieved up to 24% improvement in tracking performance over the base-
line system (amount of improvement varies for different datasets, see Section 5.6).
(iii) Improved object tracking through the Scalable Condensation
Filter (SCF)
The Scalable Condensation Filter (SCF) has been proposed. The SCF is a novel
implementation of condensation filter, incorporating elements of the mixture par-
ticle filter [163] and boosted particle filter [135]. The SCF allows:
1. A time varying number of objects to be tracked by independent mixtures.
2. Each mixture to use a time varying number of particles according to the
system complexity.
8.2 Summary of Contribution 313
3. Each mixture to use a time varying number of features, which may be of
differing types to other mixtures.
4. Results of the object detection routines to be incorporated into the mix-
tures.
The SCF is integrated into the previously proposed tracking system. This allows:
1. The underlying tracking system to maintain the identity of each tracked
object, and add and remove mixture components from the SCF as appro-
priate.
2. The SCF to use progressively updated features which adapt to changes in
the object’s appearance as they move through the scene, and are unique for
each feature.
3. The tracking system to utilise the SCF when handling occlusions and situ-
ations where detection is unreliable, rather than relying on predicting the
object’s position until it reappears, or a timeout is reached.
The proposed system is tested using the ETISEO database [130], and improve-
ments in the tracking performance and occlusion handling are shown (see Section
6.5). However, the SCF does make the system more susceptible to errors when
object detection is poor, or when there is low contrast between foreground
objects and the background.
(iv) Investigation into and development of methods to fuse multiple
modalities for object tracking
An evaluation of fusion schemes for multi-sensor fusion for object tracking has
314 8.3 Future Research
been performed. Four simple fusion schemes have been evaluated for a multi-
modal system consisting of a visual colour modality and thermal modality:
1. Fusion during the motion detection process.
2. Fusion of the motion detection output.
3. Fusion of the object detection results.
4. Fusion of the tracked object lists.
An evaluation using the OTCBVS database [42] has shown that fusion of the
object detection results in optimal performance (see Section 7.3.2). A novel
multi-spectral tracking system, that performs fusion on the object detection
results, has been proposed and is shown to offer significant advantages over
either modality alone (see Section 7.3.4).
8.3 Future Research
This thesis has contributed to several areas of object tracking and intelligent
surveillance, however there are still several areas future work could address. Areas
of future work, that further improve techniques proposed in this thesis as well as
potential new research are listed below:
• The proposed object tracking systems will continue to be evaluated as new
datasets are made available, to confirm the results on a wider range of data,
and further refine the proposed algorithms.
8.3 Future Research 315
• The proposed motion segmentation algorithm could be extended to be able
to recognise, and ignore motion from, cast light such as vehicle headlights.
This would improve performance of motion segmentation and vehicle track-
ing systems in dark conditions.
• The proposed motion segmentation system could be modified such that
grouping of connected regions of the same classification are made (i.e. a
shadow region, active foreground region, etc.). Regions could be analysed
to remove errors (such as spurious motion in the middle of a shadow, or
in the middle of a region of stationary foreground) and provide additional
feedback to the motion segmentation algorithm to improve performance in
future frames.
• The image segmentation technique, GrabCut [147], can be used to segment
images based on a selected region of interest, and GMM’s of positive and
negative features extracted from this initial selection. The GrabCut al-
gorithm and proposed motion segmentation algorithm could be combined,
such that regions of motion are used to initialise the GrabCut process and
improve segmentation performance. If successful, such a system could be
integrated into a tracking system such that each track stores its own GMM,
and this GMM, combined with a coarse position estimation and motion de-
tection results could be used to segment the target object in future frames.
• Tracking systems presently rely on a single frame for detection results. De-
tection could be performed over a sliding frame window, such that at frame
t, detection results for frames t− 1, t, and t+ 1 are combined and used to
update the list of tracked objects. Given that tracking systems typically
process video at 15-25fps and surveillance cameras have a wide field of view,
the difference in position from frame to frame is small and combining de-
tection results from consecutive frames is likely to improve detection and
tracking performance, without adversely affecting the localisation of the
316 8.3 Future Research
tracked objects.
• The use of the SCF within the proposed tracking system can be further
improved, to overcome the problem of poorly initialised tracked objects
(due to detection/segmentation errors) being lost. Further improvements
can also be made to better handle long occlusions, and ensure that tracks
are not lost during a prolonged occlusion.
Appendix A
Baseline Tracking System Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD6 0.83 0.73 0.84 1.00 0.98 1.00 0.44 0.35 0.75 0.97 0.44RD7 0.71 0.56 0.76 1.00 0.98 1.00 0.34 0.27 0.79 0.93 0.20Average 0.77 0.64 0.80 1.00 0.98 1.00 0.39 0.31 0.77 0.95 0.32
Table A.1: RD Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingRD6 0.75 0.95 0.48RD7 0.59 0.92 0.39Average 0.67 0.93 0.44
Table A.2: Overall RD Dataset Results
318
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BC16 0.77 0.44 0.75 1.00 0.98 0.99 0.46 0.28 0.75 0.81 0.38BC17 0.77 0.38 0.72 1.00 0.98 0.99 0.44 0.33 0.73 0.88 0.29Average 0.77 0.41 0.73 1.00 0.98 0.99 0.45 0.30 0.74 0.85 0.34
Table A.3: BC Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingBC16 0.50 0.91 0.47BC17 0.45 0.91 0.46Average 0.48 0.91 0.46
Table A.4: Overall BC Dataset Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP11-C4 0.86 0.67 0.79 1.00 1.00 1.00 0.47 0.25 1.00 1.00 0.17AP11-C7 0.86 0.77 0.84 1.00 1.00 1.00 0.65 0.51 0.88 1.00 0.60AP12-C4 0.76 0.58 0.69 1.00 1.00 1.00 0.50 0.51 1.00 1.00 0.67AP12-C7 0.80 0.54 0.72 1.00 1.00 1.00 0.66 0.30 1.00 1.00 0.38Average 0.82 0.64 0.76 1.00 1.00 1.00 0.57 0.39 0.97 1.00 0.45
Table A.5: AP Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingAP11-C4 0.71 0.93 0.47AP11-C7 0.79 0.95 0.66AP12-C4 0.61 0.90 0.60AP12-C7 0.59 0.91 0.61Average 0.68 0.93 0.59
Table A.6: Overall AP Dataset Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BE19-C1 0.83 0.46 0.73 1.00 0.95 1.00 0.29 0.21 0.54 0.86 0.29BE19-C3 0.74 0.03 0.46 0.50 0.50 1.09 0.44 0.00 0.00 0.00 0.00BE20-C1 0.75 0.50 0.74 0.99 0.97 1.00 0.05 0.24 0.32 0.71 0.54BE20-C3 0.33 0.01 0.27 0.75 0.75 1.04 0.00 0.01 1.00 1.00 0.00Average 0.66 0.25 0.55 0.81 0.79 1.03 0.19 0.11 0.47 0.64 0.21
Table A.7: BE Dataset Results
319
Data Set Overall Detection Overall Localisation Overall TrackingBE19-C1 0.53 0.90 0.34BE19-C3 0.17 0.54 0.29BE20-C1 0.55 0.91 0.21BE20-C3 0.07 0.62 0.14Average 0.33 0.74 0.25
Table A.8: Overall BE Dataset Results
Appendix B
Tracking System with Improved
Motion Segmentation Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD6 0.85 0.71 0.83 1.00 0.99 1.00 0.54 0.41 0.96 0.96 0.54RD7 0.79 0.55 0.79 1.00 0.98 1.00 0.45 0.32 0.80 0.89 0.37Average 0.82 0.63 0.81 1.00 0.98 1.00 0.50 0.36 0.88 0.92 0.46
Table B.1: RD Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingRD6 0.74 0.94 0.58RD7 0.60 0.93 0.47Average 0.67 0.94 0.53
Table B.2: Overall RD Dataset Results
322
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BC16 0.83 0.44 0.75 1.00 0.97 0.99 0.53 0.29 0.78 0.76 0.40BC17 0.77 0.38 0.71 1.00 0.98 0.99 0.46 0.30 0.71 0.89 0.32Average 0.80 0.41 0.73 1.00 0.97 0.99 0.49 0.30 0.74 0.83 0.36
Table B.3: BC Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingBC16 0.52 0.91 0.51BC17 0.46 0.90 0.47Average 0.49 0.91 0.49
Table B.4: Overall BC Dataset Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP11-C4 0.87 0.69 0.80 1.00 1.00 1.00 0.47 0.26 1.00 1.00 0.17AP11-C7 0.90 0.79 0.85 1.00 1.00 1.00 0.75 0.53 0.88 1.00 0.60AP12-C4 0.66 0.45 0.64 1.00 1.00 1.00 0.46 0.49 1.00 0.88 0.57AP12-C7 0.83 0.51 0.72 1.00 1.00 1.00 0.73 0.27 1.00 1.00 0.33Average 0.82 0.61 0.75 1.00 1.00 1.00 0.60 0.39 0.97 0.97 0.42
Table B.5: AP Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingAP11-C4 0.73 0.94 0.47AP11-C7 0.81 0.95 0.72AP12-C4 0.50 0.89 0.55AP12-C7 0.57 0.91 0.65Average 0.65 0.92 0.60
Table B.6: Overall AP Dataset Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BE19-C1 0.75 0.49 0.73 1.00 0.94 1.00 0.27 0.29 0.83 0.90 0.31BE19-C3 0.86 0.14 0.52 1.00 1.00 0.99 0.45 0.09 1.00 1.00 0.00BE20-C1 0.77 0.50 0.74 0.99 0.98 1.00 0.18 0.22 0.29 0.56 0.54BE20-C3 0.33 0.00 0.27 0.50 0.50 1.00 0.05 0.01 1.00 1.00 0.00Average 0.68 0.28 0.57 0.87 0.85 1.00 0.24 0.15 0.78 0.87 0.21
Table B.7: BE Dataset Results
323
Data Set Overall Detection Overall Localisation Overall TrackingBE19-C1 0.54 0.90 0.36BE19-C3 0.28 0.85 0.41BE20-C1 0.56 0.91 0.27BE20-C3 0.07 0.47 0.17Average 0.36 0.78 0.31
Table B.8: Overall BE Dataset Results
Appendix C
Tracking System with SCF
Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD6 0.86 0.73 0.84 1.00 0.98 1.00 0.54 0.40 0.92 0.95 0.48RD7 0.80 0.59 0.80 1.00 0.98 1.00 0.58 0.33 0.86 0.86 0.43Average 0.83 0.66 0.82 1.00 0.98 1.00 0.56 0.37 0.89 0.91 0.45
Table C.1: RD Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingRD6 0.76 0.95 0.57RD7 0.63 0.93 0.56Average 0.69 0.94 0.57
Table C.2: Overall RD Dataset Results
326
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BC16 0.83 0.46 0.75 0.99 0.97 0.99 0.49 0.30 0.71 0.78 0.45BC17 0.77 0.38 0.71 0.99 0.98 0.99 0.48 0.31 0.71 0.88 0.36Average 0.80 0.42 0.73 0.99 0.97 0.99 0.49 0.30 0.71 0.83 0.41
Table C.3: BC Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingBC16 0.53 0.91 0.50BC17 0.45 0.90 0.48Average 0.49 0.91 0.49
Table C.4: Overall BC Dataset Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP11-C4 0.88 0.75 0.83 1.00 1.00 1.00 0.51 0.36 0.88 1.00 0.51AP11-C7 0.90 0.79 0.86 1.00 1.00 1.00 0.64 0.54 0.88 1.00 0.73AP12-C4 0.68 0.50 0.64 1.00 1.00 1.00 0.48 0.46 0.87 1.00 0.63AP12-C7 0.83 0.53 0.73 1.00 1.00 1.00 0.66 0.31 1.00 1.00 0.38Average 0.82 0.64 0.76 1.00 1.00 1.00 0.57 0.42 0.90 1.00 0.56
Table C.5: AP Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingAP11-C4 0.78 0.95 0.55AP11-C7 0.81 0.96 0.68AP12-C4 0.54 0.89 0.56AP12-C7 0.59 0.92 0.62Average 0.68 0.93 0.60
Table C.6: Overall AP Dataset Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BE19-C1 0.79 0.42 0.74 1.00 0.94 1.00 0.42 0.21 0.58 0.83 0.33BE19-C3 0.84 0.13 0.53 1.00 1.00 0.99 0.49 0.07 1.00 1.00 0.00BE20-C1 0.76 0.48 0.73 0.99 0.98 1.00 0.09 0.23 0.30 0.63 0.38BE20-C3 0.33 0.01 0.27 0.89 1.00 1.00 0.11 0.01 1.00 1.00 0.00Average 0.68 0.26 0.57 0.97 0.98 1.00 0.28 0.13 0.72 0.86 0.18
Table C.7: BE Dataset Results
327
Data Set Overall Detection Overall Localisation Overall TrackingBE19-C1 0.50 0.90 0.42BE19-C3 0.27 0.85 0.43BE20-C1 0.54 0.91 0.21BE20-C3 0.07 0.74 0.21Average 0.34 0.85 0.32
Table C.8: Overall BE Dataset Results
Appendix D
Multi-Camera Tracking System
Results
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP11-C4 0.87 0.76 0.82 1.00 1.00 1.00 0.51 0.36 0.88 1.00 0.51AP11-C7 0.89 0.79 0.86 1.00 1.00 1.00 0.90 0.63 1.00 1.00 0.90AP12-C4 0.73 0.49 0.68 0.99 1.00 1.00 0.51 0.46 0.87 1.00 0.63AP12-C7 0.83 0.54 0.73 1.00 1.00 1.00 0.66 0.31 1.00 1.00 0.38Average 0.83 0.64 0.77 1.00 1.00 1.00 0.64 0.44 0.94 1.00 0.61
Table D.1: AP Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingAP11-C4 0.78 0.95 0.55AP11-C7 0.81 0.96 0.88AP12-C4 0.54 0.90 0.58AP12-C7 0.59 0.92 0.62Average 0.68 0.93 0.66
Table D.2: Overall AP Dataset Results
330
Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BE19-C1 0.71 0.42 0.71 1.00 0.93 1.00 0.29 0.17 0.71 0.92 0.33BE19-C3 0.85 0.07 0.50 1.00 1.00 0.99 0.38 0.03 1.00 1.00 0.00BE20-C1 0.69 0.40 0.66 0.97 0.98 1.00 0.27 0.16 0.19 0.65 0.31BE20-C3 0.31 0.01 0.27 0.55 0.75 1.04 0.00 0.01 0.75 1.00 0.00Average 0.64 0.22 0.54 0.88 0.91 1.01 0.24 0.09 0.66 0.89 0.16
Table D.3: BE Dataset Results
Data Set Overall Detection Overall Localisation Overall TrackingBE19-C1 0.47 0.89 0.35BE19-C3 0.23 0.85 0.36BE20-C1 0.46 0.88 0.28BE20-C3 0.07 0.56 0.13Average 0.31 0.79 0.28
Table D.4: Overall BE Dataset Results
Bibliography
[1] M. Abdelkader, R. Chellappa, Q. Zheng, and A. Chan, “Integrated motion
detection and tracking for visual surveillance,” in IEEE International Con-
ference on Computer Vision Systems (ICVS), p. 28, 2006.
[2] E. H. Adelson, “Layered representation for vision and video,” in IEEE Work-
shop on Representation of Visual Scenes, p. 3, 1995.
[3] E. L. Andrade, S. Blunsden, and R. B. Fisher, “Modelling crowd scenes for
event detection,” in International Conference on Pattern Recognition, vol. 1,
pp. 175 – 178, 2006.
[4] N. Atsushi, K. Hirokazu, H. Shinsaku, and I. Seiji, “Tracking multiple people
using distributed vision systems,” in Proceedings 2002 IEEE International
Conference on Robotics and Automation, vol. 3, (Washington, DC, USA),
pp. 2974–2981, IEEE, 2002.
[5] E. Auvinet, E. Grossmann, C. Rougier, M. Dahmane, and J. Meunier, “Left-
luggage detection using homographies and simple heuristics,” in IEEE In-
ternational Workshop on PETS, (New York), pp. 51–58, 2006.
[6] J. Bergen, P. Burt, R. Hingorani, and S. Peleg, “Computing two motions
from three frames,” in 3rd Int. Conf. on Computer Vision, pp. 27–32, 1990.
332 BIBLIOGRAPHY
[7] D. Beymer, “Person counting using stereo,” Workshop on Human Motion,
pp. 127 – 133, 2000.
[8] M. K. Bhuyan, B. C. Lovell, and A. Bigdeli, “Tracking with multiple cameras
for video surveillance,” in Digital Image Computing Techniques and Applica-
tions, 9th Biennial Conference of the Australian Pattern Recognition Society
on (B. C. Lovell, ed.), pp. 592–599, 2007.
[9] M. Black and P. Anandan, “A framework for the robust estimation of optical
flow,” in Fourth International Conference on Computer Vision, pp. 231 –
236, 1993.
[10] J. Black, T. Ellis, and P. Rosin, “Multi view image surveillance and track-
ing,” in Motion and Video Computing, 2002. Proceedings. Workshop on
(T. Ellis, ed.), pp. 169–174, 2002.
[11] R. S. Blum and Z. Liu, Multi-Sensor Image Fusion and Its Applications.
Boca Raton, FL: CRC Press, 2006.
[12] R. Bourezak and G. Bilodeau, “Object detection and tracking using iterative
division and correlograms,” in Computer and Robot Vision, 2006. The 3rd
Canadian Conference on, p. 38, 2006.
[13] R. Bowden and P. Kaewtrakulpong, “An improved adaptive background
mixture model for real-time tracking with shadow detection,” in AVBS01,
2001.
[14] G. R. Bradski, “Computer vision face tracking for use in a perceptual user
interface,” Intel Tech Journal, pp. 1–15, 1998.
[15] H. Breit and G. Rigoll, “A flexible multimodal object tracking system,” in
International Conference on Image Processing, vol. 3, pp. 133–136, 2003.
BIBLIOGRAPHY 333
[16] F. Bunyak, I. Ersoy, and S. Subramanya, “A multi-hypothesis approach for
salient object tracking in visual surveillance,” in Image Processing, 2005.
ICIP 2005. IEEE International Conference on, vol. 2, pp. II–446–9, 2005.
[17] P. Burt, J. Bergen, R. Hingorani, R. Kolczynski, W. Lee, A. Leung, J. Lubin,
and J. Shvaytser, “Object tracking with a moving camera; an application
of dynamic motion analysis,” in IEEE Workshop on Visual Motion, (Irvine,
CA), 1989.
[18] D. Butler, S. Sridharan, and V. M. Bove Jr, “Real-time adaptive background
segmentation,” in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), vol. 3, pp. 349–352, 2003.
[19] Q. Cai and J. Aggarwal, “Tracking human motion in structured environ-
ments using a distributed-camera system,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 21, no. 11, pp. 1241 – 1247, 1999.
[20] J. Canny, “A computational approach to edge detection,” IEEE Transac-
tions on Pattern Analysis and MachineI Intelligence, vol. 8, no. 6, pp. 679–
698, 1986.
[21] T.-H. Chang and S. Gong, “Tracking multiple people with a multi-camera
system,” in IEEE Workshop on Multi-Object Tracking, pp. 19 – 26, 2001.
[22] N. Checka, K. Wilson, M. Siracusa, and T. Darrell, “Multiple person and
speaker activity tracking with a particle filter,” in Acoustics, Speech, and Sig-
nal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Con-
ference on, vol. 5, pp. V–881–4 vol.5, 2004.
[23] H. Chen and T. Liu, “Trust-region methods for real-time tracking,” in In-
ternational Conference on Computer Vision, vol. 2, (Vancouver, Canada),
pp. 717–722, 2001.
334 BIBLIOGRAPHY
[24] Y.-S. Cheng, C.-M. Huang, and L.-C. Fu, “Multiple people visual tracking in
a multi-camera system for cluttered environments,” in 2006 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems, (Beijing, China),
pp. 675–680, 2006.
[25] S.-Y. Chien, W.-K. Chan, D.-C. Cherng, and J.-Y. Chang, “Human object
tracking algorithm with human color structure descriptor for video surveil-
lance systems,” in Multimedia and Expo, 2006 IEEE International Confer-
ence on, pp. 2097–2100, 2006.
[26] B. Coifman, D. Beymer, P. McLauchlan, and J. Malik, “A real-time com-
puter vision system for vehicle tracking and traffic surveillance,” Transporta-
tion Research: Part C, vol. 6, no. 4, pp. 271–288, 1998.
[27] R. Collins, O. Amidi, and T. Kanade, “An active camera system for acquiring
multi-view video,” in Proceedings of ICIP 2002 International Conference on
Image Processing, vol. 1, (Rochester, NY, USA), pp. 517–520, IEEE, 2002.
[28] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature
space analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence,
vol. 24, no. 5, pp. 603–619, 2002.
[29] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid ob-
jects using mean shift.,” in ofIEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition (CVPR2000),, vol. 2, pp. 2142–2149,
2000.
[30] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,”
IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, no. 5,
pp. 564–575, 2003.
[31] C. O. Conaire, E. Cooke, N. O’Connor, N. Murphy, and A. Smearson, “Back-
ground modelling in infrared and visible spectrum video for people tracking,”
BIBLIOGRAPHY 335
in IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 3, pp. 20–20, 2005.
[32] C. O. Conaire, N. E. O’Connor, E. Cooke, and A. F. Smeaton, “Multi-
spectral object segmentation and retrieval in surveillance video,” in IEEE
International Conference on Image Processing (ICIP), pp. 2381–2384, 2006.
[33] M. Couprie and G. Bertrand, “Topological grayscale watershed transforma-
tion,” in SPIE Vision Geometry V, vol. 3168, pp. 136–146, 1997.
[34] A. Criminisi, A. Cross, G. Blake, and V. Kolmogorov, “Bilayer segmentation
of live video,” in IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2006.
[35] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Statistical and
knowledge-based moving object detection in traffic scene,” in IEEE Inter-
national Conference on Intelligent Transportation Systems, pp. 27–32, 2000.
[36] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving ob-
jects, ghosts, and shadows in video streams,” IEEE Trans. on Pattern Anal-
ysis and Machine Intelligence, vol. 25, no. 10, pp. 1337–1342, 2003.
[37] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti, “Improving
shadow suppression in moving object detection with hsv color information,”
in Fourth International IEEE Conference on Intelligent Transportation Sys-
tems, (Oaklan, CA, USA), pp. 334–339, 2001.
[38] F. Cupillard, F. Bremond, and M. Thonnat, “Group behavior recognition
with multiple cameras,” in Applications of Computer Vision, 2002. (WACV
2002). Proceedings. Sixth IEEE Workshop on, pp. 177–183, 2002.
[39] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, “Integrated person
tracking using stereo, color, and pattern detection,” in Computer vision
336 BIBLIOGRAPHY
and pattern recognition, (Santa Barbara; CA), pp. 601–609, IEEE Computer
Society, 1998.
[40] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, “Integrated person
tracking using stereo, color, and pattern detection,” International Journal
of Computer Vision, vol. 37, no. 2, pp. 175–185, 2000.
[41] J. Davis and V. Sharma, “Robust background-subtraction for person detec-
tion in thermal imagery,” in Conference on Computer Vision and Pattern
Recognition Workshop, p. 128, 2004.
[42] J. Davis and V. Sharma, “Ieee otcbvs ws series bench fusion-based
background-subtraction using contour saliency,” in IEEE International
Workshop on Object Tracking and Classification Beyond the Visible Spec-
trum, 2005.
[43] S. Denman, V. Chandran, and S. Sridharan, “An adaptive optical flow tech-
nique for person tracking systems,” Elsivier Pattern Recognition Letters,
vol. 28, no. 10, pp. 1232–1239, 2007.
[44] A. Doucet, “On sequential simulation-based methods for bayesian filtering,”
technical report cued/f-infeng/tr 310, Department of Engineering, Cam-
bridge University, 1998.
[45] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at a dis-
tance,” in Computer Vision, 2003. Proceedings. Ninth IEEE International
Conference on, Vol., Iss., 13-16 Oct. 2003, pp. 726– 733 vol.2, 2003.
[46] T. Ellis, “Multi-camera video surveillance,” in Proceedings IEEE 36th
Annual 2002 International Carnahan Conference on Security Technology
(L. Sanson, ed.), (Atlantic City, NJ, USA), pp. 228–233, IEEE, 2002.
BIBLIOGRAPHY 337
[47] I. Everts, N. Sebe, and G. A. Jones, “Cooperative object tracking with
multiple ptz cameras,” in Image Analysis and Processing, 2007. ICIAP 2007.
14th International Conference on (N. Sebe, ed.), pp. 323–330, 2007.
[48] R. Fisher, J. Santos-Victor, and J. Crowley, “Caviar: Con-
text aware vision using image-based active recognition,
(http://homepages.inf.ed.ac.uk/rbf/caviar/),” Last Accessed 23 Feb
2008, 2002.
[49] F. Fleuret, J. Berclaz, R. Lengagne, and P. A. F. P. Fua, “Multicamera peo-
ple tracking with a probabilistic occupancy map,” Transactions on Pattern
Analysis and Machine Intelligence, vol. 30, no. 2, pp. 267–282, 2008.
[50] D. Focken and R. Stiefelhagen, “Towards vision-based 3-d people tracking
in a smart room,” in Proceedings Fourth IEEE International Conference on
Multimodal Interfaces, (Pittsburgh, PA, USA), pp. 400–405, IEEE Comput.
Soc, 2002.
[51] J. F. G. d. Freitas, M. Niranjan, A. H. Gee, and A. Doucet, “Sequential
monte carlo methods to train neural network models,” Neural Computation,
vol. 12, no. 4, pp. 955–993, 2000.
[52] L. M. Fuentes and S. Velastin, “People tracking in surveillance applications,”
2nd IEEE InternationalWorkshop on Performance Evaluation of Tracking
and Surveillance (PETS2001), 2001.
[53] L. M. Fuentes and S. A. Velastin, “Tracking people for automatic surveillance
applications,” in Pattern recognition and image analysis (F. J. P. O. Perales,
ed.), (Puerto de Andratx, Spain), pp. 238–245, Berlin; Springer;, 2003.
[54] K. Fukunaga, Introduction to Statistical Pattern Recognition. Boston: Aca-
demic Press, 1990.
338 BIBLIOGRAPHY
[55] G. S. K. Fung, N. H. C. Yung, G. K. H. Pang, and A. H. S. Lai, “Effective
moving cast shadow detection for monocular color image,” vol. -, no. -, pp. –
409, 2001.
[56] G. Gordon, T. Darrell, M. Harville, and J. Woodfill, “Background estimation
and removal based on range and color,” in IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition, pp. 459–464, 1999.
[57] H. Grabner and H. Bischof, “On-line boosting and vision,” in IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition,
vol. 1, pp. 260 – 267, 2006.
[58] R. Green and L. Guan, “Tracking human movement patterns using particle
filtering,” in International Conference on Multimedia and Expo (ICME),
vol. 3, pp. 117–120, 2003.
[59] D. Grest, J.-M. Frahm, and R. Koch, “A color similarity measure for ro-
bust shadow removal in real time,” in Vision, Modeling and Visualization
Conference, (Munich, Germany), pp. 253–260, 2003.
[60] J. M. Hammersley and K. W. Morton, “Poor man’s monte carlo,” Journal
of the Royal Statistical Society B, vol. 16, pp. 23–28, 1954.
[61] J. . Han and B. Bhanu, “Detecting moving humans using color and infrared
video,” in IEEE International Conference on Multisensor Fusion and Inte-
gration for Intelligent Systems, pp. 228 – 233, 2003.
[62] J. Han and B. Bhanu, “Fusion of color and infrared video for moving human
detection,” Pattern Recognition, vol. 40, no. 6, pp. 1771–1784, 2007.
[63] I. Haritaoglu, D. Harwood, and L. Davis, “W4: Who? when? where? what?
a real time system for detecting and tracking people,” in Third IEEE Inter-
national Conference on Automatic Face and Gesture Recognition, pp. 222 –
227, 1998.
BIBLIOGRAPHY 339
[64] I. Haritaoglu, D. Harwood, and L. Davis, “W4s: A real time system for
detecting and tracking people in 2 1/2 d,” in Europaen Conference Computer
Vision, pp. 962–968, 1998.
[65] I. Haritaoglu, D. Harwood, and L. S. Davis, “Hydra: multiple people detec-
tion and tracking using silhouettes,” in International Conference on Image
Analysis and Processing, pp. 280 – 285, 1999.
[66] I. Haritaoglu, D. Harwood, and L. Davis, “An appearance-based body model
for multiple people tracking,” in 15th International Conference on Pattern
Recognition, vol. 4, (Barcelona, Spain), pp. 184–187, 2000.
[67] M. Harville, “A framework for high-level feedback to adaptive, per-pixel,
mixture-of-gaussian background models,” in 7th European Conference on
Computer Vision, vol. 3, (Copenhagen, Denmark), pp. 37–49, 2002.
[68] M. Harville, G. G. Gordon, and J. Woodfill, “Foreground segmentation us-
ing adaptive mixture models in color and depth,” in IEEE Workshop on
Detection and Recognition of Events in Video, pp. 3–11, 2001.
[69] M. Harville and D. Li, “Fast, integrated person tracking and activity recog-
nition with plan-view templates from a single stereo camera,” in Computer
Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004
IEEE Computer Society Conference on, pp. II–398– II–405 Vol.2, 2004.
[70] E. Hayman and J. Eklundh, “Statistical background subtraction for a mobile
observer,” in International Conference on Computer Vision (ICCV), vol. 1,
pp. 67 – 74, 2003.
[71] T. Higuchi, “Monte carlo filter using the genetic algorithm operators,” Jour-
nal of Statistical Computation and Simulation, vol. 59, no. 1, pp. 1–23, 1997.
[72] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intel-
ligence, vol. 17, pp. 185–203, 1981.
340 BIBLIOGRAPHY
[73] T. Horprasert, D. Hanvood, and L. Davis, “A statistical approach for real-
time robust background subtraction and shadow detection,” in ICCV Frame-
rate Workshop, 1999.
[74] M. Hu, W. Hu, and T. Tan, “Tracking people through occlusions,” in Pat-
tern Recognition, 2004. ICPR 2004. Proceedings of the 17th International
Conference on, pp. 724– 727, 2004.
[75] J.-S. Hu, T.-M. Su, and S.-C. Jeng, “Robust background subtraction with
shadow and highlight removal for indoor surveillance,” in Intelligent Robots
and Systems, 2006 IEEE/RSJ International Conference on, pp. 4545–4550,
2006.
[76] G. Hua and Y. Wu, “Multi-scale visual tracking by sequential belief propa-
gation,” in IEEE Conference on Computer Vision and Pattern Recognition,
(Washington, DC), pp. 826–833, 2004.
[77] M. Isard and A. Blake, “Condensation - conditional density propagation for
visual tracking,” International Journal of Computer Vision, vol. 29, no. 1,
pp. 5–28, 1998.
[78] M. Isard and J. MacCormick, “Bramble: a bayesian multiple-blob tracker,”
in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE Interna-
tional Conference on, vol. 2, pp. 34–41 vol.2, 2001.
[79] J. Jacques, C. Jung, and S. Musse, “Background subtraction and shadow
detection in grayscale video sequences,” in Computer Graphics and Image
Processing, 2005. SIBGRAPI 2005. 18th Brazilian Symposium on, pp. 189–
196, 2005.
[80] O. Javed, S. Khan, Z. Rasheed, and M. Shah, “Camera handoff: tracking in
multiple uncalibrated stationary cameras,” in Workshop on Human Motion,
pp. 113 – 118, 2000.
BIBLIOGRAPHY 341
[81] O. Javed, Z. Rasheed, O. Alatas, and M. Shah, “Knightm: A real time
surveillance system for multiple overlapping and non-overlapping cameras,”
in International Conference on Multimedia Expo (ICME), pp. 649–652, 2003.
[82] O. Javed, Z. Rasheed, K. Shafique, and M. Shah, “Tracking across multiple
cameras with disjoint views,” in The Ninth IEEE Conference on Computer
Vision (ICCV), vol. 2, (Nice, France), pp. 952–957, 2003.
[83] O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robust
background subtraction using color and gradient information,” in in IEEE
Workshop on Motion and Video Computing, pp. 22–27, 2002.
[84] S. Joo and Q. Zheng., “A temporal variance-based moving target detector,”
in IEEE International Workshop on Performance Evaluation of Tracking
Systems (PETS), 2005.
[85] S. Julier and J. Uhlman, “A consistent, debiased method for converting be-
tween polar and cartesian coordinate systems,” in 11th International Sym-
posium on Aerospace/Defence Sensing, Simulation and Controls, vol. Multi
Sensor Fusion, Tracking and Resource Management II, (Orlando, Florida),
pp. 110–121, 1997.
[86] R. E. Kalman, “A new approach to linear filtering and prediction problems,”
Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series
D, pp. 35–45, 1960.
[87] J. Kang, I. Cohen, and G. Medioni, “Tracking people in crowded scenes
across multiple cameras,” in Asian Conference on Computer Vision
(ACCV), 2004.
[88] J. Kang, I. Cohen, and G. Medioni, “Persistent objects tracking across mul-
tiple non overlapping cameras,” in Motion and Video Computing, 2005.
342 BIBLIOGRAPHY
WACV/MOTIONS ’05 Volume 2. IEEE Workshop on (I. Cohen, ed.), vol. 2,
pp. 112–119, 2005.
[89] S. Kang, B.-W. Hwang, and S.-W. Lee, “Multiple people tracking based on
temporal color feature,” International Journal of Pattern Recognition and
Artificial Intelligence, vol. 17, no. 6, pp. 931–949, 2003.
[90] H. Kang, D. Kim, and S. Y. Bang, “Real-time multiple people tracking
using competitive condensation,” in Image Processing. 2002. Proceedings.
2002 International Conference on, vol. 3, pp. III–325–III–328 vol.3, 2002.
[91] J. Kato, T. Watanabe, S. Joga, Y. Liu, and H. Hase, “An hmm/mrf-based
stochastic framework for robust vehicle tracking,” IEEE Transactions on
Intelligent Transportation Systems, vol. 5, no. 3, pp. 142 – 154, 2004.
[92] M. Kazuyuki, M. Xuchu, and H. Hideki, “Global color model based object
matching in the multi-camera environment,” in IEEE/RSJ International
Conference on Intelligent Robots and Systems, 2006, pp. 2644–2649, 2006.
[93] S. Khan, O. Javed, Z. Rasheed, and M. Shah, “Human tracking in multiple
cameras,” in Eighth IEEE International Conference on Computer Vision,
vol. 1, pp. 331 – 336, 2001.
[94] S. Khan and M. Shah, “Object based segmentation of video using color,
motion and spatial information,” in IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, vol. 2, pp. 746–751, 2001.
[95] S. Khan and M. Shah, “Consistent labeling of tracked objects in multiple
cameras with overlapping fields of view,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1355 – 1360, 2003.
[96] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Background
modeling and subtraction by codebook construction,” IEEE International
Conference on Image Processing (ICIP), 2004.
BIBLIOGRAPHY 343
[97] K. Kim, D. Harwood, and L. S. Davis, “Background updating for visual
surveillance,” in ISVC, pp. 337–346, 2005.
[98] G. Kitagawa, “Monte carlo filter and smoother for non-gaussian nonlin-
ear state space models,” Journal of Computational and Graphical Statistics,
vol. 5, no. 1, pp. 1–25, 1996.
[99] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking with oc-
clusion reasoning,” in ECCV, vol. 1, pp. 189–196, 1994.
[100] A. Kong, J. S. Liu, and W. H. Wong, “Sequential imputations and bayesian
missing data problems,” Journal of the American Statistical Association,
vol. 89, no. 425, pp. 278–288, 1994.
[101] N. Krahnstoever, P. Tu, T. Sebastian, A. Perera, and R. Collins, “Multi-
view detection and tracking of travelers and luggage in mass transit environ-
ments,” in IEEE International Workshop on PETS, (New York), pp. 67–74,
2006.
[102] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer,
“Multi-camera multi-person tracking for easyliving,” in Third IEEE Inter-
national Workshop on Visual Surveillance, pp. 3 – 10, 2000.
[103] B. Kwolek, “Person following and mobile camera localization using particle
filters,” in Robot Motion and Control, 2004. RoMoCo’04. Proceedings of the
Fourth International Workshop on, pp. 265–270, 2004.
[104] L. Latecki, R. Miezianko, and D. Pokrajac, “Tracking motion objects in
infrared videos,” in IEEE Conference on Advanced Video and Signal Based
Surveillance (AVSS), pp. 99–104, 2005.
[105] M. Latzel, E. Darcourt, and J. Tsotsos, “People tracking using robust mo-
tion detection and estimation,” in Computer and Robot Vision, 2005. Pro-
ceedings. The 2nd Canadian Conference on, pp. 270–275, 2005.
344 BIBLIOGRAPHY
[106] B. Lei and L.-Q. Xu, “From pixels to objects and trajectories: a generic
real-time outdoor video surveillance system,” in Imaging for Crime Detection
and Prevention, 2005. ICDP 2005. The IEE International Symposium on,
pp. 117–122, 2005.
[107] W. Leoputra, T. Tele, and L. Fee Lee, “Non-overlapping distributed track-
ing using particle filter,” in Pattern Recognition, 2006. ICPR 2006. 18th
International Conference on (T. Tele, ed.), vol. 3, pp. 181–185, 2006.
[108] L. Li, R. Luo, W. Huang, and H.-L. Eng, “Context-controlled adaptive
background subtraction,” in IEEE International Workshop on PETS, New
York, June 18, 2006, (New York), pp. 31–38, 2006.
[109] F. L. Lim, W. Leoputra, and T. Tan, “Non-overlapping distributed tracking
system utilizing particle filter,” Journal of VLSI Signal Processing, vol. 49,
pp. 343–362, 2007.
[110] J. S. Liu and R. Chen, “Sequential Monte Carlo methods for dynamic
systems,” Journal of the American Statistical Association, vol. 93, no. 443,
pp. 1032–1044, 1998.
[111] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
[112] G. Loy, L. Fletcher, N. Apostoloff, and A. Zelinsky, “An adaptive fusion ar-
chitecture for target tracking,” in Automatic Face and Gesture Recognition,
2002. Proceedings. Fifth IEEE International Conference on, pp. 248–253,
2002.
[113] W. Lu and Y.-P. Tan, “A color histogram based people tracking system,”
in 2001 IEEE International Symposium on Circuits and Systems, vol. 2,
pp. 137 – 140, 2001.
BIBLIOGRAPHY 345
[114] B. Lucas and T. Kanade, “An iterative image registration technique with
an application to stereo vision,” in 7th International Joint Conference on
Artificial Intelligence (IJCAI), pp. 674–679, 1981.
[115] M. Lucena, J. Fuertes, N. de la Blanca, and A. Garrido, “An optical flow
probabilistic observation model for tracking,” in Image Processing, 2003.
ICIP 2003. Proceedings. 2003 International Conference on, vol. 3, pp. III–
957–60 vol.2, 2003.
[116] M. Lucena, J. Fuertes, J. Gomez, N. de la Blanca, and A. Garrido, “Optical
flow-based probabilistic tracking,” in Signal Processing and Its Applications,
2003. Proceedings. Seventh International Symposium on, vol. 2, pp. 219–222
vol.2, 2003.
[117] M. Lucena, J. Fuertes, J. Gomez, N. de la Blanca, and A. Garrido, “Track-
ing from optical flow,” in Image and Signal Processing and Analysis, 2003.
ISPA 2003. Proceedings of the 3rd International Symposium on, Vol.2, Iss.,
18-20 Sept. 2003, pp. 651– 655 Vol.2, 2003.
[118] L. Marchesotti, S. Piva, and C. Regazzoni, “An agent-based approach for
tracking people in indoor complex environments,” in 12th International Con-
ference Image Analysis and Processing, pp. 99–102, 2003.
[119] N. Martel-Brisson and A. Zaccarin, “Moving cast shadow detection from a
gaussian mixture shadow model,” in Computer Vision and Pattern Recog-
nition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2,
pp. 643–648 vol. 2, 2005.
[120] S. Maskell and N. Gordon, “A tutorial on particle filters for on-line
nonlinear/non-gaussian bayesian tracking,” in Target Tracking: Algorithms
and Applications (Ref. No. 2001/174), IEE, vol. Workshop, pp. 2/1–2/15
vol.2, 2001.
346 BIBLIOGRAPHY
[121] A. Matsumura, Y. Iwai, and M. Yachida, “Tracking people by using color
information from omnidirectional images,” in 41st SICE Annual Conference,
vol. 3, pp. 1772 – 1777, 2002.
[122] R. v. d. Merwe, N. d. Freitas, A. Doucet, and E. Wan, “The unscented
particle filter,” Advances in Neural Information Processing Systems, vol. 13,
no. Nov, 2001.
[123] C. Micheloni, G. L. Foresti, and L. Snidaro, “A network of co-operative
cameras for visual surveillance,” Vision, Image and Signal Processing, IEE
Proceedings -, vol. 152, no. 2, pp. 205–212, 2005.
[124] A. Mittal and L. Davis, “Unified multi-camera detection and tracking using
region-matching,” in Multi-Object Tracking, 2001. Proceedings. 2001 IEEE
Workshop on, pp. 3–10, 2001.
[125] A. Mittal and L. Davis, “M/sub 2/tracker: a multi-view approach to seg-
menting and tracking people in a cluttered scene,” International Journal of
Computer Vision, vol. 51, no. 3, pp. 189–203, 2003.
[126] K. Morioka and H. Hashimoto, “Color appearance based object identifica-
tion in intelligent space,” pp. 505–510, 2004.
[127] S. Nadimi and B. Bhanu, “Moving shadow detection using a physics-based
approach,” in 16th International Conference on Pattern Recognition,, vol. 2,
pp. 701–704, 2002.
[128] S. Nadimi and B. Bhanu, “Physical models for moving shadow and ob-
ject detection in video,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, vol. 26, no. 8, pp. 1079–1087, 2004.
[129] K. P. Ng and S. Ranganath, “Tracking people,” in 16th International Con-
ference on Pattern Recognition, vol. 2, pp. 370 – 373, 2002.
BIBLIOGRAPHY 347
[130] A. T. Nghiem, F. Bremond, and M. T. V. Valentin, “Etiseo, performance
evaluation for video surveillance systems,” in IEEE Conference on Advanced
Video and Signal Based Surveillance (AVSS), (London, UK), pp. 476–481,
2007.
[131] N. Nguyen, S. Venkatesh, G. West, and H. Bui, “Hierarchical monitoring
of people’s behaviors in complex environments using multiple cameras,” in
16th International Conference on Pattern Recognition, vol. 1, pp. 13 – 16,
2002.
[132] “Ninth ieee international workshop on performance evaluation of tracking
and surveillance,” 2006.
[133] C. O’Conaire, N. E. O’Connor, E. Cooke, and A. F. Smeaton, “Comparison
of fusion methods for thermo-visual surveillance tracking,” in 9th Interna-
tional Conference on Information Fusion (ICIF), pp. 1–7, 2006.
[134] R. Okada, Y. Shirai, and J. Miura, “Tracking a person with 3-d motion by
integrating optical flow and depth,” in Automatic Face and Gesture Recogni-
tion, 2000. Proceedings. Fourth IEEE International Conference on, pp. 336–
341, 2000.
[135] K. Okuma, A. Taleghani, N. d. Freitas, J. Little, and D. Lowe, “A boosted
particle filter: Multitarget detection and tracking,” in 8th European Confer-
ence on Computer Vision (ECCV), vol. 1, (Prague, Czech Republic), pp. 28–
39, 2004.
[136] N. Oshima, T. Saitoh, and R. Konishi, “Real time mean shift tracking
using optical flow distribution,” in SICE-ICASE, 2006. International Joint
Conference, pp. 4316–4320, 2006.
[137] N. Owens, C. Harris, and C. Stennett, “Hawk-eye tennis system,” in Vi-
348 BIBLIOGRAPHY
sual Information Engineering, 2003. VIE 2003. International Conference on,
pp. 182–185, 2003.
[138] K. Patwardhan, G. Sapiro, and V. Morellas, “Robust foreground detection
in video using pixel layers,” IEEE Trans. on Pattern Analysis and Machine
Intelligence, vol. 30, no. 4, pp. 746–751, 2008.
[139] M. K. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle
filters,” Journal of the American Statistical Association, vol. 94, no. 446,
pp. 590–599, 1999.
[140] S. Piva, A. Calbi, D. Angiati, and C. S. Regazzoni, “A multi-feature ob-
ject association framework for overlapped field of view multi-camera video
surveillance systems,” pp. 505–510, 2005.
[141] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic
tracking,” in 7th European Conference on Computer Vision, pp. 661 – 675,
2002.
[142] R. Rad and M. Jamzad, “Real-time classification and tracking of multiple
vehicles in highways,” Pattern Recognition Letters, vol. 26, pp. 1597–1607,
2005.
[143] D. Ramanan and D. Forsyth, “Finding and tracking people from the bottom
up,” in IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 2, pp. 467 – 474, 2003.
[144] A. Rao, R. Srihari, and Z. Zhang, “Geometric histogram: A distribution
of geometric configurations of color subsets,” in SPIE: Internet Imaging,
vol. 3964, pp. 91–101, 2000.
[145] G. Rigoll, S. Eickeler, and S. Muller, “Person tracking in real-world sce-
narios using statistical methods,” in Automatic face and gesture recognition,
(Grenoble, France), pp. 342–347, IEEE; 2000, 2000.
BIBLIOGRAPHY 349
[146] M. N. Rosenbluth and A. W. Rosenbluth, “Monte carlo calculation of the
average extension of molecular chains,” Journal of Chemical Physics, vol. 23,
pp. 356–359, 1955.
[147] C. Rother, V. Kolmogorov, and A. Blake, “”grabcut” - interactive fore-
ground extraction using iterated graph cuts,” in ACM SIGGRAPH, pp. 309
– 314, 2004.
[148] H. Ryu and M. Huber, “A particle filter approach for multi-target tracking,”
in IEEE/RSJ International Conference on Intelligent Robots and Systems,
(San Diego, CA, USA), pp. 2753–2760, 2007.
[149] D. Schulz, W. Burgard, D. Fox, and A. Cremers, “Tracking multiple moving
targets with a mobile robot using particle filters and statistical data associa-
tion,” in IEEE International Conference on Robotics and Automation, vol. 2,
pp. 1665–1670, 2001.
[150] F. Seitner and B. C. Lovell, “Pedestrian tracking based on colour and spatial
information,” in Proceedings of the Digital Imaging Computer: Techniques
and Applications (DICTA 2005), (Cairns, Australia), pp. 36 – 43, 2005.
[151] P. Shastry and K. Ramakrishnan, “Fast technique for moving shadow detec-
tion in image sequences,” in Signal Processing and Communications, 2004.
SPCOM ’04. 2004 International Conference on, pp. 359–362, 2004.
[152] N. Siebel and S. Maybank, “Fusion of multiple tracking algorithms for ro-
bust people tracking,” in Computer Vision - ECCV 2002. 7th European
Conference on Computer Vision. Proceedings, Part IV (A. Heyden, G. Sparr,
M. Nielsen, and P. Johansen, eds.), (Copenhagen, Denmark), pp. 373–387,
Springer-Verlag, 2002.
[153] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard, “Tracking loose-limbed
350 BIBLIOGRAPHY
people,” in Proceedings of the 2004 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, vol. 1, pp. 421–428, 2004.
[154] Silogic and Inria, “Etiseo metrics definition (http://www-
sop.inria.fr/orion/etiseo/download.htm),” tech. rep., 6th January 2006.
[155] C. Stauffer, “Estimating tracking sources and sinks,” in Event Mining
Workshop, (Madison, WI), 2003.
[156] C. Stauffer, “Learning to track objects through unobserved regions,” in Mo-
tion and Video Computing, 2005. WACV/MOTIONS ’05 Volume 2. IEEE
Workshop on, vol. 2, pp. 96–102, 2005.
[157] C. Stauffer and W. Grimson, “Adaptive background mixture models for
real-time tracking,” in IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, vol. 2, p. 252, 1999.
[158] F. Tang and H. Tao, “Object tracking with dynamic feature graph,” in Vi-
sual Surveillance and Performance Evaluation of Tracking and Surveillance,
2005. 2nd Joint IEEE International Workshop on, pp. 25–32, 2005.
[159] H. Tao, H. S. Sawhney, and R. Kumar, “Object tracking with bayesian esti-
mation of dynamic layer representations,” IEEE Trans. on Pattern Analysis
and Machine Intelligence, vol. 24, no. 1, pp. 75–89, 2002.
[160] T. Thongkamwitoon, S. Aramvith, and T. Chalidabhongse, “An adaptive
real-time background subtraction and moving shadows detection,” in IEEE
International Conference on Multimedia and Expo (ICME), vol. 2, pp. 1459–
1462, 2004.
[161] R. Y. Tsai, “An efficient and accurate camera calibration technique for
3d machine vision,” in IEEE Conference on Computer Vision and Pattern
Recognition, (Miami Beach, FL), pp. 364–374, 1986.
BIBLIOGRAPHY 351
[162] H. Tsutsui, J. Miura, and Y. Shirai, “Optical flow-based person tracking by
multiple cameras,” in International Conference on Multisensor Fusion and
Integration for Intelligent Systems, pp. 91 – 96, 2001.
[163] J. Vermaak, A. Doucet, and P. Perez, “Maintaining multi-modality through
mixture tracking,” in Ninth IEEE International Conference on Computer
Vision (ICCV’03), vol. 2, (Nice, France), pp. 1110–1116, 2003.
[164] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficient al-
gorithm based on immersion simulations,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 13, no. 6, pp. 583–598, 1991.
[165] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of
simple features,” in CVPR, 2001.
[166] “Viper-gt, the ground truth authoring tool,
http://vipertoolkit.sourceforge.net/docs/gt/.”
[167] L. Wang, W. Hu, and T. Tan, “Face tracking using motion-guided dynamic
template matching,” ACCV’2002, 2002.
[168] M.-L. Wang, C.-C. Huang, and H.-Y. Lin, “An intelligent surveillance sys-
tem based on an omnidirectional vision sensor,” in Cybernetics and Intelli-
gent Systems, 2006 IEEE Conference on, pp. 1–6, 2006.
[169] H. Wang and D. Suter, “A re-evaluation of mixture-of-gaussian background
modeling,” in IEEE International Conference on Acoustics, Speech, and Sig-
nal Processing, (Philadelphia, PA, USA), pp. 1017–1020, 2005.
[170] D. Wei and J. Piater, “Data fusion by belief propagation for multi-camera
tracking,” pp. 1–8, 2006.
[171] Q. Wei, D. Schonfeld, and M. Mohamed, “Decentralized multiple camera
352 BIBLIOGRAPHY
multiple object tracking,” in Multimedia and Expo, 2006 IEEE International
Conference on (D. Schonfeld, ed.), pp. 245–248, 2006.
[172] G. Welch and G. Bishop, “An introduction to the kalman filter,” technical
report tr95-041, University of North Carolina at Chapel Hill, 1995.
[173] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: real-time
tracking of the human body,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997.
[174] D. Xu, J. Liu, Z. Liu, and X. Tang, “Indoor shadow detection for video
segmentation,” in Multimedia and Expo, 2004. ICME ’04. 2004 IEEE Inter-
national Conference on, vol. 1, pp. 41–44 Vol.1, 2004.
[175] T. Yamane, Y. Shirai, and J. Miura, “Person tracking by integrating optical
flow and uniform brightness regions,” in Robotics and Automation, 1998.
Proceedings. 1998 IEEE International Conference on, vol. 4, pp. 3267–3272
vol.4, 1998.
[176] T. Yang, F. Chen, D. Kimber, and J. A. V. J. Vaughan, “Robust people
detection and tracking in a multi-camera indoor visual surveillance system,”
in Multimedia and Expo, 2007 IEEE International Conference on (F. Chen,
ed.), pp. 675–678, 2007.
[177] C. Yang, R. Duraiswami, and L. Davis, “Fast multiple object tracking via
a hierarchical particle filter,” in Tenth IEEE International Conference on
Computer Vision, pp. 212 – 219, 2005.
[178] M.-T. Yang, Y.-C. Shih, and S.-C. Wang, “People tracking by integrating
multiple features,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of
the 17th International Conference on, pp. 929– 932, 2004.
[179] M. Yokoyama and T. Poggio, “A contour-based moving object detection
and tracking,” in Visual Surveillance and Performance Evaluation of Track-
BIBLIOGRAPHY 353
ing and Surveillance, 2005. 2nd Joint IEEE International Workshop on,
pp. 271–276, 2005.
[180] Q. Zang and R. Klette, “Robust background subtraction and maintenance,”
in Proceedings of the 17th International Conference on Pattern Recognition
(ICPR), vol. 2, pp. 90–93, 2004.
[181] Z. Zeng and S. Ma, “Head tracking by active particle filtering,” in Auto-
matic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE Inter-
national Conference on, pp. 82–87, 2002.
[182] H.-J. Zhang, “New schemes for video content representation and their appli-
cations (tutorial),” in IEEE International Conference on Image Processing,
(Singapore), 2004.
[183] W. Zhang, X. Z. Fang, and X. Yang, “Moving cast shadows detection based
on ratio edge,” in Pattern Recognition, 2006. ICPR 2006. 18th International
Conference on, vol. 4, pp. 73–76, 2006.
[184] T. Zhao and R. Nevatia, “Tracking multiple humans in complex situations,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26,
no. 9, pp. 1208–1221, 2004.
[185] Q. Zhao and H. Tao, “Object tracking using color correlogram,” in Vi-
sual Surveillance and Performance Evaluation of Tracking and Surveillance,
2005. 2nd Joint IEEE International Workshop on, pp. 263–270, 2005.
[186] L. ZhiHua and K. Komiya, “Region-wide automatic visual search and
pursuit surveillance system of vehicles and people using networked intelli-
gent cameras,” in 6th International Conference on Signal Processing, vol. 2,
pp. 945 – 948, 2002.
354 BIBLIOGRAPHY
[187] Y. Zhou and H. Tao, “A background layer model for object tracking through
occlusion,” in Ninth IEEE International Conference on Computer Vision,
vol. 2, pp. 1079–1085, 2003.