improved detection and tracking of objects in surveillance ... · object tracking, motion...

Improved Detection andTracking of Objects in

Surveillance Video

by

Simon Paul Denman, BEng (Hons, 1st Class),

BIT (Dist)

PhD Thesis

Submitted in Fulfilment

of the Requirements

for the Degree of

Doctor of Philosophy

at the

Queensland University of Technology

Image and Video Research Laboratory

Faculty of Built Environment and Engineering

May 2009

Keywords

Object Tracking, Motion Detection, Optical Flow, Condensation Filter, Particle

Filter, Thermal Imagery, Multi-Sensor Fusion, Multi-Spectral Tracking.

Abstract

Surveillance networks are typically monitored by a few people, viewing several

monitors displaying the camera feeds. It is then very difficult for a human op-

erator to effectively detect events as they happen. Recently, computer vision

research has begun to address ways to automatically process some of this data,

to assist human operators. Object tracking, event recognition, crowd analysis and

human identification at a distance are being pursued as a means to aid human

operators and improve the security of areas such as transport hubs.

The task of object tracking is key to the effective use of more advanced technolo-

gies. To recognize an event people and objects must be tracked. Tracking also

enhances the performance of tasks such as crowd analysis or human identification.

Before an object can be tracked, it must be detected. Motion segmentation tech-

niques, widely employed in tracking systems, produce a binary image in which

objects can be located. However, these techniques are prone to errors caused by

shadows and lighting changes. Detection routines often fail, either due to erro-

neous motion caused by noise and lighting effects, or due to the detection routines

being unable to split occluded regions into their component objects. Particle fil-

ters can be used as a self contained tracking system, and make it unnecessary

for the task of detection to be carried out separately except for an initial (of-

ten manual) detection to initialise the filter. Particle filters use one or more

extracted features to evaluate the likelihood of an object existing at a given point

ii

each frame. Such systems however do not easily allow for multiple objects to be

tracked robustly, and do not explicitly maintain the identity of tracked objects.

This dissertation investigates improvements to the performance of object tracking

algorithms through improved motion segmentation and the use of a particle filter.

A novel hybrid motion segmentation / optical flow algorithm, capable of simulta-

neously extracting multiple layers of foreground and optical flow in surveillance

video frames is proposed. The algorithm is shown to perform well in the presence

of adverse lighting conditions, and the optical flow is capable of extracting a mov-

ing object. The proposed algorithm is integrated within a tracking system and

evaluated using the ETISEO (Evaluation du Traitement et de lInterpretation de

Sequences vidEO - Evaluation for video understanding) database, and signifi-

cant improvement in detection and tracking performance is demonstrated when

compared to a baseline system. A Scalable Condensation Filter (SCF), a particle

filter designed to work within an existing tracking system, is also developed. The

creation and deletion of modes and maintenance of identity is handled by the

underlying tracking system; and the tracking system is able to benefit from the

improved performance in uncertain conditions arising from occlusion and noise

provided by a particle filter. The system is evaluated using the ETISEO database.

The dissertation then investigates fusion schemes for multi-spectral tracking sys-

tems. Four fusion schemes for combining a thermal and visual colour modality are

evaluated using the OTCBVS (Object Tracking and Classification in and Beyond

the Visible Spectrum) database. It is shown that a middle fusion scheme yields

the best results and demonstrates a significant improvement in performance when

compared to a system using either mode individually.

Findings from the thesis contribute to improve the performance of semi-

automated video processing and therefore improve security in areas under surveil-

lance.

Contents

Abstract i

List of Tables xi

List of Figures xvii

Notation xxvii

Acronyms & Abbreviations xli

List of Publications xliii

Certification of Thesis xlvii

Acknowledgments xlix

Chapter 1 Introduction 1

iv CONTENTS

1.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Improvements to Motion Segmentation . . . . . . . . . . . 4

1.2.2 Improvements to Particle Filters . . . . . . . . . . . . . . . 5

1.2.3 Improvements to Multi-Modal Fusion in Tracking Systems 6

1.3 Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Original Contributions and Publications . . . . . . . . . . . . . . 7

1.5 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Chapter 2 Literature Review 13

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Foreground Segmentation and Motion Detection . . . . . . . . . . 17

2.2.1 Background Subtraction and Background modelling . . . . 19

2.2.2 Optical Flow Approaches . . . . . . . . . . . . . . . . . . . 28

2.2.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.4 Auxiliary Processes . . . . . . . . . . . . . . . . . . . . . . 32

2.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3 Detecting and Tracking Objects . . . . . . . . . . . . . . . . . . . 37

CONTENTS v

2.3.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.2 Matching and Tracking Objects . . . . . . . . . . . . . . . 42

2.3.3 Handling Occlusions . . . . . . . . . . . . . . . . . . . . . 49

2.3.4 Alternative Approaches to Tracking . . . . . . . . . . . . . 53

2.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.4 Prediction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.4.1 Motion Models . . . . . . . . . . . . . . . . . . . . . . . . 57

2.4.2 Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . 58

2.4.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . 59

2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.5 Multi Camera Tracking Systems . . . . . . . . . . . . . . . . . . . 71

2.5.1 System Designs . . . . . . . . . . . . . . . . . . . . . . . . 73

2.5.2 Track Handover and Occlusion Handling . . . . . . . . . . 75

2.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Chapter 3 Tracking System Framework 91

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vi CONTENTS

3.2.1 Tracking Algorithm Overview . . . . . . . . . . . . . . . . 94

3.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.3.1 Person Detection . . . . . . . . . . . . . . . . . . . . . . . 99

3.3.2 Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . 104

3.3.3 Blob Detection . . . . . . . . . . . . . . . . . . . . . . . . 107

3.4 Baseline Tracking System . . . . . . . . . . . . . . . . . . . . . . 108

3.5 Evaluation Process and Benchmarks . . . . . . . . . . . . . . . . 111

3.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 112

3.5.2 Evaluation Data and Configuration . . . . . . . . . . . . . 118

3.5.3 Tracking Output Description . . . . . . . . . . . . . . . . . 127

3.5.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Chapter 4 Motion Detection 143

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

4.2 Multi-Modal Background modelling . . . . . . . . . . . . . . . . . 145

4.3 Core Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.3.1 Variable Threshold . . . . . . . . . . . . . . . . . . . . . . 149

4.3.2 Lighting Compensation . . . . . . . . . . . . . . . . . . . . 153

CONTENTS vii

4.3.3 Shadow Detection . . . . . . . . . . . . . . . . . . . . . . . 158

4.4 Computing Optical Flow Simultaneously . . . . . . . . . . . . . . 161

4.4.1 Detecting Overlapping Objects . . . . . . . . . . . . . . . 168

4.5 Detecting Stopped Motion . . . . . . . . . . . . . . . . . . . . . . 170

4.5.1 Feedback From External Source . . . . . . . . . . . . . . . 172

4.6 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . 173

4.6.1 Synthetic Data Tests . . . . . . . . . . . . . . . . . . . . . 174

4.6.2 Real World Data Tests . . . . . . . . . . . . . . . . . . . . 185

4.6.3 Optical Flow and Overlap Detection Evaluation . . . . . . 189

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

Chapter 5 Object Detection 201

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

5.2 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

5.3 Using Static Foreground . . . . . . . . . . . . . . . . . . . . . . . 203

5.4 Detecting Overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . 207

5.5 Integration into the Tracking System . . . . . . . . . . . . . . . . 211

5.6 Evaluation and Testing . . . . . . . . . . . . . . . . . . . . . . . . 214

viii CONTENTS

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Chapter 6 The Scalable Condensation Filter 227

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.2 Scalable Condensation Filter . . . . . . . . . . . . . . . . . . . . . 229

6.2.1 Dynamic Sizing . . . . . . . . . . . . . . . . . . . . . . . . 230

6.2.2 Dynamic Feature Selection and Occlusion Handling . . . . 234

6.2.3 Adding Tracks and Incorporating Detection Results . . . . 239

6.3 Tracking Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

6.3.1 Handling Static Foreground . . . . . . . . . . . . . . . . . 245

6.4 Incorporation into Tracking System . . . . . . . . . . . . . . . . . 248

6.4.1 Matching Candidate Objects to Tracked Objects . . . . . . 250

6.4.2 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . 253

6.5 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . 255

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

Chapter 7 Advanced Object Tracking and Applications 271

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

7.2 Multi-Camera Tracking . . . . . . . . . . . . . . . . . . . . . . . . 272

CONTENTS ix

7.2.1 System Description . . . . . . . . . . . . . . . . . . . . . . 272

7.2.2 Track Handover and Matching Objects in Different Views . 273

7.2.3 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

7.2.4 Evaluation using ETISEO database . . . . . . . . . . . . . 276

7.3 Multi-Spectral Tracking . . . . . . . . . . . . . . . . . . . . . . . 285

7.3.1 Evaluation of Fusion Points . . . . . . . . . . . . . . . . . 285

7.3.2 Evaluation of Fusion Techniques using OTCVBS Database 291

7.3.3 Proposed Fusion System . . . . . . . . . . . . . . . . . . . 300

7.3.4 Evaluation of System using OTCVBS Database . . . . . . 305

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

Chapter 8 Conclusions and Future Work 309

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

8.2 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . 310

8.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

Appendix A Baseline Tracking System Results 317

Appendix B Tracking System with Improved Motion Segmentation

Results 320

x CONTENTS

Appendix C Tracking System with SCF Results 323

Appendix D Multi-Camera Tracking System Results 326

Bibliography 329

List of Tables

3.1 Transition Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.2 Evaluation Metric Standard Definitions . . . . . . . . . . . . . . . 114

3.3 System Parameters - Fixed parameters for all data sets. . . . . . . 122

3.4 System Parameters - Parameters specific to the RD data sets. . . 124

3.5 System Parameters - Parameters specific to the BC data sets. . . 125

3.6 System Parameters - Parameters specific to the AP data sets. . . 126

3.7 System Parameters - Parameters specific to the BE data sets. . . 127

3.8 Baseline Tracking System Results . . . . . . . . . . . . . . . . . . 129

3.9 Overall Baseline Tracking System Results . . . . . . . . . . . . . . 129

3.10 Baseline Tracking System Throughput . . . . . . . . . . . . . . . 142

4.1 Synthetic Motion Detection Performance for AESOS Set 1 . . . . 176


xii LIST OF TABLES



4.5 Synthetic Lighting Normalisation Performance . . . . . . . . . . . 183

4.6 Motion Detection Results for Real World Sequence . . . . . . . . 188

5.1 System Parameters - Additional parameters for system configuration.214

5.2 System Parameters - Additional parameters for system configura-

tion specific to each dataset group. . . . . . . . . . . . . . . . . . 215

5.3 Overall Tracking System Performance using a Global Variable

Threshold for Motion Detection (see Section 3.5.1 for an expla-

nation of metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

5.4 Overall Tracking System Performance using Individual Variable

Thresholds for Motion Detection (see Section 3.5.1 for an explana-

tion of metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

5.5 Performance of BE Cameras with Using Different Motion Thresh-

old Approaches (DOv, LOv and TOv are overall detection, localisa-

tion and tracking metric results respectively, see Section 3.5.1 for

an explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . 216

5.6 Tracking System with Improved Motion Detection Results (see

Section 3.5.1 for an explanation of metrics) . . . . . . . . . . . . . 218

5.7 Tracking System with Improved Motion Detection Overall Results

(see Section 3.5.1 for an explanation of metrics) . . . . . . . . . . 218

LIST OF TABLES xiii

5.8 Proposed Tracking System Throughput . . . . . . . . . . . . . . . 224

6.1 System Parameters - Additional parameters for system configuration.256

6.2 System Parameters - Additional parameters for system configura-

tion specific to each dataset group. . . . . . . . . . . . . . . . . . 256

6.3 Tracking System with SCF Results (see Section 3.5.1 for an expla-


6.4 Tracking System with SCF Overall Results (see Section 3.5.1 for


6.5 Improvements using SCF (see Section 3.5.1 for an explanation of

metrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

6.6 Overall Improvements using SCF (see Section 3.5.1 for an expla-


6.7 Proposed Tracking System Throughput . . . . . . . . . . . . . . . 267

7.1 Multi-Camera Tracking System Results (see Section 3.5.1 for an

explanation of metrics) . . . . . . . . . . . . . . . . . . . . . . . . 277

7.2 Multi-Camera Tracking System Overall Results (see Section 3.5.1

for an explanation of metrics) . . . . . . . . . . . . . . . . . . . . 278

7.3 Improvements using a Multi-Camera System (see Section 3.5.1 for


xiv LIST OF TABLES

7.4 Overall Improvements using a Multi-Camera System (see Section

3.5.1 for an explanation of metrics) . . . . . . . . . . . . . . . . . 279

7.5 Fusion Algorithm Evaluation - Set 1 Results (see Section 3.5.1 for


7.6 Fusion Algorithm Evaluation - Overall Set 1 Results (see Section


7.7 Fusion Algorithm Evaluation - Set 2 Results (see Section 3.5.1 for


7.8 Fusion Algorithm Evaluation - Overall Set 2 Results (see Section


7.9 Proposed Fusion Algorithm - Set 1 Results (see Section 3.5.1 for


7.10 Proposed Fusion Algorithm - Overall Set 1 Results (see Section


7.11 Proposed Fusion Algorithm - Set 2 Results (see Section 3.5.1 for


7.12 Proposed Fusion Algorithm - Overall Set 2 Results (see Section


A.1 RD Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 317

A.2 Overall RD Dataset Results . . . . . . . . . . . . . . . . . . . . . 317

A.3 BC Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 318

LIST OF TABLES xv

A.4 Overall BC Dataset Results . . . . . . . . . . . . . . . . . . . . . 318

A.5 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 318

A.6 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 318

A.7 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 318

A.8 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 319

B.1 RD Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 320

B.2 Overall RD Dataset Results . . . . . . . . . . . . . . . . . . . . . 320

B.3 BC Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 321

B.4 Overall BC Dataset Results . . . . . . . . . . . . . . . . . . . . . 321

B.5 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 321

B.6 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 321

B.7 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 321

B.8 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 322

C.1 RD Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 323

C.2 Overall RD Dataset Results . . . . . . . . . . . . . . . . . . . . . 323

C.3 BC Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 324

C.4 Overall BC Dataset Results . . . . . . . . . . . . . . . . . . . . . 324

xvi LIST OF TABLES

C.5 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 324

C.6 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 324

C.7 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 324

C.8 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 325

D.1 AP Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 326

D.2 Overall AP Dataset Results . . . . . . . . . . . . . . . . . . . . . 326

D.3 BE Dataset Results . . . . . . . . . . . . . . . . . . . . . . . . . . 327

D.4 Overall BE Dataset Results . . . . . . . . . . . . . . . . . . . . . 327

List of Figures

1.1 Example of a Surveillance Scene . . . . . . . . . . . . . . . . . . . 2

1.2 Example of Changing Scene Conditions . . . . . . . . . . . . . . . 3

2.1 A Basic Top-Down System . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Foreground Mask for a Scene (Hand Segmented). Areas of motion

are represented as white in the foreground mask. . . . . . . . . . . 17

2.3 Temporarily Stopped Objects . . . . . . . . . . . . . . . . . . . . 22

2.4 The Particle Filter Process (Merwe et al. [122]) . . . . . . . . . . 62

2.5 Sequential Importance Re-sampling. . . . . . . . . . . . . . . . . . 63

2.6 The Condensation Process (Isard and Blake [77]) . . . . . . . . . 64

2.7 Incorporating Adaboost Detections into the BPF Distribution . . 67

2.8 Multi-camera system architecture 1. . . . . . . . . . . . . . . . . . 71

2.9 Multi-camera system architecture 2. . . . . . . . . . . . . . . . . . 72

xviii LIST OF FIGURES

2.10 Surveillance network containing disjoint cameras. . . . . . . . . . 86

3.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.2 Tracking Algorithm Flowchart . . . . . . . . . . . . . . . . . . . . 95

3.3 State Diagram for a Tracked Object . . . . . . . . . . . . . . . . . 96

3.4 Head Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.5 Height Map Generation . . . . . . . . . . . . . . . . . . . . . . . . 103

3.6 Ellipse Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.7 Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.8 Detecting Overlapping Vehicles . . . . . . . . . . . . . . . . . . . 107

3.9 Ground truth and result overlaps . . . . . . . . . . . . . . . . . . 115

3.10 Examples of Evaluation Data . . . . . . . . . . . . . . . . . . . . 120

3.11 Zones for RD Datasets . . . . . . . . . . . . . . . . . . . . . . . . 123

3.12 Mask Image for RD Datasets . . . . . . . . . . . . . . . . . . . . 123

3.13 Mask Image for BC Datasets . . . . . . . . . . . . . . . . . . . . . 125

3.14 Zones for BE Datasets . . . . . . . . . . . . . . . . . . . . . . . . 126

3.15 Mask Images for BE Datasets . . . . . . . . . . . . . . . . . . . . 127

3.16 Example output from the tracking system . . . . . . . . . . . . . 128

LIST OF FIGURES xix

3.17 Example output from RD7 - Loss of track due to the target object

(car in blue rectangle on the far side of the road) being stationary

for several hundred frames . . . . . . . . . . . . . . . . . . . . . . 130

3.18 Example output from BC16 - Detection and localisation errors

caused by shadow/reflection of the moving object . . . . . . . . . 131

3.19 Example output from BC16 - Total occlusion, resulting in loss of

track identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3.20 Example output from BE19-C1 - Impact of poor motion detection

performance on tracking, the person leaving the building is tracked

poorly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

3.21 Example output from BE19-C3 - Impact of poor motion detection

performance on tracking, the two people are localised poorly . . . 134

3.22 Sensitivity of Baseline System to Variations in τfit . . . . . . . . . 137

3.23 Sensitivity of Baseline System to Variations in τOccPer . . . . . . . 138

3.24 Sensitivity of Baseline System to Variations in τlum . . . . . . . . 139

4.1 Flowchart of proposed motion detection algorithm . . . . . . . . . 144

4.2 Motion Detection - The input image (on the left) is converted into

clusters (on the right) by pairing pixels . . . . . . . . . . . . . . . 146

4.3 Pixel Noise - Indoor Scene . . . . . . . . . . . . . . . . . . . . . . 150

4.4 Pixel Noise - Outdoor Scene . . . . . . . . . . . . . . . . . . . . . 151

xx LIST OF FIGURES

4.5 Sample Outdoor Scene With Changing Light Conditions . . . . . 154

4.6 Background Difference Across Whole Scene . . . . . . . . . . . . . 155

4.7 Partitioning of Image for Localised Lighting Compensation . . . . 155

4.8 Background Difference Across Region (2,2) . . . . . . . . . . . . . 156

4.9 Background Difference Across Region (3,5) . . . . . . . . . . . . . 157

4.10 Shadows cast over section of the background . . . . . . . . . . . . 159

4.11 Search Order for Optical Flow . . . . . . . . . . . . . . . . . . . . 162

4.12 Matching Across Cluster Boundaries - Bold lines indicate cluster

groupings, the red cluster location in the current image is being

compared to the blue cluster (actually split across two clusters) in

the previous image. . . . . . . . . . . . . . . . . . . . . . . . . . . 164

4.13 Optical Flow Tracking, for Lopf = 4 . . . . . . . . . . . . . . . . . 167

4.14 Optical Flow Pixel States . . . . . . . . . . . . . . . . . . . . . . 169

4.15 Static Layer Matching Flowchart . . . . . . . . . . . . . . . . . . 171

4.16 AESOS Database Example, GL is the grey level of the synthetic

figures in the scene . . . . . . . . . . . . . . . . . . . . . . . . . . 175

4.17 Synthetic Motion Detection Performance for AESOS Set 1, GL150 179

4.18 Synthetic Motion Detection Performance for AESOS Set 3, GL80 180

4.19 AESOS Lighting Variation Database . . . . . . . . . . . . . . . . 182

LIST OF FIGURES xxi

4.20 Synthetic Lighting Normalisation Performance using Set 02 . . . . 183

4.21 Synthetic Lighting Normalisation Performance using Set 03 . . . . 184



4.24 Optical Flow Performance - CAVIAR Set WalkByShop1front,

Frame 1640 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

4.25 Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,

Frame 541 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

4.26 Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,

Frame 1101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

4.27 Overlap Detection - Example 1 . . . . . . . . . . . . . . . . . . . 198



5.1 State Diagram Incorporating Static Objects . . . . . . . . . . . . 204

5.2 Static Object Detection using the Template Image . . . . . . . . . 206

5.3 Example Image Containing a Discontinuity . . . . . . . . . . . . . 208

5.4 Flow Status Vertical Projections . . . . . . . . . . . . . . . . . . . 209

5.5 Detected Discontinuities . . . . . . . . . . . . . . . . . . . . . . . 209

xxii LIST OF FIGURES

5.6 Example Image Containing Different Types of Discontinuities . . 210

5.7 Flow Status Vertical Projections . . . . . . . . . . . . . . . . . . . 211

5.8 Classified Discontinuities . . . . . . . . . . . . . . . . . . . . . . . 211

5.9 Tracking Algorithm Flowchart with Modified Object Detection

Routines (additions/changes to baseline system shown in yellow) . 212

5.10 Process for Detecting a Known Object . . . . . . . . . . . . . . . 212

5.11 Example output from RD7 - Maintaining Tracking of Temporarily

Stopped Objects (the car on the far side of the road) . . . . . . . 219

5.12 Example output from RD7 - Improved detection and localisation

of objects that have been stationary for long periods of time (the

car on the far side of the road)) . . . . . . . . . . . . . . . . . . . 219

5.13 Example output from BC16 - Improved Detection Results due to

Proposed Motion Detection Routine . . . . . . . . . . . . . . . . . 220

5.14 Example output from AP12-C7 - Tracking Example for AP Dataset221

5.15 Example output from AP11-C4 - Tracking Example for AP Dataset221

5.16 Example Output from BE19-C1 - Maintaining tracks for stationary

objects (the parked car) . . . . . . . . . . . . . . . . . . . . . . . 223

5.17 Example Output from BE20-C3 - Occlusion handling . . . . . . . 223

6.1 Dynamic Sizing of Particle Filter . . . . . . . . . . . . . . . . . . 232

6.2 Sequential Importance Re-sampling for the SCF . . . . . . . . . . 233

LIST OF FIGURES xxiii

6.3 A Typical Occlusion between Two Objects . . . . . . . . . . . . . 236

6.4 Calculating particle weights for occluded objects . . . . . . . . . . 237

6.5 Scalable Condensation Filter Process . . . . . . . . . . . . . . . . 240

6.6 Dividing input image for Appearance Model . . . . . . . . . . . . 242

6.7 Integration of the SCF into the Tracking System . . . . . . . . . . 250

6.8 Determining match probability using particles . . . . . . . . . . . 251

6.9 Update State Diagram Incorporating Occluded and Predicted States253

6.10 Example Output from RD7 - Occlusion handling using the SCF

(the person walking behind the car on the far side of the road) . . 259

6.11 Example Output from RD7 - Occlusion handling using the SCF

(the group of people in the bottom right corner of the scene) . . . 260

6.12 Example Output from BC16 - Initialisation of tracks from spurious

motion (the person entering through the door at the top of the scene)261

6.13 Example Output from BC16 - Improved occlusion handling using

the SCF (one person walks behind another) . . . . . . . . . . . . 263

6.14 Example Output from AP11-C4 - Errors as an object leaves the

scene using the SCF. . . . . . . . . . . . . . . . . . . . . . . . . . 264

6.15 Example Output from BE19-C1 - Tracking Performance using the

SCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

xxiv LIST OF FIGURES

6.16 Example Output from BE20-C3 - Localisation Performance using

the SCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

7.1 A Severe Occlusion Observed from Two Views . . . . . . . . . . . 276

7.2 Example System Results AP11 - Object leaving and re-entering

field of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

7.3 Example System Results AP12 - Object leaving and re-entering

field of view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

7.4 Example System Results BE19 - Track matching and occlusion

handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

7.5 Example System Results BE19 - Re-associating objects after de-

tection and tracking failure . . . . . . . . . . . . . . . . . . . . . . 283

7.6 Example System Results BE20 - Occlusion Performance . . . . . . 284

7.7 The points for fusion in the system . . . . . . . . . . . . . . . . . 286

7.8 Fusing Visual and Thermal Information into a YCbCr Image for

use in the Motion Detection . . . . . . . . . . . . . . . . . . . . . 287

7.9 Noise in Colour Images for Set 1 . . . . . . . . . . . . . . . . . . . 293

7.10 Example System Results for Set 1 - Occlusion . . . . . . . . . . . 294

7.11 Example System Results for Set 2 . . . . . . . . . . . . . . . . . . 298


7.13 Flowchart for Proposed Fusion System . . . . . . . . . . . . . . . 301

LIST OF FIGURES xxv

7.14 Updated State Diagram . . . . . . . . . . . . . . . . . . . . . . . 304


Notation

Ax X dimension of an appearance model.

Ay Y dimension of an appearance model.

Aec(x, y, t) Error measure for the colour component of an ap-

pearance model.

Aeopf (x, y, t) Error measure for the optical flow component of

an appearance model.

Ac(x, y, t, k) Value of the kth colour channel of an appearance

model.

Au(x,y,t) Value of the horizontal velocity of an appearance

model.

Av(x, y, t) Value of the vertical velocity of an appearance

model.

Am(x, y, t) Value of the motion occupancy of an appearance

model.

As(x, y, t) Value of the motion state for an appearance model.

Avehicle The number of motion pixels within the bounds of

a vehicle candidate.

B(T i, Ck) The Bhattacharya coefficient for the comparison of

histograms belonging to a candidate object and a

tracked object.

C(x, y, t, k) A cluster at pixel x, y in a background model.

xxviii NOTATION

Cy1(x, y, t, k) First luminance value of a cluster.

Cy2(x, y, t, k) Second luminance value of a cluster.

Cghy1 (x, y, t, k) First horizontal luminance gradient value of a clus-

ter.

Cghy2 (x, y, t, k) Second horizontal luminance gradient value of a

cluster.

Cgvy1 (x, y, t, k) First vertical luminance gradient value of a cluster.

Cgvy2 (x, y, t, k) Second vertical luminance gradient value of a clus-

ter.

Ccb(x, y, t, k) Blue chrominance value of a cluster.

Ccr(x, y, t, k) Red chrominance value of a cluster.

Cw(x, y, t, k) Weight value of a cluster.

Cob(x, y, t) Colour of the matching motion layer (i.e. active

or one of the static layers) for an appearance com-

parison involving static motion.

Cos(x, y, t) Colour of the matching static layer for an appear-

ance comparison involving static motion.

coccluded Counter monitoring the time an object spends in

the predicted state.

cpredicted Counter monitoring the time an object spends in

the occluded state.

card(x) The cardinality (size) of a set x.

χ Allowed variation in lighting compensation value

between two frames.

Di A candidate object, index i.

Dix X position of a candidate object.

Diy Y position of a candidate object.

Diw Width of a candidate object.

Dih Height of a candidate object.

NOTATION xxix

DiffChr(x, y, t) Chrominance difference when matching an input

pixel pair to a cluster in a background model.

DiffLum(x, y, t) Luminance difference when matching an input

pixel pair to a cluster in a background model.

dmin Minimum dimension bounds for particles in the

SCF.

dmax Maximum dimension bounds for particles in the

SCF.

dxmin Minimum bounds for the x dimension of particles

in the SCF.

dxmax Maximum bounds for the x dimension of particles

in the SCF.

dymin Minimum bounds for the y dimension of particles

in the SCF.

dymax Maximum bounds for the y dimension of particles

in the SCF.

dwmin Minimum bounds for the width dimension of par-

ticles in the SCF.

dwmax Maximum bounds for the width dimension of par-

ticles in the SCF.

dhmin Minimum bounds for the height dimension of par-

ticles in the SCF.

dhmax Maximum bounds for the height dimension of par-

ticles in the SCF.

E(i, j) An ellipse mask used during person detection.

Earea(Ci, T j) Error in area between a candidate object and a

tracked object.

Eposition(Ci, T j) Error in position between a candidate object and

a tracked object.

xxx NOTATION

emax The maximum allowed drift of a particle from

frame to frame.

exmax The maximum allowed drift of a particle from

frame to frame for the x dimension.

eymax The maximum allowed drift of a particle from

frame to frame for the y dimension.

ewmax The maximum allowed drift of a particle from

frame to frame for the width dimension.

ehmax The maximum allowed drift of a particle from

frame to frame for the height dimension.

Fc(x, y, t, k) Value of the kth colour channel of a feature ex-

tracted from an incoming image to build/compare

to an appearance model.

F activec (x′, y′, t, k) Intermediary value used when computing an ap-

pearance model that involves both active and

static motion.

F staticc (x′, y′, t, k) Intermediary value used when computing an ap-

pearance model that involves both active and

static motion.

Fu(x,y,t) Value of the horizontal velocity of a feature ex-



Fv(x, y, t) Value of the vertical velocity of a feature extracted

from an incoming image to build/compare to an

appearance model.

Fm(x, y, t) Value of the motion occupancy of a feature ex-



NOTATION xxxi

F (Ci, T j) A fit function between a tracked object and a can-

didate object.

F̄ (Ci, T j) A fit function between a tracked object and a can-

didate object that incorporates SCF output.

Farea(Ci, T j) A fit function between a tracked object and a can-

didate object considering only the area.

Fposition(Ci, T j) A fit function between a tracked object and a can-

didate object considering only the position.

FSCF (Ci, T j) A fit function between a tracked object and a can-

didate object considering the output of the SCF.

fgnd(t) The set of all clusters that are classified as fore-

ground.

fcount The number of frames a cluster in motion has been

tracked for.

Gc(x′, y′, t) The number of standard deviations an observation

is from the mean for the colour component of an

appearance model.

Gopf (x′, y′, t) The number of standard deviations an observation

is from the mean for the optical flow component

of an appearance model.

γi The initial position of an object (x, y, w, h).

Φµ,σ The cumulative density function for a Gaussian

distribution.

hproj(j) A vector containing a horizontal projection of a

motion image.

H(T i, n) The nth bin of the histogram of a tracked object.

I(t) Input Image.

Imobj Candidate image for an object.

IS Image of pixel states.

xxxii NOTATION

IST,n Static image template for an object, n.

Iy(t) The luminance channel of an input image in

YCbCr 4:2:2 format.

IZ Image composed of static cluster colours.

i Index into a list/vector/image. See text for con-

text.

j Index into a list/vector/image. See text for con-

text.

K Total number of clusters per pixel in a background

model.

Ks Number of stationary clusters per pixel in a back-

ground model.

Kb Number of background clusters per pixel in a back-

ground model.

k Index of a cluster at a pixel in a background model.

κ Index of the matching cluster for a pixel in a back-

ground model.

L Learning rate of a background model.

Lopf Learning rate of the average velocity for a cluster.

Lfus Learning rate of the performance metrics for object

fusion.

λi Movement of an object, i, from one frame to the

next.

M A motion image.

Ma Active motion image.

MIR Motion image from an IR camera.

Ms Static motion image.

MV is Motion image from visible spectrum camera.

Mk Value used to update cluster weights.

NOTATION xxxiii

min(x, y) The minimum of two values, x and y.

max(x, y) The maximum of two values, x and y.

N(t) The number of objects detected at the current

time.

Nthermal(t) The number of objects detected in the thermal do-

main at the current time.

Nvisual(t) The number of objects detected in the visual do-

main at the current time.

νinit Initial number of particles allocated to each object

in the SCF.

νadd Additional number of particles that can be added

for each occlusion level increase for each object in

the SCF.

Ofused(t) List of objects created by fusing detection lists for

the visual and thermal domains.

Othermal(t) List of objects detected using input from the visi-

ble spectrum.

Ovisible(t) List of objects detected using input from the visi-

ble spectrum.

Olum(r, t) Weighted average of luminance changes for a re-

gion within the scene.

Operson The occupancy (percentage of motion pixels within

the bounds) of a person candidate.

Ovehicle The occupancy (percentage of motion pixels within

the bounds) of a vehicle candidate.

Ovhoriz Horizontal overlap.

Ovvert Horizontal overlap.

Θ(x, y, t) The luminance at a pixel.

p(xt|xt−1) Probability distribution of the SCF.

xxxiv NOTATION

pi(xi,t|xi,t−1) Probability distribution of the particles for the ith

track within the SCF.

p(x, y, t) A pixel at located at x, y at time t.

P (x, y, t) A pixel pair, located at x, y at time t.

Pcount Size of the set, Wfgnd(x, y, t).

PTj ,t Image of back projected particle probabilities for

a tracked object.

pm(t) Overall performance for the modality m’s object

detection.

pthermal(t) Overall performance for the thermal domain’s ob-

ject detection.

pvisual(t) Overall performance for the visual domain’s object

detection.

Qp The number of successive frames a clusters posi-

tion can be estimated.

Qv The number of frames a cluster’s motion must have

been tracked for before an overlap can be recorded

involving the cluster.

qi(xi,t|xi,t−1) Probability distribution formed by object detec-

tion results for the ith track.

qthermal(t) Quality measure for the object detection perfor-

mance in the thermal domain for the current

frame.

qvisual(t) Quality measure for the object detection perfor-

mance in the visual domain for the current frame.

R Number of regions a scene is divided into.

Ri, j Reprojection error when transferring two tracked

object’s positions between camera views.

NOTATION xxxv

Rm Ratio between static foreground and active fore-

ground.

r Index of a subregion of the scene.

ρ A random vector used when updating particles in

the SCF.

S A pixel state, one of new, continuous, overlap or

ended.

S(x, y, t) The state of pixel pair.

si,n,t The nth particle for the ith track in the SCF.

σarea The standard deviation of the area error for match-

ing candidate objects to tracked objects.

σ(t)2Chr Variance of the chrominance differences when

matching clusters for a background model.

σ(t)2Chrinit

Initial chrominance variance for a background

model.

σ(t)2Lum Variance of the luminance differences when match-

ing clusters for a background model.

σ(t)2Luminit

Initial luminance variance for a background model.

σlum(r, t) Weighted standard deviation of the luminance

changes for a region for a given frame.

σpos The standard deviation of the position error for

matching candidate objects to tracked objects.

T j A tracked object, index j.

T jx X position of a tracked object.

T jx,i X position of a tracked object in view i.

T jx,ω X position of a tracked object in world coordinates.

T jy Y position of a tracked object.

T jy,i Y position of a tracked object in view i.

T jy,ω Y position of a tracked object in world coordinates.

xxxvi NOTATION

T ju X velocity of a tracked object.

T jv Y velocity of a tracked object.

T jw Width of a tracked object.

T jh Height of a tracked object.

τAreaV eh Threshold applied to Avehicle to determine if the

detected candidate is valid.

τa Threshold to determine if a modality is performing

well enough for objects to be added based solely

on detections from that modality.

τactive Number of frames of continuous detection required

for an object to enter the active state after cre-

ation.

τB Threshold for determining a match between two

histograms when matching objects across different

camera views.

τChr Chrominance threshold for a background model.

τChrShad Inverse scaling factor for the chrominance thresh-

old when detecting shadows.

τd Threshold for determining if two views of an object

(i.e. the same object observed in different cam-

eras) have transferred coordinates close enough to

remain paired.

τF1 Threshold for fusing motion images from multiple

modalities.

τF2 Threshold for fusing motion images from multiple

modalities.

τFScale Scaling factor for object detection using multiple

modalities.

τfit Threshold on fit scores.

NOTATION xxxvii

τforeground Threshold to separate foreground from background

in a background model.

τGrad Gradient threshold for shadow detection in a back-

ground model.

τ gradov Threshold on gradient overlaps.

τ gradprox Threshold on gradient overlap proximity for merg-

ing.

τLum Luminance threshold for a background model.

τLumShad Threshold for the luminance difference when de-

tecting shadows.

τmc Threshold to determine if the percentage change

in the amount of motion from frame to frame is

acceptable.

τMaxChr Maximum chrominance threshold for the motion

detection when a variable threshold is used.

τMaxLum Maximum luminance threshold for the motion de-

tection when a variable threshold is used.

τMinChr Minimum chrominance threshold for the motion

detection when a variable threshold is used.

τMinLum Minimum luminance threshold for the motion de-

tection when a variable threshold is used.

τnearby Threshold to determine nearby objects for

adding/removing particles from the SCF.

τoccluded Number of frames an object is allowed to exist in

the occluded state for before it is removed.

τOccPer Threshold applied to Operson to determine if the


τOccV eh Threshold applied to Ovehicle to determine if the


xxxviii NOTATION

τR Threshold for determining if two objects in dif-

ferent cameras have transferred coordinates close

enough to be considered the same object.

τS Threshold applied to Rm when determining if a

region is stationary.

τSOv Threshold for overlap between the active state and

discontinuity states for detecting overlaps.

τSOvType Threshold for the overlap between the new and

overlap states for determining the type of overlap

present.

τstatic Threshold on object velocity to determine if an

object is stationary.

τvel Threshold for velocity error when extracting a can-

didate image from optical flow images.

t Current time step.

tstatic A period of time that a cluster has remained sta-

tionary for.

txj X coordinate that has been transferred from view

j.

tyj Y coordinate that has been transferred from view

j.

U(T i, T j, Ck) The uncertainty between two tracked objects and

a candidate object.

U(x, y, t) Horizontal flow image.

uave The average horizontal velocity of a cluster.

υx The x position of a cluster.

υy The y position of a cluster.

V (x, y, t) Vertical flow image.

NOTATION xxxix

vcontour(i) A vector containing the top contour of a motion

image.

vHeightMap(i) A weighted sum of a vcontour(i) and a vproj(i).

vproj(i) A vector containing a vertical projection of a mo-

tion image.

vave The average vertical velocity of a cluster.

vx X velocity.

vy Y velocity.

W (x1 : x2, y1 : y2, t) A window of clusters.

Wfgnd(x, y, t) A subset of W (x1 : x2, y1 : y2, t) where all clusters

within the subset were in the foreground at t− 1.

WLum Average luminance difference when matching two

windows of clusters.

WGrad Average gradient difference when matching two

windows of clusters.

wk Weight for a cluster in the background model, in-

dex k.

wi,n,t The weight for the nth particle for the ith track in

the SCF.

wthermal(t) The weight assigned to the visual domain during

fusion of detected object lists.

wvisual(t) The weight assigned to the visual domain during

fusion of detected object lists.

X Image width.

x X coordinate.

Y Image height.

y y coordinate.

Z(x, y, t, z) A static layer within a motion model.

Zc(x, y, t, z) Counter for a static layer.

xl NOTATION

z Depth of a static layer.

zj,i,t The jth particle filter feature for the ith tracked

object.

z−k,i,t The kth negative particle filter feature for the ith

tracked object.

ζi,ω Function to transform image coordinates to world

coordinates.

ζi,j Function to transform image coordinates between

two cameras.

Acronyms & Abbreviations

2D 2-dimensional3D 3-dimensional

AO Abandoned ObjectAOD Abandoned Object Detection

BPF Boosted Particle Filter

CAVIAR Context Aware Vision using Image-based ActiveRecognition

ETISEO Evaluation du Traitement et de lInterpretationde Sequences Video - Evaluation for videounderstanding

FOV Field of ViewFN False NegativeFP False PositiveFPS Frames Per Second

GHz GigahertzGL Grey LevelGMM Gaussian Mixture ModelGT Ground Truth

HMM Hidden Markov ModelHSV Hue, Saturation, Value

xlii ACRONYMS & ABBREVIATIONS

ID IdentityIR Infrared

MPF Mixture Particle Filter

OTCBVS Object Tracking and Classification in andBeyond the Visible Spectrum

PDF Probability Density FunctionPETS Performance Evaluation of Tracking SystemsPTZ Pan-Tilt-Zoom

RGB Red, Green, Blue

SAD Sum of Absolute DifferencesSCF Scalable Particle FilterSIR Sequential Importance Re-sampling

TN True NegativeTP True Positive

Y’CbCr Luminance, Blue Chrominance, Red Chrominance

XML Extensible Markup Language

List of Publications

The journal articles that have been published as part of this research are as

follows:

1. S. Denman, V. Chandran, and S. Sridharan, ‘An Adaptive Optical Flow

Technique for Person Tracking Systems,’ Elsivier Pattern Recognition Let-

ters, vol. 28, pp. 1232-1239, 15 July 2007.

2. S. Denman, T. Lamb, C. Fookes, and S. Sridharan, ‘Multi-Spectral Fusion

for Surveillance Systems,’ International Journal on Computers and Electri-

cal Engineering, Accepted and published online on 30th December 2008,

DOI:10.1016/j.compeleceng.2008.11.011

The book chapters that have been published as part of this research are as follows:

1. F. Lin, S. Denman, V. Chandran, and S. Sridharan, Computational Foren-

sics, ch. Improved Subject Identification in Surveillance Video using Super-

resolution. Springer-Verlag, 2008. (Accepted 11 March 2008)

The conference articles that have been published as part of this research are as

follows:

xliv LIST OF PUBLICATIONS

1. S. Denman, V. Chandran, and S. Sridharan, ‘Tracking People in 3D Using

Position, Size and Shape,’ in Eighth International Symposium on Signal

Processing and Its Applications, Sydney, Australia, 2005, pp. 611-614.

2. S. Denman, V. Chandran, and S. Sridharan, ‘Adaptive Optical Flow for

Person Tracking,’ in Digital Image Computing: Techniques and Applica-

tions, Cairns, Australia, 2005, pp. 44-50.

3. S. Denman, V. Chandran, and S. Sridharan, ‘Person Tracking using Mo-

tion Detection and Optical Flow,’ in The 4rd Workshop on the Internet,

Telecommunications and Signal Processing, Noosa, Australia, 2005, pp.

242-247.

4. S. Denman, V. Chandran, and S. Sridharan, ‘A Multi-Class Tracker us-

ing a Scalable Condensation Filter,’ in Advanced Video and Signal Based

Surveillance, Sydney, 2006, DOI:10.1109/AVSS.2006.7.

5. S. Denman, C. Fookes, J. Cook, C. Davoren, A. Mamic, G. Farquharson,

D. Chen, B. Chen, and S. Sridharan, ‘Multi-view Intelligent Vehicle Surveil-

lance System,’ in Advanced Video and Signal Based Surveillance, Sydney,

2006, DOI:10.1109/AVSS.2006.78.

6. S. Denman, V. Chandran, and S. Sridharan, ‘Robust Multi-Layer Fore-

ground Segmentation for Surveillance Applications,’ in IAPR Conference

on Machine Vision Applications, The University of Tokyo, Japan, 2007, pp.

496-499.

7. F. Lin, S. Denman, V. Chandran, and S. Sridharan, ‘Automatic Track-

ing, Super-Resolution and Recognition of Human Faces from Surveillance

Video,’ in IAPR Conference on Machine Vision Applications, Tokyo, 2007,

pp. 37-40.

LIST OF PUBLICATIONS xlv

8. S. Denman, T. Lamb, C. Fookes, S. Sridharan, and V. Chandran, ‘Multi-

Sensor Tracking using a Scalable Condensation Filter,’ in International Con-

ference on Signal Processing and Communication Systems (ICSPCS), Gold

Coast, QLD, 2007, pp. 429-438.

9. S. Denman, S. Sridharan, and V. Chandran, ‘Abandoned Object Detection

Using Multi-Layer Motion Detection,’ in International Conference on Sig-

nal Processing and Communication Systems (ICSPCS), Gold Coast, QLD,

2007, pp. 439-448.

10. S. Denman, C. Fookes, V. Chandran, S. Sridharan, ‘Object Tracking using

Multiple Motion Modalities’, in International Conference on Signal Process-

ing and Communication Systems (ICSPCS), Gold Coast, QLD, 2008.

11. D. Ryan, S. Denman, C. Fookes, S. Sridharan, ‘Scene Invariant Crowd

Counting for Real-Time Surveillance’, in International Conference on Sig-

nal Processing and Communication Systems (ICSPCS), Gold Coast, QLD,

2008.

Certification of Thesis

The work contained in this thesis has not been previously submitted for a degree

or diploma at any other higher educational institution. To the best of my

knowledge and belief, the thesis contains no material previously published or

written by another person except where due reference is made.

Signed:

Date:

Acknowledgments

I would firstly like to thank my supervisors, Associate Professor Vinod Chandran

and Professor Sridha Sridharan for their support and guidance throughout my

PhD. Thanks also to Clinton Fookes for his support and assistance.

I would also like to thank everyone in the Speech, Audio, Image and Video Tech-

nology (SAIVT) laboratory for the games of cards, entertaining (and at times

strange and confusing) conversations, and assistance.

I would also like to thank my family for their support throughout the PhD. Finally

I would like to thank my wife Pamela, for her support and understanding when

things weren’t working and deadlines were looming, and especially for the endless

supply of chocolate chip biscuits.

Simon Paul Denman

Queensland University of Technology

May 2009

Chapter 1

Introduction

1.1 Motivation and Overview

This dissertation is an investigation into object tracking and how it can be applied

to surveillance footage. Surveillance networks are typically monitored by one or

more people, looking at several monitors displaying the camera feeds. Each person

may potentially be responsible for monitoring hundreds of cameras, making it very

difficult for the human operator to effectively detect events as they happen. Given

the rise in security concerns in recent years, more and more security cameras have

been deployed, placing further strain on human operators.

Recently, computer vision research has begun to address ways to automatically

process some of this data, to assist human operators. Various areas of research

such as object tracking, event recognition, crowd analysis and human identifi-

cation at a distance are being pursued as a means to aid human operators and

improve the security of areas such as transport hubs (i.e. airports, train stations).

The task of object tracking is key to the effective use of more advanced technolo-

2 1.1 Motivation and Overview

gies (i.e. to recognise the actions of person, they need to be tracked first), and

can be used as an aid to enhance the utility of others (i.e. crowd analysis may

be able to detect an object moving against the flow of traffic, which can then be

tracked using object tracking techniques).

Object tracking itself is the task of following one or more objects about a scene,

from when they first appear to when the leave the scene. An object may be

anything of interest within the scene that can be detected, and depends on the

requirements of the scene itself. Figure 1.1 shows an example of a typical view

from a surveillance cameras, with various objects that need to be tracked outlined

(the colour indicates an ID, note that the ID assigned to each object is consistent

over time). In this scene, vehicles are people are of interest. To track the objects

in the scene, they must be detected and matched from frame to frame. To match

objects detected in one frame, to those present in a previous frame, one or more

features that can be used to compare the objects are required. Problems such as

occlusions, and errors in detection algorithms (either missed detections, or false

detections) must also be overcome.

(a) Frame 500 (b) Frame 600 (c) Frame 700

Figure 1.1: Example of a Surveillance Scene

Systems are required to operate in a variety of conditions. Figure 1.2 shows an

example of the same scene at two different times. In a surveillance situation,

it is not feasible for a human operator to change configuration values as the

scene conditions change. Any algorithm must be able to adjust to the new scene

1.2 Aims and Objectives 3

conditions, with minimal error.

(a) (b)

Figure 1.2: Example of Changing Scene Conditions

In the remainder of this chapter the aims and objectives of this thesis will be

described. The scope of the thesis will be defined, and an outline of the thesis

will be presented. Finally, the contributions made in this thesis will be stated.

1.2 Aims and Objectives

This thesis aims to improve the performance of object tracking systems by con-

tributing originally to three components : (a) motion segmentation, (b) particle

filters for object tracking and (c) fusion of information. These contributions will

be in the nature of:

1. Improvements to motion segmentation to enhance performance in adverse

conditions (shadow detection, variable thresholds, detection and manage-

ment of changing lighting conditions) and to improve the general utility of

the algorithm (simultaneously computing optical flow, segmenting motion

into multiple layers of static motion and active motion). See Section 1.2.1

for more detail.

4 1.2 Aims and Objectives

2. Improvements to particle filters to allow a particle filter to be used within

an existing tracking system, rather than as a self contained tracking system.

See Section 1.2.2 for more detail.

3. Improvements to multi-modal fusion in tracking systems to determine the

optimal fusion point for a tracking system, and to develop a fusion algorithm

that is able to dynamically alter the fusion parameters according to the

performance of each mode. See Section 1.2.3 for more detail.

1.2.1 Improvements to Motion Segmentation

Motion segmentation forms the basis of many tracking systems. In such systems,

detection results and thus tracking performance are ultimately reliant on the mo-

tion segmentation performance. Shadows and lighting fluctuations can introduce

further errors into the motion segmentation process and have an adverse effect on

a tracking system. Other tracking systems use optical flow information as a basis

for detecting and tracking objects. Optical flow offers advantages in its ability

to separate objects moving in different directions, but does complicate the initial

detection of objects. Optical flow is also prone to error in the presence of in-

consistent lighting and discontinuities, such as those caused by the self occlusion

present when a person walks.

This research aims to improve motion segmentation for object tracking appli-

cations, and investigate ways to use additional motion information to improve

object tracking performance. Two key additions to an existing motion segmen-

tation algorithm [18] are proposed:

1. The simultaneous computation of optical flow and motion segmentation,

using motion information from the previous frames to reduce discontinuities

1.2 Aims and Objectives 5

caused by a moving object against the background and eliminate the need

for the previous frame to be stored.

2. The ability to segment the scene into layers, consisting of multiple layers of

stationary foreground and a single layer moving foreground.

Each of these additions results in additional output being generated. As such,

techniques are proposed that allow a tracking system to utilise this information

in addition to the standard motion mask, to improve detection and tracking

results. It will be shown in Chapter 5 that the additional information provided

by the proposed motion detection algorithm (described in Chapter 4) results in

a significant improvement in system performance.

1.2.2 Improvements to Particle Filters

Particle filters [120] allow objects to be tracked without the need to detect the

objects every frame. Features for the objects of interest can be extracted, and

used to locate the object in future frames. However, such systems do not allow

for the features to adapt to the changing appearance of the tracked objects, or

provide simple methods through which tracked objects can be added or removed

from the particle filter.

This research aims to investigate ways to integrate a particle filter with a frame-

by-frame, detect and update, tracking system. As a result of this integration,

the proposed particle filter (presented in Chapter 6) is able to use a time varying

number of features and particles for each tracked object, as well as being able to

use different types of features for different tracked objects where appropriate.

6 1.3 Scope of Thesis

1.2.3 Improvements to Multi-Modal Fusion in Tracking

Systems

Tracking systems normally use a single video feed for input. Each type of video

modality, however, has its own set of weaknesses. In the case of a visual light

modality (colour or grey scale), this is susceptible to lighting changes and shad-

ows, and poor performance in low light conditions. For a thermal modality how-

ever, it is a lack of texture and colour information that enables objects to be

distinguished from one another that poses the largest challenge.

This research aims to determine the most appropriate point for fusion, in a multi-

modal tracking system using a visual colour modality and a thermal modality,

by evaluating four simple fusion schemes at different points in an object tracking

system. Based on these findings (presented in Chapter 7), an improved fusion

system is proposed.

1.3 Scope of Thesis

The scope of this thesis is defined by the following research questions:

1. Can optical flow and multi-layer motion segmentation (separating the fore-

ground into stationary foreground regions and moving foreground regions)

be computed simultaneously?

2. Does the combination of optical flow, and motion segmentation (itself a

combination of stationary foreground and moving foreground) for object

detection result in improved object tracking performance compared to using

motion segmentation on its own?

1.4 Original Contributions and Publications 7

3. Does a particle filter integrated into a frame-by-frame detect and update

tracking system result in improved object tracking performance when com-

pared to a frame-by-frame detect and update tracking system on its own?

4. Where is the optimal point for fusion in a multi-modal tracking system?

The tracking systems proposed within this thesis do not consider using any

learned model approaches to object detection, relying solely on simple detection

techniques based on the analysis of motion segmentation output. Learned mod-

els have not been considered due to the requirements in training such models to

accurately locate people and cars across a wide range of data, and due to motion

segmentation research being a main component of this thesis. Ultimately, the

use of the simple motion based detection techniques allows a wider range of data

to be tested than would have otherwise been possible. The ETISEO database

[130], and evaluation tool are used to evaluate the performance of object tracking

systems. The research into multi-modal tracking systems is restricted to a single

visual colour modality being fused with a single thermal modality. Performance

is evaluated using the OTCBVS database [42], and the ETISEO evaluation tool.

1.4 Original Contributions and Publications

The original contributions made in this thesis include:

(i) Simultaneous computation of multi-layer motion segmentation and

optical flow

Motion segmentation is a key early step in object tracking, and poor segmen-

tation performance can have an adverse impact on the performance of any

tracking algorithm. This research proposes an improved motion segmentation

algorithm that is able to simultaneously compute optical flow and multi-layer

8 1.4 Original Contributions and Publications

motion segmentation. Motion information is split into multiple layers of static

foreground (objects that entered the scene and come to a stop) and a layer

of active foreground (currently moving objects). The proposed algorithm is

evaluated using the AESOS database, the CAVIAR database [48], and data

captured in house, and significant improvement is shown.

(ii) Incorporation of multi-layer motion and optical flow into object

tracking

Object tracking systems typically use either motion segmentation (with only

a single layer of output), optical flow, or a learned model for object detection

and tracking. As tracking systems are generally aimed at real-time performance,

computing multiple motion modes (i.e. motion segmentation and optical flow) is

not ideal. This research proposes implementing the proposed hybrid multi-layer

motion segmentation / optical flow algorithm into a tracking system, and using

the multiple modes of output to improve detection and tracking performance.

The resultant tracking system is evaluated using the ETISEO database [130],

and significant improvement over a baseline tracking system (using a single layer

motion segmentation algorithm) is shown.

(iii) Improved object tracking through the Scalable Condensation

Filter (SCF)

Particle filters allow tracking to be performed in environments where detection

is difficult and unreliable, but still require a form of detection (this may include

manual instantiation) to initialise the particle filter. This research aims to

develop a condensation filter that can be used within a tracking framework

that detects and updates objects on a frame by frame basis, rather than as

a self contained tracking system. This allows the condensation filter to have

1.4 Original Contributions and Publications 9

access to continuously updated features, and use observations from the detection

algorithms to augment the condensation filter distribution. The proposed

condensation filter is able to dynamically scale the number of particles used for

each track, in addition to the number and type of features used, according to

the system complexity. The proposed tracking system is evaluated using the

ETISEO database [130], and improvement in tracking performance, particularly

occlusion handling, is demonstrated.

(iv) Investigation into and development of methods to fuse multiple

modalities for object tracking

This research aims to investigate the most appropriate way to fuse a visual colour

modality and thermal modality for the task of object tracking. Four simple fusion

schemes are evaluated:

1. Fusion during the motion detection process.

2. Fusion of the motion detection output.

3. Fusion of the object detection results.

4. Fusion of the tracked object lists.

These fusion schemes are evaluated using the OTCBVS database [42] and the

ETISEO evaluation tool [130]. It is shown that fusion of the object detection

results is most effective, and a more sophisticated fusion scheme at this point in

the system is proposed. It is shown that this scheme outperforms each modality

on its own, as well as the earlier evaluated schemes.

10 1.5 Outline of Thesis

1.5 Outline of Thesis

The thesis is outlined as follows:

Chapter 2: Literature Review

• Provides a detailed survey of literature related to motion detection, object

tracking, particle filters and abandoned object detection.

Chapter 3: Tracking System Framework

• Outlines the structure, both algorithmic and programmatic, of the tracking

systems used in this thesis.

• Details the baseline tracking system.

• Outlines the evaluation method used for evaluating the performance of the

tracking systems within this thesis.

• Presents benchmark scores for the baseline tracking system.

Chapter 4: Motion Detection

• Presents a novel motion detection algorithm capable of simultaneously cal-

culating optical flow and segmenting a multi-layered foreground.

• Illustrates the improvement that can be achieved using this algorithm, on

synthetic and real world data.

1.5 Outline of Thesis 11

Chapter 5: Object Detection

• Describes how the proposed motion detection algorithm (Chapter 4) can be

incorporated into the baseline tracking system, to improve object detection

and thus tracking.

• Presents evaluation results for the modified tracking system proposed in

this chapter, and show the improvement over the baseline system.

Chapter 6: The Scalable Condensation Filter

• Proposes a novel condensation filter, the Scalable Condensation Filter

(SCF), that is able to dynamically resize, and dynamically change features

as the system requires.

• Describes how the SCF can be implemented into the tracking system pro-

posed in Chapter 5.

• Presents evaluation results for the modified tracking system, and show the

improvement achieved by using the SCF.

Chapter 7: Advanced Object Tracking and Applications

• Proposes a multi-camera tracking system based on the systems proposed

earlier in this thesis, and demonstrate the improvement in performance that

can be gained in situations where occlusions are present.

• Investigates multi-modal fusion approaches for a tracking system.

12 1.5 Outline of Thesis

Chapter 8: Conclusions and Future Work

• Provides a summary of the research as well as possible avenues of future

work.

Chapter 2

Literature Review

2.1 Introduction

Tracking is the process of following an object of interest within a sequence of

frames, from its first appearance to its last. The type of object and its description

within the system depends on the application. During the time that it is present

in the scene, it may be occluded (either partially or fully) by other objects of

interest or fixed obstacles within the scene. A tracking system should be able

to predict the position of any occluded objects through the occlusion, ensuring

that the object is not temporarily lost and only detected again when the object

appears after the occlusion. The process can be extended to multiple cameras,

which can help to overcome occlusions by being able to observe the objects of

interest from multiple angles, but also requires that the same objects in different

views are grouped together.

Object tracking systems are typically geared toward surveillance applications

where it is desired to monitor people and/or vehicles moving about an area.

14 2.1 Introduction

Systems such as these need to perform in real time, and be able to deal with real

world environments and effects such as changes in lighting and spurious move-

ment in the background (such as trees moving in the wind). Other surveillance

applications include data mining applications, where the aim is to annotate video

after the event. Applications outside of surveillance have been found in the realm

of sport. The ball tracking system, ‘Hawk-eye’ [137], has become a standard

feature of tennis and cricket broadcasts, and uses object tracking techniques to

locate and track the ball as it moves about the court or pitch. The development of

‘smart rooms’, where a central intelligence is able to monitor a room’s occupants

and perform tasks according to the actions of occupants [50, 102] is another area

where object tracking techniques are being applied.

There are two distinct approaches to the tracking problem, top-down and bottom-

up [182]. Top-down methods are goal orientated, and the bulk of tracking systems

are designed in this manner. These typically involve some sort of segmentation

to locate regions of interest, from which objects and features can be extracted for

tracking.

Bottom-up systems respond to stimulus, and behave according to observed

changes. These systems are often used for gesture recognition tasks, and use

descriptor functions (such as optical flow) combined with filters to provide the

stimulus. Bottom-up systems, or stimulus driven systems, have been more focused

on data-mining applications. Efros et al. [45] uses optical flow as the stimulus

to classify actions in a sporting context, either on a soccer field, tennis court or

ballet stage. Zhang [182] uses motion descriptors derived from optical flow and

HMMs to index sports footage such as that from a basketball or volleyball match.

Recently surveillance applications designed to perform crowd monitoring [3] have

been developed using a bottom-up approach.

A top-down approach is the most popular method for developing surveillance

2.1 Introduction 15

systems for real time tracking. Systems have a common structure (see Figure

2.1) consisting of a segmentation step, a detection step and a tracking step.

Segmentation and object detection is commonly done as a two step process [66,

184], whereby motion detection proceeds object detection which uses the resultant

motion image. However some systems [131, 145] effectively merge these processes

by using model based object detection.

Figure 2.1: A Basic Top-Down System

Predictors are often used to predict the motion of the tracked object, and thus its

position in the next frame. This is used to aid in matching or in the event of the

object being temporarily lost (occluded). Features such as colour are popular to

aid in matching tracks [40, 113], with various forms of histogram matching and

colour clustering used to maintain the identity of tracked objects.

Recently, systems using particle filtering techniques [135, 163] have begun to use

both top-down and bottom-up approaches within the one system. The particle

filters are used to track previously detected objects, and are updated by using

stimulus from the input image(s). New objects are detected and added using a

top-down approach.

Tracking systems are required to function in a wide variety of conditions. Systems

must be able to function in both indoor and outdoor environments, and need to

be able to deal with challenges such as illumination changes and changing weather

conditions (i.e. fog, rain, snow). Other challenges such as occlusions are common

place in real world scenarios, and systems must be able to maintain an objects’

identity and a reasonable approximation of its position during these occlusions.

16 2.1 Introduction

In order to help overcome some of these problems, algorithms that are able to

utilise multi-camera setups have been developed [95, 118]. This however intro-

duces additional challenges such as camera calibration, track handover between

views, and how to utilise Pan-Tilt-Zoom (PTZ) cameras.

In order to gauge the performance of tracking algorithms, several evaluations

have been run, and a growing collection of databases are available. To date, a

large amount of research has used privately collected data to evaluate algorithms,

making comparison between different algorithms difficult.

This chapter will discuss the main areas of research within the field of object

tracking, and will be structured as follows:

• Section 2.2 will discuss motion detection techniques, such as background

segmentation, optical flow and auxiliary processes such as shadow detection.

• Section 2.3 will present various tracking systems for both people and vehi-

cles. The detection and matching of objects, as well as occlusion handling

and features used for tracking will be discussed.

• Section 2.4 will discuss the main types of predictors used within the object

tracking systems.

• Section 2.5 will discuss tracking systems that use multiple cameras, how

such systems are designed, and how they handle problems such as track

handover between cameras and occlusions.

2.2 Foreground Segmentation and Motion Detection 17

2.2 Foreground Segmentation and Motion De-

tection

Foreground segmentation is the process of dividing a scene into two classes, fore-

ground and background. The background is the region of the scene that is fixed,

such as roads, buildings and furniture. Whilst the background is fixed, its ap-

pearance can be expected to change over time, due to factors such as changing

weather or lighting conditions. The foreground is any element of the scene that

is moving, or expected to move, and some foreground elements may actually be

stationary for long periods of time (such as parked cars, which may be stationary

for hours at a time). It is also possible that some elements of background may

actually move, such as trees moving in a breeze. An example of a motion mask

is shown in Figure 2.2.

(a) Input Image (b) Foreground Mask

Figure 2.2: Foreground Mask for a Scene (Hand Segmented). Areas of motionare represented as white in the foreground mask.

There are two main approaches to locating foreground objects within surveillance

systems:

1. Background modelling/Subtraction - incoming pixels are compared to a

18 2.2 Foreground Segmentation and Motion Detection

background model to determine if they are foreground or background.

2. Optical Flow Approaches - compare consecutive images to determine flow

vectors (movement of each pixel) which can be used to detect moving (i.e.

foreground) objects.

Background modelling and subtraction techniques compare incoming images to a

learned background. In the case of background subtraction, single mode models

of the background are used. An image subtraction followed by thresholding is

performed to determine foreground pixels,

Imotion = |Iinput − Ibackground| > T, (2.1)

where Imotion is the output motion image, Iinput is the input image, Ibackground is

the background image and T is the motion threshold. Depending on the imple-

mentation, Ibackground may be a fixed image initialised at start up, or may adapt

to changes in the scene (i.e. lighting fluctuations).

Background modelling techniques [18, 97, 157] use a multi-modal approach, where

various modes of the background are stored for each pixel,

I(x, y)background = {c, w, n} , n = [1..N ], (2.2)

where (x, y) is a pixel in the background model; {c, w, n} is a single model of

background with c equal to the colour (this may be multiple values), w is the

weight and n is the index; and N is the number of models in the background.

Incoming images are compared to all possible background modes in order of

weight (highest to lowest) to determine if the pixel is foreground or not,

|I(x, y, n)background − I(x, y)input| < T, n = [1..N ] (2.3)

where (x, y) is the coordinate of the pixel being tested, n is the background mode

being compared to, and N is the total number of background modes. Once a


match is found, the weight of the match is used in determining if the pixel is

foreground. If there is no match, then the pixel must be foreground. Background

segmentation routines update themselves overtime, removing modes that have

low probabilities as new modes appear, and gradually adjusting other modes as

the scene gradually changes (i.e. lighting changes due to the time of day).

Background subtraction, being a simpler approach results in faster execution

times, but it less robust when exposed to complex scenes containing environmen-

tal effects such as lighting fluctuations, or trees moving in the wind.

Optical flow [9, 72, 114] attempts do not explicitly determine where motion is

in the scene. Instead they attempt to determine what motion each pixel has

undergone between subsequent frames.

Systems belonging to each of these areas shall be discussed, as well as systems

that combine one or more techniques, or don’t fall into one of these classifications.

Finally, auxiliary processes such as those responsible for shadow detection or

illumination correction shall be discussed.

2.2.1 Background Subtraction and Background modelling

Background modelling methods build a model of the background, often multi-

modal, and compare this to each incoming frame. Early approaches used a fixed

model such as an image of the expected background and directly compared this

to incoming frames. Whilst this works in ideal situations, it cannot cope with any

variations in lighting and so is very limited. A moving average of the background

image can be used to add some adaptability to the model, but this is still limited.

Recently multi-modal solutions [18, 157] have been proposed that are able to

cope better with real world situations problems such as lighting fluctuations and


a time varying background.

The choice of colour space and colour model is also important. Colour spaces

such as YCbCr and HSV separate the colour from the intensity making tasks

such as shadow detection simpler. The choice of colour model in the system can

have a similar effect.

Horprasert et al. [73] proposed a colour model based in RGB space which allowed

brightness and colour distortion to be measured. A given colour in RGB space

is represented as a line from the origin to the colour, its chromacity line. The

brightness distortion is a scalar value that brings the observed colour back to the

chromacity line, while the colour distortion is the distance between the observed

colour and chromacity line. A background model consisting of the observed colour

and its variance, and the variance of the brightness distortion and colour distor-

tion is used. The use of such a colour model allows shadows or highlights to be

picked up and eliminated from the foreground image.

Thongkamwitoon et al. [160] proposed an adaptive background subtraction algo-

rithm based on [73] that could vary the learning rate for different parts of the

scene. A constant learning rate can result in errors in two ways:

1. True background may be lost at areas of high activity when a fast learning

rate is used.

2. New background objects will be incorporated very slowly if a slow learning

rate is used.

To overcome this problem, they define a vivacity factor which can be used as

a substitution to the learning rate. The vivacity is calculated by observing the

changes at a pixel of a window of frames, and can be used to compensate for

different levels of motion within the scene.


Stauffer and Grimson [157] proposed a multi-modal background model using a

GMM to model each pixel, and incoming pixels were compared to the GMM to

determine how well they matched the background. This allowed multi-modal

backgrounds to be effectively modelled, and for the model to learn and adapt to

changes in the background. This mixture of Gaussians (MOGS) approach (or

systems which use an approximation to it [18]) has become popular and several

improvements and variations have been proposed since.

Harville et al. [68] proposed a system based on the work of Stauffer and Grimson

[157] and the work of Gordon et al. [56], to produce a system that performed

foreground segmentation based on depth and colour information within a MOGS

framework. The proposed algorithm is able to modulate the learning rate accord-

ing to the amount of activity at a given pixel. Pixels that are more active, learn

slower to preserve the background, while inactive pixels have the learning rate

increased as it is likely that they represent a new background object.

Bowden and Kaewtrakulpong [13] improved upon the system proposed by Stauffer

and Grimson [157] by improving the learning rate so that it converged on a stable

background faster. Changes were also proposed that allow the system to better

deal with shadows. Further improvements have been proposed by Wang and Suter

[169], who proposed the addition of shadow removal and a foreground support

map to aid in the updating of the background. The MOGs approach has also

been applied to a moving camera situation by Hayman and Eklundh [70], who

adapted [157] to a system with a camera capable of panning and tilting (such as

may be found on a robot, or in video conferencing).

One problem that may arise with these approaches, is that objects that are not

actually part of the background, may become incorporated into the background

model. When an object first stops, it will still be detected as foreground, how-

ever, after a period of time the object will have been present in the scene long


enough to be considered background. If this occurs, and the object then begins to

move again, it is possible that motion will be incorrectly detected at the location

where the object was. Whilst this could be partially overcome by using a slower

learning rate, this will then affect other aspects of the system such as the ability

to learn changes in the background due to changing light conditions (something

that is necessary in an outdoor scene). Within a tracking system, this inability

can lead to a higher number of occlusions, placing increased demands on other

segmentation steps, or tracking and predictions steps in the case of a surveillance

system.

(a) Frame 0 (b) Frame 800 (c) Frame 1900 (d) Frame 2450

Figure 2.3: Temporarily Stopped Objects

Figure 2.3 shows several frames of a sequence where cars are temporarily stopping,

and then leaving again. In this situation, a car is stopped within the scene for

over 2000 frames (80 seconds if the frames are captured at 25fps). In the same

time that the car is stopped, there is significant change in the scenes lighting (note

the difference between frame 1900 and 2450 in Figure 2.3). In such a situation,

if a slow learning rate was used to counter the temporarily stopping objects, the

same slow learning rate would result in excessive false motion detected after the

lighting change.

Harville [67] proposed an extension to the MOGs approach to overcome this

problem. Harville allowed a higher level process to impose positive or negative

feedback to force changes in the background model. In a situation where there

is a stationary foreground object in the scene, feedback can be applied to ensure


that the weights of the mixture components associated with the object remain

sufficiently low that the object is still considered foreground.

Javed et al. [83] added gradient information to the MOGs model to help overcome

illumination changes, as the gradient in the background will remain relatively

stable in the background during an illumination change, even though the colour

does not. Possible gradient distributions could be calculated from the background

model using the various Gaussians that represent these background modes. For

each input pixel, the gradient (magnitude and direction) could be computed and

if it matched one of the possible gradient distributions, then the pixel belonged

to the background.

Zang and Klette [180] proposed a mixture of Gaussians model called PixelMap,

which combines a MOGs approach with region level and frame level considerations

to eliminate holes in objects and help reduce noise. Three processes are combined

in this approach.

1. A standard MOGs background model is used to locate foreground pixels.

2. A frame level process that considers a window of three frames (frame t, t−1

and t + 1). Two difference images are created between t + 1 and t, and t

and t− 1; with the logical and of these two difference images being added

to the mask created by the background model process.

3. A region level process is applied to fill in object holes and remove noise. A

window (5× 5) is moved across the image. For each point that’s in motion,

the motion within the window is analysed to check if it is connected, and if

the window is over 50% full. If this is the case, then the remainder of the

window is filled in, otherwise the centre pixel is set to 0. This third process

replaces the more conventional binary closing routines that are commonly

used, as these routines often struggle to close large holes, or to do so require


large kernels that distort other details.

The resultant algorithm produces motion images with less noise and fewer holes

within the extracted objects.

Hu et al. [75] described a new colour model (cone-shape illumination model,

CSIM) to distinguish between foreground, shadow and highlights, and incorpo-

rated this into a mixture of Gaussians background model, that used both a long

term and short term background model. The proposed CSIM is similar to the

model proposed by [73] in that its centre axis lies on the line from the colour to

the origin. It uses a 3D cone centred at the mean with each axis 2.5 standard

deviations long. Hu et al. [75] also made use of a gradient based background

model [83] to help deal with dramatic lighting changes.

Li et al. [108] proposed a background subtraction algorithm that varied the learn-

ing rate for different areas of the scene according to their detected context. This

modification could be applied to any existing background subtraction scheme.

Two different contextual background region types are defined; fixed public facili-

ties such as counters, stores etc; and homogeneous surfaces such as walls and floor.

Orientation histogram representation (OHR) and principal colour representation

(PCR) are used to distinguish between the different background contexts.

Modelling each pixel with a set of GMMs is very processor intensive however,

and not ideal when foreground segmentation is only the first step in a multi-step

process (i.e. surveillance). To address this, Butler et al. [18] proposed a system,

where an approximation to a GMM was used. Butler et al. [18] proposed an

adaptive background segmentation algorithm where each pixel is modelled as a

group of clusters. A cluster consists of a centroid, describing the pixels colour;

and a weight, denoting the frequency of its occurrence. However, the use of a

GMM for each pixel is very processor hungry and not ideal.


The motion detection uses colour images in Y’CbCr 4:2:2 format as input. Pix-

els are paired to create a cluster which consists of two luminance values (y1

and y2), a blue chrominance value (Cb), and red chrominance value (Cr) to de-

scribe the colour; and a weight, w. For each pixel pair as a set of K clusters,

C(x, y, t, 1..K) = (y1, y2, Cb, Cr, w) is stored, which represents a multi-modal

PDF. The pairing of pixels to form clusters means C(x, y, t, 1..K) is formed by the

pixels at (x, y) and (x+ 1, y). Each pixel is only used once, so for C(x, y, t, 1..K),

x must be even, and the algorithm requires images that have an even horizontal

dimension.

Clusters are ordered from highest to lowest weight; and the current matching

cluster, C(x, y, t,m) (where m is the index of the matching cluster in the range

1..K), for each pixel is stored, giving an approximation of the image.

For each (x, y, t) the algorithm makes a decision assigning it to one of the sets

(background, or a motion layer) by matching C(x, y, t, k), where k is an index in

the range 1 to K, to the pixels in the incoming image. Clusters are matched to

incoming pixels by finding the highest weighted cluster which satisfies,

|y1 − Cy1(x, y, t, k)|+ |y2 − Cy2(x, y, t, k)| < τLum, (2.4)

|Cb− CCb(x, y, t, k)|+ |Cr − CCr(x, y, t, k)| < τChr, (2.5)

where y1 is the luminance value at (x, y), y2 is the luminance value at (x+ 1, y),

Cb is the chrominance value at (x, y), and Cr is the chrominance value at (x +

1, y). Thresholds are applied to the luminance and chrominance, and if both

are satisfied, then the pixel is suitably close to the cluster to be a match. By

separating luminance and chrominance a certain amount of tolerance to shadows

is inbuilt. The centroid of the matching cluster is adjusted to reflect the current

pixel colour, and the weights of all clusters in the pixels group are adjusted to

reflect the new state,

wk = wk +1

L(Mk − wk) , (2.6)


where wk is the weight of the being adjusted; L is the inverse of the traditional

learning rate, α; and Mk is 1 for the matching cluster and 0 for all others. If

there is no match, then the lowest weighted cluster is replaced with a new cluster

representing the incoming pixels.

Based on the accumulated pixel information, the frame can be classified into

foreground,

fgnd = ∀(x, y, t) wherem∑i=0

C(x, y, t, i)(w) < T (x, y, t), (2.7)

where T (x, y, t) is the foreground/background threshold; and background.

The clusters and weights are gradually adjusted over time as more frames are

processed, allowing the system to adapt to changes in the background model.

This means that new objects can be added to the scene (i.e. a box may be

placed on the floor), and over time these objects will be incorporated into the

background model.

Kim et al. [96] proposed a model where background values were quantized into

codebooks. Each codeword contained the RGB vector, the minimum and max-

imum brightness that the codeword matched to, its frequency, the longest time

that it was not seen, and its first and last access times. The system was similar

in construction to Stauffer and Grimson [157] and used a colour model that was

based on [73], allowing the system to handle shadows and highlights. Each pixel

used a different codebook size, depending on the variation at the pixel. The sys-

tem however requires a training sequence to initialise and does not adapt over

time.

Techniques such as those proposed by Butler et al. [18], Stauffer and Grimson

[157] and Kim et al. [96] provide a certain amount of tolerance for lighting changes

and shadows (depending on thresholds and colour models being used). However


they are still susceptible to rapid lighting changes that may be caused by (in the

case of an outdoor scenario) the sun moving behind a cloud, or (in the case of an

indoor scenario) a light being turned on/off. These techniques may also require a

number of frames to learn the background model. If there are no moving objects

present, then a single frame is often sufficient, but if there are moving objects

already present then the number of frames required will depend on the rate of

movement, and the system’s learning rate.

A limitation of these techniques is their inability to distinguish between objects

that have been temporarily stationary, and those that are continuing to move.

Kim et al. [97] proposed a modification to their earlier system (Kim et al. [96])

that was able to distinguish short term background (i.e. a car that has stopped)

from motion and long term background. Statistics for each possible code are

recorded to determine which codes belong to background and foreground, and

which belong to short-term background (i.e. stopped cars).

Motion detection techniques such as those proposed by Stauffer and Grimson

[157], Butler et al. [18], Kim et al. [96] and their derivatives all work at the pixel

level, detecting individual changes at each pixel to determine motion. In the

event of small camera motion (possibly caused by wind in the case of an outdoor

camera), these methods would all, incorrectly, detect false motion in large por-

tions of the image. This problem could overcome by using image registrations to

ensure that any incoming images are aligned with the background model, how-

ever, this is computationally costly. Adelson [2] proposed a method of modelling

a scene using pixel layers (where a layer is a group of similar pixels), and this

approach has since been applied to the tasks of motion detection (Patwardhan

et al. [138]), video segmentation (Khan and Shah [94], Criminisi et al. [34]) and

object tracking (Tao et al. [159], Zhou and Tao [187]).

Patwardhan et al. [138] proposed using a layered approach to motion detection.


A training sequence is used to locate the layers within a scene, with layers created

according to the similarity of individual pixels colour values. When processing

a frame, pixels are compared to a stack of previous frames to determine the

likelihood that they belong to one of the layers that exists at the pixel (using

a set of images allows for multiple layers to be considered, as a pixel may exist

near a layer boundary, or the layer itself may move, i.e. vegetation moving in

the wind). If the pixel does not belong to one of the background layers, it is

assigned to the foreground layer, and the system continues to learn new layers

and update existing ones as more frames are processed. Foreground layers that

become stationary can be added to the background model, and removed again

once the object begins to move (i.e. a parked car). The use of layers also allows

for overlapping foreground regions to be identified.

2.2.2 Optical Flow Approaches

Optical flow is a process which attempts to determine the motion each pixel in a

scene has undergone between subsequent images,

I(x, y, t) = I(x+ δu, y + δv, t+ δt), (2.8)

where u is the horizontal image velocity and v is the vertical image velocity, t is

the current time step and δt is the time difference between frames. In order to

determine u and v, two assumptions are commonly made:

1. That for a given region across two frames, its appearance will not change

due to lighting (constant luminance).

2. That a pixel present in a frame will still be present in the next frame (no

spatial discontinuities).


Both these assumptions can be restrictive and, when broken can lead to errors.

Most optical flow techniques are either gradient based methods (Horn and

Schunck [72], Lucas and Kanade [114]), or block matching based methods (Bergen

et al. [6], Burt et al. [17]). Gradient based methods have been preferred due

to speed and performance considerations. Gradient based methods analyse the

change in intensity and gradient (using partial spatial and temporal derivatives)

to determine the optical flow. Block matching based methods rely on determin-

ing the correspondence between the two images. This typically involves matching

‘blocks’ of one image to ‘blocks’ of the other to determine how far that region has

moved.

Both methods perform best when determining flow at or around clearly defined

features, and make assumptions of constant luminance and spatial continuity.

As a result, when objects are not clearly defined (perhaps due to clutter) or the

lighting conditions vary, errors can occur in the optical flow output. Performance

also suffers where trying to determine the flow for uniform regions where there is

little to no texture.

To try and overcome the limitations of existing methods, Black and Anandan

[9] proposed a robust method based around a robust estimation framework. The

estimation framework reduces the outliers caused by motion discontinuities and

violations of the constant luminance assumption. However, this approach is too

slow for a real time system, and so not suitable for surveillance applications.

For surveillance applications, optical flow images can be restrictive as they will

detect all motion. Background modelling techniques can be trained to filter out

repetitive motion such as trees swaying in the breeze, but optical flow will always

detect this as motion. As such, extensive additional processing may be required

to filter the motion to determine what is cause by the target objects in the scene.


2.2.3 Other Methods

Temporal thresholding processes monitor the variance of pixels over a period of

time. A moving average (or similar construct) is used to calculate the variance,

to which a threshold is applied to determine if there is motion. Temporal thresh-

olding processes, however, leave a trail of motion behind moving objects while the

variance stabilises, making them undesirable for use in applications that require

accurate segmentation.

To overcome this, Joo and Zheng. [84] and Abdelkader et al. [1] have proposed

methods that combine temporal thresholding and background subtraction to

achieve a more robust motion detection technique. The mean and variance of

each pixel is calculated over a window of several frames, and recursively updated

for each new frame. A simple exponential decay function is used to update the

filter. To overcome the limitations of temporal thresholding (leaving a trail after

the moving object), a simple background model is used in combination with the

temporal threshold approach. A second set of mean and variances is kept to model

the background. The background model uses a much larger window (slower learn-

ing rate), and its update process is selective in that only pixels that (according

to the variance) are not possible foreground pixels are incorporated. A confi-

dence weight that denotes the confidence of a pixel being part of the foreground

is extracted from the background model, which is multiplied with the variance

obtained from the temporal thresholding. The resulting value is thresholded to

determine if the pixel is in motion.

These methods have an advantage in that they do not require a training period.

Models can be built with motion occurring in the scene without incurring sig-

nificant errors. The background model proposed however only considers a single

mode of background, and so only simple scenes can be effectively modelled.


Latzel et al. [105] proposed using interlaced images to detect motion. By ex-

ploiting the motion artifacts found in interlaced video, it is possible to extract

the edges of motion regions (and of edges within the motion regions) from the

sequence. This results in a system that is robust to lighting changes and environ-

mental changes. However the system is unable to deal with shadows, or moving

objects that temporarily stop, and is obviously not applicable to video streams

that are not interlaced.

Grabner and Bischof [57] proposed a background subtraction method based on

on-line adaboost. The input images are divided into a grid of small overlapping

rectangles and a boosted classifier is trained for each. The resultant system is

very sensitive and able to detect changes in low contrast scenes, however, this

also makes the approach susceptible to noise. The proposed approach is also not

as fast as other algorithms (reported to run between 5 and 10 fps), making it less

suitable for real time systems.

Thermal imagery motion detectors have been proposed by Davis and Sharma

[41] and Latecki et al. [104]. Thermal imagery is often noisier than colour, and

is subject to problems such as thermal halos. Davis and Sharma [41] described a

contour based system for use with IR imagery. A background subtraction routine

is first applied [157] to extract regions of interest. These regions typically contain

the target objects as well as an undesired thermal halo. A contour saliency map

which describes the likelihood of a pixel being on a object boundary is generated

by analysing gradient strengths within the foreground region and background

model. Thinning, thresholding and amplification routines are run over the map

image to produce a contour image for the moving objects. To overcome the

problem of incomplete contours, the watershed transform is used to locate possible

contour completions which are used to close any open contours. The watershed

transform is a method for segmenting images using watershed lines [33, 164]. The


watershed transform treats the input image as a topographical map where the

grey level indicates the elevation. When applied to a gradient map, the watershed

lines are found along gradient ridges.

Latecki et al. [104] proposed using a series of images to detect motion in thermal

imagery. Due to the additional noise present in thermal imagery (when compared

with visual feeds), multiple frames were used to test for motion by measuring

texture spread across a time and space window. A high texture spread indicates

motion at the region.

2.2.4 Auxiliary Processes

Many systems rely on auxiliary processes to aid the motion detection. These

processes are aimed at removing shadows and/or reflections, or at dealing with

lighting fluctuations. Some systems [13] have incorporated these processes di-

rectly into the motion detection. Having such processes incorporated directly

into any motion detection is highly desirable, as it improves motion detection

performance and avoids the need to run additional processes which may compli-

cate the system.

Commonly, shadows are detected by analysing regions of motion in a colour

space that separates intensity and colour information (HSV and Y’CbCr are two

examples). Fung et al. [55] developed a shadow detection system in which it was

proposed that there are two types of shadows, self shadows and cast shadows.

Self shadows occur when one side of an object is not illuminated, where as cast

shadows occur when an object occludes a light source and casts a shadow onto the

background or other objects. Cast shadows have several distinguishing properties

that allow them to be located and removed:


• The luminance of a cast shadow is lower than that of the background.

• The chrominance of the cast shadow is approximately equal to that of the

background.

• The gradient density between the cast shadow and the background is lower

than that between the object and the background.

• The shadow lies on the edge of the bounding region of movement.

Fung et al. [55] derived metrics or tests for each of these conditions to test for a

pixel being a shadow. From these tests, a ‘shadow confidence score’ is calculated.

A threshold is applied to this score to remove shadows from the image.

Nadimi and Bhanu [127, 128] proposed a physics based approach to shadow detec-

tion. The approach requires a training phase where the body colour (the colour

of the material under white light) of surfaces that may come under shadow in the

scene is calculated. This method uses physical properties of shadows to detect

them using a series of tests. A shadow must result in the intensity of the pixel be-

ing reduced, so a reduction in value across the R, G, and B channels is expected.

A blue ratio test is applied, as it is observed that for shadows cast outdoors onto

neutral surfaces, there is a higher ratio of blue due to the illumination by the

blue sky. An albedo ratio segmentation step is performed, to segment the image

into regions of uniform reflectance. An ambient illumination correction is per-

formed to remove the effect of sky illumination, and then body colour estimation

is performed to determine the true colour of the object. A verification step then

matches the various surfaces with their expected body colours to determine which

regions lie in shadow.

Wang et al. [168] proposed a method which analysed each shadow as a series of

sub-regions. Cast shadows are split into three groups:


1. Deep Umbra Shadow - cause by blocking the sunlight, with no environmen-

tal/reflected light to lessen the shadow.

2. Shallow Umbra Shadow - caused by blocking the sunlight, with environ-

mental/reflected light to lessening the shadow.

3. Penumbra Shadow - partly illuminated by sunlight and environmental light,

this is very difficult for humans to detect.

Each type of shadow is segmented and dealt with separately to improve segmen-

tation accuracy and reduce the number of false shadows found. The percentage of

grey level reduction for pixel regions can be used to roughly segment the three dif-

ferent shadow types. Colour for the pixels is compared to that of the background

image. For a shadow region, the colour should not change significantly.

Grest et al. [59] observed that a region in shadow is a scaled-down version (darker)

of the same region in the background model. As such, normalised cross correlation

can be used to detect shadows. Jacques et al. [79] proposed a variation on the

work of [59] by adding an additional step that used the statistics of local pixel

ratios. Potential shadow regions are detected using NCC as described by [59].

The ratio of the input image and background image is calculated for the pixels

in the local neighbourhood of the candidate shadow. If the standard deviations

of these ratios is beneath a predefined threshold (i.e. if the region has undergone

a constant illumination decrease), then the region is classified as a shadow.

Shadow detection methods such as those proposed by [55, 59, 79, 128, 168] are

complex, multi-step processes and are as such are not ideal for use in a tracking

system, or for incorporation into an existing motion detection algorithm. Tech-

niques such as these can only be effectively applied as a post process to any

motion detection.


Edges can also be used to aid in shadow detection. Xu et al. [174] proposed using a

canny edge detector on the foreground image to separate the various regions (both

shadow and non shadow) in the foreground. Through multi-frame integration,

region growing and edge matching, the shadow regions can be identified and

removed. Zhang et al. [183] was able to detect shadows using the ratio edge.

The ratio edge is the ratio between neighbouring pixels, and it is shown to be

illumination invariant. The location of this edge can be analysed to segment

shadows from moving objects. Once again however, these approaches are only

suitable when used in post processing.

Martel-Brisson and Zaccarin [119] proposed using GMMs to detect shadows (al-

lowing for easy integration with a MOGs foreground segmentation processes).

A Gaussian mixture shadow model (GMSM) is built which contains all possible

shadow states. The YUV colour space is used, and shadows are initially detected

by looking for a constant attenuation across the Y, U and V channels. At each

time step, the foreground state with the largest a priori probability is processed

to determine if it describes a cast shadow. If it does, then the shadow state can be

incorporated into the GMSM, either by combining it with another shadow state,

or as a new state (a complex scene may have two or three states per pixel for

shadows). The GMSM can be used to detect shadow by testing if the foreground

pixels match to the any of the shadow states of GMSM.

Cucchiara et al. [37] proposed a shadow detection algorithm to be used within the

tracking system, SAKBOT (Cucchiara et al. [35]). HSV colour space is used, and

a shadow is defined as a point where the intensity is reduced, and the ratio of the

reduction lies in the range α to β (where β is less than 1 to avoid detecting points

that have been slightly altered by noise, and α is determined by the strength of

the light source to describe how dark shadows could be); the saturation is slightly

reduced and the hue is relatively unchanged.


This approach (Cucchiara et al. [37]) was later extended (Cucchiara et al. [36]) to

a system capable of discriminating between moving objects and their shadows as

well as ‘ghosts’ (objects detected due to errors in the motion segmentation, they

do not correspond to any actual motion) and their shadows within an image. An

analysis of the optical flow of detected moving objects can be used to distinguish

between ghost objects and moving objects. Moving objects should exhibit a high

average optical flow, while ghosts should be close to 0, as they do not actually

represent any motion (and as such the motion in the region is either zero or

inconsistent).

Shastry and Ramakrishnan [151] modified the system proposed by Cucchiara et al.

[37] so that an improvement in speed was achieved by using track information. As

the shadow detection was being used to detect shadows associated with tracked

moving objects, it could be assumed that the shadows are moving at the same

speed as the objects. A shadow at point (x, y) at time t will be at (x+mx, y+my)

at time t + 1 where (mx,my) is the velocity of the moving object. This predicts

results in small errors in the location of shadows, and over an extended time those

errors can increase. To prevent errors in the shadow detection building up due to

incorrect prediction of shadow movements, the shadow mask is recomputed every

N frames.

2.2.5 Summary

Within a surveillance environment, foreground segmentation and motion detec-

tion techniques can be used to locate objects of interest. Such environments are

often quite complex, and may contain complex lighting and a changing back-

ground. As such the use of a technique that is able to cope with a multi-modal

background (Butler et al. [18], Kim et al. [97], Patwardhan et al. [138], Stauffer

2.3 Detecting and Tracking Objects 37

and Grimson [157]) and can learn changes within the background is very impor-

tant. Depending on the environment in which systems are intended to operate,

additional techniques such as shadow detection, highlight detection and the abil-

ity to handle lighting changes are also important. Ideally, the ability to cope

with these issues should be part of any segmentation algorithm (Bowden and

Kaewtrakulpong [13], Hu et al. [75]), rather than additional processes that are

run afterwards.

Whilst optical flow can also be used detect regions of motion, unless a robust

method is used (i.e. Black and Anandan [9]), errors caused by violations of the

assumptions of constant luminance and no spatial discontinuities are likely to

cause the results to be inaccurate, and potentially unusable as a sole mode of de-

tection. In a surveillance situation (particularly one where there is natural light,

or florescent light which causes a noticeable flicker in video footage), with poten-

tially high levels of unpredictable movement, these assumptions will be violated

regularly.

A lack of common data for evaluation makes the comparison of different tech-

niques difficult, and as such, no explicit comparison is presented. However, the

relative strengths and weaknesses have been discussed in general terms based on

the presented literature and their reported results.

2.3 Detecting and Tracking Objects

The process of object tracking can be approached in two ways:

1. In each frame, detect all objects and match these to the list of objects from

the last frame;

38 2.3 Detecting and Tracking Objects

2. Detect an object once and extract one or more features to describe the

object, then follow the object using the extracted features.

The first approach is the most common approach to object tracking, and is ad-

dressed further in this section. Examples of the second approach are techniques

such as the mean shift algorithm (Fukunaga [54]) and its derivatives (such as

CAMShift, Continuous Adaptive Mean Shift, Bradski [14]), and particle filters

(which are discussed in Section 2.4). These systems often rely on external input

to initialise the tracking process, as a continuous detection process that allows

automatic discovery of new modes, requires detected modes to be matched to de-

termine which objects are already being tracked (i.e. the first approach described

above).

For the task of automated surveillance, it is desirable to be able to automatically

discover objects as they enter the scene, as relying on a human operator to flag

incoming objects on behalf of the tracking system defeats the original purpose of

the tracking system (i.e. to ease the burden on the human operators).

2.3.1 Object Detection

In order to track an object, it must be able to be reliably and consistently de-

tected, and features that can be observed and matched from frame to frame must

be extracted. These features can be simple distance and position based features,

or more complex colour and texture based features. The acquisition parameters

(colour or grey scale, image resolution, camera field of view) and environment

(indoor/outdoor, day/night) in which the system is intended to operate is likely

to play a large role in determining the type of features used.

Many systems use motion detection to detect objects for tracking. Haritaoglu


et al. [65] detects people by locating blobs of motion and computing the vertical

histogram of the silhouette (heads will lie at a local maxima), and combining this

information with the results of convex hull-corner vertices to determine where

potential heads lay within the region. This approach can be used effectively to

segment groups of people. Fuentes and Velastin [53] performed motion detection

using luminance contrast and formed blobs, characterised by a bounding box,

centroid, width and height, to represent tracked people. A blob is a group of

connected regions, according to spatial constraints. As such, a detected blob

may be formed by a single, large, connected (either 8 or 4 connected) region; or

a cluster of several smaller connected components located close to one another.

Rather than simply tracking blobs, Zhao and Nevatia [184] proposed a system

that used an ellipsoid shape model to locate and segment people from the motion

image. The system was setup such that the camera was deployed a few meters

above the ground looking down, to help overcome the occlusion problems that

occur with ground level cameras. People are detected via an iterative process.

The following two steps are repeated until no more people are detected in the

scene.

1. Locate all heads and fit an ellipsoid person model at the head. If there is

sufficient motion within the ellipse, the person is accepted and their motion

is removed.

2. Perform geometric shadow analysis to remove shadow regions belonging to

the detected people (using the date, location and time of day to determine

the suns position, and orientation of any shadows).

Kang et al. [89] also applied head detection after background segmentation to

locate people, who are characterised by bounding boxes. In addition, results

from the previous frame’s head detection are used in the next frame through a


feedback loop to aid the process.

Similar motion based approaches have been used to detect vehicles. Koller et al.

[99] used an adaptive background model to locate vehicles on the road. The shape

of the object is first derived from both the gradient image and the motion mask.

The shape is expressed as a convex polygon that encloses the object, smoothed

with the application of cubic spline parameters to the points.

A problem that can occur when using motion segmentation results as the basis

for object detection is that spurious motion can be included as part of a detected

object. To help overcome this, Lei and Xu [106] proposed a tracking system that

uses a variant of the brightness distortion metric [73] to remove shadows and

highlights from the motion image. Shadow/highlight removal is performed at

two threshold levels (tight and loose). Blobs that are unconnected in the tight

threshold output are grouped according to connectivity in the loose threshold

output. This allows the blob grouping from the loose thresholds to be retained

whilst ensuring that more of the spurious motion is removed.

Tracking systems that use colour images as input such as Matsumura et al. [121]

and Wang et al. [167] use skin detection to locate people within the image. These

skin regions are tracked, and a template is constructed from the extracted skin

segment. This skin segment template can then be used to match candidates

in future frames. Wang et al. [167] and Matsumura et al. [121] applied motion

detection first to simplify the skin detection by removing regions of no interest.

Matsumura et al. [121] searched for skin colour in frames, and tracked large skin

regions. These systems, however, rely on being able to detect the face or other

skin regions to track and this may not always be possible.

Other systems have used a learned model approach rather than relying on motion

detection. Motion based approaches are ultimately reliant on the performance


on the motion detection, with poor motion detection likely to lead to poor object

detection. A model based approach can overcome this limitation, but relies on a

suitable model (or models) being trained.

Rigoll et al. [145] used Pseudo 2D HMMs (P2DHMM) to track people. This

avoided using motion detection, and so avoided the problems such as cars/other

people/trees causing additional movement that results in false objects being de-

tected in the scene or valid tracks being lost. A P2DHMM trained on over 600

people was used to locate people in a frame. The P2DHMM uses the tracked

objects centroid, velocity, bounding box height and width as inputs into Kalman

filter [172]. The Kalman filter feeds out the next prediction to the HMM to aid

in the tracking, and as tracking progresses the HMM adapts its model to better

fit the object that it is currently tracking. The major benefit of a system such as

this is that there is no motion detection, meaning the camera can zoom and pan

without resulting in any additional complexity being added to the system.

Nguyen et al. [131] also made use of Markov models recognising the actions of

people within a tracking system. An Abstract Hidden Markov Model (AHMM)

was developed which replaces the Markov chains of the standard HMM with

Markov policies. Policies could be defined in a hierarchy such that higher level

policies could be built from simple low-level policies. Like a HMM, the behaviours

were learned off-line by observing training data. Kato et al. [91] applied HMMs

to vehicle tracking and traffic monitoring.

Seitner and Lovell [150] used a Viola-Jones [165] detector to detect people and

subsequently track them in a scene. Yang et al. [177] and Okuma et al. [135] have

also used the Viola-Jones detector [165] with particle filters to track people.

Despite the clear advantage of being able to cope with lighting changes and

camera noise that would have a severe negative impact on methods that rely on


the analysis on image masks, learned model approaches do have their drawbacks.

Within a surveillance environment, it can be expected that the objects of interest

(i.e. people or vehicles) will be viewed from various angles (i.e. front on, side

on, and anywhere in between), and any detection method will need to be view

invariant. This invariance can be achieved in two ways:

1. Train the model to recognise the object from any angle

2. Train several models for the different view angles

Given the variation observed when viewing an object such as a person from

all angles, any single model that is trained to detect a person at any angle is

likely to be too general and perform poorly (particularly in a real world situation

with a complex background). Whilst training several models solves this problem,

training several models to detect each object class is very demanding and not

ideal.

2.3.2 Matching and Tracking Objects

Once objects have been detected in a frame, it is necessary to match the objects

that have been detected to those that were detected in the previous frame. Zhao

and Nevatia [184] matches detected objects iteratively. Tracks are matched one

by one (each track compared to all located objects) in order of their depth in

the scene (determined by position in the frame and camera calibration). Fuentes

and Velastin [52] matches detected and tracked objects using a two way matrix

matching algorithm (matching forwards and reverse). In order to perform this

matching however, some form of feature (or features) needs to be extracted for

comparison.


The features used to match tracks varies greatly, and the type of feature used is

partially dependent on the system requirements and type of input the system is

receiving (i.e. grey scale or colour images). The types of features used can be

broadly grouped as follows:

1. Geometric features (i.e. object position, bounding box position/size [106,

121]) - can be extracted directly from the object detection results with no

further image processing. Reliability of the features is directly dependent

on the object detection performance, however the features are very quick to

extract and compare. Due to their simplicity, geometric features are often

used in combination with more complex features [106, 121].

2. Edge features (i.e. silhouettes [63, 66]) - can be extracted from a motion

mask, or similar mask image (a mask could possibly be extracted using

colour segmentation techniques). Performance of edge features is depen-

dent on the accuracy of the mask, segmentation errors will lead to poor

performance.

3. Colour/Texture features (i.e. histograms [113, 129], appearance models

[25, 63, 74, 89, 144]) - can be extracted using a combination of the object

detection results, the input images and any mask images. Such features are

more robust to object detection and segmentation errors, but the features

are more computationally demanding.

Tracking systems may use multiple features to track objects [40, 63, 66, 178].

Different features may be used at different times, or depending on the state of

the tracked object. For example, when the object has been observed for several

frames in succession and there is little complexity in the scene, simple geometric

features may be sufficient to match the object. If the object has not been observed

for several frames then position and size estimates are less reliable, and so colour


or texture features may be more appropriate.

Colour feature approaches have focused on using histograms, as they are simple

to compute and compare. The colour histogram is relatively unaffected by pose

change or motion, and so is also a reliable metric for matching after occlusion. His-

tograms are matched by calculating the histogram intersection. Lu and Tan [113]

uses motion detection to find people and characterises them using their bounding

rectangle, size and a colour histogram. Ng and Ranganath [129] proposed an im-

provement to the colour histogram model such that the histogram uses variable

bin widths, resulting in comparable performance with a five-component GMM.

One limitation of histograms is that they do not contain any position information.

Two objects that have very similar colour histograms may have dramatically

different appearances due to the distribution of the colours. For example one

person may be wearing a white shirt and black pants whilst a second is wearing

a black shirt and white pants. Whilst these people may have quite distinct

appearances, they would have very similar histograms. To overcome this fault,

Hu et al. [74] extracted three histograms from each person, one each for the

head, torso and legs, to not only allow for matching based on colour, but also

on distribution of colour. An ellipsoid shape model is used to characterise the

person, and tracking is performed using a condensation filter [77].

Chien et al. [25] proposed a colour model (Human Colour Structure Descriptor -

HCSD) that aims to capture the distribution of colours in a human body. Three

colours are used to represent the colour of the body, legs and shoes, and positions

are defined to describe the position of body and legs relative to the shoes. The

model is generated by first extracting the silhouette for the object; then extracting

skeletons for the shoes, legs and body; from which colours and positions are

obtained.


Kang et al. [89] makes use of colour clusters to aid in tracking. Motion estimation

is the primary method of people detection and tracking, and is used in the case of

tracking a single person (or the occluder in a total occlusion). Colour is used for

partial occlusions, or when an object re-enters. Once a person is detected, colour

clusters are obtained by calculating the colour histogram, calculating means for

each bin, and then merging similar bins. About three clusters per person are

obtained, representing the three major colours belonging to that person. Each

cluster has a weight that is a function of its size, duration, frequency and the

existence of other nearby objects (nearby objects cause inaccuracies as they may

contribute part of their colour to the object). This colour model can be used to

match people after occlusions, or after they have left and re-entered the scene.

The colour correlogram (Rao et al. [144]) is a variant of the colour histogram,

where geometric information is encoded as well as colour information according

to predefined geometric configurations. Zhao and Tao [185] proposed a simplified

colour correlogram [144] for use in tracking systems. As the original correlogram

[144] is too processor intensive for real time tracking a simplified version which

only considers pixels lying on the major and auxiliary (perpendicular to the ma-

jor) axis of the object is proposed. This is much simpler to compute and still able

to deal with rotational variations. To track using the simplified correlogram, a

modified mean shift algorithm is proposed that is capable of determining rotation

changes within the search rather than separately.

Bourezak and Bilodeau [12] also make use of correlograms to track objects of

interest. Bourezak and Bilodeau [12] proposed a system that used histograms

to perform background segmentation. The image is divided into a series of sub-

regions, and a reference histogram is computed for each. This reference histogram

is then compared with the histogram for the incoming frame to determine if the

region is in motion. This can be performed iteratively to refine the detection.


Using normalised histograms and a coarse to fine approach means that small

amounts of noise can be ignored and the system is invariant to changing lighting

conditions. Correlograms and histograms are then used to track the objects,

capturing both colour and texture information.

To improve reliability and tracking performance, systems such as Darrell et al.

[40] and Yang et al. [178] use multiple modalities to track people. Darrell et al.

[40] [39] combined the use of stereo, colour and face detection to track people.

Models are integrated according to their strengths and weaknesses, and reliability

of each mode. Face detection results are given greater precedence, and the other

two modalities are used to update when the face is not available. To detect

people after leaving and re-entering the scene (long term tracking), Darrell et al.

[40] uses visual clues such as height, skin colour, hair colour and face pattern.

These can all be used in the short term (up to a couple of hours), however, for

tracking of over a day, compensation for lighting changes is needed. This can be

done by mean shifting all colours and excluding any clothing colour information

from matching.

Yang et al. [178] combined motion, depth and colour, and merged these by treat-

ing the observations from each module as Gaussian distributions. Features are all

tracked separately and all have individual Kalman filters for tracking. Features

are fused late in the process and all tracking parameters have an ‘uncertainty’

value associated with them. The depth module performs SAD on regions of in-

terest, and computes U = Ds/D′s for all regions (where Ds is the minimum value

from the depth module, D′s is the second smallest, and U is the uncertainty). If

U and Ds satisfy thresholds, then a right-left consistency check is performed, the

resulting depth map is smoothed, and holes within are filled. The objects result-

ing from this detection are translated into an overhead view and the uncertainty


for each candidate is,

Udepth = a× |p−W |W

+ c, (2.9)

where a and c are controlling constants, p is the width of the detected object, W

is the expected width of a person and Udepth is the uncertainty for the detected

object.

Skin colour is used for colour tracking. A locus model is used to group skin

regions into face shapes. The system tries to detect lips to distinguish the face

from other skin regions. The uncertainty for this modality is defined as,

Ucolour = a× |s− 1.5|1.5

+ b× t− 0.1

0.1+ c, (2.10)

where a, b and c are constants, s is the aspect ratio of the face bounding box and

t is the ratio of the lip colour.

The motion module uses a temporal subtraction technique to locate blobs and

nearby regions are merged to locate candidates. The uncertainty is,

Umotion = a× |s− 1.5|1.5

+ c, (2.11)

where a and c are constants and s is the aspect ratio of the bounding box.

To merge candidates from each module, the candidate with the lowest uncertainty

is taken and integrated with candidates from other modules that are within a de-

fined distance threshold. Candidates are combined as through they are Gaussian

distributions, and the uncertainty represents the standard deviation. Final can-

didates are tracked by a Kalman filter.

W4 [63, 66] uses a combination of silhouette matching and an appearance model to

track objects. Frame to frame tracking is achieved using silhouettes. Silhouettes

are compared by matching over a 5 × 3 window and performing a binary edge

correlation (typically dominated by the head and torso as these are slower moving


than the legs). Whilst this approach is suitable for frame to frame tracking, if

a person is lost for more than a couple of frames the differences between the

silhouettes may be too great to match a person to their current silhouette.

To match people who have been occluded, or who have left and re-entered the

scene, Haritaoglu et al. [63][66] propose an appearance model where data per-

taining to the texture and position of the subject is recorded, and can be used to

determine the identity of a person who has just ceased to be occluded.,

Ψt(x, y) =I(x, y) + wt−1(x, y)×Ψt−1(x, y)

wt−1(x, y) + 1, (2.12)

where Ψ is the texture model, x, y is the pixel being updated, I is the input image,

and w is an occupancy map describing how many times the pixel x, y has been

classified as foreground in the last N frames. The texture model can be matched

to an new object to determine if the re-entering object is the same person,

C(p, r) =

∑(x,y)∈Sp

∣∣Stp(x, y)−Ψtr(x, y)

∣∣× wtr(x, y)∑wtr(x, y)

, (2.13)

where p is the person who has been tracked and r is the person who has dis-

appeared for a period of time. The tracked persons grey scale silhouette (Stp) is

compared to person r’s texture model to determine if they are the same person.

This model allowed people to be re-detected if they had been lost for several

frames due to occlusion, or had left and re-entered the scene.

Matsumura et al. [121] makes use of position, speed, a template image and an

object state to track the regions. The tracked regions positions are estimated

using a Kalman filter, and the object state (which could be one of absence,

emergence, tracked, lapped or lost) allows the system to know how to (or if to)

update the model.

Siebel and Maybank [152] combined a region tracker, head detection and active

shape tracker (AST) to obtain more robust tracking, where each component is


able to make use of others to improve results. Fusion is achieved by allowing

trackers to use each other’s output as well as historical output. The region tracker

makes use of tracking status and a history database to track over time. If the

region tracker is unsure about a track or cannot detect an object, it uses the active

shape tracker. The AST is also used to split large regions. The head detector

uses regions from the region tracker, and its head positions are in turn used by

the AST. The AST uses the other modules to initialise tracks. The combined

results of the modules are refined and filtered. The system checks for the same

object in multiple trackers and then selects the best track for use, discarding the

others. However, multiple tracks of one object are kept if they are considered

possibly valid, although only the best tracks appear in the output.

Lei and Xu [106] tracks objects using several simple features (position, shape and

colour based), by comparing detected object features to those of the track and

dividing by the variance to obtain a cost for the match. Spurious objects can be

detected and removed by observing the variance of the position and velocity of

the tracked object, to determine if it represents an actual moving object.

2.3.3 Handling Occlusions

In real world tracking situations, occlusions are inevitable. Within a tracking

system, there are two main types of occlusion:

• Object and Environment - a tracked object is obscured by a fixed item in

the environment, such as moving behind a pillar or tree.

• Object and Object - one tracked object obscures another.


It is important for a tracking system to be able to handle occlusions, and resume

tracking after an occlusion has passed. In order to do this, it can be advantageous

to detect or anticipate occlusion events. Lu and Tan [113] anticipates occlusion

events by using the minimum bounding rectangle for the object being tracked, to

determine when two objects are likely to intersect and one will occlude the other.

Koller et al. [99] estimates the depth position of the tracked objects. The depth

order is used in an explicit occlusion reasoning module that sorts objects based

on their vertical position, and combines this with expected positions (determined

by a Kalman filter) to determine occlusions.

Rad and Jamzad [142] explicitly deals with occlusions by predicting and detecting

when they occur using three criteria:

1. By examining the trajectory of each vehicle, if the centre points of the

tracked regions will be too close to each other, occlusion is predicted.

2. If the size of a region exceeds a given threshold, then it is assumed that

region includes more than one object.

3. If the size of a region changes by a significant amount between frames then

it can be assumed that an occlusion merge or split has occurred.

If an occlusion is detected, the region is split by examining the bounding contour

of the motion mask. The region is split through the contour point farthest away

from the minimum bounding rectangle.

Another approach to overcoming occlusion problems is to use depth information.

The use of depth information allows the order of objects (i.e. closest to the cam-

era) in the scene to be determined, and allows occluded objects to be segmented

provided they are at sufficiently distinct depths. Whilst depth ordering can be


approximated by using the pixel coordinates of the object region that is touch-

ing the ground (in most systems where the camera is mounted such that objects

appear vertical in the image, this is the bottom edge of the bounding box), this

method relies on correct segmentation at the base of the object, and thus is prone

to inaccuracies. Haritaoglu et al. [64] applied object detection to both intensity

and disparity images, with the stereo modality proving most useful when there is

a sudden change in the illumination, there are shadows, or regions in the intensity

image split (less likely to split in disparity).

Harville and Li [69] proposed a system which tracks people within a plan view

(a view from above the scene, looking down). The systems input is ‘colour with

depth’, 3 channels of colour and 1 channel of depth. Height and occupancy

maps are generated in the plan view, showing candidate heights and the amount

of motion within the candidate region. By using a plan view for the tracking,

the system overcomes many of the occlusion problems found in other systems.

Beymer [7] also uses a plan view, sourced from a stereo camera mounted such that

it looked directly down to the ground, to count people as the enter a doorway.

Feature points can also be used to help overcome occlusions. For a given object, it

is likely that several features points can be extracted. As the object moves, it can

be expected that some feature points will be not visible at times, either due to self

occlusion or occlusion with other objects, but some features points are likely to

be always visible. Coifman et al. [26] proposed a system that used feature points

to track vehicles. As it is likely that each vehicle will have multiple feature points,

the use of these for tracking helps to mitigate the problems with occlusions as it

is likely that even in the event of an occlusion, one of the feature points will be

visible. No motion detection or background detection is used; corners are drawn

from a pre-defined detection area on the video input where vehicles are expected

to enter the scene. If these points then move, they are considered to be part of


a moving vehicle. Points that satisfy a common motion constraint are grouped,

and the system then considers each group representative of a single vehicle.

Tang and Tao [158] proposed a dynamic feature graph to represent tracked ob-

jects. Objects are modelled as a group of invariant features (SIFT [111]) and

their relationship is encoded into a attributed relational graph. This can overcome

problems associated with other colour models such as histograms as it models the

structure and distribution of the object features. Feature relations are defined

by three items; the Euclidean distance between the features, the scale difference

and the orientation difference. As relative measures are used in determining the

relations, the model is invariant to rotations and translations. Features are added

to and removed from the model after they have been observed or absent for a

period of frames. Relaxation labelling is used is used to match the graphs.

Bunyak et al. [16] proposed a novel tracking approach based on a graph structure,

where nodes represent detected objects in consecutive frames, and edges represent

the confidence of a match between the nodes (objects). Over time, the graph

can be pruned, to eliminate the false trajectories as more information becomes

available. Appearance similarity is computed using colour features, and location

similarity is computed using centroids. The similarity measures are combined

to obtain a similarity confidence for the match. A separation confidence is also

obtained (which describes how distinct the match is) which is combined with the

similarity confidence using a weighted sum. The objects are filtered and the graph

is pruned at a variety of stages (object detection, similarity matching, evaluating

confidences, and by eliminating short segments that start or end unexpectedly).

Source and sink areas where occlusions may arise are identified in advance. A

Kalman filter is used to determine possible future positions of an occluded object,

which can be matched when the object reappears. An approach such as this allows

for tracking errors made as a result of occlusions or false/missed detections to be


corrected as more evidence is gathered.

2.3.4 Alternative Approaches to Tracking

Systems that rely on visual cameras may suffer from problems relating to lighting

and weather conditions that can be avoided by using thermal sensors. Latecki

et al. [104] proposed a method adapted for detection and tracking in infrared

videos. A spatio-temporal representation was used, to provide a more robust

method of motion detection to counter the increased noise present in IR imagery

compared to visual.

Rather than using only colour or grey scale cameras within the visible spec-

trum, Han and Bhanu [61] proposed a system that uses a combination of thermal

infrared and colour sensors to detect human movement. Two approximately iden-

tical images are obtained, one from a colour camera and one from an IR camera,

and these are registered. The data from the two images can be fused to provide

more accurate human locations and better performance in adverse conditions.

O’Conaire et al. [31, 32, 133] experimented with fusion for object segmentation,

background modelling and tracking using colour and thermal infrared images.

Fusion for tracking is done in the appearance model by using a multi-dimensional

Gaussian to represent each pixel. The scores from the visible and thermal spec-

tra in the appearance model are fused in different ways to match the model to

the incoming image. The ways of combining scores methods are compared to

ascertain the best method for this form of fusion. Blum and Liu [11] proposed

different methods of early image fusion using the wavelet transform and the pyra-

mid transform. These early fusion methods can be used to fuse the images before

they are fed into a tracking system, allowing a conventional single mode tracking

algorithm to be used. Han and Bhanu [62] proposed techniques for the use of


colour images and infrared images for use in moving human silhouette extrac-

tion as well using these silhouettes for automatic image registration between the

infrared and colour images.

Optical flow is another method commonly used to track objects [117, 134, 136,

162, 175, 179], and is often used as an alternative to motion detection for locating

moving objects in a scene. Whilst optical flow is more prone to noise than motion

detection, it does offer additional information in the form of the direction of

movement. This can be used to aid the prediction of future positions and segment

occlusions between objects moving in different directions.

Lucena et al. [117] uses the Lucas and Kanade algorithm to track objects using

particle filters. The probabilistic approach of the particle tracker works well for

optical flow, as it can naturally handle the incomplete or imprecise data that the

optical flow estimations provide. Lucena et al [115, 116] propose an observation

model to track using optical flow contour in the condensation framework [77]. The

flow discontinuities along the contour between inside and outside the contour of

the object being tracked are used to determine the accuracy of the match from

the model to the input image. The area inside should have an optical flow close

to that predicted by the model, while the area outside should be significantly

different.

Optical flow algorithms perform best around clearly defined features, and areas

that contain sparse levels of details are often handled poorly. To counter this,

Yamane et al. [175] proposed a method using optical flow and uniform brightness

regions (a section where the optical flow cannot be detected due to a lack of

texture) to track people. Optical flow is used to detect general areas of motion,

and areas of uniform brightness are found and tracked within the objects bounding

box.


Other tracking approaches to utilise optical flow include Oshima et al. [136] pro-

posed using the mean shift algorithm [29, 30] with optical flow [72] and a near in-

frared camera to track people in low contrast surveillance imagery; and Yokoyama

and Poggio [179] who proposed using optical flow in conjunction with a canny

edge detector [20] to extract and track contours. Okada et al. [134] uses optical

flow and depth information for tracking, whilst Tsutsui et al. [162] applied optical

flow in a multiple camera system.

Optical flow can also be utilised to recognise actions in a sporting context [45]

and to detect abnormalities in crowd motion [3].

Tracking techniques have also been applied to detecting and tracking individual

body parts. This allows the movement of individual limbs to be monitored,

facilitating gesture detection. Wren et al. [173] developed a system (Pfinder)

where the person is modelled as a series of ‘blobs’, with each blob corresponding to

a major body part. This method allowed gesture recognition to be performed by

analysing the movement of the blobs. The system relied on a constant background

(due to a simple background detection method) and struggled when multiple

people entered the scene. Ramanan and Forsyth [143] proposed a similar method

where the human body is modelled as a 2D puppet, consisting of 9 rectangles

representing body parts (i.e. arms, legs, torso, head etc.). Through the use of

kinematic constraints, the parts can be joined together to model the person(s)

being tracked.

2.3.5 Summary

The problem of object tracking can be split into two main tasks, detection and

matching. There are two main approaches to detection, analysis of a mask image

(such as a motion image) to locate objects of interest (Fuentes and Velastin


[53], Haritaoglu et al. [65], Zhao and Nevatia [184]), or the use of learned models

(Rigoll et al. [145], Seitner and Lovell [150]). Given the difficulties using a learned

model approach to detection (training suitable model(s) that are able to cope

with the wide variations in viewing angle), the analysis of mask images for object

detection has been more widely used.

To match objects, a wide variety of features are used from simple position and

geometric based features (Haritaoglu et al. [66], Matsumura et al. [121]), to his-

togram based colour models (Hu et al. [74], Lu and Tan [113]) and more complex

appearance models (Chien et al. [25], Haritaoglu et al. [66]). Appearance models

that can encode position and colour information are ideal, as these prove more ro-

bust and discriminative than histograms alone. Features can also be used to help

resolve occlusions by checking identity once the occlusion has passed. The ideal

choice of features for a system is not clear, and to a large extent it depends on

the application. However, using multiple features (Darrell et al. [40], Siebel and

Maybank [152], Yang et al. [178]) provides greater protection against switching

the identities of tracks and recovery from occlusion. The impact of occlusions can

also be lessened by anticipating occlusions (Lu and Tan [113], Rad and Jamzad

[142]), or modelling objects in such a way that they can be tracked through

occlusions (i.e. though the use of feature points, [26]).

A lack of common evaluation data makes direct comparison of individual tech-

niques difficult, as the vast majority of evaluation is performed on privately cap-

tured datasets, with any performance metrics used varying from author to author.

For this reason, no comparison of performance is given.

2.4 Prediction Methods 57

2.4 Prediction Methods

An important part of a tracking system is the ability to predict where an object

will be next frame. This is needed to aid in matching the tracks to detected

objects, and to predict position during occlusions. There are three common

approaches to predict an objects position:

1. Motion Models.

2. Kalman Filters.

3. Particle Filters.

Motion models and Kalman filters will be discussed in brief. A more detailed

discussion on particle filtering will be presented, as these techniques have become

the method of choice for tracking systems.

2.4.1 Motion Models

Motion models are a simple type of predictor and are quite common among simple

systems. Motion models aim to predict the next position based on a number of

past observations. They may or may not make use of acceleration, and can be

expressed as,

p(t+ 1) = p(t) + v(t), (2.14)

where p(t+ 1) is the expected position at the next time step, p(t) is the position

at the current time step, and v(t) is the velocity at the current time step.

For the simplest implementation,

v(t) = p(t)− p(t− 1). (2.15)

58 2.4 Prediction Methods

Other implementations use the history of the object to determine its velocity,

such that,

v(t) =p(t)− p(t−N)

N, (2.16)

where N is the size of the history being used. Using a smaller history (or none)

means that the model can react faster to changes in direction by the tracked

object. However, it also makes the model more sensitive to errors in the object’s

position (caused by segmentation or detection faults) which can result in the poor

prediction of future positions.

2.4.2 Kalman Filters

Kalman filters (Kalman [86]) are a linear predictive filter, and can be used to

predict the state of a system in the presence of noise. The filter estimates the

process state at the next time step, and uses the measurement at that time

step as feedback. Equations can be split into time update equations (predict

the next state of the process) and measurement update equations (incorporate

the new information into the system, to improve future estimations). A detailed

explanation of the equations and the tuning of parameters is provided in [172].

Kalman filters are, however, limited by their inability to effectively handle non-

Gaussian distributions, and are constrained by requiring the process being es-

timated, and the measurements’ relationship to the process, to be linear. Ex-

tensions have been proposed to the Kalman filter such as the Extended Kalman

Filter (EKF) and the Unscented Kalman Filter (UKF) (Julier and Uhlman [85])

to try and overcome this limitation.

The EKF allows non linear relationships by linearizing about the current mean

and covariance. The EKF can break down when faced with highly non-linear

models, which can not be adequately approximated (see Welch and Bishop [172]


for more information). The UKF [85] addresses some of the approximation issues

of the EKF by using the actual nonlinear models rather than approximations.

The UKF uses a set of deterministically chosen sample points (selected from

around the mean), which are propagated through the nonlinear system to obtain

the mean and covariance for the posterior distribution (see Julier and Uhlman

[85] for more information).

2.4.3 Particle Filters

Particle filters [120] are a sequential, Monte Carlo method based on particle

representation of probability densities. Particle filters have an advantage over

Kalman filters in that they can model any multi-modal distribution, where as

Kalman filters are constrained by the assumptions that state and sensory models

are linear, and that noise and posterior distributions are Gaussian.

Sequential Monte Carlo methods using sequential importance sampling were ini-

tially proposed in the 1950’s (Hammersley and Morton [60], Rosenbluth and

Rosenbluth [146]) for use in physics and statistics. Particle filters use a set of

samples (particles) to approximate the posterior PDF. Like a Kalman filter, the

process contains two major steps each time step, prediction and update.

The state of the filter at time t, is represented by xt, and its history is Xt =

(x1, x2, .., xt). The observation at time t is zt, and its history is Zt = (z1, z2, .., zt).

It is assumed that the object dynamics form a temporal first order markov chain

so that the next state is totally dependent on the directly previous state,

p(xt|Xt−1) = p(xt|xt−1). (2.17)

Observations are considered to be independent (mutually and with respect to the

process). The observation process is defined by specifying the conditional density,


p(zt|xt) at each time, t,

p(Zt|χt) =t∏i=1

p(zi|ti). (2.18)

The conditional state density at time t is defined as,

pt(xt) = p(xt|Zt). (2.19)

State density is propagated over time according to the rule,

p(xt|Zt) = ktp(zt|xt)p(xt|Zt−1), (2.20)

where,

p(xt|Zt−1) =

∫xt−1

p(x|xt−1)p(xt−1|zt−1), (2.21)

and kt is a normalisation constant that does not depend on xt.

In a computational environment, we approximate the posterior, p(xt|Zt), as a set

of N samples, {sit}i=1..N , where each sample has an importance weight, wit. When

the filter is initialised, this distribution is drawn from the prior density, p(x). At

each time step, the distribution is re-sampled to generate an un-weighted particle

set. Re-sampling is done according to the importance weights. The generic

particle filter algorithm is outlined below, and illustrated in Figure 2.4:

Initialisation: at t = 0

• For i = 1..N , select samples si0 from the prior distribution p(x0)

Iterate: for t = 1, 2..

1. Importance Sampling

(a) Predict the samples next position,

sit = p(xt|xt−1 = sit−1). (2.22)


This prediction process is governed by the needs of the individual

system, but typically involves adjusting the position according to pre-

defined system dynamics and adding noise.

(b) Evaluate the importance weights based on the measured features, zt,

wit = p(zt|xt = sit). (2.23)

(c) Normalise the importance weights,

wit =wit[∑Nj w

jt

] . (2.24)

2. Re-sampling

(a) Resample from sit so that samples with a high weight, wit are em-

phasised (sampled multiple times) and samples with a low weight are

suppressed (re-sampled few times, if at all).

(b) Set wit to 1N

for i = 1..N .

(c) The resultant sample set can be used to approximate the posterior

distribution.

The re-sampling step (2(a)) is required to avoid degeneracy in the algorithm

(see Kong et al. [100] for more details), by ensuring that all the weight does not

become contained within a single particle. Sequential Importance Re-sampling

(SIR) is a commonly used re-sampling scheme that uses the following process to

select a new sample. The process is applied for n = 1..N .

1. Generate a random number, r ∈ [0..1].

2. Find the smallest j which satisfies∑j

1wjt ≥ r, j ∈ [1..N ].

3. Set s′it = sjt−1.


Figure 2.4: The Particle Filter Process (Merwe et al. [122])

Particles with a higher weight are more likely to be sampled, resulting in higher

weighted samples being given greater emphasis in the next time step. Figure 2.5

illustrates this re-sampling process. Various implementations (Doucet [44], Pitt

and Shephard [139]) have shown that this algorithm can be implemented with

O(N) complexity. Other re-sampling schemes such as residual re-sampling and

minimum variance sampling have been proposed in Higuchi [71], Liu and Chen

[110] and Kitagawa [98] respectively.

A key issue when using particle filters is the number of particles needed. Using

fewer particles results in a computationally faster system, resulting in much effort

being expended in developing methods to lower the number of particles required.

Other research has focused on how to use particle filters in systems where multiple

objects are being tracked. While the output of the particle filter can be multi-

modal, the system will tend toward a single mode in the long term. This means

that for a filter that is tracking multiple people, tracks can be lost as the number


Figure 2.5: Sequential Importance Re-sampling.

of particles representing a track decreases due to the re-sampling procedure.

The condensation algorithm (Isard and Blake [77]), is a specific implementation

of particle filtering proposed by Isard and Blake [77] to track curves in images.

Like particle filtering, it is an iterative process where the sample set for time t

is generated by re-sampling from the sample set for time t − 1. It differs from

the particle filter described above in that the re-sampling step is done first. Thus

the output at the end of the time step, is a weighted particle set rather than an

un-weighted particle set (see Figure 2.6).

Tracking systems such as those proposed by Kang et al. [90] and Zeng and Ma

[181] use particle filtering techniques to track people. Kang et al. [90] modified

the condensation algorithm to track multiple people in a crowded environment.

A discrete human model was created to allow people to be represented as a

single discrete valued parameter; and a competition rule was introduced, such

that each tracker suppresses the weights of samples around features tracked by

another tracker, helping to avoid multi-modal distributions that can occur when

multiple objects are close to one another. Zeng and Ma [181] used active particle

filtering to track heads. Active particle filtering involves combining traditional


Figure 2.6: The Condensation Process (Isard and Blake [77])

particle filtering with curve fitting. Each particle is fitted to the closest local

maxima of the approximated PDF prior to weighting. The modifications allow

the system to use fewer particles to track objects.

While particle filters are able to represent multi-modal distributions, over time

the more dominant mode within the distribution comes to dominate the filter.

The result of this is that when tracking multiple objects, over time objects may

be lost as the modes that have a stronger response result in particles being taken

away from the weaker modes during re-sampling. One approach that has been

used to overcome this is to use a distribution that allows the presence of multiple

modes. Isard and MacCormick [78] developed BraMBLe, a Bayesian multiple-

blob tracker. A particle filtering implementation (Isard and Blake [77]) was mod-

ified by formulating a multi-blob likelihood function to express the likelihood of


a particular configuration of objects resulting in the observed image. This en-

abled the system to function with an unknown, time varying number of objects,

allowing the tracking of multiple objects. The proposed system was also able to

detect new modes as they entered, and remove modes as they left.

An alternative approach to multi-target tracking was proposed by Vermaak et al.

[163] who proposed the Mixture Particle Filter (MPF). The mixture particle

filter addresses the problem caused by a multi-modal posterior distribution (due

to ambiguities or multiple targets) resulting in poor performance. Each mode

(target) is effectively modelled by its own set of particles, which forms part of the

overall mixture. Given this, the overall distribution becomes

p(xt|xt−1) =M∑m=1

πm,tpm(xt|xt−1), (2.25)

where M is number of mixtures currently in the system, m is index of the current

mixture, πm,t is the weight for the mixture, m at time t and pm(xt|xt−1) is the

component distribution for the mixture m, approximated by a set of particles used

only by this mixture. Each mixture has its own set of particles, sit and weights

wit, where i ∈ Im. The weights of the particles for each mixture are calculated

such that they sum to one, ∑i∈Im

wit = 1, (2.26)

and the mixture weights also sum to one,

M∑m=1

πm,t = 1. (2.27)

When re-sampling, mixture components are re-sampled individually, ensuring

that modes are not lost during the procedure. Initial weights for the particles are

set according to the set size,

wit =1

card(Im), (2.28)

where card(Im) is the cardinality (size) of the particle set Im.


Throughout the mixture particle filter process, the individual filters only inter-

act through the computation of the weights. This multi-modal filtering approach

overcomes problems associated with previous multi-target trackers where the sam-

ples for a given target could become deleted and the target lost. However, the

system still maintains just a single particle filter for the whole system, rather

than one for each tracked object.

The MPF uses k-means clustering to determine when modes have split, or merged.

Splitting of a mode indicates that a new mode has appeared (i.e. a new tracked

object has entered the scene), and as such a new mixture is added to the MPF.

When two modes merge, it indicates that a mode has left (i.e. a tracked object

has left the scene) and a mixture is removed from the model. This allows the MPF

to model a scene with a dynamic number of objects of interest. This approach is

limited however by requiring each target object to match the same feature (i.e.

only a single histogram is used to locate all objects), and this feature is also

constant. As a result, the MPF is limited to situations where all targets have an

identical, unchanging appearance.

Okuma et al. [135] proposed the Boosted Particle Filter (BPF). This work ex-

tended that of Vermaak et al. [163] and used a cascaded adaboost (Viola and

Jones [165]) algorithm to detect the target objects to guide the particle filter,

rather than user initialisation followed by k-means clustering. A colour observa-

tion model is used (using HSV space to separate colour from intensity) to measure

likelihoods of the observations. To initialise the system, and allow new modes to

be detected, the adaboost results are incorporated into the proposal distribution,

so that when the adaboost detection performed well, the BPF distribution could

incorporate this information.

q∗B = αqada(xt|xt−1, yt) + (1− α)p(xt|xt−1). (2.29)

The term, qada, is a Gaussian distribution dependent on the current observation


from the adaboost detection, yt. By increasing the value of α, more weight is

applied to the adaboost detection, however when the adaboost detection can not

detect the target (due to clutter, lighting changes) α can be set to 0 so that the

system operates as a MPF.

Figure 2.7: Incorporating Adaboost Detections into the BPF Distribution

Figure 2.7 illustrates this process. In this instances, the adaboost detection pro-

cess detects a mode that is not tracked by the BPF. This information is incorpo-

rated into the final distribution. When re-sampling occurs, there is now a greater

probability that samples that represent the newly detected mode will be propa-

gated, and the new mode can be tracked. Like the MPF however, the BPF uses

a single unchanging feature, and therefore is also limited to situations where all

targets have a very similar unchanging appearance.

To overcome the limitation of a single feature for the whole filter, Wang et al.

[168] proposed a mixture filtering approach [163] where a separate histogram

is used for each target. This enables multiple people to be tracked in a scene

whilst maintaining identity through occlusions. However the proposed approach

is unable to discover new modes, and relies on manual initialisation.

Ryu and Huber [148] proposed using an individual particle filter, with a joint


observation model indicating the likelihood of occlusions to track multiple objects,

instead of the mixture approach of [135, 163]. The observation model reprojects

particle weights into image space to determine the most likely object at a given

pixel, and determine when occlusions are present. An observation model for

hidden targets, which uses the expected value of the measurement model, is

used to ensure that occluded objects are not lost whilst they are not visible. A

background filter, that is able to detect new modes entering the scene is used to

instantiate new tracks, while tracks are removed by monitoring the total particle

weight associated with each track and the expected value in the observation

model. Such a system allows new modes to be added dynamically (as in [135,

163]), but still allows different features to be used for different targets (as in

[168]).

Prez et al. [141] used particle filters to track faces by adapting the colour his-

togram based tracking approaches such as [14, 23, 29] to a particle filter frame-

work. The HSV colour space was used to separate intensity from the colour.

Multi-part colour models were used to improve the performance, by encoding

some spatial information within the colour model. Prez et al. [141] also made

use of the known background to improve performance. Rather than just compare

the histogram of the region in the current frame with that of the reference, it is

also compared with the histogram of the background image at the same region.

This aims to prevent the tracking from shifting from the target to a background

region.

Particle filters have also found uses in multi-modal systems. Checka et al. [22]

utilised particle filters to track people and monitor their speaker activity. Loy

et al. [112] made use of particle filters to track objects using a variety of hy-

potheses, depending on conditions. Using a particle filter allowed the multiple

hypotheses to be maintained, and allowed tracks to be updated at different fre-


quencies. Like Loy et al. [112], Breit and Rigoll [15] use particle filters to facilitate

multi modal object tracking. Using the condensation algorithm, Breit and Rigoll

[15] combines a pseudo 2-dimensional hidden markov model, a skin detector and

a motion detector to track people. The use of a particle filter facilitates simple in-

tegration of the three modes, and allows them to overcome the weaknesses posed

by each mode individually. The modes are combined during the calculation of

the sample weights, wit,

wit =∏j

p(zj|xit)wj . (2.30)

A weighted product of the probabilities is used to determine the final weights,

where the weights of each mode, wj are manually selected according to the relia-

bility of each mode.

As particle filters require a model to guide the particle movement from frame

to frame (in addition to a random offset), approaches have been proposed that

use Kalman filters [51, 122] to model the particle movement. Merwe et al. [122]

proposed a system that incorporated the unscented Kalman filter (UKF) by using

the UKF as the proposal distribution for the filter. Similar proposals have been

made using the EKF instead of the UKF (Freitas et al. [51]).

Yang et al. [177] proposed a Hierarchical Particle Filter, that characterised objects

using their colour and edge orientation histogram features. This system was able

to evaluate likelihoods in a coarse to fine manner (similar to that used by Viola

and Jones [165] in their object detector) to allow the system to focus on more

promising regions and quickly discard those with zero probability.

Other applications have been to use particles filters to track 3D human models,

either in a single camera (Green and Guan [58]) or in multiple cameras (Sigal

et al. [153]), or to guide robots (Kwolek [103], Schulz et al. [149]). Green and

Guan [58] and Sigal et al. [153] use particle filters to track the human body in


3D. Green used a single camera, tracking each joint angle using the particle filter

while Sigal uses a particle filter adapted to work on general graphs, to track the

limbs of a person from multiple cameras.

2.4.4 Summary

Being able to estimate the position of an object in a scene, either to aid in de-

tection in subsequent frames or track through occlusions is important within a

tracking system. Motion models and Kalman filters (Kalman [86], Welch and

Bishop [172]) allow a more direct approach, allowing an approximate future po-

sition to be estimate given the objects previous state. Particle filters (Isard and

Blake [77], Maskell and Gordon [120]) do not explicitly provide a set of coordi-

nates where the object should be, rather they produce a probability density that

describes the likelihood of the object being at any given location.

The design of particle filters is such that particles that have a higher likelihood are

more likely to be re-sampled, at the cost of particles that have a low likelihood.

This process helps the filter avoid a situation where poorly fitting particles come

to constitute the majority of the particle set, but also means that when tracking

a multi-modal distribution, over time the system is likely to loose all but the

most dominant mode. To ensure that multiple modes can be tracked, mixture

filters (Okuma et al. [135], Vermaak et al. [163]) have been proposed, which treat

the overall distribution as the sum of several sub-distributions, each of which

represents just a single mode. The sub-distributions are re-sampled separately,

to ensure that modes are not lost in the re-sampling process. Alternatively,

systems can simply use multiple particle filters, such that each target object is

tracked by its own filter (Ryu and Huber [148]).

2.5 Multi Camera Tracking Systems 71

2.5 Multi Camera Tracking Systems

The use of multiple cameras in a tracking system allows additional information

to be extracted from a scene, by either observing the same area from two or

more from different positions, or by observing more of a scene than a single

camera could. Most multi-camera systems are simply extensions of single camera

systems, facing the same problems with tracking and maintaining identity, only

having to do this across multiple views. Some problems, such as handling and

resolving occlusions, can be dealt with more effectively as an occluded object may

be visible in multiple views, and is unlikely to be occluded in all. Other problems,

such as maintaining identity for tracked objects, can become more complex as

tracked objects must be matched across multiple views.

There are two commonly used architectures for multi-camera systems, illustrated

in Figures 2.8 and 2.9.

Figure 2.8: Multi-camera system architecture 1.

The first approach (see Figure 2.8) applies single camera tracking techniques to

each camera feed, tracking the objects that are visible in that feed. The lists of

tracked objects for each view can then be combined by using camera calibration

information (and potentially other information such as colour and/or appearance

models) to form a list of global tracked objects. In such a system, information

72 2.5 Multi Camera Tracking Systems

Figure 2.9: Multi-camera system architecture 2.

from the global list may then be passed back to the single camera trackers to

improve performance. Systems proposed in Cheng et al. [24], Kazuyuki et al.

[92], Piva et al. [140], Wei and Piater [170] are examples of this architecture.

The second approach (see Figure 2.9) merges the different views prior to tracking.

After a detection stage, the detected objects are transferred to a global coordinate

scheme, and the objects (now represented in a 3D coordinate scheme) are tracked

by an object tracker. Such systems may merge the views after motion detection,

mapping the detected motion to a common ground plane and then performing

object detection. Systems proposed in Auvinet et al. [5], Krahnstoever et al. [101]

are examples of this architecture.

Systems that use the second architecture require accurate camera calibration as

they need to be able to register detected motion or objects to a common ground

plane or coordinate system very accurately. Systems that use the first architecture

may also utilise camera calibration, but can also make use of additional cues

(such the direction of motion, colour/appearance) when matching objects across

views. These systems can also be implemented in situations where very little of

the camera calibration information is actually known (such as the field of view

extents and overlaps only), or where the calibration is learnt as the system runs.

The first type of system is also better suited to networks that contain areas that

are not covered by a camera (i.e. have disjoint views), requiring hand off to be


performed blind, as they are better equipped to be able to recover identity after

a period of occlusion.

2.5.1 System Designs

Multi-camera networks require a large amount of data to be processed, and for

communication to occur between the various software modules responsible for

data collection and tracking. As camera networks become larger, more process-

ing power, and more advanced communication protocols are required. Camera

networks can be set up to communicate in two ways,

1. Each camera/tracker communicates directly with other trackers to deter-

mine positions of objects, hand overs and overlaps;

2. Each camera/tracker communicates with a central server, which combines

all data and provides information back to the trackers as required.

Marchesotti et al. [118] proposed a system using software agents (an indepen-

dently acting program module which can reason with other agents and make

decisions) to track people across a multiple camera system. Camera agents, re-

sponsible for acquiring images and performing image processing tasks, pass in-

formation to a top agent, which is responsible for organising the data from the

cameras and passing it to the appropriate simple agents. The simple agents are

responsible for the tracking and data fusion for multiple cameras. The simple

agents negotiate with each other to determine if they are tracking the same ob-

ject. For every new agent that it is added to the scene, a multicast message is

sent containing the position and histogram of the object. All other trackers then

evaluate matching metrics for this data to determine if they are seeing the same

object. The top agent is able to spawn new simple agents as required.


Focken and Stiefelhagen [50] proposed a system designed in the style of a dis-

tributed sensor network. Each machine in the network is synchronised and all

images are timestamped to ensure synchronisation. Each camera in the system

contains a background subtraction module, and produces a constant stream of

features that are sent to a tracking agent. The tracking agent is responsible for

collating this data and determining 3D tracks.

Krumm et al. [102] proposed a multi-camera system consisting of two stereo

cameras in their ‘EasyLiving’ project. Each stereo head uses a dedicated PC to

process the incoming images, and locate people in the scene. These results of this

process are passed to a third PC which performs the person tracking. Position

and histogram information is then passed back from the tracking PC to the stereo

head controllers.

Atsushi et al. [4] proposed a system that used a series of cameras that com-

municated directly with one another rather than through a central server. An

environment map is generated at the start of tracking by each camera, using the

calibration information for the camera network. This map shows where the cam-

era is in relation to the other cameras in the network, allowing it to determine

when another camera will be able to see an object that it is tracking, or which

camera a tracked object that has left the networks field of view is likely to enter.

The cameras communicate using three messages:

• Acquisition - a position is sent and the camera is to attempt to acquire a

track for a person at that location;

• Acquisition on border - a field of view edge is sent and the camera is to

attempt to acquire a track for a person at that location;

• Stop acquisition - the camera is to stop attempting to acquire a track.


Acquisition messages are sent when a track is in an overlapping field of view area,

acquisition on border messages are sent when a track leaves the field of view

the network and is expected to arrive at the edge of another’s FOV, and stop

acquisition messages are sent when a station detects a person who has entered

from an unwatched area. For all messages sent, an acknowledgment message is

sent back. This allows the system to dynamically determine if a sensor in the

network is down and adapt accordingly.

Micheloni et al. [123] proposed a system composed of static camera system’s (SCS,

consisting of fixed cameras with overlapping views) and active camera system’s

(ACS, consisting of a pan-tilt camera), and a method for the communication and

working together. The SCS is responsible for tracking all objects within its views,

and fusing data across different cameras to improve performance. The SCS is able

to request the ACS to track a specific target. Communication is performed over

a wireless network using a simple protocol that allows a network to,

1. Issue request commands such as requesting an ACS to track an object,

2. Transmit 2-D position of all object tracked by that network.

Messages can be sent either to an individual network, or to all networks.

2.5.2 Track Handover and Occlusion Handling

Multi-camera tracking systems add an additional level of complexity to tracking

systems, as tracked objects must be matched across camera views. Tracking

systems solve the problem of determining object correspondence in two ways:

1. Transfer coordinates of detected objects and/or extracted features to a


world coordinate scheme (such as an extracted ground plane) and perform

tracking within the 3D work space [5, 101].

2. Apply single camera tracking techniques to each view, and determine the

correspondences between the tracks from separate views [24, 92, 140, 170].

The first approach requires a high degree of accuracy in camera calibration, as we

need to be able to accurately transfer all detected objects to a world coordinate

scheme, however, as all tracking (and potentially object detection) is transferred

directly to a world coordinate scheme there is no need to deal with the problem

of object handover, or matching tracked objects in individual views.

The second approach is able to work on more loosely calibrated cameras, as

only knowledge of the field of view (FOV) extents are needed. However more

sophisticated calibration schemes can allow position and velocity to be used as

features, and provide greater accuracy when determining correspondences. The

remainder of this section will discuss techniques used to match objects in different

camera views, focusing on the features and matching techniques used.

Features that are often used to determine correspondence include

• Position, either by translating to a global coordinate scheme [10, 19, 21, 38,

50, 118] to extract 3D coordinates, and optionally trajectories [46, 118] and

velocities [140]; or by using FOV boundaries [8, 93, 95] in the 2D images

and knowledge of the order of the cameras (i.e. which cameras overlap with

each other).

• Appearance, incorporating shape/aspect features[38, 140]; or colour, using

histograms [92, 126, 186] or dominant colours [24].


Colour and appearance based features rely on a good quality imaging, and (de-

pending on the type of model used) the cameras to be located nearby and at a

similar angle to the subject. Position and shape/aspect features can be easily

translated into a common 3D coordinate system and compared between multiple

cameras and may be more suitable when the subjects cannot be reliably compared

using colour or appearance. When position cannot be computed accurately, due

to poor calibration or the subjects being too far away, trajectory can be used.

Trajectory however relies on the objects in the individual cameras to be reliably

tracked for a period of time so that there are valid trajectories for comparison.

Position has proved a popular feature as it can be quickly computed, and when

camera calibration is available, it is often desirable to use a 3D coordinate scheme

to switch between cameras and provide output to an end user. In situations where

there are few objects being tracked in the scene, it is also very reliable. A track

in a single view can have its position (and trajectory/velocity) transferred to a

common coordinate scheme where matching between multiple views can be done

purely by position. Ellis [46] tracked objects in a single view using a combination

of shape, position and colour information, but relied only on position to match

objects across different views. The resultant 3D position was then tracked by a

separate Kalman filter in 3D. Tracking in 3D also allows for easy comparison to

other views, and for views that are occluded to receive input from views that are

not, improving localisation of the objects in the occluded view. However, small

errors in the 2D segmentation (i.e. failing to cleanly segment a persons legs,

where they contact the ground plane) can result in large errors in the equivalent

3D coordinates, which can lead to incorrect matching between views.

Black et al. [10] proposes using a homography to map between overlapping views.

Transferred objects are match based on the error obtained when transferring coor-

dinates between views. Each view’s coordinates are transferred into one another,


and the errors are squared and summed. If this error is below a threshold, a match

has been found. Objects are simultaneously tracked in 2D and 3D using Kalman

filters, recording position and velocity (in pixels for 2D, real world coordinates

for 3D). The output of the 2D Kalman filters is transferred to the image plane to

provide a measure of uncertainty for the measurements, and allow improvement in

the observation uncertainty for the 3D Kalman filter, used matching objects and

updating the filter. The measurement uncertainty increases as the object moves

further from the viewpoint, where segmentation errors have a greater impact on

the translated coordinates.

Cupillard et al. [38] uses multiple cameras with overlapping fields of view to locate,

track, and identify the behaviour of groups. Motion detection is used to locate

moving regions in each camera, from which a set of 3D numerical parameters are

extracted and a semantic type (i.e. person, crowd) is assigned. The weighted sum

of the similarity between the 3D position, the semantic type and the fusion results

for related objects (as the system is aim at the analysis of groups, associated

objects in each view are grouped together) is used to determine object matches

across the camera views.

Focken and Stiefelhagen [50] combines data from multiple camera using position

information. Motion detection is performed on each camera feed, and the blobs

detected have their position translated into a 3D coordinate scheme. Blobs are

matched based on the 3D positions, with a threshold used to determine valid

matches, and provide a measure a confidence for each grouping. The regions

are tracked using a multiple hypothesis tracker, that maintains a list of possible

hypotheses for each track, based on the confidence in the location of the target

region (in 3D coordinates), and the distance moved from the previous position

(modelled as a Gaussian distribution, such that regions that are further from the

last position are weighted less). As tracking progresses and more information is


gathered, the less likely track trajectories can be discarded.

Micheloni et al. [123] proposed a system that combined networks of static cameras

(cameras with a fixed view) with networks of active cameras (cameras that can

change their view, i.e. Pan-Tilt-Zoom cameras). Observations from the different

views are fused using their real-world position and an appearance ratio, that pro-

vides a degree of confidence for the blob extracted from the camera view. Those

cameras that provide more reliable measurements are weighted higher when fusing

tracks to ensure that the use of multiple cameras does not degrade performance.

2D ground coordinates are translated into pan and tilt angles, allowing the active

camera system (ACS) to track the target object. The ACS aims to keep the

track in the centre of its view, and compares it position against the static camera

system (SCS) regularly to ensure that the correct target is being tracked. As the

ACS only uses pan and tilt operations, a simple image translation can be used

to register the camera view before and after moving, based on detected image

features. Feature points are also used to track the target object within the ACS.

Using field of view (FOV) boundaries [80, 93, 95] can reduce the impact of poor

object segmentation, as it is much simpler to accurately detect when a person is

entering or leaving a cameras FOV. When a person crosses a FOV boundary, the

system can check the appropriate FOV line in other cameras, with objects that lie

on along that line candidates for a match. This approach does leave the system

susceptible to errors caused by multiple objects being in the transition area at

the time of handover, and (depending on the layout of the camera network) only

allows a small window of time when correspondence can be determined. Bhuyan

et al. [8] also makes use of FOV lines to determine correspondence across multiple

views. Bhuyan et al. [8] proposed a multi-camera system where each camera uses

a single camera tracker (using the MPEG-7 art descriptor as a tracking feature

and the unscented Kalman filter to predict object position).


Colour and appearance matching approaches are less sensitive to errors in seg-

mentation, but may struggle if there are differences in colour balance between

the views, or if the subject’s clothing is not a constant colour, or pattern. To

help negate this difference in colour, colour spaces that separate intensity and

colour information are often used (HSV, YUV). Other approaches such as that

proposed by Kazuyuki et al. [92], Morioka and Hashimoto [126] combine the local

colour histograms that are used to track in an individual camera to form a global

histogram, which is used to identify across multiple cameras.

Often, features are fused or used in a hierarchy to achieve greater accuracy.

Marchesotti et al. [118] uses position and a colour histogram, and requires match-

ing constraints for both to be satisfied for a match to be made. Cheng et al.

[24] uses geometry and colour information to determine correspondence between

cameras. Geometry is applied first and if there are ambiguities the major colour

components (likely to represent clothing colours, hair and skin colours) are se-

lected and compared to determine the correspondence. Piva et al. [140] combines

position, speed, a shape factor (aspect ratio) and an chromatic characteristic.

Each is independently considered and a similarity is calculated, after which the

features are merged to determine if there is a match.

Krumm et al. [102] uses position and colour to match people across multiple

cameras. Simple velocity calculations are used to predict a persons next location,

and a persons past locations are stored within the system. The person tracker

(in the view being initialised) searches the area around the expected location of

the person for a person shaped blob. If multiple blobs are in the area, histogram

matching is applied to locate the correct blob. Atsushi et al[4] also determines

correspondence using colour and position. Colour and position are compared and

if the differences are within a limit, they are grouped. For the duration of this

grouping, the track for each view maintains a confidence that it is observing the


same track as all other objects in the group. If this confidence drops below a

threshold, then the object breaks from its group, potentially being grouped with

other tracks.

Yang et al. [176] describe a system to track people through an 18 camera in-

door surveillance network. Track hand off is performed in two distinct ways for

situations where the cameras overlap, and those where they do not (blind hand

off). For situations where there is overlap, the matching ratio is the fraction of

time (of the overlap) that the two tracks are within a distance threshold of one

another in world coordinates. For blind hand off, the matching metric is the

similarity in colour between the models, as well as an additional constraint that

the tracks must appear within a set time limit (upper and lower bounds) based

on the distance between the camera views.

Other systems use probabilistic frameworks to group objects between views. Wei

and Piater [170] proposed the use of particle filters and belief propagation to track

an object across multiple views. The object is tracked in each view using a particle

filter, and the information from these separate views is passed using sequential

belief propagation (Hua and Wu [76]) (SBP) to obtain a global coordinate, and

allow views to share information. Wei et al. [171] also used particle filters to

track objects in multiple views. However rather than use a particle filter for each

view, a single particle filter is used for the entire system. When an occlusion is

present (or likely) camera collaboration is used to handle the occlusion problem,

and coordinates are transferred between the views to maintain tracking.

Chang and Gong [21] proposed a system using Bayesian networks to classify

people across a multiple camera system. Multiple modalities are used for tracking,

covering various recognition (height and colour) and geometric (epipolar lines,

homographies, and use of scene landmarks) techniques. Comparison functions are

defined for each modality and the match results are combined within a Bayesian


framework to determine the likelihood that tracks in separate views correspond.

Cai and Aggarwal [19] also used a Bayesian classifier to match people across

a multi camera system. Cai and Aggarwal [19] system uses features based on

intensity and geometry to formulate a GMM to parameterise the features. A

Bayesian classifier can then be used to find the most appropriate match. This

system tracks a person from a single camera while that subject can be observed

well from that camera. Once the subject becomes occluded or leaves the field of

view for that camera, the system finds the next best camera and resumes tracking.

A simple location based prediction method is used to find the next best camera,

with one of the key requirements being that the camera that is switched to be

the one that will minimise the amount of future switching.

Kang et al. [87] registers the cameras in their system using a ground plane ho-

mography (this method relies on there being a common ground plane between

the cameras used in the system, however this is very common in man made en-

vironments). An appearance based model (based on a polar representation that

allows for a rotation invariant model) and velocity based model (both in 2D and

3D, using Kalman filters) are used and combined using a joint probability data

association filter (JPDAF). The JPDAF is used to handle occlusions and cam-

era hand off. Using the JPDAF, the probability of a tracked object occupying a

given position is defined as the product of the appearance, 2D position and 3D

position probabilities. The optimum position of track then becomes the position

that maximises these three probabilities.

Fleuret et al. [49] propose an approach to track people within a four camera

network, where the cameras are mounted at eye level. Such a configuration leads

to a large number of occlusions. Tracking is performed across a sliding window

of frames, rather than on a frame by frame basis, and the optimal track across

the whole window is taken. Tracks that are detected well in previous windows


are optimised first in future windows, so that troublesome tracks (which are more

likely to be unstable and ’jump’ to another tracked object) are optimised last, and

so cannot steal a track that has already been allocated. Track probabilities are

determined using an occupancy map, and appearance model. The occupancy map

is generated by transferring the results of background subtraction to the ground

plane, and formulating the probability that a given location of the ground plane

contains a person. This occupancy map is combined with the appearance model

to determine the most likely trajectory for a person over a window of frames.

Collins et al. [27] uses a group of pan-tilt-zoom cameras rather than static cam-

eras, to track a single person moving through a scene. Each camera tracks the

object using the mean shift algorithm [28, 30] and histogram matching (the mean

shift algorithm is robust to camera movement). Cameras are calibrated and share

the position of object they are tracking with one another, to aid cameras that are

tracking poorly. Everts et al. [47] proposed using multiple configured pan-tilt-

zoom cameras (PTZ) to cooperatively track objects. The cameras are calibrated

by collecting a number of real world points and their corresponding camera po-

sitions (pan and tilt values), and using an optimizer to determine the camera

parameters that allow for the transform from world coordinates to camera posi-

tion. This calibration assumes that there is a common ground plane, does not

incorporate zoom, and is performed offline. It is important to note that is un-

likely that this method will scale effectively, or what effect the number of position

pairs used has on calibration accuracy and computation time. Like Collins et al.

[27], objects are tracked using the mean shift algorithm [28, 30], as this technique

is not adversely effected by camera movement. At the end of each frame (once

the target has been located), the cameras are shifted so that the target objects

is in the centre of the frame. To handover the track between cameras, the colour

of the object, as well as the assumption that the object will be roughly centred

within the frame, are used. The proposed system is limited by the use of colour


as a discriminating factor. When an object that has a non-discriminative colour

is used, the system may completely fail to locate the object in the second camera

and fail catastrophically.

Tsutsui et al. [162] applies optical flow based person tracking in a multiple camera

environment. A tracking window for the subject being tracked is shifted according

to the mean flow of the frame for the next frame of the sequence. For a single

camera system, the person is modelled as a 2D plane, in a multi-camera system

the person can be modelled as a 3D cylinder, with motion vectors within the

volume being analysed. When an occlusion occurs, the tracking window and

velocity can be transferred to another view until the occlusion is resolved.

Mittal and Davis [124, 125] proposed a system to track people using colour and

position in multiple cameras through complex occlusions. People are modelled

using a cylinder, that is divided into several regions of equal height, each of which

has its own colour model. This models the vertical distribution of tracked person

colour (it is not possible to model the horizontal distribution effectively with 3D

reconstruction). A Bayesian classification scheme to used to segment the image

into regions that belong to a particular person and background, using the colour

models for the tracked person in combination with the position of the pixel being

segmented (relative to the expected position of the person). This segmentation

also takes into account distance from the camera and occlusions, so a person closer

to a camera will yield higher likelihoods (they are less likely to be occluded). The

results from this segmentation across the whole camera network are combined by

projecting the detected regions into the other views to determine the positions of

the people in 3D coordinates, after which the person models (appearance and a

Kalman filter for motion) are updated.

Systems such as those proposed by Auvinet et al. [5] and Krahnstoever et al.

[101] avoid the problem of merging objects across view and track hand off by


performing all tracking in a world coordinate domain. Auvinet et al. [5] proposed

a system which merged the results of motion detection into the ground plane of

a four camera network. This allows all motion in the scene to be viewed from an

overhead perspective, and helps to overcome occlusions. From this perspective,

blobs can be tracked, simple event recognition can be performed and immobile

objects can be detected. To merge the views, each motion image undergoes a

homographic transformation. The images are overlayed and regions that have

three or more silhouettes overlapping (i.e. three of the cameras views report

motion in the same portion of the ground plane) are accepted as blobs for tracking.

As the transform does not consider height off the ground plane, these overlapping

regions should represent the parts of objects that are touching the ground plane

(i.e. feet). Requiring three silhouettes to overlap reduces the likelihood that a

blob will be incorrectly created by objects above the ground plane (i.e. people

heads and upper bodies) being mapped to the same location. Krahnstoever et al.

[101] system performs target detection in the individual camera views, before

transferring the tracking to a calibrated ground plane.

Handling Disjoint Views

In many real life camera networks, it is likely that not all of scene will be covered

by cameras. It is therefore possible, that at some point a tracked object will leave

the field of view of the network, only to reappear in another camera a few seconds

later. Just as it is important to be able to consistently label objects when the

camera views overlap, it is also important to be able to identify when a person

entering a cameras field of view is the same person that left another camera a

few seconds previously. Figure 2.10 shows an example of such a network, where

each room contains one or more cameras, but the hallway connecting the rooms

in unsupervised. Ideally, a system should be able to tell when a person entering


room one has come from room two and vice-versa.

Figure 2.10: Surveillance network containing disjoint cameras.

Leoputra et al. [107] and Lim et al. [109] proposed a system to track objects in a

non-overlapping camera environment using a particle filter and environment map.

The particle filter operates as normal when the target is visible in the camera

view, however, when they are between camera views, the particle filter uses the

environment map (providing knowledge of possible pathways) to help determine

where the person will reappear, and propagate particles along the possible paths

that the person may have taken. The proposed system combines the particle

filter results with those of a histogram match for any new objects detected. The

particle filter results are weighted less the longer the person has been missing for,

reflecting the greater number of possibilities for where the person has gone, the

longer they are missing.

Kang et al. [88] extended the system proposed in [87], to track people in a multi-

camera environment consisting of two non-overlapping fixed cameras, and one

moving camera that can pan between the other two views. A spatio-temporal

joint probability data association filter (JPDAF) is proposed to aid in overcoming

the gaps in the systems field of view. The spatio temporal JPDAF uses a buffer


of images to improve performance. By allowing multiple images to be taken into

account when formulating probability, errors due to occlusions can be overcome.

The two stationary cameras are registered against a mosaic from the moving

camera, using a common ground plane. When detecting objects in multiple views,

the feet position of deleted objects is translated to other views to register common

objects. Once registered using position, the information from all views is fed into

the JPDAF.

Stauffer [156] proposed methods to determine the transition correspondence mod-

els (TCM) within a camera network (i.e. correspondence between exit and entry

locations in different cameras). The goal of a TCM is to estimate the likelihood

that a given observation at a given sink (exit point), was the result the same ob-

ject that earlier produced an observation at a given source (entry point), where

the sink and source points may be in different camera views. An unsupervised

hypothesis method is proposed which is able to approximate the likelihood of

transitions from one camera to another. The system was evaluated on synthetic

data simulating a traffic scenario (containing a traffic light, and a fork in the

road) and was shown to effectively determine the camera transitions.

It is also important to be able to determine the location of sources and sinks

within the camera network. Knowledge of where objects are allowed to enter can

improve the initialisation of tracks, while knowing where objects are allowed to

exit can help prevent lost tracks. Stauffer [155] used a two state hidden state

model (similar to a HMM, except all sequences are length two, and the model is

not shared across time) to estimate the source and sink positions within a scene,

given a set of track sequences. The first state in the model corresponds to the

sources, and the second to the sinks. An iterative optimisation routine is applied

to find the optimal placement of the sources and sinks.

Javed et al. [81, 82] extended the system proposed in [80, 93, 95] to be able


to learn the topology of a network containing disjoint cameras. The proposed

system uses source and sink locations, the velocity of tracked objects, the time it

takes objects to move between cameras and the appearance of tracked objects to

determine the configuration of the camera network. A training sequence is used

to learn an initial estimation of the system, and the parameters are continuously

updated during system operation. This updating also allows new behaviours to be

learned and added to the system, while transitions that are obsolete are forgotten

(i.e. people may be more likely to travel a different path in the afternoon to the

morning, thus some camera transitions are more likely at different times). Javed

et al. [82] also proposed modelling the difference in colour histograms between

the disjoint cameras. Once a correspondence model has been learned (and it is

possible to be sure of correspondences) the histograms of an object observed in

two cameras can be compared, and the difference modelled. As the cameras are

likely to be configured differently, use different lenses, or be different cameras all

together, it is important to be able to measure the colour difference between the

two views, to improve the accuracy of comparisons. The difference is modelled

using a Gaussian model.

2.5.3 Summary

The process of tracking objects within a multi-camera network can be approached

in two ways:

1. Apply single camera tracking techniques to each camera and use camera

calibration and/or feature matching to determine object correspondence

in the different views (Cheng et al. [24], Kazuyuki et al. [92], Piva et al.

[140], Wei and Piater [170]).

2. Transfer results of a detection process from all cameras to a common co-


ordinate scheme and treat as track in the same manner as a single camera

situation (Auvinet et al. [5], Krahnstoever et al. [101]).

The first method can be implemented in either a distributed (each camera com-

municates with the other cameras and data is shared between) or centralised way

(each camera communicates with a central server, which combines data and issues

commands to the views), whilst the second requires a centralised implementation.

This approach does avoid the need to match objects across the different views

however. In sufficiently large networks, systems could be implemented that use

such designs.

Being able to constantly label an object across the camera views is important

within a multi-camera network. A method to match objects between views is

required, and it needs to be able to handle the difference in pose between the

views. Given this requirement, position (Cai and Aggarwal [19], Chang and

Gong [21], Cupillard et al. [38], Marchesotti et al. [118], Piva et al. [140]) and

simple colour models (Cheng et al. [24], Kazuyuki et al. [92], ZhiHua and Komiya

[186]) are the most suitable and popular methods. Complex appearance models

are not suitable, as these are typically pose specific. In situations where there are

disjoint views, knowledge of the camera network and velocity of the target object

can be combined with colour models to match objects (Javed et al. [81], Kang

et al. [88], Lim et al. [109]).

Chapter 3

Tracking System Framework

3.1 Introduction

The following chapter describes the tracking framework that has been developed

as part of this thesis. The framework has been developed in C++ using the

vxl (vision-something-libraries) 1 as a base to provide basic image processing

functionality such as:

• Image structures for image storage and manipulation (including loading and

saving).

• Basic image processing such as morphology, edge detection, resizing.

• Vector and matrix structures, and associated maths functions.

The proposed framework allows for the development and testing of multi-camera

tracking systems. The framework makes extensive use of abstract base classes

1VXL can be downloaded from http://vxl.sourceforge.net/

92 3.2 System Design

and polymorphism, allowing new types of trackers, detectors, or storage classes

to be implemented quickly. Configuration is performed using one or more XML

configuration files which are loaded at startup. All system parameters, such as the

number and type of trackers, detectors and objects to be tracked are contained

within this file. Parameters cannot be changed once loaded.

3.2 System Design

The framework uses abstract base classes and polymorphism to allow for an

extensible system. There are four main classes which form the tracking system,

there are:

• ObjectView - a tracked object, as seen in a single camera view. Contains

only 2D information about the objects position (in pixel coordinates), and

has no knowledge of other views. Each ObjectView has a track ID and a

view ID associated with it to identify it within the system.

• TrackedObject - a collection of up to N ObjectView ’s (where N is the num-

ber of inputs to the system). This class also contains 3D information about

the object’s position, if camera calibration information is available. It is

created with a array of length N ObjectView pointers, which initially all

point to NULL. As the object moves through the scene, these pointers are

changed to point to the valid object view structures. Each TrackedObject

has a track ID associated with it. This ID is shared by the ObjectView ’s

associated with TrackedObject.

• ObjectTracker - a tracking class, responsible for tracking objects within a

single camera view. The ObjectTracker only sees the ObjectView ’s that

are within its camera view. It has no knowledge of the TrackedObject ’s,

3.2 System Design 93

or of other camera views. The class contains two lists of ObjectView ’s, a

shared list that contains the objects known to the system at the start of

the processing of the current frame, and an internal list that stores objects

that are detected in the current frame.

• Manager - a collection of N ObjectTracker ’s. The Manager sees the com-

plete list of TrackedObject ’s is responsible for maintenance tasks such as

creation deletion and transfer of objects between views.

These objects relate to one another as shown in Figure 3.1. The black arrows

indicate ownership (i.e. responsibility for creation/deletion/storage) within the

system (i.e. the ObjectTracker shares ownership of the ObjectView ’s with the

TrackedObject ’s, and is owned by the Manager). The red arrows indicate the

inputs to each object (i.e. the ObjectView receives data from the ObjectTracker

and TrackedObject, and sends data to the ObjectTracker and TrackedObject).

Input images are passed directly to the appropriate ObjectTracker for processing.

Figure 3.1: System Design

The creation and deletion of new ObjectView ’s and TrackedObject ’s requires com-

munication between the ObjectTracker and Manager classes. When a new object


enters the system, it is first detected by the ObjectTracker responsible for the

view in which the object appears. This tracker creates a new ObjectView and

adds it to the internal list of the ObjectTracker. When the Manager performs

maintenance, it checks the contents of this internal list and observes that a new

object has been placed there. A new TrackedObject is created and this ObjectView

is associated with it. The ObjectView is then removed from the internal list and

placed in the common list. Object deletion is also performed by the Manager.

An ObjectTracker may change the state of an ObjectView to be Dead (see Sec-

tion 3.2.1), marking it for deletion. When the Manager performs maintenance

(typically at the end of each frame), any ObjectView ’s that are in the Dead state

are deleted. If the ObjectView deleted is the only ObjectView associated with its

controlling TrackedObject, the TrackedObject is also deleted.

Each of these base classes is used as a platform for building more advanced

tracking systems, through inheritance.

3.2.1 Tracking Algorithm Overview

The tracking algorithm used in this work is a top-down system (see Figure 3.2).

Motion detection is used to perform initial segmentation, and the resultant motion

mask is used by one or more object detectors (possibly in combination with the

input image) to detect the target objects. The resulting list of candidate objects,

DObj(t) is compared to the list of tracked objects, TObj(t). Candidate objects

are compared to tracked objects to determine the quality of matches using a fit

function, F , which returns a value in the range of 0 to 1. A fit of 1 indicates a

perfect match, and a fit of 0 indicates no match. The candidate and track pair

which yield the highest fit score are matched, followed by the next lowest until

all candidate-track pairs that have a valid match (determined by a threshold on


the fit scores, the threshold is typically set to 0.5) are paired. Any remaining

candidates are added as new objects, and any unmatched tracked objects are

updated via prediction.

Figure 3.2: Tracking Algorithm Flowchart

All tracks have a state associated with them and two counters, which defines how

the system handles the track. Counters are kept for the number of successive

frames the object is correctly detected (cdetected), and the number of successive

frames that the system fails to detect the object ((coccluded)). There are five

possible states within the system:

1. Preliminary - Entered into when a track is first created. Tracks in this state

must be continually detected.

2. Transferred - Tracks that are moved from another camera view are created

in the transferred state. This is similar to the Preliminary state, but allows

for more leeway when detecting and matching the object.

3. Active - The track has been observed for several frames. Tracks spend most

of their time in this state. It indicates that the track has been located in

the last frame and its position is known.

4. Occluded - Indicates that the track has not been located in the last frame,

either due to occlusion or system error.


5. Dead - The track is to be removed from the system. Tracks in this state

are deleted when the current frame’s processing ends.

The state transitions are shows in Figure 3.3, and the transition conditions are

outlined in Table 3.1.

Figure 3.3: State Diagram for a Tracked Object

a Prelim→Active The tracked object is detected and matched forτactive successive frames (cactive ≥ τactive)

b Transferred→Active The tracked object is detected and matched oncec Prelim→Dead The tracked object is not detected and matched

for a single framed Transferred→Dead The tracked object is not detected and matched

for a single framee Active→Occluded The tracked object is not detected and matched

for a single framef Occluded→Active The tracked object is detected and matched for a

single frameg Active→Dead The tracked object is explicitly deleted by the

systemh Occluded→Dead The tracked object is not detected and matched

for τoccluded consecutive frames (coccluded ≥ τoccluded)

Table 3.1: Transition Conditions


In the proposed system, τactive is set to 3, and τoccluded is set to 10. These param-

eters are used throughout testing unless explicitly specified elsewhere.

Each tracked object stores several values that describe the object, major items

stored include:

• Position - 2D position in image coordinates, the position is stored as the

bounding box of the object.

• Histogram/Appearance Model - one or more colour/appearance models are

stored, to use for matching the object in ambiguous situations.

• Motion model - a motion model is stored, to allow a prediction of the

object’s position to be made in the event of an occlusion or ambiguities

when matching. This may be a constant velocity model, Kalman filter, or

particle filter.

• ID - an ID for the tracked object.

• Track type - type for the object, may be person, vehicle or unknown. This

determines what matching parameters are used, and what detection rou-

tines are employed.

• State - the state of the tracked object (see Figure 3.3) and its associated

counters, to determine how the object is handled and the action to be taken

in the event that the object cannot be detected.

Values such as the ID and Track type are set when the object is created, and are

unlikely to change for the life of the object. Other values such as the position,

histogram/appearance model, and state may change regularly.

The algorithm also allows zones to be defined within the scene. A zone allows

98 3.3 Object Detection

specific behaviour to be permitted or denied within a given part of the scene

(specified as a polygon). The following types of zone are defined by the system:

• Active - area where objects are allowed to exist.

• Entry - allow objects to enter within this region.

• Entry Priority - allow objects to enter within this region, with priority given

to one class of objects.

• Transfer - allow objects to be transferred to other camera views in this

region.

• Alarm - raise an alarm when an object is detected in this region.

• Inactive - region where tracks cannot appear, objects in this region will be

discarded.

By default (if no zones are specified), all locations are active and entry for all

classes of objects. Each zone can also be specified with an object type, so that

different entries, allowed locations and alarm zones can be defined for people and

vehicles. The entry priority zone is handled slightly differently, as it allows all

objects to enter at the specified region, but gives the specified object class priority

(i.e. in the specified region, cars are added before people). This can be used to

help control incorrect detections, by ensuring that the more likely object class at

a given position is processed first (i.e. on the footpath, person object would be

given priority while on the road vehicle objects would be processed first).

3.3 Object Detection

The system detects three different types of objects:

3.3 Object Detection 99

1. People - a region of motion with the major axis of the region vertically

aligned in the image, that contains a vertical peak in the motion image,

with a drop either side of the peak.

2. Vehicles - a rectangular region of motion (allowed range of aspects defined

by the system configuration) with a high ratio of motion pixels within the

region.

3. Blobs - a rectangular region of motion (allowed range of aspects the same

as, or more inclusive than the range for vehicles) with a ratio of motion

pixels with the region less than or equal to that required for vehicles and

people.

A region is defined as one or more 8-connected groups of pixels that are grouped

according to spatial constraints (proximity of region bounds and centroid to one

another).

These three object types are used as they encompass all objects that are observed

within the testing data. Additional object types can be added to the system if

required. The object detection routines (and thus the ability to track that type of

object) are enabled within the configuration file (i.e. a system can be configured

to track only people and ignore all vehicles). All object detection routines use a

binary image as a basis, such as a motion image.

3.3.1 Person Detection

Once a motion image has been obtained, it must be analysed to determine the

location of the any people present. Motion images can contain significant errors,

either as motion being detected where there is none, or motion not being detected

where it should be. Any detection techniques should be robust to these errors.


As motion is being used to detect there is no texture information available, only

information relating to size and silhouette. To extract people from a motion

image, the following process is used [66, 184]:

1. Locate areas of the image which contain a significant amount of motion (one

or more connected components that are closely located and satisfy minimum

size requirements, parameters such as required size of grouping distances

are defined in a configuration file and vary between applications/datasets)

and are likely to contain people.

2. Locate the heads of people within those regions using vertical histograms

and the top contour of the motion region.

3. Fit ellipses at the head locations to determine if there is sufficient motion

to constitute a person.

This process requires that people appear vertically in the image (i.e. parallel to

the left and right image bounds).

Motion images are analysed and broken into smaller segments containing patches

of motion to allow people who are vertically aligned (occupy a similar set of

columns at different heights in the image) to be detected (the use of vertical his-

tograms and the top contour means that only one head can exist in any given

column). These regions are processed separately, so if there is spatial separa-

tion between two vertically aligned people, their motion regions will be analysed

separately and each person can be detected. During this same process, small, un-

connected regions of motion can be removed, as their presence may lead to other

inaccuracies. These are likely to be errors, or motion caused by objects too small

to track (i.e. a piece of rubbish being blown across the ground by the wind). The

remaining regions can be grouped into spatial groups, and analysed individually.


Figure 3.4 shows the input and motion images, and the resultant head detection.

A single region of interest is located, and based on the height map of that region,

a single head is detected ((c) shows a white dot at the detected head on the height

map that corresponds to the motion image).

(a) Input Image (b) Motion Mask (c) Detected Heads

Figure 3.4: Head Detection

A person’s head in a silhouette image typically has the following properties:

1. It is the highest point on the person’s silhouette.

2. The surrounding area is roughly curved and symmetrical.

The second condition may not hold if the person is wearing a hat, has an unusual

hairstyle, or if there are errors in the segmentation. As such, the first property

will be used as the basis for detection.

It is assumed that people will appear in the image vertically (i.e. their spine will

be parallel to the vertical edge of the image), and so the image is analysed on a

column by column basis to determine the pixel height of the region. To determine

the height two approaches can be used:

1. Vertical Projection - vproj(i) =∑j=N−1

j=0 M(i, j), where vproj(i) is the vertical

projection at column i, j is the row index and N is the number of rows

(height) of the mask image, M .


2. Top Contour - vcontour(i) = N − (minimumj forwhichM(i, j) > 0), where

vcontour is the top contour. It is assumed that for the mask image (M) it is

zero indexed and the top left corner is at the coordinate (0, 0).

The vertical projection counts the number of motion pixels in each column, so a

region such as the head, which should have motion all the way below to the feet,

should lie at a global maximum. However, if the motion image contains errors

such as missing regions (i.e. a large portion of the persons shirt is not detected

as motion), the vertical projection may not contain the head at a maxima. The

top contour is simply the top most pixel in each column that is in motion, this

accuracy of this however depends on the accuracy of the motion detection around

the edge of the person. Either one of these, or both in combination, can be used

to detect the head of a person. Using both in combination can help overcome the

individual weaknesses of each modality and improve detection results (see figure

3.5),

vHeightMap(i) = αvproj(i) + βvcontour(i), (3.1)

where vHeightMap(i) is the combined height map, α is the weight of the vertical

projection, and β is the weight of the top contour. A mean filter is applied to

the height map to reduce noise and remove small local maxima (see Figure 3.5

(e)). This height map can then be searched for maxima, which are the likely

location of heads. The global maxima will provide a good estimate of the head of

one person. If multiple people are present in the area being analysed, then local

maxima will represent one or more of their heads. Analysis of the maxima, such

as looking at their prominence and proximity to other maxima, can be used to

determine which of these are likely to represent the heads of the people in the

region.

Once the heads have been located, ellipses are fitted at the head points. Ellipses

are oriented such that the major axis is vertical, and the length of the major axis


(a) InputImage

(b) Vertical Pro-jection

(c) Top Contour (d) Height Map (e) Mean FilteredHeight Map

Figure 3.5: Height Map Generation

of the ellipse is set to the height of the head as detected by the head detector. The

length of the major axis is chosen by performing further analysis of the height

map. The area surrounding the height map is searched to find the position either

side where the height drops to below a predefined ratio of the total height (i.e.

50%), or a minima in between two maxima. The maximum of the left and right

distance is used as the width of the minor axis, and the ellipse is cropped at the

smaller of the two (see Figure 3.6 (c)).

(a) Input Image (b) PersonBounds

(c) Ellipse

Figure 3.6: Ellipse Fitting


After the ellipse dimensions have been determined, a filled ellipse can be drawn

overlaying the detected person (see figure 3.6), and the amount of motion within

can be calculated such that,

Operson =

∑M(i, j) where E(i, j) > 0∑

E(i, j), (3.2)

where Operson is the percentage of the ellipse that contains motion, i and j are the

image coordinates, M is the motion image and E is the ellipse mask. If Operson

is above a threshold, τOccPer then candidate region is accepted as a valid person

candidate. τOccPer is set to 0.3 in the proposed system. The motion for that

person can now be removed from the motion image, to ensure that it is not used

to detect a second person later. In Figure 3.6, the occupancy for the ellipse is

86%, and so the candidate region is accepted.

3.3.2 Vehicle Detection

Vehicles are detected by locating large areas of motion, where there is a high

concentration of motion pixels in the region’s bounding box (i.e. most pixels

are in motion), as most vehicles are roughly rectangular in shape. The detection

process runs in two stages, the first simply groups large regions of motion together

to form a list of initial vehicle candidates. The second analyses this initial list

further, checking for overlapping objects to create a list of final vehicle candidates,

which is then used by the system to update existing tracks and create new tracks.

Initial candidate vehicles are formed by locating regions of motion and grouping

nearby regions in to a single candidate. The ratio of motion pixels to total

bounding box size,

Ovehicle =

∑x=R,y=Bx=L,y=T M(x, y)

W ×H(3.3)

where L, R, T , and B are the bounds of the candidate object, W and H are the

width and height of the candidate object, M is a binary motion mask and Ovehicle


is the ratio of motion pixels to bounding box size; as well as overall size,

Avehicle =

x=R,y=B∑x=L,y=T

M(x, y) (3.4)

where Avehicle is the total motion area of the candidate object; is used to validate

detected candidates (i.e. if the region is too small, or contains too little motion

relative to its size, it is discarded). If either Ovehicle or Avehicle is less than its

corresponding threshold, τOccV eh or τAreaV eh, the candidate is discarded.

Figure 3.7 shows the results of this initial vehicle detection process (top line shows

the input frame, bottom line shows the motion mask with any detected vehicles

surrounded by a red box).

(a) (b) (c)

(d) (e) (f)

Figure 3.7: Vehicle Detection

If an initial candidate is larger than expected, it may be due to the candidate

being the result of two vehicles overlapping. The candidate is analysed using


vertical and horizontal projection histograms,

vproj(i) =

j=0∑j=N−1

M(i, j), (3.5)

hproj(j) =i=0∑

i=O−1

M(i, j), (3.6)

where vproj(i) is the vertical projection at column i, j is the row index and N is

the number of rows (height) of the mask image, M , and hproj(j) is the horizontal

projection at column j, i is the column index and O is the number of columns

(width) of the mask image; to determine if the candidate may be formed by two

vehicles overlapping (see Figure 3.8). If the initial candidate is deemed to be an

acceptable size, it is accepted as a candidate vehicle.

An abrupt change in either dimension is likely to indicate an overlapping area.

Overlapping vehicles can be separated by detecting these changes and segmenting

accordingly. The gradient of the vertical and horizontal projection histograms is

analysed to detect overlaps,

Ovhoriz = |hproj(j)− hproj(j − 1)| ≥ O × τ gradov , (3.7)

Ovvert = |vproj(i)− vproj(i− 1)| ≥ N × τ gradov , (3.8)

where Ovhoriz and Ovvert are the detected horizontal and vertical overlaps respec-

tively and τ gradov is a scaling value used to determine a gradient threshold based on

the candidate size (25% in the proposed algorithm). Detected overlaps that are

within close proximity to one another (5% of the region size, τ gradprox ) are merged

and the average position of the merged overlaps is used.

In Figure 3.8, the two cars and cyclist are detected as a single vehicle candidate,

due to the cars overlapping and cyclist being close nearby. Computing the hor-

izontal and vertical projections for the candidate, it can be seen that there are

two significant changes in gradient in each direction (denoted by the red lines in


(a) Input Image (b) Vehicle Candidate

(c) Vertical Projection (d) Horizontal Projection

(e) Detected Vehicles (f) Detected Vehicles

Figure 3.8: Detecting Overlapping Vehicles

Figure 3.8 (c) and (d), and shown overlayed on the original motion image in (e)).

The initial candidate can be segmented using these boundaries to locate the three

separate candidate vehicles (Figure 3.8 (f)).

3.3.3 Blob Detection

A third object detection routine is used to catch any remaining objects that have

not been detected by either the person or vehicle detectors. The sole purpose of

108 3.4 Baseline Tracking System

this detector is to catch objects that the other detectors failed to detect, possibly

due to errors in the binary input image. For a standard system configuration,

objects detected by this detection routine can only be used to update the position

of existing tracked objects (i.e. an object detected by the blob detector cannot

result in a new object being added to the system).

The blob detector simply locates regions of motion and groups the regions ac-

cording to spatial constraints. If the grouped regions are within size bounds, they

are accepted as candidates. This process is essentially the same as that which

searches for vehicle candidates (see section 3.3.2), with looser constraints on the

merging of nearby regions of motion, occupancy and size. Constraints on size

and region grouping are relaxed as any objects detected by this process could not

be detected by the detection routine intended to find them (most likely due to

segmentation errors), and so using the constraints applied in the more specific de-

tection routines will result in no detection being made. As objects detected in this

process are only used to update known objects, this is deemed to be acceptable.

3.4 Baseline Tracking System

The baseline tracking system uses the algorithm described in Section 3.2.1. The

system uses the motion detection system proposed by Butler et al [18] (an

overview of this algorithm is provided in Section 2.2.1, and more details can

be found in Section 4.2), and the object detection routines described in Section

3.3.

3.4 Baseline Tracking System 109

A simple constant velocity motion model is used to predict object positions,

T ix(t+ 1) = T ix(t) +1

N(T ix(t)− T ix(t−N)), (3.9)

T iy(t+ 1) = T iy(t) +1

N(T iy(t)− T iy(t−N)), (3.10)

T ih(t+ 1) = T ih(t) +1

N(T ih(t)− T ih(t−N)), (3.11)

T iw(t+ 1) = T iw(t) +1

N(T iw(t)− T iw(t−N)), (3.12)

where T ix and T iy are the x and y image coordinates for track i, and T iw and

T ih are the width and height (in pixels) of track i, N is the size of the motion

model and t is the current time step. The whole bounding box is used within the

motion model to smooth the width and height and counter any fluctuations due

to segmentation errors. The size of the motion model (N) depends on frame rate

and the speed of objects, but is typically set to 10 frames for most systems.

Objects are detected using the object detection routines outlined in Sections 3.3.1

to 3.3.3 for people, vehicles and blobs respectively. Each object detection routines

results in a set objects being detected, Di, i = [1..D], where D is the total number

of objects detected.

The fit for an object to a candidate is calculated by comparing the position and

size of the detected object, Di, to the tracked object, T j. Errors in position and

area are calculated using,

Eposition(Di, T j) =

√√√√(Dix − T

jx

T jw

)2

+

(Diy − T

jy

T jh

)2

, (3.13)

Earea(Di, T j) =

∣∣Diw ×Di

h − T jw × Tjh)∣∣

T jw × T jh, (3.14)

where Eposition(Di, T j) is the error in the median position between the candidate

object Di and the tracked object T j, and Earea(Di, T j) is the error in area between

the objects. Dix, D

iy, D

iw and Di

h are the x and y position, the width and height

respectively. The errors are expressed as a percentage of the objects size (i.e. for

110 3.4 Baseline Tracking System

an object which is very large, a given change in position will be less significant

than for an object which is very small).

Errors are evaluated using two Gaussian distributions (one for position, one for

error) with a user specified standard deviations (σpos and σarea for the position and

area distributions respectively, these are expressed as a percentage and specified

as inputs in the configuration file) and a mean of 0 (no error, the value is the

same from one frame to the next). The likelihood of the position and size (area)

are determined separately such that,

Fposition(Di, T j) = Φ0,σpos(Eposition(Di, T j)), (3.15)

Farea(Di, T j) = Φ0,σarea(Earea(D

i, T j)), (3.16)

where Fposition(Di, T j) is the fit of the position component, Farea(Di, T j) is the

fit of the area component, and Φµ,σ is the cumulative density function for the

Gaussian distribution. The product of these errors is used as a measure of the fit

of the candidate to the tracked object,

F (Di, T j) = Fposition(Di, T j)× Farea(Di, T j), (3.17)

where F (Di, T j) is the fit between the track T j and the candidate Di. If F (Di, T j)

is greater than τfit and the track is in either the active, occluded or entry state,

a valid match has been found. For the transfer state, the threshold is lowered to

τfit

2. The threshold is lowered to account for potential errors when transferring

coordinates between views.

A histogram is used to compare objects when there is uncertainty regarding a

match. The histogram is calculated in YCbCr, with the luminance component

independent from the chrominance. The uncertainty of a match is determined by

the ratio of the top two fit’s for a given object,

UT i, T j, Dk =min(F (T i, Dk), F (T j, Dk))

max(F (T i, Dk), F (T j, Dk)), (3.18)

3.5 Evaluation Process and Benchmarks 111

where UT i, T j, Dk is the uncertainty, for the tracks T i and T j match to the object

Dk. If uT i,T j is greater than a threshold, Tunc, the histograms of the tracks are

compared to the histogram of the object (for our system, Tunc is set to 0.5).

Histograms are compared using the Bhattacharya coefficient,

B(T i, Dk) =

√√√√ N∑1

√H(T i, n)×H(Dk, n), (3.19)

where B(T i, Dk) is the Bhattacharya coefficient, H(T i, n) is the nth bin for the

histogram belonging to T i, and N is the total number of bins in the histogram. To

simplify comparison and ensure any results are within fixed bounds, the histogram

comparison is performed using histograms with their bin weights normalised such

that they sum to 1,N∑1

H(T i, n) = 1. (3.20)

This will return 1 for a perfect match, and 0 for no match. B(T i, Dk) is multiplied

with the original fit of the two tracks, such that,

F ′(T i, Dk) = F (T i, Dk)×B(T i, Dk), (3.21)

F ′(T j, Dk) = F (T j, Dk)×B(T j, Dk). (3.22)

Whichever track has the greatest value of F ′, is deemed the matching track.

3.5 Evaluation Process and Benchmarks

The proposed tracking system and the its improvements are evaluated using a

subset of the ETISEO database [130] and the ETISEO evaluation tool 2. The

ETISEO evaluation was run in 2006, to evaluate tracking and event recognition

systems.

2ETISEO resources such as the database and evaluation tool can be downloaded athttp://www-sop.inria.fr/orion/ETISEO/index.htm

112 3.5 Evaluation Process and Benchmarks

The throughput of the tracking systems proposed in this thesis are also evalu-

ated. The average number of frames processed per second (fps) for each group of

datasets (see Section 3.5.2) is used to evaluate the system performance in terms

of data throughput. Whilst it is not expected that the tracking systems pro-

posed in this thesis are capable of processing at 25 fps (the implementations of

the tracking system used within this thesis are unoptimised, only run on a single

processor core and load data to and from a hard drive), they are still ultimately

intended to be used for processing live data feeds of 5 fps or faster (the frame

rate required to process a live data feed depends on the data itself). As such, it

is important that the proposed systems approach or exceed this frame rate.

Section 3.5.1 outlines the metrics proposed by the ETISEO evaluation, which are

used in this evaluation; Section 3.5.2 details the subset of the ETISEO database

that is used, and describes the major configuration settings for each dataset; and

Section 3.5.4 contains benchmarks for the baseline tracking system.

3.5.1 Evaluation Metrics

As part of the ETISEO evaluation, an evaluation tool was developed and several

metrics were proposed for comparing the performance of tracking systems [154].

These metrics fall into five areas:

1. Detection of objects.

2. Localisation of objects.

3. Tracking of objects.

4. Classification of objects.

5. Event Recognition.


For of each of these areas, there are several metrics to measure specific perfor-

mance criteria within each area, and an overall metric that is created by averaging

the simpler metrics. All metrics return a value in the range [0..1]. A value of 1

indicates best possible performance, 0 indicates worst possible. The evaluation

of work in this thesis will focus on metrics from the first three areas (detection,

localisation, tracking). The tracking systems discussed in this thesis do not per-

form event recognition (with the exception of abandoned object detection), and

classification of objects within the proposed systems is very limited (the ETISEO

evaluation uses a much greater range of classification types than the proposed

systems). As a result, metrics in these areas are deemed to be unnecessary for

the evaluation of the proposed systems.

The use of the ETISEO evaluation format and the metrics allows the system per-

formance to be analysed across a wide range of criteria. High level comparisons

can be made by comparing the overall metrics, whilst analysis of the simpler,

component metrics can be used to better understand the reasons for any im-

provements gained. The use of an existing evaluation process also avoids the

challenges involved in developing a evaluation tool to compare tracking data to

ground truth (and formulation of separate metrics).

The metrics used in the evaluations contained in this thesis are briefly outlined

here. Many metrics can be computed over either individual frames, or over the

whole sequence. Only metrics computed over whole sequence are considered in

our evaluation. More detailed information on these metrics can be found in [154].

The following standard definitions are used for many of the metrics:


True Positive (TP) Detected situation exists in theground truth and results

True Negative (TN) A situation that does not exist in eitherthe ground truth or results

False Positive (FP) Algorithm has detected a situation thatdoes not exist in the ground truth

False Negative (FN) The algorithm has failed to detect anevent that exists in the ground truth

Table 3.2: Evaluation Metric Standard Definitions

From these values, four scores can be calculated, the Precision, Sensitivity, Speci-

ficity and F-Score,

Precision =TP

TP + FP, (3.23)

Sensitivity =TP

TP + FN, (3.24)

Specificity =TN

FP + TN, (3.25)

F − score =2× Precision× SensitivityPrecision+ Sensitivity

. (3.26)

It should be noted that not every metric computes all these measures.

Four distance measures are also defined. These are used to match objects in

the ground truth to objects in the result data. As such, they are used as part

of determining many other metrics. Overlaps need to be determined for both

spatial and temporal information (see Figure 3.9). In Figure 3.9, GT represents

the ground truth, and R the result.


(a) Spatial Overlap (b) Temporal Overlap

Figure 3.9: Ground truth and result overlaps

The four metrics (E1-4, dice coefficient, overlapping, Betrozzi, maximum devia-

tion respectively) are defined as follows,

E1 =2× card(GT ∩R)

card(GT ) + card(R), (3.27)

E2 =card(GT ∩R)

GT, (3.28)

E3 =(GT ∩R)2

card(GT )× card(R), (3.29)

E4 = Max

(card(R/GT )

card(R),card(GT/C)

card(GT )

), (3.30)

(3.31)

where card(S) is the cardinality (number of elements) in the set, S. Each of these

measures can be used as is, or have a threshold applied to determine a match.

In instances where a distance metric is used, a result is calculated using each

distance metric, and the average of these scores is taken as the final metric score.

The evaluation within this thesis does not include individual scores for precision,

sensitivity, specificity or f-score for each metric. Nor does it provide results for

each distance metric (E1-4). The evaluation presents the overall score for each

metric. This is the average of the precision, sensitivity, specificity or f-score (or

whichever of these are computed). If the metric requires E1-4, then the overall

score is the average of the precision, sensitivity, specificity or f-score for each

distance metric.


The evaluation also computes overall scores for detection, localisation and track-

ing. These scores are the average of all precision, sensitivity, specificity and f-score

results across E1-4, for all metrics in the category.

In the evaluations within this thesis, the overall metrics for detection, localisation

and tracking are used, as well as the individual metrics detailed below. Metrics are

outlined briefly here to describe what they are evaluating. Detailed information

on the precise formulation of the metrics is not provided here, and can be found

in [154]. Other metrics not used in our evaluation (such as those for object

classification and event recognition) can also be found in [154].

Detection Metrics

Two detection metrics are defined:

• Number of physical objects (D1)- The number of objects detected

compared to the number of objects present in the ground truth (does not

consider if the objects detected are valid, i.e. in the expected positions).

• Number of physical objects using their bounding box (D2) - The

number of detected objects that have a significant overlap with a ground

truth object. Only one detected object can match a ground truth object,

any additional detections are designated as false positives.

Localisation Metrics

Four localisation metrics are defined:

• Physical objects area (L1) - Evaluate the 2D object position in each


frame. Uses the overlap of the ground truth and result bounding boxes to

compute the metric.

• Physical object area fragmentation (splitting) (L2) - Determine ob-

ject fragmentation by checking the number of detected objects that corre-

spond to (overlap with) each ground truth object. Ideally, each detected

object should overlap with a single ground truth object.

• Physical object area integration (merging) (L3) - Determine object

merging by checking the number of ground truth objects that correspond

to (overlap with) each detected object. Ideally, each ground truth object

should overlap with a single detected object.

• Physical object centroid localisation (L4) - Evaluates the detection ac-

curacy based on the distance between the detected centroid and the ground

truth centroid.

Metrics L2 and L3 do not compute the true/false positives/negatives, and define

a separate metric for matching (see [154]).

Tracking Metrics

Five tracking metrics are defined:

• Number of objects being tracked during time (T1) - measures the

ability of the system to detect and follow an object over time. Analyses

object occurrences and object persistence over time, and uses the distance

between the ground truth and the result data to determine performance.

• Tracking time (T2) - Measures the percentage of time that an object is

tracked for. Assumes that the object ID will be constant over the object


life.

• Physical object ID fragmentation (T3) - Determine how well tracked

objects remain associated with ground truth objects. Measures the number

of result objects (tracked recorded in the test data) that are associated with

each ground truth object to determine how well objects remain tracked.

• Physical object ID confusion (T4) - Measure the rate at which result

tracks switch to other ground truth tracks (i.e. becomes confused).

• Physical object 2D trajectories (T5) - Determines if 2D trajectories

are correctly detected over time. A match between two given trajectories is

calculated by obtaining the distance between using a distance metric and

thresholding the result.

Metrics T2, T3 and T4 do not compute the true/false positives/negatives, and

define a separate metric for matching (see [154]).

3.5.2 Evaluation Data and Configuration

The evaluation uses the following datasets (example images are shown in Figure

3.10):

• ETI-VS2-RD6 - a dataset showing a road with a combination of cars and

pedestrians to track.

• ETI-VS2-RD7 - a dataset showing a road with a combination of cars and

pedestrians to track.

• ETI-VS2-BC16 - a dataset showing the corridor of a building, with several

doors on either side through which people enter and exit.


• ETI-VS2-BC17 - a dataset showing the corridor of a building, with several

doors on either side through which people enter and exit.

• ETI-VS2-AP11 - a two camera dataset (ETI-VS2-AP11-C4 and ETI-VS2-

AP11-C7) showing part of an airport tarmac, with various vehicles moving

about.

• ETI-VS2-AP12 - a two camera dataset (ETI-VS2-AP12-C4 and ETI-VS2-

AP12-C7) showing part of an airport tarmac, with various vehicles moving

about.

• ETI-VS2-BE19 - a two camera dataset (ETI-VS2-BE19-C1 and ETI-VS2-

BE19-C3), showing the entrance to a building, with people entering and

leaving the building, and vehicles entering and leaving the parking lot.

• ETI-VS2-BE20 - a two camera dataset (ETI-VS2-BE20-C1 and ETI-VS2-

BE20-C3), showing the entrance to a building, with people entering and

leaving the building, and vehicles entering and leaving the parking lot.

The configurations used for the system are kept the same when testing different

datasets captured from the same camera (i.e. RD6 and RD7 all use the same

configuration, that is different to that used by BC16 and BC17). Configurations

are also kept as similar as possible when testing different tracking systems on the

same dataset (see testing performed in Chapters 5 and 6). For common elements

to the trackers, the same configuration is used, with the configuration only altered

to configure the additional features.

When performing all tests, the first image the tracker receives (initialisation im-

age) is an empty scene. Whilst the motion detectors used within the tracking

systems are able to learn the background over several hundred frames (the num-

ber required depends on the amount of motion within the scene and the learning


(a) RD Example (b) BC Example

(c) AP C4 Example (d) AP C7 Example

(e) BE C1 Example (f) BE C3 Example

Figure 3.10: Examples of Evaluation Data

rate), this is not ideal as it means that an early portion of every dataset is devoted

to training the background model. An alternative method is to ensure that the

first image is an image of an empty scene. This ensures that the initial model is

of an empty scene, and allows object tracking to commence immediately.


All images are resized to 320 × 240 pixels for processing. Tracking results are

scaled back to the original image dimensions for comparison to the ground truth.

Processing is performed at the smaller, and uniform dimensions, to improve sys-

tem speed and conserve storage space when performing the testing, and provide

a better simulation of real world conditions (real time performance is not pos-

sible for images sized 720 × 576 pixels or similar). Common image sizes across

datasets also allows more of the configuration parameters to remain fixed between

datasets.

For each dataset, the following configuration parameters are selected:

• Motion detector thresholds (τLum and τChr) and learning rate (L)

• Expected person size (minimum and maximum height)

• Expected vehicle size (minimum and maximum size specified as height and

aspect ratio, minimum amount of motion pixels)

These parameters can be quickly established by looking at a small set of frames

(at most 10) from the dataset. Motion detection parameters are chosen based

on the observed noise in the scene and the contrast between the moving objects

and the background. Expected sizes are chosen based on the size of objects

observed in the scene. For scenes where one object class is not present, the

appropriate detector is disabled and not configured. It is realistic to expect that

any commercial deployment of an intelligent surveillance system would perform

at this much configuration, if not significantly more, on any deployed system. It

is also unrealistic to expect a single configuration to optimal for all possible video

feeds.

All other parameters are fixed for all datasets. These parameters are listed in

Table 3.3.


Parameter ValueMotion Detection Parameters

K 6Person Detection Parameters

τOccPer 0.3Minimum Aspect Ratio 0.15Maximum Aspect Ratio 0.5

Vehicle Detection ParametersτOccV eh 0.4τ gradov 0.25τ gradprox 0.05

Object Detection andMatching Parameters

τactive 3τoccluded 10τfit 0.5τunc 0.5

Table 3.3: System Parameters - Fixed parameters for all data sets.

RD Dataset Configuration

The RD datasets are configured to track both people and vehicles. Four zones

are defined:

1. Active Zone - the entire scene, for all object classes.

2. Entry Priority Zone - the roadway, with vehicles given priority.

3. Entry Zone - the sidewalk nearest the camera, for the person class.

4. Entry Zone - the sidewalk furthermost from the camera, for the person

class.

This configuration means that only people can be created on the sidewalk, and

either class can be created on the road with vehicles given priority. Once an

object is created, any position within the scene is valid (i.e. a vehicle may be


tracked on the footpath, as long as it is created on the road). Figure 3.11 shows

the zones, with the numbers indicated in the image corresponding to the above

list.

Figure 3.11: Zones for RD Datasets

A mask image is also used, and is shown in Figure 3.12. This mask forces the

portion at the top of the image to be ignored. This region is deemed to be of

no interest as it contains the overlayed text that changes (causing motion to be

detected where there is nothing of interest), a tree that blocks the roadway almost

entirely (meaning vehicles cannot be tracked), and the side of the building. This

use of the mask also means that no motion is detected as a result of any movement

from the tree. The regions of images that are shaded dark are masked out.

Figure 3.12: Mask Image for RD Datasets


Configuration parameters specific to the RD datasets are lists in Table 3.4.


τLum 50τChr 30L 9

Person Detection ParametersMinimum Height (pixels) 15Maximum Height (pixels) 60

Vehicle Detection ParametersMinimum Height (pixels) 15Maximum Height (pixels) 120Minimum Aspect Ratio 1

3

Maximum Aspect Ratio 3τAreaV eh 200

Table 3.4: System Parameters - Parameters specific to the RD data sets.

BC Dataset Configuration

The BC datasets are configured to track people only. No zones are defined (all

locations are valid for track existence and entry). A mask image is used, and

is shown in Figure 3.13. This mask simply removes portions of the image which

cannot contain people, allowing an improvement in processing speed. The regions

of images that are shaded dark are masked out.

Configuration parameters specific to the BC datasets are lists in Table 3.5. Note

that for the BC datasets, vehicle detection is disabled.

AP Dataset Configuration

The AP datasets are configured to track vehicles only. No zones are defined, and

no mask image is used.


Figure 3.13: Mask Image for BC Datasets


τLum 50τChr 30L 9


Table 3.5: System Parameters - Parameters specific to the BC data sets.

Configuration parameters specific to the AP datasets are lists in Table 3.6. Note

that for the AP datasets, person detection is disabled

BE Dataset Configuration

The BE datasets are configured to track people and vehicles. The following zones

are defined across the two views:

1. Active Zone - the entire scene, for all object classes.

2. Entry Priority Zone - the driveway entry, with vehicles given priority.

3. Entry Zone - the building doorway, for the person class.



τLum 50τChr 30L 9


5


Table 3.6: System Parameters - Parameters specific to the AP data sets.

4. Entry Zone - the driveway and car park area, for both classes.

Figure 3.14 shows the zones, with the numbers indicated in the image correspond-

ing to the above list.

(a) BE C1 Zones (b) BE C3 Zones

Figure 3.14: Zones for BE Datasets

Mask images are also used, and are shown in Figure 3.15. The masks remove

regions where tracks cannot appear (such as the gardens, the side of the building).

This use of the masks also mean that no motion is detected as a result of any

movement from the trees. The regions of images that are shaded dark are masked

out.

Configuration parameters specific to the BE datasets are lists in Table 3.7.


(a) BE C1 Mask (b) BE C3 Mask

Figure 3.15: Mask Images for BE Datasets


τLum 60τChr 50L 9



3


Table 3.7: System Parameters - Parameters specific to the BE data sets.

3.5.3 Tracking Output Description

The object tracker annotates output images to indicate what objects have been

detected and are being tracked. All tracking output in the evaluations contained

within this thesis is annotated using the described approach. An example of the

tracking output is shown in Figure 3.16.

Objects may be annotated in two ways depending on what state they are in.

Objects that have only just entered the system and are in either the Preliminary


(a) Frame 240 (b) Frame 340 (c) Frame 440

Figure 3.16: Example output from the tracking system

or Transfered state are annotated using a single yellow box. An example of this is

visible in Figure 3.16 (c). All other objects are annotated by drawing two boxes

around the object. These boxes indicate the objects ID and its type.

The outer bounding box indicates the object ID. The same object in should be

drawn with the same colour bounding box each frame (see Figure 3.16, the car

on the far side of the road is annotated with a blue box each frame). If the colour

of the outer bounding box changes, it indicates that the object’s ID has changed

which may indicate that an error has occurred. The ID bounding box can be one

of 16 colours (the EGA colours are used). The system will reuse colours if more

than 16 objects are tracked during the systems run time.

The inner bounding box indicates the objects type. There are three types within

the system:

1. Person - Indicated by a red box.

2. Vehicle - Indicated by a cyan box.

3. Unknown - Indicated by a white box.

The unknown object class should only be observed if an error occurs.


3.5.4 Benchmarks

Overall results are shown in Tables 3.8 and 3.9. Detailed results for each dataset

are contained in Appendix A.

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD 0.77 0.64 0.80 1.00 0.98 1.00 0.39 0.31 0.77 0.95 0.32BC 0.77 0.41 0.73 1.00 0.98 0.99 0.45 0.30 0.74 0.85 0.34AP 0.82 0.64 0.76 1.00 1.00 1.00 0.57 0.39 0.97 1.00 0.45BE 0.66 0.25 0.55 0.81 0.79 1.03 0.19 0.11 0.47 0.64 0.21

Table 3.8: Baseline Tracking System Results

Data Set Overall Detection Overall Localisation Overall TrackingRD 0.67 0.93 0.44BC 0.48 0.91 0.46AP 0.68 0.93 0.59BE 0.33 0.74 0.25

Table 3.9: Overall Baseline Tracking System Results

RD Dataset Benchmarks

Benchmarks for the RD datasets (ETI-VS2-RD6 and ETI-VS2-RD7) are shown

in Tables A.1 and A.2 in Appendix A. Difficulties are experienced by the tracking

system due to the presence of objects that stop during the scene (see Figure 3.17,

the car on the far side of the road (blue bounding box) is lost after almost 500

frames of tracking), and several occlusions.

The reason objects are lost after being stationary for a period of time is that

the motion detection algorithm learns these objects as being part of the back-

ground. Whilst the problem of objects being lost can be overcome by using a

slower learning rate, this is not ideal, and thus not implemented. Within the

evaluated datasets, some tracked objects stay stationary for over 2000 frames.



(e) Frame 540 (f) Frame 640

Figure 3.17: Example output from RD7 - Loss of track due to the target object(car in blue rectangle on the far side of the road) being stationary for severalhundred frames

A learning rate suitably slow to ensure that such objects are not incorporated

into the background will result in a background model that is close to static. In

situations such as that shown in Figure 3.17, detection and tracking errors also

arise when the object does begin to move again. As the vehicle has been learned

as part of the background, when it begins to move the space that it used to

occupy may be considered foreground (depending on the learning rate, and the

time spent stationary) and thus be detected as an object of interest.

Detection and tracking results are also diminished by vehicles that are present

and parked in the entire scene (see Figure 3.17, the white van on the far side of the

road is present for the entire scene). The tracking system requires motion to be

observed to detect objects. As these vehicles are always stationary, there is never

any motion observed to enable detection. However, as the ETISEO ground truth

annotates these objects, the performance of the detection and tracking metrics is

reduced.


BC Dataset Benchmarks

Benchmarks for the BC datasets (ETI-VS2-BC16 and ETI-VS2-BC17) are shown

in Tables A.3 and A.4 in Appendix A. The BC datasets are captured in a hallway,

with a shallow field of view angle, resulting in a large number of occlusions. This

is reflected in the overall tracking performance, and the poor performance for the

tracking metric T2 (tracking time). The colour of the hallway and floors also

poses a problem, as it is similar to the colour of the skin and clothing of many of

the subjects within the system, and there is also a significant shadow/reflection

on the floors caused by the people walking about the scene (see Figure 3.18, the

person is poorly localised as their shadow is detected as being part of them) which

results in detection and localisation errors, and when multiple people are present

can compound tracking problems arising from occlusions.

(a) Frame 665 - Mo-tion

(b) Frame 675 - Mo-tion

(c) Frame 685 - Mo-tion

(d) Frame 695 - Mo-tion

(e) Frame 665 - Out-put

(f) Frame 675 - Out-put

(g) Frame 685 - Out-put

(h) Frame 695 - Out-put

Figure 3.18: Example output from BC16 - Detection and localisation errorscaused by shadow/reflection of the moving object

A typical occlusion within this dataset is shown in Figure 3.19. Due to the

shallow field of view angle of the camera, a total occlusion is caused as the people

walk away. This results in the identity of one person being lost, the identity of









(h) Frame 1325 -Output

Figure 3.19: Example output from BC16 - Total occlusion, resulting in loss oftrack identities

second being switched to the first. Occlusions such as this are common with the

BC datasets. The motion detector output in Figure 3.19 also highlights another

problem affecting the performance of the BC datasets. The white shirt worn by

the person on the left cannot be distinguished from the floor and walls by the

motion detection procedure (thus it is not detected as being a region of motion).

This results in poor detection of this person. This problem affects several other

people in the scene. Whilst it can be solved by using a lower motion detection

threshold, this results in the shadows and reflections on the floor being even more

problematic (see Figure 3.18).

AP Dataset Benchmarks

Benchmarks for the AP datasets (ETI-VS2-AP11 and ETI-VS2-AP12, Cameras

4 and 7) are shown in Tables A.5 and A.6 in Appendix A. The AP datasets are

captured outdoors, and like the RD datasets, contains objects that are present

in the scene for the entire sequence (without moving) and thus are not detected.


Like the RD datasets, these objects are annotated in the ground truth, and so the

systems failure to track them diminishes performance. The AP datasets contain

few occlusions, and as such there are few tracking errors, and the performance

errors recorded within the metrics can be largely attributed to the objects that

do not move within the scene.

BE Dataset Benchmarks

Benchmarks for the BE datasets (ETI-VS2-BE19 and ETI-VS2-BE20, Cameras 1

and 3) are shown in Tables A.7 and A.8 in Appendix A. The BE datasets contain

significant camera noise. To handle this, motion detection thresholds are raised

(less sensitive), however this makes the detection of some objects (particularly

people) difficult, as many people are dressed (at least partially) in dark clothing

and are moving across a dark road surface.









Figure 3.20: Example output from BE19-C1 - Impact of poor motion detectionperformance on tracking, the person leaving the building is tracked poorly

Figure 3.20 shows examples of the noise present in the data, with false motion

detected under the tree. There is also motion detected at the office front door


through which the person leaves, which whilst valid motion, does potentially

cause additional problems for the object detection.









Figure 3.21: Example output from BE19-C3 - Impact of poor motion detectionperformance on tracking, the two people are localised poorly

Figure 3.21 shows a second example of tracking within the BE datasets. Once

again noise is present in the data, and the motion associated with the objects in

the scene is incomplete (i.e. the woman walking towards the office front door,

dressed in dark colours similar to the roadway).

The BE-C3 datasets, pose an additional challenge in that the camera is mounted

such that the people appear at an angle (see Figure 3.21 for an example). The

proposed person detection routine is intended to work in an environment where

the people appear vertical. Whilst the slant present in BE-C3 is not severe,

it is sufficiently large to reduce the D2 (detection based on bounding box), T2

(tracking time) and T5 (2D trajectories) metrics to, or almost to, 0. The slanted

camera angle, whilst not making detection impossible, does result in the system

tending to detect only a part of the person. T2 and T5 being 0 illustrates the

effect that poor detection has on the object tracking. Whilst part of the object

is detected and tracked by the system, this part is typically about 50% of the


persons area, and this is sufficiently low ensure that there is not sufficient overlap

between the detected object and ground truth object to record a correct detection.

As a result, tracking metrics such as T2 and T5 are reduced to 0, as there is no

trajectory that closely matches that in the ground truth.

Within the BE datasets, the motion detection is configured to try and allow

a suitable compromise between the noise and detection problems, and detection

routines are also relaxed. Part of this configuration involves using a faster learning

rate, which further exacerbates the problem of objects being incorporated into the

background (see Figure 3.17). As a result of the motion detector performance and

relaxed detection parameters, there is an increase in false detections. However

this is required to ensure that the object of interest are detected at all. There is

also a prolonged occlusion between three people in BE20, which is handled very

poorly (as illustrated by the performance metrics for BE20, particularly camera

3).

Sensitivity to Thresholds

The baseline tracking system uses a large number thresholds during the tracking

process. In any system where decisions need to be made, thresholds are required.

In many cases these thresholds are fixed for all cases, whilst other thresholds need

to be set at an appropriate level for the dataset. Ideally, the system should be

functional with the thresholds set within a range of values (i.e. precise tuning of

each threshold should not be required for the system to function). However, it is

expected that for some thresholds, there will be some values (or ranges of values)

which result in system failure.

To asses the sensitivity of the baseline systems thresholds, three thresholds are

evaluated:


1. τfit - The fit threshold for matching detected objects to tracked objects.

This threshold is set globally for all configurations, at 0.4.

2. τOccPer - The threshold for determining if a person candidate is to be ac-

cepted given the amount of motion within the detected object bounds. This

threshold is set globally for all configurations, at 0.3.

3. τLum - One of two thresholds for performing cluster matching within the

motion detection algorithm [18] (the other being τchr). This threshold is set

for each group of datasets (i.e. RD, BC etc.), as each camera has different

scene and noise characteristics. For the BC datasets, it is set to 50.

Each threshold is evaluated using the ETI-VS1-BC-12 dataset from set 1 of the

ETISEO evaluation. This dataset is very similar to VS2-BC16 and VS2-BC17,

and the same configuration is used (however no mask image is used due to the

slightly different camera angle). To evaluate sensitivity to a given threshold, the

baseline system is evaluated with the threshold in question set to different values

while the rest of configuration is unchanged.

The overall detection and tracking metric scores (see Section 3.5.1) are plotted for

each evaluated threshold. Localisation is not considered as localisation scores only

consider correctly detected objects (i.e. false detections, or missed detections, are

not considered, so poor detection does not effect the localisation scores). As a

result of this, localisation scores does not vary significantly as the thresholds are

changed, and are of little interest.

Figure 3.22 shows the performance of the overall detection and tracking metrics

as τfit is varied from 0 to 1 in 0.1 increments. τfit primarily has an effect on the

tracking performance, as this threshold is used after detection has taken place.

When τfit is set to very low values, any detected object can be matched to any


Figure 3.22: Sensitivity of Baseline System to Variations in τfit

tracked object. This can result in incorrect matches, and tracks swapping identity.

This will only occur however, if the object detection produces errors, resulting

in erroneous detections being available for tracked objects to be matched to. As

the object detection performance is constant, there is little change in the system

performance for low values of τfit (0 to 0.4, see Figure 3.22).

As τfit is increased, obviously incorrect matches are eliminated and performance

improves, reaching its peak at 0.5 (see Figure 3.22). However as τfit is increased

further, performance begins to decline as matches which are in fact correct, are

ruled invalid due to the high value of τfit. It can be seen that whilst τfit performs

best at 0.5, detection does not decline severely until τfit exceeds 0.8, and the

system is capable of operating at a range of values.

Figure 3.23 shows the performance of the overall detection and tracking metrics as

τOccPer is varied from 0 to 1 in 0.1 increments. τOccPer is used when selecting valid

candidates during person detection. As such, it has an effect on both detection

and tracking performance.

When τOccPer is very low, more candidates can be detected by the system. How-


Figure 3.23: Sensitivity of Baseline System to Variations in τOccPer

ever, many of these candidates are invalid, leading to the creation of false tracks.

As τOccPer is increased, fewer false candidates are detected, and tracking and de-

tection improves (see Figure 3.23, increase in tracking performance from 0.3 to

0.5).

At 0.5, tracking performance reaches it peak, however detection performance

drops. Fewer candidates are detected at τOccPer = 0.5, leading to not only a

decrease in the detection of false candidates, but also an increase in missed detec-

tions of valid candidates. This increase in missed detections ultimately leads to a

performance drop for overall detection. However, the tracking system is able to

predict positions to overcome these missed detections, and fewer false tracks are

spawned due to the reduced number of false candidates that have been detected.

As such tracking performance increases while detection performance decreases.

However, once τOccPer exceeds 0.5, performance decreases rapidly (see Figure

3.23). The detection algorithm struggles to detect any people at all, as a result

there are no objects to track, and the system fails. Once τOccPer reaches 1 (every

pixel within the person bounds must be in motion), the system fails to detect

and track any people.


Figure 3.24: Sensitivity of Baseline System to Variations in τlum

τlum is one of two thresholds used in performing the motion detection (the other

being τchr). Motion detection thresholds are set independently for each camera

view, as each camera has different noise and scene characteristics. Figure 3.24

shows the performance of the overall detection and tracking metrics as Tlum is

varied. Motion thresholds have an effect on both detection and tracking results,

as the motion detection results are used by the object detection (and object

detection results are used by the tracking).

When τlum is set to low values, the motion detection algorithm is more sensitive,

detecting more motion. When the algorithm is too sensitive (Tlum = 0), too

much motion is detected, and no valid candidate objects can be found as the

majority of the scene is classified as being in motion due to sensor noise. As TLum

is increased, it becomes possible to detect candidates, and the system is able to

track objects. Tracking performance varies as the threshold is increased, how-

ever detection performance continuously decreases. The BC datasets have a low

contrast between the people and the background, so sensitive motion detection

is required. However, a highly reflective floor and visible noise result in a large

number of false candidates being detected. As the threshold is increased (less

sensitive motion detection) the tracking performance begins to improve again, as


fewer false candidates are detected allowing for more accurate tracking of those

people that can be detected. However, once the threshold is set to above 80, per-

formance drops once again as the system is unable to detect motion associated

with the people, and thus is unable to track them.

It should be noted that having the chrominance threshold (τchr) set at 30 limits

the effects of increasing τlum. When performing motion detection, both τlum

and τchr thresholds must be satisfied for a match between the background model

and input image to occur. If both thresholds can not be satisfied by one of the

background modes, then motion will be detected at the pixel in question. As

Tlum is increased, more and more foreground pixels will be detected as a result of

violating τchr, and once τlum becomes sufficiently high, it will cease to have any

noticeable effect on the motion detection.

It has been shown that the three thresholds that have been evaluated are all

capable of operating reasonably over a wide range of values, indicating that the

system is not overly sensitive to threshold values. However, all these thresholds do

have optimal values, and some values which can cause the system performance

to decrease dramatically (in the case of τOccPer and τlum the system can fail

completely).

Whilst the thresholds do have some values that can result in poor performance,

these values are all very intuitive. It is obvious that setting a threshold such as

τOccPer to very high values (greater than 0.8) will result is very poor performance

unless the motion detection results feeding the object detection process are ex-

tremely accurate, and even then the fact that people are not a clearly defined

shape will still result in missed detections. Likewise, very low values of τlum (less

than 10) are going to result in very poor performance unless the input video

stream is free from all noise and environmental effects (florescent lighting flicker,

shadows, reflections etc.).


Additional thresholds are not examined due to the high computational demand

of performing such as evaluation, however it is expected that the other thresh-

olds would perform similarly. It should also be noted that not all datasets would

exhibit similar results. For example, a dataset with lower noise and a greater con-

trast between foreground and background objects would not suffer the detection

performance drop off experienced when τlum is varied.

Data Throughput Benchmarks

The average frame rate at which the datasets were processed in shown in Table

3.10. These throughput rates are for the tracking system running on a single core

of a 2.4GHz Quad-Core Intel Core 2 CPU (i.e. there is no multi-threading). All

data is loaded from disk at the start of each frame, and images are then resized

to 320x240 pixels.

The BC datasets are processed at approximately 4 fps slower than the other

datasets. This is due to the larger objects present in the BC datasets. The

motion detection process is fastest when there is no motion detected (i.e. the

incoming pixel matches the first background mode it is compared to). As the

BC datasets contain larger objects than the other datasets, the motion detection

execution time is slower when compared to the other datasets. The speed of

object detection routines are also dependent on the size of the region they are

processing. As the dataset contains larger objects, the object detection routines

are executed more slowly when compared to other datasets.


Data Set Frame Rate (fps)RD 18.84BC 14.47AP 19.89BE 18.36

Table 3.10: Baseline Tracking System Throughput

Chapter 4

Motion Detection

4.1 Introduction

Motion detection forms the basis of many object tracking systems. Motion detec-

tion is used to locate objects of interest for tracking, and as such, it is important

to have a robust and adaptive motion detection system. Failure of the motion

detection in such a system may result in failure to detect and track objects, or in

false objects being detected and tracked. The following chapter outlines a pro-

posed multi-modal background modelling system. The proposed system extends

an algorithm proposed by Butler [18].

A flowchart of the proposed algorithm is shown in Figure 4.1. This shows the

process that a single pixel undergoes.

The proposed algorithm extends Butler’s [18] to include the following:

• A variable threshold (see Section 4.3.1)

144 4.1 Introduction

Figure 4.1: Flowchart of proposed motion detection algorithm

4.2 Multi-Modal Background modelling 145

• Lighting compensation (see Section 4.3.2)

• Shadow detection (see Section 4.3.3)

• Segmentation of the detected motion into motion caused by moving objects

and motion caused by stationary objects (see Section 4.5)

• Computation of optical flow for pixels that are detected as being in motion

(see Section 4.4)

It should be noted that the variable threshold can be determined either for the

whole scene or for each pixel (as shown in Figure 4.1), and the lighting compen-

sation values are determined for regions of the scene (not for each pixel).

4.2 Multi-Modal Background modelling

An efficient method of foreground segmentation that is robust and adapts to

lighting and background changes was proposed by Butler [18]. This approach

is similar in design to the Mixture of Gaussian’s (MoG’s) approach proposed

by Stauffer and Grimson [157], in that each pixel is modelled by a group of

weighted modes that describe the likely appearance of the pixel. Unlike the

MoG’s approaches, cluster structures consisting of four colour values (stored as

two pairs) and a weight are used to represent the pixel modes.

The algorithm uses Y’CbCr 4:2:2 images as input, and clusters are formed by

pixel pairs (see Figure 4.2). Each pixel in the incoming image has two values, a

luminance and a single chrominance, which alternates between blue chrominance

and red chrominance. Pixels are paired for use in the motion detector, such that

each pixel pair contains two pairs of values (centroids), one of luminance values

and one of chrominance values. This pairing results in motion detection being

146 4.2 Multi-Modal Background modelling

effectively performed at half the horizontal resolution of the original image, with

the benefit being increased speed.

Figure 4.2: Motion Detection - The input image (on the left) is converted intoclusters (on the right) by pairing pixels

Let p(xi, yi, t) be a pixel in the incoming Y’CbCr 4:2:2 image, I(xi, yi, t) where

[xi, yi] is in [0..X−1, 0..Y −1] and t is in [0, T ]. A pixel pair, P (x, y, t) (where [x, y]

is in [0..X2− 1, 0..Y − 1]) is formed from p(xi, yi, t) = [y, cb] and p(xi + 1, yi, t) =

[y, cr] to obtain four colour values, P (x, y, t) = [y1, cb, y2, cr] (where xi = x × 2,

and yi = y). These four values are treated as two centroids ((y1, y2) and (cb, cr)).

Each image pixel, p(xi, yi, t), is only used once when forming pixel pairs P (x, y, t).

As the algorithm pairs consecutive pixels, input images must have an even width.

Let f(x, y, t) be a frame sequence, and P (x, y, t′) be a pixel pair in the frame at

time t′. Pixel colour history is recorded,

C(x, y, t, 0..K − 1) = [y1, y2, Cb, Cr, w], (4.1)

which represents a multi-modal PDF. K is the number of modes stored for each

pixel. Increasing K results in more memory being required to store the algorithms

data structures and an increase in time taken to search and sort cluster lists.

However a higher value of K means the algorithm can store a greater number of

modes and maintain a more accurate history for the pixel. Typically, K is set to

6.

4.2 Multi-Modal Background modelling 147

Each cluster contains two luminance values (y1 and y2), a blue chrominance value

(Cb), and red chrominance value (Cr) to describe the colour; and a weight, w.

Colour values are accessed and adjusted separately (i.e. the cluster is not a

vector). The weight describes the likelihood of the colour described by that

cluster being observed at that position in the image. Clusters are stored in order

of highest to lowest weight.

For each P (x, y, t) the algorithm makes a decision assigning it to background

or foreground by matching P (x, y, t) to C(x, y, t, k), where k is an index in the

range 0 to K− 1. Clusters are matched to incoming pixels by finding the highest

weighted cluster which satisfies,

|y1 − Cy1(k)|+ |y2 − Cy2(k)| < τLum, (4.2)

|Cb− CCb(k)|+ |Cr − CCr(k)| < τChr, (4.3)

where y1 and Cb are the luminance and chrominance values for the image pixel

p(x × 2, y), y2 and Cr are the luminance and chrominance values for the image

pixel p(x× 2 + 1, y, C(k) = C(x, y, t, k); and τLum and τChr are fixed thresholds

for evaluating matches. The centroid of the matching cluster is adjusted to reflect

the current pixel colour,

C(x, y, t, κ) = C(x, y, t, κ) +1

L(P (x, y, t)− C(x, t, y, κ)) , (4.4)

where κ is the index of the matching cluster; and the weights of all clusters in

the pixels group are adjusted to reflect the new state,

w′k = wk +1

L(Mk − wk) , (4.5)

where wk is the weight of the cluster being adjusted; L is the inverse of the

traditional learning rate, α; and Mk is 1 for the matching cluster and 0 for all

others.

If P (x, y, t) does not match any C(x, y, t, k), then the lowest weighted cluster,

C(x, y, t,K − 1), is replaced with a new cluster representing the incoming pixels.

148 4.3 Core Improvements

Clusters are gradually adjusted and removed as required, allowing the system to

adapt to changes in the background.

After the updating of weights and clusters, the cluster weights are normalised to

ensure they sum to one using

wk =wk∑K−1k=0 wk

. (4.6)

When a new cluster is added to the model, its weight is set to 0.01.

Based on the accumulated pixel information, the frame can be classified into

foreground,

fgnd = ∀(x, y, t) whereκ∑i=0

Cw(x, y, t, i) < τforeground, (4.7)

where τforeground is the foreground/background threshold and κ is the matching

cluster index; and background. Pixels that are classified as foreground are said

to be in motion.

4.3 Core Improvements

Three core improvements are made to Butler’s algorithm [18]. These changes

are intended to improve performance of the motion detection, without generating

additional modes of output (i.e. optical flow, multi-layer segmentation). Three

improvements are proposed:

1. Variable threshold - Factors such as scene lighting and variation, the type

of camera being used and type of objects being observed can all affect the

threshold that is used for detecting motion. The use of a variable threshold

is examined, to provide greater flexibility for the motion detection. A global

variable threshold and per pixel variable threshold are proposed.

4.3 Core Improvements 149

2. Lighting compensation - Within an outdoor scene, the level of lighting can

change rapidly due to the changing weather (e.g. the sun moving behind

clouds). Sudden changes such as this can result is a large amount of false

motion being detected, leading to poor object detection. An approach to

detect and compensate for lighting changes is proposed.

3. Shadow detection - All objects cast shadows, however when detecting and

tracking objects it is preferable to be able to ignore their shadow. A method

to detect and remove shadows during the motion detection process is pro-

posed.

4.3.1 Variable Threshold

Different surveillance situations contain differing cameras and lighting conditions,

and as such, require different thresholds to evaluate the presence of motion at a

pixel. The algorithm outlined in Section 4.2 calculates two differences for each

pixel every frame. In ideal situations, where there is no noise and no motion,

these differences should be 0. Assuming that there is no motion present in a

video sequence, the difference observed will be a direct result of any noise present

in the sequence.

Analysis of two sequences (see Figures 4.3 and 4.4) show that the pixel noise

distribution for a surveillance camera is a Gaussian distribution, with a mean of 0.

For two images sequences, each 100 frames in length and containing no motion or

discernible changes in the background (i.e. tree movement, lighting changes), the

differences observed when comparing to a background frame (the first frame in the

sequence) are calculated. Differences are recorded separately for the luminance

channel and for the chrominance channels (blue and red chrominance observations

are combined as they are also combined in the motion detection algorithm). The


(a) Sample Input (b) Luminance Distribution

(c) Chrominance Distribution

Figure 4.3: Pixel Noise - Indoor Scene

distribution of the differences is shown to be approximately Gaussian. It should be

noted that these distributions (see Figures 4.3 and 4.4) include differences across

all pixels, and further analysis has shown that each pixel has approximately the

same noise distribution (i.e. sensor noise is constant across the whole sensor).

Variable thresholds, τLum(t) and τChr(t), to replace τLum and τChr can be calcu-


(a) Sample Input (b) Luminance Distribution

(c) Chrominance Distribution

Figure 4.4: Pixel Noise - Outdoor Scene

lated by observing the standard deviation of the matching differences,

DiffLum(x, y, t) = |Py1(x, y, t)− Cy1(x, y, t, κ)|+ (4.8)

|Py2(x, y, t)− Cy2(x, y, t, κ)| ,

σ(t)2Lum =

1X2× Y

x=X2−1,y=Y−1∑

x=0,y=0

DiffLum(x, y, t)2, (4.9)

DiffChr(x, y, t) = |PCb(x, y, t)− CCb(x, y, t, κ)|+ (4.10)

|PCr(x, y, t)− CCr(x, y, t, κ)| ,

σ(t)2Chr =

1X2× Y

x=X2−1,y=Y−1∑

x=0,y=0

DiffChr(x, y, t)2, (4.11)

over time (where κ is the index of the matching cluster, and X and Y are the


dimensions of the input image). The variance can be calculated progressively

as each pixel is processed, and the new threshold can then be determined at

the start of each frame. In situations where a matching cluster is not found

(i.e. a new mode is observed), then the same values that are used as the initial

luminance and chrominance variances, σ2Luminit

and σ2Chrinit

, are added instead of

DiffLum(x, y, t)2 and DiffChr(x, y, t)2.

It is assumed that the only source of error in a correct match is sensor noise, and

that the noise forms a Gaussian distribution with a mean of 0. For a Gaussian

distribution, 99% of the values are known to be within 3 standard deviations of

the mean. Given this, the standard deviations are multiplied by three to obtain

the matching thresholds,

τLum(t) = 3×√σ2Lum(t), (4.12)

τChr(t) = 3×√σ2Chr(t). (4.13)

To avoid sudden changes in the threshold caused by a high level of sensor noise, or

by other errors such as an incorrect matches, it is subjected to the same learning

rate as the cluster centroids,

σ2Lum(t) =

L− 1

Lσ2Lum(t− 1) +

1

Lσ2Lum(t). (4.14)

The threshold is bounds checked (upper and lower) to ensure that excessive levels

of motion or noise do not result in the threshold becoming too high, and that high

levels of inactivity do not result in it becoming too sensitive. These bounds are

set at extreme threshold values, minimum and maximum luminance thresholds

are typically 20 and 130 respectively and minimum and maximum chrominance

thresholds are 10 and 130 respectively.

This process can also be applied at a pixel level, where each pixel has its own


thresholds, τLum(x, y, t) and τChr(x, y, t), based on the variances observed when

matching only that pixel.

4.3.2 Lighting Compensation

In surveillance situations, particularly outdoor scenarios, lighting levels can

change rapidly resulting in large amounts of erroneous motion. A sample outdoor

scene changing over time due to the sun is shown in Figure 4.5. As this figure

shows, significant changes can occur very rapidly (within a few seconds) and un-

less a high learning rate (or high frame rate) is used, these changes will not be

able to be incorporated into the background quickly enough to avoid errors. As

such, it is important to perform additional processing to negate these lighting

changes.

The distribution of changes observed between the background state (frame 11900)

and the other three images is shown in Figure 4.6. Whilst there is little change

in the chrominance values when compared to the background, there is significant

change in the luminance. The changes in the luminance are not concentrated

about a single value either. This is likely due to a combination of factors, such

as the different materials in the scene and their colours, the angle of the camera

and the angle of the light source (which also alters the shadows that are cast).

It would be ideal to be able to use a single constant value to adjust for the

luminance change in a given frame. However, as the luminance change is not

constant across a scene, scene is divided into several small regions, and each is

treated separately. Figure 4.7 shows a simple, even, partitioning of the image

that breaks the image into 25 sub regions.


(a) Frame 11900, t = 0 seconds (b) Frame 12000, t = 4 seconds

(c) Frame 12150, t = 10 seconds (d) Frame 12500, t = 24 seconds

Figure 4.5: Sample Outdoor Scene With Changing Light Conditions

It can be seen in Figures 4.8 and 4.9 that the changes in colour are much more uni-

form (once again there is negligible change in the chrominance) when considering

only a small region.

These distributions can be described as the noise distribution (see section 4.3.1)

plus an offset, OLum, indicating the average change in luminance.

The calculation of a distinct luminance offset, OLum(r, t) (where r is the region

index in the range [0..R − 1], in Figure 4.7, R = 25) for each sub region of a

scene is proposed. At each time step, the weighted average of luminance changes


Figure 4.6: Background Difference Across Whole Scene

Figure 4.7: Partitioning of Image for Localised Lighting Compensation

is calculated for each region,

OLum(r, t) =

∑DiffLum(x, y, t)× Cw(x, y, t, κ)∑

Cw(x, y, t, κ)∀(x, y) ∈ r, (4.15)

where κ is the index of the matching cluster. The use of weighted sum allows

pixels that are only recently created, and so potentially created partially under


Figure 4.8: Background Difference Across Region (2,2)

the present lighting conditions, to be weighted less relative to those that have

been present longer. Provided this value is within a percentage threshold of the

previous luminance offset, it is accepted and used for the next frame,

χ <= OLum(r, t− 1) <=1

χ, (4.16)

where χ is the change threshold for the luminance offset and is in the range [0..1].

χ is set to 0.5 for the proposed algorithm, and is used only as a failsafe for extreme

conditions.

If the change in the luminance threshold is outside of the acceptable limit, it

indicates one of two things has occurred:

1. A very rapid lighting change has occurred.

2. A large object has entered the area.


Figure 4.9: Background Difference Across Region (3,5)

If the first possibility has occurred, then the change should be accepted; if the

second possibility has occurred, the change should be discarded and the previous

value for the offset should be used. To determine which of these events has taken

place, the weighted standard deviation of the luminance offset is calculated,

σLum(r, t) =

√(OLum(r, t)−DiffLum(x, y, t))2 × Cw(x, y, t, κ)∀(x, y) ∈ r∑

Cw(x, y, t, κ).

(4.17)

If a rapid lighting change has occurred, then the change across the region should

be relatively uniform, and the standard deviation low. If a large object has sud-

denly entered (i.e. the region has gone from being empty to being half full), then

(unless under very specific circumstances relating to the colour of the object and

the scene) the change should be highly varied, and the standard deviation would

be high. Given this, a threshold is applied to the weighted standard deviation

such that if standard deviation is below the threshold the change is accepted,


otherwise it is discarded and the previous value for the luminance offset is used.

The luminance offset is incorporated into the match equation by subtracting half

of the luminance offset from each pixel difference, and taking the absolute value

to allow for a single comparison.∣∣∣∣P (y1)− C(k)(y1)−OLum(r, t)

2

∣∣∣∣+

∣∣∣∣P (y2)− C(k)(y2)−OLum(r, t)

2

∣∣∣∣ < τLum(t),

(4.18)

In situations where coloured lighting is present, the same approach could be

applied to the chrominance threshold to compensate.

4.3.3 Shadow Detection

Shadows can result in motion being detected where there is none. As such,

it is important to recognise shadows and ensure that they are not recorded as

motion. Shadows can be characterised by the fact that they alter the luminance

component of the object’s colour, but have minimal effect on the chrominance.

This is shown in Figure 4.10, which shows two examples of a shadow being cast

across part of the background (the areas inside the boxes). For images (a) and

(b), the mean luminance change is −123.1 while the mean chrominance change

is negligible at 1.2. The standard deviations for luminance and chrominance

differences are 3.97 and 4.45 respectively, indicating that the luminance change

is fairly uniform, and the chrominance change is very close to the existing noise

distribution. Images (c) and (d) yield similar results, with the mean changes (for

luminance and chrominance) being −92.6 and 1.6 with standard deviations of

37.34 and 6.23 respectively. The increased standard deviation for the luminance

can be attributed to the soft nature of the shadow.

Gradient can also be used to aid in shadow detection, as the area under shadow


(a) (b) (c) (d)

Figure 4.10: Shadows cast over section of the background

still retains its original texture (it has only been darkened by a shadow). However,

shadow edges will have a high gradient. Four gradient values can be calculated

for each cluster,

ygv1 = Iy(xi, yi)− Iy(xi, yi − 1), (4.19)

ygh1 = Iy(xi, yi)− Iy(xi − 1, yi), (4.20)

ygv2 = Iy(xi + 1, yi)− Iy(xi + 1, yi − 1), (4.21)

ygh2 = Iy(xi + 1, yi)− Iy(xi − 1, yi), (4.22)

where ygv1 is the vertical gradient for y1, ygh1 is the horizontal gradient for y1, and Iy

is the luminance channel of the input image. As the gradients are being calculated

for cluster pairs, x must be even. These values can then be incorporated into the

background model so that a cluster becomes

C(x, y, t, k) = [y1, y2, cb, cb, ygv1 , y

gh1 , y

gv2 , y

gh2 , w]. (4.23)

The cluster along the top edge of the image have a vertical gradient of 0 and

those on the left edge of the image have a horizontal gradient of 0 (as there is no

pixel to subtract to obtain a gradient). The gradient values are updated in the

same manner as the other values within the cluster, however they are only used

for shadow detection and optical flow calculations (see Section 4.4), as they can

cause problems when matching clusters to determine motion due to the gradient

of a pixel at (x, y) being effected by motion at the pixels at (x−1, y) and (x, y−1),

which will lead to additional false positives. When an area is under shadow, it


can be assumed that the surrounding pixels are also under the same (or similar)

shadow. As such, the gradient of the pixel should be the same as it is for the

background mode.

Shadow detection is added to the algorithm by adding additional constraints

when matching the incoming pixels to the clusters, and by comparing gradients.

If the initial matching constraints are not satisfied (i.e. motion is detected),

the following constraints are checked to determine if the motion is caused by a

shadow:

0 < (Cy1(k)− y1) + (Cy2(k)− y2) < τLumShad, (4.24)

|Cb− CCb(k)|+ |Cr − CCr(k)| < τChr(t)

τChrShad, (4.25)∣∣ygv1 − Cygv

1(k)∣∣+∣∣∣ygh1 − Cygh

1(k)∣∣∣ < τGrad, (4.26)∣∣ygv2 − Cygv

2(k)∣∣+∣∣∣ygh2 − Cygh

2(k)∣∣∣ < τGrad. (4.27)

If there is a positive difference in the luminance, less than the prescribed shadow

threshold, τLumShad, only a small difference in the chrominance (determined by

dividing the chrominance threshold, τChr(t), by an integer τChrShad) and only a

small difference in the gradient less that a gradient threshold, τGrad, a shadow is

present and motion is not detected at P . τLumShad must be greater than τLum(t),

and is set to 200 in the proposed system, S is set 2, and τGrad is set to 50. These

values are fixed (i.e. they do not adapt to changing lighting conditions), and

are sufficient to detect most shadows. However, strong shadows cast in bright

conditions will often not be detected using these settings.

A pixel that is detected to be in shadow needs to be handled differently elsewhere

in the system. When adjusting the lighting model, and the variable threshold,

pixels that are in shadow are ignored, as the difference incurred when matching

has been significantly altered by the presence of a shadow. As it is not possible

to accurately estimate what effect the shadow has had, it is disregarded.

4.4 Computing Optical Flow Simultaneously 161

4.4 Computing Optical Flow Simultaneously

Optical flow algorithms attempt to determine the motion each pixel has under-

gone from one frame to the next, rather than simply if a pixel is in motion. The

incorporation of optical flow into the proposed motion detection algorithm to

provide additional information to processes that use the motion detection (i.e.

tracking systems) is proposed.

Optical flow calculations require the previous frame to be compared to the cur-

rent to determine motion. The need for comparison with the previous frame is

avoided by maintaining a record of the matching cluster for each pixel for the

last frame, essentially storing an approximation of the last frame. The accuracy

of the approximation, depends on the thresholds used in the motion detection

(tighter thresholds lead to a more accurate approximation).

Matching is performed over cluster windows and at a pixel resolution (i.e. not

cluster resolution, which is down sampled by 2 in the horizontal direction). Let

W (x1 : x2, y1 : y2, t) be the window of pixels extracted from the incoming frame

centred about (x, y) (the position of the cluster determine the flow is being de-

termined for), and W (x1′ : x2′, y′1 : y2′, t − 1) be the window of clusters from

the previous frame that will be compared to W (x1 : x2, y1 : y2, t), centred about

(x′, y′) (a possible position for the cluster in the previous frame). For a compari-

son to be made, the pixel in W (x1′ : x2′, y1′ : y2′, t−1) must have been in motion

at time t − 1. This ensures that no attempts are made to match to parts of the

background. The set of clusters that are compared then becomes

Wfgnd(x, y, t) = P (x, y, t− 1) ∈ fgnd(t− 1) (4.28)

where (x, y) ∈ W (x1′ : x2′, y1′ : y2′, t− 1),

where Wfgnd(x, y, t) is the set of clusters for the window centred at x, y that were

in the foreground last frame. The size of this set (i.e. the number of clusters in

162 4.4 Computing Optical Flow Simultaneously

the set) is defined as Pcount.

If this condition is not enforced, then for a cluster that lies on an object boundary,

part of the comparison for the window will be performed against the background

each frame. As the object is moving, the part of the background being compared

to will be constantly changing, and so it can be assumed that there will be no

match between these sections of the window. These poor matches may result in

the whole window being ruled a poor match, thus impeding the ability of the

system to estimate motion.

When motion is detected at a cluster, its surrounding region is examined to

determine the optical flow for that cluster. The surrounding area is analysed

outwards in rings. The centre pixel is checked first, and if a suitable match is

found, searching stops. If there is no match, then the next ’ring’ (at a distance

of one pixel) is searched in full, and so on until a match is found. Each ring

is searched in full, and the best match within the ring (if a match is present at

all) is accepted. Rings may be ’truncated’ to a pair of rows (or columns) if the

maximum horizontal and vertical accelerations are not equal (see Figure 4.11).

Figure 4.11: Search Order for Optical Flow


This method of searching attempts to minimise the acceleration of a pixel, by

taking the first match when searching outwards, rather than taking the best match

in the whole search area. Although the approaches aims to minimise acceleration

(constant velocity assumption), no restriction is placed on the velocity, as the

pixel can continue to accelerate gradually over the course of several frames.

As the algorithm is intended to function at a pixel resolution, a method to match

across cluster boundaries is required (see Figure 4.12). As down sampled chromi-

nance information is used, this becomes unreliable and can have a negative effect

on the matching. As a result, chrominance is not used for the optical flow calcu-

lations. Only luminance and luminance gradient information are used for match-

ing. When matching between two compete clusters (i.e. at an even horizontal

distance) the same matching equations as used when comparing clusters for the

motion detection (or for shadow detection in the case of the gradient information)

can be reused (see Equation 4.2 for luminance comparison and Equations 4.26

and 4.27 for gradient comparison). However, note that rather than comparing

cluster values to pixels in the input image, cluster values for the matching cluster

in the current frame are compared to cluster values for the matching cluster in

the previous frame. It is not possible to compare directly to the pixel values from

the previous frame as the previous frame is not stored.

When matching across a cluster boundary, the comparison is effectively between

one cluster in the current image to two in the previous image. In Figure 4.12, the

area marked in red is the cluster in the current image we are trying to find a match

for (C(x, y, t)), and the area marked in blue as the previous frame’s cluster we

are trying to match to. The blue area is actually two clusters. C(x′− 1, y′, t− 1)

is defined as the left most of the two, and C(x′, y′, t − 1) is the right most. The


Figure 4.12: Matching Across Cluster Boundaries - Bold lines indicate clustergroupings, the red cluster location in the current image is being compared to theblue cluster (actually split across two clusters) in the previous image.


comparison between these clusters then becomes

DiffLum(x, y, t) = |Cy1(x, y, t)− Cy1(x′ − 1, y′, t− 1)|+ (4.29)

|Cy2(x, y, t)− Cy2(x′, y′, t− 1)| ,

DiffGrad(x, y, t) =∣∣Cyvg

1(x, y, t)− Cyvg

1(x′ − 1, y′, t− 1)

∣∣+ (4.30)∣∣Cyvg2

(x, y, t)− Cyvg2

(x′, y′, t− 1)∣∣+∣∣∣Cyhg

1(x, y, t)− Cyhg

1(x′ − 1, y′, t− 1)

∣∣∣+∣∣∣Cyhg2

(x, y, t)− Cyhg2

(x′, y′, t− 1)∣∣∣ .

When matching across a cluster boundary, C(x′ − 1, y′, t− 1) and C(x′, y′, t− 1)

are checked separately to determine if they were in motion at time t − 1. If

only one was, then only the portion of the comparison that involves that cluster

is performed. In this instance, Pcount is incremented by 0.5 (as only half the

comparison has been performed).

A matching score for a cluster window is obtained by calculating the average

error in the luminance and gradient matches (WLum and WChr) for all foreground

pixels in the window,

WLum =1

PCount

∑x,y

DiffLum(x, y, t) where (x, y) ∈ Wfgnd(x, y, t), (4.31)

WGrad =1

PCount

∑x,y

DiffGrad(x, y, t) where (x, y) ∈ Wfgnd(x, y, t). (4.32)

WLum and WGrad are compared to a threshold, and if this threshold is met, and

if Pcount > 1, then a potential match has been found (whether this is the actual

match depends on what, if any, other matching windows are detected in the

current search ring). Thresholds for WLum and WGrad are fixed at 50 and 100 for

the proposed system (WGrad is twice WLum as it contains twice the comparisons).

These thresholds do not need to vary as the comparisons taking place are between

colour values recorded in consecutive frames, so there should be little variation

due to environmental changes. If Pcount is less that or equal to 1, it is discarded


as an invalid match. The primary concern is with detecting optical flow for large

objects, and it is assumed that for these there should be adjacent pixels that are

also in motion. If such pixels are not present (i.e. a match has been made to an

isolated pixel), than it is likely the match has been made to noise.

As the algorithm works at pixel resolution, flow is determined to integer precision.

Sub-pixel precision is not pursued, as this level of accuracy is not required for a

tracking application. Once movement for a cluster has been determined, its next

position is predicted. Optical flow is tracked within the system by a simple linear

prediction method. Optical flow information is propagated through the system

from frame to frame, tracking the movement of pixels over time. For each cluster

at time t that has non-zero optical flow, a prediction is propagated forward to

the expected position at time t+ 1 assuming a constant velocity model,

υx(t+ 1) = υx(t) + (υx(t)− υx(t− 1)) , (4.33)

υy(t+ 1) = υy(t) + (υy(t)− υy(t− 1)) , (4.34)

where υx(t − 1) and υx(t − 1) are the positions of the cluster υ in the previous

frame, υx(t) and υx(t) are the positions of the cluster υ in the current frame, and

υx(t + 1) and υx(t + 1) are the expected positions of the cluster p in the next

frame. At time t+1, when the system is processing a cluster that has a prediction

associated with it, it will use the prediction provided as a starting point for the

search, improving system performance. Multiple predictions are allowed to be

propagated forward to a single pixel (i.e. multiple pixels can potentially occupy

the same position in the next frame). When determining flow at that pixel in

the next frame, the centre point of each prediction is checked first, and the best

match (if there is a match at all) is taken as the flow. If there is no match, the

surrounding areas of the each prediction is analysed as described above until a

match is found.

These predictions are stored with an accumulated average velocity (uave and vave


for the horizontal and vertical velocities respectively), and a counter (fcount) to

indicate how many successive frames the pixel has been observed in motion for.

Average velocities are calculated as

uave(t) = uave(t− 1) +1

Lopf(u(t)− uave(t− 1)), (4.35)

vave(t) = vave(t− 1) +1

Lopf(v(t)− vave(t− 1)), (4.36)

where u and v are the horizontal and vertical flows for the current frame, and Lopf

is the learning rate for the average optical flow. Figure 4.13 shows an example

of the optical flow for a cluster being propagated and updated over a series of

frames. At time t, the motion and optical flow is first detected, the optical flow

of the prediction is initialised with the flow detected at time t. At time t+ 1 and

t + 2, the average flow information is copied from the cluster that was matched

to, and is updated to reflect the current state.

Figure 4.13: Optical Flow Tracking, for Lopf = 4

In the event that the optical flow cannot be determined for a cluster (i.e. a

matching window of clusters from the previous frame cannot be found), the list

of predictions for that cluster is used to estimate the flow. The prediction with

the highest counter (fcount) is assumed to indicate the likely flow for this pixel,


and is propagated through as a prediction, and a flag is set to indicate that it

is a prediction. A cluster’s flow information can only be propagated through for

Qp successive frames. Qp is kept small (< 3) as only a simple linear prediction

model is used.

If a cluster is not detected as being in motion, it is stationary and optical flow is

not calculated for that pixel.

4.4.1 Detecting Overlapping Objects

A cluster in motion can take on one of four states within the system:

1. New - the first appearance of a cluster, its flow cannot be determined as

there is no appropriate matching cluster.

2. Continuous - the cluster is in motion and a match to the previous frame

has been found (i.e. it has been in motion for two or more frames).

3. Overlap - the cluster was in motion last frame and cannot be found this

frame. The space that it should be this frame is occupied by another cluster

that has moved from a different direction.

4. Ended - the cluster cannot be found and there is no overlap condition, so

the motion must have ceased.

These four states are illustrated in Figure 4.14.

The state, S(x, y, t) of a pixel pair P (x, y, t) is determined using the tracked

optical flow information.


(a) New (b) Continuous

(c) Overlap (d) Ended

Figure 4.14: Optical Flow Pixel States

When motion is detected at a cluster, p(x, y, t), and its optical flow is determined

(U(x, y, t) and V (x, y, t)), p(x, y, t) and its flow information is obtained by copying

from the previous cluster, p(x − U(x, y, t), y − V (x, y, t), t − 1) and updated to

reflect the new situation. The motion arising from the cluster in the previous

frame is marked as accounted for.

At the end of the frame, any cluster whose motion from the previous frame has

not accounted for is either involved in an overlap, or its motion has ended. The

prediction for the pixel is checked (if there are multiple predictions, then the

prediction with the highest fcount) and if the predicted position is occupied by

motion, then an overlap has occurred. If there is no motion, then the motion has

ended. An overlap can only occur if the cluster that is obscured has been observed

for Qv successive frames (i.e. the motion of that cluster has been observed for a

period of time and is considered reliable, Qv is fixed at 3 in the proposed system).

This helps to reduce erroneous overlaps. Also, for an overlap that obscured a

cluster’s optical flow, information is still propagated through for Qp frames. If

170 4.5 Detecting Stopped Motion

new motion is then detected at the point where the obscured pixel is expected to

reappear, it is assumed that this motion is actually caused by the obscured pixel

reappearing.

The detection of overlapping pixels can aid in detecting overlaps between moving

objects in a scene.

4.5 Detecting Stopped Motion

A scene can be further classified in to active foreground, temporarily stopped

(static) foreground, and background. Active foreground is defined as motion

that is observed to be moving through the scene, such that any given cluster of

active foreground is only a given colour for a small number of frames. Static

foreground can be defined as motion that remains a constant colour for a period

of tstatic frames or more. To discriminate between active and static foreground,

the algorithm needs to compare the current cluster at a given pixel, to the last

cluster at that location, as well as any static foreground objects that are present

there.

When C(x, y, t, κ) = C(x, y, t − 1, κ), P (x, y, t) has a static layer, Z(x, y, t, z),

initialised, where z is the depth of the layer. Each layer has a counter, c,

and a colour, (y1, y2, Cb, Cr) associated with it. For subsequent frames where

C(x, y, t, κ) = C(x, y, t− 1, κ), Zc(x, y, t, z) is incremented, otherwise it is decre-

mented. Static pixels can be defined as,

∀(x, y, t) ∈ fgnd where Zc(x, y, t, z) >= tstatic. (4.37)

Static pixels can be further organised into layers depending on when the pixel

appears. Layers can be built one on top of the other, as new objects appear and

4.5 Detecting Stopped Motion 171

come to a stop atop an existing static layer. Layers remain until the observed

cluster is matched to either a lower layer, or the background.

The number of static layers available, Ks, is determined by the parameters of

the background model and the requirements of the scene. At least one cluster

must be dedicated to the active foreground and there must be one cluster per

background mode. Given this, the maximum number of static layers is,

Ks = K −Kb − 1, (4.38)

where K is the total number of clusters in the background model and Kb is the

number of background modes. Typically, Ks = 2 and K = 6.

The algorithm for detecting and updating the static layers for a single pixel is

outlined in figure 4.15. If the pixel already has static layers, these are compared

against first. If there are no layers, or no matches to existing layers, checks are

performed to see if there is possibly a new static layer forming (last two frames

have the same colour at the pixel). If this is the case, a new static layer is created.

Figure 4.15: Static Layer Matching Flowchart

Each static layer is monitored by a counter which is updated each time step, and

used to determine the state of the layer (i.e. static, to be removed). Counters

are incremented when the layer is detected, and decremented only when a lower

172 4.5 Detecting Stopped Motion

level static layer (or background) is detected. When a higher level static layer

(or active layer) is detected counters are unchanged as the static layer may be

hidden below. Counters are decremented gradually to provide error tolerance

for incorrect cluster matching, or noise. The decrement rate depends on the

scene, with more challenging scenes requiring a slower decrement rate due to

the increased chance of an erroneous cluster match. Layers are removed when

the counter reaches zero, and counters are capped to guarantee that a layer can

be removed in a set number of frames. Increment, decrement, static thresholds

and caps are fixed, and are set to 1, 3, 50, and 100 for the proposed algorithm.

However, these parameters are unlikely to be optimal for all scene configurations.

The algorithm has some limitations in that it is not possible to determine when

a lower level static object leaves while higher level static object remains, or when

a lower level object moves in behind a higher level object, due to the relevant

pixels being obscured.

4.5.1 Feedback From External Source

It is important to allow changes to occur in the background model as the scene

varies, but it is also important prevent foreground objects of interest being incor-

porated into the background. Objects such as stopped cars may remain stationary

in the scene for several minutes (or longer), in which time the clusters that model

the car will accumulate enough weight so that the car is considered part of the

background, despite it still being of interest in the scene. One way this can be

overcome is by having a very slow learning rate, however this will then mean that

legitimate changes (caused by changing light, or objects being placed in the scene

that are not of interest) will also take a very long time to be incorporated into

the background. An alternative method, is to allow an external process (such as

4.6 Evaluation and Testing 173

an surveillance system) to impose changes on the background model.

The inverse of the weight adjustment algorithm can be used to prevent the object

from being incorporated into the background model, by effectively stopping all

weight updates so that objects of interest remain in the foreground,

w′k =(Lwk −Mk)

L− 1, (4.39)

where wk is the weight of the cluster being adjusted; L is the inverse of the learning

rate (lower values will result in background changes being incorporated faster);

and Mk is 1 for the matching cluster and 0 for all others. An external process can

be used to provide a mask image back to the motion detection algorithm, which

can be used to apply the weight reversal.

4.6 Evaluation and Testing

The proposed algorithm is evaluated using synthetic data (see Section 4.6.1, and

real world data (see Section 4.6.2). The proposed algorithm is compared with the

system proposed by Butler [18]. Motion detection results from the algorithms are

compared to ground truth images to measure the performance of the algorithms.

Performance is measured in terms of false negatives (FN, motion present in ground

truth but not detected) and false positives (FP, motion detected but not present

in ground truth), true positives, true negatives, false positive and false negatives

are defined as,

TP = GT (x, y) = 1&M(x, y) = 1, (4.40)

TN = GT (x, y) = 0&M(x, y) = 0, (4.41)

FP = GT (x, y) = 0&M(x, y) = 1, (4.42)

FN = GT (x, y) = 1&M(x, y) = 0, (4.43)

(4.44)

174 4.6 Evaluation and Testing

where GT is the ground truth image and M is the motion result image. Each of

these images are binary images.

The performance of the optical flow (see Section 4.6.3) is evaluated by attempting

to segment a moving object from a scene, and using visual inspection of the

results to determine performance. The performance of the proposed optical flow

algorithm is compared to the Lucas-Kanade [114] algorithm, the Horn-Schunck

[72] algorithm, and a block matching algorithm.

4.6.1 Synthetic Data Tests

Testing is conducted using synthetic data from the AESOS database 1. A portion

the ACV Motion Detection Database is used to evaluate the proposed system.

This database contains four separate scenes (three outdoor, one indoor) with

animated models drawn on top of the background to simulate the motion. Each

scene contains 16 sets, each of which contains the figures drawn at a different

grey level (from grey level 60 to grey level 220 in increments of 10). An example

of the database is shown in Figure 4.16.

For testing, three sets at different grey levels (80, 150, 210) from each sequence

are used to compare the systems. Thresholds for the motion detectors were set

the same across the tests. Butlers algorithm has thresholds set to 50 and 30 for

τlum and τChr respectively, while the proposed algorithm uses 20 and 10 as the

minimum and 80 and 50 as the maximum of τlum and τChr respectively, with

initial values of 50 and 30. Each system uses K = 6 (number of clusters) and

L = 9 (inverse learning rate).

Four test configurations were run using the proposed system:

1This database was provided by Advanced Computer Vision GmbH - ACV


(a) Seq 1, GL = 80 (b) Seq 2, GL = 150

(c) Seq 3, GL = 150 (d) Seq 4, GL = 210

Figure 4.16: AESOS Database Example, GL is the grey level of the syntheticfigures in the scene

1. Single variable threshold for the whole system with other improvements

(shadow detection and lighting normalisation) disabled.

2. Variable threshold for each pixel with other improvements (shadow detec-

tion and lighting normalisation) disabled.

3. Single variable threshold for the whole system with other improvements

enabled.

4. Variable threshold for the each pixel with other improvements enabled.

These configurations are used to assess the benefits of using a single variable

threshold for the system versus a threshold for each pixel, and the effects

that other improvements have on the system when neither lighting variation


Algorithm GL 80 GL 150 GL 210False False False False False FalsePositive Negative Positive Negative Positive Negative

Butler 0.11% 48.13% 0.06% 63.00% 0.03% 78.08%

Proposed 0.14% 42.84% 0.10% 50.74% 0.07% 64.07%(Config 1)Proposed 0.13% 43.43% 0.095% 51.70% 0.06% 65.17%(Config 2)Proposed 0.12% 49.00% 0.08% 56.71% 0.07% 63.50%(Config 3)Proposed 0.11% 49.72% 0.07% 57.66% 0.06% 64.50%(Config 4)

Table 4.1: Synthetic Motion Detection Performance for AESOS Set 1

or shadows are present. Thresholds specific to the proposed system are set to

τLumShad = 150, τChrShad = 2, τgrad = 50 and χ = 0.75. Note that optical flow

performance is not assessed in these tests as the data set uses textureless synthetic

models to provide motion.

Tables 4.1 to 4.4 and Figures 4.17 and 4.18 show the test results and sample

output respectively.

The results show an overall increase in performance between the algorithm of

Butler[18] and the proposed configurations. There is a significant overall decrease

in the rate of false negatives (up to 14.58%) and only a small increase in false

positives (no greater than 0.23%). This improvement is most pronounced in

the datasets where the moving object colour is most similar to the background

(i.e. in set 1, 2 and 3, which are all bright scenes, this is GL210 where the

moving objects are most white). When the moving object’s colour is more distinct

from the background, the performance gain is less, and when shadow detection

and lighting normalisation are enabled, it can actually lead to a performance



Butler 0.06% 52.21% 0.05% 61.99% 0.07% 70.03%




Butler 0.15% 38.74% 0.13% 46.97% 0.16% 33.11%





Butler 0.01% 97.96% 0.01% 98.50% 0.04% 81.30%

Proposed 0.01% 84.24 0.08% 90.24 0.13% 71.15%(Config 1)Proposed 0.02% 84.74 0.02% 90.89 0.07% 72.00%(Config 2)Proposed 0.03% 93.44 0.04% 89.68 0.09% 70.86%(Config 3)Proposed 0.01% 93.70 0.02% 90.68 0.06% 71.74%(Config 4)



(a) Butler (b) Butler (c) Butler (d) Butler

(e) Config 1 (f) Config 1 (g) Config 1 (h) Config 1

(i) Config 2 (j) Config 2 (k) Config 2 (l) Config 2

(m) Config 3 (n) Config 3 (o) Config 3 (p) Config 3

(q) Config 4 (r) Config 4 (s) Config 4 (t) Config 4

Figure 4.17: Synthetic Motion Detection Performance for AESOS Set 1, GL150

decrease. This performance drop off is observed primarily with the GL80 sets,

where the foreground objects are significantly darker than the background, and

can be attributed to the shadow detection falsely detecting shadows (and thus

causing a pixel in motion to be classified as non-motion). This is due in part to

the nature of the synthetic dataset. The shadow detection works by checking if


(a) Butler (b) Butler (c) Butler (d) Butler

(e) Config 1 (f) Config 1 (g) Config 1 (h) Config 1

(i) Config 2 (j) Config 2 (k) Config 2 (l) Config 2

(m) Config 3 (n) Config 3 (o) Config 3 (p) Config 3

(q) Config 4 (r) Config 4 (s) Config 4 (t) Config 4

Figure 4.18: Synthetic Motion Detection Performance for AESOS Set 3, GL80

there has a luminance decrease, with little change in the chrominance. In datasets

1, 2 and 4 the background is primarily grayscale with little texture, so overlaying

a grayscale object will (if the object has lower pixel values than the background)

result in a shadow being falsely detected. The second test performed by the

shadow detection is to look at the gradient change, with little change in gradient


more likely to indicate a shadow. All four datasets contain environments that are,

for large parts of the scene, devoid of texture and so have gradients of (about)

0. As the motion model also has no texture it also has a gradient of (about) 0,

satisfying the second condition for a shadow. This is shown in Figure 4.17, (fourth

and fifth rows), where it can be seen that the motion is detected effectively at the

edge of the object (where there is a gradient change), but poorly in the centre of

the object where there is little texture. The gradient change at the object edge

results in the system correctly detecting motion rather than a shadow. However,

it can be expected that this process will also result in the edges of some (if not

all) shadows being incorrectly detected as motion.

The configurations that used a single variable threshold (1 and 3) achieved a lower

rate of false negatives, while those that used a variable threshold (2 and 4) for

each pixel achieved a lower rate of false positives. Both approaches outperform

Butler [18] (which uses a fixed threshold) in terms of false negatives, but have

slightly worse performance when considering false positives.

A single threshold will be less impacted by local events in a scene, and so will

respond slower unless the entire scene undergoes a change (i.e. increased noise at

5% of pixels will have little to no effect, increased noise across the whole sensor

will cause the threshold to change). With a threshold for each pixel, if a pixel

is in motion often, and as such is adding new clusters to the model to depict

the scene state, the threshold be increased faster and may reach a point where

motion is not detected as effectively. However this same mechanism that allows

the threshold to increase faster will also result in the threshold dropping faster

when there is no motion.

Given this behaviour, it can be expected that a single variable threshold will result

in a lower rate of false negatives, as activity in a small portion of the scene (i.e.

the addition of new cluster to model the motion that is occurring) will have only


a small effect on the threshold. When using a variable threshold for each pixel,

these thresholds will respond more rapidly to motion and increase, becoming less

sensitive. This increase will result in fewer false positives.

The lighting normalisation is evaluated using the lighting variation dataset with

the AESOS database, which contains three datasets that artificially depict vary-

ing degrees of lighting variation. These datasets are based on a low grey level

scene from Set 1. Figure 4.19 contains an example from the database. In set 1,

the lighting oscillates very rapidly, but consistently. The lighting in set 2 oscil-

lates slower, but has a variable period and amplitude, and the lighting is set 3

changes in a more random fashion.

(a) Set 1, Frame 1 (b) Set 1, Frame 2 (c) Set 1, Frame 3 (d) Set 1, Frame 4

(e) Set 2, Frame 40 (f) Set 2, Frame 50 (g) Set 2, Frame 60 (h) Set 2, Frame 60

(i) Set 3, Frame 50 (j) Set 3, Frame 55 (k) Set 3, Frame 60 (l) Set 3, Frame 65

Figure 4.19: AESOS Lighting Variation Database

Testing is performed using the same thresholds and parameters used in earlier

tests. For the proposed algorithm a single variable threshold is used for the whole

scene, lighting normalisation is enabled, but shadow detection is disabled as there


Algorithm Set 01 Set 02 Set 03False False False False False FalsePositive Negative Positive Negative Positive Negative

Butler 14.31% 39.19% 8.66% 43.93% 9.56% 58.29%Proposed 0.63% 50.09% 0.77% 43.38% 1.71% 58.66%

Table 4.5: Synthetic Lighting Normalisation Performance

are no shadows in the scene, and as demonstrated in the previous testing, it can

have an adverse effect when used with this data due to the lack and colour and

texture. The proposed algorithm is once again compared to Butler’s [18]. Table

4.5 and Figures 4.20 and 4.21 show the results.

(a) Input, 250 (b) Input, 500 (c) Input, 750 (d) Input, 1000

(e) Butler, 250 (f) Butler, 500 (g) Butler, 750 (h) Butler, 1000

(i) Proposed, 250 (j) Proposed, 500 (k) Proposed, 750 (l) Proposed, 1000

Figure 4.20: Synthetic Lighting Normalisation Performance using Set 02

As the results show, the proposed lighting compensation approach is effectively

able to reduce the number of false positives whilst still delivering good perfor-

mance when detecting motion pixels. The poor performance in set 1 (when



(e) Butler, 250 (f) Butler, 500 (g) Butler, 750 (h) Butler, 1000

(i) Proposed, 250 (j) Proposed, 500 (k) Proposed, 750 (l) Proposed, 1000

Figure 4.21: Synthetic Lighting Normalisation Performance using Set 03

comparing the false negative rate of the proposed system to Butler’s [18]) can

be attributed to extremely rapid rate of lighting change. The proposed approach

assumes that the luminance is changing at a constant rate,

Θ(x, y, t+ 1) = Θx, y, t+ (Θx, y, t−Θx, y, t− 1), (4.45)

where Θ(x, y, t) is the luminance at a pixel, x, y at time t. This assumption

approximately holds for real world effects such as cloud cover or AGC changes.

However in set 1, the lighting is oscillating extremely rapidly (and therefore chang-

ing direction very often) in a manner which is highly unlikely to be seen in a real

world situation. Due to the frequent changes in direction of the lighting change,

the assumption on which the proposed approach is based is frequently broken.

This causes increased errors in cluster matching, which results in more relaxed

matching thresholds and a higher false negative rate. Butler’s algorithm [18]

however, converges on the average background luminance and as a result, whilst


still producing noisy output (particularly at first when learning the average back-

ground), it outperforms the proposed approach in terms of false negatives for this

set.

4.6.2 Real World Data Tests

Testing is conducted using a 10,000 frame sequence of real world data acquired at

a public passenger drop off area. Twenty frames which illustrated various effects

such as lighting variation, shadows, temporarily stopped objects and overlapping

objects are hand segmented for comparison (it is not practical to hand segment

the entire sequence).

The algorithm’s overall performance was compared to Butler’s [18] (see Table

4.6). Incorrect detection of the motion type results in a false negative (FN)

and a false positive (FP) being recorded for the appropriate motion types (i.e.

active foreground detected when static’s expected - FN for static, FP for active;

static detected in layer two expected in layer one - FN and FP for static). The

performance of the algorithm at classifying active foreground, static foreground

and shadows is measured, to provide an indication of the performance of each

component. Shadow detection is measured purely in terms of false positives, as it

is expected that no motion should detected at a shadow (i.e. errors only occurs

when shadows are detected as motion). A simple object detector was applied to

the output of the proposed algorithm to locate large foreground objects and apply

feedback to the region they occupy. No morphological operations were applied to

the output of either system.

Thresholds for Butlers algorithm are set to 80 and 50 for τlum and τChr respec-

tively, while the proposed algorithm uses 30 and 20 as the minimum and 130 and

80 as the maximum of τlum and τChr respectively, with initial values of 80 and


50. Each system uses K = 6 (number of clusters) and L = 9 (inverse learning

rate). All other parameters as the same as used in the synthetic data tests. The

proposed system is evaluated using both a single variable threshold for the whole

system (Configuration 1) and an independent variable threshold for each pixel

(Configuration 2).


(e) GT, 1175 (f) GT, 1900 (g) GT, 2500 (h) GT, 3300

(i) Butler’s, 1175 (j) Butler’s, 1900 (k) Butler’s, 2500 (l) Butler’s, 3300

(m) Config 1, 1175 (n) Config 1, 1900 (o) Config 1, 2500 (p) Config 1, 3300

(q) Config 2, 1175 (r) Config 2, 1900 (s) Config 2, 2500 (t) Config 2, 3300

Figure 4.22: Motion Detection Results for Real World Sequence



(e) GT, 5575 (f) GT, 6600 (g) GT, 7525 (h) GT, 7585

(i) Butler’s, 5575 (j) Butler’s, 6600 (k) Butler’s, 7525 (l) Butler’s, 7585

(m) Config 1, 5575 (n) Config 1, 6600 (o) Config 1, 7525 (p) Config 1, 7585

(q) Config 2, 5575 (r) Config 2, 6600 (s) Config 2, 7525 (t) Config 2, 7585

Figure 4.23: Motion Detection Results for Real World Sequence

Figures 4.22 and 4.23 show a sample of the output. The top row shows the

original images; the second row shows the ground truth; third is the output

from Butler [18]; the fourth and fifths rows are the output from the proposed

algorithm (configurations 1 and 2 respectively). In ground truth and output from


Proposed Proposed Butler’s(Config 1) (Config 2) Algorithm[18]

False False False False False FalsePositive Negative Positive Negative Positive Negative

Active 2.68% 19.86% 1.56% 21.98% N/A N/AMotionShadow 17.37% N/A 16.80% N/A 64.95% N/AMotionStatic 2.60% 35.60% 1.09% 41.84% N/A N/AMotionTotal 5.13% 26.70% 2.57% 30.61% 8.46% 55.49%Motion

Table 4.6: Motion Detection Results for Real World Sequence

the proposed algorithm, green indicates active foreground, blue static foreground,

red (in the ground truth images only) indicates shadow (which is expected to be

detected as no motion in the bottom row).

As Table 4.6 and Figures 4.22 and 4.23 illustrate the system performs well and is

able to discern between static and active foreground objects, as well as cope with

lighting changes (see frames 1,900, 2,500 and 3,300 in Figure 4.22) and shadows.

However, the system does struggle to deal with lighting variations where the

background is widely varied, due to the different textures in the region (i.e. the

area around the rails on the left edge of the image, see frame 2,500 and 5,575).

The shadow detection can also affect the motion detection when dark objects

enter, such as the windscreen and windows of the car in frames 7,525 and 7,585.

Despite the limitations of proposed changes however, they result in a significant

improvement in performance, clearly reducing the rate of false positives and false

negatives when compared to [18].


4.6.3 Optical Flow and Overlap Detection Evaluation

To evaluate the performance of the proposed optical flow algorithm, attempts are

made to extract people from several test images. Expected motion is determined

using the ground truth data from the CAVIAR database. The difference between

the median locations in the previous and current frame is used as the expected

average velocity of the object. Extraction is performed by finding pixels where

the combined error in the horizontal flow and vertical flow is less than a threshold;

Imobj = |U − vx|+ |V − vy| < τvel, (4.46)

where U and V are the horizontal and vertical flow images; vx and vy are the

expected movements and Imobj is the extracted object image. τvel is set to 1.5 in

this evaluation.

The performance of the proposed algorithm is compared with that of three other

optical flow algorithms; the Lucas-Kanade [114] algorithm, the Horn-Schunck

[72] algorithm, and a block matching algorithm; from the OpenCV library2. For

these other optical flow techniques, the input images were first converted to grey

scale. Extraction results are also masked against the motion image generated by

the proposed algorithm to improve clarity when comparing the extracted regions.

Without the masking operation, additional noise would be visible in the extrac-

tion output from the other algorithms. Due to practical considerations, the people

within the images have not been hand segmented to test performance. Instead,

we have simply visually compared the performance of the various algorithms.

As Figures 4.24 to 4.26 show, the proposed algorithm is significantly better at

extracting a moving object from the scene, however the segmentation is unable

2The Open Source Computer Vision Library is used courtesy of the Intel Cor-poration and is available for public download from the World Wide Web at”‘http://www.intel.com/research/mrl/research/opencv/”’.


to extract the entire object, as not all pixels within the object meet the flow

criteria. This could be overcome by either using a morphological close operation,

or increasing the threshold used (i.e. detect pixels that fall within a larger range

of flow values). However, any increase in the threshold will also result in an

increase in noise, or an increased likelihood that pixels belonging to a different

object will be detected.

The other methods suffer from discontinuities around the edge of the person (mov-

ing object), and struggle with patches of movement that are a single colour (i.e.

the person’s clothes). Perhaps their biggest problem however is that they fail to

distinguish background from foreground, resulting in the detection of movement

in the background (this can be seen in the horizontal and vertical flow images),

however as the object extraction uses the motion image as a mask, these errors

are not seen in the extracted object. These errors are brought about by the as-

sumptions made by these techniques. Due to the lighting in the scene, there are

slight fluctuations in the colour of background regions from frame to frame. Lucas

et al.[114] and Horn et al.[72] use spatial intensity gradient information, whereas

the block matching technique uses correlation between image regions to obtain

the flow. However both methods rely on the intensity of corresponding regions

in the images being very similar, and the small fluctuations result in the average

intensity of these corresponding regions varying from frame to frame. When this

occurs on a uniform, featureless surface (i.e. floors, walls), these fluctuations can

result in motion being detected. The proposed algorithm does not suffer from this

problem as it uses a variable threshold to detect motion. This threshold adapts

to the level of noise in the scene to ensure that the small fluctuations observed

under conditions such as fluorescent lighting do not adversely affect the systems

performance.

The overlap detection is evaluated using sample sequences from the CAVIAR


database that contain two people overlapping. The optical flow status images

that are produced are used to detect areas where there are a high proportion of

flow discontinuities, likely to be caused by the edge of a moving object, either

overlapping objects or the edge of a region of motion. Only edges that run

vertically are detected in this test. Further details on the process that is used to

determine overlaps/edges can be found in Chapter 5. Examples of the flow status

image output, and the detected object edges are shown in Figures 4.27, 4.28 and

4.29.

The example sequences (see Figures 4.27, 4.28 and 4.29) show the motion detec-

tion output in the top row, the flow status in the second and the original frame

with the overlaps marked (shaded red bars) in the bottom row. The flow status

images can be interpreted as follows:

• Black - no motion is associated with this pixel.

• Green - new motion is found at this pixel (New).

• Yellow - the motion at this pixel is continuing from last frame (Continuous).

• Red - an overlap is present at this pixel (Overlap).

• Blue - motion was expected at this pixel, but colour not be found (Ended).

The example sequences show that overlaps can be detected by analysis of the flow

status images (see Figure 4.27 (j) and (k), Figure 4.28 (j) and (k) and Figure 4.29

(j) and (k)). Several object edges are also detected (see Figure 4.27 (i), Figure

4.28 (i) and (l) and Figure 4.29 (l)), whilst two detections (see Figure 4.27 (l) and

(j)) are erroneous, and detect and overlap/edge through the middle of an isolated

person. The detections at the edges can be attributed to increased instances

of pixels in the new and stopped states at the boundary of the person (i.e. a

192 4.7 Summary

discontinuity). Further analysis of these detection, analysing the ratio of overlap

pixels to new and stopped could separate these edge detection from overlaps.

The flow status images show an greater concentration of overlap pixels being

detected when an occlusion is occurring. There are also isolated overlap pixels

detected elsewhere, which can be partially attributed to the dataset. As can be

seen in the motion detection results, the motion detection performs poorly around

the legs of the people (due to the dark edge at the bottom of the shop front, which

is a very similar to colour to the pants worn by all subjects) and results in large

amounts of new motion being detected about the legs. This results in new motion

being detected at the legs in every frame, some of which forms false overlaps in

later frames. Despite this, in the three samples shown only 2 false object edges

are detected.

4.7 Summary

This chapter has presented a new algorithm for calculating multi-layer foreground

segmentation and optical flow simultaneously. The proposed approach uses the

motion information to help resolve discontinuities when computing the optical

flow, in addition to simple short term tracking of flow vectors into improve perfor-

mance. Any discontinuities that are present are also recorded for use in detecting

events such as overlaps between two moving objects. Several improvements aimed

at improving the motion segmentation performance have also been proposed:

• A variable threshold for use when matching the incoming image to the

background model.

• A lighting compensation method to handle fluctuations in natural lighting

conditions.

4.7 Summary 193

• Shadow detection, using luminance and gradient to avoid dark foreground

objects being classed as shadows.

• Detection of stationary foreground regions and the layering of such regions,

and the separation of these from moving foreground regions for use in track-

ing systems

• A feedback mechanism to allow an external process to alter background

model weights to keep objects of interest out of the background, or force

segmentation errors into the background.

The proposed approach has been evaluated on a public database (AESOS) as

well as in-house captured data, and is shown to offer significant improvements.

A comparison of the use of a single variable threshold for the whole system and a

variable threshold for each pixel has also been performed, and it has been shown

that each can be effective depending on the system requirements and nature of

the data used.

Despite the good performance observed, the proposed system has the following

limitations:

• The proposed lighting compensation technique assumes an approximately

linear rate of change for the scene lighting, resulting in a drop in perfor-

mance when strobe effects are encountered.

• Shadow detection in scenes with texture less foreground and background

and little to no colour can result in motion being incorrectly classified as

shadows.

• Stationary foreground detection can be impeded by objects arriving behind

or leaving from behind an existing stationary layer, due to the pixels of

interest being occluded.

194 4.7 Summary

• Optical flow is calculated at pixel resolution, and performs inconsistently

for objects travelling at less than 1 pixel per frame.

As these limitations are only encountered in highly specific situations (strobe

lighting and texture less scenes are only likely to be encountered in synthetic

data), or have no impact on the core motion detection performance (stationary

foreground and optical flow computations do not impact upon motion segmenta-

tion as they take place afterwards), these limitations are considered to be minor.

4.7 Summary 195

(a) Input Image (b) Motion Image

(c) Proposed, H-Flow (d) Proposed, V-Flow (e) Proposed, Object

(f) LK, H-Flow (g) LK, V-Flow (h) LK, Object

(i) HS, H-Flow (j) HS, V-Flow (k) HS, Object

(l) BM, H-Flow (m) BM, V-Flow (n) BM, Object

Figure 4.24: Optical Flow Performance - CAVIAR Set WalkByShop1front, Frame1640

196 4.7 Summary






Figure 4.25: Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,Frame 541

4.7 Summary 197






Figure 4.26: Optical Flow Performance - CAVIAR Set OneStopNoEnter2front,Frame 1101

198 4.7 Summary

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

Figure 4.27: Overlap Detection - Example 1

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)


4.7 Summary 199

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)


Chapter 5

Object Detection

5.1 Introduction

To take full advantage of a motion detection routine capable of determining op-

tical flow and distinguishing moving objects from those that are temporarily

stopped (see Chapter 4), object detection routines need to be adapted to in-

corporate the additional information. The proposed motion detection method

provides three additional cues that require special handling:

• Optical flow.

• Detection of static foreground.

• Detection of overlaps between moving objects.

This chapter will address how this information is used in the object detection pro-

cess, and evaluate the improvements that are gained by using these in a tracking

system.

202 5.2 Optical Flow

5.2 Optical Flow

Optical flow can be used to aid in the detection of objects that are currently

moving in the scene. If an object has been observed for several frames, a reason-

able estimate of its velocity can be made. This velocity can be used to extract a

candidate region based on matching optical flow values, using the equation,

ImObj(n, t) = |U(t)− T nu |+ |V (t)− T nv | < τvel, (5.1)

where IObj(n, t) is the candidate image based on the optical flow information for

the tracked object T n at the time t, U and V are the horizontal and vertical

flow images, T nu and T nv are the expected horizontal and vertical velocities for

the tracked object n, and τvel is the optical flow error tolerance for the object

detection. τvel is typically set to 1.5 in the proposed system.

The operation is applied over a region that corresponds to the expected position of

the target object. This position is based on the previous observed position, offset

by the expected movement. The region is padded (expanded) by a few pixels (the

exact amount varies depending on the dataset, size and speed of objects, frame

rate etc.) to account for errors in the previous detection, or changes in direction.

The extracted region is likely to be incomplete (i.e. the region may contain

holes), due to inconsistencies in the optical flow. To counter this, a morphological

close operation is performed on the region. This region is then processed in

the same manner as for regular object detection (see Section 3.3), except any

detected candidate can only be matched to the intended target track (n). Like

the previously discussed object detection procedures, any detected and matched

region is removed from the motion detection images to prevent the same motion

being assigned to multiple objects.

This detection process can only be applied to objects in the Active state (see

5.3 Using Static Foreground 203

Section 3.2.1). This ensures that the object has been tracked for a suitable

number of frames to allow a reasonable estimate of the velocity to be made.

5.3 Using Static Foreground

When an object stops, the optical flow for the object (or at least a significant

portion of the object) becomes zero, and is therefore not an ideal mode for de-

tection (the background, as well as any number of other stopped objects will also

have an optical flow of zero). The static foreground output from the proposed

motion detector, and colour can be used to detect objects in instances such as

these.

To allow the system to effectively manage the tracking of moving and stationary

objects, the Active state (see Section 3.2.1 for details on the states of tracked

objects within the tracking framework) is divided into two sub-states, Moving

and Static. Figure 5.1 shows the updated state diagram that incorporates the

two types of tracked objects (moving and static). Objects can only enter and exit

the Active state as moving objects. Static objects can only occur when an object

that has been observed comes to a stop (i.e. an object cannot suddenly appear in

the image and then not move), and cannot suddenly disappear either (if detection

of static object fails, it is assumed that it is due to the object moving, so it is no

longer static). Objects enter the static state when the average velocity calculated

according to median bounding box position drops below τstatic. τstatic is set to 0.5

in the proposed system (i.e. the median pixel moves less than 1 pixel every two

frames).

To detect objects in the static foreground image, a method to determine when an

object becomes stationary is required. This can be achieved by monitoring the

204 5.3 Using Static Foreground

Figure 5.1: State Diagram Incorporating Static Objects

object’s average velocity over several frames (the average velocity is defined as

the average movement of centre of the objects bounding box). When this velocity

approaches zero (less than a threshold τstatic), the object is considered stationary

and the tracked object enters the Static sub-state. Detection of this track is

now possible using the static foreground image, though it may take several more

frames for static pixels to appear in the motion images depending on the threshold

for static pixels in the motion detector (see Section 4.5). Optical flow is not used

to ascertain if an object is stationary, as a stationary object does not necessarily

have an average optical flow of zero. For example, a person might be standing

still but waving their arms, which will yield only a small (if any) change in the

bounding box (and average velocity calculated based on position), but result in

non-zero optical flow.

When a tracked object, T n, enters the static sub-state, a template image is created

(IST,n, set to a size equal to the width and height of T n, plus a small tolerance to

account for any detection and segmentation error in recent frames, typically no

more than 3 pixels) to indicate what pixels belong to T n and their motion mode

5.3 Using Static Foreground 205

(static foreground and the layer, active foreground, or pixel does not belong to

T n). A static tracked object may consist of some active pixels (i.e. a person may

be standing still except for their head), and some pixels may change state from

active to static and vice versa whilst the object is static (i.e. a person may be

standing still, move an arm, and then be still again).

IST,n does not store colour, as this would be redundant. Assuming that a static

pixel remains present at a given location, the colour for that pixel is unchanging,

whilst it can be assumed that any active pixels are likely to have a changing

colour. When a new static pixel is added to the IST,n, its colour is checked to see

if it is present in the histogram of T n, and is only accepted if the colour is present

in the histogram (i.e. the object contains pixels of that colour). For any active

pixels that are preset, their colour is verified each frame, as there is no way of

knowing that the active pixel present at x, y at time t, is the same pixel at time

t+ 1.

IST,n is used to detect the object in subsequent frames after its creation. For each

pixel in IST,n, the algorithm checks if the state indicated by the template is still

valid (i.e. template indicates a static pixel, layer 1 - check if the static foreground

image has a static pixel at layer 1 present), and if so, this pixel has been detected

and verified. If the expected state cannot be detected, then additional states

are checked (i.e. check the active foreground). Once T n has been flagged as

stationary, and IST,n has been created, IST,n cannot be resized (i.e. the detected

object will remain the same size while stationary).

Figure 5.2 shows an example of the detection and update process using the tem-

plate image. In Figure 5.2, the input template contains pixels in both the static

foreground state (blue) and active foreground state (green). For pixels in the ini-

tial template, the system checks if the mode indicated is still valid, and if so, that

state remains. For pixels in the template where the state no longer exists (such

206 5.3 Using Static Foreground

as the static pixels in the upper middle of the template), the system checks for

other motion modes which may be valid. The resultant updated object template

is then stored and used in the next frame.

Figure 5.2: Static Object Detection using the Template Image

Only the template image is used for detection until the object begins to move

again. Movement can be detected by a significant increase in the amount of

active foreground present in the template image, or by a decrease in the number

of pixels detected as being part of the objects. Movement may also result in failure

to detect the object depending on scene characteristics (it is expected that for a

static object, it should be detected every frame, and a failure to detect indicates

that the object is no longer static). In the case of the object detection failing, the

object immediately ceases to be static and object detection is re-attempted using

the other detection methods. If movement is detected either by a decrease in the

number of pixels belonging to the object, or a large decrease in the number of

5.4 Detecting Overlaps 207

static pixels, then in the next frame the system will revert to the default detection

routines (see Section 3.3) to locate the object.

5.4 Detecting Overlaps

The proposed motion detection algorithm (see Chapter 4) is able to associate

a state with every pixel (no motion, new, continuous, overlap, ended), that in-

dicates the type of motion observed. This information can be used to detect

discontinuities in regions of motion, such as instances where two objects are over-

lapping. Unlike using static foreground (see Section 5.3), this process does not

require one of the objects to be stationary.

Of the five potential states for a pixel, three indicate a discontinuity:

1. New - new motion has been detected, likely cause is an object entering or

reappearing from occlusion.

2. Overlap - a motion mode that was observed last frame cannot be found and

the predicted location for it is occupied by a different mode, likely cause is

two objects overlapping.

3. Ended - motion has not been detected where it was expected, likely cause

is an object leaving or becoming obscured behind an obstacle.

All three states can also arise as a result of inaccurate optical flow computation.

One of the states indicates regular motion (continuous) and the other indicates

that there is no motion present.

To detect overlaps in an image, groups of discontinuities need to be found. A

vertical projection of the pixel states can be used to determine the amount of each

208 5.4 Detecting Overlaps

type of motion in each column of the image. Regions where there are higher ratios

of discontinuities to continuous pixels indicates, there is an overlap occurring.

An example of an overlap is shown in Figure 5.3. The flow status image shows

the type of motion detected at the pixels. In this image, yellow represents the

continuous state, green is new, red is overlap and blue is ended.

(a) Input Im-age

(b) Motion Im-age

(c) Flow Sta-tus Image

Figure 5.3: Example Image Containing a Discontinuity

Given the input images in Figure 5.3, vertical projections for the states can be

calculated using

vproj(i, S) =

j=N−1∑j=0

IS(i, j), (5.2)

where vproj(i, S) is the vertical projection at column i, for the pixel state s, j is

the row index and N is the number of rows (height) of the input image, Is, which

is a binary image which equals 1 for pixels that have the state s, and 0 otherwise.

Figure 5.4 (a) shows a plot of the vertical projections for these states. To detect

overlaps, the states that represent discontinuities are summed using the equation,

vproj(discont) = vproj(overlap) + α1vproj(new) + α2vproj(ended), (5.3)

where vproj(discont) is the summed vertical projection, vproj(overlap), vproj(new)

and vproj(ended) are the vertical projections for the overlap state, new state and

ended state respectively, and α1 and α2 are weights applied to the new and

ended state respectively. These weights are used to counter for situations where

5.4 Detecting Overlaps 209

the optical flow is performing poorly, and there is a large proportion of motion

detected in the new state. In the proposed system, both weights are set to 0.5.

The combined vertical projection can be compared to that of the continuous state

to determine the location of overlaps (see Figure 5.4 (b)). An overlap is detected

when,vproj(i)(discont)

vproj(i)(continuous)≥ τSOv, (5.4)

where i is the image column being analysed, vproj(continuous) is the vertical

projection for the continuous state and τSOv is the threshold for an overlap (set

to 1 in the proposed system). Figure 5.5 shows the input image with the detected

overlap area shaded.

(a) (b)

Figure 5.4: Flow Status Vertical Projections

Figure 5.5: Detected Discontinuities

In some instances, discontinuities will also be detected at the edge of a moving

object (see Figure 5.6). This is caused by an increased rate of pixels in the new

210 5.4 Detecting Overlaps

and dead states at discontinuities such as object edges. It is important to be

able to distinguish between discontinuities caused by the object edge, and by

overlapping objects.

(a) Input Im-age

(b) Motion Im-age

(c) Flow Sta-tus Image

(d) DetectedDiscontinuities

Figure 5.6: Example Image Containing Different Types of Discontinuities

The ratio of the pixel states that represent a discontinuity, as well as the amount

of motion detected either side of the discontinuity can be used for classification.

Figure 5.7 (b) shows the vertical projections of the different discontinuity types.

For the first discontinuity, the overlap state is highest, whist for the second the

new state dominates. If the equation,

vproj(i)(overlap)

vproj(i)(new)≥ τSOvType, (5.5)

is true, then the discontinuity is caused by an overlap, otherwise, it represents

the edge of an object.

In the event that an edge is detected, the amount of motion either side of the

discontinuity is analysed (sum of the vertical projection for n pixels either side,

n is set to 5% of the image width). If the amounts of motion are approximately

equal, then the edge is discarded as a false detection (i.e. it is not a valid edge,

and as the overlap state is less prominent than the new state it is not a valid

overlap either). For non-rigid objects (i.e. people) it is likely that discontinu-

ities will be detected at the edges (potentially every frame), as the motion of the

5.5 Integration into the Tracking System 211

(a) (b)

Figure 5.7: Flow Status Vertical Projections

arms and legs will violate the constant velocity assumption made in the optical

flow calculations. For a person walking through a scene, this means that a dis-

continuity may be detected in between their legs, and divide the person in two.

However, this discontinuity is likely to be caused by the detection of new motion,

and so by checking the amount of motion either side, it can be determined that

the discontinuity is in fact not at an edge and be discarded. Figure 5.8 shows the

classified discontinuities (Red - Overlap, Green - Edge).

Figure 5.8: Classified Discontinuities

5.5 Integration into the Tracking System

The additions to the detection system are incorporated into the system as shown

in Figure 5.9.

212 5.5 Integration into the Tracking System

Figure 5.9: Tracking Algorithm Flowchart with Modified Object Detection Rou-tines (additions/changes to baseline system shown in yellow)

Known objects (those that have been detected in the last frame) are detected and

updated first, using the methods described in Sections 5.2 and 5.3 (depending on

if the object is stationary). This process is shown in Figure 5.10. For known

objects that are successfully detected, their motion is removed from the motion

images. The adjusted motion images are then processed by the object detection

routines to locate any remaining objects in the scene. These routines are modified

to incorporate the detection of overlaps as described in Section 5.4. At the end

of each frame, the locations of the known objects are used to provide feedback

to the motion detector, to ensure that motion that has been associated with an

object remains separate from the background. Motion that is not associated with

an object will gradually be incorporated into the background.

Figure 5.10: Process for Detecting a Known Object

The process that detects known objects is shown in Figure 5.10. For an object to

be detected, it must be in either the Active or Occluded. Objects are required to be

in the system for several frames prior to this detection to allow an estimate of the

5.5 Integration into the Tracking System 213

optical flow to be made for detection (static detection will not occur this quickly

as the time required for a pixel to be considered stationary far exceeds the time

required for an object to be considered Active or Occluded) state. Both processes

produce one or more candidate objects which are matched to the intended target

in the same manner as the original system. If a match is found, the target is

updated. If not, the system attempts to detect the object using the standard

detection methods once all other known objects have been processed.

The process for updating the way that the histogram is modified to handle to

multiple modes of motion. Updates that occur when the tracked object is not

stationary (i.e. in the moving sub-state) are processed as they otherwise would

be. When the object is stationary, there is a possibility that static foreground

comprises part of the object, and that the static foreground component could be

behind an active foreground component, so the colour at that pixel in the input

image is not the colour of the static layer and the object in question. As such,

when an object is stationary and has a valid static template, the template is used

as a guide to update the histogram. The pixels that the static template indicates

to be present are used when updating the histogram.

The histogram comparison that occurs when dealing with an ambiguous match

(see Section 3.4) between a detected object and two tracked objects does not need

to be modified, as it is not possible for a stationary object to be involved in such

a comparison. The process of detecting the stationary objects occurs prior to the

standard object detection (see Section 3.3), so any stationary objects will already

be matched (and thus not involved in any further comparisons). Furthermore, a

stationary object that is not detected by the stationary object detection process is

considered to be no longer stationary, resulting in the stationary object template

being reset. For objects in this situation (i.e. having ceased to be stationary as

of the current frame), it is unknown which, if any pixels are in static foreground


as the template is no longer valid and has been reset (there should be very few,

if any pixels in static foreground). As such, any comparison can only use active

foreground images, and no additional modifications are required.

5.6 Evaluation and Testing

The modified system is evaluated using the same procedure detailed in Section

3.5, and a comparison is made to the results of the baseline system (see Section

3.5.4). Details on the metrics used and annotation of the tracking output can be

found in Sections 3.5.1 and 3.5.3 respectively. Configuration files used to test the

baseline system are modified to configure the new motion detector, and detection

routines. All existing configuration parameters are left unchanged. The initial

threshold for the proposed motion detection routine is set to the threshold used

in the baseline system (see Section 3.5.2 for baseline system parameters). Values

for the new system parameters for each dataset group are shown in Table 5.1 for

parameters that are constant for all configurations, and Table 5.2 for parameters

that vary between datasets.


τMinLum 30τMaxLum 130τMinChr 20τMaxChr 90τChrShad 2τGrad 25

Object Detection ParametersτV el 1.5

Table 5.1: System Parameters - Additional parameters for system configuration.

The proposed motion detector is capable of using either a global variable thresh-

old, or a variable threshold for each pixel (see Section 4.3.1), and it was found


Parameter RD BC AP BEMotion Detection Parameters

τLumShad 300 100 300 300

Table 5.2: System Parameters - Additional parameters for system configurationspecific to each dataset group.

that whilst the use of a single variable threshold resulted in a lower rate of false

negatives, the use of a variable threshold per pixel resulted in a lower rate of

false positives (see Section 4.6). This difference in performance results makes it

unclear which configuration is better suited to tracking, or if the type of threshold

used should be dictated by the characteristics of the data.

Testing is performed using both a single variable threshold, and an individual

variable threshold per pixel. Overall results are shown in Tables 5.3 (single vari-

able threshold) and 5.4 (individual variable threshold).


Table 5.3: Overall Tracking System Performance using a Global Variable Thresh-old for Motion Detection (see Section 3.5.1 for an explanation of metrics)

Comparing the performance when using a single variable threshold for the whole

system against using a variable threshold for each pixel, it can be seen that for

the BC, the use of a variable pixel per threshold results in a performance im-

provement, whilst for the RD and AP datasets, there is a performance reduction

when using individual variable thresholds. The overall results shown for the BE

dataset are misleading. Table 5.5 shows a breakdown of the results for the BE

datasets by camera.

It can be seen that whilst a global variable threshold offers better performance

for camera 1, an individual variable threshold performs better for camera 3.



Table 5.4: Overall Tracking System Performance using Individual VariableThresholds for Motion Detection (see Section 3.5.1 for an explanation of met-rics)

Camera Global Variable Individual VariableThreshold Thresholds

DOv LOv TOv DOv LOv TOvC1 0.57 0.91 0.36 0.55 0.90 0.32C3 0.16 0.81 0.26 0.17 0.66 0.29

Table 5.5: Performance of BE Cameras with Using Different Motion ThresholdApproaches (DOv, LOv and TOv are overall detection, localisation and trackingmetric results respectively, see Section 3.5.1 for an explanation of metrics)

The varying performance of different thresholding approaches on different

datasets can be explained by the nature of each dataset. The RD and AP datasets

contain little camera noise, and do not encounter significant problems with com-

plex shadows or reflections, or objects that are difficult to distinguish from the

background. As a result, the use of a single variable threshold process for the

scene results in a tighter threshold (fewer false negatives, more false positives),

allows the moving objects to be more effectively segmented from the background.

As the data is not prone to excessive spurious motion, the small increase in false

positives does not have a significant impact.

The BC and BE datasets (particularly BE-C3) however contain more challenging

conditions for motion segmentation. The BC datasets contain complex reflections

and shadows on the floor of the hallway in which the dataset is captured, whilst

the BE datasets contain significant camera noise. These datasets also contain

several people with clothing a very similar colour to the background, making

detection of these people more difficult. In these situations, the ability to have a


threshold for each pixel is advantageous. In regions of the scene where there is

noise present, the can threshold can quickly be raised whilst in regions where there

is little noise, the threshold can be lowered to aid in detecting moving objects

when the enter the scene. Whilst the BC datasets don’t contain significant camera

noise, they still benefit from the individual thresholds. The BC datasets contain

large amounts of motion when compared to the other datasets, meaning that

a global variable threshold is less likely to drop quickly to help detect hard to

distinguish foreground regions. The disadvantage of this however, is that the more

sensitive thresholds make the detection and removal of shadows more difficult,

and so little to no improvement is noticed handling shadows when compared to

the baseline system.

Given the differences between the datasets (i.e. capture environment), and their

suitability to the different modes of thresholding, the proposed tracking system

will be evaluated using a single variable threshold for the RD and AP datasets,

and variable thresholds for each pixel for the BC and BE datasets. Whilst BE-C1

does achieve better performance using a global threshold, for simplicity, and given

that BE-C3 is the more challenging camera view individual variable thresholds

will be used for all BE datasets. In a real world deployment of such a system,

system parameters would be tailored to the scenes requirements. As such, the

selection of an appropriate thresholding mode for each dataset is valid.

The overall results for the improved system are shown in Tables 5.6 and 5.7.

Detailed results for each dataset are shown in Appendix B.

The performance of the RD datasets (see Table B.1 and B.2 in Appendix B)

is significantly better when using the modified system incorporating the new

motion detection and detection routines. Significant increases in both the overall

detection and tracking metrics are observed, as a result of the modified systems

ability to continue to track objects once they have been stopped for a period of



Table 5.6: Tracking System with Improved Motion Detection Results (see Section3.5.1 for an explanation of metrics)


Table 5.7: Tracking System with Improved Motion Detection Overall Results (seeSection 3.5.1 for an explanation of metrics)

time (through the use of multi-layer motion detection and feedback to prevent

the motion from being incorporated into the background). An example of this is

shown in Figure 5.11 (top row shows the output of the baseline system, bottom

line shows the output of the tracking system with the proposed modifications).

The loss of tracking on stationary objects due to them becoming part of the

background also results in invalid tracks when the objects begin to move again.

This results in objects being detected in the place where the car was, as the

motion detector has wrongly learned that the primary background mode for that

region is the car.

The use of static foreground and a detection routine to locate objects that have

stopped, and are visible in static foreground, also results in improved detection

and tracking when the objects begin to move again. Figure 5.12 shows a situation

where a car that has been parked begins to move again (on the far side of the

road). The baseline system is able to detect a car, but is unable to correctly

localise it due to the errors in motion detection caused by the car being incor-


(a) Frame 150 (b) Frame 250 (c) Frame 350 (d) Frame 450 (e) Frame 550 (f) Frame 650

(g) Frame 150 (h) Frame 250 (i) Frame 350 (j) Frame 450 (k) Frame 550 (l) Frame 650

Figure 5.11: Example output from RD7 - Maintaining Tracking of TemporarilyStopped Objects (the car on the far side of the road)

porated into the background. As a result, the car is not tracked correctly and

an object is falsely detected at the location where the car was, which results in

further tracking errors when a second car passes later on. The improved tracking

system using the proposed motion detection does not suffer from these problems,

as the parked car is never moved into the background, and so there is no false

motion when it begins to move again.

(a) Frame2100

(b) Frame2150

(c) Frame 2200 (d) Frame2350


(g) Frame2100

(h) Frame2150

(i) Frame 2200 (j) Frame 2350 (k) Frame2375

(l) Frame 2400

Figure 5.12: Example output from RD7 - Improved detection and localisation ofobjects that have been stationary for long periods of time (the car on the far sideof the road))

The performance of the BC datasets is shown in Table B.3 and B.4 in Appendix

B. The proposed system suffers from many of the same problems as the base-


line system with the BC database (shallow FOV angle, complex shadows and

reflections). The improved motion detection routine, and use of shadow detec-

tion, does result in some improvement in the detection of people within the scene

(see Figure 5.13, top row shows the output from the baseline system, bottom

row is the output from the proposed system). This improvement is reflected in

the small improvements observed in the overall detection metrics and each of

the individual detection metrics, as well as increase in the tracking performance.

However the improvement in tracking performance is only small, as the frequent

total occlusions present in the database still pose a large problem.

(a) Frame 1750 (b) Frame 1760 (c) Frame 1770 (d) Frame 1780 (e) Frame 1790

(f) Frame 1750 (g) Frame 1760 (h) Frame 1770 (i) Frame 1780 (j) Frame 1790

Figure 5.13: Example output from BC16 - Improved Detection Results due toProposed Motion Detection Routine

The main area of improvement with the proposed system is more accurate de-

tection of people, due to the improved motion detection. Figure 5.13 shows a

situation where two people are walking down the hallway, each person wearing a

white shirt that is similar in colour the floor and walls. As a result, the baseline

system performs poorly, failing to detect portions of the shirt as being in mo-

tion. This results in only the legs of these people being properly tracked. The

proposed system is able to correctly detect the shirts as being in motion, and

segment the two people correctly. This improvement can be attributed to the

use of the variable threshold in the proposed motion detection algorithm. Whilst


this improvement does obviously lead to improvements in tracking performance,

it does not aid in the resolution of the occlusions.

Tables B.5 and B.6 (see Appendix B) show the results of the evaluation on the

AP datasets. Overall, the AP datasets perform very similarly to the baseline

system, with a small decrease in detection and small improvement in tracking

performance being observed.



Figure 5.14: Example output from AP12-C7 - Tracking Example for AP Dataset



Figure 5.15: Example output from AP11-C4 - Tracking Example for AP Dataset

Figures 5.14 and 5.15 show examples of the tracking output for the baseline and

proposed tracking systems (top row of images are the output from the baseline

system, bottom row is from the proposed system). Throughout the AP datasets,


both systems track the objects in the scene successfully. The small performance

decrease in the detection can be attributed to the proposed system not detecting

objects as quickly (i.e. five frames slower) when they enter the scene. For the

AP12 dataset, which is captured in full sun, this is partially caused by the shadow

detection. The use of shadow detection reduces the size of the detected objects

(i.e. shadow pixels may have otherwise been included), and for small objects

at the back of the scene, this can (for the first few frames of they are present)

result in the object being too small to be considered (this could be counteracted

by adjusting object detection parameters to accept smaller objects). The small

tracking improvement results for an improvement in the T1 metric (number of

objects being tracked during time) and is a result of the more consistent motion

detection providing more accurate object locations over time.

The performance of the BE datasets is shown in Tables B.3 and B.4 in Appendix

B. The proposed system results is improved performance over the baseline in

detection, localisation and tracking. This is largely due to the improved per-

formance on the BE19 dataset, specifically when handling the parked car. As

was found in the baseline system (see Section 3.5.4), the camera angle of BE-C3

once again results in extremely poor performance in the D2, T2 and T5 metrics.

This is to be expected, as whilst the motion segmentation algorithm has been

significantly changed, the detection algorithm which uses this information has

remained unchanged.

Figure 5.16 (the output from the baseline system in shown on the top row, and

the output from the proposed system is shown on the bottom row) shows a car

that has just parked in the parking lot, with two people also moving through the

scene (one of which has got out of the car). The baseline system loses track of

the object after a few hundred frames, and also performs worse when tracking

the people.


The proposed system does however detect a spurious person at the parked car

(see Figure 5.16 (h)). Due to the noisy nature of the BE dataset, the static

motion tends to be less stable than for the RD dataset. As a result, a certain

amount of active motion is detected over the static motion. In some situations,

this can result in a false track being spawned for a short period of time.



Figure 5.16: Example Output from BE19-C1 - Maintaining tracks for stationaryobjects (the parked car)

There is little to no improvement when handling the occlusion in BE20 however

(see Figure 5.17, the output from the baseline system in shown on the top row,

and the output from the proposed system is shown on the bottom row).



Figure 5.17: Example Output from BE20-C3 - Occlusion handling

Figure 5.17 shows the occlusion in BE20, for camera 3. The proposed system in


unable to perform any better in handling this occlusion, as the combination of the

camera noise and poor viewing angle make it difficult to separate the individual

people. The proposed system also performs poorly when tracking people as they

move away from the occlusions, as the additional detection cues such as optical

flow are not able to be well used when there is a high amount of noise present (this

makes the matching process for determining the optical flow very unreliable).

Table 5.8 shows the data throughput benchmarks for the proposed tracking sys-

tem. These throughput rates are calculated under the same conditions used

when benchmarking the baseline system (see Section 3.5.4). The incorporation

of the proposed motion detection routines results in a drop in throughput for all

datasets. This is to be expected, as a significant amount of additional processing

is performed by the proposed motion detection routine, and the accompanying

object detection routines.

The RD dataset experiences the largest relative decrease (32% decrease) due to

the large amount of static foreground present in the scenes. Static foreground

results in an increase in processing time for the motion detection algorithm, and

the object detection routine to locate static objects is more computationally in-

tensive than other object detection processes. The AP and BE datasets also

contain static objects, explaining why they suffer a bigger performance drop than

the BC dataset (no static objects). However as they have less static foreground

than the RD dataset, there is less of a performance decrease.


Table 5.8: Proposed Tracking System Throughput

Despite the drop in data throughput when compared to the baseline system,

5.7 Summary 225

all datasets are processed at greater than 11 fps. Given that these results are

achieved executing on a single core, and that significant optimisations can be

made to the motion segmentation it is feasible that the proposed system could

process data in real time.

5.7 Summary

This chapter has presented methods for incorporating multiple modalities of mo-

tion information (active foreground, stationary foreground, optical flow) into a

single tracking system. Detection routines capable of using the additional infor-

mation as well as an approach to integrate the routines into an existing system

have been proposed. The proposed tracking system has been evaluated using the

ETISEO [130] database and compared against a baseline tracking system (see

Chapter 3) and significant improvement has been shown.

Despite the observed improvements, the proposed algorithm is limited in that

noise in the stationary foreground image can result in false objects being de-

tected (and false tracks created), and to keep stationary foreground out of the

background does rely on the object detection and tracking process finding the ob-

ject so that feedback can be applied to the motion detection algorithm. Also, the

proposed algorithm, whilst improving motion detection and object segmentation,

does not provide any additional aid in dealing with severe occlusions. However,

as the baseline system is unable to discriminate between stationary foreground

objects and moving foreground objects, or apply feedback to prevent objects from

being added to the background, these are seen as minor limitations.

Chapter 6

The Scalable Condensation Filter

6.1 Introduction

The performance of a tracking system can be improved by using more advanced

prediction methods such as particle filters. Particle filters are able to use features

that are extracted from one or more target objects to locate them in future frames.

Particle filters are not constrained by linearity and Gaussian assumptions like

Kalman filters are, and so are able to be used in a wide range of situations.

To date, many particle filter implementations have relied on using a common

feature to locate all objects in the scene [78, 135, 163], complicating the task of

maintaining the identity of the individual tracked objects. Manual initialisation

of tracks [163, 168] is also common, and limits the use of such systems in real

world applications.

In this chapter the Scalable Condensation Filter (SCF) is proposed and the pro-

cess by which it can be integrated into the tracking system proposed in Chapters

228 6.1 Introduction

3 and 5 is described. The SCF is a variant of the mixture particle filter [163],

that is able to alter the number of particles used by each independent mixture as

the scene changes, and allows the features used (and type of feature used) to vary

between tracks and from frame to frame, according to the system needs. This

allows the system to be more efficient as high particle counts and high complexity

features are only used when the scene is sufficiently complex.

The SCF is integrated into the proposed tracking system such that the proposed

object detection methods are able to be used to instantiate new tracks, update

existing tracks when possible, and provide additional stimulus to the individual

mixture distributions based on the current detection results. This integration

also aids in maintaining the identity of the individual mixtures as it allows each

mixture to use its own set of features. Importantly, these features are also able

to be updated on a frame to frame basis (according to how often the object can

be detected and updated), such that if the appearance of a tracked object does

change the features used are also able to change, helping to improve tracking

performance. Finally, the integration with the proposed tracking system also

allows for tracks to be removed from the SCF when appropriate.

The SCF is designed such that the type and number of features used by each

mixture component (each of which represents a tracked object) as well as the

number of particles used by each mixture component can vary from frame to

frame and mixture component to mixture component. This allows the SCF to

adapt to the needs of the scene.

6.2 Scalable Condensation Filter 229

6.2 Scalable Condensation Filter

A condensation filter[77] is used to aid in tracking objects within the system. The

Scalable Condensation Filter (SCF), an extension of the Mixture Particle Filter

(MPF)[163] and Boosted Particle Filter (BPF)[135], is proposed. A single filter

is used for the entire system, and the particle count is scaled according to the

number of objects being tracked. In addition, the number of particles for each

track is allowed to vary according to the complexity of the surrounding area,

and resampled in such a way that ensures that particles for a track (and thus

the track itself) are not lost owing to re-sampling (see Section 6.2.1). Features

used for tracking may also vary from frame to frame, and track to track. This

is important when tracking different classes of object within the one system (i.e.

people and vehicles) which may be better suited to different features, and also

allows objects that are in more complex regions of the scene (i.e. occluded) to

use more advanced features.

The distribution modelled by the SCF is the sum of the distributions for the

distributions of the individual tracks,

p(xt|xt−1) =N∑i=1

pi(xi,t|xi,t−1), (6.1)

where pi(xi,t|xi,t−1) is the component distribution for a single track, i, and N is

the total number of tracks within the system.

For the tracking system proposed in this thesis, the SCF particles are four dimen-

sional and describe a bounding box (a centre position - x and y pixel coordinates,

and the height and width),

si,n,t = {x, y, h, w}, (6.2)

where n is the index of the particle s is the range [0..N−1]. Each variable is free to

move within the dimension limits, {dmin, dmax}, which are defined by the system

230 6.2 Scalable Condensation Filter

(i.e. the limits of x and y are governed by the image size). The distribution of each

dimension is Gaussian, and independent from the other dimensions. The standard

deviation of each dimension is equal to the maximum expected movement of a

dimension from one frame to the next, emax.

6.2.1 Dynamic Sizing

Rather than have a fixed number of samples for the filter, the sample count is

dynamically altered as object’s enter and leave the scene, and as people move

about and occlude one another. For each track, an arbitrary number of samples,

νinit, are created about the object’s initial position and associated with that

object,

si,n,t = γi + 3× ρ (6.3)

where si,n,t is the new sample, γi is the new objects state, and ρ is a vector

of random values, in the range −emax to +emax. Note that the values of emax

potentially vary for each dimension (i.e. x, y, h and w, may be expected to

change at different rates). A multiplier of 3 is used when initialising the particles

to ensure that the initial distribution is not too closely packed about γi (the

object’s initial position).

The particles initially associated with the given track remain associated with that

track for duration of that track’s life, and the particle count for any individual

track cannot be diminished unless it is specifically desired. This initialisation

gives each tracked object a set of samples to model it immediately, rather than

needing to allow a period of frames for the system to adapt to its presence. When

an object leaves, νinit samples are removed from the system.

When tracked objects are close together, additional particles can be added and

more advanced features can be used to aid in the tracking. Three levels of occlu-


sion are defined for each track:

1. Level 0 (No Occlusion) - The tracked object is isolated within the scene,

there are other objects nearby.

2. Level 1 (Object Nearby) - Another tracked object’s bounding box is within

a distance τnearby. τnearby is set at a fixed value for each class and depends

on the expected size of the objects being tracked.

3. Level 2 (Overlap) - Another tracked object’s bounding box is overlapping.

When a track is first created, and added to the SCF, it is at occlusion level 0

and is created with the standard number of particles (νinit). For each occlusion

level increase, an additional νadd particles are added to the SCF for that track;

and νadd samples are removed for each occlusion level decrease. Additional (or

fewer) levels of occlusion could be defined based on the proximity of tracks, and

the severity of the overlap.

Figure 6.1 shows a system that is tracking two objects (blue and yellow). At

time t, these objects are suitably far apart that there is no occlusion, and so each

object is tracked with the standard number of particles (4). At time t + 1, the

objects are considered to be in an occlusion state. The system over-samples the

particle set to generate a set of 8 particles for each of the tracked objects. At

time t + 2, the occlusion has passed, and so the sample sets are under-sampled

such that each object is once again tracked by the standard number of particles.

Particle counts for tracked objects are altered during the re-sampling procedure

by either under-sampling or over-sampling. Resizing the system in this manner

ensures that no unnecessary updates are done (i.e. object 1 may have ceased

to be occluded by object 2, requiring a drop in the number of particles, but

have become occluded by object 3, requiring an increase in particles, such that


Figure 6.1: Dynamic Sizing of Particle Filter

ultimately no net change in the particle count is required), and improves CPU

utilisation.

A Sequential Importance Re-sampling (SIR)[44, 139] procedure is used to update

the sample set. Each new particle is adjusted according to a motion model

associated with the tracked object responsible for the particle. The expected

movement according to this motion model (based on a window of Q previous

observations) is added to the particle as well as a noise vector,

s(i,n,t+1) = s(i,n,t) + λi + ρ, (6.4)

where s(i,n,t+1) is the nth sample for track i at the next time step; s(i,n,t) is the nth


sample for track i at the current time step; ρ is the noise vector, which is within

the range [−emax..+ emax], and λi is the expected movement for the track, i. As

part of all particle updating and creation, a set of limits to are applied to each

particle, to ensure that it is describes a valid object (if a dimension exceeds a limit,

it is set to the limit). Whilst SIR would ensure that any particles that describe

invalid objects are not propagated (they would have 0 probability), performing

this test on the particles at this point avoids the need to check for valid image

coordinates when matching features, which allows the system to be more efficient.

Normally when re-sampling using SIR, random values in the range [0..1] are

selected and mapped to the corresponding particle according to the cumulative

probability in order to select the particles for re-sampling. For the SCF, re-

sampling is performed on a track by track basis, where the random value selected

is in the range that corresponds to the cumulative probability range of the track’s

particles. Figure 6.2 shows an example of this.

Figure 6.2: Sequential Importance Re-sampling for the SCF

In Figure 6.2, the SCF distribution contains particles for two tracks. Particles


for each track are stored in a continuous block. When re-sampling, particles for

the first track are re-sampled first by selecting random values in the range [0..x].

Once the correct number of particles have been re-sampled for the first track, the

second track is re-sampled by selecting random values in the range [x..1]. This

process can be easily scaled for additional tracked objects.

This approach relies on ensuring that the particles for each track are stored

continuously in the particle list (i.e. the list cannot have νinit particles for track

1, νinit particles for track 2, and then another νadd particles for track 1). Provided

particles are only added to and removed from the filter during re-sampling, this

is easily achieved.

6.2.2 Dynamic Feature Selection and Occlusion Handling

Each track is able to use multiple features. Using inheritance and polymorphism,

the types of features used by tracks can be allowed vary depending on circum-

stances and the class of object being tracked, without any change required in the

condensation filter itself. This approach allows different types of objects to use

features more suited to their individual properties.

Two types of features, each of which has various sub-types, are proposed:

1. Histograms

2. Appearance Models

Each of these features can optionally use motion detection and optical flow as

additional aids (i.e. a pixel must be in motion and must be moving in the same

direction as the object being tracked), and this can be change dynamically de-

pending on the systems status (i.e. if motion detection is unreliable for a period of


time due to environmental effects, this can be omitted when matching features).

Features are initially built when the tracked object is first detected, and are up-

dated every subsequent frame. As such, a track’s features learn the appearance

of the track and are able to vary over time to accommodate any changes that

may occur in the objects appearance. As the appearance models are computed

and compared in different manners, and model different aspects of the objects

appearance, they are assumed to be independent.

Histograms simply model colour distributions, and so while being quicker to

compute, do not take geographical information into consideration (i.e. a per-

son wearing blue pants and a red shirt will have a very similar histogram to a

person wearing red pants and a blue shirt, despite having a distinct appearance).

Appearance models encode position information as well as colour information,

and so are more discriminative. They are however more processor intensive.

The features used by the system are varied as the complexity changes. A his-

togram feature is used by default, and when a track’s occlusion level increases

above 0 (see section 6.2.1) an appearance model feature is used as well. When

multiple features are used, the probability for the particle is the product of the

probabilities for each feature,

wi,n,t =M∏j=1

p(zj,i,t|xi,t = si,n,t), (6.5)

where wi,n,t is the weight of particle n for track i at time t, M is the total number

of features for track i, zj,i,t is the jth feature for track i (xi,t), and si,n,t is the

particle from xi,t’s distribution we are matching the feature to.

When a track’s occlusion level reaches two (occlusion occurring), the process to

calculate weights is altered. In such a situation, the tracked object can either

be obscured (i.e. the blue object in Figure 6.3), or be obscuring another object

(i.e. the yellow object in Figure 6.3). Each of these situations must be handled


differently.

Figure 6.3: A Typical Occlusion between Two Objects

When an object is obscured (partially or fully), the SCF is likely to receive poor

responses to its features as it cannot be seen. To overcome this, features can be

matched only to regions where it is believed that an occlusion is not taking place,

and a fixed probability can be used when considering regions that cannot be seen.

At the end of processing each frame, the tracking system determines which objects

are in occlusion and updates the SCF with this information. For objects that

are partially obscured, the locations (bounding box) of the obscuring objects

are passed. When evaluating features, these regions are avoided, to prevent the

features being matched to a mix of the target object, and any objects that may

be obscuring the object. This however will also result in a reduced probability,

which when the occlusion is suitably severe (i.e. half of the object of more is

hidden) may still lead to SCF losing the track. To overcome this, regions that

are obscured are assigned a fixed probability, β. Using this process, the weight

of a particle belonging to an object that is occluded becomes,

wi,n,t = αvis

M∏j=1

p(zj,i,t|xi,t = si,n,t) + (1− αvis)× β, (6.6)

where αvis is the fraction of the object that was visible in the last frame, and β


is a constant between 0 and 1 that denotes the probability of an occluded object

being at an obscured location (typically set to 0.5). αvis is needed to ensure that

the probability for a particle does not exceed 1, and the constant, β, is used to

ensure that when an object is totally (or almost totally) occluded, the probability

for the object does not drop to zero and the track is not lost.

(a) Particle Location (b) Motion Image

(c) Occlusion Map (d) Motion Image with Occluding Re-gions Removed

Figure 6.4: Calculating particle weights for occluded objects

Figure 6.4 shows an example where the weight for a particle (the red box in (a))

is being calculated for an object that is partially obscured (the blue object in (a),

obscured by the yellow object). The expected location of the occluding object(s)

is used to create an occlusion map (c), which shows the regions where the target

object is expected to be obscured. The occlusion map is also used to alter the

motion image for the target. The original motion image (b) has the motion in


the region where the occluding object is expected to be removed, resulting in the

motion mask shown in (d).

When one object obscures another, it is still entirely visible, so there is no danger

that the occlusion will result in the SCF receiving poor responses to the track’s

features due to the object not being visible. If, however, the objects involved

in the occlusion have a similar appearance, there is a possibility that the SCF

will begin to track multiple objects within the one mixture component, possibly

leading to object identities being swapped or lost.

To overcome this problem, the idea of negative features is proposed. Within the

SCF, the probability of a particle is given the Equation 6.5. However, in an

occlusion there is a risk that, if the occluded object is of a similar appearance to

the target (occluding) object, the particle weights may be affected by a partial

match to the occluded object. When a severe occlusion is taking place, the

mixture component associated with the occluding object uses the features of the

occluded object in a negative capacity, such that the particle weight becomes,

wi,n,t =

Mj∏j=1

p(zj,i,t|xi,t = si,n,t)× (1−Mi∏k=1

p(z−k,i,t|xi,t = si,n,t)), (6.7)

where Mj is the number of positive features for the track, Mi is the number of

features for track i that are to be used in a negative way, and z−k,i,t is the kth

negative feature for track i. Note that the negative features are not actually

created by, updated by, or owned by track i. Rather they are copied from the

other object(s) involved in the occlusion. This results in the SCF yielding a

higher response to regions that match the appearance of the target object well

and, match that of the other object(s) involved in the occlusion poorly.

As each tracked object has its probabilities normalised and particles re-sampled

separately, there is no danger of the additional matching constraints (using ad-

ditional features, negative features, or measures to overcome the object being


obscured) reducing a track’s probabilities to the extent that the track’s particles

are removed from the system by the re-sampling procedure. It is feasible that

multiple or different appearance models and histograms could be used for each

track under appropriate circumstances.

6.2.3 Adding Tracks and Incorporating Detection Results

When a new object enters the scene, the SCF needs to begin tracking it. The

MPF [163] uses K-means clustering to detect split and merge events within the

existing modalities to detect and initialise new tracks, whilst the BPF [135] relies

on adaboost detection results being incorporated into the distribution to initialise

new tracks. As the SCF is designed to be integrated into an existing tracking

system, the discovery of new modes can be handled by the underlying tracking

system. When the detection methods utilised within this system detect a new

track, a new component mixture corresponding to the new track is added to the

SCF.

Like the boosted particle filter [135], the SCF may also use detection results from

the object detection routines in the tracking system when calculating particle

weights. Within the BPF, this is intended to help initialise new modes for track-

ing. Within the SCF, it is used to provide additional information to the system,

to help the component mixtures better represent and maintain the locations of

the tracked objects. The mixture distributions are combined using an additive

process (as is used in the BPF),

p′(xt|xt−1) =N∑i=1

αiqi(xi,t|xi,t−1, yi,t) + (1− αi)pi(xi,t|xi,t−1), (6.8)

where qi is a Gaussian distribution dependent on the current observations and

matches from the tracking system, yi,t, for a specific track, i, and N is the num-

ber of tracked objects. The mixture, qi(xi,t|xi,t−1, yi,t) is creating by summing


Gaussian’s for each object detection (yi,t) recorded for the tracked object (xi,t).

The standard deviation of the Gaussian is set to emax.

The value of αi varies between different tracks, and is set according to the occlu-

sion level. Object detections that arise from tracks that are at a lower occlusion

level are weighted higher (larger α) as they are less likely to be in error. De-

tections that arise during occlusion are less reliable, and so are weighted lower.

Within the proposed system, weights for α are set to 0.5, 0.25 and 0.125 for oc-

clusion levels of 0, 1 and 2 respectively. For each track, multiple detection results

may be used when updating each distribution (see Section 6.4).

This update allows the mixture distributions to become multi-modal themselves

temporarily, to help cope with ambiguities.

Figure 6.5: Scalable Condensation Filter Process

The updating of particle weights based on observed detections is added in to the

6.3 Tracking Features 241

process as shown in Figure 6.5. The usual condensation filter update is performed

(resample, re-weight based on features), after which the object detection and

updating of tracked objects takes place. These detection results can then be

incorporated into the SCF distribution at the end of the current time step (t), as

shown in Figure 6.5, such that at time t+ 1, when the distribution is re-sampled,

the detection results will influence the re-sampling process.

6.3 Tracking Features

The system uses two types of tracking features, histograms and appearance mod-

els. An appearance model is defined as a model that encodes both intensity and

position information, whilst a histogram simply encodes colour information.

An appearance model is proposed that utilises the proposed motion detection

routine (see Chapter 4), by incorporating colour, motion state, and optical flow

into a single model. The appearance model, A, is a grid of Ax by Ay squares,

with an average colour (Ac(k), where k is the colour channel), velocity (Au and

Av for the horizontal and vertical velocity respectively, derived from the optical

flow), and motion occupancy (Am) stored for each square. An error value for

the colour (Aec) and optical flow (Aeopf ) is also stored for each square. The input

image, I(t) is divided in to a grid of dimensions Ax by Ay. It is assumed that

these dimensions will be significantly smaller than those of the input images (see

Figure 6.6). It is also assumed that the object detection results will be similar

from frame to frame (either correct or at consistently incorrect), to ensure that

the contents of each square is reasonably consistent from frame to frame (changed

however are expected over time as the target moves about, however these should

occur over several frames, instead of a dramatic change from one frame to the

next).

242 6.3 Tracking Features

Figure 6.6: Dividing input image for Appearance Model

For each grid square in I(t), the average colour, percentage of motion, and optical

flow (horizontal and vertical) are computed,

Fc(x′, y′, t, k) =

1

card(M(x, y, t))

∑I(x, y, t, k) where (x, y) ∈M(t), (6.9)

Fu(x′, y′, t) =

1

card(M(x, y, t))

∑U(x, y, t) where (x, y) ∈M(t), (6.10)

Fv(x′, y′, t) =

1

card(M(x, y, t))

∑V (x, y, t) where (x, y) ∈M(t), (6.11)

Fm(x′, y′, t) =card(M(t))

card(I(t))(6.12)

where F is a feature extracted for the current image, x′, y′ are in the range

[0..Ax − 1, 0..Ay − 1], U and V are the input horizontal and vertical flow image,

M is the input motion image and M(t) is the set of all pixels that are in motion,

and x, y is in the range that corresponds to the grid square x′, y′.

Given the features for the incoming image, the appearance model components

are updated according to the equation,

A(t+ 1) = A(t) + (F (t)− A(t))× L, (6.13)

where L is the learning rate. L is defined as,

L =1

T; for T < W, (6.14)

L =1

W; for W >= T, (6.15)


where W is the number of frames used in the model, and T is the number of

updates performed on the model. This ensures that the image that the model is

initialised with does not dominate the model for a significant number of frames.

Instead, the information is incorporated quickly when the model is new to provide

a better representation of the tracked object being modelled sooner.

An error measure is kept for both the optical flow and colour components of the

model,

F ec (x′, y′, t) =

K∑1

|Ac(x′, y′, t, k)− Fc(x′, y′, t, k)| , (6.16)

F eopf (x

′, y′, t) = |Au(x′, y′, t)− Fu(x′, y′, t)|+ |Av(x′, y′, t)− Fv(x′, y′, t)| , (6.17)

where F ec and F e

opf are the frame errors for colour and optical flow respectively,

and K is the number of colour channels in the appearance model.

The errors are updated over time using equations 6.14 and 6.15. The cumulative

error is used as an approximation to the standard deviation (it is assumed that

the observations over time form a Gaussian distribution) of the error, as it is

not practical to re-compute the standard deviation each frame, and not ideal

to assume a fixed standard deviation. Given that the standard deviation for a

sample set is defined as,

σ =

√√√√ 1

N

N∑n=1

(µ− sn)2, (6.18)

and in the proposed appearance model, for each grid square there is one obser-

vation at each time step (N = 1), so the standard deviation at a given time step

is,

σ =√

(µ− s)2 = |A(x′, y′, t)− F (x′, y′, t)| , (6.19)

which is the proposed error measure.

When matching the model to an input image, average colour, flow and motion

occupancy is computed for the image in the same manner as for an update.


Errors for the colour and optical flow are calculated and these are compared to

the cumulative errors for the model,

Gc(x′, y′, t) =

Aec(x′, y′, t)

F ec (x′, y′, t)

, (6.20)

Gopf (x′, y′, t) =

Aeopf (x′, y′, t)

F eopf (x

′, y′, t), (6.21)

where Gc(x′, y′, t) and Gopf (x

′, y′, t) are the number of standard deviations from

the mean that the observation (input image) is. Aec(x′, y′, t) and Aeopf (x

′, y′, t) are

determined using Equation 6.13. A normal distribution look-up table is used to

determine the probability that these observations have arisen from the model,

which yields P (Fc(x′, y′, t)|Ac(x′, y′, t)) as the probability that the colour ob-

servation belongs to the distribution described in the appearance model, and

P (Fopf (x′, y′, t)|Aopf (x′, y′, t)) as the probability that the optical flow observation

belongs to the distribution described in the appearance model.

The probability that a given grid square matches the corresponding area in the

input image is then defined as,

P (F (x′, y′, t)|A(x′, y′, t)) = P (Fc(x′, y′, t)|Ac(x′, y′, t))× (6.22)

P (Fopf (x′, y′, t)|Aopf (x′, y′, t)),

where F (x′, y′, t) is the set of features for a grid square in the input image, and

A(x′, y′, t) is the set of features for a given grid square in the appearance model.

The motion occupancy component of the model is used as a weight when com-

puting the match across the whole model. A higher motion occupancy indicates

that there is more motion, and thus more information, in a given grid square.

Given this, the match for the model to an input image is,

P (I(t)|A(t)) =

∑x′=Ax;y′=Ay

x′=1;y′=1 P (F (x′, y′, t)|A(x′, y′, t))× Am(x′y′, t)∑x′=Ax;y′=Ay

x′=1;y′=1 Am(x′y′, t). (6.23)

Am(x′y′, t) is determined using Equation 6.13.


6.3.1 Handling Static Foreground

The motion detection algorithm that is used within the tracking system described

in this thesis can distinguish between objects that are moving in the scene (active

foreground) and objects that have stopped moving, but are not part of the back-

ground (static foreground). Any appearance model needs to be able to handle

the different types of foreground.

The proposed appearance model divides the object being modelled into a grid.

Each grid location is modelled separately, with its own colour, optical flow and

expected motion occupancy. This can be extended by allowing each grid to have

its own state, AS(x, y, t), indicating what type of motion is expected at this

square. This type may be Active (there is no static foreground expected at this

location), Static (there is no active foreground expected at this location) or Both

(foreground of both types is expected).

The state of each grid square is determined during the object update of each

frame, and uses the static object template (see Section 5.3) to determine the

state of each grid square. If the template has not been initialised (i.e. the object

is not stationary) then all squares must be in the Active state. If the static

template is initialised, then the ratio of active foreground to static foreground

within the template region that corresponds to each grid square is calculated,

Rm =card(IST = Active)

card(IST = Static), (6.24)

where IST is the template image for the track that owns the appearance model,

card(IST = Active) is the number of pixels within the template image that are

detected to be in a state of active foreground, card(IST = Static) is the number

of template image pixels in a state of static foreground, and Rm is the ratio of

active foreground to static foreground within the template. A threshold, τs, is


applied to this ratio to determine the state such that,

Rm ≤ τs =⇒ AS(x, y, t) = Static, (6.25)

Rm ≥1

τs=⇒ AS(x, y, t) = Active, (6.26)

1

τs< Rm < τs =⇒ AS(x, y, t) = Both. (6.27)

For a grid square that is in the Active state, it is processed as described in

Section 6.3. For squares that are in either the Static or Both states, an optical

flow comparison is not performed. Squares that are in the Static state contain

only static foreground and so will have zero flow, making comparison needless.

Squares that are in the Both state are likely to be in transition. Both is a transient

state, occurring either as an object stops and becomes static (during which time

the flow is changing to 0), or as an object begins to move again (during which

time the flow is changing from 0). In either situation, comparing the optical flow

is unreliable due to the known state change, and so it is not performed. Updates

to the optical flow are also not performed when a square is in one of these states.

The optical flow component is also reset when the Active state is re-entered (after

being in either the Static or Both state) as the object may have started moving

in a different direction and thus invalidated the flow component.

When updating a block that it is the Static state, the colour and motion features

become,

Fc(x′, y′, t, k) =

∑IZ(x, y, t, IST (x, y, t), k) where IST (x, y, t) = Static

card(IST (x, y, t) = Static), (6.28)

Fm(x′, y′, t) =card(IST (x, y, t) = Static)

card(IST (x, y, t)), (6.29)

where IZ(x, y, t, IST (x, y, t), k) is the colour of channel k for the IST (x, y, t)th

static layer. IZ is an image that contains the colours of all the static pixels. For

each pixel, this image contains a list that contains the colour of each static layer

ordered according to their depth. The value of IST at the pixel can be used as an


index to access the appropriate colour.

When updating a block that is in the Both state, static and active foreground

must be considered. The colour and motion features in this situation become,

F staticc (x′, y′, t, k) =

∑IZ(x, y, t, IST (x, y), k) where (IST (x, y, t) = Static),(6.30)

F activec (x′, y′, t, k) =

∑I(x, y, t, k) where (IST (x, y, t) = Active),(6.31)

Fc(x′, y′, t, k) =

F staticc (x′, y′, t, k) + F active

c (x′, y′, t, k)

card(IST (x, y, t) = Static) + card(IST (x, y, t) = Active),(6.32)

Fm(x′, y′, t) =(card(IST (x, y, t) = Static) + (card(IST (x, y, t) = Active)

card(IST (x, y, t)).(6.33)

The values for calculated for Fc(x′, y′, t, k) and Fm(x′, y′, t) are used to update

the appearance model in the same manner previously described.

Performing a comparison between a model that contains static foreground (either

as part of the Static or Both state) is different from an update, as it is likely that

IST is no longer valid when the comparison is being performed. As compares

should only be occurring prior to an update (either with the SCF process, or

when directly comparing two tracks to a detected object), the template image

stored by the track will be from the previous frame, and thus be inaccurate. For

squares in the Static state, the static layer colour image (see Section 5.3) is used.

For each pixel in the grid square, the static colour that is closest to that of the

appearance model is used as the colour for that pixel, as such, the colour becomes,

Cos(x, y, t) = argminn=[1..IST (x,y,t)] |Ac(x′, y′, t)− IZ(x, y, t, n)| , (6.34)

Fc(x′, y′, t, k) =

∑Cos(x, y, t) where (x, y) ∈Ms(x, y, t)

card(Ms(x, y, t)), (6.35)

where Cos(x, y, t) is colour of the static layer that is the closest match to the

appearance model, IZ is an image that contains the colours each static pixel (i.e.

such that if the pixel at (x, y) has n static layers, the corresponding location in IZ

has n channels corresponding to the different layer), and Ms(x, y, t) is the static

foreground image. The closest matching colour is selected by comparing all k

248 6.4 Incorporation into Tracking System

colour channels. The motion occupancy for squares in the Static state becomes,

Fm(x′, y′, t, k) =card(Ms(x, y, t))

card(I(t)). (6.36)

For squares in the Both state, the static layer colour image and the input colour

image are used. Matching is performed in the same manner as for the Static

state, the closest matching colour is used when calculating the total error,

Cob(x, y, t) = min(Cos(x, y, t), Ac(x′, y′, t)− I(x, y, t, )), (6.37)

Fc(x′, y′, t, k) =

∑Cob(x, y, t) where (x, y) ∈Ms(x, y, t) ∪Ma(x, y, t)

card(Ms(x, y, t) ∪Ma(x, y, t)), (6.38)

where Cob(x, y, t) is the colour of the static layer or active layer that is the closest

match to the appearance model, and Ma(x, y, t) is the active foreground image.

The motion occupancy for the square is the number of pixels that have any motion

present (active or static),

Fm(x′, y′, t, k) =card(Ms(x, y, t) ∪Ma(x, y, t))

card(I(t)). (6.39)

The values calculated for Fc(x′, y′, t, k) and Fm(x′, y′, t) are used to update the

appearance model in the same manner previously described. The system does

not consider the ratio of static foreground and active foreground within a grid

square, as it is likely to be changing rapidly due to the Both state being transient.

6.4 Incorporation into Tracking System

The integration of the scalable condensation filter into the tracking system results

in the following changes in the system (see Figure 6.7):

1. After motion detection has been performed, the SCF is updated (resample

particles and determine particle weights). This is performed after motion

6.4 Incorporation into Tracking System 249

detection as the output of the motion detector (motion images and opti-

cal flow) is used as input for the SCF (depending on the configuration of

features used by the objects being tracked) in addition to the input colour

image.

2. When matching detected objects to tracked objects, the SCF is used to aid

in matching objects. The distribution of the SCF for the track in question

can be checked to determine the likelihood of the detected object being the

track in question (see Section 6.4.1).

3. When updating a tracked object, the tracked object passes to the SCF the

current features that have been extracted for that object, and the object’s

movement for the frame. Object detection information is also passed, and

is used to update the SCF as described in Section 6.2.3.

4. When an object cannot not be detected and its position in the frame needs

to be predicted, the distribution of the SCF for the track in question is used

to determine the most likely position for the track in question.

5. Any new objects added to the system result in a new component being

added to the SCF. Also, whenever a tracked object is removed from the

system, the component mixture for that track is removed from the SCF.

The use of the SCF to determine matches between detected objects and tracked

objects renders the use of a histogram (or appearance model) to determine which

of two detected objects is the better match for a given tracked object (see Section

3.4) redundant. As the match between a detected and tracked object is now par-

tially calculated from the SCF distribution, which itself uses the match between

the tracked object features (histogram/appearance model) and the image, the

SCF based tracking system by default uses the tracked object features to match

detected objects to tracked objects.


Figure 6.7: Integration of the SCF into the Tracking System

When adding detection results to the SCF mixture components, the detected

object that matches the tracked object as well as any detected objects that have

an uncertainty (see Equation 3.18, Section 3.4) within a threshold, are used when

updating the component mixture with the track information. All tracks are given

an equal weight when adding them to the mixture. This ensures that in situations

where ambiguities arise, the SCF distribution will encompass both modes until

the situation is resolved.

6.4.1 Matching Candidate Objects to Tracked Objects

By using a condensation filter [77] rather than a traditional particle filter, the

tracking system has access to a weighted particle set derived from the features

from the tracked objects at the previous frame and the current images. This par-

ticle set represents the (approximate) probability of the locations of the tracked

objects in the current frame. As such, the output of the SCF can be used to aid

in evaluating matches between candidate objects and the list of tracked objects.


To enable candidate objects to be compared to tracked objects, particles are back

projected into a map, PTj ,t (where Tj is the tracked object that the map relates)

to indicate the likelihood of an object occupying a given pixel in the current

frame.

Within the SCF, each particle describes a bounding box and has a probability

associated with it. The maximum probability from all particles at a location

is used (rather than an average or total) as it is unaffected by the number of

particles, or their location in space. Consider the example situations shown in

Figure 6.8 (the yellow ellipse is the object of interest and the coloured rectangles

represent different particles, in these examples, the more accurately the particles

inscribe the object of the interest, the higher their probability).

(a) Example 1 (b) Example 2

Figure 6.8: Determining match probability using particles

In Example 1, if an average measure of particle probabilities is used, PTj ,t would

show a reduced likelihood on the right side of the object, due to the poor response

of the blue particle, despite the good response achieved by the red particle in the

same area. In Example 2, if a summation of all particle probabilities is used, PTj ,t

would show an increased likelihood on the right side of the object due to the large

number of particles covering this area, despite none of these particles yielding a

particularly high response. In each of these cases, the use of the maximum particle

weight results in strong probabilities in locations covered by particles with strong


responses, and areas with particles that have only weak responses will have low

probabilities.

Using this approach, PTj ,t at a given pixel location can be defined as,

PTj ,t(x, y) = argmaxNj,max

n=Nj,min(wj,n,t) where x, y ∈ s(j, n, t), (6.40)

where s(j, n, t) is the nth particle for Tj, and wj,n,t is the corresponding weight,

Nj,min and Nj,max are the indexes of the first and last particles for Tj, and x, y is

location being evaluated. As each s(j, n, t) describes a bounding box, it is very

simple to determine if the present location, x, y, is within the region described by

the particle.

The likelihood of a candidate object matching a tracked object can then be defined

as,

FSCF (Ci, Tj) =1

W ×H

X0+W,Y0+H∑x=X0,y=Y0

PTj ,t(x, y), (6.41)

where FSCF (Ci, Tj) is the fit between the track Tj and the candidate Ci, X0 and

Y0 are the coordinates of the top left corner for the bounding box of Ci, and W

and H are the width and height of the bounding box of Ci. An average probability

is taken (rather than a total) to prevent the system from simply favouring the

largest object.

A limitation of this approach is that it does not consider the object size when

evaluating the match (i.e. it is possible for two objects of wildly different sizes to

have a similar appearance, so overall object size needs to be considered). Note

that it is not appropriate for Equation 6.41 to not normalise for size, as this

would result in larger objects being favoured in all matches. To overcome this,

the proposed likelihood is combined with the fit measure used in the baseline

system (see Equation 3.17, Section 3.4) such that the match between a candidate


object and a tracked object is,

F̄ (Ci, Tj) =√FSCF (Ci, Tj)× F (Ci, Tj). (6.42)

6.4.2 Occlusion Handling

The use of the SCF also allows for improved occlusion handling, as the scalable

condensation filter can continue to provide probabilities relating to the location

of the occluded object. To take advantage of this ability, the Occluded state is

redefined, and a new state is created, Predicted. Figure 6.9 shows the modified

state diagram.

Figure 6.9: Update State Diagram Incorporating Occluded and Predicted States

Previously, the Occluded state was entered when an object could not be detected

for a single frame. The Occluded state is now defined as an object that cannot

be detected at a given frame and also has an occlusion level of two (overlapping

with another object). The Occluded state assumes that the object detection has

failed due to the object being obscured. The Predicted state is entered when an

object cannot be detected at a given frame and has an occlusion level of less than

two (i.e. based on the previous frame, the tracked object is not obscured by any


other objects). In this situation, it is assumed that the object detection has failed

due to the object no longer being present in the scene. Given this, there is only

a timeout on the Predicted state, such that after τoccluded consecutive frames, the

track will pass to the Dead state.

From the Occluded state, an object can either move to the Active or Predicted

states. An object moves to the Active state if it is detected and matched. An

object moves to the Predicted state if it is deemed that the SCF no longer has

a sufficiently good estimate of the location of the object (i.e. all particles are

providing a very poor response to the features for the track), or if the object

ceases to be at an occlusion level of 2 and is not detected. In the event of either

of these occurring, the counter that limits the time spent in the Predicted state

is set to,

cpredicted = min(τoccluded

2, coccluded

), (6.43)

where cpredicted is the counter monitoring the time in the Predicted state, and

coccluded is a counter indicating how much time was spent in the Occluded state.

It is assumed that an object transitions to Predicted from Occluded, it is because

it is no longer in the scene (the SCF can longer track it, and the object detection

routines cannot find it). Whilst the system still enters the Predicted state briefly

(in case this assumption is wrong), the time spent in the state is significantly

reduced.

Whilst it is possible to implement a similar system of occlusion handling without

the use of the use of a particle filter, such a system has no way of knowing when

the object has actually left aside from monitoring the occlusion level. Such an

approach is likely to lead to additional tracking errors (such as track identities

being swapped or objects remaining after they have left the scene), and as a result

it is not implemented in the baseline system.

This approach also allows a smaller value of τoccluded to be used, as it is known that

6.5 Evaluation and Results 255

only an object whose detection is failing due to an inability to find the object, is

in this state.

6.5 Evaluation and Results

The proposed system using the SCF is evaluated using the same procedure de-

tailed in Section 3.5, and a comparison is made to the results of the baseline

system (see Section 3.5.4) and the system proposed in Chapter 5. Details on

the metrics used and annotation of the tracking output can be found in Sections

3.5.1 and 3.5.3 respectively. The proposed SCF based tracking system is derived

from that proposed in Chapter 5, and thus uses the alternative motion detection

routine and object detection routines proposed. Configuration files used to test

the system described in Chapter 5 are modified to include configuration for the

SCF. Values for the new system parameters for each dataset group are shown in

Table 6.1 for parameters that are constant for all configurations, and Table 6.2

for parameters that vary between datasets. dxmin is the value of dmin for the x

dimension of the SCF, dxmax is the value of dmax for the x dimension of the SCF

and exmax is the value of exmax for the x dimension of the SCF.

Note that different types of particles are created for each different object type in

the system (i.e. for the RD datasets, there are types of particles for people and

vehicles). In these situations, particle bounds are applied that are appropriate

to the type of object the particle is following (particles used to track a vehicle

will have height and width bounds that match up with the size constraints of

the vehicle). Likewise, different types of objects have different noise distribution

bounds as they are expected to move at different rates (vehicles are faster than

people).

256 6.5 Evaluation and Results

Parameter ValueCondensation Filter Parameters

νinit 50νadd 25dxmin 0dxmax 319dymin 0dymax 239dwmin Minimum Height ×

Minimum Aspectdwmax Maximum Height ×

Maximum Aspectdhmin Minimum Heightdhmax Maximum Height

Table 6.1: System Parameters - Additional parameters for system configuration.

Parameter RD BC AP BEPerson Particle Parameters

exmax 2 4 N/A 2eymax 2 4 N/A 2ewmax 1 2 N/A 1ehmax 1 2 N/A 1

Vehicle Particle Parametersexmax 4 N/A 4 4eymax 4 N/A 4 4ewmax 2 N/A 2 2ehmax 2 N/A 2 2

Table 6.2: System Parameters - Additional parameters for system configurationspecific to each dataset group.

The SCF is configured to initially use 50 particles per track, with an increase of 25

for each occlusion level increase (there are three occlusion levels, so a maximum

of 100 particles for a track). Each tracked object uses a histogram as its standard

feature, and the appearance model described in Section 6.3 is also used when

the occlusion level is raised above 0 (no occlusion). All existing configuration

parameters are left unchanged.

The overall results for the tracking system incorporating the SCF are shown in


Tables 6.3 and 6.4. Detailed results for each dataset are shown in Appendix C.


Table 6.3: Tracking System with SCF Results (see Section 3.5.1 for an explanationof metrics)


Table 6.4: Tracking System with SCF Overall Results (see Section 3.5.1 for anexplanation of metrics)

The overall change in performance resulting from the use of the SCF is shown

in Tables 6.5 (individual metrics) and 6.6 (overall metrics). As these results

show, the use of the SCF results in an incremental improvement, with small

improvements in the tracking performance of the RD and BE datasets, and little

change in the BC and AP datasets.

The metrics for the RD datasets are shown in Tables C.1 and C.2 (see Appendix

C). The use of the SCF has little impact on detection and localisation perfor-

mance, only resulting in incremental improvements. This is expected, as the

detection methods used are unchanged. The small improvement is due to the im-

proved performance when predicting positions in the event of missed detections,

and improved handling of occlusions when the object detection methods perform

poorly. The improvement in occlusion handling also results in the small improve-

ment in tracking performance shown by the increase in T1 (number of objects

tracked during time). A small improvement in the fragmentation (T3) is offset

by a small decrease in confusion (T4). Whilst the SCF does improve the systems


Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD 0.01 0.03 0.01 0.00 0.00 0.00 0.06 0.00 0.01 -0.02 0.00BC 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 0.01 -0.03 0.00 0.05AP 0.00 -0.01 0.00 0.00 0.00 0.00 0.00 -0.01 -0.03 0.03 0.02BE 0.00 -0.02 0.00 0.10 0.12 0.00 0.04 -0.02 -0.06 0.00 -0.04

Table 6.5: Improvements using SCF (see Section 3.5.1 for an explanation ofmetrics)

Data Set Overall Detection Overall Localisation Overall TrackingRD 0.03 0.00 0.04BC 0.00 0.00 0.00AP 0.00 0.00 0.00BE -0.02 0.07 0.01

Table 6.6: Overall Improvements using SCF (see Section 3.5.1 for an explanationof metrics)

ability to follow objects through more complex scenes (improvement in T3), it is

susceptible to errors when a new modality is first initialised (decrease in T4) and

the features are able to describe multiple objects well. Within the RD dataset,

many of the people are dressed similarly (dark clothing), and enter in groups.

Likewise, many cars are similar colours (silver, white) and, at times, also enter

in pairs. In these situations, where the detection routines do not immediately

detect the two objects, it is possible for the identities to swap before the features

are well defined.

Examples of the difference in occlusion handling is shown in Figures 6.10 and 6.11

(in both figures, the top row shows results without using the SCF, the bottom

row with the SCF).

In Figure 6.10, a person walks behind a stopped car, resulting in only the torso

being visible and the person detection routine not being able to consistently locate

them. Without the SCF, the system looses track of the person, and creates a new

track several frames later when they can be detected correctly once more. Using



(e) Frame 1495 (f) Frame 1500 (g) Frame 1505 (h) Frame 1510

Figure 6.10: Example Output from RD7 - Occlusion handling using the SCF (theperson walking behind the car on the far side of the road)

the SCF however, the system is able to continue to track the person through the

occlusion.

Figure 6.11 shows an occlusion between three moving people. Whilst neither

system handles this occlusion correctly, the SCF is able to maintain the identities

of the individual people for longer. Ultimately however, the occlusion continues

for too long for the SCF to be able to properly resolve the situation.

The metrics for the BC datasets are shown in Tables C.3 and C.4 in Appendix

C. There is no change in the overall metrics for the BC datasets, however there

are small changes within the individual tracking metrics. No change is expected

within detection and localisation, as the use of the SCF has little effect on the

object detection (only the process of matching detections and tracking).

The change in the tracking performance is varied, with a decrease in T1 and T3

(number of objects being tracked and fragmentation respectively) and an increase

in T2 and T5 (tracking time and 2D trajectories respectively). The drop in T1

and T3 is a result of uncertainty when new objects enter the scene. Within the




(g) Frame 2150 (h) Frame 2200 (i) Frame 2220 (j) Frame 2240

(k) Frame 2260 (l) Frame 2280

Figure 6.11: Example Output from RD7 - Occlusion handling using the SCF (thegroup of people in the bottom right corner of the scene)

RD datasets this uncertainty resulted in a decrease in T4. In these datasets,

similarly coloured objects often enter in pairs and if the detection algorithms are

not able detect both objects consistently, there is possibility that the SCF may

begin to track the second object. Within the BC datasets, people frequently enter

through doors, which creates motion that is detected by the motion detector. In

several instances, the door is detected as it opens, and as the person enters from

behind the door, the system swaps from the door to the person and tracks the

person. However, when using the SCF in such a situation, the features used by the


SCF (histogram and appearance model) are created using the initial detections

of the door. As a result, the SCF often fails to transfer the newly created track

from the door to the person, resulting in poorer performance for the T1 and T3

metrics. An example of this situation is shown in Figure 6.12.





Figure 6.12: Example Output from BC16 - Initialisation of tracks from spuriousmotion (the person entering through the door at the top of the scene)

Figure 6.12 shows an example of an person entering through a door. The person is

not easily distinguishable from the background, and the majority of the motion in

the region is actually caused by the door. As a result, the system initially tracks

the door. When the SCF is not used (Figure 6.12, top two rows), the system is


able to switch from the door to person once the door motion subsides. As the

door and person are still close to one another, the system is able to make this

association. However, when the SCF is used (Figure 6.12, bottom two rows), the

system fails to make this association. The features for the tracked object, which

are used by the SCF to locate it in future frames, have been initialised using the

incorrect detections on the door and therefore are modelling the door. As such,

the system continues to track the door, eventually loosing track of it once all

motion from the door stops.

The improvement in T2 and T5 can be attributed to the improved occlusion

handling. The BC datasets contain a large number of occlusions, and whilst

not all are able to be correctly resolved using the SCF, there is a noticeable

improvement in the system’s occlusion handling abilities. Figure 6.13 shows an

example of an occlusion and how it is handled with the SCF (bottom two rows)

and without (top two rows).

In Figure 6.13 one person walks behind another, and is totally occluded for ap-

proximately 20 frames. Without the SCF, the person is lost. The system is unable

to separate the two people when the occlusion begins, and cannot re-detect the

person until after the occlusion has passed. The SCF however, is able to resolve

this occlusion successfully, despite a false object being detected (Frame 2720, the

shadow of the occluded person is briefly tracked) briefly.

The results for the AP datasets are shown in Tables C.5 and C.6 in Appendix

C. Detection and localisation performance is unchanged, as is overall tracking

performance. Whilst small changes are observed in the individual tracking metrics

(a mix of gains and drops), these can be attributed to the differing performance

of the systems when creating new tracks, rather than any significant tracking

errors or improvements. Due to the relative simplicity of the AP datasets (no

major occlusions, few objects in the scene at a given time) and the already good






Figure 6.13: Example Output from BC16 - Improved occlusion handling usingthe SCF (one person walks behind another)

performance achieved with these datasets, these results are as expected. An

example of tracking output is shown in Figure 6.14.

Figure 6.14 illustrates an error that can occur using the SCF. As the vehicle exists

the scene (travels behind the aero bridge) the SCF tracks the object differently.

The system begins to track the vehicle poorly at frame 470, losing track of the

vehicle, before re-creating the track just before the object ultimately exits. The

fact that the vehicle has the majority of the object obscured, and is of a similar

colour to both the tarmac and aero bridge leads to confusion for the SCF and




(i) Frame 430 (j) Frame 440 (k) Frame 450 (l) Frame 460

(m) Frame 470 (n) Frame 480 (o) Frame 490 (p) Frame 500

Figure 6.14: Example Output from AP11-C4 - Errors as an object leaves thescene using the SCF.

poor tracking. This could be overcome by using more bins the histogram to

achieve greater separation between the colours, however this comes at the cost of

memory and computational speed. This problem is observed at other times in the

AP datasets, and leads to a small drop in the T3 metric (object fragmentation).

Tables C.7 and C.8 (see Appendix C) show the results for the BE datasets.

Overall localisation and tracking performance are improved slightly by using the

SCF, while there is a slight decrease in overall detection performance. As was

found in previous evaluation, very poor results for D2, T2 and T5 are obtained


due to the camera angle of BE-C3.

Figure 6.15 shows the tracking performance resulting from the use of the SCF (top

row show the system without the SCF, bottom row with the SCF). There is little

noticeable difference in performance between the two the systems, as the main

problem within the dataset is the poor detection performance. The occlusions

present are simple and can be handled effectively by both systems.



Figure 6.15: Example Output from BE19-C1 - Tracking Performance using theSCF

Figure 6.16 shows the difference in performance when processing BE20-C3 when

using the SCF and when not (top two rows are without the SCF, bottom two

rows are with). It can be seen that there is little difference in how the systems

perform when confronted with the occlusion. In each instance several incorrect

tracks are created, and tracks are frequently lost before being recreated. The

poor motion and object detection in this dataset, combined with the three people

in the occlusion being dressed in similar coloured clothing means that the SCF is

able to easily swap between people, creating errors. The similar coloured clothing

is compounded by the poor detection, as tracks are frequently lost and recreated,

meaning that the appearance models and histograms are initialised on data that

is not correctly segmented.




(i) Frame 260 (j) Frame 310 (k) Frame 360 (l) Frame 410

(m) Frame 510 (n) Frame 610 (o) Frame 660 (p) Frame 710

Figure 6.16: Example Output from BE20-C3 - Localisation Performance usingthe SCF

Overall, the use of the SCF does not have the same impact on system perfor-

mance as the addition of an improved motion detection routine. Despite this,

improvement is still observed when handling occlusions and tracking in difficult

circumstances. The tight integration with the previously proposed tracking sys-

tem allows for features to be updated continuously, and for new tracks to be

easily added and removed from the tracking system.

Whilst the SCF is effective at resolving occlusions and handling situations where

there are segmentation errors, it is very vulnerable when a track is initialised


using poor segmentation results. This often results in the SCF associating the

track with the source of the segmentation error (i.e. a door), rather than the

moving object. When the erroneous motion stops, the tracked object is then

often lost, and an additional new track must be spawned to track the object of

interest.

Table 6.7 shows the data throughput benchmarks for the proposed tracking sys-

tem. These throughput rates are calculated under the same conditions used when

benchmarking the baseline system (see Section 3.5.4). As expected, the incorpo-

ration of the SCF results in a drop in throughput for all datasets.

The RD and BE datasets suffer the most significant drop in performance (41%

and 39% respectively when compared to the system running without the SCF).

Once again, this can be attributed to the presence of static foreground. Compar-

ing features that contain static foreground is significantly more computationally

intensive than comparing features that contain only active foreground. The pro-

cess that extracts the colour of static layers from the motion segmentation (to

allow comparison of features that contain static foreground) is primarily respon-

sible for this increase in processing time.

The AP datasets suffers only a small performance penalty using the SCF, due to

the simple nature of the data. The BC datasets suffer more of an impact, due

to the high number of occlusions (resulting in additional particles and features)

and the large size of the objects. The computational demands of all the features

used by the SCF increase linearly with the size of the region being processed.


Table 6.7: Proposed Tracking System Throughput

268 6.6 Summary

Despite the drop in data throughput when compared to the baseline system and

the proposed system without the SCF, all datasets are processed at greater than

7 fps. Given that these results are achieved executing on a single core, and that

significant optimisations can be made to the motion segmentation and processing

of object features it is feasible that the proposed system could process data in

real time.

6.6 Summary

This chapter has presented a new implementation of the condensation filter, the

Scalable Condensation Filter (SCF). The SCF is a derivative of the mixture

particle filter [163] that is intended to operate in tandem with an existing tracking

system, rather than as a self contained tracking system. The tight integration of

the SCF with an existing tracking system allows the SCF to:

• Add and remove new mixtures from the filter without any user intervention.

• Use progressively updated features for each tracked object.

• Use a time varying number of particles for each tracked object, based on

the system status as determined by the underlying tracking system.

• Use a varied number and type of features for each tracked object, based on

the system status as determined by the underlying tracking system.

• Incorporate detection results for each tracked object into each mixture com-

ponent, allowing each mixtures to monitor multiple modes temporarily

when object detection is uncertain (i.e. in the presence of occlusions, or

poor segmentation/detection performance).

6.6 Summary 269

• Use the tracking system to flag occlusions, and take appropriate action for

both the occluded and occluding objects to ensure that neither object is

lost.

The use of the SCF within the existing tracking system provides improved occlu-

sion handling, and improved performance in situations where segmentation and

detection are poor. The scalability of the system (time varying number of parti-

cles and number/type of features) also ensures that there is not a severe impact

on system performance. To utilise the previously proposed motion segmentation

algorithm (see Chapter 4), a new tracking feature has also been proposed.

The proposed tracking system, with the SCF integrated, has been tested using

the ETISEO database [130] and improvement, particularly when handling occlu-

sions, has been shown compared to the same system not using the SCF. However,

it has been shown that the SCF can also introduce problems in some situations,

particularly when there is poor contrast between the foreground and background,

and when the initial detection results when the object first enters are poor. In

these situations, tracks can be switched to either erroneous motion in the back-

ground, or potentially between objects. Improving the performance of the object

detection, and using more discriminative object features would help overcome

these problems.

Chapter 7

Advanced Object Tracking and

Applications

7.1 Introduction

Single camera tracking techniques can be extended to work with multiple cameras,

either by tracking objects through a network of cameras or through the fusion

of multiple views of the same area, and can be used as a initial step in more

advanced surveillance systems such as event and action recognition. This chapter

will apply the single camera tracking techniques previously discussed in this thesis

to two areas:

1. Multi-camera tracking.

2. Sensor Fusion for Object Tracking.

272 7.2 Multi-Camera Tracking

7.2 Multi-Camera Tracking

7.2.1 System Description

The tracking framework described in Chapter 3 is able to support multiple camera

inputs. Each camera input (view) is assigned an object tracker to track all objects

within the view. Each individual tracker has no knowledge of the other cameras

in the network, and as such is responsible for tracking only the objects within its

field of view.

The information from the multiple views is combined after each tracker has pro-

cessed the current frame by a separate module. This module aims to ensure that

every object is tracked in every view, and as such will attempt to find group

object in different views together, based on matching criteria. To achieve this,

the module performs the following tasks:

• Translate the 2D image coordinates of the tracked objects into a 3D coor-

dinate system.

• Match ungrouped objects across different view.

• Check that objects that have been paired, are still a valid match.

• Check for occlusions affecting paired objects, and ensure that the system

does not attempt to detect objects that are badly occluded in one view, if

it can be clearly seen in another.

To perform these tasks, camera calibration information is required that is able to

translate image coordinates (in pixels) to a world coordinate scheme,

[T nωx, Tnωy] = ζi,ω(T nx,i, T

njy,i ), (7.1)

7.2 Multi-Camera Tracking 273

and translate image coordinates in one camera view to another,

[T nx,j, Tny,j] = ζi,j(T

nx,i, T

nx,i), (7.2)

where T jx,i and T jy,i are the pixel coordinates of an object, n, in camera i, T jωx and

T jωy are the world coordinates for the object n, ζi,ω is a transform that translates

images coordinates in camera i to the world coordinate scheme and ζi,j is a

transform that translates image coordinates in camera i to camera j. It is assumed

that any image coordinates that are translated are located on the ground plane

(i.e. z = 0).

The databases used for multi-camera tracking within this thesis (ETISEO [130]

and PETS 2006 [132]) each provide camera calibration created using Tsai’s cam-

era calibration algorithm [161], which provides a world coordinate scheme (mea-

sured in meters) and the ability to transfer image coordinates between the world

coordinates and the alternative camera views.

7.2.2 Track Handover and Matching Objects in Different

Views

Objects can be paired between views in two different ways:

1. Detected in each view and then matched according to one or more features.

2. Detected in a single view, and then have its position translated to a second.

The second approach is very simple, in that no matching of features is required.

The bounding box is simply translated using the camera coordinates. Objects

translated between views in this manner begin in the Transfer state (see Section

3.2.1).


The first approach requires features to be extracted from each view and compared,

the proposed multi-camera system uses two features:

1. Position

2. Colour Histogram

Appearance models are not used, as, unless the cameras are observing the scene

from very similar angles, the appearance models will record a poor match. Colour

histograms however are view invariant (with the exception of non-uniformly

coloured objects), and so are better suited to matching across different cam-

eras. It is assumed that there is no significant difference in the colour calibration

of the cameras in the network (i.e. no coloured filters, no large differences in gain

or exposure).

Position is matched by projecting the coordinates in each view into one another,

and calculating the Euclidean error, such that,

R(i, j) =

√(T nx,i − txj)2 + (T ny,i − tyj)2 +

√(T nx,j − txi)2 + (T ny,j − tyi)2

2, (7.3)

where Ri, j is the average reprojection error for tracks i and j, T nx,i and T ny,i are

the x and y image coordinates for track i, and txj and tyj are the translated

image coordinates for track j.

Colour histograms are compared using the Bhattacharya coefficient (see Section

3.4). Histogram bins are normalised to sum to one, to ensure that B (the match

between) is in the range [0..1] (0 being no match, 1 being a perfect match). As it

is expected that the objects will not be the same size in the two views, comparing

actual bin counts would not be suitable.


Provided the matches for the two features satisfy the equations,

B(i, j) > τB, (7.4)

Ri, j < τR, (7.5)

where B(i, j) is the Bhattacharya coefficient when comparing the histograms of

track i and j, R(i, j) is the average reprojection error, and τB and τR are the

thresholds for these measures, the objects in the two views are paired. In each

frame, the position of the object is are checked again to ensure that they are still

within the position threshold, τd. τB, τR and τd are set to 0.75 (75% similarity

between the histograms), 1 (within 1 meter on the ground plane) and 2 (within

2 meters on the ground plane) respectively. If the positions exceed this threshold

for three consecutive frames, the pair is broken.

7.2.3 Occlusions

In a single camera tracking system, the tracking algorithm must attempt to detect

and track all objects within in the scene, even when those objects may be involved

in severe occlusions. This may often lead to tracked objects being lost, or the

identities of two or more tracked objects being swapped. Within a multi-camera

system however, it is possible that the area containing the occlusion may be

visible from multiple positions, and so whilst a person may be severely occluded

in one view they may not be in another. Figure 7.1, shows an example of such

an occlusion. There are three people in the scene, but in camera one, the third

person is obscured behind the first two, and in the second camera, only the third

person is clearly visible (the others are partially occluded).

Rather than try to detect all three people in each camera view, a better approach

is to attempt to detect only the first two people in camera one, and only the

third in camera two. Using the camera calibration information, the system can


(a) Camera 1 (b) Camera 2

Figure 7.1: A Severe Occlusion Observed from Two Views

still maintain the position of all people in all cameras, without the burden of

attempting to detect severely occluded people indefinitely.

Occlusions can be detected within a single camera by analysing the location of

the tracked objects relative to one another. Using the camera calibration, it is

possible to determine the order of the objects within the scene (i.e. which object

is closest to the camera, and thus in full view), and which objects are obscured.

If an object is obscured in one view, but visible in another, the obscured view is

flagged such that the system will not attempt to locate it.

For this approach to function properly, the tracked objects need to be initially

detected (and tracked) prior to the occlusion, as detecting the objects during the

occlusion is likely to prove unreliable.

7.2.4 Evaluation using ETISEO database

The multi-camera extensions to the tracking system are evaluated on the AP and

BE datasets of the ETISEO database, used in the previous evaluations within the

thesis. These datasets are each two camera datasets, and have camera calibration

supplied, calculated using the Tsai camera calibration toolbox [161]. The other


datasets (RD and BC) are only single camera datasets, and thus are of no interest

when testing a multi camera system. Details on the metrics used and annotation

of the tracking output can be found in Sections 3.5.1 and 3.5.3 respectively.

The ETISEO evaluation software uses the ID of each tracked object in both

the ground truth data and system output to determine the effectiveness of the

system at maintaining an object’s identity over time. When matching objects

across different camera views, the multi-camera system sets both object ID’s to

the same value. This is done to indicate with the tracking system this is the

same object viewed from two angles. However, it does result in an error being

recorded by the evaluation software. This is due to the object being detected and

tracked (and thus logged in the output file) for several frames under a different

ID. Normally, if the ID was to change it indicates that tracking has been lost and

the object has been re-created. However, with a multi-camera system it may be

because a track in one view has been paired with a track in another. Whilst this

is not a tracking error, the evaluation tool is unable to distinguish between the

two situations. As such, it is expected that there will be a small performance

drop in the tracking metrics as a result of this.

The overall results for the tracking system incorporating the SCF are shown in

Tables 7.1 and 7.2. Detailed results for each dataset are shown in Appendix D.

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP 0.83 0.64 0.77 1.00 1.00 1.00 0.64 0.44 0.94 1.00 0.61BE 0.64 0.22 0.54 0.88 0.91 1.01 0.24 0.09 0.66 0.89 0.16

Table 7.1: Multi-Camera Tracking System Results (see Section 3.5.1 for an ex-planation of metrics)

Tables 7.3 and 7.4 show the difference in performance between tracking in single

cameras and tracking with multiple cameras on these datasets (both tracking

systems are using the SCF, see Chapter 6, and difference are measured against


Data Set Overall Detection Overall Localisation Overall TrackingAP 0.68 0.93 0.66BE 0.31 0.79 0.28

Table 7.2: Multi-Camera Tracking System Overall Results (see Section 3.5.1 foran explanation of metrics)

those recorded in Section 6.5). There is a considerable increase in the performance

of the AP datasets, whilst the BE datasets have recorded a slight drop across most

metrics, and all overall metrics.

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP 0.01 0.04 0.02 0.00 0.00 0.00 0.04 0.06 -0.01 0.00 0.17BE -0.04 -0.04 -0.03 -0.09 -0.06 0.01 -0.04 -0.04 -0.06 0.03 -0.02

Table 7.3: Improvements using a Multi-Camera System (see Section 3.5.1 for anexplanation of metrics)

Results for the AP datasets are shown in Tables D.1 and D.2 (see Appendix D).

As these results show, the use of a multi-camera system improves the performance

of the tracking system.

Figures 7.2 and 7.3 show examples of the AP dataset tracked by two single camera

systems and by a single multi-camera system. The top two rows of these figures

show the output from single camera tracking systems for cameras C4 and C7, the

bottom two rows show the output of the proposed multi-camera tracking system.

Figure 7.2 shows a situation where an object, visible in both cameras, leaves

the field of view of one camera before re-entering, whilst remaining visible in

the second camera. By using a multi-camera system, the object is assigned its

original identity again when it reenters the scene (denoted by the colour of the

bounding box drawn around the object, in this case black).

Figure 7.3 shows a situation where two objects move from camera C7 (second

and fourth rows) to camera C4 (first and third rows). The use of multi-camera


Data Set Overall Detection Overall Localisation Overall TrackingAP 0.03 0.01 0.06BE -0.04 -0.06 -0.03

Table 7.4: Overall Improvements using a Multi-Camera System (see Section 3.5.1for an explanation of metrics)

(a) 400 (b) 450 (c) 500 (d) 550 (e) 600

(f) 400 (g) 450 (h) 500 (i) 550 (j) 600

(k) 400 (l) 450 (m) 500 (n) 550 (o) 600

(p) 400 (q) 450 (r) 500 (s) 550 (t) 600

Figure 7.2: Example System Results AP11 - Object leaving and re-entering fieldof view

system ensures that the same object is assigned the same ID in each view. The

system is also alert to an object entering the scene, from one camera to another,

helping to initialise the tracking of an object in a second view slightly quicker.

In Figure 7.3, it can be seen that the multi-camera system does not provide any

advantage when tracking the plane at the top of C4. This object is only visible

in a single view, and so no benefit is gained.


(a) 50 (b) 100 (c) 150 (d) 200 (e) 250

(f) 50 (g) 100 (h) 150 (i) 200 (j) 250

(k) 50 (l) 100 (m) 150 (n) 200 (o) 250

(p) 50 (q) 100 (r) 150 (s) 200 (t) 250

Figure 7.3: Example System Results AP12 - Object leaving and re-entering fieldof view

Results for the BE datasets are shown in Tables D.3 and D.4 in Appendix D. In

the case of the BE dataset, the use of the multi-camera dataset does not result in

an improvement in performance. In fact, a small drop in overall performance is

recorded. In the case of the drop in tracking performance, this can be attributed

to in part the switching of identities, and in part the unstable detection proce-

dures. The poor detection performance in C3 is once again present, and this also

has an effect on the overall system performance.

Figures 7.4, 7.5 and 7.6 show examples of the BE dataset tracked by two single

camera systems and by a single multi-camera system. The top two rows of these

figures show the output from single camera tracking systems for cameras C1 and


C3, the bottom two rows show the output of the proposed multi-camera tracking

system.

Figure 7.4 shows an example of the system output from BE19. In this sequence,

a person is exiting the building through the front doors and then walks down the

driveway. Like the single camera system, a false track is spawned at the door. As

the door is only covered by one camera, similar performance is expected. As the

person walks down the steps and enters the view of C3, the single camera system

spawns an extra track. Due to the camera angle and segmentation errors arising

from the black pants against the dark roadway, two people are detected where

there are only one. After a short period of time however, the second track is lost.

When using the multi-camera system, a false track is created in the same manner

as the false tracking in the single camera system. However, unlike the single

camera system, this track persists for much longer. The multi-camera system

correctly matches the track for this person in C1 with the initial track created

in C3, however, where the second track is incorrectly spawned, an occlusion is

created (albeit, between two tracked objects tracking the same person). The

occlusion handling of the multi-camera system is then able to, incorrectly, keep

both tracks alive far longer than in the single camera system.

Figure 7.5 shows another example of the tracking from BE19. The multi-camera

system is able to continue to track the person leaving the car for slightly longer

than the single camera system (note in (f), the blue bounding box, that is the

track previously associated with this person). Due to there being continuing

non-detections, this person is lost later on (see (s)). However, when they are

detected again, they are quickly re-associated with the corresponding track in

the other view, and they return to having the same track ID (denoted by the

bounding box colour, see (t)) that they had previously. This process however,

ultimately results in several ID changes which are registered by the system as


(a) 390 (b) 420 (c) 450 (d) 480 (e) 510

(f) 390 (g) 420 (h) 450 (i) 480 (j) 510

(k) 390 (l) 420 (m) 450 (n) 480 (o) 510

(p) 390 (q) 420 (r) 450 (s) 480 (t) 510

Figure 7.4: Example System Results BE19 - Track matching and occlusion han-dling

errors. The predicted positions shown in frames 570 and 580 ((q) and (r)), where

the predictions are taken from the position in the other camera view, are also

treated as false detections as they do not match the target object well. Given

this, a missed detection for the person is still recorded.

Figure 7.6 shows an example from BE20. As can be seen, even with combining

the two camera views the occlusions are still not properly resolved.

Despite the improvement in performance of the BE19 dataset, there is no im-

provement in BE20. This can be explained by the manner in which the camera

calibration matches tracks across the views. To transfer points between views, it


(a) 560 (b) 570 (c) 580 (d) 590 (e) 600

(f) 560 (g) 570 (h) 580 (i) 590 (j) 600

(k) 560 (l) 570 (m) 580 (n) 590 (o) 600

(p) 560 (q) 570 (r) 580 (s) 590 (t) 600

Figure 7.5: Example System Results BE19 - Re-associating objects after detectionand tracking failure

is desirable that they are on the ground plane (i.e. z = 0). The feet of a person

should be on the ground plane, and so the feet position (taken as the centre of

the bottom of the bounding box inscribing the tracked object) is used to deter-

mine the proximity of objects in different views to one another. However, if the

segmentation is inaccurate, then this foot position in also likely to be inaccurate

(either too low, if part of a shadow is grouped with the track; or too high if the

lower legs are missed). Within the BE datasets, and particularly C3, the seg-

mentation performs very poorly due to a combination of the high noise levels and

people appearing at an angle (due to the way the camera is mounted). To try

and counter this with the BE datasets, the ground plane distance between object

in different views for them to considered a match is quite high (2m), which also


(a) 950 (b) 1000 (c) 1050 (d) 1100 (e) 1150

(f) 950 (g) 1000 (h) 1050 (i) 1100 (j) 1150

(k) 950 (l) 1000 (m) 1050 (n) 1100 (o) 1150

(p) 950 (q) 1000 (r) 1050 (s) 1100 (t) 1150

Figure 7.6: Example System Results BE20 - Occlusion Performance

results in several false matches. Setting this lower, results in almost no matches.

This problem is exacerbated the further from the camera the person is, as errors

further from the camera translate to larger distances on the ground plane.

Problems such as these can potentially be overcome by using different strategies

to transfer points between views. The use of F-Matrices allows a point in one

view to be matched to a line in a second, and it is possible that by transferring the

head and feet points from one view to two lines in the second view, and searching

for an object that is bounded by those lines that performance could be improved.

The motion image could also be transferred to a common coordinate scheme [5]

to try and determine where ground plane overlaps are in the two views.

7.3 Multi-Spectral Tracking 285

7.3 Multi-Spectral Tracking

Object tracking and abandoned object detection systems typically rely on a single

colour modality for their input. As a result, performance can be compromised

when low lighting, shadowing, smoke, dust or unstable backgrounds are present,

or when the objects of interest are a similar colour to the background. Thermal

images are not affected by lighting changes or shadowing, and are not overtly

affected by smoke, dust or unstable backgrounds. However, thermal images lack

colour information which makes distinguishing between different people or objects

of interest within the same scene difficult. Using modalities from both the visible

and thermal infra-red spectra allows more information to be extracted from a

scene, and can help overcome the problems associated with using either modality

individually.

There are a variety of fusion points available when fusing visual colour and ther-

mal image feeds for use in a surveillance application. To determine the most

appropriate point, four approaches for fusing visual and thermal images for use

in a person tracking system (two early fusion methods, one mid fusion and one

late fusion method), are evaluated. A final fusion approach is proposed based on

this initial evaluation.

7.3.1 Evaluation of Fusion Points

To determine the most appropriate method for fusing the thermal infrared and

visible light images for object tracking, four different fusion approaches are pro-

posed (see figure 7.7):

1. Fusing images during the motion detection by interlacing the images.

286 7.3 Multi-Spectral Tracking

2. Fusing the motion detection results of each image.

3. Fusing when updating the tracked objects using detected object lists from

each modality.

4. Fusing the results of two object trackers, which each track a modality in-

dependently.

Figure 7.7: The points for fusion in the system

For each of these proposed systems, the tracking system described in Chapter 6

is used (with any required modifications made to allow for the fusion process). In

all cases, the scalable condensation filter is used to support the tracking, using a

histogram model and the proposed appearance model (see Section 6.3).

Fusion in the Motion Detector

The first fusion method involves fusing the images prior to the motion detec-

tion by interlacing the luminance channel of the visible light image with the grey

scale thermal infrared image. This approach is facilitated by using a motion

detector which requires YCbCr 4:2:2 input [43]. The motion detector analyses

images in 2 pixel (four value, two luminance, one blue chrominance and one red

chrominance) blocks from which clusters containing two centroids (a luminance

and chrominance cluster, {Y1, Y2;Cb,Cr}) are formed. The centroids of the clus-


ters in the background model are compared to those in the incoming image to

determine foreground/background.

Rather than convert the colour image to YCbCr 4:2:2 format as would be done in

normal circumstances, it is converted to YCbCr 4:4:4. The thermal information

is then interlaced with the colour information. By treating the thermal informa-

tion as additional luminance data and doubling the luminance information, we

effectively create a YCbCr 4:2:2 image (see figure 7.8) that can be fed directly

into the tracking system without any further modifications.

Figure 7.8: Fusing Visual and Thermal Information into a YCbCr Image for usein the Motion Detection

This results in the motion detector clusters becoming {Y, T ;Cb,Cr}. This

method of fusion has the advantage of consuming little processing resources on

top of our existing system (the only additional load is when performing motion

detection), and is also very simple to implement. It does however require that the

colour and thermal images be correctly registered, which may require additional

processing, or in some situations, not be possible.


Fusion After Motion Detection

The use of middle or late fusion allows for greater control over the information

contained in the images that can be used by the tracking process. This informa-

tion can be used to greatly improve the accuracy and robustness of the detection

and tracking system. In both the second and third (see section 7.3.1) of the

proposed fusion systems we compute motion detection for each of the images. If

either image shows an abnormal increase in motion, it is disregarded. In the un-

likely event that both show such an abnormality, the more consistent of the two

is chosen. The abnormality of the images is assessed by looking at the percentage

increase of the in motion pixel count,

card(M(t))

card(M(t− 1))> τmc, (7.6)

where card(M(t)) is the amount of motion in the image, t is the time step and

τmc is the threshold for determining invalid motion detection results. This test is

not performed if the overall percentage of pixels in motion in the scene is beneath

a threshold (10% in our system), as if there is very little motion then something

such as a person entering the scene may be enough to result in an invalid image.

Our second proposed fusion scheme involves fusing directly after the motion de-

tection. Once the motion detection masks are obtained for each the visible light

and the thermal infrared modalities, they are combined to obtain a single mask

for the scene. Rather than simply apply a logical “and” or “or” operation, the

images are fused using the following equations,

(MIR(x, y, t) > τF1)&(MV is(x, y, t) > τF1), (7.7)

MIR(x, y, t) > τF2, (7.8)

MV is(x, y, t) > τF2, (7.9)


where MIR is the thermal motion image, MV is is the visual motion image, and τF1

and τF2 are thresholds to control the fusion (τF2 > τF1). If any of these equations

are satisfied, the fused motion mask at (x, y, t) is set to indicate motion. The

resultant mask is used in the remainder of the system described in Chapter 6.

Fusion After Object Detection

A second mid-fusion scheme is evaluated whereby motion detection and object

detection is carried out on both modalities, and the two object lists are used

to update the central list of tracked objects. Objects that have been previously

detected can be updated by a detection from either domain. For a new object to

be added, the object must be detected in both, or in the modality where it is not

detected there must be a given amount of motion within the region where the

object has been detected. The amount of motion required in the second modality

(where the object has not been detected) is the pixel count for the detected object

multiplied by a value, τFScale. τFScale, for our system is 0.5. This attempts to

ensure that a false detection in one modality, does not lead to an non-existent

track being initialised.

The proposed appearance model (see Section 6.3) is extended to contain infor-

mation from both motion detection routines. An additional motion and optical

flow component are added, such that the model consists of a shared colour com-

ponent, a motion and optical flow component for the visual domain input and

a motion and optical flow component for the thermal domain. The model can

be used to compare a detected object to either domain individually, or to both

simultaneously (an update can also be performed on only a single domain, or

both).


Fusion After Tracking

A late fusion scheme is evaluated where each modality is tracked individually

and the resultant tracked object lists are fused in the same manner as for a

multi-camera network. Each view is processed separately, and a list of tracked

objects from each view is generated and tracked independently. At the end of

each frame, a camera management module attempts to determine which objects

that are being tracked by the individual trackers, represent the same real-world

objects. As our multi-camera network consists of two cameras that are observing

exactly the same area, there is no need to transfer to a world coordinate scheme,

or rely on camera calibration, pixel coordinates can be used directly. However, as

one view is in the colour domain and one is in the thermal, we are unable to use

colour/appearance as an additional metric. Given this, we simply use the overlap

of the bounding boxes to group objects.

At the end of each frame, the object lists are compared. It is expected that all

objects should be tracked in both modalities. For those objects that are being

tracked in only one modality, the tracks in the second modality (that are not

already associated with a track in the first) are searched to find a match based on

the overlap of the bounding box. If a matching track cannot be found (presumably

due to an inability to detect the object due to poor motion detection), one is

created and the system will attempt to begin tracking in the next frame (in this

case, the new track is initialised without initialising histograms and appearance

models, and as a result the condensation filter cannot be used until these have

been initialised).

Tracked objects that have been paired across the views are compared each frame,

to check that they are, in fact, a valid match. If the overlap between these two

objects drops below a threshold for two consecutive frames, the pair is broken


up, and at the end of the next frame the system will attempt to pair the tracks

again (it is possible that they will be paired with each other again).

7.3.2 Evaluation of Fusion Techniques using OTCVBS

Database

The OTCBVS Benchmark Dataset Collection[42] is used to evaluate the four fu-

sion tracking systems. This is a publicly available dataset that contains aligned

thermal infrared and colour image sequences of two different outdoor scenes con-

taining pedestrians. The sequences include a variety situations of interest with

multiple pedestrians to test the system. We test the performance of the proposed

fusion system as well as tracking with both modalities individually.

Seven sub-sequences from the database are selected to highlight various situations

of interest such as stationary people, occlusions, people moving in shadowed areas,

and shadowing caused by cloud cover. Two sequences from the second location

(set 1 in our evaluation), and five from the first (set 2 in our evaluation) are used.

Separate results are shown for each set of sequences, as the first set (taken from

Location 2 in the database) contains significantly simpler scenarios than those in

the second. Ground truth tracking data has been computed for each of these sub-

sequences using the VIPER toolkit [166], and tracking performance is evaluated

using the ETISEO evaluation tool and metrics [130] as discussed in Chapter 3.

Details on annotation of the tracking output can be found in Section 3.5.3.

Results for the first set of two sequences are shown in Tables 7.5 and 7.5, and

Figure 7.10.

As Tables 7.5 and 7.6 show, there is little to no performance benefit from using

the proposed fusion schemes for the simple scenarios contained in set 1, when


Algorithm Detection Localisation TrackingD1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5

Colour 0.88 0.59 0.80 0.98 1.00 1.00 0.60 0.52 0.63 0.92 0.79Thermal 0.98 0.72 0.86 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00Fusion 1 0.97 0.53 0.82 1.00 1.00 1.00 0.92 0.65 0.88 1.00 0.92Fusion 2 0.98 0.65 0.84 0.99 1.00 1.00 1.00 0.81 1.00 1.00 1.00Fusion 3 0.98 0.66 0.85 1.00 1.00 1.00 1.00 0.85 1.00 1.00 1.00Fusion 4 0.98 0.70 0.86 1.00 1.00 1.00 1.00 0.91 1.00 1.00 1.00

Table 7.5: Fusion Algorithm Evaluation - Set 1 Results (see Section 3.5.1 for anexplanation of metrics)

Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.65 0.93 0.64Thermal 0.77 0.96 0.99Fusion 1 0.62 0.94 0.88Fusion 2 0.72 0.95 0.97Fusion 3 0.73 0.95 0.98Fusion 4 0.75 0.96 0.99

Table 7.6: Fusion Algorithm Evaluation - Overall Set 1 Results (see Section 3.5.1for an explanation of metrics)

compared to the performance of the thermal modality alone. The colour modality

performs the worst due to noise that is present in the colour dataset, that resulted

in poor performance for the motion detection (when compared to the thermal

modality). An example of this is shown in Figure 7.9. As the datasets used are

only of a short length (200-300 frames each), the motion detection algorithm is

not able to adjust to cope with the noise, and setting initial thresholds that do

not detect the noise results in large amounts of the motion also being missed.

The noise in the colour images only had a significant impact on the first fusion

scheme, with its detection and tracking performance falling well below that of

the thermal modality on its own. The other three fusion schemes were able to

effectively overcome the failings of the colour modality and perform at a similar

level to the thermal modality. No systems outperformed the thermal modality

alone, however this can be attributed to the fact that all objects were tracked


(a) 100 (b) 101 (c) 102 (d) 103

(e) 100 (f) 101 (g) 102 (h) 103

Figure 7.9: Noise in Colour Images for Set 1

correctly within this modality (tracking performance is 0.99 out of a maximum

of 1.0).

Figure 7.10 shows an example of the tracking output from set 1, where two people

cross paths, causing an occlusion. The top row shows the output of tracking using

colour images only, the second row shows the output of tracking using the thermal

images only, the third row shows results of tracking using fusion scheme 1, the

fourth row shows results of tracking using fusion scheme 2, the fifth row shows

results of tracking using fusion scheme 3 and the sixth row shows tracking results

using fusion scheme 4.

With the exception of the colour only modality, each configuration is able to re-

solve the occlusion correctly. The failure in the colour modality can be attributed

to the poor motion and object detection performance as a result of the noise (see

Figure 7.9 for an example of the noise in the colour images). Despite this failure

in the colour modality, the fusion system are all able to overcome this and track

the two objects through the occlusion correctly.

Results for the second set of five sequences are shown in Tables 7.7 and 7.8, and


(a) 50 (b) 60 (c) 70 (d) 80 (e) 90

(f) 50 (g) 60 (h) 70 (i) 80 (j) 90

(k) 50 (l) 60 (m) 70 (n) 80 (o) 90

(p) 50 (q) 60 (r) 70 (s) 80 (t) 90

(u) 50 (v) 60 (w) 70 (x) 80 (y) 90

(z) 50 (aa) 60 (ab) 70 (ac) 80 (ad) 90

Figure 7.10: Example System Results for Set 1 - Occlusion

Figures 7.11 and 7.12.

As Tables 7.7 and 7.8 shows, the third proposed fusion scheme achieves the best

performance, ahead of the thermal modality individually. Examples of the system

output are shown in Figures 7.11 and 7.12. Within these figures, the top row

shows the output of tracking using colour images only, the second row shows the



Colour 0.72 0.42 0.64 1.00 1.00 1.00 0.43 0.34 0.90 0.97 0.40Thermal 0.90 0.50 0.77 1.00 1.00 1.00 0.61 0.45 0.86 0.91 0.51Fusion 1 0.80 0.43 0.66 1.00 1.00 1.00 0.55 0.50 0.86 0.97 0.44Fusion 2 0.85 0.46 0.72 1.00 1.00 1.00 0.58 0.45 0.91 0.98 0.47Fusion 3 0.90 0.55 0.77 1.00 1.00 1.00 0.64 0.51 0.89 0.96 0.52Fusion 4 0.88 0.52 0.76 1.00 1.00 1.00 0.54 0.47 0.75 0.94 0.47

Table 7.7: Fusion Algorithm Evaluation - Set 2 Results (see Section 3.5.1 for anexplanation of metrics)

Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.48 0.89 0.49Thermal 0.58 0.93 0.61Fusion 1 0.51 0.89 0.58Fusion 2 0.54 0.91 0.59Fusion 3 0.62 0.93 0.64Fusion 4 0.59 0.92 0.56

Table 7.8: Fusion Algorithm Evaluation - Overall Set 2 Results (see Section 3.5.1for an explanation of metrics)

output of tracking using the thermal images only, the third row shows results of

tracking using fusion scheme 1, the fourth row shows results of tracking using

fusion scheme 2, the fifth row shows results of tracking using fusion scheme 3 and

the sixth row shows tracking results using fusion scheme 4.

All systems perform significantly worse on the second set of data, due to the more

complex nature of the data. The scenes contain more people, the people being

tracked are smaller in the image, there are heavy shadows cast by the people and

the environment as well as shadows caused by moving clouds. As a result of the

shadowing present, the colour modality performs very poorly, resulting in many

false tracks being created as shadows from moving clouds are cast over the scene

(see Figure 7.12). The thermal modality does not suffer from these problems,

and very few false tracks are created.

Within the colour modality, the presence of such severe shadowing results in a


large amount of false motion, which spawns a large number of false tracks. Whilst

the shadows can be partially removed using shadow detection, shadow detection

also results in some motion caused by the people in the scene to be lost (i.e. is

it falsely classified as being caused by shadows). This is due to most people in

the scene appearing significantly darker than the background, and there being

little texture on either the background or people (i.e. fairly constant colours,

low gradient). As such, more aggressive shadow detection, whilst able to remove

more of the shadows, also removes more of the motion caused by people.

The fusion systems all see some improvement over the colour modality, however

all except for the third is outperformed by the thermal modality alone. The first,

second and fourth fusion schemes are less effective at being able to completely

ignore a modality when it is performing poorly. The first and fourth fusion

schemes will always use the available information in the same manner regardless

of performance. The first fusion scheme achieves some improvement as the effect

of shadows is diluted due to the luminance channel being expanded. When each

cluster is compared in the motion detection, only a single luminance value is from

the colour modality (in the colour only system, both pixels are from the colour

modality). The effect of this is that the shadow detection performs better, but

still does not remove all erroneous motion. As a result some false tracks are still

spawned. The fourth fusion scheme assumes that each modality is producing

correct results, and attempts to merge the two lists of tracked objects through

the camera management. This results in the system being able to maintain the

identity of the valid objects more effectively (through the thermal modality), but

does not assist in handling or preventing the invalid tracks that are spawned in

the colour modality.

The second scheme is able to disregard an input in the event of suspected failure

(the same mechanism is used by the third fusion scheme), but this will not nec-


essarily register an error when a shadow moves gradually across the scene, and

is better suited to dealing with errors caused by automatic gain control errors,

or indoor situations where lights are turned on/off. The shadows that appear

in these scenes gradually move across the scene, and there is no rapid change

in motion levels. As such, no error is detected and the colour modality is used,

despite the significant errors present. Whilst the difference between the levels of

motion in the thermal and colour modality could be used to indicate a problem

(i.e. they should each see a similar amount of motion in ideal circumstances), it is

hard to determine which modality is in error in a situation such as this (i.e. it is

just as likely that the thermal camera could be missing large portions of motion).

The third proposed scheme is better equipped to ignore the motion caused by

shadows as it does not appear in the thermal images, and so new tracks cannot

be spawned as a result of shadow motion (at least some motion is required in

both images to create a new track). This same mechanism also helps to deal

with errors in the thermal images (see Figure 7.11 - in (g) a track second track

is created along the building side as a result of a door being opened, however the

third fusion scheme is able to avoid this).

Under appropriate conditions, all fusion schemes can offer some level of improve-

ment over using either modality alone. Overall however, our third proposed

fusion scheme (fusion after object detection) performs the best, out performing

each camera on its own and the other fusion schemes. Fusions schemes one and

two are directly reliant on the quality of the motion detection from the colour and

thermal images. If either image contains excessive noise (sensor noise, or environ-

mental effects such as shadowing) the whole system suffers as the fusion has been

performed before any object detection processes, and so the object detection for

the whole system is degraded. Fusion scheme 4 performs object detection and

tracking independently on each image, and merges results. Poor performance in


(a) 50 (b) 70 (c) 90 (d) 110 (e) 130

(f) 50 (g) 70 (h) 90 (i) 110 (j) 130

(k) 50 (l) 70 (m) 90 (n) 110 (o) 130

(p) 50 (q) 70 (r) 90 (s) 110 (t) 130

(u) 50 (v) 70 (w) 90 (x) 110 (y) 130

(z) 50 (aa) 70 (ab) 90 (ac) 110 (ad) 130

Figure 7.11: Example System Results for Set 2

one modality cannot be corrected by the other modality.

Depending on the conditions of the scene, fusion schemes 1, 2 and 4 may still al-

low some improvement over either modality individually, however at other times

it can result in reduced performance. This can possibly be overcome by modifying


(a) 50 (b) 70 (c) 90 (d) 110 (e) 130

(f) 50 (g) 70 (h) 90 (i) 110 (j) 130

(k) 50 (l) 70 (m) 90 (n) 110 (o) 130

(p) 50 (q) 70 (r) 90 (s) 110 (t) 130

(u) 50 (v) 70 (w) 90 (x) 110 (y) 130

(z) 50 (aa) 70 (ab) 90 (ac) 110 (ad) 130


the early fusion schemes to determine fusion parameters dynamically, or adding

additional intelligence to the multi-camera systems in fusion scheme 4 (possibly

a similar system to that used in the third scheme). Fusion after the object detec-

tion overcomes this problem more effectively, as in the event that one modality

produces poor results, the system can ignore this modality entirely and fall back


on the second to update the system until both modalities are producing usable

results.

The results from set two show that even when one modality (the colour modality

in this case) is producing very poor results, it can still allow improvements in the

detection and in the tracking over time of objects (see Table 7.8) due to the added

colour information, which allows for better matching using appearance models

and histograms, used by the condensation filter. This fusion scheme weights both

inputs equally, assuming that either one is equally likely to produce valid/invalid

data. The thermal modality could be weighted higher for tasks such as initial

object detection to initialise tracks (so that fewer false tracks are spawned), yet

the discriminating power offered by the colour modality when tracking known

objects is not lost.

7.3.3 Proposed Fusion System

It is been shown in Section 7.3.2 that the best approach for multi-sensor object

tracking within the object tracking framework used in this thesis is a middle

fusion approach, where motion detection and object detection are performed on

each modality. The results of the individual object detection routines are used

to update a single list of tracked objects. Given these results, a more advanced

approach to fusion at this point in the system is proposed. Figure 7.13 shows a

flowchart of the proposed system.

It is important to monitor the performance of each modality, to ensure that if one

modality is severely affected by noise, objects detected within that modality are

weighted less importantly, or ignored. The performance of the motion detection

is measured and evaluated as described in Section 7.3.1. However, as shown in

Section 7.3.2, this approach is only able to handle severe problems and issues


Figure 7.13: Flowchart for Proposed Fusion System

such as shadows cast by clouds are unlikely to result in errors being detected. To

overcome this, a method that can gauge the performance of the object detection

is required.

Motion detection and object detection is performed on each modality, resulting

in two object lists, Ovisible(t) and Othermal(t), of size Nvisual(t) and Nthermal(t)

respectively. Ideally, each of these object lists should each contain the number of

objects presently in the scene, N(t). The number of objects detected compared

to the number of objects present is used to determine the performance of each

modality at a given time,

N(t) < α; qvisual(t) = qthermal(t) = 1, (7.10)

qvisual(t) = 1− min(max(|N(t)−Nvisual(t)| − α, 0), N(t))

max(max(|N(t)−Nvisual(t)| − α, 0), N(t))(7.11)

qthermal(t) = 1− min(max(|N(t)−Nthermal(t)| − α, 0), N(t))

max(max(|N(t)−Nthermal(t)| − α, 0), N(t))(7.12)

A tolerance of α objects is allowed (within the proposed system α is set to 1). This

tolerance ensures that when the system contains no objects, the appearance of

an object does not result in the performance of the system dropping significantly

(this is also dealt with through the use of a learning rate to curb rapid changes

in the performance metric, see Equations 7.14 and 7.14). Whilst multiple objects

entering and exiting the scene will result in a drop in performance for the modality,

ideally this drop should be uniform across each modality. As it is the difference


in performance between the modalities that is important, this is not a problem.

The performance for a given frame is incorporated into a global performance

metric which is adjusted gradually,

pvisual(t) = pvisual(t− 1) +qvisual(t)− pvisual(t− 1)

Lfus, (7.13)

pthermal(t) = pthermal(t− 1) +qthermal(t)− pthermal(t− 1)

Lfus, (7.14)

where Lfus is the learning rate for the performance metric. Initial performance

metrics may be specified within the system configuration, by default weighting

one modality above another (i.e. if it is known that one modality is less reliable

for a given scene it can be by default set to a lower value).

These performance metrics are then used to determine the weighting applied to

each modality when fusing object lists, and adding objects. The relative strength

of each modality for the task of object detection is calculated,

wvisual(t) =pvisual(t)

pvisual(t) + pthermal(t), (7.15)

wthermal(t) =pthermal(t)

pvisual(t) + pthermal(t), (7.16)

where wvisual(t) is the performance of the visual modality relative to the thermal,

and wthermal(t) is the performance of the thermal modality relative to the visual.

This process ensures that the weights of each modality sums to 1, which simplifies

the process of merging objects.

The two object lists are merged, by determining the overlap between the objects.

If the overlap between the two objects is greater than a threshold, T fusionov , the

objects are merged. For each object, there are several parameters such as the

bounding box, centroid and velocities. Each of these values is merged according

to the equation,

Ofused(t, i) = Ovisual(t, j)× wvisual(t) +Othermal(t, k)× wthermal(t), (7.17)


where Ovisual(t, j) is the visual object being merged, Othermal(t, k) is the thermal

object being merged and Ofused(t, i) is the resultant fused object.

This yields three objects lists, Ofused(t), O′visible(t) and O′thermal(t), representing

the fused objects, the remaining visible and remaining thermal objects respec-

tively. The updated lists of visual and thermal objects are defined as,

O′visible(t) = Ovisible(t) /∈ Ofused(t), (7.18)

O′thermal(t) = Othermal(t) /∈ Ofused(t). (7.19)

The object lists are used to update the known tracks and add new tracks to the

system, this process is performed in the following steps:

1. Match objects in the merged list, Ofused(t), to the tracked list.

2. Match objects in the individual lists (O′visible(t) and O′thermal(t)) to the

tracked list, such that the best fitting object from either list is matched

in turn.

3. Add new objects within the merged list.

4. Add new objects within the individual lists.

The fourth stage involves additional checks to ensure that invalid objects are not

added, and an additional state is added to the system to accommodate this pro-

cess. A prerequisite amount of motion must be present within the other modality

for such a detection to be valid (this amount is specified in a configuration file,

rather than derived from the performance metrics), and the performance of the

modality must be greater than a threshold, τa (set to 0.5 within the proposed

system). Figure 7.14 shows the updated state diagram.


Figure 7.14: Updated State Diagram

The Preliminary Single Modality state is used for objects that are added after

detection in a single model. It is similar in behaviour to the Entry state, in

that objects must be continuously detected when in this state, or they will be

removed from the system. The difference regarding this state is the time that an

object spends in this state before entering the Active state. This is determined

as follows,

τactive(i) =τactivepm(t)

, (7.20)

where τactive(i) is the active threshold for object i (the object being added), τactive

is the default threshold and pm(t) is the performance for the modality m, from

which the object is being added. However, when an object that is in this state is

detected in both modalities (i.e. it is updated from an object in the Ofused(t) list),

the threshold is decremented by one (along with the increment in the detection

count, in effect this counts as two detections).

The systems integration with the particle filter is unchanged (i.e. the particle

filter does not weight one modality above another).


7.3.4 Evaluation of System using OTCVBS Database

The proposed fusion approach is evaluated using the same datasets and metrics as

used in Section 7.3.2. Results are compared to the visual and thermal modalities

individually as well as the third proposed fusion scheme (see Section 7.3.1) upon

which is this system is based.


Colour 0.88 0.59 0.80 0.98 1.00 1.00 0.60 0.52 0.63 0.92 0.79Thermal 0.98 0.72 0.86 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00Fusion 3 0.98 0.66 0.85 1.00 1.00 1.00 1.00 0.85 1.00 1.00 1.00Proposed 0.98 0.69 0.85 1.00 1.00 1.00 1.00 0.83 1.00 1.00 1.00

Table 7.9: Proposed Fusion Algorithm - Set 1 Results (see Section 3.5.1 for anexplanation of metrics)

Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.65 0.93 0.64Thermal 0.77 0.96 0.99Fusion 3 0.73 0.95 0.98Proposed 0.75 0.95 0.98

Table 7.10: Proposed Fusion Algorithm - Overall Set 1 Results (see Section 3.5.1for an explanation of metrics)

Tables 7.9 and 7.10 show the results for set 1, using the proposed algorithm.

The performance of the proposed algorithm is very similar to that of the thermal

modality alone, and the third evaluated fusion scheme (see Section 7.3.1) upon

which the proposed algorithm is based. This is expected given that the thermal

modality and original fusion schemes performed very well for these datasets, and

so no significant improvement, or change in performance was expected.

Tables 7.11 and 7.12, and Figure 7.15 show the results for set 2 using the proposed

fusion algorithm. As can be seen, the proposed fusion algorithm offers a significant

improvement over both individual modalities, and a noticeable improvement over

the third fusion scheme on which the proposed algorithm is based.



Colour 0.72 0.42 0.64 1.00 1.00 1.00 0.43 0.34 0.90 0.97 0.40Thermal 0.90 0.50 0.77 1.00 1.00 1.00 0.61 0.45 0.86 0.91 0.51Fusion 3 0.90 0.55 0.77 1.00 1.00 1.00 0.64 0.51 0.89 0.96 0.52Proposed 0.93 0.56 0.76 1.00 1.00 1.00 0.70 0.49 0.95 0.97 0.60

Table 7.11: Proposed Fusion Algorithm - Set 2 Results (see Section 3.5.1 for anexplanation of metrics)

Algorithm Overall Detection Overall Localisation Overall TrackingColour 0.48 0.89 0.49Thermal 0.58 0.93 0.61Fusion 3 0.62 0.93 0.64Proposed 0.63 0.92 0.69

Table 7.12: Proposed Fusion Algorithm - Overall Set 2 Results (see Section 3.5.1for an explanation of metrics)

Figure 7.15 shows a situation where there is heavy shadowing caused by a cloud

moving across the scene. The top row is the output of the colour modality only,

the second row is the thermal modality only, the third row is the output of

the third evaluated fusion scheme (see Section 7.3.1) and the fourth row is the

proposed fusion algorithm. In this situation, the colour modality alone performs

very poorly, spawning false tracks before losing track of the majority of the objects

in the scene as the motion detector attempts to adjust to cope with the changes.

The third of the initially proposed fusion approaches performs much better, but

a false track is still spawned (bottom centre of frame) due to noise in the thermal

image coinciding with errors caused shadows in the visual image. The proposed

fusion algorithm does not have this problem, as it is able to identify that the

detection in the visual modality is failing catastrophically, and ignore the modality

far more effectively.

7.4 Summary 307

(a) 50 (b) 70 (c) 90 (d) 110 (e) 130

(f) 50 (g) 70 (h) 90 (i) 110 (j) 130

(k) 50 (l) 70 (m) 90 (n) 110 (o) 130

(p) 50 (q) 70 (r) 90 (s) 110 (t) 130


7.4 Summary

This chapter has presented two extensions to the single camera tracking systems

proposed in this thesis:

1. A multi-camera tracking system.

2. A multi-spectral tracking system.

It has been shown that by using multiple cameras to cover the same area of a

scene, system performance can be improved. This can allow a system to antic-

ipate when objects are entering, and help recover from errors in a single view.

However, the use of multiple cameras does not overcome problems caused by poor

308 7.4 Summary

segmentation and detection, and in some cases (depending on how the camera

management is implemented) it may in fact exacerbate the problem.

This chapter has also evaluated approaches for uses in a multi-spectral tracking

system, where a visual colour modality and thermal modality are combined. It

has been shown that a middle fusion approach, where motion detection and object

detection are applied to each image and the object detection results are fused is

optimal. An improved middle fusion scheme has been proposed and is shown to

offer significant improvements over either modality individually.

Chapter 8

Conclusions and Future Work

8.1 Introduction

This thesis has examined methods to improve the performance of motion segmen-

tation algorithms and particle filtering techniques for object tracking applications;

and examined methods for multi-modal fusion in an object tracking system.

Motion segmentation is a key step in many tracking algorithms as it forms the

basis of object detection. Improving segmentation results, as well as being able

to extract additional information such as optical flow and motion status such as

stationary or active, allows for improved object detection and thus tracking. Oc-

clusions and difficulties in object detection are a major source of error in tracking

systems. However a strength of particle filters is their ability to track objects in

adverse situations. Integrating a particle filter within a standard tracking system

allows the particle filter to use progressively updated features and aids in main-

taining identity of the tracked objects, and provides the tracking system with an

effective means to handle occlusions.

310 8.2 Summary of Contribution

The research in the above areas has led to four main contributions, these being:

1. The simultaneous computation of multi-layer motion segmentation and op-

tical flow information.

2. The combined use of multi-layer motion and optical flow to improve object

detection and tracking performance.

3. The development of the scalable condensation filter (SCF, a mixture con-

densation filter that can dynamically scale the number of particles and

number and type of features, for each mixture component) and its inte-

gration into an existing tracking system to allow for improved occlusion

handing and tracking in adverse conditions.

4. An investigation into multi-sensor fusion for object tracking, and the devel-

opment of a fusion scheme for fusing a visual colour modality and thermal

modality for object tracking.

These innovations have been shown to improve the performance of object tracking

in adverse conditions, such as the in the presence of complex lighting, or situations

where there are a frequent occlusions. In the following section a summary of these

four contributions is provided.

8.2 Summary of Contribution

The four original contributions in this thesis are:

(i) Simultaneous computation of multi-layer motion segmentation and

optical flow

8.2 Summary of Contribution 311

A novel motion segmentation technique that can simultaneous calculate optical

flow as well as multi-layer motion segmentation has been proposed. Regions of

motion are divided into multiple layers of stationary foreground (i.e. objects that

have entered the scene and come to a stop) and a single layer of active foreground

(objects that are currently moving). Optical flow is calculated using a window

matching approach, incorporating previous optical flow results as well as previous

motion detection results, such that the algorithm does not attempt to match to

regions that were background in the previous frame, to help reduce discontinu-

ities. The optical flow is calculated to pixel precision, but as it is targeted at

tracking applications, sub-pixel resolution is not required. The proposed system

also uses the optical flow information and a short term history to compute a mo-

tion consistency map, indicating probable overlaps and other discontinuities (i.e.

new motion).

As well as the additional modes of output, lighting compensation, a variable

threshold, shadow detection and feedback approach have been proposed to im-

prove the segmentation performance. The merit of a variable threshold for each

pixel against a single global variable threshold has also been investigated.

The proposed algorithm has been evaluated using the AESOS database, the

CAVIAR database [48] and data captured in-house, and significant improvement

over the baseline (see Section 4.6) has been demonstrated.

(ii) Incorporation of multi-layer motion and optical flow into object

tracking

312 8.2 Summary of Contribution

The proposed motion segmentation algorithm has been integrated into a tracking

system capable of utilising the multiple modes of output (optical flow, multi-layer

foreground segmentation, motion consistency map). Novel methods to utilise

these modes of input within a tracking system have been proposed, these include:

1. The use of optical flow to extract moving objects.

2. The use of stationary foreground information to locate temporarily stopped

objects, or detect abandoned objects.

3. The use of the motion consistency map to detect overlaps and provide fur-

ther segmentation during the detection stage.

The proposed methods have been tested using the ETISEO database [130]

and achieved up to 24% improvement in tracking performance over the base-

line system (amount of improvement varies for different datasets, see Section 5.6).

(iii) Improved object tracking through the Scalable Condensation

Filter (SCF)

The Scalable Condensation Filter (SCF) has been proposed. The SCF is a novel

implementation of condensation filter, incorporating elements of the mixture par-

ticle filter [163] and boosted particle filter [135]. The SCF allows:

1. A time varying number of objects to be tracked by independent mixtures.

2. Each mixture to use a time varying number of particles according to the

system complexity.

8.2 Summary of Contribution 313

3. Each mixture to use a time varying number of features, which may be of

differing types to other mixtures.

4. Results of the object detection routines to be incorporated into the mix-

tures.

The SCF is integrated into the previously proposed tracking system. This allows:

1. The underlying tracking system to maintain the identity of each tracked

object, and add and remove mixture components from the SCF as appro-

priate.

2. The SCF to use progressively updated features which adapt to changes in

the object’s appearance as they move through the scene, and are unique for

each feature.

3. The tracking system to utilise the SCF when handling occlusions and situ-

ations where detection is unreliable, rather than relying on predicting the

object’s position until it reappears, or a timeout is reached.

The proposed system is tested using the ETISEO database [130], and improve-

ments in the tracking performance and occlusion handling are shown (see Section

6.5). However, the SCF does make the system more susceptible to errors when

object detection is poor, or when there is low contrast between foreground

objects and the background.

(iv) Investigation into and development of methods to fuse multiple

modalities for object tracking

An evaluation of fusion schemes for multi-sensor fusion for object tracking has

314 8.3 Future Research

been performed. Four simple fusion schemes have been evaluated for a multi-

modal system consisting of a visual colour modality and thermal modality:

1. Fusion during the motion detection process.

2. Fusion of the motion detection output.

3. Fusion of the object detection results.

4. Fusion of the tracked object lists.

An evaluation using the OTCBVS database [42] has shown that fusion of the

object detection results in optimal performance (see Section 7.3.2). A novel

multi-spectral tracking system, that performs fusion on the object detection

results, has been proposed and is shown to offer significant advantages over

either modality alone (see Section 7.3.4).

8.3 Future Research

This thesis has contributed to several areas of object tracking and intelligent

surveillance, however there are still several areas future work could address. Areas

of future work, that further improve techniques proposed in this thesis as well as

potential new research are listed below:

• The proposed object tracking systems will continue to be evaluated as new

datasets are made available, to confirm the results on a wider range of data,

and further refine the proposed algorithms.

8.3 Future Research 315

• The proposed motion segmentation algorithm could be extended to be able

to recognise, and ignore motion from, cast light such as vehicle headlights.

This would improve performance of motion segmentation and vehicle track-

ing systems in dark conditions.

• The proposed motion segmentation system could be modified such that

grouping of connected regions of the same classification are made (i.e. a

shadow region, active foreground region, etc.). Regions could be analysed

to remove errors (such as spurious motion in the middle of a shadow, or

in the middle of a region of stationary foreground) and provide additional

feedback to the motion segmentation algorithm to improve performance in

future frames.

• The image segmentation technique, GrabCut [147], can be used to segment

images based on a selected region of interest, and GMM’s of positive and

negative features extracted from this initial selection. The GrabCut al-

gorithm and proposed motion segmentation algorithm could be combined,

such that regions of motion are used to initialise the GrabCut process and

improve segmentation performance. If successful, such a system could be

integrated into a tracking system such that each track stores its own GMM,

and this GMM, combined with a coarse position estimation and motion de-

tection results could be used to segment the target object in future frames.

• Tracking systems presently rely on a single frame for detection results. De-

tection could be performed over a sliding frame window, such that at frame

t, detection results for frames t− 1, t, and t+ 1 are combined and used to

update the list of tracked objects. Given that tracking systems typically

process video at 15-25fps and surveillance cameras have a wide field of view,

the difference in position from frame to frame is small and combining de-

tection results from consecutive frames is likely to improve detection and

tracking performance, without adversely affecting the localisation of the

316 8.3 Future Research

tracked objects.

• The use of the SCF within the proposed tracking system can be further

improved, to overcome the problem of poorly initialised tracked objects

(due to detection/segmentation errors) being lost. Further improvements

can also be made to better handle long occlusions, and ensure that tracks

are not lost during a prolonged occlusion.

Appendix A

Baseline Tracking System Results

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5RD6 0.83 0.73 0.84 1.00 0.98 1.00 0.44 0.35 0.75 0.97 0.44RD7 0.71 0.56 0.76 1.00 0.98 1.00 0.34 0.27 0.79 0.93 0.20Average 0.77 0.64 0.80 1.00 0.98 1.00 0.39 0.31 0.77 0.95 0.32

Table A.1: RD Dataset Results

Data Set Overall Detection Overall Localisation Overall TrackingRD6 0.75 0.95 0.48RD7 0.59 0.92 0.39Average 0.67 0.93 0.44

Table A.2: Overall RD Dataset Results

318

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BC16 0.77 0.44 0.75 1.00 0.98 0.99 0.46 0.28 0.75 0.81 0.38BC17 0.77 0.38 0.72 1.00 0.98 0.99 0.44 0.33 0.73 0.88 0.29Average 0.77 0.41 0.73 1.00 0.98 0.99 0.45 0.30 0.74 0.85 0.34

Table A.3: BC Dataset Results

Data Set Overall Detection Overall Localisation Overall TrackingBC16 0.50 0.91 0.47BC17 0.45 0.91 0.46Average 0.48 0.91 0.46

Table A.4: Overall BC Dataset Results

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5AP11-C4 0.86 0.67 0.79 1.00 1.00 1.00 0.47 0.25 1.00 1.00 0.17AP11-C7 0.86 0.77 0.84 1.00 1.00 1.00 0.65 0.51 0.88 1.00 0.60AP12-C4 0.76 0.58 0.69 1.00 1.00 1.00 0.50 0.51 1.00 1.00 0.67AP12-C7 0.80 0.54 0.72 1.00 1.00 1.00 0.66 0.30 1.00 1.00 0.38Average 0.82 0.64 0.76 1.00 1.00 1.00 0.57 0.39 0.97 1.00 0.45

Table A.5: AP Dataset Results

Data Set Overall Detection Overall Localisation Overall TrackingAP11-C4 0.71 0.93 0.47AP11-C7 0.79 0.95 0.66AP12-C4 0.61 0.90 0.60AP12-C7 0.59 0.91 0.61Average 0.68 0.93 0.59

Table A.6: Overall AP Dataset Results

Data Detection Localisation TrackingSet D1 D2 L1 L2 L3 L4 T1 T2 T3 T4 T5BE19-C1 0.83 0.46 0.73 1.00 0.95 1.00 0.29 0.21 0.54 0.86 0.29BE19-C3 0.74 0.03 0.46 0.50 0.50 1.09 0.44 0.00 0.00 0.00 0.00BE20-C1 0.75 0.50 0.74 0.99 0.97 1.00 0.05 0.24 0.32 0.71 0.54BE20-C3 0.33 0.01 0.27 0.75 0.75 1.04 0.00 0.01 1.00 1.00 0.00Average 0.66 0.25 0.55 0.81 0.79 1.03 0.19 0.11 0.47 0.64 0.21

Table A.7: BE Dataset Results

319

Data Set Overall Detection Overall Localisation Overall TrackingBE19-C1 0.53 0.90 0.34BE19-C3 0.17 0.54 0.29BE20-C1 0.55 0.91 0.21BE20-C3 0.07 0.62 0.14Average 0.33 0.74 0.25

Table A.8: Overall BE Dataset Results

Appendix B

Tracking System with Improved

Motion Segmentation Results


Table B.1: RD Dataset Results


Table B.2: Overall RD Dataset Results

322


Table B.3: BC Dataset Results


Table B.4: Overall BC Dataset Results


Table B.5: AP Dataset Results


Table B.6: Overall AP Dataset Results


Table B.7: BE Dataset Results

323


Table B.8: Overall BE Dataset Results

Appendix C

Tracking System with SCF

Results


Table C.1: RD Dataset Results


Table C.2: Overall RD Dataset Results

326


Table C.3: BC Dataset Results


Table C.4: Overall BC Dataset Results


Table C.5: AP Dataset Results


Table C.6: Overall AP Dataset Results


Table C.7: BE Dataset Results

327


Table C.8: Overall BE Dataset Results

Appendix D

Multi-Camera Tracking System

Results


Table D.1: AP Dataset Results


Table D.2: Overall AP Dataset Results

330


Table D.3: BE Dataset Results


Table D.4: Overall BE Dataset Results

Bibliography

[1] M. Abdelkader, R. Chellappa, Q. Zheng, and A. Chan, “Integrated motion

detection and tracking for visual surveillance,” in IEEE International Con-

ference on Computer Vision Systems (ICVS), p. 28, 2006.

[2] E. H. Adelson, “Layered representation for vision and video,” in IEEE Work-

shop on Representation of Visual Scenes, p. 3, 1995.

[3] E. L. Andrade, S. Blunsden, and R. B. Fisher, “Modelling crowd scenes for

event detection,” in International Conference on Pattern Recognition, vol. 1,

pp. 175 – 178, 2006.

[4] N. Atsushi, K. Hirokazu, H. Shinsaku, and I. Seiji, “Tracking multiple people

using distributed vision systems,” in Proceedings 2002 IEEE International

Conference on Robotics and Automation, vol. 3, (Washington, DC, USA),

pp. 2974–2981, IEEE, 2002.

[5] E. Auvinet, E. Grossmann, C. Rougier, M. Dahmane, and J. Meunier, “Left-

luggage detection using homographies and simple heuristics,” in IEEE In-

ternational Workshop on PETS, (New York), pp. 51–58, 2006.

[6] J. Bergen, P. Burt, R. Hingorani, and S. Peleg, “Computing two motions

from three frames,” in 3rd Int. Conf. on Computer Vision, pp. 27–32, 1990.

332 BIBLIOGRAPHY

[7] D. Beymer, “Person counting using stereo,” Workshop on Human Motion,

pp. 127 – 133, 2000.

[8] M. K. Bhuyan, B. C. Lovell, and A. Bigdeli, “Tracking with multiple cameras

for video surveillance,” in Digital Image Computing Techniques and Applica-

tions, 9th Biennial Conference of the Australian Pattern Recognition Society

on (B. C. Lovell, ed.), pp. 592–599, 2007.

[9] M. Black and P. Anandan, “A framework for the robust estimation of optical

flow,” in Fourth International Conference on Computer Vision, pp. 231 –

236, 1993.

[10] J. Black, T. Ellis, and P. Rosin, “Multi view image surveillance and track-

ing,” in Motion and Video Computing, 2002. Proceedings. Workshop on

(T. Ellis, ed.), pp. 169–174, 2002.

[11] R. S. Blum and Z. Liu, Multi-Sensor Image Fusion and Its Applications.

Boca Raton, FL: CRC Press, 2006.

[12] R. Bourezak and G. Bilodeau, “Object detection and tracking using iterative

division and correlograms,” in Computer and Robot Vision, 2006. The 3rd

Canadian Conference on, p. 38, 2006.

[13] R. Bowden and P. Kaewtrakulpong, “An improved adaptive background

mixture model for real-time tracking with shadow detection,” in AVBS01,

2001.

[14] G. R. Bradski, “Computer vision face tracking for use in a perceptual user

interface,” Intel Tech Journal, pp. 1–15, 1998.

[15] H. Breit and G. Rigoll, “A flexible multimodal object tracking system,” in

International Conference on Image Processing, vol. 3, pp. 133–136, 2003.

BIBLIOGRAPHY 333

[16] F. Bunyak, I. Ersoy, and S. Subramanya, “A multi-hypothesis approach for

salient object tracking in visual surveillance,” in Image Processing, 2005.

ICIP 2005. IEEE International Conference on, vol. 2, pp. II–446–9, 2005.

[17] P. Burt, J. Bergen, R. Hingorani, R. Kolczynski, W. Lee, A. Leung, J. Lubin,

and J. Shvaytser, “Object tracking with a moving camera; an application

of dynamic motion analysis,” in IEEE Workshop on Visual Motion, (Irvine,

CA), 1989.

[18] D. Butler, S. Sridharan, and V. M. Bove Jr, “Real-time adaptive background

segmentation,” in IEEE International Conference on Acoustics, Speech, and

Signal Processing (ICASSP), vol. 3, pp. 349–352, 2003.

[19] Q. Cai and J. Aggarwal, “Tracking human motion in structured environ-

ments using a distributed-camera system,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 21, no. 11, pp. 1241 – 1247, 1999.

[20] J. Canny, “A computational approach to edge detection,” IEEE Transac-

tions on Pattern Analysis and MachineI Intelligence, vol. 8, no. 6, pp. 679–

698, 1986.

[21] T.-H. Chang and S. Gong, “Tracking multiple people with a multi-camera

system,” in IEEE Workshop on Multi-Object Tracking, pp. 19 – 26, 2001.

[22] N. Checka, K. Wilson, M. Siracusa, and T. Darrell, “Multiple person and

speaker activity tracking with a particle filter,” in Acoustics, Speech, and Sig-

nal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Con-

ference on, vol. 5, pp. V–881–4 vol.5, 2004.

[23] H. Chen and T. Liu, “Trust-region methods for real-time tracking,” in In-

ternational Conference on Computer Vision, vol. 2, (Vancouver, Canada),

pp. 717–722, 2001.

334 BIBLIOGRAPHY

[24] Y.-S. Cheng, C.-M. Huang, and L.-C. Fu, “Multiple people visual tracking in

a multi-camera system for cluttered environments,” in 2006 IEEE/RSJ In-

ternational Conference on Intelligent Robots and Systems, (Beijing, China),

pp. 675–680, 2006.

[25] S.-Y. Chien, W.-K. Chan, D.-C. Cherng, and J.-Y. Chang, “Human object

tracking algorithm with human color structure descriptor for video surveil-

lance systems,” in Multimedia and Expo, 2006 IEEE International Confer-

ence on, pp. 2097–2100, 2006.

[26] B. Coifman, D. Beymer, P. McLauchlan, and J. Malik, “A real-time com-

puter vision system for vehicle tracking and traffic surveillance,” Transporta-

tion Research: Part C, vol. 6, no. 4, pp. 271–288, 1998.

[27] R. Collins, O. Amidi, and T. Kanade, “An active camera system for acquiring

multi-view video,” in Proceedings of ICIP 2002 International Conference on

Image Processing, vol. 1, (Rochester, NY, USA), pp. 517–520, IEEE, 2002.

[28] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature

space analysis,” IEEE Trans. on Pattern Analysis and Machine Intelligence,

vol. 24, no. 5, pp. 603–619, 2002.

[29] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid ob-

jects using mean shift.,” in ofIEEE Computer Society Conference on Com-

puter Vision and Pattern Recognition (CVPR2000),, vol. 2, pp. 2142–2149,

2000.

[30] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,”

IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 25, no. 5,

pp. 564–575, 2003.

[31] C. O. Conaire, E. Cooke, N. O’Connor, N. Murphy, and A. Smearson, “Back-

ground modelling in infrared and visible spectrum video for people tracking,”

BIBLIOGRAPHY 335

in IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, vol. 3, pp. 20–20, 2005.

[32] C. O. Conaire, N. E. O’Connor, E. Cooke, and A. F. Smeaton, “Multi-

spectral object segmentation and retrieval in surveillance video,” in IEEE

International Conference on Image Processing (ICIP), pp. 2381–2384, 2006.

[33] M. Couprie and G. Bertrand, “Topological grayscale watershed transforma-

tion,” in SPIE Vision Geometry V, vol. 3168, pp. 136–146, 1997.

[34] A. Criminisi, A. Cross, G. Blake, and V. Kolmogorov, “Bilayer segmentation

of live video,” in IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, 2006.

[35] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Statistical and

knowledge-based moving object detection in traffic scene,” in IEEE Inter-

national Conference on Intelligent Transportation Systems, pp. 27–32, 2000.

[36] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving ob-

jects, ghosts, and shadows in video streams,” IEEE Trans. on Pattern Anal-

ysis and Machine Intelligence, vol. 25, no. 10, pp. 1337–1342, 2003.

[37] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti, “Improving

shadow suppression in moving object detection with hsv color information,”

in Fourth International IEEE Conference on Intelligent Transportation Sys-

tems, (Oaklan, CA, USA), pp. 334–339, 2001.

[38] F. Cupillard, F. Bremond, and M. Thonnat, “Group behavior recognition

with multiple cameras,” in Applications of Computer Vision, 2002. (WACV

2002). Proceedings. Sixth IEEE Workshop on, pp. 177–183, 2002.

[39] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, “Integrated person

tracking using stereo, color, and pattern detection,” in Computer vision

336 BIBLIOGRAPHY

and pattern recognition, (Santa Barbara; CA), pp. 601–609, IEEE Computer

Society, 1998.

[40] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, “Integrated person

tracking using stereo, color, and pattern detection,” International Journal

of Computer Vision, vol. 37, no. 2, pp. 175–185, 2000.

[41] J. Davis and V. Sharma, “Robust background-subtraction for person detec-

tion in thermal imagery,” in Conference on Computer Vision and Pattern

Recognition Workshop, p. 128, 2004.

[42] J. Davis and V. Sharma, “Ieee otcbvs ws series bench fusion-based

background-subtraction using contour saliency,” in IEEE International

Workshop on Object Tracking and Classification Beyond the Visible Spec-

trum, 2005.

[43] S. Denman, V. Chandran, and S. Sridharan, “An adaptive optical flow tech-

nique for person tracking systems,” Elsivier Pattern Recognition Letters,

vol. 28, no. 10, pp. 1232–1239, 2007.

[44] A. Doucet, “On sequential simulation-based methods for bayesian filtering,”

technical report cued/f-infeng/tr 310, Department of Engineering, Cam-

bridge University, 1998.

[45] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at a dis-

tance,” in Computer Vision, 2003. Proceedings. Ninth IEEE International

Conference on, Vol., Iss., 13-16 Oct. 2003, pp. 726– 733 vol.2, 2003.

[46] T. Ellis, “Multi-camera video surveillance,” in Proceedings IEEE 36th

Annual 2002 International Carnahan Conference on Security Technology

(L. Sanson, ed.), (Atlantic City, NJ, USA), pp. 228–233, IEEE, 2002.

BIBLIOGRAPHY 337

[47] I. Everts, N. Sebe, and G. A. Jones, “Cooperative object tracking with

multiple ptz cameras,” in Image Analysis and Processing, 2007. ICIAP 2007.

14th International Conference on (N. Sebe, ed.), pp. 323–330, 2007.

[48] R. Fisher, J. Santos-Victor, and J. Crowley, “Caviar: Con-

text aware vision using image-based active recognition,

(http://homepages.inf.ed.ac.uk/rbf/caviar/),” Last Accessed 23 Feb

2008, 2002.

[49] F. Fleuret, J. Berclaz, R. Lengagne, and P. A. F. P. Fua, “Multicamera peo-

ple tracking with a probabilistic occupancy map,” Transactions on Pattern

Analysis and Machine Intelligence, vol. 30, no. 2, pp. 267–282, 2008.

[50] D. Focken and R. Stiefelhagen, “Towards vision-based 3-d people tracking

in a smart room,” in Proceedings Fourth IEEE International Conference on

Multimodal Interfaces, (Pittsburgh, PA, USA), pp. 400–405, IEEE Comput.

Soc, 2002.

[51] J. F. G. d. Freitas, M. Niranjan, A. H. Gee, and A. Doucet, “Sequential

monte carlo methods to train neural network models,” Neural Computation,

vol. 12, no. 4, pp. 955–993, 2000.

[52] L. M. Fuentes and S. Velastin, “People tracking in surveillance applications,”

2nd IEEE InternationalWorkshop on Performance Evaluation of Tracking

and Surveillance (PETS2001), 2001.

[53] L. M. Fuentes and S. A. Velastin, “Tracking people for automatic surveillance

applications,” in Pattern recognition and image analysis (F. J. P. O. Perales,

ed.), (Puerto de Andratx, Spain), pp. 238–245, Berlin; Springer;, 2003.

[54] K. Fukunaga, Introduction to Statistical Pattern Recognition. Boston: Aca-

demic Press, 1990.

338 BIBLIOGRAPHY

[55] G. S. K. Fung, N. H. C. Yung, G. K. H. Pang, and A. H. S. Lai, “Effective

moving cast shadow detection for monocular color image,” vol. -, no. -, pp. –

409, 2001.

[56] G. Gordon, T. Darrell, M. Harville, and J. Woodfill, “Background estimation

and removal based on range and color,” in IEEE Computer Society Confer-

ence on Computer Vision and Pattern Recognition, pp. 459–464, 1999.

[57] H. Grabner and H. Bischof, “On-line boosting and vision,” in IEEE Com-

puter Society Conference on Computer Vision and Pattern Recognition,

vol. 1, pp. 260 – 267, 2006.

[58] R. Green and L. Guan, “Tracking human movement patterns using particle

filtering,” in International Conference on Multimedia and Expo (ICME),

vol. 3, pp. 117–120, 2003.

[59] D. Grest, J.-M. Frahm, and R. Koch, “A color similarity measure for ro-

bust shadow removal in real time,” in Vision, Modeling and Visualization

Conference, (Munich, Germany), pp. 253–260, 2003.

[60] J. M. Hammersley and K. W. Morton, “Poor man’s monte carlo,” Journal

of the Royal Statistical Society B, vol. 16, pp. 23–28, 1954.

[61] J. . Han and B. Bhanu, “Detecting moving humans using color and infrared

video,” in IEEE International Conference on Multisensor Fusion and Inte-

gration for Intelligent Systems, pp. 228 – 233, 2003.

[62] J. Han and B. Bhanu, “Fusion of color and infrared video for moving human

detection,” Pattern Recognition, vol. 40, no. 6, pp. 1771–1784, 2007.

[63] I. Haritaoglu, D. Harwood, and L. Davis, “W4: Who? when? where? what?

a real time system for detecting and tracking people,” in Third IEEE Inter-

national Conference on Automatic Face and Gesture Recognition, pp. 222 –

227, 1998.

BIBLIOGRAPHY 339

[64] I. Haritaoglu, D. Harwood, and L. Davis, “W4s: A real time system for

detecting and tracking people in 2 1/2 d,” in Europaen Conference Computer

Vision, pp. 962–968, 1998.

[65] I. Haritaoglu, D. Harwood, and L. S. Davis, “Hydra: multiple people detec-

tion and tracking using silhouettes,” in International Conference on Image

Analysis and Processing, pp. 280 – 285, 1999.

[66] I. Haritaoglu, D. Harwood, and L. Davis, “An appearance-based body model

for multiple people tracking,” in 15th International Conference on Pattern

Recognition, vol. 4, (Barcelona, Spain), pp. 184–187, 2000.

[67] M. Harville, “A framework for high-level feedback to adaptive, per-pixel,

mixture-of-gaussian background models,” in 7th European Conference on

Computer Vision, vol. 3, (Copenhagen, Denmark), pp. 37–49, 2002.

[68] M. Harville, G. G. Gordon, and J. Woodfill, “Foreground segmentation us-

ing adaptive mixture models in color and depth,” in IEEE Workshop on

Detection and Recognition of Events in Video, pp. 3–11, 2001.

[69] M. Harville and D. Li, “Fast, integrated person tracking and activity recog-

nition with plan-view templates from a single stereo camera,” in Computer

Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004

IEEE Computer Society Conference on, pp. II–398– II–405 Vol.2, 2004.

[70] E. Hayman and J. Eklundh, “Statistical background subtraction for a mobile

observer,” in International Conference on Computer Vision (ICCV), vol. 1,

pp. 67 – 74, 2003.

[71] T. Higuchi, “Monte carlo filter using the genetic algorithm operators,” Jour-

nal of Statistical Computation and Simulation, vol. 59, no. 1, pp. 1–23, 1997.

[72] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intel-

ligence, vol. 17, pp. 185–203, 1981.

340 BIBLIOGRAPHY

[73] T. Horprasert, D. Hanvood, and L. Davis, “A statistical approach for real-

time robust background subtraction and shadow detection,” in ICCV Frame-

rate Workshop, 1999.

[74] M. Hu, W. Hu, and T. Tan, “Tracking people through occlusions,” in Pat-

tern Recognition, 2004. ICPR 2004. Proceedings of the 17th International

Conference on, pp. 724– 727, 2004.

[75] J.-S. Hu, T.-M. Su, and S.-C. Jeng, “Robust background subtraction with

shadow and highlight removal for indoor surveillance,” in Intelligent Robots

and Systems, 2006 IEEE/RSJ International Conference on, pp. 4545–4550,

2006.

[76] G. Hua and Y. Wu, “Multi-scale visual tracking by sequential belief propa-

gation,” in IEEE Conference on Computer Vision and Pattern Recognition,

(Washington, DC), pp. 826–833, 2004.

[77] M. Isard and A. Blake, “Condensation - conditional density propagation for

visual tracking,” International Journal of Computer Vision, vol. 29, no. 1,

pp. 5–28, 1998.

[78] M. Isard and J. MacCormick, “Bramble: a bayesian multiple-blob tracker,”

in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE Interna-

tional Conference on, vol. 2, pp. 34–41 vol.2, 2001.

[79] J. Jacques, C. Jung, and S. Musse, “Background subtraction and shadow

detection in grayscale video sequences,” in Computer Graphics and Image

Processing, 2005. SIBGRAPI 2005. 18th Brazilian Symposium on, pp. 189–

196, 2005.

[80] O. Javed, S. Khan, Z. Rasheed, and M. Shah, “Camera handoff: tracking in

multiple uncalibrated stationary cameras,” in Workshop on Human Motion,

pp. 113 – 118, 2000.

BIBLIOGRAPHY 341

[81] O. Javed, Z. Rasheed, O. Alatas, and M. Shah, “Knightm: A real time

surveillance system for multiple overlapping and non-overlapping cameras,”

in International Conference on Multimedia Expo (ICME), pp. 649–652, 2003.

[82] O. Javed, Z. Rasheed, K. Shafique, and M. Shah, “Tracking across multiple

cameras with disjoint views,” in The Ninth IEEE Conference on Computer

Vision (ICCV), vol. 2, (Nice, France), pp. 952–957, 2003.

[83] O. Javed, K. Shafique, and M. Shah, “A hierarchical approach to robust

background subtraction using color and gradient information,” in in IEEE

Workshop on Motion and Video Computing, pp. 22–27, 2002.

[84] S. Joo and Q. Zheng., “A temporal variance-based moving target detector,”

in IEEE International Workshop on Performance Evaluation of Tracking

Systems (PETS), 2005.

[85] S. Julier and J. Uhlman, “A consistent, debiased method for converting be-

tween polar and cartesian coordinate systems,” in 11th International Sym-

posium on Aerospace/Defence Sensing, Simulation and Controls, vol. Multi

Sensor Fusion, Tracking and Resource Management II, (Orlando, Florida),

pp. 110–121, 1997.

[86] R. E. Kalman, “A new approach to linear filtering and prediction problems,”

Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series

D, pp. 35–45, 1960.

[87] J. Kang, I. Cohen, and G. Medioni, “Tracking people in crowded scenes

across multiple cameras,” in Asian Conference on Computer Vision

(ACCV), 2004.

[88] J. Kang, I. Cohen, and G. Medioni, “Persistent objects tracking across mul-

tiple non overlapping cameras,” in Motion and Video Computing, 2005.

342 BIBLIOGRAPHY

WACV/MOTIONS ’05 Volume 2. IEEE Workshop on (I. Cohen, ed.), vol. 2,

pp. 112–119, 2005.

[89] S. Kang, B.-W. Hwang, and S.-W. Lee, “Multiple people tracking based on

temporal color feature,” International Journal of Pattern Recognition and

Artificial Intelligence, vol. 17, no. 6, pp. 931–949, 2003.

[90] H. Kang, D. Kim, and S. Y. Bang, “Real-time multiple people tracking

using competitive condensation,” in Image Processing. 2002. Proceedings.

2002 International Conference on, vol. 3, pp. III–325–III–328 vol.3, 2002.

[91] J. Kato, T. Watanabe, S. Joga, Y. Liu, and H. Hase, “An hmm/mrf-based

stochastic framework for robust vehicle tracking,” IEEE Transactions on

Intelligent Transportation Systems, vol. 5, no. 3, pp. 142 – 154, 2004.

[92] M. Kazuyuki, M. Xuchu, and H. Hideki, “Global color model based object

matching in the multi-camera environment,” in IEEE/RSJ International

Conference on Intelligent Robots and Systems, 2006, pp. 2644–2649, 2006.

[93] S. Khan, O. Javed, Z. Rasheed, and M. Shah, “Human tracking in multiple

cameras,” in Eighth IEEE International Conference on Computer Vision,

vol. 1, pp. 331 – 336, 2001.

[94] S. Khan and M. Shah, “Object based segmentation of video using color,

motion and spatial information,” in IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, vol. 2, pp. 746–751, 2001.

[95] S. Khan and M. Shah, “Consistent labeling of tracked objects in multiple

cameras with overlapping fields of view,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1355 – 1360, 2003.

[96] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, “Background

modeling and subtraction by codebook construction,” IEEE International

Conference on Image Processing (ICIP), 2004.

BIBLIOGRAPHY 343

[97] K. Kim, D. Harwood, and L. S. Davis, “Background updating for visual

surveillance,” in ISVC, pp. 337–346, 2005.

[98] G. Kitagawa, “Monte carlo filter and smoother for non-gaussian nonlin-

ear state space models,” Journal of Computational and Graphical Statistics,

vol. 5, no. 1, pp. 1–25, 1996.

[99] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking with oc-

clusion reasoning,” in ECCV, vol. 1, pp. 189–196, 1994.

[100] A. Kong, J. S. Liu, and W. H. Wong, “Sequential imputations and bayesian

missing data problems,” Journal of the American Statistical Association,

vol. 89, no. 425, pp. 278–288, 1994.

[101] N. Krahnstoever, P. Tu, T. Sebastian, A. Perera, and R. Collins, “Multi-

view detection and tracking of travelers and luggage in mass transit environ-

ments,” in IEEE International Workshop on PETS, (New York), pp. 67–74,

2006.

[102] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer,

“Multi-camera multi-person tracking for easyliving,” in Third IEEE Inter-

national Workshop on Visual Surveillance, pp. 3 – 10, 2000.

[103] B. Kwolek, “Person following and mobile camera localization using particle

filters,” in Robot Motion and Control, 2004. RoMoCo’04. Proceedings of the

Fourth International Workshop on, pp. 265–270, 2004.

[104] L. Latecki, R. Miezianko, and D. Pokrajac, “Tracking motion objects in

infrared videos,” in IEEE Conference on Advanced Video and Signal Based

Surveillance (AVSS), pp. 99–104, 2005.

[105] M. Latzel, E. Darcourt, and J. Tsotsos, “People tracking using robust mo-

tion detection and estimation,” in Computer and Robot Vision, 2005. Pro-

ceedings. The 2nd Canadian Conference on, pp. 270–275, 2005.

344 BIBLIOGRAPHY

[106] B. Lei and L.-Q. Xu, “From pixels to objects and trajectories: a generic

real-time outdoor video surveillance system,” in Imaging for Crime Detection

and Prevention, 2005. ICDP 2005. The IEE International Symposium on,

pp. 117–122, 2005.

[107] W. Leoputra, T. Tele, and L. Fee Lee, “Non-overlapping distributed track-

ing using particle filter,” in Pattern Recognition, 2006. ICPR 2006. 18th

International Conference on (T. Tele, ed.), vol. 3, pp. 181–185, 2006.

[108] L. Li, R. Luo, W. Huang, and H.-L. Eng, “Context-controlled adaptive

background subtraction,” in IEEE International Workshop on PETS, New

York, June 18, 2006, (New York), pp. 31–38, 2006.

[109] F. L. Lim, W. Leoputra, and T. Tan, “Non-overlapping distributed tracking

system utilizing particle filter,” Journal of VLSI Signal Processing, vol. 49,

pp. 343–362, 2007.

[110] J. S. Liu and R. Chen, “Sequential Monte Carlo methods for dynamic

systems,” Journal of the American Statistical Association, vol. 93, no. 443,

pp. 1032–1044, 1998.

[111] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”

International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[112] G. Loy, L. Fletcher, N. Apostoloff, and A. Zelinsky, “An adaptive fusion ar-

chitecture for target tracking,” in Automatic Face and Gesture Recognition,

2002. Proceedings. Fifth IEEE International Conference on, pp. 248–253,

2002.

[113] W. Lu and Y.-P. Tan, “A color histogram based people tracking system,”

in 2001 IEEE International Symposium on Circuits and Systems, vol. 2,

pp. 137 – 140, 2001.

BIBLIOGRAPHY 345

[114] B. Lucas and T. Kanade, “An iterative image registration technique with

an application to stereo vision,” in 7th International Joint Conference on

Artificial Intelligence (IJCAI), pp. 674–679, 1981.

[115] M. Lucena, J. Fuertes, N. de la Blanca, and A. Garrido, “An optical flow

probabilistic observation model for tracking,” in Image Processing, 2003.

ICIP 2003. Proceedings. 2003 International Conference on, vol. 3, pp. III–

957–60 vol.2, 2003.

[116] M. Lucena, J. Fuertes, J. Gomez, N. de la Blanca, and A. Garrido, “Optical

flow-based probabilistic tracking,” in Signal Processing and Its Applications,

2003. Proceedings. Seventh International Symposium on, vol. 2, pp. 219–222

vol.2, 2003.

[117] M. Lucena, J. Fuertes, J. Gomez, N. de la Blanca, and A. Garrido, “Track-

ing from optical flow,” in Image and Signal Processing and Analysis, 2003.

ISPA 2003. Proceedings of the 3rd International Symposium on, Vol.2, Iss.,

18-20 Sept. 2003, pp. 651– 655 Vol.2, 2003.

[118] L. Marchesotti, S. Piva, and C. Regazzoni, “An agent-based approach for

tracking people in indoor complex environments,” in 12th International Con-

ference Image Analysis and Processing, pp. 99–102, 2003.

[119] N. Martel-Brisson and A. Zaccarin, “Moving cast shadow detection from a

gaussian mixture shadow model,” in Computer Vision and Pattern Recog-

nition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2,

pp. 643–648 vol. 2, 2005.

[120] S. Maskell and N. Gordon, “A tutorial on particle filters for on-line

nonlinear/non-gaussian bayesian tracking,” in Target Tracking: Algorithms

and Applications (Ref. No. 2001/174), IEE, vol. Workshop, pp. 2/1–2/15

vol.2, 2001.

346 BIBLIOGRAPHY

[121] A. Matsumura, Y. Iwai, and M. Yachida, “Tracking people by using color

information from omnidirectional images,” in 41st SICE Annual Conference,

vol. 3, pp. 1772 – 1777, 2002.

[122] R. v. d. Merwe, N. d. Freitas, A. Doucet, and E. Wan, “The unscented

particle filter,” Advances in Neural Information Processing Systems, vol. 13,

no. Nov, 2001.

[123] C. Micheloni, G. L. Foresti, and L. Snidaro, “A network of co-operative

cameras for visual surveillance,” Vision, Image and Signal Processing, IEE

Proceedings -, vol. 152, no. 2, pp. 205–212, 2005.

[124] A. Mittal and L. Davis, “Unified multi-camera detection and tracking using

region-matching,” in Multi-Object Tracking, 2001. Proceedings. 2001 IEEE

Workshop on, pp. 3–10, 2001.

[125] A. Mittal and L. Davis, “M/sub 2/tracker: a multi-view approach to seg-

menting and tracking people in a cluttered scene,” International Journal of

Computer Vision, vol. 51, no. 3, pp. 189–203, 2003.

[126] K. Morioka and H. Hashimoto, “Color appearance based object identifica-

tion in intelligent space,” pp. 505–510, 2004.

[127] S. Nadimi and B. Bhanu, “Moving shadow detection using a physics-based

approach,” in 16th International Conference on Pattern Recognition,, vol. 2,

pp. 701–704, 2002.

[128] S. Nadimi and B. Bhanu, “Physical models for moving shadow and ob-

ject detection in video,” Pattern Analysis and Machine Intelligence, IEEE

Transactions on, vol. 26, no. 8, pp. 1079–1087, 2004.

[129] K. P. Ng and S. Ranganath, “Tracking people,” in 16th International Con-

ference on Pattern Recognition, vol. 2, pp. 370 – 373, 2002.

BIBLIOGRAPHY 347

[130] A. T. Nghiem, F. Bremond, and M. T. V. Valentin, “Etiseo, performance

evaluation for video surveillance systems,” in IEEE Conference on Advanced

Video and Signal Based Surveillance (AVSS), (London, UK), pp. 476–481,

2007.

[131] N. Nguyen, S. Venkatesh, G. West, and H. Bui, “Hierarchical monitoring

of people’s behaviors in complex environments using multiple cameras,” in

16th International Conference on Pattern Recognition, vol. 1, pp. 13 – 16,

2002.

[132] “Ninth ieee international workshop on performance evaluation of tracking

and surveillance,” 2006.

[133] C. O’Conaire, N. E. O’Connor, E. Cooke, and A. F. Smeaton, “Comparison

of fusion methods for thermo-visual surveillance tracking,” in 9th Interna-

tional Conference on Information Fusion (ICIF), pp. 1–7, 2006.

[134] R. Okada, Y. Shirai, and J. Miura, “Tracking a person with 3-d motion by

integrating optical flow and depth,” in Automatic Face and Gesture Recogni-

tion, 2000. Proceedings. Fourth IEEE International Conference on, pp. 336–

341, 2000.

[135] K. Okuma, A. Taleghani, N. d. Freitas, J. Little, and D. Lowe, “A boosted

particle filter: Multitarget detection and tracking,” in 8th European Confer-

ence on Computer Vision (ECCV), vol. 1, (Prague, Czech Republic), pp. 28–

39, 2004.

[136] N. Oshima, T. Saitoh, and R. Konishi, “Real time mean shift tracking

using optical flow distribution,” in SICE-ICASE, 2006. International Joint

Conference, pp. 4316–4320, 2006.

[137] N. Owens, C. Harris, and C. Stennett, “Hawk-eye tennis system,” in Vi-

348 BIBLIOGRAPHY

sual Information Engineering, 2003. VIE 2003. International Conference on,

pp. 182–185, 2003.

[138] K. Patwardhan, G. Sapiro, and V. Morellas, “Robust foreground detection

in video using pixel layers,” IEEE Trans. on Pattern Analysis and Machine

Intelligence, vol. 30, no. 4, pp. 746–751, 2008.

[139] M. K. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle

filters,” Journal of the American Statistical Association, vol. 94, no. 446,

pp. 590–599, 1999.

[140] S. Piva, A. Calbi, D. Angiati, and C. S. Regazzoni, “A multi-feature ob-

ject association framework for overlapped field of view multi-camera video

surveillance systems,” pp. 505–510, 2005.

[141] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic

tracking,” in 7th European Conference on Computer Vision, pp. 661 – 675,

2002.

[142] R. Rad and M. Jamzad, “Real-time classification and tracking of multiple

vehicles in highways,” Pattern Recognition Letters, vol. 26, pp. 1597–1607,

2005.

[143] D. Ramanan and D. Forsyth, “Finding and tracking people from the bottom

up,” in IEEE Computer Society Conference on Computer Vision and Pattern

Recognition, vol. 2, pp. 467 – 474, 2003.

[144] A. Rao, R. Srihari, and Z. Zhang, “Geometric histogram: A distribution

of geometric configurations of color subsets,” in SPIE: Internet Imaging,

vol. 3964, pp. 91–101, 2000.

[145] G. Rigoll, S. Eickeler, and S. Muller, “Person tracking in real-world sce-

narios using statistical methods,” in Automatic face and gesture recognition,

(Grenoble, France), pp. 342–347, IEEE; 2000, 2000.

BIBLIOGRAPHY 349

[146] M. N. Rosenbluth and A. W. Rosenbluth, “Monte carlo calculation of the

average extension of molecular chains,” Journal of Chemical Physics, vol. 23,

pp. 356–359, 1955.

[147] C. Rother, V. Kolmogorov, and A. Blake, “”grabcut” - interactive fore-

ground extraction using iterated graph cuts,” in ACM SIGGRAPH, pp. 309

– 314, 2004.

[148] H. Ryu and M. Huber, “A particle filter approach for multi-target tracking,”

in IEEE/RSJ International Conference on Intelligent Robots and Systems,

(San Diego, CA, USA), pp. 2753–2760, 2007.

[149] D. Schulz, W. Burgard, D. Fox, and A. Cremers, “Tracking multiple moving

targets with a mobile robot using particle filters and statistical data associa-

tion,” in IEEE International Conference on Robotics and Automation, vol. 2,

pp. 1665–1670, 2001.

[150] F. Seitner and B. C. Lovell, “Pedestrian tracking based on colour and spatial

information,” in Proceedings of the Digital Imaging Computer: Techniques

and Applications (DICTA 2005), (Cairns, Australia), pp. 36 – 43, 2005.

[151] P. Shastry and K. Ramakrishnan, “Fast technique for moving shadow detec-

tion in image sequences,” in Signal Processing and Communications, 2004.

SPCOM ’04. 2004 International Conference on, pp. 359–362, 2004.

[152] N. Siebel and S. Maybank, “Fusion of multiple tracking algorithms for ro-

bust people tracking,” in Computer Vision - ECCV 2002. 7th European

Conference on Computer Vision. Proceedings, Part IV (A. Heyden, G. Sparr,

M. Nielsen, and P. Johansen, eds.), (Copenhagen, Denmark), pp. 373–387,

Springer-Verlag, 2002.

[153] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard, “Tracking loose-limbed

350 BIBLIOGRAPHY

people,” in Proceedings of the 2004 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, vol. 1, pp. 421–428, 2004.

[154] Silogic and Inria, “Etiseo metrics definition (http://www-

sop.inria.fr/orion/etiseo/download.htm),” tech. rep., 6th January 2006.

[155] C. Stauffer, “Estimating tracking sources and sinks,” in Event Mining

Workshop, (Madison, WI), 2003.

[156] C. Stauffer, “Learning to track objects through unobserved regions,” in Mo-

tion and Video Computing, 2005. WACV/MOTIONS ’05 Volume 2. IEEE

Workshop on, vol. 2, pp. 96–102, 2005.

[157] C. Stauffer and W. Grimson, “Adaptive background mixture models for

real-time tracking,” in IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, vol. 2, p. 252, 1999.

[158] F. Tang and H. Tao, “Object tracking with dynamic feature graph,” in Vi-

sual Surveillance and Performance Evaluation of Tracking and Surveillance,

2005. 2nd Joint IEEE International Workshop on, pp. 25–32, 2005.

[159] H. Tao, H. S. Sawhney, and R. Kumar, “Object tracking with bayesian esti-

mation of dynamic layer representations,” IEEE Trans. on Pattern Analysis

and Machine Intelligence, vol. 24, no. 1, pp. 75–89, 2002.

[160] T. Thongkamwitoon, S. Aramvith, and T. Chalidabhongse, “An adaptive

real-time background subtraction and moving shadows detection,” in IEEE

International Conference on Multimedia and Expo (ICME), vol. 2, pp. 1459–

1462, 2004.

[161] R. Y. Tsai, “An efficient and accurate camera calibration technique for

3d machine vision,” in IEEE Conference on Computer Vision and Pattern

Recognition, (Miami Beach, FL), pp. 364–374, 1986.

BIBLIOGRAPHY 351

[162] H. Tsutsui, J. Miura, and Y. Shirai, “Optical flow-based person tracking by

multiple cameras,” in International Conference on Multisensor Fusion and

Integration for Intelligent Systems, pp. 91 – 96, 2001.

[163] J. Vermaak, A. Doucet, and P. Perez, “Maintaining multi-modality through

mixture tracking,” in Ninth IEEE International Conference on Computer

Vision (ICCV’03), vol. 2, (Nice, France), pp. 1110–1116, 2003.

[164] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficient al-

gorithm based on immersion simulations,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 13, no. 6, pp. 583–598, 1991.

[165] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of

simple features,” in CVPR, 2001.

[166] “Viper-gt, the ground truth authoring tool,

http://vipertoolkit.sourceforge.net/docs/gt/.”

[167] L. Wang, W. Hu, and T. Tan, “Face tracking using motion-guided dynamic

template matching,” ACCV’2002, 2002.

[168] M.-L. Wang, C.-C. Huang, and H.-Y. Lin, “An intelligent surveillance sys-

tem based on an omnidirectional vision sensor,” in Cybernetics and Intelli-

gent Systems, 2006 IEEE Conference on, pp. 1–6, 2006.

[169] H. Wang and D. Suter, “A re-evaluation of mixture-of-gaussian background

modeling,” in IEEE International Conference on Acoustics, Speech, and Sig-

nal Processing, (Philadelphia, PA, USA), pp. 1017–1020, 2005.

[170] D. Wei and J. Piater, “Data fusion by belief propagation for multi-camera

tracking,” pp. 1–8, 2006.

[171] Q. Wei, D. Schonfeld, and M. Mohamed, “Decentralized multiple camera

352 BIBLIOGRAPHY

multiple object tracking,” in Multimedia and Expo, 2006 IEEE International

Conference on (D. Schonfeld, ed.), pp. 245–248, 2006.

[172] G. Welch and G. Bishop, “An introduction to the kalman filter,” technical

report tr95-041, University of North Carolina at Chapel Hill, 1995.

[173] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: real-time

tracking of the human body,” IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 19, no. 7, pp. 780–785, 1997.

[174] D. Xu, J. Liu, Z. Liu, and X. Tang, “Indoor shadow detection for video

segmentation,” in Multimedia and Expo, 2004. ICME ’04. 2004 IEEE Inter-

national Conference on, vol. 1, pp. 41–44 Vol.1, 2004.

[175] T. Yamane, Y. Shirai, and J. Miura, “Person tracking by integrating optical

flow and uniform brightness regions,” in Robotics and Automation, 1998.

Proceedings. 1998 IEEE International Conference on, vol. 4, pp. 3267–3272

vol.4, 1998.

[176] T. Yang, F. Chen, D. Kimber, and J. A. V. J. Vaughan, “Robust people

detection and tracking in a multi-camera indoor visual surveillance system,”

in Multimedia and Expo, 2007 IEEE International Conference on (F. Chen,

ed.), pp. 675–678, 2007.

[177] C. Yang, R. Duraiswami, and L. Davis, “Fast multiple object tracking via

a hierarchical particle filter,” in Tenth IEEE International Conference on

Computer Vision, pp. 212 – 219, 2005.

[178] M.-T. Yang, Y.-C. Shih, and S.-C. Wang, “People tracking by integrating

multiple features,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of

the 17th International Conference on, pp. 929– 932, 2004.

[179] M. Yokoyama and T. Poggio, “A contour-based moving object detection

and tracking,” in Visual Surveillance and Performance Evaluation of Track-

BIBLIOGRAPHY 353

ing and Surveillance, 2005. 2nd Joint IEEE International Workshop on,

pp. 271–276, 2005.

[180] Q. Zang and R. Klette, “Robust background subtraction and maintenance,”

in Proceedings of the 17th International Conference on Pattern Recognition

(ICPR), vol. 2, pp. 90–93, 2004.

[181] Z. Zeng and S. Ma, “Head tracking by active particle filtering,” in Auto-

matic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE Inter-

national Conference on, pp. 82–87, 2002.

[182] H.-J. Zhang, “New schemes for video content representation and their appli-

cations (tutorial),” in IEEE International Conference on Image Processing,

(Singapore), 2004.

[183] W. Zhang, X. Z. Fang, and X. Yang, “Moving cast shadows detection based

on ratio edge,” in Pattern Recognition, 2006. ICPR 2006. 18th International

Conference on, vol. 4, pp. 73–76, 2006.

[184] T. Zhao and R. Nevatia, “Tracking multiple humans in complex situations,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26,

no. 9, pp. 1208–1221, 2004.

[185] Q. Zhao and H. Tao, “Object tracking using color correlogram,” in Vi-

sual Surveillance and Performance Evaluation of Tracking and Surveillance,

2005. 2nd Joint IEEE International Workshop on, pp. 263–270, 2005.

[186] L. ZhiHua and K. Komiya, “Region-wide automatic visual search and

pursuit surveillance system of vehicles and people using networked intelli-

gent cameras,” in 6th International Conference on Signal Processing, vol. 2,

pp. 945 – 948, 2002.

354 BIBLIOGRAPHY

[187] Y. Zhou and H. Tao, “A background layer model for object tracking through

occlusion,” in Ninth IEEE International Conference on Computer Vision,

vol. 2, pp. 1079–1085, 2003.

improved detection and tracking of objects in surveillance ... · object tracking, motion...

Documents