optical vehicle tracking - a framework and tracking solution › research › reports › ... ·...

Optical vehicle tracking - A framework and tracking solution

November 15, 2004

Ronald Highet

Supervisor: Richard Green

Computer Science Department

Canterbury University

Christchurch, New Zealand

Abstract

There are a large number of different approaches to vehicle tracking currently available. In thisreport we look at different problems faced when tracking vehicles, such as background subtractionand vehicle recognition, and look at possible solutions to these. We present a layered approachto vehicle tracking, which includes the use of background subtraction based on a statistical colourmodel, shape approximation using contour creation algorithms, and two dimensional object recog-nition using colour histograms and geometric moments. We improve on the standard statisticalRGB background removal model by adding a second pass HSV shadow removal filter and demon-strate that this provides cleaner background segmentation that other approaches including opticalflow. We improve on prior research into simple vehicle tracking solutions by combining objectrecognition based on both colour histograms and geometric moments, and demonstrate the robustnature of our solution through a number of example scenarios.

Optical vehicle tracking - A framework and tracking solution 2

Acknowledgements

Firstly I would like to thank my supervisor Richard Green, for many hours of late night proof-reading and enthusiastic advice. Thank you to Oliver Hunt for his informative discussion with meon the general theory of moment functions and help with debugging geometric moment relatedcode. Finally I would like to thank Lara Rennie for proof reading this document.

Contents

1 Introduction 6

2 Related Work 82.1 View Independent Vehicle/Person Classification[3] . . . . . . . . . . . . . . . . . . 82.2 Simultaneous tracking of pedestrians and Vehicles by the Spatio Temporal Markov Random Field Model [33

3 Research Goals 103.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Commercial Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Lack of simple solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Background 124.1 The OpenCV library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.1.2 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.3 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.1.4 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.1 Simple Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Polygonal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4.1 K-Cosine Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.2 Rosenfeld-Johnston algorithm[27] . . . . . . . . . . . . . . . . . . . . . . . . 184.4.3 Teh Chin Algorithm[5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.5 Moment Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.1 Geometric moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Colour Histograms and Back projection . . . . . . . . . . . . . . . . . . . . . . . . 204.6.1 Colour based Back projection . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 System Development 225.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 Low Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.2 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.3 Sofware Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1.4 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.2 Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.3 Vehicle detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.4 Vehicle recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2.5 Vehicle tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


5.3 Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3.1 Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3.2 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4.1 Video Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.4.2 Motion Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.3 Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4.4 Vehicle Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.4.5 Vehicle Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Summary of Results 356.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.2 Quality related Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7 Conclusion 37

List of Figures

1.1 Traffic taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Sample trackign results from view independant tracking system . . . . . . . . . . . 92.2 Same results from spatio temporal random field model based tracking . . . . . . . 9

4.1 Features provided by the OpenCV library . . . . . . . . . . . . . . . . . . . . . . . 124.2 Data types provided by OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Optical flow example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4 Ambiguity in determining optical flow . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.1 Overview of System Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Output data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Comparing Digital video Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4 Computer platform specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.5 Frame rate of Camera .vs time for processing each frame . . . . . . . . . . . . . . . 265.6 DSE XH5096 Sample output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.7 Scenes shot using the ADS Turbo Firewire and USB 2.0 cameras . . . . . . . . . . 285.8 Incorrect exposure setting resulting in washed out image . . . . . . . . . . . . . . . 285.9 Simple RGB-based background subtraction indoors . . . . . . . . . . . . . . . . . . 295.10 Simple RGB-based background subtraction with shadows . . . . . . . . . . . . . . 295.11 Comparing results between shadow removal and no shadow removal . . . . . . . . 305.12 The effect of size based object thresholding on the object motion map . . . . . . . 315.13 Comparison of polygonal approximation methods[27][5] . . . . . . . . . . . . . . . 325.14 Simple vehicle types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.15 Example vehicle path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.16 Example of road cone back projection . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 Frame rate of tracking application across sample video . . . . . . . . . . . . . . . . 356.2 Comparing background subtraction methodologies . . . . . . . . . . . . . . . . . . 366.3 Velocity calculation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 1

Introduction

“Computer vision as a field is an intellectual frontier. Like any frontier, it is exciting and dis-organised; there is often no reliable authority to appeal to– many useful ideas have no theoreticalgrounding, and some theories are useless in practise; developed areas are widely scattered, andoften one looks completely inaccessible from the other”1

Computer vision is the process of using statistical models to understand digital images. Thesemodels are often constructed with the aid of geometry, physics and learning theory. Computervision enables us to both make inferences on the real world based on single pixel values and tocombine the information obtained from multiple video cameras into a single coherent model. Com-puter vision also enables us to impose order on groups of pixels to separate them from each other.It enables us to recognise these groups of pixels either based on geometric information such asshape and size or using probabilistic techniques. Computer vision has a large variety of applica-tions including robot navigation, industrial inspection and object tracking. Object tracking addsa temporal aspect to the attributes of computer vision discussed previously; the aim is not simplyto assign meaning to objects in a visual scene but also to track their movement through a sequenceof images. To achieve this, we must be able to quickly and reliably distinguish between objects ina video steam. An object tracking system[20, 21] must be capable of assigning the same meaningto an object in a digital image independent of its spatial location.

Traffic monitoring systems[13, 36, 33, 38, 34, 19, 8, 18] are software systems designed to collectstatistics on traffic in a given geographical region. Traffic types can be classified using the taxonomyillustrated in figure 1.1.

Traffic type DescriptionPedestrian People WalkingSelf powered Bikes, Skateboards, TricyclesAutomotive Cars, Trucks

Figure 1.1: Traffic taxonomy

This report presents a automotive traffic monitoring framework designed to operate on a stan-dard desktop PC. One primary objective is to keep cost to a minimum and for this reason onlya single optical sensor (camera) is used. The field of optically based traffic monitoring systems iswell researched and a large number of systems [18, 33, 38, 34, 19, 8, 24, 22] have been presentedthat tackle this task in a variety of different ways. Our system improves on these by combininga number of different approaches. We initially approached the problem domain by developinga modularised frame work. This modular approach to the development of our tracking system

1Computer Vision - A Modern Approach[1]

7 COSC460 Research project

enables us to ‘snap together’ different algorithms for each layer without affecting subsequent sec-tions of our system. There are five layers in the system; Video Acquisition, Background removal,Vehicle detection, Vehicle recognition and Vehicle tracking. For each of these layers a number ofalgorithms have been evaluated and the most promising selected for use in the final system. Thefinal tracking system includes the use of statistical colour-model based-background subtraction(section 4.2), shape approximation using polygonal approximation algorithms (section 4.4) andtwo-dimensional object recognition using geometric moments (section 4.5) and colour histograms(section 4.6). The polygonal approximation based approach is similar to that presented in Ger-ald Kuhne et al. (2001)[9]. However, we significantly improve the reliability by combining itwith colour-based object recognition, similar to the methodologies presented in Katja Nummiaroet al (2001)(2003) [15, 16]. We improve on the standard background subtraction algorithm, byadding a second pass shadow removal filter and demonstrate that this provides cleaner backgroundsegmentation than other approaches including optical flow.

An introduction to traffic monitoring systems has been presented in section 1. Our reportcontinues in Chapter 2 to introduce a number of previous developed traffic monitoring systems.Chapter 3 presents the motivations and aims of this research. Following this, chapter 4 provides thebackground information necessary to fully understand the system. Chapter 5 then discusses thedevelopment of our traffic monitoring system, providing an overview of the system requirements,the system’s design and implementation details. A summary of the research results are thenpresented in chapter 6. Finally the concluding section of this report summarises the researchpresented in this report and discusses possible extensions to the systems and future applicationsof this research.

Chapter 2

Related Work

There is currently a large amount of commercial interest in vehicle tracking and traffic monitoringsystems. Traffic monitoring systems are particularly useful in the age of terrorism, to increase theeffectiveness of sparsely distributed security forces. The ability to automatically count the numberof vehicles that enter or leave an area or that pass a given location, as well as the ability to searchfor specific types of cars across multiple simultaneous video streams can drastically improve theeffectiveness of security personnel. While large numbers of these systems are not publicly releasedthere is still a large amount of research publicly available into the development of similar systems.This chapter discusses a number of these systems that are directly related to our research, anddiscusses the strengths and weakness.

2.1 View Independent Vehicle/Person Classification[3]

This paper presents an object classification system in digital surveillance that can be used froman arbitrary camera viewpoint. The system is designed to distinguish humans from vehicles foran arbitrary scene. The system employs a two phase approach. In the primary phase, humanand vehicle recognition is performed using traditional feature-based classification. This phase ini-tialises the view-normalisation parameters. The second phase performs object classification basedon these normalised features. This normalisation also enables absolute identification of size andspeed. Size and speed can be used to identify vehicles and search for objects of certain sizes andspeeds.

This system has a relatively straight-forward object classification approach. Moving objectsare detected by combining evidence from differences in colour and motion. This backgroundsubtraction approach is identical to that developed in the paper by Connel, J. et al (2003)[6]. Theresulting saliency map is smoothed and holes removed using morphological operators. Severalmechanisms are built into this phase of the system that deal with changing lighting conditionsand scene composition. Detected objects are tracked using both size and colour properties. Ourapproach to vehicle tracking is based on the approach presented in this paper, combining bothcolour and motion cues to produce a robust tracking solution. Our approach does not have toworry about the presence of pedestrians and thus the overall approach is a lot simpler.

2.2 Simultaneous tracking of pedestrians and Vehicles bythe Spatio Temporal Markov Random Field Model [33]

This paper presents a traffic monitoring system capable of tracking both pedestrians and vehiclessimilar to the system presented by Lisa M. Brown (2003)[3]. The system uses a predefined Spatio-Temporal Markov Random Field Model S-T MRF to divide an image into blocks of related


Figure 2.1: Sample trackign results from view independant tracking system

pixels. The system performs background subtraction by only processing those blocks which havea significantly different texture to that of the background.

The system defines the following functions:

1. G(t − 1) = g,G(t) = h; An image G at time t − 1 and t is defined as having a value gand h respectively. The values of each pixel can be defined by the equation G(t − 1; i, j) =g(i, j), G(t; i, j) = h(i, j)

2. X(t − 1) = x, X(t) = y; An object map X at time t − 1 and t is estimated to have adistribution of labels x and y respectively. For each block in the map this condition can bedescribed using the following equations: Xk(t − 1) = xk, Xk = yk where k represents theblock number

3. V (t−1); t = v A motion-vector map V from time t−1 to t is estimated with respect to eachblock.

Using the S-T MRF model, the system simultaneously determines the Object map X(t) andthe Motion vector field V (t), which are then used to calculated a Maximum A posteriori Probabilityfield. The paper improves on the stanard S-T MRF by simultaneously optimising segmentationboundaries and motion vectors by both referring to texture and labelling correlations across avideo stream (across the temporal axis). The results show that the system is very robust fortracking flexible and fixed objects such as vehicles and pedestrians, even with frequent occlusion.

Figure 2.2: Same results from spatio temporal random field model based tracking

Chapter 3

Research Goals

This chapter introduces research into vehicle tracking. Section 3.1 introduces the motivationsbehind this research project, discussing the reasons why this project was undertaken. Section 3.2then presents the objectives or aims of this research project.

3.1 Motivation

The motivations behind this research project are two-fold; There is firstly much commercial interestin tracking solutions. Simultaneously there is a the lack of readily available simple approaches tovehicle tracking. These motivations are further discussed below.

3.1.1 Commercial Interest

Since September 11 2001, there has been a large interest in terrorism prevention. The UnitedStates Department of Defense has allocated many millions of dollars to the development of systemsthat are capable of tracking vehicles and people. These systems have been deployed throughoutThe United States, both in airports and other public facilities. While these have largely beenoperational failures, these projects have served as a catalyst for commercial interest into securitysystems based on computer vision. In Christchurch New Zealand, a number of companies arelooking at the development of systems capable of tracking vehicles across a number of video sources.The aim is to develop a system that would lighten the load on security forces, by automating asmuch of the surveillance work as possible. The system developed in this paper prototypes a numberof the facets of such a security system and provides a base for future development.

3.1.2 Lack of simple solutions

Most vehicle tracking solutions presented in research publications are highly specialised and thealgorithms developed in these papers are never extended. Large numbers of these algorithmsachieve the same results, yet there is no comparison between the algorithm presented and similarpast research. The motivation behind this project is to evaluate a large number of possiblesolutions to problems encountered when building a optical tracking system, and to select the mostpromising solution for further development. This avoids the unnecessary development of solutionsto problems which have already been solved.


3.2 Objectives

The objectives of this research project are as follows:

1. Develop a framework for the development of vehicle tracking applications. This frameworkshould be platform independent and support the rapid prototyping of vehicle tracking solu-tions.

2. Using the aforementioned development framework, develop a simple vehicle tracking systemthat uses a single camera and relies on optical cues for tracking.

3. Utilize the development of the tracking system to evaluate a number of recent approachesto vehicle tracking, and select the most promising approaches for further development.

Chapter 4

Background

This chapter introduces relevant background information. These are concepts that are criticalto the development of the tracking solution. OpenCV, a computer vision software developmentkit, used as a base for system development, is presented in section 4.1. A simple approach tobackground subtraction is presented in section 4.2. Another technique used in system developmentfor both background subtraction and object identification, called optical flow is presented in section4.3. Polygonal approximation algorithms are covered in section 4.4, followed by a discussion ofmoment functions in section 4.5. Finally, colour histograms and colour-based backprojection ofimages are discussed in section 4.6.

4.1 The OpenCV library

The OpenCV library is a software library designed to aid in the development of software projectsinvolving computer vision.The OpenCV library implements a wide variety of tools for computervision applications. It is compatible with the Intel Image Processing Library (IPL), which im-plements a large range of very low level operations for manipulation of digital images. OpenCVprovides a small range of primitive functions, but is in essence a high-level library. Table 4.1 listsa number of functions provided by the OpenCV library.

OpenCV FeaturesCamera calibrationOptical Feature detectionOptical Feature tracking including Optical FlowShape Analysis including Geometry and Contour Processing functionalityMotion Analysis including Motion templates and estimatorsThree-dimensional reconstruction including View morphingObject SegmentationObject recognition including HistogramsMarkov models and Eigen objects

Figure 4.1: Features provided by the OpenCV library

4.1.1 Data Types

There are a few fundamental and helper data types supplied with the OpenCV library. A numberof the fundamental data types provided by OpenCV, these are summarised in figure 4.2.

The presence of these data types simplifies the development of computer vision projects as itminimizes the amount of low level coding. This is especially useful in projects similiar in nature


Data Type DescriptionIplImage Generic image structureCvMat Generic matrix structureCvSeq Deque collectionCvSet Set structureCvGraph Generic graph structureCvHistogram Generic histogram structureCvPoint Two-dimensional pointCvMoments Spatial moment structure

Figure 4.2: Data types provided by OpenCV

to ours, because more time can be devoted to research and implementation of new algorithms.

4.1.2 Error Handling

OpenCV uses a set of global error status flags that can be set or retrieved using presuppliedfunctions cvError or cvGetErrStatus. The error-handling mechanisms are fully adjustable anduser-specified.

4.1.3 Hardware Requirements

The OpenCV library is capable of running on most personal computers. The OpenCV library isonly optimized for use of Intel-based systems and as such it is highly recommened that an IntelSystem be used for any OpenCV software development. The OpenCV library is available on bothWindows and Linux platforms. The OpenCV library is supplied as a package with the Debiandistribution of Linux.

4.1.4 Software Requirements

The OpenCV library is written in C++ and as such could feasibly be ported to any systemthat has a C++ compiler. The OpenCV library is supplied with a Visual Studios 6 Project file.However there are a number of the functions used throughout the library that are not compatiblewith Microsoft’s latest Visual Studio .Net. This problem was remedied early in the developmentof the system with each of the problem functions rewritten so that they work within the .Netenvironment.

4.2 Background Subtraction

Background subtraction, also known as background segmentation and motion segmentation, is theremoval of sections of the source video that are not relevant to the project. In the case of ourtracking system, the term ‘background’ is defined as the set of motionless pixels, or pixels thatdo not belong to any object moving laterally across the field of view of our camera. In manysituations, background subtraction can be achieved by subtracting an estimate of the appearanceof the background from the image and looking for large absolute values in the result. The mainissue associated with these types of background subtraction is obtaining a good estimate of thebackground. Different background subtraction methods all provide different solutions to thisproblem. A number of different approaches to background subtraction are presented below.

4.2.1 Simple Background subtraction

The simplest background model assumes that the brightness of every background pixel variesindependently, according to a normal distribution. Background characteristics can be calculated


by accumulating a large number of video frames, as well as their squares. This means that isnecessary to find the sum of pixel values in the location scenex,y and a sum of square of the valuesscene2(x,y) for every pixel location. Given that N represents the number of video frames collectedso far, the mean pixel intensity can be calculated using the following equation:

mean(x, y) =

∑

scene(x, y)

N

Standard deviation of pixel intensity at that point can be calculated as follows:

σ(x, y) =

√

(

∑

scene2(x,y)

N−

(∑

scene(x,y)

N

)2)

After this, a pixel is regarded as being in motion if it satisfies the following condition:

|(

mean(x,y) − scene(x,y)

)

|> 3σ(x,y)

This approach produces poor results however, as the background typically changes slowly overtime. Examples of these changes would be a road as it rains and less as the weather dries up.The algorithm can be improved by using a moving average. In this approach, we estimate thevalue of a particular background pixel as a weighted average of the previous values. Typicallypixels in the distant past should be weighted with 0, and the weights should smoothly increase.The moving average should track the changes in the background, meaning that if the weatherchanges quickly, relatively few pixels should have nonzero weights, and if changes are slow thenthe number of nonzero weighted pixels should increase. The background subtraction formulasbased on a running average is shown below:

mean(x,y) =1

N

i+N−1∑

j=i

scene(x,y)

4.3 Optical Flow

Optical flow[11] is the apparent motion of brightness patterns in an image. Optical Flow algorithmsproduce a velocity vector field (motion field), that can warp one image to another. An exampleof two images and the corresponding vector field is shown in figure 4.3

These vector fields can be used for object detection and tracking by grouping vectors of similarlength and direction. These vector fields can also be used for background subtraction, by regardingthose pixels with a small associated vector as being part of the background. The seminal paper onoptical flow by Horn and Schunck (1980)[12] provides a reliable algorithm for calculating opticalflow motion vector fields. The process of calculating optical flow is now described.

Assuming that the intensity of an image can be represented as a two-dimensional densitydistribution I(x, y), which varies over time t, we can define the function I(x, y, t), where theintensity is a function of t, as well as of x and y. Further assumptions made are detailed below.

• Brightness I(x,y,t) varies smoothly according to the values of the x and y coordinates acrossthe majority of the image

• Brightness of every point of a moving or static object does not change over time.


Figure 4.3: Optical flow example

To calculate the change in intensity over time, we differentiate with respect to time.

I(x+ dx, y + dy, t+ dt) = I(x, y, t) (4.1)

dI

dt=δI

δx

dx

dt+δI

δy

dy

dt+δI

δt(4.2)

We now assume that the image intensity of each visible scene point does not change over timei.e that shadows and illumination are not changing due to motion. This leads to the followingequation:

dI

dt= 0 (4.3)

This implies that

0 =δI

δx

dx

dt+δI

δy

dy

dt+δI

δt(4.4)

After dividing by dt, we have

dx

dt= u (4.5)

dy

dt= v (4.6)

u and v represent the speed at which the pixel is moving in the x and y directions respectively.Thus, because the limit of dt tends towards zero

Ixu+ Iyv + It = 0 (4.7)

where the partial derivatives of I are denoted by the subscripts Ix,y,t, and u and v are thecomponents of the optical flow motion vector. This equation (Equation 4.3) is known as theOptical flow constraint equation as it expresses a constraint on the components u and v ofthe optical flow vector field. The optical flow constraint equation is often rewritten as shownbelow:


(Ix, Iy) × (u, v) = −It (4.8)

or

−δI

δt=δI

δxu+

δI

δyv (4.9)

δIδt

of a given pixel represents how fast the intensity is changing over time and δIδx

and δIδy

represent how rapidly intensity changes along the x and y axes respectively. We wish to calcu-late u and v. However, we only have one equation per pixel and two unknown variables, so wecannot unambiguously calculate out vector components. An example of this is shown in Figure 4.4.

Figure 4.4: Ambiguity in determining optical flow

The lines are contours with identical intensity in both images. The ambiguity occurs whenattempting to tell whether the part of the image represented by point A in the first image hasmoved to point B, B’ or any other point of the same intensity in the second image. We requireadditional information to unambiguously determine the values of u and v for each vector in ouroptical flow vector field. A partial solution to this problem exists in the observation that themotion that is observed in adjacent pixels is very similar, except near the edges of moving objects.A measure of how much the optical flow deviates from the smoothly varying ideal can be calculatedby evaluating the following integral over the whole image:

S =

∫ ∫

image

(δu

δx

)2+(δu

δy

2)

+(δu

δx

)2+( δv

δx

2)

(4.10)

The smoothness constraint represented in equation 4.3 sometimes conflicts with the opticalflow constraint. It is possible to show how the solution provided by our smoothness constraintdeviates from the condition required by the optical flow constraint by evaluating the followingequation:

c =

∫ ∫

image

(

δI

δxu+

δI

δyv +

δI

δt

)2

dx dy (4.11)

In order to combine these two constraint equations, it is necessary to use a technique calledLargrangian multipliers[31]. We attempt to find a solution for u and v which minimises S+kC,where k is a scalar multiplier. Typically k is large if our intensity measurements are accurate (whenC is small) and small if our accuracy is low (when C is large). Minimising the resulting integralproduces the following functions for u and v which need to be satisfied:

δ2u

δy2+δ2u

δy2= k

( δI

δxu+

δI

δyv +

δI

δt

) δI

δx(4.12)


δ2v

δy2+δ2v

δy2= k

( δI

δxu+

δI

δyv +

δI

δt

)δI

δy(4.13)

The optical flow motion vector field can then be produced by iteratively stepping across theimage and solving the equations for each pixel, using the intensityI derived for each pixel obtainedfrom the original image and the weighting factor k produced previously.

4.4 Polygonal Approximation

Polygonal approximation algorithms aim to approximate the shape of a curve using a number offundamental mathematical functions. They provide simple representations for relatively complexobjects, and produce data that can easily be processed further. The fundamental idea behind allof these algorithms is to find and keep only the dominant points of any curve. These are thosepoints which are critical in defining the shape of the curve, and are hence points where the localmaximums of curvature angle are located on the digital curve. In order to approximate a curve,it is necessary to first represent the curve discretely. The curvature of a continuous curve at anyone point can be determined as the speed at which the tangent angle is changing, as shown below:

k =x,yn − xny,

(x,2 + y,2

3

2

(4.14)

This continuous curve can be quantised using a simple approximation called L1 curvature:

ci = ((fifi−1 + 4)mod8) − 4 (4.15)

This method produces a value varying between 0 and 4 which represents the change in curvaturebetween two points on the curve. 0 represents no change in curvature (a straight line) and 4represents a complete reversal in direction. A number of more complex algorithms can be used toget more accurate discrete approximations of curves. A selection of these are discussed below.

4.4.1 K-Cosine Algorithm

For any given point on the curve (xi, yi) the radius mi represents the neighbourhood of pointsthat can be selected as the next node in the curve. Some algorithms use a constant value of mi

for all points on the curve, while many other algorithms calculate a suitable value of mi for everypoint on the curve. The following values are calculated for all pairs (xi−k, yi−k) and (xi+k, yi−k)1:

cik = cos(aik, bik) (4.16)

where

aik = (xi−k − xi, yi−k − yi) (4.17)

bik = (xi+k − xi, yi−k − yi) (4.18)

The index value hi is then calculated such that the constraint cim < cim − 1 < ... < cih1≥

cih1−1. The value cih1is regarded as the curvature value of the ith point on the curve. The

value of cik can vary between −1 and 1, representing minimum curvature and maximum curvaturerespectively.

1k= 1...m


4.4.2 Rosenfeld-Johnston algorithm[27]

This algorithm is one of the earliest algorithms for polygonal approximation. The algorithmrequires that the parameter m be between 1/10th and 1/15th of the number of points on thedigital curve. The algorithm removes all points from the curve that do not satisfy the followingconstraint:

3 j, | i− j |≤hi

2(4.19)

where cihi< cjh1

. The remaining points are treated as the dominant points on the curve.

The primary weakness of this algorithm is the necessity to choose the parameter m and theparameter identity must be chosen for all points, which can result in either very rough approxi-mations or excessively precise approximations.

4.4.3 Teh Chin Algorithm[5]

This polygonal approximation algorithm was proposed by C.H. Teh (1989)[5]. The algorithmmakes several passes through the curve and deletes some points at each pass. At first, all pointswith zero curvature are removed. For all other points the parameter M1 is calculated along withthe angle of curvature Ci. After that the algorithm performs a non-maxima suppression, deletingpoints whose curvature satisfies the previous condition where, for C1

i , the metric H1 is set toM1. Finally, the algorithm replaces groups of two successive remaining points with a single pointand groups of three or more successive points with a pair of the first and the last points. Thisalgorithm does not require any input apart from the original curve representing the outline of theobject we are wishing to represent.

4.5 Moment Functions

Moment functions are used throughout image analysis. Examples of applications are object classi-fication, object recognition, pose estimation, image compression and invariant pattern recognition.A set of moments calculated for a given image generally represent global characteristics of the im-age shape and reveal a large amount of information about geometric features of the image. Thisfeature representation capability has been used in computer vision and robotics for object recogni-tion. This project utilises the computationally inexpensive Geometric moments (4.5.1) for objectrecognition.

An image can be represented as a two-dimensional density distribution f(x, y), where f(x, y)represents the intensity at the given pixel co-ordinate (x, y). If ζ represents the domain of thefunction f(x, y) (the size of the image in pixels), a general definition of all moment functions φpq

of order (p+ q) is shown below:

φpq =

∫ ∫

ζ

ψpq(x, y)f(x, y) dx dy p, q = 0, 1, 2... (4.20)

where ψpq(x, y) is the moment weighting kernel, a continuous function of (x, y). The indices (p, q)of this function denote the degrees of the coordinates (x, y), as defined inside the function ψ.

The first and most significant paper regarding geometric moments was published by Hu [14]in 1962. In this paper Hu used the geometric moments to generate a set of invariants that couldbe used for automatic character recognition. The geometric moments have also been used in anumber of different object recognition applications, such as ships by Smith [7] and recognisingaircraft by Dydabu [29] in 1977.


4.5.1 Geometric moments

Geometric moments are the simplest of the moment functions. Their kernel is defined as theproduct of the pixel coordinates. The primary advantage of these geometric moments is thattransforms in the image space can be easily expressed in terms of corresponding transforms inthe moment space. Geometric moments are computationally less expensive than other momentcalculations due to the simplicity of their kernel function. Functions of geometric moments thatare invariant in the image plane are used throughout computer vision for object recognition.

Geometric moments are defined with the base set xpyq. With the (p+ q)th order 2D geometricmoment denoted as mpq, they can be expressed as follows:

mpq =

∫ ∫

ζ

xpyq f(x, y) dx dy p, q = 0, 1, 2, 3... (4.21)

where φ represents the image space in which the intensity function f(x, y) is defined.

Each order of geometric moments represents different spatial characteristics of the image forwhich they were calculated. A set of these moments of different orders can therefore be used tocreate an overall descriptor of the image’s shape. A number of the early order geometric momentsare explained below.

The order zero geometric moment m00 represents the total intensity of the image. For a sil-houette image, m00 represents the total area of the image region.

The first order geometric moments m01 m10, represent the intensity moment along the x andy axises respectively. The center of mass or centroid (x0, y0) of an image can be calculated fromm00,m01 and m10 as follows:

x0 =m10

m00y0 =

m01

m00(4.22)

centroid = (x0, y0) (4.23)

The second order moments measure the variance of the image’s intensity distribution aroundthe origin of the image.

Central geometric moments

Given that (x0, y0) represents the optical center of the image, we can shift the origin of ourreference system to the optical center of our image. This transformation means that the momentcomputation will be independent of the images reference system. Moments computed with respectto the optical centroid are called central moments, defined as follows:

ηpq =

∫ ∫

ζ

(x− x0)p ∗ (y − y0)

qf(x, y) dx dy, p.q = 0, 1, 2, 3... (4.24)

The order zero central moment η00 is therefore directly equivalent to m00. The first ordercentral moments η10 and η01 are both equal to zero. The second order central moments η20 andη02 measure the variance in the intensity distribution around the optical center of the image, andcovariance measure is given by η11.

Functions of these geometric moments, which are invariant with regard to image-plane trans-forms, are used throughout object identification. The central moments are by definition invariantto translation. This is because the image’s centroid moves with the image during translation, and


central moments are defined with the centroid as their origin. It is possible to calculate the secondand third order central moments from the ordinary geometric moments as shown below:

η02 = m02 − y0m01 (4.25)

η20 = m20 − x0m10 (4.26)

η11 = m11 − y0m10 (4.27)

η30 = m30 − 3x0m20 + 2x20m10 (4.28)

η03 = m03 − 3y0m02 + 2y20m01 (4.29)

η21 = m21 − 2x0m11 − y0m20 + 2x20m01 (4.30)

η12 = m12 − 2y0m11 − x0m02 + 2y20m10 (4.31)

M. Hu (1962) [14] has shown that a set of seven features derived from the second and third-order centralised geometric moments is invariant to translation, rotation, and scale change. Theseseven features are calculated as follows:

h1 = η20 + η02 (4.32)

h2 = (η − η02)2 + 4η2

11 (4.33)

h3 = (η30 − 3η12)2 + (3η21 − η03)

2 (4.34)

h4 = (η30 + η12)2 + (η21 + η03)

2 (4.35)

h5 = (η30 − 3η12)(η30 + η12)[(η30 + η12)2 (4.36)

−3(η21 + η03)2] + (3η21 − η03)(η21 (4.37)

+η03)[3(η30 + η12)2 − (η21 + η03)

2] (4.38)

h6 = (η20 − η02)[(η30 + η12)2 − (η21] + η03)

2] (4.39)

+4η11(η30 + η12)(η21 + η03) (4.40)

h7 = (3η21 − η03)(η30 + η12[(η30 + η12)2 (4.41)

−3(η21 + η203] − (η30 − 3η12)(η21 + η03)[ (4.42)

3(η30 + η12)2 − (η21 + η03)

2] (4.43)

From these seven invariant feature descriptors three similarity measures I1,I2 and I3 can becalculated as shown below:

I1(A,B) =

7∑

i=1

|−1

mAi

+1

BBi

| (4.44)

I2(A,B) =

7∑

i=1

| −mAi +mB

i | (4.45)

I3(A,B) = max |(mA

i −mBi )

mAi

| (4.46)

where 2

mAi = sgn(hA

i )log10 | hAi | (4.47)

mBi = sgn(hB

i )log10 | hBi | (4.48)

4.6 Colour Histograms and Back projection

A histogram is a discrete approximation of a continuous variable probability distribution. His-tograms are widely used in image processing and computer vision. Examples of uses of one

2sgn() determines the sign of a number. Returns 1 if number is positive; 0 if number is 0; -1 if number is negative.


dimensional histograms include: Grey scale image enhancement, Automatic threshold calculation,Selecting colour histograms via hue histograms back projection. Multi dimensional histograms arealso used in a variety of situations including: Analyzing and segmenting color images, normalizedto brightness (e.g. red-green or hue-saturation images), Analyzing and segmenting motion fields(x-y or magnitude-angle histograms), Analyzing shapes,Content based retrieval, Bayesian-basedobject recognition

Histograms represent a simple statistical description of an object, e.g An image. The objectscharacteristics are measured by iterating through that object. For example colour histogramsfor an image are built from pixel values in one of the colour spaces. All possible values of thatmulti-dimensional characteristic are further quantized on each coordinate. The histogram can beviewed as a multi-dimensional array. Each dimension of the histogram corresponds to a differentobject feature for example the three colour spaces of an object in HSV colour format. Each arrayelement represents a histogram bin and contains the number of measurements done of the objectwith a quantised value equal the to bins identifier. Histograms can be used to compare respectiveobject as follows:

DL1(H,K) =

∑

i

| hi − ki | (4.49)

(4.50)

or

D(H,K) =√

(h− k)TA(h− k) (4.51)

(4.52)

These methods suffer from a number of severe problems. The metric DL sometimes gives toosmall difference when there is no exact correspondence between histogram bins, that is, if the binsof one histogram are slightly shifted. DL can also indicate a significant difference inaccurately dueto cumulative error properties. Rubner Jan et al[39] presents a robust metric which significantlyreduces the errors associated with matching histograms, called the Earth Mover Distance.

4.6.1 Colour based Back projection

Back projection is a technique whereby a image can be selectively masked according to a givencolour distribution. Given an image in the HSV colour format, and a one dimensional histogramrepresenting the Hue distribution of the colour mask it is possible to black out all pixels whichhave a Hue value that does not occur above a threshold in the mask histogram. This has the effectof leaving only those pixels visible which have similar colour to the distribution described in thehistogram.

Chapter 5

System Development

The primary purpose of this project is to develop a simple yet robust vehicle tracking system. Thebasic design of this system is discussed in section 5.2. Section 5.4 discusses the implementation inmuch greater detail, looking at each of the four major layers in the system individually. Section5.4.1 looks at the video acquisition system. The background removal sub-system is discussed insection 5.4.2. Background subtraction is essentially the removal of sections of the source videothat remain constant across frames and are therefore deduced to be non-moving. Section 5.4.3then discusses our approach to vehicle detection. Vehicle detection is the detection of vehicles inthe source video (after background subtraction). Vehicle recognition is discussed in section 5.4.4.

5.1 Requirements

The following section covers the three primary requirements of our system. These requirementsare that the system be inexpensive, simple and compatible with the OpenCv software developmentkit.

5.1.1 Low Cost

A large proportion of vehicle tracking and traffic monitoring systems[24, 38, 30] rely on custom-built hardware and elevated platforms. The cost associated with such projects can often run intohundreds of thousands of dollars[24]. We aim to produce a system capable of being placed besideany two-lane road, with little or no setup cost involved. To maintain this low overall cost, onlyoff-the-shelf, components have been used in the development of this project. Expenditure hasbeen kept to a minimum by using a standard laptop computer and a single digital video camera.Section 5.3 covers the exact specifications of the required hardware in more detail.

In addition to monetary cost, a significant amount of effort has been applied to avoid unnec-essary software development time. The OpenCV library discussed previously in section 4.1 wasselected as a base for system development.

5.1.2 Computational Efficiency

The system should be kept as simple as possible without affecting the performance or qualityof the tracking system. This means that we favour algorithms that require little or no manualcalibration. Many similar tracking systems are capable of producing impressive results, but requireextensive manual set up. This setup often includes a of choosing thresholds based on “trial anderror”. This affects not only the initial deployment cost of any such system, but if these thresholdsare incorrectly set it significantly affects the performance of the system.


5.1.3 Sofware Compatibility

The tracking framework for efficient implementation is designed to be compatible with the OpenCvsoftware. This means it must be developed exclusively in C++ and be compatible with bothWindows and Linux versions of the OpenCv library.

5.1.4 Modularity

It is an important priority that the framework be distinctly segmented. Each segment or ‘module’,should be capable of operating independently of those around it. Each module should have apredefined input data format and a predefined output data format shown in figure 5.2. Thismodular approach to the development of our tracking system enables us to ‘snap together’ differentalgorithms for each layer without affecting subsequent sections of our system.

5.2 System Design

This section discusses the design of our tracking framework and looks at the aforementioned systemrequirements that have been realised in the software solution. The main driving force behind thedesign of the framework is that it should simplify further system development. The requirementthat the system be modular is pivotal to support easy development. By having a modular designwe minimise the amount of effort required to test new solutions for each module. There arefive primary modules in the system; Video Acquisition, Motion Segmentation, Vehicle detection,Vehicle recognition and Vehicle tracking.

Figure 5.1: Overview of System Modules

Each module can be developed completely separately from all other modules in the trackingsystem and modules can be easily interchanged.

Each of these modules has distinct responsibilities which are discussed in the following section:

5.2.1 Video Acquisition

This module is responsible for retrieving each of the video frames from the camera. The inputdata format for this module can vary as a number of video cameras such as the ADS USB 2.0 orthe Sony Handcam produce full resolution uncompressed video, while a number of other cameras


Figure 5.2: Output data format

including the Sony Eye Toy and Logitech Web Cam produce compressed video. The output formatis a stream of IplImages, which are the generic image format supported by OpenCV. This meansthat modules handling interaction with compression-enabled cameras must uncompress the videobefore producing the image stream.

5.2.2 Motion Segmentation

This module is responsible for motion segmentation. The module is responsible for maintainingthe background model, updating this model and producing a 1 bit motion map. This motion maprepresented an a black and white IplImage represents pixels in motion as a one and pixels thatare stationary as a zero.

5.2.3 Vehicle detection

This module is responsible for using the motion map to detect groups of pixels with similar motionpatterns that could represent vehicles. It is responsible for thresholding the motion map (such asremoving objects which are too small to be of interest) and grouping areas of similar motion intoobjects. This object map represented by the OpenCV data structure cvSeq is then passed to thenext module.


5.2.4 Vehicle recognition

The vehicle recognition module is responsible for utilising the output of the previous modulesto identify whether the objects detected in the scene are vehicles, and if they are, whether thevehicle has been seen before. In the event that a moving object is recognised as a vehicle, but notrecognised as a vehicle previously seen, this module is responsible for generating a model of thevehicle that can be used for recognition in the future. This model generation can use input froma number of modules, including the video images produced by the Video Acquisition model, themotion map produced by the Motion segmentation module and the Object Map produced by thevehicle detection module. This reliance on a large number of different modules makes it the mostcomplex module in the tracking framework. The vehicle recognition module is then responsiblefor passing the vehicle identifier to the vehicle tracking module along with its current position inthe source video.

5.2.5 Vehicle tracking

This is the final module in the tracking framework and is responsible for maintaining informationon the position of vehicles in the source video. This module processes the position informationproduced by the Vehicle recognition module and calculates the path of the vehicle through thesource video.

5.3 Apparatus

The system must be capable of running on a desktop computer. The definition of what representsa ‘standard desktop computer’ can differ depending on the context in which the term is used. Forthis reason the following section outlines the exact specifications of the hardware used to completethis project.

5.3.1 Cameras

The first limiting factor of any optical tracking system is that of the video camera. Digital videocameras vary in three major areas; price, resolution and frame rate. Price refers to how muchthe camera costs, resolution to the level of detail of the image (pixels per frame) produced by thecamera while the frame rate refers to how fast (frames per second) the camera can take pictures.Higher resolutions and frame rates generally result in higher costs. Figure 5.3 shows an overviewof cameras that were evaluated for use with this project.

Figure 5.3: Comparing Digital video Cameras


As video resolution increases, each frame tells us more information about the real world. Thisincreased information load means it takes longer for us to capture it and thus the frame ratedecreases. The lower the frame rate of a video source, the less information about how objectsare moving in a scene can be obtained. This results in the need for a delicate balance betweenresolution and frame rate.

5.3.2 Processor

Once a video frame reaches the application level of a tracking system, the application must processthe image and track the target objects. The time required to process an image is directly related totwo factors; the resolution of the image and the speed of the computer on which you are doing theprocessing. Figure 5.4 details the exact specifications of the computer on which all experimentshave been run.

Component DetailsCPU Intel Pentium IV 2.4ghzRAM 512MB DDR333 RAMHard Drive 80GB HDD 7200rpmGraphics card NIVIDA FX 4600 128mbFront side bus speed 533mhzMotherboard SUS P4B533-VM

Figure 5.4: Computer platform specifications

While the speed at which the computer can process each frame directly affects the frame rateof the tracking application, the frame rate of the application can never exceed the frame rate ofthe camera supplying the video. As the frame rate of the camera increases, the computer hasless time available to process each frame decreases. Figure 6.1 shows the relationship between theframe rate of the camera and the time available for processing each frame.

Figure 5.5: Frame rate of Camera .vs time for processing each frame


5.4 Method

The following section discusses the vehicle tracking system in detail. Results and method for eachmodule are presented synchronously. The reason for this synchronous structure is that results ofone module affect the design and results of subsequent results. Without an understanding of theoutput of previous modules it would be impossible to fully understand the modules that are higherup in the stack.

5.4.1 Video Acquisition

A number of different cameras have been experimented with in the development of the videoacquisition module. Initially the USB 1.0 DSE XH5096 web cam was used for video capture. Thiscamera is designed for desktop video conferencing and is the cheapest of the tested cameras. Imagequality is reasonable as shown in Figure 5.6, however the camera has on-chip white balancing thatcan not be turned off. White balancing cameras automatically adjust the relative levels of colourin the video, so that the average colour across the image is white. Computer Vision applications ingeneral are not compatible with white balancing cameras, as objects in the scene can rapidly altertheir colours. This rapid alteration in object colour renders a large number of tracking algorithmscompletely useless.

Figure 5.6: DSE XH5096 Sample output

The next low cost USB 1.0 camera tested was the Sony Eyetoy Webcam. The Eyetoy cameraachieves a relatively high 640x480 frame rate by using an on-board image compression chip. Thisescapes the raw video band with limit of standard uncompressed video across USB 1.0 and achievesapproximately 25 fps. While the camera itself works well, there is very little driver support inWindows. The camera works well after installation on Windows. However, because of the on-board JPG compression it was necessary to adjust the OpenCv library’s video capture routines toinclude JPG decompression and conversion to IplImage formats. This significantly slowed downthe frame-rate of the camera, resulting in less the 14fps when running on the desktop computerdiscussed in section 5.3.

The third family of cameras tested were the ADS Turbo Pro USB 2.0 and Fire-wire cameras.Initially the fire-wire card was tested. However, due to the fact that it required an additionalfire-wire controller card and the orange tinge of the video as shown in figure 5.7, it was swappedfor the USB 2.0 version of the camera. As shown in figure 5.7 this produced much cleaner clearerimages. Both cameras had adjustable colour thresholds and it was possible to turn white-balancingoff. The frame rate of both cameras was similar, with approximately 30 fps at 800x600 resolution32 bit RGB.

While both of the ADS cameras performed well indoors, both cameras suffered from near totalsun blindness when deployed outside. Sun blindness occurs when the camera cannot cope with the


Figure 5.7: Scenes shot using the ADS Turbo Firewire and USB 2.0 cameras

increased amount of light available outside and images appear completely washed out, shown inFigure 5.8. Exposure is generally used to control the level of light allowed into the lens. However,on both the ADS Turbo models, minimum exposure was not enough to correct the problem. Thismeans that the video is completely useless for Computer Vision applications as no objects arevisible.

Figure 5.8: Incorrect exposure setting resulting in washed out image

The final digital video camera tested was the more expensive Sony DCRGC1000 MiniDV DigitalHandycam. This camera is designed for use outdoors unlike the web cams discussed previouslyin this section. The automatic exposure control meant that the image was no longer washedout. In addition to the automatic exposure control white balance is configurable. The camerahad a fire-wire interface so a full 50fps at 800x600 was possible. The actual configuration usedduring system development was 640x480 at 30fps uncompressed stream, as a higher frame rate orresolution meant subsequent modules would not operate in real time.

5.4.2 Motion Segmentation

The motion segmentation module has two distinct variants, motion segmentation using a statis-tical colour model and Optical flow based motion segmentation. Both methods of backgroundsubtraction are discussed in the following sections:

Statistical colour model based motion segmentation

This background subtraction method is discussed in section 4.2. After initial trials it was foundthat the algorithm required severe thresholding resulting in Equation 4.2.1 being modified as


shown below:

|(


)

|> threshold ∗ σ(x,y)

This new thresholded approach to background subtraction works very well indoors as shownin Figure 5.9.

Figure 5.9: Simple RGB-based background subtraction indoors

Indoors light is very ambient in nature. There are many sources, and shadows are rarelyvisible. Outdoors there is only one predominant source of light, the sun, and as such shadowsoccur much more frequently. The standard background subtraction model is incapable of dealingwith these shadows and the shadows are classified as being objects in motion. This problem isfully illustrated in Figure 5.10.

Figure 5.10: Simple RGB-based background subtraction with shadows

To solve the problem where shadows are classified as objects in motion it was necessary toadd a second-pass filter to the background subtraction algorithm. This filter is based on the HSVcolour model, and on the observation that shadows generally represent dull areas in the image. Interms of the HSV colour model, shadows have low intensity and saturation components. The filtercan therefore classify a pixel as being motionless if both its intensity component and saturationcomponent are below certain thresholds. The modified equation is shown below:


(

|(


)

|> threshold ∗ σ(x,y)

)

∪(

Iscenex,y > intesitythresh

)

∪(

Sscenex,y > saturationthresh

)

The results are shown in the contrast between the background subtracted image before shadowremoval and the image after shadow removal, both shown in Figure 5.11.

Figure 5.11: Comparing results between shadow removal and no shadow removal

Optical flow based motion segmentation

This method of background subtraction is based on the algorithms for calculating optical flowdiscussed in section 4.3. Background subtraction using optical flow is achieved by thresholding thevelocity vector field produced by the optical flow algorithm. The resultant background subtractionalgorithm defines a pixel as being stationary if its associated velocity vector has less magnitudethan the threshold.

Optical flow is sensitive to changes in lighting and the algorithm is computationally expensive.The initial implementation that we developed provided much to slow to achieve any where nearreal time background subtraction. The OpenCV implementation which is optimised for use onIntel based machine is a lot faster, however when used outside the motion segmentation was notclean enough to be used in the final tracking application.

5.4.3 Vehicle Detection

The vehicle detection module is responsible for processing the background subtracted image. Thismodule can be further subdivided into the following steps; object identification, object threshold-ing, shape approximation and shape recognition.

Object identification

This phase of the vehicle detection module processes the motion map produced by the motionsegmentation module. This motion map is passed to the vehicle detection module as a blackand white image. White areas of the image represent areas in motion and black areas represent,stationary or background areas. Object identification is performed using a connected componentsalgorithm. The algorithm used is the 4-connected algorithm, which groups pixels if they aredirectly orthogonal.


Object thresholding

Object thresholding is the removal of groups of pixels generated by the object identification algo-rithm that are too small to represent the objects of interest. These small groups of pixels oftenrepresent errors in the background segmentation or movement in the background area of the video.This is a manually thresholded approach, so the user of the system must set the minimum numberof pixels or size at which objects should be passed to the shape approximation algorithms. Figure5.12, show an example of the size threshold set at different levels.

Figure 5.12: The effect of size based object thresholding on the object motion map

In addition to the removal of small objects, objects which are only partially visible are alsoremoved. This means any contour that includes pixels at the edge of the image is removed. Thisis because the contour does not accurately represent the shape of the vehicle and would causeproblems later in the tracking stages of the system.

Shape approximation

In order to further analyse the moving objects identified in the video stream, the next step is togenerate approximations of each object’s shape. Given a set of points denoted by the functionO(x, y) and an image I(x, y) in that the object exists, it is possible to remove all points which donot lie on the border of the object. B(x, y) denotes the points that belong to border of the objectO, which means that B is a subset of O. The points generated by the function B can then bepassed to a polygonal approximation method. These algorithms are also known as contour creationalgorithms, are discussed in section 4.4. These algorithms return a further subset of the pointsin B that represent an approximation of the shape of the object O. During initial developmentthe Rosenfeld-Johnston algorithm[27] was used to produce the vehicle contours. This algorithmrequires manual thresholding of the parameter m (see Section 4.4 for more information), andoften produces excessively rough shape estimations. The variations in polygonal approximationsgenerated by this algorithm are shown in Figure 5.13. The Teh Chin polygonal approximationalgorithm[5] was then tested. This algorithm does not require the manual thresholding and as suchproduces a much more reliable shape approximation. The results of this polygonal approximationalgorithm are also shown in Figure 5.13.


Figure 5.13: Comparison of polygonal approximation methods[27][5]

5.4.4 Vehicle Recognition

Once the contours representing the objects in the source video have been created, the next stepis to evaluate the optical properties of the object. This evaluation is accomplished in two phases;The initial shape based thresholding and the secondary vehicle recognition based on shape andcolour.

Shape based thresholding

For each vehicle the contour is draw as a binary silhouette. The set of geometric moments (seesection 4.5) are then calculated. These geometric moments are then used to obtain informationabout the object.

• m00, the order zero geometric moment, is used to calculate the area of the object in thesource video

• m01 and m10, the first order moments, are used to calculate the centroid or optical centerof mass of the object in the video frame.

These geometric moments are then used to calculate the central geometric moments of theobject (as described in Section 4.5). The centralised geometric moments are then used to generatea set of seven invariant feature descriptors that can be used for shape similarity evaluation. Alibrary of three very simple contours is maintained in memory and the object’s contour is comparedto these to evaluate the likely vehicle type. This set of basic vehicle types is shown in figure 5.14.

VanWagon Sedan

Figure 5.14: Simple vehicle types

This produces a match metric which varies between zero and one. Each of the vehicle modelsis compared to the vehicle contour and the highest value is determined to be the most likelyformat of the vehicle being tracked. Once again, a threshold is applied. If the match metric ofall vehicle types is less than the threshold value, the object is discarded. The reason for thisthresholding is two-fold; Initially this removed large moving objects that do not represent vehiclesfrom the video, for example trees swaying in the background or people walking on the pavement.


Secondly this limits the number of comparisons needed to correctly identify a vehicle, by assigninga vehicle a class. This is best explained with an example, If the system begins to track a vehicleand this vehicle is recognised as being of shape similar to a van, then there is no need to checkthe database to see if the vehicle matches any of the vehicles in the sedan or wagon classes.This instantly reduces the number of comparisons needed and thus increases the efficiency of thetracking algorithm

Shape and Colour based vehicle recognition

After the initial shape-based thresholding is complete, the secondary recognition phase begins.This stage introduces the use of colour-based tracking. Without this colour based approach itwould be possible for two similar model vehicles to pass the tracking system and be classified asthe same vehicle, irrespective of the fact that one car was green and the other red. Colour-basedtracking is achieved by using the pixels within the contour representing the vehicle being trackedto generate a RGB colour histogram. The vehicles contour is then compared to the contour ofeach vehicle previously seen in the video stream. If the contour does not matches within a giventhreshold (manually set) then a new vehicle entry is created in the database. This vehicle entryincludes the vehicles contour and the pre-calculated geometric moments, centralised moments,seven invariant feature descriptors and the colour histogram. If the vehicle’s contour does matchwith a contour in the vehicle database then the contour image of the vehicle is back projectedusing the colour histogram of the vehicle in the database with which it matched with as detailed insection 4.6. If the number of pixels that are back projected in the vehicle’s contour is higher thana manually set threshold then the match is deemed to be correct. The vehicle in the database,then has its path updated with the new position of the vehicle’s center of mass.

5.4.5 Vehicle Tracking

This module is primarily responsible for rendering the path of a vehicle on a screen. When avehicle enters the video stream, a new entry is created in the database by the vehicle recognitionmodule. The position of the center of mass is then rendered by the tracking module on an image.As the vehicle moves through the video frames the path is updated and rendered to the screen.This module is also responsible for removing vehicle identifiers that have been stagnant for a setperiod of time. Examples of vehicle paths generated are shown in Figure 5.15.

Figure 5.15: Example vehicle path

Initially the period of time that a vehicle remains in the database was set by a manual threshold,however a better approach was developed. By generating an orange biased colour histogramartificially we can locate two road cones placed 20 meters apart on the road using back projectionsee figure 5.16.

This enables to calculate how many pixels represent 20 meters in the real scene. As mentionedpreviously, as the vehicle moves through the scene the position of the vehicle’s center of mass isstored. We can then calculate the relative velocity of vehicle in the scene. This velocity can then


Figure 5.16: Example of road cone back projection

be used to predict the length of the time the vehicle will remain in the video stream. When thistime decreases below a certain threshold, the vehicle is removed from the database.

Chapter 6

Summary of Results

This section gives an overview of the results obtained through the development of this research.Because of the young state of Computer Vision as a field of Computer Science there is very little inthe way of formalised methods for evaluating tracking solutions. Due to the modular nature of thesystem, a cyclic development model was used. This means both that each module was developed inorder initally and that results from one model affected the design of subsequent modules. Chapter5 System Development presents the results of each module individually simultaneously with anexplanation of their design. This chapter summarises these results and looks at the system as awhole. This section presents information obtained by manual analysis of a 30 second clip of samplevideo. This sample video begins with no objects visible, after 5 seconds have elapsed a single whitewagon enters the video stream travelling at approximately 15 km/hr. It takes approximately 7seconds to pass through the video stream at which point there are no vehicles visible. At 15seconds a second vehicle enters the video stream and exits at approximately 22 seconds. For thelast 8 seconds of the video there are no visible vehicles. This section is divided into two section;performance related results and quality related results.

6.1 Performance Results

The primary measure of performance for a tracking application is frame rate of the trackingsolution. Figure 6.1 shows the frame rate of the tracking solution for the 30 second samplesolution. If the frame rate drops below that of the camera, then video frames are simply discardeduntil the tracking modules are ready to accept the next frame. By looking at this graph it becomes

Figure 6.1: Frame rate of tracking application across sample video

obvious that as the initial vehicle enters the video stream the frame rate dramatically decreases.Once again as the second vehicle enters the video stream the frame rate decreases again. This is


due to the fact that when there is no movement in the video stream then the layers above theBackground subtraction model do not operate. Once an object does enter the video stream thena significant amount of processing time is required to correctly track the vehicle.

6.2 Quality related Results

There are two quality related results that can be easily calculated from the 30 second example videoclip. The accuracy of our of background subtraction and the accuracy of our velocity calculations.

It is possible to compare the background subtraction algorithms by manually calculating thenumber of pixels that make up the vehicle in the source video and then comparing them to thenumber of pixels contained within the vehicles contour. The resulting statistics let us analyse thequality of our background subtraction model, thereby also analysing those modules build on topof the background subtraction module.

Figure 6.2: Comparing background subtraction methodologies

Finally, the last module in the tracking application is the Vehicle tracking module. This moduleis responsible for making real world inferences about vehicle movement based on the informationobtained from those modules beneath it. The following graph demonstrates the accuracy of thevehicle velocity calculations.

Figure 6.3: Velocity calculation error

Chapter 7

Conclusion

The result of this research is a modularised approach to object tracking. This enables easycomparison of algorithms and fast prototyping of computer vision applications. We have shownthat at each stage in the development process a number of options can be rapidly prototyped andtested. In combination with the openCV SDK which provides a large number of computer visionalgorithms and a reliable framework for video capture, we have produced a simple, yet reliablesolution to vehicle tracking. We have improved the default background segmentation algorithmprovided by openCV by implementing a second pass shadow removal algorithm. This shadowremoval algorithm is based on the simple observation that shadows have the same hue with a lowerintensity than the surrounding environment. For this reason the algorithm is based on the HSVcolour model in which the second and third components correspond directly to the saturation andintensity of the colour. Pixels are regarded to be in shadow if they fit inside the bounds of the initialRGB background model but have a low intensity and saturation components when represented inthe HSV colour model. Due to the fact that the position of our camera is orthogonal to the roadplane after background segmentation, we achieve accurate two dimensional silhouettes. This meansthat the process of vehicle recognition is more robust on our system than it is with other systemswhere the camera is located in an elevated position. In addition to this efficient platform designour vehicle tracking and recognition phases are dependant on two distinctly different algorithms;Shape based recognition (reliant on Teh Chin polygonal approximation and Hu invariants) andColour based recognition (using colour histograms and back projection). Both of these approacheshave been utilised singularly for object tracking and reinforce each other providing the system witha more robust overall tracking approach.

There are two primary aspects to the research discussed in this report. The development of atracking framework and the development of a system built on top of this framework. The designof the framework, the modular approach to vehicle tracking has few limitations. The tracking sys-tem however has a much larger set of limitations. When evaluating the system as a whole there anumber of limitations. Initially when the system is deployed it is necessary to capture a significantamount of un-occluded background video, as the complexity of the background scene changes thehigher the number of frames needed. This is because the number of frames captured correspondsdirectly to the quality of the background colour model used for background subtraction. Secon-darily, the system does not cope very well with occlusion. If two cars enter the scene at differentends of the video stream then the system is capable of differentiating between them. The problemoccurs however if the cars are always slightly occluded, the system is unable to fit a vehicle modelto the single optical feature representing two vehicles and thus discards the object. In additionto this weakness, tracking calculations such as vehicle velocity are based on the frame rate of thecamera, should the frame rate of the camera change at any point, then all velocity calculationswould become inaccurate. Each of these errors while small by themselves, when combined meanthat further development would be needed before any commercial deployment could be considered.

The modular design of our system means that it would be possible to extend it in the future


with other algorithms at any stage in the pipeline. Each of these modules could be improved in anumber of different ways, a subsection of the more important improvements are presented below:

• Video Acquisition

– Integration of stereo camera depth information to improve occulsion handling

• Motion Segmentation

– Adaptation of the currently used background removal algorithm currently used to in-clude the ability to reclassify objects as being part of the background if they maintaina static location in the scene for a given period of time.

• Vehicle Detection

– Automation of minimum and maximum object size thresholds. Possibly by calculatingaverage size of vehicles being tracked and removing all objects outside of 2 standarddeviations from this mean.

• Vehicle Recognition

– Use of more accurate two dimensional models for inital vehicle classification

– Evaluation of Earth Mover Distance comparison metric for coloured histogram match-ing.

• Vehicle Tracking

– Intergration of Particle filter to vehicle tracking.

Vehicle tracking solutions have a large range of possible applications. Perhaps the most im-portant applications are security related. Given a system such as the one presented in this paper,it is possible to monitor the speed and position of a number of vehicles in a given area. This canbe used to alert security personnel to suspicious vehicles, such as those parked too close to vitalbuildings or left stationary for a given period of time. This system could also be used to locateall vehicles of a certain colour or shape in any number of separate video streams, helping with thelocation of stolen vehicles. In addition to these security applications, a larger set of more mundaneapplications also exist. One example is an automated parking monitor for a car park. If a system,such as the one presented in this paper was placed at all the entrances and exits to a complexsuch as Canterbury University, a tally of the total number of cars currently inside the groundscould be collected. If this number exceeded the total number of car parks available, signs at theentrances could indicate that it was unlikely that a park would be available for any new vehiclesentering the complex.

Bibliography

[1] Forsyth an d Ponce. Computer Vision - A Modern Approach. Prentice Hall, 2003.

[2] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik. A real-time computer vision systemfor measuring traffic parameters, 1997.

[3] Lisa M. Brown. View independent vehicle/person classification. In Proceedings of the ACM2nd international workshop on Video surveillance & sensor networks, pages 114–123. ACMPress, 2004.

[4] T. Camus. Real time quantized optical flow. The Journal of Real-time Imaging (Special issueon Real-Time motion analysis), 1997.

[5] R.C. C.H. Teh. Detection of dominant points on digital curves. IEEE, 1989.

[6] J. et al Connel. Detecting and tracking in the ibm people vision system. ICME, 2003.

[7] Smith F.W and Wright M.H. Automatic ship photo interpretation by the method of moments.IEE Trans. on Computers, 20(9):1089–1095, 1971.

[8] A D Worall C I Attwood GD Sullivan, K D Baker and P R REmagnino. Model-based vehicledetection and classification using orthographic approximations. IEEE, 1996.

[9] Markus Beier Gerald Kuhne, Stephan Richter. Motion-based segmentation and contour-basedcalssification of video objects. Praktische Informatik IV Universtity of Mannheim L 14,16,68131 Mannheim, Germany, 2001.

[10] Young-Kee Jung; Yo-Sung Ho. Traffic parameter extraction using video-based vehicle track-ing. Intelligent Transportation Systems, 1999. Proceedings. 1999 IEEE/IEEJ/JSAI Interna-tional Conference on, 1999.

[11] B. G. Horn, B. K. P. & Schunck. Determining optical flow. AI Memo 572, MassachusettsInstitute of Technology, 1980.

[12] B.G. Horn, B.K.P; Schunck. Determining the optical flow. Artificial Intelligence, 17:185–204,1980.

[13] Jianguang Lou; Tieniu Tan; Weiming Hu. Visual vehicle tracking algorithm. ElectronicsLetters, 38(18):1024–1025, 2002.

[14] M. Hu. Visual pattern recognition by moment invariants. IRE Transactions on InformationTheory, 1962.

[15] Luc Van Gool Katja Nummiaro, Esther Koller-Meier. Object tracking with an adaptivecolor-based particle filter. Katholieke Universiteit Leuven, ESAT/VISICS, Belgium, 2001.

[16] Tomas Svoboda Katja Nummiaro, Esther Koller-Miere. Color-based object tracking in multi-camera environments. Katholieke Universiteit Leuven, ESAT/VISICS, Belgium, 2003.


[17] C.W.; Lee K.M.; Yun T.S.; Kim H.J. Kim, J.B.; Lee. Wavelet-based vehicle tracking forautomatic traffic surveillance. Electrical and Electronic Technology, TENCON. Proceedingsof IEEE Region 10 International Conference on, 1(19-22):313–316, 2001.

[18] Dieter Koller, Joseph Weber, and Jitendra Malik. Robust multiple car tracking with occlusionreasoning. In ECCV (1), pages 189–196, 1994.

[19] H.-H. Leuck, H.; Nagel. Model-based initialisation of vehicle tracking: dependency on il-lumination. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE InternationalConference on, 2001.

[20] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target classification and tracking from real-timevideo, 1998.

[21] John MacCormick and Andrew Blake. A probabilistic exclusion principle for tracking multipleobjects. International Journal of Computer Vision, 39(1):57–71, 2000.

[22] M. Nagel, H.-H.; Haag. Bias-corrected optical flow estimation for road vehicle tracking.Computer Vision, 1998. Sixth International Conference on, 1998.

[23] M.; von Seelen W. Noll, D.; Werner. Real-time vehicle tracking and classification. IntelligentVehicles ’95 Symposium., Proceedings of the, pages 101–106, 1995.

[24] S.; Kosuge Y. Okada, T.; Tsujimichi. Multisensor vehicle tracking method for intelligenthighway system. SICE 2000. Proceedings of the 39th SICE Annual Conference. InternationalSession Papers, pages 291–296, July 2000.

[25] K R Ramakrishnan R Mukundan. Moment Functions in image Analysis - Theory and Appli-cations. World Scientific, 1998.

[26] Zhimin Fan; Jie Zhou; Dashan Gao; Gang Rong. Robust contour extraction for movingvehicle tracking. mage Processing. 2002. Proceedings. 2002 International Conference on,3(24-28):625–628, 2002.

[27] A. Rosenfel and E. Johnston. Angle detection on digital curves. IEE Trans Computers,22:875–878, 1973.

[28] Russel and Norvig. AU, A Modern approach. Prentice Hall, 1995. Figure 24.9, pg. 737.

[29] Dudani S.A. Aircraft identification by moment invariants. IEEE Trans. on Computers,26(1):39–45, 1977.

[30] M Schmid. An approach to model-based 3-d recognition of vehicles in real time by machinevision. Intelligent Robots and Systems ’94. ’Advanced Robotic Systems and the Real World’,IROS ’94. Proceedings of the IEEE/RSJ/GI International Conference on, 1994.

[31] Steve Sengels. Explanation of lagrangian multipliers.http://www.cs.toronto.edu/~sengels/tutorials/lagrangian.html.

[32] S.M. Smith. Asset2:real-time motion segmentation and shape tracking. Proc Int. Conf. onCOmputer Vision 4, 1995.

[33] Ching-Po Lin; Jen-Chao Tai; Kai-Tai Song. Simultaneous tracking of pedestrians and vehiclesby the spatio-temporal markov random field model. Systems, Man and Cybernetics, 2003.IEEE International Conference on, 2003.

[34] Ching-Po Lin; Jen-Chao Tai; Kai-Tai Song. Traffic monitoring based on real-time image track-ing. Robotics and Automation, 2003. Proceedings. ICRA ’03. IEEE International Conferenceon, 2003.

http://www.cs.toronto.edu/~sengels/tutorials/lagrangian.html


[35] Chris Stauffer and W. Eric L. Grimson. Learning patterns of activity using real-time tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):747–757, 2000.

[36] Weiming Hu; Xuejuan Xiao; Dan Xie; Tieniu Tan. Traffic accident prediction using vehicletracking and trajectory analysis. Intelligent Transportation Systems, 2003. Proceedings. 2003IEEE , Volume: 1 , 12-15 Oct., 2003.

[37] M.B.; Groen F.C.A. van Leuven, J.; van Leeuwen. Real-time vehicle tracking in imagesequences. Instrumentation and Measurement Technology Conference, 2001. IMTC 2001.Proceedings of the 18th IEEE, Vol.3, Iss., 2001.

[38] S. Weiming Hu; Xuejuan Xiao; Xie, D.; Tieniu Tan; Maybank. Traffic accident predictionusing 3-d model-based vehicle tracking. Vehicular Technology, IEEE Transactions on, 2004.

[39] L.J. Guibas Y. Rubner. C. Tomasi. The earth mover s distance as a metric for image retrieval.Technical Report STAN-CS-TN-98-86, 1998.

optical vehicle tracking - a framework and tracking solution › research › reports › ... ·...

Documents