object tracking in video-surveillance

6

Click here to load reader

Upload: d-moroni

Post on 03-Aug-2016

223 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Object tracking in video-surveillance

ISSN 1054-6618, Pattern Recognition and Image Analysis, 2009, Vol. 19, No. 2, pp. 271–276. © Pleiades Publishing, Ltd., 2009.

Object Tracking in Video-Surveillance

1

D. Moroni and G. Pieri

Institute of Information Science and Technologies (ISTI), Italian National Research Council (CNR), Pisa, Italye-mail: {davide.moroni, gabriele.pieri}@isti.cnr.it

Abstract

—This paper faces the automatic object tracking problem in a video-surveillance task. A previouslyselected and then identified target has to be retrieved in the scene under investigation because it is lost due tomasking, occlusions, or quick and unexpected movements. A two-step procedure is used, firstly motion detec-tion is used to determine a candidate target in the scene, secondly using a semantic categorization and ContentBased Image Retrieval techniques, the candidate target is identified whether it is the one that was lost or not.The use of Content Based Image Retrieval serves as support to the search problem and is performed using areference data base which was populated a priori.

DOI:

10.1134/S1054661809020096

Received January 14, 2009

1

1. INTRODUCTION

Recognizing and tracking people in real time videosequences is nowadays a very relevant and challengingtask. There is a need for automatic tools to perform theidentification and to follow the

target

under investiga-tion. Actually most tools for performing automatictracking need to be run under specific environmentand/or constraints, such as complete visibility of thetarget, impossibility to recover lost targets in a scene,and eventually also privacy issues play an importantrole in the efficiency of a tool.

In this paper we firstly describe the global task ofactive video-surveillance, then the focus will be on thesubtask of automatically recover a lost target while inan automatic object tracking task.

Global task of active video-surveillance.

The gen-eral and more global problem of detecting and trackinga moving target in a video-surveillance application hasbeen faced in [1], in particular processing infrared (IR)video. Acquired images from video are processed forthe detection of the target to be tracked in the actualframe, subsequently

active tracking

is performedthrough a

Hierarchical Artificial Neural Network

(HANN) for recognizing the actual target. In case thetarget cannot be recognized and it is lost due to somespecific reason (e.g. occlusion), the

automatic targetretrieval

phase is performed to identify and localize thelost target, this search is performed also on an a prioripopulated database and it is based on the paradigm ofContent-Based Image Retrieval (CBIR).

Active tracking in a real time video surveillance isperformed through 2 subphases:

—target identification: to localize spatially the tar-get within the current frame of the sequence,

1

The article is published in the original.

—target recognition: to recognize whether or notthe target is the correct one to follow.

As initialization to the tracking task a motion detec-tion algorithm can be used to detect a moving target inthe scene, this can be done efficiently by exploiting thethermal characteristics of a human target in comparisonto general background (independently to illuminationconditions). The target can be localized and extractedfrom the frame under analysis through a segmentationalgorithm. Once the target has been identified a set ofvarious features can be computed, and these featurescan be used to assign the target to a

semantic class

itbelongs to (i.e., for humans: upstanding person, crawl-ing, crouched, …).

In order to perform target recognition, the set ofcomputed features, included the computed semanticclass is sent as input to a HANN which has been traineda priori off-line. If the network recognize the target asbelonging to correct semantic class active tracking isconsidered successful and is repeated in the followingframe acquired from the video sequence. In case awrong object has been identified, and it is not recog-nized, the automatic search for the lost target object isstarted, based on CBIR paradigm and supported by thereference database. The identification of a wrong objectcan be due to various reasons: i.e., a masking, a partialocclusion of the target, or a quick and unpredictablemovement in an unexpected direction.

Using CBIR allows to access information at a per-ceptual level i.e., using visual features, automaticallyextracted with appropriate similarity model [6].

The outcome of the automatic target search can bepositive (i.e., target object tracked again) or negative(i.e., target object not yet identified or recognized). Inthe first case active tracking from the next frame will beperformed again, while in the second case there is aneed of continuing to perform automatic search of thetarget in the following frame. A stop condition to thissearch after a number of frames is set and will be

APPLICATIONPROBLEMS

Page 2: Object tracking in video-surveillance

272

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 19

No. 2

2009

MORONI, PIERI

defined in the following sections. When this stop con-dition is reached with no positive result the final deci-sion and the control of the system are given back to theuser warning about the lost target.

In Fig. 1 the general algorithm implementing the

on-line

approach for the global task of automatic trackingin video-surveillance is shown.

In order to be able to use the neural network, whichneeds to be trained over known samples, and the data-base, which needs to be populated for using the CBIR,both in real time, an

off-line

stage is performed tuningboth components over the scene under surveillance.The database used for CBIR is organized into pre-defined semantic classes, and for each of the classesrepresentative separate sets of shapes are stored totake into account many possible events to be recog-nized, such as target partial masking, and different ori-entation.

In the following sections the feature selection fortarget characterization and the automatic target searchusing CBIR are described in detail, finally some discus-sion and conclusion are drawn.

2. TARGET CHARACTERIZATION

The segmented target can then be characterizedthrough a set of meaningful extracted features. Theextracted features are grouped into 4 main categoriesdue to their intrinsic characteristics that are relatedwith:

—morphology;—geometry;

—intensity;—semantics.The features are extracted from the region enclosed

by the target contour that is defined by a sequence of

N

points (i.e., in our case

N

= 16) having coordinates

x

s

,

y

s

, which can be described as being the target delimit-ing points.

Morphology.

Extracting parameters from a shape,thus obtaining contour descriptors, represent a charac-terization of the morphology of the target. The shapecan be obtained in an efficient way computing framesdifference during the object segmentation. This framedifference is computed on a temporal window span-ning 3 frames, this is made in order to prevent inconsis-tencies and problems due to intersections of the shapes.Let

(

j

,

j

– 1) be the modulus of difference betweenactual frame

F

j

and previous frame

F

j

– 1

. Otsu’s thresh-olding is applied to

(

j

,

j

– 1) in order to obtain a binaryimage

Bin

(

j

,

j

– 1). Letting

TS

j

to be the target shape inthe frame

F

j

, heuristically we have:

Thus, considering actual frame

F

j

, the target shapeis approximated by the formula:

Once the target shape is extracted two steps are per-formed: first an edge detection is performed in order toobtain a shape contour, second a computation of thenormal in selected points of the contour is performed inorder to get a better characterization of the target.

Two low level morphological features are computedfollowing examples reported in [2]: the normal orienta-tion, and the normal curvature degree. Considering theextracted contour, 64 equidistant points

s

p

,

t

p

areselected. Each point is characterized by the

orientation

Θ

p

of its normal and its

curvature

K

p

. To define theselocal features, a local chart is used to represent thecurve as the graph of a degree 2 polynomial. More pre-cisely, assuming without loss of generality that in aneighborhood of

s

p

,

t

p

the abscissas are monotone, thefitting problem

is solved in the least square sense. Then we define:

(1)

(2)

More information can be obtained discretizing thenormal orientation into 16 different bins correspondingto equiangular directions.

Bin j j 1–,( ) TS j TS j 1– .∪=

TS j Bin j j 1–,( ) Bin j 1+ j,( ).∩=

t as2 bs c+ +=

Θp1–

2asp b+--------------------⎝ ⎠

⎛ ⎞ ,arctan=

K p2a

1 2asp b+( )2+( )3/2----------------------------------------------.=

Frame

Images

DB

Target selection

Motion detection

Semantic class&

class change

SPATIALLOCALIZATION

Detectionsegmentation

Characterizationfeature extration

RECOGNITION

Active traking

HANN

DB SEARCH

CBR resultHANN

Target OK Target OK

Aut

omat

ic S

earc

h-N

ext F

ram

e

Nex

t Fra

me

Aut

omat

ic S

earc

h-N

ext F

ram

e

Nex

t Fra

me

FeaturIntegration

Recognition

Target not recognized Target not recognizedI frame skipped

Target Lost

Fig. 1.

The general algorithm for the trucking in the video-surveillance task [1].

Page 3: Object tracking in video-surveillance

PATTERN RECOGNITION AND IMAGE ANALYSIS

Vol. 19

No. 2

2009

OBJECT TRACKING IN VIDEO-SURVEILLANCE 273

Such a histogram is invariant for scale transforma-tion and, thus, independent from the distance of the tar-get, hence it will be used for a more precise character-ization of the semantic class of the target. This distribu-tion represents an additional feature to the classificationof the target e.g. a standing person will have a far dif-ferent normal distribution than a crouched one. A vec-tor [

v

(

Θ

p

)] of the normal for all the points in the con-tour is defined, associated to a particular distribution ofthe histogram data.

Geometry.

Taking into account the shape andcontour previously extracted other different featuresregarding the geometry of the segmented object canbe computed. In particular two basis features arecomputed: the area and perimeter of the currentshape.

Intensity.

In our particular case study under investi-gation specific images were acquired, i.e., infraredimages, thus we choose to exploit their specific charac-teristic of having a thermal radiation value of the targetin each point of the image which also characterize theintensity of the image, to improve the efficiency of thesystem. Described below are more specific and IR-ori-ented features.

Average Intensity:

Standard deviation:

Skewness:

Kurtosis:

Entropy:

where

F

j

(

t

) represents the thermal radiation valueacquired for the frame

j

on the image point

t

, and

µ

r

aremoments of order

r

of

F

j

(

t

).

Area xsys 1+( ) ysxs 1+( )–[ ]s 1=

N

∑ /2,=

Perimeter xs xs 1+–( )2 ys ys 1+–( )2+ .s 1=

N

∑=

I1

Area----------- F j t( ).

t Target∈∑=

σ 1Area 1–-------------------- F j t( ) I–( )2

t Target∈∑ .=

γ 1 µ3/µ23/2.=

β2 µ4/µ22.=

E F j t( ) F j t( )( ),2logt Target∈∑–=

Semantics. In order to be able to reduce accesstime to the DB, the organization of the DB itself ismade exploiting the predefined semantic classes, thisallows to perform class-specific searches (i.e., on areduced database set) which are more selective andefficient [5].

Each target, once it is identified through its extractedfeatures, belongs to a specific semantic class (e.g., for ahuman: upstanding person, crouched, crawling, etc.…).This class can be considered as an additional featurecomputed considering the combination of the abovedefined features, among a predefined set of possiblechoices and assigned to the target.

Another characterization of the tracked objectwhich regards its semantics, is what we call a Class-Change event, this occurs when, among different suc-cessive frames (i.e., during time), the semantic classassigned to the target changes. The Class-change isdefined as a couple ⟨Sbef, Sa ft⟩ that is associated withthe target, and represents the modification from thesemantic class Sbef computed before, and the semanticclass Saft computed in the actual frame. This event canbe considered when more complex situations need to beevaluated in the context of video-surveillance. Due tothe fact that the semantic classes are defined a priori, amatrix of class-changes can be defined, giving differentspecific weights to each class-change, e.g. a target rec-ognized as a standing person before (Sbef), and com-puted to be crouched in the current frame (Sa ft) couldbe given more attention (i.e., higher weight) than a dif-ferent class-change.

The most important aspect for understanding class-changes is represented by morphology, in particular wecan consider an index of the normal histogram distribu-tion.

All the extracted information is passed to the recog-nition phase, in order to assess whether or not the local-ized target is correct.

3. TARGET SEARCH USING CBIR

In cases when the identified target has not been rec-ognized (i.e., a wrong target recognition occurs), suchas masking or occlusion, or quick movements, then thetarget search phase begins.

Systems based on CBIR generate low-level featurevectors which represent both the query and the image toretrieve, while in this system a semantic-based imageretrieval is performed, hence a semantic concept isdefined by means of sufficient number of trainingexamples containing these concept [5]. Once thesemantic concept is defined it is used to make a selec-tive and more efficient access to the DB, and this subsetis used to perform typical CBIR on low-level features.

A preliminary filtering on the DB is obtained usingthe reference semantic class to access the information

Page 4: Object tracking in video-surveillance

274

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 19 No. 2 2009

MORONI, PIERI

in a more efficient manner [7]. The features extractedfrom the identified target are compared to the onesrecorded in the reference DB using a quick similarityfunction for each feature class [3]. In particular, weconsidered intensity values matching, using percent-ages, shape matching, using the cross-correlation crite-rion, and the vector [v(Θp)] representing the distribu-tion histogram of the normal. A weight is defined asso-ciated to each similarity value, in such a way a finalsingle (and global) similarity measure can be obtained.Possible variations of the initial shape are recorded foreach semantic class. In particular, the shapes to performthe similarity comparison are retrieved in the databaseusing information in a set obtained considering theshape information stored at the time of the initial targetselection joined with the one of the last valid shape ofthe target object. If the identified target has a final sim-ilarity value which has a distance below a fixed toler-ance threshold, to at least one of the shapes in theobtained set, then it can be considered valid. Otherwisethe target search starts again in the next frame [4].

In Fig. 2 a sketch of the CBIR in case of automatictarget search is shown considering with the assumptionthat the database was previously defined (i.e., off-line),and considering a comprehensive vector of features⟨Ftk⟩ for all the above described groups.

Thus, when a pattern of the DB is found throughCBIR, having a higher similarity value than the pre-fixed threshold, then the automatic target search hasreached its goal and the target has been grabbed backand continuing in next frame automatic active trackingcan be performed. If this does not happens then auto-matic target search is performed again in the nextframe, not considering the last frame, but again the last

valid shape of the searched target (i.e., the last correcttarget) as a starting point.

A stop condition has to be set for this automaticsearch, that is when after m frames the correct target hasnot yet been retrieved. In such a condition the control isgiven back to the user with a warning regarding the losttarget situation. The value of m is computed when thetarget search starts considering the Euclidean distancebetween the centroid C of the last valid target objectand the edge point of the frame Er along the searchdirection r (the search direction vector is computed onthe basis of the last f = 5 frames movements of the tar-get), divided by the average speed of the target previ-ously measured in the last f frames:

(3)

where Leapi represents the distance covered by the tar-get between frames i and i – 1.

4. CONCLUSIONS

The object recognition and tracking problem hasbeen described in the frame of a general case study ofvideo surveillance in the control of accesses torestricted areas. In particular more focus has been givento the sub-task of recovering a target which is lost whilein real time tracking something happens, such as occlu-

j C Er– /

Leapi

i 1=

f

∑f

---------------------

⎝ ⎠⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎛ ⎞

,=

Extracted features

Ft1 Ft2 Ft3 FtaQuery

?

F1, i F2, i F3, i Fa, i

F1, k F2, k F3, k Fa, k

Ft1, k Ft2, k Ft3, k Ftu, k

If SC2

Most similarpattern

DB

Semanticclass 1

Semanticclass 2

Fig. 2. Automatic target search with the support of CheContent-Based Image Retrieval and driven by the semanticclass feature.

Fig. 3. The original frame (top), shape extraction by framesdifference (bottom). Left and right represent two differentpostures of a tracked person.

Page 5: Object tracking in video-surveillance

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 19 No. 2 2009

OBJECT TRACKING IN VIDEO-SURVEILLANCE 275

sion, unexpected quick movements. The specific cam-era used to acquire the videos was a thermocameraworking in the 8–12µm wavelength range. The camerawas mounted on a robotic moving structure covering360° pan and 90° tilt, and equipped with 12° and24° optics. Finally spatial resolution of the videoimages to process was 320 × 240 pixels.

In our case study the off-line stage to build the data-base for the CBIR was performed taking into accountvarious image sequences relative to different classes ofthe scenario under surveillance. Regarding in particularthe choice of semantic classes for a human class, it hasbeen composed defining three different postures (i.e.,upstanding, crouched, crawling) while also taking intoaccount considering three possible people categories(short, medium, tall).

An estimation of the number of operations com-puted for each frame in real time tracking yields the fol-lowing values: about 5 × 105 operations for the identifi-cation and characterization phases, about 4 × 103 oper-ations for the active tracking. This assures the real timefunctioning of the procedure on a personal computer ofmedium power due to the fact also that the heaviestoperations are performed while in off-line stage. Theautomatic search process can require a higher numberof operations, but it is performed when the target is par-tially occluded or lost due to some obstacles, so it canbe reasonable to spend more time in finding it, thusloosing some frames. Among the many variables onwhich the number of operations depends, the relativedimension of the target to be followed is an importantamount, i.e., bigger targets require a higher effort to be

segmented and characterized, even if they prove not tobe the correct ones.

In Fig. 3 example frames of video sequences areshown (top) together with their relative shape extrac-tion.

While in Fig. 4 the same frames are processed, andshape contour with distribution histogram of the normalare shown.

A methodology has been proposed for objecttracking and target recovery based on the ContentBased Image Retrieval paradigm in a video-surveil-lance task. Target identification during active track-ing has been performed, using a Hierarchical Artifi-cial Neural Network (HANN). In case of automaticsearching, when the tracking object is lost oroccluded, and needs to be grabbed back again, aCBIR database search and compare has been used forthe retrieval and comparison of the currentlyextracted features with the previously reliable stored.In order to optimize the storing and search in the ref-erence database an approach based on semantic cat-egorization of the information has been defined.

ACKNOWLEDGMENTS

This work was partially supported by EU NoEMUSCLE—FP6-507752. We would like to thank Eng.M. Benvenuti, head of the R&D Dept. at TDGroupS.p.A., Pisa, for allowing the use of proprietary instru-mentation for test purposes.

REFERENCES1. D. Moroni and G. Pieri, “Active Video-Surveillance

Based on Stereo and Infrared Imaging,” Eurasip J. Appl.Signal Proc. 2008 (Article ID 380210), 8 (2007).

2. S. Berretti, A. Del Bimbo, and P. Pala, “Retrieval byShape Similarity with Perceptual Distance and EffectiveIndexing,” IEEE Trans. Multimedia 4 (2), 225 (2000).

3. P. Tzouveli, G. Andreou, G. Tsechpenakis, andY. Avrithis, “Intelligent Visual Descriptor Extractionfrom Video Sequences. Lecture Notes in Computer Sci-ence,” Adaptive Multimedia Retrieval 3094, 132 (2004).

4. M. G. Di Bono, G. Pieri, and O. Salvetti, “MultimediaTarget Tracking through Feature Detection and DatabaseRetrieval,” in 22nd International Conference onMachine Learning (ICML 2005), Bonn, Germany, 2005(2005), pp. 19–22.

5. M. M. Rahman, P. Battarcharya, and B. C. Desai,“A Framework for Medical Image Retrieval UsingMachine Learning and Statistical Similarity MatchingTechniques with Relevance Feedback,” IEEE Trans.Inform. Tech. Biomed. 11 (1), 58 (2007).

6. A. Smeulder, M. Worring, S. Santini, A. Gupta, andR. Jain, “Content-Based Image Retrieval at the End ofthe Early Years,” IEEE Trans. Pattern Analysis Mach.Intell. 22 (12), 1349–1380 (2003).

7. Y. Chen, J. Z. Wang, and R. Krovetz, “CLUE: Cluster-Based Retrieval of Images by Unsupervised Learning,”IEEE Trans. Image Proc. 14 (8), 1187 (2005).

9060

30

0

330

300270

240

210

180

150

12090

60

30

0

330

300270

240

210

180

150

12015

10

5

8

6

4

2

Fig. 4. Shape contour with normal vector on 64 points (top),distribution histogram of the normal (bottom). Left andright epresent two different postures of a tracked person(same frames as in Fig. 3).

Page 6: Object tracking in video-surveillance

276

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 19 No. 2 2009

MORONI, PIERI

Davide Moroni (Magenta, 1977),M.Sc. in Mathematics honours degreefrom the University of Pisa in 2001,dipl. at the Scuola Normale Superioreof Pisa in 2002, Ph.D. in Mathematicsat the University of Rome 'La Sapi-enza’ in 2006, is a research fellow atthe Institute of Information Scienceand Technologies of the ItalianNational Research Council, in Pisa.His main interests include geometricmodelling, computational topology,

image processing and medical imaging. At present he isinvolved in a number of European research projects workingin discrete geometry and scene analysis.

Gabriele Pieri (Pescia, 1974),M.Sc. (2000) in Computer Sciencefrom the University of Pisa, since2001 joined the “Signals and Images”Laboratory at ISTI-CNR, Pisa, work-ing in the field of image analysis. Hismain interests include neural net-works, machine learning, industrialdiagnostics and medical imaging. Heis author of more than twenty papers.