Transcript

Neurocomputing 83 (2012) 165–175

Contents lists available at SciVerse ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

wenshe

journal homepage: www.elsevier.com/locate/neucom

Multi-target tracking with occlusions via skeleton points assignment

Huan Ding n, Wensheng Zhang

Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun East Road, Beijing 100190, China

a r t i c l e i n f o

Article history:

Received 19 May 2011

Received in revised form

6 December 2011

Accepted 10 December 2011

Communicated by Tao Meicomplex situations captured by static monocular cameras. In the proposed method, we first select the

Available online 29 December 2011

Keywords:

Skeleton points assignment

Occlusion segmentation

Occlusion compensation

Multi-target tracking

12/$ - see front matter Crown Copyright & 2

016/j.neucom.2011.12.003

esponding author. Tel.: þ86 010 82614489; f

ail addresses: [email protected] (H. Ding),

[email protected] (W. Zhang).

a b s t r a c t

Multiple-target tracking in complex scenes is one of the most complicated problems in computer

vision. Handling the occlusion between objects is the key issue in multiple-target tracking. This paper

introduces the method of motion segmentation into the object tracking system, and presents a SPA

(Skeleton Points Assign, SPA) based occlusion segmentation approach to track multiple people through

skeleton points and evaluate their occlusion states by low-level information like optical flow; then we

assign these points to different objects using advanced semantic information, such as appearance,

motion and color; finally, a dense classification of foreground pixels are taken advantages of to

accomplish occlusion segmentation and a blob-based compensation strategy is utilized to estimate the

missing information of occluded objects. Object tracking is handled by a particle filter-based tracking

framework, in which a probabilistic appearance model is used to find the best particle. Experiments are

conducted on the public challenging dataset PETS 2009. Results show that this approach can improve

the performance of the existing tracking approach and handle dynamic occlusions better.

Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.

1. Introduction

Despite a lot of attention being dedicated to tracking multiplepeople in video sequences over the last 20 years, the problemremains very concerning in many computer vision and videoprocessing tasks, such as video surveillance and event inference.In particular, the major challenge of multi-targets tracking is thefrequent presence of visual occlusions. Occlusions make thecurrent observation totally or partially unavailable for some timeintervals. The problem of dealing with occlusions correctly is stillan open subject.

Solutions to managing occlusions can be decomposed into twogroups, depending on whether a scene is captured by a singlestationary camera or by multiple cameras. Multi-view methodsfuse information from multi-views to localize people on multiplescene planes [1]. These methods work well, except for specialgeometries of the camera views and people locations. Thesespecial geometries provide insufficient information to generatea unique signal for tracking, due to visual occlusions. In thesecases more specialized tracking methods need to maintain trackidentity. In addition, for some applications, multiple views arenot always available. Thus, this paper focuses on designing a

011 Published by Elsevier B.V. All

ax: þ86 010 62545229.

monocular tracking methodology, with the goal of handlingrelatively complex occlusion scenarios.

A considerable amount of research has reported on the treat-ment of occlusions from a stationary monocular camera duringthe last decade. Most of them consider occlusion segmentation aspart of the object model, and embed it into tracking process.These works build object models using color [2], appearance [2–5]and motion [6–8] information. Unfortunately, these models arelearned for describing the postures or actions of the trackingtargets, and do not fit well with occlusion segmentation.

There are also many attempts to deal with the problem ofocclusions explicitly. Hai et al. [9] decomposed video frames intocoherent 2D motion layers and introduced a complete dynamicmotion layer representation in which spatial and temporal con-straints on shape, motion and appearance are estimated using theEM algorithm. His method has been applied in an airborne vehicletracking system and examples of tracking vehicles in complexinteractions are demonstrated. Qian et al. [10] proposed a frame-work for treating the general multiple target tracking problem,which was formulated in terms of finding the best spatial andtemporal association of observations that could maximize theconsistency of both motion and appearance of object trajectories.Papadourakis et al. [11] presented a robust object trackingalgorithm which could automatically build appropriate objectrepresentations by color and handles spatially extended andtemporally long object occlusions. The majority of the abovemethods are under a simple assumption that object color satisfieseither a single Gaussian or a Gaussian mixture distribution model.

rights reserved.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175166

Meanwhile, these approaches do not take the historical informa-tion in the scene into account. Thus all the methods mentionedabove lead to the poor performance of occlusion segmentationand multiple targets tracking for objects with similar color.

Recently, some important issues in utilizing motion segmenta-tion in tracking system are discussed. In the context of motionsegmentation, the literature can be divided in two kinds: directmethods and feature-based methods [12]. Direct methods recoverthe unknown parameters directly from measurable image quan-tities at each pixel in the image. This is in contrast with thefeature-based methods, which first extract a sparse set of distinctfeatures from each image separately, and then recover andanalyze their correspondences in order to determine motion.Feature-based methods minimize an error measure that is basedon distances between a few corresponding features, while directmethods minimize a global error measure that is based on directimage information collected from all pixels in the image. It isimportant to observe that with direct methods the pixel corre-spondence/classification is performed directly with the measur-able image quantities at each pixel, while in feature-basedmethods this is done indirectly, based on independent featuremeasurements in a set of sparse pixels. An important property ofthe direct methods is that they can successfully estimate globalmotion even in the presence of multiple motions and/or outliers[13]. However, computational time is wasted by including in theminimization a large number of pixels where no flow can bereliably estimated. On the other hand, feature-based methodsinitially ignore areas of low information, resulting in a problem offewer parameters to be estimated, with good convergence evenfor long sequences. Hence, considering the tracking efficiency, weonly focus on the feature-based methods.

Feature-based methods for motion segmentation usually con-sist of two independent stages: (1) feature selection and/orcorrespondence and (2) motion parameter estimation [13]. Thesecond stage is often performed through factorization methods[14], although some simpler clustering strategy can be used [15].Several methods have been proposed for sparse feature selectionand/or correspondence, and among them, the most popular arethe Harris Corner Detector [16,17], and SIFT [18]. However, thesesparse feature-based methods compute feature correspondencesindependently. Thus, they are very sensitive to outliers, makingthem susceptible to errors in motion parameter estimation/segmentation. Moreover, homogeneous regions of a frame maypresent none or few features, which results in the motionestimation/segmentation difficult (or even impossible) in largeareas of the video frames. In object tracking field, Papadakis et al.[19] utilized a graph cuts approach and separated each object intovisible and occluded parts using an energy function, whichcontains terms based on position and motion information. Silvaet al. [12] obtained a pixel-wise segmentation by clustering a setof adaptively sampled points in space and time domains. Thesemethods still tend to emphasize the motion segmentation, whichfocus on the low-level information of the pixels. They rarely usethe high-level semantic features associated with tracking target.Thus it leads to a low performance of tracking and a high cost ofcomputation.

In this paper, we present a novel approach for multi-targettracking where the scene is captured by a stationary monocularcamera. Based on skeleton points assignment (SPA), our approachcombines the advantages of feature-based motion segmentationmethods and the probabilistic appearance-based particle filtertracking framework, as described next. Initially, a set of sparsepoints are computed, which we call the skeleton points. Instead ofcomputing point correspondences independently (as done inmany feature-based methods), neighboring particles are treatedas they were linked, reducing the occurrence of outliers and

avoiding the aperture problem. Moreover, the density of skeletonpoints is adaptive, and denser distributions are used in regionswhere precision is more important, so we can save computationwithout neglecting homogeneous regions. To compute pointcorrespondences in a video sequence, we use the approachproposed by Sand and Teller [20]. After the skeleton points areselected, we classify them to different types according to theirpresence in three successive frames. The classified points are thentreated separately with respect to their types. Each point isassigned to an appropriate tracking target, where high-levelinformation, such as appearance, motion and color distribution,is utilized. After that, a pixel-wise dense clustering strategy isutilized, and every foreground pixel in the scene is clustered tothe skeleton points. Then, the missing parts of occluded objectsare estimated by a blob-based compensation approach. Finally,we get the tracking result in a probabilistic appearance-basedparticle filter tracking framework.

Compared with the existing methods, our contribution isstated as follows: firstly, we define the matching strategy andthe state transition matrix of the skeleton points, to decide theirstates during tracking process. Secondly, we establish the skele-ton points assignment model, based on the high-level featuressuch as appearance, color distribution and motion. Finally, wedensely cluster the foreground with skeleton points as kernel,which get the fully segmentation of occlusions with less compu-tational cost. The proposed approach selects the skeleton pointsby the low-level motion information; meanwhile it treats theassignment problem of skeleton points with the high-levelinformation. The proposed method potentially has the ability toefficiently generate a robust object appearance through complexocclusions, which is adequate for tracking.

This paper is organized as follows. Section 1 discusses the stateof the art in the methods of multi-target tracking with occlusions,and contextualizes our work. An overview of the proposedapproach is presented in Section 2. In Section 3, we describe thestrategy of skeleton points extraction and matching. The estima-tion of the occlusion states of skeleton points along the videosequence is described in Section 4, as well as the classificationmethod which is used to assign these points to correspondingtargets. In Section 5, we present the dense representation for themoving targets. Section 6 introduces a rough compensationapproach to estimate the missing parts of the occluded objects.Section 7 presents some experimental results obtained with theproposed approach. Finally, Section 8 summarizes the maincontributions of this paper, and discusses some limitations ofthe proposed approach for future work.

2. Method overview

The structure of the proposed multi-target tracking approachcan be divided into three main parts, which is shown in Fig. 1.

1.

Background modeling and foreground detection. 2. Occlusion segmentation via skeleton points assignment. 3. Multi-target tracking in a probabilistic appearance-based par-

ticle filter tracking framework.

All the stages of the proposed approach are processed sequen-tially. Every stage is performed for the entire video before going tothe next stage.

The first part concerns the foreground detection in the scene.We utilize the Gaussian mixture model [21] to estimate themoving points. This stage takes as input the original video frames,and output a set of foreground points.

ParticleFilter

Tracking

InputImages

BackgroundSubtraction

SkeletonExtraction

AndMatching

ModelUpdate

GaussianMixture Model

ProbabilisticAppearance

Model

Optical FlowFeature Points

SkeletonAssignment

OcclusionSegmentation

Greedy MatchingAlgorithm

ColorDistribution

Model

DenseSegmentation

OcclusionCompensation

Space\MotionInformation

SequentialInformation

Fig. 1. System framework diagram.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175 167

The second part deals with the segmentation of occlusions,and this is the key point of our study. All the foreground pointsare assigned to an appropriate object. This stage takes as input theoriginal video frames and the foreground point set computed inthe first stage, and returns the corresponding segmentation labelsfor each foreground point as outputs, representing the index ofthe targets. The occlusions segmentation process can be dividedinto four steps.

Skeleton points selection and matching (see Section 3): in thisstep, we extract skeleton points by using the optical flow inthe scene. Considering a successive video sequence which isconsisted of the current frame, the last frame and the nextframe, we estimate the matching situation of all the skeletonpoints, and establish a matching matrix, which is utilized todetermine the occlusion states of these points. � Skeleton points assignment (see Section 4): here, all the

matching matrices computed in the previous step are pro-cessed simultaneously to produce a unique state transitionmatrix of the skeleton points in the current frame. We estimatethe occlusion type of these points according to the statetransition matrix, and then assign these points to appropriatetarget in the scene by considering high-level information.

� Dense segmentation extraction (see Section 5): in this step, the

dense segmentation is done by creating implicit functions foreach foreground pixel, based on color, motion and spatialposition. Foreground pixels are clustered to the skeleton pointswhich are already labeled in the previous step, and are markedwith the same label as their clustering kernel. This representa-tion of foreground points can be employed to obtain efficientobservation of each target in the scene.

� Occlusion compensation (see Section 6): in this step, we locate

the occluded objects by their foreground change rate. A blob-based match strategy and historical foreground informationare utilized to estimate the missing parts of the occludedobjects.

The third and final part of the proposed tracking method is the

particle filter tracking [22,23] based on a probabilistic appearancemodel [5]. This stage takes as input the original video frames, aswell as the segmentation labels returned by the second stage, andreturns as the output the final tracking results in the form ofbounding boxes.

3. Selection and matching of skeleton points

In this section, we show how the skeleton points are selectedin each video frame, and how their correspondences in a 3-framessuccessive video sequence are computed. A matching matrix isestablished here is adopted to estimate the occlusion states of theskeleton points.

3.1. Selection of skeleton points

The main goal of the skeleton points in this paper is to providethe basis for the occlusion segmentation, so some importantproperties of these points should be considered.

1.

First, the sampling density of skeleton points should beadaptive. In the sense that regions with more details aresampled with more points, while homogeneous regions aresampled with less points. Thus, higher segmentation precisioncan be obtained in regions with more motion information,while saving computation in other regions.

2.

Second, the location of the skeleton points should be as stableas possible. This makes it sure that the correspondences of theskeleton points in pairs of adjacent frames can give somevaluable information of occlusion.

3.

Third, the skeleton points should have a good performance ofhandling the matching between non-rigid targets. Becausemost of the tracking targets are non-rigid, especiallypedestrian.

The most commonly used feature select methods, such asHarris, KLT and SIFT, tend to have a high density on theboundaries in the frame, as well as have a low density in thehomogeneous regions. Meanwhile, these methods are mainlyused to deal with the matching problems of rigid objects, andhave a considerable error rate when dealing with non-rigidobjects. In addition, the point number of these methods is notgreat enough for containing most information of the movingtargets. Thus, these methods are not suitable for extracting theskeleton points in the proposed method.

The approach proposed by Sand and Teller [20] is adoptedhere to generate a set of optical flow feature points, which is usedas the skeleton points in our work. This method is based on theframe-to-frame optical flow, and considers the influences of scalechanges during computing the optical flow. As shown in Fig. 2, thefeature points sampling density is adaptive. Regions with movingobjects have much more points than homogeneous ones. Theskeleton points in this paper focus on the occlusions of fore-ground pixels, so we make a few modifications for computingefficiency. Suppose that the original feature point set got by Sand[20] in Frame t is Pt, and the foreground point set is Ft. Theskeleton point set is given by

Spt ¼ fp9pAPt \ Ftg, ð1Þ

which shows that we only care about the points that areforeground.

Fig. 2. Example of skeleton points set.

Fig. 3. Skeleton points match between two objects.

Table 1

MatchMapt1 ,t2 of skeleton points in Fig. 3.

sp1 sp2 sp3 sp4 sp5 sp6

sp1 1 0 0 0 0 0

sp2 0 1 0 0 0 0

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175168

3.2. Matching of skeleton points

The occurrence of a skeleton point in a video sequence shouldbe computed first, in order to determine the occlusion states ofthat point. To estimate that occurrence, we need to know thecorrespondences of the skeleton points in a successive videosequence. Thus we construct the successive video sequence withthe current frame It, the previous frame It�1, along with the frameItþ1 after It in the video sequence beforehand. The matchingmatrix between every two frames in the sequence is computed asbelow.

Definition 1. Assume the number of skeleton points in Framet is Mt, so the matching matrix between Frame t1 and t2 is abinary matrix, which is named MatchMapt1 ,t2 . The element in thismatrix is

MatchMapt1 ,t2

i,j ¼

1 if the ith points of Spt1matches

the jth points of Spt2,

0 others:

8><>:

ð2Þ

A greedy algorithm is adopted to find the best match of theskeleton points. We first select one point from the skeleton pointset SPt1 randomly, and find the most similar point in SPt2. Here,color and spatial position are used to define the similarityfunction, which can be described as

SpLikelihood ðspt1,spt2Þ ¼ e�spaceDisðspt1

,spt2Þ=s2

space e�colorDisðspt1,spt2Þ=s2

color ,

ð3Þ

where spaceDisðspt1,spt2Þ and colorDisðspt1

,spt2Þ are the Euclidean

and RGB distance between the two skeleton points spt1,spt2

respectively. These distances are given by

spaceDisðspt1,spt2Þ ¼ ðxspt1

�xspt2Þ2þðyspt1

�yspt2Þ2, ð4Þ

colorDisðspt1,spt2Þ ¼ ðrspt1

�rspt2Þ2þðgspt1

�gspt2Þ2þðbspt1

�bspt2Þ2, ð5Þ

where (x, y) and ðr, g, bÞ are respectively the coordinate and RGBvalue of the skeleton point. The details of the matching algorithmare described in Algorithm 1.

Algorithm 1. Skeleton points matching algorithm.

sp3 0 0 1 0 0 0

sp4 0 0 0 0 1 0

sp5 0 0 0 0 0 1

Input: SPt1

, SPt2, It1

, It2and similarity threshold yr

Output: MatchMapt1 ,t2

Initialize: MatchMapt1 ,t2 ¼ 0;repeatselect a point i randomly from SPt1

;

forall the points in SPt2do

compute the similarity between the current pointand i using ð3Þ;

����end

find the point j in SPt2with the maximum

similarity maxLikelihood;

if maxLikelihood4yr then

MatchMapt1 ,t2

i,j ¼ 1;

delete i,j from SPt1,SPt2

respectively;

�����elsedelete j from SPt2

;���end

��������������������������������until Either SPt1

or SPt2is empty;

Algorithm 1 is a greedy algorithm, which means the order ofchoosing points from SPt1

may lead to different matching results.However according to (3), the matching similarity is constrainedby both color and spatial distance. So the matching errors onlyoccur between skeleton points with similar color and location.These skeleton points are always in a homogeneous smallregion. Thus, these matching deviations are acceptable in thesubsequent stage.

As shown in Fig. 3, there are five skeleton points in Frame t1,and six skeleton points in Frame t2. The matched point pairs gotby Algorithm 1 are marked with red links, and the correspondingmatch matrix is in Table 1. If point spi(i¼ 1;2, . . . ,Mt1

) in SPt1has a

matched point spj in SPt2, then the element MatchMapt1 ,t2

i,j islabeled as ‘‘1’’, means one skeleton point exists in both Framest1 and t2. Meanwhile, the rows(columns) that are all zerosrepresent that some points in Frame t1ðt2Þ do not find theircorresponding points in Frame t2ðt1Þ. The matching matrices

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175 169

of every two frames, namely MatchMapt,t�1,MatchMapt,tþ1,MatchMapt�1,tþ1 are obtained in the 3-frame video sequence bythe method above, and provide the occurrence of skeleton points,which is used for estimating the occlusion states in the next step.

4. Skeleton assignment

In this section, a state transition matrix is established toanalyze the occlusion states of skeleton points. These points aredivided into occluded points, exposed points or new points. Then,the visible skeleton points are assigned to appropriate targetsaccording to color, position and motion information.

4.1. Tracking model of the targets

Before the skeleton points are assigned, the tracking model ofeach target should be discussed first. A probabilistic appearancebased particle filter tracking system [5] is employed in ourapproach. The appearance model is a single Gaussian process,composed of three template elements: the RGB appearance of theperson mn, the corresponding variance matrix for each modelpixel Sn, and the foreground probability template PF,n, where n isthe identification number associated to each target. Several entryregions are selected manually before the tracking. When a newblob is detected in these regions, we first run a multiple humanssegmentation algorithm proposed by [2], which is an iterativeprocess that estimate the head pixel of the target. This segmenta-tion algorithm is utilized to separate individual target from agroup of people. It works well for a small number of overlappingpeople that do not have severe occlusion; however a severelyoccluded object will not be detected. This method is not sensitiveto blob fragmentation if a large portion of the object still appearsin the foreground. Then the appearance model is initialized by thebounding box which is located by the head pixel and the feet pixel(see [5] for details). Thus a tracking model is established for eachtarget in the scene, and the skeleton points are assigned to thetargets in the following step.

4.2. Analysis of the occlusion state of skeleton points

Once the matching matrices MatchMapt,t�1, MatchMapt,tþ1 andMatchMapt�1,tþ1 are computed, the occlusion states of skeletonpoints can be analyzed. In a successive video sequence, theskeleton points can be decomposed into two main groups: visiblepoints and invisible points. The first group of points includes theexposed points and new points. The exposed points always existin the video sequence, while the new points are generatedbecause of the appearance change of the targets. The second

Table 2The analysis of skeleton points states.

STt Occurrence description

st�1 st stþ1

0 0 0 Exists in no frames

0 0 1 Only exists in Frame tþ1

0 1 0 Only exists in Frame t

0 1 1 Exists in successive two f

1 0 0 Disappears in successive

1 0 1 Only disappears in Frame

1 1 0 Only disappears in Frame

1 1 1 Exists in all frames

group of points includes the occluded points and the disappearpoints that are yielded by the appearance change or the absenceof targets. A state transition matrix STMapt is established toanalyze the occlusion states of skeleton points in a 3-frame videosequence.

Definition 2. The occlusion state transition of a single skeletonpoint is represented by a vector STt

¼ ðst�1,st ,stþ1Þ. If this pointexists in Frame lðl¼ t�1,t,tþ1Þ, sl is labeled as ‘‘1’’, otherwise it islabeled as ‘‘0’’. Thus the state transition matrix of skeletonpoints in Frame t is described as a binary matrix STMapt

¼

ðSTt1,STt

2, . . . ,STtNtÞT , where Nt is the total number of skeleton

points in Frame t�1, t, and tþ1.

Table 2 gives a comprehensive investigation and analysis of all thepossible values that the occlusion state transition vector. We cansee that the occlusion state of a skeleton point in Frame t can becompletely inferred by its state transition from Frame t�1 to tþ1.So we establish a state transition matrix by gatheringMatchMapt,t�1, MatchMapt,tþ1, and MatchMapt�1,tþ1 to estimatethe occlusion state of skeleton points in Frame t. The approach togenerate the state transition matrix is described in Algorithm 2.

Algorithm 2. States transition generate algorithm.

Input: MatchMapt,t�1, MatchMapt,tþ1, and MatchMapt�1,tþ1

Output: STMapt

Initialize: SThMapt¼ 0;

1. for the ith row of STMapt , 1r irMt , Mt is the sum

of skeleton points in Frame t do

1Þset STMapti :st ¼ 1;

2Þif there is a matching point j in SPt�1 then

STMapti :st�1 ¼ 1;

Mark j in the points set SPt�1as processed;

�����else

STMapti :st�1 ¼ 0;

���end

3Þif there is a matching point k in SPtþ1 then

STMapti :stþ1 ¼ 1;

Mark k in the points set SPtþ1as processed;

�����else

STMapti :stþ1 ¼ 0;

���end

�������������������������������������end

2. for the ith row of STMapt , Mtþ1r irMtþNt , Nt is the

unmarked number of skeleton points in Frame t�1 do

Occlusion state

Impossible

New point in Frame tþ1 or error point in Frame t

Error point in Frame t

rames New point in Frame t

two frames Missing point in Frame t�1

t Error pint in Frame t

tþ1 Missing point in Frame t or error pint in Frame tþ1

Exposed point

Fig. 4. Results of skeleton points assignment.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175170

1Þset STMapti :st�1 ¼ 1,STMapt

i :st ¼ 0;

2Þfor the unprocessedpoint l in SPt�1,if there is a matching point j in SPtþ1 then

STMapti :stþ1 ¼ 1;

Mark j in the points set SPtþ1 as processed;

�����else

STMapti :stþ1 ¼ 0;

���end

��������������������end

3. for the ith row of STMapt , MtþNtþ1r irMtþNtþKt , kt is

the unmarked number of skeleton points in Frame tþ1 do

These points just exist in Frame tþ1,so

STMapti :st�1 ¼ 0,STMapt

i :st ¼ 0,STMapti :stþ1 ¼ 1

�����end

All the skeleton points in Frame t are divided into three pointsets according to STMapt: the exposed-point set Vt ¼ fspi9STt

i ¼

ð1;1,1Þg, the new-point set Nt ¼ fspi9STti ¼ ð0;1,1Þg and the miss-

ing-point set Mt ¼ fspi9STti ¼ ð1;0,0Þg.

4.3. Assignment of the skeleton points

After estimating the occlusion states of skeleton points, weassign the points in Vt and Nt to an appropriate target using color,spatial and motion information. The approach proposed byNummiaro [24] is adopted to construct the RGB color histogramof the target. Suppose that the color distributions are discretizedinto m bins. The histograms are produced with the function hðxiÞ,that assigns the color at location xi to the corresponding bin. Inour method, the histograms are typically calculated in the RGBspace using 8�8�8 bins. Thus, the color distribution of the nth

target in Frame t is represent by pcol,nðtÞ ¼ fpcol,nðtÞugu ¼ 1,...,m, and is

given by

pcol,nðtÞðuÞ¼ f

XrAOn

Fðr,tÞdfh½Iðr,tÞ��ug, ð6Þ

where f ¼ 1=P

rAOnFðr,tÞ; d is the Kronecker delta function; On is

the coordinate set of the nth target; rAOn is the coordinate of theskeleton point; I and F are the original image and the foregroundimage respectively.

Based on (6), the color confidence probability that a skeletonpoint sp belongs to the nth target can be defined as

Likelihoodcol ¼ pcol,nðtÞðh½Iðsp,tÞ�Þ: ð7Þ

Also, we can define the spatial and motion confidence probabilityfunction as

Likelihoodspace ¼ e�½ðxsp�xnÞ2þðysp�ynÞ

2�=s2

space , ð8Þ

Likelihoodmotion ¼ e�½ðusp�unÞ2þðvsp�vnÞ

2�=s2

motion , ð9Þ

where ðxn, ynÞ is the centroid of the nth target, which is calculatedby all the coordinates of foreground points in On in Frame t�1with a weight by their distances to the centroid.

Assume the visible point set of the nth target is Vn. The finalconfidence probability that sp belongs to Vn is

pðspAVnÞ

¼ Likelihoodcol � Likelihoodspace � Likelihoodmotion: ð10Þ

For each target in the scene, we compute the probability that askeleton point belongs to it using (10), and select the one withmaximum probability as the assignment result of that point. Asshown in Fig. 4, the assignment approach proposed in this paper

can well distinguish skeleton points from those of differentinteractive targets.

5. Dense segmentation extraction

In particle filter tracking framework, it is necessary to extract adense representation of target. It means that we must determinewhich pixels are assigned to each moving object. In this section,we assign each foreground pixel to the most similar skeletonpoint. To perform this task, we compare foreground pixels withsets of skeleton points in terms of color, spatial and motionproximity using implicit functions.

Let P¼ sp1,sp2, . . . ,spn be the whole set of skeleton point thatalready assigned to different targets, and lspAf1, . . . ,Kg be the setof labels that indicates for each skeleton points. So, to each sp inFrame t, represented by its spatial coordinates ðxt

sp, ytspÞ, is

assigned a multivariate gaussian kernel (see (11)),

Gtspðx,y,u,v,c,tÞ

¼1

ð2pÞ29S91=2exp �

1

2

x�xtsp

y�ytsp

u�utsp

v�vtsp

c�ctsp

2666666664

3777777775

T

S�1

x�xtsp

y�ytsp

u�utsp

v�vtsp

c�ctsp

2666666664

3777777775

0BBBBBBBBB@

1CCCCCCCCCA

, ð11Þ

where utsp ¼ xt

sp�xt�1sp and vt

sp ¼ ytsp�yt�1

sp represent the displace-ment of sp in Frame t, ct

sp is the color of sp, and S is the covariancematrix given by

sS 0 0 0 0

0 sS 0 0 0

0 0 sM 0 0

0 0 0 sM 0

0 0 0 0 sC

26666664

37777775

,

where, sS, sM and sC are the spatial, motion and color standarddeviations respectively. These coefficients are chosen based on acompromise between precision and smoothness of the objectsboundaries, and can be modified according to the application.

According to (11), pixels located near the skeleton point sp,with motion and color similar to sp, will yield large Gt

sp values,while pixels far away from sp, or with distinct motion or colorvectors, will yield small Gt

sp values. In order to assign a foreground

Fig. 5. Examples of occlusion compensation. (a) Input image. (b) Foreground of

the man without compensation. (c) Appearance of the man after compensation.

(d) Foreground of the man after compensation.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175 171

pixel to a target, we compare it with all the skeleton points using(11), and assign the pixel to the skeleton point which yields thelargest sum of the sum Gaussian kernels at the correspondingpixel position

maxN

Xfsp9lsp ¼ Ng

Gtspðx,y,u,v,c,tÞ: ð12Þ

Fig. 6a–d shows the final result of the proposed SPA occlusionsegmentation approach. Most of the foreground pixels areassigned to the right target. It is very important for multi-targettracking in the probabilistic appearance based particle filterframework because of the following two reasons: (1) the like-lihood of the probabilistic appearance in observation stage ofparticle filter would be badly influenced when there are fore-ground points that do not belong to the target; (2) information ofoccluding objects are mixed with the real target when update theprobabilistic appearance of the occluded targets, which yields aaccumulating deviation of the appearance model. The proposedmethod largely exclude the impact of occlusion, and the labeledforeground can accurately reflect the information of targets. Thus,the performance of multi-target track improves.

6. Occlusion compensation

In this section, a compensation strategy is presented toestimate the missing parts of the occluded objects. We assumethat the foreground number of an object does not change greatlyin a successive video sequence. Thus, the change rate of fore-ground number is defined as (13)

FC ¼nt�1

nt, ð13Þ

where nt represents the foreground number in Frame t. FC

measures how fast the foreground changes in two frames. Whenits value is larger than a threshold th (set to 0.7 in our experi-ments), it means there is a rapid reduction of foreground pixels inFrame t during a very short time. The reason of the reductioncould be: (1) the object moving out of the camera view, or(2) occlusions and deformations. The first situation is excludedin our work by setting a group of entry regions before trackingbegins, and the rapid reduction occurs out of these regions areconsidered as occlusions, which need to be compensated.

The sequential information is utilized in this work to comple-ment the missing parts of the occluded or deformed objects.Firstly, the foreground of the object in Frame t�1 is divided into aset of blobs, and the width and height of each blob are both5 pixels. Then, a blob-based matching strategy is adopted to findthe best position for the foreground Ft�1 in Frame t, where thecolor distance between blobs is calculated by the Euclideandistance of the color vector. Finally, the unmatched blobs inFrame t�1 are considered as the missing parts of the object,and both correspondent binary foreground image and correspon-dent appearance image are added to the foreground in Frame t.Thus, the occluded parts of one target in Frame t are compensatedby the most similar parts of him/her in Frame t�1. The results ofocclusion compensation are shown in Fig. 5. When the short termocclusion occurs, our compensation approach can give a roughestimate of the missing parts of the occluded object, and canprovide more information to the particle filter trackingframework.

7. Experimental results

In this section, we present some experimental results to illustratethe performance of our proposed approach.

7.1. Analysis on occlusion segmentation results

Fig. 6 describes an example of occlusion. In Fig. 6e the numberof assigned foreground points of each target are shown. Initially,the man, coming toward the camera, has a increasing number offoreground points, while the woman, coming to the oppositedirection, has a decreasing number of foreground points. Whenthe occlusion occurs at about Frame 455, the foreground numberof the occluded man is conspicuous decrease, while that numberof the woman does not have an abnormal change. And after theocclusion, the foreground number of both targets returns to aordinary level at about Frame 465.

Occlusions in the stationary monocular camera usually includestatic occlusions and dynamic occlusions. Fig. 7 shows theperformance of the proposed approach on several kinds ofocclusions. We can distinguish the background object from themoving target (Fig. 7), meanwhile we can divide the two movingtargets in the opposite direction (Fig. 7). Moreover, when thereare more targets with similar moving directions in the scene, ourmethod can provide a good performance (Fig. 7). The segmenta-tion errors of the two targets in the right of Fig. 7c is generatedbecause the color of these two targets are similar in the originalvideo, and the optical flow information is inaccuracy, whichinfluences the assignment of the foreground pixels.

7.2. Multi-targets tracking

The described tracking algorithm was tested on the PETS 2009S2.L1 dataset. The scenes are captured from multiple views, andthe input images are RGB color pictures with 768�576 pixels.Table 3 relates statistics about events occurring in the sequences.Our algorithm worked well in all the 795 frames and handledalmost all the complex situations, such as people passing by,walking together, etc. Most views of this dataset was tested underVisual Studio 2010 on a desktop computer (Intel Pentium Dual-Core CPU E5200, 2.5 GHz, 2 GB RAM), with the Intel OpenCVlibrary [25]. The parameters of the proposed method are selectedaccording to [2,20] and [5]. In our experiment, the value is set asfollows: sspace ¼ 8; scolor ¼ ½7 7 7�; smotion ¼ 2.

On a typical video frame, the proposed approach spends about0.03 s extracting foreground pixels, 0.5 s selecting and matchingthe skeleton points, 0.1 s assigning skeleton points, 0.02 s runningdense segmentation, and 0.04 s compensating foreground pixels.The final tracking algorithm requires about 0.05 s per target perframe. It takes about 1 s per frame averagely in our experiment,and most of the time is spend on optical flow computation. Thus,our approach is not suitable for real-time tracking, and theefficiency can be greatly improved by high speed optical flowevaluation algorithm.

Fig. 8 depicts some representative tracking results from thesequence of View1. The colored boxes are the tracking boxes. Thewhole dataset can be found at http://www.cvg.rdg.ac.uk/PETS2009/a.html. In Frame 425, there is no interaction between

Fig. 7. Examples of occlusion segmentation. (a) Static occlusion. (b) Dynamic

occlusion between two people. (c) Occlusion segmentation between three people.

Table 3Challenging events.

Event Count

Pass-by 52

Walking in group 2

Reappearance 9

Static occlusion 44

New person 8

3-Person interaction 2

430 432 434 436 438 440 442 444 446 448 450 452 454 456 458 460 462 464 466 468 470500

600

700

800

900

1000

1100

1200

1300

1400

Frame

Num

ber

Occluded TargetOcclusing Target

Fig. 6. (a)–(d) Occlusion segmentation result of four sample frames with two people: the man, coming from the left, is occluded by the woman coming from the right;

(e) number of assigned foreground of the two targets during the occlusion.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175172

targets, people are correctly detected and tracked. From Frames62–68, dynamic occlusion occurs between two people in theupper left corner of the scene, and both targets are trackedcorrectly before and after the occlusion. Static occlusions existin Frames 157 and 339, where targets are occluded by the sign inthe scene. In this situation, historical information which recordedby the probabilistic appearance model in the particle filter frame-work, are utilized to infer the real location of the targets, and it

works well in our system. In Frame 162, the largely occludedtarget reappears in the scene, and he is identified and trackedcorrectly. We can see from the experiment that, although two ormore targets have similar appearance, the proposed trackingapproach can split them in most situations.

Fig. 9 shows the tracking results on another view of PETS 2009dataset. In this view, the depression angle of camera is smaller,which yields more serious occlusions. For example, there is largeoverlapping between targets in Frames 679–686, or in Frames

Fig. 9. Results of occlusion segmentation (PETS2009, Task S2, Video L1, View5).

Fig. 8. Results of occlusion segmentation (PETS2009, Task S2, Video L1, View1). (a) Frame 62, (b) Frame 68, (c) Frame 157, (d) Frame 162, (e) Frame 339, (f) Frame 425.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175 173

700–720. The experimental results demonstrated that the pro-posed method has a good adaptability of temporal occlusions.

A quantitative performance evaluation was performed hereusing the framework proposed by Smith [26], the result of which

are reported in Table 4 and Fig. 10. We compared our methodwith some widely used models which are adopted for occlusionsegmentation in multi-target tracking systems, including colorhistogram model [23], improved probabilistic appearance model

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175174

[5], K-means based occlusion segmentation model, which isreported in our another work, and the recently proposed L1tracker [27].

The evaluation has been carried out in terms of: False Positives(FP), False Negatives (FN), Multiple Objects (MO), Multiple Track-ers (MT) [26]. FP means an estimate exists that is not associated

Table 4Tracking error analysis (PETS2009, Task S2, Video L1).

View Method FP FN MT MO

View1 Color histogram 0.052 0.253 0.187 0.065

Probabilistic appearance 0.046 0.235 0.162 0.054

K-means segmentation 0.061 0.169 0.104 0.026

L1 tracker 0.043 0.217 0.157 0.022Without compensation 0.057 0.169 0.106 0.024

Proposed method 0.055 0.167 0.104 0.024

View5 Color histogram 0.065 0.283 0.257 0.175

Probabilistic appearance 0.058 0.279 0.245 0.164

K-means segmentation 0.066 0.269 0.201 0.086

L1 tracker 0.061 0.277 0.254 0.041Without compensation 0.063 0.267 0.203 0.087

Proposed method 0.062 0.266 0.199 0.091

View7 Color histogram 0.055 0.258 0.252 0.085

Probabilistic appearance 0.042 0.251 0.244 0.082

K-means segmentation 0.051 0.25 0.215 0.073

L1 tracker 0.040 0.258 0.242 0.057Without compensation 0.050 0.247 0.217 0.059

Proposed method 0.048 0.248 0.211 0.068

View8 Color histogram 0.051 0.287 0.198 0.094

Probabilistic appearance 0.052 0.28 0.187 0.087

K-means segmentation 0.055 0.272 0.164 0.045

L1 tracker 0.052 0.279 0.188 0.031Without compensation 0.055 0.276 0.163 0.037

Proposed method 0.053 0.276 0.162 0.042

FP FN MT MO0

0.05

0.1

0.15

0.2

0.25

0.3View 1

Err

or R

ate

FP FN MT MO0

0.05

0.1

0.15

0.2

0.25

0.3

View 7

Err

or R

ate

Proposed Method K-means Segmentation

Fig. 10. Tracking e

with a ground truth object, FN means a ground truth object existsthat is not associated with an estimate, MT means two or moreestimates are associated with the same ground truth, and MOmeans two or more ground truth objects are associated with thesame estimate. From the experiments we find that the probabil-istic appearance model considers the spatial information on thebase of color information, so it has a lower error rate than thecolor histogram. The proposed method assigns the skeleton pointsto corresponding targets and gets more accurate foreground ofthe targets, so it decrease the probability of the situation thatmulti-target are tracked by one tracker (MO) and the trackerdrifts to other targets (MT). Meanwhile, FN is also be improvedbecause we get more accurate information of the occludedtargets. However, the proposed method does not improve FP.The reason is that the skeleton points are extracted by opticalflow, when a target moves very slowly or two targets have similarcolor, motion and appearance, the occlusion segmentation doesnot work well, thus the tracking result may be influenced.

The L1 tracker is mostly used for single target tracking in [27].This algorithm shows a very good performance in solving trackingproblems. In our experiment, we initialized a L1 tracker for eachtarget, and compared its performance of multi-target trackingwith our method. The bounding boxes of both methods areprecise when the targets are moving alone, and the ones of L1tracker are more stable. However L1 tracker did not consider themutual information of interacting targets, so the bounding boxesdrifted sometimes when targets were near each other. L1 trackerrepresents targets by some appearance templates. The boundingboxes sometimes drift to other targets but seldom drift to thelocations that are not people. Thus L1 tracker has a lower FP aswell as a higher FN. The drifts mentioned above also lead to thesituation that multiple trackers track the same target (MT). Thesize of L1 bounding boxes always fits for the size of the targets, soMO is low.

FP FN MT MO0

0.05

0.1

0.15

0.2

0.25

0.3

View 5

Err

or R

ate

FP FN MT MO0

0.05

0.1

0.15

0.2

0.25

0.3

View 8

Err

or R

ate

Probabilistic Appearance Color Histogram L1

rror analysis.

H. Ding, W. Zhang / Neurocomputing 83 (2012) 165–175 175

The effect of our occlusion compensation algorithm could alsobe found in Table 4. By adding a compensation stage in thesystem, the decline of the tracking bounding boxes caused byocclusion is obviously improved, because the appearance of thetargets are more complete. Thus FP of the proposed method islower than the method without compensation. Meanwhile, theincreasing size of the tracking box of occluded targets introducesthe chance that one tracker include multi-target, so MO of theproposed method is a little higher.

8. Conclusion

A method for multi-target tracking with occlusions in astationary monocular camera was proposed in this work. Thistechnique provides a new way of linking low-level information invideos to high-level concepts. The proposed method extracts theskeleton points in the scene using optical flow, and assigns thesepoints to corresponding targets according to their color, motionand appearance. Then, the foreground pixels of targets areestimated by a dense segmentation stage and a blob-basedcompensation approach. Finally, the probabilistic appearancebased particle filter tracking framework is employed to obtainthe final tracking result. Some experimental results obtained withthe proposed method were presented. Comparisons with a fewwidely used methods on the international universal databasePETS 2009 indicates that our method can potentially have a goodperformance on multi-target tracking in complex environments,especially when there are dynamic occlusions among targets.

Our method does have several limitations. The skeleton pointsare extracted according to the optical flow. However whenocclusion occurs, the optical velocity is not always correctbecause of the constraint of the optical flow methods and theinteraction between targets. So the performance of occlusionsegmentation is influenced sometime, and more robust opticalflow estimation methods should be explored in the future work.

Acknowledgments

This work was supported by the Hi-tech Research and Devel-opment Program of China (863) (2008AA01Z121), the NationalScience Foundation of China (90924026). And the authors wouldlike to thank Dr. P. Sand and Dr. Haibing Ling for providing theoptical flow code used in [20] and the L1 code in [27] respectively.

References

[1] M. Shah, S. Khan, Tracking multiple occluding people by localizing onmultiple scene planes, IEEE Trans. Pattern Anal. Mach. Intell. 31 (3) (2009)505–519.

[2] T. Zhao, R. Nevatia, Tracking multiple humans in complex situations, IEEETrans. Pattern Anal. Mach. Intell. 26 (9) (2004) 1208–1221.

[3] L. Xu, P. Puig, B. Res, B. Venturing, A hybrid blob-and appearance-basedframework for multi-object tracking through complex occlusions, in: TheInternational Conference on Computer Communications and Networks,Washington, DC, 2005, pp. 73–80.

[4] A. Senior, A. Hampapur, Y.L. Tian, L. Brown, S. Pankanti, R. Bolle, Appearancemodels for occlusion handling, Image Vision Comput. 24 (11) (2006)1233–1243.

[5] J. Yang, P.A. Vela, Z. Shi, J. Teizer, Probabilistic multiple people trackingthrough complex situations, in: 11th IEEE International Workshop on PETS,Miami, 2009, pp. 79–86.

[6] W. Du, J. Piater, Tracking by cluster analysis of feature points using a mixtureparticle filter, in: IEEE Conference on Advanced Video and Signal BasedSurveillance, Como, Italy, 2005, pp. 165–170.

[7] A. Bugeau, P. Perez, Detection and segmentation of moving objects in highlydynamic scenes, in: IEEE Conference on Computer Vision and PatternRecognition, Minneapolis, MN, 2007, pp. 2102–2109.

[8] M. Kristan, J. Pers, S. Kovacic, A. Leonardis, A local-motion-based probabilisticmodel for visual tracking, Pattern Recognition 42 (9) (2009) 2160–2168.

[9] T. Hai, Object tracking with Bayesian estimation of dynamic layer represen-tations, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 75–89.

[10] Y. Qian, G. Medioni, Multiple-target tracking by spatiotemporal Monte CarloMarkov chain data association, IEEE Trans. Pattern Anal. Mach. Intell. 31 (12)(2009) 2196–2210.

[11] V. Papadourakis, A. Argyros, Multiple objects tracking in the presence of long-term occlusions, Comput. Vision Image Understanding 114 (7) (2010)835–846.

[12] L.S. Silva, J. Scharcanski, Video segmentation based on motion coherence ofparticles in a video sequence, IEEE Trans. Image Process. 19 (4) (2010)1036–1049.

[13] M. Irani, P. Anandan, About direct methods, in: B. Triggs, A. Zisserman, R.Szeliski (Eds.), Vision Algorithms: Theory and Practice, Lecture Notes inComputer Science, vol. 1883, Springer, Berlin, Heidelberg, 2000, pp. 267–277.

[14] J. Yan, M. Pollefeys, A general framework for motion segmentation: inde-pendent, articulated, rigid, non-rigid, degenerate and non-degenerate, in:A. Leonardis, H. Bischof, A. Pinz (Eds.), Computer Vision C ECCV 2006, ser.Lecture Notes in Computer Science, vol. 3954, Springer, Berlin, Heidelberg,2006, pp. 94–106.

[15] S. Smith, J. Brady, Asset-2: real-time motion segmentation and shapetracking, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 814–820.

[16] C. Harris, M. Stephens, A combined corner and edge detector, in: The 4thAlvey Vision Conference, 1988, pp. 147–151.

[17] J. Shi, C. Tomasi, Good features to track, in: IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, 1994, pp. 593–600.

[18] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.Comput. Vision 60 (2004) 91–110.

[19] N. Papadakis, A. Bugeau, Tracking with occlusions via graph cuts, IEEE Trans.Pattern Anal. Mach. Intell. 33 (1) (2011) 144–157.

[20] P. Sand, S. Teller, Particle video: long-range motion estimation using pointtrajectories, in: IEEE Conference on Computer Vision and Pattern Recognition,2006, pp. 2195–2202.

[21] C. Stauffer, W.E.L. Grimson, Learning patterns of activity using real-timetracking, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 144–157.

[22] M. Isard, A. Blake, Condensation—conditional density propagation for visualtracking, Int. J. Comput. Vision 29 (1) (1998) 5–28.

[23] Z. Wang, X. Yang, Y. Xu, S. Yu, Camshift guided particle filter for visualtracking, Pattern Recognition Lett. 30 (4) (2009) 407–413.

[24] K. Nummiaro, An adaptive color-based particle filter, Image Vision Comput.21 (1) (2003) 99–110.

[25] Opencv library. [Online]. Available: /http://opencvlibrary.sourceforge.net/S.[26] K. Smith, D. Gatica-Perez, J.-M. Odobez, S. Ba, Evaluating multi-object

tracking, Comput. Vision Pattern Recognition Workshop (2005) 36–43.[27] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse

representation, IEEE Trans. Pattern Anal. Mach. Intell. 99 (2011). (Preprints).

Huan Ding is a Ph.D. candidate in the State KeyLaboratory of Intelligent Control and Management ofComplex Systems at Institute of Automation ChineseAcademy of Sciences. He received his Bachelor degreein Electrical Engineering from Peking University,China, in 2007. His research interests include imageprocessing, computer vision, pattern recognition.

Wensheng Zhang received the Ph.D. degree in PatternRecognition and Intelligent Systems from the Instituteof Automation, Chinese Academy of Sciences (CAS), in2000. He joined the Institute of Software, CAS, in 2001.He is a Professor of Machine Learning and Data Miningand the Director of Research and Development Depart-ment, Institute of Automation, CAS. He has publishedover 32 papers in the area of Modeling ComplexSystems, Statistical Machine Learning and DataMining. His research interests include computer vision,pattern recognition, artificial intelligence and compu-ter human interaction.


Top Related