Multi-target tracking with occlusions via skeleton points assignment
Post on 10-Sep-2016
Embed Size (px)
conducted on the public challenging dataset PETS 2009. Results show that this approach can improve
the performance of the existing tracking approach and handle dynamic occlusions better.
Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.
f multisions.lly unaith occ
identity. In addition, for some applications, multiple views are
temporal association of observations that could maximize the
Papadourakis et al.  presented a robust object tracking
Contents lists available at SciVerse ScienceDirect
journal homepage: www.els
Neurocomputing 83 (2012) 165175either a single Gaussian or a Gaussian mixture distribution firstname.lastname@example.org (W. Zhang).algorithm which could automatically build appropriate objectrepresentations by color and handles spatially extended andtemporally long object occlusions. The majority of the abovemethods are under a simple assumption that object color satises
0925-2312/$ - see front matter Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.
n Corresponding author. Tel.: 86 010 82614489; fax: 86 010 62545229.E-mail addresses: email@example.com (H. Ding),not always available. Thus, this paper focuses on designing a consistency of both motion and appearance of object trajectories.special geometries provide insufcient information to generatea unique signal for tracking, due to visual occlusions. In thesecases more specialized tracking methods need to maintain track
interactions are demonstrated. Qian et al.  proposed a frwork for treating the general multiple target tracking prowhich was formulated in terms of nding the best spatiagroups, depending on whether a scene is captured by a singlestationary camera or by multiple cameras. Multi-view methodsfuse information from multi-views to localize people on multiplescene planes . These methods work well, except for specialgeometries of the camera views and people locations. These
coherent 2D motion layers and introduced a complete dynamicmotion layer representation in which spatial and temporal con-straints on shape, motion and appearance are estimated using theEM algorithm. His method has been applied in an airborne vehicletracking system and examples of tracking vehicles in complexIn particular, the major challenge ofrequent presence of visual occlucurrent observation totally or partiaintervals. The problem of dealing wan open subject.
Solutions to managing occlusionsprocessing tasks, such as video surveillance and event inference. part of the object model, and embed it into tracking process.1. Introduction
Despite a lot of attention being dpeople in video sequences over thremains very concerning in manyed to tracking multiple20 years, the problemuter vision and video
-targets tracking is theOcclusions make thevailable for some timelusions correctly is still
e decomposed into two
monocular tracking methodology, with the goal of handlingrelatively complex occlusion scenarios.
A considerable amount of research has reported on the treat-ment of occlusions from a stationary monocular camera duringthe last decade. Most of them consider occlusion segmentation as
These works build object models using color , appearance and motion  information. Unfortunately, these models arelearned for describing the postures or actions of the trackingtargets, and do not t well with occlusion segmentation.
There are also many attempts to deal with the problem ofocclusions explicitly. Hai et al.  decomposed video frames intoMulti-target tracking with occlusions v
Huan Ding n, Wensheng Zhang
Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun East Road
a r t i c l e i n f o
Received 19 May 2011
Received in revised form
6 December 2011
Accepted 10 December 2011
Communicated by Tao MeiAvailable online 29 December 2011
Skeleton points assignment
a b s t r a c t
Multiple-target tracking i
vision. Handling the occlu
introduces the method of
(Skeleton Points Assign, SP
complex situations captur
skeleton points and evalua
assign these points to di
motion and color; nally
accomplish occlusion segm
missing information of oc
framework, in which a proskeleton points assignment
ijing 100190, China
omplex scenes is one of the most complicated problems in computer
between objects is the key issue in multiple-target tracking. This paper
tion segmentation into the object tracking system, and presents a SPA
based occlusion segmentation approach to track multiple people through
y static monocular cameras. In the proposed method, we rst select the
heir occlusion states by low-level information like optical ow; then we
nt objects using advanced semantic information, such as appearance,
dense classication of foreground pixels are taken advantages of to
tation and a blob-based compensation strategy is utilized to estimate the
ed objects. Object tracking is handled by a particle lter-based tracking
ilistic appearance model is used to nd the best particle. Experiments are
Compared with the existing methods, our contribution is
H. Ding, W. Zhang / Neurocomputing 83 (2012) 165175166Meanwhile, these approaches do not take the historical informa-tion in the scene into account. Thus all the methods mentionedabove lead to the poor performance of occlusion segmentationand multiple targets tracking for objects with similar color.
Recently, some important issues in utilizing motion segmenta-tion in tracking system are discussed. In the context of motionsegmentation, the literature can be divided in two kinds: directmethods and feature-based methods . Direct methods recoverthe unknown parameters directly from measurable image quan-tities at each pixel in the image. This is in contrast with thefeature-based methods, which rst extract a sparse set of distinctfeatures from each image separately, and then recover andanalyze their correspondences in order to determine motion.Feature-based methods minimize an error measure that is basedon distances between a few corresponding features, while directmethods minimize a global error measure that is based on directimage information collected from all pixels in the image. It isimportant to observe that with direct methods the pixel corre-spondence/classication is performed directly with the measur-able image quantities at each pixel, while in feature-basedmethods this is done indirectly, based on independent featuremeasurements in a set of sparse pixels. An important property ofthe direct methods is that they can successfully estimate globalmotion even in the presence of multiple motions and/or outliers. However, computational time is wasted by including in theminimization a large number of pixels where no ow can bereliably estimated. On the other hand, feature-based methodsinitially ignore areas of low information, resulting in a problem offewer parameters to be estimated, with good convergence evenfor long sequences. Hence, considering the tracking efciency, weonly focus on the feature-based methods.
Feature-based methods for motion segmentation usually con-sist of two independent stages: (1) feature selection and/orcorrespondence and (2) motion parameter estimation . Thesecond stage is often performed through factorization methods, although some simpler clustering strategy can be used .Several methods have been proposed for sparse feature selectionand/or correspondence, and among them, the most popular arethe Harris Corner Detector [16,17], and SIFT . However, thesesparse feature-based methods compute feature correspondencesindependently. Thus, they are very sensitive to outliers, makingthem susceptible to errors in motion parameter estimation/segmentation. Moreover, homogeneous regions of a frame maypresent none or few features, which results in the motionestimation/segmentation difcult (or even impossible) in largeareas of the video frames. In object tracking eld, Papadakis et al. utilized a graph cuts approach and separated each object intovisible and occluded parts using an energy function, whichcontains terms based on position and motion information. Silvaet al.  obtained a pixel-wise segmentation by clustering a setof adaptively sampled points in space and time domains. Thesemethods still tend to emphasize the motion segmentation, whichfocus on the low-level information of the pixels. They rarely usethe high-level semantic features associated with tracking target.Thus it leads to a low performance of tracking and a high cost ofcomputation.
In this paper, we present a novel approach for multi-targettracking where the scene is captured by a stationary monocularcamera. Based on skeleton points assignment (SPA), our approachcombines the advantages of feature-based motion segmentationmethods and the probabilistic appearance-based particle ltertracking framework, as described next. Initially, a set of sparsepoints are computed, which we call the skeleton points. Instead ofcomputing point correspondences independently (as done inmany feature-based methods), neighboring particles are treated
as they were linked, reducing the occurrence of outliers andstated as follows: rstly, we dene the matching strategy andthe state transition matrix of the skeleton points, to decide theirstates during tracking process. Secondly, we establish the skele-ton points assignment model, based on the high-level featuressuch as appearance, color distribution and motion. Finally, wedensely cluster the foreground with skeleton points as kernel,which get the fully segmentation of occlusions with less compu-tational cost. The proposed approach selects the skeleton pointsby the low-level motion information; meanwhile it treats theassignment problem of skeleton points with the high-levelinformation. The proposed method potentially has the ability toefciently generate a robust object appearance through complexocclusions, which is adequate for tracking.
This paper is organized as follows. Section 1 discusses the stateof the art in the methods of multi-target tracking with occlusions,and contextualizes our work. An overview of the proposedapproach is presented in Section 2. In Section 3, we describe thestrategy of skeleton points extraction and matching. The estima-tion of the occlusion states of skeleton points along the videosequence is described in Section 4, as well as the classicationmethod which is used to assign these points to correspondingtargets. In Section 5, we present the dense representation for themoving targets. Section 6 introduces a rough compensationapproach to estimate the missing parts of the occluded objects.Section 7 presents some experimental results obtained with theproposed approach. Finally, Section 8 summarizes the maincontributions of this paper, and discusses some limitations ofthe proposed approach for future work.
2. Method overview
The structure of the proposed multi-target tracking approachcan be divided into three main parts, which is shown in Fig. 1.
1. Background modeling and foreground detection.2. Occlusion segmentation via skeleton points assignment.3. Multi-target tracking in a probabilistic appearance-based par-
ticle lter tracking framework.
All the stages of the proposed approach are processed sequen-tially. Every stage is performed for the entire video before going tothe next stage.
The rst part concerns the foreground detection in the scene.We utilize the Gaussian mixture model  to estimate themoving points. This stage takes as input the original video frames,utilized, and every foreground pixel in the scene is clustered tothe skeleton points. Then, the missing parts of occluded objectsare estimated by a blob-based compensation approach. Finally,we get the tracking result in a probabilistic appearance-basedparticle lter tracking framework.infois uanrmation, such as appearance, motion and color distribution,tilized. After that, a pixel-wise dense clustering strategy isassigned to an appropriate tracking target, where high-level
treavoiding the aperture problem. Moreover, the density of skeletonpoints is adaptive, and denser distributions are used in regionswhere precision is more important, so we can save computationwithout neglecting homogeneous regions. To compute pointcorrespondences in a video sequence, we use the approachproposed by Sand and Teller . After the skeleton points areselected, we classify them to different types according to theirpresence in three successive frames. The classied points are then
ated separately with respect to their types. Each point isd output a set of foreground points.
H. Ding, W. Zhang / Neurocomputing 83 (2012) 165175 167The second part deals with the segmentation of occlusions,and this is the key point of our study. All the foreground pointsare assigned to an appropriate object. This stage takes as input theoriginal video frames and the foreground point set computed inthe rst stage, and returns the corresponding segmentation labels
Fig. 1. System framework diagram.fortheint
Optical FlowFeature Points
GaussianMixture Modeleach foreground point as outputs, representing the index oftargets. The occlusions segmentation process can be divided
o four steps.
Skeleton points selection and matching (see Section 3): in thisstep, we extract skeleton points by using the optical ow inthe scene. Considering a successive video sequence which isconsisted of the current frame, the last frame and the nextframe, we estimate the matching situation of all the skeletonpoints, and establish a matching matrix, which is utilized todetermine the occlusion states of these points.Skeleton points assignment (see Section 4): here, all thematching matrices computed in the previous step are pro-cessed simultaneously to produce a unique state transitionmatrix of the skeleton points in the current frame. We estimatethe occlusion type of these points according to the statetransition matrix, and then assign these points to appropriatetarget in the scene by considering high-level information.Dense segmentation extraction (see Section 5): in this step, thedense segmentation is done by creating implicit functions foreach foreground pixel, based on color, motion and spatialposition. Foreground pixels are clustered to the skeleton pointswhich are already labeled in the previous step, and are markedwith the same label as their clustering kernel. This representa-tion of foreground points can be employed to obtain efcientobservation of each target in the scene.Occlusion comp...