multi-view people tracking via hierarchical web.cs.ucla.edu/~yuanluxu/publications/mvot_ people...
Post on 05-May-2019
Embed Size (px)
Multi-view People Tracking via Hierarchical Trajectory Composition
Yuanlu Xu1, Xiaobai Liu2, Yang Liu1 and Song-Chun Zhu11Dept. Computer Science and Statistics, University of California, Los Angeles (UCLA)
2Dept. Computer Science, San Diego State University (SDSU)email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
This paper presents a hierarchical composition ap-proach for multi-view object tracking. The key idea is toadaptively exploit multiple cues in both 2D and 3D, e.g.,ground occupancy consistency, appearance similarity, mo-tion coherence etc., which are mutually complementarywhile tracking the humans of interests over time. Whilefeature online selection has been extensively studied in thepast literature, it remains unclear how to effectively sched-ule these cues for the tracking purpose especially when en-countering various challenges, e.g. occlusions, conjunc-tions, and appearance variations. To do so, we proposea hierarchical composition model and re-formulate multi-view multi-object tracking as a problem of compositionalstructure optimization. We setup a set of composition cri-teria, each of which corresponds to one particular cue.The hierarchical composition process is pursued by exploit-ing different criteria, which impose constraints between agraph node and its offsprings in the hierarchy. We learnthe composition criteria using MLE on annotated data andefficiently construct the hierarchical graph by an iterativegreedy pursuit algorithm. In the experiments, we demon-strate superior performance of our approach on three pub-lic datasets, one of which is newly created by us to test var-ious challenges in multi-view multi-object tracking.
1. IntroductionMulti-view multi-object tracking has attracted lots of at-
tentions in the literature . Tracking objects from mul-tiple views is by nature a composition optimization prob-lem. For example, a 3D trajectory of a human can be hier-archically decomposed into trajectories of individual views,trajectory fragments, and bounding boxes. While exist-ing trackers have exploited the above principles more orless, they enforced strong assumptions over the validity ofThis work is supported by DARPA SIMPLEX Award N66001-15-C-
4035, ONR MURI project N00014-16-1-2007, and NSF IIS 1423305. Cor-respondence author is Xiaobai Liu.
(c) (d) (e)
Figure 1. An illustration of utilizing different cues at different pe-riods for the task multi-view multi-object tracking.
a particular cue, e.g. appearance similarity , motionconsistency , sparsity [30, 50], 3D localization coinci-dence , etc., which are not always correct. Actually,different cues may dominate different periods over objecttrajectories, especially for complicated scenes. In this pa-per, we are interested in automatically discovering the op-timal compositional hierarchy for object trajectories fromvarious cues, in order to handle a wider variety of trackingscenarios.
As illustrated in Fig. 1, suppose we would like to trackthe highlighted subject and obtain its complete trajectory(e). The optimal strategy for tracking may vary over spaceand time. For example, in (a), since the subject shares thesame appearance within certain time period, we apply anappearance based tracker to get a 2D tracklet; in (b) and(c), since the subject can be fully observed from two differ-ent views, we can group these two boxes into a 3D trackletby testing the proximity of their 3D locations; in (d), sincethe subject is fully occluded in this view, we consider sam-pling its position from the 3D trajectory curve constrainedby background occupancy.
In this work, we formulate multi-view multi-objecttracking as a structure optimization problem described bya hierarchical composition model. As illustrated in Fig. 2,our objective is to discover composition gradients of each
object in the hierarchical graph. We start from structure-less tracklets, i.e., object bounding boxes, and graduallycompose them into tracklets of larger size and eventuallyinto trajectories. Each trajectory entity may be observed insingle view or multiple views. The composition process isguided by a set of criteria, which describe the compositionfeasibility in the hierarchical structure.
Each criterion focuses on one certain cue and in fact isequivalent to a simple tracker, e.g., appearance tracker [29,45], geometry tracker , motion tracker , etc., whichgroups tracklets of the same view or different views intotracklets of larger sizes. Composition criteria lie in the heartof our method: feasible compositions can be conducted re-cursively and thus the criteria can be efficiently utilized.
To infer the compositional structure, we divest MCMCsampling-based algorithms due to their heavy computationcomplexity. We approximate the hierarchy by a progressivecomposing process. The composition scheduling problemis solved by an iterative greedy pursuit algorithm. At eachstep, we first greedily find and apply the composition withmaximum probability and then re-estimate parameters forthe incremental part.
In the experiments, we evaluate the proposed methodon a set of challenging sequences and the results demon-strate superior performance over other state-of-the-art ap-proaches. Furthermore, we design a series of comparisonexperiments to systematically analyze the effectiveness ofeach criterion.
The main contributions of this work are two-fold. Firstly,we re-frame multi-view multi-object tracking as a hier-archical structure optimization problem and present threetracklet-based composition criteria to jointly exploit differ-ent kinds of cues. Secondly, we establish a new dataset tocover more challenges, to present richer visual informationand to provide more detailed annotations than existing ones.
The rest of this paper is organized as follows. We reviewthe related work in Section 2, introduce the formulation ofour approach in Section 3, and discuss the learning and in-ference procedures in Section 4. The experiments and com-parisons are presented in Section 5, and finally comes theconclusion in Section 6.
2. Related WorkOur work is closely related to the following four research
streams.Multi-object tracking has been extensively studied in
the last decades. In the literature, the tracking-by-detectionpipeline [47, 20, 33, 41, 7, 8] attracts wide-spreaded atten-tions and acquires impressive results, thanks to the consid-erable progress in object detection [12, 37, 34], as well asin data association [48, 32, 6]. In particular, network flowbased methods [32, 6] organize detected bounding boxesinto directed multiple Markov chains with chronological or-
der and pursue the trajectory as finding pathes. Andriyenkoet al.  propose to track objects in discrete space and usesplines to model trajectories in continuous space. Our ap-proach also follows this pipeline but considers boundingboxes as structureless elements. With preliminary associa-tions to preserve locality, we can better explore the nonlocalproperties  of trajectories in the time domain. For ex-ample, tracklets with evident appearance similarities can begrouped together without considering the time interval.
Multi-view object tracking is usually addressed as adata association problem across cameras. The typical solu-tions include, homography constraints [24, 4], ground prob-abilistic occupancy , network flow optimization [42, 6,25], marked point process , joint reconstruction andtracking , multi-commodity network  and muti-view SVM . All these methods have certain strong as-sumptions and thus are restricted to certain specific scenar-ios. In contrast, we are interested in discovering the optimalcomposition structure to obtain complete trajectories in awide variety of scenarios.
Hierarchical model receives heated endorsement for itseffectiveness in modeling diverse tasks. In , a stochas-tic grammar model was proposed and applied to solve theimage parsing problem. After that, Zhao et al.  and Liuet al.  introduced generative grammar models for sceneparsing. Pero et al.  further built a generative scenegrammar to model the constitutionality of Manhattan struc-tures in indoor scenes. Ross et al. presented a discriminativegrammar for the problem of object detection . Grosseet al.  formulated matrix decomposition as a structurediscovery problem and solved it by a context-free gram-mar model. In this paper, our representation can be analo-gized as a special hierarchical attributed grammar model,with similar hierarchical structures, composition criteria asproduction rules, and soft constraints as probabilistic gram-mars. The difference lies in that our model is fully recursiveand without semantics in middle levels.
Combinatorial optimization receives considerable at-tentions in the surveillance literature . When the solu-tion space is discrete and the structure cannot be topologi-cally sorted (e.g., loopy graphs), there comes the problemof combinatorial optimization. Among all the solutions,MCMC techiques are widely acknowledged. For example,Khan et al.  integrated the MCMC sampling within theparticle filer tracking framework. Yu et al.  utilized thesingle site sampler for associating foreground blobs to tra-jectories. Liu et al.  introduced a spatial-temporalgraph to jointly solve the region labeling and object trackingproblem by Swendsen-Wang Cut . In this work, thoughfacing a similar combinatorial optimization problem, wepropose a very efficient inference algorithm with acceptabletrade-off.
3. RepresentationIn this section, we first introduce the compositional hi-
erarchy representation, and then discuss the proposed prob-lem formulation for multi-view multi-object tracking.