edic research proposal 1 people tracking in crowded scenes · pdf fileedic research proposal 1...

EDIC RESEARCH PROPOSAL 1

People Tracking in Crowded ScenesBo Chen

CVLAB, I&C, EPFL

Abstract—In this research proposal, we first review 3 relatedpapers: a general survey covering the topic of pedestrian detec-tion [1], a state of art approach in object detection (deformablepart-based model) [2], and a tracking method including higherorder motion information optimized with Lagrangian relaxation[3]. We then summarize our preliminary work on consecutiveballistic trajectory recovering and mix template POM. Lastly,we outline the future plan to overcome existing problems, e.g.,temporal inconsistency.

Index Terms—People Tracking, Pedestrian Detection, MotionModel, Constrained Optimization, Temporal Consistency

I. INTRODUCTION

PEOPLE tracking has been an active research topic in com-puter vision because of its potential applications in daily

life, e.g., surveillance, robotics and entertainment. However,people tracking itself is quite challenging and suffers frommany physical limits like low resolution [1], high intraclassvariability [2] and occlusion.

To overcome the aforementioned problems, many solutionshave been proposed. Most of them follow the tracking-by-detection paradigm, in which detections of possible peoplelocations are first made and then linked in the tracking stagebased on proximity or appearance cues. And these two stages

Proposal submitted to committee: May 21st, 2014; Candi-dacy exam date: May 28th, 2014; Candidacy exam committee:Prof. Viktor Kuncak, Prof. Pascal Fua, Dr. Francois Fleuret

This research plan has been approved:

Date: ————————————

Doctoral candidate: ————————————(name and signature)

Thesis director: ————————————(name and signature)

Thesis co-director: ————————————(if applicable) (name and signature)

Doct. prog. director:————————————(B. Falsafi) (signature)

EDIC-ru/05.05.2009

are considered relatively independent, therefore sometimesseparately studied in researchers’ work.

In the detection stage, despite the variety of techniquesthat exist, most of them share some key components: amulti-scale sliding window paradigm, trained binary classifiers(e.g., SVM, boosting), extracted image features (e.g., HOG[4]), followed by a non-maximum suppression. A thoroughcomparison of these methods under automobile applicationsetting is reported in [1].

However, few of these approaches mentioned above couldhandle occlusion correctly, which is quite normal in reallife, especially in crowded scenes. To solve this, an intuitivethought would be using part-based detections. Specifically,besides the general people detector, part detectors for face,arms, legs should also be considered so that individuals couldstill be recognized even partly occluded. As one of the mostadvanced approaches in part-based detectors, the deformablemodel [2] is reviewed in this proposal.

Another existing technique to deal with occlusion is Prob-ability Occupancy Map (POM) [5]. In POM, a generativemodel for foreground blob is proposed, the probability ofpedestrian 2D location is then estimated to match the outputof background subtraction. We are particularly interested inthese two detection methods and hope to extend them for moreapplications.

On the other hand, current approaches in the tracking stagecould be roughly divided into recursive type and global opti-mum type. For the former methods, such as Kalman filtering[6], [7], detections are linked together recursively. Effectiveas some approaches are, they are prone to identity switcheswhen people get close. And because of their recursive nature,mistakes, once made, are difficult to recover from.

To remedy these shortcomings, techniques for global trajec-tory optimization are introduced, such as dynamic program-ming and linear programming. The classical network flowformulation is widely adopted in these methods. In [8], anefficient way (k-shortest path) to solve min-cost flow problemis proposed. As demonstrated by experiment, the result is quitepromising.

One possible way to further improve min-cost network flowbased techniques is to take the motion model into account. Theexpectation of introducing such a model is that trajectoriescould be smoother and more realistic. However, the use ofhigher order motion model would cause the network flowmuch more difficult to optimize. Nevertheless, some methods[3], [9] have been proposed to include motion models.

In fact, one of our preliminary work is to recover consec-utive ballistic trajectories from noisy detections, which alsoutilizes the higher order motion information. Recently, we turn


to the more general problem of tracking people and other(small) objects simultaneously. To start, solid detections ofpeople and objects should be first made.

In our experiment of detection, some problems (limits) ofcurrent detection are found, such as temporal inconsistency.Therefore, we propose to overcome these limits as part of thefuture research plan. Once we have decent detection output,we will focus on the tracking stage, in which we could benefitfrom higher order motion models. The overall goal is to builda better tool for tracking people and objects nearby.

The remainder of this proposal is organized as follows: insection II, we analyze the state of the art pedestrian detectorsaccording to the thorough evaluation [1]; then in section IIIand IV, two specific techniques on detection (deformable part-based model [2]) and tracking (network flow with Lagrangianrelaxation [3]) are presented. We conclude our preliminarywork and propose the future thesis plan in section V.

II. SURVEY OF PEDESTRIAN DETECTION

Detectors play an important role in the overall system oftracking. Over years of research, many approaches of peopledetection are suggested. However, it’s almost impossible tocompare these methods because different evaluation criteriaand datasets are used in each work.

Thanks to the work of [1], a comparison of 16 state of theart pedestrian detection techniques are finally available. In thissection, we first discuss the 3 most noticeable aspects of thissurvey:• a novel and large Caltech dataset with occlusion labels;• a discussion of evaluation criteria;• a comprehensive evaluation across 16 people detectors;

and then conclude with some comments on the survey.

A. Datasets and Statistics

The Caltech dataset is collected from a CCD camera (res-olution: 640 x 480) set up on a vehicle. Approximately 10hours of 30 Hz video is collected when the vehicle drivesthrough regular traffic in an urban environment with relativelyhigh concentration of pedestrians. The overall image qualityis lower than static images due to the motion of camera.

After video stabilization, 250k frames are annotated with atotal of 350k bounding box around 2300 unique pedestrians.For every annotated pedestrian, a full bounding box BB-fullindicating the full extent of the entire people is marked. Incase of occlusion, a second BB is used to delineate the visibleregion BB-vis. 3 labels are used: “Person” for individuals,“People” for large groups of pedestrians and “Person?” in caseof non-clear identification. A few statistics about this datasetis also made in this survey.

1) Scale Statistics: After histogramming the height ofpedestrians, it’s found that the height is lognormally distributedwith a log-average of 50 pixels. Thus, the pedestrians aregrouped into three scales: near (80 or more pixels), medium(30-80 pixels) and far (30 or less pixels). Similarly, thelog-average of aspect ratio is calculated to be 0.41, alsolognormally distributed.

Under the camera setup that mirrors expected automotivesettings: 640×480 resolution, 27 degrees vertical field of viewand a focal length of 7.5 mm and the assumptions of fixedpedestrian height (1.8 m) and vehicle speed (55 km/h), 80pixel person is about 1.5 s away and 30 pixel is 4 s away.With this setting, it’s more necessary for the system to detectmedium size pedestrians.

2) Occlusion Statistics: In the Caltech dataset, 29% pedes-trians are never occluded, 53% occluded in some frames andthe rest always occluded. So around 70% pedestrians aresomewhat occluded, underscoring the importance of detectingoccluded people.

A statistics of occluded pedestrians is then made: 10% ispartially occluded (< 35% area), 35% is heavily occluded(35 − 80% area), with the rest fully occluded (> 80% area).The authors then investigated which regions of a person aremore likely to be occluded. It appears that the lower portionof a pedestrian is much more likely to be occluded than thetop portion.

3) Position Statistics: Pedestrians are typically located ina narrow band running horizontally across the center of theimage, which only holds true in the setting of this paper.

B. Evaluation Criterion

The evaluation criterion is a crucial component to quantifyand rank detector performance. We start with an overviewof full image evaluation in section II-B1. Then we discussevaluation using subsets of ground truth and detections in sec-tion II-B2. Last of all, we compare the per-window evaluationcriterion against the full image one in section II-B3.

1) Full Image Evaluation: Evaluation is performed on thefinal output, a list of detected bounding boxes. The PASCALmeasure

αo =area(BBdt ∩BBgt)area(BBdt ∪BBgt)

> 0.5 (1)

is used to evaluate the detection. Each BBdt and BBgt canbe matched at most once. The conflict is resolved in a greedyway: (i) detections with higher score are matched first; (ii) incase of multiple overlapping, the match with highest overlappercent is kept. Unmatched BBdt is treated as false positiveand unmatched BBgt as false negative.

To display the performance, the miss rate is plotted againstfalse positives per image (FPPI) with log scale. Log-averagemiss rate is used to represent the performance, computed byaveraging miss rates at 9 uniformly distributed FPPI rates.

2) Filtering Ground Truth and Detections: To evaluatedetector’s performance under specific setting, e.g., mediumsize pedestrian with partly occlusions, we need to pre-excludeparts of ground truth labels and detections.

The specific details of filtering are skipped here due to spacelimits. For more details we refer the readers to section 3.3 and3.4 in the original paper.

3) Per-Window Evaluation: Another protocol for binaryclassifier based detector performance evaluation is per-windowevaluation. It’s generally used to compare between differentdetectors with the assumption that a better PW performanceleads to a better overall performance.


Fig. 1. Comparison of Evaluated Detectors. This figure is a screenshot of Table 2 in [1], all the reference numbers correspond to the reference in [1].

Nonetheless, the experiment result of this paper indicatesPW and full image performance are only weakly correlated.The reason for that is a whole detector consists of othercomponents as well, e.g., sliding window and non-maximumsuppression. And a window tested during PW evaluation isusually not the same with ground truth BB, leading to a biasin the performance.

To summarize, PW evaluation is not able to represent thegeneral performance of the detector. Therefore, the use of fullimage evaluation is preferred.

C. Comparison of Pedestrian Detectors

16 representative advanced pedestrian detectors are chosen.In almost all cases, the pretrained detectors are used sothat they are trained by their authors. In this section, thesedetectors are tested under various settings and over all datasetsand conclude with an analysis of experimental results. Manydetails are skipped here due to limited space, consult section4 and 5 in [1] for more details.

1) Evaluated Detectors: Even though various ways ofdetecting people have been developed, most of them sharesome key components: a multi-scale sliding window paradigm,binary classifiers, discriminative features and non-maximumsuppression. Fig. 1 gives an overview of different componentsused in these methods.

2) Experiment on Caltech Dataset: Performance of peopledetectors are tested under 6 conditions on the testing data ofCaltech dataset: overall, near/medium scale, no/partial occlu-sion and clearly visible.

a) Overall: Absolute performance is poor with a log-average miss rate of over 80%.

b) Scale: For the near scale (no occlusion), the bestdetector MULTIFTR+MOTION [10] achieves a log-averagemissing rate of 22%, the other detectors are around 40%. Onthe medium scale, performance degrade dramatically with amissing rate around 77%.

c) Occlusion: Performance drops significantly even un-der partial occlusion. Surprisingly, part-based detectors de-grade as severely as for holistic detectors. However, we doubtthe reason is that the resolution is too low (height ≥ 50 pixels)for the parts to be recognized.

d) Reasonable: Pedestrians over 50 pixels with no orpartial occlusion are referred as reasonable evaluation setting.In this setting, MULTIFTR+MOTION [10] still performs bestwith a missing rate of 51%.

3) Experiment on Multiple Datasets: Experiment is alsoperformed on other datasets. Overall, detector ranking isreasonably consistent across datasets.

4) Statistical Significance: To remove the effect of incon-sistent dataset difficulty, a key insight is to use algorithmrank instead of absolute performance. After obtaining suffi-cient numbers of performance samples, the performance ofalgorithms can be compared using mean rank. Again, the bestdetector is MULTIFTR+MOTION [10]. However, significancetest shows differences among the top six detectors are notstatistically significant.

D. Discussion

Under the reasonable setting, the best detector is MUL-TIFTR+MOTION [10]. If running time is also considered,CHNFTRS [11] and FPDW [12] are better options. The keyfinding of this survey is detectors degrade significantly in caseof low resolution and high occlusion.

On the other side, the main limit is that this survey isextremely task dependent, e.g., low camera resolution andmonocular camera only. Therefore the conclusion of thissurvey can’t represent performances in other settings.

Besides, it doesn’t seem fair to directly compare methodstrained on different datasets. We also notice there is an obviousinconsistency between train set and test set: people in trainset INRIA [4] has a median height of 279 pixels, whilethe test set only has a median height of 48 pixels. This


inconsistency undoubtedly leads to the degradation in overalldetection performance.

III. DETECTION WITH DEFORMABLE PART-BASED MODEL

In the previous section, we reviewed on a survey of pedes-trian detectors. Now we focus on a specific detection approach[2]. It has been shown in the survey that with relative highresolution setting, this method is quite effective.

The detection is based on a deformable part-based model:(i) deformable means the object can be non-rigid, thereforethe spatial structure is exploited; (ii) part-based empowersthe ability to handle occlusion. Besides the model, manyoptimization techniques, e.g., hard negative data mining, latentSVM are used to optimize the semi-convex formulation. Inthe following subsections, we introduce this approach in moredetails.

A. ModelAll of the models involve linear filters applied on dense

feature maps. A feature map is an array whose entries ared-dimensional feature vectors computed in dense locations ofan image. A filter is a rectangular template composed of d-dimensional weights. The score feature map G at position(x, y) with filter F is the dot product of the filter F and thesubwindow of G with top-left corner at (x, y):∑

u,v

F (u, v)G(x+ u, y + v)

In practice, the feature maps need to be scanned over allscales, therefore a feature pyramid is constructed from theimage pyramid. In each octave, there is λ feature maps. Hencewe need to go down λ levels in the pyramid to get a twicehigher resolution.

Let F be a w × h filter, H be a feature pyramid andp = (x, y, l) specify a position (x, y) in the l-th level of thepyramid. Let φ(H, p) denote a vector composed of featurevectors in the w×h subwindow at p in H , F denote the filterweight vector, then the score is F ′ · φ(H, p).

1) Deformable Part Models: The star models are definedby a coarse root filter and higher resolution part filters (λ levelsdown). A model for an object with n parts is formally definedby an n+ 2 tuple (F0, P1, ..., Pn, b):• F0 is the root filter;• Pi = (Fi, vi, di) specifies a part model, with Fi being

the part filter, vi being the standard displacement of parti to the root, di being the deformation cost coefficient;

• b is the bias used to make mixture models comparable;An object hypothesis specifies the location of each filter,

z = (p0, ..., pn), where pi = (xi, yi, li) specifies the locationof filter i. The level of part filter li is required to be λ lessthan l0 so that the part filter can be computed in a higherresolution.

The score of an object hypothesis is:

score(z) =

n∑i=0

F ′iφ(H, pi)−n∑i=1

diφd(dxi, dyi) + b (2)

(dxi, dyi) =(xi, yi)− 2(x0, y0)− vi (3)

φd(dx, dy) =(dx, dy, dx2, dy2) (4)

It can also be expressed as a dot product β · ψ(H, z):

β =(F ′0, ..., F′n, d1, ..., dn, b) (5)

ψ(H, z) =(φ(H, p0), ..., φ(H, pn), (6)− φd(dx1, dy1), ...,−φd(dxn, dyn), 1)

2) Matching: To detect objects in an image, an overall scoreover all root locations with best possible placement of partsis computed:

score(p0) = maxp1,...,pn

score(p0, ..., pn) (7)

Dynamic programming and generalized distance transform[13] are used to compute the best locations for the parts asa function of the root location. The resulting method is veryefficient, taking O(nk) time if filter response are precomputed(n: #parts, k: #locations).

The procedures are as following:

Ri,l(x, y) =F ′i · φ(H, (x, y, l)) (8)Di,l(x, y) = max

dx,dy(Ri,l(x+ dx, y + dy)− diφd(dx, dy)) (9)

score(p0) =R0,l0(x0, y0) +

n∑i=1

Di,l0−λ(2(x0, y0) + vi) + b

(10)

In Eq. 8, filter response of all part filters over all locationsare precomputed. Then a generalized distance transform [13]is applied in Eq. 9. After transformation, Di,l(x, y) is themaximum contribution of part i if the ith anchor is placedat (x,y,l). In Eq. 10, an overall score of placing root filter atp0 is calculated.

3) Mixture Models: A mixture model with m componentsis defined by an m-tuple M = (M1, ...,Mm). The objecthypothesis for a mixture model specifies not only the mixturecomponent but also locations of each filter, z = (c, p0, ..., pnc

).As in the case of single component, the score of mixturecomponents can also be expressed as a dot product β ·ψ(H, z):

β =(β1, ..., βm) (11)z′ =(p0, ..., pnc) (12)

ψ(H, z) =(0, ..., 0, ψ(H, z′), 0, ..., 0) (13)

B. Latent SVM

The classifier takes the form of Eq. 15, Z(x) defines the setof all possible latent values for example x:

Φ(x, z) =ψ(H(x), z) (14)fβ(x) = max

z∈Z(x)β · Φ(x, z) (15)

LD(β) =1

2||β||2 + C

n∑i=1

max(0, 1− yifβ(xi)) (16)

The model parameters β is trained by minimizing the standardhinge loss objective function Eq. 16, where yi ∈ {−1, 1} andD is the set of training examples.


1) Semiconvexity and Optimization Framework: It can beeasily proved that if yi = −1, then the hinge loss is alsoconvex. However, for positive examples (yi = 1), the hingeloss is generally not convex because it’s the maximum of aconvex function and a concave function.

Nevertheless, if there is only one latent value for eachpositive example, then the objective function becomes convexagain. Following this idea, we could iteratively minimize Eq.16 by:

1) Relabel positive examples: Optimize LD(β, Zp) withrespect to Zp by choosing the latent value with highestscore for each positve example, zi = arg maxz∈Z(xi) β ·Φ(xi, z)

2) Optimize β: Optimize LD(β, Zp) over β by solving theconvex optimization problem defined by LD(Zp)(β)

Using this optimization scheme, the careful initialization of βis necessary.

2) Stochastic Gradient Descent: Step 2 (optimize β) canbe solved by stochastic gradient descent. Let zi(β) =arg maxz∈Z(xi) β · Φ(xi, z), then fβ(xi) = β · Φ(xi, zi(β)).The subgradient of LSVM objective is:

∇LD(β) =β + C

n∑i=1

h(β, xi, yi) (17)

h(β, xi, yi) =

{0, yifβ(xi) ≥ 1,−yiΦ(xi, zi(β)), yifβ(xi) < 1.

(18)

The following procedures are standard stochastic gradientdescent steps, we skip here due to space limit.

3) Data Mining Hard Examples: When training a model,the large amount of negative examples makes it infeasible toconsider all examples simultaneously. Instead, it’s common touse “hard negative” examples to represent the whole negativeexample sets.

The intuition of data mining hard examples is to train withand collect incorrectly classified negative examples iteratively.After several iterations, only really hard negative examplesare kept and can represent the whole negative example sets.In practice, a margin sensitive definition yfβ(x) < 1 is usedfor hard examples.

C. Training Models

1) Parameter Learning: For each positive example, animage I and a bounding box B is provided; for negativeexamples, only an image J is given. For each (I,B) ∈ P ,Z(x) is defined so that the detection window of the root filteroverlaps at least 70 percent with B. Z(x) of negative examplesspread over all locations in the feature pyramid.

The procedure train is outlined in Fig. 2:

• line 3 to 6 is the “relabel positive example” step;• line 8 to 11 collects hard negative examples;• line 12 implements gradient descent to optimize β;• line 13 discards easy negative examples;

2) Initialization: The training procedure used above is sus-ceptible to local minima, therefore needs a good initialization.

Fig. 2. Procedure Train, a screenshot of training steps in [2]

a) Initializing Root Filters: Bounding boxes of positiveexamples are first sorted based on the aspect ratio, upon whichthe bounding boxes are further divided into m groups. Eachgroup provides positive examples to train a component model,negative examples are sampled randomly. A standard SVM isused to train root filter.

b) Merging Components: Retrain root filters (withoutpart filters) with procedure train.

c) Initializing Part Filters: Select top n high energy(norm of weights) region in root filters as part filters, interpo-late to twice the spatial resolution.

D. Features

HOG based features are adopted in this approach. Orienta-tion θ(x, y) and magnitude r(x, y) is first calculated at eachpixel. The gradient orientation is then discretized into p values.Spatial aggregation is later applied to group nearby pixels intoa block histogram. After aggregation, histogram is normal-ized and component-wise truncated. Last of all, dimensionreduction is performed to reduce the feature dimensions whilekeeping the performance.

E. Postprocessing

A bunch of post-processing techniques are used to furtherimprove the performance.

1) Bounding Box Prediction: A linear least-squares regres-sion is performed to combine part filter and root filter locationsto predict the bounding box location.

2) Nonmaximum Suppression: As usual, a greedy non-maximum suppression is used to suppress highly overlappingdetections.


3) Contextual Information: Contextual information, namelyhighest detection score in each object category, is used torescore detection score. The idea behind this is that objectsfrom other categories provide evidence for (or against) theexistence of a certain object.

F. Experiment

The deformable part model is very powerful, obtaining thebest AP score in 9 out of the 20 categories and the second in 8(PASCAL VOC 2008). The main failure modes are confusionamong classes and insufficient overlapping area.

G. Discussion

The deformable part model is powerful because not onlyappearance cues but also spatial structure among parts is used.Apart from the model itself, optimizing techniques used herealso contribute greatly to the high performance.

The part-based characteristic empowers the potential to copewith occlusion, which could be useful for us. Moreover, intheory this method can detect any category of objects and isthus quite general.

IV. NETWORK FLOW WITH LAGRANGIAN RELAXATION

So far we have introduced two papers related to the detec-tion, in this section we discuss a tracking relevant technique.Even though the standard min-cost network flow formulationproves to work well in experiment, it can’t take advantage ofhigher order motion information. In paper [3], higher ordermotion information is incorporated and Lagrangian relaxationis used for optimization of the resulting network.

A. Problem Formulation

Let l be the length of a video sequence, Fk be the set ofobservations in frame k, and rk be the size of Fk:

Fk = {obk1 , ..., obkrk}, k = 1, ..., l (19)

Candidate matches between observations in consecutiveframes are found based on appearance similarity and spatialproximity. The set of all candidate matches between frame kand frame k + 1 is:

mki =(ob1mk

i, ob2mk

i), ob1mk

i∈ Fk, ob2mk

i∈ Fk+1 (20)

Pk ={mk1 , ...,m

knk} (21)

where nk is the total number of candidate matches betweenframe k and frame k+1. The whole matching set M thereforehas a number of n =

∑l−1t=1 nt matching pairs.

A graph G = (V,E) is then generated:• Vertex set V : a start node s, a sink node t and two linked

nodes 2i− 1 and 2i for each match observation pairs.

V = {s, t, 1, 2, ..., 2n− 1, 2n} (22)

• Edge set E: (i) edges between matching observations(2i−1, 2i); (ii) edges linking matching pairs at consecu-tive frames that share a common observation, (2i, 2j−1);(iii) edges from the source node to all incoming nodes

Fig. 3. A graph example, copied from [3]

(s, 2i− 1) and from all outgoing nodes to the sink node,(2i, t).

Fig. 3 is an example of the generated graph G.In addition to the edges and nodes, another constraint needs

to be imposed. For each observation a ∈ Fk, the set of matchpairs {(a, ∗)} ∈ Pk forms a conflict set in which at mostone edge can be selected. Similarly, the set of match pairs{(∗, a)} ∈ Pk−1 also makes another conflict set. In short, byreasoning over every observation, we get q conflict sets, whichis denoted by ECi, i = 1, ..., q.

Now the whole tracking problem can be formulated as:

min f(x) =∑

(i,j)∈E

cijxij (23)

s.t. xij ∈ {0, 1}, ∀(i, j) ∈ E (24)∑(i,j)∈E

xij =∑

(j,k)∈E

xjk, ∀j ∈ V − {s, t} (25)

∑(i,j)∈ECk

xij ≤ 1, k = 1, ..., q (26)

In Eq. 23, cij is the cost of edge, which can take theappearance similarity and spatial relationship into account.Eq. 23-25 is the standard min-cost network flow problem,able to be solved using existing methods. However, constraintin Eq. 26 breaks the standard formulation, therefore can’t beoptimized using the standard approaches.

B. Lagrangian Relaxation

To optimize the objective defined in last subsection, La-grangian relaxation is used. The key idea is to convert the


hard constraint in Eq. 26 into soft constraint:

min f(x) =∑

(i,j)∈E

cijxij +

q∑k=1

λk(∑

(i,j)∈ECk

xij − 1) (27)

s.t. xij ∈ {0, 1}, ∀(i, j) ∈ E (28)∑(i,j)∈E

xij =∑

(j,k)∈E

xjk, ∀j ∈ V − {s, t} (29)

With soft constraint (λk > 0), the hard constraint can nowbe violated, but at a certain price. And the new networkmodel is now the standard min-cost network flow formulation,which can be solved efficiently. Moreover, we can graduallyincrease λk to ensure the cost of violating constraints getsmore and more expensive, enforcing the soft constraint to behard constraint.

However, even though we can gradually increase λk, it’snot guaranteed that the optimization will finally converge to asolution satisfying the hard constraint. In such cases, post-processing is needed to resolve constraints. In practice, agreedy way of post-processing is taken.

C. Discussion

The following experiment demonstrates the superiority ofthe Lagrangian relaxation method. However, we notice thefps used in experiment is fairly low (1-3 fps), and with ahigher fps (3 fps) the improvement over other approachesseems minimal. It appears that the method is more effective inlow fps setting. Another explanation is the dataset used is notsuitable to demonstrate the potential of higher order motionmodel.

We also think the measure criteria of the mismatch error(mme) is inappropriate and suggest another criteria gmme [14]instead. Overall, the framework is quite open and can handlehigher order motion model in a decent way.

V. PRELIMINARY WORK AND THESIS PROPOSAL

A. Consecutive Ballistic Trajectory Recovery

The first part of our preliminary work is consecutive bal-listic trajectory recovery. The goal is to recover from noisydetections a trajectory of consecutive parabolas, hence theassumption is also quite strong: the original trajectory consistsonly of multiple parabolas.

A simple model with motion dynamics constraints as fol-lowing works well under this assumption:

min

T∑t=1

N(t)∑i=1

W it ||Xt − Y it ||2 + λ

T−1∑t=1

||At|| (30)

s.t. Xt+1 = Xt + Vtdt (31)Vt+1 = Vt +Atdt+Gdt (32)

Y it stands for the 3D location of the ith detection at instantt, W i

t is its associated weight and Xt is the position we aretrying to recover. We use the sparsity (l1-norm) of accelerationAt to model the assumption of consecutive parabolas.

An iterative approach is used to minimize the objectivefunction. Fig. 4 is the result of a trajectory recovery: green

dots are noisy detections, blue dots are the recovered resultand red dots should be regarded as the ground truth. From theexperiment, we can see the model is working well, but themain limit of this approach is the assumption is so strong thatit can only be applied on volleyball dataset.

Fig. 4. Result of Trajectory Recovery

B. Mix Template POM

After trajectory recovery, now our main focus is on mixtemplate POM. The original POM [5] is designed to detect onecategory of object from background subtraction output. Whenwe simply feed POM with two different templates, some newproblems arise:• A big size foreground blob can either be explained using

one big template or a bunch of smaller templates stackingtogether.

• Background subtraction is imperfect, so it’s normal thata player is detected as a person in the previous frame,but becomes several balls in the next frame.

• Occasionally, we hope to impose a constraint of thenumber of detections. For example, there is only one ballin the case of ball game data sequence.

Fig. 5. Left: POM; Right: sparse POM

The first problem is already solved by introducing sparsityconstraint, see Fig. 5. Recently, we are working on multi-label POM. The hope is by creating a mix label in the input,occlusion between objects of different categories can be betterdealt with.


C. Thesis Proposal

Motivated by the current unsolved problems mentioned insection V-B, we will focus on the following topics in thefuture:

1) Temporal Consistency: Temporal consistency need to beintroduced so that template matching in POM not only dependson current frame, but also takes neighbor frames into account.

For example, we could modify the distance function Ψ inPOM to also include the distance between current syntheticimage and previous synthetic image. A better approach may bedirectly requiring the grids probability distribution in consec-utive frames to be similar, instead of using synthetic images.

However, if we modify POM this way, a new approach ofoptimizing is needed since we now want to optimize severalframes simultaneously, rather than independently.

2) Quantity Constraint: Quantity constraint should also beimposed to suppress the occurrence of false positives. Forinstance, only one ball exists in most ball games. Moreover,it has been tested that simple normalization is not working,hence more possibilities are to be explored.

3) Tracking Stage: After problems of detection are fixed,we will keep on improving the tracking technique, e.g.,considering object interaction and higher order motion model.

The ultimate goal is to build a better tracking tool to tracksimultaneously people and (moving) objects around and toextend the application to real life scenarios.

REFERENCES

[1] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:An evaluation of the state of the art,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 34, no. 4, pp. 743–761, 2012.

[2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,“Object detection with discriminatively trained part-based models,” Pat-tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32,no. 9, pp. 1627–1645, 2010.

[3] A. A. Butt and R. T. Collins, “Multi-target tracking by lagrangianrelaxation to min-cost network flow,” in Computer Vision and PatternRecognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 1846–1853.

[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.886–893.

[5] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera peopletracking with a probabilistic occupancy map,” Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 30, no. 2, pp. 267–282,2008.

[6] J. Black, T. Ellis, and P. Rosin, “Multi view image surveillance andtracking,” in Motion and Video Computing, 2002. Proceedings. Work-shop on. IEEE, 2002, pp. 169–174.

[7] A. Mittal and L. S. Davis, “M2tracker: A multi-view approach tosegmenting and tracking people in a cluttered scene,” InternationalJournal of Computer Vision, vol. 51, no. 3, pp. 189–203, 2003.

[8] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object trackingusing k-shortest paths optimization,” Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 33, no. 9, pp. 1806–1819, 2011.

[9] A. Andriyenko and K. Schindler, “Globally optimal multi-target trackingon a hexagonal lattice,” in Computer Vision–ECCV 2010. Springer,2010, pp. 466–479.

[10] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New features andinsights for pedestrian detection,” in Computer vision and patternrecognition (CVPR), 2010 IEEE conference on. IEEE, 2010, pp. 1030–1037.

[11] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features.”in BMVC, vol. 2, no. 3, 2009, p. 5.

[12] P. Dollar, S. Belongie, and P. Perona, “The fastest pedestrian detectorin the west.” in BMVC, vol. 2, no. 3. Citeseer, 2010, p. 7.

[13] P. Felzenszwalb and D. Huttenlocher, “Distance transforms of sampledfunctions,” Cornell University, Tech. Rep., 2004.

[14] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, “Multi-commoditynetwork flow for tracking multiple people,” 2013.

edic research proposal 1 people tracking in crowded scenes · pdf fileedic research proposal 1...

Documents