adaptive structural model for video based pedestrian detectionbyang/papers/adaped_accv14.pdf ·...

15
Adaptive Structural Model for Video Based Pedestrian Detection Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences {jjyan,bin.yang,zlei,szli}@nlpr.ia.ac.cn Abstract. The performance of generic pedestrian detector usually declines se- riously for videos in novel scenes, which is one of the major bottlenecks for current pedestrian detection techniques. The conventional works improve pedes- trian detection in video by mining new instances from detections and adapting the detector according to the collected instances. However, when treating the two tasks separately, the detector adaptation suffers from the defective output of in- stance mining. In this paper, we propose to jointly handle the instance mining and detector adaption using an adaptive structural model. The regularization function of the model is applied on detector to prevent overfitting in adaption, and the loss function is designed to evaluate the combination of mined instances set and detector. Particularly, we extend the Deformable Part Model (DPM) to adaptive DPM, where an adaptive feature transformation defined on low-level HOG cell is learned to reduce the domain shift, and the regularization function for the detec- tor is conducted on the transformation. The loss of the instance set and detector is measured by a cost-flow network structure which incorporates both the appear- ance of frame-wise detections and their spatio-temporal continuity. We demon- strate an alternating minimization procedure to optimize the model. The proposed method is evaluated on ETHZ, PETS2009 and Caltech datasets, and outperforms baseline DPM by 7% in terms of mean miss rate. 1 Introduction Pedestrian detection has been a hot research topic for decades. Benefitting from the advances in low-level feature and high-level model, static image based pedestrian de- tection has achieved impressive progresses in both effectiveness [1–8] and efficiency [9–13]. With well-designed feature and model, current detectors trained on a large set can handle some occlusions, pose and viewpoint variations. However, the performance on novel scenes may drop disastrously due to the domain shift. For example, according to the evaluation in [14], the state-of-the-art pedestrian detector Crosstalk [12] achieves 19% mean miss rate on INRIA test set, while increases to 54% on Caltech Pedestrian Benchmark. To handle the domain shift, one promising solution is the automatic adaption of the generic detector to the target scenes, as recently explored in [15–22]. Most of the works followed an unsupervised paradigm since the annotations in novel scenes are often unavailable. These works usually considered two tasks in detector adaptation. The

Upload: others

Post on 21-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based PedestrianDetection

Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

Center for Biometrics and Security Research & National Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy of Sciences

{jjyan,bin.yang,zlei,szli}@nlpr.ia.ac.cn

Abstract. The performance of generic pedestrian detector usually declines se-riously for videos in novel scenes, which is one of the major bottlenecks forcurrent pedestrian detection techniques. The conventional works improve pedes-trian detection in video by mining new instances from detections and adaptingthe detector according to the collected instances. However, when treating the twotasks separately, the detector adaptation suffers from the defective output of in-stance mining. In this paper, we propose to jointly handle the instance mining anddetector adaption using an adaptive structural model. The regularization functionof the model is applied on detector to prevent overfitting in adaption, and theloss function is designed to evaluate the combination of mined instances set anddetector. Particularly, we extend the Deformable Part Model (DPM) to adaptiveDPM, where an adaptive feature transformation defined on low-level HOG cell islearned to reduce the domain shift, and the regularization function for the detec-tor is conducted on the transformation. The loss of the instance set and detectoris measured by a cost-flow network structure which incorporates both the appear-ance of frame-wise detections and their spatio-temporal continuity. We demon-strate an alternating minimization procedure to optimize the model. The proposedmethod is evaluated on ETHZ, PETS2009 and Caltech datasets, and outperformsbaseline DPM by 7% in terms of mean miss rate.

1 Introduction

Pedestrian detection has been a hot research topic for decades. Benefitting from theadvances in low-level feature and high-level model, static image based pedestrian de-tection has achieved impressive progresses in both effectiveness [1–8] and efficiency[9–13]. With well-designed feature and model, current detectors trained on a large setcan handle some occlusions, pose and viewpoint variations. However, the performanceon novel scenes may drop disastrously due to the domain shift. For example, accordingto the evaluation in [14], the state-of-the-art pedestrian detector Crosstalk [12] achieves19% mean miss rate on INRIA test set, while increases to 54% on Caltech PedestrianBenchmark.

To handle the domain shift, one promising solution is the automatic adaption ofthe generic detector to the target scenes, as recently explored in [15–22]. Most of theworks followed an unsupervised paradigm since the annotations in novel scenes areoften unavailable. These works usually considered two tasks in detector adaptation. The

Page 2: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

2 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

first is to mine new positive and negative instances of the target video in an unsupervisedmanner, and the second is to adapt the the generic detector when the training instancesof the target video are collected. The standard paradigm in these works is as follows(Fig. 1 (a)), (1) conduct frame-wise detection on video; (2) use various information(e.g., tracking, background substraction and optical flow) to mine instances from thedetection result; (3) take the mined instances as online training samples to update thedetector.

Fig. 1. Different paradigms of detector adaptation for video based pedestrian detection. The con-ventional methods take the training instance mining and detector adaptation as two separate tasks.In this paper, to explore the benefit from each other, we propose to optimize the instance miningand detector adaptation jointly in a structural model.

The motivation of our approach is that the instance mining and detector adaptationprocedures should be explored jointly. For example, the detector can benefit from theconfidently mined clear instances in adaptation, and the well adapted detector can fur-ther improve the quality of mined instances. Given the frame-wise detection result, webuild a joint structural model to find an optimal combination of adapted detector andnew instances from the video (Fig. 1 (b)). Particularly, we build a frame-wise detectorwith DPM. To avoid the complexity in shifting the high dimensional DPM parametersdirectly, we propose to use a linear transformation to capture the domain shift on low-level HOG cell, which can effectively capture the variations in different conditions withmuch less parameters. The loss function in the structural model is built on a cost-flownetwork to capture the structure in video, where both the frame-wise appearance andthe video continuity among frames are encoded. We show that when the instance set isfixed, the optimal detector can be solved by standard quadratic programming, and whenthe detector is fixed, the model can be solved by efficient successive shortest-path algo-rithm. In optimizing the structural model, we conduct an alternating scheme to conductframe-wise detection and structural model adaptation iteratively.

We validate the detection performance following the protocol provided in [14] onchallenging videos from ETHZ, Caltech, and PETS2009. Our structural model de-

Page 3: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 3

creases mean miss rate by more than 7% compared to the baseline DPM (version 5),and outperforms the best published results by a 2% margin.

The rest of the paper is organized as follows. We discuss the related work in Section2 and provide the background of DPM and cost-flow network based data associationin Section 3. The joint structural model and the corresponding optimization method aredescribed in Section 4. We demonstrate the experimental results in Section 5 and finallyconclude the paper in Section 6.

2 Related work

There are numerous works on pedestrian detection, and we refer readers to [23, 24,14] for the detailed survey. Our work is most related to the work on adapting thegeneric pedestrian detector to videos, which is recently explored in [16–22, 25, 26].These works differ in the training instances mining and detector adaptation methods.Methods presented in [16, 18, 19] mined new instances from detections according to apre-defined threshold . To reduce the noise in the detected results, context cues wereapplied to refine the detections. [19, 20, 25] explored context cues on background to re-move the detections with high scores. [16] used multiple target tracking to associate de-tections with trajectories, and took the non-associated detections as negative instances.[18] conducted KLT tracking to collect positive instances with low detection scores.[17] proposed an unsupervised tree coding method to cluster the detections. [20] pro-posed a confidence score SVM to encode the confidence scores in the model updating.The learning algorithms largely depend on the models used in the generic detector. Forexample, [25, 16, 21, 26] were built on boosting detectors, [17–20] were built on SVMdetectors, [22] was built on the deep neural network. To the best of our knowledge, thisis the first work to jointly consider the instance mining and detector adaptation in oneobjective function.

The problem setting is also related to the domain adaptation, which has been studiedextensively in computer vision. We learn a feature transformation between the sourceand the target domain, as explored in [27–29] in an unsupervised manner. The mostsimilar models as ours are the unsupervised approaches proposed in [30, 31]. How-ever, these unsupervised approaches were based on generative models, making them-selves unsuitable for real world detection tasks where discriminative models are alwaysadopted. In addition, these work are designed for images instead of videos.

Our detector is built on DPM (Deformable Part Model) [6], which is one of the state-of-the-art detectors for generic detection tasks on static images. However, as evaluatedin [14], its performance is unsatisfying for videos in real applications (e.g., Caltechbenchmark [14]). Our method can be seen as an extension of DPM to adaptive-DPM forvideos, where the detector is adapted automatically and the spatio-temporal continuityin videos is explored. Since the feature dimension in DPM is very high (more than 20K),it’s infeasible to adapt it directly. Instead we introduce a feature transform on the celllevel of HOG features, which can effectively capture the holistic low-level appearancechange caused by domain shift. In the description of the model, we use a bilinear formof DPM used in [32] to simplify our notation.

Page 4: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

4 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

We employ cost-flow network as part of our structural model, which is related to de-tection based multiple-target tracking [33–39]. Previous detection based multiple-targettracking methods rely on a fixed detector, while we adapt the detector automatically.The tracker outputs the trajectories of detections of multiple targets, while we focuson improving the detection performance. Due to the noticeable improvement of objectdetection in videos, our work can serve as a more reliable initialization for detectionbased tracking.

3 Preliminaries

In this section, we briefly introduce the bilinear form of DPM [6] and the cost-flowbased data association [33]. The former model is the basis of our pedestrian detector,and the latter one is the structure on which we define the objective function with bothappearance and spatio-temporal constraints.

3.1 Bilinear DPM

The popular DPM provides a hierarchical representation for pedestrians (as well asother objects). It contains a root filter and a set of deformable part filters. Without lossof generalization, we take the root as a special part here. Given an image I and a config-uration of parts ζ = {l0, l1, · · · , lm} in the detection window, we define the detectionscore of the configuration with respect to the DPM detector F as

score(F , I, ζ) =m∑i=0

wTi φa(I, li) + wT

s φs(ζ), (1)

wherewi is the filter of the ith part, and φa(I, li) is the HOG [1] feature vector extractedat li. ws is the shape prior which prefers a particular configuration, and φs(ζ) is thespatial feature vector of the configuration ζ. Here l0 is the location of the root, and li isthe location of the ith part. It is straightforward to introduce the mixture components inEq. 1, so we leave them out to simplify the exposition.

In this paper, we use the bilinear form of DPM originally introduced in [32]. Itequals to the standard DPM, but can simplify the notation of our adaptive DPM intro-duced in the next section. Similar formulation is also proposed in [8]. The HOG featureof the ith part is denoted as a nf × nk dimensional matrix φa(I, li) , where nk is thenumber of cells in the part, and nf is the dimension of gradient histogram feature vec-tor for a cell. Each column in φa(I, li) is a feature vector of a cell. φa(I, li) are furtherconcatenated to be a large matrix Φa(I, ζ) = [φa(I, l0), φa(I, l1), · · ·φa(I, lm)]. Theappearance filters in the detector are concatenated to be a matrix Wa in the same way.With these notations, the detection score of DPM [6] equals

score(F , I, ζ) = Tr(WTa Φa(I, ζ)) + wT

s φs(ζ), (2)

where Tr(·) is the trace operation which is defined as summation of the elements onthe main diagonal of a matrix. The bilinear form DPM detector is parameterized by the

Page 5: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 5

appearance parameter matrix Wa and spatial parameter matrix ws. For a scanning win-dow in detection, only the root location l0 is given, and all the part locations are takenas latent variables which are optimized at runtime. The detection score of the slidingwindow is denoted as score(I, ζ∗), where ζ∗ is the best possible part configurationwhen the root location is fixed to be l0. Using quadratic function to model the spatialdeformation of each part, the problem can be effectively solved with linear complexity[6].

3.2 Cost-flow based Data Association

Cost-flow based data association is proposed in [33] to associate detections in a videoto be long trajectories. Finding the globally optimal trajectories for detections in videois reformulated as finding the min-cost flow in a network. Let us define the detectionset in a video as D = {d1, · · · , dn}, where di = {ζi, σi, ti} and n is the number ofdetections. ζi, σi, and ti stand for the location, scale, and frame index respectively.The detection di corresponds to an edge from node ui to vi in the network. The ciis the weight to represent the cost for di to be a pedestrian activation. For detectionsbetween different frames that have the possibility to belong to the same trajectory, anedge (vi, uj) is created and ci,j is used to represent the cost for the transition betweenvi and uj in one trajectory. To start the flow, source node s and sink node t are addedto the network, where s links to all the ui with cost cs,i and all the vi are linked tothe sink node t with cost ct,i. The cost cs,i and ct,i are used to punish the number oftrajectories. For each edge in the flow, there is an indicator to represent whether theedge is included in one trajectory, which is denoted as yi, yi,j , ys,i and yt,i for theedge (ui, vi),(vi, uj),(s, ui) and (vi, t), respectively. To interpret the network flow asno overlap trajectory, the model uses the following constraints

ys,i +

n∑j=1

yj,i = yi = yt,i +

n∑j=1

yi,j ,∀i (3)

yi, ys,i, yt,i, yi,j ∈ {0, 1},

where yi is 1 when the detection di is included in current trajectory, and otherwise 0.The above constraints guarantee that no paths share a common edge. The flow in thenetwork is specified by Y = {yi, yi,j , ys,i, yt,i}. Given the network and a configurationof Y , the total cost is

L(D,Y) =n∑

i=1

ciyi +

n∑i=1

cs,iys,i +

n∑i=1

ct,iyt,i +

n∑i=1

n∑j=1

ci,jyi,j , (4)

where costs of all activated edges are summarized. When the cost terms are properly de-fined, finding the globally optimal trajectories is equivalent to solving the min-cost flowproblem, where the cost is defined in Eq. 4 with constrains in Eq. 3. The optimizationproblem has been well explored. For example, [34] has shown an efficient successiveshortest-paths algorithm with the complexity of O(knlogn), where k is the number oftrajectories and n is the number of detections. It is efficient enough in real applications(e.g., fewer than 10 seconds for a 103-frame video with 106 detections).

Page 6: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

6 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

4 Adaptive Structural Model

Conventional methods consider the instance mining and detector adaptation as two sep-arate tasks. In this way, the errors in instance mining could result in the drift of thedetector, and the drift in detector adaptation could further harm the instance mining. Toavoid the vicious circle, in this work, we propose to capture instance mining and detec-tor adaptation jointly, where the instance mining and detector adaptation are handled inone objective function, and the joint optimization procedure outputs a combination ofnew instances and adapted detector.

Given the target video, we first conduct frame-wise detection and denote the detec-tion result as D. Due to the noise in detection, the detections in D cannot be taken asground truth, instead we take them as latent variables, and label D by Y via an instancemining module. Here the indicator set Y = {y1, · · · , yn} and yi ∈ {0, 1}, where yi = 1indicates that the detection di is taken as the true positive, otherwise yi = 0. Given theoriginal detector F0 and the detection set D, we propose to find the optimal adapteddetector F∗, and the new indicator set Y∗ with our designed objective function as

(F∗,Y∗) = argminF,Y

R(F ,F0) + ηL(D,F ,Y), (5)

where R(F ,F0) is used to regularize the new detector F by the original detector F0,and L(F ,D,Y) is the loss term to measure the fitness between the adapted model Fand the final detection result, which is specified by D and its indicator set Y . The lossfunction is designed to encode the structural information in the target video. Two kindsof information can be encoded, the first is the appearance in detections, and the secondis the spatio-temporal continuity in the video. The objective function in Eq. 5 natu-rally combines the instances mining and detector adaptation in a unified framework,and enables two tasks to benefit from each other. In the following parts, we define theregularization and loss function, and show how to optimize the objective function.

4.1 Adaptive DPM and Regularization

We use the generic DPM detector [6] as the initial detector F0 and aim to find an opti-mized DPM detectorF on the target video. To avoid the direct adaptation of parametersof high dimensionality in DPM, we introduce a simple but effective adaptive DPM andshow how to regularize it in the structural model defined below.

In DPM based representation, pedestrian consists of a number of local parts, andeach part is represented by HOG cells. In applying the generic pedestrian detector to anovel scene, we only consider the domain shift in appearance (e.g., illumination, imag-ing condition) and ignore the variations in viewpoint, since the viewpoint variationscould be naturally handled by the DPM mixture model. Under this assumption, weargue that the structure of parts and HOG spatial relationship between different partsshould remain unchanged in detector adaptation process, while domain shift can becaptured at feature level. Particularly, we use a linear transformation P to model themapping between the source and target domain in HOG cell level, which is a nf × nf

Page 7: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 7

dimensional matrix. When the transformation matrix P is given, the detection score inthe adaptive DPM for a part configuration ζ is defined as

score(F , I, ζ) = Tr(WTa PΦa(I, ζ)) + wT

s φs(ζ), (6)

where an additional feature transformation P is conducted before the feature Φa(I, ζ),which is then fed into the appearance filter Wa. Here the model F is specified by Wa,ws and P . The Eq. 2 can be taken as a special case of Eq. 6 when the transformationmatrix P is the identity matrix. To avoid the overfitting in model adaptation, we use theidentity matrix I to regularize P by

R(F ,F0) = ‖P − I‖2F , (7)

where the Frobenius norm ‖ · ‖F is defined as the square root of the sum of the absolutesquares of the elements in a matrix. It is of particular importance for video based detec-tion since the feature vector is high dimensional while the number of mined instances isusually about a few hundred. The number of variables needed for adaptation is nf ×ncin the original DPM, where nc is the number of HOG cells in all parts. In adaptiveDPM, we only need to adapt nf × nf parameters, which brings about more efficiency(in typical DPM models, nc is one order larger than nf ).

4.2 Loss Function

The loss function is used to measure the detector F and indicator set Y on the targetvideo. In this paper, two kinds of information are considered. The first is the frame-wise detection information, which means that detections activated in Y should have lowappearance loss (i.e. high detection score). The second is that the activated detectionsshould satisfy the video continuity, for example a stand alone detection in video isvery likely to be a false positive and should be indicated as a false detection in Y . Tocapture the above two types of information, we borrow the idea from cost-flow baseddata association, and measure the loss of indicator set Y and detector F jointly with thefollowing function

L(F ,D,Y) =n∑

i=1

(ct,iyt,i + cs,iys,i + ciyi) (8)

where ci = max(ξ1, ξ2 − score(F , I, ζi))

s.t. ys,i +

n∑j=1

yj,i = yi = yt,i +

n∑j=1

yi,j ,∀i

and yi, yi,j , ys,i, yt,i ∈ {0, 1},

where the indicator Y now includes auxiliary variables. The score(F , I, ζi) is definedin adaptive DPM as Eq. 6. The above problem can be seen as an instantiation of the gen-eral cost-flow based data association problem introduced in Eq. 4. The appearance costci is defined as a generalized hinge loss of adaptive DPM detection score. The intuitioninside the definition is that detections with high appearance scores should be activated

Page 8: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

8 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

with a negative loss value, while the detections with low appearance scores should besuppressed with a positive value. The parameter ξ1 and ξ2 can be tuned according tothe range of detection scores. The costs cs,i and ct,i which involve the source and sinknodes are fixed to be positive constraint. They can be considered as a punishment to thenumber of trajectory, which can help to remove the discontinuous false positives.

4.3 Adaptive Optimization

In video based detection, we need to determine the new detector F specified by P , theframe-wise detection set D, and the indicator set Y of the detection set. Advocated byrecent latent structural learning works, we adopt the alternating minimization procedureto optimize them.

Algorithm 1 Adaptive Structural Optimization for Video based Pedestrian Detection.1: Input:

The video V , and the generic detector F0.2: Set F = F0, i.e. P = I .3: for i=1 to T1 do4: Conduct the frame-wise adaptive DPM detection procedure with detector F by Eq. 6, and

get the detection set D of the video.5: for j=1 to T2 do6: Fix the P , and solve the optimal indicator set Y by minimizing L(P,D,Y) with the

successive shortest-paths algorithm.7: Fix the Y , and solve the optimal P in Eq. 11 with standard quadratic programming

procedure.8: end for9: end for

10: return (F ,D,Y).

The whole optimization procedure is shown in Algorithm 1. In the outer loop, weconduct the adaptive DPM detector for frame-wise detection to get the detection setD. When D is fixed, the detection indicator Y and the adapted detector F are jointlyoptimized by the following problem

(Y∗, P ∗) = argminY,P‖P − I‖2F + ηL(P,D,Y) (9)

s.t. ys,i +

n∑j=1

yj,i = yi = yt,i +

n∑j=1

yi,j ,∀i

and yi, ys,i, yt,i, yi,j ∈ {0, 1},

where the L(P,D,Y) is exactly the L(F ,D,Y), since F can be specified by P . Thefilters Wa and ws in F are from the generic detector and fixed in the whole procedure.It is a difficult mixed programming non-convex problem when both Y and P are free.We therefore resort to an iterative algorithm based on the fact that solving Y given P ,

Page 9: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 9

and solving P given Y are convex problems, and there exists off-the-shelf solvers. Indetail, we solve the following two problems.

Fix Y to solve P When Y is given, the constrains in Eq. 9 can be removed, and theproblem becomes to be

argminP‖P − I‖2F (10)

n∑i=1

yi ·max(ξ1, ξ2 − (Tr(WTa PΦa(I, ζi)) + wT

s φs(ζi))),

Since Tr(WTa PΦa(I, ζi)) is equal to Tr(PΦa(I, ζi)W

Ta ), the above problem equals

argminP‖vec(P )− vec(I)‖2 (11)

n∑i=1

yi ·max(ξ1, ξ2 − (vec(P )T vec(Φa(I, ζi)WTa ) + wT

s φs(ζ))),

where vec(·) is the operator to reshape the matrix to be a vector in a column-wise man-ner. The above problem can be solved effectively by standard quadratic programmingsolvers [40].

Fix P to solve Y When the transformation matrix P is given, Eq. 5 becomesargminY L(P,D,Y) under the cost-flow constraint, which can be effectively solvedby successive shortest-paths algorithm described in [34]. In the algorithm, we itera-tively find the minimum-cost parts γ from the source to the sink in the residual graph,and update the flow by pushing the unit-flow along γ if the total cost of the path isnegative.

Since the objective value is reduced in both of the two subproblems of the innerloop, it can be easily proved that the inner loop will converge to a local minima. We setthe loop number T1 to be 5, the loop number T2 to be 8 and validate the convergence ofthe whole optimization procedure in experiments.

5 Experiment

Experiments are conducted on challenging videos from ETHZ [41], Caltech [10] andPETS2009 1 pedestrian datasets. The ETHZ and Caltech are captured from movingcamera, while the PETS2009 is captured from stationary camera. Particularly, the Bahn-hof sequences from ETHZ, S2-L2 from PETS2009, and 8 sequences from Caltech withmost people are selected for evaluation. The ETHZ-Bahnhof sequences contain 999frames and 8467 pedestrians; the PETS2009-S2-L2 sequences contain 436 frames and8927 pedestrians; the 8 sequences from Caltech testset are with the length of about 1800frames. These videos are challenging for cluttered background, large illumination varia-tions, and heavy occlusion. The DPM detector (Version 5) trained on the INRIA dataset[1] is taken as the baseline. Since the detector can only detect pedestrians of above 120pixels in height, we resize every video frame with a scale of 2.5, and only measure

1 http://www.cvg.rdg.ac.uk/PETS2009/

Page 10: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

10 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

pedestrians of above 50 pixels in height as suggested in [14]. For all the experiments,ξ1, ξ2, cs,i and ct,i used in Eq. 8 are fixed to be -1, 0.2, 10 and 10, respectively.

We follow the publicly available evaluation protocol in [14], except that the eval-uations are conducted for each video separately. Full ROC curve and mean miss rate2

are used to compare different algorithms. In the following parts, we compare differ-ent detector adaptation methods and examine the convergence of the outer loop in Al-gorithm 1, and finally compare the detection performance with other state-of-the-artdetectors.

5.1 Different Methods for Video based Detection

In this part, we compare four different approaches for video based pedestrian detectionon PETS2009-S2-L2, which is challenging for appearance variations in illuminationand occlusion. These approaches include: (1) Generic DPM, the baseline DPM detector(version 5) learned on INRIA; (2) DPM + Adaptation, which iteratively adds new train-ing instances according to the frame-wise detection score, and then adapts the DPMdetector using the collected instances; (3) DPM + Tracking + Adaptation, which usesthe tracked detections as the new instances to adapt the DPM detector, where the track-ing is solved by cost-flow network; (4) The proposed method that jointly considersinstance mining and detector adaptation.

10−3

10−2

10−1

100

101

.10

.20

.30

.40

.50

.64

.80

1

false positives per image

mis

s r

ate

62.85 DPM

60.29 DPM+Adaptation

52.80 DPM+Tracking+Adaptation

48.69 Proposed method

Fig. 2. Pedestrian detection results of different methods on PETS2009-S2-L2.

ROC curves and mean miss rates of the four methods are demonstrated in Fig. 2.Due to the noise in training instances used, direct adaptation could cause the drift prob-lem, and in this experiment it only improves a small margin over the original generic

2 The mean miss rate defined in P. Dollar’s toolbox is used here, which is the average miss rateat 0.0100, 0.0178, 0.0316, 0.0562, 0.1000, 0.1778, 0.3162, 0.5623 and 1.0000 false-positive-per-image.

Page 11: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 11

detector. Since a lot of false positives can be removed by optimizing the cost-flow net-work, the instances used in adapting the detector are clear enough in the DPM + Track-ing + Adaptation procedure, and it improves the performance with quite a large margin.Benefiting from the joint model, our approach achieves the best performance and out-performs the baseline generic DPM with 14% reduction in mean miss rate. Comparedwith the sequential instance mining and detector adaptation approach, the proposedjoint learning method further reduces the mean miss rate by 4%.

5.2 The Convergence

In this part, we validate the convergence of the proposed method. 8 videos from Cal-tech testset with most pedestrians are selected for evaluation. Since the inner loop inAlgorithm 1 is sure to converge to a local stable point, in this part, we only validatethe convergence of the outer loop, which iteratively conducts frame-wise detection andoptimizes the structural model. The selected video ID and the mean miss rate at eachloop are reported in Fig. 3.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

L O O P 0 L O O P 1 L O O P 2 L O O P 3 L O O P 4

Set06-V002

Set07-V000

Set07-V005

Set07-V010

Set07-V011

Set10-V009

Set10-V010

Set10-V011

Mea

nM

iss R

ate

Fig. 3. The convergence illustration of the proposed optimization method in video based pedes-trian detection.

From Fig. 3, we can find the noticeable performance improvement and the fast con-vergence rate of our approach. On the 8 videos, the proposed method has an averageof 16% reduction in mean miss rate and the performance is close to convergence after3 loops. Since the first loop can mine most of the instances, it contributes most to theperformance.

5.3 Comparisons with State-of-the-art Methods

In this part, we compare the proposed method with other state-of-the-art algorithms,collected in [14], including Viola-Jones [42], Shapelet [43], LatSVM-V1, LatSVM-V2

Page 12: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

12 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

[6], PoseInv [44], HOGLbp [4], HikSVM [3], HOG[1], FtrMine [45], MultFtr [43],MultiFtr+CSS [43], Pls [46], MultiFtr+Motion [43], FPDW [10], FeatSynth [47], Chn-Ftrs [48], MultiResC [7], Veryfast [11], and CrossTalk [12]. We show the results of thevideo Bahnhof in ETHZ, the set07-V000 and set07-V011 in Caltech 3.

10−3

10−2

10−1

100

101

.05

.10

.20

.30

.40

.50

.64

.80

1

false positives per image

mis

s r

ate

91.77 PoseInv

90.41 Shapelet

87.53 VJ

77.92 LatSvm−V1

69.03 HikSvm

59.56 HOG

59.39 MultiFtr+Motion

58.05 MultiFtr+CSS

57.31 MultiFtr

55.09 FPDW

54.72 VeryFast

52.65 ChnFtrs

51.67 HogLbp

50.36 Pls

49.23 LatSvm−V2

47.60 DPM−V5

45.88 CrossTalk

40.30 Proposed Method

(a) ROC on ETHZ BAHNHOF

10−3

10−2

10−1

100

101

.05

.10

.20

.30

.40

.50

.64

.80

1

false positives per image

mis

s r

ate

96.51 Shapelet

96.21 VJ

92.84 PoseInv

85.71 LatSvm−V1

81.13 FtrMine

77.97 HikSvm

76.69 MultiFtr

73.56 HOG

71.67 HogLbp

63.65 LatSvm−V2

63.44 MultiFtr+CSS

63.23 FeatSynth

61.97 FPDW

60.33 MultiResC

58.13 Pls

57.52 DPM−V5

53.39 MultiFtr+Motion

49.84 ChnFtrs

47.85 Proposed method

(b) ROC on Caltech Set07-V000

10−3

10−2

10−1

100

101

.05

.10

.20

.30

.40

.50

.64

.80

1

false positives per image

mis

s r

ate

99.74 VJ

97.47 Shapelet

92.71 FtrMine

90.92 HikSvm

89.19 MultiFtr

88.88 PoseInv

87.15 HOG

84.98 MultiFtr+CSS

84.09 HogLbp

81.61 MultiFtr+Motion

79.12 FPDW

78.67 LatSvm−V2

77.66 ChnFtrs

76.92 Pls

75.98 DPM−V5

74.06 LatSvm−V1

73.02 FeatSynth

72.87 MultiResC

68.88 Proposed method

(c) ROC on Caltech Set07-V011

Fig. 4. Quantitative evaluations on ETHZ and Caltech.

Fig. 4 illustrates the quantitative results of different methods. On all the three videos,the proposed method outperforms the baseline DPM (version 5) by more than 7%, andoutperforms the published state-of-the-art results. It improves about 5% on the ETHZ

3 The two videos are selected as they contain more people than other videos.

Page 13: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 13

Bahnhof, 2% on Caltech Set07-V000, and 4% on Caltech Set07-V011 than the bestpublished results. Some qualitative examples are shown in Fig. 5.

While the structural model optimization step is very efficient, most of the calculationis spent on frame-wise detection. In our implementation, we modify the code of the FFTbased implementation [49] for fast convolution computation. Some techniques can beused to further accelerate the loop, such as the cascade detection or only detecting asubset in the early steps of the outer loop in Algorithm 1, and we leave it in futurework.

Fig. 5. Qualitative results of the proposed video base pedestrian detection on the ETHZ, Caltechand PETS2009.

6 Conclusion

In this paper, we propose a joint structural model to adapt the generic pedestrian detec-tor for video based pedestrian detection. The instance mining and detector adaptationare formulated in one objective function, and an alternating minimization procedureis adopted to optimize it. The DPM is extended to be adaptive-DPM, where a featuretransformation defined on low-level HOG cell is used to reduce the domain shift. Wedemonstrate noticeable improvement over the methods that treat the two tasks indepen-dently, and other state-of-the-art detectors on challenging videos from Caltech, ETHZand PETS2009.

Acknowledgement. This work was supported by the Chinese National Natural Sci-ence Foundation Projects #61105023, #61103156, #61105037, #61203267, #61375037,#61473291, National Science and Technology Support Program Project #2013BAK02B01,Chinese Academy of Sciences Project No. KGZD-EW-102-2, and AuthenMetric R&DFunds.

Page 14: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

14 Junjie Yan, Bin Yang, Zhen Lei, Stan Z. Li

References

1. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR,IEEE (2005)

2. Yan, J., Lei, Z., Yi, D., Li, S.Z.: Multi-pedestrian detection in crowded scenes: A globalview. In: CVPR, IEEE (2012)

3. Maji, S., Berg, A., Malik, J.: Classification using intersection kernel support vector machinesis efficient. In: CVPR, IEEE (2008)

4. Wang, X., Han, T., Yan, S.: An hog-lbp human detector with partial occlusion handling. In:ICCV, IEEE (2009)

5. Walk, S., Majer, N., Schindler, K., Schiele, B.: New features and insights for pedestriandetection. In: CVPR, IEEE (2010)

6. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discrim-inatively trained part-based models. TPAMI (2010)

7. Park, D., Ramanan, D., Fowlkes, C.: Multiresolution models for object detection. ECCV(2010)

8. Yan, J., Zhang, X., Lei, Z., Liao, S., Li, S.Z.: Robust multi-resolution pedestrian detectionin traffic scenes. In: CVPR, IEEE (2013)

9. Huang, C., Nevatia, R.: High performance object detection by collaborative learning of jointranking of granules features. In: CVPR, IEEE (2010)

10. Dollar, P., Belongie, S., Perona, P.: The fastest pedestrian detector in the west. BMVC 2010(2010)

11. Benenson, R., Mathias, M., Timofte, R., Van Gool, L.: Pedestrian detection at 100 framesper second. In: CVPR, IEEE (2012)

12. Dollar, P., Appel, R., Kienzle, W.: Crosstalk cascades for frame-rate pedestrian detection.In: ECCV, Springer (2012)

13. Yan, J., Lei, Z., Wen, L., Li, S.Z.: The fastest deformable part model for object detection.In: CVPR. (2014)

14. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of the stateof the art. TPAMI (2012)

15. Yang, M., Zhu, S., Lv, F., Yu, K.: Correspondence driven adaptation for human profilerecognition. In: CVPR, IEEE (2011)

16. Sharma, P., Huang, C., Nevatia, R.: Unsupervised incremental learning for improved objectdetection in a video. In: CVPR, IEEE (2012)

17. Wang, X., Hua, G., Han, T.X.: Detection by detections: Non-parametric detector adaptationfor a video. In: CVPR, IEEE (2012)

18. Tang, K., Ramanathan, V., Fei-Fei, L., Koller, D.: Shifting weights: Adapting object detectorsfrom image to video. In: NIPS. (2012)

19. Wang, M., Wang, X.: Automatic adaptation of a generic pedestrian detector to a specifictraffic scene. In: CVPR, IEEE (2011)

20. Wang, M., Li, W., Wang, X.: Transferring a generic pedestrian detector towards specificscenes. In: CVPR, IEEE (2012)

21. Sharma, P., Nevatia, R.: Efficient detector adaptation for improved object detection in avideo. In: CVPR, IEEE (2013)

22. Yang, Y., Shu, G., Shah, M.: Semi-supervised learning of feature hierarchies for objectdetection in a video. In: CVPR, IEEE (2013)

23. Enzweiler, M., Gavrila, D.: Monocular pedestrian detection: Survey and experiments.TPAMI (2009)

24. Geronimo, D., Lopez, A., Sappa, A., Graf, T.: Survey of pedestrian detection for advanceddriver assistance systems. PAMI (2010)

Page 15: Adaptive Structural Model for Video Based Pedestrian Detectionbyang/papers/adaped_accv14.pdf · generic pedestrian detector to videos, which is recently explored in [16–22,25,26]

Adaptive Structural Model for Video Based Pedestrian Detection 15

25. Roth, P.M., Sternig, S., Grabner, H., Bischof, H.: Classifier grids for robust adaptive objectdetection. In: CVPR, IEEE (2009)

26. Pang, J., Huang, Q., Yan, S., Jiang, S., Qin, L.: Transferring boosted detectors towardsviewpoint and scene adaptiveness. TIP (2011)

27. Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new do-mains. In: ECCV. Springer (2010)

28. Kulis, B., Saenko, K., Darrell, T.: What you saw is not what you get: Domain adaptationusing asymmetric kernel transforms. In: CVPR, IEEE (2011)

29. Gao, T., Stark, M., Koller, D.: What makes a good detector?–structured priors for learningfrom few examples. In: ECCV. Springer (2012)

30. Gopalan, R., Li, R., Chellappa, R.: Domain adaptation for object recognition: An unsuper-vised approach. In: ICCV, IEEE (2011)

31. Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domainadaptation. In: CVPR, IEEE (2012)

32. Pirsiavash, H., Ramanan, D.: Steerable part models. In: CVPR, IEEE (2012)33. Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network

flows. In: CVPR, IEEE (2008)34. Pirsiavash, H., Ramanan, D., Fowlkes, C.C.: Globally-optimal greedy algorithms for tracking

a variable number of objects. In: CVPR, IEEE (2011)35. Berclaz, J., Fleuret, F., Fua, P.: Multiple object tracking using flow linear programming. In:

PETS-Winter, IEEE (2009)36. Jiang, H., Fels, S., Little, J.J.: A linear programming approach for multiple object tracking.

In: CVPR, IEEE (2007)37. Yang, B., Huang, C., Nevatia, R.: Learning affinities and dependencies for multi-target track-

ing using a crf model. In: CVPR, IEEE (2011)38. Andriyenko, A., Schindler, K.: Globally optimal multi-target tracking on a hexagonal lattice.

In: ECCV. Springer (2010)39. Wen, L., Li, W., Yan, J., Lei, Z., Yi, D., Li, S.Z.: Multiple target tracking based on undirected

hierarchical relation hypergraph. (2014)40. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge university press (2009)41. Ess, A., Leibe, B., Schindler, K., , van Gool, L.: A mobile vision system for robust multi-

person tracking. In: CVPR, IEEE (2008)42. Viola, P., Jones, M., Snow, D.: Detecting pedestrians using patterns of motion and appear-

ance. IJCV (2005)43. Wojek, C., Schiele, B.: A performance evaluation of single and multi-feature people detec-

tion. DAGM (2008)44. Lin, Z., Davis, L.: A pose-invariant descriptor for human detection and segmentation. ECCV

(2008)45. Dollar, P., Tu, Z., Tao, H., Belongie, S.: Feature mining for image classification. In: CVPR,

IEEE (2007)46. Schwartz, W., Kembhavi, A., Harwood, D., Davis, L.: Human detection using partial least

squares analysis. In: ICCV, IEEE (2009)47. Bar-Hillel, A., Levi, D., Krupka, E., Goldberg, C.: Part-based feature synthesis for human

detection. ECCV (2010)48. Dollar, P., Tu, Z., Perona, P., Belongie, S.: Integral channel features. In: BMVC. (2009)49. Dubout, C., Fleuret, F.: Exact acceleration of linear object detectors. ECCV (2012)