non-rigid object tracking with elastic structure of local...

Non-Rigid Object Tracking with Elastic Structure ofLocal Patches and Hierarchical Sampling

Kwang Moo Yi, Soo Wan Kim, Hawook Jeong, Songhwai Oh, Jin Young Choi

Department of Electrical Engineering and Computer Science, Seoul National University, ASRI, PIRCEmail: [email protected],[email protected],{hwjeong,songhwai,jychoi}@snu.ac.kr

Abstract

To solve the problem of real-time object tracking under partial occlusions and non-rigid deformations,we propose a tracking method based on sequential Bayesian inference. The proposed method is mainlyconsisted of two parts: (1) modeling the target object using elastic structure of local patches for robustperformance; and (2) efficient hierarchical sampling method to obtain an acceptable solution in real-time.The elastic structure of local patches allows the proposed method to handle partial occlusions and non-rigiddeformations through the relationship among neighboring patches. The proposed hierarchical samplingmethod generate samples from the region where the posterior is concentrated to reduce computation time.The method is extensively tested on a number of challenging image sequences with occlusion and non-rigiddeformation, demonstrating its real-time capability and robustness under different situations.

Keywords: Visual Tracking, Local Patches, Markov Random Field, Particle Filter, Sampling

1 Introduction

Object tracking is an important problem in thefield of computer vision. Tracking is used in manyapplications, such as, robot vision, visual surveillance,video analysis, home automation, and behavior re-cognition. These applications require accurate tra-cking results in order for the whole system to workcorrectly. Conventionally available tracking algo-rithms, such as the kernel-based tracking algorithm[1], particle filtering [2], and subspace based tra-ckers [3], have proved to be successful for theseapplications. However, the performance of thesealgorithms are somewhat limited to “lab environ-ments” and do not show satisfying results whentested on real-world scenarios.

The reason for such limitation of conventional me-thods is that they have strong assumptions aboutthe input video sequence, such as, constant mo-vement, and consistent views. Also, in real-worldscenarios, lots of occlusion can occur with non-rigiddeformations, limiting the applicability of conven-tional methods even more. For instance, when agroup of people are walking together, a commoncase, people tend to occlude one another with non-rigid movements, making it extremely difficult forconventional methods to successfully track the tar-get person throughout the image sequence.

Kernel-based tracking [1] is a widely used method,because of its computational efficiency and robustperformance. This method is very efficient and

suitable for real-time purposes. However, since themethod uses local optimization, it is easy for thetracker to fall into local maxima (or minima). Also,the method gets easily distracted by the backgroundand has no clear way of adapting to scale and illu-mination changes. Particle filtering based trackingmethods have become popular in recent years, sinceits first introduction in [2]. These methods tryto solve the tracking problem using Monte Carlo(MC) simulation. Unlike kernel-based trackers, par-ticle filtering considers not only translational mo-vements but also affine motions as well. Manyvariations have been proposed with different mea-surement models, e.g., [3], [4], and [5]. Amongthese, subspace-based measurements [3], inspiredby the work of Black et.al. [6], have proven to besuccessful. The subspace-based methods are ableto adapt to various changes such as change in viewsand illumination and changes within the model.However, these methods assume slow changes ofthe target object and do not consider occlusionsduring learning, causing the drift problem.

Recently, methods considering the tracking pro-blem as a Maximum A Posterior - Markov RandomField (MAP-MRF) problem [7], and modeling thetarget object as a collection of local patches [8]have drawn attention. Under this consideration,the target is modeled using MRF, and the trackingproblem becomes finding the MAP solution of thisMRF. In [7], tracking is performed through featurematching, as a problem of finding the transforma-

978-1-4244-9630-3/10/$26.00 ©2010 IEEE

tion which maximizes the posterior energy in theMRF formulation. Also, the rigidity and adap-tiveness of the model is automatically determinedfrom the tracking results, which enables the trackerto track various types of movements. Since thisalgorithm is based on features, it shows robustresults to background clutter and occlusion pro-blems. However, for relatively small objects orcases with noises from the motion of a camera,feature points are unreliable and its tracking per-formance may degrade. Also, the method onlyconsiders the relationship between two subsequentframes without exploiting the dynamic propertiesof a target. In [8], a star model of local patches isused as a MRF configuration. The target object isdescribed with local patches, and Adaptive BasinHopping Monte Carlo (A-BHMC) sampling is usedto minimize the posterior energy. The patches des-cribing the model are constantly updated througha heuristic scheme which enables the tracker to beable to adapt to drastic changes in the appearanceand shape of the target. The A-BHMC reducesthe number of particles required for tracking andkeeps the computation time tractable. However,the heuristic update scheme is not mathematicallywell defined and does not have a complete problemformulation, which can cause tracking failures forsituations which are not taken into account befo-rehand (e.g. partial occlusions are not consideredin their method). Also, the star model of thismethod does not preserve the local dependencyof each patch, making each patch only dependenton the global position of the target object. Thismakes the algorithm weak against occlusions sincethe occluded patches will not be tracked and re-moved from the update scheme. Moreover, evenwith A-BHMC, this algorithm still requires a largeamount of computation as demonstrated in Section6, which makes it difficult to run in real-time.

Our method is targeted to solve the problem of par-tial occlusion and non-rigid deformation, in real-time. The proposed algorithm models the targetobject through elastic structure of local patcheswith spring-type connections, formulated under theMAP-MRF framework. Each local patch is consi-dered to be connected with its neighbors, and the-refore, the local structures of the target object isembedded into the MRF structure. Since the patchesthat are not occluded will enforce the positionsof the neighboring occluded patches through therelationship between patches, our method becomesrobust to partial occlusions. Non-rigid deforma-tions are also well described since the tracking re-sult is described as a collection of local motionsof each patch. This is a distinctive feature of ourmethod against other methods using MRF. Also,the method we proposed does not match the detec-ted features in subsequent frames [7], or employ a

heuristic update scheme using feature detection forpatches [8]. Rather, the proposed method embedsthe MRF structure inside the sequential Bayesianinference. The proposed method therefore consi-ders not only subsequent frames but also the wholeimage observed before the current observation. Inaddition, since we use lots of local patches, thesearch space is very large, and a hierarchical sam-pling approach is proposed to achieve real-timecomputation under such condition.

To demonstrate the effectiveness of our method,we have experimented with a number of challen-ging image sequences. The experimental resultsshow that our method is robust against partialocclusions and non-rigid deformations, comparedwith other methods. Especially, our method runsin real-time, whereas the methods proposed in [3]and [8] runs only a few frames per second.

2 Sequential Bayesian InferenceFramework

For our work, we treat the problem of object tra-cking as a sequential Bayesian inference problem.Let Xt =

(X1

t ,X2t , · · · ,Xm

t

)denote the object state

at time t, where Xkt denotes the state of the kth

local patch of the object (in our method, just theposition of the patch), and m is the total number oflocal patches. Then, if we denote the observations(input images) up to time t as Y1:t, the problemof object tracking can be defined as the following:

X̂t = arg maxXt

P (Xt|Y1:t). (1)

For sequential Bayesian inference, the posterior pro-bability P (Xt|Y1:t) can be sequentially updated asthe following:

P (Xt|Y1:t) ∝

P (Yt|Xt)×∫P (Xt|Xt−1)P (Xt−1|Y1:t−1).

(2)

Here, P (Yt|Xt) is the likelihood between the cur-rent state Xt and the current observation Yt,P (Xt|Xt−1) is the transition probability from Xt−1

to Xt. Typically, for object tracking, since weconsider many types of movements (e.g. transla-tion, rotation, scale, and affine motions), searchingthe whole solution space is intractable. Especially,in our formulation, the dimension of the solutionspace increases as the number of local patches in-creases. Therefore as in [2], particle filtering (alsoknown as sequential Monte Carlo sampling) is usedto solve the problem.

Our method differs from the traditional particlefiltering methods in the fact that the likelihood

Figure 1: Example of elastic structure of local patches

used to describe the target object.

P (Yt|Xt) is obtained through an MRF-style man-ner. Therefore, object tracking is performed tomaximize both the individual likelihood of eachpatches and the relationship among them. Thisis similar to the method proposed in [8], but theyused a star model whereas our method uses anMRF model which holds the local structures ofeach patch. Therefore, the structured local patches,which will be explained in 3, has an advantage thatthe resultant posterior distribution considers thelocal structures. To allow the proposed methodto obtain meaningful solution within the real-timeconstraint, unlike conventional particle filter basedalgorithms, our method does not use a simple ran-dom walk model. Instead, we adopt a hierarchicalsampling scheme, which focuses on sampling fromthe region of the solution space where the poste-rior is concentrated. The details of the proposedmethod are described in the following sections.

3 Elastic Structure of LocalPatches

The target object is described by n × n size localpatches and the number of patches are automati-cally computed according to the size of the targetobject, as in Fig. 1 (denoted by black boxes). Thepatch size n is a pre-defined parameter. The lo-cal patches are connected in an spring-type elasticstructure to describe the target object. Since eachlocal patch is connected with its neighbors formingan MRF, the local structure of the target objectcan be preserved. Therefore, if some of the patchesare occluded, other un-occluded patches will directthe occluded patches to the correct position, cau-sing our model to be robust against partial occlu-sions. Also, since we describe the target objectusing local patches, we are able to well describenon-rigid deformations.

We use the posterior probability of this structuredlocal patches model as the likelihood P (Yt|Xt) in(2). Therefore, P (Yt|Xt) is defined as

P (Yt|Xt) ∝m∏

k=1

P (Yt|Xkt )∏

j∈Nk

P (Xkt |X

jt ), (3)

where, P (Yt|Xkt ) is the likelihood of a single patch,

P (Xkt |X

jt ) is the prior probability describing the

relationship among neighboring patches, and Nk

denotes the neighbors of the kth patch. In theenergy form, if we denote the total energy of theconfiguration as E(Yt; Xt), the energy of a singlepatch as E(Yt; Xk

t ), energy from the relationshipbetween patches as E(Xk

t ,Xjt ), we can write the

total energy of the MRF model (which is simply thesum of the observation and neighborhood energy ofall patches) as

E(Yt; Xt) ≡ Z+m∑

k=1

E(Yt; Xkt ) +

∑j∈Nk

E(Xkt ,X

jt )

,(4)

where Z is a normalizing constant. Here, the rela-tionship between (3) and (4) is that Probability ∝exp(−Energy), assuming the Gibbs distribution.Now we can easily define the probabilities in termsof energy.

The local patches are described by 12-D features.The first nine dimensions are HOG (Histogram ofOriented Gradients) features. Nine bins denoteeight directions of the image gradients and a bin fordenoting no gradient. When counting the edges,only edges with intensity change larger than 5 areconsidered to increase robustness. The remainingthree dimensions are the average RGB values ofthe local patch. This feature is similar to the oneused in [9], but one feature is assigned to a singlepatch not a single pixel as in [9]. The E(Yt; Xk

t ) in(4) is defined as the distance between the model ofthe patch and the observation in this feature space.This 12 dimension feature can be obtained throughintegral images, greatly reducing the computationtime. The distance in the feature space is obtainedusing a truncated L-2 norm with different weightsfor different feature dimensions. The reason whywe use a truncated L-2 norm is to refrain one patchfrom having too large energy dominating all others.In other words, it is used to reduce the effect ofoutliers. For the kth local patch, if we denote the9-D HOG of the observation as fo,k

HOG, 9-D HOGof the model as fm,k

HOG, 3-D RGB features of theobservation as fo,k

RGB , and 3-D RGB features of themodel as fm,k

HOG, then E(Yt; Xkt ) can be defined as

E(Yt; Xkt ) =√

µ⟨fo,k

HOG, fm,kHOG

⟩HOG

+ ν⟨fo,k

RGB , fm,kRGB

⟩RGB

,

(5)

where < a, b >HOG and < a, b >RGB are truncatedL-2 norms. The µ and ν here are parameters whichcontrol the influence of HOG and RGB.

The relationship between neighboring patches aremodeled to behave as if the neighboring patches are

connected by springs. As in Fig. 1 (left), nearest-neighbors are connected (denoted by red lines).The energy between connected local patchesE(Xk

t ,Xjt ) in (4) is defined in the form of potential

energy of a spring. Mathematically, if we denotethe neighborhood strength by β, the energy is de-fined as

E(Xkt ,X

jt ) = β

‖lc(j, k)− lm(j, k)‖22‖lm(j, k)‖22

, (6)

where lc(j, k) and lm(j, k) denotes the distancesbetween the jth and kth patches of the current(sample) configuration and the model, respectively.Here, in order to deal with global scale changes, re-lative change of the distance between local patches.Using this relative change, same energy can begiven for the same changes at different scales. Themodel update scheme is just an weighted averagewith weight 0.01 to the new input and 0.99 to thecurrent model as follows

4 Hierarchical Sampling

When sampling using particle filtering to solve thetracking problem, a random-walk model is typi-cally used at the diffusion step, i.e., estimatingP (Xt|Xt−1). However, in our model, the dimen-sion of the solution space is too large to simplyapply random-walk in all state dimensions. Forexample, if we consider the case when only onepatch is used for tracking, and if this case requires100 particles to track the movement, we wouldneed 100m particles if we use m patches. Thissurely requires intractable amount of calculation,which makes the method impossible to run in real-time. Therefore, simple diffusion using random-walk is not feasible.

To use a small number of samples, we sample fromthe region where the actual solution exists withhigh probability. Under the assumption that thedeformation of the target object is not large bet-ween subsequent frames, which holds in most si-tuations, we diffuse particles hierarchically in twosteps: globally for the motion and locally for thedeformation. We diffuse all local patches equallyaccording to the Gaussian distribution with a rela-tively large variance, then diffuse each patch sepa-rately according to the Gaussian distribution witha relatively small variance. This is illustrated inFig. 2. In the global step, the samples are diffusedso that the relative positions of local patches in asample are preserved. Then, in the local step, eachlocal patches are diffused independently. Mathe-matically the proposed hierarchical sampling me-thod can be described as

Xk,(l)t = Xk,(l)

t−1 + ∆(l) + δk,(l), (7)

Figure 2: Sampling to increase meaningful samples.

Global movement of the total configuration of patches

is sampled first, then local movement of individual

patches are sampled.

(a) caviar (b) pedestrian

Figure 3: Minimum energy obtained for each frame of

the (a) caviar sequence and the (b) pedestrian se-

quence (1000 samples are used). Solid line denotes the

minimum energy obtained using hierarchical sampling

and dashed line denotes the minimum energy obtained

using simple Gaussian diffusion for each local patch.

For simple Gaussian diffusion, variance of 5 was used

for both x and y directions.

where, Xk,(l)t denotes the state of the kth local

patch of the lth sample at time t, ∆(l) denotes the3-dimension global diffusion (translation in x, y di-rection, and scale change for the whole object) forthe lth sample, and δk,(l) denotes the 2-dimensionlocal diffusion (translation in x, y direction for a lo-cal patch) for the kth local patch of the lth sample.Here, ∆(l) ∼ N (0, σ2

G) and δk,(l) ∼ N (0, σ2L), where

N (0, σ2) denotes a Gaussian distribution with 0mean and standard deviation of σ. σG and σL

are parameters for the diffusion. The proposedsampling produces an acceptable solution with arelatively less number of particles than the simplerandom walk approach.

The effectiveness of the proposed hierarchical sam-pling method is demonstrated in Fig. 3. As moresamples are “focussed” where the solution is likelyto exist, E(Yt; Xt) of the MAP solution obtainedby particle filtering is better than simple random-walk (Gaussian diffusion for each local patch). InFig. 3 we can see some points where the minimumenergy of the proposed sampling method is largerthan simple Gaussian diffusion (around frame 130in (a), and around 50 in (b)). These cases are whenthe tracking result gets stuck in some region dueto tracking failures, and the object model learns

some wrong object. When we do not use hierar-chical sampling, the number of samples needed forsuccessful tracking becomes too large to be run inreal time.

5 Summary of the Proposed Me-thod

The proposed method uses particle filtering to geta MAP solution for the object tracking problem.For the proposed method, the initial model of thetarget needs to be given manually. The update ofthe model is performed as an weighted mean bet-ween the current observation and the model. Gi-ven the initial model

{fm,k

HOG, fm,kRGB , lm(j, k); ∀k

},

the proposed method can be summarized as Algo-rithm 1. As described in the preceding sections,the evaluation through the structured local patchesmodel gives robust results, and the hierarchicalsampling allows the proposed method to run inreal-time.

6 Experiments

Evaluation of the proposed method was perfor-med through several image sequences. Each imagesequence consists of different types of situations(occlusion, outer-plane motion, non-rigid deforma-tion, etc.) To let particle filtering focus more onsamples with lower energies, concentration para-meter α for the weights of particle filter was setto be 2. The truncation was done at

√8× 0.52

for HOG and√

3× 0.262 for RGB (RGB valueswere scale to range between 0 and 1), the updateweight was set to 0.01, and the number of particleswas set to 1000. The parameter β determining thestrength of the neighborhood was set to 0.4, andexcept for the robot sequence σG = (5.0, 5.0, 0.03)and σL = (2.0, 2.0). The parameters µ and ν whichcontrol the influence of HOG and RGB were setaround 1 and 8, respectively, according to the qua-lity of the image sequence. The implementationwas done in C++. All experiments were held ona 3.0GHz desktop and ran comfortably about 20-60 frames per second depending on the number ofpatches. In all result images, the thick red boxdenotes the overall tracking result and the thinblue boxes denote the tracking result for the localpatches.

Fig. 4 is the result of the proposed method forthe robot sequence. Since the sequence containsabrupt horizontal motion, σG in Section 4 were setto σG = (8.0, 5.0, 0.03). As shown in the figure,even with large occlusion in (e) and (g), or non-rigid deformation in (d) and (h) (the robot is bentand lay down), the proposed method was able to

track the target object correctly. For comparison,we have tested with the Mean-Shift tracker [1] andthe methods proposed in [3] and [8]. As in Fig. 4(i), the Mean-Shift tracker is not able to describecomplex movements and the tracking results arenot accurate compared to the proposed method.In Fig. 4 (j), the method of [3] cannot describe thenon-rigid deformation and tracks only a part of theobject, where our method successfully tracks theobject describing the non-rigid deformation as agroup of motions of local patches. Eventually, afterlearning the wrong part of the object, their methodfails to track object. As in Fig. 4 (k), the methodproposed in [8] fails to track objects under occlu-sions, where our method successfully tracks theobject. Also, our method ran over 60 frames persecond, whereas the method of [3] ran 1-2 framesper second with 600 particles and the method of[8] ran only about 1 frame per second (using theimplementation authors uploaded in [11]), whichboth are not in real-time.

Fig. 5 is the tracking results for the caviar se-quence. This sequence is available at [10]. As in(a)-(d), our method successfully tracks the objectthroughout the sequence. The Mean-Shift algo-rithm (e)-(h) fails from heavy occlusion and tracksa wrong person. In (i)-(l), since the method of [3]does not consider occlusions, as soon as occlusionsoccur, the tracking result shrinks to a small boxand fails to give any meaningful result. The me-thod of [8] (m)-(p) does not lose track of the object.However, as occlusion occurs, only the parts thatare not occluded are followed. Also, their methodtends to avoid tracking lower body parts, whichmoves consistently.

Fig. 6 is the tracking results for the highjumpsequence provided in [11]. This scene consists oflots of non-rigid deformation as the person almostperforms a back-flip. For this sequence, the Mean-Shift tracker and the method of [3] fails to trackthe target person. As in the figure, the trackingresults are similar. However, the method of [8]runs only about one frame per second accordingto the authors C++ implementation on their website [11]. This is certainly not real-time, whereasour algorithm runs comfortably over 60 frames persecond, assuring real-time performance.

Fig. 7 is the tracking results for the pedestriansequence. The scene contains lots of motion blursand the scene itself is also blurry. Even in thesesituations, the proposed method successfully tracksthe target object very accurately. However, themethod of [8] gives some inaccurate results, due tothe blurriness of the scene.

Algorithm 1 Tracking with Local Patches (for each frame)1: For each sample, computeE(Yt; Xt) ≡ Z +

∑mk=1

[E(Yt; Xk

t ) +∑

j∈NkE(Xk

t ,Xjt )]

(4)where,

E(Yt; Xkt ) =

√µ⟨fo,k

HOG, fm,kHOG

⟩HOG

+ ν⟨fo,k

RGB , fm,kRGB

⟩RGB

(5)

E(Xkt ,X

jt ) = β

‖lc(j,k)−lm(j,k)‖22‖lm(j,k)‖22

(6)2: Assign sample weights w according to the likelihoodw ∝ P (Yt|Xt) ∝ exp(−α× E(Yt; Xt))

3: Find MAP solutionX̂t = arg max

Xt

P (Xt|Y1:t) (1), (2)

4: Update object model5: Re-sample according to weights6: Diffuse samples

Xk,(l)t+1 = Xk,(l)

t + ∆(l) + δk,(l) (7)

(a) Frame #2 (b) Frame #46 (c) Frame #214 (d) Frame #349

(e) Frame #535 (f) Frame #550 (g) Frame #561 (h) Frame #633

(i) Mean-Shift (#349) (j) Method [3] (#46) (k) Method [8] (#535)

Figure 4: Tracking results for the robot sequence. (a)-(h) The proposed method, (i) Mean-Shift [1], (j) the

method of [3], and (k) the method of [8]



(i) Frame #27 (j) Frame #60 (k) Frame #85 (l) Frame #109

(m) Frame #27 (n) Frame #60 (o) Frame #85 (p) Frame #109

Figure 5: Tracking results for the caviar sequence. (a)-(d) The proposed method, (e)-(h) Mean-Shift, (i)-(l) the

method of [3], and (m)-(p) the method of [8]



Figure 6: Tracking results for the highjump sequence. (a)-(d) The proposed method and (e)-(h) the method of

[8].



Figure 7: Tracking results for the pedestrian sequence. (a)-(d) The proposed method and (e)-(h) the method

of [8].

7 Conclusion

A new tracking method based on sequential Baye-sian inference is proposed. The proposed methodtackles the occlusion and non-rigid deformation pro-blems of object tracking by modeling the targetobject with an elastic structure of local patches,and by performing hierarchical sampling in the so-lution space. By modeling the target object withan elastic structure of local patches, the propo-sed method was able to track objects with par-tial occlusions and non-rigid deformations. Also,through hierarchical sampling, an acceptable solu-tion was obtainable in real-time. The method wasevaluated through several image sequences withlarge occlusions and non-rigid deformations. Theexperimental results show the robustness of theproposed method against occlusions and non-rigiddeformations compared to existing methods.

References

[1] D. Comaniciu, V. Ramesh, andP. Meer, “Kernel-based object tracking”,IEEE Trans. on PAMI, vol. 25, pp 564–575,2003.

[2] M. Isard and A. Blake, “CONDENSATION- Conditional density propagation for visualtracking”, Int’l J. on Computer vision, 29, 1,pp 5–281, 1998.

[3] J. Lim, D. Ross, R. S. Lin, and M. H. Yang,“Incremental learning for visual tracking”,in Advances in Neural Information ProcessingSystems, 2004.

[4] P. Perez, C. Hue, J. Vermaak, and M. Gangnet,Color-based probabilistic tracking, in Proc.

European Conf. on Computer Vision, vol. 1,pp 661–675, 2002.

[5] J. Kwon, K. M. Lee, F. C. Park, Visual Tra-cking via Geometric Particle Filtering on theAffine Group with Optimal Importance Func-tions, in Proc. IEEE Conf. on CVPR, 2009.

[6] M. J. Black , A. D. Jepson , EigenTracking:Robust Matching and Tracking of ArticulatedObjects Using a View-Based Representation,Int’l J. on Computer vision, vol. 26, pp 63–84,1998.

[7] M. Yang, Y. Wu, Granularity and Elas-ticity Adaptation in Visual Tracking, inProc. IEEE Conf. on CVPR, 2008.

[8] J. Kwon, K. M. Lee, Tracking of Non-Rigid Object via Patch-based DynamicAppearance Modeling and AdaptiveBasin Hopping Monte Carlo Sampling, inProc. IEEE Conf. on CVPR, 2009.

[9] S. Avidan, Ensemble Tracking, IEEE Trans.on PAMI, vol. 29, pp 261–271, 2007.

[10] Caviar project page, “Caviar Test Case Sce-narios”, http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/, visited on 9/15/2010

[11] Tracking of a Non-Rigid Object viaPatch-based Dynamic AppearanceModeling and Adaptive Basin HoppingMonte Carlo Sampling, “Home Page”,http://cv.snu.ac.kr/research/ bhmctracker/index.html, visited on 9/15/2010

non-rigid object tracking with elastic structure of local...

Documents