[lecture notes in computer science] computer vision – accv 2009 volume 5995 || orientation and...

Orientation and Scale Invariant Kernel-Based

Object Tracking with Probabilistic Emphasizing

Kwang Moo Yi, Soo Wan Kim, and Jin Young Choi

ASRI, PIRC, Dept. of Electrical Engineering and Computer Science,Seoul National University, Seoul, Korea{kmyi,swkim,jychoi}@neuro.snu.ac.kr

Abstract. Tracking object with complex movements and backgroundclutter is a challenging problem. The widely used mean-shift algorithmshows unsatisfactory results in such situations. To solve this problem,we propose a new mean-shift based tracking algorithm. Our method isconsisted of three parts. First, a new objective function for mean-shift isproposed to handle background clutter problems. Second, orientation es-timation method is proposed to extend the dimension of trackable move-ments. Third, a method using a new scale descriptor is proposed to adaptto scale changes of the object. To demonstrate the effectiveness of ourmethod, we tested with several image sequences. Our algorithm is shownto be robust to background clutter and is able to track complex move-ments very accurately even in shaky scenarios.

1 Introduction

Tracking of objects using the mean-shift algorithm is a popular method in thefield of object tracking. The algorithm has its advantages in the fact that it is rel-atively easy to implement, it does not require heavy computation, and it showsrobust results in practical object tracking tasks. However, the original mean-shift algorithm shows unsatisfactory results when the object shows complicatemovements and there are objects similar to the target in the nearby backgroundregion. This is due to the three major problems of the original mean-shift al-gorithm. The first problem is the background clutter effect during mean-shiftiterations, which may lead to tracking failures. The second problem is the lackof ability to track elaborate movements such as in-plane rotation. The thirdproblem is its inability to adapt to scale changes (i.e. kernel bandwidth is fixed),which is critical to the tracking performance. These problems greatly affect ob-ject tracking results, but are not clearly solved.

The first problem was usually approached with the use of anisotropic kernelssuch as ellipsoids [1]. This method reduces the amount of background informa-tion in the object model but still contains background pixels inside the modelwhich causes background clutter problems. A. Yilmaz proposed a method usinglevel-set kernels which are in the exact shape of the object [2]. This level-setkernel does not have any restriction in shape and succeeded in exclusion of thebackground information inside the object model. Unfortunately, even when using

H. Zha, R.-i. Taniguchi, and S. Maybank (Eds.): ACCV 2009, Part II, LNCS 5995, pp. 130–139, 2010.� Springer-Verlag Berlin Heidelberg 2010

Orientation and Scale Invariant Kernel-Based Object Tracking 131

these kernels, background information is still included inside the target candi-date. R. Collins et.al. used a way of selecting discriminative features online toovercome the effect of the background information [3]. Their method succeededin obtaining features which make the target object more discriminative to thebackground. But as they noted in their paper, the number of features to be se-lected is not certain. Moreover, the computation time increases proportional tothe number of selected features. The second problem has not been covered muchin the field of mean-shift object tracking since only translational movements canbe estimated through the mean-shift vector. Rather, to track elaborate move-ments, other famous tracking algorithms such as the “Particle Filter” are used[4], [5], [6], [7]. However, for very complex movements, tracking using particlefilter is hard to be done in real-time. Other silhouette tracking methods, such astracking via direct minimization of contour energy function [8], are also capableof tracking elaborate movements, but require even more computation. The thirdproblem was intuitively solved in [1] by the 10% method, but this method doesnot work well due to its nature of preferring the smaller kernel. R. Collins pro-posed a method using difference of Gaussian (DOG) mean-shift kernel in scalespace [10] to solve this problem. However, this method is computationally expen-sive. C. Yang et.al. [11] succeeded in tracking objects with scale changes usingthe joint feature-spatial space concept [12], but their method adapts to scalechanges without consideration of the regional changes of the template. Yi et.al.seemed to solve all three of these problems [14] but as they noted, estimationresults are somewhat unstable.

In this paper, to overcome the three problems of mean-shift, we propose anew mean-shift based object tracking method. Our method is consisted of threeparts. First, we propose an altered objective function for mean-shift, which makesthe tracker robust to background clutter. Second, we propose an orientationestimation method to track objects with in-plane rotation. Third, we proposea method which utilizes a new scale descriptor to adapt to scale changes ofthe object. The test results show that the proposed algorithm is superior to theoriginal mean-shift algorithm and is also comparable to another popular trackingalgorithm, the particle filter.

The paper is organized as follows. Section 2 briefly describes the originalmean-shift algorithm for reference. Next, the proposed method is explained indetail in section 3. Experimental results of our proposed algorithm are given insection 4 and finally, we will conclude our paper on section 5.

2 Mean Shift Tracking: Brief Review

In this section, we give a brief review of the original mean-shift algorithm [1].The mean-shift method is a fast way of finding the local maxima of a sampledistribution iteratively from a given starting position. In the field of object track-ing, this sample is the color observed at a pixel x. To this x, the sample weightw(x) is defined as

w(x) =√

hm(I(x))/hc(I(x)), (1)

132 K.M. Yi, S.W. Kim and J.Y. Choi

where I(x) is the color of pixel x, hm and hc are the color histograms generatedfrom the model and candidate object regions, respectively. If we let the initialhypothesized position be yold, the computed new position be ynew, the pixelsinside the candidate region be xi, Δy = ynew − yold, and K(.) be the radiallysymmetric kernel defining the tracking object region respectively, then using thesample weight (1), the mean shift vector is computed as

Δy =∑

i K(xi − yold)w(xi)(xi − yold)∑i K(xi − yold)w(xi)

. (2)

This mean shift vector is an estimate of the gradient of the sample distribution.Using this mean shift vector, tracking of the object is performed iteratively.

w(xi) in (1) is derived from the Taylor expansion of the Bhattacharyya coef-ficient used in [1]. Bhattacharyya coefficient is defined as ρ(y) ≡ ρ[pc(y), pm] =∫ √

pcz(y)pmzdz, where pc(y) denotes the probability distribution of the candi-date when y is the center of the candidate, pm denotes the probability distribu-tion of the object model, and z denotes that it is of some feature. In the our case,we use color histograms as features, therefore the estimate for the Bhattacharyyacoefficient can be defined as

ρ(y) ≡ ρ[pc(y), pm] =∑

ν

√hc(ν,y)hm(ν), (3)

whereˆdenotes the estimator and hc(ν,y) is the histogram value for color ν of thecandidate when candidate is at position y. Using Taylor expansion around somepoint y0 and kernel estimation, this equation can be approximated as follow [1]:

ρ[pc(y), pm] ≈ 12

∑

ν

√hc(ν,y0)hm(ν)

+Ch

2

∑

i

w(xi)K (y − xi), (4)

where Ch denotes a normalizing constant.

3 The Proposed Method

3.1 Probabilistic Emphasizing

To solve the background clutter problem, we propose an altered objective func-tion for mean-shift. Using this objective function emphasizes features that aremore likely to be of the target object rather than the background. Generally, inmean-shift, color histograms are used as features to describe an object. There-fore, in our work, we also used the probability distribution of colors, i.e. colorhistograms, as features. We first start by obtaining the probability of some colorν being in the object model. If we denote the probability of being in the objectmodel as p(Obj), the probability of being in the model for ν can be denoted as


p(Obj|ν). This p(Obj|ν) can be interpreted as the probability of a pixel withcolor ν being in the object model, i.e. 1 − p(Obj|ν) is the probability of thatpixel being in the background. Then, using the Bayesian rule, p(Obj|ν) can beobtained by p(Obj|ν) = p(ν|Obj)p(Obj)/p(ν). Here, p(ν|Obj) can be estimatedwith the color histogram of the object model, p(Obj) with the area ratio of theobject region and the selected background region, and p(ν) with the color his-togram of the background and object region. Therefore if we denote the colorhistogram of the background region as hbg(.), area of the candidate region asAc, and the area of the background region as ABG, respectively, Then, we canwrite the estimate of p(Obj|ν)

p(Obj|ν) =Achm(ν)

ABGhBG(ν) + Achm(ν). (5)

Next, we use some function of (5) as a penalty function [13] for finding themaxima of ρ(y) in (3). If we denote this penalty function as Φ(ν), then, (3) canbe modified as

ρ(y) ≡ ρ[pc(y), pm] =∑

ν

Φ(ν)√

hc(ν,y)hm(ν). (6)

In our work, we used Φ(ν) =√

p(Obj|ν). Then, if we denote Φ(I(xi))w(xi) asw(xi), the mean-shift equation (2) simply becomes

Δy =∑

i K(xi − yold)w(xi)(xi − yold)∑i K(xi − yold)w(xi)

. (7)

This proposed objective function emphasizes the weights from pixels that aremore likely to be in the object model than the background. Therefore the trackertends to follow features that are more discriminant from the background, i.e. thetracker becomes more robust to background clutter problems.

3.2 Orientation Estimation

Our proposed method for orientation estimation uses color histograms con-structed for each orientation divisions as in [14]. Within this sub-section, this“orientation division” terminology will be used often. Therefore we will firststart by clearly defining this orientation division concept. If we denote Ω as theobject (or the candidate) region, xc as the center of the object (or the candidate)region, and Nα as the number of orientation divisions, respectively, we can definethe orientation divisions as

αi � {xi|F1(arg(x − xc)) ∈ [ηi, ηi+1),x ∈ Ω} , (8)

where, F1(.) is a function to restrict the value of arg(xi−xc) to be in [−π/2, π/2)and ηi is the boundary for each orientation division, respectively.

At the last steps of the mean-shift iteration, when tracking of the translationof the object is almost finished, most of the object is likely to be inside the


tracking window, and also since the time difference is very small between frames,the orientation of the object is likely to change little. This allows the assumptionthat color distribution of the the target candidate has not changed much fromthe object model in the last steps of the mean shift iteration. If we let pm andpc be the probability with respect to the object model and the target candidaterespectively, under this assumption we can assume pc(αi|ν) ≈ pm(αi|ν) [14],where pc(αi|ν) and pm(αi|ν) denotes the probability of color ν being in ithorientation division of the candidate and the model, and ˆ denotes the estimator,respectively. From this approximation, we can derive

pc(αj |αi) =∑

ν

pc(αj |ν)pc(ν|αi) (9)

≈∑

ν

pm(αj |ν)pc(ν|αi). (10)

This pc(αj |αi) is the probability of each orientation division αi being the orien-tation division αj , i.e. this lets us know what each orientation division is likelyto be. In (10), pm(αj |ν), can be obtained by

pm(αj |ν) =pm(ν|αj)pm(αj)∑j pm(ν|αj)pm(αj)

. (11)

The probability of color values for each αj , pm(ν|αj), can be calculated usingcolor histograms constructed for each orientation division, and pm(αj) is justthe area ratio of αj and the object region with respect to the object model. Ifwe denote the old estimated orientation of the tracker as θold, newly estimatedorientation as θnew, Δθ = θnew − θold, and mean of orientation of αi and αj asθi and θj , respectively, using the results of (10) and (11), we can obtain Δθ by

Δθ =∑

i

⎡

⎣∑

j

pc(αj |αi)F2(θj − θi)

⎤

⎦ pc(αi), (12)

where, pc(αi) is the area ratio of αi and the object region with respect to thetarget candidate, and function F2(.) is to enforce θj−θi to be inside [−π/2, π/2).Since from our definition of orientation divisions in (8), |θi| < π/2 and |θj | < π/2.Next, to make our orientation estimation result robust to background clutter,we use result of (5) to modify pm(.) and pc(.) in (10) and (11). Instead of usingpm(ν|αj) and pc(ν|αi) we use pm(ν|αj) and pc(ν|αi) which are modified as

pk(ν|αj) � pk(ν|αj)p(Obj|ν)∑

ν pk(ν|αj)p(Obj|ν), k ∈ {m, c}. (13)

Substituting (13) in (10), (11), and (12), we obtain the final equation

Δθ =∑

i

⎡

⎣∑

j

pc(αj |αi)F2(θj − θi)

⎤

⎦ pc(αi) (14)


3.3 Scale Adaptation

To adapt to scale changes, we need to know what the current status of our trackeris, i.e. what the target candidate is looking at. If we could figure out which partof the distribution our target candidate is observing, then the whole problem ofscale adaptation would be easily solved. Roughly speaking, our distribution ofweights must be similar to the shape of the kernel we used for mean-shift: lagervalues if closer to the center and smaller values if further from center. This isdue to the nature that when doing mean-shift, we adopt a kernel to estimatethe probabilistic distribution for tracking, and therefore color histograms areconstructed with larger values if closer to the center. We confirmed this byexperimental data for some actual tracking situations (Figure 1).

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.8

0.85

0.9

0.95

1

1.05

1.1

~Σ=0.509

Scale (5 divisions)

Ave

rage

of w

eigh

ts

(b)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

~Σ=0.434

Scale (5 divisions)

Ave

rage

of w

eigh

ts

(c) (d)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.8

0.85

0.9

0.95

1

1.05

1.1

~Σ=0.515

Scale (5 divisions)

Ave

rage

of w

eigh

ts(e)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

~Σ=0.476

Scale (5 divisions)

Ave

rage

of w

eigh

ts

(f)

Fig. 1. Example of scale vs average of weights. Scale divided into 5 scale divisions. (a)and (d) are the original image, where inner red box denotes the target candidate regionand the outer blue circle denotes the background region. (b) and (e) are the resultsusing w(xi), and (c) and (f) are the results using w(xi).

To use this idea, we first start by defining the scale divisions of a kernel. if wedenote the relative distance as σ(x) (ranging from 0 to 1), following the samenotation for the object (or the candidate) region from sub-section 3.2, we candefine the scale division as ςi � {x|σ(x) ∈ [ζi, ζi+1),x ∈ Ω}, where ζi is given byζi � (i− 1)/Nς and Nς is the number of scale divisions. Then, to observe whichpart of the original weight distribution respect to scale the target candidate islooking at, we can define the following descriptor for scale [14]:

Σ =∑

j

[wavg,j∑i wavg,i

σavg,j

], (15)

where wavg,j = 1Nςj

∑xi∈ςj

w(xi), σavg,j = 1Nςj

∑xi∈ςj

σ(xi) , and Nςj is thenumber of pixels inside ςj . Since in consecutive frames the change in scale islittle, this descriptor is sufficient for describing how the distribution of weightshas changed. However, this Σ is may be inaccurate due to inclusion of the back-ground information when obtaining w(xi). To overcome this limitation, we usew(xi) instead of using plane w(xi). This give us the final equation for our newlyproposed scale descriptor

Σ =∑

j

[wavg,j∑i wavg,i

σavg,j

]. (16)


Using this descriptor, we adapt scale descriptor of the current target Σcandidate

to match the initial scale descriptor of the model Σ0. By this adaptation, we cantracks object with scale without much increase in computation time.

3.4 Algorithm Summary

When tracking objects, mean-shift finds the most probable position of the targetobject through iteration. During this iteration, when the target candidate ismoving Thus, our method estimates orientation and adapts to scale only whenthe target candidate is moving in small amounts, i.e. when ||Δy|| is smaller thansome threshold ε′.

Given the object model q (the kernel, the color histogram of the model, thecolor histograms constructed for each orientation divisions, and Σ0), the trackingalgorithm can be summarized as follow:

Algorithm 1. Tracking1: Create the target candidate model p2: Compute the Δy using q (7)3: ynew ← yold + Δy4: If ||Δy|| > ε′ go to 1.

5: σnew ← Σcandidate

Σ0σold (16)

6: θnew ← θold + Δθ (12)7: Repeat steps 1 to 6 until ||Δy|| < ε′′,

where ε′ is the threshold for orientation estimation and scale adaptation and ε′′

is the threshold for convergence.

4 Experiments

Our algorithm was implemented using C++ with 16 × 16 × 16 RGB color his-togram, 5 orientation divisions, 4 scale divisions, 2 for ε′ and 0.9 for ε′′. Ananisotropic kernel in the shape of a rectangle was used for kernel density estima-tion in mean-shift. For the background region, we used the area inside a circlelittle bigger than our target candidate. In actual tracking scenarios, orientationchanges are not large in consecutive frames, therefore, we clipped the orientationestimation result to -2.5�and 2.5�. This is to prevent erroneous estimation resultsfor orientation estimation since our assumption in sub-section 3.2 holds only forsmall orientation changes. All experiments were held on a 2.0GHz PC and rancomfortably over 90fps.

Figure 2 is the tracking results for one selected frame of an image sequenceof cars moving on a highway. Using this image sequence, we compared the pro-posed algorithm with the original mean-shift and mean-shift using 10% scaleadaptation. The trackers were applied to the car on the top right. The original


(a) (b) (c)

Fig. 2. Result of the original mean shift tracker (a), original mean shift tracker usingthe 10% scale adaptation (b), and the proposed method (c)

mean shift algorithm without scale adaptation (a) resulted in tracking failure,since it could not adapt to scale change and another similar object entered thetarget candidate region. Mean-shift with the 10% method (b) failed to adaptto scale change and the tracker shrank to a small box. Our method (c) on thebottom shows some minor errors in following the orientation of the object dueto the drastic change in scale and minor change in viewpoint, but succeeded infollowing the target object.

(a) #3 (b) #76 (c) #122 (d) #148

(e) #3 (f) #76 (g) #122 (h) #148

Fig. 3. Tracking results for the IVT tracker (without subspace update) with 80 particles(above) and the proposed method (below)

Figure 3 is the tracking result of the proposed method compared with a par-ticle filter tracker on an shaky image sequence of a medicine bottle captured bya hand held webcam. In the image sequence, the medicine bottle shows abrupttranslational, orientational, and random movements, i.e. sudden change in scaleand orientation occur. Sub-figures (a), (b), (c), and (d) are the results of theIVT tracker [6] with 80 particles without subspace update and (e), (f), (g),and (h) are the results of the proposed method. Each sub-caption denotes theframe numbers in the image sequence. The IVT algorithm (a particle filter witheigen-space method combined) was used for comparison. The reason we used 80


Fig. 4. Tracking of a person in a shaky scenario

particles is to achieve real-time performance (over 20fps) and compare it withour proposed method. Also, since with the subspace update, the IVT trackerwas never able to follow no matter how many particles were used, we did not usethe subspace update method. As shown in (b) and (f), frame 76, the IVT trackerfails to adapt to fast orientation and translation change, whereas the proposedmethod succeeds. In (c), frame 122, the IVT tracker fails to track and followsan object similar to the target. But in (g), since our method is robust to back-ground clutter problems, we can see that our method succeeds in tracking themedicine bottle. The IVT tracker without subspace update was able to followthe medicine bottle using 600 particles, but takes 4 fps whereas our method isover 90 fps (both implemented using C++).

We also tested our proposed method on a shaky scene recorded with a hand-held digital camcorder. The tracking results for selected frames are given inFigure 4. The recorded video image is very shaky, and therefore scenes of someframes are blurred out. The forth selected frame in Figure 4 is an example ofthis situation. In the forth selected frame, it is hard to recognize the legs ofthe tracked person even with human eyes. In the tracking results, there aresome frames which our proposed method fails to adapt to scale change. Theseframes have abrupt changes in the position of the person due to the shake of thecamcorder. However, our proposed method successfully re-adapts to scale changeand ultimately, does not loose track of the scale change of the object. Orientationestimation results in (d) show similar behavior as the scale adaptation result inthe fact that it shows some errors when abrupt motion occurs. But this is not acommon case and we can see that our method successfully follows the orientationchange of the person.

5 Conclusion

We proposed a new object tracking method to solve the problems of the orig-inal mean-shift algorithm. The method is consisted of three parts. To handlebackground clutter problems, we proposed a new objective function which em-phasizes features that are more likely to be of the object model. We also proposed


an orientation estimation method to track object with orientation changes. Fi-nally, to adapt to scale changes of the object, we proposed a scale adaptationmethod which utilizes a new scale descriptor. Experimental results show that theproposed method was able to track objects with scale and orientation changeseven in shaky scenarios. In comparison with other tracking algorithms, the pro-posed method was shown to be superior to the traditional mean-shift and alsocomparable to the particle filter.

Acknowledgment

This work has been supported by the Korean Ministry of Knowledge Economy,Samsung Techwin, and BK 21 program.

References

1. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans.on Pattern Analysis and Machine Intelligence 25, 564–575 (2003)

2. Yilmaz, A.: Object tracking by asymmetric kernel mean shift with automatic scaleand orientation selection. In: IEEE Conf. on Computer Vision and Pattern Recog-nition (2007)

3. Collins, R.T., Liu, Y., Leordeanu, M.: Online Selection of Discriminative TrackingFeatures. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(10), 1631–1643 (2005)

4. Isard, M., Blake, A.: CONDENSATION - Conditional density propagation for vi-sual tracking. International Journal on Computer vision 29(1), 5–28 (1998)

5. Perez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-based probabilistic tracking.In: European Conf. on Computer Vision, vol. 1, pp. 661–675 (2002)

6. Lim, J., Ross, D., Lin, R.S., Yang, M.H.: Incremental learning for visual tracking.In: Advances in Neural Information Processing Systems (2004)

7. Nummiaro, K., Koller-Meier, E., Gool, L.: An adaptive color-based particle filter.Image and Vision Computing 21(1), 99–110 (2003)

8. Yilmaz, A., Li, X., Shah, M.: Contour-Based Object Tracking with Occlusion Han-dling in Video Acquired Using Mobile Cameras. IEEE Trans. on Pattern Analysisand Machine Intelligence 26(11), 1531–1536 (2004)

9. Yilmaz, A., Javed, O., Shah, M.: Object Tracking: A Survey. ACM ComputingSurveys 38(4) (2006), http://dx.doi.org/10.1145/1177352.1177355

10. Collins, R.: Mean-shift blob tracking through scale space. In: IEEE Conf. on Com-puter Vision and Pattern Recognition (2003)

11. Yang, C., Duraiswami, R., Davis, L.: Efficient Mean-Shift Tracking via a NewSimilarity Measure. In: IEEE Conf. on Computer Vision and Pattern Recognition(2005)

12. Elgammal, A., Duraiswami, R., Davis, L.: Probabilistic tracking in joint feature-spatial spaces. In: IEEE Conf. on Computer Vision and Pattern Recognition (2003)

13. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press,Cambridge (2004)

14. Yi, K.M., Ahn, H.S., Choi, J.Y.: Orientation and Scale Invariant Mean Shift UsingObject Mask-Based Kernel. In: Int’l Conf. on Pattern Recognition (2008)

http://dx.doi.org/10.1145/1177352.1177355

[lecture notes in computer science] computer vision – accv 2009 volume 5995 || orientation and...

Documents