region-based pose tracking with occlusions using 3d models

21
Machine Vision and Applications (2012) 23:557–577 DOI 10.1007/s00138-010-0317-5 ORIGINAL PAPER Region-based pose tracking with occlusions using 3D models Christian Schmaltz · Bodo Rosenhahn · Thomas Brox · Joachim Weickert Received: 23 April 2010 / Revised: 17 December 2010 / Accepted: 23 December 2010 / Published online: 29 January 2011 © Springer-Verlag 2011 Abstract Despite great progress achieved in 3-D pose tracking during the past years, occlusions and self-occlu- sions are still an open issue. This is particularly true in sil- houette-based tracking where even visible parts cannot be tracked as long as they do not affect the object silhouette. Multiple cameras or motion priors can overcome this prob- lem. However, multiple cameras or appropriate training data are not always readily available. We propose a framework in which the pose of 3-D models is found by minimising the 2-D projection error through minimisation of an energy function depending on the pose parameters. This framework makes it possible to handle occlusions and self-occlusions by track- ing multiple objects and object parts simultaneously. There- fore, each part is described by its own image region each of which is modeled by one probability density function. This allows to deal with occlusions explicitly, which includes self- occlusions between different parts of the same object as well We gratefully acknowledge funding by the German Research Foundation (DFG) under the project We 2602/5-2. C. Schmaltz (B ) · J. Weickert Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science, Saarland University, Campus E1 1, 66041 Saarbrücken, Germany e-mail: [email protected] J. Weickert e-mail: [email protected] B. Rosenhahn Faculty of Electrical Engineering and Computer Science, Leibniz University of Hannover, Appelstr. 9A, 30167 Hannover, Germany e-mail: [email protected] T. Brox Department of Computer Science, Computer Vision Lab, Albert-Ludwigs-University Freiburg, 79110 Freiburg, Germany e-mail: [email protected] as occlusions between different objects. The results we pres- ent for simulations and real-world scenes demonstrate the improvements achieved in monocular and multi-camera set- tings. These improvements are substantiated by quantitative evaluations, e.g. based on the HumanEVA benchmark. Keywords Pose estimation · Model-based tracking · Kinematic chain · Computer vision · Human motion analysis · Occlusion handling 1 Introduction Following the 3-D position, orientation and, if present, the articulations of an object in a video is necessary for many applications ranging from self localisation and object grasp- ing in robotics, body language interpretation, human com- puter interaction, traffic or security surveillance, character animation, analysis of athletes up to content-based video retrieval. For a detailed overview of the field, we refer to the surveys by Gavrila [18], Forsyth et al. [15], Moeslund et al. [30], and Poppe [36]. This task, which is called pose tracking, is bound to some kind of feature matching between entities of the 3-D model and their corresponding features in the image. Such features can be points as in Abidi and Chandra [1], lines [5, 29], or more complex features such as vertices, T-junctions, cusps, three-tangent junctions, limb and edge injections, and cur- vature L-junctions [25]. Drummond and Cipolla [14] used edge-detection to achieve real-time tracking of articulated objects. In Pressigout and Marchand [37], a combination of two tracking algorithms that use edges and texture informa- tion, respectively, are used to achieve improved results. In Sundaresan and Chellappa [52], the pose is predicted using 123

Upload: christian-schmaltz

Post on 25-Aug-2016

216 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Region-based pose tracking with occlusions using 3D models

Machine Vision and Applications (2012) 23:557–577DOI 10.1007/s00138-010-0317-5

ORIGINAL PAPER

Region-based pose tracking with occlusions using 3D models

Christian Schmaltz · Bodo Rosenhahn ·Thomas Brox · Joachim Weickert

Received: 23 April 2010 / Revised: 17 December 2010 / Accepted: 23 December 2010 / Published online: 29 January 2011© Springer-Verlag 2011

Abstract Despite great progress achieved in 3-D posetracking during the past years, occlusions and self-occlu-sions are still an open issue. This is particularly true in sil-houette-based tracking where even visible parts cannot betracked as long as they do not affect the object silhouette.Multiple cameras or motion priors can overcome this prob-lem. However, multiple cameras or appropriate training dataare not always readily available. We propose a framework inwhich the pose of 3-D models is found by minimising the 2-Dprojection error through minimisation of an energy functiondepending on the pose parameters. This framework makes itpossible to handle occlusions and self-occlusions by track-ing multiple objects and object parts simultaneously. There-fore, each part is described by its own image region each ofwhich is modeled by one probability density function. Thisallows to deal with occlusions explicitly, which includes self-occlusions between different parts of the same object as well

We gratefully acknowledge funding by the German ResearchFoundation (DFG) under the project We 2602/5-2.

C. Schmaltz (B) · J. WeickertMathematical Image Analysis Group, Faculty of Mathematics andComputer Science, Saarland University, Campus E1 1,66041 Saarbrücken, Germanye-mail: [email protected]

J. Weickerte-mail: [email protected]

B. RosenhahnFaculty of Electrical Engineering and Computer Science, LeibnizUniversity of Hannover, Appelstr. 9A, 30167 Hannover, Germanye-mail: [email protected]

T. BroxDepartment of Computer Science, Computer Vision Lab,Albert-Ludwigs-University Freiburg, 79110 Freiburg, Germanye-mail: [email protected]

as occlusions between different objects. The results we pres-ent for simulations and real-world scenes demonstrate theimprovements achieved in monocular and multi-camera set-tings. These improvements are substantiated by quantitativeevaluations, e.g. based on the HumanEVA benchmark.

Keywords Pose estimation · Model-based tracking ·Kinematic chain · Computer vision · Human motionanalysis · Occlusion handling

1 Introduction

Following the 3-D position, orientation and, if present, thearticulations of an object in a video is necessary for manyapplications ranging from self localisation and object grasp-ing in robotics, body language interpretation, human com-puter interaction, traffic or security surveillance, characteranimation, analysis of athletes up to content-based videoretrieval. For a detailed overview of the field, we refer tothe surveys by Gavrila [18], Forsyth et al. [15], Moeslundet al. [30], and Poppe [36].

This task, which is called pose tracking, is bound to somekind of feature matching between entities of the 3-D modeland their corresponding features in the image. Such featurescan be points as in Abidi and Chandra [1], lines [5,29], ormore complex features such as vertices, T-junctions, cusps,three-tangent junctions, limb and edge injections, and cur-vature L-junctions [25]. Drummond and Cipolla [14] usededge-detection to achieve real-time tracking of articulatedobjects. In Pressigout and Marchand [37], a combination oftwo tracking algorithms that use edges and texture informa-tion, respectively, are used to achieve improved results. InSundaresan and Chellappa [52], the pose is predicted using

123

Page 2: Region-based pose tracking with occlusions using 3D models

558 C. Schmaltz et al.

Fig. 1 Examples for problems that can occur in feature based and invariational segmentation algorithms: Left SIFT features [28] found intwo consecutive frames are depicted. The green crosses denote the posi-tions of SIFT features found in the frame shown, while the yellow boxesshow the corresponding feature position in the last frame. Here, the teabox was to be tracked. As can be seen, no SIFT feature has been foundthat is on the tea box for these two frames, making a tracking with

SIFT-features infeasible. See [16] for further details. Middle A part ofthe segmentation is concave although the object to be tracked is con-vex, and an undesired split up into several connected components. RightThe spout is not accurately segmented due to oversmoothing. Further-more, holes generated in the segmentation are not always correct. FromSchmaltz et al. [44] (color figure online)

pixel displacements and improved afterwards using image-based cues such as silhouettes.

Techniques building on keypoint tracking are usually fast,but they also share some common problems. First of all,a sufficient number of common keypoints must be detectedin two frames, which is not always the case; see the left imagein Fig. 1. Moreover, when not matching all images to a com-mon keyframe, e.g. the first frame in the sequence, trackingerrors accumulate and result in an undesired drift.

Drift is avoided in detection-based tracking approaches,where a description of an object model is assumed and theinstance of this model in the image is sought to be detectedin each frame. Reliable detection is a very hard problem.Thus, various assumptions that simplify the problem are com-monly applied. Many approaches assume a static backgroundin order to detect the object silhouette with background sub-traction, e.g. [3]. Other approaches assume sufficiently manydiscriminative features on the object to detect it in a reliablemanner [35]. Sun et al. [51] learn different object parts froma number of view points to recognise the view point of newimages. Ottlik and Nagel [34] segment the optic flow to ini-tialise the position of cars in inner-city sequences. Ramananet al. [39] rely on detection only in frames where it is reliableand follow a tracking approach for the frames in between.Rosenhahn et al. [41] assumes a good initialisation of thepose of a 3-D model in one frame and limited motion betweenframes to update the silhouette and the pose parameters in aniterative approach, where the object model serves as a shapeprior for the segmentation. While the requirement of an ini-tial pose and limited motion resembles a classical trackingapproach, the matching of the surface model to the silhouettein the image avoids drift and involves the silhouette of theobject as a general descriptor (see [40]).

If we are only interested in the pose of the object and not inits contour, the drawback of the technique in Rosenhahn et al.[41] is that it solves a much harder problem than actually nec-essary. In particular, it estimates an intermediate contour in aninfinite-dimensional space only to derive a 6+n-dimensional

pose vector, where n is the number of articulations. The manydegrees of freedom increase the computation time, decreasethe robustness, and lead to inaccuracies in the contour; seeFig. 1.

Building upon the ideas we presented in an earlier confer-ence paper [44], we propose an energy function that involvesoptimisation only in a finite low-dimensional space. Thismethod also uses silhouettes as features for tracking. How-ever, instead of separately estimating 2-D segmentation and3-D pose parameters, we directly estimate 3-D pose parame-ters by minimising the projection error of a given 3-D modelin the respective 2-D images. Thus, the contours obtainedwith our algorithm are by construction consistent with theobject model with the estimated 3-D pose. Since we onlyneed to estimate a small number of pose parameters insteadof an infinite-dimensional level set function, the runtime ofthe algorithm significantly decreases compared to the algo-rithm described in Rosenhahn et al. [41]. Nevertheless, wecan use the same general statistical representation of regions.Estimating a separate silhouette may be advantageous if theprovided surface model does not fit the tracked object. How-ever, as shown in Balan and Black [11], there are ways toexpress the typical variations of a surface model in a para-metric form. As we do not have access to a rich surface data-base, we stick to fixed surface models from a 3-D scanner,but the approach would also work with parametric adaptivesurfaces. While we restrict the model here to shape defor-mations at predefined joints (see Sect. 3.6), the model andoptimisation scheme could be extended to include more gen-eral shape variations in a similar way as recently shown inSandhu et al. [43]. Note that the 3-D object models used heredo not include any appearance information (such as colouror texture information).

One reason why contemporary solutions to the trackingtask are not yet satisfactory is that there are a lot of issues thatmake pose tracking challenging. Especially, partial occlu-sions are a big problem for many pose trackers. As a sec-ond contribution of this paper, we show how to deal with

123

Page 3: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 559

partial occlusions between different objects by making useof the estimated 3-D positions in order to model occlusionsexplicitly. To this end, we minimise an energy function jointlydefined over multiple objects. In a similar manner, self-occlu-sions of articulated objects can be handled using object mod-els consisting of multiple components. This tackles a typicalshortcoming of silhouette-based methods, where even visi-ble parts cannot be tracked if their presence is not seen in theobject silhouette.

The approach in Huang and Essa [21] can handle occlu-sions between an unknown number of objects. However,the objects must be very clearly distinguishable from eachother and from the background. In Joshi et al. [23], imagesfrom a camera array are aggregated, and the resulting imagewas used for successfully tracking despite occlusions. Inthese approaches, tracking is only performed in 2-D, though.In Kim and Davis [24], a multi-hypothesis approach totrack multiple occluded persons was proposed. It relies onhuman appearance models and ground plane homographiesand tracks the position at which the person is standing. InAblavsky et al. [2], several occlusion layers are used to detectocclusions with static occluders. Those layers are generatedby rendering via computer graphics. Grabner et al. [20] pro-pose to handle occlusions using supporters found in the visualcontext around the tracked object.

Another way to deal with occlusion is using motion priors(see [33]). As such a prior incorporates information on themost likely continuation of a motion pattern, it principallyallows to track even objects that are completely occluded.However, this is only possible if there are only few possiblemotion patterns, and if the object to be tracked approxi-mately follows the pattern as stated in the prior. Conse-quently, the results depend on the choice and quality of themotion priors. Furthermore, the tracking results are biasedtowards the priors, which is undesired in applications inwhich derivations from the usual motion shall be found.This is the case, e.g. in medical applications or when ana-lysing the motion of athletes. The method presented in thispaper allows the incorporation of motion priors, but theyare only required to deal with full occlusions. In all ourexperiments, we did not use any prior data from motion dat-abases. Nevertheless, our tracking approach yields resultsthat surpass those of most other state of the art algorithms,as we will demonstrate using the HumanEVA-II bench-mark.

This article is organised as follows: In Sect. 2, we reviewsome basic mathematical concepts and pose estimation frompoint correspondences needed for our region-based approachintroduced in Sect. 3. The method is extended to multipleobjects in 4 and to multiple component models in Sect. 5. InSect. 6, both approaches will be combined such that severalobjects with multiple components can be tracked. Experi-ments presented after each chapter illustrate the achieved

improvements. Section 7 concludes the paper with a sum-mary.

2 Basics

This section reviews the concept of representing rigidmotions as twists, the concept of Plücker lines and the basicpoint-based pose tracking algorithm used later.

An isomorphism that preserves orientation and distancesis called rigid motion. Such isomorphisms are of interest tous, since a rigid body can only perform a rigid motion. Anyrigid body motion in 3-D can be represented as m(x) :=Rx + t , with a translation vector t ∈ R

3 and a rotation matrixR ∈ SO(3) with SO(3) := {R ∈ R

3×3 : det(R) = 1}.Using homogeneous coordinates, one can also write m as a4 × 4 matrix M :

m((x1, x2, x3)T ) = M(x1, x2, x3, 1)T =

(R3×3 t3×1

01×3 1

)x .

(1)

The set containing all matrices of this form is the so-calledLie group SE(3). To every Lie group, there is an associatedLie algebra, whose underlying vector space is the tangentspace of the Lie group evaluated at the origin. The Lie alge-bras associated with SO(3) and SE(3) are so(3) := {A ∈R

3×3|AT = −A} and se(3) := {(ν, ω)|ν ∈ R3, ω ∈ so(3)},

respectively. The elements of se(3) are called twists. Ele-ments of a Lie group can be converted to elements of thecorresponding Lie algebra, and vice versa. In particular, arigid motion can be written as a twist. Since a rigid motiongiven as element of SE(3) has 12 parameters while a twisthas six, it makes sense to prefer estimating twists insteadof rigid motions given as matrix. Moreover, when solvingfor the pose parameters in Sect. 2.2, a twist can easily belinearised.

Since elements of so(3) and se(3) can be written either asvectors ω = (ω1, ω2, ω3), ξ = (ω1, ω2, ω3, ν1, ν2, ν3) or asmatrices

ω =⎛⎝ 0 −ω3 ω2

ω3 0 −ω1

−ω2 ω1 0

⎞⎠ ∈ so(3), (2)

ξ =(

ω ν

01×3 0

)∈ se(3), (3)

we distinguish these two ways of representing elements by ahat sign. A twist ξ can be converted to an element of the Liegroup M ∈ SE(3) by the exponential function exp(ξ ) = M ,which can be computed efficiently with the Rodriguez for-mula (see [32]).

123

Page 4: Region-based pose tracking with occlusions using 3D models

560 C. Schmaltz et al.

x

L

n

O

x’ x x21

Fig. 2 Identifiers used in the proof that the distance between a point xand a Plücker line L = (n, m) is given by ‖x × n − m‖

2.1 Plücker forms

3-D lines can be represented in different ways. Here, we usethe Plücker form from Shevlin [47] to represent lines. In thecontext of this article, we consider projection rays, i.e. linescontaining all points which are projected to a certain imagepoint. Consequently, these lines always go through this imagepoint and the camera centre.

A line in Plücker form L = (n, m) is given by a nor-malised vector n (pointing in the direction of the line) anda momentum m := x ′ × n for a point x ′ on the line. Thedistance of a point x to such a line L = (n, m) can easilybe computed as ‖x × n − m‖. To understand this, we writex as sum of x ′, a vector x1 = λn parallel to n and a vectorx2 perpendicular to n, i.e. x = x ′ + x1 + x2 (see Fig. 2) andcompute

‖x × n − m‖ = ‖(x ′ + λn + x2) × n − m‖= ‖ x ′ × n︸ ︷︷ ︸

=m

+λ n × n︸ ︷︷ ︸=0

+x2 × n − m‖

= ‖x2 × n‖ = ‖x2‖. (4)

2.2 Pose estimation with 2-D–3-D point correspondences

The region-based pose tracking approach in this paperuses point correspondences as well as the point-based poseestimation algorithm described in Rosenhahn et al. [41] toimplement a gradient descent. Let (x, X) be a 2-D–3-D pointcorrespondence, i.e. let X ∈ R

4 be a point on the 3-D sil-houette of the object model in homogeneous coordinates andx ∈ R

2 its position in the image. Furthermore, let L = (n, m)

be the Plücker line through x and the respective camera ori-gin. We will explain in Sect. 3 how such point correspon-dences emerge from a gradient descent.

In order to track the object, we need to find a twist ξ

that maps the transformed model points exp(ξ )Xi as closeas possible to the corresponding Plücker lines Li :

arg minξ

∑i

∥∥∥∥(

exp(ξ)

Xi

)3×1

× ni − mi

∥∥∥∥2

(5)

where the function ·3×1 : R4 �→ R

3 removes the last entry,which is 1. This results in a system of equations that is hardto solve due to the exponential function. However, since thetwist ξ corresponds to the pose change, it is rather “small”.As in Drummond and Cipolla [14], we can linearise the expo-nential function by the first order Taylor expansion

exp(ξ ) =∞∑

k=0

ξ k

k! ≈ I + ξ (6)

with an identity matrix I without introducing a large error.Then, we get

arg minξ

∑i

∥∥∥∥(

exp(ξ)

Xi

)3×1

× ni − mi

∥∥∥∥2

(7)

≈ arg minξ

∑i

∥∥∥∥((

I + ξ)

Xi

)3×1

× ni − mi

∥∥∥∥2

. (8)

To solve this least squares problem, the cross product is eval-uated, resulting in three linear equations of rank two foreach correspondence (xi , Xi ). Thus, three non-colinear cor-respondences are sufficient to obtain a unique solution of thesix parameters in the twist. Since correspondences are usuallynot accurate, however, it is advantageous to consider morecorrespondences. Thus, one obtains a least squares problem,which can be solved efficiently with standard methods, e.g.the Householder algorithm [49]. In order to further minimisethe error introduced by the Taylor expansion, we iterate thisminimisation process.

3 Region-based model fitting

Previous approaches usually try to match a contour obtainedby segmentation to the projected model surface. Sincesegmentation and matching can be time consuming and erro-neous, our idea is to avoid both the explicit contour com-putation as well as the matching step. Instead, we directlyoptimise the pose parameters such that all images are opti-mally partitioned into an object and a background region bythe projected surface model. To simplify the description, weexplain the algorithm for a sequence created with a singlecamera. The extension to multiple views is straightforward,though.

3.1 Energy model

To find the set of pose parameters that splits the image domain� into a foreground region �in and a background region �out,we minimise the energy function

E(ξ) = −∫�

(Pξ (x) log pin + (

1 − Pξ (x))

log pout)

dx (9)

123

Page 5: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 561

Table 1 In this table, several pieces of information are given for eachimage sequence shown in this paper: the number of views (Views),the degrees of freedom per object (Unknowns), the size of each image(Size), the average number of iterations necessary (Iterations), the aver-age amount of real time (or wall clock time) required per tracked frame

(RTime), the average amount of processor time required per trackedframe (PTime), the number of frames tracked (Frames), if informationfrom the textures space proposed in [7] was used (Texture), if proba-bility density functions from the last frame were used (PDFs), and thefigure(s) in this article in which results are shown (Figure(s))

Sequence Views Unknowns Size Iterations RTime (s) PTime (s) Frames Texture PDFs Figure(s)

Tea box 2 6 384×288 12.7 5.05 4.75 395 Yes New 7, 8

Tea box 2 2 6 384×288 4.5 0.79 0.51 395 No New 8

Giraffe 1 6 384×288 280.0 6.94 6.63 84 Yes Old 9

Puncher 1 6 376×284 50.2 1.04 0.92 171 No Old 10

Aeroplane 1 6 640×480 39.4 13.79 13.43 48 No Old 11

Flip 4 27 500×380 59.7 19.81 18.75 260 No Old 12

Duplo2 2 6/6 640×480 36.7 24.77 24.22 110 Yes Old 14

Bike 4 6/28 640×480 23.2 42.86 42.17 550 No Old 15

Arm 1 30 640×480 155.7 24.09 23.42 59 No Old 17

HumanEva 4 28 656×490 18.3 15.48 14.63 1, 257 No Old 18, 19

Duplo3 2 6/6/6 640×480 12.8 4.89 4.27 380 No Old 20

All times are from computations on an Intel Pentium 4 with 3.2 GHz. Since the wall clock time is influenced by other running processes and delaysdue to accessing a hard drive, we also included the processor time unaffected by these problems

where Pξ (x) := P(ξ, x) ∈ (R6 × � �→ {0, 1}) is an indi-cator function for the projected object surface, i.e., it is 1 ifand only if the surface of the 3-D model with pose ξ projectsto the point x in the image plane. The functions pin, pout :F × � �→ R are two probability density functions (abbrevi-ated as “PDF”) that model the different feature distributions.Both take the features F of the image (such as colour ortexture) and an image point as input and return the probabil-ity that the features at the point x occur inside or outside theobject region, respectively. To make the formulas more read-able, the arguments of the PDFs are omitted in the equations.Similar PDFs appear in Rosenhahn et al. [41] for estimatingthe contour. In contrast, the energy in (9) does not require theestimation of an intermediate contour but only the optimisa-tion of the six pose parameters. This simplifies the estimationconsiderably. Also the length constraint on the contour usedin Rosenhahn et al. [41] is no longer required.

Good and straightforward choices for the input featuresare the image intensity (for grey-scale images), or the colourin CIELAB colour space [22] (for colour images). These fea-tures are used in every experiment shown here. Additionally,we employ the texture feature space in Brox and Weickert [7]to improve tracking results for objects with a more complexappearance (see Table 1). Any other features which can bedensely computed in the image, such as Gabor filters, canalso be used.

To keep the model tractable, we assume that all pixels andfeature channels are independently distributed. That is, theprobability functions pin/out are given by the product

pin/out =N∏

i=1

piin/out, (10)

where N is the number of feature channels and piin and pi

outare the estimated PDFs of the i-th feature channel for theinside or outside region, respectively. This is also one reasonwhy we use the CIELAB colour space, as it separates thedifferent channels well. If the models appearance is not toocomplex, we model the PDFs with a non-parametric Parzendensity

piin/out(Fi (x)) = 1

|�in/out|∫

�in/out

Kσ (Fi (x) − Fi (y))dy,

(11)

where Fi (x) is the i-th feature at the image position x , andKσ (z) = 1√

2πσexp( z2

2σ 2 ) is a Gaussian kernel with standard

deviation σ = √30, approximated by three box filters of

width 11. For more complex appearances, we employ a localGaussian distribution as introduced Rosenhahn et al. in [41]and very common in recent segmentation approaches, e.g.[31] or [27]:

piin/out(Fi (x), x) = 1√

2πσexp

((Fi (x) − μi

in/out(x))2

2σ iin/out(x)2

),

(12)

where the local means and standard deviations are computedlocally as

μiin/out(x) =

∫�in/out

Kσ (y − x)Fi (y)dy∫�in/out

Kσ (y − x)dy, (13)

123

Page 6: Region-based pose tracking with occlusions using 3D models

562 C. Schmaltz et al.

σ iin/out(x) =

∫�in/out

Kσ (y − x)(Fi (y) − μiin/out(x))2dy∫

�in/outKσ (y − x)dy

.

(14)

3.2 Minimisation

The gradient of the energy function (9) with respect to thepose parameters reads:

− ∇E(ξ) =∫�

(∇ Pξ (x) log pin − ∇ Pξ (x) log pout)

+ (Pξ (x)∇(log pin) + (1 − Pξ (x))∇(log pout)

)dx

≈∫�

(∇ Pξ (x)(log pin − log pout))

dx . (15)

The approximation neglects the dependency of the poseparameters on the PDFs. Both are actually not independent,since changing the pose usually leads to a different objectregion in which the PDFs are estimated. There are severalworks that discussed the use of the complete shape gradient inthe scope of segmentation, e.g. [4]. Our own experiments didnot reveal a practical advantage, such as considerably fasterconvergence, and iterations are much faster when using theapproximation. The approximation is motivated by the factthat the PDFs usually change very slowly when changingthe pose parameters. In our implementation, we use Sobeloperators (see [19]) to approximate ∇ P .

Since ∇E(ξ) is zero unless we are at the 2-D silhouette cof the projected object, we can rewrite (15) as

− ∇E(ξ) ≈∫c

(∇ Pξ (c(s))(log pin − log pout))

ds, (16)

where s is the arc-length parameterisation of c. In the discretesetting, only a small number of object points is projected ontothe 2-D silhouette. Denoting the set of these points by Os ,we have

− ∇E(ξ) ≈∑

xs∈OS

∇ Pξ (xs)(log pin − log pout). (17)

Interpreting (17), the energy function (9) is minimisedby moving each contour point along the direction indicatedby the gradient ∇ P . Speed and sign of this motion are

given by the log-likelihood ratio log(

pinpout

). Employing the

methodology from Sect. 2.2, this motion can be transferredto corresponding 3-D points on the surface model. Thus, thegradient vectors created this way are reprojected back to the3-D space in order to find the most consistent 3-D rigid bodymotion that corresponds to the gradient in the image. Vice-versa, the estimated rigid body motion changes the 2-D sil-houette such that the features are separated more clearly.

In practical implementation, we create 2-D–3-D pointcorrespondences (xi , Xi ) by projecting silhouette points Xi

using the current pose ξ to the image plane where theyyield xi . If the PDF for the inside region evaluated at xi ,i.e. pin(xi ), exceeds the corresponding function value for theoutside region (pout(xi )), we suppose that xi belongs to theobject region. Thus, xi will be moved in outward normaldirection to a new point x ′

i . Conversely, the points wherepin(xi ) < pout(xi ) holds will be shifted into the oppositedirection.

According to the gradient (17), the shift vector shouldhave a length of l := ‖ log pin − log pout‖. We noticed thatsetting l to a fixed value tends to work better. We believethat this is because in the optimum the silhouette will usu-ally be close to an image boundary. If the model does notexactly fit the object, parts of the contour will still be in thewrong region and induce large gradient vectors. Note thatthe absolute value of the difference of the logarithms stateshow likely it is that a point belongs to a certain region, whichdoes usually not correspond to the distance of the point to theobject boundary, and thus to the optimal length of the shiftvector. Moreover, image boundaries are often blurred andlead to pixel values that do not fit well to either PDF. Captur-ing only the sign of the log-likelihood ratio can be regardedas a robust variant of the gradient, where points along the sil-houette all have equal weights when voting for the directionof movement. Especially in case of unexpected occlusions, aconstant l helps to reduce the influence of points assigned tothe wrong region. We tested various other methods to adaptthe length of the gradient vector, but the results were usuallyinferior. An example result where l was not fixed is shownin the last image in Fig. 10. Note that the optimal value ofl depends on various factors such as the acceleration of theobject, the type of movement, and the amount of silhouettepoints. Thus, it is currently a free parameter in our tracker.All experiments shown use values between 0.1 and 2.

A gradient descent in the pose parameters very similar tothe one we first proposed in Schmaltz et al. [44] has recentlybeen presented in Dambreville et al. [13]. The gradient isderived in 3-D space, which yields an additional multiplica-tive factor depending on the curvature of the object. In theirimplementation, however, this factor was neglected. More-over, they use the exact gradient without discarding the differ-ence ‖ log pin − log pout‖. While this makes the optimisationtheoretically more sound, their approach would probably alsobenefit from discarding this factor.

After each gradient descent step in the pose parameters,the PDFs in the object and background region are recom-puted. In order to have better initial pose parameters—whichincludes a region that better fits the object in the image—wepredict the object’s pose in a successive frame by linearlyextrapolating the pose from the previous two frames. Thisis also beneficial when linearising the exponential function.A more sophisticated approach, e.g. using a physical simu-lation as in Vondrak et al. [53], could be used here as well.

123

Page 7: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 563

Fig. 3 Overview over the stepsdone by the basic pose trackingalgorithm

For each frame:

i

Iter

ate

Extrapolate new pose from previous poses and compute image features− Project 3D object model onto image plane

− Generate and solve system of equations to get new pose

− Generate PDFs for interior/exterior of projected model

− Construct projection rays from (x’ ,X )ii iii− Adapt 2D−3D point correspondences (x ,X ) to (x’ ,X )

(Section 3.1)

(Section 3.2)

(Section 2.1)

(Section 2.2)

1e-05

0.0001

0.001

0.01

0.1

-100 -50 0 50 100

0.0001

0.001

0.01

0.1

0 20 40 60 80 100

1e-05

0.0001

0.001

0.01

0.1

-100 -50 0 50 100

Color channel bColor channel a

LuminanceV

alue

of

the

pdf

Val

ue o

f th

e pd

f

Val

ue o

f th

e pd

f

Fig. 4 Upper left corner Contour of an initial pose guess (red) and thecontour after tracking (green). Upper right corner PDFs generated forthe luminance channel. Lower two figures PDFs generated for the col-our channels. Red lines PDFs before pose estimation. Green lines PDFsafter pose estimation is finished. Thick lines PDFs for the interior of the

object. Thin lines PDFs for the exterior of the object. The black arrowsindicate how the PDFs evolved. Note that, as explained in the text, thePDFs for the background region changed only marginally (color figureonline)

However, even if the initial pose estimated for a frame isquite bad, the optimisation will usually find the correct pose,as we will show later in an experiment. Figure 3 depicts anoverview of the algorithm.

3.3 Illustrative example

Figure 4 illustrates the adaptation of the probability densi-ties. The image in this example shows the contour of thepose estimated by our prediction step. PDFs are generatedusing the Parzen model for each feature channel and for theobject and background region of the projected model. Thealgorithm then tries to optimise the pose in order to separatethese PDFs. The three figures show the PDFs generated from

the initial pose guess (red lines) and at the end of the poseestimation step (green lines) for the three channels used(i.e. the luminance and the two colour channels in theCIELAB colour system) for the interior (thick lines) andexterior (thin lines) of the puncher shown. It can be seenthat the initial pose contains parts of the cyan mouse pad,which has values around (51, −15, −5) in CIELAB colourspace, and of the orange measuring tape, whose channelsare approximately (55, 40, 55). This is also clearly visiblein the PDFs (thick red lines). After pose tracking, the pro-jected pose extends almost entirely over the white puncherin the image. As a consequence, the peak in the luminancearound 53—as well as those in the colour channels around 35and 50, respectively—have disappeared (thick green lines).

123

Page 8: Region-based pose tracking with occlusions using 3D models

564 C. Schmaltz et al.

Fig. 5 From left to right Initial guess, and pose estimated after 10, 30, 50, and 70 iterations (magnified)

(a) (b) (c)

Fig. 6 From left to right (a) Input image. The puncher is to be tracked.(b) Projection of the model in an inaccurate pose onto the image (mag-nified). The two marked points are the points referenced to in Sect. 3.3.

(c) The 2-D contour of the projection (magnified). The arrows showinto which directions these points should move in our algorithm. FromSchmaltz et al. [44]

Also note that the difference between the PDFs after posetracking (green lines) is greater than before pose tracking(red lines). The difference between the PDFs for the outsideregions before and after pose tracking is very small becauseonly a small fraction of the outside area has changed. Theevolution of the pose is shown in Fig. 5.

Figure 6 illustrates how the force vectors affect the pose.Figure 6b depicts the surface model corresponding to thepuncher which has been projected in an inaccurate pose ontothe input image (Fig. 6a). Figure 6c shows the boundarybetween the interior and the exterior of the projected model.Since most of the interior is white, the white point markedby the circle on the right side of the image fits to the sta-tistical model of the object region better than to that of thebackground. Thus, it is shifted away from the object, i.e. tothe right. The other marked point, which is cyan, better fitsthe PDF of the background and is thus shifted towards theinterior of the object, and thus also roughly to the right.

If the estimated pose is close to the optimal one, the posechanges induced by the force vectors will mutually cancelout. Thus, the process is iterated until the average pose changeafter up to three iterations is smaller than a threshold T . Theoptimal threshold depends on the sequence and the parame-ter l. For example, the farther the object is from the camera,the larger T should be.

3.4 Experiments

The experiment in Fig. 7 demonstrates the abilities of thebasic tracker. A tea box is tracked in a stereo setup. Despite

partial occlusions, complete rotations and specular high-lights, tracking worked very well. As in all other experi-ments, we use a given pose in the first frame (usually createdby hand) as initialisation.

In Fig. 8 we highlight the quality-speed tradeoff by com-paring two variants of the tracking algorithm: a fast versionwhere only intensity and colour are considered, the time stepsize in the gradient descent is larger, and a less restrictivestopping criterion is used (upper row), and a more accu-rate version where we also incorporate the texture featuresfrom [7] (lower row). As can be clearly seen in the diagrams,the version with texture features and finer time steps is muchmore precise as errors are mostly below 3 mm. On the otherhand, the result in the upper row was computed nearly anorder of magnitude faster, as indicated in Table 1. In manyapplications the fast result may already be sufficiently pre-cise. The worst tracking error of the fast tracker appeared inframe 220, which is the one depicted in Fig. 8.

3.5 Reusing old probability densities

The extrapolated pose used as initial guess for a new frameis often inaccurate. Therefore, the PDFs based on this posemight differ strongly from the PDFs corresponding to the cor-rect object position in the image. This can lead to a slowerconvergence rate or even to an unpleasant local optimum farfrom the sought solution.

One way to deal with this problem is to replace the sim-ple extrapolation step by a more sophisticated method. Forexample, a 2-D pose tracking or an optical flow algorithm can

123

Page 9: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 565

Fig. 7 From left to right Tracking results for frames 121, 214, 269 and 274 in one of the two available views of a tea box sequence. Only thesilhouettes of the tracking results are shown such that the orientation of the tea box and the specular highlights can be seen

-15

-10

-5

0

5

10

15

0 20 40 60 80 100 120 140 160

-15

-10

-5

0

5

10

15

0 20 40 60 80 100 120 140 160

mm

mm

frame

frame

Fig. 8 Illustration of the quality/speed tradeoff of the pose trackingalgorithm. From left to right Tracking results for frame 220 in bothcameras and change of the three translation parameters in the first 160

frames in which the static tea box was partially occluded. Upper rowFast but imprecise results. Lower row Precise but slower results. Inaddition to being imprecise, the fast results also oscillate stronger

be incorporated to improve the initial guess in a new frameas done in Brox et al. [8].

As an alternative to such rather complex approaches, wepropose to diminish the problem of inaccurate PDFs at thebeginning of a new frame by reusing the PDF estimatesfrom the previous frame. The inherent assumption is that theappearance of the object and background, or more preciselytheir description by a PDF, changes only marginally eachframe. This assumption is usually well satisfied, though carehas to be taken in case of the local Gaussian model. Conse-quently, we recompute the probability density functions onlyafter the estimation of the object position is completed, i.e.once per frame. Since this requires significantly fewer PDFestimations, it also leads to a speedup of the algorithm.

Keeping the PDFs from the previous frame has anotheradvantage. Since the PDFs pin and pout are constant for everyiteration, the derivatives ∇(log pin) and ∇(log pout) are zero,and (15) is not an approximation anymore.

As mentioned above, the situation is a bit more complexwhen locally varying densities are employed. If the object(and possibly the background) moves between two frames,the positional information of the local densities from the

previous frame is no longer valid. There are two possibleapproximations. The PDFs for a 2-D–3-D point correspon-dence (x, X) can be evaluated either: (a) at x , or (b) at theposition to which X was projected in the previous frame.This second alternative is not available in pose tracking algo-rithms that use explicit segmentation, since there is no real3-D information incorporated into the segmentation step.

Both approximations have advantages and disadvan-tages. Method (a) results in better approximations of staticbackgrounds. However, the background is often not static.Moreover, since it ignores the motion of the object, theapproximation of the object model is often better when usingmethod (b). Furthermore, method (b) has slightly higher com-putational costs and memory requirements, as more informa-tion from the previous frame must be stored and evaluated.In practise, however, both methods yield very similar resultsand the additional costs of method (b) are negligible. Sincethe results are similar, we used method (a) because it is a bitfaster in our implementation.

Figure 9 shows five frames of a monocular sequence witha wooden toy giraffe. Since only one view is available, andsince the appearance of the wooden giraffe is rather complex,

123

Page 10: Region-based pose tracking with occlusions using 3D models

566 C. Schmaltz et al.

Fig. 9 Estimated pose for five frames of a colour sequence with a wooden giraffe. From left to right: Frames 32, 48, 64, 74, and 84. Since thesurface model consists of a single closed point grid, it is possible to look through the projected pose

Fig. 10 In the upper row, the estimated poses for the frames 10, 16,134, and 142 of a monocular sequence are shown. The last row showsthe estimated contours for the frames 101, 102 and 103 in black, fol-lowed by the tracking result in frame 103 when the length of the shiftvector l is not set to a fixed value in turquoise. Note the fast move-ment between frame 10 and 16, the fact that the object partially left theimage around frame 142 and the pretty sudden turnaround in the frames

101–103. Also note the motion blur visible in frame 103. The red poseshown in the image to frame 103 is the pose estimated by our initialprediction step. Although it is initially far away from the correct pose,our algorithm is able to successfully track the puncher if l is fixed. Theposition of the puncher estimated by our algorithm is indicated by theblack contour. Note that only the camera moves while the scene itselfis static (color figure online)

tracking is quite hard in this sequence. Still, the projectionof the estimated model fits the object in the image well. Forthis sequence, as well as for all experiments described fromnow on, we used the probability density functions from theprevious frame as described before.

Figure 10 shows an example of a scene with fast move-ments and strong accelerations. Although only one view wasavailable the white puncher was tracked even though it movedwith more than 50 pixels per frame. Moreover, the algorithmwas able to deal with accelerations of more than 70 pix-els/frame/frame. With the proposed algorithm, 171 frameshave been successfully tracked. The best result we achievedwithout reusing PDFs is a successful tracking of 25 frames,after which tracking failed completely. An example resultobtained with a non-constant l is also shown in this figure.

In Figure 11, it is illustrated that the proposed algo-rithm can also handle movements towards the camera evenif only one view is available. Furthermore, there is a secondaeroplane visible in the background with an appearance verysimilar to that of the aeroplane being tracked. Nevertheless,tracking was successful. In the very last frame, in which onlya very small part of the aeroplane is visible, tracking results

still look good but are in fact quite bad, as can be seen in thelast image and the plot in Fig. 11.

3.6 Articulated objects

This section explains an extension of the model that allowsto handle articulated objects by employing kinematic chainsas introduced in Bregler and Malik [6]. Kinematic chains area system of rigid bodies connected by joints, whereby eachjoint has only one degree of freedom. Since every joint angleis an additional parameter that must be estimated, and dueto the self-occlusions, pose tracking becomes far more chal-lenging. Incorporating kinematic chains, the function P—and thus the energy function (9)—does not only depend onthe twist ξ but also on the joint angles in the chain. Apart fromthe additional parameters, there is no change in the energy.

A joint can be modelled by a twist of the form θiξi withunknown joint angle θi ∈ R and known joint axis ξi (the rota-tion axis is a part of the model representation, and there is notranslation). The complete pose that must be estimated for amodel with n joints is given by χ = (ξ, θ1, . . . , θn) ∈ R

6+n .

123

Page 11: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 567

-8

-6

-4

-2

0

2

4

0 5 10 15 20 25 30 35 40 45 50

X

Z

Y

m

frame

Fig. 11 Tracking results for the frames 0, 10, 20, 30, 40, 44, and 47of a monocular, synthetic scene with a moving aeroplane, illustrated ascyan contours in the original images. Four images are cropped, and somelines illustrating the coordinate system were added in frame 30. The plotshows the translational tracking errors per frame in meter. Althoughthere is another aeroplane with similar appearance in the background,

the tracking results look good. According to the quantitative evaluation,there is some inaccuracy due to depth ambiguity, and the last frame wastracked badly due to the fact that most of the aeroplane was not visible.Input sequence generated and kindly provided by Richard Steffen fromthe University of North Carolina at Chapel Hill (color figure online)

The equation of a point Xi behind the j-th joint then hasto satisfy(

exp(ξ ) exp(θ1ξ1) · · · exp(θ j ξ j )Xi

)3×1

× ni − mi = 0

(18)

in order to lie on the Plücker line L = (ni , mi ). After allexponential functions are linearised in the same way as inSect. 2.2, one such constraint results in three rank two equa-tions in the six pose parameters and the joint angles.

A problem typically arising with kinematic chains is thatthere might be no 2-D silhouette points for a rigid body inthe chain, either because it is completely surrounded by otherparts of the object (e.g. a small hand in front of a big torso)or because it is occluded. In both cases, the angle of thecorresponding joint cannot be estimated. This problem canbe alleviated by including prior knowledge on joint angleconfigurations; see for instance [9] for further details of thisoptional extension. However, since a prior is never as goodas a reliable measurement, particularly when there is littletraining data, we rather propose to improve the silhouettemodel. This will be done in Sect. 5, where we better exploitthe available information provided by the image.

Tracking results with an articulated object model aredepicted in Fig. 12. For this scene, grey-scale images fromfour cameras as well as prior information were used. Apartfrom smaller problems due to the fact that a human cannot be

perfectly modelled by a kinematic chain, the only trackinginaccuracies occur at the feet. This is due to the feet beingwhite while the rest of the object is black.

4 Simultaneous tracking of multiple objects

As shown above, the basic pose tracking approach can han-dle partial occlusions in some situations. However, once thepercentage of the occluded area is too big, tracking will fail.This is because a large part of the initial estimate of the objectregion covers the background and thus, the estimated PDFfor that region will be very inaccurate1. As a consequence,the estimated PDFs cannot be used to distinguish betweeninterior and exterior anymore.

In this section, we show how to handle occlusions thatare due to the multiple objects by a coupled tracking of allinvolved objects. The basic idea was presented in a prelimi-nary conference paper [45].

4.1 Basics of multiple object tracking

In order to track multiple objects, we assume that a model anda pose initialisation are available for each object. The goal is

1 When reusing the PDFs from the previous frame, this effect is post-poned to the successive frame. The problem is not solved, though.

123

Page 12: Region-based pose tracking with occlusions using 3D models

568 C. Schmaltz et al.

Fig. 12 A sequence where a person turns two backflips. Only one outof four camera views is shown. The left image (frame 50) shows thestart of the second flip. The two images in the middle show the frames

96 and 148, and in the image on the right (frame 190) the backflip iscompleted. The bottom row shows the estimated pose in zoomed images

to find the poses of all objects. That is, if N is the number ofobjects to be tracked, the algorithm should find the pose χi

of each object i . For the methodology we present here, it isnot important if we track rigid or articulated objects, or both.Hence, we will always write χi for the pose parameters eventhough χi could be simplified to ξi in case of rigid objects.

The simplest way to track multiple objects with theapproach introduced in the last section is to estimate PDFsfor the interior and exterior of each object, and to minimisean energy function that is the sum of energy functions of theform (9):

E(χ1, . . . , χN )

= −N∑

i=1

∫�

(Pχi ,i (x) log pi,in + (

1 − Pχi ,i (x))

log pi,out)

dx .

(19)

Here, Pχi ,i is a projection function � �→ {0, 1}which projects the i-th object with pose χi to the image plane.Minimising this energy function can be done in basically thesame way as with the old energy function: Each object isprojected, PDFs are estimated, the 2-D parts of the 2-D–3-Dpoint correspondences are adapted according to the gradientderived from the corresponding PDFs, and a system of equa-tions is solved for each object. As soon as the pose updateof one object is small, its position can be considered as finaland only the poses of the remaining objects are searched.This approach treats all objects independently. As soon astwo objects partially occlude each other, however, the inde-pendence assumption is no longer satisfied.

4.2 Occluding objects

In order to understand the problem of occluding objects, con-sider the illustrative example in Fig. 13a in which a greenpuncher and a yellow box is to be tracked. Suppose the 3-Dpose of both objects is already correctly estimated. When thegreen puncher is projected onto the image plane, it splits theimage into the background region (blue region in Fig. 13b)and foreground region (green and yellow region in Fig. 13b).Clearly, the foreground region not only includes the greenregion, but also a considerable area of the yellow tea box.As a consequence, the PDF estimated for the interior of thepuncher has two modes: one from the green and one from theyellow area. Therefore, yellow parts of the image are falselybelieved to belong to the interior of the puncher. Those sil-houette points which are yellow (the silhouette is shown inFig. 13c) consequently, generate gradient vectors that pointtowards the outside of the object. The tracked puncher willget stuck in the yellow area and tracking fails.

4.3 Coupled tracking

In order to prevent problems due to the occlusions, it isnecessary to make sure that each image point is only assignedto the visible object. Hence, we introduce a visibility func-tion [50] vi for each object i . Let Oi (χi , x) be the set of thosepoint on the i-th object model with pose χi that are projectedonto the image point x . Next we define

di (χi , x) := d(Oi (χi , x), C) = miny∈Oi (χi ,x)

{d(y, C)} (20)

123

Page 13: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 569

Fig. 13 Illustration of two problems occurring when one of the trackedobjects occludes an other, and how our algorithm solves them by uti-lising the visibility functions explained in Sect. 4: Suppose the initialposes of the two objects to be tracked in the artificial input image (a) are

correct. The area (b) used for PDF estimation and silhouette (c) usedwithout visibility functions are suboptimal. When using the visibilityfunctions, the area (d) and silhouette (e) are more reasonable

as the minimal distance of the set Oi to the camera originC . Finally, the visibility functions are given by

vi (χ1, . . . , χn, x) ={

1 if di (χi , x) = minj∈{1,...,n}{d j (χ j , x)},

0 else .

(21)

That is, a point is considered to be visible if there is nopoint of another object closer to the camera origin projectingto the same image point. This local visibility testing is clearlysuperior to approaches which simply assume all parts of oneobject to be in front of all parts of another.

The arguments of the visibility functions—the pose of allinvolved objects and the image point—will be omitted tokeep the formulas readable.

In Schmaltz et al. [45], the energy function

E(χ1, . . . , χN )

= −N∑

i=1

∫�

(vi Pχi ,i (x) log pi,in + (

1 − vi Pχi ,i (x))

log pi,out)

dx .

(22)

was used for tracking multiple objects: When multiplyingeach projection function Pχi ,i with the corresponding visi-bility function vi , occluded points are not considered as beinginside the interior of the object region. In Fig. 13a, only thegreen part of Fig. 13d is, therefore, used to estimate the PDFof the interior region of the green object. Consequently, thePDF is perfectly accurate here.

However, the PDFs estimated for the outside of each objectregion still overlap and are thus inaccurate. Thus, we proposeto use a single background region here instead of one outsideregion per object.

The necessary functions for the background region arev0(χ1, . . . , χN , x) := ∏N

i=1 (1 − vi ) (the background is vis-ible if no other object can be seen) and P0(χ1, . . . , χN , x)

:= 1 (ignoring visibility, the background covers the wholeimage), while the energy function is:

E(χ1, . . . , χN ) = −N∑

i=0

∫�

(vi Pχi ,i (x) log pi

)dx . (23)

The minimisation is basically the same as in the singleobject case. For each object, a silhouette is found by project-ing the object, and the point correspondences on the silhou-ette are adapted by comparing the PDFs of the two regionsnext to the considered silhouette point, i.e. the PDF of theregion in which the point lies and the next region in outwardsnormal direction. The visibility functions can be computedefficiently by projecting the objects with openGL.

In addition to the improved way to estimate the PDFs,we can further benefit from the explicit modelling of occlud-ing objects by considering only true silhouette points andneglecting occluding contours, as illustrated in Fig. 13d, e.

It is advantageous to normalise the lengths of the force vec-tors depending on the number of correspondences ci availablefor each object i . Thus, we linearly scale the length l of theforce vectors depending on ci in each iteration step: For acorrespondence on object i , the length of the force vector isset to ci

cm, where cm = maxi ci . This is beneficial since, when

only few correspondences are available, each force vector hasa larger influence on the pose. Thus, the step size should besmaller to prevent oscillations.

Figure 14 shows a scene in which two Lego Duplo®objects were successfully tracked. Despite the cluttered back-grounds and occlusions in both views, both objects weretracked accurately.

The Fig. 15 shows an example in which there are mutualocclusions: The bike partially occludes the cyclist and viceversa in all views throughout the complete sequence. Forthis sequence, we used the soft constraints from Rosenhahnet al. [42] that encourage the hands to stay close to the han-dlebar and the feet to perform a circular trajectory.

5 Objects with multiple components

As long as all objects to be tracked are rigid, the onlyproblematic occlusions occur because one object occludes

123

Page 14: Region-based pose tracking with occlusions using 3D models

570 C. Schmaltz et al.

Fig. 14 From left to right: Four frames (15, 45, 75, and 105) of a sequence with two objects. Top View 1, Bottom View 2. Despite simultaneousocclusions in both frames, the tracking was successful

Fig. 15 From left to right: Four frames (210, 240, 270, and 510) of asequence with a bike and a cyclist. Only one of the four views is shown.It can be seen that the tracker can handle mutual occlusions. These

occlusions are illustrated for frame 240: The parts where the cyclistsoccludes the bike are shown in green while it is vice versa for thoseparts shown in blue (color figure online)

another, since self-occlusions are usually unproblematic.Once kinematic chains are tracked, there can be self occlu-sions that result in ambiguities, though. An example for sucha situation is shown in Fig. 17. For some frames, the rightforearm is completely inside the silhouette of the person. Asa result, an approach using a single silhouette can only guessthe joint angles of the forearm from the prior distribution.Even in case of a learned prior, there remains a lot of uncer-tainty in the forearm estimate although the arm is clearlyvisible in the image. We will now present an extension of thetracking approach in order to make better use of the imagedata in such situations.

5.1 Energy function using multiple internal regions

We follow the idea discussed in Schmaltz et al. [46], whichis to assign different rigid parts of the articulated object tobe tracked into one of several components. Then, each com-ponent has a separate PDF and is projected separately intothe image plane, yielding an increased number of silhouettesfor tracking. Note the difference between “part” and “com-ponent” used throughout this section: “Part” refers to a rigidpart of the kinematic chain, while “component” is a set ofone or more parts with similar appearance. For the moment,assume that a good splitting into l different components Mi ,

i =1, . . . , l is already given. We will explain in Sect. 5.2 howthe number and composition of the components can be found.

As in the case of multiple objects, visibility functionsshould be defined that account for occlusions. Thus, we sim-ilarly define O j (χ, x) as the set of those points on the j-thcomponent M j with pose χ that are projected onto the imagepoint x , and

d j (χ, x) := d(O j (χ, x), C) = miny∈O j (χ,x)

{d(y, C)} (24)

as the minimal distance of the set O j to the camera origin C .The visibility function for the multiple internal regions caseare then given by

v j (χ, x) := 1 if d j (χ, x) = minj∈{1,...,l}{d

j (χ, x)}. (25)

Defining the visibility and projection functionsfor the background as v0(χ, x) := ∏l

j=1

(1 − v j (χ, x)

)(the

background is visible if no other object can be seen) andP0(χ, x) := 1 (ignoring visibility, the background coversthe whole image) leads to the following energy function:

E(χ) = −l∑

j=0

∫�

(v j (χ, x)P j

χ (x) log p j)

dx . (26)

There is another potential problem that can be solved bythis approach: Some parts of the object might look too similar

123

Page 15: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 571

Body

Left leg

Head

Left arm

Right arm

Right leg

Fig. 16 Result of the automatic splitting explained in Sect. 5.2. Left-most Input image (cropped).Left Object in initial pose. Right Similaritymatrix of the first step. The darker a dot, the more similar two partsare. The circles show the two region pairs which are merged first: Head

and lower right arm (green), followed by the two hands (blue). Right-most Final splitting suggested with splitting threshold in the interval[0.15, 0.33] (color figure online)

Fig. 17 Tracking results for amonocular scene. From left toright Pose and silhouetteestimated in frames 10, 38and 54. Although noinformation about the right armis included in the silhouette, thearm is correctly tracked

to the background such that including this part degrades theresults. For example, this can be the case when tracking some-one standing in front of a large white background while wear-ing a white T-shirt. In this case, white pixels are more likelyin the background. Consequently, the tracker tries to reducethe number of white points in the object region, even thoseof the white T-shirt. This challenge can be easily handled inour approach. All that must be done is to exclude the cor-responding parts of the object from all regions Mi . Then, itis automatically assigned to the background region. If theremoved parts are not at one end of the kinematic chain,this works without any problems since the approach doesnot require that each part of the object is in one compo-nent. Furthermore, it is also possible that a component con-tains parts that are not connected, e.g. both arms but not thetorso.

5.2 Automatic component generation

The tracking approach using multiple internal regionsrequires that the appearances of the different componentscan be distinguished. Thus, we assume that parts with simi-lar appearance should be in the same component.

To automatically split an articulated object model, we startby putting each rigid part of the articulated object into acomponent of its own. If there are significant appearancedifferences in rigid parts, these parts should also be sepa-rated into several components. We have done this manuallyfor the experiment shown in Fig. 20. However, it should alsobe possible to automate this step, e.g. by segmenting the 3-Dmodel based on its appearance visible in the image.

Next, each component is projected into the image plane,and a PDF for each region is estimated. Then, the two com-ponents whose PDFs are most similar are combined. Aftercomputing, the PDF for the new, combined region, these stepsare repeated until the difference between all PDFs exceeds athreshold α.

For this algorithm, it is necessary to compare two PDFs.From the several possible ways how to compare PDFs, theJenson–Shannon divergence [54], which is a symmetric andsmooth variant of the Kullback–Leibler divergence [26],works best in our experiments. It is given by

J SD(p, q) := J (p, M) + J (q, M)

2, (27)

123

Page 16: Region-based pose tracking with occlusions using 3D models

572 C. Schmaltz et al.

0

50

100

150

200

250

300

350

0 200 400 600 800 1000 1200

’without internal regions’’with internal regions’

mm

frame

Fig. 18 Tracking results for Sequence S4 of HumanEva II. The left-most image shows the tracking error with and without internal regions.The left image in each block shows the tracking result without usinginternal regions, the right when using internal regions. The same col-ours have been used to allow an easy comparison of the results. Without

using multiple components, tracking can be wrong due to occlusions(middle images, frame 50), or because the estimated PDFs are inaccu-rate (rightmost images, frame 200). Only one of the four available viewsis shown. Images have been cropped

where p and q are the PDFs to be compared, M is equal top+q

2 , and where J is the Kullback-Leibler divergence

J (p, q) :=∑

i

p(i) logp(i)

q(i). (28)

5.3 Experiments

Figure 16 demonstrates the automatic splitting algorithm.Note that upper arms and torso are clustered in the samecomponent, because both have the appearance of the shirt.

In Fig. 17, we show tracking results of a monocular scenein which internal regions and the splitting by the automaticsplitting algorithm with threshold α = 0.25 is used. As canbe seen, the tracking works quite well in this monocularsequence, despite the fact that one arm is completely insidethe silhouette of the whole object.

Figure 18 shows more results using internal componentsobtained with α = 0.25. The sequence in this experiment isSequence S4 of the HumanEva-II database [48], for whichfour cameras are available. The advantage of this benchmarkis that it is possible to receive the average tracking error inmillimetre per frame via an online interface at Brown univer-sity, i.e. a quantitative evaluation is possible. The sequenceillustrates that multiple component models can also be use-ful in multiple camera settings. It is not surprising that thearms are better tracked around frame 50 (see middle imagesin Fig. 18). However, there are also improvements that looksurprising at first glance: when using a model with a sin-gle component, the two legs crossed around frame 200 (seeright images in Fig. 18). When using the split model, moreaccurate PDFs are estimated. As a result, the crossing of legsdisappears.

Later in the sequence, tracking without using internalregions fails as indicated by the error plot in Fig. 18. Sampleframes for the whole sequence, which includes three different

types of movement, are shown in Fig. 19. The bottom rowalso shows the frame with the largest tracking error.

Table 1 shows the average tracking error obtained withour method when using a single component, and when usingmultiple components for the whole sequence, i.e. until frame1257. It can be seen that the average tracking error is muchsmaller using the multi-component model. Moreover, theseresults show that our tracking approach yield very accurateresults: In Corazza et al. [12], an average tracking error of80 mm is reported for the same sequence. However, only thefirst 150 frames were used, as tracking became less robustand started failing afterwards. For these frames, we obtainan average error of 32.13 mm. Brubaker et al. [10] achievea mean error of 54 mm using only 2 of the 4 cameras. Theirapproach only handles the lower body part, as a physicalmodel of walking is used. Thus, only the walking part of thesequence is evaluated (frames 15–350). We obtain an averageerror of 33.4 mm in these frames.

Gall et al. [17] proposed to perform a global optimisationbased on the interacting simulated annealing, followed by alocal optimisation. Such a global optimisation step preventstracking failures from propagating into the next frame, result-ing in an average error of 32.01 mm for the whole sequence.Such an approach can also be incorporated as first step intothe tracking approach used here, but would approximatelyincrease the run time of our algorithm by a factor of five.

6 Simultaneous tracking of multi-part objects

The two methods described in the last two sections, i.e., track-ing multiple objects and tracking an object with multipleinternal regions, can be combined into a single energy func-tion in a straightforward manner. By minimising this func-tion, multiple object with multiple internal regions can betracked.

123

Page 17: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 573

Fig. 19 Each row shows tracking results in the four camera views forsequence S4 of the HumanEva-II benchmark. From top to bottom: Aframe in the walking part (frame 222), a frame in the jogging part (frame

444), a frame in the “balancing” part (frame 888), and the frame withthe worst results (frame 757)

Table 2 Some statistics of the tracking error with and without usinginternal regions for the sequence shown in Fig. 18

Model Avg. error Std. deviation Max. error

Multi-part object 4.87 21.94 156.53

Single-part object 142.48 89.82 300.30

Given are the average and standard deviation of the tracking error andthe maximal tracking error. All numbers are in millimetres

6.1 Tracking multiple multi-part objects

To track n objects Mi (1 ≤ i ≤ n) with unknown poses χi ,where object i consists of li parts M1

i , . . . , Mlii (as explained

above), we use functions similar to those used before:We define P j

i (χi ) as projection function of the compo-

nent M ji , p j

i := p ji (χ1, . . . χn, x) as the PDF of M j

i , and

O ji (χi , x) as the set of those point on M j

i that are projectedonto the image point x . Furthermore, we set

d ji (χi , x) := d(O j

i (χi , x), C) = miny∈O j

i (χ,x)

{d(y, C)} (29)

as the minimal distance of O ji to the camera origin C , and

vji (χi , x) := 1 if d j

i (χ, x) = mini∈{1,...N } min

j∈{1,...,li }{d j

i (χi , x)}(30)

as visibility function of the component M ji .

Then the function to be minimised reads:

E(χ1, . . . χN ) = −N∑

i=0

li∑j=1

∫�

[v

ji P j

χi ,i(x) log p j

i

]dx (31)

For the background, we set l0 = 1, since it has only asingle component. Furthermore, we set v1

0(χ1, . . . , χn, q) =∏ni=1

∏lij=1

(1 − v

ji (χ1, . . . , χn, q)

)(again, the background is visible if no other object is visi-ble) and P1

0 (χ1, . . . , χn, q) = 1 in this case. Minimisationof this energy function is performed in a similar way as min-imising the energy function (26). The only difference is thatthe poses of all objects are optimised simultaneously, likein (23).

123

Page 18: Region-based pose tracking with occlusions using 3D models

574 C. Schmaltz et al.

Fig. 20 Relevance of using multiple internal regions. Left two col-umns, from top to bottom The two input images for frame 139, resultwith internal regions (magnified), and with a single region (magnified)

Only the silhouette is shown for one view to make the details better vis-ible. Right two columns, from top to bottom Tracking results in frame20, 190, and 350 using internal regions

6.2 Experiments

In Fig. 20, we track another scene with three Lego Duploobjects that was first used in Schmaltz et al. [45]. Two ofthose objects are red, and the third consists of a large greenboard with blue Duplos attached to it. We track this multiple-object sequence once without internal regions, and once aftersplitting the last object into a green and a blue part. Withoutusing internal regions, tracking is quite inaccurate for someframes. Trying to track the blue/green object without trackingthe red objects fails completely. These experiments clearlyshows that using internal regions is also advantageous whentracking rigid objects.

7 Summary

In this paper, we have presented a region-based pose track-ing algorithm that can incorporate various statistical modelsfor object and background region without explicitly estimat-ing a contour in the image. We showed that this is advan-tageous, since it leads to a significant speedup compared toother region-based methods that require costly segmentationand contour matching steps. Furthermore, results are oftenbetter than those achieved with alternating segmentation andpose estimation steps. Moreover, we demonstrated that thisframework can be extended to kinematic chains in order totrack human motion. We also proposed to reuse probabil-ity density functions from previous frames. This has led to

substantial speedups and increased the robustness of the algo-rithm in case of fast object motions with high accelerations.

In order to deal with partial occlusions in tracking, we sug-gested a generative approach, which tracks all objects simul-taneously. Thanks to the depth reasoning in this approach,occlusions can be dealt with, which leads to significantlyimproved performance than when treating occlusions merelyas model noise. The same concept can deal with self-occlusions that have been a nuisance in silhouette-basedapproaches. We also showed a way how to generate reason-able multi-component object models from a single modelautomatically. The methodology was evaluated in varioussettings and the experiments revealed a clear improvementin performance due to the occlusion reasoning.

Table 1 summarises all the presented experiments. As canbe seen the number of required iterations varies stronglyfrom sequence to sequence: More iterations are needed ifthe prediction of the object’s pose for the next frame is bad.Furthermore, rotating object (parts) requires more iterations,especially in monocular setting. Similarly, the computationtime per frame also varies strongly. In all experiments excepttea box 2, we optimised the parameters independently foreach sequence to get optimal results. Thus, the runtimes areoften quite high. This is due to the tradeoff between qualityand speed mentioned before. Note that real-time performancewas obtained with an algorithm similar to our basic tracker inPrisacariu and Reid [38] using CUDA. Thus, one can expectsignificant lower runtimes by performing processing on theGPU using NVIDIA’s CUDA framework.

123

Page 19: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 575

Even if partial occlusions are handled, there are still situ-ations in which our approach is likely to fail, e.g. if one (partof an) object can barely be distinguished from the surround-ing background, or if the projection of the initial pose guessdoes not overlap with the correct object region due to strongaccelerations. This is the topic of our ongoing research. Sim-ilarly, we plan to estimate the parameters of our algorithm(the length of the shift vector l, the threshold T , and possiblyeven the models underlying the PDF, as well as the imagefeatures used) automatically in the future.

References

1. Abidi, M.A., Chandra, T.: Pose estimation for camera calibrationand landmark tracking. In: Proceedings of the International Confer-ence on Robotics and Automation, vol. 1, pp. 420–426. Cincinnati(1990)

2. Ablavsky, V., Thangali, A., Sclaroff, S.: Layered graphical modelsfor tracking partially-occluded objects. In: Proc. 2008 IEEE Com-puter Society Conference of Computer Vision and Pattern Recog-nition (CVPR), IEEE Computer Society Press, New York (2008)

3. Agarwal, A., Triggs, B.: Recovering 3D human pose from mon-ocular images. IEEE Transac. Pattern Anal. Mach. Intell. 28(1),44–58 (2006)

4. Aubert, G., Marlaud, M., Faugeras, O., Jehan-Besson, S.: Imagesegmentation using active contours: calculus of variations or shapegradients? SIAM J. Appl. Math. 63(6), 2128–2154 (2003)

5. Beveridge, J.: Local search algorithms for geometric object rec-ognition: Optimal correspondence and pose. PhD thesis, Depart-ment of Computer Science, University of Massachusetts, Amherst(1993)

6. Bregler, C., Malik, J.: Tracking people with twists and exponen-tial maps. In: Proc. 1998 IEEE Computer Society Conference onComputer Vision and Pattern Recognition, pp. 8–15. Santa Barbara(1998)

7. Brox, T., Weickert, J.: A TV flow based local scale estimate andits application to texture discrimination. J. Vis. Commun. ImageRepresent. 17(5), 1053–1073 (2006)

8. Brox, T., Rosenhahn, B., Cremers D., Seidel, H.P.: High accu-racy optical flow serves 3-D pose tracking: exploiting contour andflow based constraints. In: Leonardis, A., Bischof, H., Pinz, A.(eds) European Conference on Computer Vision (ECCV), LectureNotes in Computer Science, vol. 3952, pp. 98–111. Springer, Graz(2006a)

9. Brox, T., Rosenhahn, B., Kersting, U., Cremers, D.: Nonparametricdensity estimation for human pose tracking. In: Franke, K., Müller,K., Nickolay, B., Schäfer, R. Pattern recognition lecture notes incomputer science, vol. 4174, pp. 546–555. Springer, Berlin (2006)

10. Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Physics-based per-son tracking using the anthropomorphic walker. Int. J. Comput.Vis. 87(1/2), 140–155 (2010)

11. Balan, A.O., Black, M.J.: The naked truth: estimating bodyshape under clothing. In: Forsyth, D., Torr, P., Zisserman, A. Com-puter Vision—ECCV 2008, Part II, Lecture Notes in ComputerScience, vol. 5303, pp. 15–29. Springer, Berlin (2008)

12. Corazza, S., Mündermann, L., Gambaretto, E., Ferrigno, G., An-driacchi, T.P.: Markerless motion capture through visual hull,articulated ICP and subject specific model generation. Int. J. Com-put. Vis. 87(1/2), 156–169 (2010)

13. Dambreville, S., Sandhu, R., Yezzi, A., Tannenbaum, A. : Robust3D pose estimation and efficient 2D region-based segmentationfrom a 3D shape prior. In: Forsyth, D., Torr, P., Zisserman,

A. (eds.) Computer Vision—ECCV 2008, Part II, LectureNotes in Computer Science, vol. 5303, pp. 169–182. Springer,Berlin (2008)

14. Drummond, T., Cipolla, R.: Real-time tracking of multiple artic-ulated structures in multiple views. In: Vernon, D. (ed) ComputerVision—ECCV 2000, Part II, Lecture Notes in Computer Science,vol. 1843, pp. 20–36. Springer, Berlin (2000)

15. Forsyth, D.A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan,D.: Computational studies of human motion: part 1, tracking andmotion synthesis. Found. Trends Comput. Graph. Vis. 1(2–3),77–254 (2005)

16. Gall J., Rosenhahn B., Seidel H.P.: Drift-free tracking of rigid andarticulated objects. In: Proceedings of 2008 IEEE Computer Soci-ety Conference of Computer Vision and Pattern Recognition, IEEEComputer Society Press, New York (2008)

17. Gall, J., Rosenhahn, B., Brox, T., Seidel, H.P.: Optimization andfiltering for human motion capture. Int. J. Comput. Vis. 87(1/2),75–92 (2010)

18. Gavrila, D.M.: The visual analysis of human movement: a sur-vey. Comput. Vis. Image Underst. 73(1), 82–98 (1999)

19. Gonzalez R.C., Woods R.E.: Digital image processing, 2nd edn.Addison–Wesley, Reading (2002)

20. Grabner H., Matas J.G., Gool L.V., Cattin P.: Tracking the invisible:Learning where the object might be. In: Proc. 2010 IEEE ComputerSociety Conference of Computer Vision and Pattern Recognition(CVPR), pp. 1285–1292. IEEE Computer Society Press, New York(2010)

21. Huang Y., Essa I.: Tracking multiple objects through occlusions.In: Proc. 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, vol. 2, pp. 1051–1058, IEEE Com-puter Society Press, New York (2005)

22. ISO/CIE CIE colorimetry—part 4: 1976 L*a*b* colour space. ISO11664-4:2008(E)/CIE S 014-4/E:2007 (2007)

23. Joshi N., Avidan S., Matusik W., Kriegman D.: (2007) Syntheticaperture tracking: tracking through occlusions. In: Proceedingsof Eleventh International Conference on Computer Vision, IEEEComputer Society Press, New York

24. Kim K., Davis L.: Multi-camera tracking and segmentation ofoccluded people on ground plane using search-guided particle fil-tering. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) Computervision—ECCV 2006, Part II, Lecture Notes in Computer Science,vol. 3953, pp. 98–109. Springer, Berlin (2006)

25. Kriegman, D., Vijayakumar, B., Ponce, J.: Constraints forrecognizing and locating curved 3D objects from monocularimage features. In: Sandini, G. (ed.) Computer Vision—ECCV’92, Lecture Notes in Computer Science, vol. 588, pp. 829–833.Springer, Berlin (1992)

26. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann.Math. Stat. 22, 79–86 (1951)

27. Lankton, S., Tannenbaum, A.: Localizing region-based activecontours. IEEE Transac. Image Process. 17(11), 2029–2039(2008)

28. Lowe, D.: Distinctive image features from scale-invariant key-points. Int. J. Comput. Vis. 60(2), 91–110 (2004)

29. Lowe, D.G.: Fitting parameterized three-dimensional modelsto images. IEEE Transac. Pattern Anal. Mach. Intell. 13(2),441–450 (1991)

30. Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances invision-based human motion capture and analysis. Int. J. Comput.Vis. 104(2), 90–126 (2006)

31. Mory B., Ardon R., Thiran J.: Variational segmentation using fuzzyregion competition and local non-parametric probability densityfunctions. In: Proc. Eleventh International Conference on Com-puter Vision, IEEE Computer Society Press, New York (2007)

32. Murray, R.M., Li, Z., Sastry, S.S.: A mathematical introduction torobotic manipulation. CRC Press, Boca Raton (1994)

123

Page 20: Region-based pose tracking with occlusions using 3D models

576 C. Schmaltz et al.

33. Ormoneit, D., Sidenbladh, H., Black, M.J., Hastie, T.: Learningand tracking cyclic human motion. In: Leen, T.K., Dietterich, T.G.,Tresp, V. Advances in Neural Information Processing Systems 13,pp. 894–900 The MIT Press, MA (2001)

34. Ottlik, A., Nagel, H.H.: Initialization of model-based vehicle track-ing in video sequences of inner-city intersections. Int. J. Comput.Vis. 80(2), 211–225 (2008)

35. Özuysal M., Lepetit V., Fleuret F., Fua P.: Feature harvesting fortracking-by-detection. In: Computer Vision—ECCV 2006, PartIII, Lecture Notes in Computer Science, vol. 3953, pp. 592–605.Springer, Graz (2006)

36. Poppe, R.: Vision-based human motion analysis: An over-view. Comput. Vis. Image Underst. 108(1–2), 4–18 (2007)

37. Pressigout M., Marchand E.: Real-time 3d model-based track-ing: Combining edge and texture information. In: IEEE Int. Conf.on Robotics and Automation, ICRA’06. pp. 2726–2731. Orlando(2006)

38. Prisacariu V.A., Reid I.D.: PWP3D: Real-time segmentation andtracking of 3D objects. In: Proceedings of the 20th British MachineVision Conference (2009)

39. Ramanan, D., Forsyth, D.A., Zisserman, A.: Tracking people bylearning their appearance. IEEE Transac. Pattern Ana. Mach.Intell. 29(1), 65–81 (2007)

40. Rosenhahn, B., Sommer, G. : Adaptive pose estimation for differ-ent corresponding entities. In: Van Gool, L. (ed) Pattern Recogni-tion, Lecture Notes in Computer Science, vol. 2449, pp. 265–273.Springer, Berlin (2004)

41. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shapeknowledge for joint image segmentation and pose tracking. Int. J.Comput. Vis. 73(3), 243–262 (2007)

42. Rosenhahn, B., Schmaltz, C., Brox, T., Weickert, J., Cremers, D.,Seidel, H.P.: Markerless motion capture of man-machine inter-action. In: Proceedings of 2008 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition, IEEE ComputerSociety Press, Washington (2008)

43. Sandhu, R., Dambreville, S., Yezzi, A., Tannenbaum, A.:Non-rigid 2D-3D pose estimation and 2D image segmentation. In:Proceedings of 2009 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition, pp. 786–793. IEEE Com-puter Society Press, Washington (2009)

44. Schmaltz, C., Rosenhahn, B., Brox, T., Cremers, D., Weickert, J.,Wietzke, L., Sommer, G.: Region-based pose tracking. In: Martí, J.,Benedí, J.M., Mendonça, A.M., Serrat, J. (eds) Pattern recognitionand image analysis, Lecture Notes in Computer Science, vol.4478, pp. 56–63. Springer, Girona (2007)

45. Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J., Cremers,D., Wietzke, L., Sommer, G. : Occlusion modeling by trackingmultiple objects. In: Hambrecht, F., Schnörr, C., Jähne, B. (eds)Pattern Recognition, Lecture Notes in Computer Science,vol. 4713, pp. 173–183. Springer, Berlin (2007)

46. Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J., Wietzke, L.,Sommer, G.: Dealing with self-occlusion in region based motioncapture by means of internal regions. In: Perales, F.J., Fisher,R.B. (eds) Articulated Motion and Deformable Objects, LectureNotes in Computer Science, vol 5098, pp. 102–111. Springer, Ber-lin (2008)

47. Shevlin F.: Analysis of orientation problems using Plucker lines.In: Proc. 14th International Conference on Pattern Recognition,vol. 1, pp. 685–689. IEEE Computer Society Press, Washington(1998)

48. Sigal, L., Balan, A., Black, M.: HUMANEVA: Synchronized videoand motion capture dataset and baseline algorithm for evaluation ofarticulated human motion. Int. J. Comput. Vis. 87(1/2), 4–27 (2010)

49. Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis.Springer, Berlin (1980)

50. Sudderth, E.B., Mandel, M.I., Freeman, W.T., Willsky, A.S. : Dis-tributed occlusion reasoning for tracking with nonparametric beliefpropagation. In: Saul, L.K., Weiss, Y., Bottou, L. (eds) Advances inNeural Information Processing Systems 17, pp. 1369–1376. MITPress, Cambridge (2005)

51. Sun, M., Su, H., Savarese, S., Fei-Fei, L.: A multi-view probabi-listic model for 3D object classes. In: Proc. 2009 IEEE ComputerSociety Conference of Computer Vision and Pattern Recognition(CVPR), pp. 1247–1254. IEEE Computer Society Press, New York(2009)

52. Sundaresan, A., Chellappa, R.: Multicamera tracking of articulatedhuman motion using shape and motion cues. IEEE Transac. ImageProcess. 18(9), 2114–2126 (2009)

53. Vondrak, M., Sigal, L., Jenkins, O.C.: Physical simulation for prob-abilistic motion tracking. In: Proc. 2008 IEEE Computer SocietyConference of Computer Vision and Pattern Recognition (CVPR),IEEE Computer Society Press, New York (2008)

54. Wong, A.K.C., You, M.: Entropy and distance of random graphswith application of structural pattern recognition. IEEE Transac.Pattern Anal. Mach. Intell. 7(5), 599–609 (1985)

Author biographies

Christian Schmaltz received aDiplom (Master) degree in com-puter science from Saarland Uni-versity, Saarbrücken, Germany,in 2004, and a master degreein applied mathematics in 2009.Since April 2006, he has beenwith the Mathematical ImageAnalysis Group, Department ofMathematics and Computer Sci-ence, Saarland University, wherehe is currently pursuing his Ph.D.degree in computer science. Hiscurrent research interests include3-D pose tracking, image and

video compression, and halftoning.

Bodo Rosenhahn studied Com-puter Science (minor subjectMedicine) at the University ofKiel. He received the Dipl.-Inf.and Dr.-Ing. from the Univer-sity of Kiel in 1999 and 2003,respectively. From 10/2003 till10/2005, he worked as Post-Doc at the University of Auck-land (New Zealand), funded witha scholarship from the GermanResearch Foundation (DFG). In11/2005-08/2008 he worked assenior researcher at the Max-Planck Insitute for Computer

Science in Saarbrücken. Since 09/2008 he is Full Professor at theLeibniz-University of Hannover, heading a group on automated imageinterpretation.

123

Page 21: Region-based pose tracking with occlusions using 3D models

Region-based pose tracking 577

Thomas Brox received thePh.D. degree in computer sciencefrom the Saarland University,Saarbrücken, Germany in 2005.During his studies, he spent3 months as a visiting researcherat the INRIA Sophia-Antipolis,France. After his Ph.D., hejoined the Computer VisionGroup at the University ofBonn, Germany, as a postdoc-toral researcher. From October2007 to October 2008 he headedthe Intelligent System Group atthe University of Dresden, Ger-

many, as a temporary faculty member. He was a postdoctoral fellow inthe Computer Vision Group of Jitendra Malik at the U.C. Berkeley for2 years. He is now a full professor heading the Computer Vision Lab atthe Albert-Ludwigs-University Freiburg, Germany. Prof. Brox is asso-ciate editor of the Image and Vision Computing Journal and regularlyreviews for all major computer vision journals. In 2004, he receivedthe Longuet-Higgins Best Paper Award at the European Conference onComputer Vision.

Joachim Weickert is professorof Mathematics and ComputerScience at Saarland University(Saarbrücken, Germany), wherehe heads the Mathematical ImageAnalysis Group. He graduatedand obtained his Ph.D. fromthe University of Kaiserslautern(Germany) in 1991 and 1996.He worked as Post-DoctoralResearcher at the UniversityHospital of Utrecht (The Nether-lands) and the University ofCopenhagen (Denmark), and asAssistant Professor at the Univer-

sity of Mannheim (Germany). Joachim Weickert has developed manymodels and efficient algorithms for image processing and computervision using partial differential equations and optimisation principles.In particular he has contributed to diffusion filtering, optical flow com-putation, processing of tensor fields, and image compression. His scien-tific work covers about 230 refereed publications. He has served in theeditorial boards of nine international journals and given more than 130invited talks. In 2010 he has received the Gottfried Wilhelm LeibnizPrize which is the highest German science award.

123