multi-camera tracking of articulated human motion using...

Multi-camera Tracking of Articulated HumanMotion Using Motion and Shape Cues

Aravind Sundaresan and Rama Chellappa

Department of Electrical and Computer Engineering,University of Maryland, College Park, MD 20742, USA

{aravinds, rama}@cfar.umd.eduhttp://www.cfar.umd.edu/users/aravinds/

Abstract. We present a framework and algorithm for tracking artic-ulated motion for humans. We use multiple calibrated cameras and anarticulated human shape model. Tracking is performed using motion cuesas well as image-based cues (such as silhouettes and “motion residues”hereafter referred to as spatial cues,) as opposed to constructing a 3Dvolume image or visual hulls. Our algorithm consists of a predictor andcorrector: the predictor estimates the pose at the t + 1 using motioninformation between images at t and t + 1. The error in the estimatedpose is then corrected using spatial cues from images at t + 1. In ourpredictor, we use robust multi-scale parametric optimisation to estimatethe pixel displacement for each body segment. We then use an iterativeprocedure to estimate the change in pose from the pixel displacementof points on the individual body segments. We present a method forfusing information from different spatial cues such as silhouettes and“motion residues” into a single energy function. We then express thisenergy function in terms of the pose parameters, and find the optimumpose for which the energy is minimised.

1 Introduction

The complex articulated structure of human beings makes tracking articulatedhuman motion a difficult task. It is necessary to use multiple cameras to deal withocclusion and kinematic singularities. We also need shape models to deal withthe large number of body segments and to exploit their articulated structure.In our work, we use shape models, whose parameters are known, to build asystem that can track articulated human body motion using multiple camerasin a robust and accurate manner. A tracking system works better if there aremore number of observations to estimate the pose and to that end our systemuses different kinds of cues that can be estimated from the images. We useboth motion information (in the form of pixel displacements), as well as spatialinformation (such as silhouettes, and “motion residues”, hereafter referred toas spatial cues). The motion and spatial cues are complementary in nature.We present a framework for unifying different spatial cues into a single energyimage. The energy of a pose can be described in terms of this energy image.We can then obtain the pose that possesses the least energy using optimisation

P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 131–140, 2006.c© Springer-Verlag Berlin Heidelberg 2006

132 A. Sundaresan and R. Chellappa

techniques. Much of the work in the past has focussed on using either motionor spatial parameters. In this paper we present an algorithm that fuses togetherinformation from these two kinds of cues. Since we use motion and spatial cuesin our tracking algorithm, we are able to better deal with cases where the bodysegments are close to each other, such as when the arms are by the side of thebody. Purely silhouette based methods typically experience difficulties in suchcases. Silhouette or edge-based methods also have the weakness that they willnot be able to deal with rotation about the axis of the body segment.

Estimating the initial pose is a different problem from tracking and is difficultdue to the large number of unknown parameters (joint angles). It is computation-ally intensive and typically requires several additional algorithms such as headdetectors or hand detectors. Stochastic algorithms such as particle filtering oroptimisation methods are required for the sake of robustness. While the methodswe present in this paper can be used for initialisation as well, we concentrate onthe tracking aspect.

Fig. 1. Overview of the algorithm

(a) 3D Scan (b) Super-quadric

Fig. 2. 3D model comparison

In our work, we use eight cameras that are placed around the subject. We useparametric shape models connected in an articulated tree to represent the humanbody as described in §1.2. Our system, the block diagram of which is presentedin Figure 1, consists of two parts: a predictor and corrector. We assume that theinitial pose is known. The tracking algorithm is as follows.

• Compute 2D pixel displacement between frames at times t and t + 1.• Predict 3D pose at t + 1 based on 2D motion from multiple cameras.• Compute an energy function that fuses information from different spatial

cues.• Use the energy function to refine estimate of pose at t + 1.

We represent the pose, ϕt, in a parametric form as a vector of the position of thebase-body (6 degrees of freedom) and the joint angles of the various articulatedbody segments (3 degrees of freedom for each joint.) δ represents the incrementalpose vector.

We summarise prior work in articulated tracking in §1.1. We then describe themodels in §1.2 and the details of our algorithm in §2. We validate our algorithmusing real images captured from eight cameras and the results are presented in §3.

Multi-camera Tracking of Articulated Human Motion 133

1.1 Prior Work

We address the problem of tracking articulated human motion using multi-ple cameras. Gavrila and Davis [1], Aggarwal and Cai [2], and Moeslund andGranum [3], provide surveys of human motion tracking and analysis meth-ods. a We look at some existing methods that use either motion-based meth-ods or silhouette or edge based methods to perform tracking. Yamamoto andKoshikawa [4] analyse human motion based on a robot model and Yamamoto etal. [5] track human motion using multiple cameras. Gavrila and Davis [6] discussa multi-view approach for 3D model-based tracking of humans in action. Theyuse a generate-and-test algorithm in which they search for poses in a param-eter space and match them using a variant of Chamfer matching. Bregler andMalik [7] use an orthographic camera model and use optical flow. Rehg and Mor-ris [8] and Rehg et al. [9] describe ambiguities and singularities in tracking of ar-ticulated objects and Cham and Rehg [10] propose a 2D scaled prismatic model.Sidenbladh et al. [11] provide a framework to track 3D human figures using 2Dimage motion and particle filters with a constrained motion model that restrictsthe kinds of motions that can be tracked. Kakadiaris and Metaxas [12] use sil-houettes from multiple cameras to estimate 3D motion. Plaenkers and Fua [13]use articulated soft objects with an articulated underlying skeleton as a modeland use stereo and silhouette data for shape and motion recovery. Theobalt etal. [14] project the texture of the model obtained from silhouette-based methodsand refine the pose using the flow field. Delamarre and Faugeras [15] use 3D ar-ticulated models for tracking with silhouettes. They use silhouette contours andapply forces to the contours obtained from the projection of the 3D model sothat they move towards the silhouette contours obtained from multiple images.Cheung et al. [16] use shapes from silhouette to estimate human body kinemat-ics. Chu et al. [17] use volume data to acquire and track a human body model.Wachter and Nagel [18] track persons in monocular image sequences. They usean IEKF with a constant motion model and use edges to region information inthe pose update step in their work. Moeslund and Granum [19] use multiplecues for model-based human motion capture and use kinematic constraints toestimate pose of a human arm. The multiple cues are depth (obtained from astereo rig) and the extracted silhouette, whereas the kinematic constraints areapplied in order to restrict the parameter space in terms of impossible poses.Sigal et al. [20, 21] use non-parametric belief propagation to track in a multiview set up. Lan and Huttenlocher [22] use hidden Markov temporal models.DeMirdjian et al. [23] constrain pose vectors based on kinematic models usingSVMs. Rohr [24] performs automated initialisation of the pose for single cameramotion. Krahnstoever [25] addresses the issue of model acquisition and initialisa-tion. Mikic et al. [26] automatically extract the model and pose using voxel data.Ramanan and Forsyth [27] also suggest an algorithm that performs rough poseestimation and can be used in an initialisation step. Sminchisescu and Triggspresent a method for monocular video sequences using robust image matching,joint limits and non-self-intersection constraints [28]. They also try to removekinematic ambiguities in monocular pose estimation efficiently [29].


Our method is different in that we use both motion and spatial cues to trackthe pose as opposed to using volume or visual based techniques or only opticalflow. We use spatial and motion cues obtained from multiple views in order toobtain robust results that overcome occlusions and kinematic singularities. Wealso present a novel method to use spatial cues such as silhouettes and motionresidues. It is also possible to incorporate edges in our method. We also do notconstrain the motion or the pose parameters for specific types of motion (suchas walking) and hence our method is general.

1.2 Models

A good human shape model should allow the system to represent the humanbody in all of it’s postures and yet be simple enough to minimise the num-ber of parameters required to represent the body accurately. We use taperedsuper-quadrics in order to represent the different body segments. We can usemore complex triangular mesh models if we can acquire the parameters of suchmodels. We illustrate the 3D model used in our experiments in Figure 2. Thedimensions of the super-quadrics are obtained manually with the help of the3D scanned model in the figure. The motion of the different body segments areconstrained by the articulated structure of the body. The base body (trunk)has 6 degree-of-freedom (DoF) motion. All other body segments are attachedto the base body in a kinematic chain and have at most 3 DoF rotational mo-tion with respect to the parent node. The body model also includes the loca-tions of the joints of the different body segments besides the shape of the bodysegment.

2 Algorithm

We compute the pose at time t + 1 given the pose at time t using the images attime t and t + 1. The pose at t + 1 is estimated in two steps, the prediction stepand the correction step. The steps required to estimate the pose at time t + 1are first listed and then described in detail in the sections that follow.

1. Pixel-body registration at time t using known pose at t.2. Estimate pixel displacement between time t and time t + 1.3. Predict pose at time t + 1 using pixel displacement.4. Combine silhouettes and “motion residue” for each body segment into an

“energy image” for each image.5. Correct the predicted pose at time t + 1 using the “energy image” obtained

in step 4.

2.1 Pixel-Body Registration

Pixel-body registration is the process of registering each pixel in each image toa body segment as well as obtain approximate 3D coordinates of the point. Wethus obtain a 2D mask for each body segment that we can use while estimatingthe pixel displacement. We convert each body segment into a triangular mesh


(a) View 1 (b) View 2

Fig. 3. Pixel registration

(a) Mask (b) Image Diff (c) MR (d) Flow

Fig. 4. Pixel displacement and Motion Residue

and project it onto each image, and compute the depth at each pixel by inter-polating the depths of the triangle vertices. We can thus fairly easily extend ouralgorithm to use triangular mesh models instead of super-quadrics. Since thedepths of all pixels are known, we can compute occlusions. Figure 3 illustratesthe projection of the body onto images from two cameras. Different colours indi-cate different body segments. We compute approximate 3D coordinates of pixelsin a similar fashion.

2.2 Estimating Pixel Displacement

As we use pixel displacement between frames to estimate 3D pose change, weare not dependent on specific optical flow algorithms. Figure 4 illustrates howwe obtain the pixel displacement of a single body segment, the example beingthat of the left forearm shown in Figure 3 (d). We use a robust parametricmodel for the motion of the rigid objects so that the displacement, ∆xi, atpixel xi is given by ∆(xi, φ), where φ = [u, v, θ, s]. The elements of φ are thedisplacements along the x and y axes, rotation and scale respectively. We findthat the above parametric representation is more intuitive and more robust thanan affine model. We obtain that value of φ ∈ [φ0 −φB, φ0 +φB] that minimisesthe residue given by eTe where

[e]j = It(xij ) − It+1(xij + ∆(xij , φ)),

and {xij : j = 1, 2, · · · } is the set of all points in the mask obtained in §2.1 andillustrated in Figure 4 (a). φ denotes zero motion and φB denotes the boundson the motion that we impose. Figure 4 (a) is the smoothed intensity image attime t. Figure 4 (b) is the difference between image at time t and t + 1, i.e.,with zero motion, and has large values in the mask region signifying that thereis some motion. Figure 4 (b) is the difference between image at time t and theimage at time t + 1 warped according to the estimated motion and is calledthe “motion residue” for the optimal φ. The value of the pixels in the regionof the mask is close to zero where the estimated pixel displacement agrees withthe actual pixel displacement. The “motion residue” provides us with a roughdelineation of the location of the body segment, even when the original maskdoes not exactly match the body segment.


2.3 Pose Prediction

The pose parameter we need to estimate is the vector ϕ, which consists of the 6-DoF parameters for the base-body and the 3-DoF joint angles for each of the re-maining body segments.The state vector in our state-space formulation is ϕt (1-2).

State Update : ϕt+1 = h(ϕt) + δt (1)Observation : f(∆xt, ϕt, ϕt+1) = 0 (2)

In our case the function h(.) is linear (3) and the pixel position x(.), in (4), isa non-linear function of the pose, ϕ, and the incremental pose, δ. However, it iswell approximated by a linear function locally.

ϕt+1 = ϕt + δt (3)f (∆xt, ϕt, δt) = ∆xt − (x(ϕt + δt) − x(ϕt)) (4)

Let us consider the observation, the measured (noisy) pixel displacement, ∆x′t =

∆xt + η, where η is the measurement noise, and ∆xt is the pixel displacement.We expand f(∆xt, ϕt, δt) in a Taylor series about f(∆x′

t, ϕt, δt) as

f(∆x′

t, ϕt, δt

)+

∂f

∂∆xt(∆xt − ∆x′

t)+∂f

∂ϕt

(ϕt − ϕt)+∂f

∂δt

(δt − δt

)+O (· · · ) .

(5)The left hand side (f(∆xt, ϕt, δt)) is 0. The first term f

(∆x′

t, ϕt, δt

)is given

by ∆x′t −

(x(ϕt + δt) − x(ϕt)

).

The second term can be simplified as ∂f∂∆xt

(∆xt − ∆x′t) = 1.(−η) = −η.

The third term in (5) ∂f(.)∂ϕt

(ϕt − ϕt) is negligible because the function f(.) isnot very sensitive to the current pose, ϕt and we expect the term ϕt − ϕt to bealso negligible. We assume, without loss of generality that δt is a linear functionof time t, so that δt = δ.t, where δ is a constant. We note that (6) follows fromthe fact that the pixel velocity, ∂x(ϕt)

∂t , at a given point is a linear function ofthe rate of change of pose, δ [30].

∂f (∆x, ϕ, δt)∂δt

= −∂x(ϕ + δt)∂t

/∂δt

∂t= −F (ϕ + δt) δ/δ = −F (ϕt + δt) (6)

The fourth term is ∂f∂δt

|(∆x′t,ϕt,δt)= −F

(ϕt + δt

)We neglect the higher order

terms in (5) and obtain the following linearised observation equation (7).

∆x′t −

(x(ϕt + δt) − x(ϕt)

)+ η = F

(ϕt + δt

)(δt − δt

)(7)

We solve (7) for δt iteratively. We set δ0t = 0 and perform the following until

we obtain numerical convergence, which we do in a few iterations. We finally setϕt+1 = ϕt + δ

N

t .

• Set F (i) = F(ϕt + δ

(i)t

).


• Set ∆x(i)t = ∆x′

t −(x

(ϕt + δ

(i)t

)− x(ϕt)

)

• Update pose: δ(i+1)t = δ

(i)t +

(F (i)T F (i)

)−1F (i)T ∆x

(i)t .

2.4 Computing Spatial Energy Function

We combine different types of spatial cues into an energy image for each bodysegment. This allows us to use the framework irrespective of which spatial cuesare available. In our work we use silhouette information as well as the “motionresidue” obtained during motion estimation.

Figure 4 (d) is the “motion residue” for that segment, and provides us withthe region that agrees with the motion of the mask. We combine the “motionresidue” with the silhouette as shown in Figure 5. We can form energy imageseven if the quality of the silhouette is not very good. There are a number ofoutliers, but though these may affect other silhouette based algorithms, they donot affect our algorithm much.

(a) Silhouette (b) Silhouette(c) Motion Residue (d) Energy

Old PositionNew Position

(e) Object mask

(dx,dy)

φ

Displaced andRotated

Displaced

Originalposition

(f) 2D pose

Fig. 5. Obtaining unified energy image for the forearm

Once we have the pixel-wise energy image for each camera and a given bodysegment we compute the energy for different values of 2D parameters such asdisplacement and rotation. We have a mask for the body segment for the bodysegment for a given image as illustrated in Figure 5 (e). We can move this maskby a translation (dx, dy) or a rotation ϕ as illustrated in Figure 5 (f). We canfind the “energy” of the mask in each position by summing the energy of all thepixels that belong to the mask. Thus we can express the energy as a function of(dx, dy, θ) in the neighbourhood of (dx, dy, θ) = (0, 0, 0). When the body segmentmoves in 3D space by a translation and rotation, we can project the new axis onto the image and find the corresponding 2D configuration parameters in each ofthe images. We can then find the energy of the 3D pose by summing the energiesof the mask in the 2D configurations in each image.

We minimise this energy function in the local neighbourhood. We use aLevenberg-Marquardt optimisation technique which is initialised to the current3D position. We show the new position of the axis of the body segment afteroptimisation in Figure 6. The red line represents the initial position of the axisof the body segment and the cyan line represents the new position. We thuscorrect the pose using spatial cues.


Energy of image 1 Energy of image 2 Energy of image 3 Energy of image 4 Energy of image 6 Energy of image 8

Fig. 6. Minimum energy configuration

3 Experimental Results and Conclusions

In the experiments performed, we use grey-scale images from eight cameras witha spatial resolution of 648×484. Calibration is performed using Tomas Svoboda’salgorithm [31] and a simple calibration device to compute the scale. We useimages that have been undistorted based on the radial calibration parameters ofthe cameras. We use perspective projection model for the cameras. Experimentswere conducted on different kind of sequences and we present the results oftwo such experiments. The subject performs motions that exercise several jointangles in the body. Our results show that using only motion cues for trackingcauses the pose estimator to lose track eventually, as we are estimating onlythe difference in the pose and therefore the error accumulates. This underlinesthe need for “correcting” the pose estimated using motion cues. We show the“correction” step of the algorithm prevents drift in the tracking. In Figure 7,we present results in which we have superimposed the images with the modelassuming the estimated pose over the images obtained from two cameras. Thelength of the first sequence is 10 seconds (300 frames), during which there isconsiderable movement and bending of the arms and occlusions at various timesin different cameras. The second sequence is that of the subject walking and thebody parts are successfully tracked in both cases.

Fig. 7. Tracking results using both motion and spatial cues


We note that the method is fairly accurate and robust despite the fact thehuman body model used is not very accurate, given that it was obtained manu-ally using visual feedback. Specifically, the method is sensitive to joint locationand it is important to accurately estimate the joint location during the modelacquisition stage. We also note that the method scales with respect to accuracyof the human body model. We also note that while we use super-quadrics to rep-resent body segments, we could easily use triangular meshes instead, providedthey can be obtained. We need to consider more flexible models that allow thelocation of certain joints, such as shoulder joints, to vary with respect to thetrunk, to better model the human body.

References

1. Gavrila, D.M.: The visual analysis of human movement: A survey. ComputerVision and Image Understanding: CVIU 73 (1999) 82–98

2. Aggarwal, J., Cai, Q.: Human motion analysis: A review. Computer Vision andImage Understanding 73 (1999) 428–440

3. Moeslund, T., Granum, E.: A survey of computer vision-based human motioncapture. CVIU (2001) 231–268

4. Yamamoto, M., Koshikawa, K.: Human motion analysis based on a robot armmodel. In: CVPR. (1991) 664–665

5. Yamamoto, M., Sato, A., Kawada, S., Kondo, T., Osaki, Y.: Incremental trackingof human actions from multiple views. In: CVPR. (1998) 2–7

6. Gavrila, D., Davis, L.: 3-D model-based tracking of humans in action: A multi-viewapproach. In: Computer Vision and Pattern Recognition. (1996) 73–80

7. Bregler, C., Malik, J.: Tracking people with twists and exponential maps. In:CVPR. (1998) 8–15

8. Rehg, J.M., Morris, D.: Singularity analysis for articulated object tracking. In:Computer Vision and Pattern Recognition. (1998) 289–296

9. Rehg, J., Morris, D.D., Kanade, T.: Ambiguities in visual tracking of articulatedobjects using two- and three-dimensional models. International Journal of RoboticsResearch 22 (2003) 393 – 418

10. Cham, T.J., Rehg, J.M.: A multiple hypothesis approach to figure tracking. In:Computer Vision and Pattern Recognition. Volume 2. (1999)

11. Sidenbladh, H., Black, M.J., Fleet, D.J.: Stochastic tracking of 3D human figuresusing 2D image motion. In: ECCV. (2000) 702–718

12. Kakadiaris, I., Metaxas, D.: Model-based estimation of 3D human motion. IEEEPAMI 22 (2000) 1453–1459

13. Plankers, R., Fua, P.: Articulated soft objects for video-based body modeling. In:ICCV. (2001) 394–401

14. Theobalt, C., Carranza, J., Magnor, M.A., Seidel, H.P.: Combining 3D flow fieldswith silhouette-based human motion capture for immersive video. Graph. Models66 (2004) 333–351

15. Delamarre, Q., Faugeras, O.: 3D articulated models and multi-view tracking withsilhouettes. In: ICCV. (1999) 716–721

16. K.M. Cheung, S.B., Kanade, T.: Shape-from-silhouette of articulated objects andits use for human body kinematics estimation and motion capture. In: IEEEConference on Computer Vision and Pattern Recognition. (2003) 77–84


17. Chu, C.W., Jenkins, O.C., Mataric, M.J.: Markerless kinematic model and motioncapture from volume sequences. In: CVPR (2). (2003) 475–482

18. Wachter, S., Nagel, H.H.: Tracking persons in monocular image sequences. Com-puter Vision and Image Understanding 74 (1999) 174–192

19. Moeslund, T., Granum, E.: Multiple cues used in model-based human motioncapture. In: International Conference on Face and Gesture Recognition. (2000)

20. Sigal, L., Isard, M., Sigelman, B.H., Black, M.J.: Attractive people: Assemblingloose-limbed models using non-parametric belief propagation. In: NIPS. (2003)

21. Sigal, L., Bhatia, S., Roth, S., Black, M.J., Isard, M.: Tracking loose-limbed people.In: CVPR. (2004) 421–428

22. Lan, X., Huttenlocher, D.P.: A unified spatio-temporal articulated model for track-ing. In: CVPR (1). (2004) 722–729

23. Demirdjian, D., Ko, T., Darrell, T.: Constraining human body tracking. In: ICCV.(2003) 1071–1078

24. Rohr, K.: Human Movement Analysis Based on Explicit Motion Models. KluwerAcademic (1997)

25. Krahnstoever, N., Sharma, R.: Articulated models from video. In: Computer Visionand Pattern Recognition. (2004) 894–901

26. Mikic, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition andtracking using voxel data. International Journal of Computer Vision 53 (2003)199–223

27. Ramanan, D., Forsyth, D.A.: Finding and tracking people from the bottom up.In: CVPR (2). (2003) 467–474

28. Sminchisescu, C., Triggs, B.: Covariance scaled sampling for monocular 3D bodytracking. In: Proceedings of the Conference on Computer Vision and PatternRecognition, Kauai, Hawaii, USA. Volume 1. (2001) 447–454

29. Sminchisescu, C., Triggs, B.: Kinematic jump processes for monocular 3D humantracking. In: International Conference on Computer Vision & Pattern Recognition.(2003) I 69–76

30. Sundaresan, A., RoyChowdhury, A., Chellappa, R.: Multiple view tracking ofhuman motion modelled by kinematic chains. In: International Conference onImage Processing, Singapore. (2004)

31. Svoboda, T., Martinec, D., Pajdla, T.: A convenient multi-camera self-calibrationfor virtual environments. PRESENCE: Teleoperators and Virtual Environments14 (2005) To appear.

multi-camera tracking of articulated human motion using...

Documents