ieee transactions on image processing, vol. 22, no. 12, …cvrl/publication/pdf/chen2013.pdf ·...

13
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013 4627 Data-Free Prior Model for Upper Body Pose Estimation and Tracking Jixu Chen, Member, IEEE, Siqi Nie, Student Member, IEEE, Qiang Ji, Senior Member, IEEE Abstract—Video based human body pose estimation seeks to estimate the human body pose from an image or a video sequence, which captures a person exhibiting some activities. To handle noise and occlusion, a pose prior model is often constructed and is subsequently combined with the pose estimated from the image data to achieve a more robust body pose tracking. Various body prior models have been proposed. Most of them are data-driven, typically learned from 3D motion capture data. In addition to being expensive and time-consuming to collect, these data-based prior models cannot generalize well to activities and subjects not present in the motion capture data. To alleviate this problem, we propose to learn the prior model from anatomic, biomechanics, and physical constraints, rather than from the motion capture data. For this, we propose methods that can effectively capture different types of constraints and systematically encode them into the prior model. Experiments on benchmark data sets show the proposed prior model, compared with data-based prior models, achieves comparable performance for body motions that are present in the training data. It, however, significantly outperforms the data-based prior models in generalization to different body motions and to different subjects. Index Terms— Body pose estimation, body pose model, knowledge-based model. I. I NTRODUCTION T RACKING human body pose is of great interest in various applications such as human computer interaction, video surveillance, remote collaboration, and computer ani- mation. Using commercial motion capture (Mocap) systems, body pose tracking is made simpler by placing markers on the human body. However, marker-based system is often unnatural and distressing, and the cost of a commercial Mocap system can be prohibitive for an ordinary user. Marker-less body pose tracking, in contrast, uses only image as an input, without any specific marking on the subject. Marker-less body pose track- ing is natural and it can be implemented without any intrusion to the user, but it is also much more difficult, particularly when the image observation is noisy due to occlusion, clothing and illumination. To address this challenge, marker-less approach is often combined with a body motion prior model, which is Manuscript received May 9, 2012; revised November 4, 2012, March 6, 2013, and May 5, 2013; accepted July 9, 2013. Date of publication July 24, 2013; date of current version September 26, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Jean-Philippe Thiran. J. Chen is with the Computer Vision Laboratory, GE Global Research Center, Niskayuna KW-C410, NY 12308 USA (e-mail: [email protected]). S. Nie and Q. Ji are with the Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180-3590 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2013.2274748 constructed offline [1], [2], to yield a more accurate and robust body pose estimation and tracking. To realize such a hybrid approach, the Bayesian formulation is widely used. The Bayesian formulation infers the posterior of the body pose s, given the image observation o. Based on Bayes rule, the posterior can be factorized as: p(s|o) p(o|s) p(s), where p(o|s) is the image likelihood representing the closeness between the pose and the image observations, and p(s) is the prior probability of the body pose. When the image likelihood term is not reliable due to noise or inadequate training data, a good prior model can improve the robustness of the final result. This paper focuses on developing a robust and generalizable prior model of body pose under natural body movement. In recent years, various body prior models have been pro- posed. Most of them are learned from training data, obtained by using a motion capture system to capture the 3D body poses when the subject is performing various activities. Besides being expensive to collect, the model learned from limited training data cannot generalize well to activities and subjects not present in the training data. In contrast to the data-based pose prior model learned from training data, we propose a data-free body prior model which is learned exclusively from the related theories and principles from different disciplines that govern body pose and body movement. We generically call such body theories and principles body knowledge, which differentiates, in the context of this paper, from the knowledge learnt from the training data (which we call “data knowledge”). We argue that such body knowledge and the models learned from them are applicable to different natural body movements and to different subjects. There are several facets to our work. 1) First, we systematically identified and represented the body knowledge in the form of anatomic, biomechanics, and physical constraints. 2) Second, we propose advanced machine learning methods to effectively encode various types of body constraints into one body prior model. Though not able to capture the body motion statistics, our prior model constrains the body to physically and anatomically plausible move- ments. The experiment shows the importance of using body knowl- edge/constraints. While each body constraint may be weak, the constraints, when combined together, can collectively form a strong prior model to improve the body pose estimation in particular when the image measurement of the body is poor. The proposed data-free body prior model significantly 1057-7149 © 2013 IEEE

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013 4627

Data-Free Prior Model for Upper Body PoseEstimation and Tracking

Jixu Chen, Member, IEEE, Siqi Nie, Student Member, IEEE, Qiang Ji, Senior Member, IEEE

Abstract— Video based human body pose estimation seeks toestimate the human body pose from an image or a video sequence,which captures a person exhibiting some activities. To handlenoise and occlusion, a pose prior model is often constructed andis subsequently combined with the pose estimated from the imagedata to achieve a more robust body pose tracking. Various bodyprior models have been proposed. Most of them are data-driven,typically learned from 3D motion capture data. In addition tobeing expensive and time-consuming to collect, these data-basedprior models cannot generalize well to activities and subjects notpresent in the motion capture data. To alleviate this problem, wepropose to learn the prior model from anatomic, biomechanics,and physical constraints, rather than from the motion capturedata. For this, we propose methods that can effectively capturedifferent types of constraints and systematically encode them intothe prior model. Experiments on benchmark data sets show theproposed prior model, compared with data-based prior models,achieves comparable performance for body motions that arepresent in the training data. It, however, significantly outperformsthe data-based prior models in generalization to different bodymotions and to different subjects.

Index Terms— Body pose estimation, body pose model,knowledge-based model.

I. INTRODUCTION

TRACKING human body pose is of great interest invarious applications such as human computer interaction,

video surveillance, remote collaboration, and computer ani-mation. Using commercial motion capture (Mocap) systems,body pose tracking is made simpler by placing markers on thehuman body. However, marker-based system is often unnaturaland distressing, and the cost of a commercial Mocap systemcan be prohibitive for an ordinary user. Marker-less body posetracking, in contrast, uses only image as an input, without anyspecific marking on the subject. Marker-less body pose track-ing is natural and it can be implemented without any intrusionto the user, but it is also much more difficult, particularly whenthe image observation is noisy due to occlusion, clothing andillumination. To address this challenge, marker-less approachis often combined with a body motion prior model, which is

Manuscript received May 9, 2012; revised November 4, 2012, March 6,2013, and May 5, 2013; accepted July 9, 2013. Date of publicationJuly 24, 2013; date of current version September 26, 2013. The associate editorcoordinating the review of this manuscript and approving it for publicationwas Prof. Jean-Philippe Thiran.

J. Chen is with the Computer Vision Laboratory, GE Global ResearchCenter, Niskayuna KW-C410, NY 12308 USA (e-mail: [email protected]).

S. Nie and Q. Ji are with the Department of Electrical, Computer, andSystems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180-3590USA (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2013.2274748

constructed offline [1], [2], to yield a more accurate and robustbody pose estimation and tracking.

To realize such a hybrid approach, the Bayesian formulationis widely used. The Bayesian formulation infers the posteriorof the body pose s, given the image observation o.Based on Bayes rule, the posterior can be factorized as:p(s|o) ∝ p(o|s)p(s), where p(o|s) is the image likelihoodrepresenting the closeness between the pose and the imageobservations, and p(s) is the prior probability of the bodypose. When the image likelihood term is not reliable due tonoise or inadequate training data, a good prior model canimprove the robustness of the final result. This paper focuseson developing a robust and generalizable prior model of bodypose under natural body movement.

In recent years, various body prior models have been pro-posed. Most of them are learned from training data, obtainedby using a motion capture system to capture the 3D body poseswhen the subject is performing various activities. Besidesbeing expensive to collect, the model learned from limitedtraining data cannot generalize well to activities and subjectsnot present in the training data.

In contrast to the data-based pose prior model learnedfrom training data, we propose a data-free body prior modelwhich is learned exclusively from the related theories andprinciples from different disciplines that govern body poseand body movement. We generically call such body theoriesand principles body knowledge, which differentiates, in thecontext of this paper, from the knowledge learnt from thetraining data (which we call “data knowledge”). We arguethat such body knowledge and the models learned fromthem are applicable to different natural body movementsand to different subjects. There are several facets to ourwork.

1) First, we systematically identified and represented thebody knowledge in the form of anatomic, biomechanics,and physical constraints.

2) Second, we propose advanced machine learning methodsto effectively encode various types of body constraintsinto one body prior model. Though not able to capturethe body motion statistics, our prior model constrainsthe body to physically and anatomically plausible move-ments.

The experiment shows the importance of using body knowl-edge/constraints. While each body constraint may be weak, theconstraints, when combined together, can collectively forma strong prior model to improve the body pose estimationin particular when the image measurement of the body ispoor. The proposed data-free body prior model significantly

1057-7149 © 2013 IEEE

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4628 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

outperforms the data-based prior models in generalization tonew body movements or new users.

II. RELATED WORK

In this section, we will review the related work on construct-ing the prior model. Notice that this paper focus on Bayesianformulation. Discriminative formulation [3], [4], the otherpopular body pose estimation framework which learns themapping from image to body pose directly, is not discussed.

A. Data-Based Prior Model

Due to the high-dimensionality of the body pose(25–50 dimensions are not uncommon [5]), modeling its priorprobability is still a challenging problem. Most of the currentprior models of body pose can be classified into the followingtwo categories: prior model using dimensionality reductionand prior model using probabilistic graphical model (PGM).Besides the two most popular methods, example-based models,such as kernel density function [6] and hashing [7], directlyuse the training examples to build prior model.

One popular method to model the high-dimensionalbody pose is to project it to a low-dimensional subspace.Linear dimensionality reduction methods, such as principalcomponent analysis (PCA) [8], [9], have been populardue to their simplicity and efficiency. However, due to thesimple linear projection assumption, PCA may result inpoor approximation for complex data [10]. In contrast tothe linear methods, most recent prior models perform theprobabilistic nonlinear mapping from the low-dimensionallatent space to the original pose. These models are alsocalled latent variable models, including the Gaussian ProcessLatent Variable Model (GPLVM) [10]–[12], Gaussian ProcessDynamic Model(GPDM) [13], [14], Switching GaussianProcess Dynamic Model (SGPDM) [15], coordinated mixtureof factor analysis(CMFA) [2] and conditional restrictedBoltzmann machine (CRBM) [1].

Another popular method to address the high dimensionalityof the state space is to decompose the body pose into thestate of each body part. By using the probabilistic graphicalmodels (PGM), this prior probability can be factorized as theproduct of local conditional probability distributions (CPDs)or potential functions. PGM models can be classified into tree-structured model [16]–[18], non-tree-structured model [19],and mixture of tree models [20].

B. Prior Model from Data and Body Constraints

So far, most successful prior models are highly dependenton the training data (manually labeled data [12], [17] or motioncapture data [1], [2], [11], [18]). However, collecting a largeamount of training data for various activities and subjectsis both difficult and expensive. The development of currenttechnology such as the Kinect sensor has significantly reduceddata collection time. But the large pose space (over 50 dimen-sions for the whole body), a large number of possible bodymotions, and significant variations in individual body motions,exponentially increases the number of samples needed to learn

a good and generalizable prior model. On the other hand, thereare established theories of physics, anatomy and biomechanicsgoverning body motions. It would be better to reuse theestablished theories rather than reinvent the wheel by learningthem from data.

Hybrid approaches [16], [21]–[24] based on combiningtraining data with generic human body constraints have beenintroduced recently to alleviate the problems of data-basedmodels. Body constraints are derived from generic physics,biomechanics and anatomical rules, and applicable to allthe natural body poses. In general, these constraints can beimposed in body tracking either in front-end or in rear-end. Front-end method is based on the Bayesian formulation,i.e., all body pose constraints are imposed on a prior modelbefore tracking, and this prior model is then combined withimage likelihood during body tracking. For rear-end method,body pose constraints are directly applied in body trackingthrough constrained optimization. In the rest of this section,we first discuss the typical front-end and rear-end methods,and then discuss the recently proposed physics models. Otherrepresentative methods are discussed in the end.

A typical front-end method is imposing constraints on PGM.For example, the ‘elastic’ connectivity constraint used in[25] and [26] is enforced by constraining the CPD of the linksconnecting two adjacent body parts. Some work [16], [27]also proposes to incorporate the range constraint on the relativeangle between two adjacent body parts by manually settingthe CPD as uniform distribution in the feasible angle range.However, only a few simple constraints on two adjacent bodyparts are imposed directly in current PGMs.

A typical rear-end method is directly solving the constrainedoptimization problem in body tracking. Demirdjian [28] solvesthe pose estimation by minimizing the distance between theobserved 3D points and the surface of the 3D limbs sub-ject to the connectivity and non-penetrating constraints.Sminchisescu et al. [21] introduce the 3D joint angle rangeand non-intersection constraints, and solve the body trackingproblem as a constrained optimization, using covariance scaledsampling (CSS). However, the high-dimensionality of the bodyspace and the non-convex nature of the objective functionsand the constraints make a global optimum difficult, if notimpossible, to find. Recently, physical principles are employedto produce realistic body motions. The physics-based modelis used to predict the next pose from the current pose [22],[23], [29], [30] by computing the forces or torques imposed onthe body. Physics-based model, however, requires monitoringevery constraint during tracking. it is further hindered by theestimation of the force, as force cannot be directly observedfrom video. Constraints can also be imposed during trackingusing particle filter [31]–[33]. Compared with the front-endapproach, the rear-end approach is simple to implement. Butchecking for the satisfaction of each constraint during trackingcould slow down the process and makes it difficult for realtime implementation, especially if there are many constraintsand/or the constraints are complex.

In summary, a majority of current methods are data-driven,requiring motion capture 3D data to construct the prior bodypose model. Methods have also been proposed to utilize certain

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

CHEN et al.: DATA-FREE PRIOR MODEL 4629

body constraints to supplement the data. These methods,however, have several limitations.

• Constraints are mainly used as a supplement to data. Thecumbersome and time-consuming data collection is stillrequired for model training.

• The constraints used tend to be simple, limited indiversity, and are typically implemented separately.In addition, different methods are used to implement theconstraints. There is not a unified method to implementdifferent constraints simultaneously.

To overcome these limitations, in this paper, besides intro-ducing additional constraints, we propose a method that con-structs the prior body model exclusively from related bodypose knowledge. By using generic knowledge, the proposedmethod not only saves time and effort for data acquisitionbut, more importantly, our model can be applied to differentactivities of different subjects.

III. RELATED KNOWLEDGE ON BODY POSE

In this section, we introduce the knowledge and con-straints from different disciplines that govern the natural bodypose/motion.

A. Body Pose Parameterization

First, we define the parameters to represent the body pose,focusing on the upper-body. We focus on upperbody posetracking for this research though our approach can be readilyapplied to fully body pose tracking. The upper-body iscomposed of six rigid body parts. Each of them is modeled as acylinder as shown in Fig. 1. We use the torso-coordinate-system which takes the upper center of the torso as theorigin O. In general, we use three angles αi j , βi j , γi j tomodel the rotation of part i relative to its neighboring part j ,around the local x , y and z axes respectively. Thus, thepose state can be represented by a vector s ∈ R

15, whichincludes the joint angles of five joints, namely neck,right/left shoulder and right/left elbow, i.e., s = (α21, β21,γ21, α31, β31, γ31, α43, β43, γ43, α51, β51, γ51, α65, β65, γ65)

T .

B. Body Anatomy

Human anatomy is the study of body structure and bodycomposition. It offers knowledge that constrains the bodystructure and its movement. We consider four differentanatomical constraints: connectivity constraint, body lengthconstraint, kinesiology constraint and symmetry constraint.

1) Connectivity and Length Constraints: Based on humanbody anatomy, the human body is composed of several bodyparts such that adjacent body parts are connected via joints.We call this constraint Connectivity Constraint. Each body parthas a constant length. We call it Body Length Constraint. Here,we use average body lengths from anthropometric data [34].The body pose s is a set of joint angles. Given the aboveconstraints, the joint positions can be computed from jointangles as follows:

First, the rotation matrix of the torso R1(s) = I is the iden-tity matrix in the torso-coordinate-system. Then, the rotation

Fig. 1. Upper body is composed of 6 rigid body parts.

(a) (b)

Fig. 2. The relationship between elbow joint and shoulder joint for naturalarm motion. The dashed line segments show the initial positions for arm,forearm, and muscles. The solid line segments show their new positions.(a) Flexed arm following center of gravity. (b) Flexion of biceps.

matrix of part j can be recovered sequentially, from torso toforearm:

R j = Ri Rx(α j i )Ry(β j i )Rz(γ j i ) (1)

where Ri denotes the rotation matrix of the body part iconnected to part j ; R j denotes the rotation matrix of the bodypart j ; Rx(∗), Ry(∗) and Rz(∗) represent the rotation matricesto perform rotations around x ,y and z axes respectively.

Given the rotation matrices, body part positions are com-puted sequentially based on the connectivity constraints. Forexample, given the radius of torso cylinder r , right shoulderpoint is (r, 0, 0)T . Because right arm and torso are connectedthrough this shoulder, right elbow point eR is computed basedon both the shoulder (r, 0, 0)T and the rotation of the rightarm R3 :

eR = (r, 0, 0)T + Rx (α31)Ry(β31)Rz(γ31)(0, 0, lR A)T (2)

where lR A is the length of the right arm. Similarly, sinceforearm and arm are connected through the elbow point, rightwrist point wR can be computed from elbow eR and therotation of forearm:

wR = eR + Rx(α31)Ry(β31)Rz(γ31)Rx(α43)Ry(β43)

× Rz(γ43)(0, 0, lRF A)T (3)

where lRF A is the length of the right forearm.2) Kinesiology Constraints: Kinesiology studies the motion

of human body under normal or pathological conditions.Though there are more rules in Kinesiology study, mostof them are applied to specific motion conditions. In thispaper, we only applied one of the most generic kinesiologyconstraints. Based on the kinesiology study [35], while onecan change each joint angle independently, the joint anglesare actually related to each other for natural body motion.For example, figure 2(a) shows that the flexion of the elbowis typically followed by the arm moving backward in theshoulder, because the center of gravity of the system falls

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4630 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Fig. 3. Ranges for the neck, shoulder and elbow joints from biomechanicsconstraints. The thick red line shows the neutral position when the angle isequal to zero. Images are adapted from [34].

TABLE I

JOINT ANGLE RANGE FROM [34]

behind the center of the joint. Figure 2(b) shows that theflexion of the elbow is combined with the flexion of the armin the shoulder joint when both joints are moved together bythe biceps.

Based on kinesiology studies, we identify the constraint thatthe shoulder angles and the elbow angle depend on each other.

3) Symmetry Constraint: The symmetry constraint fur-ther restricts the relationships. For example, the relationshipbetween the shoulder and elbow should be the same for boththe right and left arms. Specifically, this constraint requiresthat the dependencies among the right arm angles (α31, β31,γ31, α43) and the dependencies among the left arm angles (α51,β51, γ51, α65) should be the same.

C. Biomechanics

Biomechanics provides knowledge that can further restrictthe degree of freedom and range of body pose. We utilizetwo biomechanics constraints on the joints: joint degree-of-freedom(DOF) constraint and joint angle range constraint.

1) Joint Degree of Freedom(DOF) Constraint: This con-straint means that the 3D joint angle has limited DOFs.In the biomechanics study, the shoulder joint is a ball-and-socket joint with three DOFs, while the elbow is a hingejoint with one DOF. Based on these constraints, the dimen-sion of the pose state is reduced to eleven, which includeseleven joint angles: s = (α21, β21, γ21, α31, β31, γ31, α43, α51,β51, γ51, α65)

T . In our current experiment, the head rotationγ21 is difficult to estimate due to low image resolution, so wetracked only the ten other angles.

2) Joint Angle Range Constraint: This constraint specifiesthe lower and upper limit for each joint angle of a normalhuman pose as shown in Fig. 3 and Table I.

D. Physical Knowledge

Physical constraints are imposed to exclude physicallyinfeasible relationships between the parts. In our work, weimpose a non-penetrating constraint between the arm/forearmand the torso. Specifically, this constraint means that the elbowpoint and the wrist point of the forearm cannot be inside thetorso, and two arms cannot penetrate into each other.

For instance, from Eq 2 and 3, we can compute the rightelbow position eR = (x R

e , y Re , z R

e )T and the right wrist positionwR = (x R

w, y Rw, z R

w)T . Since the torso axis is the y-axis in ourcoordinate system, the non-penetrating constraint is imposedas a lower bound on the distance between elbow/wrist to they-axis:

√(x R

e )2 + (z Re )2 > r,

√(x R

w)2 + (z Rw)2 > r (4)

where r is the radius of the torso cylinder.

E. Body Dynamics

In addition to the constraints on the static body pose,natural body motion also follows certain dynamic principlesthat further restrict the body movement. We considered thefollowing dynamic constraints:

1) Angular Dynamic Constraint: This constraint restrictsthe joint angle movement speed. Since the state of the bodypose s is a vector of joint angles, angular dynamic constraintcan be written as:

‖st+1i − st

i ‖ < �t · wi (i = 1, . . . , 10) (5)

where sti and st+1

i are the i th joint angle in the current and thenext pose, respectively. wi is the maximum angular velocityfor i th joint angle, and �t is the time between the two frames,which depends on the frame rate.

2) Smoothness Constraint: This constraint ensures that thebody joints move smoothly. Specifically, this constraint iswritten as:

dp j (s)dt

=∑

i

∂p j

∂si

dsi

dt= 0 (6)

where p j (s) is the j th joint position, which is a function ofthe pose s. si is the i th joint angle. For example, because theright wrist position depends on four joint angles in Eq. 3, thesmoothness constraint on the right wrist can be written as:∂wR

∂α31

dα31

dt+ ∂wR

∂β31

dβ31

dt+ ∂wR

∂γ31

dγ31

dt+ ∂wR

∂α43

dα43

dt= 0 (7)

where the partial derivation of wR with respect to each jointangle (e.g. ∂wR

∂α31) can be computed directly based on Eq. 3.

The derivative of each joint angle (e.g. dα31dt ) represents its

angular speed. Similar smoothness constraints were imposedon other joint positions. Please note that this joint smoothnessconstraint is different from angular dynamic constraint whichrestricts the angular velocity. A small change in angle doesnot always guarantee a smooth joint position movement.

IV. DATA-FREE BODY PRIOR MODEL LEARNING

Given the various constraints, we need methods to effec-tively embed them into the body prior model. Before dis-cussing the methods to implement the constraints, we firstintroduce the structure of the prior body model.

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

CHEN et al.: DATA-FREE PRIOR MODEL 4631

A. Body Prior Model

Following the body pose tracking with PGM approach, weuse a Bayesian Network (BN) to represent the body pose.In a BN, nodes denote random variables and the links betweennodes denote the conditional dependency among the variables.The dependency is characterized by the conditional probabilitydistribution (CPD).

In our body pose BN model, as shown in Fig. 5, thenodes denote the joint angles and the CPD of each node isparameterized as a mixture of Gaussians (MoGs). For example,the CPD of one node x given its parent nodes y (y is a vectorcomposed of all the parents) can be written as:

p(x |y) =∑

m

p(x |y, m)p(m) (8)

where p(x |y, m) = Nm(μm + Wy, σ 2m) is the Gaussian dis-

tribution of mth component of MoG, μm , σ 2m and W denote

its mean, variance and regression matrix respectively. p(m)represents the prior probability of this component, i.e. itsweight. Similar to [18], we use a mixture of four Gaussiansin our model.

B. Learning the Prior Model from Constraints

Given the BN body model, we propose methods to automat-ically learn the body model only from the above constraints.

1) Constraints on Body Model Structure: Some of theconstraints can be directly applied to the BN model structure.Specifically, kinesiology constraint requires that the shoulderangle and the elbow angle depend on each other. This canbe realized by adding a link that directly connect the twojoints as shown in Figure 5. The symmetry constraint isrealized by requiring that the links among the right arm angles(α31, β31, γ31, α43) and the links among the left arm angles(α51, β51, γ51, α65) be the same as shown in Fig. 5.

2) Pseudo Data Generation Under Body Constraints: Forconstraints that cannot be directly imposed, we propose anefficient sampling method to produce pseudo data from theconstraints in order to effectively represent these constraints.Pseudo data generation has been applied to different appli-cations (computer vision [36], robotics [37], and computergraphics [38]) to enrich the training data set.

Because we need to generate samples to cover the feasibleregions in the high-dimensional pose space. To avoid a brute-force search of the feasible regions, we designed a proposaldistribution to explore this space more efficiently.

The basic idea of the efficient sampling is to generate moresamples from the current unexplored region. Specifically, wedefine the proposal distribution of the nth sample conditionedon the previous samples : p(s(n)|s(n−1), . . . , s(1)). Given theprevious samples, we first define a kernel density functionwith Gaussian kernel:

q(s(n)|s(n−1), . . . , s(1)) = 1

n − 1

n−1∑

j=1

1

(2πσ 2)D/2

× exp

{−‖s(n) − s( j )‖2

2σ 2

}(9)

Fig. 4. Generated upper body pose samples.

where D is the dimension of the pose space and σ is thestandard deviation (σ is empirically selected based on therange of joint angles. We use sigma = 10 degree in ourexperiment). This density function has high density in theregions close to previous samples. Since we need to explorethe regions which have not been sampled, we use a proposaldistribution as follows:

p(s(n)|s(n−1), . . . , s(1)) ∝ 1/(2πσ 2)D/2

−q(s(n)|s(n−1), . . . , s(1)) (10)

where 1/(2πσ 2)D/2 is the largest possible value ofq(s(n)|s(n−1), . . . , s(1)). This proposal distribution has higherdensity in the region which is not covered by the previous sam-ples. Now, we need to generate a new sample s(n) accordingto this proposal distribution.

Considering the constraints, we use the rejection samplingas follows.

1) We first uniformly generate a sample s(n) in the feasibleangle range.

2) If n = 1, this sample is always accepted. If not, this sam-ple is accepted with the probability p(s(n)|s(n−1),...,s(1))

1/(2πσ 2)D/2 .

3) If s(n) is rejected, go back to Step 1 to generate anothersample, until the new sample is accepted.

4) Compute the joint positions from the new samplep j (s(n)), based on connectivity constraint and bodylength constraint. If all these joint positions satisfy thenon-penetrating constraints, then add the new sample tothe sample set s(n) → C, otherwise reject this sampleand go back to Step 1.

5) If the sample set size |C| is smaller than N , then go backto Step 1.

Finally, a concise sample set is generated to represent theconstraints. Some samples are shown in Fig. 4. We can seethat the generated poses can be complex and some may berare, but they all satisfy the basic body constraints.

Through the above efficient sampling, we can generate apseudo-data set satisfying all the static constraints. In orderto incorporate the dynamic constraints, we propose to gen-erate pseudo-data pairs (st , st+1) using dynamic sampling asfollows.

1) Sample the current pseudo pose st using the aboveefficient sampling method.

2) Given st , we first generate the next pose st+1 underangular dynamic constraint. Because this constraint isimposed on each dimension of the body pose separately,we sample each dimension of st+1 independently fromthe uniform distribution over [st

i −�t ·wi , sti +�t ·wi ].

3) Check st+1 with all the static constraints, and check thepseudo pose pair (st , st+1) with smoothness constraint

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4632 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Fig. 5. Structure learning of the BN. (The dotted circle denotes the threejoint angles for each elbow). (a) The prior BN B0. (b) BN (BC ) learned frompseudo-poses.

(we use �si/�t to approximate the derivative of jointangle dsi/dt in the smoothness constraint (Eq. 6)). Ifthis pair is infeasible then reject it and go back to Step 1.

3) Learning BN Parameter and Structure from Constraints:In the following sections, we focus on learning the prior modelof pose from pseudo-data. We first introduce a constrainedlearning algorithm to learn BN parameter and structure inthis section. We then extend the BN to Dynamic BayesianNetwork (DBN) by adding dynamic links and discuss DBNlearning.

Given the pseudo-data, it is easy to learn the BN parameterfor a given structure: simply learn the CPDs (MoGs) of eachnode separately. For each CPD in Eq. 8, we learn the mixtureof Gaussian parameters including μm , σm , p(m) and the lineartranslation matrix W. W can be learned by solving a linearregression between the x and y. The mixture of Gaussianis learned via a standard iterative Expectation-Maximizationalgorithm with K-means initialization. In this section, we focuson learning the BN structure.

Structure learning for a probabilistic graphical model(PGM) has been studied extensively in machine learning. Wechose to use a score-based method [39] to learn its structure.

We first manually set a prior BN structure (B0 in Fig. 5(a))based on the anatomical structure of the body and the bodyconstraint. The links from the neck angles (α21, β21) to theright/left shoulder angles ((α31, β31, γ31)/ (α51, β51, γ51)), andlinks from the shoulder angles to the right/left elbow angles(α43 /α65) are initialized based on the hierarchical structureof the human body. Based on the symmetry constraint, thelinks are symmetrical for right and left arms. Based on thekinesiology constraint, all three angles of the shoulder arerelated to the elbow angle.

The Bayesian score [39] that describes the fitness of eachpossible structure B to the training data is defined as:

Score(B) = log p(D, B) = log p(D|B) + log p(B). (11)

The first term is the log likelihood and the second term is thelog prior probability of the structure B. For a large database,the log likelihood can be approximated with the Bayesianinformation criterion (BIC) [40]:

log p(D|B) ≈ log p(D|θ̂B, B) − d

2log(K ) (12)

where D is the set of pseudo-data; θB is the parameters ofnetwork B, θ̂B is the maximum likelihood estimate of θB, dis the number of free parameters in B, K is the number of

samples in pseudo-data D. Thus, the first term measures howwell the model fits the pseudo-data, and the second term is apenalty term that punishes the structure complexity.

Instead of giving an equal prior p(B) to all possible struc-tures, we assign the high probability to the structure which isclose to the prior structure B0. Let i (B) be a set of parentnodes of i th node in B. δi is the number of nodes in thedifference of i (B) and i (B0):

δi (B, B0) = |(i (B) ∪ i (B0))\(i (B) ∩ i (B0))| (13)

where, i (B)∪i (B0) represents the union of sets i (B) andi (B0), i (B)∩i (B0) represents the intersection of the twosets, and ‘\’ means set difference.

Then, the prior of any network B = B0 is defined as follows:

p(B) ∝ κδ(B,B0) (14)

where δ(B, B0) = ∑Ni=1 δi (B, B0). 0 < κ < 1 is a predefined

constant factor. We empirically set κ = 0.1 in our experiment.Finally, we obtain the optimal structure B̂ by maximizing

the score, i.e., B̂ = arg maxB Score(B) in Eq. 11 using thefollowing iterated hill-climbing procedure.

1) Initialize the starting BN structure as B0: Bs = B02) Starting from Bs , compute the score of each neighbor of

Bs , which is generated from Bs by adding, deleting, orreversing a single link, subject to kinesiology constraint,i.e., links between shoulder angle and elbow anglecannot be removed. Based on symmetry constraint, if alink in the right/left arm is manipulated, its counterpartin the left/right arm is manipulated in the same way.

3) Update Bs with the BN that has the maximum scoreamong all the neighbors and go back to previous step,until no neighbors have a higher score than the currentstructure.

4) Randomly perturb the current Bs as a new starting pointand go back to Step 2 until it converges.

The learned structure is shown in Fig. 5(b). We can seethat some links are removed from the initial structure. Thismeans that, in general, the neck rotation does not have a strongcorrelation to the shoulder angles. Several links are added tothe learned structure. The link β31 → γ31 represents the strongrelationship among the shoulder angles. The links α51 → α31and β51 → β31 represent the correlation between the rightand left shoulder. These links are not directly reflected in theconstraints, but they are the result from the interaction amongdifferent constraints. Furthermore, when we generate pseudodata, we imposed the non-penetrating constraint that right andleft arms do not intersect with each other. Thus, the right andleft angles are related statistically in the pseudo data.

4) Learning DBN Model: In this section, we extend ourmodel to DBN to capture the dynamic dependencies amongjoint angles.

A DBN can be defined by a pair of BNs (B1, B→): (1) thestatic network B1 is the same as the BN we learned before,which captures the static relationships among joint angles andrepresents the static joint probability: p(s1) in the first timeslice. (2) The transition network B→ specifies the transitionprobability p(st+1|st ) for all t in a finite time slice sequence,i.e. the dynamic dependencies among joint angles.

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

CHEN et al.: DATA-FREE PRIOR MODEL 4633

Fig. 6. DBN learned from pseudo data pairs. The self-transition link ateach node indicates the temporal relationship of a single joint angle from theprevious time to the current time. The links from the nodes at t − 1 to thenodes at t indicate the temporal relationships between different joint angles.

The static network is already learned in the above sections.Here, we focus on learning the transition network B→, whichconsists of two types of links: inter-slice links and intra-slice links. The inter-slice links are the dynamic links con-necting the temporal variables of two successive time slices.In contrast, the intra-slice links connect the variables withina single time slice, which are the same as the static networkstructure.

Given the pseudo-data pairs generated in Section IV-B2, thelearning procedure of B→ is the same as the BN learning, butthere are more coherent structure constraints on the transitionnetwork. First, the variables st = (st

1, . . . , stN ) in the first slice

do not have parents. Also, the inter-slice links can only haveone direction, from the current to the next time slice. Finally,based on the stationary assumption, the intra-slice links amongjoint angles in every time slice should be the same. Here,we fixed the intra-slice links as the previously learned staticnetwork.

The learned transition network from pseudo-data pairs isshown in Figure 6. We learned two types of dynamic relation-ships for joint angles. First, because of the angular dynamicconstraint on each single joint angle, the self-transition link foreach node is expected in the learned DBN . Second, becausethe smoothness constraint is imposed on the dynamics of a setof joint angles, there are dynamics links among different jointangles to depict more complex dynamic dependencies ratherthan the simple self-transitions.

V. DEMONSTRATION EXPERIMENT

Before applying our method in body tracking problem, wefirst demonstrate the effectiveness of our method in learning ageneral PGM from constraints. Since the learning of a DBN issimilar to the learning of a two-slice BN, in this experiment,we only demonstrate the learning of a BN with three nodes forclarity and simplicity. Here, we imposed constraints on threevariables (x1, x2, x3)

T . Through our knowledge-based method,we can learn a BN model of them only from constraints.

A. Constraints on Three Variables

We imposed different constraints on the three variables asfollows.

TABLE II

LIST OF LEARNED STRUCTURES AND THEIR CORRESPONDING

CONSTRAINTS

1) Constraints on One Variable (Limit Constraints)Like the joint angle limit constraint in our pseudo-posedata, we set the limit constraints as: 0 < xi < 1,i = 1, 2, 3

2) Constraints on Multiple Variables.We imposed various linear and non-linear constraints onthe three variables as shown in Table II.

B. BN Learned from Pseudo Data

In this experiment, we generated 10,000 sets of pseudo-datafrom the constraints, and used our structure learning algorithmto learn the BN model from the pseudo-data.

Using different constraints, the learned models have differ-ent structures. The constraints and the corresponding learntmodel structures are listed in Table II, where these constraintsare put in the same category in the right column and theirlearnt structure is shown in the left column. (If there areseveral Markov equivalent structures, only one of them isshown.)

This table shows that a link between two variables arelearned only when these variables are involved in one con-straint. For example, if only limit constraints are imposed,the learned structure (Structure A. in Table II) has no linksamong the nodes because limit constraint only involves asingle variable. On the other hand, when x1 and x2 are includedin one constraint, e.g. x1 + x2 < 1, they are connected inStructure B. If the combination of constraints are imposed,e.g. x1 + x2 < 1 and x1 + x3 < 1, two corresponding links(x1 → x2 and x1 → x3) are learned in Structure C. Finally,when all the three variables are involved in one structure, e.g.x1 + x2 + x3 < 1, the three nodes are fully connected inStructure D.

Besides the model structure, model parameters (i.e. theCPDs) are also learned with our learning algorithm. The BN

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4634 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

Fig. 7. Six BN models learned from different constraints: (a) limitconstraints, (b) limit constraints and x1 + x2 < 1, (c) limit constraints andx2

1 + x22 < 1, (d) limit constraints and x2

1 + x22 < 1 and x1/2

1 + x1/22 > 1,

(e) limit constraints and x1 · x2 < 0.1, (f) x1 + x2 + x3 < 1. Row 1 shows thelearned BN structure. Row 2–3 shows the samples from constrains and thosefrom BN, respectively.

structure and its CPDs together define the joint probability andrepresent the constraints. To evaluate the effectiveness of ourBN model, we drew N = 10 000 samples from its joint prob-ability. These samples were compared to the samples directlydrawn from the constraints. The constraints, samples directlyfrom constraints, the learned BN structure, and the samplesfrom drawn the learned BN model are shown in Figure 7 (Forclarity, only the first two variables (x1 and x2) of the samplesare shown). It is clear that the distributions of the samplesdrawn from the BNs match well with the distributions of thesamples from the constraints. Although some BN models sharethe same structure, they can represent different constraintsbecause of different parameters, as shown in Figure 7 (B), (C),(D) and (E). This experiment shows that given a set of con-straints on some random variables and on their relationships,we can learn a BN to accurately, concisely and convenientlycapture the constraints. Moreover, given the BN, we can thenperform any kind of inference on the random variables.

VI. EXPERIMENTS WITH REAL DATA

In this section, we apply the learned body prior modelin body tracking problem using the challenging HumanEva-Idatabase [33]. Please note again that while our method is sofar limited to upperbody pose modeling, because of lack ofbenchmark data for upperbody, we decide to use HumanEvaand the CMU data, both of which are for full body pose. Fora fair comparison, we do not apply the proposed prior modelto update the joint angles for the lower body pose.

A. Body Pose Estimation Through BN/DBN Inference

In the above sections, we have learned the BN/DBN modelto represent the prior probability of the body pose. When weestimate the body pose from the image, this prior model iscombined with the likelihood of the measurement to estimatethe posterior of body pose. Here, the measurements areextracted through particle filtering based on silhouette andedge information from three views. We use 1000 particles forall the testing sequences. To the best of our knowledge, weutilize the same edge and silhouette-based likelihood and thesame test sets as [1].

Then, these measurements are used as the evidence to esti-mate the true state of joint angles through BN/DBN inference.For example, in the BN model (Fig. 5(b)), we first associatedeach joint angle node with a measurement node. The CPD ofthe link from joint angle node si to its measurement node oi

is defined as a Gaussian distribution p(oi |si ) = N (oi |μ =si , σ

2i ), where the variance σ 2

i represents the uncertainty ofthis measurement.

In BN inference, the posterior probability of the bodypose can be estimated by combining the likelihood frommeasurement with the prior probability of the body pose:

p(s1, . . . , sN |o1, . . . , oN )=N∏

i=1

p(oi |si )

N∏

i=1

p(si |Pa(si )) (15)

The second term is the product of the conditional probabilitiesof each joint angle si given its parents Pa(si ), which arequantified in our BN model. In practice, the posterior probabil-ity can be estimated efficiently through the belief propagationalgorithm [41]. To handle the loops in the structure, we applythe junction tree algorithm [42], which can perform exactbelief propagation in a multiply-connected BN, i.e., BN withloops.

The DBN inference is similar to the BN inference exceptfor the dynamic transitions. Let st = st

1 · · · stN represent

the joint angles at time t . Given the evidences until timet : o1:t = (o1:t

1 , . . . , o1:tN ), the posterior probability p(st |o1:t )

can be factorized and estimated iteratively as the ‘filtering’ inDBN [42].

The body tracking algorithm combines measurements fromparticle filter with the BN/DBN prior model. The trackingalgorithm is summarized as follows.

1) Measurement extraction using particle filter.a) Re-sample previous pose samples (particles)

{st−1,(m)}m=1..1000 based on their weights{wt−1,(m)}m=1..1000.

b) Propagate the particles to the next frame based ontransition probability p(st |st−1) which is defined inDBN (For BN, a simple zero-order dynamics, i.e.,st = st−1 up to additive Gaussian noise is used inthis particle filter). st = (st

1 · · · stN ) represents all

joint angles in the current frame.c) Re-weight current samples based on the edge and

silhouette-based image likelihood which is definedin [1].

d) Extract the sample with the highest likelihood asthe measurement ot in the current frame.

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

CHEN et al.: DATA-FREE PRIOR MODEL 4635

2) BN/DBN inference.

a) For BN model, the posterior probability of bodypose is estimated from the current measurement ot

based on Eq. 15.b) For DBN model, the posterior probability of

body pose is factorized as p(st |o1:t ) = p(ot |st )· ∫st−1 p(st |st−1)p(st−1|o1:t−1). Given the currentmeasurement ot , this posterior probability canbe estimated recursively from previous posteriorp(st−1|o1:t−1) [42].

B. Comparison With Models from Training Data

The key point that distinguishes the proposed data-free priormodel from the traditional model is that our model is learnedwithout any training data from the motion capture system. Inthis section we compare our model with the models learnedfrom motion capture data.

1) Training and Testing Data: We evaluate the data-basedmodels learned from different training data sets. First, we trainthe model from HumanEva-I database [33]. This databaseincludes 43,365 training poses from four subjects and fivedifferent activities, i.e., walking, jogging, throw-and-catch,gestures and boxing. We train one generic HumanEva modelwith all five activities and five specific models for fiveactivities respectively. Besides the training data in HumanEvadatabase, we also learn the model from the larger CMUMocap database [43], which included 995,354 body posesfrom 144 subjects and various activities. Please notice thatCMU Mocap database shares some activities, e.g. walking andjogging, with the HumanEva database, but they are collectedfrom different subjects with different body motion styles.The models learned from different data sets are listed inTable III.

Finally, we test the above models on five different activitiesin the HumanEvaI database (first 150 frames of each activityof subject S1). We use the online HumanEva evaluationsystem provided by the author [33] to compute the averagedistance error of the joint positions. The results of differenttest activities and different models are listed in Table IV.We also compare this with a baseline system which directlyuses the result from particle filtering without using a priormodel.

2) Performance of the Model from Training Data: Com-pared with the baseline system, the models learned fromtraining data can improve the body tracking results when thetest activity is present in the training data. For example, theBN model learned from walking data (BW alk) can reduce thetracking error of the walking sequence from 107.97 mm to60.67 mm. Similarly, BJ og can reduce the tracking error ofthe jogging sequence from 115.76 mm to 53.48 mm. However,if the testing and training activities are different, the resultsare very poor. For example, applying the walking modelBW alk to the jogging sequence will increase tracking errorto 168.22 mm, which is even worse than the baseline system(115.76 mm), where no prior model was applied.

The BN learned from all the activities in HumanEva(BHumanEva) can simultaneously improve the tracking results

TABLE III

MODEL LIST IN OUR EXPERIMENT

of the five test sequences. The average error of the fivetesting sequence is decreased from 90.47 mm to 60.80 mm,compared to the baseline. However, this does not mean thatthe data-based BN model can always be improved by addingmore training data of more activities. For example, althoughthe CMU database is very large and already includes somewalking sequences, the BN model learned from CMU database(BC MU ) still cannot track walking sequence in HumanEvavery well (the error is 103.26 mm). It is due to the subjectand movement style differences between CMU and HumanEvadatabase.

3) Performance of Knowledge-Base Model: Compared tothe data-based BN models, the proposed knowledge-based BNmodel can achieve comparable results for specific activities(the data-based model is a still a little better when the trainingand testing activities are the same.). In addition, by usinggeneric constraints, the proposed model is much better ingeneralizing to different activities. The average error of thefive testing activities achieves 61.19 mm when using theknowledge-based BN (BC ). To evaluate our structure learningalgorithm, we also test the BN (B0) with initial structure(see Sec. IV-B3 for more details). The tracking errors offive activities using B0 is summarized in Table IV and theaverage error is 106.27 mm. Please notice that the parametersof both B0 and BC are learned from pseudo poses, and the onlydifference is the model structure. We can see that BC performsmuch better than B0, because some important relations (links)between joint angles are ignored in B0.

We also tested the knowledge-based DBN model (DBNC )learned from pseudo-data pairs. Like the BN model, thisDBN model can be applied to different activities by usingonly generic constraints. The results for all five activities areimproved because the DBN model used dynamic constraints.However, the improvements are different for different activi-ties. For the activity with simple dynamics, the improvementis small. For example, in the ‘gestures’ activity (the subjectonly waves his right hand in a small range) the error isreduced from 48.20 mm to 46.49 mm compared to BN model.The improvement is significant when tracking more complexdynamics, e.g. the error of the ‘boxing’ activity is decreasedfrom 74.09 mm to 61.58 mm (16.9 percent improvement).

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4636 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

TABLE IV

TEST ON DIFFERENT ACTIVITIES WITH DIFFERENT PRIOR MODELS. (TEST ON SUBJECT S1. THE ERROR IS IN mm)

TABLE V

TEST ON SUBJECT S2 AND S3, WHILE TRAINED ON SUBJECT S1. (THE ERROR IS IN mm)

Fig. 8. Structure learning of the BN from training data. (a) BBox . (b) BGestures . (c) BC MU . (d) BBoxWithConstraint . (e) BGesturesWithConstraint .

Finally, we evaluate our knowledge based model (DBNC )on two other subjects (S2 and S3), and we compare it withdata-driven models. Please notice that we compare with action-specific data-driven models. For instance, when we test onthe walking sequence of S2 and S3 we use a model whichis learned also from walking data of subject S1. The resultof these data-driven S1_Models are summarized in Table V.We can see that in most of the time S1_Models can be appliedto S2 and S3 because they perform very similar actions.However, in the ‘Gestures’ action, the body poses of S1 and S2are different (S1 waves right hand but S2 waves left hand.).In this case, the model learned from S1 has trouble to beapplied to S2 and the tracking error (120.39 mm) is muchlarger than DBNC . Because DBNC does not depend on anyperson-specific training data, it can be applied to differentsubjects successfully. The average error of DBNC is smallerthan data-driven models on S2 and S3.

4) Analysis of Data- and Knowledge-Based Model: Data-and knowledge-based model are different regarding theirstructures and their underlying probabilities. First, we com-pare the model structures. For the data-based model, thestructure learning algorithm is similar to the algorithm inSec. IV-B3, but the model is learned from motion capturedata and we do not impose the symmetry and kinesiologyconstraints. Two activity-specific models from HumanEvadata and one model from CMU Mocap data are shown in

Fig. 8(a), 8(b) and 8(c) respectively. Here, ‘Boxing’ is the mostcomplex upper-body activity in HumanEva data and ‘Gestures’is the most simple one which only includes one arm waving.First, we observe that these model structures are different fromthe the knowledge-based model Fig. 5(b). And we can seethat more links are added to the ‘Box’ model to representthe complex relationships among the upper-body joints, whilerelatively simpler model is learned for ‘Gestures’.

If we also impose the symmetry and kinesiology constraintswhen we learn the data-based model, the learned structures ofthe ‘Box’ and the ‘Gesture’ model are shown in Fig. 8(d)and 8(e). Since the CMU model structure in Fig. 8(c) alreadysatisfies these constraints, it does not change. The trackingresult of these models are shown in Table VI. Compared tothe data-based models in Table IV, by using constraints, theaverage tracking error reduces from 125.39 mm to 115.74 mmfor the ‘Gestures’ model and it reduces from 100.01 mm to77.50 mm for the ‘Box’ model. However, these constraints areonly useful when generalizing a data-based model to new testdata, e.g., applying BGesturesW ithConstraint to activities otherthan gestures. The result is worse when testing on the sameactivity, e.g., applying BBox W ithConstraint to boxing activity.

Notice that, some links, such as the links from shoulder toelbow, are shared by the data-based and the knowledge-basedmodels. It only means that shoulder and elbow are related fordifferent activities, but the relationship is different because

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

CHEN et al.: DATA-FREE PRIOR MODEL 4637

TABLE VI

COMPARISON OF DATA-BASED MODEL WITH AND WITHOUT CONSTRAINTS (TEST ON SUBJECT S1. THE ERROR IS IN mm)

TABLE VII

COMPARISON BETWEEN CRBM AND OUR MODEL BC ON DIFFERENT ACTIVITIES. (TEST ON SUBJECT S1. THE ERROR IS IN mm)

-150 -100 -50 0 50-40

-20

0

20

40

60

80

100

31

31

-150 -100 -50 0 50-40

-20

0

20

40

60

80

100

31

31

-150 -100 -50 0 50-40

-20

0

20

40

60

80

100

31

31

-150 -100 -50 0 50-40

-20

0

20

40

60

80

100

31

31

-150 -100 -50 0 50-40

-20

0

20

40

60

80

100

31

31

Fig. 9. A Visualization of the body pose samples from five different BNmodels: (a) BBoxing model is learned from ‘boxing’ activity, (b) BGesturesmodel is learned from ‘gesture’ activity, (c) BC MU model is learned fromCMU data, (d) BC model is learned from constraints, (e) BResampleC MUmodel is learned from re-sampled CMU data. Only two joint angles of thepose are visualized, i.e., x- and y-axis represent α31 and β31 respectively.

the learned CPDs of these links are different for differentmodels.

Given different structures and CPDs, data-based andknowledge-based model basically capture different prior prob-abilities of the body pose. In order to compare the proba-bilities, 2,000 body pose samples are randomly drawn fromdifferent models. The distributions of two joint angles α31and β31 as shown in Fig. 9. We can see that the distributionsof two activity-specific models concentrate in small areas inthe joint angle space. The distribution of BC MU covers alarger area because of the larger training data size and moretraining activities. However, the CMU data is biased towardsthe ‘resting’ pose, i.e., α31 = −90 and β31 = 0 degrees(Fig. 3). For the knowledge-based model BC , the distributioncovers the whole space of feasible joint angles. This explainsits good generalization capability.

Compared to the knowledge-based model, BC MU is biasedto the most frequent activities in CMU database. This maybe the reason why it cannot generalize well to the test datain HumanEva database. In order to reduce the bias, we re-sample the CMU data to make it uniformly distributed. There-sampling procedure is similar to the sampling of pseudodata in Section IV-B2, but instead of generating samples inthe feasible pose range, we randomly select samples fromthe CMU data. The sampling probability of a new sampleis given by Eq. 10. We learn a new model BResampleC MU

from the re-sampled CMU data. As shown in Figure 9(e), thedistribution of this model is more scattered and less “biased”.Compared to BC MU , by using BResampleC MU the averagetracking error of the five activities reduced from 85.00 mm

to 67.62 mm. It shows that this uniform re-sampling doeshelp when generalizing the data-based model to new test data.However, it only works when there is enough training data tocover a large enough body pose space. Finally, BResampleC MU

is still worse than the knowledge-based model.

C. Comparison With the State-of-the-Art

Most recently, Taylor et al. [1], [44] employed the con-ditional restricted Bolzmann machine (CRBM) to model thebody pose prior probability with binary latent variables.They reported state-of-the-art results on the HumanEva datasetcompared to other prior models, such as the motion correlationmodel [45] and coordinated mixture of factor analysis (CMFA)[2], [46].

Here, we compare our body pose prior model with theCRBMs from different training data sets. Since their publishedresult only includes the testing on the same activities asthe training data, we trained and tested different CRBMsusing the source code released by the author. To the bestof our knowledge, we utilized the same particle filter with1000 particles, the same edge and silhouette-based likelihoodand the same configurations of CRBM models. The testingresults of different CRBMs and different testing sequencesare shown in Table VII. The CRBM is similar to the BNmodel in that it can achieve good results when the train-ing and testing sequences include the same activities, but ithas problems generalizing to new activities. In contrast, ourmodel achieves comparable performance for specific activities,but significantly outperforms CRBM in generalizing to newactivities.

In this experiment, we process three camera views in a2.67 GHz PC with 4 GB RAM. The body pose estimationusing BN/DBN takes about 18 seconds per frame, whichis similar to the processing time of CRBM, but faster thananother state of art method [2] that takes 0.6 minute per frame.In our method, the most time-consuming step is the pseudo-data generation (about 1 hour), but it is done offline. It doesnot affect online tracking speed.

Page 12: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

4638 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

VII. CONCLUSION

In this paper, we proposed a knowledge-based frameworkto effectively learn the prior body pose model from multipleanatomic, biomechanics, and physical constraints. We firstsystematically identified and represented the human bodyconstraints from different sources. Then, we proposed differentmethods to effectively incorporate these constraints into thebody prior model. Some of the constraints can be directlyimposed on the model structure. For the constraints that cannotbe directly applied, we proposed an efficient sampling methodto first transfer the constraints into pseudo-data, and then learnthe model from these pseudo-data. Unlike the traditional data-based model, where body constraints are used as a supplementto the data, our model does not need any motion capture dataor manually labeled data for training. By only using genericconstraints, our model can be applied to different activitiesand subjects.

Compared with state of the art methods using benchmarkdatasets, the proposed prior model achieves comparable perfor-mance for specific body motions that are present in the trainingdata. The proposed model, however, significantly outperformsthe data-based prior models when generalizing to new bodymotions and new subjects. Since our approach is generaland can easily incorporate different constraints and domainknowledge, we believe it can also be applied to other computervision problems, where training data is difficult to collect butdomain knowledge is available.

The contribution of this research is not the additionalconstraints we introduced. Rather, it is the proposed genericframework that can effectively capture different types ofbody constraints and systematically encode them into theprior model. Also, through this paper, we want to emphasizethat we are not advocating abandoning data-driven approach.Rather, we want to advocate a hybrid approach based oncombining data with domain knowledge to achieve a robustand generalizable visual understanding. We plan to apply thisphilosophy to other vision problems in the future.

REFERENCES

[1] G. Taylor, L. Sigal, D. Fleet, and G. Hinton, “Dynamical binary latentvariable models for 3D human pose tracking,” in Proc. IEEE CVPR,Jun. 2010, pp. 631–638.

[2] R. Li, T.-P. Tian, S. Sclaroff, and M.-H. Yang, “3D human motiontracking with a coordinated mixture of factor analyzers,” Int. J. Comput.Vis., vol. 87, nos. 1–2, pp. 170–190, Mar. 2010.

[3] R. Urtasun and T. Darrell, “Sparse probabilistic regression from activity-independent human pose inference,” in Proc. IEEE CVPR, Jun. 2008,pp. 1–8.

[4] C. Ionescu, F. Li, and C. Sminchisescu, “Latent structured mod-els for human pose estimation,” in Proc. IEEE ICCV, Nov. 2011,pp. 2220–2227.

[5] A. Agarwal and B. Triggs, “Recovering 3D human pose from monocularimages,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 1,pp. 44–58, Jan. 2006.

[6] T. Brox, B. Rosenhahn, D. Cremers, and H.-P. Seidel, “Nonparametricdensity estimation with adaptive, anisotropic kernels for human motiontracking,” in Proc. 2nd Conf. Human Motion, Understand., Model.,Capture Anim., Oct. 2007, pp. 152–165.

[7] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estimation withparameter-sensitive hashing,” in Proc. 9th IEEE Int. CVPR, vol. 2.Oct. 2003, pp. 750–757.

[8] R. Urtasun, D. J. Fleet, and P. Fua, “Monocular 3D tracking of the golfswing,” in Proc. IEEE Comput. Soc. Conf. CVPR, vol. 2. Jun. 2005,pp. 932–938.

[9] H. Sidenbladh, M. J. Black, and D. J. Fleet, “Stochastic tracking of3D human figures using 2D image motion,” in Proc. ECCV, Jul. 2000,pp. 702–718.

[10] A. Geiger, R. Urtasun, and T. Darrell, “Rank priors for continuousnon-linear dimensionality reduction,” in Proc. IEEE Conf. CVPR, 2009,pp. 1–2.

[11] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua, “Priors for peopletracking from small training sets,” in Proc. 10th IEEE ICCV, vol. 1.Oct. 2005, pp. 403–410.

[12] T.-P. Tian, R. Li, and S. Sclaroff, “Articulated pose estimation in alearned smooth space of feasible solutions,” in Proc. IEEE Comput.Soc. Conf. CVPR Learn. Workshop, Jun. 2005, p. 50.

[13] R. Urtasun, D. J. Fleet, and P. Fua, “3D people tracking with Gaussianprocess dynamical models,” in Proc. IEEE Comput. Soc. Conf. CVPR,Jun. 2006, pp. 238–245.

[14] R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. J. Darrell, andN. D. Lawrence, “Topologically-constrained latent variable models,”in Proc. 25th ICML, 2008, pp. 1080–1087.

[15] J. Chen, M. Kim, Y. Wang, and Q. Ji, “Switching Gaussian processdynamic models for simultaneous composite motion tracking and recog-nition,” in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 2655–2662.

[16] D. Ramanan, D. A. Forsyth, and A. Zisserman, “Strike a pose: Trackingpeople by finding stylized poses,” in Proc. IEEE Comput. Soc. Conf.CVPR, Jun. 2005, pp. 271–278.

[17] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures forobject recognition,” Int. J. Comput. Vis., vol. 61, no. 1, pp. 55–79,Jan. 2005.

[18] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard, “Trackingloose-limbed people,” in Proc. IEEE Comput. Soc. Conf. CVPR, vol.1. Jul. 2004, pp. I-421–I-428.

[19] X. Lan and D. P. Huttenlocher, “Beyond trees: Common factor modelsfor 2D human pose recovery,” in Proc. 10th IEEE ICCV, vol. 1.Oct. 2005, pp. 470–477.

[20] Y. Wang and G. Mori, “Multiple tree models for occlusion and spatialconstraints in human pose estimation,” in Proc. 10th ECCV, 2008,pp. 710–724.

[21] C. Sminchisescu and B. Triggs, “Covariance scaled sampling for monoc-ular 3D body tracking,” in Proc. CVPR, vol. 1. 2001, pp. I-447–I-454.

[22] M. Vondrak, L. Sigal, and O. C. Jenkins, “Physical simulation forprobabilistic motion tracking,” in Proc. CVPR, 2008, pp. 1–8.

[23] M. A. Brubaker, L. Sigal, and D. J. Fleet, “Physics-based human motionmodelling for people tracking,” in Proc. ICCV, Sep. 2009, pp. 1–48.

[24] H. Kjellstrom, D. Kragic, and M. J. Black, “Tracking people interactingwith objects,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 747–754.

[25] D. Ramanan and D. A. Forsyth, “Finding and tracking people fromthe bottom up,” in Proc. IEEE Comput. Soc. Conf. CVPR, Jun. 2003,pp. II-467–II-474.

[26] Y. Wu, G. Hua, and T. Yu, “Tracking articulated body by dynamicMarkov network,” in Proc. 9th IEEE CVPR, vol. 2. Oct. 2003,pp. 1094–1101.

[27] P. Noriega and O. Bernier, “Multicues 3D monocular upper bodytracking using constrained belief propagation,” in Proc. BMVC, 2007,pp. 1–10.

[28] D. Demirdjian, “Enforcing constraints for human body tracking,”in Proc. CVPRW, Jun. 2003, p. 102.

[29] M. A. Brubaker and D. J. Fleet, “The kneed walker for human posetracking,” in Proc. IEEE Conf. CVPR, Jun. 2008, pp. 1–8.

[30] C. R. Wren and A. P. Pentland, “Dynamic modeling of human motion,”in Proc. 3rd IEEE Int. Conf. Autom. Face Gesture Recognit., Apr. 1998,pp. 22–27.

[31] J. Bandouch and M. Beetz, “Tracking humans interacting with theenvironment using efficient hierarchical sampling and layered obser-vation models,” in Proc. 12th IEEE ICCV Workshop, Oct. 2009,pp. 2040–2047.

[32] J. Bandouch, F. Engstler, and M. Beetz, “Accurate human motion captureusing an ergonomics-based anthropometric human model,” in Proc. 5thInt. Conf. AMDO, Jul. 2008, pp. 248–258.

[33] L. Sigal, A. Balan, and M. Black, “HumanEva: Synchronized videoand motion capture dataset and baseline algorithm for evaluation ofarticulated human motion,” Int. J. Comput. Vis., vol. 87, nos. 1–2,pp. 4–27, Mar. 2010.

[34] NASA. (2010, Dec.). Nasa-Std-3000: Man-Systems Integra-tion Standards, Washington, DC, USA [Online]. Available:http://msis.jsc.nasa.gov/sections/section03.htm

[35] A. Steindler, Kinesiology of the Human Body under Normal andPathological Conditions. Springfield, IL, USA: Charles C Thomas, 1955.

Page 13: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, …cvrl/Publication/pdf/Chen2013.pdf · 2015-12-03 · IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2013

CHEN et al.: DATA-FREE PRIOR MODEL 4639

[36] W. Gao, S. Shan, X. Chai, and X. Fu, “Virtual face image generationfor illumination and pose insensitive face recognition,” in Proc. ICME,Jul. 2003, pp. 149–152.

[37] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novelobjects using vision,” Int. J. Robot. Res., vol. 27, no. 2, pp. 157–173,Feb. 2008.

[38] A. Witkin and M. Kass, “Spacetime constraints,” Comput. Graph.,vol. 22, no. 4, pp. 159–168, Aug. 1988.

[39] D. Koller and N. Friedman, Probabilistic Graphical Models: Principlesand Techniques, T. Dietterich, Ed. Cambridge, MA, USA: MIT Press,2009.

[40] G. Schwarz, “Estimating the dimension of a model,” Ann. Stat., vol. 6,no. 2, pp. 461–464, 1978.

[41] J. Pearl, Probabilistic Reasoning in Intelligent Systems. San Mateo, CA,USA: Morgan Kaufmann, 1988.

[42] K. Murphy, “Dynamic bayesian networks: Representation, inferenceand learning,” Ph.D. dissertation, Dept. Comput. Sci. Univ. CaliforniaBerkeley, Berkeley, CA, USA, 2002.

[43] (2010, Dec.). Carnegie Mellon University Motion Capture Database[Online]. Available: http://mocap.cs.cmu.edu/

[44] G. Taylor, G. Hinton, and S. Roweis, “Modeling human motion usingbinary latent variables,” in Proc. NIPS, 2007, pp. 1345–1352.

[45] X. Xu and B. Li, “Learning motion correlation for tracking articulatedhuman body with a rao-blackwellised particle filter,” in Proc. ICCV,2007.

[46] R. Li, T.-P. Tian, and S. Sclaroff, “Simultaneous learning of nonlinearmanifold and dynamical models for high-dimensional time series,”in Proc. ICCV, 2007, pp. 1–8.

Jixu Chen received the Ph.D. degree in electricalengineering from the Rensselaer Polytechnic Insti-tute, Troy, NY, USA, in 2011. He is currently aResearcher with the Computer Vision Laboratory,GE Global Research, Niskayuna, NY, USA. Hiscurrent research interests include computer vision,machine learning, and human-computer interaction.He is a member of the IEEE Computer Society.

Siqi Nie received the B.S. degree in electrical engi-neering from Tsinghua University, Beijing, China,in 2011. He is currently pursing the Ph.D. degreewith the Rensselaer Polytechnic Institute, Troy, NY,USA. His current research interests include learningand inference in probabilistic graphical models andits applications in computer vision.

Qiang Ji received the Ph.D. degree in electricalengineering from the University of Washington,Seattle, WA, USA. He is currently a Professor withthe Department of Electrical, Computer, and Sys-tems Engineering, Rensselaer Polytechnic Institute(RPI), Troy, NY, USA. He recently served as a Pro-gram Director with the National Science Foundation(NSF), where he managed NSF’s computer visionand machine learning programs. He held teachingand research positions with the Beckman Institute,University of Illinois at Urbana-Champaign, Urbana,

IL, USA, the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA,USA, the Department of Computer Science, University of Nevada at Reno,Reno, NV, USA, and the U.S. Air Force Research Laboratory. He currentlyserves as the Director of the Intelligent Systems Laboratory, RPI. His currentresearch interests include computer vision, probabilistic graphical models,information fusion, and their applications in various fields. He has publishedover 160 papers in peer-reviewed journals and conferences. His researchhas been supported by major governmental agencies, including NSF, NIH,DARPA, ONR, ARO, and AFOSR as well as by major companies, includingHonda and Boeing. He is an Editor on several related IEEE and internationaljournals and he has served as a General Chair, Program Chair, TechnicalArea Chair, and Program Committee Member in numerous internationalconferences and workshops. He is a fellow of IAPR.