learning articulated structure and motiondross/articulated/rosstarlowzemel...motion, the second as...

International Journal of Computer VisionThe original publication is available at www.springerlink.com.

Learning Articulated Structure and Motion

David A. Ross · Daniel Tarlow · Richard S. Zemel

Received: 21 July 2008 / Accepted: 9 Feb 2010 / Published 2 March 2010.

DOI 10.1007/s11263-010-0325-y

Abstract Humans demonstrate a remarkable abilityto parse complicated motion sequences into their con-stituent structures and motions. We investigate thisproblem, attempting to learn the structure of one ormore articulated objects, given a time series of two-dimensional feature positions. We model the observedsequence in terms of “stick figure” objects, under theassumption that the relative joint angles between stickscan change over time, but their lengths and connectiv-ities are fixed. The problem is formulated as a singleprobabilistic model that includes multiple sub-compo-nents: associating the features with particular sticks,determining the proper number of sticks, and findingwhich sticks are physically joined. We test the algorithmon challenging datasets of 2D projections of optical hu-man motion capture and feature trajectories from realvideos.

Keywords structure from motion · graphical models ·non-rigid motion

1 Introduction

An important aspect of analyzing dynamic scenes in-volves segmenting the scene into separate moving ob-jects and constructing detailed models of each object’smotion. For scenes represented by trajectories of fea-tures on the objects, structure-from-motion methodsare capable of grouping the features and inferring the

D.A. Ross · D. Tarlow · R.S. Zemel

University of Toronto, 10 King’s College Road, Toronto, ON, M5S

3G4, CanadaE-mail: {dross, dtarlow, zemel}@cs.toronto.edu

object poses when the features belong to multiple in-dependently moving rigid objects. Recently, however,research has been increasingly devoted to more compli-cated versions of this problem, when the moving objectsare articulated and non-rigid.

In this article we investigate the problem, attempt-ing to learn the structure of an articulated object whilesimultaneously inferring its pose at each frame of thesequence, given a time series of feature positions. Wepropose a single probabilistic model for describing theobserved sequence in terms of one or more “stick fig-ure” objects. We define a “stick figure” as a collectionof line segments (bones or sticks) joined at their end-points. The structure of a stick figure—the number andlengths of the component sticks, the association of eachfeature point with exactly one stick, and the connectiv-ity of the sticks—is assumed to be temporally invari-ant, while the angles (at joints) between the sticks areallowed to change over time. We begin with no infor-mation about the figures in a sequence, as the modelparameters and structure are all learned. An exampleof a stick figure learned by applying our model to 2Dfeature observations from a video of a walking giraffe isshown in Figure 1.

Learned models of skeletal structure have many pos-sible uses. For example, detailed, manually constructedskeletal models are often a key component in full-bodytracking algorithms. The ability to learn skeletal struc-ture could help to automate the process, potentiallyproducing models more flexible and accurate than thoseconstructed manually. Additionally, skeletons are nec-essary for converting feature point positions into jointangles, a standard way to encode motion for animation.Furthermore, knowledge of the skeleton can be used toimprove the reliability of optical motion capture, per-

2

Fig. 1 Four frames from a video of a walking giraffe, with the articulated skeleton learned by our model superimposed. Each black

line represents a stick, and each white circle a joint between sticks. The tracked features, which serve as the only input, are shown as

coloured markers. Features associated with the same stick are assigned markers of the same colour and shape.

mitting disambiguation of marker correspondence andocclusion (Herda et al., 2001). Finally, a learned skele-ton might be used as a rough prior on shape to helpguide image segmentation (Bray et al., 2006).

In the following section we discuss other recent ap-proaches to modelling articulated figures from trackedfeature points. In Section 3 we formulate the problemas a probabilistic model, and in Section 4 we propose analgorithm for learning the model from data. Learningproceeds in a stage-wise fashion, building up the struc-ture incrementally to maximize the joint probability ofthe model variables.

In Section 5 we test the algorithm on a range ofdatasets. In the final section we describe assumptionsand limitations of the approach, and discuss future work.

Research presented in this paper is a continuationof Ross et al. (2008), and includes results from Ross(2008a).

2 Related Work

Humans demonstrate a remarkable ability to parse com-plicated motion sequences, even from apparently sparsestreams of information. One field where this is readilyapparent is in the study of human response to pointlight displays. A point light display (PLD), as depictedin Figure 2, is constructed by attaching a number ofpoint light sources to an object, then recording (only)the positions of these lights as the object moves. Thecanonical example is to instrument a human’s limbs andbody with lights, then to record their positions as heor she performs motions such as walking, running, orswinging a golf club. PLDs have received considerableattention in psychology research (e.g. Johansson, 1973)due to one remarkable property. Despite the apparentlylimited information they contain, biological motion de-picted in PLDs is almost instantly recognizable by hu-mans. From a PLD of a person or animal, humans are

3

Fig. 2 A point light display of a human, in four different poses.

able to understand the structure of the display (howthe lights are connected via the performer’s underlyingskeleton), and the motions that are performed.

Point light displays are also common in several do-mains of computer science research. The field of motioncapture, in essence, is the study of recording and ana-lyzing PLDs. In computer animation, PLDs obtainedvia motion capture are used to animate synthetic char-acter models. Finally, in computer vision many appli-cations choose to represent digital images sequences interms of feature point trajectories. When the originalimage data is discarded, the feature points locations areequivalent to a PLD.

What follows is a discussion of three recent approachesto modelling articulated figures from tracked featurepoints. Each of these approaches addresses the problemfrom a different viewpoint: the first as structure frommotion, the second as geometrical constraints in motioncapture data, and the third as learning the structure ofa probabilistic graphical model.

2.1 Articulated Structure From Motion

The first work we will consider is “Automatic Kine-matic Chain Building from Feature Trajectories of Ar-ticulated Objects” by Yan and Pollefeys (2006b, 2008).This work builds on a history of solutions for the struc-ture from motion (SFM) problem, extending them tohandle articulated objects. We begin with a brief overviewof this evolution, before describing the Yan and Polle-feys approach.

2.1.1 Standard Structure From Motion

Given a set of feature points observed at a number offrames, the goal of SFM is to recover the structure—the time-invariant relative 3D positions of the points—

while simultaneously solving for the motion—the per-frame pose of the object(s) relative to the camera—that produced the observations. Generally, the inputfor SFM is assumed to be two-dimensional observations(image coordinates) of points on an inherently three-dimensional object. However most algorithms, includ-ing the ones presented here, work equally well given 3Dinputs.

When the trajectories come from one rigid object(or equivalently, the scene is static and only the cameramoves), and the camera is assumed to be orthographic,Tomasi and Kanade (1992) have shown that structureand motion can recovered by using the singular valuedecomposition (SVD) to obtain a low-rank factorizationof the matrix of feature point trajectories.

Suppose we are given a matrix W where each col-umn contains the x and y image coordinates of oneof the observed points, at all time frames. Thus, givenP points and F frames, the size of W is 2F × P (or3F × P for three-dimensional observations). Consider-ing the generative process that produced the observa-tions (and disregarding noise), W is the product of amotion matrix and a structure matrix,

W = MS,

both of which are rank 4. The structure S is a 4×P ma-trix containing the time-invariant (homogeneous) 3Dcoordinates of the points. At each frame f , the observa-tions are produced by applying a rigid-body motion—arotation Rf and a translation tf —to S, and projectingthe points onto the image plane:[xf,1 . . . xf,P

yf,1 . . . yf,P

]=[1 0 00 1 0

] [Rf tf

]S.

Hence, M is formed by stacking the first two rows ofeach of these F motion matrices. From W, M and S

4

can be recovered by taking the singular value decom-position1:

W = UΣV> ⇒ M = UΣ12 S = Σ

12 V>.

In practice, feature trajectories will be contaminatedby noise, giving W a rank larger than 4. In this caseTomasi and Kanade suggest retaining only the columnsof U,Σ and V corresponding to the four largest singu-lar values, which is the optimal rank-4 approximationto W (under squared error).

Despite the elegance and popularity of this solution,Tomasi and Kanade (1992) assume a rather unrealisticcamera model—scaled orthography—for the projectionof three-dimensional points down to two dimensions.As such, this does not represent a complete solutionto rigid-body SFM.2 However, when the input consistsof three-dimensional points (e.g. obtained from a mo-tion capture system), scaled orthography is perfectlyreasonable assumption.

2.1.2 Multibody SFM

Recovering structure and motion when the scene con-tains multiple objects moving independently is morechallenging. Consider the case in which the point tra-jectories arise from two independent rigid objects. If thecolumns of W are sorted so that all points from object1 come first, and the points from object 2 come second,the low-rank factorization can be written as follows:

W = MS =[M1 M2

] [S1 00 S2

]. (1)

In this case the ranks of the motion and structure ma-trices (and hence, of W) have increased to 8, or 4× thenumber of objects. If the grouping of point trajecto-ries into objects was known, the structure and motionof each object, Mi and Si, could be recovered indepen-dently, using the method described earlier. The problemnow becomes, how to group the points?

The solution proposed by Costeira and Kanade (1996,1998) involves considering what they term the shape-interaction matrix, Q ≡ VV>. When the columns of

1 In most cases, although the columns of U and V span thecorrect subspaces, they are actually linear transformations of thecolumns of M and S respectively. This can be corrected by solv-ing, via nonlinear optimization, for a transformation that satisfies

the constraints on the rotational components of M Tomasi andKanade (1992).

2 Solutions based on the more-realistic projective camera, per-

haps using the above method as an initialization, can be obtainedvia an algorithm for bundle adjustment (Hartley and Zisserman,

2003).

W are correctly sorted, as in (1), Q assumes a distinc-tive block-diagonal structure3

Q ≡ VV>= S>Σ−1S =[S1>Σ−1

1 S1 00 S2

>Σ−12 S2

],

where V and Σ again arise from the SVD of W. Re-gardless of the sorting of the points, Qi,j is nonzero ifpoints i and j are part of the same rigid object, and 0otherwise. The shape-interaction matrix has the advan-tage of being invariant to object motion, image scale,and choice of coordinate system.

Costeira and Kanade suggest that grouping pointtrajectories can now be accomplished by reordering thepoints to make Q block-diagonal. This problem, how-ever, is NP-complete, thus the greedy algorithm theypropose obtains only sub-optimal solutions. Interest-ingly, Q can be interpreted as a pairwise affinity matrix.In fact, VV> is simply a weighted version of the in-ner product matrix W>W. This interpretation suggeststhat other ways of normalizing the shape-interactionmatrix are possible, and that points could be groupedby any clustering algorithm which takes as input anaffinity matrix, such as spectral clustering (Shi and Ma-lik, 2000; Culverhouse and Wang, 2003; Weiss, 1999) orAffinity Propagation (Frey and Dueck, 2007).

The primary disadvantage to this approach is thatthe shape-interaction matrix is highly sensitive to noisein the observations (Gruber and Weiss, 2004). First ofall, in the presence of noise Qi,j is no longer zero wheni and j come from different objects. Furthermore, com-puting Q requires knowing the rank of W, which is thenumber of columns of V retained after the SVD. (Notethat if we retain all columns of V, then Q = VV>= I.)In the simplest case, this rank is 4× the number ofobjects, but it can be less when an object does not ex-press all its degrees of mobility. Noise makes the rank ofW difficult to determine, requiring an often-unreliableanalysis of the eigenspectrum. One approach for deal-ing with this, in the presence of noise, is described byGear (1998).

2.1.3 Probabilistic SFM

Gruber and Weiss (2003) have noted that the approachof Tomasi and Kanade can be reinterpreted as a proba-bilistic graphical model, specifically factor analysis. Infactor analysis, each observed data vector is generatedby taking a linear combination of a set of basis vectors,and adding diagonal-covariance Gaussian noise. In thecontext of single-body SFM each row wi of W, the x or

3 VV> is also block-diagonal if we allow V> to more generallybe an invertible linear transformation of the true structure: S =

A−1Σ12 V> (Costeira and Kanade, 1998).

5

y coordinates of all feature points in one frame, is gen-erated by taking a linear combination mi of the rowsof S. Including a standard Gaussian prior on the rowsof the motion matrix produced the following model:

wi = miS + ni, where

ni ∼ N (0,diag(ψi))

mi ∼ N (0, I).

Structure and motion can be recovered by fitting themodel using the standard Expectation Maximization(EM) algorithm for factor analysis (Ghahramani andHinton, 1996a). An advantage of this formulation isthat missing observations can be dealt with easily; set-ting the corresponding variances to ∞ has the effectof eliminating them from the calculations (Gruber andWeiss, 2003).

Another key innovation of the Gruber and Weissapproach is to assume temporal coherence of motions.This allows them to take advantage of the fact, whenestimating motions, that motions for adjacent framesshould be similar. In the graphical model, temporalcoherence is incorporated easily through the use of aMarkov chain prior (a Kalman filter) over the latentmotion variables. The result is closely related to theEM algorithm for learning linear dynamical systems(Ghahramani and Hinton, 1996b).

Multibody factorization

The probabilistic approach has also been extended tohandle multiple independent rigid objects (Gruber andWeiss, 2004). Structure and motion are modeled in muchthe same way as (Costeira and Kanade, 1996): one inde-pendent factor analyzer of dimension 4 for each object.However, the approach of Gruber and Weiss to group-ing point trajectories is quite different.

Instead of grouping points by clustering a pairwiseaffinity matrix, Gruber and Weiss incorporate addi-tional discrete latent variables that assign each of thepoints to one of the motions. With this addition, thegrouping, together with the structures and motions, canbe estimated jointly using EM. This provides a distinctadvantage over the method of Costeira and Kanadewhich, once it has grouped the points, is unable to rees-timate the grouping based on subsequent information.Although fitting with EM often leads to local minima,in the presence of noise it outperforms Costeira andKanade.

The core of this model is the same as Multiple CauseFactor Analysis (Ross and Zemel, 2006), independentlyproposed for simultaneous segmentation and appear-ance modelling of images.

2.1.4 Articulated Structures

The motion of an articulated object can be describedas a collection of rigid motions, one per part, with theadded constraint that the motions of connected partsmust be spatially coherent. Yan and Pollefeys (2005a)have shown that this constraint causes the motion sub-spaces of two connected objects to intersect, makingthem linearly dependent. In particular, for each pairof connected parts, the motion subspaces share one di-mension (translation) if they are joined at a point andtwo dimensions (translation and one angle of rotation)if they are joined at an axis of rotation. As a result ofthis dependence, the method of Costeira and Kanade(1996) for grouping points is no longer applicable.

To illustrate this, consider two parts that are con-nected by a rotational joint. Without loss of generalitythe shape matrices of the objects, S1 and S2 (droppingthe homogeneous coordinate) can be adjusted to placethis joint at the origin. Now, because the objects areconnected at the joint, at each frame the translationcomponents of their motions must be identical. Thusthe ranks of W, M, and S have been reduced to atmost 7 (Yan and Pollefeys, 2005a,b).

W = MS =[R1 R2 t

] S1 00 S2

1 1

From this equation, we can see that the off-diagonalblocks of the shape interaction matrix, VV>= S>Σ−1S,are no longer zero, so clustering it will not effect thegrouping of point trajectories.

Recognizing this, Yan and Pollefeys (2006a,b, 2008)propose an alternative affinity matrix to use for group-ing points, and an approach for recovering the full ar-ticulated structure and motion of the sequence. Theirmethod consists of four key steps: (1) segmenting thefeature point trajectories into a number of rigid parts,(2) computing an affinity measure indicating the like-lihood that each pair of parts is connected by a joint,(3) obtaining a spanning tree that connects parts whilemaximizing affinity, and finally (4) solving for the loca-tions of joints.

When specifying the affinity between a pair of fea-tures, instead of relying on the dot product (angle) be-tween rows vi and vj of V, they suggest that a morerobust measure could be obtained by comparing thesubspace spanned by vi and its nearest neighbors withthat of vj and its neighbors. Given these two subspaces,they compute the principal angles θ1, . . . , θm betweenthem, and define the affinity between i and j to be

exp

(−∑

n

sin2(θn)

).

6

The affinity is used as input for spectral clustering (Shiand Malik, 2000), thereby producing a grouping of fea-ture point trajectories.

Principal angles are also used as a basis for learningthe articulated structure. Noting that the four-dimen-sional motions (and hence shape subspaces) of partsconnected by an articulated joint will have at least onedimension in common, at least one of the principal an-gles between the parts should be zero. Using minimumprincipal angle as an edge weight, Yan and Pollefeysset up a fully connected graph and solve for the articu-lated structure by finding the minimum spanning tree.The method can be extended to finding multiple artic-ulated objects in a scene simply by disallowing edgeswith weight exceeding a manually specified threshold.

Finally, the locations of the joints can be obtainedfrom the intersections of the motion subspaces of con-nected parts, as described in (Yan and Pollefeys, 2005a)

Due to the reliance on estimating subspaces, thismethod requires each body part to have at least asmany feature points as the dimensionality of its motionsubspace. (In practice, segmenting two independent ob-jects requires at least five points per object, using atleast three neighbors to estimate the local subspace,in the noise-free case.) However, relying on subspacesprovides an additional advantage: the approach is ableto deal with non-rigid body parts–single subpaces withrank higher than four.

Alternative approaches to articulated structure frommotion are presented by Tresadern and Reid (2005) andSminchisescu and Triggs (2003).

2.2 Geometric Analysis of Motion Capture Data

When observations are the 3D world locations of fea-ture points, rather than 2D projections, the geometry ofrecovering 3D skeletal structure becomes easier. Basedon a simple analysis of the distance between featurepoints, and following roughly the same four steps asYan and Pollefeys (2006b), Kirk et al. (2005) are ableto automatically recover skeletal structure from motioncapture data. This is an improvement upon existingmethods of fitting a skeleton to motion capture data(e.g. Silaghi et al., 1998; Abdel-Malek et al., 2004),which often require a user to manually associate mark-ers with positions on a generic human skeleton.

The key property motivating the approach of Kirket al. (2005) is, if two feature points are attached to thesame rigid body part, then the distance between thesepoints is constant. Furthermore, if two body parts areconnected by a rotational joint, then the distances be-tween the joint and the points belonging to both parts

should also be constant. Feature points are grouped, toobtain body parts, by computing the standard devia-tion of the distance between each pair of points andusing that as the (negative or inverse) affinity matrixfor spectral clustering (Ng et al., 2002). The number ofbody parts is chosen manually, or again by analysis ofthe eigenspectrum.

When determining the skeletal connectivity of thebody parts, Kirk et al. define a joint cost, which is theaverage variance in the distance from a putative jointto each of the points in the two parts it connects. Jointcosts are computed for each pair of body parts. Eval-uating the joint cost requires non-linear conjugate gra-dient minimization, but also returns the optimal jointlocation at each frame. Note that joint locations can beestimated as long as one stick has at least two observedmarkers and the other stick has at least one. Finally,the skeletal structure is obtained by running a mini-mum spanning tree algorithm, using the joint costs asedge weights.

This method has a few drawbacks. First, it is onlyable to work on 3D observations–none of the distanceconstraints it relies upon apply when points are pro-jected into 2D. Second, like (Yan and Pollefeys, 2006b),it consists of a sequence of steps without feedback orreestimation. Finally, beyond computing the positionsof joints in each frame, the method does not producea time-invariant model of structure or a set of motionparameters. As such, filling in missing observations orcomputing joint angles would require further process-ing.

One further caveat regarding this method is that,contrary to the images included in the paper (Kirket al., 2005), its output is not actually a “stick figure”—a collection of line segments (bones or sticks) joined attheir endpoints. Instead, in the learned graph, parts ofthe body are nodes and joints are edges, which is amore-difficult structure to visualize.

2.3 Learning a Graphical Model Structure

Another approach to the analysis of PLDs is to modelthe relationships between feature point locations witha probabilistic graphical model. In this setting, recover-ing the skeleton is a matter of learning the graph struc-ture and parameters of the model. This is the approachtaken by Song et al. (2001, 2003), with a goal of auto-matically detecting human motion in cluttered scenes.

Treating each frame as an independent, identicallydistributed sample, Song et al. construct a model inwhich each variable node represents the position andvelocity of one of the observed points. No latent vari-ables are included, instead each feature point is treated

7

as a unique part of the body. This presumes a muchsparser set of features than (Yan and Pollefeys, 2006b)and (Kirk et al., 2005), which require each part to giverise to multiple feature point trajectories. The set ofgraphs considered is restricted to a particular class, de-composable triangulated graphs, in which all cliques areof size three. The limitation placed on the structure en-sures that, although these graphs are more complicatedthan trees, efficient exact inference is still possible. Theclique potentials, over triplets of nodes, are multivariateGaussian distributions over the velocities and relativepositions of the parts.

The maximum likelihood (ML) graph is the one thatminimizes the empirical entropy of each feature pointgiven its parents. Unfortunately no tractable algorithmexists for computing the ML graph, so Song et al. pro-pose the following approximate greedy algorithm. As-suming all nodes are initially disconnected, choose thefirst edge in the graph by connecting the nodes B andC that minimize the joint entropy h(B,C). Then, forall possible ways of choosing an pair of connected par-ents (B,C) already in the graph, find the child A thatminimizes the conditional entropy h(A|B,C) and con-nect it to the graph. Continue connecting child nodesto the graph until it has reached the desired size, orthe entropy of the best putative child exceeds a thresh-old. The cost of this algorithm is O(n4), where n is thenumber of feature points.

Note that if the class of graphical models consideredis restricted to trees, the graph structure can be foundefficiently, by calculating the mutual information be-tween each pair of body parts and solving for the max-imum spanning tree (Taycher et al., 2002; Song et al.,2003).

Song et al. further extend their approach to han-dle cluttered scenes, obtained by automatically trackingfeatures in video. Since the results of tracking are in-variably noisy, this requires solving the correspondenceproblem at each frame (identifying which feature pointsare body parts, which come from the background, andwhich body parts are occluded). Learning can now beaccomplished via an EM-like algorithm, which alter-nates optimizing the feature correspondence with learn-ing the graphical model structure and parameters.

Although the authors are able to show some inter-esting results, this approach has a number of draw-backs. First, learned models are specific to the 3D posi-tion and orientation of the subject, accounting only forinvariance to translation parallel to the image plane.Thus a model trained on a person walking from left toright is unable to detect a person walking from right toleft (Song et al., 2003). Secondly, a single time-invariantmodel is learned on the data from all frames, thereby

confounding structure and motion. Instead of trying tomodel these two latent factors separately, the presenceof motion serves only to increase uncertainty in thegraphical model.

3 Model

Here we formulate a probabilistic graphical model forsequences generated from articulated skeletons. By fit-ting this model to a set of feature point trajectories (theobserved locations of a set of features across time), weare able to parse the sequence into one or more ar-ticulated skeletons and recover the corresponding mo-tion parameters for each frame. The observations areassumed to be 2D, whether tracked from video or pro-jected from 3D motion capture, and the goal is to learnskeletons that capture the full 3D structure. Fittingthe model is performed entirely via unsupervised learn-ing; the only inputs are the observed trajectories, withmanually tuned parameters restricted to a small set ofthresholds on Gaussian variances.

The observations for this model are the locations wfp

of feature points p in frames f . A discrete latent vari-able R assigns each point to one of S sticks. Each sticks consists of a set of time-invariant 3D local coordi-nates Ls, describing the relative positions of all pointsbelonging to the stick. Ls is mapped to the observedworld coordinate system by a different motion matrixMf

s at every frame f (see Figure 3). For example, in anoiseless system, where rp,1 = 1, indicating that pointp has been assigned to stick 1, Mf

1 l1,p = wfp .

If all of the sticks are unconnected and move inde-pendently, then this model essentially describes multi-body SFM (Costeira and Kanade, 1998; Gruber andWeiss, 2004), or equivalently an instance of MultipleCause Factor Analysis (Ross and Zemel, 2006). How-ever, for an articulated structure, with connections be-tween sticks, the stick motion variables are not inde-pendent (Yan and Pollefeys, 2006a). Allowing connec-tivity between sticks makes the problems of describingthe constraints between motions and inferring motionsfrom the observations considerably more difficult.

To deal with this complexity, we introduce variablesto model the connectivity between sticks, and the (un-observed) locations of stick endpoints and joints in eachframe. Every stick has two endpoints, each of which isassigned to exactly one vertex. Each vertex can corre-spond to one or more stick endpoints (vertices assignedtwo or more endpoints are joints). We will let ki specifythe coordinates of endpoint i relative to the local coor-dinate system of its stick, s(i), and vf

j and efi represent

the world coordinate location of vertex j and endpoint

8

Fig. 3 The generative process for the observed feature positions, and the imputed positions of the stick endpoints. For each stick,

the relative positions of its feature points and endpoints are represented in a time-invariant local coordinate system (left). For each

frame in the sequence (right), motion variables attempt to fit the observed feature positions (e.g. wfP ) by mapping local coordinates

to world coordinates, while maintaining structural cohesion by mapping stick endpoints to inferred vertex (joint) locations.

i in frame f , respectively. Again, in a noiseless system,ef

i = Mfs(i)ki for every frame f . Noting the similarity

between the efi variables and the observed feature po-

sitions wfp , these endpoint locations can be interpreted

as a set of pseudo-observations, inferred from the datarather than directly observed.

Vertices are used to enforce a key constraint: for allthe sticks that share a given vertex, the motion matricesshould map their local endpoint locations to a consis-tent world coordinate. This restricts the range of pos-sible motions to only those resulting in appropriatelyconnected figures. For example, in Figure 3, endpoint2 (of stick 1), is connected to endpoint 4 (of stick 2);both are assigned to vertex 2. Thus in every frame fboth endpoints should map to the same world location,the location of the knee joint, i.e. vf

2 = ef2 = ef

4 .

The utility of introducing these additional variablesis that, given the vertices V and endpoints E, the prob-lem of estimating the motions and local geometries (Mand L) factorizes into S independent structure-from-motion problems, one for each stick. Latent variablegi,j = 1 indicates that endpoint i is assigned to vertex j;hence G indirectly describes the connectivity betweensticks. The assumed generative process for the featureobservations and the vertex and endpoint pseudo-ob-

servations is shown in Figure 3, and the correspondingprobabilistic model is shown in Figure 4.

The complete joint probability of the model can bedecomposed into a product of two likelihood terms, onefor the true feature observations and the second for theendpoint pseudo1-observations, and priors over the re-maining variables in the model:

P = P(W|M,L,R)P(E|M,K,V,φ,G) (2)

P(V)P(φ)P(M)P(L)P(K)P(R)P(G)

Assuming isotropic Gaussian noise with precision(inverse variance) τw, the likelihood function is

P(W|M,L,R) =∏f,p,s

N (wfp |Mf

s ls,p, τ−1w I)rp,s (3)

where rp,s is a binary variable equal to 1 if and onlyif point p has been assigned to stick s. This distribu-tion captures the constraint that for feature point p, itspredicted world location, based on the motion matrixand its location in the local coordinate system for thestick to which it belongs (rp,s = 1), should match itsobserved world location. Note that dealing with missingobservations is simply a matter of removing the corre-sponding factors from this likelihood expression.4

4 This likelihood is applicable if the observations wfp are 2D or

3D. In the 2D case, we assume an affine camera projection. How-

9

Multib

ody S

FM

Art

icula

ted J

oin

ts

Frame f+1Frame fTime Invariant

s = 1:S

j = 1:J

i = 1:2S

p = 1:P

r

i

p

L

g

s

f+1M

fw

jφ

s

f

p

M

w f+1

s

ik

f+1iee

fv j

i

f+1vj

p

f

Fig. 4 The graphical model. The bottom half shows the model for independent multibody SFM; the top half describes the verticesand endpoints, which account for motion dependencies introduced by the articulated joints.

Each motion variable consists of a 2 × 3 rotationmatrix Rf

s and a 2 × 1 translation vector tfs : Mf

s ≡[Rf

s tfs ]. The motion prior P(M) is uniform, with the

stipulation that all rotations be orthogonal: RfsR

fs>=

I.We define the missing-data likelihood of an endpoint

location as the product of two Gaussians, based on thepredictions of the appropriate vertex and stick:

P(E|M,K,V,φ,G) ∝ (4)∏f,i

N (efi |M

fs(i)ki, τ

−1m I)

∏f,i,j

N (efi |v

fj , φ−1j I)gi,j

Here τm is the precision of the isotropic Gaussian noiseon the endpoint locations with respect to the stick, andgi,j is a binary variable equal to 1 if and only if endpointi has been assigned to vertex j. The second Gaussianin this product captures the requirement that endpointsbelonging to the same vertex should be coincident. In-stead of making this a hard constraint, connectivity is

ever, it would be possible to extend this to a projective camera

by making the mean depend non-linearly on Mfs ls,p.

softly enforced, allowing the model to accommodate acertain degree of non-rigidity in the underlying struc-ture, as illustrated by the mismatch between endpointand vertex positions in Figure 3. The vertex precisionvariables φj capture the degree of “play” in the joints,and are assigned Gamma prior distributions:

P(φ) =∏j

Gamma(φj |αj , βj). (5)

The prior on the vertex locations incorporates atemporal smoothness constraint, with precision τt:

P(V) =∏f,j

N (vfj |v

f−1j , τ−1

t I) (6)

The priors for feature and endpoint locations in thelocal coordinate frames, L and K, are zero-mean Gaus-sians, with isotropic precision τp.

P(L) =∏s,p

N (ls,p|0, τ−1p I) P(K) =

∏i

N (ki|0, τ−1p I)

Finally, the priors for the variables defining the struc-ture of the skeleton, R and G, are multinomial. Each

10

point p selects exactly one stick s (enforced mathemati-cally by the constraint

∑s rp,s = 1) with prior probabil-

ity cs, and each endpoint i selects one vertex j (similarly∑j gi,j = 1) with probability dj :

P(R) =∏p,s

(cs)rp,s P(G) =∏i,j

(dj)gi,j .

4 Learning

Given a set of observed feature point trajectories, wepropose to fit this model in an entirely unsupervisedfashion, by maximum likelihood learning. Conceptu-ally, we divide learning into two challenges: recoveringthe skeletal structure of the model, and given a struc-ture, fitting the model’s remaining parameters. Struc-ture learning involves grouping the observed trajecto-ries into a number of rigid sticks, including determiningthe number of sticks, as well as determining the connec-tivity between them. Parameter learning involves deter-mining the local geometries and motions of each stick,as well as imputing the locations of the stick endpointsand joints — all while respecting the connectivity con-straints imposed by the structure.

Both learning tasks seek to optimize the same ob-jective function—the expected complete log-likelihoodof the data given the model—using different, albeit re-lated, approaches. Given a structure, parameters arelearned using the standard variational expectation max-imization algorithm. Structure learning is formulated asan “outer loop” of learning: beginning with a fully dis-joint multibody SFM solution, we incrementally mergestick endpoints, at each step greedily choosing the mergethat maximizes the objective. Finally the expected com-plete log-likelihood can be used for model comparisonand selection.

A summary of the proposed learning algorithm isprovided in Figure 4.2.1.

4.1 Learning the model parameters

Given a particular model structure, indicated by a spe-cific setting of R and G, the remaining model parame-ters are fit using the variational expectation-maximization(EM) algorithm (Neal and Hinton, 1998; Dempster et al.,1977). This well-known algorithm takes an iterative ap-proach to learning: beginning with an initial setting ofthe parameters, each parameter is updated in turn, bychoosing the value that maximizes the expected com-plete log-likelihood objective function, given the values(or expectations) of the other parameters.

The objective function—also known as the negativeFree Energy (Neal and Hinton, 1998)—is formed by as-suming a fully factorized variational posterior distri-bution Q over a subset of the model parameters, thencomputing the expectation of the model’s log probabil-ity (2) with respect to Q, plus an entropy term:

L = EQ[log P]− EQ[log Q]. (7)

For this model, we define Q over the variables V, E,and φ, involved in the world-coordinate locations of thejoints. The variational posterior for vf

j is a multivari-ate Gaussian with mean parameter µ(vf

j ) and precisionparameter τ(vf

j ), for efi is also a Gaussian with mean

µ(efi ) and precision τ(ef

i ), and for φ is a Gamma dis-tribution with parameters α(φj) and β(φj):

Q = Q(V)Q(E)Q(φ)

Q(V) =∏f,j

N (vfj |µ(vf

j ), τ(vfj )−1)

Q(E) =∏f,i

N (efi |µ(ef

i ), τ(efi )−1)

Q(φ) =∏j

Gamma(φj |α(φj), β(φj)).

The EM update equations are obtained by differentiat-ing the objective function L, with respect to each pa-rameter, and solving for the maximum given the otherparameters. We now present the parameter updates,with detailed derivation of L and the updates appearingin (Ross, 2008b). As a reminder, the constants appear-ing in these equations denote: Do the dimensionalityof the observations, generally 2 but 3 will also work;F the number of observation frames; J the number ofvertices; P the number of data points; S the number ofsticks.

τ−1w =

∑f,p,s rp,s‖wf

p −Mfs ls,p‖2

FPDo

τ−1m =

∑f,i ‖µ(ef

i )−Mfs(i)ki‖2

2FSDo+

∑f,i τ(ef

i )−1

2FS

τ−1t =

∑Ff=2

∑j ‖µ(vf

j )− µ(vf−1j )‖2

(F − 1)JDo

+

∑f,j τ(vf

j )−12h(f)

(F − 1)J,

11

where h(f) = 1 if 1 < f < F and 0 otherwise.

τ(efi ) =

∑j

gi,jα(φj)β(φj)

+ τm

µ(vfj ) =

(α(φj)β(φj)

∑i

gi,jµ(efi )

+ [f > 1]τtµ(vf−1j ) + [f < F ]τtµ(vf+1

j ))

/

(α(φj)β(φj)

∑i

gi,j + τt2h(f)

)τ(vf

j ) =α(φj)β(φj)

∑i

gi,j + τt2h(f)

α(φj) = αj +FDo

2

∑i

gi,j

β(φj) = βj +12

∑f,i

gi,j‖µ(efi )− µ(vf

j )‖2

+Do

2

∑f,i

gi,j [(τ(efi ))−1 + (τ(vf

j ))−1]

αj = α(φj)

βj = β(φj)

The update for the motion matrices is slightly morechallenging due to the orthogonality constraint on therotations. A straightforward approach is to separate therotation and translation components of the motion andto solve for each individually. The update for transla-tion is obtained simply via differentiation:

Mfs =

[Rf

s tfs

]ts,f =

(τw∑

p

rp,s(wfp −Rf

s ls,p)

+ τm∑

{i|s(i)=s}

(µ(efi )−Mf

sks,i))

/(τw∑

p

rp,s + 2τm)

To deal with the orthogonality constraint on Rfs , its

update can be posed as an orthogonal Procrustes prob-lem (Golub and Van Loan, 1996; Viklands, 2006). Givenmatrices A and B, the goal of orthogonal Procrustes isto obtain the matrix R that minimizes ‖A − RB‖2,subject to the constraint that the rows of R form anorthonormal basis. Computing the most likely rotationinvolves maximizing the likelihood of the observations(3) and of the endpoints (4), which can be written asthe minimization of

∑p ‖(wf

p − ts,f ) − Rfs ls,p‖2 and∑

{i|s(i)=s} ‖(µ(efi )−ts,f )−Rf

sks,i‖2 respectively. Con-catenating the two problems together, weighted by their

respective precisions, allows the update of Rfs to be

written as a single orthogonal Procrustes problemargminRf

s‖A−Rf

sB‖2, where

A =[[√

τw rp,s(wfp − ts,f )

]p=1..P

[√τm (µ(ef

i )− ts,f )]{i|s(i)=s}

]B =

[[√τw rp,sls,p

]p=1..P

[√τm ki

]{i|s(i)=s}

].

The solution is to compute the singular value decom-position of BA> SV D= UΣV>, and let R = VIm×nU>,where m and n are the numbers of rows in A and Brespectively.

Given Rfs and tf

s , the updates for the local coordi-nates are:

ls,p =(∑

f

Rfs>Rf

s +τpτw

I)−1∑

f

Rfs>(wf

p − ts,f )

ki =(∑

f

Rfs(i)>Rf

s(i) +τpτm

I)−1∑

f

Rfs(i)>(µ(ef

i )− tfs(i))

The final issue to address for EM learning is initial-ization. Many ways to initialize the parameters are pos-sible; here we settle on one simple method that producessatisfactory results. The motions and local coordinates,M and L, are initialized by solving SFM independentlyfor each stick (Tomasi and Kanade, 1992). The ver-tex locations are initialized by averaging the observa-tions of all sticks participating in the joint: µ(vf

j ) =(∑

i,p gi,j rp,s(i) wfp )/(

∑i,p gi,j rp,s(i)). The endpoints are

initially coincident with their corresponding vertices,µ(ef

i ) =∑

j gi,j µ(vfj ), and the Ks by averaging the

backprojected endpoint locations: ki = 1F

∑f Rf

s(i)>(µ(ef

i )−tfs(i)). All precision parameters are initialized to con-

stant values, as discussed in Section 5.1.

4.2 Learning the skeletal structure

Structure learning in this model entails estimating theassignments of feature points to sticks (including thenumber of sticks), and the connectivity of sticks, ex-pressed via the assignments of stick endpoints to ver-tices. The space of possible structures is enormous. Wetherefore adopt an incremental approach to structurelearning: beginning with a fully disconnected multibody-SFM model, we greedily add joints between sticks bymerging vertices. After each merge the model parame-ters are updated via EM, and the assignments of ob-servations to sticks are resampled. After performingthe desired number of merges, model selection—thatis, choosing the optimal number of joints—is guided bycomparing the expected complete log-likelihood of eachmodel.

12

The first step in structure learning involves hypoth-esizing an assignment of each observed feature trajec-tories to a stick. This is accomplished by clustering thetrajectories using the Affinity Propagation algorithm ofFrey and Dueck (2007). Affinity Propagation takes asinput an affinity matrix, for which we supply the affin-ity measure from Yan and Pollefeys (2006a,b, 2008) aspresented in Section 2.1.4 (or for 3D data, Kirk et al.(2005) discussed in 2.2). During EM parameter learn-ing, the stick assignments R are resampled every 10iterations using the posterior probability distribution

P(rp,s) ∝cs exp

−αw

2

∑f

‖wfp −Mf

s ls,p‖2

s.t.∑s′

rp,s′ = 1.

Instead of relying only on information available be-fore model fitting begins (Costeira and Kanade, 1998;Kirk et al., 2005; Yan and Pollefeys, 2006b), resam-pling of stick assignments allows model probability tobe improved by leveraging current best estimates of themodel parameters.

The second step of structure learning involves deter-mining which sticks endpoints are joined together. Asdiscussed earlier, connectivity is captured by assigningstick endpoints to vertices; each endpoint must be asso-ciated to one vertex, and vertices with two or more end-points act as articulated joints. (Valid configurations in-clude only cases in which endpoints of a given stick areassigned to different vertices.) We employ an incremen-tal greedy scheme for inferring this graphical structureG, beginning from an initial structure that contains nojoints between sticks. Thus, in terms of the model, westart with J = 2S vertices, one per stick-endpoint, sogi,j = 1 if and only if j = i. Given this initial structure,parameters are fit using variational EM.

A joint between sticks is introduced by merging to-gether a pair of vertices. The choice of vertices to mergeis guided by our objective function L. At each stage ofmerging we consider all valid pairs of vertices, puta-tively joining them and estimating (via 20 iterationsof EM) the change in log-likelihood if this merge wereaccepted. The merge with the highest log-likelihood isperformed, by modifying G accordingly, and the modelparameters are re-optimized with 200 additional itera-tions of EM, including resampling of the stick assign-ments R. This process is repeated until no valid mergesremain, or the desired maximum number of merges hasbeen reached.

4.2.1 Computational Cost

By examining the updates presented in Section 4.1, incan be seen that the cost of each iteration of EM param-eter learning scales linearly in the following quantities:F the number of frames, J the number of joints, P thenumber of observed feature point trajectories, and S

the number of sticks. (Note that since the number ofrows in A and B are fixed, each orthogonal Procrustesupdate of Rf

s has a cost that is linear in P—the ini-tial multiplication AB>—in addition to a constant-costSVD and final multiplication.)

Each stage of greedy merging requires computingthe expected log-likelihood for all of the possible pairsof vertices to be merged. The number of possible mergesscales with O(J2), which, since J = 2S during the firststage, can be as high as 4S2. In practice, however, itis possible to reduce the number that must be consid-ered. Savings can be obtained by noting the symmetryof the merge operation, reducing the number of uniquemerges by a factor of two, as well as by disallowingself-merges between the two endpoints of a stick. A lessobvious savings can be realized by avoiding duplica-tion when merging with a stick that has two free end-points, since the change in probability from mergingto either of these otherwise unconstrained endpointswill be identical. During the initial stage, when thestructure contains no joints, this reduces the number ofunique merges by an additional factor of four. Duringlater stages, there are fewer possible merges to considersince J , the number of vertices, decreases by one foreach stage, and our previously mentioned restriction—that the endpoints of a stick cannot be assigned to thesame vertex—eliminates a greater proportion of poten-tial merges.

In our experiments these optimizations are suffi-cient to yield acceptable runtimes, however given muchlarger models the number of possible merges could bereduced to O(J) by allowing each stick to merge withonly a fixed number (e.g. five) of its nearest neigh-bors. It may also be possible to achieve further savingsthrough caching—approximating the expected changein log-likelihood of a merge with its value from the pre-vious stage, without recomputing (Ross et al., 2007).

5 Experimental Results and Analysis

We now present results of the proposed algorithm ona range of different feature point trajectory datasets.This includes data obtained by automatically trackingfeatures in video, from optical motion capture (both 2Dand 3D), as well as a challenging artificially generatedsequence. In each experiment a model was learned on

13

1. Obtain an initial grouping R by clustering the observed trajectories using Affinity propagation. Initialize G to a fully

disconnected structure.2. Optimize the parameters M, L, K, V, φ, E, using 200 iterations of the variational EM updates, resampling R every

10 iterations.

3. For all vertex-pair merges, estimate gain resulting from the proposed structure by updating the parameters with 20EM iterations and noting the change in expected log-probability.

4. Choose the merge with the largest gain, modifying G accordingly. Re-optimize parameters using another 200 EM

iterations, resampling R every 10th.5. Go to Step 3 and repeat. Exit when there are no more valid merges, or the maximum number of merges has been

reached.

Fig. 5 A summary of the learning algorithm

the first 70% of the sequence frames, with the remaining30% held out as a test set used to measure the model’sperformance. Learning was performed using the algo-rithm summarized in Figure 4.2.1, with greedy mergingcontinuing (generally) until no valid merges remained.After each stage of merging, we saved the learned modeland corresponding expected complete log-likelihood—the objective function learning maximizes. The likeli-hoods were plotted for comparison, and used to selectthe optimal model.

The learned model’s performance was evaluated basedon its ability to impute (reconstruct) the locations ofmissing observations. For each test sequence we gener-ated a set of missing observations by simulating an oc-cluder that sweeps across the scene, obscuring points asit passes. We augmented this set with an additional 5%of the observations chosen to be “missing at random”,to simulate drop-outs and measurement errors, result-ing in a overall occlusion rate of 10-15%. The learnedmodel was fit to the un-occluded points of the test se-quence, and used to predict the location of the missingpoints. Performance was measured by computing theroot-mean-squared error between the predictions andthe locations of the heldout points. We compared theperformance of our model against similar prediction er-rors made by single-body and multibody structure frommotion models.

This section begins with a brief analysis of the ef-fect of precision parameters during learning, followedby experimental results on five datasets: a video of anexcavator, a video of a walking giraffe, 2D feature tra-jectories obtained from human motion capture, an syn-thetic dataset of a jointed ring, and an additional setof human motion data in 3D. Finally we conclude witha brief comparison against two related methods (Yanand Pollefeys, 2008; Kirk et al., 2005).

Videos of the experimental results may be found athttp://www.cs.toronto.edu/∼dross/articulated/.

5.1 Setting Precision Parameters During Learning

As presented, the model contains a number of preci-sion parameters to be determined during learning: τw,τm, τt, τp, τ(vf

j ), τ(efi ), as well as the parameters of the

prior distribution on the joint prior, αj and βj . In prac-tice, simply initializing these precisions to arbitrary val-ues and allowing them to adapt freely during EM leadsto poor results. Some of the precisions—particularlyαw, αm, τ(vf

j ), and τ(efi )—tend to grow unbounded,

thus we have found it useful to specify a maximum pre-cision of 50 (a standard trick during EM). In contrast,the joint precisions φj (given by α(φj)/β(φj)) tend to-wards relatively small values, resulting in a model thathas very little cohesion in the joints. To counteract thiswe specify a very strong prior on φj encouraging it to-wards large values: αj = 2× 105 ×maximum precisionand βj = 105, resulting in an expected value of 2 ×maximum precision with limited variance. When fittingthe motion of a stick, assuming other precisions satu-rate at the maximum, this means that keeping an end-point near its vertex is at least twice as important askeeping a feature point near its observed location.

In our experiments, we have found temporal smooth-ing of the vertices, governed by precision τt to be a dis-advantage during learning. Particularly at the begin-ning, when the structure contains no joints, smoothingcauses the unconnected vertices and endpoints to driftaway from the actual observations at each frame, to-wards their temporal mean. Thus, in all of the follow-ing experiments we disable smoothing during learning.However, when measuring test performance it’s not un-common for one or more adjacent sticks to be entirelyoccluded during a frame. When this happens, smoothedlocations of the vertices provide the only source of in-formation about the location of the stick, and thustemporal smoothing is essential for limiting test error.When measuring test performance, therefore, we enablesmoothing and set τt = 2000.

Finally, the precisions play an important role in de-termining the optimal number of joints. During model

14

selection we seek the model with the largest expectedcomplete log-likelihood, hoping that this will include asmany plausible joints as possible. However most termsin this objective function favour a disconnected model,with more vertices and fewer joints. To understand this,consider the problem of estimating the motion of anunconnected stick. Since there are no constraints onthe vertices, they can be trivially placed to be coin-cident with endpoints, thus the motion variable needsonly focus on maximizing the probability of the obser-vations. However when two sticks are joined together,perfect placement of the vertices is generally not pos-sible, requiring modelling compromises that introduceslight reductions in observation probability. The oneterm in the objective function that does not decreaseas merges are performed is the entropy of the verticesEQ(V)[logQ(V)].

Assuming each precision parameter in Q(v) is equalto the maximum precision, p, this entropy is(FJDo/2) log(2πe/p). If p is greater than 2πe ≈ 17.08,then the differential entropy is negative5. The result isthat decreasing the number of vertices J causes the log-likelihood to increase. In fact a fixed cost of(FDo/2) log(2πe/p) is paid for each vertex in the model,giving us the desired bias towards connectivity. Plotsof the relevant log-likelihood terms are included for thedatasets presented below.

5.2 Excavator

Our first dataset consisted of a video clip of an ex-cavator. We used a Kanade-Lucas-Tomasi tracker (Shiand Tomasi, 1994) (with manual assistance to correctfor frequent loss-of-track) to obtain 35 feature trajec-tories across 176 frames. Our algorithm processed thedata in 4 minutes on a 2.8 GHz processor. The learnedmodel at each stage of greedy merging is depicted inFigure 6. The optimal structure was chosen by com-paring the log-likelihood at each stage, as plotted inFigure 7 (left). The four most significant terms com-prising this objective function are plotted individuallyin Figure 7 (right). As can be seen, joining sticks addsadditional constraints that reduce the expected proba-bility of the observations (top left), the endpoints givenvertices (top right), and the endpoints given Mk (bot-tom left). In contrast the vertex entropy term (bottomright) acts as a per-vertex penalty, which decreases aswe merge vertices, favoring more highly connected mod-els. Figure 7 (bottom) shows that the system’s pre-

5 Although unintuitive, negative differential entropies are per-

fectly acceptable (Cover and Thomas, 1991).

diction error for occluded data was significantly betterthan either multibody or single-body SFM.

As can be seen in Figure 6, the model does a goodjob at recovering the structure—the grouping and connectivity—of the observed trajectories. The reconstruction showssome deviation between the inferred locations of thejoints and their intuitive positions. The probable sourceof this inaccuracy is that the small range of motionexhibited by the excavator’s arm permits a range ofpossible joint positions, while the Gaussian prior saysthat the joints should be near the center of mass of eachstick. Apparently, while mathematically convenient, theGaussian prior is not always the best choice. Neverthe-less, the model is fully able to capture the observedmotion of the excavator’s arm, despite the inaccuratejoints.

Using the excavator data, we also examined the model’srobustness to learning with occlusions in the trainingdata. When the occlusion scheme described earlier wasemployed to generate a training set with missing obser-vations, and the learning algorithm was applied to thisdata, it was still able to recover the correct structure.Similarly, when training observations were randomlywithheld during training, rather than using structuredocclusion, the correct structure was reliably recoveredwith up to 75% of the training observations missing.

5.3 Giraffe

Our second dataset consisted of a video of a walking gi-raffe. As before features were tracked, producing 60 tra-jectories across 128 frames. Merging results are depictedin Figure 8. Using the objective function to guide modelselection (Figure 9), the best structure corresponded tostage 10, and this model is shown superimposed overthe original video in Figure 1, appearing at the start ofthis article.

5.4 2D Human

Our third dataset consisted of optical human motioncapture data (courtesy of the Biomotion Lab, Queen’sUniversity, Canada), which we projected from 3D to 2Dusing an isometric projection. The data contained 53features, tracked across a 1018-frame range-of-motionexercise (training data), and 318 frames of running onan inclined plane (test data). The structures learnedduring greedy merging are shown in Figure 10, of whichstage 11 most closely matches human intuition.

By examining the plots in Figure 11, it can be notedthat the expected log-likelihood of the various mod-els forms a plateau, roughly between stages 8 and 11,

15

Fig. 6 Excavator Data: Shown at the top are the models learned in each of the five successive stages of greedy learning. Reconstructions

of the observed markers are shown with different symbols depending on their stick assignments. The locations of vertices are shownas black o’s, and black lines are drawn to connect each stick’s pair of vertices. At the bottom, the selected model (stage 3) is used to

reconstruct the observed feature trajectories, and the results are superimposed over the corresponding frames of the input video.

16

0 1 2 3 42800

3000

3200

3400

3600

3800

4000

E[logprob]

0 2 46600680070007200740076007800

E[log N(W|ML)]

0 2 4−2000

−1500

−1000

E[log N(E|V)]

0 2 4

500

1000

1500E[log N(E|MK)]

0 2 4−1500

−1000

−500

H(Q(v))

multibody articulated single body0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Excavator − Occluder

RM

S e

rror

traintest

Fig. 7 Excavator Log-likelihood and Error. At the top-left we see that stage 3 of merging produces the model with the highest

log-probability. At the top-right are individual plots of the four most significant terms comprising the log probability. At the bottom,

we can see that the learned model exhibits less reconstruction error than either single or multibody SFM models.

rather than a sharp peak as seen for the Excavatordata. Although stage 11 is not actually the most likelymodel (stage 8 is slightly higher), the log-likelihood de-creases rapidly after stage 11. This suggests that havingtoo many joints—and thereby hampering the ability ofsticks to move so as to fit the observations—is a biggerdisadvantage to the model than simply having too fewjoints. Theoretically it may be possible to encourage aglobal maximum in log-likelihood at stage 11 by simplyincreasing the maximum precision (thereby penalizingstage 8 which has more vertices). However, recogniz-ing our preference for models with as many plausible

joints as possible, selecting the stage at the edge of theplateau—stage 11—seems a reasonable choice.

Again, the articulated model achieved a lower testerror than either SFM or multibody SFM.

5.5 Synthetic Ring

In order to evaluate the performance of the model ondata which contains significant out-of-plane motion, wecreated a challenging synthetic dataset depicting a seg-mented ring deforming in space. The generated sequenceconsisted of 100 features across 300 frames, to whichindependent Gaussian noise of standard deviation 0.05

17

Fig. 8 Giraffe structures learned during greedy merging. Stage 10 has the highest expected log-likelihood.

18

0 2 4 6 8 10 12 14 16

−2050

−2000

−1950

E[logprob]

0 5 10 15

5450

5500

5550

5600E[log N(W|ML)]

0 5 10 15

−4500

−4450

−4400

E[log N(E|V)]

0 5 10 15

0

50

100

E[log N(E|MK)]

0 5 10 15−150

−100

−50

0

H(Q(v))


0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Giraffe − Occluder

RM

S e

rror

traintest

Fig. 9 Giraffe Log-likelihood and Error.

was added. (For comparison, each stick was approxi-mately 0.5 units wide and 5 units long.) Six framesfrom the sequence are depicted in Figure 12.

The models learned for the successive stages of merg-ing are shown in Figure 13. The sharp downturn inlog-likelihood between stages 8 and 9, shown in Fig-ure 14, suggests selecting stage 8 as the best model.(Note that although stage 0, which is equivalent tomultibody SFM, has a higher expected log-probability,stage 8 has the lower test error.) Unlike methods basedon spanning trees, our approach was able to recover thecorrect closed ring structure.

Interestingly, all of the learned structures chose togroup the feature points into eight sticks, three morethan were in the true grouping used to generate thedata, as illustrated in the bottom-right of Figure 13.Examination of the results show that these extra groupsarise from splitting three of the true sticks each into apair sticks connected by a joint. Although the learned

structure is an over-segmentation of the ground truthstructure, it still provides a perfectly acceptable modelof the data.

As a further analysis of the algorithm’s inability toidentify the correct number of sticks and joints, an ex-periment was performed in which the correct ground-truth segmentation for the ring data was provided as aninitialization. From this starting point, the learning al-gorithm was able to recover the connectivity, joint loca-tions, and parameters correctly. This suggests that theproblem is not inherent in the representative capabilityof the model, rather that the greedy/EM optimizationalgorithm has difficulty escaping from a poor initial seg-mentation, thereby impairing the ability to identify thecorrect number of sticks.

19

Fig. 10 2D Human structures learned during greedy merging, of which stage 11 most closely matches human intuition.

20

0 2 4 6 8 10 12 14 16

−3

−2

−1

0

1

x 104 E[logprob]

0 5 10 156

8

10x 10

4E[log N(W|ML)]

0 5 10 15

−5

−4

−3

−2

x 104 E[log N(E|V)]

0 5 10 15

0

2

4x 10

4E[log N(E|MK)]

0 5 10 15−4

−2

0

x 104 H(Q(v))


0.2

0.4

0.6

0.8

1

1.2

1.42D Human − Occluder

RM

S e

rror

traintest

Fig. 11 2D Human Log-likelihood and Error.

5.6 3D Human

Although recovering 3D structure from 3D observationsis much simpler than from 2D data, it also receivesattention in the literature. As mentioned previously,our model easily extends to 3D observations, so we in-clude an additional experiment demonstrating this abil-ity. Here we trained our model on optical human mo-tion capture data obtained from the Carnegie MellonUniversity Motion Capture Database. The data con-sisted of 174 feature points tracked across 732 frames(downsampled by a factor of three from the originalframerate). The results of greedy merging are shown inFigure 15, and the corresponding log-likelihoods in Fig-ure 16. Since learning from 3D observations is an easierproblem, the most likely structure—stage 15—is visu-ally more appealing than the structure learned earlieron the 2D human data.

5.7 Comparisons with Related Methods

Finally, as an additional qualitative comparison, we ranour method on two sequences from Yan and Pollefeys(2008), and ran a re-implementation of Kirk et al. (2005)on our 3D datasets.

The results of our method on Yan and Pollefeys’s“puppet” and “dancing” sequences are shown in Fig-ure 17, at the top left and top right respectively. (Pleasecompare with Figures 10 and 11 in Yan and Pollefeys(2008).) As can be seen, our method does a good jobrecovering the structure, including segmentation andjoints, of the puppet. In contrast with Yan and Polle-feys’s, our approach finds more segments: the arms andlegs are split into two segments each, instead of onlyone; and the head, neck, and chest are subdivided, in-stead of being combined into one segment. An unintu-itive choice made by our algorithm was to place a joint

21

Fig. 12 Synthetic Ring Data: Six frames selected from a synthetic data sequence depicting the motion of a 5-segmented ring. Thering undergoes significant out-of-plane motion.

connecting the legs, above the knees. This placementmakes more sense upon watching the entire sequence,and noticing how little the legs appear to move. Specif-ically, the left leg is stationary, and the right leg movesonly slightly, back and forward perpendicularly to theimage plane. In this case the algorithm favours a sim-pler model which still adequately captures the visiblemotion.

The results on the “dancing” sequence are similar.Again our method find more segments. In particular thechest is divided into four segments instead of one. In-terestingly, the algorithm learned interconnections be-tween these chest segments, producing a near-fully con-nected graph. This shows that the model does a goodjob capturing the near-rigidity of the chest segments.However it also suggests there is a limitation in the ex-tent to which initial segments which are actually tightlycoupled can be combined into a single segment duringlearning. The segments which show the most motion,the forearms and head, each are reasonably modeledby sticks which extend from the main body.

As described earlier, the method of Kirk et al. isdesigned to work on 3D optical motion capture data,thus we trained it on the 3D Human dataset used in Sec-tion 5.6, as well as on the 3D feature locations that gaverise to the 2D Human dataset from Section 5.4. In theoriginal paper, Kirk et al. focus on fitting their model to“calibration” sequences, in which the actor fully flexeseach of his individual joints. Indeed, as shown in Fig-

ure 17, (bottom right), the method does a good jobat recovering the structure from the range-of-motionsequence. (For comparison, the results of our methodtrained on the 2D-projection of the same sequence isshown in Figure 10.) In contrast, on the other 3D Hu-man sequence which depicts walking and sitting ratherthan range-of-motion exercises, Kirk’s method fares muchmore poorly (Figure 17 (bottom left), c.f. our methodFigure 15).

6 Discussion

We have demonstrated a single coherent model that canlearn the structures and motion of articulated skeletons.This model can be applied to a variety of structures,requiring no input beyond the observed feature trajec-tories, and a minimum of manually adjusted precisionparameters.

Our model makes a number of contributions to thestate of the art. First, it is based optimizing a sin-gle global objective function, which details how all as-pects of learning—grouping, connectivity, and param-eter fitting—contribute to the overall quality of themodel. Having this objective function permits iterationbetween updates of the structure and parameters, al-lowing information obtained from one stage to assistlearning in the other. Moreover, the value of the ob-jective function proves useful for model selection, de-termining the optimal number of joints. Also, the noise

22

Fig. 13 Synthetic Ring Structures learned during greedy merging, of which stage 8 is the best. In comparison to the ground-truth

structure, shown in the lower-right, the learned model over-segments the data into 8 sticks, rather than 5. However, since this involvessplitting three of the true sticks in half, the learned model still provides a good fit to the data.

in our generative model plays an important role, allow-ing a degree of non-rigidity in the motion with respectto the learned skeleton. This not only allows a featurepoint to move in relation to its associated stick, butalso permits complexity in the joints, as the stick end-points joined at a vertex need not coincide exactly. Inaddition we presented a method for quantitative com-parison, based on imputing the locations of occluded

observations, and were able to demonstrate that ourmodel performs measurably better than single-body ormultibody structure from motion.

Our model has some limitations. First, as illustratedin the “excavator” example (Section 5.2) the choice ofa Gaussian prior for joint locations, while mathemat-ically convenient, is not ideal, since it encourages themodel to place joints in the middle of each stick, rather

23

0 2 4 6 8 10−2000

0

2000

4000

6000

E[logprob]

0 5 10

1.4

1.6

1.8

2

2.2x 10

4E[log N(W|ML)]

0 5 10

−8000

−6000

−4000

−2000

E[log N(E|V)]

0 5 10

−2000

0

2000

4000

E[log N(E|MK)]

0 5 10

−6000

−4000

−2000

0

2000

H(Q(v))


0.5

1

1.5Ring − Occluder

RM

S e

rror

traintest

Fig. 14 Synthetic Ring Log-likelihood and Error. The sharp downturn in log-likelihood at stage 9 suggests selecting the structure

learned during stage 8.

than at its endpoints. In many cases the effect of theprior is minor, and the problem does not arise. How-ever, when the range of observed motion in a particularjoint is quite limited, the prior can be more pronounced,moving the inferred joint away from its true location.Secondly, when starting from a poor initial segmenta-tion, the greedy/EM learning algorithm can have dif-ficulty identifying the correct number of sticks, due tochallenges escaping from local minima. This problemoccurs in the synthetic ring experiment (Section 5.5),and suggests that alternative optimization proceduresbe investigated.

To obtain good results, the model requires a certaindensity of features, in particular because the affinitymatrix used for initialization (Yan and Pollefeys, 2006a,2008) requires at least 4 points per stick. In addition,the flexibility of learned models is limited to the de-grees of freedom visible in the training data; if a joint

is not exercised, then the body parts it connects can-not be distinguished. Finally, our model requires thatthe observations arise from a scene containing roughlyarticulated figures; it would be a poor model of an oc-topus, for example.

An important direction for future study is the abil-ity of learned skeletal structures to generalize: applyingthem to new motions not seen during training, and torelated sequences, such as using a model trained on onegiraffe to parse the motion of another.

Acknowledgements The motion capture data used in this projectwas provided by the Biomotion Lab, Queen’s University, Canada,

and the Carnegie Mellon University Motion Capture Database

http://mocap.cs.cmu.edu/ (created with funding from NSF EIA-0196217). We would like to acknowledge funding from the Natural

Science and Engineering Research Council of Canada, the Cana-

dian Institute for Advanced Research, and a Microsoft/LiveLabs-University Support Agreement.

24

Fig. 15 3D Human structures learned during greedy merging. Stage 15 has the highest log-likelihood.

25

0 10 20

2.4

2.5

2.6

2.7

x 105E[log N(W|ML)]

0 10 20−5

−4

−3

−2

x 104 E[log N(E|V)]

0 10 20

0

1

2

3

x 104E[log N(E|MK)]

0 10 20

−3

−2

−1

0

x 104 H(Q(v))

0 2 4 6 8 10 12 14 16 18 20 221.5

1.6

1.7

1.8

1.9x 10

5 E[logprob]

Fig. 16 3D Human Log-likelihood.

References

K. Abdel-Malek, J. Arora, S. Beck, M. Bhatti, J. Car-roll, T. Cook, S. Dasgupta, N. Grosland, R. Han,H. Kim, J. Lu, C. Swan, A. Williams, and J. Yang.Digital human modeling and virtual reality for FCS.Technical Report VSR-04.02, The Virtual Soldier Re-search (VSR) Program, Center for Computer-AidedDesign, College of Engineering, The University ofIowa, October 2004.

M. Bray, P. Kohli, and P. Torr. Posecut: Simultane-ous segmentation and 3d pose estimation of humansusing dynamic graph-cuts. In ECCV (2), pages 642–655, 2006.

J. Costeira and T. Kanade. A multi-body factorizationmethod for motion analysis. In Image UnderstandingWorkshop, pages 1013–1026, 1996.

J. P. Costeira and T. Kanade. A multibody factoriza-tion method for independently moving-objects. In-ternational Journal of Computer Vision, 29(3):159–179, September 1998.

Thomas M. Cover and Joy A. Thomas. Elements ofInformation Theory. Wiley-Interscience, 1991.

P.F. Culverhouse and H. Wang. Robust motion seg-mentation by spectral clustering. In British MachineVision Conference, pages 639–648, 2003.

A.P. Dempster, N.M. Laird, and D.B. Rubin. Maxi-mum likelihood from incomplete data via the em al-gorithm. Journal of the Royal Statistical Society, 39:1–38, 1977.

B.J. Frey and D. Dueck. Clustering by passing messagesbetween data points. Science, 315:972–976, February2007.

C. W. Gear. Multibody grouping from motion im-ages. Int. J. Comput. Vision, 29(2):133–150, 1998.ISSN 0920-5691. doi: http://dx.doi.org/10.1023/A:1008026310903.

Z. Ghahramani and G.E. Hinton. The EM algorithmfor mixtures of factor analyzers. Technical ReportCRG-TR-96-1, University of Toronto, 1996a.

Z. Ghahramani and G.E. Hinton. Parameter estimationfor linear dynamical systems. Technical Report CRG-TR-96-2, University of Toronto, 1996b.

Gene H. Golub and Charles F. Van Loan. Matrix Com-putations. The Johns Hopkins University Press, 1996.

Amit Gruber and Yair Weiss. Factorization with un-certainty and missing data: Exploiting temporal co-herence. In Sebastian Thrun, Lawrence K. Saul,and Bernhard Scholkopf, editors, Advances in NeuralInformation Processing Systems. MIT Press, 2003.ISBN 0-262-20152-6.

Amit Gruber and Yair Weiss. Multibody factoriza-tion with uncertainty and missing data using theEM algorithm. In Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition, pages707–714, 2004.

R. Hartley and A. Zisserman. Multiple View Geometry.Cambridge University Press, 2003.

L. Herda, P. Fua, R. Plankers, R. Boulic, and D. Thal-mann. Using skeleton-based tracking to increase thereliability of optical motion capture. Human Move-ment Science Journal, 20(3):313–341, 2001.

G. Johansson. Visual perception of biological motionand a model for its analysis. Perception and Psy-chophysics, 14:201–211, 1973.

Adam G. Kirk, James F. O’Brien, and David A.Forsyth. Skeletal parameter estimation from optical

26

Fig. 17 A comparison of results by related methods. Our method was trained on the “puppet” and “dancing” sequences from Yan

and Pollefeys (2008) (top row). The Kirk et al. (2005) method was trained on 3D feature locations from the two datasets of human

motion capture (bottom row) .

motion capture data. In Proceedings of IEEE Confer-ence on Computer Vision and Pattern Recognition.IEEE Computer Society, 2005. ISBN 0-7695-2372-2.

R. Neal and G. Hinton. A view of the em algorithm thatjustifies incremental, sparse, and other variants. InM. I. Jordan, editor, Learning in Graphical Models.Kluwer, 1998.

A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectralclustering: Analysis and an algorithm. In Advancesin Neural Information Processing Systems (NIPS),2002.

David A. Ross. Learning Probabilistic Models for VisualMotion. PhD thesis, University of Toronto, Ontario,Canada, 2008a.

David A. Ross. Learning Probabilistic Models for VisualMotion. PhD thesis, University of Toronto, Toronto,Ontario, Canada, 2008b.

David A. Ross and Richard S. Zemel. Learning parts-based representations of data. Journal of MachineLearning Research, 7:2369–2397, Nov 2006.

David A. Ross, Daniel Tarlow, and Richard S. Zemel.Learning articulated skeletons from motion. In Work-shop on Dynamical Vision at ICCV, 2007.

David A. Ross, Daniel Tarlow, and Richard S. Zemel.Unsupervised learning of skeletons from motion. InD. Forsyth, P. Torr, and A. Zisserman, editors, Pro-ceedings of the 10th European Conference on Com-puter Vision (ECCV 2008). Springer, 2008.

27

J. Shi and J. Malik. Normalized cuts and image segmen-tation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 22(8):888–905, August 2000.

Jianbo Shi and Carlo Tomasi. Good features to track.In Proceedings of IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), pages 593–600, 1994.

M. C. Silaghi, R.Plankers, R.Boulic, P.Fua, andD.Thalmann. Local and global skeleton fitting tech-niques for optical motion capture , modeling and mo-tion capture techniques for virtual environments. InLecture Notes in Artificial Intelligence, volume 1537,pages 26–40. Springer, 1998.

C. Sminchisescu and B. Triggs. Estimating ArticulatedHuman Motion with Covariance Scaled Sampling. In-ternational Journal of Robotics Research, 22(6):371–393, 2003.

Y. Song, L. Goncalves, and P. Perona. Unsupervisedlearning of human motion. IEEE Transactions onPattern Analysis and Machine Intelligence, 25(7):814–827, July 2003.

Yang Song, Luis Goncalves, and Pietro Perona. Learn-ing probabilistic structure for human motion detec-tion. In Proceedings of IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages771–777. IEEE Computer Society, 2001. ISBN 0-7695-1272-0.

Leonid Taycher, John W. Fisher III, and Trevor Darrell.Recovering articulated model topology from observedrigid motion. In Suzanna Becker, Sebastian Thrun,and Klaus Obermayer, editors, Advances in NeuralInformation Processing Systems (NIPS), pages 1311–1318. MIT Press, 2002.

C. Tomasi and T. Kanade. Shape and motion fromimage streams under orthography: a factorizationmethod. International Journal of Computer Vision,9:137–154, 1992.

P. Tresadern and I. Reid. Articulated structure frommotion by factorization. In CVPR ’05: Proceed-ings of the 2005 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition(CVPR’05) - Volume 2, pages 1110–1115, Wash-ington, DC, USA, 2005. IEEE Computer Society.ISBN 0-7695-2372-2. doi: http://dx.doi.org/10.1109/CVPR.2005.75.

Thomas Viklands. Algorithms for the Weighted Or-thogonal Procrustes Problem and Other Least SquaresProblems. PhD thesis, Ume University, Ume, Sweden,2006.

Y. Weiss. Segmentation using eigenvectors: a unifyingview. In Proceedings of the International Conferenceon Computer Vision (ICCV), 1999.

Jingyu Yan and Marc Pollefeys. Factorization-basedapproach to articulated motion recovery. In Proceed-ings of IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2005a.

Jingyu Yan and Marc Pollefeys. Articulated motionsegmentation using ransac with priors. In (ICCV)Workshop on Dynamical Vision (ICCV), 2005b.

Jingyu Yan and Marc Pollefeys. A general frameworkfor motion segmentation: Independent, articulated,rigid, non-rigid, degenerate and non-degenerate. InComputer Vision - ECCV 2006, 9th European Con-ference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part III, 2006a.

Jingyu Yan and Marc Pollefeys. Automatic kinematicchain building from feature trajectories of articu-lated objects. In Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR),2006b.

Jingyu Yan and Marc Pollefeys. A factorization-based approach for articulated nonrigid shape, mo-tion and kinematic chain recovery from video.IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 30(5):865–877, 2008. ISSN 0162-8828. doi: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2007.70739.

learning articulated structure and motiondross/articulated/rosstarlowzemel...motion, the second as...

Documents