optical flow constraints on deformable models with applications to face tracking

International Journal of Computer Vision 38(2), 99–127, 2000c© 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

Optical Flow Constraints on Deformable Models with Applicationsto Face Tracking

DOUGLAS DECARLODepartment of Computer Science, and Center for Cognitive Science, Rutgers University, Piscataway,

NJ 08854-8019, [email protected]

DIMITRIS METAXASDepartment of Computer and Information Science, University of Pennsylvania, Philadelphia,

PA 19104-6389, [email protected]

Abstract. Optical flow provides a constraint on the motion of a deformable model. We derive and solve a dy-namic system incorporating flow as a hard constraint, producing a model-based least-squares optical flow so-lution. Our solution also ensures the constraint remains satisfied when combined with edge information, whichhelps combat tracking error accumulation. Constraint enforcement can be relaxed using a Kalman filter, whichpermits controlled constraint violations based on the noise present in the optical flow information, and enablesoptical flow and edge information to be combined more robustly and efficiently. We apply this framework to theestimation of face shape and motion using a 3D deformable face model. This model uses a small number ofparameters to describe a rich variety of face shapes and facial expressions. We present experiments in extract-ing the shape and motion of a face from image sequences which validate the accuracy of the method. They alsodemonstrate that our treatment of optical flow as a hard constraint, as well as our use of a Kalman filter to rec-oncile these constraints with the uncertainty in the optical flow, are vital for improving the performance of oursystem.

Keywords: face tracking, cue integration, constraints, deformable models

1. Introduction

The apparent motion of brightness in an image—theoptical flow—constrains but does not necessarily de-termine the three-dimensional motion of an observedobject. In reconstructing three-dimensional motion us-ing optical flow and other ambiguous cues, combin-ing separate solutions derived from different cues canonly compromise among limited guesses. In contrast,adding optical flow constraints to disambiguate a prob-lem of motion estimation can derive a good guess con-sistent with all the data. This paper describes such aconstraint approach to optical flow within a deformable

model framework (Metaxas and Terzopoulos, 1993;Pentland and Sclaroff, 1991; Terzopoulos et al., 1988)for shape and motion estimation. We show that thisapproach can greatly improve the ability to maintainaccurate track of a moving object. For the applicationshere, we will be specifically investigating the trackingof human faces.

Image cues provideconstraintswhich the estimatedmodel should satisfy as much as possible. Typically,constraints from multiple cues are reconciled by statis-tically combining constraint solutions, so as to weighteach source of information according to its reliability.This formulation treats data constraints assoft, in that

100 DeCarlo and Metaxas

the formulation biases the system towards satisfyingthe constraint, but does not enforce the constraint af-ter combination. One of the distinguishing features ofour approach is that we treat optical flow as ahardconstraint on the extracted motion of the model, whichguarantees enforcement. By placing constraints fromone cue onto the modelduring estimation, we limitthe choices of parameter combinations available forsolutions using other cues (which are included as softconstraints), thereby eliminating a portion of the searchspace. We claim that this simplifies the estimation pro-cess and leads to improved robustness by not only pro-ducing a lower dimensional estimation problem, butalso by avoiding local minima.

The optical flow constraints are based on noisy data,which can lead to problems when using hard constraints(since the desired portion of the search space could bediscarded due to noise). Previous approaches whichused soft constraints did not encounter this problem,since a noisy constraint would simply be violated (andhence ignored) when combined with other information.But it is precisely this property which prevents soft con-straints from limiting the search space in the first place(and hence loses the benefits of efficiency and robust-ness). Instead, we use an iterated extended Kalman fil-ter to relax the optical flow constraint to allow for con-straint violations which increase as uncertainty in theflow increases. In Section 4, we will be more precise bywhat is meant by relaxing hard constraints, as well ashow the constraint relaxation takes place. Basically, theKalman filter finds a middle-ground between the hardand soft constraint solutions that is in harmony withthe level of uncertainty in the hard constraint. This re-tains the beneficial property of limiting the search spacewhile being robust to noisy constraints.

Our approach can be summarized as follows. Withina deformable model framework, we start with a model-based version of the optical flow constraint equation,which constrains the velocities of the motion parame-ters of the model. In the theory of dynamic systems(Shabana, 1989), velocity constraints such as theseare called non-holonomic. The velocities of the mo-tion parameters are already accounted for as resultingfrom the application of edge-based forces; finding theequilibrium resulting from these forces amounts to astraightforward optimization problem. With the addi-tion of the optical flow constraints, a constrained op-timization problem results, which is solved using La-grange multipliers. The constrained solution contains

two kinds of forces. One provides the standard linearleast-squares model-based solution to the optical flow(Li et al., 1993). The second is a constraint enforce-ment term which ensures the optical flow constraintremains satisfied when combined with edge forces.The presence of the constraint enforcement term yieldsa profitable combination of the optical flow solutionwith the edge forces. We use a Kalman filter to real-ize this combination in a way that accounts for theuncertainty in the flow. Problems with tracking er-ror accumulation are alleviated using the edge forces,which now keep the model aligned with the image with-out a statistically relevant violation of the optical flowconstraint.

The applications we address here concentrate on theproblem of estimating the shape and motion of a humanface. This problem has been widely addressed in recentresearch, having applications in human-machine inter-action. Face tracking is a particularly natural testbed forour research for two reasons. The actual shape and mo-tion of faces makes edge and optical flow informationeasy to use and advantageous to combine; and the abun-dance of data describing human face shape (Farkas,1994) facilitates the development of three-dimensionalmodels of faces.

We have constructed a model of the human facewhich captures the relevant aspects concerning theirshape, motion and appearance. By using data fromface anthropometry studies (Farkas, 1994), the rangeof shapes produced capture the variabilities seen in theshape and appearance of faces across the human popu-lation. The design of the facial motion model employsaspects of the Facial Action Coding System (FACS)(Ekman and Friesen, 1978), which is a manual codingmethod for describing facial movements in terms of“action units”. Our model separately encodes informa-tion regarding an individual’s appearance from theirfacial motions and expression. Shape parameters de-scribe unchanging features of an observed object andcapture variations in shape across objects in a targetclass. Motion parameters describe how an observedobject changes during a tracking session. This separa-tion produces an easier tracking problem by requiringa smaller description of object state to be estimatedin each frame. It also allows information to be appliedmore precisely, since optical flow information only reg-isters changes in motion parameters, while edge in-formation figures in both shape and motion parameterestimation.

Optical Flow Constraints 101

1.1. Outline

After a review of related work, in Section 3 we presentsome preliminaries on our deformable model frame-work. Then, Section 4 describes how the optical flowis treated as a hard constraint in this framework, andhow a Kalman filter is used to relax this constraint toaccount for uncertainty. We then present a series of ex-periments designed to assess the generality of our ap-proach and its quantitative validity in Section 5. Theseexperiments extract the shape of the face, and track itsmotion—even in the presence of large rotations andself-occlusion. They demonstrate that our treatment ofoptical flow as a hard constraint, as well as our use ofa Kalman filter to reconcile these constraints with theuncertainty in the optical flow, are vital for improvingthe performance of our system.

2. Related Work

There is a wide variety of work that relates to whatis presented here concerning both the underlying tech-niques used, and the application of tracking a humanface. Virtually all work on face tracking takes advan-tage of the constrained situation: instead of using ageneric tracking framework which views the observedface as an arbitrarily deforming object, amodel-basedapproach is favored, which incorporates knowledgeabout facial deformations, motions and appearance.This facial model is used in concert with a numberof model-based techniques.

Model-based edges and features: A prevalent model-based approach for tracking and shape estimationuses features and edges in a sequence of images totrack an object (Lanitis et al., 1997; Lowe, 1991;Metaxas and Terzopoulos, 1993; Terzopoulos et al.,1988; Yuille et al., 1992). This requires aligning themodel features with the data, and is typically for-mulated as an optimization problem where the pa-rameter combination is sought which yields the bestalignment. The alignment can be performed usingeither a 2D appearance model (Terzopoulos et al.,1988; Yuille et al., 1992) or on 2D features com-puted from a 3D model (Lowe, 1991; Metaxas andTerzopoulos, 1993; Terzopoulos and Waters, 1993).This optimization problem tends to be quite diffi-cult, however, especially as the deviation betweenthe model and data becomes large.

Model-based optical flow: Instead of computing an un-constrained flow field (a grid of arrows), a model-based approach explains the optical flow informa-tion in terms of motion parameters of the model(Adiv, 1985; Bergen et al., 1992; Choi et al., 1994;Horn and Weldon, 1988; Li et al., 1993; Negah-daripour and Horn, 1987; Netravali and Salz, 1985).While the problem is non-linear, these frameworkscan use either a single step linear least-squares so-lution (Choi et al., 1994; Li et al., 1993; Netravaliand Salz, 1985), or an iterative least-squares solution(Adiv, 1985; Bergen et al., 1992; Horn and Weldon,1988; Negahdaripour and Horn, 1987). The motionmodel can be a 2D model of image motion (Bergenet al., 1992; Black and Yacoob, 1995) or a 3D model(rigid or non-rigid) of object motion (Bergen et al.,1992; Choi et al., 1994; Li et al., 1993) (along witha camera model to relate to the images).

It is also possible to compute an unconstrained op-tical flow field using standard techniques, and fit aparametric motion model to the resulting field (Basuet al., 1996; Essa and Pentland, 1997). The primarydownside to this approach would be that with thecomputation of an unconstrained flow field comesthe artifacts resulting from smoothness assumptionsand the problem of finding motion discontinuities.The model-based methods above take this informa-tion from the model (instead of assuming it).

Preventing tracking drift: In this paper, we advocatethe use ofbothoptical flow and edges for face track-ing. Face tracking methods usingonly optical flow(Basu et al., 1996; Black and Yacoob, 1995; Essaand Pentland, 1997) will suffer from tracking drift,since error will accumulate when only velocity in-formation is used. As a result, long sequences arenot tracked successfully. However, in the context offacial expression recognition (where the sequencesare quite short), this might not be a serious problem.

In previous work (DeCarlo and Metaxas, 1996),we presented a framework for combining opticalflow and edge information to avoid the problemswith tracking drift. Aside from this, Li et al. (1993)is the other notable exception of a framework whichuses model-based optical flow yet avoids drift. Inthis work, a render-feedback loop was used to pre-vent drift by locally searching for the best set of pa-rameters which aligns the rendered model with theimage.

Model-based constraints and Kalman filtering: Thetreatment of optical flow as a hard constraint on


the motion of the model in DeCarlo and Metaxas(1996) not only helps prevent tracking drift, but alsomakes the system more robust and efficient whencoupled with a Kalman filter to handle uncertaintyin the constraint.

Before DeCarlo and Metaxas (1996), hard con-straints were used in estimation only as a model-ing tool where an articulated object was modeledas a set of rigid pieces held together by geomet-ric constraints (which model the joints) (Hel-Or andWerman, 1996; Metaxas and Terzopoulos, 1993;Sharma et al., 1998). A method known as con-straint fusion (Hel-Or and Werman, 1996; Sharmaet al., 1998) combines constraints with measurementdata to account for the fact that the joint configura-tions might not be known in advance. This fusionwas performed using a Kalman filter in Hel-Or andWerman (1996), and ends up being closely relatedto the physics-based constraint method in Metaxasand Terzopoulos (1993) which adds constraint forcesto data forces. More efficient means for dealingwith such constraints has also been investigated (Fuaand Brechbuhlar, 1997; Metaxas and Terzopoulos,1993).

The use of soft constraints is much more commonfor fusing information. In Fua and Leclerc (1995),stereo and shading information are combined us-ing a soft constraint (weighted terms from eachsource are added into the system energy). A physics-based sensor fusion method combining range andintensity data was presented in Zhang and Wallace(1993). Using a Kalman filter (Ayache and Faugeras,1988; Kriegman et al., 1989) or Bayesian methods(Durrant-Whyte, 1987) for fusion combines solu-tions in a similar way. Aside from this, Kalman fil-tering has become a standard tool for estimation indealing with noisy data (Bar-Shalom and Fortmann,1988; Gelb, 1974; Maybeck, 1979; Metaxas, 1996;Pentland and Horowitz, 1991).

Face tracking: There is a vast body of work on trackingthe human face, with applications ranging from mo-tion capture to human-computer interaction. Amongthem, there are a number which bare similarity insome respect to the work presented in this paper.

Several 2D face models based on splines or de-formable templates (Lanitis et al., 1997; Moses et al.,1995; Yuille et al., 1992) have been developed whichtrack the contours of a face in an image sequence.In addition to motion, these methods provide rough2D information about the observed individual’s ap-pearance. In Black and Yacoob (1995), the optical

flow field is parameterized based on the motion ofthe face (under projection) using a set of locally de-fined 2D motion regions. The extracted parametersare used for expression recognition.

In Essa and Pentland (1997) and Terzopoulos andWaters (1993), a physics-based 3D face model (withmany degrees of freedom) is used, where motion ismeasured in terms of muscle activations. Edge forcesfrom snakes are used in Terzopoulos and Waters(1993), while in Essa and Pentland (1997), activa-tions are determined from an optical flow field whichare later used for expression recognition. In Basuet al. (1996), a rigid ellipsoid model of the head isused to estimate motion parameters from a flow field.

Addressing the problem of image coding, (Choiet al., 1994; Li et al., 1993) estimate face motionusing a simple 3D model and a model-based least-squares solution to the optical flow constraint equa-tion. Li et al. (1993) improves performance usingmotion prediction, and avoids tracking drift usinga render-feedback loop.

Our system, first presented in DeCarlo andMetaxas (1996), uses a combination of model-basedoptical flow and model-based edge tracking to esti-mate the shape and motion of a face. The flow andedges are combined by treating the flow as a hardconstraint on the motion of the model. This combina-tion prevents tracking error accumulation, as with therender-feedback loop in Li et al. (1993), although ouruse of a hard constraint produces a much more robustsolution by making the model-based edge trackingproblem easier to solve.

In addition to this, none of the previous workmakes a serious attempt in extracting a detailed 3Dshapedescription of the face from an image se-quence. At best, only a rough shape description isderived. Furthermore, most all of these approachesfail under large head rotations due to the use of a 2Dmodel or the inability to handle self-occlusion.

3. Deformable Model Dynamics

Deformable models (Metaxas and Terzopoulos, 1993;Pentland and Sclaroff, 1991; Terzopoulos et al., 1988)are parameterized shapes that deform due to forcesaccording to physical laws. For vision applications,physics provides a useful analogy for treating shapeestimation (Metaxas and Terzopoulos, 1993), whereforces are determined from visual cues such as edgesin an image. The deformations that result produce a


shape that agrees with the data. The use of physics alsomakes available additional mathematical tools; for ex-ample, constraint techniques from physics will be usedin Section 4 to incorporate the optical flow information.In this section, we review deformable models as pre-sented in Metaxas and Terzopoulos (1993) and brieflydescribe our face model.

A three-dimensional deformable modelx maps a do-mainΩ (of surface coordinates) to a set of points inR3

which form the model’s surface. It is parameterized bya vector of valuesq, meaning that changes inq reg-ister as geometric deformations of the surface. A par-ticular point on the surface is written asx(q; u) withu∈Ω being used to identify a specific surface location.(Note that the dependency ofx on q is often omitted,for reasons of conciseness.) The goal of a shape andmotion estimation process is to recover the value ofq for each image in a sequence of frames. We nowpresent our deformable face model; following this, theremaining discussion will refer to this face model, al-though the development of the techniques will applygenerally.

3.1. A Deformable Face Model

Across the human population, the faces of individualsexhibit a great deal of variation in their appearance, butthey all still have a good deal of structure in common.A similar statement can be made about facial motion—while it is complex and non-rigid, the motions are stillfairly constrained. We take advantage of this common-ality in the construction of our model of the humanface. Here, we briefly describe the face model used inthe experiments in this paper.

Our deformable face model is a 3D polygon mesh,shown smoothly shaded in Fig. 1(a) and wireframe in

Figure 1. The deformable face model (in rest position).

(b) in its default configuration. The model is realizedusing a set of parameterized deformations (which de-pend onq) applied to this polygon mesh. The parame-terization of this face model was constructed by hand;details concerning its construction are in Appendix B.Appendix C describes a system of anthropometric mea-surements of the face; we use data from published ta-bles of these measurements to help bias the model awayfrom producing unlikely faces during estimation. Ourmodel does have limitations in its coverage, however.There are no means for representing large amounts offacial hair (such as a beard or mustache) or eyeglasses.Furthermore, there are many facial motions that cannotbe expressed accurately, such as many of the lip defor-mations produced during speech. Effects of these lim-itations on system performance are discussed in Sec-tion 5.6. Better and easier methods of data acquisitionare becoming available, and are making it possible tobuild a model constructed from examples (Kaucic andBlake, 1998; Cootes et al., 1998). It’s likely that moreautomatic methods of model construction (using exam-ples) will become the favored approach, as it is ratherdifficult to obtain a fully developed model of the faceby hand.

3.2. Separation of Shape and Motion

In some applications (including face tracking), to dis-tinguish the processes of shape estimation and mo-tion tracking, the parameters inq are rearranged andseparated intoqb—a staticquantity—which describesthe basic shape of the object, and intoqm—a dy-namicquantity—which describes its motion (both rigidand non-rigid), so thatq= (q>b , q>m)>. Regarding hu-man faces,qb describes the unchanging features ofan observed face and captures variations in appear-ance across the human population, whileqm describeshow an observed face changes during a tracking ses-sion (head position, as well as facial displays and ex-pressions). This separation produces an easier trackingproblem by requiring a smaller description of objectstate to be estimated in each frame. This division isoften built into face models (Black and Yacoob, 1995;Li et al., 1993; Moses et al., 1995; Terzopoulos andWaters, 1993) to simplify model construction or esti-mation, while Reynard, et al. (1996) use this separationto permit learning the variability of motions for a classof objects. Note that there is no guarantee that the shapeand motion of some class of objects is separable; thisis a simplifying assumption that we make. For humanfaces, this separation is quite reasonable, and results in


the changes inqb tending to zero as the shape of theobserved object is established. Once this occurs, fittingneed only continue forqm. This suggests that includingas many parameters as possible inqb makes long-termestimation more efficient.

The modelx is realized by applying deformationfunctions to an underlying shapes. For this paper,s isthe polygon mesh in Fig. 1, andΩ is an index set usedto refer to its vertices (when applications require it,we add additional structure toΩ to allow referencesto any point on its surface, instead of just the ver-tices). In other deformable model work (Metaxas andTerzopoulos, 1993),s is a solid primitive such as anellipsoid given by its explicit parametric equation withits domainΩ being an appropriate rectangle inR2.

As with the parameters, the deformations applied tosare split into two separate deformation functions—onefor shape (Tb) and one for motion (Tm)—as demon-strated in Fig. 2. These deformation functions (whichdepend on the parametersq) mapR3 to R3. For faces,the shape deformationTb is applied to the underlyingpolygon mesh first (since facial motion can be seen asdeviations away from a particular individual’s face), sothat:

x(q; u) = Tm(qm;Tb(qb; s(u))) (1)

The shape deformationTb uses the parametersqb todeform the underlying shapes. For faces, applying thisdeformation alone will produce a particular individ-ual’s face in rest position. On top of this is the motiondeformationTm with parametersqm, which includes arigid translation and rotation, as well as non-rigid defor-mations (such as raising eyebrows, frowning, smiling,and opening the mouth, as shown on the left of Fig. 2).

Figure 2. Example parameterized deformations of the face model (with separate shape and motion parameters).

Each of these deformations can be defined using a se-ries of composed functions, allowing a more modulardesign (see Appendix A).

3.3. Kinematics

The kinematics of the model can be determined in termsof the parameter velocitiesq. As the shape changes, thevelocity at a pointu on the model is given by:

x(u) = L(q; u)q (2)

whereL = ∂x/∂q is the model Jacobian (Metaxas andTerzopoulos, 1993). For cases wherex is defined usinga sequence of deformation functions, the Jacobian canbe computed using the chain rule as in Appendix A.We also partition the Jacobian (as we did with theparameters) into blocks corresponding toqb and qm

asL = [Lb Lm]. A geometric interpretation forL(u)comes from viewing each column ofL as correspond-ing to a particular parameter inq. Each column is athree-dimensional vector which “points” in the direc-tion that x(u) moves as that parameter is increased.When considered over the entire model (overΩ), theyform vector fields, which are shown in Fig. 3 for par-ticular motion parameters of our face model.

3.4. Perspective Projection of the Model

When modeling an object viewed in images,x needsto include a camera projection, resulting in a two-dimensional model (calledxp), which is projected flatfrom the original three-dimensional model. Under per-spective projection (with a camera having focal length


Figure 3. Sample vector fields for various motion parameters.

f ), the pointx(u)= (x, y, z)> projects to the imagepointxp(u)= f

z (x, y)>.The velocities of model points projected onto the im-

age plane,xp, can be found in terms ofx. The JacobianL p = ∂xp/∂q is given by:

xp(u) = ∂xp

∂xx(u) =

(∂xp

∂xL(q; u)

)q = L p(q; u)q

(3)where

∂xp

∂x=[

f/z 0 − f x/z2

0 f/z − f y/z2

](4)

The matrix in (4) projects the columns ofL (whichare three-dimensional vectors) onto the image plane.As with L , we partitionL p as [Lbp Lmp]. In fact, thevector fields in Fig. 3 are just renderings ofLmp.

3.5. Estimation Using Dynamics

The models defined earlier are useful for applicationssuch as shape and motion estimation when used in aphysics-based framework (Metaxas and Terzopoulos,1993). These techniques are a form of optimizationwhereby the deviation between the model and the datais minimized. The optimization is performed by inte-grating differential equations derived from the Euler-Lagrange equations of motion. In a typical vision appli-cation, the equations of motion are simplified to omitthe mass term, which produces a model free of inertia.From an optimization point of view, this has the desir-able property that the model state no longer changesonce all forces vanish or equilibrate. In addition to

this, the damping matrix is set to be the identity andthe stiffness term is omitted, resulting in the followingsimplified dynamic equations of motion:

q = fq (5)

where the applied forcesfq are computed from three-dimensional forcesf3D and two-dimensional imageforcesf imageas:

fq =∑

j

(L(u j )>f3D(u j )+ L p(u j )

>f image(u j )) (6)

The distribution of forces on the model is based inpart on forces computed from the edges of an inputimage. We compute the image forcesf image(u j ) usingthe intensity gradient, as in Metaxas and Terzopoulos(1993) and Terzopoulos et al. (1988). Using thismethod, the image force applied to model pointu j

(which corresponds by projection to pixelj ) is theproduct of the intensity gradient and a weighting func-tion (with range [0, 1]) which is a “probability” thatthe current model configuration would produce anedge visible nearby pixelj . As in Terzopoulos et al.(1988), we use a thresholded version of this weight-ing function—details on how potentially visible edgesare determined for our face model are provided inAppexdix D. Note that these image forces depend onq not only throughL , but also in the determination oflikely visible edges (the weighting function). Given anadequate model initialization, these forces will alignfeatures on the model with image features, thereby de-termining the object parameters. UsingL andL p, theapplied forces are converted to forces which act onq


and are integrated over the model to find the total pa-rameter forcefq. The dynamic system in (5) is solved byintegrating over time, using standard (explicit) differ-ential equation integration techniques (we use an Eulerstep):

q(t + 1) = q(t)+ q(t)1t (7)

The process used to initialize the system to determinethe value ofq(0) is described in Section 5. The nextsection describes how this framework is augmented toaccommodate optical flow information.

4. Optical Flow Constraints

In the following, the use of hard optical flow constraintson deformable models is presented. The optical flowconstraint equation, which expresses a constraint onthe optical flow velocities, is reformulated as a systemof dynamic constraints that constrainq, the velocity ofthe deformable model. The resulting information willbe combined with the data forcesfq while leaving theconstraint satisfied. The optical flow constraint equa-tion is used at a number of select locations in the imageto constrain the motion of the model, instead of ex-plicitly computing an unconstrained optical flow fieldon the entire image. We will see below how the use ofthis constraint is related to model-based optical flowmethods (which are also known as “direct methods”since they also do not explicitly compute a flow field).The use of optical flow information greatly improvesthe estimation ofqm, the motion parameters of the de-formable model.

In deformable model frameworks, estimation is ac-complished through an energy optimization process viaequations of motion. Hard constraints impose a globalrequirement on this dynamic system whose solution isenforced at each iteration (either exactly or, if the sys-tem is overconstrained, in a least-squares sense), whilesoft constraints (such as spring forces) only bias thebehavior of the system toward a certain goal (often in-volving the system energy). We will discuss how hardconstraints provide a means for results obtained fromone data source to guide the computation of the so-lution to another, potentially more difficult problem.This decreases the cost of this further computation andincreases the likelihood that its solution will closelyreflect the true state of the observed object.

Constraints which depend only onq are calledholo-nomicconstraints, and constrain the model to a set of al-

lowable positions. They have been used in a deformablemodel formulation, for instance, to add point-to-pointattachment constraints between the parts of an articu-lated object (Hel-Or and Werman, 1996; Metaxas andTerzopoulos, 1993; Sharma et al., 1998). A holonomicconstraintC has the general form

C(q, t) = 0 (8)

Non-holonomicconstraints additionally depend on thevelocity of the parameters,q, and constrain the motionof the model. A non-holonomic constraintC has thegeneral form

C(q, q, t) = 0 (9)

In the following, we show how the optical flow con-straints take this form and can be incorporated into thedynamic system using Lagrange multipliers. This re-sults in hard constraints, since the constraints will beenforced exactly (or in a least-squares way). Followingthis, we will describe how this constrained system issolved, and how a Kalman filter is used to relax thesehard constraints.

4.1. Optical Flow Constraints

Given the assumption of brightness constancy of smallregions in an image, the optical flow constraint equa-tion (Horn, 1986) at a pixeli in the image I takes theform:

∇I i

[ui

vi

]+ Iti = 0 (10)

where∇I= [I x Iy] are the spatial derivatives and It isthe temporal derivative of the image intensity.ui andvi are the components of the optical flow velocities.

For a model under perspective projection, there ex-ists a unique pointui on the model that corresponds tothe pixeli (provided it is not on an occluding bound-ary). The optical flow constraint equation can now berewritten in terms ofq with this in mind. This rewritinguses an identification of the image velocity(ui , vi ) atpixel i with its corresponding model velocityxp(ui )

from (3): [ui

vi

]= xp(ui ) = Lmp(ui )qm (11)


Direct use of the optical flow information only providesmotion information, and as a result, onlyqm is affected.To clarify this: any observed motion is caused by dy-namic changes in the true value ofqm. The true valueof qb is a static quantity—the meaning ofqb comesfrom the analogy of physics, where the value ofqb im-proves over the course of fitting (over time) as moredata becomes available.

The non-holonomic constraint equation for the opti-cal flow at a pixeli in the image can be found by rewrit-ing the optical flow constraint Eq. (10) using (11):

∇I i Lmp(ui )qm+ Iti = 0 (12)

Instead of using this constraint at every pixel in the im-age,n pixels are selected from the input image (wherenÀ dimqm). Appendix E describes the criterion usedto choose these particular points, and also describeshow some of the known difficulties in the computa-tion of the optical flow are avoided in this model-basedapproach.

For then chosen pixels in the image, the system ofequations based on (12) becomes:

∇I1Lmp(u1)

...

∇InLmp(un)

qm+

It1

...

Itn

= 0 (13)

which can be written compactly as

Bqm+ I t = 0 (14)

This equation is simply a model-based version of theoptical flow constraint equation (Adiv, 1985; Bergenet al., 1992; Choi et al., 1994; Horn and Weldon,1988; Li et al., 1993; Negahdaripour and Horn, 1987;Netravali and Salz, 1985). Instead of solving it on itsown, however, it is used as a hard constraint on themotion of the model.

4.2. Solving the Dynamic System

Constraining the equations of motion with the model-based flow equation results in the constrained system:

q = fq subject to Bqm+ I t = 0 (15)

This is solved using the method of Lagrange multipli-ers (Shabana, 1989; Strang, 1988). The Lagrange mul-tiplier technique adds additional degrees of freedom

(one for each degree of constraint), to form a larger,unconstrained system (with the constraints “built in”).The initial dynamic equation of motion (5), now splitinto two parts corresponding toqb andqm, is modifiedby adding the constraint forcefc to qm:

qb = fqb, qm = fqm + fc (16)

Adding a particular value offc will ensure the constraintequation is satisfied, in part by cancelling the compo-nents offqm that violate the constraint. This constraintforce is determined using the Lagrange multiplierλ as:

fc = −B>λ (17)

We can combine Eqs. (14), (16) and (17) to form:

BB>λ = Bfqm + I t (18)

and can now determine the constraint force (by mul-tiplying (18) on the left byB+, the pseudo-inverse(Strang, 1988) ofB):

fc = −B+(Bfqm + I t) = −B+I t − B+Bfqm (19)

which results in the unconstrained dynamic system(which is solved iteratively sincefq is highly non-linear):

qb = fqb, qm = −B+I t + [1− B+B]fqm (20)

The first term ofqm in (20),−B+I t, is a model-basedlinear least-squares solution to the optical flow con-straint equation (Li et al., 1993). A model-based solu-tion to the optical flow constraint equations attributesthe flow in the image to motion parameters in the model.This works as follows. A change to any motion param-eter induces a characteristic motion field in the image.Figure 3 illustrates these vector fields for particular mo-tion parameters of our face model. The linear combina-tion of the fieldsLmp(u) using the weights−B+I t bestsatisfies (14) at the sampled pixels in a least-squaressense. The pseudo-inverse ofB is determined by com-puting its singular value decomposition (Press et al.,1992; Strang, 1988). Each motion parameter inq willhave a corresponding singular value—a singular valuenear zero is interpreted as a lack of motion in that par-ticular parameter (although this could also be caused bythe failure to gather enough information from the im-ages to sample the motion). In solving (20) iteratively,


fq is re-evaluated upon each iteration, whileB is not.(We have empirically found that re-evaluatingB doesnot change the performance of our system when highframe-rate cameras are used. This is not surprising, asthe dependency ofB on q is very small for the ap-plications here, when given moderately sized changesin q.)

The second term in (20) contains the edge forcesfqm

scaled by the matrix [1−B+B]. This projection matrixcancels the component offqm that violates the constraint(14) onqm. Unlike the hard optical flow constraint, theedge forces act as a soft constraint, but still preventerrors inqm from accumulating, since the uncancelledcomponent of the edge forces can further adjust thesolution.

As it stands, however, this method will not be ro-bust sinceB depends on noisy data. In DeCarlo andMetaxas (1996), an ad hoc method was used to relaxthe hard constraint by replacing the projection matrix in(20) with

[1− B+WB

]whereW is a diagonal matrix

whose entries (in [0, 1]) represent the certainty of in-formation provided by a particular pixel. A more prin-cipled approach to relaxing the constraint is describedin the next section, where (20) is reformulated using aKalman filter. But first, it’s worth taking a closer lookat what the hard constraint is actually doing, and whatis meant by relaxing the hard constraint.

4.3. Discussion

Consider the problem of combining together the twosources of information (edges and flow) to computeqm.Independently, the edges and flow produce two differ-ent solutions:qm= fqm andqm= −B+I t. In the follow-ing discussion, we consider the different formulationsthat result depending on whether soft, hard, or relaxedhard constraints are used to combine the solutions.1

Soft constraints are the typical means for combin-ing these solutions. Statistical methods for combina-tion weight these together (using matricesWflow andWedge) according to their reliability:

q(soft)m = −WflowB+I t +Wedgefqm (21)

These weighting matrices are typically formed fromthe covariance matrices for the individual solutions(Bryson and Ho, 1975; Durrant-Whyte, 1987). In non-linear situations, (21) is solved iteratively. This isalso comparable to using a Kalman filter to combinesources together (or an iterated extended Kalman filter

in the non-linear case). In a deformable model frame-work, this approach is achieved by adding togetherweighted combinations of forces (Terzopoulos, 1993;Terzopoulos et al., 1988) or energies (Fua and Leclerc,1995) derived from data sources. The dynamic systemproduces a weighted least squares estimate similar to(21) as it converges.

With hard constraints, instead of combining solu-tions as above, we solve a constrained system: the equa-tion originally used to solve for flow,Bqm+ I t= 0, willbe used as a hard constraint on the solution ofqm= fqm.What results is the following:

q(hard)m = −WflowB+I t +Wedge[1− B+B]fqm (22)

This solution gives precedence to the flow solution inan interesting way. There is now a projection matrix[1− B+B] which cancels the component of the edgesolution which violates the constraintbeforethe solu-tions are combined together. This makes a substantialdifference when (22) must be solved iteratively whenthe model-edge alignment problem (fqm) is non-linear.When solved alone, it is significantly more computa-tionally expensive than solving for the flow. In thiscase, however, the projection matrix cancels out a por-tion of the search space for the model-edge alignmentthat the constraint makes impossible. This results in alower dimensional problem in solving the model-edgealignment, and can improve efficiency, as well as de-crease the chances of reaching a local minimum. Al-ternatively, the edge solution could be used as a hardconstraint on the flow; but this would lose the efficiencybenefits, as the edge solution is much more expensive tosolve, so that its projection matrix would not be avail-able in time to efficiently guide the flow solution.

In practice, the hard constraint depends on noisydata, in which case it is overly restrictive to fully can-cel any component of the edge forces. Furthermore,during complex motions, it is not unreasonable for theflow solution to be non-degenerate, so that the projec-tion matrix is zero (so that it cancels everything). Oneway of dealing with this is torelax the hard constraint:to permit small violations of the hard constraint whereit is noisy while still projecting away much of the edgeforces which violate it.

The ad hoc method that accomplished this inDeCarlo and Metaxas (1996) (mentioned earlier) wasto replace the projection matrix in (22) with [1 −B+WconstraintB], using the diagonal matrixWconstraint.


This results in:

q(relaxed)m = −WflowB+I t

+Wedge[1− B+WconstraintB

]fqm (23)

The best illustration of what this relaxed solution isdoing comes from the special case whenWconstrainthasequal entries on its diagonal, so thatWconstraint=α1with α ∈ [0, 1]. Then the relaxed solution is simplya convex combination of the soft and hard constraintsolutions:

q(relaxed)m = αq(hard)

m + (1− α)q(soft)m (24)

The hard constraint is enforced whenα is 1, with alinear compression in the constraint’s null space di-rection that decreases untilα is 0, which is the softconstraint solution. This special case of the methodin DeCarlo and Metaxas (1996) represents one of thesimplest means of relaxing a hard constraint. The nextsection describes a more principled approach to relax-ing the hard constraint using a Kalman filter.

4.4. Kalman Filtering and Hard Constraints

The optical flow constraint onqm is imperfect due tonoise and estimation errors. It is therefore desirable tohave only a partial cancellation offqm; one way thiscan be accomplished is through the use of a Kalmanfilter. This section describes how the computation fromSection 4.2 is reformulated using an iterated extendedKalman filter.

Kalman filtering (Bar-Shalom and Fortmann, 1988;Gelb, 1974; Maybeck, 1979) has become a populartool in computer vision, and the formulation here is,on the whole, similar to other applications (Ayacheand Faugeras, 1988; Broida and Chellappa, 1986;Kriegman et al., 1989; Metaxas, 1996; Pentland andHorowitz, 1991): there is a measurement equationwhich models the noise inherent in the data gather-ing process, and there is a process model, which pre-dicts the behavior of the system based on the currentstate. The initialization and tuning of the filter is ac-complished using standard techniques. The significantdifference here, is that there is not only the edge dataEq. (5), which has been previously used as a filter-ing measurement equation (Metaxas and Terzopoulos,1993), but there is also a data-based constraint Eq. (14).The first part of this section describes one reasonableway of using this constraint in the measurement equa-

tion. Alternative formulations are possible; ours cor-responds to one which involves the solution of a hardconstraint. The remainder of the section describes aniterated extended Kalman filter based in part on thismeasurement equation. First, we will describe our for-mulation of the filter. Following this, we will explainwhy this treatment allows for relaxed hard constraints.

By assuming a Gaussian noise model for both themeasurements and state, a Kalman filter can maintainan estimate of the statey and the state covarianceP.While the assumption of Gaussian noise might not beparticularly accurate in describing the actual noise inthe system, it permits a much simpler solution whilestill capturing a large amount of the uncertainty.

The measurementequation for the filter relates themeasurementsz to the statey using the measurementmatrix H. Termsvfq andvI t are added to represent theassumed zero-mean Gaussian noise infq and I t; theyhave covariancesRfq andRI t respectively:

z(t) = H(t)y(t)+(

vfq(t)

vI t(t)

)(25)

where the construction ofH, y and z in (25) comesfrom (14), (16) and (17).

H =

1 0 0

0 1 B>

0 B 0

, y =

qb

qm

λ

= (q

λ

),

(26)

z=

fqb

fqm

−I t

= ( fq

−I t

)=(∑

j L p(u j )>f(u j )

−I t

)

The statey consists of the parameter velocitiesq; to-gether with the Lagrange multipliersλ used in theoptical flow solution. This inclusion is for presenta-tion only, because, as will be seen later,λ is effectivelynot part of the state. The discrete update equation forthe state is given by (7).

z consists of the parameter forcesfq and the tem-poral image derivativesI t. Note that the spatial imagederivatives are not included in the measurements (eventhough they are used in the formation ofB); doing sowould greatly complicate the measurement equations.Similar simplifications can be found in image-basedoptical flow techniques (Simoncelli et al., 1991) wherethe noise in the spatial image derivatives are ignoredto provide a Gaussian solution. Reasonably accurateestimates of the spatial image derivatives are usually


available (especially away from occlusion boundaries),making this a fairly safe assumption.

Note thatH depends on the statey, so that the mea-surement equation is non-linear, and its solution re-quires the use of an extended Kalman filter. We alsochoose to iterate the solution, due to serious non-linearities infq. Recall from the previous section thatwhile bothB andfq depend onq, we only re-evaluatefq,and notB. This iterated extended Kalman filter is im-plemented in the standard way (Gelb, 1974; Maybeck,1982), paying heed to the usual caveats concerning lin-earized filter convergence.

A more standard implementation would use theKalman filter for data integration (Ayache andFaugeras, 1988; Kriegman et al., 1989) (a soft con-straint approach) to fuse the flow and edge solutions.This solution simply lacks the Lagrange multipliers:

Hsoft =

1 0

0 1

0 B

, ysoft =(

qb

qm

)= q,

(27)

zsoft =

fqb

fqm

−I t

= ( fq

−I t

)

However, it is the solution that containsλ that producesa hard constraint solution. Comparing the pseudo-inverses of these two measurement matrices shows thatH+ results in (20):

H+ =

1 0 0

0 1− B+B B+

0 (B+)> −(B+)>B+

(28)

Hsoft+ =[

1 0 0

0 (B>B+ 1)−1 (B>B+ 1)−1B>

]

The inclusion ofλ in (26) thus ensures the system re-duces to the original unfiltered solution (in the presenceof no a priori state information). The presence ofλ is aresult of the constraints on the dynamic system. How-ever, it should not be considered part of the state. In fact,the Lagrange multipliers are not something that reallyneeds to be estimated; but we must include it to effecta hard constraint solution. This decision comes withsome complications. Eachλ j in λ is associated witha particular pixel from the optical flow computation.However, there is not necessarily any correspondence

between the pixels (and hence theλ j ) across iterations.Even worse, the number of pixels used (the dimensionofλ) varies across iterations. This means a subset of thestate parameters are only present at one iteration, andtheir predicted values at timet are not based on any pre-viously estimated values. An alternative interpretationwould be to view these parametersλ as having infiniteobservation noise, or perhaps that the “observability”of λ is changing.

The discrete processequation for the Kalman fil-ter gives an expression for the prediction of the statey(t + 1) given the previous estimatey(t). In this case,this equation states that the predicted motion of the ob-served subject is the same as in the previous iteration,along with the added noisew (assumed to be indepen-dent zero-mean Gaussian noise with covarianceQ) toform the primarily data-driven system:

y(t + 1) = y(t)+ w(t) p(w) ∼ N(0,Q) (29)

The prior estimates ofy andP used in the computationof the estimated state and covariance at timet are de-notedy andP. Sinceλ is treated as a distinct value ateach iteration, only the portions ofy(t−1) andP(t−1)that correspond toq are retained, resulting in:

y(t) =(

q(t − 1)

0

),

P(t) = P(t − 1)+Q(t − 1)

=

0Pq(t − 1)

0

0 0 0

+

0 0 0

0 Qqm(t − 1) 0

0 0 Qλ(t − 1)

(30)

(wherePq is the block ofP(t −1) corresponding toq).Q is the covariance of the process noise, and rep-

resents the uncertainty in the process model. As esti-mation ofqb is static, its corresponding block inQ iszero (in practice, a small amount of stabilizing noise isneeded).Qqm models the uncertainty of the actions ofthe observed subject. This way, the estimation of thestatic quantityqb will eventually cease as the estimatedcovariance of these parameters converges, whileqm isinterpreted as a dynamic quantity by the filter.Qλ isused to relax the hard constraints; this will be explainedbelow.


Computing the estimated mean and covariance ofyinvolves forming the Kalman gain matrix, which is usedto combine the solution using the current measurementswith the solution from the previous iteration. In thefollowing filtering equations, all quantities are takenat time t , but this dependence is omitted to improvereadability. The Kalman gain matrix (Bar-Shalom andFortmann, 1988; Gelb, 1974; Maybeck, 1979) is com-puted as:

K = PH>(HPH> + R)−1 (31)

The covariance matrixR is the sum of terms resultingfrom the noise infq andI t:

R =[

Rfq 0

0 RI t

](32)

The relative scale betweenRfq andRI t provides a con-trol for tuning how much trust goes into the optical flowinformation relative to the edge information.

The estimated mean is computed as a sum of thecurrent solutionKz and the weighted prior mean esti-matey, or as the sum of the prior estimatey and theinnovation(z− Hy) weighted byK :

y = Kz + (1− KH )y = y+ K(z− Hy) (33)

The estimated covariance (Bar-Shalom and Fortmann,1988; Gelb, 1974; Maybeck, 1979) is computed fromthe prior covarianceP as:

P= (1− KH )P (34)

4.4.1. Relaxing Hard Constraints. In understandingwhy this formulation relaxes the hard constraints, it ismuch clearer to consider the following alternative (andalgebraically equivalent) formulation (Maybeck, 1979)of (31)–(34):

y = PH>R−1z+ PP−1y

P= (H>R−1H + P−1)−1(35)

When written this way, it is clear how the solution ofthe measurement equation is being combined with thea priori information, based on their corresponding co-variances.

Consider the case whenR= 1, P(t − 1)= 0,Qqm(t − 1)= 0 andQλ(t − 1)= 1

α1, with α >0 (using

P+ for P−1 in the absence of prior information). The

result simplifies to:

y =

H>H +

0 0 0

0 0 0

0 0 α1

−1

H>z (36)

Without the addition ofα1, this would be the hard con-straint solution in (20). The addition ofα1 relaxes thehard constraint, with more constraint violation asα in-creases. This is because the Lagrange multipliers thatwere used to enforce the constraint are gradually driventowards zero asα increases, since they are increasinglycombined with the a priori value ofλ in y (which iszero). In fact, whenα is sufficiently large, this solutionapproaches that of a soft constraint solution (i.e. onewithout Lagrange multipliers), since:

limα→∞

H>H +

0 0 0

0 0 0

0 0 α1

−1

=

1 0 0

0 (B>B+ 1)−1 0

0 0 0

(37)

which, when right multiplied byH>, producesHsoft+

(with additional rows of zeros).In the general case,P(t −1) andQλ(t −1)will also

cause constraint violation, but not in any controlledway. Rather, their presence causes the solution to bea balance between the measurement equation solutionand the prediction in a way that doesn’t respect the con-straint. In other words, the Kalman filter can put moretrust in the prediction (which can violate the constraint)at times when the estimate ofqm is noisy. In practice,the form ofQλ(t − 1) is still 1

α1, with α determined

during filter tuning (the determined value cancelled onaverage 97% of the component of the edge forces whichviolated the constraint, on each iteration of the extendedKalman filter). Keep in mind that the only distinctionbetween this solution, and an ordinary use of a Kalmanfilter, is the inclusion of the Lagrange multipliers in thestate variable, and their process update in (30). Neitherof these betray the assumptions made in the derivationof the iterated extended Kalman filter (Maybeck, 1982),and as a result, the solution here has the same stabilityproperties as an ordinary solution.

The Kalman filter solution presented here has a num-ber of advantages over the direct solution from (20), andthe commonplace use of a Kalman filter for data fusion.


It makes the framework more robust to noise and smallestimation errors. More importantly, it provides a valu-able means for combining the edge forces and opticalflow information; the optical flow constraint is now re-laxed to a degree based on the error in the optical flowinformation in a way that makes the system more effi-cient and robust. Next, experiments will be presentedwhich show that the use of a Kalman filter (in addi-tion to treating optical flow as a hard constraint) wasan important addition to the system.

5. Experiments and Discussion

This section contains the results from a series of faceshape and motion estimation experiments. The firstthree experiments exhibit the generality of our systemon a variety of subjects, while the next four experi-ments use a common observed subject, and provide aquantitative validation of the shape and motion estima-tion. The last of the validation experiments compares anumber of related frameworks mentioned in this paper.

5.1. Initialization

The entire estimation process is automatic, except forthe initialization, which requires the manual specifica-tion of several landmark features in the first frame of thesequence (the eyebrow centers, eye corners, nose tip,and mouth corners). The subject must also be at rest,and (approximately) facing forward, as in Fig. 4(a). Inall the experiments, except for those used for motionvalidation, the shape of the face is estimated only fromthe images.

Using these marked features, forces are applied to theinitial face model that deform the corresponding pointson the face toward the desired locations in the image.Experience has shown that the initialization process isrobust to small displacements (i.e. several pixels) in the

Figure 4. Model initialization.

selected landmark points. The rotation and translation,as well as course-scale face shape parameters (suchas those which determine the positions and sizes of theface parts) are fitted using this information, the result ofwhich is shown in Fig. 4(b). Once roughly in place, bothedge and anthropometry forces are applied that pull theface into the correct shape as in Fig. 4(c). The distancefrom the initial face to the camera is determined giventhe assumption that the subject’s face is the same sizeas the model.

The problem of automatically locating the faceand its various features has been addressed elsewhere(Yacoob and Davis, 1994; Yuille et al., 1992), and couldbe used to make this process automatic. No markers ormake-up are used on the subject (markers are used forthe validation of the method, however, as describedbelow).

5.2. Tracking Experiments

The original image sequences are 8 bit gray images atNTSC resolution (480 vertical lines). In each of the se-quences, the width of the face in the image averages 200pixels, and the range of motion of features across theimage sequence is typically 80 to 100 pixels. For eachof the tracking examples, several frames from the imagesequence are displayed, cropped appropriately. Beloweach, the same sequence is shown with the estimatedface superimposed. In each case, a model initializationis performed as described above. The initialization pro-cess usually takes about 2 minutes of computation. Af-terwards, processing each frame (using the extendedKalman filter formulation) takes an average of 1.4 sec-onds each (all computation times are measured on a175 MHz R10000 SGI O2).

The sequence shown in Fig. 5 was taken on an Indy-Cam at 5 fps. Figure 5 shows a subject turning her headin (a) through (d) and opening her mouth from (d) to(f). Based on the good alignment of the face model withthe image, it appears the face model is able to capturethe shape of her face, as well as the head rotation andmouth motion. The next two sequences were taken ona higher quality camera at 30 fps. Both Fig. 6 and Fig. 7show a subject smiling and moving forward in (b) and(c), opening their mouth while turning their head in (e)and (f), and turning back, closing their mouth slightly in(g). All of these motions appear to be correctly trackedbased on the observed motion. These three experimentsinvolve different subjects, having very different appear-ances. This suggests the verification of the face model


Figure 5. Motion and expression tracking example 1.




shape parameterization (described in Appendix B) wassuccessful. The extracted face shape is quite individ-ualized to the subject, but not to the point that wouldbe useful for certain applications in computer graphics.These extracted models for the three subjects here inFigs. 5–7 are on the right side of Fig. 2 (the upper-right,lower-left and lower-right, respectively).

5.3. Shape Estimation Validation

This experiment provides a validation of the shape es-timation accuracy of our system. The extracted shape(specified byqb) is validated by comparing with a Cy-berware range scan of the subject, shown in Fig. 8(a).

The shape estimation validation experiment in Fig. 9shows the subject performing small head motions in (a)through (f) while smiling in (c) and (d), and finishingwith a significant head rotation in (g). At each frame,Fig. 10 shows the extracted shape results as comparedagainst the range scan of the subject. Note that for thiscomparison, all motion parameters are ignored, so thatonly the shape is compared. The RMS error is com-puted using the nodes of the model, and also includes auniform scaling of the model so that the two faces arethe same scale (this eliminates the depth ambiguity—

Figure 8. (a) Shaded range scan of subject, (b) Marker calibrationimages, (c) Resulting marked model.

Figure 9. Shape validation experiment.

in this case, the estimated model was compared at 96%scale). The rigid alignment (translation and rotation)as well as this uniform scaling were computed usinga semi-automatic alignment method (the chosen align-ment had the smallest RMS error).

The RMS error, which starts at around 2 cm after ini-tialization, shows a gradual reduction over the courseof the experiment, ending around 1 cm, with the largereduction in error around frame 50 corresponding towhen the subject turned his head significantly to theside in Fig. 9(f) and (g), where the profile view con-tained good edge information to fit the face shape.

5.4. Motion Estimation Validation

The next three experiments use markers to allow forthe validation of the motion tracking of our technique.Eleven small circular markers were placed on the faceof a subject. Analysis of the accuracy of the motion es-timation in qm is performed using these markers onthe subject, which allows for alignment verificationin the image plane (ground truth motion in 3D is notavailable).

For these three experiments, no shape estimation isperformed. Instead, the face shape is provided by an

Figure 10. Results of shape validation experiment.


off-line fitting of the face model to the range scan inFig. 8(a)—this way, any deviation can be attributedprimarily to motion error, not shape error. In addition,the fixed locations of the markers on the model aredetermined using some additional images taken of thesubject, shown in Fig. 8(b). The markers are fixed intoparticular locations of the polygon mesh (they havefixed coordinates inΩ). The model resulting from thisfitting and marker placement is shown in Fig. 8(c), withthe marker locations shown as dark circles. The RMSerror of the extracted model (comparing the extractedmodel with the range scan) is 0.26 cm.

First, the image locations of each of the markers fromthe image sequence is obtained using a semi-automatictracking system. The rough location of the markers istracked using the KLT2 package (which is based on Shiand Tomasi (1994)), and was fine tuned using a de-formable ellipse template. Simple calibration tests sug-gest this tracking technique has a variance of 0.35 pixelsin measuring the center of a marker (which are usuallyabout 8 pixels across) in the image.

Care was taken so that the presence of the mark-ers did not significantly affect the motion estimation,since these markers could provide useful informationfor tracking. The pixel selection method for the opticalflow information was modified so that no points wereselected that were within 3 pixels (the radius of thespatial derivative filters) of any point on a marker. Inaddition, any edges used to produce edge forces weresimilarly limited to be distant from markers. Given thatthe markers were not placed directly on top of im-portant facial features, it is unlikely that the presence

Figure 11. Motion validation experiment 1 (no shape estimation performed).

of the markers detrimentally affected the experimentresults.

In each of the following three motion validationexperiments, there is an accompanying graph show-ing the displacement error for each frame. This dis-placement error of a marker is the Euclidean distance(in pixels) between the image location of the marker (ifvisible), and the predicted image location of the markergiven the model (which is the projected image locationof the model marker). The dark line on the graph showsthe mean displacement error of all visible markers (onestandard deviation is indicated by the gray region sur-rounding it). The dotted lines indicate the minimumand maximum displacement error.

The first two sequences were taken using an Indy-Cam at 5 fps. The final sequence was taken on a highquality camera (Pulnix TM-9701; grayscale, progres-sive scan) at 30 fps. Also note that this final sequencewas taken at a different time than the first two—themarkers were re-applied to the subject, and their lo-cations were determined again, as in Fig. 8(b) and (c).Their new locations were roughly the same as in the ear-lier validation experiments (at most 1.5 cm difference).

The sequence in Fig. 11 shows predominantly non-rigid motion (facial expressions). The subject movesforward and frowns his eyebrows in (b), moves backand produces a surprise expression in (d), followed bya smile in (f). The average error shown in Fig. 12 isbetween 2 and 4 pixels, which given the face is ap-proximately 200 pixels across in the image, amountsto less than 2%. The maximum error of around 7pixels corresponds to around 3.5% (roughly 0.5 cm).


The largest error is produced during the smile expres-sion; possible reasons for this are discussed in the nextsection.

The second sequence in Fig. 13 is a combinationof rigid and non-rigid motions. The subject turns hishead from (a) through (d) while smiling, returning torest position in (f). The displacement error shown inFig. 14 averages from 2 to 4 pixels (but being closerto 4 for a longer period), reaching a maximum of justover 7 pixels. The largest error is produced when thesmile is viewed from the side, and is concentrated inthe mouth area.

The last sequence in Fig. 15 is primarily a rigid-motion sequence that is significantly longer than theother experiments (760 frames). It includes head rota-tions in a variety of directions, as well as some largehead translation (side-to-side and away from the cam-

Figure 12. Results of motion validation experiment 1.


era). Eyebrow raises and a smile are also present. Thissequence demonstrates the ability of the system tomaintain track over a long sequence, without experi-encing failure due to tracking drift. In this sequence,the face is approximately 140 pixels across in the im-age (somewhat smaller than in the previous experi-ments). The average pixel deviations shown in Fig. 16,range between 1.5 and 2.8 pixels, with a maximumerror at 4.6 pixels, corresponding to about the sameabsolute distance error as with the previous experi-ments (roughly 0.5 cm). Hence, the apparently lowerpixel deviations for this sequence amount to approx-imately the same error in actual distance. During thesequence, some of the motions were very close to themaximum limits of tracking speed (pixel velocitieswere about the same size as the derivative filter width).In particular, the turning motion at frames 250–320





is the most serious, with other occurrences at frames430–450 and 610–620. These motions manifest them-selves in Fig. 16 as larger displacement errors. How-ever, during the successive motions (which are wellbelow this maximum velocity), the system recoversfrom these errors, and improves the fit using edge in-formation, returning to the baseline deviation amountof around 2 pixels. This baseline corresponds to themaximum accuracy of model-edge alignment, and thelimited precision of the marked model in Fig. 8(c).

Figure 17. Tracking performance of various frameworks.

5.5. Discussion

The successful tracking performed by this frameworkis primarily due to the use of optical flow as a con-straint. This is empirically verified by disabling keycomponents of our tracking system, and observing theresulting performance decrease (in the form of track-ing failures3). The results of running the experiments inFig. 15 on the various frameworks is shown in Fig. 17.


Figure 18. Timing of various frameworks (175 MHz R10000SGI O2).

The timings shown in Fig. 18 include the average ex-ecution time for a single frame (for all iterations) ona 175 MHz R10000 SGI O2, along with the averagenumber of iterations required within a frame.

The results from the constraint-based Kalman filter-ing framework for the third experiment are shown inFig. 16 (and also in Fig. 17 as the line marked “withconstraint”). The framework which uses both opticalflow and edges, but uses the measurement equation in(27) which does not incorporate optical flow as a hardconstraint, is shown as “without constraint” in Fig. 17.This system can experience tracking failure (as in frame370) when it encounters a difficult model-edge align-ment problem (when the deviation is large, or manyparameters require adjustment). It is worth noting thata constraint-based Kalman filtering method without therelaxation (a framework like “with constraint” but withQλ= 0) had tracking performance that was virtuallythe same as the “without constraint” method (althoughwas just as fast as the method “with constraint”).

The framework in DeCarlo and Metaxas (1996) (la-beled “CVPR 96”) used an ad hoc filtering method (tosoften the constraint) instead of a Kalman filter. In otherwords, this system used flow as a hard constraint, butdid not use a Kalman filter. While each iteration tookless time, more iterations were required, resulting inroughly the same timing as the method which uses aKalman filter. However, this method is not as robust,losing track around frame 290.

It is also worthwhile to test each data source sepa-rately (but still using a Kalman filter). When edges arenot used, leaving only the model-based optical flowsolution, errors in the estimation ofqm accumulate(since this solution is integrating a velocity), causingthe model to lose track quite quickly. This method ismarked as “flow only” in Fig. 17. While a non-lineariterative solution (Bergen et al., 1992) would improvethe accuracy, it would still not prevent tracking drift.

The method using only edge information (marked“edges only” in Fig. 17) often finds local minimumsolutions (such as around frame 80 and frame 130),some of which can lead to tracking failure (near frame

280). As the model-image displacement increases, themodel-edge alignment problem becomes quite difficultand expensive to solve. Tracking failure occurs in sit-uations not unlike those that caused problems for theframework marked “without constraint”.

While the framework using hard constraints per-formed well in this sequence, we can add noise to thesystem to determine at what point the system fails. Thistracking experiment was run again (a number of times)to experimentally determine the minimum sustaineddeviation that causes tracking failure. After each itera-tion, Gaussian noise was added (of increasing varianceuntil tracking failed) to the rigid motion parameters inqm at the start of each iteration. Tracking failure becamecommon as average displacement errors went above 4.6pixels (the incidence of failure went from non-existentbelow 4.5, to prevalent by 4.7). Alternatively, addingGaussian noise directly to the images (of increasingvariance until tracking failed) produced a similar value(average displacement errors of 4.4 pixels, with a corre-sponding image noise variance of 15.5% of intensity).Incidences of tracking failures for the other systems(when noise was added) became noticeably worse dur-ing these tests. This suggests that our system usingrelaxed hard constraints has a comfortable margin ofsafety from tracking failure.

Considering all the experiments, the error in thetracking results can have other (non-noisy) sources,besides motion estimation error. One possibility is thatit can be caused by poorly extracted marker locations(although this distance is less than a pixel). Anothersource can be the discrepancy between the face shapeused and the shape of the observed subject. The RMSerror between the face shape and the range scan foronly the marker points is much lower than that fromthe whole model; it is 0.1 cm, which will cause at most1 pixel of deviation in marker locations. Violation of theassumption of perspective projection is also a possiblecontributor to error, although in this case is minimal,given the small depth range of the face compared tothe distance of the face to the camera. From this, itcan be concluded that a significant portion of the errorspresent here are from motion estimation.

Upon closer examination, it can be seen that thelarger errors which are present during non-rigid mo-tions (in particular, smiling), are caused by the smileproduced by the model not matching the smile on thesubject. In particular, the subject’s smile is more curvedthan the one produced by the model. Also, the smileproduced by the model does not move the mouth back


(into the face) far enough, which explains the fact thatthe most error is present when the smile is viewed fromthe side. These errors result from the inability to esti-mate the scaling constants used in (46). Attempting toestimate these constants for each individual using onlyedge forces does not produce reliable results.

5.6. Limitations

The many experiments in this section show the capa-bilities of the shape estimation and tracking systemsdescribed in this paper. On the other hand, they alsosay a lot about what the limitations of the system are.

First, some of the limitations of the system come di-rectly from the assumptions made during design. Mostobvious is the assumption of brightness constancy dur-ing optical flow computation. Major lighting changescan cause tracking failure. Specularities also causesmall problems, but tend not to affect the entire model,since they tend to be fairly localized. In some cases,poor lighting will also lead to tracking failure. Typi-cally, these occur in situations where edges are washedout (opening the aperture too wide on a camera will dothis).

Second, is to simply exceed the maximum trackingspeed (determined by the derivative filter width). Thisproblem can be addressed by using multi-scale opticalflow methods. On a related note, the motions and edgescan also become too small to be estimated accurately.When the face in the image is smaller than about 40pixels across, there is not enough edge information tomaintain track reliably.

Third, are deviations from the model—where the im-ages go past the coverage limits of the model. Attempt-ing to track motions that are not represented producesrelatively unpredictable effects. For example, lip puck-ering is not modeled: tracking this facial motion pro-duces the best fit using the existing motion parameters(and can often be quite far off). This causes poor model-image alignment, which can lead to tracking failure ifthe unmodeled motion is very large. Note that duringspeech, however, the system retains good track of thehead and brows, while the motion parameters affectingthe mouth region are garbled. This is not surprising,as these unmodeled motions are attributed to other pa-rameters in the same region (in a least squares way).Large occlusions produce similar problems (such asa hand passing in front of the face). And of course,since a “mask” face model is used, our framework willlose track during head motions where the mask visi-

bility becomes too small. There is hope for detectingthese problems automatically—many of these difficul-ties first appear as large increases in the constraint resid-ual (localized to the region of model deviation).

Finally, are the problems associated with the track-ing of multiple, simultaneous motions. In the validationexperiments, situations where head rotation was ac-companied by a non-rigid expression deformation oftenproduced higher pixel deviations. On occasion, this de-viation can be serious enough to cause tracking failure.This is caused by the linearization in the model-basedoptical flow solution, which could perhaps be allevi-ated by using an iterative least-squares solution (Bergenet al., 1992). There can also be situations where mo-tions can be confused (given a particular model config-uration, two parameterized motions may appear nearlyidentical). The problem arises when the model statechanges to make the current motion estimate incon-sistent. Multiple hypothesis estimation methods mightprovide a solution here, although it’s likely the most ro-bust solution (for some applications) would be simplyto detect and recover from such a situation.

6. Conclusions

We have presented a method for treating optical flowinformation as a hard constraint on the motion of adeformable model. We have argued, as well as em-pirically demonstrated, that it was the treatment as ahard constraint which resulted in the benefits in effi-ciency and robustness. Furthermore, we showed howa hard constraint based on noisy data can be softenedusing a Kalman filter while preserving these beneficialaspects. Finally, hard constraints provided a means forcombining information sources which allowed edge in-formation to be used along with optical flow in orderto combat error accumulation in tracking.

Our use of a detailed three-dimensional modelalso helped a great deal. By accounting for the self-occlusion of the face, large amounts of head rotationcan successfully be tracked. Our detailed shape modelallowed for accurate descriptions of facial shape to beextracted, the parameterization for which would nothave been possible to implement without the use offace anthropometry data to control model coverage.Finally, by designing the model with a separable shapeand motion parameterization, we can separate the prob-lem of estimating the shape of an individual’s face fromestimating their motion, resulting in a much smallerdynamic estimation problem.


The current system does have a number of limita-tions, however. The most significant of which is theidealization of the optical flow constraint equation. Forinstance, the problems of photometric variation andself-shadowing, which violate the optical flow con-straint equation, are not addressed. The presence ofa three-dimensional model could prove to be usefulwhen addressing these problems. Another limitationis in tracking large motions; at the moment, motionslarger than the width of the derivative filters will notbe tracked correctly. Multi-scale model-based opticalflow techniques (Bergen et al., 1992) can be appliedhere to address this.

Looking to the future, investigation of the recog-nition of faces using the shape parameterization, orof facial expressions using the motion description isworth pursuing. Simplistic approaches that depend ona particular parameterization (such as directly usingthe “smile” parameter with a threshold) would not berobust. Also, having a more detailed motion parame-terization will allow for the tracking of more complexfacial motions. Methods for generating motion modelsfrom example data, which are becoming more com-monplace as data gathering methods improve, wouldbe particularly effective in building such a complexmodel.

Appendix A: Modularizationof Global Deformations

The shape modelx is defined through the repeated ap-plication ofn global deformationsTk :R3→R3, wherek∈ 1..n, to the underlying shapes (which has parame-tersqs in general deformable model frameworks) as:

x(q; u) = Tn(qTn; . . .T1

(qT1; s(qs; u)

))(38)

whereqTk are the parameters used byTk. The param-eters used by all of the global deformations are accu-mulated into the vectorqT as in:

qT =(q>T1, . . . ,q>Tn

)>(39)

so thatq can now be grouped as:

q = (q>s , q>T )> (40)

For a particular set of deformation functions, closedform expressions for the resulting shape can be derived.From these complex expressions, the Jacobian matrix

can be derived (see Metaxas and Terzopoulos (1993)for an example), although this method is tedious andnot modular.

Instead of this, a single expression for the result-ing shape is not derived, but rather each deformationis applied separately given the definition in (38). TheJacobian matrix can be calculated in a similar way us-ing the chain rule. First, define the deformationτk asthe composition of the firstk deformation functionsT1

throughTk:

τk(qT ; p)= Tk

(qTk; ..T1(qT1; p)

)p ∈ R3, k ∈ 1..n (41)

with τ0 defined to be the identity. Given this definitionof τk, it follows how to computeJx, the Jacobian ofxwith respect toq, using the following recurrence:

Jτ0 = Js = ∂s∂qs

(42)

Jτk =[∂Tk(p)∂p

Jτk−1

∂Tk

∂qTk

]k ∈ 1..n

so thatJx= Jτn . The left block in (42) uses the chainrule, so that the matrix∂Tk(p)/∂p “deforms” the indi-vidual columns of the Jacobian matrixJτk−1. The rightblock in (42) contains the derivatives of the outermostdeformationTk with respect to its parameters.

A naive technique for computingJx using this re-currence from the bottom-up (which starts withJs), isparticularly expensive in terms of both time and spacecomplexity. This is particularly a problem since the Ja-cobian needs to be re-evaluated at each iteration, overmany points on the model. Instead, the quantityJ>f iscomputed, given an applied forcef such as in (6). ThequantityJ>f can be computed efficiently in a top-downfashion as:

fn = f, fk−1 =(∂Tk(p)∂p

)>fk k ∈ 1..n (43)

J>s f =(∂s∂qs

)>f0, J>Tk

f =(∂Tk

∂qTk

)>fk k ∈ 1..n

(44)

If the actual columns ofJx are required, as is the casefor the optical flow computation (12), they can be foundby three applications of the above technique using theunit vectorsi, j , and k in the x, y and z directions,


respectively, as:

J>x = (J>x i)i> + (J>x j)j> + (J>x k)k> (45)

sincei i>+ j j>+ kk> = 1. For the optical flow compu-tation, this construction is only required for the motionparameters inqm.

Besides global deformations, it is also useful to in-clude rigid motions (translations and rotations) andeven camera projections. For the case of camera pro-jections, however, the mapping becomesT :R3→R2,and (45) uses onlyi and j , since the image forces aretwo-dimensional. The formulation of the projected Ja-cobians in (3) and (4) is simply an instance of the leftblock of (42).

This modular technique for computing the Jacobianmatrix allows for significantly easier implementation atlittle computational expense. It is also a more modularapproach, since the choice of which deformations usedcan be made on the fly.

Appendix B: Face Model Deformations

In order to capture the variations seen in the shape andmotion of human faces, a mixture of scaling, bend-ing, and rigid deformations are used in the construc-tion of the face model. This section provides details onthese deformations. The model designer carefully com-bines the deformations to produce a parameterized facemodel. The result of this construction is an underlyingmodel (the polygon mesh) which has a series of de-formations functions applied to it, each having a smallnumber of parameters, and each is applied to a partic-ular set of face parts, ranging from a single part to theentire face.

Rigid transformations such as translation and rota-tion are used for the placement of parts on the face.Scaling and bending deformations, shown in Fig. 19,allow for the representation of a variety of face shapes.Each of these deformations is defined with respect toparticular landmark locations in the face mesh. By fix-ing the deformations into the mesh, the desired effectof any particular deformation is not lost due to the pres-ence of other deformations (since the landmark pointsare deformed along with the rest of the mesh). Althoughvarying degrees of continuity can be attained for thesedeformations, each of the following deformations areC1 continuous.

A shape (before any deformation is applied) whichcontains the landmark pointsp0, p1 and c is shown

Figure 19. Scaling and bending deformations.

in Fig. 19(a). Figure 19(b) shows the effect of scalingthis shape along the displayed axis. The center pointc is a fixed point of the deformation, while the regionbetweenp0 and p1 is scaled to have lengthd (the pa-rameter of the scaling deformation). Portions of theshape outside this region are rigidly translated.

Bending is applied in Fig. 19(c), and shows the effectof bending the shape in (a) in a downward direction.The bending is applied to the area betweenp0 and p1,wherec is the center of the bending. Outside this area,the shape is rotated rigidly. Each plane perpendicularto the bending axis is rotated by an angle determinedby the distance of this plane to the center pointc. Theamount of bending is specified by the parametersθ0

andθ1, which specify the rotation angle atp0 and p1,respectively.

In addition, the spatial extent of each of these defor-mations can be localized, as shown in Fig. 19(d). Theinfluence of the scaling deformation varies in directionsperpendicular to the axis, producing a tapering effect.Near the top of the shape, the object is fully scaled tobe the lengthd, while the bottom of the object is un-affected by the deformation. The ability to restrict theeffect of a deformation is vital in specifying the varia-tions of shape seen in the face. We will now see howthese deformations can be used to create the model.

B.1. Face Shape

The underlying shapes, which is the polygon meshshown in Fig. 1, can take the shape of a variety of facesthrough the application of a number of spatial deforma-tions. This parameterization of the model is specifiedby the model designer. The job of the designer is madeeasier by separating the face into parts, allowing each


face model component to be treated separately. Insteadof describing the entire model (which would be ex-tremely lengthy and not particularly enlightening), ashort description is provided which illustrates the con-cepts necessary for its construction.

Deformations are defined over a particular set of facemodel parts, although most deformations affect onlyone part. Example deformations that parameterize mul-tiple parts include those affecting the lower face, whichdeform the chin and both cheeks. All of the deforma-tions are specified in a particular order, and are appliedin sequence to the underlying shape (see Appendix A).All of the parameters to describe the shape of the face atrest (there are approximately 80) are collected togetherinto qb. The shape deformations are collected togetherinto a single deformation functionTb. Most of the pa-rameters are independent due to spatial locality, whichkeeps the problem of estimation using this model fairlytractable.

Figure 20 shows some of the scaling deformationsdefined for the nose. Each arrow indicates a particularscaling parameter (in the vertical or horizontal direc-tion), that affects the space between the enclosing lines.The results of applying some of the deformations to thenose are shown in Fig. 21. Figure 21(a) and (d) showtwo views of the default nose. Figure 21(b) shows anose deformed using vertical scaling, while the pulled-up nose in (c) is produced using a localized bendingdeformation. Figure 21(e) and (f) show localized scal-ing affecting the width of the nose in different places.

Figure 20. Scaling deformations of the nose.

Figure 21. Example deformations affecting the nose.

Different faces produces using many deformations areshown on the right side of Fig. 2.

Verification of the face parameterization producedby the model designer can be accomplished by fittingthe model to a series of randomly generated sets of fa-cial measurements. This is effectively a Monte Carlomethod of sampling the space of face measurements.The fitting is easily accomplished, given a set of mea-surements, using the anthropometric forces describedin Appendix C. The model designer can alter the modelparameterization when a particular set of face measure-ments cannot be satisfied by the model. We obtaineda face model capable of representing a wide variety offaces after only a few design iterations.

B.2. Face Motion

The deformations corresponding to motions (such asfacial expressions) are modeled using the same tech-niques employed for face shape. However, there is noavailable motion data that corresponds to anthropomet-ric data for shape (although technology for gatheringsuch data is becoming available (Guenter et al., 1998)).The motion deformations are applied to the face in restposition—after the shape deformations, as in (1). Ex-amples of modeled expressions are displayed on the leftside of Fig. 2. The model is capable of frowning or rais-ing each eyebrow (top-left, top-right), smiling (bottom-left) or opening the mouth (bottom-right). This resultsin a total of 6 expression parameters (2 brow frown-ing, 2 brow raising, 1 smiling, 1 mouth opening), eachcorresponding to a particular FACS action unit (Ekmanand Friesen, 1978). In addition to this are the six param-eters for rigid head motion (translation and rotation),resulting in a total of 12 parameters inqm. These defor-mations can be applied to any face (differentqb), suchas those on the bottom of Fig. 2.

The construction of expressions is simplified by de-composing each face motion into several componentdeformations. For example, the mouth opening defor-mation is decomposed into chin/cheek bending, lipscaling and lip translation. To facilitate tracking ofthese expressions by reducing the number of motionparameters, there is a singlecontrolparameter for eachexpression which uniquely determines all of its compo-nent parameters. Given a particular face motion whichis constructed using a series of deformations with pa-rametersbi , the control parametere determines thevaluebi based on the formula:

bi = si e (46)


wheresi is the scaling parameter used to form the linearrelationship betweenbi ande. These scaling parame-ters are the expression-shape parameters included inqb (there are about 20 in total). For situations wherethese parameters are not estimated, these parametersare treated as constants, average values for which aredetermined by the designer during construction of themodel.

The set of face motion parametersqm consists of thecontrol parameters for each of the expressions (whichare initially all zero), and the rigid translation and rota-tion specifying the head position. The parametersbi arenot estimated, but are determined directly by (46) usingthe estimated value ofe. The motion deformations arecollected together into the deformationTm.

Appendix C: Anthropometry

Anthropometry is the biological science of human bodymeasurement. Anthropometric data is used in a varietyof applications that require knowledge of the distribu-tion of measurements across human populations. Forexample, in medicine, quantitative comparison of an-thropometric data with patients’ measurements beforeand after surgery furthers planning and assessment ofplastic and reconstructive surgery (Farkas, 1994). Thispaper proposes a similar use of anthropometry, in theconstruction of a face model for computer vision.

In order to develop useful statistics from anthropo-metric measurements, the measurements are made in astrictly defined way (Hrdlicka, 1972). Particular loca-tions on a subject, calledlandmarkpoints, are definedin terms of visible or palpable features. A series ofmeasurements between these landmarks is then takenusing carefully specified procedures and measuring in-struments (such as calipers, levels and measuring tape).A canonical coordinate system for the head and face isalso defined in terms of landmarks, and provides a setof axes from which some measurements are taken. Asa result, repeated measurements of the same individual(taken a few days apart) are very reliable, and mea-surements of different individuals can be successfullycompared.

Farkas (1994) describes a widely used set of mea-surements for describing the human face. A largeamount of anthropometric data using this system isavailable (Farkas, 1987; Farkas, 1994). The system usesa total of 47 landmark points to describe the face, andincludes the following five types of facial measure-ments:

r theshortest distancebetween two landmarks (suchas the separation of the eyes)r theaxial distancebetween two landmarks—the dis-tance measured along an axis (such as the verticalheight of the nose)r the tangential distancebetween two landmarks—the distance measured along a prescribed path onthe surface of the face (such as the arc length of theupper lip boundary)r theangle of inclinationbetween two landmarks withrespect to an axis (such as the slope of the nosebridge)r the angle between locations(such as the angleformed at the tip of the nose)

Farkas describes a total of 132 measurements on theface and head.

Systematic collection of anthropometric measure-ments has made possible a variety of statistical in-vestigations of groups of subjects. Subjects have beengrouped on the basis of gender, race and age. Meansand variances for the measurements within a group,tabulated in Farkas (1994), effectively provide a set ofmeasurements which captures virtually all of the vari-ation that can occur within the group.

In addition to statistics on measurements, statisticson theproportionsbetween measurements have alsobeen used. Anthropometrists have found that propor-tions give useful information about the correlationsbetween features, and can serve as more reliable in-dicators of group membership than can simple mea-surements (Farkas, 1987). These proportions are aver-aged over a particular population group, and means andvariances are provided in Farkas, (1987).

C.1. Use of Anthropometry Data

Our face model includes representation of the anthro-pometric measurements described above. Given themeasurement descriptions in Farkas (1994), they arerealized using a straightforward set of geometric oper-ations performed using the face model: given a value ofqb, a set of measurements can be taken from the model.

Use of this data limits the coverage of a hand-craftedmodel to the space of faces made likely by a distributionof anthropometric measurements. Forces arising fromthis data are comparable to internal stiffness forces usedin other deformable model work (Terzopoulos et al.,1988). In that work, stiffness forces were used to deter-mine a smooth surface in situations where the data was


sparse or noisy. Here, anthropometric forces maintaina believable face shape, avoiding the parameter com-binations that result in unlikely or impossible faces.

For a particular set of model pointsx1..xn, a mea-surementM j is written as:

M j (x1, . . . , xn) j ∈ 1..M (47)

where M is the number of measurements in Farkas’inventory. As an example, ashortest distancemeasure-ment is simply the following:

Mdist(x1, x2) = ‖x1− x2‖ (48)

wherex1 and x2 are model points corresponding tothe two landmarks used by the measurement. Note thatthese points depend on the shape parametersqb, but noton the motion parametersqm (which is effectively ze-roed when any anthropometric measurements are takenon the model—since this reflects the same “expression-less” conditions under which the data was originallygathered).

The statistical characterization of measurements andproportions can be built into the model in two ways.First, by using an average set of measurements, a set ofparameters specifying the initial model is determined.This initial model is an anthropometrically “average”model, and is shown in Fig. 1. Second, this charac-terization provides a means of biasing the face modelshape parameters (qb) toward more likely occurringindividuals.

Given a particular set of population groups, averagemeasurement values and variances are obtained fromFarkas (1994) as:

(µ j , σ

2j

)j ∈ 1..M (49)

The biasing of the parameters is performed using three-dimensional spring-like forces (a soft constraint) thatare applied to the polygonal face model that softly en-force a measurement on the model. First, an energy isassociated with each measurement:

Ej = 1

2(M j (x1, . . . , xn

)−µ j )2 (50)

Then, the force resulting from the energyEj , which isapplied to model domain pointui (which corresponds

to the pointxi on the model surface), is obtained as:

fEj (ui ) = −∂Ej

∂xi= −(M j (x1, . . . , xn

)−µ j )∂M j

∂xi

(51)

The total anthropometric force applied to model do-main pointui is computed as theweightedsum of allmeasurement forces atui :

fant(ui ) =∑

j∈1..M

(1− e−Ej /σ

2j

√2πσ

)ρfEj (ui ) (52)

Each force is weighted by a quantity which is a power(ρ) of how improbable the current measurement is (as-suming a Gaussian distribution on the anthropomet-ric measurements (Farkas, 1994)). This weighting pre-vents the model from actually attaining the average setof measurements, but instead is simply biased towardsthem. For values ofρ around 10, forces on measure-ments within one standard deviation of the mean forthat measurement are effectively ignored, making itused primarily as a prior onqb.

For proportions, the energy would involve two mea-surements as:

Ejk = 1

2(M j (x1, . . . , xn)− pjk ·Mk(x′1, . . . , x

′n′))

2

(53)

where pjk is the mean proportion between measure-mentsµ j andµk. As with the above for measurements,a force distribution for proportion data is obtained us-ing this energy.

Appendix D: Face Featureand Edge Determination

The edge-based force field methods in Metaxas andTerzopoulous (1993) and Terzopoulos et al. (1988) re-quire knowing which locations of the face model arelikely to produce edges in an image. On the face, certainfeatures are likely to produce edges in the image. Theparticular features chosen are the boundary of the lipsand eyes, and the top boundary of the eyebrows. Edgesin the polygon mesh which correspond to these featureswere manually marked during the model construction,and are shown in Fig. 22(a).

Other likely candidates for producing edges are theregions on the model of high curvature. The base of thenose and indentation on the chin are examples of highcurvature edges, and can be seen in Fig. 22(b). Oc-cluding boundaries on the model also produce edges


Figure 22. Likely face features in an image.

in the image, and can be determined using the three-dimensional model. The location of occlusion bound-aries on the model will be useful when determiningthe quality of selected points for the measurement ofoptical flow.

Of course, for an edge to be produced in the image,the corresponding region on the face must be visible tothe camera. This visibility determination is performedusing the model and camera transformation. The modeldepth information can be used to determine the partsof the model that are visible to the camera (the front-most regions of the model). Figure 22(b) shows visiblelocations of the model (features, high curvature and oc-cluding edges) that are likely to produce edges, giventhe model in (c).

Once the locations on the model are known whichare likely to produce image edges, we can weightthe intensity gradient accordingly when forming thetwo-dimensional edge-based forces (Metaxas andTerzopoulous, 1993; Terzopoulos et al., 1988) that areapplied to the model. These forces contribute to thevalue of fq (affecting parameters in bothqb andqm)based on (6). Over the course of fitting, these edgeforces “pull” the model so that the model edges be-come aligned with their corresponding image edges.

Appendix E: Optical Flow Point Selection

The construction of the optical flow constraint onqm

required the selection of a set of image pixels fromwhich to measure optical flow information. While itwould be possible to use all pixels on the observedobject, this would have two problems. Most obviously,it would be more expensive to solve the system—it isespecially wasteful since it is likely that most pixels donot provide a significant amount of useful information.And second, particular points actually provide harmfulinformation—such as those near occlusion boundaries.

This section describes our method for the selection ofpixels in the construction of (13).

Tomasi and Shi (1994) define good features for track-ing by using the following criterion. The outer productof the image gradients at pixeli is summed over a smallwindow around that pixel:∑

window(i )

∇I i ∇I>i (54)

A feature is selected when the smaller eigenvalue of this2× 2 matrix is greater than a threshold value. Thesefeatures possess significant image gradients in two or-thogonal directions, which makes them reliable track-ing features, as well as good sources of optical flowinformation. Features with one very large eigenvalueare also useful in our application, as these image pointsalso provide good optical flow information.

However, not all pixels with significant gradientmagnitude should be chosen. In particular, pixels onocclusion boundaries must be avoided, as they vio-late the optical flow constraint equation. The use ofmodel-based techniques here provides a straightfor-ward solution—assuming the model is at least roughlyaligned with the image, pixels anywhere nearby thepredicted occlusion boundaries of the model are sim-ply not chosen. A detailed model, such as our facemodel, coupled with a method which computes occlu-sion boundaries (as in Appendix D), can be used toavoid these problems.

Traditional techniques for solving the optical flowconstraint Eq. (10), often impose smoothness condi-tions on the flow field (Horn, 1986) to determine asolution. Smoothing is complicated by the fact that oc-clusion boundaries violate (10) and must be located todetermine where to relax the smoothness conditions.The presence of a model entirely avoids the need forsmoothing, as connectivity and discontinuity informa-tion is provided by the model. In addition to this, model-based optical flow techniques are more immune to theaperture problem (Horn, 1986), since information iscombined over much larger image regions.

Besides providing the most accurate informationpossible, the set of chosen points must also adequatelysample the motion information present in the image.The accurate measurement of a parameter inqm re-quires a sufficient number of pixels in the imagecorresponding to model points where the Jacobian ofthat parameter does not vanish. Note that some motiondeformations may affect only a particular region of theface.


Using too few pixels in the computation results ina loss of accuracy, and can reach the point where thesystem loses track of the subject. Including too manypixels forces the pixel selection method to include pix-els containing little useful information (such as havinga small gradient magnitude). It has been determinedby experimentation that 10 to 20 pixels per parame-ter provide sufficient accuracy and robustness for theapplication of face tracking (at which point the resultschange negligibly when more pixels are used). Sincethere can be considerable overlap between the sets ofpixels used to measure each parameter, the total num-ber of pixels used can be fairly small. For each of theexperiments here,n is approximately 120 pixels.

Acknowledgments

We would like to thank Eero Simoncelli, Max Mintz,Matthew Turk, Pascal Fua, Frank Dellaert, YaserYacoob and the reviewers of this paper for their help-ful comments and discussions. We are also grateful toYaser Yacoob and the Center for Automation Researchat the University of Maryland College Park for pro-viding the image sequences in Figure 6 and Figure 7.This research is partially supported by ARPA DAMD-17-94-J-4486 and DAAH-049510067; ARO DURIPDAAH-049510023; NSF IRI93-09917, IRI95-04372and NSF-MIP94-20393; and a RuCCS postdoctoralfellowship.

Notes

1. To keep this discussion informal, Eqs. (21)–(23) are merelysuggestive of the process involved in statistical combination.As we shall see in the next section, the combination processcan be formalized most perspicuously by recasting the entiresolution process to take into account the statistical informationfrom both solutions. Unfortunately, this presentation distractsfrom the high level differences between methods.

2. Stan Birchfield’s KLT package is available at:http://vision.stanford.edu/ ∼birch .

3. Tracking failure is simply defined as reaching a 10 pixeldeviation—at this point the deviation typically increases, withtracking being re-gained only by luck.

References

Adiv, G. 1985. Determining 3-d motion and structure from opticalflow generated by several moving objects.IEEE Pattern Analysisand Machine Intelligence, 7(4):384–401.

Ayache, N.J. and Faugeras, O.D. 1988. Building, registering, andfusing noisy visual maps.IJRR, 7(6):45–65.

Bar-Shalom, Y. and Fortmann, T. 1988.Tracking and Data Associa-tion. Academic Press.

Basu, S., Essa, I., and Pentland, A. 1996. Motion regularization formodel-based head tracking. InProceedings ICPR ’96, p. C8A.3.

Bergen, J., Anandan, P., Hanna, K., and Hingorani, R. 1992. Hierar-chical model-based motion estimation. InProceedings ECCV ’92,pp. 237–252.

Black, M. and Yacoob, Y. 1995. Tracking and recognizing rigid andnon-rigid facial motions using local parametric models of imagemotion. InProceedings ICCV ’95, pp. 374–381.

Broida, T.J. and Chellappa, R. 1996. Estimation of object motion pa-rameters from noisy images.IEEE Pattern Analysis and MachineIntelligence, 8(1):90–99.

Bryson, A. and Ho, Y. 1975.Applied Optimal Control. Halsted Press.Choi, C., Aizawa, K., Harashima, H., and Takebe, T. 1994. Analy-

sis and synthesis of facial image sequences in model-based im-age coding.IEEE Circuits and Systems for Video Technology,4(3):257–275.

Cootes, T., Edwards, G., and Taylor, C. 1998. Active appearancemodels. InECCV98, pp. II:484–498.

DeCarlo, D. and Metaxas, D. 1996. The integration of optical flowand deformable models with applications to human face shape andmotion estimation. InProceedings CVPR ’96, pp. 231–238.

Durrant-Whyte, H.F. 1987. Consistent integration and propagationof disparate sensor observations.IJRR, 6(3):3–24.

Ekman, P. and Friesen, W. 1978.The Facial Action Coding System.Consulting Psychologist Press, Inc.

Essa, I.A. and Pentland, A.P. 1997. Coding, analysis, interpretation,and recognition of facial expressions.IEEE Pattern Analysis andMachine Intelligence, 19(7):757–763.

Farkas, L. 1987.Anthropometric Facial Proportions in Medicine.Thomas Books.

Farkas, L. 1994.Anthropometry of the Head and Face. Raven Press.Fua, P. and Brechbuhler, C. 1997. Imposing hard constraints on de-

formable models through optimization in orthogonal subspaces.Computer Vision and Image Understanding, 65(2):148–162.

Fua, P. and Leclerc, Y.G. 1995. Object-centered surface reconstruc-tion: Combining multi-image stereo and shading.InternationalJournal of Computer Vision, 16(1):35–56.

Gelb, A. 1974.Applied Optimal Estimation. MIT Press.Guenter, B., Grimm, C., Wood, D., Malvar, H., and Pighin, F. 1998.

Making faces. InProceedings SIGGRAPH ’98, pp. 55–66.Hel-Or, Y. and Werman, M. 1996. Constraint fusion for recognition

and localization of articulated objects.International Journal ofComputer Vision, 19(1):5–28.

Horn, B. 1986.Robot Vision. McGraw-Hill.Horn, B. and Weldon, E. 1988. Direct methods for recovering motion.

International Journal of Computer Vision, 2(1):51–76.Hrdlicka, A. 1972.Practical anthropometry. AMS Press.Kaucic, A. and Blake, A. 1998. Accurate, real-time, unadorned lip

tracking. InICCV98, pp. 370–375.Kriegman, D.J., Triendl, E., and Binford, T.O. 1989. Stereo vision

and navigation in buildings for mobile robots.RA, 5(6):792–803.Lanitis, A., Taylor, C.J., and Cootes, T.F. 1997. Automatic inter-

pretation and coding of face images using flexible models.IEEEPattern Analysis and Machine Intelligence, 19(7):743–756.

Li, H., Roivainen, P., and Forchheimer, R. 1993. 3-D motion estima-tion in model-based facial image coding.IEEE Pattern Analysis


and Machine Intelligence, 15(6):545–555.Lowe, D.G. 1991. Fitting parameterized three-dimensional mod-

els to images.IEEE Pattern Analysis and Machine Intelligence,13(5):441–450.

Maybeck, P. 1979.Stochastic Models, Estimation and Control, Vol-ume 1. Academic Press.

Maybeck, P. 1982.Stochastic Models, Estimation and Control, Vol-ume 2. Academic Press.

Metaxas, D. 1996.Physics-Based Deformable Models: Applicationsto Computer Vision, Graphics, and Medical Imaging. Kluwer Aca-demic Publishers.

Metaxas, D. and Terzopoulos, D. 1993. Shape and nonrigid motionestimation through physics-based synthesis.IEEE Pattern Analy-sis and Machine Intelligence, 15(6):580–591.

Moses, Y., Reynard, D., and Blake, A. 1995. Robust real time trackingand classificiation of facial expressions. InProceedings ICCV ’95,pp. 296–301.

Negahdaripour, S. and Horn, B. 1987. Direct passive navigation.IEEE Pattern Analysis and Machine Intelligence, 9(1):168–176.

Netravali, A. and Salz, J. 1985. Algorithms for estimation of three-dimensional motion.AT&T Technical Journal, 64:335–346.

Pentland, A. and Horowitz, B. 1991. Recovery of nonrigid motionand structure.IEEE Pattern Analysis and Machine Intelligence,13(7):730–742.

Pentland, A. and Sclaroff, S. 1991. Closed-form solutions for phys-ically based shape modeling and recognition.IEEE Pattern Anal-ysis and Machine Intelligence, 13(7):715–729.

Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. 1992.Nu-merical Recipes in C: The Art of Scientific Computing. CambridgeUniversity Press.

Reynard, D., Wildenberg, A., Blake, A., and Marchant, J. 1996.Learning dynamics of complex motions from image sequences.In Proceedings ECCV ’96, pp. I:357–368.

Shabana, A. 1989.Dynamics of Multibody Systems. Wiley.Sharma, R., Azoz, Y., and Devi, L. 1998. Reliable tracking of human

arm dynamics by multiple cue integration and constraint fusion.In Proceedings CVPR ’98.

Shi, J. and Tomasi, C. 1994. Good features to track. InProceedingsCVPR ’94, pp. 593–600.

Simoncelli, E., Adelson, E., and Heeger, D. 1991. Probability distri-butions of optical flow. InProceedings CVPR ’91, pp. 310–315.

Strang, G. 1988.Linear algebra and its applications. Harcourt,Brace, Jovanovich.

Terzopoulos, D. 1993. Physically-based fusion of visual data overspace, time and scale. InMultisensor Fusion for Computer Vision,J. Aggarwal (Ed.). Springer-Verlag, pp. 63–69.

Terzopoulos, D. and Waters, K. 1993. Analysis and synthesis offacial image sequences using physical and anatomical models.IEEE Pattern Analysis and Machine Intelligence, 15(6):569–579.

Terzopoulos, D., Witkin, A., and Kass, M. 1988. Constraints on de-formable models: Recovering 3D shape and nonrigid motion.Ar-tificial Intelligence, 36(1):91–123.

Yacoob, Y. and Davis, L.S. 1994. Computing spatio-temporal repre-sentations of human faces. InProceedings CVPR ’94, pp. 70–75.

Yuille, A.L., Cohen, D.S., and Halliman, P. 1992. Feature extractionfrom faces using deformable templates.International Journal ofComputer Vision, 8:104–109.

Zhang, G.H. and Wallace, A. 1993. Physical modeling and combina-tion of range and intensity edge data.Computer Vision, Graphics,and Image Processing, 58(2):191–220.

optical flow constraints on deformable models with applications to face tracking

Documents