three-dimensional head tracking and facial expression ...petriu/trim08-3d_headtrackfaceex...cordea...

1578 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 57, NO. 8, AUGUST 2008

Three-Dimensional Head Tracking and FacialExpression Recovery Using an Anthropometric

Muscle-Based Active Appearance ModelMarius D. Cordea, Emil M. Petriu, Fellow, IEEE, and Dorina C. Petriu, Senior Member, IEEE

Abstract—This paper describes a novel 3-D model-based track-ing algorithm allowing the real-time recovery of 3-D position,orientation, and facial expressions of a moving head. The methoduses a 3-D anthropometric muscle-based active appearance model(3-D AMB AAM), a feature-based matching algorithm, and anextended Kalman filter (EKF) pose and expression estimator. Ourmodel is an extension of the classical 2-D AAM and uses a generic3-D wireframe model of the face based on two sets of controls:the anatomically motivated muscle actuators to model facial ex-pressions and the statistically based anthropometrical controls tomodel different facial types.

Index Terms—Active appearance model (AAM), anthropo-metric, face analysis, face tracking, human–computer interac-tion (HCI), model-based video coding (MBVC), muscle based,3-D head.

I. INTRODUCTION

AUTOMATIC 3-D tracking of human faces in image se-quences is an important and challenging task in computer

vision. Applications include human–computer interaction, fa-cial recognition, computer graphics, and model-based videocoding.

The 3-D face-tracking problem is technologically difficult as3-D motion parameters have to be extracted from a 2-D videosequence. The effects of head motion and facial expressions arecombined in video images, so it is crucial to successfully sep-arate the rigid from the nonrigid motion of the head (which isa problem known as “pose/expression separation”). A trackingsystem trying to recover rigid or/and nonrigid facial movementalso requires the use of a recursive estimation framework.

During the last decade, work has mainly concentrated onrecovering 3-D head pose (rigid motion) [1]–[3] and facial ex-pressions (nonrigid motion) [4]–[7] from a sequence of imagesof the subject’s face.

In the late 1990s, Basu et al. [3] used the fusion of theoptical flow and a 3-D ellipsoidal model to track faces. Since

Manuscript received July 5, 2007; revised March 25, 2008. This work wassupported in part by the Natural Sciences and Engineering Research Council ofCanada (NSERC) and in part by Communications and Information TechnologyOntario (CITO).

M. D. Cordea and E. M. Petriu are with the University of Ottawa, Ottawa,ON K1N 6N5, Canada.

D. C. Petriu is with Carleton University, Ottawa, ON K1S 5B6, Canada.Color version of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIM.2008.923784

the method does not use facial points to anchor the 3-D modelto the face, the flow errors accumulate and eventually drift offthe face tracker. DeCarlo and Metaxas [7] used a manuallypositioned wireframe head model to extract facial shape andexpressions. Their method calculates the optical flow at featurepoints and regularizes it using the model motion. Kalman filter-ing and edge information are used to stabilize the measurementsaffected by optical flow error accumulation.

Statistical model-based tracking relies on a trained modelto learn different appearance and shape aspects that are en-countered during tracking. In the case of face tracking, themodel could be trained on a database containing different facesexposing various facial expressions in different illuminationconditions.

Lately, the 2-D active appearance model (AAM) has in-creasingly been used in statistical model-based trackers. Some2-D AAM implementations are able to track objects (faces,eyes, boxes, etc.) using small-sized model fitting [4], [6], [8].Faces are tracked in a video stream at frame rates in the rangeof 5–10 frames/s with a low “shape-free” model resolution(64 × 64 pixels or less).

This paper describes a statistical model-based tracking algo-rithm that aims to recover the 3-D pose and facial expressionsof a moving head. The proposed tracking algorithm uses anextended Kalman filter (EKF) to recover the head pose (globalmotion) and the 3-D anthropometric muscle-based AAM (3-DAMB AAM) [9] to recover the facial expressions (local mo-tion). The EKF predicts both the rigid and nonrigid facialmotions.

The presented solution enables fast and stable online trackingof extended sequences, despite noise and large variations inpose and expressions.

The proposed tracking approach is similar in a few ways tothe recent work of Strom [2] and Ahlberg and Forchheimer [4].Ahlberg and Forchheimer employ a 2-D AAM and the 3-DCandide face model [10] for face tracking. The facial ex-pressions are described with the help of MPEG-4 animationset, which is structure dependent. The system does not use arecursive motion estimator, which is crucial in attaining hightracking performance. As a result, the developed tracker workswell only in constrained conditions and often fails due to fasthead motion. Strom [2] extends the work of Azarbayejani andPentland [1] in Structure-From-Motion (SFM) using a recursiveEKF estimator and making the SFM tracker more robust byincluding recovery in case of failure with the help of the same

0018-9456/$25.00 © 2008 IEEE

CORDEA et al.: 3-D HEAD TRACKING AND FACIAL EXPRESSION RECOVERY USING AN AMB AAM 1579

3-D Candide model. However, Strom’s system does not recoverfacial expressions.

II. 3-D TRACKING ALGORITHM

The first step in building the tracking system was the de-velopment of the 3-D AMB AAM. Our model is an extensionof the classical AAM, introduced by Edwards et al. [11]. TheAAM is a statistical model based on the estimation of linearmodels of shape and texture variation. The object shape isdefined by manually or automatically annotating each imageexample with landmarks. The shape examples are then alignedto a common mean using Gower’s generalized procrustes analy-sis [12]. This geometrical normalized frame represents theshape-free reference to which the texture samples are warpedand from which they are sampled. After geometrical normal-ization, principal component analysis (PCA) is used to modelthe shape and texture variability. Employing prior knowledge ofthe optimization space, the AAM can be fitted to unseen imagesusing an analysis-by-synthesis (ABS) technique, given a close-to-target search initialization.

The fitting algorithm generates synthetic examples to matchthe given images by efficiently adjusting the model parame-ters. The basic AAM fitting technique assumes a constantlinear relationship between the error image and the incrementalupdates to the parameters. This relationship is learned offlinefor fast fitting process. AAM was shown to be prolific insegmenting objects in low-resolution 2-D images [11], [13],[14]. When modeling objects in high-resolution 2-D and 3-Dimages, this approach becomes infeasible due to excessive stor-age and computational requirements. Moreover, it was shown[13], [15], [16] that due to their representational capabilities,2-D AAM could generate illegal model instances. AAM illegalinstances can also be generated by sources that break the linearassumption of 2-D AAM training set [15], such as variationsin face orientation and errors induced by human-assisted shapelabeling. In addition, when used for tracking [8], the 2-D AAMis not applicable to six degree-of-freedom (DOF) pose andexpression estimation.

A. 3-D Face Model

The 3-D AMB AAM is a method for modeling the shapeand appearance of human faces in 3-D using a constrained3-D AAM. The classical AAM parameterizes the facial shape in2-D or 3-D space defined by the labeled landmark values. The3-D AMB model uses a generic 3-D wireframe model of theface to solve the storage-, pose-, and labeling-related problems.

The shape of our model is described by two sets of controls,i.e., the anatomically motivated muscle actuators to modelfacial expressions and the anthropometrical controls to modeldifferent facial types. The 3-D face deformations are definedon the anthropometric expression (AE) space, which is derivedfrom Waters’ muscle-based face model [17] and from the FacialAction Coding System (FACS) [18]. FACS defines the ActionUnit (AU) as the basic visual facial movement controlled byone or more muscles. The muscle model is independent of thefacial topology and is an efficient method that represents facialexpressions by a reduced set of parameters.

Fig. 1. Modeled AUs (from [18]).

Fig. 2. Instances of the 3-D AMB model.

AUs occur alone or in combinations. The combinations maybe additive, in which case, the appearance caused by eachAU is preserved, or nonadditive, in which case, the appear-ance caused by each AU is changed. Nonadditive interactions,which are also called coarticulation effects, are difficult torecognize, and their exposed appearance is highly dependenton timing. To better control the muscular activity during thefitting process and eliminate nonadditive combinations, we builtcontext-depended AUs with values between [−1, 1] and calledthem Expression AUs (EAUs). First of all, the new musclesare composed of symmetrical left/right facial muscles, whichsimultaneously deform. Second, they represent combinations ofthe implemented AUs or pairs of muscles that pull in oppositedirections. For example, Lip Puller and Lip Depressor AUs cancancel each other out while making an expression. To avoidthe coarticulation effects, a combination of the two muscles,named Mouth Corners, can activate one AU at the time. Thenew muscle behaves as AU12 when pulling upward (for valuesbetween [0, 1]) or AU15 when pulling downward (for valuesof [−1, 0]). This way, the newly developed EAUs only allowadditive combinations.

We chose four EAUs as context-dependent AUs to better con-trol the muscular activity during the image fitting process andeliminate expression coarticulation effects: Jaw Drop (AU26),Mouth Corners (AU12 and AU15), Eyebrow Middle (AU2 andAU4), and Eyebrow Inner (AU1 and AU4) (see Fig. 1).

We also define Anthropometry AUs (AAUs), which are cho-sen in conformity with anthropometrical statistics, to provide away to deform the 3-D generic mesh into a personal one. TheAAUs are deformation parameters used to make the jaw wider,the nose bridge narrower, etc. To keep the AAUs independent ofthe facial mesh structure, we built them from symmetric pairsof geometrically constrained facial muscles. We constrained thedeformation to allow the displacement in one dimension only.For example, the Nose Width locally deforms the nose shapeonly in the horizontal direction.

Fig. 2 illustrates few instances of the 3-D AMB modelobtained by varying the anthropometric and muscle parameter


values. The first face instance represents the neutral instance.The second face instance was obtained by pulling MandibleWidth, Nose Width, Nose Length, Chin Height AAUs, andEyebrow Inner EAU. The third face instance is the result ofactivating the following actuators: Face Width, Mandible Width,Nose Width, Nose Length AAUs, Eyebrow Middle, and MouthCorners EAUs. The fourth face instance shows in profile anexpression caused by Jaw Drop and Mouth Corners EAUs.

Our 3-D wireframe model contains 512 vertices, four EAUs,and eight AAUs. The model appearance is represented by thetexture map of the 3-D model vertices.

B. Active Model Parameterization

Model training tries to formulate a combined compactshape–appearance representation of faces and an efficient fittingstrategy. To have better control on model deformation, weformulate our 3-D AAM on the combined AE shape and texturespace. We project the 3-D model onto N images and adjust theAAUs and EAU values so that the contours of the projectedmodel best fit the facial image. The shape-free normalizedframe is a simple matter of transforming each fitted 3-D faceto a reference state (no pose and no expression). The shape-freeframework is basically a normalization step that eliminates thegeometrical differences between individuals. The normalizationstep will create a convex face space, which will simplify thefinding of basis vectors that span the face set. The recordedshape values are used to extract the texture intensities from the3-D textured model in a neutral state.

The first stage of the training process consists of applyingPCA in collected shape and texture vectors. Let a and grepresent a synthesized AE shape and texture, and let a and grepresent the corresponding sample means. New instances aregenerated when adjusting the eigenprojections of AE shape andtexture: pa and pg . We have

a = a + Uapa

g = g + Ugpg (1)

where Us and Ug represent the eigenvectors of AE shape andtexture variations estimated from the training set.

To remove correlations between AE shape and texture, thetwo PCA spaces are coupled through another PCA transform as

(Wapa

pg

)= Ucc (2)

where Wa is a diagonal weighting matrix employed to com-pensate the differences between AE shape and pixel intensitymeasure units.

This way, we obtain a combined AE shape and textureparameterization defined by combined parameters. The modelinstances (physiognomies, expressions, and texture) are con-trolled by the c parameters as

a = a + Qac

g = g + Qgc (3)

where a and g represent a synthesized AE shape (i.e., muscleactuators) and texture, and a and g are the correspondingsample means. Qa and Qg represent modes of shape and texturevariation in the combined AE texture space. Any facial shapeor texture can be represented in the compressed shape texturespace by using the projection equations

c = QTg (g − g)

c = QTa (a − a). (4)

The eigenspaces of shape and texture and the combined modes(pa, pg , c) are truncated so that each represent 92% of thetraining set variance.

The advantage of formulating AAM in AE space is thatthe anthropometric characteristics and facial expressions aredirectly obtained during the synthesis phase. Extensive imagesegmentation experiments [9] demonstrate the superior accu-racy of the proposed algorithm against the classical 2-D AAM.In addition, our AAM incorporates the full six DOF pose andanimation parameters as part of the minimization procedure.This allows for 3-D pose estimation and facial expressionrecognition without any restriction on face geometry.

Each AAM comes with a search or fitting method used inimage interpretation or segmentation tasks.

The appearance model parameters control the shape (fa-cial types and expressions) and texture in the normalizedmodel frame according to (3). Additionally, the 3-D modelprojection in the image plane also depends on the 3-D posetransformations, namely, translations, rotations, and scalingγ = (rx ry rz tx ty z), which means that the modelappearances and deformations are defined by the parametergroup (c, γ). During fitting, we sample the pixels in the regionof the image gi defined by (c, γ) and project into the texture ofthe model normalized frame.

As a model fitting strategy, we adopt a simple additive updateAAM scheme, which was shown to produce the best results[19]. The model fitting is treated as an optimization problemthat minimizes the difference between the new image and theone synthesized by the 3-D appearance model. Basically, theAAM search formulation uses a regression approach wheretexture difference vectors ∂g = gi − gm correspond to modeland pose parameter displacement vectors ∂d = (∂c, ∂γ). Toincrease the search efficiency, the correspondence displacementparameter is learned offline and provides the corrections of themodel parameters during the image search.

The 3-D AMB AAM model can be trained for various tasks:image segmentation, expression recognition, face tracking, etc.For tracking, our 3-D AMB AAM model was trained using227 faces and is described by nine shape modes, 34 texturemodes, and 20 modes that retain 92% of the combined shapeand texture variation of the training set of faces. The resultantmean facial texture patch contained, on average, 2200 pixels.

C. Tracking Algorithm

The EKF offers a recursive and temporal solution to the facialpose and expression tracking problem. The proposed trackingsystem augments an SFM-EKF framework [20] with the 3-D


AMB AAM employed to recover facial expressions. Duringeach iteration, the EKF provides an optimal estimate of thecurrent state using the current input measurement and predictsthe future state using the underlying state model. The EKF stateand measurement equations can be expressed as

s(k + 1) = As(k) + ξ(k)

m(k) = Hs(k) + η(k) (5)

where s is the state vector, m is the measurement vector, Ais the state transition matrix, H is the Jacobian that relatesstate to measurement, and ξ and η are error terms modeledas Gaussian white noise. The EKF framework implements twomain modules:

1) tracking module, delivering the measurement vectorm = (pi(ui, vi), aj), that is, the 2-D point measurementspi(ui, vi) of tracked facial features and aj facial expres-sions measured by fitting the 3-D active model on eachframe;

2) estimator module (for the estimation of 3-D motion andexpressions), delivering a state vector

s = (tx, ty, tz, α, β, λ, f, aj)

where (tx, ty, tz, α, β, λ) are the six 3-D camera/objectrelative motions, namely, translation and rotation, f isthe camera focal length, and aj’s are the 3-D model’sactuators (muscles) that generate facial expressions, withj = 1, . . . , Na, where Na is the number of actuators. Theset of 2-D features to be tracked is obtained by projectingtheir corresponding 3-D rigid model points, such as eyecorners, nose tip, etc. The EKF observations are the2-D feature coordinates (u, v) and expressions aj , whichare concatenated into a measurement vector at each timestep. The EKF state vector includes the facial expressionactuators that we wish to filter and estimate (a1−4 =Jaw Drop, Mouth Corners, Eyebrow Middle, EyebrowInner).

D. Motion and Measurement Models

The dynamic model of the head tracking is a discrete-timephysical model of a rigid body motion. Similar to [5], thedynamics model is chosen simply as an identity transform plusnoise. The state equation (5) could be written as

tiωi

faj

k+1

=

I 0 0 00 I 0 00 0 I 00 0 0 I

·

tiωi

faj

k

+ ξ(k) (6)

where i = x, y, z is the index of the coordinate axes of thecamera reference frame, I is the identity matrix, and ωi rep-resents an incremental rotation, similar to the one used in [5] toestimate the interframe rotation. The incremental rotation com-puted at each frame step is combined into a global quaternionvector (q0, q1, q2, q3) used in the EKF linearization process androtation of the 3-D model.

The measurement model relates the state vector s to the 2-Dfeature locations pi(ui, vi). Points pi(Xi, Yi, Zi) of the objectreference frame become points pci(Xci, Yci, Zci) of the camerareference frame, whereas the camera reference points becomepoints of the image plane pi(ui, vi) using a perspective projec-tion as

Xci

Yci

Zci

=T (tx, ty, tz) + R(α, β, γ)

Xi

Yi

Zi

(ui

vi

)=

f

Zci

(Xci

Yci

), i = 1, . . . , Np (7)

where T and R represent the object (or camera) translation androtation matrices, f is the camera focal length, and Np is thenumber of tracked points.

At each filter cycle, we have to calculate the partial deriva-tives of measurements with respect to each of the unknown stateparameters. Lowe [21] proposed a reparameterization of theprojection equations, to simplify the calculation of H Jacobian,by expressing the translations in the camera coordinate systemrather than model coordinates. In this case, (7) will take thefollowing form:

Xci

Yci

Zci

=R(α, β, γ)

Xi

Yi

Zi

(ui

vi

)=

( fZci+tz

Xci + txf

Zci+tzYci + ty

). (8)

The first partial derivative (Jacobian) of each measurementrelative to each state variable used in the EKF minimizationprocess has the form

Hi,j =(

∂hi(s)∂sj

), i = 1, . . . , 2Np + Na, j = 1, . . . , M

(9)

where s is the M -component state vector, h is the function thatrelates state s to measurement m, Np is the number of trackedfeatures, and Na is the number of expression actuators.

The partial derivatives of (Xci, Yci, Zci) coordinates withrespect to the small rotation angles (ωx, ωy, ωz) in Lowe’sformulation simplify the H matrix computation as

0 −Zci Yci

Zci 0 −Xci

−Yci Xci 0

. (10)

Considering a state vector parameter p and using the derivationchain rule, we have

∂ui

∂p= fd

(∂Xci

∂p− dXci

∂Zci

∂p

)

∂vi

∂p= fd

(∂Yci

∂p− dYci

∂Zci

∂p

)(11)

where d = 1/(Zci + tz) (see Table I).


TABLE IJACOBIAN FEATURE COMPONENTS WITH RESPECT

TO THE POSE AND FOCAL LENGTH

Fig. 3. Block diagram of the EKF tracking system.

The state actuator parameters (aj , j = 1−4) are fully observ-able using 3-D AMB AAM fitting, and the Jacobian part relatedto these is easy to compute, i.e.,

∂aj

∂p=

{1, if aj = p0, if aj �= p

(12)

where p is a state parameter.We compute a three-parameter incremental rotation

(ωx, ωy, ωz) similar to that used in [5] to estimate theinterframe rotation. The incremental rotation computed ateach frame step is combined into a global quaternion vector(q0, q1, q2, q3) used in the EKF linearization process androtation of the 3-D model.

The EKF provides a convenient mechanism for fusing all theinformation about the reliability of the measurements (featurepositions and muscle intensities) into an estimate of the 3-Drigid and nonrigid facial motion and focal length of thecamera.

E. Tracking Update

The EKF tracking update stage is illustrated in Fig. 3. Ateach frame, the EKF computes an estimate of rigid motion,camera focal length, and actuator values (facial expressions).For a given sequence, the vector (tx, ty, tz, α, β, λ, a1−4) de-

Fig. 4. ABS component of the EKF.

scribes the time-dependent pose and animation of the 3-Dwireframe model. A 2-D gradient feature tracking method per-forms the 2-D tracking reinforced by the EKF pose estimationoutput.

The functionality of the ABS estimator is illustrated in Fig. 4.The AAM fitting process needs the facial pose and c parametersthat define the combined AE texture space. The estimatedexpression a+

j parameters from the EKF are projected onto thec parameters using (4). The parameters and the EKF estimatedpose are used by the ABS component to refine the image fitbased on image differences. The corrected c parameters areback projected into the muscle actuators aj using (3) and areused as input for EKF. Then, EKF performs another step offiltering/estimation, and the cycle continues.

The main task of the Kalman filter in this implementationis to filter and predict global and local motion, hence, to easethe load of the AAM search space. Without the use of EKF,the fitted appearance model of the current frame is propagatedto the next frame and used as an input for the next imagefit. In this case, because of large pose variations, fast motion,or local minima, the tracking will quickly fail. In the EKFimplementation, at each frame, the expression part of the AEshape vector is replaced by the estimated EKF expressionvector. The performed experiments show that integrating theABS fitting procedure in the EKF recursive schema can result in


Fig. 5. Tracking instances from the synthetic-pose-expression sequence (original, normalized, and synthetic frames). The face is from the man machineinteraction (MMI) database [22].

robust expression tracking results. The 3-D model is also usedto initialize the 3-D EKF tracking system and to test for featureocclusions during tracking.

III. EXPERIMENTS

The described system can track the 3-D head pose and fourEAUs (Jaw Drop, Mouth Corners, Eyebrow Middle, and Eye-brow Inner) at 10 frames/s without optimizations. The trackingperformance is validated with three sets of experiments. Theinput consists of synthetic and real sequences of a humanface moving in front of the camera and performing differentexpressions. The first set of experiments performs a calibrationof the tracking system using a synthetic sequence of a 3-Dface model performing rigid and nonrigid motion (synthetic-pose-expression experiments). The second set of experimentsuses ten real image sequences from the database in [22], wherefaces display different expressions but have limited rigid motion(real-expression experiments). The third set of experimentsis performed in three real image sequences from the BostonUniversity database of video sequences and ground truth forthe evaluation of 3-D head tracking [23], which contains faceswith a large range of rigid motion and spontaneous expressionsused for real-pose-expression experiments [24].

During tracking, we use the facial texture symmetry inthe warping procedure for yaw angles greater than 20◦. Thetexture half to be mirrored is dictated by the yaw angle of the3-D model measuring the vertical face orientation. This way,the synthesized shape-free texture will be more accurate sincesevere distortions and occlusions are introduced by rotation indepth.

A. Synthetic-Pose-Expression Experiments

In the synthetic experiments, a previously recorded sequenceof 2-D images representing 3-D head model poses and ex-pressions is played as “live” video and tracked with our EKFsystem. The estimated motion values are compared with themeasured motion values of the synthetic image sequence. Thesequence contained 225 frames, displaying a synthetic headperforming large rotations in the range of (−45, 45◦) for “Yaw”and in the range of (−20, 20◦) for “Roll” and “Pitch” angles,and spontaneous facial expressions (Jaw Drop, Mouth Corners,Eyebrow Middle, and Eyebrow Inner) in the range of [−1, 1].We minimized the errors by fine tuning the initialization processof the EKF. During the calibration process, we evaluate the

tracker accuracy by computing the root mean square error(RMSE) between true and estimated rotation angles and be-tween true and estimated expressions (EAUs), i.e.,

RMSE =

√√√√ 1n

n∑i=1

(ytrue − yestim) (13)

where ytrue and yestim are the ground truth and the estimatedangle or EAU parameters, and n is the number of frames inthe sequence. We experimentally found in one case that thecalibrated RMSEs for rotation angles (degrees) and expressions(deformation units) are

RMSE_X = 2.785 (Pitch)

RMSE_Y = 3.005 (Yaw)

RMSE_Z = 1.150 (Roll)

RMSE_JD = 0.095 (Jaw Drop)

RMSE_MC = 0.122 (Mouth Corners)

RMSE_EM = 0.256 (Eyebrow Middle)

RMSE_EI = 0.105 (Eyebrow Inner).

The pose statistics are comparable to the Polhemus sensoraccuracy, indicating that the vision estimate can be accurate asthe Polhemus sensor [1].

Fig. 5 shows instances of the 3-D tracking of a synthetichead. In some frames, the expression fitting was less accuratedue to significant pose change and occlusion.

The experiment showed that the accuracy of expressiontracking increased when the face is closed to the frontal po-sition. The expression tracking drifted for extreme rotationangles that cause a reduction of expression information. Ourexperiments also showed that the RMSE of each expression islower if we do not take in consideration the first few framesnecessary for the EKF to converge.

B. Real-Expression Experiments

The “Real Experiments” use ten real image sequences fromthe web-based database presented in [22], where the facesdisplay different expressions (Jaw Drop, Mouth Corners, Eye-brow Middle, and Eyebrow Inner) with limited rigid motion


Fig. 6. Typical real-expression EKF tracking sequence (face from the MMI database [22]).

(real-expression experiments). These experiments allow us tostudy the effect of including the deformation parameters inEKF. In a first test, a face is tracked only using the 3-DAMB AAM by sequentially fitting the active model in eachframe. At each frame, the ABS iterative search starts withthe previous frame estimate of the model state (shape textureparameters). A second test included the animation parameters

in the recursive framework. The advantage of using a recursiveestimator in tracking facial expressions (Jaw Drop and MouthCorners) is illustrated Fig. 6. The dashed gray line representsthe ground truth of the AU throughout the sequence. The light-colored line shows the output animation parameter synthesizedby the fitting process without using the EKF. The dark lineshows the “smoothing” effect of the EKF over the expression


Fig. 7. Typical real-pose-expression EKF tracking sequence (face from the BU database [23]).

parameters. Fig. 6 also illustrates snapshots of tracking. Eachsnapshot shows the normalized and reconstructed face overlaidonto the original one. The frame number, iterations needed forconvergence, and fitting similarity are also shown. The trackingsystem was tested in ten real-expression sequences. Trackingperformed well in all the image sequences; both the rigidand nonrigid modules did not fail. This is because the faceshave frontal orientation, limited rigid motion, and no occlusion,which allowed both tracking methods to follow the expressionsignature without failing.

C. Real-Pose-Expression Experiments

The tracking system was tested in three pose-expression im-age sequences from the Boston University (BU) database [23],which contains faces with a large range of rigid motion andfacial expressions. The images of the sequence were capturedat a resolution of 320 × 240 pixels under different lightingconditions. Each sequence contains over 400 frames (8–15 slong) captured at a sampling rate of 15 Hz. In addition, eachsequence provides the ground truth of the head pose in eachframe. We choose a subset of this database, namely, three moviefiles where persons perform facial expressions as well (JawDrop and Mouth Corners). No face from the test sequences waspresent in the training set. The tracking system can cope withdifferent degrees of rigid motion and nonrigid deformations, asillustrated in Fig. 7. The tracking instances in Fig. 7 show thenormalized and reconstructed face overlaid onto the originalone. Each snapshot also shows the frame number, iterationsneeded for convergence, and fitting similarity.

The estimated orientation during tracking against the groundtruth for the displayed example is illustrated in Fig. 8. The effectof the EKF in nonrigid tracking is dramatic in this case.

As shown in Fig. 9, the tracking of facial expressions wouldhave been difficult without using the Kalman filter. The extremevalues illustrated by the graph in light color indicate ABS

tracking failures, when no EKF was used to predict and filterfacial expressions. Since no ground truth for expressions wasprovided, the tracking accuracy was subjectively judged. Forthe illustrated example (Fig. 7), the subject displays a “happi-ness” expression starting in the second half of the sequence (i.e.,the mouth and eyebrow related AUs were activated). Despitelarge rigid motions, rigid tracking did not fail in any of thesequences. Based on residual error, the nonrigid tracking driftedaway between two and four times in two sequences, mainlydue to the pose changes, but recovered in the next frame withthe help of the EKF pose and expression estimations. Thenonrigid tracking was considered lost when its residual errorhad a high value for several frames. In this case, the ABS searchdiscarded the estimated model parameters and started with theEKF-estimated pose and the mean of model parameters. Therotation angle RMSEs (in degrees) for the three real-pose-expression sequences are

RMSE_X = 2.123 1.465 2.795 (Pitch)

RMSE_Y = 5.408 2.255 4.207 (Yaw)

RMSE_Z = 1.476 1.288 4.731 (Roll).

Again, the pose statistics indicate comparable accuracy to thePolhemus sensor.

The tracking system was implemented using C++ under aWindows environment. The current implementation runs at10 frames/s on a PC with a 2.3-GHz Pentium-4 CPU and an ATIMobility Radeon 9700 graphics card. During the initializationphase, when the 3-D active model needs to adapt the facialimage, a maximum number of eight iterations is allowed. Sincethe initial estimate is near optimum during tracking, less thaneight iterations are needed to fit the active model on eachframe. Based on this, a maximum of four iterations is allowedfor model fitting during tracking. In this implementation, the


Fig. 8. Real versus estimated orientation for a real-pose-expression experiment.

active model search needs about 25 ms/iteration, that is, the 3-DAMB AAM fitting algorithm takes approximately 0.1 s for eachframe.

IV. CONCLUSION

This paper has described a statistical model-based track-ing algorithm that aims to recover the 3-D pose and facial


Fig. 9. Effect of EKF on expression tracking for a real-pose-expression experiment.

expressions of a moving head. The described tracking algorithmuses an EKF to recover the head pose and the 3-D AMBAAM to recover the facial expressions. The EKF predicts boththe rigid and nonrigid facial motions. The presented solutionenables stable online tracking of extended sequences, despitenoise and large variations in pose and expressions. The sys-tem was implemented on a 2.3-GHz Pentium-4 Windows PCplatform processing video images at over 10 frames/s. Theresulting motion tracking system was shown to work in arealistic environment without makeup on the face, with anuncalibrated camera, and unknown lighting conditions andbackground.

The head tracking was successfully validated in synthetic andreal standard image sequences. The developed tracking system

demonstrated the great value of including facial expressiontracking in the EKF recursive framework. Since our trackingframework filters and predicts muscle intensities, the facialexpression model remains the researcher’s choice.

A promising future research objective would be to study theeffects of replacing the 3-D AMB AAM in the EKF formulationwith a simpler, faster, and more accurate variant based onActive Shape Models (ASM) [25]. The derived 3-D AMB ASMwill use the projection of the 3-D facial contours depicted in ourwireframe face model. We observed that serious illuminationinconsistency increases the tracker inability to adapt to changesof object appearance. Another promising future research direc-tion would be the formulation of specific illumination modelsneeded to control lighting changes in the scene.


REFERENCES

[1] A. Azarbayejani and A. Pentland, “Recursive estimation of motion, struc-ture, and focal length,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17,no. 6, pp. 562–575, Jun. 1995.

[2] J. Strom, “Model-based face tracking and coding,” in Ph.D. disserta-tion, Linköping, Sweden: Dept. Electr. Eng., Linköping Univ., Feb. 2002.Dissertation No. 733.

[3] S. Basu, I. Essa, and A. Pentland, “Motion regularization for model-based head tracking,” in Proc. 13th IEEE ICPR, Vienna, Austria, 1996,pp. 611–616.

[4] J. Ahlberg and R. Forchheimer, “Face tracking for model-basedcoding and face animation,” Int. J. Imaging. Syst. Technol., vol. 13, no. 1,pp. 8–22, 2003.

[5] T. Jebara and A. Pentland, “Parameterized structure from motion for 3Dadaptive feedback tracking of faces,” in Proc. IEEE Conf. Comput. Vis.Pattern Recog., San Juan, Puerto Rico, 1997, pp. 144–150.

[6] F. Dornaika and J. Ahlberg, “Fast and reliable active appearance modelsearch for 3-D face tracking,” IEEE Trans. Syst., Man, Cybern. B,Cybern., vol. 34, no. 4, pp. 1838–1853, Aug. 2004.

[7] D. DeCarlo and D. Metaxas, “The integration of optical flowand deformable models with applications to human face shape and mo-tion estimation,” in Proc. IEEE Conf. CVPR, San Francisco, CA, 1996,pp. 231–238.

[8] M. B. Stegmann, “Object tracking using active appearance models,”in Proc. 10th Danish Conf. Pattern Recog. Image Anal., 2001, vol. 1,pp. 54–60.

[9] M. D. Cordea and E. M. Petriu, “A 3-D anthropometric-muscle-basedactive appearance model,” IEEE Trans. Instrum. Meas., vol. 55, no. 1,pp. 91–98, Feb. 2006.

[10] M. Rydfalk, “CANDIDE, A Parameterized Face,” Dept. Electr. Eng.,Linköping Univ., Linköping, Sweden, Rep. No. LiTH-ISY-I-866, 1987.

[11] G. J. Edwards, T. F. Cootes, and C. J. Taylor, “Interpreting face imagesusing active appearance models,” in Proc. 3rd IEEE Int. Conf. Autom.Face Gesture Recog. (FG), Nara, Japan, Apr. 1998, pp. 300–305.

[12] J. C. Gower, “Generalized procrustes analysis,” Psychometrika, vol. 40,no. 1, pp. 33–50, Mar. 1975.

[13] T. F. Cootes and C. J. Taylor, “Statistical models of appearance forcomputer vision,” Univ. Manchester, Manchester, U.K., 2001. DraftTech. Rep.

[14] M. B. Stegmann, R. Fisker, B. K. Ersbøll, H. H. Thodberg, andL. Hyldstrup, “Active appearance models: Theory and cases,” in Proc.9th Danish Conf. Pattern Recog. Image Anal., Aalborg, Denmark, 2000,pp. 49–57.

[15] R. H. Davies, C. J. Twining, T. F. Cootes, J. C. Waterton, and C. J. Taylor,“A minimum description length approach to statistical shape modelling,”IEEE Trans. Med. Imag., vol. 21, no. 5, pp. 525–537, May 2002.

[16] J. Xiao, S. Baker, I. Mattews, and T. Kanade, “Real-time combined2D + 3D active appearance models,” in Proc. IEEE Conf. Comput. Vis.Pattern Recog., 2004, vol. 2, pp. 535–542.

[17] K. Waters, “A muscle model for animation three-dimensional facial ex-pression,” ACM SIGGRAPH Comput. Graph., vol. 21, no. 4, pp. 17–24,Jul. 1987.

[18] P. Ekman and W. Friesen, Manual for the Facial Action Coding System.Palo Alto, CA: Consulting Psychologists, 1986.

[19] T. F. Cootes and P. Kittipanya-ngam, “Comparing variations onthe active appearance model algorithm,” in Proc. BMVC, 2002, vol. 2,pp. 837–846.

[20] M. D. Cordea, D. C. Petriu, E. M. Petriu, N. D. Georganas, andT. E. Whalen, “3-D head pose recovery for interactive virtual realityavatars,” IEEE Trans. Instrum. Meas., vol. 51, no. 4, pp. 640–644,Aug. 2002.

[21] D. G. Lowe, “Three-dimensional object recognition from single two-dimensional images,” Artif. Intell., vol. 31, no. 3, pp. 355–395, Mar. 1987.

[22] M. Pantic, M. F. Valstar, R. Rademaker, and L. Maat, “Web-based data-base for facial expression analysis,” in Proc. IEEE ICME, Amsterdam,Netherlands, 2005, pp. 317–321.

[23] BU database of video sequences and ground truth used in evaluation of3D head tracking, Boston, MA: Image Video Comput. Group, Comput.Sci. Dept., Boston Univ. (last visited Jan. 6, 2008). [Online]. Available:ftp://csr.bu.edu/headtracking/

[24] M. La Cascia, M. S. Sclaroff, and V. Athitsos, “Fast, reliable head trackingunder varying illumination: An approach based on registration of texture-mapped 3D models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,no. 4, pp. 322–336, Apr. 2000.

[25] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Comparing active shapemodels with active appearance models,” in Proc. Brit. Mach. Vis. Conf.,1999, vol. 1, pp. 173–182.

Marius D. Cordea received the Engineering de-gree from the Polytechnic University of Cluj,Cluj-Napoca, Romania, and the M.S. and Ph.D. de-grees in electrical and computer engineering from theUniversity of Ottawa, Ottawa, ON, Canada, in 1999and 2007, respectively.

He is currently with the University of Ottawa.His research interests include pattern recognition,face analysis, interactive virtual environments, andanimation languages.

Emil M. Petriu (M’86–SM’88–F’01) received theDipl. Eng. and Dr. Eng. degrees from the Polytech-nic University of Timisoara, Timisoara, Romania, in1969 and 1978, respectively.

He is a Professor and a University ResearchChair with the School of Information Technologyand Engineering, University of Ottawa, Ottawa, ON,Canada. He has published more than 250 technicalpapers, authored two books, edited other two books,and received two patents. His research interests in-clude robot sensing and perception, intelligent sen-

sors, interactive virtual environments, soft computing, and digital integratedcircuit testing.

Dr. Petriu is a Fellow of the Canadian Academy of Engineering and theEngineering Institute of Canada. He is a corecipient of the IEEE Donald G. FinkPrize Paper Award for 2003 and the recipient of the 2003 IEEE Instrumentationand Measurement Society Award. He is currently serving as Chair of TC-15Virtual Systems and Co-Chair of TC-28 Instrumentation and Measurement forRobotics and Automation and TC-30 Security and Contraband Detection of theIEEE Instrumentation and Measurement Society. He is an Associate Editor ofthe IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT anda member of the Editorial Board of the IEEE I&M Magazine.

Dorina C. Petriu (M’91–SM’03) received the Dipl.Eng. degree in computer engineering from thePolytechnic University of Timisoara, Timisoara,Romania, and the Ph.D. degree in electrical engineer-ing from Carleton University, Ottawa, ON, Canada.

She is a Professor with the Department of Systemsand Computer Engineering, Carleton University. Sheis a contributor to two OMG standards. Her cur-rent research projects include automatic derivationof software performance models from UML designspecifications and software development and mod-

eling for virtual environments and robotics. Her research interests are in theareas of performance modeling and model-driven engineering, with emphasison integrating performance engineering in the software development process.

Dr. Petriu is a Fellow of the Engineering Institute of Canada and a memberof the Professional Engineers of Ontario and the Association for ComputingMachinery.

three-dimensional head tracking and facial expression ...petriu/trim08-3d_headtrackfaceex...cordea...

Documents