face verification through tracking facial features

13
Face verification through tracking facial features Baoxin Li Sharp Laboratories of America, 5750 N.W. Pacific Rim Boulevard, Camas, Washington 98607 Rama Chellappa Center for Automation Research, University of Maryland, College Park, Maryland 20742 Received September 8, 2000; accepted May 7, 2001; revised manuscript received May 25, 2001 We propose an algorithm for face verification through tracking facial features by using sequential importance sampling. Specifically, we first formulate tracking as a Bayesian inference problem and propose to use Mar- kov chain Monte Carlo techniques for obtaining an empirical solution. A reparameterization is introduced under parametric motion assumption, which facilitates the empirical estimation and also allows verification to be addressed along with tracking. The facial features to be tracked are defined on a grid with Gabor at- tributes (jets). The motion of facial feature points is modeled as a global two-dimensional (2-D) affine trans- formation (accounting for head motion) plus a local deformation (accounting for residual motion that is due to inaccuracies in 2-D affine modeling and other factors such as facial expression). Motion of both types is pro- cessed simultaneously by the tracker: The global motion is estimated by importance sampling, and the re- sidual motion is handled by incorporating local deformation into the measurement likelihood in computing the weight of a sample. Experiments with a real database of face image sequences are presented. © 2001 Op- tical Society of America OCIS codes: 100.0100, 100.5010, 150.0150. 1. INTRODUCTION Facial feature tracking is useful in applications such as facial expression analysis, face-based person authentica- tion, video-based face recognition, etc. Face verification has a wide range of applications, such as surveillance, ac- cess control, etc. When a short video sequence is avail- able, face recognition can be performed more reliably if the temporal information is properly exploited. In appli- cations such as access control, it is reasonable to assume that a short image sequence is available. Under this con- dition, instead of using a single frame, we can consider verification or recognition using a short video sequence of the face subject to small motion (this is natural when the head is moving or the camera is panning during the ac- quisition process). Exploiting the temporal information available in a video sequence to improve face recognition has been an active research topic for years. For example, video-based segmentation can be used to assist in face detection, such as in Refs. 1 3. Also, solving for structure from motion could help to solve face recognition if three-dimensional face models are available. Many video-based face recog- nition algorithms/systems have been proposed, using dif- ferent techniques (e.g., see Refs. 2 6). However, one can find that, in most cases, the temporal information is ex- ploited only for tracking or detection (e.g., motion-based segmentation). For example, most of the current approaches/systems use tracking only as a preprocessing step: After some ‘‘good’’ frames are selected through tracking, still-image-based recognition techniques are then applied to these frames. Using video to achieve bet- ter segmentation or detection is, of course, one way of ex- ploiting temporal information available in an image se- quence. One can also use ‘‘voting’’—i.e., recognition is done on each frame—and a vote is then taken to give the final decision. This simple technique is straightforward and often works well, but it can be expensive if the single- frame-based approach requires significant computation. Moreover, much information is being left out in this simple approach, such as the coherence in the shape change of an object in consecutive frames. We now give an example to illustrate how other temporal information can be exploited, in addition to voting, to facilitate face verification. Figure 1 shows an example in which Gabor feature points on a grid in the first image are matched with two other frames by using elastic graph matching (EGM). 7 (For simplicity, we use a grid to define the feature points, as in Ref. 7, but it is obvious that our discussion is valid for features defined by other methods.) When a short se- quence is available, one can, of course, still do recognition on a frame-by-frame basis and then take a vote. For ex- ample, given a face sequence of person A, if we match im- ages of persons A and B (different from A) to the se- quence, we should obtain a higher matching score for A (at least) for most of the frames in the sequence. There is, however, other information that can be exploited. For example, the matched grid points in the sequence should have more coherent patterns for A than for B, since the matched points for A are supposed to be the same set of physical points on the face if the matching is perfect, whereas for B, this assumption is not necessarily valid. Figure 2 shows the results of matching two templates to a short image sequence; Fig. 2(a) is the matched grid points when the sequence of 20 frames and the template are from the same person, while Fig. 2(b) is from the tem- B. Li and R. Chellappa Vol. 18, No. 12/December 2001/J. Opt. Soc. Am. A 2969 0740-3232/2001/122969-13$15.00 © 2001 Optical Society of America

Upload: rama

Post on 07-Oct-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2969

Face verification through tracking facial features

Baoxin Li

Sharp Laboratories of America, 5750 N.W. Pacific Rim Boulevard, Camas, Washington 98607

Rama Chellappa

Center for Automation Research, University of Maryland, College Park, Maryland 20742

Received September 8, 2000; accepted May 7, 2001; revised manuscript received May 25, 2001

We propose an algorithm for face verification through tracking facial features by using sequential importancesampling. Specifically, we first formulate tracking as a Bayesian inference problem and propose to use Mar-kov chain Monte Carlo techniques for obtaining an empirical solution. A reparameterization is introducedunder parametric motion assumption, which facilitates the empirical estimation and also allows verification tobe addressed along with tracking. The facial features to be tracked are defined on a grid with Gabor at-tributes (jets). The motion of facial feature points is modeled as a global two-dimensional (2-D) affine trans-formation (accounting for head motion) plus a local deformation (accounting for residual motion that is due toinaccuracies in 2-D affine modeling and other factors such as facial expression). Motion of both types is pro-cessed simultaneously by the tracker: The global motion is estimated by importance sampling, and the re-sidual motion is handled by incorporating local deformation into the measurement likelihood in computing theweight of a sample. Experiments with a real database of face image sequences are presented. © 2001 Op-tical Society of America

OCIS codes: 100.0100, 100.5010, 150.0150.

1. INTRODUCTIONFacial feature tracking is useful in applications such asfacial expression analysis, face-based person authentica-tion, video-based face recognition, etc. Face verificationhas a wide range of applications, such as surveillance, ac-cess control, etc. When a short video sequence is avail-able, face recognition can be performed more reliably ifthe temporal information is properly exploited. In appli-cations such as access control, it is reasonable to assumethat a short image sequence is available. Under this con-dition, instead of using a single frame, we can considerverification or recognition using a short video sequence ofthe face subject to small motion (this is natural when thehead is moving or the camera is panning during the ac-quisition process).

Exploiting the temporal information available in avideo sequence to improve face recognition has been anactive research topic for years. For example, video-basedsegmentation can be used to assist in face detection, suchas in Refs. 1–3. Also, solving for structure from motioncould help to solve face recognition if three-dimensionalface models are available. Many video-based face recog-nition algorithms/systems have been proposed, using dif-ferent techniques (e.g., see Refs. 2–6). However, one canfind that, in most cases, the temporal information is ex-ploited only for tracking or detection (e.g., motion-basedsegmentation). For example, most of the currentapproaches/systems use tracking only as a preprocessingstep: After some ‘‘good’’ frames are selected throughtracking, still-image-based recognition techniques arethen applied to these frames. Using video to achieve bet-ter segmentation or detection is, of course, one way of ex-ploiting temporal information available in an image se-

0740-3232/2001/122969-13$15.00 ©

quence. One can also use ‘‘voting’’—i.e., recognition isdone on each frame—and a vote is then taken to give thefinal decision. This simple technique is straightforwardand often works well, but it can be expensive if the single-frame-based approach requires significant computation.Moreover, much information is being left out in thissimple approach, such as the coherence in the shapechange of an object in consecutive frames. We now givean example to illustrate how other temporal informationcan be exploited, in addition to voting, to facilitate faceverification.

Figure 1 shows an example in which Gabor featurepoints on a grid in the first image are matched with twoother frames by using elastic graph matching (EGM).7

(For simplicity, we use a grid to define the feature points,as in Ref. 7, but it is obvious that our discussion is validfor features defined by other methods.) When a short se-quence is available, one can, of course, still do recognitionon a frame-by-frame basis and then take a vote. For ex-ample, given a face sequence of person A, if we match im-ages of persons A and B (different from A) to the se-quence, we should obtain a higher matching score for A(at least) for most of the frames in the sequence. Thereis, however, other information that can be exploited. Forexample, the matched grid points in the sequence shouldhave more coherent patterns for A than for B, since thematched points for A are supposed to be the same set ofphysical points on the face if the matching is perfect,whereas for B, this assumption is not necessarily valid.

Figure 2 shows the results of matching two templatesto a short image sequence; Fig. 2(a) is the matched gridpoints when the sequence of 20 frames and the templateare from the same person, while Fig. 2(b) is from the tem-

2001 Optical Society of America

Page 2: Face verification through tracking facial features

2970 J. Opt. Soc. Am. A/Vol. 18, No. 12 /December 2001 B. Li and R. Chellappa

plate of a different person. A close look at the plots re-veals that they have different patterns. In fact, if thetracked trajectories of the feature points are plotted, asshown in Figs. 2(c) and 2(d) [for Figs. 2(a) and 2(b), re-spectively], we find that with the true hypothesis, the mo-tion trajectories display a more coherent pattern. Ex-ploiting this type of information is the focus of this paper.

In this paper, we present a facial feature tracking algo-rithm based on sequential importance sampling (SIS) anddemonstrate how the approach can be used for face veri-fication. The feature points are defined by the outputs ofa set of Gabor filters on a grid covering the face region(collectively called jets in the literature). The motion ofthe set of jets is modeled as a global two-dimensional(2-D) affine transformation, which accounts for globalhead motion, plus a local deformation, which accounts forresidual motion that is due to inaccuracies in affine mod-eling and other factors such as facial expressions. Thesetwo types of motion are estimated simultaneously by thetracker: Global motion is tracked by SIS, while residualmotion is handled by incorporating local deformation intothe measurement likelihood in the SIS steps. Since thetracking results can also be utilized for verification tasks,tracking as defined here can have a different meaningthan conventionally understood. In a pure-tracking ap-plication, the tracker is initialized by computing the set ofjets from the first frame. However, the proposed param-eterization also allows a set of jets from a template face tobe tracked in a given sequence. Thus, with differenttemplates, such a tracker allows verification to be ad-dressed simultaneously with tracking. We call this kindof application tracking for verification.

The paper is organized as follows. In Section 2, wepropose a sampling-based algorithm for facial featuretracking and show that it addresses verification simulta-

Fig. 1. Example of facial feature tracking: (a) feature pointsdefined by Gabor attributes on a grid, (b) and (c) correspondingfeature points in two subsequent frames.

neously as a result of the formulation. Experimental re-sults on a real database are then presented in Section 3,along with comparisons with alternative approaches. Werelate our work to some other relevant research in Section4. Finally, we point out future direction of the researchin Section 5.

2. FACE VERIFICATION THROUGHTRACKING FACIAL FEATURESIn this section, we first formulate a general tracking prob-lem as a Bayesian inference problem and introduce SIS asan empirical solution to the inference problem. Consid-ering that SIS would have difficulty in high-dimensionalspaces, we propose a reparameterization to facilitate thecomputation. We also point out that, with the use of thisreparameterization, verification can be simultaneouslyaddressed with tracking. After defining facial featuresby a grid of Gabor attributes, we propose an SIS-basedtracking algorithm that unifies both global and local mo-tion. With this formulation, verification is just a naturalby-product of the tracking algorithm.

A. Tracking Using Sequential Importance SamplingIn general, tracking is defined as the processing of mea-surements obtained from an object in order to maintainan estimate of its current state, which typically consists ofits position, velocity, etc. Let Xt denote the state to be es-timated, and let Zt be the measurements (observations)up to time t. The subscript t denotes a discrete time in-dex. Both Xt and Zt are, in general, random quantities.Let p(Xt) be the distribution of Xt ; we then have the jointdistribution p(Xt , Zt):

p~Xt , Zt! 5 p~Xt!p~ZtuXt!,

where p(ZtuXt) is the likelihood of Xt , having observedZt . Using Bayes’s theorem, we have

p~XtuZt! 5p~Xt!p~ZtuXt!

E p~Xt!p~ZtuXt!dXt

,

which is the posterior distribution of Xt and is what Baye-sian inference attempts to estimate. Assuming that wehave obtained p(XtuZt), tracking is solved; knowing Xt ,by definition we know everything about the current stateof the object. Thus tracking can be formulated as a Baye-sian inference problem, with p(XtuZt) as the objects of the

Fig. 2. Corresponding feature points obtained from 20 frames: (a) result of matching the same person to a video, (b) result of matchinga different person to the video, (c) trajectories of (a), (d) trajectories of (b).

Page 3: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2971

inference. Note that in this formulation, the posteriordistribution p(XtuZt) is a time-varying quantity.

In reality, instead of obtaining the posterior density it-self, a Bayesian inference task may focus on estimatingonly some properties of the density, such as moments,quantiles, highest posterior density regions, etc. Allthese quantities can be expressed in terms of posterior ex-pectations of functions of Xt as

E@ f~Xt!uZt# 5

E f~Xt!p~Xt!p~ZtuXt!dXt

E p~Xt!p~ZtuXt!dXt

.

The integration in this expression has until recentlybeen the source of most of the practical difficulties inBayesian inference, especially in high dimensions. Inmost applications where the system dynamics and the un-certainty models are complex (typically nonlinear/non-Gaussian), analytic evaluation of E@ f(Xt)uZt# is impos-sible. In recent years, with increased computing power,Monte Carlo methods have emerged as effective empiricaltools for the analysis of non-Gaussian/nonlinear dynamicsystems (e.g., Refs. 8 and 9). SIS is an elegant approachof this type. In an SIS approach, the analysis of a dy-namic system is formulated as the evaluation of the con-ditional probability density p(XtuZt). At time t, p(XtuZt)is approximated by a set of its samples, and each sampleis associated with a weight reflecting its significance inrepresenting the underlying density (importancesampling).10,11 The general framework of SIS describedin Ref. 12 covers applications such as target positiontracking,13 the Bayesian missing data problem,14 and con-tour tracking.15 A brief description of the SIS algorithmis given in Appendix A, where a dynamic system is de-fined as a sequence of evolving probability distributionsp t(Xt) (for a complete treatment, see Ref. 12). By settingp t(Xt) 5 p(XtuZt), we then solve the tracking problemformulated above by the SIS algorithm. In non-Gaussian/nonlinear situations, sampling-based ap-proaches typically have advantages over the conventionalKalman filter.9,12,15

B. Dimensionality ReductionUnfortunately, when X is high dimensional (for example,the positions and the velocities of the individual graphvertices), posterior estimation through empirical ap-proaches is not realistic with the SIS method. If, how-ever, the object can be characterized by a vector of low di-mensionality, then the SIS method would be an effectivetool for analyzing a dynamic system. To this end, we de-rive a reparameterization.

Consider a rigid object subject to motion that can bemodeled by a transformation f parameterized by a param-eter vector u. Let X0 denote an original parameterizationof the object. X0 can be, for example, a set of jets. LetX 5 f(u, X0) denote the transformation of X0 into X.Under a small- and continuous-motion assumption, Xwould be similar to X0 , meaning that u has only a ‘‘small’’difference from u0 , with u0 being the parameter forthe identity transform: X0 5 f(u0 , X0). ExpandingX 5 f(u, X0) at u0 gives

X 5 f~u, X0! 5 f~u0 , X0! 1 Ju~u0!~u 2 u0! 1 o~ • !

' X0 1 Ju~u0!~u 2 u0!, (1)

where o( • ) denotes higher-order terms and Ju( • ) is theJacobian matrix with respect to u. The above expansionshows that the transformed object X can be viewed as the

original X0 plus a changing term caused by Du 5def

u2 u0 . Therefore, given X0 , the vector Du is a good pa-rameterization of all possible X under a small-motion as-sumption. From a practical point of view, only the differ-ence is important—knowing the difference, one canestablish a temporal correspondence between X andX0 . Thus we propose to use Du as the state vector. TheJacobian matrix Ju(u0) is easy to obtain for 2-Daffine or simpler transformations. For example, letX0 5 (x0 , y0) be the location of one feature point. Fora 2-D affine transformation f(u, • ) with

u 5def

(a11 , a21 , a21 , a22 , Tx , Ty)T, equivalent to, say,

f~u, • ! 5 Fa11 a12

a21 a22G~ • ! 1 S Tx

TyD ,

the Jacobian matrix is computed as

Ju~u0! 5 U]X

]uU

u0

5 Fx0 y0 0 0 1 0

0 0 x0 y0 0 1G . (2)

With the reparameterization, tracking is solved by ana-lyzing the dynamic system governing the evolution of Du.Note that the 2-D affine group is linear in each componentof u; therefore the higher-order terms in Eq. (1) are zero,implying that the parameterization with Du is accuratefor a 2-D planar object, even under large motion, if themotion is 2-D affine.

The reparameterization originates from a simple Tay-lor expansion. However, its novelty comes from the factthat one can choose different X0 for different purposes.If X0 corresponds to a feature set from the first frame of asequence, then the reparameterization is suitable for puretracking. However, if X0 represents some template froma candidate list, then the reparameterization is naturallygood for tracking for verification: When a template andthe sequence belong to the same person, tracking resultsshould reflect a coherent motion induced by the same un-derlying shape; on the other hand, a more random motionpattern will be observed when the template and the se-quence belong to different persons. This idea will be-come clearer in the experiments section (Section 3).

C. Gabor Wavelets As Feature DetectorsWe define the facial features as a set of grid points withGabor attributes. The motivation behind this is mainlythe successful application of Gabor filters to face recogni-tion (e.g., Refs. 7, 16 and 17). The set of jets is obtainedas follows. An image I(x) is first filtered by a Gaborwavelet as

~WI !~k, x0! 5 E fk~x0 2 x!I~x!dx,

where fk is a convoluting kernel defined by

Page 4: Face verification through tracking facial features

2972 J. Opt. Soc. Am. A/Vol. 18, No. 12 /December 2001 B. Li and R. Chellappa

Fig. 3. Examples of Gabor wavelets with different s, m, and n.

fk~x! 5k2

s 2 expS 2k2x2

2s 2 D Fexp~ikx! 2 expS 2s 2

2 D G ,

with k 5 knm 5 kn exp(ifm) controlling the orientationand the scale of the wavelets. The outputs at the gridpoints are then saved as components of the jets. In thiswork, to obtain the jets, we have used 24 wavelets [threefrequencies and eight orientations given by n 5 0, 1, 2and m 5 0,..., 7, with kn 5 (p/2)/A2n and fm 5 (mp)/8].Figure 3 shows some examples of the Gabor waveletscomputed with the above parameters.

D. Proposed AlgorithmUnder a weak perspective camera model, the motion (inthe image domain) of a planar rigid object can be approxi-mated by the 2-D affine group. Although the set of jets isdefined on a face, which is not an ideal planar object, ifdeformation of each grid point is allowed, one can still geta good approximation to the motions of the jets. Thus wemodel the motions of the jets as a 2-D affine transforma-tion plus a local deformation. When the temporal evolu-tion of the jet positions is modeled as a dynamic system,tracking is solved by analyzing this system, which, in gen-eral, is non-Gaussian and nonlinear. Note that we didnot specify any uncertainty model for individual featurepoints, which may be too complex to be modeled by asimple function, since it needs to account for inaccuraciesin the 2-D approximation, uncertainty that is due tonoise, nonrigidity that is due to expression, etc. Instead,the local deformation at each jet is used to account forthese factors. We now give a facial feature tracking al-

gorithm as follows, which also solves verification at thesame time (note that, as explained in Subsection 2.B, thestate X is now actually Du):

Algorithm.Initialization. Rectify the template jets onto the first

frame by using elastic graph matching (EGM). Draw Nsrandom samples from p0(X0).

Tracking and verification.Updating. At time t . 0, invoke the SIS algorithm to

obtain an updated set of samples for p t(Xt). To computethe likelihood of each sample, perform a local searcharound each node to account for the deformation beforecomputing the matching error.

Mean shape evaluation. At any time t . 0, thetracked set of jets is given by Eq. (1) with Du 5 Ep(Xt),plus a local search; a final matching score is calculated byusing the mean shape; a posterior probability is computedin an interval around Ep(Xt).

In the initialization step, X0 gives the transform thatrectifies (warps) the template onto the first frame (forpure tracking, X0 corresponds to the identity transform,since the template is defined on the first frame directly).We assumed that p0( • ) is joint Gaussian (with indepen-dent components), which is a reasonable and convenientchoice under a small-motion assumption.

In the updating step, one needs to specify the measure-ment likelihood functions, which indicate the likelihood ofa sample given the observations. It could be difficult tospecify such functions for complex problems. In animportance-sampling-based approach, the key is to in-

Page 5: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2973

crease the weights of those samples that incur lessprediction-measurement error. Thus if the measurementlikelihood is too complex to specify, as an approximation,we can use functions that monotonically decrease withprediction-measurement error. In our experiments, wehave assumed that the likelihood of sample j is a trun-cated Gaussian is

L ~ j ! 5 H 1

A2ps0

exp$2@et~ j !#2/s0

2% if uet~ j !u , d

c otherwise

,

where d is a threshold and c is a constant. The error et iscomputed as

et 51

NJ(

k

Jm~ j !Js

~ j !

iJm~ j !iiJs

~ j !i,

where NJ is the number of jets in the model, Jm( j) is the

jth jet (a vector) of the model, and Js( j) is its counterpart

in the current frame. It needs to be emphasized that Js( j)

is found by applying the global motion followed by a localsearch. Therefore, even though the algorithm does notexplicitly model uncertainties in individual featurepoints, the uncertainties have been absorbed into the es-timated density (which is for the global motion).

As shown above, the algorithm is initialized by rectify-ing the templates using the EGM method in the firstframe of the sequence (i.e., locating the best match of thetemplate in the first frame). Starting from the secondframe, the matched grid will be found only through track-ing using the algorithm. It needs to be pointed out thatthe system dynamics update only the locations of the jets,and the jet response remains intact.

To evaluate the motion coherence in a shape, we calcu-late the posterior probabilities from the estimated densi-ties p t( • ) on a region centered around the mean shape.Without going into lengthy discussion (see Ref. 18 for anexplanation from a Bayesian point of view), here we givea short but intuitive explanation on why the posteriorprobabilities serve as a measure for the coherence of mo-tion and thus can be used in verification in addition to thematching scores. Under the true hypothesis, assumingthat the motion model is accurate enough for describingthe underlying motion of the point set, in an ideal situa-

tion there will be a unique motion parameter for relatingan instance of the point set to a template. In practice,this appears instead as a high peak in the density, whichis most likely unimodal in a local region because of thecontinuity of the object motion. On the other hand, for afalse hypothesis, these arguments do not necessarily hold,since the underlying constraint (tracked feature pointsshould correspond to the same physical points) does nothold. Thus a coherent motion from a true hypothesiswould give a larger posterior probability. This will bedemonstrated in the experiments in Section 3.

3. EXPERIMENTS AND COMPARISONSA. Database and Experimental SetupWe tested our algorithm on a database containing 19 in-dividuals. In building the database, we asked each per-son to sit on a chair at a fixed distance from the camera,so that the scale was approximately the same for all per-sons. Two sequences were acquired under different light-ing for each person while the person was asked to movehis/her head and make any desired facial expression,simulating an automatic teller machine or an access-controlled scenario. Although this database was in-tended for authentication experiments, it is well suitedfor testing the proposed tracking algorithm, since it is in-tended for tracking for verification.

Two sample sequences of the same person under differ-ent lighting are shown as an example in Fig. 4.

As mentioned in Subsection 2.c, we use 24 Gabor filtersfor feature extraction. Note that, in practice, it is com-putationally very expensive if one attempts filtering (con-volution) in the time domain (there would be 24 convolu-tion operations for each frame). A faster approach is touse the fast Fourier transform (FFT). When the FFT isused, the convolution in the time domain is realized bythe inverse FFT of the product of two signals (a frame andthe kernel) in the frequency domain. This is what wasused in the experiments. For an even faster implemen-tation, special hardware can be used to compute the FFT.

In the experiments reported in this section, the localsearch region in the updating step of the algorithm is5 3 5. The frame size is 90 3 80 pixels. The number ofsamples Ns during sampling is 200. The Gabor filter pa-rameters are as given in Section 2.

Fig. 4. Two sample sequences (top and bottom rows) for the same person under different lighting conditions.

Page 6: Face verification through tracking facial features

2974 J. Opt. Soc. Am. A/Vol. 18, No. 12 /December 2001 B. Li and R. Chellappa

Fig. 5. Results of pure tracking. Tracking results from the proposed algorithm (top) and using a Kalman-filter-based tracker plus alocal search, (bottom). See also movies 1 and 4.

Fig. 6. Results of pure tracking. Tracking results from the proposed algorithm (top) and using a Kalman-filter-based tracker plus alocal search (bottom). Note the rotation in depth. See also movies 2 and 5.

Fig. 7. Results of pure tracking. Tracking results from the proposed algorithm (top) and using a Kalman-filter-based tracker plus alocal search (bottom). Note the large in-plane rotation and the error in some locations of feature points. See also movies 3 and 6.

B. Pure TrackingWe first present experiments of pure tracking, where thetask is to track the set of feature points defined on thefirst frame of a sequence. Although in the database thereare some sequences with only small head motion (thusmaking tracking an easier job), here we present experi-ments only on those sequences involving challenging mo-tion (for example, those subject to rotation in depth).

The first example, as shown in Fig. 5 (top row), involvesmoderate up-and-down head motion; therefore the

‘‘height’’ of the face varies across frames. It is interestingto note that the tracker captured this type of variation,in addition to maintaining track on individual fea-ture points. See also movie 1 (movies referred to in jthispaper can be viewed at http://www.cfar.umd.edu/~baoxin/facemovies./html).

The second example is shown in Fig. 6 (top row). Inthis case, there is a large rotation in depth, yet thetracker was able to do its job. Note that there are alsoexpression changes in this sequence. See also movie 2.

Page 7: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2975

In the third example [Fig. 7 (top row)], the head isswaying to a large extent. The global motion is capturedby the tracker, although there are noticeable errors on in-dividual feature points. See also movie 3.

In summary, we find that our facial feature trackingapproach works well, even in challenging situations asshown above.

It must be emphasized that the deformed grids in theexamples are not obtained by the EGM method, except forthe first frame; rather, they are tracked from the video se-quence. We could have used the EGM method on aframe-by-frame basis to obtain the matched grid points ateach frame (hence achieve tracking), but the disadvan-tage is that the temporal coherence is not naturally ex-ploited, and a separate procedure is needed for analyzingthis temporal information. Also, a large amount of rota-tion (head sway) is difficult for the EGM method, since arigid search plus an elastic search is not enough to ac-count for the motion in this case. In addition, it is com-putationally expensive to use the EGM method frame byframe.

C. Comparison with Kalman-Filter-Based TrackingWe argued in Section 2 that the adoption of the SIS-basedalgorithm can handle tracking better than the Kalman fil-ter when the system dynamics and the measurementuncertainties are not linear and Gaussian. There areplenty of discussions on this topic in the literature(e.g., Refs. 12 and 15). Here we present only exper-imental results on tracking using a Kalman-filter-based tracker and compare them with those from our al-gorithm. In the implementation of the Kalman filter, welet the state S 5 (a11 , a12 , a21 , a22 , Tx , Ty)T, where$a11 , a12 , a21 , a22 , Tx , Ty% are the affine parameters, asmentioned in Section 2. We have assumed that the sys-tem dynamics are first-order random walk, since it is noteasy to specify a system equation that governs the evolu-tion of the state S. Since the affine parameters are usedas the state, the measurement equation is linear. Math-ematically, let measurement Z be the coordinates of allthe jets, i.e., Z 5 (x1 , y1 , x2 , y2 , ..., xn , yn)T, with(xi , yi) being the position of the ith jet; then we can writethe system equation and the measurement equation as

Si11 5 HSi 1 wi , Zi 5 FSi 1 vi ,

where H is an identity matrix, wi and vi are white noise,and F is a matrix whose every two rows are like the Jaco-bian matrix given in Eq. (2):

F 5 3x01 y01 0 0 1 0

0 0 x01 y01 0 1

¯ ¯

x0 i y0 i 0 0 1 0

0 0 x0 i y0 i 0 1

¯ ¯

x0n y0n 0 0 1 0

0 0 x0n y0n 0 1

4 ,

with (x0 i , y0 i) being the coordinates of the ith jet in thefirst frame (recall that all the jets form the template).

With the above formulation, a standard Kalman filteris then applied to estimate the state at each time index.Note that by using the above Kalman filter, we can esti-mate only the global motion. To get a fair comparison,after obtaining the global motion estimation through theKalman filter, we further did a local search around eachfeature point, just as we have done in our algorithm. Theresults are presented in the bottom rows of Figs. 5, 6, and7 (see also movies 4, 5, and 6, respectively). A close look(more obvious by viewing the movies) would reveal thatalthough Kalman-filter-based tracking works well for thefirst sequence, it failed to track the head motion for thesecond and the third sequences, where the motion is morecomplex. Specifically, for the sequence in Fig. 6, theKalman-filter-based tracker did not catch the ‘‘width’’change of the face as a result of rotation in depth; and inFig. 7, the dramatic head sway baffled the Kalman-filter-based tracker. Note that in both cases, the Kalman-filter-based tracker worked fine at the beginning of the se-quences but later failed to catch up with the motion. Inshort, the Kalman-filter-based tracker did not capture theglobal motion as accurately as our algorithm did.

In the implementation of the Kalman filter, the mea-surements at each time were provided by an EGMmodule.

Fig. 8. Sample frames from the video (top) and the templates (bottom). The templates will be referred to as face 1, face 2 ,..., etc.,counting from left, and obviously face 3 in the bottom row is the true hypothesis.

Page 8: Face verification through tracking facial features

2976 J. Opt. Soc. Am. A/Vol. 18, No. 12 /December 2001 B. Li and R. Chellappa

Fig. 9. Posterior probabilities: from the true hypothesis (face 3 in Fig. 8), (solid curves) and, in (a), (b), (c), (d), and (e), from faces 1, 2,4, 5, and 6, respectively (dashed curves).

D. Tracking for VerificationWe mentioned above that, by using the proposed repa-rameterization, one can easily incorporate a template intothe tracker, so that verification can be simultaneously ad-dressed. We now give such an example. The example isintended to show that while tracking, we can also com-pute the posterior probability (as shown in the algorithm,the posterior probabilities are computed from the esti-mated densities on a region centered around the meanshape) and use the probability to assist in verification.In Fig. 2, we showed tracked feature trajectories andfound that the trajectories display more coherent motionpatterns under a true hypothesis than under false hy-potheses. We now demonstrate that such information isalso reflected in the posterior probabilities. In this ex-

periment, we assume that a short sequence is given, thatthere are several hypotheses (face templates), and thatthe tracker is required to verify which of them is true (i.e.,which person is in the sequence).

In Fig. 8, the top row shows sample frames from a se-quence, and the bottom row shows six possible candi-dates. All the templates and the sequence were obtainedunder the same lighting condition. Figure 9 shows thecomputed probabilities for each false-hypothesis templatetogether with that for the true hypothesis. Figure 10shows the matching scores using the computed meanshape. Note that the scores are not from the EGMmethod.

Figure 11 shows two templates of the same person un-der different lighting conditions. The template on the

Page 9: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2977

left is the one that we used in Fig. 8, and the one on theright is used here, as the true hypothesis, to compare withthe templates in Fig. 8. Figure 12 shows the computedprobabilities for each false hypothesis together with thatfor the true hypothesis. This example shows that eventhough the true hypothesis is acquired under a differentlighting condition from that for the video that it is testedagainst, the algorithm is still able to do the verificationbased on the computed posterior probability. (In this ex-ample, all false hypotheses have the same lighting as thevideo does; this makes the problem more difficult.)

E. Comparison with Frame-by-Frame Matching andVotingWe have argued that, although voting could provide a po-tential solution when video is available, it does not utilizetemporal information to its full advantage. In our ex-periments, we have found that when the template and thevideo clips are under the same lighting condition, votingis indeed enough to give a correct verification result forthis small database. However, when the template andthe video clips are from different lighting, a voting strat-egy has difficulties in verification. For simplicity, again,

Fig. 10. Matching scores computed by using the mean shape. The solid curves are from the true hypothesis, and the dashed curves arefrom false hypotheses.

Page 10: Face verification through tracking facial features

2978 J. Opt. Soc. Am. A/Vol. 18, No. 12 /December 2001 B. Li and R. Chellappa

Fig. 11. Two face templates of the same person under differentlighting. The left is face 3 in Fig. 8. The right is the same per-son under different lighting and will be called face 7.

we use those faces in Fig. 8 as templates to verify againstthe sequence in Fig. 4 (top row). With the EGM algo-rithm, we computed the matched grid at each frame andplotted the matching score in Fig. 13. It is found that inthis case, scores of the true hypothesis do not differ somuch from those of the false hypotheses (for easy com-parison, we have intentionally kept Figs. 10 and 13 at thesame resolution). In fact, by voting, one can find thatface 5 should be favored over face 3 (which is the true hy-pothesis). However, by using the posterior probabilities(Fig. 12), one should have no problem making the correctdecision.

Fig. 12. Posterior probabilities: from the true hypothesis [face in Fig. 11 (right)] (solid curves) and, in (a), (b), (c), (d), and (e), from faces1, 2, 4, 5, and 6, respectively (dashed curves).

Page 11: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2979

Fig. 13. Matching scores computed by using EGM for each frame. The solid curves are from the true hypothesis, and the dashedcurves are from false hypotheses. The resolution has been kept the same as that in Fig. 10. It is obvious that in this situation, thedifferences between the score from the true hypothesis and those from false hypotheses become much smaller, which would render atraining process difficult (see text).

For this small database, we have found that if the true-hypothesis template and the test video are under differ-ent lighting while false hypotheses are free to be either ofthe two lighting conditions, then simple frame-by-framematching and voting can achieve only a 68% recognitionrate (13 out of 19 individuals). Since lighting is only oneof many variations in the real world, one can expect thatthis number will drop with more difficult data. On theother hand, with the proposed approach, by using the pos-

terior probabilities, we did not find any case failure forthis small database.

We need to point out that so far the verification is basedon comparison with alternative hypotheses. In an appli-cation where the algorithm has to verify a claimed iden-tity without referring to all the possible templates, a cer-tain type of training is needed first. In this situation, theproposed approach may still have advantages over aframe-by-frame matching and voting strategy in the fol-

Page 12: Face verification through tracking facial features

2980 J. Opt. Soc. Am. A/Vol. 18, No. 12 /December 2001 B. Li and R. Chellappa

lowing respect: From Fig. 13, one can find that even ifonly the lighting condition changed, the difference (interms of matching score) between true and false hypoth-eses becomes much smaller (compared with Fig. 10,where lighting is the same for all the templates and thetest sequence). However, in either Fig. 9 or 12, we canobserve that the posterior probability of a true hypothesisnever drops as low as that of false hypotheses. Thus, if atraining process is to be utilized, using the posterior prob-ability would apparently make the task easier.

4. RELATED WORKThe proposed approach is aimed at analyzing the motionof facial features with uncertainties as illustrated in Fig.1. From a geometrical point of view, facial features in theimage domain are just a set of 2-D points. Statisticalshape theory, a discipline pioneered by Kendall,19–21 isthe study of 2-D shape and its statistical properties in thepresence of uncertainties. Other contributions to thisfield can be found, for example, in Refs. 22–26. In thecomputer vision field, the problem of uncertainty propa-gation in transforms has also been discussed extensively(e.g., Ref. 27), even though the term shape theory is sel-dom used. An explicit application of shape theory to acomputer vision problem (and one of the earliest) may bethe face detection approach reported in Ref. 28, wherefacelike objects are characterized by a few points corre-sponding to the eyes, the nose, etc. The work was laterextended in Ref. 29 to include affine-invariant shape.The authors also pointed out that the approach is wellsuited for recognizing object classes (e.g., face detection).

It could be challenging, however, to directly use a shapetheory approach to discriminate similar objects (such as afew faces that are represented by only the positions of afew feature points). In a complex recognition problem,there are some potential challenges, in addition to diffi-culties in obtaining the density of shape (either analyti-cally or empirically) for an underlying set of featurepoints. First, in a problem illustrated in Figs. 1 and 2,one can find that a decision using the shape density couldbe optimal only for an individual frame. In other words,a frame-by-frame-based shape-theoretic approach doesnot pay attention to motion coherence. The motion co-herence between adjacent frames is not considered otherthan in the fact that, of course, one can still use voting.Second, although a certain group such as affine is oftenused to model the object motion (e.g., the head motion inFig. 1), the real motion never traverses through all possi-bilities in the group. On the contrary, usually only a verylimited subregion of the group is meaningful. For ex-ample, normal head motion will never flip the left andright sides of a face, although this ‘‘motion’’ is obviously inthe affine group [given by @ 0 1

2 1 0 #( • ) 1 (00), assuming

that the origin is in the center of the face]. Thus, forpractical problems, the density of shape may have consid-ered too much in a sense. Those unnecessary possibili-ties of the group, tangled with uncertainties in individualpoints, would lead to a density that obscures the distinc-tion between two slightly different objects and thus makerecognition difficult (however, for recognition of an objectclass, this may still be fine, as was reported in Refs. 29

and 30). Third, some objects may be distinguishablefrom each other only when the motion is restricted to asubspace of a group, and it may not be so easy to distin-guish them in the shape space.

5. CONCLUSIONS AND FUTURE WORKWe have proposed a Gabor attribute tracking algorithmbased on statistical sampling methods. As described inthe introduction (Section 1), there are many methods forfacial feature tracking. Our algorithm differs from otherfacial feature tracking methods (e.g., Refs. 31–33) in thattracking is intended for verification—the model templatescan be easily incorporated into the tracker because of itsformulation and parameterization. Meanwhile, asshown in the pure-tracking experiments, our algorithmcan handle tracking very well, even under large motionincluding rotation in depth. With the random samplingand local search components, the algorithm also lends it-self an interpretation from a shape-theoretic point ofview. Experiments with a real database show that thealgorithm works as desired.

It is worth pointing out that since the early develop-ment of the EGM method for face recognition,7 more re-fined work has been done, such as feature selection basedon significance of the feature (resulting in an irregulargrid), etc. These refinements of the EGM method can beeasily incorporated into our tracking algorithm, since thealgorithm works as long as an ordered set of points withattributes is given. We also believe that work along thisline will lead to better verification results than those fromthe current scheme (which uses a fixed grid and thus mayinclude some points that are insignificant for verification).

Time complexity is another issue that needs to be ad-dressed. In the examples given above, the tracker runsat 5 frames per second with Ns 5 200 on an Ultra-Sparc 5workstation, excluding the jet formation procedure.However, its speed depends on factors such as the numberof jets, the number of attributes at each grid point, the ex-tent to which deformation is allowed, etc. Further workis needed on accurate speed performance characterizationand real-time implementation. Future work also in-cludes testing on a large database.

APPENDIX ADefinition (Ref. 12). A random sample X drawn from adistribution g is said to be properly weighted by a weight-ing function w( • ) with respect to the distribution p(•) iffor any integrable function h,

Eg$h~X !w~X !% 5 Ep$h~X !%.

A set of random draws and weights (x ( j), w ( j)), j5 1, 2,..., is said to be properly weighted with respect top if

limm→`

(j51

m

h(x ~ j !)w ~ j !

(j51

m

w ~ j !

5 Ep$h~X !% (A1)

for any integrable function h. In a practical sense, wecan think of p as being approximated by the discrete dis-

Page 13: Face verification through tracking facial features

B. Li and R. Chellappa Vol. 18, No. 12 /December 2001 /J. Opt. Soc. Am. A 2981

tribution supported on the x ( j) with probabilities propor-tional to the weights w ( j).

SIS algorithm (Ref. 12). Let St 5 $Xt( j) , j

5 1 ,..., m% denote a set of random draws that are prop-erly weighted (see the definition above) by the set ofweights Wt 5 $wt

( j) , j 5 1 ,..., m% with respect to p t . Ateach time step t:

Step 1. Draw Xt11 5 xt11( j) from gt11(xt11uxt

(t)).Step 2. Compute

ut11~ j ! 5

p t11(xt11~ j ! )

p t(xt~ j !)gt11(xt11

~ j ! uxt~ j !)

, wt11~ j ! 5 ut11

~ j ! wt~ j ! .

Then (xt11( j) , wt11

( j) ) is a properly weighted sample of p t11 .

ACKNOWLEDGMENTThe authors thank the anonymous reviewers for their in-sightful comments.

Baoxin Li can be contacted by phone, 360-817-8459;fax, 360-817-8436; or e-mail; [email protected]. RamaChellappa can be contacted by phone, 301-405-3656; fax,301-314-9115; or e-mail, [email protected].

REFERENCES1. M. Turk and A. Pentland, ‘‘Eigenfaces for recognition,’’ J.

Cogn. Neurosci. 3, 72–86 (1991).2. A. J. Howell and H. Buxton, ‘‘Towards unconstrained face

recognition from image sequences,’’ in Proceedings of the In-ternational Conference on Automatic Face and Gesture Rec-ognition (IEEE Computer Society, Los Alamitos, Calif.,1996), pp. 224–229.

3. H. Wechsler, V. Kakkad, J. Huang, S. Gutta, and V. Chen,‘‘Automatic video-based person authentication using theRBF network,’’ in Proceedings of the International Confer-ence on Audio- and Video-Based Person Authentication(Springer, New York, 1997), pp. 85–92.

4. S. J. McKenna and S. Gong, ‘‘Non-intrusive person authen-tication for access control by visual tracking and face recog-nition,’’ in Proceedings of the International Conference onAudio- and Video-Based Person Authentication (Springer,New York, 1997), pp. 177–184.

5. M. Seibert and A. M. Waxman, ‘‘Combining evidence frommultiple views of 3-D objects,’’ in Sensor Fusion IV: Con-trol Paradigms and Data Structures, P. S. Schenker, ed.,Proc. SPIE 1611, 178–189 (1991).

6. J. Steffens, E. Elagin, and H. Neven, ‘‘PersonSpotter—fastand robust system for human detection, tracking and rec-ognition,’’ in Proceedings of the International Conference onAutomatic Face and Gesture Recognition (IEEE ComputerSociety, Los Alamitos, Calif., 1998), pp. 516–521.

7. M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. v.d.Malsburg, and W. Konen, ‘‘Distortion invariant object rec-ognition in the dynamic link architecture,’’ IEEE Trans.Comput. 42, 300–311 (1993).

8. S. Geman and D. Geman, ‘‘Stochastic relaxation, Gibbs dis-tributions and the Bayesian restoration of images,’’ IEEETrans. Pattern Anal. Mach. Intell. PAMI-6, 721–741(1984).

9. G. Kitagawa, ‘‘Monte Carlo filter and smoother for non-Gaussian nonlinear state space models,’’ J. Comput. Graph.Stat. 5, 1–25 (1996).

10. J. Hammersley and D. Handscomb, Monte Carlo Methods(Wiley, New York, 1964).

11. M. H. Kalos and P. A. Whitlock, Monte Carlo Methods(Wiley, New York, 1986).

12. J. Liu and R. Chen, ‘‘Sequential Monte Carlo methodsfordynamic systems,’’ J. Am. Stat. Assoc. 93, 1031–1041(1998).

13. N. Gordon, D. Salmond, and A. Smith, ‘‘Novel approach tononlinear/non-Gaussian Bayesian state estimation,’’ IEEETrans. Radar Signal Process. 140, 107–113 (1993).

14. A. Kong, J. Liu, and W. Wong, ‘‘Sequential imputations andBayesian missing data problems,’’ J. Am. Stat. Assoc. 89,278–288 (1994).

15. M. Isard and A. Blake, ‘‘Contour tracking by stochasticpropagation of conditional density,’’ in Proceedings of theEuropean Conference on Computer Vision (Springer-Verlag,Cambridge, UK, 1996), Vol. I, pp. 343–356.

16. B. S. Manjunath and R. Chellappa, ‘‘A feature based ap-proach to face recognition,’’ in Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (IEEEComputer Society, Los Alamitos, Calif., 1992), pp. 373–378.

17. K. Okada, J. Steffens, T. Maurer, H. Hong, E. Elagin, H.Neven, and C. v.d. Malsburg, ‘‘The Bochum/USC face recog-nition system,’’ in Face Recognition: From Theory to Appli-cations, H. Wechsler, P. J. Phillips, V. Bruce, F. Soulie, andT. Huang, eds. (Springer-Verlag, New York, 1998).

18. B. Li and R. Chellappa, ‘‘Simultaneous tracking and verifi-cation via sequential importance sampling,’’ in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition (IEEE Computer Society, Los Alamitos, Calif.,2000), Vol. II, pp. 110–117.

19. D. Kendall, ‘‘Shape manifolds, procrustean metrics andcomplex projective spaces,’’ Bull. London Math. Soc. 16, 81–121 (1983).

20. D. Kendall, ‘‘Further developments and applications of thestatistical theory of shape,’’ Theory Probab. It. Appl. 31,407–412 (1986).

21. D. Kendall, ‘‘A survey of the statistical theory of shape,’’Stat. Sci. 4, 87–120 (1989).

22. F. L. Bookstein, ‘‘Size and shape spaces for landmark datain two dimensions,’’ Stat. Sci. 1, 181–242 (1986).

23. I. L. Dryden and K. V. Mardia, ‘‘General shape distributionin a plane,’’ Adv. Appl. Probab. 23, 259–276 (1989).

24. K. V. Mardia and I. L. Dryden, ‘‘Shape distribution for land-mark data,’’ Adv. Appl. Probab. 21, 742–255 (1989).

25. C. R. Goodall and K. V. Mardia, ‘‘Multivariate aspects ofshape theory,’’ Ann. Stat. 21, 259–276 (1993).

26. R. Berthilsson, ‘‘A statistic theory of shape,’’ Tech. Rep. (De-partment of Mathematics, Lund Institute of Technology,Lund, The Netherlands, 1997).

27. W. Grimson, D. Huttenlocher, and D. Jacobs, ‘‘A study of af-fine matching with bounded sensor error,’’ in Proceedings ofthe European Conference on Computer Vision, Vol. 588 ofLecture Notes in Computer Science (Springer, New York,1992), pp. 291–306.

28. M. C. Burl, T. K. Leung, and P. Perona, ‘‘Face localizationvia shape statistics,’’ in Proceedings of the Workshop on Au-tomatic Face and Gesture Recognition (IEEE Computer So-ciety, Los Alamitos, Calif., 1995), pp. 194–199.

29. T. K. Leung, M. C. Burl, and P. Perona, ‘‘Probabilistic affineinvariants for recognition,’’ in Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition (IEEEComputer Society, Los Alamitos, Calif., 1998), pp. 678–684.

30. M. Weber, M. Welling, and P. Perona, ‘‘Towards automaticdiscovery of object categories,’’ in Proceedings of the Confer-ence on Computer Vision and Pattern Recognition (IEEEComputer Society, Los Alamitos, Calif., 2000), Vol. II, pp.101–106.

31. M. Black and Y. Yacoob, ‘‘Tracking and recognizing facial ex-pressions in image sequences, using local parametrizedmodels of image motion,’’ Tech. Rep. CS-TR-3401 (Centerfor Automation Research, University of Maryland, CollegePark, Md., 1995).

32. S. J. McKenna, S. Gong, R. P. Wurtz, and J. Tanner, ‘‘Track-ing facial feature points with Gabor wavelets and shapemodels,’’ in Proceedings of the International Conference onAudio- and Video-Based Biometric Person Authentication(Springer, New York, 1997), pp. 35–43.

33. D. Freedman and M. Brandstein, ‘‘A subset approach tocontour tracking in clutter,’’ in Proceedings of the Interna-tional Conference on Computer Vision (IEEE Computer So-ciety, Los Alamitos, Calif., (1999), pp. 99–104.