tracking of the motion of important facial features in model-based coding

12
Signal Processing 66 (1998) 249 260 Tracking of the motion of important facial features in model-based coding Paul M. Antoszczyszyn*, John M. Hannah, Peter M. Grant Department of Electrical Engineering, The University of Edinburgh, King+s Buildings, Mayeld Road, Edinburgh, Scotland EH9 3JL, UK Received 14 April 1997; revised 26 September 1997 Abstract A new method of tracking the position of important facial features for semantic wire-frame-based moving-image coding is presented. Reliable and fast tracking of the facial features in head-and-shoulders scenes is of paramount importance for reconstruction of the speaker’s motion in videophone systems if wire-frame-based coding is used. The proposed method is based on eigenvalue decomposition of the sub-images extracted from subsequent frames of the video sequence. Motion of each facial feature (the left eye, the right eye, the nose and the lips) is tracked separately. No restrictions other than the presence of the speaker’s face, were imposed on the actual contents of the scene. The algorithm was tested on widely used head-and-shoulders video sequences containing moderate head pan, rotation and zoom with remarkably good results. Tracking was maintained even when the facial features were occluded. The algorithm can also be used in other semantic-based systems. ( 1998 Elsevier Science B.V. All rights reserved. Zusammenfassung Eine neue Methode zur Verfolgung der Position wichtiger Gesichtspartien fu¨r die semantische Drahtgestell-basierte Bewegtbildkodierungwird vorgestellt. Zuverla¨ {ige und schnelle Verfolgung der Gesichtsausdru¨cke in head and shoul- ders-Szenen ist von entscheidender Bedeutung bei der Rekonstruktion der Bewegung des Sprechers in Videophon Systemen, wenn die Drahtgestell-basierte Kodierung benutzt wird. Die hier vorgeschlagene Methode basiert auf einer Eigenwertzerlegung des Teilbildes, welches von aufeinanderfolgenden Bildern der Videosequenz extrahiert wird. Die Bewegung jedes einzelnen Gesichtsteils (linkes Auge, rechtes Auge, Nase and Lippen) wird getrennt verfolgt. Bis auf die Pra¨senz des Gesichts des Sprechers werden dem jeweiligen Inhalt der Szene keine Beschra¨nkungen auferlegt. Der Algorithmus wurde mit bemerkenswert guten Resultaten an der ha¨ufig benutzten head and shoulders Videosequenz getestet, welche geeignete Gesichtsteile, Drehungen und Anna¨herungen beinhaltete. Die Verfolgung konnte auch dann weiter durchgefu¨ hrt werden, als die Gesichtspartien okkludiert wurden. Der Algorithmus kann daru¨ berhinaus in anderen Semantik-basierten Systemen eingesetz werden. ( 1998 Elsevier Science B.V. All rights reserved. Re´ sume´ Une nouvelle me´ thode de poursuite de la position des traits importants du visage pour le codage d’images, s’appuyant sur une grille se´ mantiqueest pre´ sente´ e. Si on utilise un codage base´ sur une grille dans les syste` mes de vide´ ophonie,ou` les * Corresponding author. Tel.: #44 131 650 5613; fax.: #44 131 650 6554; e-mail: plma@ee.ed.ac.uk. 0165-1684/98/$19.00 ( 1998 Elsevier Science B.V. All rights reserved. PII S0165-1684(98)00009-7

Upload: paul-m-antoszczyszyn

Post on 02-Jul-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Tracking of the motion of important facial features in model-based coding

Signal Processing 66 (1998) 249—260

Tracking of the motion of important facial features inmodel-based coding

Paul M. Antoszczyszyn*, John M. Hannah, Peter M. Grant

Department of Electrical Engineering, The University of Edinburgh, King+s Buildings, Mayfield Road,Edinburgh, Scotland EH9 3JL, UK

Received 14 April 1997; revised 26 September 1997

Abstract

A new method of tracking the position of important facial features for semantic wire-frame-based moving-imagecoding is presented. Reliable and fast tracking of the facial features in head-and-shoulders scenes is of paramountimportance for reconstruction of the speaker’s motion in videophone systems if wire-frame-based coding is used. Theproposed method is based on eigenvalue decomposition of the sub-images extracted from subsequent frames of the videosequence. Motion of each facial feature (the left eye, the right eye, the nose and the lips) is tracked separately. Norestrictions other than the presence of the speaker’s face, were imposed on the actual contents of the scene. The algorithmwas tested on widely used head-and-shoulders video sequences containing moderate head pan, rotation and zoom withremarkably good results. Tracking was maintained even when the facial features were occluded. The algorithm can alsobe used in other semantic-based systems. ( 1998 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Eine neue Methode zur Verfolgung der Position wichtiger Gesichtspartien fur die semantische Drahtgestell-basierteBewegtbildkodierung wird vorgestellt. Zuverla{ige und schnelle Verfolgung der Gesichtsausdrucke in head and shoul-ders-Szenen ist von entscheidender Bedeutung bei der Rekonstruktion der Bewegung des Sprechers in VideophonSystemen, wenn die Drahtgestell-basierte Kodierung benutzt wird. Die hier vorgeschlagene Methode basiert auf einerEigenwertzerlegung des Teilbildes, welches von aufeinanderfolgenden Bildern der Videosequenz extrahiert wird. DieBewegung jedes einzelnen Gesichtsteils (linkes Auge, rechtes Auge, Nase and Lippen) wird getrennt verfolgt. Bis auf diePrasenz des Gesichts des Sprechers werden dem jeweiligen Inhalt der Szene keine Beschrankungen auferlegt. DerAlgorithmus wurde mit bemerkenswert guten Resultaten an der haufig benutzten head and shoulders Videosequenzgetestet, welche geeignete Gesichtsteile, Drehungen und Annaherungen beinhaltete. Die Verfolgung konnte auch dannweiter durchgefuhrt werden, als die Gesichtspartien okkludiert wurden. Der Algorithmus kann daruberhinaus in anderenSemantik-basierten Systemen eingesetz werden. ( 1998 Elsevier Science B.V. All rights reserved.

Resume

Une nouvelle methode de poursuite de la position des traits importants du visage pour le codage d’images, s’appuyantsur une grille semantique est presentee. Si on utilise un codage base sur une grille dans les systemes de videophonie, ou les

*Corresponding author. Tel.: #44 131 650 5613; fax.: #44 131 650 6554; e-mail: [email protected].

0165-1684/98/$19.00 ( 1998 Elsevier Science B.V. All rights reserved.PII S 0 1 6 5 - 1 6 8 4 ( 9 8 ) 0 0 0 0 9 - 7

Page 2: Tracking of the motion of important facial features in model-based coding

Fig. 1. Candide wire-frame model.

scenes contiennent generalement le buste du sujet (tete et epaules), la poursuite fiable et rapide des traits du visage est depremiere importance pour la reconstruction des movements. La methode proposee repose sur la decomposition envaleurs propres de sous-images extraites d’images consecutives de la sequence video. Le mouvement de chaque trait duvisage (oeil gauche, oeil droit, nez et levres) est analyse separement. Hormis la presence du visage du sujet, aucunerestriction n’a ete imposee sur le contenu des scenes. L’algorithme a ete teste sur des sequences video avec des rotationset rapprochements moderes de la tete du sujet, et a donne des resultats remarquablement bons. La poursuite des traits duvisage etait maintenue meme lorsque ceux-ci se retrouvaient caches. L’algorithme peut egalement etre utilise avecd’autres systemes semantiques. ( 1998 Elsevier Science B.V. All rights reserved.

1. Introduction

Extremely low bit-rate moving image transmis-sion systems are designed to handle video com-munication in environments where data rates arenot allowed to exceed 10 kbit/s. These includePSTN lines in certain countries and mobile com-munication. It is widely acknowledged, thatcurrently available systems are not capable of de-livering a satisfactory quality of moving image atsuch low data-rates. Most of these methods arederived from a block-based motion compensateddiscrete cosine transform [10]. At extremely lowbit-rates (below 9600 bit/s) the application ofblock-based algorithms results in disturbing arte-facts [15]. Also, the frame rate is often reduced toaround 5 frames per second. Despite introductionof other moving-image coding techniques based onvector quantization [8], fractal theory [11] andwavelet analysis [1] it is still not possible to sendvideo over extremely low-bit-rate channels withacceptable quality. A very promising video codingapproach was proposed by Musmann et al. [18].Also, most recent contributions of the MPEG-4group are focused on object-based techniques [21].However, it seems that only the application ofmodel-based techniques will allow transmission atextremely low-bit rates (1—10 kbit/s). A model-based approach is also very useful for sceneanalysis purposes in research areas such as virtualreality headsets, virtual speakers and lip readingsystems.

Aizawa et al. [6] and Forchheimer [7] indepen-dently introduced semantic wire-frame models andcomputer graphics to the area of extremely low-bit-rate communication. According to their assess-ments it is possible to obtain data-rates below

10 kbit/s for head-and-shoulders video sequences.The concept of model-based communication can bebriefly explained in the following way. A model ofthe human face (e.g. Candide — Fig. 1) is shared bythe transmitter and the receiver. With each sub-sequent frame of the video sequence the position ofthe vertices of the wire frame is automaticallytracked. The initial and subsequent positions of thewire frame are transmitted in the form of 3D coor-dinates over the low-bit-rate channel along with thetexture of the face from the initial frame of thesequence. Knowing the texture of the scene fromthe initial frame and the 3D positions of the verticesof the wire frame in subsequent frames it is possibleto reconstruct the entire sequence by mapping thetexture of the initial frame of the sequence at loca-tions indicated by the transmitted vertices.

We have focused our efforts on techniques utilis-ing models of the head-and-shoulders scene. Inour research we use the Candide [7] wire-frame(Fig. 1). The two main problems in content-based

250 P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249–260

Page 3: Tracking of the motion of important facial features in model-based coding

Fig. 2. Tracking system.

moving-image coding are automatic fitting of thewire frame and automatic tracking of the scene.A limited number of approaches to automatic-wire-frame fitting have been proposed. The methodproposed by Welsh [23] utilises the idea of ‘snakes’(active contours). Different approaches were pre-sented by Reinders et al. [19] and Seferidis [20].We have proposed the application of a facial database [3,4]. However, the issue of automatic track-ing still remains unresolved. Existing proposals em-ploy optical-flow analysis [14] and correlation[13]. Here a solution to the problem of automaticscene tracking is presented as a major developmentof our previous approach [2]. We have developedan algorithm capable of tracking facial features inhead-and-shoulders scenes and tested it on a rangeof widely used video sequences.

2. Tracking in head-and-shoulders scenes

The successive frames of a head-and-shouldersvideo sequence are relatively similar to each other.All frames of such a sequence have one feature incommon: they contain the face of the speaker.While classification of the image as a face does notpresent any problem for the human brain, it isextremely difficult for a machine to perform reliablythe same task. However, an image of a face containscertain distinct features that can be identified auto-matically. We have concentrated our efforts on

tracking motion of the left eye, the right eye, the noseand the lips. These facial features will be referred to asimportant facial features. Each facial feature should betracked separately so that its 2D coordinates could beused to determine the current position of the speakershead. This also allows implementation of the algo-rithm using a parallel architecture.

In the proposed tracking algorithm we use se-quences of sub-images containing important facialfeatures (for tracking of global motion, Fig. 2 — left)and sequences of sub-images containing the vicinityof the vertices of the Candide model describinga particular feature (for tracking of local motion,Fig. 2 — right — the example of the left eye). Thesesequences of sub-images are extracted from theinitial frames of the analysed sequence (either man-ually or using the method described in [3]). Eachsequence of sub-images is subsequently used tocreate a separate principal component space that isemployed in further processing.

In Section 3 we describe the method of principalcomponents applied to a sequence of images. Sec-tion 4 contains a detailed description of our track-ing algorithm. The results of its application toa range of test sequences are presented in Section 5.

3. Eigenvalue decomposition for sequence of images

In trying to derive a reliable mathematical ap-proach to fitting and tracking of wire-frames in

P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249—260 251

Page 4: Tracking of the motion of important facial features in model-based coding

model-based coding we have compared methodsbased on correlation and principal componentanalysis (PCA). As the method based on PCA gavebetter results [5], we decided to employ it in all oursubsequent research. Principal component analysisis also referred to as the Hotelling [9] transform, ora discrete case of the Karhunen—Loeve [12,16]transform. The method of principal components isa data-analytic technique transforming a group ofcorrelated variables such that certain optimalconditions are met. The most important of theseconditions is that the transformed variables areuncorrelated. We will show that the method ofprincipal components is a reliable mathematicaltool for tracking the motion of facial features inhead-and-shoulders video sequences.

In the first step of the method of principal com-ponents, the eigenvectors of the covariance (dis-persion) matrix S of the sequence X of M,N-dimensional input column vectors: X"

[x1x22

xM

], xj"[x

ji], i"12N, j"12M,

must be found. In our analysis the input sequenceconsists of images (2D matrices). They have to beconverted into 1D column vectors. This can bedone by scanning the image line by line or columnby column. An image consisting of R rows andC columns would therefore produce a column in-put vector consisting of N"C]R rows. We canobtain the covariance matrix from the followingrelationship:

S"YYT, (1)

where the columns of matrix Y are vectors yjthat

differ from the input vectors xj

by the expectedvalue m

xof the sequence X:

Y"[x1!m

xx2!m

x2xM!m

x], (2)

mx"

1

M

M+i/1

xi. (3)

If we develop Eq. (1) using Eqs. (2) and (3), we willobtain a symmetric non-singular matrix.

S"Cs21

s12

s1N

s12

s22

2 s2N

F F } F

s1N

s2N 2 s2

ND, (4)

where s2i

is the variance of the ith variable and sij,

iOj, is the covariance between the ith and jthvariable. The images in the analysed sequence willbe correlated, therefore the elements of the matrixS that are not on the leading diagonal will benon-zero. The objective of the method of principalcomponents is to find the alternative coordinatesystem Z for an input sequence X in which all theelements off the leading diagonal of the covariancematrix S

Z(where index Z denotes the new coordi-

nate system) are zeros. According to matrix algebrasuch a matrix can be constructed if the eigenvectorsof the covariance matrix are known:

SZ"UTSU, (5)

where º"[u1u22

uN] and u

i, i"12N, is the ith

eigenvector of covariance matrix S. The eigenvec-tors can be calculated from the following relation-ship:

Sui"l

iui, (6)

where li

is the eigenvalue of the ith eigenvector,i"1,2 , N.

As can be seen from Eq. (6), calculation of theeigenvectors involves operations on the covariancematrix S. Even if small images are used as an inputsequence, the size of the covariance matrix can betoo large to handle by common computing equip-ment (e.g. a sequence of images consisting of 100columns and 100 rows would result in a 1002]1002covariance matrix). If, however, the number of im-ages M in the sequence X is considerably smallerthan the dimension of the images themselves(N"C]R), it is possible to reduce the computa-tional effort considerably by application of singularvalue decomposition (SVD) [17]. SVD allows us toexpress the eigenvectors of matrix S"YYT as lin-ear combinations of a matrix C"YTY. Sincematrix C is M]M, the computational costs offinding the eigenvectors of the matrix S are greatlyreduced (in our research M(20).

The Hotelling transformation can be expressedby the following relationship:

zi"uT

i(x

i!m) (7)

or in vector form:

Z"UTY, (8)

252 P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249–260

Page 5: Tracking of the motion of important facial features in model-based coding

where ziis the ith principal component. If we pre-

multiply both sides of Eq. (8) by matrix U and takeinto account the fact, that matrix U is orthonormal(thus UT"U~1), we will obtain the reverse Hotel-ling transformation:

Y"UZ. (9)

The U matrix was derived using the input sequenceX. If we analyse an image which was not originallya part of the input sequence, we can no longerexpect its principal component representation to beoptimal. Such an image, when transformed ontothe principal component space (created using inputsequence X) and transformed back using relation-ship (9) will look different. The difference will indi-cate the ‘distance’ of the analysed image from all theimages used to generate the principal componentspace. If, however, the analysed image is similar tothe images from the input vector X, it is very likelythat sequential application of the Hotelling trans-form and its reversed form will result in an imagesimilar to the analysed one. Again, the differencewill indicate how similar the analysed image is tothe images used to generate the principal compon-ent space. This ability to analyse unknown imagesusing a fixed ensemble of training images is utilisedin face recognition algorithms [22].

4. Automatic tracking algorithm

The input matrix X is formed from the sequenceof sub-images containing important facial features(Fig. 2 — left) extracted from the M initial frames ofthe analysed sequence. The matrix X is formedseparately for each facial feature. In case of trackingthe centres of the facial features, there are fourX matrices: for the left eye, the right eye, the noseand the lips. If for example, the number of initialimages was 16 and the dimensions of the extractedsub-images were 50]50 then the M and N inEqs. (2)—(4) would be 16 and 2500, respectively.

The automatic tracking commences with frameM#1. The initial position of the tracked facialfeature in frame M#1 (current frame) is assumedto be the same as in frame M (previous frame). Thisview is subsequently verified in the following way.Sub-images within the search range are extracted

from the current frame (e.g. a 15]15 search rangewould result in 225 sub-images). In order to avoidconfusion we call these images the extracted set(Fig. 2). The term initial set will be used to describethe sub-images extracted from the M initial frames(matrix X). The size of the search range is adjust-able but the dimensions of the images from theextracted set are identical to those from the initialset. Since the extracted set images are similar tothose from the initial set we can assume that theycan be projected onto the principal componentspace created by input vector X. The further analy-sis of the extracted set is performed in M-dimen-sional principal component space. It is our goal tofind among the images in the extracted set the bestmatch image. Its 2D coordinates will mark theposition of the tracked facial feature in the current(M#1) frame.

The most straightforward method of establishingthe difference between two images is calculation ofthe Euclidean distance separating them in the prin-cipal component space created by vector X:

eij"Ea

i!b

jE , (10)

where aiis the projection of the ith image vector

from the initial set onto M-dimensional principalcomponent space according to relationship (7), andbjis projection of the jth image from the extracted

set onto the same space. The distance eij

is cal-culated for every possible pair of images from theinitial set and the extracted set. The jth image fromthe extracted set for which the e

ijdistance is minim-

al is the best match image. Its 2D coordinates markthe position of the particular facial feature in thenext frame.

We have also tested a different method of deriva-tion of the best-match image. The best-result foundusing relationship (10) points to the image which isthe most similar to one particular image from theinitial set. However, it seems that finding the imagethat is most similar to all the images from the initialset might provide us with a better measure of dis-tance. Let us examine the following equation:

di"KK yi!

M+i/1

ziuiKK, (11)

where yidenotes the normalised ith image from the

extracted set, ui— the ith eigenvector of covariance

P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249—260 253

Page 6: Tracking of the motion of important facial features in model-based coding

Table 1Mean tracking error and standard deviation: Miss America

Method A (150 frames) Method B (150 frames)

Mean error Standard deviation Mean error Standard deviationFacial feature (pixels) (pixels) (pixels) (pixels)

Left eye 0.707 0.606 0.633 0.611Right eye 1.324 1.232 0.944 0.749Nose 1.230 0.828 0.548 0.606Lips 1.833 1.367 0.712 0.623

matrix S and zi— the ith principal component. The

more similar the unknown image is to the initial set,the smaller will be the distance d

i. Large differences

between the input and output images will result ina large value of d

i. Similarly to the previous case,

the ith image from the extracted set for which thedidistance is minimal is the best match image and

its 2D coordinates mark the position of the particu-lar facial feature in the next frame.

We initially focused our efforts on tracking thegeometrical centres of important facial features. Inpractical terms this means tracking of the centre ofthe corona of the eye, mid-distance between thenostrils and the centre of the lips. Coordinates ofthe important facial features tracked throughoutthe sequence form a rigid triangle (e.g. the left eye,the right eye and the nose, or the left eye, the righteye and the lips). It is therefore possible to deduceglobal motion (motion of the speaker’s head) fromthe motion of this triangle.

5. Experimental results

In an attempt to determine the best measure ofdistance we initially tested our algorithm usingEq. (10) as method A and Eq. (11) as method B. Weused the Miss America (CIF-sized) video sequencefor this purpose. In both experiments tracking wasmaintained throughout the entire sequence for allfacial features, as shown by subjective tests usinga created movie with white crosses centred on thetracked points of the speaker’s face. The centres ofthe crosses were constantly kept within the coronaof the eye (when visible), between the nostrils, and

in the centre of lips. Tracking was maintained evenwhen the speaker closed her eyes (quite frequentlyin the Miss America sequence) or opened andclosed her lips. However, subjective observationshowed the results obtained using method B to bemore acceptable: the crosses centred on facial fea-tures moved smoothly and the response on headpan was immediate.

In order to more precisely assess the accuracy ofthe tracking method, the 2D locations of the impor-tant facial features were extracted manually fromall 150 frames of the Miss America sequence. Theerror was measured as the Euclidean distance be-tween the 2D coordinates of the features trackedautomatically (in methods A and B) and manually.As can be seen from Table 1, which shows thefeature tracking error, the application of methodB gives superior results. The mean error in methodB is less than one pixel for all the tracked facialfeatures. In method A, the distance in principalcomponents’ space between two individual images(one from the initial set and one from the extractedset) is measured. On the other hand, in method B, itis the difference between the normalised image fromthe extracted set and its reconstruction which ismeasured. As the reconstruction is carried out us-ing eigenvectors of the principal components spacecreated using all images from the initial set, thislatter measurement is likely to give a better indica-tion of the similarity to all the images from theinitial set. Because of its superior performance weemployed method B for all subsequent work.

A number of widely used head-and-shouldersvideo sequences were used for further testing. Theseare listed in Table 2, and are readily available from

254 P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249–260

Page 7: Tracking of the motion of important facial features in model-based coding

Table 3Tracking error results for Miss America and Claire

Miss America (30 of 150 frames) Claire (34 of 168 frames)

Mean error Standard deviation Mean error Standard deviationFacial feature (pixels) (pixels) (pixels) (pixels)

Left eye 0.624 0.678 0.407 0.533Right eye 0.844 0.686 0.638 0.741Nose 0.569 0.596 0.784 0.553Lips 0.489 0.582 1.018 0.621

Table 2Sequences tested

Horizontal size Vertical size Length Tested samplesSequence (pixels) (pixels) (frames) (frames)

Miss America 352 240 150 30Claire 360 288 168 34Car Phone 176 144 400 80Grandma 176 144 768 154Salesman 360 288 400 80Trevor 256 256 100 20

well known WWW sites. They reflect a wide varietyof possible head-and-shoulders videophone situ-ations containing moderate speaker’s head pan,rotation and zoom. CIF-sized sequences (MissAmerica, Claire, Salesman, ¹revor) as well as QCIFsequences (Car Phone, Grandma) are included. Thesubjects of two video sequences (¹revor, Grandma)wear glasses and the Car Phone sequence was shotin a moving car.

Our algorithm produced reliable and consistentresults for all tested sequences. In all our tests,tracking of the facial features was maintainedthroughout the entire sequence. In order to obtaina measure of accuracy of the algorithm we useda method similar to the one described above. How-ever, because of the length and the number of testsequences, manual extraction of the 2D positions ofthe facial features was only undertaken from everyfifth frame of each sequence. This yielded 20 com-parison frames for the ¹revor sequence (lower limit)and 154 frames for the Grandma sequence. Theresultant mean error and standard deviation for

these sample frames are given in Tables 3—5. As canbe seen, the mean error for all the facial features inalmost all the sequences was less than 1 pixel.Figs. 3 and 4 show the actual error profiles for thesequence with the largest errors (Car Phone), whichwas also the one with the largest amount of motion.Especially in the case of QCIF sequences (CarPhone, Grandma) the reference manual fitting canalso introduce additional error since it is sometimesdifficult to manually judge the position of the centreof the particular facial feature.

The left and right eye seem to be the most reli-ably tracked facial features, possibly due to thecombination of very light (white of the eye) andvery dark (corona of the eye) pixels. However, thetrack was also maintained during periods when thespeakers close their eyes. There were no problemswith tracking eyes partially occluded by glasses.Despite the use of a static initial set, the algorithmcoped well with temporary occlusions (Salesman).Indeed, further experimental work has confirmedthat application of a dynamic set does not result in

P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249—260 255

Page 8: Tracking of the motion of important facial features in model-based coding

Table 4Tracking error results for Car Phone and Grandma

Car Phone (80 of 400 frames) Grandma (154 of 768 frames)

Mean error Standard deviation Mean error Standard deviationFacial feature (pixels) (pixels) (pixels) (pixels)

Left eye 0.878 0.715 0.856 0.600Right eye 0.886 0.781 0.829 0.648Nose 1.0594 0.873 0.372 0.528Lips 1.663 1.042 0.731 0.670

Table 5Tracking error results for Salesman and ¹revor

Salesman (80 of 400 frames) Trevor (20 of 100 frames)

Mean error Standard deviation Mean error Standard deviationFacial feature (pixels) (pixels) (pixels) (pixels)

Left eye 0.862 0.784 0.844 0.752Right eye 0.827 0.963 0.839 0.886Nose 0.845 0.692 0.853 0.583Lips 0.939 1.314 0.695 0.701

Fig. 3. Car phone: The left eye (left) and the right eye (right) tracking error profiles.

an improvement in tracking performance. Ouroverall results clearly demonstrate, that the algo-rithm is able to maintain tracking of importantfacial features for a variety of head-and-shoulderssequences. Stills which illustrate the range of move-ment of the facial features in the test sequences arepresented in Figs. 5—10. The full sequences illustra-ting the performance of the algorithm are available

from our Internet site: http://www.ee.ed.ac.uk/&plma/.

We have also investigated the effectiveness of ouralgorithm for tracking the shape of individual facialfeatures. Since we wish to utilise the Candide wire-frame, in order to reconstruct the local motion (e.g.lips close-open, eyes close-open) we must be able totrack reliably the motion of the vertices assigned to

256 P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249–260

Page 9: Tracking of the motion of important facial features in model-based coding

Fig. 4. Car Phone: The nose (left) and the lips (right) tracking error profiles.

Fig. 5. Tracking Miss America frames 20, 85 and 120.

Fig. 6. Tracking Claire frames 0, 95 and 140.

Fig. 7. Tracking Car Phone frames 80, 170 and 290.

P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249—260 257

Page 10: Tracking of the motion of important facial features in model-based coding

Fig. 8. Tracking Grandma frames 340, 410 and 440.

Fig. 9. Tracking Salesman frames: 120, 140, 210.

Fig. 10. Tracking ¹revor frames: 20, 30, 40.

the selected facial features (Fig. 11). In the case ofthe eyes, the eye-brows and the lips this involvestracking of at least four vertices. We utilised thesame algorithm but this time the initial set imageswere centred on the points of the image that corre-sponded to the positions of the wire-frame verticesof a particular facial feature (Fig. 11, Fig. 2 — right).Even if the shape of the facial feature changesradically, the contents of the image from the initialset will change only slightly. For example: tracking

of a corner of an opening eye would effectivelymean tracking of a dark triangle increasing its areaon a frame-by-frame basis. Also, since the imagesdescribing the vicinity of the feature’s vertex aresmaller than the images centred on the feature itself,a considerable reduction of the computational loadis achieved for tracking of the particular vertex.Using this approach it was possible to track closingand opening of lips and eyes. Observation of testvideo sequences re-created from the results of the

258 P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249–260

Page 11: Tracking of the motion of important facial features in model-based coding

Fig. 11. Tracking points for the left eye and the left eyebrow (left) and the lips (right) of the Candide model.

Fig. 12. Tracking anchor vertices (left) used to manipulate the Candide model (right).

algorithm showed a very good tracking perfor-mance. The tracked vertices were subsequentlyused as anchors for vertices of the Candide wire-frame model (Fig. 12) and the wire-frame modelwas driven by both the global motion of thespeaker’s head and the local motion of the facialfeatures. Again, subjective observation showed thisto operate very effectively.

6. Conclusions and further research

We have developed a new and reliable algorithmfor automatically tracking the motion of facial fea-

tures in head-and-shoulders scenes based oneigenvalue decomposition of sub-images contain-ing facial features such as the eyes, the nose andthe lips. The algorithm was tested on sequencescontaining limited pan, rotation and zoom of thespeaker’s head and gave very good results. We wereable to recover both global and local motion anduse it for driving the Candide wire-frame. Theuse of SVD allowed us to track the motion of thefacial features in almost real time on a Pentium-based computer. As all the facial features aretracked independently, the algorithm could beeasily adapted for use on a parallel processingsystem.

P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249—260 259

Page 12: Tracking of the motion of important facial features in model-based coding

In future research we intend to reconstructthe tracked video sequence using texture mappingand to apply the techniques described here toimplement a software-based videophone system ca-pable of operation at extremely low data rates. Wealso envisage application of this algorithm in vir-tual reality systems.

Acknowledgements

The authors wish to thank Professor Don Pear-son of The University of Essex for valuable com-ments. Paul Antoszczyszyn acknowledges thesupport of The University of Edinburgh througha Postgraduate Studentship.

References

[1] M. Antonini, M. Barlaud, P. Mathieu, I. Daubechies, Im-age coding using wavelet transform, IEEE Trans. ImageProcess 1 (2) (April 1992) 205—220.

[2] P.M. Antoszczyszyn, J.M. Hannah, P.M. Grant, Facialfeatures motion analysis for wire-frame tracking inmodel-based moving image coding, Proc. 1997 IEEE In-ternat. Conf. Acoust. Speech Signal Process., ICASSP’97,Munich, Germany, 20—24 April 1997, Vol. IV, pp.2669—2672.

[3] P.M. Antoszczyszyn, J.M. Hannah, P.M. Grant, Facialfeatures model fitting in semantic-based scene analysis,IEE Electron. Lett. 33 (10) (8 May 1997) 855—857.

[4] P.M. Antoszczyszyn, J.M. Hannah, P.M. Grant, Auto-matic frame fitting for semantic-based moving image cod-ing using a facial code-book, Proc. 8th European SignalProcessing Conf. EUSIPCO’96, Trieste, Italy, 10—13 Sep-tember 1996, Vol. II, pp. 1369—1372.

[5] P.M. Antoszczyszyn, J.M. Hannah, P.M. Grant, A com-parison of detailed automatic wire-frame fitting methods,Proc. 1997 IEEE Internat. Conf. on Image Processing,ICIP’97, Santa Barbara, USA, 26—29 October 1997, Vol. 1,pp. 468—472.

[6] K. Aizawa, H. Harashima, T. Saito, Model-based analysissynthesis image coding (MBASIC) system for a person’sface, Signal Processing: Image Communication 1 (2)(October 1989) 139—152.

[7] R. Forchheimer, T. Kronander, Image coding — fromwaveforms to animation, IEEE Trans. Acoustic SpeechSignal Process. 37 (12) (December 1989) 2008—2023.

[8] A. Gersho, On the structure of vector quantizers, IEEETrans. Information Theory 28 (2) (March 1982) 157—166.

[9] H. Hotelling, Analysis of a complex of statistical variablesinto principal components, J. Educational Psyhol. 24,(September 1933) 417—441 and 498—520.

[10] ITU-T Draft H.263: Line transmission of non-telephonesignals, Video coding for low bitrate communication, July1995.

[11] A.E. Jacquin, Image coding based on a fractal theory ofiterated contractive image transformations, IEEE Trans.Image Process 1 (1) (January 1992) 18—30.

[12] K. Karhunen, U® ber lineare methoden in der wahrschein-lichkeitsrechnung, Ann. Acad. Sci. Fennicae, Ser. A1,Math. Phys. 37 (1946). Translated by I. Selin, On linearmethods in probability theorie, Doc. T-131, Rand Corp.,Santa Monica, CA, 1960.

[13] M. Kokuer, A.F. Clark, Feature and model tracking formodel-based coding, Proc. 1992 IEE Internat. Conf. onImage Processing and its Applications, Maastricht, Neth-erlands, 7—9 April 1992, pp. 135—138.

[14] H. Li, R. Forchheimer, Two-view facial movement estima-tion, IEEE Trans. Circuits Systems Video Technol. 4 (3)(June 1994) 276—287.

[15] H. Li, A. Lundmark, R. Forchheimer, Image sequencescoding at very low bitrates: a review, IEEE Trans. ImageProcess 3 (5) (September 1994) 589—609.

[16] M. Loeve, Fonctions Aleatoires de second ordre, in: P.Levy (Ed.), Processus Stochastiques et mouvementBrownien, Hermann, Paris, France, 1948.

[17] H. Murakami, V. Kumar, Efficient calculation of primaryimages from a set of images, IEEE Trans. Pattern Anal.Mach. Intell. 4 (5) (September 1982) 511—515.

[18] H.G. Musmann, M. Hotter, J. Ostermann, Object-orientedanalysis—synthesis coding of moving images, Signal Process-ing: Image Communication 1 (2) (October 1989) 117—138.

[19] M.J.T. Reinders, P.J.L. van Beek, B. Sankur, J.C.A. van derLubbe, Facial feature localization and adaptation of a gen-eric face model for model-based coding, Signal Processing:Image Communication 7 (1) (March 1995) 57—74.

[20] V. Seferidis, Facial feature estimation for model-basedcoding, Electron. Lett. 27 (24) (November 1991) 2226—2228.

[21] Special Issue on MPEG-4, IEEE Trans. Circuits SystemsVideo Technol. 7 (1) (February 1997).

[22] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogni-tive Neurosci. 3 (1) (1991) 71—86.

[23] B. Welsh, Model-based coding of video images, Electron.Comms. Eng. J. 3 (1) (February 1991) 29—38.

260 P.M. Antoszczyszyn et al. / Signal Processing 66 (1998) 249–260