reliable tracking of facial features in semantic-based video coding

7
Reliable tracking of facial features in semantic-based video coding P.M.Antoszczyszyn J.M.Hannah P.M.Grant, FlEE Indexing terms: Trucking, Imuge coding Abstract: A new method of tracking the position of important facial features for semantic-based moving image coding is presented. Reliable and fast tracking of the facial features in head-and- shoulders scenes is of paramount importance for reconstruction of the speaker’s motion in videophone systems. The proposed method is based on eigenvalue decomposition of the sub- images extracted from subsequent frames of the video sequence. Motion of each facial feature (the left eye, the right eye, the nose and the lips) is tracked separately; this means that the algorithm can be easily adapted for a parallel machine. No restrictions, other than the presence of the speaker’s face, were imposed on the actual contents of the scene. The algorithm was tested on numerous widely used head-and-shoulders video sequences containing moderate head pan, rotation and zoom, with remarkably good results. Tracking was maintained even when the facial features were occluded. The algorithm can also be used in other semantic-based systems. 1 Introduction Extremely low bit-rate moving image transmission sys- tems are designed to handle video communication in environments where data rates are not allowed to exceed 10kbitis. These include PSTN lines in certain countries and mobile communications. It is widely acknowledged that currently available systems are not capable of delivering a satisfactory quality of moving image at such low data rates. Most of these methods are derived from a block-based motion-compensated discrete cosine transform [l]. At extremely low bit rates (below 9600 bitis), the application of block-based algo- rithms results in disturbing artefacts [2]. Despite the introduction of other moving image coding techniques based on vector quantisation [3], fractal theory [4] and wavelet analysis [5], it is still not possible to send video over extremely low bit-rate channels with acceptable 0 IEE, 1998 Paper first received 24th March 1997 and in revised form 9th March 1998 The authors are with the Department of Electrical Engineering, Univer- sity of Edinburgh, Kmg’s Buildings, Mayfield Road, Edinburgh EH9 3JL, UK IEE Pmceedimp online no. 199R2153 quality. A promising approach to scene analysis has been proposed by Musmann et al. [6, 71. However, it seems that only the application of semantic-based tech- niques will allow transmission at extremely low bit rates. Fig. 1 ‘Candide’ wireqiame model Aizawa et al. [SI and Forchheimer [9] independently introduced semantic wire-frame models and computer graphics to the area of extremely low bit-rate commu- nication. According to their assessments, it is possible to obtain data rates below 10kbitis for head-and-shoul- ders video sequences. The concept of semantic-based communication can be briefly explained in the follow- ing. A semantic model of the human face (e.g. Candide in Fig. 1) is shared by the transmitter and the receiver. With each subsequent frame of the video sequence, the position of the vertices of the wire-frame is automati- cally tracked. The initial and subsequent positions of the wire-frame are transmitted in the form of 3-D co- ordinates over the low bit-rate channel, along with the texture of the face from the initial frame of the sequence. Knowing the texture of the scene from the initial frame and the 3-D positions of the vertices of the wire-frame in subsequent frames, it is possible to recon- struct the entire sequence by mapping the texture of the initial frame of the sequence at locations indicated by the transmitted vertices. We have focused our efforts on techniques utilising semantic models of the head-and-shoulders scene. In our research we use the Candide [9] semantic wire- frame (Fig. I). The two main problems in content- based moving image coding are automatic fitting of the wire-frame and automatic tracking of the scene. A lim- ited number of approaches to automatic wire-frame fit- ting have been proposed. The method proposed by Welsh [lo] utilises the idea of ‘snakes’ (active contours). 251 IEE Proc.-Vi,?. Image Signal Process., Vol. 145, No. 4, August 1998

Upload: pm

Post on 20-Sep-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Reliable tracking of facial features in semantic-based video coding

Reliable tracking of facial features in semantic-based video coding

P.M. Antoszczyszyn J.M.Hannah P.M.Grant, FlEE

Indexing terms: Trucking, Imuge coding

Abstract: A new method of tracking the position of important facial features for semantic-based moving image coding is presented. Reliable and fast tracking of the facial features in head-and- shoulders scenes is of paramount importance for reconstruction of the speaker’s motion in videophone systems. The proposed method is based on eigenvalue decomposition of the sub- images extracted from subsequent frames of the video sequence. Motion of each facial feature (the left eye, the right eye, the nose and the lips) is tracked separately; this means that the algorithm can be easily adapted for a parallel machine. No restrictions, other than the presence of the speaker’s face, were imposed on the actual contents of the scene. The algorithm was tested on numerous widely used head-and-shoulders video sequences containing moderate head pan, rotation and zoom, with remarkably good results. Tracking was maintained even when the facial features were occluded. The algorithm can also be used in other semantic-based systems.

1 Introduction

Extremely low bit-rate moving image transmission sys- tems are designed to handle video communication in environments where data rates are not allowed to exceed 10kbitis. These include PSTN lines in certain countries and mobile communications. It is widely acknowledged that currently available systems are not capable of delivering a satisfactory quality of moving image at such low data rates. Most of these methods are derived from a block-based motion-compensated discrete cosine transform [l]. At extremely low bit rates (below 9600 bitis), the application of block-based algo- rithms results in disturbing artefacts [2]. Despite the introduction of other moving image coding techniques based on vector quantisation [ 3 ] , fractal theory [4] and wavelet analysis [5] , it is still not possible to send video over extremely low bit-rate channels with acceptable 0 IEE, 1998

Paper first received 24th March 1997 and in revised form 9th March 1998 The authors are with the Department of Electrical Engineering, Univer- sity of Edinburgh, Kmg’s Buildings, Mayfield Road, Edinburgh EH9 3JL, UK

IEE Pmceedimp online no. 199R2153

quality. A promising approach to scene analysis has been proposed by Musmann et al. [6, 71. However, it seems that only the application of semantic-based tech- niques will allow transmission at extremely low bit rates.

Fig. 1 ‘Candide’ wireqiame model

Aizawa et al. [SI and Forchheimer [9] independently introduced semantic wire-frame models and computer graphics to the area of extremely low bit-rate commu- nication. According to their assessments, it is possible to obtain data rates below 10kbitis for head-and-shoul- ders video sequences. The concept of semantic-based communication can be briefly explained in the follow- ing. A semantic model of the human face (e.g. Candide in Fig. 1) is shared by the transmitter and the receiver. With each subsequent frame of the video sequence, the position of the vertices of the wire-frame is automati- cally tracked. The initial and subsequent positions of the wire-frame are transmitted in the form of 3-D co- ordinates over the low bit-rate channel, along with the texture of the face from the initial frame of the sequence. Knowing the texture of the scene from the initial frame and the 3-D positions of the vertices of the wire-frame in subsequent frames, it is possible to recon- struct the entire sequence by mapping the texture of the initial frame of the sequence at locations indicated by the transmitted vertices.

We have focused our efforts on techniques utilising semantic models of the head-and-shoulders scene. In our research we use the Candide [9] semantic wire- frame (Fig. I). The two main problems in content- based moving image coding are automatic fitting of the wire-frame and automatic tracking of the scene. A lim- ited number of approaches to automatic wire-frame fit- ting have been proposed. The method proposed by Welsh [lo] utilises the idea of ‘snakes’ (active contours).

251 IEE Proc.-Vi,?. Image Signal Process., Vol. 145, No. 4, August 1998

Page 2: Reliable tracking of facial features in semantic-based video coding

Different approaches were presented by Reinders et al. [ll] and Seferidis [12]. We have proposed the applica- tion of a facial database [13, 141. However, the issue of automatic tracking still remains unresolved. Existing proposals employ optical flow analysis [ 151 and correla- tion [16]. Extraction and tracking of the eyes is the sub- ject of numerous contributions e.g. based on Kalman filtering [ 171, based on deformable templates [ 181 and based on a statistical approach [19].

Here, a solution to the problem of automatic scene tracking is presented as a major development of our previous approach [20]. We have developed an algo- rithm capable of tracking facial features in head-and- shoulders scenes. It has been tested on numerous, widely used, video sequences.

2 Eigenvalue decomposition for sequence of images

Principal component analysis is also referred to as the Hotelling [21] transform, or a discrete case of the Karhunen-Lohe [22, 231 transform. In the first step of the method of principal components, the eigenvectors must be found of the covariance (dispersion) matrix S of the sequence X of M , N - dimensional input column vectors: X = [xl x2 ... xM], xJ = [x,,], i = 1 .. N, j = 1 .. M . The images are converted into 1-D column vectors using line by line scanning. An image consisting of R rows and C columns would therefore produce a column input vector consisting of N = C x R rows. We can obtain the covariance matrix from the following rela- tionship:

s =YYT (1) where the columns of matrix Y are vectors y, that differ from the input vectors x, by the expected value m, of the sequence X:

Y = [ x l -m, xa-m, . . . x ~ - m , ] (2)

M

( 3 ) 1

m, = - Ex, M

If we develop eqn. 1 using eqns. 2 and 3, we obtain a symmetric non-singular matrix:

2 = 1

r s f s12

where s? is the variance of the ith variable and sg, i z j is the covariance between the ith and jth variables. The objective of the method of principal components is to find the alternative co-ordinate system Z for an input sequence X, in which all the elements of the leading diagonal of the covariance matrix Sz (where index Z denotes the new co-ordinate system) are zeros. Accord- ing to matrix algebra, such a matrix can be constructed if the eigenvectors of the covariance matrix are known:

sz = UTSU (5) where U = [ M ~ u2 ... uN] and uL, i = 1 _ _ N, is the ith eigenvector of covariance matrix S. The eigenvectors can be calculated from the following relationship:

su, = &U, (6)

where li is the eigenvalue of the ith eigenvector, and i = 1 .. N. It is possible to reduce the computational effort involving calculation of the eigenvectors of the S matrix by application of singular value decomposition (SVD) [24]. The Hotelling transformation can be expressed by the following relationship:

x; = uT(x; - m) (7 ) or in vector form:

z = U T Y (8) where z, is the ith principal component.

The Hotelling transform has been applied to the analysis of images of human faces in previous research [25]. The ability to analyse unknown images using a generic ensemble of training images has been utilised for recognition purposes [26]. An in-depth comparison of face recognition techniques has also been presented ~ 7 1 .

3 Automatic tracking algorithm

The image of a face contains certain distinct features that can be identified automatically. We have concen- trated our efforts on tracking motion of the important facial features (the left eye, the right eye, the nose and the lips). The input matrix X is formed from the sub- images containing important facial features (left of Fig. 2) extracted from the M = 16 initial frames of the analysed sequence. The fact that the initial set is cre- ated from sub-images extracted from the analysed sequence (either manually or automatically [ 141) increases the robustness of the algorithm and decreases the overall tracking error. The matrix X is formed sep- arately for each facial feature. In our case, there are four X matrices: for the left eye, the right eye, the nose and the lips. If, for example, the number of initial images was 16 and the dimensions of the extracted sub- images were 50 x 50, then M and N in eqns. 2-4 would be 16 and 2500, respectively.

The automatic tracking commences with frame A4 + 1. The initial position of the tracked facial feature in frame M + 1 (current frame) is assumed to be the same as in frame M (previous frame). This view is subse- quently verified in the following way. Sub-images within the search range are extracted from the current frame (e.g. a 15 x 15 search range would result in 225 sub-images). In order to avoid confusion, we call these images the extracted set (Fig. 2). The term ‘initial set’ is used to describe the sub-images extracted from the M initial frames (matrix X). The size of the search range is adjustable, but the dimensions of the images from the extracted set are identical to those from the initial set. Since the extracted set images are similar to those from the initial set, we can assume that they can be projected onto the principal component space created by input vector X. Further analysis of the extracted set is per- formed in R-dimensional (R 5 principal component space. It is our goal to find the best match image among the images in the extracted set. Its 2-D co-ordi- nates will mark the position of the tracked facial fea- ture in the current ( M + 1) frame.

Note that the dimensionality of the principal component space can be reduced to less than M , as we are interested in correct recognition of the most similar sub-image, rather than its reconstruction. Analysis of the variance of a particular principal component (i.e. the absolute value of the characteristic root associated

258 IEE Proc.-Vis Image Sigual Process., Vol. 145, No. 4, August 1998

Page 3: Reliable tracking of facial features in semantic-based video coding

Fig.2 Trucking system

with it) can lead to its elimination from further processing.

The most straightforward method of establishing the difference between two images is calculation of the Euclidean distance separating them in the principal component space created by vector X:

where ai is the projection of the ith image vector from the initial set onto K-dimensional principal component space according to eqn. 7; and bi is projection of thejth image from the extracted set onto the same space. The distance eij is calculated for every possible pair of images from the initial set and the extracted set. The jth image from the extracted set for which the eo dis- tance is minimal is the best match image. Its 2-D co- ordinates mark the position of the particular facial fea- ture in the next frame.

We have also tested a different method of derivation of the best match image. The best result found using eqn. 9 points to the image which is the most similar to one particular image from the initial set. However, it seems that finding the image that is most similar to all the images from the initial set might provide us with a better measure of distance. Examine the following equation:

M

d; = yi - C Z ~ U ~ (10) I1 i=l I1 where yi denotes the normalised ith image from the extracted set, ui is the ith eigenvector of covariance matrix S, and zi is the ith principal component. The more similar the unknown image is to the initial set, the smaller is the distance d,. Large differences between the input and output images result in a large value of d,. Similarly to the previous case, the ith image from the extracted set for which the di distance is minimal is the best match image, and its 2-D co-ordinates mark the position of the particular facial feature in the next frame.

We initially focused our efforts on tracking the geo- metrical centres of important facial features. In practi- cal terms, this means tracking of the centre of the corona of the eye, mid-distance between the nostrils, and the centre of the lips. Co-ordinates of the impor- tant facial features tracked throughout the sequence form a rigid triangle (e.g. the left eye, the right eye and

IEE Proc.-Vis. Image Signal Process.. Vol. 145, No. 4, August 1998

the nose, or the left eye, the right eye and the lips). It is therefore possible to deduce global motion (motion of the speaker's head) from the motion of this triangle.

4 Experimental results

We have tested the tracking algorithm on numerous head-and-shoulders video sequences with variable amount of motion, frame size and sequence length. Our experiments consist of three stages. In the first (prelim- inary) stage, we establish a reliable distance measure using one of the video sequences. In the second (main) stage, we test the overall performance of the tracking algorithm on all the video sequences. In the final (con- cluding) stage, we test the suitability of the algorithm for tracking the shape of particular facial features.

In an attempt to determine the best measure of dis- tance, we have tested our algorithm using eqn. 9 as method A and eqn. 10 as method B. We used the 'Miss America' (CIF-sized) video sequence for this purpose. In both experiments. tracking was maintained through- out the entire sequence for all facial features, as shown by subjective tests, using a created movie with white crosses centred on the tracked points of the speaker's face. The centres of the crosses were constantly kept within the corona of the eye (when visible), between the nostrils, and in the centre of lips. Tracking was main- tained even when the speaker closed her eyes (quite fre- quently in the 'Miss America' sequence), or opened and closed her lips. However, subjective observation showed the results obtained using method B to be more acceptable: the crosses centred on facial features moved smoothly, and the response on head pan was immedi- ate.

Table 1: Mean tracking error and standard deviation for 'Miss America'

Method A (150 frames) Method B (150 frames) - . .

Standard deviation (pixels)

tacial Standard Mean error feature Mean error deviation

(pixels) (pixels) (pixels)

Lefteye 0.707 0.606 0.633 0.61 1

Right eye 1.324 1.232 0.944 0.749

Nose 1.230 0.828 0.548 0.606

Lios 1.833 1.367 0.712 0.623

259

Page 4: Reliable tracking of facial features in semantic-based video coding

In order to more precisely assess the accuracy of the tracking method, the 2-D locations of the important facial features were extracted manually from all 150 frames of the ‘Miss America’ sequence. The error was measured as the Euclidean distance between the 2-D co-ordinates of the features tracked automatically (in methods A and B) and manually. Tracking error pro- files are presented in Figs. 3 and 4. As it can be seen from Table 1, the application of method B gives supe- rior results. The mean error in method B is less than 1 pixel for all the tracked facial features. In method A, the distance in principal components’ space between two individual images (one from the initial set and one from the extracted set) is measured. On the other hand, in method B, it is the difference between the normalised image from the extracted set and its reconstruction which is measured. As the reconstruction is carried out using eigenvectors of the principal components’ space created using all images from the initial set, this latter measurement (in method B) is likely to yield a better indication of the similarity to all the images from the initial set. Owing to its superior performance, we employed method B for all subsequent work.

1

0 50 100 150 frame

a

I I

0 50 100 150 frame

b

Fig.3 a Method A b Method B

Methods A and B for lejt eye tracking error profiles

Table 2: Sequences tested in the main stage

Horizontal size (pixels) Sequence

Miss America 352

Claire 360

Carphone 176

Grandma 176

Salesman 360

Trevor 256

Vertical size Length Sample (pixels) (frames) (frames)

2 40 150 30

288 168 34

144 400 80

144 768 154

288 400 80

256 100 20

The head-and-shoulders video sequences used for further testing are listed in Table 2 (all of these are

260

readily available from well known WWW sites). The sequences reflect a wide variety of possible head-and- shoulders videophone situations containing moderate speaker’s head pan, rotation and zoom. CIF-sized sequences (‘Miss America’, ‘Claire’, ‘Salesman’, ‘Trevor’) and QCIF sequences (‘Car Phone’, ‘Grandma’) are included. The subjects of two video sequences (‘Trevor’, ‘Grandma’) wear spectacles, and the ‘Car Phone’ sequence was shot in a moving car.

4

U)

a,

Q

- .-

2

0 50 100 150 frame

a

I

4 1

I

0 50 100 150

frame

b Fig. 4 a Method A b Method B

Methods A and B for lips tracking error profiles

Our algorithm produced reliable and consistent results for all tested sequences. In all our tests, tracking of all important facial features was maintained throughout the entire sequence. In order to obtain a measure of accuracy of the algorithm, we used a method similar to the one described above. However, because of the length and the number of test sequences, we only chose to extract manually the 2-D positions of the important facial features from every fifth frame of each sequence. This results in 20 frames in the case of ‘Trevor’ (lower limit) and 154 frames in the case of ‘Grandma’. The resultant mean error and standard deviation are given in Tables 3-5. As can be seen, the mean error for all the facial features in almost all the sequences was less than 1 pixel.

Table 3: Mean tracking error results for ’Miss America‘ and ‘Claire’

Miss America Claire (30 of 150 frames) (34 of 168 frames)

Facial . . . . .. .

feature Mean Standard Mean Standard (pixels) deviation deviation

(pixels) (pixels)

Left eye 0.624 0.678 0.407 0.533

Right eye 0.844 0.686 0.638 0.741

Nose 0.569 0.596 0.784 0.553

LiDs 0.489 0.582 1.018 0.621

IEE Proc.-Vis. Image Signul Process., Vol. 145, No. 4, August 1998

Page 5: Reliable tracking of facial features in semantic-based video coding

Fig.5 Tracking ‘Miss America’ frames 20, 85 and 120

Fig.6 Tracking ‘Claire’frames 0, 95 and 140

Fig.7 Tracking ‘Car Phones’fiames 80, 170 and 290

Fig.8 Trucking %rundma’frames 340, 410 and 440

Table 4: Tracking error results for ‘Car Phone’ and ‘Grandma‘

Car phone Grandma (80 of 400 frames) (154 of 768 frames)

Facial feature

I..

Table 5: Tracking error results for ’Salesman’ and ‘Trevor‘

Salesman Trevor (80 of 400 frames) (20 of 100 frames)

Facial Standard

Standard Mean error deviation feature Mean error

(pixels) (pixels) (pixels) deviation

Lefteye 0.878 0.715 0.856 0.600

Right eye 0.886 0.781 0.829 0.648

Nose 1.0594 0.873 0.372 0.528

Lips 1.663 1.042 0.731 0.670

The left and the right eye seem to be the most reliably tracked facial features, possibly due to the combination of very light (white of the eye) and very dark (corona of the eye) pixels However, the track was also maintained during periods when the speakers close their eyes. There were no problems with tracking eyes partially occluded by spectacles. Despite the use of a static initial set, the algorithm coped well with

IEE Proc -Vir Image Signal Process, Vol 145, No 4, August 1998

Left eye 0.862 0.784 0.844 0.752

Right eye 0.827 0.963 0.839 0.886

Nose 0.845 0.692 0.853 0.583

Lips 0.939 1.314 0.695 0.701

temporary occlusions (Salesman). Inclusion of ‘fail- safe’ mechanisms (i.e. recovery of the tracking of a certain feature based on the positions of the remaining features) proved unnecessary. The overall results clearly demonstrate that the algorithm is able to maintain tracking of important facial features for a variety of head-and-shoulders sequences (Figs. 5-8).

In the final part of the experiment, we investigated

261

Page 6: Reliable tracking of facial features in semantic-based video coding

the effectiveness of our algorithm for tracking the shape of individual facial features. Since we wish to uti- lise the ‘Candide’ wire-frame, in order to reconstruct the local motion (e.g. lips close-open, eyes close-open), we must be able to track reliably the motion of the ver- tices assigned to the selected facial features (Fig. 9). In the case of the eyes, the eyebrows and the lips, this involves tracking at least four vertices. We utilised the same algorithm, but this time the initial set images were centred on the points of the image that corresponded to the positions of the wire-frame vertices of a particu- lar facial feature (Fig. 9, right of Fig. 2). Even if the shape of the facial feature changes radically, the con- tents of the image from the initial set will change only slightly. This assures continuity of tracking. The other advantage of a decrease in size of the images from the initial set is a considerable reduction of the computa- tional load required to track the particular vertex.

a b Fig. 9 a Left eye and left eyebrow b Lips

Tracking points for the ‘Candide’ model

Using this approach, it was possible to track closing and opening of lips and eyes. Observation of test video sequences recreated from the results of the algorithm again showed an excellent tracking performance. The tracked vertices were subsequently used as anchors for vertices of the ‘Candide’ wire-frame model (Fig. lo), and the wire-frame model was driven by both the glo- bal motion of the speaker’s head and the local motion of the facial features. Subjective observation showed this to operate very effectively.

5 Conclusions and further research

The fact that there is a lower bandwidth limit to wave- form-based techniques seems to be widely accepted, as is the fact that model-based techniques face far greater problems than ‘traditional’ approaches. We have addressed one of the most important issues concerning model-based coding. It was our intention to develop and test a tool that could be used with in the scope of currently emerging standards (MPEG-IV [28]). We have developed a reliable algorithm for automatically tracking the motion of facial features in head-and-

shoulders scenes, based on the eigenvalue decomposi- tion of sub-images containing facial features such as the eyes, the nose and the lips. The algorithm was tested on numerous sequences containing limited pan, rotation and zoom of the speaker’s head, with excellent results. We were able to recover both global and local motion, and use it for driving the ‘Candide’ wire-frame. The use of SVD allows tracking at speeds of more than 10 frames per second per feature on an entry-level Pentium processor-based computer (120 MHz, 16 MB RAM). A further increase in processing speed can be easily achieved by using a faster machine.

In future research, we intend to implement a real- time version of the algorithm (using a parallel proces- sor or DSP hardware). This will allow us to reconstruct the tracked video sequence using texture mapping, and to apply the techniques described here to implement a videophone system capable of operation at extremely low data rates. We also envisage application of the described algorithm in virtual reality systems.

6 Acknowledgments

The authors wish to thank Professor Don Pearson, University of Essex, for valuable comments. Paul Antoszczyszyn acknowledges the support of the Uni- versity of Edinburgh through a Postgraduate Student- ship.

7 References

1 ITU-T Draft H.263, ‘Line transmission of non-telephone signals, Video coding for low bitrate communication’. 1995

2 LI, H., LUNDMARK, A., and FORCHHEIMER, R.: ‘Image sequences coding at very low bitrates: a review’, IEEE Trans. Image Process., 1994, 3, ( 5 ) , pp. 589-609

3 GERSHO, A.: ‘On the structure of vector quantizers’, IEEE Trans. In$ Theory, 1982, 28, (2), pp. 157-166

4 JACQUIN, A.E.: ‘Image coding based on a fractal theory of iter- ated contractive image transformations’, IEEE Trans. Image Process., 1992, 1, (l), pp. 18-30 ANTONINI, M., BARLAUD, M., MATHIEU, P., and DAUB- ECHIES, I.: Image coding using wavelet transform’, IEEE Trans. Image Process., 1992, 1, (2), pp. 205-220

6 MUSMANN, H.G.: ‘A layered coding system for very low bit rate video coding’, Signal Process. Image Commun., 1995, 7, (4-6), pp. 267-278

7 MUSMANN, H.G., HOTTER, M., and OSTERMANN, J.: Object-oriented analysis-synthesis coding of moving images’, Sig-

nal Process. Image Commun., 1989, 1, (2), pp. 117-138 8 AIZAWA, K., HARASHIMA, H., and SAITO, T.: ‘Model-

based analysis synthesis image coding (MBASIC) system for a person’s face’, Signal Process. Image Commun., 1989, 1, (2), pp. 139-152

9 FORCHHEIMER, R., and KRONANDER, T.: ‘Image coding - from waveforms to animation’, IEEE Trans. Acoustic, Speech Sig- nal Process.. 1989. 37. (121. OD. 2008-2023

5

\

10 WELSH, B:: ‘Model-based coding of video images’, Electron.

11 REINDERS, M.J.T.. VAN BEEK. P.J.L.. SANKUR. B.. and Commun. Eng. J., 1991, 3, (l), pp. 29-38

VAN DER’ LUBBE, J.C.A.: ‘Facial feature localization and adaptation of a generic face model for model-based coding’, Sig- nal Process. Image Commun., 1995, 7, (l), pp. 57-14

Fig. 10

262

Tracking anchor vertices (a) used to manipulate the ‘Candide’ model (b)

IEE Proc.-Vis. Image Signal Pvocess., Vol. 14S, No. 4, August I998

Page 7: Reliable tracking of facial features in semantic-based video coding

12 SEFERIDIS, V.: ‘Facial feature estimation for model-based cod- ing’, Electron. Lett., 1991, 27, (24), pp. 2226-2228

13 ANTOSZCZYSZYN, P.M., HANNAH, J.M., and GRANT, P.M.: ‘Automatic frame fitting for semantic-based moving image coding using a facial code-book’. Proceedings of EUSIPC0’96, Trieste, Italy, September 1996, Vol. 2, pp. 1369-1372

14 ANTOSZCZYSZYN, P.M., HANNAH, J.M., and GRANT, P.M.: ‘Facial features model fitting in semantic-based scene anal- ysis’, Electron. Lett., 1997, 33, (lo), pp. 855-857

15 LI, H., and FORCHHEIMER, R.: ‘Two-view facial movement estimation’, IEEE Trans. Circuits Syst. Video Technol., 1994, 4, (3), pp. 276-287

16 KOKUER, M., and CLARK, A.F.: ‘Feature and model tracking for model-based coding’. IPA’92, Maastricht, Netherlands, April 1992, pp. 135-138

17 XIE, X., SUDHAKAR, R., and ZHUANG, H.: ‘Real-time eye feature tracking from a video image sequence using kalman filter’, IEEE Trans. Systems, Man. Cybernetics, 1995, 25, (12), pp. 1568- 1577

18 YUILLE, A.L.: ‘Deformable templates for face recognition’, J. Cognitive Neurosci., 1991, 3, (l), pp. 60-70

19 SUNG, K.J., and ANDERSON, D.J.: ‘A video eye tracking sys- tem based on a statistical algorithm’. Proceedings of 36th IEEE Midwest symposium on Circuits and systems, August 1993, pp. 438441

20 ANTOSZCZYSZYN, P.M., HANNAH, J.M., and GRANT, P.M.: ‘Facial features motion analysis for wire-frame tracking in model-based moving image coding’. Proceedings of ICASSP’97, Munich, Germany, April 1997, Vol. IV, pp. 2669-2672

21 HOTELLING, H.: ‘Analysis of a complex of statistical variables into principal components’, J. Educ. Psychol., 1933, 24, pp. 417- 441, 498-520

22 KARHUNEN, K.: ‘Uber lineare Methoden in der Wahrscheinli- chkeitsrechnung’, Ann. Acad. Sci Fenn., AI, Math. Phys., 1946, 37,

23 LOEVE, M.: ‘Fonctions AIcatoires de second ordre’ in LEVY, P. (Ed.): ‘Processus stochastiques et mouvement Brownien’ (Her- mann, Paris, France, 1948)

24 MURAKAMI, H., and KUMAR, V.: ‘Efficient calculation of primary images from a set of images’, ZEEE Trans. Pattern Anal. Mach. Intell., 1982, 4, (9, pp. 511-515

25 KIRBY, M., and SIROVICH, L.: ‘Application of the Karhunen- Loeve procedure for the characterization of human faces’, IEEE Trans. Pattern Anal. Mach. Infell., 1990, 12, (l), pp. 103-108

26 TURK, M., and PENTLAND, A.: ‘Eigenfaces for recognition’, J. Cognitive Neurosci., 1991, 3, (l), pp. 71-86

27 BRUNELLI, R., and POGGIO, T.: ‘Face recognition: features versus templates’, IEEE Trans. Pattern Anal. Mach. Intell., 1993, 15, (IO), pp. 1042-1052

28 CHIARIGLIONE, L.: ‘MPEG and multimedia communications’, Signal Process. Image Commun., 1997, 7, (I), pp, 5-18

IEE Proc.-Vis. Image Signal Process., Vol. 145, No. 4, August 1998 263