r. v. hogg and a. t. craig. - pdfs.semanticscholar.org · r. v. hogg and a. t. craig. intr o...

14

Upload: others

Post on 16-Mar-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

[19] R. V. Hogg and A. T. Craig. Introduction to mathe-matical statistics. Collier-Macmillan, 1978.[20] P. J. Huber. Robust Statistics. Wiley, 1981.[21] H. Gish J. Makhoul, S.Roucos. Vector Quantizationin Speech Coding. Proceedings of IEEE, 73(11):1551{1588, 1985.[22] T. Matsui and S. Furui. Similarity normalizationmethod for speaker veri�cation based on a posterioriprobability. In Proceedings of ESCA Workshop on Au-tomatic Speaker Recognition Identi�cation Veri�cation,pages 59{62, Martigny, Switzerland, April 1994.[23] D. O'Shaughnessy. Speech Communication. Addison-Wesley, 1987.[24] Yoh-Han Pao. Adaptive Pattern Recognition and Neu-ral Networks. Addison-Wesley, Reading, MA, USA,1989.[25] T. Poggio and F. Girosi. Regularization algorithms forlearning that are equivalent to multilayernetworks. Sci-ence, 247:978{982, 1990.[26] T. Poggio and L. Stringa. A Project for an IntelligentSystem: Vision and Learning. International Journal ofQuantum Chemistry, 42:727{739, 1992.[27] A. E. Rosenberg, J. DeLong, C. H. Lee, B. H. Juang,and F. K. Soong. The use of cohort normalized scoresfor speaker veri�cation. In Proceedings of ICSLP, vol-ume 1, pages 599{602, Ban�, Canada, October 1992.[28] A. E. Rosenberg and F. K. Soong. Evaluation of a Vec-tor Quantization Talker Recognition System in TextIndependent and Text Dependent Modes. ComputerSpeech and Language, 2(3{4):143{157, 1987.[29] P. Melmerstein S. B. Davis. Comparison of ParametricRepresentations for Monosyllabic Word Recognition inContinuosly Spoken Sentences. IEEE Transactions onAcoustic, Speech and Signal Processing, 28(4):357{366,1980.[30] F. K. Soong and A. E. Rosenberg. On the Use ofIstantaneous and Transitional Spectral Information inSpeaker Recognition. IEEE Transactions on Acoustic,Speech and Signal Processing, 36(6):871{879, 1988.[31] L. Stringa. Automatic Face Recognition using Direc-tional Derivatives. Technical Report 9205-04, I.R.S.T,1991.[32] L. Stringa. Eyes Detection for Face Recognition. Ap-plied Arti�cial Intelligence, 7:365{382, 1993.[33] L. Xu, A. Krzyzak, and C. Y. Suen. Methods of Com-bining Multiple Classi�ers and Their Applications toHandwriting Recognition. IEEE Transactions on Sys-tems, Man, and Cybernetics, 22(3):418{435, 1992.14

formance is evaluated on data acquired during realinteractions of the users in the reference database.Performance of the two techniques is similar.The current implementation of the system isworking on an HP 735 workstation with a MatroxMagic frame grabber. In order to optimize systemthroughput, it relies on a hierarchical match withthe face database.The incoming picture, represented by a set of fea-tures is compared at low resolution with the com-plete database. For each person in the database,the most similar feature, among the set of avail-able images, is chosen and the location of the bestmatching position stored. The search is then con-tinued at the upper resolution level by limiting thesearch to the most promising candidates at the pre-vious level.These candidates are selected by integrating theirface scores according to the procedure described insection 4.1. All available data must be used to se-cure a reliable normalization of the scores. How-ever, new scores at higher resolution are computedonly for a selected subset of persons and this con-stitutes a problem for the integration procedure. Infact, scores from image comparisons at di�erent lev-els would me mixed, similarity values deriving fromlower resolutions being usually higher. To overcomethis di�culty, the scores from the previous level arereduce (scaled) by the highest reduction factor ob-tained comparing the newly computed scores to thecorresponding previous ones.The performance, measured on the datasets usedfor the reported experiments, does not decrease andthe overall identi�cation time (face and voice pro-cessing) is approximately 5 seconds.The same approach, using codebooks of reducedsize could be applied to the speaker identi�ca-tion system, thereby increasing system throughput.Adding a subject to the database is a simple taskfor both subsystems. This is due to the modular-ity of the databases, each subject being describedindependently of the others. The integration strat-egy itself does not require any update. The rejec-tion and the combined identi�cation/rejection pro-cedures do require updating. However, the train-ing of the linear perceptron and of the HyperBFnetwork can be con�gured more as a re�nement ofa suboptimal solution (available from the previousdatabase) than as the computation of a completelyunknown set of optimal parameters. While the sys-tem, as presented, is mainly an identi�cation sys-tem, a small modi�cation transforms it into a ver-i�cation system. For each person in the databaseit is possible to select a subset containing the mostsimilar people (as determined by the identi�cationsystem). When the user must be veri�ed the iden-ti�cation system can be used using the appropriatesubset, thereby limiting the computational e�ort,

and verifying the identity of the user using the tech-niques reported in the paper.Future work will have the purpose of further im-proving the global e�ciency of the system with theinvestigation of more accurate and reliable rejectionmethods. AcknowledgementThe authors would like to thank Dr. L. Stringa,Prof. T. Poggio and Prof. R. de Mori for valuablesuggestions and discussions. The authors are grate-ful to the referees for many valuable comments.References[1] D. H. Ballard and C. M. Brown. Computer Vision.Prentice Hall, Englewood Cli�s, NJ, 1982.[2] P. B. Bonissone and K. S. Decker. Selecting uncertaintycalculi and granularity: an experiment in trading-o�precision and complexity. In J. F. Lemmer L. N. Kar-nak, editor,Uncertainty in Arti�cial Intelligence, pages217{247. North Holland, 1986.[3] P. B. Bonissone, S. S. Gans, and K. S. Decker. Rum: Alayered architecture for reasoning with uncertainty. InProceedings of the Tenth International Joint Confer-ence on Arti�cial Intelligence, pages 891{898, Milan,August 1987.[4] R. Brunelli. On Training Neural Nets through Stochas-tic Minimization. Technical Report 9212-06, I.R.S.T,1992. To appear on Neural Networks.[5] R. Brunelli. Estimation of Pose and Illuminant Direc-tion for Face Processing. A.I. Memo No. 1499, Mas-sachusetts Institute of Technology, 1994.[6] R. Brunelli, D. Falavigna, T. Poggio, and L. Stringa.A recognition system, particularly for recognising peo-ple. Patent Nr 93112738, 1993, Priority IT/11.08.92/ITTO920695.[7] R. Brunelli and S. Messelodi. Robust Estimation ofCorrelation: an Application to Computer Vision. Tech-nical Report 9310-05, I.R.S.T, 1993. to appear on Pat-tern Recognition.[8] R. Brunelli and T. Poggio. Face Recognition: Featuresversus Templates. IEEE Transactions on Pattern Anal-ysis and Machine Intelligence, 15(10):1042{1052, 1993.[9] R. Brunelli, T. Poggio, D. Falavigna, and L. Stringa.Automatic Person Recognition by Using Acousticand Geometric Features. Technical Report 9307-43,I.R.S.T., 1993. to appear on Machine Vision and Ap-plications.[10] R. Brunelli and G. Tecchiolli. Stochastic minimiza-tion with adaptive memory. Technical Report 9211-14,I.R.S.T, 1992. To appear on Journal of Computationaland Applied Mathematics.[11] P. J. Burt. Smart sensing within a pyramid vision ma-chine. Proceedings of the IEEE, 76(8):1006{1015, 1988.[12] G. Carli and R. Gretter. A start-end point detectionalgorithm for a real-time acoustic front-end based ondsp32c vme board. In Proceedings of ICSPAT, pages1011{1017, Boston, November 1992.[13] G. R. Doddington. Speaker Recognition, IdentifyingPeople by Their Voices. Proceedings of IEEE, 73(11),1985.[14] R. O. Duda and P. E. Hart. Pattern Recognition andScene Analysis. Wiley, New York, 1973.[15] K. Fukunaga. Introduction to Statistical Pattern Recog-nition. Academic Press, 1990.[16] S. Furui. Cepstrum Analysis Technique for AutomaticSpeaker Veri�cation. IEEE Transactions on Acoustic,Speech and Signal Processing, 29(1):254{272, 1981.[17] P. W. Hallinan. Recognizing Human Eyes. In SPIEProceedings, volume 1570, pages 214{226, 1991.[18] F. R. Hampel, P. J. Rousseeuw, E. M. Ronchetti, andW. A. Stahel. Robust statistics: the approach based onin uence functions. John Wiley & Sons, 1986.13

Fig. 9. The total error achieved by networks with di�er-ent number of units. The total error is computed bysumming the percentage of accepted strangers, misrec-ognized and rejected database people. For each net sizea thresholdwas chosen to minimize the cumulative error.Errors !0 (!0 + !1)=2 !1Stranger accepted (%) 0.5 0.5 0.0Familiar rejected (%) 3.0 3.0 3.5Familiar misrecog. (% 0.0 0.0 0.0TABLE IVThe performance of the system when using anHyperBF network with 21 units to perform scoreintegration.strangers, misrecognized and rejected database per-sons. In Figure 9, the total error is reported as afunction of the network size. Note that the thresh-old is computed on the test set, so that it gives anoptimistic estimate.To obtain a correct estimate of system perfor-mance, a cross-validation approach was used for thenet giving the best (optimistic) total error estimate.Let [!0; !1] be the interval over which the total er-ror assumes its minimumvalue (see Figure 10). Thethreshold value can be chosen as:� !0 favouring acceptance over rejection;� (!0 + !1)=2;� !1 favouring rejection over acceptance.The resulting performance is reported in Table IV.Note that using !1 the system was able to rejectall of the strangers, which is the ultimate require-ment for a reliable system, missing only 3.5% of theknown users. 5. ConclusionsA system that combines acoustic and visualcues in order to identify a person has been de-scribed. The speaker recognition sub-system isbased on Vector Quantization of the acoustic pa-rameter space and includes an adaptation phase ofthe codebooks to the test environment. A di�er-ent method to perform speaker recognition, whichmakes use of the Hidden Markov Model techniqueand pitch information, is under investigation.

Fig. 10. Error percentages as a function of the rejectionthreshold for a Gaussian based expansion.A face recognition sub-system also was described.It is based on the comparison of facial features atthe pixel level using a similarity measure based onthe L1 norm.The two sub-systems provide a multiple classi�ersystem. In the implementation described, 5 classi-�ers (2 acoustic and 3 visual) were considered. Themultiple classi�er operates in two steps. In the �rstone, the input scores are normalized using robustestimators of location and scale. In the second step,the scores are combined using a weighted geometricaverage. The weights are adaptive and depend onthe score distributions. While normalization is fun-damental to compensate for input variations (e.g.variations of illumination, background noise con-ditions, utterance length and of speaker voices),weighting emphasizes the classi�cation power of themost reliable classi�ers. The use of multiple cues,acoustic and visual, proved to be e�ective in im-proving performance. The correct identi�cationrate of the integrated system is 98% which repre-sents a signi�cant improvement with respect to the88% and 91% rates provided by the speaker andface recognition systems respectively. Future useof the Hidden Markov Model technique is expectedto improve performance of the VQ based speakerrecognizer.An important capability of the multiple classi�eritself is the rejection of the input data when theycan not be matched with su�cient con�dence toany of the database entries.An accept/reject rule is introduced by means of alinear classi�er based on measurement and rank in-formation derived from the �ve recognition systems.A novel, alternative, approach to the integration ofmultiple classi�ers at the hybrid rank/measurementlevel is also presented. The problem of combiningthe outputs of a set of classi�ers is considered asa learning task. A mapping from the scores of theclassi�ers and their ranks into the interval (0; 1) isapproximated using an HyperBF network. A �nalrejection/acceptance threshold is then introducedusing the cross-validation technique. System per-12

si�er can then be regarded as a list of couplesf(Sij ; rij)gi=1;:::;I where I represents the numberof people in the reference database. A mapping issought L01 such that:L01(Si1; ri1; � � � ; Si5; ri5) = � 1 if label(X) = i0 otherwise (25)If, after mapping the list of scores, more than alabel quali�es, the system rejects the identi�cation.It is possible to relax the de�nition of L01 by lettingthe value of the mapping span the whole interval[0; 1]. In this way the measurement level charac-ter of the classi�cation can be retained. The newmapping L can be interpreted as a fuzzy predicate.The following focuses on the fuzzy variant, fromwhich the original formulation can be obtained byintroducing a threshold !:L01 = �(L(xi)� !) (26)where �(�) is the Heavyside unit-step function andxi = (Si1; ri1; � � � ; Si5; ri5) is a ten dimensionalvector containing the feature normalized matchingscores and corresponding ranks. The goal is toapproximate the characteristic function of the cor-rect matching vectors as a sum of Gaussian bumps.Therefore the search for L is conducted within thefollowing family of functions:L(x; f(c�; t�)g�;�) = �(X� c�G(jx� t�j�)) (27)where G(x) = e�x2 (28)�(x) = 11 + e4(x�1=2) (29)jx� t�j� = q(x� t�)T��1(x� t�) (30)��1 being a diagonal matrix with positive entries,x; t� 2 R10 and c� 2 R. The approximating func-tion can be represented as an HyperBF network [25]whose topology is reported in Figure 8.The sigmoidalmapping is required to ensure thatthe co-domain be restricted to the interval (0; 1).The location t�, shape � and height c� of eachbump are chosen by minimizing the following errormeasure:E =Xi [yij � �(X� c�G(jxij � t�j�))]2 (31)where f(xij; yij)gi is a set of examples (points atwhich the value of the mapping to be recovered isknown). The �rst subscript i denotes the database

kcc1

x x1

. . . . .

Σ

. . . . .

n

kt1tFig. 8. The function used to approximate the mappingfrom the score/rank domain into the interval (0;1) canbe represented as an HyperBF network.entry from which xij is derived and the second sub-script j represents the example.The required value of the mapping at xij is 1when i is the correct label (class) for the j-th ex-ample and 0 otherwise. The error measure E isminimized over the parameter space f(c�; t�)g�;�by means of a stochastic algorithm with adaptivememory [10]. The number of free parameters in-volved in the minimization process dictates the useof a large set of examples. As a limited num-ber of real interactions was available, a leave-one-out strategy was used for training and testing thesystem as for the linear classi�er previously de-scribed. From each of the available user-systeminteractions, a virtual interaction was derived byremoving from the database the entry of the inter-acting user, thereby simulating an interaction witha stranger. For each interaction j1. the vector corresponding to the correctdatabase entry provides a positive example;2. the vectors of the �rst ten, non correct, entriesof the real interaction (as derived from sortingthe integrated scores of Section 4.1) and thevectors of the �rst ten entries of the virtualinteraction provide the negative examples.The reason for using only the �rst ten non correctentries is that the matching scores decay quicklywith rank position in the �nal score list and addi-tional examples would not provide more informa-tion. Data from di�erent interactions of the sameuser were then grouped. The resulting set of ex-amples was used to generate an equal number ofdi�erent training/testing set pairs. Each set wasused in turn for testing, leaving the remaining onesfor training. The problem of matching the numberof free parameters in the approximation function tothe complexity of the problem was solved by testingthe performance of networks with increasing size.For each network size, a value for threshold ! ofeqn. (26) was computed to minimize the total er-ror de�ned as the sum of the percentage of accepted11

Fig. 7. System performance when false positives and falsenegatives are weighted di�erently.Note that the LHSs of eqns. (23)(24) representthe signed distance, in arbitrary units, of point dfrom the plane de�ned by w that divides the spaceinto two semispaces. Points lying in the correctsemispace contribute to E inversely to their dis-tance from plane w. Points lying near the planecontribute with � or � while points lying in thewrong semispace and at great distance from the dis-criminating plane contribute with 2� or 2�. If thetwo classes of points are linearly separable it is pos-sible to drive E to zero (see [14], [24]). A stochasticminimization algorithm [4], [10] was used to mini-mize E.When the system is required to work in a strictmode (no errors allowed, that is, no strangers ac-cepted), � >> � should be considered in the train-ing phase. Note that a similar discriminant func-tion can be computed for each of the recognitionsubsystems (i.e. face recognition and voice recog-nition), thereby enabling the system to reject anidenti�cation when it is not su�ciently certain evenwhen not all of the identi�cation cues are available.The training/test of the classi�er followed a leave-one-out strategy to maximize the number of dataavailable in the training phase [15]. The classi�er istrained by using all but one of the available samplesand tested on the excluded one. The performanceof the classi�er can be evaluated by excluding inturn each of the available samples and averagingthe classi�cation error.In the reported experiments, the available exam-ples were grouped per interacting user. The leave-one-out method was then applied to the resulting87 sets (the number of users that interacted withthe system) to guarantee the independence of thetraining and test sets.Each set was used in turn for testing, leaving theremaining 86 for training. The results are reportedin Table III. A complete operating character-istic curve for the integrated performance shownin Table III is reported in Figure 7 where thestranger-accepted and familiar-rejected rates at dif-ferent �=� ratios are plotted.

Error (%)FaceStranger accepted 4.0Familiar rejected 8.0Familiar misrecog. 0.5VoiceStranger accepted 14.0Familiar rejected 27.0Familiar misrecog. 1.0IntegratedStranger accepted 0.5Familiar rejected 1.5Familiar misrecog. 0.0TABLE IIIError rates of the subsystems and of the completesystem when a rejection threshold is introduced.Data are based on the subset of interactions forwhich both face and speech data were available (155out of 164).Similar experiments were run on the acoustic andvisual features separately and are also reported inTable III. The results show that the use of the com-plete set of features provides a relevant increase inreliable performance over the separate subsystems.4.3 Hybrid level integrationIn this sub-section, a hybrid rank/measurementlevel at which multiple classi�ers can be combinedwill be introduced. The approach is to reconstruct amapping from the sets of scores, and correspondingranks, into the set f0; 1g. The matching to eachof the database entries, as described by a vectorof �ve scores and the corresponding ranks shouldbe mapped to 1, if it corresponds to the correctlabel, and to 0 otherwise. The reconstruction ofthe mapping proceeds along the following steps:1. �nd a set of positive and negative examples;2. choose a parametric family of mappings;3. choose the set of parameters for which the cor-responding mapping minimizes a suitable errormeasure over the training examples.Another way to look at the reconstruction of themapping is to consider the problem as a learningtask, where, given a set of acceptable and non ac-ceptable inputs, the system should be able to ap-propriately classify unseen data.Let fCjg be the set of classi�ers. Each of themassociates to each person X some numerical dataXj that can be considered a vector. By compari-son with the ith database entry, a normalized sim-ilarity score Sij can be computed. Each score Sijcan be associated to its rank rij in the list of scoresproduced by classi�er Cj. The output of each clas-10

S1 S2 F1 F2 F3S1 1 (1) 0.64 (1.0) 0.06 (0.7) 0.08 (0.8) 0.04 (0.7)S2 1.00 (1.0) 0.03 (0.6) 0.07 (0.8) 0.03 (0.6)F1 1.00 (1.0) 0.11 (0.8) 0.10 (0.8)F2 1.00 (1.0) 0.50 (1.0)F3 1.00 (1.0)TABLE IIThe rank correlation value of the couples offeatures. The parenthesized values represent thesignificance of the correlation. S1 and S2 representthe dynamic and static acoustic featuresrespectively; F1, F2, F3 represent the eyes, nose andmouth.is distributed approximately as a Student's distri-bution with I�2 degrees of freedom [19]. It is thenpossible to assess the dependence of the di�erentfeatures used by computing the rank correlation ofeach couple and by testing the corresponding signif-icance. Results for the features used in the systemdeveloped are given in Table II.The acoustic features are clearly correlated, aswell as the nose and mouth features. The lattercorrelation is due to the overlapping of the nose andmouth regions, which was found to be necessaryin order to use facial regions characterized by thesame coordinates for the whole database. Acousticand visual features are independent, as could beexpected.The feasibility of using a linear classi�er was in-vestigated by looking at the distribution of accept-able and non-acceptable2 best candidates in a 3Dspace whose coordinates are the integrated score,a normalized ratio of the �rst to second best scoreand the standard deviation of the rankings. As canbe seen in Figure 6 a linear classi�er seems to beappropriate.The full vector d 2 R18 used as input to the linearclassi�er is given by:1. the integrated score, S1, of the best candidate;2. the normalized ratio of the �rst to the secondbest integrated score:R = Si1 � 0:5Si2 � 0:5 ; (21)3. the minimum and maximum ranks of the �rstand second �nal best candidates (4 entries);4. the rank standard deviation of the �rst andsecond �nal best candidates (2 entries);5. the individual ranks of the �rst and second�nal best candidates (10 entries).To train the linear classi�er the following procedurewas used. A set of positive examples fpig is derived2Non-acceptable best candidates derive from two sources:misclassi�ed users from real interactions and best candidatesfrom virtual interactions, characterizedby the removal of theuser entry from the data base.

Fig. 6. Let us represent the match with the database entriesby means of the integrated score, the standard deviationof the rankings across the di�erent features and the nor-malized ratio of the �rst to second best integrated score.The resulting three dimensional points are plotted andmarked with a 2 if they represent a correct match orwith a � if the match is incorrect. Visual inspectionof the resulting point distribution shows that the twoclasses of points can be separated well by using a plane.from the data relative to the persons correctly clas-si�ed by the system. A set of negative examplesfnjg is given by the data relative to the best can-didate when the system did not classify the usercorrectly. The set of negative examples can be aug-mented by the data of the best candidate when thecorrect entry is removed from the database, therebysimulating the interaction with a stranger. The lin-ear discriminant function de�ned by the vector wcan be found by minimizing the following error:E = �Xi 0@1� 1� e�(Plk=1 wkpik+wl+1)1 + e�(Plk=1 wkpik+wl+1)1A2(22)+ �Xj 0@1 + 1� e�(Plk=1 wknjk+wl+1)1 + e�(Plk=1 wknjk+wl+1)1A2where � and � represent the weight to be attributedto false negatives and to false positives respectivelyand l = 18 is the dimensionality of the input vec-tors. When � = � = 1, E represents the outputerror of a linear perceptron with a symmetric sig-moidal unit.Final acceptance or rejection of an identi�cation,associated to a vector d, is done according to thesimple rule:lXi=1 widi +wl+1 > 0 accept (23)lXi=1 widi +wl+1 � 0 reject (24)9

Feature Recognition (%) RVoice 88 1.14Static 77 1.08Dynamic 71 1.08Face 91 1.56Eyes 80 1.25Nose 77 1.25Mouth 83 1.28ALL 98 1.65TABLE IThe recognition performance and averageseparation ratio R for each single feature and fortheir integration. Data are based on 164 realinteractions and a database of 89 users.The ratio Rx measures the separation of the cor-rect match S0x from the wrong ones. This ratiois invariant against the scale and location param-eters of the integrated score distribution and canbe used to compare di�erent integration strategies(weighted/unweighted geometric average, adap-tive/�xed normalization). The weighted geomet-ric average of the scores adaptively normalized ex-hibits the best performance and separation amongthe various schemes on the available data.Experiments have been carried out using data ac-quired during 3 di�erent test sessions. Of the 89persons stored in the database, 87 have interactedwith the system in one or more sessions. One of thethree test sessions was used to adapt the acousticand visual databases (in the last case the images ofthe session were simply added to those available);therefore, session 1 was used to adapt session 2 andsession 2 to adapt session 3. As the number of in-teractions for each adapted session is 82 the totalnumber of test interactions was 164. As each sessionconsisted of 82 interactions, the system was testedon 164 interactions. The recognition performanceand the average value of Rx for the di�erent sepa-rate features and for their integration are reportedin Table I.4.2 RejectionAn important capability of a classi�er is to rejectinput patterns that cannot be classi�ed in any ofthe available classes with a su�ciently high degreeof con�dence. For a person veri�cation system, theability to reject an impostor is critical. The follow-ing paragraphs introduce a rejection strategy thattakes into account the level of agreement of all thedi�erent classi�ers in the identi�cation of the bestcandidate.A simple measure of con�dence is given by theintegrated score itself: the higher the value, the

higher the con�dence of the identi�cation. Anotheris given by the di�erence of the two best scores: itis a measure of how sound the ranking of the bestcandidate is. The use of independent features (orfeature sets) also provides valuable information inthe form of the rankings of the identi�cation labelsacross the classi�er ouputs: if the pattern does notbelong to any of the known classes, its rank willvary signi�cantly from classi�er to classi�er. Onthe contrary, if the pattern belongs to one of theknown classes, rank agreement will be consistentlyhigh. The average rank and the rank dispersionacross the classi�ers can then be used to quantifythe agreement of the classi�ers in the �nal identi-�cation. The con�dence in the �nal identi�cationcan then be quanti�ed through several measures.The decision about whether the con�dence is suf-�cient to accept the system output can be basedon one or several of them. In the proposed system,a linear classi�er, based on absolute and relativescores, ranks and their dispersion, will be used toaccept/reject the �nal result. The following issueswill be discussed:1. degree of dependence of the features used;2. choice of the con�dence measures to be usedin the accept/reject rule;3. training and test of the linear classi�er usedto implement the accept/reject rule.As a preliminary step, the independence of thefeatures used in the identi�cation process will beevaluated. It is known that the higher the degreeof independence, the higher the information pro-vided to the classi�er. Let us consider a couple offeatures X and Y . Let f(xi; yi)gi=1;:::;I representthe corresponding normalized scores. They can beconsidered as random samples from a populationwith a bivariate distribution function. Let Ai bethe rank of xi among x1; : : : ; xI when they are ar-ranged in descending order, and Bi the rank of yiamong y1; : : : ; yI de�ned similarly to Ai. Spear-man's rank correlation [19] is de�ned by:rs = Pi(Ai � �A)(Bi � �B)qPi(Ai � �A)2qPi(Bi � �B)2 (19)where �A and �B are the average values of fAig andfBig, respectively. An important characteristic ofrank correlation is its non-parametric nature. Toassess the independence of the features it is notnecessary to know the bivariate distribution fromwhich the (Xi; Yi) are drawn, since the distributionof their ranks is known, under the assumption ofindependence. It turns out thatt = rss I � 21� r2s (20)8

�rst step towards the normalization of the scoresis to reverse the sign of distances, thereby makingthem concordant with the correlation values: thehigher the value, the more similar the input pat-terns. Inspection of the score distributions showsthem to be markedly unimodal and roughly sym-metrical. A simple way to normalize scores is toestimate their average values and standard devi-ations so that distributions can be translated andrescaled in order to have zero average and unit vari-ance. The values can then be forced into a standardinterval, such as (0; 1), by means of an hyperbolictangent mapping. The normalization of the scorescan rely on a �xed set of parameters, estimatedfrom the score distributions of a certain number ofinteractions, or can be adaptive, estimating the pa-rameters from the score distribution of the currentinteraction. The latter strategy was chosen mainlybecause of its ability to cope with variations such asdi�erent speech utterance length without the needto re-estimate the normalization parameters.The estimation of the location and scale param-eters of the distribution should make use of robuststatistical techniques [18], [20]. The usual arith-metic average and standard deviation are not wellsuited to the task: they are highly sensitive to out-lier points and could give grossly erroneous esti-mates. Alternative estimators exist that are sensi-tive to the main bulk of the scores (i.e. the centralpart of a unimodal symmetric distribution) and arenot easily misled by points in the extreme tails ofthe distribution. The median and the Median Ab-solute Deviation (MAD) are examples of such loca-tion and scale estimators and can be used to reliablynormalize the distribution of the scores. However,the median and the MAD estimators have a lowe�ciency relative to the usual arithmetic averageand standard deviation. A class of robust estima-tors with higher e�ciency was introduced by Ham-pel under the name of tanh-estimators and is usedin the current implementation of the system (see[18] for a detailed description). Therefore each listof scores fSijgi=1;:::;I from classi�er j, being I thenumber of people in the reference database, can betransformed into a normalized list by the followingmapping:S0ij = 12 �tanh�0:01 (Sij � �tanh)�tanh �+ 1� 2 (0; 1)(15)where �tanh and �tanh are the average and stan-dard deviation estimates of the scores fSijgi=1;:::;Ias given by the Hampel estimators. An exampleof distributions of the resulting normalized scoresis reported in Figure 5 for each of the �ve featuresused in the classi�cation.In the following formulas, a subscript index im in-dicates the mth entry within the set of scores sorted

Fig. 5. The density distribution of the normalized scores foreach of the classi�ers: S1, S2 represent the static anddynamic speech scores while F1, F2 and F3 representthe eyes, nose and mouth scores respectively.by decreasing value. The normalized scores can beintegrated using a weighted geometric average:Si = 0@Yj S0 wjij 1A1=Pj wj (16)where the weights wj represent an estimate of thescore dispersion in the right tail of the correspond-ing distributions:wj = S0i1j � 0:5S0i2j � 0:5 � 1:0 (17)The main reason suggesting the use of geometric av-erage for the integration of scores relies on probabil-ity: if we assume that the features are independentthe probability that a feature vector correspondsto a given person can be computed by taking theproduct of the probabilities of each single feature.The normalized scores could then be considered asequivalent to probabilities. Another way of look-ing at the geometric average is that of predicateconjunction using a continuous logic [2], [3]. Theweights re ect the importance of the di�erent fea-tures (or predicates). As de�ned in eqn. (17), eachfeature is given an importance proportional to theseparation of the two best scores. If the classi�ca-tion provided by a single feature is ambiguous, it isgiven low weight. A major advantage of eqn. (16)is that it does not require a detailed knowledge ofhow each feature is distributed (as would be nec-essary when using a Bayes approach). This easesthe task of building a system that integrates manyfeatures.The main performance measure of the system is thepercentage of persons correctly recognized. Perfor-mance can be further quali�ed by the average valueof the following ratio Rx:Rx = S0x � S0iImaxi6=x(S0i)� S0iI 1 � i � I: (18)7

Fig. 4. The distribution of the correlation values for cor-responding features of the same person and of di�erentpeople.two images over the larger one. A major advan-tage of the image similarity computed according toeqn. (12) over the more common estimate given bythe cross-correlation coe�cient [1], based on the L2norm, is its reduced sensitivity to small amountsof unusually high di�erences between correspond-ing pixels. These di�erences are often due to noiseor image specularities such as iris highlights. A de-tailed analysis of the similarity measure de�ned ineqn. (12) is given in [7]. An alternative techniquefor face identi�cation is reported in [31]. Let usdenote with fUkmgm=1;:::;pk the set of images avail-able for the kth user. A comparison can now bemade between a set of regions of the unknown imageN and the corresponding regions of the databaseimages. The regions currently used by the systemcorrespond to the eyes, nose and mouth. A list ofsimilarity scores is obtained for each region F� ofimage Ukm:fsk�g = fmaxm C(R�(N );F�(Ukm)g (14)where R�(N ) represents a region of N containingF� with a frame whose size is related to the inte-rocular distance. The lists of matching scores cor-responding to eyes, nose, and mouth are then avail-able for further processing. The distribution of thecorrelation values for corresponding features of thesame person and of di�erent people are reported inFigure 4.Integration with the scores derived from theacoustic analysis can now be performed with a sin-gle or double step process. In the �rst case, the twoacoustic and the three visual scores are combinedsimultaneously, while in the second the acoustic andvisual scores are �rst combined separately and the�nal score is given by the integration of the outputsof the speaker and face recognition systems (see [9]for an example of the latter). The next section willintroduce two single-step integration strategies forclassi�ers working at the measurement level.

4. IntegrationThe use of multiple cues, such as face and voice,provides in a natural way the information neces-sary to build a reliable, high performance system.Specialized subsystems can identify (or verify) eachof the previous cues and the resulting outputs canthen be combined into a unique decision by someintegration process. The objective of this sectionis to describe and evaluate some integration strate-gies. The use of multiple cues for person recognitionproved bene�cial for both system performance andreliability1.A simpli�ed taxonomy of multiple classi�er sys-tems is reported in [33]. Broadly speaking, a classi-�er can output information at one of the followinglevels:the abstract level: the output is a subset of thepossible identi�cation labels, without any qual-ifying information;the rank level: the output is a subset of the pos-sible labels, sorted by decreasing con�dence(which is not supplied);the measurement level: the output is a subset oflabels quali�ed by a con�dence measure.The level at which the di�erent classi�ers of acomposite system work clearly constrains the waystheir responses can be merged. The �rst of the fol-lowing sections will address the integration of thespeaker/face recognition systems at the measure-ment level. The possibility of rejecting a user asunknown will then be discussed. Finally, a novel,hybrid level approach to the integration of a set ofclassi�ers will be presented.4.1 Measurement level integrationThe acoustic and visual identi�cation systems al-ready constitute a multiple classi�er system. How-ever, both the acoustic and visual classi�ers canbe further split into several subsystems, each onebased on a single type of feature. In our system,�ve classi�ers were considered (see secs. 2, 3) work-ing on the static, dynamic acoustic features, and onthe eyes, nose and mouth regions.A critical point in the design of an integrationprocedure at the measurement level is that of mea-surement normalization. In fact, the responses ofthe di�erent classi�ers usually have di�erent scales(and possibly o�sets), so that a sensible combina-tion of the outputs can proceed only after the scoresare properly normalized. As already detailed, theoutputs of the identi�cation systems are not ho-mogeneous: the acoustic features provide distanceswhile the visual ones provide correlation values. A1Two aspects of reliability are critical for a person iden-ti�cation system: the �rst is the ability of rejecting a useras unknown, the second is the possibility of working witha reduced input, such as only the speech signal or the faceimage.6

and the interocular distance and the directionof the eye-to-eye axis at prede�ned values.Under the assumption that the user face is approx-imately vertical in the digitized image, a good es-timate of the coordinate S of the symmetry axis isgiven by S = medianfPV (jI �KV j)ig (8)where � represent convolution, I the image,KV theconvolution kernel [�1; 0; 1]t, PV the vertical pro-jection whose index i runs over the columns of theimage. The face can then be split vertically intotwo, slightly overlapping parts containing the leftand right eye respectively. The illumination underwhich the image is taken can impair the templatematching process used to locate the eye. To mini-mize this e�ect a �lter, N (I), is applied to imageI: N = � N 0 if N 0 � 12� 1N 0 if N 0 > 1 (9)where N 0 = II �KG(�) (10)andKG(�) is a Gaussian kernel whose � is related tothe expected interocular distance �ee. The arith-metic operations act on the values of correspondingpixels. The process mapping I into N reduces thein uence of ambient lighting while keeping the nec-essary image details. This is mainly due to the re-moval of linear intensity gradients that are mappedto the constant value 1. Extensive experiments, us-ing ray-tracing and texture-mapping techniques togenerate synthetic images under a wide range oflighting directions have shown that the local con-trast operator of eqn. (9) exhibits a lower illumi-nation sensitivity than other operators such as thelaplacian, the gradient magnitude or direction [5]and that there is an optimal value of the parameter� (approximately equal to the iris radius).The same �lter is applied to the eye templates.The template matching process is based on thealgorithm of hierarchical correlation proposed byBurt [11]. Its �nal result is a map of correlationvalues: the center of gravity of the pixels with max-imum value representing the location of the eye.Once the two eyes have been located, the con�denceof the localization is expressed by a coe�cient, CE ,that measures the symmetry of the eye positionswith respect to the symmetry axis, the horizontalalignment and the scale relative to that of the eyetemplates:CE = (Cl + Cr)2 min(Cl; Cr)max(Cl; Cr)e� (s�1)22�2s e���22�2� (11)where Cl and Cr represent the (maximum) corre-lation value for the left/right eye, s the interocular

distance expressed as a multiple of the interoculardistance of the eyes used as templates, �� repre-sents the angle of the interocular axis with respectto the horizontal axis while �� and �s represent tol-erances on the deviations from the prototype scaleand orientation.The �rst factor in the RHS of eqn. (11) is theaverage correlation value of the left and right eye:the higher it is the better the match with the eyetemplates. The second factor represents the sym-metry of the correlation values and equals 1 whenthe two values are identical. The third and fourthfactors allow weighing the deviation from both theassumed scale and (horizontal) orientation of theinterocular axis, respectively. The parameters ofthe Gaussians, �s and ��, were determined by theanalysis of a set of interactions.If the value of CE is too low, the face recognitionsystem declares failure and the identi�cation pro-ceeds using the acoustic features alone. Otherwise,the image is translated, scaled and rotated to matchthe location of the pupils to that of the databaseimages. In the reported experiments the interocu-lar distance was set equal to 28 pixels. Alternativetechniques for locating eyes are reported in [17],[32]. Due to the geometrical standardization, thesubimages containing the eyes, nose, and mouthare approximately characterized by the same coor-dinates in every image. These regions are extractedfrom the image of the user face and compared inturn to the corresponding regions extracted fromthe database entries, previously �ltered accordingto eqns. (9)(10). Let us introduce a similaritymeasure C based on the computation of the L1 normof a vector kxkL1 =Pi jxij and on the correspond-ing distance dL1(x;y) = kx� ykL1 :C(x;y) = 1� dL1(x;y)kxkL1 + kykL1 (12)The L1 distance of two vectors is mapped byC(�; �) into the interval [0; 1], higher values repre-senting smaller distances. This de�nition can beeasily adapted to the comparison of images. Forthe comparison to be useful when applied to realimages, it is necessary to normalize the images sothat they have the same average intensity � andstandard deviation (or scale) �. The latter is par-ticularly sensitive to values far from the average �so that the scale of the image intensity distributioncan be better estimated by the following quantity:�L1 = 1n nXi=1 jxi � �j (13)where the image is considered as a one dimensionalvector x. The matching of an image B to an imageA can then be quanti�ed by the maximum valueof C(A;B), obtained by sliding the smaller of the5

should also take into account variations in time ofthe speaker's voice (intraspeaker variations). Adap-tation requires the use of few utterances to modifythe codebook as it is not necessary to design it fromscratch (this would require at least 30-40 secondsof speech). In our case, the adaptation vectors arederived from the digit strings uttered by the usersduring a single test session. The dynamic code-books were not adapted since they represent tem-poral variations of the speech spectra and thereforethey are less sensitive to both intraspeaker voicevariability and acquisition channel variations.The adaptation process of the ith codebook, Cican be summarized as follows:1. the mean vectors �i and �i of the adaptationvectors and of the given codebook respectivelyare evaluated;2. the di�erence vector�i = �i��i is evaluated;3. the vectors of Ci are shifted by a quantityequal to �i obtaining a new set C 0i = fci1 +�i; : : : ; ciM +�ig; the set C0i is placed in theregion of the adaptation vectors;4. the adaptation vectors are clustered usingthe set C 0i as initial estimate of the cen-troids; therefore a new set of centroids Oi =foi1; : : : ;oiMg and the corresponding cell oc-cupancies Ni = fni1; : : : ; niMg are evaluated;5. the adapted codebook i is obtained accord-ing to the following equation: im = c0im+(1�e��nim )(oim�c0im) 1 � m �M(7)In the equation above the parameter nim de-termines the fraction of deviation vector �im =(oim�c0im), that has to be summed to the ini-tial centroid c0 im. Eqn. 7 is a simple methodto modify the centroids of a codebook accord-ing to the number of data available for theirestimates. �im can be zero when the utteranceused for adaptation does not contain soundswhose spectra are related to the m-th centroid.For the system, � was chosen equal to 0:1. Thetwo shifts applied by the adaptation procedure canbe interpreted as follows:1. �i, the major shift, accounts for environmentand channel variations with respect to training;2. �im, the minor shift, accounts for intra-speaker voice variations in time.3. Face recognitionPerson identi�cation through face recognition isthe most familiar among the possible identi�cationstrategies. Several automatic or semiautomatic sys-tems were realized since the early seventies - albeitwith varying degree of success. Di�erent techniqueswere proposed, ranging from the geometrical de-scription of salient facial features to the expansionof a digitized image of the face on an appropriate

Fig. 3. The highlighted regions represent the templates usedfor identi�cation.basis of images (see [8] for references). The strategyused by the described system is essentially basedon the comparison, at the pixel level, of selected re-gions of the face [8]. A set of regions, respectivelyencompassing the eyes, nose, and mouth of the userto be identi�ed are compared with the correspond-ing regions stored in the database for each referenceuser (see Figure 3). The images should represent afrontal view of the user face without marked ex-pressions. As will be clear from the detailed de-scription, these constraints could be relaxed at thecost of storing a higher number of images per userin the database. The fundamental steps of the facerecognition process are the following:1. acquisition of a frontal view of the user face;2. geometrical normalization of the digitized im-age;3. intensity normalization of the image;4. comparison with the images stored in thedatabase.The image of the user face is acquired with a CCDcamera and digitized with a frame grabber.To compare the resulting image with those storedin the database, it is necessary to register the im-age: it has to be translated, scaled, and rotated sothat the coordinates of a set of reference points takecorresponding standard values. As frontal views areconsidered, the centers of the pupils represent a nat-ural set of control points that can be located withgood accuracy. Eyes can be found through the fol-lowing steps:1. locate the (approximate) symmetry axis of theface;2. locate the left/right eye by using an eye tem-plate for which the location of the pupil isknown; if the con�dence of the eye location isnot su�ciently high, declare failure (the iden-ti�cation system will use only acoustic infor-mation);3. achieve translation, scale and rotation invari-ance by �xing the origin of the coordinate sys-tem at the midpoint of the interocular segment4

The acoustic analysis of each frame is performedas follows:1. the power spectrum of the sequence yt(n) isevaluated;2. a bank of Q = 24 triangular �lters, spacedaccording to a logarithmic scale (Mel scale), isapplied to the power spectrum and the energyoutputs stq, 1 � q � Q, from each �lter areevaluated;3. the Mel Frequency Cepstrum Coe�cients(MFCC) [29], �tp, 1 � p � P = 8, are com-puted, from the �lterbank outputs, accordingto the following equation:�tp = QXq=1[log(stq)] cos �p�q � 12� �Q� ; (3)the MFCCs are arranged into a vector, �t,which is called static, since it refers to a singlespeech frame;4. to account for the transitional informationcontained in the speech signal a linear �tis applied to the components of 7 adjacentMFCCs; the resulting regression coe�cientsare arranged into a vector that is called dy-namic;5. a binary variable is �nally evaluated that al-lows marking the frame as speech or backgroundnoise; this parameter is computed by means ofthe algorithm described in [12].The Mel scale is motivated by auditory analysisof sounds. The inverse Fourier transform of thelog-spectrum (cepstrum) provides parameters thatimproves performance at both speech and speakerrecognition [23], [29]. Furthermore, the Euclideandistance between two cepstral vectors representsa good measure for comparing the correspond-ing speech spectra. The static and dynamic 8-dimensional vectors related to windows marked asbackground noise are not considered during bothsystem training and testing. As previuosly said,VQ is used to design the static and dynamic code-books of a given reference speaker, say the ith one.Starting from a set of training vectors (static or dy-namic)�i = f�i1; : : : ; �iKg, derived from a certainnumber of utterances, the objective is to �nd a newset i = f i1; : : : ; iMg, withM � K, that repre-sents well the acoustic characteristics of the givenspeaker. To do this a clustering algorithm, simi-lar to that described in [21], is applied to the �iset. The algorithm makes use of an iterative pro-cedure that allows determination of codebook cen-troids, i, by minimizing their average distance,D(�i; i), from the training vectors:D(�i; i) = 1K KXk=1 Mminm=1[d(�ik; im)] (4)

The distance d(�ik; im) is de�ned as follows:d(�ik; im) = (�ik � im)tW�1(�ik � im) (5)In the equation above t denotes transpose andW is the covariance matrix of the training vectors.The matrixW is estimated from the training dataof all the speakers in the reference database. Thismatrix was found to be approximately diagonal, sothat only the diagonal elements are used to evaluatedistances.In the recognition phase the distances, DSi; DDi,between the static and dynamic vector sequences,derived from the input signal, and the correspond-ing speaker codebooks are evaluated and sent to theintegration module.If� = f�1; : : : ; �Tg is the static (or dynamic) in-put sequence and i is the ith static (or dynamic)codebook, then the total static (or dynamic) dis-tance will be:Di(�;i) = 1T TXt=1 Mminm=1[d(�t; im)] 1 � i � I(6)where I is the total number of speakers in the ref-erence database.To train the system, 200 isolated utterances ofthe Italian digits (from 0 to 9) were collected foreach reference user. The recordings were realized bymeans of a Digital Audio Tape (DAT): the signal onthe DAT tape, sampled at 48 kHz, was downsam-pled to 16 kHz, manually end-pointed, and storedon a computer disk. The speech training materialwas analyzed and clustered as previously described.As demonstrated in [28], system performance de-pends on both input utterance length and codebooksize; preliminary experiments have suggested thatthe speaker, to be identi�ed, should utter a stringof at least 7 digits in a continuous way and in what-ever order. In the reported experiments the numberof digits was kept equal to 7 and the codebook sizewas set to M = 64 because higher values did notimprove recognition accuracy. Furthermore, if in-put signal duration is too short, the system requiresthe user to repeat the digit string.To evaluate integrated system performance (seesection 4.1) the reference users interacted 3 timeswith the system during 3 di�erent sessions. Thetest sessions were carried out in an o�ce environ-ment using an ARIEL board as acquisition channel.Furthermore the test phase was performed about�ve months after the training recordings. Dueto both the di�erent background noise and acquisi-tion conditions between training and test, the code-books must be adapted.Adaptation means designing a new codebook,starting from a given one, that better resembles theacoustic characteristics of both the operating envi-ronment and the acquisition channel. Adaptation3

cussed. Finally, the novel rank/measurement levelintegration strategy using an HyperBF network isintroduced with a detailed report on system perfor-mance. 2. Speaker recognitionThe voice signal contains two types of informa-tion: individual and phonetic. They have mutuale�ects and are di�cult to separate; this representsone of the main problems in the development ofautomatic speaker and speech recognition systems.The consequence is that speaker recognition sys-tems perform better on speech segments having spe-ci�c phonetic contents while speech recognition sys-tems provide higher accuracy when tuned on thevoice of a particular speaker. Usually the acousticparameters for a speech/speaker recognizer are de-rived by applying a bank of band-pass �lters to ad-jacent short time windows of the input signal. Theenergy outputs of the �lters, for various frames, pro-vide a good domain representation. Figure 1 givesan example of such an analysis. The speech wave-forms correspond to utterances of the Italian digit 4(/kwat:ro/) by two di�erent speakers. The energyoutputs of a 24 triangular band-pass �lter bank arerepresented below the speech waveforms (darker re-gions correspond to higher energy values).Fig. 1. Acoustic analysis of two utterances of the digit 4(/kwat:ro/) by two di�erent speakers.In the past years several methods and systemsfor speaker identi�cation [13], [16] were proposedthat perform more or less e�ciently depending onthe text the user is required to utter (in general,systems can be distinguished into text dependent ortext independent), the length of the input utterance,the number of people in the reference database and,�nally, the time interval between test and trainingrecordings.For security applications, it is desirable that theuser utter a di�erent sentence during each inter-

action. The content of the utterance can then beveri�ed to ensure that the system is not cheated byprerecorded messages. For this work, a text inde-pendent speaker recognition system based on Vec-tor Quantization (VQ) [28] was built. While it can-not yet verify the content of the utterance, it canbe modi�ed (using supervised clustering or othertechniques) to obtain this result.A block diagram of the system is depicted inFigure 2. In the system, each reference speaker isrepresented by means of two sets of vectors (code-books) that describe his/her acoustic characteris-tics. During identi�cation, two sets of acoustic fea-tures (static and dynamic), derived from the shorttime spectral analysis of the input speech signal,are classi�ed by evaluating their distances from theprototype vectors contained in the speaker code-book couples. In this way, two lists of scores aresent to the integration module. In the followingboth the spectral analysis and vector quantizationtechniques will be described in more detail (see also[21] and a reference book such as [23]).Since the power spectrum of the speech signal de-creases as frequency increases a preemphasis �lterthat enhances the higher frequencies is applied tothe sampled input signal. The transfer function ofthe �lter is H(z) = 1=(1� 0:95 � z�1).The preemphasized signal, x(n), 1 � n � N , issubdivided into frames yt(n), 1 � t � T , havinglength L. Each frame is obtained by multiplyingx(n) by an Hamming window ht(n):yt(n) = x(n) � ht(n) 1 � t � T = NS (1)ht(n) = 0:54� 0:46 � cos�2�(n� tS)L � (2)tS � L2 < n � tS + L2#1 #2 #I

#1 #2 #I

Voice

Signal

Static Parameters

Static Score List

Dynamic Score ListAnalysis

Acoustic

Distance Computation

Distance Computation

Dynamic Codebooks

Static Codebooks

Dynamic ParametersFig. 2. The speaker recognition system based on VectorQuantization.In the equation above L represents the length,in samples, of the Hamming window and S is theanalysis step (also expressed in samples). For thesystem L and S were chosen to correspond to 20ms and 10 ms respectively.The signal is multiplied by an Hamming window(raised cosine) to minimize the sidelobe e�ects onthe spectrum of the resulting sequence yt(n).2

Person identi�cation using multiple cuesRoberto Brunelli and Daniele FalavignaIstituto per la Ricerca Scienti�ca e TecnologicaI-38050 Povo, Trento, ITALYe-mail: [email protected] [email protected]| This paper presents a person identi-�cation system based on acoustic and visual fea-tures. The system is organized as a set of non-homogeneous classi�ers whose outputs are inte-grated after a normalization step. In particular, twoclassi�ers based on acoustic features and three basedon visual ones provide data for an integration mod-ule whose performance is evaluated. A novel tech-nique for the integration of multiple classi�ers at anhybrid rank/measurement level is introduced usingHyperBF networks. Two di�erent methods for therejection of an unknown person are introduced. Theperformance of the integrated system is shown to besuperior to that of the acoustic and visual subsys-tems. The resulting identi�cation system can beused to log personal access and, with minor modi�-cations, as an identity veri�cation system.Keywords| template matching, robust statistics,correlation, face recognition, speaker recognition,learning, classi�cation.1. IntroductionThe identi�cation of a person interacting withcomputers represents an important task for auto-matic systems in the area of information retrieval,automatic banking, control of access to security ar-eas, buildings and so on. The need for a reliableidenti�cation of interacting users is obvious. At thesame time it is well known that the security of suchsystems is too often violated in every day life. Thepossibility to integrate multiple identi�cation cues,such as password, identi�cation card, voice, face,�ngerprints and the like will, in principle, enhancethe security of a system to be used by a selected setof people.This paper describes in detail the theoreticalfoundations and design methodologies of a personrecognition system that is part of MAIA, the inte-grated AI project under development at IRST [26].Previous works about speaker recognition [30],[16] have proposed methods for classifying and com-bining acoustic features and for normalizing [27],[22] the various classi�er scores. In particular, scorenormalization is a fundamental step when a systemis required to con�rm or reject the identity givenby the user (user veri�cation): in this case, in fact,the identity is accepted or rejected according to acomparison with a preestimated threshold. Sincethe integration of voice and images in an identi�ca-tion system is a new concept, new methods for bothclassi�er normalization and integration were inves-tigated. E�ective ways for rejecting an unknown

person by considering score and rank informationand for comparing images with improved similar-ity measures are proposed. A simple method foradapting the acoustic models of the speakers to areal operating environment also was developed.The speaker and face recognition systems are de-composed into two and three single feature classi-�ers respectively. The resulting �ve classi�ers pro-duce non-homogeneous lists of scores that are com-bined using two di�erent approaches. In the �rstapproach, the scores are normalized through a ro-bust estimate of the location and scale parametersof the corresponding distributions. The normalizedscores are then combined using a weighted geomet-ric average and the �nal identi�cation is acceptedor rejected according to the output of a linear clas-si�er, based on score and rank information derivedfrom the available classi�ers. Within the secondapproach, the problem of combining the normal-ized outputs of multiple classi�ers and of accept-ing/rejecting the resulting identi�cation is consid-ered a learning task. A mapping from the scoresand ranks of the classi�ers into the interval (0; 1)is approximated using an HyperBF network. A�nal threshold is then introduced based on cross-validation. System performance is evaluated anddiscussed for both strategies. Because of the nov-elty of the problem, standard data-bases for sys-tem training and test are not yet available. Forthis reason, the experiments reported in this paperare based on data collected at IRST. A system im-plementation operating in real-time is available andwas tested on a variety of IRST researchers and vis-itors. The joint use of acoustic and visual featuresproved e�ective in increasing system performanceand reliability.The system described here represents an im-provement over a recently patented identi�cationsystem based on voice and face recognition [6], [9].The two systems di�er in many ways: in the latterthe speaker and face recognition systems are notfurther decomposed into classi�ers, the score nor-malization does not rely on robust statistical tech-niques and, �nally, the rejection problem is not ad-dressed.The next sections will introduce the speaker andface recognition systems. The �rst approach to theintegration of classi�ers and the linear accept/rejectrule for the �nal system identi�cation are then dis-