multimodal biometric authentication using teeth image and voice in mobile environment

IEEE Transactions on Consumer Electronics, Vol. 54, No. 4, NOVEMBER 2008

Contributed Paper Manuscript received August 6, 2008 0098 3063/08/$20.00 © 2008 IEEE

1790

Multimodal Biometric Authentication using Teeth Image and Voice in Mobile Environment

Dong-Ju Kim and Kwang-Seok Hong

Abstract — Mobile devices such as smart-phone, PDA and

mobile-phone are vulnerable to theft and loss due to their small size and the characteristics of the environments in which they are used. A simple and convenient authentication system is required to protect private information stored in the mobile device. Therefore, we propose a new multimodal biometric authentication approach using teeth image and voice as biometric traits in this paper. The individual matching scores obtained from the teeth image and voice are combined using a weighted-summation operation, and the fused-score is utilized to classify an unknown user into the acceptance or rejection. The proposed method is evaluated using 1000 teeth images and voices, in which these are collected by smart-phone, i.e., one mobile device for 50 subjects. In the experiment results, the proposed method has an EER of 2.13%, and we demonstrate the effectiveness of the proposed method. 1

Index Terms — multimodal biometrics authentication, teeth image authentication, voice authentication, mobile environment.

I. INTRODUCTION Mobile devices such as smart-phone, PDA and mobile-phone

have currently brought users an unprecedented level of convenience and flexibility. Moreover, these devices are fast becoming a pervasive part of our lifestyle, because of their mobility, miniaturization, decreasing price, and increasing computational power. However, the security and privacy issues related to mobile devices are unfortunately becoming a major problem, since they are often exposed in public places such as taxis, coffee houses, and airports, where they are vulnerable to theft or loss. Along with the value of lost hardware, users also worry about the exposure of sensitive information. For instance, private information stored in mobile device, such as names, addresses, short message, and image, may be stolen. Although the traditional ways to protected the information are used by token or PINs (Personal Identification Number) which are easy to implement, constantly under the risk of being stolen, or forgotten. Therefore, there is a need for offering to the user a more reliable and a friendly way of identification or authentication, and biometrics which is identifying or authenticating an individual based on their distinguishing biological or behavioral characteristics is rising one of the most popular and promising alternative to solve these problems. In literature, only few published studies have investigated the

1 D-J Kim(e-mail: [email protected]) and K-S Hong(e-mail: kshong@

skku.ac.kr) are with School of Information and Communication Engineering, Sungkyunkwan University, 300, Chunchun-dong, Jangan-gu, Suwon, 440-746, KOREA

problem of biometric authentication in mobile devices. For instance, a research to use the face authentication is done by Venkataramani, et al. [1], Qian et al. [2] and Hadid et al. [3]. In other research, Czyz et al. [4] have studied performance degradations, i.e., scalability, of face and speech identification systems when used with devices of lower capability such as mobile devices. Other multimodal biometric systems using ear/speech [5] and face/signature [6] have been recently researched for mobile device. Face information, in combination with speech, is frequently used for authentication, since most of mobile devices are equipped with a camera and microphone. Biometrics using facial features may be unreliable due to changing make-up, mustache, beard, hair style, and so on. Thus, teeth image is suggested as a biometric trait for individual authentication, instead of a facial image, in this paper. Although teeth authentication is affected by variable illumination like the face, it is few sensitive or ineffective to factors such as face expression, make-up, mustache, beard, and hair style. Biometrics utilizing teeth images is additionally expected to achieve reliable and robust individual authentication, since teeth are unique to each individual and hardly change during adulthood.

We present a new multimodal biometric authentication approach based on teeth image and voice for mobile device security in this paper. The proposed multimodal approach using teeth image and voice has not until now, been researched in a mobile environment. Biometrics using teeth images was first proposed by Tae-Woo Kim et al. [7] for a individual identification. Their method is composed of teeth image acquisition and teeth recognition in which there are teeth region detection and pattern recognition procedures based on Linear Discriminant Analysis (LDA) as sequential steps. In addition, Prajuabklang et al. [8] proposed real-time individual identification method using modified principal component analysis (PCA), and Nadee et al. [9] proposed an improved PCA-based individual identification method using the invariance moment. Recently, a comparative study was researched by Dong-Ju Kim et al. [10] for teeth recognition. They compared the performances of various recognition algorithms using teeth images, and the best performance was achieved in case of using the two-dimensional discrete cosine transform (2D-DCT) and an embedded hidden Markov model (EHMM) algorithm. In other research, Jain et al. [11] studied dental x-ray images to identify postmortem people using their dental radiograph images in forensic investigation. However, it is obviously that our research needs for mobile device security and forensics differ.

In this paper, the teeth authentication is composed of image acquisition and teeth region detection with pre-processing

D.-J. Kim and K.-S. Hong: Multimodal Biometric Authentication using Teeth Image and Voice in Mobile Environment 1791

procedure, and authentication phase as sequential steps. In the authentication phase, we use 2D-DCT and EHMM for teeth authentication, since that method performed best [10]. Along with the teeth authentication, a voice authentication is additionally accommodated in the biometric authentication system. The voice signal that is biometric trait used in voice authentication is simultaneously obtained by pronouncing /i/ during teeth image acquisition. This voice signal has a characteristic of single-vowel. For the voice authentication, we use pitch and mel-frequency cepstral coefficients (MFCC) as feature parameters and a Gaussian mixture model (GMM) algorithm to model the voice signal. Consequently, the multimodal biometric authentication system which is fully operated automatic phases combines the teeth and voice authentication. In the fusing phase, we apply a weighted-summation operation, since it has a comparatively simple structure and superior performance. The proposed system is evaluated using a database collected by smart-phone, i.e., one of mobile devices. The database consists of 1000 images and voices for each of 50 subjects, i.e., 20 images and voices per individual.

The remainder of this paper is organized as follows. In sections 2 and 3, we present outline of the teeth and voice authentication, respectively. Section 4 describes the fusing method used by the proposed system to combine the two unimodal system. The experimental results are presented in section 5, and we conclude in section 6.

II. TEETH AUTHENTICATION In this section, we describe the teeth authentication using

teeth image as biometric trait. The teeth authentication is composed of image acquisition and teeth region detection with pre-processing procedure, feature extraction of teeth image and authentication phase as sequential steps. In this paper, we use the AdaBoost algorithm based on Haar-like features for teeth region detection, and the EHMM algorithm with 2D-DCT as feature vector is additionally accommodated in process of teeth authentication.

A. Teeth region detection The first phase of teeth authentication is the detection of the

teeth region in images acquired by a digital camera on a mobile device. We utilize the AdaBoost algorithm based on Haar-like features introduced by Viola and Jones [12]. Haar-like features are used in the process of searching for the teeth region, and prototypes have been trained to accurately represent the teeth region through the AdaBoost learning algorithm. AdaBoost is a method to construct a strong classifier by combining weak multiple classifiers. A weak classifier is a single-layer perceptron, and is defined by equation (1). Each weak classifier jh is associated with a

feature jf and a threshold jθ . x indicates the size of the sub-

window, and is set to 24× 12 in this paper.

⎪⎩

⎪⎨⎧ <

=otherwise

θxfifxh jj

j 0

)(1)( (1)

The Haar-like features obtained by the AdaBoost learning algorithm are classified in stages, as shown in Figure 1. An image is scanned at all possible locations and scales by a sub-window. A series of classifiers are then applied to all the sub-windows. The initial stage eliminates a large number of negative examples with very little processing. All the sub-windows are classified as "teeth" or "non-teeth" regions by going through the entire cascade. In higher classification stages, the number of Haar-like features processed is larger than those at lower stages, and Haar-like features processed at a higher classification stage can render teeth regions that are more detailed.

Fig. 1. Structure of cascade for teeth region detection.

After detecting the teeth region by using the AdaBoost algorithms based on Haar-like features (Figure 2 (a)), the detected teeth region undergoes the pre-processing procedure, i.e., rotated-angle compensation, to improve performance as shown in Figure 2.

(b) (c)

(d) (e)

(a)

(f) (g)

Fig. 2. The pre-processing procedure of teeth image : (a) detected teeth region in capture image, (b) detected teeth image (c) thresholded image,(d) the centers of mass, (e) image applied to the centers for authentication, (f) rotated image with horizontal line connecting the both centers, (g) pre-processed teeth image.

The detected teeth image using the Adaboost algorithm is illustrated in Figure 2 (b), and the detected image is first converted in the threshold image (Figure 2 (c)). Since both sides of the teeth image are generally dark, the work can be successfully performed. The thresholding is performed given by

⎩⎨⎧

>

≤=

TyxfifTyxfif

yxg),(255),(0

),( , (2)


1792

where ),( yxf and ),( yxg are an intensity value of detected teeth image and threshold image at the position ),( yx , respectively, and T is the threshold value [13]. The centers of mass, i.e., LC and RC , of the side areas are obtained using the threshold image (Figure 2 (d)). We can calculate the rotated-angle with the horizontal line connecting the both centers as show Figure 2 (e). The example rotated image is also shown in Figure 2 (f). Finally, the image of rotated-angle compensation, i.e., pre-processed image, is acquired by cropping the teeth region, where the white area is replaced the related pixel values in Figure 2 (a), and normalized in size as shown in Figure 2 (g). Then, the pre-processed image is used in feature extraction.

B. Feature extraction Discrete cosine transformation is a well-known signal

analysis tool. It is especially used in compression standards, due to its compact representation power. Recently, two-dimensional discrete cosine transform (2D-DCT) has been employed in face recognition to reduce dimensionality. The advantage of 2D-DCT is that it is data independent. That is, the basis images are only dependent on one image instead of on the entire set of training images. It can be also implemented using a fast algorithm. Feature extraction of a teeth image using 2D-DCT consists of two steps [14].

In the first step, the pre-processed teeth image is divided in small block images as shown in Figure 3. Figure 3 also illustrates a block extraction method for 2D-DCT analysis.

Fig. 3. Block extraction in a teeth image.

Let LP× be the window size of 2D-DCT, and MQ× be the overlap size in the horizontal and vertical directions of the image. Then, the number of blocks is calculated by the following equation for an image with W rows and H columns.

)()(MLMH

QPQWT

−−

×−−

= (3)

In the next step, the 2D-DCT coefficients of the image block ),( yxf are calculated. If we assume that P and L are equal to N ( NLP == ), then 2D-DCT coefficients, ),( vuC is computed defined by

∑∑−

=

−

=

=1

0

1

0

),,,(),()()(),(N

x

N

yvuyxβyxfvαuαvuC

(4)

for 1....,,2,1,0, −= Nvu ,

where ⎪⎩

⎪⎨⎧

−=

==

1...,,2,1,/2

0,/1)(),(

NvuforN

vuforNvαuα

and ⎥⎦

⎤⎢⎣

⎡ +×⎥⎦

⎤⎢⎣⎡ +

=N

πvyN

πuxvuyxβ2

)12(cos

2)12(cos),,,( .

Usually, all the blocks have the same size. We use the 8× 8 in this paper [15]. The computed coefficients are then ordered in a zig-zag fashion. The purpose of ordering the coefficients is that to reflect the amount of information stored in each coefficient. Namely, lower order coefficients usually contain more information. After ordering, the 2D-DCT coefficients, which are computed from all block images, are used to form an observation vector of EHMM.

C. Embedded hidden Markov model An HMM is a Markov chain with a finite number of

unobservable states [16]. Although the Markov states are not directly observable, each state has a probability distribution associated with the set of possible observations. EHMM is an extension of the one-dimensional HMM to deal with 2-D data such as images and videos. EHMM was first introduced for character recognition by Kuo and Agazz [17]. It was applied as a new approach to face recognition by Nefian et al. [14]. Sinch this method showed the best performance [10], we adopt Nefian’s EHMM for modeling teeth images in this paper. EHMM comprises a set of super-states and each super-state is associated with a set of embedded-states. The super-states represent the primary image regions along the vertical direction while the embedded-states within each super-state describe the image regions along the horizontal direction in detail. From this structure, we know that the sequence of super-states is used to model a horizontal slice of the image along the vertical direction and the sequence of embedded-states in a super-state is used to model a block image along the horizontal direction. The elements of EHMM are defined as follows [14].

1) 0N : The number of super-states in the vertical direction. 2) 0∏ : The initial super-state probability distribution, i.e.,

}1:{ 0,00 Niπ i ≤≤=∏ , where iπ ,0 is the initial probability of being in 0Λ super-state.

3) 0A : The super-state transition probability matrix, i.e., },1:{ 0,00 NjiaA ij ≤≤= where ija ,0 is the

probability of transition from thi - super-state to thj - super-state.

4) 0Λ : The set of one-dimensional HMM in each super-

state, i.e., }1:{ 00 Niλi ≤≤=Λ , where iλ indicate the model parameters of embedded-states in thi - super-state. Each iλ is represented by the one-dimensional HMM parameters as follows. • kN1 is the number of embedded-states in the

thk - super-state.


• }1:{ 1,11kk

ik Niπ ≤≤=∏ is the initial state probability

distribution, where kiπ ,1 is the probability of being in

thi - state of thk - super-state. • },1:{ 1,11

kkij

k NjiaA ≤≤= is the state transition

probability matrix, where kija ,1 specifies the

probability of transitioning from thi - state to thj - state in the thk - super-state.

• }1:)({ 1,1 10k

ttki

k NiObB ≤≤= is the observation

probability matrix, where 10 ,ttO represents the

observation vector at row 0t and column 1t , and

)(10 ,tt

ki Ob denotes the probability of being observed

the vector, 10 ,ttO in the thi - state of thk - super-

state. In a continuous density HMM, the states are characterized by a continuous observation density function. The probability density function that is typically represented in terms of a mixture of Gaussian functions, i.e.,

),,()( ,,,1

,, 1010kmi

kmitt

M

m

kmitt

ki UμONcOb ∑

=

= ,

(5)

where kNi 11 ≤≤ , kmic , denotes the mixture

coefficient for the thm - mixture in the thi - embedded-state of the thk - super-state, and

),,( ,,, 10kmi

kmitt UμON is a Gaussian probability

density function with mean vector kmiμ , and

covariance matrix kmiU , .

Since 0Λ denotes the set of one-dimensional HMM in each super-state, the EHMM can be completely specified by the following parameter set,

},,{ 000 ΛΠ= Aλ , (6)

where }1:{ 00 Niλi ≤≤=Λ , and thk - super-state is defined by

the set of parameters as },,{ 111kkkk BAλ Π= . Although EHMM

is more complex than a one-dimensional HMM, EHMM is more suited to 2-D images. Figure 4 illustrates the state structure of EHMM with three super-states for teeth image modeling. The super-states have a structure as overall top-to-bottom HMM, a left-to-right HMM is assigned to each super-state. We compose EHMM using three super-states and embedded-states of each three, five, and three in each super-state like that in Figure 4 to model the teeth image. Each super-state represents the vertical teeth features such as the upper-regions of the teeth, the teeth regions, and the lower-regions of teeth in the teeth image, and each embedded state in the super state represents the horizontal local features. Overall, the teeth authentication consists of the procedures, i.e., image acquisition, teeth region detection, feature extraction and pattern representation phase, as sequential steps.

Fig. 4. An illustration of EHMM for modeling teeth.

III. VOICE AUTHENTICATION This section describes the voice authentication method.

Voice authentication is an addition to the multimodal biometric authentication system together with teeth authentication. In acquiring a procedure image, the voice signal can be acquired through pronouncing /i/. We use MFCC and pitch as voice features, and the GMM algorithm is applied in the voice authentication process.

A. Feature extraction MFCC and pitch are used as voice signal features in this paper

for voice authentication. MFCC is the most popular front-end for speaker recognition in general. For obtain MFCC, the voice is first pre-emphasized using a pre-emphasis filter in order to spectrally flatten the signal and then the pre-emphasized speech is separated into short segments called frames. The frame length is set to 32ms (512 samples) to guarantee that the signal is stationary within the frame in this paper. We also set a 16ms (256 samples) overlap between two adjacent frames to ensure that the signal is stationary between frames. A frame can be seen as the result of multiplying the speech waveform by a rectangular pulse whose width is equal to the frame length. This will introduce significant high-frequency noise in the beginning and end of the frame, because of the sudden changes from zero to signal and from signal to zero. In order to reduce the familiar edge effects, a 512-point Hamming window is applied to each frame. In the next step, the pre-processed voice is converted into the frequency domain using discrete Fourier transform (DFT), and a log magnitude of the complex signal in the frequency domain is obtained. Mel-scaling is then performed on the log magnitude signal using triangular filters that are linearly spaced from 0 to 1 kHz, and then non-linearly placed according to the mel-scaling approximations. The mel-scale can be obtained given by [18]

)700

1(log2595 inmel

FF += , (7)

where melF denotes the frequency in mels, and inF is the input frequency in Hertz. The resultant signal from the filtering is then transformed using an inverse DFT into the cepstral domain. The lower order coefficients are selected as the feature parameter, i.e.


1794

MFCC, of the voice signal. We use 13 orders of MFCC coefficients in this paper.

The pitch parameter and MFCC are generally used as the features of voice authentication. Pitch period is considered an important parameter in designing a successful speaker authentication system. The pitch signal, also known as the glottal waveform, is produced by vocal folds vibration. Two features related to the pitch signal are widely used, namely the pitch frequency and the glottal air velocity at the vocal fold opening time instant. The elapsed time between two successive vocal fold openings is called pitch period T , while the vibration rate of the vocal folds is the fundamental frequency 0F . A widely spread method to extract pitch is based on the autocorrelation, however, this method sometimes fails to provide an accurate result due to the wide range of variations present in the real voice signal [19]. Nevertheless, we can use the autocorrelation to extract pitch, because the input voice signal has the characteristic of simple single-vowel signal. An instance of voice waveform and pitch period extracted by autocorrelation method is shown in Figure 5. We confirm the pitch period is stably extracted.

Fig. 5. Waveform and pitch period of voice /i/ .

After extracting the parameters of MFCC and pitch, these parameters are combined as a new feature set by simple concatenation as shown in Figure 6, and feature vector ix of thi - frame is then used out of the observation vector of GMM

algorithm.

Fig. 6. A procedure to obtain the GMM observation vector.

B. Gaussian mixture model Many classifier approaches, such as vector quantization

(VQ), Bayesian discriminant dynamic time warping (DTW), Gaussian mixture model (GMM), hidden Markov model (HMM) and neural network (NN), have been studied for speaker recognition. Among these approaches, GMM yield the best performance, especially for text-independent applications [20]. GMM is a powerful approach to model a speaker’s characteristics for its flexibility to approximate the underlying probability distribution in a high dimensional space.

GMM usually comprises a weighted sum of M component densities and is given by

∑=

=M

iii xbcλXP

1)()|( , (8)

where x is a -D dimensional feature vector, ,,...,1),( Mixbi = are the component densities and ,,...,1, Mici = mean the mixture weights. Each component density of the

-D dimensional vector is represented as a -D variate Gaussian function given by

)]()(21exp[

)2(

1)( 12/12/ ii

Ti

iDi μxμx

πxb −∑−−

∑= − , (9)

where iμ is the mean vector and i∑ is the covariance matrix. The mixture weights must be satisfied the constraint given by

11

=∑=

M

iic . (10)

The complete Gaussian mixture model is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities. These parameters are collectively represented by the following equation.

Miμcλ iii .,...,1},,,{ =∑= (11)

In voice authentication, these parameters are estimated by the EM algorithm using the training voice signals, and each user is represented by a GMM model λ . For a sequence of T test vector TxxxX ....,, 21= , GMM likelihood of the log domain is calculated in the following equation, and then is used in voice authentication.

∑=

==T

tt λxpλXpλXL

1)|(log)|(log)|( (12)


IV. MULTIMODAL BIOMETRIC AUTHENTICATION In this paper, we combine the results of teeth and voice

authentication to improve system performance. Figure 7 shows a block diagram of the multimodal biometric authentication system using teeth and voice. We use the AdaBoost algorithm based on Haar-like features, for teeth region detection in an image. The detected teeth image undergoes pre-processing, and 2D-DCT features are extracted. The raw-score of the teeth image is calculated by the EHMM algorithm, and is utilized in teeth authentication. In voice authentication, the voice signal is first windowed and pre-emphasized. Features such as MFCC and pitch are then extracted from the preprocessed voice signal, and are combined into a new feature set by simple concatenation. Using this feature set, we calculate the raw-score of the voice using the GMM algorithm. Since the raw-scores of teeth and voice have different distributions, we apply the sigmoid function to normalize these raw-scores from 0 to 1. Finally, we compose the multimodal biometric authentication system by fusing these normalized-scores using a weighted-summation method. The fused-score is used to classify the unknown user into the acceptance or rejection.

A. Score normalization Normalization involves transforming the raw-scores

obtained using different modalities to a common domain using a mapping function. Normalization generally has various methods such as z-score, min-max, decimal change, and sigmoid function. We use the sigmoid function to normalize the raw-scores of teeth and voice. The normalization method using the sigmoid function maps the raw-scores to the [0, 1] interval [21], and is defined by

))(exp(11

,origiii oτo

−+= , (13)

where )( ,origii oτ is defined iiiorigi σσμo 2/)]2([ , −− , origio , is

the raw-score of thi - modality, io is normalized-score, and

iμ and iσ are the mean and the standard deviation of the raw- scores, respectively.

B. Fusion Once normalized, the normalized-scores obtained from

teeth and voice authentication are combined using a simple weighted-summation operation. This fusion method does not require any training phase [22]. The simple weighted-summation method is given by

10,)1( ≤≤−+= pSppSS vtm , (14)

where tS and vS are the normalized-scores of the teeth and voice, respectively, and mS is the fused-score. p is the weight of the normalized-scores obtained from teeth authentication, while )1( p− is the weight of the normalized-scores obtained from voice authentication.

V. EXPERIMENTAL RESULTS

A. Database collection The proposed multimodal biometric authentication system

is implemented using Microsoft embedded Visual C++ 4.0 on an Hp iPAQ rw6100 mobile device equipped with a camera and sound-recording device. We construct a database consisted by teeth images and voices, under constraints such as illumination and noiseless environments, to evaluate the proposed system. Especially, we constraint the teeth images to be captured frontal views, where the teeth expressions recommended a teeth occlusion state, and the illumination intensity was maintained at about 200 lux in an indoor environment. The captured images are acquired at a resolution of 640480× pixels, where captured images were normalized

4080× pixels after the pre-processing procedure, and the voices are recorded with sampling rates of 16 kHz at 16 bit/sample. The experimental database totally has 1000 images and voices for 50 subjects, i.e., 20 images and voices per individual. We also use 250 images and voices of each individual for training, and the remaining images and voices are used to evaluate performance.

Fig. 7. Multimodal biometric authentication system using teeth and voice.


1796

B. Performance evaluation In the experimental results, the teeth region detection rate

was 98.87%, and the average time for teeth region detection was 2.92s in one image. We also investigated the computational complexity in terms of the training time per model and authentication time per image. The results show average times of 55.97s for training a user and 10.76s to authenticate a user. Figure 8 shows a visible example of the proposed multimodal biometric authentication system. In this figure, we notice that the user is authenticated, and also see the end-point detection results for the voice and the teeth region detection results from the input image.

Fig. 8. Implemented display of proposed system.

Figure 9 shows the two-dimensional distribution of the

normalized-scores from the image and voice authentication for a genuine user and an imposter. As observed in Figure 9, we confirm that the imposter scores are distributed in a region of small values, while the scores for the genuine user have larger values.

Fig. 9. Distribution of teeth and voice normalized-scores.

The performance indicator adopted in this paper is the equal error rate (EER). EER is the error rate where the false acceptance rate (FAR) and false rejection rate (FRR) assume the same value. It is usually adopted as unique measure for characterizing the security level of a biometric system. Figure 10 plots the EER obtained by combining the normalized-scores using the weighted-summation method along with changing p of weight of teeth scores from 0 to 1. The results show that the EER of the unimodal biometric methods are 6.42% in the case of teeth authentication and 6.24% in the case of voice authentication. By combining the two modalities, using the weighted-summation method, the EER is 2.13% when p is 0. 5. We also investigate the receiver operating characteristic (ROC) curves when p is 0.5, and the results are shown in Figure 11. As observed in the ROC curves, the proposed multimodal system outperforms the two unimodal biometric methods, i.e., teeth authentication and voice authentication, and we improve the proposed system performance in this research.

Fig. 10. Resultant EER against weight of teeth score.

Fig. 11. ROC of proposed system.


VI. CONCLUSIONS Since mobile devices are vulnerable to theft and loss, the

private information of users may be compromised. We present a new multimodal biometric authentication approach to protect their information. This approach implemented in this study is operates as full automatic authentication using teeth image and voice. Teeth authentication comprises image acquisition and teeth region detection with pre-processing, and authentication procedures accompanied by 2D-DCT and EHMM as sequential steps. Also, we authenticate voice by using the combined features of pitch and MFCC, and the GMM algorithm. Finally, we encompass the multimodal authentication system using a weighted-summation operation with simple structure and superior performance in the fusing phase. We fuse the normalized-scores of teeth and voice authentication to improve overall authentication performance. The proposed system is evaluated using 1000 images and voices acquired in a mobile device for 50 subjects. In the experimental results, we obtain 6.42% EER for teeth authentication and 6.24% in voice authentication. The proposed multimodal biometric authentication system exhibits an EER of 2.13%. Thus, we confirm that the performance of the proposed system is better than the performance obtained using teeth or voice individually. We thus demonstrate the effectiveness of the proposed method.

We plan, in future works, to enhance the processing time in the mobile environment through improved algorithms or optimization processes. Furthermore, we would like to extend this study to consider the effect of variable noise and illumination, and we will investigate other fusion methods such as Bayesian classifier, Fisher classifier, and support vector machine (SVM) classifier.

ACKNOWLEDGMENT This work was supported by the Korea Research

Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF-2006-0889-000).

REFERENCES [1] K. Venkataramani, S. Qidwai, and B. VijayaKumar, "Face authentication

from cell phone camera with illumination and temporal variations," IEEE Trans. on Systems, Man and Cybernetics, vol. 35, no. 3, pp. 411-418, 2005.

[2] Tao, Qian Veldhuis, Raymond N. J., "Biometric Authentication for a Mobile Personal Device", in International Conference on Mobile and Ubiquitous Systems, pp. 1-3, 2006.

[3] Hadid, A. Heikkila, J.Y. Silven, O. Pietikainen, M., "Face and Eye Detection for Person Authentication in Mobile Phones", in International Conference on Distributed Smart Cameras, pp. 101-108, 2007.

[4] J. Czyz, S. Bengio, C. Marce, and L. Vandendorpe, "Scalability analysis of audio-visual person authentication," in International Conference on Audio and Video Based Biometric Person Identification, pp. 752-760, 2003.

[5] Koji Iwano, Taro Miyazaki, Sadaoki Furui, "Multimodal Speaker Verification Using Ear Image Features Extracted by PCA and ICA", LNCS 3546, 588-596, 2005.

[6] Dae Jong Lee, Keun Chang Kwak, Jun Oh Min, Myung Geun Chun, "Multi-modal Biometrics System Using Face and signature", LNCS 3043, 635-644, 2004.

[7] Tae-Woo KIM, Tae-Kyung CHO, "Teeth Image Recognition for Biometrics", IEICE TRANSACTIONS on Information and Systems Vol.E89-D No.3 pp.1309-1313, 2006.

[8] K. Prajuabklang, P. Kumhom, T. Maneewarn, K. Chamnongthai, "Real-time Personal Identification from Teeth-image using Modified PCA", Proceeding, the 4-th information and computer Engineering Postgraduate Workshop, Vol. 4, No. 1, pp.172-175, 2004.

[9] C. Nadee, P. Kumhom, K. Chamnongthai, "Improved PCA-Based Personal Identification Method Using Invariance Moment", The third International Conference on Intelligent Sensing and Information Processing, December 14-17, 2005.

[10] Dong-Ju Kim, Jong-Bae Jeon and Kwang-Seok Hong, "Performance Evaluation of Feature Vectors for Teeth Image Recognition", The 4th Conference On New Exploratory Technologies (NEXT 2007 KOREA), October 25-27, 2007.

[11] A. Jain, and H. Chen, "Matching of Dental X-ray images for Human identification", Pattern Recognition, Vol. 37, pp. 1519-1532, July 2004.

[12] P.Viola, M.J.Jones, "Robust real-time object detection", Technical Report Series, Compaq Cambridge research Laboratory, CRL 2001/01, Feb. 2001.

[13] R.C. Gonzalez and R.E. Woods, "Digital Image Processing", Addison Wesley, 1992.

[14] A. Nefian, M. Hayes, "An Embedded HMM-based Approach for Face Detection and Recognition", Proc. IEEE Int. Conf. On Acoustics, Speech and Signal Processing, vol. 6, 1999.

[15] S. Eickler, S. MWuller, G. Rigoll, "Recognition of JPEG compressed face images based on statistical methods", Image Vision Computing, vol. 18, no. 4, pp. 279– 287, 2000.

[16] L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition", Proc. IEEE. Vol. 77(2), 257-286, 1989.

[17] S. Kuo, O. Agazzi, "Keyword spotting in poorly printed documents using pseudo 2-D Hidden Markov Models", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, pp. 842-848, August 1994.

[18] D. O’Shaughnessy, "Speech Communication - Human and Machine", Addison-Wesley, New York, 1987.

[19] S. Kadambe, "The applicaton of time-frequency and time-scale representations in speech analysis", PH.D. Thesis, Dept. of Elect. Eng., Univ. of Rhode Island, 1991.

[20] Reynolds, D. A., Quatieri, T., Dunn, R., "Speaker verification using adapted Gaussian mixture models", Digital Signal Process, 10, 19-41, 2000.

[21] P. Jourlin, J. Luettin, D. Genoud, H. Wassner, "Acoustic-labial speaker authentication", Pattern Recognition Letter. 18. 853-858, 1997.

[22] A. Ross, A.K. Jain, "Information fusion in biometrics", Pattern Recognition Letter, vol. 24, pp. 2115–2125, 2003.

Dong-Ju Kim received the B.S. and M.S. degree in Radio and Science Engineering from ChungBuk National University. He is presently a Ph.D candidate at the department of Information and Communication Engineering, Sungkyunkwan University. His current research focuses on digital signal processing and biometrics.

Kwang-Seok Hong received the B. S., M. S., and Ph.D. in electronic engineering from the Sungkyunkwan University, Seoul, Korea, in 1985, 1988, and 1992, respectively. In March 1990, he joined the Seoul Health College, Seoul, Korea, where he was a full time lecturer of Computer Engineering. From March 1993 to February 1995, he was a full time lecturer at Cheju University,

Cheju, Korea. Since March 1995, he has been a professor at Sungkyunkwan University, Suwon, Korea. His current research focuses on VoiceXML, and integration and representation of five-senses.

multimodal biometric authentication using teeth image and voice in mobile environment

Documents