telephone speech recognition based on bayesian adaptation of hidden markov models

Ž .Speech Communication 22 1997 369–384

Telephone speech recognition based on Bayesian adaptation ofhidden Markov models

Jen-Tzung Chien 1, Hsiao-Chuan Wang )

Department of Electrical Engineering, National Tsing Hua UniÕersity, Hsinchu, 30043, Taiwan, ROC

Received 22 October 1996; revised 3 April 1997; accepted 1 July 1997

Abstract

Ž .This paper presents an adaptation method of speech hidden Markov models HMMs for telephone speech recognition.Our goal is to automatically adapt the HMM parameters so that the adapted HMM parameters can match with the telephoneenvironment. In this study, two kinds of transformation-based adaptations are investigated. One is the bias transformationand the other is the affine transformation. A Bayesian estimation technique which incorporates prior knowledge into thetransformation is applied for estimating the transformation parameters. Experiments show that the proposed approach can besuccessfully employed for self adaptation as well as supervised adaptation. Besides, the performance of telephone speechrecognition using Bayesian adaptation is shown to be superior to that using maximum-likelihood adaptation. The affinetransformation is also demonstrated to be significantly better than the bias transformation. q 1997 Elsevier Science B.V.

Resume´ ´

Ž .Cet article presente une methode d’adaptation des modeles de Markov caches HMMs pour la reconnaissance de parole´ ´ ` ´telephonique. Notre but est d’adapter automatiquement les parametres HMM a l’environnement telephonique. Dans cet´ ´ ` ` ´ ´article, on etudie deux types de d’adaptation basees sur des transformations. L’une est la transformation par biais et l’autre la´ ´transformation affine. Pour estimer les parametres de la transformation, on applique une technique d’estimation Bayesienne` ´qui incorpore la connaissance a priori dans la transformation. Les experiences montrent que l’approche proposee peut etre´ ´ ˆappliquee avec succes tant pour l’auto-adaptation que pour l’adaptation supervisee. De plus, on montre que les performances´ ` ´de la reconnaissance de parole telephonique utilisant l’adaptation Bayesienne sont superieures a celles utilisant l’adaptation´ ´ ´ ´ `par maximum de vraisemblance. On montre enfin que la transformation affine est egalement nettement plus efficace que la´transformation par biais. q 1997 Elsevier Science B.V.

Keywords: Telephone speech recognition; Model adaptation; Hidden Markov model; Maximum a posteriori estimation; Affine transforma-tion

) Corresponding author. E-mail: [email protected] Jen-Tzung Chien is now an assistant professor at Department of Computer Science and Information Engineering, National Cheng Kung

University, Tainan, Taiwan, ROC.

0167-6393r97r$17.00 q 1997 Elsevier Science B.V. All rights reserved.Ž .PII S0167-6393 97 00033-2

( )J.-T. Chien, H.-C. WangrSpeech Communication 22 1997 369–384370

1. Introduction

Because the telecommunication technology is rapidly growing in recent years, people can easily inquire orreserve a variety of information through the telephone services. To achieve the automation of telephone

Žservices, it is essential to develop robust speech recognition techniques under telephone environments Juang,.1991; Gong, 1995; Lee, 1997 . In telephone networks, a major problem of speech recognition comes from the

acoustic mismatch between training and testing environments. The mismatch due to speaker, telephone handset,transmission line and ambient noise usually causes serious degradation of recognition performance. To improvethe performance, a large amount of telephone data containing all the acoustic variabilities of telephoneenvironments should be collected for generating robust speech models. However, this approach is impractical in

Ž .real application. Thus, several robust algorithms such as codeword-dependent cepstral normalization CDCNŽ . Ž . Ž .Acero and Stern, 1990 , relative spectral RASTA method Hermansky and Morgan, 1994 , and signal bias

Ž . Ž .removal SBR Rahim and Juang, 1996 were developed for reducing the acoustic mismatch of training andtesting environments without model retraining. In addition, a practical approach for telephone speech recogni-

Ž .tion is to adapt a given set of speech hidden Markov models HMMs so that the adapted HMM parameters areacoustically close to the real telephone environment. Using the adapted HMM parameters, the recognitionperformance can be significantly improved. In practice, the adaptation approaches can be employed in two

Ž . Ž .cases: 1 self adaptation, and 2 supervised adaptation. Based on the self adaptation, the adaptation of HMMŽparameters is performed on the testing data in an unsupervised manner Chien et al., 1995a; Sankar and Lee,

.1996 . On the other hand, the adaptation can be also performed in a supervised manner. In supervisedadaptation, the HMM parameters are adapted to a new speaker and telephone channel by using some adaptation

Ž .data for which the true transcriptions are given Takahashi and Sagayama, 1994; Takagi et al., 1995 . Thetelephone speech is then recognized without adaptation in testing phase. Besides, the speaker adaptation

Ž .techniques which adapt the existing speaker-independent SI HMM parameters to a new speaker are alsoŽ .feasible for telephone speech recognition. The difference of speakers i.e., vocal tract system may be equated to

the channel mismatch of telephone utterances.By the assumption of linear channel mismatch, the quantity of channel mismatch can be approximately

characterized by a cepstral bias. Therefore, we usually apply a bias transformation for adapting the HMMŽ .parameters by adding a bias Cox and Bridle, 1989 . However, the bias transformation may be insufficient for

modeling the variabilities of telephone environments. Thus, a more sophisticated adaptation using the affineŽ .transformation also called the linear regression transformation is developed. According to the affine transfor-

Ž .mation, the HMM parameters are linearly scaled and then shifted by a bias. In Digalakis et al., 1995 , aconstrained transformation of HMM parameters was presented for speaker adaptation. Their constrained

Ž .transformation was in a form of an affine transformation. In Leggetter and Woodland, 1995 , a maximumŽ .likelihood linear regression MLLR approach was proposed for adapting the continuous-density HMM

Ž . Ž .CDHMM parameters. They employed the maximum likelihood ML theory for calculating the linearregression transformation which adapted the HMM mean vectors. Furthermore, an ML-based stochasticmatching method for decreasing the acoustic mismatch between testing features and given HMM parameters

Ž .was presented Sankar and Lee, 1996 . In their studies, the affine transformation was considered as a featuretransformation function for transforming the testing features to match with the given HMM parameters.

As mentioned above, the parameters of transformation-based adaptation in previous works were estimated viathe ML theory and have been successfully applied for model adaptation. However, if we can adequatelyincorporate the prior knowledge of transformation parameters into the adaptation, the recognition performance

Ž .may be further improved. Accordingly, we are motivated to apply the maximum a posteriori MAP theory forestimating the transformation parameters and use them for adapting the HMM parameters to a target telephoneenvironment. Using the MAP theory, the transformation parameters are optimally estimated by maximizing theposterior density which consists of a likelihood function and a prior density. In this study, a bias transformation

Ž .ysxqb with parameter u sb and an affine transformation ysAxqb with parameters u s A, b are1 b 1 2 a 2

( )J.-T. Chien, H.-C. WangrSpeech Communication 22 1997 369–384 371

considered as the model transformation functions for transforming the original sampled data x into its adaptedversion y. Generally, the estimated scaling matrix A should not be an identity matrix so that the extra unknowneffect appeared in telephone utterances can be compensated. In the experiments, we compare the performance ofML adaptation and MAP adaptation. The bias transformation and the affine transformation are served as thetransformation functions. Results show that the best performance is achieved by applying the MAP adaptationusing the affine transformation. When the adaptation data is increased, the ML adaptation and MAP adaptationhave comparable performance. Besides, we find that the proposed approach is applicable to the self adaptationand the supervised adaptation. By performing self adaptation and supervised adaptation simultaneously, theperformance can be further improved.

This paper is organized as follows. In Section 2, the adaptation approaches using bias and affinetransformations are described. In Section 3, the formulas of ML and MAP estimation of various transformationparameters are derived. In Section 4, we address the experimental setup and databases. In Section 5, theestimation of prior parameters and the histograms of transformation parameters are illustrated. The experimentsof telephone speech recognition using three adaptation techniques are reported in Section 6. Finally, theconclusions are given in Section 7.

2. Transformation functions for model adaptation

When the testing data mismatches with a given set of HMM parameters, two approaches are available forsuppressing the mismatch. One is to remove the mismatch sources from the testing data, e.g., an operation of

Ž .feature bias removal Chien et al., 1995b; Rahim and Juang, 1996 . The other is to adapt the given HMMŽ .parameters to the testing environment. In Chien et al., 1996; Sankar and Lee, 1996 , the experiments were

reported that the performance of model adaptation was better than that of feature bias removal. In this study, weapply the model adaptation approach for telephone speech recognition. Both bias transformation and affinetransformation are investigated. To simplify the expression, we derive the formulation by assuming a singleGaussian output distribution for each HMM state. This derivation can be easily extended for the case of HMMstate with multiple mixture components.

2.1. A bias transformation

� 4 � 4Let Ys y and Ss s denote the observation sequence and the state sequence of telephone data,t t

respectively. Assuming that we are given a set of CDHMM parameters L trained from a training corpus X ,x

the output probability density of state s generating an observation vector y of dimension n is expressed by at t

likelihood function

y1r2 Xynr2 1 y1<P y s s 2p S exp y y ym S y ym , 1Ž . Ž . Ž . Ž .Ž . � 4t t s t s s t s2t t t t

Ž .where m , S are the Gaussian parameters of state s . If there is an acoustic mismatch between the trainings s tt t

corpus X and telephone data Y, the HMM parameters L can be adapted by adding a bias vector. ThexŽ .likelihood function of Eq. 1 is then modified as

Xy1r2ynr2 1 y1<P y s , u sb s 2p S exp y y ym yb S y ym yb . 2Ž . Ž .Ž . Ž . Ž .� 4t t b 1 s t s 1 s t s 12t t t t

In this study, all speech features are represented by cepstrum. By adding a cepstral bias b to the HMM1

mean vectors, the linear channel mismatch between telephone data Y and HMM parameters L can bex

compensated.


2.2. An affine transformation

Although the channel mismatch is a major problem in telephone speech recognition, the extra effect, such asambient noise and speaker variabilities including the gender, accent, age, stress, Lombard effect, etc., cannot becompensated by only utilizing the bias transformation. However, it is difficult to model the mismatch sources ofambient noise and speaker variabilities by using a simple transformation function. Thus, we employ a morecomplicated transformation, i.e. an affine transformation, for achieving a better modeling of environmentalmismatch. When the affine transformation is employed, the HMM mean vector m is adapted by multiplying ast

scaling matrix A and adding a bias vector b . Simultaneously, the HMM covariance matrix ÝÝÝÝÝ is modified as2 s t

AS AX. The resulting likelihood function is expressed by the following equation:st

<P y s , u s A , bŽ .Ž .t t a 2

Xy1r2 y1ynr2 X X1s 2p A S A exp y y yAm yb AS A y yAm yb . 3Ž . Ž .Ž . Ž . Ž .½ 5s t s 2 s t s 22t t t t

3. Bayesian estimation technique

According to the MAP principle, the transformation parameters are estimated by maximizing the posteriordensity which is composed of a likelihood function and a prior density. If the prior knowledge of transformationparameters is deterministic or noninformative, i.e., the prior density equals to a constant, the MAP estimation isthen reduced to the ML estimation which only maximizes the likelihood function. Theoretically, if the observeddata is limited and the prior statistics is reliable, the estimation accuracy of MAP estimation may outperformthat of ML estimation. In this study, we exploit the transformation-based adaptation approach using the MAPestimation and apply it for telephone speech recognition.

Ž < .Let the posterior density of affine transformation parameter u be denoted as P u Y . The MAP estimate ofa aŽ < .transformation parameter u is obtained by maximizing the posterior density P u Y as follows:a,MAP a

<u s argmax P u Y . 4Ž .Ž .a ,MAP aua

Ž .However, it is difficult to directly derive the MAP estimate u from Eq. 4 . Instead of maximizing thea,MAPŽ . Ž < . Ž .posterior density of observed or incomplete data P u Y , the expectation-maximization EM algorithma

Ž .Dempster et al., 1977 provides a general solution which locally maximizes the posterior density of completedata with current estimate and iteratively updates the new MAP estimate in an optimal sense. On the other hand,

Žthe segmental MAP algorithm is also a good candidate for solving the incomplete data problem. In Lee et al.,.1991; Gauvain and Lee, 1994 , they proposed the segmental MAP algorithm for estimating the Gaussian

mixture components of HMM parameters. In this study, we apply the segmental MAP algorithm for deriving theformulas of MAP estimate u . Using this algorithm, the MAP estimate u is derived by maximizing thea,MAP a,MAP

Ž < .joint posterior density P u ,S Y over the parameter u and the state sequence S,a a

<P Y , S u P uŽ .Ž .a a<u s argmax max P u ,S Y s argmax maxŽ .a ,MAP a P Ys s Ž .u ua a

<s argmax max P Y , S u P u . 5Ž . Ž .Ž .a asua


Then, the MAP estimate u is obtained by starting with any estimate u Ž l . and alternately solving thea,MAP a

following two equations:

Ž l . < Ž l .S s argmax P Y,S u , 6Ž .Ž .aS

Ž lq1. Ž l . <u s argmax P Y , S u P u , 7Ž . Ž .Ž .a a aua

Ž .where l is the iteration index. Similar to the segmental K-means algorithm Juang and Rabiner, 1990 , it can beŽ . Ž . Ž < .proved that every combined iteration of Eqs. 6 and 7 guarantees that the joint posterior density P u ,S Ya

Ž Ž l . Ž l . < . Ž Ž lq1. Ž lq1. < . Ž . Ž .does not decrease, i.e., P u , S Y (P u , S Y . In Eqs. 6 and 7 , the most likely state sequencea aŽ l . Ž . Ž l .S is determined by using the Viterbi decoder Viterbi, 1967 . Given the most likely state sequence S , the

Ž lq1. Ž . Ž .new MAP estimate u is then obtained by using Eq. 7 . Thus, it is essential to solve Eq. 7 for obtaininga

the MAP estimates of transformation parameters. However, because the posterior density is consisted of aŽ . Žlikelihood function and a prior density, we are motivated to introduce a weighting factor a into Eq. 7 Chien

.et al., 1996 for adjusting the contribution of these two terms in MAP estimation. Accordingly, by taking thelogarithm of posterior density, the MAP estimation is modified as

<u s argmax a log P Y ,S u q 1ya log P u , 8� 4Ž . Ž . Ž .Ž .a ,MAP a aua

Ž .where 0(a(1. Here, the iteration index l is dropped for ease of expression. In Eq. 8 , when as0.5, thelikelihood function and prior density are equally weighted in MAP estimation. When as1, the MAPestimation is converted to the ML estimation, i.e.,

<u s argmax log P Y , S u . 9Ž .Ž .a ,ML aua

When as0, the MAP estimation is totally dominated by the prior density. In the following paragraphs, wewill derive the formulas of ML and MAP estimates of u .a

3.1. ML estimates for transformation parameters

Ž .Maximum likelihood ML estimation is a popular approach for parameter estimation. Based on the MLŽprinciple, the transformation parameters can be effectively estimated Digalakis et al., 1995; Leggetter and

. Ž .Woodland, 1995; Sankar and Lee, 1996 . In this study, the ML estimate of u is derived by solving Eq. 9 .aŽ < .Because the likelihood function P Y, S u satisfiesa

< < < <P Y , S u sP Y S,u P S u sP Y S, u P S , 10Ž . Ž .Ž . Ž . Ž . Ž .a a a a

Ž .where S and u are assumed to be independent, we can extend Eq. 9 in frame sequence as follows:a

T

< <u s argmax log P Y S,u s argmax log P y s ,u s A , b . 11Ž . Ž .Ž . Ž .Ýa ,ML a t t a 2u u ts1a a

� 4 Ž .The observation vectors Ys y are constrained to be i.i.d.. Also, Eq. 11 can be changed tot

u s A , b s argmin Q A , b , 12Ž . Ž . Ž .a ,ML ML 2,ML ML 2Ž .A ,b2

where

TX y1X XQ A , b s y yAm yb A S A y yAm yb q log A S A . 13Ž . Ž .Ž . Ž . Ž .ÝML 2 t s 2 s t s 2 st t t t

ts1


Ž .By respectively taking gradient of Q A, b with respect to A and b and setting the results to be zero,ML 2 2

the ML estimate u can be derived. In this study, we simplify the scaling matrix A and the covariance matrixa,MLŽ . Ž 2 . �S to be diagonal, i.e., Asdiag a and S sdiag s . Then, the transformation parameters u s a ,s s s a,ML ML, it t t

4b , is1, . . . ,n can be independently derived for each vector component. To simplify the expressions, we2,ML, i

will omit the vector component index i in the following formulation. The ML estimates a and b areML 2,ML

given by

2 2T T T T T T Ty m m y 1 y 1 yt s s t t tt t2Ta q y a q yÝ Ý Ý Ý Ý Ý ÝML ML2 2 2 2 2 2 2ž / ž / ž / ž /s s s s s s ss s s s s s sts1 ts1 ts1 ts1 ts1 ts1 ts1t t t t t t t

s0 14Ž .

and

T Ty ya m 1Ž .t ML stb s . 15Ž .Ý Ý2,ML 2 2ž /ž /s ss sts1 ts1t t

Ž . Ž .As shown in Eqs. 14 and 15 , the ML estimate a is calculated by solving a quadratic equation. ByMLŽ .substituting the ML estimate a into Eq. 15 , the corresponding ML estimate b is generated.ML 2,ML

Similarly, the ML estimate of bias transformation parameter u sb can be straightforward derived byb 1Ž . Ž . Žletting AsI in Eqs. 12 and 13 . The derived formula is shown as Neumeyer et al., 1994; Chien et al.,

.1995a

T Ty ym 1Ž .t stb s . 16Ž .Ý Ý1,ML 2 2ž /ž /s ss sts1 ts1t t

3.2. Map estimates for transformation parameters

The MAP estimation is a popular technique for parameter estimation. Due to the incorporation of priorŽ .knowledge, the MAP estimation may improve the estimation accuracy. In Gauvain and Lee, 1992 , the MAP

Ž .estimation of CDHMM parameters was proposed. A similar framework based on the discrete HMM DHMMŽ . Ž . Ž .and the semi-continuous HMM SCHMM was presented in Huo et al., 1995 . In Zavaliagkos et al., 1995a,b ,

Ž . Ž .they described the extended MAP EMAP estimation technique Lasry and Stern, 1984 of transformationfunctions shared across some HMM parameters. In this paper, we derive the MAP estimates of bias and affinetransformation functions and apply them for telephone speech recognition.

One key issue of MAP estimation is the choice of prior density family. In this study, we assume that thecomponents of scaling matrix A and bias vector b of affine transformation parameters are separately modeled2

Ž 2 . Ž 2 . � 4 2 �Ž .24by the Gaussian distribution, i.e. a;N m , s and b ;N m , s , where m sE a , s sE aym ,a a 2 b b a a a2 2

� 4 2 �Ž .24 Ž .m sE b and s sE b ym . However, Eq. 15 shows that the transformation parameters a and bb 2 b 2 b 22 2 2

� 4are dependent in essence. Hence, we should derive the MAP estimates u s a , b , is1, . . . ,na,MAP MAP, i 2,MAP, i

by considering the joint density of a and b . A practical candidate to model the joint density of a and b is the2 2Ž .joint Gaussian density Mendel, 1995 written by

aym1y1r2 ay1 y1P u sP a, b s 2p S exp y aym b ym S , 17Ž . Ž . Ž . Ž .a 2 u a 2 b ua 2 a½ 5b ym2 2 b2


where

2s ma a ,b2S s , 18Ž .u 2a m sa ,b b2 2

�Ž .Ž .4and m sE aym b ym . Here, the vector component index i is also omitted. To facilitate thea,b a 2 b2 2

formulation of MAP estimation, we first express the inverse matrix Sy1 by using the following decomposedua

components:

u wy1S s . 19Ž .ua w Õ

Ž .Then, the MAP estimation of Eq. 8 is extended by

a , b s argmin Q a, b , 20Ž . Ž . Ž .MAP 2,MAP MAP 2Ž .a ,b2

where

2T y yam ybŽ .t s 2t 2 2< <Q a, b s a q log a sŽ . ÝMAP 2 s2 2 t½ a sž /sts1 t

22q 1ya u aym q2w aym b ym qÕ b ym . 21Ž . Ž . Ž . Ž .Ž . Ž .a a 2 b 2 b2 2 5Ž .By differentiating Q a, b with respect to b and setting it to zero, the MAP estimate of b is derivedMAP 2 2 2

as

T y ya mŽ .t MAP st 2a q 1ya Õm yw a ym aŽ . Ž .Ž .Ý b MAP a MAP2 2ž /ssts1 tb s . 22Ž .2,MAP T 1

2a q 1ya ÕaŽ .Ý MAP2ž /ssts1 t

Unfortunately, it is difficult to derive a closed-form solution of MAP estimate a by differentiatingMAPŽ . Ž .Q a, b with respect to a. Thus, we apply the steepest-descent algorithm Widrow and Stearns, 1985 forMAP 2

searching the MAP estimate a . Based on the steepest-descent algorithm, the parameter a is iterativelyMAP MAP

updated by

dQ a, bŽ .MAP 2Ž lq1. Ž l .a sa yh , 23Ž .MAP MAPŽ l . Ž l .da asa , b sbMA P 2 2,MAP

where h is a positive-valued step size and

T 2 T 2 T T Ty m m bdQ a, b 2a y 2a b 2a 4a y b 2aŽ . t s s 2MAP 2 t 2 t 2t tsy y q q yÝ Ý Ý Ý Ý3 2 3 2 2 2 3 2 2 2da a s a s a s a s a ss s s s sts1 ts1 ts1 ts1 ts1t t t t t

2aTq q2u 1ya aym q2w 1ya b ym . 24Ž . Ž . Ž . Ž .Ž .a 2 b2a


The iteration stops when the change of a in an updating procedure is less than a predetermined threshold.MAPŽ . Ž . Ž . Ž .The convergence of a has been shown in Eqs. 5 – 7 . Then, by applying Eqs. 22 – 24 with an initialMAP

Ž .guess of a , the MAP estimate u can be jointly and iteratively calculated Chien et al., 1997 . In theMAP a,MAP

estimation of bias transformation parameter u sb , we assume that each vector component of transformationb 1Ž 2 . � 4 2 �Žparameter b is also characterized by a Gaussian density, i.e., b ;N m ,s , where m sE b , s sE b1 1 b b b 1 b 11 1 1 1

.24 Ž .ym . The MAP estimate b can be derived and given by Chien et al., 1996b 1,MAP1

T y ym mŽ .t s bt 1a q 1yaŽ .Ý 2 2ž /s ss sts1 t t

b s . 25Ž .1,MAP T 1 1a q 1yaŽ .Ý 2 2ž /s ss bts1 t 1

Ž . Ž .From the derived formulas, we have two observations. First, comparing Eqs. 15 and 22 , we find that theMAP estimate b is asymptotically identical to the ML estimate b when the number of observation2,MAP 2,ML

vectors T is infinite. This is reasonable because that the prior statistics becomes improper or noninformativeŽ . Ž .when T ™ `, and the MAP estimation converges to the ML estimation. Second, comparing Eqs. 22 and 25 ,

we observe that the derived MAP estimate b of bias transformation is in a simpler form than the MAP1,MAP

estimate b of affine transformation. The MAP estimate of bias transformation b contains two factors.2,MAP 1,MAP

One is the cepstral average of the difference of observations and their corresponding mean values. The other oneis the prior knowledge of transformation parameter. They are weighted by variances and linearly interpolated bya weighting factor a , respectively. If the prior statistics is properly chosen, we can expect that the estimationinaccuracy caused by unreliable state sequence can be adequately compensated. In Section 5, we will describethe estimation procedure for prior statistics. Before the description, we introduce our experimental setup anddatabases.

4. Experimental setup and databases

Ž .Our task is to recognize the Mandarin speech Chien, 1997 . Mandarin is a syllabic and tonal language. EachChinese character corresponds to a Mandarin syllable. Without considering the tonal information, the totalnumber of Mandarin syllables is 408. In general, each Mandarin syllable is composed of an initial part and afinal part. Some Mandarin syllables only have a final part. For these syllables, a null initial is needed to bemodeled. The initial part corresponds to a consonant and the final part corresponds to a vowel, some with a

Ž .nasal ending. Totally, there are 59 subsyllables 21 for initials and 38 for finals . Due to the coarticulationŽ . Ž .between initials and finals, we employed 93 context-dependent CD initials and 38 context-independent CI

finals in our experiments. Additionally, 33 null initials were generated. The tonal information was notconsidered in this study.

In the experiments, the speech signal was sampled at 8 kHz. The feature vector consisted of 26 componentswhich were 12-order LPC-derived cepstral coefficients, 12-order delta cepstral coefficients, 1 delta log energy

Ž .and 1 delta delta log energy. The delta features were calculated according to Lee et al., 1992 . Three databasesprovided by Telecommunication Laboratories, Chunghwa Telecom, Taiwan, ROC are described below. We usethese databases in the following experiments. The purpose for using each of the databases is also given.

DB1: This is a Chinese-name database collected over the public telephone network. This database was usedfor testing. Totally, 1000 testing utterances spoken by 37 males and 36 females were recorded via ten telephonehandsets. There were 250 staff names of Telecommunication Laboratories in this database. The recognition taskis to recognize 250 Chinese names. Each Chinese name contained three Mandarin syllables. Besides, to evaluatethe supervised adaptation task, we additionally sampled three utterances for each testing speaker. There were219 adaptation utterances included. The content of each adaptation utterance was also a Chinese name.


DB2: This database consisted of 5045 phonetically-balanced Mandarin words spoken by 51 males and 50females. Each Mandarin word contained two to four Mandarin syllables. All 408 Mandarin syllables werecovered in this database. The speech data was recorded in an office room via a high-quality microphone. Weused this database to train the HMM parameters. The trained HMM parameters were referred as the cleanspeech models. When the telephone speech is recognized by using the clean speech models, the seriousmismatched condition is occurred. Our proposed approach is applied for adapting the clean speech models tomatch with the telephone environment.

DB3: This database was collected over the public telephone network. It comprised 4426 phonetically-bal-anced Mandarin words spoken by 56 males and 44 females. All 408 Mandarin syllables were included. Thisdatabase has two purposes. One is to sample a set of utterances for estimating the prior statistics. The other is totrain the telephone-quality speech models so that the recognition result of matched condition can be obtained forcomparison.

In this study, the CD initials, CI finals and null initials were respectively characterized by three, four and twoŽHMM states. Thus, parameters of 498 HMM states 279 for CD initials, 152 for CI finals, 66 for null initials

.and 1 for background silence covering all acoustical characteristics of 408 Mandarin syllables were trainedfrom the training database. Each state parameter had four mixture components. The recognition system wasconstructed by the framework of CDHMM. The HMM parameters were estimated by using the segmental

Ž .K-means algorithm Juang and Rabiner, 1990 . Because the experiments used the Gaussian mixture CDHMM,the state sequence and the mixture component sequence were needed to be decoded via the Viterbi decoder. In

Žour proposed adaptation method, we compute a shared transformation function by using the speech frames i.e.,.silence frames are excluded . Then, the transformation function is used for adapting all HMM states except

background silence state.

5. Prior statistics

In theory, the prior statistics can be empirically estimated from a large amount of speech data which coversall the variabilities of model adaptation. The resulting MAP estimation will be more reliable. However, it is noteasy to collect enough training data in real application. Thus, we sampled 80 telephone utterances from DB3 toestimate the prior statistics. To extensively cover the acoustical properties of speakers, each sampled utterancewas spoken by a different speaker. These 80 speakers were excluded from those of testing database DB1. For

� 4Fig. 1. Hyperparameters of scaling factors of affine transformation m , is1, . . . ,26 .a, i


Ž . Ž . Ž .each utterance, we calculate the ML estimates of transformation parameters u , u using Eqs. 6 , 7 ,b,ML a,MLŽ . Ž . Ž 2 2 2 .14 – 16 . Then, the hyperparameters of prior density Qs m , s , m , s , m , s , m are accordinglyb b a a b b a,b1 1 2 2 2

determined by taking the operations of mean, variance and covariance over these 80 sets of ML estimates

Ž . Ž .Fig. 2. Histograms of transformation parameters: a the first scaling factor of affine transformation a , b the first bias factor of affine1Ž .transformation b , and c the first bias factor of bias transformation b .2,1 1,1


Ž .b , a , b . To increase the estimation accuracy, the ML estimation of transformation parameters is1,ML ML 2,MLŽ .supervised. Here, we describe the ML estimation procedure as follows Chien et al., 1995a . Given any

Ž Ž l . Ž l ..transformation parameters u , u , telephone observations Y, and HMM parameters L , the correspondingb a xŽ Ž l . Ž l ..state sequences S , S are first determined by passing Y through the Viterbi decoder. Then, using Eqs.b a

Ž . Ž . Ž Ž lq1. Ž lq1..14 – 16 , the new transformation parameters u , u are calculated. By applying new estimatesb aŽ Ž lq1. Ž lq1.. Ž Ž lq1. Ž lq1..u , u , the new state sequences S , S are updated. Based on this procedure, the MLb a b a

Ž .estimates u , u are iteratively obtained. In what follows, we will illustrate the histograms of MLb,ML a,MLŽ .estimates u , u .b,ML a,ML

� 4In Fig. 1, the hyperparameters of scaling factors of affine transformation m , is1, . . . ,26 are plotted.a ,i� 4These values are calculated by averaging 80 ML estimates of scaling factors a , is1, . . . ,26 . In this figure,ML, i

we find that the values of different vector components are near to one but not equal to one. This shows that theadditional effects such as ambient noise, speaker variation or even some unknown effect may be partiallycompensated by applying the affine transformation. Besides, the histograms of 80 ML estimates of transforma-tion parameters are illustrated in Fig. 2. There are three histograms included. They are the histograms of the first

Ž Ž .. Žscaling factor of affine transformation a Fig. 2 a , the first bias factor of affine transformation b Fig.1 2,1Ž .. Ž Ž ..2 b , and the first bias factor of bias transformation b Fig. 2 c .1,1

6. Experimental results

A multispeaker speech recognition task for 250 Chinese names was conducted to demonstrate the perfor-mance of proposed method. In the experiments, the clean speech models are served as the model parameters.The speech recognizer without model adaptation is referred to the baseline system. The results of cepstral mean

Ž . Ž .normalization CMN Atal, 1974; Furui, 1981 method and matched condition are also included for compari-son. As shown in Table 1, the recognition rate improves from the baseline system of 37.3% to 79.4% whenusing CMN. The performance under matched condition is 88.3%. This reveals that the CMN method is anefficient scheme for overcoming the environmental variation of telephone system. But, the performance of CMNmethod is restricted because the acoustic information of telephone utterances is also canceled by the operationof cepstral normalization. Using the proposed Bayesian adaptation method, the recognition rates can besignificantly improved over the CMN method. To evaluate the effectiveness of the proposed method, weperform three adaptation techniques which are self adaptation, supervised adaptation and two-stage adaptation.Evaluation of different size of adaptation data is also investigated.

6.1. EÕaluation of self adaptation

Self adaptation is an unsupervised adaptation technique. When the Bayesian adaptation of HMM parametersis employed for self adaptation, the transformation parameters are estimated according to the unknown testingutterance. Fig. 3 shows the block diagram of telephone speech recognition system based on Bayesian adaptationof HMM parameters using self adaptation technique. We can see that two passes of Viterbi decoder areincluded. In the first pass Viterbi decoding, the telephone utterance is matched against the 250 Chinese-namepatterns. The corresponding state and mixture component sequences of the closest pattern are extracted. Using

Table 1Ž . Ž .Recognition rates % of baseline system, cepstral mean normalization CMN method and matched condition

Baseline 37.3CMN 79.4Matched condition 88.3


Fig. 3. A block diagram of telephone speech recognition system based on Bayesian adaptation of HMM parameters using self adaptationtechnique.

these sequences, the MAP estimates of transformation parameters are calculated. The adaptation of HMMparameters is then performed. Based on this procedure, the HMM parameters are repeatedly adapted until thestate and mixture component sequences converge. Given the adapted HMM parameters, the telephone speech is

Table 2Ž .Comparison of recognition rates % of self adaptation, one-utterance supervised adaptation and two-stage adaptation using ML and MAP

Ž . Ž .adaptation for various weighting factors: a bias transformation, b affine transformation

Self adaptation One-utterance supervised adaptation Two-stage adaptation

Ž . Ž .a Bias ML 80.4 82.2 83Ž . Ž .a Bias MAP, a s0.8 80.7 83 83.3Ž . Ž .a Bias MAP, a s0.5 81.1 83.7 84Ž . Ž .a Bias MAP, a s0.3 81.3 83.8 84.5

Ž . Ž .b Affine ML 81.8 85.2 85.9Ž . Ž .b Affine MAP, a s0.8 82.4 86 86.6Ž . Ž .b Affine MAP, a s0.5 82.6 86.1 86.8Ž . Ž .b Affine MAP, a s0.3 83 86.5 87


recognized through the second pass of Viterbi decoder. In this technique, the prior statistics serves as the initialtransformation parameters. The convergence can be rapidly achieved within two iterations. Hence, the iterationnumber is fixed to be two for both self adaptation and supervised adaptation in following experiments. Giventhe adapted HMM parameters, the telephone speech is recognized through the second pass of Viterbi decoder.

The recognition results of self adaptation are given in the first column of Table 2. In this set of experiments,Ž .the ML adaptation i.e. as1 of MAP adaptation and the MAP adaptation with as0.3, 0.5 and 0.8 using bias

and affine transformations are respectively carried out. We find that the affine transformation is superior to thebias transformation for various cases of adaptation and weighting factors. The performance of MAP adaptationis also better than that of ML adaptation. The best results of bias and affine transformations are respectively81.3% and 83%. These results outperform that of CMN method.

6.2. EÕaluation of superÕised adaptation

On the other hand, the Bayesian adaptation of HMM parameters is also available for supervised adaptation.Using this technique, each testing speaker should utter a small set of adaptation utterances before the beginningof speech recognition. In this set of experiments, we only collect one adaptation utterance for each testingspeaker. When the supervised adaptation is performed, the transformation parameters are calculated according tothis speaker-specific utterance and utilized for adapting the existing HMM parameters to a new speaker andtelephone environment. In supervised adaptation, the true transcription of adaptation utterances are known inadvance. Given the adapted HMM parameters, the telephone speech is recognized without adaptation in testingphase. Here, we also refer to this technique as the one-utterance supervised adaptation. To demonstrate theconvergence speed of supervised adaptation, we plot the average log-likelihood versus the iteration number ofBayesian adaptation using bias and affine transformations in Fig. 4. It shows that the proposed Bayesianadaptation converges within two iterations. Besides, we also observe that the average log-likelihood of affinetransformation is larger than that of bias transformation. It means that the model adaptation using affinetransformation fits better than that using bias transformation. The recognition results of one-utterance supervisedadaptation are illustrated in the second column of Table 2. We can see that the affine transformationoutperforms the bias transformation. The MAP adaptation is also superior to the ML adaptation for variousweighting factors. For examples, the recognition rates can be increased from 82.2% to 83.8% by using MAPadaptation instead of ML adaptation in the case of bias transformation. In the case of affine transformation, therecognition rates can be improved from 85.2% to 86.5% by using MAP adaptation instead of ML adaptation.Besides, we also find that the one-utterance supervised adaptation has better performance than the selfadaptation. The best performance of these two adaptation techniques is achieved when as0.3.

Fig. 4. Convergence of Bayesian adaptation using bias and affine transformations.


Fig. 5. A block diagram of two-stage adaptation technique.

6.3. EÕaluation of two-stage adaptation

In practice, the self adaptation and the one-utterance supervised adaptation can be sequentially combined tofurther improve the recognition rates. We refer this technique as the two-stage adaptation. The block diagram oftwo-stage adaptation is plotted in Fig. 5. We describe the procedure of two-stage adaptation as follows. In thefirst stage, each testing speaker utters one adaptation utterance for adapting the existing HMM parameters to anew speaker and telephone channel via the one-utterance supervised adaptation technique. Then, in the secondstage, the telephone utterances of this testing speaker are recognized by additionally performing the selfadaptation technique on the adapted HMM parameters. It is noted that the Bayesian adaptation approaches areapplied in both adaptation techniques. Hyperparameters of the second stage adaptation are refreshed byexecuting the steps described in Section 5 once again. Notably, the seed models should be replaced by theadapted HMM parameters obtained in the first stage adaptation. By combining these two adaptation techniques,a better recognition performance can be obtained. The recognition results of two-stage adaptation are listed inthe third column of Table 2. We find that the recognition rates of two-stage adaptation are consistently betterthan those of self adaptation and one-utterance supervised adaptation. The best results can be increased to 84.5%and 87% using bias transformation and affine transformation, respectively.

6.4. EÕaluation of different amount of adaptation data

In the last set of experiments, we evaluate the proposed adaptation approach for different amount ofadaptation data. In this task, the supervised adaptation technique using affine transformation is employed. Thehyperparameters of prior density are identical for different amount of adaptation data. As shown in Table 3, wecompare the recognition rates of ML and MAP adaptation for one-utterance supervised adaptation and

Table 3Ž .Comparison of recognition rates % of one-utterance supervised adaptation and three-utterance supervised adaptation based on ML and

MAP adaptation. The affine transformation is applied

One-utterance supervised adaptation Three-utterance supervised adaptation

Ž .Affine ML 85.2 88.2Ž .Affine MAP, a s0.3 86.5 88.1


three-utterance supervised adaptation. Here, the weighting factor a of MAP adaptation is fixed at 0.3. Fromthese results, we find that the performance is improved when the number of adaptation utterance is increasedfrom one to three. Besides, the result of MAP adaptation approaches to that of ML adaptation in case of threeadaptation utterances. The reason is that the prior knowledge becomes improper when the amount of adaptationdata is increased. In our implementation, three adaptation utterances are sufficient for estimating a sharedtransformation function which adapts all the HMM parameters. Hence, we can say that the MAP adaptation ispreferable to the ML adaptation when only limited amount of adaptation data is available. Further, we also findthe results of three-utterance supervised adaptation to be comparable to that of matched condition.

7. Conclusion

We propose the technique of transformation-based adaptation based on the framework of MAP adaptationand successfully apply it to telephone speech recognition. In this study, the bias transformation and the affinetransformation serve as the transformation functions. The parameters of the transformation functions are derivedsuch that the posterior likelihood is maximized. To demonstrate the performance of our method, we conduct aseries of experiments for comparative studies. The results of ML and MAP adaptation using bias and affinetransformations are respectively carried out. We also verify the proposed method by examining the cases of selfadaptation, supervised adaptation and two-stage adaptation. All the experimental results show that the proposedmethod is superior to the CMN method. Other conclusions are listed as follows;1. The proposed approach converges rapidly.2. The model adaptation using affine transformation performs significantly better than that using bias transfor-

mation.3. The MAP adaptation outperforms the ML adaptation. As the adaptation data increases, the ML and MAP

adaptation have comparable performance.4. The proposed approach can be employed in self adaptation as well as supervised adaptation. By combining

both adaptation techniques, the performance can be further improved.5. The recognition rate is improved when the amount of adaptation data is increased.

Acknowledgements

The authors acknowledge the substantial contribution of Dr. Chin-Hui Lee, Head of Dialogue SystemsResearch Department at Bell Laboratories, Murray Hill, USA. They also thank the anonymous reviewers fortheir critical comments. This work has been partially support by the Telecommunication Laboratory, ChunghwaTelecom, Taiwan, ROC, under contract TL-85-5203.

References

Acero, A., Stern, R.M., 1990. Environmental robustness in automatic speech recognition. In: IEEE Proc. Internat. Conf. Acoust. SpeechSignal Proces., Vol. 1, pp. 849–852.

Atal, B., 1974. Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J.Acoust. Soc. Amer. 55, 1304–1312.

Chien, J.T., 1997. Speech recognition under telephone environments. Ph.D. Thesis. Department of Electrical Engineering, National TsingHua University, Taiwan.

Chien, J.T., Lee, L.M., Wang, H.C., 1995a. Channel estimation for reference model adaptation in telephone speech recognition. In: Proc. 4thEuropean Conf. Speech Communication and Technology, Vol. 2, pp. 1541–1544.


Chien, J.T., Lee, L.M., Wang, H.C., 1995b. A channel-effect-cancellation method for speech recognition over telephone system. IEE Proc.Vision Image Signal Process. 142, 395–399.

Chien, J.T., Lee, L.M., Wang, H.C., 1996. Estimation of channel bias for telephone speech recognition. In: Proc. Internat. Conf. SpokenLanguage Processing, Vol. 3, pp. 1840–1843.

Chien, J.T., Wang, H.C., Lee, C.H., 1997. Bayesian affine transformation of HMM parameters for instantaneous and supervised adaptationŽ .in telephone speech recognition. In: Proc. 5th European Conf. Speech Communication and Technology to appear .

Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser.B 39, 1–38.

Cox, S.J., Bridle, J.S., 1989. Unsupervised speaker adaptation by probabilistic spectrum fitting. In: IEEE Proc. Internat. Conf. Acoust.Speech Signal Process., pp. 294–297.

Digalakis, V.V., Rtischev, D., Neumeyer, L.G., 1995. Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans.Speech Audio Process. 3, 357–366.

Furui, S., 1981. Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoustic Speech Signal Process. 29, 254–272.Gauvain, J.-L., Lee, C.-H., 1992. Bayesian learning for hidden Markov model with Gaussian mixture state observation densities. Speech

Communication 11, 205–213.Gauvain, J.L., Lee, C.H., 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE

Trans. Speech Audio Process. 2, 291–298.Gong, Y., 1995. Speech recognition in noisy environments: A survey. Speech Communication 16, 261–291.Hermansky, H., Morgan, N., 1994. RASTA processing of speech. IEEE Trans. Speech Audio Process. 2, 578–589.Huo, Q., Chan, C., Lee, C.H., 1995. Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition. IEEE

Trans. Speech Audio Process. 3, 334–344.Juang, B.H., 1991. Speech recognition in adverse environments. Computer Speech and Language 5, 275–294.Juang, B.H., Rabiner, L.R., 1990. The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Trans.

Acoust. Speech Signal Process. 38, 1639–1641.Lasry, M.J., Stern, R.M., 1984. A posteriori estimation of correlated jointly Gaussian mean vectors. IEEE Trans. Pattern Anal. Machine

Intell. 6, 530–535.Lee, C.H., 1997. On feature and model compensation approach to robust speech recognition. In: Proc. ESCA–NATO Workshop Robust

Speech Recognition for Unknown Communication Channels, pp. 45–54.Lee, C.H., Lin, C.H., Juang, B.H., 1991. A study on speaker adaptation of the parameters of continuous density hidden Markov models.

IEEE Trans. Signal Process. 39, 806–814.Lee, C.H., Giachin, E., Rabiner, L.R., Pieraccini, R., Rosenberg, A.E., 1992. Improved acoustic modeling for large vocabulary continuous

speech recognition. Computer Speech and Language 6, 103–127.Leggetter, C.J., Woodland, P.C., 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov

models. Computer Speech and Language 9, 171–185.Mendel, J.M., 1995. Lessons in Estimation Theory for Signal Processing, Communications, and Control. Prentice-Hall, Englewood Cliffs,

NJ, pp. 165–166.Neumeyer, L.G., Digalakis, V.V., Weintraub, M., 1994. Training issues and channel equalization techniques for the construction of

telephone acoustic models using a high-quality speech corpus. IEEE Trans. Speech Audio Process. 2, 590–597.Rahim, M.G., Juang, B.H., 1996. Signal bias removal by maximum likelihood estimation for robust telephone speech recognition. IEEE

Trans. Speech Audio Process. 4, 19–30.Sankar, A., Lee, C.H., 1996. A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Trans. Speech

Audio Process. 4, 190–202.Takagi, K., Hattori, H., Watanabe, T., 1995. Rapid environment adaptation for robust speech recognition. In: IEEE Proc. Internat. Conf.

Acoust. Speech Signal Process., pp. 149–152.Takahashi, J., Sagayama, S., 1994. Telephone line characteristics adaptation using vector field smoothing technique. In: Proc. Internat.

Conf. Spoken Language Processing, pp. 991–994.Viterbi, A.J., 1967. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory

13, 260–269.Widrow, B., Stearns, S.D., 1985. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ, pp. 56–60.Zavaliagkos, G., Schwartz, R., McDonough, J., Makhoul, J., 1995. Adaptation algorithms for large scale HMM recognizers. In: Proc. 4th

European Conf. Speech Communication and Technology, Vol. 2, pp. 1131–1133.Zavaliagkos, G., Schwartz, R., Makhoul, J., 1995. Batch, incremental and instantaneous adaptation techniques for speech recognition. In:

IEEE Proc. Internat. Conf. Acoust. Speech Signal Processing., pp. 676–679.

telephone speech recognition based on bayesian adaptation of hidden markov models

Documents