robust numeric recognition in spoken language dialogue

18
Robust numeric recognition in spoken language dialogue Mazin Rahim * , Giuseppe Riccardi, Lawrence Saul, Jerry Wright, Bruce Buntschuh, Allen Gorin AT&T Labs – Research, Room E105, 180 Park Avenue, Florham Park, NJ 07932 USA Abstract This paper addresses the problem of automatic numeric recognition and understanding in spoken language dialogue. We show that accurate numeric understanding in fluent unconstrained speech demands maintaining robustness at several dierent levels of system design, including acoustic, language, understanding and dialogue. We describe a robust system for numeric recognition and present algorithms for feature extraction, acoustic and language modeling, dis- criminative training, utterance verification and numeric understanding and validation. Experimental results from a field-trial of a spoken dialogue system are presented that include customers’ responses to credit card and telephone number requests. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Robustness; Spoken dialogue system; Speech recognition; Utterance verification; Discriminative training; Understanding; Language modeling; Numeric recognition; Digits 1. Introduction Interactive spoken dialogue systems can play a significant role in automating a variety of services including customer care and information access (Lamel et al., 1999; Rahim et al., 1999; Riccardi and Gorin, 1999; Ramaswamy et al., 1999; Os et al., 1999). The success of these systems depends largely on their ability to accurately recognize and understand fluent unconstrained speech. The premise is that users would be able to engage with these systems in a natural open dialogue with minimal human assistance. In this paper, we describe a class of applications for spoken dialogue systems that involve the use of a numeric language. This language includes a set of phrases that form the basis for recognizing and understanding credit card and telephone numbers, zip codes, social security numbers, etc. This lan- guage forms an integral part of a field-trial study of AT&T customers responding to the open-ended prompt ‘‘How May I Help You?’’. It consists of several distinct phrase classes such as digits, natural numbers, alphabets, restarts, city/country name and other miscellaneous phrases. The numeric language can be considered as a set of ‘‘salient’’ phrases that are relevant to the task of number recognition and which may be identified by exploiting the mapping from unconstrained language into machine action (Gorin, 1995; Wright et al., 1997, 1998). An essential ingredient for facilitating natural interaction in spoken dialogue systems is main- taining system robustness. Although it is com- monly considered as a mismatch problem between the acoustic model and the test data (Sankar and Lee, 1996; Rahim and Juang, 1996), robustness in spoken dialogue systems extends to several other dimensions: www.elsevier.nl/locate/specom Speech Communication 34 (2001) 195–212 * Corresponding author. E-mail address: [email protected] (M. Rahim). Web.: http://www.research.att.com/info/mazin 0167-6393/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 5 4 - 6

Upload: mazin

Post on 02-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Robust numeric recognition in spoken language dialogue

Mazin Rahim *, Giuseppe Riccardi, Lawrence Saul, Jerry Wright,Bruce Buntschuh, Allen Gorin

AT&T Labs ± Research, Room E105, 180 Park Avenue, Florham Park, NJ 07932 USA

Abstract

This paper addresses the problem of automatic numeric recognition and understanding in spoken language dialogue.

We show that accurate numeric understanding in ¯uent unconstrained speech demands maintaining robustness at

several di�erent levels of system design, including acoustic, language, understanding and dialogue. We describe a robust

system for numeric recognition and present algorithms for feature extraction, acoustic and language modeling, dis-

criminative training, utterance veri®cation and numeric understanding and validation. Experimental results from a

®eld-trial of a spoken dialogue system are presented that include customers' responses to credit card and telephone

number requests. Ó 2001 Elsevier Science B.V. All rights reserved.

Keywords: Robustness; Spoken dialogue system; Speech recognition; Utterance veri®cation; Discriminative training; Understanding;

Language modeling; Numeric recognition; Digits

1. Introduction

Interactive spoken dialogue systems can play asigni®cant role in automating a variety of servicesincluding customer care and information access(Lamel et al., 1999; Rahim et al., 1999; Riccardiand Gorin, 1999; Ramaswamy et al., 1999; Oset al., 1999). The success of these systems dependslargely on their ability to accurately recognize andunderstand ¯uent unconstrained speech. Thepremise is that users would be able to engage withthese systems in a natural open dialogue withminimal human assistance.

In this paper, we describe a class of applicationsfor spoken dialogue systems that involve the use ofa numeric language. This language includes a set ofphrases that form the basis for recognizing and

understanding credit card and telephone numbers,zip codes, social security numbers, etc. This lan-guage forms an integral part of a ®eld-trial study ofAT&T customers responding to the open-endedprompt ``How May I Help You?''. It consists ofseveral distinct phrase classes such as digits, naturalnumbers, alphabets, restarts, city/country name andother miscellaneous phrases. The numeric languagecan be considered as a set of ``salient'' phrases thatare relevant to the task of number recognition andwhich may be identi®ed by exploiting the mappingfrom unconstrained language into machine action(Gorin, 1995; Wright et al., 1997, 1998).

An essential ingredient for facilitating naturalinteraction in spoken dialogue systems is main-taining system robustness. Although it is com-monly considered as a mismatch problem betweenthe acoustic model and the test data (Sankar andLee, 1996; Rahim and Juang, 1996), robustness inspoken dialogue systems extends to several otherdimensions:

www.elsevier.nl/locate/specomSpeech Communication 34 (2001) 195±212

* Corresponding author.

E-mail address: [email protected] (M. Rahim).

Web.: http://www.research.att.com/info/mazin

0167-6393/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 5 4 - 6

· Robust acoustic modeling. Variations in theacoustic characteristics of the signal due toextraneous conditions (such as backgroundnoise, reverberation, channel distortion, etc.)should cause little or no degradation in theperformance of the recognizer.

· Robust language modeling. Users should be ableto express themselves naturally and freelywithout being constrained by a highly structuredlanguage model.

· Robust language understanding. The presence ofdis¯uencies (such as ``ah'', ``mm'', etc.) andrecognition errors should have no or littleimpact on the behavior of the system.

· Robust dialogue strategy. The dialogue managershould guide both novice and skilled usersthrough the application seamlessly and intelli-gently.

Maintaining robustness at these various levels isthe key to the success of spoken dialogue systemsin the telecommunication environment.

In this paper, we consider the problem ofnumeric language recognition in spoken dialoguesystems as a large-vocabulary continuous speechrecognition task (Rahim, 1999a). We addressthe technical challenges in developing a robust,accurate and real-time recognition system. Wepresent algorithms for improved robustness atthe acoustic, language and understanding levelsof the system. We demonstrate that our systemfor numeric recognition provides robustness to awide variety of inputs and environmental condi-tions.

The organization of this paper is as follows.Section 2 describes the challenges involved whendealing with ¯uent unconstrained speech, andprovides evidence that demonstrates the need forutilizing a numeric language as opposed to merelydigits. Section 3 provides a characterization of ourexperimental database for numeric recognition. InSection 4, we describe the various modules of theproposed system, including feature extraction,acoustic modeling and training, language model-ing, utterance veri®cation and numeric under-standing and validation. The performance of thesevarious modules are presented in the experimentalresults of Section 5. Finally, a summary and adiscussion are provided in Section 6.

2. From digits to numeric recognition

Connected digits play a vital role in many ap-plications of speech recognition over the tele-phone. Digits are the basis for credit card andaccount number validation, phone dialing, menunavigation, etc.

Progress in connected digits recognition hasbeen remarkable over the past decade (Cardinet al., 1993; Buhrke et al., 1994). Fig. 1 summarizesthe performance of state-of-the-art digit recogniz-ers on a variety of databases that range from high-quality read speech to conversational spontaneousspeech (Chou et al., 1995; Buhrke et al., 1994).These results re¯ect the digit error rate when usinga free digit grammar and with no rejection. 1 Un-der carefully-monitored laboratory conditions,such as the Texas Instrument (TI) database, it isshown that recognizers can generally achieve lessthan 0.3% digit error rate (Cardin et al., 1993) (seeFig. 1 ± ``READ''). Dealing with telephone speechadds new di�culties to this problem. Variations inthe spectral characteristics due to di�erent channelconditions, speaker populations, backgroundnoise and transducer equipment can cause a sig-ni®cant degradation in recognition performance.Nevertheless, through advances in acoustic mod-eling, discriminative training and robustness,many recognition systems today can operate at1±2% digit error rate (Chou et al., 1995; Rahimet al., 1996). This is illustrated in Fig. 1 (``FLU-ENT'') for a variety of in-house databases thathave been collected over the telephone, includingTeletravel (TELE), Voice Card (VC), UniversalCard Service (UCS), Mall'88 (ML88), Mall'91(ML91) and Voice Recognition Call Processing(VRCP) (Buhrke et al., 1994; Chou et al., 1995).

Using a spoken dialogue system imposes a newset of challenges to the problem of recognizingdigits embedded in natural spoken input. Asopposed to using a system-initiative strategy, as itis the case for the databases reported under``READ'' and ``FLUENT'', having a mixed-ini-tiative dialogue implies less language constraints

1 The digit error rate includes the insertion, deletion and

substitution error rates normalized by the total digit count.

196 M. Rahim et al. / Speech Communication 34 (2001) 195±212

and more ¯exibility for users to speak openly(Chu-Carroll, 1999). Although this is clearlyadvantageous for building systems that facilitatenatural human±machine communication, it canhave negative impact on the performance of therecognizer. This is precisely the scenario that weare examining in this study in which during thecourse of the dialogue, users are asked a varietyof questions among which ``What number wouldyou like to call?'', ``May I have your card numberplease?'', ``What was that number again?'', etc.The di�culty here is not only to deal with ¯uentunconstrained speech, but to be able to designsystems that can accurately recognize an entirestring of numbers that may be encoded by digits,natural numbers and/or alphabets. In addition,these systems must be robust towards out-of-vo-cabulary words, hesitation, false-start and variousother acoustic and language variabilities. Em-ploying systems that accurately recognize merelydigits in a spoken dialogue environment couldlead to poor performances as illustrated in Fig. 1(``CONVERSATIONAL'') for the two ``HowMay I Help You?'' (HMIHY-I, HMIHY-II) datacollections. 2

3. Experimental study using a spoken dialogue

system

We are interested in the problem of under-standing ¯uent speech within constrained taskdomains. In particular, we have conducted anumber of ®eld trial studies on AT&T customersresponding to the open-ended prompt ``How MayI Help You?'' with the aim at providing an auto-mated customer service. The goal of this service isto recognize and understand customers' requestswhether they relate to billing, credit, call automa-tion, etc. (Gorin et al., 1997; Riccardi and Gorin,1999). Dialogue in this application is clearly nec-essary since in many situations involving ambigu-ous requests or poor system robustness, customerservice can not be performed merely from a singleinput.

3.1. Database for numeric recognition

This paper will focus on speci®c parts of thedialogue where customers are prompted to say acredit card or a telephone number to obtain callautomation or billing credit. Various types ofprompts have been studied with the objective tostimulate maximally consistent and informativeresponses from large populations of users (Boyceand Gorin, 1996). These prompts are engineeredtowards asking users to say or repeat their numberstring without imposing rigid speaking format.Our experimental database included over 30,000transactions with 2178 utterances representing re-sponses to card and phone number queries, rang-ing from 1 to 45 words in length (referred to as thenumeric database) (Riccardi and Gorin, 1999).

To calibrate the di�culty of this task, wesubdivided the database based on two sets ofresults. The ®rst, which is displayed in Fig. 2 in theform of pie charts, is a partitioning of the dataaccording to three categories: (a) numerics phrasesonly (such as digits, natural numbers and alpha-bets) (b) embedded numerics, which include thosenumeric phrases that have been spoken amongother vocabulary words, and (c) no numerics,which include utterances not containing anynumeric phrases. The pie charts indicate that thedistribution of users' responses based on spoken

00

2

4

6

8

10

12

14

16

TITELE

VCUCS ML88

ML91VRCP

HMIHY-I

HMIHY-II

TASK

DIG

IT E

RR

OR

RAT

E

[ -----------------------------FLUENT-------------------------- ]

[-READ-]

[-CONVERSATIONAL-]

Fig. 1. Digit error rates for a variety of in-house data collec-

tions.

2 HMIHY-I and HMIHY-II are customer care recordings

that were collected in two separate sessions.

M. Rahim et al. / Speech Communication 34 (2001) 195±212 197

numeric expressions is di�erent for the card andphone number prompts. Furthermore, a largeproportion of users do not respond with numericphrases alone. In the case of the phone numberprompt, 43% of the utterances contained embed-ded numeric and 9% included no numerics.

An alternative method for calibrating the di�-culty of this task is illustrated in Fig. 3 whichshows the classi®cation of utterances as a functionof their vocabulary contents and call characteris-tics (i.e, quality). Ten di�erent classes are pre-sented. They include digits only (``Digits'' such as``one'', ``two'', . . . , ``nine'', ``oh'', ``zero''); em-bedded digits (``eDigits'' ± digits spoken amongother vocabulary words); natural numbers(``nNumbers'' such as ``hundred'', ``eleven'' etc.);

alphabets (``Alphas'' such as ``A'', ``H'', etc.); re-starts (``Restart'' ± false starts, hesitations andcorrections); accent (``Accent'' ± distinct regionaldialect and strong foreign accent that contribute tomispronunciation of some words); fast speech(``Fast'' ± over 1.5 times faster than the averagespeech rate); noise (``Noise'' ± background speech,music and noise with signal-to-noise ratio below15 dB); out-of-vocabulary (``Garbage'' ± extrane-ous and uninformative speech); cuts (``Cuts'' ±utterances with words that are partially spoken).

The statistics on this database are signi®cantlydi�erent than most of the databases that we havepreviously encountered. Nearly half of the utter-ances include only digits as opposed to almost100% for the databases reported by Chou et al.(1995). The new challenge this database presents isthe need to accurately recognize embedded digits,natural numbers, alphabets, restarts and extrane-ous speech which collectively constitute about halfof the database. There are also high proportions offast speech, cuts and severe background noisewhich must all be dealt with appropriately.

3.2. The numeric language

It is not necessary for a spoken dialogue systemto recognize all words correctly in order to un-derstand the meaning of the request, but onlythose words or phrases that are ``salient'' to thetask (Gorin, 1995; Wright et al., 1997, 1998). Sa-lient phrases are essential for interpreting ¯uentspeech. They may be automatically acquired such

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

Digits

eDigits

nNumbers

Alphas Restart

Accent

Fast

NoiseGarbage Cuts

% D

ATA

AV

AIL

AB

LE

Fig. 3. Histogram of the numeric database as a function of

vocabulary and call characteristics.

Fig. 2. Pie charts of users' responses to (A) card number, and (B) phone number prompts based on spoken numeric expressions.

198 M. Rahim et al. / Speech Communication 34 (2001) 195±212

that to maximize the accuracy of the mappingfrom unconstrained input to machine action(Wright et al., 1997).

In this paper, we will refer to those salientphrases that are relevant to our task as numerics.The numeric language consists of the set of phrasesthat are relevant to the task of understanding andinterpreting customers' requests. For this applica-tion, we de®ne six distinct phrase classes includingdigits, natural numbers, alphabets (Mitchell andSetlur, 1999), restarts, city/country name, and othermiscellaneous phrases. Digits, natural numbers andalphabets are the basic building blocks of tele-phone and credit card numbers. For example, ``mycard number is one three hundred ®fty ®ve Afour. . .''. Restarts include the set of phrases thatare indicative of false-starts, corrections and hesi-tation. For example, ``my telephone number is ninezero eight I'm sorry nine seven eight. . .''. City/country names can be essential in reconstructing atelephone number when country or area codes aremissing. For example, ``I would like to call Italyand the number is three ®ve. . .''. Finally, there are anumber of miscellaneous salient phrases that canalter the ordering of the numbers. Such phrases are``area code'', ``extension number'', ``expirationdate'', etc. In our application, the vocabulary sizewas over 3600 words of which 100 phrases repre-sented the numeric language.

4. System for numeric recognition

We consider the problem of numeric recogni-tion in spoken dialogue systems as a large-vocab-

ulary continuous speech recognition task wherenumerics are treated as a small subset of the activevocabulary in the lexicon. 3 The challenge here isessentially to accurately model, recognize and un-derstand the numeric language in ¯uent uncon-strained speech. Fig. 4 shows a block diagram ofthe numeric recognition system. The main com-ponents of this architecture are described as fol-lows.

4.1. Feature extraction

The input signal, sampled at 8 kHz, is ®rst pre-emphasized and grouped into frames of 30 msdurations at every interval of 10 ms. Each frame isHamming windowed, Fourier transformed andthen passed through a set of 22 triangular band-pass ®lters. Twelve mel cepstral coe�cients arecomputed by applying the inverse discrete cosinetransform on the log magnitude spectrum. To re-duce channel variations while still maintainingreal-time performance, each cepstral vector isnormalized using cepstral mean subtraction withan operating look-ahead delay of 30 speechframes. To capture temporal information in thesignal, each normalized cepstral vector along withits frame log energy are augmented with their ®rstand second-order time derivatives. The energycoe�cient, normalized at the operating look-aheaddelay, is also applied for end-pointing the speechsignal (Rabiner and Juang, 1993).

Lexicon

DialogueManager

Action

Language Model

SpeechRecognition

Feature Extraction

UtteranceVerification

UtteranceValidation

Acoustic Model

Acoustic Model

NumericUnderstanding

ConversationalSpeech

Fig. 4. A block diagram of the numeric recognition system.

3 Our previous experiments have shown that this approach to

numeric recognition can lead to improved system performance

over keyword detection methods (Rahim, 1999a).

M. Rahim et al. / Speech Communication 34 (2001) 195±212 199

4.2. Acoustic modeling

Accurate numeric recognition in ¯uent uncon-strained speech clearly demands detailed acousticmodeling of the numeric language. It is also es-sential to accurately model non-numeric words asthey constituted over 11% of the numeric dat-abase. Accordingly, our design strategy has beento use two sets of subword units; one dedicated forthe numeric language and the other for theremaining vocabulary words. Each set adoptsleft-to-right continuous-density hidden Markovmodels (HMMs) with no skip states.

For numeric recognition, context-dependentacoustic units have been used which captured allpossible inter-numeric coarticulation (Pieracciniand Rosenberg, 1990; Lee et al., 1992). The basicphilosophy is that each word is modeled by threesegments; a head, a body and a tail. A word wouldhave one body, that has relatively stable acousticcharacteristics, and multiple heads and tails de-pending on the preceding and following context.Thus junctures between numerics are explicitlymodeled. Since this results in a huge number ofsubword units, and due to the limited amount oftraining data, the head-body-tail design has beenstrictly applied for the eleven digits. This generated273 units which were assigned a 3-4-3 state to-pology corresponding to the heads, bodies andtails, respectively (Pieraccini and Rosenberg, 1990;Lee et al., 1992).

The second set of units were used for modelingnon-numeric words and consisted of forty 3-stateHMMs, each corresponding to a context-inde-pendent English phone (Shoup, 1980). Therefore,in contrast to traditional methods for digitrecognition, out-of-vocabulary words ± essentiallythe non-numeric words, are explicitly modeled bya dedicated set of subword units, rather thanbeing treated as ``®ller'' phrases. Transitionalevents between these units and the digit HMMs(limited by the availability of the data) weremodeled by 16 additional units representingcontext switching from digits to speech eventsand vica versa. As will be shown later, thisapproach enables us to take full advantage of thelanguage model for this task while maintainingreal-time operation.

To model silence, background noise andextraneous events, four ®ller models were intro-duced. Each model was represented by either a1-state or a 3-state HMM. They were trained onpre-segmented data that included silence, coughs,clicks, extraneous sounds and background speech.

In total, our system employed 333 units. Eachstate included 32 Gaussian components with theexception of the ®ller models which consisted of 64Gaussian components. A unit duration model,approximated by a gamma distribution, was alsoused to increment the log likelihood scores(Rabiner and Juang, 1993).

4.3. Language modeling

Robustness is a key ingredient in building lan-guage models for spoken dialogue systems. Thechallenge is being able to construct models thatcan recognize ¯uent spontaneous speech, enablingusers the ¯exibility to speak freely. Though thenumeric language has quite a small vocabulary ofvery frequent words, the diverse and unpredictableset of responses make this task a challenge forlanguage modeling. Thus traditional methods thatrely on hand-crafted and deterministic grammarsare generally inferior in these circumstances(Rahim, 1999a). On the other hand, training sto-chastic language models on transcriptions of re-sponses to number requests is prone to datasparseness problems.

In this study, we automatically learn stochastic®nite state machines from the training data using avariable n-gram model (Riccardi et al., 1996). Thisparticular design includes back-o� mechanism andenables parsing of any arbitrary sequence of wordssampled from a given vocabulary. It has beenpreviously shown to be more e�cient than stan-dard n-gram language models (Riccardi et al.,1996).

To improve the generalization capability ofthe language model, we have integrated se-mantic-syntactic knowledge into the estimate ofthe word sequence probabilities. In particular, bymanually tagging the numeric language into wordand phrase classes (e.g., hdigitsi � fONE;TWO;. . .g, hnaturalsi � fELEVEN; HUNDRED; . . .},hcountryi � fITALY; ENGLAND; . . .g), we are

200 M. Rahim et al. / Speech Communication 34 (2001) 195±212

e�ectively and robustly integrating prior knowl-edge into the language model. This has the bene®tof producing probability estimates that are morerobust against data sparseness, and languagemodels that are both e�cient in terms of storageand generalizable across di�erent task domains(Riccardi et al., 1996).

Automatic learning of salient grammar frag-ments, such as classes of data, is essential for ro-bust language modeling. Algorithms for automaticacquisition of phrase grammars through word andphrase classes have been proposed in (Riccardiand Bangalore, 1998). An excerpt of salientphrases that have been automatically generatedfrom these algorithms are the following:

hdig3i area code,number is hdig10i,hcountryi hdig14i,

where hdig3i; hdig10i and hdig14i are non-terminalsymbols for di�erent sequences of digits.

Incorporating language features as a constrainton the probability distribution estimation helps inimproving the prediction of words over standardn-gram language models. For example the uni-gram model for the phrase hdig3iarea code isestimated from the following set of constraints onthe word probability distribution:

Prh�hdig3iarea code� � h1p4 � h2p1p2p3;

Pr�area� � p1;

Pr�code� � p2;

Pr�hdig3i� � p3;

Pr�hdig3iarea code� � p4;

�1�

where pi are the priors and H � fhig are the freeparameters which may be estimated througheither the expectation-maximization algorithm orthe cross-validation algorithm (Riccardi et al.,1996).

4.4. Model training

Training acoustic and language models fornumeric recognition in a spoken language dia-logue poses several new challenges. First, thetraining objective function should be coupledwith the performance of the recognizer. For

numeric recognition, an optimum recognizer isde®ned to be the one that minimizes the expectednumeric string error rate. This performancemeasure is essential for speech understandingpurposes since a numeric string would be con-sidered erroneous if and only if any one of thenumerics is recognized incorrectly. A misrecog-nition of a single digit, for example, would clearlyresult in an incorrect automation of a credit cardor a telephone number.

The second challenge is to design a frameworkthat provides ``¯exible'' interaction between theacoustic and language models. Though one needsto maximize the recognition performance in atask-speci®c domain, ¯exibility must be providedto enable acoustic model training to be performedrelatively freely of the constraints imposed by thestochastic language model.

Training is carried out in two phases using allthe available training corpus, X ; Maximum likeli-hood estimation (MLE) is performed followed byminimum classi®cation error (MCE) training(Juang and Katagiri, 1992). Given a set of lan-guage model parameters, H, the objective in MLEtraining is to learn a new set of acoustic modelparameters, K, through maximizing a log likeli-hood function. Given the correct word transcrip-tion, W0, then

K � arg maxK

g�X ;W0; K;H�; �2�

where

g�X ;W0; K;H� � log Pr�X jW0;K� � Pr�W0jH�c� ��3�

and cP 0 de®nes a compensation factor which isset empirically to weight the contribution of thestochastic language model.

In MCE training, we aim to re-estimate both Kand H by minimizing a smoothed string-basederror function:

�K; H� � arg minK;Hf1� eÿad�X ;K;H�gÿ1

; a > 0; �4�

where d�X ; K;H�, the misclassi®cation distance, isde®ned as

M. Rahim et al. / Speech Communication 34 (2001) 195±212 201

d�X ; K;H� � ÿg�X ;W0; K;H�

� log1

N

XN

n�1

eg�X ;Wn;K;H�( )

: �5�

Wn are considered as competing hypotheses andcan be generated by an N -best search. 4

Training and acoustic modeling are performediteratively. Two sets of context-independent sub-word units are initially optimized using MLE fol-lowed by MCE. Context-dependent HMMs arethen designed for the numeric language andtrained accordingly. In the ®nal step, additionalunits are integrated to model transitional eventsbetween numerics and non-numerics, and trainingis repeated again. The resultant acoustic modelconsisted of 333 discriminatively-trained HMMs.

Few remarks are in place for MCE training:1. Discriminative training is relatively fast. With

an e�cient implementation of MCE along withfast N -best search, models have been trained atabout real-time.

2. The objective function in Eq. (4) minimizes theexpected string error rate over the training data.We have observed that assigning a dedicated setof context-dependent units for modeling num-erics and another for modeling other words,as opposed to a single set of units for all vocab-ulary words, results in a lower error rate overboth the training and test data.

3. The out-of-vocabulary words, which amount to1% of our training database, were used for im-proving the discrimination between numericsand the ®ller models. This e�ectively reducesboth the insertion and deletion rates of the nu-meric language.

4. The factor c in Eq. (3) played a key role duringdiscriminative training in providing ¯exible in-teraction between the acoustic and languagemodels. The higher the value of c, the more em-phasis was given to the language model. There-fore, setting c to be reasonably low duringtraining provided the acoustic models the free-

dom to be optimized with less language con-straints. 5

4.5. Utterance veri®cation

An important charter of a robust spoken dia-logue system is identifying out-of-vocabulary ut-terances and utterances that are incorrectlyrecognized. This is particularly important for nu-meric recognition since it provides the dialoguemanager with a con®dence measure that may beused for call con®rmation, repair or disambigua-tion. Associating each utterance with a con®dencescore is performed through utterance veri®cation(Lleida and Rose, 1995; Rahim et al., 1997;Wendemuth et al., 1999). In this section, we willdescribe utterance veri®cation as a statisticalhypothesis testing problem. We will then describeour methods for extracting veri®cation featuresand performing classi®cation.

4.5.1. Statistical hypothesis testingConsider Z � fZ1; Z2; . . . ; Zng to be n veri®ca-

tion features that describe a single utterance hav-ing M words, fwiji � 1 : Mg. In statisticalhypothesis testing, if H0 is the null hypothesis andH1 is the alternate hypothesis, then the decisionrule is stated in terms of a likelihood ratio as

L�Z;H� � Prh0�ZjH0�

Prh1�ZjH1�?

H0

H1

s; �6�

where s is the decision threshold and H � fh0; h1gare the parameters of the model which are esti-mated by minimizing the average probability oferror Prh1

�Z;H0� � Prh0�Z;H1� (Bickel and Dok-

sum, 1977). If Sn denotes the numeric languageset, then H0 assumes that wi 2Sn, for at least oneword, and that all numeric words are correctlyrecognized. Conversely, H1 assumes that eitherwi 62Sn 8i, or at least one of the numeric words isincorrectly recognized.

In this study, we consider utterance veri®cationas a binary classi®cation problem where for each

4 In this study, MCE has been limited to training the acoustic

model parameters only.

5 In this study, c was set to 7 during training and 9 during

recognition thus causing the N -best search during the training

process to be more dominated by the acoustic likelihood score.

202 M. Rahim et al. / Speech Communication 34 (2001) 195±212

utterance, the most probable hypothesis is selectedas

H � � arg maxi�0;1

Pr/�HijZ�: �7�

4.5.2. Veri®cation featuresVeri®cation features, Z, are designed to re¯ect

the con®dence in recognizing essentially the nu-meric language. Five features are used to encodeeach utterance. Pilot experiments have shown thateach of these features has the capability of pro-viding some degree of separation between H0 andH1. This is evident since the equal error rate issigni®cantly better than chance level when em-ploying any of the veri®cation features individu-ally (Rahim et al., 1997).

Veri®cation score. To detect out-of-vocabularywords and numerics that are incorrectly recog-nized, a log likelihood ratio distance is computedbased on a set of discriminatively-trained context-independent veri®cation HMMs, W, involvingnumeric phrases, W�k�, anti-numeric phrases, W�ak�,and ®llers, W�f � (Rahim et al., 1997). Let X be theacoustic features for a given utterance (e.g., cep-strum, energy, etc.), fwkjk � 1 : Kg its numericwords, such that wk 2Sn 8k, and Xk the corre-sponding acoustic vectors, then the veri®cationscore is de®ned as

L�X ;W� � 1

glog

1

K

Xk

expf(

ÿ gL�Xk;W�g);

L�Xk;W�

� logPr�XkjW�k��

a Pr�XkjW�ak�� � �1ÿ a�Pr�XkjW�f ��

( );

�8�where g > 0:0 and 06 a6 1. This measure hasbeen commonly used in (Lleida and Rose, 1995;Rahim et al., 1997).

N-best veri®cation score. This score re¯ects thecon®dence in L�X ;W�, estimated over the N -bestcandidates:

dL�X ;W� � L1�X ;W� ÿ L2�X ;W�; �9�where L1��� and L2��� are the veri®cation scores forthe best two candidates.

Likelihood-ratio distance. The ratio of theN -best likelihood scores for the recognitionHMMs, K, provides a con®dence measure that candescribe the best decoded path.

dL�X ;K� � L1�X ;K� ÿ L2�X ;K�;L�X ;K� � 1

K

Xk

log Pr�Xk;wkjK�f g; �10�

where L1��� and L2��� are the likelihood scores forthe two best candidates.

Numeric cost function 1. Confusions among thenumeric language for the top N -best candidates isa strong indication of a possible error. Thus forwk 2 Sn,

D1 � 1 w1k � w2

k 8k;0 otherwise;

�where w1

k and w2k are the numeric words for the two

best candidates.Numeric cost function 2. Since for some utter-

ances the numeric language can be the onlyvocabulary in the N -best candidates, D1 willerroneously be set to 0 in these situations. Acompanion cost is therefore introduced, such that

D2 � 1 w1k 2Sn 8k;

0 otherwise:

4.5.3. Hierarchical mixture-of-expertsThe action of the classi®er is to map the input

feature space into H0 should the top hypothesiscontain numeric words that are all correctly rec-ognized, and H1 otherwise. A hierarchical mixture-of-experts (HME) has been used for this binaryclassi®cation problem (Jordan and Jacobs, 1994).Unlike neural networks, HMEs apply the expec-tation-maximization algorithm for training asopposed to gradient descent. HMEs are also dis-criminative, thus unlike Gaussian mixture models(GMM) that maximize a likelihood function,HMEs minimize the probability of error. Theygenerally converge reasonably fast accepting bothbinary and continuous inputs.

Fig. 5 shows an HME of depth 2. This archi-tecture enables non-linear supervised learning bydividing the input feature space into a nested set ofregions and ®tting simple surfaces to the data that

M. Rahim et al. / Speech Communication 34 (2001) 195±212 203

fall in these regions. The output of the HME, y, isa discrete random variable having possible out-comes of 1, to denote H0, or 0 to denote H1.

An HME is a binary tree having gating net-works that sit at the nonterminals of the tree andexpert networks that sit at the leaves of the tree.Both gating and expert networks receive the veri-®cation features Z and compute a conditionalprobability. For binary classi®cation, the resultinghierarchical probability model is a mixture ofBernoulli densities (Jordan and Jacobs, 1994).

Pr/�yjZ� �

Xi

Fi�Z;/i�X

j

Fij�Z;/ij�

� Pr/ij

�yjZ�y�1ÿ Pr/ij

�yjZ���1ÿy�: �11�

Thus, the probability of selecting H0 is

Pr/�y � 1jZ�

�Xi�0;1

Fi�Z;/i�X

j

Fij�Z;/ij�Pr/ij

�yjZ�; �12�

where / are the underlying parameters of theHME. The output of the HME may be consideredas a con®dence score. Based on the applicationdesign requirements, this output can be adjustedso that to achieve a desirable trade-o� betweenfalse acceptance and false rejection rates. 6

4.6. Numeric understanding

Speech, or language, understanding is an es-sential component in the design of spoken dia-logue systems. It provides a link between thespeech recognizer and the dialogue manager (seeFig. 4) with the prime responsibility of convertingthe recognition output into a machine action. 7

For numeric recognition, the purpose of theunderstanding module is to translate the output ofthe recognizer into a ``valid'' string of digits.However, in the event of an ambiguous request orpoor recognition performance, the understandingmodule may provide several hypotheses to the di-alogue manager for repair, disambiguation orclari®cation (Abella and Gorin, 1997).

In this study, a knowledge-based strategy hasbeen implemented for numeric understanding totranslate the recognition results (e.g., N -best hy-potheses) into a simpli®ed ®nite state machine(FSM) containing digits only. Six semantic classesfor numeric understanding are illustrated inTable 1. A simpli®ed example is shown in Fig. 6which illustrates the basic actions of the under-standing module when dealing with alphabets,natural numbers, restarts or corrections, and ®l-tering of out-of-vocabulary phrases. It should bepointed out that the knowledge-based system isvery much suited for our task given its scope andcomplexity. However, if su�ciently more data hadbeen available, it would have been interesting toexplore machine learning methods for this problem.

One variation of the understanding module is toprocess N -best hypotheses (or a lattice). Thisprovides the dialogue system with much ¯exibilityas to what the appropriate action to take especiallyif the top hypothesis is not a valid string as will beillustrated in the next section.

4.6.1. String validationSpeech recognition systems employ language

modeling for encoding knowledge of the task. Thisis advantageous from both accuracy and e�ciencystandpoints. In this study class based language

Z

Z Z

F12

F11

Z

Z Z

F

F

Z

F

22

21

1

F2

y

GatingNetwork

GatingNetwork

GatingNetwork

ExpertNetwork

ExpertNetwork

ExpertNetwork

ExpertNetwork

Fig. 5. A hierarchical mixture-of-experts (after Jordan and

Jacobs, 1994).

6 From a business point of view, threshold settings are directly

related to call automation rate and hence cost revenue. The

higher the threshold, the smaller is the automation rate but the

less are the errors (false acceptance) made by the system, and

vica versa.

7 For our particular task domain, we will refer to speech

understanding as numeric understanding.

204 M. Rahim et al. / Speech Communication 34 (2001) 195±212

models were used to integrate semantic-syntacticknowledge into the estimate of the word sequenceprobabilities. However, no constraints within thenumeric classes were imposed, that is, users hadthe ¯exibility to speak the numeric language in anyordering manner. For example users can say a cardnumber intermixing alphabets, digits and naturalnumbers. In these situations the system would relymore strongly on the acoustic scores duringdecoding.

The motivation for this design strategy wasthat users who commonly have no previousexperience of the system would unlikely followany language structure. This is clearly a challengewhen dealing with a mixed-initiative dialogue, asopposed to a system-initiative dialogue. Weobserved many instances where the user decidedeither to change context, not to provide a com-plete telephone or credit card number, or wishednot to provide any information and requested anoperator. These varieties of problems are real andvery much expected from a mixed-initiative dia-logue system that deals with novice users. In the

test database that was used in this study, wefound nearly 24% of the calls to ®t into thiscategory, prohibiting full automation of the callaccordingly.

It is essential that these problematic situationsare identi®ed and reported to the dialogue man-ager. In this study, the output of the understand-ing system is passed through a validation module.This module performs a database query whichaccesses live databases of credit cards and tele-phone numbers, reporting three possible out-comes:1. Valid. The output digit string of the under-

standing system corresponds to an actual validnumber (telephone or credit).

2. Invalid. The output digit string does not corre-spond to a valid number.

3. Partially valid. The output digit string is invalidbut at most one digit may be corrected to en-sure validity of the string.

This information is then passed to the dialoguemanager for either con®rmation, disambiguationor clari®cation.

threefour four

ASorry one hundred

I

nine E

D

two

number

my

card six

34643

92

1 0 0

(b)

(a)

Fig. 6. An example of an FSM: (a) before and (b) after numeric understanding.

Table 1

Semantic classes for numeric understanding

Class Action Example

Digits Translation Six seven eight ! 6 7 8

Naturals Translation One eight hundred two one three ! 1 8 0 0 2 1 3

Restarts Correction Nine zero eight sorry nine one eight ! 9 1 8

Alphabets Translation A Z one two three ! 2 9 1 2 3

City/country Translation Calling Florham Park New Jersey ! 9 7 3

Numeric phrases Realignment Nine one two area code nine zero one !9 0 1 9 1 2

Out-of-vocabulary Filtering I do not know what you are talking about ! /

M. Rahim et al. / Speech Communication 34 (2001) 195±212 205

Since a database of numbers can be representedby an FSM, it can be e�ciently combined with theunderstanding FSM in Section 4.6. In this manner,N -best hypotheses may also be processed andpassed to the dialogue manager. Hence, under-standing and validation can be conducted simul-taneously.

Performing validation on the output of theunderstanding module is essential in spoken lan-guage dialogue. There are many situations wherethis can be advantageous. A speaker may providean invalid area code, for example, or a valid tele-phone number that was either incorrectly recog-nized or misinterpreted by the system. Clearlydoing a checking step prior to automation isnecessary.

Finally, it should be pointed out that this pro-cess of validation after recognition is di�erent thantraditional approaches that constrain the recogni-tion process by means of a number grammar. Thislatter approach is unsuitable in spoken languagedialogue where errors can be made by either thespeaker or the machine.

5. Experimental results

This section provides experimental resultsdemonstrating the performance of the variousmodules of our system, including recognition,understanding, veri®cation and validation. Allexperiments have been performed using AT&TWatson speech recognition system (Sharp et al.,1997). A standard Viterbi beam search has beenused with a lexicon of 3.6 K words, perplexity 14and out-of-vocabulary word rate of less than 5%as measured on the test database (Riccardi andGorin, 1999).

A variety of databases have been used fortraining the subword/numeric models. These da-tabases, collected over a wide range of environ-mental conditions, help to provide broad acousticmodels. They include (a) 12,000 utterances fromthis experimental study (of the total 30,000 utter-ances) with over 1500 utterances from users re-sponding to a card or a telephone number prompt(Gorin et al., 1997), (b) 11,500 connected digitsstrings from services and data trials performed

over a 10 year period (Chou et al., 1995), (c) 3300connected digits strings, from various cellular en-vironments (acquired from BRITE systems), (d)an in-house database including 5000 strings ofconnected digits and natural numbers.

For language modeling, a variable n-gram sto-chastic automata was estimated using the 30,000transactions. Two separate language models werethen generated for the card and phone numbersby adapting on their respective transcriptions(Riccardi and Gorin, 1999).

Our experiments have been conducted on 626test utterances, each representing a separatetransaction from a di�erent speaker. The ®rst setof results are presented in Fig. 7. They illustratethe performance of the recognizer for two di�erentacoustic models. The ®rst model, represented bythe dashed line, consists of two sets of MLE-trained context-independent HMMs. This modelwas used in the initial stage of the training processas pointed out in section 4. The solid line repre-sents the MCE-trained context-dependent model.The ®gure presents the average overall word errorrate and the digit error rate (with no rejection) as afunction of processing speed on an SGI R10000machine. 8 Varying the speed of the decoder hasbeen obtained by changing the operating beamwidth.

From the two results in Fig. 7, the followingconclusions can be drawn. (a) Improved acousticmodeling and discriminative training has led to20±30% reduction in the word and digit errorrates. The reduction in the digit error rate, inparticular, is critical since it translates to over 10%reduction in the absolute numeric string error rate(Rahim, 1999a). (b) For either method, the kneepoints on the curves are well positioned below thereal-time (RT) mark. The computing time shownin Fig. 7 represents o�-line processing of the entiresystem in Fig. 4.

The second set of experiments evaluates thespeech understanding module. At this stage, we

8 The word error rate includes the insertion, deletion and

substitution error rates normalized by the total word count. The

digit error rate includes the insertion, deletion and substitution

error rates at the output of the understanding module normal-

ized by the total digit count.

206 M. Rahim et al. / Speech Communication 34 (2001) 195±212

are interested whether the interpretation of thespoken input following recognition matches thatof the transcription. The understanding moduleconverts an input text into a sequence of digitsusing the rules de®ned in Table 1. An action ofthis module is considered correct if the outputdigit sequences match between the recognizedspeech and its transcription. Fig. 8 displays theunderstanding error rates for the two previouslydescribed models (i.e., MLE-based and MCE-based) as a function of processing speed. Theseresults echo our previous ®ndings that a dis-criminatively trained context-dependent acousticmodel outperforms a standard ML-trained con-text-independent model. A typical operating pointis 26% error rate with no rejection. This impliesthat the interpretation for 74% of the utterancesis the same for the recognized speech and itstranscription.

An important strategy for improving systemperformance is by utilizing utterance veri®cation.In the third set of experiments we aim to utilize the

output of the HME as a con®dence score toidentify when the interpretation of the under-standing module is incorrect. Clearly this provides

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.624

26

28

30

32

34

36

38

40

42

CPU SECONDS/AUDIO SECONDS

UN

DE

RS

TAN

DIN

G E

RR

OR

RAT

E

RT

MLE

MCE

Fig. 8. Understanding error rate versus processing speed per

audio second. The dashed and solid lines represent the perfor-

mance of the MLE context-independent and the MCE context-

dependent acoustic models, respectively.

0 0.2 0.4 0.6 0.8 RT 1.2 1.4 1.611

12

13

14

15

16

17

18

CPU SECONDS/AUDIO SECONDS

WO

RD

ER

RO

R R

ATE

MLE

MCE

0 0.2 0.4 0.6 0.8 RT 1.2 1.4 1.65

6

7

8

9

10

11

CPU SECONDS/AUDIO SECONDS

DIG

IT E

RR

OR

RAT

E

MLE

MCE

Fig. 7. Word and digit error rates versus processing speed per audio second. The dashed and solid lines represent the performance of

the MLE context-independent and the MCE context-dependent acoustic models, respectively.

M. Rahim et al. / Speech Communication 34 (2001) 195±212 207

valuable information for the dialogue managerwhen responding to the user. In this study, we haveapplied utterance veri®cation for detecting recog-nition errors among utterances that included thenumeric language. This is clearly a much tougherchallenge than rejecting out-of-vocabulary words.Utterances that included solely out-of-vocabularywords were found not to be confused with thenumeric language strictly based on their recog-nized contents. Separating and rejecting those ut-terances was therefore trivial without having toapply utterance veri®cation. This conclusion isconsidered as one of the advantages of the pro-posed approach for numeric recognition. Methodsthat rely on more constrained grammars oftenshow larger confusability between in-vocabularyand out-of-vocabulary data.

Fig. 9(b) shows the distribution of the output ofthe HME when the data is correctly interpreted,i.e., Pr/�y � H0jO�, and when the data is incor-

rectly interpreted, i.e., Pr/�y � H1jO�. To baselineour results, Fig. 9(a) shows the distribution ofthe likelihood ratio scores when computed usingthe veri®cation models. This is considered as theclassical approach when performing utteranceveri®cation, particularly for digit recognition(Rahim et al., 1997). The amount of overlap in thetwo distributions, which de®ne the false rejectionrate and the false acceptance rate, is clearly smallerfor the HME case; a result that is attributed to theadditional input features. In particular, the equalerror rate (i.e., false acceptance � false rejection)and the minimum error rate (min(false acceptance+ false rejection)) are 14% and 23%, respectively,for the likelihood ratio score, and 10% and 18%,respectively, for the HME method.

In the following results we compare the per-formance of the HME with that of a GMM.Fig. 10 shows the receiver operating characteristic(ROC) curves for the two methods, that is, the

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

PR

OB

DE

NS

ITY

FU

NC

TIO

N

VERIFICATION SCORE

(b)

Pr(correct)

Pr(incorrect)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

PR

OB

DE

NS

ITY

FU

NC

TIO

N

VERIFICATION SCORE

(a)

Pr(correct)

Pr(incorrect)

Fig. 9. Utterance veri®cation performance when using (a) a likelihood ratio score, and (b) HME veri®cation score. The solid and

dashed lines represent the distribution of the scores when the data is incorrectly and correctly interpreted, respectively.

208 M. Rahim et al. / Speech Communication 34 (2001) 195±212

detection rate (1± false rejection rate) versus falsealarms (false acceptance). The HME clearly dem-onstrates better performance than the GMM at allpossible settings. At an operating point of 10%false acceptance rate, for example, the detectionrates are 79% and 90% for the GMM and theHME systems, respectively. Instrumenting utter-ance veri®cation at this operating point translatesto a boost in the understanding correct rate for theHME system from 74%, as shown in Fig. 8 whennot using utterance veri®cation, to 90% (Rahim,1999b).

From a designer's prospective, Fig. 11 illus-trates the trade-o� between false acceptance andfalse rejection as a function of the con®dencemeasure (or decision threshold). One interestingremark is that the output of the HME is a trueprobability which re¯ects a mapping of the veri®-cation measurements into either correct or incor-rect class decision. The minimum and equal errorrates, which both seem to have similar thresholds,are less than 18% and 10%, respectively. The laststep before sending information to the dialoguemanager is validation. Checking whether thesequence of digits at the output of the under-standing system corresponds to an active tele-phone or card number is valuable information.Based on this knowledge, the dialogue managermay determine a particular strategy, such as to

reprompt the user or to simply connect to a liveoperator.

Fig. 12 shows a break down of the telephonenumber strings based on whether or not theyrepresented valid or invalid numbers. Validity ofthese strings was a result of an inquiry with atelephone number database. These telephonenumbers were acquired at the output of the un-derstanding system. The ®gure shows the resultsfor the top 10 candidate strings which werechecked against the database in order. Should the

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

CONFIDENCE MEASURE

UN

DE

RS

TAN

DIN

G E

RR

OR

RAT

E

FALSE ACCEPTANCE

FALSE REJECTION

TOTAL ERROR RATE

Fig. 11. False acceptance, false rejection and total error rates as

a function of the decision threshold (con®dence measure).

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

N-BEST CANDIDATES

VA

LID

ST

RIN

G R

ATE

(%

)

Fig. 12. Percentage of valid telephone number strings over the

top 10 best candidates.

0 5 10 15 20 25 3030

40

50

60

70

80

90

100

FALSE ALARMS

DE

TE

CT

ION

RAT

E

GMM

HME

Fig. 10. ROC curves when using the GMM and the HME for

utterance veri®cation.

M. Rahim et al. / Speech Communication 34 (2001) 195±212 209

top candidate correspond to an invalid number,for example, the next candidate would bechecked, and so. The results in Fig. 12 indicatethat 61% of the recognized output strings (i.e., tophypothesis) correspond to valid telephone num-bers. For the 39% invalid strings, the ®gure showsthat succeeding candidates are valid in some in-stances. Based on this ®nding, the dialoguemanager may decide to examine those candidatesinstead. 9

Table 2 presents the digit and string error ratesbased on the top candidate for the valid and in-valid data. For the 61% valid strings, the digitand string error rates are 0.6% and 5.5%, re-spectively. For the invalid strings, these errorrates are 18.5% and 55.0%, respectively. Thoseremarkable results imply that valid telephonenumbers at the output of the understanding sys-tem are very likely to be correctly recognized; aconclusion that is valuable in the design of spo-ken dialogue systems.

6. Discussion and summary

The problem we are trying to solve is automaticrecognition and understanding of ¯uent speech inspoken dialogue systems. In particular, we arefocusing on task domains that involve the use ofthe numeric language. For example, tasks thatrequire the utilization of credit card, telephonenumbers, zip codes, dates, times, etc. Irrespectiveof the application, it is our premise that these tasks

are very well de®ned and that they can be inte-grated into a uni®ed language framework.

The use of the numeric language in mixed-ini-tiative spoken language dialogue presents severalchallenges particularly when dealing with userswho are unfamiliar with the operation or the de-sign of the dialogue system. First, the spokennumber string may be encoded by digits, naturalnumbers and alphabets that must all be correctlyrecognized to successfully automate the call. Sec-ond, the input speech may be embedded in a poolof out-of-vocabulary words with possible correc-tions, false-starts and other problematic situationsthat are features of ¯uent and spontaneous speech.Third, the input speech may be ambiguous, that is,it may contain either con¯icting information, in-valid numbers or incomplete information. Ratherthan trying to deal with these problems in a gen-eral way, our approach has been to narrow downthis problem to consider only those ``salient''phrases that are of interest to our task. The set ofthese phrases were referred to as the numeric lan-guage.

This paper presented our progress towardsrobust numeric language recognition in spokenlanguage dialogue. Our strategy has been toconsider this problem as a large vocabularyspeech recognition task that is especially tailoredto provide high quality robust recognition of thenumeric language. This approach is unlike tradi-tional methods that instrument handcraftedgrammars for digit recognition along with ®llermodels to accommodate for out-of-vocabularywords.

The experimental database presented in thisstudy represented automated customer service in-quiries for credit card and telephone numbers.This is a subset of a ®eld trial involving AT&Tcustomers responding to the open-ended prompt``How May I Help You?''. For this spoken lan-guage dialog, maintaining system robustness im-plies enabling any customer to access informationwhen calling from anywhere. This target demandsmaintaining robustness at several levels of systemdesign, including acoustic, language, understand-ing and dialogue. In this paper, we presentedmethods for acoustic modeling, discriminativetraining, class-based language modeling, utterance

9 Performing the validation process on the transcribed data,

as opposed to the recognized strings, resulted in 76% validity.

This indicates that misrecognition in our system caused 15%

(76ÿ 61) of the requests to be misinterpreted.

Table 2

Digit and string error rates for valid and invalid telephone

numbers

Valid (top candidate) Invalid

Digit 0.6 18.5

String 5.5 55.0

210 M. Rahim et al. / Speech Communication 34 (2001) 195±212

veri®cation, numeric understanding and stringvalidation. Collectively, these building blocksgenerate several string hypotheses which are pas-sed to the dialogue manager along with con®dencemeasures and information on whether or not thesestrings represent valid numbers. The action set ofthe dialogue manager includes call completion,disambiguation, con®rmation, clari®cation anderror recovery (Abella and Gorin, 1997).

Experimental results on 626 utterances haveconcluded the following:1. Improved acoustic modeling and discriminative

training for the numeric language leads to 20±30% reduction in the numeric error rate. It alsoprovides a higher interpretation accuracy of thenumeric data and reduces the absolute under-standing error rate by 10%.

2. The combination of ®ve veri®cation features us-ing an HME provides an accurate utteranceveri®cation system for identifying incorrectlyinterpreted strings. At an operating point of10% false acceptance, for example, the detec-tion rate increases from a baseline of 74% toover 90%.

3. Validating the output of the understanding sys-tem through a prede®ned grammar of tele-phone and card numbers helps tremendouslyin reducing the overall system error rate. Forthe 61% valid telephone numbers at the outputof recognizer, the digit error rate was observedto be 0.6%.

It should be pointed out that the majority of thedata used in our experiments were collected from®eld trial experiments with real customers.Accordingly the set of problems that the systemencounters are more challenging than those ex-perienced from data recordings with users fa-miliar with either the technology or the system.Although the amount of test data used in thisstudy was rather limited, it pointed out to someinteresting problems that need to be addressed inspoken language dialogue. It also suggests thatmaintaining robustness in dialogue systems re-quire tight integration and communication be-tween the speech recognition module, theunderstanding module and the dialogue manager.Further research in this direction is clearlyneeded.

Acknowledgements

The authors would like to thank E. Bocchieri,R. Rose and the Watson development team fortheir technical help.

References

Abella, A., Gorin, A.L., 1997. Generating semantically consis-

tent inputs to a dialog manager. In: Proceedings of the

European Conference on Speech Communication and

Technology, pp. 1879±1882.

Bickel, P., Doksum, K., 1977. Mathematical statistics: Basic

ideas and selected topics. Prentice-Hall, Englewood Cli�s,

NJ.

Boyce, S., Gorin, A.L., 1996, User interface issues for natural

spoken dialogue systems. In: Proceedings of the Interna-

tional Symposium on Spoken Dialogue (ISSD), pp. 65±68.

Buhrke, E., Cardin, R., Normandin, Y., Rahim, M., Wilpon, J.,

1994. Application of vector quantized hidden Markov

models to the recognition of connected digit strings in the

telephone network. In: Proceedings of the International

Conference on Acoustics, Speech, and Signal Processing.

Cardin, R., Normandin, Y., Millien, E., 1993. Inter-word

coarticulation modeling and MMIE training for improved

connected digit recognition. In: Proceedings of the Inter-

national Conference on Acoustics, Speech, and Signal

Processing, pp. 243±246.

Chou, W., Rahim, M.G., Buhrke, E., 1995. Signal conditioned

minimum error rate training. In: Proceedings of the

European Conference on Speech Communication and

Technology, pp. 495±498.

Chu-Carroll, J., 1999. Form-based reasoning for mixed-initia-

tive dialogue management in information-query systems.

In: Proceedings of the European Conference on Speech

Communication and Technology, pp. 1519±1522.

Gorin, A.L., 1995. On automated language acquisition.

J. Acoust. Soc. Am. 97 (6), 3441±3461.

Gorin, A.L., Riccardi, G., Wright, J.H., 1997. How may I help

you? Speech Communication 23, 113±127.

Jordan, M., Jacobs, R., 1994. Hierarchical mixtures of experts

and the EM algorithm. Neural Comput. 6, 181±214.

Juang, B.-H., Katagiri, S., 1992. Discriminative learning for

minimum error classi®cation. IEEE Trans. Acoust., Speech,

Signal Process. 40, 3043±3054.

Lamel, L., Rosset, S., Gauvain, J.-L., Bennacef, S., 1999. The

LIMSI ARISE system for train travel information. In:

Proceedings of the International Conference on Acoustics,

Speech, and Signal Processing.

Lee, C.-H., Giachin, E., Rabiner, L.R., Pieraccini, R., Rosen-

berg, A.E., 1992. Improved acoustic modeling for large

vocabulary continuous speech recognition. Computer

Speech Language 6 (2), 103±127.

Lleida, E., Rose, R.C., 1995. Utterance veri®cation in contin-

uous speech recognition. In: Proc. IEEE ASR Workshop.

M. Rahim et al. / Speech Communication 34 (2001) 195±212 211

Mitchell, C., Setlur, A., 1999. Improved spelling recognition

using a tree-based fast lexical match. In: Proceedings of the

International Conference on Acoustics, Speech, and Signal

Processing.

Os, E., Boves, L., Lamel, L., Baggia, P., 1999. Overview of the

ARISE project. In: Proceedings of the European Con-

ference on Speech Communication and Technology,

pp. 1527±1530.

Pieraccini, R., Rosenberg, A.E., 1990, Coarticulation models

for continuous digit recognition. In: Proceedings of the

Acoustics Society of America, p. 106.

Rabiner, L., Juang, B.-H., 1993. Fundamentals of speech

recognition. Prentice-Hall, Englewood Cli�s, NJ.

Rahim, M., 1999a. Recognizing connected digits in a natural

spoken dialog. In: Proceedings of the International Con-

ference on Acoustics, Speech, and Signal Processing.

Rahim, M., 1999b. Utterance veri®cation for the numeric

language in a natural spoken dialogue. In: Proceedings of

the European Conference on Speech Communication and

Technology, pp. 495±498.

Rahim, M., Juang, B.-H., 1996. Signal bias removal by

maximum likelihood estimation for robust speech recog-

nition. IEEE Trans. Speech Audio Process. 4 (1), 19±30.

Rahim, M., Juang, B.-H., Chou, W., Buhrke, E., 1996. Signal

conditioning techniques for robust speech recognition.

IEEE Signal Process. Lett. 3 (2), 107±109.

Rahim, M., Lee, C.-H., Juang, B.-H., 1997. Discriminative

utterance veri®cation for connected digits recognition.

IEEE Trans. Speech Audio Process. 5 (3), 266±277.

Rahim, M., Pieraccini, R., Eckert, W., Levin, E., Di Fabbrizio,

G., Riccardi, G., Lin, C., Kamm, C., 1999. W99 ± a spoken

dialogue system for the ASRU99 workshop. In: Proc.

IEEE ASRU Workshop.

Ramaswamy, G., Kleindienst, J., Co�man, D., Gopalakrish-

nan, P., Neti, C., 1999. A pervasive conversational inter-

face for information interaction. In: Proceedings of the

European Conference on Speech Communication and

Technology, pp. 2662±2666.

Riccardi, G., Bangalore, S., 1998. Automatic acquisition of

phrase grammars for stochastic language modeling. In:

ACL Workshop on Very Large Corpora Proceedings,

pp. 188±196.

Riccardi, G., Gorin, A.L., 1999. stochastic language adaptation

over time and state in natural spoken dialog systems. IEEE

Trans. Speech, Audio Process. 8, 3±10.

Riccardi, G., Pieraccini, R., Bocchieri, E., 1996. Stochastic

automata for language modeling. Computer Speech and

Language 10, 265±293.

Sankar, A., Lee, C.-H., 1996. A maximum-likelihood approach

to stochastic matching for robust speech recognition. IEEE

Trans. Speech Audio Process. 4 (3), 190±202.

Sharp, R.D., Bocchieri, E., Castillo, C., Parthasarathy, S.,

Rath, C., Riley, M., Rowland, J., 1997. The Watson speech

recognition engine. In: Proceedings of the International

Conference on Acoustics, Speech, and Signal Processing,

pp. 4065±4068.

Shoup, J., 1980. Phonological aspects of speech recognition. In:

Lea, W. (Ed.), Trends in Speech Recognition. Prentice-

Hall, Englewood Cli�s, NJ.

Wendemuth, A., Rose, G., Dol®ng, J., 1999. Advances in

con®dence measures for large vocabulary. In: Proceedings

of the International Conference on Acoustics, Speech, and

Signal Processing.

Wright, J.H., Gorin, A.L., Abella, A., 1998. Spoken language

understanding within dialogs using a graphical model of

task structure. In: Proc. ICSLP.

Wright, J.H., Gorin, A.L., Riccardi, G., 1997. Automatic

acquisition of salient grammar fragments for call-type

classi®cation. In Proceedings on Eurospeech, pp. 1419±

1422.

212 M. Rahim et al. / Speech Communication 34 (2001) 195±212