incorporating multiple-hmm acoustic modeling in a modular large vocabulary speech recognition system...

1
INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros, J. Macías-Guarasa, R. de Córdoba and J.M. Pardo Grupo de Politécnica de Madrid. Tecnología del Habla. Universidad Spain e-mail: [email protected], {jfl, macias, cordoba, pardo}@die.upm.es B A B A B B P P P 1 SYSTEM ARCHITECTURE SYSTEM ARCHITECTURE Previous work on this topic at EUROSPEECH'99: Flexible Large Vocabulary (up to 10000 words) Speaker independent. Isolated word Telephone speech Two stage - bottom up strategy Using Neural Networks as a novel approach to estimate preselection list length In this paper: Integrating multiple acoustic models in this system gender specific models SUMMARY SUMMARY TRAINING MULTIPLE-HMM ACOUSTIC MODELS TRAINING MULTIPLE-HMM ACOUSTIC MODELS INCORPORATING MULTIPLE-HMMs IN THE INCORPORATING MULTIPLE-HMMs IN THE RECOGNITION STAGE RECOGNITION STAGE EXPERIMENTAL SETUP EXPERIMENTAL SETUP VESTEL database realistic telephone speech corpus: Training set: 5810 utterances: 46,74 % male speakers 53,26 % female speakers Test set: 1434 utterances (vocabulary independent task) Vocabulary composed of 2000, 5000 and 10000 words CONCLUSIONS CONCLUSIONS The use of multiple acoustic models per phonetic unit allows increasing acoustic modeling robustness in difficult tasks. Best results are obtained using: Training stage: Independent-training. PSBU module: Single-set. LA module: Shared Costs. A relative error reduction around 25 % is achieved when using multiple-SCHMM modeling compared to the single-SCHMM system for a dictionary of 10000 words. EXPERIMENTAL RESULTS EXPERIMENTAL RESULTS 3% 5% 7% 9% 11% 13% 15% 17% 19% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% S ize ofpreselection list(% of dictionary size) Inclusion error rate (% ) Single Independent-training Joint-training (a = 1) Joint-training (a = 0,5) Joint-training (a = 1,25) Independent-training vs. Joint- Independent-training vs. Joint- training training Discrete HMMs. Dictionary of 2000 words. Combination 2 in the recognition stage. Alternatives for PSBU and LA Alternatives for PSBU and LA modules modules Discrete HMMs. Dictionary of 2000 words. Independent-training for multiple DHMMs. 3% 5% 7% 9% 11% 13% 15% 17% 19% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% S ize ofpreselection list(% ofdictonary size) Inclusion error rate (% ) Single C om bination 1 C om bination 2 C om bination 3 Single-SCHMM vs. Multiple-SCHMM Single-SCHMM vs. Multiple-SCHMM Independent-training for multiple SCHMMs. Single-set in PSBU module + Shared-costs in LA module. Dictionary of 2000, 5000 and 10000 words. 1% 3% 5% 7% 9% 11% 13% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% S ize ofpreselection list(% ofdictionary size) Inclusion error rate (% ) Single-2000 Multiple-2000 Single-5000 Multiple-5000 Single-10000 Multiple-10000 Joint training (I) Joint training (I) Each set of models is trained using all the utterances contained in the training database. A weighting function controls the influence of each utterance in the modeling of each set. We have developed two different methods for training gender-dependent models: Independent training Independent training Each set of models is trained using only the part of the training database assigned to it. Joint training (and II) Joint training (and II) P A and P B are the likelihoods for the utterance with the set of models A and B, respectively. A and B are the weights to be applied to the reestimation formulae in the training stage. is an adjustment factor, which allows to assign more training data to a particular set. Alternatives for Alternatives for PSBU PSBU Combined-sets Combined-sets Phonetic strings are composed by concatenating models coming from any set. Single-set Single-set Phonetic strings are forced to be generated by only one set of models (the one that produces the best score). Alternatives for LA Alternatives for LA Shared-costs Shared-costs All the allophones have the same behaviour in the LA stage, even if they have been generated from different set of models. Both sets share the same confusion matrix. Set-dependent costs Set-dependent costs Cost are gender-dependent. One square confusion matrix (for “single- set” PSBU strategy) per set. A single rectangular confusion matrix (for “combined sets”).

Post on 19-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,

INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION MODULAR LARGE VOCABULARY SPEECH RECOGNITION

SYSTEM IN TELEPHONE ENVIRONMENTSYSTEM IN TELEPHONE ENVIRONMENTA. Gallardo-Antolín, J. Ferreiros, J. Macías-Guarasa, R. de Córdoba and J.M. Pardo

Grupo de Politécnica de Madrid. Tecnología del Habla. Universidad Spaine-mail: [email protected], {jfl, macias, cordoba, pardo}@die.upm.es

BABA

BB PP

P

1

SYSTEM ARCHITECTURESYSTEM ARCHITECTUREPrevious work on this topic at EUROSPEECH'99:

Flexible Large Vocabulary (up to 10000 words) Speaker independent. Isolated word Telephone speech Two stage - bottom up strategy Using Neural Networks as a novel approach to

estimate preselection list length

In this paper: Integrating multiple acoustic models in

this system gender specific models

SUMMARYSUMMARY

TRAINING MULTIPLE-HMM ACOUSTIC MODELSTRAINING MULTIPLE-HMM ACOUSTIC MODELS

INCORPORATING MULTIPLE-HMMs IN THE INCORPORATING MULTIPLE-HMMs IN THE RECOGNITION STAGERECOGNITION STAGE

EXPERIMENTAL SETUPEXPERIMENTAL SETUP VESTEL database realistic telephone speech

corpus:

Training set: 5810 utterances: 46,74 % male speakers 53,26 % female speakers

Test set: 1434 utterances (vocabulary independent task)

Vocabulary composed of 2000, 5000 and 10000 words

CONCLUSIONSCONCLUSIONS The use of multiple acoustic models per phonetic unit allows

increasing acoustic modeling robustness in difficult tasks.

Best results are obtained using: Training stage: Independent-training. PSBU module: Single-set. LA module: Shared Costs.

A relative error reduction around 25 % is achieved when using multiple-SCHMM modeling compared to the single-SCHMM system for a dictionary of 10000 words.

EXPERIMENTAL RESULTSEXPERIMENTAL RESULTS

3%

5%

7%

9%

11%

13%

15%

17%

19%

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

Size of preselection list (% of dictionary size)

Incl

usi

on

err

or

rate

(%

)

Single

Independent-training

Joint-training (a = 1)

Joint-training (a = 0,5)

Joint-training (a = 1,25)

Independent-training vs. Joint-trainingIndependent-training vs. Joint-training Discrete HMMs. Dictionary of 2000 words. Combination 2 in the recognition stage.

Alternatives for PSBU and LA modulesAlternatives for PSBU and LA modules Discrete HMMs. Dictionary of 2000 words. Independent-training for multiple DHMMs.

3%

5%

7%

9%

11%

13%

15%

17%

19%

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%Size of preselection list (% of dictonary size)

Incl

usi

on

err

or

rate

(%

) Single

Combination 1

Combination 2

Combination 3

Single-SCHMM vs. Multiple-SCHMMSingle-SCHMM vs. Multiple-SCHMM Independent-training for multiple SCHMMs. Single-set in PSBU module + Shared-costs in LA module. Dictionary of 2000, 5000 and 10000 words.

1%

3%

5%

7%

9%

11%

13%

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

Size of preselection list (% of dictionary size)

Incl

usi

on

err

or

rate

(%

)

Single-2000 Multiple-2000

Single-5000 Multiple-5000

Single-10000 Multiple-10000

Joint training (I)Joint training (I) Each set of models is trained using all the

utterances contained in the training database.

A weighting function controls the influence of each utterance in the modeling of each set.

We have developed two different methods for training gender-dependent models:

Independent trainingIndependent training Each set of models is trained using only the part

of the training database assigned to it.

Joint training (and II)Joint training (and II)

PA and PB are the likelihoods for the utterance with the set of models A and B, respectively.

A and B are the weights to be applied to the reestimation formulae in the training stage.

is an adjustment factor, which allows to assign more training data to a particular set.

Alternatives for PSBUAlternatives for PSBUCombined-setsCombined-sets Phonetic strings are composed by

concatenating models coming from any set.

Single-setSingle-set Phonetic strings are forced to be generated by

only one set of models (the one that produces the best score).

Alternatives for LAAlternatives for LAShared-costsShared-costs All the allophones have the same behaviour in

the LA stage, even if they have been generated from different set of models.

Both sets share the same confusion matrix.

Set-dependent costsSet-dependent costs Cost are gender-dependent.

One square confusion matrix (for “single-set” PSBU strategy) per set.

A single rectangular confusion matrix (for “combined sets”).