a new verification-based fast-match for large vocabulary continuous speech recognition

Post on 02-Jan-2016

23 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech Recognition. Author : M. Afify, F. Liu, H. Jiang and O.Siohan. Reporter : CHEN, TZAN HWEI. Reference. - PowerPoint PPT Presentation

TRANSCRIPT

A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech Recognition

Reporter : CHEN, TZAN HWEI

Author :M. Afify, F. Liu, H. Jiang and O.Siohan

2

Reference

M. Afify, F. Liu, H. Jiang and O.Siohan, “A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech ”, SAP 2005

H. Ney and, S. Ortmanns, “Progress in dynamic programming search for LVCSR”, Proc, IEEE 2000.

S. Ortmans, A. Eiden, H. Ney, and N. Coenen, “Look-ahead techniques for fast beam search”, ICASSP 1997

3

Outline

Introduction to LVCSR

Proposed fast-match

Implementation of the fast-match

Experiment

Conclusion

4

Introduction to LVCSR

In the statistical approach to automatic speech recognition the best word sequence is chosen by

Large vocabulary applications, on the order of several thousand words, resulting in a very large state-space.

Look-ahead techniques are among the popular ways for reducing the search space.

)|(max~

XWPWW

5

Introduction to LVCSR (cont)

Structure of a phoneme and search space

6

Introduction to LVCSR (cont)

Tree organized pronunciation lexicon

7

Introduction to LVCSR (cont)

Look-ahead We “look-ahead” in time using some acoustic and/or

language model probabilities to predict some hypotheses that will score poorly in the future, and hence discard them from detailed evaluation.

In this paper, we just discuss about “acoustic look-ahead”, also named simply “fast-match”.

A “good” FM should accelerate the computation with minimal loss of accuracy.

8

Introduction to LVCSR (cont)

Look-ahead (cont)

9

Introduction to LVCSR (cont)

Look-ahead (cont) Global fast-match (GFM) : combines the node

score with the look-ahead score in making the pruning decision.

Local fast-match (LFM) : only the local look-ahead score is used in making the fast-match pruning decision.

10

Introduction to LVCSR (cont)

Global fast-match (GFM)

11

Proposed fast-match

Hypothesis testing : It is a general statistical framework for

deciding among several hypotheses based on some observations.

In general, binary hypothesis testing chooses one among two hypotheses, usually referred to as the null and alternative hypotheses 0H 1H

)|(

)|(

1

0

HXP

HXP

12

Proposed fast-match (cont)

Hypothesis testing (cont): Two type of errors can occur

Here, we want to minimize

) trueis | Say Pr( alarm false ofy probabilit The

) trueis | Say Pr( miss ofy probabilit The

10

01

HHp

HHp

F

M

Mp

13

Proposed fast-match (cont)

For a fast-match the null and alternative hypothesis testing for phoneme can be written as :

The first step toward developing a likelihood ratio test for the above hypothesis testing is to define suitable probability distributions for both hypotheses.

tat timestart not does :

tat time starts :

1

0

H

H

14

Proposed fast-match (cont)

lyrespective

and , hypotheses ealternativ and null theof models thedenote and , set

we where),| P(and )| P( toreduce ondistributi two thecase thisIn

approach duration fixed adopt the we Here,duration. both of values thebeforehand

determine toposiblenot isit length, variablehave toknown are phonemes As

events. both of durations possible are equal)y necessaril(not

and , and ,],[ interval thein nsobservatio acoustic represents where

at t)start not does | P(: For

at t) starts | P(: For

_

21

_

12

1

0

2

1

a

dtt

dtt

ba

dtt

dtt

ddd

XX

ddbaX

XH

XH

aa

15

Proposed fast-match (cont)

threshold.

decision theis whereotherwise, deciding and ,)( L when accepting

toreduces test ratio likelihoo The . hypothesis alternate therepresents hence and ,

phoneme of model hypothesis alternatefor stand and ),|P(log)( Where

)1()()()( L

as writtenbe can )( Lratio

likelihood log The . of duration ahead-look theis ].t[t, in sequence

nobservatio theis where,)|(log)( write weHence,

.at ends and at starts y that probabilit log the)as (define We

1t0

1

___

_

t

t

HH

H

XS

SαS

dd

XXPα S

dt tα S

adttt

tt

αα

dtt

dttt

αt

16

Proposed fast-match (cont)

The null and alternate hypotheses scores is calculated for every time instance, and hence it would be interesting to incrementally calculate the score at time from the corresponding score at time .

If a phoneme is represented by 1-state HMM, the incremental calculation be reduced to the following very simple formula

t1t

)2()|(log)|(log)()( 11 tdttt xpxpSS

17

Proposed fast-match (cont)

In turn, the likelihood ratio can also be incrementally calculated as

where

the probability can be calculated as

)3()|()|()()( 11 tdttt xqxqLL

)4()|(log)|(log)(_

xpxpq

)|( xp

),,(max

),,()|(1

mmmm

mmm

M

m

xNc

xNcxp

18

Implementation of the fast-match

Definition of alternate hypothesis or anti-phoneme models.

Parameter estimation of the phoneme and anti-phoneme Gaussian mixture models.

Determining the phoneme look-ahead durations and decision thresholds.

19

Implementation of the fast-match (cont)

Design of anti-phoneme models: A general trend in their design is to consider either

phoneme specific models or a shared model (background model).

In initial experiments we obtained similar results, in terms of speed-accuracy trade-off, for both the background model and the phoneme specific model.

20

Implementation of the fast-match (cont)

Parameter estimation of phoneme and anti-phoneme model :

First, a set of training utterances is first segmented using forced alignment into phoneme unit.

The training data for each phoneme is then defined by collecting all segments belonging to this phoneme.

For constructing a general background model, all training data are put together.

Training models by ML estimation.

21

Implementation of the fast-match (cont)

Calculation of look-ahead duration and decision

thresholds After the segments belonging to each phoneme are

identified using forced alignment of the training data, the look –ahead duration is computed as the average duration of these segments.

22

Implementation of the fast-match (cont)

Calculation of look-ahead duration and decision

thresholds (cont): For each phoneme we evaluate the score, as in (1), of all

segments in the training set belonging to this phoneme .

We calculate the mean score and the standard deviation of the score of these segments.

The threshold is calculated as , where n is used to trade off the speed and accuracy.

n

23

Experiment

Tested on a Japanese broadcast news transcription task, whose vocabulary is drawn from 20000 words.

Training and test speech data in addition to the trigram language model are provided by the Japan broadcasting corporation (NHK)

24

Experiments (cont)

First, we illustrate the behavior of the fast match algorithm on a small development set, and describe how tuning the threshold affects the performance of the system.

The training data consists of 90 h of speech.

The test set consists of 162 utterance from male speakers in a clean studio environment.

25

Experiments (cont)

The development set perplexity is about 34 and the out-of-vocabulary (OOV) rate is 0.76%

The baseline system runs in about 0.79 times real-time, for a WER of 4.04%

The Gaussian mixture size is set to 8, 12, and 32 while the threshold is taken that as discussed before, and n takes the values

}1,2,3,5.3,4{n

26

Experiments (cont)

Fig. 2. Percentage word error rate (WER) and real time factor (RT) for the fast-match with mixture sizes 8, 16, and 32, and for different thresholds on the development set.

27

Experiments (cont)

Fig. 3. Percentage word error rate (WER) and real time factor (RT)both likelihood ratio and likelihood scores are used here for the fast-matchfor comparison.

28

Experiments (cont)

In a second series of experiments, we illustrate the performance of the proposed fast-match algorithm on a much larger data set.

TABLE I LIST OF ALL EVALUATION CONDITIONS AND NUMBER

OF TEST UTTERANCES PER CONDITION

29

Experiments (cont)

The test perplexity varies from less than 10 to about 80 depending on the environment, with OOV rates ranging from 0.25% to 2.5%

The acoustic models used for recognition are build on about 170 hours of training data

The threshold is set to 5.3

30

Experiments (cont)

TABLE IIWORD ERROR RATE (WER) AND REAL-TIME (RT) FACTOR ON NHK

EVALUATION TEST SET WITH AND WITHOUT FAST-MATCH

31

Conclusion

it is shown that the current frame the test can be incrementally calculated from the previous frame using very simple computation.

It offers robustness as evidenced in the multi-environment experiments .

top related