a new verification-based fast-match for large vocabulary continuous speech recognition
Post on 02-Jan-2016
23 Views
Preview:
DESCRIPTION
TRANSCRIPT
A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech Recognition
Reporter : CHEN, TZAN HWEI
Author :M. Afify, F. Liu, H. Jiang and O.Siohan
2
Reference
M. Afify, F. Liu, H. Jiang and O.Siohan, “A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech ”, SAP 2005
H. Ney and, S. Ortmanns, “Progress in dynamic programming search for LVCSR”, Proc, IEEE 2000.
S. Ortmans, A. Eiden, H. Ney, and N. Coenen, “Look-ahead techniques for fast beam search”, ICASSP 1997
3
Outline
Introduction to LVCSR
Proposed fast-match
Implementation of the fast-match
Experiment
Conclusion
4
Introduction to LVCSR
In the statistical approach to automatic speech recognition the best word sequence is chosen by
Large vocabulary applications, on the order of several thousand words, resulting in a very large state-space.
Look-ahead techniques are among the popular ways for reducing the search space.
)|(max~
XWPWW
5
Introduction to LVCSR (cont)
Structure of a phoneme and search space
6
Introduction to LVCSR (cont)
Tree organized pronunciation lexicon
7
Introduction to LVCSR (cont)
Look-ahead We “look-ahead” in time using some acoustic and/or
language model probabilities to predict some hypotheses that will score poorly in the future, and hence discard them from detailed evaluation.
In this paper, we just discuss about “acoustic look-ahead”, also named simply “fast-match”.
A “good” FM should accelerate the computation with minimal loss of accuracy.
8
Introduction to LVCSR (cont)
Look-ahead (cont)
9
Introduction to LVCSR (cont)
Look-ahead (cont) Global fast-match (GFM) : combines the node
score with the look-ahead score in making the pruning decision.
Local fast-match (LFM) : only the local look-ahead score is used in making the fast-match pruning decision.
10
Introduction to LVCSR (cont)
Global fast-match (GFM)
11
Proposed fast-match
Hypothesis testing : It is a general statistical framework for
deciding among several hypotheses based on some observations.
In general, binary hypothesis testing chooses one among two hypotheses, usually referred to as the null and alternative hypotheses 0H 1H
)|(
)|(
1
0
HXP
HXP
12
Proposed fast-match (cont)
Hypothesis testing (cont): Two type of errors can occur
Here, we want to minimize
) trueis | Say Pr( alarm false ofy probabilit The
) trueis | Say Pr( miss ofy probabilit The
10
01
HHp
HHp
F
M
Mp
13
Proposed fast-match (cont)
For a fast-match the null and alternative hypothesis testing for phoneme can be written as :
The first step toward developing a likelihood ratio test for the above hypothesis testing is to define suitable probability distributions for both hypotheses.
tat timestart not does :
tat time starts :
1
0
H
H
14
Proposed fast-match (cont)
lyrespective
and , hypotheses ealternativ and null theof models thedenote and , set
we where),| P(and )| P( toreduce ondistributi two thecase thisIn
approach duration fixed adopt the we Here,duration. both of values thebeforehand
determine toposiblenot isit length, variablehave toknown are phonemes As
events. both of durations possible are equal)y necessaril(not
and , and ,],[ interval thein nsobservatio acoustic represents where
at t)start not does | P(: For
at t) starts | P(: For
_
21
_
12
1
0
2
1
a
dtt
dtt
ba
dtt
dtt
ddd
XX
ddbaX
XH
XH
aa
15
Proposed fast-match (cont)
threshold.
decision theis whereotherwise, deciding and ,)( L when accepting
toreduces test ratio likelihoo The . hypothesis alternate therepresents hence and ,
phoneme of model hypothesis alternatefor stand and ),|P(log)( Where
)1()()()( L
as writtenbe can )( Lratio
likelihood log The . of duration ahead-look theis ].t[t, in sequence
nobservatio theis where,)|(log)( write weHence,
.at ends and at starts y that probabilit log the)as (define We
1t0
1
___
_
t
t
HH
H
XS
SαS
dd
XXPα S
dt tα S
adttt
tt
αα
dtt
dttt
αt
16
Proposed fast-match (cont)
The null and alternate hypotheses scores is calculated for every time instance, and hence it would be interesting to incrementally calculate the score at time from the corresponding score at time .
If a phoneme is represented by 1-state HMM, the incremental calculation be reduced to the following very simple formula
t1t
)2()|(log)|(log)()( 11 tdttt xpxpSS
17
Proposed fast-match (cont)
In turn, the likelihood ratio can also be incrementally calculated as
where
the probability can be calculated as
)3()|()|()()( 11 tdttt xqxqLL
)4()|(log)|(log)(_
xpxpq
)|( xp
),,(max
),,()|(1
mmmm
mmm
M
m
xNc
xNcxp
18
Implementation of the fast-match
Definition of alternate hypothesis or anti-phoneme models.
Parameter estimation of the phoneme and anti-phoneme Gaussian mixture models.
Determining the phoneme look-ahead durations and decision thresholds.
19
Implementation of the fast-match (cont)
Design of anti-phoneme models: A general trend in their design is to consider either
phoneme specific models or a shared model (background model).
In initial experiments we obtained similar results, in terms of speed-accuracy trade-off, for both the background model and the phoneme specific model.
20
Implementation of the fast-match (cont)
Parameter estimation of phoneme and anti-phoneme model :
First, a set of training utterances is first segmented using forced alignment into phoneme unit.
The training data for each phoneme is then defined by collecting all segments belonging to this phoneme.
For constructing a general background model, all training data are put together.
Training models by ML estimation.
21
Implementation of the fast-match (cont)
Calculation of look-ahead duration and decision
thresholds After the segments belonging to each phoneme are
identified using forced alignment of the training data, the look –ahead duration is computed as the average duration of these segments.
22
Implementation of the fast-match (cont)
Calculation of look-ahead duration and decision
thresholds (cont): For each phoneme we evaluate the score, as in (1), of all
segments in the training set belonging to this phoneme .
We calculate the mean score and the standard deviation of the score of these segments.
The threshold is calculated as , where n is used to trade off the speed and accuracy.
n
23
Experiment
Tested on a Japanese broadcast news transcription task, whose vocabulary is drawn from 20000 words.
Training and test speech data in addition to the trigram language model are provided by the Japan broadcasting corporation (NHK)
24
Experiments (cont)
First, we illustrate the behavior of the fast match algorithm on a small development set, and describe how tuning the threshold affects the performance of the system.
The training data consists of 90 h of speech.
The test set consists of 162 utterance from male speakers in a clean studio environment.
25
Experiments (cont)
The development set perplexity is about 34 and the out-of-vocabulary (OOV) rate is 0.76%
The baseline system runs in about 0.79 times real-time, for a WER of 4.04%
The Gaussian mixture size is set to 8, 12, and 32 while the threshold is taken that as discussed before, and n takes the values
}1,2,3,5.3,4{n
26
Experiments (cont)
Fig. 2. Percentage word error rate (WER) and real time factor (RT) for the fast-match with mixture sizes 8, 16, and 32, and for different thresholds on the development set.
27
Experiments (cont)
Fig. 3. Percentage word error rate (WER) and real time factor (RT)both likelihood ratio and likelihood scores are used here for the fast-matchfor comparison.
28
Experiments (cont)
In a second series of experiments, we illustrate the performance of the proposed fast-match algorithm on a much larger data set.
TABLE I LIST OF ALL EVALUATION CONDITIONS AND NUMBER
OF TEST UTTERANCES PER CONDITION
29
Experiments (cont)
The test perplexity varies from less than 10 to about 80 depending on the environment, with OOV rates ranging from 0.25% to 2.5%
The acoustic models used for recognition are build on about 170 hours of training data
The threshold is set to 5.3
30
Experiments (cont)
TABLE IIWORD ERROR RATE (WER) AND REAL-TIME (RT) FACTOR ON NHK
EVALUATION TEST SET WITH AND WITHOUT FAST-MATCH
31
Conclusion
it is shown that the current frame the test can be incrementally calculated from the previous frame using very simple computation.
It offers robustness as evidenced in the multi-environment experiments .
top related