minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition bing...
TRANSCRIPT
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech
RecognitionBing Zhang and Spyros Matsoukas,
BBN Technologies, 50 Moulton St. Cambridge
Reporter : Chang Chih Hao
Introduction
• LDA and HLDA– Better classification accuracy
– some common Limitations• None of them assumes any prior knowledge of confusable hypotheses
• Their objective functions do not directly relate to the word error rate (WER)
• Minimum Phoneme Error– Minimize phoneme errors in lattice-based training frameworks
– Since this criterion is closely related to WER, MPE_HLDA tends to be more robust than other projection methods, which makes it potentially better suited for a wider variety of features.
MPE Objection Function
• MPE-HLDA model
• MPE-HLDA aims at minimizing expected number of phoneme errors introduced by the MPE-HLDA model in a given hypothesis lattice, or equivalently maximizing the function
m m
Tm m
t t
A
C diag A A
o Ao
, | (4)
is the total number of training utterances,
is the sequence of p-dimensional observation vectors in utterance r,
is the "raw accuracy" score of wor
r
R
MPE r r rr w
r
r
F O P w O w
R
O
w
d hypothesis .rw
,m mC
MPE Objection Function
•
| is the posterior probability of hypothesis in the lattice
| |
|
is the language model probability of hypothesis ,
k : in order
r
r r r
k
r r r
r r k
r r rw
r r
P w O w
P O w P wP w O
P O w P w
P w w
to reduce acoustic scores dynamic range, thereby avoiding
the concentration of all posterior mass in the top-1 hypothesis of the lattice.
MPE Objection Function
• It can be shown that the derivative of (4) with respect to A is
, log | ,, (6)
, | ,
is the MPE score of utterance r (average accuracy over all hypotheses),
is the average accuracy ove
r
RMPE qr r
rr q
r r qr r
r
F O P O q rk D q r
A A
where
D q r P q O r q r
r
q
rr all hypotheses that contain arc q .
MPE Objection Function
•
•
log | , log | ,
and are the begin and end time of are ,
denotes the posterior probability of Gaussian m in arc at time t.
qr
qr
qr
r r
qr
Eqr r tm
t S m
q q r
mr
P O q r P o mt
A A
S E q
t q
1 1 1
1 1 1 1
log | ,t m mm m t p m m t
T T
m m t m t m m m p m m t m t m
Tmt t m t m
Tmt t m t m
P o mC C P I A C R
A
C C diag o o A C I A C o o
where
P diag o o
R o o
MPE Objection Function
• Therefore, Eq.(6) can be rewritten as
1 1
1
,
,
,
,
qr
r
r qr
qr
r
r qr
qr
r
r qr
MPE
m m m m p mm
Em
m r qr q t S
Em m
m r q tr q t S
Em m
m r q tm r q t S
F Ok C C g I A kJ
A
where
D q r t
g D q r t P
J C D q r t R
39*39
39*162
MPE-HLDA Implementation
• In theory, the derivative of the MPE-HLDA objective function can be computed based on Eq.(12), via s single forward-backward pass over the training lattices. In practice, however, it is not possible to fit all the full covariance matrices in memory.
• Two steps– First, run a forward-backward pass over the training lattices to acumulate
– Second, uses these statistics together with the full covariance matrices to synthesize the derivative.
• The Paper used gradient descent in updating the projection matrix.
MPE-HLDA Implementation
Experimental Framework
A Lp*n
n*l
l*1
p*1
Global feature projection
---there is more useful information in longer contexts
---Reduce the computational cost
Experimentation
• Conversational Telephone Speech (CTS)– 2300 hours of training data
• 800 hours : training the initial ML model
• 1500 hours : held-out training data – Lattice generation
– Discriminative training
– MPE-HLDA : only 370 hours
– Testing set• Eval03
• Dev04
Experimentation
• Conversational Telephone Speech (CTS)– Feature
• Frame concatenated PLP cepstra– 15 frames, l = 225, n = 130, p = 60
Experimentation
Experimentation
• Broadcast News (BN)– 600 hours : training the initial model (Hub4 and TDT)
– 330 hours : held-out data
Thanks