minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition

Post on 21-Mar-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition. Bing Zhang and Spyros Matsoukas BBN Technologies. Present by shih-hung Liu 2006/ 05/16. Outline. Review PCA, LDA, HLDA Introduction MPE-HLDA Experimental results Conclusions. Review - PCA. - PowerPoint PPT Presentation

TRANSCRIPT

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis

for Speech Recognition

Bing Zhang and Spyros MatsoukasBBN Technologies

Present by shih-hung Liu 2006/ 05/16

2

Outline

• Review PCA, LDA, HLDA• Introduction• MPE-HLDA• Experimental results• Conclusions

3

Review - PCA

1

1 N T

i ii

T x X x XN

1

1 N

ii

X xN

Ti p iy x

4

Review - LDA

ˆ arg maxp

Tp p

p Tp p

B

W

1

1 J

j jj

W N WN

1

1 J T

j j ji

B N X X X XN

( )

1 , 1, ,T

j i j i jg i jj

W x X x X j JN

Ti p iy x

5

Review - HLDA

1

2

2

Ti ig i g i g iy y

i ing i

P x e P y

,1

,

0, 1 0

0,

j

pj p j

jp

n

( )

00

pj

j n p

Ti iy x

6

Review - HLDA

1

1( ) ( ) ( ) ( )

1

log ; log

1 log 2 log2

N

i ii

N T nT Ti g i g i i g i g i

i

L x P x

x x N

, 1, ,p Tj p jX j J

0Tn pX

, 1, ,p Tj p j pW j J

n p Tn p n pT

7

Introduction for MPE-HLDA

• In speech recognition systems, feature analysis is usually employed for better classification accuracy and complexity control.

• In recent years, extensions to the classical LDA have been widely adopted

• Among them, HDA seeks to remove the equal variance constraint by LDA

• ML-HLDA is taking the HMM structure (eg. diagonal covariance Guassian mixture state distribution) into consideration

8

Introduction for MPE-HLDA

• Despite the differences between the above techniques, they have some common limitation.

• First, none of them assumes any prior knowledge of confusable hypotheses, so their choices are determined to be suboptimal for recognition

• Second, their objective functions do not directly related to the WER

• For example, we found that HLDA could select totally non-discriminant features while improving its objective function by mapping all training samples to a single point in space along some dimensions

9

Introduction

• LDA and HLDA– Better classification accuracy– some common Limitations

• None of them assumes any prior knowledge of confusable hypotheses• Their objective functions do not directly relate to the word error rate (WER)

– As a result, it is often unknown whether selected features will do well in testing by just looking at the values of objective functions

• Minimum Phoneme Error– Minimize phoneme errors in lattice-based training frameworks– Since this criterion is closely related to WER, MPE-HLDA tends to be more robust

than other projection methods, which makes it potentially better suited for a wider variety of features

10

MPE-HLDA

• MPE-HLDA model

• MPE-HLDA aims at minimizing expected number of phoneme errors introduced by the MPE-HLDA model in a given hypothesis lattice, or equivalently maximizing the function

m m

Tm m

t t

A

C diag A A

o Ao

, | (4)

is the total number of training utterances,

is the sequence of p-dimensional observation vectors in utterance r, is the "raw accuracy" score of wor

r

R

MPE r r rr w

r

r

F O P w O w

R

Ow

d hypothesis .rw

,m mC

11

MPE-HLDA

| is the posterior probability of hypothesis in the lattice

| |

|

is the language model probability of hypothesis ,

k : in order

r

r r r

k

r r rr r k

r r rw

r r

P w O w

P O w P wP w O

P O w P w

P w w

to reduce acoustic scores dynamic range, thereby avoiding the concentration of all posterior mass in the top-1 hypothesis of the lattice.

12

MPE-HLDA

• It can be shown that the derivative of (4) with respect to A is

, log | ,, (6)

, | ,

is the MPE score of utterance r (average accuracy over all hypotheses),

is the average accuracy ove

r

RMPE qr r

rr q

r r qr r

r

F O P O q rk D q r

A Awhere

D q r P q O r q r

r

q

rr all hypotheses that contain arc q .

13

MPE-HLDA

log | , log | ,

and are the begin and end time of are ,

denotes the posterior probability of Gaussian m in arc at time t.

qr

qrqr

r r

qr

Eqr r tm

t S m

q q r

mr

P O q r P o mt

A A

S E q

t q

1 1 1

1 1 1 1

log | ,t m mm m t p m m t

T Tm m t m t m m m p m m t m t m

Tmt t m t m

Tmt t m t m

P o mC C P I A C R

A

C C diag o o A C I A C o o

where

P diag o o

R o o

14

MPE-HLDA

• Therefore, Eq.(6) can be rewritten as

1 1

1

,

,

,

,

qr

r

r qr

qr

r

r qr

qr

r

r qr

MPEm m m m p m

m

Em

m r qr q t S

Em m

m r q tr q t S

Em m

m r q tm r q t S

F Ok C C g I A kJ

Awhere

D q r t

g D q r t P

J C D q r t R

39*39

39*162

15

MPE-HLDA Implementation

• In theory, the derivative of the MPE-HLDA objective function can be computed based on Eq.(12), via s single forward-backward pass over the training lattices. In practice, however, it is not possible to fit all the full covariance matrices in memory.

• Two steps– First, run a forward-backward pass over the training lattices to accumulate– Second, uses these statistics together with the full covariance matrices to synthesize the

derivative.

• The Paper used gradient descent in updating the projection matrix.

AOFAA MPEi

jij

),ˆ()()(1

16

MPE-HLDA Overview

Tmtmt

mt

Tmtmt

mt

mtmmp

mtmm

t

E

St

t

m

mq

rq

rqrr

rqR

r qr

MPE

r

R

r wrrMPE

mm

tt

Tmm

mm

np

ooRoodiagP

RCAIPCCA

moPA

moPtA

qOP

rqOqPrqDA

qOPrqD

AOF

wOwPOF

C

AooAAdiagC

AnpA

rq

rq

r

r

r

r

r

r

))(ˆˆ(])ˆˆ)(ˆˆ[(

ˆ)ˆ(ˆ),|ˆ(log

),|ˆ(log)(),|ˆ(log

)]()()[,ˆ|(),(

),|ˆ(log),(),ˆ(

)()ˆ|(),ˆ(

modelHLDA -MPE theas )ˆ,ˆ( refer to willwe

ˆ)(ˆ

ˆ,matrix projection feature global

111

17

MPE-HLDA Overview

mt

r q

E

St

mqr

mm

mt

r q

E

St

mqrm

r q

E

St

mqrm

mmpmmmm

MPE

RtrqDCJ

PtrqDg

trqD

JAIgCCAOF

r

rq

rq

r

r

rq

rq

r

r

rq

rq

r

)(),(ˆ

)(),(

)(),(

)ˆ(ˆ),ˆ(

1

11

18

Implementation

• 1. Initialize feature projection matrix by LDA or HLDA, and MPE-HLDA model

• 2. Set • 3. Compute covariance statistics in the original feature space

– (a) Do maximum likelihood update of MPE-HLDA model in the feature space define by

– (b) Do single pass retraining using to generate and in the original feature space

• 4. Optimize the feature projection matrix:– (a) Set– (b) Project and using to get model in reduced subspace – (c) Run F-B pass on lattices using to compute , and – (d) Use , and statistics form 4(c) to compute the MPE derivative– (e) Update to using gradient descent– (f) Set , go to 4(b) unless convergence

• 5. Optionally, set and go to 3

)0(A)0(M̂

1i)(i

m)1(ˆ iM

)1( iA)1(ˆ iM )(i

m)(i

m

)1()( ,0 iij AAj

)(im

)(im

)(ijA

)(ˆ iM)(ˆ iM

1 jj)(i

jA )(1

ijA

)(ijA

)(im

m mg J

1 , ˆ , )(1

)()(1

)( iiMMAA ij

iij

i

AOFAA MPEi

jij

),ˆ()()(1

19

Experiments

• DARPA EARS research project

• CTS, 800/2300 hrs for ML training, 370 hrs of held-out data for MPE-HLDA training

• BN, 600 hrs from Hub4 and TDT for ML training, 330 hrs of held-out data for MPE-HLDA estimating

• PLP(15dim) and 1st 2nd 3th derivative coefficients (60dim)

• EARS 2003 Evaluation test set

20

Experiments

21

Experiments

22

Conclusions

• We have taken a first look at a new feature analysis method, MPE-HLDA.

• It shows that it is effective in reducing recognition error, and that it is more robust than other commonly used analysis methods like LDA and HLDA

top related