cepstral vector normalization based on stereo data for robust speech recognition

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition

Presenter: Shih-Hsiang Lin

Luis Buera, Eduardo Lleida, Antonio Miguel, Alfonso Ortega, and Óscar Saz

Communication Technologies Group (GTC)Aragon Institute of Engineering Research (I3A) University of Zaragoza, Spain

IEEE Transactions on Audio, Speech and Language Processing, Feb.,2007

2

Reference

• “Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition,” IEEE Trans on Audio, Speech and Language Processing, 2007

• “Multi-environment models based linear normalization for robust speech recognition in car conditions,” in Proc. ICASSP 2004

• “Multi-environment models based linear normalization for robust speech recognition,” in Proc. SPECOM 2004

• “Robust speech recognition in cars using phoneme dependent multi-environment linear normalization,” in Proc. EuroSpeech 2005

• “Recent advances in PD-MEMLIN for speech recognition in car conditions,” in Proc. ASUR 2005

3

Outline

• Introduction• Approaches

– Multi-Environment Model-based Linear Normalization (MEMLIN)– Polynomial MEMLIN (P-MEMLIN)– Multi-Environment Model-based Histogram Normalization (MEMHIN)– Phoneme-Dependent MEMLIN (PD-MEMLIN)– Blind PD-MEMLIN

• Experimental Results• Conclusions

4

Introduction

• Robustness techniques have been developed along the following two main lines of research– Acoustic model adaptation method

• Require more data and computing time– MAP, MLLR, PMC

– Feature vector adaptation/normalization method• Map recognition space feature vectors to the training space

– High-pass filtering» The results produced by those methods are limited individually» CMN, RASTA processing

– Model-based techniques» VTS, CDCN

– Empirical compensation» Entirely data-driven» Need a training phase where some transformations are estimated by

computing the frame-by-frame differences between the stereo data» SPLICE, POF

5

Introduction (cont.)

• This paper focuses on empirical feature vector normalization base on stereo data and the MMSE estimator– Based on the joint modeling of clean and noisy space

• which splits noisy space into several basic environments and models each basic noisy and clean feature spaces using GMMs

– To learn a transformation between clean and noisy feature vectors associated with each pair of clean and noisy model Gaussians

Noisy basic environment spaceClean space

6

Noise Effects

Convolutional Noise:Shifts the mean of the coefficients

Additive Noise:Modifies the PDF, reducing the variances of the coefficients

Real Car Environment:Modifies the mean and variance, jointly

7

MMSE-based Feature Vector Normalization Methods

• Given the noisy feature vector , the estimated clean feature vector is obtained by using the MMSE criterion as

– Method 1: CMN• No assumption are made in estimating• The clean feature vector is approximated as

• To estimated the bias vector transformation, , the mean square error, , is defined and minimized w.r.t.

ty

tx̂

X

ttt dxyxxpyxEx ||ˆ

tyxp |

x ryryx tt ,

rydxyxpryx tX

ttt |ˆ

r

tt

ttrr

xEyE

xxEr

ˆminargminarg 2

8

MMSE-based Feature Vector Normalization Methods (cont.)

• In some cases, the mean of the clean feature vectors is removed before training acoustic model, so the bias vector transformation is computed as

– Method 2: (RATZ) Multivariate Gaussian-based cepstral Normalization• Modeling the clean space using a GMM

• Approximate the clean feature vector as

• The estimation of can produce a mismatch

tyEr

xs

x spsxpxpx

| xx ssx xNsxp ;|

xx stst ryryx ,

x

x

x

x

stxst

Xs

txstt ysprydxysxpryx ||,ˆ

tx ysp |

9

MMSE-based Feature Vector Normalization Methods (cont.)

– Method 3: SPLICE• Modeling the noisy space instead of the clean one using GMM

• Approximate the clean feature vector as

– Furthermore, several acoustic conditions have been developed• Interpolated RATZ (IRATZ)• SPLICE with environmental model selection

ys

ytt spsypypy

| yy ssyt xNsyp ;|

yy stst ryryx ,

y

y

y

y

styst

Xs

tystt ysprydxysxpryx ||,ˆ

10

Multi-Environment Model-based Linear Normalization(MEMLIN)

• MEMLIN Approximation– Noisy space is divided into a combination of several basic

environment , and the noisy feature vectors are modeled as a GMM for each basic environment

– Clean feature vectors are modeled using a GMM

– Clean feature vectors can be approximated as a linear function of the noisy feature vector, which depends on the basic environment and the clean and noisy model Gaussians

e

ey

s

eye spsypyp

ey

|

e

yey ss

ey yNsyp ;|

xs

x spsxpxpx

| xx ssx xNsxp ;|

eyx sst

eyxt ryssyx ,,,

bias vector transformation

11

• MEMLIN Enhancement– Given the noisy feature vector , the estimated clean feature vector

is obtained by using the MMSE criterion as

– calculate

– calculate

is considered to be uniformly distributed over all the environments has to be close to 1

MEMLIN (Cont.)

ty

eytx

e s st

eytsst

Xe s s

teyxsstt

seyspeyspyepry

dxysesxpryx

ey x

eyx

ey x

eyx

,,|,||

|,,,ˆ

,

,

tx̂

ty

e eys xs

tyep |

''

1 1||

ete

tett

ypypyepyep

0| yep

eysp tey ,|

'

''|

|,|

eys

ey

eyt

ey

eyt

tey

spsyp

spsypeysp

estimated in a training phase using stereo data

12

MEMLIN (Cont.)

• MEMLIN Training– Given a stereo data corpus for each environment

The bias vector transformation, , is estimated by minimizing the defined mean weighted squared square error

2

,, ,|,|

e

eyxeeee

eyx

tssttt

eytxss ryxeyspexsp

eyx ssr ,

eyx ss ,

eT

eT

eeee ee

yxyxYX ,,,,, 11

e

ee

e

eeee

eyx

eyx

eyx

eyx

eysxs

eyx

tt

eytx

tttt

eytx

ss

ss

ss

ssr

ss

eyspexsp

xyeyspexsp

r

r

,|,|

,|,|

0

minarg

,

,

,

,,,

'

''|

|,|

ey

e

ee

sxxt

xxttx

spsxp

spsxpexsp

'

''|

|,|

ey

e

ee

s

ey

eyt

ey

eyt

tey

spsyp

spsypeysp

13

– Calculate • The cross probability is simplified by avoiding the time dependence

given by the noisy feature vector

• The term can be estimated by either using a hard solution or using a soft decision

– Hard decision (using relatively frequency)

– Soft decision

MEMLIN (Cont.)

eytx seysp ,,|

esspseysp eyx

eytx ,|,,|

eys

eyxNe

yx NssC

essp|

,|

'

'|'|

||

,|

x e

ee

e

ee

s t

eyx

eytxt

t

eyx

eytxt

eyx

spspsypsxp

spspsypsxp

essp

14

Experiments Results Using Basic MMSE-based methods

• A set of experiments were performed using the Spanish SpeechDat Car database– Seven basic environments were defined

• E1: car stopped, motor running• E2: town traffic, closed windows, and climatizer off (silent conditions)• E3: town traffic and noisy conditions (windows open, and/or climatizer on)• E4: low speed, rough road, and silent conditions• E5: low speed, rough road, and noisy conditions• E6: high speed, good road, and silent conditions• E7: high speed, good road, and noisy conditions

– Tow channels have been used (stereo data)• Close talK channel (CLK)• Hands-free channel (HF)

• The recognition task is isolated and continuous digits recognition

15

Experiments Results (cont.)

HFCLKCLKCLK

HFCLKMWERMWER

MWERMWER100MIMP

• The SPLICE MS method always produces better results than does IRATZ - because of the assumption of the a posterior probability

• MEMLIN performed better than IRATEZ and SPLICE MS

16

Improvement Over MEMLIN

• There are two important approximations in MEMLIN expressions that can affect the final performance of the method– The selection of the linear model for associated with a pair of Gaussians

• compensates for the mean shift, but not for the modification of variance– Treating all of the sound in the same way

• There is always a bias vector transformation which maps from a noisy model Gaussian to every clean model one

– e.g. non silence noisy feature vectors are mapped towards the clean silence

eyx sst

eyxt ryssyx ,,,

x

17

Polynomial MEMLIN (P-MEMLIN)

• The transformation function for P-MEMLIN is

– Given the noisy feature vector , the estimated clean feature vector is obtained by using the MMSE criterion as

– and are computed in the training phase using stereo data

eyxt ssy ,,

ibiyiaissyix eyx

eyx sstss

eyxt ,,,,

indexdimension :i

ty

iseyspeyspyepibiyiaix eytx

e s st

eytsstsst

ey x

eyx

eyx

,,|,||ˆ

,,

tx̂

ia eyx ss , ib e

yx ss ,

ii

ia yss

xss

sseyx

eyx

eyx

,

,,

iii

iib x

ssy

ssyss

xss

ss eyx

eyx

eyx

eyx

eyx ,,

,

,,

where

e

ee

e

eee

eyx

tt

eytx

ttt

eytx

zss yspxsp

izyspxsp

i||

||

,

e

ee

e

eyxeee

eyx

tt

eytx

t

zsstt

eytx

zss yspxsp

iizyspxsp

i||

||2

,

,

If the standard deviation terms are equal, the algorithm expressions are the same as those in MEMLIN

18

Multi-Environment Model-based Histogram Normalization (MEMHIN)

• Sometime noise can produce a more complex modification of clean and noisy feature pdfs associated with a pair of Gaussians– In that case, the linear approximation for of MEMLIN or P-MEM

LIN is not the best option– Therefore, a nonlinear model based on histogram equalization is used

• The transformation function for MEMHIN is expressed as

– band histograms associated with and for each component of the noisy and clean feature vectors are obtained in the training phase

eyxt ssy ,,

eytxt

eyttssyssx

eyxt seyspeyspyepyCCssyx e

yxeyx

,,|,||,, ,,1

,,

eyxt ssy ,,

n xs eys

19

Results from modifications

• P-MEMLIN and MEMHIN provide significant improvement over MEMLIN when few Gaussians are considered– 33.87% of MIMP for MEMLIN– 39.12% of MIMP for P-MEMLIN– 37.82% of MIMP for MEMHIN– However, if the algorithms are evaluated using more than eight Gaussians p

er environment, the mean results are very similar among the three models

• Then, additive car noise was added to clean signals of the Spanish SpeechDat Car database (5dB noise)

eyxt ssy ,,

4 Gaussians per environment

20

Phoneme-Dependent MEMLIN (PD-MEMLIN)

21

PD-MEMLIN (Cont.)

• PD-MEMLIN Approximation– Noisy space is split into several basic environment . The noisy feature

vectors associated with the different phonemes of each basic environment are modeled as GMM

– The clean feature vectors of each phoneme are modeled as a GMM

– Clean feature vector can be approximated by a linear function that depend on the environment and the phoneme-dependent Gaussians of the clean and noisy model

e

phey

s

pheyphe spsypyp

phey

,,,

,

|

phe

yphe

y ssphe

y yNsyp ,,;| ,

ph

phx

s

phxph spsxpxp

phx

| phx

phx ss

phx xNsxp ;|

phey

phx sst

phey

phxt ryssyx ,,

,,,

22

• PD-MEMLIN Enhancement– Given the noisy feature vector , the estimated clean feature vector

is obtained by using the MMSE criterion as

– calculate

• PD-MEMLIN Training

PD-MEMLIN (Cont.)

ty

pheyt

phx

e ph s st

pheyttsstt spheysppheyspeyphpyepryx

phey

phx

phey

phx

,,, ,,,|,,|,||ˆ

,

,

tx̂

eyphp t ,|

'',

,,|

phtphe

tphet

yp

ypeyphp

2

,,,

,,

,,

,,,,

,

, ,,|,,|

phe

eyxeeee

phey

phx

tss

phepht

phepht

phepht

phey

phepht

phxss rxpheyspphexsp

23

– cross probability

– bias vector transformation

PD-MEMLIN (Cont.)

For a more detailed description, please refer to the paper

phx phe

phephe

phe

phephe

s t

phey

phx

phey

phet

phx

phet

t

phey

phx

phey

phet

phx

phet

phey

phx

pheyt

phx

spspsypsxp

spspsypsxp

esspspheysp

,

,,

,

,,

,,,,

,,,,

,,

||

||

,|,,,|

phe

phephe

phe

phephephephe

phey

phx

pheysph

xs

phey

phx

t

phet

phey

phet

phx

t

phet

phet

phet

phey

phet

phx

ssr

ss

pheyspphexsp

xypheyspphexsp

r

,

,,

,

,,,,

,

,,

,

,,|,,|

,,|,,|

minarg

,,,

,,,,,

,,

phe

ys

phey

phxNphe

yphx

pheyt

phx N

ssCphesspspheysp

,

,,, |

,,|,,,| Hard solution

Soft solution

24

• 25 Spanish phonemes and the silence– To make a fair comparisons between two methods, the results have bee

n plotted as a function of the number of Transformations per basic Environment (TpE)

Results from PD-MEMLIN

phss nnnTpE ph

xphy

10log

physn

phxsn

phn

The number of noisy Gaussians for phoneme

The number of clean Gaussians for phoneme

The number of phoneme (1 for MEMLIN)

ph

ph(2,4,8,16,32)

The results show that PD-MEMLIN makes significant improvements relative to MEMLIN, specially when more than for Gaussians per phoneme are used

25

Results from PD-MEMLIN (cont.)

• To estimate the limit of the PD-MEMLIN– Each frame was normalized using only the bias vector transformation

of the “correct” phoneme (KPD-MEMLIN)

phe

ytphx

e s s

tphe

ytsstt spheysppheyspyepryxphe

yphx

phey

phx

,,,

,,,|,,||ˆ,

,

pheyt

phx

e ph s st

pheyttsstt spheysppheyspeyphpyepryx

phey

phx

phey

phx

,,, ,,,|,,|,||ˆ

,

,

KPD-MEMLIN

PD-MEMLIN

MCP: mean correct phoneme

26

Blind PD-MEMLIN

• In many cases, stereo data are not available– An iterative “blind” training procedure is needed– Assume that the noisy training feature vectors and the phoneme-dependent

clean and noisy GMMs are available– The problem is to estimate the cross probability and the bias vector transfor

mation– It consists of an initialization and an iterative process

27

• Initialization– The cross probability is estimated using a modified Kullbac

k-Leibler distance• Gives a similarity measure of and without considering the effe

cts of the noise– Assume that the noise modifies mainly the mean vectors of the Ga

ussian model• So, the similarity is computed in terms of the a priori probabilities and t

he diagonal covariance matrices of the corresponding Gaussians

• Since KL distance is not symmetric, and it is not proportional to the likelihood; therefore, a pseudo-likelihood is defined

Blind PD-MEMLIN (cont.)

phessp phey

phx ,,| ,

0

phxs phe

ys ,

1log

21, 2

2

2

2

2

2

Y

XY

X

Y

X

YKL YXD

phx

pheyphe

yi s

s

s

sphe

yphx

pheyKL

sp

spsp

ii

ii

ii

iispssd

phx

phey

phey

phx

,,

,, log1

,

,

,

,log

2,

,

,

phx

pheyKL sspl ,,

phey

phxKL

phx

pheyKL

phx

pheyKL

ssdssdsspl ,,

,

,,1,

28


• Finally, is estimated as

– On the other hand, is defined as

– The mean improvement in WER over the seven basic environments and four Gaussians per phoneme-depend GMM was 20.2%

phessp phey

phx ,,| ,

0

'

,

,,

0',

,,,|

phxs

phx

pheyKL

phx

pheyKLphe

yphx

sspl

ssplphessp

phe

phe

phe

phxphephe

phey

phx

t

phet

phey

ts

phet

phet

phey

sspheysp

ypheysp

r

,

,

,

,,

,

,,|

,,|

,,

,,,

,,0

phey

phx ssr ,,,0

29


• Iterative process– Objective function

– The value of is obtained by taking partial derivatives and setting it equal to zero

constpheyssp

phessyppheysspQ

phey

phey

phephe

yphx

phe

phe

phephe

yphx

phe

sT

st s s

phet

phey

phx

newphe

yphx

phet

t s s

phet

phey

phx

phenew

,,

,,

,

,

,,

,

21log

21,,,|,

,,|,,log,,,|,,

,,

,,,,

phe

yphx ssr ,, ph

xphe

yphxphe sssnew

phet ry ,

, ,,,

phe

phe

phxphe

phe

phe

phey

phx

t

phet

phey

phx

sphe

tt

phet

phey

phx

ssnewpheyssp

ypheyssp

r

,

,

,

,

,

,

,,,|,

,,,|,

,,

,,,

,,

phey

phx ssnewr ,,,

30


– Update the cross probability and the bias vector transformation

– The mean improvement in WER in this case was 41.03% if n=1 and 46.90% if n=10

phe

phephe

phxphe

phe

phephe

phey

phx

t

phey

phet

phx

phet

phey

sphe

tt

phey

phet

phx

phet

phey

ssnnsysppheysp

ynsysppheysp

r

,

,,

,

,

,,

,

1,,|,,|

1,,|,,|

,,,,

,,,,,

,,

phx

phey

phx

phey

phxphe

phey

phx

phey

phxphe

phe

s

pheyssssn

phet

pheyssssn

phetphe

yphe

tphx

spryN

spryNnsysp

,,,1

,

,,,1

,,,

,,,

,,,

,,;

,;1,,|

31

Results from Blind PD-MEMLIN

The results show that blind PD-MEMLIN is able to produce improvement that are very similar to MEMLIN ones for all the TpE

It can be observed that PD-MEMLIN obtains theBest improvement with the smallest TpE

cepstral vector normalization based on stereo data for robust speech recognition

Documents

estimated clean feature

clean feature vectors

clean feature spaces

noisy feature vectors

bias vector transformation

pair of clean

multienvironment models

mismatchmmsebased f