log-energy dynamic range normalizaton for robust speech recognition weizhong zhu and douglas...

LOG-ENERGY DYNAMIC RANGE NORMALIZATONFOR ROBUST SPEECH RECOGNITION

Weizhong Zhu and Douglas O’ShaughnessyINRS-EMT, University of Quebec

Montreal, Quebec, H5A 1K6, Canada

Presenter: Chen, Hung-Bin

ICASSP 2005

Outline

• Introduction

• Observation

• Energy dynamic range normalization method1

• Energy dynamic range normalization method2

• Experiment

• Conclusion

Introduction

• Automatic Speech Recognition (ASR) has been in commercial application for decades but still has severe limitations.– Accuracy of speech recognition degrades rapidly when speech is

distorted by noise.

• Methods to overcome the effects of noise must be applied in order to achieve good recognition accuracy in real speech recognition applications where various types of noises may exist.– Robust speech recognition is one of the most challenging areas

of speech recognition.

4

Introduction (cont.)

• Methods of robust speech recognition can be classified into two approaches.– front-end processing method is to suppress the noise and get

more robust parameters– back-end processing is to compensate for noise and adapt the

parameters inside the HMM system

• In this paper, we focus on the first approach.– We try to find a more effective way to remove the effects of

additive noise for the log-energy feature.

• We propose a log-energy dynamic range normalization (ERN) method to minimize mismatch between training and testing data. – The dynamic range of log-energy feature sequences of an

utterance is normalized to a target dynamic range.

5

Observation

• Comparing with the log-energy feature sequence– noisy speech with a 10 dB SNR ratio and that of clean speech

1. Elevated minimum value,

2. Valleys are buried by additive noise energy, while perks are not affected as much.

6

Energy dynamic range normalization

• The larger difference on valleys leads to a mismatch between the clean and noisy speech.

• To minimize the mismatch, – we suggest an algorithm to scale the log-energy feature sequenc

e of clean speech, in which we lift valleys while we keep peaks unchanged.

7


• define a log-energy dynamic range of the sequence as follows

valueminimum theis ))((

valuemaximum theis ))((

))((

))((10).(.

1

1

1

1

nii

nii

nii

nii

EnergyLogMin

EnergyLogMax

EnergyLogMin

EnergyLogMaxdBRD

much. as affectednot is ))((

noise additiveby affected is ))((

noise, of presence in the know, weAs

1

1

nii

nii

EnergyLogMax

EnergyLogMin

0.5 then ),20( torange dynamiclogenergy target set the weIf

10)( as range dynamicenergy target define and

))(())((Let 11

αdBXα

dBX

EnergyLogMaxEnergyLogMin niinii

8


• Following are the steps of the proposed log-energy feature dynamic range normalization algorithm:

))(()()(

1..n,iFor (4)

nothing do

else

(4)en th

))(( If (3)

))((

target Calculate (2)

))((

and ))(( Find (1)

1

1

1

1

iii

nii

nii

nii

nii

EnergyLogMaxMinMax

MinT_MinEnergyLogEnergyLog

T_MinEnergyLogMinMin

EnergyLogMaxT_Min

EnergyLogMinMin

EnergyLogMaxMax

9


• Linear scaling equation may not be the best solution.

• We modify the linear scaling equation into non-linear scaling equation.

))(()()( iii EnergyLogMaxMinMax


)))((Log)(Log()(Log)(Log

)()( iii EnergyLogMaxMinMax


10


• Figure 2 shows a schematic representation of the scaling effect of the proposed algorithm.– The scaling effect is decreased as its own value goes up and the

maximum of the sequence is unchanged.

11

Experiment

• The proposed method was evaluated on the Aurora 2 database.• All recognition tests were conducted using the HTK recognition toolk

it with the setting defined for evaluation.• Speech models are eleven whole word HMMs fixed to 16 states with

3 diagonal Gaussian mixtures per state.– Two silence models are defined.

• Data in Test A are added to by noises of Subway, Babble, Car and Exhibition.

• Data in Test B are added to noises of Restaurant, Street, Airport and Station.

• In Test C, besides the additive noise, channel distortion is also included.

12

Recognition results

• The results in this section are defined in terms of relative improvement (R.I.)– where NewScore, Baseline are recognition accuracies for each test usin

g proposed and reference algorithms,

%100100

.(%).

Baseline

BaselineNewScoreIR

13

Experiment

• Results of table 1 show relative improvements with the different target log-energy dynamic range.

14

Experiment

• The results of relative improvement in different target dynamic ranges using this non-linear normalization method are shown in Table 2.

• It achieves a 30.83% highest overall relative improvement when the target range is set to 14 dB.

15

Experiment

• Performance comparisons between linear and nonlinear normalization methods for average relative improvement at different SNR levels are shown in table 3.

• The mean recognition accuracy for each test set is obtained by taking the average of the recognition accuracies measured in 20, 15, 10, 5 and 0 dB SNR.

16

Experiment

• Experiment 2 in Table 4

• Here in experiment 2, we answer the questions: – (1) what are the results of techniques like cepstral mean and vari

ance normalizations? – (2) Can the proposed algorithms combine with these techniques

get an even better result?

• CMN refers to cepstral mean normalization– process with all 13 parameters

• CVN for cepstral variance normalization– process with all 13 parameters

• ERN(L) for proposed methods is linear respectively

• ERN(N) for proposed methods is non-linear respectively

17

Experiment

18

Conclusion

• A log-energy dynamic range normalization technique is introduced to improve ASR performance in noisy conditions.

• Reducing mismatch in log-energy leads to a large recognition improvement.

• It is also confirmed that the proposed algorithm can be combined with the cepstral mean or variance normalization techniques to achieve an even better result.

log-energy dynamic range normalizaton for robust speech recognition weizhong zhu and douglas...

Documents