noise compensation for speech recognition with arbitrary additive noise ji ming school of computer...

Noise Compensation for Speech Recognition with Arbitrary Additive Noise

Ji Ming

School of Computer ScienceQueen’s University Belfast, Belfast BT7 1NN, UK

Presented by Shih-Hsiang

IEEE Trans. on Audio, Speech, and Language Processing, Vol. 14, No.3, May 2006

2

Introduction

• Speech recognition performance is known to degrade dramatically when a mismatch occurs between training and testing conditions

• Traditional approaches for removing the mismatch thereby reducing the effect of noise on recognition include– Removing the noise from the test signal

• Noise filtering or speech enhancement– Spectral subtraction, Wiener filtering, RASTA filtering

• Assuming the availability of a priori knowledge

– Construction a new acoustic model to match the appropriate test environment

• Noise or environment compensation– Model adaptation, Parallel model combination (PMC), Multi-condition training, SPLICE

• Real-world noisy training data is needed

• More recent studies are focused on the methods requiring less knowledge– Since this knowledge can be difficult to obtain in real-world application

3

Introduction (cont.)

• This paper investigates noise compensation for speech recognition– Involving additive noise, assuming any corruption type (e.g. full, partial,

stationary, or time varying)

– Assuming no knowledge about the noise characteristics and no training data from the noisy environment

• This paper proposes a method which focuses recognition only on reliable features but robust to full noise corruption that affects all time-frequency components of the speech representation– Combining artificial noise compensation with the missing-feature

method, to accommodate mismatches between the simulated noise condition and the actual noise condition

• It is possible to accommodate sophisticated spectral distortion, e.g. full, partial, white, colored or none

– Based on clean speech training data and simulated noise data

– Namely , “Universal Compensation (UC)”

4

Methodology

• The UC method comprises three step– Construct a set of models for short-time speech spectra using artificial

multi-condition speech data• Generated by corruption the clean training data with artificial wide-band flat-

spectrum noise at consecutive SNRS

– Given a test spectrum• Search for the spectral components in each model spectrum that best match

the corresponding spectral components in the test spectrum• Produce a score based on the matched components for each model

spectrum

– Combine the scores from the individual model spectra to form an overall score for recognition

5

Methodology (cont.)

• Step 1– Generating noise by passing a

white noise through a low-pass filter

• Step 2– Calculating a score for each

model spectrum based only on the match spectral components

• Step 3– Combining the individual score

from each model spectra to product an overall score

Clean training spectrum

Artificial wide-band flat-spectrum noise

Noisy testspectrum

6

Methodology (cont.)

• A key to the success of the UC method is the accuracy for converting a full band corruption into partial-band corruption

• This accuracy is determined by two factors– The frequency-band resolution

• Determines the bandwidth for each spectral component• The smaller the bandwidth, the more accurate the approximation for arbitrary

noise spectral by piecewise flat spectra– But usually results in a loss of correlation between the spectral components, thus

giving a poor phonetic discrimination

• An optimum frequency-band subdivision, in term of a good balance between the noise spectral resolution and the phonetic discrimination remains a topic for study

– The amplitude resolution• Refers to the number of steps used to quantize the SNR• The finer the quantizing steps, the more accurate the approximation for any

given level of noise– The use of a large number of SNRs may result in a low computational efficiency

7

FormulationA. Model and Training Algorithms

• Assume that each training frame is represented by spectral vector consisting of sub-band spectral components

• Assume that level of SNR are used to generate the wide-band flat-spectrum noise to form the noisy training,

• Let represent a model spectrum, expressed as the probability distribution of the model spectral vector , associated with speech state and trained on SNR level

• Let be a test spectral vector• Recognition involves classifying each test spectrum into an

appropriate speech state , based on the probabilities of the test spectrum associated with the individual model spectra within the state

• Computing the probability for each model spectrum– Only the matched spectral components are retained,

– The mismatched components are ignored

NxxxX ,,, 21 N

L

lsXp ,|X

s l

NoooO ,,, 21 O

s

lsOp ,|

lsOp ,|

lsO ,

lsO ,~

8

Formulation (cont.)A. Model and Training Algorithms

• The probability can be approximated by which is the marginal distribution of obtained from with the mismatched spectral components in ignored to improve mismatch robustness

• Given for each model spectrum, the overall probability of ,associated with speech state ,can be obtained by combining over all different SNRs

For simplicity, assume that the individual spectral components are independent of one another. So the probability for any subset can be written as

lsOp ,| lslsOp ,|,

lsO , lsOp ,|

O

lslsOp ,|,O

lslsOp ,|,

s

L

llslsOpslpsOp

1,|,|| (1)

lsoplsOp nOo

subsubn

,|,|

OOsub

(2)

9

Formulation (cont.)A. Model and Training Algorithms

• The model spectrum may be constructed in two different ways– Firstly, we may estimate each explicitly by using the training

data corresponding to a specific SNR

– Alternatively, we may build the model by polling the training data from all SNR conditions together, and training the model as a usual mixture model on the mixed dataset (more flexible)

• Use EM algorithm decide the association between data / mixture / weights

lsXp ,|

10

Formulation (cont.)B. Recognition Algorithm

• Given a test spectral vector , the mixture probability in (1) using only a subset of the data for each of the mixture densities – Reducing the effect of mismatched noisy spectral components

– But we need to decide the matched subset that contains all the matched components for each model spectrum

• If we can assume that the matched subset produces a large probability, then may be defined as the subset that maximize the probability among all possible subsets in

• However, (2) indicates that the values of for different sized subsets are of a different order of magnitude and are thus not directly comparable– An appropriate normalization is needed for the probability

– A possible solution is to replace the condition probability of the test subset with the posterior probability of the model spectrum

O

OlsO , ls,

lsO ,O

subO lsOp sub ,|

lsOp sub ,|

lsOp sub ,|

'',','','|

,,||,

lssub

subsub lsplsOp

lsplsOpOlsp always producing a value in the range [0,1]

11

Formulation (cont.)B. Recognition Algorithm

• By maximizing the posterior probability , we should be able to obtain the subset for model spectrum that contains all the matched components. The following shows the optimum decision:

• The above optimized posterior probability can be incorporated into a HMM to form the state based emission probability

subOlsp |, ls,

subOO

OlsplsOsub

|,maxarg,

L

1

L

1

L

1

L

1

L

1

|,max

|,1

|,,

|

,||

,|||

lsub

OO

l

l

l

l

Olsp

OpOlspsp

OpOlsplsp

slp

OpOp

lsOpslp

lsOpslpsOp

sub

MAP Criterion

lsp

OpOlsp

lsp

lsOplsOp

,

|,

,

,,,|

spslplsp |,

Don’t care

Assuming an Equal priorp(s) for all the states

(3)

12

Experimental Evaluation (cont.)A. Databases

• Tow databases are used to evaluate the performance of the UC method– The first database is Aurora 2

• For speaker independent recognition of digit sequences in noisy conditions

– The second database containing the highly confusing E-set words• Used as an example to further examine the ability of the new UC model to

deal with acoustically confusing recognition tasks• E-set words include b, c, d, e, g, p, t, v

13

Experimental Evaluation (cont.)Acoustic Modeling for Aurora 2

• The performance of UC model is compared with the performances of four baseline systems– The first one trained on the clean training set

• 3 mixtures per state for the digits / 6 mixtures per stat for the silence

– The second one trained on the multi condition training set• 3 mixtures per state for the digits / 6 mixtures per state for the silence

– The third one improved model correspond to the complex back-end model• 20 mixtures per states for the digits / 36 mixtures per state for the silence

– The forth one uses 32 mixtures for all the state• Which thus has the same model complexity as the UC model

• The UC model is trained using only the clean training set– Expanded by adding wide-band flat-spectrum noise to each of the utterance

– 10 different SNR levels, from 20dB to 2dB, reducing 2dB every level

– The wide-band flat-spectrum is computer-generated white noise filtered by a low-pass filter with a 3-dB bandwidth of 3.5 kHz

14

Experimental Evaluation (cont.)Acoustic Modeling for Aurora 2

• The speech is divided into frames of 25 ms at a frame rate of 10 ms• For each frame

– 13-channel mel filter bank to obtain 13 log filter-bank amplitudes

– These 13 amplitudes are then decorrelated by using a high-pass filter resulting in 12 decorrelated log filter-bank amplitudes, denoted by

– The bandwidth of the subband can be increased conveniently by grouping neighboring subband components together to form a new subband component, for example a 6-subband spectral vector can be express as

– In this paper, each feature vector consists 18 components• 6 static subband spectra, 6 delta subband spectra and 6 delta-delta subband spe

ctra• The overall size of the feature vector for a frame is 18 x 2 = 36

11 zzH

1221 ,,, dddD

62112114321 ,,,},{,},,{},,{ oooOddddddD

1821 ,,, dddD

15

Experimental Evaluation (cont.)Tests on Aurora 2 Condition

• Table shows the recognition result for clean test data• For the clean data, best accuracy rates were obtained by the multi-

condition baseline model with 20 and 32 mixtures per states• The UC model performed on average slightly better than the multi-

condition model with 3 mixtures models

16


• Tables show the recognition result on test set A and test set B• The UC model significantly improved over the baseline model trained on

clean data, and achieve an average performance close to that obtained by the multi-condition model with three mixtures per state

• Car noise exhibits a less sophisticates spectral structure than the babble noise, and thus may be more accurately matched by the piece-wise flat spectra as implemented in the UC model

17


• Table shows the recognition result on test set C• The channel mismatch problem can be solved by Multi-20 and Multi-32• The UC model also showed a capability of coping with this mismatch

– The performance is little affected by channel mismatch

• Figure summarizes the average word accuracy results for the five system

18

Experimental Evaluation (cont.)Tests on Noise Unseen in Aurora 2

• The purpose of this study is to further investigate the capability of the UC model to offer robustness for a wide variety of noise– Three additional noise are used

• A polyphonic mobile phone ring, A pop song segment, A broadcast news segment

– The spectral characteristics of the three noise are shown in follow figure

A polyphonic mobile phone ring

A pop song segment

A broadcast news segment

19


• The UC model offered improved accuracy over all the three baseline model

• The UC model produced particularly good result for the ringtone noise

– because the noise mainly partial corruption over the speech frequency band

• Table also indicates that increasing the number of mixtures in the mismatched baseline model

– produced only a small improvement for the news noise

– no improvement for the phone ring noise

20


• The UC model, with a complexity similar to that of Multi-32, performed similarly to Multi-3 trained in matched conditions

• The UC model was able to outperform Multi-32 in the case of unknown/mismatched noise conditions

21

Experimental Evaluation (cont.)Discrimination Study on an E-Set Database

• This experiment is conducted into the ability of the UC model to discriminate between acoustically confusing words– While it reduces the mismatch between training and testing conditions, does

it also reduce the discrimination between utterances of different words

• They experimented on a new database, containing the highly confusing E-set words (b, c, d, e, g, p, t, v), extracted from the Connex speaker-independent alphabetic database provided by British Telecom– Contains three repetitions of each word by 104 speakers

• 53 male and 51 female• Among 104 speakers, 52 for training and the other 52 for testing• For each word, about 156 utterances are available for training• A total of 1219 utterances are available for testing • For different noise from Aurora 2 test set A are artificially added

• Two baseline HMMs are buit– One with the clean training set (1 mixture per state)

– The other with the multi-condition training set (11 mixtures per state)

22


• For the clean E-set, the UC model achieved a recognition accuracy rate close to the rate obtained by the baseline model, with only small loss in accuracy (84.91%83.33%)

• For the given noise conditions, the UC model achieved an average performance close to that obtained by the multi-condition baseline model

23


• Finally, tested the performance of the UC model with different resolutions for quantizing the SNR– Three different training sets are generated with an increasing SNR resolution

• Coarse quantization (6 mixtures per state)

– Including only five different SNRs, from 20dB to 4dB with a 4 dB step

• Medium-resolution quantization (11 mixtures per state)– Including ten different SNRs, from 20dB to 2dB with a 2dB step

• Fine quantization (21 mixtures per state)– Including twenty different SNRs, from 20dB to 2dB with an 1dB step

• Additionally, all the three sets also include the clean training data

24


• The two models with the medium and fine quantization produce quite similar recognition accuracy in many test conditions

• The model with the coarse quantization trained with 6 SNRs produced poorer results than the other two models, but still showed significant performance improvement in comparison to the baseline model trained on the clean data

25

Summary

• This paper investigated noise compensation for speech recognition– Assuming no knowledge about the noise characteristics and no training

data from the noisy environment– Universal compensation (UC) is proposed as a possible solution to the p

roblem– The UC method involves a novel combination of the principle of multi-co

ndition training and the principle of the missing feature method

• Experiments on the Aurora 2 have shown that the UC model has the potential to achieve a recognition performance close to the multi-condition model performance without assuming knowledge of the noise

• Further experiments with noises unseend in Aurora 2 have indicated the ability of the UC model to offer robust performance for a wide variety of noises

• Finally, the experimental results on an E-set database have demonstrated the ability of the UC model to deal with acoustically confusing recognition tasks

noise compensation for speech recognition with arbitrary additive noise ji ming school of computer...

Documents

noise corruption

white noise

noise characteristics

effect of noise

arbitrary noise spectral

simulated noise condition

simulated noise datanamely

actual noise conditionit