cambridge university engineering departmentmi.eng.cam.ac.uk/~mjfg/toshiba/hl251_5.pdf– 6h timit...

Joint Uncertainty Decoding with Found Data

Hank Liao and Mark Gales

September 4, 2006

Cambridge University Engineering Department

Toshiba 2006 Presentation

H. Liao and M.J.F. Gales

Noise Robust Speech Recoginition

• Goal: To improve speech recognition performance in noise.

• Traditionally, two main approaches to improving robustness to noise:

– Feature Compensation: clean the features to match clean speech– Model Compensation: update the models to match the corrupted speech.

Clean Speech

Feature

Compensation

Corrupted Speech

Clean Acoustic Models

Noisy Acoustic Models

Feature Space Model Space

Model

Compensation

Training Conditions

Test Conditions

• Uncertainty decoding is a hybrid of the these approaches

– fast-feature based processing with a simple, yet powerful model update.

• These all assume a “clean” acoustic model is available.

Cambridge University

Engineering DepartmentToshiba 2006 Presentation 1


What Are “Clean” Acoustic Models?

• In practice “clean” data seldom exists

– must be prepared and collected—doing so is expensive– rarely matches the target test environment anyway.

• Found data is free & plentiful, e.g. broadcast news or telephone conversations

– varying quality though: wide/narrowband, different speakers and noise.

• Multicondition or multistyle training yields models more robust to noise

– may be prepared artificially by corrupted clean data at various SNR– still have issue with mismatch between training and test– would like to use data from actual target application.

• Adaptive training normalises training data

– transforms model extraneous speaker, channel or noise factors– canonical acoustic model solely represents acoustic speech variation– a truly clean acoustic model.




Noise Robustness Framework

• Effects of noise may be represented as a DBN.

• Corrupted speech likelihood given by

p(yt|M, θt) =

∫

x

p(yt|xt) p(xt|M, θt)dxt

p(yt|xt) =

∫

n

p(yt|xt, nt) p(nt|θnt )dnt

– only p(yt|xt) depends on noise.

• Find efficient approximation:

– independent of clean model complexity– appropriate form for integration.




VTS Noise Compensation

• Relationship between clean and corrupted speech and noise

y = x + h + C log(1 + exp(C-1(n − x − h)))

– n is additive noise, h convolutional noise, and C is DCT matrix– assumes the availability of a clean acoustic model to compensate.

• Use 1st-order VTS approximation to compensate model parameters

µy ≈ E{y1vts}

= µx + µhi + C log(1 + exp(C-1(µn − µx − µh)))

Σy ≈ Var {y1vts}

=∂y

∂x

∣

∣

∣

µ0

Σx

∂y

∂x

∣

∣

∣

T

µ0

+∂y

∂n

∣

∣

∣

µ0

Σn

∂y

∂n

∣

∣

∣

T

µ0

• Dynamic parameter update uses Continuous-Time approximation ∆y ≈ ∂y

∂t




Model-based Joint Uncertainty Decoding

• Make conditional p(yt|xt) dependent on model components.

– Like with MLLR, group acoustically similar components into R classes.

• Observed likelihood is linear transform of features and offseted model variance

p(yt|m, θt) ≈ |A(r)|N(

A(r)yt + b(r);µ(m),Σ(m) + Σ(r)b

)

• Form of parameters A(r), b(r) and Σ(r)b

depend on p(yt|xt, r)

– in Joint Uncertainty Decoding, p(yt|xt, r) derived from joint distribution.




Estimating Uncertainty Transform Parameters

• Estimate maximum likelihood noise model given some test data

– determine additive noise µn, Σn and channel mean µh.

• Given clean model {µ(r)x ,Σ

(r)x } and noise model, compute joint distribution

– Joint form requires cross-covariance Σ(r)xy and noisy speech {µ

(r)y ,Σ

(r)y }

– showed previously how to generate joint distribution using VTS.

• If (# of model classes R) = (# of components M), Joint converges to VTS

– model-based Joint scheme is scaleable.




ML Noise Estimation for Joint Uncertainty Decoding

• Estimate noise model from samples of thecorrupted speech environment

• Before, derived ML VTS noise estimate

– need hypothesis and full acoustic model– use VTS to combine speech and noise to

give likelihood of corrupted speech– with EM, iteratively update noise

parameters to maximise likelihood– in EM step, refine VTS expansion point.

• Do the same for Joint compensation

– start from VTS estimate, then refine– maximise Joint auxiliary function.




Joint Adaptive Training

• Adaptive training removes unwanted factors

– e.g. environmental and speaker variability– yields a noise-free “clean” acoustic model.

• Instead of CMN or CMLLR, the factor transformT is Joint

– ML transform estimation of T̂ presented– more powerful representation of noise effect.

• Iterative model parameter update formula toestimate canonical model parameters M

– no closed form solution, use gradient descent– various schemes to stabilise estimation.




Clean Speech Class Model

• Estimate µ(r)x ,Σ

(r)x from full clean models M

– variance may be diagonal, but better fullespecially with few classes R.

• Issue with Joint adaptive training

– after step 2, acoustic model updated to M̂– hence clean speech class model changes– now M̂ + Φ̂ does not give T̂ for step 3.

• Must check that newly estimated transform isbetter than previous

– may have situation where current T̂ estimatedfrom a different clean speech class model andacoustic models.




Experiments on Resource Management Corpus

• 1000 word naval ARPA Resource Management (RM) database

– continuous read speech recorded in a sound-isolated room: 49 dB SNR.

• Baseline recogniser is CU HTK using RM recipe:

– trained on 3.8 hours of data with 109 speakers uttering 3990 sentences– cross-word triphone models, tied states, 6 components per state– MFCC features with 0th cepstra, deltas and delta-deltas for 39 dimensions.

• Multistyle training on artificially corrupted data

– at SNRs of 8, 14, 20, 26, and 32 dB, but no “clean” speech.

• SI task over three test sets, 30 speakers, 900 utterances, reporting %WER

– artificially corrupted with NOISEX-92 Operating Room noise– noise and transforms estimated on per speaker level.




System Overview




Baseline RM Clean and Multistyle Performance

Acoustic Test Set SNRModel Compensation Clean 20 dB 14 dB

Clean— 3.1 38.0 83.7Joint 3.1 9.2 22.6VTS 3.0 8.4 23.6

Multistyle— 11.7 7.0 15.5Joint 8.6 6.7 12.3VTS 8.8 6.5 12.0

Matched — 3.1 7.4 14.3

• Joint and VTS compensation give good gains on either model training method

– model-based Joint with 16 transforms close to VTS.

• Multistyle trained acoustic models superior for noisier conditions

– clean is unseen condition: was not included in training data– with compensation exceeds matched results on noisy data.




Noise Estimation for Joint Uncertainty Decoding

Acoustic Noise Est. Test Set SNRModel Compensation Type Clean 20 dB 14 dB

Clean— 3.1 38.0 83.7

JointVTS 3.1 10.1 35.3Joint 3.1 9.2 22.6

Multistyle— 11.7 7.0 15.5

JointVTS 9.0 8.6 15.9Joint 8.6 6.7 12.3

• VTS ML noise estimates may be used to estimate Joint transforms

– good results on clean models, but Joint ML noise estimates better.

• On multistyle models, using VTS estimates poorer than no compensation

– Joint noise estimates give far superior transforms and results.




Joint Adaptive Training

Acoustic Test Set (14dB)Model Compensation +CMLLR

Multistyle— 15.5 13.8

Joint 12.3 11.7JAT Joint 11.4 10.9Matched — 14.3 12.6

• Model-based Joint transforms complement CMLLR (2 full matrix).

• Adaptively trained acoustic model gives better results than multistyle

– in noisy conditions tested, beats matched and multistyle training– but JAT WER on clean data is 5.6% compared to 3.1% clean on clean– Joint transforms are effective at noise normalisation.




Experiments on Broadcast News

• 145 training hours of recorded news broadcasts suchas CNN, ABC, CBS, and NPR.

• Based on 2003 CUHTK system

– same segmenter and clusterer– 59k word dictionary– cross-word, triphone models, tied states– 7k states, 120k components, 16 comp/GMM.

• MFCCs with 0th cep, delta and accelerations.

• Test sets include

– dev03, 2.5 hours of news from 2001– eval98, 2.9 hours of news from 1998

Segmentation and Clustering

P1: Initial Transcription

P2: Lattice Generation

P3: Trigram Rescoring

One Best

One Best

Lattices




Preliminary Broadcast News Experiments

• Wideband results only, narrowband provided by common system

• Noise estimated on per speaker level

dev03 eval98

Compensation Overall F0 F1 F4

— 20.8 21.2 10.6 21.8 20.8Joint 18.8 19.4 9.9 21.1 17.5VTS 18.8 19.1 10.0 21.0 17.2

F0 – baseline broadcast speechF1 – spontaneous broadcast speechF4 – speech under degraded acoustic conditions

• Joint with 256 transforms close to VTS results

– modest gains on cleaner broadcast news data (F0,F1)– as expected, biggest gains on noisier data in F4 focus condition.




Experiments on Toshiba Datasets

• 240 hours of artificially corrupted dictated training data

– 6h TIMIT data, 60h WSJ0 and 174h WSJ1– multi-condition data with noise varying per utterance:

12.5% uncorrupted, rest has SNR distributed from 0 to 20dB.

• Baseline multistyle recogniser is Toshiba CRL-STG system

– 10 MFCC parameters plus 0th cepstral, deltas and accelerations– 2.2k word dictionary, cross-word triphone models, tied states– ∼800 states, 8k components, 10 comp/GMM.

• Three Appen test tasks examined in office, idle, city and highway conditions

– command and control task (60 utts/spkr)– telephone #’s (30 utts/spkr), 80% local calls, 20% international– 550 city names (30 utts/spkr).

– office: 20 speakers, close-talk mic, SNR ∼34dB– in car: 32 speakers each, AKG mirror mic, SNR ∼35dB, 25dB and 18dB.




HLDA

CMNCondition C&C Cities DigitsOffice 0.93 22.2 3.58Idle 1.15 17.9 2.77City 2.80 23.8 5.80Highway 2.80 23.2 8.71

CMN + HLDA

Condition C&C Cities DigitsOffice 0.54 17.7 2.69Idle 0.98 14.6 1.98City 2.52 22.3 4.43Highway 1.98 22.1 5.96

• In HLDA, add third derivatives to give 44 dimensional feature vector

– project down to 33 with full feature matrix– equivalent of decoding with a shared full global covariance.

• Gives good improvement in cleaner conditions

– noise decreases effectiveness of HLDA– decorrelating HLDA transform no longer as accurate in noise.




VTS and Joint Compensation

0DA BaslineCondition C&C Cities DigitsOffice 0.44 7.55 2.47Idle 1.21 13.5 3.69City 3.30 18.6 7.40Highway 5.11 28.9 13.2

VTS

C&C Digits0.65 3.681.42 4.524.09 5.752.36 6.60

16 Joint Xforms

C&C Digits0.43 2.534.56 2.752.71 5.072.49 7.35

• VTS and Joint gives gains over baseline

– office/idle results (∼35 dB) degrade due to multistyle training– at highway condition (18dB) WER is halved– Joint better than VTS due to refinement of noise estimates.




Conclusions

• Multistyle training gives better models than clean to compensate for noise

– VTS and Joint performance using multistyle beats matched– CMLLR complements Joint uncertainty decoding– positive results on real, found data.

• Joint adaptive training is superior to both clean and multistyle

– can yield a “clean” canonical model of speech acoustics– powerful method to factor noise out of acoustic models.

• Future work will look at

– Joint adaptive training on Broadcast News and Toshiba data– combining Joint transforms with semi-tied covariances– optimising and devising efficiencies.



cambridge university engineering departmentmi.eng.cam.ac.uk/~mjfg/toshiba/hl251_5.pdf– 6h timit...

Documents