cambridge university engineering departmentmi.eng.cam.ac.uk/~mjfg/toshiba/hl251_5.pdf– 6h timit...
TRANSCRIPT
Joint Uncertainty Decoding with Found Data
Hank Liao and Mark Gales
September 4, 2006
Cambridge University Engineering Department
Toshiba 2006 Presentation
H. Liao and M.J.F. Gales
Noise Robust Speech Recoginition
• Goal: To improve speech recognition performance in noise.
• Traditionally, two main approaches to improving robustness to noise:
– Feature Compensation: clean the features to match clean speech– Model Compensation: update the models to match the corrupted speech.
Clean Speech
Feature
Compensation
Corrupted Speech
Clean Acoustic Models
Noisy Acoustic Models
Feature Space Model Space
Model
Compensation
Training Conditions
Test Conditions
• Uncertainty decoding is a hybrid of the these approaches
– fast-feature based processing with a simple, yet powerful model update.
• These all assume a “clean” acoustic model is available.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 1
H. Liao and M.J.F. Gales
What Are “Clean” Acoustic Models?
• In practice “clean” data seldom exists
– must be prepared and collected—doing so is expensive– rarely matches the target test environment anyway.
• Found data is free & plentiful, e.g. broadcast news or telephone conversations
– varying quality though: wide/narrowband, different speakers and noise.
• Multicondition or multistyle training yields models more robust to noise
– may be prepared artificially by corrupted clean data at various SNR– still have issue with mismatch between training and test– would like to use data from actual target application.
• Adaptive training normalises training data
– transforms model extraneous speaker, channel or noise factors– canonical acoustic model solely represents acoustic speech variation– a truly clean acoustic model.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 2
H. Liao and M.J.F. Gales
Noise Robustness Framework
• Effects of noise may be represented as a DBN.
• Corrupted speech likelihood given by
p(yt|M, θt) =
∫
x
p(yt|xt) p(xt|M, θt)dxt
p(yt|xt) =
∫
n
p(yt|xt, nt) p(nt|θnt )dnt
– only p(yt|xt) depends on noise.
• Find efficient approximation:
– independent of clean model complexity– appropriate form for integration.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 3
H. Liao and M.J.F. Gales
VTS Noise Compensation
• Relationship between clean and corrupted speech and noise
y = x + h + C log(1 + exp(C-1(n − x − h)))
– n is additive noise, h convolutional noise, and C is DCT matrix– assumes the availability of a clean acoustic model to compensate.
• Use 1st-order VTS approximation to compensate model parameters
µy ≈ E{y1vts}
= µx + µhi + C log(1 + exp(C-1(µn − µx − µh)))
Σy ≈ Var {y1vts}
=∂y
∂x
∣
∣
∣
µ0
Σx
∂y
∂x
∣
∣
∣
T
µ0
+∂y
∂n
∣
∣
∣
µ0
Σn
∂y
∂n
∣
∣
∣
T
µ0
• Dynamic parameter update uses Continuous-Time approximation ∆y ≈ ∂y
∂t
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 4
H. Liao and M.J.F. Gales
Model-based Joint Uncertainty Decoding
• Make conditional p(yt|xt) dependent on model components.
– Like with MLLR, group acoustically similar components into R classes.
• Observed likelihood is linear transform of features and offseted model variance
p(yt|m, θt) ≈ |A(r)|N(
A(r)yt + b(r);µ(m),Σ(m) + Σ(r)b
)
• Form of parameters A(r), b(r) and Σ(r)b
depend on p(yt|xt, r)
– in Joint Uncertainty Decoding, p(yt|xt, r) derived from joint distribution.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 5
H. Liao and M.J.F. Gales
Estimating Uncertainty Transform Parameters
• Estimate maximum likelihood noise model given some test data
– determine additive noise µn, Σn and channel mean µh.
• Given clean model {µ(r)x ,Σ
(r)x } and noise model, compute joint distribution
– Joint form requires cross-covariance Σ(r)xy and noisy speech {µ
(r)y ,Σ
(r)y }
– showed previously how to generate joint distribution using VTS.
• If (# of model classes R) = (# of components M), Joint converges to VTS
– model-based Joint scheme is scaleable.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 6
H. Liao and M.J.F. Gales
ML Noise Estimation for Joint Uncertainty Decoding
• Estimate noise model from samples of thecorrupted speech environment
• Before, derived ML VTS noise estimate
– need hypothesis and full acoustic model– use VTS to combine speech and noise to
give likelihood of corrupted speech– with EM, iteratively update noise
parameters to maximise likelihood– in EM step, refine VTS expansion point.
• Do the same for Joint compensation
– start from VTS estimate, then refine– maximise Joint auxiliary function.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 7
H. Liao and M.J.F. Gales
Joint Adaptive Training
• Adaptive training removes unwanted factors
– e.g. environmental and speaker variability– yields a noise-free “clean” acoustic model.
• Instead of CMN or CMLLR, the factor transformT is Joint
– ML transform estimation of T̂ presented– more powerful representation of noise effect.
• Iterative model parameter update formula toestimate canonical model parameters M
– no closed form solution, use gradient descent– various schemes to stabilise estimation.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 8
H. Liao and M.J.F. Gales
Clean Speech Class Model
• Estimate µ(r)x ,Σ
(r)x from full clean models M
– variance may be diagonal, but better fullespecially with few classes R.
• Issue with Joint adaptive training
– after step 2, acoustic model updated to M̂– hence clean speech class model changes– now M̂ + Φ̂ does not give T̂ for step 3.
• Must check that newly estimated transform isbetter than previous
– may have situation where current T̂ estimatedfrom a different clean speech class model andacoustic models.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 9
H. Liao and M.J.F. Gales
Experiments on Resource Management Corpus
• 1000 word naval ARPA Resource Management (RM) database
– continuous read speech recorded in a sound-isolated room: 49 dB SNR.
• Baseline recogniser is CU HTK using RM recipe:
– trained on 3.8 hours of data with 109 speakers uttering 3990 sentences– cross-word triphone models, tied states, 6 components per state– MFCC features with 0th cepstra, deltas and delta-deltas for 39 dimensions.
• Multistyle training on artificially corrupted data
– at SNRs of 8, 14, 20, 26, and 32 dB, but no “clean” speech.
• SI task over three test sets, 30 speakers, 900 utterances, reporting %WER
– artificially corrupted with NOISEX-92 Operating Room noise– noise and transforms estimated on per speaker level.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 10
H. Liao and M.J.F. Gales
System Overview
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 11
H. Liao and M.J.F. Gales
Baseline RM Clean and Multistyle Performance
Acoustic Test Set SNRModel Compensation Clean 20 dB 14 dB
Clean— 3.1 38.0 83.7Joint 3.1 9.2 22.6VTS 3.0 8.4 23.6
Multistyle— 11.7 7.0 15.5Joint 8.6 6.7 12.3VTS 8.8 6.5 12.0
Matched — 3.1 7.4 14.3
• Joint and VTS compensation give good gains on either model training method
– model-based Joint with 16 transforms close to VTS.
• Multistyle trained acoustic models superior for noisier conditions
– clean is unseen condition: was not included in training data– with compensation exceeds matched results on noisy data.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 12
H. Liao and M.J.F. Gales
Noise Estimation for Joint Uncertainty Decoding
Acoustic Noise Est. Test Set SNRModel Compensation Type Clean 20 dB 14 dB
Clean— 3.1 38.0 83.7
JointVTS 3.1 10.1 35.3Joint 3.1 9.2 22.6
Multistyle— 11.7 7.0 15.5
JointVTS 9.0 8.6 15.9Joint 8.6 6.7 12.3
• VTS ML noise estimates may be used to estimate Joint transforms
– good results on clean models, but Joint ML noise estimates better.
• On multistyle models, using VTS estimates poorer than no compensation
– Joint noise estimates give far superior transforms and results.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 13
H. Liao and M.J.F. Gales
Joint Adaptive Training
Acoustic Test Set (14dB)Model Compensation +CMLLR
Multistyle— 15.5 13.8
Joint 12.3 11.7JAT Joint 11.4 10.9Matched — 14.3 12.6
• Model-based Joint transforms complement CMLLR (2 full matrix).
• Adaptively trained acoustic model gives better results than multistyle
– in noisy conditions tested, beats matched and multistyle training– but JAT WER on clean data is 5.6% compared to 3.1% clean on clean– Joint transforms are effective at noise normalisation.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 14
H. Liao and M.J.F. Gales
Experiments on Broadcast News
• 145 training hours of recorded news broadcasts suchas CNN, ABC, CBS, and NPR.
• Based on 2003 CUHTK system
– same segmenter and clusterer– 59k word dictionary– cross-word, triphone models, tied states– 7k states, 120k components, 16 comp/GMM.
• MFCCs with 0th cep, delta and accelerations.
• Test sets include
– dev03, 2.5 hours of news from 2001– eval98, 2.9 hours of news from 1998
Segmentation and Clustering
P1: Initial Transcription
P2: Lattice Generation
P3: Trigram Rescoring
One Best
One Best
Lattices
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 15
H. Liao and M.J.F. Gales
Preliminary Broadcast News Experiments
• Wideband results only, narrowband provided by common system
• Noise estimated on per speaker level
dev03 eval98
Compensation Overall F0 F1 F4
— 20.8 21.2 10.6 21.8 20.8Joint 18.8 19.4 9.9 21.1 17.5VTS 18.8 19.1 10.0 21.0 17.2
F0 – baseline broadcast speechF1 – spontaneous broadcast speechF4 – speech under degraded acoustic conditions
• Joint with 256 transforms close to VTS results
– modest gains on cleaner broadcast news data (F0,F1)– as expected, biggest gains on noisier data in F4 focus condition.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 16
H. Liao and M.J.F. Gales
Experiments on Toshiba Datasets
• 240 hours of artificially corrupted dictated training data
– 6h TIMIT data, 60h WSJ0 and 174h WSJ1– multi-condition data with noise varying per utterance:
12.5% uncorrupted, rest has SNR distributed from 0 to 20dB.
• Baseline multistyle recogniser is Toshiba CRL-STG system
– 10 MFCC parameters plus 0th cepstral, deltas and accelerations– 2.2k word dictionary, cross-word triphone models, tied states– ∼800 states, 8k components, 10 comp/GMM.
• Three Appen test tasks examined in office, idle, city and highway conditions
– command and control task (60 utts/spkr)– telephone #’s (30 utts/spkr), 80% local calls, 20% international– 550 city names (30 utts/spkr).
– office: 20 speakers, close-talk mic, SNR ∼34dB– in car: 32 speakers each, AKG mirror mic, SNR ∼35dB, 25dB and 18dB.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 17
H. Liao and M.J.F. Gales
HLDA
CMNCondition C&C Cities DigitsOffice 0.93 22.2 3.58Idle 1.15 17.9 2.77City 2.80 23.8 5.80Highway 2.80 23.2 8.71
CMN + HLDA
Condition C&C Cities DigitsOffice 0.54 17.7 2.69Idle 0.98 14.6 1.98City 2.52 22.3 4.43Highway 1.98 22.1 5.96
• In HLDA, add third derivatives to give 44 dimensional feature vector
– project down to 33 with full feature matrix– equivalent of decoding with a shared full global covariance.
• Gives good improvement in cleaner conditions
– noise decreases effectiveness of HLDA– decorrelating HLDA transform no longer as accurate in noise.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 18
H. Liao and M.J.F. Gales
VTS and Joint Compensation
0DA BaslineCondition C&C Cities DigitsOffice 0.44 7.55 2.47Idle 1.21 13.5 3.69City 3.30 18.6 7.40Highway 5.11 28.9 13.2
VTS
C&C Digits0.65 3.681.42 4.524.09 5.752.36 6.60
16 Joint Xforms
C&C Digits0.43 2.534.56 2.752.71 5.072.49 7.35
• VTS and Joint gives gains over baseline
– office/idle results (∼35 dB) degrade due to multistyle training– at highway condition (18dB) WER is halved– Joint better than VTS due to refinement of noise estimates.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 19
H. Liao and M.J.F. Gales
Conclusions
• Multistyle training gives better models than clean to compensate for noise
– VTS and Joint performance using multistyle beats matched– CMLLR complements Joint uncertainty decoding– positive results on real, found data.
• Joint adaptive training is superior to both clean and multistyle
– can yield a “clean” canonical model of speech acoustics– powerful method to factor noise out of acoustic models.
• Future work will look at
– Joint adaptive training on Broadcast News and Toshiba data– combining Joint transforms with semi-tied covariances– optimising and devising efficiencies.
Cambridge University
Engineering DepartmentToshiba 2006 Presentation 20