May 15, 2004EARS STT Workshop
Scaling Novel Approaches to the
Full CTS Task
Morgan, Qifeng Zhu, Barry Chen ICSI
Andreas Stolcke, SRI
May, 2004
May 15, 2004EARS STT Workshop
OverviewOverview
• Novel Approaches recap & status
• Issues in ANN scaling
• Roadmap and other NA work
• System issues with NA features
• Results with full NA-CTS system
May 15, 2004EARS STT Workshop
Narrowband 500 ms Broadband 100 ms Broadband 25 ms
MLP MLP
13overlappingspectralslices
9 frames,PLP cepstra
1 frame,PLP cepstra
posteriorscombine
concatenate
Multiple scales in Time-Frequency
features
May 15, 2004EARS STT Workshop
February StatusFebruary Status• Scaled up to 200+ hour system, w/ MMIE, better LM, etc.; signif error rate reduction still there
• Issues remained about retaining margin ofimprovement after adaptation across PLP and MFCstreams [Andreas to explain progress here]
• ANN training very lengthy, not as easily distributedas GMM training, best use scales quadratically withtraining data - wasn’t feasible for the 2000 hours
• We seem to have found a preliminary fix: no silverbullet, but several incremental innovations
May 15, 2004EARS STT Workshop
Training w/ more data: theTraining w/ more data: theproblemproblem
• Training with ~250 hours takes about a week (for 4dual processor hyper-threaded 2.8 GHz Xeons for 4nets [TRAP,Tandem]X[Male,Female]
• Optimum ANN data size is the same as HMM trn set
• Optimum net size is proportional to data size
• An increase by a factor of 8 in data -> factor of 64 intraining for best net size, or ~ 64 weeks for EARS!
• Not a scaleable solution, nor even feasible in theshort run
May 15, 2004EARS STT Workshop
Working with more dataWorking with more data
• Optimizing the initial learning rate• Priming the net with smaller data subsets• Scaling up the data size as learning rate
decreases• Using different data subsets with later
epochs• Fewer epochs required for more data
May 15, 2004EARS STT Workshop
Current projectionsCurrent projections
• 2000 hr ANN training starting now• Expected to be complete in early July• First SRI results on dev set expected in mid-
July• This allows one month for tuning whole system
with the new features• Passing the methods on to other sites will be for
RT-05
May 15, 2004EARS STT Workshop
Other work in EARS-NAOther work in EARS-NA
• Mostly applicable for RT-05 (e.g., freq-domain LP, graphical models)
• Some methods may get tested via rescoringfor RT-04, perhaps as a contrast
May 15, 2004EARS STT Workshop
Overview
• Novel Approaches recap & status• Issues in ANN scaling• Roadmap and other NA work• System issues with NA features• Results with full NA-CTS system
May 15, 2004EARS STT Workshop
System Building Issues• Work so far has focused on single frontend (PLP + NA
features), single-pass recognition• How to best use NA features in multi-frontend, multi-
pass system ?• Issues:
– Cross-adaptation– Lattice decoding– System combination
May 15, 2004EARS STT Workshop
Recap: Single Pass Results• ICSI features from largest ANN to date (trained on 120
hours per gender)• HMMs trained on 200 hours/gender• MMIE-PLP models, decode + rescore• Recognition on RT-02 male set
• Relative WER reduction: 6.8%
28.432.8PLP + ICSI features
30.535.2PLP baseline
WERrescoring
WERbigram decoding
System
May 15, 2004EARS STT Workshop
Recognition Framework• Relevant features of SRI’s CTS system
– First decode and lattices generation using MFCC frontend,within-word triphones
– Rescore 1st hyps, then use for MLLR on PLP models (cross-adaptation)
– Lattice-decode using adapted, PLP cross-word triphone models– Rescore again; confusion network decoding (5xRT system ends
here).– For 20xRT, do the same with MFCC/PLP roles reversed; final
confusion network combination– (Plus some other details, see later talk.)
May 15, 2004EARS STT Workshop
Cross-Adaptation Experiments
• Same training and test data as before• Partial system used:
– Decode with MFCC; rescore with 4-gram and other models– Decode with adapted MMIE-PLP CW models from lattices– Omit final rescoring (to amplify acoustic model differences)– MFCC+ICSI models used same ICSI features (PLP-based) as PLP+ICSI
(avoid training a separate ANN for MFCC)
25.7ICSI features in MFCC + PLP models and in lattice generation
26.0ICSI features in MFCC + PLP models26.2ICSI features in PLP models26.9Baseline (no ICSI features anywhere)WERSystem
May 15, 2004EARS STT Workshop
Cross-Adaptation Lessons• When cross-adapting it’s important to improve the first
decoding pass, or else the second pass is “held back”.• Baseline MFCC models in first pass give 0.1% worse
WER than PLP+ICSI models in first pass (betterfeatures outweigh benefit of cross-adaptation).
• Even with thick lattices (4% oracle error rate) it’simportant to also use matched models to generatelattices.
• Cutting corners hurts bottom line.• Still, relative win reduced to 4.5%.
May 15, 2004EARS STT Workshop
Experiments with Complete Systems• Same training set as before (200 hours/gender)• Tune on RT-02 subset, test on RT-02 and RT-03 males• Run 5xRT and 20xRT CTS systems• Use ICSI features in all branches of the system (both
MFCC and PLP)
• Relative improvement: 2.8% with full system
24.623.720xRT baseline
RT-03RT-02
23.923.020xRT w/ICSI features
25.524.85xRT w/ICSI features26.326.15xRT baseline
May 15, 2004EARS STT Workshop
But Wait …• Try combining baseline and ICSI-feature systems• 6-way confusion network combination (3 per system)• Run time now 40xRT (without further tweaking)
• Relative improvement over baseline: 6.5%• Combination benefits from independent systems (no
sharing of lattices or adaptation hypotheses)• … but we’re using too much time, so …
23.923.020xRT w/ICSI features23.022.140xRT combined system
24.623.720xRT baselineRT-03RT-02
May 15, 2004EARS STT Workshop
20xRT -- Second Try• Modify 20xRT system to leverage more of
baseline/ICSI system combination.• 3-out-of-6 model combination gives best results with:
– Within-word MFCC+ICSI– Cross-word MFCC+ICSI– Cross-word PLP (no ICSI)
• Rebuild 20xRT system based on those 3 models
23.622.8Revised 20xRT w/ICSI23.923.020xRT w/ICSI features
23.022.140xRT combined system
24.623.720xRT baselineRT-03RT-02
May 15, 2004EARS STT Workshop
Conclusions• Preserving initial win from ICSI features in a complex
system is nontrivial.• Important to use ICSI features in early passes
– For cross-adaptation– For lattice generation
• Best results by combining baseline with a system usingICSI features in all models (6.5% relative, 40xRT).
• Partial win by combining ICSI and baseline models in asingle 20xRT system (4.0% relative).
• Will investigate other approaches to leverage more ofthe full potential in 20xRT.
• Also: vary ICSI features across subsystems.