scaling novel approaches to the full cts task · ears stt workshop may 15, 2004 other work in...

May 15, 2004EARS STT Workshop

Scaling Novel Approaches to the

Full CTS Task

Morgan, Qifeng Zhu, Barry Chen ICSI

Andreas Stolcke, SRI

May, 2004


OverviewOverview

• Novel Approaches recap & status

• Issues in ANN scaling

• Roadmap and other NA work

• System issues with NA features

• Results with full NA-CTS system


Narrowband 500 ms Broadband 100 ms Broadband 25 ms

MLP MLP

13overlappingspectralslices

9 frames,PLP cepstra

1 frame,PLP cepstra

posteriorscombine

concatenate

Multiple scales in Time-Frequency

features


February StatusFebruary Status• Scaled up to 200+ hour system, w/ MMIE, better LM, etc.; signif error rate reduction still there

• Issues remained about retaining margin ofimprovement after adaptation across PLP and MFCstreams [Andreas to explain progress here]

• ANN training very lengthy, not as easily distributedas GMM training, best use scales quadratically withtraining data - wasn’t feasible for the 2000 hours

• We seem to have found a preliminary fix: no silverbullet, but several incremental innovations


Training w/ more data: theTraining w/ more data: theproblemproblem

• Training with ~250 hours takes about a week (for 4dual processor hyper-threaded 2.8 GHz Xeons for 4nets [TRAP,Tandem]X[Male,Female]

• Optimum ANN data size is the same as HMM trn set

• Optimum net size is proportional to data size

• An increase by a factor of 8 in data -> factor of 64 intraining for best net size, or ~ 64 weeks for EARS!

• Not a scaleable solution, nor even feasible in theshort run


Working with more dataWorking with more data

• Optimizing the initial learning rate• Priming the net with smaller data subsets• Scaling up the data size as learning rate

decreases• Using different data subsets with later

epochs• Fewer epochs required for more data


Current projectionsCurrent projections

• 2000 hr ANN training starting now• Expected to be complete in early July• First SRI results on dev set expected in mid-

July• This allows one month for tuning whole system

with the new features• Passing the methods on to other sites will be for

RT-05


Other work in EARS-NAOther work in EARS-NA

• Mostly applicable for RT-05 (e.g., freq-domain LP, graphical models)

• Some methods may get tested via rescoringfor RT-04, perhaps as a contrast


Overview

• Novel Approaches recap & status• Issues in ANN scaling• Roadmap and other NA work• System issues with NA features• Results with full NA-CTS system


System Building Issues• Work so far has focused on single frontend (PLP + NA

features), single-pass recognition• How to best use NA features in multi-frontend, multi-

pass system ?• Issues:

– Cross-adaptation– Lattice decoding– System combination


Recap: Single Pass Results• ICSI features from largest ANN to date (trained on 120

hours per gender)• HMMs trained on 200 hours/gender• MMIE-PLP models, decode + rescore• Recognition on RT-02 male set

• Relative WER reduction: 6.8%

28.432.8PLP + ICSI features

30.535.2PLP baseline

WERrescoring

WERbigram decoding

System


Recognition Framework• Relevant features of SRI’s CTS system

– First decode and lattices generation using MFCC frontend,within-word triphones

– Rescore 1st hyps, then use for MLLR on PLP models (cross-adaptation)

– Lattice-decode using adapted, PLP cross-word triphone models– Rescore again; confusion network decoding (5xRT system ends

here).– For 20xRT, do the same with MFCC/PLP roles reversed; final

confusion network combination– (Plus some other details, see later talk.)


Cross-Adaptation Experiments

• Same training and test data as before• Partial system used:

– Decode with MFCC; rescore with 4-gram and other models– Decode with adapted MMIE-PLP CW models from lattices– Omit final rescoring (to amplify acoustic model differences)– MFCC+ICSI models used same ICSI features (PLP-based) as PLP+ICSI

(avoid training a separate ANN for MFCC)

25.7ICSI features in MFCC + PLP models and in lattice generation

26.0ICSI features in MFCC + PLP models26.2ICSI features in PLP models26.9Baseline (no ICSI features anywhere)WERSystem


Cross-Adaptation Lessons• When cross-adapting it’s important to improve the first

decoding pass, or else the second pass is “held back”.• Baseline MFCC models in first pass give 0.1% worse

WER than PLP+ICSI models in first pass (betterfeatures outweigh benefit of cross-adaptation).

• Even with thick lattices (4% oracle error rate) it’simportant to also use matched models to generatelattices.

• Cutting corners hurts bottom line.• Still, relative win reduced to 4.5%.


Experiments with Complete Systems• Same training set as before (200 hours/gender)• Tune on RT-02 subset, test on RT-02 and RT-03 males• Run 5xRT and 20xRT CTS systems• Use ICSI features in all branches of the system (both

MFCC and PLP)

• Relative improvement: 2.8% with full system

24.623.720xRT baseline

RT-03RT-02

23.923.020xRT w/ICSI features

25.524.85xRT w/ICSI features26.326.15xRT baseline


But Wait …• Try combining baseline and ICSI-feature systems• 6-way confusion network combination (3 per system)• Run time now 40xRT (without further tweaking)

• Relative improvement over baseline: 6.5%• Combination benefits from independent systems (no

sharing of lattices or adaptation hypotheses)• … but we’re using too much time, so …

23.923.020xRT w/ICSI features23.022.140xRT combined system

24.623.720xRT baselineRT-03RT-02


20xRT -- Second Try• Modify 20xRT system to leverage more of

baseline/ICSI system combination.• 3-out-of-6 model combination gives best results with:

– Within-word MFCC+ICSI– Cross-word MFCC+ICSI– Cross-word PLP (no ICSI)

• Rebuild 20xRT system based on those 3 models

23.622.8Revised 20xRT w/ICSI23.923.020xRT w/ICSI features

23.022.140xRT combined system

24.623.720xRT baselineRT-03RT-02


Conclusions• Preserving initial win from ICSI features in a complex

system is nontrivial.• Important to use ICSI features in early passes

– For cross-adaptation– For lattice generation

• Best results by combining baseline with a system usingICSI features in all models (6.5% relative, 40xRT).

• Partial win by combining ICSI and baseline models in asingle 20xRT system (4.0% relative).

• Will investigate other approaches to leverage more ofthe full potential in 20xRT.

• Also: vary ICSI features across subsystems.

scaling novel approaches to the full cts task · ears stt workshop may 15, 2004 other work in...

Documents