building a highly accurate mandarin speech recognizer

38
1 Building A Highly Accurate Mandarin Speech Recognizer Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007

Upload: daisy

Post on 30-Jan-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Building A Highly Accurate Mandarin Speech Recognizer. Mei-Yuh Hwang, Gang Peng, Wen Wang (SRI), Arlo Faria (ICSI), Aaron Heidel (NTU) Mari Ostendorf 12/12/2007. Outline. Goal: A highly accurate Mandarin ASR Baseline: System-2006 Improvement Acoustic segmentation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building A Highly Accurate Mandarin Speech Recognizer

1

Building A Highly Accurate Mandarin Speech Recognizer

Mei-Yuh Hwang, Gang Peng,

Wen Wang (SRI), Arlo Faria (ICSI),

Aaron Heidel (NTU) Mari Ostendorf

12/12/2007

Page 2: Building A Highly Accurate Mandarin Speech Recognizer

2

Outline Goal: A highly accurate Mandarin ASR Baseline: System-2006 Improvement

Acoustic segmentation Two complementary comparable systems Language models and adaptation More Data

Error analysis Future

Page 3: Building A Highly Accurate Mandarin Speech Recognizer

3

Background: System-2006 849M words training text 60K-word lexicon Static 5-gram rescoring 465 hrs acoustic training Two AMs (same phone-72 pronunciation)

MFCC+pitch (42-dim), SAT+fMPE, CW MPE, 3000x128 Gaussians.

MFCC+MLP+pitch (74-dim), SAT+fMPE, nonoCW MPE, 3000x64 Gaussians

CER 18.4% on Eval06.

Page 4: Building A Highly Accurate Mandarin Speech Recognizer

4

2007 Increased Training Data 870 hours of acoustic training data. 3500x128

Gaussians. 1.2G words of training text. Trigrams and 4-grams.

#bigrams #trigrams #4-grams Dev07-IV Perplexity

LM3 58M 108M --- 325.7

qLM3 6M 3M --- 379.8

LM4 58M 316M 201M 297.8

qLM4 19M 24M 6M 383.2

Page 5: Building A Highly Accurate Mandarin Speech Recognizer

5

Acoustic segmentation Former segmenter caused high deletion errors. It

mis-classified some speech segments as noises.

Speech segment min duration 18*30=540ms=0.5s

Vocabulary Pronunciation

speech 18+ fg

Noise rej rej

silence bg bg

Start /null End /null

speech

silence

noise

Page 6: Building A Highly Accurate Mandarin Speech Recognizer

6

New Acoustic Segmenter Allow shorter speech duration Model Mandarin vs. Foreign (English) separately.

Vocabulary Pronunciation

Mandarin1 I1 F

Mandarin2 I2 F

Foreign forgn forgn

Noise rej rej

Silence bg bg

Start /null End /nullForeign

silence

Mandarin 1 Mandarin 2

noise

Page 7: Building A Highly Accurate Mandarin Speech Recognizer

7

Improved Acoustic SegmentationPruned trigram, SI nonCW-MLP MPE, on Eval06

Segmenter Sub Del Ins Total

OLD 9.7 7.0 1.9 18.6

NEW 9.9 6.4 2.0 18.3

Oracle 9.5 6.8 1.8 18.1

Page 8: Building A Highly Accurate Mandarin Speech Recognizer

8

Decoding Architecture

MLP nonCW

qLM3

PLP CW SAT+fMPEMLLR, LM3

MLP CW SATMLLR, LM3

qLM4 Adapt/Rescore qLM4 Adapt/Rescore

Confusion Network Combination

Aachen

Page 9: Building A Highly Accurate Mandarin Speech Recognizer

9

Two Sets of Acoustic Models For cross adaptation and system combo

Different error behaviors Similar error rate performance

System-MLP System-PLP

Features 74

(MFCC+pitch+MLP)

42

(PLP+pitch)

fMPE no yes

Phones 72 81

Page 10: Building A Highly Accurate Mandarin Speech Recognizer

10

MLP Phoneme Posterior Features

Compute Tandem features with pitch+PLP input. Compute HATs features with 19 critical bands Combine Tandem and HATs posterior vectors into

one. PCA(Log(71)) 32 MFCC + pitch + MLP = 74-dim

Page 11: Building A Highly Accurate Mandarin Speech Recognizer

11

Tandem Features [T1,T2,…,T71] Input: 9 frames of PLP+pitch

(42x9)x15000x71

PLP (39x9)

Pitch (3x9)

Page 12: Building A Highly Accurate Mandarin Speech Recognizer

12

HATS Features [H1,H2,…,H71]

51x60x71

E1

E2

E19

(60*19)x8000x71

Page 13: Building A Highly Accurate Mandarin Speech Recognizer

13

MLP and Pitch Features

HMM Feature MLP Input CER

MFCC (39-dim) None 24.1

MFCC+F0 (42-dim) None 21.4

MFCC+F0+Tandem (74-dim) PLP(39*9) 20.3

MFCC+F0+Tandem (74-dim) PLP+F0(42*9) 19.7

nonCW ML, Hub4 Training, MLLR, LM2 on Eval04

Page 14: Building A Highly Accurate Mandarin Speech Recognizer

14

Phone-81: Diphthongs for BC Add diphthongs (4x4=16) for fast speech and modeling

longer triphone context. Maintain unique syllabification. Syllable ending W and Y not needed anymore.

Example Phone-72 Phone-81

要 /yao4/ a4 W aw4

北 /bei3/ E3 Y ey3

有 /you3/ o3 W ow3

爱 /ai4/ a4 Y ay4

Page 15: Building A Highly Accurate Mandarin Speech Recognizer

15

Phone-81: Frequent Neutral Tones for BC

Neural tones more common in conversation. Neutral tones were not modeled. The 3rd tone

was used as replacement. Add 3 neutral tones for frequent chars.

Example Phone-72 Phone-81

了 /e5/ e3 e5

吗 /ma5/ a3 a5

子 /zi5/ i3 i5

Page 16: Building A Highly Accurate Mandarin Speech Recognizer

16

Phone-81: Special CI Phones for BC Filled pauses (hmm, ah) common in BC. Add

two CI phones for them. Add CI /V/ for English.

Example Phone-72 Phone-81

victory w V

呃 /ah/ o3 fp_o

嗯 /hmm/ e3 N fp_en

Page 17: Building A Highly Accurate Mandarin Speech Recognizer

17

Phone-81: Simplification of Other Phones

Now 72+14+3+3=92 phones, too many triphones to model.

Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2.

92 – (4x3–1) = 81 phones.

Example Phone-72 Phone-81

安 /an1/ A1 N a1 N

词 /ci2/ I1 i2

池 /chi2/ IH2 i2

Page 18: Building A Highly Accurate Mandarin Speech Recognizer

18

Different Phone SetsPruned trigram, SI nonCW-PLP ML, on dev07

BN BC Avg

Phone-81 7.6 27.3 18.9

Phone-72 7.4 27.6 19.0

Indeed different error behaviors --- good for system combo.

Page 19: Building A Highly Accurate Mandarin Speech Recognizer

19

PLP Models with fMPE Transform PLP model with fMPE transform to compete

with MLP model. Smaller ML-trained Gaussian posterior model:

3500x32 CW+SAT

5 Neighboring frames of Gaussian posteriors. M is 42 x (3500*32*5), h is (3500*32*5)x1. Ref: Zheng ICASSP 07 paper

( )k kt t ty A x b hM

Page 20: Building A Highly Accurate Mandarin Speech Recognizer

20

Topic-based LM Adaptation

Latent Dirichlet Allocation Topic Model

{w | w same story (4secs) }

0

One sentence

4s window is used to make adaptation more robust against ASR errors.

{w} are weighted based on distance.

Page 21: Building A Highly Accurate Mandarin Speech Recognizer

21

Topic-based LM Adaptation Training: one topic per sentence Train 64 topic-dependent LMs. Testing: top n topics per sentence, weighting

on neighboring 4s of speech

4( ) (1 )adapt i ii

LM LM qLM

Page 22: Building A Highly Accurate Mandarin Speech Recognizer

22

Topic-based LM Adaptation LMi still 60K-words? Per-sentence adaptation? Computational cost?

Page 23: Building A Highly Accurate Mandarin Speech Recognizer

23

LM Adaptation and CNC on Dev07

Dev07 CW PLP CW MLP CNC

LM3 12.0 11.9 ---

LM4 11.9 11.7 11.4

Adapted qLM4 11.7 11.4 11.2

UW 2 systems only

Page 24: Building A Highly Accurate Mandarin Speech Recognizer

24

LM Adaptation and CNC on Eval07

AM(adapt. hyps)

PLP(MLP)

MLP(PLP)

MLP(Aachen)

PLP(Aachen)

Rover

LM3 10.2 9.6 9.9 10.1 --

qLM4 10.2 9.7 10.0 10.1 --

LM4 10.0 9.6 9.8 10.0 9.1

Adapted

qLM4

9.7 9.3 9.6 9.7 8.9

Page 25: Building A Highly Accurate Mandarin Speech Recognizer

25

Eval07

Team CER

UW 9.1%

RWTH 12.1%

UW+RWTH 8.9%

CU+BBN 9.4%

IBM+CMU 9.8%

Page 26: Building A Highly Accurate Mandarin Speech Recognizer

26

2006 vs. 2007 on Eval07

SUB DEL INS TOTAL

2006

system

7.2 6.5 0.4 14.1

2007

system

5.5 3.0 0.4 8.9

37% relative improvement!!

Page 27: Building A Highly Accurate Mandarin Speech Recognizer

27

Progress

Testset 2006 2007-06 2007-12

Eval06 18.4% 15.3% 14.7%

Dev07 --- 11.2% 9.6%*

Eval07 14.1% 8.9% ---

Page 28: Building A Highly Accurate Mandarin Speech Recognizer

28

RWTH Demo UW acoustic segmenter. RWTH single-system ASR. Foreign (Korean)

speech skipped. Mis-reco highlighted. Manual sentence segmentation. Machine translation. Not real-time.

Page 29: Building A Highly Accurate Mandarin Speech Recognizer

29

MT Error Analysis on Extreme Cases

Snippet Dur CER HTER

a) Worst BN 87s 10.9% 47.73%

b) Worst BC 72s 24.9% 48.37%

c) Best BN 62s 0 12.67%

d) Best BC 77s 15.2% 14.20%

CER not directly related to HTER; genre matters. Better CER does ease MT.

Page 30: Building A Highly Accurate Mandarin Speech Recognizer

30

MT Error Analysis (a) worst BN: OOV names (b) worst BC: overlapped speech (c) best BN: composite sentences (d) best BC: simple sentences with disfluency

and re-starts. *.html, *.wav

Page 31: Building A Highly Accurate Mandarin Speech Recognizer

31

Error Analysis OOV (especially names): problematic for ASR,

MT, distillation.

徐 昌 霖徐 成 民徐 长 明 Xu, Chang-Lin

黄 竹 琴黄 朱 琴黄 朱 勤皇 猪 禽黄 朱 其 Huang, Zhu-Qin

Page 32: Building A Highly Accurate Mandarin Speech Recognizer

32

Error Analysis MT BN high errors

Composite syntax structure. Syntactic parsing would be useful.

MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for

its simple/short sentence structure

Page 33: Building A Highly Accurate Mandarin Speech Recognizer

33

Next ASR: Chinese Organization Names Semi-auto abbreviation generation for long

words. Segment a long word into a sequence of shorter

words Extract the 1st char of each shorter words: 世界卫生组织 世卫

(Make sure they are in MT translation table, too)

Page 34: Building A Highly Accurate Mandarin Speech Recognizer

34

Next ASR: Chinese Person Names Mandarin high rate of homophones: 408 syllables 6000

common characters. 14 homophone chars / syllable!! Given a spoken Chinese OOV name, no way to be sure which

characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!

Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu {Chang, Cheng} {Lin, Min, Ming} Huang Zhu {Qin, Qi}

After syllable CNC, apply the same name to all occurrences in Pinyin.

Page 35: Building A Highly Accurate Mandarin Speech Recognizer

35

Next ASR: Foreign Names English spelling in Lexicon, with (multiple) Mandarin

pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben3 la1 deng1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?

Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.

Page 36: Building A Highly Accurate Mandarin Speech Recognizer

36

Next ASR: LM

LM adaptation with fine topics, each topic with small vocabulary size.

Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是 我想那也是 I think it, (too), (too), is, too. I think it is, too.

If optimizing CER, stm needs to be designed such that disfluency is optionally deletable. 小孩 ( 儿 )

Page 37: Building A Highly Accurate Mandarin Speech Recognizer

37

Next ASR: AM Add explicit tone modeling (Lei07).

Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words

More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering

Smaller clusters, better performance Gender ID first.

Page 38: Building A Highly Accurate Mandarin Speech Recognizer

38

ASR & MT Integration Do we need to merge lexicon? ASR MT. Do we need to use the same word segmenter? Is word/char -level CNC output better for MT? Open questions and feedback!!!