the use of virtual hypothesis copies in decoding of large-vocabulary continuous speech

46
The Use of Virtual Hypothesis Copies in Decoding of Large- Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005 Present by shih-hung 2005/09/29

Upload: delila

Post on 05-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech. Frank Seide IEEE Transactions on Speech and Audio Processing 2005. Present by shih-hung 2005/09/29. Outline. Introduction Review of (M+1)-gram Viterbi Decoding with reentrant tree - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Frank Seide

IEEE Transactions on Speech and Audio Processing 2005

Present by shih-hung 2005/09/29

Page 2: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 2

Outline

• Introduction

• Review of (M+1)-gram Viterbi Decoding with reentrant tree

• Virtual Hypothesis Copies on word level

• Virtual Hypothesis Copies on sub-word level

• Virtual Hypothesis Copies for Long-Range acoustic Lookahead (optional)

• Experimental Results

• Conclusion

Page 3: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 3

Introduction

Page 4: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 4

Introduction

• For decoding of LVCSR, the most widely used algorithm is a time-synchronous Viterbi decoder that uses a tree-organized pronunciation lexicon with word-condition tree copies.

• The search space is organized as a reentrant network which is a composition of the state-level network (lexical tree) and the linguistic (M+1)-gram network.– i.e. a distinct instance (“copy”) of each HMM state in the lexical tree is

needed for every linguistic state (M-word history).

• Practically, this copying is done on demand in conjunction with beam pruning.

Page 5: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 5

Introduction

Page 6: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 6

Introduction

Page 7: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 7

Introduction

• One observes that hypotheses for the same word generated from different tree copies are often identical.– i.e. there is redundant computation

• Can we exploit this redundancy and modify the algorithm such that word hypotheses are shared across multiple linguistic state?

frank

funny

seide

Page 8: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 8

Introduction

• A successful approach to this is the two-pass algorithm by Ney and Aubert. It first generates a word-lattice using the “word-pair approximation”, and searches the best path through this lattice using the full range language model.– computation is reduced by sharing word hypotheses between two-

word histories that end with the same word.

• An alternative approach is start-time conditioned search, which uses non-reentrant tree copies conditioned on the start time of the tree. Here, word hypotheses are shared across all possible linguistic states during word-level recombination.

Page 9: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 9

Introduction

Page 10: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 10

Introduction

Page 11: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 11

Introduction

• In this paper, we propose a single-pass reentrant-network (M+1)-gram decoder that uses three novel approaches aiming at eliminating copies of the search-space that are redundant.

• 1.State copies are conditioned on the phonetic history rather than the linguistic history.– Phone-history approximation (PHA) analog to the word-pair

approximation (WPA).

• 2.Path hypotheses at word boundaries are saved at every frame in a data structure similar to a word lattice. To apply the (M+1)- gram at a word end , the needed linguistic path-hypothesis copies are recovered on the fly, similarly to lattice rescoring. We call the recovered copies virtual hypothesis copies (VHC).

Page 12: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 12

Introduction

• 3.For further reduction of redundancy, also multiple instances of

the same context-dependent phone occurring in the same phonetic history are dynamically replaced by a single instance.

Incomplete path hypotheses at phoneme boundaries are temporarily saved as well in the lattice-like structure. To apply the tree lexicon, CD-phone instances associated with tree nodes are recovered on the fly (phone-level VHC).

Page 13: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 13

Review of (M+1)-gram Viterbi decoding with a reentrant tree

),( stQMW

MW

),( stBMW := time of the latest transition into the tree root on the

best path up to time t that ends in state s of the lexical tree for the history (“back-pointer”)

:= probability of the best path up to time t that ends in state s of the lexical tree for history

MW

);( tWH M :=probability that the acoustic observation vectors o(1)…o(t) are generated by a word/state sequence that ends with the M words at time t.MW

Page 14: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 14

Review of (M+1)-gram Viterbi decoding with a reentrant tree

• The dynamic-programming equations for the word-history conditioned (M+1)-gram search are as follow:

Within-word recombination (s>0)

)),(,1(),(

)},1()|(max{))((),(

max sttBstB

tQsPtobstQ

MMM

MM

WWW

WsW

))(( likelihoodemission and)|P(sy probabilitsition with transtates treedenote and s where

s tob

Page 15: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 15

Review of (M+1)-gram Viterbi decoding with a reentrant tree

1)0,1(

)1;()0,1(

)};),,'((ˆ{max));,((

),()|();,(ˆ

1'

1

ttB

tWHtQ

twhwHtwhH

StQWwPtwWH

M

M

M

W

MW

MVw

M

wWMM

' by word replace doldest wor with thehistory thedenotes ),'( for word treelexical theof state terminala denotes

y probabilit model language gram-1)(M theis )|(

1 wWhwwS

WwP

MM

w

M

Word-boundary equation:

Page 16: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 16

Virtual hypothesis copies on word level

A. How it works

B. Word hypothesis

C. Word-Boundary assumption and Phonetic-History approximation

D. Virtual hypothesis copies: redundancy of

E. Choosing

F. Collapsed hypothesis copies

G. Word-boundary equations

H. Collapsed (M+1)-gram search : Summary

I. Beam pruning

J. Language model lookahead

MWQ

mW~

Page 17: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 17

How it works

• The optimal start time of a word depends on its history . The same word in different histories may have different optimal start times - this is the reason for copying.

),( weW StBM MW

However, we observed that start times are often identical, in particular if their histories are acoustically similar.

For two linguistic histories and we obtain the same optimal start time .

MW MW '

st

sweWweW tStBStBMM

),(),( '

then we have computed too much.

Page 18: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 18

How it works

• It would only have been necessary to perform the state-level Viterbi recursion for one of the two histories. This is because:

)0,(

),(

)0,(

),(

'

'

sW

weW

sW

weW

tQ

StQ

tQ

StQ

M

M

M

M

)0,(

)0,(),(),(

),( from recovered)(or computed becan ),( s,other wordin

''

'

sW

sWweWweW

weWweW

tQ

tQStQStQ

StQStQ

M

M

MM

MM

Page 19: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 19

How it works

• We are now ready to introduce our method of virtual hypothesis copying (word-level). The method consist of– 1.predicting the sets of histories for which the optimal start times are

going to be identical - this information is needed already when a path enters a new word;

– 2.performing state-level Viterbi processing only for one copy per set.

– 3.for all other copies, recovering their accumulated path probabilities. Thus, on state-level, all but one copy per set are neither stored nor computed - we call them “virtual”.

Page 20: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 20

How it works

• The art is to reliably predict these sets of histories that will lead to identical optimal start times. An exact prediction is impossible.

• We propose a heuristic, the phone-history approximation (PHA).

• The PHA assumes that a word’s optimal boundary depends only on the last N phones of the history.

))(( histories of classes entireon dconditione )),(),,((

copies" collapsed" with themreplacingby )),(),,((

copies statedependent -history theofpart eliminate wise-step willwe

)()( MWcWc

WW

WcstBstQ

stBstQ

MM

MM

Page 21: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 21

How it works

Regular bigram search Virtual hypothesis copies

Page 22: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 22

Word hypotheses

).()...1( vectorsacoustic he produces word

y thatprobabilit:),,( likelihoodemission - wordwith the

"hypothesis-word" a called is )),,(,,,( quadruple The

es

es

eses

totow

ttwh

ttwhttw

),( with )0,(

),(),,(

as derived becan )),,(,( hypotheses

wordofset a, history on dconditionecopy every treeFor

weWssW

weWesW

esWes

M

StBttQ

StQttwh

ttwh,tw,t

W

M

M

M

M

M

p(O|w)

Page 23: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 23

Word-Boundary assumption and Phonetic-History approximation

)(~

),(),(

:),(symbol by the denoted be shallboundary dcommon wor This

. timeand every wordfor

)(~

),(),(

true,always isequation below assume We

~)(

)(

~

MMwW

def

wWc

wWc

MMwWwW

WcWStBStB

StB

tw

WcWStBStB

MM

M

MM

Page 24: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 24

Word-Boundary assumption and Phonetic-History approximation

• Intuitively, the optimal word boundaries should not depend on the linguistic state, but rather on the phonetic context at the boundary.

• And words ending similarly should lead to the same boundary.

• Thus, we propose a phonetically motivated history-class definition, the phone-history approximation (PHA):– A word’s optimal start time depends on the word and its N-phone history.

structure. syllable assuch sconstraint cphonotaction depending e.g. variable,be chosen to be alsomay

} of phones last two the{)( example,for

N

WWc MM

Page 25: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 25

Virtual hypothesis copies: redundancy of MWQ

)(~

)),(;

~(

)),(;(),(

)),(;()0),,((

),(

)),(;()),,(,(

)),(;()0),,((

),(

)0),,(()0),,((

),(),(

M~

)()(~

~

M

wWM

wWMwW

wWcMwWcW

wW

wWMwW

wWMwWW

wW

wWWwWW

wWwW

WcWStBWH

StBWHStQ

StBWHStBQ

StQ

StBWHtStBwh

StBWHStBQ

StQ

StBQStBQ

StQStQ

M

M

M

M

MM

M

MM

M

MM

M

MM

MM

M

M

Page 26: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 26

Virtual hypothesis copies: redundancy ofMWQ

opiespothesis cvirtual hy

StQ

Wc

WcWH

StQStQ

wW

M

MM

wWwW

M

MM

themcall want to westored,nor

computeddirectly not need ),( recovered theSince

class. a tobelonging states linguistic across hypotheses wordgenerated sharing and )( classhistory per copy tree

oneonly keepingby spacesearch reducing way togives This

).(~

for )( and

),(other any from recovered becan ),(Every ~

Page 27: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 27

Choosing mW~

)},({max),(

)},({maxarg~

')('

'

')('

wWWcW

wW

wWWcW

M

StQStQ

StQW

MMM

M

MMM

Page 28: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 28

Collapsed hypothesis copies

• The most probable hypothesis is only know when the end of he word is reached - too late to reduce computation.

)},({max),(

),( copiesdependent over maximum

wise-state theas ),(copy treecollapsed"" thedefine We

).,( potential all

compute tohavingwithout recursion gprogrammin-dynamic a

by determined becan ),( and ~

that found weHowever,

')('

)(

)(

'

~

stQstQ

stQW

stQ

StQ

StQW

MMM

M

M

M

M

M

WWcW

Wc

WM

Wc

wW

wWM

Page 29: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 29

Collapsed hypothesis copies

)},1()|({max))((

)}},1({max)|({max))((

)}},1()|({max))(({max

)},({max),(

recursion. gprogrammin-dynamic level-state theinsertingby equation above rewrite We

)(

')('

')('

')('

)(

tQsPtob

tQsPtob

tQsPtob

stQstQ

M

MMM

MMM

MMM

M

Wcs

WWcW

s

WsWcW

WWcW

Wc

Page 30: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 30

Word-boundary equations

)),(;()0),,((

),()|();,(ˆ

)0),,(())},(;'({max

)),(;~

(

)),(;~

(

)),(;(),()|(

),()|();,(ˆ

)}1;'({max

)}0,1({max)0,1(

)()()(

)(

)()()()('

)(

)(

)()(

)('

')('

)(

wWcMwWcWc

wWcMM

wWcWcwWcWcW

wWcM

wWcM

wWcMwWcM

wWMM

MWcW

WWcW

Wc

StBWHStBQ

StQWwPtwWH

StBQStBWH

StBWHwith

StBWH

StBWHStQWwP

StQWwPtwWH

tWH

tQtQ

M

MM

M

MMMMM

M

M

M

M

M

MM

MMM

M

Page 31: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 31

Collapsed (M+1)-gram search : Summary

.1)0,1(

)}1;'({max)0,1(

};)),,'((ˆ{max));,((

)),(;()0),,((

),()|();,(ˆ

:

)),(,1(),(

)},1()|({max))((),(:

)(

)(')(

1'

1

)()()(

)(

max)()()(

)()(

ttB

tWHtQ

twWwHtwWH

StBWHStBQ

StQWwPtwWH

equationsboundaryword

sttBstB

tQsPtobstQionrecombinatwordwithin

M

MMM

M

MM

M

MMM

MM

Wc

MWcW

Wc

MVw

M

wWcMwWcWc

wWcMM

WcWcWc

WcsWc

Page 32: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 32

Language model lookahead

• M-gram lookahead aims at using language knowledge as early as possible in the lexical tree by pushing partial M-gram scores toward the tree root.

Page 33: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 33

Virtual hypothesis copies on the sub-word level

• In the word-level method, the state-level search can be interpreted as a “word-lattice generator” with (M+1)-gram “lattice rescoring” applied on the fly; and search-space reduction was achieved by sharing tree copies amongst multiple histories.

• We now want to apply the same idea to the subword level: the state-level search now becomes sort of a “subword generator,” subword hypotheses are incrementally matched against the lexical tree (frame-synchronously) and (M+1)-gram lattice rescoring applied as before.

Page 34: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 34

Virtual hypothesis copies on the sub-word level

Page 35: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 35

Virtual hypothesis copies on the sub-word level

Page 36: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 36

Virtual hypothesis copies on the sub-word level

Page 37: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 37

Experimental setup

• Philips LVCSR is based on continuous-mixture HMM.

• MFCC feature.

• Unigram lookahead.

• Corpora for Mandarin:– MAT-2000, PCD, National Hi-Tech Project 863

• Corpora for English:– Trained on WSJ0+1

– Test on 1994 ARPA NAB

Page 38: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 38

Experimental result

Page 39: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 39

Experimental result

Page 40: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 40

Experimental result

Page 41: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 41

Experimental result

Page 42: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 42

Experimental result

Page 43: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 43

Experimental result

Page 44: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 44

Experimental result

Page 45: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 45

Experimental result

Page 46: The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech

Speech Lab NTNU 2005 46

Conclusion

• We have present a novel time synchronous LVCSR Viterbi decoder for Mandarin based on the novel concept of virtual hypothesis copies (VHC).

• At no loss of accuracy, a reduction of active states of 60-80% has been achieved for Chinese, and of 40-50% for American English.