decoding techniques for automatic speech recognition

Decoding Techniques for Automatic Speech Recognition

Florian Metze

Interactive Systems Laboratories

http://www.cs.cmu.edu/

ESSLLI 2002, Trento 2Aug 14, 2002

Outline

• Decoding in ASR• Search Problem• Evaluation Problem• Viterbi Algorithm• Tree Search• Re-Entry• Recombination



The ASR problem: argW max p(W|x)

• Two major knowledge sources– Acoustic Model: p(x|W)

– Language Model: P(W)

• Bayes: p(W|x)P(x)=p(x|W)P(W)

• Search problem: argW max p(x|W)P(W)

• p(x|W) consists of Hidden Markov Models:– Dictionary defines state sequence: „hello“ = /hh eh l ow/

– Full model: concatenation of states (i.e. sounds)



Target Function/ Measure

• %WER = minimum editing distance between reference and hypothesis

• Example:– the quick brown fox jumps * over REF– * quick brown fox jump is over HYP– D S I ERRWER = 3/7 = 43%

• Different measure from max p(W|x) !!!



A simpler problem: Evaluation

• So far we have:– Dictionary: “hello” = /hh eh l ow/ …

– Acoustic Model: phh(x), peh(x), pl(x), pow(x) …

– Language Model: P(“hello world”)State sequence: /hh eh l ow w er l d/

• Given W and x:Alignment needed!

/ hh eh l ow /



The Viterbi Algorithm

• Beam search from left to right

• Resulting alignment is best match given p?(x) and x

p(x) Time

1.2 1 1

1.2 1.3 1.1 1 1.1 1.2

1.3 1 1 1 1.2 1.3

1 1.2 1.2 1.1hh

eh

l o

w



The Viterbi Algorithm (cont‘d)

• Evaluation problem: ~ Dynamic Time Warping

• Best alignment for given W, x, and p?(x) by locally adding scores (=-log p) for states and transitions

p(x) Time

6 6 7

2.4 3.9 4.4 5 6.6 8.4

1.3 2 3 4 6 6.8

1 2.4 3.9 4.4hh

eh

l o

w



Pronunciation Prefix Trees (PPT)

• Tree Representation of the Search Dictionary

• Very compact fast!• Viterbi Algorithm also

works for trees

BROADWAY: B R OA D W EYBROADLY: B R OA D L IEBUT: B AH T



Viterbi Search for PPTs

• A PPT is traversed in a time-synchronous way• Apply Viterbi Algorithm on

– state level (sub-phonemic units: –b –m –e)

Constrained by HMM Topology

– phone levelConstrained by PPT

• What do we do when we reach the end of a word?



Re-Entrant PPTs for continuous speech

• Isolated word recognition:– Search terminated in leafs of the PPT

• Decoding of word sequences:– Re-enter the PPT and store the Viterbi path using a

backpointer-table



Problem: Branching Factor

• Imagine sequence of 3 words with 10k vocabulary– 10k ^ 3 = 1000G (potentially)

– Not everything will be expanded, of course

• Viterbi approximation path recombination:– Given P(Candy | „hi I am“) = P(Candy | „hello I am“)

hi I am

hello I amCandy



Path Recombination

At time t : Path1 = w1 .. wN with score s1

Path2 = v1 .. vM with score s2

Where: s1 = p(x1...xt | w1...wN )* P(wi| wi-1 wi-2)

s2 = p(x1...xt | v1 ...vM )* P(vi | vi-1 vi-2)

In the end, we‘re only interested in the best path!



Path Recombination (cont‘d)

• To expand the search space into a new root:– Pick the path with the best score so far (Viterbi

approximation)

– Initialize scores and backpointers for the root node according to the best predecessor word

– store the left context model information with the last phone from the predecessor(context-dependent acoustic models: /s ih t/ /l ih p/)



Problem with Re-Entry:

• For a correct use of the Viterbi algorithm, the choice of the best path must include the score for the transition from the predecessor word to the successor word

• The word identity is not known at the root level, the choice of the best predecessor can therefore not be done at this point



Consequences

1. Wrong predecessor words language model information only at leaf level

2. Wrong word boundaries The starting point for the successor word is

determined without any language model information

3. Incomplete linguistic information Open pruning thresholds are needed for beam search



Three-Pass search strategy

1. Search on a tree-organized lexicon (PPT)• Aggressive path recombination at word ends

• Use linguistic information only approximately

• Generate a list of starting words for each frame

2. Search on a flat-organized lexicon• Fix the word segmentation from the first pass

• Full use of language model (often needs a third pass)



Three-Pass Decoder: Results

• Q4g system with cache for acoustic scores:– 4000 acoustic models trained on BN+ESST

– 40k Vocabulary

– Test on “readBN” data

Search Pass Error Rate Real-time factor

Tree Pass 22.0% 9.6

Flat Pass 18.8% 0.9

Lattice Rescoring 15.0% 0.2



One-Pass Decoder: Motivation

• The efficient use of all available knowledge sources as early as possible should result in faster decoding

• Use the same engine to decode along:– Statistical n-gram language models with arbitrary n

– Context-free grammars (CFG)

– Word-graphs



Linguistic states

• Linguistic state, examples:– n-1 word history for statistical n-gram LM

– Grammar state for CFGs

– (lattice node, word history) for word-graphs

• To fully use the linguistic knowledge source, the linguistic state has to be kept during decoding

• Path recombination has to be delayed until the word identity is known



Linguistic context assignment

• Key idea: establish a linguistic polymorphism for each node of the PPT

• Maintain a list of linguistically morphed instances in each node

• Each instance stores its own backpointer and scores for each state of the underlying HMM with respect to the linguistic state of that instance



PPT with linguistically morphed instances

AH T

B

R OA DW EY

L IE

Typically: 3-gram LM, i.e. P(W) = iP(wi|Wi)P(wi|Wi) = P(broadway | „bullets over“)



Language Model Lookahead

• Since the linguistic state is known, the complete LM information P(W) can be applied to the instances, given the possible successor words for that node of the PPT

• Let

lct = linguistic context/ state of instance i from node n

path(w) = path of word w in the PPT

(n, lct) = min w {w | node n path(w)} P(w|lct)

score(i) = p(x1...xt | w1...wN)* P(wN-1|...) * (n, lct)



LM Lookahead (cont‘d)

• When the word becomes unique, the exact lm score is already incorporated and no explicit word transitions needs to be computed

• The lm scores will be updated on demand, based on a compressed PPT („smearing“ of LM scores)

• Tighter pruning thresholds can be used since the language model information is not delayed anymore



Early Path Recombination

• The Path recombination can be performed as soon as the word becomes unique, which is usually a few nodes before reaching the leaf. This reduces the number of unique linguistic contexts and instances

• This is particularly effective for cross-word models due the fan-out in the right context models



One-pass Decoder: Summary

• One-Pass decoder based on– One copy of tree with dynamically allocated instances

– Early path recombination

– Full language model lookahead

• Linguistic knowledge sources– Statistical n-grams with n >3 possible

– Context free grammars



Results

Real time factor Error rate

3-pass 1-pass 3-pass 1-pass

VM 6.8 4.0 26.9% 26.9%

readBN 12.2 4.2 14.7% 13.9%

Meeting 55 38 43.7% 43.4%



Remarks on speed-up

Speed-up ranges from a factor of almost 3 for the readBN task to 1.4 for the meeting data

Speed-up depends strongly on matched domain conditions

Decoder profits from sharp language models LM Lookahead less effective for weak language

models due to unmatched conditions



Memory usage : Q4g

Module 3-pass 1-pass

Acoustic Models 44 MB 44 MB

Language Model 87 MB 82 MB

Overhead 16 MB 16 MB

Decoder:

- permanent

- dynamic

120 MB

~100 MB

18 MB

~20 MB

Total 367 MB 180 MB



Summary

• Decoding is time- and memory consuming• Search errors occur when beams too tight (trade-

off) or Viterbi assumption violated• State-of-the art: One-pass decoder

– Tree-structure for efficiency

– Linguistically morphed instances of nodes and leafs

• Other approaches exist (stack decoding, a-posteriori decoding, ...)


decoding techniques for automatic speech recognition

Documents

viterbi path

pxwlanguage model

owfull model

trentoproblem w

dgiven w

powx language model

best path

best score