decoding techniques for automatic speech recognition
DESCRIPTION
Decoding Techniques for Automatic Speech Recognition. Florian Metze Interactive Systems Laboratories. Outline. Decoding in ASR Search Problem Evaluation Problem Viterbi Algorithm Tree Search Re-Entry Recombination. The ASR problem: arg W max p(W| x ). Two major knowledge sources - PowerPoint PPT PresentationTRANSCRIPT
Decoding Techniques for Automatic Speech Recognition
Florian Metze
Interactive Systems Laboratories
ESSLLI 2002, Trento 2Aug 14, 2002
Outline
• Decoding in ASR• Search Problem• Evaluation Problem• Viterbi Algorithm• Tree Search• Re-Entry• Recombination
ESSLLI 2002, Trento 3Aug 14, 2002
The ASR problem: argW max p(W|x)
• Two major knowledge sources– Acoustic Model: p(x|W)
– Language Model: P(W)
• Bayes: p(W|x)P(x)=p(x|W)P(W)
• Search problem: argW max p(x|W)P(W)
• p(x|W) consists of Hidden Markov Models:– Dictionary defines state sequence: „hello“ = /hh eh l ow/
– Full model: concatenation of states (i.e. sounds)
ESSLLI 2002, Trento 4Aug 14, 2002
Target Function/ Measure
• %WER = minimum editing distance between reference and hypothesis
• Example:– the quick brown fox jumps * over REF– * quick brown fox jump is over HYP– D S I ERRWER = 3/7 = 43%
• Different measure from max p(W|x) !!!
ESSLLI 2002, Trento 5Aug 14, 2002
A simpler problem: Evaluation
• So far we have:– Dictionary: “hello” = /hh eh l ow/ …
– Acoustic Model: phh(x), peh(x), pl(x), pow(x) …
– Language Model: P(“hello world”)State sequence: /hh eh l ow w er l d/
• Given W and x:Alignment needed!
/ hh eh l ow /
ESSLLI 2002, Trento 6Aug 14, 2002
A simpler problem: Evaluation
• So far we have:– Dictionary: “hello” = /hh eh l ow/ …
– Acoustic Model: phh(x), peh(x), pl(x), pow(x) …
– Language Model: P(“hello world”)State sequence: /hh eh l ow w er l d/
• Given W and x:Alignment needed!
/ hh eh l ow /
ESSLLI 2002, Trento 7Aug 14, 2002
The Viterbi Algorithm
• Beam search from left to right
• Resulting alignment is best match given p?(x) and x
p(x) Time
1.2 1 1
1.2 1.3 1.1 1 1.1 1.2
1.3 1 1 1 1.2 1.3
1 1.2 1.2 1.1hh
eh
l o
w
ESSLLI 2002, Trento 8Aug 14, 2002
The Viterbi Algorithm (cont‘d)
• Evaluation problem: ~ Dynamic Time Warping
• Best alignment for given W, x, and p?(x) by locally adding scores (=-log p) for states and transitions
p(x) Time
6 6 7
2.4 3.9 4.4 5 6.6 8.4
1.3 2 3 4 6 6.8
1 2.4 3.9 4.4hh
eh
l o
w
ESSLLI 2002, Trento 9Aug 14, 2002
Pronunciation Prefix Trees (PPT)
• Tree Representation of the Search Dictionary
• Very compact fast!• Viterbi Algorithm also
works for trees
BROADWAY: B R OA D W EYBROADLY: B R OA D L IEBUT: B AH T
ESSLLI 2002, Trento 10Aug 14, 2002
Viterbi Search for PPTs
• A PPT is traversed in a time-synchronous way• Apply Viterbi Algorithm on
– state level (sub-phonemic units: –b –m –e)
Constrained by HMM Topology
– phone levelConstrained by PPT
• What do we do when we reach the end of a word?
ESSLLI 2002, Trento 11Aug 14, 2002
Re-Entrant PPTs for continuous speech
• Isolated word recognition:– Search terminated in leafs of the PPT
• Decoding of word sequences:– Re-enter the PPT and store the Viterbi path using a
backpointer-table
ESSLLI 2002, Trento 12Aug 14, 2002
Problem: Branching Factor
• Imagine sequence of 3 words with 10k vocabulary– 10k ^ 3 = 1000G (potentially)
– Not everything will be expanded, of course
• Viterbi approximation path recombination:– Given P(Candy | „hi I am“) = P(Candy | „hello I am“)
hi I am
hello I amCandy
ESSLLI 2002, Trento 13Aug 14, 2002
Path Recombination
At time t : Path1 = w1 .. wN with score s1
Path2 = v1 .. vM with score s2
Where: s1 = p(x1...xt | w1...wN )* P(wi| wi-1 wi-2)
s2 = p(x1...xt | v1 ...vM )* P(vi | vi-1 vi-2)
In the end, we‘re only interested in the best path!
ESSLLI 2002, Trento 14Aug 14, 2002
Path Recombination (cont‘d)
• To expand the search space into a new root:– Pick the path with the best score so far (Viterbi
approximation)
– Initialize scores and backpointers for the root node according to the best predecessor word
– store the left context model information with the last phone from the predecessor(context-dependent acoustic models: /s ih t/ /l ih p/)
ESSLLI 2002, Trento 15Aug 14, 2002
Problem with Re-Entry:
• For a correct use of the Viterbi algorithm, the choice of the best path must include the score for the transition from the predecessor word to the successor word
• The word identity is not known at the root level, the choice of the best predecessor can therefore not be done at this point
ESSLLI 2002, Trento 16Aug 14, 2002
Consequences
1. Wrong predecessor words language model information only at leaf level
2. Wrong word boundaries The starting point for the successor word is
determined without any language model information
3. Incomplete linguistic information Open pruning thresholds are needed for beam search
ESSLLI 2002, Trento 17Aug 14, 2002
Three-Pass search strategy
1. Search on a tree-organized lexicon (PPT)• Aggressive path recombination at word ends
• Use linguistic information only approximately
• Generate a list of starting words for each frame
2. Search on a flat-organized lexicon• Fix the word segmentation from the first pass
• Full use of language model (often needs a third pass)
ESSLLI 2002, Trento 18Aug 14, 2002
Three-Pass Decoder: Results
• Q4g system with cache for acoustic scores:– 4000 acoustic models trained on BN+ESST
– 40k Vocabulary
– Test on “readBN” data
Search Pass Error Rate Real-time factor
Tree Pass 22.0% 9.6
Flat Pass 18.8% 0.9
Lattice Rescoring 15.0% 0.2
ESSLLI 2002, Trento 19Aug 14, 2002
One-Pass Decoder: Motivation
• The efficient use of all available knowledge sources as early as possible should result in faster decoding
• Use the same engine to decode along:– Statistical n-gram language models with arbitrary n
– Context-free grammars (CFG)
– Word-graphs
ESSLLI 2002, Trento 20Aug 14, 2002
Linguistic states
• Linguistic state, examples:– n-1 word history for statistical n-gram LM
– Grammar state for CFGs
– (lattice node, word history) for word-graphs
• To fully use the linguistic knowledge source, the linguistic state has to be kept during decoding
• Path recombination has to be delayed until the word identity is known
ESSLLI 2002, Trento 21Aug 14, 2002
Linguistic context assignment
• Key idea: establish a linguistic polymorphism for each node of the PPT
• Maintain a list of linguistically morphed instances in each node
• Each instance stores its own backpointer and scores for each state of the underlying HMM with respect to the linguistic state of that instance
ESSLLI 2002, Trento 22Aug 14, 2002
PPT with linguistically morphed instances
AH T
B
R OA DW EY
L IE
Typically: 3-gram LM, i.e. P(W) = iP(wi|Wi)P(wi|Wi) = P(broadway | „bullets over“)
ESSLLI 2002, Trento 23Aug 14, 2002
Language Model Lookahead
• Since the linguistic state is known, the complete LM information P(W) can be applied to the instances, given the possible successor words for that node of the PPT
• Let
lct = linguistic context/ state of instance i from node n
path(w) = path of word w in the PPT
(n, lct) = min w {w | node n path(w)} P(w|lct)
score(i) = p(x1...xt | w1...wN)* P(wN-1|...) * (n, lct)
ESSLLI 2002, Trento 24Aug 14, 2002
LM Lookahead (cont‘d)
• When the word becomes unique, the exact lm score is already incorporated and no explicit word transitions needs to be computed
• The lm scores will be updated on demand, based on a compressed PPT („smearing“ of LM scores)
• Tighter pruning thresholds can be used since the language model information is not delayed anymore
ESSLLI 2002, Trento 25Aug 14, 2002
Early Path Recombination
• The Path recombination can be performed as soon as the word becomes unique, which is usually a few nodes before reaching the leaf. This reduces the number of unique linguistic contexts and instances
• This is particularly effective for cross-word models due the fan-out in the right context models
ESSLLI 2002, Trento 26Aug 14, 2002
One-pass Decoder: Summary
• One-Pass decoder based on– One copy of tree with dynamically allocated instances
– Early path recombination
– Full language model lookahead
• Linguistic knowledge sources– Statistical n-grams with n >3 possible
– Context free grammars
ESSLLI 2002, Trento 27Aug 14, 2002
Results
Real time factor Error rate
3-pass 1-pass 3-pass 1-pass
VM 6.8 4.0 26.9% 26.9%
readBN 12.2 4.2 14.7% 13.9%
Meeting 55 38 43.7% 43.4%
ESSLLI 2002, Trento 28Aug 14, 2002
Remarks on speed-up
Speed-up ranges from a factor of almost 3 for the readBN task to 1.4 for the meeting data
Speed-up depends strongly on matched domain conditions
Decoder profits from sharp language models LM Lookahead less effective for weak language
models due to unmatched conditions
ESSLLI 2002, Trento 29Aug 14, 2002
Memory usage : Q4g
Module 3-pass 1-pass
Acoustic Models 44 MB 44 MB
Language Model 87 MB 82 MB
Overhead 16 MB 16 MB
Decoder:
- permanent
- dynamic
120 MB
~100 MB
18 MB
~20 MB
Total 367 MB 180 MB
ESSLLI 2002, Trento 30Aug 14, 2002
Summary
• Decoding is time- and memory consuming• Search errors occur when beams too tight (trade-
off) or Viterbi assumption violated• State-of-the art: One-pass decoder
– Tree-structure for efficiency
– Linguistically morphed instances of nodes and leafs
• Other approaches exist (stack decoding, a-posteriori decoding, ...)