edinburgh mt lecture 6: decoding
TRANSCRIPT
Decoding
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
pf (1| )虽然
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然 北 风 呼啸 , 天空 天空 依然 清澈 。
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然 北 风 呼啸 , 天空 天空 依然 清澈 。
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然 北 风 呼啸 , 天空 天空 依然 清澈 。
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然 北 风 呼啸 , 天空 天空 依然 清澈 。
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
北 风 呼啸 , 天空 天空 依然 清澈 。
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
However
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
However
虽然pt(However| )
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
pd(0|However)
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
pd(8|north)
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
However , the sky remained clear under the strong north wind .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
However , the sky remained clear under the strong north wind .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
IBM Model 4
Many-to-one, same as IBM Models 1 & 2
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .
However , the sky remained clear under the strong north wind .
虽然
�
� �北 风 呼啸 , 天空 天空 依然 清澈 。
north wind strong , the sky remained clear . under theHowever
p(English, alignment|Chinese) =�
pf
�pt
�pd
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
However , the sky remained clear under the strong north wind .p(English, alignment|Chinese) =
�pf
�pt
�pd
IBM Model 4
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
However , the sky remained clear under the strong north wind .p(English|Chinese) =
�alignments
⇥pf
⇥pt
⇥pd
IBM Model 4
Pr(⌧,⇡|e) =lY
i=1
Pr(�i|�i�11 , e)Pr(�0|�l
1, e)⇥
lY
i=0
�iY
k=1
Pr(⌧ik|⌧k�1i1 , ⌧ i�1
0 , �l0, e)⇥
lY
i=1
�iY
k=1
Pr(⇡ik|⇡k�1i1 , ⇡i�1
1 , ⌧ l0, �
l0, e)⇥
�0Y
k=1
Pr(⇡0k|⇡k�101 , ⇡l
1, ⌧l0, �
l0, e)
IBM Model 4Fertility
Translation
Permutation (distortion)
Q: how hard is it to compute expectations?
Pr(⌧,⇡|e) =lY
i=1
Pr(�i|�i�11 , e)Pr(�0|�l
1, e)⇥
lY
i=0
�iY
k=1
Pr(⌧ik|⌧k�1i1 , ⌧ i�1
0 , �l0, e)⇥
lY
i=1
�iY
k=1
Pr(⇡ik|⇡k�1i1 , ⇡i�1
1 , ⌧ l0, �
l0, e)⇥
�0Y
k=1
Pr(⇡0k|⇡k�101 , ⇡l
1, ⌧l0, �
l0, e)
IBM Model 4Fertility
Translation
Permutation (distortion)
Q: how hard is it to compute expectations?A: hard, but can be approximated using
Markov chain Monte Carlo
Decoding
We want to solve this problem:
Q: how many English sentences are there?
e⇤ = arg max
ep(e|f)
To do this, we must evaluate the expression for all English sentences.
Decoding
We want to solve this problem:
Q: how many English sentences are there?
e⇤ = arg max
ep(e|f)
To do this, we must evaluate the expression for all English sentences.
(Easier?) Q: how many English sentences receive non-zero probability, given f?
北 风 呼啸 。
Suppose we have 5 translations per word.Suppose every word has fertility 1 (w/ prob. 1).
北 风 呼啸 。
Suppose we have 5 translations per word.Suppose every word has fertility 1 (w/ prob. 1).
(15,000 for this example).Then we have translations!5nn!
北 风 呼啸 。
Given a sentence pair and an alignment, we can easily calculate
p(English, alignment|Chinese)
Can we decode without enumerating all translations?
北 风 呼啸 。
There are target sentences.
Key Idea
But there are only ways to start them.O(5n)
5nn!
coverage vector
Key Idea
north
northern
strong
p(north|START ) · p( |north)北
p(northern|START ) · p( |northern)北
p(strong|START ) · p( |strong)呼啸
北 风 呼啸 。
coverage vector
Key Idea
windp(wind|north) · p( |wind)风
p(strong|north) · p( |strong)呼啸
strong
north
northern
strong
北 风 呼啸 。
coverage vector
Key Idea
windp(wind|north) · p( |wind)风
p(strong|north) · p( |strong)呼啸
strong
north
northern
strong
This shares work among sentence beginnings
Key IdeaThis shares work among
sentence beginnings
What about sentence endings?
Key Idea
Dynamic Programming
amount of work:O(5n2n)
bad, but much better than
5nn!
Key Idea
Dynamic Programming
amount of work:O(5n2n)
bad, but much better than
5nn!
Key Idea
Dynamic Programming
each edge labelled with a weight and a
word (or words)
amount of work:O(5n2n)
bad, but much better than
5nn!
Key Idea
Dynamic Programming
each edge labelled with a weight and a
word (or words)
north, 0.014
amount of work:O(5n2n)
bad, but much better than
5nn!
Key Idea
Dynamic Programming
each edge labelled with a weight and a
word (or words)
north, 0.014
weighted finite-state automata
amount of work:O(5n2n)
bad, but much better than
5nn!
Weighted languages
•The lattice describing the set of all possible translations is a weighted finite state automaton.
•So is the language model.
•Since regular languages are closed under intersection, we can intersect the devices and run shortest path graph algorithms.
•Taking their intersection is equivalent to computing the probability under Bayes’ rule.
Wait a second!
We want to solve this problem:e⇤ = arg max
ep(e|f)
But now we’re solving this problem:e⇤ = arg max
emax
ap(e,a|f)
= arg max
e
X
a
p(e,a|f)
Often called the Viterbi approximation
We can sum over alignments by weighted determinization
Wait a second!
How expensive is that?
We can sum over alignments by weighted determinization
Wait a second!
How expensive is that?
We can sum over alignments by weighted determinization
O(5n2n)nondeterministic
Wait a second!
How expensive is that?
We can sum over alignments by weighted determinization
O(5n2n)nondeterministic
O(25n2n
)deterministic
Wait a second!
I made the simplest machine translation model I could think of
and it blew up in my face
is still far too much work.O(5n2n)Ok, let’s stick with the Viterbi approximation. But…
I made the simplest machine translation model I could think of
and it blew up in my face
is still far too much work.
Can we do better?
O(5n2n)Ok, let’s stick with the Viterbi approximation. But…
北 风 呼啸 。
Can we do better?
北 风 呼啸 。
Can we do better?
Each arc weighted by translation probability +
bigram probability
北 风 呼啸 。
Can we do better?
Each arc weighted by translation probability +
bigram probability
Goal: find shortest path that visits each word once.
London Paris NY Tokyo
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
London Paris NY Tokyo
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
Probably not: this is the traveling salesman problem.
London Paris NY Tokyo
Can we do better?
Each arc weighted by translation probability +
bigram probability
Objective: find shortest path that visits each word once.
Probably not: this is the traveling salesman problem.Even the Viterbi approximation is NP-hard!
Approximation: Pruning
Approximation: Pruning
Idea: prune states by cost of shortest path to them
Approximation: Pruning
Idea: prune states by accumulated path length
Approximation: Pruning
Ideal result: only high-probability paths enumerated
Approximation: Pruning
Actual result: longer paths have lower probability!
Approximation: Pruning
Approximation: Pruning
Approximation: Pruning
Solution: Group states by number of covered words.
Approximation: Pruning
Solution: Group states by number of covered words.
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
Approximation: Pruning
“Stack” decoding: a linear-time approximation
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
Approximation: Distortion Limits
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
O(2n)number of vertices:
Approximation: Distortion Limits
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
O(2n)number of vertices:
d = 4window
Approximation: Distortion Limits
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
O(2n)number of vertices:
d = 4window
outside windowto left: covered
outside windowto right: uncovered
Approximation: Distortion Limits
虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。
the sky
number of vertices:
d = 4window
outside windowto left: covered
outside windowto right: uncovered
O(n2d)
Approximation: Distortion Limits
Summary
•We need every possible trick to make decoding fast.
•Viterbi approximation: from worse to bad.
•Dynamic programming: exponential speedup over enumeration, but NP-hard for IBM Models.
•NP-hardness means exact solutions unlikely.
•Heuristic approximations: stack decoding, distortion limits.
•Tradeoff: might not find true argmax.