edinburgh mt lecture 6: decoding

Decoding

虽然北风呼啸 , 但天空依然十分清澈。Although north wind howls , but sky still very clear .

IBM Model 4


虽然

IBM Model 4


虽然

pf (1| )虽然

IBM Model 4


虽然北风呼啸 , 天空天空依然清澈。

IBM Model 4


虽然

�

北风呼啸 , 天空天空依然清澈。

IBM Model 4


虽然

�

� �北风呼啸 , 天空天空依然清澈。

IBM Model 4


虽然

�


However

IBM Model 4


虽然

�


However

虽然pt(However| )

IBM Model 4


虽然

�


north wind strong , the sky remained clear . under theHowever

IBM Model 4


虽然

�



pd(0|However)

IBM Model 4


虽然

�



IBM Model 4


虽然

�



pd(8|north)

IBM Model 4


虽然

�



IBM Model 4


However , the sky remained clear under the strong north wind .

虽然

�



IBM Model 4



虽然

�



IBM Model 4

Many-to-one, same as IBM Models 1 & 2



虽然

�



p(English, alignment|Chinese) =�

pf

�pt

�pd

IBM Model 4

虽然北风呼啸 , 但天空依然十分清澈。

However , the sky remained clear under the strong north wind .p(English, alignment|Chinese) =

�pf

�pt

�pd

IBM Model 4


However , the sky remained clear under the strong north wind .p(English|Chinese) =

�alignments

⇥pf

⇥pt

⇥pd

IBM Model 4

Pr(⌧,⇡|e) =lY

i=1

Pr(�i|�i�11 , e)Pr(�0|�l

1, e)⇥

lY

i=0

�iY

k=1

Pr(⌧ik|⌧k�1i1 , ⌧ i�1

0 , �l0, e)⇥

lY

i=1

�iY

k=1

Pr(⇡ik|⇡k�1i1 , ⇡i�1

1 , ⌧ l0, �

l0, e)⇥

�0Y

k=1

Pr(⇡0k|⇡k�101 , ⇡l

1, ⌧l0, �

l0, e)

IBM Model 4Fertility

Translation

Permutation (distortion)

Q: how hard is it to compute expectations?

Pr(⌧,⇡|e) =lY

i=1

Pr(�i|�i�11 , e)Pr(�0|�l

1, e)⇥

lY

i=0

�iY

k=1

Pr(⌧ik|⌧k�1i1 , ⌧ i�1

0 , �l0, e)⇥

lY

i=1

�iY

k=1

Pr(⇡ik|⇡k�1i1 , ⇡i�1

1 , ⌧ l0, �

l0, e)⇥

�0Y

k=1

Pr(⇡0k|⇡k�101 , ⇡l

1, ⌧l0, �

l0, e)

IBM Model 4Fertility

Translation

Permutation (distortion)

Q: how hard is it to compute expectations?A: hard, but can be approximated using

Markov chain Monte Carlo

Decoding

We want to solve this problem:

Q: how many English sentences are there?

e⇤ = arg max

ep(e|f)

To do this, we must evaluate the expression for all English sentences.

Decoding

We want to solve this problem:

Q: how many English sentences are there?

e⇤ = arg max

ep(e|f)

To do this, we must evaluate the expression for all English sentences.

(Easier?) Q: how many English sentences receive non-zero probability, given f?

北风呼啸。

Suppose we have 5 translations per word.Suppose every word has fertility 1 (w/ prob. 1).

北风呼啸。

Suppose we have 5 translations per word.Suppose every word has fertility 1 (w/ prob. 1).

(15,000 for this example).Then we have translations!5nn!

北风呼啸。

Given a sentence pair and an alignment, we can easily calculate

p(English, alignment|Chinese)

Can we decode without enumerating all translations?

北风呼啸。

There are target sentences.

Key Idea

But there are only ways to start them.O(5n)

5nn!

北风呼啸。

coverage vector

Key Idea

windp(wind|north) · p( |wind)风

p(strong|north) · p( |strong)呼啸

strong

north

northern

strong

北风呼啸。

coverage vector

Key Idea

windp(wind|north) · p( |wind)风

p(strong|north) · p( |strong)呼啸

strong

north

northern

strong

This shares work among sentence beginnings

Key IdeaThis shares work among

sentence beginnings

What about sentence endings?

Key Idea

Dynamic Programming

amount of work:O(5n2n)

bad, but much better than

5nn!

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)



5nn!

Key Idea

Dynamic Programming


word (or words)

north, 0.014



5nn!

Key Idea

Dynamic Programming


word (or words)

north, 0.014

weighted finite-state automata



5nn!

Weighted languages

•The lattice describing the set of all possible translations is a weighted finite state automaton.

•So is the language model.

•Since regular languages are closed under intersection, we can intersect the devices and run shortest path graph algorithms.

•Taking their intersection is equivalent to computing the probability under Bayes’ rule.

Wait a second!

We want to solve this problem:e⇤ = arg max

ep(e|f)

But now we’re solving this problem:e⇤ = arg max

emax

ap(e,a|f)

= arg max

e

X

a

p(e,a|f)

Often called the Viterbi approximation

We can sum over alignments by weighted determinization

Wait a second!

How expensive is that?


Wait a second!



O(5n2n)nondeterministic

Wait a second!



O(5n2n)nondeterministic

O(25n2n

)deterministic

Wait a second!

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.

Can we do better?

O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

北风呼啸。

Can we do better?

北风呼啸。

Can we do better?

Each arc weighted by translation probability +

bigram probability

北风呼啸。

Can we do better?


bigram probability

Goal: find shortest path that visits each word once.

London Paris NY Tokyo

Can we do better?


bigram probability

Objective: find shortest path that visits each word once.


Can we do better?


bigram probability


Probably not: this is the traveling salesman problem.


Can we do better?


bigram probability


Probably not: this is the traveling salesman problem.Even the Viterbi approximation is NP-hard!

Approximation: Pruning


Idea: prune states by cost of shortest path to them


Idea: prune states by accumulated path length


Ideal result: only high-probability paths enumerated


Actual result: longer paths have lower probability!


Solution: Group states by number of covered words.


“Stack” decoding: a linear-time approximation


the sky

Approximation: Distortion Limits


the sky

O(2n)number of vertices:



the sky


d = 4window



the sky


d = 4window

outside windowto left: covered

outside windowto right: uncovered



the sky

number of vertices:

d = 4window

outside windowto left: covered

outside windowto right: uncovered

O(n2d)


Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

•Dynamic programming: exponential speedup over enumeration, but NP-hard for IBM Models.

•NP-hardness means exact solutions unlikely.

•Heuristic approximations: stack decoding, distortion limits.

•Tradeoff: might not find true argmax.

edinburgh mt lecture 6: decoding

Technology