edinburgh mt lecture 6: decoding

79
Decoding

Upload: alopezfoo

Post on 18-Jul-2015

310 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Edinburgh MT lecture 6: decoding

Decoding

Page 2: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

IBM Model 4

Page 3: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

IBM Model 4

Page 4: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

pf (1| )虽然

IBM Model 4

Page 5: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然 北 风 呼啸 , 天空 天空 依然 清澈 。

IBM Model 4

Page 6: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然 北 风 呼啸 , 天空 天空 依然 清澈 。

IBM Model 4

Page 7: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然 北 风 呼啸 , 天空 天空 依然 清澈 。

IBM Model 4

Page 8: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然 北 风 呼啸 , 天空 天空 依然 清澈 。

IBM Model 4

Page 9: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

北 风 呼啸 , 天空 天空 依然 清澈 。

IBM Model 4

Page 10: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

IBM Model 4

Page 11: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

However

IBM Model 4

Page 12: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

However

虽然pt(However| )

IBM Model 4

Page 13: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Page 14: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Page 15: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Page 16: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

pd(0|However)

IBM Model 4

Page 17: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Page 18: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

pd(8|north)

IBM Model 4

Page 19: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Page 20: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Page 21: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

IBM Model 4

Many-to-one, same as IBM Models 1 & 2

Page 22: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。Although north wind howls , but sky still very clear .

However , the sky remained clear under the strong north wind .

虽然

� �北 风 呼啸 , 天空 天空 依然 清澈 。

north wind strong , the sky remained clear . under theHowever

p(English, alignment|Chinese) =�

pf

�pt

�pd

IBM Model 4

Page 23: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .p(English, alignment|Chinese) =

�pf

�pt

�pd

IBM Model 4

Page 24: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

However , the sky remained clear under the strong north wind .p(English|Chinese) =

�alignments

⇥pf

⇥pt

⇥pd

IBM Model 4

Page 25: Edinburgh MT lecture 6: decoding

Pr(⌧,⇡|e) =lY

i=1

Pr(�i|�i�11 , e)Pr(�0|�l

1, e)⇥

lY

i=0

�iY

k=1

Pr(⌧ik|⌧k�1i1 , ⌧ i�1

0 , �l0, e)⇥

lY

i=1

�iY

k=1

Pr(⇡ik|⇡k�1i1 , ⇡i�1

1 , ⌧ l0, �

l0, e)⇥

�0Y

k=1

Pr(⇡0k|⇡k�101 , ⇡l

1, ⌧l0, �

l0, e)

IBM Model 4Fertility

Translation

Permutation (distortion)

Q: how hard is it to compute expectations?

Page 26: Edinburgh MT lecture 6: decoding

Pr(⌧,⇡|e) =lY

i=1

Pr(�i|�i�11 , e)Pr(�0|�l

1, e)⇥

lY

i=0

�iY

k=1

Pr(⌧ik|⌧k�1i1 , ⌧ i�1

0 , �l0, e)⇥

lY

i=1

�iY

k=1

Pr(⇡ik|⇡k�1i1 , ⇡i�1

1 , ⌧ l0, �

l0, e)⇥

�0Y

k=1

Pr(⇡0k|⇡k�101 , ⇡l

1, ⌧l0, �

l0, e)

IBM Model 4Fertility

Translation

Permutation (distortion)

Q: how hard is it to compute expectations?A: hard, but can be approximated using

Markov chain Monte Carlo

Page 27: Edinburgh MT lecture 6: decoding

Decoding

We want to solve this problem:

Q: how many English sentences are there?

e⇤ = arg max

ep(e|f)

To do this, we must evaluate the expression for all English sentences.

Page 28: Edinburgh MT lecture 6: decoding

Decoding

We want to solve this problem:

Q: how many English sentences are there?

e⇤ = arg max

ep(e|f)

To do this, we must evaluate the expression for all English sentences.

(Easier?) Q: how many English sentences receive non-zero probability, given f?

Page 29: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

Suppose we have 5 translations per word.Suppose every word has fertility 1 (w/ prob. 1).

Page 30: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

Suppose we have 5 translations per word.Suppose every word has fertility 1 (w/ prob. 1).

(15,000 for this example).Then we have translations!5nn!

Page 31: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

Given a sentence pair and an alignment, we can easily calculate

p(English, alignment|Chinese)

Can we decode without enumerating all translations?

Page 32: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

There are target sentences.

Key Idea

But there are only ways to start them.O(5n)

5nn!

Page 33: Edinburgh MT lecture 6: decoding

coverage vector

Key Idea

north

northern

strong

p(north|START ) · p( |north)北

p(northern|START ) · p( |northern)北

p(strong|START ) · p( |strong)呼啸

Page 34: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

coverage vector

Key Idea

windp(wind|north) · p( |wind)风

p(strong|north) · p( |strong)呼啸

strong

north

northern

strong

Page 35: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

coverage vector

Key Idea

windp(wind|north) · p( |wind)风

p(strong|north) · p( |strong)呼啸

strong

north

northern

strong

This shares work among sentence beginnings

Page 36: Edinburgh MT lecture 6: decoding

Key IdeaThis shares work among

sentence beginnings

What about sentence endings?

Page 37: Edinburgh MT lecture 6: decoding

Key Idea

Dynamic Programming

amount of work:O(5n2n)

bad, but much better than

5nn!

Page 38: Edinburgh MT lecture 6: decoding

Key Idea

Dynamic Programming

amount of work:O(5n2n)

bad, but much better than

5nn!

Page 39: Edinburgh MT lecture 6: decoding

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)

amount of work:O(5n2n)

bad, but much better than

5nn!

Page 40: Edinburgh MT lecture 6: decoding

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)

north, 0.014

amount of work:O(5n2n)

bad, but much better than

5nn!

Page 41: Edinburgh MT lecture 6: decoding

Key Idea

Dynamic Programming

each edge labelled with a weight and a

word (or words)

north, 0.014

weighted finite-state automata

amount of work:O(5n2n)

bad, but much better than

5nn!

Page 42: Edinburgh MT lecture 6: decoding

Weighted languages

•The lattice describing the set of all possible translations is a weighted finite state automaton.

•So is the language model.

•Since regular languages are closed under intersection, we can intersect the devices and run shortest path graph algorithms.

•Taking their intersection is equivalent to computing the probability under Bayes’ rule.

Page 43: Edinburgh MT lecture 6: decoding

Wait a second!

We want to solve this problem:e⇤ = arg max

ep(e|f)

But now we’re solving this problem:e⇤ = arg max

emax

ap(e,a|f)

= arg max

e

X

a

p(e,a|f)

Often called the Viterbi approximation

Page 44: Edinburgh MT lecture 6: decoding

We can sum over alignments by weighted determinization

Wait a second!

Page 45: Edinburgh MT lecture 6: decoding

How expensive is that?

We can sum over alignments by weighted determinization

Wait a second!

Page 46: Edinburgh MT lecture 6: decoding

How expensive is that?

We can sum over alignments by weighted determinization

O(5n2n)nondeterministic

Wait a second!

Page 47: Edinburgh MT lecture 6: decoding

How expensive is that?

We can sum over alignments by weighted determinization

O(5n2n)nondeterministic

O(25n2n

)deterministic

Wait a second!

Page 48: Edinburgh MT lecture 6: decoding

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

Page 49: Edinburgh MT lecture 6: decoding

I made the simplest machine translation model I could think of

and it blew up in my face

is still far too much work.

Can we do better?

O(5n2n)Ok, let’s stick with the Viterbi approximation. But…

Page 50: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

Can we do better?

Page 51: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

Can we do better?

Each arc weighted by translation probability +

bigram probability

Page 52: Edinburgh MT lecture 6: decoding

北 风 呼啸 。

Can we do better?

Each arc weighted by translation probability +

bigram probability

Goal: find shortest path that visits each word once.

Page 53: Edinburgh MT lecture 6: decoding

London Paris NY Tokyo

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

Page 54: Edinburgh MT lecture 6: decoding

London Paris NY Tokyo

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

Probably not: this is the traveling salesman problem.

Page 55: Edinburgh MT lecture 6: decoding

London Paris NY Tokyo

Can we do better?

Each arc weighted by translation probability +

bigram probability

Objective: find shortest path that visits each word once.

Probably not: this is the traveling salesman problem.Even the Viterbi approximation is NP-hard!

Page 56: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Page 57: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Idea: prune states by cost of shortest path to them

Page 58: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Idea: prune states by accumulated path length

Page 59: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Ideal result: only high-probability paths enumerated

Page 60: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Actual result: longer paths have lower probability!

Page 61: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Page 62: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Page 63: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Solution: Group states by number of covered words.

Page 64: Edinburgh MT lecture 6: decoding

Approximation: Pruning

Solution: Group states by number of covered words.

Page 65: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 66: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 67: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 68: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 69: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 70: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 71: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 72: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 73: Edinburgh MT lecture 6: decoding

Approximation: Pruning

“Stack” decoding: a linear-time approximation

Page 74: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

Approximation: Distortion Limits

Page 75: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

O(2n)number of vertices:

Approximation: Distortion Limits

Page 76: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

O(2n)number of vertices:

d = 4window

Approximation: Distortion Limits

Page 77: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

O(2n)number of vertices:

d = 4window

outside windowto left: covered

outside windowto right: uncovered

Approximation: Distortion Limits

Page 78: Edinburgh MT lecture 6: decoding

虽然 北 风 呼啸 , 但 天空 依然 十分 清澈 。

the sky

number of vertices:

d = 4window

outside windowto left: covered

outside windowto right: uncovered

O(n2d)

Approximation: Distortion Limits

Page 79: Edinburgh MT lecture 6: decoding

Summary

•We need every possible trick to make decoding fast.

•Viterbi approximation: from worse to bad.

•Dynamic programming: exponential speedup over enumeration, but NP-hard for IBM Models.

•NP-hardness means exact solutions unlikely.

•Heuristic approximations: stack decoding, distortion limits.

•Tradeoff: might not find true argmax.