edinburgh mt lecture 10: discriminative training

65
Discriminative Learning

Upload: alopezfoo

Post on 25-Jul-2015

266 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Edinburgh MT lecture 10: discriminative training

DiscriminativeLearning

Page 2: Edinburgh MT lecture 10: discriminative training
Page 3: Edinburgh MT lecture 10: discriminative training

Moses off the shelf:

-1286.916461

Page 4: Edinburgh MT lecture 10: discriminative training

Moses off the shelf:

-1286.916461

13.2 14.0 10.8 11.2 13.0 14.6

Page 5: Edinburgh MT lecture 10: discriminative training

Moses off the shelf:

-1286.916461

13.2 14.0 10.8 11.2 13.0 14.619.3 19.8 24.0 24.0 24.6 23.0

Page 6: Edinburgh MT lecture 10: discriminative training

Moses off the shelf:

-1286.916461

13.2 14.0 10.8 11.2 13.0 14.619.3 19.8 24.0 24.0 24.6 23.026.2

Page 7: Edinburgh MT lecture 10: discriminative training

Moses off the shelf:

-1286.916461

13.2 14.0 10.8 11.2 13.0 14.619.3 19.8 24.0 24.0 24.6 23.026.2

10

15

20

25

30

-1500 -1425 -1350 -1275 -1200

Page 8: Edinburgh MT lecture 10: discriminative training

Moses off the shelf:

-1286.916461

13.2 14.0 10.8 11.2 13.0 14.619.3 19.8 24.0 24.0 24.6 23.026.2

Great example of fortuitous search error: bad search fixes a bad model.

It’s important to get the model right, then the search algorithm.

10

15

20

25

30

-1500 -1425 -1350 -1275 -1200

Page 9: Edinburgh MT lecture 10: discriminative training

Learning

Problem: why maximize likelihood if we care about BLEU?

argmax

1

Zexp

(X

k

�khk(English, alignment, Chinese)

)

Page 10: Edinburgh MT lecture 10: discriminative training

Learning

Solution: maximize BLEU instead

argmax

�BLEU

arg max

Eng,align

X

k

�khk(English, alignment, Chinese)

!

Page 11: Edinburgh MT lecture 10: discriminative training

The Noisy Channel

-log p(g | e)

-log p(e)

Page 12: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Page 13: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Page 14: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Page 15: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~wg

Improvement 1:

change to find better translations~w

Page 16: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Page 17: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Page 18: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Page 19: Edinburgh MT lecture 10: discriminative training

As a Linear Model

-log p(g | e)

-log p(e)

~w

Improvement 2:

Add dimensions to make points separable

Page 20: Edinburgh MT lecture 10: discriminative training

14

h1

h2

~w

K-Best List Example

Page 21: Edinburgh MT lecture 10: discriminative training

14

h1

h2

~w

#2#1

K-Best List Example

#3

#4#5#6

#7

#8

#9#10

Page 22: Edinburgh MT lecture 10: discriminative training

15

h1

h2

#2#1

K-Best List Example

#3

#4#5#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

~w

Page 23: Edinburgh MT lecture 10: discriminative training

Training as Classification• Pairwise Ranking Optimization

• Reduce training problem to binary classification with a linear model

• Algorithm

• For i=1 to N

• Pick random pair of hypotheses (A,B) from K-best list

• Use cost function to determine if is A or B better

• Create ith training instance

• Train binary linear classifier

16

Page 24: Edinburgh MT lecture 10: discriminative training

17

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Page 25: Edinburgh MT lecture 10: discriminative training

17

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Page 26: Edinburgh MT lecture 10: discriminative training

18

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Worse!

Page 27: Edinburgh MT lecture 10: discriminative training

19

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Worse!

Page 28: Edinburgh MT lecture 10: discriminative training

20

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Page 29: Edinburgh MT lecture 10: discriminative training

21

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Better!

Page 30: Edinburgh MT lecture 10: discriminative training

22

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Better!

Page 31: Edinburgh MT lecture 10: discriminative training

23

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Worse!

Page 32: Edinburgh MT lecture 10: discriminative training

24

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Better!

Page 33: Edinburgh MT lecture 10: discriminative training

25

h1

h2

#2 #1

#3

#4#5

#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2

h1

h2

Page 34: Edinburgh MT lecture 10: discriminative training

26

h1

h2

Fit a linear model

Page 35: Edinburgh MT lecture 10: discriminative training

27

h1

h2

Fit a linear model

~w

Page 36: Edinburgh MT lecture 10: discriminative training

28

h1

h2

#2#1

K-Best List Example

#3

#4#5#6

#7

#8

#9#10

0.8 � < 1.0

0.6 � < 0.8

0.4 � < 0.6

0.2 � < 0.4

0.0 � < 0.2~w

Page 37: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

Notice:

score(English|Chinese) =

..�

i

�ihi(Chinese,English)

Page 38: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

Notice:

score(English|Chinese) =

�xhx(Chinese,English) +.�

i/x

�ihi(Chinese,English)

Page 39: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

Notice:

score(English|Chinese) =

a�x + b

Page 40: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

Notice:

score(English|Chinese) =

a�x + b

just a line!

Page 41: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 42: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 43: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 44: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 45: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 46: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 47: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 48: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 49: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 50: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 51: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

Page 52: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

BLEU

�x

Page 53: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

BLEU

�x

Page 54: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

�x

mod

el sc

ore

BLEU

�x

Minimum Error Rate Training (Och 2003)

Page 55: Edinburgh MT lecture 10: discriminative training

Max Blues

Page 56: Edinburgh MT lecture 10: discriminative training

Optimizing for BLEU

•Lots of alternative learning algorithms!

•Margin-infused Relaxation (Chiang ’08)

•Ramp Loss Minimization (Gimpel & Smtih ’12)

•Online Rank Learning (Watanabe ’12)

•many others…

Page 57: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

Page 58: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

Page 59: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

Decoder

Page 60: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

iat lat pippat eneat hilat oloat at-yurp .

lalok brok anok enemok ghirok kantok ok-yurp .

Decoder

Page 61: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

iat lat pippat eneat hilat oloat at-yurp .

lalok brok anok enemok ghirok kantok ok-yurp .

Decoder

lalok brok anok ghirok enemok kantok ok-yurp .

Page 62: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

iat lat pippat eneat hilat oloat at-yurp .

lalok brok anok enemok ghirok kantok ok-yurp .

Decoder

lalok brok anok ghirok enemok kantok ok-yurp .

accuracy = 0.83

Page 63: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

iat lat pippat eneat hilat oloat at-yurp .

lalok brok anok enemok ghirok kantok ok-yurp .

Decoder

lalok brok anok ghirok enemok kantok ok-yurp .

accuracy = 0.83

feature weights〈0.2, 0.4, 0.1, ...〉

Page 64: Edinburgh MT lecture 10: discriminative training

iat lat pippat eneat hilat oloat at-yurp .

ok-voon ororok sprok .

at-voon bichat dat . ok-drubel ok-voon anok plok sprok .

at-drubel at-voon pippat rrat dat .

erok sprok izok hihok ghirok .

totat dat arrat vat hilat . ok-voon anok drok brok jok .

at-voon krat pippat sat lat . wiwok farok izok stok .

totat jjat quat cat . lalok sprok izok jok stok .

wat dat krat quat cat .

lalok farok ororok lalok sprok izok enemok .

wat jjat bichat wat dat vat eneat . lalok brok anok plok nok .

iat lat pippat rrat nnat . wiwok nok izok kantok ok-yurp .

totat nnat quat oloat at-yurp . lalok mok nok yorok ghirok clok .

wat nnat gat mat bat hilat . lalok nok crrrok hihok yorok zanzanok .

wat nnat arrat mat zanzanat . lalok rarok nok izok hihok mok .

wat nnat forat arrat vat gat .

pippat : anokquat : izokrrat : ploksat : drok

totat : eroktotat : wiwok

vat : izokwat : lalok

zanzanat : zanzanok??? : crrrok

arrat : hihokat-drubel : ok-drubel

at-voon : ok-voonat-yurp : ok-yurp

bat : clokbichat : ororok

cat : stokdat : sprok

eneat : enemokforat : rarok

iat lat pippat eneat hilat oloat at-yurp .

gat : mokhilat : ghirok

iat : lalokjjat : farokkrat : joklat : brok

mat : yoroknnat : nok

oloat : kantok

language model +translation model +

other features

iat lat pippat eneat hilat oloat at-yurp .

lalok brok anok enemok ghirok kantok ok-yurp .

Decoder

lalok brok anok ghirok enemok kantok ok-yurp .

accuracy = 0.83

feature weights〈0.2, 0.4, 0.1, ...〉

Fairly reasonable approximationto how Google Translate and

Bing Translator work

Page 65: Edinburgh MT lecture 10: discriminative training

•Key ingredients in Google Translate:

•Phrase-based translation models

•... Learned heuristically from word alignments

•... Coupled with a huge language model

•... Very tight pruning heuristics

•... And minimum error rate training for BLEU.