edinburgh mt lecture 8: measuring translation accuracy

50
Measuring translation accuracy

Upload: alopezfoo

Post on 25-Jul-2015

234 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Edinburgh MT lecture 8: Measuring translation accuracy

Measuring translation accuracy

Page 2: Edinburgh MT lecture 8: Measuring translation accuracy

More has been written about machine translation

evaluation than about machine translation itself.

Yorick Wilks

Page 3: Edinburgh MT lecture 8: Measuring translation accuracy

•Why evaluate?•Rank systems.•Evaluate incremental changes.•Assess new ideas empirically.

•Metric must be objective!

Page 4: Edinburgh MT lecture 8: Measuring translation accuracy
Page 5: Edinburgh MT lecture 8: Measuring translation accuracy

© 2010 IBM Corporation

IBM Research

55

What It Takes to compete against Top Human Jeopardy! PlayersOur Analysis Reveals the Winner’s Cloud

Winning Human Performance

Winning Human Performance

2007 QA Computer System2007 QA Computer System

Grand Champion Human Performance

Grand Champion Human Performance

Each dot – actual historical human Jeopardy! games

More ConfidentMore Confident Less ConfidentLess Confident

Computers?Not So Good.

Page 6: Edinburgh MT lecture 8: Measuring translation accuracy

© 2010 IBM Corporation

IBM Research

10

Baseline 12/06

v0.1 12/07

v0.3 08/08

v0.5 05/09

v0.6 10/09

v0.8 11/10

v0.4 12/08

DeepQA: Incremental Progress in Answering Precision on the Jeopardy Challenge: 6/2007-11/2010

v0.2 05/08

IBM WatsonPlaying in the Winners Cloud

V0.7 04/10

Page 7: Edinburgh MT lecture 8: Measuring translation accuracy

Claim: evaluation by humans is good and ???

Page 8: Edinburgh MT lecture 8: Measuring translation accuracy

Chinese people in the traditional Spring Festival is approaching, the CPC Central Committee this afternoon in Zhongnanhai on the 22nd non-Party

personages to convene a forum in Spring Festival, invited the central committees of democratic parties, the leadership of the National Federation

of Industry and Commerce and personages without party affiliation on behalf of comrades gathered together State yes, talked in length about the

friendship, to greet the Chinese New Year. CPC Central Committee General Secretary and State President and Central Military Commission Chairman

Hu Jintao on behalf of the CPC Central Committee, the State Council, to the central committees of democratic parties, leaders of the National Federation

of Industry and Commerce and personages without party affiliation, to members of the united front, to extend my New Year's blessing.

Page 9: Edinburgh MT lecture 8: Measuring translation accuracy

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{➡Costly: 361 hours of human effort in 2011.

Page 10: Edinburgh MT lecture 8: Measuring translation accuracy

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{Are you sure this is the correct ranking?

•In above example, there are 5040 possible rankings.•With 10 systems: 3 million possible rankings.•With 20 systems: 2 quintillion possible rankings.

Page 11: Edinburgh MT lecture 8: Measuring translation accuracy

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

reference system Areference system Creference system Dreference system Fsystem A system Csystem A system Dsystem A system Fsystem C system Dsystem C system Fsystem D system F

����

⌘{WMT Raw Data:

pairwise rankings

Page 12: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Landau, 1951. On dominance relations andthe structure of animal societies

•Used to model all WMT `10-`11 rankings (25 tasks).

Page 13: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Page 14: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Page 15: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C

system D

If tournament is acyclic: topological sort

Page 16: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A

system B

system C

system D

If tournament is acyclic: topological sort

Page 17: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A

system B

system C

system D

If tournament is acyclic: topological sort

Page 18: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

16 out of 25 tasks in WMT ’10-’11 contain cycles!

Page 19: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

One solution: Reverse a set of edges such that:(a) Resulting graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Page 20: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Page 21: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Set of reversed edges = minimum feedback arc set (MFAS).In theory, this optimization is NP-hard (Karp, 1972).

In practice, it’s not too hard.

Page 22: Edinburgh MT lecture 8: Measuring translation accuracy

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Important detail: What should the weight be?Following analysis uses #(wins - losses).

Dumb, but counts each observation equally.

Page 23: Edinburgh MT lecture 8: Measuring translation accuracy

onlineBrwth-combo

cmu-hyposel-combocambridge

liumdcu-combo

cmu-heafield-comboupv-combo

nrcuedin

jhulimsi

jhu-combolium-combo

ralilig

bbn-comborwth

cmu-statxferonlineAhuicong

dfkicu-zeman

geneva

Example: French-English 2010

Task Rankings

MFAS

Page 24: Edinburgh MT lecture 8: Measuring translation accuracy

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

With crowdsourcing, WMT has made a good dent in this problem.

Empirically true in the WMT data.

Page 25: Edinburgh MT lecture 8: Measuring translation accuracy

Human evaluation is fast and cheap!

Page 26: Edinburgh MT lecture 8: Measuring translation accuracy

美国愿和北韩谈判但拒绝再付出报酬

US willing to negotiate with North Korea but not to pay more compensation.

The United States is willing to hold talks with North Korea but refused to pay

remuneration.

Page 27: Edinburgh MT lecture 8: Measuring translation accuracy

“Progress” postponed because of mechanical hand into the sky.

Launch of “Endeavour” delayed by robotic arm problems.

“奋进”号因机械手故障推迟到升空

Page 28: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Page 29: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Page 30: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Edit distance = 163 substitutions

8 deletions5 insertions

Page 31: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Edit distance = 163 substitutions

8 deletions5 insertions

ed(i, j) = min

8<

:

ed(i� 1, j) + del(wi)ed(i, j � 1) + ins(w0

j)ed(i� 1, j � 1) + sub(wi, w0

j)

Page 32: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Edit distance = 163 substitutions

8 deletions5 insertions

ed(i, j) = min

8<

:

ed(i� 1, j) + del(wi)ed(i, j � 1) + ins(w0

j)ed(i� 1, j � 1) + sub(wi, w0

j)

Page 33: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Precision:7/15 tokens = 47%

Recall:7/12 tokens = 58%

Page 34: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens

Page 35: Edinburgh MT lecture 8: Measuring translation accuracy

sky very northern shrieked clear wind Although across the the , still was it .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens

Page 36: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens4/14 bigrams1/13 trigrams

Page 37: Edinburgh MT lecture 8: Measuring translation accuracy

sky very northern shrieked clear wind Although across the the , still was it .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens0/14 bigrams0/13 trigrams

Page 38: Edinburgh MT lecture 8: Measuring translation accuracy

very clear .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 3/1 tokens2/2 bigrams1/1 trigrams

Page 39: Edinburgh MT lecture 8: Measuring translation accuracy

very clear . shrieked was still Although wind , across it northern the the sky

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens4/14 bigrams1/13 trigrams

Page 40: Edinburgh MT lecture 8: Measuring translation accuracy

a north . the was and was the the the though the , the sky

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens4/14 bigrams1/13 trigrams

Page 41: Edinburgh MT lecture 8: Measuring translation accuracy

BLEU

BP =⇢

1 if c > re1�r/c if c r

Bleu = BP · exp

NX

n=1

wn log pn

!

Page 42: Edinburgh MT lecture 8: Measuring translation accuracy

Details matter

length

BP

Page 43: Edinburgh MT lecture 8: Measuring translation accuracy

Details matter

Page 44: Edinburgh MT lecture 8: Measuring translation accuracy

Influence of BLEU

Page 45: Edinburgh MT lecture 8: Measuring translation accuracy

BLEU-1BLEU-4

BLEU-v11bBLEU-v12

METEOR-v0.6NIST-v11b

TER-v0.7.254-GRRATEC1AmberATEC3ATEC4

Meteor-v0.7TerrorCat

BEwT-EBadger

BadgerLiteBleu-sbpBleuSPCDer

DP-OrDP-OrpDR-OrEDPM

LETMETEOR-ranking

MaxSim

RTERose

SEPIA1SEPIA2

SNRSR-Or

SVM-RankTERpULCh

ULCoptinvWermBLEUmTER

Page 46: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Page 47: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Basically edit distance with swaps

Page 48: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Basically edit distance with swapsHow hard is it to compute this?

Page 49: Edinburgh MT lecture 8: Measuring translation accuracy

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Basically edit distance with swapsHow hard is it to compute this?

ter(i, j) = min

8>><

>>:

ter(i� 1, j) + del(wi)

ter(i, j � 1) + ins(w0j)

ter(i� 1, j � 1) + sub(wi, w0j)

maxk ter(i� 1, [1, ...k � 1, k + 1, ...j]) + 1

Page 50: Edinburgh MT lecture 8: Measuring translation accuracy

Automatic evaluation is fast(er) and cheap(er).