agenda for today - csee.ogi.eduroark/courses/cse506-tnl/lec12.pdf · source boeing recibe un pedido...

41
Agenda for today Questions about homework 3 and final projects Machine-generated language evaluation MT evaluation: why it’s hard Human evaluation of MT output Automated evaluation metrics for MT Extrinsic evaluation

Upload: hakhanh

Post on 17-Mar-2018

220 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Agenda for today

• Questions about homework 3 and final projects

• Machine-generated language evaluation

• MT evaluation: why it’s hard

• Human evaluation of MT output

• Automated evaluation metrics for MT

• Extrinsic evaluation

Page 2: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Homework 3 and Final project

• Applications track: Don’t forget that you can use DUC for homework 3 if that’s what you prefer.

• Projects: Sounded good. We’ll be giving feedback shortly if we haven’t already.

Page 3: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Machine generated language

• Machine translation output

• Automatic summarization output

• Automatically generated paraphrases

• Question answering

• Natural language generation

• Problem: Most of the time there is no one “true” or “correct” way to say something.

Page 4: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Turing Test

Alan Turing (1950): Computing Machinery and Intelligence

Can computers think?

Page 5: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Turing Test

Alan Turing (1950): Computing Machinery and Intelligence

Can computers do what we can do?

Page 6: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Turing Test

Alan Turing (1950): Computing Machinery and Intelligence

Can a computer trick people in the “imitation

game”?

Page 7: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Turing Test

http://testing.turinghub.com/

Page 8: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Evaluation Approaches

• Human evaluation- Rate some feature: fluency, fidelity, coherence.- Rate the sample as good or bad.- Rank multiple translations from best to worst.- Ask humans to fix it and see how many edits they made.

• Automated evaluation metrics- BLEU, ROUGE, Meteor, TER...- Metrics based on syntactic and semantic analysis.

• Extrinsic- Utility: can the output be used for anything?- Reading comprehension

Page 9: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Summarization output evaluation

Indians comprise less than 1% of the population but tribes differ widely. Tribal bureaucracies are scattered and ineffective. Many reservations are remote with little access to development.Indian communities have greater than average occurrences of such problems as drug and alcohol addiction, domestic violence, teen pregnancy, poor education, and unemployment. Poor communications facilities impact the low literacy rate.Long rides to school means students can't stay to use school resources.There is poor health care and higher than average death rates from alcohol, diabetes, suicide, and accidents.The nature of the land -- "held in trust" -- effectively shuts the Indians out of conventional home loan processes. Federal aid programs are inadequate and impacted by outdated maps and census figures. Privileges of tribal sovereignty include unique hunting and fishing rights, and the ability to sell gas and cigarettes tax-free, set zooming, and ban or sell alcohol. Reservations can set up casinos, bringing jobs and money. However, only a few casinos are profitable. Furthermore, there is concern that the casinos bring many non-Indians as customers and managers, and erode historic values and traditions. Tribal sovereignty leads to a number of other issues. Some reservations impose taxes on non-native Americans who work or live on their land and who do not have any voice in the local government. Indian cigarette factories, not having to pay tax, can undersell non-reservation factories.Fishing and hunting privileges cause concern among conservationists that regions will be depleted. Many feel different rights based on bloodline is unconstitutional.

Coin, executive director of the tribal gaming association. Coin, executive director of the tribal gaming association. If the burglar is a member of the tribe, then tribal police handle the case. If the burglar is a member of the tribe, then tribal police handle the case.' '' Garcia drove the nervous Molina to the tribal offices to interview for the program. Such status would give the tribe the right to negotiate gambling deals with the state. Such status would give the tribe the right to negotiate gambling deals with the state. Almost from the beginning, the U. The Hamptons are the playground of the powerful. The Hamptons are the playground of the powerful. '' The reservation is a world apart from the rest of the Hamptons. ' '' Garcia drove the nervous Molina to the tribal offices to interview for the program. 5 percent of the American Indian population. 5 percent of the American Indian population. The tribes will resist. But a non-Indian resident of a reservation has no say in tribal government. But a non-Indian resident of a reservation has no say in tribal government. On one side are the tribal trustees who hold nearly all the decision-making power in the lives of the 500 people on the reservation. On one side are the tribal trustees who hold nearly all the decision-making power in the lives of the 500 people on the reservation. tribes that offer legal casino or bingo gambling. Barring the use of U.

Page 10: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT output evaluation

Source Boeing recibe un pedido récord de 18 mil millones de dólares.

Ref. Boeing gets a record $18-billion order.

System 1 Boeing received a record order for 18 billion dollars.

System 2 Boeing receives an order record of $18 billion.

System 3 Boeing receives a record order of $18 thousand million.

Which one is “better”?

Page 11: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT output evaluation

• What makes a translation good?

- well-formed language (fluency, quality)- meaning of source text preserved (fidelity, adequacy)- does it make sense (coherence, readability)

• How do you get people to make these sorts of judgements reliably?

• How do you get computers to make the same kinds of judgments that people make?

• What is human judge reliability like?

Page 12: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT output evaluation

Source Boeing recibe un pedido récord de 18 mil millones de dólares.

Ref. Boeing gets a record $18-billion order.

System 1 Boeing received a record order for 18 billion dollars.

System 2 Boeing receives an order record of $18 billion.

System 3 Boeing receives a record order of $18 thousand million.

Page 13: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT output evaluation

Source Boeing recibe un pedido récord de 18 mil millones de dólares.

Ref. Boeing gets a record $18-billion order.

System 1 Boeing received a record order for 18 billion dollars.

System 2 Boeing receives an order record of $18 billion.

System 3 Boeing receives a record order of $18 thousand million.

Which one is “better”?

Page 14: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT evaluation: White 1993

• Early DARPA methodology.

• Adequacy

- How good is a translation, on a scale 1-3, 1-5?- Is this an acceptable translation, yes or no?

• Fluency

- Count up syntactic, lexical, stylistic, orthographic errors.

• Utility

- Have readers take reading comprehension multiple-choice test after reading the translations. (More on this shortly.)

Page 15: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT evaluation: Hovey 2002

• Framework for Machine Translation Evaluation

• Fidelity - Scale from 1 to whatever: requires bilingual judges

• Readability- Scale from 1 to 3- Reading time measures- Cloze test

Page 16: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Cloze test (Taylor 1953)

• Replace every nth word with ______ .

• Can a reader fill in the missing words?

• Percentage of words filled in: measure of readability.

When everything is said and _____, all of us, Republicans and _______ alike, all of us are _______; and we are all going _____ sink or swim together. We _____ moving through a perilous time. _____ with a terrible threat of __________, our Nation has

embarked upon ____great effort to help establish ____ kind of world in which _____ shall be secure. Peace is ___ goal-not peace at any price, ___ a peace based on

freedom _____ justice. We are now in _____ midst of our effort to _____ that goal.

Page 17: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Cloze test (Taylor 1953)

• Replace every nth word with ______ .

• Can a reader fill in the missing words?

• Percentage of words filled in: measure of readability.

When everything is said and _____, all of us, Republicans and _______ alike, all of us are _______; and we are all going _____ sink or swim together. We _____ moving through a perilous time. _____ with a terrible threat of __________, our Nation has

embarked upon ____great effort to help establish ____ kind of world in which _____ shall be secure. Peace is ___ goal-not peace at any price, ___ a peace based on

freedom _____ justice. We are now in _____ midst of our effort to _____ that goal.

When everything is said and done, all of us, Republicans and Democrats alike, all of us are Americans; and we are all going to sink or swim together. We are moving through a perilous time. Faced with a terrible threat of aggression, our Nation has

embarked upon a great effort to help establish the kind of world in which peace shall be secure. Peace is our goal-not peace at any price, but a peace based on freedom and

justice. We are now in the midst of our effort to reach that goal.

Page 18: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

NIST MT evaluation: LDC 2005

• Fluency and adequacy

• Adequacy: how much of the source meaning is expressed in the translation?

- 5 = All, 4 = Most, 3=Much, 2 = Little, 1 = None

• Fluency: how to describe the translation?

- 5 = Flawless English, 4 = Good English,3 = Non-native English, 2 = Disfluent English,1 = Incomprehensible

• Basically no other instructions (!)

Page 19: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Human ranking of MT output

• When comparing competing systems, have humans rank the output of several systems for the same input sentence.

• System with more high-ranked sentences is better than a system with more low-ranked sentences.

• Other interest features to extract

- how quickly people rank- how consistent they are when presented with the same set of

candidate translations- inter-rater reliability

• Lots of work on this by Chris Callison-Burch and colleagues as part of the SMT workshop.

Page 20: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

MT output ranking: Callison-Burch et al. 2007, 2008

• “Automatic measures are an imperfect substitute for human assessment of translation quality.”

• Instructions: Rank each whole sentence translation from Best to Worst relative to the other choices (ties are allowed).

• Much easier and more reliable than grading fluency and adequacy for individual sentences.

• I’ll show some correlations between human evaluations and automated evaluations at the end.

Page 21: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic
Page 22: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic
Page 23: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic
Page 24: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Translation edit rate (TER): Snover et al. 2006

• How many edits does a human have to make to create a fluent and adequate sentence from MT output?

• Edits:

- deletion- insertion - substitutions- shifts (allows movement of strings of words)

• TER = # of edits / average # of reference words

• Note: can be approximated automatically.

Page 25: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Translation edit rate (TER)

Saudi Arabia denied this week information published in the American New York Times

This week the Saudis denied information published in the New York Times.

Page 26: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Translation edit rate (TER)

Saudi Arabia denied this week information published in the American New York Times

This week the Saudis denied information published in the New York Times.

1 shift2 substitutions

1 insertionsTER = 4/13 = 30.8%

Page 27: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Humans vs. the machine

• Native speakers are well equipped to evaluate language, but

- expensive- time-consuming- not necessarily reliable unless you present the task to them

in a particular way

• Automatic evaluation metrics are necessary in order to measure incremental progress during development of an MT system.

Page 28: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Automatic MT evaluation metrics

• BLEU

• Meteor

• TER, TERplus

• GTM

• ParaEval

• Dependency overlap

• Semantic role overlap

• Metric combinations

Page 29: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

BLEU: Papineni et al. 2002

• “n-gram overlap”

• Actually: geometric mean of “clipped” n-gram precision for n=1, n=2, n=3, and n=4, weighted by a brevity penalty.

Page 30: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

BLEU: Papineni et al. 2002

Unigram precision: 7/7Clipped unigram precision: 2/7

Page 31: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Criticisms of BLEU

• BLEU is the de facto standard for MT evaluation right now.

• BLEU was reported to correlate well with human judgements (Papineni et al. 2002, Doddington 2002).

• Evidence that these correlations are not as high as previously thought (Callison-Burch 2006, Koehn and Monz, 2006).

• This is particular true at a sentence level (Blatz et al. 2003).

• Good for measuring improvements within a system.

• Less than ideal for comparing performance across systems.

Page 32: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

BLEU experiment

• 800 sentences in Urdu

• 4 different English translations of those sentences from 4 different Urdu speakers

god may grant you wisdom and vision .may allah grant you sense and intellect .

may god give you a fresh mind and vision .may allah bestow you with common sense and foresight .

Page 33: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

BLEU experiment

• 800 sentences in Urdu

• 4 different English translations of those sentences from 4 different Urdu speakers

ref 1 ref 1

ref 2 24.16 ref 2

ref 3 17.33 18.75 ref 3

ref 4 17.32 19.12 60.44

Page 34: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Meteor 1.3: Denkowski and Lavie (2011)

• Regularly updated: Banerjee and Lavie 2005, Lavie and Agarwal 2007, Agarwal and Lavie 2008

• Precision and recall of unigram precision overlap between reference and candidate but...

• ...allows matching of stems, synonyms, and paraphrases!

• Harmonic mean of precision and recall, multiplied by fragmentation penalty for reordering.

• Many parameters that can be tuned to correlate with human judgements.

Page 35: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Translation edit rate: TER, TERp

• TER can be calculated automatically (Snover et al 2006)

- use dynamic programming to count deletions, substitutions, insertions

- use greedy search to find the minimum number of shifts

• TERplus (Snover et al 2009)

- allows matching of stems, synonyms, paraphrases

Page 36: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Human-targeted automated techniques

• Instead of using existing translations as reference, have human create the appropriate reference for every candidate by editing the candidate until it sounds right.

• Then use TER, BLEU, or Meteor to compare the candidate with its specific human-generated reference.

• Only slightly less burdensome than doing actual human translation, but it might be more reliable.

Page 37: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Syntax for MT Evaluation

• Liu and Gildea (2005): Instead of counting n-gram overlap,

- count syntactic subtree (constituent) overlap- count dependency relationship overlap

• Amigó et al., 2006: Dependency overlap.

• Owczarzak (2007): Like above, but match labeled LFG dependencies, allow approximate matches via WordNet.

• Gimenez and Marquez (2007): Overlap of general “linguistic elements” including POS, constituents, dependencies.

Page 38: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Semantics for MT evaluation

• Gimenez and Marquez (2007): semantic role overlap, combined with other linguistic elements.

• Padó et al. (2009): textual entailment in MT evaluation, combined with lexical features.

Page 39: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Extrinsic evaluation

• Jones (2005): Reading comprehension for evaluating MT. How well can a naive reader answer questions about a text that was machine translated.

• Schneider et al. (2010): Dialogue systems relying on MT.

• Wan et al. (2010): Multilingual summarization.

• Our very own Steven Bedrick has done lots of extrinsic evaluation of MT in a medical setting, which I think he will talk about in a few weeks.

Page 40: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Humans vs. the machines again

Page 41: Agenda for today - csee.ogi.eduroark/courses/cse506-TNL/lec12.pdf · Source Boeing recibe un pedido récord de 18 mil millones de dólares. ... part of the SMT workshop. ... • Automatic

Humans vs. humans