machine translation - the university of edinburgh · • machine translation was one of the first...

54
Machine Translation Chris Callison-Burch 4 December 2003

Upload: buikiet

Post on 05-Jul-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Machine TranslationChris Callison-Burch4 December 2003

Page 2: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Outline

• What is machine translation?

• Why is it difficult?

• How have people approached it in the past?

• How do we evaluate translation quality?

• What is Linear B's approach to MT?

Page 3: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

パキスタンのイスラム神学校は、タリバン勢力やテロ組織アル・カーイダに連なる過激派の温床として知られるが、東南アジア出身者が摘発されたのは初めて。しかし、JIの内部では、パキスタンの神学校で過激思想の理論的支柱を確立し、アフガニスタンなどで軍事訓練を受けるのが上級幹部の地位に就くための必須条件とされてきた。

Although the Islamic theological college of Pakistan is known as a hotbed of the extremists who stand in a row in the Taliban influence or terrorism organization Al KAIDA, it is the first. [ of the Southeast Asia graduate person having been exposed ] However, inside JI, the theoretical support of excessive thought was established in the theological college of Pakistan, and it has considered as the indispensable condition for receiving military training in Afghanistan etc. taking upper management's post.

What is Machine Translation?

Page 4: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• Machine translation was one of the first applications envisioned for computers

• Warren Weaver (1949)“I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need to do is strip off the code in order to retrieve the information contained in the text.”

• First demonstrated by IBM in 1954 with a basic word-for-word translation system.

A long history

Page 5: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• U.S. has invested in MT for intelligences purposes

• MT is popular on the web -- it is the most used of Google's special features

• EU spends more than €1,000,000,000 on translation costs each year. (Semi-) automating that could lead to huge savings

Commercial interest

Page 6: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Academic interest

• Machine translation requires many other NLP technologies

• Potentially: parsing, generation, word sense disambiguation, named entity recognition, transliteration, pronoun resolution, natural language understanding, and real-world knowledge

Page 7: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Outline

• What is machine translation?

• Why is MT difficult?

• How have people approached it in the past?

• How do we evaluate translation quality?

• What is Linear B's approach to MT?

Page 8: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

What makes MT hard?

• Word order

• Word sense

• Pronouns

• Tense

• Idioms

Page 9: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Differing word orders

• English word order is subject - verb - objectJapanese order is subject - object - verb

• English: IBM bought LotusJapanese: IBM Lotus bought

• English: Reporters said IBM bought LotusJapanese: Reporters IBM Lotus bought said

Page 10: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Word sense ambiguity

• `Bank' as in river`Bank' as in financial institution

• `Plant' as in a tree`Plant' as in a factory

• Different word senses will likely translate into different words in another language

Page 11: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Problem of pronouns

• Some languages like Spanish can drop subject pronouns

• In Spanish the verbal inflection often indicates which pronoun should be restored-o = I-as = you-a = he / she / it-amos = we

-an = they

• When should we use `she' or `he' or `it'?

Page 12: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Different tenses

• Spanish has two versions of the past tense:one for a definite time in the past, and one for an unknown time in the past

• When translating from English to Spanish we need to choose which version of the past tense to use

Page 13: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Idioms

• "to kick the bucket'' means "to die''

• "a bone of contention" does not have anything to do with skeletons

• "a lame duck", "tongue in cheek", "to cave in"

Page 14: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Outline

• What is machine translation?

• Why is it difficult?

• How have people approached it in the past?

• How do we evaluate translation quality?

• What is Linear B's approach to MT?

Page 15: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Various approaches

• Word-for-word translation

• Syntactic transfer

• Interlingual approaches

• Controlled language

• Example-based translation

• Statistical translation

Page 16: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Word-for-word translation

• Use a machine-readable bilingual dictionary to translate each work in a text

• Advantages: Easy to implement, results give a rough idea about what the text is about

• Disadvantages: Problems with word order means that this results in low-quality translation

Page 17: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Syntactic transferS

NP (SUBJ) VP

VS (OBJ)

NP (SUBJ) VP

VNP (OBJ)

Reporters

said

IBM

boughtLotus

S

NP (SUBJ) VP

V S (OBJ)

NP (SUBJ) VP

V NP (OBJ)

Reporters

said

IBM

bought Lotus

• Parse the sentence

• Rearrange constituents

• Then translate the words

Page 18: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Syntactic transfer

• Advantages:Deals with the word-order problem

• Disadvantages: - Must construct grammars for each language that you deal with- Sometimes there is syntactic mis-match

• Example:English: The bottle floated into the caveSpanish: La botella entro a la cuerva flotando = The bottle entered the cave floating

Page 19: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Interlingua

• Assign a logical form to sentences

• John must not go = OBLIGATORY(NOT(GO(JOHN)))John may not go = NOT(PERMITTED(GO(JOHN)))

• Use logical form to generate a sentence in another language

Page 20: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Interlingua

• Advantages: Single logical form means that we can translate between all languages and only write a parser/generator for each language once

• Disadvantages:Difficult to define a single logical form. English words in all capital letter probably won't cut it.

Page 21: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Controlled language

• Define a subset of a language which can be used to compose text to be translated

• Issued editorial guidelines that limit each word to only one word sense, and which forbid certain difficult constructions

• Apply syntactic transfer or interlingual approaches

Page 22: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Controlled language

• Advantages: Results in more reliable, higher quality translation for subset of language that it deals with

• Disadvantages: Does not cover all language use, so can only be applied in limited settings

Page 23: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Example-based MT

• Uses a translation memory or parallel corpus as a starting point

• When a human translator types a sentence that is similar to one in the memory, it is retrieved

• Some rules/heuristics to change the sentence to match the new sentence

Page 24: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Parallel corpusA-t-on acheté les actions ou

les biens des entreprises

nationalisées?

Quel était le genre de

travaux exécutés aux

termes de ces contrats?

Le recours est rejeté

comme manifestement

irrecevable

Les propositions ne seront

pas mises en application

maintenant.

La République française

supportera ses propres

dépens

Production domestique

exprimée en pourcentage

de l'utilisation domestique

La séance est ouverte à 2

heures.

. . .

Have the shares or properties

of nationalized companies

been purchased?

What was the nature of the

work performed under these

contracts?

The action is dismissed as

manifestly inadmissible

The proposal will not now be

implemented.

France was ordered to bear

its own costs

Domestic output as a % of

domestic use

The House met at 2 p.m.

. . .

Source Translation

Page 25: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Example-based MT

• Advantages: Uses human translations which are higher quality than machine translations

• Disadvantages: May have limited coverage depending on the size of the translation memory, and flexibility of heuristics

Page 26: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Statistical machine translation

• Find most probable English sentence given a French sentence

• Probabilities are determined automatically by training a statistical model using a parallel corpus

Page 27: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Statistical machine translation

• Advantages:- Has a way of dealing with lexical ambiguity- Can deal with idioms that occur in the training data- Requires minimal human effort- Can be created for any language pair that has enough training data

• Disadvantages:Does not explicitly deal with syntax

Page 28: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Outline

• What is machine translation?

• Why is it difficult?

• How have people approached it in the past?

• How do we evaluate translation quality?

• What is Linear B's approach to MT?

Page 29: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Human Evaluation

• Have people evaluate the quality of translations using some scale

• For example:- Almost perfect- Understandable- Barely understandable - Incomprehensible

Page 30: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Human Evaluation

• Expensive and not re-usable

• We probably want to be able to make changes to our system, or to evaluate new systems

• Automated evaluated can provide this

Page 31: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Automated Evaluation

• Compare MT translation to a reference human translation

• Measure the similarity between the translations

• Possible evaluation metric: word error rate

• Problem with WER: assumes linearity, and assumes that one perfect translation exists

Page 32: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Automated Evaluation

• Use multiple reference translations

• Look for n-grams that occur anywhere in the sentence

• `Bleu' evaluation metric does just that

Page 33: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Example BleuR1: It is a guide to action that ensures that the military will forever heed Party commands.R2: It is the Guiding Principle which guarantees the military forces always being under the command of the Party.R3: It is the practical guide for the army always to heed the directions of the party.

C1: It is to insure the troops forever hearing the activity guidebook that party direct.C2: It is a guide to action which ensures that the military always obeys the command of the party.

Page 34: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Example BleuR1: It is a guide to action that ensures that the military will forever heed Party commands.R2: It is the Guiding Principle which guarantees the military forces always being under the command of the Party.R3: It is the practical guide for the army always to heed the directions of the party.

C1: It is to insure the troops forever hearing the activity guidebook that party direct.

Page 35: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Example BleuR1: It is a guide to action that ensures that the military will forever heed Party commands.R2: It is the Guiding Principle which guarantees the military forces always being under the command of the Party.R3: It is the practical guide for the army always to heed the directions of the party.

C2: It is a guide to action which ensures that the military always obeys the command of the party.

Page 36: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Automated evaluation

• Because C2 has more n-grams and longer n-grams than C1 it receives a higher score

• Bleu has been shown to correlate with human judgments of translation quality

• Bleu has been adopted by DARPA in its annual machine translation evaluation

Page 37: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Multi-reference Evaluation

• Pang and Knight (2003) suggest using multi-reference evaluation to do exact match

• Combine references into a word graph with hundreds of paths through it

• Most paths correspond to good sentences

at

during

last

the

least

last

week

battle

fighting

were

died

lost

killed

in

their

12

persons

*e*

people

the

last

weekbattle

fighting

lives

in

last

week

fighting

’s

at

’s

least

fighting

fight

died

in

peopletwelve

the

at

least

died

week

fighting’s

people12

killed

took

the

at

of

lastweek

lives

least

of

twelve people

12

lives

persons

*e*

Page 38: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Evaluation drivesMT research

• Metrics can drive the research for the topics that they evaluate

• Bleu has lead to a focus on phrase-based translation

• Other metrics may similarly change the community's focus

Page 39: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Outline

• What is machine translation?

• Why is it difficult?

• How have people approached it in the past?

• How do we evaluate translation quality?

• What is Linear B's approach to MT?

Page 40: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• Data-driven, statistical machine translation

• Language independent

• Can be deployed quickly

• Results in higher quality translation in a specific domain

Linear B’s approach

Page 41: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• Convert source language into intermediate representation

• Convert the source language representation into equivalent target language representation

• Generate a sentence in the target language

Conventional Approaches to MT

Hand written rules are used to:

Page 42: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• Requires huge staff of linguists and language experts to hand-craft rules

• Takes decades of person-years to create software for new languages

• Often fails on idiomatic language

Problems

Page 43: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• Is automatically learned from example translations rather than hand crafted by linguists

• Therefore it can be produced in a matter of months rather than years, and is adaptable

Statistical machine translation

Page 44: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• L'honorable Leonard J. Gustafson: Honorables sénateurs, tandis que la guerre en Irak entre dans sa troisième semaine, nous ne devons pas oublier qu'il faut prendre des mesures pour éviter une crise humanitaire dans la population civile. À cet égard, il y a de nombreux domaines dans lesquels les Canadiens doivent faire preuve de leadership.Maintenant que la guerre fait rage en Irak et que des pénuries de produits alimentaires et de fournitures médicales commencent à se produire, les pays qui ont accès à des ressources ont la responsabilité d'essayer de minimiser les effets négatifs du conflit sur la population irakienne. Je crois personnellement que le Canada devrait jouer un plus grand rôle à cet égard.Au Canada, nous disposons en abondance de blé et d'autres produits alimentaires. Nous pouvons également fournir des articles médicaux aux citoyens de l'Irak. Le Canada a une fière réputation pour ce qui est d'offrir une aide humanitaire aux gens quand ils en ont besoin.

Example translation

Page 45: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

• The Honourable Leonard J. Gustafson: Honourable senators, while the war in Iraq extending into a third week, the minister should not forget the one we must take some steps to prevent a crisis humanitarian face in the civilian populations. In this regard, there are a number of areas in which the Canadians must concentrate to show leadership.That the war is raging in Iraq and that a shortage of food and medical supplies are beginning to take place, those countries that have access to the resources are the responsibility for doing try to minimize the negative effects on workers on the people Irakienne. I personally believe that Canada should play a more significant role in this regard.In Canada, we have in abundance of wheat, as have other food products. We can also provision of medical articles to the people on the other Iraq. Canada has a proud record in so far, as would be to give humanitarian assistance to people when they need it.

Example translation

Page 46: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Parallel corpusA-t-on acheté les actions ou

les biens des entreprises

nationalisées?

Quel était le genre de

travaux exécutés aux

termes de ces contrats?

Le recours est rejeté

comme manifestement

irrecevable

Les propositions ne seront

pas mises en application

maintenant.

La République française

supportera ses propres

dépens

Production domestique

exprimée en pourcentage

de l'utilisation domestique

La séance est ouverte à 2

heures.

. . .

Have the shares or properties

of nationalized companies

been purchased?

What was the nature of the

work performed under these

contracts?

The action is dismissed as

manifestly inadmissible

The proposal will not now be

implemented.

France was ordered to bear

its own costs

Domestic output as a % of

domestic use

The House met at 2 p.m.

. . .

Source Translation

Page 47: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Word-level alignments

France

was

ordered

La

République

française

supportera

sesto bear

its

own

costs

propres

dépens

Page 48: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Bootstrapping phrases

France

was

ordered

La

République

française

supportera

sesto bear

its

own

costs

propres

dépens

• ses = its, propres = own, dépens = costs

• ses propres = its own, propres dépens = own costs

• ses propres dépens = its own costs

Page 49: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Translation Units

• Phrase-based translation uses longer segments of human translated text

• Longer segments results in higher quality translation

Je crois personnellement que le Canada devrait jouer un plus grand rôle à cet égard.

I personally believe that Canada should play a more significant role in this regard.

Page 50: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Translation Units

• Phrase-based translation blurs the boundaries between statistical and example-based translation

À cet égard , il y a de nombreux domaines dans lesquels les

In this regard , there are a number of areas in which the

Canadiens doivent faire preuve de leadership.

Canadians must concentrate to show leadership.

Page 51: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Data increases quality

0.16

0.18

0.2

0.22

0.24

0.26

0.28

10k 20k 40k 80k 160k 320k

Ble

u S

core

Training Corpus Size (number of words)

IBM Model 4Joint Probability Phrase Model

Word Alignment Induced Phrase Model

Page 52: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

Minimal Data

• More data leads to better quality translations

• Minimal data results in poor quality translations

• Data requirements for statistical translation are a problem for ``low density'' languages

Page 53: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

How much to create data from scratch?

• Germann (2001) estimated costs for building a Tamil-English corpus

• Cost for professional translation in the US = $0.36 per word

• 1,500,000 sentences * 20 words/sent * 0.36 $/word = $10 million.

Page 54: Machine Translation - The University of Edinburgh · • Machine translation was one of the first ... basic word-for-word translation system. ... command of the Party. R3:

My research

• I am looking at ways to reduce the cost of this through the use of Active Learning