csa405: advanced topics in nlp

40
CSA405: Advanced Topics in NLP Machine Translation I Introduction to MT

Upload: sybil

Post on 05-Jan-2016

75 views

Category:

Documents


0 download

DESCRIPTION

CSA405: Advanced Topics in NLP. Machine Translation I Introduction to MT. Outline. MT = Machine Translation Why MT is important What MT is and why MT is difficult MT and the Human Translator. Why Machine Translation is Important. Implications of Multilinguality. Commerical Interest. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSA405: Advanced Topics in NLP

CSA405: Advanced Topicsin NLP

Machine Translation I

Introduction to MT

Page 2: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 2

Outline

• MT = Machine Translation

• Why MT is important

• What MT is and why MT is difficult

• MT and the Human Translator

Page 3: CSA405: Advanced Topics in NLP

Why Machine Translation is Important

Page 4: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 4

Implications of Multilinguality

Number of Languages

Number of Language

Pairs

2 2

3 6

10 90

20 380

Page 5: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 5

Commerical Interest

• US has invested in MT for intelligence purposes

• MT is popular on the web - the most ued of Google's special features

• EU spends more that €1B per annum on translation

Page 6: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 6

Academic Interest

• Different NL technologies include– parsing– generation– morphology– pronoun resolution– understanding ...

Page 7: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 7

Misconceptions about MT

• MT is a waste of time because– you will never make a machine that can

translate Shakespeare. – the quality  of translation you can get from an

MT system is very low• MT threatens the jobs of translators. • MT systems are machines, and buying an

MT system should be very much like buying a car.

Page 8: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 8

Facts about MT• There are many situations where the ability to

produce reliable, if less than perfect, translations at high speed is valuable.

• MT systems can take over some of the boring, repetitive translation jobs and allow human translation to concentrate on more interesting specialist tasks.

• Building an MT system is an arduous and time consuming job, involving the construction of grammars and very large monolingual and bilingual dictionaries.

Page 9: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 9

The Place for MT• Human Translators are good at:

– Getting the right turn of phrase– Preserving translation equivalence

• Human Translators are bad at– Dictionary look-up – Consistency of translation – Translation of terminology

• MT can exploit these weaknesses

Page 10: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 10

Summary

MT is important because – There are too few human translators– Availability of materials in appropriate

language has significant economic consequences.

– Scientifically, it is still one of the best test areas for language technology

Page 11: CSA405: Advanced Topics in NLP

Why Translation is Difficult

Page 12: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 12

What Makes MT Hard

• Style and Meaning

• Word Order

• Word Sense

• Pronouns

• Tense

• Idioms

Page 13: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 13

Style and Meaning

As recently as a decade ago it was widely believed that infectious disease was no longer much of a threat in the developed world. The remaining challenges to public health there, it was thought, stemmed from noninfectious conditions such as cancer, heart disease and degenerative diseases.

Il y a une dizaine d’annees, on croyait que les pays industrialises etait debarasses des risques lies aux maladies infectieuses et que la sante publique n’etait menacee que par des maladies comme le cancer, les troubles cardiaques, et les anomolies genetiques

Page 14: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 14

Style and Meaning

English• Two sentences• infectious disease was no

longer much of a threat in the developed world

• The remaining challenges to public health there

• noninfectious conditions

French• One sentence• les pays industrialises

etait debarasses des risques lies aux maladies infectieuses

• la sante publique n’etait menacee que

• maladies

Page 15: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 15

Different word orders

• English word order is subject - verb - object

• Japanese order is subject - object - verb– English: IBM bought Lotus– Japanese: IBM Lotus bought

– English: Reporters said IBM bought Lotus– Japanese: Reporters IBM Lotus bought said

Page 16: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 16

Word Sense Ambiguity

• Bank as in river

• Bank as in financial insitution

• Plant as in tree

• Plant as in factory

• Different senses usually translate into different words

Page 17: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 17

Hutchins & Somers (1992)

Page 18: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 18

Problems: Contextual Interpretation

OPEN

OPEN

Page 19: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 19

Different Cultural Models

English: Health InsuranceGerman:KrankenversicherungFrench: Assurance Maladie

English:validateFrench: obliterer

Page 20: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 20

Differences in Marking of Semantic Information

• Head marking.– In English possessive relation is marked on

the head: The man's house– In Hungarian it is marked on the dependent:

The man house-his– his house / sa maison

• Direction and manner of motion marking– He ran into the room (English)– He entered the room running (French)

Page 21: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 21

Summary

• Translation is about more than equivalence of meaning.

• Translation may involve the resolution of ambiguity.

• Preservation of intention involves cultural background as well as linguistic knowledge.

• Translation is a hard problem – for humans let alone machines.

Page 22: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 22

Similarities and Differences Between Languages

Differences• Morphology• Word order and

syntactic structures • Marking of semantic

distinctions• Lexical

Similarities• Communicative

function for survival• Mechanisms for

reference to people, eating, politeness, time.

• Syntactic complexity• Nouns• Verbs

Page 23: CSA405: Advanced Topics in NLP

Machine Translation and Human Translators

Page 24: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 24

In the Beginning ....was the dream of FAMT

• Fully Automatic (High Quality) Machine Translation (Bar Hillel 1960)

Source Language

text

TargetLanguage

text

FAHQMT

Page 25: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 25

FAMT

• Basic Charactistics– No human intervention– Arbitrary text

• Evaluation Criteria– Quality of ouput– Cost ($/page)– Speed (pages/hour)

Page 26: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 26

FAMT Success StoryTAUM METEO

• Written by Chevalier et al. 1978.• Translation of weather reports from

English to French• Highly constrained subset of English:

– Small number of senses for each word– Restricted syntactic constructions

• System determines whether a given sentence is within its capabilities

• Very fast, very accurate, no post-editing

Page 27: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 27

FAMT: MORAL

• FAMT can work well but only if we give up one or more of the goals e.g.– Unrestricted text input– High quality translation

• This observation has lead to research on sub-languages

• And to the use of FALQT

Page 28: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 28

FAMT is not the only way

• FAMT lies at one extreme of a continuum of ways in which technology can be brought to bear upon the translation problem

• At the other extreme there are word processing software, fax machines, and even mobile phones

• Between these two extremes there are other points of interest where technology can radically affect the productivity of the individual translator.

Page 29: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 29

MAHT and HAMT

• Machine Aided Human Translation (MAHT)

• Human Aided Machine Translation (HAMT).

• The essential difference between these two lies not only in the way in which the person is involved but also in the extent of their involvement

Page 30: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 30

MAHT - Translation Memories

• Systems consist of a database in which each source sentence of a translation is stored together with the target sentence (this is called a translation memory "unit")

• Any new source sentences will be searched for in the database and a match value is calculated.

• When the match value is 100%, the translation of the source sentence from the database is inserted into the text being translated.

Page 31: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 31

MAHT - Translation Memories

• If the match value is below 100% and above a certain user-definable percentage (i.e., "fuzzy match"), the old translation will be inserted as a translation proposal for the translator to review and edit.

• Sentences with match values below that margin have to be translated from scratch.

• New and changed translation proposals will then be stored in the database for future use.

Page 32: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 32

MAHT - Translation Memories – Advantages

• Avoid redoing translation of repeated material

• Use previous texts as a model for new translations

• Ensure consistency throughout a translation

Page 33: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 33

MAHT - Translation Memories - Drawbacks

• If terminology changes between projects the content of a TM needs to be updated to reflect these changes.

• Blind faith in exact matches (without validation) can generate incorrect translation since there is no verification of the context where the new segment is used compared to where the original one was used.

Page 34: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 34

MAHT - Translation Memories - Remarks

• Translation Process: TM tools may not easily fit into existing translation or localization processes: work best where work can be signed off in pieces rather than as a whole.

• Customisation: rarely works straight out of the box. Menu adaptation, filters to desktop applications may require significant effort.

• Investment costs are high • Setup and maintenance of TMs has to factored in. • OpenTag/TMX formats for exchanging TM data between

competing systems

Page 35: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 35

MAHT – Other Technology

• Communication/coordination amongst translators

• Integration of internet technologies and web services.

• Database technology, smart indexing, and networking

• Improvements can be achieved that are well within the scope of current technology.

Page 36: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 36

HAMT – Human Assisted Machine Translation

• Machine retains the initiative but works in collaboration with human consultant.

• System translates autonomously until it recognises that a linguistic difficulty of a certain type has arisen, e.g.– ambiguity– pronoun reference– unknown word– unrecognised construction

• At this point it seeks help from the consultant.

Page 37: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 37

HAMT – Challenges

• Reliable identification/classification of difficulty.

• Reliable communication of difficulty to user.

• Tradeoff between quality and scope of translation.

Page 38: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 38

HAMT - Advantages

• Modulo challenges – a high quality of translation can be guaranteed.

• Speed – if large sections of text can be translated automatically.

• Human consultant need not necessarily have all the skills of a human translator; native competence in one or both languages may suffice.

Page 39: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 39

Summary

• Machine Translation is a continuum– FAMT– HAMT– MAHT

• The utility of a given type of system cannot be assessed with very simple criteria

• Utlility function involves at least the human cost, the machine cost, the quality of the result, and the nature of the translation requirements.

Page 40: CSA405: Advanced Topics in NLP

Jan 2005 CSA4050 MT I 40

Some References

• Jonathan Slocum, Machine Translation: its History, Current Status, and Future Prospects, Proc ACL 1984, Stanford University, http://acl.ldc.upenn.edu/P/P84/P84-1116.pdf

• Martin Kay – Machine Translation, Computational Linguistics vol 11 numbers 2-3 1985.

• Richard Kittredge – Sublanguages, Computational Linguistics vol 11 numbers 2-3 1985.