cs460/626 : natural language processing/speech, nlp …cs626-460-2012/lecture_slides/cs626... ·...
TRANSCRIPT
-
CS460/626 : Natural Language Processing/Speech, NLP and the Web
(Lecture 31 Introduction to Machine Translation)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
29th March, 2012
-
Introduction to MT Machine Translation (MT) is a technique to translate
texts from one natural language to another natural language using a machine
Translated text should have two desired properties:
Adequacy: Meaning should be conveyed correctly
Fluency: Text should be fluent in the target Fluency: Text should be fluent in the target language
Translation between distant languages is a difficult task
Handling Language Divergence is a major challenge
2
-
Kinds of MT Systems(point of entry from source to the target text)
Deep understanding level
Interlingual level
Ascendin
g transfe
rLogico-semantic level
Conceptual transfer
Semantic transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semantic& predicate-argument)
Ascendin
g transfe
r
Syntactico-functional level
Morpho-syntactic level
Syntagmatic level
Graphemic level Direct translation
Syntactic transfer (surface)
Syntactic transfer (deep)
Multilevel transfer
F-structures (functional)
C-structures (constituent)
Tagged text
Text
Mixing levels Multilevel description
Semi-direct translation Descending transfers
3
-
Why is MT difficult: Language Divergence
One of the main complexities of MT: Language Divergence
Languages have different ways of Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
4
English-IL Language Divergence with illustrations from Hindi(Dave, Parikh, Bhattacharyya, Journal of MT, 2002)
-
Language Divergence Theory: Lexico-Semantic Divergences
Conflational divergence F: vomir; E: to be sick E: stab; H: churaa se maaranaa (knife-with hit) S: Utrymningsplan; E: escape plan
Structural divergence Structural divergence E: SVO; H: SOV
Categorial divergence Change is in POS category
Not communicative (AdjVerb)
5
-
Language Divergence Theory: Lexico-Semantic Divergences (cntd)
Head swapping divergence E: Prime Minister of India; H: bhaarat ke pradhaanmantrii (India-of Prime Minister)
Lexical divergence E: advise; H: paraamarsh denaa (advice give): NounIncorporation- very common Indian LanguageIncorporation- very common Indian LanguagePhenomenon
6
-
Language Divergence Theory: Syntactic Divergences
Constituent Order divergence
E: Singh, the PM of India, will address the nationtoday; H: bhaarat ke pradhaan mantrii, singh, (India-of PM, Singh)
Adjunction Divergence
E: She will visit here in the summer; H: vah yahaagarmii meM aayegii (she here summer-in will come)
E: She will visit here in the summer; H: vah yahaagarmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E: Who do you want to go with?; H: kisake saath aapjaanaa chaahate ho? (who with)
7
-
Language Divergence Theory: Syntactic Divergences
Null Subject Divergence
E: I will go; H: jaauMgaa (subject dropped)
Pleonastic Divergence
E: It is raining; H: baarish ho rahii haai (rainhappening is: no translation of it)happening is: no translation of it)
8
-
Language Divergence exists even for close cousins(Marathi-Hindi-English: case marking and postpositions do not transfer easily)
(immediate past) ? ! (M) ? (H) When did you come? Just now (I came) (H) kakhan ele? ei elaam/eshchhi (B)
(certainty in future) !
(certainty in future) ! (M) ! (H) He is in for a thrashing. (E) ekhan o maar khaabei/khaachhei
(assurance) ! . (M) $ (H) I will see you tomorrow. (E) aami kaal tomaar saathe dekhaa korbo/korchhi (B)
9
-
Can MT make a difference
Yes: MT is ubiquitoushttp://www.cfilt.iitb.ac.in/judicial/
http://www.cfilt.iitb.ac.in:11080/XeroxIITBSMT/Translate
http://translate.google.com
-
Helping a person in distress (1/3)
French Lady (FL): (in French, translated into English by Google translate) Do you know how far it is to the Bangalore you know how far it is to the Bangalore main bus stand from the airport?
Self: (in English, translated into French by GT) It is about 10 KM.
FL: Can I take a bus?
-
Helping a person in distress (2/3)
FL: I think that is expensive.
Self: Yes about 3000 Rupees, about 50 Self: Yes about 3000 Rupees, about 50 Euros.
FL: No, I will take a bus.
Self: Did you hurt your ankle. Do you need medical help?
-
Helping a person in distress (3/3)
Self: Should I look for a doctor?
FL: No, thank you, I will consult in FL: No, thank you, I will consult in Mysore.
All this conversation occurred via Google translation. The lady and I took turns to type in the sentences in respective translation boxes.
-
Translation is Ubiquitous
Between Languages
Delhi is the capital of India
Between dialects
Example next slide
Between registers
My mom not well.
My mother is unwell (in a leave application)
-
Between dialects (1/3)
Lage Raho Munnabhai: an excellent example
Scene: Munnabhai (Sanjay Dutt) is Prof. MurliPrasad Sharma being interviewed with some citizens asking questions in presence of Jahnavi(Vidya Baalan)(Vidya Baalan)
Question by citizen: , . $ .
-
Between dialects (2/3) Bapu from behind invisible to others:
Munnabhai
#
Bapu
% Munnabhai
full country % Bapu
* Munnabhai
,%-
-
Between dialects (3/3) Bapu
$ Munnabhai
( $ ,
Bapu
$ Munnabhai
/ 0 $ , , /, heart heart !
-
Interlingua Based MT
18
-
MT: EnConverion + Deconversion
EnglishHindi
Interlingua(UNL)
FrenchChinese
generation
Analysis
19
-
Challenges of interlingua generation
Mother of NLP problems - Extract meaning from a sentence!
Almost all NLP problems are sub-problems
Named Entity Recognition Named Entity Recognition
POS tagging
Chunking
Parsing
Word Sense Disambiguation
Multiword identification
and the list goes on...
-
UNL: a United Nations project
Started in 1996 10 year program 15 research groups across continents First goal: generatorsFirst goal: generators Next goal: analysers (needs solving various ambiguity
problems) Current active language groups
UNL_French (GETA-CLIPS, IMAG) UNL_Hindi (IIT Bombay with additional work on UNL_English)
UNL_Italian (Univ. of Pisa) UNL_Portugese (Univ of Sao Paolo, Brazil) UNL_Russian (Institute of Linguistics, Moscow) UNL_Spanish (UPM, Madrid)
21
-
World-wide Universal Networking Language (UNL) Project
UNL
English Russian
Marathi
22
Japanese
Hindi
Spanish
Language independent meaning representation.
Others
-
UNL represents knowledge: John eats rice with a spoon
Semantic relations
attributes
Universal words
Repositoryof 42SemanticRelations
and84 attributelabels
23
-
Hindi Generation from Interlingua (UNL)
(Joint work with S. Singh, M. Dalal, V. Vachhani, Om Damani
MT Summit 2007)
-
HinD ArchitectureDeconversion = Transfer + Generation
-
Step-through the Deconverter Module Output
UNL Expression
obj(contact(
Lexeme Selection
contact farmer this you region taluka manchar khatav
Case Identification
* * *
contact farmer* this you region* taluka* manchar
you
this
farmer
agtobj
pur
plc
contact
orregion
nam
:01
khatav Morphology Generation
contact .@imperative farmer.@pl this you region
taluka manchar Khatav
Function Word Insertion
contact farmers this for you region
or taluka of Manchar Khatav
Linearization This for you manchar region or khatav
| taluka of farmers contact
nam
or
khatav
manchar
taluka
nam
-
Lexeme Selection[ ]{}"contact(icl>communicate(agt>person,obj>person)) (V,VOA,VOA-
ACT,VOA-COMM,VLTN,TMP,CJNCT,N-V,link,Va)
[ ]{}"contact(icl>representative)
(N,ANIMT,FAUNA,MML,PRSN,Na)
Lexical Choice is unambiguousLexical Choice is unambiguous
you
this farmer
agtobj contact
pur
-
Case Marking
Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V @past N
Depends on UNL Relation and the properties of the nodes Depends on UNL Relation and the properties of the nodes Case get transferred from head to modifiers
you
this *farmer
agtobj contact
pur
-
Morphology Generation: NounsThe boy saw me.
Boys saw me.
The King saw me.
Kings saw me.
Suffix Attribute values
uoM @N,@NU,@M,@pl,@oblique
U @N,@NU,@M,@sg,@oblique I @N,@NI,@F,@sg,@oblique iyoM @N,@NI,@F,@pl,@oblique
oM @N,@NA,@NOTCH,@F,@pl,@oblique
Kings saw me.
-
Verb MorphologySuffix Tense Aspect Mood N Gen P V
E
-e rahaa thaa
@past @progress - @sg @male 3rd e
-taa hai @present @custom - @sg @male 3rd -
-iyaa thaa @past @complete - @sg @male 3rd I
saktii hain @present - @ability @pl @female 3rd A
-
After Morphology Generation
you
this
farmer
agtobj contact
pur
you
this
farmer
agtobjcontact
pur
-
Function Word Insertion
Rel Par+ Par- Chi+ Chi- Ch/FW
Obj V VINT N#ANIMT @topic
you
this
farmer
agtobj contact
pur
Obj V VINT N#ANIMT @topic
-
Linearization
This you manchar region
you
this
farmer
agtobj
pur
plc
Contact
khatav taluka farmers
contact
nam
orregion
khatav
manchar
taluka
nam
:01
-
Syntax Planning: Assumptions The relative word order of a
UNL relations relata does not depend on:
Semantic Independence: the semantic properties of the this
contact
pur
semantic properties of the relata.
Context Independence: the rest of the expression.
The relative word order of various relations sharing a relatum does not depend on
Local Ordering: the rest of the expression.
this
you
this
farmer
agtobj
contact
pur
-
you
this
farmer
agtobj contact
pur
Syntax Planning: Strategy Divide a nodes relations in
Before_set = {obj,pur,agt}
After_set = {}
agt aoj obj ins
Agt L L L
Aoj L L
Obj R
Ins
obj pur
agt
Topo Sort each group: pur agt objthis you farmer
Final order: this you farmer contact
region
khatav
manch
taluka
-
Syntax Planning AlgoStack Before
Current
After
Current
Output
region
farmer
contact
this
you
manchar taluka
you
this
farmer
agtobj
pur
plc
Contact
manchar taluka
manchar
region
taluka
farmer
contact
region
taluka
farmer
Contact
this
you
manchar
taluka
farmer
contact
this
you
manchar
region
plc
nam
orregion
khatav
manchar
taluka
nam
:01
-
All Together (UNL -> Hindi) Module Output
UNL Expression
obj(contact(..
Lexeme Selection
contact farmer this you region taluka manchar khatav
Case Identification
* * *
contact farmer* this you region* taluka* manchar Identification
contact farmer* this you region* taluka* manchar
khatav Morphology Generation
contact .@imperative farmer.@pl this you region
taluka manchar Khatav
Function Word Insertion
contact farmers this for you region
or taluka of Manchar Khatav
Syntax Linearization
This for you manchar region or khatav
| taluka of farmers contact
-
How to Evaluate UNL Deconversion
UNL -> Hindi
Reference Translation needs to be generated from UNL
Needs expertise Needs expertise
Compromise: Generate reference translation from original English sentence from which UNL was generated
Works if you assume that UNL generation was perfect
Note that fidelity is not an issue
-
Manual Evaluation Guidelines
Fluency of the given translation is:(4) Perfect: Good grammar(3) Fair: Easy-to-understand but flawed grammar(2) Acceptable: Broken - understandable with effort(1) Nonsense: Incomprehensible(1) Nonsense: Incomprehensible
Adequacy: How much meaning of the reference sentence isconveyed in the translation:(4) All: No loss of meaning(3) Most: Most of the meaning is conveyed(2) Some: Some of the meaning is conveyed(1) None: Hardly any meaning is conveyed
-
Sample Translations
Hindi Output Fluency Adequacy
0.5 ! 10 )
2 3
,- / 0 2 34 3 4,- / 0 2 34 6 2 8 ,
3 4
8 3< 3<
1 1
)> 8 4 4
- B D
2 3
-
Results BLEU Fluency Adequacy
Geometric Average 0.34 2.54 2.84 Arithmetic Average 0.41 2.71 3.00 Standard Deviation 0.25 0.89 0.89 Pearson Cor. BLEU 1.00 0.59 0.50
Pearson Cor. Fluency 0.59 1.00 0.68
Fluency vs Adequacy
315 3 4
34
165
2662
155138
210 12
131
168
0
50
100
150
200
1 2 3 4
Fluency
Nu
mb
er o
f s
ente
nce
s
Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4
Good Correlation between Fluency and BLUE Strong Correlation betweenFluency and Adequacy Can do large scale evaluation using Fluency alone
Caution: Domain diversity, Speaker diversity
-
Summary: interlingua based MT
English to Hindi
Rule governed
High level of linguistic expertise needed High level of linguistic expertise needed
Takes a long time to build (since 1996)
But produces great insight, resources and tools
42