nlp: two pictures - iiit hyderabadltrc.iiit.ac.in/iasnlp2012/slides/pushpak/wordnet-wsd...7/13/2012...
Post on 11-Feb-2020
2 Views
Preview:
TRANSCRIPT
-
7/13/2012
1
Wordnet and Word Sense Disambiguation
Lecture delivered at the summer school on NLP, IIIT Hyderabad, 10July, 2012
byPushpak Bhattacharyya
Computer Science and Engineering DepartmentIIT Bombay
{pb}@cse.iitb.ac.in
Background
NLP: Two pictures
NLP
Vision Speech
Algorithm
Problem
LanguageHindi
Marathi
English
FrenchMorphAnalysis
Statistics and Probability
+Knowledge Based
Part of SpeechTagging
Parsing
Semantics
CRF
HMM
MEMM
NLPTrinity
NLP: Thy Name is Disambiguation
• I went with my friend to the bank to withdraw some money, but was disappointed to find it closed
• POS disambiguation: bank (NN/VM), withdraw (NN/VM), closed (JJ/VM)
• Sense Disambiguation: bank (finance/place), withdraw (take out/go_away)
• Co-reference disambiguation: it (bank/money)/I/friend)
• Pro-drop disambiguation: “… was disappointed…” • Scope disambiguation: for with (my _friend/my_friend_to_the bank)
-
7/13/2012
2
Where there is a will,
Where there is a will,
There are hundreds of relatives
Where there is a will
There is a way There are hundreds of relatives
Stages of processing
• Phonetics and phonology• Morphology• Lexical Analysis• Syntactic Analysis• Semantic Analysis• Pragmatics• Discourse
-
7/13/2012
3
Lexical Disambiguation
First step: part of Speech Disambiguation• Dog as a noun (animal)• Dog as a verb (to pursue)
Sense Disambiguation• Dog (as animal)• Dog (as a very detestable person)
Needs word relationships in a context• The chair emphasised the need for adult educationVery common in day to day communicationsSatellite Channel Ad: Watch what you want, when you want
(two senses of watch)e.g., Ground breaking ceremony/research
Lexical Analysis
• Essentially refers to dictionary access and obtaining the properties of the word
e.g. dognoun (lexical property)take-’s’-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)
Challenge: Lexical or word sense disambiguation
Ambiguity of Multiwords
• The grandfather kicked the bucket after suffering from cancer.
• This job is a piece of cake• Put the sweater on • He is the dark horse of the match
Google Translations of above sentences:
दादा क� सर से पीड़त होने के बाद बा�ट� लात मार�. इस काम के केक का एक टुकड़ा है.ःवेटर पर रखो.वह मैच के अंधेरे घोड़ा है.
• Bengali: ��� ����� ���� ��
English: Government is restless at home. (*)Chanchal Sarkar is at home
• Hindi: दैिनक दबंग दिुनयाEnglish: everyday bold worldActually name of a Hindi newspaper in Indore
• High degree of overlap between NEs and MWEs
• Treat differently - transliterate do not translate
Ambiguity of Named Entities
-
7/13/2012
4
Challenges in Syntactic Processing: Structural Ambiguity
• Scope1.The old men and women were taken to safe locations(old men and women) vs. ((old men) and women)2. No smoking areas will allow Hookas inside
• Preposition Phrase Attachment• I saw the boy with a telescope
(who has the telescope?)• I saw the mountain with a telescope
(world knowledge: mountain cannot be an instrument of seeing)
• I saw the boy with the pony-tail(world knowledge: pony-tail cannot be an instrument of seeing)
Very ubiquitous: newspaper headline “20 years later, BMC pays father 20 lakhs for causing son’s death”
Textual Humour (1/2)
1. Teacher (angrily): did you miss the class yesterday?Student: not much
2. A man coming back to his parked car sees the sticker "Parking fine". He goes and thanks the policeman for appreciating his parking skill.
3. Son: mother, I broke the neighbour's lamp shade.Mother: then we have to give them a new one.Son: no need, aunty said the lamp shade is irreplaceable.
4. Ram: I got a Jaguar car for my unemployed youngest son.Shyam: That's a great exchange!
5. Shane Warne should bowl maiden overs, instead of bowling maidens over
Textual Humour (2/2)
• It is not hard to meet the expenses now a day, you find them everywhere
• Teacher: What do you think is the capital of Ethiopia?Student: What do you think?Teacher: I do not think, I knowStudent: I do not think I know
Example of WSD• Operation, surgery, surgical operation, surgical procedure, surgical
process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1
• Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1
• Operation -- ((computer science) data processing in which the result is completely specified by a rule (especially the processing that results from a single instruction); "it can perform millions of operations per second") TOPIC->(noun) computer science#1, computing#1
• mathematical process, mathematical operation, operation --((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1
IS WSD NEEDED IN LARGE APPLICATIONS?
-
7/13/2012
5
Word ambiguity�topic drift in IR
Query word:“Madrid bomb blast case”
{case, container}
{case, suit, lawsuit}
{suit, apparel}
Drifted topic due to expanded term!!!
Drifted topic due to inapplicable sense!!!
43.75 43.75
25
31.25
18.75
6.25
12.5
00
5
10
15
20
25
30
35
40
45
50
Hindi-English Marathi-English
Err
or
Per
cen
tag
e
Transliteration
Translation Disambiguation
Stemmer
Dictionary
Ranking
Our observationsOn error PercentagesDue to variousFactorsCLEF 2007
How about WSD and MT?
Zaheer Khan, the India fast bowler, has been ruled out of the remainder of the series against England.
He will return to India and will be replaced by left-arm seamer RP Singh.
Zaheer picked up a hamstring injury during the first Test at Lord's.
He had been withdrawn from the squad for India's recent Test series in the West Indies due to a right ankle injury.
भारत के तेज गदबाज, जहर खान, इं�ल�ड के �खलाफ ौृखंला के शेष के बाहर शासन �कया गया है. (ruled in the administrative sense??)
वह भारत लौटने और बाए ँहाथ के तेज गदबाज आरपी िसंह +ारा ूितःथा.पत �कया जाएगा.
जहर लॉ0स1 म पहले टेःट के दौरान हैम�ःशंग चोट उठाया. (lifted??)
वह भारत क8 वेःट इंडज म हाल ह म एक सह (correct??) टखने क8 चोट के कारण टेःट ौृखंला के िलए टम से वापस ले िलया गया था.
Wordnet
-
7/13/2012
6
Psycholinguistic Theory • Human lexical memory for nouns as a hierarchy.• Can canary sing? - Pretty fast response.• Can canary fly? - Slower response.• Does canary have skin? – Slowest response.
(can move, has skin)
(can fly)
(can sing)
Wordnet - a lexical reference system based on psycholinguistic theories of human lexical memory.
Animal
Bird
canary
Essential Resource for WSD: Wordnet
Word MeaningsWord Forms
F1 F2 F3 … Fn
M1(depend)
E1,1
(bank)E1,2
(rely)E1,3
M2(bank)
E2,2
(embankment)
E2,…
M3
(bank)E3,2 E3,3
… …
Mm Em,n
Wordnet: History
• The first wordnet in the world was for English developed at Princeton over 15 years.
• The Eurowordnet- linked structure of European language wordnets was built in 1998 over 3 years with funding from the EC as a a mission mode project.
• Wordnets for Hindi and Marathi being built at IIT Bombay are amongst the first IL wordnets.
• All these are proposed to be linked into the IndoWordnet which eventually will be linked to the English and the Euro wordnets.
Basic Principle
• Words in natural languages are polysemous.• However, when synonymous words are put
together, a unique meaning often emerges.• Use is made of Relational Semantics.
-
7/13/2012
7
Lexical and Semantic relations in wordnet
1. Synonymy2. Hypernymy / Hyponymy3. Antonymy4. Meronymy / Holonymy5. Gradation6. Entailment 7. Troponymy1, 3 and 5 are lexical (word to word), rest are
semantic (synset to synset).
Gloss
study
Hyponymy
Hyponymy
Dwelling,abode
bedroom
kitchen
house,homeA place that serves as the living quarters of one or mor efamilies
guestroom
veranda
bckyard
hermitage cottage
Meronymy
Hyponymy
Meronymy
Hypernymy
WordNet Sub-Graph
Fundamental Design Question
• Syntagmatic vs. Paradigmatic relations?• Psycholinguistics is the basis of the design.• When we hear a word, many words come to our
mind by association.• For English, about half of the associated words
are syntagmatically related and half are paradignatically related.
• For cat– animal, mammal- paradigmatic– mew, purr, furry- syntagmatic
Stated Fundamental Application of Wordnet: Sense Disambiguation
Determination of the correct sense of the wordThe crane ate the fish vs.The crane was used to lift the load
bird vs. machine
-
7/13/2012
8
The problem of Sense tagging
• Given a corpora To Assign correct sense to the words.
• This is sense tagging. Needs Word Sense Disambiguation (WSD)
• Highly important for Question Answering, Machine Translation, Text Mining tasks.
Classification of Words
Word
Content Word
FunctionWord
Verb Noun Adjective Adverb Preposition
Conjunction
Pronoun Interjection
Example of sense marking: its need
एक_4187 नए शोध_1138 के अनुसार_3123 �जन लोग�_1189 का सामा�जक_43540 जीवन_125623 �यःत_48029 होता है उनके �दमाग_16168 के एक_4187 �हःस_े120425 म� अिधक_42403 जगह_113368 होती है।
(According to a new research, those people who have a busy social life, have larger space in a part of
their brain).
नेचर #यूरोसाइंस म� छपे एक_4187 शोध_1138 के अनुसार_3123 कई_4118 लोग�_1189 के �दमाग_16168 के ःकैन से पता_11431 चला �क �दमाग_16168 का एक_4187 �हःसा_120425 एिमगडाला सामा�जक_43540 �यःतताओं_1438 के साथ_328602 सामंजःय_166 के िलए थोड़ा_38861 बढ़_25368 जाता है। यह शोध_1138 58 लोग�_1189 पर �कया गया �जसम� उनक0 उॆ_13159 और �दमाग_16168 क0 साइज़ के आँकड़े_128065 िलए गए। अमर6क0_413405 ट6म_14077 ने पाया_227806 �क �जन लोग�_1189 क0 सोशल नेटव�क8 ग अिधक_42403 है उनके �दमाग_16168 का एिमगडाला वाला �हःसा_120425 बाक0_130137 लोग�_1189 क0 तुलना_म�_38220 अिधक_42403 बड़ा_426602 है। �दमाग_16168 का एिमगडाला वाला �हःसा_120425 भावनाओं_1912 और मानिसक_42151 �ःथित_1652 से जुड़ा हुआ माना_212436 जाता है।
Ambiguity of लोग� (People)
• लोग, जन, लोक, जनमानस, प��लक - एक से अिधक �य� "लोग� के �हत म� काम करना चा�हए" – (English synset) multitude, masses, mass, hoi_polloi,
people, the_great_unwashed - the common people generally "separate the warriors from the mass" "power to the people"
• दिुनया, दिुनयाँ, संसार, �व�, जगत, जहाँ, जहान, ज़माना, जमाना, लोक, दिुनयावाले, दिुनयाँवाले, लोग - संसार म� रहने वाल ेलोग "महा�मा गाँधी का स�मान पूर� दिुनया करती है / म� इस दिुनया क! परवाह नह�ं करता / आज क! दिुनया पैसे के पीछे भाग रह� है" – (English synset) populace, public, world - people in
general considered as a whole "he is a hero in the eyes of the public”
-
7/13/2012
9
Basic Principle
• Words in natural languages are polysemous.• However, when synonymous words are put
together, a unique meaning often emerges.• Use is made of Relational Semantics.• Componential Semantics where each word is a
bundle of semantic features (as in the Schankian Conceptual Dependency system or Lexical Componential Semantics) is to be examined as a viable alternative.
Componential Semantics
• Consider cat and tiger. Decide on componential attributes.
• For cat (Y, Y, N, Y)• For tiger (Y,Y,Y,N)
Complete and correct Attributes are difficult to design.
FurryFurry CarnivorousCarnivorous HeavyHeavy DomesticableDomesticable
Semantic relations in wordnet
1. Synonymy2. Hypernymy / Hyponymy3. Antonymy4. Meronymy / Holonymy5. Gradation6. Entailment 7. Troponymy1, 3 and 5 are lexical (word to word), rest are
semantic (synset to synset).
Synset: the foundation(house)
1. house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape Cod"; "she felt she had to get out of the house")2. house -- (an official assembly having legislative powers; "the legislature has two houses")3. house -- (a building in which something is sheltered or located; "they had a large carriage house")4. family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home")5. theater, theatre, house -- (a building where theatrical performances or motion-picture shows can be presented; "the house was full")6. firm, house, business firm -- (members of a business organization that owns or operates one or more establishments; "he worked for a brokerage house")7. house -- (aristocratic family line; "the House of York")8. house -- (the members of a religious community living together)9. house -- (the audience gathered together in a theatre or cinema; "the house applauded"; "he counted the house")10. house -- (play in which children take the roles of father or mother or children and pretend to interact like adults; "the children were playing house")11. sign of the zodiac, star sign, sign, mansion, house, planetary house -- ((astrology) one of 12 equal areas into which the zodiac is divided)12. house -- (the management of a gambling house or casino; "the house gets a percentage of every bet")
-
7/13/2012
10
Creation of Synsets
Three principles:• Minimality• Coverage• Replacability
Synset creation (continued)
HomeJohn’s home was decorated with lights on the occasion of Christmas.Having worked for many years abroad, John Returned home.
HouseJohn’s house was decorated with lights on the occasion of Christmas.Mercury is situated in the eighth house of John’s horoscope.
Synsets (continued)
{house} is ambiguous.{house, home} has the sense of a social unit living
together;Is this the minimal unit?{family, house , home} will make the unit completely
unambiguous.
For coverage:{family, household, house, home} ordered according to
frequency.
Replacability of the most frequent words is a requirement.
Synset creation
From first principles– Pick all the senses from good standard dictionaries.– Obtain synonyms for each sense.– Needs hard and long hours of work.
-
7/13/2012
11
Synset creation (continued)
From the wordnet of another language in the same family– Pick the synset and obtain the sense from the gloss.– Get the words of the target language.– Often same words can be used- especially for ������
words.– Translation, Insertion and deletion.
Synset+Gloss+ExampleCrucially needed for concept explication, wordnet building using
another wordnet and wordnet linking.
English Synset: {earthquake, quake, temblor, seism} -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)
Hindi Synset: �भूकंप, भूचाल, भूडोल, जलजला, भूक�प, भू-कंप, भू-क�प, ज़लज़ला, भूिमकंप, भूिमक�प - ूाकृितक कारण से पृवी के भीतर� भाग म� कुछ उथल-पथुल होने से ऊपर� भाग के सहसा �हलने क� �बया "२००१ म� गुज़रात म� आये भूकंप म� काफ़� लोग मारे गये थे"
(shaking of the surface of earth; many were killed in the earthquake in Gujarat)
Marathi Synset: धरणीकंप,भूकंप - प�ृवी�या पोटात ि�य�ोभ होऊन प!ृभाग हाल#याची $बया "२००१ साली गुजरातम�ये झाले�या धरणीकंपात अनेक लोक म�ृयुमुखी पडले"
Semantic Relations
• Hypernymy and Hyponymy– Relation between word senses (synsets)– X is a hyponym of Y if X is a kind of Y– Hyponymy is transitive and asymmetrical– Hypernymy is inverse of Hyponymy
(lion->animal->animate entity->entity)
Semantic Relations (continued)
• Meronymy and Holonymy– Part-whole relation, branch is a part of tree– X is a meronymy of Y if X is a part of Y– Holonymy is the inverse relation of Meronymy{kitchen} ………………………. {house}
-
7/13/2012
12
Lexical Relation
• Antonymy– Oppositeness in meaning – Relation between word forms– Often determined by phonetics, word length etc.
({rise, ascend} vs. {fall, descend})
Gloss
study
Hyponymy
Hyponymy
Dwelling,abode
bedroom
kitchen
house,homeA place that serves as the living quarters of one or mor efamilies
guestroom
veranda
bckyard
hermitage cottage
Meronymy
Hyponymy
Meronymy
Hypernymy
WordNet Sub-Graph
Troponym and Entailment
• Entailment{snoring – sleeping}
• Troponym{limp, strut – walk}{whisper – talk}
Entailment
Snoring entails sleeping.Buying entails paying.
• Proper Temporal Inclusion.
Inclusion can be in any way. Sleeping temporally includes snoring.Buying temporally includes paying.
• Co-extensiveness. (Troponymy)
Limping is a manner of walking.
-
7/13/2012
13
Opposition among verbs.
• {Rise,ascend} {fall,descend}Tie-untie (do-undo)
Walk-run (slow,fast)
Teach-learn (same activity different perspective)
Rise-fall (motion upward or downward)
• Opposition and Entailment.Hit or miss (entail aim) . Backward presupposition.
Succeed or fail (entail try.)
The causal relationship.
Show- see.Give- have.
Causation and Entailment. Giving entails having. Feeding entails eating.
Kinds of Antonymy
SizeSize Small Small -- BigBig
QualityQuality Good Good –– BadBadStateState Warm Warm –– CoolCoolPersonalityPersonality Dr. JekylDr. Jekyl-- Mr. HydeMr. Hyde
DirectionDirection EastEast-- WestWest
ActionAction Buy Buy –– SellSellAmountAmount Little Little –– A lotA lotPlacePlace Far Far –– NearNearTimeTime Day Day -- NightNight
GenderGender Boy Boy -- GirlGirl
-
7/13/2012
14
Kinds of MeronymyComponentComponent--objectobject Head Head -- BodyBody
StaffStaff--objectobject Wood Wood -- TableTable
MemberMember--collectioncollection Tree Tree -- ForestForest
FeatureFeature--ActivityActivity Speech Speech -- ConferenceConference
PlacePlace--AreaArea Palo Alto Palo Alto -- CaliforniaCalifornia
PhasePhase--StateState Youth Youth -- LifeLife
ResourceResource--processprocess Pen Pen -- WritingWriting
ActorActor--ActAct Physician Physician --TreatmentTreatment
Gradation
StateState Childhood, Youth, Old Childhood, Youth, Old ageage
TemperatureTemperature Hot, Warm, ColdHot, Warm, Cold
ActionAction Sleep, Doze, WakeSleep, Doze, Wake
Metonymy
• Associated with Metaphors which are epitomes of semantics
• Oxford Advanced Learners Dictionary definition: “The use of a word or phrase to mean something different from the literal meaning”
• Does it mean Careless Usage?!
Insight from Sanskritic Tradition
• Power of a word– Abhidha, Lakshana, Vyanjana
• Meaning of Hall:– The hall is packed (avidha)– The hall burst into laughing (lakshana)– The Hall is full (unsaid: and so we cannot enter)
(vyanjana)
-
7/13/2012
15
Metaphors in Indian Tradition
• upamana and upameya – Former: object being compared– Latter: object being compared with– Puru was like a lion in the battle with Alexander
(Puru: upameya; Lion: upamana)
Upamana, rupak, atishayokti
• upamana: Explicit comparison– Puru was like a lion in the battle with Alexander
• rupak: Implicit comparison– Puru was a lion in the battle with Alexander
• Atishayokti (exaggeration): upamana and upameya dropped– Puru’s army fled. But the lion fought on.
Modern study (1956 onwards, Richards et. al.)
• Three constituents of metaphor– Vehicle (items used metaphorically)– Tenor (the metaphorical meaning of the former)– Ground (the basis for metaphorical extension)
• “The foot of the mountain”– Vehicle: :foot”– Tenor: “lower portion”– Ground: “spatial parallel between the relationship
between the foot to the human body and the lower portion of the mountain with the rest of the mountain”
Interaction of semantic fields(Haas)
• Core vs. peripheral semantic fields• Interaction of two words in metonymic relation brings in
new semantic fields with selective inclusion of features• Leg of a table
– Does not stretch or move– Does stand and support
-
7/13/2012
16
Lakoff’s (1987) contribution
• Source Domain• Target Domain• Mapping Relations
Mapping Relations: ontological correspondences
• Anger is heat of fluid in container
HeatHeat
(i) Container(i) Container
(ii) Agitation of (ii) Agitation of fluidfluid
(iii) Limit of (iii) Limit of resistenceresistence
(iv) Explosion(iv) Explosion
AngerAnger
BodyBody
Agitation of Agitation of mindmind
Limit of ability Limit of ability to suppressto suppress
Loss of controlLoss of control
Image Schemas
• Categories: Container Contained• Quantity
– More is up, less is down: Outputs rose dramatically; accidents rates were lower
– Linear scales and paths: Ram is by far the best performer
• Time– Stationary event: we are coming to exam time– Stationary observer: weeks rush by
• Causation: desperation drove her to extreme steps
Patterns of Metonymy
• Container for contained– The kettle boiled (water)
• Possessor for possessed/attribute– Where are you parked? (car)
• Represented entity for representative– The government will announce new targets
• Whole for part– I am going to fill up the car with petrol
-
7/13/2012
17
Patterns of Metonymy (contd)
• Part for whole– I noticed several new faces in the class
• Place for institution– Lalbaug witnessed the largest Ganapati
Question: Can you have part-part metonymy
Purpose of Metonymy
• More idiomatic/natural way of expression– More natural to say the kettle is boiling as opposed to
the water in the kettle is boiling• Economy
– Room 23 is answering (but not *is asleep)• Ease of access to referent
– He is in the phone book (but not *on the back of my hand)
• Highlighting of associated relation– The car in the front decided to turn right (but not *to
smoke a cigarette)
Feature sharing not necessary
• In a restaurant:– Jalebii ko abhi dudh chaiye (no feature sharing)– The elephant now wants some coffee (feature
sharing)
Proverbs
• Describes a specific event or state of affairs which is applicable metaphorically to a range of events or states of affairs provided they have the same or sufficiently similar image-schematic structure
-
7/13/2012
18
IndoWordNet
Linked Indian Language Wordnets
Hindi
Wordnet
Dravidian
Language
Wordnet
North East
Language
Wordnet
Marathi
Wordnet
Sanskrit
Wordnet
English
Wordnet
Bengali
Wordnet
Punjabi
Wordnet
Konkani
Wordnet
Urdu
Wordnet
INDOWORDNET
Gujarati
Wordnet
Oriya
Wordnet
Kashmiri
Wordnet
Size of Indian Language wordnets(June, 2012) 1/2
Assamese 14958 Guahati University, Guahati, Assam
Bengali 23765 Indian Statistical Institute, Kolkata, West Bengal
Bodo 15785 Guahati University, Guahati, Assam
Gujarati 26580 Dharmsingh Desai University, Nadiad, Gujarat
Kannada 4408 Mysore University, Mysore, Karnataka
Kashmiri 23982 Kashmir University, Srinagar, Jammu and Kashmir
Konkani 25065 Goa University, Panji, Goa
Malayalam 8557 Amrita University, Coimbatore, Tamilnadu
Manipuri 16351 Manipur University, Imphal, Manipur
Marathi 24954 IIT Bombay, Mumbai, Maharastra
Size of Indian Language wordnets(June, 2012) 2/2
Nepali 11713 Assam University, Silchar, Assam
Oriya 31454 Hyderabad Central University, Hyderabad, Andhra Pradesh
Punjabi 22332 Thapar University and Punjabi University, Patiala, Punjab
Sanskrit 18980 IIT Bombay, Mumbai
Tamil 8607 Tamil University, Thanjavur, Tamilnadu
Telugu 14246 Dravidian University, Kuppam, Andhra Pradesh
Urdu 23071 Jawaharlal Nehru University, New Delhi
-
7/13/2012
19
Categories of Synsets (1/2)
•Universal: Synsets which have an indigenous lexeme inall the languages (e.g. Sun ,Earth).
•Pan Indian: Synsets which have indigenous lexeme in allthe Indian languages but no English equivalent (e.g.Paapad).
•In-Family: Synsets which have indigenous lexeme in theparticular language family (e.g. the term for Bhatija inDravidian languages).
Categories of Synsets (2/2)
•Language specific: Synsets which are unique to alanguage (e.g. Bihu in Assamese language)
•Rare: Synsets which express technical terms (e.g. ngram).
•Synthesized: Synsets created in the language due toinfluence of another language (e.g. Pizza).
Need for categorization
• To bring systematicity in the way the wordnet synsets are linked– Universal�Pan Indian�Language
Family�Language�Synthesised�Rare
• All members have finished the Universal and Pan Indian synsets
Categorization methodology
� 34378 Hindi synsets were sent to all Indo-wordnet groups in the tool, in which they had these options to categorize: � Yes� No
� Universal synsets:- The synsets which were categorized Yes and also have equivalent English words or synsets.� Pan-Indian :- The synsets which were categorized Yes
and did not have equivalent English words or synsets.
-
7/13/2012
20
Expansion approach: linking is a subtle and difficult process
• To link or not to link• While linking:
– face lexical and semantic chasms– Syntactic divergences in the example sentences
• Change of POS• Copula drop (Hindi�Bangla)
Case of kashmiri
Linking kinship relations and fine grained concepts
Relative
Uncle
Mama Chacha
पानी direct आब
पानी hypernym ऽेश
Important decision
• TWO kinds of linkages– Direct– Hypernymy
Case of kashmiri
पानी direct आब
पानी hypernym ऽेश
How to express a concept not present in the language?
-
7/13/2012
21
Transliteration: often employed
• Synset ID : 39 POS : adjective Synonyms : सनाथ,(sanaatha)
• Gloss : �जसका कोई पालन-पोषण या देखभाल करने वाला हो (opposite of orphan)
• Example statement : "सनाथ बालक� को अनाथ बालक� क� मदद करनी चा�हए (children who are looked after should help the orphans)/ साधक ूभु का हो जाने पर अनाथ नह"ं रहता, सनाथ हो जाता है”
• Transliterated and adopted by Bangla and Gujarati
Short phrase: often employed
Bangla
Urdu(meaningInauspicious)
Linking synsets across languages: Linking synsets across languages: Influence on Hindi WordnetInfluence on Hindi Wordnet
Hindi wordnet has to add new synsets to accommodate language specific concepts, e.g., in Gujarati
ભૈરવજપ (bhairav jap)
ID :: 103040
CAT :: NOUN
CONCEPT :: मो� के िलए जप करते हुए पव�त पर से अपने आप को िगराना(Taking God’s name and throwing oneself from atop a mountain to attain liberation)
EXAMPLE :: िगरनार के िशखर पर से या�ऽक भैरवजप करते थे एसा माना जाता है। (it is thought that pilgrms used to do bhairav jap atop Girnar mountain)
SYNSET-HINDI :: भैरवजप
Overview of WSD techniques
-
7/13/2012
22
Bird’s eye view
85
WSD Approaches
Machine Learning
Supervised Unsupervised Semi-supervised
Knowledge Based
CF
ILT -
IITB
Hybrid
OVERLAP BASED APPROACHES
• Require a Machine Readable Dictionary (MRD).
• Find the overlap between the features of different senses of anambiguous word (sense bag) and the features of the words in itscontext (context bag).
• These features could be sense definitions, example sentences, hypernyms etc.
• The features could also be given weights.
• The sense which has the maximum overlap is selected as the contextually appropriate sense.
86
86
CF
ILT -
IITB
LESK’S ALGORITHM
From Wordnet
• The noun ash has 3 senses (first 2 from tagged texts)
• 1. (2) ash -- (the residue that remains when something is burned)
• 2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved ornamental or timber trees of the genus Fraxinus)
• 3. ash -- (strong elastic wood of any of various ash trees; used for furniture and tool handles and sporting goods such as baseball bats)
• The verb ash has 1 sense (no senses from tagged texts)
• 1. ash -- (convert into ashes) 87
Sense Bag: contains the words in the definition of a candidate sense of the ambiguous word.
Context Bag: contains the words in the definition of each sense of each context word.
E.g. “On burning coal we get ash.”
CRITIQUE
• Proper nouns in the context of an ambiguous word can act as strong disambiguators.
E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”.
Sachin Tendulkar plays cricket.
• Proper nouns are not present in the thesaurus. Hence this approach fails to capture the strong clues provided by proper nouns.
• Accuracy– 50% when tested on 10 highly polysemous English words.
88
-
7/13/2012
23
Extended Lesk’s algorithm
– Original algorithm is sensitive towards exact words in the definition.
– Extension includes glosses of semantically related senses from WordNet (e.g. hypernyms, hyponyms, etc.).
– The scoring function becomes:
• where,– gloss(S) is the gloss of sense S from the lexical resource.
– Context(W) is the gloss of each sense of each context word.
– rel(s) gives the senses related to s in WordNet under some relations.
|)()(|)()(
sglosswcontextSscoressorsrels
ext ′= ∑′≡∈′
I
Gloss
study
Hyponymy
Hyponymy
Dwelling,abode
bedroom
kitchen
house,homeA place that serves as the living quarters of one or mor efamilies
guestroom
veranda
bckyard
hermitage cottage
Meronymy
Hyponymy
Meronymy
Hypernymy
WordNet Sub-Graph
Example: Extended Lesk
• “On combustion of coal we get ash”
From Wordnet
� The noun ash has 3 senses (first 2 from tagged texts)
� 1. (2) ash -- (the residue that remains when something is burned)
� 2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved ornamental or timber trees of the genus Fraxinus)
� 3. ash -- (strong elastic wood of any of various ash trees; used for furniture and tool handles and sporting goods such as baseball bats)
� The verb ash has 1 sense (no senses from tagged texts)
� 1. ash -- (convert into ashes)
Example: Extended Lesk (cntd)
• “On combustion of coal we get ash”
From Wordnet (through hyponymy)
� ash -- (the residue that remains when something is burned)
=> fly ash -- (fine solid particles of ash that are carried into the air when fuel is combusted)
=> bone ash -- (ash left when bones burn; high in calcium phosphate; used as fertilizer and in bone china)
-
7/13/2012
24
Critique of Extended Lesk
• Larger region of matching in WordNet– Increased chance of Matching
BUT– Increased chance of Topic Drift
WALKER’S ALGORITHM
Sense1: Finance Sense2: Location
Money +1 0
Interest +1 0
Fetch 0 0
Annum +1 0
Total 3 0
94
• A Thesaurus Based approach.• Step 1: For each sense of the target word find the thesaurus category to which
that sense belongs.• Step 2: Calculate the score for each sense by using the context words. A
context word will add 1 to the score of the sense if the thesaurus category of the word matches that of the sense.
– E.g. The money in this bank fetches an interest of 8% per annum– Target word: bank– Clue words from the context: money, interest, annum, fetch
Context words add 1 tothe sense when thetopic of the wordmatches that of thesense
CF
ILT -
IITB
WSD USING CONCEPTUAL DENSITY (Agirreand Rigau, 1996)
• Select a sense based on the relatedness of that word-sense to the context.
• Relatedness is measured in terms of conceptual
distance– (i.e. how close the concept represented by the word and the concept
represented by its context words are)
• This approach uses a structured hierarchical semantic net (WordNet) for finding the conceptual distance.
• Smaller the conceptual distance higher will be the conceptual density.– (i.e. if all words in the context are strong indicators of a particular
concept then that concept will have a higher density.)
95
CONCEPTUAL DENSITY FORMULA
96
Wish list� The conceptual distance between two words
should be proportional to the length of the path between the two words in the hierarchical tree (WordNet).
� The conceptual distance between two words should be proportional to the depth of the concepts in the hierarchy.
where, c= conceptnhyp = mean number of hyponyms
h= height of the sub-hierarchy m= no. of senses of the word and senses of context words contained in the sub-hierarchyCD= Conceptual Density
and 0.2 is the smoothing factor
entity
financelocation
moneybank-1bank-2
d (depth)
h (height) of theconcept “location”
Sub-Tree
-
7/13/2012
25
CONCEPTUAL DENSITY (cntd)
97
� The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.
� The CD formula will yield highest density for the sub-hierarchy containing more senses.
� The sense of W contained in the sub-hierarchy with the highest CD will be chosen.
CONCEPTUAL DENSITY (EXAMPLE)
The jury(2) praised the administration(3) and operation (8) of Atlanta Police Department(1)
Step 1: Make a lattice of the nouns in the context, their senses and hypernyms.
Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies).
Step 3: The concept with the highest CD is selected.
Step 4: Select the senses below the selected concept as the correct sense for the respective words.
operation
division
administrative_unit
jury
committee
police department
local department
government department
department
jury administration
body
CD = 0.256 CD = 0.062
98
CRITIQUE
• Resolves lexical ambiguity of nouns by finding a combination of senses that
maximizes the total Conceptual Density among senses.
• The Good
– Does not require a tagged corpus.
• The Bad
– Fails to capture the strong clues provided by proper nouns in the context.
• Accuracy
– 54% on Brown corpus.
99
WSD USING RANDOM WALK ALGORITHM (Page Rank) (sinha and
Mihalcea, 2007)
Bell ring church Sunday
S3
S2
S1
S3
S2
S1
S3
S2
S1 S1
a
c
b
e
f
g
hi
j
k
l
0.46
a
0.49
0.92
0.97
0.35
0.56
0.42
0.63
0.580.67
Step 1: Add a vertex for each possible sense of each word in the text.Step 2: Add weighted edges using definition based semantic similarity (Lesk’s
method).Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for
each word sense).
Step 4: Select the vertex (sense) which has the highest score.
100
-
7/13/2012
26
A look at Page Rank (from Wikipedia)
Developed at Stanford University by Larry Page (hence the name Page-Rank) and Sergey Brin as part of a research project about a new kind of search engine
The first paper about the project, describing PageRank and the initial prototype of the Google search engine, was published in 1998
Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine
While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools
A look at Page Rank (cntd)
PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.
Assume a small universe of four web pages: A, B, C and D.
The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR( ) in this simplistic system would thus gather to A because all links would be pointing to A.
PR(A)=PR(B)+PR(C)+PR(D)
This is 0.75.
A look at Page Rank (cntd)Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages
The value of the link-votes is divided among all the outbound links on a page.
Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C.
Only one third of D's PageRank is counted for A's PageRank (approximately 0.083).
PR(A)=PR(B)/2+PR(C)/1+PR(D)/3
In general,
PR(U)= ΣPR(V)/L(V), where B(u) is the set of pages u is linked to, andVεB(U) L(V) is the number of links from V
A look at Page Rank (damping factor)The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking.
The probability, at any step, that the person will continue is a damping factor d.
PR(U)= (1-d)/N + d.ΣPR(V)/L(V), VεB(U)
N=size of document collection
-
7/13/2012
27
For WSD: Page Rank
• Given a graph G = (V,E)• In(Vi) = predecessors of Vi• Out(Vi) = successors of Vi
• In a weighted graph, the walker randomly selects an outgoing edge with higher probability of selecting edges with higher weight.
105
Other Link Based Algorithms
• HITS algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com)
• IBM CLEVER project• TrustRank algorithm.
CRITIQUE
• Relies on random walks on graphs encoding label dependencies. • The Good
– Does not require any tagged data (a WordNet is sufficient).
– The weights on the edges capture the definition based semantic similarities.
– Takes into account global data recursively drawn from the entire graph.
• The Bad– Poor accuracy
• Accuracy– 54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%.
107
KB Approaches – Comparisons
Algorithm Accuracy
WSD using Selectional Restrictions 44% on Brown Corpus
Lesk’s algorithm 50-60% on short samples of “Prideand Prejudice” and some “newsstories”.
Extended Lesk’s algorithm 32% on Lexical samples from Senseval
2 (Wider coverage).
WSD using conceptual density 54% on Brown corpus.
WSD using Random Walk Algorithms 54% accuracy on SEMCOR corpuswhich has a baseline accuracy of 37%.
Walker’s algorithm 50% when tested on 10 highlypolysemous English words.
-
7/13/2012
28
KB Approaches –Conclusions
• Drawbacks of WSD using Selectional Restrictions– Needs exhaustive Knowledge Base.
• Drawbacks of Overlap based approaches – Dictionary definitions are generally very small.
– Dictionary entries rarely take into account the distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc. � cigarette and ashnever co-occur in a dictionary).
– Suffer from the problem of sparse match.
– Proper nouns are not present in a MRD. Hence these approaches fail to capture the strong clues provided by proper nouns.
SUPERVISED APPROACHES
NAÏVE BAYES
111
o The Algorithm find the winner sense usingsˆ= argmax s ε senses Pr(s|Vw)
� ‘Vw’ is a feature vector consisting of:� POS of w
� Semantic & Syntactic features of w
� Collocation vector (set of words around it) � typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's
� Co-occurrence vector (number of times w occurs in bag of words around it)
� Applying Bayes rule and naive independence assumption
sˆ= argmax s ε senses Pr(s).Πi=1nPr(Vw
i|s)
BAYES RULE AND INDEPENDENCE ASSUMPTION
sˆ= argmax s ε senses Pr(s|Vw) where Vw is the feature vector.
• Apply Bayes rule:
Pr(s|Vw)=Pr(s).Pr(Vw|s)/Pr(Vw)
• Pr(Vw|s) can be approximated by independence assumption:
Pr(Vw|s) = Pr(Vw1|s).Pr(Vw
2|s,Vw1)...Pr(Vw
n|s,Vw1,..,Vw
n-1)
= Πi=1nPr(Vw
i|s)
Thus,
sˆ= argmax sÎsenses Pr(s).Πi=1nPr(Vw
i|s)
sˆ= argmax s ε senses Pr(s|Vw)
-
7/13/2012
29
ESTIMATING PARAMETERS
• Parameters in the probabilistic WSD are:
– Pr(s)– Pr(Vw
i|s)• Senses are marked with respect to sense repository (WORDNET)
Pr(s) = count(s,w) / count(w)
Pr(Vwi|s) = Pr(Vw
i,s)/Pr(s)
= c(Vwi,s,w)/c(s,w)
DECISION LIST ALGORITHM
• Based on ‘One sense per collocation’ property.– Nearby words provide strong and consistent clues as to the sense of a target word.
• Collect a large set of collocations for the ambiguous word.• Calculate word-sense probability distributions for all such
collocations.• Calculate the log-likelihood ratio
• Higher log-likelihood = more predictive evidence• Collocations are ordered in a decision list, with most predictive
collocations ranked highest.
114
Pr(Sense-A| Collocationi)
Pr(Sense-B| Collocationi)Log( )
114
Assuming there are only
two senses for the word.
Of course, this can easily
be extended to ‘k’ senses.
Training Data Resultant Decision List
DECISION LIST ALGORITHM (CONTD.)
Classification of a test sentence is based on the highest ranking collocation found in the test sentence.E.g.
…plucking flowers affects plant growth… 115
CRITIQUE
• Harnesses powerful, empirically-observed properties of language.
• The Good– Does not require large tagged corpus. Simple implementation.– Simple semi-supervised algorithm which builds on an existing
supervised algorithm. – Easy understandability of resulting decision list.– Is able to capture the clues provided by Proper nouns from the
corpus.
• The Bad– The classifier is word-specific.– A new classifier needs to be trained for every word that you want
to disambiguate.
• Accuracy– Average accuracy of 96% when tested on a set of 12 highly
polysemous words.116
-
7/13/2012
30
Exemplar Based WSD (k-nn)
• An exemplar based classifier is constructed for each word to be disambiguated.
• Step1: From each sense marked sentence containing the ambiguous word , a training example is constructed using:
– POS of w as well as POS of neighboring words.– Local collocations– Co-occurrence vector– Morphological features– Subject-verb syntactic dependencies
• Step2: Given a test sentence containing the ambiguous word, a test example is similarly constructed.
• Step3: The test example is then compared to all training examples and the k-closest training examples are selected.
• Step4: The sense which is most prevalent amongst these “k” examples is then selected as the correct sense.
WSD Using SVMs
• SVM is a binary classifier which finds a hyperplane with the largest margin that separates training examples into 2 classes.
• As SVMs are binary classifiers, a separate classifier is built for each sense of the word
• Training Phase: Using a tagged corpus, f or every sense of the word a SVM is trained using the following features:
– POS of w as well as POS of neighboring words.– Local collocations– Co-occurrence vector– Features based on syntactic relations (e.g. headword, POS of headword, voice of
head word etc.)
• Testing Phase: Given a test sentence, a test example is constructed using the above features and fed as input to each binary classifier.
• The correct sense is selected based on the label returned by each classifier.
WSD Using Perceptron Trained HMM
• WSD is treated as a sequence labeling task.
• The class space is reduced by using WordNet’s super senses instead of actual senses.
• A discriminative HMM is trained using the following features:– POS of w as well as POS of neighboring words.– Local collocations– Shape of the word and neighboring words
E.g. for s = “Merrill Lynch & Co shape(s) =Xx*Xx*&Xx
• Lends itself well to NER as labels like “person”, location”, "time” etc are included in the super sense tag set.
Supervised Approaches –Comparisons
Approach Average Precision
Average Recall Corpus Average Baseline Accuracy
Naïve Bayes 64.13% Not reported Senseval3 – All Words Task
60.90%
Decision Lists 96% Not applicable Tested on a set of 12 highly polysemous English words
63.9%
Exemplar Based disambiguation (k-NN)
68.6% Not reported WSJ6 containing 191 content words
63.7%
SVM 72.4% 72.4% Senseval 3 –Lexical sample task (Used for disambiguation of 57 words)
55.2%
Perceptron trained HMM
67.60 73.74% Senseval3 – All Words Task
60.90%
-
7/13/2012
31
Supervised Approaches –Conclusions
• General Comments• Use corpus evidence instead of relying of dictionary defined senses.
• Can capture important clues provided by proper nouns because proper nouns do appear in a corpus.
• Naïve Bayes– Suffers from data sparseness.– Since the scores are a product of probabilities, some weak features
might pull down the overall score for a sense.– A large number of parameters need to be trained.
• Decision Lists– A word-specific classifier. A separate classifier needs to be trained for
each word.– Uses the single most predictive feature which eliminates the drawback of
Naïve Bayes.
Multilingual resource constrained WSD
Long line of work…• Mitesh Khapra, Salil Joshi and Pushpak Bhattacharyya, It takes two to Tango: A Bilingual
Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011.
• Mitesh Khapra, Salil Joshi, Arindam Chatterjee and Pushpak Bhattacharyya, Together We Can: Bilingual Bootstrapping for WSD, Annual Meeting of the Association of Computational Linguistics (ACL 2011), Oregon, USA, June 2011.
• Mitesh Khapra, Saurabh Sohoney, Anup Kulkarni and Pushpak Bhattacharyya, Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.
• Mitesh Khapra, Anup Kulkarni, Saurabh Sohoney and Pushpak Bhattacharyya, All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.
• Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Domain-Specific Word Sense Disambiguation Combining Corpus Based and Wordnet Based Parameters, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010.
• Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Projecting Parameters for Multilingual Word Sense Disambiguation, Empirical Methods in Natural Language Prfocessing (EMNLP09), Singapore, August, 2009.
• Mitesh Khapra, Pushpak Bhattacharyya, Shashank Chauhan, Soumya Nair and AdityaSharma, Domain Specific Iterative Word Sense Disambiguation in a Multilingual Setting, International Conference on NLP (ICON08), Pune, India, December, 2008.
Algorithm for WSD
-
7/13/2012
32
Iterative WSD
Motivated by the Energy expression in Hopfield network
Scoring function
Neuron � Synset
Self-activation
� Corpus Sense Distribution
Weight of connection between two neurons
�
Weight as a function ofcorpus co-occurrenceand Wordnet distancemeasures betweensynsets
Iterative WSD
Algorithm 1: performIterativeWSD(sentence)
1. Tag all monosemous words in the sentence. 2. Iteratively disambiguate the remaining words in the sentence in increasing order of their degree of polysemy. 3. At each stage select that sense for a word which maximizes the score given by the Equation below
Data
-
7/13/2012
33
#Polysemous words (tokens)
#monosemous words
#Polysemous unique words (types)
Token to Type ratio
Average degree of WN polysemy
Average degree of and corpus polysemy
H S H H
H H HS
Performance of different algorithms: monolingual WSD
-
7/13/2012
34
WSD is costly!1
WordNets• Princeton Wordnet: ~80000 synsets: 30 man years• Eurowordnet: 12 man years on the average for 12
languages• Hindi wordnet: 24 man years
– http://www.cfilt.iitb.ac.in/wordnet/webhwn/• Indowordnet: getting created; 15 languages; 4 people
on the average; in 1 year close to 15000 synsets done
• Scale of effort really huge • Tricky too: when it comes to expanding from one
wordnet to another
Machine Learnng based WSD is costly!2
Sense Annotated corpora for Machine Learning• SemCor: ~200000 sense marked words• SemEval/Senseval competition: to generate sense
marked corpora• Sense marked corpora created at IIT Bombay
– http://www.cfilt.iitb.ac.in/wsd/annotated_corpus– English: Tourism (~170000), Health (~150000)– Hindi: Tourism (~170000), Health (~80000)– Marathi: Tourism (~120000), Health (~50000)– 12 man years for each combination
Cost-accuracy trade off
This is the dream!spread from one combination to others
-
7/13/2012
35
Language Adaptation scenarios
Related Work (Not mentioning references, because they are too many)
�Knowledge Based Approaches
�Supervised Approaches
�Unsupervised Approaches
�Semi-supervised Approaches
�Hybrid Approaches
No single existing solution to WSD completely meets our requirements of
multilinguality, high domain accuracy and good performance in the face of
limited annotation
Scenario 3: EM based unsupervised Approach
140
-
7/13/2012
36
141
ESTIMATING SENSE DISTRIBUTIONS
If sense tagged Marathi corpus were available, we could have estimated
But such a corpus is not available
Framework: Figure 1 and Figure 2
E-M steps
Points to note…
• Symmetric formulation• E and M steps are identical except for the change in
language• Either can be treated as the E-step, making the other as the
M-step• A back-and-forth traversal over translation correspondences
in the two languages• Does not require parallel corpus – only in-domain corpus is
needed
144
-
7/13/2012
37
In General..
145
Experimental Setup• Languages: Hindi, Marathi• Domains: Tourism and Health (largest domain-specific sense tagged corpus)
146
Algorithms Being Compared
• EM (our approach)• Personalized PageRank (Agirre and Soroa, 2009)• State-of-the-art bilingual approach (using Mutual
Information) (Kaji and Morimoto, 2002)• Random Baseline• Wordnet First sense baseline (supervised baseline)
147
Results
• Performs better than other state-of-the-art knowledge based and unsupervised approaches
• Does not beat the Wordnet First Sense Baseline which is a supervised baseline
148
-
7/13/2012
38
Error Analysis – Non-Progressiveness estimation
• Some words have the same translations in the target language across senses – saagar(hindi) �� samudra (marathi) (“large water body”
as well as “limitless”)• Such words thus form a closed loop of translations• In such cases the algorithm does not progress and
gets stuck with the initial values• Same is the case for some language specific words
for which corresponding synsets were not available in the other language
• Such words accounted for 17-19% of the total words in the test corpus
149
Results – Eliminating words which have problem of Non Progressive Estimation
• Results are now closer to Wordnet First Sense Baseline• For 2 out of the 4 language domain pairs the results are slightly
better than WFS – remarkable for an unsupervised approach
150
Demo
• IIT Bombay’s system:www.cfilt.iitb.ac.in/UNL_enco
• Textual Entailment:
http://10.14.26.15:8084/TextualEntailmentInterface/index.jsp
Conclusions (1/2)• NLP is all about processing ambiguity,
with WSD as a fundamental task
• Resource constraint and multilingualitybrings additional challenge
• Wordnet: Great unifier of India (similar to Adi Shankaracharya, Bollywood films…)
• Getting linked with English WN; would like to link with Eurowordnet
• Application in MT, Search, Language teaching, e-commerce
-
7/13/2012
39
Future work
• Closer study needed for familialy close languages• Usage of language specific properties, in particular,
morphology• The projection idea can be used in other NLP
problems like POS tagging and Parsing
URLs
• For resourceswww.cfilt.iitb.ac.in
• For publicationswww.cse.iitb.ac.in/~pb
Thank you
Questions and comments?
top related