nlp: two pictures - iiit hyderabadltrc.iiit.ac.in/iasnlp2012/slides/pushpak/wordnet-wsd...7/13/2012...

39
7/13/2012 1 Wordnet and Word Sense Disambiguation Lecture delivered at the summer school on NLP, IIIT Hyderabad, 10July, 2012 by Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay {pb}@cse.iitb.ac.in Background NLP: Two pictures NLP Vision Speech Algorithm Problem Language Hindi Marathi English French Morph Analysis Statistics and Probability + Knowledge Based Part of Speech Tagging Parsing Semantics CRF HMM MEMM NLP Trinity NLP: Thy Name is Disambiguation I went with my friend to the bank to withdraw some money, but was disappointed to find it closed POS disambiguation: bank (NN/VM), withdraw (NN/VM), closed (JJ/VM) Sense Disambiguation: bank (finance/place), withdraw (take out/go_away) Co-reference disambiguation: it (bank/money)/I/friend) Pro-drop disambiguation: “…<I/friend> was disappointed…” Scope disambiguation: for with (my _friend/my_friend_to_the bank)

Upload: others

Post on 11-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/13/2012

    1

    Wordnet and Word Sense Disambiguation

    Lecture delivered at the summer school on NLP, IIIT Hyderabad, 10July, 2012

    byPushpak Bhattacharyya

    Computer Science and Engineering DepartmentIIT Bombay

    {pb}@cse.iitb.ac.in

    Background

    NLP: Two pictures

    NLP

    Vision Speech

    Algorithm

    Problem

    LanguageHindi

    Marathi

    English

    FrenchMorphAnalysis

    Statistics and Probability

    +Knowledge Based

    Part of SpeechTagging

    Parsing

    Semantics

    CRF

    HMM

    MEMM

    NLPTrinity

    NLP: Thy Name is Disambiguation

    • I went with my friend to the bank to withdraw some money, but was disappointed to find it closed

    • POS disambiguation: bank (NN/VM), withdraw (NN/VM), closed (JJ/VM)

    • Sense Disambiguation: bank (finance/place), withdraw (take out/go_away)

    • Co-reference disambiguation: it (bank/money)/I/friend)

    • Pro-drop disambiguation: “… was disappointed…” • Scope disambiguation: for with (my _friend/my_friend_to_the bank)

  • 7/13/2012

    2

    Where there is a will,

    Where there is a will,

    There are hundreds of relatives

    Where there is a will

    There is a way There are hundreds of relatives

    Stages of processing

    • Phonetics and phonology• Morphology• Lexical Analysis• Syntactic Analysis• Semantic Analysis• Pragmatics• Discourse

  • 7/13/2012

    3

    Lexical Disambiguation

    First step: part of Speech Disambiguation• Dog as a noun (animal)• Dog as a verb (to pursue)

    Sense Disambiguation• Dog (as animal)• Dog (as a very detestable person)

    Needs word relationships in a context• The chair emphasised the need for adult educationVery common in day to day communicationsSatellite Channel Ad: Watch what you want, when you want

    (two senses of watch)e.g., Ground breaking ceremony/research

    Lexical Analysis

    • Essentially refers to dictionary access and obtaining the properties of the word

    e.g. dognoun (lexical property)take-’s’-in-plural (morph property)animate (semantic property)4-legged (-do-)carnivore (-do)

    Challenge: Lexical or word sense disambiguation

    Ambiguity of Multiwords

    • The grandfather kicked the bucket after suffering from cancer.

    • This job is a piece of cake• Put the sweater on • He is the dark horse of the match

    Google Translations of above sentences:

    दादा क� सर से पीड़त होने के बाद बा�ट� लात मार�. इस काम के केक का एक टुकड़ा है.ःवेटर पर रखो.वह मैच के अंधेरे घोड़ा है.

    • Bengali: ��� ����� ���� ��

    English: Government is restless at home. (*)Chanchal Sarkar is at home

    • Hindi: दैिनक दबंग दिुनयाEnglish: everyday bold worldActually name of a Hindi newspaper in Indore

    • High degree of overlap between NEs and MWEs

    • Treat differently - transliterate do not translate

    Ambiguity of Named Entities

  • 7/13/2012

    4

    Challenges in Syntactic Processing: Structural Ambiguity

    • Scope1.The old men and women were taken to safe locations(old men and women) vs. ((old men) and women)2. No smoking areas will allow Hookas inside

    • Preposition Phrase Attachment• I saw the boy with a telescope

    (who has the telescope?)• I saw the mountain with a telescope

    (world knowledge: mountain cannot be an instrument of seeing)

    • I saw the boy with the pony-tail(world knowledge: pony-tail cannot be an instrument of seeing)

    Very ubiquitous: newspaper headline “20 years later, BMC pays father 20 lakhs for causing son’s death”

    Textual Humour (1/2)

    1. Teacher (angrily): did you miss the class yesterday?Student: not much

    2. A man coming back to his parked car sees the sticker "Parking fine". He goes and thanks the policeman for appreciating his parking skill.

    3. Son: mother, I broke the neighbour's lamp shade.Mother: then we have to give them a new one.Son: no need, aunty said the lamp shade is irreplaceable.

    4. Ram: I got a Jaguar car for my unemployed youngest son.Shyam: That's a great exchange!

    5. Shane Warne should bowl maiden overs, instead of bowling maidens over

    Textual Humour (2/2)

    • It is not hard to meet the expenses now a day, you find them everywhere

    • Teacher: What do you think is the capital of Ethiopia?Student: What do you think?Teacher: I do not think, I knowStudent: I do not think I know

    Example of WSD• Operation, surgery, surgical operation, surgical procedure, surgical

    process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1

    • Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1

    • Operation -- ((computer science) data processing in which the result is completely specified by a rule (especially the processing that results from a single instruction); "it can perform millions of operations per second") TOPIC->(noun) computer science#1, computing#1

    • mathematical process, mathematical operation, operation --((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1

    IS WSD NEEDED IN LARGE APPLICATIONS?

  • 7/13/2012

    5

    Word ambiguity�topic drift in IR

    Query word:“Madrid bomb blast case”

    {case, container}

    {case, suit, lawsuit}

    {suit, apparel}

    Drifted topic due to expanded term!!!

    Drifted topic due to inapplicable sense!!!

    43.75 43.75

    25

    31.25

    18.75

    6.25

    12.5

    00

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    Hindi-English Marathi-English

    Err

    or

    Per

    cen

    tag

    e

    Transliteration

    Translation Disambiguation

    Stemmer

    Dictionary

    Ranking

    Our observationsOn error PercentagesDue to variousFactorsCLEF 2007

    How about WSD and MT?

    Zaheer Khan, the India fast bowler, has been ruled out of the remainder of the series against England.

    He will return to India and will be replaced by left-arm seamer RP Singh.

    Zaheer picked up a hamstring injury during the first Test at Lord's.

    He had been withdrawn from the squad for India's recent Test series in the West Indies due to a right ankle injury.

    भारत के तेज गदबाज, जहर खान, इं�ल�ड के �खलाफ ौृखंला के शेष के बाहर शासन �कया गया है. (ruled in the administrative sense??)

    वह भारत लौटने और बाए ँहाथ के तेज गदबाज आरपी िसंह +ारा ूितःथा.पत �कया जाएगा.

    जहर लॉ0स1 म पहले टेःट के दौरान हैम�ःशंग चोट उठाया. (lifted??)

    वह भारत क8 वेःट इंडज म हाल ह म एक सह (correct??) टखने क8 चोट के कारण टेःट ौृखंला के िलए टम से वापस ले िलया गया था.

    Wordnet

  • 7/13/2012

    6

    Psycholinguistic Theory • Human lexical memory for nouns as a hierarchy.• Can canary sing? - Pretty fast response.• Can canary fly? - Slower response.• Does canary have skin? – Slowest response.

    (can move, has skin)

    (can fly)

    (can sing)

    Wordnet - a lexical reference system based on psycholinguistic theories of human lexical memory.

    Animal

    Bird

    canary

    Essential Resource for WSD: Wordnet

    Word MeaningsWord Forms

    F1 F2 F3 … Fn

    M1(depend)

    E1,1

    (bank)E1,2

    (rely)E1,3

    M2(bank)

    E2,2

    (embankment)

    E2,…

    M3

    (bank)E3,2 E3,3

    … …

    Mm Em,n

    Wordnet: History

    • The first wordnet in the world was for English developed at Princeton over 15 years.

    • The Eurowordnet- linked structure of European language wordnets was built in 1998 over 3 years with funding from the EC as a a mission mode project.

    • Wordnets for Hindi and Marathi being built at IIT Bombay are amongst the first IL wordnets.

    • All these are proposed to be linked into the IndoWordnet which eventually will be linked to the English and the Euro wordnets.

    Basic Principle

    • Words in natural languages are polysemous.• However, when synonymous words are put

    together, a unique meaning often emerges.• Use is made of Relational Semantics.

  • 7/13/2012

    7

    Lexical and Semantic relations in wordnet

    1. Synonymy2. Hypernymy / Hyponymy3. Antonymy4. Meronymy / Holonymy5. Gradation6. Entailment 7. Troponymy1, 3 and 5 are lexical (word to word), rest are

    semantic (synset to synset).

    Gloss

    study

    Hyponymy

    Hyponymy

    Dwelling,abode

    bedroom

    kitchen

    house,homeA place that serves as the living quarters of one or mor efamilies

    guestroom

    veranda

    bckyard

    hermitage cottage

    Meronymy

    Hyponymy

    Meronymy

    Hypernymy

    WordNet Sub-Graph

    Fundamental Design Question

    • Syntagmatic vs. Paradigmatic relations?• Psycholinguistics is the basis of the design.• When we hear a word, many words come to our

    mind by association.• For English, about half of the associated words

    are syntagmatically related and half are paradignatically related.

    • For cat– animal, mammal- paradigmatic– mew, purr, furry- syntagmatic

    Stated Fundamental Application of Wordnet: Sense Disambiguation

    Determination of the correct sense of the wordThe crane ate the fish vs.The crane was used to lift the load

    bird vs. machine

  • 7/13/2012

    8

    The problem of Sense tagging

    • Given a corpora To Assign correct sense to the words.

    • This is sense tagging. Needs Word Sense Disambiguation (WSD)

    • Highly important for Question Answering, Machine Translation, Text Mining tasks.

    Classification of Words

    Word

    Content Word

    FunctionWord

    Verb Noun Adjective Adverb Preposition

    Conjunction

    Pronoun Interjection

    Example of sense marking: its need

    एक_4187 नए शोध_1138 के अनुसार_3123 �जन लोग�_1189 का सामा�जक_43540 जीवन_125623 �यःत_48029 होता है उनके �दमाग_16168 के एक_4187 �हःस_े120425 म� अिधक_42403 जगह_113368 होती है।

    (According to a new research, those people who have a busy social life, have larger space in a part of

    their brain).

    नेचर #यूरोसाइंस म� छपे एक_4187 शोध_1138 के अनुसार_3123 कई_4118 लोग�_1189 के �दमाग_16168 के ःकैन से पता_11431 चला �क �दमाग_16168 का एक_4187 �हःसा_120425 एिमगडाला सामा�जक_43540 �यःतताओं_1438 के साथ_328602 सामंजःय_166 के िलए थोड़ा_38861 बढ़_25368 जाता है। यह शोध_1138 58 लोग�_1189 पर �कया गया �जसम� उनक0 उॆ_13159 और �दमाग_16168 क0 साइज़ के आँकड़े_128065 िलए गए। अमर6क0_413405 ट6म_14077 ने पाया_227806 �क �जन लोग�_1189 क0 सोशल नेटव�क8 ग अिधक_42403 है उनके �दमाग_16168 का एिमगडाला वाला �हःसा_120425 बाक0_130137 लोग�_1189 क0 तुलना_म�_38220 अिधक_42403 बड़ा_426602 है। �दमाग_16168 का एिमगडाला वाला �हःसा_120425 भावनाओं_1912 और मानिसक_42151 �ःथित_1652 से जुड़ा हुआ माना_212436 जाता है।

    Ambiguity of लोग� (People)

    • लोग, जन, लोक, जनमानस, प��लक - एक से अिधक �य� "लोग� के �हत म� काम करना चा�हए" – (English synset) multitude, masses, mass, hoi_polloi,

    people, the_great_unwashed - the common people generally "separate the warriors from the mass" "power to the people"

    • दिुनया, दिुनयाँ, संसार, �व�, जगत, जहाँ, जहान, ज़माना, जमाना, लोक, दिुनयावाले, दिुनयाँवाले, लोग - संसार म� रहने वाल ेलोग "महा�मा गाँधी का स�मान पूर� दिुनया करती है / म� इस दिुनया क! परवाह नह�ं करता / आज क! दिुनया पैसे के पीछे भाग रह� है" – (English synset) populace, public, world - people in

    general considered as a whole "he is a hero in the eyes of the public”

  • 7/13/2012

    9

    Basic Principle

    • Words in natural languages are polysemous.• However, when synonymous words are put

    together, a unique meaning often emerges.• Use is made of Relational Semantics.• Componential Semantics where each word is a

    bundle of semantic features (as in the Schankian Conceptual Dependency system or Lexical Componential Semantics) is to be examined as a viable alternative.

    Componential Semantics

    • Consider cat and tiger. Decide on componential attributes.

    • For cat (Y, Y, N, Y)• For tiger (Y,Y,Y,N)

    Complete and correct Attributes are difficult to design.

    FurryFurry CarnivorousCarnivorous HeavyHeavy DomesticableDomesticable

    Semantic relations in wordnet

    1. Synonymy2. Hypernymy / Hyponymy3. Antonymy4. Meronymy / Holonymy5. Gradation6. Entailment 7. Troponymy1, 3 and 5 are lexical (word to word), rest are

    semantic (synset to synset).

    Synset: the foundation(house)

    1. house -- (a dwelling that serves as living quarters for one or more families; "he has a house on Cape Cod"; "she felt she had to get out of the house")2. house -- (an official assembly having legislative powers; "the legislature has two houses")3. house -- (a building in which something is sheltered or located; "they had a large carriage house")4. family, household, house, home, menage -- (a social unit living together; "he moved his family to Virginia"; "It was a good Christian household"; "I waited until the whole house was asleep"; "the teacher asked how many people made up his home")5. theater, theatre, house -- (a building where theatrical performances or motion-picture shows can be presented; "the house was full")6. firm, house, business firm -- (members of a business organization that owns or operates one or more establishments; "he worked for a brokerage house")7. house -- (aristocratic family line; "the House of York")8. house -- (the members of a religious community living together)9. house -- (the audience gathered together in a theatre or cinema; "the house applauded"; "he counted the house")10. house -- (play in which children take the roles of father or mother or children and pretend to interact like adults; "the children were playing house")11. sign of the zodiac, star sign, sign, mansion, house, planetary house -- ((astrology) one of 12 equal areas into which the zodiac is divided)12. house -- (the management of a gambling house or casino; "the house gets a percentage of every bet")

  • 7/13/2012

    10

    Creation of Synsets

    Three principles:• Minimality• Coverage• Replacability

    Synset creation (continued)

    HomeJohn’s home was decorated with lights on the occasion of Christmas.Having worked for many years abroad, John Returned home.

    HouseJohn’s house was decorated with lights on the occasion of Christmas.Mercury is situated in the eighth house of John’s horoscope.

    Synsets (continued)

    {house} is ambiguous.{house, home} has the sense of a social unit living

    together;Is this the minimal unit?{family, house , home} will make the unit completely

    unambiguous.

    For coverage:{family, household, house, home} ordered according to

    frequency.

    Replacability of the most frequent words is a requirement.

    Synset creation

    From first principles– Pick all the senses from good standard dictionaries.– Obtain synonyms for each sense.– Needs hard and long hours of work.

  • 7/13/2012

    11

    Synset creation (continued)

    From the wordnet of another language in the same family– Pick the synset and obtain the sense from the gloss.– Get the words of the target language.– Often same words can be used- especially for ������

    words.– Translation, Insertion and deletion.

    Synset+Gloss+ExampleCrucially needed for concept explication, wordnet building using

    another wordnet and wordnet linking.

    English Synset: {earthquake, quake, temblor, seism} -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)

    Hindi Synset: �भूकंप, भूचाल, भूडोल, जलजला, भूक�प, भू-कंप, भू-क�प, ज़लज़ला, भूिमकंप, भूिमक�प - ूाकृितक कारण से पृवी के भीतर� भाग म� कुछ उथल-पथुल होने से ऊपर� भाग के सहसा �हलने क� �बया "२००१ म� गुज़रात म� आये भूकंप म� काफ़� लोग मारे गये थे"

    (shaking of the surface of earth; many were killed in the earthquake in Gujarat)

    Marathi Synset: धरणीकंप,भूकंप - प�ृवी�या पोटात ि�य�ोभ होऊन प!ृभाग हाल#याची $बया "२००१ साली गुजरातम�ये झाले�या धरणीकंपात अनेक लोक म�ृयुमुखी पडले"

    Semantic Relations

    • Hypernymy and Hyponymy– Relation between word senses (synsets)– X is a hyponym of Y if X is a kind of Y– Hyponymy is transitive and asymmetrical– Hypernymy is inverse of Hyponymy

    (lion->animal->animate entity->entity)

    Semantic Relations (continued)

    • Meronymy and Holonymy– Part-whole relation, branch is a part of tree– X is a meronymy of Y if X is a part of Y– Holonymy is the inverse relation of Meronymy{kitchen} ………………………. {house}

  • 7/13/2012

    12

    Lexical Relation

    • Antonymy– Oppositeness in meaning – Relation between word forms– Often determined by phonetics, word length etc.

    ({rise, ascend} vs. {fall, descend})

    Gloss

    study

    Hyponymy

    Hyponymy

    Dwelling,abode

    bedroom

    kitchen

    house,homeA place that serves as the living quarters of one or mor efamilies

    guestroom

    veranda

    bckyard

    hermitage cottage

    Meronymy

    Hyponymy

    Meronymy

    Hypernymy

    WordNet Sub-Graph

    Troponym and Entailment

    • Entailment{snoring – sleeping}

    • Troponym{limp, strut – walk}{whisper – talk}

    Entailment

    Snoring entails sleeping.Buying entails paying.

    • Proper Temporal Inclusion.

    Inclusion can be in any way. Sleeping temporally includes snoring.Buying temporally includes paying.

    • Co-extensiveness. (Troponymy)

    Limping is a manner of walking.

  • 7/13/2012

    13

    Opposition among verbs.

    • {Rise,ascend} {fall,descend}Tie-untie (do-undo)

    Walk-run (slow,fast)

    Teach-learn (same activity different perspective)

    Rise-fall (motion upward or downward)

    • Opposition and Entailment.Hit or miss (entail aim) . Backward presupposition.

    Succeed or fail (entail try.)

    The causal relationship.

    Show- see.Give- have.

    Causation and Entailment. Giving entails having. Feeding entails eating.

    Kinds of Antonymy

    SizeSize Small Small -- BigBig

    QualityQuality Good Good –– BadBadStateState Warm Warm –– CoolCoolPersonalityPersonality Dr. JekylDr. Jekyl-- Mr. HydeMr. Hyde

    DirectionDirection EastEast-- WestWest

    ActionAction Buy Buy –– SellSellAmountAmount Little Little –– A lotA lotPlacePlace Far Far –– NearNearTimeTime Day Day -- NightNight

    GenderGender Boy Boy -- GirlGirl

  • 7/13/2012

    14

    Kinds of MeronymyComponentComponent--objectobject Head Head -- BodyBody

    StaffStaff--objectobject Wood Wood -- TableTable

    MemberMember--collectioncollection Tree Tree -- ForestForest

    FeatureFeature--ActivityActivity Speech Speech -- ConferenceConference

    PlacePlace--AreaArea Palo Alto Palo Alto -- CaliforniaCalifornia

    PhasePhase--StateState Youth Youth -- LifeLife

    ResourceResource--processprocess Pen Pen -- WritingWriting

    ActorActor--ActAct Physician Physician --TreatmentTreatment

    Gradation

    StateState Childhood, Youth, Old Childhood, Youth, Old ageage

    TemperatureTemperature Hot, Warm, ColdHot, Warm, Cold

    ActionAction Sleep, Doze, WakeSleep, Doze, Wake

    Metonymy

    • Associated with Metaphors which are epitomes of semantics

    • Oxford Advanced Learners Dictionary definition: “The use of a word or phrase to mean something different from the literal meaning”

    • Does it mean Careless Usage?!

    Insight from Sanskritic Tradition

    • Power of a word– Abhidha, Lakshana, Vyanjana

    • Meaning of Hall:– The hall is packed (avidha)– The hall burst into laughing (lakshana)– The Hall is full (unsaid: and so we cannot enter)

    (vyanjana)

  • 7/13/2012

    15

    Metaphors in Indian Tradition

    • upamana and upameya – Former: object being compared– Latter: object being compared with– Puru was like a lion in the battle with Alexander

    (Puru: upameya; Lion: upamana)

    Upamana, rupak, atishayokti

    • upamana: Explicit comparison– Puru was like a lion in the battle with Alexander

    • rupak: Implicit comparison– Puru was a lion in the battle with Alexander

    • Atishayokti (exaggeration): upamana and upameya dropped– Puru’s army fled. But the lion fought on.

    Modern study (1956 onwards, Richards et. al.)

    • Three constituents of metaphor– Vehicle (items used metaphorically)– Tenor (the metaphorical meaning of the former)– Ground (the basis for metaphorical extension)

    • “The foot of the mountain”– Vehicle: :foot”– Tenor: “lower portion”– Ground: “spatial parallel between the relationship

    between the foot to the human body and the lower portion of the mountain with the rest of the mountain”

    Interaction of semantic fields(Haas)

    • Core vs. peripheral semantic fields• Interaction of two words in metonymic relation brings in

    new semantic fields with selective inclusion of features• Leg of a table

    – Does not stretch or move– Does stand and support

  • 7/13/2012

    16

    Lakoff’s (1987) contribution

    • Source Domain• Target Domain• Mapping Relations

    Mapping Relations: ontological correspondences

    • Anger is heat of fluid in container

    HeatHeat

    (i) Container(i) Container

    (ii) Agitation of (ii) Agitation of fluidfluid

    (iii) Limit of (iii) Limit of resistenceresistence

    (iv) Explosion(iv) Explosion

    AngerAnger

    BodyBody

    Agitation of Agitation of mindmind

    Limit of ability Limit of ability to suppressto suppress

    Loss of controlLoss of control

    Image Schemas

    • Categories: Container Contained• Quantity

    – More is up, less is down: Outputs rose dramatically; accidents rates were lower

    – Linear scales and paths: Ram is by far the best performer

    • Time– Stationary event: we are coming to exam time– Stationary observer: weeks rush by

    • Causation: desperation drove her to extreme steps

    Patterns of Metonymy

    • Container for contained– The kettle boiled (water)

    • Possessor for possessed/attribute– Where are you parked? (car)

    • Represented entity for representative– The government will announce new targets

    • Whole for part– I am going to fill up the car with petrol

  • 7/13/2012

    17

    Patterns of Metonymy (contd)

    • Part for whole– I noticed several new faces in the class

    • Place for institution– Lalbaug witnessed the largest Ganapati

    Question: Can you have part-part metonymy

    Purpose of Metonymy

    • More idiomatic/natural way of expression– More natural to say the kettle is boiling as opposed to

    the water in the kettle is boiling• Economy

    – Room 23 is answering (but not *is asleep)• Ease of access to referent

    – He is in the phone book (but not *on the back of my hand)

    • Highlighting of associated relation– The car in the front decided to turn right (but not *to

    smoke a cigarette)

    Feature sharing not necessary

    • In a restaurant:– Jalebii ko abhi dudh chaiye (no feature sharing)– The elephant now wants some coffee (feature

    sharing)

    Proverbs

    • Describes a specific event or state of affairs which is applicable metaphorically to a range of events or states of affairs provided they have the same or sufficiently similar image-schematic structure

  • 7/13/2012

    18

    IndoWordNet

    Linked Indian Language Wordnets

    Hindi

    Wordnet

    Dravidian

    Language

    Wordnet

    North East

    Language

    Wordnet

    Marathi

    Wordnet

    Sanskrit

    Wordnet

    English

    Wordnet

    Bengali

    Wordnet

    Punjabi

    Wordnet

    Konkani

    Wordnet

    Urdu

    Wordnet

    INDOWORDNET

    Gujarati

    Wordnet

    Oriya

    Wordnet

    Kashmiri

    Wordnet

    Size of Indian Language wordnets(June, 2012) 1/2

    Assamese 14958 Guahati University, Guahati, Assam

    Bengali 23765 Indian Statistical Institute, Kolkata, West Bengal

    Bodo 15785 Guahati University, Guahati, Assam

    Gujarati 26580 Dharmsingh Desai University, Nadiad, Gujarat

    Kannada 4408 Mysore University, Mysore, Karnataka

    Kashmiri 23982 Kashmir University, Srinagar, Jammu and Kashmir

    Konkani 25065 Goa University, Panji, Goa

    Malayalam 8557 Amrita University, Coimbatore, Tamilnadu

    Manipuri 16351 Manipur University, Imphal, Manipur

    Marathi 24954 IIT Bombay, Mumbai, Maharastra

    Size of Indian Language wordnets(June, 2012) 2/2

    Nepali 11713 Assam University, Silchar, Assam

    Oriya 31454 Hyderabad Central University, Hyderabad, Andhra Pradesh

    Punjabi 22332 Thapar University and Punjabi University, Patiala, Punjab

    Sanskrit 18980 IIT Bombay, Mumbai

    Tamil 8607 Tamil University, Thanjavur, Tamilnadu

    Telugu 14246 Dravidian University, Kuppam, Andhra Pradesh

    Urdu 23071 Jawaharlal Nehru University, New Delhi

  • 7/13/2012

    19

    Categories of Synsets (1/2)

    •Universal: Synsets which have an indigenous lexeme inall the languages (e.g. Sun ,Earth).

    •Pan Indian: Synsets which have indigenous lexeme in allthe Indian languages but no English equivalent (e.g.Paapad).

    •In-Family: Synsets which have indigenous lexeme in theparticular language family (e.g. the term for Bhatija inDravidian languages).

    Categories of Synsets (2/2)

    •Language specific: Synsets which are unique to alanguage (e.g. Bihu in Assamese language)

    •Rare: Synsets which express technical terms (e.g. ngram).

    •Synthesized: Synsets created in the language due toinfluence of another language (e.g. Pizza).

    Need for categorization

    • To bring systematicity in the way the wordnet synsets are linked– Universal�Pan Indian�Language

    Family�Language�Synthesised�Rare

    • All members have finished the Universal and Pan Indian synsets

    Categorization methodology

    � 34378 Hindi synsets were sent to all Indo-wordnet groups in the tool, in which they had these options to categorize: � Yes� No

    � Universal synsets:- The synsets which were categorized Yes and also have equivalent English words or synsets.� Pan-Indian :- The synsets which were categorized Yes

    and did not have equivalent English words or synsets.

  • 7/13/2012

    20

    Expansion approach: linking is a subtle and difficult process

    • To link or not to link• While linking:

    – face lexical and semantic chasms– Syntactic divergences in the example sentences

    • Change of POS• Copula drop (Hindi�Bangla)

    Case of kashmiri

    Linking kinship relations and fine grained concepts

    Relative

    Uncle

    Mama Chacha

    पानी direct आब

    पानी hypernym ऽेश

    Important decision

    • TWO kinds of linkages– Direct– Hypernymy

    Case of kashmiri

    पानी direct आब

    पानी hypernym ऽेश

    How to express a concept not present in the language?

  • 7/13/2012

    21

    Transliteration: often employed

    • Synset ID : 39 POS : adjective Synonyms : सनाथ,(sanaatha)

    • Gloss : �जसका कोई पालन-पोषण या देखभाल करने वाला हो (opposite of orphan)

    • Example statement : "सनाथ बालक� को अनाथ बालक� क� मदद करनी चा�हए (children who are looked after should help the orphans)/ साधक ूभु का हो जाने पर अनाथ नह"ं रहता, सनाथ हो जाता है”

    • Transliterated and adopted by Bangla and Gujarati

    Short phrase: often employed

    Bangla

    Urdu(meaningInauspicious)

    Linking synsets across languages: Linking synsets across languages: Influence on Hindi WordnetInfluence on Hindi Wordnet

    Hindi wordnet has to add new synsets to accommodate language specific concepts, e.g., in Gujarati

    ભૈરવજપ (bhairav jap)

    ID :: 103040

    CAT :: NOUN

    CONCEPT :: मो� के िलए जप करते हुए पव�त पर से अपने आप को िगराना(Taking God’s name and throwing oneself from atop a mountain to attain liberation)

    EXAMPLE :: िगरनार के िशखर पर से या�ऽक भैरवजप करते थे एसा माना जाता है। (it is thought that pilgrms used to do bhairav jap atop Girnar mountain)

    SYNSET-HINDI :: भैरवजप

    Overview of WSD techniques

  • 7/13/2012

    22

    Bird’s eye view

    85

    WSD Approaches

    Machine Learning

    Supervised Unsupervised Semi-supervised

    Knowledge Based

    CF

    ILT -

    IITB

    Hybrid

    OVERLAP BASED APPROACHES

    • Require a Machine Readable Dictionary (MRD).

    • Find the overlap between the features of different senses of anambiguous word (sense bag) and the features of the words in itscontext (context bag).

    • These features could be sense definitions, example sentences, hypernyms etc.

    • The features could also be given weights.

    • The sense which has the maximum overlap is selected as the contextually appropriate sense.

    86

    86

    CF

    ILT -

    IITB

    LESK’S ALGORITHM

    From Wordnet

    • The noun ash has 3 senses (first 2 from tagged texts)

    • 1. (2) ash -- (the residue that remains when something is burned)

    • 2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved ornamental or timber trees of the genus Fraxinus)

    • 3. ash -- (strong elastic wood of any of various ash trees; used for furniture and tool handles and sporting goods such as baseball bats)

    • The verb ash has 1 sense (no senses from tagged texts)

    • 1. ash -- (convert into ashes) 87

    Sense Bag: contains the words in the definition of a candidate sense of the ambiguous word.

    Context Bag: contains the words in the definition of each sense of each context word.

    E.g. “On burning coal we get ash.”

    CRITIQUE

    • Proper nouns in the context of an ambiguous word can act as strong disambiguators.

    E.g. “Sachin Tendulkar” will be a strong indicator of the category “sports”.

    Sachin Tendulkar plays cricket.

    • Proper nouns are not present in the thesaurus. Hence this approach fails to capture the strong clues provided by proper nouns.

    • Accuracy– 50% when tested on 10 highly polysemous English words.

    88

  • 7/13/2012

    23

    Extended Lesk’s algorithm

    – Original algorithm is sensitive towards exact words in the definition.

    – Extension includes glosses of semantically related senses from WordNet (e.g. hypernyms, hyponyms, etc.).

    – The scoring function becomes:

    • where,– gloss(S) is the gloss of sense S from the lexical resource.

    – Context(W) is the gloss of each sense of each context word.

    – rel(s) gives the senses related to s in WordNet under some relations.

    |)()(|)()(

    sglosswcontextSscoressorsrels

    ext ′= ∑′≡∈′

    I

    Gloss

    study

    Hyponymy

    Hyponymy

    Dwelling,abode

    bedroom

    kitchen

    house,homeA place that serves as the living quarters of one or mor efamilies

    guestroom

    veranda

    bckyard

    hermitage cottage

    Meronymy

    Hyponymy

    Meronymy

    Hypernymy

    WordNet Sub-Graph

    Example: Extended Lesk

    • “On combustion of coal we get ash”

    From Wordnet

    � The noun ash has 3 senses (first 2 from tagged texts)

    � 1. (2) ash -- (the residue that remains when something is burned)

    � 2. (1) ash, ash tree -- (any of various deciduous pinnate-leaved ornamental or timber trees of the genus Fraxinus)

    � 3. ash -- (strong elastic wood of any of various ash trees; used for furniture and tool handles and sporting goods such as baseball bats)

    � The verb ash has 1 sense (no senses from tagged texts)

    � 1. ash -- (convert into ashes)

    Example: Extended Lesk (cntd)

    • “On combustion of coal we get ash”

    From Wordnet (through hyponymy)

    � ash -- (the residue that remains when something is burned)

    => fly ash -- (fine solid particles of ash that are carried into the air when fuel is combusted)

    => bone ash -- (ash left when bones burn; high in calcium phosphate; used as fertilizer and in bone china)

  • 7/13/2012

    24

    Critique of Extended Lesk

    • Larger region of matching in WordNet– Increased chance of Matching

    BUT– Increased chance of Topic Drift

    WALKER’S ALGORITHM

    Sense1: Finance Sense2: Location

    Money +1 0

    Interest +1 0

    Fetch 0 0

    Annum +1 0

    Total 3 0

    94

    • A Thesaurus Based approach.• Step 1: For each sense of the target word find the thesaurus category to which

    that sense belongs.• Step 2: Calculate the score for each sense by using the context words. A

    context word will add 1 to the score of the sense if the thesaurus category of the word matches that of the sense.

    – E.g. The money in this bank fetches an interest of 8% per annum– Target word: bank– Clue words from the context: money, interest, annum, fetch

    Context words add 1 tothe sense when thetopic of the wordmatches that of thesense

    CF

    ILT -

    IITB

    WSD USING CONCEPTUAL DENSITY (Agirreand Rigau, 1996)

    • Select a sense based on the relatedness of that word-sense to the context.

    • Relatedness is measured in terms of conceptual

    distance– (i.e. how close the concept represented by the word and the concept

    represented by its context words are)

    • This approach uses a structured hierarchical semantic net (WordNet) for finding the conceptual distance.

    • Smaller the conceptual distance higher will be the conceptual density.– (i.e. if all words in the context are strong indicators of a particular

    concept then that concept will have a higher density.)

    95

    CONCEPTUAL DENSITY FORMULA

    96

    Wish list� The conceptual distance between two words

    should be proportional to the length of the path between the two words in the hierarchical tree (WordNet).

    � The conceptual distance between two words should be proportional to the depth of the concepts in the hierarchy.

    where, c= conceptnhyp = mean number of hyponyms

    h= height of the sub-hierarchy m= no. of senses of the word and senses of context words contained in the sub-hierarchyCD= Conceptual Density

    and 0.2 is the smoothing factor

    entity

    financelocation

    moneybank-1bank-2

    d (depth)

    h (height) of theconcept “location”

    Sub-Tree

  • 7/13/2012

    25

    CONCEPTUAL DENSITY (cntd)

    97

    � The dots in the figure represent the senses of the word to be disambiguated or the senses of the words in context.

    � The CD formula will yield highest density for the sub-hierarchy containing more senses.

    � The sense of W contained in the sub-hierarchy with the highest CD will be chosen.

    CONCEPTUAL DENSITY (EXAMPLE)

    The jury(2) praised the administration(3) and operation (8) of Atlanta Police Department(1)

    Step 1: Make a lattice of the nouns in the context, their senses and hypernyms.

    Step 2: Compute the conceptual density of resultant concepts (sub-hierarchies).

    Step 3: The concept with the highest CD is selected.

    Step 4: Select the senses below the selected concept as the correct sense for the respective words.

    operation

    division

    administrative_unit

    jury

    committee

    police department

    local department

    government department

    department

    jury administration

    body

    CD = 0.256 CD = 0.062

    98

    CRITIQUE

    • Resolves lexical ambiguity of nouns by finding a combination of senses that

    maximizes the total Conceptual Density among senses.

    • The Good

    – Does not require a tagged corpus.

    • The Bad

    – Fails to capture the strong clues provided by proper nouns in the context.

    • Accuracy

    – 54% on Brown corpus.

    99

    WSD USING RANDOM WALK ALGORITHM (Page Rank) (sinha and

    Mihalcea, 2007)

    Bell ring church Sunday

    S3

    S2

    S1

    S3

    S2

    S1

    S3

    S2

    S1 S1

    a

    c

    b

    e

    f

    g

    hi

    j

    k

    l

    0.46

    a

    0.49

    0.92

    0.97

    0.35

    0.56

    0.42

    0.63

    0.580.67

    Step 1: Add a vertex for each possible sense of each word in the text.Step 2: Add weighted edges using definition based semantic similarity (Lesk’s

    method).Step 3: Apply graph based ranking algorithm to find score of each vertex (i.e. for

    each word sense).

    Step 4: Select the vertex (sense) which has the highest score.

    100

  • 7/13/2012

    26

    A look at Page Rank (from Wikipedia)

    Developed at Stanford University by Larry Page (hence the name Page-Rank) and Sergey Brin as part of a research project about a new kind of search engine

    The first paper about the project, describing PageRank and the initial prototype of the Google search engine, was published in 1998

    Shortly after, Page and Brin founded Google Inc., the company behind the Google search engine

    While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools

    A look at Page Rank (cntd)

    PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.

    Assume a small universe of four web pages: A, B, C and D.

    The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25.

    If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR( ) in this simplistic system would thus gather to A because all links would be pointing to A.

    PR(A)=PR(B)+PR(C)+PR(D)

    This is 0.75.

    A look at Page Rank (cntd)Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages

    The value of the link-votes is divided among all the outbound links on a page.

    Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C.

    Only one third of D's PageRank is counted for A's PageRank (approximately 0.083).

    PR(A)=PR(B)/2+PR(C)/1+PR(D)/3

    In general,

    PR(U)= ΣPR(V)/L(V), where B(u) is the set of pages u is linked to, andVεB(U) L(V) is the number of links from V

    A look at Page Rank (damping factor)The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking.

    The probability, at any step, that the person will continue is a damping factor d.

    PR(U)= (1-d)/N + d.ΣPR(V)/L(V), VεB(U)

    N=size of document collection

  • 7/13/2012

    27

    For WSD: Page Rank

    • Given a graph G = (V,E)• In(Vi) = predecessors of Vi• Out(Vi) = successors of Vi

    • In a weighted graph, the walker randomly selects an outgoing edge with higher probability of selecting edges with higher weight.

    105

    Other Link Based Algorithms

    • HITS algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com)

    • IBM CLEVER project• TrustRank algorithm.

    CRITIQUE

    • Relies on random walks on graphs encoding label dependencies. • The Good

    – Does not require any tagged data (a WordNet is sufficient).

    – The weights on the edges capture the definition based semantic similarities.

    – Takes into account global data recursively drawn from the entire graph.

    • The Bad– Poor accuracy

    • Accuracy– 54% accuracy on SEMCOR corpus which has a baseline accuracy of 37%.

    107

    KB Approaches – Comparisons

    Algorithm Accuracy

    WSD using Selectional Restrictions 44% on Brown Corpus

    Lesk’s algorithm 50-60% on short samples of “Prideand Prejudice” and some “newsstories”.

    Extended Lesk’s algorithm 32% on Lexical samples from Senseval

    2 (Wider coverage).

    WSD using conceptual density 54% on Brown corpus.

    WSD using Random Walk Algorithms 54% accuracy on SEMCOR corpuswhich has a baseline accuracy of 37%.

    Walker’s algorithm 50% when tested on 10 highlypolysemous English words.

  • 7/13/2012

    28

    KB Approaches –Conclusions

    • Drawbacks of WSD using Selectional Restrictions– Needs exhaustive Knowledge Base.

    • Drawbacks of Overlap based approaches – Dictionary definitions are generally very small.

    – Dictionary entries rarely take into account the distributional constraints of different word senses (e.g. selectional preferences, kinds of prepositions, etc. � cigarette and ashnever co-occur in a dictionary).

    – Suffer from the problem of sparse match.

    – Proper nouns are not present in a MRD. Hence these approaches fail to capture the strong clues provided by proper nouns.

    SUPERVISED APPROACHES

    NAÏVE BAYES

    111

    o The Algorithm find the winner sense usingsˆ= argmax s ε senses Pr(s|Vw)

    � ‘Vw’ is a feature vector consisting of:� POS of w

    � Semantic & Syntactic features of w

    � Collocation vector (set of words around it) � typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's

    � Co-occurrence vector (number of times w occurs in bag of words around it)

    � Applying Bayes rule and naive independence assumption

    sˆ= argmax s ε senses Pr(s).Πi=1nPr(Vw

    i|s)

    BAYES RULE AND INDEPENDENCE ASSUMPTION

    sˆ= argmax s ε senses Pr(s|Vw) where Vw is the feature vector.

    • Apply Bayes rule:

    Pr(s|Vw)=Pr(s).Pr(Vw|s)/Pr(Vw)

    • Pr(Vw|s) can be approximated by independence assumption:

    Pr(Vw|s) = Pr(Vw1|s).Pr(Vw

    2|s,Vw1)...Pr(Vw

    n|s,Vw1,..,Vw

    n-1)

    = Πi=1nPr(Vw

    i|s)

    Thus,

    sˆ= argmax sÎsenses Pr(s).Πi=1nPr(Vw

    i|s)

    sˆ= argmax s ε senses Pr(s|Vw)

  • 7/13/2012

    29

    ESTIMATING PARAMETERS

    • Parameters in the probabilistic WSD are:

    – Pr(s)– Pr(Vw

    i|s)• Senses are marked with respect to sense repository (WORDNET)

    Pr(s) = count(s,w) / count(w)

    Pr(Vwi|s) = Pr(Vw

    i,s)/Pr(s)

    = c(Vwi,s,w)/c(s,w)

    DECISION LIST ALGORITHM

    • Based on ‘One sense per collocation’ property.– Nearby words provide strong and consistent clues as to the sense of a target word.

    • Collect a large set of collocations for the ambiguous word.• Calculate word-sense probability distributions for all such

    collocations.• Calculate the log-likelihood ratio

    • Higher log-likelihood = more predictive evidence• Collocations are ordered in a decision list, with most predictive

    collocations ranked highest.

    114

    Pr(Sense-A| Collocationi)

    Pr(Sense-B| Collocationi)Log( )

    114

    Assuming there are only

    two senses for the word.

    Of course, this can easily

    be extended to ‘k’ senses.

    Training Data Resultant Decision List

    DECISION LIST ALGORITHM (CONTD.)

    Classification of a test sentence is based on the highest ranking collocation found in the test sentence.E.g.

    …plucking flowers affects plant growth… 115

    CRITIQUE

    • Harnesses powerful, empirically-observed properties of language.

    • The Good– Does not require large tagged corpus. Simple implementation.– Simple semi-supervised algorithm which builds on an existing

    supervised algorithm. – Easy understandability of resulting decision list.– Is able to capture the clues provided by Proper nouns from the

    corpus.

    • The Bad– The classifier is word-specific.– A new classifier needs to be trained for every word that you want

    to disambiguate.

    • Accuracy– Average accuracy of 96% when tested on a set of 12 highly

    polysemous words.116

  • 7/13/2012

    30

    Exemplar Based WSD (k-nn)

    • An exemplar based classifier is constructed for each word to be disambiguated.

    • Step1: From each sense marked sentence containing the ambiguous word , a training example is constructed using:

    – POS of w as well as POS of neighboring words.– Local collocations– Co-occurrence vector– Morphological features– Subject-verb syntactic dependencies

    • Step2: Given a test sentence containing the ambiguous word, a test example is similarly constructed.

    • Step3: The test example is then compared to all training examples and the k-closest training examples are selected.

    • Step4: The sense which is most prevalent amongst these “k” examples is then selected as the correct sense.

    WSD Using SVMs

    • SVM is a binary classifier which finds a hyperplane with the largest margin that separates training examples into 2 classes.

    • As SVMs are binary classifiers, a separate classifier is built for each sense of the word

    • Training Phase: Using a tagged corpus, f or every sense of the word a SVM is trained using the following features:

    – POS of w as well as POS of neighboring words.– Local collocations– Co-occurrence vector– Features based on syntactic relations (e.g. headword, POS of headword, voice of

    head word etc.)

    • Testing Phase: Given a test sentence, a test example is constructed using the above features and fed as input to each binary classifier.

    • The correct sense is selected based on the label returned by each classifier.

    WSD Using Perceptron Trained HMM

    • WSD is treated as a sequence labeling task.

    • The class space is reduced by using WordNet’s super senses instead of actual senses.

    • A discriminative HMM is trained using the following features:– POS of w as well as POS of neighboring words.– Local collocations– Shape of the word and neighboring words

    E.g. for s = “Merrill Lynch & Co shape(s) =Xx*Xx*&Xx

    • Lends itself well to NER as labels like “person”, location”, "time” etc are included in the super sense tag set.

    Supervised Approaches –Comparisons

    Approach Average Precision

    Average Recall Corpus Average Baseline Accuracy

    Naïve Bayes 64.13% Not reported Senseval3 – All Words Task

    60.90%

    Decision Lists 96% Not applicable Tested on a set of 12 highly polysemous English words

    63.9%

    Exemplar Based disambiguation (k-NN)

    68.6% Not reported WSJ6 containing 191 content words

    63.7%

    SVM 72.4% 72.4% Senseval 3 –Lexical sample task (Used for disambiguation of 57 words)

    55.2%

    Perceptron trained HMM

    67.60 73.74% Senseval3 – All Words Task

    60.90%

  • 7/13/2012

    31

    Supervised Approaches –Conclusions

    • General Comments• Use corpus evidence instead of relying of dictionary defined senses.

    • Can capture important clues provided by proper nouns because proper nouns do appear in a corpus.

    • Naïve Bayes– Suffers from data sparseness.– Since the scores are a product of probabilities, some weak features

    might pull down the overall score for a sense.– A large number of parameters need to be trained.

    • Decision Lists– A word-specific classifier. A separate classifier needs to be trained for

    each word.– Uses the single most predictive feature which eliminates the drawback of

    Naïve Bayes.

    Multilingual resource constrained WSD

    Long line of work…• Mitesh Khapra, Salil Joshi and Pushpak Bhattacharyya, It takes two to Tango: A Bilingual

    Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011.

    • Mitesh Khapra, Salil Joshi, Arindam Chatterjee and Pushpak Bhattacharyya, Together We Can: Bilingual Bootstrapping for WSD, Annual Meeting of the Association of Computational Linguistics (ACL 2011), Oregon, USA, June 2011.

    • Mitesh Khapra, Saurabh Sohoney, Anup Kulkarni and Pushpak Bhattacharyya, Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.

    • Mitesh Khapra, Anup Kulkarni, Saurabh Sohoney and Pushpak Bhattacharyya, All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.

    • Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Domain-Specific Word Sense Disambiguation Combining Corpus Based and Wordnet Based Parameters, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010.

    • Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Projecting Parameters for Multilingual Word Sense Disambiguation, Empirical Methods in Natural Language Prfocessing (EMNLP09), Singapore, August, 2009.

    • Mitesh Khapra, Pushpak Bhattacharyya, Shashank Chauhan, Soumya Nair and AdityaSharma, Domain Specific Iterative Word Sense Disambiguation in a Multilingual Setting, International Conference on NLP (ICON08), Pune, India, December, 2008.

    Algorithm for WSD

  • 7/13/2012

    32

    Iterative WSD

    Motivated by the Energy expression in Hopfield network

    Scoring function

    Neuron � Synset

    Self-activation

    � Corpus Sense Distribution

    Weight of connection between two neurons

    Weight as a function ofcorpus co-occurrenceand Wordnet distancemeasures betweensynsets

    Iterative WSD

    Algorithm 1: performIterativeWSD(sentence)

    1. Tag all monosemous words in the sentence. 2. Iteratively disambiguate the remaining words in the sentence in increasing order of their degree of polysemy. 3. At each stage select that sense for a word which maximizes the score given by the Equation below

    Data

  • 7/13/2012

    33

    #Polysemous words (tokens)

    #monosemous words

    #Polysemous unique words (types)

    Token to Type ratio

    Average degree of WN polysemy

    Average degree of and corpus polysemy

    H S H H

    H H HS

    Performance of different algorithms: monolingual WSD

  • 7/13/2012

    34

    WSD is costly!1

    WordNets• Princeton Wordnet: ~80000 synsets: 30 man years• Eurowordnet: 12 man years on the average for 12

    languages• Hindi wordnet: 24 man years

    – http://www.cfilt.iitb.ac.in/wordnet/webhwn/• Indowordnet: getting created; 15 languages; 4 people

    on the average; in 1 year close to 15000 synsets done

    • Scale of effort really huge • Tricky too: when it comes to expanding from one

    wordnet to another

    Machine Learnng based WSD is costly!2

    Sense Annotated corpora for Machine Learning• SemCor: ~200000 sense marked words• SemEval/Senseval competition: to generate sense

    marked corpora• Sense marked corpora created at IIT Bombay

    – http://www.cfilt.iitb.ac.in/wsd/annotated_corpus– English: Tourism (~170000), Health (~150000)– Hindi: Tourism (~170000), Health (~80000)– Marathi: Tourism (~120000), Health (~50000)– 12 man years for each combination

    Cost-accuracy trade off

    This is the dream!spread from one combination to others

  • 7/13/2012

    35

    Language Adaptation scenarios

    Related Work (Not mentioning references, because they are too many)

    �Knowledge Based Approaches

    �Supervised Approaches

    �Unsupervised Approaches

    �Semi-supervised Approaches

    �Hybrid Approaches

    No single existing solution to WSD completely meets our requirements of

    multilinguality, high domain accuracy and good performance in the face of

    limited annotation

    Scenario 3: EM based unsupervised Approach

    140

  • 7/13/2012

    36

    141

    ESTIMATING SENSE DISTRIBUTIONS

    If sense tagged Marathi corpus were available, we could have estimated

    But such a corpus is not available

    Framework: Figure 1 and Figure 2

    E-M steps

    Points to note…

    • Symmetric formulation• E and M steps are identical except for the change in

    language• Either can be treated as the E-step, making the other as the

    M-step• A back-and-forth traversal over translation correspondences

    in the two languages• Does not require parallel corpus – only in-domain corpus is

    needed

    144

  • 7/13/2012

    37

    In General..

    145

    Experimental Setup• Languages: Hindi, Marathi• Domains: Tourism and Health (largest domain-specific sense tagged corpus)

    146

    Algorithms Being Compared

    • EM (our approach)• Personalized PageRank (Agirre and Soroa, 2009)• State-of-the-art bilingual approach (using Mutual

    Information) (Kaji and Morimoto, 2002)• Random Baseline• Wordnet First sense baseline (supervised baseline)

    147

    Results

    • Performs better than other state-of-the-art knowledge based and unsupervised approaches

    • Does not beat the Wordnet First Sense Baseline which is a supervised baseline

    148

  • 7/13/2012

    38

    Error Analysis – Non-Progressiveness estimation

    • Some words have the same translations in the target language across senses – saagar(hindi) �� samudra (marathi) (“large water body”

    as well as “limitless”)• Such words thus form a closed loop of translations• In such cases the algorithm does not progress and

    gets stuck with the initial values• Same is the case for some language specific words

    for which corresponding synsets were not available in the other language

    • Such words accounted for 17-19% of the total words in the test corpus

    149

    Results – Eliminating words which have problem of Non Progressive Estimation

    • Results are now closer to Wordnet First Sense Baseline• For 2 out of the 4 language domain pairs the results are slightly

    better than WFS – remarkable for an unsupervised approach

    150

    Demo

    • IIT Bombay’s system:www.cfilt.iitb.ac.in/UNL_enco

    • Textual Entailment:

    http://10.14.26.15:8084/TextualEntailmentInterface/index.jsp

    Conclusions (1/2)• NLP is all about processing ambiguity,

    with WSD as a fundamental task

    • Resource constraint and multilingualitybrings additional challenge

    • Wordnet: Great unifier of India (similar to Adi Shankaracharya, Bollywood films…)

    • Getting linked with English WN; would like to link with Eurowordnet

    • Application in MT, Search, Language teaching, e-commerce

  • 7/13/2012

    39

    Future work

    • Closer study needed for familialy close languages• Usage of language specific properties, in particular,

    morphology• The projection idea can be used in other NLP

    problems like POS tagging and Parsing

    URLs

    • For resourceswww.cfilt.iitb.ac.in

    • For publicationswww.cse.iitb.ac.in/~pb

    Thank you

    Questions and comments?