ttic 31190: natural language processingttic.uchicago.edu/~kgimpel/teaching/31190/lectures/4.pdf ·...
TRANSCRIPT
TTIC31190:NaturalLanguageProcessing
KevinGimpelWinter2016
Lecture4:LexicalSemantics
1
Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• syntaxandsyntacticparsing• neuralnetworkmethodsinNLP• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications
2
WhyisNLPhard?• ambiguityandvariabilityoflinguisticexpression:– ambiguity:oneformcanmeanmanythings– variability:manyformscanmeanthesamething
3
FeatureEngineeringforTextClassification
• Twofeatures:
• Let’ssayweset• Onsentencescontaining“great”intheStanfordSentimentTreebanktrainingdata,thiswouldgetusanaccuracyof69%
• But“great’’onlyappearsin83/6911examples
4
variability:manyotherwordscanindicatepositivesentiment
ambiguity:“great”canmeandifferentthingsindifferentcontexts
• mostofwhatwetalkedaboutonTuesdayandwhatwewilltalkabouttoday:
• oneform,multiplemeaningsà splitform
• multipleforms,onemeaningà mergeforms
5
ambiguity
variability
Ambiguity• oneform,multiplemeaningsà splitform– tokenization(addingspaces):• didn’t à didn’t• “Yes?” à “Yes?”
– today:wordsensedisambiguation:• powerplantà powerplant1• floweringplantà floweringplant2
6
Variability• multipleforms,onemeaningà mergeforms– tokenization(removingspaces):• NewYorkà NewYork
– lemmatization:• walked à walk• walking à walk
– stemming:• automation à automat• automates à automat
7
arethereanyNLPtaskswheredoingthismightlose
valuableinformation?
Variability• multipleforms,onemeaningà mergeforms– tokenization(removingspaces):• NewYorkà NewYork
– lemmatization:• walked à walk• walking à walk
– stemming:• automation à automat• automates à automat
– today/nextweek:wordrepresentations
8
9
t-SNEvisualizationfromTurian etal. (2010)
VectorRepresentationsofWords
10
Class-Based n-gram Models of Natural Language
Pe te r F. B rown" Pe te r V. deSouza* R o b e r t L. Mercer* IBM T. J. Watson Research Center
V incen t J. Del la Pietra* Jen i fe r C. Lai*
We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.
1. Introduct ion
In a number of natural language processing tasks, we face the problem of recovering a string of English words after it has been garbled by passage through a noisy channel. To tackle this problem successfully, we must be able to estimate the probability with which any particular string of English words will be presented as input to the noisy channel. In this paper, we discuss a method for making such estimates. We also discuss the related topic of assigning words to classes according to statistical behavior in a large body of text.
In the next section, we review the concept of a language model and give a defini- tion of n-gram models. In Section 3, we look at the subset of n-gram models in which the words are divided into classes. We show that for n = 2 the maximum likelihood assignment of words to classes is equivalent to the assignment for which the average mutual information of adjacent classes is greatest. Finding an optimal assignment of words to classes is computationally hard, but we describe two algorithms for finding a suboptimal assignment. In Section 4, we apply mutual information to two other forms of word clustering. First, we use it to find pairs of words that function together as a single lexical entity. Then, by examining the probability that two words will appear within a reasonable distance of one another, we use it to find classes that have some loose semantic coherence.
In describing our work, we draw freely on terminology and notation from the mathematical theory of communication. The reader who is unfamiliar with this field or who has allowed his or her facility with some of its concepts to fall into disrepair may profit from a brief perusal of Feller (1950) and Gallagher (1968). In the first of these, the reader should focus on conditional probabilities and on Markov chains; in the second, on entropy and mutual information.
* IBM T. J. Watson Research Center, Yorktown Heights, New York 10598.
(~) 1992 Association for Computational Linguistics
Peter F. Brown and Vincent J. Della Pietra Class-Based n-gram Models of Natural Language
Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August people guys folks fellows CEOs chaps doubters commies unfortunates blokes down backwards ashore sideways southward northward overboard aloft downwards adrift water gas coal liquid acid sand carbon steam shale iron great big vast sudden mere sheer gigantic lifelong scant colossal man woman boy girl lawyer doctor guy farmer teacher citizen American Indian European Japanese German African Catholic Israeli Italian Arab pressure temperature permeability density porosity stress velocity viscosity gravity tension mother wife father son husband brother daughter sister boss uncle machine device controller processor CPU printer spindle subsystem compiler plotter John George James Bob Robert Paul William Jim David Mike anyone someone anybody somebody feet miles pounds degrees inches barrels tons acres meters bytes director chief professor commissioner commander treasurer founder superintendent dean cus- todian liberal conservative parliamentary royal progressive Tory provisional separatist federalist PQ had hadn't hath would've could've should've must've might've asking telling wondering instructing informing kidding reminding bc)thering thanking deposing that tha theat head body hands eyes voice arm seat eye hair mouth
Table 2 Classes from a 260,741-word vocabulary.
we include no more than the ten most frequent words of any class (the other two months would appear with the class of months if we extended this limit to twelve). The degree to which the classes capture both syntactic and semantic aspects of English is quite surprising given that they were constructed from nothing more than counts of bigrams. The class {that tha theat} is interesting because although tha and theat are not English words, the computer has discovered that in our data each of them is most often a mistyped that.
Table 4 shows the number of class 1-, 2-, and 3-grams occurring in the text with various frequencies. We can expect from these data that maximum likelihood estimates will assign a probability of 0 to about 3.8 percent of the class 3-grams and to about .02 percent of the class 2-grams in a new sample of English text. This is a substantial improvement over the corresponding numbers for a 3-gram language model, which are 14.7 percent for word 3-grams and 2.2 percent for word 2-grams, but we have achieved this at the expense of precision in the model. With a class model, we distin- guish between two different words of the same class only according to their relative frequencies in the text as a whole. Looking at the classes in Tables 2 and 3, we feel that
475
Computational Linguistics,1992
WordClusters
Roadmap• lexicalsemantics– wordsense– wordsensedisambiguation– wordrepresentations
11
WordSenseAmbiguity• manywordshavemultiplemeanings
12
WordSenseAmbiguity
13
credit:A.Zwicky
Terminology:lemma andwordform
• lemma– wordswithsamelemmahavesamestem,partofspeech,roughsemantics
• wordform– inflectedwordasitappearsintext
wordform lemmabanks banksung sing
duermes dormir
J&M/SLP3
Lemmashavesenses
• onelemmabank canhavemanymeanings:…abank1 canholdtheinvestmentsinacustodialaccount…asagricultureburgeonsontheeastbank2 theriverwillshrinkevenmore
• sense(orwordsense)– adiscreterepresentationofanaspectofaword’smeaning
• thelemmabank herehastwosenses
sense1:
sense2:
J&M/SLP3
• twowaystocategorizethepatternsofmultiplemeaningsofwords:– homonymy:themultiplemeaningsareunrelated(coincidental?)
– polysemy:themultiplemeaningsarerelated
16
Homonymy
homonyms:wordsthatshareaformbuthaveunrelated,distinctmeanings:– bank1:financialinstitution,bank2:slopingland– bat1:clubforhittingaball,bat2: nocturnalflyingmammal
homographs:samespelling,differentmeaningsbank/bank,bat/bat
homophones:samepronunciation,differentmeaningswrite/right,piece/peace
J&M/SLP3
HomonymycausesproblemsforNLP
• informationretrieval– query: batcare
• machinetranslation– bat:murciélago (animal)orbate (forbaseball)
• text-to-speech– bass (stringedinstrument)vs.bass (fish)
J&M/SLP3
Polysemy1:Thebank wasconstructedin1875outoflocalredbrick.2:Iwithdrewthemoneyfromthebank.• arethesethesamesense?– sense2:“a financialinstitution”– sense1:“thebuildingbelongingtoafinancialinstitution”
• apolysemous wordhasrelatedmeanings– mostnon-rarewordshavemultiplemeanings
J&M/SLP3
• lotsoftypesofpolysemyaresystematic– school,university,hospital– allcanmeantheinstitutionorthebuilding
• a systematicrelationship:– building organization
• othersuchkindsofsystematicpolysemy:Author (JaneAustenwroteEmma) WorksofAuthor(IloveJaneAusten)Tree (Plumshavebeautifulblossoms) Fruit (Iateapreservedplum)
MetonymyorSystematicPolysemy:Asystematicrelationshipbetweensenses
J&M/SLP3
Howdoweknowwhenawordhasmorethanonesense?
• “zeugma”test:twosensesofserve?–Whichflightsserve breakfast?– DoesLufthansaserve Philadelphia?– ?DoesLufthansaservebreakfastandPhiladelphia?
• sincethisconjunctionsoundsweird,wesaythatthesearetwodifferentsensesofserve
J&M/SLP3
Synonyms• wordswithsamemeaninginsomeorallcontexts:– filbert/hazelnut– couch/sofa– big/large– automobile/car– vomit/throwup– water/H20
• twolexemesaresynonymsiftheycanbesubstitutedforeachotherinallsituations
J&M/SLP3
Synonyms• few(orno)examplesofperfectsynonymy– evenifmanyaspectsofmeaningareidentical– stillmaynotpreservetheacceptabilitybasedonnotionsofpoliteness,slang,register,genre,etc.
• example:– water/H20– big/large– brave/courageous
J&M/SLP3
Synonymyisarelationbetweensenses ratherthanwords
• considerthewordsbig andlarge• aretheysynonyms?
– Howbig isthatplane?– WouldIbeflyingonalarge orsmallplane?
• howabouthere:– MissNelson becameakindofbig sistertoBenjamin.– ?MissNelson becameakindoflarge sistertoBenjamin.
• why?– big hasasensethatmeansbeingolderorgrownup– large lacksthissense
J&M/SLP3
Antonyms
• sensesthatareoppositeswithrespecttoonefeatureofmeaning
• otherwise,theyareverysimilar!dark/light short/longfast/slowrise/fallhot/cold up/downin/out
• moreformally,antonymscan– defineabinaryoppositionorbeatoppositeendsofascale• long/short,fast/slow
– bereversives:• rise/fall,up/down
J&M/SLP3
HyponymyandHypernymy• onesenseisahyponym ofanotherifthefirstsenseismorespecific,denotingasubclassoftheother– car isahyponymofvehicle– mango isahyponymoffruit
• conversely:hypernym (“hyperissuper”)– vehicle isahypernym ofcar– fruit isahypernym ofmango
J&M/SLP3
WordNet 3.0• hierarchicallyorganizedlexicaldatabase• on-linethesaurus+aspectsofadictionary– somelanguagesavailableorunderdevelopment:Arabic,Finnish,German,Portuguese…
Category UniqueStringsNoun 117,798Verb 11,529
Adjective 22,479Adverb 4,481
J&M/SLP3
Sensesofbass inWordNet
J&M/SLP3
Howis“sense”definedinWordNet?
• synset (synonymset):setofnear-synonyms;instantiatesasenseorconcept,withagloss
• example:chump asanounwithgloss:“apersonwhoisgullibleandeasytotakeadvantageof”
• thissenseofchump issharedby9words:chump1, fool2,gull1,mark9,patsy1,fallguy1,sucker1, softtouch1,mug2
• eachofthese senseshavethissamegloss– (notevery sense;sense2ofgull istheaquaticbird)
J&M/SLP3
• oneform,multiplemeaningsà splitform– thethreesensesoffool belongtodifferentsynsets
• multipleforms,onemeaningà mergeforms– eachsynset containssensesofseveraldifferentwords
30
ambiguity
variability
WordNet Hypernym Hierarchyforbass
J&M/SLP3
Supersenses:toplevelhypernyms inhierarchy
32
(counts fromSchneider&Smith’sStreuselcorpus)
Noun Verb
GROUP 1469 place STATIVE 2922 isPERSON 1202 people COGNITION 1093 knowARTIFACT 971 car COMMUNIC.∗ 974 recommendCOGNITION 771 way SOCIAL 944 useFOOD 766 food MOTION 602 goACT 700 service POSSESSION 309 payLOCATION 638 area CHANGE 274 fixTIME 530 day EMOTION 249 loveEVENT 431 experience PERCEPTION 143 seeCOMMUNIC.∗ 417 review CONSUMPTION 93 havePOSSESSION 339 price BODY 82 get. . . doneATTRIBUTE 205 quality CREATION 64 cookQUANTITY 102 amount CONTACT 46 putANIMAL 88 dog COMPETITION 11 winBODY 87 hair WEATHER 0 —STATE 56 pain all 15 VSSTs 7806NATURAL OBJ. 54 flowerRELATION 35 portion N/A (see §3.2)SUBSTANCE 34 oil `a 1191 haveFEELING 34 discomfort ` 821 anyonePROCESS 28 process `j 54 friedMOTIVE 25 reasonPHENOMENON 23 result ∗COMMUNIC.
is short forCOMMUNICATION
SHAPE 6 squarePLANT 5 treeOTHER 2 stuffall 26 NSSTs 9018
Table 1: Summary of noun and verb supersense cate-gories. Each entry shows the label along with the countand most frequent lexical item in the STREUSLE corpus.
enrich the MWE annotations of the CMWE corpus1
(Schneider et al., 2014b), are publicly released underthe name STREUSLE.2 This includes new guidelinesfor verb supersense annotation. Our open-sourcetagger, implemented in Python, is available from thatpage as well.
2 Background: Supersense Tags
WordNet’s supersense categories are the top-levelhypernyms in the taxonomy (sometimes known assemantic fields) which are designed to be broadenough to encompass all nouns and verbs (Miller,1990; Fellbaum, 1990).3
1http://www.ark.cs.cmu.edu/LexSem/2Supersense-Tagged Repository of English with a Unified
Semantics for Lexical Expressions3WordNet synset entries were originally partitioned into
lexicographer files for these coarse categories, which becameknown as “supersenses.” The lexname function in WordNet/
The 26 noun and 15 verb supersense categories arelisted with examples in table 1. Some of the namesoverlap between the noun and verb inventories, butthey are to be considered separate categories; here-after, we will distinguish the noun and verb categorieswith prefixes, e.g. N:COGNITION vs. V:COGNITION.
Though WordNet synsets are associated with lex-ical entries, the supersense categories are unlexical-ized. The N:PERSON category, for instance, containssynsets for both principal and student. A differentsense of principal falls under N:POSSESSION.
As far as we are aware, the supersenses wereoriginally intended only as a method of organizingthe WordNet structure. But Ciaramita and Johnson(2003) pioneered the coarse word sense disambigua-tion task of supersense tagging, noting that the su-persense categories provided a natural broadeningof the traditional named entity categories to encom-pass all nouns. Ciaramita and Altun (2006) laterexpanded the task to include all verbs, and applieda supervised sequence modeling framework adaptedfrom NER. Evaluation was against manually sense-tagged data that had been automatically converted tothe coarser supersenses. Similar taggers have sincebeen built for Italian (Picca et al., 2008) and Chi-nese (Qiu et al., 2011), both of which have their ownWordNets mapped to English WordNet.
Although many of the annotated expressions in ex-isting supersense datasets contain multiple words, therelationship between MWEs and supersenses has notreceived much attention. (Piao et al. (2003, 2005) didinvestigate MWEs in the context of a lexical taggerwith a finer-grained taxonomy of semantic classes.)Consider these examples from online reviews:(1) IT IS NOT A HIGH END STEAK HOUSE
(2) The white pages allowed me to get in touch withparents of my high school friends so that I couldtrack people down one by one
HIGH END functions as a unit to mean ‘sophis-ticated, expensive’. (It is not in WordNet, though
NLTK (Bird et al., 2009) returns a synset’s lexicographer file.A subtle difference is that a special file called noun.Tops
contains each noun supersense’s root synset (e.g., group.n.01for N:GROUP) as well as a few miscellaneous synsets, such asliving_thing.n.01, that are too abstract to fall under any singlesupersense. Following Ciaramita and Altun (2006), we treat thelatter cases under an N:OTHER supersense category and mergethe former under their respective supersense.
Noun Verb
GROUP 1469 place STATIVE 2922 isPERSON 1202 people COGNITION 1093 knowARTIFACT 971 car COMMUNIC.∗ 974 recommendCOGNITION 771 way SOCIAL 944 useFOOD 766 food MOTION 602 goACT 700 service POSSESSION 309 payLOCATION 638 area CHANGE 274 fixTIME 530 day EMOTION 249 loveEVENT 431 experience PERCEPTION 143 seeCOMMUNIC.∗ 417 review CONSUMPTION 93 havePOSSESSION 339 price BODY 82 get. . . doneATTRIBUTE 205 quality CREATION 64 cookQUANTITY 102 amount CONTACT 46 putANIMAL 88 dog COMPETITION 11 winBODY 87 hair WEATHER 0 —STATE 56 pain all 15 VSSTs 7806NATURAL OBJ. 54 flowerRELATION 35 portion N/A (see §3.2)SUBSTANCE 34 oil `a 1191 haveFEELING 34 discomfort ` 821 anyonePROCESS 28 process `j 54 friedMOTIVE 25 reasonPHENOMENON 23 result ∗COMMUNIC.
is short forCOMMUNICATION
SHAPE 6 squarePLANT 5 treeOTHER 2 stuffall 26 NSSTs 9018
Table 1: Summary of noun and verb supersense cate-gories. Each entry shows the label along with the countand most frequent lexical item in the STREUSLE corpus.
enrich the MWE annotations of the CMWE corpus1
(Schneider et al., 2014b), are publicly released underthe name STREUSLE.2 This includes new guidelinesfor verb supersense annotation. Our open-sourcetagger, implemented in Python, is available from thatpage as well.
2 Background: Supersense Tags
WordNet’s supersense categories are the top-levelhypernyms in the taxonomy (sometimes known assemantic fields) which are designed to be broadenough to encompass all nouns and verbs (Miller,1990; Fellbaum, 1990).3
1http://www.ark.cs.cmu.edu/LexSem/2Supersense-Tagged Repository of English with a Unified
Semantics for Lexical Expressions3WordNet synset entries were originally partitioned into
lexicographer files for these coarse categories, which becameknown as “supersenses.” The lexname function in WordNet/
The 26 noun and 15 verb supersense categories arelisted with examples in table 1. Some of the namesoverlap between the noun and verb inventories, butthey are to be considered separate categories; here-after, we will distinguish the noun and verb categorieswith prefixes, e.g. N:COGNITION vs. V:COGNITION.
Though WordNet synsets are associated with lex-ical entries, the supersense categories are unlexical-ized. The N:PERSON category, for instance, containssynsets for both principal and student. A differentsense of principal falls under N:POSSESSION.
As far as we are aware, the supersenses wereoriginally intended only as a method of organizingthe WordNet structure. But Ciaramita and Johnson(2003) pioneered the coarse word sense disambigua-tion task of supersense tagging, noting that the su-persense categories provided a natural broadeningof the traditional named entity categories to encom-pass all nouns. Ciaramita and Altun (2006) laterexpanded the task to include all verbs, and applieda supervised sequence modeling framework adaptedfrom NER. Evaluation was against manually sense-tagged data that had been automatically converted tothe coarser supersenses. Similar taggers have sincebeen built for Italian (Picca et al., 2008) and Chi-nese (Qiu et al., 2011), both of which have their ownWordNets mapped to English WordNet.
Although many of the annotated expressions in ex-isting supersense datasets contain multiple words, therelationship between MWEs and supersenses has notreceived much attention. (Piao et al. (2003, 2005) didinvestigate MWEs in the context of a lexical taggerwith a finer-grained taxonomy of semantic classes.)Consider these examples from online reviews:(1) IT IS NOT A HIGH END STEAK HOUSE
(2) The white pages allowed me to get in touch withparents of my high school friends so that I couldtrack people down one by one
HIGH END functions as a unit to mean ‘sophis-ticated, expensive’. (It is not in WordNet, though
NLTK (Bird et al., 2009) returns a synset’s lexicographer file.A subtle difference is that a special file called noun.Tops
contains each noun supersense’s root synset (e.g., group.n.01for N:GROUP) as well as a few miscellaneous synsets, such asliving_thing.n.01, that are too abstract to fall under any singlesupersense. Following Ciaramita and Altun (2006), we treat thelatter cases under an N:OTHER supersense category and mergethe former under their respective supersense.
Noun Verb
GROUP 1469 place STATIVE 2922 isPERSON 1202 people COGNITION 1093 knowARTIFACT 971 car COMMUNIC.∗ 974 recommendCOGNITION 771 way SOCIAL 944 useFOOD 766 food MOTION 602 goACT 700 service POSSESSION 309 payLOCATION 638 area CHANGE 274 fixTIME 530 day EMOTION 249 loveEVENT 431 experience PERCEPTION 143 seeCOMMUNIC.∗ 417 review CONSUMPTION 93 havePOSSESSION 339 price BODY 82 get. . . doneATTRIBUTE 205 quality CREATION 64 cookQUANTITY 102 amount CONTACT 46 putANIMAL 88 dog COMPETITION 11 winBODY 87 hair WEATHER 0 —STATE 56 pain all 15 VSSTs 7806NATURAL OBJ. 54 flowerRELATION 35 portion N/A (see §3.2)SUBSTANCE 34 oil `a 1191 haveFEELING 34 discomfort ` 821 anyonePROCESS 28 process `j 54 friedMOTIVE 25 reasonPHENOMENON 23 result ∗COMMUNIC.
is short forCOMMUNICATION
SHAPE 6 squarePLANT 5 treeOTHER 2 stuffall 26 NSSTs 9018
Table 1: Summary of noun and verb supersense cate-gories. Each entry shows the label along with the countand most frequent lexical item in the STREUSLE corpus.
enrich the MWE annotations of the CMWE corpus1
(Schneider et al., 2014b), are publicly released underthe name STREUSLE.2 This includes new guidelinesfor verb supersense annotation. Our open-sourcetagger, implemented in Python, is available from thatpage as well.
2 Background: Supersense Tags
WordNet’s supersense categories are the top-levelhypernyms in the taxonomy (sometimes known assemantic fields) which are designed to be broadenough to encompass all nouns and verbs (Miller,1990; Fellbaum, 1990).3
1http://www.ark.cs.cmu.edu/LexSem/2Supersense-Tagged Repository of English with a Unified
Semantics for Lexical Expressions3WordNet synset entries were originally partitioned into
lexicographer files for these coarse categories, which becameknown as “supersenses.” The lexname function in WordNet/
The 26 noun and 15 verb supersense categories arelisted with examples in table 1. Some of the namesoverlap between the noun and verb inventories, butthey are to be considered separate categories; here-after, we will distinguish the noun and verb categorieswith prefixes, e.g. N:COGNITION vs. V:COGNITION.
Though WordNet synsets are associated with lex-ical entries, the supersense categories are unlexical-ized. The N:PERSON category, for instance, containssynsets for both principal and student. A differentsense of principal falls under N:POSSESSION.
As far as we are aware, the supersenses wereoriginally intended only as a method of organizingthe WordNet structure. But Ciaramita and Johnson(2003) pioneered the coarse word sense disambigua-tion task of supersense tagging, noting that the su-persense categories provided a natural broadeningof the traditional named entity categories to encom-pass all nouns. Ciaramita and Altun (2006) laterexpanded the task to include all verbs, and applieda supervised sequence modeling framework adaptedfrom NER. Evaluation was against manually sense-tagged data that had been automatically converted tothe coarser supersenses. Similar taggers have sincebeen built for Italian (Picca et al., 2008) and Chi-nese (Qiu et al., 2011), both of which have their ownWordNets mapped to English WordNet.
Although many of the annotated expressions in ex-isting supersense datasets contain multiple words, therelationship between MWEs and supersenses has notreceived much attention. (Piao et al. (2003, 2005) didinvestigate MWEs in the context of a lexical taggerwith a finer-grained taxonomy of semantic classes.)Consider these examples from online reviews:(1) IT IS NOT A HIGH END STEAK HOUSE
(2) The white pages allowed me to get in touch withparents of my high school friends so that I couldtrack people down one by one
HIGH END functions as a unit to mean ‘sophis-ticated, expensive’. (It is not in WordNet, though
NLTK (Bird et al., 2009) returns a synset’s lexicographer file.A subtle difference is that a special file called noun.Tops
contains each noun supersense’s root synset (e.g., group.n.01for N:GROUP) as well as a few miscellaneous synsets, such asliving_thing.n.01, that are too abstract to fall under any singlesupersense. Following Ciaramita and Altun (2006), we treat thelatter cases under an N:OTHER supersense category and mergethe former under their respective supersense.
J&M/SLP3
Meronymy• part-wholerelation– wheelisameronym ofcar– carisaholonym ofwheel
33
J&M/SLP3
WordNet NounRelations16.4 • WORD SENSE DISAMBIGUATION: OVERVIEW 7
Relation Also Called Definition ExampleHypernym Superordinate From concepts to superordinates breakfast1 ! meal1
Hyponym Subordinate From concepts to subtypes meal1 ! lunch1
Instance Hypernym Instance From instances to their concepts Austen1 ! author1
Instance Hyponym Has-Instance From concepts to concept instances composer1 ! Bach1
Member Meronym Has-Member From groups to their members faculty2 ! professor1
Member Holonym Member-Of From members to their groups copilot1 ! crew1
Part Meronym Has-Part From wholes to parts table2 ! leg3
Part Holonym Part-Of From parts to wholes course7 ! meal1
Substance Meronym From substances to their subparts water1 ! oxygen1
Substance Holonym From parts of substances to wholes gin1 ! martini1
Antonym Semantic opposition between lemmas leader1 () follower1
Derivationally Lemmas w/same morphological root destruction1 () destroy1
Related FormFigure 16.2 Noun relations in WordNet.
Relation Definition ExampleHypernym From events to superordinate events fly9 ! travel5
Troponym From events to subordinate event walk1 ! stroll1(often via specific manner)
Entails From verbs (events) to the verbs (events) they entail snore1 ! sleep1
Antonym Semantic opposition between lemmas increase1 () decrease1
Derivationally Lemmas with same morphological root destroy1 () destruction1
Related FormFigure 16.3 Verb relations in WordNet.
respond to the notion of immediate hyponymy discussed on page 5. Each synset isrelated to its immediately more general and more specific synsets through direct hy-pernym and hyponym relations. These relations can be followed to produce longerchains of more general or more specific synsets. Figure 16.4 shows hypernym chainsfor bass3 and bass7.
In this depiction of hyponymy, successively more general synsets are shown onsuccessive indented lines. The first chain starts from the concept of a human basssinger. Its immediate superordinate is a synset corresponding to the generic conceptof a singer. Following this chain leads eventually to concepts such as entertainer andperson. The second chain, which starts from musical instrument, has a completelydifferent path leading eventually to such concepts as musical instrument, device, andphysical object. Both paths do eventually join at the very abstract synset whole, unit,and then proceed together to entity which is the top (root) of the noun hierarchy (inWordNet this root is generally called the unique beginner).unique
beginner
16.4 Word Sense Disambiguation: Overview
Our discussion of compositional semantic analyzers in Chapter 15 pretty much ig-nored the issue of lexical ambiguity. It should be clear by now that this is an unrea-sonable approach. Without some means of selecting correct senses for the words inan input, the enormous amount of homonymy and polysemy in the lexicon wouldquickly overwhelm any approach in an avalanche of competing interpretations.
J&M/SLP3
WordNet:ViewedasagraphWord Sense Disambiguation: A Survey 10:9
Fig. 3. An excerpt of the WordNet semantic network.
We note that each word sense univocally identifies a single synset. For instance,given car1
n the corresponding synset {car1n, auto1
n, automobile1n, machine4
n, motorcar1n}
is univocally determined. In Figure 3 we report an excerpt of the WordNet semanticnetwork containing the car1
n synset. For each synset, WordNet provides the followinginformation:
—A gloss, that is, a textual definition of the synset possibly with a set of usage examples(e.g., the gloss of car1
n is “a 4-wheeled motor vehicle; usually propelled by an internalcombustion engine; ‘he needs a car to get to work’ ”).7
—Lexical and semantic relations, which connect pairs of word senses and synsets, re-spectively: while semantic relations apply to synsets in their entirety (i.e., to allmembers of a synset), lexical relations connect word senses included in the respec-tive synsets. Among the latter we have the following:—Antonymy: X is an antonym of Y if it expresses the opposite concept (e.g., good1
a isthe antonym of bad1
a). Antonymy holds for all parts of speech.—Pertainymy: X is an adjective which can be defined as “of or pertaining to” a noun
(or, rarely, another adjective) Y (e.g., dental1a pertains to tooth1
n).—Nominalization: a noun X nominalizes a verb Y (e.g., service2
n nominalizes the verbserve4
v).Among the semantic relations we have the following:—Hypernymy (also called kind-of or is-a): Y is a hypernym of X if every X is a (kind
of) Y (motor vehicle1n is a hypernym of car1
n). Hypernymy holds between pairs ofnominal or verbal synsets.
7Recently, Princeton University released the Princeton WordNet Gloss Corpus, a corpus of manually andautomatically sense-annotated glosses from WordNet 3.0, available from the WordNet Web site.
ACM Computing Surveys, Vol. 41, No. 2, Article 10, Publication date: February 2009.
35
J&M/SLP3
Roadmap• lexicalsemantics– wordsense– wordsensedisambiguation– wordrepresentations
36
WordSenseDisambiguation
37
credit:A.Zwicky
WordSenseDisambiguation(WSD)• given:– a wordincontext– a fixedinventoryofpotentialwordsenses
• decidewhichsenseofthewordthisis• why?machinetranslation,questionanswering,sentimentanalysis,text-to-speech
• whatsetofsenses?– English-to-Spanishmachinetranslation:setofSpanishtranslations
– text-to-speech:homographslikebass andbow– ingeneral:thesensesinathesauruslikeWordNet
J&M/SLP3
TwoVariantsofWSDTask• lexicalsampletask– smallpre-selectedsetoftargetwords(line,plant,bass)– inventoryofsensesforeachword– supervisedlearning:trainaclassifierforeachword
• all-wordstask– everywordinanentiretext– a lexiconwithsensesforeachword– datasparseness:can’ttrainword-specificclassifiers
J&M/SLP3
ClassificationFramework
learning:choose_
modeling:definescorefunctioninference:solve_
40
ClassificationforWordSenseDisambiguationofbass
41
• data:– whatisthespaceofpossibleinputsandoutputs?– whatdo(x,y)pairslooklike?– x =thewordbass alongwithitscontext– y =wordsenseofbass (fromalistofpossiblesenses)
8Sensesofbass inWordnet
J&M/SLP3
InventoryofSenseTagsforbass16.5 • SUPERVISED WORD SENSE DISAMBIGUATION 9
WordNet Spanish RogetSense Translation Category Target Word in Contextbass4 lubina FISH/INSECT . . . fish as Pacific salmon and striped bass and. . .bass4 lubina FISH/INSECT . . . produce filets of smoked bass or sturgeon. . .bass7 bajo MUSIC . . . exciting jazz bass player since Ray Brown. . .bass7 bajo MUSIC . . . play bass because he doesn’t have to solo. . .
Figure 16.5 Possible definitions for the inventory of sense tags for bass.
the set of senses are small, supervised machine learning approaches are often usedto handle lexical sample tasks. For each word, a number of corpus instances (con-text sentences) can be selected and hand-labeled with the correct sense of the targetword in each. Classifier systems can then be trained with these labeled examples.Unlabeled target words in context can then be labeled using such a trained classifier.Early work in word sense disambiguation focused solely on lexical sample tasksof this sort, building word-specific algorithms for disambiguating single words likeline, interest, or plant.
In contrast, in the all-words task, systems are given entire texts and a lexiconall-wordswith an inventory of senses for each entry and are required to disambiguate everycontent word in the text. The all-words task is similar to part-of-speech tagging, ex-cept with a much larger set of tags since each lemma has its own set. A consequenceof this larger set of tags is a serious data sparseness problem; it is unlikely that ade-quate training data for every word in the test set will be available. Moreover, giventhe number of polysemous words in reasonably sized lexicons, approaches based ontraining one classifier per term are unlikely to be practical.
In the following sections we explore the application of various machine learningparadigms to word sense disambiguation.
16.5 Supervised Word Sense Disambiguation
If we have data that has been hand-labeled with correct word senses, we can use asupervised learning approach to the problem of sense disambiguation—extractingfeatures from the text and training a classifier to assign the correct sense given thesefeatures. The output of training is thus a classifier system capable of assigning senselabels to unlabeled words in context.
For lexical sample tasks, there are various labeled corpora for individual words;these corpora consist of context sentences labeled with the correct sense for the tar-get word. These include the line-hard-serve corpus containing 4,000 sense-taggedexamples of line as a noun, hard as an adjective and serve as a verb (Leacock et al.,1993), and the interest corpus with 2,369 sense-tagged examples of interest as anoun (Bruce and Wiebe, 1994). The SENSEVAL project has also produced a num-ber of such sense-labeled lexical sample corpora (SENSEVAL-1 with 34 words fromthe HECTOR lexicon and corpus (Kilgarriff and Rosenzweig 2000, Atkins 1993),SENSEVAL-2 and -3 with 73 and 57 target words, respectively (Palmer et al. 2001,Kilgarriff 2001).
For training all-word disambiguation tasks we use a semantic concordance,semanticconcordance
a corpus in which each open-class word in each sentence is labeled with its wordsense from a specific dictionary or thesaurus. One commonly used corpus is Sem-Cor, a subset of the Brown Corpus consisting of over 234,000 words that were man-
J&M/SLP3
WSDEvaluationandBaselines
• bestevaluation:extrinsic(“task-based”)– embedWSDinataskandseeifithelps!
• intrinsicevaluation oftendoneforconvenience
• strongbaseline:mostfrequentsense
J&M/SLP3
MostFrequentSense• WordNetsensesareorderedbyfrequency• mostfrequentisfirst• sensefrequenciescomefromSemCor corpus
J&M/SLP3
PerformanceCeiling• humaninter-annotatoragreement– compareannotationsoftwohumansonsamedata,givensametaggingguidelines
• humanagreementsonall-wordscorporawithWordNetstylesenses:75%-80%
J&M/SLP3
TrainingDataforWSD• semanticconcordance:corpusinwhicheachopen-classwordislabeledwithasensefromaspecificdictionary/thesaurus– SemCor:234,000wordsfromBrownCorpus,manuallytaggedwithWordNet senses
– SENSEVAL-3competitioncorpora:2081taggedwordtokens
J&M/SLP3
FeaturesforWSD?IntuitionfromWarrenWeaver(1955):
FeaturesforWSD?IntuitionfromWarrenWeaver(1955):
“Ifoneexaminesthewordsinabook,oneatatimeasthroughanopaquemaskwithaholeinitonewordwide,thenitisobviouslyimpossibletodetermine,oneatatime,themeaningofthewords…Butifonelengthenstheslitintheopaquemask,untilonecanseenotonlythecentralwordin
J&M/SLP3
questionbutalsosayNwordsoneitherside,thenifNislargeenoughonecanunambiguouslydecidethemeaningofthecentralword…Thepracticalquestionis:‘WhatminimumvalueofNwill,atleastinatolerablefractionofcases,leadtothecorrectchoiceofmeaningforthecentralword?’”
FeaturesforWSD
• collocational– featuresforwordsatspecific positionsneartargetword
• bag-of-words– featuresforwordsthatoccuranywhereinawindowofthetargetword(regardlessofposition)
J&M/SLP3
Example• usingawindowof+/- 3fromthetarget:
Anelectric guitarandbass playerstand off toonesidenotreallypartofthescene
J&M/SLP3
Semi-SupervisedLearningproblem:supervisedlearningrequireslargehand-builtresources
whatifyoudon’thavemuchtrainingdata?
solution:bootstrappinggeneralizefromaverysmallhand-labeledseedset
J&M/SLP3
Bootstrapping• “onesensepercollocation”heuristic:– awordreoccurringincollocationwiththesamewordwillalmostsurelyhavethesamesense
• Forbass:– play occurswiththemusicsenseofbass– fishoccurswiththefishsenseofbass
J&M/SLP3
Sentencesextractedusingfish andplay
16 CHAPTER 16 • COMPUTING WITH WORD SENSES
?
?
A
?
A
?
A
?
?
?
A
?
?
?
?
?
? ?
? ??
?
?
?
B
?
?
A
?
?
?
A
?
AAA
?A
A
?
?
?
?? ?
??
?
??
?B
?
??
?
?
?
??
?
?
?
??
?
?
??
?
?
?
?
?
?
??
?
?
?
?
?? ?
?
?
?
? ???
?
?
?
?
?
?
?
??
?
?
??
??
?
? ?
?
?
? ???
??
?
B
??
?BBB
?
?
B
?? B?
?? ?
??
????
??
?
? ?
????
?
??
?
?
?
??
?
?
?
??? ?
A
B B
??
?
??? ???
?
?
?
?? ?
?
?
?
A??
??
A
?
?
?A
AA
A
A
A
A
LIFE
BB
MANUFACTURING
?
?
A
?
A
?
A
?
A
?
A
B
?
?
?
?
? ?
? ??
?
?
?
B
???
??
A
?
A
?
A
?
AAA
AA
A
?
?
?
?? ?
??
?
??
?B
?
???
?
?
?
??
?
?
?
??
?
?
??
?
?
?
?
?
?
???
?
?
?
?? ?
?
?
?
A ?A?
?
?
?
?
?
?
?
??
?
?
??
?A
B
A A
?
?
? ???
??
?
B
??
?BBB
?
?
B
?B B?
?? ?
??
????
??
?
? ?
??AA
?
??
?
?
?
??
?
?
?
??? ?
A
B B
??
B
?????
?
?
?
?? B
B
?
?
A?A
A?
A
?
?
?A
AA
A
A
A
A
LIFE
BB
MANUFACTURINGEQUIPMENT
EMPLOYEE?
??BB
??
?????
ANIMAL
MICROSCOPICV0 V1
Λ0 Λ1
(a) (b)
Figure 16.9 The Yarowsky algorithm disambiguating “plant” at two stages; “?” indicates an unlabeled ob-servation, A and B are observations labeled as SENSE-A or SENSE-B. The initial stage (a) shows only seedsentences L0 labeled by collocates (“life” and “manufacturing”). An intermediate stage is shown in (b) wheremore collocates have been discovered (“equipment”, “microscopic”, etc.) and more instances in V0 have beenmoved into L1, leaving a smaller unlabeled set V1. Figure adapted from Yarowsky (1995).
We need more good teachers – right now, there are only a half a dozen who can playthe free bass with ease.
An electric guitar and bass player stand off to one side, not really part of the scene, justas a sort of nod to gringo expectations perhaps.The researchers said the worms spend part of their life cycle in such fish as Pacificsalmon and striped bass and Pacific rockfish or snapper.
And it all started when fishermen decided the striped bass in Lake Mead were tooskinny.
Figure 16.10 Samples of bass sentences extracted from the WSJ by using the simple cor-relates play and fish.
strongly associated with the target senses tend not to occur with the other sense.Yarowsky defines his seedset by choosing a single collocation for each sense.
For example, to generate seed sentences for the fish and musical musical sensesof bass, we might come up with fish as a reasonable indicator of bass1 and play asa reasonable indicator of bass2. Figure 16.10 shows a partial result of such a searchfor the strings “fish” and “play” in a corpus of bass examples drawn from the WSJ.
The original Yarowsky algorithm also makes use of a second heuristic, calledone sense per discourse, based on the work of Gale et al. (1992b), who noticed thatone sense per
discoursea particular word appearing multiple times in a text or discourse often appeared withthe same sense. This heuristic seems to hold better for coarse-grained senses andparticularly for cases of homonymy rather than polysemy (Krovetz, 1998).
Nonetheless, it is still useful in a number of sense disambiguation situations. Infact, the one sense per discourse heuristic is an important one throughout languageprocessing as it seems that many disambiguation tasks may be improved by a biastoward resolving an ambiguity the same way inside a discourse segment.
J&M/SLP3
Bootstrapping• “onesensepercollocation” heuristic:– awordreoccurringincollocationwiththesamewordwillalmostsurelyhavethesamesense
• “onesenseperdiscourse”heuristic:– senseofawordishighlyconsistentwithina
document(Yarowsky,1995)– especiallytopic-specificwords
J&M/SLP3
StagesinYarowsky bootstrappingalgorithmforplant
?
?
A
?
A
?
A
?
?
?
A
?
?
?
?
?
? ?
? ??
?
?
?
B
?
?
A
?
?
?
A
?
AAA
?A
A
?
?
?
?? ?
??
?
??
?B
?
??
?
?
?
??
?
?
?
??
?
?
??
?
?
?
?
?
?
??
?
?
?
?
?? ?
?
?
?
? ???
?
?
?
?
?
?
?
??
?
?
??
??
?
? ?
?
?
? ???
??
?
B
??
?BBB
?
?
B
?? B?
?? ?
??
????
??
?
? ?
????
?
??
?
?
?
??
?
?
?
??? ?
A
B B
??
?
??? ???
?
?
?
?? ?
?
?
?
A??
??
A
?
?
?A
AA
A
A
A
A
LIFE
BB
MANUFACTURING
?
?
A
?
A
?
A
?
A
?
A
B
?
?
?
?
? ?
? ??
?
?
?
B
???
??
A
?
A
?
A
?
AAA
AA
A
?
?
?
?? ?
??
?
??
?B
?
???
?
?
?
??
?
?
?
??
?
?
??
?
?
?
?
?
?
???
?
?
?
?? ?
?
?
?
A ?A?
?
?
?
?
?
?
?
??
?
?
??
?A
B
A A
?
?
? ???
??
?
B
??
?BBB
?
?
B
?B B?
?? ?
??
????
??
?
? ?
??AA
?
??
?
?
?
??
?
?
?
??? ?
A
B B
??
B
?????
?
?
?
?? B
B
?
?
A?A
A?
A
?
?
?A
AA
A
A
A
A
LIFE
BB
MANUFACTURINGEQUIPMENT
EMPLOYEE?
??BB
??
?????
ANIMAL
MICROSCOPICV0 V1
Λ0 Λ1
(a) (b)J&M/SLP3
Summary
• wordsensedisambiguation:choosingcorrectsenseincontext
• applications:MT,QA,etc.• mainintuition:– lotsofinformationinaword’scontext– simplealgorithmsbasedonwordcountscanbesurprisinglygood
57
Roadmap• lexicalsemantics– wordsense– wordsensedisambiguation– wordrepresentations
58
• oneform,multiplemeaningsà splitform– thethreesensesoffool belongtodifferentsynsets
• multipleforms,onemeaningà mergeforms– eachsynset containssensesofseveraldifferentwords
59
ambiguity
variability
• WordNet splitswordsintosynsets; eachcontainssensesofseveralwords:
• arewefinished?havewesolvedtheproblemofrepresentingwordmeaning?
• arewefinished?havewesolvedtheproblemofrepresentingwordmeaning?
• issues:– WordNet haslimitedcoverage andonlyexistsforasmallsetoflanguages
– WSDrequirestrainingdata,whethersupervisedorseedsforsemi-supervised
• betterapproach:jointlylearnrepresentationsforallwordsinanunsupervisedway
category uniquestrings
noun 117,798
verb 11,529
adjective 22,479
adverb 4,481
62
t-SNEvisualizationfromTurian etal. (2010)
WordEmbeddings(Bengio etal.,2003)