ttic 31190: natural language processingttic.uchicago.edu/~kgimpel/teaching/31190/lectures/4.pdf ·...

TTIC31190:NaturalLanguageProcessing

KevinGimpelWinter2016

Lecture4:LexicalSemantics

1

Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• syntaxandsyntacticparsing• neuralnetworkmethodsinNLP• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications

2

WhyisNLPhard?• ambiguityandvariabilityoflinguisticexpression:– ambiguity:oneformcanmeanmanythings– variability:manyformscanmeanthesamething

3

FeatureEngineeringforTextClassification

• Twofeatures:

• Let’ssayweset• Onsentencescontaining“great”intheStanfordSentimentTreebanktrainingdata,thiswouldgetusanaccuracyof69%

• But“great’’onlyappearsin83/6911examples

4

variability:manyotherwordscanindicatepositivesentiment

ambiguity:“great”canmeandifferentthingsindifferentcontexts

• mostofwhatwetalkedaboutonTuesdayandwhatwewilltalkabouttoday:

• oneform,multiplemeaningsà splitform

• multipleforms,onemeaningà mergeforms

5

ambiguity

variability

Ambiguity• oneform,multiplemeaningsà splitform– tokenization(addingspaces):• didn’t à didn’t• “Yes?” à “Yes?”

– today:wordsensedisambiguation:• powerplantà powerplant1• floweringplantà floweringplant2

6

Variability• multipleforms,onemeaningà mergeforms– tokenization(removingspaces):• NewYorkà NewYork

– lemmatization:• walked à walk• walking à walk

– stemming:• automation à automat• automates à automat

7

arethereanyNLPtaskswheredoingthismightlose

valuableinformation?

Variability• multipleforms,onemeaningà mergeforms– tokenization(removingspaces):• NewYorkà NewYork

– lemmatization:• walked à walk• walking à walk

– stemming:• automation à automat• automates à automat

– today/nextweek:wordrepresentations

8

9

t-SNEvisualizationfromTurian etal. (2010)

VectorRepresentationsofWords

10

Class-Based n-gram Models of Natural Language

Pe te r F. B rown" Pe te r V. deSouza* R o b e r t L. Mercer* IBM T. J. Watson Research Center

V incen t J. Del la Pietra* Jen i fe r C. Lai*

We address the problem of predicting a word from previous words in a sample of text. In particular, we discuss n-gram models based on classes of words. We also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.

1. Introduct ion

In a number of natural language processing tasks, we face the problem of recovering a string of English words after it has been garbled by passage through a noisy channel. To tackle this problem successfully, we must be able to estimate the probability with which any particular string of English words will be presented as input to the noisy channel. In this paper, we discuss a method for making such estimates. We also discuss the related topic of assigning words to classes according to statistical behavior in a large body of text.

In the next section, we review the concept of a language model and give a definition of n-gram models. In Section 3, we look at the subset of n-gram models in which the words are divided into classes. We show that for n = 2 the maximum likelihood assignment of words to classes is equivalent to the assignment for which the average mutual information of adjacent classes is greatest. Finding an optimal assignment of words to classes is computationally hard, but we describe two algorithms for finding a suboptimal assignment. In Section 4, we apply mutual information to two other forms of word clustering. First, we use it to find pairs of words that function together as a single lexical entity. Then, by examining the probability that two words will appear within a reasonable distance of one another, we use it to find classes that have some loose semantic coherence.

In describing our work, we draw freely on terminology and notation from the mathematical theory of communication. The reader who is unfamiliar with this field or who has allowed his or her facility with some of its concepts to fall into disrepair may profit from a brief perusal of Feller (1950) and Gallagher (1968). In the first of these, the reader should focus on conditional probabilities and on Markov chains; in the second, on entropy and mutual information.

* IBM T. J. Watson Research Center, Yorktown Heights, New York 10598.

(~) 1992 Association for Computational Linguistics

Peter F. Brown and Vincent J. Della Pietra Class-Based n-gram Models of Natural Language

Friday Monday Thursday Wednesday Tuesday Saturday Sunday weekends Sundays Saturdays June March July April January December October November September August people guys folks fellows CEOs chaps doubters commies unfortunates blokes down backwards ashore sideways southward northward overboard aloft downwards adrift water gas coal liquid acid sand carbon steam shale iron great big vast sudden mere sheer gigantic lifelong scant colossal man woman boy girl lawyer doctor guy farmer teacher citizen American Indian European Japanese German African Catholic Israeli Italian Arab pressure temperature permeability density porosity stress velocity viscosity gravity tension mother wife father son husband brother daughter sister boss uncle machine device controller processor CPU printer spindle subsystem compiler plotter John George James Bob Robert Paul William Jim David Mike anyone someone anybody somebody feet miles pounds degrees inches barrels tons acres meters bytes director chief professor commissioner commander treasurer founder superintendent dean cus- todian liberal conservative parliamentary royal progressive Tory provisional separatist federalist PQ had hadn't hath would've could've should've must've might've asking telling wondering instructing informing kidding reminding bc)thering thanking deposing that tha theat head body hands eyes voice arm seat eye hair mouth

Table 2 Classes from a 260,741-word vocabulary.

we include no more than the ten most frequent words of any class (the other two months would appear with the class of months if we extended this limit to twelve). The degree to which the classes capture both syntactic and semantic aspects of English is quite surprising given that they were constructed from nothing more than counts of bigrams. The class {that tha theat} is interesting because although tha and theat are not English words, the computer has discovered that in our data each of them is most often a mistyped that.

Table 4 shows the number of class 1-, 2-, and 3-grams occurring in the text with various frequencies. We can expect from these data that maximum likelihood estimates will assign a probability of 0 to about 3.8 percent of the class 3-grams and to about .02 percent of the class 2-grams in a new sample of English text. This is a substantial improvement over the corresponding numbers for a 3-gram language model, which are 14.7 percent for word 3-grams and 2.2 percent for word 2-grams, but we have achieved this at the expense of precision in the model. With a class model, we distinguish between two different words of the same class only according to their relative frequencies in the text as a whole. Looking at the classes in Tables 2 and 3, we feel that

475

Computational Linguistics,1992

WordClusters

Roadmap• lexicalsemantics– wordsense– wordsensedisambiguation– wordrepresentations

11

WordSenseAmbiguity• manywordshavemultiplemeanings

12

WordSenseAmbiguity

13

credit:A.Zwicky

Terminology:lemma andwordform

• lemma– wordswithsamelemmahavesamestem,partofspeech,roughsemantics

• wordform– inflectedwordasitappearsintext

wordform lemmabanks banksung sing

duermes dormir

J&M/SLP3

Lemmashavesenses

• onelemmabank canhavemanymeanings:…abank1 canholdtheinvestmentsinacustodialaccount…asagricultureburgeonsontheeastbank2 theriverwillshrinkevenmore

• sense(orwordsense)– adiscreterepresentationofanaspectofaword’smeaning

• thelemmabank herehastwosenses

sense1:

sense2:

J&M/SLP3

• twowaystocategorizethepatternsofmultiplemeaningsofwords:– homonymy:themultiplemeaningsareunrelated(coincidental?)

– polysemy:themultiplemeaningsarerelated

16

Homonymy

homonyms:wordsthatshareaformbuthaveunrelated,distinctmeanings:– bank1:financialinstitution,bank2:slopingland– bat1:clubforhittingaball,bat2: nocturnalflyingmammal

homographs:samespelling,differentmeaningsbank/bank,bat/bat

homophones:samepronunciation,differentmeaningswrite/right,piece/peace

J&M/SLP3

HomonymycausesproblemsforNLP

• informationretrieval– query: batcare

• machinetranslation– bat:murciélago (animal)orbate (forbaseball)

• text-to-speech– bass (stringedinstrument)vs.bass (fish)

J&M/SLP3

Polysemy1:Thebank wasconstructedin1875outoflocalredbrick.2:Iwithdrewthemoneyfromthebank.• arethesethesamesense?– sense2:“a financialinstitution”– sense1:“thebuildingbelongingtoafinancialinstitution”

• apolysemous wordhasrelatedmeanings– mostnon-rarewordshavemultiplemeanings

J&M/SLP3

• lotsoftypesofpolysemyaresystematic– school,university,hospital– allcanmeantheinstitutionorthebuilding

• a systematicrelationship:– building organization

• othersuchkindsofsystematicpolysemy:Author (JaneAustenwroteEmma) WorksofAuthor(IloveJaneAusten)Tree (Plumshavebeautifulblossoms) Fruit (Iateapreservedplum)

MetonymyorSystematicPolysemy:Asystematicrelationshipbetweensenses

J&M/SLP3

Howdoweknowwhenawordhasmorethanonesense?

• “zeugma”test:twosensesofserve?–Whichflightsserve breakfast?– DoesLufthansaserve Philadelphia?– ?DoesLufthansaservebreakfastandPhiladelphia?

• sincethisconjunctionsoundsweird,wesaythatthesearetwodifferentsensesofserve

J&M/SLP3

Synonyms• wordswithsamemeaninginsomeorallcontexts:– filbert/hazelnut– couch/sofa– big/large– automobile/car– vomit/throwup– water/H20

• twolexemesaresynonymsiftheycanbesubstitutedforeachotherinallsituations

J&M/SLP3

Synonyms• few(orno)examplesofperfectsynonymy– evenifmanyaspectsofmeaningareidentical– stillmaynotpreservetheacceptabilitybasedonnotionsofpoliteness,slang,register,genre,etc.

• example:– water/H20– big/large– brave/courageous

J&M/SLP3

Synonymyisarelationbetweensenses ratherthanwords

• considerthewordsbig andlarge• aretheysynonyms?

– Howbig isthatplane?– WouldIbeflyingonalarge orsmallplane?

• howabouthere:– MissNelson becameakindofbig sistertoBenjamin.– ?MissNelson becameakindoflarge sistertoBenjamin.

• why?– big hasasensethatmeansbeingolderorgrownup– large lacksthissense

J&M/SLP3

Antonyms

• sensesthatareoppositeswithrespecttoonefeatureofmeaning

• otherwise,theyareverysimilar!dark/light short/longfast/slowrise/fallhot/cold up/downin/out

• moreformally,antonymscan– defineabinaryoppositionorbeatoppositeendsofascale• long/short,fast/slow

– bereversives:• rise/fall,up/down

J&M/SLP3

HyponymyandHypernymy• onesenseisahyponym ofanotherifthefirstsenseismorespecific,denotingasubclassoftheother– car isahyponymofvehicle– mango isahyponymoffruit

• conversely:hypernym (“hyperissuper”)– vehicle isahypernym ofcar– fruit isahypernym ofmango

J&M/SLP3

WordNet 3.0• hierarchicallyorganizedlexicaldatabase• on-linethesaurus+aspectsofadictionary– somelanguagesavailableorunderdevelopment:Arabic,Finnish,German,Portuguese…

Category UniqueStringsNoun 117,798Verb 11,529

Adjective 22,479Adverb 4,481

J&M/SLP3

Sensesofbass inWordNet

J&M/SLP3

Howis“sense”definedinWordNet?

• synset (synonymset):setofnear-synonyms;instantiatesasenseorconcept,withagloss

• example:chump asanounwithgloss:“apersonwhoisgullibleandeasytotakeadvantageof”

• thissenseofchump issharedby9words:chump1, fool2,gull1,mark9,patsy1,fallguy1,sucker1, softtouch1,mug2

• eachofthese senseshavethissamegloss– (notevery sense;sense2ofgull istheaquaticbird)

J&M/SLP3

• oneform,multiplemeaningsà splitform– thethreesensesoffool belongtodifferentsynsets

• multipleforms,onemeaningà mergeforms– eachsynset containssensesofseveraldifferentwords

30

ambiguity

variability

WordNet Hypernym Hierarchyforbass

J&M/SLP3

Supersenses:toplevelhypernyms inhierarchy

32

(counts fromSchneider&Smith’sStreuselcorpus)

Noun Verb

GROUP 1469 place STATIVE 2922 isPERSON 1202 people COGNITION 1093 knowARTIFACT 971 car COMMUNIC.∗ 974 recommendCOGNITION 771 way SOCIAL 944 useFOOD 766 food MOTION 602 goACT 700 service POSSESSION 309 payLOCATION 638 area CHANGE 274 fixTIME 530 day EMOTION 249 loveEVENT 431 experience PERCEPTION 143 seeCOMMUNIC.∗ 417 review CONSUMPTION 93 havePOSSESSION 339 price BODY 82 get. . . doneATTRIBUTE 205 quality CREATION 64 cookQUANTITY 102 amount CONTACT 46 putANIMAL 88 dog COMPETITION 11 winBODY 87 hair WEATHER 0 —STATE 56 pain all 15 VSSTs 7806NATURAL OBJ. 54 flowerRELATION 35 portion N/A (see §3.2)SUBSTANCE 34 oil `a 1191 haveFEELING 34 discomfort ` 821 anyonePROCESS 28 process `j 54 friedMOTIVE 25 reasonPHENOMENON 23 result ∗COMMUNIC.

is short forCOMMUNICATION

SHAPE 6 squarePLANT 5 treeOTHER 2 stuffall 26 NSSTs 9018

Table 1: Summary of noun and verb supersense cate-gories. Each entry shows the label along with the countand most frequent lexical item in the STREUSLE corpus.

enrich the MWE annotations of the CMWE corpus1

(Schneider et al., 2014b), are publicly released underthe name STREUSLE.2 This includes new guidelinesfor verb supersense annotation. Our open-sourcetagger, implemented in Python, is available from thatpage as well.

2 Background: Supersense Tags

WordNet’s supersense categories are the top-levelhypernyms in the taxonomy (sometimes known assemantic fields) which are designed to be broadenough to encompass all nouns and verbs (Miller,1990; Fellbaum, 1990).3

1http://www.ark.cs.cmu.edu/LexSem/2Supersense-Tagged Repository of English with a Unified

Semantics for Lexical Expressions3WordNet synset entries were originally partitioned into

lexicographer files for these coarse categories, which becameknown as “supersenses.” The lexname function in WordNet/

The 26 noun and 15 verb supersense categories arelisted with examples in table 1. Some of the namesoverlap between the noun and verb inventories, butthey are to be considered separate categories; here-after, we will distinguish the noun and verb categorieswith prefixes, e.g. N:COGNITION vs. V:COGNITION.

Though WordNet synsets are associated with lex-ical entries, the supersense categories are unlexical-ized. The N:PERSON category, for instance, containssynsets for both principal and student. A differentsense of principal falls under N:POSSESSION.

As far as we are aware, the supersenses wereoriginally intended only as a method of organizingthe WordNet structure. But Ciaramita and Johnson(2003) pioneered the coarse word sense disambigua-tion task of supersense tagging, noting that the su-persense categories provided a natural broadeningof the traditional named entity categories to encom-pass all nouns. Ciaramita and Altun (2006) laterexpanded the task to include all verbs, and applieda supervised sequence modeling framework adaptedfrom NER. Evaluation was against manually sense-tagged data that had been automatically converted tothe coarser supersenses. Similar taggers have sincebeen built for Italian (Picca et al., 2008) and Chi-nese (Qiu et al., 2011), both of which have their ownWordNets mapped to English WordNet.

Although many of the annotated expressions in ex-isting supersense datasets contain multiple words, therelationship between MWEs and supersenses has notreceived much attention. (Piao et al. (2003, 2005) didinvestigate MWEs in the context of a lexical taggerwith a finer-grained taxonomy of semantic classes.)Consider these examples from online reviews:(1) IT IS NOT A HIGH END STEAK HOUSE

(2) The white pages allowed me to get in touch withparents of my high school friends so that I couldtrack people down one by one

HIGH END functions as a unit to mean ‘sophis-ticated, expensive’. (It is not in WordNet, though

NLTK (Bird et al., 2009) returns a synset’s lexicographer file.A subtle difference is that a special file called noun.Tops

contains each noun supersense’s root synset (e.g., group.n.01for N:GROUP) as well as a few miscellaneous synsets, such asliving_thing.n.01, that are too abstract to fall under any singlesupersense. Following Ciaramita and Altun (2006), we treat thelatter cases under an N:OTHER supersense category and mergethe former under their respective supersense.

Noun Verb




















Noun Verb




















J&M/SLP3

Meronymy• part-wholerelation– wheelisameronym ofcar– carisaholonym ofwheel

33

J&M/SLP3

WordNet NounRelations16.4 • WORD SENSE DISAMBIGUATION: OVERVIEW 7

Relation Also Called Definition ExampleHypernym Superordinate From concepts to superordinates breakfast1 ! meal1

Hyponym Subordinate From concepts to subtypes meal1 ! lunch1

Instance Hypernym Instance From instances to their concepts Austen1 ! author1

Instance Hyponym Has-Instance From concepts to concept instances composer1 ! Bach1

Member Meronym Has-Member From groups to their members faculty2 ! professor1

Member Holonym Member-Of From members to their groups copilot1 ! crew1

Part Meronym Has-Part From wholes to parts table2 ! leg3

Part Holonym Part-Of From parts to wholes course7 ! meal1

Substance Meronym From substances to their subparts water1 ! oxygen1

Substance Holonym From parts of substances to wholes gin1 ! martini1

Antonym Semantic opposition between lemmas leader1 () follower1

Derivationally Lemmas w/same morphological root destruction1 () destroy1

Related FormFigure 16.2 Noun relations in WordNet.

Relation Definition ExampleHypernym From events to superordinate events fly9 ! travel5

Troponym From events to subordinate event walk1 ! stroll1(often via specific manner)

Entails From verbs (events) to the verbs (events) they entail snore1 ! sleep1

Antonym Semantic opposition between lemmas increase1 () decrease1

Derivationally Lemmas with same morphological root destroy1 () destruction1

Related FormFigure 16.3 Verb relations in WordNet.

respond to the notion of immediate hyponymy discussed on page 5. Each synset isrelated to its immediately more general and more specific synsets through direct hy-pernym and hyponym relations. These relations can be followed to produce longerchains of more general or more specific synsets. Figure 16.4 shows hypernym chainsfor bass3 and bass7.

In this depiction of hyponymy, successively more general synsets are shown onsuccessive indented lines. The first chain starts from the concept of a human basssinger. Its immediate superordinate is a synset corresponding to the generic conceptof a singer. Following this chain leads eventually to concepts such as entertainer andperson. The second chain, which starts from musical instrument, has a completelydifferent path leading eventually to such concepts as musical instrument, device, andphysical object. Both paths do eventually join at the very abstract synset whole, unit,and then proceed together to entity which is the top (root) of the noun hierarchy (inWordNet this root is generally called the unique beginner).unique

beginner

16.4 Word Sense Disambiguation: Overview

Our discussion of compositional semantic analyzers in Chapter 15 pretty much ig-nored the issue of lexical ambiguity. It should be clear by now that this is an unrea-sonable approach. Without some means of selecting correct senses for the words inan input, the enormous amount of homonymy and polysemy in the lexicon wouldquickly overwhelm any approach in an avalanche of competing interpretations.

J&M/SLP3

WordNet:ViewedasagraphWord Sense Disambiguation: A Survey 10:9

Fig. 3. An excerpt of the WordNet semantic network.

We note that each word sense univocally identifies a single synset. For instance,given car1

n the corresponding synset {car1n, auto1

n, automobile1n, machine4

n, motorcar1n}

is univocally determined. In Figure 3 we report an excerpt of the WordNet semanticnetwork containing the car1

n synset. For each synset, WordNet provides the followinginformation:

—A gloss, that is, a textual definition of the synset possibly with a set of usage examples(e.g., the gloss of car1

n is “a 4-wheeled motor vehicle; usually propelled by an internalcombustion engine; ‘he needs a car to get to work’ ”).7

—Lexical and semantic relations, which connect pairs of word senses and synsets, re-spectively: while semantic relations apply to synsets in their entirety (i.e., to allmembers of a synset), lexical relations connect word senses included in the respec-tive synsets. Among the latter we have the following:—Antonymy: X is an antonym of Y if it expresses the opposite concept (e.g., good1

a isthe antonym of bad1

a). Antonymy holds for all parts of speech.—Pertainymy: X is an adjective which can be defined as “of or pertaining to” a noun

(or, rarely, another adjective) Y (e.g., dental1a pertains to tooth1

n).—Nominalization: a noun X nominalizes a verb Y (e.g., service2

n nominalizes the verbserve4

v).Among the semantic relations we have the following:—Hypernymy (also called kind-of or is-a): Y is a hypernym of X if every X is a (kind

of) Y (motor vehicle1n is a hypernym of car1

n). Hypernymy holds between pairs ofnominal or verbal synsets.

7Recently, Princeton University released the Princeton WordNet Gloss Corpus, a corpus of manually andautomatically sense-annotated glosses from WordNet 3.0, available from the WordNet Web site.

ACM Computing Surveys, Vol. 41, No. 2, Article 10, Publication date: February 2009.

35

J&M/SLP3


36

WordSenseDisambiguation

37

credit:A.Zwicky

WordSenseDisambiguation(WSD)• given:– a wordincontext– a fixedinventoryofpotentialwordsenses

• decidewhichsenseofthewordthisis• why?machinetranslation,questionanswering,sentimentanalysis,text-to-speech

• whatsetofsenses?– English-to-Spanishmachinetranslation:setofSpanishtranslations

– text-to-speech:homographslikebass andbow– ingeneral:thesensesinathesauruslikeWordNet

J&M/SLP3

TwoVariantsofWSDTask• lexicalsampletask– smallpre-selectedsetoftargetwords(line,plant,bass)– inventoryofsensesforeachword– supervisedlearning:trainaclassifierforeachword

• all-wordstask– everywordinanentiretext– a lexiconwithsensesforeachword– datasparseness:can’ttrainword-specificclassifiers

J&M/SLP3

ClassificationFramework

learning:choose_

modeling:definescorefunctioninference:solve_

40

ClassificationforWordSenseDisambiguationofbass

41

• data:– whatisthespaceofpossibleinputsandoutputs?– whatdo(x,y)pairslooklike?– x =thewordbass alongwithitscontext– y =wordsenseofbass (fromalistofpossiblesenses)

8Sensesofbass inWordnet

J&M/SLP3

InventoryofSenseTagsforbass16.5 • SUPERVISED WORD SENSE DISAMBIGUATION 9

WordNet Spanish RogetSense Translation Category Target Word in Contextbass4 lubina FISH/INSECT . . . fish as Pacific salmon and striped bass and. . .bass4 lubina FISH/INSECT . . . produce filets of smoked bass or sturgeon. . .bass7 bajo MUSIC . . . exciting jazz bass player since Ray Brown. . .bass7 bajo MUSIC . . . play bass because he doesn’t have to solo. . .

Figure 16.5 Possible definitions for the inventory of sense tags for bass.

the set of senses are small, supervised machine learning approaches are often usedto handle lexical sample tasks. For each word, a number of corpus instances (con-text sentences) can be selected and hand-labeled with the correct sense of the targetword in each. Classifier systems can then be trained with these labeled examples.Unlabeled target words in context can then be labeled using such a trained classifier.Early work in word sense disambiguation focused solely on lexical sample tasksof this sort, building word-specific algorithms for disambiguating single words likeline, interest, or plant.

In contrast, in the all-words task, systems are given entire texts and a lexiconall-wordswith an inventory of senses for each entry and are required to disambiguate everycontent word in the text. The all-words task is similar to part-of-speech tagging, ex-cept with a much larger set of tags since each lemma has its own set. A consequenceof this larger set of tags is a serious data sparseness problem; it is unlikely that ade-quate training data for every word in the test set will be available. Moreover, giventhe number of polysemous words in reasonably sized lexicons, approaches based ontraining one classifier per term are unlikely to be practical.

In the following sections we explore the application of various machine learningparadigms to word sense disambiguation.

16.5 Supervised Word Sense Disambiguation

If we have data that has been hand-labeled with correct word senses, we can use asupervised learning approach to the problem of sense disambiguation—extractingfeatures from the text and training a classifier to assign the correct sense given thesefeatures. The output of training is thus a classifier system capable of assigning senselabels to unlabeled words in context.

For lexical sample tasks, there are various labeled corpora for individual words;these corpora consist of context sentences labeled with the correct sense for the tar-get word. These include the line-hard-serve corpus containing 4,000 sense-taggedexamples of line as a noun, hard as an adjective and serve as a verb (Leacock et al.,1993), and the interest corpus with 2,369 sense-tagged examples of interest as anoun (Bruce and Wiebe, 1994). The SENSEVAL project has also produced a num-ber of such sense-labeled lexical sample corpora (SENSEVAL-1 with 34 words fromthe HECTOR lexicon and corpus (Kilgarriff and Rosenzweig 2000, Atkins 1993),SENSEVAL-2 and -3 with 73 and 57 target words, respectively (Palmer et al. 2001,Kilgarriff 2001).

For training all-word disambiguation tasks we use a semantic concordance,semanticconcordance

a corpus in which each open-class word in each sentence is labeled with its wordsense from a specific dictionary or thesaurus. One commonly used corpus is Sem-Cor, a subset of the Brown Corpus consisting of over 234,000 words that were man-

J&M/SLP3

WSDEvaluationandBaselines

• bestevaluation:extrinsic(“task-based”)– embedWSDinataskandseeifithelps!

• intrinsicevaluation oftendoneforconvenience

• strongbaseline:mostfrequentsense

J&M/SLP3

MostFrequentSense• WordNetsensesareorderedbyfrequency• mostfrequentisfirst• sensefrequenciescomefromSemCor corpus

J&M/SLP3

PerformanceCeiling• humaninter-annotatoragreement– compareannotationsoftwohumansonsamedata,givensametaggingguidelines

• humanagreementsonall-wordscorporawithWordNetstylesenses:75%-80%

J&M/SLP3

TrainingDataforWSD• semanticconcordance:corpusinwhicheachopen-classwordislabeledwithasensefromaspecificdictionary/thesaurus– SemCor:234,000wordsfromBrownCorpus,manuallytaggedwithWordNet senses

– SENSEVAL-3competitioncorpora:2081taggedwordtokens

J&M/SLP3

FeaturesforWSD?IntuitionfromWarrenWeaver(1955):

FeaturesforWSD?IntuitionfromWarrenWeaver(1955):

“Ifoneexaminesthewordsinabook,oneatatimeasthroughanopaquemaskwithaholeinitonewordwide,thenitisobviouslyimpossibletodetermine,oneatatime,themeaningofthewords…Butifonelengthenstheslitintheopaquemask,untilonecanseenotonlythecentralwordin

J&M/SLP3

questionbutalsosayNwordsoneitherside,thenifNislargeenoughonecanunambiguouslydecidethemeaningofthecentralword…Thepracticalquestionis:‘WhatminimumvalueofNwill,atleastinatolerablefractionofcases,leadtothecorrectchoiceofmeaningforthecentralword?’”

FeaturesforWSD

• collocational– featuresforwordsatspecific positionsneartargetword

• bag-of-words– featuresforwordsthatoccuranywhereinawindowofthetargetword(regardlessofposition)

J&M/SLP3

Example• usingawindowof+/- 3fromthetarget:

Anelectric guitarandbass playerstand off toonesidenotreallypartofthescene

J&M/SLP3

Semi-SupervisedLearningproblem:supervisedlearningrequireslargehand-builtresources

whatifyoudon’thavemuchtrainingdata?

solution:bootstrappinggeneralizefromaverysmallhand-labeledseedset

J&M/SLP3

Bootstrapping• “onesensepercollocation”heuristic:– awordreoccurringincollocationwiththesamewordwillalmostsurelyhavethesamesense

• Forbass:– play occurswiththemusicsenseofbass– fishoccurswiththefishsenseofbass

J&M/SLP3

Sentencesextractedusingfish andplay

16 CHAPTER 16 • COMPUTING WITH WORD SENSES

?

?

A

?

A

?

A

?

?

?

A

?

?

?

?

?

? ?

? ??

?

?

?

B

?

?

A

?

?

?

A

?

AAA

?A

A

?

?

?

?? ?

??

?

??

?B

?

??

?

?

?

??

?

?

?

??

?

?

??

?

?

?

?

?

?

??

?

?

?

?

?? ?

?

?

?

? ???

?

?

?

?

?

?

?

??

?

?

??

??

?

? ?

?

?

? ???

??

?

B

??

?BBB

?

?

B

?? B?

?? ?

??

????

??

?

? ?

????

?

??

?

?

?

??

?

?

?

??? ?

A

B B

??

?

??? ???

?

?

?

?? ?

?

?

?

A??

??

A

?

?

?A

AA

A

A

A

A

LIFE

BB

MANUFACTURING

?

?

A

?

A

?

A

?

A

?

A

B

?

?

?

?

? ?

? ??

?

?

?

B

???

??

A

?

A

?

A

?

AAA

AA

A

?

?

?

?? ?

??

?

??

?B

?

???

?

?

?

??

?

?

?

??

?

?

??

?

?

?

?

?

?

???

?

?

?

?? ?

?

?

?

A ?A?

?

?

?

?

?

?

?

??

?

?

??

?A

B

A A

?

?

? ???

??

?

B

??

?BBB

?

?

B

?B B?

?? ?

??

????

??

?

? ?

??AA

?

??

?

?

?

??

?

?

?

??? ?

A

B B

??

B

?????

?

?

?

?? B

B

?

?

A?A

A?

A

?

?

?A

AA

A

A

A

A

LIFE

BB

MANUFACTURINGEQUIPMENT

EMPLOYEE?

??BB

??

?????

ANIMAL

MICROSCOPICV0 V1

Λ0 Λ1

(a) (b)

Figure 16.9 The Yarowsky algorithm disambiguating “plant” at two stages; “?” indicates an unlabeled ob-servation, A and B are observations labeled as SENSE-A or SENSE-B. The initial stage (a) shows only seedsentences L0 labeled by collocates (“life” and “manufacturing”). An intermediate stage is shown in (b) wheremore collocates have been discovered (“equipment”, “microscopic”, etc.) and more instances in V0 have beenmoved into L1, leaving a smaller unlabeled set V1. Figure adapted from Yarowsky (1995).

We need more good teachers – right now, there are only a half a dozen who can playthe free bass with ease.

An electric guitar and bass player stand off to one side, not really part of the scene, justas a sort of nod to gringo expectations perhaps.The researchers said the worms spend part of their life cycle in such fish as Pacificsalmon and striped bass and Pacific rockfish or snapper.

And it all started when fishermen decided the striped bass in Lake Mead were tooskinny.

Figure 16.10 Samples of bass sentences extracted from the WSJ by using the simple cor-relates play and fish.

strongly associated with the target senses tend not to occur with the other sense.Yarowsky defines his seedset by choosing a single collocation for each sense.

For example, to generate seed sentences for the fish and musical musical sensesof bass, we might come up with fish as a reasonable indicator of bass1 and play asa reasonable indicator of bass2. Figure 16.10 shows a partial result of such a searchfor the strings “fish” and “play” in a corpus of bass examples drawn from the WSJ.

The original Yarowsky algorithm also makes use of a second heuristic, calledone sense per discourse, based on the work of Gale et al. (1992b), who noticed thatone sense per

discoursea particular word appearing multiple times in a text or discourse often appeared withthe same sense. This heuristic seems to hold better for coarse-grained senses andparticularly for cases of homonymy rather than polysemy (Krovetz, 1998).

Nonetheless, it is still useful in a number of sense disambiguation situations. Infact, the one sense per discourse heuristic is an important one throughout languageprocessing as it seems that many disambiguation tasks may be improved by a biastoward resolving an ambiguity the same way inside a discourse segment.

J&M/SLP3

Bootstrapping• “onesensepercollocation” heuristic:– awordreoccurringincollocationwiththesamewordwillalmostsurelyhavethesamesense

• “onesenseperdiscourse”heuristic:– senseofawordishighlyconsistentwithina

document(Yarowsky,1995)– especiallytopic-specificwords

J&M/SLP3

StagesinYarowsky bootstrappingalgorithmforplant

?

?

A

?

A

?

A

?

?

?

A

?

?

?

?

?

? ?

? ??

?

?

?

B

?

?

A

?

?

?

A

?

AAA

?A

A

?

?

?

?? ?

??

?

??

?B

?

??

?

?

?

??

?

?

?

??

?

?

??

?

?

?

?

?

?

??

?

?

?

?

?? ?

?

?

?

? ???

?

?

?

?

?

?

?

??

?

?

??

??

?

? ?

?

?

? ???

??

?

B

??

?BBB

?

?

B

?? B?

?? ?

??

????

??

?

? ?

????

?

??

?

?

?

??

?

?

?

??? ?

A

B B

??

?

??? ???

?

?

?

?? ?

?

?

?

A??

??

A

?

?

?A

AA

A

A

A

A

LIFE

BB

MANUFACTURING

?

?

A

?

A

?

A

?

A

?

A

B

?

?

?

?

? ?

? ??

?

?

?

B

???

??

A

?

A

?

A

?

AAA

AA

A

?

?

?

?? ?

??

?

??

?B

?

???

?

?

?

??

?

?

?

??

?

?

??

?

?

?

?

?

?

???

?

?

?

?? ?

?

?

?

A ?A?

?

?

?

?

?

?

?

??

?

?

??

?A

B

A A

?

?

? ???

??

?

B

??

?BBB

?

?

B

?B B?

?? ?

??

????

??

?

? ?

??AA

?

??

?

?

?

??

?

?

?

??? ?

A

B B

??

B

?????

?

?

?

?? B

B

?

?

A?A

A?

A

?

?

?A

AA

A

A

A

A

LIFE

BB

MANUFACTURINGEQUIPMENT

EMPLOYEE?

??BB

??

?????

ANIMAL

MICROSCOPICV0 V1

Λ0 Λ1

(a) (b)J&M/SLP3

Summary

• wordsensedisambiguation:choosingcorrectsenseincontext

• applications:MT,QA,etc.• mainintuition:– lotsofinformationinaword’scontext– simplealgorithmsbasedonwordcountscanbesurprisinglygood

57


58

• oneform,multiplemeaningsà splitform– thethreesensesoffool belongtodifferentsynsets

• multipleforms,onemeaningà mergeforms– eachsynset containssensesofseveraldifferentwords

59

ambiguity

variability

• WordNet splitswordsintosynsets; eachcontainssensesofseveralwords:

• arewefinished?havewesolvedtheproblemofrepresentingwordmeaning?

• arewefinished?havewesolvedtheproblemofrepresentingwordmeaning?

• issues:– WordNet haslimitedcoverage andonlyexistsforasmallsetoflanguages

– WSDrequirestrainingdata,whethersupervisedorseedsforsemi-supervised

• betterapproach:jointlylearnrepresentationsforallwordsinanunsupervisedway

category uniquestrings

noun 117,798

verb 11,529

adjective 22,479

adverb 4,481

62

t-SNEvisualizationfromTurian etal. (2010)

WordEmbeddings(Bengio etal.,2003)

ttic 31190: natural language processingttic.uchicago.edu/~kgimpel/teaching/31190/lectures/4.pdf ·...

Documents