the rumble in the disambiguation jungle - diva-portal.org436727/fulltext01.pdfendast en betydelse,...

27
The Rumble In The Disambiguation Jungle Towards the comparison of a traditional word sense disambiguation sys- tem with a novel paraphrasing system Kelly H Smith Institutionen för lingvistik Examensarbete 180 hp Kandidatprogram i datorlingvistik (180 hp) Vårterminen 2011 Handledare: Katrin Erk och Robert Östling

Upload: others

Post on 12-Oct-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

The Rumble In The DisambiguationJungleTowards the comparison of a traditional word sense disambiguation sys-tem with a novel paraphrasing system

Kelly H Smith

Institutionen för lingvistik

Examensarbete 180 hp

Kandidatprogram i datorlingvistik (180 hp)

Vårterminen 2011

Handledare: Katrin Erk och Robert Östling

The Rumble In The Disambiguation JungleTowards the comparison of a traditional word sense disambiguation system with a novelparaphrasing system

AbstractWord sense disambiguation (WSD) is the process of computationally identifying and labeling poly-semous words in context with their correct meaning, known as a sense. WSD is riddled with variousobstacles that must be overcome in order to reach its full potential. One of these problems is the aspectof the representation of word meaning. Traditional WSD algorithms make the assumption that a word ina given context has only one meaning and therfore can return only one discrete sense.

On the other hand, a novel approach is that a given word can have multiple senses. Studies on gradedword sense assignment [7] as well as in cognitive science [9, 17] support this theory. It has thereforebeen adopted in a novel, paraphrasing system which performs word sense disambiguation by returninga probability distribution over potential paraphrases (in this case synonyms) of a given word. However,it is unknown how well this type of algorithm fares against the traditional one. The current study thusexamines if and how it is possible to make a comparison of the two. A method of comparison is evaluatedand subsequently rejected. Reasons for this as well as suggestions for a fair and accurate comparison arepresented.

SammanfattningOrddisambiguering (Word Sense Disambiguation, WSD) innebär att automatiskt identifiera den bety-delse av ett mångtydigt ord som avses i en viss kontext. Det finns många underliggande problem medWSD och dessa måste bemästras för att det ska kunna uppnå sin fulla potential. Ett av dessa problemär hur ordbetydelser representeras. Traditionella WSD-algoritmer antar att ett ord i en viss kontext harendast en betydelse, och kan därför endast ange en diskret mening. Enligt ett annat synsätt kan ord i enviss kontext ha mer än en betydelse. Studier om graderade betydelser [7] samt inom kognitionsveten-skap [9, 17] stödjer denna teori. Den har därför använts som grund i ett nytt WSD-system som angerbetydelsen av ett ord i en viss kontext som en sannolikhetsfördelning över potentiella parafraser (i dettafall partiella synonymer) på ett givet ord. Det är dock oklart hur denna algoritm presterar jämfört med tra-ditionella system baserade på diskreta ordbetydelser. Denna studie undersöker därför hur och om dessatvå typer av system kan jämföras. En metod för jämförelse utvärderades och förkastades. Anledningarnatill detta samt förslag till en mer korrekt och rättvis jämförelse presenteras.

Nyckelord/KeywordsWord Sense Disambiguation, WSD, paraphrasing systems, semantics, word senses, representing wordmeaning

Contents1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 What is word sense disambiguation and why should we use it? . . . . . . . . . . . . . . . . . 23.2 Resources used in WSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.2.1 What is WordNet? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2.2 What is SemCor? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.3 Representing word meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3.1 WSD systems representing word meaning with discrete senses . . . . . . . . . . . . . 43.3.2 All words paraphrasing system (MEPS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.4 Supervised versus unsupervised WSD algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1 Compiling a goldstandard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1.1 Amazon Mechanical Turk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1.2 The task and the data used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1.3 The final gold standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.1.4 Reliability of AMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.1.5 Concerns with worker’s results and how they were handled . . . . . . . . . . . . . . . . 74.1.6 Processing AMT results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 MEPS input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3 The chosen method of comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Using a hypothetical, best possible, traditional WSD system . . . . . . . . . . . . . . . . . . . 11

5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.1 AMT compiled gold standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.2 Precision and Recall of the SemCor human annotated sentences and most frequent

sense baseline with respect to the AMT gold standard . . . . . . . . . . . . . . . . . . . . . . . . 125.2.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.3 MEPS precision and recall with respect to the AMT gold standard . . . . . . . . . . . . . . . 145.4 Weighted accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.1 The skewed gold standard and its effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.2 Variations on the number of workers needed to deem a paraphrase acceptable . . . . . 166.3 Reliability and validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.4 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3

1 IntroductionWord sense disambiguation (WSD) is the process of computationally identifying and labeling poly-semous words in context with their correct meaning, known as a sense. WSD is riddled with variousobstacles that must be overcome in order to reach its full potential. One of these problems is the aspectof the representation of word meaning.

Traditional WSD systems view word meaning as being clear cut. In other words, a word in a particularcontext has one and only one sense. The systems take text as input, and, using a sense inventory, allpossible senses for a given word are determined and finally the most appropriate is chosen.

A commonly used sense inventory for English WSD is WordNet ( [14], [8]); a machine readabledictionary and thesaurus which lists a word’s senses starting with their most common usage. WSDsystems using WordNet will return a WordNet sense for each disambiguated word. For example, if sucha system is used to disambiguate the word argument, in the sentence "Changes in the basic wage rate arecost raising and they constitute an argument for raising prices," the sense assigned by the WSD systemfor the disambiguated word, if correct, will be the first sense (of seven total) listed for ’argument’: a factor assertion offered as evidence that something is true.

A novel approach to the representation of word meaning is that it is possibile for words in a particularcontext to have more than one applicable sense. Studies have shown that annotators, if given the possi-bility, will at times prefer to assign multiple senses to a given word as well as using a graded scale ofsense applicability [7]. Thus, perhaps the computational annotation of word senses also should have thisability? This approach has been adopted in a novel, paraphrasing system developed by Taesun Moonand Katrin Erk. Rather than assigning one sense to a word in a particular context, it returns a probabilitydistribution over a list of possible paraphrases, which in this case are (partial) synonyms of the word.

In order to determine wether a traditional WSD system or a paraphrasing system produces the bestresults, a comparison of the two must be made. However, what method of comparison will be the mostfair and accurate?

The goal of this study thus is to investigate if and how it is possible to compare WSD systems usingdiscrete word senses with a system that assigns probability distributions over paraphrases.

The method chosen to compare the two approaches was to transform the output of a traditional sys-tem, i.e. discrete WordNet senses, to that of the paraphrasing system. First, using Amazon MechanicalTurk, a gold standard was compiled in order to compare the two systems. Paraphrase ratings wherecollected for a set of lemmas (the dictionary form of a word, for example ask is the lemma of askedand asking) which were to be disambiguated. These paraphrase ratings were yes or no judgements as tothe interchangeability of certain words in a given context. However, due to the fact that many WordNetsenses do not have an appropriate paraphrase, the gold standard became insufficient to make a propercomparison. The reason for this is explained in the results and conclusion sections.

NOTE: Due to the constraints of the Stockholm University thesis template, the humanities style ofreferencing has not been used.

2 PurposeThe purpose of this study is threefold. The main objective is to investigate if and how it is possible tocompare two different types of word sense disambiguation (WSD) systems. The first returns a proba-bility distribution over possible paraphrases for each disambiguated word. The second is a hypotheticalWSD system which would use a traditional representation of word meaning, ie. it returns one, discretesense per disambiguated word. In order to perform a comparison, Amazon Mechanical Turk (AMT)is used to compile a gold standard in order to make the comparison. Thus the second objective is toevaluate the quality of the AMT results, both generally but also specifically in the context of the first

1

objective. The third objective is to evaluate the WSD system which returns a probability distributionover possible paraphrases.

3 Background3.1 What is word sense disambiguation and why should we use it?Word sense disambiguation is the computational classification task of identifying and labeling polyse-mous words in context with the correct meaning, known as a sense.

Ambiguity is commonplace in everyday language and we as humans hardly take a second to reflectupon it. Knowledge of the world and an ability to discern the correct meaning of an ambiguous word ina certain context make word sense disambiguation very easy for us. For example, you and I know thatgiven the context "I will do some bass fishing this weekend," we are obviously talking about the fishsense of the word bass, but how would a computer know this?

Traditional WSD ultimately encompasses two steps: first, for all relevant words in a text, determineall possible senses that can be applied to them, and second, to assign the correct sense to the words giventhe context they appear in [10].

Why should a computer even need to be able to disambiguate words? WSD is an enabling technology,meaning that by itself it is not such a useful tool, but if accurate it can be used to improve results in othertechnologies [10]. It is an area of natural language processing (NLP) where a great deal of improvementscan and should be made. WSD is claimed by scholars to be an issue which could improve many differentapplications within the field of NLP such as machine translation [3, 4], information retrieval (IR) (forexample [20] where use of WSD improved IR results by 7-14%), as well as IR related fields such asquestion answering and document classification (see for example [23]).

With state-of-the-art WSD systems achieving precision and recall levels of 69% respectively [1],this certainly leaves room for improvement. WSD is ultimately not so enabling if it is incorrect in itsdisambiguation slightly more than one fourth of the time. It could thus be argued that perhaps tradi-tional methods of modeling word senses is flawed and that a different approach should be examined.This will is discussed further in section 3.3 as the aspect of re-examining traditional approaches to therepresentation of word meaning in WSD is central to the current study.

Two variants of the evaluation of WSD exist: the lexical-sample task and all-words disambiguation. Ina lexical sample task, a predetermined sample of words are to be disambiguated. On the other hand, theall-words approach is more realistic, but also more difficult [18]. In the current study, the paraphrasingsystem is an all-words system. The hypothetical, traditional WSD system is also capable of an all-wordstask.

3.2 Resources used in WSDFor a system to perform word sense disambiguation at least two things are necessary: a sense inventoryand a corpus [18].

One or more corpora (a large structured set of text) are necessary to have data to be disambiguated.If the WSD system uses a supervised method it will need what is known as training data which isannotated with appropriate sense information. The test data is the portion of the corpus which is actuallydisambiguated by the WSD system.

A sense inventory is also needed which is a digital resource similar to a dictionary and contains thedifferent meanings of a word. This inventory is used for the first step of WSD: the determination ofall possible senses for every word relevant to the text being disambiguated. It can also be used for thesecond step which involves the actual assignment of senses to words.

2

3.2.1 What is WordNet?One of the mosy commonly used sense inventories for English WSD is WordNet 1 [8, 14], and it isalso used in the current study. WordNet describes words by their Synsets, the WordNet term for sets ofsynonyms, which represent a single lexical concept. These synonyms are assumed to be interchangeable,meaning that given a certain context, each synonym can be used in place of one another without changingthe meaning of the sentence. In the present study, these will hereafter be referred to a paraphrases.Information about semantic relationships such as antonymy, meronymy, hyponomy etc. is also included.WordNet has been released under a BSD style license meaning it can be downloaded and used freely.

In the current study, WordNet version 3.0 has been used. This version contains 155,287 unique wordsand 117,659 synsets [2].

An example of the WordNet entry argument with seven Synsets:

Argument, noun• argument, statement (a fact or assertion offered as evidence that something is true) “it was a

strong argument that his hypothesis was true”• controversy, contention, contestation, disputation, disceptation, tilt, argument, arguing (a con-

tentious speech act; a dispute where there is strong disagreement) “they were involved in aviolent argument”• argument, argumentation, debate (a discussion in which reasons are advanced for and against

some proposition or proposal) “the argument over foreign aid goes on and on”• argument, literary argument (a summary of the subject or plot of a literary work or play or

movie) “the editor added the argument to the poem”• argument, parameter ((computer science) a reference or value that is passed to a function,

procedure, subroutine, command, or program)• argument (a variable in a logical or mathematical expression whose value determines the de-

pendent variable; if f(x)=y, x is the dependent variable)• argumentation, logical argument, argument, line of reasoning, line (a course of reasoning

aimed at demonstrating a truth of falsehood; the methodical process of logical reasoning) “Ican’t follow your line of reasoning”

A frequent complaint directed at WordNet is that the senses are too fine-grained. An attempt to over-come this has been with the use of coarse-grained senses in which clustering algorithms are used togroup sense distinctions. In a study by Palmer, Dang, and Fellbaum [19], this led to inter-annotatoragreements of 90%. Improvements in automatic WSD increased approximately 10%; however a course-grained inventory is not always suitable for all types of WSD [21].

3.2.2 What is SemCor?Semcor is a semantically tagged corpus which was created with the intent of aiding research in automaticsense tagging, i.e. WSD. The corpus is a subset of the Brown corpus, and using a tagging interfacedesigned for the task of facilitating manual annotation of senses, the text has been tagged with sensesfrom WordNet. The authors of SemCor refer to it as a semantic concordance [15].

An example of an entry in the SemCor corpus is "Changes in the basic wage rate are cost raising andthey constitute an argument for raising prices." For this sentence, the lemma argument has been taggedwith sense one from WordNet (see above), a fact or assertion offered as evidence that something is true.Besides a WordNet sense tag, words are also part of speech tagged as well as tagged with informationabout a general hyponym.

An example of a SemCor sentence in HTML format used as input to a traditional WSD system (Sinhaand Mihalcea’s unsupervised graph-based WSD system specifically [21]) follows for the previouslypresented example sentence, where wnsn=n is the WordNet sense tag that has been assigned to thepreceeding lemma. In actuality SemCor uses an SGML markup scheme .

1Online WordNet search at http://wordnetweb.princeton.edu/perl/webwn

3

<p pnum=18><s snum=63><wf cmd=done id=s63.t1 pos=NN lemma=change wnsn=2 lexsn=1:24:00::>Changes</wf><wf cmd=ignore pos=IN>in</wf><wf cmd=ignore pos=DT>the</wf><wf cmd=done id=s63.t2 pos=JJ lemma=basic wnsn=1 lexsn=3:00:00::>basic</wf><wf cmd=done id=s63.t3 pos=NN lemma=wage wnsn=1 lexsn=1:21:00::>wage</wf><wf cmd=done id=s63.t4 pos=NN lemma=rate wnsn=1 lexsn=1:21:00::>rate</wf><wf cmd=done id=s63.t5 pos=VB lemma=be wnsn=1 lexsn=2:42:03::>are</wf><wf cmd=done id=s63.t6 pos=NN lemma=cost wnsn=1 lexsn=1:21:00::>cost</wf><wf cmd=done id=s63.t7 pos=JJ lemma=raising wnsn=1 lexsn=5:00:00:increasing:00>raising</wf><punc>,</punc><wf cmd=ignore pos=CC>and</wf><wf cmd=ignore pos=PRP>they</wf><wf cmd=done id=s63.t8 pos=VB lemma=constitute wnsn=1 lexsn=2:42:00::>constitute</wf><wf cmd=ignore pos=DT>an</wf><wf cmd=done id=s63.t9 pos=NN lemma=argument wnsn=1 lexsn=1:10:02::>argument</wf><wf cmd=ignore pos=IN>for</wf><wf cmd=done id=s63.t10 pos=VB lemma=raise wnsn=1 lexsn=2:30:01::>raising</wf><wf cmd=done id=s63.t11 pos=NN lemma=price wnsn=2 lexsn=1:07:00::>prices</wf><punc>.</punc></s></p>

3.3 Representing word meaningWSD is one of the oldest areas of NLP research and has been one of the most difficult NLP problems tosolve. It is known as an AI-complete or AI-hard task meaning that solving the WSD task is as difficultas solving the most difficult of AI tasks: the synthesis of a human-level intelligence. Where NLP isconcerned, this means building a system that can understand and speak a natural language as well as ahuman [10].

Why is WSD ultimately such a difficult task? One of the most prominent difficulties associated withWSD is the representation of word meaning. Traditional WSD systems are modeled after the notion thata word in a particular context has one and only one sense. However, studies have shown that given theoption, annotators will make use of a graded scale as to a sense’s applicability, and at times will assignmultiple senses to a given word [7].

In contrast to the notion of clear-cut senses, studies show that categories in the human mind insteadseem to overlap [9,17]. Since word meanings are mental concepts [17], perhaps the WSD system shouldbe modeled on the theory that words can have overlapping senses. This is the approach taken in the para-phrasing system presented in the current study, hereafter known as MEPS (Moon and Erk paraphrasingsystem).

3.3.1 WSD systems representing word meaning with discrete sensesTraditional WSD systems are based on the assumption that word meaning is clear cut. In other words, aword in a particular context has one and only one sense. These systems take text as input, and, using asense inventory, all possible senses for a given word are determined [18].

If using WordNet as the sense inventory, these systems will return a WordNet sense for each disam-biguated word in the given text. For example, if we wish to disambiguate the word “argument” whichhas seven total WordNet senses, in the sentence "Changes in the basic wage rate are cost raising andthey constitute an argument for raising prices." The output from the WSD system for the disambiguatedword, if correct, will be sense one: a fact or assertion offered as evidence that something is true. An ex-ample of the WordNet entry for “argument” can be found in section 3.2.1 on the preceding page. An

4

example of this type of system is Ravi Sinha and Rada Mihalcea’s unsupervised, graph-based WSDsystem [21].

3.3.2 All words paraphrasing system (MEPS)On the other hand, the system designed by Taesun Moon and Katrin Erk [16] uses graded models ofword meaning in context to characterize the meaning of individual usages (a word occurrence in aparticular context) without reference to dictionary senses. Rather than referring to dictionary senses, thesystem is provided with possible paraphrases for the set of words to be disambiguated. Compared toother paraphrasing systems, it is currently state-of-the-art [16].

The task is a novel one and can be viewed as an all-words paraphrasing model. Meaning is representedwith a probability distribution over potential paraphrases. This distribution is inferred using undirectedgraphical models which incorporate evidence such as dependency parses, document context, and collo-cational information. Dependency parses are transformed into undirected graphical models, and moreexplicitly as factor graphs. Factor graphs are bipartite graphs over two types of nodes, an observed nodewhich represents the surface form of a word, and a hidden node representing its usage meaning. Theparaphrase distribution of a word is thus the distribution inferred over the hidden node given evidencefrom its observed context [16].

For each dependency edge in the graph, a selectional factor is inserted which models the influenceof selectional preferences on mutually contextualizing two adjacent words. For estimating selectionalfactor parameters, C&C parses [5] of the British National Corpus (BNC, consisting of 100 millionwords) and UKWAC (2 billion words) are used.

The graph transformation used in the current study is referred to by the authors of [16] as an adja-cency transformation. In this transformation the paraphrase distribution of a word is inferred from theneighboring words. The original dependency graph is converted into a set of disconnect graphs withone paraphrase distribution in each component. Thus in this transformation collocational information istaken into account [16].

As previously mentioned, document context is also incorporated. In order to learn topic model pa-rameters, 26,000 documents from UKWAC are combined with the full contents of the LEXSUB dataset(the dataset introduced for the 2007 SemEval Lexical Substitution task, as cited in [16]). Using thesedocuments, topic parameters are learned using MALLET 2 (A Machine Learning for Language Toolkit,as cited in [16]).

All of these steps are performed by the model, however various forms of input are necessary as wellas preprocessing of the corpus. This is presented in section 4.2 on page 10.

3.4 Supervised versus unsupervised WSD algorithmsCorpus-based methods of WSD include supervised and unsupervised methods. Supervised approachesto WSD use machine learning techniques to train a classifier from labeled data. In other words, a dataset which is tagged with the appropriate sense information is used to “teach” the system. Supervisedmethods utilize machine learning techniques such as decision lists, decision trees, Naive Bayes, neuralnetworks, exemplar based learning, and support vector machines, as well as ensemble methods whichcombine various previously named approaches [18]. Generally supervised methods have proven themost successful at WSD. For a lexical-sample task, accuracy lies between 70-80% [13], and around69% for an all-words disambiguation task [1]. Accuracy is the percentage of words that are taggedidentically with the test set [11].

Unfortunately, WSD is subject to what is known as the knowledge acquisition bottleneck, whichmeans that annotated data is most often hard to obtain as it is both time consuming and expensiveto produce. therefore, when successful, unsupervised methods can be preferable. Unsupervised WSDis also known as word sense induction, and unlike supervised methods, does not use any annotateddata to train on. Instead, it is assumed that similar senses occur in similar contexts, and word senses

2http://mallet.cs.umass.edu

5

are induced by clustering word occurrences using some measure of similarity of context. The mainapproaches include context clustering, word clustering, and co-occurrence graphs [18]. Current state-of-the-art performance for word sense induction on a lexical-sample task obtains an F-score of 63% [12].F-score is a measure of a system’s accuracy considering both precision and recall.

MEPS is an unsupervised system in that it has a list of paraphrases, but no supervised training data.It uses a graph-based model [16].

4 Method4.1 Compiling a goldstandard4.1.1 Amazon Mechanical TurkIn order to compare the two systems a gold standard was needed and was obtained through the use ofAmazon Mechanical Turk (AMT) 3. AMT is a service provided by Amazon Web Services 4 in whichworkers can complete small, online tasks in return for payment. Generally the tasks, known as HITs(human intelligence task) are quite simple and pay around $0.05 (ca 0.3 Swedish crowns).

4.1.2 The task and the data usedThe data used in the HITs is centered around eight lemmas; add, argument, ask, different, important,interest, paper, and win. For every lemma 25 sentences were randomly sampled from the sense taggedcorpus SemCor [15], totaling 200 sentences for the evaluation. These were taken from a previous studyby Erk, 2009 [6] which studied graded word sense assignment. Nine separate workers on AMT providedannotations for each sentence.

The task presented to workers on AMT was to decide wether, for a given sentence, a number ofparaphrases could be substituted for a boldfaced word within said sentence while still retaining thesame sentential meaning. Simply put, was word X a suitable paraphrase for word Y given the context?

The paraphrases used consisted of all synonyms within the given lemma’s WordNet synsets (Word-Net version 3.0). Multi-word expressions were excluded due to MEPS constraints. A complete list ofparaphrases used is presented in table 1 on the following page.

AMT was chosen as a suitable way to obtain paraphrase ratings for two reasons: first due to the factthat all-words WSD datasets cannot be used to evaluate MEPS as they are labeled with a single bestsense for each word [16], and second because of the high cost of obtaining expert annotations.

Three different types of HITs were created based on the amount of paraphrases each lemma had.One version contained only one sentence with one of the three lemmas add and argument (with 11paraphrases each) and ask (with 10 paraphrases). Two fake paraphrases were also presented (more in-formation about the fake paraphrases is included in section 4.1.5.) The second version of the HITsincluded two sentences, one in which the lemma interest was used (with 6 possible paraphrases), andone in which the lemma important appeared (with possible 3 paraphrases). Among the paraphrases foreach of the two lemmas a fake paraphrase was also included. The third version of the HITs includedthree sentences which presented the lemmas different, paper, and win respectively. These three lemmaseach had four paraphrases as well as one fake paraphrase.

The workers were paid $0.02 for the task consisting of one sentence, and $0.03 for the other twotasks.

The layout of the HITs began with the instructions to the worker. This was followed by the sentencein which the lemma was presented in bold font. The worker was asked to give a yes or no rating foreach paraphrase succeeding the sentence as to wether the paraphrases could be substituted for the boldfaced word while still maintaining the same sentential meaning. The ratings were independent of eachother, meaning that a worker could answer that more than one paraphrase was applicable. As mentioned

3https://www.mturk.com/4http://aws.amazon.com/

6

Table 1: Lemmas and corresponding paraphrases used in study

Lemma POS Paraphrases

Add v append, supply, lend, impart, bestow, contribute, bring, total, tot, sum, summate

Argument n statement, controversy, contention, contestation, disputation, disceptation, tilt,argumentation, debate, parameter, line

Ask v inquire, enquire, require, expect, necessitate, postulate, need, take, involve, de-mand

Different adj unlike, dissimilar, distinct, separate

Important adj significant, crucial, authoritative

Interest n involvement, sake, interestingness, stake, pasttime, pursuit

Paper n composition, report, theme, newspaper

Win v acquire, gain, advance, suceed

previously, three different versions of the HITs were used. For the two and three sentence HITs, twoand three sentences respectively were presented to the worker. In a pilot study a comment box was alsoavailable for workers to give advice as to how the layout could be improved. An example of the HITsused is shown in figure 1 on the next page.

4.1.3 The final gold standardThe final gold standard compiled from the AMT data consisted of, for each sentence, the paraphraseratings assigned by the workers. These ratings were the number of times a paraphrase had been given ayes vote, ie. the worker considered it a suitable paraphrase for the boldfaced lemma given the context.A sample of the gold standard (for 13 sentences containing the lemma different) is presented in table 2on page 9. A complete summary sans corresponding sentences is located in the Appendix.

4.1.4 Reliability of AMTA study which evaluated non-expert annotations for various natural language tasks found that when tenannotators were asked to disambiguate the word president between three senses accuracy rates plateauedat 99.4% [22]. However, this task was quite simple with the best system performance at SemEval 2007achieving an accuracy of 98% for the same task [22]. The present study however is more challenging.Rather than choosing between the most appropriate sense for a given word, annotators are asked tochoose between a number of suitable paraphrases, some of which pertain to the same sense. An evalua-tion of the agreement between these annotators is presented in section 5.1 on page 11.

4.1.5 Concerns with worker’s results and how they were handledOne downside to AMT is the anonymity of the workers. Slight specifications can be made as to whoworks on a given task. In the present study requirements were simply that the worker must have anapproval rating of at least 95% (i.e. 96% of their previous work on AMT has been approved) as well asbeing located in the United States, to help ensure that the worker speaks English.

Aside from the aspect of worker anonymity, another issue with the use of AMT is the possibility offraud. Workers can potentially answer by randomly clicking without bothering to read instructions andactually performing the task. Due to the fact that in this study a large number of HITs was uploaded toAMT, fraudulent workers could also write a script to answer the HITs automatically. Precautions havebeen taken to filter out potential spammers by adding fake paraphrases to the list of paraphrases theworkers are asked to rate. All HITs contained at least two of these fake paraphrases and were usuallyantonyms of the paraphrases. For example, for the lemma win, among the list of paraphrases taken fromWordNet the word forfeit was included. The reasoning being that in no circumstance whatsoever couldthe word forfeit be substituted for the word win while maintaining the same meaning in the sentence.

7

Figure 1: An example of the paraphrasing task presented to workers on Amazon Mechanical Turk

8

Table 2: AMT results for 13 sentences containing the lemma different. Rating is the sum of yes answers.

Pphrase 1 Rating Pphrase 2 Rating Pphrase 2 Rating Pphrase 3 Rating

Are you experimenting with different selling slants in developing new customers?

distinct 4 unlike 2 separate 2 dissimilar 7

Rather large differences were still found between reaction cells from different manifold fillings.

distinct 6 unlike 3 separate 6 dissimilar 5

They spread over an area no larger than Oregon; yet they include peoples as different from one anotheras Oregonians are from Patagonians.

distinct 9 unlike 8 separate 3 dissimilar 9

The truth is, though, that men react differently to different treatment.

distinct 2 unlike 2 separate 4 dissimilar 4

When they got home Harold was grateful for the stillness in the apartment, and thought how, underdifferent circumstances, they might have stayed on here, in these old-fashioned, high-ceilinged roomsthat reminded him of the Irelands’ apartment in the East Eighties.

distinct 2 unlike 2 separate 5 dissimilar 7

The final sample was not significantly different from a normal distribution in regard to readingachievement or intelligence test scores.

distinct 6 unlike 5 separate 0 dissimilar 9

Some of them are obvious, such as the fact that we associate recorded and live music with our re-sponses and behavior in different types of environments and social settings.

distinct 6 unlike 1 separate 4 dissimilar 6

There are many grammatical misconstructions other than dangling modifiers and anatomicals whichpermit two different interpretations.

distinct 6 unlike 2 separate 7 dissimilar 7

We will denote the values of f ( t ) on different components by **f.

distinct 4 unlike 3 separate 7 dissimilar 3

It is presumed that this negative head was associated with some geometric factor of the assembly,since different readings were obtained with the same fluid and the only apparent difference was theassembly and disassembly of the apparatus.

distinct 4 unlike 3 separate 3 dissimilar 9

I want, therefore, to discuss a second and quite different fruit of science, the connection betweenscientific understanding and fear.

distinct 7 unlike 3 separate 5 dissimilar 8

But contest definition-that dramatic muscular separation of every muscle group that seems as thoughit must have been carved by a sculptor’s chisel-is something quite different.

distinct 6 unlike 1 separate 5 dissimilar 5

Missionary outreach by friendly contact looks somewhat different when one reflects on what is knownabout friendly contact in metropolitan neighborhoods; the majority of such contacts are with peopleof similar social and economic position; association by level of achievement is the dominant principleof informal relations.

distinct 3 unlike 3 separate 2 dissimilar 6

9

AMT gives requesters the opportunity to limit work to one unique worker per HIT. This option wasutilized by uploading all HITs at once to ensure that the same worker could not annotate the samesentence more than once.

4.1.6 Processing AMT resultsWhen all HITs were completed they needed to be approved or rejected. This process was to a greatextent automated. A script written for the study was run on the returned results file which automaticallyrejected any HITs in which a fake paraphrase had been answered as an acceptable paraphrase. The scriptalso rejected a HIT in which answers were missing. AMT provides the requester with the work time inseconds for every HIT submitted. Using this information, a control was also performed by checkingresults from any HIT that had been completed in a much lower than average work time. The assumptionhere being that a human who was performing the HIT appropriately would have a work time close tothe average work time. The results from the rejected HITs were then deleted and resubmitted to AMTfor a new worker to complete, while accepted results were subsequently added to a final results file.

In the end approximately 15% of submitted work was rejected due to incomplete work or incorrectlyanswered control questions, i.e. fake paraphrases had been answered as appropriate paraphrases. Note,all of these HITs were uploaded to AMT again, and reliable results subsequently obtained.

4.2 MEPS inputMEPS uses various forms of input in order to run. First the sentences used in the study were tokenizedusing the Penn Tree Bank Tokenizer5. After which they were lemmatized, POS tagged, and dependencyparsed using the C&C parser [5].

The system was also provided with an input file indicating the target word’s (the word to be disam-biguated) index in the sentence, and an input file listing all possible paraphrases for a given lemma.The paraphrases used consisted of all synonyms within the given lemma’s WordNet senses and were thesame paraphrases presented to workers on AMT (see table 1 on page 7).

4.3 The chosen method of comparisonWhile the paraphrasing system achieves good results, it is performing a different task than a WSDsystem which assigns discrete senses. As mentioned previously, a traditional system, if using a senseinventory such as WordNet, outputs a number representing a WordNet synset for every word to bedisambiguated. MEPS however returns a list of possible paraphrases for a given word along with aprobability distribution over the paraphrases. In order to stay true to the theory that a word in a certaincontext can have more than one applicable sense, the decision was made to transform the output of thetraditional system to that of the paraphrasing system. One of the objectives of the study was to use AMTto evaluate MEPS, therefore the gold standard consisted of yes/no ratings of paraphrases rather thanof senses. This was another reason behind the decision to transform traditional output to a probabilitydistribution rather than the other way around.

In order to transform the traditional system’s discrete output into a probability distribution, all syn-onyms within the assigned sense are given equal probability. This is illustrated in a hypothetical examplein table 3 on the next page. Here, for the verb win, MEPS has assigned a probability of 0.8 to acquire,and 0.2 to advance. On the other hand, sense 2 has been assigned in the SemCor example, resulting in aprobability of 0.5 for acquire and gain respectively. The boldfaced word acquire has been chosen as themost suitable paraphrase by AMT workers, therefore the system with the best results in this hypotheticalexample is MEPS.

Due to the fact that various senses included no paraphrases or multi-word expressions, which couldnot be handled by MEPS, or both of these cases, it was assumed that the traditional system had assignedno WordNet sense in this type of scenario, and thus no probability distribution was assigned. This was

5http://www.cis.upenn.edu/ treebank/tokenizer.sed

10

done due to the fact that these senses had not been presented to worker’s on AMT and thus were notrepresented in the final gold standard, and therefore it would not be fair to use them in the comparison.Return to the hypothetical example in 3 for an example. Here, sense one contains no paraphrase.

Table 3: A hypothetical example of a MEPS probability distribution, as well as an assigned word sensefrom a traditional system represented by a probability distribution. The boldfaced word acquirerepresents the paraphrase chosen by AMT workers and used in the gold standard.

.

Win, verb

Sense 1 2 3

Paraphrase (no paraphrase) acquire gain advance

MEPS 0.8 0.2

Traditional 0.5 0.5

4.4 Using a hypothetical, best possible, traditional WSD systemPreliminary results revealed flaws in the proposed method of comparison leaving the gold standardinadequate. Due to the fact that many of the synsets for the lemmas used in the study did not containa paraphrase or contained a multi-word expression (which can not be handled by MEPS), these senseswere not presented to workers on AMT and were not included in the final gold standard. They aresubsequently not used by MEPS either since it is supplied with an input file of possible paraphrases,the paraphrases being the same as those on AMT. This resulted in what can be seen as holes in thegold standard. So while MEPS can only return a probability distribution over paraphrases which havebeen used on AMT, a traditional WSD system can return a sense tag referring to a synset not containinga paraphrase. Since this sense is not represented in the gold standard, the system is penalized even ifthe correct sense has been assigned. This gives MEPS an unfair advantage. To attempt to even out theplaying field as well as to investigate how flawed the gold standard actually was, the decision was madeto use a hypothetical WSD system which would represent the perfect system. This was done by usingthe SemCor sense tags for the sentences used in the study, and in a traditional study would be consideredthe upper bound. This is hereafter referred to as the hypothetical system.

An upper bound or ceiling can be set by determining how well humans can perform the given taskand represents optimal performance [11]. A baseline can also be set and would represent the simplestapproach to WSD, for example assigning a random or most common sense [11].

5 Results5.1 AMT compiled gold standardThe complete gold standard as previously mentioned was found to contain various flaws that made theproposed method of comparison of MEPS with a traditional WSD system difficult. The main problembeing that not all WordNet synsets contain paraphrases. The second problem was that most often themost frequent sense itself did not contain paraphrases. Take, for example, the lemma paper. The mostfrequent WordNet sense of paper is “a material made of cellulose pulp derived mainly from wood orrags or certain grasses” 6. However, a suitable paraphrase does not exist for this sense, and thus theworkers on AMT are only provided with the paraphrases composition, report, theme and newspaper.The absence of suitable paraphrases for certain senses, resulted in these senses not being representedin the gold standard. The number of senses subsequently excluded from the gold standard are shown in

6http://wordnetweb.princeton.edu/perl/webwn

11

table 4 as well as how often a lemma’s most common sense was unusable in the compilation of the goldstandard. On average, for the eight lemmas, only 52% of senses are presented to workers on AMT.

For three sentences, AMT workers unanimously voted against all paraphrase candidates. In a prelim-inary study, the decision was made that support from a simple majority of workers (i.e. 5 or more) wasnecessary for a paraphrase to be accepted. This resulted in 34 contexts with an empty set of acceptableparaphrases. However, this number depends strongly on the vote threshold: if 3 votes are enough toaccept a paraphrase, only 22 contexts have an empty set of acceptable paraphrases, but if 6 votes arenecessary to accept a paraphrase, 99 contexts have no acceptable paraphrases. Precision and recall forall three variations with respect to the SemCor annotations are presented in table 6 on page 14.

Table 4: For every lemma used, how many senses are represented in gold standard and MEPS system

Lemma POS Total # of senses Total # of sensesused on AMT

First sense containingparaphrase

Add v 6 3 2

Ask v 7 3 1

Win v 4 3 2

Argument n 7 4 1

Interest n 7 5 1

Paper n 7 3 2

Different adj 5 1 4

Important adj 5 3 2

It was found that annotator agreement on the gold standard was at 84%, meaning each decision toinclude or not include a paraphrase was, on average, supported by 84% of the (nine) AMT workersassigned to it. This was calculated by summing the size of the majority (between 5/9 and 1) for eachparaphrase rated by the AMT workers, and dividing by the total number of paraphrases rated. Thisfigure means that, on average, every decision had between 1 and 2 dissenters among the 9 AMT judgesassigned to it.

5.2 Precision and Recall of the SemCor human annotated sentences and most frequentsense baseline with respect to the AMT gold standardAs previously mentioned, various flaws were found with the gold standard that rendered a comparisonof the two systems unfair. The decision was made to use the SemCor assigned sense tags for the compar-ison, representing a best possible, traditional WSD system. A baseline of the most frequent sense wasalso used. Since WordNet senses are generally ordered from the most frequent to least frequent sense(using frequency statistics gathered from SemCor) this is a common WSD evaluation baseline [11].

A traditional WSD system could return a sense tag which did not contain paraphrases, containeda multi-word expression or both, this sense was not represented on the gold standard. If this was thecase for a given sentence used in the study, it was instead assumed that the system (in this case thehypothetical, best possible system) had returned no sense tag. This was done due to the fact that AMTworkers had not been presented with these alternatives. Due to the fact that the human annotated sensetags from SemCor are being used rather than an actual WSD system, it will never be the case that asense tag has not been assigned from the beginning.

This situation of an empty sense set was handled in two ways when calculating precision and recall.

12

Table 5: Precision and recall of AMT data compared to manually annotated SemCor sentences as wellas baseline (most frequent WordNet sense). Precision & Recall 1: For each test instance, preci-sion and calculated only if sense assigned in SemCor data, recall calculated only if paraphraseassigned on AMT. Precision & Recall 2: For each test instance, if no sense assigned in SemCordata, precision = 0, if no paraphrases assigned on AMT, recall = 0

Precision 1 Precision 2 Recall 1 Recall 2

AMT data & hypothetical system(annotated SemCor)

0.53 0.25 0.32 0.27

AMT data & most frequent sense(baseline)

0.40 0.15 0.18 0.15

5.2.1 PrecisionPrecision, P, is calculated for each disambiguated word, D, by taking the intersection of the set ofsynonyms in the sense assigned by the annotators of SemCor, S, and the set of suitable paraphraseschosen by a majority of the AMT workers, T, and dividing by S.

P =S⋂

TS

(1)

The first method of calculation assumes that if S is empty, precision is left as undefined for that testinstance and is excluded from the average precision. Macro-averaged precision 7 for the hypothetical,best possible system was 53.26% and 40.00% for the baseline. This is referred to as Precision 1 intable 5.

A second way of calculating precision (referred to as Precision 2 in table 5), is to define the precisionas zero for a given test instance when S is empty. This results in a lower precision with the hypotheticalsystem at 24.50% and the baseline at 15.00%.

Table 5 shows the two variations of precision, as well as recall, calculated for the hypothetical systemand the baseline when using the AMT results as a gold standard.

Variations on the number of worker’s votes deemed necessary for a paraphrase to be accepted resultedin different levels of precision and are presented in table 6.

5.2.2 RecallRecall, R, is calculated for each D by taking the intersection of S, and T, and dividing it by T.

R =S⋂

TT

(2)

In the first method of recall calculation, (referred to as Recall 1 in table 5) if T is empty, the recall ofthis test instance is left undefined and excluded from the overall recall. Macro-averaged recall 8 for thehypothetical system was 32.19% and 18.09% for the baseline. This is referred to as Recall 1 in table 5.

As for the second case of calculating recall (referred to as Recall 2 in table 5), it is defined as zerowhen T is empty. This results in a lower recall with the hypothetical system at 25.72% and a baseline of15.01%.

Variations on the number of worker’s votes deemed necessary for a paraphrase to be accepted resultedin different levels of recall and are presented in table 6.

7The macro-averaged precision is obtained by computing precision for each sentence, and then averaging over the precisionvalues.8The macro-averaged recall is obtained by computing recall for each sentence, and then averaging over the recall values.

13

Table 6: Precision and recall of hypothetical system with respect to gold standard using different cut offsfor worker votes, Precision 1&2 and Recall 1&2 are same as in table 5

# of workers Precision 1 Precision 2 Recall 1 Recall 2

4 or more 0.66 0.30 0.29 0.26

5 or more 0.53 0.25 0.32 0.27

6 or more 0.26 0.12 0.35 0.18

Table 7: MEPS precision and recall, overall and for the different parts of speech used.

All Verb Noun Adjective

Precision 68.26 60.80 73.02 91.4

Recall 80.4 90.98 82.41 51.51

5.3 MEPS precision and recall with respect to the AMT gold standardOverall precision and recall for MEPS when evaluated on the AMT gold standard was at 68.26% and80.4% respectively. In the way of precision, adjectives fared best at 91.4%, however recall was low at51.51%. On the other hand, verbs achieved high recall at 90.98%, but low precision at 60.80%. Precisionand recall for nouns was 73.02% and 82.41% respectively. These results are presented in table 7. Anadditional method of evaluation, weighted accuracy, is presented in section 5.4.

However, due to the fact that MEPS is presented with all possible paraphrases for a given word, andthese paraphrases were the same as those on the gold standard which was flawed, these figures can bemisleading. This is reflected on further in the discussion section.

5.4 Weighted accuracyAn additional method of evaluation called weighted accuracy (wAcc) is proposed in [16] in which boththe weights for the gold paraphrases and the weights for the model output are compared. A = 〈a1, ....,am〉is the probability distribution over the n paraphrases according to the gold standard, and B = 〈b1, ....,bn〉is the corresponding probability distribution according to the model being evaluated. It is defined asfollows

wAcc(A,B) =n

∑i=1

min(ai,bi) (3)

If an unweighted set of paraphrases is given, the probability mass is divided uniformly over theseparaphrases. In this manner the weighted accuracy measure can be used to compare a graded sensedistribution to a set of unweighted paraphrases, such as those obtained from AMT.

Two variations of calculating wAcc are used. In the first, if either set of paraphrases is empty theweighted accuracy is left as undefined and the item is not included in the overall weighted accuracy.This results in a wAcc of 51.2%.

In the second, the weighted accuracy is defined as zero if either set of paraphrases is empty. Thisresults in a lower wAcc of 23.9%

Weighted accuracy for MEPS on the AMT data was 34.43%.

14

Table 8: Weighted accuracy (wAcc) compared to upper bound (manually annotated SemCor sentences)and baseline (most frequent WordNet sense). If no paraphrase chosen for either set of para-phrases, wAcc 1: the case is excluded from final wAcc average or wAcc 2: a zero is assigned forthe given case.

wAcc 1 wAcc 2

AMT data & annotated SemCor(upper bound)

0.51 0.24

AMT data & most frequent sense(baseline)

0.39 0.13

6 Discussion6.1 The skewed gold standard and its effectsA traditional WSD system generally assigns disambiguated words with a WordNet sense, and it wastherefore important to use paraphrases from WordNet when compiling the gold standard using AMT.The theory that a word in a certain context can have more than one appropriate sense was at the basisof the study and one of the aims of the study was to use AMT to obtain paraphrase ratings as well asperform an evaluation of MEPS. therefore the decision was made to make a comparison at the paraphraselevel rather than the sense level. However due to the fact that, for the eight lemmas used in the studycombined, only 52% of the senses were actually represented on AMT, the gold standard subsequentlybecame skewed. This will be referred to as insufficient sense coverage.

As previously mentioned, MEPS uses input consisting of all of the possible paraphrases for the lem-mas which are to be disambiguated. While there is nothing wrong with this since traditional WSDsystems are generally provided with access to WordNet to perform disambiguation, it did lead to aproblem when trying to make a comparison between the two types of systems. This is due to the factthat MEPS is only provided with the paraphrases rated by AMT workers and subsequently used in thegold standard. However, due to the insufficient sense coverage of the gold standard, a traditional systemcan return a correct sense not represented by the gold standard, lowering its accuracy rates. Ultimately,MEPS seems to benefit from the skewed gold standard while the other type of system is essentiallypunished.

It is also important to note that while AMT workers unanimously voted against all paraphrase can-didates for only three sentences, had workers been informed that it was their option to also choose noparaphrases, this number could have been much lower. This is noticeable with a slight variance in thevoter threshold.

In order to evaluate the discrepancies in the gold standard, rather than using actual output from atraditional WSD system, SemCor sense tags were used instead. Various methods of calculating precisionand recall were performed.

The best level of recall at 35%, which itself is very low, was only achievable for recall 1 (see table 6on the preceding page) with a high voter cut-off (at least 6 votes) needed to deem a paraphrase suitable.However, in this case 99 sentences were essentially discarded from the evaluation since agreementneeded to be very high. On the other hand with recall 2 for the same voter cut-off, for the 99 sentenceswhich had previously been discarded, recall was set to zero for each given test instance, ultimatelybringing recall to the lowest level of 18%.

At best precision was at 66% using a cut off of four or more workers needed to deem a paraphraseacceptable at precision 1 (see table 6 on the previous page). As previously stated, state-of-the-art WSDsystems have performed at accuracies of 69% (the best performing system at SENSEVAL-2 on anEnglish all words task) so while a precision of 66% can seem reasonable, it must be clearly statedagain that this is the result of using the human annotated sense tags from the SemCor corpus, and NOT

15

actual output from a traditional WSD system. While slight discrepancies could have arisen due to AMTworker disagreement or even very occasional errors in the annotation of SemCor, it is highly unlikelythat it would have such a drastic effect on the precision. It is important to note that in the Senseval-3 English lexical sample task [13], it was found that the most frequent WordNet sense baseline was55.2%. Of the many different ways to calculate precision that are presented in table 5 and 6 on page 14only one beats this most frequent sense baseline at 66%. This clearly indicates that the gold standard isunacceptable when used in a comparison of MEPS with any traditional WSD system.

Due to the very low precision and recall when comparing human annotated sense assignments tothe gold standard, it can quickly be assumed that the current method of evaluation is insufficient. Anytraditional system is at a great disadvantage when compared to MEPS. However, the chosen methodof comparison ultimately uncovered a problem which MEPS cannot handle that had previously goneunnoticed. MEPS can only return a probability distribution over potential paraphrases for a word in agiven context despite the fact that an appropriate paraphrase may not exist for that instance. So, whileMEPS may be state-of-the-art when it comes to assigning appropriate paraphrases as a way of perform-ing disambiguation, an alternative approach needs to be found to handle situations of the previouslymentioned type.

6.2 Variations on the number of workers needed to deem a paraphrase acceptableOriginally it was assumed that a simple majority vote on a given paraphrase for a given lemma wouldsuffice for deeming a paraphrase appropriate. However, variations on this produced quite large dif-ferences in accuracy and slight variations on recall (see table 6 on page 14). A difference of 40% inprecision can be observed when using four or more versus six or more workers as the voter threshold.However, this also resulted in a large gap between then amount of sentences subsequently not beingassigned any suitable paraphrases. If the cut off was four or more, only 22 sentences had an empty setof acceptable paraphrases. On the other hand, if the cut-off was six or more workers, 99 sentences hadan empty set.

Variations on this cut-off can be made though if the study was to be attempted in the future. Ratherthan using binary distinctions, it could be more useful to assign weights to the paraphrases based on thenumber of AMT workers who chose them as suitable.

6.3 Reliability and validityThe aspect of insufficient sense coverage affects the validity of the AMT results and leads to the goldstandard’s discrepancies. Reliability however is rather high. This is due to the fact that annotator-agreement was at 84% compared to random voting which would have given an agreement of 64%.Furthermore, control questions were used on AMT to ensure that first, humans were providing annota-tions, and second that their work would be reliable.

6.4 Future researchOptions on how to resolve, or at least mitigate the discrepancies in the evaluated method of compar-ison, seem limited. This is mainly due to the fact that a great deal of WordNet senses do not includeparaphrases. However, were a set of lemmas found on WordNet in which all senses could be completelyrepresented on AMT, the comparison should be successful.

It is important to note that the creators of MEPS have the intention of extending their model tohandle multi-word paraphrases. The fact remains though that a great deal of word’s most commonsenses simply are not paraphrasable. While the creators of MEPS argue against applying discrete sensesto words in a given context, as of now, using only paraphrases is also an insufficient way of performingdisambiguation. Perhaps a hybrid of the two can prove more succesful.

16

7 ConclusionThe main objective of this study was to investigate if and how a comparison could be made of twodifferent kinds of WSD systems, one using discrete word senses (referred to as a traditional systemthroughout the paper), and one using probability distributions over paraphrases (referred to as MEPS).In order to do this a gold standard was compiled using data collected through Amazon Mechanical Turkin order to make a comparison. The second objective of the study thus was to evaluate the AMT data inand of itself, but also in the context of the study’s first objective: making a comparison of the two typesof systems. A third objective was to evaluate MEPS itself using the AMT data.

While the main objective of the study, making a comparison of two types of WSD systems using anAMT compiled gold standard, in the end proved unsuccessful, a greater understanding of what it wouldtake to make a proper comparison was achieved.

The second objective was to evaluate the AMT data. Annotator-agreement of the gold standard wasat 84% proving the results to be reliable. However, in the aspect of using this data as a gold standardfor the comparison of the two systems, it proved to be quite inadequate resulting in low validity. Thishowever, was due to what has been referred to as inadequate sense coverage.

The last goal of the study was to, using the AMT data, perform an evaluation of MEPS. While MEPSis state-of-the-art as a paraphrasing system, it can not handle situations in which a proper paraphrasedoes not exist for a given word. This discovery was made in the course of this study.

AppendicesMechanical Turk Results

The Mechanical Turk results are presented below. Results for each sentence presented to workers areshown and are sorted as follows: first the WordNet sense tag assigned in the SemCor corpus for thegiven lemma is shown, then the lemma itself, within brackets are the paraphrases chosen by workersand the corresponding number is the number of workers who felt it was an appropriate substitute for theoriginal lemma. Only paraphrases with a yes rating of one or more are presented. While the sense tag isnot relevant to the gold standard itself, it is presented here as a way of organizing the data.

1 important [(authoritative, 1), (crucial, 5), (significant, 7)]1 important [(crucial, 3), (significant, 9)]1 important [(authoritative, 2), (crucial, 5), (significant, 7)]1 important [(authoritative, 2), (crucial, 5), (significant, 8)]1 important [(crucial, 4), (significant, 8)]1 important [(authoritative, 3), (crucial, 8), (significant, 9)]1 important [(crucial, 9), (significant, 8)]1 important [(crucial, 8), (significant, 7)]1 important [(authoritative, 2), (crucial, 5), (significant, 8)]1 important [(crucial, 7), (significant, 9)]1 important [(authoritative, 2), (crucial, 6), (significant, 8)]1 important [(authoritative, 3), (crucial, 6), (significant, 6)]1 important [(authoritative, 1), (crucial, 6), (significant, 9)]1 important [(authoritative, 1), (crucial, 8), (significant, 7)]1 important [(crucial, 9), (significant, 5)]1 important [(crucial, 7), (significant, 8)]1 important [(authoritative, 5), (crucial, 2), (significant, 8)]1 important [(crucial, 5), (significant, 9)]1 important [(authoritative, 2), (crucial, 5), (significant, 8)]1 important [(authoritative, 2), (crucial, 7), (significant, 9)]

17

1 important [(crucial, 8), (significant, 8)]1 important [(authoritative, 1), (crucial, 8), (significant, 8)]1 important [(authoritative, 2), (crucial, 5), (significant, 9)]1 important [(authoritative, 1), (crucial, 5), (significant, 9)]1 important [(authoritative, 2), (crucial, 5), (significant, 9)]1 win [(acquire, 5), (gain, 7), (succeed, 1)]1 win [(advance, 1), (acquire, 1), (succeed, 8)]1 win [(acquire, 8), (gain, 7), (succeed, 1)]1 win [(advance, 1), (acquire, 2), (gain, 5), (succeed, 2)]1 win [(advance, 1), (acquire, 1), (gain, 1), (succeed, 2)]1 win [(advance, 2), (acquire, 1), (gain, 1), (succeed, 4)]1 win [(advance, 1), (gain, 1), (succeed, 5)]1 win [(advance, 7), (acquire, 1), (gain, 2), (succeed, 9)]1 win [(advance, 2), (acquire, 2), (gain, 3), (succeed, 4)]1 win [(advance, 1), (gain, 1), (succeed, 1)]1 win [(advance, 3), (acquire, 1), (gain, 2), (succeed, 7)]1 win [(advance, 1), (acquire, 2), (gain, 4), (succeed, 4)]1 win [(advance, 1), (gain, 1), (succeed, 9)]1 win [(succeed, 3)]1 win [(gain, 1), (succeed, 1)]1 win [(succeed, 5)]1 win [(acquire, 4), (gain, 6)]1 win [(advance, 1), (acquire, 4), (gain, 6), (succeed, 1)]1 win [(acquire, 5), (gain, 4)]2 win [(acquire, 8), (gain, 8)]2 win [(advance, 1), (acquire, 7), (gain, 6), (succeed, 1)]2 win [(advance, 1), (acquire, 8), (gain, 7), (succeed, 1)]2 win [(acquire, 7), (gain, 7)]2 win [(advance, 1), (acquire, 5), (gain, 7)]2 win [(advance, 1), (acquire, 6), (gain, 9)]1 ask [(inquire, 6), (enquire, 4), (demand, 3), (postulate, 4)]1 ask [(inquire, 6), (enquire, 6), (postulate, 1)]1 ask [(inquire, 8), (enquire, 7), (expect, 1), (demand, 3), (postulate, 2)]1 ask [(inquire, 6), (enquire, 4), (demand, 2), (postulate, 1)]1 ask [(inquire, 7), (enquire, 7)]1 ask [(inquire, 5), (enquire, 2), (postulate, 1)]1 ask [(inquire, 9), (require, 1), (enquire, 4), (expect, 1), (demand, 1), (postulate, 4), (take, 1)]1 ask [(inquire, 9), (enquire, 4), (postulate, 1)]1 ask [(inquire, 7), (enquire, 4)]1 ask [(inquire, 9), (enquire, 4), (demand, 3), (postulate, 3)]1 ask [(inquire, 9), (enquire, 3), (demand, 1), (need, 1), (postulate, 2)]1 ask [(inquire, 9), (enquire, 3), (demand, 2), (postulate, 7)]1 ask [(inquire, 7), (enquire, 5), (demand, 3), (postulate, 1)]1 ask [(inquire, 8), (enquire, 4)]1 ask [(inquire, 7), (require, 1), (necessitate, 1), (enquire, 6), (demand, 5), (postulate, 6)]2 ask [(inquire, 3), (enquire, 3), (expect, 3)]2 ask [(inquire, 3), (require, 6), (necessitate, 1), (involve, 1), (enquire, 2), (demand, 5), (need, 5),

(expect, 4)]2 ask [(inquire, 3), (require, 2), (necessitate, 1), (enquire, 2), (expect, 1), (demand, 2), (need, 2),

(postulate, 1)]2 ask [(inquire, 6), (enquire, 4), (demand, 2), (need, 1)]

18

2 ask [(inquire, 1), (require, 1), (demand, 1), (postulate, 1), (expect, 2)]2 ask [(inquire, 5), (require, 3), (enquire, 1), (expect, 1), (demand, 3), (need, 2), (postulate, 1)]2 ask [(inquire, 6), (require, 3), (necessitate, 1), (enquire, 4), (expect, 2), (demand, 4), (need, 2),

(postulate, 2)]2 ask [(inquire, 4), (require, 4), (necessitate, 1), (enquire, 2), (demand, 2), (need, 4), (expect, 1)]2 ask [(inquire, 5), (enquire, 3), (postulate, 3)]2 ask [(require, 9), (necessitate, 4), (demand, 8), (need, 7), (expect, 8)]2 argument [(argumentation, 4), (contestation, 6), (contention, 7), (controversy, 7), (disputation, 5),

(debate, 9), (disceptation, 2)]2 argument [(argumentation, 4), (contestation, 7), (contention, 8), (controversy, 6), (disputation, 7),

(debate, 8), (disceptation, 4)]2 argument [(argumentation, 5), (contestation, 4), (contention, 6), (tilt, 1), (controversy, 6), (disputa-

tion, 4), (debate, 5), (disceptation, 3)]2 argument [(argumentation, 3), (contestation, 4), (contention, 5), (controversy, 4), (disputation, 4),

(debate, 6), (disceptation, 1)]2 argument [(argumentation, 4), (contestation, 4), (contention, 2), (controversy, 2), (disputation, 5),

(debate, 9), (disceptation, 1)]2 argument [(argumentation, 2), (contestation, 4), (contention, 2), (controversy, 2), (disputation, 4),

(debate, 9), (disceptation, 1)]2 argument [(argumentation, 3), (contestation, 2), (contention, 3), (controversy, 3), (statement, 1),

(disputation, 2), (parameter, 1), (debate, 5), (disceptation, 1)]2 argument [(argumentation, 3), (contestation, 2), (contention, 3), (tilt, 1), (controversy, 3), (disputa-

tion, 5), (debate, 7), (disceptation, 1)]2 argument [(argumentation, 7), (contestation, 6), (contention, 6), (controversy, 5), (disputation, 5),

(debate, 4), (disceptation, 2)]3 argument [(argumentation, 4), (contestation, 4), (contention, 2), (controversy, 2), (disputation, 4),

(debate, 8), (disceptation, 2)]3 argument [(argumentation, 5), (contestation, 3), (contention, 5), (tilt, 1), (controversy, 2), (statement,

3), (disputation, 3), (line, 2), (parameter, 2), (debate, 6), (disceptation, 2)]4 argument [(argumentation, 5), (contestation, 5), (contention, 4), (controversy, 8), (disputation, 7),

(debate, 9), (disceptation, 3)]1 paper [(report, 5), (newspaper, 8), (composition, 3)]1 paper [(report, 5), (theme, 1), (composition, 1)]1 paper [(newspaper, 5), (report, 6), (composition, 4)]1 paper [(report, 1), (newspaper, 1), (composition, 1)]1 paper [(newspaper, 2), (composition, 1)]1 paper [(report, 1), (newspaper, 2), (composition, 1)]1 paper [(composition, 2)]1 paper [(composition, 1)]2 paper [(newspaper, 1), (report, 9), (composition, 3)]2 paper [(report, 7), (theme, 1), (composition, 6)]2 paper [(newspaper, 3), (report, 9), (theme, 1), (composition, 3)]2 paper [(newspaper, 1), (report, 9), (theme, 1), (composition, 6)]2 paper [(report, 8), (theme, 1), (composition, 6)]2 paper [(report, 9), (newspaper, 2), (composition, 6)]2 paper [(report, 9), (theme, 1), (composition, 5)]2 paper [(newspaper, 1), (report, 8), (composition, 7)]3 paper [(report, 1), (newspaper, 9)]3 paper [(newspaper, 9), (report, 3), (composition, 1)]3 paper [(report, 1), (newspaper, 8), (composition, 2)]3 paper [(newspaper, 6), (report, 2), (composition, 2)]

19

6 paper [(newspaper, 9)]6 paper [(newspaper, 6), (report, 5), (theme, 1)]1 add [(supply, 3), (lend, 2), (contribute, 5), (bring, 3), (bestow, 2), (impart, 1), (total, 1), (append, 4)]1 add [(supply, 1), (contribute, 4), (tot, 1), (bring, 1), (bestow, 2), (impart, 2), (append, 6), (summate,

1)]1 add [(supply, 4), (contribute, 6), (bring, 1), (bestow, 2), (impart, 2)]1 add [(supply, 4), (contribute, 5), (bring, 3), (append, 2)]1 add [(supply, 5), (lend, 1), (sum, 1), (contribute, 6), (tot, 1), (bring, 1), (bestow, 3), (impart, 2), (total,

1), (append, 3), (summate, 1)]1 add [(supply, 3), (sum, 1), (contribute, 2), (bestow, 4), (impart, 2), (total, 1), (append, 2), (summate,

2)]1 add [(supply, 6), (lend, 2), (contribute, 5), (bring, 1), (bestow, 4), (impart, 4), (append, 3)]1 add [(supply, 2), (contribute, 5), (bring, 2), (bestow, 3), (tot, 2), (impart, 4), (append, 3), (summate,

1)]1 add [(supply, 1), (sum, 1), (contribute, 6), (bring, 1), (bestow, 3), (lend, 4), (impart, 3), (append, 2),

(summate, 1)]1 add [(supply, 9), (contribute, 8), (bestow, 6), (lend, 1), (impart, 2), (append, 3)]1 add [(supply, 8), (sum, 1), (contribute, 7), (bring, 1), (bestow, 3), (impart, 5), (total, 1), (append, 5),

(summate, 1)]1 add [(supply, 4), (lend, 4), (sum, 1), (contribute, 7), (bring, 3), (bestow, 2), (impart, 3), (total, 1),

(append, 6), (summate, 1)]1 add [(supply, 5), (contribute, 4), (bring, 1), (bestow, 2), (lend, 1), (impart, 2), (append, 2)]2 add [(lend, 1), (sum, 1), (contribute, 3), (tot, 2), (impart, 6), (append, 3), (summate, 1)]2 add [(supply, 3), (contribute, 2), (impart, 4), (append, 5), (summate, 1)]2 add [(supply, 5), (contribute, 3), (bestow, 2), (impart, 5), (append, 4), (summate, 1)]2 add [(supply, 3), (lend, 1), (sum, 1), (contribute, 4), (bring, 2), (bestow, 1), (impart, 4), (append, 1),

(summate, 2)]2 add [(supply, 5), (lend, 4), (sum, 3), (contribute, 6), (bestow, 1), (impart, 3), (append, 4), (summate,

4)]2 add [(supply, 5), (sum, 1), (contribute, 6), (bring, 1), (bestow, 4), (lend, 1), (impart, 9), (total, 1),

(append, 6), (summate, 2)]2 add [(supply, 3), (sum, 1), (contribute, 4), (bestow, 1), (impart, 6), (append, 3), (summate, 2)]2 add [(supply, 4), (lend, 1), (contribute, 5), (bestow, 4), (impart, 7), (append, 3), (summate, 1)]2 add [(supply, 1), (contribute, 2), (bestow, 1), (impart, 3), (append, 3)]2 add [(supply, 2), (contribute, 4), (bestow, 1), (impart, 3), (total, 1), (append, 4)]3 add [(supply, 7), (contribute, 7), (bring, 7), (bestow, 5), (tot, 1), (lend, 5), (impart, 8), (append, 3)]3 add [(supply, 4), (lend, 6), (contribute, 8), (bring, 3), (bestow, 2), (impart, 4), (append, 1)]1 different [(distinct, 3), (dissimilar, 5), (unlike, 2), (separate, 5)]1 different [(distinct, 4), (dissimilar, 7), (unlike, 2), (separate, 2)]1 different [(distinct, 4), (dissimilar, 5), (unlike, 3), (separate, 4)]1 different [(distinct, 6), (dissimilar, 5), (unlike, 3), (separate, 6)]1 different [(distinct, 4), (dissimilar, 7), (unlike, 4), (separate, 5)]1 different [(distinct, 7), (dissimilar, 8), (unlike, 4), (separate, 7)]1 different [(distinct, 3), (dissimilar, 7), (unlike, 3), (separate, 2)]1 different [(distinct, 6), (dissimilar, 8), (unlike, 6), (separate, 7)]1 different [(distinct, 4), (dissimilar, 6), (unlike, 2), (separate, 6)]1 different [(distinct, 3), (dissimilar, 6), (unlike, 3), (separate, 8)]1 different [(distinct, 6), (dissimilar, 6), (unlike, 1), (separate, 4)]1 different [(distinct, 3), (dissimilar, 6), (unlike, 1), (separate, 2)]1 different [(distinct, 6), (dissimilar, 3), (unlike, 3), (separate, 7)]1 different [(distinct, 6), (dissimilar, 5), (unlike, 1), (separate, 5)]

20

1 different [(distinct, 6), (dissimilar, 7), (unlike, 2), (separate, 7)]1 different [(distinct, 6), (dissimilar, 9), (unlike, 5)]1 different [(distinct, 9), (dissimilar, 9), (unlike, 8), (separate, 3)]1 different [(distinct, 4), (dissimilar, 9), (unlike, 3), (separate, 5)]1 different [(distinct, 4), (dissimilar, 9), (unlike, 3), (separate, 3)]2 different [(distinct, 4), (dissimilar, 3), (unlike, 3), (separate, 7)]2 different [(distinct, 2), (dissimilar, 7), (unlike, 2), (separate, 4)]2 different [(distinct, 7), (dissimilar, 8), (unlike, 3), (separate, 5)]2 different [(distinct, 6), (dissimilar, 6), (unlike, 2), (separate, 2)]2 different [(distinct, 3), (dissimilar, 6), (unlike, 3), (separate, 2)]2 different [(distinct, 2), (dissimilar, 4), (unlike, 2), (separate, 4)]1 interest [(stake, 3), (interestingness, 2), (involvement, 8), (pursuit, 2)]1 interest [(sake, 1), (stake, 4), (interestingness, 1), (involvement, 5), (pursuit, 2)]1 interest [(interestingness, 3), (involvement, 1), (pursuit, 2), (pasttime, 5)]1 interest [(sake, 1), (stake, 5), (involvement, 5), (pasttime, 1)]1 interest [(interestingness, 2), (pursuit, 1)]1 interest [(stake, 1), (interestingness, 1), (involvement, 4), (pursuit, 4), (pasttime, 2)]1 interest [(stake, 3), (involvement, 7), (pursuit, 3), (pasttime, 2)]1 interest [(stake, 2), (interestingness, 3), (involvement, 5), (pursuit, 2)]1 interest [(sake, 1), (interestingness, 4), (involvement, 5), (pursuit, 1), (pasttime, 1)]1 interest [(sake, 1), (stake, 1), (interestingness, 2), (involvement, 7), (pursuit, 5)]1 interest [(stake, 1), (interestingness, 4), (involvement, 4), (pursuit, 1), (pasttime, 2)]1 interest [(involvement, 5), (pursuit, 4), (pasttime, 4)]2 interest [(interestingness, 6), (involvement, 1), (pursuit, 1)]2 interest [(sake, 3), (stake, 1), (involvement, 2), (pursuit, 2), (pasttime, 1)]2 interest [(stake, 1), (interestingness, 3), (involvement, 5), (pursuit, 4), (pasttime, 2)]2 interest [(stake, 2), (interestingness, 3), (involvement, 5), (pursuit, 1), (pasttime, 3)]3 interest [(sake, 4), (stake, 4), (interestingness, 1), (involvement, 1), (pursuit, 2)]3 interest [(stake, 1), (involvement, 3), (pursuit, 5)]3 interest [(sake, 1), (stake, 3), (interestingness, 2), (involvement, 2), (pursuit, 1)]4 interest [(sake, 1), (stake, 2), (interestingness, 1), (involvement, 1), (pasttime, 1)]4 interest [(stake, 1), (interestingness, 1), (involvement, 1)]4 interest [(stake, 2), (interestingness, 1), (pursuit, 1)]5 interest [(sake, 2), (stake, 2), (interestingness, 2), (involvement, 4), (pursuit, 6), (pasttime, 5)]6 interest [(sake, 1), (stake, 6), (interestingness, 1), (involvement, 5), (pursuit, 2)]7 interest [(stake, 2), (involvement, 5), (pursuit, 5)]

21

References[1] Senseval-2 official results. http://86.188.143.199/senseval2/Results/all_graphs.htm, September

2001. Viewed on August 7, 2011.

[2] Wordnet statistics. http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html, August 2011.Viewed on August 23, 2011.

[3] Marine Carpaut and Dekai Wu. Improving statistical machine translation using word sense disam-biguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural LanguageProcessing and Computational Natural Language Learning, pages 61–72, Prague, 2007. Associa-tion for Computational Linguistics.

[4] Yee Seng Chan, Hwee Tou Ng, and David Chiang. Word sense disambiguation improves statisticalmachine translation. In Proceedings of the 45th Annual Meetings of the Association of Computa-tional Linguistics, pages 33–40, Czech republic, 2007. Association for Computational Linguistics.

[5] Stephen Clark and James R. Curran. Wide-coverage efficient statistical parsing with ccg and log-linear models. Computational Linguistics, 33:493–552, 2007.

[6] Katrin Erk and Diana McCarthy. Graded word sense assignment. In Proceedings of the 2009 Con-ference on Empirical Methods in Natural Language Processing, pages 440–449, Singapore, 2009.Association for Computational Linguistics Asian Federation of Natural Language Processing.

[7] Katrin Erk, Diana McCarthy, and Nicholas Gaylord. Investigations on word senses and wordusages. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP,pages 10–18, Singapore, 2009. Association for Computational Linguistics and Asian Federationof Natural Language Processing.

[8] Christiane Fellbaum, editor. WordNet. An Electronic Lexical Database. MIT Press, Cambridge,MA, 1998.

[9] James A. Hampton. Typicality, graded membership, and vagueness. Cognitive Science, 31:355–383, 2007.

[10] Nancy Ide and Jean Véronis. Introduction to the special issue on word sense disambiguation: Thestate of the art. Computational Linguistics, 24:1–40, 1998.

[11] Daniel Jurafsky and James H Martin. Speech and Language Processing. Pearson Prentice Hall,New Jersey, 2009.

[12] Suresh Manandhar, Ioannis P. Klapaftis, Dmitry Dligach, and Sameer S. Pradhan. Semeval-2010task 14: Word sense induction and disambiguation. In Proceedings of the Fifth International Work-shop on Semantic Evaluation, pages 63–68, Uppsala, Sweden, 2010. Association for Computa-tional Linguistics.

[13] Rada Mihalcea, Timothy Chklovsky, and Adam Kilgarriff. The senseval-3 english lexical sampletask. In Senseval-3: Third International Workshop on the Evaluation of Systems for the SemanticAnalysis of Text, pages 25–28, Barcelona, Spain, 2004. Association for Computational Linguistics.

[14] George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38:39–41, 1995.

[15] George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. A semantic concordance.In Proceedings of the workshop on Human Language Technology, pages 303–308, Princeton, NJ,1993. Association for Computational Linguistics.

22

[16] Taesun Moon and Katrin Erk. An inference-based model of word meaning in context as a para-phrase distribution. Submitted for publication, 2011.

[17] Gregory Murphy. The Big Book of Concepts. MIT Press, Cumberland, RI, 2002.

[18] Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys, 41:1–36, 2009.

[19] Martha Palmer, Hoa Trang Dang, and Chistiane Fellbaum. Making fine-grained and course-grainedsense distinctions, both manually and automatically. Natural Language Engineering, 13:137–163,2006.

[20] Hinrich Schütze and Jan Pedersen. Information retrieval based on word senses. In Proceedings ofthe Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 161–175,Las Vegas, NV, 1995. Information Science Research Institute, University of Nevada, Las Vegas.

[21] Ravi Sinha and Rada Mihalcea. Unsupervised graph-based word sense disambiguation using mea-sures of word semantic similarity. In Proceedings of the IEEE International Conference on Seman-tic Computing (ICSC 2007), pages 363–369, Irvine, CA, 2007. IEEE Computer Society.

[22] Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and fast - but is itgood? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008Conference on Empirical Methods in Natural Language Processing, pages 254–263, Honolulu, HI,2008. Association for Computational Linguistics.

[23] Piek Vossen, David Farwell, German Rigau, Iñaki Alegria, Eneko Agirre, and Manuel Fuentes.Meaningful results for information retrieval in the meaning project. In Proceedings of the ThirdGlobal WordNet Conference, pages 22–26, Jeju Island, Korea, 2006. Association for Computa-tional Linguistics.

23

Stockholms universitet/Stockholm University

SE-106 91 Stockholm

Telefon 08 - 16 20 00

www.su.se