direct word sense matching for lexical...

Bar Ilan University

The Department of

Computer Science

Direct Word Sense Matching for Lexical Substitution

by

Efrat Hershkovitz-Marmorshtein

Submitted in partial fulfillment of the requirements for the Master's Degree

in the Department of Computer Science, Bar Ilan University.

Ramat Gan, Israel October 2006, Tishrei 5767

This work was carried out under the supervision of Dr Ido Dagan

Department of Computer Science,

Bar-Ilan University.

2

Acknowledgements

This thesis paper has been accomplished with the help of a number of people, and I wish

to express my heartfelt thanks to them.

I am grateful to Dr. Ido Dagan at Bar-Ilan University for his supervision of the thesis

and for his tutoring on how to observe, analyze and scientifically formalize. It has been a

great pleasure to work with him on this subject and I have learned a lot.

I would like to thank Oren Glickman at Bar-Ilan University for his guidance at the

beginning of the way, and his advice, both professional and technical, throughout the

work.

I would also like to thank our Italian colleagues, Alfio Glizzo and Carlo Strapparava

from ITC-Irst with whom I enjoyed a fruitful collaboration. I wish to thank them for their

assistance in implementing some of the methods, and for their helpful suggestions based

on their professional experience.

I would like to thank Hami Margliyot for her help in professional matters and in

wording the paper.

I would like to thank our NLP group for their mutual support and helpful comments.

Finally, I would like to thank my family and my husband Yair and our daughter Hadar

Tova for understanding, supporting and encouraging.

3

Table of Contents

Table of Contents.................................................................................................................4List of tables and figures......................................................................................................5Abstract................................................................................................................................61. Introduction......................................................................................................................92. Background....................................................................................................................13

2.1 Word Senses and Lexical Ambiguity......................................................................132.1.1 WordNet – a word sense database........................................................................132.2 WSD and Lexical Expansion...................................................................................152.3 Textual Entailment...................................................................................................172.4 Classification Algorithms........................................................................................19

2.4.1 Binary classification using SVM (Support Vector Machine)...........................202.4.1.1 One-class SVM..........................................................................................22

2.4.2 The kNN (k -nearest neighbor) classification algorithm..................................234. Investigated Methods.....................................................................................................32

4.1 Feature set and classifier..........................................................................................334.2 Supervised Methods.................................................................................................34

4.2.1 Indirect approach..............................................................................................344.2.2 Direct approach.................................................................................................34

4.3 Unsupervised Methods............................................................................................354.3.1 Indirect approach..............................................................................................364.3.2 Direct approaches.............................................................................................36

4.3.2.1 Direct approach one-class SVM...............................................................374.3.2.2 Direct approach KNN-based ranking........................................................39

5. Evaluation......................................................................................................................425.1 Evaluation measures................................................................................................42

5.1.1 Classification measure......................................................................................425.1.2 Ranking measure...............................................................................................42

5.2 Classification measure results..................................................................................435.2.1 Baselines...........................................................................................................435.2.2 Supervised Methods.........................................................................................445.2.3 Unsupervised methods......................................................................................46

5.3 Ranking measure results..........................................................................................496. Conclusion and future work...........................................................................................537. References......................................................................................................................55

4

List of tables and figures

Tables List

Table 1 - Source and target pairs …………………………………………………………....29Table 2 -Positive and negative example for the source-target synonym pair 'record-disc'………………………………………………….………....30Table 3 -Example instances for the source-target synonym pair ‘level-degree’, where two senses of the source word 'degree' are considered positive….…31 Table 4 -Table of a noisy training example and an appropriate training example for the source word 'level' and the target word 'degree'……………….…...39Table 5A-Classification results on the sense matching task -supervise method………….….44Table 5B-Classification results on the sense matching task -unsupervise method………..…44Table 6 - Mean Average Precision……………………………………………….….50

Figures List

Figure1 -Pseudo code for our kNN classifier algorithm……………………….……41Figure 2 -Direct supervised results varying J ………………………………………….… ...41Figure 3 -One-class evaluation varying ……………………………………………….….45Figure 4-Precision, Recall and F1 of kNN with cosine metric and k=10, for various thresholds………………………………………………..….47Figure 5 -Macro-averaged recall-precision curves………………………………….…….....50Figure 6-Results of kNN with different values of k………………………….……...51Figure 7-Results of kNN with different similarity metrics……………..……….......52

5

Abstract

This thesis investigates conceptually and empirically the novel sense matching task, of

recognizing whether the senses of two synonymous words match in context. Sense

matching enables substituting a word by its synonym, which is called lexical substitution.

It is a commonly used operation for increasing recall in information seeking applications,

like Information Retrieval (IR) and Question Answering (QA). For example, there are

contexts in which the given source word ‘design’ (which might be part of a search query)

may be substituted by the target word ‘plan’; however one should recognize that ‘plan’

has a different sense than ‘design’ in sentences such as “they discussed plans for a new

bond issue.”, while in the sentence "The construction plans were drawn." it should be

recognized that the sense of the target word 'plan' does match the meaning of the source

word 'design'.

This thesis addresses improving the task of verifying that the senses of two given

words do indeed match in a given context. In other words, recognizing texts in which the

specified source word might be substituted with a synonymous target word. Such

improved recognition of sense matching would improve the eventual precision of sense

matching in applications, which typically decreases somewhat when applying lexical

substitution.

To perform lexical substitution, a source of synonymous words is required. One of the

most common sources, which was used also in our work, is WordNet (Fellbaum, 1998).

Given a synonymous word, the binary classification task of sense matching may be dealt

with various methods, which are categorized by two basic characteristics: The first one is

whether the sense matching is direct or indirect. In the indirect approach, the senses of the

6

source word and the target word are explicitly identified relative to predefined lists of

word senses, a process called Word Sense Disambiguation (WSD), and then compared. In

the direct approach it is determined whether the senses match without identifying

explicitly the sense identity. Apparently, the indirect approach solves a harder

intermediate problem than eventually required and relies on a set of explicitly stipulated

senses, while in the direct approach there is no explicit reference to predefined senses.

The second discrimination between methods is whether they are supervised or

unsupervised. Supervised methods require manually labeled learning data, while

unsupervised methods do not require any manual intervention.

In this thesis we investigate sense matching methods of all the above types. We

experimented with a supervised indirect method, which makes use of the standard multi-

class WSD setting to identify the sense of the target word, and classify positively for the

sense matching task if the selected sense matches one of the sense of the source word.

The supervised direct method we examined is trained on a binary annotated learning data

of matching and non-matching target words, which correspond to the multiple senses of

the source word. The unsupervised indirect method we implemented matches example

words in the given context of the target word with the sense definitions of the source

word, obtained from a common resource dictionary.

The most powerful approach we investigated is the unsupervised direct one, which

avoids the intermediate step of explicit word sense disambiguation, and thus circumvents

the problematic reliance on a set of explicitly stipulated senses, and does not require

manual labeling as well. The underlying assumption of this approach is that if the sense

of the substituting target word matches the original source word, then its context should

7

be valid (typical) for the source word. The classification scheme we suggested in this

thesis learns a model of unlabeled occurrences of the source word, referring to all of them

as positive examples, and tests whether this model matches the context of the given

occurrence of the target word. We applied two different methods for this approach, one

based on the one-class SVM algorithm, which tries to delimit a range of most training

examples, and classifies a substituting target word as matching if it falls within this range.

The other method is based on a kNN approach, which calculates similarity between the

substituting target word and the occurrences of the source word, and ranks the level of

matching between the two words according to the level of similarity between the target

word context and the k most similar occurrences of the source word.

We used two different measures to evaluate the results of the methods described

above, one for evaluating classification accuracy and the other for evaluating ranking

quality. The ranking measure could be applied only for the kNN method and the

supervised direct method we implemented, since only those gave a score for each

substituting word, which enabled ranking. Classification could be applied for all methods,

when setting a threshold of positive classification for the ranking methods scores,

converting their results to binary classification.

Positive empirical results are presented for all methods, substantially improving the

baselines. We focused on the direct unsupervised approach that does not require any

manual intervention and does not rely on any form of external information. As described

above we applied two different methods for this approach, the kNN method and the one-

class method, where the former obtained better results. These results are accompanied

with some stimulating analysis for future research.

8

1. Introduction

In many language processing settings it is needed to recognize that a given word or term

may be substituted by a synonymous one. In a typical information seeking scenario, an

information need is specified by some given source words. When looking for texts that

match the specified need the original source words might be substituted with

synonymous target words. For example, given the source word ‘weapon’ a system may

substitute it with the target synonym ‘arm’ when searching for relevant texts about

weapons.

This scenario, which is generally referred here as lexical substitution, is a common

technique for increasing recall in Natural Language Processing (NLP) applications. In

Information Retrieval (IR) and Question Answering (QA), it is typically termed

query/question expansion (Moldovan and Mihalcea, 2000; Negri, 2004). Lexical

Substitution is also commonly applied to identify synonyms in text summarization, for

paraphrasing in text generation, or is integrated into the features of supervised tasks such

as Text Categorization and Information Extraction. Naturally, lexical substitution is a

very common first step in textual entailment recognition, which models semantic

inference between a pair of texts in a generalized application independent setting (Dagan

et al., 2005).

To perform lexical substitution NLP applications typically utilize a knowledge source

of synonymous word pairs. The most commonly used resource for lexical substitution is

the manually constructed WordNet (Fellbaum, 1998). Another option is to use statistical

9

word similarities, such as in the database constructed by Dekang Lin (e.g. (Lin, 1998)).1

We generically refer to such resources as substitution lexicons2.

When using a substitution lexicon it is assumed that there are some contexts in which

the given synonymous words share the same meaning. Yet, due to polysemy, it is needed

to verify that the senses of the two words do indeed match in a given context. For

example, there are contexts in which the source word ‘weapon’ may be substituted by the

target word ‘arm’; however one should recognize that ‘arm’ has a different sense than

‘weapon’ in sentences such as “repetitive movements could cause injuries to hands,

wrists and arms.”

Since the sense matching involves sense disambiguation of both words, either

explicitly or implicitly, a mismatch between the source and target words may be caused

by wrong sense disambiguation of either of them. To illustrate these two cases of

mismatch, let us first consider the pair of source word weapon and target word arm,

when arm appears in the following context: “Look, could you grab hold of this box

before my arms drop off?". In this sentence, the word arm appears in another sense than

weapon - not the desired sense. The second type of mismatch happens when the original

source word is substituted by a word that is not synonymous to the sense of the original

word in the given context. For example, assume that the source word 'paper' appears in a

given query "photocopying paper". In this case it would be wrong to substitute it with the

target word 'newspaper', which is synonymous to a different sense of 'paper'. The focus of

our research is to solve the mismatch of the first type, while the same method could be

1 Available from http://armena.cs.ualberta.ca/lindek/downloads2 While focusing on synonymy in this thesis, lexical substitution may be based on additional lexical semantic relation such as hyponymy

10

applied to solve the second type, when switching roles between the source and target

words.

A commonly proposed approach to address sense matching in lexical substitution is

applying Word Sense Disambiguation (WSD) to identify the senses of the source and

target words. In this approach, substitution is applied only if the words have the same

sense (or synset, in WordNet terminology). In settings in which the source is given as a

single term without context, sense disambiguation is performed only to the target word;

substitution is then applied only if the target word’s sense matches at least one of the

possible senses of the source word.

One might observe that such application of WSD addresses the task at hand in a

somewhat indirect manner. In fact, lexical substitution only requires knowing that the

source and target senses do match, but it does not require that the matching senses will be

explicitly identified. Selecting explicitly the right sense in context, which is then followed

by verifying the desired matching, might be solving a harder intermediate problem than

required. Instead, we can define the sense matching problem directly as a binary

classification task for a pair of synonymous source and target words. This task requires to

decide whether the senses of the two words do or do not match in a given context (but it

does not require to identify explicitly the identity of the matching senses).

A highly related task was proposed in (McCarthy, 2002). McCarthy's proposal was to

ask systems to suggest possible "semantically similar replacements" of a target word in

context, where alternative replacements should be grouped together. While this task is

somewhat more complicated as an evaluation setting than our binary recognition task, it

was motivated by similar observations and applied goals. From another perspective, sense

11

matching may be viewed as a lexical sub-case of the general textual entailment

recognition setting, where we need to recognize whether the meaning of the target word

"entails" the meaning of the source word in a given context.

This thesis3 provides a first investigation of the novel sense matching problem. To

allow comparison with the classical WSD setting we derived an evaluation dataset for the

new problem from the Senseval-3 English lexical sample dataset (Mihalcea and

Edmonds, 2004). We then evaluated alternative supervised and unsupervised methods

that perform sense matching either indirectly or directly (i.e. with or without the

intermediate sense identification step). Our findings suggest that in the supervised setting

the results of the direct and indirect approaches are comparable. However, addressing

directly the binary classification task has practical advantages and can yield high

precision values as desired in precision-oriented applications such as IR and QA.

More importantly, direct sense matching sets the ground for implicit unsupervised

approaches that may utilize practically unlimited volumes of unlabeled training data.

Furthermore, such approaches circumvent the Sisyphean need for specifying explicitly a

set of stipulated senses. We present initial implementations of such approaches based a

one-class classifier and a KNN-style ranking method. These methods are trained on

unlabeled occurrences of the source word and are applied to classify and rank test

occurrences of the target word. The presented results outperform the unsupervised

baselines and put forth a whole new direction for future research.

3 Major parts of this research were published in (Dagan et al., 2006), which was based on

the current thesis work.

12

2. Background

2.1 Word Senses and Lexical Ambiguity

2.1.1 WordNet – a word sense database

One must refer to the WordNet ontology (Fellbaum, 1998), being the most influential

computational lexical resource, in order to obtain an application-oriented view on

prominent lexical relations.

WordNet is a lexical database which is available online, and provides a large

repository of English lexical items. It was developed by a group of lexicographers led by

Miller, Fellbaum and others at Princeton University and has been constantly updated and

improved during the last fifteen years. Inspired by current psycholinguistic theories of

human lexical memory, it consists of English nouns, verbs, adjectives and adverbs

organized into synonym sets – synsets, each representing one underlying sense.

The synset includes a set of synonyms and their definition. The specific meaning of

one word for one type of POS (part of speech) is called a sense. Each sense of a word

appears in a different synset. Synsets are equivalent to senses = structures containing sets

of terms with synonymous meanings. Each synset has a gloss that defines the concept it

represents. For example, the words 'night', 'nighttime' and 'dark' constitute a single synset

that has the following gloss: “the time after sunset and before sunrise while it is dark

outside." Synsets are connected to one another through explicit semantic relations. Some

of these relations (hypernym and hyponym for nouns, hypernym and troponym for verbs)

constitute is-a-kind-of (hyperonymy) and is-a-part-of (meronymy for nouns) hierarchies.

Diverse WordNet relations have been used in various NLP tasks as a source of

candidate lexical substitutes for expansion. Expansion consists of altering a given text

13

(usually a query) by adding terms of similar meaning. For example, many question

answering systems perform expansion in the retrieval phase using query related words

based on WordNet’s lexical relations, such as synonymy or hyponymy (e.g. (Harabagiu et

al., 2000), (Hovy et al., 2001)). Automatic indexing has been improved by adding the

synsets of query words and their hypernyms to the query (Mihalcea and Moldovan,

2000). Scott and Matwin (1998) exploited WordNet hypernyms to increase the accuracy

of Text Classification. Chaves (1998) enhanced a document summarization task through

merging WordNet hyponymy chains, while Flank (1998) introduced a layered approach

to term similarity computation for information retrieval, which assigns the highest

weights to synonymy relations, hyponyms are ranked next, and the meronymy relations

contribute the lowest scores to the final similarity weights. Notably, each of the above

works addressed the problem within the narrow setting of a specific application, while

none has induced a clear generic definition of the types of ontological relations that

contribute to semantic substitution.

2.1.2 Lexical ambiguity and Senseval

Word Sense Disambiguation (WSD) is the problem of deciding which sense a word has

in any given context. It has been very difficult to formalize the process of

disambiguation, which humans can do so effortlessly. For virtually all applications of

language technology, word sense ambiguity is a potential source of error. One example is

Machine Translation (MT). If the English word 'drug' translates into French as either

'drogue' (narcotic) or 'médicament' (medication), then an English-French MT system

needs to disambiguate every use of 'drug' in order to make the correct translation.

14

Similarly, information retrieval systems may erroneously retrieve documents about an

illegal narcotic when the item of interest is a medication; analogously, information

extraction systems may make wrong assertions; and text-to-speech application may

confuse violin bows for a ship's bows.

Senseval (http://www.senseval.org/) is the international organization devoted to the

evaluation of Word Sense Disambiguation systems. Its mission is to organize and run

evaluation and related activities to test the strengths and weaknesses of WSD systems

with respect to different words, different aspects of language, and different languages. In

actual applications, WSD is often fully integrated into the system and often cannot be

separated. But in order to study and evaluate WSD, Senseval has, to date, concentrated on

standalone, generic systems for WSD.

2.2 WSD and Lexical Expansion

Despite some initial skepticism about the usefulness of WSD in practical tasks

(Voorhees, 1993; Sanderson, 1994), there is some evidence that WSD can improve

performance in typical NLP tasks such as IR and QA. For example, Schütze and

Pederson (1995) give clear indication of the potential for WSD to improve the precision

of an IR system. They tested the use of WSD on a standard IR test collection (TREC-1B),

improving precision by more than 4%.

The use of WSD has produced successful experiments for query expansion techniques.

In particular, some attempts exploited WordNet to enrich queries with semantically-

related terms. For instance, Voorhees (1994) manually expanded 50 queries over the

TREC-1 collection using synonymy and other WordNet relations. She found that the

15

http://www.senseval.org/

expansion was useful with short and incomplete queries, leaving the task of proper

automatic expansion as an open problem.

Gonzalo et al. (1998) demonstrates an increment in performance over an IR test

collection using the sense data contained in SemCor over a purely term based model. In

practice, they experimented searching SemCor with disambiguated and expanded queries.

Their work shows that a WSD system, even if not performing perfectly, combined with

synonymy enrichment increases retrieval performance.

Moldovan and Mihalcea (2000) introduces the idea of using WordNet to extend Web

searches based on semantic similarity. Their results showed that WSD-based query

expansion actually improves retrieval performance in a Web scenario. Recently Negri

(2004) proposed a sense-based relevance feedback scheme for query enrichment in a QA

scenario (TREC-2003 and ACQUAINT), demonstrating improvement in retrieval

performance.

While all these works clearly show the potential usefulness of WSD in practical tasks,

nonetheless they do not necessarily justify the efforts for refining fine-grained sense

repositories and for building large sense-tagged corpora. We suggest that the sense

matching task, as presented in the introduction, may relieve major drawbacks of applying

WSD in practical scenarios.

It is worth mentioning a related approach of word sense discrimination (Pedersen and

Bruce, 1997; Schütze, 1998). Word sense discrimination intends to divide the usages of a

word into different meanings without regard to any particular existing sense inventory.

Typically approached with unsupervised techniques, sense discrimination divides the

occurrences of a word into a number of classes by determining for any two occurrences

16

whether they belong to the same sense or not. Consequently, sense discrimination does

not determine the actual "meaning" (i.e. sense "label") but rather identifies which

occurrences of the same word have an equivalent meaning. Over-all, word sense

discrimination can be viewed as an indirect approach which assign unsupervised senses.

In our preliminary work we assessed the importance of identifying expansion

mismatches in applied settings, and to evaluate the causes of such mismatch. Using

several pairs of source words, we checked whether substituting them with target

synonyms actually results with sense mismatches in randomly retrieved sentences. We

discovered that the main cause of inappropriately retrieved sentences is indeed word

sense mismatch which caused 77% of retrieval mismatches. For example, consider the

original word pair 'cut job', where the source word 'job' is substituted by the target word

'position'. A successful substitution is found in the sentence: "40% of the positions at the

company were cut." The following sentence is an example of a sense mismatch: "The

company’s market position suffered a cut after a bad quarter". In this sentence the word

position has a different sense than job.

2.3 Textual Entailment

Textual entailment (TE) has been proposed recently as a generic framework for modeling

semantic variability in many Natural Language Processing applications, such as Question

Answering (QA), Information Extraction (IE) and Information Retrieval (IR).Textual

entailment is defined as a relationship between a coherent text T and a language

expression, which is considered as a hypothesis, H. Then, T entails H (H is a consequent

of T), denoted by T=>H, if the meaning of H, as interpreted in the context of T, can be

17

inferred from the meaning of T. For example, Shirley inherited the house” => “Shirley

owned the house”.

The task of identifying entailment between texts is a complex task. Many researchers

have addressed sub-tasks of TE, such as Geffet and Dagan, (2004) who explored the

correspondence between the distributional characterizations of pairs of words (which may

rarely co-occur, as is usually the case for synonyms ) and the kind of tight semantic

relationship that might hold between them, and in particular. entailment at the lexical

level. They proposed a feature weighting function (RFF) that yields more accurate

distributional similarity lists, which better approximate the lexical entailment relation.

This method still applies a standard measure for distributional vector similarity (over

vectors with the improved feature weights), and thus produces many loose similarities

that do not correspond to entailment.

In a later paper, they explore more deeply the relationship between distributional

characterization of words and lexical entailment, proposing two new hypotheses as a

refinement of the distributional similarity hypothesis. The main idea is that if one word

entails the other then we would expect that virtually all the characteristic context features

of the entailing word will actually occur also with the entailed word. To illustrate this

idea let us consider an entailing pair: company => firm, and the following set of

characteristic features of “company” – {“(company)’s profit”, “chairman of the

(company)”}. Then these features are expected to appear with “firm” as well in some

large corpus - “firm’s profit” and “chairman of the firm”. Other researchers have explored

other aspects of textual entailment: Glickman, Dagan, and Kopel (2004) propose a

general generative probabilistic setting for textual entailment. They focus on the sub-task

18

of recognizing whether the lexical concepts present in a given hypothesis are entailed

from a given text.

Glickman, Bar-Haim, Spector (2005) suggest an analysis of sub-components and tasks

within textual entailment, proposing two levels: Lexical and Lexical-Syntactic. At the

lexical level, they match (possibly multi-word) terms of one text (T) and a second text

(hypothesis H), ignoring function words. At the lexical-syntactic level, they match the

syntactic dependency relations within H and T.

The sense matching problem we tried to deal with is actually a binary classification

task: to decide whether the occurrence of the target word in the given sentence entails the

source word (i.e., at least one of the meanings of the source word). An example we have

already mentioned is the sentence ' Repetitive movements could cause injuries to hands,

wrists and arms.', where the word arm substitutes the word weapon, but with the wrong

sense. In this case, the target word arm does not entail the source word weapon. On the

other hand, in the sentence 'This house was badly mauled by careless soldiers searching

for arms' the target word arm entails the source word weapon. In our work we suggest a

novel approach of using an implicit WSD method that identifies such lexical entailment

in context.

2.4 Classification Algorithms

As we have mentioned in the introduction, we can define the sense matching

problem directly as a binary classification task for a pair of

synonymous source and target words. For this task we used two

algorithms, SVM (Support Vector Machine) and a kNN (k-nearest neighbors) -

19

based method. We used two existing implementations of the SVM algorithm, LibSVM

and SVMLight, and implemented the kNN algorithm

2.4.1 Binary classification using SVM (Support Vector Machine)

The SVM algorithm refers to each source-target pair as a point (or vector) in a multi-

dimensional space, where each dimension is any desired feature of the pair. Given a

collection of such training points in the feature space, each tagged positive or negative,

we would like to separate the positive and the negative ones as neatly as possible, by the

simplest plane. The task of determining whether an untagged point is negative or positive,

is called classification, and a group of identically tagged points is called a class.

In some cases, the two classes of positive and negative points can be separated by a

multi-dimensional 'plane’, which is called a hyperplane. A method that uses such a

hyperplane is therefore called linear classification. If such a hyperplane exists, we would

like to choose the one that separates the data points with maximum distance between it

and the closest data points from both classes. This distance is called the margin. We

desire this property, because it makes the separation between the two classes greater If we

then add another data point to the points we already have, we can more accurately

classify the new point more accurately. Such a hyperplane is known as the maximum-

margin hyperplane, or the optimal hyperplane. The vectors (data points) that are closest

to this hyperplane are called the support vectors.

Let us now give a more detailed view of the SVM algorithm. We consider a set of

training data points of the form: ….. )} where the ci is

either 1 or −1. This constant denotes the class to which the point belongs positive or

20

negative. Each is a n-dimensional vector of scaled [0,1] or [-1,1] values. The scaling is

important to guard against features with larger variance, that might otherwise dominate

the classification. These points can be viewed as training data, which denote the correct

classification, which the SVM should eventually distinguish. The classification will be

defined by means of the dividing hyperplane, which takes the form

(1)

where w and b are the parameters of the optimal hyperplane that we need to learn.

As we are interested in the maximum margin, the dominant data are in the support

vectors and the hyperplanes closest to these support vectors in either class. These

hyperplanes are parallel to the optimal separating hyperplane. It can be shown that they

can be described by the equations

(2)

(3)

We would like these hyperplanes to maximize the distance from the dividing hyperplane

and to have no data points between them. Geometrically, the distance between the

hyperplanes is 2/|w|, so |w| should be minimized in order to maximiaze the margin. To

exclude data points, it should be ensured that for all i,

(4) -b 1 or -b -1

This can be rewritten as:

(5) ( -b) 1 1 i n

The problem now is to minimize |w| subject to the constraint (5). This is a quadraic

programming (QP) optimization problem which is solved by the SVM training

21

algorithm. After the SVM has been trained, it can be used to classify unseen 'test' data.

This is achieved using the following decision rule

(6)

where c is the class of the new data point x.

Writing the classification rule in its dual form reveals that classification is a function only

of the Support Vectors, i.e., the training data points that lie on the margin.

In the cases when there is no hyperplane that can split the positive and negative

training examples, the Soft Margin method is applied. This method chooses a hyperplane

that splits the examples as cleanly as possible, while still maximizing the distance to the

nearest split examples. This method introduces slack variables and the equation (5) now

transforms to

(7) ( -b) 1 - 1 i n

and the optimization problem becomes

(8) min || w || + such that ( -b) 1 - 1 i n

The constraint in (7) along with the objective of minimizing |w| can be solved using

Lagrange multipliers or setting up a dual optimization problem to eliminate the slack

variable.

22

2.4.1.1 One-class SVM In certain case the training data contain points of one class only, and

the above separation is not possible anymore. In such cases the aim is

to estimate the smallest hypersphere enclosing most of the positive

training data points. New test instances are then classified positively if

they lie inside the sphere, while outliers are regarded as negatives.

We used the SVMlight 4 classifier (developed by T. Joachims from the University of

Dortmund) for the case where the data contain points of both classes, and LibSVM5, with

its one class option, for the one-class data case. LibSVM also enables us to control

the ratio between the width of the enclosed region of training points

and the number of misclassified training examples, by setting the

parameter (0, 1). Smaller values of will produce larger positive

regions, yielding increased recall, but lower precision.

2.4.2 The kNN (k -nearest neighbor) classification algorithm

The k -nearest neighbor algorithm6 is an intuitive -method that

classifies unlabeled examples based on their similarity with examples

in a given training set. For a given unlabeled example, the algorithms finds the k

closest labeled examples in the training data, and classifies the unlabeled examples

according to the most frequent class within the set of the k closest examples. The special

4 http://svmlight.joachims.org/5 http://www.csie.ntu.edu.tw/~cjlin/libsvm/6 Major parts of this paragraph were learned from wikipedia http://en.wikipedia.org/wiki/Nearest_neighbor_(pattern_recognition)

23

case where the class is predicted to be the class of the closest training sample (i.e. when k

= 1) is called the nearest neighbour algorithm.

The training examples are mapped into a multidimensional feature space. The space is

partitioned into regions by class labels of the training samples. A point in the space is

assigned to the class c, if it is the most frequent class label among the k nearest training

samples. The training phase of the algorithm consists only of storing the feature vectors

and class labels of the training samples. In the actual classification phase, the same

features as before are computed for the new test example (whose class is not known). The

distances from the new vector to all stored vectors are computed, by a selected metric –

distance, or some vector similarity measure. The k closest samples are then selected, and

the new point is predicted to belong to the most frequent class within this set. The

performance of the kNN algorithm is influenced by two main factors: (1) the similarity

measure used to locate the k nearest neighbors; and (2) the number of k neighbors used to

classify the new sample.

The best choice of k depends upon the data. Generally, larger values of k reduce the

effect of noise on the classification, but make boundaries between classes less distinct. A

good k can be selected by parameter optimization using, for example, cross-validation.

The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or

irrelevant features, or if the feature scales are not consistent with their relevance. Much

research effort has been placed into selecting or scaling features to improve classification.

A particularly popular approach is the use of evolutionary algorithms to optimize feature

scaling (Dixon and Corne and Oates, 2003).. Another popular approach is to scale

24

features by the mutual information of the training data with the training classes (Yang,

J.O. Pederson ,1997) , (Li, L. et al., 2001).

The algorithm is easy to implement, but it is computationally intensive, especially

when the size of the training set grows. Many optimizations have been proposed over the

years; these generally seek to reduce the number of distances actually computed. Some

optimizations involve partitioning the feature space, and only computing distances within

specific nearby volumes.

To measure the distance between two vectors some vector metric or any measure of

similarity is required. We used three of the most popular similarity measures. The

weighted Jaccard measure, (Grefenstette, 1994), compares the number of common

features with the number of unique features for a pair of examples. When generalizing

this scheme to non-binary values, each features is represented by a real value in the range

of 0–1. This generalization, known as Weighted Jaccard, replaces intersection with the

minimum weight, and union with the maximum weight. Set cardinality is generalized to

summing over the union of the features of the two examples w and v.

where F(w) and F(v) are the features of the two examples. The advantage of this measure

is that it also takes into account the feature weights rather than just the number of the

common features.

25

The standard Cosine measure, which was successfully employed in IR (Salton and

McGill, 1983), and also for learning similar words (Ruge, 1992; Caraballo, 1999; Gauch

et al., 1999; Pantel and Ravichandran, 2004), is the second alternative to examine:

ff

f

fvweightfwweight

fvweightfwweight

vwsim22

cos,,

,,

,

Calculating the cosine of the angle between the two vectors considers the difference in

direction of two vectors in feature space as opposed to their geometric distance. Thus, it

overcomes the problem of distance metrics that discriminate too strongly between vectors

with significantly different lengths.

The third measure we used is a recent state-of-the-art variant of the weighted Jaccard

measure (Weeds and Weir, 2004), which was developed by Lin (1998) and is grounded

on principles of information theory. It computes the ratio between what is shared by the

features of both vectors and the sum over the features of each vector:

F(w) and F(v) are the features of the two examples and the weight function is defined as

the Mutual Information (MI). There are three underlying intuitions to this measure: (1)

the more commonality the two objects share, the more similar they are; (2) the more

differences they have, the less similar they are; (3) the maximum similarity between

objects A and B should only be reached when they are identical, no matter how much

commonality they share.

26

3. Problem Setting and Dataset

To investigate the direct sense matching problem it is necessary to obtain an appropriate

dataset of examples for this binary classification task, along with gold standard

annotation. While there is no such standard (application independent) dataset available it

is possible to derive it automatically from existing WSD evaluation datasets, as described

below. This methodology also allows comparing direct approaches for sense matching

with classical indirect approaches, which apply an intermediate step of identifying first

the most likely WordNet sense.

We chose to work with single words in order to find the abstract solution to the

problem of direct sense matching. (We did not want to become involved in working with

more than a single word at a time in order to prevent problematic word dependencies,

etc.). Our dataset was derived from the Senseval-3 English lexical sample dataset

(Mihalcea and Edmonds, 2004), and included all 25 nouns, adjectives and adverbs in this

sample. Verbs were excluded since their sense annotation in Senseval-3 is not based on

WordNet senses but rather on a different dictionary (the available approximate mapping

to Word-Net synsets was not sufficiently reliable). The Senseval dataset includes a set of

example occurrences in context for each word, split to training and test sets, where each

example is manually annotated with the corresponding WordNet synset.

For the sense matching setting we need examples of pairs of source-target

synonymous words, where at least one of these words should occur in a given context.

Following an applicative motivation, we mimic a typical IR setting in which a single

source word query is expanded (substituted) by a synonymous target word. Then, it is

28

needed to identify contexts in which the target word appears in a sense that matches the

source word. Accordingly, we considered each of the 25 words in the Senseval sample as

a target word for the sense matching task. Next, we had to pick for each target word a

corresponding synonym to play the role of the source word. This was done by creating a

list of all WordNet synonyms of the target word, under all its possible senses, and picking

randomly one of the synonyms as the source word. For example, the word ‘disc’ is one of

the words in the Senseval lexical sample. For this target word the synonym ‘record’ was

picked, which matches ‘disc’ in its musical sense.

While creating source-target synonym pairs it was evident that many WordNet

synonyms corresponded to very infrequent senses or word usages, such as the WordNet

synonyms germ and source. Such source synonyms are useless for evaluating sense

matching with the target word since the senses of the two words would rarely match in

perceivable contexts. In fact, considering our motivation for lexical substitution, it is

usually desired to exclude such obscure synonym pairs from substitution lexicons in

practical applications, since they would mostly introduce noise to the system. To avoid

this problem the list of WordNet synonyms for each target word was filtered by a

lexicographer, who excluded manually obscure synonyms that seemed worthless in

practice. The lexicographer was also instructed to exclude pairs where the target word

had a more general meaning than the source word.

Using those manually filtered results, the source synonym for each target word was then

picked randomly from the filtered list. Table 1 shows the 25 source-target pairs created

for our experiments.

29

Source word Target word WordNet Sense id

statement argumentargument%1:10:02

level degree degree%1:07:00:: degree%1:26:01::

raging hothot%3:00:00:violent:00

opinion judgment judgment%1:10:00::execution performance

performance%1:04:00::subdivision arm

arm%1:14:00::deviation difference

difference%1:11:00::ikon image

Image%1:06:00::arrangement organization

organization%1:09:00::design plan

plan%1:09:01::atm atmosphere

atmosphere%1:23:00::dissimilar different

different%3:00:02:: crucial important

important%3:00:02::newspaper paper paper%1:06:00:: paper

%1:10:03:: paper%1:14:00::protection shelter

Shelter%1:26:00::hearing audience

audience%1:26:00:: trouble difficulty

difficulty%1:04:00:: sake interest

interest%1:07:01:: company party

party%1:14:02:: variety sort

sort%1:09:00::camber bank

bank%1:17:02:: record disc

disc%1:06:01:: bare simple

Simple%3:00:02:plain:01 substantial solid solid%3:00:00:sound:01 solid

%3:00:00:wholesome:00root source

Source%1:15:00::

Table 1: Source and target pairs

30

In future work it may be possible to apply automatic methods for filtering infrequent

sense correspondences in the dataset, by adopting algorithms such as in (McCarthy et al.

2004).

Having source-target synonym pairs, a classification instance for the sense matching

task is created from each example occurrence of the target word in the Senseval dataset.

A classification instance is thus defined by a pair of source and target words and a given

occurrence of the target word in context. The instance should be classified as positive if

the sense of the target word in the given context matches one of the possible senses of the

source word, and as negative otherwise. Table 2 illustrates positive and negative example

instances for the source-target synonym pair ‘record-disc’, where only occurrences of

‘disc’ in the musical sense are considered positive.

Sentence annotation

This is anyway a stunning disc, thanks to the playing of the Moscow

Virtuosi with Spivakov.

positive

He said computer networks would not be affected and copies of

information should made on floppy discs.

negative

Before the dead solider was placed in the ditch his personal possessions

were removed, leaving one disc on the body for identification purposes.

negative

Table 2: positive and negative example for the source-target synonym pair 'record-disc'

The gold standard annotation for the binary sense matching task can be derived

automatically from the Senseval annotations and the corresponding WordNet synsets. An

example occurrence of the target word is considered positive if the annotated synset for

that example includes also the source word, and Negative otherwise. Notice that different

31

positive examples might correspond to different senses of the source word. This happens

when the source and target share several senses, and hence they appear together in several

synsets (see Table 3). Finally, since in Senseval an example may be annotated with more

than one sense, it was considered positive if any of the annotated synsets for the target

word included the source word.

Using this procedure we derived gold standard annotations for all the examples in the

Senseval-3 training section for our 25 target words. For the test set we took up to 40 test

examples for each target word (some words had fewer test examples), yielding 913 test

examples in total, out of which 239 were positive. This test set was used to evaluate the

sense matching methods described in the next section.

Sentence WordNet sense annotation

It can be a very useful means of making a

charitable gift towards the end of the tax

year when your taxable income for the year

can be estimated with some degree of

precision

A position on a scale of

intensity or amount or

quality

Positive

The length of time spent stretching depends

on the sport you are training for and the

degree of flexibility you wish to attain

A specific identifiable

position in a continuum

or series or especially in

a process

Positive

Table 3 :Example instances for the source-target synonym pair ‘level-degree’, where two

senses of the source word 'degree' are considered positive.

32

4. Investigated Methods

As explained in the introduction, the sense matching task may be addressed by two

general approaches. The traditional indirect approach would first disambiguate the target

word relative to a predefined set of senses, using standard WSD methods. Then, it would

check whether the selected sense matches the source word. In terms of Word-Net synsets,

it would check whether the selected synset for the target word includes the source word

as well. On the other hand, a direct approach would address the binary sense matching

task directly, without selecting explicitly a concrete sense for the target word. In this

research we focus on investigating several direct methods for sense matching and

compare their performance relative to traditional indirect methods, under both supervised

and unsupervised settings.

Two different goals may be set for sense matching methods. The first goal is

classification, where the system needs to decide for each test example whether it is

positive or negative (i.e., whether the target word sense matches the source or not). The

second goal is ranking, where the system only needs to rank all test examples of a given

target word according to their likelihood of being positive, as measured by some

confidence score. From the perspective of the applied lexical substitution task, employing

the sense matching module as a classifier enables to utilize it for filtering out

inappropriate contexts of the target word. On the other hand, scored ranking corresponds

to situations in which a hard classification decision is not expected from the sense

matching module, either because the final system output is a ranked list (as in IR and QA)

or because the sense matching score is being integrated with the scores of additional

33

system modules. As described below, we investigate alternative methods for both the

classification and ranking goals.

4.1 Feature set and classifier

As a vehicle for investigating different classification approaches we implemented a

“vanilla” state of the art architecture for WSD. Following common practice in feature

extraction (e.g. (Yarowsky, 1994)), and using the mxpost7 part of speech tagger and

WordNet’s lemmatization, the following feature set was used: bag of word lemmas for

the context words in the preceding, current and following sentence; unigrams of lemmas

and part of speech in a window of +/- three words , where, each position provides a

distinct feature [w-3, w-2, w-1, w+1 w+2 w+3].and bigrams of lemmas in the same

window [w-3-2, w-2-1, w-1+1, w+1+2, w+2+3]

The SVMLight (Joachims, 1999) classifier was used in the supervised settings with its

default parameters. To obtain a multi-class classifier we used a standard one-vs-all

approach of training a binary SVM for each possible sense and then selected the highest

scoring sense for a test example.

To verify that our implementation provides a reasonable replication of state of the art

WSD we applied it to the standard Senseval-3 Lexical Sample WSD task. The obtained

accuracy8 was 66.7%, which compares reasonably with the mid-range of systems in the

Senseval-3 benchmark (Mihalcea and Edmonds, 2004). This figure is just a few percent

lower than the (quite complicated) best Senseval-3 system, which achieved about 73%

accuracy and it is much higher than the standard Senseval baselines. We thus regard our

7 ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz8 The standard classification accuracy measure equals precision and recall as defined in the Senseval terminology when the system classifies all examples, with no abstentions

34

classifier as a fair vehicle for comparing the alternative approaches for sense matching on

equal grounds.

4.2 Supervised Methods

4.2.1 Indirect approach The indirect approach for sense matching follows the traditional scheme of performing

WSD for lexical substitution. First, the WSD classifier described above was trained for

the target words of our dataset, using the Senseval-3 sense annotated training data for

these words. This was accomplished by training a binary SVM for each possible sense.

Each binary classifier was trained to identify one sense of the target - the training

examples were classified as positive when the sense of the target word in the given

context matched the specific classifier's sense and the rest were classified as negative.

Then, each classifier was applied to each test example of the target words, selecting the

most likely sense for each example by picking the sense of the binary classifier that

scored highest for the test example. Finally, an example was classified as positive if its

selected sense included the source word in its synset. Otherwise, the example was

classified as negative.

4.2.2 Direct approach As explained above, the direct approach addresses the binary sense matching task

directly, without selecting explicitly a sense for the target word. In the supervised setting

it is easy to obtain such a binary classifier using the annotation scheme described in

Section 3. Under this scheme an example was annotated as positive (for the binary sense

matching task) if the source word is included in the Senseval gold standard synset of the

35

target word. We trained the classifier using the set of Senseval-3 training examples for

each target word, considering their derived binary annotations. Finally, the trained

classifier was applied to the test examples of the target words, yielding directly a binary

positive-negative classification. We note that the direct binary setting is suitable for

producing rankings as well, using the obtained SVM scores to rank all examples of each

target word. In addition, because this method is direct and applies a single classifies for a

target word, it allows for shorter running time during the training and test stages. In the

indirect method, the training stage must be used to train a binary classifier for each sense.

Consequently, in the testing stage, each test example must be checked by all of the binary

classifiers. The running time increases with the number of senses for each target word.

Some words have many senses, like the word "hot", that has twenty one different senses.

4.3 Unsupervised Methods

It is well known that obtaining annotated training examples for WSD tasks is very

expensive, and is often considered infeasible in unrestricted domains. Therefore, many

researchers investigated unsupervised methods, which do not require annotated examples.

Unsupervised approaches have usually been investigated within Senseval using the “All

Words” dataset, which does not include training examples. In this thesis we preferred

using the same test set which was used for the supervised setting (created from the

Senseval-3 “Lexical Sample” dataset, as described above), in order to enable comparison

between the two settings. Naturally, in the unsupervised setting the sense labels in the

training set were not utilized.

36

4.3.1 Indirect approachState-of-the-art unsupervised WSD systems are quite complex and they are not easy to be

replicated. Thus, we implemented the unsupervised version of the Lesk algorithm (Lesk,

1986) as a reference system, since it is considered a standard simple baseline for

unsupervised approaches. The Lesk algorithm is one of the first algorithms developed for

semantic disambiguation of all-words in unrestricted text. In its original unsupervised

version, the only resource required by the algorithm is a machine readable dictionary with

one definition for each possible word sense. The algorithm looks for words in the sense

definitions that overlap with context words in the given sentence, and chooses the sense

that yields maximal word overlap. This algorithm is based on the intuition that words that

co-occur in a sentence are being used to refer to the same topic, and that topically related

sense of words are defined in a dictionary using the same words. We used an

implemented version of this algorithm created by our Italian colleague, Carlo

Strapparava, from ITC-Irst, , that uses WordNet sense-definitions with context length of

±10 words before and after the target word.

4.3.2 Direct approaches It has been well recognized that it is very difficult, and methodologically problematic, to

determine the “right” set of pre-defined senses for WSD. Hence, the direct sense

matching approach may be particularly attractive just because it does not assume any

reference to a pre-defined set of senses. However, existing unsupervised algorithms for

the classical WSD task do rely on pre-defined sense repositories, and sometimes also on

dictionary definitions for these senses (similar to the Lesk algorithm). For this reason,

37

standard unsupervised WSD techniques cannot be applied for direct sense matching, in

which the only external information assumed is a substitution lexicon.

The assumption underlying our proposed methods is that if a target word occurrence

has a sense which matches the source word then the context of that occurrence should be

valid for the source word as well. Unlabeled occurrences of the source word can then be

used to learn a model of its typical valid contexts. Next, we can match this model against

test examples of the target word and evaluate whether the given target contexts are valid

for the source word or not, providing a decision criterion for sense matching.

We notice that in this proposed approach only positive examples are given, in the form of

unlabeled occurrences of the source word. Learning from positive examples only (also

called one class learning) is known to be much more difficult than standard supervised

learning for the same task. Yet, this setting arises in many practical situations and is often

the only unsupervised solution available.

4.3.2.1 Direct approach one-class SVM Our first unsupervised method utilizes the One Class SVM learning algorithm

(Sch¨olkopf et al., 2001), and was implemented using the LIBSVM package9 by our

Italian colleague, Alfio Glizzo , from ITC-Irst. The training examples consist of a given

sample of unlabeled occurrences of the source word represented by the same feature set

of Subsection 4.1. We used training examples taken from the BNC10 (British National

Corpus). This created a compatibility between the training data and test data because the

BNC is one of the sources of Senseval, which we used as a source for the test data. This

9 Freely available from http://www.csie.ntu.edu.tw/~cjlin/libsvm10The BNC (British National Corpus) is a 100 million word collection of samples of written and spoken language from a wide range of sources. http://www.natcrop.ox.ac.uk

38

http://www.natcrop.ox.ac.uk/

compatibility added to the chances of successful learning, since the training data and the

test data had more topics in common.

Roughly speaking, a one-class SVM estimates the smallest hypersphere enclosing

most of the training data. New test instances are then classified positively if they lie

inside the sphere, while outliers are regarded as negatives. The ratio between the width of

the enclosed region and the number of misclassified training examples can be varied by

setting the parameter (0, 1). Smaller values of will produce larger positive regions,

yielding increased recall. We note that we could utilize the LIBSVM one-class package

only for classification but not for ranking, since it provides just a binary classification

decision rather than a classification score.

Experiments with the one-class SVM (see Section 5) revealed two problems. First,

there is no obvious way to tune the optimal value for the parameter in an unsupervised

setting, in which no labeled examples are given. Furthermore, different values were

found optimal, in retrospect, for different words. Such optimization of classification

performance is an inherent problem for the unsupervised one-class setting, unlike the

standard supervised setting in which both positive and negative examples are utilized to

optimize models uniformly during training. Second, when the source word is ambiguous

then only one (or few) of its senses can be substituted with the target word. However, our

one-class algorithm was trained on all examples of the source word, which include

examples of irrelevant senses of the source word, yielding noisy training sets.

For an example, see Table 4.

39

Sentence Sense appropriate /noisy

What level is the office on? floor noisy

A high level of care is

required

degree, amount acceptable

Table 4: Table of a noisy training example and an appropriate training example for the

source word 'level' and the target word 'degree'.

4.3.2.2 Direct approach KNN-based rankingConsequently, we developed also an unsupervised ranking method which is based on the

k Nearest Neighbors (kNN) principle. To avoid the first problem of optimizing

classification performance we decided to focus at this stage on the ranking goal. That is,

we aim to score all test examples of a target word such that the positive ones will tend to

appear at the top of the ranked list, but without identifying an optimal classification

boundary. (This method was also evaluated by the classification measurement so that we

would be able to compare the two unsupervised direct method - one class and kNN).

The second problem of source word ambiguity is addressed by the choice of a kNN

approach. In this approach the score of a test example is determined only by the most

relevant subset of source word occurrences, which are likely to correspond to the relevant

sense of the source word. More concretely, we store in memory all training examples of

the source word, represented by our standard feature set. The score of a test example of

the target word is computed as the average similarity between the test example and the k

most similar training examples. Finally, all test examples of the target word are ranked by

these scores.

40

The rational behind this method is that if the sense of the target test example matches

the source then there are likely to be k occurrences of the corresponding source sense that

are relatively similar to this target example. On the other hand, if the target example has a

sense that does not match the source word then it is likely to have lower similarity with

the source examples.

The disadvantage of this algorithm is that it uses all the training data at the test time,

which makes it memory and time expensive, since similarities should be calculated

between the test example and all training examples. We tried to improve the algorithm in

these two aspects by building an index for the training data of every source word. The

pseudo code of the improved algorithm appears in the following figure, where the

numbers in brackets below refer to the code lines. The index is implemented by a hash

table where the key is the feature number and the value is the indexes of sentences that

include that feature (1). After building the index, the similarities are calculated (2-5). The

index saves us the need to calculate similarities for all the training set. Instead, we

calculate similarities only for the sentences that have any common feature with the test

sentence: the algorithm loops over the features of the test sentence (3), and for every

feature, calculates the similarity between the test sentence and the sentences that were

hashed in this entry (4-5). This way we save both time and cash memory, since we do not

need to upload all the training data to the cash.

41

Figure1 -Pseudo code for our kNN classifier algorithm

1 Bulid indx I for the training data set

2 for each example X in the test data do

3 for each feature xi in example X

4 for each training example Dj in indxer entry I[xi]

5 calculate sim(X, Dj);

6 find K largest scores of sim(X, Dj);

7 calculate sim_avg for K nearest neighbors;

8 return sim_avg

42

I. 5. Evaluation

5.1 Evaluation measures

As we described in section 4, two different goals may be set for sense matching methods:

classification and ranking. To get realistic and comprehensive evaluation of the methods,

we used two evaluation measures, one for each goal.

5.1.1 Classification measureFor binary sense matching, and the corresponding lexical substitution setting, the

standard WSD metrics (Mihalcea and Edmonds, 2004) are less suitable because we are

interested in the binary decision of whether the target word matches the sense of a given

source word.

For this reason we decided to adopt an Information Retrieval evaluation schema,

where Precision, Recall and F1 are estimated as follows:

In the following section we report micro-averaged results for these measures on our test

set.

5.1.2 Ranking measureThis measure is very popular in Information Retrieval (IR) and Question Answering

(QA) systems, for which the lexical expansion setting is targeted. It quantifies the

system's ability to rank examples for a given source word, preferring a ranking which

ranks correct examples before negative ones. A prefect ranking would place all the

43

positive examples before all the negative examples. Average precision is a common

evaluation measure for system rankings, and is computed as the average of the system's

precision values, at all points in the ranked list where recall increases (Voorhees and

Harman 1999). In our case, the points where recall increases correspond to positive test

examples. More formally, it can be written as follows:

1/R * sum for i=1 to n ( E(i) * #-correct-up-to-pair-i/i)

Where n is a number of the examples in the test set, R is the total number of positive

examples in the test set, E(i) is 1 if the i-th example is positive and 0 otherwise, and I

ranges over the examples, order by their ranking from the highest down.

This average precision calculation outputs a value in the 0-1 range, where 1 corresponds

to perfect ranking. This value corresponds to the area under the non-interpolated recall-

precision curve for the target word. Mean Average Precision (MAP) is defined as the

mean of the average precision values for all test words.

5.2 Classification measure results

5.2.1 BaselinesFollowing the Senseval methodology, we evaluated two different baselines for

unsupervised and supervised methods. The random baseline, used for the unsupervised

algorithms, was obtained by choosing either the positive or the negative class at random

resulting in P = 0.262, R = 0.5, F1 = 0.344. The Most Frequent baseline has been used

for the supervised algorithms and is obtained by assigning the positive class when the

percentage of positive examples in the training set is above 50%, resulting in P = 0.65,

R= 0.41, F1 = 0.51.

44

Supervised P R F1

Most Frequent Baseline 0.65 0.41 0.51

Multiclass SVM Indirect 0.59 0.63 0.61

Binary SVM (J = 0.5)

Binary SVM ( J = 1 )



Direct

Direct

Direct

Direct

0.80

0.76

0.68

0.69

0.26

0.46

0.53

0.55

0.39

0.57

0.60

0.61

Table 5A Supervise method

Unsupervised P R F1

Random Baseline 0.26 0.50 0.34

Lesk Indirect 0.24 0.19 0.21

One-Class ( = 0.3)

One-Class ( = 0.5)

One-Class ( = 0.7)

One-Class ( = 0.9)

Direct

Direct

Direct

Direct

0.26

0.29

0.28

0.23

0.72

0.56

0.36

0.10

0.39

0.38

0.32

0.14

Table 5B Unsupervise method

Table 5: Classification results on the sense matching task

5.2.2 Supervised MethodsBoth the indirect and the direct supervised methods presented in Subsection 4.2 have

been tested and compared to the most frequent baseline.

Indirect. For the indirect methodology we trained the supervised WSD system for each

target word on the sense-tagged training sample. As described in Subsection 4.2, we

implemented a simple SVM-based WSD system (see Section 4.2) and applied it to the

sense-matching task. Results are reported in Table 5A. The direct strategy surpasses the

most frequent baseline F1 score, but the achieved precision is still below it. We note that

45

in this multi-class setting it is less straightforward to tradeoff recall for precision, as all

senses compete with each other.

Direct. In the direct supervised setting, sense matching is performed by training a binary

classifier, as described in Subsection 4.2.

The advantage of adopting a binary classification strategy is that the precision/recall

tradeoff can be tuned in a meaningful way. In SVM learning, such tuning is achieved by

varying the parameter J, that allows us to modify the cost function of the SVM learning

algorithm. If J = 1 (default), the weight for the positive examples is equal to them weight

for the negatives. When J > 1, negative examples are penalized (increasing recall), while,

whenever 0 < J < 1, positive examples are penalized (increasing precision). Results

obtained by varying this parameter are reported in Figure 2.

Figure 2: Direct supervised results varying J

46

Adopting the standard parameter settings (i.e. J = 1, see Table 3), the F1 of the system

is slightly lower than for the indirect approach, while it reaches the indirect figures when

J increases. More importantly, reducing J allows us to boost precision towards 100%.

This feature is of great interest for lexical substitution, particularly in precision oriented

applications like IR and QA, for filtering irrelevant candidate answers or documents.

5.2.3 Unsupervised methodsIndirect. To evaluate the indirect unsupervised settings we implemented the Lesk

algorithm, described in Subsection 4.3.1, and evaluated it on the sense matching task. The

obtained figures, reported in Table 5B, are clearly below the baseline, suggesting that

simple unsupervised indirect strategies cannot be used for this task. In fact, the error of

the first step, due to low WSD accuracy of the unsupervised technique, is propagated in

the second step, producing poor sense matching.

Unfortunately, state-of-the-art unsupervised systems are actually not much better than

Lesk on all words task (Mihalcea and Edmonds, 2004), discouraging the use of

unsupervised indirect methods for the sense matching task

Direct. Conceptually, the most appealing solution for the sense matching task is the one-

class approach proposed for the direct method (Section 4.3.2). As stated in section 4, in

order to perform our experiments, we trained a different one-class SVM for each source

word, using a sample of its unlabeled occurrences in the BNC as the training set. To

avoid huge training sets and to speed up the learning process, we fixed the maximum

number of training examples to 10000 occurrences per word, collecting on average about

47

6500 occurrences per word. For each target word in the test sample, we applied the

classifier of the corresponding source word. Results for different values of are reported

in Figure 3 and summarized in Table 5B.

Figure 3 One-class evaluation varying

While the results are somewhat above the baseline, just small improvements in

precision are reported, and recall is higher than the baseline for < 0.6. Such small

improvements may suggest that we are following a relevant direction, even though they

may not be useful yet for an applied sense-matching setting.

Further analysis of the classification results for each word revealed that optimal F1

values are obtained by adopting different values of for different words. In the optimal

(in retrospect) parameter settings for each word, performance for the test set is noticeably

boosted, achieving P = 0.40, R = 0.85 and F1 = 0.54. Finding a principled unsupervised

way to automatically tune the parameter is thus a promising direction for future work.

48

Investigating further the results per word, we found that the correlation coefficient

between the optimal values and the degree of polysemy of the corresponding source

words is 0.35. More interestingly, we noticed a negative correlation (r = -0.30) between

the achieved F1 and the degree of polysemy of the word, suggesting that polysemous

source words provide poor training models for sense matching. This can be explained by

observing that polysemous source words can be substituted with the target words only for

a strict subset of their senses. On the other hand, our one class algorithm was trained on

all the examples of the source word, which include irrelevant examples that yield noisy

training sets. A possible solution may be obtained using clustering-based word sense

discrimination methods (Pedersen and Bruce, 1997; Sch¨utze, 1998), in order to train

different one-class models from different sense clusters. Overall, the analysis suggests

that it may be possible to obtain in the future better binary classifiers based on unlabeled

examples of the source word.

As the unsupervised-direct approach is the most appealing approach for sense

matching, and we have presented two algorithms which implement that approach, we

would like to compare their results. For this purpose we evaluated the classification

measure of the kNN algorithm as well (although this algorithm will be examined mostly

by the ranking measure). Since the kNN algorithm ranks the test sentences, we need to set

a threshold to separate the negative and positive results, in order to compute the

classification results. Figure 4 shows the Precision, Recall and F1 values for various

values of threshold, with the cosine similarity metric and k=10.

49

One can see quite clearly that the kNN yield somewhat better results than the one-class

SVM algorithm: the optimal F1 the kNN has achieved is 0.42, with a threshold of 0.1,

compared to the optimal F1 of the one-class SVM, which is 0.39.

0

2.0

4.0

6.0

8.0

1

2.1

100.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 1

dlohserht

eroc

s

P R 1F

Figure 4– Precision, Recall and F1 of kNN with cosine metric and k=10, for various

thresholds

5.3 Ranking measure results

Table 6 summarizes the MAP (Mean Average Precision ) results for the supervised direct

approach and for the KNN-based unsupervised approach, along with a baseline of

randomized ranking averaged over 10 runs. The results indicate that the ranking produced

by the KNN method (K = 10) outperform random ranking, while still being substantially

lower than supervised performance.

Method MAP

50

Random 0.36

kNN (Cosine 10) 0.40

Binary SVM (J = 2) 0.60

Table 6: Mean Average Precision

Figure 5 provides a closer look at the ranking behavior, plotting the macro averaged

recall-precision curves for each method.

Figure 5 Macro-averaged recall-precision curves

The figure indicates that the KNN-based ranking is better than randomized ranking up to

the 80% recall point. In particular, in the important high-precision range, of up to about

25% recall, the kNN based method is better than random by 8 − 18%. That is, KNN does

succeed to give the highest ranks to positive examples substantially better than random.

To the best of our knowledge, this is the first time in which such positive result is

obtained by a method that does not consider any externally-provided information at all,

51

be it in the form of labeled examples, a sense repository or sense definitions. We

hypothesize that this result can be further improved through better assessment of the

similarity between the target test example and the source training data. for ranking results

When implementing the kNN algorithm, we tried three similarity measures, Cosine,

Jacard, Lin, and different values of k,10, 50 and 100. There was no significant difference

between the results of these attempts, but we still find it valuable to show them, in figures

6 and 7 figure 6 show the results of kNN with the Cosine metric for different k values

and figure 7 shows the results of kNN with different metrics with K=100

K gniyrav enisoC

0

2.0

4.0

6.0

8.0

1

2.1

0 5.0 1 5.1llaceR

nois

icer

P

soc001 soc 01 soc 05

Figure 6– Results of kNN with different values of k

52

Results varying sim

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5Recall

Prec

isio

n

cos 100jaccard 100lin 100

Figure 7– Results of kNN with different similarity metrics

53

6. Conclusion and future work

This thesis defined and investigated the novel sense matching task, which captures

directly the polysemy problem in lexical substitution. We proposed direct approaches for

the task, suggesting the advantages of controlling precision-recall tradeoff, while

avoiding the need in an explicitly defined sense repository. Furthermore, we proposed

novel types of completely unsupervised learning schemes.

To obtain realistic and comprehensive evaluation of the methods, we used two

evaluation measures, by classification and ranking, which correspond to two different

goals for the sense matching task. In both measures the methods yielded better results

than the baselines. In particular, positive results for both measures were obtained by the

kNN method, which does not require any form of external information. We speculate that

with these encouraging results there is a great potential for such approaches, to be

explored in future research.

We would like to remind here that the algorithms we suggested were aimed to handle

one case of source-target mismatch – the first case that was mentioned in the

introduction, where the target word had the wrong sense in a given context. The same

algorithms could be used when switching roles between source and target, to handle the

second case of mismatch, where the target word was selected according to a wrong sense

of the source word.

We focused on the direct unsupervised approach as our goal. Possible future

improvements may be done, for example, by adding weights to the features, or creating

negative examples in the training data by using the target words as negative examples

while the source words themselves make the positive ones. This idea need further

54

research, since it induces much noise that should be handled. Additionally, ideas for other

methods have come up during the research, such as the automatic clustering of word

instances by contexts, e.g. what Sh¨utze (1998) termed as sense discrimination - two

words would be considered to be used in the same sense if they are within the same

cluster. We hope that the abstract idea we initiated in this research will lead to further

research in this area, and to valuable progress in the task of lexical substitution.

55

7. References

B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, 5th Annual ACM Workshop on COLT, pages 144-152, Pittsburgh, PA, 1992..

Caraballo, Sharon A. 1999. Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from Text. In 37th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, pages 120-126.

Chaves R. P. 2001. WordNet and Automated Text Summarization. In Proceedings of the 6th Natural Language Processing Pacific Rim Symposium (NLPRS-01). Tokyo, Japan.

Christopher J. C. Burges. "A Tutorial on Support Vector Machines for Pattern Recognition". Data Mining and Knowledge Discovery 2:121 - 167, 1998

Ido Dagan,. 2000. Contextual Word Similarity, in Rob Dale, Hermann Moisl and Harold Somers (Eds.), Handbook of Natural Language Processing, Marcel Dekker Inc, 2000, Chapter 19, pp. 459-476

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The pascal recognising textual entailment challenge. Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment.

Ido Dagan, Oren Glickman, Alfio Gliozzo, Efrat Marmorshtein and Carlo Strapparava. 2006. Direct Word Sense Matching for Lexical Substitution, COLING-ACL

Ido Dagan, Shaul Marcus and Shaul Markovitch. Contextual word similarity and estimation from sparse data, Computer, Speech and Language, 1995, Vol. 9, pp. 123-152

Belur V. Dasarathy, 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques,

Phillip William Dixon and David Corne and Martin J. Oates Replacing Generality with Coverage for Improved Learning Classifier Systems, 2003HIS pp 185-193

C. Fellbaum. 1998. WordNet. An Electronic Lexical Database. MIT Press. J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarran. 1998. Indexing with wordnet synsets can improve text retrieval. In ACL, Montreal, Canada.

Flank S. A layered approach to NLP-based Information Retrieval. 1998. In Proceedings of the ACL / COLING Conference. Montreal, Canada.

56

Gasperin, Caroline and Renata Vieira. 2004. Using Word Similarity Lists for Resolving Indirect Anaphora. In Proc. of ACL-04 Workshop on Reference Resolution. Barcelona, Spain, July, 2004

Gauch, Susan, J. Wang, S. Mahesh Rachakonda. 1999. A Corpus Analysis Approach for Automatic Query Expansion and its Extension to Multiple Databases. ACM Transactions on Information Systems, volume 17(3), pp. 250-250, 1999.

.Grefenstette, Gregory. 1994. Exploration in Automatic Thesaurus Discovery. Kluwer Academic Publishers.

Harabagiu, Sanda M., Dan I. Moldovan, Marius Pasca, Rada Mihalcea, Mihai Surdeanu, Razvan C. Bunescu, Roxana Girju, Vasile Rus, and Paul Morarescu. 2000. Falcon: Boosting knowledge for answer engines. In Text REtrieval Conference.

Hovy, Eduard H., Ulf Hermjakob, and Chin-Yew Lin. 2001. The use of external knowledge of factoid QA. In Text Retrieval Conference.

T. Joachims. 1999. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in kernel methods: support vector learning, chapter 11, pages 169 – 184. MIT Press.

Lee, Lillian. 1997. Similarity-Based Approaches to Natural Language Processing. Ph.D. thesis, Harvard University, Cambridge, MA.

M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the ACM-SIGDOC Conference, Toronto, Canada.

Li, L. et al., 2001. Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17, 1131–1142.

Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics, pages 768–774, Morristown, NJ, USA. Association for Computational Linguistics.

Lin, Dekang. 1998a. Automatic Retrieval and Clustering of Similar Words. In Proc. of COLING–ACL98, Montreal, Canada, August, 1998.

Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. 2004, Automatic identification of infrequent word senses. In Proceedings of COLING page 1220-1226.

Diana McCarthy, 2002, Lexical substitution as a task for wsd evaluation. In Proceedings of the ACL-02 workshop on Word sense disambiguation, pages 109-115, Morristown, NJ, USA, Association for Computational Linguistics.

57

R. Mihalcea and P. Edmonds, editors. 2004. Proceedings of SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July.

Mihalcea R. and D. Moldovan. 2000. Semantic Indexing using WordNet Senses. In Proceedings of ACL Workshop on IR and NLP.

D. Moldovan and R. Mihalcea. 2000. Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1):34–43, January.

M. Negri. 2004. Sense-based blind relevance feedback for question answering. In SIGIR-2004 Workshop on Information Retrieval For Question Answering (IR4QA), Sheffield, UK, July.

Patrick Pantel and Deepak Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proceedings of Human Language Technology / North American chapter of the Association for Computational Linguistics (HLT/NAACL-04). pp. 321-328. Boston, MA.

T. Pedersen and R. Bruce. 1997. Distinguishing word sense in untagged text. In EMNLP, Providence, August. M. Sanderson. 1994. Word sense disambiguation and information retrieval. In SIGIR, Dublin, Ireland, June.

Ruge, Gerda. 1992. Experiments on linguistically-based term associations. Information Processing & Management, 28(3), pp. 317–332.

Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.

Scott, S. and S. Matwin. 1998. Text classification using WordNet hypernyms. In Proceedings of the COLING / ACL Workshop on Usage of WordNet in Natural Language Processing Systems. Montreal, Canada.

B. Schölkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural Computation, 13:1443–1471.H.

Shakhnarovish, Darrell, and Indyk Nearest-Neighbor Methods in Learning and Vision, e, The MIT Press, 2005

H. Shütze Automatic word sense discrimination computational Linguistics 24, 1(1998), 97-124

H. Shütze and J. Pederson. 1995. Information retrieval based on word senses. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas.

58

E. Voorhees and D. Harmann, editors. 1999. Proceedings of the Seventh Text REtrieval Conference (TREC-7), Gaithersburg, MD, USA, July. NIST Special Publication.

E. Voorhees. 1993. Using WordNet to disambiguate word sense for text retrieval. In SIGIR, Pittsburgh, PA.

E. Voorhees. 1994. Query expansion using lexical semantic relations. In Proceedings of the 17th ACM SIGIR Conference, Dublin, Ireland, June.

Weeds, Julie, D. Weir, and D. McCarthy. 2004. Characterizing Measures of Lexical Distributional Similarity. In Proc. of Coling 2004. Switzerland, July, 2004.

Y.Yang, J.O. Pederson, A comparative study on feature selection in text categorization, International Conference on Machine Learning (ICML), 1997.

D. Yarowsky. 1994. Decision lists for lexical ambiguity resolution: Application to accent restoration in spanish and french. In ACL, pages 88–95, Las Cruces, New Mexico.

59

direct word sense matching for lexical...

Documents