agissilaos andreou - university of edinburgh

81
Ontologies and Query expansion Agissilaos Andreou Master of Science School of Informatics University of Edinburgh 2005

Upload: others

Post on 03-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Ontologies and Query expansion

Agissilaos Andreou

Master of Science

School of Informatics

University of Edinburgh

2005

Abstract

This master thesis will explore the use of ontologies in information retrieval and in

query expansion in particular. Ontologies are usually huge, hand-coded repositories of

concepts and relations between them so using them in information retrieval seems to

be a reasonable goal. We feel that the use of ontologies for query expansion in par-

ticular has been overlooked in contemporary literature, as the main related papers date

before 2000. In this thesis we will attempt to present a query expansion method using

ontologies that outperforms non-ontological query expansion methods. Note, however,

that the presented approach is not purely ontological but is rather a hybrid approach

as it uses non-ontological steps. We also propose a method for purely probabilistic

query expansion that outperforms all methods tested. Finally we explore word sense

disambiguation based on ontologies as that is a prerequisite step for ontological query

expansion. The ontology used was WordNet. The results of our experiments were

based on standard TREC conferences data and showed that an ontological approach

can cause improvement over non-ontological methods.

i

Acknowledgements

To my Greek professors that taught me how to think and my British professors that

taught me how to actually work.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Agissilaos Andreou)

iii

Table of Contents

1 Introduction 1

2 Background 6

2.1 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Probabilistic Query Expansion . . . . . . . . . . . . . . . . . 8

2.2.2 Ontological Query Expansion . . . . . . . . . . . . . . . . .12

2.3 Ideal query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

2.4 Semantic similarity measures for ontologies . . . . . . . . . . . . . .15

2.5 Ontology based Word Sense Disambiguation . . . . . . . . . . . . .18

3 Methodology 21

3.1 Probabilistic query expansion . . . . . . . . . . . . . . . . . . . . . .21

3.2 Ontology based Word Sense Disambiguation . . . . . . . . . . . . .23

3.3 Re-ranking of expansion terms based on ontologies . . . . . . . . . .25

3.3.1 Boosting based on relation to query concepts . . . . . . . . .28

3.3.2 Boosting based on importance measure drawn from hierarchies30

3.3.3 Boosting based on network importance measure . . . . . . . .33

4 Implementation 35

4.1 Modules used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

4.2 Interactive version . . . . . . . . . . . . . . . . . . . . . . . . . . . .36

4.3 Batch processing version . . . . . . . . . . . . . . . . . . . . . . . .38

4.4 Various visualisation tools . . . . . . . . . . . . . . . . . . . . . . .40

iv

5 Evaluation and Results 42

5.1 TREC tracks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

5.2 Ontology based word sense disambiguation . . . . . . . . . . . . . .44

5.3 Query expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

5.3.1 Probabilistic Query expansion . . . . . . . . . . . . . . . . .46

5.3.2 Ontological Query expansion . . . . . . . . . . . . . . . . .50

5.3.3 Hybrid Query expansion . . . . . . . . . . . . . . . . . . . .51

6 Discussion and Conclusions 61

6.1 Ideal query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

6.2 Probabilistic methods . . . . . . . . . . . . . . . . . . . . . . . . . .62

6.3 Pure ontological query expansion . . . . . . . . . . . . . . . . . . . .63

6.4 Hybrid query expansion . . . . . . . . . . . . . . . . . . . . . . . . .65

6.4.1 Boosting based on relation to query concepts . . . . . . . . .66

6.4.2 Boosting based on network importance measure . . . . . . .67

6.4.3 The effect of the probabilistic method . . . . . . . . . . . . .69

6.5 A note on our adapted version of Pagerank . . . . . . . . . . . . . . .69

7 Summary 71

Bibliography 73

v

Chapter 1

Introduction

In the past years the growth of the World Wide Web both in content and users and the

vast improvement in search engine technology has radically changed the way knowl-

edge and information is collected and shared. Gathering information had never been

so easy and open to such a wide audience as it is today. However, there are still a

significant number of cases where the results obtained through a search engine con-

tain a high number of irrelevant results. Ordinary web users in many cases simply do

not know how to create efficient queries and even the more experienced users usually

cannot create good queries when moving on to an unknown domain.

An alternative approach to keyword based information retrieval (IR) for the web is

the so called Semantic Web (SW). The Semantic Web uses Ontologies as a structured

representation of knowledge to improve information retrieval and to assist both humans

and machines to better find information in web pages. However, despite the significant

effort during the last years to make the Semantic Web a reality, several issues are

preventing its growth. One of the main drawbacks of the Semantic Web as an IR system

is that it requires the semantic annotation of all the documents it can use. This process

has proven to be a significant bottleneck on the deployment of SW and although several

semiautomatic methods have been proposed this is thought to prevent the growth of the

SW in the near future.

In our research we will focus on using ontologies with the standard web. Actually,

we are going to focus on using ontologies for query expansion. Query expansion is

the process of augmenting the user’s query with additional terms in order to improve

1

Chapter 1. Introduction 2

results. For example the query “mad cow disease”, the terms “Creutzfeldt Jakob”

might be automatically added so that pages that contain these additional terms along

with the original terms get higher ranking. Although this thesis is focused on query

expansion, we also describe related uses of ontologies, namely using ontologies for

evaluating semantic similarity and word sense disambiguation.

There is a significant and successful non-ontological literature on query expansion

which we will attempt to take into account. The probabilistic methods have proven

to be the predominant approach for query expansion in the most important IR confer-

ences, namely SIGIR (http://www.acm.org/sigir) and TREC (http://trec.nist.gov).

The main motivation for query expansion is, needless to say, to improve results by

including terms that would lead to retrieving more relevant documents. There is an

issue, however, as to what constitutes a good expansion term. Terms that are similar

and relevant to query terms are usually considered as good terms for expansion. How-

ever, note that is not always the case as we describe later in this thesis. A proposed

probabilistic method that uses criterion other than relation to query terms is perform-

ing better than methods attempting to detect relations to query terms. Moreover, by

analysing the ideal queries we found that optimal terms tend to form semantic clus-

ters, however, sometimes these clusters are not related to query terms. More precisely,

they are related to the query but only under the particular context of the specific query

and this relation cannot be captured by a general semantic similarity and relatedness

measures. For example in the query “mad cow disease” the terms “britain, british, eu-

ropean, france” form a semantic cluster and are very good for expansion. Nevertheless,

they are not semantically close to “mad cow disease” and their relation to the query

is difficult to capture. Is it that “mad cow disease is a disease, diseases outbrake in

specific locations, these are the locations” or is it that “mad cow disease is about cows,

but which cows? british and european”.

There are two main strategies to find expansion terms: the first is to add related

terms based on some automatic relatedness measure and the second is based on rele-

vance feedback. Relevance feedback involves identifying which documents are rele-

vant and then selecting the terms that lead to a query that best distinguishes relevant

from irrelevant documents. Because relevance feedback requires the user to select

Chapter 1. Introduction 3

which documents are relevant, it is quite common to use pseudo-relevance feedback.

Pseudo-relevance feedback does not involve the user and assumes that all top-n doc-

uments retrieved by an initial query are relevant. There is a variety of methods based

on what relevance metric is used and how the ideal query is extracted from pseudo

relevance feedback data. We will review some of these methods in the background

section. In the rest of the dissertation we will focus on a novel hybrid method that uses

both pseudo-relevance feedback and relatedness drawn from the ontology.

Query expansion has some inherent dangers. The main are related to a phenomenon

namedquery drift, that is moving the query in a direction away from the user’s inten-

tion. This happens frequently when the query is ambiguous. For example the query

“windows” might be about actual windows in houses or the Microsoft Windows oper-

ating system. A system might choose an interpretation different than the user’s inten-

tion and augment the query with terms related to the wrong interpretation. This kind

of query drift is quite common in ontological methods and stresses the importance of

disambiguation of query terms and the query in general. Actually most ontological

methods include a disambiguation preprocessing step. In this thesis we will describe

some methods for disambiguation of queries terms by using ontologies.

A specific kind of query drift is calledoutweightingand is well described by Mahler

(2003). Outweightingrefers to the phenomenon where the augmentation terms are

strongly related to individual query terms but not to the overall query for example the

query “dogs training” might be augmented with terms as “Poodle, Retriever, Setter,

jogging, weights” instead of “obedience, sit, heel, leash, reward”. Ontological query

expansion methods are prone to this kind of error but the phenomenon can be observed

in statistical methods too.

An issue specific to ontological methods is that specific types of relations in the

ontology direct the query in specific directions. For example most ontologies include

is-a andpart-of relations. Thus if expanding withis-a relations a query about “car

accidents”, then the terms “vehicle event” could be added since “car is-a vehicle” and

“accident is-a event” but would lead the query to a direction that could include train

accidents or car breakdowns. Similarly expanding based on “part-of” relations make

the query focus on the structure of things discussed in the query.

Chapter 1. Introduction 4

Despite these problems, query expansion actually works and significantly improves

average performance of IR systems. This finding is well documented in literature

and is apparent from the results of this research. However, because of these dangers

query expansion degrades the performance in unpredictable ways in some queries and

many IR systems do not deploy query expansion at all, or use very cautious expansion

approaches.

In the rest of the thesis we will test the following hypotheses:

H1: A hybrid query expansion method that re-ranks query expansion termssuggested by a probabilistic method based on relatedness drawn from theontology outperforms the original probabilistic method.

H2: Terms with near uniform distribution (high entropy) of term frequen-cies in the top documents returned by an initial query are good expansionterms.

H3: The correct senses of query terms will be more important in a networkwhich has as nodes the terms extracted from a probabilistic method and asedges the semantic similarity between those terms.

Our main focus will be testing H1. The remaining hypotheses are used in our

exploration of H1 and their examination in this thesis is neither thorough nor com-

plete. The inclusion of H3 is justified on the grounds that the disambiguation approach

greatly affects the performance of ontological methods. H3 is contrasted in this the-

sis to other disambiguation approaches that use the same semantic similarity measures

but only mutually disambiguate query terms and do not take into account informa-

tion extracted from the actual documents returned by the query. Using networks and

importance measures on them expresses an attempt to successfully incorporate this

additional information in the disambiguation process.

For the evaluation we used data from the TREC-2003 HARD track using Lucene

as an Information Retrieval engine (http://lucene.apache.org/).

The rest of the thesis is organised as follows. In the background chapter we review

some of the dominant query expansion methods and cover some issues used in the rest

of the thesis. In the methodology chapter we describe our proposed methodologies for

query expansion and disambiguation. In the implementation chapter we give the details

Chapter 1. Introduction 5

of how we implemented the methodologies. In the evaluation and results chapter we

describe how we evaluated the methodologies and present the results. In the discussion

and conclusions chapter we comment on the results. Finally we include a summary of

this thesis as a last chapter.

Chapter 2

Background

2.1 Ontologies

Ontologies provide a structured way of describing knowledge. According to Gruber

(1993) an ontology is a “shared specification of a conceptualisation”. Philosophically

speaking ontology is the “metaphysical study of the nature of being and existence”

(WordNet). Practically speaking, ontologies can be seen as special kinds of graphs

describing the entities that exist in a domain, their properties and the relations between

them. The basic building blocks of ontologies are concepts and relationships.

Concepts (or classes or categories or types) can be thought of as sets and appear

as nodes in the ontology graph. Concepts in the ontologies usually have a textual

description defining them, although some ontologies include a formal definition in

some kind of logic as well. In almost every ontology concepts are described by one

or more terms. Note that each concept might have more than one term describing it

and that a term need not match only one concept. For example, to describe the concept

of bicycle the terms “bicycle” and “bike” can be used. However, the term “bike”

might also refer to the concept of motorcycle. Usually, ontologies include a single and

unambiguous term for each concept. This might be more appropriate for specifying

and sharing knowledge, however, it is not usually good for detecting concepts in text

because in real text the same concepts are usually referred to with many different terms.

Furnas et al. (1987) describe an experiment showing that people use the same term to

6

Chapter 2. Background 7

describe the same concept less than 20% of the time. Mapping a term found in text to

a unique ontology concept is one of our main goals in this thesis.

Relationships are usually of a specific type and connect two or more concepts.

Most ontologies includeis-a (or subclass, or hyper/hyponymic) relationships between

concepts i.e. “car is-a vehicle”. Many ontologies include apart-of (or holo/ meronymic)

relationship “Earth is-part-of the Solar-System”. Usually ontologies include other

types of relationships as well but we will focus on these two because they can found

in almost any ontology and can be used to create hierarchies which we will use in our

approach. Note that there are some issues in creating hierarchies frompart-of relations

regarding the implied transitivity in hierarchies. For example, “My foot is part-of me”

and “I am part-of a committee” thus we are led to the rather strange conclusion that

“my foot is part-of a committee”. This phenomenon is caused when different types of

part-of relationships are mixed as it is well described in (Winston et al., 1987).

Throughout this thesis we will use WordNet (http://wordnet.princeton.edu/) as our

ontology. However, we specifically avoided the use of any ontology specific features

so that our approach can be easily applied to other ontologies. Concepts in WordNet

are calledsynsets; that is synonym sets. Usually in the context of WordNet concepts

are referred to assenses. The terms describing each concept are the synonyms con-

tained in thesynset. For example thesynsetof bicycle is “bicycle, bike”, thus both

terms are representing the same concept. The definition of a concept in WordNet is

calledgloss. The main relations in WordNet areis-a relations andpart-of relations.

Other relations exist as well asdomain, pertains-to, similar, see-alsoetc., however,

they are quite sparse and not worth dealing with independently. Parents in anis-a

relationship as “vehicle” in “vehicle is-a car” are calledhypernymsand children are

calledhyponymsi.e. vehicle is ahypernymof car and car ahyponymof vehicle. Par-

ents in apart-of relationship, such as “car” in “car has-an engine” are calledholonyms

and the childrenmeronyms. WordNet roughly distinguishes between different types of

thepart-of relation and thus is suitable for creating hierarchies. The types ofpart-of

relations used in WordNet are:

• member sense (hmem/mmem): professor is-a-member-of staff

• substance sense (hsub/msub): tears are-made-of water

Chapter 2. Background 8

• all other senses (hprt/mprt): China is-part-of Asia, Amusementpark has rides

etc.

2.2 Query Expansion

There are two main approaches to query expansion covered in the literature. The dom-

inant one is that of probabilistic query expansion. Probabilistic query expansion is

usually based on calculating co occurrences of terms in documents and selecting terms

that are most related to query terms. Ontological methods suggest an alternative ap-

proach which uses semantic relations drawn from the ontology to select terms. In this

section we will compare the probabilistic approaches and ontological methods and then

present some methods we have drawn some ideas from.

2.2.1 Probabilistic Query Expansion

An excellent review of early probabilistic methods can be found in the introduction

section of (Xu and Croft, 2000), in the related work section of (Hang et al., 2002) and

in section 2 of (Carpineto et al., 2001). Here we are going to provide a summary of

those reviews and introduce some more recent methods. Most probabilistic methods

can be categorised as global or local. Global techniques extract their co-occurrence

statistics from the whole document collection and might be resource intensive as the

calculations can be performed off line. Local techniques extract their statistics from the

top-n documents returned by an initial query and might use some corpus wide statistics

such as the inverse document frequency but they must be fast because they delay the

response of the system as. All calculations for local methods are done on-line; just

after the user supplies the query and before presenting the results to the user.

One of the first successful global analysis techniques was term clustering (Jones,

1971). Term clustering is based on the association hypothesis. Namely that terms re-

lated in some corpus tend to co-occur in the documents of that corpus. Using this hy-

pothesis, the terms were clustered based on their co-occurrences and expansion terms

were selected from the clusters which contained the query terms. Other well-known

global techniques include Latent Semantic Indexing (Deerwester et al., 1990), and

Chapter 2. Background 9

Phrasefinder (Jing and Croft, 1994). These techniques use different methods to build a

similarity matrix of terms and select terms that are most related to the query terms in

that matrix.

Local analysis can be traced at least back to (Attar and Fraenkel, 1977) which used

a similar approach to term clustering to select expansion terms, but, needless to say,

since it was a local method the clusters were created from terms of the top-n results of

an initial query. Local techniques are based on the hypothesis that the top-n documents

are relevant to the query. This assumption is called pseudo-relevance feedback and has

proven to be a simple but effective assumption to make. However, it can cause a

significant variance in performance depending on whether the documents retrieved by

the initial query were actually relevant.

Most local analysis methods use the notion of Rocchio’s (Rocchio, 1971) ideal

query as a start point. This method is discussed in more detail later in this section

and could be described as a method to find the query that has maximum similarity to

relevant documents and minimum similarity to irrelevant documents.

Several methods have been proposed which differ on how they select the terms

from the top-n documents and their attempts to minimise the effect of irrelevant docu-

ments returned by the initial query (Mitra et al., 1998) (Lu et al., 1997) (Buckley et al.,

1998). However, the most successful local analysis method of this kind is Local Con-

text Analysis (Xu and Croft, 2000) which we will present in more detail as it is one of

the most successful query expansion methods and we are going to evaluate it in this

thesis.

2.2.1.1 Local context Analysis

Local context analysis (LCA) (Xu and Croft, 2000) is a local technique but uses a

method for selecting terms which is more similar to that found in global techniques

(actually Phrasefinder). More specifically, expansion terms are selected not based on

their frequencies in the top-ranked documents but rather on their co-occurrences with

query terms. Alternatively this can be seen as a method to implicitly cluster the top

ranked documents and select terms that appear in the most relevant cluster. The rele-

vance of a cluster is measured by the term frequency of the query terms in that cluster.

Chapter 2. Background 10

Consider a single term query, if a term appears with term frequencyt f1 in document

d1 and another term appears with the same term frequencyt f1 in documentd2 but the

query term frequency is higher ind1 than ind2 then the first term will get a higher

score as it appears in a presumably more relevant document “cluster”. In this way,

local context analysis overcomes the problem of irrelevant initial documents to some

extent and produces better results.

In the term scoring formula this is expressed as the replacement of the standard

tf*idf (term frequency) measure with a measure of co-occurrence degree. A simplified

version of the LCA formula is the following:

∏wi∈Q(∑d∈St f (term,d)∗ t f (wi ,d))N

∗ id f (term) (2.1)

whered is a document,S is the set of top-n documents returned by an initial query,N

is the number of documents inS (a normalising factor not having effect on ranking),

wi is theith term of the query, andQ is the query. This formula compared to standard

tf*idf term selection as expressed in the formula:

∑d∈St f (term,d)N

∗ id f (term) (2.2)

shows that LCA weights the term frequencies by the frequency of query terms, so that

terms that appear with higher frequencies in documents where query term frequencies

are high get better score. Moreover, there is an attempt to favour terms that co-occur

with all query terms at the same time.

There is an inherent problem to methods which are based on term frequency and

this is more apparent with LCA because of the weighting process. These methods are

biased towards terms contained in small documents. Usually a term will exist once or

twice in a document so the actual value of term frequency depends on the size of the

document. Small documents will have high term frequencies for all terms contained

in them. Thus terms contained in small documents get an unreasonable high score.

In (Xu and Croft, 2000) this is addressed by using fixed length passages instead of

documents for the scoring function.

LCA has also some parameters that affect performance and these are:

• the number of top documents (passages) used

Chapter 2. Background 11

• the number of terms selected and

• the weighting scheme of the selected terms.

In the original paper the top 100 documents were used, the 70 top-scoring expansion

terms were included for expansion and there was a weighting scheme where each ex-

pansion term had a different weight based on its rank according to the score; lower

ranked expansion terms get lower weights.

In general, LCA is probably the one of the most successful and well established

query expansion methods. Results in experiments show an improvement in average

precision of more than 20%.

2.2.1.2 Other probabilistic methods

Another very successful query expansion method that uses information from query

logs is described in (Hang et al., 2002). Using query logs is very attractive because

they can be used to train parameters of any model. Nevertheless, access to query logs

is required and thus such approaches cannot be used from the initial deployment of a

system but could rather be used to adjust its performance as the system is being used.

An alternative approach that attempts to use information theoretic measures for

query expansion is described in (Carpineto et al., 2001). The main hypothesis of this

method is that the difference between the distribution of terms in a set of relevant

documents and the distribution of the same terms in the overall document collection

reveals the semantic relatedness of those terms to the query. More specifically, the

frequency of appropriate terms is expected to be higher in relevant documents than in

the whole collection. Kullback-Leibler divergence is used to determine the difference

of the distributions and the overall results stated are very good and comparable to LCA.

Finally, the last probabilistic approach we will present is that of Holistic Query Ex-

pansion (Mahler, 2003) . This method was actually developed to answer relation ques-

tions. Unlike all other methods that build a similarity matrix based on co-occurrences

in documents, this method uses the notion of an explicit similarity measure to build a

graph of terms and selects terms from the resulting graph. Several similarity measures

were tested and several methods for selecting terms from the graph were tested as well.

Chapter 2. Background 12

The results reported for this method are not comparable to the other methods as it was

tested on a different corpus. However, it is important to note that this approach uses

an explicit similarity measure to graphs for selecting terms and that makes it perhaps

closer to ontological methods.

2.2.2 Ontological Query Expansion

Probabilistic methods are attractive because they are effective and the relations are eas-

ily generated from the document collection. However, there are a significant number of

manually edited large repositories of relations between concepts stored in ontologies

and using those data for query expansion is covered in the literature. Most approaches

use large lexical ontologies (usually WordNet or Cyc [http://www.cyc.com]) because

they are not domain specific and because their relations are not sparse.

Using ontologies for query expansion can be dated at least up to (Voorhees, 1994).

In her paper, Elen Voorhees outlines a method for using ontologies for query expansion

that is adapted by most of the following research:

• First the query terms must be disambiguated so that they map to a unique ontol-

ogy concept

• Then terms related in the ontology to the disambiguated concepts are added to

the query.

Usually in the literature this is followed by an analysis on the effect of specific onto-

logical relations in the results.

2.2.2.1 Disambiguation

As we have already noticed, concepts in the ontologies need not to be described by a

single term. Usually each concept is described by several synonyms. In some cases

the converse is also true: a single term (word) might by used to describe more than one

concepts. In such an event the system must disambiguate the term so that it matches

to a unique ontology concept. (Voorhees, 1994) manually disambiguated the concepts

as the main goal was to prove that an ontological expansion method would be help-

ful in the first place. Automatic disambiguation methods are suggested by more recent

Chapter 2. Background 13

papers such as (Navigli and Velardi, 2003). The issue of the importance of disambigua-

tion for ontological query expansion methods and information retrieval effectiveness in

general is discussed in great detail in literature. (Sanderson, 1994) and (Gonzalo et al.,

1998) used different evaluation approaches but both agreed than in order to achieve any

improvement an error rate in WSD of less than 10% is required. However, this result

was questioned by more recent research (Stokoe et al., 2003) and (Navigli and Velardi,

2003) on the grounds that better strategies for disambiguation and better expansion

methods can be used. We will return to this issue on the discussion section.

2.2.2.2 Term Selection

After disambiguating the terms most methods go on to select terms that are related

to the disambiguated concepts by direct relations in the ontology. Usually specific

kinds of relations are tested (“synonyms”, “synonyms and hyponyms”, “synonyms,

hyponyms and hypernyms”, “meronyms”, etc) along with a method that mixes the

several relations.

Note, however, that we came across no attempt to actually verify that relation holds

when multiple options exist. In some cases the same term maps to concepts that have

more than one relations to query terms. For example in WordNet “human” is both a

sibling of “animal” under “organism” and a hyponym of “animal”. The ontological

methods we encountered do not distinguish such cases. If a version of the concept is a

hyponym of a query concept it is equally used for expansion as unambiguous concepts.

Moreover, we came across no attempt to combine relations, all approaches do a per

relation analysis.

In general the conclusion drawn by most ontological expansion research is well

stated by Voorhees (1994):

The most useful relations for query expansion are idiosyncratic to the par-ticular query in the context of the particular document collection

Query expansion terms selected by all of these methods cause a smaller improvement

than the one achieved by the previously mentioned probabilistic methods. Navigli and

Velardi (2003) propose a method of expanding with terms appearing in the definition

of the disambiguated concepts and report an improvement comparable to that of the

Chapter 2. Background 14

probabilistic methods. However, many artifact relations find their way into the query.

For example in a query about “uniforms in public schools” where the definition for

public school is “a free school supported by taxes and controlled by a school board”,

the word “tax” finds its way into the query producing irrelevant results. This find-

ing was also confirmed by our experiments which showed that although this kind of

expansion improves average performance it is very unstable.

2.3 Ideal query

Since the goal of query expansion is to improve the query, it is useful to know what

the ideal query would look like. Moreover it would help to set an upper bound on

expected performance. To determine the ideal query we use Rocchio’s query expansion

(Rocchio, 1971) using the actual relevance judgements given by the TREC conference.

Rocchio’s query expansion is a method for detecting the ideal query. The ideal

query is the one that has maximal similarity with relevant documents and minimal

similarity with the irrelevant ones. Assuming a vector space retrieval model this query

Q is given by the following formula:

Q =1

|Dr | ∑dr∈Dr

dr −1|Di | ∑

di∈Di

di (2.3)

Dr is the set of relevant documents andDi the set of irrelevant documents. In other

words Rocchio’s query expansion finds the average term frequency in relevant docu-

ments, the average term frequency in irrelevant documents, subtracts the latter from

the former and thus calculate a per term weight. In this way terms that appear with

high frequencies in relevant documents and low frequencies in irrelevant documents

will get higher weight.

Usually this could end up in a very large weighted query, so a pruning of the top-n

best terms can be used.

Note, however, that this method for extracting good terms has an inherent flaw

as it overfits on the documents and does not use actual similarity measures or any

background information. If a rare term or a spelling error just happens to appear in the

relevant documents then it would be a good expansion term according to this method

Chapter 2. Background 15

but will presumably not generalise well in a new document collection.

An alternative way we followed to detect the ideal query was the following: For

each probabilistic method we extracted the top-n suggested terms and randomly re-

ranked those terms to form a large number of queries. We then issued these queries on

our system and selected the best performing query as the “ideal” query.

This was done in order to have more than one independent ways of defining the

ideal query and thus presumably minimise the bias of our analysis.

Another important use of ideal queries created by this method is that they give

some insight on what is feasible by only reranking the top-n terms. This was usefull to

set an upper bound on our term re-ranking process. Moreover, by considering not only

the best but all the randomly generated queries and considering their average we set a

baseline for our term re-ranking process.

2.4 Semantic similarity measures for ontologies

In our approach we use semantic similarity measures to disambiguate the terms. Thus

we dedicate this section to semantic similarity measures. Similarity measures can

be seen as a symmetric function that gets as arguments two concepts and returns a

similarity score.

There are two kinds of semantic similarity measures: one is based on detecting

similarity based on distribution of concepts in documents and the other on evaluating

similarity from ontologies. A good review on distributional semantic similarity mea-

sures can be found in (Manning and Shutze, 1999). Here we will focus on semantic

similarity measures created for ontologies and especially WordNet as that is the ontol-

ogy that we are going to use. An excellent starting point for this kind of methods is the

WordNet::Similarity Perl library. Note that although this package uses WordNet the

same algorithms can be used with any ontology.

There are two main categories of ontological methods. The first type of methods,

which we will call “structural”, attempt to extract some similarity measures from the

structure of ontology when seen as a graph, in other words they calculate the similarity

score based on the properties of the paths that connect the concepts in the ontology. The

Chapter 2. Background 16

second type of methods which we will call “gloss-based” calculate semantic similarity

based on the overlap of the definition of the terms, in other words these methods rely

on the hypothesis that similar terms will have similar definitions. Needless to say the

most effective methods combine these two approaches.

A good review of structural methods can be found in (Maki et al., 2004). The basic

notion in structural methods is that of a connecting path. A connecting path is a path

in the ontology that connects the two concepts whose similarity we wish to evaluate.

The path consists of a series of relation-edges and concepts-nodes. Various structural

similarity measures exist based on how they calculate the similarity score. Several

alternatives exist: the number of paths, the length of paths, the kinds of relationships

existing in the path, the kinds of nodes in the paths etc.

The basic notion in “gloss-based” methods is the notion of overlap. The definitions

of terms are checked for common words or phrases and the semantic similarity score is

determined by the number of common words or phrases. Various definitional similarity

measures exist that differ mostly on how they weight the overlapping terms and how

much they favour phrases.

Next we will present a simple structural, a simple definitional and two hybrid ap-

proaches.

Probably the simpler structural approach uses only taxonomic (is-a) relations and

calculates similarity based on the length of the path. For example assume that “car is-a

vehicle” and “bus is-a vehicle”. In this case a path of length 1 (1 intermediate node)

exists “car-vehicle-bus”, so the score according to this method would be 1/1=1. By

contrast score for “cat” and “mouse” would be 1/4 because “cat is-a feline”, “feline

is-a carnivore” , “carnivore is-a placental mammal”, “rodent is-a placental mammal”,

“mouse is-a rodent” (4 intermediate nodes).

Probably the simplest “gloss based” approach is described in (Lesk, 1986). This

method calculates the similarity of two concepts simply by counting the number of

common words between the definitions of the concepts and assigns that count as a

score.

The papers describing these approaches report reasonable results; however, they

are outperformed by methods using a hybrid approach. One approach which is basi-

Chapter 2. Background 17

Figure 2.1: Simple taxonomic relationships in WordNet

cally structural is described in (Navigli and Velardi, 2003). In their paper they describe

augmenting WordNet with an explicit “gloss” relation. This relation is created for

each non stop-word existing in the definition of the concept. For example since the

gloss of “car” is “four wheel motor vehicle, usually propelled by an internal combus-

tion engine” explicit relations of type “gloss” are added, the relations start from “car”

and end on “wheel”, ”vehicle”, etc. Note however that the target terms need to be

disambiguated before such relation can be added. After augmenting the ontology this

method scores terms based on standard connection-path measures (actually the num-

ber of connecting paths is used). Although the results reported are very promising,

the exact process of the gloss disambiguation step is rather unclear in the paper, so we

could not reproduce the results.

An alternative method, that does not need to disambiguate the terms of glosses,

is based on (Lesk, 1986) and has been successfully tested on standard SENSEVAL

conference data, is described in (Banerjee and Pedersen, 2003). This method slightly

changes the overlap score to strongly favour phrases and collocations. Instead of sim-

ply counting the number of overlapping words, if an overlapping phrase (more than

one continuous words) is detected then the similarity score is much higher. More pre-

cisely when an overlap between the definitions is detected the Lesk score adds 1 to

Chapter 2. Background 18

the overlap score but the extended overlap score addsn2 where n is the length of the

overlap. No attempt is made, however, to use a language model.

The most important difference of this method compared to Lesk is that it takes into

account the overlap of related concepts as well. More specifically the formula calcu-

lating the relatedness of two concepts is:

relatedness(A,B)=score(gloss(A),gloss(B))+score(hype(A),hype(B))+

score(hypo(A),hypo(B))+score(gloss(A),hype(B))+score(hype(A),gloss(B))

whereAandBare the concepts of which the relationship is being measured,score(A,B)

is a function that returns the overlap score of two strings,gloss(X) returns the defini-

tion of X, hype(X) returns a concatenated string of the definitions of all hypernyms,

andhypo(X) does the same for hyponyms.

Note that other combinations of relations might be used, but the experiments de-

scribed in the paper concluded that this combination produced the best results.

2.5 Ontology based Word Sense Disambiguation

Word sense disambiguation (WSD) refers to the process of selecting the correct sense

of a word from a set of possible senses or in terms of ontologies to map a term to

the correct unique concept. Several state of the art algorithms can be found in the

SENSEVAL conference (http://www.senseval.org).

One category of WSD algorithms uses semantic similarity measures such as the

ones described in the previous section. Actually an algorithm for WSD was the moti-

vation for calculating semantic similarity in (Banerjee and Pedersen, 2003).

In (Banerjee and Pedersen, 2003), a window around the target word is selected, and

for each word in that window a set of candidate senses is identified. The algorithm is

outlined as follows

1. For each CANDIDATESENSE of the target word assign SENSESCORE[SENSE]=02. For each CANDIDATESENSE of the target word– 2.1 For each CONTEXTWORD in window—- 2.1.1 For each CONTEXTWORD SENSE of CONTEXTWORD

Chapter 2. Background 19

—— 2.1.1.1 SENSESCORE[CANDIDATESENSE] +=score(CANDIDATESENSE,CONTEXTWORD SENSE)3. Select the sense with the maximum SENSESCORE

Note that each decision of sense is taken independently. The complexity of this

algorithm isO(n∗m2) wheren is the numbers of words considered andm is the max-

imum number of senses per word . Although the complexity is polynomial in both

terms this algorithm is very slow when large context window is used (in our system we

managed to calculate only few queries with 100 context words per day).

An alternative approach for WSD using ontologies (WordNet) is described in (Mi-

halcea et al., 2004). In their approach they treat the ontology as a graph (network)

and use Pagerank (Page et al., 1998) to disambiguate senses from that network. The

Pagerank algorithm was originally designed to perform link analysis in web pages and

detect the most important pages. The basic idea behind Pagerank is that, if there is a

link from page A to page B then the author of A is implicitly conferring some impor-

tance to page B. More specifically it confers some of its own importance to page B,

thus if A is important then B will be also become important but if A is not so important

then B would only slightly benefit from the link from A. Thus importance is defined

recursively and the algorithm runs in several iterations until convergence. Initially all

pages have the same importance but after each iteration importance is concentrated in

specific pages. An alternative way to view Pagerank is that it roughly expresses the

probability of a random web serfer staying in a specific page.

Pagerank has proven to be very successful when applied to web pages but whether

this analogy can be applied in the concepts of an ontology for disambiguation purposes

is examined in the referenced paper. To successfully apply Pagerank on WordNet some

relationships were pruned and few more are added. Moreover, the outputs of Pagerank

are then mixed with the outputs of the Lesk algorithm described earlier. Although, the

results reported are very good, it seems that this approach is tailored to WordNet and

extending it to other ontologies might require pruning/adding relations.

In the methodology chapter we describe a very similar approach. However, to avoid

of tailoring the ontology to the needs of disambiguation we do not use the original re-

lations as edges of the graph. We rather create a fully connected graph where the edges

are weighted based on a similarity measure as in (Banerjee and Pedersen, 2003). This

Chapter 2. Background 20

way we incorporate the extended Lesk measure before running Pagerank rather than

doing sophisticated ranking merging after Pagerank. Moreover, this method is less

sensitive to the density of relations in the original ontology. Our approach requires,

though, an adaptation of Pagerank to handle weighted edges (links) and prior proba-

bilities. We based that adaptation on (Haveliwala, 2002) and actually used the JUNG

Java library (http://jung.sourceforge.net) in our implementation.

Chapter 3

Methodology

3.1 Probabilistic query expansion

A prerequisite step for the methods described later is the use of a probabilistic method

to extract representative terms from documents. We used Local Context Analysis but

found that it had two probably unwanted properties. LCA does not take into account

the number of documents in which a term appears. Thus, if there is good evidence of

correlation to query terms in few documents LCA will include that term. Secondly, as

we show in the evaluation and results chapter, the quality of terms suggested by LCA

significantly drops as we consider lower ranked concepts. This is successfully dealt

with within LCA by gradually lowering the weight of lower ranked terms. However,

this weighting scheme seems to be appropriate when terms are used for expansion

but perhaps a different weighting scheme would be more appropriate when terms are

used for disambiguation. In this thesis, we decided not to incorporate any weighting

scheme in subsequent steps. Perhaps this is a flaw of our approach, however, selecting

an appropriate weighting scheme for each usage of the terms is a difficult task and we

felt that it is part of an optimisation and fine tuning process while in this thesis we

explored whether the approaches are useful in the first place.

To compensate for these probably unwanted properties of LCA and to create a

diverse set of probabilistic methods for subsequent steps we explored several other

methods. More specifically, we propose a method that has entirely different properties

21

Chapter 3. Methodology 22

from LCA. The hypothesis behind the proposed methods is the following:

H2: “Terms with near uniform distribution (high entropy) of term frequen-cies in the top documents returned by an initial query are good expansionterms.”

The motivation behind the proposed hypothesis is that terms that appear with uni-

form distribution across the documents will be good expansion terms or at least their

inclusion will not seriously hurt the query. Using this method, terms that constantly

appear with low frequencies are also considered for expansion although they are usu-

ally overlooked by standardt f ∗ id f measures. Moreover, because entropy is used,

the terms are less likely to create a query drift; they will give no specific direction to

the query but rather simply make the subject of the documents retrieved by the initial

query more dominant.

To test this hypothesis and get some diverse probabilistic methods for the next steps

we developed three methods we called “ENT”, ”TST” and “MIX” .

ENT sorts the terms according to the entropy of their frequency distribution. More

specifically, the frequency distribution vector[t fi∗] (wheret fi is the term frequency of

the term in documenti ) is normalised and now each row represents an estimation of

P(document|term). P(document|term) expresses the probability of getting this docu-

ment as the first document if we searched by that term within the documents returned

by the initial query. This method attempts to find terms that are less discriminative thus

have the same (= 1/numo f documents) probability for most documents.

As we show in the evaluation and results chapter, this method produced much better

results than (unweighted) LCA in almost any setup. However, the terms selected are

not always semantically related to the queries. The term “said” appeared in the top

places in almost any query when searching the AQUAINT corpus (mostly newspaper

articles) and the terms “home”, “site”, “search”, “contact” etc. appeared in almost any

query when searching the web. This effect is magnified as more documents are used.

We could deal with these words as stop words and thus hand-code their exclusion

for expansion; however, we felt that they reveal an inherent property of the method

so experimented with two other methods as well. Nevertheless, note that it could be

argued that there is no need to actually exclude those terms; although they are not

actually related to the query they do not harm the query.

Chapter 3. Methodology 23

TST multiplies the score of ENT with the standardt f ∗ id f measure leading to the

equation

TSTscore(term,docs) = ENTscore(term,docs)∗ t f (term,docs)∗ id f (term) (3.1)

MIX combines the score of LCA and ENT. After conducting some experiments we

concluded to a value of 0.01 as the mixing factor thus the final formula for MIX is:

MIXscore(term,docs) = LCAscore(term,docs)1−0.01∗ENTscore(term,docs)0.01 (3.2)

We should also mention that after the experiments described in the evaluation and

results chapter (figure 5.2) we settled on using the top 80 documents returned by the

initial query, expanding with 20 query terms.

3.2 Ontology based Word Sense Disambiguation

To disambiguate the query terms we followed the following hypothesis:

H3: “The correct senses of query terms will be more important in a net-work which has as nodes the terms extracted from a probabilistic methodand as edges the semantic similarity between those terms”

The network we used contained as nodes all the possible concepts of the top-n terms

returned by a probabilistic method.

Note that we did not actually add all the possible senses of each top-n terms but

conducted some pruning. Each concept in the ontology is assigned a list of compatible

parts-of-speech for its terms. Thus we add a possible sense for that term only if the

term appeared in the documents with a compatible part-of-speech. For example the

term “are” maps to both the verb “are” and the noun “are” (the area measure). If the

term “are” was not encountered as a noun in the initial documents then the concept of

noun “are” is not included in the network. In a small experiment we found that this

pruning did not affect the results in any significant way, nevertheless, it promoted the

time-efficiency of the calculations.

The graph is fully connected because we add edges for each pair of concepts.

The edges were weighted according to the extended Lesk semantic similarity measure

(Banerjee and Pedersen, 2003).

Chapter 3. Methodology 24

The actual value ofn top terms used was determined from experiments described

in the evaluation and results chapter. Note, however, that we conducted no experi-

ments with more than 100 terms because evaluating the similarity of each possible pair

took a significant amount of time. Actually, about 2 similarity evaluations per second

were possible in our system. Thus considering 100 terms for each query usually cor-

responded to about 120 concepts, thus 7140 similarity calculations per query which

translated to 3570secs that is almost an hour per query. Needless to say, this is not

an acceptable the amount of time for any on-line system, but these calculations can

be performed off-line. In our implementation we used caching and that significantly

improved performance.

To measure the importance of nodes we used an adapted version of Pagerank. The

adapted version works with weighted edges and priors. The edges were weighted as

described before with the extended Lesk measure. We did not assign uniform priors

and experimented with several values of prior weight (betaparameter in JUNG library

terms). Thebetaparameter expresses the percentage of the final importance that will

be determined by the prior distribution.beta= 100% means that the final importance

measure will be determined 100% by the prior importance.beta= 0% means that prior

importance will have no effect on the final importance measure.

If we assigned very highbeta, near 100%, then only query terms would be taken

into account. Only mutual disambiguation of query terms would occur; the rest of the

terms would not contribute at all. If we assigned a uniform prior (orbeta= 0%) then

the main importance weight might shift away from the terms we want to disambiguate.

Note that it is preferred to use uniform prior instead of lowbetabecause the latter

might lead the algorithm not to converge. Although we would expect the main per-

centage of importance to be focused on the query terms regardless of thebetafactor,

this proved not to be the case especially when few of the top-n terms are used. For ex-

ample in the query “animal protection” and when Pagerank is used with lowbetathe

most important concept turns out to be “city”. That was because several cities where

mentioned and had a strong semantic relation with “city” and no other strong links.

Thus “city” gathered most of their weight. We describe the effect in ofbetafactor and

number of terms in the evaluation and results chapter.

Chapter 3. Methodology 25

An important issue we came across was dealing with words that did not appear in

the ontology. We could choose not to include those terms in the network. However, in

some cases some of the query terms were absent from the ontology. Thus we decided

to actually put these terms in the network and to measure similarity we used the cosine

similarity of the term frequencies in the documents. More specifically for each term we

created a vector[t f1, t f2, ..., t fn] wheret fi is the term frequency of each term in doc-

umenti. The documents we used were the same top-ranked documents used for the

probabilistic methods. Then, we calculated the cosine of each pair of vectors and as-

signed that as a score of semantic similarity (a cosine of 1 means identical distribution

and thus maximum similarity). We had no scale issues because Pagerank normalcies

the weights per concept.

Another issue we came across was that the terms describing some concepts were

actually phrases consisting of more than one word. This was not frequent so we used

a simple approach to tackle it. We simply checked if the top-10 most similar terms

according to the cosine similarity measure formed a multi-word term describing an

existing concept in the ontology. We did not formally evaluate this approach. However,

from a short inspection it proved to have high recall but lower precision: it detected all

the terms we could manually spot and some we did not detect manually, but included

also some irrelevant concepts (this was expected, as the actual locations of terms in

documents were not taken into account). These irrelevant collocations, though, have

no significant effect on the performance of the system.

3.3 Re-ranking of expansion terms based on ontolo-

gies

The final step of our suggested method is to re-rank the terms suggested by a proba-

bilistic method from information drawn from the ontologies following the following

hypothesis:

H1: “A hybrid query expansion method that re-ranks query expansionterms suggested by a probabilistic method based on relatedness drawnfrom the ontology outperforms the original probabilistic method”

Chapter 3. Methodology 26

We implemented the re-ranking by calculating a boosting score from the ontology and

mixing that score according to the formula:

score(term) = probabilisticScore(term)1−mix∗boost(term)mix (3.3)

We did not consider all terms for re-ranking, but rather used only the top-n terms

proposed by the probabilistic method. We describe the results for the various values of

n in the evaluation and results chapter.

To derive the boosting factor we used three methods:

• the first method derives the boosting factor of a concept according to the relation

of that concept to query concepts in the various ontology hierarchies. Specific

types of relations namely children, parents, siblings get different boosting factors

and these boosting factors were trained from a set of training queries.

• the second method derives the boosting factor of a concept based on the im-

portance of its position in the various hierarchies. In a nutshell, if there is an

indication of concentration of concepts under a specific node in a hierarchy in

the context of the specific query compared to the same concentration in the con-

text of the whole document collection, then all children of that node are boosted

otherwise they are penalised.

• the final method does not use hierarchies but is based on creating a network sim-

ilar to the one proposed for disambiguation. This is a fully connected network

where concepts are nodes and edges are weighted according to semantic simi-

larity drawn from ontologies. Pagerank is run to determine the boosting score of

each concept; important concepts get a higher boosting score.

In all methods we have to deal with two specific issues. The first issue is about

terms that do not map to an ontology concept and the second is about how to handle

terms when more than one terms map to a single ontology concept.

The first issue was quite common when the probabilistic method used was LCA

because the latter tends to select rare terms. Rare people and location names are com-

monly encountered in the results of LCA. However, concepts not covered in the on-

tology are quite common for all methods. We tested several methods for deriving a

Chapter 3. Methodology 27

boosting score for unknown concepts. The obvious method is to select a boosting fac-

tor of 1 for such concepts. However, this lead to a bias towards these terms because

the boosting factors are usually less than 1. Thus a less biased method is to assign

unknown terms the average boosting factor assigned to terms that appear in the ontol-

ogy. An alternative method is to assign those terms the minimum boosting factor of

known concepts. This causes a deliberate bias towards ontology concepts which might

be desired in some cases. In the first approach we deployed training of this score, that

is all ontology concepts and non ontology concepts get a prior score simply based on

whether the term maps to an ontology concept or not; we describe how we trained this

score in the next section. In the rest of the methods we used the intermediate approach

of assigning unknown concepts a boosting factor ofboostingaverage+boostingminimum2 .

The second issue was less common but proved to significantly affect the results.

By using ontologies we can derive a boosting factor for a concept, however, at some

point we must use that score to derive a boosting factor for actual terms. The issue

here is that more than one term might map to the same concept and a term might map

to more than one concept. When a term maps to more than one concept we assign it

the sum of the boosting factor for the first approach and the maximum boosting factor

for the rest approaches. More sophisticated methods could be used to favour terms

that express more than one concept or penalise terms that actually express penalised

concepts, however, in this thesis we followed the simple approaches of summing and

getting the maximum.

When a concept is expressed by more than one term (synonyms) then we followed

the simple approach of assigning the concepts boosting factor to the first term (accord-

ing to the probabilistic method) and minimum boosting factor to the rest of the terms.

This might not be the desired method in some cases because expanding with synonyms

is one of the most commonly used expansion methodologies. However, note that there

is a distinction between expanding with synonyms of the query terms and expanding

with synonyms of the expansion terms. Although the former method makes the query

terms more dominant in the query the latter method simply makes some expansion

terms more dominant. Actually, in our setting, where only 20, not weighted expan-

sion n terms are used, the expansion concepts might become even more dominant than

Chapter 3. Methodology 28

actual query terms in the final query. Thus by assigning the concept’s boosting factor

only to the first term and minimum factor to the rest of the terms we get more diverse

expansion terms.

3.3.1 Boosting based on relation to query concepts

For this method we used the relation of query concepts in hierarchies drawn from on-

tologies. We used only relations that form hierarchies because they allow the definition

of specific types of relations.

The types of relations to query terms we used were:

• synonyms: terms that match to the same concepts as query terms.

• parents: parents of query terms in the hierarchy

• children: terms that match to direct children of query concepts.

• children-subtree: terms that match to concepts in the sub-trees under children.

• siblings: terms that match to concepts that share the same parent with query

concepts.

• siblings-subtree: terms that match to concepts in the sub-trees under siblings.

The hierarchies in which we search for these relations areis-a andpart-of hierar-

chies, although other relations forming hierarchies might be used as well. Note that

we did not use the generalpart-of i.e. “meronym” relation in WordNet as it mixes the

types of part-of. Thus for WordNet we used the following hierarchies:

• is-a (holo/mero): taxonomic hierarchy

• part-of (member sense) hmem/mmem: professor is-a-member-of staff

• part-of (substance sense) hsub/msub: tears are-made-of water

• part-of (all other senses) hprt/mprt: China is-part-of Asia, Amusementpark has

rides etc.

Chapter 3. Methodology 29

Note that these hierarchies are not trees because ontologies allow multiple inheri-

tance. Actually multiple parents are encountered frequently in WordNet.

For each hierarchy and each relation we estimated the appropriate boosting factor

using Rocchio’s ideal query. That is we traversed the index of all documents and for

each query and end each term we used equation 2.3 to calculate the weighting fac-

tor for that particular term in the ideal query. Note that this weighting factor might

be negative if the term is encountered more frequently in irrelevant documents. The

relevance judgements used were the actual relevance judgements used to evaluate the

queries. To estimate the boosting factor for a specific relation in a specific hierarchy

we use the average weight of all terms having that specific relation to query terms.

Note that we used manually disambiguated query terms to make sure that relation to

the correct query concept is considered. The trained parameters can be summarised in

the following table:

parents children children-subtree siblings siblings-subtree

is-a 0.024772 0.031324 0.045891 0.010591 0.019223

part-of (member) -0.013287 0.711712 0.018197 0.005665

part-of (substance)

part-of (rest) 0.089481 0.473094 0.001387 0.064584 -0.001692

OTHER synonyms Mapping to ontology concept Not in ontology

0.571086 -0.002474 -0.002785

Note that the cells missing where not considered significant because of low counts.

Also Note that along with the average weight according to specific relations we

evaluated the average weight of terms that mapped to ontology concepts and terms

that did not map to ontology concepts. In this method we used those parameters to

determine the boosting factor of concepts not covered in the ontology.

To derive the boosting factor for each concept we use the equation:

boost1(term) = ∑c∈concepts(term)

∑q∈Q

∑h∈H

∑r∈relations(h,c,q)

score(h, r) (3.4)

whereconcepts(term) returns all the concepts mapping to termterm, Q is the set of

disambiguated query concepts,H is the set of hierarchies used andrelations(h,c,q)

Chapter 3. Methodology 30

returns all relations of conceptc to query conceptq in hierarchyh that isparentsif c

is a parent ofq in h, children if c is a child ofq in h etc. Finally,score(h, r) returns the

value of the specific cell in the presented or zero in the event of absence of value in the

table.

boost2(term) adds toboost1 the score of the appropriate values of the last row of

table presented that is the score for synonyms if the term maps to a query concept,

and the score of either mapping or not mapping terms depending on whether the term

actually matches any ontology concept.

To ensure that the boosting factor is greater than zero we find the term with the

minimum boosting factor and subtract its boosting factor fromboost2(term). Thus

supposing that the minimum boosting factor isminBoost2 the final equation is:

boost(term) = boost2(term)−minBoost2 (3.5)

3.3.2 Boosting based on importance measure drawn from hierar-

chies

We used two methods to derive boosting factors from hierarchies and combined them to

a final boosting score for each concept. The first method focuses on detecting important

paths and the second focuses on discovering important nodes according to the density

of the subtree under a specific node. Both methods use hierarchies drawn from the

ontology. The same hierarchies as described in the previous section were used.

Note that as described in the evaluation and results chapter this method produced

very bad results. That was because of disambiguation errors (we did not disambiguate

the terms before adding them in the hierarchy). Nevertheless, the justification of in-

cluding this method in this thesis is, appart from commenting on its failure as a query

expansion method, its success in detecting how wrong senses of the words can combine

in unpredictable ways. This property makes it suitable for building an error analysis

tool as we explain in the implementation chapter.

In the hierarchies used we define two notions which we name “path” and “density”.

For each concept in the hierarchy we define as “path” any path consisting of the concept

and all its parents up to the root of the hierarchy. “Density” is defined in relation with

Chapter 3. Methodology 31

a list of concepts L. The “density” of a concept is defined as the number of times the

concept or one of its children appears in the list, divided by the length of the list.

Using the notion of density we can weight each node in a hierarchy. Actually

both methods rely on comparing two weighted versions of the hierarchies. The first

version, which we will refer to with the subscript “prior ”, is weighted according to

counts gathered from the whole document collection (the density is defined in relation

with a concept list L that is created by concatenating the concepts that appear in all

documents). The second, which we will refer to with the subscript “query”, is weighted

according the concepts returned by a probabilistic expansion method.

The two methods for deriving importance from the hierarchies express the follow-

ing:

• if a higher than expected concentration of concepts is detected under concept X

in the context of the query, then all children of X are boosted, otherwise all chil-

dren of X are penalised. For example if in the context of the query we encounter

many specific animals, then the concept of animal is probably important and all

its children (animals) are boosted.

• if in the context of the query we encounter a concept and also encounter its

parents then the concept is boosted, otherwise it is penalised. The amount of

boosting or penalising depends on the probability of encountering the parent

in the whole document collection. If the parent of the encountered concept is

frequent in the whole collection, then absence of that concept in the context of

the query leads to greater penalisation. For example if in the context of the

query we encounter many animals but do not encounter the term “animal” then

according to this criterion all animals are penalised (although according to the

previous criterion they were boosted). If on the other hand we encounter a single

specific animal and the term “animal” then that specific animal is boosted.

A concept might appear in more than one locations in the hierarchies and each place

is defined by a unique path. Thus there are several ways to derive a unique boosting

factor for a concept by combining the boosting factors of its paths. We simply selected

the maximum boosting factor.

Chapter 3. Methodology 32

3.3.2.1 Density Measure

The main goal of this measure is to detect higher than expected concentrations in spe-

cific areas of the hierarchies. For each node in the ontology we use the notion of

probability of a child given the parentP(child|parent). This probability determines a

density distribution of children of the specific parent. The expected density of a child

is:

Pprior(child|parent) =densityprior(child)

∑c∈children(parent) densityprior(child)(3.6)

wherechildren(X) is a function returning the set of children of concept X in the hi-

erarchy anddensityprior(X) returns the number of times X or one of its children is

encountered in the whole document collection. The distribution in the context of the

query is defined by:

Pquery(child|parent) =densityquery(child)

∑c∈children(parent) densityquery(child)(3.7)

Note that for estimatingPprior(child|parent) we did not consider the probability of

encountering a child given that we have encountered the parent in the same document,

but used the probability of encountering a child in any document. Estimating the distri-

bution only from documents containing parent could be presumably more appropriate,

however, in our rather small document collection this lead to very sparse counts.

To estimate the boosting factor we consider the difference in distributions and de-

fine the boosting factor of a child as:

boostdensity(child|parent) = 1+Pquery(child|parent)−Pprior(child|parent) (3.8)

And to estimate the boosting factor of a path we simply take the average of boosting

factors for each concept in the path:

boostdensity([root,c1,c2, ...,concept])= avg([boost(c1|root), ...,boost(ci |ci−1), ...,boost(concept|cn−1)])

(3.9)

where[...] defines a list,[root,c1,c2, ...,concept] defines the path considered,avg re-

turns the average of a list andboost(X|Y) is as defined in the previous equation.

Chapter 3. Methodology 33

3.3.2.2 Path importance measure

The main goal of this measure is to derive a boosting factor for a path that will estimate

the significance of that path . Actually, if the parent is encountered in the context of

the query the boosting factor should be greater than one, otherwise it should be less

than one. The amount of boosting should be less if the parent concept is common and

the amount of penalising should be greater if the parent concept is common.

To derive this measure we use the notion of

Pprior(concept) = density(concept)− ∑child∈children(concept)

density(child) (3.10)

which expresses the probability of encountering the specific concept (and not any of

its children). If we encounter a specific concept in the context of the query then that

concepts gets a boosting factor ofboostpath(concept) = 2−Pprior(concept). If we do

not encounter a concept in the context of the query then that concept gets a boosting

factor ofboostpath(concept) = 1−Pprior(concept).

To estimate the boosting factor of a path we simply multiply the boosting factors

of the concepts in that path.

3.3.2.3 Combined hierarchy measure

To get the boosting factor of a child in the hierarchy we simply multiply the path with

the density measure.

3.3.3 Boosting based on network importance measure

Alternatively, to derive the boosting score from the ontology we used the importance as

measured by the Pagerank algorithm described in the disambiguation section. Actually

for selecting terms we run Pagerank twice.

In the first run terms are disambiguated and we get aP( j, i) for each possible sense

j of term i. Recall that in PagerankP( j, i) roughly expresses the probability of a web

surfer staying in a specific node-page (j,i) and that this probability is used to measure

the importance of a sense j in the network. In our context of semantic similarity based

Chapter 3. Methodology 34

networks this expresses the probability of stopping to a specific sense while randomly

moving through senses.

In the previous section we were concerned with disambiguation, thus we selected

the most important sense of each term that issensei = argmaxjP( j, i). In this section

we do not want need to make a crisp discrimination between senses because as we show

later on the evaluation usually more than one senses are equally appropriate (inter-

annotator agreement of about only 58% percent). Thus we use an amount proportional

to P( j, i) as our prior probability for a second run of Pagerank.

We did not use directlyP( j, i) but rather weighted it appropriately so that each term

gets equal prior weight on the final network. Intuitively this expresses that if a query

term is less important than another query term in the network created by the terms

suggested from the probabilistic method, then that query term is probably overlooked

by the terms suggested from the probabilistic method. In other words anoutweight-

ing is probably we going to occur and probably should focus on the overlooked term

more. For example in a query about “history of skateboarding” most terms in the graph

were about skateboarding and very few of them about history so expanding with the

suggested terms causes the results to move towards skateboarding shops, contests, etc.

Thus we attempt to balance the importance of terms by assigning different priors for

the second run of Pagerank (more for “history” and less for “skateboarding” in the

previous example). Mathematically this is expressed by assigning prior probabilities

according to the equation:

PriorP2( j, i) = P1( j, i)∗∑ j ∑k P1( j,k)

∑ j P1( j, i)∗numo f queryterms(3.11)

The second run of Pagerank calculates the actual boosting score for each term.

Note that for each run of Pagerank we can tune how significant the prior probabil-

ities would be through the parameterbeta. In the evaluation we describe the results

for various values ofbeta. We will call the betaparameter of first run of Pagerank

disambBetaand the same parameter for the second runrankBeta.

Chapter 4

Implementation

In this section we describe the details of what modules were used, how we imple-

mented the methodologies described in the previous chapter and finally the details of

some analysis tools we developed. Note that we implemented the methodologies in a

way that two ways of using our system are possible:

• Interactive version: This version provides a web page. In this web page the actual

documents returned by an initial query and the expansion terms proposed by the

several methodologies are displayed. This version was developed to examine

the feasibility of actual deployment of the proposed methods in terms of time

efficiency and most importantly to create a framework for exploratory evaluation

using any queries.

• Batch processing version: This version automates the procedure of evaluating

the various expansion methodologies. The input of this version is a topics file

describing the queries and a file containing relevance judgements. The output of

this version consists of files describing the performance of the system.

4.1 Modules used

As a search engine we mainly used the open sourceLucene(http:// lucene.apache.org)

Java-based search engine but have designed the system in a modular fashion so any

35

Chapter 4. Implementation 36

other system might be used as well, we actually usedGoogleTM API for an exploratory

research and confirmation of our results.

Our ontology was WordNet (http://WordNet.princeton.edu) version 1.7.1. To de-

rive similarity measures we used the WordNet::Similarity (http://wn-similarity.sourceforge.net/)

Perl package. Other ontologies can be used with this package, but note that to do that

the WordNet::QueryData (a simple package reading WordNet files) needs to be im-

plemented for the new ontology, the rest of the algorithms will work with the new

ontology.

For network importance algorithms we used the JUNG Java library (http://jung.sourceforge.net).

From this library we also used some graph visualisation tools for analysing and debug-

ging our methodologies.

The glue connecting those modules together and most of our code is written in

Python.

Finally as a part-of-speech tagger we used TnT (http://www.coli.uni-saarland.de/ thorsten/tnt/)

trained on Wall Street Journal data.

4.2 Interactive version

The system architecture of the interactive version is sketched in figure 4.1.

In this figure the main modules are presented along with a process description

specifying the order by which the modules are used.

The usage scenario is the following:

• The user enters a web page similar to those of standard web search engines and

issues a query

• (1) search.py conducts an actual search using the appropriate API and the appro-

priate collection

• (2) the resulting documents of the initial query are immediately presented to the

user

• (3) At the same time the documents are passed to indexer.py. Indexer.py trans-

lates html pages to text using the python sgml parser. Note in some cases of ill-

Chapter 4. Implementation 37

Figure 4.1: System architecture

Chapter 4. Implementation 38

formed html documents this process might fail thus indexer.py passes the html

document through the W3C utility ?tidy? which alters the html file to a syntacti-

cally correct xhtml document. Moreover, indexer.py tokenizes the text extracted

from the html documents and passes the tokens through a Part-Of-Speech tagger

namely TnT.

• (4) The tokenized documents are passed to lca.py which although named ?lca.py?

implements all probabilistic query expansion methods described in this thesis.

The output of lca.py is a list of the terms encountered in the documents along

with their score according to the various probabilistic expansion methods.

• (5) The list of terms is passed to ontology.py. ontology.py implements the vari-

ous ontology based re-ranking methods described in this thesis and

• (6) The suggested expanded queries are presented to the user for inspection and

the user might click on the query proposed by a specific method to see its results.

4.3 Batch processing version

To compute large numbers of queries (TREC collections) we include a batch operating

mode where several instances of the batches can run in different machines to speed

up evaluation time. The process consists of the following stages (implemented by

different python modules which must be run using the specified order):

• search.py: reads topics (queries) from files in TREC file format and creates

.search files containing the top 100 documents returned by those queries sorted

by their ranking using the predefined search engine (either google on the web or

lucene on the AQUAINT corpus).

• index.py: reads the .search files and for each distinct document it creates a .index

file containing terms appearing in the document, their frequencies, their Part of

speech and their location in documents.

• expand1.py: for each query it reads the top-n documents it creates a METHOD.

NUM OF DOCS.expand file containing the terms and their scores according to

Chapter 4. Implementation 39

the probabilistic method used (currently LCA, MIX, TST and ENT) sorted by

their scores. NUMOF DOCS refers to the number of the top-n documents used.

• preparesimilarities.py: reads all .expand files (excluding .hybrid.expand), gets

all concepts referring to that terms and calculates semantic similarity of each

pair of terms. The files are stored in NUMBER.similarity where number is an

arbitrary number and contain a line for each per of concepts plus it’s similarity.

Note that this step takes enormous time to compute and that semantic similarity

files are not specific to the queries. So they are stored in a different directory and

should not be deleted.

• disambiguate.py: disambiguate.py does two things. Firstly for each query it

creates a small .query file containing the one line per query term. In each

line the term along with all possible concepts is included. Secondly, for each

METHOD.NUM OF DOCS.expand file it creates a METHOD.NUMOF DOCS.

NUM OF TERMS.rank file which contains all concepts ordered by their impor-

tance according to pagerank. NUMOF TERMS refers to how many of the terms

in METHOD.NUM OF DOCS.expand were actually used.

• expand2.py: for each METHOD.NUMOF DOCS.NUM OF TERMS.rank file

it creates a METHOD.NUMOF DOCS.NUM OF TERMS.hybrid.expand file

containing the concepts along with their scores as re-ranked from the ontology.

Needless to say, information from the .expand file is used as well in this step.

• search2.py: creates a .METHOD.search file for each .expand file. This file con-

tains the top-100 documents returned by the expanded queries sorted by their

ranking.

• eval.py: reads document relevance information (qrels) in standard TREC and

creates a tab delimited text file containing the results of each expansion method.

Note that it might be the case that some documents are not included in the qrel

file. In such a case the documents are considered irrelevant and a .missing file is

created containing one file for each missing relevance judgement.

Chapter 4. Implementation 40

This whole process is automated by batch.py which as stated before can run in

several machines at the same time.

4.4 Various visualisation tools

In the context of this thesis we needed to make some case based analysis of how the

various disambiguation and query expansion methods performed. To conduct this per

query analysis we needed some visualisation tools. We developed two tools:

• Network visualisation tool: This tool is based on modules provided by the JUNG

Java library. Nodes are visualised as circles and edges as lines connecting these

circles. An important issue for network visualisation is the layout of nodes. We

left the exact layout algorithm used as a parameter of our system; any of the

JUNG layout algorithms can be used. We found no satisfactory way to visualise

weights of edges. However, visualising importance of nodes was already imple-

mented in JUNG. The diameter of the circle describing the node is proportional

to the importance of the node.

• Hierarchy visualisation tool: This tool was developed from scratch and the out-

put of this tool is an HTML page visualising a tree similar to the trees dis-

played by the most browsers for XML documents. However, the standard XSL

stylesheet used by browsers restricted parameters of visualisation. Thus we im-

plemented an custom XSL stylesheet. This stylesheet takes an XML file and

visualises it as an html page adding the following useful visualisation properties:

Not all attributes of an XML nodes are displayed. The children of each node are

sorted according to the value of a specific attribute. And finally the color of the

text describing a node is a value of gray proportional to the actual value of an

attribute. Thus to visualise hierarchies we created an XML file for that hierarchy.

This xml contained the name of the concept, the terms mapped to that concepts

and the value of the boosting factor as calculated by the methodology described

in the “Boosting based on importance measure drawn from hierarchies” section

of the methodology chapter. Using our stylesheet children were sorted according

to the boosting score and the font color visualising the concept was proportional

Chapter 4. Implementation 41

to that boosting factor. Thus concepts with high boosting factor were visualised

by having almost black font color while concepts with low boosting factor were

visualised by having an almost white color.

The network visualisation tool proved not to be very usefull, after including more

than 20 nodes to the network it is very difficult to understand what is going on. The

hierarchy visualisation tool on the other hand proved to be surprisingly effective and

useful.

Chapter 5

Evaluation and Results

Our evaluation is mainly focused on evaluating query expansion methodologies. How-

ever, because performance of query expansion for ontological methods depends on

word sense disambiguation performance we also evaluated the performance of WSD

algorithms.

To evaluate the query expansion results we used the standard measure of relevance

on top-n retrieved documents. That is, we counted how many of the top-n pages re-

trieved by the query were actually relevant. The baseline system was that of query-

ing without query expansion and the tested systems were the various query expansion

methodologies described in this thesis. As the value ofn we used 20, that is, we evalu-

ate the relevance in the top-20 documents. As an upper bound for expansion methods

we use Rocchio’s methodology to derive the ideal query from the set of relevant and

irrelevant documents as specified by the TREC query relevance data (the same data we

used for evaluating the score).

Along with this standard measure we used the non-standard but rather informative

measure of counting the number of queries for which there was a degrade in perfor-

mance after the expansion. That is, we calculated relevance at top-20 documents for

the unexpanded and the expanded query and considered the percentage of queries that

had a lower score when expanded. Needless to say, less is better for this score.

To evaluate WSD performance, we simply calculated the number of query terms

that were correctly disambiguated. Note that we excluded unambiguous terms; we

calculated the performance of disambiguation algorithms by only considering ambigu-

42

Chapter 5. Evaluation and Results 43

ous terms. This explains to some extend the low scores compared to those reported in

the literature and our decision is justified as follows. The frequency of unambiguous

terms is an important measure of its own, and should perhaps be used independently.

Including both ambiguous and unambiguous term gives presumably a better picture

of the number of terms that should be expected to be correctly disambiguated. Ex-

cluding unambiguous terms gives presumably a more objective picture of actual WSD

performance.

Our baseline for WSD is random assignment of senses to words and the meth-

ods tested are the one described in (Banerjee and Pedersen, 2003) and our proposed

method. To find the correct sense of the query we asked 3 users to choose the appro-

priate sense of each query term in the context of the query. The upper bound for WSD

is defined by the inter-annotator agreement, that is the percentage of terms where all

three users selected the same sense.

Both WSD and query expansion proved to be very sensitive to the parameters by

which we run each method. We could choose a very close value for a parameter and

get very different results. Because of this fact we preferred to display the results in

three dimensional tables, where x and y axes are parameters of the method and the z

axes is the performance. To visualise the z axes we used different shades of gray to

color the appropriate cell of the table. In all tables darker means better.

5.1 TREC tracks

For our experiments we used the queries and data used in the TREC conference. Every

year several information retrieval tracks run in the TREC conference. Initially the track

of ad-hoc information retrieval on large collections or snapshots of the web was run.

Ad-hoc retrieval corresponds to general search over those large collections. However

after year 2001 the ad-hoc track was replaced by more specialised tracks. The terabyte

track focuses on the extension of IR techniques to huge collections. The HARD track

is focusing on the extraction of passages and using targeted interaction with the user.

The web track is focused on finding homepages (not actual documents containing the

information but rather one-jump links to that information) and has also discontinued

Chapter 5. Evaluation and Results 44

from 2004. Finally the robust track is focused on difficult queries.

From these options we chose to focus on the HARD track data. This decision

was affected by corpus availability issues but proved to be a good choice because the

queries contained are diverse and quite difficult without being as difficult as those of

the robust track. Moreover, the data collection of this track is neither huge (as in the

terabyte track) nor structured (as that of web track which is specialised for using the

structure of sites to find homepages). The data collection consists of newspaper and

magazine articles of the year 1999. From this data we used only the AQUAINT corpus.

The whole collection contains about 300.000 documents and that can be considered

relatively small. As for the queries we must mention that the queries supplied for the

HARD track and which we used as a baseline are far better than those supplied by

ordinary users. Query expansion is well known to work better when the queries are

not so good. Thus, our results can be characterised as being rather pessimistic and

that explains to some extend why the actual results we got for the various expansion

methods show far less improvement than those reported by the papers we derived the

approaches from.

5.2 Ontology based word sense disambiguation

To evaluate WSD performance, we counted the number of query terms that were cor-

rectly disambiguated. We only considered ambiguous query terms for this evaluation.

Our baseline is random assignment of senses to words and the methods tested are the

(Banerjee and Pedersen, 2003) and our proposed method. The results are summarised

in figure 5.1.

To identify the correct senses of the terms we manually disambiguated query terms.

Actually we asked 3 users to provide the correct sense and selected the correct sense

by a majority vote. This way we also define an upper bound for our methods which is

the inter-annotator agreement.

From figure 5.1 we can see that when no context terms are used and only mutual

disambiguation takes place, our proposed method is about 1% better than (Banerjee

and Pedersen, 2003) although the same similarity measures are used. This improve-

Chapter 5. Evaluation and Results 45

Figure 5.1: Disambiguation performance

Chapter 5. Evaluation and Results 46

ment is to be attributed to not making independent decisions for each sense. Actually,

as thedisambBetaparameter minimises and the effect of prior importance is less, the

performance gets better. The performance, when some context terms are used, de-

pends on the probabilistic method that actually proposed the terms. Note that all meth-

ods show a significant decrease in performance when only 10 context terms are used

and thedisambBetaparameter is low thus the prior importance is less significant and

the importance weight is free to move through the terms. Performance is restored for

the various methods after adding 30-40 terms and reaches its peak (for the measured

values) for 50 terms.

Each method seems to have a distinct preference for the value of thedisambBeta

parameter. When context words are extracted from “ENT”disambBeta= 0.5 seems

to produce better results. When context words are extracted from “LCA” or “TST”

disambBeta= 0.25 seems to be more appropriate while “MIX” performs better for

disambBeta= 0.75.

5.3 Query expansion

For query expansion performance we used two measures: average precision in the top

20 documents and percentage of queries experiencing degrade after expansion.

5.3.1 Probabilistic Query expansion

In this section we present the results of the experiments related to probabilistic expan-

sion methods. We present the effect of the number of documents returned by the initial

query used and the effect of the number of query terms used in combined tables in

figures 5.2 and 5.3. In both figures darker means better.

As baseline performance we use the unexpanded query which corresponds to an

average of 42.3% precision at top-20 documents. As an upper bound of performance

we use the Rocchio’s (Rocchio, 1971) ideal query derived from the documents as clas-

sified by the TREC relevance judgements; that is we used equation 2.3 to derive the

query that best distinguishes the relevant from the irrelevant documents. Using this

method we get a performance of 77.1%.

Chapter 5. Evaluation and Results 47

Figure 5.2: Average precision at top-20 documents

Chapter 5. Evaluation and Results 48

Figure 5.3: Percentage of queries experiencing degrade in performance

Chapter 5. Evaluation and Results 49

From figure 5.2 we can see that LCA has an average performance of 40.9%-47.3%.

The performance of LCA improves significantly as more documents are considered.

Using LCA with less than 50 documents can produce a degrade in performance (base-

line average was 42.3%) but after 70 documents LCA has an average of about 44% giv-

ing an approximate 2% improvement over unexpanded queries. The same conclusions

can be drawn from the “percentage of queries experiencing degrade in performance”

metric. The percentage of queries experiencing degrade after using LCA for expansion

ranges from 22.0% to 36.0%. However, when used with over 70 documents it rarely

causes in degrade in more than 28.0% of the queries.

ENT has an average performance of 42.4%-49.7%. The performance of ENT

reaches its peak when about 30 documents are used. Using more documents can de-

crease performance but never below baseline. The number of terms used in the query

is also important for this method. The best results are when 40 or less terms are used.

When used with less than 40 terms ENT is giving always above 47% that is a 5% im-

provement over the unexpanded query. Using the “percentage of queries experiencing

degrade in performance” metric we can see that ENT can cause degrade in perfor-

mance to 18.0% to 42.0% of the queries. However, percentage of over 35% can be

found only when very few (less than 30) documents and many (more than 50 terms)

are used. When used with more than 10 documents and less than 40 terms it rarely

causes a degrade in more than 24% of the queries.

The results of MIX and TST are more balanced and similar to each other. The aver-

age performance ranges from about 42% to about 49.6% and when used properly they

usually have a performance of more than 46%, which corresponds to 4% improvement.

The effect of number of terms is less apparent TST but MIX performs better with less

documents. The “percentage of queries experiencing degrade in performance” reveals,

an important difference between these methods. Similar to ENT, MIX can cause a

decrease in performance in large percent of the queries ( more than 30%) when used

with more than 30 terms. TST, on the other hand, improves on this metric with more

documents.

In total we could say that when properly used in our document collection all meth-

ods create an average of more than 4% improvement over unexpanded queries and

Chapter 5. Evaluation and Results 50

cause degrade in performance in about 18%-25% of the queries. Note, however, that

these results are rather pessimistic as we used the HARD queries over a rather small

collection. In a web setup where users enter average queries and the document collec-

tion is much larger, significantly better results should be expected.

For all subsequent steps we used a value of 80 for documents from the initial query

and a value of 20 for the number of query terms.

5.3.2 Ontological Query expansion

We did not conduct any experiments for pure ontological query expansion. Expanding

based on specific type of relations used is well covered in the literature (Voorhees,

1994) and (Navigli and Velardi, 2003). A summary of the results of (Voorhees, 1994)

and (Navigli and Velardi, 2003) is that the most effective expansion method is that

with synonyms plus descendants. Expanding with hypernyms causes a small if any

improvement. Expanding with any related concepts (not taking account the type of

relations) causes a small if any improvement as well. (Navigli and Velardi, 2003) also

report a remarkable improvement when expanding with words in glosses.

Although we did not expand based on specific relations we used a different mea-

sure to estimate the quality of expanding with specific kinds of relations and found

results consistent to the ones reported in the literature. As mentioned in the methodol-

ogy section we train the boosting factor for one of the proposed hybrid methods. The

results of the training procedure are summarised in the methodology chapter and are

repeated here for convenience:

parents children children-subtree siblings siblings-subtree

is-a 0.024772 0.031324 0.045891 0.010591 0.019223

part-of (member) -0.013287 0.711712 0.018197 0.005665

part-of (substance)

part-of (rest) 0.089481 0.473094 0.001387 0.064584 -0.001692

OTHER synonyms Mapping to ontology concept Not in ontology

0.571086 -0.002474 -0.002785

Each cell in the row expressed the average weight of the specific kind of terms in the

Chapter 5. Evaluation and Results 51

ideal query. That is we used Rocchio’s ideal query methodology to derive the weight

of each term in the ideal query. Next, for each hierarchy and each kind of relation we

found all the concepts related to the query concepts in the specific hierarchy with the

specific relation. In this table we report the average weight of terms.

Although it is difficult to predict the exact performance of expanding with the spe-

cific terms from the reported weight, this weight is a very useful measure for comparing

the usefulness of the relations in the hierarchies.

5.3.3 Hybrid Query expansion

In this section we present the performance of our proposed approach. The parameters

of our approach are:

• the probabilistic method used to extract the initial terms namely “LCA”, “ENT”,

“MIX” or “TST”

• the number of initial terms considered for re-ranking, which will refer to as “nu-

mOfTerms”

• the mixing factor which we will refer to as “mix” (lowercase). Mixing factor

of 0 means pure probabilistic score. Mixing factor of 1 means pure ontological

re-ranking score.

• the method used to derive the boosting factor, namely “Boosting based on re-

lation to query concepts”, “Boosting based on importance measure drawn from

hierarchies” and “ Boosting based on network importance measure”.

Since the goal of using ontologies is to re-rank the terms suggested by a proba-

bilistic method we used the performance of the original probabilistic performance as a

baseline. Moreover we provide a baseline of random re-ranking of terms. This baseline

is useful to illustrate the quality of the terms proposed by the probabilistic method. As

an upper bound of performance we use the score of best performing query after run-

ning 100 randomly re-ranked queries. This upper bound is described in figures 5.4 and

5.5.

Chapter 5. Evaluation and Results 52

Figure 5.4: Average precision at top-20 documents of top randomly re-ranked query

Figure 5.5: Percentage of queries experiencing degrade in performance using the top

randomly re-ranked queries

Chapter 5. Evaluation and Results 53

Note that performance on both measures should improve as greater number of nu-

mOfTerms is used. More terms give more options and the best query available when

re-rankingnumO f Terms1 is still available when re-rankingnumO f Terms2 terms if

numO f Terms1 < numO f Terms2. However, this is not captured in our method. It

seems that our decision to use 100 random re-rankings proved to be a very low num-

ber for discovering the optimal query. Nevertheless, it illustrates an important issue:

when more terms are used finding the optimal query becomes harder as the re-ranking

algorithm has to choose from a larger set of re-ranking options.

Another, interesting finding is that even when the best queries discovered are cho-

sen, there is still a significant percentage of queries experiencing degrade in perfor-

mance. This illustrates that for some queries it is better not to expand the query at all;

regardless of the expansion method used performance will degrade.

5.3.3.1 Boosting based on relation to query concepts

In this section we present the results of ontological re-ranking when the boosting score

is derived from the relation of the candidate term to query concepts in the ontology.

Figure 5.6 shows the average performance after re-ranking based on this criterion for

the various factors of parametermix. Recall thatmix = 0 means no re-ranking and

mix= 1 corresponds to full ontological re-ranking (the score assigned to the term by

the original probabilistic method is not taken into account). In figure 5.7 the percentage

of queries experiencing degrade in performance is shown. In both figures random re-

ranking is included as a baseline and darker means better.

From figure 5.6 we can see that although full re-ranking causes improvement in

only a small number of parameter settings, when re-ranking score is mixed with the

original score from the probabilistic method the results are much better. 2% to 3%

improvement over the performance of the original probabilistic method can be ob-

served and this improvement translates to almost doubling the improvement over the

unexpanded query. The same picture of significant improvement can be drawn from

the percentage of queries experiencing a degrade in performance measure in figure

5.7. Using this re-ranking method we get a decrease in the percentage of queries that

experience a degrade in performance after expansion of about 2%.

Chapter 5. Evaluation and Results 54

Figure 5.6: Average precision at top-20 documents when boosting based on relation to

query concepts

Chapter 5. Evaluation and Results 55

Figure 5.7: Percentage of queries experiencing degrade in performance when boosting

based on relation to query concepts

Chapter 5. Evaluation and Results 56

5.3.3.2 Boosting based on importance measure drawn from hierarchies

This measure performed surprisingly bad. Results were worse than random re-ranking.

Nevertheless, areas of high concentration were successfully detected in the hierarchies.

However, because the terms were not disambiguated strange combinations of senses

were captured and boosted by this method. Perhaps one of the most successful cases

for this method was the inclusion of the term “reading” in the query “Alexandria’s li-

brary”, however, this happened for the wrong reasons. “Reading” was boosted because

a high concentration of concepts was discovered under the concept “city”. “Alexan-

dria”, “Cairo” and the British town of “Reading” were detected.

To correctly evaluate this method as a query expansion method perhaps manually

disambiguated terms should be used. Note, however, that manually disambiguating all

top-150 terms suggested by each probabilistic method for each query s a considerable

task on its own. Perhaps, the solution would be to query a sense tagged document

collection.

Nevertheless, because of this extreme sensitivity to WSD errors we do not discuss

this method as a method for query expansion any further.

5.3.3.3 Boosting based on network importance measure

In this section we present the results of semantic similarity based re-ranking where se-

mantic similarity is calculated by the extended Lesk measure (Banerjee and Pedersen,

2003) and boosting score is derived from the importance of the concepts as measured

by Pagerank with priors. In all results reported in this section we used a value of 0.5

for the parameterdisambBeta. Figure 5.8 illustrates the average performance with-

out mixing the score with the original probabilistic method for the various values of

rankBeta. For all subsequent steps we usedrankBeta= 0.99.

Figure 5.9 illustrates the average performance after re-ranking based on this crite-

rion for the various factors of parametermix. In figure 5.7 the percentage of queries

experiencing degrade in performance is displayed. In both figures random re-ranking

is included as a baseline and darker means better.

From figure 5.9 we can observe that the performance of this re-ranking method

strongly depends on the probabilistic method used. For LCA re-ranking causes a sig-

Chapter 5. Evaluation and Results 57

Figure 5.8: Average precision at top-20 documents when boosting based on network

importance (mix= 1)

Chapter 5. Evaluation and Results 58

Figure 5.9: Average precision at top-20 documents when boosting based on network

importance (rankBeta= 0.99)

Chapter 5. Evaluation and Results 59

Figure 5.10: Percentage of queries experiencing degrade in performance when boost-

ing based on network importance

Chapter 5. Evaluation and Results 60

nificant decrease in performance. For all other methods performance increases as more

terms are considered. Unfortunately performance does not seem to have reached its

peak in the specific figures. Probably when considering more terms the improvement

would be more apparent. Nevertheless, performance is always better than random re-

ranking. When MIX, TST or ENT is used and more than 80 terms are considered there

is an increase in performance over the original probabilistic method which ranges from

0 to 1%.

Chapter 6

Discussion and Conclusions

In this section we comment on the results and present the conclusions that can be drawn

from our research for the properties of the ideal query, query expansion in general and

ontological query expansion in particular.

6.1 Ideal query

As described earlier in this thesis we explored the notion of ideal query using two

methods. The first was using Rocchio’s equation and the second one was by randomly

re-ranking terms proposed from probabilistic query expansion methods and selecting

the best performing query. We manually clustered the top-15 terms of the ideal query

based on semantic similarity. A surprising finding was that in a significant number

of cases the original query terms were not included in the top-15 terms of the ideal

query. This illustrates the query-document word mismatch described in the literature

even in well formed queries. For example in the query “red cross activities” the term

“activities” does not appear in the list but instead “aid” and “relief” do appear. This

was a consistent finding: when query terms do not appear in the top query terms then

a significant number of terms semantically similar to the missing term do appear. An-

other consistent finding was that top terms of the ideal query tend to form semantically

related clusters.

These two findings were confirmed from the inspection of the ideal queries pro-

61

Chapter 6. Discussion and Conclusions 62

posed by the best randomly re-ranked method for deriving the ideal query.

A surprising finding exposed by the latter method of deriving the ideal query was

that remarkable improvement can be achieved only by reranking the terms suggested

by the probabilistic methods. We used 20 term queries and by using the top-20 terms

suggested by probabilistic methods we got results of no more than 49% (5% im-

provement over unexpanded query). The actual performance depended on the prob-

abilistic method used. Nevertheless, regardless of the probabilistic method used we

can get a 60% performance (16% improvement) only by reranking the top-30 terms.

When reranking the top-150 terms performance reaches the upper bound of Rocchio’s

method. In other words probabilistic methods have a high recall but low precision

in locating good expansion terms. They detect very good expansion terms but also

include some worse and mixing them causes a far less improvement than possible.

The first two findings were the main motivation of our attempt to use semantic

similarity measures, so that such semantically related clusters are detected and boosted.

The third finding justifies our decision to deploy ontologies (an independent method)

to re-ranking the terms suggested by the probabilistic methods.

6.2 Probabilistic methods

The probabilistic methods for query expansion tested were the well established Local

Context Analysis and a proposed entropy based method. Local Context Analysis per-

formed in our setting much worse than expected causing an improvement of only about

3% which significantly lower than the more than 20% claimed by the original paper

describing the method. The proposed entropy method performed better than LCA in

almost any setting. However, although performance of LCA improves as more docu-

ments are considered, we tested LCA with less documents than needed for it to reach

its peak. The entropy based method reaches its peak quicker when only few documents

are considered. This property might be desirable in real applications where resource

efficiency is an important issue. However, further exploration of this method under

different settings would be required before such deployment. In the context of this

thesis ENT and its descendants (MIX and TST) were used simply to create a diverse

Chapter 6. Discussion and Conclusions 63

set of methods for subsequent steps and the variance in performance of the methods

discussed later illustrates that although the performance of the methods is quite close

the suggested terms are diverse.

6.3 Pure ontological query expansion

For ontological query expansion in particular the conclusion of (Voorhees, 1994) was

verified by this research:

The most useful relations for query expansion are idiosyncratic to the par-ticular query in the context of the particular document collection.

The contribution of this research for pure ontological query expansion is perhaps

an attempt to explain this idiosyncratic effect based on the analysis of the particular

queries in our test set.

Advocates of ontologies prize them because they make explicit ontological com-

mitments. The simplest ontological commitment is perhaps the selection of the term

to describe a concept and more complex ontological commitments are related to how

to define a concept and place it in taxonomies. Some ontologies might partition the

concepthumanto maleand femalewhile others might choose to partition the same

concept tochild and adult. These are some decisions expressing ontological com-

mitments, but note that they express a specific choice from a set of options. Explicit

ontological commitments simplify knowledge sharing, promote consistency and allow

automatic usage of knowledge. Nevertheless, these merits come with a cost.

Ontology mismatch and the need of sophisticated mapping between ontologies are

two issues due to different ontologies making different ontological commitments and

are discussed in great detail in ontological literature. In the context of our research

the important issue is that there are many alternative options for a specific ontological

commitment and a single ontology usually makes an explicit commitment to a single

approach. People deploy what seems to be an endless variety and what proves to be

a surprisingly effective set of heuristics to select the appropriate commitments in the

specific context. Automatic methods, on the other hand, use single ontologies that

either make single decisions for each ontological commitment (as the commitment to

Chapter 6. Discussion and Conclusions 64

a single taxonomy when using WordNet) or use ontologies that allow multiple options

for each ontological commitment (as the commitment to multiple terms describing a

single concept in WordNet). Presumably using multiple options at the same time is

more appropriate but note that this adds the complexity of selecting the appropriate

commitment in the context of the query which is usually a very difficult problem. The

performance of WSD algorithms is a good indication of this difficulty.

Thus we feel that the idiosyncratic effect is not related to the type of relation used,

but rather to whether the ontological commitments implied by the query and the doc-

uments actually match the ontological commitments made by the specific ontology

used.

It is always good to expand using hyponyms but the question is in which taxonomy.

Consider the query “animal protection”. In the context of this query protection of

antelopes, lions, birds, fish etc. is relevant. But what about humans? The concept

humanwas consistently and significantly boosted in all ontological approaches tested

for this query and that was presumably correct but reasons behind this boosting were

wrong.

The query seems to imply an exclusion of humans from animals. It seems that

in the context of the query, human is a sibling of animal perhaps under the common

parent concept “organism, being” and thus protection of humans is not related to the

query. However, one could argue that, under different circumstances,humanwould

be implied to be a kind of animal by the query and thus protection of humans would

be relevant. As far as the query and the ontology is concerned this is not an irrational

assumption, it is our prior knowledge and the documents that make this assumption

inappropriate. The concepthumanis widely referred in both the actually relevant

documents and the documents returned by the initial query. Neverthelesshumansare

not mentioned as kind-of animals but rather as agents of protection; the documents are

about human activities for animal protection rather than protection of humans. The

documents imply a taxonomy where the dominant location of humans is under agent

and not animal. Thus the concepthumanis and should be boosted but because of the

relation ofhumanto “protection” and not because its relation to “animal”.

Chapter 6. Discussion and Conclusions 65

Expanding with the pure ontological approaches as described in the literature makes

no attempt to detect whether a relation of terms is actually valid in the context of the

specific query. Even if that was attempted, although it is not easy to clearly see how, the

best possible outcome would be to correctly filter the ontological commitments made

in the ontology to the current context and use only those that are appropriate. Thus

for such a method to work the appropriate commitment should exist in the ontology.

Finally, even if all the commitments are there, the inter-annotator agreement for WSD

is an indication of the precision to be expected for such a method.

This is not to say that pure ontological expansion is not worth further exploring.

It is rather to explain the idiosyncratic effect, stress the difficulty of the problem and

justify the more cautious usage of ontologies in this thesis.

6.4 Hybrid query expansion

There is a significant difference between using a relation and looking for indication of

relation.

Using a relation requires understanding the exact properties of that relation. For ex-

ample expanding with terms that map to hyponyms of query terms requires correctly

disambiguating query terms and probably deciding if the proposed term is actually a

hyponym in the context of the query and the documents used. That is, to use a relation

we must decide what is related, how is it related and whether that relation seems to

hold in the specific context. The failure of our proposed “Boosting based on impor-

tance measure drawn from hierarchies” was mostly due to not even attempting to an-

swer these questions. All possible concepts that mapped in to the detected terms were

added and all possible positions in the hierarchies were considered for these concepts.

Areas of higher density were discovered but they were usually containing wrongly

disambiguated or misplaced concepts.

Looking for indication of relation is the approach followed by the more success-

fully methods described in this thesis. In “Boosting based on relation to query con-

cepts” we attempt to favour terms that seem to be related with query terms. Note

that instead of starting from ontology relations, this approach starts from the suggested

Chapter 6. Discussion and Conclusions 66

terms and only re-ranks them if there is some indication of relation. In “Boosting based

on network importance measure” we attempt to derive from the ontology a presumably

more abstract similarity measure. Thus instead of using the specific relations we use

those relations to get an estimation of similarity and use that similarity as an indication

of relation.

6.4.1 Boosting based on relation to query concepts

In figure 5.6 we present the performance when the score of each term is fully deter-

mined by the probabilistic method (mix= 0.00), when the score is fully determined by

the ontological method (mix= 1.00) and several cases where the score is mixed. The

exact value of mixing when 0.00< mix< 1.00 is not of great importance as it depends

on scaling issues of the two original scores. Nevertheless, it is clear that when mixing

the scores we almost always get better performance than determining fully the score

from any of the original methods.

Parameters of the ontological method (how much to boost any kind of specific re-

lation) should be considered near optimal as they were trained from the actual queries.

Nevertheless, the actual construction of the query should not be considered optimal.

Using only this boosting score we simply expand using a specific order of relations.

The parameter values as trained form an ordering of relations. Synonyms are used

first if they are not enough to complete the 20-term query then children in part-of hi-

erarchies are used, if they are not enough too then the rest of relations are considered.

Perhaps this method is not optimal and that justifies the low performance when only

the ontological boosting score is used.

When the score is mixed with the original probabilistic score a significant improve-

ment is observed. However, depending on the probabilistic method used best perfor-

mance is reached for different number of terms considered. For “ENT” and “TST”

performance improves regardless of the number of terms used and reaches its peak

in about 100 terms. “MIX” and “LCA” on the other hand illustrate a more unstable

performance.

Chapter 6. Discussion and Conclusions 67

6.4.2 Boosting based on network importance measure

An alternative approach to use ontologies without using specific relations was explored

in this thesis. This approach can be seen as an attempt to use hierarchies in the ontology

in a more cautious way and attempt to break through hierarchy boundaries through the

association of keywords and key phrases to each ontology concept. Consider the case

where each concept in the ontology is associated with a set of tags (key words and

phrases). In such an ontology we could find concepts related to query terms based on

whether concepts share the same tags. Few ontologies use tags and augmenting every

concept in a ready ontology with tags would be a difficult and time-consuming task.

Most ontologies include a definition for each concept thus the definition can be

used to approximate the assignment of key words and phrases to each concept. (Baner-

jee and Pedersen, 2003) calculate semantic similarity based on definition overlap and

location in taxonomies thus combining these two approaches.

However, not using specific relations requires a new paradigm for selecting terms.

The simplest approach would be to select the concepts most similar to query concepts.

However, motivated by the absence of query terms in some ideal queries we explored a

method that also considers concepts not immediately related to query terms if there is a

high concentration of such concepts in a specific semantic area. This was implemented

by building a network where nodes are concepts and edges are weighted according to

semantic similarity and, finally, using Pagerank with priors to detect important nodes

in that network. The control parameter which determined the degree that independent

areas were considered was therankBetaparameter. Very highrankBetaassigns sig-

nificant prior importance to query concepts. Very lowrankBetaminimises the prior

importance of query concepts and allows importance weight to move freely through

the network.

As shown in the results section when low prior importance is used the performance

is rather unstable. Some excellent results were accomplished but small change in pa-

rameter values caused a dramatic change in performance. This instability was expected

because of a fundamental fallacy in our initial motivation. Terms in the ideal query tend

to form semantically related clusters but that does not mean that the clusters formed by

terms extracted by a probabilistic query expansion method will be the same clusters to

Chapter 6. Discussion and Conclusions 68

those of the ideal query. In other words, when importance is allowed to move freely

the clusters might or might not be the correct ones. Actually the only indication avail-

able in this model that they are related and correct is their relation to query terms. To

express the need of such indication we need highrankBeta.

When used with very highrankBeta, the results are generally much better and more

consistent. Actually there seems to be a consistent improvement when more terms

from the initial probabilistic method are considered. Unfortunately, due to time and

resource limitations we were not able to push this approach to its limits. The main time-

consuming process was that of calculating semantic similarity by using the (Banerjee

and Pedersen, 2003) similarity measure. To tackle this issue we used caching, however,

a significant amount of time and a significant amount of space was required for this

cache. In real word deployment of this method this would not be a problem as the

cache needs to be build once and can be built off-line. What might be a problem with

this approach would be the on-line calculations needed, namely running Pagerank on

the network of the top-n concepts. Further exploration of this method regarding its

time-efficiency is required before actually deploying it.

Note that selecting concepts with very highrankBetais not equivalent to simply

selecting the concepts most related to query concepts. Using Pagerank adds two impor-

tant properties: firstly concepts related to all query concepts are favoured and secondly

inter-concept relations are considered. If two concepts are equally related to query

concepts but one of them has more strong relations to many non query concepts then

that concept will be preferred.

In general this approach, produced significant and consistent improvement over the

original probabilistic methods when many terms are used and a desirable property is

its resistance to disambiguation errors. Nevertheless, the initial probabilistic method

used to extract terms has a great impact on performance.

Another issue deserving further exploration for this method is whether the inclusion

of more terms improves performance because there is a greater variety of terms to

select from or because the Pagerank algorithm performs better with more terms.

Chapter 6. Discussion and Conclusions 69

6.4.3 The effect of the probabilistic method

Both methods for ontological based boosting proved to be very sensitive to the prob-

abilistic method used to extract the initial terms. For both methods ENT produces the

best and most stable results. LCA, on the other hand, produces unstable if any im-

provement. For some parameter settings we get an improvement while for others we

do not. It is difficult to settle on some parameters that could be generally used when

LCA is used.

To explain this effect we provided the baseline of random reranking of terms. Us-

ing this measure we can see that LCA performs much worse than all other methods

when its top-n terms are randomly re-ranked. This illustrates that the quality of LCA

suggested terms gets lower as we move towards lower ranked terms. Presumably this

indicates a good ranking method and justifies the use of a weighting scheme within

LCA. Nevertheless, in the context of our method this expresses that top-n terms are al-

ready well ranked and thus getting a gain from reranking those terms is more difficult.

Another issue with LCA is that it detects rare terms strongly related to query terms.

Because the terms are rare it is quite often that they are not included in the ontology

and that minimises the information that can be used for reranking.

Finally, compared to ENT the terms suggested from LCA are actually more se-

mantically related to the query. That narrows the space of possible improvement from

semantically related reranking. As mentioned before many terms suggested by ENT

are not actually relevant to the query (“home”,“site”,“said”,“reports”,etc.). The ontol-

ogy based method can penalise these terms for ENT and thus produce a better query.

When LCA is used to actually improve through re-ranking more sophisticated reason-

ing is needed; a rough “looks related” measure would fail on terms suggested by LCA

because all terms already “look related” or we do not know if they are related (rare

names and places).

6.5 A note on our adapted version of Pagerank

Throughout this thesis we used an adapted version of Pagerank. This version works

with weighted edges and priors. We found this version to be a surprisingly expressive

Chapter 6. Discussion and Conclusions 70

model. Just to name a few, it was very easy to express the following:

• do a mutual disambiguation of query terms unless the rest of the terms strongly

suggest otherwise.

• do not use a single correct sense for each query term but rather use all possible

senses at the same time, but weight them by their probability according to a

disambiguation method.

• detect if there is a bias towards specific query terms and attempt to compensate

for this bias.

• select concepts for expansion that are more related to query concepts but also

consider independent important clusters.

The ease with which many complex features and parameters get incorporated to Pager-

ank make it very attractive. Nevertheless, we found it to have two undesired properties

in our setting:

• it is difficult to understand why something goes wrong when the model does not

behave as expected.

• there is no indication of whether all the features and parameters we added in the

model are combined optimally.

Perhaps the first property would not be so important if we had some indication that the

parameters were optimally combined. In the absence of such indication, in numerous

cases we were tempted to slightly change some weight so that a specific case works as

expected and usually that lead to decrease in average performance.

We feel that this is a significant disadvantage of Pagerank, at least for our multi-

featured model. If we had more time we would attempt to use the features which

proved to be useful through the use of Pagerank to train and test other models. Note,

however, that expressing some features outside Pagerank is a difficult task.

Chapter 7

Summary

In this thesis, we explored several query expansion methods. We reviewed some of

the most important probabilistic methods and proposed a simple but effective entropy

based probabilistic method. We also explored the use of ontologies for query expansion

and word sense disambiguation. Finally we explored the notion of ideal query and

attempted to discover the properties of that query.

The results of our experiments showed that the hypotheses posed in the introduction

of this thesis hold but not to the extent that we expected. H2 was about an entropy

based probabilistic method and actually proved to hold significantly. H3 was about

disambiguating query terms using network importance algorithms and additional terms

extracted from a probabilistic expansion method. H3 proved to hold to some extent,

however, the additional information used by this method seems not to be optimally

taken advantage of.

Our main hypothesis H1 was that a hybrid query expansion method that uses on-

tologies to re-rank terms suggested by a probabilistic method would outperform the

original probabilistic method. That is, ontology based reranking would add some ad-

ditional gain in query expansion’s performance.

H1 proved to hold, but not under all settings. We implemented three reranking

methods. One focusing on the relation to query terms, one focusing on the concen-

tration’s of concepts in the hierarchy and one using semantic similarity and network

importance algorithms. The first method trained the weights for each relation from

the actual queries. The second method compared concentrations in the context of the

71

Chapter 7. Summary 72

query to the same concentrations in context of the whole document collection. The

third method did not use the ontology directly, it only used them to derive a semantic

similarity score and used only that measure in subsequent steps.

From these methods only the first and third caused improvement. This improve-

ment is significant (up to doubling the gain of the expansion process compared to

using only probabilistic method). However, this depends on the probabilistic method;

For some methods we do not get stable improvement. Our proposed entropy based

probabilistic method performed very good on its own and re-ranking its terms added a

stable and significant additional gain When LCA was used as a probabilistic method

the gain from reranking was not significant and very unstable.

Bibliography

Attar, R. and Fraenkel, A. (1977). Local feedback in full-text retrieval systems.J.

ACM, 24, 3 (July):397–417.

Banerjee, S. and Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic

relatedness. InIn: Proceedings of the Eighteenth International Joint Conference on

Artificial Intelligence (IJCAI-03), pages 805–810,.

Buckley, C., Mitra, M., Walz, J., and Cardie, C. (1998). Using clustering and supercon-

cepts within smart. In Voorhees, E., editor,In Proceedings of the 6th Text Retrieval

Conference (TREC-6), page 107124.

Carpineto, C., Mori, R. D., Romano, G., and Bigi, B. (2001). An information theoretic

approach to automatic query expansion.CM Transactions on Information Sys- tems,

19(1):1–27.

Deerwester, S., Dumai, S., Furnas, G., Landauer, G., and Harshman, R. (1990). Index-

ing by latent semantic analysis.J. Am. Soc. Inf. Sci., 41, 6:391407.

Furnas, W., Landauer, T., Gomez, L., and Dumais, S. (1987). The vocabulary prob-

lem in human-system communication.Commun. ACM Commun. ACM 30, 11,

11:964971.

Gonzalo, J., Verdejo, F., Chugur, I., and Cigarran, J. (1998). Indexing with wordnet

synsets can improve text retrieval. InProceedings of the COLING/ACL ’98 Work-

shop on Usage of WordNet for NLP.

Gruber, T. R. (1993). Towards principles for the design of ontologies used for knowl-

edge sharing. In Guarino, N. and Poli, R., editors,Formal Ontology in Conceptual

73

Bibliography 74

Analysis and Knowledge Representation, Deventer, The Netherlands. Kluwer Aca-

demic Publishers.

Hang, C., Ji-Rong, W., Jian-Yun, N., and Ma, W.-Y. (2002). Probabilistic query expan-

sion using query logs. InIn Proceedings of the eleventh international conference on

World Wide Web (2002), page 325332. ACM Press.

Haveliwala, T. H. (2002). Topic-sensitive pagerank. InIn Proceedings of the Eleventh

International World Wide Web Conference.

Jing, Y. and Croft, W. (1994). An association thesaurus for information retrieval. InIn

Proceedings of the Intelligent Multimedia Information Retrieval Systems (RIAO 94,

New York, NY), page 146160.

Jones, K. S. (1971). Automatic keyword classification for information retrieval. But-

terworths, London, UK.

Kruschwitz, U. and Al-Bakour, H. (2004). Users want more sophisticated search as-

sistants - results of a task-based evaluation.Journal of the American Society for

Information Science and Technology (JASIST).

Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionar-

ies: how to tell a pine cone from an ice cream cone. InIn Proceedings of the 5th

Annual International Conference on Systems Documentation, pages 24–26.

Lu, A., Ayoub, M., and Dong., J. (1997). Ad hoc experiments using eureka. InIn

Proceedings of the 5th Text Retrieval Conference, page 229240.

Mahler, D. (2003).New Directions in Question Answering, chapter Chapter 24 - Holis-

tic Query Expansion using graphical models.

Maki, W., McKinley, L., and Thompson, A. (2004). Semantic distance norms com-

puted from an electronic dictionary (wordnet).Behavior Research Methods, Instru-

ments, & Computers, 36:421–431.

Manning, C. and Shutze, H. (1999).Foundations of statistical natural language pro-

cessing pp.294-307. The Mit Press.

Bibliography 75

Mihalcea, R., Tarau, P., and Figa, E. (2004). Pagerank on semantic networks, with ap-

plication to word sense disambiguation. InIn Proceedings of the 20st International

Conference on Computational Linguistics (COLING 2004).

Mitra, M., Singhal, A., and Buckley, C. (1998). Improving automatic query expansion.

In In Proceedings of the 21st Annual International ACM SIGIR Conference on Re-

search and Development in Information Retrieval (SIGIR 98, Melbourne, Australia,

Aug. 2428),.

Navigli, R. and Velardi, P. (2003). An analysis of ontology-based query expansion

strategies. InWorkshop on Adaptive Text Extraction and Mining (ATEM 2003), in

the 14th European Conference on Machine Learning (ECML 2003).

Page, L., Brin, S., Motwani, R., and Winograd, T. (1998). The pagerank citation

ranking: Bringing order to the web.Stanford Digital Libraries Working Paper.

Rocchio, J. J. (1971). Relevance feedback in information retrieval.In The SMART

Retrieval System: Experiments in Automatic Document Processing., pages 313–323.

Sanderson, M. (1994). Word sense disambiguation and information retrieval. In17th

Int. Conf. on Research and Development in Information Retrieval.

Stokoe, C., Oakes, M., and Tait, J. (2003). Word sense disambiguation in information

retrieval revisited. InIn proceedings of ACM SIGIR Conference (26), pages 159–

166.

Voorhees, E. (1994). Query expansion using lexical-semantic relations. InProceedings

of the 17th annual international ACM SIGIR conference on Research and develop-

ment in information retrieval Dublin, Ireland, page 61 69.

Winston, M., Chaffin, R., and Hermann, D. (1987). A taxonomy of part-whole rela-

tions. Cognitive Science, 11:417444.

Xu, J. and Croft, W. (2000). Improving the effectiveness of information retrieval with

local context analysis.Transactions on Information Systems (ACM TOIS), 18(1):79–

112.