seeking optimal search strategy and result representation ... › ~feit › papers ›...

Seeking Optimal Search Strategy and ResultRepresentation in BoW

Maayan Zhitomirsky-Geffet∗, Eitan Frachtenberg, Yair Wiseman⋄, and Dror G. Feitelson∗∗School of Computer Science and Engineering, Hebrew University

⋄Department of Computer Science, Bar-Ilan University

Abstract

One of the biggest concerns of modern information retrievalsystems is reducing the usereffort required for manual traversal and filtering of long matching document lists. In this paperwe propose an alternative approach for compact and concise representation of search results,which we implemented in the BoW on-line bibliographical repository. The BoW repositoryis based on an hierarchical concept index to which entries are linked. The key idea is thatsearching in the hierarchical repository should take advantage of the repository structure andreturn matching topics from the hierarchy, rather than justa long list of entries. Likewise,when new entries are inserted, a search for relevant topics to which they should be linked isrequired. Therefore, a similar hierarchical scheme for query-topic matching can be appliedfor both tasks. However, our experiments show that different query types used for these tasksare best treated by different topic ranking functions. For example, keyword search whichis typically based on short (1-3 word) queries requires a weight-based (rather than Boolean)ranking approach. The underlying rationale of weight-based ranking is that for a truly relevanttopic all (or almost all) the query terms should appear in itsvector representation and withapproximately even high weights. Applying this reasoning to the topic ranking method isshown to significantly increase the precision and the F1 (by over 30%) for short keywordqueries compared to the baseline Boolean ranking metric.

Keywords: Multi-level hierarchical search and indexing, Boolean search, Weight-based search,Bibliographical repository, Topic ranking, Result representation.

1 Introduction

An obvious and natural approach to organize a large repository of data is to use a hierarchical struc-ture, which typically reflects the logical structure of the data. At the same time, prevalent searchfacilities usually ignore the underlying hierarchical structure when presenting search results. In-stead, they rank the retrieved items according to some relevance or importance metric, and presentthe user with a linear list of results, which is typically quite long. Because of the vast amounts ofinformation on almost all topics, one cannot systematically go over the whole set of results, and

1

therefore must rely on the ordering of the results by the search engine. Hence, one of the biggestchallenges for modern information retrieval systems is handling the tradeoff between generatingan accurate and concise list of matching search results on one hand, and making this list completeon the other hand.

In this work we propose an alternative solution to improve search result representation. Wesuggest that given a hierarchical structure, it is desirable for search procedures to point to relevantlocations within this hierarchy, rather than providing a flat and disconnected listing of individualresults. For example, in the context of searching pictures,a query of “baby” may return pointers toa couple of albums predominantly filled with baby pictures, rather than a mixed list of individualpictures from these and other albums. This approach provides the user with a wider context ofrelated documents, within which the best data to answer the query can be found.

Searching within a hierarchy has two independent uses. One is for retrieval of informationas suggested above. The other is for insertion of new data — essentially on-line indexing, wherenew items are added to the repository and need to be linked to the most relevant locations in thehierarchy. Relevant locations can be found by simply using the new item to define a query andthen utilizing the same technique for item insertion as for search. In either case, the most relevantlocations in the hierarchy can be indicated by graphical cues that make them stand out from thegeneral structure. For example, we use increased font size as illustrated below in Figure 2.

The core of such an hierarchical searching method is the ranking function that determines theselection of most relevant topics at every level of the index. We experimented with a variety ofranking functions including a classic Boolean ranking approach along with more complex weight-based methods (see Section 4). The main idea of the proposed weight-based techniques is that themost relevant topics for a query are expected to contain manyof the query terms at the top ranks oftheir keyword vectors. The top ranks are determined by the highest weights of keywords for a topic.In our experiments the Boolean ranking was used as a baselineand the weight-based functions wereconstructed such that each of them reflected and tested the impact of some additional weightingfactor on search performance.

Thus, another important contribution of this paper is a comparative study of various topic rank-ing strategies, and their impact on different types of queries. For example, we found that Booleansearch is more effective for very long queries and for queries on authors, while weight-based searchapproaches perform substantially better (by over 30%) for short (2-3 word) keyword queries. Ourideas were implemented and evaluated within the BoW on-linehierarchical bibliographic reposi-tory, and a significant increase in performance was obtainedby the best ranking metrics for differ-ent tasks and query types.

2 The BoW Bibliographical Repository

To illustrate and evaluate our developments we used an on-line bibliographical repository, calledBoW, dedicated to the somewhat limited domain of parallel systems. BoW stand for “Bibliographyon the Web”. The goal of the BoW project [9, 10] is to create a user-friendly working environmentfor the construction, use, and maintenance of an on-line bibliographical repository. The key ideais that this be a communal effort shared by all the users. Thus, every user can benefit from the

2

input and experience of other users, and can also make contributions. In fact, the system tabulatesuser activity, so merely searching through the repository and exporting selected items alreadycontributes to their ranking in terms of user interest.

BoW is unique in organizing its data in a hierarchical concept index, and returning conceptsfrom this index, rather than a list of individual items, in response to queries. In addition to theabove concept index, BoW automatically constructs a corresponding index of keywords whichrepresent the semantic vocabulary of each topic. These keywords are selected according to aniterative unsupervised learning algorithm, developed in our previous work [11]. This algorithmwas optimized by utilizing some specific features and the structure of the BoW repository and assuch was shown to be well suited to the given retrieval tasks [11]. A prototype implementation isavailable at http://www.bow.cs.huji.ac.il.

2.1 The BoW System Architecture

The entries in the BoW repository are surrogates for scientific publications: journal papers, con-ference papers, and books. Each entry contains the publication’s authors, title, publication details(journal or conference, volume, pages, date) and possibly abrief user annotation. Full text is notstored as part of the repository, but external links are supported. The search and indexing pro-cedures described below only use the stored data, namely authors, title, and annotations. Thisprovides enough data to work with while reducing the amount of data that needs to be handled[14, ?].

The heart of the BoW repository is a deep (multi-level) hierarchical index spanning the wholedomain. The nodes in the hierarchy are calledconcept pages. Pages near the top of the hierarchyrepresent broad concepts, while those near the bottom represent more narrow concepts. The depthof the hierarchy should be sufficient so that the bottommost pages only contain a handful of tightlyrelated entries (as opposed to Web directories such as Yahoo! and CORA [17] which are shallowrelative to the number of documents they contain). Our prototype repository on parallel systemscontains about 3500 entries, and the hierarchy has a typicaldepth of 4 or 5 (Figure1).

A subtree containing all the concept pages reachable from a certain (high level) concept pageis referred to as atopic. Entries can be linked to multiple concept pages, if they pertain to mul-tiple concepts. Likewise, they can be linked at different levels of the hierarchy, depending ontheir breadth and generality. The hierarchy is constructedmanually by the site editor based ona thorough knowledge of the topic domain. The vocabulary used in the index and annotations isuncontrolled by the system, so users query the system using natural language [3].

Essentially the same structure is used to display search results, as shown in Figures 2 and 3.The righthand frame is used to display a list of matching entries, while the lefthand frame is used toindicate which topics are the most relevant for the query. Once identified and ranked, the relevanttopics are displayed by opening the hierarchy until they areexposed, and emphasizing them byusing a larger font; the larger the font, the higher the relevance of the topic to the query). In thecase of author queries, the selected topics can be taken as a summary of the research areas in whichthe query author is active (Figure 2).

Each entry type in BoW has a customized form that allows the relevant data to be entered.Submitting this form has the side effect of performing a search based on the submitted data, in

3

Pipelining

Machines and Projects

Alphabetical Listing

Classifications

Architectures

Parallel Workstations

Metacomputing and the Grid

Commercial Machines

Clusters and NoWs

Dedicated Machines

General

Special Hardware

General Issues

Degree of Synchronism

Shared Memory Implementation

Atomic Access and Consistency

Hardware Atomic Operations

Protocols for Atomic Access

Consistency

Interleaving and Skewing

Memory Latency and Caching

Latency Tolerance Techniques

Caching

Cache Coherence

Caching in Specific Systems

Caching Performance

Distributed shared Memory (DSM)

Memory Modules Shared by Small Subset of Processors

Virtual Memory

Vector and Array Processing

Processors

Communication

Hardware Support

Parallel I/O

Technology

Interconnection Networks

Operating Systems and Runtime Support

Programming, Languages, and Compilation

Fault Tolerance and Detection

Algorithms and Applications

Textbooks

Vector Processing

Array Processing (SIMD)

Systolic Arrays

Multiple Vector Processors

Figure 1: The BoW concept hierarchy showing some of the structure of two top-level concepts.

4

Figure 2: Display of the results of an author search. The large panel on the left shows the conceptindex. The opened and emphasized topics identify the query author’s research areas. The righthandpanel provides a list of documents co-authored by the query author. Clicking on an entry shows itsdetails in the bottom panel.

5

Figure 3: Display of results of a keyword search. In the concept index, the most relevant topics areopened and emphasized with a larger font. Clicking on one of them shows the entries it containsin the righthand panel, with those matching the query emphasized.

order to identify concept pages to which the entry may be linked. However, the actual linking isleft to the discretion of the user. This is done by displayingthe topics in the search results withcheckbuttons next to them; selecting a topic by marking its checkbutton indicates that a link shouldbe created from this topic to the new entry.

6

2.2 The Automatic Keyword Index Construction

In parallel to the hierarchy of concept pages, an hierarchical index of characterizing keywordvectors for each topic is constructed. This index has the same structure as the hierarchy of conceptpages, and is in fact based on its contents. Each node in the index is a vector of keywords whichrepresent the vocabulary of the corresponding topic in the hierarchy. Since each topic encompassesall the concept pages and entries in a sub-tree of the hierarchy, all these sub-topics and entriesshould be taken into account when constructing its keyword vector. The keywords are selectedautomatically as the most relatively significant words for this topic, which also differentiate it fromits sibling topics [11]. The group of sibling topics, located at the same level and having the sameparent in the hierarchy, are called acompetitive set, since they compete for keywords with eachother. While processing the page contents we look for all thefive-grams of letters inside a word,shifting right letter by letter from the beginning to the endof the page. For example, “algorithm”will be turned into “algo”, “algor”, “lgori”, “gorit”, “ori th”, “rithm”, and “ithm”. From now on theterms “five-gram” and “word” will be used interchangeably, except in Section 4 where we need toconsider the words that appear in the original query.

Our data is semi-structured, with special fields for authors, journals, titles, topic headings, andsub-topics. Such fields can be easily extracted from a concept page and given a special (extra)weight or a special treatment. For example, since many queries are topical it is important to ensurethat the topics with query words in the title will be assignedsubstantially higher weights thanthose including query words only in the content. Thus, for a given wordw and a topict, the word’sweight in the topic’s vocabulary,V oct(w), is calculated as follows:

Voct(w) = termfreq(w, t) + intitle(w, t) · [A + B · subtopics(t) + entries(t)]

intitle is a Boolean predicate that evaluates to 1 if wordw is in topic t’s title. This then addsthe terms in the square brackets to the weight, including a constantA and two additional termsthat reflect the topic’s size. In our experiments the constant A was set to 100 andB to 5. Sincethe author field carries some important information, authors should be identified and handled askeywords during the search. We also constructed and employed an acronym thesaurus, since ourdata contains many names of projects, systems and tools which are often referred to by acronyms.The idea of this sort of index is to construct a pure content-related (reflecting) language, whiledropping out all the meaningless words. One may wonder why not use the full text vocabulary ofthe topics for indexing purposes. However, previous work has shown that a significant increase inaccuracy on the one hand and a real decrease in computationalcost on the other can be achieved byreducing the size of the vectors [15]. The initial construction of the keyword index and its updatesare executed off-line, repeated at regular intervals.

At the on-line phase, for each user’s action (search or new entry insertion) a query is created andhandled by the matching procedure. In essence, the query vocabulary vector is matched against thekeyword index vectors in a vector-based manner [11]. The search then proceeds recursively fromthe root topics, choosing the most suitable sub-topic(s) ateach point. This approach provides betteraccuracy than the traditional flat query-document matchingschemes over a structured documentcorpus [15, 18]. The main advantage of the hierarchical method is that at every stage the set of

7

the sub-topics to be investigated next is pruned, and the decision to be made by the classificationprocess is simplified and more focused.

3 Related Work

Despite the widespread popularity of hierarchies as a structural organization, there has been rel-atively little other work regarding search within a hierarchy, e.g. works by Koller and Sahami[15], McCallum et al. [18], and Paynter et al. [20]. Another related area of research is doc-ument classification into pre-defined class hierarchies. Many researchers have applied super-vised machine learning techniques, such as Bayesian and SVM-based approaches, to this task[15, 4, 23, 22, 6, 21, 12, 24, 5, 8].

However, most of the above classifiers have a common restrictive precondition: for each cat-egory a training set made of a significant number of labeled documents is required. Actually, thenumber of labeled examples required to train a supervised learning algorithm is related to the sizeof the taxonomy. This problem becomes critical for larger taxonomies such as web directoriesincluding hundreds and thousands of nodes. Hence, some unsupervised methods employing clus-tering into a web directory to bootstrap the supervised classification and reduce the amount oflabeled data were proposed, e.g. [1, 8, 13].

An additional important disadvantage of the above searching and classification approaches isthat they return the end result by selecting the single most suitable class. This is overly restrictive,as a query may in fact match more than one topic. It may also cause unrecoverable errors in caseof classification failure on some higher level of the hierarchy. In contrast, our system returns aset of nodes from the hierarchy. Also, in most of the existingrepositories documents are assignedonly to leaves, each sub-category may only belong to a singleparent category, and each documentbelongs to one leaf. In contrast, BoW entries may be linked both to leaves and to internal conceptpages, and each bibliographic entry and concept page may be associated with several parent-topicsin the hierarchy. Rousu et al. [22] and Cesa-Bianchi et al. [6] also considered the issue of “partialpaths” classification and allowed for an entry to be classified at the internal nodes.

Finally, BoW, as described in Section 2, is distinct from theabove mentioned tools, as beinga deep hierarchical repository of bibliographic entries. Thus, in addition to other constraints oursearching algorithm has to cope with a limited amount of dataprovided for each entry, while mostof the previous work was designed to classify full text documents.

4 The Topic Ranking Functions

As was mentioned in Section 1 the proposed procedure for finding the best matching topics toa given query was designed to handle both searching and insertion of new entries in a similarmanner. However, the optimal topic ranking function to be used by this mathcing procedure mightvary according to the type of the provided query. For example, queries for the insertion task includethe whole content of the new entry. This can typically include a dozen words or even more (whichare further split into multiple corresponding five-grams).But queries for search are much shorter.

8

It has been observed that a typical web query contains only one to three words [2]. Some rankingfunction may achieve higher recall while others are more precision-oriented. Another parameter tobe considered is the content of the query, e.g. queries including author names or other proper nounsmight require a different treatment than queries which consist of common noun keywords. Hence,in this section we propose a number of ranking functions thatare designed to handle various querytypes as further shown in Section 5.

Boolean Ranking: The basic approach is to calculate the topic score by counting the overlap ofquery wordsQVoc with the topic’s keyword vectorTKeystopic:

scoretopic =| QVoc ∩ TKeystopic |

The major drawback of this approach is that the score of a topic only reflects how many of the querywords appear among the topic’s keywords. However, this information might not be sufficientlydiscriminative when handling short queries, which consequently leads to too noisy results. Intruth, keywords are not all equal in the degree that they represent a topic: for example, a keywordthat appears multiple times both in entries and in the topic title should carry much more weightthan a keyword that appears only once in a single entry. Therefore, we propose four new variationsof the topic ranking method based on frequencies of appearance (weights) of the query words inthe topic. Note that, in particular, this will emphasize topics with keywords that appear in thetopic title, because of the artificially inflated counts of words that appear in the title as described inSection 2.2.

Version I - SumWeight: The first and simplest approach is to sum the five-gram weights, V octopic(f),in the topic weighted vocabulary vector, rather than incrementing the score by one point foreach matched five-gram. This leads to the following formula for the score:

scoretopic =∑

f∈QVoc

Voctopic(f)

where five-grams that do not appear in the topic vocabulary are given zero weight.

Observation of experimental results reveals that many of the retrieved topics gained a high overallscore only thanks to five-gram(s) representing one word of the query that had a very high weightin that topic, while the other query words have a very low or noscore for the topic. Usually, insuch cases the topic is not very relevant to the query and so should not be returned at the result. Forexample, a query on “optical network” may retrieve an irrelevant topic “Point-to-point networks”since the word “network” appears in the title and therefore gains a high weight. Our expectationfor a truly relevant topic is that it should include most if not all of the query words as keywords,preferably all with high weights.

The above considerations lead to the following three normalizations of the above topic weight-ing and ranking metric:

9

Version II - Norm1: The first normalization follows the rationale that all the query terms should beroughly evenly weighted in the topic vector. The problem is that both the query vocabularyQVoc and the topic vocabularyVoctopic are expressed in five-grams. We therefore select thehighest-weight five-gram to represent each word in the query. Denoting five-grams derivedfrom query wordw by g ∈ w, we therefore define

weighttopic(w) = maxg∈w

Voctopic(g)

Using this, the weighted score for a topic will be

scoretopic =

∑

f∈QVoc

Voctopic(f)

minw∈Q

{weighttopic(w)}

maxw∈Q

{weighttopic(w)}

WhereQ is the original query (in words, not five-grams). In particular, if any query wordsare totally missing from the topic the topic’s score will be 0.

Version III - Norm2: The second normalization does not exclude topics missing a query word.Instead, it adds a factor that ranks topics according to the number of query five-grams theycontain. But in contrast with the simple binary criterion employed before, here the rela-tive number of five-grams present is squared, to make this factor more discriminative andsensitive to every missing term:

scoretopic =

∑

f∈QVoc

Voctopic(f)

·

[

| QVoc ∩ TKeystopic |

| QVoc |

]

2

Version IV - Norm1&2: The third variation is just combining both of the above normalizationfactors into one formula:

scoretopic =

∑

f∈QVoc

Voctopic(f)

minw∈Q

{weighttopic(w)}

maxw∈Q

{weighttopic(w)}

[

| QVoc ∩ TKeystopic |

| QVoc |

]

2

Note that these weighting schemes mainly affect queries with two or more words. For the one-word queries all the above formulas largely reduce to the initial form of summing the term weights(Version I).

5 Evaluation and Results

In our previous work [11] we applied the Boolean ranking function to the task of inserting new en-tries by using a 7-fold cross validation over a corpus of about 3500 bibliographic entries. The bestaverage hit ratio for the top-level entries (with relatively large vocabularies) was 94.7% (±5.81),and 91.8% (±7.11) and 89.2% (±8.05) for the next two levels (having smaller vocabulary size),

10

One-word queries backfilling, deadlock, Ethernet, grid, kernel, middleware,robustness, paging, workload, router, testing, scalability,protocol

Long queries adaptive scheduling, cluster computing, parallel comput-ing history, performance optimization, client-server, fortrancompiler

Acronyms LAN, LRU, DSM, MPI, SCSI, RP3Queries with typos gang sceduling, kernel treads, load balansing, memory le-

tency, flow kontrol, usr interfaceAuthors Bal E. Henri, Yang Yuanyuan, Reed Daniel, Van Steen

Maarten, Patt Yale N., Bertossi Alan (A.), Mellor-CrummeyJohn M.

Table 1: Examples of various query types used by the judges.

correspondingly. Thus, our experimental results have corroborated those of McCallum et al. [18]that topics with larger vocabulary sizes generally performbetter.

Manually checking the entries that were misclassified revealed that in many cases they wereambiguous, and had very short annotations that only included quite general terms. Hence, thesecond important finding is that the search and insertion accuracy is also influenced by the sizeof query vocabulary. All the error decisions were made for queries that were very small (onesentence annotation). Our expectation is that in this case the weight-based ranking may boost theperformance.

5.1 Manual Evaluation of the Search Results

5.1.1 Experimental Setting

To further assess the above expectation we estimated and compared the performance of the pro-posed ranking metrics in a manual evaluation experiment. Finding suitable human assessors forthe system evaluation was quite difficult, partly due to the narrow professional domain of the BoWmaterial. Finally, we managed to find two highly qualified judges, both experts in the field ofParallel Systems, who independently created and tested twosets of over 200 queries.

Each judge’s set comprised about 100 author names queries and 100 keyword queries on thevarious subjects covered by the BoW repository. Approximately 50% of the keyword queriesconsisted of two or three words, the rest were one-word queries. There were also a few querieswith typos (5%–8%, which seems like a reasonable relative number of typos for a typical user)in each set and 10 acronyms (as is also quite typical for an average user). The acronyms wereautomatically interpreted by the system through the pre-computed thesaurus and converted to theirfull wording. Table 1 exemplifies the various query types used in the experiment.

The judges were guided to evaluate the query results for eachof the four proposed rankingmetrics and the baseline Boolean approach from section 4. Two types of grading criteria wererequired for each query result:

11

score R* P*

3 mostly relevant results very few non-relevant results2 sufficiently many relevant results some non-relevant results1 few relevant results many irrelevant results

Table 2: Scores used by the judges to evaluate query responses.

1. R* – corresponding to the subjective level of recall achieved for the query, i.e. how manyrelevant topics were retrieved. This is interpreted as being relative to what may be expected,based on an understanding of the domain and some knowledge ofthe concept hierarchystructure.

2. P* – corresponding to the subjective precision of the response, i.e. how many irrelevantresults (“noise”) were also retrieved.

Scores were given numerically on a scale of 1 to 3 as specified in Table 2. These evaluation criteriarequire less user effort and allow for a more flexible estimation of the method performance thanassigning a binary score of "relevant" / "non-relevant" foreach individual result.

It is important to emphasize that the judges were not aware ofthe differences between theevaluated methods and had no knowledge about which one of them was the baseline and whichwere the new ones. The final product of the evaluation experiment for each judge consisted of atable of approximately 200 queries across 5 searching methods with two grades for every queryunder each method.

5.1.2 Result Analysis and Discussion

In order to analyze the obtained results we calculated the average grades for each criterion, asgraded by each judge, over different sets of queries. The full results are displayed in Table 3. Foreach criterion and query type, the top graded method is indicated by boldface. Note that in somecases, the baseline is better than the new methods, but only for criterion R* (subjective recall). Wealso calculated the average of the two criteria as a simple way to combine them.

Since the query sets were distinct for each judge we could only measure their agreement bytheir grades’ average values rather than by direct grade correlation per query. On the other hand, wedid measure linear correlation for queries of the same set for different pairs of algorithms, as shownin Table 4. As the various ranking methods may behave differently for queries of certain types(presented in Table 1) we also computed the corresponding figures for each query type separately.

As expected the correlation coefficient figures between different versions’ results show an al-most complete correlation for both judges between the Norm1, Norm2, and Norm1&2 methodsfor one-word queries. This is since the influence of the normalization factors only applies in caseof longer queries. Hence, we did not ask the judges to test one-word queries for the SumWeightversion. Another special category constitutes the acronyms. Similarly to the one-word queries theyachieve almost identical average grades for all the new methods, since they always appear in thetopic vector either in the full form (including all the words) or in the acronym form.

12

Query Algorithm Judge I Judge IIType Version CR1 CR2 AVE CR1 CR2 AVE

One-word Norm1 2.277 2.246 2.261 2.466 2.666 2.566queries Norm2 2.265 2.253 2.259 2.466 2.666 2.566

Norm1&2 2.289 2.246 2.267 2.466 2.666 2.566Baseline 2.050 1.929 1.990 2.533 1.733 2.133

Long SumWeight 1.925 2.195 2.060 2.033 2.383 2.208keyword Norm1 1.850 2.480 2.165 1.983 2.483 2.233queries Norm2 1.885 2.290 2.087 2.050 2.466 2.258(with typos) Norm1&2 1.795 2.395 2.095 1.900 2.483 2.191

Baseline 2.195 1.495 1.845 2.200 1.350 1.775Long SumWeight 1.955 2.205 2.080 1.900 2.360 2.130queries Norm1 1.866 2.522 2.194 1.940 2.500 2.220excluding Norm2 1.894 2.316 2.105 1.920 2.460 2.190typos Norm1&2 1.816 2.427 2.122 1.840 2.500 2.170

Baseline 2.255 1.533 1.894 2.140 1.400 1.770Acronyms SumWeight 2.045 2.727 2.386 2.300 2.100 2.200

Norm1 2.136 2.613 2.375 2.400 2.900 2.650Norm2 2.136 2.613 2.375 2.400 2.900 2.650Norm1&2 2.159 2.613 2.386 2.400 2.900 2.650Baseline 1.795 2.181 1.988 2.300 2.100 2.200

Long SumWeight 1.946 2.290 2.118 2.071 2.342 2.207queries Norm1 1.901 2.504 2.202 2.042 2.542 2.292with Norm2 1.930 2.348 2.139 2.100 2.528 2.314acronyms Norm1&2 1.860 2.434 2.147 1.971 2.542 2.257

Baseline 2.122 1.618 1.870 2.214 1.457 1.835All SumWeight 1.925 2.195 2.060 2.049 2.393 2.221keyword Norm1 2.094 2.372 2.233 2.170 2.580 2.375queries Norm2 2.102 2.300 2.201 2.210 2.570 2.390

Norm1&2 2.080 2.338 2.209 2.120 2.580 2.350Baseline 2.086 1.778 1.932 2.310 1.540 1.925

Author SumWeight 2.073 2.706 2.389 1.734 2.846 2.290queries Norm1 1.921 2.421 2.171 1.704 2.857 2.280

Norm2 2.453 1.918 2.186 2.530 1.836 2.183Norm1&2 2.456 1.926 2.191 2.510 1.836 2.173Baseline 2.320 2.418 2.369 2.234 2.234 2.234

Table 3: Experimental results for various query types and algorithm versions, for the two judges.The best results among the weighting versions are marked by bold and when the baseline resultsare higher than those of the new versions, they are denoted bybold and italics. Average (“AVE”)is the arithmetic average of the values of the two criteria.

13

Query Methods Judge I Judge IIType compared CR1 CR2 CR1 CR2

Authors Baseline : Norm1 0.157 0.103 0.487 0.305Norm1 : Norm2 0.370 0.494 0.363 0.204Norm2 : Norm1&2 0.991 0.995 0.952 1.000Norm1&2 : Norm1 0.351 0.500 0.386 0.204Baseline : Norm2 0.163 0.157 0.597 0.638Baseline : Norm1&2 0.139 0.171 0.550 0.638Baseline : SumWeight 0.199 0.215 0.479 0.274Norm1 : SumWeight 0.685 0.284 0.903 0.969Norm2 : SumWeight 0.397 0.338 0.362 0.165Norm1&2 : SumWeight 0.389 0.327 0.296 0.165

One-word Baseline : Norm1 0.550 -0.003 0.698 0.213keyword Norm1 : Norm2 0.985 0.991 1.000 1.000queries Norm2 : Norm1&2 0.951 0.976 1.000 1.000

Norm1&2 : Norm1 0.943 0.970 1.000 1.000Baseline : Norm2 0.559 -0.036 0.698 0.213Baseline : Norm1&2 0.499 0.014 0.698 0.213

All Baseline : Norm1 0.382 0.027 0.449 0.179keyword Norm1 : Norm2 0.824 0.843 0.639 0.475queries Norm2 : Norm1&2 0.831 0.838 0.718 0.597

Norm1&2 : Norm1 0.912 0.937 0.930 0.872Baseline : Norm2 0.312 0.095 0.369 0.243Baseline : Norm1&2 0.373 0.086 0.394 0.157Baseline : SumWeight 0.155 0.342 0.289 0.001Norm1 : SumWeight 0.391 0.518 0.448 0.339Norm2 : SumWeight 0.606 0.754 0.867 0.647Norm1&2 : SumWeight 0.386 0.571 0.548 0.442

Table 4: The Linear Correlation (Pearson coefficients) between the judgment values of variousmethods for each of the two judges. The methods with highest correlation values for both judgesare marked with bold; this indicates that the methods exhibit similar behavior.

For author queries there is a very high correlation between Norm2 and Norm1&2 (i.e. Norm2is the dominant factor in Norm1&2), and for keywords — between Norm1 and Norm1&2, for bothjudges. This could be explained by the fact that for authors the number of query words that appearin the topic is a more crucial factor. Thus, to recognize topics relevant to an author with a highaccuracy the system should require that both the first name and the surname appear in the topic,otherwise the partially matching name may refer to a different person. As for keyword queries, inmany cases all the query words would occur in the inspected topic, so the relative word frequenciesplay the role of the most discriminative factor. In addition, we notice that all the new methods donot significantly correlate with the baseline.

14

Judge I Judge IIQuery Best R% P% F1 IMP Best R% P% F1 IMPType Method % Method %One-word Norm* 64.5 62.2 0.633 28.4 Norm* 73.3 83.2 0.779 57.4queries base 52.5 46.5 0.493 base 76.6 36.6 0.495Long Norm1 42.4 73.9 0.539 54.4 Norm2 52.5 73.3 0.612 125.8queries base 59.7 24.7 0.349 base 60.0 17.5 0.271Long Norm1 43.3 76.0 0.552 47.2 Norm1 46.9 75.0 0.577 95.6w/o typos base 62.7 26.7 0.375 base 57.0 19.9 0.295Long w/ Norm1 45.0 75.1 0.563 41.1 Norm2 55.0 76.3 0.639 93.1acronyms base 56.1 30.9 0.399 base 60.7 22.8 0.331All Norm1 54.7 68.5 0.608 34.2 Norm2 60.4 78.4 0.682 78.5keyword base 54.3 38.8 0.453 base 65.5 27.0 0.382Authors SumW 53.7 85.3 0.659 -3.7 SumW 36.7 92.2 0.525 -14.8queries base 66.0 70.9 0.684 base 61.6 61.6 0.616

Table 5: Recall and precision (in %) and F1 for the best performing versions for each case fromTable 3 vs. baseline. F1 is calculated by the standard IR formula as a harmonic mean of recall andprecision. The improvement over the baseline F1 is presented in column “IMP %”. Note that forone-word queries all the new methods produced very similar results which allows us to use any ofthem as the best method (denoted by Norm*).

While for one-word and acronym queries the new methods yieldsignificantly higher averagegrades for both judgment criteria over the baseline, for longer queries the weight-based scoresachieve a somewhat lower R* but a much higher P* compared to the baseline ranking. Thus, theyoutperform the baseline significantly in average. Both judges consistently evaluate the Norm1metric as the one with the highest P* grades for all query types except for authors.

We also observe that typos almost do not influence the resultsfor any ranking algorithm in-cluding the baseline, since the favorable behavior of the system for typos is determined by usingn-grams (rather than whole words) and it does not depend on anyother parameters of the topicweighting strategy. Adding the acronym queries to the 2-3-word queries pool leads to quite similarresults as well.

For authors, on the other hand, the top average grades were produced by the baseline ranking,while both judges consent that the SumWeight strategy is thebest of the new methods, and itsgrades are comparable with the baseline performance. This fine behavior of the Boolean rankingcould be explained by the different nature of the author queries, which, as opposed to the regularkeyword queries, are rather precisely specified and are lessambiguous, since there are few authorswith identical first names and surnames. In addition, once anauthor name appears in some entryof the topic it is automatically treated as a keyword by our indexing procedure and thus cannot bemissed or filtered by the competition as may happen to entry content words. Therefore, even theBoolean ranking algorithm, which typically suffers from too broad and noisy results, is suited tohandle such focused queries quite well.

15

The overall improvement rates in terms of recall and precision are summarized in Table 5. Thetable presents the best algorithm performance for each query type and judge and the improvementsover the baseline. The metric used is F1, the harmonic mean ofprecision and recall. We use thesubjective R* and P* as approximations for recall and precision. In order to compute them asa percentage, all the grades were mapped into a scale of[0..1] simply by subtracting 1 and thendividing by 2 (because the original scale was[1..3]).

Overall, the new ranking method results show a substantial increase in precision (by up to 55percentage points), reaching 68–78% precision for keywords (compared to 27–39% for baseline)and 85–92% precision for authors (compared to 61–71% for baseline), with relatively little lossin recall (up to 19 percentage points). As shown in the table for 2-3-word queries the precision isthree to four times higher with the new methods. This consequently leads to improved F1 valuesfor keyword queries (according to both judges’ average grades). For authors the new methodsperformance is slightly worse than the baseline.

Remarkably, our experiment shows quite compatible resultsfor both judges in a variety ofcases and aspects. Specifically,

1. The weight-based methods always improve precision over the Boolean baseline.

2. For all the keyword queries Norm1 yields the highest precision while Norm2 achieves thehighest recall.

3. For 2-3-word queries without typos Norm1 produces the best F1 score.

4. Norm1&2 and SumWeight typically exhibit weaker results than Norm1 and Norm2. Thepossible reasons are that the Norm1&2 metric is too restrictive since it combines both nor-malization factors, leading to some decrease in recall, while the SumWeight version, whichsometimes achieves quite good recall, is too permissive, since it uses no normalization con-straints, which affects the precision.

5. For author queries the best weighting version is SumWeight producing results comparableto the baseline.

6. The Boolean ranking (the baseline) usually produces higher recall scores than all the weight-based versions, as it generally retrieves larger resultinglists of topics. However, this con-sequently significantly hurts precision with exception forvery long queries, e.g. queriesconstructed from entries.

Finally, we conclude that the main contribution of the weight-based approaches is to improvingthe search precision, and the best metrics in this regard areNorm1 for keywords, and SumWeightfor authors. Thus, the system might either automatically employ Norm1 for keyword search, andSumWeight (or Boolean ranking, which produced the best F1) for authors, or give the users anoption to choose the most suitable method for them, as follows:

• In case of a precision oriented search — Norm1 / SumWeight will be selected for keyword /author queries, respectively,

16

• If recall is more important but precision should be quite reasonable as well, Norm2 might bethe best choice for keyword queries,

• When a high recall is the user’s only concern, the system willapply the Boolean rankingprocedure.

6 Conclusions and Future Work

Information retrieval is typically concerned with the retrieval of documents out of a corpus that arerelevant to a given query. The response to the user can be presented at various levels, ranging froma document reference number through a document surrogate tothe full text [16]. BoW is uniquein organizing its data in a hierarchical concept index, and returning concepts from this index inresponse to queries. This approach provides the user with a wider context of related documents,within which the best data to answer the query can be found.

The system supports two main functionalities: insertion ofnew entries, and retrieval of existingones. Interestingly, while a similar topic retrieval scheme was shown to be suitable for variousapplication goals, we found that slightly different weighting metrics were best for different typesof queries. Very specific and well defined queries like long entry-data-based queries and authorqueries were found to work well with Boolean weights, i.e. byjust counting how many query termsare matched. This approach also appeared to yield the highest recall. But for other keyword queriesit was found to be better to combine the sum of keyword weight in each topic with a factor thatmeasures whether most or all of the keywords are indeed present and are evenly highly weighted.Such a strategy achieves a much higher precision (increase of 30–50% over the boolean baseline)and F1 (increase of 34–78%). This implies that in order to obtain the best results the searchprocedure should use different weighting schemes for different types of queries and retrieval tasks.

Our algorithm was tested on a parallel systems bibliographywith its specific structure, subjectscope, and other characteristics. Future work may include testing the procedure on other data setsin different domains, to see how well it generalizes and whatnew issues are raised.

References

[1] Adami, Giordano, Paolo Avesani, and Diego Sona. (2005).Clustering Documents into a WebDirectory for Bootstrapping a Supervised Classification. Data & Knowledge Engineering.Vol. 54 , Issue 3, pp. 301–325. Adamson, G. and Boreham. (1974). The use of an AssociationMeasure Based on Character Structure to Identify Semantically Related Pairs of Words andDocument Titles. Information Storage and Retrieval 10, pp.253–260.

[2] Beitzel Steven M., Jensen Eric C., Chowdhury Abdur, Grossman David, and Frieder Ophir(2004). Hourly analysis of a very large topically categorized web query log. In 27th SIGIRConf. Research and Development in Information Retrieval, pp. 321–328.

[3] Blair, David. C. (1990). Language and Representation ininformation retrieval. N.Y.: Elsevier.

17

[4] Cai, Lijuan and Thomas Hofmann. (2004). Hierarchical Document Categorization with Sup-port Vector Machines. CIKM’04, Washington, DC, USA.

[5] Chakrabarti, Soumen, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan. (1998). Scalablefeature selection, classification and signature generation for organizing large text databasesinto hierarchical topic taxonomies The VLDB Journal 7, pp. 163–178.

[6] Cesa-Bianchi, Nicolo, Claudio Gentile, Luca Zaniboni.(2006). Incremental Algorithms forHierarchical Classification. Journal of Machine Learning Research 7, pp. 31–54.

[7] Doszkocs, T.E. (1982). From Research to Application: The CITE Natural Language Infor-mation Retrieval System. In Research and Development in Information Retrieval, G. Salton,and H. J. Schneider (eds.), pp. 251–262. Berlin: Springer-Verlag.

[8] Dumais, Susan and Hao Chen. Hierarchical Classificationof Web Content. (2000). In Proc.of the 23rd Int’l ACM Conf. on Research and Development in Information Retrieval (SIGIR),pages 256–263, Athens, Greece.

[9] Feitelson, D. G. (1999). The BoW Project. Technical Report 99-30, Institute of ComputerScience, Hebrew University.

[10] Feitelson, D. G. (2000). Cooperative indexing, classification and evaluation in BoW. In 7thIFCIS Intl. Conf. Cooperative Information Systems, O. Etzion and P. Scheuermann (Eds.),pp. 66–77, Springer-Verlag, LNCS vol. 1901.

[11] Geffet, M., and Feitelson, D. G. (2001). Hierarchical indexing and document matching inBoW. ACM/IEEE Joint Conf. Digital Libraries, pp. 259–267.

[12] Granitzer, Michael and Peter Auer. (2005). Experiments with hierarchical text classification.In Proc. of Artificial Intelligence and Soft Computing. Benidorm, Spain

[13] Jardine, N. and C. J. vanRijsberngen. (1971). The Use ofHierarchic Clustering in InformationRetrieval. Information Storage and Retrieval, 7(5), pp. 217–40.

[14] Kerner, C. J., and T. F. Lindsley (1969). The value of abstracts in normal text searching. InThe information bazaar, Proc. 6th Annual National Colloquium on Information Retrieval,Philadelphia, pp. 437–440.

[15] Koller, D., and Sahami, M. (1997). Hierarchically classifying documents using very fewwords. In Proc. 14th International Conference on Machine Learning (ML-97), pp. 170–178,Nashville, Tennessee.

[16] Korfhage, R. R. (1997). Information Storage and Retrieval, N.Y.: John Wiley and Sons.

[17] McCallum A., Nigam K., Rennie J., and Seymore K. (2000).Automating the construction ofInternet portals with machine learning. Information Retrieval 3(2), pp. 127–163.

18

[18] McCallum, A., Rosenfeld, R., Mitchell, T., and Ng, A. Y.(1998). Improving Text Classi-fication by Shrinkage in a Hierarchy of Classes. In Proc.15thInternational Conference onMachine Learning, pp. 359–367.

[19] Montejo-Raez Arturo, L. Alfonso, and Ralf Steinberger. (2005). Text Categorization usingbibliographic records: beyond document content. Procesamiento del Lenguaje Natrural, no.35 (ISSN: 1135).

[20] Paynter, G. W., Witten I. H., Cunningham S. J., and Buchanan G. (1999). Scalable browsingfor large collections: a case study. In 5th ACM Conf. DigitalLibraries, pp. 215–223.

[21] Rocchio, J. J. (1971). In Gerard Salton, editor, The SMART retrieval system: experiments inautomatic document processing. Englewood Clis, US. Prentice-Hall, pp. 313-323.

[22] Rousu, Juho, Craig Saunders, Sandor Szedmak, John Shawe-Taylor. (2005). Learning Hierar-chical Multi-Category Text Classification Models. In Proc.of 22nd International Conferenceon Machine Learning, Bonn, Germany.

[23] Sun, Aixin, Ee-Peng Lim, Wee-Keong Ng, and Jaideep Srivastava. (2004). Blocking Reduc-tion Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and DataEngineering, Vol. 16, No. 10.

[24] Tikk, Domonkos, Gyorgy Biro, and Jae Dong Yang. (2004).A hierarchical text categorizationapproach and its application to FRT expansion. Australian Journal of Intelligent InformationProcessing Systems, 8(3), pp. 123–131.

19