the effect of indexing exhaustivity on retrieval performance

Informurron Processing & Management Vol. 27, No. 6. pp. 623-628, 1991 Printed in Great Britain.

0306.4573/91 13.00 + .oo Copyright 0 1991 Pergamon Press plc

THE EFFECT OF INDEXING EXHAUSTIVITY ON RETRIEVAL PERFORMANCE

ROBERT BURGIN School of Library and Information Sciences, North Carolina Central University,

Durham, NC 27707, U.S.A.

(Received 26 November 1990; accepted in final form I1 April 1991)

Abstract-The effect of variations in indexing exhaustivity on retrieval performance in a vector space retrieval system was investigated by using a term weight threshold to construct different document representations for a test collection. Retrieval results showed that retrieval performance, as measured by the mean optimal E measure for all queries at a term weight threshold, was highest at the most exhaustive representation, and decreased slightly as terms were eliminated and the indexing representation became less exhaustive. These findings, coupled with those of Shaw for a retrieval system based on single-link clustering, suggest that the vector space model is more robust against variations in indexing exhaustivity than is the single-link clustering model.

INTRODUCTION

Shaw (1990a, 199Oc, 1986a) has observed that the performance of a retrieval system based on single-link clustering varies as a function of indexing exhaustivity. Document representations produced by varying the levels of indexing exhaustivity were tested by Shaw, and retrieval performance was found to improve as the exhaustivity of representations decreased from the most exhaustive representation to an optimal representation based on an intermediate level of indexing exhaustivity. In some cases, retrieval performance was also found to decrease as the exhaustivity of representations decreased from that optimal representation of less exhaustive representations. In all cases, the most successful retrieval results were found at a point where an optimal compromise between indexing exhaustivity and speci- ficity was produced by the document representation.

These observations raise the question of whether a similar phenomenon could be found in a vector space retrieval environment. If a similar regularity were observed for the vector model, one might suggest that the observed relationship between retrieval performance and indexing exhaustivity holds for a wide range of retrieval methods and thereby reveals an important regularity of the information retrieval process. If a similar phenomenon were not observed, its absence might instead suggest important differences between the two models of retrieval.

RELATED WORK

It is worth noting that the effects of variations in indexing exhaustivity on retrieval performance have been investigated by several researchers using a variety of approaches. Seely (1972) summarized a number of these studies, including the two Cranfield studies and early work with the SMART and MEDLARS systems, and emphasized that all revealed a trade- off between recall and precision, such that low levels of exhaustivity result in low recall and high precision, whereas high levels of exhaustivity result in high recall but low precision. Boyce and McLain (1989) recently demonstrated that this relationship also obtains in a commercial retrieval environment.

Sparck Jones (1973) varied the indexing exhaustivity of both queries and documents to show that the two are not independent, and that the same level of retrieval performance may be obtained by different combinations of the two. The results of experiments with two document collections appeared to show that the effects of indexing exhaustivity in documents may be mitigated by varying the indexing exhaustivity in queries.

623

624 R. BURCIN

Harding and Willett (1980) examined the effects of indexing exhaustivity on clustering efficiency by showing that the number of interdocument comparisons varies with the mean number of index terms per document in such a way that the number of comparisons for an inverted file approach will be greater than with conventional approaches for exhaustive indexing.

Finally, El-Hamdouchi and Willett (1987) suggested that variations in indexing exhaustivity may affect clustering tendency and cluster search performance, and recommend this question as an area for future research.

METHODS

The present study was based on the collection examined in Shaw (1990b) and Shaw (199Oc), a test collection of 1239 papers, published between 1974 and 1979 and indexed with the term cystic fibrosis in the National Library of Medicine’s Medline file, and 100 queries with three sets of relevance evaluations from subject experts (Wood, Wood, & Shaw, 1990).

In the study reported here, a relevant document was defined as a document that had been scored as “highly relevant” by at least two sets of judges. Under this criterion, no relevant documents existed for Query 2, and so the query set for the test collection consisted of 99 queries and relevance judgments. The minimum number of relevant documents for

a query was one; the maximum number was 93. The mean number of relevant documents per query was 10.66 documents. In all, there were 1055 relevant document-query pairs under the criterion outlined above.

The SMART information retrieval system (Buckley, 1985; Salton, 1971) was used to produce single-term indexing representations for the collection based on the titles and abstracts of the documents and based on the texts of the queries. Weights for the document vector collection were based on augmented normalized term frequencies and the standard inverse document frequency measure.

Here the augmented term frequency of term k in document i, afki, is equal to

0.5 + 0.5 * (ffk,/max_tf,), (1)

where tfk, is the raw frequency of term k in document i and max_tf; is the maximum raw term frequency in document i. Thus, all values for afki are normalized within the range 0.5 < afki < 1.0. This augmented term frequency measure was used in conjunction with document representations (titles and abstracts) to minimize the effects of document length, of which raw term frequency is a function.

The inverse document frequency measure used here is then W,,, the weight of term k in document i, and is equal to

afk, * [log2 * (d/d,)], (2)

where afk, is the augmented normalized frequency of term k in document i, d is the total number of documents in the collection, and dA is the number of documents in which term k appears. A high weight in a document denotes a term that occurs relatively infrequently in that document and relatively infrequently in the collection. Such a term, then, is one that is relatively specific to the associated document.

The author modified one component of the SMART system in order to produce different representations of the documents based on term weight thresholds. As in Shaw (1990a, 1990b, 199Oc, 1986a), these thresholds were used to control the exhaustivity of document representations by eliminating terms whose weights fell below the threshold. Thus, for any given term weight threshold, a term was selected to describe a document only if its weight for that document was greater than or equal to the term weight associated with that threshold. That is, for any term weight threshold, t,,, a term k was used to describe doc-

. ument I only It \t’/,, 2 t,, .

Table I gi\,cs statistics t’ol- the collection aj ;I function of term weight threshold. At the term \bciphr thre<holtl 0.0. no I~‘I-III$ \\cI-c’ ~lirninared t’rom the document representations,

Indexing exhaustivity and retrieval performance

Table I. Statistics for the collection as a function of term weight threshold

625

Term No. documents Minimum Maximum Mean weight with 1 or terms in terms in terms in

threshold more terms a document a document a document

0.0 1239 4 143 49.5 0.2 1239 4 141 47.7 0.4 1239 4 140 47.2 0.6 1239 3 140 46.9 0.8 1239 3 137 45.5 1 .o 1239 3 135 43.3 1.2 1239 3 129 40.2 1.4 1239 3 119 36.0 1.6 1239 3 100 32.0 1.8 1239 2 86 27.9 2.0 1239 2 72 23.8 2.2 1239 2 67 19.9 2.4 1239 1 59 16.6 2.6 1239 1 54 14.0 2.8 1238 0 51 11.7

since all terms had positive weights. At the term weight threshold 0.2, only those terms whose weights fell below 0.2 were eliminated from the document representations. Following Shaw (1990a,199Ob,1990c,1986a), the current study examined only those term weight thresholds from 0.0 through the threshold that excluded all terms from at least one document, the term weight threshold 2.8.

Retrieval results were obtained by using the SMART information retrieval system to conduct a sequential match of query and document vectors. In addition to the initial search, a relevance feedback search was performed, following the method outlined by Salton and Buckley (1990). In this method, the evaluation of the initial and feedback retrieval runs is based on a “reduced” collection, which is formed by removing the top 15 items retrieved in the initial search. Information from these top 15 items is used to construct feedback queries by adding document term weights to query terms.

Retrieval results obtained from the SMART system for each query included a list of relevant documents and their ranks in the retrieval process. For each relevant document, then, a recall and precision figure and a corresponding E measure were obtained. The E measure used in this study is given by

E=l-l/[l(cr*P)+l/((l-cr]*R)], (3)

where P and R are the conventional precision and recall measures and CY was fixed at 0.5, thus giving equal weight to precision and recall for the theoretical reasons outlined by Shaw (1986b). The magnitude of E varies from 0.0 to 1 .O, and is inversely related to retrieval performance. When E = 0, all relevant documents are retrieved and no nonrelevant document is retrieved; when E = 1, no relevant document is retrieved (Van Rijsbergen, 1979).

For each term weight threshold, the lowest E measure for each query was then obtained, and the mean of these minimal E measures was then derived. This figure represents the mean optimal performance for the system at each term weight threshold. Figures for the term weight thresholds used in the present study are given in Table 2.

RESULTS

As shown in Table 2, the regularity observed by Shaw (1990a, 199Oc, 1986a) for a single-link clustering algorithm was not found for the vector space model. Retrieval performance, as measured by the mean optimal E measure for all queries at a term weight threshold, was highest at the most exhaustive representation, and decreased slightly as terms were eliminated and the indexing representation became less exhaustive. However, not until the term weight threshold 2.2, by which point 60% of the terms describing the collection

626 R. BURGIN

Table 2. Retrieval effectiveness as a function of term weight threshold

Term Mean weight retrieval

threshold effectiveness

0.0 0.2 0.4 0.6 0.8

1.0 1.2

1.4 1.6 1.8

2.0 2.2 2.4 2.6 2.8

0.5494 0.5494 0.5497 0.5499 0.5507

0.5508 0.5541 0.5599 0.5701 0.5818

0.5894 0.6038 0.6173 0.6521 0.6630

had been eliminated, did retrieval performance differ by even a “noticeable” degree, defined by Sparck Jones (1971) as a difference of 5% or more.

A similar pattern was observed for retrievals based on relevance feedback (see Table 3). The most exhaustive representations produced the best performance, and performance did not decline by even a “noticeable” degree until two thirds of the terms describing the collection had been eliminated (i.e., at the term weight threshold 2.4).

DISCUSSION

The effect of indexing exhaustivity on retrieval performance, reported by Shaw (1990a,1990~,1986a) for a retrieval system based on single-link clustering, was not observed for a vector space retrieval system, and would appear to be an artifact of single-link cluster-based retrieval.

This finding suggests that the vector space model is more robust against variations in indexing exhaustivity than is the single-link clustering model. This difference would appear

Table 3. Feedback retrieval effectiveness as a function of term weight threshold

-__

Term “Reduced” weight collection

threshold retrieval Feedback retrieval Improvement

0.0 0.2 0.4 0.6 0.8 1 .o I .2 I.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8

0.7929 0.693 I 12.59070 0.7929 0.693 1 12.59% 0.7925 0.6926 12.61% 0.7914 0.693 I 12.42% 0.7815 0.6877 12.00% 0.7909 0.7009 I I .38% 0.7828 0.7069 9.70% 0.8117 0.7127 12.20% 0.8212 0.7217 12.12Vo 0.8245 0.7010 14.98% 0.8264 0.7084 14.28% 0.8271 0.7195 13.01% 0.9000 0.7564 15.96% 0.9554 0.7913 17.18% 0.9729 0.8183 15.89%

indexing exhaustivity and retrieval performance 627

to follow from the major difference between the two techniques, the fact that the single-link clustering model takes interdocument similarities into account prior to any consideration of the relationship between queries and documents. Although this feature of cluster-based strategies is sometimes considered an advantage over conventional, linear associative retrieval models (Griffiths, Luckhurst, & Willett, 1986; Jardine & Van Rijsbergen, 1971), the clustering structures generated from these interdocument similarities would appear to be sensitive to variations in the indexing exhaustivity, and to contribute thereby to subopti- ma1 retrieval effectiveness in many situations. Shaw’s (1990a, 1990b) finding that the evi- dence of clustering structure was strongest for the most exhaustive document representations and weakest for intermediate values of term exhaustivity, coupled with his finding that retrieval effectiveness was worst for the most exhaustive document representations and optimal for an intermediate value of term exhaustivity, provides confirmation of the potentially adverse effects of these clustering structures on retrieval effectiveness.

This investigation further indicates that, unless individuals who make use of single-link clustering algorithms take care to determine the level of indexing exhaustivity that produces the best performance in a specific experimental environment, they may not obtain optimal performance. Retrieval experiments typically use test collections from which only stop words have been eliminated and are therefore typically based on the most exhaustive document representations. Unfortunately, the most exhaustive representation will produce sub- optimal results for single-link clustering models. Since the vector space model is more robust against variations in indexing exhaustivity, that method would appear to be pref- erable, unless one is willing to take the effects of variations in indexing exhaustivity into consideration.

These findings are, of course, limited to the retrieval performance of the single-link clustering method investigated by Shaw (1990a,1990c,1986a). Other clustering algorithms have been shown to be superior to single-link clustering in terms of retrieval performance (Willett, 1988), and these algorithms may also prove to be more robust against variations in indexing exhaustivity than single-link clustering. The robustness of alternative clustering methods requires further investigation.

Finally, although vector-based retrieval appears to be more robust against variations in indexing exhaustivity than is single-link cluster-based retrieval, the retrieval effectiveness results reported in Table 2 are nevertheless comparable to the retrieval effectiveness results obtained by Shaw and characterized as “low performance levels.” (Shaw 1990a, p. 347) Thus, the results reported here would appear to corroborate suggestions by Swanson (1988) and Shaw that subject representations do a less than adequate job of associating queries and relevant documents, regardless of whether the retrieval method is cluster-based or vector-based.

Acknawiedgemenf-The author gratefully acknowledges Dr. W.M. Shaw, Jr., for providing the test collection used in this study, and for many helpful discussions.

REFERENCES

Boyce, B.R., & McLain, J.P. (1989). Entry point depth and online search using a controlled vocabulary. Jour- nal of the American Society for I~formR~~on Science, 40 (4), 273-276.

Buckley, C. (1985). Implementation of the SMART information retrieval system. Technical Report No. 85-686. Ithaca, NY: Cornell University.

El-Hamdouchi, A., & Willett, P. (1987). Techniques for the measurement of clustering tendency in document retrieval systems. Journal of Information Science, 13, 361-365.

Griffiths, A., Luckhurst, H.C., & Willett, P. (1986). Using interdocument similarity information in document retrieval systems. Journal of rhe American Society for Information Science, 37 (1). 3-l 1.

Harding, A.F., & Willett, P. (1980). indexing exhaustivity and the computation of similarity matrices. Journal of fhe American Sociefy for InformQ~ion Science, 31 (4), 298-300.

Jardine, N., & Van Rijsbergen, C. (1971). The use of hierarchic clustering in information retrieval. Z~formufjo~ Storage and Retrievai, 7, 217-240.

Salton, G. (Ed.) (1971). The SMART retrieval system-Experiments in automatic document processing. Engle- wood Cliffs, NJ: Prentice-Hall.

Salton, G., & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the Amer- ican Society for Information Science, 41 (41, 288-297.

Seely, B.J. (1972). Indexing depth and retrieval effectiveness. Drexel Library Quarfedy, 8 (2), 201-208. IPti 27:6-c

628 R. BURGIN

Shaw, W.M., Jr. (1990a). An investigation of document structures. Information Processing and Management, 26 (3). 339-348.

Shaw, W.M., Jr. (1990b). Subject indexing and citation indexing-Part 1: Clustering structure in the cystic fibrosis document collection. Information Processing and Management, 26 (6) 693-703.

Shaw, W.M., Jr. (1990~). Subject indexing and citation indexing-Part 11: An evaluation and comparison. In- formation Processing and Management, 26 (6), 105-7 18.

Shaw, W.M., Jr. (1986a). An investigation of document partitions. Information Processing and Management, 22 (l), 19-28.

Shaw, W.M., Jr. (1986b). On the foundation of evaluation. Journal of the American Society,for Information Sci- ence, 37 (5), 346-348.

Sparck Jones, K. (1973). Does indexing exhaustivity matter? Journal of /he American Society for lnformafion Science, 24 (5), 313-316.

Sparck Jones, K. (1971). Progress in documentation: Automatic indexing. Journal of Documentation, 39 (4), 393-432.

Swanson, D.R. (1988). Historical note: Information retrieval and the future of an illusion. Journal o,f the Amer- ican Society for Information Science, 39 (2). 92-98.

Van Rijsbergen. C.J. (1979). Information retrieval. London: Butterworths. Willett, P. (1988). Recent trends in hierarchic document clustering: A critical review. fnformutron f’roce.ssing and

Management, 24 (5), 511-591. Wood, J.B., Wood, R.E., & Shaw, W.M., Jr. (1990). The cystic fibrosis database. Technical Report No. 8902.

Chapel Hill, NC: University of North Carolina at Chapel Hill, School of Information and Library Science.

the effect of indexing exhaustivity on retrieval performance

Documents