“is this document relevant? probably”: a survey of

25
“Is This Document Relevant? ... Probably”: A Survey of Probabilistic Models in Information Retrieval FABIO CRESTANI, MOUNIA LALMAS, CORNELIS J. VAN RIJSBERGEN, AND IAIN CAMPBELL University of Glasgow This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—retrieval models; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—indexing methods General Terms: Algorithms, Theory Additional Key Words and Phrases: Information retrieval, probabilistic indexing, probabilistic modeling, probabilistic retrieval, uncertain inference modeling 1. HISTORY OF PROBABILISTIC MODELING IN IR In information retrieval (IR), probabilis- tic modeling is the use of a model that ranks documents in decreasing order of their evaluated probability of relevance to a user’s information needs. Past and present research has made much use of formal theories of probability and of sta- tistics in order to evaluate, or at least estimate, those probabilities of rele- vance. These attempts are to be distin- guished from looser ones such as the “vector space model” [Salton 1968] in which documents are ranked according to a measure of similarity to the query. A measure of similarity cannot be di- rectly interpretable as a probability. In addition, similarity-based models gener- ally lack the theoretical soundness of probabilistic models. The first attempts to develop a proba- bilistic theory of retrieval were made over 30 years ago [Maron and Kuhns 1960; Miller 1971], and since then there has been a steady development of the approach. There are already several op- erational IR systems based upon proba- bilistic or semiprobabilistic models. One major obstacle in probabilistic or semiprobabilistic IR models is finding methods for estimating the probabilities used to evaluate the probability of rele- vance that are both theoretically sound and computationally efficient. The prob- lem of estimating these probabilities is difficult to tackle unless some simplify- ing assumptions are made. In the early Authors’ address: Computing Science Department, University of Glasgow, Glasgow G12 8QQ, Scotland, UK; email: ^fabio,mounia,keith,iain&@dcs.gla.ac.uk. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. © 1999 ACM 0360-0300/99/1200–0528 $5.00 ACM Computing Surveys, Vol. 30, No. 4, December 1998

Upload: others

Post on 26-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Is This Document Relevant? Probably”: A Survey of

“Is This Document Relevant? . . . Probably”: A Survey ofProbabilistic Models in Information RetrievalFABIO CRESTANI, MOUNIA LALMAS, CORNELIS J. VAN RIJSBERGEN, ANDIAIN CAMPBELL

University of Glasgow

This article surveys probabilistic approaches to modeling information retrieval. Thebasic concepts of probabilistic approaches to information retrieval are outlined andthe principles and assumptions upon which the approaches are based arepresented. The various models proposed in the development of IR are described,classified, and compared using a common formalism. New approaches thatconstitute the basis of future research are described.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]:Information Search and Retrieval—retrieval models; H.3.1 [Information Storageand Retrieval]: Content Analysis and Indexing—indexing methods

General Terms: Algorithms, Theory

Additional Key Words and Phrases: Information retrieval, probabilistic indexing,probabilistic modeling, probabilistic retrieval, uncertain inference modeling

1. HISTORY OF PROBABILISTICMODELING IN IR

In information retrieval (IR), probabilis-tic modeling is the use of a model thatranks documents in decreasing order oftheir evaluated probability of relevanceto a user’s information needs. Past andpresent research has made much use offormal theories of probability and of sta-tistics in order to evaluate, or at leastestimate, those probabilities of rele-vance. These attempts are to be distin-guished from looser ones such as the“vector space model” [Salton 1968] inwhich documents are ranked accordingto a measure of similarity to the query.A measure of similarity cannot be di-rectly interpretable as a probability. Inaddition, similarity-based models gener-

ally lack the theoretical soundness ofprobabilistic models.

The first attempts to develop a proba-bilistic theory of retrieval were madeover 30 years ago [Maron and Kuhns1960; Miller 1971], and since then therehas been a steady development of theapproach. There are already several op-erational IR systems based upon proba-bilistic or semiprobabilistic models.

One major obstacle in probabilistic orsemiprobabilistic IR models is findingmethods for estimating the probabilitiesused to evaluate the probability of rele-vance that are both theoretically soundand computationally efficient. The prob-lem of estimating these probabilities isdifficult to tackle unless some simplify-ing assumptions are made. In the early

Authors’ address: Computing Science Department, University of Glasgow, Glasgow G12 8QQ, Scotland,UK; email: ^fabio,mounia,keith,iain&@dcs.gla.ac.uk.Permission to make digital / hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute tolists, requires prior specific permission and / or a fee.© 1999 ACM 0360-0300/99/1200–0528 $5.00

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 2: “Is This Document Relevant? Probably”: A Survey of

stages of the study of probabilistic mod-eling in IR, assumptions related toevent independence were employed inorder to facilitate the computations. Thefirst models to be based upon such as-sumptions were the “binary indepen-dence indexing model” (Section 3.3) andthe “binary independence retrievalmodel” (Section 3.2). Recent findings byCooper [1995] have shown that theseassumptions are not completely neces-sary and were, in fact, not actuallymade (Section 5).

The earliest techniques that took de-pendencies into account gave resultsthat were worse than those given bytechniques based upon the simplifyingassumptions. Moreover, complex tech-niques that captured dependenciescould only be used at a computationalprice regarded as too high with respectto the value of the results [van Rijsber-gen 1977]. One particular research di-rection aimed at removing the simplify-ing assumptions has been studiedextensively and much work is beingdone.1

Another direction has involved theapplication of statistical techniquesused in pattern recognition and regres-sion analysis. These investigations, ofwhich the Darmstadt indexing approach(DIA) is a major example [Fuhr 1989;Fuhr and Buckley 1991] (see Section3.4), do not make use of independenceassumptions. They are “model-free” inthe sense that the only probabilistic as-sumptions involved are those implicit inthe statistical regression theory itself.The major drawback of such approachesis the degree to which heuristics arenecessary to optimize the descriptionand retrieval functions.

A theoretical improvement in the DIAwas achieved by using logistic regres-sion instead of standard regression.Standard regression is, strictly speak-ing, inappropriate for estimating proba-bilities of relevance if relevance is con-sidered as a dichotomous event; that is,

a document is either relevant to a queryor not. Logistic regression has been spe-cifically developed to deal with dichoto-mous (or n-dichotomous) dependentvariables. Probabilistic models thatmake use of logistic regression havebeen developed by Fuhr and Pfeifer[Fuhr and Buckley 1991] and by Cooperet al. [1992] (see Sections 3.4 and 3.7).

One area of recent research investi-gates the use of an explicit networkrepresentation of dependencies. Thenetworks are processed by means ofBayesian inference or belief theory, us-ing evidential reasoning techniquessuch as those described by Pearl [1988].This approach is an extension of theearliest probabilistic models, takinginto account the conditional dependen-cies present in a real environment.Moreover, the use of such networks gen-eralizes existing probabilistic modelsand allows the integration of severalsources of evidence within a singleframework. Attempts to use Bayesian(or causal) networks are reported inTurtle [1990], Turtle and Croft [1991],and Savoy [1992].

A new stream of research, initiated byvan Rijsbergen [1986] and continued byhim and others,2 aims at developing amodel based upon a nonclassical logic,specifically a conditional logic where thesemantics are expressed using probabil-ity theory. The evaluation can be per-formed by means of a possible-worldanalysis [van Rijsbergen 1989, 1992;Sembok and van Rijsbergen 1993; Cres-tani and van Rijsbergen 1995], thus es-tablishing an intentional logic, by usingmodal logic [Nie 1988, 1989, 1992;Amati and Kerpedjiev 1992], by usingsituation theory [Lalmas 1992], or byintegrating logic with natural languageprocessing [Chiaramella and Chevallet1992]. The area is in its infancy; noworking prototype based on the pro-posed models has been developed so far,

1 Please see Fung et al. [1990], Turtle and Croft[1990], Savoy [1992], and van Rijsbergen [1992].

2 Please see Amati and van Rijsbergen [1995],Bruza [1993], Bruza and van der Weide [1992],Huibers [1996], Sebastiani [1994], and Crestaniand van Rijsbergen [1995].

Probabilistic Models in IR • 529

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 3: “Is This Document Relevant? Probably”: A Survey of

and the operational validity of theseideas has still to be confirmed.

2. BACKGROUND

Here we review some general aspectsthat are important for a full under-standing of the proposed probabilisticmodels. We then provide a frameworkwithin which the various models can beplaced for comparison. We assume somefamiliarity with principles of probabilitytheory on the part of the reader.

2.1 Event Space

In general, probabilistic models have astheir event space the set QI 3 DI , whereQI represents the set of all possible que-ries and DI the set of all documents inthe collection. The differences amongthe various models lie in their use ofdifferent representations and descrip-tions of queries and documents.

In most models, queries and docu-ments are represented by descriptors,often automatically extracted or manu-ally assigned terms. These descriptorsare represented as binary-valued vec-tors in which each element correspondsto a term. More complex models makeuse of real-valued vectors or take intoaccount relationships among terms oramong documents.

A query is an expression of an infor-mation need. Here we regard a query asa unique event; that is, if two userssubmit the same query or if the samequery is submitted by the same user ontwo different occasions, the two queriesare regarded as different. A query issubmitted to the system, which thenaims to find information relevant to theinformation need expressed in thequery. In this article we consider rele-vance as a subjective user judgment ona document related to a unique expres-sion of an information need.3

A document is any object carrying in-formation: a piece of text, an image, asound, or a video. However, most allcurrent IR systems deal only with text,a limitation resulting from the difficultyof finding suitable representations fornontextual objects. We thus considerhere only text-based IR systems.

Some assumptions common to all re-trieval models are:

—Users’ understanding of their infor-mation needs changes during a searchsession, is subject to continuous re-finement, and is expressed by differ-ent queries.

—Retrieval is based only upon repre-sentations of queries and documents,not upon the queries and documentsthemselves.

—The representation of IR objects is“uncertain.” For example, the extrac-tion of index terms from a documentor a query to represent the documentor query information content is ahighly uncertain process. As a conse-quence, the retrieval process becomesuncertain.

It is particularly this last assumptionthat led to the study of probabilisticretrieval models. Probability theory[Good 1950] is, however, only one way ofdealing with uncertainty.4 Also, earliermodels were largely based on classicalprobability theory, but recently new ap-proaches to dealing with uncertaintyhave been applied to IR. Sections 3 and4 present traditional and new ap-proaches to probabilistic retrieval.

2.2 A Conceptual Model

The importance of conceptual modelingis widely recognized in fields such asdatabase management systems and in-

3 The relevance relationship between a query anda document relies on a user-perceived satisfactionof his or her information needs. Such a perceptionof satisfaction is subjective—different users cangive different relevance judgments to a given query-

document pair. Moreover, this relevance relation-ship depends on time, so that the same user couldgive a different relevance judgment on the samequery-document pair on two different occasions.4 Other approaches are based, for example, onfuzzy logic [Zadeh 1987] and Dempster–Shafer’stheory of evidence [Shafer 1976].

530 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 4: “Is This Document Relevant? Probably”: A Survey of

formation systems. Here we use the con-ceptual model proposed by Fuhr[1992b], which has the advantage of be-ing both simple and general enough tobe a conceptual basis for all the proba-bilistic models presented in this survey,although some of them predate it.

The model is shown in Figure 1. Thebasic objects of an IR system are: afinite set of documents DI (e.g., books,articles, images) and a finite set of que-ries QI (e.g., information needs). We con-sider a set of queries and not a singlequery alone because a single user mayhave varying information needs. If weconsider 5 a finite set of possible rele-vance judgments, for example, in thebinary case 5 5 {R, R# } (i.e., a documentcan either be relevant to a query or not)then the IR system’s task is to mapevery query-document pair to an ele-ment of 5. Unfortunately, IR systemsdo not deal directly with queries anddocuments but with representations ofthem (e.g., a text for a document or aBoolean expression for a query). It ismainly the kind of representation tech-nique used that differentiates one IRindexing model from another.

We denote by aQ the mapping be-tween a set of queries QI and their rep-resentations Q. For example, a user insearch of information about wine mayexpress his or her query as follows: “Iam looking for articles dealing withwine.” Similarly, we denote by aD themapping between a set of documents DIand their representations D. For exam-ple, in a library, a book is representedby its author, titles, a summary, the factit is a book (and not an article), andsome keywords. These two mappings

can be very different from each other.Obviously, the better the representationof queries and documents, the better theperformance of the IR system.

To make the conceptual model gen-eral enough to deal with the most com-plex IR models, a further mapping isintroduced between representations anddescriptions. For instance, a descriptionof the preceding query could be the twoterms: “article” and “wine.” The sets ofrepresentations Q and D are mapped tothe sets of descriptions Q9 and D9 bymeans of two mapping functions bQ andbD. Moreover, the need for such addi-tional mapping arises for learning mod-els (see, e.g., Section 3.4) that mustaggregate features to allow largeenough samples for estimation. It isworth noticing, however, that mostmodels work directly with the originaldocument and query representations.

It is common for IR systems to be ableto manage only a poor description of therepresentation of the objects (e.g., a setof stems instead of a text). However,when representation and descriptionhappen to be the same, it is sufficient toconsider either aQ or aD as an identitymapping.

Descriptions are taken as the inde-pendent variables of the retrieval func-tion r: Q9 3 D9 3 R, which maps aquery-document pair onto a set of re-trieval status values (RSV) r(q9k, d9j)[Bookstein and Cooper 1976]. The taskof ranked retrieval IR systems in re-sponse to a query qk is to calculate thisvalue and rank each and every docu-ment dI j in the collection upon it.

In probabilistic IR the task of thesystem is different. If we assume binary

Figure 1. The underlying conceptual model.

Probabilistic Models in IR • 531

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 5: “Is This Document Relevant? Probably”: A Survey of

relevance judgments (i.e., 5 containsonly the two possible judgments R andR# ), then according to the probabilityranking principle (Section 2.4), the taskof an IR system is to rank the docu-ments according to their estimatedprobability of being relevant P(Ruqk,dI j). This probability is estimated byP(Ruq9k, d9j).

2.3 On “Relevance” and “Probability ofRelevance”

The concept of relevance is arguably thefundamental concept of IR. In the pre-ceding model we purposely avoided giv-ing a formal definition of relevance be-cause the notion of relevance has neverbeen defined precisely in IR. Althoughthere have been many attempts towardsa definition [Seracevic 1970; Cooper1971; Mizzaro 1996], there has neverbeen agreement on a unique precise def-inition. A treatment of the concept ofrelevance is outside the scope of thisarticle and we do not attempt to formu-late a new definition or even acceptsome particular already existing one.What is important for the purpose ofour survey is to understand that rele-vance is a relationship that may or maynot hold between a document and a userof the IR system who is searching forsome information: if the user wants thedocument in question, then we say thatthe relationship holds. With referenceto the preceding model, relevance (5) isa relationship between a document ~djI !

and a user’s information need ~qkI !. Ifthe user wants the document dI in rela-tion to his information need qkI , then djIis relevant (R).

Most readers will find the concept ofprobability of relevance quite unusual.The necessity of introducing such aprobability arises from the fact that rel-evance is a function of a large number ofvariables concerning the document, theuser, and the information need. It isvirtually impossible to make strict pre-dictions as to whether the relationshipof relevance will hold between a given

document and a given user’s informa-tion need. The problem must be ap-proached probabilistically. The preced-ing model explains what evidence isavailable to an IR system to estimatethe probability of relevance P(Ruqk, dI j).A precise definition of probability of rel-evance depends on a precise definitionof the concept of relevance, and given aprecise definition of relevance it is pos-sible to define such a probability rigor-ously. Just as we did not define rele-vance, we do not attempt to define theprobability of relevance, since everymodel presented here uses a somewhatdifferent definition. We refer the readerto the treatment given by Robertson etal. [1982], where different interpreta-tions of the probability of relevance aregiven and a unified view is proposed.

2.4 The Probability Ranking Principle

A common characteristic of all the prob-abilistic models developed in IR is theiradherence to the theoretical justifica-tion embodied in the probability rank-ing principle (PRP) [Robertson 1977].The PRP asserts that optimal retrievalperformance can be achieved when doc-uments are ranked according to theirprobabilities of being judged relevant toa query. These probabilities should beestimated as accurately as possible onthe basis of whatever data have beenmade available for this purpose.

The principle speaks of “optimal re-trieval” as distinct from “perfect retriev-al.” Optimal retrieval can be definedprecisely for probabilistic IR because itcan be proved theoretically with respectto representations (or descriptions) ofdocuments and information needs. Per-fect retrieval relates to the objects of theIR systems themselves (i.e., documentsand information needs).

The formal definition of the PRP is asfollows. Let C denote the cost of retriev-ing a relevant document and C# the costof retrieving an irrelevant document.The decision rule that is the basis of thePRP states that a document dm shouldbe retrieved in response to a query qk

532 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 6: “Is This Document Relevant? Probably”: A Survey of

above any document di in the collectionif

C z P~Ruqk , dm! 1 C# z ~1 2 P~Ruqk , dm!!

# P~Ruqk , di! 1 C# z ~1 2 P~Ruqk , di!!.

The decision rule can be extended todeal with multivalued relevance scales(e.g., very relevant, possibly relevant,etc. [Cooper 1971]). In addition, bymeans of a continuous cost function, itis possible to write a decision rule forapproaches where the relevance scale isassumed to be continuous [Borgognaand Pasi 1993].

The application of the PRP in proba-bilistic models involves assumptions.

—Dependencies between documents aregenerally ignored. Documents areconsidered in isolation, so that therelevance of one document to a queryis considered independent of that ofother documents in the collection(nevertheless, see Section 5).

—It is assumed that the probabilities(e.g., P(Ruqk, di)) in the decision func-tion can be estimated in the best pos-sible way, that is, accurately enoughto approximate the user’s real rele-vance judgment and therefore to or-der the documents accordingly.

Although these assumptions limit theapplicability of the PRP, models basedon it make possible the implementationof IR systems offering some of the bestretrieval performances currently avail-able [Robertson 1977]. There are, ofcourse, a number of other retrievalstrategies with high performance levelsthat are not consistent with the PRP.Examples of such strategies are theBoolean or the cluster model. We arenot concerned here with these modelssince they are not probabilistic in na-ture.

2.5 The Remainder of This Article

In the remainder of this article we sur-vey probabilistic IR models in two main

categories: relevance models and infer-ence models.

Relevance models, described in Sec-tion 3, are based on evidence aboutwhich documents are relevant to a givenquery. The problem of estimating theprobability of relevance for every docu-ment in the collection is difficult be-cause of the large number of variablesinvolved in the representation of docu-ments in comparison to the smallamount of document relevance informa-tion available. The models differ pri-marily in the way they estimate this orrelated probabilities.

Inference models, presented in Sec-tion 4, apply concepts and techniquesoriginating from areas such as logic andartificial intelligence. From a probabi-listic perspective, the most noteworthyexamples are those that consider IR asa process of uncertain inference. Theconcept of relevance is interpreted in adifferent way so that it can be extendedand defined with respect, not only to aquery formulation, but also to an infor-mation need.

The models of both categories are pre-sented separately, but using a commonformalism and, as much as possible, atthe same level of detail.

We also note that we are not con-cerned here with issues related to eval-uation. Evaluation is a very importantpart of IR research and even a brieftreatment of some of the issues involvedin the area would require an entire pa-per. The interested reader should lookat the extensive IR literature on thissubject, in particular van Rijsbergen[1979] and Sparck Jones [1981].

3. PROBABILISTIC RELEVANCE MODELS

The main task of IR systems based uponrelevance models is to evaluate theprobability of a document being rele-vant. This is done by estimating theprobability P(Ruqk, di) for every docu-ment di in the collection, which is adifficult problem that can be tackledonly by means of simplifying assump-tions. Two kinds of approaches have

Probabilistic Models in IR • 533

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 7: “Is This Document Relevant? Probably”: A Survey of

been developed to deal with such as-sumptions, model-oriented and descrip-tion-oriented.

Model-oriented approaches are basedupon some probabilistic independenceassumptions concerning the elementsused in representing5 the documents orthe queries. The probabilities of theseindividual representation elements areestimated, and, by means of the inde-pendence assumptions, the probabilitiesof the document representations are es-timated from them. The binary indepen-dence indexing and retrieval models(Sections 3.3 and 3.2) and the n-Poissonmodel (Section 3.8) are examples of thisapproach.

Description-oriented approaches aremore heuristic in nature. Given the rep-resentation of queries and documents, aset of features for query-document pairsis defined (e.g., occurrence frequency in-formation) that allows each query-docu-ment pair in the collection to be mappedonto these features. Then, by means ofsome training data containing query-document pairs together with their cor-responding relevance judgments, theprobability of relevance is estimatedwith respect to these features. The bestexample of the application of this ap-proach is the Darmstadt Indexing model(Section 3.4). However, a new modelwhose experimental results are not yetknown has been proposed by Cooper etal. [1992]. These models exploit themapping between representations anddescriptions introduced in Section 2.2.

3.1 Probabilistic Modeling as a DecisionStrategy

The use of probabilities in IR was ad-vanced in 1960 by Maron and Kuhns[1960]. In 1976, Robertson and SparckJones went further by showing the pow-erful contribution of probability theory

in modeling IR. The probabilistic modelwas theoretically finalized by van Rijs-bergen [1979, Chapter 6]. The focus ofthe model is on its analysis as a decisionstrategy based upon a loss or risk func-tion.

Referring to the conceptual model de-scribed in Section 2.2, it is assumedthat the representation and the descrip-tion methods for queries and documentsare the same. Queries and documentsare described by sets of index terms. LetT 5 {t1, . . . , tn} denote the set of termsused in the collection of documents. Werepresent the query qk with terms be-longing to T. Similarly, we represent adocument dj as the set of terms occur-ring in it. With a binary representation,dj is represented as the binary vectorxW 5 (x1, . . . , xn) with xi 5 1 if ti [ djand xi 5 0 otherwise. The query qk isrepresented in the same manner.

The basic assumption, common tomost models described in Section 3, isthat the distribution of terms within thedocument collection provides informa-tion about the relevance of a documentto a given query, since it is assumedthat terms are distributed differently inrelevant and irrelevant documents. Thisis known as the cluster hypothesis (seevan Rijsbergen [1979, pp. 45–47]). If theterm distribution were the same withinthe sets of relevant and irrelevant docu-ments, then it would not be possible todevise a discrimination criterion be-tween them, in which case a differentrepresentation of the document infor-mation content would be necessary.

The term distribution provides infor-mation about the “probability of rele-vance” of a document to a query. If weassume binary relevance judgments,then the term distribution provides in-formation about P(Ruqk, dj).

The quantity P(Ruqk, xW ), with xW as abinary document representation, cannotbe estimated directly. Instead, Bayes’theorem is applied [Pearl 1988]:

P~Ruqk, xW ! 5P~Ruqk! z P~xW uR, qk!

P~xW uqk!.

5 Depending on the complexity of the models, theprobabilities to be estimated can be with respectto the representations or the descriptions. Forclarity, we refer to the representations only, un-less otherwise stated.

534 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 8: “Is This Document Relevant? Probably”: A Survey of

To simplify notation, we omit the qkon the understanding that evaluationsare with respect to a given query qk.The previous relation becomes

P~RuxW ! 5P~R! z P~xW uR!

P~xW !,

where P(R) is the prior probability ofrelevance, P(xW uR) is the probability ofobserving the description xW conditionedupon relevance having been observed,and P(xW ) is the probability that x isobserved. The latter is determined asthe joint probability distribution of then terms within the collection. The pre-ceding formula evaluates the “posterior”probability of relevance conditionedupon the information provided in thevector xW .

The provision of a ranking of docu-ments by the PRP can be extended toprovide an “optimal threshold” value.This can be used to set a cutoff point inthe ranking to distinguish among thosedocuments that are worth retrievingand those that are not. This threshold isdetermined by means of a decision strat-egy, whose associated cost functionCj(R, dec) for each document dj is de-scribed in Table I.

The decision strategy can be de-scribed simply as one that minimizesthe average cost resulting from any de-cision. This strategy is equivalent tominimizing the following risk function

5~R, dec! 5 Odj[D

Cj~R, dec! z P~djuR! .

It can be shown (see van Rijsbergen[1979, pp. 115–117]) that the minimiza-tion of that function brings about anoptimal partitioning of the documentcollection. This is achieved by retrieving

only those documents for which the fol-lowing relation holds,

P~djuR!

P~djuR# !. l ,

where

l 5l2 z P~R# !

l1 z P~R!.

3.2 The Binary Independence RetrievalModel

In the previous section, it remains nec-essary to estimate the joint probabilitiesP(djuR) and P(djuR# ), that is, P(xW uR) andP(xW uR# ), if we consider the binary vectordocument representation xW .

In order to simplify the estimationprocess, the components of the vector xWare assumed to be stochastically inde-pendent when conditionally dependentupon R or R# . That is, the joint probabil-ity distribution of the terms in the docu-ment dj is given by the product of themarginal probability distributions:

P~djuR! 5 P~xW uR! 5 Pi51

n

P~ xiuR!,

P~djuR# ! 5 P~xW uR# ! 5 Pi51

n

P~ xiuR# !.

This binary independence assumptionis the basis of a model first proposed inRobertson and Sparck Jones [1976]: thebinary independence retrieval model(BIR). The assumption has always beenrecognized as unrealistic. Nevertheless,as pointed out by Cooper [1995, Section5], the assumption that actually under-lies the BIR model is not that of binaryindependence but the weaker assump-tion of linked dependence:

P~xW uR!

P~xW uR# !5 P

i51

n P~ xiuR!

P~ xiuR# !.

Table I. Cost of Retrieving and Not Retrieving aRelevant and Irrelevant Document

Probabilistic Models in IR • 535

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 9: “Is This Document Relevant? Probably”: A Survey of

This states that the ratio between theprobabilities of xW occurring in relevantand irrelevant documents is equal tothe product of the corresponding ratiosof the single terms.

Considering the decision strategy ofthe previous section, it is now possibleto obtain a decision strategy by using alogarithmic transformation to obtain alinear decision function:

g~dj! 5 logP~djuR!

P~djuR# !. log l.

To simplify notation, we define pj 5P(xj 5 1uR) and qj 5 P(xj 5 1uR# ), whichrepresent the probability of the jth termappearing in a relevant and in an irrel-evant document, respectively. Clearly:1 2 pj 5 P(xj 5 0uR), and 1 2 qj 5P(xj 5 0uR# ). This gives:

P~xW uR! 5 Pj51

n

pjxj z ~1 2 pj!

12xj,

P~xW uR# ! 5 Pj51

n

qjxj z ~1 2 qj!

12xj.

Substituting the preceding gives:

g~di! 5 Oj51

n Sxj z logpj

qj

1 ~1 2 xj! z log1 2 pj

1 2 qjD

5 Oj51

n

cjxj 1 C,

where

cj 5 logpj z ~1 2 qj!

qj z ~1 2 pj!,

C 5 Oj51

n

log1 2 pj

1 2 qj

.

This formula gives the RSV of docu-ment dj for the query under consider-ation. Documents are ranked accordingto their RSV and presented to the user.The cutoff value l can be used to deter-mine the point at which the display ofthe documents is stopped, although theRSV is generally used only to rank theentire collection of documents. In a realIR system, the ordering of documents ontheir estimated probability of relevanceto a query matters more than the actualvalue of those probabilities. Therefore,since the value of C is constant for aspecific query, we need only considerthe value of cj. This value, or more oftenthe value exp(cj), is called the term rel-evance weight (TRW) and indicates theterm’s capability to discriminate rele-vant from irrelevant documents. As canbe seen, in the BIR model term rele-vance weights contribute “independent-ly” to the relevance of a document.

To apply the BIR model, it is neces-sary to estimate the parameters pj andqj for each term used in the query. Thisis done in various ways, depending uponthe amount of information available.The estimation can be retrospective orpredictive. The first is used on test col-lections where the relevance assess-ments are known, and the second withnormal collections where parametersare estimated by means of relevancefeedback from the user.

Another technique, proposed by Croftand Harper [1979], uses collection infor-mation to make estimates and does notuse relevance information. Let us as-sume that the IR system has alreadyretrieved some documents for the queryqk. The user is asked to give relevanceassessments for those documents, fromwhich the parameters of the BIR areestimated. If we also assume we areworking retrospectively, then we knowthe relevance value of all individualdocuments in the collection. Let a collec-tion have N documents, R of which arerelevant to the query. Let nj denote thenumber of documents in which the termxj appears, amongst which only rj arerelevant to the query. The parameters

536 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 10: “Is This Document Relevant? Probably”: A Survey of

pj and qj can then be estimated:

p̂j 5rj

R, q̂j 5

nj 2 rj

N 2 R.

These give:

TRWj 5

rj

R 2 rj

nj 2 rj

N 2 nj 2 R 1 rj

.

This approach is possible only if wehave relevance assessments for all doc-uments in the collection, that is, wherewe know R and rj. According to Croftand Harper, given that the only infor-mation concerning the relevance of doc-uments is that provided by a userthrough relevance feedback, predictiveestimations should be used. Let R̃ de-note the number of documents judgedrelevant by the user. Furthermore, let r̃jbe the number of those documents inwhich the term xj occurs. We can thencombine this with the estimation tech-nique of Cox [1970] to get

TR̃Wj 5

r̃j 1 0.5

R̃ 2 r̃j 1 0.5

nj 2 r̃j 1 0.5

N 2 nj 2 R̃ 1 r̃j 1 0.5

.

Usually, the relevance informationgiven by a user is limited and is notsufficiently representative of the entirecollection. Consequently, the resultingestimates tend to lack precision. As apartial solution, one generally simpli-fies by assuming pj to be constant for allthe terms in the indexing vocabulary.The value pj 5 0.5 is often used, whichgives a TRW that can be evaluated eas-ily:

TR̃Wj 5N 2 nj

nj

For large N (i.e., large collections ofdocuments) this expression can be ap-

proximated by the “inverse documentfrequency” IDFj 5 log N/nj. This iswidely used in IR to indicate the intui-tive discrimination power of a term in adocument collection.

3.3 The Binary Independence IndexingModel

The binary independence indexingmodel (BII model) is a variant of theBIR model. Where the BIR model re-gards a single query with respect to theentire document collection, the BIImodel regards one document in relationto a number of queries. The indexingweight of a term is evaluated as anestimate of the probability of relevanceof that document with respect to queriesusing that term. This idea was firstproposed in Maron and Kuhns’s [1960]indexing model.

In the BII, the focus is on the queryrepresentation, which we assume to bea binary vector zW . The dimension of thevector is given by the set of all terms Tthat could be used in a query, and zj 51 if the term represented by that ele-ment is present in the query; zj 5 0otherwise.6 In this model, the termweights are defined in terms of fre-quency information derived from que-ries; that is, an explicit document repre-sentation is not required. We assumeonly that there is a subset of terms thatcan be used to represent any documentand will be given weights with respectto a particular document.

The BII model seeks an estimate ofthe probability P(RuzW , dj) that the docu-ment dj will be judged relevant to thequery represented by zW . In the formal-ism of the previous section, we use xW todenote the document representation. Sofar this model looks very similar to theBIR; the difference lies with the appli-

6 As a consequence, two different informationneeds (i.e., two queries) using the same set ofterms produce the same ranking of documents.

Probabilistic Models in IR • 537

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 11: “Is This Document Relevant? Probably”: A Survey of

cation of Bayes’ theorem:

P~RuzW , xW ! 5P~RuxW ! z P~zW uR, xW !

P~zW uxW !.

P(RuxW ) is the probability that the doc-uments represented by xW will be judgedrelevant to an arbitrary query. P(zW uR, xW )is the probability that the document willbe relevant to a query with representa-tion zW . As zW and xW are assumed to beindependent, P(zW uxW ) reduces to the prob-ability that the query zW will be submit-ted to the system P(zW).

To proceed from here, some simplify-ing assumptions must be made.

(1) The conditional distribution ofterms in all queries is independent.This is the classic “binary indepen-dence assumption” from which themodel’s name arises:

P~zW uR, xW ! 5 Pi51

n

P~ ziuR, xW !.

(2) The relevance of a document withrepresentation xW with respect to aquery zW depends only upon theterms used by the query (i.e., thosewith zi 5 1) and not upon otherterms.

(3) With respect to a specific document,for each term not used in the docu-ment representation, we assume:

P~Ruzi , xW ! 5 P~RuxW !.

Now, applying the first assumption toP(RuzW , xW ), we get:

P~RuzW , xW ! 5P~RuxW !

P~zW uxW !z P

i51

n

P~ ziuR, xW !.

By applying the second assumption andBayes’ theorem, we get the ranking for-mula:

P~RuzW , xW ! 5)iP~ zi!

P~zW!z P

i51

n P~Ruzi , xW !

P~RuxW !

5)iP~ zi!

P~zW!z P~RuxW ! z P

zi51

P~Ruzi 5 1, xW !

P~RuxW !z P

zi50

P~Ruzi 5 0, xW !

P~RuxW !.

The value of the first fraction is aconstant c for a given query, so there isno need to estimate it for ranking pur-poses. In addition, by applying the thirdassumption, the third fraction becomesequal to 1 and we obtain:

P~RuzW , xW ! 5 c z P~RuxW ! z Pti[zWùxW

P~Ruti , xW !

P~RuxW !.

There are a few problems with thismodel. The third assumption contrastswith experimental results reported byTurtle [1990], who demonstrates the ad-vantage of assigning weights to queryterms not occurring in a document.Moreover, the second assumption iscalled into question by Robertson andSparck Jones [1976], who proved exper-imentally the superiority of a rankingapproach in which the probability ofrelevance is based upon both the pres-ence and the absence of query terms indocuments. The results suggest that theBII model might obtain better results ifit were, for example, used together witha thesaurus or a set of term–term rela-tions. This would make possible the useof document terms not present in thequery but related in some way to thosethat were.

Fuhr [1992b] pointed out that, in itspresent form, the BII model is hardly anappropriate model because, in general,not enough relevance information isavailable to estimate the probabilityP(Ruti, xW ) for specific term-documentpairs. To overcome this problem in part,one can assume that a document con-sists of independent components towhich the indexing weights relate. How-ever, experimental evaluations of thisstrategy have shown only average re-trieval results [Kwok 1990].

Robertson et al. [1982] proposed a

538 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 12: “Is This Document Relevant? Probably”: A Survey of

model that provides a unification of theBII and BIR models. The proposedmodel, simply called Model 3 (as op-posed to the BII model called Model 1and the BIR model called Model 2), letsus combine the two retrieval strategiesof the BII and the BIR models, thusproviding a new definition of probabilityof relevance that unifies those of the BIIand BIR models. In the BII model theprobability of relevance of a documentgiven a query is computed relative toevidence consisting of the properties ofthe queries for which that documentwas considered relevant, whereas in theBIR model it is computed relative to theevidence consisting of the properties ofdocuments considered relevant by thatsame query. Model 3 allows us to useboth forms of evidence. Unfortunately, acomputationally tractable estimationtheory fully faithful to Model 3 has notbeen proposed. The Model 3 idea wasexplored further by Fuhr [1989] andWong and Yao [1989] (see Section 3.5).

3.4 The Darmstadt Indexing Model

The basic idea of the DIA is to uselong-term learning of indexing weightsfrom users’ relevance judgments [Fuhrand Knowrz 1984; Biebricher et al.1988; Fuhr and Buckley 1991]. It can beseen as an attempt to develop index-term-specific estimates based upon theuse of index terms in the learning sam-ple.

DIA attempts to estimate P(Ruxi, qk)from a sample of relevance judgments ofquery-document or term-documentpairs. This approach, when used for in-dexing, associates a set of heuristicallyselected attributes with each term-docu-ment pair, rather than estimating theprobability associated with an indexterm directly (examples are given in thefollowing). The use of an attribute setreduces the amount of training data re-quired and allows the learning to becollection-specific. However, the degreeto which the resulting estimates areterm-specific depends critically uponthe particular attributes used.

The indexing performed by the DIA isdivided in two steps: a description stepand a decision step.

In the description step relevance de-scriptions for term-document pairs(xi, xW ) are formed. These relevance de-scriptions s(xi, xW )7 comprise a set of at-tributes considered important for thetask of assigning weights to terms withrespect to documents. A relevance de-scription s(xi, xW ) contains values of at-tributes of the term xi of the document(represented by xW ) and of their relation-ships. This approach does not make anyassumptions about the structure of thefunction s or about the choice of at-tributes. Some possible attributes to beused by the description function are:

—frequency of occurrence of term xi inthe document;

—inverse frequency of term xi in thecollection;

—information about the location of theoccurrence of term xi in the document;or

—parameters describing the document,for example, its length, the number ofdifferent terms occurring in it, and soon.

In the decision step, a probabilisticindex weight based on the previous datais assigned. This means that we esti-mate P(Rus(xi, xW )) and not P(Ruxi, xW ). Inthe latter case, we would have regardeda single document dj (or xW ) with respectto all queries containing xi, as in the BIImodel. Here, we regard the set of allquery-document pairs in which thesame relevance description s occurs.The interpretation of P(Rus(xi, xW )) istherefore the probability of a documentbeing judged relevant to an arbitraryquery, given that a term common toboth document and query has a rele-vance description s(xi, xW ).

The estimates of P(Rus(xi, xW )) are de-rived from a learning sample of term-document pairs with attached relevance

7 These are similar to those used in pattern recog-nition.

Probabilistic Models in IR • 539

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 13: “Is This Document Relevant? Probably”: A Survey of

judgments derived from the query-docu-ment pairs. If we call this new domainL, we have:

L , DxQxR or L 5 $~qk , dj , rkj!%.

By forming relevance descriptions forthe terms common to queries and docu-ments for every query-document pair inL, we get a multiset of relevance de-scriptions with relevance judgments:

Lx 5 @~s~ xi , dj!, rkj!u

xi [ qk ù dj ` ~qk , dj , rkj! [ L#.

With this set, it would be possible toestimate P(Rus(xi, xW )) as the relativefrequency of those elements of Lx withthe same relevance description. Never-theless, the technique used in DIAmakes use of an indexing function, be-cause it provides better estimatesthrough additional plausible assump-tions about the indexing function. InFuhr and Buckley [1991], various linearindexing functions estimated by least-squares polynomials were used, and inFuhr and Buckley [1993] a logistic in-dexing function estimated by maximumlikelihood was attempted. Experimentswere performed using both a controlledand a free term vocabulary.

The experimental results on the stan-dard test collections indicate that theDIA approach is often superior to otherindexing methods. The more recent (butonly partial) results obtained using theTREC collection [Fuhr and Buckley1993] tend to support this conclusion.

3.5 The Retrieval with ProbabilisticIndexing Model

The retrieval with probabilistic indexing(RPI) model described in Fuhr [1989]takes a different approach from otherprobabilistic models. This model as-sumes that we use not only a weightingof index terms with respect to the docu-ment but also a weighting of queryterms with respect to the query. If wedenote by wmi the weight of index termxi with respect to the document xWm, and

by vki the weight of the query term zi 5xi with regard to the query zWk, then wecan evaluate the following scalar prod-uct and use it as retrieval function.

r~xW m , zWk! 5 O$ xm5zk%

wmi z vki .

Wong and Yao [1989] give an utility-theoretic interpretation of this formulafor probabilistic indexing. Assuming wehave a weighting of terms with respectto documents (similar to those, for ex-ample, of BII or DIA), the weight vki canbe regarded as the utility of the term ti,and the retrieval function r(dm, qk) asthe expected utility of the documentwith respect to the query. Therefore,r(dm, qk) does not estimate the proba-bility of relevance, but has the sameutility-theoretic justification as thePRP.

RPI was developed especially for com-bining probabilistic indexing weightingwith query-term weighting based, forexample, on relevance feedback. As aresult, its main advantage is that it issuitable for application to differentprobabilistic indexing schemes.

3.6 The Probabilistic Inference Model

Wong and Yao [1995] extend the workreported in Wong and Yao [1989] byusing an epistemological view of proba-bility, from where they proposed a prob-abilistic inference model for IR. Withthe epistemic view of probability theory,the probabilities under considerationare defined on the basis of semanticrelationships between documents andqueries. The probabilities are inter-preted as degrees of belief.

The general idea of the model startswith the definition of a concept space,which can be interpreted as the knowl-edge space in which documents, indexterms, and user queries are representedas propositions. For example, the propo-sition d is the knowledge contained inthe document; the proposition q is theinformation need requested; and theproposition d ù q is the portion ofknowledge common to d and q.

540 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 14: “Is This Document Relevant? Probably”: A Survey of

An epistemic probability function P isdefined on the concept space. For exam-ple, P(d) is the degree to which theconcept space is covered by the knowl-edge contained in the document andP(d ù q) is the degree to which theconcept space is covered by the knowl-edge common to the document and thequery.

From these probabilities, differentmeasures can be constructed to evaluatethe relevance of documents to queries;the measures offer different interpreta-tions of relevance and thus lead to dif-ferent approaches to model IR. We dis-cuss two of them. The first is:

C~d3 q! 5 P~qud! 5P~d ù q!

P~d!.

C(d 3 q) can be considered as ameasure of precision of the documentwith respect to the query, and is definedas the probability that a retrieved docu-ment is relevant. A precision-orientedinterpretation of relevance should beused when the user is interested in lo-cating a specific piece of information. Asecond measure is:

C~q3 d! 5 P~duq! 5P~q ù d!

P~q!.

C(q 3 d) is considered a recall index ofthe document with respect to the query,and is defined as the probability that arelevant document is retrieved. A recall-oriented measure should be used whenthe user is interested in finding asmany papers as possible on the subject.

Depending on the relationships be-tween concepts, different formulationsof C(d 3 q) and C(q 3 d) are obtained.For example, suppose that the conceptspace is t1 ø . . . ø tn where the basicconcepts are (pairwise) disjoint; that is,ti ù tj 5 À for i Þ j. It can be proventhat

C~d3 q! 5(t P~d ù qut! P~t!

P~d!,

C~q3 d! 5(t P~d ù qut! P~t!

P~q!.

Wong and Yao [1995] aim to provide aprobabilistic evaluation of uncertain im-plications that have been advanced as away to measure the relevance of docu-ments to queries (see Section 4.1). Al-though measuring uncertain implica-tions by a probability function is morerestrictive than, for example, using thepossible-world analysis, the model pro-posed by Wong and Yao is both expres-sive and sound. For example, they showthat the Boolean, fuzzy set, vectorspace, and probabilistic models are spe-cial cases of their model.

3.7 The Staged Logistic Regression Model

The staged logistic regression model(SLR), proposed in Cooper et al. [1992],is an attempt to overcome some prob-lems in using standard regression meth-ods to estimate probabilities of rele-vance in IR. Cooper criticizes Fuhr’sapproaches, especially the DIA, whichrequires strong simplifying assump-tions. He thinks (a longer explanation ofhis point of view appears in Section 5)that these assumptions inevitably dis-tort the final estimate of the probabilityof relevance. He advocates a “model-free” approach to estimation. In addi-tion, a more serious problem lies in theuse of standard polynomial regressionmethods. Standard regression theory isbased on the assumption that the sam-ple values taken for the dependent vari-able are from a continuum of possiblemagnitudes. In IR, the dependent vari-able is usually dichotomous: a documentis either relevant or irrelevant. So stan-dard regression is clearly inappropriatein such cases.

A more appropriate tool, according toCooper, is logistic regression, a statisti-cal method specifically developed for us-ing dichotomous (or discrete) dependentvariables. Related techniques have beenused with some success by other re-searchers; for example, Fuhr employed

Probabilistic Models in IR • 541

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 15: “Is This Document Relevant? Probably”: A Survey of

it in Fuhr and Pfeifer [1991] and morerecently in Fuhr and Buckley [1993].

The method proposed by Cooper isbased on the guiding notion of treatingcomposite clues on at least two levels,an intraclue level at which a predictivestatistic is estimated separately foreach composite clue,8 and an intercluelevel in which these separate statisticsare combined to obtain an estimate ofthe probability of relevance for a query-document pair. As this proceeds instages, the method is called staged lo-gistic regression. A two-stage SLRwould be as follows.

(1) A statistical simplifying assumptionis used to break down the complexjoint probabilistic distribution of thecomposite clues. This assumption iscalled linked dependence. For exam-ple, assuming that we have only twoclues, a positive real number K ex-ists such that the following condi-tions hold true.

P~a, buR! 5 K z P~auR! z P~buR!

P~a, buR# ! 5 K z P~auR# ! z P~buR# !.

It follows that

P~a, buR!

P~a, buR# !5

P~auR!

P~auR# !zP~buR!

P~buR# !.

Generalizing this result to the caseof n clues and taking the “log odds,”we obtain:

LogO~Rua1 , . . . , an! 5 LogO~R!

1 Oi51

n

~LogO~Ruai! 2 LogO~R!!.

This is used at retrieval time toevaluate the log odds of relevancefor each document in the collectionwith respect to the query.

(2) A logistic regression analysis on alearning sample is used to estimatethe terms on the right-hand side ofthe previous equation. Unfortu-nately, the required learning sampleis often only available within theenvironment of test collections, al-though it could be possible to usethe results of previous good queriesfor this purpose.The estimation of LogO(R) is quitestraightforward using simple pro-portions. A more complex matter isthe estimation of LogO(Ruai), whenthere are too few query-documentpairs in the learning set with theclue ai to yield estimates of P(Ruai)and P(R# uai). To go beyond simpleaveraging, Cooper uses multiple lo-gistic regression analysis. If we as-sume that the clue ai is a compositeclue whose elementary attributesare hi, . . . , hm, then we can esti-mate LogO(Ruai):

LogO~Ruai! 5 LogO~Ruhi , . . . , hm!

5 c0 1 c1h1 , 1 . . .

1 cmhm .

To demonstrate how the logisticfunction comes into the model, theprobability of relevance of a docu-ment can be expressed as

P~Ruhi , . . . , hm!

5ec01c1h1 ,1. . .1cmhm

1 1 ec01c1h1 ,1. . .1cmhm.

Taking the log odds of both sidesconveniently reduces this formula tothe previous one.

(3) A second logistic regression analy-sis, based on the same learningsample, is used to obtain anotherpredictive rule to combine the com-posite clues and correct biases intro-duced by the simplifying assump-tion.The linked dependence assumptiontends to inflate the estimates for

8 A simple clue could be, for example, the presenceof an index term in a document. Clues must bemachine-detectable.

542 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 16: “Is This Document Relevant? Probably”: A Survey of

documents near the top of the out-put ranking whenever the clues onwhich the estimates are based arestrongly interdependent. To helpcorrect this, a second-level logisticregression analysis is performed onthe results of the first of the follow-ing form.

LogO~Rua1 , . . . , an!

5 d0 1 d1Z 1 d2n,

where Z 5 (i51n (LogO(Ruai) 2

LogO(R)) and n is the number ofcomposite clues. More elaborate cor-recting equations might also be con-sidered.

When a query is submitted to thesystem and a document is comparedagainst it, the technique in (2) is ap-plied to evaluate the log odds necessaryto obtain Z. That is then employed in (3)to adjust the estimate of the log odds ofrelevance for the document.

This approach seems flexible enoughto handle almost any type of probabilis-tic retrieval clues likely to be of inter-est, and is especially appropriate whenthe retrieval clues are grouped or com-posite. However, the effectiveness of themethodology remains to be determinedempirically, and its performance com-pared with other retrieval methods. Anexperimental investigation is currentlyunder way by Cooper, and the use oflogistic regression has also been investi-gated by Fuhr, as reported in the pro-ceedings of the TREC-1 Conference[Fuhr and Buckley 1993].

3.8 The N-Poisson Indexing Model

This probabilistic indexing model is anextension to n dimensions of the 2-Pois-son model proposed by Bookstein et al.[Bookstein and Swanson 1974]. In itstwo-dimensional form the model isbased upon the following assumption. Ifthe number of occurrences of a termwithin a document is different depend-ing upon whether the document is rele-vant, and if the number of occurrences

of that term can be modeled using aknown distribution, then it is possibleto decide if a term should be assigned toa document by determining to which ofthe two distributions the term belongs.The 2-Poisson model resulted from asearch for the statistical distribution ofoccurrence of potential index terms in acollection of documents.

We can extend the preceding idea tothe n-dimensional case. We supposethere are n classes of documents inwhich the term xi appears with differ-ent frequencies according to the extentof coverage of the topic related to thatspecific term. The distribution of theterm within each class is governed by asingle Poisson distribution. Given aterm xi, a document class for that termKij, and the expectation of the numberof occurrences of that term in that classlij, then the probability that a docu-ment contains l occurrences of xi (i.e.,that tf(xi) 5 l), given that it belongs tothe class Kij, is given by

P~tf~ xi! 5 luxW [ Kij! 5lij

l

l!e2lij.

Extending this result, the distributionof a certain term within the whole col-lection of documents is governed by asum of Poisson distributions, one foreach class of coverage. In other words, ifwe take a document in the collection atrandom whose probability of belongingto class Kij is pij, then the probability ofhaving l occurrences of term xi is:

P~tf~ xi! 5 l ! 5 Oj51

n

pije2lijlij

l

l!.

This result can be used with a Bayes-ian inversion to evaluate P(xW [ Kijutf(xi5 l)) for retrieval purposes. The param-eters lij and pij can be estimated with-out feedback information by applyingstatistical techniques to the documentcollection.

Experiments have shown that theperformance of this model is not alwaysconsistent. Some experiments by Harter

Probabilistic Models in IR • 543

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 17: “Is This Document Relevant? Probably”: A Survey of

[1975] on a 2-Poisson model showedthat a significant number of “good” in-dex terms were 2-Poisson, but they didnot provide conclusive evidence of thevalidity of the n-Poisson model. Theseresults were covalidated by Robertsonand Walker [1994]. They demonstratedconsiderable performance improve-ments by using some effective approxi-mations to the 2-Poisson model on theTREC collection. Other research inves-tigated the possibility of using a 3-Pois-son, and lastly Margulis [1992, 1993]investigated the generalized n-Poissonmodel on several large full-text docu-ment collections. His findings weremore encouraging than previous ones.He determined that over 70% of fre-quently occurring words were indeeddistributed according to an n-Poissondistribution. Furthermore, he foundthat the distribution of most n-Poissonwords had relatively few single Poissoncomponents, usually two, three, or four.He suggests that his study providesstrong evidence that the n-Poisson dis-tribution could be used as a basis foraccurate statistical modeling of largedocument collections. However, to date,the n-Poisson approach lacks work onretrieval strategies based upon the re-sults so far.

4. UNCERTAIN INFERENCE MODELS

The models presented in this section arebased on the idea that IR is a process ofuncertain inference. Uncertain infer-ence models are based on more complexforms of relevance than those used inrelevance models, which are basedmainly upon statistical estimates of theprobability of relevance. With uncertaininference models, information notpresent in the query formulation maybe included in the evaluation of therelevance of a document. Such informa-tion might be domain knowledge,knowledge about the user, user’s rele-vance feedback, and the like. The esti-mation of the probabilities P(Ruqk, di,K) involves the representation of theknowledge K.

Another characteristic of uncertaininference models is that they are not asstrongly collection-dependent as rele-vance models. Parameters in relevancemodels are valid only for the currentcollection, whereas inference modelscan use knowledge about the user or theapplication domain that can be usefulwith many other collections.

This research area is promising inthat it attempts to move away from thetraditional approaches, and it may pro-vide the breakthrough that appears nec-essary to overcome the limitations ofcurrent IR systems.

There are two main types of uncertaininference models: one based on nonclas-sical logic, to which probabilities aremapped (Section 4.1), and the otherbased on Bayesian inferences (Section4.2).

4.1 A Nonclassical Logic for IR

In 1986, van Rijsbergen proposed a par-adigm for probabilistic IR in which IRwas regarded as a process of uncertaininference [van Rijsbergen 1986]. Theparadigm is based on the assumptionsthat queries and documents can be re-garded as logical formulae and that toanswer a query, an IR system mustprove the query from the documents.This means that a document is relevantto a query only if it implies the query, inother words, if the logical formula d 3q can be proven to hold. The proof mayuse additional knowledge K; in thatcase, the logical formula is then rewrit-ten as (d, K) 3 q.

The introduction of uncertainty comesfrom the consideration that a collectionof documents cannot be considered aconsistent and complete set of state-ments. In fact, documents in the collec-tion could contradict each other in anyparticular logic, and not all the neces-sary knowledge is available. It has beenshown [van Rijsbergen 1986; Lalmas1997] that classical logic, the most com-monly used logic, is not adequate torepresent queries and documents be-cause of the intrinsic uncertainty

544 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 18: “Is This Document Relevant? Probably”: A Survey of

present in IR.9 Therefore, van Rijsber-gen [1986] proposes the logical uncer-tainty principle:

Given any two sentences x and y; ameasure of the uncertainty of y 3 xrelated to a given data set is deter-mined by the minimal extent towhich we have to add information tothe data set, to establish the truthof y 3 x.

The principle says nothing about how“uncertainty” and “minimal” might bequantified. However, in his paper, vanRijsbergen suggested an information-theoretic approach. This idea has beenfollowed by Nie et al. [1996] and Lalmas[van Rijsbergen and Lalmas 1996].However, that work is somewhat be-yond the scope of this article.

Nearer to this survey, van Rijsbergen[1989] later proposed estimating P(d 3q) by imaging. Imaging formulatesprobabilities based on a “possible-worlds” semantics [Stalnaker 1981] ac-cording to which a document is repre-sented by a possible world w, that is, aset of propositions with associated truthvalues. Let t denote a logical truth func-tion; then t(w, p) denotes the truth ofthe proposition p in the world w. Fur-thermore, let s(w, p) denote the worldmost similar to w where p is true. Then,y 3 x is true at w if and only if x is trueat s(w, p).

Imaging uses this notion of most sim-ilar worlds to estimate P(y 3 x). Everypossible world w has a probability P(w),and the sum over all possible worlds is1. P(y 3 x) is computed as follows.

P~ y3 x! 5 Ow

P~w!t~w, y3 x!

5 Ow

P~w!t~s~w, y!, y3 x!

5 Ow

P~w!t~s~w, y!, x!.

It remains undetermined how to eval-uate the function s on document repre-sentations, and furthermore, how to as-sign a probability P to them. There havebeen a few attempts at using imaging inIR (e.g., Amati and Kerpedjiev [1992]and Sembok and van Rijsbergen [1993]),with rather disappointing results. A re-cent attempt by Crestani and van Rijs-bergen [1995] taking the view that “anindex term is a world” obtains betterresults.

The concept of relevance is not fea-tured in the preceding framework. Invan Rijsbergen [1992] using Jeffrey’sconditionalization was proposed to eval-uate the probability of relevance P(Ruqk,di). This conditionalization, describedas “Neo-Bayesianism” by Pearl [1990],allows conditioning to be based on evi-dence derived from the “passage of ex-perience,” where the evidence can benonpropositional in nature. A compre-hensive treatise of Jeffrey’s studies onprobability kinematics (i.e., on how torevise a probability measure in the lightof uncertain evidence or observation)can be found in Jeffrey [1965]. Bymeans of the famous example in thatbook of inspecting the color of a piece ofcloth by candlelight, van Rijsbergen in-troduced a form of conditioning that hasmany advantages over Bayesian condi-tioning. In particular, it makes possibleconditioning on uncertain evidence andallows order-independent partial asser-tion of evidence. Such advantages, de-spite some strong assumptions, con-vinced van Rijsbergen that thisparticular form of conditionalization ismore appropriate for IR than Bayesianconditionalization. However, despite theappeal of Jeffrey’s conditionalization,the evaluation of the probability of rele-vance involves parameters whose esti-mation remains problematic.

In the same paper van Rijsbergen[1992] makes the connection betweenJeffrey’s conditionalization and Demp-ster–Shafer theory of evidence [Demp-ster 1968; Shafer 1976]. This theory canbe viewed as a generalization of theBayesian method (e.g., it rejects the ad-

9 There are other reasons why classical logic is notadequate, but these are not relevant to this article(see Lalmas [1997]).

Probabilistic Models in IR • 545

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 19: “Is This Document Relevant? Probably”: A Survey of

ditivity rule), and has been used bysome researchers to develop IR models[Schoken and Hummel 1993; de Silvaand Milidiu 1993].

4.2 The Inference Network Model

When IR is regarded as a process ofuncertain inference, then the calcula-tion of the probability of relevance, andthe general notion of relevance itself,become more complex. Relevance be-comes related to the inferential processby which we find and evaluate a rela-tion between a document and a query.

A probabilistic formalism for describ-ing inference relations with uncertaintyis provided by Bayesian inference net-works, which have been described ex-tensively in Pearl [1988] and Neapoli-tan [1990]. Turtle and Croft [Turtle1990; Turtle and Croft 1990, 1991] ap-plied such networks to IR. Figure 2depicts an example of such a network.Nodes represent IR entities such as doc-uments, index terms, concepts, queries,and information needs. We can choose

the number and kind of nodes we wishto use according to how complex wewant the representation of the docu-ment collection or the information needsto be. Arcs represent probabilistic de-pendencies between entities. They rep-resent conditional probabilities, that is,the probability of an entity being truegiven the probabilities of its parentsbeing true.

The inference network is usuallymade up of two component networks: adocument network and a query net-work. The document network, whichrepresents the document collection, isbuilt once for a given collection and itsstructure does not change. A query net-work is built for each information needand can be modified and extended dur-ing each session by the user in an inter-active and dynamic way. The query net-work is attached to the static documentnetwork in order to process a query.

In a Bayesian inference network, thetruth value of a node depends only uponthe truth values of its parents. To eval-

Figure 2. An inference network for IR.

546 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 20: “Is This Document Relevant? Probably”: A Survey of

uate the strength of an inference chaingoing from one document to the querywe set the document node di to “true”and evaluate P(qk 5 trueudi 5 true).This gives us an estimate of P(di 3 qk).

It is possible to implement varioustraditional IR models on this networkby introducing nodes representing Bool-ean operators or by setting appropriateconditional probability evaluation func-tions within nodes.

One particular characteristic of thismodel that warrants exploration is thatmultiple document and query represen-tations can be used within the contextof a particular document collection (e.g.,a Boolean expression or a vector). More-over, given a single information need, itis possible to combine results from mul-tiple queries and from multiple searchstrategies.

The strength of this model comes fromthe fact that most classical retrievalmodels can be expressed in terms of aBayesian inference network by estimat-ing in different ways the weights in theinference network [Turtle and Croft1992a]. Nevertheless, the characteris-tics of the Bayesian inference processitself, given that nodes (evidence) canonly be binary (either present or not),limit its use to where “certain evidence”[Neapolitan 1990] is available. The ap-proach followed by van Rijsbergen (Sec-tion 4.1), which makes use of “uncertainevidence” using Jeffrey’s conditionaliza-tion, therefore appears attractive.

5. EFFECTIVE RESULTS FROM FAULTYMODELS

Most of the probabilistic models pre-sented in this article use simplifyingassumptions to reduce the complexityrelated to the application of mathemati-cal models to real situations. There aregeneral risks inherent in the use of suchassumptions. One such risk is thatthere may be inconsistencies betweenthe assumptions laid down and the datato which they are applied.

Another is that there may be a mis-identification of the underlying assump-

tions; that is, the stated assumptionsmay not be the real assumptions uponwhich the derived model or resultingexperiments are actually based. Thisrisk was noted by Cooper [1995]. Heidentified the three most commonlyadopted simplifying assumptions, whichare related to the statistical indepen-dence of documents, index terms, andinformation needs:

Absolute Independence

P~a, b! 5 P~a! z P~b!;

Conditional Independence

P~a, buR! 5 P~auR! z P~buR!

P~a, buR# ! 5 P~auR# ! z P~buR# !.

These assumptions are interpreted dif-ferently when a and b are regarded asproperties of documents or of users.

Cooper pointed out how the combineduse of the Absolute Independence as-sumption and either of the ConditionalIndependence assumptions yields logi-cal inconsistencies. The combined use ofthese assumptions leads to the conclu-sion that P(a, b, R) . P(a, b), which iscontrary to the elementary laws of prob-ability theory. Nevertheless, in mostcases where these inconsistencies ap-peared, the faulty model used as thebasis for experimental work has proved,on the whole, to be successful. Examplesof this are given in Robertson andSparck Jones [1976] and Fuhr andBuckley [1991].

The conclusion drawn by Cooper isthat the experiments performed wereactually based on somewhat differentassumptions, which were, in fact, con-sistent. In some cases where the Abso-lute Independence assumption was usedtogether with a Conditional Indepen-dence assumption, it seems that the re-quired probability rankings could havebeen achieved on the basis of the Condi-tional Independence assumption alone.This is true of the model proposed byMaron and Kuhns [1960]. In othercases, the Conditional Independence as-

Probabilistic Models in IR • 547

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 21: “Is This Document Relevant? Probably”: A Survey of

sumptions could be replaced by the sin-gle linked dependence assumption:

P~a, buR!

P~a, buR# !5

P~auR!

P~auR# !zP~buR!

P~buR# !.

This considerably weaker assumptionis consistent with the Absolute Indepen-dence assumption. This is true of theSLR model presented in Section 3.7 andof the BIR model (whose name seems tolose appropriateness in the light ofthese results) presented in Section 3.2.

6. FURTHER RESEARCH

In the late ’90s, researchers have cometo realize that there is a leap to be madetowards a new generation of IR sys-tems: towards systems able to cope withincreasingly demanding users, whoserequirements and expectations continueto outstrip the progress being made incomputing, storage, and transport tech-nology. Faster machines and better in-terconnectivity make possible access toenormous amounts of information. Thisinformation is increasing not only inamount, but also in complexity; for ex-ample, structured hypertexts consistingof multiple media are becoming thenorm. Until recently, research in infor-mation retrieval has been confined tothe academic world. Things are chang-ing slowly. The success of the TRECinitiative over the last four years (fromHarman [1993] to Harman [1996]), par-ticularly in terms of the interest shownby commercial organizations, demon-strates the wider desire to produce so-phisticated IR systems. Web searchingengines, which have a high profile inthe wider community, increasingly uti-lize probabilistic techniques. It can onlybe hoped that this increasing awarenessand interest will stimulate new re-search.

The requirements of the next genera-tion of IR systems include the following.

Multimedia Documents. The diffi-culty with multimedia document collec-tions lies in the representation of the

nontextual parts of documents such assounds, images, and animations. Sev-eral approaches have been tried so far:they can be exemplified in the particu-lar approach of attaching textual de-scriptions to nontextual parts, and thederivation of such descriptions bymeans of an inference process (e.g.,Dunlop [1991]). Nevertheless, suchtechniques avoid the real issue of han-dling the media directly. This appliesnot only to probabilistic models, but toall IR models.

Interactive Retrieval. Current IRsystems, even those providing forms ofrelevance feedback for the user, are stillbased upon the traditional iterativebatch retrieval approach. Even rele-vance feedback acts upon a previousretrieval run to improve the quality ofthe following run [Harman 1992a,b]. Weneed real interactive systems, makingpossible a greater variety of interactionwith the user than mere query formula-tion and relevance feedback [Croft1987]. User profile information, analysisof browsing actions, or user modifica-tion of probabilistic weights, for exam-ple, could all be taken into consider-ation [Croft and Thompson 1987; Croftet al. 1988, 1989; Thompson 1989,1990a,b]. The subjective, contextual,and dynamic nature of relevance is nowbeing recognized and incorporated intoprobabilistic models [Campbell and vanRijsbergen 1996].

Integrated Text and Fact Retrieval.There has been a steady development ofthe kinds of information being collectedand stored in databases, notably of text(unformatted data) and of “facts” (for-matted, often numerical, data). Demandis growing for the availability of sys-tems capable of dealing with all types ofdata in a consistent and unified manner[Fuhr 1992a, 1993; Croft et al. 1992;Harper and Walker 1992].

Imprecise Data. The use of probabi-listic modeling in IR is important notonly for representing the document in-formation content, but also for repre-

548 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 22: “Is This Document Relevant? Probably”: A Survey of

senting and dealing with vagueness andimprecision in the query formulationand with imprecision and errors in thetextual documents themselves [Fuhr1990; Turtle and Croft 1992b]. For ex-ample, the increasing use of scannersand OCR in transferring documentsfrom paper to electronic form inevitablyintroduces imprecision (but see Smithand Stanfill [1988]).

7. CONCLUSIONS

The major concepts and a number ofprobabilistic IR models have been de-scribed. We are aware that new modelsare being developed as we speak. A sur-vey is always a bit dated. However, webelieve we have covered the most impor-tant and the most investigated probabi-listic models of IR.

It is not easy to draw conclusionsfrom a survey of 30 years of research. Itis safe to conclude that good resultshave been achieved but more research isrequired, since there is considerableroom for improvement. Current-genera-tion probabilistic IR systems work quitewell when compared with the Booleansystems they are replacing. A noviceuser using natural language input witha current-generation probabilistic IRsystem gets, on average, better perfor-mance than an expert user with a Bool-ean system on the same collection.Moreover, theoretically, the probabilis-tic approach to IR seems inherentlysuitable for the representation and pro-cessing of the uncertain and impreciseinformation that is typical of IR. Webelieve that, with development, it willbe capable ultimately of providing anintegrated, holistic, and theoreticallyconsistent framework for the effectiveretrieval of complex information.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewersfor their interesting and helpful comments.

REFERENCES

AMATI, G. AND KERPEDJIEV, S. 1992. An infor-mation retrieval logical model: Implementa-

tion and experiments. Tech. Rep. Rel 5B04892(March), Fondazione Ugo Bordoni, Roma, Italy.

AMATI, G. AND VAN RIJSBERGEN, C. J. 1995.Probability, information and information re-trieval. In Proceedings of the First Interna-tional Workshop on Information Retrieval,Uncertainty and Logic (Glasgow, Sept.).

BIEBRICHER, P., FUHR, N., KNORZ, G., LUSTIG, G.,AND SCHWANTNER, M. 1988. The automaticindexing system AIX/PHYS—from research toapplication. In Proceedings of ACM SIGIR(Grenoble, France), 333–342.

BOOKSTEIN, A. AND COOPER, W. S. 1976. A gen-eral mathematical model for information re-trieval systems. Libr. Quart. 46, 2.

BOOKSTEIN, A. AND SWANSON, D. 1974. Probabi-listic models for automatic indexing. J. Am.Soc. Inf. Sci. 25, 5, 312–318.

BORGOGNA, G. AND PASI, G. 1993. A fuzzy lin-guistic approach generalizing Boolean infor-mation retrieval: A model and its evaluation.J. Am. Soc. Inf. Sci. 2, 70–82, 44.

BRUZA, P. D. 1993. Stratified information dis-closure: A synthesis between hypermedia andinformation retrieval. Ph.D. Thesis, Katho-lieke Universiteit Nijmegen, The Nether-lands.

BRUZA, P. D. AND VAN DER WEIDE, T. P. 1992.Stratified hypermedia structures for informa-tion disclosure. Comput. J. 35, 3, 208–220.

CAMPBELL, I. AND VAN RIJSBERGEN, C. J. 1996.The ostensive model of developing informa-tion needs. In Proceedings of CoLIS 2 (Copen-hagen, Oct.), 251–268.

CHIARAMELLA, Y. AND CHEVALLET, J. P. 1992.About retrieval models and logic. Comput. J.35, 3, 233–242.

COOPER, W. S. 1971. A definition of relevancefor information retrieval. Inf. Storage Re-trieval 7, 19–37.

COOPER, W. S. 1995. Some inconsistencies andmisnomers in probabilistic information re-trieval. ACM Trans. Inf. Syst. 13, 1, 100–111.

COOPER, W. S., GEY, F. C., AND DABNEY, D. P.1992. Probabilistic retrieval based on stagedlogistic regression. In Proceedings of ACMSIGIR (Copenhagen, June), 198–210.

COX, D. R. 1970. Analysis of Binary Data.Methuen, London.

CRESTANI, F. AND VAN RIJSBERGEN, C. J. 1995a.Information retrieval by logical imaging. J.Doc. 51, 1, 1–15.

CRESTANI, F. AND VAN RIJSBERGEN, C. J. 1995b.Probability kinematics in information re-trieval. In Proceedings of ACM SIGIR (Seat-tle, WA, July), 291–299.

CROFT, W. B. 1987. Approaches to intelligentinformation retrieval. Inf. Process. Manage.23, 4, 249–254.

CROFT, W. B. AND HARPER, D. J. 1979. Using

Probabilistic Models in IR • 549

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 23: “Is This Document Relevant? Probably”: A Survey of

probabilistic models of document retrievalwithout relevance information. J. Doc. 35,285–295.

CROFT, W. B. AND THOMPSON, R. H. 1987. I3R: Anew approach to the design of document re-trieval systems. J. Am. Soc. Inf. Sci. 38, 6,389–404.

CROFT, W. B., LUCIA, T. J. AND COHEN, P. R. 1988.Retrieving documents by plausible inference:A preliminary study. In Proceedings of ACMSIGIR (Grenoble, France, June).

CROFT, W. B., LUCIA, T. J., CRIGEAN, J., AND WIL-LET, P. 1989. Retrieving documents byplausible inference: An experimental study.Inf. Process. Manage. 25, 6, 599–614.

CROFT, W. B., SMITH, L. A., AND TURTLE, H. R.1992. A loosely coupled integration of a textretrieval system and an object-oriented data-base system. In Proceedings of ACM SIGIR(Copenhagen, June), 223–232.

DE SILVA, W. T. AND MILIDIU, R. L. 1993. Belieffunction model for information retrieval.J. Am. Soc. Inf. Sci. 4, 1, 10–18.

DEMPSTER, A. P. 1968. A generalization of theBayesian inference. J. Royal Stat. Soc. 30,205–447.

DUNLOP, M. D. 1991. Multimedia informationretrieval. Ph.D. Thesis, Department of Com-puting Science, University of Glasgow, Glas-gow.

FUHR, N. 1989. Models for retrieval with proba-bilistic indexing. Inf. Process. Manage. 25, 1,55–72.

FUHR, N. 1990. A probabilistic framework forvague queries and imprecise information indatabases. In Proceedings of the InternationalConference on Very Large Databases (Los Al-tos, CA), Morgan-Kaufmann, San Mateo, CA,696–707.

FUHR, N. 1992a. Integration of probabilisticfact and text retrieval. In Proceedings of ACMSIGIR (Copenhagen, June), 211–222.

FUHR, N. 1992b. Probabilistic models in infor-mation retrieval. Comput. J. 35, 3, 243–254.

FUHR, N. 1993. A probabilistic relational modelfor the integration of IR and databases. InProceedings of ACM SIGIR (Pittsburgh, PA,June), 309–317.

FUHR, N. AND BUCKLEY, C. 1991. A probabilisticlearning approach for document indexing.ACM Trans. Inf. Syst. 9, 3, 223–248.

FUHR, N. AND BUCKLEY, C. 1993. Optimizingdocument indexing and search term weight-ing based on probabilistic models. In TheFirst Text Retrieval Conference (TREC-1), D.Harman, Ed., Special Publication 500-207.National Institute of Standards and Technol-ogy, Gaithersburg, MD, 89–100.

FUHR, N. AND KNOWRZ, G. 1984. Retrieval testevaluation of a rule based automatic indexing(AIR/PHYS). In Research and Development in

Information Retrieval, C. J. van Rijsbergen,Ed., Cambridge University Press, Cambridge,UK, 391–408.

FUHR, N. AND PFEIFER, U. 1991. Combiningmodel-oriented and description-oriented ap-proaches for probabilistic indexing. In Pro-ceedings of ACM SIGIR (Chicago, Oct.), 46–56.

FUNG, R. M., CRAWFORD, S. L., APPELBAUM, L. A.,AND TONG, R. M. 1990. An architecture forprobabilistic concept based information re-trieval. In Proceedings of ACM SIGIR (Brux-elles, Belgium, Sept.), 455–467.

GOOD, I. J. 1950. Probability and the Weighingof Evidence. Charles Griffin Symand.

HARMAN, D. 1992a. Relevance feedback andother query modification techniques. In Infor-mation Retrieval: Data Structures and Algo-rithms, W. B. Frakes and R. Baeza-Yates,Eds., Prentice Hall, Englewood Cliffs, NJ,Chapter 11.

HARMAN, D. 1992b. Relevance feedback revis-ited. In Proceedings of ACM SIGIR (Copenha-gen, June), 1–10.

HARMAN, D. 1993. Overview of the first TRECconference. In Proceedings of ACM SIGIR(Pittsburgh, PA, June), 36–47.

HARMAN, D. 1996. Overview of the fifth textretrieval conference (TREC-5). In Proceedingof the TREC Conference (Gaithersburg, MD,Nov.).

HARPER, D. J. AND WALKER, A. D. M. 1992.ECLAIR: An extensible class library for infor-mation retrieval. Comput. J. 35, 3, 256–267.

HARTER, S. P. 1975. A probabilistic approach toautomatic keyword indexing: Part 1. J. Am.Soc. Inf. Sci. 26, 4, 197–206.

HUIBERS, T. W. C. 1996. An axiomatic theoryfor information retrieval. Ph.D. Thesis, UtrechtUniversity, The Netherlands.

JEFFREY, R. C. 1965. The Logic of Decision.McGraw-Hill, New York.

KWOK, K. L. 1990. Experiments with a compo-nent theory of probabilistic information re-trieval based on single terms as documentcomponents. ACM Trans. Inf. Syst. 8, 4 (Oct.),363–386.

LALMAS, M. 1992. A logic model of informationretrieval based on situation theory. In Pro-ceedings of the Fourteenth BCS InformationRetrieval Colloquium (Lancaster, UK, Dec.).

LALMAS, M. 1997. Logical models in informa-tion retrieval: Introduction and overview. Inf.Process. Manage. 34, 1, 19–33.

MARGULIS, E. L. 1992. N-Poisson documentmodelling. In Proceedings of ACM SIGIR(Copenhagen, June), 177–189.

MARGULIS, E. L. 1993. Modelling documentswith multiple Poisson distributions. Inf. Pro-cess. Manage. 29, 2, 215–227.

550 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 24: “Is This Document Relevant? Probably”: A Survey of

MARON, M. E. AND KUHNS, J. L. 1960. On rele-vance, probabilistic indexing and retrieval. J.ACM 7, 216–244.

MILLER, W. L. 1971. A probabilistic searchstrategy for MEDLARS. J. Doc. 27, 254–266.

MIZZARO, S. 1996. Relevance: The whole (hi)sto-ry. Tech. Rep. UDMI/12/96/RR (Dec.), Diparti-mento di Matematica e Informatica, Univer-sita’ di Udine, Italy.

NEAPOLITAN, R. E. 1990. Probabilistic Reason-ing in Expert Systems. Wiley, New York.

NIE, J. Y. 1988. An outline of a general modelfor information retrieval. In Proceedings ofACM SIGIR (Grenoble, France, June), 495–506.

NIE, J. Y. 1989. An information retrieval modelbased on modal logic. Inf. Process. Manage.25, 5, 477–491.

NIE, J. Y. 1992. Towards a probabilistic modallogic for semantic based information retrieval.In Proceedings of ACM SIGIR (Copenhagen,June), 140–151.

NIE, J. Y., LEPAGE, F., AND BRISEBOIS, M. 1996.Information retrieval as counterfactual. Com-put. J. 38, 8, 643–657.

PEARL, J. 1988. Probabilistic Reasoning in In-telligent Systems: Networks of Plausible Infer-ence. Morgan-Kaufmann, San Mateo, CA.

PEARL, J. 1990. Jeffrey’s rule, passage of expe-rience and Neo-Bayesianism. In KnowledgeRepresentation and Defeasible Reasoning,H. E. Kyburg, R. P. Luoi, and G. N. Carlson,Eds., Kluwer Academic, Dordrecht, The Neth-erlands, 245–265.

ROBERTSON, S. E. 1977. The probability rankingprinciple in IR. J. Doc. 33, 4 (Dec.), 294–304.

ROBERTSON, S. E. AND SPARCK JONES, K. 1976.Relevance weighting of search terms. J. Am.Soc. Inf. Sci. 27, 129–146.

ROBERTSON, S. E. AND WALKER, S. 1994. Somesimple effective approximations to the 2-Pois-son model for probabilistic weighted retrieval.In Proceedings of ACM SIGIR (Dublin, June),232–241.

ROBERTSON, S. E., MARON, M. E., AND COOPER,W. S. 1982. Probability of relevance: A uni-fication of two competing models for documentretrieval. Inf. Technol. Res. Dev. 1, 1–21.

SALTON, G. 1968. Automatic Information Orga-nization and Retrieval. McGraw-Hill, NewYork.

SAVOY, J. 1992. Bayesian inference networksand spreading activation in hypertext sys-tems. Inf. Process. Manage. 28, 3, 389–406.

SCHOKEN, S. S. AND HUMMEL, R. A. 1993. On theuse of Dempster–Shafer model in informationindexing and retrieval applications. Int. J.Man-Mach. Stud. 39, 1–37.

SEBASTIANI, F. 1994. A probabilistic terminlogi-cal logic for modelling information retrieval.

In Proceedings of ACM SIGIR (Dublin, June),122–131.

SEMBOK, T. M. T. AND VAN RIJSBERGEN, C. J.1993. Imaging: A relevance feedback re-trieval with nearest neighbour clusters. InProceedings of the BCS Colloquium in Infor-mation Retrieval (Glasgow, March), 91–107.

SERACEVIC, T. 1970. The concept of “relevance”in information science: A historical review. InIntroduction to Information Science, T. Se-racevic, Ed., R. R. Bower, New York, Chapter14.

SHAFER, G. 1976. A Mathematical Theory of Ev-idence. Princeton University Press, Princeton,NJ.

SMITH, S. AND STANFILL, C. 1988. An analysis ofthe effects of data corruption on text retrievalperformance. Tech. Rep. (Dec.), Thinking Ma-chines Corporation, Cambridge, MA.

SPARCK JONES, K. 1981. Information RetrievalExperiments. Butterworth, London.

STALNAKER, R. 1981. Probability and condition-als. In Ifs, W. L. Harper, R. Stalnaker, and G.Pearce, Eds., The University of Western On-tario Series in Philosophy of Science, D.Riedel, Dordrecht, Holland, 107–128.

THOMPSON, P. 1990a. A combination of expertopinion approach to probabilistic informationretrieval. Part 1: The conceptual model. Inf.Process. Manage. 26, 3, 371–382.

THOMPSON, P. 1990b. A combination of expertopinion approach to probabilistic informationretrieval. Part 2: Mathematical treatment ofCEO model 3. Inf. Process. Manage. 26, 3,383–394.

THOMPSON, R. H. 1989. The design and imple-mentation of an intelligent interface for infor-mation retrieval. Tech. Rep., Computer andInformation Science Department, Universityof Massachusetts, Amherst, MA.

TURTLE, H. R. 1990. Inference networks for doc-ument retrieval. Ph.D. Thesis, Computer andInformation Science Department, Universityof Massachusetts, Amherst, MA.

TURTLE, H. R. AND CROFT, W. B. 1990. Inferencenetworks for document retrieval. In Proceed-ings of ACM SIGIR (Brussels, Belgium,Sept.).

TURTLE, H. R. AND CROFT, W. B. 1991. Evalua-tion of an inference network-based retrievalmodel. ACM Trans. Inf. Syst. 9, 3 (July),187–222.

TURTLE, H. R. AND CROFT, W. B. 1992a. A com-parison of text retrieval models. Comput. J.35, 3 (June), 279–290.

TURTLE, H. R. AND CROFT, W. B. 1992b. Uncer-tainty in information retrieval systems. Un-published paper.

VAN RIJSBERGEN, C. J. 1977. A theoretical basisfor the use of co-occurrence data in informa-tion retrieval. J. Doc. 33, 2 (June), 106–119.

Probabilistic Models in IR • 551

ACM Computing Surveys, Vol. 30, No. 4, December 1998

Page 25: “Is This Document Relevant? Probably”: A Survey of

VAN RIJSBERGEN, C. J. 1979. Information Re-trieval (Second ed.). Butterworths, London.

VAN RIJSBERGEN, C. J. 1986. A non-classicallogic for information retrieval. Comput. J. 29,6, 481–485.

VAN RIJSBERGEN, C. J. 1989. Toward a new in-formation logic. In Proceedings of ACM SIGIR(Cambridge, MA, June).

VAN RIJSBERGEN, C. J. 1992. Probabilistic re-trieval revisited. Departmental Research Re-port 1992/R2 (Jan.), Computing Science De-partment, University of Glasgow, Glasgow.

VAN RIJSBERGEN, C. J. AND LALMAS, M. 1996. Aninformation calculus for information retrieval.J. Am. Soc. Inf. Sci. 47, 5, 385–398.

WONG, S. K. M. AND YAO, Y. Y. 1989. A proba-bility distribution model for information re-trieval. Inf. Process. Manage. 25, 1, 39–53.

WONG, S. K. M. AND YAO, Y. Y. 1995. Onmodelling information retrieval with proba-bilistic inference. ACM Trans. Inf. Syst. 13,1, 38 – 68.

ZADEH, L. A. 1987. Fuzzy Sets and Applica-tions: Selected Papers. Wiley, New York.

Received December 1996; revised October 1997; accepted October 1997

552 • F. Crestani et al.

ACM Computing Surveys, Vol. 30, No. 4, December 1998