less is more probabilistic model for retrieving fewer relevant docuemtns harr chen and david r....

Less is MoreProbabilistic Model for Retrieving Fewer Relevant

Docuemtns

Harr Chen and David R. KargerMIT CSAILSIGIR2006

4/30/2007

Abstract

• Probability Ranking Priciple (PRP)– Rank documents in decreasing order of probability of

relevance.

• Propose a greedy algorithm that approximately optimizes the following objectives– %no metric: the percentages of queries for which no

relevant documents are retrieved.– The diversity of results.

4/30/2007

Introduction• Probability Ranking Principle

– Rule of thumb: “optimal”.• TREC robust track

– %no metric– Question answering and finding a homepage.

• Diversity– For example, “Trojan horse”– PRP-based method may choose one “most likely”

interpretation.• Greedy algorithm

– Fill each position in the ranking by assuming that all previous documents in the ranking are not relevant.

4/30/2007

Introduction (Cont.)

• Other measures– Search length (SL)– Reciprocal rank (RR)– Instance recall: the number of difference subtopics in a

given result set.

• Retrieving for Diversity– The diversity automatically arises as a consequence of the

objective function.

4/30/2007

Related Work

• Algorithm– Zhai and Lafferty: a risk minimization framework– Bookstein: a sequential learning retrieval system

• Diversity– Zhai et al.: novelty and redundancy– Clustering is an approach to quickly cover a diverse range

of query interpretations.

4/30/2007

Evaluation Metrics

• MSL (mean search length)• MRR (mean reciprocal rank)• %no– k-call at n: 1 if at least k of the top n docs returned by

system for the given query are deemed relevant; otherwise 0.

– mean 1-call: one minus the %no metric– n-call at n: perfect precision

• Instance recall at rank n

4/30/2007

Bayesian Retrieval

• Standard Bayesian Information Retrival– The documents in a corpus should be ranked by Pr[r|d]– By a monotonic transformation

– Focus on the objective function, so use Naïve Bayes framework with multinomial models (θi) as the family of distributions.

– Determine the parameters (training)– Dirichlet prior: prior probability distribution over the

parameters (θi).

– Estimate the probability of parameters of the relevant distribution (i.e., Pr[d|r]).

4/30/2007

Object Function

• Considering optimizing for the k-call at n metric.– k=1: the probability that at least one of the first n

relevance variables be true

– For arbitrary k: the probability that at least k docs are relevant

4/30/2007

Optimization Methods

• NP-hard Problem– To perfectly optimize the k-call of any specific set of n docs

objective function from a corpus of m docs, because

• Greedy algorithm (approximately optimize it)– Successively select each result of the result set.

1. Select first result by applying the conventional PRP.2. For the ith result, we hold results 1 throught i-1 to their

already selected value, and consider all remaining corpus documents as a possibility for document i.

3. Pick the document with highest k-call score as the ith result.

4/30/2007

Applying the Greedy Approach

• k=1– First, choose the doc d0 maximizing Pr[r0|d0].

– Wish to choose d1 maximizing the below quantity:

– Choose d2 by maximizing– In general, select the optimal di that maximizes

4/30/2007

Applying the Greedy Approach (Cont.)

• k=n (perfect precision)– Select the ith document according to:

• 1<k<n– The objective is to maximize the probability of having at

least k relevant docs in the top n.– Focus on k=1 and k=n cases in this paper.

4/30/2007

Optimizing for Other Metrics

• Optimizing 1-call– Choose greedily conditioned on there being no previous

document relevant.– Equal to minimize expected search length and maximize

expected reciprocal rank.– Also optimize instance recall metric, which measures the

number of distinct subtopics retrieved.• If a query has t subtopics, then instance recall is

4/30/2007

Google Examples

• Two ambiguous queries: “Trojan horse” and “virus”– Usd the titles, summaries, and snippets of Google’s results

to form a corpus of 1000 docs for each query.

4/30/2007

Experiments

• Methods– 1-greedy, 10-greedy, and conventional PRP

• Datasets– ad hoc topics from TREC-1, TREC-2, and TREC-3 to set the

weight parameters of model appropriately.

– TREC2004 robust track– TREC-6,7,8 interactive track– TREC-4 and TREC-6 ad hoc tracks

4/30/2007

Tuning the Weights

• Key weight– For the proposed model, the key weights are the strength

of the relevant distribution and irrelevant distribution priors with respect to the strength of the docs.

• TRECs 1, 2, and 3– Consisting about 724,000 docs, and 150 topics (topics 51-

200)– Used for tuning weight

4/30/2007

Robust Track Experiments

• TREC2004 robust track– 249 topics in total, about 528,000 docs– 50 topics were selected by TREC as being “difficult” queries.

4/30/2007

Instance Retrieval Experiments

• TREC-6, 7, and 8 interactive track– Test the performance of diversity– Total 20 topics with between 7 and 56 aspects each, and

about 210,000 docs.– Zhai et al’s LM approach is better for aspect retrieval.

4/30/2007

Multiple Annotator Experiments

• TREC-4 and TREC-6– Multiple independent annotators are asked to make

relevant judgments for the same topics over the same corpus.

– TREC-6 had three annotators, TREC-6 had two.

4/30/2007

Query Analysis

• A specific topic 100– The description is:

4/30/2007

Conclusions and Future Work

• Conclusions– Identify the PRP is not optimal, and given an approach to

directly optimize other desired objective.– The approach is algorithmically feasible.

• Future work– Other objective functions– More sophisticated techniques, such as local search alg.– The likelihood of relevance collections of docs• Two-Poisson model• Language model

4/30/2007

less is more probabilistic model for retrieving fewer relevant docuemtns harr chen and david r....

Documents