information retrieval: problem formulation & evaluation

Information Retrieval:

Problem Formulation & Evaluation

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Research Process

• Identification of a research question/topic

• Propose a possible solution/answer (formulate a hypothesis)

• Implement the solution

• Design experiments (measures, data, etc)

• Test the solution/hypothesis

• Draw conclusions

• Repeat the cycle of question-answering or hypothesis-formulation- and-testing if necessary

Today’s lecture

2

Part 1: IR Problem Formulation

Basic Formulation of TR (traditional)• Vocabulary V={w1, w2, …, wN} of language

• Query q = q1,…,qm, where qi V

• Document di = di1,…,dimi, where dij V

• Collection C= {d1, …, dk}

• Set of relevant documents R(q) C

– Generally unknown and user-dependent

– Query is a “hint” on which doc is in R(q)

• Task = compute R’(q), an “approximate R(q)” (i.e., decide which documents to return to a user)

4

Computing R(q)

• Strategy 1: Document selection

– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier

– System must decide if a doc is relevant or not (“absolute relevance”)

• Strategy 2: Document ranking

– R(q) = {dC|f(d,q)>}, where f(d,q) is a relevance measure function; is a cutoff

– System must decide if one doc is more likely to be relevant than another (“relative relevance”)

5

Document Selection vs. Ranking

++

+ +-- -

- - - -

- - - -

-

- - +- -

Doc Selectionf(d,q)=?

++

++

--+

-+

--

- --

---

Doc Rankingf(d,q)=?

1

0

0.98 d1 +0.95 d2 +0.83 d3 -0.80 d4 +0.76 d5 -0.56 d6 -0.34 d7 -0.21 d8 +0.21 d9 -

R’(q)

R’(q)

True R(q)

User sets the threshold

6

Problems of Doc Selection/Boolean model [Cooper 88]

• The classifier is unlikely accurate

– “Over-constrained” query (terms are too specific): no relevant documents found

– “Under-constrained” query (terms are too general): over delivery

– It is hard to find the right position between these two extremes (hard for users to specify constraints)

• Even if it is accurate, all relevant documents are not equally relevant; prioritization is needed since a user can only examine one document at a time

7

Ranking is often preferred

• A user can stop browsing anywhere, so the boundary is controlled by the user

– High recall users would view more items

– High precision users would view only a few

• Theoretical justification: Probability Ranking Principle [Robertson 77]

8

Probability Ranking Principle [Robertson 77]

• Seek for more fundamental justification

– Why is ranking based on probability of relevance reasonable?

– Is there a better way of ranking documents?

– What is the optimal way of ranking documents?

• Theoretical justification for ranking (Probability Ranking Principle): returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions (do they hold?):

– The utility of a document (to a user) is independent of the utility of any other document

– A user would browse the results sequentially

9

Two Justifications of PRP

• Optimization of traditional retrieval effectiveness measures

– Given an expected level of recall, ranking based on PRP maximizes the precision

– Given a fixed rank cutoff, ranking based on PRP maximizes precision and recall

• Optimal decision making

– Regardless the tradeoffs (e.g., favoring high precision vs. high recall), ranking based on PRP optimizes the expected utility of a binary (independent) retrieval decision (i.e., to retrieve or not to retrieve a document)

• Intuition: if a user sequentially examines one doc at each time, we’d like the user to see the very best ones first

10

According to the PRP, all we need is

“A relevance measure function f”

which satisfies

For all q, d1, d2, f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2)

Most existing research on IR models so far has fallen into this line of thinking…. (Limitations?)

Modeling Relevance: Raodmap for Retrieval Models

Relevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

…

GenerativeModel

RegressionModel (Fuhr 89)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

Div. from Randomness(Amati & Rijsbergen 02)

Learn. To Rank(Joachims 02, Berges et al. 05)

Relevance constraints[Fang et al. 04]

12

Part 2: IR Evaluation

Evaluation: Two Different Reasons

• Reason 1: So that we can assess how useful an IR system/technology would be (for an application)

– Measures should reflect the utility to users in a real application

– Usually done through user studies (interactive IR evaluation)

• Reason 2: So that we can compare different systems and methods (to advance the state of the art)

– Measures only need to be correlated with the utility to actual users, thus don’t have to accurately reflect the exact utility to users

– Usually done through test collections (test set IR evaluation)

14

What to Measure?

• Effectiveness/Accuracy: how accurate are the search results?

– Measuring a system’s ability of ranking relevant docucments on top of non-relevant ones

• Efficiency: how quickly can a user get the results? How much computing resources are needed to answer a query?

– Measuring space and time overhead

• Usability: How useful is the system for real user tasks?

– Doing user studies

15

The Cranfield Evaluation Methodology• A methodology for laboratory testing of system components

developed in 1960s

• Idea: Build reusable test collections & define measures

– A sample collection of documents (simulate real document collection)

– A sample set of queries/topics (simulate user queries)

– Relevance judgments (ideally made by users who formulated the queries) Ideal ranked list

– Measures to quantify how well a system’s result matches the ideal ranked list

• A test collection can then be reused many times to compare different systems

• This methodology is general and applicable for evaluating any empirical task

16

Test Collection Evaluation

17

Q1 D1 +Q1 D2 +Q1 D3 –Q1 D4 –Q1 D5 +…Q2 D1 –Q2 D2 +Q2 D3 +Q2 D4 –…Q50 D1 –Q50 D2 –Q50 D3 +…

Relevance Judgments

Document Collection

Q1 Q2 Q3… Q50 ...

D1

D2

D3

D48…

Queries

D2 +D1 + D4 - D5 +System A

System B

Query= Q1

D1 +D4 -D3 - D5 +

Precision=3/4Recall=3/3

Precision=2/4Recall=2/3

18

Measures for evaluating a set of retrieved documents

Relevant Retrieved

aIrrelevant Retrieved

cIrrelevant Rejected

d

Relevant Rejected

bRelevant

Not relevant

Retrieved Not RetrievedDocAction

ba

aRecall

ca

aPrecision

Ideal results: Precision=Recall=1.0

In reality, high recall tends to be associated with low precision (why?)

19

How to measure a ranking?

• Compute the precision at every recall point

• Plot a precision-recall (PR) curve

precision

recall

x

x

x

x

precision

recall

x

x

x

x

Which is better?

Computing Precision-Recall Curve

20

D1 +D2 +D3 –D4 –D5 +D6 – D7 –D8 +D9 –D10 –

Precision Recall

1/1

Total number of relevant documents in collection: 10

1/102/2 2/10

2/3 2/10

3/5 3/10

4/8 4/10

? 10/10

1.0

0.10.2 1.00.3 ….

0.6

How to summarize a ranking?

21

D1 +D2 +D3 –D4 –D5 +D6 – D7 –D8 +D9 –D10 –

Precision Recall

1/1

Total number of relevant documents in collection: 10

1/102/2 2/10

2/3 2/10

3/5 3/10

4/8 4/10

0 10/10

1.0

0.10.2 1.00.3 ….

0.6Average Precision=?

10

00000084

53

22

11

22

Summarize a Ranking: MAP

• Given that n docs are retrieved– Compute the precision (at rank) where each (new) relevant document is

retrieved => p(1),…,p(k), if we have k rel. docs

– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.

– If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero

• Compute the average over all the relevant documents– Average precision = (p(1)+…p(k))/k

• This gives us an average precision, which captures both precision and recall and is sensitive to the rank of each relevant document

• Mean Average Precisions (MAP)– MAP = arithmetic mean average precision over a set of topics

– gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

– Which one should be used?

What if we have multi-level relevance judgments?

23

D1 3D2 2D3 1D4 1D5 3D6 1 D7 1D8 2D9 1D10

1

Gain

Normalized DCG=?

Relevance level: r=1 (non-relevant) , 2 (marginally relevant), 3 (very relevant)

Cumulative Gain

33+23+2+1

3+2+1+1

…

DiscountedCumulative

Gain3

3+2/log 2

3+2/log 2+1/log 3

…

DCG@10 = 3+2/log 2+1/log 3 +…+ 1/log 10IdealDCG@10 = 3+3/log 2+3/log 3 +…+ 3/log 9+

2/log 10Assume: there are 9 documents rated “3” in total in the collection

24

Summarize a Ranking: NDCG• What if relevance judgments are in a scale of [1,r]? r>2

• Cumulative Gain (CG) at rank n

– Let the ratings of the n documents be r1, r2, …rn (in ranked order)

– CG = r1+r2+…rn

• Discounted Cumulative Gain (DCG) at rank n

– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n

– We may use any base for the logarithm, e.g., base=b

– For rank positions above b, do not discount

• Normalized Cumulative Gain (NDCG) at rank n

– Normalize DCG at rank n by the DCG value at rank n of the ideal ranking

– The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc

Other Measures• Precision at k documents (e.g., prec@10doc):

– easier to interpret than MAP (why?)

– also called breakeven precision when k is the same as the number of relevant documents

• Mean Reciprocal Rank (MRR):

– Same as MAP when there’s only 1 relevant document

– Reciprocal Rank = 1/Rank-of-the-relevant-doc

• F-Measure (F1): harmonic mean of precision and recall

25RP

PRF

RP

RP

PR

F

2

*)1(11

1

1

2

2

11

1 22

2

P: precisionR: recall: parameter (often set to 1)

Challenges in creating early test collections • Challenges in obtaining documents:

– Salton had students to manually transcribe Time magazine articles

– Not a problem now!

• Challenges in distributing a collection

– TREC started when CD-ROMs are available

– Not a problem now!

• Challenge of scale – limited by qrels (relevance judgments)

– The idea of “pooling” (Sparck Jones & Rijsbergen 75)

26

Larger collections created in 1980s

27

Name Docs. Qrys. Year Size, Mb

Source document

INSPEC 12,684 77 1981 - Title, authors, source, abstract and indexing information from Sep-Dec 1979 issues of Computer and Control Abstracts.

CACM 3,204 64 1983 2.2 Title, abstract, author, keywords and bibliographic information from articles of Communications of the ACM, 1958-1979.

CISI 1,460 112 1983 2.2 Author, title/abstract, and co-citation data for the 1460 most highly cited articles and manuscripts in information science, 1969-1977.

LISA 6,004 35 1983 3.4 Taken from the Library and Information Science Abstracts database.

Commercial systems then routinely support searching overmillions of documents Pressure for researchers to use larger collections for evaluation

The Ideal Test Collection Report [Sparck

Jones & Rijsbergen 75]

• Introduced the idea of pooling

– Have assessors to judge only a pool of top-ranked documents returned by various retrieval systems

• Other recommendations (the vision was later implemented in TREC)

28

1.that an ideal test collection be set up to facilitate and promote research;2.that the collection be of sufficient size to constitute an adequate test bed for experiments relevant to modern IR systems…3.that the collection(s) be set up by a special purpose project carried out by an experienced worker, called the Builder;4.that the collection(s) be maintained in a well-designed and documented machine form and distributed to users, by a Curator;5.that the curating (sic) project be encouraged to, promote research via the ideal collection(s), and also via the common use of other collection(s) acquired from independent projects.”

TREC (Text REtrieval Conference)• 1990: DARPA funded NIST to build a large test

collection

• 1991: NIST proposed to distribute the data set through TREC (leader: Donna Harman)

• Nov. 1992: First TREC meeting

• Goals of TREC:

– create test collections for a set of retrieval tasks;

– promote as widely as possible research in those tasks;

– organize a conference for participating researchers to meet and disseminate their research work using TREC collections.

29

The “TREC Vision” (mass collaboration for creating a pool)

30

“Harman and her colleagues appear to be the first to realize that if the documents and topics of a collection were distributed for little or no cost, a large number of groups would be willing to load that data into their search systems and submit runs back to TREC to form a pool, all for no costs to TREC. TREC would use assessors to judge the pool. The effectiveness of each run would then be measured and reported back to the groups. Finally, TREC could hold a conference where an overall ranking of runs would be published and participating groups would meet to present work and interact. It was hoped that a slight competitive element would emerge between groups to produce the best possible runs for the pool.” (Sanderson 10)

The TREC Ad Hoc Retrieval Task & Pooling • Simulate an information analyst (high recall)

• Multi-field topic description

• News documents + Government documents

• Relevance criteria: “a document is judged relevant if any piece of it is relevant (regardless of how small the piece is in relation to the rest of the document)”

• Each run submitted returns 1000 document for evaluation with various measures

• Top 100 documents were taken to form a pool

• All the documents in the pool were judged

• The unjudged documents are often assumed to be non-relevant (problem?)

31

An example TREC topic

32

<top>

<num> Number: 200

<title> Topic: Impact of foreign textile imports on U.S. textile industry

<desc> Description: Document must report on how the importation of foreign textiles or textile products has influenced or impacted on the U.S. textile industry.

<narr> Narrative: The impact can be positive or negative or qualitative. It may include the expansion or shrinkage of markets or manufacturing volume or an influence on the methods or strategies of the U.S. textile industry. "Textile industry“ includes the production or purchase of raw materials; basic processing techniques such as dyeing, spinning, knitting, or weaving; the manufacture and marketing of finished goods; and also research in the textile field.

</top>

33

Precion-Recall Curve

Mean Avg. Precision (MAP)

Recall=3212/4728

Breakeven Precision (precision when prec=recall)

Out of 4728 rel docs, we’ve got 3212

D1 +D2 +D3 –D4 –D5 +D6 -

Total # rel docs = 4System returns 6 docs

Average Prec = (1/1+2/2+3/5+0)/4

about 5.5 docsin the top 10 docs

are relevant

Precision@10docs

Typical TREC Evaluation Result

Denominator is 4, not 3 (why?)

34

What Query Averaging Hides

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation

Statistical Significance Tests• How sure can you be that an observed difference

doesn’t simply result from the particular queries you chose?

System A

0.200.210.220.190.170.200.21

System B

0.400.410.420.390.370.400.41

Experiment 1

Query

1234567

Average 0.20 0.40

System A

0.020.390.160.580.040.090.12

System B

0.760.070.370.210.020.910.46

Experiment 2

Query

1234567

Average 0.20 0.40

Slide from Doug Oard 35

Statistical Significance TestingSystem A

0.020.390.160.580.040.090.12

System B

0.760.070.370.210.020.910.46

Query

1234567

Average 0.20 0.40

Sign Test

+-+--+-

p=1.0

Wilcoxon

+0.74- 0.32+0.21- 0.37- 0.02+0.82- 0.38

p=0.9375

0

95% of outcomes

Try some out at: http://www.fon.hum.uva.nl/Service/CGI-Inline/HTML/Statistics.htmlSlide from Doug Oard 36

Live Labs: Involve Real Users in Evaluation• Stuff I’ve Seen [Dumais et al. 03]

– Real systems deployed with hypothesis testing in mind (different interfaces + logging capability)

– Search logs can then be used to analyze hypotheses about user behavior

• The “A-B Test”

– Initial proposal by Cutting at a panel [Lest et al. 97]

– First research work published by Joachims [Joachims 03]

– Great potential, but only a few follow-up studies

37

38

What You Should Know

• Why is retrieval problem often framed as a ranking problem?

• Two assumptions of PRP

• What is Cranfield evaluation methodology?

• How to compute the major evaluation measures (precision, recall, precision-recall curve, MAP, gMAP, nDCG, F1, MRR, breakeven precision)

• How does “pooling” work?

• Why is it necessary to do statistical significance test?

Open Challenges in IR Evaluation• Almost all issues are still open for research!

• What are the best measures for various search tasks (especially newer tasks such as subtopic retrieval)?

• What’s the best way of doing statistical significance test?

• What’s the best way to adopt the pooling strategy in practice?

• How can we assess the quality of a test collection? Can we create representative test sets?

• New paradigms for evaluation? Open IR system for A-B test?

39

information retrieval: problem formulation & evaluation

Documents