advanced topics in information retrieval - evaluation, saarland...

40
9. Evaluation

Upload: others

Post on 04-Aug-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

9. Evaluation

Page 2: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Outline9.1. Cranfield Paradigm & TREC

9.2. Non-Traditional Measures

9.3. Incomplete Judgments

9.4. Low-Cost Evaluation

9.5. Crowdsourcing

9.6. Online Evaluation

2

Page 3: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

9.1. Cranfield Paradigm & TREC๏ IR evaluation typically follows Cranfield paradigm

๏ named after two studies conducted by Cyril Cleverdon in the 1960swho was a librarian at the College of Aeronautics, Cranfield, England

๏ Key Ideas:

๏ provide a document collection

๏ define a set of topics (queries) upfront

๏ obtain results for topics from different participating systems (runs)

๏ collect relevance assessments for topic-result pairs

๏ measure system effectiveness (e.g., using MAP)

3

Page 4: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

TREC๏ Text Retrieval Evaluation Conference (TREC) organized by the

National Institute of Standards and Technology (NIST) since 1992 ๏ from 1992–1999 focus on ad-hoc information retrieval (TREC 1–8)

and document collections mostly consisting of news articles (Disks 1–5)

๏ topic development and relevance assessment conducted by retired information analysts from the National Security Agency (NSA)

๏ nowadays much broader scope includingtracks on web retrieval, question answering,blogs, temporal summarization

4

Page 5: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Evaluation Process๏ TREC process to evaluate participating systems

๏ (1) Release of document collection and topics

๏ (2) Participants submit runs, i.e., results obtained for the topics using a specific system configuration

๏ (3) Runs are pooled an a per-topic basis, i.e., merge documents returned (within top-k) by any run

๏ (4) Relevance assessments are conducted; each (topic, document) pair judged by one assessor

๏ (5) Runs ranked according to their overall performance across all topics using an agreed-upon effectiveness measure

5

Document

Collection

Topics

Pooling

Relevance

Assessments

Run

Ranking

Page 6: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

9.2. Non-Traditional Measures๏ Traditional effectiveness measures (e.g., Precision, Recall, MAP)

assume binary relevance assessments (relevant/irrelevant)

๏ Heterogeneous document collections like the Web and complex information needs demand graded relevance assessments

๏ User behavior exhibits strong click bias in favor of top-ranked results and tendency not to go beyond first few relevant results

๏ Non-traditional effectiveness measures (e.g., RBP, nDCG, ERR) consider graded relevance assessments and/or are based on more complex models of user behavior

6

Page 7: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Position Models vs. Cascade Models๏ Position models assume that user inspects

each rank with fixed probability that isindependent of other ranks

๏ Example: Precision@k corresponds to userinspecting each rank 1…k withuniform probability 1/k

๏ Cascade models assume that user inspects each rank with probability that depends onrelevance of documents at higher ranks

๏ Example: α-nDCG assumes that user inspects rank k with probability P[n ∉ d1] x … x P[n ∉ dk-1]

7

1.

2.

k.

1.

2.

3.

P [ d1 ]

P [ d2 ]

P [ dk ]

P [ d1 ]

P [ d2 | d1 ]

P [ d3 | d1, d2 ]

Page 8: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Rank-Biased Precision๏ Moffat and Zobel [9] propose rank-biased precision (RBP) as

an effectiveness measure based on a more realistic user model

๏ Persistence parameter p: User moves on to inspect next result with probability p and stops with probability (1-p)with ri ∈ {0,1} indicating relevance of result at rank i

8

RBP = (1� p) ·dX

i=1

ri · pi�1

Page 9: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Normalized Discounted Cumulative Gain๏ Discounted Cumulative Gain (DCG) considers

๏ graded relevance judgments (e.g., 2: relevant, 1: marginal, 0: irrelevant)

๏ position bias (i.e., results close to the top are preferred)

๏ Considering top-k result with R(q,m) as grade of m-th result

๏ Normalized DCG (nDCG) obtained through normalization with idealized DCG (iDCG) of fictitious optimal top-k result

9

DCG(q, k) =kX

m=1

2

R(q,m) � 1

log(1 +m)

nDCG(q, k) =DCG(q, k)

iDCG(q, k)

Page 10: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Expected Reciprocal Rank๏ Chapelle et al. [6] propose expected reciprocal rank (ERR)

as the expected reciprocal time to find a relevant result with Ri as probability that user sees a relevant result at rank iand decides to stop inspecting result

๏ Ri can be estimated from graded relevance assessments as

๏ ERR equivalent to RR for binary estimates of Ri

10

ERR =nX

r=1

1

r

r�1Y

i=1

(1�Ri)

!Rr

Ri =2g(i) � 1

2gmax

Page 11: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

9.3. Incomplete Judgments๏ TREC and other initiatives typically make their document

collections, topics, and relevance assessments available to foster further research

๏ Problem: When evaluating a new system which did not contribute to the pool of assessed results, one typically also retrieves results which have not been judged

๏ Naïve Solution: Results without assessment assumed irrelevant ๏ corresponds to applying a majority classifier (most irrelevant)

๏ induces a bias against new systems

11

Page 12: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Bpref๏ Bpref assumes binary relevance assessments and

evaluates a system only based on judged resultswith R and N as sets of relevant and irrelevant results

๏ Intuition: For every retrieved relevant result compute a penalty reflecting how many irrelevant results were ranked higher

12

bpref =1

|R|X

d2R

✓1� min(|d0 2 N ranked higher than d|, |R|)

min(|R|, |N |)

Page 13: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Condensed Lists๏ Sakai [10] proposes a more general approach to the problem of

incomplete judgments, namely to condense result lists by removing all unjudged results ๏ can be used with any effectiveness measure (e.g., MAP, nDCG)

๏ Experiments on runs submitted to the Cross-Lingual Information Retrieval tracks of NTCIR 3&5 suggest that the condensed listapproach is at least as robust as bpref and its variants

13

d1

d7

d9

d2

d1

d7

d2

relevantirrelevantunknown

Page 14: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Kendall’s τ๏ Kendall’s τ coefficient measures the rank correlation between

two permutations πi and πj of the same set of elements with n as the number of elements

๏ Example: π1 = ⟨a b c d⟩ and π2 = ⟨d b a c⟩

๏ concordant pairs:(a,c) (b,c)

๏ discordant pairs: (a,b) (a,d) (b,d) (c,d)

๏ Kendall’s τ: -2/6

14

⌧ =

(# concordant pairs)� (# discordant pairs)

12 · n · (n� 1)

Page 15: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Experiments๏ Sakai [10] compares the condensed list approach on several

effectiveness measures against bpref in terms of robustness

๏ Setup: Remove a random fraction of relevance assessmentsand compare the resulting system ranking in terms of Kendall’s τagainst the original system ranking with all relevance assessments

15

Figure 2: Reduction rate (x axis) vs. Kendall’s rank correlation (y axis).

Table 3: Discriminative power at α = 0.05 using full relevance data from NTCIR.NTCIR-3C disc. power diff. required NTCIR-5C disc. power diff. requiredQ-measure 242/435=55.6% 0.10 Q-measure 174/435=40.0% 0.11AveP 240/435=55.2% 0.11 Q′ 167/435=38.4% 0.11Q′ 238/435=54.7% 0.10 nDCG 163/435=37.5% 0.10nDCG′ 236/435=54.3% 0.11 rpref relative2 163/435=37.5% 0.10rpref relative2 236/435=54.3% 0.10 AveP 159/435=36.6% 0.11nDCG 235/435=54.0% 0.13 AveP′ 158/435=36.3% 0.11AveP′ 234/435=53.8% 0.12 nDCG′ 156/435=35.9% 0.10bpref R 232/435=51.5% 0.12 bpref R 132/435=30.3% 0.10rpref N 203/435=46.7% 0.13 bpref N 109/435=25.1% 0.11bpref N 198/435=45.6% 0.14 rpref N 105/435=24.1% 0.12

NTCIR-3J disc. power diff. required NTCIR-5J disc. power diff. requirednDCG′ 317/435=72.9% 0.12 Q-measure 136/435=31.3% 0.09nDCG 316/435=72.6% 0.14 Q′ 131/435=30.1% 0.09Q′ 308/435=70.8% 0.11 nDCG 120/435=27.6% 0.13Q-measure 305/435=70.1% 0.13 nDCG′ 119/435=27.4% 0.12AveP 298/435=68.5% 0.11 rpref relative2 115/435=26.4% 0.10rpref relative2 298/435=68.5% 0.11 AveP 113/435=26.0% 0.10AveP′ 296/435=68.0% 0.11 AveP′ 113/435=26.0% 0.10bpref R 283/435=65.1% 0.11 bpref R 100/435=23.0% 0.10bpref N 272/435=62.5% 0.16 bpref N 68/435=15.6% 0.11rpref N 272/435=62.5% 0.16 rpref N 56/435=12.9% 0.12

Figure 3 shows the effect of relevance data reduction (j ∈{70, 50, 30, 10}) on discriminative power at α = 0.05. Forexample, for NTCIR-5C with only 10% relevance data, theactual AveP detects a significant difference for only about2.5% of all run pairs. (Note that its discriminative poweris 36.6% with 100% relevance data, as shown in Table 3.)Whereas, under the same condition, the actual Q-measureand nDCG are comparable to bpref (a little over 10% in dis-criminative power); Q′, nDCG′, rpref relative2 and AveP′

are the most discriminative (over 20%). Although the ac-tual Q-measure does not do well with NTCIR-3J at j = 10,and nDCG(′) appears to do worse at j = 50 than at j = 30with NTCIR-5C and 5J, the trends consistent across all fourdata sets are as follows:

(i′) Q′, nDCG′, rpref relative2 and AveP′ are more dis-criminative than bpref R in a very incomplete rele-vance environment; Q′ and nDCG′ are probably themost discriminative.

(ii′) Given incomplete relevance data, bpref R is more dis-criminative than the actual AveP; but so are the actualQ-measure and nDCG.

Thus, the benefit of introducing bpref is not clear in termsof discriminative power either.

8. CONCLUSIONSThis paper showed that the application of Q-measure,

nDCG or AveP to condensed lists, obtained by filteringout all unjudged documents from the original ranked lists,is actually a better solution to the incompleteness problemthan bpref. Furthermore, we showed that the use of gradedrelevance boosts the robustness of IR evaluation to incom-pleteness and therefore that Q-measure and nDCG basedon condensed lists are the best choices. In terms of the en-tire system ranking, Q-measure, nDCG and AveP based oncondensed lists (Q′, nDCG′ and AveP′) are more robust toincompleteness than bpref R (i.e., bpref); even the actual

SIGIR 2007 Proceedings Session 3: Evaluation

77

Figure 2: Reduction rate (x axis) vs. Kendall’s rank correlation (y axis).

Table 3: Discriminative power at α = 0.05 using full relevance data from NTCIR.NTCIR-3C disc. power diff. required NTCIR-5C disc. power diff. requiredQ-measure 242/435=55.6% 0.10 Q-measure 174/435=40.0% 0.11AveP 240/435=55.2% 0.11 Q′ 167/435=38.4% 0.11Q′ 238/435=54.7% 0.10 nDCG 163/435=37.5% 0.10nDCG′ 236/435=54.3% 0.11 rpref relative2 163/435=37.5% 0.10rpref relative2 236/435=54.3% 0.10 AveP 159/435=36.6% 0.11nDCG 235/435=54.0% 0.13 AveP′ 158/435=36.3% 0.11AveP′ 234/435=53.8% 0.12 nDCG′ 156/435=35.9% 0.10bpref R 232/435=51.5% 0.12 bpref R 132/435=30.3% 0.10rpref N 203/435=46.7% 0.13 bpref N 109/435=25.1% 0.11bpref N 198/435=45.6% 0.14 rpref N 105/435=24.1% 0.12

NTCIR-3J disc. power diff. required NTCIR-5J disc. power diff. requirednDCG′ 317/435=72.9% 0.12 Q-measure 136/435=31.3% 0.09nDCG 316/435=72.6% 0.14 Q′ 131/435=30.1% 0.09Q′ 308/435=70.8% 0.11 nDCG 120/435=27.6% 0.13Q-measure 305/435=70.1% 0.13 nDCG′ 119/435=27.4% 0.12AveP 298/435=68.5% 0.11 rpref relative2 115/435=26.4% 0.10rpref relative2 298/435=68.5% 0.11 AveP 113/435=26.0% 0.10AveP′ 296/435=68.0% 0.11 AveP′ 113/435=26.0% 0.10bpref R 283/435=65.1% 0.11 bpref R 100/435=23.0% 0.10bpref N 272/435=62.5% 0.16 bpref N 68/435=15.6% 0.11rpref N 272/435=62.5% 0.16 rpref N 56/435=12.9% 0.12

Figure 3 shows the effect of relevance data reduction (j ∈{70, 50, 30, 10}) on discriminative power at α = 0.05. Forexample, for NTCIR-5C with only 10% relevance data, theactual AveP detects a significant difference for only about2.5% of all run pairs. (Note that its discriminative poweris 36.6% with 100% relevance data, as shown in Table 3.)Whereas, under the same condition, the actual Q-measureand nDCG are comparable to bpref (a little over 10% in dis-criminative power); Q′, nDCG′, rpref relative2 and AveP′

are the most discriminative (over 20%). Although the ac-tual Q-measure does not do well with NTCIR-3J at j = 10,and nDCG(′) appears to do worse at j = 50 than at j = 30with NTCIR-5C and 5J, the trends consistent across all fourdata sets are as follows:

(i′) Q′, nDCG′, rpref relative2 and AveP′ are more dis-criminative than bpref R in a very incomplete rele-vance environment; Q′ and nDCG′ are probably themost discriminative.

(ii′) Given incomplete relevance data, bpref R is more dis-criminative than the actual AveP; but so are the actualQ-measure and nDCG.

Thus, the benefit of introducing bpref is not clear in termsof discriminative power either.

8. CONCLUSIONSThis paper showed that the application of Q-measure,

nDCG or AveP to condensed lists, obtained by filteringout all unjudged documents from the original ranked lists,is actually a better solution to the incompleteness problemthan bpref. Furthermore, we showed that the use of gradedrelevance boosts the robustness of IR evaluation to incom-pleteness and therefore that Q-measure and nDCG basedon condensed lists are the best choices. In terms of the en-tire system ranking, Q-measure, nDCG and AveP based oncondensed lists (Q′, nDCG′ and AveP′) are more robust toincompleteness than bpref R (i.e., bpref); even the actual

SIGIR 2007 Proceedings Session 3: Evaluation

77

Page 16: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Label Prediction๏ Büttcher et al. [3] examine the effect of incomplete judgments

based on runs submitted to the TREC 2006 Terabyte track

๏ They also examine the amount of bias against new systems by removing judged results solely contributed by one system

16

1

0.8

0.6

0.4

0.2

0%20%40%60%80%100%Size of qrels file (compared to original)

(a) Evaluation measure (average over all topics and all runs) 1

0.9

0.8

0.7

0.6

0.5

0%20%40%60%80%100%Size of qrels file (compared to original)

(b) Kendall’s tau between original and incomplete qrels

RankEffbpref

APP@20

nDCG@20

Figure 1: Building reduced qrels by random sampling: Incomplete, unbiased judgements. (a) The averagescore of all runs according to various evaluation measures. (b) The correlation between the new ranking(based on the incomplete qrels) and the old ranking (based on the original qrels).

• The nDCG@20 measure [10], a measure similar toP@20, that, however, gives greater weight to doc-uments retrieved at high ranks than to documentsretrieved at lower ranks.

When we measure the similarity between two rankings, weuse Kendall’s τ , as defined by Kendall [13]. Like Kendall, wedo not pay special attention to the case where two systemsare tied according to a given evaluation measure. Whenevera tie is encountered, it is assumed that the two entries areranked in the correct order. We also look at how the rawscore of a particular measure is affected as a result of chang-ing the set of qrels used to compute its value. We quantifythe difference between the original value and the new valueby the root mean square (RMS) error:

RMS error(o, n) =

vuut 1N

NX

i=1

(oi − ni)2, (3)

where the oi and ni are the old and new values, respectively,for example the MAP values of all N runs in the pool.

5. EXPERIMENTSThe experiments presented in this section consist of three

parts: First, we repeat the experiments with incomplete, un-biased judgements conducted by other researchers, using theTREC Terabyte data. Second, we extend those experimentsby looking at biased judgements instead of unbiased ones.Third, we examine to what extent the effect of biased judge-ments can be countered by training a classifier and using itto predict the relevance of unjudged documents.

5.1 Incomplete, Unbiased JudgementsIn our first series of experiments, we examine how de-

creasing the set of qrels in a uniform, unbiased way affectsthe evaluation results. For this, we take the qrels for the 50TREC topics from 2006 and generate random subsets com-prising 5%–80% of the original qrels. We then evaluate all 42runs in the original pool on these incomplete qrels and lookat how reducing the qrels affects raw evaluation scores andhow it affects the correlation between the systems’ rankingon the original qrels and on the incomplete qrels.

Figure 1(a) shows that reducing the size of the qrelsdecreases the value of all measures, except for bpref andRankEff. So far, this is consistent with earlier findings[4] [1]. What might be surprising, however, is that bpref,

contrary to earlier finding, does not exhibit a dramaticincrease when the qrels are reduced. The reason for this isthat we use the actual bpref measure and not the adjustedbpref-10. By using bpref-10 in their experiments, Buckleyand Voorhees [4] drastically increase the relative number ofnon-relevant documents used in the evaluation if the num-ber of known relevant documents is small. The result is ahigher bpref-10 score for most runs. This is consistent withthe fact that, in our experiments, the average RankEff scoreis greater than the average bpref score; RankEff takes morejudged non-relevant documents into account than bpref.Finally, the fact that the RankEff score remains essentiallyconstant is consistent with the results reported by Ahlgrenand Gronqvist [1].

Figure 1(b) shows the correlation between the rankingproduced from the original qrels and that produced fromthe incomplete qrels, for the same measures and the sameset of topics as before. Here, the results are completely inline with earlier findings: Average precision (AP), P@k, andnDCG@k lead to poor correlation; bpref achieves a highercorrelation than those three; RankEff achieves a slightlyhigher correlation than bpref. The curve for bpref-10, notshown in the Figure, would mostly be between the curvesfor bpref and RankEff.

5.2 Incomplete, Biased JudgementsExperiments with unbiased incomplete qrels, like the ones

presented above, gave the initial motivation to replace thetraditional average precision measure by new measures, suchas bpref. However, unbiased incomplete qrels only cover oneaspect of the evaluation process. What is equally, or evenmore, important when aiming for a reusable set of relevancejudgements is how well an evaluation measure can deal withbiased judgements that were created by only taking docu-ments from some runs into account, while completely ignor-ing others. The reason why this is so important is becauseit reflects the everyday life of many researchers in the field,who use existing data and relevance judgements to evaluatetheir new retrieval methods. If an evaluation measure fails togeneralize to runs that did not contribute to the pool, thenthese people may obtain highly misleading results about thequality of their new methods.

Leave-One-Out Experiments

We selected the 42 runs that contributed documents to theTREC Terabyte 2006 qrels and simulated how removing all

SIGIR 2007 Proceedings Session 3: Evaluation

66

MRR P@10 P@20 nDCG@20 Avg. Prec. bpref P@20(j) RankEffAvg. absolute rank difference 0.905 1.738 2.095 2.143 1.524 2.000 2.452 0.857Max. rank difference 0↑/15↓ 1↑/16↓ 0↑/12↓ 0↑/14↓ 0↑/10↓ 14↑/1↓ 22↑/1↓ 4↑/3↓

RMS Error 0.0130 0.0207 0.0243 0.0223 0.0105 0.0346 0.0258 0.0143Runs with significant diff. (p < 0.05) 4.8% 38.1% 50.0% 54.8% 95.2% 90.5% 61.9% 81.0%

Table 2: Removing unique documents contributed by a given group from the pool. Average change in aleft-out run’s rank, maximum rank change (in both directions), RMS error of the raw score, and percentageof runs for which there is a significant difference between old and new score.

MRR P@10 P@20 nDCG@20 Avg. Prec. bpref P@20(j) RankEffAvg. absolute rank difference 4.182 5.636 6.364 4.909 5.091 4.182 4.909 1.818Max. rank difference 1↑/20↓ 0↑/17↓ 0↑/18↓ 0↑/15↓ 3↑/13↓ 12↑/0↓ 23↑/1↓ 4↑/3↓

RMS Error 0.0423 0.0503 0.0562 0.0472 0.0189 0.0542 0.0466 0.0410Kendall’s τ (all runs) 0.8885 0.8467 0.8281 0.8513 0.8513 0.8676 0.8676 0.8955Kendall’s τ (manual runs only) 0.8545 0.8545 0.8545 0.8545 0.6000 0.7455 0.8182 0.8909

Table 3: The impact that removing all manual runs from the pool has on the estimated performance of eachmanual run. Manual runs move down in the ranking according to most measures (exceptions: bpref, P@20(j),RankEff). Kendall’s tau between old and new ranking is below 0.9 in all cases.

runs submitted by the same group from the pool would affectthe scores and rankings achieved by runs from that group.More specifically, we proceeded as follows:

1. Pick a group G.

2. For each topic T and each judged document D in theqrels, if only runs from G returned D among the top50 documents for T , then remove D from T ’s qrels.

This procedure was repeated 20 times, once for each groupparticipating in TREC TB 2006. The effect that the re-sulting biased qrels have on various evaluation measures isshown in Table 2.

On average, removing the unique contributions by a groupfrom the qrels decreases the number of judged documentsby 22 per topic. Thus, not very surprisingly, the effect onearly precision measures like P@20 and nDCG@20 is quitesubstantial. According to the table, a run from the discrim-inated group on average loses 2.1 positions in the ranking ofall 42 systems. In extreme cases, however, the loss can befar more extreme: 12 positions for P@20, and 14 positionsfor nDCG@20.

Average precision is a little less sensitive to the biasedjudgements than the early precision measures. The rank ofa left-out run, according to average precision (AP), changesby about 1.5 positions. Interestingly, bpref does not appearmore stable than AP. On average, a run submitted by theleft-out group moves by 2 positions in the ranking. In 90.5%of all cases, the bpref score difference between the originalqrels and the incomplete qrels, for the same run, is statis-tically significant according to a paired t-test (p < 0.05).Moreover, AP and bpref affect the rank of a left-out run inopposite directions: While, according to AP, the rank of therun is usually lower with the incomplete qrels than with theoriginal qrels (by up to 10 positions), according to bpref itis higher (by up to 14 positions).

This is an important result, because it shows that, for bi-ased judgements, bpref is no more reliable than AP. WhereAP underestimates the performance of a system, bpref over-estimates it. Both phenomena are potentially dangerous andshould not be taken lightly. While using AP to evaluate arun outside the pool may lead a researcher to the incorrectconclusion that a newly developed technique does not work

Orig. qrels Manual only Autom. only# Judged 31,984 16,157 23,099# Relevant 5,893 4,495 4,373% Relevant 18.4% 27.8% 18.9%

Table 4: Basic characteristics of original and biasedqrels for TREC topics 801–850 (TREC TB 2006).

very well, using bpref may lead to the equally incorrect con-clusion that it works really well when in fact it does not.

RankEff, in contrast, designed for unbiased incompleteqrels, just like bpref, behaves remarkably well. On average,the position of a discriminated run changes by only 0.857positions — 4 places in the worst case.

The P@20(j), defined in Section 4 for exactly this purpose,to be used in the presence of incomplete judgements, turnsout to be a very unstable measure. Where P@20 under-estimates the performance of a run, P@20(j) overestimatesit — grossly, by up to 22 positions in the ranking of the sys-tems. Why does P@20(j) give such a poor approximation ofthe original P@20 score? The answer lies in the distributionof relevant and non-relevant documents among the uniquecontributions by a given group. After removing a group’sunique contributions from the qrels, a run by that group onaverage has 3.4 unjudged documents among its top 20. Ofthese 3.4 documents, however, only 0.3 are relevant accord-ing to the original qrels. Thus, a unique contribution is farless likely to be relevant than what could be expected froma run’s P@20 score (which is usually far greater than 10%).Ignoring the unjudged documents in a system’s ranking im-plicitly assumes that they exhibit the same proportion ofrelevant and non-relevant documents as the judged docu-ments — an assumption that is simply wrong.

Automatic Runs vs. Manual Runs

By looking at the evaluation results more carefully, we foundthat, although all runs by a given group are noticeably af-fected by removing that group’s contributions from the qrels,the manual runs seem to be more sensitive to this than theautomatic runs. This inspired us to conduct another exper-iment; instead of removing the unique contributions by aparticular group from the qrels, we now removed all docu-ments that are only referenced by some of the manual runsin the pool, but not by any of the automatic runs.

SIGIR 2007 Proceedings Session 3: Evaluation

67

Page 17: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Label Prediction๏ Idea: Predict missing labels using classification methods

๏ Classifier based on Kullback-Leibler divergence

๏ estimate unigram language model θR from relevant documents

๏ document d with language model θd is considered relevant ifwith threshold ψ estimated such that exactly |R| documents in the training data exceed it and are thus considered relevant

17

KL(✓dk✓R) <

Page 18: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Label Prediction๏ Classifier based on Support Vector Machine (SVM) with w ∈ Rn and b ∈ R as parameters and x as document vector

๏ consider the 106 globally most frequent terms as features

๏ features determined using tf.idf weighting

18

sign(wT· x + b)

Page 19: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Label Prediction๏ Prediction performance for varying amounts of training data

๏ Bias against new systems when predicting relevance of results solely contributed by one system

19

Training data Test data KLD classifier SVM classifierPrecision Recall F1 measure Precision Recall F1 measure

5% 95% 0.718 0.195 0.238 0.777 0.162 0.17410% 90% 0.549 0.252 0.293 0.760 0.212 0.24320% 80% 0.455 0.291 0.327 0.742 0.246 0.30740% 60% 0.403 0.329 0.356 0.754 0.354 0.42060% 40% 0.403 0.353 0.370 0.792 0.386 0.45580% 20% 0.413 0.338 0.355 0.812 0.413 0.474

Automatic-only Rest 0.331 0.318 0.262 0.613 0.339 0.355Manual-only Rest 0.233 0.400 0.231 0.503 0.419 0.364

Table 6: Predicting document relevance for topics 801–850. Per-topic precision/recall/F1 macro-averages.

MRR P@10 P@20 nDCG@20 Avg. Prec. bpref P@20(j) RankEff

KLD

Avg. absolute rank diff. 0.976 0.929 1.000 1.214 0.667 1.119 1.000 1.071Max. rank difference 9↑/8↓ 2↑/11↓ 7↑/7↓ 7↑/8↓ 3↑/8↓ 5↑/9↓ 7↑/7↓ 5↑/5↓

RMS Error 0.0499 0.0245 0.0238 0.0442 0.0067 0.0179 0.0238 0.0103% significant (p < 0.05) 14.3% 19.1% 28.6% 40.5% 54.8% 64.3% 28.6% 52.4%

SV

M

Avg. absolute rank diff. 0.595 0.500 0.619 0.691 0.691 0.667 0.619 0.643Max. rank difference 1↑/7↓ 0↑/4↓ 1↑/6↓ 4↑/5↓ 3↑/7↓ 2↑/5↓ 1↑/6↓ 1↑/4↓

RMS Error 0.0071 0.0086 0.0088 0.0078 0.0046 0.0068 0.0088 0.0028% significant (p < 0.05) 2.4% 7.1% 16.7% 33.3% 35.7% 16.7% 16.7% 26.2%

Table 7: Predicting document relevance to counter the effect of biased judgements in the leave-one-outexperiments. For all measures examined, the rank of a run submitted by the respective group stays within± 1 of its original rank on average if the SVM classifier is used to complete the qrels.

Table 7 shows the results we obtained in the revised leave-one-out experiments, using a classifier to predict the rele-vance of all documents that were removed from the pool(i.e., top 50 documents from the left-out runs). When usingthe SVM classifier, the rank of a run from the group thatwas removed from the pool changes by less than 1 place onaverage. This holds for all measures. For P@20, the maxi-mum change in rank decreases from 18 to 7; the RMS errorof the P@20 scores is reduced by 64%, from 0.0243 to 0.0088(comparing Table 2 and Table 7).

In a last experiment, we had the classifiers predict therelevance of all documents removed from the pool in theautomatic-only experiments (i.e., documents retrieved onlyby manual runs). Table 8 shows the results we obtained forthis setting. Using the SVM classifier, the number of placesa manual run is moved up or down in the ranking accordingto P@20 is decreased from 6.4 to 2.0 on average (comparingTables 3 and 8). The correlation between the original rank-ing, based on the original qrels, and the new ranking, basedon the SVM-completed qrels, is in excess of 0.9 for all mea-sures shown in the table, including bpref and RankEff (inthe case of bpref, it is improved from 0.8676 to 0.9164; in thecase of RankEff, a very slight increase from 0.8955 to 0.9071is achieved). Following the usual interpretation of Kendallτ values, the two rankings can be considered equivalent.

Regarding the relative performance of the KLD and theSVM classifier, we can say that the SVM classifier leads tomore accurate results in almost all cases. The only exceptionis AP, for which the KLD classifier with its training targetprecision = recall (which is what AP needs for its scores toremain constant) achieves slightly better results.

6. CONCLUSIONS AND FUTURE WORKTraditional evaluation measures used in information re-

trieval are unable to deal with the problem of incompletejudgements. The bpref [4] measure overcomes this limita-

tion by ignoring unjudged documents. This approach workswell for unbiased incomplete judgements, but does not prop-erly address the problem of biased judgements. We foundthat bpref is not immune to incomplete and biased judge-ments. Where other measures, such as average precision,tend to underestimate the performance of a run that liesoutside the pool of judged documents, bpref tends to over-estimate it — to a similar degree.

The effect of biased judgements, however, can be coun-tered by training a classifier on the judged documents andusing it to predict the relevance of unjudged documents. Inour experiments with data from the TREC 2006 Terabytetrack, predicting document relevance consistently increasesthe Kendall τ correlation between the ranking producedfrom the original set of judgements, based on all runs, anda ranking produced from a biased set of judgements, con-structed from the automatic runs only, for all measures weexamined. For P@20, Kendall’s τ is increased from 0.8281to 0.9512. For bpref, it is increased from 0.8676 to 0.9164.Hence, it seems to be possible to reliably evaluate the perfor-mance of retrieval systems, even if the relevance judgementsused in the evaluation are incomplete and highly biased.

This does not imply that less effort should be put on thecreation of manual judgements. It does, however, mean that,at least for the GOV2 collection examined in our experi-ments, it is usually a good idea to not use the original qrelsbuilt from a pool of old systems when evaluating a newranking technique. Instead, a classifier should be trainedand used to predict the relevance of documents that are re-turned by the new technique, but not found in the pool. Thisway, the bias inherent in most evaluation measures (eitherpositive or negative), can be avoided, and a more reliableevaluation of the new ranking technique can be obtained.

Of course, using a classifier to predict document relevancebears the risk that a new ranking method is developed andtrained to best match the classifier instead of actual humanrelevance judgements. Therefore, it can only be a short-term

SIGIR 2007 Proceedings Session 3: Evaluation

69

Training data Test data KLD classifier SVM classifierPrecision Recall F1 measure Precision Recall F1 measure

5% 95% 0.718 0.195 0.238 0.777 0.162 0.17410% 90% 0.549 0.252 0.293 0.760 0.212 0.24320% 80% 0.455 0.291 0.327 0.742 0.246 0.30740% 60% 0.403 0.329 0.356 0.754 0.354 0.42060% 40% 0.403 0.353 0.370 0.792 0.386 0.45580% 20% 0.413 0.338 0.355 0.812 0.413 0.474

Automatic-only Rest 0.331 0.318 0.262 0.613 0.339 0.355Manual-only Rest 0.233 0.400 0.231 0.503 0.419 0.364

Table 6: Predicting document relevance for topics 801–850. Per-topic precision/recall/F1 macro-averages.

MRR P@10 P@20 nDCG@20 Avg. Prec. bpref P@20(j) RankEff

KLD

Avg. absolute rank diff. 0.976 0.929 1.000 1.214 0.667 1.119 1.000 1.071Max. rank difference 9↑/8↓ 2↑/11↓ 7↑/7↓ 7↑/8↓ 3↑/8↓ 5↑/9↓ 7↑/7↓ 5↑/5↓

RMS Error 0.0499 0.0245 0.0238 0.0442 0.0067 0.0179 0.0238 0.0103% significant (p < 0.05) 14.3% 19.1% 28.6% 40.5% 54.8% 64.3% 28.6% 52.4%

SV

M

Avg. absolute rank diff. 0.595 0.500 0.619 0.691 0.691 0.667 0.619 0.643Max. rank difference 1↑/7↓ 0↑/4↓ 1↑/6↓ 4↑/5↓ 3↑/7↓ 2↑/5↓ 1↑/6↓ 1↑/4↓

RMS Error 0.0071 0.0086 0.0088 0.0078 0.0046 0.0068 0.0088 0.0028% significant (p < 0.05) 2.4% 7.1% 16.7% 33.3% 35.7% 16.7% 16.7% 26.2%

Table 7: Predicting document relevance to counter the effect of biased judgements in the leave-one-outexperiments. For all measures examined, the rank of a run submitted by the respective group stays within± 1 of its original rank on average if the SVM classifier is used to complete the qrels.

Table 7 shows the results we obtained in the revised leave-one-out experiments, using a classifier to predict the rele-vance of all documents that were removed from the pool(i.e., top 50 documents from the left-out runs). When usingthe SVM classifier, the rank of a run from the group thatwas removed from the pool changes by less than 1 place onaverage. This holds for all measures. For P@20, the maxi-mum change in rank decreases from 18 to 7; the RMS errorof the P@20 scores is reduced by 64%, from 0.0243 to 0.0088(comparing Table 2 and Table 7).

In a last experiment, we had the classifiers predict therelevance of all documents removed from the pool in theautomatic-only experiments (i.e., documents retrieved onlyby manual runs). Table 8 shows the results we obtained forthis setting. Using the SVM classifier, the number of placesa manual run is moved up or down in the ranking accordingto P@20 is decreased from 6.4 to 2.0 on average (comparingTables 3 and 8). The correlation between the original rank-ing, based on the original qrels, and the new ranking, basedon the SVM-completed qrels, is in excess of 0.9 for all mea-sures shown in the table, including bpref and RankEff (inthe case of bpref, it is improved from 0.8676 to 0.9164; in thecase of RankEff, a very slight increase from 0.8955 to 0.9071is achieved). Following the usual interpretation of Kendallτ values, the two rankings can be considered equivalent.

Regarding the relative performance of the KLD and theSVM classifier, we can say that the SVM classifier leads tomore accurate results in almost all cases. The only exceptionis AP, for which the KLD classifier with its training targetprecision = recall (which is what AP needs for its scores toremain constant) achieves slightly better results.

6. CONCLUSIONS AND FUTURE WORKTraditional evaluation measures used in information re-

trieval are unable to deal with the problem of incompletejudgements. The bpref [4] measure overcomes this limita-

tion by ignoring unjudged documents. This approach workswell for unbiased incomplete judgements, but does not prop-erly address the problem of biased judgements. We foundthat bpref is not immune to incomplete and biased judge-ments. Where other measures, such as average precision,tend to underestimate the performance of a run that liesoutside the pool of judged documents, bpref tends to over-estimate it — to a similar degree.

The effect of biased judgements, however, can be coun-tered by training a classifier on the judged documents andusing it to predict the relevance of unjudged documents. Inour experiments with data from the TREC 2006 Terabytetrack, predicting document relevance consistently increasesthe Kendall τ correlation between the ranking producedfrom the original set of judgements, based on all runs, anda ranking produced from a biased set of judgements, con-structed from the automatic runs only, for all measures weexamined. For P@20, Kendall’s τ is increased from 0.8281to 0.9512. For bpref, it is increased from 0.8676 to 0.9164.Hence, it seems to be possible to reliably evaluate the perfor-mance of retrieval systems, even if the relevance judgementsused in the evaluation are incomplete and highly biased.

This does not imply that less effort should be put on thecreation of manual judgements. It does, however, mean that,at least for the GOV2 collection examined in our experi-ments, it is usually a good idea to not use the original qrelsbuilt from a pool of old systems when evaluating a newranking technique. Instead, a classifier should be trainedand used to predict the relevance of documents that are re-turned by the new technique, but not found in the pool. Thisway, the bias inherent in most evaluation measures (eitherpositive or negative), can be avoided, and a more reliableevaluation of the new ranking technique can be obtained.

Of course, using a classifier to predict document relevancebears the risk that a new ranking method is developed andtrained to best match the classifier instead of actual humanrelevance judgements. Therefore, it can only be a short-term

SIGIR 2007 Proceedings Session 3: Evaluation

69

Page 20: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

9.4. Low-Cost Evaluation๏ Collecting relevance assessments is laborious and expensive

๏ Assuming that we know returned results, have decided on an effectiveness measure (e.g., P@k), and are only interested in the relative order of (two) systems: Can we pick a minimal-size set of results to judge?

๏ Can we avoid collecting relevance assessments altogether?

20

Page 21: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Minimal Test Collections๏ Carterette et al. [4] show how a minimal set of results to judge can

be selected so as to determine the relative order of two systems

๏ Example: System 1 and System 2 compared under P@3 ๏ determine sign of ΔP@3(S1, S2)

๏ judging a document only provides additional informationif it is within the top-k of exactly one of the two systems

21

S2

C

B

D

A

E

S1

A

B

E

D

C

�P@k =1

k

nX

i=1

xi · 1(rank1(i) k)� 1

k

nX

i=1

xi · 1(rank2(i) k)

=1

k

nX

i=1

xi · [1(rank1(i) k)� 1(rank2(i) k)]

Page 22: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Minimal Test Collections๏ iteratively judge documents with

๏ determine upper and lower bound of ΔP@k(S1, S2) after every judgment upper bound (if C is irrelevant)lower bound (if C is relevant)

๏ terminate collecting relevance assessments as soon asupper bound smaller than -1 or lower bound larger than +1

22

1(rank1(i) k)� 1(rank2(i) k) 6= 0

�P@3(S1, S2) = 2/3� 0/3

�P@3(S1, S2) 2/3� 0/3

2/3� 1/3 �P@3(S1, S2)

S2

C

B

D

A

E

S1

A

B

E

D

C

Page 23: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Automatic Assessments๏ Efron [8] proposes to assess relevance of results automatically

๏ Key Idea: Same information need can be expressed by many query articulations (aspects)

๏ Approach: ๏ Determine for each topic t a set of aspects a1… am

๏ Retrieve top-k results Rk(ai) with baseline system for each ai

๏ Consider all results in union of Rk(ai) relevant

23

Page 24: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Automatic Assessments๏ How to determine query articulations (aspects)?

๏ manually by giving users the topic description, letting them search on Google, Yahoo, and Wikipedia, and recording their query terms

๏ automatically by using automatic query expansion methodsbased on pseudo-relevance feedback

๏ Experiments on TREC-3, TREC-7, TREC-8 with ๏ two manual aspects (A1, A2) per topic (by author and assistant)

๏ two automatic aspects (A3, A4) derived from A1 and A2

๏ Okapi BM25 as baseline retrieval model

24

Page 25: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Automatic Assessments๏ Kendall’s τ between original system ranking under MAP and

system ranking determined with automatic assessments

๏ Performance of query aspects A1…A4 when used in isolation

25

Using Multiple Query Aspects to Build Test Collections 281

Table 1. Datasets used for experimentation

Corpus Disks # Docs Topics #SystemsTREC-3 1-3 1,078,169 151-200 29TREC-7 4-5 (-CR) 527,094 351-400 86TREC-8 4-5 (-CR) 527,094 401-450 115

gives a different metric, which we call aMAP. The question we pursue is: towhat extent is a system ranking on aMAP correlated with a standard rankingon MAP?

4.1 Baseline Correlation of aMap with MAP

The simplest question that bears on our analysis is, to what extent does perfor-mance evaluation based on judgment-free aspect qrels correlate with evaluationby traditional, human-judged qrels? We measure the correlation between rank-ings using two rank correlation metrics: Kendall’s tau (abbreviated tau) and theSpearman rank correlation (abbreviated src) [21].

To ground our analysis we compare our rank correlations with those obtainedby the method of “reference counts” detailed by Wu and Crestani in [7] (we ab-breviate this approach RC). The RC method shows marked improvement overearlier judgment-free evaluation (e.g. [6]). Ranking systems by RC is straightfor-ward. Given N systems and a system S whose performance we wish to estimate,RC derives a statistic for S obtained by summing 1000 minus the rank of eachdocument that appears for each query among the remaining N − 1 systems.The intuition here is that systems that return documents that are highly rankedamong other systems are likely to be high performers.

Comparing RC and aMAP is perhaps complicated by several differences be-tween the two approaches. The structure of overlap method is completely auto-matic, requiring no human intervention. While our approach requires no relevancejudgments, it does depend on people creating multiple query aspects. On the otherhand, RC assumes that we have multiple systems on hand for evaluation, while ourapproach requires only a single seed system to generate aspect qrels.

Table 2 shows tau and Spearman rank correlations between system rankingsobtained by MAP and rankings obtained by the RC method and by aMAP.All correlations are significant above the 99% level. Additionally aMAP appears

Table 2. Rank correlations with MAP: comparison of the structure of overlap techniqueand aMAP as described in Section 3

RC aMAP % ImprovedData tau src. tau src tau src.TREC-3 0.374 0.556 0.852 0.962 127.8 73.0TREC-7 0.639 0.843 0.867 0.974 35.7 15.6TREC-8 0.603 0.784 0.77 0.92 27.7 17.35

Using Multiple Query Aspects to Build Test Collections 281

Table 1. Datasets used for experimentation

Corpus Disks # Docs Topics #SystemsTREC-3 1-3 1,078,169 151-200 29TREC-7 4-5 (-CR) 527,094 351-400 86TREC-8 4-5 (-CR) 527,094 401-450 115

gives a different metric, which we call aMAP. The question we pursue is: towhat extent is a system ranking on aMAP correlated with a standard rankingon MAP?

4.1 Baseline Correlation of aMap with MAP

The simplest question that bears on our analysis is, to what extent does perfor-mance evaluation based on judgment-free aspect qrels correlate with evaluationby traditional, human-judged qrels? We measure the correlation between rank-ings using two rank correlation metrics: Kendall’s tau (abbreviated tau) and theSpearman rank correlation (abbreviated src) [21].

To ground our analysis we compare our rank correlations with those obtainedby the method of “reference counts” detailed by Wu and Crestani in [7] (we ab-breviate this approach RC). The RC method shows marked improvement overearlier judgment-free evaluation (e.g. [6]). Ranking systems by RC is straightfor-ward. Given N systems and a system S whose performance we wish to estimate,RC derives a statistic for S obtained by summing 1000 minus the rank of eachdocument that appears for each query among the remaining N − 1 systems.The intuition here is that systems that return documents that are highly rankedamong other systems are likely to be high performers.

Comparing RC and aMAP is perhaps complicated by several differences be-tween the two approaches. The structure of overlap method is completely auto-matic, requiring no human intervention. While our approach requires no relevancejudgments, it does depend on people creating multiple query aspects. On the otherhand, RC assumes that we have multiple systems on hand for evaluation, while ourapproach requires only a single seed system to generate aspect qrels.

Table 2 shows tau and Spearman rank correlations between system rankingsobtained by MAP and rankings obtained by the RC method and by aMAP.All correlations are significant above the 99% level. Additionally aMAP appears

Table 2. Rank correlations with MAP: comparison of the structure of overlap techniqueand aMAP as described in Section 3

RC aMAP % ImprovedData tau src. tau src tau src.TREC-3 0.374 0.556 0.852 0.962 127.8 73.0TREC-7 0.639 0.843 0.867 0.974 35.7 15.6TREC-8 0.603 0.784 0.77 0.92 27.7 17.35

Using Multiple Query Aspects to Build Test Collections 281

Table 1. Datasets used for experimentation

Corpus Disks # Docs Topics #SystemsTREC-3 1-3 1,078,169 151-200 29TREC-7 4-5 (-CR) 527,094 351-400 86TREC-8 4-5 (-CR) 527,094 401-450 115

gives a different metric, which we call aMAP. The question we pursue is: towhat extent is a system ranking on aMAP correlated with a standard rankingon MAP?

4.1 Baseline Correlation of aMap with MAP

The simplest question that bears on our analysis is, to what extent does perfor-mance evaluation based on judgment-free aspect qrels correlate with evaluationby traditional, human-judged qrels? We measure the correlation between rank-ings using two rank correlation metrics: Kendall’s tau (abbreviated tau) and theSpearman rank correlation (abbreviated src) [21].

To ground our analysis we compare our rank correlations with those obtainedby the method of “reference counts” detailed by Wu and Crestani in [7] (we ab-breviate this approach RC). The RC method shows marked improvement overearlier judgment-free evaluation (e.g. [6]). Ranking systems by RC is straightfor-ward. Given N systems and a system S whose performance we wish to estimate,RC derives a statistic for S obtained by summing 1000 minus the rank of eachdocument that appears for each query among the remaining N − 1 systems.The intuition here is that systems that return documents that are highly rankedamong other systems are likely to be high performers.

Comparing RC and aMAP is perhaps complicated by several differences be-tween the two approaches. The structure of overlap method is completely auto-matic, requiring no human intervention. While our approach requires no relevancejudgments, it does depend on people creating multiple query aspects. On the otherhand, RC assumes that we have multiple systems on hand for evaluation, while ourapproach requires only a single seed system to generate aspect qrels.

Table 2 shows tau and Spearman rank correlations between system rankingsobtained by MAP and rankings obtained by the RC method and by aMAP.All correlations are significant above the 99% level. Additionally aMAP appears

Table 2. Rank correlations with MAP: comparison of the structure of overlap techniqueand aMAP as described in Section 3

RC aMAP % ImprovedData tau src. tau src tau src.TREC-3 0.374 0.556 0.852 0.962 127.8 73.0TREC-7 0.639 0.843 0.867 0.974 35.7 15.6TREC-8 0.603 0.784 0.77 0.92 27.7 17.35

Using Multiple Query Aspects to Build Test Collections 281

Table 1. Datasets used for experimentation

Corpus Disks # Docs Topics #SystemsTREC-3 1-3 1,078,169 151-200 29TREC-7 4-5 (-CR) 527,094 351-400 86TREC-8 4-5 (-CR) 527,094 401-450 115

gives a different metric, which we call aMAP. The question we pursue is: towhat extent is a system ranking on aMAP correlated with a standard rankingon MAP?

4.1 Baseline Correlation of aMap with MAP

The simplest question that bears on our analysis is, to what extent does perfor-mance evaluation based on judgment-free aspect qrels correlate with evaluationby traditional, human-judged qrels? We measure the correlation between rank-ings using two rank correlation metrics: Kendall’s tau (abbreviated tau) and theSpearman rank correlation (abbreviated src) [21].

To ground our analysis we compare our rank correlations with those obtainedby the method of “reference counts” detailed by Wu and Crestani in [7] (we ab-breviate this approach RC). The RC method shows marked improvement overearlier judgment-free evaluation (e.g. [6]). Ranking systems by RC is straightfor-ward. Given N systems and a system S whose performance we wish to estimate,RC derives a statistic for S obtained by summing 1000 minus the rank of eachdocument that appears for each query among the remaining N − 1 systems.The intuition here is that systems that return documents that are highly rankedamong other systems are likely to be high performers.

Comparing RC and aMAP is perhaps complicated by several differences be-tween the two approaches. The structure of overlap method is completely auto-matic, requiring no human intervention. While our approach requires no relevancejudgments, it does depend on people creating multiple query aspects. On the otherhand, RC assumes that we have multiple systems on hand for evaluation, while ourapproach requires only a single seed system to generate aspect qrels.

Table 2 shows tau and Spearman rank correlations between system rankingsobtained by MAP and rankings obtained by the RC method and by aMAP.All correlations are significant above the 99% level. Additionally aMAP appears

Table 2. Rank correlations with MAP: comparison of the structure of overlap techniqueand aMAP as described in Section 3

RC aMAP % ImprovedData tau src. tau src tau src.TREC-3 0.374 0.556 0.852 0.962 127.8 73.0TREC-7 0.639 0.843 0.867 0.974 35.7 15.6TREC-8 0.603 0.784 0.77 0.92 27.7 17.35

282 M. Efron

●●

●●●

●●

●●

●●●

●●

●●●

0 5 10 15 20 25 30

0.05

0.10

0.15

0.20

TREC−3

System Rank

aMap

●●

●●●

●●

●●●

●●●●

●●●

●●

●●●

●●●

●●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●●

●●●●

●●

●●●●●

0 20 40 60 80

0.05

0.15

0.25

TREC−7

System Rank

aMap

●●

●●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●●●●

●●●●●●

0 20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

0.25

TREC−8

System Rank

aMap

Fig. 1. System rankings using aMap. aMap scores plotted in the order generated byranking systems with respect to MAP calculated from NIST’s qrels.

more strongly correlated with MAP than the ranking given by RC; the tworightmost columns of the table show the percent improvement in rank correlationusing aMAP, our method based on judgment-free aspect qrels, over RC.

To contextualize these results, we refer to Voorhees’ work in [22]. Voorheessuggests that human agreement regarding relevance typically results in tau cor-relations of approximately 0.9. As a heuristic, she posits that we consider systemsscoring at least τ = 0.8 with a benchmark measure not “noticeably” differentthan that benchmark. In the cases pursued here aMAP surpasses tau correlationof 0.8 with respect to MAP, while the reference counting method falls below thisthreshold.

Figure 1 plots aMap for the systems participating in each TREC listed inTable 1. The systems are listed in decreasing order of their rank using MAPcomputed from NIST’s qrels. The panels of Figure 1 suggest that aMap is verysuccessful at identifying poorly performing systems. This is very common in theliterature of judgment-free evaluation. The correlation between aMAP and MAPremains strong even for high-performing systems in TREC-3 and TREC-7. Buttowards the left side of the rightmost panel of the figure the correlation on theTREC-8 data weakens. Nonetheless, throughout the TREC-8 ranking, a solidline of points is visible in the center of the data cloud, suggesting that manyhigh-ranked systems are also ranked highly by aMAP.

4.2 Effect of Aspect Pool Depth on Correlation

The method described in Section 3 uses a tunable parameter k, the number ofdocuments returned by the seed system for each query aspect that are used topopulate the final aspect qrels. Following TREC terminology we refer to k as theaspect pool depth. Figure 2 shows how the selection of k bears on the correlation ofaMAP with MAP. For TREC-7 and TREC-8 the impact of k on tau correlation ismild. A more pronounced influence appears for the TREC-3 data. We hypothesizethat the volatility of tau on TREC-3 is due to the small number of systems (29)in the data. Nonetheless, tau correlation for all three data sets is maximized whenwe set k in the range of approximately 50-100 documents.

284 M. Efron

Table 3. Tau correlation between MAP and aMap calculated from individual queryaspect groups. Highest correlation with MAP shown in boldface.

Data A1 A2 A3 A4 UnionTREC-3 0.773 0.857 0.778 0.827 0.852TREC-7 0.78 0.796 0.772 0.801 0.867TREC-8 0.747 0.77 0.72 0.709 0.77

Fig. 3. Sensitivity of aMAP to the quality of the seed system on TREC-8 data. x-axisquantifies the amount of noise introduced to the seed system at each observation. Thex-axis is measured in multiples of the standard deviation of the observed system scores.

To test whether high aMAP-MAP correlation requires a high-performing seedsystem we intentionally degraded the quality of the seed system’s rankings. Toaccomplish this, after retrieving documents for a query aspect we added Gaussiannoise to the document scores and then re-ranked the return set, keeping onlythe top k = 100 documents as before. The added noise had zero mean with astandard deviation of xσ, where σ is the sample standard deviation of the scoresobtained from the seed system and x is a number between zero and five.

Figure 3 shows the results of this experiment, applying the score corruption tothe TREC-8 data. The x-axis in the figure gives x, the factor that we multipliedthe sample standard deviation by when introducing noise into the system: largerx will lead to more noise. The topmost points (solid line) are the tau correlationsfor aMAP with MAP obtained using aspect qrels based on the corrupted rank-ings. The lower data points show precision at 10 documents retrieved (prec@10)obtained by using each of our four query aspect groups as queries against theTREC-8 data (using NIST’s qrels). We graph prec@10 instead of MAP to

Page 26: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

9.5. Crowdsourcing๏ Crowdsourcing platforms provide a cheap and readily available

alternative to hiring skilled workers for relevance assessments ๏ Amazon Mechanical Turk (AMT) (mturk.com)

๏ CrowdFlower (crowdflower.com)

๏ oDesk (odesk.com)

๏ Human Intelligence Tasks (HITs) are small tasks that are easy for humans but difficult for machines (e.g., labeling an image) ๏ workers are paid a small amount (often $0.01–$0.05) per HIT

๏ workers from all-over-the-globe with different demographics

26

Page 27: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Example HIT

27

Page 28: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Example HIT

27

Page 29: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Crowdsourcing Best Practices๏ Alonso [1] describes best practices for crowdsourcing

๏ clear instructions and description of task in simple language

๏ use highlighting (bold, italics) and show examples

๏ ask for justification of input (e.g., why do you think it is relevant?)

๏ provide “I don’t know” option

28

Page 30: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Crowdsourcing Best Practices๏ assign same task to multiple workers use majority voting

๏ continuous quality monitoring and control of workforce

๏ before launch: use qualification test or approval rate threshold

๏ during execution: use honey pots (tasks with known answer),ban workers who provide unsatisfactory input

๏ after execution: check assessor agreement (if applicable), filter out input that was provided too quickly

29

Page 31: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Cohen’s Kappa๏ Cohen’s kappa measures agreement between two assessors

๏ Intuition: How much does the actual agreement P[ A ]deviate from expected agreement P[ E ]

๏ Example: Assessors Ai, Categories Cj

๏ actual agreement: 20 / 35

๏ expected agreement:10 / 35*8 / 35 + 10/35*11/35 + 15/35*16/35

๏ Cohen’s kappa: ~ 0.34

30

=P [A ]� P [E ]

1� P [E ]

A2

C1 C2 C3

C1 5 2 3

A1 C2 2 5 3

C3 1 4 10

Page 32: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Fleiss’ Kappa๏ Fleiss’ kappa measures agreement between

a fixed number of assessors

๏ Intuition: How much does the actual agreement P[ A ]deviate from expected agreement P[ E ]

๏ Definition: Assessors Ai, Subjects Sj, Categories Ckand njk as the number of assessors who assigned Sj to Ck

๏ Probability pk that category Ck is assigned

31

=P [A ]� P [E ]

1� P [E ]

pk =1

|S||A|

|S|X

j=1

njk

Page 33: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Fleiss’ Kappa๏ Probability Pj that two assessors agree on category for subject Sj

๏ Actual agreement as average agreement over all subjects

๏ Expected agreement between two assessors

32

Pj =1

|A|(|A|� 1)

|C|X

k=1

njk(njk � 1)

P [A] =1

|S|

|S|X

j=1

Pj

P [E] =

|C|X

k=1

p2k

Page 34: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Crowdsourcing vs. TREC๏ Alonso and Mizzaro [2] investigate whether crowdsourced

relevance assessments can replace TREC assessors ๏ 10 topics from TREC-7 and TREC-8, 22 documents per topic

๏ 5 binary assessments per (topic,document) pair from AMT

๏ Fleiss’ kappa among AMT workers: 0.195 (slight)

๏ Fleiss’ kappa among AMT workers and TREC assessor: 0.229 (fair)

๏ Cohen’s kappa between majority vote among AMT workers and TREC assessor: 0.478 (moderate)

33

Page 35: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

9.6. Online Evaluation๏ Cranfield paradigm not suitable when evaluating online systems

๏ need for rapid testing of small innovations

๏ some innovations (e.g., result layout) do not affect ranking

๏ some innovations (e.g., personalization) hard to assess by others

๏ hard to represent user population in 50, 100, 500 queries

34

Page 36: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

A/B Testing๏ A/B testing exposes two large-enough user populations to

products A and B and measures differences in behavior ๏ has its roots in marketing (e.g., pick best box for cereals)

๏ deploy innovation on small fraction of users (e.g., 1%)

๏ define performance indicator (e.g., click-through on first result)

๏ compare performance against rest of users (the other 99%)and test for statistical significance

35

Page 37: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Interleaving๏ Idea: Given result rankings A = (a1…ak) and B = (b1…bk)

๏ construct an interleaved ranking I which mixes A and B

๏ show I to users and record number of clicks on individual results

๏ click on result scores A, B, or both a point

๏ derive users’ preference for A or B based on total number of clicks

๏ Team-Draft Interleaving Algorithm: ๏ flip coin whether A or B starts selecting results (players)

๏ A and B take turns and select yet-unselected results

๏ interleaved result I based on order in which results are picked

36

Page 38: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

Summary๏ Cranfield paradigm for IR evaluation (provide documents,

topics, and relevance assessments) goes back to 1960s

๏ Non-traditional effectiveness measures handle graded relevance assessments and implement more realistic user models

๏ Incomplete judgments can be dealt with by using (modified) effectiveness measures or by predicting assessments

๏ Low-cost evaluation seeks to reduce the amount of relevance assessments that is required to determine system ranking

๏ Crowdsourcing as a possible alternative to skilled assessors which requires redundancy and careful test design

๏ A/B testing and interleaving as forms of online evaluation

37

Page 39: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

References[1] O. Alonso: Implementing crowdsourcing-based relevance experimentation: an industrial perspective, Information Retrieval 16:101–120, 2013

[2] O. Alonso and S. Mizzaro: Using crowdsourcing for TREC relevance assessment, Information Processing & Management 48:1053–1066, 2012

[3] S. Büttcher, C. L. A. Clarke, P. C. K. Yeung: Reliable Information Retrieval Evaluation with Incomplete and Biased Judgments, SIGIR 2007

[4] B. Carterette, J. Allan, R. Sitaraman: Minimal Test Collections for Information Retrieval, SIGIR 2006

[5] B. Carterette and J. Allan: Semiautomatic Evaluation of Retrieval Systems Using Document Similarities, CIKM 2007

38

Page 40: Advanced Topics in Information Retrieval - Evaluation, Saarland ...resources.mpi-inf.mpg.de/departments/d5/teaching/ws14_15/atir/slid… · Advanced Topics in Information Retrieval

Advanced Topics in Information Retrieval / Evaluation

References[6] O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan: Expected Reciprocal Rank for Graded Relevance, CIKM 2009

[7] O. Chapelle, T. Joachims, F. Radlinski, Y. Yue: Large-Scale Validation and Analysis of Interleaved Search Evaluation, ACM TOIS 30(1), 2012

[8] M. Efron: Using Multiple Query Aspects to Build Test Collections without Human Relevance Judgments, ECIR 2009

[9] A. Moffat and J. Zobel: Rank-Biased Precision for Measurement of Retrieval Effectiveness, ACM TOIS 27(1), 2008

[10] T. Sakai: Alternatives to Bpref, SIGIR 2007

39