information retrieval lecture 3 introduction to information retrieval (manning et al. 2007) chapter...
DESCRIPTION
Relevance How do we quantify relevance? A benchmark set of docs (corpus) A benchmark set of queries A binary assessment for each query-doc pair either relevant or irrelevantTRANSCRIPT
Information Retrieval
Lecture 3Introduction to Information Retrieval (Manning et al. 2007)Chapter 8
For the MSc Computer Science Programme
Dell ZhangBirkbeck, University of London
Evaluating an IR System
Expressiveness The ability of query language to express complex
information needs e.g., Boolean operators, wildcard, phrase, proximity, etc.
Efficiency How fast does it index? How large is the index? How fast does it search?
Effectiveness – the key measure How effective does it find relevant documents
Is this search engine good?Which search engine is better?
Relevance
How do we quantify relevance? A benchmark set of docs (corpus) A benchmark set of queries A binary assessment for each query-doc pair
either relevant or irrelevant
Relevance
Relevance should be evaluated according to the information need (which is translated into a query). [information need] I'm looking for information on
whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.
[query] wine red white heart attack effective We judge whether the document addresses the
information need, not whether it has those words.
Benchmarks
Common Test Corpora TREC
National Institute of Standards and Testing (NIST) has run a large IR test bed for many years
Reuters Reuters-21578 RCV1
20 Newsgroups ……
Relevance judgements are given by human experts.
TREC
TREC Ad Hoc tasks from first 8 TRECs are standard IR tasks 50 detailed information needs per year Human evaluation of pooled results returned More recently other related things
QA, Web, Genomics, etc.
TREC
A Query from TREC5
<top><num> Number: 225<desc> Description:What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?</top>
Ranking-Ignorant Measures
The IR system returns a certain number of documents.
The retrieved documents are regarded as a set.
It can actually be considered as classification – each doc is classified/predicted to be either ‘relevant’ or ‘irrelevant’.
Contingency Table
Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
p = positive; n = negative; t = true; f = false.
Accuracy
Accuracy = (tp+tn) / (tp+fp+tn+fn) The fraction of correct classifications. Not a very useful evaluation measure in IR.
Why? Accuracy puts equal weights on relevant and irrelevant
documents. It is common that the number of relevant documents is
very small compared to the total number of documents. People doing information retrieval want to find
something and have a certain tolerance for junk.
Accuracy
Search for: 0 matching results found.
This Web search engine returns 0 matching results for all queries.How much time do you need to build it? 1 minute!
How much accuracy does it have? 99.9999%
Precision and Recall
Precision P = tp/(tp+fp) The fraction of retrieved docs that are relevant. Pr[relevant|retrieved]
Recall R = tp/(tp+fn) The fraction of relevant docs that are retrieved. Pr[retrieved|relevant] Recall is a non-decreasing function of the number
of docs retrieved. You can get a perfect recall (but low precision) by retrieving all docs for all queries!
Precision and Recall
Precision/Recall Tradeoff In a good IR system, precision decreases as recall increases, and vice versa.
F measure
F : weighted harmonic mean of P and R Combined measure that assesses the
precision/recall tradeoff. Harmonic mean is a conservative average.
RPPR
RP
F
2
2 )1(1)1(1
1
F measure
F1 : balanced F measure (with = 1 or = ½) Most popular IR evaluation measure
RPPR
RP
F
2
112
1
F1 and other averagesCombined Measures
0
20
40
60
80
100
0 20 40 60 80 100
Precision (Recall fixed at 70%)
Minimum
Maximum
Arithmetic
Geometric
Harmonic
F measure – Exercise
d1
d2
d3
d4
d5
retrieved
relevantirrelevant
F1 = ?
IR result for q
Ranking-Aware Measures
The IR system rank all docs in the decreasing order of their relevance to the query.
Returning various numbers of the top ranked docs leads to different recalls (and accordingly different precisions).
Precision-Recall Curve
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Pre
cisi
on
Precision-Recall Curve
The interpolated precision at a recall level R The highest precision found for any recall level
higher than R. Removes the jiggles in the precision-recall curve.
11-Point Interpolated Average Precision For each information need, the interpolated
precision is measured at 11 recall levels 0.0, 0.1, 0.2, …, 1.0
The measured interpolated precisions are averaged (i.e., arithmetic mean) over the set of queries in the benchmark.
A composite precision-recall curve showing 11 points can be graphed.
11-Point Interpolated Average Precision
A representative (good) TREC system
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
Mean Average Precision (MAP) For one information need, it is the average of
the precision value obtained for the top k docs each time a relevant doc is retrieved. No use of fixed recall levels. No interpolation.
When no relevant doc is retrieved, the precision value is taken to be 0.
The MAP value for a test collection is then the arithmetic mean of MAP values for individual information needs. Macro-averaging: each query counts equally.
Precision/Recall at k
Prec@k: Precision on the top k retrieved docs. Appropriate for Web search engines
Most users scan only the first few (e.g., 10) hyperlinks that are presented.
Rec@k: Recall on the top k retrieved docs. Appropriate for archival retrieval systems
what fraction of total number of relevant docs did a user find after scanning the first (say 100) docs?
R-Precision
Precision on the top Rel retrieved docs Rel is the size of the set of relevant documents
(though perhaps incomplete). A perfect IR system could score 1 on this metric
for each query.
PRBEP
Given a precision-recall curve, the Precision/Recall Break-Even Point (PRBEP) is the value at which the precision is equal to the recall. It is obvious from the definition of precision/recall,
the equality is achieved for contingency tables with tp+fp = tp+fn.
ROC Curve
An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1-specificity). true positive rate or sensitivity = recall = tp/(tp+fn) false positive rate = fp/(fp+tn) = 1 – specificity
specificty = tn/(fp+tn) The area under the ROC curve
ROC Curve
Variance in Performance
It is normally the case that the variance in performance of the same system across different queries is much greater than the variance in performance of different systems on the same query. For a test collection, an IR system may perform
terribly on some information needs (e.g., MAP = 0.1) but excellently on others (e.g., MAP = 0.7).
There are easy information needs and hard ones!
Take Home Messages
Evaluation of Effectiveness based on Relevance Ranking-Ignorant Measures
Accuracy; Precision & Recall F measure (especially F1)
Ranking-Aware Measures Precision-Recall curve 11 Point, MAP, P/R at k, R-Precision, PRBEP ROC curve