chapter 3 retrieval evaluation
Post on 06-Jan-2016
86 Views
Preview:
DESCRIPTION
TRANSCRIPT
Chapter 3Retrieval Evaluation
Modern Information RetrievalRicardo Baeza-Yates
Berthier Ribeiro-Neto
Hsu Yi-Chen, NCU MIS88423043
Outline
IntroductionRetrieval Performance Evaluation Recall and precision Alternative measures
Reference Collections TREC Collection CACM&ISI Collection CF Collection
Trends and Research Issues
Introduction
Type of evaluation Functional analysis phase, and Error
analysis phase Performance evaluation
Performance of the IR system Performance evaluation
Response time/space required Retrieval performance evaluation
The evaluation of how precise is the answer set
Retrieval performance evaluation for IR system
Goodness of retrieval strategy S =the similarity between
Set of retrieval documents by SSet of relevant documents provided by
specialistsquantified by Evaluation measure
Retrieval Performance Evaluation(Cont.)
評估以 batch query 為主的 IR 系統
collection
Relevant DocsIn Answer Set
|Ra|
Relevant Docs|R|
Answer Set|A|
Recall=|Ra|/|R|
Precision=|Ra|/|A|
Sorted by relevance
Precision versus recall curve
Rq={d3,d5,d9,d25,d39,d44,d56,d89,d123}
Ranking for query q:
1.d123*2.d843.d56*4.d65.d8
6.d9*7.d5118.d1299.d18710.d25*
11.d3812.d4813.d25014.d1115.d3*
•100% at10%•66% at 20%•50% at 30%•Usally based on 11 standard recall levels:0%,10%,..100%
Precision versus recall curve
For a single query
Fig3.2
計算多個 query 的平均效能
P(r)= Σ(Pi(r)/Nq)P(r)=average precision at the recall levalNq=number of queries used
Pi(r)=the precision at recall level r for the i-th query
i=1
i=Nq
Interpolated precision
Rq={d3,d56,d129}Let rj,j={0,1,2,…,10},be a reference to the j-th standard recall levelP(rj)=max ri≦ r≦ rj+1P(r)
兩個不同演算法的 Average recall versus precision figure
Single Value Summaries之前的 average precision versus recall: 比較 retrieval algorithms over a set of example queri
esBut! Individual query 的 performance 也很重要,因為 : Average precision 可能會隱藏演算法中不正常的部分 可能需要知道 , 兩個演算法中,對某特定 query 的 perf
ormance 為何 解決方法 : 考慮每一個 query 的 single precision value The single value should be interpreted as a summar
y of the corresponding precision versus recall curve 通常 ,single value summary 被用來當作某一個 recall l
evel 的 precision 值
Average Precision at Seen Relevant Documents
Averaging the precision figures obtained after each new relevant document is observed.F3.2,(1+0.66+0.5+0.4+0.3)/5=0.57此方法對於很快找到相關文件的系統是相當有利的 ( 相關文件被排在越前面 ,precision 值越高 )
R-Precision
Computing the precision at the R-th position in the ranking( 在 R 篇文章中出現相關文章數目的比例 )R:the total number of relevant documents of the current query(total number in Rq)Fig3.2:R=10,value=0.4Fig3.3,R=3,value=0.33易於觀察每一個單一 query 的演算法 performance
Precision Histograms
利用長條圖比較兩個 query 的 R-precision 值RPA/B(i )=RPA(i )-RPB(i )RPA(i),RPB(i):R-precision value of A,B for i-th queryCompare the retrieval performance history of two algorithms through visual inspection
Precision Histograms(cont.)
Summary Table Statistics
將所有 query 相關的 single value summary 放在 table 中 如 the number of queries , total number of documents retrieved by all quer
ies, total number of relevant documents were effect
ively retrieved when all queries are considered Total number of relevant documents retrieved b
y all queries…
Precision and Recall 的適用性
Maximum recall 值的產生,需要知道所有文件相關的背景知識Recall and precision 是相對的測量方式,兩者要合併使用比較適合。Measures which quantify the informativeness of the retrieval process might now be more appropriateRecall and precision are easy to define when a linear ordering of the retrieved documents is enforced
Alternative MeasuresThe Harmonic Mean , 介於 0,1
The E Measure- 加入喜好比重
b=1,E(j)=F(j) b>1,more interested in precision b<1,more interested in recall
2
r(j)1
P(j)
1+
F(j)=
1+b2
r(j)b2
P(j)
1+
E(j)=1-
User-Oriented Measure假設: Query 與使用者有相關 , 不同使用者有不同的 relevant docs coverage=|Rk|/|U| Novelty=|Ru|/|Ru|+|Rk|
Coverage 越高 , 系統找到使用者期望的文件越多
Noverlty 越高 , 系統找到許多使用者之前不知道相關的文件越多
User-Oriented Measure(cont.)
relative recall: 系統找到的相關文章數佔使用者預期找到的文章數比例 (|Ru|+|Rk|)/ |U|
Recall effort: 使用者期望找到的相關文章數佔符合使用者期望的相關文章數 (the number of documents examined in an attempt to find the expected relevant documents) |U|/|Rk|
Reference Collection
用來作為評估 IR 系統 reference test collections TIPSTER/TREC: 量大,實驗用 CACM,ISI: 歷史意義 Cystic Fibrosis :small collections,relevant
documents 由專家研討後產生
IR system 遇到的批評
Lacks a solid formal framework as a basic foundation 無解 ! 一個文件是否與查詢相關,是相當主觀的 !
Lacks robust and consistent testbeds and benchmarks 較早,發展實驗性質的小規模測試資料 1990 後, TREC 成立,蒐集上萬文件,提供給研究
團體作 IR 系統評量之用
TREC(Text REtrieval Conference)
Initiated under the National Institute of Standards and Technology(NIST)Goals: Providing a large test collection Uniform scoring procedures Forum
7th TREC conference in 1998: Document collection:test collections,example
information requests(topics),relevant docs The benchmarks tasks
The Documents Collection
由 SGML 編輯<doc>
<docno>WSJ880406-0090</docno>
<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>
<author>Janet GuyonWSJ Staff)</author>
<dateline>New York</dateline>
<text>
American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad…
</text>
</doc>
<doc>
<docno>WSJ880406-0090</docno>
<hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl>
<author>Janet GuyonWSJ Staff)</author>
<dateline>New York</dateline>
<text>
American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad…
</text>
</doc>
The Example Information Requests(Topics)
用自然語言將資訊需求描述出來Topic number: 給不同類型的 topics
<top>
<num> Number:168
<title>Topic:Financing AMTRAK
<desc>Description:
…..
<nar>Narrative:A …..
</top>
The relevant Documents for Each Example Information Request
The set of relevant documents for each topic obtained from a pool of possible relevant documentsPool: 由數各參與的 IR 系統中所找到的相關文件,依照相關性排序後的前 K個文章。K通常為 100最後透過人工鑑定,判斷是否為相關文件->pooling method 相關文件有數個組合的 pool取得 不在 pool內的文件視為不相關文件
The (Benchmark)Tasks at the TREC Conferences
ad hoc task: Receive new requests and execute them on
a pre-specified document collection
routing task Receive test info. Requests,two document
collections first doc:training and tuning retrieval
algorithm Second doc:testing the tuned retrieval
algorithm
Other tasks:
*ChineseFiltering Interactive*NLP(natural language procedure)Cross languagesHigh precisionSpoken document retrievalQuery Task(TREC-7)
Evaluation Measures at the TREC Conferences
Summary table statistics Recall-precisionDocument level averages*Average precision histogram
The CACM CollectionSmall collections about computer science literatureText of docstructured subfields
word stems from the title and abstract sections Categories direct references between articles:a list of pairs of documents[d
a,db] Bibliographic coupling connections:a list of triples[d1,d2,ncited] Number of co-citations for each pair of articles[d1,d2,nciting]
A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns
The ISI Collection
ISI 的 test collection 是由之前在 ISI(Institute of Scientific Information) 的 Small組合而成這些文件大部分是由當初 Small 計畫中有關 cross-citation study 中挑選出來支持有關於 terms和 cross-citation patterns 的相似性研究
The Cystic Fibrosis Collection
有關於“囊胞性纖維症”的文件Topics和相關文件由具有此方面在臨床或研究的專家所產生Relevance scores 0:non-relevance 1:marginal relevance 2:high relevance
Characteristics of CF collection
Relevance score 均由專家給定Good number of information requests(relative to the collection size) The respective query vectors present ove
rlap among themselves 利用之前的 query增加檢索效率
Trends and Research Issues
Interactive user interface 一般認為 feedback 的檢索可以改善效率 如何決定此情境下的評估方式 (Evaluation
measures)?其它有別於 precise,recall 的評估方式研究
top related