assessing the retrieval

26
Assessing the Retrieval Chapter 2 considered various ways of breaking text into indexable features Chapter 3 considered various ways of weighting combinations of those features to find the best match to a query So many alternatives are possible – which is the best combination? Users can give their personal points of view through relevance feedback. The system builder wants to construct a search engine that robustly finds the right documents for each query – an omniscient expert often determines which documents should have been retrieved.

Upload: joshua-justice

Post on 03-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Assessing the Retrieval. Chapter 2 considered various ways of breaking text into indexable features Chapter 3 considered various ways of weighting combinations of those features to find the best match to a query So many alternatives are possible – which is the best combination? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Assessing the Retrieval

Assessing the Retrieval

• Chapter 2 considered various ways of breaking text into indexable features

• Chapter 3 considered various ways of weighting combinations of those features to find the best match to a query

• So many alternatives are possible – which is the best combination?

• Users can give their personal points of view through relevance feedback.

• The system builder wants to construct a search engine that robustly finds the right documents for each query – an omniscient expert often determines which documents should have been retrieved.

Page 2: Assessing the Retrieval

Personal Assessment of Relevance

• What is relevance? The lack of a fully satisfactory definition of the core concept (e.g. relevance, information, intelligence) does not entirely stop progress.

• How thorough a search does the user wish to perform? A single football result, or any science that might cure a patient / all previous relevant court cases. This variability can be observed across different users, and even across the same user at different times.

Page 3: Assessing the Retrieval

Prototypic retrievals

• Relevance feedback (Oddy) is a task of object recognition• The object to be recognised is an internally represented

prototypic “ideal” document satisfying the user’s information need.

• For each retrieved document, the users have to judge how well it matches the prototype.

• We assume that users are capable of grading the quality of this match, e.g. on a five-point scale: not_relevant, no_response, possibly_relevant, relevant, critically_relevant.

• Extreme judgements are most useful for relevance feedback.

Page 4: Assessing the Retrieval

Relevance Feedback (RF) is Nonmetric

• While users find it easy to critique documents with + (relevant), # (neutral), - (not_relevant), they would find it harder to assign numeric quantities reflecting the exact degree of relevance.

• E.g. If the best document was rated 10, the second 6 and the third 2, could we be sure that the difference in relevance between the best and the second was exactly the same as the difference between the second and the third?

• Types of scales:• Ratio (2 metres is twice as long as one metre)• Interval (gap between 30 degrees and 20 degrees is the same as the

gap between 20 degrees and 10 degrees – but 20 degrees is not exactly twice as hot as 10 degrees)

• Ordinal e.g. + > # > - in RF• Nominal: separate unranked categories e.g. noun, verb, adjective.

Page 5: Assessing the Retrieval
Page 6: Assessing the Retrieval

Extending the dialogue with RF

• In RF, users’ reactions to just retrieved documents provide the link between assessments, to form the FOA search dialogue.

• Do we assess retrieval performance according to which documents come out first time around, or after a number of iterations of “berrypicking” (Bates, 1986).

• It seems that in reality an assessment of one document’s relevance will depend greatly on the “basket” of other documents we have already seen – but IR uses the independence assumption

Page 7: Assessing the Retrieval

Using RF for Query Refinement (1)

• We expect that there is some localised region in vector space where + relevant documents are most likely to occur.

• If these positively rated documents are in fact clustered, we can consider a hypothetical centroid (average) document d+, which is at the centre of all those documents the users have rated relevant.

• To calculate the centroid, e.g. consider a vocabulary of [apple, bean, carrot]

• Let doc_vector1 be [1, 0, 0]• Let doc_vector2 be [2, 1, 0]• Let doc_vector3 be [1, 1, 0]• Then d+ = [1.33, 0.67, 0]

• It is less reasonable to imagine that negatively labelled documents are similarly clustered.

Page 8: Assessing the Retrieval
Page 9: Assessing the Retrieval

RF for Query Refinement (2)

• Most typically, RF is used to refine the user’s query – we “take a step toward” the centroid of the positively rated cluster d+

• The size of this step can vary, e.g. half the distance between the original query and d+

• E.g. d+ = [1.33, 0.67, 0]• Original_query = [ 1, 1, 1 ]• New_query = [1.17, 0.83, 0.5]• Negative RF involves “taking a step away from” the

centroid of the negatively rated documents, but this works less well due to the cluster being less well defined.

Page 10: Assessing the Retrieval

RF: Research Ideas

• 1. Make changes to the documents rather than the query (see next slide). Adaptive document modifications made in response to RF are not expected to be of (immediate) use to the users who provide it, but made available to later searchers: useful documents have been moved slowly into that part of the semantic space where users’ queries are concentrated (Salton & McGill, 1983)

• 2. Weight propagation (Yamout, two slides away).

Page 11: Assessing the Retrieval
Page 12: Assessing the Retrieval

Documents Propagating Positive and Negative Weights (Fadi Yamout,

2006)

Page 13: Assessing the Retrieval

Search Engine Performance

• We have discussed RF from the user’s point of view and how this information can be used to modify users’ retrievals.

• Another use of RF information is to evaluate which search engine is doing a better job.

• If one system can consistently, across a range of typical queries, more frequently retrieve documents that the users mark as relevant and fewer that they mark as irrelevant, then that system is doing a better job.

Page 14: Assessing the Retrieval

Underlying assumptions

• Real FOA vs. laboratory retrieval. Our lab setting is similar to real life, i.e. “guinea pig” users will have reactions that mirror real ones.

• Intersubject reliability = consistency between users. But users differ in education, time available, preferred writing styles, etc. See consensual relevance, Kappa statistic.

• The relevance of a document can be assessed independently of assessments of other documents – a questionable assumption

• We are assessing the document proxy rather than the document itself.

Page 15: Assessing the Retrieval
Page 16: Assessing the Retrieval

Traditional Evaluation Methodologies

• When text corpora were small, it was possible to have a set of test queries compared exhaustively against every document in the corpus.

• E.g. the Cranfield collection:1400 documents in metallurgy, with 221 queries generated by some of the documents’ authors.

• The Text Retrieval Conference is held annually for search engine evaluation. It uses much larger corpora, and avoids exhaustive assessment of all documents by the pooling method.

• The pooling method uses each search engine independently, and then pool their results (top k documents, where k = 100) to form a set of documents that are at least potentially relevant.

• All unassessed documents are assumed to be irrelevant.

Page 17: Assessing the Retrieval
Page 18: Assessing the Retrieval

Recall and Precision: A reminder

• Recall = | Ret ∩ Rel | / | Rel |

• Precision = | Ret ∩ Rel | / | Ret |

Page 19: Assessing the Retrieval
Page 20: Assessing the Retrieval
Page 21: Assessing the Retrieval
Page 22: Assessing the Retrieval

Notes on the recall/precision curves

• 4.10 top ranked doc relevant, second relevant, third non-relevant etc.

• 4.11 dotted line shows what would happen if top ranked doc non-relevant, second and third relevant (i.e. swap the judgements for the top and third-ranked documents relative to 4.10.

• 4.12. Best retrieval envelope would be achieved if the top 5 ranked documents were all relevant, all lower-ranked documents non-relevant.

• 4.12 Worst retrieval envelope would be achieved if the top ranked documents were al non-relevant, and the 5 lowest-ranked documents were all relevant.

Page 23: Assessing the Retrieval

Multiple retrievals across a set of queries

• 4.14 shows R/P curves for two queries. • Even with these two queries, there is no guarantee that we

will have R/P data points at any particular recall level. • This necessitates interpolation of data points at desired

recall levels, e.g. 0, 0.25, 0.5, 0.75 and 1.• The 11-point average curve finds the average precision

over a set of queries at recall-levels of 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.

• Another useful measure for web search engines is P@10, the precision over the top 10 ranked documents.

Page 24: Assessing the Retrieval
Page 25: Assessing the Retrieval
Page 26: Assessing the Retrieval

Combining Precision and Recall

• Jardine & Van Rijsbergen’s F-measure

• The “harmonic mean” of Precision and Recall

• F = (2 * P * R) / (P + R).