1 cs 430 / info 430 information retrieval lecture 9 evaluation of retrieval effectiveness 2

25
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

Upload: stanley-parrish

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

3 Precision-recall graph precision recall The red system appears better than the black, but is the difference statistically significant?

TRANSCRIPT

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

1

CS 430 / INFO 430 Information Retrieval

Lecture 9

Evaluation of Retrieval Effectiveness 2

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

2

Course administration

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

3

Precision-recall graph

1.0

0.75

0.5

0.25

1.00.750.50.25

precision

recall

The red system appears better than the black, but is the difference statistically significant?

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

4

Statistical tests

Suppose that a search is carried out on systems i and jSystem i is superior to system j if, for all test cases,

recall(i) >= recall(j)precisions(i) >= precision(j)

In practice, we have data from a limited number of test cases. What conclusions can we draw?

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

5

Statistical tests

• The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data.

• The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples.

• The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

6

Text Retrieval Conferences (TREC)

• Led by Donna Harman (NIST) and Ellen Voorhees, with DARPA support, since 1992

• Corpus of several million textual documents, total of more than five gigabytes of data

• Researchers attempt a standard set of tasks, e.g.,

-> search the corpus for topics provided by surrogate users

-> match a stream of incoming documents against standard queries

• Participants include large commercial companies, small information retrieval vendors, and university research groups.  

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

7

Characteristics of Evaluation Experiments

Corpus: Standard set of documents that can be used for repeated experiments.

Topic statements: Formal statement of user information need, not related to any query language or approach to searching.

Results set for each topic statement: Identify all relevant documents (or a well-defined procedure for estimating all relevant documents)

Publication of results: Description of testing methodology, metrics, and results.

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

8

TREC Ad Hoc Track

1. NIST provides text corpus on CD-ROMParticipant builds index using own technology

2. NIST provides 50 natural language topic statementsParticipant converts to queries (automatically or manually)

3. Participant run search (possibly using relevance feedback and other iterations), returns up to 1,000 hits to NIST

4. NIST uses pooled results to estimate set of relevant documents5. NIST analyzes for recall and precision (all TREC participants

use rank based methods of searching)6. NIST publishes methodology and results

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

9

Notes on the TREC Corpus

The TREC corpus consists mainly of general articles. The Cranfield data was in a specialized engineering domain.

The TREC data is raw data:

-> No stop words are removed; no stemming

-> Words are alphanumeric strings

-> No attempt made to correct spelling, sentence fragments, etc.

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

10

Relevance Assessment: TREC

Problem: Too many documents to inspect each one for relevance.

Solution: For each topic statement, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant

The human expert who set the query looks at every document in the pool and determines whether it is relevant.

Documents outside the pool are not examined.

In a TREC-8 example, with 71 participants:7,100 documents in the pool1,736 unique documents (eliminating duplicates)94 judged relevant

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

11

Some other TREC tracks (not all tracks offered every year)

Cross-Language Track

Retrieve documents written in different languages using topics that are in one language.

Filtering Track

In a stream of incoming documents, retrieve those documents that match the user's interest as represented by a query. Adaptive filtering modifies the query based on relevance feed-back.

Genome Track

Study the retrieval of genomic data: gene sequences and supporting documentation, e.g., research papers, lab reports, etc.

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

12

Some Other TREC Tracks (continued)

HARD TrackHigh accuracy retrieval, leveraging additional information about the searcher and/or the search context.

Question Answering Track

Systems that answer questions, rather than return documents.

Video Track

Content-based retrieval of digital video.

Web Track

Search techniques and repeatable experiments on Web documents.

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

13

A Cornell Footnote

The TREC analysis uses a program developed by Chris Buckley, who spent 17 years at Cornell before completing his Ph.D. in 1995.

Buckley has continued to maintain the SMART software and has been a participant at every TREC conference. SMART has been used as the basis against which other systems are compared.

During the early TREC conferences, the tuning of SMART with the TREC corpus led to steady improvements in retrieval efficiency, but after about TREC-5 a plateau was reached.

TREC-8, in 1999, was the final year for the ad hoc experiment.

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

14

Searching and Browsing: The Human in the Loop

Search index

Return hits

Browse repository

Return objects

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

15

Evaluation: User criteria

System-centered and user-centered evaluation-> Is user satisfied?-> Is user successful?

System efficiency-> What efforts are involved in carrying out the search?

Suggested criteria (none very satisfactory) • recall and precision • response time • user effort • form of presentation • content coverage

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

16

D-Lib Working Group on Metrics

DARPA-funded attempt to develop a TREC-like approach to digital libraries (1997) with a human in the loop.

"This Working Group is aimed at developing a consensus on an appropriate set of metrics to evaluate and compare the effectiveness of digital libraries and component technologies in a distributed environment. Initial emphasis will be on (a) information discovery with a human in the loop, and (b) retrieval in a heterogeneous world. "

Very little progress made.

See: http://www.dlib.org/metrics/public/index.html

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

17

MIRA

Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications

European study 1996-99

Chair Keith Van Rijsbergen, Glasgow University

ExpertiseMulti Media Information Retrieval Information RetrievalHuman Computer Interaction

Case Based ReasoningNatural Language Processing

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

18

MIRA Starting Point

• Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information.

• New original research in Information Retrieval is being blocked or hampered by the lack of a broader framework for evaluation.

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

19

Some MIRA Aims

• Bring the user back into the evaluation process.

• Understand the changing nature of Information Retrieval tasks and their evaluation.

• Evaluate traditional evaluation methodologies.

• Understand how interaction affects evaluation.

• Understand how new media affects evaluation.

• Make evaluation methods more practical for smaller groups.

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

20

MIRA Approaches

• Developing methods and tools for evaluating interactive Information Retrieval.

• Studying real users and their overall goals.

• Design for a multimedia test collection.

• Get together collaborative projects. (TREC was organized as competition.)

• Pool tools and data.

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

21

Market Evaluation

System that are successful in the market place must be satisfying some group of users.

Example Documents Approach

Library Library of catalog fielded datacatalogs Congress records Boolean search

Scientific Medline index records thesaurusinformation + abstracts ranked search

Web search Google web pages similarity +document rank

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

22

Market Research Methods of Evaluation

• Expert opinion (e.g. consultant)

• Competitive analysis

• Focus groups

• Observing users (user protocols)

• Measurements effectiveness in carrying out tasksspeed

• Usage logs

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

23

Market Research Methods

Initial Mock-up Prototype Production

Expert opinions

Competitive analysis

Focus groups

Observing users

Measurements

Usage logs

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

24

Focus Group

A focus group is a group interview

• Interviewer

• Potential usersTypically 5 to 12Similar characteristics (e.g., same

viewpoint)

• Structured set of questionsMay show mock-upsGroup discussions

• Repeated with contrasting user groups

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2

25

The Search Explorer Application: Reconstruct a User Sessions