2003.11.18 slide 1is 202 – fall 2003 lecture 20: evaluation prof. ray larson & prof. marc...

2003.11.18 SLIDE 1IS 202 – FALL 2003

Lecture 20: Evaluation

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/

SIMS 202:

Information Organization

and Retrieval

2003.11.18 SLIDE 2IS 202 – FALL 2003

Lecture Overview

• Review– Probabilistic IR

• Evaluation of IR systems– Precision vs. Recall– Cutoff Points and other measures– Test Collections/TREC– Blair & Maron Study

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

2003.11.18 SLIDE 3IS 202 – FALL 2003

Lecture Overview


• Evaluation of IR systems– Precision vs. Recall– Cutoff Points and other measures– Test Collections/TREC– Blair & Maron Study– Discussion


2003.11.18 SLIDE 4IS 202 – FALL 2003

Probability Ranking Principle

• “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

Stephen E. Robertson, J. Documentation 1977

2003.11.18 SLIDE 5IS 202 – FALL 2003

Model 1 – Maron and Kuhns

• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term t i,

what is the probability that she/he would be satisfied with document Dj ?

2003.11.18 SLIDE 6IS 202 – FALL 2003

Model 2

• Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.

Robertson, Maron & Cooper, 1982

2003.11.18 SLIDE 7IS 202 – FALL 2003

Model 2 – Robertson & Sparck Jones

Document Relevance

DocumentIndexing

Given a term t and a query q

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

2003.11.18 SLIDE 8IS 202 – FALL 2003

Robertson-Sparck Jones Weights

• Retrospective formulation

rRnNrnrR

r

log

2003.11.18 SLIDE 9IS 202 – FALL 2003

Robertson-Sparck Jones Weights

• Predictive formulation

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

2003.11.18 SLIDE 10IS 202 – FALL 2003

Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance (R) is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

DxQy

2003.11.18 SLIDE 11IS 202 – FALL 2003

Probabilistic Models

• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)

• Model 0 -- P(R|y,x)

• Probabilities are estimated based on prior usage or relevance estimation

2003.11.18 SLIDE 12IS 202 – FALL 2003

Probabilistic Models

QD

x

y

Di

Qj

2003.11.18 SLIDE 13IS 202 – FALL 2003

Logistic Regression

• Another approach to estimating probability of relevance

• Based on work by William Cooper, Fred Gey and Daniel Dabney

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence

2003.11.18 SLIDE 14IS 202 – FALL 2003

Logistic Regression

100 -

90 -

80 -

70 -

60 -

50 -

40 -

30 -

20 -

10 -

0 - 0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce

2003.11.18 SLIDE 15IS 202 – FALL 2003

Relevance Feedback

• Main Idea:– Modify existing query based on relevance

judgements• Extract terms from relevant documents and add

them to the query• And/or re-weight the terms already in the query

– Two main approaches:• Automatic (pseudo-relevance feedback)• Users select relevant documents

– Users/system select terms from an automatically-generated list

2003.11.18 SLIDE 16IS 202 – FALL 2003

Rocchio Method

0.25) to and 0.75 to set best to studies some(in terms

t nonrelevan andrelevant of importance thetune and ,

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

121101

21

n

n

iS

iR

Q

where

Sn

Rn

QQ

i

i

i

n

i

n

ii

2003.11.18 SLIDE 17IS 202 – FALL 2003

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

2003.11.18 SLIDE 18IS 202 – FALL 2003

Lecture Overview




2003.11.18 SLIDE 19IS 202 – FALL 2003

IR Evaluation

• Why Evaluate?

• What to Evaluate?

• How to Evaluate?

2003.11.18 SLIDE 20IS 202 – FALL 2003

Why Evaluate?

• Determine if the system is desirable

• Make comparative assessments– Is system X better than system Y?

• Others?

2003.11.18 SLIDE 21IS 202 – FALL 2003

What to Evaluate?

• How much of the information need is satisfied

• How much was learned about a topic

• Incidental learning:– How much was learned about the collection– How much was learned about other topic

• How inviting the system is?

2003.11.18 SLIDE 22IS 202 – FALL 2003

Relevance (revisited)

• In what ways can a document be relevant to a query?– Answer precise question precisely– Partially answer question– Suggest a source for more information– Give background information– Remind the user of other knowledge– Others...

2003.11.18 SLIDE 23IS 202 – FALL 2003

Relevance (revisited)

• How relevant is the document?– For this user for this information need

• Subjective, but• Measurable to some extent

– How often do people agree a document is relevant to a query?

• How well does it answer the question?– Complete answer? Partial? – Background Information?– Hints for further exploration?

2003.11.18 SLIDE 24IS 202 – FALL 2003

• What can be measured that reflects users’ ability to use system? (Cleverdon 66)– Coverage of information– Form of presentation– Effort required/ease of use– Time and space efficiency– Recall

• Proportion of relevant material actually retrieved

– Precision• Proportion of retrieved material actually relevant

What to Evaluate?E

ffec

tiven

ess

2003.11.18 SLIDE 25IS 202 – FALL 2003

Relevant vs. Retrieved

Relevant

Retrieved

All Docs

2003.11.18 SLIDE 27IS 202 – FALL 2003

Why Precision and Recall?

Get as much good stuff as possible while at the same time getting as little junk as possible

2003.11.18 SLIDE 28IS 202 – FALL 2003

Retrieved vs. Relevant Documents

Very high precision, very low recall

Relevant

2003.11.18 SLIDE 29IS 202 – FALL 2003


Very low precision, very low recall (0 in fact)

Relevant

2003.11.18 SLIDE 30IS 202 – FALL 2003


High recall, but low precision

Relevant

2003.11.18 SLIDE 31IS 202 – FALL 2003


High precision, high recall (at last!)

Relevant

2003.11.18 SLIDE 32IS 202 – FALL 2003

Precision/Recall Curves

• There is a well-known tradeoff between Precision and Recall

• So we typically measure Precision at different (fixed) levels of Recall

• Note: this is an AVERAGE over MANY queries

precision

recall

x

x

x

x

2003.11.18 SLIDE 33IS 202 – FALL 2003

Precision/Recall Curves

• Difficult to determine which of these two hypothetical results is better:

precision

recall

x

x

x

x

2003.11.18 SLIDE 34IS 202 – FALL 2003

TREC (Manual Queries)

2003.11.18 SLIDE 35IS 202 – FALL 2003

Lecture Overview




2003.11.18 SLIDE 36IS 202 – FALL 2003

Document Cutoff Levels

• Another way to evaluate:– Fix the number of RELEVANT documents retrieved at

several levels:• Top 5• Top 10• Top 20• Top 50• Top 100• Top 500

– Measure precision at each of these levels– (Possibly)Take average over levels

• This is a way to focus on how well the system ranks the first k documents

2003.11.18 SLIDE 37IS 202 – FALL 2003

Problems with Precision/Recall

• Can’t know true recall value – Except in small collections

• Precision/Recall are related– A combined measure sometimes more

appropriate

• Assumes batch mode– Interactive IR is important and has different

criteria for successful searches– We will touch on this in the UI section

• Assumes that a strict rank ordering matters

2003.11.18 SLIDE 38IS 202 – FALL 2003

Relation to Contingency Table

• Accuracy: (a+d) / (a+b+c+d)• Precision: a/(a+b)• Recall: ?• Why don’t we use Accuracy for IR Evaluation?

(Assuming a large collection)– Most docs aren’t relevant – Most docs aren’t retrieved– Inflates the accuracy value

Doc is Relevant

Doc is NOT relevant

Doc is retrieved a b

Doc is NOT retrieved c d

2003.11.18 SLIDE 39IS 202 – FALL 2003

The E-Measure

Combine Precision and Recall into one number (van Rijsbergen 79)

P = precisionR = recall = measure of relative importance of P or RFor example,= 1 means user is equally interested in precision and recall= means user doesn’t care about precision= 0 means user doesn’t care about recall

)1/(1

1)1(

11

1

2

RP

E

2003.11.18 SLIDE 40IS 202 – FALL 2003

F Measure (Harmonic Mean)

documentth -j for theprecision theis P(j) and

documentth -j for the recall theis )(

)(1

)(1

2)(

jrwhere

jPjr

jF

2003.11.18 SLIDE 41IS 202 – FALL 2003

Lecture Overview




2003.11.18 SLIDE 42IS 202 – FALL 2003

Test Collections

• Cranfield 2 – – 1400 Documents, 221 Queries– 200 Documents, 42 Queries

• INSPEC – 542 Documents, 97 Queries• UKCIS -- > 10000 Documents, multiple sets, 193

Queries• ADI – 82 Document, 35 Queries• CACM – 3204 Documents, 50 Queries• CISI – 1460 Documents, 35 Queries• MEDLARS (Salton) 273 Documents, 18 Queries

2003.11.18 SLIDE 43IS 202 – FALL 2003

TREC

• Text REtrieval Conference/Competition– Run by NIST (National Institute of Standards &

Technology)– 1999 was the 8th year - 9th TREC in early November

• Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs– Newswire & full text news (AP, WSJ, Ziff, FT)– Government documents (federal register,

Congressional Record)– Radio Transcripts (FBIS)– Web “subsets” (“Large Web” separate with 18.5

Million pages of Web data – 100 Gbytes)– Patents

2003.11.18 SLIDE 44IS 202 – FALL 2003

TREC (cont.)

• Queries + Relevance Judgments– Queries devised and judged by “Information

Specialists”– Relevance judgments done only for those documents

retrieved—not entire collection!

• Competition– Various research and commercial groups compete

(TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)– Results judged on precision and recall, going up to a

recall level of 1000 documents

• Following slides are from TREC overviews by Ellen Voorhees of NIST

2003.11.18 SLIDE 45IS 202 – FALL 2003

2003.11.18 SLIDE 46IS 202 – FALL 2003

2003.11.18 SLIDE 47IS 202 – FALL 2003

2003.11.18 SLIDE 48IS 202 – FALL 2003

2003.11.18 SLIDE 49IS 202 – FALL 2003

2003.11.18 SLIDE 50IS 202 – FALL 2003

Sample TREC Query (Topic)

<num> Number: 168<title> Topic: Financing AMTRAK

<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

2003.11.18 SLIDE 51IS 202 – FALL 2003

2003.11.18 SLIDE 52IS 202 – FALL 2003

2003.11.18 SLIDE 53IS 202 – FALL 2003

2003.11.18 SLIDE 54IS 202 – FALL 2003

2003.11.18 SLIDE 55IS 202 – FALL 2003

2003.11.18 SLIDE 56IS 202 – FALL 2003

TREC

• Benefits:– Made research systems scale to large collections (at

least pre-WWW “large”)– Allows for somewhat controlled comparisons

• Drawbacks:– Emphasis on high recall, which may be unrealistic for

what many users want– Very long queries, also unrealistic– Comparisons still difficult to make, because systems

are quite different on many dimensions– Focus on batch ranking rather than interaction

• There is an interactive track but not a lot is being learned, given the constraints of the TREC evaluation process

2003.11.18 SLIDE 57IS 202 – FALL 2003

TREC is Changing

• Emphasis on specialized “tracks”– Interactive track– Natural Language Processing (NLP) track– Multilingual tracks (Chinese, Spanish)– Filtering track– High-Precision– High-Performance

• http://trec.nist.gov/

2003.11.18 SLIDE 58IS 202 – FALL 2003

Lecture Overview




2003.11.18 SLIDE 59IS 202 – FALL 2003

Blair and Maron 1985

• A classic study of retrieval effectiveness– Earlier studies were on unrealistically small

collections• Studied an archive of documents for a legal suit

– ~350,000 pages of text– 40 queries– Focus on high recall– Used IBM’s STAIRS full-text system

• Main Result: – The system retrieved less than 20% of the relevant

documents for a particular information need– Lawyers thought they had 75%

• But many queries had very high precision

2003.11.18 SLIDE 60IS 202 – FALL 2003

Blair and Maron (cont.)

• How they estimated recall– Generated partially random samples of

unseen documents– Had users (unaware these were random)

judge them for relevance

• Other results:– Two lawyers searches had similar

performance– Lawyers recall was not much different from

paralegal’s

2003.11.18 SLIDE 61IS 202 – FALL 2003

Blair and Maron (cont.)

• Why recall was low– Users can’t foresee exact words and phrases

that will indicate relevant documents• “accident” referred to by those responsible as:• “event,” “incident,” “situation,” “problem,” …• Differing technical terminology• Slang, misspellings

– Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

2003.11.18 SLIDE 62IS 202 – FALL 2003

Lecture Overview




2003.11.18 SLIDE 63IS 202 – FALL 2003

Carolyn Cracraft Questions

• 1. It would seem that some of the problems that contributed to poor recall, particularly spelling, might still be mitigated automatically. Given your experience with spell check, do you think it would be at all reasonable to run it on a document without human intervention? Sometimes it does make rather fanciful suggestions for replacements. Is there any spell check system out there that includes similarity calculations - i.e. where you could set a threshold and only automatically correct spelling if the word and the suggested replacement were, say, 80% similar? Should I have realized by this point in the course that computers are NEVER going to deal adequately with nuances of language?

2003.11.18 SLIDE 64IS 202 – FALL 2003

Carolyn Cracraft Questions

• 2. Certainly, the system seems to cry out for some metadata or a controlled vocabulary. But given the scenario in which the database was created (i.e. paralegals indexing documents relevant to just one case), it seems like it would be unreasonable to rigorously develop a vocabulary for every case handled by a law firm. But would even an ad-hoc version help with recall? If the paralegals just quickly agreed on a set of keywords relevant to the case and assigned these words as appropriate to the documents entered, would that have an enough of an effect on recall to justify the extra time spent?

2003.11.18 SLIDE 65IS 202 – FALL 2003

Megan Finn Questions

• The experimenters mention that one problem with their method was that they didn't give clear instructions on how long to spend on each document. It also seems like seeing the queries relative to each other might influence how systems users rank documents. (For example, a user might say that a document seems more relevant to query 1 than query 2, rather than judging them soley on their own merits.) What are some problems that you see with their experiment design?

2003.11.18 SLIDE 66IS 202 – FALL 2003


• It seems like one of the most challenging parts of this experiments is getting enough people to sit down and use their tools for two hours. Evaluating IR systems by tracking human interaction with query results seems like it could be easier. How would you measure the effectiveness of an IR system through tracking human action (something like clickthroughs)? Would tracking human action be likely to give you the same results as RAVe Reviews?

2003.11.18 SLIDE 67IS 202 – FALL 2003


• If the ultimate goal of IR is to provide the user with results that are relevant to them (not necessarily to everyone), is here a way to utilize the results of this experiment to return results that are more relevant to that user?

2003.11.18 SLIDE 68IS 202 – FALL 2003

Margaret Spring Questions

• A Case For Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness (Koenemann & Belkin) The article distinguishes between opaque, transparent and penetrable feedback. Since even inexperienced users (in 1996) had notably better success using penetrable feedback, why isn’t this approach seen more often in online search/retrieval tools?

2003.11.18 SLIDE 69IS 202 – FALL 2003


• The researchers seemed dismayed that users became "lazy" in term generation & relied too heavily on term selection from feedback. Doesn’t the willingness and preference to focus a search through multiple interactions of feedback indicate further support penetrable feedback? Is this indicative of an outdated perspective on user patience with/expectations of system processing abilities?

2003.11.18 SLIDE 70IS 202 – FALL 2003


• Analysis of individual tests seemed to indicate that if a user was guided to particular search topics via feedback, the user was far more likely to successful identify the relevant documents. Does this search topic “favoritism” indicate an application weakness in not connecting the term more effectively to other topics or does it indicate a discrepancy in user vocabulary and document vocabulary?

2003.11.18 SLIDE 71IS 202 – FALL 2003

Jeff Towle Questions

• 1) GroupLens, Ringo and other collaborative filtering systems (such as Amazon's) are all applicable in domains that I would describe as very 'taste'-sensitive. Does collaborative filtering have further applications? Or is information retrieval in general a method of matching tastes?

• 2) The GroupLens authors propose filter-bots as a method of dealing with the sparsity problem. Would such bots provide meaningful results, or is sparsity simply another method of rating that would be lost with the use of bots?

2003.11.18 SLIDE 72IS 202 – FALL 2003

Rebecca Shapley Questions

• Ringo was successful, and the second-to-last paragraph essentially predicts Amazon.com's current recommendation system. Can you think of other good applications of computerized social filtering? What characteristics of these applications make them most amenable to a social filtering system?

• We've considered and even experienced how values are incorporated into classification structures. What values are built into the Ringo example? More broadly, how are values built into social filtering systems? What implications does that have?

2003.11.18 SLIDE 73IS 202 – FALL 2003


• This paper shows that a "constrained pearson" calculation of similarity of user profile works best out of the three approaches they tried. The "constrained pearson" considers similarity of user profiles both for artists they like AND for ones they don't like, and adjusts the equation to incorporate the specific numbers resulting from the 7 point scale. As you read the article, were there any ideas that occurred to you to try, to see if those would improve the performance of the system at recommending?

2003.11.18 SLIDE 74IS 202 – FALL 2003


• Who else might find the collected user-profiles from a social filtering system useful? Mapping the user profiles in n-dimensional hyperspace, they might cluster into groups roughly representing stereotypical consumer appetites, or they might be more spread out, more web-like. Would marketers look at this structure or the texture of their product in this web for useful info? Would users like to "travel" this web, seeing what people sort-of like them like? Would your user profile in something like Ringo or Amazon.com be an asset in a dating service? A liability under the Patriot Act?

2003.11.18 SLIDE 75IS 202 – FALL 2003

Next Time

• Assignment 8

• Web Searching and Crawling

• Readings/Discussion– The Anatomy of a Large-Scale Hypertextual

Web Search Engine (Brin, Sergey and Page, Lawrence); Jesse

– Mercator: A Scalable, Extensible Web Crawler (Heydon, Allan and Najork, Marc) Yuri

2003.11.18 slide 1is 202 – fall 2003 lecture 20: evaluation prof. ray larson & prof. marc...

Documents

retrieval slide

r r nn r nr n slide

warren sack slide

query q r nr n r

pm fall

relevance r

future documents q

class of similar documents