user performance versus precision measures for simple search tasks ( don’t bother improving map )

User Performance versus Precision Measures for Simple Search Tasks

(Don’t bother improving MAP)

Andrew Turpin

Falk Scholer

{aht,fscholer}@cs.rmit.edu.au

People in glass houses should not throw stones

http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg

Scientists should not live in glass houses.Nor straw, nor wood…

http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg

Scientists should do more than throw stones

www.worth1000.com/entries/ 161000/161483INPM_w.jpg

Overview

• How are IR systems compared?– Mean Average Precision: MAP

• Do metrics match user experience?• First grain (Turpin & Hersh SIGIR 2000)• Second pebble (Turpin & Hersh SIGIR 2001)• Third stone (Allan et al SIGIR 2005)• This golf ball (Turpin & Scholer SIGIR 2006)

P@5 1/5 = 0.20 2/5 = 0.40

P@1 0/1 = 0.00 0/1 = 0.00

AP Av. of P at 1’s= 0.25 Av. of P at 1’s= 0.54

Sum of all precision values at relevant documents Number of relevant docs in the list

Sum of all precision values at relevant documents Number of relevant docs in all lists

(0.25) / 1 =

(0.67 + 0.40) / 2 =

(0.25) / 3 =

(0.67 + 0.40) / 3 =

Mean Average Precision (MAP)

• Previous example showed precision for one query

• Ideally need many queries (50 or more)• Take the mean of the AP values over all

queries: MAP• Do a paired t-test, Wilcoxon, Tukey HSD,

…• Compares systems on the same

collection and same queries

Similarity Measure

Simple Terms

Simple Terms + Phrases

Percentage Improvement

Lnu.ltu 0.3616 0.3758 3.9% unknown

BBA-AGJ-BCA 0.3497 0.3683 5.1% p=0.006

BDA-CI-BCA 0.3373 0.3586 5.9% p=0.006

Turpin & Moffat SIGIR 1999

Typical IR empirical systems paper

Fang et al SIGIR 2004

Monz et al SIGIR 2005

Shi et al SIGIR 2005Jordan et al JCDL June 2006

Implicit assumptionMore relevant documents high in the list is good

• Do users generally want more than one relevant document?

• Do users read lists top to bottom?• Who determines relevance? Binary?

Conditional or state-based?

• While MAP is tractable, does it reflect user experience?

• Is Yahoo! really better than Google, or vice-versa?

General Experiment

• Get a collection, set of queries, relevance judgments

• Compare System A and System B using MAP (Cranfield)

• Get users to do queries with System A or System B (balanced design…)

• Did the users do better with A or B?• Did the users prefer A or B?

Experiment 2000

24 Users Engine A

Engine B

MAP 0.275

IR 0.330

MAP 0.324

IR 0.3906 Queries

Experiment 2001

32 Users Engine A

Engine B

MAP 0.270

QA 66%

MAP 0.354

QA 60%8 Queries

Experiment 2005

• James Allan et al, UMass, SIGIR2005

• Passage retrieval and a recall task

• Used bpref, which “tracks MAP”

• Small benefit to users when bpref goes from – 0.50 to 0.60 and 0.90 to 0.95

• No benefit in the mid range 0.60 to 0.90

Predicted

Instance recall 81% 15% (p = 0.27)

Question answering 58% -6% (p = 0.41)

Actual

Experiments 2000, 2001, 2005

Exp 2005 20% 20%

16% 1%

50% 0%

Exp 2001

Exp 2002

Experiment 2006

32 Users

MAP 0.55

50 Queries

MAP 0.65

MAP 0.75

MAP 0.85

MAP 0.95

(100 documents)

Our Sheep

0.55 0.65 0.75 0.85 0.95

Time required to find first relevant document

Failures

55% 65% 75% 85% 95%

“Better” MAP definition

Conclusion

• MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true– Supported by 4 different experiments

• Don’t automatically choose MAP as a metric– P@1 for Web style tasks?

P@10 1

0-10%10-20%

20-30%30-40%

40-50%50-60%

60-70%70-80%

80-90%90-100%

Rank of saved/viewed docs

Number of relevant found

user performance versus precision measures for simple search tasks ( don’t bother improving map )

map exp

map cranfieldget users

ir systems

precision values

relevant documentfailuresmap

grain turpin hersh sigir

pebble turpin hersh

precision measures

Documents

shoo fly! shoo, fly, don’t bother me, shoo, fly, don’t...

bother the apocalypse. bother demons. bother failure ......

testing - why bother?

why bother?

why bother making friends?

why bother ?

why bother?* - pwc

why does elliot bother

tasks and assessments:simplified characters versus...

why bother 11

benchmark roundup – why bother

inference tasks and computational semantics. key concepts...

social media why bother?

why bother with boomers?

mirrors why bother?

an intranet? why bother?

xml--why bother?

badger bother

why bother with partnership

centralized versus distributed schedulers for bag-of-tasks -...