user performance versus precision measures for simple search tasks ( don’t bother improving map )

27
User Performance versus Precision Measures for Simple Search Tasks (Don’t bother improving MAP) Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.ed u.au

Upload: rocco

Post on 01-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP ). Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au. People in glass houses should not throw stones. http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

User Performance versus Precision Measures for Simple Search Tasks

(Don’t bother improving MAP)

Andrew Turpin

Falk Scholer

{aht,fscholer}@cs.rmit.edu.au

Page 2: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

People in glass houses should not throw stones

http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg

Page 3: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Scientists should not live in glass houses.Nor straw, nor wood…

http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg

Page 4: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Scientists should do more than throw stones

www.worth1000.com/entries/ 161000/161483INPM_w.jpg

Page 5: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Overview

• How are IR systems compared?– Mean Average Precision: MAP

• Do metrics match user experience?• First grain (Turpin & Hersh SIGIR 2000)• Second pebble (Turpin & Hersh SIGIR 2001)• Third stone (Allan et al SIGIR 2005)• This golf ball (Turpin & Scholer SIGIR 2006)

Page 6: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

0

0

P@5 1/5 = 0.20 2/5 = 0.40

P@1 0/1 = 0.00 0/1 = 0.00

0.00

0.25

0.20

0.17

0.00

0

1

0

0.000

1

0

0

0

1

0

0.00

0.00

0.67

0.25

0.40

0.33

AP Av. of P at 1’s= 0.25 Av. of P at 1’s= 0.54

Page 7: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Sum of all precision values at relevant documents Number of relevant docs in the list

Sum of all precision values at relevant documents Number of relevant docs in all lists

AP =

AP =

(0.25) / 1 =

(0.67 + 0.40) / 2 =

0.25

0.54

0.08

0.36

(0.25) / 3 =

(0.67 + 0.40) / 3 =

Page 8: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Mean Average Precision (MAP)

• Previous example showed precision for one query

• Ideally need many queries (50 or more)• Take the mean of the AP values over all

queries: MAP• Do a paired t-test, Wilcoxon, Tukey HSD,

…• Compares systems on the same

collection and same queries

Page 9: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Similarity Measure

Simple Terms

Simple Terms + Phrases

Percentage Improvement

Lnu.ltu 0.3616 0.3758 3.9% unknown

BBA-AGJ-BCA 0.3497 0.3683 5.1% p=0.006

BDA-CI-BCA 0.3373 0.3586 5.9% p=0.006

Turpin & Moffat SIGIR 1999

Typical IR empirical systems paper

Page 10: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Fang et al SIGIR 2004

Monz et al SIGIR 2005

Shi et al SIGIR 2005Jordan et al JCDL June 2006

Page 11: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Implicit assumptionMore relevant documents high in the list is good

• Do users generally want more than one relevant document?

• Do users read lists top to bottom?• Who determines relevance? Binary?

Conditional or state-based?

• While MAP is tractable, does it reflect user experience?

• Is Yahoo! really better than Google, or vice-versa?

Page 12: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

General Experiment

• Get a collection, set of queries, relevance judgments

• Compare System A and System B using MAP (Cranfield)

• Get users to do queries with System A or System B (balanced design…)

• Did the users do better with A or B?• Did the users prefer A or B?

Page 13: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Experiment 2000

24 Users Engine A

Engine B

MAP 0.275

IR 0.330

MAP 0.324

IR 0.3906 Queries

Page 14: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Experiment 2001

32 Users Engine A

Engine B

MAP 0.270

QA 66%

MAP 0.354

QA 60%8 Queries

Page 15: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Experiment 2005

• James Allan et al, UMass, SIGIR2005

• Passage retrieval and a recall task

• Used bpref, which “tracks MAP”

• Small benefit to users when bpref goes from – 0.50 to 0.60 and 0.90 to 0.95

• No benefit in the mid range 0.60 to 0.90

Page 16: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Predicted

Instance recall 81% 15% (p = 0.27)

Question answering 58% -6% (p = 0.41)

Actual

Experiments 2000, 2001, 2005

MAP

Exp 2005 20% 20%

16% 1%

50% 0%

Exp 2001

Exp 2002

Page 17: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Experiment 2006

32 Users

A

MAP 0.55

50 Queries

B

C

D

E

MAP 0.65

MAP 0.75

MAP 0.85

MAP 0.95

(100 documents)

Page 18: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Our Sheep

Page 19: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )
Page 20: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

MAP

0.55 0.65 0.75 0.85 0.95

Tim

e (s

econ

ds)

5010

015

0 20

025

030

00

Time required to find first relevant document

Page 21: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Failures

0

5

10

15

20

25

55% 65% 75% 85% 95%

MAP

% o

f qu

erie

s w

ith n

o re

leva

nt a

nsw

er

Page 22: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

“Better” MAP definition

Page 23: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Conclusion

• MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true– Supported by 4 different experiments

• Don’t automatically choose MAP as a metric– P@1 for Web style tasks?

Page 24: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

P@1

P@10 1

Tim

e (s

econ

ds)

5010

015

0 20

025

030

00

Page 25: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

0-10%10-20%

20-30%30-40%

40-50%50-60%

60-70%70-80%

80-90%90-100%

Page 26: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Rank of saved/viewed docs

Page 27: User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

Number of relevant found