validity and reliability of cranfield-like evaluation in information retrieval

Validity and Reliability of Cranfield-like Evaluation in Information Retrieval

Julián Urbano

Picture by Tom Parnell Glasgow, Scotland · September 2013

Talk outline

• Why we want to Evaluate…

• …and what we do with Cranfield

• Validity: users versus systems

• Reliablity: estimating from samples

Why we want to Evaluate

The two questions

• How good is my system? – What does good mean?

– What is good enough?

• Is system A better than system B? – What does better mean?

– How much better?

• Efficiency? Effectiveness? Ease?

Measure user experience

• Time to complete task

• Idle time

• Success rate

• Failure rate

• Frustration

• Ease to learn

• Ease to use

…and a long etcetera

We want to know some distributions

• For an arbitrary user, need and document collection, what is the distribution of:

• They describe user experience, fully

0 time to complete task

none frustration

much some

The big(ger) picture

• Different user-measures attempting to assess the same thing: user satisfaction

– How likely is it that an arbitrary user, with an arbitrary need (and with an arbitrary document collection) will be satisfied by the system?

• This is the ultimate goal: the good, the better

The big(ger) question

• User satisfaction…as Bernoulli trial

• Probability of satisfaction?

• Probability that k in n users are satisfied?

• Probability of >80% users satisfied?

satisfaction yes no

What we do with Cranfield

Sources of variability

user-measure = f(documents, need, user, system)

• Try to estimate the user-measure distribution

– Sample documents, needs and users

– Problematic

• Representativeness

• Cost

• Ethics

– Hard to replicate and repeat results

Fix samples

• Get a (hopefully) good sample and fix it

– Document collection

– Topic set

– A step towards reproducibility

• Still have to sample users, but can’t fix them!

– Very large source of variability

– Hard to replicate and repeat experiments

– Complex, costly, ethical issues

– Example: ASTIA-Uniterm studies

Simulate users…and fix them

• Cleverdon’s idea: remove users, but include a static user component, fixed across experiments

– The judgments in the ground truth

• Remove all sources of variability, except systems


Simulate users…and fix them

• Cleverdon’s idea: remove users, but include a static user component, fixed across experiments

– The judgments in the ground truth

• Remove all sources of variability, except systems


user-measure = f(system)

Test collections

user-measure = f(system)

• Test collections are tools to estimate distributions of user-measures

– Reproducibility becomes possible and easy

– Experiments are inexpensive (collections are not)

– Research becomes systematic

Wait a minute

• Are we estimating distributions about users or distributions about systems?

system-effectiveness = f(system, measure)

• We come up with different distributions of system-effectiveness, one per measure

• Each measure has its own assumptions

Assumption

• System-measures correspond to user-measures

Users Systems

Time to complete task Idle time

Success rate Failure rate Frustration

Ease to learn Ease to use Satisfaction

…

P AP RR DCG nDCG ERR GAP Q …

Assumption

• Well, at least we assume the correlation

– Are they correlated? How well?

• Test collections: estimators of user distributions

– What we want to measure: user satisfaction

– What we do measure: system effectiveness

Validity and Reliability

• Validity: are we measuring what we want to?

– External validity: Are topics, documents and assessors representative?

– Construct validity: Do system-measures correspond to user-measures?

– Conclusion validity: Is system A really better than system B?

• Reliability: how repeatable are the results?

– How large do collections have to be to ensure repeatability with a different sample?

Validity

Assumption

• Systems with better effectiveness are perceived by users as more useful, more satisfactory

• Tricky: different effectiveness measures and relevance scales give different results

– Which one is better to predict satisfaction?

• The goal is user satisfaction, not system effectiveness

Mapping

• Try to map system effectiveness onto user satisfaction, experimentally

• If P@10 = 0.2, how likely is it that the user will find the results satisfactory?

• What if DCG@20 = 0.467?

• What if ERR = 0.9?

User-oriented System-measures

• Effectiveness measures are generally not formulated to correlate with user-satisfaction

• If effectiveness is 0, we expect 0% probability of user satisfaction

• If effectiveness is 1, we expect 100% probability

• If effectiveness is 𝜆, we expect 100𝜆%

• But this is not what we have

Unbounded measures

𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖

𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖

𝑘

𝑖=1

• Upper bound depends on cutoff, gain function and relevance scale

– Normalize effectiveness between 0 and 1

– What is the best we can do with 𝑘 documents?

𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖

𝑔𝑎𝑖𝑛 𝑟𝑖∗ 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖

𝑘

𝑖=1

Recall-oriented measures

𝐴𝑃@𝑘 =1

ℛ1 𝑟i · 𝑃@𝑖

𝑘

𝑖=1

• 𝐴𝑃@𝑘 = 1 only possible if 𝑘 ≥ ℛ1

• Reformulate towards users

– What is the best we can do with 𝑘 documents, regardless of the judgments in the ground truth?

𝐴𝑃@𝑘 =1

𝑘 𝑟𝐴𝑖 · 𝑃@𝑖

𝑘

𝑖=1

Ideal ranking

𝑛𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘𝑖=1

𝑔𝑎𝑖𝑛 𝑖𝑑𝑒𝑎𝑙𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘𝑖=1

• If there is only one relevant, 𝑛𝐷𝐶𝐺@10 = 1 even if we retrieve nine nonrelevants

• Assume the ideal ranking has only excellent documents, with maximum relevance

𝑛𝐷𝐶𝐺@𝑘 = 𝑔𝑎𝑖𝑛 𝑟𝑖 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘𝑖=1

𝑔𝑎𝑖𝑛 𝑟𝑖∗ 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡 𝑖 𝑘

𝑖=1

• This is basically user-oriented 𝐷𝐶𝐺@𝑘

Audio Music Similarity

• Song as input to system, audio signal

• Retrieve songs musically similar to it, by content

• Resembles traditional Ad Hoc retrieval in Text IR

• (most?) Important task in Music IR

– Music recommendation

– Playlist generation

– Plagiarism detection

Measures

• All reformulated, user-oriented

– What is the best we can do under the user model?

• Binary

– P, AP, RR

• Graded

– CG, DCG, Q, RBP, ERR, GAP, ADR , EDCG

– Linear and exponential gains

Relevance scales

• Originally used

– Broad: 3 levels

– Fine: 101 levels

• Artificially made from the Fine scale

– Graded with 3, 4 and 5 levels, evenly spaced

– Binary, with threshold equal 20, 40, 60 and 80

Measures and Scales

Measure Original Artificial Graded Artificial Binary

Broad Fine 𝑛ℒ = 3 𝑛ℒ = 4 𝑛ℒ = 5 ℓ𝑚𝑖𝑛 = 20 ℓ𝑚𝑖𝑛 = 40 ℓ𝑚𝑖𝑛 = 60 ℓ𝑚𝑖𝑛 = 80

𝑃@5 x x x x

𝐴𝑃@5 x x x x

𝑅𝑅@5 x x x x

𝐶𝐺𝑙@5 x x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5

𝐶𝐺𝑒@5 x x x x 𝑃@5 𝑃@5 𝑃@5 𝑃@5

𝐷𝐶𝐺𝑙@5 x x x x x x x x x

𝐷𝐶𝐺𝑒@5 x x x x 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5 𝐷𝐶𝐺𝑙@5

𝐸𝐷𝐶𝐺𝑙@5 x x x x x x x x x

𝐸𝐷𝐶𝐺𝑒@5 x x x x 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5 𝐸𝐷𝐶𝐺𝑙@5

𝑄𝑙@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5

𝑄𝑒@5 x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5

𝑅𝐵𝑃𝑙@5 x x x x x x x x x

𝑅𝐵𝑃𝑒@5 x x x x 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5 𝑅𝐵𝑃𝑙@5

𝐸𝑅𝑅𝑙@5 x x x x x x x x x

𝐸𝑅𝑅𝑒@5 x x x x 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5 𝐸𝑅𝑅𝑙@5

𝐺𝐴𝑃@5 x x x x x 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5 𝐴𝑃@5

𝐴𝐷𝑅@5 x x x x x x x x

Experimental Design

Experimental Design

user preference (agrees or disagrees with effectiveness)

Experimental Design

non-preference (can’t decide)

What can we infer?

• Preference: difference noticed by user

– Positive: user agrees with evaluation

– Negative: user disagrees with evaluation

• Non-preference: difference not noticed by user

– Good: both systems are satisfactory

– Bad: both systems are not satisfactory

Data

• Queries, documents and judgments from MIREX

– MIREX: TREC-like evaluation forum in Music IR

• 4,115 unique and artificial examples

– Covering full range of effectiveness

• In 10 bins 0, 0.1 , 0.1, 0.2 ,… , [0.9, 1]

– At least 200 examples per measure/scale/bin

• 432 unique queries, 5,636 unique documents

Collecting User Preferences

• Crowdsourcing

– Quality control through trap examples

• Total: 547 unique subjects, 11,042 preferences

• Accepted: 175 subjects, 9,373 preferences

• After trap questions: 113 subjects

Single system: how good is it?

• 2,045 non-preferences (49%)

– 1,056 satisfactory

– 969 non-satisfactory

What do we expect?


• 2,045 non-preferences (49%)

– 1,056 satisfactory

– 969 non-satisfactory

Linear mapping


Large thresholds

underestimate satisfaction


Ranking does not affect

satisfaction?


Exponential gain

underestimates satisfaction


• Best adhere to the diagonal

– 𝐶𝐺𝑙@5, 𝐷𝐶𝐺𝑙@5 and 𝑅𝐵𝑃𝑙@5

– Not necessarily better: just easier to interpret

• About 20% bias at endpoints

– Room for improvement with personalization

• Less sensitive to subjectivity in relevance

– Minimize 𝑃(𝑆𝑎𝑡│0) and maximize 𝑃(𝑆𝑎𝑡│1)

– ℓ𝑚𝑖𝑛 = 40 and 𝐵𝑟𝑜𝑎𝑑 behave better

– 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5

Two systems: which one is better?

• 2,090 preferences (51%)

– 1,019 for system A

– 1,071 for system B

What do we expect?


• 2,090 preferences (51%)

– 1,019 for system A

– 1,071 for system B

Users always notice the difference… …regardless

of how large it is


Need quite large

differences!


More relevance levels better to discriminate


Bad correlation?


• Users prefer the (supposedly) worse system

User Agrees with Evaluation

• Closer to ideal 𝑃 𝐴𝑔𝑔 = 1 Δ𝜆 = 1

– ℓ𝑚𝑖𝑛 = 80 better among binaries

– 𝐹𝑖𝑛𝑒 better for linear gain

– 𝑛ℒ = 5 better for exponential gain

– 𝐶𝐺@5, 𝐷𝐶𝐺@5, 𝑅𝐵𝑃@5 and 𝐺𝐴𝑃@5

User Disagrees with Evaluation

• Closer to ideal 𝑃 𝐴𝑔𝑔 = −1 Δ𝜆 = 0

– ℓ𝑚𝑖𝑛 = 40 better among binaries

– 𝐹𝑖𝑛𝑒 better for linear gain

– 𝐵𝑟𝑜𝑎𝑑 better with exponential gain

– 𝐶𝐺@5, 𝐺𝐴𝑃@5, 𝐷𝐶𝐺@5 and 𝑅𝐵𝑃@5

Summary

• Linear gain better than exponential gain

– Except, slightly, in terms of disagreements

• Measures oriented to a single document are not appropriate for a music recommendation setting

• Gain is independent of other documents

• 𝐵𝑟𝑜𝑎𝑑 better to predict satisfaction

• 𝐹𝑖𝑛𝑒 better to predict user agreement

• Binary scales worst overall

Summary

• We can map system effectiveness onto probability of user satisfaction

• ~20% of users disagree with effectiveness

– Practical upper (and lower) bound in evaluation

– Need to incorporate user profiles

• Somehow included in MSD Challenge

• Δ𝜆 ≈ 0.4 needed for users to agree

– Historically observed only 20% of times in MIREX

– Be careful with statistical significance!

Satisfaction over samples

User Satisfaction

• So far only for a query and a user (Bernoulli)

– 𝑃 𝑆𝑎𝑡 𝜆𝑞

• Easily for 𝑛 users (Binomial)

– 𝑃 𝑆𝑎𝑡𝑛 = 𝑘 𝜆𝑞

• Example: 𝑄𝑙@5 = 0.61

– 𝑃 𝑆𝑎𝑡 ≈ 0.7

– 𝑃 𝑆𝑎𝑡15 = 10 ≈ 0.21

• What about a sample of queries 𝒬?

User Satisfaction over a Sample

𝐸 𝑃 𝑆𝑎𝑡 =1

𝑛𝒬 𝑃 𝑆𝑎𝑡 𝜆𝑞𝑞∈𝒬

• Example: satisfaction is underestimated

System Success

• If 𝑃 𝑆𝑎𝑡 ≥ 𝑡𝑕𝑟𝑒𝑠𝑕𝑜𝑙𝑑 the system is successful

• If we want the majority of users to be satisfied

– 𝑃 𝑆𝑢𝑐𝑐 = 1 − F 𝑃 𝑆𝑎𝑡 0.5

• Intuition: improving bad queries is worthier than further improving good ones

System Success

• Example:

– 𝐸 Δ𝜆 = −0.0021

System Success

• Example:

– 𝐸 Δ𝜆 = −0.0021

– 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011

System Success

• Example:

– 𝐸 Δ𝜆 = −0.0021

– 𝐸 𝛥𝑃 𝑆𝑎𝑡 = 0.0011

– 𝐸 Δ𝑃 𝑆𝑢𝑐𝑐 = 0.07

Summary

• Need to consider full distributions

– Always average or good on average?

• Modeling full distribution

– Normal for small query sets, Empirical for large

– Beta always better for 𝐹𝑖𝑛𝑒 scale

Summary

• Intuitive interpretations of effectiveness fail

– Contradictory results in terms of user satisfaction

Reliability

Samples

• Test collections are samples from larger, possibly infinite, populations

– Documents, queries and users

• Δ𝜆 is just an estimate of the population mean 𝜇Δ𝜆

• How reliable is our conclusion?

Reliability vs Cost

• Building reliable collections is easy

• Just use more documents, queries and assessors

• But it is prohibitively expensive

• Best option is to increase query set size

– Largest source of variability

• How many queries?

– First we need to measure reliability

Data-based approach

1. Randomly split query set

2. Compute indicators of reliability based on these two query subsets

3. Extrapolate to larger query sets

…with some variations

Data-based reliability indicators

• Compare results with two collections

– Kendall tau correlation

– AP correlation

– Absolute sensitivity

– Relative sensitivity

– Power ratio

– Minor conflict ratio

– Major conflict ratio

– RMSE

Generalizability Theory approach

• Address variability of scores, not just means

• G-study

– Estimate variance components from previous, representative data

– Usually previous test collections

• D-study

– Estimate reliability based on estimated variance components from G-study

G-study

𝜎2 = 𝜎𝑠

2 + 𝜎𝑞2 + 𝜎𝑠:𝑞

2

• Estimated with Analysis of Variance

G-study

𝜎2 = 𝜎𝑠

2 + 𝜎𝑞2 + 𝜎𝑠:𝑞

2


system differences,

our goal!

G-study

𝜎2 = 𝜎𝑠

2 + 𝜎𝑞2 + 𝜎𝑠:𝑞

2


system differences,

our goal! query difficulty

G-study

𝜎2 = 𝜎𝑠

2 + 𝜎𝑞2 + 𝜎𝑠:𝑞

2


system differences,

our goal! query difficulty

some systems better for

some queries

D-study

• Relative stability: 𝐸𝜌2 =𝜎𝑠2

𝜎𝑠2+

𝜎𝑠:𝑞2

𝑛𝑞′

• Absolute stability: Φ =𝜎𝑠2

𝜎𝑠2+

𝜎𝑞2+𝜎𝑠:𝑞

2

𝑛𝑞′

• Easy to estimate how many queries we need to reach a certain stability level (1MQ track) – ≈80 queries sufficient for stable rankings

– ≈130 queries for stable absolute scores

G-Theory approach

• How sensitive is the D-study to the initial data used in the G-study?

• How should we interpret G-Theory indicators in practice? What does 𝐸𝜌2 = 0.95 mean?

• From the above, review reliability of over 40 TREC test collections

Data

• 43 TREC collections

– From TREC 3 to TREC 2011

• 12 tasks across 10 tracks

– Ad hoc, Web, Novelty, Genomics, Robust, Terabyte, Enterprise, Million Query, Medical and Microblog

Sensitivity: experiment

• Vary number of queries in G-study – From 𝑛𝑞 = 5 to full set

– Use all runs available

• Run D-study – Compute 𝐸𝜌 2 and Φ

– Compute 𝑛 𝑞′ to reach 0.95 stability

• 200 random trials

Variability due to queries

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.3, depending on what 10 queries we use

Variability due to queries

Sensitivity: experiment

• Do the same, but vary number of systems

– From 𝑛𝑠 = 5 to full set

– Use all queries available


Variability due to systems

We may get 𝐸𝜌 2 = 0.9 or 𝐸𝜌 2 = 0.5, depending on what 20 systems we use

Variability due to systems

Results

• G-Theory is very sensitive to initial data – Need about 50 queries and 50 systems for differences

in 𝐸𝜌 2 and Φ below 0.1

• Number of queries for 𝐸𝜌 2 = 0.95 may change in orders of magnitude – Microblog2011 (all 184 systems and 30 queries) • Need 63 to 133 queries

– Medical2011 (all 34 queries and 40 systems) • Need 109 to 566 queries

Compute confidence intervals

Account for variability in initial data


Required number of queries to reach the

lower end of the interval


Summary in TREC

• 𝐸𝜌 2: mean=0.88 sd=0.1

– 95% conf. intervals are 0.1 long

• Φ : mean=0.74 sd=0.2

– 95% conf. intervals are 0.19 long

Interpretation: experiment

• Split query set in 2 subsets

– From 𝑛𝑞 = 10 to full set / 2

– Use all runs available

• Run D-study

– Compute 𝐸𝜌 2 and Φ and map onto 𝜏, sensitivity, power, conflicts, etc.


– Over 28,000 datapoints

*All mappings in the paper

Example: 𝑬𝝆𝟐 → 𝝉

𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85



𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97



Million Query 2007

Million Query 2008



Future predictions

• This allows us to make more informed decisions within a collection

• What about a new collection?

– Fit a single model for each mapping with 90% and 95% prediction intervals

• Assess whether a larger collection is really worth the effort

current collection



current collection target



Example: 𝚽→ 𝒓𝒆𝒍. 𝒔𝒆𝒏𝒔𝒊𝒕𝒗𝒊𝒕𝒚

Summary

• G-Theory is regarded as more appropriate, ease to use and powerful to assess reliability than the traditional data-based approaches

• But it is quite sensitive to initial data used to estimate variance components

– Data-based approaches are too!

• and almost impossible to interpret in practice

Summary

• Need about 50 queries and 50 systems to have robust estimates of reliability

– That is a whole collection already!

– Need to use confidence intervals

• Previous interpretation overestimated reliability

– 𝜏 = 0.9 → 𝐸𝜌2 ≈ 0.97

– 𝐸𝜌2 = 0.95 → 𝜏 ≈ 0.85

Reliability: review of TREC collections

Outline

• Estimate 𝐸𝜌 2 and Φ , with 95% confidence intervals, and full query set

• Map onto 𝜏, sensitivity, power, conflicts, etc.

• Results within tasks offer a historical perspective on reliability since 1994

*All collections and mappings in the paper

Example: Ad hoc 3-8

• 𝐸𝜌 2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81]

• 𝑚𝑖𝑛𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.6, 8.2 %

• 𝑚𝑎𝑗𝑜𝑟 𝑐𝑜𝑛𝑓𝑙𝑖𝑐𝑡𝑠 ∈ 0.02, 1.38 %

• Queries to get 𝐸𝜌2 = 0.95: [37,233]

• Queries to get Φ = 0.95: [116,999]

• 50 queries were used

Example: Web ad hoc

• TREC-8 to TREC-2001: WT2g and WT10g

– 𝐸𝜌 2 ∈ 0.86,0.93 → 𝜏 ∈ [0.65,0.81]

– Queries to get 𝐸𝜌2 = 0.95: 40,220

• TREC-2009 to TREC-2011: ClueWeb09

– 𝐸𝜌 2 ∈ 0.8,0.83 → 𝜏 ∈ [0.53,0.59]

– Queries to get 𝐸𝜌2 = 0.95: 107,438

• 50 queries were used

Historical trend

• Decreasing within and across tracks?

Historical trend

• Systems getting better for specific problems?

Historical trend

• Increasing task-specificity in queries?

Historical reliability in TREC

• On average, 𝐸𝜌2 = 0.88 → 𝜏 ≈ 0.7

• Some collections clearly unreliable

– Web Distillation 2003, Genomics 2005, Terabyte 2006, Enterprise 2008, Medical 2011 and Web Ad Hoc 2011

• 50 queries not enough for stable rankings, about 200 are needed in most cases

Implications

• Fixing a minimum number of queries across tracks is unrealistic

– Not even across editions of the same task

• Need to analyze on a case-by-case basis, while building the collections

– GT4IReval, R package online

Current and future work

Validity

• Similar studies in Text IR to map effectiveness onto user satisfaction

• Particularly interesting because there are several query types, and users behave differently – Single measure to use in all cases?

– Use different measures and average them all?

• Further user studies to figure out what makes users say good and better

• How should test collections be extended to incorporate more user information?

Reliability

• Study assessor effect

• Study document collection effect

• Better models to map G-theory indicators onto understandable data-based indicators

• Methods to reliably measure reliability while building the collection

References

General

• Cleverdon, C. W. (1991). The Significance of the Cranfield Tests on Index Languages. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 3–12).

• Sanderson, M. (2010). Test Collection Based Evaluation of Information Retrieval Systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.

• Robertson, S. (2008). On the History of Evaluation in IR. Journal of Information Science, 34(4), 439–456. • Harman, D. K. (2011). Information Retrieval Evaluation. Synthesis Lectures on Information Concepts, Retrieval,

and Services, 3(2), 1–119. • Voorhees, E. M. (2002). The Philosophy of Information Retrieval Evaluation. In Workshop of the Cross-Language

Evaluation Forum (pp. 355–370). • Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information

Processing and Management, 28(4), 467–490. • Gull, C. D. (1956). Seven Years of Work on the Organisation of Materials in a Special Library. American

Documentation, 7(4), 320–329. • Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent

Information Systems. • Urbano, J. (2013). Evaluation in Audio Music Similarity. PhD dissertation, University Carlos III of Madrid. • Trochim, W. M. K., & Donnelly, J. P. (2007). The Research Methods Knowledge Base (3rd ed.). Atomic Dog

Publishing. • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for

Generalized Causal Inference. Houghton-Mifflin. • Zobel, J., Webber, W., Sanderson, M., & Moffat, A. (2011). Principles for Robust Evaluation Infrastructure. In ACM

CIKM Workshop on Data infrastructures for Supporting Information Retrieval Evaluation.

Validity

• Allan, J., Carterette, B., & Lewis, J. (2005). When Will Information Retrieval Be “Good Enough”? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 433–440).

• Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User Satisfaction. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 773–774).

• Al-Maskari, A., Sanderson, M., Clough, P., & Airio, E. (2008). The Good and the Bad System: Does the Test Collection Predict User’s Effectiveness. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 59–66).

• Bailey, P., Craswell, N., Soboroff, I., Thomas, P., Vries, A. P. de, & Yilmaz, E. (2008). Relevance Assessment: Are Judges Exchangeable and Does it Matter? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 667–674).

• Bennett, P. N., Carterette, B., Chapelle, O., & Joachims, T. (2008). Beyond Binary Relevance: Preferences, Diversity and Set-Level Judgments. ACM SIGIR Forum, 42(2), 53–58.

• Carterette, B. (2011). System Effectiveness, User Models, and User Utility: A General Framework for Investigation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 903–912).

• Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or There: Preference Judgments for Relevance. In European Conference on Information Retrieval (pp. 16–27).

• Carterette, B., & Soboroff, I. (2010). The Effect of Assessor Error on IR System Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 539–546).

• Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User Evaluations Give the Same Results? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 17–24).

Validity

• Hersh, W., Turpin, A., Sacherek, L., Olson, D., Price, S., Chan, B., & Kraemer, D. (2000). Further Analysis of Whether Batch and User Evaluations Give the Same Results With a Question-Answering Task. In Text REtrieval Conference.

• Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. In International Society for Music Information Retrieval Conference (pp. 331–336).

• Huffman, S. B., & Hochster, M. (2007). How Well does Result Relevance Predict Session Satisfaction? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 567–573).

• Ingwersen, P., & Järvelin, K. (2005). The Turn: Integration of Information Seeking and Retrieval in Context. Springer.

• Järvelin, K. (2011). IR Research: Systems, Interaction, Evaluation and Theories. ACM SIGIR Forum, 45(2), 17–31. • Mizzaro, S. (1997). Relevance: The Whole History. Journal of the American Society for Information Science, 48(9),

810–832. • Sanderson, M., Paramita, M. L., Clough, P., & Kanoulas, E. (2010). Do User Preferences and Evaluation Measures

Line Up? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 555–562).

• Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal of Intelligent Information Systems.

• Schedl, M., Stober, S., Gómez, E., Orio, N., & Liem, C. C. S. (2012). User-Aware Music Retrieval. In M. Müller, M. Goto, & M. Schedl (Eds.), Multimodal Music Processing (pp. 135–156). Dagstuhl Publishing.

• Scholer, F., & Turpin, A. (2008). Relevance Thresholds in System Evaluations. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 693–694).

Validity

• Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that They Are Unrealistic. In European Workshop on Human-Computer Interaction and Information Retrieval (pp. 11–12).

• Thom, J. A., & Scholer, F. (2007). A Comparison of Evaluation Measures Given How Users Perform on Search Tasks. In Australasian Document Computing Symposium (pp. 100–103).

• Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 225–231).

• Turpin, A., & Hersh, W. (2002). User Interface Effects in Past Batch Versus User Experiments. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 431–432).

• Turpin, A., & Scholer, F. (2006). User Performance Versus Precision Measures for Simple Search Tasks. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 11–18).

• Urbano, J., Downie, J. S., Mcfee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval. In International Society for Music Information Retrieval Conference (pp. 181–186).

Reliability

• Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million Query Track 2008 Overview. In Text REtrieval Conference.

• Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million Query Track 2007 Overview. In Text REtrieval Conference.

• Armstrong, T. G., Moffat, A., Webber, W., & Zobel, J. (2009). Improvements that Don’t Add Up: Ad-Hoc Retrieval Results since 1998. In ACM International Conference on Information and Knowledge Management (pp. 601–610).

• Banks, D., Over, P., & Zhang, N.-F. (1999). Blind Men and Elephants: Six Approaches to TREC data. Information Retrieval, 1(1-2), 7–34.

• Bodoff, D. (2008). Test Theory for Evaluating Reliability of IR Test Collections. Information Processing and Management, 44(3), 1117–1145.

• Bodoff, D., & Li, P. (2007). Test Theory for Assessing IR Test Collections. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 367–374).

• Brennan, R. L. (2001). Generalizability Theory. Springer. • Buckley, C., & Voorhees, E. M. (2000). Evaluating Evaluation Measure Stability. In International ACM SIGIR

Conference on Research and Development in Information Retrieval (pp. 33–34). • Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million Query Track 2009 Overview. In Text REtrieval

Conference. • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2008). Evaluation Over Thousands of Queries. In

International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 651–658). • Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I Had a Million Queries. In European

Conference on Information Retrieval (pp. 288–300). • Lin, W.-H., & Hauptmann, A. (2005). Revisiting the Effect of Topic Set Size on Retrieval Error. In International ACM

SIGIR Conference on Research and Development in Information Retrieval (pp. 637–638).

Reliability

• Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 533–540).

• Robertson, S., & Kanoulas, E. (2012). On Per-Topic Variance in IR Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 891–900).

• Sakai, T. (2007). On the Reliability of Information Retrieval Metrics Based on Graded Relevance. Information Processing and Management, 43(2), 531–548.

• Sanderson, M., & Zobel, J. (2005). Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169).

• Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in Effectiveness Across Sub-collections. In ACM International Conference on Information and Knowledge Management (pp. 1965–1969).

• Shavelson, R. J., & Webb, N. M. (1991). Generalizability Theory: A Primer. Sage Publications. • Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information

Retrieval Evaluation. In ACM International Conference on Information and Knowledge Management (pp. 623–632).

• Urbano, J., Marrero, M., & Martín, D. (2013). A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 925–928).

• Urbano, J., Marrero, M., & Martín, D. (2013). On the Measurement of Test Collection Reliability. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 393–402).

• Voorhees, E. M. (2000). Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Information Processing and Management, 36(5), 697–716.

• Voorhees, E. M. (2009). Topic Set Size Redux. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 806–807).

Reliability

• Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 316–323).

• Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. In ACM International Conference on Information and Knowledge Management (pp. 571–580).

• Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A New Rank Correlation Coefficient for Information Retrieval. In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 587–594).

• Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? In International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–314).

validity and reliability of cranfield-like evaluation in information retrieval

Technology