evaluation in audio music similarity
DESCRIPTION
Audio Music Similarity is a task within Music Information Retrieval that deals with systems that retrieve songs musically similar to a query song according to their audio content. Evaluation experiments are the main scientific tool in Information Retrieval to determine what systems work better and advance the state of the art accordingly. It is therefore essential that the conclusions drawn from these experiments are both valid and reliable, and that we can reach them at a low cost. This dissertation studies these three aspects of evaluation experiments for the particular case of Audio Music Similarity, with the general goal of improving how these systems are evaluated. The traditional paradigm for Information Retrieval evaluation based on test collections is approached as an statistical estimator of certain probability distributions that characterize how users employ systems. In terms of validity, we study how well the measured system distributions correspond to the target user distributions, and how this correspondence affects the conclusions we draw from an experiment. In terms of reliability, we study the optimal characteristics of test collections and statistical procedures, and in terms of efficiency we study models and methods to greatly reduce the cost of running an evaluation experiment.TRANSCRIPT
![Page 1: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/1.jpg)
Evaluation in Audio Music Similarity
PhD dissertation
by
Julián Urbano
Leganés, October 3rd 2013 Picture by Javier García
![Page 2: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/2.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
2
![Page 3: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/3.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
3
![Page 4: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/4.jpg)
Information Retrieval
• Automatic representation, storage and search of unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music
• A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents
4
![Page 5: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/5.jpg)
Information Retrieval Evaluation
• IR systems are based on models to estimate relevance, implementing different techniques
• How good is my system? What system is better?
• Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements
• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
5
![Page 6: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/6.jpg)
History of IR Evaluation research
6
1960
Cranfield 2 MEDLARS
SMART
1980 1970 1990 2000 2010
SIGIR
![Page 7: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/7.jpg)
History of IR Evaluation research
6
1960
TREC
CLEF NTCIR
Cranfield 2 MEDLARS
SMART
INEX
1980 1970 1990 2000 2010
SIGIR
![Page 8: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/8.jpg)
History of IR Evaluation research
6
1960
MIREX
TREC
CLEF NTCIR
ISMIR
Cranfield 2 MEDLARS
SMART
INEX
MusiCLEF
1980 1970 1990 2000 2010
MSD Challenge
SIGIR
![Page 9: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/9.jpg)
History of IR Evaluation research
6
1960
MIREX
TREC
CLEF NTCIR
ISMIR
Cranfield 2 MEDLARS
SMART
INEX
MusiCLEF
1980 1970 1990 2000 2010
MSD Challenge
SIGIR
![Page 10: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/10.jpg)
History of IR Evaluation research
6
1960
MIREX
TREC
CLEF NTCIR
ISMIR
Cranfield 2 MEDLARS
SMART
INEX
MusiCLEF
1980 1970 1990 2000 2010
MSD Challenge
SIGIR
![Page 11: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/11.jpg)
Audio Music Similarity
• Song as input to system, audio signal
• Retrieve songs musically similar to it, by content
• Resembles traditional Ad Hoc retrieval in Text IR
• (most?) Important task in Music IR
– Music recommendation
– Playlist generation
– Plagiarism detection
• Annual evaluation in MIREX
7
![Page 12: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/12.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
8
![Page 13: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/13.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
9
![Page 14: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/14.jpg)
The two questions
• How good is my system?
– What does good mean?
– What is good enough?
• Is system A better than system B?
– What does better mean?
– How much better?
• Efficiency? Effectiveness? Ease?
10
![Page 15: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/15.jpg)
Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use …
– Their distributions describe user experience, fully
• User satisfaction is the bigger picture
– How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system?
• This is the ultimate goal: the good, the better
11
![Page 16: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/16.jpg)
The Cranfield Paradigm
• Estimate user-measure distributions
– Sample documents, queries and users
– Monitor user experience and behavior
– Representativeness, cost, ethics, privacy …
• Fix samples to allow reproducibility
– But cannot fix users and their behavior
– Remove users, but include a static user component, fixed across experiments: ground truth judgments
– Still need to include the dynamics of the process: user models behind effectiveness measures and scales
12
![Page 17: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/17.jpg)
Test collections
• Our goal is the users: user-measure = f(system)
• Cranfield measures systems: system-effectiveness = f(system, measure, scale)
• Estimators of the distributions of user-measures – Only source of variability is the systems themselves
– Reproducibility becomes easy
– Experiments are inexpensive (collections are not)
– Research becomes systematic
13
![Page 18: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/18.jpg)
Validity, Reliability and Efficiency
• Validity: are we measuring what we want to?
– How well are effectiveness and satisfaction correlated?
– How good is good and how better is better?
• Reliability: how repeatable are the results?
– How large do samples have to be?
– What statistical methods should be used?
• Efficiency: how inexpensive is it to get valid and reliable results?
– Can we estimate results with fewer judgments?
14
![Page 19: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/19.jpg)
Goal of this dissertation
Study and improve the validity, reliability and efficiency
of the methods used to evaluate AMS systems
Additionally, improve meta-evaluation methods
15
![Page 20: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/20.jpg)
Outline
• Introduction
– Scope
– The Cranfield Paradigm
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
16
![Page 21: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/21.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
17
![Page 22: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/22.jpg)
Assumption of Cranfield
• Systems with better effectiveness are perceived by users as more useful, more satisfactory
• But different effectiveness measures and relevance scales produce different distributions
– Which one is better to predict user satisfaction?
• Map system effectiveness onto user satisfaction, experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory?
– What if DCG@20 = 0.46? 18
![Page 23: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/23.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 24: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/24.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 25: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/25.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 26: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/26.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 27: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/27.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 28: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/28.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 29: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/29.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 30: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/30.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 31: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/31.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 32: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/32.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 33: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/33.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
![Page 34: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/34.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
19
MIREX
![Page 35: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/35.jpg)
Experimental design
20
![Page 36: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/36.jpg)
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation
• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory
21
![Page 37: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/37.jpg)
Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions
• 113 unique subjects
22
![Page 38: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/38.jpg)
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
What do we expect?
23
![Page 39: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/39.jpg)
Single system: how good is it?
• For 2045 examples (49%) users could not decide which system was better
23
![Page 40: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/40.jpg)
Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction
24
![Page 41: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/41.jpg)
Single system: how good is it?
• Users don’t pay attention to ranking?
25
![Page 42: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/42.jpg)
Single system: how good is it?
• Exponential gain underestimates satisfaction
26
![Page 43: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/43.jpg)
Single system: how good is it?
• Document utility independent of others
27
![Page 44: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/44.jpg)
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
What do we expect?
28
![Page 45: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/45.jpg)
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one system over the other one
28
![Page 46: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/46.jpg)
Two systems: which one is better?
• Large differences needed for users to note them
29
![Page 47: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/47.jpg)
Two systems: which one is better?
• More relevance levels are better to discriminate
30
![Page 48: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/48.jpg)
Two systems: which one is better?
• Cascade and navigational user models are not appropriate
31
![Page 49: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/49.jpg)
Two systems: which one is better?
• Users do prefer the (supposedly) worse system
32
![Page 50: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/50.jpg)
Summary
• Effectiveness and satisfaction are clearly correlated – But there is a bias of 20% because of user disagreement – Room for improvement through personalization
• Magnitude of differences does matter – Just looking at rankings is very naive
• Be careful with statistical significance
– Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX
• Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than
navigational and cascade – The more relevance levels, the better
33
![Page 51: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/51.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
34
![Page 52: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/52.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=3 nℒ=4 nℒ=5 ℓmin=20 ℓmin=40 ℓmin=60 ℓmin=80
P@5 X X X X
AP@5 X X X X
RR@5 X X X X
CGl@5 X X X X X P@5 P@5 P@5 P@5
CGe@5 X X X X P@5 P@5 P@5 P@5
DCGl@5 X X X X X X X X X
DCGe@5 X X X X DCGl@5 DCGl@5 DCGl@5 DCGl@5
EDCGl@5 X X X X X X X X X
EDCGe@5 X X X X EDCGl@5 EDCGl@5 EDCGl@5 EDCGl@5
Ql@5 X X X X X AP@5 AP@5 AP@5 AP@5
Qe@5 X X X X AP@5 AP@5 AP@5 AP@5
RBPl@5 X X X X X X X X X
RBPe@5 X X X X RBPl@5 RBPl@5 RBPl@5 RBPl@5
ERRl@5 X X X X X X X X X
ERRe@5 X X X X ERRl@5 ERRl@5 ERRl@5 ERRl@5
GAP@5 X X X X X AP@5 AP@5 AP@5 AP@5
ADR@5 X X X X X X X X
35
![Page 53: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/53.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
36
![Page 54: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/54.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
37
![Page 55: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/55.jpg)
Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql@5 = 0.61 = 0.7
• Easily for n users and a single query
– P Sat15 = 10 Ql@5 = 0.61 = 0.21
• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials
38
![Page 56: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/56.jpg)
Expected probability of satisfaction
• Now we can compute point and interval estimates of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness
39
![Page 57: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/57.jpg)
System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary
– Now it is meaningful, in terms of user satisfaction
• Intuitively, we want the majority of users to find the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5)
• Improving queries for which we are bad is worthier than further improving those for which we are already good
40
![Page 58: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/58.jpg)
Distribution of P(Sat)
• Need to estimate the cumulative distribution function of user satisfaction: FP(Sat)
• Not described by a typical distribution family
– ecdf converges, but what is a good sample size?
– Compare with Normal, Truncated Normal and Beta
• Compared on >2M random samples from MIREX collections, at different query set sizes
• Goodness of fit as to Cramér-von Mises ω2
41
![Page 59: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/59.jpg)
Estimated distribution of P(Sat)
• More than ≈25 queries in the collection
– ecdf approximates better
• Less than ≈25 queries in the collection
– Normal for graded scales, ecdf for binary scales
• Beta is always the best with the Fine scale
• The more levels in the relevance scale, the better
• Linear gain better than exponential gain
42
![Page 60: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/60.jpg)
Intuition fails, again
• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
43
![Page 61: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/61.jpg)
Intuition fails, again
• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
43
![Page 62: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/62.jpg)
Intuition fails, again
• Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction
– E Δλ = −0.002
– E ΔP Sat = 0.001
– E ΔP Succ = 0.07
43
![Page 63: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/63.jpg)
Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries
44
![Page 64: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/64.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
45
![Page 65: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/65.jpg)
Measures and scales
Measure Original Artificial Graded Artificial Binary
Broad Fine nℒ=4 nℒ=5 ℓmin=20 ℓmin=40
P@5 X X
AP@5 X X
CGl@5 X X X X P@5 P@5
CGe@5 X X X P@5 P@5
DCGl@5 X X X X X X
DCGe@5 X X X DCGl@5 DCGl@5
Ql@5 X X X X AP@5 AP@5
Qe@5 X X X AP@5 AP@5
RBPl@5 X X X X X X
RBPe@5 X X X RBPl@5 RBPl@5
GAP@5 X X X X AP@5 AP@5
46
![Page 66: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/66.jpg)
Outline
• Introduction
• Validity
– System Effectiveness and User Satisfaction
– Modeling Distributions
• Reliability
• Efficiency
• Conclusions and Future Work
47
![Page 67: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/67.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
48
![Page 68: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/68.jpg)
Random error
• Test collections are just samples from larger, possibly infinite, populations
• If we conclude system A is better than B, how confident can we be?
– Δλ𝒬 is just an estimate of the population mean μΔλ
• Usually employ some statistical significance test for differences in location
• If it is statistically significant, we have confidence that the true difference is at least that large
49
![Page 69: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/69.jpg)
Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0: μΔλ = 0
– H1: μΔλ ≠ 0
• Run test, obtain p-value= P μΔλ ≥ Δλ𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence
• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
50
![Page 70: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/70.jpg)
Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test
• Based on resampling
– Bootstrap test, permutation/randomization test
• They make certain assumptions about distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that assumptions are violated?
51
![Page 71: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/71.jpg)
Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates
• Safety
– Minimize Type I error rates
– Usually decreases power
• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
52
![Page 72: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/72.jpg)
Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections
• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels
• All systems and queries from MIREX 2007-2011
– >15M p-values
53
![Page 73: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/73.jpg)
Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the most successful, depending on α level
54
![Page 74: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/74.jpg)
Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels
55
![Page 75: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/75.jpg)
Optimal measure and scale
• Power: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Success: CGl@5, GAP@5, DCGl@5 and RBPl@5
• Conflicts: very similar across measures
• Power: Fine, Broad and binary
• Success: Fine, Broad and binary
• Conflicts: very similar across scales
56
![Page 76: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/76.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
57
![Page 77: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/77.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
58
![Page 78: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/78.jpg)
Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?
• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes and experimental designs
59
![Page 79: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/79.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 80: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/80.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 81: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/81.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 82: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/82.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 83: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/83.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 84: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/84.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 85: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/85.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 86: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/86.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 87: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/87.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 88: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/88.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 89: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/89.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
60
![Page 90: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/90.jpg)
G-study: variance components
• Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA
σ2 = σs
2 + σq2 + σsq
2
• Estimated with Analysis of Variance
• If σs2 is small or σq
2 is large, we need more queries
60
![Page 91: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/91.jpg)
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
61
![Page 92: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/92.jpg)
D-study: variance ratios
• Stability of absolute scores
Φ nq =σs2
σs2 +
σq2 + σe
2
nq
• Stability of relative scores
Eρ2 nq =σs2
σs2 +
σe2
nq
• We can easily estimate how many queries are needed to reach some level of stability (reliability)
61
![Page 93: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/93.jpg)
Effect of query set size
• Average absolute stability Φ = 0.97 • ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases • Fine scale slightly better than Broad and binary scales • RBPl@5 and nDCGl@5 are the most stable
62
![Page 94: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/94.jpg)
Effect of query set size
• Average relative stability Eρ 2 = 0.98
• ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
• Fine scale better than Broad and binary scales
• CGl@5 and RBPl@5 are the most stable
63
![Page 95: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/95.jpg)
Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable
• Tested in MIREX 2012
– Apparently in 2013 too
64
![Page 96: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/96.jpg)
Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable – From Φ = 0.81 to Φ = 0.83
– From Eρ 2 = 0.93 to Eρ 2 = 0.95
65
![Page 97: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/97.jpg)
Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability
66
![Page 98: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/98.jpg)
Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied
• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q
67
![Page 99: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/99.jpg)
Effect of assessor set size
• Broad scale: σ s2 ≈ σ h:q
2
• Fine scale: σ s2 ≫ σ h:q
2
• Always better to spend resources on queries
68
![Page 100: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/100.jpg)
Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?
• Employ G-Theory while building the collection
• Fine better than Broad, better than binary
• CGl@5 and DCGl@5 best for relative stability
• RBPl@5 and nDCGl@5 best for absolute stability
69
![Page 101: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/101.jpg)
Outline
• Introduction
• Validity
• Reliability
– Optimality of Statistical Significance Tests
– Test Collection Size
• Efficiency
• Conclusions and Future Work
70
![Page 102: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/102.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
71
![Page 103: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/103.jpg)
Probabilistic evaluation
• The MIREX setting is still expensive
– Need to judge all top k documents from all systems
– Takes days, even weeks sometimes
• Model relevance probabilistically
• Relevance judgments are random variables over the space of possible assignments of relevance
• Effectiveness measures are also probabilistic
72
![Page 104: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/104.jpg)
Probabilistic evaluation
• Accuracy increases as we make judgments
– E Rd ← rd
• Reliability increases too (confidence)
– Var Rd ← 0
• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop
• Judge as few documents as possible
73
![Page 105: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/105.jpg)
Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P Rd = ℓ θd
– For each document separately
– Ordinal Logistic Regression
• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity
74
![Page 106: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/106.jpg)
Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity
– Decent fit, R2 ≈ 0.35
• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity
– Excellent fit, R2 ≈ 0.91
75
![Page 107: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/107.jpg)
Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine
• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine
• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine
• Negligible under the current MIREX setting
76
![Page 108: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/108.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
77
![Page 109: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/109.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
78
![Page 110: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/110.jpg)
Probabilistic effectiveness measures
• Effectiveness scores are also random variables
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence
• For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series
79
![Page 111: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/111.jpg)
Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%
80
![Page 112: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/112.jpg)
Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
81
DCGl@5
Confidence Broad Fine
In bin Accuracy In bin Accuracy
[0.5, 0.6) 23 (6.5%) 0.826 22 (6.2%) 0.636
[0.6, 0.7) 14 (4%) 0.786 16 (4.5%) 0.812
[0.7, 0.8) 14 (4%) 0.571 11 (3.1%) 0.364
[0.8, 0.9) 22 (6.2%) 0.864 21 (6%) 0.762
[0.9, 0.95) 23 (6.5%) 0.87 19 (5.4%) 0.895
[0.95, 0.99) 24 (6.8%) 0.917 27 (7.7%) 0.926
[0.99, 1) 232 (65.9%) 0.996 236 (67%) 0.996
E[Accuracy] 0.938 0.921
![Page 113: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/113.jpg)
Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems
• What documents should we judge? – Those that are the most informative
– Measure-dependent
82
![Page 114: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/114.jpg)
Relative estimates with judgments
• Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5
• Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931
83
![Page 115: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/115.jpg)
Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05) 1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores
• What documents should we judge? – Those that reduce variance the most
– Measure-dependent
84
![Page 116: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/116.jpg)
Absolute estimates with judgments
• The stopping condition is overly confident – Virtually no judgments are even needed (supposedly)
• But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance
85
![Page 117: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/117.jpg)
Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments
86
![Page 118: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/118.jpg)
Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct
• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy
• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05
87
![Page 119: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/119.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
– Learning Relevance Distributions
– Low-cost Evaluation
• Conclusions and Future Work
88
![Page 120: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/120.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
– Conclusions
– Future Work
89
![Page 121: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/121.jpg)
Validity
• Cranfield tells us about systems, not about users
• Provide empirical mapping from system effectiveness onto user satisfaction
• Room for personalization quantified in 20%
• Need large differences for users to note them
• Consider full distributions, not just averages
• Conclusions based on effectiveness tend to contradict conclusions based on user satisfaction
90
![Page 122: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/122.jpg)
Reliability
• Different significance tests for different needs
– Bootstrap test is the most powerful
– Wilcoxon and t-test are the safest
– Wilcoxon and bootstrap test are the most exact
• Practical interpretation of p-values
• MIREX collections generally larger than needed
• Spend resources on queries, not on assessors
• User models with deeper cutoffs are feasible
• Employ G-Theory while building collections
91
![Page 123: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/123.jpg)
Efficiency
• Probabilistic evaluation reduces cost, dramatically
• Two models to estimate document relevance
• System rankings 92% accurate without judgments
• 2% of judgments to reach 95% confidence
• 25% of judgments to reduce error to 0.05
92
![Page 124: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/124.jpg)
Measures and scales
• Best measure and scale depends on situation
• But generally speaking
– CGl@5, DCGl@5 and RBPl@5
– Fine scale
– Model distributions as Beta
93
![Page 125: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/125.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
– Conclusions
– Future Work
94
![Page 126: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/126.jpg)
Outline
• Introduction
• Validity
• Reliability
• Efficiency
• Conclusions and Future Work
– Conclusions
– Future Work
95
![Page 127: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/127.jpg)
Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better capture document utility
• Explicitly define judging guidelines
• Similar mapping for Text IR
96
![Page 128: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/128.jpg)
Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while building test collections
97
![Page 129: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/129.jpg)
Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights
98
![Page 130: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/130.jpg)
Conduct similar studies
for the wealth of tasks in
Music Information Retrieval
99
![Page 131: Evaluation in Audio Music Similarity](https://reader033.vdocuments.site/reader033/viewer/2022052413/559d376d1a28ab6f398b4671/html5/thumbnails/131.jpg)
Evaluation in Audio Music Similarity
PhD dissertation
by
Julián Urbano
Leganés, October 3rd 2013 Picture by Javier García