on the reliability and intuitiveness of aggregated search metrics
DESCRIPTION
Aggregating search results from a variety of diverse verticals such as news, images, videos and Wikipedia into a single interface is a popular web search presentation paradigm. Although several aggregated search (AS) metrics have been proposed to evaluate AS result pages, their properties remain poorly understood. In this paper, we compare the properties of existing AS metrics under the assumptions that (1) queries may have multiple preferred verticals; (2) the likelihood of each vertical preference is available; and (3) the topical relevance assessments of results returned from each vertical is available. We compare a wide range of AS metrics on two test collections. Our main criteria of comparison are (1) discriminative power, which represents the reliability of a metric in comparing the performance of systems, and (2) intuitiveness, which represents how well a metric captures the various key aspects to be measured (i.e. various aspects of a user’s perception of AS result pages). Our study shows that the AS metrics that capture key AS components (e.g., vertical selection) have several advantages over other metrics. This work sheds new lights on the further developments and applications of AS metrics.TRANSCRIPT
On the Reliability and Intui0veness of Aggregated Search Metrics
Ke Zhou1, Mounia Lalmas2, Tetsuya Sakai3, Ronan Cummins4, Joemon M. Jose1
1University of Glasgow 2Yahoo Labs London 3Waseda University
4University of Greenwich
CIKM 2013, San Francisco
Aggregated Search
• Diverse search verNcals (image, video, news, etc.) are available on the web.
• AggregaNng (embedding) verNcal results into “general web” results has become de-‐facto in commercial web search engine.
VerNcal search engines
General web search
Background
Aggregated Search
• Diverse search verNcals (image, video, news, etc.) are available on the web.
• AggregaNng (embedding) verNcal results into “general web” results has become de-‐facto in commercial web search engine.
VerNcal search engines
General web search
Background
VerNcal selecNon
Architecture of Aggregated Search
…… Blog VerNcal
Wiki (Encyclopedia)
VerNcal
Image VerNcal
General Web VerNcal
Shopping VerNcal
Aggregated search system
query
query query query query query
(RP) Result Presenta0on
(VS) Ver0cal Selec0on
(IS) Item Selec0on
Background
IS
VS RP
Background
EvaluaNng the EvaluaNon (Meta-‐evaluaNon)
• Aggregated Search (AS) Metrics – model four AS compounding factors – differences: the way they model each factor and combine them. – How well the metrics capture and combine those factors remain poorly understood.
• Focus: we meta-‐evaluate AS metrics – Reliability
• ability to detect “actual” performance differences. – IntuiNveness
• ability to capture any property deemed important (AS component).
MoNvaNon
EvaluaNng the EvaluaNon (Meta-‐evaluaNon)
• Aggregated Search (AS) Metrics – model four AS compounding factors – differences: the way they model each factor and combine them. – How well the metrics capture and combine those factors remain poorly understood.
• Focus: we meta-‐evaluate AS metrics – Reliability
• ability to detect “actual” performance differences. – IntuiNveness
• ability to capture any property deemed important (AS component).
MoNvaNon
EvaluaNng the EvaluaNon (Meta-‐evaluaNon)
• Aggregated Search (AS) Metrics – model four AS compounding factors – differences: the way they model each factor and combine them. – How well the metrics capture and combine those factors remain poorly understood.
• Focus: we meta-‐evaluate AS metrics – Reliability
• ability to detect “actual” performance differences. – IntuiNveness
• ability to capture any property deemed important (AS component).
MoNvaNon
EvaluaNng the EvaluaNon (Meta-‐evaluaNon)
• Aggregated Search (AS) Metrics – model four AS compounding factors – differences: the way they model each factor and combine them. – How well the metrics capture and combine those factors remain poorly understood.
• Focus: we meta-‐evaluate AS metrics – Reliability
• ability to detect “actual” performance differences. – IntuiNveness
• ability to capture any property deemed important (AS component).
MoNvaNon
EvaluaNng the EvaluaNon (Meta-‐evaluaNon)
• Aggregated Search (AS) Metrics – model four AS compounding factors – differences: the way they model each factor and combine them. – How well the metrics capture and combine those factors remain poorly understood.
• Focus: we meta-‐evaluate AS metrics – Reliability
• ability to detect “actual” performance differences. – IntuiNveness
• ability to capture any property deemed important (AS component).
MoNvaNon
Overview
Overview
Compounding Factors
• (VS) VerNcal SelecNon • (IS) Item SelecNon
MoNvaNon
• (RP) Result PresentaNon • (VD) VerNcal Diversity
• VS(A>B,C): image preference • IS(C>A,B): more relevant items • RP (B>A,C): relevant items at top • VD (C>A,B): diverse informaNon
Factors
Compounding Factors
• (VS) VerNcal SelecNon • (IS) Item SelecNon
MoNvaNon
• (RP) Result PresentaNon • (VD) VerNcal Diversity
• VS(A>B,C): image preference • IS(C>A,B): more relevant items • RP (B>A,C): relevant items at top • VD (C>A,B): diverse informaNon
Factors
Compounding Factors
• (VS) VerNcal SelecNon • (IS) Item SelecNon
MoNvaNon
• (RP) Result PresentaNon • (VD) VerNcal Diversity
• VS(A>B,C): image preference • IS(C>A,B): more relevant items • RP (B>A,C): relevant items at top • VD (C>A,B): diverse informaNon
Factors
Compounding Factors
• (VS) VerNcal SelecNon • (IS) Item SelecNon
MoNvaNon
• (RP) Result PresentaNon • (VD) VerNcal Diversity
• VS(A>B,C): image preference • IS(C>A,B): more relevant items • RP (B>A,C): relevant items at top • VD (C>A,B): diverse informaNon
Factors
Compounding Factors
• (VS) VerNcal SelecNon • (IS) Item SelecNon
• (RP) Result PresentaNon • (VD) VerNcal Diversity
• VS(A>B,C): image preference • IS(C>A,B): more relevant items • RP (B>A,C): relevant items at top • VD (C>A,B): diverse informaNon
Factors
Overview
Metrics Metrics
• TradiNonal IR – homogeneous ranked list
• Adapted Diversity-‐based IR – treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
Metrics Metrics
• TradiNonal IR – homogeneous ranked list
• Adapted Diversity-‐based IR – treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
Metrics Metrics
• TradiNonal IR – homogeneous ranked list
• Adapted Diversity-‐based IR – treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
Metrics Metrics
• TradiNonal IR – homogeneous ranked list
• Adapted Diversity-‐based IR – treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
posiNon discounted vs. set-‐based
Metrics • TradiNonal IR
– homogeneous ranked list • Adapted Diversity-‐based IR
– treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
novelty vs. orientaNon vs. diversity
Metrics
Metrics • TradiNonal IR
– homogeneous ranked list • Adapted Diversity-‐based IR
– treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
posiNon vs. user tolerance vs. cascade
Metrics
Metrics • TradiNonal IR
– homogeneous ranked list • Adapted Diversity-‐based IR
– treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page key components: VS vs. IS. vs. RP vs. VD
Metrics
Metrics • TradiNonal IR
– homogeneous ranked list • Adapted Diversity-‐based IR
– treat verNcal as intent – adapt ranked list to block-‐based – normalize by “ideal” AS page
• Aggregated Search – uNlity-‐effort aware framework
• Single AS component – VS: verNcal precision – VD: verNcal (intent) recall – IS: mean precision of verNcal items – RP: Spearman’s correlaNon with the “ideal” AS
page
Metrics
Standard parameter secngs [Zhou et al. SIGIR’12]
K. Zhou, R. Cummins, M. Lalmas and J.M. Jose. EvaluaNng aggregated search pages. In SIGIR, 115-‐124, 2012.
Overview
Experiment Setup • Two Aggregated Search test collecNons
– VertWeb’11 (classifying ClueWeb09 collecNon) – FedWeb’13 (TREC)
• VerNcals – Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.)
• Topics and Assessments – Reusing topics from TREC web and millionquery tracks – VerNcal orientaNon assessments (type of informaNon) – Topical relevance assessments of items (tradiNonal document relevance)
• Simulated AS systems – implement state-‐of-‐the-‐art AS components – vary component system of combinaNon for final AS system – 36 AS systems in total
Experimental Setup
Experiment Setup • Two Aggregated Search test collecNons
– VertWeb’11 (classifying ClueWeb09 collecNon) – FedWeb’13 (TREC)
• VerNcals – Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.)
• Topics and Assessments – Reusing topics from TREC web and millionquery tracks – VerNcal orientaNon assessments (type of informaNon) – Topical relevance assessments of items (tradiNonal document relevance)
• Simulated AS systems – implement state-‐of-‐the-‐art AS components – vary component system of combinaNon for final AS system – 36 AS systems in total
Experimental Setup
Experiment Setup • Two Aggregated Search test collecNons
– VertWeb’11 (classifying ClueWeb09 collecNon) – FedWeb’13 (TREC)
• VerNcals – Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.)
• Topics and Assessments – Reusing topics from TREC web and millionquery tracks – VerNcal orientaNon assessments (type of informaNon) – Topical relevance assessments of items (tradiNonal document relevance)
• Simulated AS systems – implement state-‐of-‐the-‐art AS components – vary component system of combinaNon for final AS system – 36 AS systems in total
Experimental Setup
Experiment Setup • Two Aggregated Search test collecNons
– VertWeb’11 (classifying ClueWeb09 collecNon) – FedWeb’13 (TREC)
• VerNcals – Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.)
• Topics and Assessments – Reusing topics from TREC web and millionquery tracks – VerNcal orientaNon assessments (type of informaNon) – Topical relevance assessments of items (tradiNonal document relevance)
• Simulated AS systems – implement state-‐of-‐the-‐art AS components – vary component system of combinaNon for final AS system – 36 AS systems in total
Experimental Setup
Experiment Setup • Two Aggregated Search test collecNons
– VertWeb’11 (classifying ClueWeb09 collecNon) – FedWeb’13 (TREC) -‐> the one that we will report our experiments on
• VerNcals – Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.)
• Topics and Assessments – Reusing topics from TREC web and millionquery tracks -‐> 50 topics – VerNcal orientaNon assessments (type of informaNon) – Topical relevance assessments of items (tradiNonal document relevance)
• Simulated AS systems – implement state-‐of-‐the-‐art AS components – vary component system of combinaNon for final AS system – 36 AS systems in total
Experimental Setup
Overview
DiscriminaNve Power (Reliability) • DiscriminaNve power
– reflect metrics’ robustness to variaNon across topics. – measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs.
• Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] – use the observed data and computaNonal power to esNmate the distribuNons.
– conservaNve nature
Methodology
B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.
DiscriminaNve Power (Reliability) • DiscriminaNve power
– reflect metrics’ robustness to variaNon across topics. – measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs.
• Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] – use the observed data and computaNonal power to esNmate the distribuNons.
– conservaNve nature
Methodology
B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.
DiscriminaNve Power (Reliability) • DiscriminaNve power
– reflect metrics’ robustness to variaNon across topics. – measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs.
• Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] – use the observed data and computaNonal power to esNmate the distribuNons.
– conservaNve nature
Methodology
B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.
Main idea: if the largest mean difference of systems observed is not significant, then none of the other differences should be significant either.
DiscriminaNve Power (Reliability) • DiscriminaNve power
– reflect metrics’ robustness to variaNon across topics. – measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs.
• Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] – use the observed data and computaNonal power to esNmate the distribuNons.
– conservaNve nature
Methodology
B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.
Main idea: if the largest mean difference of systems observed is not significant, then none of the other differences should be significant either.
DiscriminaNve Power (Reliability) • DiscriminaNve power
– reflect metrics’ robustness to variaNon across topics. – measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs.
• Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] – use the observed data and computaNonal power to esNmate the distribuNons.
– conservaNve nature
Methodology
B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.
Main idea: if the largest mean difference of systems observed is not significant, then none of the other differences should be significant either.
DiscriminaNve Power Results Results
Y-‐axis: ASL (p-‐value: 0 to 0.10)
X-‐axis: run pairs sorted by ASL
ASL: Achieved Significance Level
• The most discriminaNve metrics are those closer to the origin in the figures.
• TradiNonal & Single component << Adapted diversity & Aggregated search
Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”
DiscriminaNve Power Results Results
X-‐axis: run pairs sorted by ASL
ASL: Achieved Significance Level
each curve: one metric
Y-‐axis: ASL (p-‐value: 0 to 0.10)
• The most discriminaNve metrics are those closer to the origin in the figures.
• TradiNonal & Single component << Adapted diversity & Aggregated search
Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”
DiscriminaNve Power Results Results
X-‐axis: run pairs sorted by ASL
tradiNonal IR and single component metrics
adapted diversity and aggregated search metrics
ASL: Achieved Significance Level
Y-‐axis: ASL (p-‐value: 0 to 0.10)
• The most discriminaNve metrics are those closer to the origin in the figures.
• TradiNonal & Single component << Adapted diversity & Aggregated search
Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”
DiscriminaNve Power Results Results
X-‐axis: run pairs sorted by ASL
tradiNonal IR and single component metrics
adapted diversity and aggregated search metrics
ASL: Achieved Significance Level
Y-‐axis: ASL (p-‐value: 0 to 0.10)
• The most discriminaNve metrics are those closer to the origin in the figures.
• TradiNonal & Single component << Adapted diversity & Aggregated search
Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”
DiscriminaNve Power Results Results
X-‐axis: run pairs sorted by ASL
tradiNonal IR and single component metrics
adapted diversity and aggregated search metrics
ASL: Achieved Significance Level
• The most discriminaNve metrics are those closer to the origin in the figures.
• TradiNonal & Single component << Adapted diversity & Aggregated search
Y-‐axis: ASL (p-‐value: 0 to 0.10)
Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”
DiscriminaNve Power Results Single component & TradiNonal
Results
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• Single-‐component metrics perform comparaNvely well.
• RP metric is the most discriminaNve single-‐component metric.
• VS metric is the least discriminaNve single-‐component metric.
• nDCG performs beoer than P@10 and other single-‐component metrics.
VS << VD << (IS, P@10) << (nDCG, RP)
DiscriminaNve Power Results Single component & TradiNonal
Results
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• Single-‐component metrics perform comparaNvely well.
• RP metric is the most discriminaNve single-‐component metric.
• VS metric is the least discriminaNve single-‐component metric.
• nDCG performs beoer than P@10 and other single-‐component metrics.
VS << VD << (IS, P@10) << (nDCG, RP)
DiscriminaNve Power Results Single component & TradiNonal
Results
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• Single-‐component metrics perform comparaNvely well.
• RP metric is the most discriminaNve single-‐component metric.
• VS metric is the least discriminaNve single-‐component metric.
• nDCG performs beoer than P@10 and other single-‐component metrics.
VS << VD << (IS, P@10) << (nDCG, RP)
DiscriminaNve Power Results Single component & TradiNonal
Results
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• Single-‐component metrics perform comparaNvely well.
• RP metric is the most discriminaNve single-‐component metric.
• VS metric is the least discriminaNve single-‐component metric.
• nDCG performs beoer than P@10 and other single-‐component metrics.
VS << VD << (IS, P@10) << (nDCG, RP)
DiscriminaNve Power Results Single component & TradiNonal
Results
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value) VS << VD << (IS, P@10) << (nDCG, RP)
• Single-‐component metrics perform comparaNvely well.
• RP metric is the most discriminaNve single-‐component metric.
• VS metric is the least discriminaNve single-‐component metric.
• nDCG performs beoer than P@10 and other single-‐component metrics.
DiscriminaNve Power Results Adapted diversity & Aggregated search
Results
IA-‐nDCG << D#-‐nDCG << (ASRBP , α-‐nDCG) << ASDCG << ASERR
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• AS-‐metrics (uNlity-‐effort) are generally more discriminaNve than other adapted diversity metrics.
• ASERR (cascade model) outperforms ASDCG (posiNon-‐based) and ASRBP(tolerance-‐based).
• IA-‐nDCG (orientaNon emphasized) and D#-‐nDCG (diversity emphasized) are the least discriminaNve metrics.
DiscriminaNve Power Results Adapted diversity & Aggregated search
Results
IA-‐nDCG << D#-‐nDCG << (ASRBP , α-‐nDCG) << ASDCG << ASERR
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• AS-‐metrics (uNlity-‐effort) are generally more discriminaNve than other adapted diversity metrics.
• ASERR (cascade model) outperforms ASDCG (posiNon-‐based) and ASRBP(tolerance-‐based).
• IA-‐nDCG (orientaNon emphasized) and D#-‐nDCG (diversity emphasized) are the least discriminaNve metrics.
DiscriminaNve Power Results Adapted diversity & Aggregated search
Results
IA-‐nDCG << D#-‐nDCG << (ASRBP , α-‐nDCG) << ASDCG << ASERR
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• AS-‐metrics (uNlity-‐effort) are generally more discriminaNve than other adapted diversity metrics.
• ASERR (cascade model) outperforms ASDCG (posiNon-‐based) and ASRBP(tolerance-‐based).
• IA-‐nDCG (orientaNon emphasized) and D#-‐nDCG (diversity emphasized) are the least discriminaNve metrics.
DiscriminaNve Power Results Adapted diversity & Aggregated search
Results
IA-‐nDCG << D#-‐nDCG << (ASRBP , α-‐nDCG) << ASDCG << ASERR
X-‐axis: run pairs sorted by ASL
Y-‐axis: ASL (p-‐value)
• AS-‐metrics (uNlity-‐effort) are generally more discriminaNve than other adapted diversity metrics.
• ASERR (cascade model) outperforms ASDCG (posiNon-‐based) and ASRBP(tolerance-‐based).
• IA-‐nDCG (orientaNon emphasized) and D#-‐nDCG (diversity emphasized) are the least discriminaNve metrics.
Overview
Concordance Test (IntuiNveness) Methodology
• Highly discriminaNve metrics, while desirable, may not necessarily measure everything that we may want measured.
• Understanding how each key component is captured by the metric – Context of AS
• VS, VD, IS, RP
Concordance Test (IntuiNveness) Methodology
• Highly discriminaNve metrics, while desirable, may not necessarily measure everything that we may want measured.
• Understanding how each key component is captured by the metric – Context of AS
• VS, VD, IS, RP
(VS) VerNcal SelecNon: select correct verNcals
(VD) VerNcal diversity: promote mulNple verNcal results
(RP) Result PresentaNon: embed verNcals correctly
(IS) Item SelecNon: select relevant items
……
Concordance Test [Sakai, WWW’12]
Methodology
Metric 1 Metric 2 disagree
Gold-‐standard Simple Metric
60% 40%
• Concordance test – Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric
– Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy.
– Four simple gold-‐standard metrics • VS, VD, IS, RP • simple and therefore agnosNc to metric differences (e.g. different posiNon-‐based discounNng)
concordance
T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012.
Concordance Test [Sakai, WWW’12]
Methodology
Metric 1 Metric 2 disagree
Gold-‐standard Simple Metric
60% 40%
• Concordance test – Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric
– Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy.
– Four simple gold-‐standard metrics • VS, VD, IS, RP • simple and therefore agnosNc to metric differences (e.g. different posiNon-‐based discounNng)
concordance
T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012.
Concordance Test [Sakai, WWW’12]
Methodology
Metric 1 Metric 2 disagree
Gold-‐standard Single-‐component Simple Metric
60% 40%
concordance
• Concordance test – Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric
– Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy.
– Four simple gold-‐standard metrics • VS, VD, IS, RP • simple and therefore agnosNc to metric differences (e.g. different posiNon-‐based discounNng)
T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012.
Concordance Test Results Capturing each individual key AS component
Results
• Concordance with VS: - IA-‐nDCG > ASRBP > ASDCG > D#-‐nDCG > ASERR, α-‐nDCG
- Intent-‐aware (IA) metric (orientaNon emphasized) and AS-‐metrics (uNlity-‐effort) perform best.
• Concordance with VD: - D#-‐nDCG > IA-‐nDCG > ASDCG, ASRBP , ASERR > α-‐nDCG
- D# (diversity emphasized) and IA (orientaNon emphasized) frameworks work best.
Let “M1 > M2”denotes “M1 staNsNcally significantly outperforms M2 in terms of concordance with a given gold-‐standard metric.”
Concordance Test Results Capturing each individual key AS component
Results
• Concordance with VS: - IA-‐nDCG > ASRBP > ASDCG > D#-‐nDCG > ASERR, α-‐nDCG
- Intent-‐aware (IA) metric (orientaNon emphasized) and AS-‐metrics (uNlity-‐effort) perform best.
• Concordance with VD: - D#-‐nDCG > IA-‐nDCG > ASDCG, ASRBP , ASERR > α-‐nDCG
- D# (diversity emphasized) and IA (orientaNon emphasized) frameworks work best.
Let “M1 > M2”denotes “M1 staNsNcally significantly outperforms M2 in terms of concordance with a given gold-‐standard metric.”
Concordance Test Results Capturing each individual key AS component
Results
• Concordance with VS: - IA-‐nDCG > ASRBP > ASDCG > D#-‐nDCG > ASERR, α-‐nDCG
- Intent-‐aware (IA) metric (orientaNon emphasized) and AS-‐metrics (uNlity-‐effort) perform best.
• Concordance with VD: - D#-‐nDCG > IA-‐nDCG > ASDCG, ASRBP , ASERR > α-‐nDCG
- D# (diversity emphasized) and IA (orientaNon emphasized) frameworks work best.
Let “M1 > M2”denotes “M1 staNsNcally significantly outperforms M2 in terms of concordance with a given gold-‐standard metric.”
Concordance Test Results Capturing each individual key AS component
Results
• Concordance with IS: - ASRBP , D#-‐nDCG > ASDCG > IA-‐nDCG > ASERR > α-‐nDCG;
- ASRBP (tolerance-‐based AS Metric) and D# (diversity emphasized) metrics perform best.
• Concordance with RP: - α-‐nDCG > ASERR > ASDCG > ASRBP > D#-‐nDCG > IA-‐nDCG.
- α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics work best.
• However, α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics consistently perform worst with respect to VS, VD and IS.
Concordance Test Results Capturing each individual key AS component
Results
• Concordance with IS: - ASRBP , D#-‐nDCG > ASDCG > IA-‐nDCG > ASERR > α-‐nDCG;
- ASRBP (tolerance-‐based AS Metric) and D# (diversity emphasized) metrics perform best.
• Concordance with RP: - α-‐nDCG > ASERR > ASDCG > ASRBP > D#-‐nDCG > IA-‐nDCG.
- α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics work best.
• However, α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics consistently perform worst with respect to VS, VD and IS.
Concordance Test Results Capturing each individual key AS component
Results
• Concordance with IS: - ASRBP , D#-‐nDCG > ASDCG > IA-‐nDCG > ASERR > α-‐nDCG;
- ASRBP (tolerance-‐based AS Metric) and D# (diversity emphasized) metrics perform best.
• Concordance with RP: - α-‐nDCG > ASERR > ASDCG > ASRBP > D#-‐nDCG > IA-‐nDCG.
- α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics work best.
• However, α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics consistently perform worst with respect to VS, VD and IS.
Concordance Test Results Capturing mulNple key AS components
Results
• Concordance with VS and IS: - ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG;
• Concordance with VS, VD and IS: - D#-‐nDCG > ASRBP , IA-‐nDCG > ASDCG > ASERR > α-‐nDCG;
• Concordance with all (VS, VD, IS and RP): - ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG.
• ASRBP (tolerance-‐based AS Metric) and D#-‐nDCG (diversity emphasized) perform best when combining all components.
• There are advantages of metrics that capture key components of AS (e.g. VS) over those that do not (e.g. α-‐nDCG).
Concordance Test Results Capturing mulNple key AS components
Results
• Concordance with VS and IS: - ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG;
• Concordance with VS, VD and IS: - D#-‐nDCG > ASRBP , IA-‐nDCG > ASDCG > ASERR > α-‐nDCG;
• Concordance with all (VS, VD, IS and RP): - ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG.
• ASRBP (tolerance-‐based AS Metric) and D#-‐nDCG (diversity emphasized) perform best when combining all components.
• There are advantages of metrics that capture key components of AS (e.g. VS) over those that do not (e.g. α-‐nDCG).
Concordance Test Results Capturing mulNple key AS components
Results
• Concordance with VS and IS: - ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG;
• Concordance with VS, VD and IS: - D#-‐nDCG > ASRBP , IA-‐nDCG > ASDCG > ASERR > α-‐nDCG;
• Concordance with all (VS, VD, IS and RP): - ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG.
• ASRBP (tolerance-‐based AS Metric) and D#-‐nDCG (diversity emphasized) perform best when combining all components.
• There are advantages of metrics that capture key components of AS (e.g. VS) over those that do not (e.g. α-‐nDCG).
Final take-‐out • In terms of discriminaNve power,
– RP is the most discriminaNve feature (metric) for evaluaNon among the four AS components.
– AS and novelty-‐emphasized metrics are superior to diversity and orientaNon emphasized metrics.
• In terms of intuiNveness, – Tolerance-‐based AS Metric and diversity emphasized metric is the most intuiNve metric to emphasize all AS components.
• Overall, Tolerance-‐based AS Metric is the most discriminaNve and intuiNve metric.
• We propose a comprehensive approach for evaluaNng intuiNveness of metrics that takes special aspects of aggregated search into account.
Conclusions
Final take-‐out • In terms of discriminaNve power,
– RP is the most discriminaNve feature (metric) for evaluaNon among the four AS components.
– AS and novelty-‐emphasized metrics are superior to diversity and orientaNon emphasized metrics.
• In terms of intuiNveness, – Tolerance-‐based AS Metric and diversity emphasized metric is the most intuiNve metric to emphasize all AS components.
• Overall, Tolerance-‐based AS Metric is the most discriminaNve and intuiNve metric.
• We propose a comprehensive approach for evaluaNng intuiNveness of metrics that takes special aspects of aggregated search into account.
Conclusions
Final take-‐out • In terms of discriminaNve power,
– RP is the most discriminaNve feature (metric) for evaluaNon among the four AS components.
– AS and novelty-‐emphasized metrics are superior to diversity and orientaNon emphasized metrics.
• In terms of intuiNveness, – Tolerance-‐based AS Metric and diversity emphasized metric is the most intuiNve metric to emphasize all AS components.
• Overall, Tolerance-‐based AS Metric is the most discriminaNve and intuiNve metric.
• We propose a comprehensive approach for evaluaNng intuiNveness of metrics that takes special aspects of aggregated search into account.
Conclusions
Final take-‐out • In terms of discriminaNve power,
– RP is the most discriminaNve feature (metric) for evaluaNon among the four AS components.
– AS and novelty-‐emphasized metrics are superior to diversity and orientaNon emphasized metrics.
• In terms of intuiNveness, – Tolerance-‐based AS Metric and diversity emphasized metric is the most intuiNve metric to emphasize all AS components.
• Overall, Tolerance-‐based AS Metric is the most discriminaNve and intuiNve metric.
• We propose a comprehensive approach for evaluaNng intuiNveness of metrics that takes special aspects of aggregated search into account.
Conclusions
Future Work
• comparison with meta-‐evaluaNon results from human subjects to test the reliability of our approach and results.
• propose a more principled evaluaNon framework to incorporate and combine key AS factors (VS, VD, IS, RP).
• Welcome to parNcipate TREC FedWeb 2014 task (conNnuaNon of FedWeb 2013: hops://sites.google.com/site/trecfedweb/)!
Future
Future Work
• comparison with meta-‐evaluaNon results from human subjects to test the reliability of our approach and results.
• propose a more principled evaluaNon framework to incorporate and combine key AS factors (VS, VD, IS, RP).
• Welcome to parNcipate TREC FedWeb 2014 task (conNnuaNon of FedWeb 2013: hops://sites.google.com/site/trecfedweb/)!
Future
Future Work
• comparison with meta-‐evaluaNon results from human subjects to test the reliability of our approach and results.
• propose a more principled evaluaNon framework to incorporate and combine key AS factors (VS, VD, IS, RP).
• Welcome to parNcipate TREC FedWeb 2014 task (conNnuaNon of FedWeb 2013: hops://sites.google.com/site/trecfedweb/)!
Future