on the reliability and intuitiveness of aggregated search metrics

On the Reliability and Intui0veness of Aggregated Search Metrics

Ke Zhou1, Mounia Lalmas2, Tetsuya Sakai3, Ronan Cummins4, Joemon M. Jose1

1University of Glasgow 2Yahoo Labs London 3Waseda University

4University of Greenwich

CIKM 2013, San Francisco

Aggregated Search

•  Diverse search verNcals (image, video, news, etc.) are available on the web.

•  AggregaNng (embedding) verNcal results into “general web” results has become de-‐facto in commercial web search engine.

VerNcal search engines

General web search

Background

Aggregated Search

•  Diverse search verNcals (image, video, news, etc.) are available on the web.

•  AggregaNng (embedding) verNcal results into “general web” results has become de-‐facto in commercial web search engine.

VerNcal search engines

General web search

Background

VerNcal selecNon

Architecture of Aggregated Search

…… Blog VerNcal

Wiki (Encyclopedia)

VerNcal

Image VerNcal

General Web VerNcal

Shopping VerNcal

Aggregated search system

query query query query query

(RP) Result Presenta0on

(VS) Ver0cal Selec0on

(IS) Item Selec0on

Background

EvaluaNng the EvaluaNon (Meta-‐evaluaNon)

•  Aggregated Search (AS) Metrics –  model four AS compounding factors –  differences: the way they model each factor and combine them. –  How well the metrics capture and combine those factors remain poorly understood.

•  Focus: we meta-‐evaluate AS metrics –  Reliability

•  ability to detect “actual” performance differences. –  IntuiNveness

•  ability to capture any property deemed important (AS component).

MoNvaNon

Overview

Compounding Factors

•  (VS) VerNcal SelecNon •  (IS) Item SelecNon

MoNvaNon

•  (RP) Result PresentaNon •  (VD) VerNcal Diversity

•  VS(A>B,C): image preference •  IS(C>A,B): more relevant items •  RP (B>A,C): relevant items at top •  VD (C>A,B): diverse informaNon

Factors

Compounding Factors

MoNvaNon

Factors

Compounding Factors

MoNvaNon

Factors

Compounding Factors

MoNvaNon

Factors

Compounding Factors

Factors

Overview

Metrics Metrics

•  TradiNonal IR –  homogeneous ranked list

•  Adapted Diversity-‐based IR –  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page

•  Aggregated Search –  uNlity-‐effort aware framework

•  Single AS component –  VS: verNcal precision –  VD: verNcal (intent) recall –  IS: mean precision of verNcal items –  RP: Spearman’s correlaNon with the “ideal” AS

Metrics Metrics

posiNon discounted vs. set-‐based

Metrics •  TradiNonal IR

–  homogeneous ranked list •  Adapted Diversity-‐based IR

–  treat verNcal as intent –  adapt ranked list to block-‐based –  normalize by “ideal” AS page

novelty vs. orientaNon vs. diversity

Metrics

posiNon vs. user tolerance vs. cascade

Metrics

page key components: VS vs. IS. vs. RP vs. VD

Metrics

Standard parameter secngs [Zhou et al. SIGIR’12]

K. Zhou, R. Cummins, M. Lalmas and J.M. Jose. EvaluaNng aggregated search pages. In SIGIR, 115-‐124, 2012.

Overview

Experiment Setup •  Two Aggregated Search test collecNons

–  VertWeb’11 (classifying ClueWeb09 collecNon) –  FedWeb’13 (TREC)

•  VerNcals –  Cover a variety of 11 verNcals employed by three major commercial search engines (e.g. News, Image, etc.)

•  Topics and Assessments –  Reusing topics from TREC web and millionquery tracks –  VerNcal orientaNon assessments (type of informaNon) –  Topical relevance assessments of items (tradiNonal document relevance)

•  Simulated AS systems –  implement state-‐of-‐the-‐art AS components –  vary component system of combinaNon for final AS system –  36 AS systems in total

Experimental Setup

–  VertWeb’11 (classifying ClueWeb09 collecNon) –  FedWeb’13 (TREC) -‐> the one that we will report our experiments on

•  Topics and Assessments –  Reusing topics from TREC web and millionquery tracks -‐> 50 topics –  VerNcal orientaNon assessments (type of informaNon) –  Topical relevance assessments of items (tradiNonal document relevance)

Experimental Setup

Overview

DiscriminaNve Power (Reliability) •  DiscriminaNve power

–  reflect metrics’ robustness to variaNon across topics. –  measure by conducNng a staNsNcal significance test for different pairs of systems, and counNng the number of significantly different pairs.

•  Randomized Tukey’s Honestly Significantly Difference (HSD) test [Cartereoe TOIS’12] –  use the observed data and computaNonal power to esNmate the distribuNons.

–  conservaNve nature

Methodology

B. Cartereoe. MulNple TesNng in StaNsNcal Analysis of Systems-‐Based InformaNon Retrieval Experiments. TOIS, 30-‐1, 2012.

Methodology

Main idea: if the largest mean difference of systems observed is not significant, then none of the other differences should be significant either.

Methodology

DiscriminaNve Power Results Results

Y-‐axis: ASL (p-‐value: 0 to 0.10)

X-‐axis: run pairs sorted by ASL

ASL: Achieved Significance Level

•  The most discriminaNve metrics are those closer to the origin in the figures.

•  TradiNonal & Single component << Adapted diversity & Aggregated search

Let “M1 << M2” denotes “M2 outperforms M1 in terms of discriminaNve power.”

each curve: one metric

tradiNonal IR and single component metrics

adapted diversity and aggregated search metrics

DiscriminaNve Power Results Single component & TradiNonal

Results

Y-‐axis: ASL (p-‐value)

•  Single-‐component metrics perform comparaNvely well.

•  RP metric is the most discriminaNve single-‐component metric.

•  VS metric is the least discriminaNve single-‐component metric.

•  nDCG performs beoer than P@10 and other single-‐component metrics.

VS << VD << (IS, P@10) << (nDCG, RP)

Results

VS << VD << (IS, P@10) << (nDCG, RP)

Results

VS << VD << (IS, P@10) << (nDCG, RP)

Results

VS << VD << (IS, P@10) << (nDCG, RP)

Results

Y-‐axis: ASL (p-‐value) VS << VD << (IS, P@10) << (nDCG, RP)

DiscriminaNve Power Results Adapted diversity & Aggregated search

Results

IA-‐nDCG << D#-‐nDCG << (ASRBP , α-‐nDCG) << ASDCG << ASERR

•  AS-‐metrics (uNlity-‐effort) are generally more discriminaNve than other adapted diversity metrics.

•  ASERR (cascade model) outperforms ASDCG (posiNon-‐based) and ASRBP(tolerance-‐based).

•  IA-‐nDCG (orientaNon emphasized) and D#-‐nDCG (diversity emphasized) are the least discriminaNve metrics.

Results

Overview

Concordance Test (IntuiNveness) Methodology

•  Highly discriminaNve metrics, while desirable, may not necessarily measure everything that we may want measured.

•  Understanding how each key component is captured by the metric –  Context of AS

•  VS, VD, IS, RP

Concordance Test (IntuiNveness) Methodology

•  Highly discriminaNve metrics, while desirable, may not necessarily measure everything that we may want measured.

•  Understanding how each key component is captured by the metric –  Context of AS

•  VS, VD, IS, RP

(VS) VerNcal SelecNon: select correct verNcals

(VD) VerNcal diversity: promote mulNple verNcal results

(RP) Result PresentaNon: embed verNcals correctly

(IS) Item SelecNon: select relevant items

……

Concordance Test [Sakai, WWW’12]

Methodology

Metric 1 Metric 2 disagree

Gold-‐standard Simple Metric

60% 40%

•  Concordance test –  Computes rela%ve concordance scores for a given pair of metrics and a gold-‐standard metric

–  Gold-‐standard metric should represent a basic property that we want the candidate metrics to saNsfy.

–  Four simple gold-‐standard metrics •  VS, VD, IS, RP •  simple and therefore agnosNc to metric differences (e.g. different posiNon-‐based discounNng)

concordance

T. Sakai. EvaluaNon with informaNonal and navigaNonal intents. In WWW, 499-‐508, 2012.

Methodology

Gold-‐standard Simple Metric

60% 40%

concordance

Methodology

Gold-‐standard Single-‐component Simple Metric

60% 40%

concordance

Concordance Test Results Capturing each individual key AS component

Results

•  Concordance with VS: -  IA-‐nDCG > ASRBP > ASDCG > D#-‐nDCG > ASERR, α-‐nDCG

-  Intent-‐aware (IA) metric (orientaNon emphasized) and AS-‐metrics (uNlity-‐effort) perform best.

•  Concordance with VD: -  D#-‐nDCG > IA-‐nDCG > ASDCG, ASRBP , ASERR > α-‐nDCG

-  D# (diversity emphasized) and IA (orientaNon emphasized) frameworks work best.

Let “M1 > M2”denotes “M1 staNsNcally significantly outperforms M2 in terms of concordance with a given gold-‐standard metric.”

Results

•  Concordance with IS: -  ASRBP , D#-‐nDCG > ASDCG > IA-‐nDCG > ASERR > α-‐nDCG;

-  ASRBP (tolerance-‐based AS Metric) and D# (diversity emphasized) metrics perform best.

•  Concordance with RP: -  α-‐nDCG > ASERR > ASDCG > ASRBP > D#-‐nDCG > IA-‐nDCG.

-  α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics work best.

•  However, α-‐nDCG (novelty emphasized) and ASERR (cascade AS Metric) metrics consistently perform worst with respect to VS, VD and IS.

Results

Concordance Test Results Capturing mulNple key AS components

Results

•  Concordance with VS and IS: -  ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG;

•  Concordance with VS, VD and IS: -  D#-‐nDCG > ASRBP , IA-‐nDCG > ASDCG > ASERR > α-‐nDCG;

•  Concordance with all (VS, VD, IS and RP): -  ASRBP > D#-‐nDCG > ASDCG, IA-‐nDCG > ASERR > α-‐nDCG.

•  ASRBP (tolerance-‐based AS Metric) and D#-‐nDCG (diversity emphasized) perform best when combining all components.

•  There are advantages of metrics that capture key components of AS (e.g. VS) over those that do not (e.g. α-‐nDCG).

Results

Final take-‐out •  In terms of discriminaNve power,

–  RP is the most discriminaNve feature (metric) for evaluaNon among the four AS components.

–  AS and novelty-‐emphasized metrics are superior to diversity and orientaNon emphasized metrics.

•  In terms of intuiNveness, –  Tolerance-‐based AS Metric and diversity emphasized metric is the most intuiNve metric to emphasize all AS components.

•  Overall, Tolerance-‐based AS Metric is the most discriminaNve and intuiNve metric.

•  We propose a comprehensive approach for evaluaNng intuiNveness of metrics that takes special aspects of aggregated search into account.

Conclusions

Future Work

•  comparison with meta-‐evaluaNon results from human subjects to test the reliability of our approach and results.

•  propose a more principled evaluaNon framework to incorporate and combine key AS factors (VS, VD, IS, RP).

•  Welcome to parNcipate TREC FedWeb 2014 task (conNnuaNon of FedWeb 2013: hops://sites.google.com/site/trecfedweb/)!

Future

Future Work

Future

Future Work

Future

on the reliability and intuitiveness of aggregated search metrics

Technology