copulas for information retrieval (sigir'13)

35
Copulas for Information Retrieval Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson

Upload: carsten-eickhoff

Post on 17-Jun-2015

254 views

Category:

Education


3 download

DESCRIPTION

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario. This work together with Arjen P. de Vries and Kevyn Collins-Thompson has been accepted for full oral presentation at the 36th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR) in Dublin, Ireland. The full version of this paper is available at: http://dl.acm.org/citation.cfm?id=2484066

TRANSCRIPT

Page 1: Copulas for Information Retrieval (SIGIR'13)

Copulas for Information Retrieval

Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson

Page 2: Copulas for Information Retrieval (SIGIR'13)

Copulas – What is it all about?

• Assume two sufficiently different commodities

• Rare elemental metals

• Pork bellies

• No apparent correlations

0

1

2

3

4

5

6

Rare Earths Pork Bellies

Page 3: Copulas for Information Retrieval (SIGIR'13)

Copulas – What is it all about?

• Two seemingly independent variables

• Yet, for rare extreme cases, there are co-movements

• “Tail dependencies”

• Copulas decouple observations and dependencies • IR models are good at estimating marginals

• Copulas are good at combining them

Page 4: Copulas for Information Retrieval (SIGIR'13)

Overview

1. Non-linear Dependency Structures in IR

2. Copulas – Intuition & Background

3. Multivariate Relevance Estimation

4. When to use them?

5. Score Fusion

6. Conclusion & Future Directions

Page 5: Copulas for Information Retrieval (SIGIR'13)

1 Non-Linear Dependency Structures in IR

Page 6: Copulas for Information Retrieval (SIGIR'13)

Multivariate Relevance Modelling

• IR Systems index and retrieve a growing variety of document types • Many structured, or at least “complex”

• Single-criteria relevance frameworks do not perform well

• Multi-criteria models tend to be either: a) Naïve (e.g., independence assumption), or,

b) Hard to qualitatively interpret for humans (e.g., L2R)

Page 7: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Non-linear dependency structures are still a challenge

• TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql”

• Relevance Criteria: • Topicality

• Subjectivity

Page 8: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

Page 9: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

Page 10: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

• In the lower third of the scale,

we note ᵨ = 0.37

Page 11: Copulas for Information Retrieval (SIGIR'13)

Non-Linear Dependencies

• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?

• In the lower third of the scale,

we note ᵨ = 0.37,

• And in the upper third, it turns

to ᵨ = -0.4

Page 12: Copulas for Information Retrieval (SIGIR'13)

2 Copulas – Intuition & Background

Page 13: Copulas for Information Retrieval (SIGIR'13)

Copulas (from copulare, to join)

• Copulas model complex non-linear dependencies between variables that simple correlations can't capture

• Decouple marginal distributions from dependency structure

• Approximate joint multivariate distributions

• Applied previously in portfolio and risk management, meteorology, river flooding predictions, …

Page 14: Copulas for Information Retrieval (SIGIR'13)

Formal Basics

• Given a k-dimensional rv

• Map to unit cube

• Describe joint cdf with copula

• Isolation of a component

• Copula’s zero

Page 15: Copulas for Information Retrieval (SIGIR'13)

Closing the circle

• Recall the example TREC topic 1171

• Linear combination: AP = 0.14, below collection average (0.25)

• Fit Clayton copula to model joint relevance distribution

• AP rises to 0.22

Page 16: Copulas for Information Retrieval (SIGIR'13)

3 Multivariate Relevance Estimation

Page 17: Copulas for Information Retrieval (SIGIR'13)

Joint Relevance Estimation

• Estimate marginal distributions from data

• Estimate copula fitting parameters to maximize posterior probability of observing data

• Use copula to represent joint probability of relevance

Page 18: Copulas for Information Retrieval (SIGIR'13)

Joint Relevance Estimation

• We study three different scenarios: • Opinionated blog posts • Personalized bookmarks • Child-friendly websites

• Use original training portion of the corpora where available

• A 90/10 split otherwise

Page 19: Copulas for Information Retrieval (SIGIR'13)

Results I – Opinionated Blog Posts

• TREC Blogs08 dataset

• 1.3 M documents

• Relevance dimensions: Topicality & Subjectivity

• Significantly higher performance than linear combination model

Page 20: Copulas for Information Retrieval (SIGIR'13)

Results II – Personalized Bookmarks

• Dataset by Vallet & Castells

• 339k documents

• Relevance Dimensions: Topicality & Personal relevance

• Significantly performance gains in some metrics

Page 21: Copulas for Information Retrieval (SIGIR'13)

Results III – Child-friendly Websites

• Dataset from the PuppyIR project (http://puppyir.eu)

• 22k documents

• Relevance Dimensions: Topicality & Child-suitability

• Worse-than-baseline performance

Page 22: Copulas for Information Retrieval (SIGIR'13)

4 Copulas – When to use them?

Page 23: Copulas for Information Retrieval (SIGIR'13)

When to use them?

• Previously: Strongly varying performance for different settings

• Is there a way of predicting the merit?

• Recall: copulas model tail dependencies between dimensions

Page 24: Copulas for Information Retrieval (SIGIR'13)

Types of Tail Dependencies

Page 25: Copulas for Information Retrieval (SIGIR'13)

Measuring Tail Dependencies

• According to Frees and Valdez 1998: IL and IU measure strength of lower and upper tail dependencies

• Anderson-Darling test of goodness-of-fit between copula and observed data

Domain Frees Tail index Anderson-Darling Actual Retrieval

Performance

Opinionated Blogs IL = 0.07 0.67 Copulas > linear

Personalized Bookmarks IU = 0.49 0.47 Copulas = linear

Child-friendly Websites IL = IU = 0 0.046 Copulas < linear

Page 26: Copulas for Information Retrieval (SIGIR'13)

5 Copulas for Score Fusion

Page 27: Copulas for Information Retrieval (SIGIR'13)

Score Fusion

• A different angle on relevance estimation

• Combine individual retrieval system scores instead of modelling relevance from content criteria

• In this setting, submissions to historic TRECs serve as criteria

• We randomly draw k individual runs and combine them using copulas

Page 28: Copulas for Information Retrieval (SIGIR'13)

Fusion Methods

• Established: • Copula-based:

Page 29: Copulas for Information Retrieval (SIGIR'13)

Results – TREC 4

• Results are averaged across 200 randomizations per setting of k

• Relative improvements over the best, worst and median fused run in terms of percentages of MAP

• Small but consistent improvements over non-copula fusion baselines

Page 30: Copulas for Information Retrieval (SIGIR'13)

Robustness - CombSUM

• Fusion approaches are often sensitive to weak contributions

• We control the number of weak submissions added to the fusion

• Copulas’ explicit modeling of dependency structure is more robust

Page 31: Copulas for Information Retrieval (SIGIR'13)

Robustness - CombMNZ

• Fusion approaches are often sensitive to weak contributions

• We control the number of weak submissions added to the fusion

• Copulas’ explicit modeling of dependency structure is more robust

Page 32: Copulas for Information Retrieval (SIGIR'13)

6 Conclusion and Future Directions

Page 33: Copulas for Information Retrieval (SIGIR'13)

Conclusion

• Copulas decouple observations and dependencies • IR models are good at estimating marginal

• Copulas are good at combining them

• We use them for multivariate relevance estimation • Strongly scenario-dependent performance

• Tail indices & goodness of fit tests as estimators of expected performance

• Copulas for score fusion • Robust to outliers

Page 34: Copulas for Information Retrieval (SIGIR'13)

The Road Ahead

• Currently, we use single copulas for relevance modelling • Copula mixtures and composite Archimedean copulas for higher accuracy

• Here, we use pre-existing copula families and fit them to data • Instead, can we formalize copulas from scratch to include domain knowledge?

• So far, we explored two-dimensional relevance spaces • What happens as we move into higher-order systems?

Page 35: Copulas for Information Retrieval (SIGIR'13)

Thank You!