copulas for information retrieval (sigir'13)

Copulas for Information Retrieval

Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson

Copulas – What is it all about?

• Assume two sufficiently different commodities

• Rare elemental metals

• Pork bellies

• No apparent correlations

0

1

2

3

4

5

6

Rare Earths Pork Bellies

Copulas – What is it all about?

• Two seemingly independent variables

• Yet, for rare extreme cases, there are co-movements

• “Tail dependencies”

• Copulas decouple observations and dependencies • IR models are good at estimating marginals

• Copulas are good at combining them

Overview

1. Non-linear Dependency Structures in IR

2. Copulas – Intuition & Background

3. Multivariate Relevance Estimation

4. When to use them?

5. Score Fusion

6. Conclusion & Future Directions

1 Non-Linear Dependency Structures in IR

Multivariate Relevance Modelling

• IR Systems index and retrieve a growing variety of document types • Many structured, or at least “complex”

• Single-criteria relevance frameworks do not perform well

• Multi-criteria models tend to be either: a) Naïve (e.g., independence assumption), or,

b) Hard to qualitatively interpret for humans (e.g., L2R)

Non-Linear Dependencies

• Non-linear dependency structures are still a challenge

• TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql”

• Relevance Criteria: • Topicality

• Subjectivity


• Pearson’s ᵨ = 0.18

• So, there is no real dependency

• …right?




• …right?

• In the lower third of the scale,

we note ᵨ = 0.37




• …right?

• In the lower third of the scale,

we note ᵨ = 0.37,

• And in the upper third, it turns

to ᵨ = -0.4

2 Copulas – Intuition & Background

Copulas (from copulare, to join)

• Copulas model complex non-linear dependencies between variables that simple correlations can't capture

• Decouple marginal distributions from dependency structure

• Approximate joint multivariate distributions

• Applied previously in portfolio and risk management, meteorology, river flooding predictions, …

Formal Basics

• Given a k-dimensional rv

• Map to unit cube

• Describe joint cdf with copula

• Isolation of a component

• Copula’s zero

Closing the circle

• Recall the example TREC topic 1171

• Linear combination: AP = 0.14, below collection average (0.25)

• Fit Clayton copula to model joint relevance distribution

• AP rises to 0.22

3 Multivariate Relevance Estimation

Joint Relevance Estimation

• Estimate marginal distributions from data

• Estimate copula fitting parameters to maximize posterior probability of observing data

• Use copula to represent joint probability of relevance

Joint Relevance Estimation

• We study three different scenarios: • Opinionated blog posts • Personalized bookmarks • Child-friendly websites

• Use original training portion of the corpora where available

• A 90/10 split otherwise

Results I – Opinionated Blog Posts

• TREC Blogs08 dataset

• 1.3 M documents

• Relevance dimensions: Topicality & Subjectivity

• Significantly higher performance than linear combination model

Results II – Personalized Bookmarks

• Dataset by Vallet & Castells

• 339k documents

• Relevance Dimensions: Topicality & Personal relevance

• Significantly performance gains in some metrics

Results III – Child-friendly Websites

• Dataset from the PuppyIR project (http://puppyir.eu)

• 22k documents

• Relevance Dimensions: Topicality & Child-suitability

• Worse-than-baseline performance

4 Copulas – When to use them?

When to use them?

• Previously: Strongly varying performance for different settings

• Is there a way of predicting the merit?

• Recall: copulas model tail dependencies between dimensions

Types of Tail Dependencies

Measuring Tail Dependencies

• According to Frees and Valdez 1998: IL and IU measure strength of lower and upper tail dependencies

• Anderson-Darling test of goodness-of-fit between copula and observed data

Domain Frees Tail index Anderson-Darling Actual Retrieval

Performance

Opinionated Blogs IL = 0.07 0.67 Copulas > linear

Personalized Bookmarks IU = 0.49 0.47 Copulas = linear

Child-friendly Websites IL = IU = 0 0.046 Copulas < linear

5 Copulas for Score Fusion

Score Fusion

• A different angle on relevance estimation

• Combine individual retrieval system scores instead of modelling relevance from content criteria

• In this setting, submissions to historic TRECs serve as criteria

• We randomly draw k individual runs and combine them using copulas

Fusion Methods

• Established: • Copula-based:

Results – TREC 4

• Results are averaged across 200 randomizations per setting of k

• Relative improvements over the best, worst and median fused run in terms of percentages of MAP

• Small but consistent improvements over non-copula fusion baselines

Robustness - CombSUM

• Fusion approaches are often sensitive to weak contributions

• We control the number of weak submissions added to the fusion

• Copulas’ explicit modeling of dependency structure is more robust

Robustness - CombMNZ

• Fusion approaches are often sensitive to weak contributions

• We control the number of weak submissions added to the fusion

• Copulas’ explicit modeling of dependency structure is more robust

6 Conclusion and Future Directions

Conclusion

• Copulas decouple observations and dependencies • IR models are good at estimating marginal

• Copulas are good at combining them

• We use them for multivariate relevance estimation • Strongly scenario-dependent performance

• Tail indices & goodness of fit tests as estimators of expected performance

• Copulas for score fusion • Robust to outliers

The Road Ahead

• Currently, we use single copulas for relevance modelling • Copula mixtures and composite Archimedean copulas for higher accuracy

• Here, we use pre-existing copula families and fit them to data • Instead, can we formalize copulas from scratch to include domain knowledge?

• So far, we explored two-dimensional relevance spaces • What happens as we move into higher-order systems?

Thank You!

copulas for information retrieval (sigir'13)

Education

component copulas

joint relevance estimation

modelling relevance

joint probability of

documents relevance

copulas intuition background

mysql relevance criteria

topicality personal