a pólya urn document language model for information...

77
A Pólya Urn Document Language Model for Information Retrieval Ronan Cummins Cambridge, 2014

Upload: ngotuong

Post on 01-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

A Pólya Urn Document Language Modelfor Information Retrieval

Ronan Cummins

Cambridge, 2014

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

Quote

“An author writes not only by processes of association – i.e. samplingearlier segments of the word sequence – but also by process ofimitation – i.e. sampling segments of word sequences from otherworks he has written, from works of other authors, and, of course,from sequences he has heard.”

— Herbert A. Simon (1955)

Motivation

I Bag of wordsI Term-dependencies

I Improves retrieval effectiveness +I Leads to more complex models -I ClueWeb09 (1 Billion documents)

I Can we create a retrieval model that includesdependencies but without any additional cost?

Motivation

I Bag of wordsI Term-dependencies

I Improves retrieval effectiveness +I Leads to more complex models -I ClueWeb09 (1 Billion documents)

I Can we create a retrieval model that includesdependencies but without any additional cost?

Two Kinds of Term Dependency

ExamplesTraditional dependencies

I Captain BeefheartI Che Guevara

ExamplesWord Burstiness

I A different kind of dependencyI "Cycling on the footpath is dangerous. A footpath is ..."I Synonyms: {footpath, pavement, sidewalk }I Preference for the word already used

Two Kinds of Term Dependency

ExamplesTraditional dependencies

I Captain BeefheartI Che Guevara

ExamplesWord Burstiness

I A different kind of dependencyI "Cycling on the footpath is dangerous. A footpath is ..."I Synonyms: {footpath, pavement, sidewalk }I Preference for the word already used

Word Burstiness

I Initial choice of a word to describe a ‘concept’ affectssubsequent usage

I The tendency of an otherwise rare word to occur multipletimes in a document (Church, 1995; Madsen; 2005)

I A form of preferential attachment (e.g. ‘the rich get richer’)I A generative language model that includes preferential

attachment better explains Zipfian (power-law)characteristics (Simon, 1955; Mitzenmacher, 2004)

I Two-stage language models (Goldwater et al, 2011)

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

VSM

Figure : vector space example

Tradition

I Place documents and queries in a multidimensional termspace

I Use measures of closeness in the space as measures ofsimilarity

I Conceptually usefulI But?

I What weights to use?I What matching function to use?I Experiments tell us that cosine matching function is very

poorI linear tf and idf has very poor performanceI What did we gain from the VSM other than an

inner-product matching function?

Tradition

I Place documents and queries in a multidimensional termspace

I Use measures of closeness in the space as measures ofsimilarity

I Conceptually usefulI But?I What weights to use?I What matching function to use?I Experiments tell us that cosine matching function is very

poorI linear tf and idf has very poor performanceI What did we gain from the VSM other than an

inner-product matching function?

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

Language Modelling for Retrieval

I First approaches appeared in 1998 (Ponte and Croft, 1998;Hiemstra, 1998)

I Relevance-based approaches (Lavrenko, 2001)I Studies of smoothing (Zhai and Lafferty, 2001)I Dirichlet compound multinomial relevance language model

(Xu and Akella, 2008)I Positional language models (Lv and Zhai, 2009)I State-of-the-art unigram model uses a Dirichlet prior on the

background multinomial updated with a document (Zhaiand Lafferty, 2004)

Language Modelling for Retrieval

I First approaches appeared in 1998 (Ponte and Croft, 1998;Hiemstra, 1998)

I Relevance-based approaches (Lavrenko, 2001)I Studies of smoothing (Zhai and Lafferty, 2001)I Dirichlet compound multinomial relevance language model

(Xu and Akella, 2008)I Positional language models (Lv and Zhai, 2009)I State-of-the-art unigram model uses a Dirichlet prior on the

background multinomial updated with a document (Zhaiand Lafferty, 2004)

Query-Likelihood Model

I Rank documents d in order of the likelihood of their modelMd generating the query string q

I General ranking principle for a probabilistic languagemodel

p(q|Md = θdm) =∏t∈q

p(t |θdm)c(t ,q) (1)

log p(q|Md = θdm) =∑t∈q

(log p(t |θdm) · c(t ,q)) (2)

Query-Likelihood Model

I Rank documents d in order of the likelihood of their modelMd generating the query string q

I General ranking principle for a probabilistic languagemodel

p(q|Md = θdm) =∏t∈q

p(t |θdm)c(t ,q) (1)

log p(q|Md = θdm) =∑t∈q

(log p(t |θdm) · c(t ,q)) (2)

Query-Likelihood Model

I Rank documents d in order of the likelihood of their modelMd generating the query string q

I General ranking principle for a probabilistic languagemodel

p(q|Md = θdm) =∏t∈q

p(t |θdm)c(t ,q) (1)

log p(q|Md = θdm) =∑t∈q

(log p(t |θdm) · c(t ,q)) (2)

Query Likelihood

Query Likelihood

Query Likelihood

Zero probabilities are especially problematic for longer queries

Smoothing I

I Avoids over-fitting

p(t |θ̂dm) = (1− π) · p(t |θ̂d) + π · p(t |θ̂c) (3)

I Dirichlet prior smoothing

πdir =µ

µ+ |d |(4)

Smoothing II

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Smoothing II

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Query

Smoothing II

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Query

Rank document 2 higher than document 1

Overview

I We can derive a retrieval function (and principledterm-weights) using language models, unlike the VSM

I It can be viewed as a form of unsupervised machinelearning

I The multinomial model is efficient to estimate and withDirichlet priors smoothing is the state-of-the-art in terms ofretrieval effectiveness

I It forms the basis of many applicationsI It does not model term-dependenciesI The model using a Dirichlet prior has a free parameter (i.e.µ)

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

Multivariate Pólya Urn

Model (urn) Document (sample)

Multivariate Pólya Urn

Model (urn) Document (sample)

Multivariate Pólya Urn

Model (urn) Document (sample)

Multivariate Pólya Urn

Model (urn) Document (sample)

Sampling with reinforcementthe rich get richer

Multivariate Pólya Distribution

I The multivariate Polya distribution(Dirichlet-compound-multinomial or DCM)

I Instead of the multinomial in the original query-likelihoodmodel we can use the DCM

p(d |α) =∫θ

p(d |θ)p(θ|α)dθ (5)

I Parameter vector α can be interpreted as the initial numberof balls of each colour in the urn

Parameterisation

αd = md · θd = (md ·p(t1|θd),md ·p(t2|θd), ....,md ·p(tv |θd)) (6)

I θd can be seen as the expectationI md can be seen as the scale (variance)I Low md implies high burstiness

Some properties

I Subsequent balls drawn from the urn are identicallydistributed but dependent

I Each sample (document) can be modelled using amultinomial

I Each time you restart the process to draw a sample, youdraw from a different multinomial

I The process is exchangableI Generates power-law characteristics of term-frequenciesI Estimating a DCM from multiple samples (i.e. multiple

documents) is computationally expensive (i.e. noclosed-form solutions)

The SPUD Language Model

I Use the Pólya urn as a model for document generationI Documents are known to exhibit burstinessI Estimate the document and background models as before

but with different model assumptionsI Retain the multinomial as the model for query generation

Background Model I

Background Model II

I The background model is the most likely single model tohave generated all documents

I Given all documents, find the DCM parametersI Elkan (2006) has shown that close approximations to the

model parameters are proportional to the number ofsamples in which an observation appears (EDCM)

I Essentially, documents exhibit quite a lot of word burstiness

Background Model II

I This is a useful result as close estimates of the thebackground parameters will be proportional to:

p(t |θ̂′c) =dft∑t ′ dft ′

=dft∑nj |~dj |

(7)

I With only mc remaining to be estimated using Newton’smethod

α̂c = (mc · p(t1|θ̂′c),mc · p(t2|θ̂′c), ....,mc · p(tv |θ̂′c)) (8)

Document Model I

Document Model II

I With only one sample we cannot estimate the parametersof a DCM

I We can estimate the expectation of the DCM but what ismd?

I Thought experiment: What is the minimum initial mass ofthe urn (i.e. number of balls) that could have generated d?

I We set md to the number unique terms in the document(it’s lower bound).

α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (9)

Document Model II

I With only one sample we cannot estimate the parametersof a DCM

I We can estimate the expectation of the DCM but what ismd?

I Thought experiment: What is the minimum initial mass ofthe urn (i.e. number of balls) that could have generated d?

I We set md to the number unique terms in the document(it’s lower bound).

α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (9)

Document Model II

I With only one sample we cannot estimate the parametersof a DCM

I We can estimate the expectation of the DCM but what ismd?

I Thought experiment: What is the minimum initial mass ofthe urn (i.e. number of balls) that could have generated d?

I We set md to the number unique terms in the document(it’s lower bound).

α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (9)

Remaining Parameters

I Linearly combine the two models using one parameter ωI We can experimentally tune ω

SPUD = ω · αc + (1− ω) · αd (10)

α̂c = (mc · p(t1|θ̂′c),mc · p(t2|θ̂′c), ....,mc · p(tv |θ̂′c)) (11)

α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (12)

Remaining Parameters

I Linearly combine the two models using one parameter ωI We can experimentally tune ω

SPUD = ω · αc + (1− ω) · αd (10)

α̂c = (mc · p(t1|θ̂′c),mc · p(t2|θ̂′c), ....,mc · p(tv |θ̂′c)) (11)

α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (12)

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

Questions

I How effective is the new model in terms of retrieval?I How effective is Newton’s method at automatically

determining the free parameter in the background model?I Why?

Effectiveness MAP

I Optimally tuning the one free parameter in each functionI All increases are statistically significant (for SPUD v MQL)

news wt2g wt10g gov2 mq-07 mq-08

0.15

0.2

0.25

0.3

0.35

0.4

0.45

DCM-L-TMQLSPUD

Figure : MAP on Newswire and Web datasets (title only queries)

Effectiveness NDCG@20

I Optimally tuning the one free parameter in each functionI All increases are statistically significant

news wt2g wt10g gov2 mq-07 mq-080.25

0.3

0.35

0.4

0.45

0.5

0.55

DCM-L-TMQLSPUD

Figure : NDCG@20 on Newswire and Web datasets (title onlyqueries)

Newton’s Method and Tuning

I Mixing parameter is robust at ω = 0.8

0.22

0.23

0.24

0.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

MA

P

ω

ohsu

robust-04

trec-8

Figure : (title queries)

Newton’s Method

I Mixing parameter is set to ω = 0.8I Tuned mc vs mc estimated using Newton’s method

news wt2g wt10g gov2 mq-07 mq-08

0.15

0.2

0.25

0.3

0.35

0.4

0.45

SPUD (tuned)SPUD (ω=0.8)

Figure : MAP on Newswire and Web datasets (title only queries)

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

Scope Hypothesis

I One of two hypotheses proposed that aim to explain theinteraction between document length and topicality

I Documents vary in length due to some documentscovering more topics (Robertson & Walker, 1994)

I Relevance is likely affected by this aspect of documentlength

Verbosity Hypothesis

I Documents vary in length due to verbosity (Robertson &Walker, 1994)

I Some documents are just more ‘wordy’I This aspect of document length is independent of topic,

and therefore, relevanceI In reality documents may vary in length due to a

combination of these two hypothesesI No formal means of capturing whether a retrieval function

adheres to this intuition has been proposed (as far as Iknow)

Axiomatic Analysis

LNC2* ConstraintGiven a ranking function s(q,d) that scores a document d withrespect to a query q, if d ′ is created by concatenating d withitself k times, then s(q,d ′) = s(q,d)

In other words, you cannot change the ranking of a documentby concatenating it with itself k times (where k > 1)

Axiomatic Analysis

LNC2* ConstraintGiven a ranking function s(q,d) that scores a document d withrespect to a query q, if d ′ is created by concatenating d withitself k times, then s(q,d ′) = s(q,d)

In other words, you cannot change the ranking of a documentby concatenating it with itself k times (where k > 1)

Comparison

Multinomial

MULTdir (q,d) =∑t∈q

log(|d ||d |+ µ

· c(t ,d)|d |

|d |+ µ· cft|C|

) · c(t ,q)

(13)

Violation I

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Violation I

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Query

Violation I

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Query

Document 2 is ranked higher than document 1 (X)

Violation II

Document 1 (Sample)Background Model

Document 2 (Sample)Background Model

Violation II

Violation II

Document 1 is ranked higher than document 2 (X)

Comparison

Multinomial

MULTdir (q,d) =∑t∈q

log(|d ||d |+ µ

· c(t ,d)|d |

|d |+ µ· cft|C|

) · c(t ,q)

(14)

SPUD

SPUDdir (q,d) =∑t∈q

log(~|d |

~|d |+ µ′·c(t ,d)|d |

+µ′

~|d |+ µ′· dft∑

t dft)·c(t ,q)

(15)

Comparison

Multinomial

MULTdir (q,d) =∑t∈q

log(|d ||d |+ µ

· c(t ,d)|d |

|d |+ µ· cft|C|

) · c(t ,q)

(14)

SPUD

SPUDdir (q,d) =∑t∈q

log(~|d |

~|d |+ µ′·c(t ,d)|d |

+µ′

~|d |+ µ′· dft∑

t dft)·c(t ,q)

(15)

What about scope?

I What happens as non-query (off-topic) terms are added toa document

I The part of the SPUD model that deals with scope, onlypenalises documents as distinct terms are added

-4.95

-4.9

-4.85

-4.8

-4.75

-4.7

-4.65

-4.6

100 150 200 250 300 350 400 450 500

we

igh

t

document length

MQLdirSPUDdir

What about scope?

I What happens as non-query (off-topic) terms are added toa document

I The part of the SPUD model that deals with scope, onlypenalises documents as distinct terms are added

-4.95

-4.9

-4.85

-4.8

-4.75

-4.7

-4.65

-4.6

100 150 200 250 300 350 400 450 500

we

igh

t

document length

MQLdirSPUDdir

Probability of Relevance/Retrieval

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

1 10 100 1000 10000

Pro

ba

bility o

f R

ele

va

nce

/Re

trie

ve

d

Document Length

RelevanceMQLdir Retrieved

SPUDdir Retrieved

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

1 10 100 1000 10000

Pro

ba

bility o

f R

ele

va

nce

/Re

trie

ve

d

Document Length

RelevanceMQLdir Retrieved

SPUDdir Retrieved

Figure : Probability of retrieval/relevance for MQLdir and SPUDdirmethods for trec-9/01 collection for short queries (left) and mediumlength queries (right).

A more sensitive idf

Traditional idf

log(n

dft) (16)

The actual weight applied to a term occurring in a documentcan be re-written as follows:

variable idf

log(1 + δ · ndft

) (17)

where δ = c(t ,d) · |~d |avg · |~d |/(µ′ · |d |) contains term-frequencyand document length normalisation

A more sensitive idf

Traditional idf

log(n

dft) (16)

The actual weight applied to a term occurring in a documentcan be re-written as follows:

variable idf

log(1 + δ · ndft

) (17)

where δ = c(t ,d) · |~d |avg · |~d |/(µ′ · |d |) contains term-frequencyand document length normalisation

Outline

Introduction

Vector Space Models

Language Modelling

Pólya Urn Model

Some Results

Analysis

Conclusions

Conclusions

I The simple bag-of-words approach has not yet reached itslimit

I More accurately modelling the language generationprocess leads to more accurate unsupervised models ofretrieval

I The Pólya urn model leads to more effective retrievalwithout any additional cost

I Automatic setting the parameters in the background model(unlike the multinomial model)

I An analysis shows that the new model adheres to a newtest for the verbosity hypothesis

Acknowledgements and Questions

I Jiaul Paik (University of Maryland)andYuanhua Lv (Microsoft Research)

I Questions and comments welcome

Acknowledgements and Questions

I Jiaul Paik (University of Maryland)andYuanhua Lv (Microsoft Research)

I Questions and comments welcome

Church, K. W. and Gale, W. A. (1995).Poisson mixtures.Natural Language Engineering, 1, 163–190.

Goldwater, S., Griffiths, T. L., and Johnson, M. (2011).Producing power-law distributions and damping wordfrequencies with two-stage language models.Journal of Machine Learning Research, 12, 2335–2382.

Hiemstra, D. (1998).A linguistically motivated probabilistic model of informationretrieval.In Research and Advanced Technology for Digital Libraries,Second European Conference, ECDL ’98, pages 569–584.

Lavrenko, V. and Croft, W. B. (2001).Relevance based language models.In Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’01, pages 120–127, New York, NY, USA.ACM.

Lv, Y. and Zhai, C. (2009).Positional language models for information retrieval.In Proceedings of the 32Nd International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’09, pages 299–306, New York, NY, USA.ACM.

Madsen, R. E., Kauchak, D., and Elkan, C. (2005).Modeling word burstiness using the dirichlet distribution.In Proceedings of the 22nd International Conference onMachine Learning, pages 545–552.

Mitzenmacher, M. (2003).A brief history of generative models for power law andlognormal distributions.INTERNET MATHEMATICS, 1, 226–251.

Ponte, J. M. and Croft, W. B. (1998).A language modeling approach to information retrieval.In Proceedings of the 21st Annual International ACM SIGIRConference on Research and Development in Information

Retrieval , SIGIR ’98, pages 275–281, New York, NY, USA.ACM.

Robertson, S. E. and Walker, S. (1994).Some simple effective approximations to the 2-poissonmodel for probabilistic weighted retrieval.In Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’94, pages 232–241, New York, NY, USA.Springer-Verlag New York, Inc.

Simon, H. A. (1955).On a class of skew distribution functions.Biometrika, 42(3–4), 425–440.

Xu, Z. and Akella, R. (2008).A new probabilistic retrieval model based on the dirichletcompound multinomial distribution.In Proceedings of the 31st Annual International ACM SIGIRConference on Research and Development in Information

Retrieval , SIGIR ’08, pages 427–434, New York, NY, USA.ACM.

Zhai, C. and Lafferty, J. (2001).A study of smoothing methods for language models appliedto ad hoc information retrieval.In Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’01, pages 334–342, New York, NY, USA.ACM.

Zhai, C. and Lafferty, J. (2004).A study of smoothing methods for language models appliedto information retrieval.ACM Transactions of Information Systems, 22, 179–214.