a pólya urn document language model for information...
TRANSCRIPT
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Quote
“An author writes not only by processes of association – i.e. samplingearlier segments of the word sequence – but also by process ofimitation – i.e. sampling segments of word sequences from otherworks he has written, from works of other authors, and, of course,from sequences he has heard.”
— Herbert A. Simon (1955)
Motivation
I Bag of wordsI Term-dependencies
I Improves retrieval effectiveness +I Leads to more complex models -I ClueWeb09 (1 Billion documents)
I Can we create a retrieval model that includesdependencies but without any additional cost?
Motivation
I Bag of wordsI Term-dependencies
I Improves retrieval effectiveness +I Leads to more complex models -I ClueWeb09 (1 Billion documents)
I Can we create a retrieval model that includesdependencies but without any additional cost?
Two Kinds of Term Dependency
ExamplesTraditional dependencies
I Captain BeefheartI Che Guevara
ExamplesWord Burstiness
I A different kind of dependencyI "Cycling on the footpath is dangerous. A footpath is ..."I Synonyms: {footpath, pavement, sidewalk }I Preference for the word already used
Two Kinds of Term Dependency
ExamplesTraditional dependencies
I Captain BeefheartI Che Guevara
ExamplesWord Burstiness
I A different kind of dependencyI "Cycling on the footpath is dangerous. A footpath is ..."I Synonyms: {footpath, pavement, sidewalk }I Preference for the word already used
Word Burstiness
I Initial choice of a word to describe a ‘concept’ affectssubsequent usage
I The tendency of an otherwise rare word to occur multipletimes in a document (Church, 1995; Madsen; 2005)
I A form of preferential attachment (e.g. ‘the rich get richer’)I A generative language model that includes preferential
attachment better explains Zipfian (power-law)characteristics (Simon, 1955; Mitzenmacher, 2004)
I Two-stage language models (Goldwater et al, 2011)
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Tradition
I Place documents and queries in a multidimensional termspace
I Use measures of closeness in the space as measures ofsimilarity
I Conceptually usefulI But?
I What weights to use?I What matching function to use?I Experiments tell us that cosine matching function is very
poorI linear tf and idf has very poor performanceI What did we gain from the VSM other than an
inner-product matching function?
Tradition
I Place documents and queries in a multidimensional termspace
I Use measures of closeness in the space as measures ofsimilarity
I Conceptually usefulI But?I What weights to use?I What matching function to use?I Experiments tell us that cosine matching function is very
poorI linear tf and idf has very poor performanceI What did we gain from the VSM other than an
inner-product matching function?
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Language Modelling for Retrieval
I First approaches appeared in 1998 (Ponte and Croft, 1998;Hiemstra, 1998)
I Relevance-based approaches (Lavrenko, 2001)I Studies of smoothing (Zhai and Lafferty, 2001)I Dirichlet compound multinomial relevance language model
(Xu and Akella, 2008)I Positional language models (Lv and Zhai, 2009)I State-of-the-art unigram model uses a Dirichlet prior on the
background multinomial updated with a document (Zhaiand Lafferty, 2004)
Language Modelling for Retrieval
I First approaches appeared in 1998 (Ponte and Croft, 1998;Hiemstra, 1998)
I Relevance-based approaches (Lavrenko, 2001)I Studies of smoothing (Zhai and Lafferty, 2001)I Dirichlet compound multinomial relevance language model
(Xu and Akella, 2008)I Positional language models (Lv and Zhai, 2009)I State-of-the-art unigram model uses a Dirichlet prior on the
background multinomial updated with a document (Zhaiand Lafferty, 2004)
Query-Likelihood Model
I Rank documents d in order of the likelihood of their modelMd generating the query string q
I General ranking principle for a probabilistic languagemodel
p(q|Md = θdm) =∏t∈q
p(t |θdm)c(t ,q) (1)
log p(q|Md = θdm) =∑t∈q
(log p(t |θdm) · c(t ,q)) (2)
Query-Likelihood Model
I Rank documents d in order of the likelihood of their modelMd generating the query string q
I General ranking principle for a probabilistic languagemodel
p(q|Md = θdm) =∏t∈q
p(t |θdm)c(t ,q) (1)
log p(q|Md = θdm) =∑t∈q
(log p(t |θdm) · c(t ,q)) (2)
Query-Likelihood Model
I Rank documents d in order of the likelihood of their modelMd generating the query string q
I General ranking principle for a probabilistic languagemodel
p(q|Md = θdm) =∏t∈q
p(t |θdm)c(t ,q) (1)
log p(q|Md = θdm) =∑t∈q
(log p(t |θdm) · c(t ,q)) (2)
Smoothing I
I Avoids over-fitting
p(t |θ̂dm) = (1− π) · p(t |θ̂d) + π · p(t |θ̂c) (3)
I Dirichlet prior smoothing
πdir =µ
µ+ |d |(4)
Smoothing II
Document 1 (Sample)Background Model
Document 2 (Sample)Background Model
Query
Rank document 2 higher than document 1
Overview
I We can derive a retrieval function (and principledterm-weights) using language models, unlike the VSM
I It can be viewed as a form of unsupervised machinelearning
I The multinomial model is efficient to estimate and withDirichlet priors smoothing is the state-of-the-art in terms ofretrieval effectiveness
I It forms the basis of many applicationsI It does not model term-dependenciesI The model using a Dirichlet prior has a free parameter (i.e.µ)
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Multivariate Pólya Distribution
I The multivariate Polya distribution(Dirichlet-compound-multinomial or DCM)
I Instead of the multinomial in the original query-likelihoodmodel we can use the DCM
p(d |α) =∫θ
p(d |θ)p(θ|α)dθ (5)
I Parameter vector α can be interpreted as the initial numberof balls of each colour in the urn
Parameterisation
αd = md · θd = (md ·p(t1|θd),md ·p(t2|θd), ....,md ·p(tv |θd)) (6)
I θd can be seen as the expectationI md can be seen as the scale (variance)I Low md implies high burstiness
Some properties
I Subsequent balls drawn from the urn are identicallydistributed but dependent
I Each sample (document) can be modelled using amultinomial
I Each time you restart the process to draw a sample, youdraw from a different multinomial
I The process is exchangableI Generates power-law characteristics of term-frequenciesI Estimating a DCM from multiple samples (i.e. multiple
documents) is computationally expensive (i.e. noclosed-form solutions)
The SPUD Language Model
I Use the Pólya urn as a model for document generationI Documents are known to exhibit burstinessI Estimate the document and background models as before
but with different model assumptionsI Retain the multinomial as the model for query generation
Background Model II
I The background model is the most likely single model tohave generated all documents
I Given all documents, find the DCM parametersI Elkan (2006) has shown that close approximations to the
model parameters are proportional to the number ofsamples in which an observation appears (EDCM)
I Essentially, documents exhibit quite a lot of word burstiness
Background Model II
I This is a useful result as close estimates of the thebackground parameters will be proportional to:
p(t |θ̂′c) =dft∑t ′ dft ′
=dft∑nj |~dj |
(7)
I With only mc remaining to be estimated using Newton’smethod
α̂c = (mc · p(t1|θ̂′c),mc · p(t2|θ̂′c), ....,mc · p(tv |θ̂′c)) (8)
Document Model II
I With only one sample we cannot estimate the parametersof a DCM
I We can estimate the expectation of the DCM but what ismd?
I Thought experiment: What is the minimum initial mass ofthe urn (i.e. number of balls) that could have generated d?
I We set md to the number unique terms in the document(it’s lower bound).
α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (9)
Document Model II
I With only one sample we cannot estimate the parametersof a DCM
I We can estimate the expectation of the DCM but what ismd?
I Thought experiment: What is the minimum initial mass ofthe urn (i.e. number of balls) that could have generated d?
I We set md to the number unique terms in the document(it’s lower bound).
α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (9)
Document Model II
I With only one sample we cannot estimate the parametersof a DCM
I We can estimate the expectation of the DCM but what ismd?
I Thought experiment: What is the minimum initial mass ofthe urn (i.e. number of balls) that could have generated d?
I We set md to the number unique terms in the document(it’s lower bound).
α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (9)
Remaining Parameters
I Linearly combine the two models using one parameter ωI We can experimentally tune ω
SPUD = ω · αc + (1− ω) · αd (10)
α̂c = (mc · p(t1|θ̂′c),mc · p(t2|θ̂′c), ....,mc · p(tv |θ̂′c)) (11)
α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (12)
Remaining Parameters
I Linearly combine the two models using one parameter ωI We can experimentally tune ω
SPUD = ω · αc + (1− ω) · αd (10)
α̂c = (mc · p(t1|θ̂′c),mc · p(t2|θ̂′c), ....,mc · p(tv |θ̂′c)) (11)
α̂d = (|~d | · p(t1|θ̂d), |~d | · p(t2|θ̂d), ...., |~d | · p(tv |θ̂d)) (12)
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Questions
I How effective is the new model in terms of retrieval?I How effective is Newton’s method at automatically
determining the free parameter in the background model?I Why?
Effectiveness MAP
I Optimally tuning the one free parameter in each functionI All increases are statistically significant (for SPUD v MQL)
news wt2g wt10g gov2 mq-07 mq-08
0.15
0.2
0.25
0.3
0.35
0.4
0.45
DCM-L-TMQLSPUD
Figure : MAP on Newswire and Web datasets (title only queries)
Effectiveness NDCG@20
I Optimally tuning the one free parameter in each functionI All increases are statistically significant
news wt2g wt10g gov2 mq-07 mq-080.25
0.3
0.35
0.4
0.45
0.5
0.55
DCM-L-TMQLSPUD
Figure : NDCG@20 on Newswire and Web datasets (title onlyqueries)
Newton’s Method and Tuning
I Mixing parameter is robust at ω = 0.8
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
MA
P
ω
ohsu
robust-04
trec-8
Figure : (title queries)
Newton’s Method
I Mixing parameter is set to ω = 0.8I Tuned mc vs mc estimated using Newton’s method
news wt2g wt10g gov2 mq-07 mq-08
0.15
0.2
0.25
0.3
0.35
0.4
0.45
SPUD (tuned)SPUD (ω=0.8)
Figure : MAP on Newswire and Web datasets (title only queries)
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Scope Hypothesis
I One of two hypotheses proposed that aim to explain theinteraction between document length and topicality
I Documents vary in length due to some documentscovering more topics (Robertson & Walker, 1994)
I Relevance is likely affected by this aspect of documentlength
Verbosity Hypothesis
I Documents vary in length due to verbosity (Robertson &Walker, 1994)
I Some documents are just more ‘wordy’I This aspect of document length is independent of topic,
and therefore, relevanceI In reality documents may vary in length due to a
combination of these two hypothesesI No formal means of capturing whether a retrieval function
adheres to this intuition has been proposed (as far as Iknow)
Axiomatic Analysis
LNC2* ConstraintGiven a ranking function s(q,d) that scores a document d withrespect to a query q, if d ′ is created by concatenating d withitself k times, then s(q,d ′) = s(q,d)
In other words, you cannot change the ranking of a documentby concatenating it with itself k times (where k > 1)
Axiomatic Analysis
LNC2* ConstraintGiven a ranking function s(q,d) that scores a document d withrespect to a query q, if d ′ is created by concatenating d withitself k times, then s(q,d ′) = s(q,d)
In other words, you cannot change the ranking of a documentby concatenating it with itself k times (where k > 1)
Comparison
Multinomial
MULTdir (q,d) =∑t∈q
log(|d ||d |+ µ
· c(t ,d)|d |
+µ
|d |+ µ· cft|C|
) · c(t ,q)
(13)
Violation I
Document 1 (Sample)Background Model
Document 2 (Sample)Background Model
Query
Document 2 is ranked higher than document 1 (X)
Comparison
Multinomial
MULTdir (q,d) =∑t∈q
log(|d ||d |+ µ
· c(t ,d)|d |
+µ
|d |+ µ· cft|C|
) · c(t ,q)
(14)
SPUD
SPUDdir (q,d) =∑t∈q
log(~|d |
~|d |+ µ′·c(t ,d)|d |
+µ′
~|d |+ µ′· dft∑
t dft)·c(t ,q)
(15)
Comparison
Multinomial
MULTdir (q,d) =∑t∈q
log(|d ||d |+ µ
· c(t ,d)|d |
+µ
|d |+ µ· cft|C|
) · c(t ,q)
(14)
SPUD
SPUDdir (q,d) =∑t∈q
log(~|d |
~|d |+ µ′·c(t ,d)|d |
+µ′
~|d |+ µ′· dft∑
t dft)·c(t ,q)
(15)
What about scope?
I What happens as non-query (off-topic) terms are added toa document
I The part of the SPUD model that deals with scope, onlypenalises documents as distinct terms are added
-4.95
-4.9
-4.85
-4.8
-4.75
-4.7
-4.65
-4.6
100 150 200 250 300 350 400 450 500
we
igh
t
document length
MQLdirSPUDdir
What about scope?
I What happens as non-query (off-topic) terms are added toa document
I The part of the SPUD model that deals with scope, onlypenalises documents as distinct terms are added
-4.95
-4.9
-4.85
-4.8
-4.75
-4.7
-4.65
-4.6
100 150 200 250 300 350 400 450 500
we
igh
t
document length
MQLdirSPUDdir
Probability of Relevance/Retrieval
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
1 10 100 1000 10000
Pro
ba
bility o
f R
ele
va
nce
/Re
trie
ve
d
Document Length
RelevanceMQLdir Retrieved
SPUDdir Retrieved
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
1 10 100 1000 10000
Pro
ba
bility o
f R
ele
va
nce
/Re
trie
ve
d
Document Length
RelevanceMQLdir Retrieved
SPUDdir Retrieved
Figure : Probability of retrieval/relevance for MQLdir and SPUDdirmethods for trec-9/01 collection for short queries (left) and mediumlength queries (right).
A more sensitive idf
Traditional idf
log(n
dft) (16)
The actual weight applied to a term occurring in a documentcan be re-written as follows:
variable idf
log(1 + δ · ndft
) (17)
where δ = c(t ,d) · |~d |avg · |~d |/(µ′ · |d |) contains term-frequencyand document length normalisation
A more sensitive idf
Traditional idf
log(n
dft) (16)
The actual weight applied to a term occurring in a documentcan be re-written as follows:
variable idf
log(1 + δ · ndft
) (17)
where δ = c(t ,d) · |~d |avg · |~d |/(µ′ · |d |) contains term-frequencyand document length normalisation
Outline
Introduction
Vector Space Models
Language Modelling
Pólya Urn Model
Some Results
Analysis
Conclusions
Conclusions
I The simple bag-of-words approach has not yet reached itslimit
I More accurately modelling the language generationprocess leads to more accurate unsupervised models ofretrieval
I The Pólya urn model leads to more effective retrievalwithout any additional cost
I Automatic setting the parameters in the background model(unlike the multinomial model)
I An analysis shows that the new model adheres to a newtest for the verbosity hypothesis
Acknowledgements and Questions
I Jiaul Paik (University of Maryland)andYuanhua Lv (Microsoft Research)
I Questions and comments welcome
Acknowledgements and Questions
I Jiaul Paik (University of Maryland)andYuanhua Lv (Microsoft Research)
I Questions and comments welcome
Church, K. W. and Gale, W. A. (1995).Poisson mixtures.Natural Language Engineering, 1, 163–190.
Goldwater, S., Griffiths, T. L., and Johnson, M. (2011).Producing power-law distributions and damping wordfrequencies with two-stage language models.Journal of Machine Learning Research, 12, 2335–2382.
Hiemstra, D. (1998).A linguistically motivated probabilistic model of informationretrieval.In Research and Advanced Technology for Digital Libraries,Second European Conference, ECDL ’98, pages 569–584.
Lavrenko, V. and Croft, W. B. (2001).Relevance based language models.In Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’01, pages 120–127, New York, NY, USA.ACM.
Lv, Y. and Zhai, C. (2009).Positional language models for information retrieval.In Proceedings of the 32Nd International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’09, pages 299–306, New York, NY, USA.ACM.
Madsen, R. E., Kauchak, D., and Elkan, C. (2005).Modeling word burstiness using the dirichlet distribution.In Proceedings of the 22nd International Conference onMachine Learning, pages 545–552.
Mitzenmacher, M. (2003).A brief history of generative models for power law andlognormal distributions.INTERNET MATHEMATICS, 1, 226–251.
Ponte, J. M. and Croft, W. B. (1998).A language modeling approach to information retrieval.In Proceedings of the 21st Annual International ACM SIGIRConference on Research and Development in Information
Retrieval , SIGIR ’98, pages 275–281, New York, NY, USA.ACM.
Robertson, S. E. and Walker, S. (1994).Some simple effective approximations to the 2-poissonmodel for probabilistic weighted retrieval.In Proceedings of the 17th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’94, pages 232–241, New York, NY, USA.Springer-Verlag New York, Inc.
Simon, H. A. (1955).On a class of skew distribution functions.Biometrika, 42(3–4), 425–440.
Xu, Z. and Akella, R. (2008).A new probabilistic retrieval model based on the dirichletcompound multinomial distribution.In Proceedings of the 31st Annual International ACM SIGIRConference on Research and Development in Information
Retrieval , SIGIR ’08, pages 427–434, New York, NY, USA.ACM.
Zhai, C. and Lafferty, J. (2001).A study of smoothing methods for language models appliedto ad hoc information retrieval.In Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval , SIGIR ’01, pages 334–342, New York, NY, USA.ACM.
Zhai, C. and Lafferty, J. (2004).A study of smoothing methods for language models appliedto information retrieval.ACM Transactions of Information Systems, 22, 179–214.