effective query formulation with multiple information sources michael bendersky 1, donald metzler 2,...
TRANSCRIPT
Effective Query Formulation with Multiple Information Sources
Michael Bendersky1, Donald Metzler2, W.Bruce Croft1
1University of Massachusetts2Information Sciences Institute, USCWSDM 2012 Best Paper Runner Up
Presented by TomMarch 14th, 2012
1
Michael Bendersky
Donald MetzlerGraduate in 2007Yahoo! ResearchUSC
W. Bruce Croft
Supervisor Supervisor
2
A Markov Random Field Model for Term Dependencies, SIGIR, 2005
Learning Concept Importance Using a Weighted Dependence Model, WSDM, 2010
Parameterized Concept Weighting in Verbose Queries, SIGIR, 2011, Honorable Mention Award
Effective Query Formulation with Multiple Information Sources, WSDM 2012, Best Paper Runner Up
Inheritance
Inheritance
Inheritance
3
Outline
• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting
• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]
• Experiments• Discussion
4
Outline
• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting
• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]
• Experiments• Discussion
5
Query Formulation Process
• Query Refinement– Alter the query on the morphological level– Tokenization• 香港中文大学 (CUHK)||何善衡 (Ho Sin-Hang)||大楼
(Building)
– Spelling corrections• E.g. Hong Kng -> Hong Kong
– Stemming
7
Query Formulation Process
• Structured Query Formulation– Concept Identification• What are the atomic matching units in the query?
– Concept Weighting• How important are the different concepts for
conveying query intent?
– Query Expansion• What additional concepts should be associated with
the query?
8
Structured Query Formulation
Query Expansion Terms
0.297 er0.168 tv0.192 show0.051 er tv0.012 tv show
0.085 season0.065 episode0.051 dr0.043 drama0.036 series
ER TV Show (ER is an American medical drama television series )
Concept Identification
Concept Weighting
Query Expansion
9
Outline
• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]
• Experiments
11
Concept Matching
• Assign score to the matches of concept k in document D
• Monotonic function: value increases with the number of times concept k matches document D
• Language model
13
tf is frequency, C is collection, D is a document, µ is a parameter
Outline
• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting
• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]
• Experiments• Discussion
14
Markov Random Field
• Markov Random Field– Undirected graphical models that define a joint
probability distribution over a set of random variables
– Node represent random variable, and edge represent dependence semantics
– Information Retrieval• Document random variable D, query term random
variable Q
15
Sequential Dependence Model
• Sequential dependence model places edges between adjacent query terms
16
Markov random field model for three query terms under the sequential dependence assumption
Sequential Dependence Model
17
Query Term Concept: individual query word
Phrase Concept: adjacent query word pairs matched as exact phrases in the document
Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document
• All matches of the same type are treated as being equally important
• Concept weight, set to 0.8, 0.1, 0.1 respectively
Weighted Sequential Dependence
• SD treat matches of the same type equally• Desire to weight a priori over different terms
and bigrams differently based on query-level evidence
• Assume the concept weight parameter λ take on a parameterized form
18
Weighted Sequential Dependence
19
Features defined over unigram
Features defined over bigram
w are free parameters that must be estimated
Weighted Sequential Dependence
• Concept Importance Features– Endogenous: collection dependent– Exogenous: collection independent, estimated
from external data sources
20
Weighted Sequential Dependence
• Parameter Estimation– Coordinate-level ascent– Iteratively optimize a multivariate objective
function by performing a series of one-dimensional line searches
– Repeat cycles through each parameter– Process is performed iteratively until the gain in
the target metric is below a certain threshold– Metzler and Croft 2007
21
Parameterized Query Expansion• WSD learns weights only for the explicit query
concepts (concept appears in query), not for latent concepts that are associated with the query through pseudo-relevance feedback
• PQE uses four types of concepts– Query term– Phrase concept– Proximity concept– Expansion concept
• Top-K terms associated with the query through pseudo-relevance feedback
• Using Latent Concept Expansion (Metzler and Croft 2007)
22
Parameterized Query Expansion
• Latent Concept Expansion– Use explicit concepts to retrieve a set of
documents R (pseudo-relevant documents)– Estimate the weight of each term in R to be an
expansion concept
23
Document relevance
Weight of term in pseudo-relevant set
Dampen scores of common terms
Parameterized Query Expansion
24
• Two stage optimization for estimating parameters• a1-a5 is 1st stage• A6-a7 is 2nd stage
Multiple Source Formulation
• LCE and PQE use single source for expansion, may lead to topic drift
25
Multiple Source Formulation
• Expansion Term– Ranking documents in each source σ using ranking
function using explicit concept– M terms with highest value of LCE for each source
σ are added to– Assign a weight to each term in , using the
weighted combination of expansion scores
26
Outline
• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting
• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]
• Experiments• Discussion
33
Experiments
• Newswire & Web TREC collections– ROBUST04 (500K documents)– GOV2 (25M documents)– ClueWeb-B (50M documents)
• <title> & <desc> portions of TREC topics• 3-fold cross-validation
34
Experiments
35
Comparison with the query weighting methods on TREC collections
Significance test over each baseline is presented
Experiments
36
Comparison with the query expansion methods on TREC collections
Statistically indistinguishable from other methods
Experiments
• Other experiments in WSDM2012 paper– Varying the number of expansion terms– Robustness of proposed methods– Result diversification performance
37
Discussion
• The problems solved in these papers are fundamentally important
• Written in a good style– General formulation -> specific algorithm– Cite related work throughout the paper, 旁征博引– Motivate the proposed approach from time to time
• Experiments on standard data sets, and quite thorough
38