effective query formulation with multiple information sources michael bendersky 1, donald metzler 2,...

Effective Query Formulation with Multiple Information Sources

Michael Bendersky1, Donald Metzler2, W.Bruce Croft1

1University of Massachusetts2Information Sciences Institute, USCWSDM 2012 Best Paper Runner Up

Presented by TomMarch 14th, 2012

1

Michael Bendersky

Donald MetzlerGraduate in 2007Yahoo! ResearchUSC

W. Bruce Croft

Supervisor Supervisor

2

A Markov Random Field Model for Term Dependencies, SIGIR, 2005

Learning Concept Importance Using a Weighted Dependence Model, WSDM, 2010

Parameterized Concept Weighting in Verbose Queries, SIGIR, 2011, Honorable Mention Award

Effective Query Formulation with Multiple Information Sources, WSDM 2012, Best Paper Runner Up

Inheritance

Inheritance

Inheritance

3

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting

• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments• Discussion

4

Outline




5

Query Formulation Process

6


• Query Refinement– Alter the query on the morphological level– Tokenization• 香港中文大学 (CUHK)||何善衡 (Ho Sin-Hang)||大楼

(Building)

– Spelling corrections• E.g. Hong Kng -> Hong Kong

– Stemming

7


• Structured Query Formulation– Concept Identification• What are the atomic matching units in the query?

– Concept Weighting• How important are the different concepts for

conveying query intent?

– Query Expansion• What additional concepts should be associated with

the query?

8

Structured Query Formulation

Query Expansion Terms

0.297 er0.168 tv0.192 show0.051 er tv0.012 tv show

0.085 season0.065 episode0.051 dr0.043 drama0.036 series

ER TV Show (ER is an American medical drama television series )

Concept Identification

Concept Weighting

Query Expansion

9


10

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments

11

Concept-Based Ranking

12

QueryDocument

Concepts

Concept Weighting

Concept Matching

Concept Matching

• Assign score to the matches of concept k in document D

• Monotonic function: value increases with the number of times concept k matches document D

• Language model

13

tf is frequency, C is collection, D is a document, µ is a parameter

Outline




14

Markov Random Field

• Markov Random Field– Undirected graphical models that define a joint

probability distribution over a set of random variables

– Node represent random variable, and edge represent dependence semantics

– Information Retrieval• Document random variable D, query term random

variable Q

15

Sequential Dependence Model

• Sequential dependence model places edges between adjacent query terms

16

Markov random field model for three query terms under the sequential dependence assumption

Sequential Dependence Model

17

Query Term Concept: individual query word

Phrase Concept: adjacent query word pairs matched as exact phrases in the document

Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document

• All matches of the same type are treated as being equally important

• Concept weight, set to 0.8, 0.1, 0.1 respectively

Weighted Sequential Dependence

• SD treat matches of the same type equally• Desire to weight a priori over different terms

and bigrams differently based on query-level evidence

• Assume the concept weight parameter λ take on a parameterized form

18


19

Features defined over unigram

Features defined over bigram

w are free parameters that must be estimated


• Concept Importance Features– Endogenous: collection dependent– Exogenous: collection independent, estimated

from external data sources

20


• Parameter Estimation– Coordinate-level ascent– Iteratively optimize a multivariate objective

function by performing a series of one-dimensional line searches

– Repeat cycles through each parameter– Process is performed iteratively until the gain in

the target metric is below a certain threshold– Metzler and Croft 2007

21

Parameterized Query Expansion• WSD learns weights only for the explicit query

concepts (concept appears in query), not for latent concepts that are associated with the query through pseudo-relevance feedback

• PQE uses four types of concepts– Query term– Phrase concept– Proximity concept– Expansion concept

• Top-K terms associated with the query through pseudo-relevance feedback

• Using Latent Concept Expansion (Metzler and Croft 2007)

22

Parameterized Query Expansion

• Latent Concept Expansion– Use explicit concepts to retrieve a set of

documents R (pseudo-relevant documents)– Estimate the weight of each term in R to be an

expansion concept

23

Document relevance

Weight of term in pseudo-relevant set

Dampen scores of common terms

Parameterized Query Expansion

24

• Two stage optimization for estimating parameters• a1-a5 is 1st stage• A6-a7 is 2nd stage

Multiple Source Formulation

• LCE and PQE use single source for expansion, may lead to topic drift

25


• Expansion Term– Ranking documents in each source σ using ranking

function using explicit concept– M terms with highest value of LCE for each source

σ are added to– Assign a weight to each term in , using the

weighted combination of expansion scores

26


27

Multiple sources

Explicit Concept

Expansion Concept


28


29


30


31


32

Outline




33

Experiments

• Newswire & Web TREC collections– ROBUST04 (500K documents)– GOV2 (25M documents)– ClueWeb-B (50M documents)

• <title> & <desc> portions of TREC topics• 3-fold cross-validation

34

Experiments

35

Comparison with the query weighting methods on TREC collections

Significance test over each baseline is presented

Experiments

36

Comparison with the query expansion methods on TREC collections

Statistically indistinguishable from other methods

Experiments

• Other experiments in WSDM2012 paper– Varying the number of expansion terms– Robustness of proposed methods– Result diversification performance

37

Discussion

• The problems solved in these papers are fundamentally important

• Written in a good style– General formulation -> specific algorithm– Cite related work throughout the paper, 旁征博引– Motivate the proposed approach from time to time

• Experiments on standard data sets, and quite thorough

38

39

Thanks!Q & A

effective query formulation with multiple information sources michael bendersky 1, donald metzler 2,...

Documents

effective query formulation

query term concept

adjacent query terms

conveying query intent

adjacent query word

query term random variable

matches of concept

documentproximity concept