effective query formulation with multiple information sources michael bendersky 1, donald metzler 2,...

39
Effective Query Formulation with Multiple Information Sources Michael Bendersky 1 , Donald Metzler 2 , W.Bruce Croft 1 1 University of Massachusetts 2 Information Sciences Institute, USC WSDM 2012 Best Paper Runner Up Presented by Tom March 14 th , 2012 1

Upload: liliana-neal

Post on 02-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Effective Query Formulation with Multiple Information Sources

Michael Bendersky1, Donald Metzler2, W.Bruce Croft1

1University of Massachusetts2Information Sciences Institute, USCWSDM 2012 Best Paper Runner Up

Presented by TomMarch 14th, 2012

1

Michael Bendersky

Donald MetzlerGraduate in 2007Yahoo! ResearchUSC

W. Bruce Croft

Supervisor Supervisor

2

A Markov Random Field Model for Term Dependencies, SIGIR, 2005

Learning Concept Importance Using a Weighted Dependence Model, WSDM, 2010

Parameterized Concept Weighting in Verbose Queries, SIGIR, 2011, Honorable Mention Award

Effective Query Formulation with Multiple Information Sources, WSDM 2012, Best Paper Runner Up

Inheritance

Inheritance

Inheritance

3

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting

• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments• Discussion

4

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting

• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments• Discussion

5

Query Formulation Process

6

Query Formulation Process

• Query Refinement– Alter the query on the morphological level– Tokenization• 香港中文大学 (CUHK)||何善衡 (Ho Sin-Hang)||大楼

(Building)

– Spelling corrections• E.g. Hong Kng -> Hong Kong

– Stemming

7

Query Formulation Process

• Structured Query Formulation– Concept Identification• What are the atomic matching units in the query?

– Concept Weighting• How important are the different concepts for

conveying query intent?

– Query Expansion• What additional concepts should be associated with

the query?

8

Structured Query Formulation

Query Expansion Terms

0.297 er0.168 tv0.192 show0.051 er tv0.012 tv show

0.085 season0.065 episode0.051 dr0.043 drama0.036 series

ER TV Show (ER is an American medical drama television series )

Concept Identification

Concept Weighting

Query Expansion

9

Query Formulation Process

10

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments

11

Concept-Based Ranking

12

QueryDocument

Concepts

Concept Weighting

Concept Matching

Concept Matching

• Assign score to the matches of concept k in document D

• Monotonic function: value increases with the number of times concept k matches document D

• Language model

13

tf is frequency, C is collection, D is a document, µ is a parameter

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting

• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments• Discussion

14

Markov Random Field

• Markov Random Field– Undirected graphical models that define a joint

probability distribution over a set of random variables

– Node represent random variable, and edge represent dependence semantics

– Information Retrieval• Document random variable D, query term random

variable Q

15

Sequential Dependence Model

• Sequential dependence model places edges between adjacent query terms

16

Markov random field model for three query terms under the sequential dependence assumption

Sequential Dependence Model

17

Query Term Concept: individual query word

Phrase Concept: adjacent query word pairs matched as exact phrases in the document

Proximity Concept: adjacent query word pairs, both individual words occur in any order within a window of fixed length in document

• All matches of the same type are treated as being equally important

• Concept weight, set to 0.8, 0.1, 0.1 respectively

Weighted Sequential Dependence

• SD treat matches of the same type equally• Desire to weight a priori over different terms

and bigrams differently based on query-level evidence

• Assume the concept weight parameter λ take on a parameterized form

18

Weighted Sequential Dependence

19

Features defined over unigram

Features defined over bigram

w are free parameters that must be estimated

Weighted Sequential Dependence

• Concept Importance Features– Endogenous: collection dependent– Exogenous: collection independent, estimated

from external data sources

20

Weighted Sequential Dependence

• Parameter Estimation– Coordinate-level ascent– Iteratively optimize a multivariate objective

function by performing a series of one-dimensional line searches

– Repeat cycles through each parameter– Process is performed iteratively until the gain in

the target metric is below a certain threshold– Metzler and Croft 2007

21

Parameterized Query Expansion• WSD learns weights only for the explicit query

concepts (concept appears in query), not for latent concepts that are associated with the query through pseudo-relevance feedback

• PQE uses four types of concepts– Query term– Phrase concept– Proximity concept– Expansion concept

• Top-K terms associated with the query through pseudo-relevance feedback

• Using Latent Concept Expansion (Metzler and Croft 2007)

22

Parameterized Query Expansion

• Latent Concept Expansion– Use explicit concepts to retrieve a set of

documents R (pseudo-relevant documents)– Estimate the weight of each term in R to be an

expansion concept

23

Document relevance

Weight of term in pseudo-relevant set

Dampen scores of common terms

Parameterized Query Expansion

24

• Two stage optimization for estimating parameters• a1-a5 is 1st stage• A6-a7 is 2nd stage

Multiple Source Formulation

• LCE and PQE use single source for expansion, may lead to topic drift

25

Multiple Source Formulation

• Expansion Term– Ranking documents in each source σ using ranking

function using explicit concept– M terms with highest value of LCE for each source

σ are added to– Assign a weight to each term in , using the

weighted combination of expansion scores

26

Multiple Source Formulation

27

Multiple sources

Explicit Concept

Expansion Concept

Multiple Source Formulation

28

Multiple Source Formulation

29

Multiple Source Formulation

30

Multiple Source Formulation

31

Multiple Source Formulation

32

Outline

• Query Formulation Process• Concept-Based Ranking– Concept Matching– Concept Weighting

• Sequential Dependence [SIGIR 2005]• Weighted Sequential Dependence [WSDM 2010]• Parameterized Query Expansion [SIGIR 2011]• Multiple Source Formulation [WSDM 2012]

• Experiments• Discussion

33

Experiments

• Newswire & Web TREC collections– ROBUST04 (500K documents)– GOV2 (25M documents)– ClueWeb-B (50M documents)

• <title> & <desc> portions of TREC topics• 3-fold cross-validation

34

Experiments

35

Comparison with the query weighting methods on TREC collections

Significance test over each baseline is presented

Experiments

36

Comparison with the query expansion methods on TREC collections

Statistically indistinguishable from other methods

Experiments

• Other experiments in WSDM2012 paper– Varying the number of expansion terms– Robustness of proposed methods– Result diversification performance

37

Discussion

• The problems solved in these papers are fundamentally important

• Written in a good style– General formulation -> specific algorithm– Cite related work throughout the paper, 旁征博引– Motivate the proposed approach from time to time

• Experiments on standard data sets, and quite thorough

38

39

Thanks!Q & A