a formal study of information retrieval heuristics

Post on 30-Dec-2015

20 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A Formal Study of Information Retrieval Heuristics. Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA. Empirical Observations in IR. Retrieval heuristics are necessary for good retrieval performance. - PowerPoint PPT Presentation

TRANSCRIPT

1

A Formal Study of Information Retrieval Heuristics

Hui Fang, Tao Tao and ChengXiang ZhaiDepartment of Computer Science

University of Illinois, Urbana-ChampaignUSA

2

Empirical Observations in IR

• Retrieval heuristics are necessary for good retrieval performance.– E.g. TF-IDF weighting, document length normalization

• Similar formulas may have different performances.

• Performance is sensitive to parameter setting.

3

• Pivoted Normalization Method

• Dirichlet Prior Method

• Okapi Method

1 ln(1 ln( ( , ))) 1( , ) ln

| | ( )(1 )w q d

c w d Nc w q

d df ws savdl

( , )( , ) ln(1 ) | | ln

( | ) | |w q d

c w dc w q q

p w C d

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

Inversed Document FrequencyDocument Length NormalizationTerm Frequency

Empirical Observations in IR (Cont.)

1+ln(c(w,d))

Alternative TF transformationParameter sensitivity

4

Research Questions

• How can we formally characterize these necessary retrieval heuristics?

• Can we predict the empirical behavior of a method without experimentation?

5

• Formalized heuristic retrieval constraints• Analytical evaluation of the current retrieval formulas• Benefits of constraint analysis

– Better understanding of parameter optimization

– Explanation of performance difference

– Improvement of existing retrieval formulas

Outline

6

d2:d1:

),( 1dwc

),( 2dwc

Term Frequency Constraints (TFC1)

• TFC1

TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term.

q :w

If |||| 21 dd ),(),( 21 dwcdwc and

Let q be a query with only one term w.

).,(),( 21 qdfqdf then

),(),( 21 qdfqdf

7

1 2( , ) ( , )f d q f d q

Term Frequency Constraints (TFC2)

TF weighting heuristic II: Favor a document with more distinct query terms.

2 1( , )c w d

1 2( , )c w d

1 1( , )c w d

d1:

d2:

1 2( , ) ( , ).f d q f d qthen

1 2 1 1 2 1( , ) ( , ) ( , )c w d c w d c w d If

2 2 1 1 2 1( , ) 0, ( , ) 0, ( , ) 0c w d c w d c w d and

1 2| | | |d dand

Let q be a query and w1, w2 be two query terms.

Assume 1 2( ) ( )idf w idf w

• TFC2

q:w1 w2

8

Term Discrimination Constraint (TDC)IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)

...…

SVMSVM TutorialTutorial…

Doc 1

……

SVMSVMTutorialTutorial…

Doc 2

( 1) ( 2)f Doc f Doc

SVM Tutorial

9

Term Discrimination Constraint (Cont.)

• TDC

Let q be a query and w1, w2 be two query terms.

1 2| | | |,d dAssume

)()( 21 widfwidf and),(),( 2111 dwcdwc If

).,(),( 21 qdfqdf then

),(),(),(),( 22211211 dwcdwcdwcdwc and

),(),( 21 dwcdwc for all other words w.and

1 2( ) ( )idf w idf wq:w1 w2

d2:d1:

),( 11 dwc

),( 21 dwc

),( 12 dwc

),( 22 dwc

1 2( , ) ( , )f d q f d q

10

Length Normalization Constraints(LNCs)Document length normalization heuristic:Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .

• LNC2

d2:

q:Let q be a query.

d1:||||,1 21 dkdk ),(),( 21 dwckdwc If and

),(),( 21 qdfqdf then

),(),( 21 qdfqdf

d1:d2:

q:Let q be a query.

1),(),(, 12 dwcdwcqw),(),(, 12 dwcdwcw

qw

),( 1dwc

),( 2dwc

If for some word

but for other words

),(),( 21 qdfqdf ),(),( 21 qdfqdf then

• LNC1

11

TF-LENGTH Constraint (TF-LNC)

• TF-LNC

TF-LN heuristic:Regularize the interaction of TF and document length.

q:w

),( 2dwc

d2:

),( 1dwc

d1:

Let q be a query with only one term w.

).,(),( 21 qdfqdf then

),(),( 21 dwcdwc and

If 1 2 1 2| | | | ( , ) ( , )d d c w d c w d

1 2( , ) ( , )f d q f d q

12

Analytical Evaluation

Retrieval Formula TFCs TDC LNC1 LNC2 TF-LNC

Pivoted Norm. Yes Conditional Yes Conditional Conditional

Dirichlet Prior Yes Conditional Yes Conditional Yes

Okapi (original) Conditional Conditional Conditional Conditional Conditional

Okapi (modified) Yes Conditional Yes Yes Yes

13

Term Discrimination Constraint (TDC)IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

...…SVMSVMSVMTutorialTutorial…

Doc 1

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)

……TutorialSVMSVMTutorialTutorial…

Doc 2

( 1) ( 2)f Doc f Doc

14

Benefits of Constraint Analysis

• Provide an approximate bound for the parameters – A constraint may be satisfied only if the parameter is

within a particular interval.

• Compare different formulas analytically without experimentations– When a formula does not satisfy the constraint, it often

indicates non-optimality of the formula.

• Suggest how to improve the current retrieval models– Violation of constraints may pinpoint where a formula

needs to be improved.

15

Parameter sensitivity of s

s

Avg

. Pre

c.

Benefits 1 : Bounding Parameters• Pivoted Normalization Method LNC2 s<0.4

0.4

Optimal s (for average precision)

16

Negative when df(w) is large Violate many constraints

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

Benefits 2 : Analytical Comparison• Okapi Method

Pivoted

Okapi

keyword query verbose query

s or b s or b

Avg

. Pre

c

Avg

. Pre

c

17

Benefits 3: Improving Retrieval Formulas

Make Okapi satisfy more constraints; expected to help verbose queries

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

• Modified Okapi Methoddf

N 1ln

keyword query verbose query

s or b s or b

Avg

. Pre

c.

Avg

. Pre

c.

Pivoted

Okapi

Modified Okapi

18

Conclusions and Future Work

• Conclusions– Retrieval heuristics can be captured through formally

defined constraints.– It is possible to evaluate a retrieval formula analytically

through constraint analysis.

• Future Work– Explore additional necessary heuristics– Apply these constraints to many other retrieval methods– Develop new retrieval formulas through constraint

analysis

19

The End

Thank you!

top related