a formal study of information retrieval heuristics

19
1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA

Upload: sean-francis

Post on 30-Dec-2015

20 views

Category:

Documents


0 download

DESCRIPTION

A Formal Study of Information Retrieval Heuristics. Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA. Empirical Observations in IR. Retrieval heuristics are necessary for good retrieval performance. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Formal Study of Information Retrieval Heuristics

1

A Formal Study of Information Retrieval Heuristics

Hui Fang, Tao Tao and ChengXiang ZhaiDepartment of Computer Science

University of Illinois, Urbana-ChampaignUSA

Page 2: A Formal Study of Information Retrieval Heuristics

2

Empirical Observations in IR

• Retrieval heuristics are necessary for good retrieval performance.– E.g. TF-IDF weighting, document length normalization

• Similar formulas may have different performances.

• Performance is sensitive to parameter setting.

Page 3: A Formal Study of Information Retrieval Heuristics

3

• Pivoted Normalization Method

• Dirichlet Prior Method

• Okapi Method

1 ln(1 ln( ( , ))) 1( , ) ln

| | ( )(1 )w q d

c w d Nc w q

d df ws savdl

( , )( , ) ln(1 ) | | ln

( | ) | |w q d

c w dc w q q

p w C d

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

Inversed Document FrequencyDocument Length NormalizationTerm Frequency

Empirical Observations in IR (Cont.)

1+ln(c(w,d))

Alternative TF transformationParameter sensitivity

Page 4: A Formal Study of Information Retrieval Heuristics

4

Research Questions

• How can we formally characterize these necessary retrieval heuristics?

• Can we predict the empirical behavior of a method without experimentation?

Page 5: A Formal Study of Information Retrieval Heuristics

5

• Formalized heuristic retrieval constraints• Analytical evaluation of the current retrieval formulas• Benefits of constraint analysis

– Better understanding of parameter optimization

– Explanation of performance difference

– Improvement of existing retrieval formulas

Outline

Page 6: A Formal Study of Information Retrieval Heuristics

6

d2:d1:

),( 1dwc

),( 2dwc

Term Frequency Constraints (TFC1)

• TFC1

TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term.

q :w

If |||| 21 dd ),(),( 21 dwcdwc and

Let q be a query with only one term w.

).,(),( 21 qdfqdf then

),(),( 21 qdfqdf

Page 7: A Formal Study of Information Retrieval Heuristics

7

1 2( , ) ( , )f d q f d q

Term Frequency Constraints (TFC2)

TF weighting heuristic II: Favor a document with more distinct query terms.

2 1( , )c w d

1 2( , )c w d

1 1( , )c w d

d1:

d2:

1 2( , ) ( , ).f d q f d qthen

1 2 1 1 2 1( , ) ( , ) ( , )c w d c w d c w d If

2 2 1 1 2 1( , ) 0, ( , ) 0, ( , ) 0c w d c w d c w d and

1 2| | | |d dand

Let q be a query and w1, w2 be two query terms.

Assume 1 2( ) ( )idf w idf w

• TFC2

q:w1 w2

Page 8: A Formal Study of Information Retrieval Heuristics

8

Term Discrimination Constraint (TDC)IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)

...…

SVMSVM TutorialTutorial…

Doc 1

……

SVMSVMTutorialTutorial…

Doc 2

( 1) ( 2)f Doc f Doc

SVM Tutorial

Page 9: A Formal Study of Information Retrieval Heuristics

9

Term Discrimination Constraint (Cont.)

• TDC

Let q be a query and w1, w2 be two query terms.

1 2| | | |,d dAssume

)()( 21 widfwidf and),(),( 2111 dwcdwc If

).,(),( 21 qdfqdf then

),(),(),(),( 22211211 dwcdwcdwcdwc and

),(),( 21 dwcdwc for all other words w.and

1 2( ) ( )idf w idf wq:w1 w2

d2:d1:

),( 11 dwc

),( 21 dwc

),( 12 dwc

),( 22 dwc

1 2( , ) ( , )f d q f d q

Page 10: A Formal Study of Information Retrieval Heuristics

10

Length Normalization Constraints(LNCs)Document length normalization heuristic:Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .

• LNC2

d2:

q:Let q be a query.

d1:||||,1 21 dkdk ),(),( 21 dwckdwc If and

),(),( 21 qdfqdf then

),(),( 21 qdfqdf

d1:d2:

q:Let q be a query.

1),(),(, 12 dwcdwcqw),(),(, 12 dwcdwcw

qw

),( 1dwc

),( 2dwc

If for some word

but for other words

),(),( 21 qdfqdf ),(),( 21 qdfqdf then

• LNC1

Page 11: A Formal Study of Information Retrieval Heuristics

11

TF-LENGTH Constraint (TF-LNC)

• TF-LNC

TF-LN heuristic:Regularize the interaction of TF and document length.

q:w

),( 2dwc

d2:

),( 1dwc

d1:

Let q be a query with only one term w.

).,(),( 21 qdfqdf then

),(),( 21 dwcdwc and

If 1 2 1 2| | | | ( , ) ( , )d d c w d c w d

1 2( , ) ( , )f d q f d q

Page 12: A Formal Study of Information Retrieval Heuristics

12

Analytical Evaluation

Retrieval Formula TFCs TDC LNC1 LNC2 TF-LNC

Pivoted Norm. Yes Conditional Yes Conditional Conditional

Dirichlet Prior Yes Conditional Yes Conditional Yes

Okapi (original) Conditional Conditional Conditional Conditional Conditional

Okapi (modified) Yes Conditional Yes Yes Yes

Page 13: A Formal Study of Information Retrieval Heuristics

13

Term Discrimination Constraint (TDC)IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.

...…SVMSVMSVMTutorialTutorial…

Doc 1

Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)

……TutorialSVMSVMTutorialTutorial…

Doc 2

( 1) ( 2)f Doc f Doc

Page 14: A Formal Study of Information Retrieval Heuristics

14

Benefits of Constraint Analysis

• Provide an approximate bound for the parameters – A constraint may be satisfied only if the parameter is

within a particular interval.

• Compare different formulas analytically without experimentations– When a formula does not satisfy the constraint, it often

indicates non-optimality of the formula.

• Suggest how to improve the current retrieval models– Violation of constraints may pinpoint where a formula

needs to be improved.

Page 15: A Formal Study of Information Retrieval Heuristics

15

Parameter sensitivity of s

s

Avg

. Pre

c.

Benefits 1 : Bounding Parameters• Pivoted Normalization Method LNC2 s<0.4

0.4

Optimal s (for average precision)

Page 16: A Formal Study of Information Retrieval Heuristics

16

Negative when df(w) is large Violate many constraints

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

Benefits 2 : Analytical Comparison• Okapi Method

Pivoted

Okapi

keyword query verbose query

s or b s or b

Avg

. Pre

c

Avg

. Pre

c

Page 17: A Formal Study of Information Retrieval Heuristics

17

Benefits 3: Improving Retrieval Formulas

Make Okapi satisfy more constraints; expected to help verbose queries

31

31

( 1) ( , )( 1) ( , )( ) 0.5ln

| |( ) 0.5 ( , )((1 ) ) ( , )w q d

k c w qk c w dN df wddf w k c w qk b b c w d

avdl

• Modified Okapi Methoddf

N 1ln

keyword query verbose query

s or b s or b

Avg

. Pre

c.

Avg

. Pre

c.

Pivoted

Okapi

Modified Okapi

Page 18: A Formal Study of Information Retrieval Heuristics

18

Conclusions and Future Work

• Conclusions– Retrieval heuristics can be captured through formally

defined constraints.– It is possible to evaluate a retrieval formula analytically

through constraint analysis.

• Future Work– Explore additional necessary heuristics– Apply these constraints to many other retrieval methods– Develop new retrieval formulas through constraint

analysis

Page 19: A Formal Study of Information Retrieval Heuristics

19

The End

Thank you!