a formal study of information retrieval heuristics
DESCRIPTION
A Formal Study of Information Retrieval Heuristics. Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA. Empirical Observations in IR. Retrieval heuristics are necessary for good retrieval performance. - PowerPoint PPT PresentationTRANSCRIPT
1
A Formal Study of Information Retrieval Heuristics
Hui Fang, Tao Tao and ChengXiang ZhaiDepartment of Computer Science
University of Illinois, Urbana-ChampaignUSA
2
Empirical Observations in IR
• Retrieval heuristics are necessary for good retrieval performance.– E.g. TF-IDF weighting, document length normalization
• Similar formulas may have different performances.
• Performance is sensitive to parameter setting.
3
• Pivoted Normalization Method
• Dirichlet Prior Method
• Okapi Method
1 ln(1 ln( ( , ))) 1( , ) ln
| | ( )(1 )w q d
c w d Nc w q
d df ws savdl
( , )( , ) ln(1 ) | | ln
( | ) | |w q d
c w dc w q q
p w C d
31
31
( 1) ( , )( 1) ( , )( ) 0.5ln
| |( ) 0.5 ( , )((1 ) ) ( , )w q d
k c w qk c w dN df wddf w k c w qk b b c w d
avdl
Inversed Document FrequencyDocument Length NormalizationTerm Frequency
Empirical Observations in IR (Cont.)
1+ln(c(w,d))
Alternative TF transformationParameter sensitivity
4
Research Questions
• How can we formally characterize these necessary retrieval heuristics?
• Can we predict the empirical behavior of a method without experimentation?
5
• Formalized heuristic retrieval constraints• Analytical evaluation of the current retrieval formulas• Benefits of constraint analysis
– Better understanding of parameter optimization
– Explanation of performance difference
– Improvement of existing retrieval formulas
Outline
6
d2:d1:
),( 1dwc
),( 2dwc
Term Frequency Constraints (TFC1)
• TFC1
TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term.
q :w
If |||| 21 dd ),(),( 21 dwcdwc and
Let q be a query with only one term w.
).,(),( 21 qdfqdf then
),(),( 21 qdfqdf
7
1 2( , ) ( , )f d q f d q
Term Frequency Constraints (TFC2)
TF weighting heuristic II: Favor a document with more distinct query terms.
2 1( , )c w d
1 2( , )c w d
1 1( , )c w d
d1:
d2:
1 2( , ) ( , ).f d q f d qthen
1 2 1 1 2 1( , ) ( , ) ( , )c w d c w d c w d If
2 2 1 1 2 1( , ) 0, ( , ) 0, ( , ) 0c w d c w d c w d and
1 2| | | |d dand
Let q be a query and w1, w2 be two query terms.
Assume 1 2( ) ( )idf w idf w
• TFC2
q:w1 w2
8
Term Discrimination Constraint (TDC)IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.
Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)
...…
SVMSVM TutorialTutorial…
Doc 1
……
SVMSVMTutorialTutorial…
Doc 2
( 1) ( 2)f Doc f Doc
SVM Tutorial
9
Term Discrimination Constraint (Cont.)
• TDC
Let q be a query and w1, w2 be two query terms.
1 2| | | |,d dAssume
)()( 21 widfwidf and),(),( 2111 dwcdwc If
).,(),( 21 qdfqdf then
),(),(),(),( 22211211 dwcdwcdwcdwc and
),(),( 21 dwcdwc for all other words w.and
1 2( ) ( )idf w idf wq:w1 w2
d2:d1:
),( 11 dwc
),( 21 dwc
),( 12 dwc
),( 22 dwc
1 2( , ) ( , )f d q f d q
10
Length Normalization Constraints(LNCs)Document length normalization heuristic:Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .
• LNC2
d2:
q:Let q be a query.
d1:||||,1 21 dkdk ),(),( 21 dwckdwc If and
),(),( 21 qdfqdf then
),(),( 21 qdfqdf
d1:d2:
q:Let q be a query.
1),(),(, 12 dwcdwcqw),(),(, 12 dwcdwcw
qw
),( 1dwc
),( 2dwc
If for some word
but for other words
),(),( 21 qdfqdf ),(),( 21 qdfqdf then
• LNC1
11
TF-LENGTH Constraint (TF-LNC)
• TF-LNC
TF-LN heuristic:Regularize the interaction of TF and document length.
q:w
),( 2dwc
d2:
),( 1dwc
d1:
Let q be a query with only one term w.
).,(),( 21 qdfqdf then
),(),( 21 dwcdwc and
If 1 2 1 2| | | | ( , ) ( , )d d c w d c w d
1 2( , ) ( , )f d q f d q
12
Analytical Evaluation
Retrieval Formula TFCs TDC LNC1 LNC2 TF-LNC
Pivoted Norm. Yes Conditional Yes Conditional Conditional
Dirichlet Prior Yes Conditional Yes Conditional Yes
Okapi (original) Conditional Conditional Conditional Conditional Conditional
Okapi (modified) Yes Conditional Yes Yes Yes
13
Term Discrimination Constraint (TDC)IDF weighting heuristic:Penalize the words popular in the collection; Give higher weights to discriminative terms.
...…SVMSVMSVMTutorialTutorial…
Doc 1
Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial)
……TutorialSVMSVMTutorialTutorial…
Doc 2
( 1) ( 2)f Doc f Doc
14
Benefits of Constraint Analysis
• Provide an approximate bound for the parameters – A constraint may be satisfied only if the parameter is
within a particular interval.
• Compare different formulas analytically without experimentations– When a formula does not satisfy the constraint, it often
indicates non-optimality of the formula.
• Suggest how to improve the current retrieval models– Violation of constraints may pinpoint where a formula
needs to be improved.
15
Parameter sensitivity of s
s
Avg
. Pre
c.
Benefits 1 : Bounding Parameters• Pivoted Normalization Method LNC2 s<0.4
0.4
Optimal s (for average precision)
16
Negative when df(w) is large Violate many constraints
31
31
( 1) ( , )( 1) ( , )( ) 0.5ln
| |( ) 0.5 ( , )((1 ) ) ( , )w q d
k c w qk c w dN df wddf w k c w qk b b c w d
avdl
Benefits 2 : Analytical Comparison• Okapi Method
Pivoted
Okapi
keyword query verbose query
s or b s or b
Avg
. Pre
c
Avg
. Pre
c
17
Benefits 3: Improving Retrieval Formulas
Make Okapi satisfy more constraints; expected to help verbose queries
31
31
( 1) ( , )( 1) ( , )( ) 0.5ln
| |( ) 0.5 ( , )((1 ) ) ( , )w q d
k c w qk c w dN df wddf w k c w qk b b c w d
avdl
• Modified Okapi Methoddf
N 1ln
keyword query verbose query
s or b s or b
Avg
. Pre
c.
Avg
. Pre
c.
Pivoted
Okapi
Modified Okapi
18
Conclusions and Future Work
• Conclusions– Retrieval heuristics can be captured through formally
defined constraints.– It is possible to evaluate a retrieval formula analytically
through constraint analysis.
• Future Work– Explore additional necessary heuristics– Apply these constraints to many other retrieval methods– Develop new retrieval formulas through constraint
analysis
19
The End
Thank you!