Diversified Retrieval as Structured Prediction
Redundancy, Diversity, and
Interdependent Document Relevance (IDR ’09)
SIGIR 2009 Workshop
Yisong YueCornell University
Joint work with Thorsten Joachims
Need for Diversity (in IR)
• Ambiguous Queries– Different information needs using same query– “Jaguar”– At least one relevant result for each information need
• Learning Queries– User interested in “a specific detail or entire breadth
of knowledge available” • [Swaminathan et al., 2008]
– Results with high information diversity
Optimizing Diversity
• Interest in information retrieval– [Carbonell & Goldstein, 1998; Zhai et al., 2003; Zhang et al., 2005;
Chen & Karger, 2006; Zhu et al., 2007; Swaminathan et al., 2008]
• Requires inter-document dependencies– Impossible with standard independence assumptions– E.g., probability ranking principle
• No consensus on how to measure diversity.
This Talk
• A method for representing and optimizing information coverage
• Discriminative training algorithm– Based on structural SVMs
• Appropriate forms of training data– Requires sufficient granularity (subtopic labels)
• Empirical evaluation
•Choose top 3 documents•Individual Relevance: D3 D4 D1•Pairwise Sim MMR: D3 D1 D2•Best Solution: D3 D1 D5
How to Represent Information?
• Discrete feature space to represent information– Decomposed into “nuggets”
• For query q and its candidate documents:– All the words (title words, anchor text, etc)– Cluster memberships (topic models / dim reduction)– Taxonomy memberships (ODP)
• We will focus on words and title words.
Weighted Word Coverage
• More distinct words = more information
• Weight word importance
• Will work automatically w/o human labels
• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Budgeted max coverage problem (Khuller et al., 1997)
– Greedy selection yields (1-1/e) bound.
– Need to find good weighting function (learning problem).
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
Example
D1 D2 D3 Best
Iter 1 12 11 10 D1
Iter 2 -- 2 3 D3
Marginal Benefit
V1 V2 V3 V4 V5
D1 X X X
D2 X X X
D3 X X X X
Word Benefit
V1 1
V2 2
V3 3
V4 4
V5 5
Document Word Counts
How to Weight Words?
• Not all words created equal– “the”
• Conditional on the query– “computer” is normally fairly informative…– …but not for the query “ACM”
• Learn weights based on the candidate set – (for a query)
Prior Work
• Essential Pages [Swaminathan et al., 2008]
– Uses fixed function of word benefit– Depends on word frequency in candidate set
– - Local version of TF-IDF
– - Frequent words low weight– (not important for diversity)
– - Rare words low weight– (not representative)
Linear Discriminant
• x = (x1,x2,…,xn) - candidate documents
• v – an individual word
• We will use thousands of such features
...
]in titlesof 10% in appears [
...
] of 20% in appears [
] of 10% in appears [
),(
x
x
x
x
v
v
v
v
Linear Discriminant
• x = (x1,x2,…,xn) - candidate documents
• y – subset of x (the prediction)• V(y) – union of words from documents in y.
• Discriminant Function:
• Benefit of covering word v is then wT(v,x)
)(
),(),(y
xyxVv
TT vww
),(maxargˆ yxyy
Tw
Linear Discriminant
• Does NOT reward redundancy – Benefit of each word only counted once
• Greedy has (1-1/e)-approximation bound
• Linear (joint feature space) – Suitable for SVM optimization
)(
),(),(y
xyxVv
TT vww
More Sophisticated Discriminant
• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it
better than another document with only 2 copies.
More Sophisticated Discriminant
• Documents “cover” words to different degrees– A document with 5 copies of “Thorsten” might cover it
better than another document with only 2 copies.
• Use multiple word sets, V1(y), V2(y), … , VL(y)
• Each Vi(y) contains only words satisfying certain importance criteria.
• Requires more sophisticated joint feature map.
Conventional SVMs
• Input: x (high dimensional point)
• Target: y (either +1 or -1)
• Prediction: sign(wTx)
• Training:
subject to:
• The sum of slacks upper bounds the accuracy loss
N
ii
w N
Cw
1
2
, 2
1minarg
iiT xwi 1)(y : i
i
i
Structural SVM Formulation
• Input: x (candidate set of documents)• Target: y (subset of x of size K)
• Same objective function:
• Constraints for each incorrect labeling y’.
• Scoreof best y at least as large as incorrect y’ plus loss
• Requires new training algorithm [Tsochantaridis et al., 2005]
i
iN
Cw 2
2
1
)'()',(),( :' yyxyxyy TT ww
Weighted Subtopic Loss
• Example:– x1 covers t1
– x2 covers t1,t2,t3
– x3 covers t1,t3
• Motivation– Higher penalty for not covering popular subtopics– Mitigates effects of label noise in tail subtopics
# Docs Loss
t1 3 1/2
t2 1 1/6
t3 2 1/3
Diversity Training Data
• TREC 6-8 Interactive Track– Queries with explicitly labeled subtopics– E.g., “Use of robots in the world today”
• Nanorobots• Space mission robots• Underwater robots
– Manual partitioning of the total information regarding a query
Experiments
• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.
• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem
• 45 docs/query, 20 subtopics/query, 300 words/doc
• Trained using LOO cross validation
• TREC 6-8 Interactive Track• Retrieving 5 documents
0.469 0.472 0.471
0.434
0.349
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
0.5
Random Okapi Unweighted Essential Pages SVM-div
Can expect further benefit from having more training data.
Moving Forward
• Larger datasets– Evaluate relevance & diversity jointly
• Different types of training data– Our framework can define loss in different ways– Can we leverage clickthrough data?
• Different feature representations– Build on top of topic modeling approaches?– Can we incorporate hierarchical retrieval?
References & Code/Data
• “Predicting Diverse Subsets Using Structural SVMs”– [Yue & Joachims, ICML 2008]
• Source code and dataset available online – http://projects.yisongyue.com/svmdiv/
• Work supported by NSF IIS-0713483, Microsoft Fellowship, and Yahoo! KTC Grant.