robust ranking models via risk-sensitive optimizationpbennett/papers/wang-et-al-sigir-2012.pdf ·...

10
Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer Science University of Maryland College Park, MD [email protected] Paul N. Bennett Microsoft Research One Microsoft Way Redmond, USA [email protected] Kevyn Collins-Thompson Microsoft Research One Microsoft Way Redmond, USA [email protected] ABSTRACT Many techniques for improving search result quality have been proposed. Typically, these techniques increase average effectiveness by devising advanced ranking features and/or by developing sophisticated learning to rank algorithms. How- ever, while these approaches typically improve average per- formance of search results relative to simple baselines, they often ignore the important issue of robustness. That is, al- though achieving an average gain overall, the new models often hurt performance on many queries. This limits their application in real-world retrieval scenarios. Given that ro- bustness is an important measure that can negatively impact user satisfaction, we present a unified framework for jointly optimizing effectiveness and robustness. We propose an ob- jective that captures the tradeoff between these two com- peting measures and demonstrate how we can jointly opti- mize for these two measures in a principled learning frame- work. Experiments indicate that ranking models learned this way significantly decreased the worst ranking failures while maintaining strong average effectiveness on par with current state-of-the-art models. Categories and Subject Descriptors H.3.3 [Information Retrieval]: Retrieval Models Keywords Re-ranking, robust algorithms, machine learning 1. INTRODUCTION Many approaches for learning highly effective ranking mod- els for search have been proposed in recent years [19, 5, 25]. Most commonly, they apply well-developed machine learning algorithms to construct ranking models from training data by optimizing a given IR metric. While the learned rank- ing models can be highly effective, these approaches have mostly ignored the important issue of robustness – in addi- tion to performing well on average, the model should, with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’12, August 12–16, 2012, Portland, Oregon, USA. Copyright 2012 ACM 978-1-4503-1472-5/12/08 ...$15.00. high probability, not perform very poorly for any individ- ual query. Often effectiveness and robustness are competing forces that counteract each other: models optimized for ef- fectiveness alone may not meet the strict robustness require- ment for individual queries. This paper introduces learning robust ranking models for search via risk-sensitive optimization, by exploiting and op- timizing the tradeoffs between model effectiveness and ro- bustness. At a basic level, this framework learns ranking models whose effectiveness and robustness can be explic- itly controlled. While effectiveness and robustness can be thought of in terms of absolute performance, one can spe- cialize these concepts to a case where we desire to perform well relative to a particular baseline – for example a person- alized system vs. the non-personalized baseline, query ex- pansion vs. no expansion, or a learned model vs. a retrieval formula. In this case, we can specialize the notions of ef- fectiveness and robustness to say that we wish to have high average gain relative to the baseline (i.e., positive change in performance) while at the say time incurring low risk rela- tive to the baseline (i.e., low probability of performing worse than the baseline for any particular query). We develop a unified learning to rank framework for jointly optimizing both gain and risk by introducing a novel trade- off metric that decomposes gain into reward (upside) and risk (downside) relative to a baseline. While this objective could be integrated into many learning models, we show how to extend and generalize a state-of-the-art algorithm Lamb- daMart [6] to optimize the new objective. We empirically demonstrate that models learned this way outperform both a selective personalization approach and standard learning to rank models in terms of achieving an optimal balance between gain and risk. Moreover, our results also provide important insights into why our learned models are more ro- bust in terms of the learning convergence property of the new model as compared to standard effectiveness-centric models. 2. RELATED WORK Multi-objective learning to rank has received much atten- tion in recent years [31, 13]. This is because, in many prac- tical settings, the performance of a ranking model is eval- uated by multiple measures of interest (e.g., NDCG, MAP, freshness, efficiency). Multi-objective learning to rank trains ranking models to simultaneously optimize multiple IR mea- sures. Svore et al. [31] use a standard web relevance mea- sure, NDCG, as the primary optimization metric and a rele- vance derived from click-data is used as the secondary met- ric; the performance of the primary measure is maintained

Upload: others

Post on 28-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

Robust Ranking Models via Risk-Sensitive Optimization

Lidan WangDept. of Computer Science

University of MarylandCollege Park, MD

[email protected]

Paul N. BennettMicrosoft ResearchOne Microsoft Way

Redmond, [email protected]

Kevyn Collins-ThompsonMicrosoft ResearchOne Microsoft Way

Redmond, [email protected]

ABSTRACTMany techniques for improving search result quality havebeen proposed. Typically, these techniques increase averageeffectiveness by devising advanced ranking features and/orby developing sophisticated learning to rank algorithms. How-ever, while these approaches typically improve average per-formance of search results relative to simple baselines, theyoften ignore the important issue of robustness. That is, al-though achieving an average gain overall, the new modelsoften hurt performance on many queries. This limits theirapplication in real-world retrieval scenarios. Given that ro-bustness is an important measure that can negatively impactuser satisfaction, we present a unified framework for jointlyoptimizing effectiveness and robustness. We propose an ob-jective that captures the tradeoff between these two com-peting measures and demonstrate how we can jointly opti-mize for these two measures in a principled learning frame-work. Experiments indicate that ranking models learnedthis way significantly decreased the worst ranking failureswhile maintaining strong average effectiveness on par withcurrent state-of-the-art models.

Categories and Subject DescriptorsH.3.3 [Information Retrieval]: Retrieval Models

KeywordsRe-ranking, robust algorithms, machine learning

1. INTRODUCTIONMany approaches for learning highly effective ranking mod-

els for search have been proposed in recent years [19, 5, 25].Most commonly, they apply well-developed machine learningalgorithms to construct ranking models from training databy optimizing a given IR metric. While the learned rank-ing models can be highly effective, these approaches havemostly ignored the important issue of robustness – in addi-tion to performing well on average, the model should, with

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR’12, August 12–16, 2012, Portland, Oregon, USA.Copyright 2012 ACM 978-1-4503-1472-5/12/08 ...$15.00.

high probability, not perform very poorly for any individ-ual query. Often effectiveness and robustness are competingforces that counteract each other: models optimized for ef-fectiveness alone may not meet the strict robustness require-ment for individual queries.

This paper introduces learning robust ranking models forsearch via risk-sensitive optimization, by exploiting and op-timizing the tradeoffs between model effectiveness and ro-bustness. At a basic level, this framework learns rankingmodels whose effectiveness and robustness can be explic-itly controlled. While effectiveness and robustness can bethought of in terms of absolute performance, one can spe-cialize these concepts to a case where we desire to performwell relative to a particular baseline – for example a person-alized system vs. the non-personalized baseline, query ex-pansion vs. no expansion, or a learned model vs. a retrievalformula. In this case, we can specialize the notions of ef-fectiveness and robustness to say that we wish to have highaverage gain relative to the baseline (i.e., positive change inperformance) while at the say time incurring low risk rela-tive to the baseline (i.e., low probability of performing worsethan the baseline for any particular query).

We develop a unified learning to rank framework for jointlyoptimizing both gain and risk by introducing a novel trade-off metric that decomposes gain into reward (upside) andrisk (downside) relative to a baseline. While this objectivecould be integrated into many learning models, we show howto extend and generalize a state-of-the-art algorithm Lamb-daMart [6] to optimize the new objective. We empiricallydemonstrate that models learned this way outperform botha selective personalization approach and standard learningto rank models in terms of achieving an optimal balancebetween gain and risk. Moreover, our results also provideimportant insights into why our learned models are more ro-bust in terms of the learning convergence property of the newmodel as compared to standard effectiveness-centric models.

2. RELATED WORKMulti-objective learning to rank has received much atten-

tion in recent years [31, 13]. This is because, in many prac-tical settings, the performance of a ranking model is eval-uated by multiple measures of interest (e.g., NDCG, MAP,freshness, efficiency). Multi-objective learning to rank trainsranking models to simultaneously optimize multiple IR mea-sures. Svore et al. [31] use a standard web relevance mea-sure, NDCG, as the primary optimization metric and a rele-vance derived from click-data is used as the secondary met-ric; the performance of the primary measure is maintained

Page 2: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

constant while the algorithm tries to improve the secondarymeasure. Dai et al. [13] develop a multi-objective learningto rank algorithm for freshness and relevance by extendinga state-of-the-art divide and conquer ranking approach [3].The key differences between these and our work is that wetreat both risk and reward, which are potentially competingmeasures, as first-class metrics during optimization, unlikethe “tiered” approach [31]; however, more importantly, ourrisk measure is inherently different from the previous multi-objective learning to rank metrics – instead of capturing justanother facet of web search [31, 13], risk evaluates the fitnessof the ranking model with respect to a baseline model undera specific metric. In other words, risk is orthogonal to theprevious multi-objective metrics – risk is defined on top ofthese metrics with respect to a baseline model.

Although other risk-aware algorithms have been proposedfor ad-hoc retrieval [35, 39] and query expansion [26, 10],our problem differs from them in three important dimen-sions. First, we focus on designing new learning to rankalgorithms to create robust state-of-the-art ranking mod-els using boosted regression trees, rather than focusing onrisk-minimization for language models or simple term-basedmodels [35, 39, 34]. Second, unlike [10, 26], the source ofrisk in our setting comes from the fact that users and queriesare diverse, so that a single model cannot serve all users orqueries well. Thus we design a new risk-sensitive algorithmthat effectively leverages query-dependent features to buildmodels that are highly robust and effective for individualqueries. Finally, these previous problems represent partic-ular strategies for ranking, hence their solutions cannot betrivially extended to the more general problem of learningrobust ranking models for large-scale retrieval.

Risk can be indirectly addressed by building query-dependentmodels to better adapt to individual queries. This line ofwork resulted in recent papers that consider query-specificloss functions [4], where queries of different types (e.g., navi-gational, information) are optimized with different loss func-tions. Geng et al. [18] proposed a k-Nearest Neighbor basedmethod which trains a query-dependent ranking functionfor each query based on its nearest neighbors in the train-ing set. Bian et al. [3] proposed a clustering-based divide-and-conquer approach for building query-dependent rank-ing models. However, it is important to note that query-dependent models are not perfect – just like any other mod-els, query-dependent models incur risk (e.g., reduced perfor-mance for some queries and gains for other queries), and thisfact has been largely ignored as well. In a sense, our pro-posed framework generalizes and complements this threadof work by learning risk-sensitive query-dependent models.

There has been a great deal of research devoted to de-veloping effective retrieval models that have high averageretrieval quality. This has given rise to a steady streamof techniques for effective ranked retrieval. Examples in-clude learning to rank [19, 5, 25], learning to re-rank [22],numerous term proximity models [27, 8, 32], and search per-sonalization [1, 36, 11]. However, unlike our work, they ig-nore the important issue of model robustness – while manyqueries see performance improvements, other queries are of-ten hurt by the new and complex models as compared tosimple alternatives such as BM25 [29] and language modelsfor information retrieval [28].

Our work also has connections to query difficulty and per-formance prediction for Web search [38], as well as selective

personalization [33]. These techniques mitigate risk in re-ranking by using a query performance prediction methodto selectively re-rank queries whose baseline effectiveness isexpected to be low, and avoid re-ranking for queries wherethe baseline effectiveness is already high. We note thesetechniques are robustness-centric, which may lead to overlyreduced effectiveness due to unreliable decision in applyingthe model. In Section 5 we compare our approach to the useof selective re-ranking using performance variables.

It seems reasonable to assume that users tend to remem-ber the singular, spectacular failures of a search engine ratherthan the many successful searches that preceded them. How-ever, we are not aware of many user studies that have lookedat the effect of IR robustness (or the lack thereof) on users’perceptions of system quality. One exception is a recent userstudy of recommender systems [23] that showed how opti-mizing the accuracy of a system is not enough: among otherinteresting findings, they found that high-variance recom-mendation algorithms can create a bad user experience. In-formally, significant errors can have a large negative impacton the perceived quality of the system, even if the systemfrequently does well.

Finally, the term robust ranking has been used previouslyin the literature, but with somewhat different meaning. Forexample, Li et al. [24] call ranking functions ‘robust’ thataccount for volatility in ranker scores, and thus, documentordering, over time, by taking into account the distributionof ranking scores over results. Others have used ‘robustranking’ to refer to ranking algorithms that are less vulner-able to spam or noise in the training data [2].

3. MOTIVATIONBefore proceeding, it is important to identify sources of

uncertainty that contribute to low robustness in rankingmodels. In most search settings and especially Web search,both users and information needs are highly diverse, thus, itis unlikely a common set of ranking features and the samemodel learned from these features can optimally work forall users and queries. For instance, it has been shown that:queries with low click entropies typically will not benefitfrom search personalization [1, 36, 11]; queries with lowquery clarity will not benefit from query expansion tech-niques as much [16]. Ignoring these facts during model train-ing leads to a highly risky model. Although previous workhas studied whether and when a learned model should beapplied to a given query based on the aforementioned query-specific meta-features [33], the average effectiveness that re-sults from the robust application of the ranking model isgenerally ignored – which means the robustness is typicallyimproved but at an overly aggressive cost in reduced av-erage effectiveness. We take a principled learning to rankapproach that directly optimizes multiple requirements.

For instance, Figure 1 illustrates the improvements anddegradations of a proposed model (tuned for average effec-tiveness) with respect to a baseline ranking across queries.The plots shows a high variance of the new results – the per-formance of a number of queries is improved (right side) atthe expense of degraded results for other queries (left side).Directly optimizing for variance is a challenging task – vari-ance is inherently a set-based measure defined on the distri-bution of results quality over queries, but typically, learningto rank algorithms are not directly applicable to optimizing

Page 3: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

0

200

400

600

800

1000

1200

[-1

00

,-9

0)

[-9

0,-

80

)

[-8

0,-

70

)

[-7

0,-

60

)

[-6

0,-

50

)

[-5

0,-

40

)

[-4

0,-

30

)

[-3

0,-

20

)

[-2

0,-

10

)

[-1

0,0

)

[10

,20

)

[20

,30

)

[30

,40

)

[40

,50

)

[50

,60

)

[60

,70

)

[70

,80

)

[80

,90

)

[90

,10

0)

Nu

mb

er o

f Q

uer

ies

Change in NDCG over Baseline Ranking

Figure 1: Typical helped-hurt histogram of re-ranking gainsand losses compared to a given baseline ranking.

set-based measures. We instead adopt a simple approachsuggested by this figure.

Informally, when compared to an alternative or baselineranking, the probability of decreasing performance is relatedto the mass on the left-hand side of the figure. In fact theexpectation of the left-hand side, which we call risk, is theamount we can expect to decrease retrieval quality on av-erage if we pick a query at random. Likewise, the expec-tation of the right-hand side, which we call reward, is theamount we can expect to improve retrieval quality on aver-age. By defining each average relative to the total numberof queries rather than dividing reward by the number ofimproved queries and risk by the number of hurt queries,we control for issues of coverage. Typical methods optimizeoverall performance, Reward − Risk + Baseline, but sincethe baseline is fixed from the optimization point of view, thisis equivalent to optimizing the difference, Reward−Risk. Inorder to control the risk or probability of harm, this suggestsintroducing a weight on the risk term that is user tunable.This is precisely what we do. In the next section, we for-malize this intuition before demonstrating how to optimizethe objective in practice.

4. A LEARNING TO RANK APPROACHFOR CREATING ROBUST MODELS

4.1 Problem settingIn this section, we present the formal setup for learning

robust ranking models for Web search. Given a baselineranking Mb we would like to construct a new ranking modelMm with high average effectiveness across queries, withoutdegrading the performance of individual queries too muchwith respect to Mb. The exact instantiation of the base-line Mb depends on the specific application of our generalframework. For personalization, Mb might be the initialnon-personalized ranking (e.g., optimized for the averageuser). In a production environment, Mb might representthe results from an earlier release of the ranker. In learningto rank, Mb might be simpler models such as BM25 thatare known to be effective on a particular subset of queries.Our framework is applicable to these and any other ranking-based applications where a risk/reward tradeoff may exist.

4.2 Measuring risk and rewardOne way to measure risk is by measuring the variance

of the new ranking model with respect to baseline results.We instead adopt a variant of this approach, by introducinga risk function FRISK (QT ,Mb) to capture characteristics ofthe variance with respect to model Mb over a training setQT . We are interested in the downside risk of the new model,defined as the average reduction in effectiveness due to usingthe new model Mm compared to the baseline model Mb

1.

FRISK (QT ,Mb) =1

N

XQ∈QT

max [0,Mb(Q)−Mm(Q)] (1)

where N denotes the total number of queries in the trainingset QT , and Mb(Q) and Mm(Q) denote the effectivenessof the baseline model and new model for a given query Q,respectively. We note that effectiveness can be measuredby any commonly-used IR metrics (e.g. precision@k, recall,average precision, and NDCG@k). Thus, the downside riskaccounts for the negative aspect of the retrieval performanceacross queries.

Similarly we defined reward in relative terms as the pos-itive improvement in effectiveness over baseline model Mb,averaged across all queries in a training set QT :

FREWARD(QT ,Mb) =1

N

XQ∈QT

max [0,Mm(Q)−Mb(Q)] (2)

where notations N , Mb, and Mm are defined as before.Thus, the reward function accounts for the positive aspectsof the retrieval performance across queries. For any arbi-trary baseline, Mb, the overall gain for training set QT canthen be written:

FGAIN (QT ) = FREWARD(QT ,Mb)− FRISK (QT ,Mb). (3)

Our goal is to automatically learn ranking models thatbetter adapt to individual queries by optimally trading offbetween risk and reward. However, before we can learn sucha well-balanced model, we must define a new metric thatcaptures the tradeoff. Our objective function, which we callRisk-Reward tradeoff T , is defined as a weighted linear com-bination of gain (which includes risk and reward) and risk:

Tα(QT ,Mb) = FGAIN (QT )− α · FRISK (QT ,Mb) (4)

= FREWARD(QT ,Mb)− (1 + α) · FRISK (QT ,Mb)

where α ≥ 0 is a key risk-sensitivity parameter that con-trols the tradeoff between risk and reward. We chose thisformulation so that the case α = 0 corresponds to the stan-dard learning-to-rank objective to maximize average FGAIN

alone. Note that T is not a single measure, but a family ofmeasures, parameterized by the tradeoff parameter α. Byvarying α we can capture a full spectrum of risk/rewardtradeoffs: from α = 0 (standard default ranking) to largervalues of α that lead to lower-risk models, or small values ofα that lead to higher gain models.

The most common approach to optimizing multiple objec-tives [30] is to linearly combine objectives, as we have donewith the risk and reward functions. While many other waysfor combining metrics exist, such as by multiplication, divi-sion, geometric or harmonic means, optimizing an arbitrary

1For simplicity of notation, in the remaining of the paper Mb

will refer to both the baseline model and the effectiveness ofthe baseline model. A similar convention is used for Mm.

Page 4: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

optimization objective can be problematic in many learningalgorithms. In contrast, linearly combined objectives are of-ten easily optimized. For our purposes, we will extend theobjective of the LambdaMart algorithm [37]. In Sec. 4.3,we prove that our tradeoff measure T satisfies a desirableconsistency property [6].

4.3 Constructing robust ranking modelsIn this section, we describe how to learn models that di-

rectly optimize the proposed tradeoff metric Tα in the pre-vious section. The basic assumption is that queries are dif-ferent, thus requiring potentially different ranking functionsto be individually effective. Previous query-dependent mod-els [4, 18, 3] ignore risk. We devise learning to rank algo-rithms that directly optimize the tradeoff metric in the con-text of LambdaMart boosted regression trees framework [37].

4.3.1 Model and optimization algorithmWe choose to extend LambdaMart [37] for our ranking

model. Boosted regression trees have been shown to be oneof the best learning to rank models. In the recent Yahoo!Learning to Rank Challenge [9], boosting (and ensemblemethods) have been a dominant technique used by winningentries. Boosted regression trees are a type of non-linearmodel that can capture the potentially complex interactionsbetween various types of ranking features.

LambdaMart is derived from the tree-boosting optimiza-tion MART [17] and the listwise LambdaRank [7]. Lamb-daMart improves an effectiveness metric by fitting a se-quence of boosted regression trees using an approximationto the derivative of a cost function by accumulating pair-wise gradients weighted by the total change from the currentranking to a desired ranking.

We now present a brief overview on LambdaMart objec-tive function as it serves as a foundation for deriving thelearning algorithm for the tradeoff metric. The objectivefunction of list-wise LambdaMart is based on the cross-entropy objective function used by RankNet [5]. Let ri de-note the relevance grade of a document Di, so that Rij = 1if ri > rj , and Rij = −1 when rj > ri. Let si denote thescore assigned to Di by the model. For documents Di andDj , the cross-entropy is expressed as follows:

C =1

2(1−Rij)β(si − sj) + log

“1 + e−β·(si−sj)

”(5)

where β is a shape parameter for the sigmoid function. Thiscost function assigns zero penalty if two documents are cor-rectly ranked, and asymptotically, assigns linear penalty ifthey are ranked incorrectly. The derivatives of C with re-spect to score difference between si and sj is:

λij = β

„1

2(1−Rij)−

1

1 + eβ(si−sj)

«(6)

LambdaRank [7] and LambdaMart [37] modify this gradi-ent defined on pairwise loss to optimize for list-wise IR mea-sures. The new gradient λnew

ij captures the desired changeof document scores with respect to the IR metric after thedocuments have been sorted by their scores, in addition tocapturing the original λij value. The new gradient under thelist-wise optimization is simply defined to be λij multipliedby the absolute change in IR measure M due to swappingdocuments i and j:

λnewij = λij |∆Mij | (7)

The gradient for each document Di is obtained by sum-ming λij over all pairs that Di participates in for query Q:

λnewi =

Xj 6=i

λij |∆Mij | (8)

The gradients λnewi defined on each document for each

query are modeled and optimized by LambdaMart regres-sion trees. In principle, LambdaMart can be extended tooptimize any standard IR metric by simply replacing ∆Mij

in Eq. 8 by the corresponding change ∆M?ij in the optimized

metric M?. However, it is generally considered desirable forM? to satisfy the consistency property : any pairwise swapbetween correctly ranked documents Di and Dj must lead todecrease in M?, i.e., ∆M?

ij < 0 if ri > rj . Generally consis-tency is desired since LambdaMart’s approximation to theoverall gradient is derived from the pairwise gradients.

Unlike standard IR metrics, the tradeoff measure is de-fined relative to a baseline ranking and it is a meta-measure– a combination of multiple measures. Despite its impor-tance, little existing literature has provided insight into howto optimize meta-measures in a principled way. Next, weprove the consistency property of the tradeoff metric T andshow how it can be optimized by boosted regression trees.

4.3.2 Consistency of Risk-Reward Tradeoff TαWhile LambdaMart is a list-wise optimization scheme and

can optimize a wide range of IR measures – including NDCG,MAP, and MRR – it is often desirable for an objective mea-sure to satisfy the consistency property [6] stated above:when swapping ranked positions of two documents, di anddj , in a sorted document list where di is more relevant thandj , and ranks before dj , the objective should decrease. Whilemost IR measures easily satisfy this property, in general anarbitrary objective measure M will not always satisfy theconsistency property. For example, suppose measures A andB are NDCG@5 and NDCG@10, respectively, and we definea meta-measureM = A−B. It is easy to see thatM does notsatisfy the consistency property, because as a correct swap ismade, the gain in NDCG@5 is offset by gain in NDCG@10,which may result in a negative change for the overall mea-sure M , causing M to be inconsistent. In general, it is notstraightforward to see whether a meta-measure, defined asa combination of different measures, can ensure the consis-tency property.

We now show that the tradeoff metric T , defined as aweighted linear combination of risk and reward in Eq. 4,satisfies the consistency property.

Theorem 1. The tradeoff metric T in Eq. 4 satisfies theconsistency property.

Proof. Let Mm and Mb be the effectiveness of the Lamb-daMart model and baseline model respectively. After swap-ping documents di and dj , denote the resulting change intradeoff by ∆T , change in effectiveness by ∆M , change inreward by ∆γ and change in risk by ∆σ. Let Rel(d) be therelevance of document d. To show T is consistent we showthat swapping documents di and dj , where di is more rele-vant than dj and ranks before dj , results in a decrease of Tor equivalently ∆T < 0. We consider two scenarios as newtrees are added to the current LambdaMart ensemble: 1)Mm ≤ Mb and 2) Mm > Mb. We show T is consistent inboth cases.

Page 5: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

Scenario A: Mm ≤Mb

Case A1 ) Swap di and dj , where Rel(di) > Rel(dj) andi < j. In this case ∆γ = 0 and ∆σ = ∆M < 0. Thus∆T = (1 + α) ·∆M < 0 as desired.

Case A2 ) Swap di and dj , where Rel(di) > Rel(dj) andi > j. Then there are two subcases:

If Mb > Mm + ∆M then ∆γ = 0, ∆σ = ∆M and

∆T = (1 + α) ·∆M > 0.

If Mb ≤Mm + ∆M then ∆γ = Mm + ∆M −Mb

∆σ = Mb −Mm and ∆T = α · (Mb −Mm) + ∆M .

Then ∆T > 0, as desired.

Scenario B: Mm > Mb

Case B1 ) Swap di and dj , where Rel(di) > Rel(dj) andi < j. Then there are two subcases:

If Mb > Mm − |∆M | then ∆σ = −Mb + (Mm − |∆M |) < 0

∆γ = Mb −Mm < 0 and ∆T = α · (Mm −Mb)− (1 + α) · |∆M |If Mb ≤Mm − |∆M | then ∆σ = 0 and ∆T = ∆M .

Then ∆T < 0, as desired.Case B2 ) Swap di and dj , where Rel(di) > Rel(dj) and

i > j. There is no risk and ∆M > 0, so ∆T > 0 as desired.

Thus, in all cases the tradeoff measure T satisfies the re-quired consistency property.

We note that NDCG, MRR, and MAP (Mean AveragePrecision) have all been shown to satisfy consistency [14],and thus Theorem 1 applies generally to risk-reward tradeoffmeasures based on these widely-used IR measures, as wellas any others that have been proven consistent.

4.3.3 Risk-sensitive LambdaMart optimizationThe details of the proof for Theorem 1 above also tell us

how to compute the ∆T ’s used in the LambdaMart gra-dients for optimizing T . We observe that by optimizingthe tradeoff metric, the LambdaMart puts more emphasison the “risky” queries, by modifying the gradient of suchqueries so that incorrect results for risky queries hurt thetradeoff metric more, and the algorithm learns to optimallyutilize query-dependent features in order to be more risk-averse. To illustrate this point, we see that when the currentmodel effectiveness is less than the baseline’s and a correctswap is made (i.e., Case A2), the optimizer assigns a largerpositive change to the tradeoff if the swap erases the per-formance difference between the model and baseline (i.e.,∆T = α · (Mb −Mm) + ∆M > (1 + α) · (Mb −Mm)); oth-erwise, it assigns a smaller positive tradeoff change to thequery (∆T = (1 + α) · ∆M < (1 + α) · (Mb − Mm)). Asanother example, in the case where the current model hasa better effectiveness than the baseline and when two docu-ments are incorrectly swapped (i.e., Case B1), the algorithmassigns a non-zero risk to the query if the swap causes themodel to degrade below the baseline effectiveness; other-wise, it assigns no penalty. Such risk-sensitive optimizationis in sharp contrast to the standard LambdaMart algorithmwhich treats all queries identically, without paying extra at-tention to high-risk queries. Thus, our risk-sensitive frame-work generalizes standard learning-to-rank algorithms like

LambdaMart that only optimize for average effectivenessgain without accounting for risk.

5. EXPERIMENTSWe now show the benefits of augmenting the standard

average-gain maximization ranking objective with our risk-sensitive objective. We emphasize that our key learning goalis not to maximize NDCG gains alone, or to minimize vari-ance alone, but to achieve the best possible joint tradeoffbetween change in risk and change in reward. The abilityto control such tradeoffs can be important in operationaldecisions, and in giving a better picture of the achievableperformance space of a ranking algorithm. We also analyzethe effect of the risk-aversion parameter α on the conver-gence rate of the learning algorithm and on characteristicsof risk-averse rankings.

5.1 DatasetsWe report results using two datasets, one public and one

proprietary. These datasets were chosen to illustrate differ-ent applications of risk-sensitive ranking for different typesof queries and Web applications. For clarity, we will use“baselines” to refer to the models used to relativize loss inour risk-reward optimization. The model without any risksensitivity (i.e., α=0), we will call the gain-only model.

5.1.1 MSLR learning-to-rank datasetWe use MSLR-WEB10K, a widely-used publicly released

dataset from Microsoft Research2. MSLR is a samplingof 10,000 Web search queries and their results from a ma-jor commercial search engine that includes graded relevancejudgments for supervised learning-to-rank experiments. Thisdataset is pre-partitioned into five folds, each with train,test, and validation subsets. We use and report results usingthese same folds in our experiments. We report NDCG@1and NDCG@10 since both measures are of importance inWeb search evaluation. Since our optimization is performedrelative to a baseline, we have taken the ranking obtainedfrom sorting by BM25 using the BM25.whole.document fea-ture as a baseline. Thus, on this dataset, the goal is tomake a more robust tradeoff relative to a well-motivated IRformula for ranking.

5.1.2 Location personalization datasetWe use a proprietary dataset for learning to personalize

Web results by user location, with details and baseline re-sults as described in [1]. That work introduced a method-ology for automatically learning a location distribution toassociate with a search result with the goal of promotingresults when the user’s search location matches the searchresult’s location distribution. The data consists of logs froma competitive top commercial search engine and labels basedon a single binary last satisfied click by the user per queryimpression as an implicit relevance judgment. We follow thesame methodology as [1] and report average change acrosstest folds in MRR relative to the search engine’s originalranking for the last satisfied click. Thus, on this dataset,the goal is to learn both how to personalize and when topersonalize simultaneously.

We next define some key evaluation terms used in ourexperiments.

2research.microsoft.com/en-us/projects/mslr/

Page 6: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

5.2 Evaluation measuresA retrieval quality measure for a ranked result list is sim-

ply a standard statistic that measures some aspect of rele-vance. The retrieval quality measures we use to report re-sults on the Web data sets in our study are Non-DiscountedCumulative Gain (NDCG) and Mean Reciprocal Rank (MRR).

Given a baseline ranking, the reward of using a retrievalalgorithm over a set of queries is the total positive changesin retrieval quality compared to the baseline ranking dividedby the total number of queries (Eq. 2). The risk of using aretrieval algorithm is the total negative changes (losses) inretrieval quality compared to the baseline ranking dividedby the total number of queries (Eq. 1). Visually, risk cor-responds to the total mass on the left of the vertical zeroline in Figure 1 normalized by the total number of queries,while reward is the total mass on the right half normalizedby the total number of queries. We call the overall gain –or simply gain – of an algorithm the average combined gainover all queries, so that Gain = Reward − Risk. There istypically a risk-reward tradeoff for retrieval algorithms [10].

Taken together, the (risk, gain) tradeoff pair provides amore complete picture of a ranker’s performance profile thansimply reporting average NDCG gain. Often, algorithmsthat appear similar in NDCG gain can differ greatly in theirrisk-sensitivity, for example. This is evident in some of ourresults later in Sec. 5.6.

5.3 Query Performance Selective StrategiesQuery-dependent features such as click entropy [15], query

clarity [21], query length [12], and the maximum of BM25scores [20] have been shown to be highly indicative of indi-vidual query performance and can be used to differentiateperformance between different kinds of queries. An alterna-tive approach to attempting to learn a model is simply toselectively apply a learned model by leveraging the insightsfrom query performance prediction. We provide two suchselective methods for comparison points. For the MSLRdataset, we use insights regarding the maximum BM25 scorefrom [20] and use the max of the BM25.whole.document fea-ture to predict when the baseline (the ranking obtained fromthe same feature) will perform well. We can then simplysweep out thresholds by using the gain-only learned modelwhen the max BM25 score is low and the BM25 rankingotherwise. We then vary this threshold in an analogy tovarying the α parameter. We refer to this as maxBM25.

For the location dataset, we also present a selection strat-egy. For each search result in the location dataset, theentropy of the result’s location distribution measures howlocation-specific a single search result is. When the mini-mum of this feature over all search results is low, it meansthat at least one search result has a very location-specificmeaning, and therefore, location personalization is more likelyto succeed. We again vary the threshold using the person-alized gain only model when the minimum is low and usingthe baseline original ranking of the search engine otherwise.We refer to this as minEntropy below.

5.4 Learning methodologyTo ensure all methods have the same information avail-

able, both the gain-only models and the risk-sensitive modelsalso have the features (max BM25 for MSLR and minimumentropy for location) that are used by the selective meth-ods. In the case of MSLR, in order to ensure strong gain-

only model performance, we performed sweeps on Lamb-daMart parameters using the validation sets. On the MSLRdata, for α = 0 (i.e. standard optimization), for each foldwe used performance on the validation data to do a gridsearch for parameter settings. The parameter ranges we ex-perimented with were: minimum documents in leaf m ∈500, 1000, 1500; number of leaves l ∈ 10, 25, 50; number oftrees in ensemble t ∈ 50, 100, 200, 400, 800 and learning rater ∈ 0.025, 0.05, 0.07. For three folds the optimal values werem = 1000, l = 50, t = 800 and r = 0.075, and for the tworemaining folds, the values were identical except m = 500.For the models learned using α > 0, we used these same pa-rameters to keep other variables constant. For consistencyof comparison we used the same parameter setting for theLocation dataset experiments as in [1].

5.5 Summary statisticsTables 1 and 2 show summary retrieval quality and ro-

bustness statistics for ranking on the MSLR and Locationdatasets. The gain over baseline effectiveness is broken downinto its constituent Risk and Reward components. (Recallthat Gain = Reward − Risk.) Unless otherwise noted,other summary stats are based on the top 10 results. TheWins/Losses row shows the counts of queries that contributedto the Reward vs Risk respectively.3 The ‘Loss > 20%’statistic counts queries whose relative NDCG change overbaseline NDCG was worse than -20%. Thus, smaller Lossnumbers are better.

Risk, reward and wins/losses are relative to the baselineranking. The NDCG@10 score for the BM25 baseline was30.869. The ∆MRR score for the Location minEntropy base-line is zero, since we report relative gains on that dataset forproprietary reasons.

As expected, there is a clear trade-off between overall re-trieval effectiveness and amount of risk reduction for bothcollections. As the risk-aversion parameter α is increased,the Wins/Losses row shows consistent reduction in losses,with a small decline in overall effectiveness (NDCG or ∆MRR).Note in particular that for the Location data, the chance of aquery being hurt by more than 20% loss in ∆MRR comparedto the baseline vanishes almost to zero, while still retainingsignificant overall effectiveness gains.

5.6 Risk-reward tradeoffFinding a re-ranking strategy that reduces variance com-

pared to the baseline is easy - simply avoid re-ranking andalways use the baseline ranking. In this case, the ∆MRRwill be zero, with zero risk. However, this obviously loses allthe benefits of the re-ranking for queries that are actuallyhelped by re-ranking. Our goal is to find optimal tradeoffstrategies that dominate other possible strategies for tradingrisk for reward.

Figure 2 shows our risk-sensitive objective in LambdaMartindeed dominates alternative effective strategies for selectivere-ranking, as measured by the tradeoff curve for risk vs.gain for each collection. This tradeoff curve for the selec-tive re-ranking strategy is produced as a function of the fea-ture threshold (the minEntropy feature for Location queries,and maxBM25 feature for MSLR queries). As the featurethreshold is lowered, more and more queries have the feature

3We ignore queries with zero NDCG@10 difference from thebaseline, with gains over baseline being a ‘Win’, and lossesbeing a ‘Loss’.

Page 7: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

Collection/Statistics All featuresα = 0 α = 1 α = 5 α = 10

MSLR(N = 10000)

NDCG@1 46.166 45.835 44.615 43.658NDCG@10 47.272 47.221 46.312 45.540

Risk 2.239 1.974 1.722 1.540Reward 18.642 18.327 17.165 16.282

Wins/Losses 7512/1972 7630/1845 7613/1838 7645/1776Loss > 20% 740 684 630 573

Table 1: Summary of generalization performance over test folds for the MSLR dataset.

Collection/Statistics All featuresα = 0 α = 1 α = 5 α = 10

Location(N = 541958)

∆MRR 1.451 1.144 0.784 0.555Risk 0.4725 0.193 0.080 0.051

Reward 1.923 1.337 0.864 0.606Wins/Losses 43616/24499 32253/14952 23598/8041 19818/8090Losses > 20% 2539 718 266 14

Table 2: Summary of generalization performance over test folds for the Location dataset.

value above the threshold and are re-ranked according to thebaseline ranking. The tradeoff curve for the risk-sensitivere-ranking strategy is a function of α. At α = 0, the risk-sensitive ranking coincides with the gain-only ranking. Asα is increased, the learned model exhibits a stronger pref-erence for the baseline, which acts like an increasing ‘soft’query selection for the baseline ranking. The dotted BreakEven line shows the break-even point where the tradeoffobjective T = 0, and any result above the dashed Propor-tional Reduction line has the desirable property that it re-duces risk proportionally more than it reduces reward. The“near-optimal hull” curve represents an envelope of achiev-able tradeoffs that results from incorporating an automaticstopping strategy for α, details of which are given in Sec. 5.7.

For both collections it is notable that the selective thresh-old re-ranking method performs significantly better than thebreak-even point. Additionally, since it lies above the Pro-portional Reduction line, it reduces risk proportionally morethan it reduces gain at almost all points of the curve. Thisis most evident at high values of the respective thresholdfeatures, when the ‘best’ queries amenable to that methodare most likely to be selected. At much lower thresholdvalues, it is clear that most queries do not benefit from re-ranking with the baseline feature, giving performance closerto break-even. Furthermore, while other selective personal-ization approaches could be taken, it is clear based on per-formance that the selective threshold method here providesan informative comparison point.

In turn, for the Location data, the risk-gain tradeoff achievedby risk-sensitive ranking completely dominates that of se-lective threshold re-ranking. For any given risk level, risk-sensitive ranking achieves significantly higher gain, and con-versely, for any given level of gain, it achieves that gainat much lower risk than the selective method. The resultsalso show that the near-optimal hull dominates the selectivemethod for both collections.

5.7 Guidance for setting the α parameterA natural question is how to set α, or where one should

stop trying new α’s. While α is a user-set parameter to

control risk, the question of where to stop can be approxi-mated. For any model, the risk can always be reduced byflipping a coin with probability p and applying the methodgiven “heads” and the baseline given “tails”. Varying p natu-rally both reduces the gain and the risk (e.g., p=0.20 yields20% risk and 20% reward). Since we can apply the samelogic to the gain only model, this leads to natural guidancein terms of the gain-to-risk ratio of the gain only model(GOM): when α = GainGOM /RiskGOM the original gainonly model would yield a tradeoff objective T =0. Further-more, randomly reducing the gain only model for greatervalues of α than this will continue to ensure the tradeoffobjective remains non-negative. Thus, solutions with T > 0in the segment α ∈ [0,GainGOM /RiskGOM ] are interestingnon-trivial tradeoffs relative to the original gain-only model.

To find the size of this segment of interest, we used per-formance over the training data. For MSLR, this yielded avalue of α= 27.9±1.3 and for Location α= 4.0±0.04. Thenstarting from the maximal nearest computed α points (30and 4 respectively) in the segment, we simulate the ran-dom reduction from there toward the origin and call thisthe “Near-Optimal Hull” line on the curves. This gives amore realistic picture of achievable tradeoffs.

5.8 Convergence ratesWe discussed in Sec. 4.3.3 how adding our risk-sensitive

objective to LambdaMart results in modifying the Lamb-daMart gradient in such a way that its magnitude is ampli-fied as α increases, accelerating the learning rate for querieswhere baseline effectiveness is already high. Thus, we hy-pothesized that on average across all training queries, wewould see faster convergence rates at high values of α com-pared to low values of α.

Figure 4 demonstrates that this phenomenon actually oc-curs on both collections, comparing the ranker convergencewhen α = 0 to setting α = 5. The x-axis indicates num-ber of iterations; the y-axis shows the overall NDCG@10achieved at any given iteration (up to a maximum of 100 it-erations), averaged across all validation folds. Although thealgorithm ultimately converged to very similar NDCG@10

Page 8: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

1

1.5

2

2.5

3

3.5

Re

war

d (

MR

R)

Location

Risk-Sensitive with Near-Optimal Hull0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 0.2 0.4 0.6 0.8 1

Gai

n (

MR

R)

Risk (MRR)

Location

Selective

Risk-SensitiveOptimizationRisk-Sensitive withNear-Optimal Hull

Break-even Line

Proportional Reduction Line

(a) Location: risk vs. gain

0

2

4

6

8

0 0.5 1 1.5 2 2.5 3

Re

war

d (

ND

CG

@1

0)

Risk (NDCG@10)

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40

Gai

n (

Re

war

d-R

isk)

Percentage Queries Measurably Different

Gain vs Coverage

0

2

4

6

8

10

12

14

16

18

0 0.5 1 1.5 2 2.5 3

Gai

n (

ND

CG

@1

0)

Risk (NDCG@10)

MSLR

Selective

Risk-SensitiveOptimization

Risk-Sensitive withNear-Optimal Hull

Break-even Line

Proportional Reduction Line

(b) MSLR: risk vs. gain

Figure 2: Risk/gain tradeoffs achieved by risk-sensitive LambdaMart learning vs. selective re-ranking for Location dataset(left), and MSLR dataset (right).

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

alpha = 0

alpha = 1

alpha = 5

alpha = 10

(a) Location: α = 0, 1, 5, 10

0

200

400

600

800

1000

1200

alpha = 0

alpha = 1

alpha = 5

alpha = 10

(b) MSLR: α = 0, 1, 5, 10

Figure 3: Helped-hurt histograms, showing how variance around baseline (dotted line) shrinks as the risk-aversion parameterα is increased from α = 0 (lightest bars) to α = 10 (darkest bars), for Location dataset (left), and MSLR dataset (right). Eachhistogram bucket corresponds to a different 10% range of loss/gain in effectiveness compared to that collection’s baseline,and contains four bars, corresponding to the number of queries helped or hurt for the four risk-reward tradeoff settingsα = 0, 1, 5, 10. For clearer scaling the large spike of queries for the [0, 10) bucket has been omitted in both charts.

for both settings, the α = 5 runs converged faster for allfolds on both collections. In the case of the Location data,the speed improvement was dramatic – learning with α= 5took an average of 10.2 iterations to achieve 95% of the finalconvergence effectiveness, compared to 41.5 with α=0. Theimprovement was smaller but still consistent across folds forthe MSLR data: for example, learning with α= 5 reachedNDCG@10 gain of 0.42 in an average of 25.4 iterations, com-pared with 33.4 iterations for α=0.

5.9 Model consistency across risk levelsWe now examine how rankings change with respect to the

baseline results as the risk-aversion parameter α is increased.To measure change between rankings, we use Kendall’s taudistance, a standard distance between two ranked lists thatcorresponds to the (normalized) number of pairwise swapsrequired to move from one ranking to another. A largeKendall distance indicates low similarity to the baseline rank-ing, and conversely a small Kendall distance indicates aranking that is very similar to the baseline ranking.

For illustration purposes, we binned Kendall distance val-ues into 10 bins from 0 to 9. Queries whose re-ranking isthe same or very similar to the baseline ranking are put inBin 0, while queries with re-rankings that are very differentfrom the baseline are put into Bin 9, or an intermediate binaccording to the Kendall distance.

Figure 5 shows, for each value of α, a distribution overKendall bins using the location personalization dataset. Itshows that as the permitted risk increases – alpha changesfrom high to low – more and more queries start to deviatesignificantly from the baseline. For example, the lower-riskmodel learned with α=9 has 85% of the queries staying closeto the baseline (bin 0), while the higher-risk model learnedwith α=1 has only 42% of the queries close to the baselineranking, with the remaining queries dispersed over muchmore diverse rankings with much larger Kendall distancesfrom the baseline. This behavior can be explained in termsof risk/reward tradeoff. The higher-risk models have the in-centive to deviate significantly from the baseline in exchangefor more drastic improvement in effectiveness (which is val-

Page 9: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

1

0.34

0.36

0.38

0.4

0.42

0.44

0.46

1 11 21 31 41 51 61 71 81 91

ND

CG

Iterations

alpha=10

alpha = 5

alpha = 0

(a)

2

0.65

0.67

0.69

0.71

0.73

0.75

0.77

0.79

0.81

0.83

1 5 9

13

17

21

25

29

33

37

41

45

49

53

57

61

65

69

73

77

81

85

89

93

97

ND

CG

Iterations

Reward only

Risk aware(alpha=8)

(b)

Figure 4: Risk-sensitive model (high α) consistently con-verges faster than reward-only objective (α=0), for both theMSLR dataset (top) and Location dataset (bottom). Eachiteration point is averaged over all validation folds.

ued more by the tradeoff metric under a low α value); whilelow-risk models (high α value) tend to stay close to baseline,since risk is aggressively penalized by the tradeoff metric un-der large α and there is no incentive to significantly deviatefrom baseline (even if that means a higher gain) due to thepotentially large penalty incurred from it. This fact is alsoconfirmed by results shown in Table 2, where the higher-riskmodel trades off more robustness for effectiveness.

Moreover, in models that heavily penalize risk, the re-ranked results for a particular query may deviate signifi-cantly from baseline, but only under the condition that thereward (NDCG gain) acquired as a result of the deviationmust be high enough to outweigh the increased risk. Wecan see this behavior in the risk-sensitive model α = 9. Al-though a majority of the queries are in Bin 0 (near base-line), the overall NDCG gain is moderate and the overallrisk is the lowest among all models. Also interesting is thatas Kendall distance increases, the safe model tends to havethe largest reward among other models under same kendalldistance. For safe model (α = 9), reward is 0.00371 underkendall bin 2, which is the highest among other models inthe same Kendall bin 2. Thus, one potential application ofthis type of cross-model analysis might be to use the consis-tency across different models to produce the best rerankingmodel for each query.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5 6 7 8 9 10

Dis

trib

uti

on

ove

r K

en

dal

l bin

s fo

r e

ach

alp

ha

mo

de

l

Alpha models (from high to low risk)

Kendall distance to baseline for each model

Kendall 0.8 - 1.0

Kendall 0.7 - 0.8

Kendall 0.6 - 0.7

Kendall 0.5 - 0.6

Kendall 0.4 - 0.5

Kendall 0.3 - 0.4

Kendall 0.2 - 0.3

Kendall 0.1 - 0.2

Kendall 0.0 - 0.1

Figure 5: Distributions of Kendall distances to baselineranking, for increasing values of the risk parameter α (lo-cation dataset). For high-risk models (α = 1), only 42%of queries have re-ranked results that remain very close tothe baseline ranking (Bin 0), compared to low-risk (α = 10)models having 90% of queries with results very close to thebaseline.

6. DISCUSSION AND FUTURE WORKIn this study we gave a definition of ‘risk’ in terms of

the size of the downside tail of a distribution over a specificperformance measure: NDCG. However, our framework isnot limited to that choice and can address a much broaderfamily of scenarios.

First, the risk and reward parts of the ranking objectivemay address other performance aspects of the results rank-ing, as long as the combined measure is consistent.

Second, we can look beyond strictly performance-relateddefinitions of risk to include other properties of result sets.For example, the churn of a set of results for a given queryrefers to undesirable and unexpected volatility in a givenquery’s results after re-ranking, even (or especially) whenthe rankings of the best relevant documents - and thus over-all performance - might be unchanged or only minimally re-duced. All things being equal, given two rankings with thesame retrieval performance, we may prefer the one that givesless churn compared to a baseline ranking. Thus, the ‘risk’of ranking might be defined using a ranking dissimilaritymeasure such as Kendall’s tau compared to the baseline, in-stead of the strictly performance-based difference in NDCGwe used here.

Third, another assumption we made was that reducingdownside variance was a desirable goal. In general, this istrue. However, there may be some retrieval tasks where itis beneficial to have an objective that tries to increase up-side variance, even if this leads to an increased chance oflarge errors. Typically this will be in cases where a greaterdiversity of retrieval hypotheses is desirable as part of alarger precision-oriented system. For example, a search en-gine, acting as one source among several that provides can-didate documents to a question-answering application, maybe more likely to find high-reward documents if it does moreaggressive promotion from deep in the original ranking.

Page 10: Robust Ranking Models via Risk-Sensitive Optimizationpbennett/papers/wang-et-al-sigir-2012.pdf · Robust Ranking Models via Risk-Sensitive Optimization Lidan Wang Dept. of Computer

Finally, we note that our use of an objective that mini-mizes negative outcomes with respect to a baseline is a spe-cial case of regret-based learning. In particular, we couldgeneralize our risk-reward objective to minimize the averageloss over the worst X% tail of negative outcomes, instead ofall negative outcomes, and this could be viewed as a mini-max regret objective - with connections to the widely-usedConditional Value-at-Risk objective from financial optimiza-tion. Beyond ranking, we believe this family of robust ob-jectives deserves wider consideration in IR, with potentialapplicability to tasks such as query rewriting, federated re-source selection, meta-search and others, where risk-rewardscenarios exist and finding the highest-quality operationaltradeoffs is critical.

7. CONCLUSIONSWe presented a general approach for improving the ro-

bustness of learned rankers by performing risk-sensitive op-timization of ranking models. Our optimization adds a noveladditional ranking objective that minimizes downside riskrelative to a given baseline, in addition to the standard ob-jective factor that maximizes average effectiveness acrossqueries. Our approach can be viewed as a way to con-trol the robustness/effectiveness tradeoff in learning-to-rankapproaches. Thus, it generalizes the standard learning torank approaches as special cases that do not consider ro-bustness. Our robustness metric is orthogonal to previousmulti-objective learning to rank metrics for capturing differ-ent facets for search, since robustness is defined using stan-dard facet-based and IR performance measures with respectto a baseline model. We proved the theoretical consistencyof the proposed tradeoff measure and designed a principledlearning to rank approach, using boosted regression trees,to optimize it. Our approach can be potentially applied toa wide range of applications in learning-to-rank by pluggingin different baseline models. Furthermore, by performing anextensive empirical evaluation using both public and pro-prietary datasets we demonstrated that our robust modelsachieve significant benefits in learning rate, robustness, andreliability over existing methods.

8. REFERENCES[1] P. N. Bennett, F. Radlinski, R. W. White, and E. Yilmaz.

Inferring and using location metadata to personalize websearch. In SIGIR, pages 135–144, 2011.

[2] R. Bhattacharjee and A. Goel. Algorithms and incentives forrobust ranking. In SODA 2007.

[3] J. Bian, X. Li, F. Li, Z. Zheng, and H. Zha. Rankingspecialization for web search: a divide-and-conquer approachby using topical ranksvm. In WWW, pages 131–140, 2010.

[4] J. Bian, T.-Y. Liu, T. Qin, and H. Zha. Ranking withquery-dependent loss for web search. In WSDM, pages 141–150,2010.

[5] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,N. Hamilton, and G. Hullender. Learning to rank usinggradient descent. In ICML, pages 89–96, 2005.

[6] C. J. C. Burges. From RankNet to LambdaRank toLambdaMART: An Overview. Technical report, MicrosoftResearch, 2010.

[7] C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to Rankwith Nonsmooth Cost Functions. In NIPS, pages 193–200. MITPress, 2006.

[8] S. Buttcher, C. L. A. Clarke, and B. Lushman. Term proximityscoring for ad-hoc retrieval on very large text collections. InSIGIR, pages 621–622, 2006.

[9] O. Chapelle and Y. Chang. Yahoo! learning to rank challengeoverview. Journal of Machine Learning Research -Proceedings Track, 14:1–24, 2011.

[10] K. Collins-Thompson. Reducing the risk of query expansion viarobust constrained optimization. In CIKM 2009, pages837–846.

[11] K. Collins-Thompson, P. N. Bennett, R. W. White, S. de laChica, and D. Sontag. Personalizing web search results byreading level. In CIKM, pages 403–412, 2011.

[12] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predictingquery performance. In SIGIR, pages 299–306.

[13] N. Dai, M. Shokouhi, and B. D. Davison. Learning to rank forfreshness and relevance. In SIGIR, pages 95–104, 2011.

[14] P. Donmez, K. M. Svore, and C. J. C. Burges. On the localoptimality of lambdarank. In SIGIR, pages 460–467, 2009.

[15] Z. Dou, R. Song, and J.-R. Wen. A large-scale evaluation andanalysis of personalized search strategies. In WWW, pages581–590, 2007.

[16] E. N. Efthimiadis. Chapter: Query expansion. Annual Reviewof Information Science and Technology, 31:121–187, 1996.

[17] J. H. Friedman. Greedy function approximation: A gradientboosting machine. Annals of Statistics, 29:1189–1232, 2000.

[18] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y. Shum.Query dependent ranking using k-nearest neighbor. In SIGIR,pages 115–122, 2008.

[19] F. Gey. Inferring probability of relevance using the method oflogistic regression. In Proc. 17th SIGIR Conference, 1994.

[20] Q. Guo, R. W. White, S. Dumais, J. Wang, and B. Anderson.Predicting query performance using query, result, and userinteraction features. In RIAO, 2010.

[21] C. Hauff, D. Hiemstra, and F. de Jong. A survey ofpre-retrieval query performance predictors. In CIKM, pages1419–1420, 2008.

[22] C. Kang, X. Wang, J. Chen, C. Liao, Y. Chang, B. L. Tseng,and Z. Zheng. Learning to re-rank web search results withmultiple pairwise features. In WSDM’11, pages 735–744, 2011.

[23] B. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, andC. Newell. Explaining the user experience of recommendersystems. J. of User Modeling and User-Adapted Interaction(UMUAI), 22, 2011.

[24] F. Li, X. Li, S. Ji, and Z. Zheng. Comparing both relevance androbustness in selection of web ranking functions. In SIGIR ‘09,pages 648–649, 2009.

[25] T.-Y. Liu. Learning to rank for information retrieval.Foundations and Trends in Information Retrieval, 3(3), 2009.

[26] Y. Lv, C. Zhai, and W. Chen. A boosting approach toimproving pseudo-relevance feedback. In SIGIR, pages 165–174,2011.

[27] D. Metzler and W. B. Croft. A Markov Random Field modelfor term dependencies. In Proc. 28th SIGIR Conference, pages472–479, 2005.

[28] J. Ponte and W. B. Croft. A language modeling approach toinformation retrieval. In Proc. 21st SIGIR Conference, pages275–281, 1998.

[29] S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu,and M. Gatford. Okapi at TREC-3. In Proc. 3rd TRECConference, pages 109–126, 1994.

[30] R. Steuer. Multiple Criteria Optimization: Theory,Computation and Application. John Wiley, 1986.

[31] K. M. Svore, M. N. Volkovs, and C. J. Burges. Learning to rankwith multiple objective functions. In WWW, pages 367–376,2011.

[32] T. Tao and C. Zhai. An exploration of proximity measures ininformation retrieval. In Proc. 30th SIGIR Conference, 2007.

[33] J. Teevan, S. T. Dumais, and E. Horvitz. Potential forpersonalization. ACM TOCHI, 17(1), 2010.

[34] E. M. Voorhees. The trec robust retrieval track. SIGIR Forum,39(1):11–20, June 2005.

[35] J. Wang and J. Zhu. Portfolio theory of information retrieval.In SIGIR. ACM, 2009.

[36] R. W. White, P. N. Bennett, and S. T. Dumais. Predictingshort-term interests using activity-based search context. InCIKM, pages 1009–1018, 2010.

[37] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adaptingboosting for information retrieval measures. InformationRetrieval, 13:254–270, June 2010.

[38] Y. Zhou and W. B. Croft. Query performance prediction in websearch environments. In Research and Development inInformation Retrieval, pages 543–550, 2007.

[39] J. Zhu, J. Wang, I. J. Cox, and M. J. Taylor. Risky business:modeling and exploiting uncertainty in information retrieval. InSIGIR. ACM, 2009.