learning to rank using multiple loss...

Vol.:(0123456789)1 3

Int. J. Mach. Learn. & Cyber. (2019) 10:485–494 DOI 10.1007/s13042-017-0730-4

ORIGINAL ARTICLE

Learning to rank using multiple loss functions

Yuan Lin1 · Jiajin Wu1 · Bo Xu1 · Kan Xu1 · Hongfei Lin1

Received: 22 November 2016 / Accepted: 21 September 2017 / Published online: 12 October 2017 © Springer-Verlag GmbH Germany 2017

function from a given dataset, such that the ranking function can sort new objects according to their relevance or prefer-ence. Prior studies on learning to rank generally focused on one specific input space constructed from the training dataset, and thus the methods can be categorized into the following three types: the pointwise approach, the pairwise approach and listwise approach.

Without loss of generality we discuss learning to rank in document retrieval task in this paper as follows. For learning to rank, there is a collection of queries, in which each query has some related documents in the training set, and each query-document pair at least contains three types of infor-mation, namely the ground truth relevance label, the query ID and some document features. The document features are extracted to measure the importance of the document or the relevance of query-document pair, the former type of fea-ture is usually called query-independent (or linkage analysis based) feature such as PageRank [25], HITTS [17] while the latter called query-dependent (or content-based) feature such as BM25 [28], LMIR [38, 39] and so on. Thus learning to rank can be viewed as learning how to combine the existing various relevance functions to achieve better performance, and many learning to rank methods have been proposed [29] in recent years.

During the training process, the input space for a given query is constructed from the aforementioned dataset. For the pointwise approach, it is composed of each single doc-ument vector associated with the query; for the pairwise approach, it is composed of each document preference pair constructed according to the document ground truth rele-vance label with respect to the same query; for the listwise approach, it is composed of the group of document vector associated with the query.

In previous work, most learning to rank algorithms belong to one of the three types discussed above. And many

Abstract Learning to rank has attracted much attention in the domain of information retrieval and machine learning. Prior studies on learning to rank mainly focused on three types of methods, namely, pointwise, pairwise and listwise. Each of these paradigms focuses on a different aspect of input instances sampled from the training dataset. This paper explores how to combine them to improve ranking perfor-mance. The basic idea is to incorporate the different loss functions and enrich the objective loss function. We present a flexible framework for multiple loss function incorporation and based on which three loss-weighting schemes are given. Moreover, in order to get good performance, we define sev-eral candidate loss functions and select them experimentally. The performance of the three types of weighting schemes is compared on LETOR3.0 dataset, which demonstrates that with a good weighting scheme, our method significantly out-performs the baselines which use single loss function, and it is at least comparable to the state-of-the-art algorithms in most cases.

Keywords Learning to rank · Loss function · Gradient descent · Incorporation · Weighting scheme

1 Introduction

Ranking is a central issue in information retrieval and has attracted much attention recently. Learning to rank is a task that uses machine learning methods to learn a ranking

* Hongfei Lin [email protected]

1 Dalian University of Technology, No 2 Linggong Road Ganjingzi District, Dalian 116023, China

http://crossmark.crossref.org/dialog/?doi=10.1007/s13042-017-0730-4&domain=pdf

486 Int. J. Mach. Learn. & Cyber. (2019) 10:485–494

1 3

studies of loss function have been made for the reason that different loss functions consider different aspects of infor-mation about the dataset. Regardless of different learning methods, the main divergence of algorithms lies in the loss function.

As mentioned above, different paradigms have different input spaces and thus focus on different aspects of the data-set information, and their loss functions are heterogeneous. In the paper, we combine these three aspects together and complement each other based on incorporation of appropri-ate loss functions in order to grasp all the information about the dataset and improve the performance of ranking models by making full use of the information. In other words, the aim of our work is to find out how to use multi-granularity levels of instances, i.e. a combination of loss functions can lead to the improvement on retrieval performance.

The main contributions of our work is drawing attention to the comprehensively using the different input instances, and proposing a framework of loss functions incorporation and three weighting schemes for incorporation. Meanwhile, we propose two novel loss functions for the pairwise input instances and the listwise input instances.

The rest of this paper is organized as follows. In Sect. 2, the related work of this paper is introduced. In Sect. 3, some preliminary backgrounds and notations necessary for the paper are set and the problem of separately training on one type of input instances are analyzed. In Sect. 4, the candidate loss functions of our method and the optimization algorithm for solving them are given; moreover, the framework of loss function incorporation is proposed and based on which three types of weighting schemes for incorporation are given. In Sect. 5, the experimental methodology is described and the results are given to support the competitiveness of our algo-rithms. Finally, Sect. 6 offers some concluding remarks.

2 Related work

The task of learning to rank has recently drawn a lot of interest in machine learning and information retrieval (IR). As distinguished by [19, 22], previous works fell into three paradigms: pointwise, pairwise, and listwise approaches.

In the pointwise approaches, each training instance is associated with a rating denoting its relevance with respect to a given query. The goal of learning is to find a model that can map instances into ratings that are close to their ground truth. Thus ranking is reduced to regression or clas-sification. A typical example of this type is Prank [6], which trains a Perceptron model to directly maintain a totally-ordered set via projections. The goal of Prank is to find a direction defined by a parameter vector, after projecting the documents onto which one can easily use thresholds to

distinguish the documents into different ordered categories. For more works on pointwise, please refer to [5, 10, 23, 31].

The pairwise approaches take pairs of objects such that the relative preferences of training instances can be consid-ered in training which is more like a ranking problem com-pared with pointwise approaches. Typical examples include Ranking SVM [3, 12, 15], RankBoost [9], and RankNet [1], etc. The loss function of RankNet, i.e. cross entropy loss, is defined on a pair of documents, and a neural network [11] is used as the model and gradient descent as the optimization algorithm to learn ranking function.

There are two sub-categories of listwise approaches [19]. Both of them use a list of ranked objects as training instances and learn to predict the list of objects. While for the first sub-category, the loss function is defined based on the approximation or bound of widely used IR evaluation measures and the example algorithms include SoftRank [30], SVMmap [38], SVMndcg [4], and PermuRank [36]; for the second sub-category, the loss function measures the dif-ference between the permutation given by the hypothesis and the ground truth permutation, and example methods includes ListNet [2], ListMLE [35] and RankCosine [26], etc. It is worth mentioning that ListNet also uses the similar model and optimization algorithm used in RankNet as we will do.

The idea of incorporating these three paradigms to design one learning to rank algorithm has been attempted in [34], where their method is to mix the pointwise regression and listwise SDCG and shift the focus on different paradigms as the training process goes through. Later Moon et al. [22] proposed IntervalRank by adding pointwise regression and pairwise tie constraint in their listwise isotonic regression. Other works related to learning to rank includes group-enhanced ranking [20], learning to rank based query expan-sion [37] and recommendation [13], data analysis research on learning to rank [24], learning to rank based image search [7, 8], and some state-of-the-art works on machine learning [32, 33, 41]. In this paper, we also start from the idea of incorporating these three paradigms, especially three types of loss functions, to define the learning to rank method. Our main contributions are proposing the incorporation frame-work and based on it we give three weighting schemes and experimentally prove their effectiveness. Moreover, in order to get good performance, we define several candidate loss functions and select them experimentally.

3 Problems analysis

3.1 Learning to rank

In this section, we take document retrieval as an example to give a general description on learning to rank. We use

487Int. J. Mach. Learn. & Cyber. (2019) 10:485–494

1 3

superscript to denote the id of a query and subscript to denote the id of a document.

Gene ra l ly i n t r a i n ing , a s e t o f que r i e s Q = {q(1), q(2),… , q(m)} is given. Each query q(i) is associ-ated with a list of documents d(i) = {d

(i)

1, d

(i)

2,… , d

(i)

n(i)},

where d(i)j

denotes the j-th document and n(i) denotes the

size of d(i). Moreover, each list of documents d(i) is associ-a ted wi th a l i s t of g round t r u th judgments y(i) = {y(i)

1, y

(i)

2,… , y

(i)

n(i)}, where y(i)

j denotes the judgment on

document d(i)j

with respect to query q(i). The judgment rep-

resents the relevance of d(i)j

to q(i) and according to which

d(i)

j is ranked. A feature vector x(i)

j= �(q(i), d

(i)

j) is created

from each query-document pair (q(i), d(i)j), i = 1, 2,… ,m

and j = 1, 2,… , n(i).Then fo r a gene ra l po in tw i se app roach ,

D = {d(1), d(2),… , d(m)} is denoted as all the documents, and the training dataset is constructed from the above set regardless the query boundaries, we denote it as Tpt = {xj, yj}

Nd

j=1, where Nd is the size of D.

For pairwise approach, the training dataset is created as follows: for any two documents d(i)

jand d(i)

k where j ≠ k, a

new instance named preference pair is created, and +1(or the minus of relevance degrees of d(i)

j and d(i)

k) is assigned

to the pair if is larger than otherwise −1(vice versa). Then each query q(i) is associated with a list of preference pairs r(i) = {r

(i)

1, r

(i)

2,… , r

(i)

N(i)}, where r(i)

s denotes the s-th prefer-

ence pair, N(i) denotes the sizes of r(i), R = {r(1), r(2),… , r(m)} is denoted as the total preference pair set, and Tpr = {rs, ts}

Nr

s=1 is the training dataset, where ts denotes the

classification label of rs, and Nr denotes the size of R.For listwise approach, the training dataset can be

denoted as Tls = {x(i), y(i)}mi=1

.Then given a ranking function, each feature vector x(i)

j

will be given an output score f (x(i)j). The objective of

learning then turns out to be minimization of the total losses with respect to the training data.

For pointwise, the loss is given as follows:

Here Lpt can be the loss function used in regression or classification.

The loss of pairwise approach is a little different, take RankNet for example, it is based on probability. For a pref-erence pair r(i)

s constructed from d(i)

j and d(i)

k, it is considered

that there is a probability P̄s that d(i)j

is ranked higher than

d(i)

k according to the judgments of d(i)

j and d(i)

k, also a prob-

(1)Lpt =∑Nd

j=1Lpt(yj, f (xj))

ability Ps according to the output score f (x(i)j) and f (x(i)

k).

Then the loss is defined as the divergence of these two probabilities:

Here Lpr(P̄s,Ps) can be any statistical divergence meas-ures. The probabilities can be modeled as a map from the outputs. We will give the detail later.

While for listwise approach, the loss function is defined as follows:

Here z(i) = (f (x(i)

1),… , f (x

(i)

n(i))) is the list of output scores

for the list of feature vectors x(i), and take ListNet for example, which is based on Plackett-Luce [21, 27] model, Lls(y

(i), z(i)) is a statistical divergence measure.Ranking model f is then selected through the training by

For notational brevity, here we denote L as the loss in general.

In ranking, each new query-document pair are con-structed the same way as the training dataset, and the pair will be given a score by ranking model f , according to which the document is ranked.

3.2 Problem analysis

The pointwise approaches try to directly estimate the rel-evance label as what the conventional classification and regression do. This can be easily handled through already existed classification or regression methods, but obviously it may miss the relative relation of different documents with respect to the same query, which means they may ignore that a document with a mediocre rating may actually be desir-able if all other documents carry even lower scores [22]. The pairwise approaches take the relative information into consideration by comparing pairs of documents; however the loss function is defined on pairwise and does not directly link to the criteria used in the evaluation of ranking. The listwise approaches treat the document group entirely, and they hold that users care more about the topmost retrieved documents so that the ranking model should give more atten-tion on the topmost retrieved documents.

There are advantages with focusing on one paradigm for the loss function that taking care of one aspect of instances sampled from the dataset can be created directly and simply. However, there are also problems with that. There has been significant discussion about the relative merit of these strate-gies, and the common wisdom is that the listwise is better than the pairwise, which outperforms the pointwise. Then it seems

(2)Lpr =∑Nr

s=1Lpr(P̄s,Ps)

(3)Lls =∑m

i=1Lls(y

(i), z(i))

(4)f = argminL


1 3

that the pointwise and pairwise are useless compared with listwise. However, it has been discussed that in [22] when the listwise methods are used naively, some parts of information will be missed. And in [34], the pointwise method regression has been used for getting a good start point for its noncon-vex listwise method. Given this consideration, we study the method of using multiple instances, especially multiple loss functions, in this paper.

4 Methodology

For one type of instances, a robust and effective loss function is good for capturing the information the input space contains. So we begin our method by discussing the candidate loss func-tions in our work.

4.1 Loss function

For pointwise loss function we use regression square loss; while for pairwise we use the cross entropy and likelihood loss as candidates; then for listwise we choose cross entropy and J divergence. We adopt these loss functions, especially the pairwise function and the listwise function, because they are widely used in learning to rank. It should also be noted that other loss functions can be integrated into our method, since our method is general for combining different loss functions for learning to rank.

The pointwise square loss is defined as follows:

Cross entropy for pairwise is first used in RankNet [1], it is used as follows:

Here the probability Ps can be modeled using a logistic function:

Here oj = f (xj), ojk ≡ oj − ok = f (xj) − f (xk).Likelihood is used in ListMLE as a listwise loss function

for its nice properties of time complexity and soundness; we adapt it as a pairwise loss function in our work as a candidate as follows:

Here .Cross entropy for listwise [2] has the following form:

(5)Lpt(yj, f (xj)) = (yj − f (xj))2

(6)Lpr(P̄s,Ps) = (−P̄slogPs − (1 − P̄s) log(1 − Ps))

(7)Ps =eojk

1 + eojk

(8)Lpr(P̄s,Ps) = −log𝜑(P̄s ∥ Ps)

(9)Lls(

y(i), z(i))

= −∑

∀g∈�kPy(i) (g) log

(

Pz(i) (g))

Here �k denotes using the Top-k Probability to model the permutation probability.

J divergence [14] is a symmetrical version of KL divergence [16], which is like cross entropy. It is used as follows:

Here KL�

y(i), z(i)�

=∑

∀g∈�kPy(i) (g) log

�

Py(i)

(g)

Pz(i)

(g)

�

. The sym-

metrical version is better in the sense of distance for the J from y(i) to z(i) is the same as the J from z(i) to. For more information, [2, 14] can be referred.

We select the loss function from these candidates in Sect. 5.3.1 and 5.3.2 by their experimental performance.

4.2 Gradient descent

All the above three types of loss functions are convex and their derivatives can be computed efficiently, so we use the neural network to model the ranking function and solve the optimization problem by gradient descent, which is an iterative solution within each step, the gradient of the objective function is calculated, then the negative direction is used to search the next step by multiplying the step-size.

We use linear function f (xj) =∑Nu

u=1�u ⋅ xj,u for our rank-

ing function, where xj,u is the u-th feature of document feature vector xj, while �u is the u-th weight of ranking function, and Nu is the total number of document features.

The gradient of pointwise square loss is given by:

The gradient of pairwise cross entropy Lpr(P̄s,Ps) is given as follows:

Here Ps,u� = Ps ⋅ xj,u − PS ⋅

(eoj⋅xj,u−e

ok ⋅xk,u)

(eoj−eok )

2 . The gradient of

pairwise likelihood loss can be calculated as follows:

The gradient of listwise cross entropy is given by:

The gradient of listwise J divergence loss is given by:

(10)Lls(

y(i), z(i))

= KL(

y(i), z(i))

+ KL(

z(i), y(i))

(11)Δ�u = 2(yj − f (xj)) ⋅ xj,u

(12)Δ�u = −P̄s

Ps

⋅ P�

s,u+

(1 − P̄s)

(1 − Ps)⋅ P

�

s,u

(13)Δ�u = −xj,u +(eoj ⋅ xj,u − eok ⋅ xk,u)

(eoj − eok )2

(14)Δ�u = −P̄s

Ps

⋅ P�

s,u

(15)Δ�u = Ps,u�⋅ (logPs + 1 − logP̄s −

P̄s

Ps

)


1 3

Then we use the following algorithm to optimize the loss function mentioned above.

Algorithm 1 describes how to update neural network model � by obtaining gradient at every step. Here t denotes the cur-rent step number. The learning rate � ∈ (0, 1) is the step-size for each step. We update it by multiplying a dropping rate if the sum of loss for current epoch is bigger than that for the previous epoch. In practice the convergent condition can also be replaced by iteration times.

4.3 Ranking by weighting importance of loss functions

We gave the candidate loss functions in Sect. 4.1, while in this part we will discuss how to combine them together for making full use of dataset information.

We first give the framework of loss function incorporation as follows:

Here I3 = (Ipt, Ipr, Ils) is a three dimensions indicator func-tion with each I(expr) assumes the value 1 if expr = TRUE and 0 otherwise. W3 = (Wpt,Wpr,Wls) is a three-dimension weighting function, L is the total loss. By bringing in I3, we can switch from solo (pointwise, pairwise or listwise) to double and triple incorporation, and in this sense, the point-wise, pairwise or listwise approach is a special case under this framework. We use three weighting schemes based on Eq. 16 to incorporate these three types of loss function based on Eq. 13.

4.4 Regularization weighting

First we treat these three losses equally and incorporate them based on the regularization weighting, and take pointwise loss function for example, we regularize it by:

(16)L = I3 ⋅W3⋅ (Lpt, Lpr, Lls)

(17)Wpt =Lpt −minLpt

maxLpt −minLpt

Here Wpt is the weight for pointwise loss, and maxLpt is the biggest pointwise loss during the training process, while is the smallest and Lpt is the pointwise loss of the current epoch. In practice it is expensive to get maxLpt and, and instead we can set the first epoch loss as and 0 as. This is meaningful for the training is actually decreasing the loss and the best case is decreasing the loss to 0. Similarly, we can get Wpr and Wls

In such a way, the different type of loss functions with dif-ferent boundaries can be incorporated directly. Then accord-ing to Eq. 16, the total loss is calculated, and the optimiza-tion can be done by Algorithm 1 by a little change for is associated with all the three loss functions, so � is updated by:

Here Δ�ptdenotes the gradient of pointwise loss, Δ�prof pairwise and Δ�ls of listwise. When using different I3, the loss in Eq. 16 will be different, and we will discuss the dif-ference of this experimentally.

4.5 Iteration sensitive weighting

Wu et al. [34] proposed to mix SHF-SDCG with regression by a weighting scheme �m, for that SHF-SDCG is not con-vex, and the regression is added for getting a good starting point. While our candidate loss functions are all convex so we do not need to get a good starting point, however, we can adopt �mfor incorporation of loss functions. We name it as iteration sensitive weighting for �m is defined associated with iteration times.

Here M is the total number of the iterations, and � is a free variable for adjusting the proportion which is chosen in [34] such that �1 = 0.999999. We will compare the difference by using different � and M in our work.

Then by using different I3 the loss is defined differently. For example, if we use I3 = (1, 0, 1) and W3 = (�m, 0, 1 − �m) then the loss is as follows: L = Lpt × �m + Lls × (1 − �m). This is a double incorporation and, similarly, we can get other dif-ferent double incorporation easily. If we use I3 = (1, 1, 1) and W3 = (�m, �m ∗ (1 − �m), (1 − �m)

2), then the loss becomes: L = Lpt × �m + (1 − �m) × (Lpt × �m + Lls × (1 − �m)). This is a triple incorporation and, similarly, we can get other differ-ent triple incorporation.

4.6 Relay incorporation

An extreme situation for iteration sensitive weighting is that at different stage, we only focus on one type of loss

(18)� ← � − � × I3 ⋅W3⋅ (Δ�pt,Δ�pr,Δ�ls)

(19)�m =1

1 + exp(�(m −M∕2))


1 3

function ignoring others. For example, we may use the strategy that handle the pointwise at the beginning then continue with pairwise and end up with listwise at last. This is just like a relay in which different athletes do their best to take responsibility at different stage for achieving the best performance. Compared with the iteration sensi-tive weighting which is asymptotic, this weighting is a kind of step function. At different stage when the focused loss converged or the iteration limit is reached, the next stage will begin.

Then the key point is what kind of strategy is the best. Since the different type of loss is quite different, it is hard to analyze which one is the main loss in different train-ing process. We try different strategies and compare their performance.

5 Experiments and results

5.1 Datasets and setup

We use datasets published in LETOR 3.0 package [18] in the experiments: OHSUMED and TD2004 and use three typical algorithms including Regression, FRank and List-Net as our baselines.

OHSUMED is a collection of documents and queries on medicine, consisting of 348,566 documents and 106 que-ries. There are in total 16,140 query-document pairs upon which relevance judgments are made. The relevance judg-ments are either definitely relevant, possibly relevant, or not relevant. The standard features in document retrieval are extracted for each query-document pair. There are 45 features in total.

TD2004 is extracted from the topic distillation task of TREC2004. The goal of the topic distillation task is to find good websites about the query topic. There are 75 queries in this dataset. For each query, the human assessors decide whether a web page is a relevant result for the query or not, consequently two levels of relevance are used: relevant and not relevant. The documents in the TREC2004 dataset are crawled from the .gov websites, so the features extracted by linkage analysis are also used to represent the query-document pair in addition to the content features used in the OHSUMED dataset. The total number of features is 64 and total number of query-document pairs is 74,170.

We use fivefold cross validation where 1/5 each is used for validation and 1/5 for testing and the remaining 3/5 for training. All models are trained using the training set, tuned on the validation set and tested on the testing set. The initial learning rate is set to be 1E-5 and the dropping rate is set to be 0.1. The ranking results are evaluated by averaging the performance over the five folds with cross validation.

5.2 Evaluation measures

In order to evaluate the performance of the different weighting scheme, three evaluation measures are applied: Precision, Mean Average Precision (MAP) and Normal-ized Discount Cumulative Gain (NDCG). All these evalu-ation measures are widely used for comparing information retrieval system. For conciseness we give a brief definition of the measures below.

@k: We use r here to denote the rank of the document after the ranking function has been applied. The Precision at position n for a ranked document list of a query q(i) is used to measure the quality of the top n results of the ranking list.

: The mean of the average Precision over test que-ries is defined as the mean over the precision scores for all retrieved relevant documents. It is given by:

@k: The NDCG at position k for a ranked docu-ment list of query q(i) is defined as a position and rating weighted score which is then normalized such that the maxi-mum NDCG score is 1 for a perfect ranking. We have the definition:

Here gri is the relevance grade of document ranked at i. Denote by ⌢r ∶= arg sort[g] the optimally sorted version of the document collection. In this case we can write

ZD =∑k

j=1

2

g⌢r i −1

log(1+i).

6 Results

We first experimentally select the appropriate loss functions from the candidates and then do experiments based on the three schemes to find out how to incorporate them to achieve the best performance.

6.1 Appropriate pairwise loss selection

We experimentally select appropriate pairwise loss func-tion from the candidates. The comparison performance of

(20)P@k =1

k

∑k

j=1I(r

(i)

j)

(21)MAP =

∑n(i)

j=1P@k ∗ I(grj = max(g))

∑n(i)

j=1I(grj = max(g))

(22)NDCG@k =1

ZD

∑k

j=1

2gri − 1

log(1 + j)


1 3

likelihood loss for pairwise (LH) and RankNet on OHSUMED are shown in Table 1.

The results in Table 1 show that LH gets the best Precision and NDCG at positions 1 and 2, meanwhile has the compara-tive MAP with RankNet, the results at other positions are simi-lar and omitted here.

The input instances for pairwise approaches are very large in contrast with pointwise approaches for the number of preference pair is of O(n2)in extreme case. Take the fold1 of OHSUMED as an example, the number of pointwise instances is 9,177 while the number of pairwise is 367,663. Form the definition of likelihood for pairwise and cross entropy in Sect. 4.1, the calculation of likelihood loss is easier than cross entropy loss, so the likelihood loss for pairwise is better in terms of the tradeoff of time consuming and performance. Given the nice properties of likelihood loss, we then choose it as our pairwise loss function.

6.2 Appropriate listwise loss selection

Similarly, we compare the performance of J divergence meas-ure for listwise (ListJ) and ListNet on OHSUMED and the results are shown in Table 2.

It can be observed from Table 2 that ListJ performs the best on most of the evaluation. This indicates that J divergence measure is more suitable for handling listwise loss than cross entropy.

6.3 Regularization weighting

The comparison performance of different incorporation on OHSUMED is given in Table 3.

Here ‘REG’ denotes regularization weighting, PT means pointwise loss, PR means pairwise loss and LS means list-wise loss, and PT+PR represents the double incorporation of pointwise and pairwise loss functions, and TRI means triple incorporation of pointwise, pairwise and listwise, and the rest may be deduced by analogy. It can be observed from Table 3 that, REG-PT+LS gets the best performance followed by

REG-PT+PR as second best and REG-PR+LS third, while REG-PT+PR+LS is the last, which is surprising at the first glance since it is a little bit different from our original thought. It seems no reason for the triple incorporation ends up with the poorest performance, because it is combined by all three types of loss function and may grasp all information the dataset contains, in which view it should get the best perfor-mance. One possible explanation for this may be that REG-PT+PR+LS results in larger number of parameters than REG-PT-LS, which become difficult to fit an effective model on the OHSUMED data set because the data set is relatively smaller. Therefore, we will attempt to further optimize the model based on REG-PT+PR+LS on other datasets to enhance our method as our future work.

So we do further experiment to assign them different weights randomly, but it shows no sign of improvement. And we note that the best two incorporation are REG-PT+PR and REG-PT+LS, which inspires us that pointwise is a good supplementation for the reason that the other two paradigms both have considered the relative relation between objects while only pointwise itself aims at the absolute information. And the incorporation of pointwise and the other two are heterogeneous which is actually performing better, while the incorporation of pairwise and listwise are homogeneous. The loss functions of pairwise and listwise in our methods are both based on divergence measure of different probabil-ity models, and the main difference of them lies in the num-ber of the permutation, i.e. in pairwise the number is two and in listwise the number is the limit of the relevant documents with respect to the query. Then the incorporation of pairwise and listwise may confuse the training machine.

6.4 Iteration sensitive weighting

We perform experiments on iteration sensitive weighting to find the best incorporation method. The �m is associated with iteration times and a free variable �, and we test differ-ent settings to find the best combination. The results shown in Table 4 is achieved when the total iteration times is 200 and � = 0.01.

Here ‘ITER’ denotes iteration sensitive weighting. It can be observed from Table 4 that similar conclusion can be made as that of regularization weighting, that ITER-PT+LS

Table 1 Performance of Likelihood loss for pairwise (LH)

Performance MAP P@1 P@2 NDCG@1 NDCG@2

LH 0.4264 0.5819 0.5476 0.4736 0.4374RankNet 0.4273 0.5728 0.5385 0.4645 0.4313

Table 2 Performance of J divergence loss for listwise (ListJ)


ListJ 0.4469 0.6581 0.5762 0.5685 0.4843ListNet 0.4457 0.6524 0.6093 0.5326 0.4810

Table 3 Performance of regularization weighting


REG-PT+PR 0.4552 0.6495 0.5916 0.5345 0.4803REG-PT+LS 0.4577 0.6886 0.6402 0.6057 0.5322REG-PR+LS 0.4439 0.5732 0.5342 0.4837 0.4518REG-TRI 0.4316 0.5341 0.5387 0.4135 0.4017


1 3

performs the best; the reason is similar to that of regulariza-tion weighting.

6.5 Relaying incorporation

Inspired by the former two weighting schemes, we take the following two strategies: focusing on pointwise at the first stage and then shifting to pairwise; focusing on pointwise firstly and followed by listwise. The two methods are com-pared in Table 5.

Here ‘RELY’ denotes Relay Incorporation. The RELY-PT+PR is better than RELY-PT+LS. This may be achieved for that the instances number of pairwise are large, and given a good starting point will be helpful for its training than brute search in the whole solution space.

6.6 Performance comparison

Here we choose the best methods of all the three incorpora-tion methods and examine their effectiveness together. The results are shown in Figs. 1, 2 and Table 6. The REG-W denotes the best regularization weighting method, ITER-W denotes the best iteration sensitive weighting method and RELY-W denotes the best relaying incorporation method.

Table 4 Performance of iteration sensitive weighting


ITER-PT+PR 0.4572 0.6686 0.6107 0.5406 0.4867ITER-PT+LS 0.4450 0.7271 0.6684 0.5937 0.5416ITER-PR+LS 0.4383 0.5651 0.5156 0.4753 0.4267ITER-TRI 0.4316 0.5341 0.5387 0.4135 0.4017

Table 5 Performance of relay incorporation


RELY-PT+PR 0.4497 0.7358 0.6638 0.6144 0.5523RELY-PT+LS 0.4255 0.6310 0.6060 0.4978 0.4423

Fig. 1 Performance of different incorporation on OHSUMED

Fig. 2 Performance of different incorporation on TD2004

Table 6 Comparison of MAP on OHSUMED and TD2004

Dataset OHSUMED TD2004

Regression 0.4220 0.2078FRank 0.4439 0.2388ListNet 0.4457 0.2231REG-W 0.4577 0.2287ITER-W 0.4450 0.2369RELY-W 0.4497 0.2215


1 3

It can be observed that ITER-W and RELY-W are a lit-tle bit better than REG-W on most of the evaluations. That is because the REG-W treats the different loss functions equally, while the other two assign them different weights, and with better weights the performance will be enhanced greater. Besides, our three best weighting methods outper-form the baseline based on NDCG@k, P@k and MAP, espe-cially when at positions 1–5 the improvement of NDCG@k and P@k are significant, for example, the best method on OHSUMED outperforms the baseline ListNet 13.73% on NDCG@1, and the best method on TD2004 outperforms the baseline ListNet 33.33% on P@1, which is quite meaningful for that in the scenario of real world search engines the user usually cares more about the top-most retrieved documents. And the improvement is achieved because of that the com-bination of different loss functions that measure different loss functions can capture more information of the dataset. Especially the heterogeneous pointwise can be a good com-plementary for capturing the absolute relevance information.

6.7 Comparison to other similar algorithms

Since the algorithms proposed in [22, 34] have also tak-ing into account of all the three paradigms, we compare our methods with them, and the results on OHSUEMD are given in Table 7. REG-SHF-SDCG is the best method of [34], IntervalRank is the method in [22]. We conduct two-tailed paired Student’s t tests to examine the significance of improvement with 95% confidential level (p < 0.05), and the + of the right top of the value denotes significant improve-ment over the other algorithms.

It can be observed from Table 7 that our methods out-perform the other two methods on most of the evaluations.

7 Conclusions

In this paper, we presented a framework for loss function incorporation, and based on it we used three weighting schemes to incorporate different types of loss functions for improving the ranking performance. The basic idea is

to make full use of dataset information by using multiple aspects of input instances, especially multiple loss functions, during model training phase.

Meanwhile, we proposed to use appropriate loss func-tions for better capturing the information of different types of input instances. We experimentally selected likelihood loss as pairwise loss function and J divergence measure as listwise loss function.

Then, we combined the appropriate loss functions based on the incorporation framework and tested three incorpo-ration weighting schemes. Our method was evaluated on LETOR3.0 dataset, and we found that with good weighting scheme our methods using multiple loss functions signifi-cantly outperforms the state-of-the-art methods and other similar methods.

We believe that the proposed framework is useful to design improved ranking functions with extended weight-ing methods and other appropriate loss functions.

Acknowledgements This work is partially supported by grant from the Natural Science Foundation of China (No. 61402075, 61602078, 61572102, 61572098), Natural Science Foundation of Liaoning Prov-ince, China (No.201202031, 2014020003), the Ministry of Education Humanities and Social Science Project (No. 16YJCZH12), the Funda-mental Research Funds for the Central Universities.

References

1. Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the ICML. ACM, pp 89–96

2. Cao Z, Qin T, Liu TY, Tsai MF, Li H (2007) Learning to rank: From pairwise approach to listwise approach. In: Proceedings of the ICML. ACM, pp 129–136

3. Cao YB, Xu J, Liu TY, Li H, Huang YL, Hon WH (2006) Adap-tive ranking SVM to document retrieval. In: Proceedings of the SIGIR Conference. ACM, pp 186–193

4. Chakrabarti S, Khanna R, Sawant U, Bhattacharyya C (2008) Structured learning for non-smooth ranking losses. In: Proceed-ings of the SIGKDD. ACM, pp 88–96

5. Cossock D, Zhang T (2006) Subset ranking using regression. In: Proceedings of the COLT, pp 605–619

6. Crammer K, Singer Y (2002) PRanking with ranking. In: Proceed-ings of the NIPS, 14, pp 641–647

Table 7 Comparison of MAP over three datasets

Bold value denotes improvement with 95% confidential level (p < 0.05)

Algorithms N@1 N@2 N@3 N@4 N@5 P@1 P@2 P@3 P@4 P@5 MAP

REG-SHF-SDCG 0.5517 0.5110 0.4802 0.4716 0.4634 0.6333 0.6152 0.5893 0.5716 0.5574 0.4506IntervalRank 0.5628 0.5448 0.4900 0.4703 0.4609 0.6892 0.6522 0.5768 0.5556 0.5488 0.4466REG-W 0.6057+ 0.5322 0.5086 0.4942 0.4877+ 0.6886 0.6402 0.6143+ 0.5990 0.5859+ 0.4488ITER-W 0.5937+ 0.5416 0.5110+ 0.4931 0.4878+ 0.7271+ 0.6684 0.6305+ 0.6016+ 0.5862+ 0.4450RELY-W 0.6144+ 0.5523 0.5174+ 0.5021+ 0.4963+ 0.7358+ 0.6638 0.6242+ 0.6038+ 0.5903+ 0.4497


1 3

7. Cui C, Ma J, Lian T et al (2015) Improving image annotation via ranking-oriented neighbor search and learning-based keyword propagation. J Assoc Inf Sci Technol 66(1):82–98

8. Cui C, Shen J, Chen Z et al (2017) Learning to rank images for complex queries in concept-based search. Neurocomputing

9. Freund Y, Iyer R, Schapire RE, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969

10. Fuhr N (1989) Optimum polynomial retrieval functions based on the probability ranking principle. Acm T Inform Syst 7:183–204

11. Haykin S (2008) Neural networks and learning machines, 3rd edn. Prentice Hall, Upper Saddle River

12. Herbrich R, Graepel T, Obermayer K (2000) Large margin rank boundaries for ordinal regression. Advances in large margin clas-sifiers. MIT Press, Cambridge, pp 115–132

13. Ifada N, Nayak R (2016) How relevant is the irrelevant data: lev-eraging the tagging data for a learning-to-rank model[C]. Web Search and Data Mining

14. Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc R Soc Lond Ser A 186(1007):453–461

15. Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the SIGKDD. ACM, pp 133–142

16. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

17. Kleinberg JM (1999) Authoritative sources in a hyperlinked envi-ronment. J ACM 46(5):604–632

18. Liu TY, Xu J, Qin T, Xiong WY, Li H (2007) LETOR: bench-mark collection for research on learning to rank for information retrieval. In: Proceedings of the Learning to Rank Workshop in conjunction with SIGIR. ACM SIGIR Forum 41(2):58–62. ACM

19. Liu TY (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331

20. Lin Y, Lin H, Xu K et al. Group-enhanced ranking. Neurocomput-ing, 2015: 99–105

21. Luce RD (1959) Individual choice behavior. Wiley, New York 22. Moon T, Smola A, Chang Y, Zheng ZH (2010) IntervalRank—

isotonic regression with listwise and pairwise constraints. In: Proceedings of the WSDM, pp 151–159

23. Nallapati R (2004) Discriminative models for information retrieval. In: Proceedings of the SIGIR Conference. ACM, pp 64–71

24. Niu S, Lan Y, Guo J et al (2014) What makes data robust: a data analysis in learning to rank[C]. International ACM SIGIR Confer-ence on Research and Development in Information Retrieval

25. Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: Bringing order to the web, Technical Report, Stanford Digital Library Technologies Project

26. Qin T, Zhang XD, Tsai MF, Wang DS, Liu TY, Li H (2008) Query-level loss functions for information retrieval. Inf Process Manage 44:838–855

27. Plackett RL (1975) The analysis of permutations. Appl Stat 24:193–202

28. Robertson SE (1997) Over view of the okapi projects. J Doc 53:3–7

29. Tax N, Bockting S, Hiemstra D et al (2015) A cross-benchmark comparison of 87 learning to rank methods. Inf Process Manage 51(6):757–772

30. Taylor M, Guiver J, Robertson S, Minka T (2008) SoftRank: opti-mising non-smooth rank metrics. In: Proceedings of the WSDM, pp 77–86

31. Tsai MF, Liu TY, Qin T, Chen HH, Ma WY (2007) Frank: a ranking method with fidelity loss. International Conference on Research and Development in Information Retrieval, pp 383–390

32. Wang X, Xing H, Li Y et al (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654

33. Wang XZ, Ashfaq RAR, Fu AM (2015) Fuzziness based sample categorization for classifier performance improvement. J Intell Fuzzy Syst 1–12

34. Wu M, Zha H, Zheng Z, Chang Y (2009) Smoothing DCG for learning to rank: a novel approach using smoothed hinge functions. In: Proceedings of the CIKM (Short Paper). ACM, pp 1923–1926

35. Xia F, Liu TY, Wang J, Zhang W, Li H (2008) Listwise approach to learning to rank—Theorem and algorithm. In: Proceedings of the ICML. ACM, pp 1192–1199

36. Xu J, Liu T-Y, Lu M, Li H, Ma W-Y (2008) Directly optimizing IR evaluation measures in learning to rank. In: Proceedings of the SIGIR Conference. ACM, pp 107–114

37. Xu B, Lin H, Lin Y et al (2015) Assessment of learning to rank methods for query expansion. Journal of the Association for Infor-mation Science and Technology

38. Yue Y, Finley T, Radlinski F, Joachims T (2007) A support vector method for optimizing average precision. In: Proceedings of the SIGIR Conference. ACM, pp 271–278

39. Zeng XJ, Zhang YK (2003) Machine learning. China Machine, pp 60–94

40. Zhai CX (2008) Statistical language models for information retrieval a critical review. Found Trends Inf Retr 2(3):137–213

41. Zhu H, Tsang ECC, Wang XZ et al (2016) Monotonic classifica-tion extreme learning machine. Neurocomputing 225(C):205–213

learning to rank using multiple loss...

Documents