a general approximation framework for direct optimization of information retrieval measures

A general approximation framework for directoptimization of information retrieval measures

Presenter: Shih-Hsiang Lin (林士翔 )

Tao Qin, Tie-Yan Liu, Hang LiMicrosoft Research Asia, Beijing, China

Reference:1. Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD ’022. Freund, Y., et al., (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.3. Burges, C., et al., (2005). Learning to rank using gradient descent. In ICML ’054. Cao, Z., et al., (2007). Learning to rank: From pairwise approach to listwise approach. In ICML ’075. Xu, J., & Li, H. (2007). Adarank: A boosting algorithm for information retrieval. In SIGIR ’076. He, Y., et al., (2008). Are algorithms directly optimizing ir measures really direct? Technical Report MSR-TR-2008-154, Microsoft Corporation.7. Xia, F., et al., (2008). Listwise approach to learning to rank: Theory and algorithm. In ICML ’088. Xu, J., et al., (2008). Directly optimizing evaluation measures in learning to rank. In SIGIR ’08

Recently direct optimization of information retrieval (IR) measures has become a new trend in learning to rank◦ IR measures are explicitly considered in the direct

optimization approach◦ Generally, they can be grouped into two categories

introduce upper bounds of the IR measures approximate the IR measures using some smooth functions

Open Problem◦ The relationships between the surrogate functions and the

corresponding IR measures have not been sufficiently studies◦ Some of the proposed surrogate functions are not easy to

optimize

INTRODUCTION

2

The main contributions of this work include◦ They set up a general framework for direct optimization

it is applicable to any position based IR measure◦ They take AP and NDCG as two examples to show how to

optimize the position based IR measures as surrogate functions in the framework

◦ They provide a theoretical justification to the direct optimization approach

INTRODUCTION

3

Precision@k◦ Evaluating top k positions of a ranked list using two levels

(relevant and irrelevant) of relevance judgment

Average Precision (AP)

◦ e.g. relevant docs ranked at 1, 5, 10, precisions are 1/1, 2/5, 3/10, AP = (1/1+2/5+3/10)/3≈0.56

MAP is defined as the mean of AP over a set of queries

REVIEW ON IR MEASURES (1/3)

4

k

jjrk

k1

1@Prek denotes the truncation positionrj equals one if the doc in the jth position is relevant and zero otherwise

j

j jrD

@Pre1AP |D+ | denotes the number of relevant documents w.r.t. the query

Normalized Discounted Cumulated Gain (NDCG)◦ It is designed for multiple levels of relevance judgments◦ Uses graded relevance as a measure of the usefulness, or gain,

from examining a document◦ Discounted Cumulative Gain (DCG) is the total gain accumulated at

a particular rank k

e.g. 10 ranked documents judged on 0-3 relevance scale3, 3, 2, 2, 1, 1, 17, 7, 3, 3, 1, 1, 11, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33 7, 11.41, 12.91, 14.2, 14.59, 14.95, 15.28


5

k

j

r

jk

j

1 2 1log12@DCG

rank j : rj

gain 2rj-1

discount 1/log2(1+j)

DCG

◦NDCG is defined as


6

k

j

r

k jNk

j

1 2

1

1log12@NDCG

Nk is a constant depending on a Query to make the maximum value of NDCG@k of they query is 1

The framework consists of four steps:◦ Reformulating an IR measure from ‘indexed by positions’ to

‘indexed by documents’◦ Approximating the position function with a logistic function

of ranking scores of documents◦ Approximating the truncation with a logistic function of

positions of documents◦ Applying a global optimization technique to optimize the

approximated measure (surrogate function)

A GENERAL APPROXIMATION FRAMEWORK

7

Most of the IR measures, for example, Precision@k, AP and NDCG are position based◦ The summations in the definitions of IR measures are taken

over positions◦ The position of a document may change during the training

process, which makes the optimization of the IR measures difficult

When indexed by documents, Precision@k can be re-written as below

STEP1: Measure Reformulation (1/2)

8

X

1x

kxxrk

k 1@Pre

X is a set of documentsr(x) equals one for relevant document and zero otherwiseπ(x) denotes the position of x in the ranked list π1{} is a truncation function

9

With documents as indexes, AP can be re-written as

Combining above two equations yields

So far, this measurements are non-continuous and non-differentiable

STEP1: Measure Reformulation (2/2)

yyrD y

X

@Pre1AP

X X

X X

1

1

y yxx

y x

yyxxryr

yyr

D

yxxry

yrD

,

1

11AP

10

The position function can be represented as a function of ranking scores

Due to the indication function in it, the position function is still non-continuous and non-differentiable◦ They propose approximating the indicator function

using a logistic function

STEP 2: Position Function Approximation (1/2)

xyy

yxsx,

, 01X1 yxyx sss ,where

0, yxs1

xyy yx

yx

ss

x, ,

,

exp1exp

1ˆX

α is a scaling constant and α>0

11

Examples of position approximation

◦ The approximation is very accurate in this case

STEP 2: Position Function Approximation (2/2)

12

Some measures have truncation functions in definitions, such as Precision@k, AP, and NDCG@k. These measures need further approximations on the truncation functions

To approximate the truncation function , a simple way is to use the logistic function once again

Thus, we obtain the approximation of AP as follow

STEP3: Truncation Function Approximation

yx 1

xy

xyyx

ˆêxp1

ˆêxp

1 β is a scaling constant and β >0

X Xy yxx xy

xyyxryr

yyr

D , ˆêxp1ˆêxp

ˆˆ1AP

13

With the aforementioned approximation technique, the surrogate objective functions become continuous and differentiable with respect to the parameter in the ranking model

However, considering that the original IR measures contain a lot of local optima, the approximations of them will also contain local optima◦ One should better choose those global optimization

methods such as random restart and simulated annealing in order to avoid being trapped to local optima

STEP4: Surrogate Function Optimization (1/3)

;xf

Gradient of ApproxAP

where

by chain rule

14


15


16

In general, we would like to create a ranking model that maximize the accuracy in terms of an IR measure on training data,

or equivalently, minimizes the loss function defined as follows

Directly optimizing techniques try to minimize the above function

Comparisons with other directly optimizing techniques

m

iiiE

1,max y

m

iii

m

iiiii EEE

11

* ,1min,,min yyy

πi is the permutation selected for query qi E(πi , yi ) is evaluation of πi w.r.t. yi for qi

17

From the viewpoint of loss function optimization, these methods fall into three categories◦ One can minimize upper bounds of the basic loss function

defined on the IR measures AdaRank, SVMmap

◦ One can approximate the IR measures with functions that are easy to handle this paper, SoftRank

◦ One can use specially designed technologies for optimizing the non-smooth IR measures

Comparisons with other directly optimizing techniques (cont.)

18

Minimize upper bounds of the basic loss function◦ Type one bound

the logistic function

the exponential function


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

11-x

exp(-x)

log(1+exp(-x))

Since e-x ≥ 1-x

19

◦ Type two bound

The loss function measures the loss when the worst prediction is made


[[.]] is one if the condition is satisfied, otherwise zero

20


21

Datasets◦ LETOR 3.0 datasets

a benchmark collection for the research on learning to rank for information retrieval

TD2003, TD2004 and OHSUMED

Retrieval method◦ Use linear ranking model for ApproxAP and ApproxNDCG in

the experiments

EXPERIMENTAL SETUP

22

On the approximation of IR measures

◦ The approximation accuracy is very high and it becomes more accurate as increasing α or β

EXPERIMENTAL RESULTS (1/3)

Qq

qqQ

APAP1Approximate error:

23

On the performance of ApproxAP◦ Five fold cross validation as suggested in LETOR for both

TD2003 and TD2004 datasets α = {50, 100, 150, 200, 250, 300}, β= {1,10, 20, 50, 100}

δ=0.001, η=0.01, K=10

◦ The result clearly shows the advantage of using the proposed method for direct optimization


24

◦ It also can be found that AdaRank.MAP and SVMmap are not as good as Ranking SVM and ListNet AdaRank.MAP and SVMmap optimize the upper bound of AP

and it is not clear whether the bound is tight. If the bound is very loose, optimization of the bound cannot

always lead to the optimization of AP, and so they may not perform well on some datasets.


25

In this paper, they have set up a general framework to approximate position based IR measures◦ The key part of the framework is to approximate the positions

of documents by logistic functions of their scores There are several advantages of this framework◦ The way of approximating position based measures is simple

yet general◦ Many existing techniques can be directly applied to the

optimization and the optimization process itself is measure independent

◦ It is easy to conduct analysis on the accuracy of the approach and high approximation accuracy can be achieved by setting appropriate parameters

CONCLUSIONS AND FUTURE WORK (1/2)

26

There are still some issues that need to be further studied◦ The approximated measures are not convex, and there may

be many local optima in training◦ Conduct experiments to test the algorithms with other

function classes

CONCLUSIONS AND FUTURE WORK (2/2)

a general approximation framework for direct optimization of information retrieval measures

Documents