learning to rank - ibismlibisml.org/ibis2009/pdf-invited/hangli1.pdf · •learning to rank methods...

Learning to Rank Methods

Hang Li

Microsoft Research Asia

IBIS 2009Oct. 21, 2009

Fukuoka Japan

1

Talk Outline

• What is Learning to Rank

• Learning to Rank Methods– Ranking SVM

– IR SVM

– ListMLE

– Ada Rank

• Learning to Rank Theory

• Learning to Rank Applications

• Future Directions of Learning to Rank Research

2

What is Learning to Rank?

3

Ranking Problem: Example = Document Retrieval

q

query

documents

ranking of documents

NdddD ,,, 21

),( dqf

4

ranking based on relevance, importance,

preference

),(

),(

),(

11 ,11,1

2,112,1

1,111,1

mm nmmnm

mmm

mmm

dqfd

dqfd

dqfd

Traditional Approach to Search Ranking

5

Probabilistic Model

Nd

d

d

2

1

q

),|(~

),|(~

),|(~

22

11

nn dqrPd

dqrPd

dqrPd

query

documents


}0,1{

),|(

R

dqrP

6

BM25

Nd

d

d

2

1

qquery

documents

ranking function

qdw wtfavgdl

dlbkb

wtfk

)()1(

)()1(

7

),|(~

),|(~

),|(~

22

11

nn dqrPd

dqrPd

dqrPd


Learning to Rank: New Approach to Search Ranking

8

Learning to Rank

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

Learning System

Ranking System

1mq

1,1

2,1

1,1

mnm

m

m

d

d

d

9

NdddD ,,, 21

Learning to Rank

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

Learning System

Ranking System

1mq

),(

),(

),(

11 ,11,1

2,112,1

1,111,1

mm nmmnm

mmm

mmm

dqfd

dqfd

dqfd

),( dqf

10

NdddD ,,, 21

1. Data Labeling(rank) 3. Learning

)(xf

2. Feature Extraction

1,1

2,1

1,1

1

nd

d

d

q

mnm

m

m

m

d

d

d

q

,

2,

1,

11 ,1,1

2,12,1

1,11,1

1

nn yd

yd

yd

q

mm nmnm

mm

mm

m

yd

yd

yd

q

,,

2,2,

1,1,

11 ,1,1

2,12,1

1,11,1

nn yx

yx

yx

mm nmnm

mm

mm

yx

yx

yx

,,

2,2,

1,1,

Training Process

11

1. Data Labeling(rank)

2. Feature Extraction

Testing Process

12

3. Ranking with )(xf

)(

)(

)(

111 ,1,1,1

2,12,12,1

1,11,11,1

mmm nmnmnm

mmm

mmm

yxfx

yxfx

yxfx

4. Evaluation

EvaluationResult

1,1

2,1

1,1

1

mnm

m

m

m

d

d

d

q

11 ,1,1

2,12,1

1,11,1

1

mm nmnm

mm

mm

m

yd

yd

yd

q

11 ,1,1

2,12,1

1,11,1

mm nmnm

mm

mm

yx

yx

yx

Notes

• Features are functions of query and document

• Query and associated documents form a group

• Groups are i.i.d. data

• Feature vectors within group are not i.i.d. data

• Ranking model is function of features

• Several data labeling methods (here labeling of rank as example)

13

Recent Trends on Learning to Rank

• Successfully applied to search

• Hot topic in Information Retrieval and Machine Learning

– Over 100 publications at SIGIR, ICML, NIPS, etc

– 2 sessions at SIGIR every year

– 3 SIGIR workshops

– Special issue at Information Retrieval Journal

– LETOR benchmark dataset, over 1,000 downloads

http://research.microsoft.com/en-us/um/beijing/projects/letor/index.html

14




Issues in Learning to Rank

• Data Labeling

• Feature Extraction

• Evaluation Measure

• Learning Method (Model, Loss Function, Algorithm)

15

Data Labeling Problem

• E.g., relevance of documents w.r.t. query

16

Doc A

Doc B

Doc C

Query

Data Labeling Methods

• Labeling of Ranks

– Multiple levels (e.g., relevant, partially relevant, irrelevant)

– Widely used in IR

• Labeling of Ordered Pairs

– Ordered pairs between documents (e.g. A>B, B>C)

– Implicit relevance judgment: derived from click-through data

• Creation of List

– List (or permutation) of documents is given

– Ideal but difficult to implement

17

Feature Extraction

18

Doc A

Doc B

Doc C

Query

BM25

BM25

BM25

PageRank

PageRank

PageRank

.............

.............

.............

Query-document feature

Document feature

Feature Vectors

Example Features

• Relevance: BM25

• Relevance: proximity

• Relevance: query exactly occurs in document

• Importance: PageRank

19

Evaluation Measures

• Important to rank top results correctly

• Measures– NDCG (Normalized Discounted Cumulative Gain)

– MAP (Mean Average Precision)

– MRR (Mean Reciprocal Rank)

– WTA (Winners Take All)

– Kendall’s Tau

20

NDCG

• Evaluating ranking using labeled ranks

• NDCG at position j

21

)1log(/)12(1

1

)( in

j

i

ir

j

NDCG (cont’)

• Example: perfect ranking

– (3, 3, 2, 2, 1, 1, 1) rank r=3,2,1

– (7, 7, 3, 3, 1, 1, 1) gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) position discount

– (7, 18.11, 24.11, …) DCG

– (1/7, 1/18.11, 1/24.11, …) normalizing factor

– (1, 1,1,1,1,1,1) NDCG for perfect ranking

12 )( jr

)1log(/1 j

)1log(/)12(1

)( ij

i

ir

jn

22

NDCG (cont’)

• Example: imperfect ranking

– (2, 3, 2, 3, 1, 1, 1)

– (3, 7, 3, 7, 1, 1, 1) Gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) Position discount

– (3, 14.11, 20.11, … ) DCG

– (0.43, 0.78, 0.83, ….) NDCG

• Imperfect ranking decreases NDCG

23

Relations with Other Learning Tasks

• No need to predict category

vs Classification

• No need to predict value of

vs Regression

• Relative ranking order is more important

vs Ordinal regression

• Learning to rank can be approximated by classification, regression, ordinal regression

24

),( dqf

Ordinal Regression (Ordinal Classification)

• Categories are ordered

– 5, 4, 3, 2, 1

– e.g., rating restaurants

• Prediction

– Map to ordered categories

25


26

Learning to Rank Methods• Pointwise Approach

– Subset Ranking [Cossock and Zhang, 2006]: Regression

– SVM [Nallapati, 2004]: Binary Classification Using SVM

– McRank [Li et al 2007]: Multi-Class Classification Using Boosting Tree

– Prank [Crammer and Singer 2002]: Ordinal Regression Using Perceptron

– Large Margin [Shashua & Levin 2002]: Ordinal Regression Using SVM

27

Learning to Rank Methods• Pairwise Approach

– Ranking SVM: Pairwise Classification Using SVM– RankBoost [Freund et al 2003]: Pairwise Classification

Using Boosting– RankNet [Burges et al 2005]: Pairwise Classification

Using Neural Net– Frank [Tsai et al 2007]: Pairwise Classification Using

Fidelity Loss and Neural Net– GBRank [Zheng et al 2007]: Pairwise Regression Using

Boosting Tree– IR SVM [Cao et al 2006]: Cost-sensitive Pairwise

Classification Using SVM– Multiple SVMs [Qin et al 2007]: Multiple SVMs

28


• Listwise Approach– ListNet [Cao et al 2007]: Probabilistic Ranking Model– ListMLE [Xia et al 2008]: Probabilistic Ranking Model– AdaRank [Xu and Li 2007]: Direct Optimization of

Evaluation Measure– SVM Map [Yue et al 2007]: Direct Optimization of

Evaluation Measure– PermuRank [Xu et al 2008]: Direct Optimization of

Evaluation Measure– Soft Rank [Taylor et al 2008]: Approximation of Evaluation

Measure– Lambda Rank [Burges et al 2007]: Using Implicit Loss

Function

29


• Other Methods

– K-Nearest Neighbor Ranker [Geng et al 2008]

– Semi-Supervised Learning [Jin et al 2008]

30

Evaluation Results

• Pairwise approach and listwise approach perform better than pointwise approach

• Listwise approach performs better than pairwiseapproach in most cases

• Listwise approach– ListMLE, ListNet, AdaRank, PermuRank, SVM-MAP

• Pairwise approach– Ranking SVM, RankNet, RankBoost

• Pointwise approach– Linear Regression

31

Ranking SVM

32

Transforming Ranking to Pairwise Classification

• Input space: X

• Ranking function

• Ranking:

• Linear ranking function:

• Transforming to pairwise classification:

RXf :

);();( wxfwxfxx jiji

xwwxf ,);(

);();( 0, wxfwxfxxw jiji

ij

ji

jixx

xxzzxx

1

1 ),,(

33

Ranking Problem

34

rank 1

rank 2

rank 3

1x

2x

3x

Transformed Pairwise Classification Problem

35

);( wxf31 xx

32 xx

21 xx

+1-1

12 xx

23 xx

13 xx

Positive Examples

Negative Examples

Ranking SVM

• Pairwise classification on differences of feature vectors

• Corresponding positive and negative examples

• Negative examples are redundant and can be discarded

• Hyper plane passes the origin

• Soft Margin and Kernel can be used

• Ranking SVM = pairwise classification SVM

36

Learning of Ranking SVM

2

1

)2()1( ||||,1min wxxwzl

i

iiiw

37

),0max(][ ss C2

1

0

,,1 1,

||||2

1min

)2()1(

1

2

,

i

iiii

l

i

iw

lixxwz

Cw

IR SVM

38

Problems with Ranking SVM

• Not sufficient emphasis on correct ranking on topr: relevant, p: partially relevant, i: irrelevantranking 1: p r p i i i iranking 2: r p i p i i iranking 2 should be better than ranking 1Ranking SVM views them as the same

• Numbers of pairs vary according to queriesq1: r p p i i i iq2: r r p p p i i i i iq1 pairs: 2*(r, p) + 4*(r, i) + 8*(p, i) = 14q2 pairs: 6*(r, p) + 10*(r, i) + 15*(p, i) = 31Ranking SVM is biased toward q2

39

IR SVM

• Solving the two problems of Ranking SVM

• Higher weight on important rank pairs

• Normalization weight on pairs in query

• IR SVM = Ranking SVM using modified hinge loss

40

)(ik

)(iq

Modified Hinge Loss function

41

2)2()1(

)(

1

)( ||||,1min wxxwz iiiiq

l

i

ikw

1

2

0.5

1 )( )2()1( xxzf

Loss

Learning of IR SVM

42

2)2()1(

)(

1

)( ||||,1min wxxwz iiiiq

l

i

ikw

2

0

,,1 1,

||||2

1min

)()(

)2()1(

1

2

,

iqik

i

i

iiii

l

i

iiw

C

lixxwz

Cw

ListMLE

43

Plackett-Luce Model(Permutation Probability)

• Probability of permutation 𝜋 is defined as

• Example:

n

in

ij j

i

s

sP

1 )(

)()(

C

C

CB

B

CBA

A

s

s

ss

s

sss

sP

ABC

P(A ranked No.1)

P(B ranked No.2 | A ranked No.1)

P(C ranked No.3 | A ranked No.1, B ranked No.2)44

Properties of Plackett-Luce Model

• Objects: ABC

• Scores:

• Property 1: P(ABC) is largest, P(CBA) is smallest

• Property 2: swap B and C in ABC, P(ABC) > P(ACB)

45

1,3,5 CBA sss

Plackett-Luce Model(Top-k Probability)

• Computation of permutation probabilities is intractable

• Top-k probability

– Defining Top-k subgroup G(o1…ok) containing all permutations whose top-k objects are o1, …, ok

–

– Time complexity of computation : from n! to

• Example:

k

in

ij o

o

k

j

i

s

sooGP

1

1

)!(! knn

46

CBA

A

sss

sGP

(A)

ListMLE

• Parameterized Plackett-Luce Model

• Maximum Likelihood Estimation

47

));(exp( wxfs

Qq

k

in

ij j

i

wxf

wxfwL

1 ));(exp(

));(exp(log)(

k

in

ij x

x

k

j

i

s

sxxGP

1

1

AdaRank

48

Listwise Loss

49

111 ,1,1,1

2,11,22,1

1,11,11,1

1

nnn yx

yx

yx

q

mmm nmnmnm

mmm

mmm

m

yx

yx

yx

q

,,,

2,2,2,

1,1,1,

AdaRank

• Optimizing exponential loss function

• Algorithm: AdaBoost-like algorithm for ranking

50

Loss Function of AdaRank

51

Any evaluation measuretaking value between [-1,+1]

AdaRank Algorithm

52

Theoretical Results on AdaRank

• Training error will be continuously reduced during learning phase.

53

Learning to Rank Theory

54

Learning to Rank Theory

• Pairwise Approach

– Generalization Analysis [Lan et al 2008]

• Listwise Approach

– Generalization Analysis [Lan et al 2009]

– Consistency Analysis [Xia et al 2008]

55

Learning to Rank Applications

56

Learning to Rank Applications

• Search [Burges et al 2005]

• Collaborative Filtering [Freund et al 2003]

• Key Phrase Extraction [Jiang et al 2009]

57

Collaborative Filtering

58

Item1 Item2 Item3 ...

User1 5 4

User2 1 2 2

... ? ? ?

UserM 4 3

Future Directions of Learning to Rank Research

59

New Issues to be Further Studied

• Learning from implicit data

– Automatically generate labeled data from implicit feedback

• Model (feature) learning

– Automatically learn features such as BM25

• Global ranking

– Using features of current document as well as relations with other documents

60

New Issues to be Further Studied (cont’)

• Query-dependent ranking

– Creating different ranking models for different queries (in search)

• New applications

– Machine Translation

61

Takeaway Message

• Learning to Rank = Machine Learning Task

• Different from classification, regression, ordinal regression

• Learning to Rank has been successfully applied to search

• Existing approaches: pointwise, pairwise, listwise

• Many open problems

62

Contact: [email protected]

63

learning to rank - ibismlibisml.org/ibis2009/pdf-invited/hangli1.pdf · •learning to rank methods...

Documents