learning to rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · learning to...

30
Learning to Rank Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 EE448, Big Data Mining, Lecture 9 http://wnzhang.net/teaching/ee448/index.html

Upload: others

Post on 13-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Learning to RankWeinan Zhang

Shanghai Jiao Tong Universityhttp://wnzhang.net

2019 EE448, Big Data Mining, Lecture 9

http://wnzhang.net/teaching/ee448/index.html

Page 2: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Content of This Course• Another ML problem: ranking

• Learning to rank

• Pointwise methods

• Pairwise methods

• Listwise methods

Page 3: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Ranking ProblemLearning to rankPointwise methodsPairwise methodsListwise methods

Sincerely thank Dr. Tie-Yan Liu

Page 4: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

The Probability Ranking Principle• https://nlp.stanford.edu/IR-

book/html/htmledition/the-probability-ranking-principle-1.html

Page 5: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Regression and Classification

• Two major problems for supervised learning• Regression

L(yi; fμ(xi)) =1

2(yi ¡ fμ(xi))

2L(yi; fμ(xi)) =1

2(yi ¡ fμ(xi))

2

• Supervised learning

minμ

1

N

NXi=1

L(yi; fμ(xi))minμ

1

N

NXi=1

L(yi; fμ(xi))

• Classification

L(yi; fμ(xi)) = ¡yi log fμ(xi)¡ (1 ¡ yi) log(1 ¡ fμ(xi))L(yi; fμ(xi)) = ¡yi log fμ(xi)¡ (1 ¡ yi) log(1 ¡ fμ(xi))

Page 6: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Learning to Rank Problem• Input: a set of instances

X = fx1; x2; : : : ; xngX = fx1; x2; : : : ; xng

• Output: a rank list of these instances

Y = fxr1 ; xr2 ; : : : ; xrngY = fxr1 ; xr2 ; : : : ; xrng

• Ground truth: a correct ranking of these instancesY = fxy1 ; xy2 ; : : : ; xyngY = fxy1 ; xy2 ; : : : ; xyng

Page 7: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

A Typical Application• Webpage ranking given a query

• Page ranking

Page 8: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Webpage Ranking

D = fdigD = fdig

qq

Indexed Document Repository

QueryRankingModel

Ranked List of Documents

query = qquery = q

dq1 =https://www.crunchbase.com

dq2 =https://www.reddit.com

...

dqn =https://www.quora.com

dq1 =https://www.crunchbase.com

dq2 =https://www.reddit.com

...

dqn =https://www.quora.com

"ML in China"

Page 9: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Model Perspective• In most existing work, learning to rank is defined as

having the following two properties• Feature-based

• Each instance (e.g. query-document pair) is represented with a list of features

• Discriminative training• Estimate the relevance given a query-document pair• Rank the documents based on the estimation

yi = fμ(xi)yi = fμ(xi)

Page 10: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Learning to Rank• Input: features of query and documents

• Query, document, and combination features

• Output: the documents ranked by a scoring function

• Objective: relevance of the ranking list• Evaluation metrics: NDCG, MAP, MRR…

• Training data: the query-doc features and relevance ratings

yi = fμ(xi)yi = fμ(xi)

Page 11: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Training Data• The query-doc features and relevance ratings

FeaturesRating Document Query

LengthDoc

PageRankDoc

LengthTitle Rel.

Content Rel.

3 d1=http://crunchbase.com 0.30 0.61 0.47 0.54 0.76

5 d2=http://reddit.com 0.30 0.81 0.76 0.91 0.81

4 d3=http://quora.com 0.30 0.86 0.56 0.96 0.69

Query=‘ML in China’

Documentfeatures

Query-docfeatures

Queryfeatures

Page 12: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Learning to Rank Approaches• Learn (not define) a scoring function to optimally

rank the documents given a query

• Pointwise• Predict the absolute relevance (e.g. RMSE)

• Pairwise• Predict the ranking of a document pair (e.g. AUC)

• Listwise• Predict the ranking of a document list (e.g. Cross Entropy)

Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer 2011.http://www.cda.cn/uploadfile/image/20151220/20151220115436_46293.pdf

Page 13: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Pointwise Approaches• Predict the expert ratings

• As a regression problem

FeaturesQuery=‘ML in China’

yi = fμ(xi)yi = fμ(xi)

minμ

1

2N

NXi=1

(yi ¡ fμ(xi))2min

μ

1

2N

NXi=1

(yi ¡ fμ(xi))2

Rating Document QueryLength

DocPageRank

DocLength

Title Rel.

Content Rel.

3 d1=http://crunchbase.com 0.30 0.61 0.47 0.54 0.76

5 d2=http://reddit.com 0.30 0.81 0.76 0.91 0.81

4 d3=http://quora.com 0.30 0.86 0.56 0.96 0.69

Page 14: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Point Accuracy != Ranking Accuracy

• Same square error might lead to different rankings

Relevancy

Doc 1 Ground truth Doc 2 Ground truth

32.4 3.63.4 4.64

Ranking interchanged

Page 15: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Pairwise Approaches• Not care about the absolute relevance but the

relative preference on a document pair

• A binary classification

q(i)q(i)266664d

(i)1 ; 5

d(i)2 ; 3...

d(i)

n(i) ; 2

377775266664

d(i)1 ; 5

d(i)2 ; 3...

d(i)

n(i) ; 2

377775Transform n

(d(i)1 ; d

(i)2 ); (d

(i)1 ; d

(i)

n(i)); : : : ; (d(i)2 ; d

(i)

n(i))on

(d(i)1 ; d

(i)2 ); (d

(i)1 ; d

(i)

n(i)); : : : ; (d(i)2 ; d

(i)

n(i))oq(i)q(i)

5 > 35 > 3 5 > 25 > 2 3 > 23 > 2

Page 16: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Binary Classification for Pairwise Ranking• Given a query and a pair of documentsqq (di; dj)(di; dj)

yi;j =

(1 if i B j

0 otherwiseyi;j =

(1 if i B j

0 otherwise

Pi;j = P (di B dj jq) =exp(oi;j)

1 + exp(oi;j)Pi;j = P (di B dj jq) =

exp(oi;j)

1 + exp(oi;j)

oi;j ´ f(xi)¡ f(xj)oi;j ´ f(xi)¡ f(xj) xi is the feature vector of (q, di)

• Cross entropy loss

• Target probability

• Modeled probability

L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)

Page 17: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

RankNet• The scoring function is implemented by a

neural network

Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.

fμ(xi)fμ(xi)

@L(q; di; dj)

@μ=

@L(q; di; dj)

@Pi;j

@Pi;j

@oi;j

@oi;j

=@L(q; di; dj)

@Pi;j

@Pi;j

@oi;j

³@fμ(xi)

@μ¡ @fμ(xj)

´@L(q; di; dj)

@μ=

@L(q; di; dj)

@Pi;j

@Pi;j

@oi;j

@oi;j

=@L(q; di; dj)

@Pi;j

@Pi;j

@oi;j

³@fμ(xi)

@μ¡ @fμ(xj)

´

Pi;j = P (di B dj jq) =exp(oi;j)

1 + exp(oi;j)Pi;j = P (di B dj jq) =

exp(oi;j)

1 + exp(oi;j)

oi;j ´ f(xi)¡ f(xj)oi;j ´ f(xi)¡ f(xj)• Cross entropy loss

• Modeled probability

L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)L(q; di; dj) = ¡yi;j log Pi;j ¡ (1¡ yi;j) log(1 ¡ Pi;j)

• Gradient by chain ruleBP in NN

Page 18: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Shortcomings of Pairwise Approaches

• Each document pair is regarded with the same importance

24324

Same pair-level error but different list-level error

Documents Rating

Page 19: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Ranking Evaluation Metrics

• Precision@k for query q

P@k =#frelevant documents in top k resultsg

kP@k =

#frelevant documents in top k resultsgk

• Average precision for query q

AP =

Pk P@k ¢ yi(k)

#frelevant documentsgAP =

Pk P@k ¢ yi(k)

#frelevant documentsg

• For binary labels yi =

(1 if di is relevant with q

0 otherwiseyi =

(1 if di is relevant with q

0 otherwise

10101

• Mean average precision (MAP): average over all queries

AP =1

3¢³1

1+

2

3+

3

5

´AP =

1

3¢³1

1+

2

3+

3

5

´• i(k) is the document id at k-th position

Page 20: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Ranking Evaluation Metrics

• Normalized discounted cumulative gain (NDCG@k) for query q

NDCG@k = Zk

kXj=1

2yi(j) ¡ 1

log(j + 1)NDCG@k = Zk

kXj=1

2yi(j) ¡ 1

log(j + 1)

• For score labels, e.g.,

yi 2 f0; 1; 2; 3; 4gyi 2 f0; 1; 2; 3; 4g

• i(j) is the document id at j-th position• Zk is set to normalize the DCG of the ground truth ranking as 1

Normalizer

GainDiscount

Page 21: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Shortcomings of Pairwise Approaches

• Same pair-level error but different list-level error

NDCG@k = Zk

kXj=1

2yi(j) ¡ 1

log(j + 1)NDCG@k = Zk

kXj=1

2yi(j) ¡ 1

log(j + 1)

y = 0 y = 1

Page 22: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Listwise Approaches• Training loss is directly built based on the difference

between the prediction list and the ground truth list

• Straightforward target• Directly optimize the ranking evaluation measures

• Complex model

Page 23: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

ListNet• Train the score function• Rankings generated based on • Each possible k-length ranking list has a probability

• List-level loss: cross entropy between the predicted distribution and the ground truth

• Complexity: many possible rankingsCao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th international conference on Machine learning. ACM, 2007.

yi = fμ(xi)yi = fμ(xi)

fyigi=1:::nfyigi=1:::n

Pf ([j1; j2; : : : ; jk]) =

kYt=1

exp(f(xjt))Pnl=t exp(f(xjl

))Pf ([j1; j2; : : : ; jk]) =

kYt=1

exp(f(xjt))Pnl=t exp(f(xjl

))

L(y; f(x)) = ¡Xg2Gk

Py(g) log Pf (g)L(y; f(x)) = ¡Xg2Gk

Py(g) log Pf (g)

Page 24: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Distance between Ranked Lists• A similar distance: KL divergence

Tie-Yan Liu @ Tutorial at WWW 2008

Page 25: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Pairwise vs. Listwise• Pairwise approach shortcoming

• Pair-level loss is away from IR list-level evaluations• Listwise approach shortcoming

• Hard to define a list-level loss under a low model complexity

• A good solution: LambdaRank• Pairwise training with listwise information

Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.

Page 26: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

LambdaRank• Pairwise approach gradient

@L(q; di; dj)

@μ=

@L(q; di; dj)

@Pi;j

@Pi;j

@oi;j| {z }¸i;j

³@fμ(xi)

@μ¡ @fμ(xi)

´@L(q; di; dj)

@μ=

@L(q; di; dj)

@Pi;j

@Pi;j

@oi;j| {z }¸i;j

³@fμ(xi)

@μ¡ @fμ(xi)

´oi;j ´ f(xi)¡ f(xj)oi;j ´ f(xi)¡ f(xj)

Pairwise ranking loss Scoring function itself

• LambdaRank basic idea• Add listwise information into as ¸i;j¸i;j h(¸i;j ; gq)h(¸i;j ; gq)

Current ranking list

@L(q; di; dj)

@μ= h(¸i;j ; gq)

³@fμ(xi)

@μ¡ @fμ(xj)

´@L(q; di; dj)

@μ= h(¸i;j ; gq)

³@fμ(xi)

@μ¡ @fμ(xj)

´

Page 27: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

LambdaRank for Optimizing NDCG• A choice of Lambda for optimize NDCG

h(¸i;j ; gq) = ¸i;j¢NDCGi;jh(¸i;j ; gq) = ¸i;j¢NDCGi;j

Page 28: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

LambdaRank vs. RankNet

Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.

Linear nets

Page 29: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

LambdaRank vs. RankNet

Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS. Vol. 6. 2006.

Page 30: Learning to Rank - wnzhangwnzhang.net/teaching/ee448/slides/9-learning-to-rank.pdf · Learning to Rank Approaches •Learn (not define) a scoring function to optimally rank the documents

Summary of Learning to Rank• Pointwise, pairwise and listwise approaches for

learning to rank

• Pairwise approaches are still the most popular• A balance of ranking effectiveness and training efficiency

• LambdaRank is a pairwise approach with list-level information

• Easy to implement, easy to improve and adjust