learning to rank - tutorial @ sepln 2012

Introduction Corpus Eval Metrics Approaches Applications Datasets Letor API Summary References

Learning to Rank - Tutorial @ SEPLN 2012

Parth Gupta

NLE Lab, ELiRF, UPV

September 5, 2012

SEPLN ’12 NLE Lab, ELiRF, UPV

Learning to Rank


Outline

Introduction

Corpus

Evaluation Metrics

Approaches

Applications

Letor Datasets

Xapian Letor API

Summary

References


Learning to Rank


What is Ranking?

I For the given, request and collections of offerings we giveofferings in particular relevance order.

In case of Document retrieval

I Input: Query and collection of documents

I Output: Ranked list of documents

Learning to Rank

I Machine Learning for Information Retrieval (IR)

I Specifically : Learning the Ranking ModelSEPLN ’12 NLE Lab, ELiRF, UPV

Learning to Rank


Basic IR Concepts

Crawling

I Crawl the necessary part of the Web and prepare a staticcollection of documents

Indexing

I preprocess to convert it into raw text format (ASCII or UTF-8)

I Stop-word removal [Term Pipeline]

I Stemming [Term Pipeline] Store relevant information of termsand documents like term frequency (TF) [doc and collection]and document length in direct and inverted index.


Learning to Rank


Query Normalisation

I Pass the query from the same pipeline

Ranking

I The simplest yet powerful model is TF-IDF

Score(Q,D) =

n∑i=1

tf(qi, D) ∗ idf(qi)

I tf(qi, D) = Frequency of Term qi in D.

I idf(qi) = log( N# of docs containing qi

)


Learning to Rank


tf − idf Example

I Doc1 = Jaume I University is in Castellon

I Doc2 = Jaume I University is in Spain

I Doc3 = UPV is in Valencia

I Q = Jaume Spain

Ranking

I Score(Q, Doc1) = (1+0)*(0.64) = 0.64 Rank - 2

I Score(Q, Doc2) = (1+1)*(0.64) = 1.28 Rank - 1

I Score(Q, Doc3) = (0+0)*(0.64) = 0.0 Rank - 3


Learning to Rank


Other Unsupervised Ranking Models

I BM25 - Probabilistic Model

I Language Model for IR


Learning to Rank


Scope of This Tutorial

I Learning to Rank Problem Definition

I Approaches to Learning to Rank

I Applications of LTR to NLP and IR

I Research Trends with LTR

I Hands on with LTR


Learning to Rank


Conventional Unsupervised Ranking Model


Learning to Rank


Motivation

RankList A RankList B

Irrelevant Doc

Relevant Doc

Can We Learn?


Learning to Rank


Supervised Learning-to-Rank (LetoR) Model[Qin et al., 2010]


Learning to Rank


Practical Ranking

Query

SupervisedRanker

Un-supervisedRanker

ExtractFea-tures

Learning to Rank

RankList RankList


Learning to Rank


Training Process (fig. taken from [Li, 2009])


Learning to Rank


Testing Process (fig. taken from [Li, 2009])


Learning to Rank


Letor 4.0 Dataset [Qin et al., 2010]

I Gov2 web page collection and two query sets of Million Query(MQ) track of TREC07 and TREC08.

I Given in form of query document pairs.

2 qid:10032 1:0.056537 2:0.000000 ... 45:0.000000 46:0.076923 #docid = GX029-35-58946380 qid:10032 1:0.279152 2:0.000000 ... 45:0.250000 46:1.000000 #docid = GX030-77-63150421 qid:10032 1:0.593640 2:1.000000 ... 45:0.500000 46:0.000000 #docid = GX256-43-0740276


Learning to Rank


Issues in Learning to Rank

I Relevance Labels - manual judgments vs. click-through data

I Feature Extraction - (Q, D, Q+D)

I Evaluation Measure - multi-level relevance labels


Learning to Rank


Why so popular?

I Easy to incorporate new featuresI age of the documentI link information in web graphI click informationI geographical information of the user etc.

I Easy to tune parameters of the ranking model automatically


Learning to Rank


Evaluation Metrics

Mean Average Precision (MAP)

MAP =∑n

k=1(P (k) ∗ rel(k))Number of Relevant Documents

Normalized Discounted Cumulative Gain (nDCG)

NDCG @ n = Zn ∗∑n

j=1

[2c(j) − 1

log (1 + j)

]


Learning to Rank


Approaches

- Pointwise Approach Individual doc is used forlearning

- Pairwise Approach Pairs of two documentsfor a query is used forlearning

- Listwise Approach Whole ranked list for aquery is used for learning


Learning to Rank


Pointwise Approach

Q1

Q2

Qn

• Use individual en-tity for learning evenwithout query infor-mation


Learning to Rank


Pointwise Learning-to-Rank (LetoR) Model


Learning to Rank


Regression Approach

I Directly methods like regression or classification can beapplied to solve the ranking problem.

I Consider relevance judgments asI Real values - in case of regressionI Non-ordered categories - in case of classification


Learning to Rank


Regression Approach contd.

I Least Squares Retrieval Function [Fuhr, 1989]

I Loss Function can be LSE.

I Regression model is Multiple Linear Regression.

I Basic idea is to map the feature vector the the real value.


Learning to Rank


Regression Approach contd.

I Discriminative models [Nallapati, 2004]

I Here real score can be looked as probability of relevance givendocument and query.

I Here the weight vector can be estimated using gradientdescent algorithm.


Learning to Rank


Classification Approach

I BayesNetRank [Gupta, 2011], McRank [Li et al., 2007]

I Solve the Ranking Problem using Classification

I Classes : Relevance Levels [ 0 , 1 , 2 ]

I Classification Model : Bayesian Networks (BayesNetRank)and Neural Networks (McRank)

I Ranking Function on top of Classification Probabilities.

Constraint

I Bayesian Networks accept only discrete data -> MDL baseddiscretization scheme is used [Fayyad and Irani, 1993]


Learning to Rank


Bayesian Networks: Structure Learning

I E= set of Edges i.e. thestructure of the Network

I So we have to Estimate

P (E|X) =P (X|E) ∗ P (E)

P (X)

I Where P(E) = PriorI And P(x) is constantI P (X|E) is likelihood

term, Cooper HerkovitzLikelihood

P (X|E) =d∏j=1

qj∏l=1

kj∏i=1

Θnk(x

(i)j

|π(l)j

)

jil

f1 f2 f3 f45 f46 R

f1 f2 f3 f4

f11 f15f12

R

Structure Learning


Learning to Rank


Structure Learning contd..

P (X|E) =

d∏j=1

qj∏l=1

kj∏i=1

Θnk(x

(i)j |π

(l)j )

jil

I d is the number of variables (No of nodes)

I kj is number of possible states that the variable Xj can take

I qj is the number of possible parent configurations for variableXj and

Θjil = P (Xj = x(i)j |Πj = π

(l)j )


Learning to Rank


Structure Learning contd..

I The likelihood has to be computed for all the possible graphstructure and the graph that maximizes it, is chosen.

I NP-Hard Problem: because possible number of DAGs growssuper-exponentially with no. of nodes.

I e.g. for N=10, possible DAGs = 4.2 x 1018

I So there are different ways to select subset of all the DAGs forconsideration

I e.g. K2, HC etc


Learning to Rank


Bayesian Networks: Parameter Learning

I Now we know the structure and so the parent set of our Classnode [Relevance in our case].

I We use the Maximum Likelihood Estimation(MLE) to give theprobabilities, where Likelihood function is same [CooperHerkovitz likelihood] as described in Structure LearningSection.


Learning to Rank


Ranking Function

I Bayesian Network will give us, P(R=0), P(R=1) and P(R=2)

I But we want a real score for the document

I Very straightforward yet effective function ’ExpectedRelevance’ [Li et al., 2007]

Score(Xi) =∑i

i ∗ P (R = i)


Learning to Rank


Issues with Regression Approach

I Here relevance is considered as either non-ordered categoriesor real values, which is not true in reality.

I This approach does not take any advantage of pairwisepreference or total order information in learning.


Learning to Rank


Ordinal Regression

I Ordinal regression can be a better choice than RegressionApproach.

I Because here categories are ordered while in classification thecategories are non ordered.


Learning to Rank


Ordinal Regression contd.

I P ranking for Ranking [Crammer and Singer, 2001]I Here we project the ~wT · ~x in the ordered categoriesb1 < b2 < · · · < bn.

I In the training phase if the predicted category is not equal tothe actual category then we modify vector ~w and thresholds.

Concerns:

I Still consider the absolute relevance

I does not take information of pairwise preference or total orderinformation in learning.


Learning to Rank


Algorithms

I Regression

I Classification - McRank [Li et al., 2007], BayesNetRank[Gupta, 2011]

I Ordinal Regression - PRanking [Crammer and Singer, 2001],Large margin [Shashua and Levin, 2002]


Learning to Rank


Pairwise Approach

Q1

Q2

Qn

Q1

Qn

• Create Correct andIncorrect RankedPairs for each Query

• Use these pairs forLearning


Learning to Rank


Pairwise Learning-to-Rank (LetoR) Model


Learning to Rank


Pairwise Approach

I Take the pairs of documents

I Input: Xi , Xj ∈ Rt where Xi > Xj

I Output : Y ∈ +1,-1, where ’+1’ denotes that Xi > Xj iscorrect order and ’-1’ otherwise.

I Ex. if given data is like (X1,2), (X2,1), (X3,0) where 0,1 and2 are the relevance labels

I New training data will be, (X1,X2,+1), (X2,X1,-1),(X1,X3,+1), (X3,X1,-1), (X2,X3,+1) and (X3,X2,-1).


Learning to Rank


Pairwise Approach - RankSVM [Herbrich et al., 2000]

I The SVM model is as shown below,

minimize1

2‖ w ‖2 +C

∑i

ζi

subject to zi⟨w, x1i − x2i

⟩≥ 1− ζi, ζi ≥ 1, i = 1, 2, . . . , l

Where,

zi =

{+1 if y1i ≺ y2i−1 if y2i ≺ y1i


Learning to Rank


Pairwise Approach contd.

I There are some concerns with Ranking SVM that

I It is biased towards queries with many relevant documentsthan those with fewer relevant documents. [Cao et al., 2006]

I It gives same weight to all the pairs.

I Also pairwise approach does not handle information of totalorder.


Learning to Rank


Algorithms

I Support Vector Machines - RankSVM [Herbrich et al., 2000],IR-SVM [Cao et al., 2006]

I Neural Networks - RankNet [Burges et al., 2005]


Learning to Rank



Learning to Rank


Listwise Approach

Q1

Q2

Qn

Q2

Qn

• Use whole rankedlist for Learning

Q1


Learning to Rank


Listwise Learning-to-Rank (LetoR) Model


Learning to Rank


ListNet [Cao et al., 2007]

I Listwise Loss Function

N∑i=1

L(yi, zi)

I L is cross entropy (p log p)I zi is the generated rank-list for query iI yi is the ground truth rank-list for query i

I Permutation Probability

Ps(π) =n∏j=1

sπ(j)∑nk=j sπ(k)

I Calculating for N objects is computationally intractable


Learning to Rank


ListNet contd.

I Top k Permutation Probability

Ps(ς(j1, j2, ..., jk)) =

k∏t=1

sjt∑nl=t sjt

I Time complexity of computation : from O(n!) to O( n!(n−k)!)

I Ranking model s(., .) is Neural Network

I Example, if s(A)=5, s(B)=4, s(C)=2 then P(ABC) ismaximum and P(CBA) is minimum


Learning to Rank


Algorithms

I Neural Networks - ListNet [Cao et al., 2007]


Learning to Rank


Applications

I Search

I Collaborative Filtering [Freund et al., 2003]

I Key-Phrase Extraction [Jiang et al., 2009]

I Question Answering (why questions) [Verberne et al., 2011]


Learning to Rank


Collaborative Filtering

I Problem formulationI Input: users ratings on some itemsI Output: users ratings on other itemsI Assumption: users sharing same ratings on input items tend to

agree on new items

I SolutionI ClassificationI Ordinal RegressionI Learning-to-Rank


Learning to Rank


Key Phrase Extraction

I Problem formulationI Input: documentI Output: keyphrases of documentI Two steps: phrase extraction and keyphrase identification

I Traditional approachI Classification: keyphrase vs non-keyphrase


Learning to Rank


Key Phrase Extraction contd.

I Ranking of Phrases

I Pairwise approach


Learning to Rank


why Question Answering

I Problem FormulationsI Input: Question and Potential AnswersI Output: Improved ranking of Answers (re-ranking)

I Traditional ApproachI TF-IDF scoring


Learning to Rank


Question Answering contd.

I The data is highly imbalanced

I Pairwise approaches seem to work well


Learning to Rank


Statistics of public LTR datasets

Queries Doc. Rel. Feat. Year

Letor 3.0 - Gov 575 568 k 2 64 2008Letor 3.0 - Ohsumed 106 16 k 3 45 2008Letor 4.0 2,476 85 k 3 46 2009Yandex 20,267 213 k 5 245 2009Yahoo! 36,251 883 k 5 700 2010Microsoft 31,531 3,771 k 5 136 2010


Learning to Rank


Research Trends in Learning to Rank

I Feature Extraction - What can be useful?

I Feature Selection - All of them are useful?

I Data Labeling - Do I have to judge all of them?

I Ranking Models - How to Rank?

I Dimensionality Reduction - Can feature vector be compacted?

I Suitability of Evaluation Metric - Does one fit all?


Learning to Rank


Xapian

I Xapian1 is an Open-source search engine project

I Letor API is developed under Google’s initiative to promoteopen-source, well known as, Google Summer of Codes (GSoC)

I Detailed description along with code of Letor API can beaccessed through, wiki-page:http://trac.xapian.org/wiki/GSoC2011/LTR

Motivation

I To make an end-to-end i.e. natural language query todocument ranked list type infrastructure for Learning to Rank

1http://xapian.org/SEPLN ’12 NLE Lab, ELiRF, UPV

Learning to Rank

http://trac.xapian.org/wiki/GSoC2011/LTR


API

Xapian::Letor ltr;

...

ltr.set_database(db);

ltr.set_query(query);

...

// prepare training file

ltr.prepare_training_file("topics.txt",

"inex2010-article.qrels",

100);

....

// learn the model with svm parameters

ltr.letor_learn_model(4,0);

// get the new ranked list with Letor!

map<Xapian::docid,double> letor_mset = ltr.letor_score(mset);


Learning to Rank


Summary

I We learnt the basic framework of Learning-to-Rank

I We explored the approaches to it

I We saw application of LTR to IR and NLP tasks

I Overview of Datasets and Xapian Letor API


Learning to Rank


Thank You! ¨

[email protected]


Learning to Rank


References I

Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. N.

(2005).Learning to rank using gradient descent.In ICML, pages 89–96.

Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., and Hon, H.-W. (2006).

Adapting ranking svm to document retrieval.In SIGIR, pages 186–193.

Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007).

Learning to rank: from pairwise approach to listwise approach.In ICML, pages 129–136.

Crammer, K. and Singer, Y. (2001).

Pranking with ranking.In NIPS, pages 641–647.

Fayyad, U. M. and Irani, K. B. (1993).

Multi-interval discretization of continuous-valued attributes for classification learning.In IJCAI, pages 1022–1029.


Learning to Rank


References IIFreund, Y., Iyer, R. D., Schapire, R. E., and Singer, Y. (2003).

An efficient boosting algorithm for combining preferences.Journal of Machine Learning Research, 4:933–969.

Fuhr, N. (1989).

Optimal polynomial retrieval functions based on the probability ranking principle.ACM Trans. Inf. Syst., 7(3):183–204.

Gupta, P. (2011).

Learning-to-Rank: Using Bayesian Networks.Master’s thesis, DA-IICT, India.

Herbrich, R., Graepel, T., and Obermayer, K. (2000).

Large margin rank boundaries for ordinal regression.In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D., editors, Advances in Large MarginClassifiers, pages 115–132, Cambridge, MA. MIT Press.


Learning to Rank


References IIIJiang, X., Hu, Y., and Li, H. (2009).

A ranking approach to keyphrase extraction.In Proceedings of the 32nd international ACM SIGIR conference on Research and development ininformation retrieval, SIGIR ’09, pages 756–757, New York, NY, USA. ACM.

Li, H. (2009).

Learning to rank.In Tutorial Abstracts of ACL-IJCNLP 2009, page 5, Suntec, Singapore. Association for ComputationalLinguistics.

Li, P., Burges, C. J. C., and Wu, Q. (2007).

Mcrank: Learning to rank using multiple classification and gradient boosting.In NIPS.

Nallapati, R. (2004).

Discriminative models for information retrieval.In SIGIR, pages 64–71.

Qin, T., Liu, T.-Y., Xu, J., and Li, H. (2010).

LETOR: A benchmark collection for research on learning to rank for information retrieval.Information Retrieval, 13(4):346–374.


Learning to Rank


References IVShashua, A. and Levin, A. (2002).

Ranking with large margin principle: Two approaches.In NIPS, pages 937–944.

Verberne, S., Halteren, H., Theijssen, D., Raaijmakers, S., and Boves, L. (2011).

Learning to rank for why-question answering.Inf. Retr., 14(2):107–132.


Learning to Rank