1 machine learning for information retrieval rong jin michigan state university yi zhang university...

Machine Learning for Information RetrievalRong JinMichigan State University

Yi ZhangUniversity of California Santa Cruz

Outline Introduction to information retrieval, statistical inference

and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

Roadmap of Information Retrieval

Search

Filtering

Categorization

Summarization

Clustering

Data Analysis

Extraction

Mining

VisualizationRetrievalApplications

Mining/LearningApplications

InformationAccess

KnowledgeAcquisition

Why Machine Learning is Important ?

Text Categorization

Text Categorization Open directory project

the largest human-edited directory of the Web

Manual classification Over 4 million sites and

590 K categories Need to automate the

process

Document Clustering

Question AnsweringQuestion Answering

Classify question; identify answers; match questions and answers

Image Retrieval

Image segmentation by data clustering

Image Retrieval by Key Points

Key features visual words: data clustering

b1 b2 b3 b4

Image Retrieval by Text Query Automatically annotate images with textual words Retrieve images with textual queries Key technique: classification

Each keyword a different category

Information Extraction

Title J2EE Developer

Length 4 month

Salary ….

Location

Reference

Web page: free style text Relational DB

Structure prediction by Hidden Markov Model and Markov Random Field

Citation/Link Analysis

Recommender Systems

User 1 ? 5 3 4 2

User 2 4 1 5 ? 5

User 3 5 ? 4 2 5

User 4 1 5 3 5 ?

Sparse data problem: a lot of missing values

Recommender System

User Class I 1 p(4)=1/4

p(5)=3/4

User Class II p(4)=1/4

p(5)=3/4

p(1)=1/2

p(2)=1/2

p(4)=1/2

p(5)=1/2

Movie Type I

Movie Type II

Movie Type III

Fill out sparse data by data clustering

One More Reason for ML

$ 1,000,000 award

Review of Basic Prob. Concepts Probability Pr(A): “the fraction of possible world in

which A is true” ExamplesA = Your paper will be accepted by SIGIR 2008A = It rains in SingaporeA = A document contains the word “IR”

A is true

Event space of all possible worlds. The area is 1.

Conditional Probability SIGIR2008 = “a document contains the phrase SIGIR 2008” SINGAPORE = “a document contains the word singpaore”

P(SINGAPORE) = 0.000001 P(SIGIR2008) = 0.00000001 P(SINGAPORE|SIGIR2008) = 1/2

“Singapore” is rare and “SIGIR 2008” is rarer, but if you have a document with SIGIR 2008, there’s a 50-50 chance you’ll find the word “Singapore” in it

Conditional Prob.

B is trueA is true

Pr(AjB) =Pr(A;B)Pr(B)

Pr(A;B) = Pr(B)Pr(AjB)Definition Chain rule

Conditional Prob.

B is trueA is true

Independent variables

Pr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)

Conditional Prob.

Marginal probability B is trueA is true

IndependencePr(AjB) = Pr(A) Pr(A;B) = Pr(B) Pr(A)

Pr(B) =kX

Pr(B;A = aj )

Bayes’ Rule

Pr(H jE ) / Pr(H ) £ Pr(E jH )

LikelihoodPriorPosterior

Inference: Pr(H|E)

Information: Pr(E|H)

Hypothesis Evidence

Bayes’ Rule

Inference: Pr(R|W)

Information: Pr(W|R)

Pr(W|R)

W 0.7 0.4

W 0.3 0.6

R: It rains

W: The grass is wet

Statistical Inference

Learning stage: a parametric model for Pr(E|H) Inference stage: for a given observation E

Compute Pr(H|E) for each hypothesis H Choose the hypothesis with the largest Pr(H|E)

Example: Language Model (LM) for IR

d1 … d1000

q: ‘Signapore SIGIR’

Estimating some statistics for each document

Estimating likelihood p(q| )

Hypothesis: H

Evidence: E

Pr(H jE )Pr(E jH )

Pr(H )

Probability Distributions Binomial distributions Beta distribution Multinomial distributions Dirichlet distribution Gaussian distributions Laplacian distribution

Language models

Smoothing LM

Sparse solution L1 regularizer

Outline Introduction to information retrieval, statistical inference and

machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

f (x) = ax ¡ b

Regression: Continuous Y`Regression: Continuous Y`

Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x

y = +1

x = (x1;x2)

y = -1

w>x ¡ b= 0

f (x) = sign(w>x ¡ b)

Classification: Discrete YClassification: Discrete Y

Examples Text categorization

Input x: word histogram Output y: document categories (e.g., 1 for

“domestic economics”, 2 for “politics”, 3 “sports”, and 4 for “others”)

Question answering: classify question types Input x: a parsing tree of a qestion Output y: question types (e.g., when, where, …)

K Nearest-Neighbor (KNN) Classifiers

– Compute distance to other training documents

– Identify the k nearest neighbors

– determine the class of the unknown point by the class labels of its closest neighbors

Unknown record

Based on Tan,Steinbach, Kumar

K Nearest-Neighbor (KNN) Classifiers Compute distance between two points

Euclidean distance, cosine distance, Kullback-Leibler distance, Bregman distance, …

Learning distance function from data (Distance learning) Determine the class

Majority vote, or weighted majority vote

Bregman distance: generated by a convex function

K Nearest-Neighbor (KNN) Classifiers Decide K (# of nearest neighbors)

Bias-variance tradeoff Cross validation (or leave-one-out)

(k=1)(k=4)

Training Dataset

Validation Dataset

K Nearest-Neighbor (KNN) Classifiers Curse of dimensionality

Many attributes are irrelevant High dimension less informative distance

Distribution of square distance, generated by 1000 random data points in 1000 dims

KNN for Collaborative Filtering Collaborative filtering

Will user u like item b? Assumption:

Users have similar tastes are likely to have similar preferences on items

Making filtering decisions for one user based on the feedback from other users that are similar to this user

KNN for Collaborative Filtering

User 1 1 5 3 4 3

User 2 4 1 5 2 5

User 3 2 ? 3 5 4?

KNN for Collaborative Filtering

User 1 1 5 3 4 3

User 2 4 1 5 2 5

User 3 2 ? 3 5 4

Similarity measure of user interests can be learned

Paradigm for Supervised Learning Gathering training data Determine the input features (i.e., What’s x ?)

e.g., text categorization, bags of words Feature engineering is very very very important

Determine the functional form f(x) Linear or nonlinear What is the function form for KNN?

Determine the learning algorithm Learn optimal parameters (optimization, cross validation) Probabilistic or non-probabilistic

Test on a test set

Bayesian LearningLikelihoodPriorPosterior

MAP Learning: Maximum A Posterior

Hypothesis space: H = fY1;Y2; : : : ;g

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Baye’s Rule

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Bayesian Learning

MLE Learning: Maximum Likelihood Estimation

Pr(H jE ) / Pr(H ) £ Pr(E jH ) Baye’s Rule

Bayesian Learning: Conjugate Prior

Posterior Pr(Y|X) is in the same form as prior Pr(Y) e.g., Dirchlet dist. is conjugate prior for multinomial

dist. (widely used in language model)

Y ¤ = argmaxY 2H

Pr(Y jX )

= argmaxY 2H

Pr(Y ) Pr(X jY )

Example: Text Categorization

How to estimate Pr(Y=Student) or Pr(Y= Prof.) ? How to estimate Pr(w|Y) ?

What is Y ?

What is feature X?

What is Y ?

What is feature X?

Web page for Prof. or student ?Web page for Prof. or student ?

Y ¤ = argmaxY 2H

Pr(Y ) Pr(X jY )

1. Counting = MLE2. Counting + Pseudo = MAP

Counting !

f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)

= logPr(Y = P)Pr(Y = S)

+x1 logPr(w1jY = P)Pr(w1jY = S)

+:::+xv logPr(wV jY = P)Pr(wV jY = S)

Pr(X jY ) ¼[Pr(w1jY )]x1 ¢¢¢[Pr(wV jY )]xV

Naïve Bayes

Pr(wjY ) Pr(X jY )?

[w1;w2; : : : ;wV ]

Weight for wordsWeight for words

X = (x1; x2; : : : ; xV )

ThresholdConstant

f (X ) = logPr(X jY = P)Pr(Y = P)Pr(X jY = S)Pr(Y = S)

= logPr(Y = P)Pr(Y = S)

+x1 logPr(w1jY = P)Pr(w1jY = S)

+:::+xv logPr(wV jY = P)Pr(wV jY = S)

Naïve Bayes: A Linear Classifier

y = +1

y = -1

f (x) = sign(w>x ¡ b)+

Logistic Regression

Directly model f(x) or Pr(Y|X)

Logistic Regression

Directly model f(x) or Pr(Y|X)

logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S) = b+t1x1 +:::+tV xV

logPr(X jY = P )Pr(Y = P )Pr(X jY = S)Pr(Y = S)

= logPr(Y = P )Pr(Y = S)

+x1 logPr(w1jY = P )Pr(w1jY = S)

+:::+xv logPr(wV jY = P )Pr(wV jY = S)

Logistic Regression (LR)

t1…tV are unknown weights that are learned from data by maximum likelihood estimation (MLE)

Pr(y = § 1jX ) =1

1+exp[¡ y(t1x1 +:::+tV xV +b)]

Logistic Regression (LR) Learning parameters: b, t1…tV

Maximum Likelihood Estimation (MLE)

(~t¤;b¤) = argmax~t;b

logPr(yi jX i ;~t;b)

Logistic Regression (LR) Learning parameters: b, t1…tV

Maximum Likelihood Estimation (MLE)

(~t¤;b¤) = argmax~t;b

logPr(yi jX i ;~t;b)

worse performance

OverfittingOverfitting

Maximum Likelihood Estimation

Maximum A Posterior

Maximum Likelihood Estimation

Maximum A Posterior

+Pr(t)

Why only word weights?Why only word weights?

Learning Logistic Regression

Pr(y = § 1jX ) =1

1+ exp[¡ yf (X )]

(~t¤;b¤) = argmin~t;b

¡ logPr(yi jX i ;~t;b)

Loss functionMismatch between y and f(X)

Other Loss functionsOther Loss functions

Logistic Regression (LR) Closely related to Maximum Entropy (ME)

Advantage of LR Bayesian approach Convenient for incorporating prior knowledge Useful for semi-supervised learning, transfer

learning, …

Logistic Regression

Maximum Entropy

Maximum EntropyDual

Comparison of Classifiers

From Li and Yang SIGIR03

Macro F1

Micro F1

KNN 0.8557 0.5975

Naïve Bayes

0.8009 0.4737

Logistic Regression

0.8748 0.6084

Logistic RegressionLogistic Regression Naïve BayesNaïve Bayes

1. Model Pr(Y|X)2. Model decision boundary3. NB is a special case of LR

1. Model Pr(X|Y) & Pr(Y)2. Model input patterns (X)

1. Require numerical solution2. Large number of training examples, slow convergence

1. Simple solution2. Small number of training examples, fast convergence

Discriminative ModelDiscriminative Model Generative ModelGenerative Model

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption

Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important

Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test

Discriminative model if1. Enough training examples2. Enough computational power3. Classification accuracy is important

Generative model if 1. Lack of training examples2. Lack of computational power3. Training time is more important4. A quick test

Rule of Thumb

Discriminative ModelDiscriminative Model Generative ModelGenerative Model

1. Model Pr(Y|X)2. Model decision boundary3. Broader model assumption

What about KNN ?What about KNN ?

Other Discriminative Classifiers Decision tree

Aggregation of decision rules via a tree

Easy interpretation

Other Discriminative Classifiers Decision tree

Aggregation of decision rules via a tree

Easy interpretation Support vector machine

A maximum margin classifier

best text classifier

y = +1

y = -1

From Li and Yang SIGIR03

Macro F1

Micro F1

KNN 0.8557 0.5975

Naïve Bayes

0.8009 0.4737

Logistic Regression

0.8748 0.6084

Support vector machine

0.8857 0.5975

Ensemble Learning Generate multiple classifiers Classification by (weighted) majority votes Bagging & Boosting

Train a classifier for a different sampling of training data

D1 D2 Dk

Sampling

h1 h2 hk

Ensemble Learning Bias-variance tradeoff

Reduce variance (bagging) and bias (boosting)

Error caused by variance

Error caused by bias

50 decision trees Majority vote

Multi-Class Classification

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

XN 1 0 1

More than 2 classes Multi-labels assigned to

each example Approaches

One against all ECOC coding

Binary classifier

Binary classifier ………

f K (X )f 1(X )

One against all ECOC coding

……

f 1(X )

f M (X )

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

XN 1 0 1

0 1 … 0

1 0 … 1

… … … …

1 1 … 0

# of codingbits

f 1(X )

f 2(X )

f 3(X )c1

c1 c2 … cK

X1 0 1 … 0

X2 1 0 0

XN 1 0 1

One against all ECOC coding Transfer learning

f 1(X )Binary classifier Binary classifier

f K (X )………

Beyond Vector Inputs

gene sequence classification

question type classification

Character Recognition

sequences trees graphs

Beyond Vector Inputs: Kernel Kernel function k(x1, x2)

Assess the similarity between two objects x1, x2

Don’t have to represent objects by vectors

Beyond Vector Inputs: Kernel Kernel function k(x1, x2)

Assess the similarity between two objects x1, x2

Don’t have to represent objects by vectors Vector representation by kernel function

Given training examples Represent any example x by vector

x1;: : : ;xN

[k(x1;x);k(x2;x); : : : ;k(xN ;x)]

Related to representer theorem

Beyond Vector Inputs

Strong Kernel Tree Kernel Graph Kernel

sequences trees graphs

Kernel for Nonlinear Classifiers

Words are associated with Kernels Reproducing Kernel Hilbert Space (RKHS)

Vector representation Mercer’s conditions

Good kernels Representer theorem Kernel learning (e.g., multiple kernel

learning)

Sequence Prediction

Part-of-speech tagging But, all the taggings are related

Hidden Markov Model (HMM), Conditional Random Field (CRF), and Maximum Margin Markov Network (M3)

[He] [reckons] [the] [current] [account] [deficit]

[PRP] [VBZ] [DT] [JJ] [NN] [NN]

Pr(N N j\ account) ! Pr(N N j\ account;tag-for-\ current)

machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions

Topics of Semi-supervised Learning

Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms

Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training

Semi-supervised data clustering

Spectrum of Learning Problems

What is Semi-supervised Learning Learning from a mixture of labeled and unlabeled examples

f (x) : X ! Y

L = f (x1;y1); : : : ; (xn l;yn l

)gLabeled Data

U = fx1; : : : ;xnug

Unlabeled Data

Total number of examples:N = nl +nu

Why Semi-supervised Learning? Labeling is expensive and difficult Labeling is unreliable

Ex. Segmentation applications Need for multiple experts

Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc.

Semi-supervised Learning Problems Classification

Transductive – predict labels of unlabeled data Inductive – learn a classification function

Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semi-

supervised counterpart.

Why Unlabeled Could be Helpful Clustering assumption

Unlabeled data help decide the decision boundary

Manifold assumption Unlabeled data help decide decision function

f (X ) = 0

f (X )

Clustering Assumption

Points with same label are connected through high density regions, thereby defining a cluster

Clusters are separated through low-density regions

Suggest a simple alg. forSemi-supervised Learning ?

Manifold Assumption

Regularize the classification function f(x)

Graph representation Vertex: training example

(labeled and unlabeled) Edge: similar examples

x1 and x2 are connected ¡ ! jf (x1) ¡ f (x2)j is small

Labeled examples

Manifold Assumption

Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the

data manifold

Graph representation Vertex: training example

(labeled and unlabeled) Edge: similar examples

Statistical View

Generative model for classification

Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)

Statistical View

Generative model for classification

Unlabeled data help estimate

Clustering assumption θ

Pr(X jY ;µ)

Pr(X ;Y jµ;´) = Pr(X jY ;µ) Pr(Y j´)

Statistical View Discriminative model for classification

Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)

Statistical View Discriminative model for classification

Unlabeled data help regularize θ

via a prior

Manifold assumption

μPr(µjX )

Pr(X ;Y jµ;´) = Pr(X j¹ ) Pr(Y jX ;µ)

Label Propagation: Key Idea

A decision boundary based on the labeled examples is unable to take into account the layout of the data points

How to incorporate the data distribution into the prediction of class labels?

Connect the data points that are close to each other

Propagate the class labels over the connected graph

Label Propagation: Key Idea Connect the data

points that are close to each other

Propagate the class labels over the connected graph

Different from the K Nearest Neighbor

Label Propagation: Representation Adjancy matrix

Similarity matrix

Matrix

Wi ;j =

½1 xi and xj connect0 otherwise

W 2 f0;1gN £ N

W 2 RN £ N+

Wi ;j : similarity between xi and xj

D = diag(d1;: ::;dN )

j 6=i Wi ;j

Label Propagation: Representation Adjancy matrix

Similarity matrix

Degree matrix

Wi ;j =

½1 xi and xj connect0 otherwise

W 2 f0;1gN £ N

W 2 RN £ N+

Wi ;j : similarity between xi and xj

D = diag(d1;: ::;dN ) di =P

j 6=i Wi ;j

Label Propagation: Representation Given Label information

W 2 RN £ N+

yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l

yu = (y1;y2; : : : ;ynu) 2 f ¡ 1;+1gnu

Label Propagation: Representation Given Label information

W 2 RN £ N+

yl = (y1;y2; : : : ;yn l) 2 f ¡ 1;+1gn l

y = (yl ;yu)

Label Propagation Initial class assignments

Predicted class assignments First predict the confidence scores Then predict the class assignments

by 2 f ¡ 1;0;+1gN

y 2 f ¡ 1;+1gN

f 2 RN

½+1 f i > 0¡ 1 f i · 0

½§ 1 xi is labeled0 xi is unlabeled

Label Propagation Initial class assignments

Predicted class assignments First predict the confidence scores Then predict the class assignments

by 2 f ¡ 1;0;+1gN

y 2 f ¡ 1;+1gN

½+1 f i > 0¡ 1 f i · 0

½§ 1 xi is labeled0 xi is unlabeled

f = (f 1; : : : ; f N )

Label Propagation (II)

One round of propagation

½byi xi is labeled

i=1 Wi ;j byi otherwise

f1 = by + ®Wby

Weighted KNNWeighted KNNWeight for each propagation

Weight for each propagation

Two rounds of propagation

How to generate any number of iterations?

fk = by +kX

®i W i by

f2 = f1 + ®Wf1

= by + ®Wby + ®2W2by

Results for any number of iterations

fk = by +kX

®i W i by

f2 = f1 + ®Wf1

Results for infinite number of iterations

f1 = by +1X

®i W i by

f2 = f1 + ®Wf1

Results for infinite number of iterations

f1 = (I ¡ ®W)¡ 1by

¹W = D ¡ 1=2WD ¡ 1=2Normalized Similarity Matrix:

f2 = f1 + ®Wf1

Matrix InverseMatrix Inverse

Local and Global Consistency [Zhou et.al., NIPS 03]

Local consistency:

Like KNN

Global consistency:

Beyond KNN

Summary: Construct a graph using pairwise similarities Propagate class labels along the graph Key parameters

: the decay of propagation W: similarity matrix

Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = (I ¡ ®W)¡ 1by

Questions

Cluster Assumption Manifold Assumption

Transductive predict classes for unlabeled data

Inductive learn classification function

Application: Text Classification [Zhou et.al., NIPS 03]

20-newsgroups autos, motorcycles, baseball,

and hockey under rec

Pre-processing stemming, remove stopwords

& rare words, and skip header

#Docs: 3970, #word: 8014

Propagation

Application: Image Retrieval [Wang et al., ACM MM 2004]

5,000 images Relevance feedback for the top

20 ranked images Classification problem

Relevant or not? f(x): degree of relevance

Learning relevance function f(x) Supervised learning: SVM Label propagation

Label propagation

Label propagation Graph partition based approaches Transductive Support Vector Machine (TSVM) Co-training

Graph Partition Classification as graph partitioning Search for a classification boundary

Consistent with labeled examples Partition with small graph cut

Graph Cut = 1Graph Cut = 2

Graph Partitioning Classification as graph partitioning Search for a classification boundary

Consistent with labeled examples Partition with small graph cut

Graph Cut = 1

Min-cuts for semi-supervised learning [Blum and Chawla, ICML 2001]

Additional nodes V+ : source, V-: sink

Infinite weights connecting sinks and sources High computational cost

SourceSink

Graph Cut = 1

Harmonic Function [Zhu et al., ICML 2003]

Weight matrix W wi,j 0: similarity between xi and xi

Membership vector f = (f 1; : : : ; f N )

½+1 xi 2 A¡ 1 xi 2 B

+1¡ 1

¡ 1¡ 1

Harmonic Function (cont’d) Graph cut

Degree matrix Diagonal element:

C(f) A B

+1¡ 1

¡ 1¡ 1

D = diag(d1;: :: ;dN )di =

Pj 6=i Wi ;j

C(f) =NX

(f i ¡ f j )2

4wi ;j

=14f>(D ¡ W)f =

14f>Lf

Harmonic Function (cont’d) Graph cut

Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data

C(f) A B

+1¡ 1

¡ 1¡ 1

C(f) =NX

(f i ¡ f j )2

4wi ;j

=14f>(D ¡ W)f =

14f>Lf

Harmonic Function

+1¡ 1

¡ 1¡ 1

minf 2 f ¡ 1;+1gN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Consistency with graph structures

Consistent with labeled dataChallenge:

Discrete space Combinatorial Opt.

Harmonic Function

Relaxation: {-1, +1} continuous real number

Convert continuous f to binary ones

+1¡ 1

¡ 1¡ 1

minf 2 f ¡ 1;+1gN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Harmonic Function

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

µL l ;l Lu;lL l ;u Lu;u

¶; f = (fl ; fu)

fu = ¡ L ¡ 1u;uLu;lyl

Harmonic Function

Local Propagation

Harmonic Function

Local Propagation

Global propagation

Sound familiar ?Sound familiar ?

Spectral Graph Transducer [Joachim , 2003]

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

Soften hard constraints

+®n lX

(f i ¡ yi )2

Spectral Graph Transducer [Joachim , 2003]

minf 2RN

C(f) =14f>Lf

s. t. f i = yi ;1 · i · nl

+®n lX

(f i ¡ yi )2

minf 2RN

C(f) =14f>Lf +®

(f i ¡ yi )2

s. t.NX

f 2i = N

Solved by Constrained Eigenvector ProblemSolved by Constrained Eigenvector Problem

Manifold Regularization [Belkin, 2006]

minf 2RN

C(f) =14f>Lf +®

(f i ¡ yi )2

s. t.NX

f 2i = N Loss function for

misclassification

Regularize the norm of classifier

Manifold Regularization [Belkin, 2006]

minf 2RN

14f>Lf +®

(f i ¡ yi )2

s. t.NX

f 2i = N

Loss function: l(f (xi );yi )

minf 2RN

f>Lf +®n lX

l(f (xi );yi ) +°jf j2H K

Manifold Regularization

Summary Construct a graph using pairwise similarity Key quantity: graph Laplacian

Captures the geometry of the graph Decision boundary is consistent

Graph structure Labeled examples

Parameters , , similarity

¡ 1¡ 1

Questions

Application: Text Classification 20-newsgroups

autos, motorcycles, baseball,

and hockey under rec

Pre-processing stemming, remove stopwords

& rare words, and skip header

#Docs: 3970, #word: 8014

Propagation Harmonic

Application: Text Classification

PRBEP: precision recall break even point.

Application: Text Classification

Improvement in PRBEP by SGT

Transductive SVM Support vector machine

Classification margin Maximum classification

margin Decision boundary given a

small number of labeled examples

Transductive SVM Decision boundary given a

How to change decision boundary given both labeled and unlabeled examples ?

Transductive SVM Decision boundary given a

Move the decision boundary to low local density

Transductive SVM Classification margin

f(x): classification function Supervised learning

Semi-supervised learning Optimize over both f(x) and yu

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f ¤ = argmaxf 2H K

! (X ;y;f )

f ¤ = argmaxf 2H K ;yu 2f ¡ 1;+1gn u

! (X ;yl ;yu; f )

Transductive SVM Decision boundary given

a small number of labeled examples

Move the decision boundary to place with low local density

Classification results How to formulate this

Transductive SVM: Formulation

{ , }= argmin

1 labeled

examples....

w b w w

y w x b

Original SVM

,..., ,

{ , }= argmin argmin

1 labeled

examples....

1 unlabeled

....examples

n n my y w b

n m n m

w b w w

y w x b

Transductive SVM

Constraints for unlabeled data

A binary variables for label of each example

Computational Issue

No longer convex optimization problem. Alternating optimization

* *1 1

,..., ,

1 1 11 1 1

{ , }= argmin argmin

1 labeled unlabeled ....

examples exampl....1

n ni ii i

y y w b

n m n m mn n n

w b w w

y w x by w x b

y w x b

y w x by w x b

Summary

Based on maximum margin principle Classification margin is decided by

Labeled examples Class labels assigned to unlabeled data

High computational cost Variants: Low Density Separation (LDS), Semi-

Supervised Support Vector Machine (S3VM), TSVM

Questions

Text Classification by TSVM

10 categories from the Reuter collection

3299 test documents 1000 informative words

selected by MI criterion

Co-training [Blum & Mitchell, 1998]

Classify web pages into category for students and category for professors

Two views of web pages Content

“I am currently the second year Ph.D. student …”

Hyperlinks “My advisor is …” “Students: …”

Co-training for Semi-Supervised Learning

It is easy to classify the type of

this web page based on its

content

It is easier to classify this web

page using hyperlinks

Co-training Two representation for each web page

Content representation:

(doctoral, student, computer, university…)

Hyperlink representation:

Inlinks: Prof. Cheng

Oulinks: Prof. Cheng

Co-training Train a content-based classifier

Co-training Train a content-based classifier using

labeled examples Label the unlabeled examples that are

confidently classified

confidently classified Train a hyperlink-based classifier

confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are

Co-training Assume two views of objects

Two sufficient representations Key idea

Augment training examples of one view by exploiting the classifier of the other view

Extension to multiple view Problem: how to find equivalent views

A Few Words about Active Learning Active learning

Select the most informative examples In contrast to passive learning

Key question: which examples are informative Uncertainty principle: most informative example is

the one that is most uncertain to classify Measure classification uncertainty

A Few Words about Active Learning Query by committee (QBC)

Construct an ensemble of classifiers Classification uncertainty largest degree of

disagreement SVM based approach

Classification uncertainty distance to decision boundary

Simple but very effective approaches

Semi-supervised clustering algorithms

Semi-supervised Clustering

Clustering data into two clusters

Semi-supervised Clustering

Clustering data into two clusters Side information:

Must links vs. cannot links

Must link

cannot link

Semi-supervised Clustering Also called constrained clustering Two types of approaches

Restricted data partitions Distance metric learning approaches

Restricted Data Partition Require data partitions to be consistent

with the given links Links hard constraints

E.g. constrained K-Means (Wagstaff et al., 2001)

Links soft constraints E.g., Metric Pairwise Constraints K-means

(Basu et al., 2004)

Restricted Data Partition Hard constraints

Cluster memberships must obey the link constraints

must link

cannot linkYes

must link

cannot linkYes

must link

cannot linkNo

Restricted Data Partition Soft constraints

Penalize data clustering if it violates some links

must link

cannot linkPenality = 0

must link

cannot link

Penality = 0

must link

cannot linkPenality = 1

Distance Metric Learning Learning a distance metric from pairwise links

Enlarge the distance for a cannot-link Shorten the distance for a must-link

Applied K-means with pairwise distance measured by the learned distance metric

must link

cannot link

Transformed by learned distance metric

Example of Distance Metric Learning

Solid lines: must links

dotted lines: cannot links

2D data projection using Euclidean distance metric

2D data projection using learned distance metric

BoostCluster [Liu, Jin & Jain, 2007]

General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with

pairwise constraints

Key challenges How to influence an arbitrary clustering algorithm by side

information?

Encode constraints into data representation

How to take into account the performance of underlying clustering algorithm?

Iteratively improve the clustering performance

BoostCluster

Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm

PairwiseConstraints

New data Representation

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

BoostCluster

Find the best data rep. that encodes the unsatisfied pairwise constraints

PairwiseConstraints

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

BoostCluster

Obtain the clustering results given the new data representation

PairwiseConstraints

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

BoostCluster

Update the kernel with the clustering results

PairwiseConstraints

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

BoostCluster

Run the procedure iteratively

PairwiseConstraints

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

BoostCluster

Compute the final clustering result

PairwiseConstraints

ClusteringAlgorithm

ClusteringResults

Final Results

KernelMatrix

ClusteringAlgorithm

Summary Clustering data under given pairwise constraints

Must links vs. cannot links Two types of approaches

Restricted data partitions (either soft or hard) Distance metric learning

Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user

logs, etc. May be noisy and unreliable

Application: Document Clustering[Basu et al., 2004]

300 docs from topics (atheism, baseball, space) of 20-newsgroups

3251 unique words after removal of stopwords and rare words and stemming

Evaluation metric: Normalized Mutual Informtion (NMI)

KMeans-x-x: different variants of constrained clustering algs.

machine learning Supervised learning and its application to text classification,

adaptive filtering, collaborative filtering and ranking Semi-supervised learning and its application to text

classification Emerging research directions

Efficient Learning In IR, we have massive amount of data But, most learning algs. are relatively slow

Difficult to handle millions of documents How to improve scalability ?

Sampling, only use part of data Stochastic optimization, update model one example each

time (related to online learning) More interesting, more examples may mean more

efficient training (Sebro, ICML 2008)

Kernel Learning Kernel plays central role in machine learning Kernel functions can be learned from data

Kernel alignment, multiple kernel learning, non-parametric learning, …

Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal

similarity measure automatically

Transfer Learning Different document categories are correlated We should be able to borrow information of

one class to the training of another class Key question: what to transfer between

classes? Representation, model priors, similarity measure

Active Learning IR Applications Relevance feedback (text retrieval or image

retrieval) Text classification Adaptive information filtering Collaborative filtering Query Rewriting

Discriminative Language Models

Language models have shown to be effective for information retrieval

But most language models are generative, thus missing the discriminative power

Key difficulty in discriminative language models: no outputs! Side information Mixture of generative and discriminative models

References A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification.

In AAAI-98 Workshop on Learning for Text Categorization, 1998 Tong Zhang and Frank J. Oles, Text Categorization Based on Regularized Linear Classification

Methods, Journal of Information Retrieval, 2001 F. Li and Y. Yang. A loss function analysis for classification methods in text categorization, The

Twentieth International Conference on Machine Learning (ICML'03) Chengxiang Zhai and John Lafferty, A study of smoothing methods for language models applied

to information retrieval, ACM Trans. Inf. System, 2004 A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-training, COLT 1998 D. Blei and M. Jordan, Variational methods for the Dirichlet process, ICML 2004 T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn.,

42(1-2), 2001 D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, NIPS*2002 R. Jin, C. Ding, and F. Kang, A Probabilistic Approach for Optimizing Spectral Clustering,

NIPS*2005 D. Zhou, B. Scholkopf, and T. Hofmann, Semi-supervised learning on directed graphs,

NIPS*2005. X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using Gaussian fields and

harmonic functions. ICML 2003. T. Joachims, Transductive Learning via Spectral Graph Partitioning, ICML 2003

References Andrew McCallum and Kamal Nigam, Employing {EM} in Pool-Based Active Learning for

Text Classification, Proceeding of the International Conference on Machine Learning, 1998 David A. Cohn and Zoubin Ghahramani and Michael I. Jordan, Active Learning with

Statistical Models, Journal of Artificial Intelligence Research, 1996 S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM

Multimedia, 2001 Xuehua Shen and ChengXiang Zhai, Active feedback in ad hoc information retrieval, SIGIR

'05 J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear

predictors. Information and Computation, 1997. X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li. Multi-Model Similarity Propagation and its Application

for Web Image Retrieval, ACM Multimedia, 2004 M. Belkin and P. Niyogi and V. Sindhwani, Manifold Regularization, Technical Report,

Univ. of Chicago, 2006 K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with

background knowledge. In ICML '01, 2001. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised

clustering. In SIGKDD '04, 2004.

References Xiaofei He, Benjamin Rey, Wei Vivian Zhang, Rosie Jones, Query Rewriting using Active Learning

for Sponsored Search, SIGIR07 Y. Zhang, W. Xu, and J. Callan. Exploration and exploitation in adaptive filtering based on bayesian

active learning. In Proceedings of 20th International Conf. on Machine Learning, 2003. Z. Xu and R. Akella. A bayesian logistic regression model for active relevance feedback (SIGIR08) G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML 2000 M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking.

Machine learning, 2004 J. Rocchio. Relevance feedback in information retrieval, In The Smart System: experiments in

automatic document processing. Prentice Hall, 1971. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual

workshop on Computational learning theory, 1992 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee

algorithm. Machine Learning, 28(2-3):133–168, 1997 D. A. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-ing. Machine

learning, 1994. Robert M. Bell and Yehuda Koren, Lessons from the Netix Prize Challenge, KDD Exploration 2008 Tie-Yan Liu, Tutorial: Learning to rank Soumen Chakrabarti, Learning to Rank in Vector Spaces and Social Networks, www 2007

Thank You

God, it is finally over !God, it is finally over !

1 machine learning for information retrieval rong jin michigan state university yi zhang university...

Documents

1 ranking with index rong jin. 2 inverted index find plays...

compartmentalized gene regulatory network of the ... ·...

decision tree rong jin. determine milage per gallon

trusted data sharing over untrusted cloud storage provider...

large-scale text categorization by batch mode active...

lei wu , steven c.h. hoi , rong jin #, jianke zhu, nenghai...

chuonnasuan (meng jin fu) - desales university

dr rong qu history of ai - university of nottingham

learning parities with structured noise sanjeev arora, rong...

expectation maximization algorithm rong jin. a mixture model...

unconstrained optimization rong jin. recap gradient...

1 machine learning spring 2010 rong jin. 2 cse847 machine...

online multiple kernel classification steven c.h. hoi, rong...

unsupervised learning: clustering rong jin outline ...

yong-yuan zhu, and xue-jin zhang lie-rong yuan, kang qin...

blog mining rong jin. blog data mining blogspace analysis ...

1 collaborative filtering rong jin department of computer...

1 machine learning spring 2010 rong jin. 2 cse847 machine...

generative models rong jin. statistical inference training...

research paper homoharringtonine inhibited breast cancer...