similarity-based classifiers: problems and solutions

Similarity-based Classifiers:

Problems and Solutions

Classifying based on similarities:

Van GoghOr

Monet ?

Van Gogh

the Similarity-based Classification Problem

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

(painter)(paintings)

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;nUnderlyingSimilarity Function: Ã : Ð£ Ð ! R

Training Similarities: S = £Ã(xi;xj )¤n£n ; y =

£y1 : : : yn¤T

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

Training Similarities: S = £Ã(xi;xj )¤n£n ; y =

£y1 : : : yn¤T

Test Similarities: s = £Ã(x;x1) : :: Ã(x;xn)¤T ; Ã(x;x)

Problem: Estimate theclass label yfor test samplex given S, y, s, and Ã(x;x).

UnderlyingSimilarity Function: Ã : Ð£ Ð ! R

Examples of Similarity FunctionsComputational Biology

– Smith-Waterman algorithm (Smith & Waterman, 1981)

– FASTA algorithm (Lipman & Pearson, 1985)– BLAST algorithm (Altschul et al., 1990)

Computer Vision– Tangent distance (Duda et al., 2001)– Earth mover’s distance (Rubner et al., 2000)– Shape matching distance (Belongie et al., 2002)– Pyramid match kernel (Grauman & Darrell, 2007)

Information Retrieval– Levenshtein distance (Levenshtein, 1966)– Cosine similarity between tf-idf vectors (Manning

& Schütze, 1999)

Approaches to Similarity-based Classification

MDSSimilariti

es as kernels

Similarities as

features

theory

weights

Generative

Models

Classify x given S, y, s, and Ã(x;x).

MDSSimilariti

es as kernels

Similarities as

features

theory

weights

Generative

Models

Can we treat similarities as kernels?

Kernels are inner products in someHilbert space.

conjugatesymmetric, reallinear: hax;zi = a< x;z >positivede nite: hx;xi > 0unless x = 0

Example Inner Product hx;zi = xTz.

Properties of an Inner Product hx;zi :

Kernels are inner products in someHilbert space.x

zhx;zi

An inner product implies a norm: kxk=phx;xi

Kernels are inner products in someHilbert space.

Inner products aresimilarities.

Areour notions of similarities always inner products?No!

Example: Amazon similarity

10 20 30 40 50 60 70 80 90

Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon

96 books

ooks S

Inner product-like?

10 20 30 40 50 60 70 80 90

96 books

ooks S

Á(HTF, Bishop) = 3Á(Bishop, HTF) = 8

assymmetric!

0 10 20 30 40 50 60 70 80 90-0.2

Eigenvalue Rank

10 20 30 40 50 60 70 80 90

96 books

ooks S

negative

Not PSD!

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Clip:Sclip =U diag(max(¸1;0); : : : ;max(¸n;0))UT

PSD Cone

Sclip is thePSD matrix closest to Sin terms of theFrobenius norm.

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Flip:S°ip =U diag(j¸1j; : : : ; j¸nj) UT

(similar e®ect: Snew =STS)

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Shift:Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

Flip, Clip or Shift?Best bet is Clip.

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Learn the best kernel matrix for the SVM:(Luss NIPS 2007, Chen et al. ICML 2009)

minK º 0

minf 2H K

i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF

Approaches to Similarity-based Classification.

MDSSimilariti

es as Kernels

Similarities as

features

theory

weights

Generative

Models

Let the similarities to the training samples be features

– SVM (Graepel et al., 1998; Liao & Noble, 2003)– Linear programming (LP) machine (Graepel et al.,

1999)– Linear discriminant analysis (LDA) (Pekalska et al.,

2001)– Quadratic discriminant analysis (QDA) (Pekalska &

Duin, 2002)– Potential support vector machine (P-SVM) (Hochreiter

& Obermayer, 2006; Knebel et al., 2008)

Let £Ã(x;x1) :: : Ã(x;xn)¤T 2 Rn be the featurevector for x.

minimize®12ky ¡ S®k

22+²k®k1+°k®k1

Asymptotically does thiswork?Our results suggest you need to choosea slow-growing subset of n.

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM-kNN(clip)(Zhang et al. 2006)

17.56 13.75 36.82 4.23 61.25 5.23

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

MDSSimilariti

es as Kernels

Similarities as

features

theory

weights

Generative

Models

Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

i=1wi I f yi=gg

Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

i=1wi I f yi=gg

P (Y = gjX = x) =kX

i=1wi I f yi=gg

For wi ¸ 0 and P i wi =1, get class posterior estimate:

Good for asymmetric costsGood for interpretationGood for system integration.

Design Goals for the Weights

Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).

Design Goals for the Weights

Design Goals for the Weights (Chen et al. JMLR 2009)

Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).

Linear Interpolation WeightsLinear interpolation weights will meet these goals:

iwixi = x; such that wi ¸ 0;

iwi =1

x1x2 x3

non-uniquesolution

Linear Interpolation WeightsLinear interpolation weights will meet these goals:

iwi =1

x1x2 x3

non-uniquesolution

x1x2 x3

no solution

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

LIME weightsLinear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

iwi =1

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

iwi =1

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

iwi =1

maximumentropy ! push weights to beequal

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

iwi =1

maximumentropy = exponential solutionconsistent (Friedlander Gupta IEEE IT 2005)noiseaveraging

Kernelize Linear Interpolation (Chen et al. JMLR 2009)

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

Twsubject to w ¸ 0; 1Tw= 1;

LIME weights:

Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:

Kernelize Linear Interpolation

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

LIME weights:

regularizes the variance of the weights

Kernelize Linear Interpolation

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

LIME weights:

only need inner products – can replace with kernel or similarities!

minimizew12w

TSw¡ sTw+ ¸2w

Twsubject to w ¸ 0; 1Tw=1:

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

minimizew12w

TSw¡ sTw+ ¸2w

affinity: s = £Ã(x;x1) ::: Ã(x;xn)

¤T ;sowi high if Ã(x;xi) high

minimizew12w

TSw¡ sTw+ ¸2w

diversity:12w

TSw= 12X

i;jÃ(xi ;xj )wiwj

minimizew12w

TSw¡ sTw+ ¸2w

MakeS PSD,problem is a QP

QP w/ box constraintsCan solvewith SMO

Remove the constraints on the weights:

Can show equivalent to local ridge regression:KRR weights.

argminw

TSw¡ sTw+ ¸2w

subject to w ¸ 0; 1Tw=1:

argminw

TSw¡ sTw+ ¸2w

´ (S +¸I )¡ 1s

Weighted k-NN: Example 1

5 0 0 00 5 0 00 0 5 00 0 0 5

3775 ; s =

wKRI =arg minw¸ 0;1T w=1

TSw¡ sTw+ ¸2w

KRI weights

10-2 100 1020

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 102-0.1

5 1 1 11 5 4 21 4 5 21 2 2 5

3775 ; s =

TSw¡ sTw+ ¸2w

KRI weights

10-2 100 1020.15

w2, w3

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 1020.1

w2, w3

5 1 1 11 5 4 21 4 5 21 2 2 5

3775 ; s =

TSw¡ sTw+ ¸2w

KRI weights

10-2 100 1020

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 102-0.4

0.250.4

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Amazon-47

Aural Sonar

Caltech-101

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Amazon-47

Aural Sonar

Caltech-101

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Amazon-47

Aural Sonar

Caltech-101

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Approaches to Similarity-based Classification.

MDSSimilarit

ies as Kernels

Similarities as

features

theory

weights

Generative

Models

Generative ClassifiersModel theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

Pro: Produces class probabilities

Generative ClassifiersModel theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

classdescriptivestatistics of s

Our Goal: Model P (T(s)jg)

Weuse: T(s) = [Ã(x;¹ 1);Ã(x;¹ 2);:: :;Ã(x;¹ G)]¹ h is a centroid for each class

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

Reg. Local SDAPerformance: Competitive

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

Lots of Open QuestionsMaking S PSD.

Fast k-NN search for similarities

Similarity-based regression

Relationship with learning on graphs

Try it out on real data

Fusion with Euclidean features (see our FUSION 2009 papers)

Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)

Code/Data/Papers: idl.ee.washington.edu/similaritylearning

Similarity-based Classification by Chen et al., JMLR 2009

Training and Test ConsistencyFor a test sample x, given , shall we

classify x ass = £Ã(x;x1) :: : Ã(x;xn)

y = sgn((c?)T s+b?) ?

No! If a training sample was used as a test sample, could change its class!

Data Sets

10 20 30 40 50 60 70 80 90

20 40 60 80 100

10020 40 60 80 100 120 140

0 20 40 60 80 100 120 140

Eigenvalue Rank

0 10 20 30 40 50 60 70 80 90-0.2

Eigenvalue Rank

0 10 20 30 40 50 60 70 80 90-5

Eigenvalue Rank

Amazon Aural Sonar Protein

Eigenvalue Rank

Data Sets

50 100 150 200

200100 200 300 400

50 100 150 200

0 50 100 150 200 250 300 350 400-50

Eigenvalue Rank

Voting Yeast-5-7 Yeast-5-12

0 20 40 60 80 100 120 140 160 180-20

Eigenvalue Rank

0 20 40 60 80 100 120 140 160 180-20

Eigenvalue Rank

SVM ReviewEmpirical risk minimization (ERM) with regularization:

minimizef 2HK

i=1L(f (xi);yi) +´kf k2K

Hinge loss:L(f (x);y) =max(1¡ yf (x);0)

SVM Primal:

minimizec;b;»

T»+´cTK c

subject to diag(y)(K c+b1) ¸ 1 ¡ »; »¸ 0:

0 1 2 1 2 ( )yf x

Lhinge loss

0-1 loss

Learning the Kernel MatrixFind for classification the best K regularized toward S:

minK º 0

minf 2HK

SVM that learns the full kernel matrix:

minimizec;b;»;K

T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;

»¸ 0; K º 0:

Related Work

Robust SVM (Luss & d’Aspremont, 2007):

SVM Dual:

maximize® 1T®¡ 12®

T diag(y)K diag(y)®subject to yT®=0; 0 · ®· C1:

maximize® minK º 0

µ1T®¡ 1

2®Tdiag(y)K diag(y)®+½kK ¡ Sk2F

subject to yT®=0; 0 · ®· C1:

“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”

Related WorkLet

A = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡ 12®

Tdiag(y)K diag(y)®+½kK ¡ Sk2F

Theorem (Sion, 1958)Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then

sup¹ 2M infº2N f (¹ ;º) = infº2N sup¹ 2M f (¹ ;º):

Related WorkLet

A = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡ 12®

By Sion’s minimax theorem, the robust SVM is equivalent to:

minK º 0

max®2A

1T®¡ 12®

Compare

minK º 0

minf 2HK

L(x;¸?) or f (x)

L(x?;¸) or g(¸) x ¸

zero duality gap

Learning the Kernel MatrixIt is not trivial to directly solve:

minimizec;b;»;K

»¸ 0; K º 0:

Lemma (Generalized Schur Complement)Let , and . Then

if and only if , z is in the range of K, and .

· K zzT u

¸º 0

u ¡ zTK yz ¸ 0

K 2 Rn£n z 2 Rn u 2 R

K º 0

Let , and notice that since .z = K c cTK c= zTK yz K K yK = K

Learning the Kernel MatrixIt is not trivial to directly solve:

minimizec;b;»;K

»¸ 0; K º 0:

However, it can be expressed as a convex conic program:

minimizez;b;»;K ;u;v

T»+´u+°v

subject to diag(y)(z +b1) ¸ 1 ¡ »; »¸ 0;· K zzT u

¸º 0; kK ¡ SkF · v:

– We can recover the optimal by .c? c? = (K ?)yz?

Learning the Spectrum ModificationConcerns about learning the full kernel matrix:

– Though the problem is convex, the number of variables is O(n2).

– The flexibility of the model may lead to overfitting.

similarity-based classifiers: problems and solutions

similaritybased classifiers

svm sim

cosine similarity

svm clip81

design goals

kernel matrix18

kernel matrix19

kernel matrix15

Documents

~adapted from walch education solving problems using...

ideology classifiers for political speech stefan...

application of classifiers in predicting problems of

melodic similarity through shape similarity · melodic...

8. classifiers - max planck...

decision tree classifiers

title an invitation to the similarity problems : after...

nearest neighbourhood classifiers in biometric fusion ·...

name: period gl unit 5: similarity · test: similarity...

naïve bayes classifiers

classifiers kelompok 1

solving problems using similarity and congruence

a similarity evaluation technique for data mining with...

8. classifiers

similarity evaluation techniques for filtering problems ?...

edinburgh research explorer · the application of machine...

5 character classifiers

gaussian bayes classifiers

reflux classifiers

training image classifiers with similarity metrics, linear...