Similarity-based Classifiers:
Problems and Solutions
2
Classifying based on similarities:
Van GoghOr
Monet ?
Van Gogh
Monet
3
the Similarity-based Classification Problem
Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n
(painter)(paintings)
4
the Similarity-based Classification Problem
Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;nUnderlyingSimilarity Function: à : У Ð ! R
Training Similarities: S = £Ã(xi;xj )¤n£n ; y =
£y1 : : : yn¤T
5
the Similarity-based Classification Problem
Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n
Training Similarities: S = £Ã(xi;xj )¤n£n ; y =
£y1 : : : yn¤T
Test Similarities: s = £Ã(x;x1) : :: Ã(x;xn)¤T ; Ã(x;x)
Problem: Estimate theclass label yfor test samplex given S, y, s, and Ã(x;x).
UnderlyingSimilarity Function: à : У Р! R
?
6
Examples of Similarity FunctionsComputational Biology
– Smith-Waterman algorithm (Smith & Waterman, 1981)
– FASTA algorithm (Lipman & Pearson, 1985)– BLAST algorithm (Altschul et al., 1990)
Computer Vision– Tangent distance (Duda et al., 2001)– Earth mover’s distance (Rubner et al., 2000)– Shape matching distance (Belongie et al., 2002)– Pyramid match kernel (Grauman & Darrell, 2007)
Information Retrieval– Levenshtein distance (Levenshtein, 1966)– Cosine similarity between tf-idf vectors (Manning
& Schütze, 1999)
7
Approaches to Similarity-based Classification
MDSSimilariti
es as kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
8
Approaches to Similarity-based Classification
MDSSimilariti
es as kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
9
Can we treat similarities as kernels?
Kernels are inner products in someHilbert space.
10
Can we treat similarities as kernels?
conjugatesymmetric, reallinear: hax;zi = a< x;z >positivede nite: hx;xi > 0unless x = 0
Example Inner Product hx;zi = xTz.
Properties of an Inner Product hx;zi :
Kernels are inner products in someHilbert space.x
zhx;zi
An inner product implies a norm: kxk=phx;xi
11
Can we treat similarities as kernels?
Kernels are inner products in someHilbert space.
Inner products aresimilarities.
Areour notions of similarities always inner products?No!
12
Example: Amazon similarity
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon
96 books
96 b
ooks S
Inner product-like?
13
Example: Amazon similarity
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon
96 books
96 b
ooks S
Á(HTF, Bishop) = 3Á(Bishop, HTF) = 8
assymmetric!
0 10 20 30 40 50 60 70 80 90-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Eigenvalue Rank
Eig
enva
lue
Example: Amazon similarity
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon
96 books
96 b
ooks S
negative
Rank
Eige
nval
ues
Not PSD!
15
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)
Clip:Sclip =U diag(max(¸1;0); : : : ;max(¸n;0))UT
0 0
S
Sclip
PSD Cone
Sclip is thePSD matrix closest to Sin terms of theFrobenius norm.
16
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)
Flip:S°ip =U diag(j¸1j; : : : ; j¸nj) UT
0 0
(similar e®ect: Snew =STS)
17
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)
0 0
Shift:Sshift =U (¤ + jmin(¸min(S);0)j I ) UT
18
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)
0 0
Sshift =U (¤ + jmin(¸min(S);0)j I ) UT
Flip, Clip or Shift?Best bet is Clip.
19
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)
Learn the best kernel matrix for the SVM:(Luss NIPS 2007, Chen et al. ICML 2009)
minK º 0
minf 2H K
1n
nX
i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF
20
Approaches to Similarity-based Classification.
MDSSimilariti
es as Kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
21
Let the similarities to the training samples be features
– SVM (Graepel et al., 1998; Liao & Noble, 2003)– Linear programming (LP) machine (Graepel et al.,
1999)– Linear discriminant analysis (LDA) (Pekalska et al.,
2001)– Quadratic discriminant analysis (QDA) (Pekalska &
Duin, 2002)– Potential support vector machine (P-SVM) (Hochreiter
& Obermayer, 2006; Knebel et al., 2008)
Let £Ã(x;x1) :: : Ã(x;xn)¤T 2 Rn be the featurevector for x.
minimize®12ky ¡ S®k
22+²k®k1+°k®k1
Asymptotically does thiswork?Our results suggest you need to choosea slow-growing subset of n.
22
AMAZON47 classes
AURAL SONAR2 classes
CALTECH101 classes
FACE REC139 classes
MIREX10 classes
VOTING VDM2 classes
# samples
n = 204 n =100
n = 8677
n = 945 n = 3090
n = 435
SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear)
76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF)
75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
23
AMAZON47 classes
AURAL SONAR2 classes
CALTECH101 classes
FACE REC139 classes
MIREX10 classes
VOTING VDM2 classes
# samples
n = 204 n =100
n = 8677
n = 945 n = 3090
n = 435
SVM-kNN(clip)(Zhang et al. 2006)
17.56 13.75 36.82 4.23 61.25 5.23
SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear)
76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF)
75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
24
Approaches to Similarity-based Classification
MDSSimilariti
es as Kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
25
Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:
Algorithmic parallel of the exemplar model of human learning.
y = argmaxg2G
kX
i=1wi I f yi=gg
?
26
Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:
Algorithmic parallel of the exemplar model of human learning.
y = argmaxg2G
kX
i=1wi I f yi=gg
P (Y = gjX = x) =kX
i=1wi I f yi=gg
For wi ¸ 0 and P i wi =1, get class posterior estimate:
Good for asymmetric costsGood for interpretationGood for system integration.
27
Design Goals for the Weights
?
28
Design Goals for the Weights
Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).
?
29
Design Goals for the Weights
?
30
Design Goals for the Weights (Chen et al. JMLR 2009)
Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).
?
31
Linear Interpolation WeightsLinear interpolation weights will meet these goals:
X
iwixi = x; such that wi ¸ 0;
X
iwi =1
x1x2 x3
x4x
non-uniquesolution
32
Linear Interpolation WeightsLinear interpolation weights will meet these goals:
X
iwixi = x; such that wi ¸ 0;
X
iwi =1
x1x2 x3
x4x
non-uniquesolution
x1x2 x3
x4 x
no solution
33
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
LIME weightsLinear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
X
iwixi = x; such that wi ¸ 0;
X
iwi =1
34
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
LIME weightsLinear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
X
iwixi = x; such that wi ¸ 0;
X
iwi =1
35
LIME weightsLinear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
X
iwixi = x; such that wi ¸ 0;
X
iwi =1
maximumentropy ! push weights to beequal
36
LIME weightsLinear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
X
iwixi = x; such that wi ¸ 0;
X
iwi =1
maximumentropy = exponential solutionconsistent (Friedlander Gupta IEEE IT 2005)noiseaveraging
37
Kernelize Linear Interpolation (Chen et al. JMLR 2009)
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
minimizew12w
TX TXw¡ xTXw+ ¸2w
Twsubject to w ¸ 0; 1Tw= 1;
LIME weights:
Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:
38
Kernelize Linear Interpolation
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
minimizew12w
TX TXw¡ xTXw+ ¸2w
Twsubject to w ¸ 0; 1Tw= 1;
LIME weights:
Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:
regularizes the variance of the weights
39
Kernelize Linear Interpolation
minimizew
°°°°°kX
i=1wixi ¡ x
°°°°°
2
2+¸
kX
i=1wi logwi
subject tokX
i=1wi =1; wi ¸ 0; i = 1;::: ;k:
minimizew12w
TX TXw¡ xTXw+ ¸2w
Twsubject to w ¸ 0; 1Tw= 1;
LIME weights:
Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:
only need inner products – can replace with kernel or similarities!
40
minimizew12w
TSw¡ sTw+ ¸2w
Twsubject to w ¸ 0; 1Tw=1:
KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:
41
KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:
minimizew12w
TSw¡ sTw+ ¸2w
Twsubject to w ¸ 0; 1Tw=1:
affinity: s = £Ã(x;x1) ::: Ã(x;xn)
¤T ;sowi high if Ã(x;xi) high
42
KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:
minimizew12w
TSw¡ sTw+ ¸2w
Twsubject to w ¸ 0; 1Tw=1:
diversity:12w
TSw= 12X
i;jÃ(xi ;xj )wiwj
43
KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:
minimizew12w
TSw¡ sTw+ ¸2w
Twsubject to w ¸ 0; 1Tw=1:
MakeS PSD,problem is a QP
QP w/ box constraintsCan solvewith SMO
44
KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:
Remove the constraints on the weights:
Can show equivalent to local ridge regression:KRR weights.
argminw
12w
TSw¡ sTw+ ¸2w
Tw
subject to w ¸ 0; 1Tw=1:
argminw
12w
TSw¡ sTw+ ¸2w
Tw
´ (S +¸I )¡ 1s
45
Weighted k-NN: Example 1
S =
2664
5 0 0 00 5 0 00 0 5 00 0 0 5
3775 ; s =
2664
4321
3775
wKRI =arg minw¸ 0;1T w=1
12w
TSw¡ sTw+ ¸2w
Tw
KRI weights
10-2 100 1020
0.1
0.25
0.4
0.5
0.6
¸
1
w4
1
w3
1
w2
1
w1
1
¸
wKRR = (S +¸I )¡ 1s
KRR weights
10-2 100 102-0.1
0
0.1
0.25
0.4
0.5
0.6
¸
1
w2
1
w1
1
w3
1
w4
1
¸
46
Weighted k-NN: Example 2
S =
2664
5 1 1 11 5 4 21 4 5 21 2 2 5
3775 ; s =
2664
3333
3775
wKRI =arg minw¸ 0;1T w=1
12w
TSw¡ sTw+ ¸2w
Tw
KRI weights
10-2 100 1020.15
0.2
0.25
0.3
0.35
0.4
¸
1
w2, w3
1
w4
1
w1
1
¸
wKRR = (S +¸I )¡ 1s
KRR weights
10-2 100 1020.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
¸
1
w2, w3
1
w4
1
w1
1
¸
47
Weighted k-NN: Example 3
S =
2664
5 1 1 11 5 4 21 4 5 21 2 2 5
3775 ; s =
2664
2433
3775
wKRI =arg minw¸ 0;1T w=1
12w
TSw¡ sTw+ ¸2w
Tw
KRI weights
10-2 100 1020
0.1
0.25
0.4
0.5
0.6
0.7
¸
1
w1
1
w3
1
w4
1
w2
1
¸
wKRR = (S +¸I )¡ 1s
KRR weights
10-2 100 102-0.4
-0.2
0
0.250.4
0.6
0.8
1
¸
1
w4
1
w1
1
w3
1
w2
1
¸
48
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
49
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
50
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
51
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
52
Approaches to Similarity-based Classification.
MDSSimilarit
ies as Kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
53
Generative ClassifiersModel theprobability of what you seegiven each class:
Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...
Pro: Produces class probabilities
54
Generative ClassifiersModel theprobability of what you seegiven each class:
Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...
classdescriptivestatistics of s
Our Goal: Model P (T(s)jg)
Weuse: T(s) = [Ã(x;¹ 1);Ã(x;¹ 2);:: :;Ã(x;¹ G)]¹ h is a centroid for each class
55
Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)
AssumeG similaritiesclass-conditionally independent
Reducemodel bias by applying locally (local SDA)
Reduceest. varianceby regularizing over localities
Model P (T(s)jg)
EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.
56
Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)
AssumeG similaritiesclass-conditionally independent
Reducemodel bias by applying locally (local SDA)
Reduceest. varianceby regularizing over localities
Model P (T(s)jg)
EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.
Reg. Local SDAPerformance: Competitive
57
Some ConclusionsPerformance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible
58
Some ConclusionsPerformance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible
59
Some ConclusionsPerformance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible
60
Some ConclusionsPerformance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible
61
Some ConclusionsPerformance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible
62
Lots of Open QuestionsMaking S PSD.
Fast k-NN search for similarities
Similarity-based regression
Relationship with learning on graphs
Try it out on real data
Fusion with Euclidean features (see our FUSION 2009 papers)
Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)
Code/Data/Papers: idl.ee.washington.edu/similaritylearning
Similarity-based Classification by Chen et al., JMLR 2009
64
Training and Test ConsistencyFor a test sample x, given , shall we
classify x ass = £Ã(x;x1) :: : Ã(x;xn)
¤T
y = sgn((c?)T s+b?) ?
No! If a training sample was used as a test sample, could change its class!
65
Data Sets
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
20 40 60 80 100
20
40
60
80
10020 40 60 80 100 120 140
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
-10
0
10
20
30
40
50
60
70
Eigenvalue Rank
Eig
enva
lue
0 10 20 30 40 50 60 70 80 90-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Eigenvalue Rank
Eig
enva
lue
0 10 20 30 40 50 60 70 80 90-5
0
5
10
15
20
25
30
35
Eigenvalue Rank
Eig
enva
lue
Amazon Aural Sonar Protein
Eigenvalue Rank
Eigenvalue Rank
Eigenvalue Rank
Eige
nval
ue
Eige
nval
ue
Eige
nval
ue
66
Data Sets
50 100 150 200
50
100
150
200100 200 300 400
100
200
300
400
50 100 150 200
50
100
150
200
0 50 100 150 200 250 300 350 400-50
0
50
100
150
200
250
Eigenvalue Rank
Eig
enva
lue
Voting Yeast-5-7 Yeast-5-12
0 20 40 60 80 100 120 140 160 180-20
0
20
40
60
80
100
120
Eigenvalue Rank
Eig
enva
lue
0 20 40 60 80 100 120 140 160 180-20
0
20
40
60
80
100
120
Eigenvalue Rank
Eig
enva
lue
Eige
nval
ue
Eige
nval
ue
Eige
nval
ue
Eigenvalue Rank
Eigenvalue Rank
Eigenvalue Rank
67
SVM ReviewEmpirical risk minimization (ERM) with regularization:
minimizef 2HK
1n
nX
i=1L(f (xi);yi) +´kf k2K
Hinge loss:L(f (x);y) =max(1¡ yf (x);0)
SVM Primal:
minimizec;b;»
1n1
T»+´cTK c
subject to diag(y)(K c+b1) ¸ 1 ¡ »; »¸ 0:
0 1 2 1 2 ( )yf x
1
Lhinge loss
0-1 loss
68
Learning the Kernel MatrixFind for classification the best K regularized toward S:
minK º 0
minf 2HK
1n
nX
i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF
SVM that learns the full kernel matrix:
minimizec;b;»;K
1n1
T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;
»¸ 0; K º 0:
69
Related Work
Robust SVM (Luss & d’Aspremont, 2007):
SVM Dual:
maximize® 1T®¡ 12®
T diag(y)K diag(y)®subject to yT®=0; 0 · ®· C1:
maximize® minK º 0
µ1T®¡ 1
2®Tdiag(y)K diag(y)®+½kK ¡ Sk2F
¶
subject to yT®=0; 0 · ®· C1:
“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”
70
Related WorkLet
A = f®2n j yT®=0; 0 · ®· C1g
Rewrite the robust SVM as
max®2A
minK º 0
1T®¡ 12®
Tdiag(y)K diag(y)®+½kK ¡ Sk2F
Theorem (Sion, 1958)Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then
sup¹ 2M infº2N f (¹ ;º) = infº2N sup¹ 2M f (¹ ;º):
71
Related WorkLet
A = f®2n j yT®=0; 0 · ®· C1g
Rewrite the robust SVM as
max®2A
minK º 0
1T®¡ 12®
Tdiag(y)K diag(y)®+½kK ¡ Sk2F
By Sion’s minimax theorem, the robust SVM is equivalent to:
minK º 0
max®2A
1T®¡ 12®
Tdiag(y)K diag(y)®+½kK ¡ Sk2F
Compare
minK º 0
minf 2HK
1n
nX
i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF
L(x;¸?) or f (x)
L(x?;¸) or g(¸) x ¸
zero duality gap
72
Learning the Kernel MatrixIt is not trivial to directly solve:
minimizec;b;»;K
1n1
T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;
»¸ 0; K º 0:
Lemma (Generalized Schur Complement)Let , and . Then
if and only if , z is in the range of K, and .
· K zzT u
¸º 0
u ¡ zTK yz ¸ 0
K 2 Rn£n z 2 Rn u 2 R
K º 0
Let , and notice that since .z = K c cTK c= zTK yz K K yK = K
73
Learning the Kernel MatrixIt is not trivial to directly solve:
minimizec;b;»;K
1n1
T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;
»¸ 0; K º 0:
However, it can be expressed as a convex conic program:
minimizez;b;»;K ;u;v
1n1
T»+´u+°v
subject to diag(y)(z +b1) ¸ 1 ¡ »; »¸ 0;· K zzT u
¸º 0; kK ¡ SkF · v:
– We can recover the optimal by .c? c? = (K ?)yz?
74
Learning the Spectrum ModificationConcerns about learning the full kernel matrix:
– Though the problem is convex, the number of variables is O(n2).
– The flexibility of the model may lead to overfitting.