machine learning -...
TRANSCRIPT
17.06.13 1
Machine Learning
Support Vector Machine (SVM)
Prof. Dr. Volker Sperschneider
AG Maschinelles Lernen und Natürlichsprachliche Systeme
Institut für Informatik Technische Fakultät
Albert-Ludwigs-Universität Freiburg
17.06.13 2
SVM
I. Large margin linear separability II. Optimization theory III. Maximum margin classifier at work IV. Kernel functions and kernel trick V. SVM learnability theory VI. Extension to soft margin
17.06.13 3
SVM
I. Large margin linear separability
17.06.13 4
Architecture
x1
xn
w1
wn
b
⎩⎨⎧
<−
≥+=
+=
ℜ∈
ℜ∈
∑=
0),,(10),,(1
),,(
,,,,,,,
1
21
21
xbwnetxbwnet
y
bxwxbwnet
bwwwxxx
n
iii
n
n
17.06.13 5
Training set
{ }1,1,,,
,,,
),(,),,(),,(
21
21
2211
+−∈
ℜ∈
l
nl
ll
ddd
xxx
dxdxdx
Set of l labelled (classified) vectors
We assume that both positive and negative
training vectors are present.
17.06.13 6
Hyperplanes and Halfspaces
{ }
{ }
{ }0),(
0),(
0),(
<+ℜ∈=
>+ℜ∈=
=+ℜ∈=
−
+
bxwxbwH
bxwxbwH
bxwxbwH
Tn
Tn
Tn
17.06.13 7
Linear Separability
),( bwH −
),( bwH +
wwb2−
02222 =+
−=+
−=+
− bwwbbww
wbbw
wbw TT
),( bwH
17.06.13 8
Distance of arbitrary vector z to hyperplane is
Signed distance (> 0 for vectors in positive halfspace, < 0 for vectors in negative halfspace) of arbitrary vector z to hyperplane is
wbzwT +
wbzwT +
17.06.13 9
Take two arbitrary vectors on hyperplane:
Difference vector is perpendicular to w:
Thus distance of origin to hyperplane is
0)( =−=−=− bbywxwyxw TTT
wb
bywbxw TT +==+ 0
17.06.13 10
wwb2−
u
zv
parallelwvvuz
−
+=
,
vwvw
wvw
wvwbuw
wbvuw
wbzw
T
TTTT
=⋅
==
++=
++=
+ )(
17.06.13 11
wwb2−
u
z
v
parallelwvvuz
−
−=
,
vwvw
wvw
wvwbuw
wbvuw
wbzw
T
TTTT
−=⋅−
=−
=
−+=
+−=
+ )(
17.06.13 12
Unfavourable seperating lines
17.06.13 13
Favourable seperating line
17.06.13 14
due to large margin
17.06.13 15
Maximum Margin Separation
Given training set T, find weight vector w and threshold b that maximize margin for hyperplane H(w,b) w.r.t. T. { }
w
bxwbwbw
w
bxw
wbxw
bw
dxdxdxT
kT
lk
bwT
bw
bw
kT
lkkT
lkT
ll
+==
→+
=+
=
=
=
=
=
1
,,maxmax
,1
1
2211
minmaxarg),(maxarg,
maxmin
min),(
),(,),,(),,(
µ
µ
17.06.13 16
Normal form of maximum margin
Double occurrence of w in the definition of margin, in nominator and denominator, can be avoided. Simply scale w, b with suitable factor λ > 0. Scaled parameters define the same hyperplane and halfspaces as before. Use scaled w, b such that
1min1
=+=
bxw kT
lk
17.06.13 17
Normal form of maximum margin
Constraints after scaling Training vectors xk with are called support vectors.
lkbxwdlkbxwd
kTk
kTk
111111
=∀−≤+⇒−=
=∀+≥+⇒+=
1=+bxw kT
17.06.13 18
Positive support vectors are separated from negative support vectors by a corridor of width . Exercise: Proof this! Why is the statement not completely trivial? The term above is to be maximized under the normalized constraints. Alternatively one can minimize under the normalized constraints.
w2
221 w
17.06.13 19
Constraints can be transformed in uniform format: lkbxwd
lkbxwdkTk
kTk
111111
=∀−≤+⇒−=
=∀+≥+⇒+=
lkbxwdlkbxwd
kTk
kTk
11)(111)(1
=∀−≤+⇒−=
=∀−≤+−⇒+=
lkbxwdlkbxwd
kTk
kTk
101)(1101)(1
=∀≤++⇒−=
=∀≤++−⇒+=
lkbxwd kTk 101)( =∀≤++−
17.06.13 20
Normal form of maximum margin
under constraints Parameter b has vanished in the function to be maximized. Does this cause a problem? Does it make the optimization senseless? Why cannot we simply let norm of w tend to infinity?
lkbxwd kTk 101)( =∀≤++−
min221
ww →
17.06.13 21
SVM
II. Optimization theory
Is only presented (without proofs) in so far as is required for an understanding of support vector machines. For a more detailed presentation use Martin Riedmillers slides:
Riedmiller_svm
17.06.13 22
Convexity makes life easier Subset is convex if the following holds:
[ ] Ω∈−+∈∀Ω∈∀Ω∈∀ )(1,0 xyxyx λλ
yxyxx )( −+ λ
nℜ⊆Ω
17.06.13 23
Convexity makes life easier Function f is convex if the following holds:
[ ]1,0))()(()())(( ∈∀−+≤−+ λλλ xfyfxfxyxf
yxyxx )( −+ λ
17.06.13 24
Convexity makes life easier Consider convex function on convex domain:
A local minimum is a vector such that: A global minimum is a vector such that:
ℜ→Ω:f
))()((0 yfxfyrxyyr ≤∧Ω∈⇒≤−∀>∃
))()(( yfxfyy ≤⇒Ω∈∀
Ω∈x
Ω∈x
17.06.13 25
Convexity makes life easier Consider convex function on convex domain:
Theorem: Every local minimum is a global minimum. Proof: • Let be a local minimum and arbitrary. • Choose small enough such that
ℜ→Ω:f
Ω∈yΩ∈x1,0 ≤> λλ
))(()( xyxfxf −+≤ λ
17.06.13 26
Convexity makes life easier Using convexity we conclude:
)()(
)()(0
))()((0
))()(()())(()(
yfxf
xfyf
xfyf
xfyfxfxyxfxf
≤
−≤
−≤
−+≤−+≤
λ
λλ
17.06.13 27
Convexity makes life easier Examples of convex functions: • linear functions - trivial • affine functions (= linear + constant) - trivial • square function (1-dimensional) – proof follows • sum of convex functions • squared euklidean norm (n-dimensional) – from
results above • convex function scaled with positive factor - easy
17.06.13 28
Convexity makes life easier
Square function is convex:
:10 <<≠ λandyxConsider
true
xyxyxyxyyxyx
xyxyxxyxyxyxyx
xyxyxxyxyxxxyxxyx
⇔≤
⇔−≤−
⇔−≤−
⇔≤−+
⇔+≤−+
⇔+−≤−+−
⇔+−+≤−+−+
⇔−+≤−+
1)()()()(
)()()(2
))(()()(2))(()()(2
)())((
2
2
2
22
2222
2222
λ
λ
λλ
λλλ
λλλ
λλλ
λλλ
λλ
17.06.13 29
Minimization under equalities
Differentiable function to be minimized Equality constraints Lagrange function
ℜ→Ω:f
lpxhp ,,10)( =∀=
∑=
+=l
p
ppl xhxfxL
11 )()(),,,( ααα
17.06.13 30
Minimization under equalities
Necessary condition on minimum x with constraints is existence of Lagrange multipliers with: Under certain conditions also sufficient.
lαα ,,1
lpxh
xhxfxL
p
l
p
pxpxlx
,,10)(
0))(())(()),,,((1
1
=∀=
=∇+∇=∇ ∑=
ααα
17.06.13 31
Minimization under equalities
In explicit terms:
lpxh
xxh
xxf
p
l
p i
p
pi
,,10)(
0)()(1
=∀=
=∂
∂+
∂∂
∑=
α
17.06.13 32
Example 1: max area rectangle
• Find rectangle with side lengths x and y, fixed circum- ference sum 2x + 2y = c, and maximum area xy.
• Function to be minimized
• Equality constraint
022 =−+ cyx
xyyxf −=),(
17.06.13 33
Solution by square
844ccc xy === α
0220202 =−+=+−=+− cyxxy αα
cxy === ααα 822
17.06.13 34
Example 2: Entropy maximization
• Function to be maximized
• Equality constraint
∑=
=−n
kkx
1
01
∑=
=n
kkkn xxxxf
11 log),,(
[ ] ℜ→nf 1,0:
17.06.13 35
Solution
∑
∑∑
=
==
=
=++−=∂
∂
=++−=∂
∂
−+−=
n
kk
nn
n
n
n
kk
n
kkkn
x
xxxxL
xxxxL
xxxxxL
1
1
11
1
111
1
0)2log(log),,,(
0)2log(log),,,(
)1(log),,,(
αα
αα
αα
17.06.13 36
∑=
=
==n
kk
n
x
xx
1
1
1
∑=
=
=+=+n
kk
n
x
xx
1
1
1
2loglog2loglog αα
nnxx 11 ===
Solution by uniform probability distribution
17.06.13 37
Example 3: Likelihood maximization
• A random process with k independent possible events is observed with number of occurrences for the events
• If probabilities of events were known to be
• then likelihood of this probability model under observations above is defined by likelihood function:
• Equality constraint
knn ,,1
kpp ,,1
17.06.13 38
• Likelihood function is to be maximized.
• Equality constraint
∑=
=−k
iip
1
01
∏=
==k
i
nikkkipppnnLppf
1111 ),,,,(),,(
[ ] ℜ→kf 1,0:
17.06.13 39
Exercise: Show that empirical relative frequencies give the most likely probability model: Calculations are are little bit more complicated than in the examples before.
ninn
npk
ii ,,1
1
=∀++
=
17.06.13 40
Minimization under inequalities Differentiable function to be minimized Inequality constraints Lagrange function
ℜ→Ω:f
∑=
+=l
p
ppl xgxfxL
11 )()(),,,( ααα
mpxg p ,,10)( =∀≤
17.06.13 41
Minimization under inequalities
Necessary condition on minimum x with inequality constraints is the existence of Lagrange multipliers which fulfill the following KKT constraints (Karush, Kuhn, Tucker):
lαα ,,1
17.06.13 42
Minimization under inequalities Karush-Kuhn-Tucker constraints Note that there are as many equations as variables.
lpxglpxg
xgxfxL
pp
p
l
p
pxpxlx
l
,,10)(,,10)(
0))(())(()),,,((
0,,
11
1
=∀=
=∀≤
=∇+∇=∇
≥
∑=
α
ααα
αα
17.06.13 43
Duality Primal problem
lpxgtsrequirementosubject
xff
p
x
n
,,10)(
min)(:
=∀≤
→
ℜ→Ω
Ω∈
17.06.13 44
Duality Lagrange function More compact: KKT conditions force us to solve equations under inequality constraints. This is often uncomfortable.
0,,)()(),,,( 11
1 ≥+= ∑=
l
l
p
ppl xgxfxL ααααα
0)()(),(1
≥+= ∑=
αααl
p
pp xgxfxL
17.06.13 45
Duality Dual problem ignores all inequality constraints and, for fixed α, minimizes Lagrange function over x. This defines a lower bound for primal problem. Lemma 1: For arbitrary α ≥ 0
),(inf)( αα xLQx
=
)(inf)(0)(
xfQxgwithx
≤
≤α
17.06.13 46
Duality Proof: Consider an arbitrary y with: Then: Since this holds for all y we conclude:
∑=
≤+=≤=l
p
ppx
yfygyfyLxLQ1
)()()(),(),(inf)( αααα
)(inf)(0)(
xfQxgwithx
≤
≤α
lpyg p ,,10)( =∀≤
17.06.13 47
Duality Having ignored requirements is compensated in a second step by taking the greatest lower bound, that is, by maximizing over all Lagrange multipliers: Lemma 2: Existence of infima and max supposed,
max)(0≥→α
αQ
)(inf)(max0)(
0xfQ
xgwithx
≤≥
≤αα
17.06.13 48
Computing max inf L means to find a saddle-point of function L.
17.06.13 49
Duality Lemma 3: Assume you found β ≥ 0 and y with For short: „Dual value meets primal value.“ Then For short: „Optimal dual = optimal primal = solution of KKT has been obtained.“
)()(,,10)(
yfQlpyg p
=
=∀≤
β
),()()(inf)(max)(0)(
0yLyfxfQQ
xgwithx
βαβα
====≤
≥
17.06.13 50
Duality Proof: was used in Lemma 2 to conclude: Using we obtain
)()( yfQ =β
lpyg p ,,10)( =∀≤
)()(inf)(max)()(0)(
0yfxfQQyf
xgwithx
=≤≤=≤
≥αβ
α
)(inf)(max0)(
0xfQ
xgwithx
≤≥
≤αα
17.06.13 51
Duality Proof continued: In the proof of Lemma 1 we showed: Thus So far, things were rather simple. The non-trivial part is (without proofs):
)(),()( yfyLQ ≤≤ ββ
),()()( ββ yLQyf ==
17.06.13 52
Duality Lemma 4: Under certain conditions (that are fulfilled for the margin maximization problem: quadratic function, linear constraints, compact domains) equality holds: In particular, this means that max and min both exits.
)(inf)(max0)(
0xfQ
xgwithx
≤≥
=αα
17.06.13 53
Duality Thus we have the choice: • Solve primal problem
• Solve dual problem
• Solve KKT
17.06.13 54
Minimization under equality and inequality constraints
Exercise:
Combine the formulas for the case of equality constraints and the case of inequality constraints into formulas for the combination of equality and inequality constraints.
17.06.13 55
Margin maximization: Primal form
min221
ww →
lpbxwd pTp 101)( =∀≤++−
17.06.13 56
Lagrange function
∑∑
∑
==
=
−+−
=++−+
=
l
p
pTpp
n
ii
l
p
pTpp
l
bxwdw
bxwdw
bwL
1
2
1
1
2
1
)1)(()(21
)1)((21
),,,,(
α
α
αα
17.06.13 57
∑
∑
=
=
−=∂
∂
−=∂
∂
l
p
pp
l
l
p
pi
ppi
i
l
db
bwL
xdww
bwL
1
1
1
1
),,,,(
),,,,(
ααα
ααα
Partial derivatives:
17.06.13 58
Karush-Kuhn-Tucker conditions
lpbxwd
lpbxwd
d
xdw
pTpp
pTp
l
p
pp
l
p
ppp
l
,,10)1)(()5(
,,101)()4(
0)3(
)2(
0,,)1(
1
1
1
=∀=−+
=∀≤++−
=
=
≥
∑
∑
=
=
α
α
α
αα
17.06.13 59
Margin maximization: Dual form
Consider dual function Q and maximize:
max),,,(
),,,,,(inf),,,(
0,,21
21,21
1 ≥→
=
ll
lbwl
Q
bwLQ
ααααα
αααααα
17.06.13 60
Lagrange function:
∑ ∑∑∑
∑∑
∑
= ===
==
=
+−−
=−+−
=++−+
=
l
p
l
pp
l
k
pp
pTpp
n
ii
l
p
pTpp
n
ii
l
p
pTpp
l
dbxwdw
bxwdw
bxwdw
bwL
1 11
2
1
1
2
1
1
2
1
)()(21
)1)(()(21
)1)((21
),,,,(
ααα
α
α
αα
17.06.13 61
If then due to subterm in function L and the fact that minimization of L also runs over b we conclude that Thus this case does not participate in the process of maximization of inf L in the definition of Q. So we may assume:
−∞=),,,,,(inf 21, lbwbwL ααα
∑=
≠l
p
ppd
1
0α ∑=
≠−l
p
ppdb
1
0α
∑=
=l
p
ppd
1
0α
17.06.13 62
Lagrange function reduces to: For every fixed α try to explain why:
∑∑
∑
==
=
+−
=++−+
=
l
pp
l
p
pTpp
l
p
pTpp
l
xwdw
bxwdw
bwL
11
2
1
2
1
)(21
)1)((21
),,,,(
αα
α
αα
−∞≠),,(inf αbwLw
17.06.13 63
Existence of the infimum fixes w by setting gradient of L w.r.t. w to zero: Thus: This allows further simplification of function Q:
∑=
=l
p
ppp xdw
1
α
0),,,,(1
1 =−=∇ ∑=
l
p
ppplw xdwbwL ααα
17.06.13 64
∑∑∑
∑∑ ∑
∑∑
∑∑∑
∑∑
== =
== =
==
===
==
+−
=+−
=+−
=+−
=+−
=
l
pp
l
p
qTpqpqp
l
q
l
pp
l
p
pl
q
Tqqq
pp
l
pp
l
p
pTpp
l
pp
l
p
pTpp
l
p
ppp
T
l
pp
l
p
pTpp
T
k
xxdd
xxdd
xwd
xwdxdw
xwdww
Q
11 121
11 121
1121
11121
1121
21
))((
))((
)(
)(
)(
),,,(
ααα
ααα
αα
ααα
αα
ααα
17.06.13 65
max),,,(
)(
),,,(
0,,21
11 121
21
1 ≥
== =
→
+−
=
∑∑∑
ll
l
pp
l
p
l
q
qTpqpqp
l
Q
xxdd
Q
ααααα
ααα
ααα
Summary
17.06.13 66
∑=
=l
p
ppp xdw
1
α
Solution of primal problem obtained from this:
Bias b is determined as follows: Select a non-zero
Lagrange multiplier – why must it exist? Now use
and solve for b.
0)1)(( =−+ bxwd pTppα
17.06.13 67
SVM
III. Maximum margin classifier at work
17.06.13 68
Generalization • Let a fresh vector z be presented to the net. • Inner product with weight vector w is computed.
• Inner product of x with support vector xp measures similarity between these vectors: Parallel vectors give large inner product. Orthogonal vectors give zero inner product.
∑∑==
==l
p
pTpp
l
p
ppp
TT xzdxdzwz11
)(αα
17.06.13 69
Generalization • Term dp gives inner product correct sign. • Term αp weights terms above appropriately
(hopefully) to compute net input. • Net input can alternatively be seen as computed
by the following architecture with a hidden linear neuron for each support vector whose weight vector is the support vector.
17.06.13 70
),()(11
wznetxzdxdzwzl
p
pTpp
l
p
ppp
TT === ∑∑==
αα
1z
nz
11dα
kkdα
ssdα
11x
1nxkx1
knx sx1
snx
linear
linear
linear
step function
computed here
17.06.13 71
We obtain a quite natural and intutive architecture: • Among training vectors the support vectors are
determined; only these play a role; they are the most representative among the training vectors.
• Similarity with each support vector is computed. This is some sort of „case based reasoning“:
support vectors = cases • Lagrange multipliers weight the similarities.
17.06.13 72
Main disadvantage - plan for a solution • Linear separability is seldom the case. • Embedding vectors of low dimensional input
space into higher dimensional feature space may help:
Create additional features whatever you think could be problem relevant.
• Keep additional complexity limited.
17.06.13 73
SVM
IV. Kernel functions and kernel trick
17.06.13 74
Kernel functions
Use extra features that appear to be relevant for the problem solution, formally described as a function from (lower-dimensional) input space to (higher-dimensional) feature space:
Nn ℜ→ℜΦ :
17.06.13 75
Kernel functions
Originally given learning problem in input space reads in feature space equivalently as follows:
),(),( 11 ll dxdx
)),(()),(( 11 ll dxdx ΦΦ
17.06.13 76
Kernel functions
Dual optimization now reads in feature space as follows:
max),,,(
)())((
),,,(
0,,21
11 121
21
1 ≥
Φ
== =
Φ
→
+ΦΦ−
=
∑∑∑
ll
l
pp
l
p
l
q
qTpqpqp
l
Q
xxdd
Q
ααααα
ααα
ααα
17.06.13 77
Kernel functions
Classification of a fresh vector z from input space proceeds by computing inner product of embedded vectors as follows and comparing it with threshold:
))()((1∑=
ΦΦl
p
pTpp xzdα
17.06.13 78
Kernel functions
For the process of learning (convex optimization) as well as for the process of classification fresh vectors (generalization) the following operation is central and occurs in high number: Function K is called a kernel function.
)()(),( yxyxK TΦΦ=
17.06.13 79
Kernel functions
Computation of lots of inner products in feature space may be expensive since dimension N might be a large number compare to n. We should look for means to compute inner products in input space:
)()(),( yxyxK TΦΦ=
ℜ→ℜ
=ΦΦ=
:)()()(),(
kfunctionsomewithyxkyxyxK TT
17.06.13 80
Kernel functions
Example: Inner product in feature space:
),2,,,(),(
:22
52
yxyxyxyx =Φ
ℜ→ℜΦ
2
2
2222
)())','(),(())','(),(()','(),('''2''')','(),(
rrrkwithyxyxkyxyxyxyxyyyxyxxxyyxxyxyx
TTT
T
+=
=+
=++++=ΦΦ
17.06.13 81
Classification:
1z
nz
11dα
kkdα
ssdα
11x
1nxkx1
knx sx1
snx
linear
linear
linear
step function
),( 1xzK
),( lxzK
),( kxzK
17.06.13 82
Interpretation: • Hidden neurons measure similarity between input vector
and training vectors after embedding in feature space.
• Output neuron uses class labels and Lagrange multipliers to weight similarities and integrate them into a summarized net input.
• A particularly natural way to measure similarity of input
vector with some training vector is by Gaussian bell shaped functions:
2
2
21),( σσπ
pxzp exzK
−−
=
17.06.13 83
• It can be shown that this indeed is a Kernel function (Mercer‘s theorem).
• Other widely used kernels are polynomial kernels of degree d
• and tanh-Kernels that mimic the behaviour of MLPs (weight vector βw and threshold θ):
The latter are Kernels onl for certain combinations of βw and θ.
dT xzxzK )1(),( +=
)tanh(),( θβ −= xzxzK T
17.06.13 84
• There are lots of further useful kernel functions, some more general, some tailored to specifica applications.
• There are lots of cooking recipes for building fresh
kernels out of already constructed ones.
17.06.13 85
SVM
V. SVM learnability theory
17.06.13 86
Generalization ability of maximum margin classifier
• Using lots of additional features (remember used embedding into high-dimensional feature space), usually (as for MLPs) has danger of overfitting.
• Remember the estimations for VC-dimension of
MLPs of order w·log(w) or w2n2.
17.06.13 87
• SVM do not suffer from this problem: Expected error scales with m/R where m is maximum margin of a training set and R the radius of training vectors, but does not depend on dimension of feature space.
• More concrete estimation for expected error ε is (with m maximum margin, l size of training set, R radius of training set, δ confidence):
)logloglog(),,,( 4328
642222
2
δδε +=ml
Relm
mR
lRml
17.06.13 88
A single formula What confidence would you like to have?
Fix δ > 0, δ < 1 (for example δ = 1%)
What margin do you expect?
m would be nice (for example m = 2)
How many training data l are available?
17.06.13 89
{
}δ
ε
µµ
−≥
≤−
⇒≥∀
=
1
),(),(
),((,),(,),,(Pr
,
11)(
bwerrorbwerror
bwbwdxdxT
genTemp
T
lll
Under the settings above, what error must be expected under a randomly drawn training set in the worst case?
p
lp
ml
Relm
mR
l
xRand
with
,,1
4328
642
max
)logloglog( 222
2
==
+= δε
17.06.13 90
Do you see any problem with this estimation? What if you draw a random training set of size l and margin 1 instead of the desired margin 2? Using formula again with m = 1 changes l to l‘. Again draw a random training set of size l‘. Does this game finally come to an end?
17.06.13 91
• Dependence on m/R instead of m is obvious: By scaling training set with a positive factor R one could increase maximum margin m as strong as wished without affecting learning problem.
• The existence of some difficulties with a sound mathematical formalization of this result should be mentioned – maximum margin cannot be fixed in advance before randomly drawing a training set. The interested reader should take a look into literature on structural risk minimization.
17.06.13 92
SVM
VI. Soft margin
17.06.13 93
• Allowing limited classification error (that is bad), larger margin becomes possible (that is good).
• The amount of classification error is measured
by slack variables ξp, one for each training vector – error may vary from training vector to training vector.
• Inequality constraints now read as follows:
17.06.13 94
lpbxwdlpbxwd
ppTp
ppTp
111111
=∀+−≤+⇒−=
=∀−+≥+⇒+=
ξ
ξ
lpbxwd ppTp 11)( =∀−≥+ ξ
lpbxwd ppTp 101)( =∀≤−++− ξ
17.06.13 95
larger margin
error measuredby slack variables
17.06.13 96
• In margin maximization use error values as small as possible.
• This is expressed by the following function
with constant C that is chosen by the user: • C controls the balance between margin
maximization and error tolerance.
0,,
)(),,,(
1
1
22211
≥
+= ∑=
l
l
p
pl Cwwf
ξξ
ξξξ
17.06.13 97
Soft margin: Primal problem Under constraints
min)(),,,(,1
22211
ξξξξ
w
l
p
pl Cwwf →+= ∑=
lpbxwd ppTp
l
101)(0,,1
=∀≤−++−
≥
ξ
ξξ
17.06.13 98
Soft margin: Lagrange function
)())(1(
)(
),,,,,,,,,,(
11
1
2221
111
pl
pp
l
p
pTppp
l
p
p
lll
bxwd
Cw
bwL
ξβξα
ξ
ββααξξ
−++−−
++
=
∑∑
∑
==
=
17.06.13 99
Soft margin: KKT conditions
pipi
piCL
pidbL
pixdwwL
lp
lpbxwdlp
lpbxwd
i
i
iiii
l
p
pp
l
p
pi
ppi
i
pp
pTppp
p
pTpp
1010
102
10
10
10)(
10))(1(10
10)(1
1
1
=∀≥
=∀≥
=∀=−−=∂∂
=∀=−=∂∂
=∀=−=∂∂
=∀=−
=∀=+−−
=∀≤−
=∀≤+−−
∑
∑
=
=
β
α
βαξξ
α
α
ξβ
ξα
ξ
ξ
17.06.13 100
Soft margin: Deriving solution for primal problem from KKT Bias b is indirectly derived from an equality constraint for non-zero αi.
lpC
xdw
ppp
l
p
ppp
12
1
=∀+
=
=∑=
βαξ
α
17.06.13 101
Soft margin: Dual function
.)..(0
12
),,,,,,,,(inf),,,,,(
1
1
11,,11
glwd
lpC
xdw
bwLQ
l
p
pp
ppp
l
p
ppp
ppbwpp
∑
∑
=
=
=
=∀+
=
=
=
α
βαξ
α
ββααξββααξ
17.06.13 102
Soft margin: Inserting values in dual function
∑∑
∑
==
=
−++−−
++
=
l
p
pp
l
p
pTppp
l
p
pT
pp
bxwd
Cww
Q
11
1
221
11
)())(1(
)(
),,,,,(
ξβξα
ξ
ββαα
17.06.13 103
Inserting values in dual function continued
∑∑∑ ∑
∑∑
∑∑
∑
=
+
== =
+
=
+
=
==
=
−−−−
++
=
−++−−
++
l
pCp
l
p
pp
l
p
l
p
pTppCp
l
pC
Tl
p
ppp
l
p
pp
l
p
pTppp
l
p
pT
pppp
pp
dbxwd
Cwxd
bxwd
Cww
12
11 12
1
22
121
11
1
221
)1(
)()(
)())(1(
)(
βαβα
βα
βααα
α
ξβξα
ξ
17.06.13 104
Inserting values in dual function continued
∑∑∑∑
∑∑∑ ∑
∑∑
=
+
==
+
=
=
+
== =
+
=
+
=
−++−
=
−−−−
++
l
pC
l
pp
l
pC
Tl
p
ppp
l
pCp
l
p
pp
l
p
l
p
pTppCp
l
pC
Tl
p
ppp
pppp
pppp
pp
Cwxd
dbxwd
Cwxd
12
)(
11
22
121
12
11 12
1
22
121
2
)()(
)1(
)()(
βαβα
βαβα
βα
αα
βααα
α
17.06.13 105
Inserting values in dual function continued
∑ ∑∑
∑∑
∑ ∑∑
∑∑∑
= =
+
=
+
= =
= =
++
=
=
+
= =
−+
+−
=
−−
++−
l
p
l
pCp
l
pC
l
p
l
q
qTpqpqp
l
p
l
pCpC
l
ppp
l
pC
l
p
l
q
qTpqpqp
pppp
pppp
pp
C
xxdd
Cxxdd
1 12
)(
1
22
1 121
1 122
1
1
22
1 121
2
)(
)(
)()(
βαβα
βαβα
βα
α
αα
βαα
αα
17.06.13 106
Inserting values in dual function continued
∑∑∑
∑∑
∑ ∑∑
∑∑
=
+
==
+
= =
= =
+
=
+
= =
−+
+−
=
−+
+−
l
pC
l
pp
l
pC
l
p
l
q
qTpqpqp
l
p
l
pCp
l
pC
l
p
l
q
qTpqpqp
pppp
pppp
xxdd
C
xxdd
12
)(
114
)(
1 121
1 12
)(
1
22
1 121
22
2
)(
)(
)(
βαβα
βαβα
α
αα
α
αα
17.06.13 107
Inserting values in dual function continued
∑∑∑∑
∑∑∑
∑∑
==
+
= =
=
+
==
+
= =
+−−
=
−+
+−
l
pp
l
pC
l
p
l
q
qTpqpqp
l
pC
l
pp
l
pC
l
p
l
q
qTpqpqp
pp
pppp
xxdd
xxdd
114
)(21
1 121
12
)(
114
)(
1 121
2
22
)(
)(
ααα
α
αα
βα
βαβα
17.06.13 108
Final form of dual function with constraints
∑∑∑∑==
+
= =
+−−
=l
pp
l
pC
l
p
l
q
qTpqpqp
ll
ppxxdd
Q
114
)(21
1 121
11
2
)(
),,,,,(
ααα
ββαα
βα
∑=
=∀=
≥l
p
pp
ll
lpd1
11
,,10
0,,,,,
α
ββαα
17.06.13 109
Wild mixture of kernel functions
cyxyxK T +=),(
dT cyxyxK ))((),( += α
2
2
2),( σ
yx
eyxK−
−=
22),( σ
yx
eyxK−
−=
σyx
eyxK−
−=),(
linear polynomial Gauss exponential Laplace
17.06.13 110
Wild mixture of kernel functions
))(tanh(),( cyxyxK T += α
2111),(yx
yxK−+
=σ
22),( cyxyxK +−=
22
1),(cyx
yxK+−
=
Cauchy sigmoidal quadratic inverse quadratic
17.06.13 111
Wild mixture of kernel functions
θθ yxyx
yxK−
−= sin),(
dyxyxK −−=),(
)1log(),( +−−=dyxyxK
)),min(),min(
),min(1(),(
3312
2
1
iiiiyx
n
iiiiiii
yxyx
yxyxyxyxK
ii +−
++=
+
=∏
Wave Power Log Spline
17.06.13 112
Wild mixture of kernel functions
∑=
=n
iii yxyxK
1
),min(),(
∑=
=n
iii yxyxK
1
),min(),( βα
dyxyxK
−+=1
1),(
histogram generalized histogram T-student