the problem: consider a two class task with ω 1 , ω 2
DESCRIPTION
LINEAR CLASSIFIERS. The Problem: Consider a two class task with ω 1 , ω 2. Hence:. The Perceptron Algorithm Assume linearly separable classes, i.e., The case falls under the above formulation, since. Our goal: Compute a solution, i.e., a hyperplane w , so that The steps - PowerPoint PPT PresentationTRANSCRIPT
1
The Problem: Consider a two class task with ω1, ω2
02211
0
...
0)(
wxwxwxw
wxwxg
ll
T
2121
0201
21
, 0)(
0
:hyperplanedecision on the , Assume
xxxxw
wxwwxw
xx
T
TT
LINEAR CLASSIFIERS
2
Hence:
hyperplane on the w
0)( 0 wxwxg T
22
21
22
21
0 )( ,
ww
xgz
ww
wd
3
The Perceptron Algorithm
Assume linearly separable classes, i.e.,
The casefalls under the above formulation, since
•
•
2T
1T
x0x*w
x0x*w:*w
*T* wxw 0
1
0
x'x,
w
*w'w *
00 'x'wwx*w T*T
4
Our goal: Compute a solution, i.e., a hyperplane w,so that
• The steps– Define a cost function to be minimized– Choose an algorithm to minimize the cost
function– The minimum corresponds to a solution
2
1 0)(
xxwT
5
The Cost Function
• Where Y is the subset of the vectors wrongly classified by w. When Y=(empty set) a solution is achieved and
•
•
•
Yx
Tx xwwJ )()(
0)( wJ
2
1
and if 1
and if 1
xYx
xYx
x
x
0)( wJ
6
• J(w) is piecewise linear (WHY?)
The Algorithm• The philosophy of the gradient descent
is adopted.
7
• Wherever valid
•
This is the celebrated Perceptron Algorithm
old)()(
(old)(new)
www
wJw
www
xxwww
wJ
Yx Yxx
Tx
)()(
xtwtwYx
xt
)()1(
w
8
An example:
The perceptron algorithm converges in a finite number of iteration steps to a solution if
)1( )(
)()1(
xxt
t
xtw
xtwtw
t
ct
t
kk
t
t
kk
t
:e.g.,
lim,lim0
2
0
9
A useful variant of the perceptron algorithm
It is a reward and punishment type of algorithm
otherwise )()1(
0)( ,)()1(
0)( , )()1(
2)(
)()(
1)(
)()(
twtw
x
xtwxtwtw
x
xtwxtwtw
t
tT
t
t
tT
t
10
The perceptron
shold thre
weightssynapticor synapses
0w
s'wi
It is a learning machine that learns from the training vectors via the perceptron algorithm
The network is called perceptron or neuron
11
Example: At some stage t the perceptron algorithm results in
The corresponding hyperplane is
5.0 ,1 ,1 021 www
05.021 xx
5.0
51.0
42.1
1
75.0
2.0
)1(7.0
1
05.0
4.0
)1(7.0
5.0
1
1
)1(tw
ρ=0.7
12
Least Squares Methods If classes are linearly separable, the perceptron
output results in
If classes are NOT linearly separable, we shall compute the weights
so that the difference between• The actual output of the classifier, , and
• The desired outputs, e.g.
to be SMALL
1
021 w,...,w,w
xwT
2
1
if 1
if 1
x
x
13
SMALL, in the mean square error sense, means to choose so that the cost function
•
•
•
w
responses desired ingcorrespond the
)(minargˆ
minimum is ])[()( 2
y
wJw
xwyEwJ
w
T
14
Minimizing
where Rx is the autocorrelation matrix
and the crosscorrelation vector
][ˆ
][][
)]([2
0])[()(
:in results to w.r.)(
1
2
yxERw
yxEwxxE
wxyxE
xwyEww
wJ
wwJ
x
T
T
T
][]...[][
...................................
][]...[][
][
21
12111
llll
lT
x
xxExxExxE
xxExxExxE
xxER
][
...
][
][1
yxE
yxE
yxE
l
15
Multi-class generalization• The goal is to compute M linear discriminant functions:
according to the MSE.
• Adopt as desired responses yi:
• Let
• And the matrix
xwxg Tii )(
otherwise 0
if 1
i
ii
y
xy
TMyyyy ,...,, 21
MwwwW ,...,, 21
16
• The goal is to compute :
• The above is equivalent to a number M of MSE minimization problems. That is:
Design each so that its desired output is 1 for and 0 for any other class.
Remark: The MSE criterion belongs to a more general class of cost function with the following important property:
• The value of is an estimate, in the MSE sense, of the a-posteriori probability , provided that the desired responses used during training are and 0 otherwise.
M
i
Tii
W
T
WxwyExWyEW
1
22minargminargˆ
iw ix
)(xgi)|( xP i
ii xy ,1
W
17
Mean square error regression: Let , be jointly distributed random vectors with a joint pdf• The goal: Given the value of estimate the value
of . In the pattern recognition framework, given one wants to estimate the respective label .
• The MSE estimate of given is defined as:
• It turns out that:
The above is known as the regression of given and it is, in general, a non-linear function of . If is Gaussian the MSE regressor is linear.
My x),( yxp
x yx
y y x 2
~~minargˆ yyEy
y
ydxypyxyEy
)|(|ˆ
y xx ),( yxp
1y
18
SMALL in the sum of error squares sense means
that is, the input xi and its
corresponding class label (±1).
),(
)()(
ii
2N
1ii
Ti
xy
xwywJ
: training pairs
i
N
ii
Ti
N
ii
N
ii
Ti
yxwxx
xwyww
wJ
11
1
2
)(
0)()(
iy
19
Pseudoinverse Matrix Define
responses desired ingcorrespond y
matrix) (an
N
1
TN
T2
T1
y
...
y
Nxl
x
...
x
x
X
matrix) (an ],...,,[ 21 lxNxxxX NT
N
i
Tii
T xxXX1
i
N
ii
T yxyX
1
20
Thus
Assume N=l X square and invertible. Then
yX
yXXXw
yXwXX
yxwxx
TT
TT
N
i
N
iiii
Ti
1
1 1
)(ˆ
ˆ)(
)(ˆ)(
TT XXXX 1)( Pseudoinverse of X
1
111)(
XX
XXXXXXX TTTT
21
Assume N>l. Then, in general, there is no solution to satisfy all equations simultaneously:
The “solution” corresponds to the minimum sum of squares solution
unknowns equations ....
: 22
11
l N
ywx
ywx
ywx
ywX
NTN
T
T
yXw
22
Example:
5.0
7.0,
6.0
8.0,
4.0
7.0,
2.0
6.0,
6.0
4.0:
3.0
3.0,
7.0
2.0,
4.0
1.0,
5.0
6.0,
5.0
4.0:
2
1
y
..
..
..
..
..
..
..
..
..
..
X
1
1
1
1
1
1
1
1
1
1
15070
16080
14070
12060
16040
13030
17020
14010
15060
15040
23
34.1
24.0
13.3
)(
0.0
1.0
6.1
,
107.48.4
7.441.224.2
8.424.28.2
1 yXXXw
yXXX
TT
TT
24
The Bias – Variance DilemmaA classifier is a learning machine that tries to predict the class label y of . In practice, a finite data set D is used for its training. Let us write . Observe that:
For some training sets, , the training may result to good estimates, for some others the result may be worse.
The average performance of the classifier can be tested against the MSE optimal value, in the mean squares sense, that is:
where ED is the mean over all possible data sets D.
)(xgx
);( Dxg
NixyD ii ..., ,2 ,1,,
2]|[);( xyEDxgED
Support Slide
25
• The above is written as:
• In the above, the first term is the contribution of the bias and the second term is the contribution of the variance.
• For a finite D, there is a trade-off between the two terms. Increasing bias it reduces variance and vice verse. This is known as the bias-variance dilemma.
• Using a complex model results in low-bias but a high variance, as one changes from one training set to another. Using a simple model results in high bias but low variance.
2]|[);( xyEDxgED
22 );();(]|[);( DxgEDxgExyEDxgE DDD
Support Slide
26
Let an -class task, . In logistic discrimination, the logarithm of the likelihood ratios are modeled via linear functions, i.e.,
Taking into account that
it can be easily shown that the above is equivalent with modeling posterior probabilities as:
LOGISTIC DISCRIMINATION
121 ,
|
|ln 0, , ..., M-, ixww
xP
xP Tii
M
i
1|1
M
ii xP
M ..., , , 21MSupport S
lide
27
For the two-class case it turns out that
1
10,exp1
1| M
i
Tii
M
xwwxP
1,...2,1,exp1
exp| 1
10,
0,
ιxww
xwwxP M
i
Tii
Tii
i
xwwxP T
0
2exp1
1|
xww
xwwxP T
T
0
01
exp1
exp|
Support Slide
28
The unknown parameters are usually estimated by maximum likelihood arguments.
Logistic discrimination is a useful tool, since it allows linear modeling and at the same time ensures posterior probabilities to add to one.
121 , , 0, , ..., M-, iww ii Support S
lide
29
The goal: Given two linearly separable classes, design the classifier
that leaves the maximum margin from both classes
0)( 0 wxwxg T
Support Vector Machines
30
Margin: Each hyperplane is characterized by
• Its direction in space, i.e.,
• Its position in space, i.e.,
• For EACH direction, , choose the hyperplane that leaves the SAME distance from the nearest points from each class. The margin is twice this distance.
w
0w
w
31
The distance of a point from a hyperplane is given by
Scale, so that at the nearest points from each class the discriminant function is ±1:
Thus the margin is given by
Also, the following is valid
x
, , 0ww
w
xgzx
)ˆ(ˆ
21 for 1)( and for 1)x( 1)( xggxg
www
211
20
10
1
1
xwxw
xwxwT
T
32
SVM (linear) classifier
Minimize
Subject to
The above is justified since by minimizing
the margin is maximised
0)( wxwxg T
2
2
1)( wwJ
2for 1
for 1
ii
iii
x,y
,x,y
Niwxwy iT
i ,...,2,1 ,1)( 0
w
w
2
33
The above is a quadratic optimization task, subject to a set of linear inequality constraints. The Karush-Kuhh-Tucker conditions state that the minimizer satisfies:
• (1)
• (2)
• (3)
• (4)
• Where is the Lagrangian
0) , ,L( 0 www
0) , ,( 00
wwLw
Nii ,...,2,1 ,0
Niwxwy iT
ii ,...,2,1 ,01)( 0
),,( L
)]([2
1) , ,( 0
10 wxwywwwwL i
Ti
N
ii
T
Support Slide
34
The solution: from the above, it turns out that
•
•
ii
N
ii xyw
1
01
i
N
ii y
Support Slide
35
Remarks:• The Lagrange multipliers can be either zero or
positive. Thus,
–
where , corresponding to positive Lagrange multipliers
– From constraint (4) above, i.e.,
the vectors contributing to
ii
N
ii xyw
s
1
0NN s
Niwxwy iT
ii ,...,2,1 ,0]1)([ 0
10 wxw iT
wsatisfy
Support Slide
36
– These vectors are known as SUPPORT VECTORS and are the closest vectors, from each class, to the classifier.
– Once is computed, is determined from conditions (4).
– The optimal hyperplane classifier of a support vector machine is UNIQUE.
– Although the solution is unique, the resulting Lagrange multipliers are not unique.
w 0w
37
Dual Problem Formulation• The SVM formulation is a convex
programming problem, with– Convex cost function– Convex region of feasible solutions
• Thus, its solution can be achieved by its dual problem, i.e.,
– maximize
– subject to
),,( 0 wwL
0
01
1
i
N
ii
ii
N
ii
y
xyw
λ
Support Slide
38
• Combine the above to obtain
– maximize
– subject to
)2
1(
1j
Tijij
iji
N
ii xxyy
0
01
i
N
ii y
λ
Support Slide
39
Remarks:• Support vectors enter via inner products
Non-Separable classes
40
In this case, there is no hyperplane such that
• Recall that the margin is defined as twice the distance between the following two hyperplanes
xwxwT ,1)(0
1
and
1
0
0
wxw
wxw
T
T
41
The training vectors belong to one of three possible categories
1) Vectors outside the band which are correctly classified, i.e.,
2) Vectors inside the band, and correctly classified, i.e.,
3) Vectors misclassified, i.e.,
1)( 0 wxwy Ti
1)(0 0 wxwy Ti
0)( 0 wxwy Ti
42
All three cases above can be represented as
1)2)3)
are known as slack variables
iT
i wxwy 1)( 0
i
i
i
1
10
0
i
43
The goal of the optimization is now two-fold• Maximize margin• Minimize the number of patterns with ,
One way to achieve this goal is via the cost
where C is a constant and
• I(.) is not differentiable. In practice, we use an approximation
•
• Following a similar procedure as before we obtain
0i
N
iiICwwwJ
1
2
0 2
1) , ,(
00
01)(
i
iiI
N
iiCwwwJ
1
2
0 2
1) , ,(
44
KKT conditions
Ni
Ni
Niwxwy
NiC
yλ
xyλw
ii
ii
iiT
ii
ii
i
N
ii
ii
N
ii
,...,2,1 ,0, )6(
,...,2,1 ,0 )5(
,...,2,1 ,0]1)([ )4(
,...,2,1 ,0 )3(
0 )2(
)1(
0
1
1
Support Slide
45
The associated dual problem
Maximize
subject to
Remarks: The only difference with the separable
class case is the existence of in the constraints
)2
1(
1 ,j
Tijij
N
i jiii xxyy
N
1iii
i
0y
N,...,2,1i,C0
C
Support Slide
46
Training the SVMA major problem is the high computational cost. To this end, decomposition techniques are used. The rationale behind them consists of the following:
• Start with an arbitrary data subset (working set) that can fit in the memory. Perform optimization, via a general purpose optimizer.
• Resulting support vectors remain in the working set, while others are replaced by new ones (outside the set) that violate severely the KKT conditions.
• Repeat the procedure.• The above procedure guarantees that the cost
function decreases.• Platt’s SMO algorithm chooses a working set of two
samples, thus analytic optimization solution can be obtained.
47
Multi-class generalizationAlthough theoretical generalizations exist, the most popular in practice is to look at the problem as M two-class problems (one against all).
Example:
Observe the effect of different values of C in the case of non-separable classes.