support vector machines: algorithms and applications · 2019. 9. 12. · support vector machines...
TRANSCRIPT
-
Support Vector Machines:
Algorithms and Applications
Tong Wen
LBNL
in collaboration with
Alan Edelman
MIT
Jan 17, 2003
-
Overview
� Support Vector Machines (SVMs) solve classification problems by learning from examples.
� Contents:
1. Introduction to Support Vector Machines.
2. Fast SVM training algorithms.
3. Financial applications of SVMs?
� Notations:
�
�
��
�� the �th training point � the label of the �th training point � the vector of Lagrange multipliers
� the �th Lagrange multiplier � the vector giving the normal direction of a hyperplane � the bias term � the dimension of a training point � the number of training points � the Hessian matrix � the matrix whose columns are the training points the matrix that spans the current search space
the orthogonal complement of � the set of SVs that are correctly separated � the set of SVs that are separated with errors ���� a decision rule (function)
-
Introduction to Support Vector Machines
� An example.
� The primal SVM quadratic programming problem.
� The dual SVM quadratic programming problem.
� Properties of Support Vectors(SVs).
-
An Example
� The mechanism that generates each point � and its color �:
P(y=−1)=0.5
P(y=1)=0.5
x
x
xP(x | y=1)
P(x | y=−1)
� The main theme of classification is to determine a decision rule:
� � � � �� � � � � �
For this example, � � ������.
� What is the optimal solution to our detection problem?
-
An Example
� By Bayes’ Rule, the maximum likelihood decision rule is
������ � ������ ��� � � � �����������
which minimize the risk
��� � �� �� � ���������
m−=(1,1), m
+=(3,3), σ=1
−1
0
1
2
3
4
5
w
m−
m+
h(x)=−1
h(x)=1
−2 −1 0 1 2 3 4 5
-
Support Vector Machines
� The decision rule ������ depends on ���� ��.
� In general, statistical information such as ���� ��is not available. A distribution-free approach is needed.
� The SVM decision rule is determined based on the training examples:
� � ����� ���� � � � � ���� �����
� The training error:
�� � ������ � ��� � �������
� ����
� Overfitting vs. the training error.
-
Support Vector Machines
� In the previous example,
������ � �������� ���
� Every decision rule geometrically corresponds to a separating boundary.
� If ������ � � defines the separating boundary, then the corresponding decision rule has a general form
���� � ������������
� The SVM decision rule:
������� � �������� ���
������� � ���������� � ����
� ���� ��������� � ������
-
Support Vector Machines
� SVMs use two parallel hyperplanes instead of one to determine the separating boundary. These two (oriented) hyperplanes are computed directly using the training set �.
w
d = 2 / ||w||
-
The Primal SVM QP Problem: the Separable Case
� � and � are determined by the quadratic programming (QP) problem:
��������� �����
��� �
subject to
�������� � � �� for � �� � � � ���
� We can always separate the training points by mapping them to a higher dimensional space � �
�
(�� � �), such that in � ��
the two sets ����� � ��
�� and ����� � �� ��� can be separated by hyperplanes.
� In practice, people may not want to fully separate the training set because of overfitting.
-
The Primal SVM QP Problem: the Inseparable Case
� To accommodate the separation errors on �,nonnegative slack variables (errors) �� areintroduced:
� ��
��������� ��� � � ��
����� ��
subject to (for � � �� � � � ��)
������� � � � �� �� and �� � ��
� If the training point ��� is correctly separated by the soft-margin hyperplanes, then �� � �.
� The coefficient � controls the trade-off between the size of the gap and how well is separated.
-
The Dual SVM QP Problems
� In practice, the dual problems are computed instead:
�������� ������
���� �
subject to
��� � �
� � ��
�������� ������
���� �
subject to
��� � �
� � � � ��
� The Hessian matrix ��� � �������������� isa ��� symmetric positive semi-definite matrix.
-
The Positive Definite Kernel Functions
� Since only inner products are involved in the dual problems, a positive definite kernel function ���� ��is used instead of ����, where
�������� � ����������� and ��� � �������������
� Using kernel functions avoids the curse of dimensionality.
� Three positive definite kernels are used here:
linear: ���� �� � ���poly(�): ���� �� � ����� ���
rbf(�): ���� �� � ��������
�������
� For simplicity, we use � in place of ���� in our later discussions.
-
Support Vectors
� At optimality, ��
� � ����������
and if � � �� � � (�������� �� � �) then
� � �� ����� � ����� �
�� ����
� ������� � �������� ��.
� � � � contain the support vectors (SVs). ��
�
� � �� � � �� �� � ��� �� � ���� ������� � �� � � ��� � ����
� � � �� �� � ��� �� � ���� ������� � �� � �� �� and �� ���� � � �� �� � ��� �� � ���� ������� � �� � ��� � ���
� SVs are the training points that determine the two separating hyperplanes through linear equations.
� The solution to the dual problem is sparse.
-
Solving the Dual SVM QP Problem Efficiently
� Solving the dual problem is computationally challenging.
� The decomposition strategy.
� A projected conjugate gradient algorithm.
� Speed considerations.
� Numerical experiments.
� Conclusion and future work.
-
The Decomposition Strategy
� The sparsity property.
� An observation from the primal problem point of view:
1 2
3
1
4
5
2
3
6
� How to apply this strategy to solve the dual problem efficiently?
1. Constructing a subproblem.
2. Solving the subproblems.
3. The optimality conditions.
� A projected conjugate gradient algorithm.
-
Applying the Decomposition Strategy
� Projecting the dual QP problem onto a subspace.
– Let us restrict � in the affine space determined by ��and � . That is, � � ��� � ��.
– The projected dual problem is
�������� ��� � ���� ��� ������� ��� ��
� � �
subject to
��� �� � �
� � ��� � �� � ��
where �� is the gradient of �� at ��.
– � is a general matrix so far. It can be specified directly or by its orthogonal complement �, where � �� � .
� How to determine � or �, and how to solve each subproblem?
-
Constructing a Subproblem
� �� ��� � ��.
� � �� and � � �� �� � �� ��
������
��
������ � ������ ��
���� ����
subject to ��� �� � ��
� � ���� �� � �,where ��� � � ���, �� � � ��, �� � � ��� , etc.
� � � �����
�, � � �� and ��� � ��
������
��
������ � ���� ���� ��
����� � ��� ��
subject to � � ���� � �� � �.
� �� is a �� � matrix (�� �).
-
Solving the Subproblems by the CG Method
� If � � ���� � �� � � never becomes active, we have
� ���� �� � � �����
or equivalently
���� �� � ���� (� is symmetric)�
The CG iteration The simplified version
����� � �� ���� � ����� ����� � �� ���� � �����
���� � ���� ���� � ����
� � �������
������
�������� ���������
� � �������
������
������� ��������
����� � �������� ������� ����� � �������� �������
���� � ������ � �� ��������� ���� � ������ � �� ��������
�� �����
����
�������������
� � �����
����
�������������
���� � ����� ������� ���� � ����� �������
� How about � � � �������� ���� � � becomes active during the iteration?
-
The Optimality Conditions
� An equivalent problem:
� � ������ and � � � ����������� �
� Define
� � ��
and
� � ��������������������
where � � ���� and � is a ��� matrix.
� � is optimal iff
� � � and � � ��
�������� ��
���
� �� �� Since ������� � � � , it is easy to
����� �
� �
compute and �.
��������
������� � ����
� � ����� �
� Each time when active constraints are to be relaxed, we start with the one that has the most negative ��.
-
A Projected Conjugate Gradient Algorithm
� A summary of our algorithm: �� ��
initialize ��;while � �� � or � � �
at most � steps of the CG iterations for the subproblem update � and �:
�� �� � �� ���� � ���
� �
�� �� � � ������ � ����
�
relax at most � active constraints with the most negative ��;update ��;compute the columns of � corresponding to the relaxed constraints; update ��� and ��;
end
� Memory requirement: (� is not precomputed!)
� � ���� � � � ���� and ����
� Flops:Solving a sequence of smaller problems is cheaper when the solution is sparse.
� Two parameters: � and �.
� How to initialize ��?
-
Speed Considerations: Being Adaptive to the Memory Hierarchy
� A model of the modern computer memory hierarchy:
Register
L2
cache
L1 cache
Other caches
Main
memory
Disk/
memory
Distributed
lregister �
�� ��� ���� ���
���� ���������
message-passing ����������������
leve clock cycles
L1 cache L2 cache near main memory far main memory distributed main memory
� Linear algebra operations: level �� �.
� ATLAS BLAS.
� Thanks to MATLAB’s inclusion of ATLAS BLAS.
-
Speed Considerations: Setting � and � Adaptively
� The spectrum of H.
– The size of �� � ��.
– the convergence of the CG method.
� � is used to control the steps of the CG iterations.
– There is no need to compute every subproblem exactly.
– The rank deficiency of �.
� � is used to control the number of active constraints relaxed at each time.
– When � active constraints are relaxed, ����� � � � �������is computed to generate the corresponding columns of �.
– If � is set large, then there is more chance of some previously relaxed constraints becoming active again.
� � and � is set to be ��, the estimated number of the leading eigenvalues of �.
– In our algorithm, �� is set to be the number of iterations for the CG method to reduce the residual norm to a scale of ���� for the initial subproblem.
-
Numerical Experiments
� The two training sets:
Digit1713007 310226742 29016265 28121784 361
sparse denseeasier harder
Training sets Face Size of the training set The positive training points The negative training points The dimension of the training points The format of the training points Separability
� Comparing against �������� (C) and SvmFu (C++).
� The choices of ���� �� and �:
����l��� l����� p������ p��������
r������ r����� r���� r���������� �����
l��� l��� l����� p����p�������� r������ r����� r����
Training sets: Digit17- and Digit17
Training sets: Face- , Face- and Face
-
Numerical Experiments
Using Default Settings Using Estimated Optimal Settings
Digit17−6000 Digit17−6000
0
5
10
15
20
25
30
35
40
45
50FMSvmSVMlight
SvmFu
0
5
10
15
20
25
30FMSvmSVMlight
SvmFu
CP
U s
eco
nd
s C
PU
sec
on
ds
CP
U s
eco
nd
s C
PU
sec
on
ds
l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1)kernel (c) kernel (c)
Face−6000 Face−6000
0
100
200
300
400
500
600
700FMSvmSVMlight
SvmFu
0
200
400
600
800
1000
1200
1400
1600
1800
2000FMSvmSVMlight
SvmFu
l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1) l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1) kernel (c) kernel (c)
-
Numerical Experiments
���Comparison of FMSvm with ����� and SvmFu with the estimated optimal parameter settings
Digit17−6000 Face−6000
0
5
10
15
20
25
30FMSvmSVMlight
SvmFu
0
100
200
300
400
500
600
700FMSvmSVMlight
SvmFu
CP
U s
eco
nd
s C
PU
sec
on
ds
CP
U s
eco
nd
s C
PU
sec
on
ds
l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1) l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1)kernel (c) kernel (c)
Digit17−13007
0
10
20
30
40
50
60
70
80
90FMSvmSVMlight
SvmFu
Face−13901
0
200
400
600
800
1000
1200FMSvmSVMlight
SvmFu
l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1) l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1)kernel (c) kernel (c)
Face−31022
0
200
400
600
800
1000
1200
1400
1600
1800
2000
CP
U s
eco
nd
s
FMSvmSVMlight
SvmFu
l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1) kernel (c)
-
Conclusion and Future Work
� Conclusion:
1. efficient;
2. easy to use;
3. highly portable.
� Future work:
1. Try FMSvm on more training problems and platforms.
2. Improve FMSvm’s performance and understand better its convergence behavior.
-
Applying SVMs to Financial Markets
� Identifying problems:
– A mathematical model is available, but only an approximation.
– Decisions are made based on experience.
� Constructing training sets:
– Domain knowledges.
– Data are from different categories.
– Scaling the training data.
� Using SVMs to predict the movement of Dow Jones Industrials Average Index (DJIA):
– The geometric Brownian motion solution.
– The SVM solution.
– The comparison between the two solutions.
� Conclusion and future work.
-
The Geometric Brownian Motion Model
� The weak form of the market-efficiency hypothesis.
� The equations:
�� � ����� �����
By Itô’s lemma, we have
��� ��� � ���
����� ����
That is,
����
� ��������� �����
�� �
� We want to predict on Thursday (after the market is closed) whether or not the closing index of DJIA on Friday is higher than its opening index on Monday.
-
The Dow Jones Industrial Average Index
The Dow Jones Index (02−Dec−1985 to 28−Dec−2001)
2000
4000
6000
8000
10000
12000
Do
w J
on
es In
dex
085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02
Year
105
The Dow Jones Index Plotted in the Logarithmic Scale
104
Do
w J
on
es In
dex
103
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02Year
-
The Geometric Brownian Motion Solution
� The geometric Brownian motion model:
��� ���� ����
�
� Its discrete-time version:
�� �� ���� �� ���
�
�� � ����� �������
� Stock price on Friday: � � ������� � ��� �������
��� � �������
� ��� �� �������
�� We want to know ��� � ���.
� ������� � �� �� � �� �
����� � � � ���� � � if � �� Since
�
���� � � � ���� � � if � �
����� � �����
-
The Geometric Brownian Motion Solution
� Estimating � and �: only � matters.
Estimated Daily Returns on Fridays (07−Dec−1990 to 28−Dec−2001) 0.4
l=25l=100l=300
0.3
−0.1
0
0.1
0.2
Est
imat
ed D
aily
Ret
urn(
%)
−0.290 91 92 93 94 95 96 97 98 99 00 01 02
Time
� The test set: December 7, 1990 to December 28, 2001.
� The performance: �� errors out ��� trials. The accuracy of ���� on this test set is ������.
-
The SVM Solution
� ������� � �������� ��, where “experience” is
incorporated in � and �.
� Our first training set: July 5, 1954 to November 30,1990.
�� ������ ����� ����� ����� ����� � � � ����� ����� ����� ����� ������
� Scaling the raw training vectors:
�� � �� ���� ����� ����� � � � � ���� ����� ����� ����� ������
� Results: (There are totally �� training vectors, and the test set contains ��� vectors.)
poly(4)� 0.2 0.5 0.8 1 1.2 1.4 1.6
Errors 106 99 96 96 96 98 97rbf(1)
� 20 40 60 80 100 120 140Errors 109 101 101 98 96 96 99
-
The SVM Solution
� Our intuition is that if there is any available information cor-related to the movement of DJIA, then adding it to the train-ing set may improve the performance.
� We add the daily federal funds rate to the training set. We choose this rate simply because it is a daily rate.
� Again, we use the relative values with respect to the Monday rate in the training vectors, instead of real rates.
� Results:
poly(3)� 0.05 1 1.5 2.5 10 20 30 35
Errors 103 102 96 96 94 93 89 89
� When � � ��, it takes around ��� hours (on newton.mit.edu)to train the SVM comparing with the scale of seconds when using the old training set.
-
The Geometric Brownian Motion Model VS. SVMs
� The Geometric Brownian Motion model is only an approximation.
� The semi-strong form of the market-efficiency hypothesis.
0
5
10
15
Nu
mb
er o
f E
rro
rs
hGBM
vs. hSVM
hGBM
hSVM
91 92 93 94 95 96 97 98 99 00 01 Year
-
Conclusions and Future Work
� Conclusions:
1. SVMs can learn well from time series data.
2. There are potentials to achieve better performance using SVMs.
� Future work:
1. We want to find good applications of SVMs (need domain knowledge).
2. How to construct a training set based on data from different categories is also an interesting problem.
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 300 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects true /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False
/DetectCurves 0.100000 /EmbedOpenType false /ParseICCProfilesInComments true /PreserveDICMYKValues true /PreserveFlatness true /CropColorImages true /ColorImageMinResolution 150 /ColorImageMinResolutionPolicy /OK /ColorImageMinDownsampleDepth 1 /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /GrayImageMinDownsampleDepth 2 /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /CheckCompliance [ /None ] /PDFXOutputConditionIdentifier () /Description >>> setdistillerparams> setpagedevice