support vector machines: algorithms and applications · 2019. 9. 12. · support vector machines...

35
Support Vector Machines: Algorithms and Applications Tong Wen LBNL in collaboration with Alan Edelman MIT Jan 17, 2003

Upload: others

Post on 14-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Support Vector Machines:

    Algorithms and Applications

    Tong Wen

    LBNL

    in collaboration with

    Alan Edelman

    MIT

    Jan 17, 2003

  • Overview

    � Support Vector Machines (SVMs) solve classification problems by learning from examples.

    � Contents:

    1. Introduction to Support Vector Machines.

    2. Fast SVM training algorithms.

    3. Financial applications of SVMs?

    � Notations:

    ��

    �� the �th training point � the label of the �th training point � the vector of Lagrange multipliers

    � the �th Lagrange multiplier � the vector giving the normal direction of a hyperplane � the bias term � the dimension of a training point � the number of training points � the Hessian matrix � the matrix whose columns are the training points the matrix that spans the current search space

    the orthogonal complement of � the set of SVs that are correctly separated � the set of SVs that are separated with errors ���� a decision rule (function)

  • Introduction to Support Vector Machines

    � An example.

    � The primal SVM quadratic programming problem.

    � The dual SVM quadratic programming problem.

    � Properties of Support Vectors(SVs).

  • An Example

    � The mechanism that generates each point � and its color �:

    P(y=−1)=0.5

    P(y=1)=0.5

    x

    x

    xP(x | y=1)

    P(x | y=−1)

    � The main theme of classification is to determine a decision rule:

    � � � � �� � � � � �

    For this example, � � ������.

    � What is the optimal solution to our detection problem?

  • An Example

    � By Bayes’ Rule, the maximum likelihood decision rule is

    ������ � ������ ��� � � � �����������

    which minimize the risk

    ��� � �� �� � ���������

    m−=(1,1), m

    +=(3,3), σ=1

    −1

    0

    1

    2

    3

    4

    5

    w

    m−

    m+

    h(x)=−1

    h(x)=1

    −2 −1 0 1 2 3 4 5

  • Support Vector Machines

    � The decision rule ������ depends on ���� ��.

    � In general, statistical information such as ���� ��is not available. A distribution-free approach is needed.

    � The SVM decision rule is determined based on the training examples:

    � � ����� ���� � � � � ���� �����

    � The training error:

    �� � ������ � ��� � �������

    � ����

    � Overfitting vs. the training error.

  • Support Vector Machines

    � In the previous example,

    ������ � �������� ���

    � Every decision rule geometrically corresponds to a separating boundary.

    � If ������ � � defines the separating boundary, then the corresponding decision rule has a general form

    ���� � ������������

    � The SVM decision rule:

    ������� � �������� ���

    ������� � ���������� � ����

    � ���� ��������� � ������

  • Support Vector Machines

    � SVMs use two parallel hyperplanes instead of one to determine the separating boundary. These two (oriented) hyperplanes are computed directly using the training set �.

    w

    d = 2 / ||w||

  • The Primal SVM QP Problem: the Separable Case

    � � and � are determined by the quadratic programming (QP) problem:

    ��������� �����

    ��� �

    subject to

    �������� � � �� for � �� � � � ���

    � We can always separate the training points by mapping them to a higher dimensional space � �

    (�� � �), such that in � ��

    the two sets ����� � ��

    �� and ����� � �� ��� can be separated by hyperplanes.

    � In practice, people may not want to fully separate the training set because of overfitting.

  • The Primal SVM QP Problem: the Inseparable Case

    � To accommodate the separation errors on �,nonnegative slack variables (errors) �� areintroduced:

    � ��

    ��������� ��� � � ��

    ����� ��

    subject to (for � � �� � � � ��)

    ������� � � � �� �� and �� � ��

    � If the training point ��� is correctly separated by the soft-margin hyperplanes, then �� � �.

    � The coefficient � controls the trade-off between the size of the gap and how well is separated.

  • The Dual SVM QP Problems

    � In practice, the dual problems are computed instead:

    �������� ������

    ���� �

    subject to

    ��� � �

    � � ��

    �������� ������

    ���� �

    subject to

    ��� � �

    � � � � ��

    � The Hessian matrix ��� � �������������� isa ��� symmetric positive semi-definite matrix.

  • The Positive Definite Kernel Functions

    � Since only inner products are involved in the dual problems, a positive definite kernel function ���� ��is used instead of ����, where

    �������� � ����������� and ��� � �������������

    � Using kernel functions avoids the curse of dimensionality.

    � Three positive definite kernels are used here:

    linear: ���� �� � ���poly(�): ���� �� � ����� ���

    rbf(�): ���� �� � ��������

    �������

    � For simplicity, we use � in place of ���� in our later discussions.

  • Support Vectors

    � At optimality, ��

    � � ����������

    and if � � �� � � (�������� �� � �) then

    � � �� ����� � ����� �

    �� ����

    � ������� � �������� ��.

    � � � � contain the support vectors (SVs). ��

    � � �� � � �� �� � ��� �� � ���� ������� � �� � � ��� � ����

    � � � �� �� � ��� �� � ���� ������� � �� � �� �� and �� ���� � � �� �� � ��� �� � ���� ������� � �� � ��� � ���

    � SVs are the training points that determine the two separating hyperplanes through linear equations.

    � The solution to the dual problem is sparse.

  • Solving the Dual SVM QP Problem Efficiently

    � Solving the dual problem is computationally challenging.

    � The decomposition strategy.

    � A projected conjugate gradient algorithm.

    � Speed considerations.

    � Numerical experiments.

    � Conclusion and future work.

  • The Decomposition Strategy

    � The sparsity property.

    � An observation from the primal problem point of view:

    1 2

    3

    1

    4

    5

    2

    3

    6

    � How to apply this strategy to solve the dual problem efficiently?

    1. Constructing a subproblem.

    2. Solving the subproblems.

    3. The optimality conditions.

    � A projected conjugate gradient algorithm.

  • Applying the Decomposition Strategy

    � Projecting the dual QP problem onto a subspace.

    – Let us restrict � in the affine space determined by ��and � . That is, � � ��� � ��.

    – The projected dual problem is

    �������� ��� � ���� ��� ������� ��� ��

    � � �

    subject to

    ��� �� � �

    � � ��� � �� � ��

    where �� is the gradient of �� at ��.

    – � is a general matrix so far. It can be specified directly or by its orthogonal complement �, where � �� � .

    � How to determine � or �, and how to solve each subproblem?

  • Constructing a Subproblem

    � �� ��� � ��.

    � � �� and � � �� �� � �� ��

    ������

    ��

    ������ � ������ ��

    ���� ����

    subject to ��� �� � ��

    � � ���� �� � �,where ��� � � ���, �� � � ��, �� � � ��� , etc.

    � � � �����

    �, � � �� and ��� � ��

    ������

    ��

    ������ � ���� ���� ��

    ����� � ��� ��

    subject to � � ���� � �� � �.

    � �� is a �� � matrix (�� �).

  • Solving the Subproblems by the CG Method

    � If � � ���� � �� � � never becomes active, we have

    � ���� �� � � �����

    or equivalently

    ���� �� � ���� (� is symmetric)�

    The CG iteration The simplified version

    ����� � �� ���� � ����� ����� � �� ���� � �����

    ���� � ���� ���� � ����

    � � �������

    ������

    �������� ���������

    � � �������

    ������

    ������� ��������

    ����� � �������� ������� ����� � �������� �������

    ���� � ������ � �� ��������� ���� � ������ � �� ��������

    �� �����

    ����

    �������������

    � � �����

    ����

    �������������

    ���� � ����� ������� ���� � ����� �������

    � How about � � � �������� ���� � � becomes active during the iteration?

  • The Optimality Conditions

    � An equivalent problem:

    � � ������ and � � � ����������� �

    � Define

    � � ��

    and

    � � ��������������������

    where � � ���� and � is a ��� matrix.

    � � is optimal iff

    � � � and � � ��

    �������� ��

    ���

    � �� �� Since ������� � � � , it is easy to

    ����� �

    � �

    compute and �.

    ��������

    ������� � ����

    � � ����� �

    � Each time when active constraints are to be relaxed, we start with the one that has the most negative ��.

  • A Projected Conjugate Gradient Algorithm

    � A summary of our algorithm: �� ��

    initialize ��;while � �� � or � � �

    at most � steps of the CG iterations for the subproblem update � and �:

    �� �� � �� ���� � ���

    � �

    �� �� � � ������ � ����

    relax at most � active constraints with the most negative ��;update ��;compute the columns of � corresponding to the relaxed constraints; update ��� and ��;

    end

    � Memory requirement: (� is not precomputed!)

    � � ���� � � � ���� and ����

    � Flops:Solving a sequence of smaller problems is cheaper when the solution is sparse.

    � Two parameters: � and �.

    � How to initialize ��?

  • Speed Considerations: Being Adaptive to the Memory Hierarchy

    � A model of the modern computer memory hierarchy:

    Register

    L2

    cache

    L1 cache

    Other caches

    Main

    memory

    Disk/

    memory

    Distributed

    lregister �

    �� ��� ���� ���

    ���� ���������

    message-passing ����������������

    leve clock cycles

    L1 cache L2 cache near main memory far main memory distributed main memory

    � Linear algebra operations: level �� �.

    � ATLAS BLAS.

    � Thanks to MATLAB’s inclusion of ATLAS BLAS.

  • Speed Considerations: Setting � and � Adaptively

    � The spectrum of H.

    – The size of �� � ��.

    – the convergence of the CG method.

    � � is used to control the steps of the CG iterations.

    – There is no need to compute every subproblem exactly.

    – The rank deficiency of �.

    � � is used to control the number of active constraints relaxed at each time.

    – When � active constraints are relaxed, ����� � � � �������is computed to generate the corresponding columns of �.

    – If � is set large, then there is more chance of some previously relaxed constraints becoming active again.

    � � and � is set to be ��, the estimated number of the leading eigenvalues of �.

    – In our algorithm, �� is set to be the number of iterations for the CG method to reduce the residual norm to a scale of ���� for the initial subproblem.

  • Numerical Experiments

    � The two training sets:

    Digit1713007 310226742 29016265 28121784 361

    sparse denseeasier harder

    Training sets Face Size of the training set The positive training points The negative training points The dimension of the training points The format of the training points Separability

    � Comparing against �������� (C) and SvmFu (C++).

    � The choices of ���� �� and �:

    ����l��� l����� p������ p��������

    r������ r����� r���� r���������� �����

    l��� l��� l����� p����p�������� r������ r����� r����

    Training sets: Digit17- and Digit17

    Training sets: Face- , Face- and Face

  • Numerical Experiments

    Using Default Settings Using Estimated Optimal Settings

    Digit17−6000 Digit17−6000

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50FMSvmSVMlight

    SvmFu

    0

    5

    10

    15

    20

    25

    30FMSvmSVMlight

    SvmFu

    CP

    U s

    eco

    nd

    s C

    PU

    sec

    on

    ds

    CP

    U s

    eco

    nd

    s C

    PU

    sec

    on

    ds

    l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1)kernel (c) kernel (c)

    Face−6000 Face−6000

    0

    100

    200

    300

    400

    500

    600

    700FMSvmSVMlight

    SvmFu

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000FMSvmSVMlight

    SvmFu

    l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1) l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1) kernel (c) kernel (c)

  • Numerical Experiments

    ���Comparison of FMSvm with ����� and SvmFu with the estimated optimal parameter settings

    Digit17−6000 Face−6000

    0

    5

    10

    15

    20

    25

    30FMSvmSVMlight

    SvmFu

    0

    100

    200

    300

    400

    500

    600

    700FMSvmSVMlight

    SvmFu

    CP

    U s

    eco

    nd

    s C

    PU

    sec

    on

    ds

    CP

    U s

    eco

    nd

    s C

    PU

    sec

    on

    ds

    l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1) l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1)kernel (c) kernel (c)

    Digit17−13007

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90FMSvmSVMlight

    SvmFu

    Face−13901

    0

    200

    400

    600

    800

    1000

    1200FMSvmSVMlight

    SvmFu

    l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r10(1) r3(1) r3(0.1) l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1)kernel (c) kernel (c)

    Face−31022

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    2000

    CP

    U s

    eco

    nd

    s

    FMSvmSVMlight

    SvmFu

    l(4) l(1) l(0.1) p2(0.1) p2(0.001) r10(10) r3(10) r3(1) kernel (c)

  • Conclusion and Future Work

    � Conclusion:

    1. efficient;

    2. easy to use;

    3. highly portable.

    � Future work:

    1. Try FMSvm on more training problems and platforms.

    2. Improve FMSvm’s performance and understand better its convergence behavior.

  • Applying SVMs to Financial Markets

    � Identifying problems:

    – A mathematical model is available, but only an approximation.

    – Decisions are made based on experience.

    � Constructing training sets:

    – Domain knowledges.

    – Data are from different categories.

    – Scaling the training data.

    � Using SVMs to predict the movement of Dow Jones Industrials Average Index (DJIA):

    – The geometric Brownian motion solution.

    – The SVM solution.

    – The comparison between the two solutions.

    � Conclusion and future work.

  • The Geometric Brownian Motion Model

    � The weak form of the market-efficiency hypothesis.

    � The equations:

    �� � ����� �����

    By Itô’s lemma, we have

    ��� ��� � ���

    ����� ����

    That is,

    ����

    � ��������� �����

    �� �

    � We want to predict on Thursday (after the market is closed) whether or not the closing index of DJIA on Friday is higher than its opening index on Monday.

  • The Dow Jones Industrial Average Index

    The Dow Jones Index (02−Dec−1985 to 28−Dec−2001)

    2000

    4000

    6000

    8000

    10000

    12000

    Do

    w J

    on

    es In

    dex

    085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02

    Year

    105

    The Dow Jones Index Plotted in the Logarithmic Scale

    104

    Do

    w J

    on

    es In

    dex

    103

    85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02Year

  • The Geometric Brownian Motion Solution

    � The geometric Brownian motion model:

    ��� ���� ����

    � Its discrete-time version:

    �� �� ���� �� ���

    �� � ����� �������

    � Stock price on Friday: � � ������� � ��� �������

    ��� � �������

    � ��� �� �������

    �� We want to know ��� � ���.

    � ������� � �� �� � �� �

    ����� � � � ���� � � if � �� Since

    ���� � � � ���� � � if � �

    ����� � �����

  • The Geometric Brownian Motion Solution

    � Estimating � and �: only � matters.

    Estimated Daily Returns on Fridays (07−Dec−1990 to 28−Dec−2001) 0.4

    l=25l=100l=300

    0.3

    −0.1

    0

    0.1

    0.2

    Est

    imat

    ed D

    aily

    Ret

    urn(

    %)

    −0.290 91 92 93 94 95 96 97 98 99 00 01 02

    Time

    � The test set: December 7, 1990 to December 28, 2001.

    � The performance: �� errors out ��� trials. The accuracy of ���� on this test set is ������.

  • The SVM Solution

    � ������� � �������� ��, where “experience” is

    incorporated in � and �.

    � Our first training set: July 5, 1954 to November 30,1990.

    �� ������ ����� ����� ����� ����� � � � ����� ����� ����� ����� ������

    � Scaling the raw training vectors:

    �� � �� ���� ����� ����� � � � � ���� ����� ����� ����� ������

    � Results: (There are totally �� training vectors, and the test set contains ��� vectors.)

    poly(4)� 0.2 0.5 0.8 1 1.2 1.4 1.6

    Errors 106 99 96 96 96 98 97rbf(1)

    � 20 40 60 80 100 120 140Errors 109 101 101 98 96 96 99

  • The SVM Solution

    � Our intuition is that if there is any available information cor-related to the movement of DJIA, then adding it to the train-ing set may improve the performance.

    � We add the daily federal funds rate to the training set. We choose this rate simply because it is a daily rate.

    � Again, we use the relative values with respect to the Monday rate in the training vectors, instead of real rates.

    � Results:

    poly(3)� 0.05 1 1.5 2.5 10 20 30 35

    Errors 103 102 96 96 94 93 89 89

    � When � � ��, it takes around ��� hours (on newton.mit.edu)to train the SVM comparing with the scale of seconds when using the old training set.

  • The Geometric Brownian Motion Model VS. SVMs

    � The Geometric Brownian Motion model is only an approximation.

    � The semi-strong form of the market-efficiency hypothesis.

    0

    5

    10

    15

    Nu

    mb

    er o

    f E

    rro

    rs

    hGBM

    vs. hSVM

    hGBM

    hSVM

    91 92 93 94 95 96 97 98 99 00 01 Year

  • Conclusions and Future Work

    � Conclusions:

    1. SVMs can learn well from time series data.

    2. There are potentials to achieve better performance using SVMs.

    � Future work:

    1. We want to find good applications of SVMs (need domain knowledge).

    2. How to construct a training set based on data from different categories is also an interesting problem.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 300 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects true /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    /DetectCurves 0.100000 /EmbedOpenType false /ParseICCProfilesInComments true /PreserveDICMYKValues true /PreserveFlatness true /CropColorImages true /ColorImageMinResolution 150 /ColorImageMinResolutionPolicy /OK /ColorImageMinDownsampleDepth 1 /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /GrayImageMinDownsampleDepth 2 /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /CheckCompliance [ /None ] /PDFXOutputConditionIdentifier () /Description >>> setdistillerparams> setpagedevice