an information-based complexity analysis of statistical ...math.bu.edu/people/mkon/q6.pdf · an...

1

An Information-Based Complexity Analysis ofStatistical Learning Algorithms

Mark A. Kon, Boston University

Dedicated to Henryk Wozniakowski on the occasion of his 60th birthday

Abstract

We apply information-based complexity analysis to support vectormachine (SVM) algorithms, with the goal of a comprehensive continuousalgorithmic analysis of such algorithms.

1. IntroductionThis paper provides one example of the increasing number of currentapplications of the information-based complexity paradigm developed byHenryk Wo niakowski and his collaborators (Traub, Wozniakowski, andzWasilkowski, 1988). We here apply the theory to an important sub-area ofcomputational learning theory, namely support vector machines (SVM; Vapnik,1998, 2000). These form a class of useful continuous algorithms forming acurrent focus of machine learning. The algorithmic approach to continuousproblems was pioneered by Traub, Wozniakowsk and Wasilkowski, as well asSmale, Blum and Shub (1998) and others (see, e.g., Braverman and Cook,2006). To the knowledge of the author, a continuous algorithmic analysis ofSVM (nor other SLT algorithms) has not been done up to now.

In particular, our IBC-based approach naturally separates the SVMprocedure into information and algorithmic components. The procedure weanalyze is part of a more general one in which information and algorithms aregraded by complexity (see Vapnik, 1998, 2000).

2

1.1 Terminology

We use standard terminology from IBC theory (see Traub, et al., 1988 andTraub and Werszhulz, 1998). Let and be linear spaces representing respectively a class of problems and their solutions. Let be the solution operator taking a problem to its solution Let be an information operator, with representing informationavailable to a computational algorithm about the problem . An example of such a situation is a case in which represents a class of partial differentialequations (PDE), with representing full coefficient, boundary, and initial condition information regarding a PDE of interest. The exact problem solution(i.e., solution of the PDE) is then .

A standard goal in learning theory is the identification of the best input-output function x x to explain a set of examples drawn from

an unknown underlying probability distribution . The aim is to generalize xto the prediction of new values given inputs , assuming x x xcontinue to be drawn from . W and . The e will assume here that x

d on , represents a statistical distribution of input-istribution x

output values , combined with all sources of error. The identification of xthe optimal choice of predictor from the unknown represents a standard problem in information-based complexity, in which the problem element is theunderlying relationship , and the solution is the optimal predictorx- x , with the class of relations (the ) from which canx- hypothesis spacebe chosen.

We thus assume the unknown , the space of all probabilitymeasures on (the dual notation is useful in making a correspondence between IBC and statistical learning). A typical error criterion in statisticallearning is the loss

x x x

,

with x x a measure of distance between and the approximation .

Support vector machine algorithms: Support vector machines (SVM) arestatistical learning algorithms for classification, in which (which is done by restricting the support of to , and consists xof the set of affine maps from to We find x w x x thebest estimate x by minimizing the average distance between

3

sgn (sign of ) and . The distance is measured locally by x x , defined by

if

otherwise.

Information: The information that we have about the unknown relationship is the set of examples , with chosenz x z

independent and identically distributed (iid) from . Our goal is to find analgorithm in a class which is able to use the examples to best approximate , so that .

Information optimization: We seek optimal information of (hopefullysmall) cardinality , for which it is possible to find an algorithm in the given class for which the error is small in "most" cases. We assumethat the information operator varies only in its cardinality . In some cases it is useful to extend the class allowed information operators to a larger class ,and then to optimize by finding the with smallest error. Such optimization can for example rely on dimension reduction, e.g. choosing x x

, with a projection onto coordinates in relevant to predicting

2. Notation and definitions

2.1 Information, algorithm, error

We review here the basic IBC paradigm. We generally assume be a spacewith to be an unknown system state in a space of system states, and solve a problem with input . Here is a solutionoperator. We assume an (with theinformation operator

information space) chosen from a family . Let be a family of allowed algorithms ; we seek and for which

is optimal. Informally we seek and so thediagram

4

commutes maximally see Traub, Wo niakowski and Wasilkowski (1988) and zTraub and Werschulz (1998) We are given a class of primitive operations, eachassumed to have unit complexity, whose cardinality measures complexity.

SVM learning: For SVM learning we assume more precisely an SLT model,with the unknown x a probability measure on the input-outputspace of , with the space of all probability distributions on . For x X simplicity we assume . Information about is provided as examplesX

z x z X , with chosen iid from

We seek such that if sgn , the graph x X x x x x , with distributed as the marginal of on , bestapproximates Two choices of local loss might be x

z x ,

and (when

z x

the the standard SVM; we will assume .hinge loss of Define the risk by

x x x . (2.1)

We let and unless otherwise specified. Define the taking to the best approximation by

arg min

(assuming the minimum exists and is unique), so that minimizes (and takes

5

to 0) the deviation of the risk of from the optimal risk We seek an algorithm which best approximates . We select from a "small" more easily computable class of algorithms whose ranges arein the subset with in this case consisting of affine separator functions . x w x

Definition of error: We formally define the error between our approximation and the optimal by

2.2

(Vapnik, 1998, 2000).In many cases an error measure like , at least for a finite dimensional

hypothesis space , is equivalent to "standard" error measures (at least whenone of the arguments is the optimal . Indeed assume the risk functional for is a twice differentiable function of (assumed finite dimensional). For a unique minimum , the Hessian matrix must be positive. If in addition it is positive definite, it follows that for any metricdefined by a norm on , there are such that

(2.3)

Then up to a constant the error in (2.2) can be replaced by

Further, for any finite or infinite dimensional choice of , if we use an -

type measure of risk,

x x

we have an equality in (2.3). Namely, since , x x

x x

x x x x

x x

x

6

, (2.4)

so the error reduces to an one. Above is marginal expectation with x

respect to .x

2.2 Information and algorithmic error

Assume as above there is an optimal , and we wish to approximate in . More generally assume a scale of increasing spaces , and thatwe seek as the best approximation of in . Let denote the set of all algorithms on the information space which map into , i.e.,

for . That is the algorithm space consists of all mapsfrom to (we always restrict functions to be measurable); thus

depends on only. We seek a so that is an approximation of and thus of .

Error definitions: There are two sources of error in the approximation of by

. First for , there is an algorithmic (approximation) error determined entirely by the choice of approximation space (equivalently ), given by

alg inf inf

.

Here

arg inf

represents any choice which minimizes , assuming such a minimizer exists (we assume existence of a minimizer, but if it does not exist thedefinition can be modified with approximate minima or minimizing sequences).Note that this and the other errors below are local errors, since they depend onthe probability distribution .

Second there is information (estimation) error,

inf inf

inf

Z

7

where are chosen iid according to , and the z x Z

above expectation is over Z ZThus is the error if we use the best possible algorithm inf

into , using random information limited to cardinality . The error Zis averaged over the choice of Monte Carlo information under . On the otherZ hand is the error with full information, if the algorithm is restricted to the alg

class of all algorithms mapping into . This error is characterized in SLT (Vapnik, 1998) as approximation error. See Kon and Plaskota (2000) for a studyof relationships between these two types of errors in the context of neuralnetwork algorithms. Approximation error in this context has also been studiedin Smale and Zhou (2003) and in Kon and Raphael (2006).

For convenience assume a minimizer

arg inf

Z

exists; otherwise we can again use minimizing sequences for the followingdefinitions. Letting be the algorithmic estimate of we define the Zfull error

Z

(2.5)

inf alg

Thus by using effectively squared errors in our definitions of and (as in inf alg

( )), with our error defined in terms of risks as above, have an exact2.4 equality (2.5), as opposed to bounds of the type in Kon and Plaskota (2000).

Complexity: We correspondingly separate the complexity of approximationinto two parts, information complexity algorithmic complexityand , the inverses of the functions ( (information error) and (algorithmic inf alg

error). We will assume that the unit of complexity is normalized so that the costof each additional information operation (obtaining of a data point z x is 1. We then define information complexity by

compinf inf inf

8

To define algorithmic complexity, we define for a (where

contains ) the algorithmic error

Z

We define to be the complexity of the computation of from given information (again in units where one information operation has cost ).

Now define the algorithmic complexity as

comp 2.6 inf

The full -complexity of approximation in is defined as

comp (2.7) inf

This defines the best information cardinality and algorithmic complexitylevel to obtain an approximation of the optimal within error . As mentioned above, we generally will want to choose from a graded family ofalgorithm classes whose complexity (which depends on the range of the family ) will scale with the information cardinality in order to optimize ( ). A natural question is how such a scaling should go, which is considered2.7in the following discussion of SVM.

3. An example in SVMWe now restrict to a more specific application, the statistical learning theory(SLT) formulation of support vector machines, involving SLT and itsalgorithmic IBC formulation.

Again let be the space of probability distributions on , and a function space on . As before specify the map which takes a probability distribution to its functional best approximation , with the goal of estimating . As above, we define to x minimize the risk

x x x

.

We now assume is restricted to the values through concentration of the measure on , and that we are given the data set x

9

Z z x (training data) consisting of random information

derived from , with unknown. represents example classifications Z

(positive or negative) of data . The risk-minimizing is intended toxgeneralize the examples , and when , a new is classified as positive,Z x x and otherwise negative. We now restrict the set of classification functions to

(consisting of affine functions on ) as in SVM, to simplifyestimation.

Empirical risk: Let

arg inf arg inf

be the closest element in to the optimal . To formulate the choice of algorithm estimating , we define the Z empirical probabilitydistribution to be our estimate of given data . It is definedZ z z by where and is the point mass at

z z z z x z

The empirical risk of any is the corresponding estimate of true risk , i.e.,

x x x

x (3.1)

Let be the empirical risk minimizer, again assuming that it

arg inf

exists, which is case under weak hypotheses. The minimizer defines the " - regression" separator, an approximation of the optimal separator , (the trueminimizer of in ) Define the algorithm (the set of all maps from

to by ZThus the consists of affine onhypothesis space x w x

, and restriction to is done by limiting to algorithms with range in .

We will discuss in section 6 the consideration of a larger scale of SVMalgorithm spaces forming a nested family (differing in their ranges ) which scale with cardinality of information, though for now we fix and .

10

4. SVM: Convergence rates of empirical risks

4.1 Risk and VC dimension

We now consider some complexity bounds on SVM algorithms. We define thelocal error of the algorithm (which depends on ) to be

where is the true minimizer of risk, assumed to exist in the class of all functions on . Recall , with

Z z z zz x

iid and chosen according to .To bound information error as the cardinality , there are several

results in continuous complexity and SLT which are useful here. Letting

x x x x

,

first we have an error bound based on results of Vapnik (1998, 2000). We willdefine for a given loss function its VC dimension, defined x xfor the family . For any family of functions, this is defined x by

Definition 4.1: A family of real-valued functions on a space is said to separate a set of points if for every subset , there z

exists a and such that if and only if z z

Definition 4.2: The of a family of functions on the space isVC dimension the cardinality of the largest set of points which is separated by . If this cardinality is unbounded then the VC dimension of is infinite.

4.2 Error estimates

We now discuss some asymptotic error estimates independent of the initialdistribution As above is the set of affine functions on z

Henceforth let , and define

11

sup

xx

4.1

and

(4.2)

Define for any probability

ln ln

8

where is the VC dimension of the set of functions on x

As above, we assume Monte Carlo information about the unknown

relationship in the form , with iid from . x Z z z x

We assume a minimizer of the empirical risk exists. Then we have (see Vapnik, 2000, §3.7) a result which only depends on the VC dimension of:

Theorem 4.1 (Vapnik): For any non-negative loss , we have with probabilityat least , that for random information x

,

inf

inf

(4.3)

where is a minimizer in of empirical risk

This gives a -PAC (probably approximately correct, with probability greaterthan ) bound on the SVM error. The formulation in Vapnik, 2000 is stated equivalently in terms of the probability parameter As mentioned earlier the error on the left side can generally be expressed in terms of a normerror on the finite dimensional hypothesis space . Since depends on the unknown , the above error is local in . The dependence can be eliminated if we assume, for example, that the loss is bounded as a

12

function of and (Vapnik, 2000, §3.7). We note that the term above is

uniform in the choice of .

4.3 Complexity estimates

We now estimate information complexity of the standard SVM algorithm byinverting ( ). We define the ( -PAC) -information complexity of finding4.3 the risk-minimizing by:

compinf

inf

with probability at least

where and

arg inf

arg inf

with formed from information Letting

z

z

inf

, we have from Theorem 4.1 (always with -probability at least

)

,

where is as in ( ). Note that by our definitions we should in fact 4.2have instead of on the right side; this change is however absorbed in the term. Thus

ln

ln

13

ln

ln ln

defining

2 2

ln ln

the above holds for sufficiently large that .

ln

To invert this for , we invert the corresponding equality, replacing by defined by

ln

ln ln, (4.4)

so after squaring and letting ,

ln4.5

where . ln

Defining and letting

we have

ln

(4.6)

where ln

We insert a solution of (4.6) of the form

ln (4.7)

with the expectation that is of lower order than as To ln

14

validate this and estimate , let . Using (4.7) in (4.6), (below is

ln

the first MacLaurin coefficient of )ln

ln ln

ln ln ln lnln ln ln

ln(4.8)

Note that

ln ln ln

ln ln

, (4.9)

since

ln lnlnlnln

ln

4.10)

by ( ), where for any function4.6 by definition as

Thus by (4.8) and (4.9), as ,

ln lnln ln

ln

(4.11)

so

ln ln

Thus by (4.7)

ln ln ln ln ln

or

15

ln ln ln

ln ln ln . (4.12)

We note more information is needed in the asymptotic expansion to determineuniform dependence on . The equality in above in (4.4) is easily replaced by

the inequality in again (since all functions are monotonic). Recalling that

one information operation is a unit of information complexity, we have

Theorem 4.2: Given an allowed probability of error , the information complexity of the -PAC approximation for the support vector machine in dimensions is bounded by

ln ln ln ln ln (4.13)

as where denotes the minimal risk, and infG

sup

xx ,

for any

The theorem follows from (4.12) since the VC dimension of the space of affine functions on the feature space X is (note that up to thispoint the argument is valid for a general loss ) Now letting x x , the VC dimension of this family of loss functions isdetermined by first noting x is a monotonic function of x, andhence its VC dimension is bounded by that of the latter. The VC dimension of x can be bounded by noticing that (since is affine) it is an affinecombination of the functions and , forming (upon mapping into

x x as a new coordinate system) affine functions in a coordinate system of dimensions. In a dimensional coordinate system the set of affinefunctions has VC dimension bounded by . Thus we replace the VC dimension by on the right side of (4.13) (see Vapnik, 1998, Korian and Sontag, 1997).

16

This gives us a -probabilistic complexity estimate, giving a -PAC complexity to first order. That is, with information of cardinality we canobtain an approximate solution to the SVM problem, with a (probably) smallrisk. Note as above that this is a local complexity (dependent on ) if we do not assume the risk function is bounded.

4 4. SVM: Optimality of algorithms

We show here that the above SVM information complexity estimates are withina logarithmic term of being optimal. A simple heuristic argument would be asfollows: for any nontrivial distribution , standard random information results from Monte Carlo or IBC yield that for some non-constantfunction , with probability at least , the error between actual and empirical risk is

(4.14)

Indeed, even with a taking two values the above holds, and it easily followsthat this holds for any which is essentially non-constant (i.e., is not equivalentto a constant function) on the support of . An informal conclusion is that theabove error of SVM in Theorem 4.1 is optimal within a log term.

A precise result along these lines is (Vapnik and Chernovenkis, 1974):

Theorem 4.3 (Vapnik and Chernovenkis): If the function x isessentially non-constant on the support of , then for any , Theorem 4.1fails to hold if the right hand side of (4.1.1) is replaced by any function

Thus the error in Theorem 4.1 is within a factor of being optimal, andlnit follows easily that the -information compexity of SVM in Theorem 4.2 iswithin a logarithmic factor, , of being optimal.ln

5 SVM: Improvement of VC complexity boundsWe now show it is possible to improve the bounds in section 4 so as to eliminatethe logarithmic term, if we restrict ourselves to a class of loss functions x which are polynomial in the two arguments. Note this class is dense

17

in the set of all continuous loss functions x for compactly supporteddensities .

5.1 Preliminaries

Recalling is the class of affine functions on , fix and let be

the compact space of all affine with | and x w x w

. Letting , with empirical information x

, wefirst consider some probabilistic bounds. By Chebyshev's theorem

| |

x x

with probability at least (where is variance) or

x

equivalently

| | (5.1)

x

with probability at least . This bound works for a single , while our goal is to make this bound uniform over the class .

If is a polynomial in and (e.g., squared error loss x x x , the corresponding "polynomial risk" SVMwe claimalso carries the better bound (5.1) for the estimation of the risk by empirical riskuniformly over . Thus assume

x

is a polynomial of order . Then the difference between risk and empirical risk(error) is

and since the difference is for each monomial in above, we will show x

that this also holds uniformly in for the sum, initially by requiring that . We first require a simple fact:

18

Lemma 5.1: Given two functions and on a set which take on their minima,

arg inf inf sup

Proof: Assume this does not hold. Letting sup

we would

have

arg inf inf

This would imply

inf arg inf arg inf inf

We also have

inf arg inf inf ,

which gives a contradiction.

Note also that for any finite sum of functions , defining ,

we have by the triangle inequality in ,

(5.2)

with

5.2 Polynomial risk functions

We now have:

Lemma 5.2: For any and any polynomial , we have with probability at least , simultaneously for all ,

19

d

(5.3)

where x x

, and is the number

of non-zero terms in as a polynomial in x

Proof: Writing , and defining x

w

, we have for

w x w x w x

w x w x

(5.4)

where the above sum is over all multiindices with non- negative integers and The total number of distinct powers

x

in the last sum is . Note that is the maximum number of distinct powers x

which can appear in (5.4); in particular note thelast term in (5.4) is always 1.

Note

5.5)

Let w wbe the vector defined by taking absolute values of components of , i.e.,

w

Then by and (5.5), w 5.1 ith probability at least (since we must use (5.1) times below)

w x

20

w x

w x (5.6)

w x

| | , 5.7

with The next to last1 w 1 w w

, since

inequality above follows from the expansion (5.4). Note that if all , this inequality is an equality for each fixed and thus also for the sum over . The general argument is then not difficult, given that the components of w whichappear in both sums are in absolute value only.

We now need

Lemma 5.3: For any probability distribution x, on and a

polynomial loss which is positive definite, the risk attains its minimum over , with the class of affine functions. Further, there is no minimum at , i.e., no minimizing sequence with or x w x w

for which inf

Proof: We will assume there is no hyperplane (i.e., affine subspace) on

which is supported in . For if such a hyperplane exists we can without x xloss restrict to it or a smaller hyperplane in which contains no proper sub-hyperplane on which is supported, which we assume has been done

21

To prove the Lemma, let w x be a minimizing sequence for .We claim it suffices to show that w and must remain bounded. Indeed thiswould automatically prove the last statement of the Lemma. In addition, bytaking subsequences, this would imply that converge pointwise to a fixed w x , and it is then easy to check that is a minimizer of , showingthat attains its minimum.

To show and above remain bounded, assume this it is false for aw contradiction. By the minimizing sequence assumption we have

inf

On the other hand, since we assume either or hasw w or are not bounded, a subsequence which converges to . Assume first that a subsequence of wconverges to . B y taking subsequences, assume Then we claim w

(independently of ). To show this, note that given

there is an such that has measure smaller x x w xthan (for all ) for sufficienly large (since the width of this set, , w

w

becomes arbitrarily small, and is not supported on a proper hyperplane). Thusas , and hence (which is positive definite) becomes w xarbitrarily large on a set of measure at least . Then we would have

w x ,

which contradicts the assumption converges.

Thus we must conclude has no subsequence which converges to , wand remains bounded. On the other hand, if a subsequence of | | w converges to , we can have (since by taking a further subsequence w w

w x w x are bounded) and , which would imply is not a

minimizing sequence. Thus must also be bounded, completing the proof.

Now note by Theorem , the minimizer of the empirical risk4.1 is close to the minizer of the true risk , in that for any , with probability at least ,

inf

ln

(5.8)

22

By the last statement of Lemma , there are such that5.3 inf

if Now choose an and as above. Now by (6.4), forsufficiently large (with probability at least , since eventually inf

with at least that probability By Lemma , we have5.2

together with Lemma :5.1

Theorem .4: Given a probability measure on , a positive definite x

polynomial risk , a minimizer of the empirical risk is an approximation to any true minimizer of the risk in , in that for sufficiently large , with probability at least

,

where is any constant (which always exists) such that all minimizers of are in

Note the existence of a finite is guaranteed by the argument before theTheorem. We note that for a compactly supported , we can approximate anycontinuous non-negative uniformly by positive definite polynomials in and on the support of , so we have

Theorem 5.5: For a compactly supported and any continuous non-negative function , there exists an which is arbitrarily close to

(in sup-norm), such that asymptotic error (in the sense of risk) of an SVM usingerror criterion is of order , uniformly in

Note that the this result depends on and so is not uniform in .

5.3 Uniform results in The above results arise from Lemma which gives uniform bounds over5.2, . However, the bound in the Lemma is uniform in only for a class of for which x for some fixed , where

23

is the largest power of appearing in This includes any class of supported in a fixed compact region in . In general, however, our bound is

again local and not uniform in , since polynomials are unbounded on .

In order to obtain uniform results in , we extend the above observations by noting that for the set of which are polynomial on a compactset in and constant outside , the above results in fact become

uniform in :

Theorem 5.6 : If is polynomial in and in any fixed compact set of values , and has constant value outside this set, then for allprobability distributions , with probability at least

where

sup , and is the number of non-zero terms (in and ) in

the polynomial .

Proof: We have

,

with

if

otherwise

Therefore with probability at least (since there are at most distinct terms where is replaced by below)

| | (5.9)

24

| |

Now we bound using ( ):5.5

while

1/4

since This gives

a bound which is independent of

Finally we have the uniform analog of Theorem :5.4

Theorem 5.7: Given a risk function L as in Theorem , a minimizer of5.4 empirical risk is an approximation to any true minimizer of the

risk , uniformly in More precisely, for sufficiently large , with probability at least ,

25

This follows from Theorem together with Lemma . Note since the5.6 5.1bounds are uniform in , an argument such as in Lemma is unnecessary 5.3here.

6. Scaled families of algorithms

6.1 Uses of scaled families

Scaled families of algorithms can be useful because increased informationtypically can be used with increased algorithmic complexity. In some IBCapplications increased algorithmic complexity is implcitly scaled with increasedinformation complexity, as with spline algorithms, where more data points yieldmore spline knots. As a simple example of such scaling, note that given a largenumber (e.g., 10 ) of data points, linear regression (with an approximation space

consisting of affine functions) will generally under-utilize the data. One canenlarge the space of approximation algorithms to have a range made ofapproximating functions with more parameters, e.g., involving quadratics andcubics of the variables.

See Vapnik (1998, 2000) for an SLT analysis of such scalings of complexity,in which the VC dimension of the family of approximating functions (therange of allowable algorithms ) is scaled with cardinality of information. For information of cardinality , we can formalize

such a scaling by choosing an algorithm : whose range has VC

dimension , with the scaling chosen so that the error of approximation is minimized.

Defining to be the true regression function, i.e., a minimizer in the fullspace of the risk , again define (assuming the arg inf exist)

arg inf

closest element in to

arg inf

minimizer in of empirical risk with data points

Recalling the definitions in Section we note informally that (2.2 infdecreases as , and ( , which can be bounded in terms of VC alg

dimension of , goes to as . If is too large relative to , we

26

have overfitting: estimation (information) error becomes large, as we are in thewrong space (with too many functions or parameters). In this case the goal is todecrease in order to lower . In general we scale the VC dimension inf of (all algorithms with range with the information complexity . We want the number of free parameters, measured by (related to thealgorithmic complexity), scaled to data cardinality (information complexity . This approach is taken in Kon and Plaskota (2000), where algorithms are ascaled family of neural networks (with algorithmic complexity defined as thenumber of neurons).

6.2 Scaling of algorithms: nonlinear SVM generalizations

Scaled families of algorithms: By Theorem the information error of an4.1SVM is asymptotically bounded (with probability at least ) as

inf ln ln

with VC the VC dimension of the class of SVM decision functions , and . Making the above discussion more precise, we see xthat to guarantee vanishes as , we need inf a scaling of with so that is decreasing (Vapnik 1998, 2000). On the other hand, must increase with so that algorithmic error Thus is defined by alg

Ran

and is chosen so VC , the dimension of the range of , scales with .

An SVM example: To give an example of this, let be the polynomials ofdegree on . The standard SVM algorithm (minimizing the loss function

x x ) will be denoted as

.Z z

The space of loss functions (as functions of has x x and )VC dimension VC , as shown in section 4.3.

27

We now extend the standard space of SVM algorithms with range to a scaled set , with the set of algorithms with range in . Define the

range of such algorithms for to be nonlinear SVM Practically suchalgorithms can be implemented by extending the data vector x

to

x x˜ 6.1

consisting of all monomials

x

of degree or less in components , and then using a standard SVM algorithm(with range in the affine polynomials in x). Above with

non-negative integers and The dimension of is the number of monomials in variables of order less

than , i.e., the cardinality of . This is the number of non-negative lattice points in dimensions satisfying , which is of

order of the volume of the region in the positive octant of .

Thus

grows polynomially in , where

To scale with , note that the VC dimension of is , since the VC dimension of a space of functions is bounded by the dimension of itsnon-constant part plus 1 (see Korian and Sontag, 1997).

Thus by Theorem , recalling ( ), we have4.1 2.5

Theorem 6.1: For the above scaled family of nonlinear (polynomial) SVMalgorithms, the -probabilistic error (error with probability at least satisfies

28

e

alg 6.2

where is the VC

ln ln

inf

dimension of the polynomial space and is as in ( ), with replaced by 4.1 .

The algorithmic (approximation) error

alg inf

,

inf

x x x , (6.3)

is an approximation theory measurement of the distance between the optimal

(the full function space) and its approximation by polynomials

, using the error (6.3). This can be bounded analytically (see the nextsection) or through simulations.

6.3 Bounding the algorithmic error

We wish to scale (which determines algorithmic error and hence algorithmiccomplexity) with information complexity by letting the range of ouralgorithms vary through a the scale , with for standard SVM. Tohave an error bound which decreases, we can choose so as to scale to grow more slowly than , e.g., .

If we wish to scale with to minimize the right side of ( ), there is a 6.2scaling prescription based on bounds on the full error . As an example of this, if the a priori distribution is assumed supported in on the ball x

and is assumed to admit a risk-minimizing function arg inf

all of whose (multiple) directional derivatives v v (in all unit

directions in ) arev x z bounded by a constant (for , then Taylor's theorem with remainder in the direction from z z z to gives for :

29

zz z z

(6.4)

where z z z is in the line between and , and acts only on the variable which is the argument of , after which it is evaluated at in the first sum.

We can bound the error in (6.4) as

z z

z z

z

Therefore (letting :

arg inf

alg

x

since can be chosen as the order Taylor polynomial approximation

to in and the minimum risk can be bounded by the risk of this . In this example a choice can be made which minimizes the right

side of ( ) above. Note that such an optimization (in this case on the sum of6.2our upper bounds) occurs when the terms and alg

inf

ln ln

(which holds for sufficiently large) have rates of change with respect to which are of the same order, which gives a minimum in .

As mentioned above, dimension reduction (e.g., a projection isx x ofuseful for pruning the set of possible x as approximations to the unknown . This is important when the dimension of range spaces of

30

algorithms grows rapidly, e.g., where is the set of multivariate polynomials above In the latter case the feature (information)

vector ˜ where , with x x x

x Dimension reduction can be done either through eliminationof less relevant variables or by pruning coordinates from the basic vectorx x

6.4 An example of scaling of algorithmic and informationcomplexity

The Gaussian case: For SVM, recall the use of scaled families of nonlinearizedfeature vectors and corresponding approximation algortithms is a nonlinear SVM (NLSVM). Whether there is advantage to using NLSVMreduces in a Gaussian situation to the question: given two multivariate Gaussiandistributions

x

det6.5

x x

x (6.6)

(the conditional distributions of the x x and feature vectors conditioned on and , respectively), what is the shape of an optimal separator between the two? With this assumption that areGaussian, our goal here is to form an SVM separator for which the risk xfunction

x x

(proportional the expected number of errors) is minimized. In addition,weighting of false positives versus false negatives may also be useful, in whichcase the new risk function is

(6.7)

with W e may be less concerned with false positives than falsenegatives, so is a good choice, if the overall numbers of positive examples are larger than numbers of negative examples.

31

We consider the complexity of the SVM with risk function (6.7), focusingon the algorithmic (approximation) error. If our algorithm based on data Z x restricts to the class of affine functions and ifalgorithmic error is large, a decision to extend to a nonlinear SVM insteadmakes sense. It is then of interest to find the approximation error for hypothesisspaces which include polynomials of order , with increasing.

Finding the optimal solution: For the risk function (6.7) we can in factidentify the optimal choice of among all functions (not just affine ones), if wefirst show the separation surface

x x x , (6.8)

(for the above choices ( ) of ) is optimal - this can be done using6.5, 6.6 calculus of variations. Indeed, if there is an infinitesimal variation in the optimal resulting in a change of the volume separated by , in the directionx x xat location , then the first order increment in above is

x x ,

since we are increasing the volume in which and decreasing that in which ; we have since is stationary

To higher than first order it follows easily that since is on the xaverage larger than in the volume (since increases in the x xdirection from , the risk (1) has increased. Symmetry shows x xthat when is in the opposite direction, increases as well, proving the optimizing surface is (6.8). The above balance between in the direction x x follows more directly by noting that from ( )6.5

ln

x x . (6.9)

where is a constant.Thus the surface is given by equality of two quadratic polynomials x

of the form (6.9), namely by

32

ln ln

det x x

ln ln det

(6.10) x x

In general (if ) this surface is quadratic, so that use of a quadratic SVM (in which is appropriate. From this we expect that in general cases where the distributions of positive and negative classes have differentcovariances , so that the quadratic terms in (6.10) do not cancel, there may be significant improvement using a quadratic SVM (giving a quadraticseparation surface) over a linear one. This is illustrated in the example of theWisconsin cancer study below.

However, inequality of covariances for positive and negative examples isnot always the case. An example involves data in computational biology forwhich the improvement from linear to quadratic SVM is marginal (Cvetkovski,et al.), implying that in such cases positives and negatives have distributions (ifapproximately Gaussian) with about the same covariances.

In case of the risk function (6.7), the surface is (from (6.10))

x x x

ln

detdet

The effect of the weighting coefficients is to shift the surface withoutchanging its shape. The size of determines whether the quadratic

SVM will improve risk significantly over the linear one. This suggests a generalcriterion for appropriateness of a quadratic SVM - given empirical covariances and of the two data sets (assumed sufficiently large), the norm

should be small.

6.5 Example: Application to biomedical informatic data

We apply here the above example of a scaled algorithm family to some datain biomedical informatics, the Wisconsin cancer database (Radwin, 1992). We

33

begin with a standard SVM applied to 9 input variables (measured physicalcharacteristics of a tumor), which predict the output variable , which is cancermalignancy (+1) or non-malignancy ( ). The data, summarized in the tablebelow, are taken from a random selection of 349 training examples and 349 testexamples out of 699 total data. The first test via SVM (with data involving all 9input variables) has an error rate of 13.75% on the test set. When a dimensionalreduction is done and the three most useful variables are extracted, there is anSVM error rate of 32.39%. When the nonlinear SVM of degree 2 is applied tothese input data, the total error rate goes down to 8.60%.

Machine \ Error rate FP FN TP TN ERR %ERR

9-variable SVM 37 11 107 194 48 .1375

3-variable SVM 41 72 44 192 113 .3239

3-variable nonlinear SVM 29 1 117 202 30 .0860

Table: F/TP = false/true positive; F/TN = false/true negative; ERR-total errors

This significant improvement of the quadratic over the linear SVM impliesthat the covariance matrices and for the 3-variable data are significantly different between positive and negative examples. There are somecurrent analogous methods (Holloway, et al., 2006 for the linear case) foridentifying transcription initiation sites in the genome from examples, usingseveral SVM feature spaces.

References

Braverman, M. and S. Cook (2006). Computing over the reals: Foundations forscientific computing. 318-329.Notices AMS 53,

Blum, L, F. Cucker, M. Shub, and S. Smale (1998). Complexity and RealComputation. Springer, New York.

34

Cvetkovski, A., C. DeLisi, D. Holloway, M. Kon, and P. Seal (2006).Dimensional reduction and optimization in TF-gene binding inferences.Technical report, Boston University.

Holloway, D., M. Kon and C. DeLisi (2006). Machine learning for regulatoryanalysis and transcription factor target prediction in yeast. Preprint, to appear,Systems and Synthetic Biology.

Radwin, M. (1992). Wisconsin breast cancer database,http://www.radwin.org/michael/projects/learning/about-breast-cancer-wisconsin.html

Kon, M. and L. Plaskota (2000), Information complexity of neural networks.Neural Networks 13, 365-376.

Kon, M. and L. Raphael (2006), Statistical learning theory and uniformapproximation bounds in wavelet spaces, preprint.

Koiran, P and E. Sontag (1997). Vapnik-Chervonenkis dimension of recurrentneural networks. Proceedings of Third European Conference on ComputationalLearning Theory, Jerusalem.

Traub, J., G. Wasilkowski, and H. Wozniakowski (1988).´ Information-BasedComplexity. Academic Press, Boston.

Traub, J. and A. Werschulz (1998). CambridgeComplexity and Information.University Press, Cambridge.

Vapnik, V. (1998). Wiley, New York 1998.Statistical Learning Theory.

Vapnik, V. (2000). Springer, NewThe Nature of Statistical Learning Theory.York, 2000.

Vapnik, V. and A. Chernovenkis (1974). Theory of Pattern Recognition. Nauka,Moscow.

Zhou, D. and S. Smale (2003). Estimating the approximation error in learningtheory. , 1-25.Analysis and Applications 1

an information-based complexity analysis of statistical ...math.bu.edu/people/mkon/q6.pdf · an...

Documents