learning theory2.rtf

8/11/2019 Learning theory2.rtf

1/39

Introduction to Statistical Learning Theory

abor Lugosi3Olivier Bousquet1 , Stephane Boucheron2 , and

1 !a"#$lanc% Institute &or Biological 'ybernetics

Spe(annstr) 3*, +#2-. T/ ubingen, er(any

olivier)bousquet0(")org

ho(e page http44555)%yb)(pg)de46bousquet2 7niversite de $aris#Sud, Laboratoire d8In&or(atique

B9ati(ent :-, ;#:1-< Orsay 'ede", ;rance

stephane)boucheron0lri)&r

ho(e page http44555)lri)&r46bouchero3 +epart(ent o& =cono(ics, $o(peu ;abra 7niversity

>a(on Trias ;argas 2


2/39


3/39

1. Bousquet, Boucheron H Lugosi

1) Observe a pheno(enon

2) 'onstruct a (odel o& that pheno(enon

3) !a%e predictions using this (odel

O& course, this de&inition is very general and could be ta%en (ore or less as the

goal o& Fatural Sciences) The goal o& !achine Learning is to actually auto(ate

this process and the goal o& Learning Theory is to &or(aliGe it)In this tutorial 5e consider a special case o& the above process 5hich is the

supervised learning &ra(e5or% &or pattern recognition) In this &ra(e5or%, the

data consists o& instance#label pairs, 5here the label is either 1 or 1) iven a

set o& such pairs, a learning algorith( constructs a &unction (apping instances to

labels) This &unction should be such that it (a%es &e5 (ista%es 5hen predicting

the label o& unseen instances)

O& course, given so(e training data, it is al5ays possible to build a &unction

that &its e"actly the data) But, in the presence o& noise, this (ay not be the

best thing to do as it 5ould lead to a poor per&or(ance on unseen instances

@this is usually re&erred to as over&ittingA) The general idea behind the design o&

1 )n @gA, it 5ould

be unreasonable to loo% &or the &unction (ini(iGing >n @g A a(ong all possible&unctions) Indeed, 5hen the input space is in&inite, one can al5ays construct a

&unction gn 5hich per&ectly predicts the labels o& the training data @i)e) gn @i A M

Pi , and >n @gn A M -A, but behaves on the other points as the opposite o& the target

&unction t, i)e) gn @A M P so that >@gn A M 1 ) So one 5ould have (ini(u(e(pirical ris% but (a"i(u( ris%)

It is thus necessary to prevent this over&itting situation) There are essentially

t5o 5ays to do this @5hich can be co(binedA) The &irst one is to restrict the

class o& &unctions in 5hich the (ini(iGation is per&or(ed, and the second is to

(odi&y the criterion to be (ini(iGed @e)g) adding a penalty &or co(plicated8

&unctionsA)

=(pirical >is% !ini(iGation) This algorith( is one o& the (ost straight#

&or5ard, yet it is usually e cient) The idea is to choose a (odel o& possible

&unctions and to (ini(iGe the e(pirical ris% in that (odel

gn M arg (in >n @gA )g

O& course, this 5ill 5or% best 5hen the target &unction belongs to ) o5ever,

it is rare to be able to (a%e such an assu(ption, so one (ay 5ant to enlarge

the (odel as (uch as possible, 5hile preventing over&itting)

Structural >is% !ini(iGation) The idea here is to choose an in&inite se#

quence Jd d M 1, 2, ) ) )K o& (odels o& increasing siGe and to (ini(iGe thee(pirical ris% in each (o del 5ith an added penalty &or the siGe o& the (odel

gn M arg (in >n @gA pen@d, nA )g d ,d

The penalty pen@d, nA gives pre&erence to (odels 5here esti(ation error is s(all

and (easures the siGe or capacity o& the (odel)

>egulariGation) ?nother, usually easier to i(ple(ent approach consists in

cho osing a large (odel @possibly dense in the continuous &unctions &or e"a(pleA

and to de&ine on a regulariGer, typically a nor( g ) Then one has to (ini(iGe

the regulariGed e(pirical ris%

gn M arg (in >n @gA g 2 )g

Strictly spea%ing this is only possible i& the probability distribution satis&ies so(e

(ild conditions @e)g) has no ato(sA) Other5ise, it (ay not be possible to achieve>@gn A M 1 but even in this case, provided the support o& $ contains in&initely (any

points, a si(ilar pheno(enon occurs)


7/39

1*- Bousquet, Boucheron H Lugosi

'o(pared to S>!, there is here a &ree para(eter , called the regulariGation

para(eter 5hich allo5s to choose the right trade#o bet5een &it and co(ple"ity)

Tuning is usually a hard proble( and (ost o&ten, one uses e"tra validation

data &or this tas%)

!ost e"isting @and success&ulA (ethods can be thought o& as regulariGation

(ethods)

For(aliGed >egulariGation) There are other possible approaches 5hen the

regulariGer can, in so(e sense, be nor(aliGed8, i)e) 5hen it corresponds to so(e

probability distribution over )

iven a probability distribution de&ined on @usually called a priorA, one can

use as a regulariGer log @g A< ) >eciprocally, &ro( a regulariGer o& the &or( g 2 ,

i& there e"ists a (easure U on such that e g 2 dU@gA V &or so(e W -,then one can construct a prior corresponding to this regulariGer) ;or e"a(ple, i&

is the set o& hyperplanes in d going through the origin, can be identi&ied

5ith d and, ta%ing U as the Lebesgue (easure, it is possible to go &ro( the

=uclidean nor( regulariGer to a spherical aussian (easure on d as a prior. )This type o& nor(aliGed regulariGer, or prior, can be used to construct another

probability distribution on @usually called posteriorA, as

X @ A @gA ,@gA M e > n@ g A

5here - is a &ree para(eter and X @ A is a nor(aliGation &actor)

There are several 5ays in 5hich this can be used) I& 5e ta%e the &unction

(a"i(iGing it, 5e recover regulariGation as

arg (a" @gA M arg (in >n @gA log @gA ,g g

5here the regulariGer is 1 log @gA )?lso, can be used to rando(iGe the predictions) In that case, be&ore co(#

puting the predicted label &or an input ", one sa(ples a &unction g according to and outputs g@"A) This procedure is usually called ibbs classi&ication)

?nother 5ay in 5hich the distribution constructed above can be used is byta%ing the e"pected prediction o& the &unctions in

gn @"A M sgn@ @g@"AAA )

< This is &ine 5hen is countable) In the continuous case, one has to consider the

density associated to ) e o(it these details). eneraliGation to in&inite di(ensional ilbert spaces can also be done but it requires

(ore care) One can &or e"a(ple establish a correspondence bet5een the nor( o& a

reproducing %ernel ilbert space and a aussian process prior 5hose covariance

&unction is the %ernel o& this space) Fote that (ini(iGing >n @gA log @gA is equivalent to (ini(iGing >n @gA

1 log @gA)


8/39

Statistical Learning Theory 1*1

This is typically called Bayesian averaging)

?t this point 5e have to insist again on the &act that the choice o& the class

and o& the asso ciated regulariGer or prior, has to co(e &ro( a priori %no5ledge

about the tas% at hand, and there is no universally best choice)

2)2 Bounds

e have presented the &ra(e5or% o& the theory and the type o& algorith(s that

it studies, 5e no5 introduce the %ind o& results that it ai(s at) The overall goal is

to characteriGe the ris% that so(e algorith( (ay have in a given situation) !ore

precisely, a learning algorith( ta%es as input the data @1 , P1 A, ) ) ) , @n , Pn A and

produces a &unction gn 5hich depends on this data) e 5ant to esti(ate the

ris% o& gn ) o5ever, >@gn A is a rando( variable @since it depends on the dataAand it cannot be co(puted &ro( the data @since it also depends on the un%no5n

$ A) =sti(ates o& >@gn A thus usually ta%e the &or( o& probabilistic bounds)Fotice that 5hen the algorith( chooses its output &ro( a (odel , it is

possible, by intro ducing the best &unction g in , 5ith >@g A M in&g >@gA, to5rite

>@gn A > M D>@g A > E D>@gn A >@g AE )

The &irst ter( on the right hand side is usually called the appro"i(ation error,and (easures ho5 5ell can &unctions in approach the target @it 5ould be Gero

i& t A) The second ter(, called esti(ation error is a rando( quantity @it

depends on the dataA and (easures ho5 close is gn to the best possible choicein )

=sti(ating the appro"i(ation error is usually hard since it requires %no5ledge

about the target) 'lassically, in Statistical Learning Theory it is pre&erable to

avoid (a%ing speci&ic assu(ptions about the target @such as its belonging to

so(e (odelA, but the assu(ptions are rather on the value o& > , or on the noise&unction s)

It is also %no5n that &or any @consistentA algorith(, the rate o& convergence to

Gero o& the appro"i(ation error* can be arbitrarily slo5 i& one does not (a%eassu(ptions about the regularity o& the target, 5hile the rate o& convergence

o& the esti(ation error can be co(puted 5ithout any such assu(ption) e 5ill

thus &ocus on the esti(ation error)

?nother possible deco(position o& the ris% is the &ollo5ing

>@gn A M >n @gn A D>@gn A >n @gn AE )

In this case, one esti(ates the ris% by its e(pirical counterpart, and so(e quan#

tity 5hich appro"i(ates @or upper boundsA >@gn A >n @gn A)

To su((ariGe, 5e 5rite the three type o& results 5e (ay be interested in)

* ;or this converge to (ean anything, one has to c onsider algorith(s 5hich choose

&unctions &ro( a class 5hich gro5s 5ith the sa(ple siGe) This is the case &or e"a(ple

o& Structural >is% !ini(iGation or >egulariGation based algorith(s)


9/39

1*2 Bousquet, Boucheron H Lugosi

Y =rror bound >@gn A >n @gn A B@n, A) This corresponds to the esti(ationo& the ris% &ro( an e(pirical quantity)

Y =rror bound relative to the best in the class >@gn A >@g A B@n, A) Thistells ho5 Zopti(alZ is the algorith( given the (odel it uses)

Y =rror bound relative to the Bayes ris% >@gn A > B@n, A) This givestheoretical guarantees on the convergence to the Bayes ris%)

3 Basic Bounds

In this section 5e sho5 ho5 to obtain si(ple error bounds @also called general#

iGation boundsA) The ele(entary (aterial &ro( probability theory that is needed

here and in the later sections is su((ariGed in ?ppendi" ?)

3)1 >elationship to =(pirical $rocesses

gn @ A MP o& the &unction

>ecall that 5e 5ant to esti(ate the ris% >@gn A M

gn returned by the algorith( a&ter seeing the data @1 , P1 A, ) ) ) , @n , Pn A) Thisquantity cannot be observed @$ is un%no5nA and is a rando( variable @since it

depends on the dataA) ence one 5ay to (a%e a state(ent about this quantity

is to say ho5 it relates to an esti(ate such as the e(pirical ris% >n @gn A) Thisrelationship can ta%e the &or( o& upper and lo5er bounds &or

D>@gn A >n @gn A W E )

;or convenience, let Xi M @i , Pi A and X M @, P A) iven de&ine the loss class

; M J& @", yA g @"A My g K ) @1A

Fotice that contains &unctions 5ith range in J 1, 1K 5hile ; contains non#

negative &unctions 5ith range in J-, 1K) In the re(ainder o& the tutorial, 5e 5illgo bac% and &orth bet5een ; and @as there is a biection bet5een the(A, so(e#

ti(es stating the results in ter(s o& &unctions in ; and so(eti(es in ter(s o&

&unctions in ) It 5ill be clear &ro( the conte"t 5hich classes and ; 5e re&er

to, and ; 5ill al5ays be derived &ro( the last (entioned class in the 5ay o& @1A)

ne use the shorthand notation $& M D& @, P AE and $n & M 1i M1 & @i , Pi A)n

$n is usually called the e(pirical (easure associated to the training sa(ple)ith this notation, the quantity o& interest @di erence bet5een true and e(pir#

ical ris%sA can be 5ritten as

@2A$ &n $n &n )

?n e(pirical process is a collection o& rando( variables inde"ed by a class o&

&unctions, and such that each rando( variable is distributed as a su( o& i)i)d)

rando( variables @values ta%en by the &unction at the dataA

J$ & $n & K& ; )


10/39

Statistical Learning Theory 1*3

One o& the (ost studied quantity associated to e(pirical processes is their supre#

(u(

sup $& $n & )& ;

It is clear that i& 5e %no5 an upper bound on this quantity, it 5ill be an upper

bound on @2A) This sho5s that the theory o& e(pirical processes is a great source

o& tools and techniques &or Statistical Learning Theory)

3)2 oe ding8s Inequality

Let us re5rite again the quantity 5e are interested in as &ollo5s

n

>@gA >n @gA M D& @X AE 1 & @Xi A )ni M1

It is easy to recogniGe here the di erence bet5een the e"pectation and the e(#

pirical average o& the rando( variable & @X A) By the la5 o& large nu(bers, 5e

i((ediately obtain that

1 nli( & @Xi A D& @X AE M - M 1 )nn

i M1

This indicates that 5ith enough sa(ples, the e(pirical ris% o& a &unction is agood appro"i(ation to its true ris%)

It turns out that there e"ists a quantitative version o& the la5 o& large nu(bers

5hen the variables are bounded)

Theore( 1 @o e dingA) Let X1 , ) ) ) , Xn be n i)i)d) rando( variables 5ith& @XA Da, bE) Then &or all W -, 5e have

1 n& @Xi A D& @XAE W 2 e"p 2n 2

n @b aA2 )i M1

Let us re5rite the above &or(ula to better understand its consequences) +enote

the right hand side by ) Then

,R$n & $ & R W @b aA log 2

2n

or @by inversion, see ?ppendi" ?A 5ith probability at least 1 ,

R$n & $& R @b aA log 2 2n )


11/39

1* Bousquet, Boucheron H Lugosi

?pplying this to & @X A M g @ AMP 5e get that &or any g, and any W -, 5ithprobability at least 1

@3A>@g A >n @gA log 2 2n )

Fotice that one has to consider a &i"ed &unction g and the probability is 5ithrespect to the sa(pling o& the data) I& the &unction depends on the data this

do es not apply[

3)3 Li(itations

?lthough the above result see(s very nice @since it applies to any class o&

bounded &unctionsA, it is actually severely li(ited) Indeed, 5hat it essentiallysays is that &or each @&i"edA &unction & ;, there is a set S o& sa(ples &or 5hich

DSE 1 A) o5#2 n @and this set o& sa(ples has (easure$ & $n & lo g 2

ever, these sets S (ay be di erent &or di erent &unctions) In other 5ords, &or the

observed sa(ple, only so(e o& the &unctions in ; 5ill satis&y this inequality)

?nother 5ay to e"plain the li(itation o& oe ding8s inequality is the &ollo5#

ing) I& 5e ta%e &or the class o& all J 1, 1K#valued @(easurableA &unctions, then

&or any &i"ed sa(ple, there e"ists a &unction & ; such that

$ & $n & M 1 )

To see this, ta%e the &unction 5hich is & @i A M Pi on the data and & @A M Pevery5here else) This does not contradict oe ding8s inequality but sho5s that

it do es not yield 5hat 5e need)

;igure 2 illustrates the above argu(entation) The horiGontal a"is corresponds

>>is%

Rn

R (g)n

R(g)

;unction classg g g \n

;ig) 2) 'onvergence o& the e(pirical ris% to the true ris% over the class o& &unctions)


12/39

Statistical Learning Theory 1*@&n A >n @&n A sup @>@& A >n @& AA @A& ;

In other 5ords, i& 5e can upper bound the supre(u( on the right, 5e are done)

;or this, 5e need a bound 5hich holds si(ultaneously &or all &unctions in a class)

Let us e"plain ho5 one can construct such uni&or( bounds) 'onsider t5o

&unctions &1 , &2 and de&ine

'i M J@"1 , y1 A, ) ) ) , @"n , yn A $ &i $n &i W K )

This set contains all the bad8 sa(ples, i)e) those &or 5hich the bound &ails) ;ro(

oe ding8s inequality, &or each i

D'i E )

e 5ant to (easure ho5 (any sa(ples are bad8 &or i M 1 or i M 2) ;or this 5e

use @see ?ppendi" ?A

D'1 '2 E D'1 E D'2 E 2 )

!ore generally, i& 5e have F &unctions in our class, 5e can 5rite

F

D'1 ) ) ) 'F E D'i E

i M1

?s a result 5e obtain

D & J&1 , ) ) ) , &F K $& $n & W E

F

D$ &i $n &i W E

iM 1

F e"p 2n 2


13/39

1*. Bousquet, Boucheron H Lugosi

ence, &or M Jg1 , ) ) ) , gF K, &or all W - 5ith probability at least 1 ,

g , >@gA >n @g A log F log 12n

This is an error bound) Indeed, i& 5e %no5 that our algorith( pic%s &unctions

&ro( , 5e can apply this result to gn itsel&)Fotice that the (ain di erence 5ith oe ding8s inequality is the e"tra log F

ter( on the right hand side) This is the ter( 5hich accounts &or the &act that 5e

5ant F bounds to hold si(ultaneously) ?nother interpretation o& this ter( is as

the nu(ber o& bits one 5ould require to speci&y one &unction in ) It turns out

that this %ind o& coding interpretation o& generaliGation bounds is o&ten possible

and can be used to obtain error esti(ates D1.E)

3)< =sti(ation =rror

7sing the sa(e idea as be&ore, and 5ith no additional e ort, 5e can also get a

bound on the esti(ation error) e start &ro( the inequality

>@g A >n @g A sup @>@gA >n @gAA ,g

5hich 5e co(bine 5ith @A and 5ith the &act that since gn (ini(iGes the e(#pirical ris% in ,

>n @g A >n @gn A -

Thus 5e obtain

>@gn A M >@gn A >@g A >@g A

>n @g A >n @gn A >@gn A >@g A >@g A

2 sup R>@gA >n @gAR >@g Ag

e obtain that 5ith probability at least 1

>@gn A >@g A 2 log F log 2 2n )

e notice that in the right hand side, both ter(s depend on the siGe o& the

class ) I& this siGe increases, the &irst ter( 5ill decrease, 5hile the second 5ill

increase)

3). Su((ary and $erspective

?t this point, 5e can su((ariGe 5hat 5e have e"posed so &ar)

Y In&erence requires to put assu(ptions on the process generating the data

@data sa(pled i)i)d) &ro( an un%no5n $A, generaliGation requires %no5ledge

@e)g) restriction, structure, or priorA)


14/39

Statistical Learning Theory 1*

Y The error bounds are valid 5ith respect to the repeated sa(pling o& training

sets)

Y ;or a &i"ed &unction g, &or (ost o& the sa(ples

>@gA >n @gA 14 n

Y ;or (ost o& the sa(ples i& R R M F

sup >@gA >n @gA log F4ng

The e"tra variability co(es &ro( the &act that the chosen gn changes 5iththe data)

So the result 5e have obtained so &ar is that 5ith high probability, &or a &inite

class o& siGe F ,

sup @>@gA >n @gAA log F log 12n )

g

There are several things that can be i(proved

Y oe ding8s inequality only uses the boundedness o& the &unctions, not theirvariance)

Y The union bound is as bad as i& all the &unctions in the class 5ere independent

@i)e) i& &1 @XA and &2 @X A 5ere independentA)

Y The supre(u( over o& >@gA >n @gA is not necessarily 5hat the algorith(

5ould cho ose, so that upper bounding >@gn A >n @gn A by the supre(u((ight be lo ose)

In&inite 'ase Capni%#'hervonen%is Theory

In this section 5e sho5 ho5 to e"tend the previous results to the case 5here the

class is in&inite) This requires, in the non#countable case, the introduction o&

tools &ro( Capni%#'hervonen%is Theory)

)1 >e&ined 7nion Bound and 'ountable 'ase

e &irst start 5ith a si(ple re&ine(ent o& the union bound that allo5s to e"tend

the previous results to the @countablyA in&inite case)

>ecall that by oe ding8s inequality, &or each & ;, &or each W - @possibly

depending on & , 5hich 5e 5rite @& AA,

log 1 @& A )@& A$ & $n & W2n


15/39

1** Bousquet, Boucheron H Lugosi

ence, i& 5e have a countable set ;, the union bound i((ediately yields

log 1@& A @&A )& ; $ & $n & W

2n& ;

'ho osing @& A M p@& A 5ith & ; p@& A M 1, this (a%es the right#hand side

equal to and 5e get the &ollo5ing result) ith probability at least 1 ,

log 1p@& A log 1 & ;, $ & $n & 2n )

e notice that i& ; is &inite @5ith siGe F A, ta%ing a uni&or( p gives the log F as

be&ore)

7sing this approach, it is possible to put %no5ledge about the algorith(

into p@& A, but p should be chosen be&ore seeing the data, so it is not possible tocheat8 by setting all the 5eight to the &unction returned by the algorith( a&ter

seeing the data @5hich 5ould give the s(allest possible boundA) But, in general,

i& p is 5ell#chosen, the bound 5ill have a s(all value) ence, the bound can be

i(proved i& one %no5s ahead o& ti(e the &unctions that the algorith( is li%ely

to pic% @i)e) %no5ledge i(proves the boundA)

)2 eneral 'ase

hen the set is uncountable, the previous approach do es not directly 5or%)

The general idea is to loo% at the &unction class proected8 on the sa(ple) !ore

precisely, given a sa(ple G1 , ) ) ) , Gn , 5e consider

;G 1, )) ),G n M J@& @G1 A, ) ) ) , &@Gn AA & ;K

The siGe o& this set is the nu(ber o& possible 5ays in 5hich the data @G1 , ) ) ) , Gn Acan be classi&ied) Since the &unctions & can only ta%e t5o values, this set 5ill

al5ays be &inite, no (atter ho5 big ; is)

+e&inition 1 @ro5th &unctionA) The gro5th &unction is the (a"i(u( nu(#

ber o& 5ays into 5hich n points can be classi&ied by the &unction class

R;G 1, )) ),G n R )S; @nA M sup

@G 1 ,)) ),G n A

e have de&ined the gro5th &unction in ter(s o& the loss class ; but 5e can do

the sa(e 5ith the initial class and notice that S; @nA M S @nA)It turns out that this gro5th &unction can be used as a (easure o& the siGe8

o& a class o& &unction as de(onstrated by the &ollo5ing result)

Theore( 2 @Capni%#'hervonen%isA) ;or any W -, 5ith probability at least

1 ,

g , >@gA >n @gA 2 2 log S @2nA log 2 n )


16/39

Statistical Learning Theory 1*:

Fotice that, in the &inite case 5here RR M F , 5e have S @nA F so that thisbound is al5ays better than the one 5e had be&ore @e"cept &or the constantsA)

But the proble( beco(es no5 one o& co(puting S @nA)

)3 C' +i(ension

Since g J 1, 1K, it is clear that S @nA 2n ) I& S @nA M 2n , there is a set o&siGe n such that the class o& &unctions can generate any classi&ication on these

points @5e say that shatters the setA)

+e&inition 2 @C' di(ensionA) The C' di(ension o& a class is the largest

n such that

S @nA M 2n )

In other 5ords, the C' di(ension o& a class is the siGe o& the largest set that

it can shatter)

In order to illustrate this de&inition, 5e give so(e e"a(ples) The &irst one is the

set o& halplanes in d @see ;igure 3A) In this case, as depicted &or the cased M 2, one can shatter a set o& d 1 points but no set o& d 2 points, 5hich

(eans that the C' di(ension is d 1)

;ig) 3) 'o(puting the C' di(ension o& hyperplanes in di(ension 2 a set o& 3 points

can be shattered, but no set o& &our points)

It is interesting to notice that the nu(ber o& para(eters needed to de&ine

halspaces in d is d, so that a natural question is 5hether the C' di(ensionis related to the nu(ber o& para(eters o& the &unction class) The ne"t e"a(ple,

depicted in ;igure , is a &a(ily o& &unctions 5ith one para(eter only

Jsgn@sin@t"AA t K

5hich actually has in&inite C' di(ension @this is an e"ercise le&t to the readerA)


17/39

1:- Bousquet, Boucheron H Lugosi

;ig) ) C' di(ension o& sinusoids)

It re(ains to sho5 ho5 the notion o& C' di(ension can bring a solution

to the proble( o& co(puting the gro5th &unction) Indeed, at &irst glance, i& 5e

%no5 that a class has C' di(ension h, it entails that &or all n h, S @nA M 2n

and S @nA V 2n other5ise) This see(s o& little use, but actually, an intriguingpheno(enon occurs &or n h as depicted in ;igure


18/39

Statistical Learning Theory 1:1

and &or all n h,h )

S @nA en h

7sing this le((a along 5ith Theore( 2 5e i((ediately obtain that i& has

C' di(ension h, 5ith probability at least 1 ,

h log 2 g , >@gA >n @gA 2 2 h log 2 e n n )

hat is i(portant to recall &ro( this result, is that the di erence bet5een the

true and e(pirical ris% is at (ost o& order

h log n

n )

?n interpretation o& C' di(ension and gro5th &unctions is that they (easure the

e ective siGe o& the class, that is the siGe o& the pro ection o& the class onto &inite

sa(ples) In addition, this (easure does not ust count8 the nu(ber o& &unctions

in the class but depends on the geo(etry o& the class @rather its pro ectionsA);inally, the &initeness o& the C' di(ension ensures that the e(pirical ris% 5ill

converge uni&or(ly over the class to the true ris%)

) Sy((etriGation

e no5 indicate ho5 to prove Theore( 2) The %ey ingredient to the proo& is the

so#called sy((etriGation le((a) The idea is to replace the true ris% by an esti#

(ate co(puted on an independent set o& data) This is o& course a (athe(atical

technique and does not (ean one needs to have (ore data to be able to apply

the result) The e"tra data set is usually called virtual8 or ghost sa(ple8)

e 5ill denote by X1 , ) ) ) , Xn an independent @ghostA sa(ple and by $n thecorresponding e(pirical (easure)

Le((a 2 @Sy((etriGationA) ;or any t W -, such that nt2 2,

sup sup@$ $n A& t 2 @$n $n A& t42 )& ; & ;

$roo&) Let &n be the &unction achieving the supre(u( @note that it depends

on X1 , ) ) ) , Xn A) One has @5ith denoting the conunction o& t5o eventsA,

@$ $n A& n Wt @$ $ nA &n V t42 M @ $ $ n A& n Wt @$n $ A& n t42

@ $n $n A& n Wt42 )

Ta%ing e"pectations 5ith respect to the second sa(ple gives

D@$ $n A&n V t42E D@$n $n A&n W t42E )@$ $ nA &n W t


19/39

1:2 Bousquet, Boucheron H Lugosi

By 'hebyshev8s inequality @see ?ppendi" ?A,

D@$ $n A&n t42E Car&nnt2 1 nt2 )

Indeed, a rando( variable 5ith range in D-, 1E has variance less than 14) ence

@$ $n A& n Wt @1 1

D@$n $n A&n W t42E )nt2 A

Ta%ing e"pectation 5ith respect to &irst sa(ple gives the result)

This le((a allo5s to replace the e"pectation $& by an e(pirical average

over the ghost sa(ple) ?s a result, the right hand side only depends on the

proection o& the class ; on the double sa(ple

,;X 1, )) ),X n ,X1 , ))) ,X n

5hich contains &initely (any di erent vectors) One can thus use the si(ple union

bound that 5as presented be&ore in the &inite case) The other ingredient that is

needed to obtain Theore( 2 is again oe ding8s inequality in the &ollo5ing &or(

D$n & $n & W tE e n t2 42 )

e no5 ust have to put the pieces together

sup& ; @$ $n A& t

2 sup& ; @$n $n A& t42

M 2 @$n $n A& t42sup& ; X1 , ))), Xn , X1 ,) )), Xn

2S; @2nA D@$n $n A& t42E

S; @2nAe n t2 4* )

7sing inversion &inishes the proo& o& Theore( 2)

)< C' =ntropy

One i(portant aspect o& the C' di(ension is that it is distribution independent)

ence, it allo5s to get bounds that do not depend on the proble( at hand

the sa(e bound holds &or any distribution) ?lthough this (ay be seen as an

advantage, it can also be a dra5bac% since, as a result, the bound (ay be loose

&or (ost distributions)

e no5 sho5 ho5 to (odi&y the proo& above to get a distribution#dependent

result) e use the &ollo5ing notation F @;, Gn 1 A M R;G 1 ,) )), G n R)

+e&inition 3 @C' entropyA) The @annealedA C' entropy is de&ined as

; @nA M log DF @;, Xn 1 AE )


20/39

Statistical Learning Theory 1:3

Theore( 3) ;or any W -, 5ith probability at least 1 ,

g , >@gA >n @gA 2 2 @2nA log 2n )

$roo&) e again begin 5ith the sy((etriGation le((a so that 5e have to

upper bound the quantity

I M @$n $n A& t42 )sup& ; X n

1 ,X n1

Let 1 , ) ) ) , n be n independent rando( variables such that $ @ i M 1A M $ @ i M1A M 142 @they are called >ade(acher variablesA) e notice that the quanti#

nties @$n $n A& and 1i M1 i @& @Xi A & @Xi AA have the sa(e distribution sincen

changing one i corresponds to e"changing Xi and Xi ) ence 5e have

1 n

i @& @Xi A & @Xi AA t42 ,nI sup& ; X ni M11 ,X n1

and the union bound leads to

1 nI F ;, X n i @& @Xi A & @Xi AA t42 )1 , Xn1 (a" n

& i M1

Since i @& @Xi A & @Xi AA D 1, 1E, oe ding8s inequality &inally gives

I DF @;, X, X AE e n t24 * )

The rest o& the proo& is as be&ore)

< 'apacity !easures

e have seen so &ar three (easures o& capacity or siGe o& classes o& &unction theC' di(ension and gro5th &unction both distribution independent, and the C'

entropy 5hich depends on the distribution) ?part &ro( the C' di(ension, they

are usually hard or i(possible to co(pute) There are ho5ever other (easures

5hich not only (ay give sharper esti(ates, but also have properties that (a%e

their co(putation possible &ro( the data only)


21/39

1: Bousquet, Boucheron H Lugosi

This is the nor(aliGed a((ing distance o& the proections8 on the sa(ple)

iven such a (etric, 5e say that a set &1 , ) ) ) , &F covers ; at radius i&

; F i M1 B@&i , A )

e then de&ine the covering nu(bers o& ; as &ollo5s)

+e&inition @'overing nu(berA) The covering nu(ber o& ; at radius ,

5ith respect to dn , denoted by F @; , , nA is the (ini(u( siGe o& a cover o&radius )

Fotice that i t does not (atter i& 5e apply this de&inition to the original class

or the loss class ;, since F @;, , nA M F @, , nA)

The covering nu(bers characteriGe the siGe o& a &unction class as (easured

by the (etric dn ) The rate o& gro5th o& the logarith( o& F @, , nA usually calledthe (etric entropy, is related to the classical concept o& vector di(ension) Indeed,

i& is a co(pact set in a d#di(ensional =uclidean space, F @, , nA d )

hen the covering nu(bers are &inite, it is possible to appro"i(ate the class

by a &inite set o& &unctions @5hich cover A) hich again allo5s to use the

&inite union bound, provided 5e can relate the behavior o& all &unctions in to

that o& &unctions in the cover) ? typical result, 5hich 5e provide 5ithout proo&,

is the &ollo5ing)

Theore( ) ;or any t W -,

D g >@gA W >n @gA tE * DF @, t, nAE e n t2 41 2 * )

'overing nu(bers can also be de&ined &or classes o& real#valued &unctions)

e no5 relate the covering nu(bers to the C' di(ension) Fotice that, be#

cause the &unctions in can only ta%e t5o values, &or all W -, F @, , nA

R M F @, Xn1 A) ence the C' entropy corresponds to log covering nu(bersRX n

1at (ini(al scale, 5hich i(plies F @, , nA h log en h , but one can have a con#siderably better result)

Le((a 3 @ausslerA) Let be a class o& C' di(ension h) Then, &or all W -,

all n, and any sa(ple,

F @ , , nA 'h@eAh h )

The interest o& this result is that the upper bound do es not depend on the sa(ple

siGe n)

The covering nu(ber bound is a generaliGation o& the C' entropy bound

5here the scale is adapted to the error) It turns out that this result can be

i(proved by considering all scales @see Section ecall that 5e used in the proo& o& Theore( 3 >ade(acher rando( variables,

i)e) independent J 1, 1K#valued rando( variables 5ith probability 142 o& ta%ing

either value)


22/39

Statistical Learning Theory 1:n & M 1i M1 i & @Xi A) e 5ill denote by the e"pectation ta%en 5ithn

respect to the >ade(acher variables @i)e) conditionally to the dataA 5hile 5illdenote the e"pectation 5ith respect to all the rando( variables @i)e) the data,

the ghost sa(ple and the >ade(acher variablesA)

+e&inition < @>ade(acher averagesA) ;or a class ; o& &unctions, the >ade#

(acher average is de&ined as

>@;A M sup >n & ,& ;

and the conditional >ade(acher average is de&ined as

>n @;A M sup >n & )& ;

e no5 state the &unda(ental result involving >ade(acher averages)

Theore( @;A log 12n ,

and also, 5ith probability at least 1 ,

& ;, $& $n & 2>n @; A 2 log 2 n )

It is re(ar%able that one can obtain a bound @second part o& the theore(A 5hich

depends solely on the data)

The pro o& o& the above result requires a po5er&ul tool called a concentration

inequality &or e(pirical processes)

?ctually, oe ding8s inequality is a @si(pleA concentration inequality, in the

sense that 5hen n increases, the e(pirical average is concentrated around the

e"pectation) It is possible to generaliGe this result to &unctions that depend on

i)i)d) rando( variables as sho5n in the theore( belo5)

Theore( . @!c+iar(id D1EA) ?ssu(e &or all i M 1, ) ) ) , n,

sup R; @G1 , ) ) ) , Gi , ) ) ) , Gn A ; @G1 , ) ) ) , Gi , ) ) ) , Gn AR c ,

G 1 ,) )), G n ,G i

then &or all W -,

DR; D; E R W E 2 e"p 2 2nc2 )

The (eaning o& this result is thus that, as soon as one has a &unction o& n

independent rando( variables, 5hich is such that its variation is bounded 5hen

one variable is (odi&ied, the &unction 5ill satis&y a oe ding#li%e inequality)


23/39

1:. Bousquet, Boucheron H Lugosi

$roo& o& Theore( ade(acher average to the conditional

one)

e &irst sho5 that !c+iar(id8s inequality can be applied to sup& ; $ & $n & )

e denote te(porarily by $ i n the e(pirical (easure obtained by (odi&ying oneele(ent @e)g) Xi is replaced by Xi A o& the sa(ple) It is easy to chec% that the&ollo5ing holds

R sup @$ & $ i R$ i@$ & $n & A sup n & AR sup n & $n & R )& ; & ; & ;

Since & J-, 1K 5e obtain

R$ in & $n & R M 1 n ,n R& @Xi A & @Xi AR 1

and thus !c+iar(id8s inequality can be applied 5ith c M 14n) This concludesthe &irst step o& the pro o&)

e ne"t prove the @&irst part o& theA &ollo5ing sy((etriGation le((a)Le((a ) ;or any class ;,

sup$ & $n & 2 sup >n & ,

& ; & ;

and sup

R$ & $n & R 1 >n & 1 2 n )2 sup& ; & ;

$roo&) e only prove the &irst part) e introduce a ghost sa(ple and its

corresponding (easure $n ) e successively use the &act that $n & M $ & andthe supre(u( is a conve" &unction @hence 5e can apply ]ensen8s inequality, see

?ppendi" ?A

sup

$ & $n && ; D$n & E $n &

M sup& ;

sup $n & $n & & ;

1 n

M sup i @& @Xi A & @Xi AAn& ; iM1

1 n 1 n

sup i & @Xi A sup i & @Xi AAn n& ; & ;

iM1 i M1

M 2 sup >n & )& ;


24/39

Statistical Learning Theory 1:

5here the third step uses the &act that &@Xi A & @Xi A and i @& @Xi A & @Xi AA

have the sa(e distribution and the last step uses the &act that the i & @Xi A and

i &@Xi A have the sa(e distribution)

The above already establishes the &irst part o& Theore( n @;A )

It is easy to chec% that ; satis&ies !c+iar(id8s assu(ptions 5ith c M 1n ) ?s a

result, ; M >@; A can be sharply esti(ated by ; M >n @;A)

Loss 'lass and Initial 'lass) In order to (a%e use o& Theore( < 5e have to

relate the >ade(acher average o& the loss class to those o& the initial class) This

can be done 5ith the &ollo5ing derivation 5here one uses the &act that i and

i Pi have the sa(e distribution)

1 n

>@; A M sup n i g @ i AMP ig i M1

1 n 1 iM sup n 2 @1 Pi g@i AAg i M1

1 nM 1 i Pi g@i A M 1n 2 >@ A )2 sup g

i M1

Fotice that the sa(e is valid &or conditional >ade(acher averages, so that 5e

obtain that 5ith probability at least 1 ,

g , >@gA >n @gA >n @A 2 log 2n )

'o(puting the >ade(acher ?verages) e no5 assess the di culty o&

actually co(puting the >ade(acher averages) e 5rite the &ollo5ing)

1 1 n

i g@i An2 sup g i M1

1 nM 1 1 i g@i An 22 sup g

i M1

1 n 1 i g@i AM 1n 22 in& g

i M1

M 1 >n @g, A )2 in& g


25/39

1:* Bousquet, Boucheron H Lugosi

This indicates that, given a sa(ple and a choice o& the rando( variables 1 , ) ) ) , n ,

co(puting >n @A is not harder than co(puting the e(pirical ris% (ini(iGer in

) Indeed, the procedure 5ould be to generate the i rando(ly and (ini(iGe

the e(pirical error in 5ith respect to the labels i )

?n advantage o& re5riting >n @ A as above is that it gives an intuition o& 5hatit actually (easures it (easures ho5 (uch the class can &it rando( noise) I&

the class is very large, there 5ill al5ays be a &unction 5hich can per&ectly &itthe i and then >n @ A M 142, so that there is no hope o& uni&or( convergenceto Gero o& the di erence bet5een true and e(pirical ris%s)

;or a &inite set 5ith R R M F , one can sho5 that

>n @ A 2 log F 4n ,

5here 5e again see the logarith(ic &actor log F ) ? consequence o& this is that,

by considering the proection on the sa(ple o& a class 5ith C' di(ension h,

and using Le((a 1, 5e have

h>@A 2 h log en n )

This result along 5ith Theore( < allo5s to recover the Capni% 'hervonen%isbound 5ith a concentration#based proo&)

?lthough the bene&it o& using concentration (ay not be entirely clear at that

point, let us ust (ention that one can actually i(prove the dependence on n

o& the above bound) This is based on the so#called chaining technique) The idea

is to use covering nu(bers at all scales in order to capture the geo(etry o& the

class in a better 5ay than the C' entropy does)

One has the &ollo5ing result, called +udley8s entropy bound

n log F @; , t, nA dt )>n @;A '-

?s a consequence, along 5ith aussler8s upper bound, 5e can get the &ollo5ing

result

>n @;A ' h n )

e can thus, 5ith this approach, re(ove the unnecessary log n &actor o& the C'

bound)

. ?dvanced Topics

In this section, 5e point out several 5ays in 5hich the results presented so &ar

can be i(proved) The (ain source o& i(prove(ent actually co(es, as (entioned

earlier, &ro( the &act that o e ding and !c+iar(id inequalities do not (a%e

use o& the variance o& the &unctions)


26/39

Statistical Learning Theory 1::

.)1 Bino(ial Tails

e recall that the &unctions 5e consider are binary valued) So, i& 5e consider a

&i"ed &unction & , the distribution o& $n & is actually a bino(ial la5 o& para(eters

$ & and n @since 5e are su((ing n i)i)d) rando( variables & @Xi A 5hich can either

be - or 1 and are equal to 1 5ith probability &@Xi A M $ & A) +enoting p M $ & ,

5e can have an e"act e"pression &or the deviations o& $n & &ro( $&

n@ p tA nD$ & $n & tE M

% p% @1 pAn % )%M-

Since this e"pression is not easy to (anipulate, 5e have used an upper bound

provided by oe ding8s inequality) o5ever, there e"ist other @sharperA upper

bounds) The &ollo5ing quantities are an upper bound on D$ & $n & tE,

n @p tA @e"ponentialAn @1 p tA p1 p

1 p t pt

1 p @@1 t4pA lo g @1 t4pAt4pA @BennettAe np

2 p@ 1 p A2 t 43 @BernsteinAe n t 2

e 2 n t2 @o e dingA

="a(ining the above bounds @and using inversionA, 5e can say that roughly

spea%ing, the s(all deviations o& $& $n & have a aussian behavior o& the

&or( e"p@ nt2 42p@1 pAA @i)e) aussian 5ith variance p@1 pAA 5hile the largedeviations have a $oisson behavior o& the &or( e"p@ 3nt42A)

So the tails are heavier than aussian, and oe ding8s inequality consists in

upper bounding the tails 5ith a aussian 5ith (a"i(u( variance, hence the

ter( e"p@ 2nt2 A)=ach &unction & ; has a di erent variance $ & @1 $ & A $ & ) !oreover,

&or each & ;, by Bernstein8s inequality, 5ith probability at least 1 ,

$ & $n & 2$ & log 13n )n 2 log 1

The aussian part @second ter( in the right hand sideA do(inates @&or $ & not

too s(all, or n large enoughA, and it depends on $ & ) e thus 5ant to co(bine

Bernstein8s inequality 5ith the union bound and the sy((etriGation)

.)2 For(aliGation

The idea is to consider the ratio

$& $n &$ & )

ere @& J-, 1KA, Car& $ & 2 M $ &


27/39

2-- Bousquet, Boucheron H Lugosi

The reason &or considering this ration is that a&ter nor(aliGation, uctuations

are (ore uni&or(8 in the class ;) ence the supre(u( in

$ & $n &sup $ &

& ;

not necessarily attained at &unctions 5ith large variance as it 5as the case pre#

viously)!oreover, 5e %no5 that our goal is to &ind &unctions 5ith s(all error $&

@hence s(all varianceA) The nor(aliGed supre(u( ta%es this into account)

e no5 state a result si(ilar to Theore( 2 &or the nor(aliGed supre(u()

Theore( @Capni%#'hervonen%is, D1*EA) ;or W - 5ith probability at least

1 ,

$ & 2 log S; @2nA log & ;, $& $n & n ,

and also 5ith probability at least 1 ,

$n & 2 log S; @2nA log & ;, $n & $& n )

$roo&) e only give a s%etch o& the proo&) The &irst step is a variation o& the

sy((etriGation le((a

$ & $n & $n & $n &sup $ & t 2 sup@$n & $n & A42 t )& ; & ;

The second step consists in rando(iGation @5ith >ade(acher variablesA

n1i M1 i @& @Xi A & @Xi AAn sup

^ ^ ^ M 2 @$n & $n & A42 t )& ;

;inally, one uses a tail bound o& Bernstein type)

Let us e"plore the consequences o& this result)

;ro( the &act that &or non#negative nu(bers ?, B, ',

? B ' B ' ,? ? B '2

5e easily get &or e"a(ple

& ;, $ & $n & 2 $n & log S; @2nA log n

log S; @2nA log n )


28/39

Statistical Learning Theory 2-1

In the ideal situation 5here there is no noise @i)e) P M t@A al(ost surelyA, and

t , denoting by gn the e(pirical ris% (ini(iGer, 5e have > M - and also

>n @gn A M -) In particular, 5hen is a class o& C' di(ension h, 5e obtain

>@gn A M O h log n n )

So, in a 5ay, Theore( allo5s to interpolate bet5een the best case 5here

the rate o& convergence is O@h log n4nA and the 5orst case 5here the rate is

O@ h log n4nA @it does not allo5 to re(ove the log n &actor in this caseA)

It is also possible to derive &ro( Theore( relative error bounds &or the

(ini(iGer o& the e(pirical error) ith probability at least 1 ,

>@gn A >@g A 2 >@g A log S @2nA log n

log S @2nA log n )

e notice here that 5hen >@g A M - @i)e) t and > M -A, the rate is again

o& order 14n 5hile, as so on as >@g A W -, the rate is o& order 14 n) There&ore,it is not possible to obtain a rate 5ith a po5er o& n in bet5een 142 and 1)

The (ain reason is that the &actor o& the square ro ot ter( >@g A is not theright quantity to use here since it does not vary 5ith n) e 5ill see later that

one can have instead >@gn A >@g A as a &actor, 5hich is usually converging toGero 5ith n increasing) 7n&ortunately, Theore( cannot be applied to &unctions

o& the type & & @5hich 5ould be needed to have the (entioned &actorA, so 5e5ill need a re&ined approach)

.)3 Foise 'onditions

The re&ine(ent 5e see% to obtain requires certain speci&ic assu(ptions about the

noise &unction s@"A) The ideal case being 5hen s@"A M - every5here @5hich cor#

responds to > M - and P M t@AA) e no5 intro duce quantities that (easure

ho5 5ell#behaved the noise &unction is)The situation is &avorable 5hen the regression &unction @"A is not too close

to -, or at least not too o&ten close to 142) Indeed, @"A M - (eans that the noise

is (a"i(u( at " @s@"A M 142A and that the label is co(pletely undeter(ined

@any prediction 5ould yield an error 5ith probability 142A)

+e&initions) There are t5o types o& conditions)

+e&inition . @!assart8s Foise 'onditionA) ;or so(e c W -, assu(e

R @AR W 1c al(ost surely )


29/39

2-2 Bousquet, Boucheron H Lugosi

This condition i(plies that there is no region 5here the decision is co(pletely

rando(, or the noise is bounded a5ay &ro( 142)

+e&inition @Tsyba%ov8s Foise 'onditionA) Let D-, 1E, assu(e that

one the &ollo5ing equivalent conditions is satis&ied

@iA c W -, g J 1, 1K ,

Dg@A @A -E c@>@gA > A

@iiA c W -, ? , d$ @"A c@ R @"ARd$@"AA? ?

@iiiA B W -, t -, DR @AR tE Bt1

'ondition @iiiA is probably the easiest to interpret it (eans that @"A is close

to the critical value - 5ith lo5 probability)

e indicate ho5 to prove that conditions @iA, @iiA and @iiiA are indeed equiv#

alent

@iA @iiA It is easy to chec% that >@gA > M DR @AR g - E) ;or each

&unction g, there e"ists a set ? such that ? M g -@iiA @iiiA Let ? M J" R @"AR tK

DR R tE M d$ @"A c@ R @ "ARd$ @"AA? ?

d$ @"AAct @?

DR R tE c 1 1 t 1@iiiA @iA e 5rite

>@gA > M DR @AR g -Eg - R Rt

t

M t g W- R R tDR R tE t

Dg W -E M t@ 1 A )1 A tt@1 Bt Dg -E Bt

@1 A 4 &inally givesDg - ETa%ing t M @1 AB

Dg -E B1@1 A@ 1 A @>@gA > A )

e notice that the para(eter has to be in D-, 1E) Indeed, one has the opposite

inequality

Dg@A @A -E ,>@g A > M DR @AR g - E D g - E M

5hich is inco(patible 5ith condition @iA i& W 1)

e also notice that 5hen M -, Tsyba%ov8s condition is void, and 5hen

M 1, it is equivalent to !assart8s condition)


30/39


31/39

2- Bousquet, Boucheron H Lugosi

The reason &or this de&inition is that, as 5e have seen be&ore, the crucial ingredi#

ent to obtain better rates o& convergence is to use the variance o& the &unctions)

LocaliGing the >ade(acher average allo5s to &ocus on the part o& the &unction

class 5here the &ast rate pheno(enon o ccurs, that are &unctions 5ith s(all vari#

ance)

Fe"t 5e introduce the concept o& a sub#root &unction, a real#valued &unction

5ith certain (onotony properties)

+e&inition : @Sub#>oot ;unctionA) ? &unction is sub#root i&

@iA is non#decreasing,

@iiA is non negative,

@iiiA @rA4 r is non#increasing )

?n i((ediate consequence o& this de&inition is the &ollo5ing result)

Le((a


32/39

Statistical Learning Theory 2-ade(acher average behaves li%e a

sub#root &unction, and thus has a unique &i"ed point) This &i"ed point 5ill turn

out to be the %ey quantity in the relative error bounds)

Le((a .) ;or any class o& &unctions ;,

>n @ ;, rA is sub#root )

One legiti(ate question is 5hether ta%ing the star#hull does not enlarge the class

too (uch) One 5ay to see 5hat the e ect is on the siGe o& the class is to co(pare

the (etric entropy @log covering nu(bersA o& ; and o& ;) It is possible to

see that the entropy increases only by a logarith(ic &actor, 5hich is essentially

negligible)

>esult) e no5 state the (ain result involving local >ade(acher averages and

their &i"ed point)

Theore( *) Let ; be a class o& bounded &unctions @e)g) & D 1, 1EA and r

be the &i"ed point o& >@ ;, rA) There e"ists a constant ' W - such that 5ith

probability at least 1 ,

log log n& ;, $ & $n & ' r Car& log 1n )

I& in addition the &unctions in ; satis&y Car& c@$ & A , then one obtains that5ith probability at least 1 ,

log log n2 log 1& ;, $ & ' $n & @r A 1 n )

$roo&) e only give the (ain steps o& the proo&)

1) The starting point is Talagrand8s inequality &or e(pirical processes, a gen#eraliGation o& !c+iar(id8s inequality o& Bernstein type @i)e) 5hich includes

the varianceA) This inequality tells that 5ith high probability,

sup Car& 4n c 4n ,$ & $n & sup $ & $n & c sup& ; & ; & ;

&or so(e constants c, c )

2) The second step consists in peeling8 the class, that is splitting the class into

subclasses according to the variance o& the &unctions

;% M J& Car& D"% , "% 1 AK ,


33/39

2-. Bousquet, Boucheron H Lugosi

3) e can then apply Talagrand8s inequality to each o& the sub#classes sepa#

rately to get 5ith high probability

sup $ & $n & sup $ & $n & c "Car& 4n c 4n ,& ; % & ; %

) Then the sy((etriGation le((a allo5s to introduce local >ade(acher av#

erages) e get that 5ith high probability

& ; , $& $n & 2>@;, "Car&A c "Car& 4n c 4n )

behaves li%e a

square root &unction since 5e can upper bound the local >ade(acher average

by the value o& its &i"ed point) ith high probability,

$ & $n & 2 r Car& c "Car& 4n c 4n )

.) ;inally, 5e use the relationship bet5een variance and e"pectation

Car& c@$ & A ,

and solve the inequality in $ & to get the result)

e 5ill not got into the details o& ho5 to apply the above result, but 5e give

so(e re(ar%s about its use)

?n i(portant e"a(ple is the case 5here the class ; is o& &inite C' di(ension

h) In that case, one has

>@;, rA ' rh log nn ,

so that r ' h lo g nn ) ?s a consequence, 5e obtain, under Tsyba%ov condition, a

rate o& convergence o& $ &n to $ & is O@14n1 4@2 A A) It is i(portant to note that

in this case, the rate o& convergence o& $n & to $ & in O@14 nA) So 5e obtain

a &ast rate by loo%ing at the relative error) These &ast rates can be obtainedprovided t @but it is not needed that > M -A) This require(ent can bere(oved i& one uses structural ris% (ini(iGation or regulariGation)

?nother related result is that, as in the global case, one can obtain a bound5ith data#dependent @i)e) conditionalA local >ade(acher averages

>n @;, rA M sup >n & )

& ; $ & 2 r

The result is the sa(e as be&ore @5ith di erent constantsA under the sa(e con#

ditions as in Theore( *) ith probability at least 1 ,

log log n2 log 1$ & ' $n & @rn

n A 1


34/39

Statistical Learning Theory 2-

5here r n is the &i"ed point o& a sub#root upper bound o& >n @;, rA)ence, 5e can get i(proved rates 5hen the noise is 5ell#behaved and these

rates interpolate bet5een n 1 42 and n 1 ) o5ever, it is not in general possibleto esti(ate the para(eters @c and A entering in the noise conditions, but 5e 5ill

not discuss this issue &urther here) ?nother point is that although the capacity

(easure that 5e use see(s lo cal8, it does depend on all the &unctions in the

class, but each o& the( is i(plicitly appropriately rescaled) Indeed, in >@ ;, rA,each &unction & ; 5ith $ & 2 r is considered at scale r4$& 2 )

Bibliographical re(ar%s) oe ding8s inequality appears in D1:E) ;or a proo&

o& the contraction principle 5e re&er to Ledou" and Talagrand D2-E)

Capni%#'hervonen%is#Sauer#Shelah8s le((a 5as proved independently by

Sauer D21E, Shelah D22E, and Capni% and 'hervonen%is D1*E) ;or related co(#

binatorial results 5e re&er to ?les%er D23E, ?lon, Ben#+avid, 'esa#Bianchi, and

aussler D2E, 'esa#Bianchi and aussler D2


35/39

2-* Bousquet, Boucheron H Lugosi

The use o& >ade(acher averages in classi&ication 5as &irst pro(oted by

Noltchins%ii D


36/39

Statistical Learning Theory 2-:

B Fo ;ree Lunch

e can no5 give a &or(al de&inition o& consistency and state the core results

about the i(possibility o& universally goo d algorith(s)

+e&inition 11 @'onsistencyA) ?n algorith( is consistent i& &or any probability

(easure $ ,

li(n >@gn A M > al(ost surely)It is i(portant to understand the reasons that (a%e possible the e"istence o&

consistent algorith(s) In the case 5here the input space is countable, things

are so(eho5 easy since even i& there is no relationship at all bet5een inputs and

outputs, by repeatedly sa(pling data independently &ro( $ , one 5ill get to see

an increasing nu(ber o& di erent inputs 5hich 5ill eventually converge to all

the inputs) So, in the countable case, an algorith( 5hich 5ould si(ply learn by

heart8 @i)e) (a%es a (aority vote 5hen the instance has been seen be&ore, and

produces an arbitrary prediction other5iseA 5ould be consistent)

In the case 5here is not countable @e)g) M A, things are (ore subtle)Indeed, in that case, there is a see(ingly innocent assu(ption that beco(es

crucial to be able to de&ine a probability (easure $ on , one needs a #algebra

on that space, 5hich is typically the Borel #algebra) So the hidden assu(ption

is that $ is a Borel (easure) This (eans that the topology o& plays a rolehere, and thus, the target &unction t 5ill be Borel (easurable) In a sense this

guarantees that it is possible to appro"i(ate t &ro( its value @or appro"i(ate

valueA at a &inite nu(ber o& points) The algorith(s that 5ill achieve consistency

are thus those 5ho use the topology in the sense o& generaliGing8 the observed

values to neighborho ods @e)g) lo cal classi&iersA) In a 5ay, the (easurability o& t

is one o& the crudest notions o& s(oothness o& &unctions)

e no5 cite t5o i(portant results) The &irst one tells that &or a &i"ed sa(ple

siGe, one can construct arbitrarily bad proble(s &or a given algorith()

Theore( : @Fo ;ree Lunch, see e)g) DEA) ;or any algorith(, any n and

any W -, there e"ists a distribution $ such that > M - and

>@gn A 1 2 M 1 )

The second result is (ore subtle and indicates that given an algorith(, one

can construct a proble( &or 5hich this algorith( 5ill converge as slo5ly as one

5ishes)

Theore( 1- @Fo ;ree Lunch at ?ll, see e)g) DEA) ;or any algorith(, and

any sequence @an A that converges to -, there e"ists a probability distribution $

such that > M - and

>@gn A an )

In the above theore(, the bad8 probability (easure is constructed on a countable

set @5here the outputs are not related at all to the inputs so that no generaliGa#

tion is possibleA, and is such that the rate at 5hich one gets to see ne5 inputs

is as slo5 as the convergence o& an )


37/39

21- Bousquet, Boucheron H Lugosi

;inally 5e (ention other notions o& consistency)

+e&inition 12 @C' consistency o& =>!A) The =>! algorith( is consistent

i& &or any probability (easure $ ,

>@gn A >@g A in probability,

and>n @gn A >@g A in probability)

+e&inition 13 @C' non#trivial consistency o& =>!A) The =>! algorith(

is non#trivially consistent &or the set and the probability distribution $ i& &or

any c ,

in& $ @&A in probability)$n @& A in&& ; $ & W c & ; $ & W c

>e&erences

1) Capni%, C) Statistical Learning Theory) ]ohn iley, Fe5 Por% @1::*A

2) ?nthony, !), Bartlett, $)L) Feural Fet5or% Learning Theoretical ;oundations)

'a(bridge 7niversity $ress, 'a(bridge @1:::A

3) Brei(an, L), ;ried(an, ]), Olshen, >), Stone, ') 'lassi&ication and >egressionTrees) ads5orth International, Bel(ont, '? @1:*A) +evroye, L), y/ or&i, L), Lugosi, ) ? $robabilistic Theory o& $attern >ecognition)

Springer#Cerlag, Fe5 Por% @1::.A

), art, $) $attern 'lassi&ication and Scene ?nalysis) ]ohn iley, Fe5

Por% @1:3A

.) ;u%unaga, N) Introduction to Statistical $attern >ecognition) ?cade(ic $ress,

Fe5 Por% @1:2A

) Nearns, !), CaGirani, 7) ?n Introduction to 'o(putational Learning Theory)

!IT $ress, 'a(bridge, !assachusetts @1::A

*) Nul%arni, S), Lugosi, ), Cen%atesh, S) Learning pattern classi&ication`a sur#

vey) I=== Transactions on In&or(ation Theory @1::*A 21*Y22-. In&or(ation

Theory 1:*Y1::*) 'o((e(orative special issue):) Lugosi, ) $attern classi&ication and learning theory) In y/ or&i, L), ed) $rinciples

o& Fonpara(etric Learning, Springer, Ciena @2--2A ecognition) ]ohniley, Fe5 Por% @1::2A

11) !endelson, S) ? &e5 notes on statistical learning theory) In !endelson, S), S(ola,

?), eds) ?dvanced Lectures in !achine Learning) LF'S 2.--, Springer @2--3A 1Y

-

12) Fataraan, B) !achine Learning ? Theoretical ?pproach) !organ Nau&(ann,

San !ateo, '? @1::1A

13) Capni%, C) =sti(ation o& +ependencies Based on =(pirical +ata) Springer#Cerlag,Fe5 Por% @1:*2A

1) Capni%, C) The Fature o& Statistical Learning Theory) Springer#Cerlag, Fe5 Por%

@1::


38/39

Statistical Learning Theory 211

1.) von Lu"burg, 7), Bousquet, O), Sch/ ol%op&, B) ? co(pression approach to support

vector (odel selection) The ]ournal o& !achine Learning >esearch < @2--A 2:3Y

323

1) !c+iar(id, ') On the (ethod o& bounded di erences) In Surveys in 'o(bina#

torics 1:*:, 'a(bridge 7niversity $ress, 'a(bridge @1:*:A 1*Y1**

1*) Capni%, C), 'hervonen%is, ?) On the uni&or( convergence o& relative &requencies

o& events to their probabilities) Theory o& $robability and its ?pplications 1.

@1:1A 2.Y2*-

1:) oe ding, ) $robability inequalities &or su(s o& bounded rando( variables)

]ournal o& the ?(erican Statistical ?ssociation


39/39

212 Bousquet, Boucheron H Lugosi

3:) Capni%, C), 'hervonen%is, ?) Fecessary and su cient conditions &or the uni#

&or( convergence o& (eans to their e"pectations) Theory o& $robability and its

?pplications 2. @1:*1A *21Y*32-) ?ssouad, $) +ensite et di(ension) ?nnales de l8Institut ;ourier 33 @1:*3A 233Y2*2

1) 'over, T) eo(etrical and statistical properties o& syste(s o& linear inequali#

ties 5ith applications in pattern recognition) I=== Transactions on =lectronic

'o(puters 1 @1:.) Balls in >% do not cut all subsets o& % 2 points) ?dvances in

!athe(atics 31 @3A @1::A 3-.Y3-*

3) oldberg, $), ]erru(, !) Bounding the Capni%#'hervonen%is di(ension o& con#

cept classes para(etriGed by real nu(bers) !achine Learning 1* @1::), +udley, >) So(e special Capni%#'hervonen%is classes) +iscrete

!athe(atics 33 @1:*1A 313Y31*

learning theory2.rtf

Documents