learning theory2.rtf

Upload: muhammad-rosikhan-anwar

Post on 02-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Learning theory2.rtf

    1/39

    Introduction to Statistical Learning Theory

    abor Lugosi3Olivier Bousquet1 , Stephane Boucheron2 , and

    1 !a"#$lanc% Institute &or Biological 'ybernetics

    Spe(annstr) 3*, +#2-. T/ ubingen, er(any

    olivier)bousquet0(")org

    ho(e page http44555)%yb)(pg)de46bousquet2 7niversite de $aris#Sud, Laboratoire d8In&or(atique

    B9ati(ent :-, ;#:1-< Orsay 'ede", ;rance

    stephane)boucheron0lri)&r

    ho(e page http44555)lri)&r46bouchero3 +epart(ent o& =cono(ics, $o(peu ;abra 7niversity

    >a(on Trias ;argas 2

  • 8/11/2019 Learning theory2.rtf

    2/39

  • 8/11/2019 Learning theory2.rtf

    3/39

    1. Bousquet, Boucheron H Lugosi

    1) Observe a pheno(enon

    2) 'onstruct a (odel o& that pheno(enon

    3) !a%e predictions using this (odel

    O& course, this de&inition is very general and could be ta%en (ore or less as the

    goal o& Fatural Sciences) The goal o& !achine Learning is to actually auto(ate

    this process and the goal o& Learning Theory is to &or(aliGe it)In this tutorial 5e consider a special case o& the above process 5hich is the

    supervised learning &ra(e5or% &or pattern recognition) In this &ra(e5or%, the

    data consists o& instance#label pairs, 5here the label is either 1 or 1) iven a

    set o& such pairs, a learning algorith( constructs a &unction (apping instances to

    labels) This &unction should be such that it (a%es &e5 (ista%es 5hen predicting

    the label o& unseen instances)

    O& course, given so(e training data, it is al5ays possible to build a &unction

    that &its e"actly the data) But, in the presence o& noise, this (ay not be the

    best thing to do as it 5ould lead to a poor per&or(ance on unseen instances

    @this is usually re&erred to as over&ittingA) The general idea behind the design o&

    1 )n @gA, it 5ould

    be unreasonable to loo% &or the &unction (ini(iGing >n @g A a(ong all possible&unctions) Indeed, 5hen the input space is in&inite, one can al5ays construct a

    &unction gn 5hich per&ectly predicts the labels o& the training data @i)e) gn @i A M

    Pi , and >n @gn A M -A, but behaves on the other points as the opposite o& the target

    &unction t, i)e) gn @A M P so that >@gn A M 1 ) So one 5ould have (ini(u(e(pirical ris% but (a"i(u( ris%)

    It is thus necessary to prevent this over&itting situation) There are essentially

    t5o 5ays to do this @5hich can be co(binedA) The &irst one is to restrict the

    class o& &unctions in 5hich the (ini(iGation is per&or(ed, and the second is to

    (odi&y the criterion to be (ini(iGed @e)g) adding a penalty &or co(plicated8

    &unctionsA)

    =(pirical >is% !ini(iGation) This algorith( is one o& the (ost straight#

    &or5ard, yet it is usually e cient) The idea is to choose a (odel o& possible

    &unctions and to (ini(iGe the e(pirical ris% in that (odel

    gn M arg (in >n @gA )g

    O& course, this 5ill 5or% best 5hen the target &unction belongs to ) o5ever,

    it is rare to be able to (a%e such an assu(ption, so one (ay 5ant to enlarge

    the (odel as (uch as possible, 5hile preventing over&itting)

    Structural >is% !ini(iGation) The idea here is to choose an in&inite se#

    quence Jd d M 1, 2, ) ) )K o& (odels o& increasing siGe and to (ini(iGe thee(pirical ris% in each (o del 5ith an added penalty &or the siGe o& the (odel

    gn M arg (in >n @gA pen@d, nA )g d ,d

    The penalty pen@d, nA gives pre&erence to (odels 5here esti(ation error is s(all

    and (easures the siGe or capacity o& the (odel)

    >egulariGation) ?nother, usually easier to i(ple(ent approach consists in

    cho osing a large (odel @possibly dense in the continuous &unctions &or e"a(pleA

    and to de&ine on a regulariGer, typically a nor( g ) Then one has to (ini(iGe

    the regulariGed e(pirical ris%

    gn M arg (in >n @gA g 2 )g

    Strictly spea%ing this is only possible i& the probability distribution satis&ies so(e

    (ild conditions @e)g) has no ato(sA) Other5ise, it (ay not be possible to achieve>@gn A M 1 but even in this case, provided the support o& $ contains in&initely (any

    points, a si(ilar pheno(enon occurs)

  • 8/11/2019 Learning theory2.rtf

    7/39

    1*- Bousquet, Boucheron H Lugosi

    'o(pared to S>!, there is here a &ree para(eter , called the regulariGation

    para(eter 5hich allo5s to choose the right trade#o bet5een &it and co(ple"ity)

    Tuning is usually a hard proble( and (ost o&ten, one uses e"tra validation

    data &or this tas%)

    !ost e"isting @and success&ulA (ethods can be thought o& as regulariGation

    (ethods)

    For(aliGed >egulariGation) There are other possible approaches 5hen the

    regulariGer can, in so(e sense, be nor(aliGed8, i)e) 5hen it corresponds to so(e

    probability distribution over )

    iven a probability distribution de&ined on @usually called a priorA, one can

    use as a regulariGer log @g A< ) >eciprocally, &ro( a regulariGer o& the &or( g 2 ,

    i& there e"ists a (easure U on such that e g 2 dU@gA V &or so(e W -,then one can construct a prior corresponding to this regulariGer) ;or e"a(ple, i&

    is the set o& hyperplanes in d going through the origin, can be identi&ied

    5ith d and, ta%ing U as the Lebesgue (easure, it is possible to go &ro( the

    =uclidean nor( regulariGer to a spherical aussian (easure on d as a prior. )This type o& nor(aliGed regulariGer, or prior, can be used to construct another

    probability distribution on @usually called posteriorA, as

    X @ A @gA ,@gA M e > n@ g A

    5here - is a &ree para(eter and X @ A is a nor(aliGation &actor)

    There are several 5ays in 5hich this can be used) I& 5e ta%e the &unction

    (a"i(iGing it, 5e recover regulariGation as

    arg (a" @gA M arg (in >n @gA log @gA ,g g

    5here the regulariGer is 1 log @gA )?lso, can be used to rando(iGe the predictions) In that case, be&ore co(#

    puting the predicted label &or an input ", one sa(ples a &unction g according to and outputs g@"A) This procedure is usually called ibbs classi&ication)

    ?nother 5ay in 5hich the distribution constructed above can be used is byta%ing the e"pected prediction o& the &unctions in

    gn @"A M sgn@ @g@"AAA )

    < This is &ine 5hen is countable) In the continuous case, one has to consider the

    density associated to ) e o(it these details). eneraliGation to in&inite di(ensional ilbert spaces can also be done but it requires

    (ore care) One can &or e"a(ple establish a correspondence bet5een the nor( o& a

    reproducing %ernel ilbert space and a aussian process prior 5hose covariance

    &unction is the %ernel o& this space) Fote that (ini(iGing >n @gA log @gA is equivalent to (ini(iGing >n @gA

    1 log @gA)

  • 8/11/2019 Learning theory2.rtf

    8/39

    Statistical Learning Theory 1*1

    This is typically called Bayesian averaging)

    ?t this point 5e have to insist again on the &act that the choice o& the class

    and o& the asso ciated regulariGer or prior, has to co(e &ro( a priori %no5ledge

    about the tas% at hand, and there is no universally best choice)

    2)2 Bounds

    e have presented the &ra(e5or% o& the theory and the type o& algorith(s that

    it studies, 5e no5 introduce the %ind o& results that it ai(s at) The overall goal is

    to characteriGe the ris% that so(e algorith( (ay have in a given situation) !ore

    precisely, a learning algorith( ta%es as input the data @1 , P1 A, ) ) ) , @n , Pn A and

    produces a &unction gn 5hich depends on this data) e 5ant to esti(ate the

    ris% o& gn ) o5ever, >@gn A is a rando( variable @since it depends on the dataAand it cannot be co(puted &ro( the data @since it also depends on the un%no5n

    $ A) =sti(ates o& >@gn A thus usually ta%e the &or( o& probabilistic bounds)Fotice that 5hen the algorith( chooses its output &ro( a (odel , it is

    possible, by intro ducing the best &unction g in , 5ith >@g A M in&g >@gA, to5rite

    >@gn A > M D>@g A > E D>@gn A >@g AE )

    The &irst ter( on the right hand side is usually called the appro"i(ation error,and (easures ho5 5ell can &unctions in approach the target @it 5ould be Gero

    i& t A) The second ter(, called esti(ation error is a rando( quantity @it

    depends on the dataA and (easures ho5 close is gn to the best possible choicein )

    =sti(ating the appro"i(ation error is usually hard since it requires %no5ledge

    about the target) 'lassically, in Statistical Learning Theory it is pre&erable to

    avoid (a%ing speci&ic assu(ptions about the target @such as its belonging to

    so(e (odelA, but the assu(ptions are rather on the value o& > , or on the noise&unction s)

    It is also %no5n that &or any @consistentA algorith(, the rate o& convergence to

    Gero o& the appro"i(ation error* can be arbitrarily slo5 i& one does not (a%eassu(ptions about the regularity o& the target, 5hile the rate o& convergence

    o& the esti(ation error can be co(puted 5ithout any such assu(ption) e 5ill

    thus &ocus on the esti(ation error)

    ?nother possible deco(position o& the ris% is the &ollo5ing

    >@gn A M >n @gn A D>@gn A >n @gn AE )

    In this case, one esti(ates the ris% by its e(pirical counterpart, and so(e quan#

    tity 5hich appro"i(ates @or upper boundsA >@gn A >n @gn A)

    To su((ariGe, 5e 5rite the three type o& results 5e (ay be interested in)

    * ;or this converge to (ean anything, one has to c onsider algorith(s 5hich choose

    &unctions &ro( a class 5hich gro5s 5ith the sa(ple siGe) This is the case &or e"a(ple

    o& Structural >is% !ini(iGation or >egulariGation based algorith(s)

  • 8/11/2019 Learning theory2.rtf

    9/39

    1*2 Bousquet, Boucheron H Lugosi

    Y =rror bound >@gn A >n @gn A B@n, A) This corresponds to the esti(ationo& the ris% &ro( an e(pirical quantity)

    Y =rror bound relative to the best in the class >@gn A >@g A B@n, A) Thistells ho5 Zopti(alZ is the algorith( given the (odel it uses)

    Y =rror bound relative to the Bayes ris% >@gn A > B@n, A) This givestheoretical guarantees on the convergence to the Bayes ris%)

    3 Basic Bounds

    In this section 5e sho5 ho5 to obtain si(ple error bounds @also called general#

    iGation boundsA) The ele(entary (aterial &ro( probability theory that is needed

    here and in the later sections is su((ariGed in ?ppendi" ?)

    3)1 >elationship to =(pirical $rocesses

    gn @ A MP o& the &unction

    >ecall that 5e 5ant to esti(ate the ris% >@gn A M

    gn returned by the algorith( a&ter seeing the data @1 , P1 A, ) ) ) , @n , Pn A) Thisquantity cannot be observed @$ is un%no5nA and is a rando( variable @since it

    depends on the dataA) ence one 5ay to (a%e a state(ent about this quantity

    is to say ho5 it relates to an esti(ate such as the e(pirical ris% >n @gn A) Thisrelationship can ta%e the &or( o& upper and lo5er bounds &or

    D>@gn A >n @gn A W E )

    ;or convenience, let Xi M @i , Pi A and X M @, P A) iven de&ine the loss class

    ; M J& @", yA g @"A My g K ) @1A

    Fotice that contains &unctions 5ith range in J 1, 1K 5hile ; contains non#

    negative &unctions 5ith range in J-, 1K) In the re(ainder o& the tutorial, 5e 5illgo bac% and &orth bet5een ; and @as there is a biection bet5een the(A, so(e#

    ti(es stating the results in ter(s o& &unctions in ; and so(eti(es in ter(s o&

    &unctions in ) It 5ill be clear &ro( the conte"t 5hich classes and ; 5e re&er

    to, and ; 5ill al5ays be derived &ro( the last (entioned class in the 5ay o& @1A)

    ne use the shorthand notation $& M D& @, P AE and $n & M 1i M1 & @i , Pi A)n

    $n is usually called the e(pirical (easure associated to the training sa(ple)ith this notation, the quantity o& interest @di erence bet5een true and e(pir#

    ical ris%sA can be 5ritten as

    @2A$ &n $n &n )

    ?n e(pirical process is a collection o& rando( variables inde"ed by a class o&

    &unctions, and such that each rando( variable is distributed as a su( o& i)i)d)

    rando( variables @values ta%en by the &unction at the dataA

    J$ & $n & K& ; )

  • 8/11/2019 Learning theory2.rtf

    10/39

    Statistical Learning Theory 1*3

    One o& the (ost studied quantity associated to e(pirical processes is their supre#

    (u(

    sup $& $n & )& ;

    It is clear that i& 5e %no5 an upper bound on this quantity, it 5ill be an upper

    bound on @2A) This sho5s that the theory o& e(pirical processes is a great source

    o& tools and techniques &or Statistical Learning Theory)

    3)2 oe ding8s Inequality

    Let us re5rite again the quantity 5e are interested in as &ollo5s

    n

    >@gA >n @gA M D& @X AE 1 & @Xi A )ni M1

    It is easy to recogniGe here the di erence bet5een the e"pectation and the e(#

    pirical average o& the rando( variable & @X A) By the la5 o& large nu(bers, 5e

    i((ediately obtain that

    1 nli( & @Xi A D& @X AE M - M 1 )nn

    i M1

    This indicates that 5ith enough sa(ples, the e(pirical ris% o& a &unction is agood appro"i(ation to its true ris%)

    It turns out that there e"ists a quantitative version o& the la5 o& large nu(bers

    5hen the variables are bounded)

    Theore( 1 @o e dingA) Let X1 , ) ) ) , Xn be n i)i)d) rando( variables 5ith& @XA Da, bE) Then &or all W -, 5e have

    1 n& @Xi A D& @XAE W 2 e"p 2n 2

    n @b aA2 )i M1

    Let us re5rite the above &or(ula to better understand its consequences) +enote

    the right hand side by ) Then

    ,R$n & $ & R W @b aA log 2

    2n

    or @by inversion, see ?ppendi" ?A 5ith probability at least 1 ,

    R$n & $& R @b aA log 2 2n )

  • 8/11/2019 Learning theory2.rtf

    11/39

    1* Bousquet, Boucheron H Lugosi

    ?pplying this to & @X A M g @ AMP 5e get that &or any g, and any W -, 5ithprobability at least 1

    @3A>@g A >n @gA log 2 2n )

    Fotice that one has to consider a &i"ed &unction g and the probability is 5ithrespect to the sa(pling o& the data) I& the &unction depends on the data this

    do es not apply[

    3)3 Li(itations

    ?lthough the above result see(s very nice @since it applies to any class o&

    bounded &unctionsA, it is actually severely li(ited) Indeed, 5hat it essentiallysays is that &or each @&i"edA &unction & ;, there is a set S o& sa(ples &or 5hich

    DSE 1 A) o5#2 n @and this set o& sa(ples has (easure$ & $n & lo g 2

    ever, these sets S (ay be di erent &or di erent &unctions) In other 5ords, &or the

    observed sa(ple, only so(e o& the &unctions in ; 5ill satis&y this inequality)

    ?nother 5ay to e"plain the li(itation o& oe ding8s inequality is the &ollo5#

    ing) I& 5e ta%e &or the class o& all J 1, 1K#valued @(easurableA &unctions, then

    &or any &i"ed sa(ple, there e"ists a &unction & ; such that

    $ & $n & M 1 )

    To see this, ta%e the &unction 5hich is & @i A M Pi on the data and & @A M Pevery5here else) This does not contradict oe ding8s inequality but sho5s that

    it do es not yield 5hat 5e need)

    ;igure 2 illustrates the above argu(entation) The horiGontal a"is corresponds

    >>is%

    Rn

    R (g)n

    R(g)

    ;unction classg g g \n

    ;ig) 2) 'onvergence o& the e(pirical ris% to the true ris% over the class o& &unctions)

  • 8/11/2019 Learning theory2.rtf

    12/39

    Statistical Learning Theory 1*@&n A >n @&n A sup @>@& A >n @& AA @A& ;

    In other 5ords, i& 5e can upper bound the supre(u( on the right, 5e are done)

    ;or this, 5e need a bound 5hich holds si(ultaneously &or all &unctions in a class)

    Let us e"plain ho5 one can construct such uni&or( bounds) 'onsider t5o

    &unctions &1 , &2 and de&ine

    'i M J@"1 , y1 A, ) ) ) , @"n , yn A $ &i $n &i W K )

    This set contains all the bad8 sa(ples, i)e) those &or 5hich the bound &ails) ;ro(

    oe ding8s inequality, &or each i

    D'i E )

    e 5ant to (easure ho5 (any sa(ples are bad8 &or i M 1 or i M 2) ;or this 5e

    use @see ?ppendi" ?A

    D'1 '2 E D'1 E D'2 E 2 )

    !ore generally, i& 5e have F &unctions in our class, 5e can 5rite

    F

    D'1 ) ) ) 'F E D'i E

    i M1

    ?s a result 5e obtain

    D & J&1 , ) ) ) , &F K $& $n & W E

    F

    D$ &i $n &i W E

    iM 1

    F e"p 2n 2

  • 8/11/2019 Learning theory2.rtf

    13/39

    1*. Bousquet, Boucheron H Lugosi

    ence, &or M Jg1 , ) ) ) , gF K, &or all W - 5ith probability at least 1 ,

    g , >@gA >n @g A log F log 12n

    This is an error bound) Indeed, i& 5e %no5 that our algorith( pic%s &unctions

    &ro( , 5e can apply this result to gn itsel&)Fotice that the (ain di erence 5ith oe ding8s inequality is the e"tra log F

    ter( on the right hand side) This is the ter( 5hich accounts &or the &act that 5e

    5ant F bounds to hold si(ultaneously) ?nother interpretation o& this ter( is as

    the nu(ber o& bits one 5ould require to speci&y one &unction in ) It turns out

    that this %ind o& coding interpretation o& generaliGation bounds is o&ten possible

    and can be used to obtain error esti(ates D1.E)

    3)< =sti(ation =rror

    7sing the sa(e idea as be&ore, and 5ith no additional e ort, 5e can also get a

    bound on the esti(ation error) e start &ro( the inequality

    >@g A >n @g A sup @>@gA >n @gAA ,g

    5hich 5e co(bine 5ith @A and 5ith the &act that since gn (ini(iGes the e(#pirical ris% in ,

    >n @g A >n @gn A -

    Thus 5e obtain

    >@gn A M >@gn A >@g A >@g A

    >n @g A >n @gn A >@gn A >@g A >@g A

    2 sup R>@gA >n @gAR >@g Ag

    e obtain that 5ith probability at least 1

    >@gn A >@g A 2 log F log 2 2n )

    e notice that in the right hand side, both ter(s depend on the siGe o& the

    class ) I& this siGe increases, the &irst ter( 5ill decrease, 5hile the second 5ill

    increase)

    3). Su((ary and $erspective

    ?t this point, 5e can su((ariGe 5hat 5e have e"posed so &ar)

    Y In&erence requires to put assu(ptions on the process generating the data

    @data sa(pled i)i)d) &ro( an un%no5n $A, generaliGation requires %no5ledge

    @e)g) restriction, structure, or priorA)

  • 8/11/2019 Learning theory2.rtf

    14/39

    Statistical Learning Theory 1*

    Y The error bounds are valid 5ith respect to the repeated sa(pling o& training

    sets)

    Y ;or a &i"ed &unction g, &or (ost o& the sa(ples

    >@gA >n @gA 14 n

    Y ;or (ost o& the sa(ples i& R R M F

    sup >@gA >n @gA log F4ng

    The e"tra variability co(es &ro( the &act that the chosen gn changes 5iththe data)

    So the result 5e have obtained so &ar is that 5ith high probability, &or a &inite

    class o& siGe F ,

    sup @>@gA >n @gAA log F log 12n )

    g

    There are several things that can be i(proved

    Y oe ding8s inequality only uses the boundedness o& the &unctions, not theirvariance)

    Y The union bound is as bad as i& all the &unctions in the class 5ere independent

    @i)e) i& &1 @XA and &2 @X A 5ere independentA)

    Y The supre(u( over o& >@gA >n @gA is not necessarily 5hat the algorith(

    5ould cho ose, so that upper bounding >@gn A >n @gn A by the supre(u((ight be lo ose)

    In&inite 'ase Capni%#'hervonen%is Theory

    In this section 5e sho5 ho5 to e"tend the previous results to the case 5here the

    class is in&inite) This requires, in the non#countable case, the introduction o&

    tools &ro( Capni%#'hervonen%is Theory)

    )1 >e&ined 7nion Bound and 'ountable 'ase

    e &irst start 5ith a si(ple re&ine(ent o& the union bound that allo5s to e"tend

    the previous results to the @countablyA in&inite case)

    >ecall that by oe ding8s inequality, &or each & ;, &or each W - @possibly

    depending on & , 5hich 5e 5rite @& AA,

    log 1 @& A )@& A$ & $n & W2n

  • 8/11/2019 Learning theory2.rtf

    15/39

    1** Bousquet, Boucheron H Lugosi

    ence, i& 5e have a countable set ;, the union bound i((ediately yields

    log 1@& A @&A )& ; $ & $n & W

    2n& ;

    'ho osing @& A M p@& A 5ith & ; p@& A M 1, this (a%es the right#hand side

    equal to and 5e get the &ollo5ing result) ith probability at least 1 ,

    log 1p@& A log 1 & ;, $ & $n & 2n )

    e notice that i& ; is &inite @5ith siGe F A, ta%ing a uni&or( p gives the log F as

    be&ore)

    7sing this approach, it is possible to put %no5ledge about the algorith(

    into p@& A, but p should be chosen be&ore seeing the data, so it is not possible tocheat8 by setting all the 5eight to the &unction returned by the algorith( a&ter

    seeing the data @5hich 5ould give the s(allest possible boundA) But, in general,

    i& p is 5ell#chosen, the bound 5ill have a s(all value) ence, the bound can be

    i(proved i& one %no5s ahead o& ti(e the &unctions that the algorith( is li%ely

    to pic% @i)e) %no5ledge i(proves the boundA)

    )2 eneral 'ase

    hen the set is uncountable, the previous approach do es not directly 5or%)

    The general idea is to loo% at the &unction class proected8 on the sa(ple) !ore

    precisely, given a sa(ple G1 , ) ) ) , Gn , 5e consider

    ;G 1, )) ),G n M J@& @G1 A, ) ) ) , &@Gn AA & ;K

    The siGe o& this set is the nu(ber o& possible 5ays in 5hich the data @G1 , ) ) ) , Gn Acan be classi&ied) Since the &unctions & can only ta%e t5o values, this set 5ill

    al5ays be &inite, no (atter ho5 big ; is)

    +e&inition 1 @ro5th &unctionA) The gro5th &unction is the (a"i(u( nu(#

    ber o& 5ays into 5hich n points can be classi&ied by the &unction class

    R;G 1, )) ),G n R )S; @nA M sup

    @G 1 ,)) ),G n A

    e have de&ined the gro5th &unction in ter(s o& the loss class ; but 5e can do

    the sa(e 5ith the initial class and notice that S; @nA M S @nA)It turns out that this gro5th &unction can be used as a (easure o& the siGe8

    o& a class o& &unction as de(onstrated by the &ollo5ing result)

    Theore( 2 @Capni%#'hervonen%isA) ;or any W -, 5ith probability at least

    1 ,

    g , >@gA >n @gA 2 2 log S @2nA log 2 n )

  • 8/11/2019 Learning theory2.rtf

    16/39

    Statistical Learning Theory 1*:

    Fotice that, in the &inite case 5here RR M F , 5e have S @nA F so that thisbound is al5ays better than the one 5e had be&ore @e"cept &or the constantsA)

    But the proble( beco(es no5 one o& co(puting S @nA)

    )3 C' +i(ension

    Since g J 1, 1K, it is clear that S @nA 2n ) I& S @nA M 2n , there is a set o&siGe n such that the class o& &unctions can generate any classi&ication on these

    points @5e say that shatters the setA)

    +e&inition 2 @C' di(ensionA) The C' di(ension o& a class is the largest

    n such that

    S @nA M 2n )

    In other 5ords, the C' di(ension o& a class is the siGe o& the largest set that

    it can shatter)

    In order to illustrate this de&inition, 5e give so(e e"a(ples) The &irst one is the

    set o& halplanes in d @see ;igure 3A) In this case, as depicted &or the cased M 2, one can shatter a set o& d 1 points but no set o& d 2 points, 5hich

    (eans that the C' di(ension is d 1)

    ;ig) 3) 'o(puting the C' di(ension o& hyperplanes in di(ension 2 a set o& 3 points

    can be shattered, but no set o& &our points)

    It is interesting to notice that the nu(ber o& para(eters needed to de&ine

    halspaces in d is d, so that a natural question is 5hether the C' di(ensionis related to the nu(ber o& para(eters o& the &unction class) The ne"t e"a(ple,

    depicted in ;igure , is a &a(ily o& &unctions 5ith one para(eter only

    Jsgn@sin@t"AA t K

    5hich actually has in&inite C' di(ension @this is an e"ercise le&t to the readerA)

  • 8/11/2019 Learning theory2.rtf

    17/39

    1:- Bousquet, Boucheron H Lugosi

    ;ig) ) C' di(ension o& sinusoids)

    It re(ains to sho5 ho5 the notion o& C' di(ension can bring a solution

    to the proble( o& co(puting the gro5th &unction) Indeed, at &irst glance, i& 5e

    %no5 that a class has C' di(ension h, it entails that &or all n h, S @nA M 2n

    and S @nA V 2n other5ise) This see(s o& little use, but actually, an intriguingpheno(enon occurs &or n h as depicted in ;igure

  • 8/11/2019 Learning theory2.rtf

    18/39

    Statistical Learning Theory 1:1

    and &or all n h,h )

    S @nA en h

    7sing this le((a along 5ith Theore( 2 5e i((ediately obtain that i& has

    C' di(ension h, 5ith probability at least 1 ,

    h log 2 g , >@gA >n @gA 2 2 h log 2 e n n )

    hat is i(portant to recall &ro( this result, is that the di erence bet5een the

    true and e(pirical ris% is at (ost o& order

    h log n

    n )

    ?n interpretation o& C' di(ension and gro5th &unctions is that they (easure the

    e ective siGe o& the class, that is the siGe o& the pro ection o& the class onto &inite

    sa(ples) In addition, this (easure does not ust count8 the nu(ber o& &unctions

    in the class but depends on the geo(etry o& the class @rather its pro ectionsA);inally, the &initeness o& the C' di(ension ensures that the e(pirical ris% 5ill

    converge uni&or(ly over the class to the true ris%)

    ) Sy((etriGation

    e no5 indicate ho5 to prove Theore( 2) The %ey ingredient to the proo& is the

    so#called sy((etriGation le((a) The idea is to replace the true ris% by an esti#

    (ate co(puted on an independent set o& data) This is o& course a (athe(atical

    technique and does not (ean one needs to have (ore data to be able to apply

    the result) The e"tra data set is usually called virtual8 or ghost sa(ple8)

    e 5ill denote by X1 , ) ) ) , Xn an independent @ghostA sa(ple and by $n thecorresponding e(pirical (easure)

    Le((a 2 @Sy((etriGationA) ;or any t W -, such that nt2 2,

    sup sup@$ $n A& t 2 @$n $n A& t42 )& ; & ;

    $roo&) Let &n be the &unction achieving the supre(u( @note that it depends

    on X1 , ) ) ) , Xn A) One has @5ith denoting the conunction o& t5o eventsA,

    @$ $n A& n Wt @$ $ nA &n V t42 M @ $ $ n A& n Wt @$n $ A& n t42

    @ $n $n A& n Wt42 )

    Ta%ing e"pectations 5ith respect to the second sa(ple gives

    D@$ $n A&n V t42E D@$n $n A&n W t42E )@$ $ nA &n W t

  • 8/11/2019 Learning theory2.rtf

    19/39

    1:2 Bousquet, Boucheron H Lugosi

    By 'hebyshev8s inequality @see ?ppendi" ?A,

    D@$ $n A&n t42E Car&nnt2 1 nt2 )

    Indeed, a rando( variable 5ith range in D-, 1E has variance less than 14) ence

    @$ $n A& n Wt @1 1

    D@$n $n A&n W t42E )nt2 A

    Ta%ing e"pectation 5ith respect to &irst sa(ple gives the result)

    This le((a allo5s to replace the e"pectation $& by an e(pirical average

    over the ghost sa(ple) ?s a result, the right hand side only depends on the

    proection o& the class ; on the double sa(ple

    ,;X 1, )) ),X n ,X1 , ))) ,X n

    5hich contains &initely (any di erent vectors) One can thus use the si(ple union

    bound that 5as presented be&ore in the &inite case) The other ingredient that is

    needed to obtain Theore( 2 is again oe ding8s inequality in the &ollo5ing &or(

    D$n & $n & W tE e n t2 42 )

    e no5 ust have to put the pieces together

    sup& ; @$ $n A& t

    2 sup& ; @$n $n A& t42

    M 2 @$n $n A& t42sup& ; X1 , ))), Xn , X1 ,) )), Xn

    2S; @2nA D@$n $n A& t42E

    S; @2nAe n t2 4* )

    7sing inversion &inishes the proo& o& Theore( 2)

    )< C' =ntropy

    One i(portant aspect o& the C' di(ension is that it is distribution independent)

    ence, it allo5s to get bounds that do not depend on the proble( at hand

    the sa(e bound holds &or any distribution) ?lthough this (ay be seen as an

    advantage, it can also be a dra5bac% since, as a result, the bound (ay be loose

    &or (ost distributions)

    e no5 sho5 ho5 to (odi&y the proo& above to get a distribution#dependent

    result) e use the &ollo5ing notation F @;, Gn 1 A M R;G 1 ,) )), G n R)

    +e&inition 3 @C' entropyA) The @annealedA C' entropy is de&ined as

    ; @nA M log DF @;, Xn 1 AE )

  • 8/11/2019 Learning theory2.rtf

    20/39

    Statistical Learning Theory 1:3

    Theore( 3) ;or any W -, 5ith probability at least 1 ,

    g , >@gA >n @gA 2 2 @2nA log 2n )

    $roo&) e again begin 5ith the sy((etriGation le((a so that 5e have to

    upper bound the quantity

    I M @$n $n A& t42 )sup& ; X n

    1 ,X n1

    Let 1 , ) ) ) , n be n independent rando( variables such that $ @ i M 1A M $ @ i M1A M 142 @they are called >ade(acher variablesA) e notice that the quanti#

    nties @$n $n A& and 1i M1 i @& @Xi A & @Xi AA have the sa(e distribution sincen

    changing one i corresponds to e"changing Xi and Xi ) ence 5e have

    1 n

    i @& @Xi A & @Xi AA t42 ,nI sup& ; X ni M11 ,X n1

    and the union bound leads to

    1 nI F ;, X n i @& @Xi A & @Xi AA t42 )1 , Xn1 (a" n

    & i M1

    Since i @& @Xi A & @Xi AA D 1, 1E, oe ding8s inequality &inally gives

    I DF @;, X, X AE e n t24 * )

    The rest o& the proo& is as be&ore)

    < 'apacity !easures

    e have seen so &ar three (easures o& capacity or siGe o& classes o& &unction theC' di(ension and gro5th &unction both distribution independent, and the C'

    entropy 5hich depends on the distribution) ?part &ro( the C' di(ension, they

    are usually hard or i(possible to co(pute) There are ho5ever other (easures

    5hich not only (ay give sharper esti(ates, but also have properties that (a%e

    their co(putation possible &ro( the data only)

  • 8/11/2019 Learning theory2.rtf

    21/39

    1: Bousquet, Boucheron H Lugosi

    This is the nor(aliGed a((ing distance o& the proections8 on the sa(ple)

    iven such a (etric, 5e say that a set &1 , ) ) ) , &F covers ; at radius i&

    ; F i M1 B@&i , A )

    e then de&ine the covering nu(bers o& ; as &ollo5s)

    +e&inition @'overing nu(berA) The covering nu(ber o& ; at radius ,

    5ith respect to dn , denoted by F @; , , nA is the (ini(u( siGe o& a cover o&radius )

    Fotice that i t does not (atter i& 5e apply this de&inition to the original class

    or the loss class ;, since F @;, , nA M F @, , nA)

    The covering nu(bers characteriGe the siGe o& a &unction class as (easured

    by the (etric dn ) The rate o& gro5th o& the logarith( o& F @, , nA usually calledthe (etric entropy, is related to the classical concept o& vector di(ension) Indeed,

    i& is a co(pact set in a d#di(ensional =uclidean space, F @, , nA d )

    hen the covering nu(bers are &inite, it is possible to appro"i(ate the class

    by a &inite set o& &unctions @5hich cover A) hich again allo5s to use the

    &inite union bound, provided 5e can relate the behavior o& all &unctions in to

    that o& &unctions in the cover) ? typical result, 5hich 5e provide 5ithout proo&,

    is the &ollo5ing)

    Theore( ) ;or any t W -,

    D g >@gA W >n @gA tE * DF @, t, nAE e n t2 41 2 * )

    'overing nu(bers can also be de&ined &or classes o& real#valued &unctions)

    e no5 relate the covering nu(bers to the C' di(ension) Fotice that, be#

    cause the &unctions in can only ta%e t5o values, &or all W -, F @, , nA

    R M F @, Xn1 A) ence the C' entropy corresponds to log covering nu(bersRX n

    1at (ini(al scale, 5hich i(plies F @, , nA h log en h , but one can have a con#siderably better result)

    Le((a 3 @ausslerA) Let be a class o& C' di(ension h) Then, &or all W -,

    all n, and any sa(ple,

    F @ , , nA 'h@eAh h )

    The interest o& this result is that the upper bound do es not depend on the sa(ple

    siGe n)

    The covering nu(ber bound is a generaliGation o& the C' entropy bound

    5here the scale is adapted to the error) It turns out that this result can be

    i(proved by considering all scales @see Section ecall that 5e used in the proo& o& Theore( 3 >ade(acher rando( variables,

    i)e) independent J 1, 1K#valued rando( variables 5ith probability 142 o& ta%ing

    either value)

  • 8/11/2019 Learning theory2.rtf

    22/39

    Statistical Learning Theory 1:n & M 1i M1 i & @Xi A) e 5ill denote by the e"pectation ta%en 5ithn

    respect to the >ade(acher variables @i)e) conditionally to the dataA 5hile 5illdenote the e"pectation 5ith respect to all the rando( variables @i)e) the data,

    the ghost sa(ple and the >ade(acher variablesA)

    +e&inition < @>ade(acher averagesA) ;or a class ; o& &unctions, the >ade#

    (acher average is de&ined as

    >@;A M sup >n & ,& ;

    and the conditional >ade(acher average is de&ined as

    >n @;A M sup >n & )& ;

    e no5 state the &unda(ental result involving >ade(acher averages)

    Theore( @;A log 12n ,

    and also, 5ith probability at least 1 ,

    & ;, $& $n & 2>n @; A 2 log 2 n )

    It is re(ar%able that one can obtain a bound @second part o& the theore(A 5hich

    depends solely on the data)

    The pro o& o& the above result requires a po5er&ul tool called a concentration

    inequality &or e(pirical processes)

    ?ctually, oe ding8s inequality is a @si(pleA concentration inequality, in the

    sense that 5hen n increases, the e(pirical average is concentrated around the

    e"pectation) It is possible to generaliGe this result to &unctions that depend on

    i)i)d) rando( variables as sho5n in the theore( belo5)

    Theore( . @!c+iar(id D1EA) ?ssu(e &or all i M 1, ) ) ) , n,

    sup R; @G1 , ) ) ) , Gi , ) ) ) , Gn A ; @G1 , ) ) ) , Gi , ) ) ) , Gn AR c ,

    G 1 ,) )), G n ,G i

    then &or all W -,

    DR; D; E R W E 2 e"p 2 2nc2 )

    The (eaning o& this result is thus that, as soon as one has a &unction o& n

    independent rando( variables, 5hich is such that its variation is bounded 5hen

    one variable is (odi&ied, the &unction 5ill satis&y a oe ding#li%e inequality)

  • 8/11/2019 Learning theory2.rtf

    23/39

    1:. Bousquet, Boucheron H Lugosi

    $roo& o& Theore( ade(acher average to the conditional

    one)

    e &irst sho5 that !c+iar(id8s inequality can be applied to sup& ; $ & $n & )

    e denote te(porarily by $ i n the e(pirical (easure obtained by (odi&ying oneele(ent @e)g) Xi is replaced by Xi A o& the sa(ple) It is easy to chec% that the&ollo5ing holds

    R sup @$ & $ i R$ i@$ & $n & A sup n & AR sup n & $n & R )& ; & ; & ;

    Since & J-, 1K 5e obtain

    R$ in & $n & R M 1 n ,n R& @Xi A & @Xi AR 1

    and thus !c+iar(id8s inequality can be applied 5ith c M 14n) This concludesthe &irst step o& the pro o&)

    e ne"t prove the @&irst part o& theA &ollo5ing sy((etriGation le((a)Le((a ) ;or any class ;,

    sup$ & $n & 2 sup >n & ,

    & ; & ;

    and sup

    R$ & $n & R 1 >n & 1 2 n )2 sup& ; & ;

    $roo&) e only prove the &irst part) e introduce a ghost sa(ple and its

    corresponding (easure $n ) e successively use the &act that $n & M $ & andthe supre(u( is a conve" &unction @hence 5e can apply ]ensen8s inequality, see

    ?ppendi" ?A

    sup

    $ & $n && ; D$n & E $n &

    M sup& ;

    sup $n & $n & & ;

    1 n

    M sup i @& @Xi A & @Xi AAn& ; iM1

    1 n 1 n

    sup i & @Xi A sup i & @Xi AAn n& ; & ;

    iM1 i M1

    M 2 sup >n & )& ;

  • 8/11/2019 Learning theory2.rtf

    24/39

    Statistical Learning Theory 1:

    5here the third step uses the &act that &@Xi A & @Xi A and i @& @Xi A & @Xi AA

    have the sa(e distribution and the last step uses the &act that the i & @Xi A and

    i &@Xi A have the sa(e distribution)

    The above already establishes the &irst part o& Theore( n @;A )

    It is easy to chec% that ; satis&ies !c+iar(id8s assu(ptions 5ith c M 1n ) ?s a

    result, ; M >@; A can be sharply esti(ated by ; M >n @;A)

    Loss 'lass and Initial 'lass) In order to (a%e use o& Theore( < 5e have to

    relate the >ade(acher average o& the loss class to those o& the initial class) This

    can be done 5ith the &ollo5ing derivation 5here one uses the &act that i and

    i Pi have the sa(e distribution)

    1 n

    >@; A M sup n i g @ i AMP ig i M1

    1 n 1 iM sup n 2 @1 Pi g@i AAg i M1

    1 nM 1 i Pi g@i A M 1n 2 >@ A )2 sup g

    i M1

    Fotice that the sa(e is valid &or conditional >ade(acher averages, so that 5e

    obtain that 5ith probability at least 1 ,

    g , >@gA >n @gA >n @A 2 log 2n )

    'o(puting the >ade(acher ?verages) e no5 assess the di culty o&

    actually co(puting the >ade(acher averages) e 5rite the &ollo5ing)

    1 1 n

    i g@i An2 sup g i M1

    1 nM 1 1 i g@i An 22 sup g

    i M1

    1 n 1 i g@i AM 1n 22 in& g

    i M1

    M 1 >n @g, A )2 in& g

  • 8/11/2019 Learning theory2.rtf

    25/39

    1:* Bousquet, Boucheron H Lugosi

    This indicates that, given a sa(ple and a choice o& the rando( variables 1 , ) ) ) , n ,

    co(puting >n @A is not harder than co(puting the e(pirical ris% (ini(iGer in

    ) Indeed, the procedure 5ould be to generate the i rando(ly and (ini(iGe

    the e(pirical error in 5ith respect to the labels i )

    ?n advantage o& re5riting >n @ A as above is that it gives an intuition o& 5hatit actually (easures it (easures ho5 (uch the class can &it rando( noise) I&

    the class is very large, there 5ill al5ays be a &unction 5hich can per&ectly &itthe i and then >n @ A M 142, so that there is no hope o& uni&or( convergenceto Gero o& the di erence bet5een true and e(pirical ris%s)

    ;or a &inite set 5ith R R M F , one can sho5 that

    >n @ A 2 log F 4n ,

    5here 5e again see the logarith(ic &actor log F ) ? consequence o& this is that,

    by considering the proection on the sa(ple o& a class 5ith C' di(ension h,

    and using Le((a 1, 5e have

    h>@A 2 h log en n )

    This result along 5ith Theore( < allo5s to recover the Capni% 'hervonen%isbound 5ith a concentration#based proo&)

    ?lthough the bene&it o& using concentration (ay not be entirely clear at that

    point, let us ust (ention that one can actually i(prove the dependence on n

    o& the above bound) This is based on the so#called chaining technique) The idea

    is to use covering nu(bers at all scales in order to capture the geo(etry o& the

    class in a better 5ay than the C' entropy does)

    One has the &ollo5ing result, called +udley8s entropy bound

    n log F @; , t, nA dt )>n @;A '-

    ?s a consequence, along 5ith aussler8s upper bound, 5e can get the &ollo5ing

    result

    >n @;A ' h n )

    e can thus, 5ith this approach, re(ove the unnecessary log n &actor o& the C'

    bound)

    . ?dvanced Topics

    In this section, 5e point out several 5ays in 5hich the results presented so &ar

    can be i(proved) The (ain source o& i(prove(ent actually co(es, as (entioned

    earlier, &ro( the &act that o e ding and !c+iar(id inequalities do not (a%e

    use o& the variance o& the &unctions)

  • 8/11/2019 Learning theory2.rtf

    26/39

    Statistical Learning Theory 1::

    .)1 Bino(ial Tails

    e recall that the &unctions 5e consider are binary valued) So, i& 5e consider a

    &i"ed &unction & , the distribution o& $n & is actually a bino(ial la5 o& para(eters

    $ & and n @since 5e are su((ing n i)i)d) rando( variables & @Xi A 5hich can either

    be - or 1 and are equal to 1 5ith probability &@Xi A M $ & A) +enoting p M $ & ,

    5e can have an e"act e"pression &or the deviations o& $n & &ro( $&

    n@ p tA nD$ & $n & tE M

    % p% @1 pAn % )%M-

    Since this e"pression is not easy to (anipulate, 5e have used an upper bound

    provided by oe ding8s inequality) o5ever, there e"ist other @sharperA upper

    bounds) The &ollo5ing quantities are an upper bound on D$ & $n & tE,

    n @p tA @e"ponentialAn @1 p tA p1 p

    1 p t pt

    1 p @@1 t4pA lo g @1 t4pAt4pA @BennettAe np

    2 p@ 1 p A2 t 43 @BernsteinAe n t 2

    e 2 n t2 @o e dingA

    ="a(ining the above bounds @and using inversionA, 5e can say that roughly

    spea%ing, the s(all deviations o& $& $n & have a aussian behavior o& the

    &or( e"p@ nt2 42p@1 pAA @i)e) aussian 5ith variance p@1 pAA 5hile the largedeviations have a $oisson behavior o& the &or( e"p@ 3nt42A)

    So the tails are heavier than aussian, and oe ding8s inequality consists in

    upper bounding the tails 5ith a aussian 5ith (a"i(u( variance, hence the

    ter( e"p@ 2nt2 A)=ach &unction & ; has a di erent variance $ & @1 $ & A $ & ) !oreover,

    &or each & ;, by Bernstein8s inequality, 5ith probability at least 1 ,

    $ & $n & 2$ & log 13n )n 2 log 1

    The aussian part @second ter( in the right hand sideA do(inates @&or $ & not

    too s(all, or n large enoughA, and it depends on $ & ) e thus 5ant to co(bine

    Bernstein8s inequality 5ith the union bound and the sy((etriGation)

    .)2 For(aliGation

    The idea is to consider the ratio

    $& $n &$ & )

    ere @& J-, 1KA, Car& $ & 2 M $ &

  • 8/11/2019 Learning theory2.rtf

    27/39

    2-- Bousquet, Boucheron H Lugosi

    The reason &or considering this ration is that a&ter nor(aliGation, uctuations

    are (ore uni&or(8 in the class ;) ence the supre(u( in

    $ & $n &sup $ &

    & ;

    not necessarily attained at &unctions 5ith large variance as it 5as the case pre#

    viously)!oreover, 5e %no5 that our goal is to &ind &unctions 5ith s(all error $&

    @hence s(all varianceA) The nor(aliGed supre(u( ta%es this into account)

    e no5 state a result si(ilar to Theore( 2 &or the nor(aliGed supre(u()

    Theore( @Capni%#'hervonen%is, D1*EA) ;or W - 5ith probability at least

    1 ,

    $ & 2 log S; @2nA log & ;, $& $n & n ,

    and also 5ith probability at least 1 ,

    $n & 2 log S; @2nA log & ;, $n & $& n )

    $roo&) e only give a s%etch o& the proo&) The &irst step is a variation o& the

    sy((etriGation le((a

    $ & $n & $n & $n &sup $ & t 2 sup@$n & $n & A42 t )& ; & ;

    The second step consists in rando(iGation @5ith >ade(acher variablesA

    n1i M1 i @& @Xi A & @Xi AAn sup

    ^ ^ ^ M 2 @$n & $n & A42 t )& ;

    ;inally, one uses a tail bound o& Bernstein type)

    Let us e"plore the consequences o& this result)

    ;ro( the &act that &or non#negative nu(bers ?, B, ',

    ? B ' B ' ,? ? B '2

    5e easily get &or e"a(ple

    & ;, $ & $n & 2 $n & log S; @2nA log n

    log S; @2nA log n )

  • 8/11/2019 Learning theory2.rtf

    28/39

    Statistical Learning Theory 2-1

    In the ideal situation 5here there is no noise @i)e) P M t@A al(ost surelyA, and

    t , denoting by gn the e(pirical ris% (ini(iGer, 5e have > M - and also

    >n @gn A M -) In particular, 5hen is a class o& C' di(ension h, 5e obtain

    >@gn A M O h log n n )

    So, in a 5ay, Theore( allo5s to interpolate bet5een the best case 5here

    the rate o& convergence is O@h log n4nA and the 5orst case 5here the rate is

    O@ h log n4nA @it does not allo5 to re(ove the log n &actor in this caseA)

    It is also possible to derive &ro( Theore( relative error bounds &or the

    (ini(iGer o& the e(pirical error) ith probability at least 1 ,

    >@gn A >@g A 2 >@g A log S @2nA log n

    log S @2nA log n )

    e notice here that 5hen >@g A M - @i)e) t and > M -A, the rate is again

    o& order 14n 5hile, as so on as >@g A W -, the rate is o& order 14 n) There&ore,it is not possible to obtain a rate 5ith a po5er o& n in bet5een 142 and 1)

    The (ain reason is that the &actor o& the square ro ot ter( >@g A is not theright quantity to use here since it does not vary 5ith n) e 5ill see later that

    one can have instead >@gn A >@g A as a &actor, 5hich is usually converging toGero 5ith n increasing) 7n&ortunately, Theore( cannot be applied to &unctions

    o& the type & & @5hich 5ould be needed to have the (entioned &actorA, so 5e5ill need a re&ined approach)

    .)3 Foise 'onditions

    The re&ine(ent 5e see% to obtain requires certain speci&ic assu(ptions about the

    noise &unction s@"A) The ideal case being 5hen s@"A M - every5here @5hich cor#

    responds to > M - and P M t@AA) e no5 intro duce quantities that (easure

    ho5 5ell#behaved the noise &unction is)The situation is &avorable 5hen the regression &unction @"A is not too close

    to -, or at least not too o&ten close to 142) Indeed, @"A M - (eans that the noise

    is (a"i(u( at " @s@"A M 142A and that the label is co(pletely undeter(ined

    @any prediction 5ould yield an error 5ith probability 142A)

    +e&initions) There are t5o types o& conditions)

    +e&inition . @!assart8s Foise 'onditionA) ;or so(e c W -, assu(e

    R @AR W 1c al(ost surely )

  • 8/11/2019 Learning theory2.rtf

    29/39

    2-2 Bousquet, Boucheron H Lugosi

    This condition i(plies that there is no region 5here the decision is co(pletely

    rando(, or the noise is bounded a5ay &ro( 142)

    +e&inition @Tsyba%ov8s Foise 'onditionA) Let D-, 1E, assu(e that

    one the &ollo5ing equivalent conditions is satis&ied

    @iA c W -, g J 1, 1K ,

    Dg@A @A -E c@>@gA > A

    @iiA c W -, ? , d$ @"A c@ R @"ARd$@"AA? ?

    @iiiA B W -, t -, DR @AR tE Bt1

    'ondition @iiiA is probably the easiest to interpret it (eans that @"A is close

    to the critical value - 5ith lo5 probability)

    e indicate ho5 to prove that conditions @iA, @iiA and @iiiA are indeed equiv#

    alent

    @iA @iiA It is easy to chec% that >@gA > M DR @AR g - E) ;or each

    &unction g, there e"ists a set ? such that ? M g -@iiA @iiiA Let ? M J" R @"AR tK

    DR R tE M d$ @"A c@ R @ "ARd$ @"AA? ?

    d$ @"AAct @?

    DR R tE c 1 1 t 1@iiiA @iA e 5rite

    >@gA > M DR @AR g -Eg - R Rt

    t

    M t g W- R R tDR R tE t

    Dg W -E M t@ 1 A )1 A tt@1 Bt Dg -E Bt

    @1 A 4 &inally givesDg - ETa%ing t M @1 AB

    Dg -E B1@1 A@ 1 A @>@gA > A )

    e notice that the para(eter has to be in D-, 1E) Indeed, one has the opposite

    inequality

    Dg@A @A -E ,>@g A > M DR @AR g - E D g - E M

    5hich is inco(patible 5ith condition @iA i& W 1)

    e also notice that 5hen M -, Tsyba%ov8s condition is void, and 5hen

    M 1, it is equivalent to !assart8s condition)

  • 8/11/2019 Learning theory2.rtf

    30/39

  • 8/11/2019 Learning theory2.rtf

    31/39

    2- Bousquet, Boucheron H Lugosi

    The reason &or this de&inition is that, as 5e have seen be&ore, the crucial ingredi#

    ent to obtain better rates o& convergence is to use the variance o& the &unctions)

    LocaliGing the >ade(acher average allo5s to &ocus on the part o& the &unction

    class 5here the &ast rate pheno(enon o ccurs, that are &unctions 5ith s(all vari#

    ance)

    Fe"t 5e introduce the concept o& a sub#root &unction, a real#valued &unction

    5ith certain (onotony properties)

    +e&inition : @Sub#>oot ;unctionA) ? &unction is sub#root i&

    @iA is non#decreasing,

    @iiA is non negative,

    @iiiA @rA4 r is non#increasing )

    ?n i((ediate consequence o& this de&inition is the &ollo5ing result)

    Le((a

  • 8/11/2019 Learning theory2.rtf

    32/39

    Statistical Learning Theory 2-ade(acher average behaves li%e a

    sub#root &unction, and thus has a unique &i"ed point) This &i"ed point 5ill turn

    out to be the %ey quantity in the relative error bounds)

    Le((a .) ;or any class o& &unctions ;,

    >n @ ;, rA is sub#root )

    One legiti(ate question is 5hether ta%ing the star#hull does not enlarge the class

    too (uch) One 5ay to see 5hat the e ect is on the siGe o& the class is to co(pare

    the (etric entropy @log covering nu(bersA o& ; and o& ;) It is possible to

    see that the entropy increases only by a logarith(ic &actor, 5hich is essentially

    negligible)

    >esult) e no5 state the (ain result involving local >ade(acher averages and

    their &i"ed point)

    Theore( *) Let ; be a class o& bounded &unctions @e)g) & D 1, 1EA and r

    be the &i"ed point o& >@ ;, rA) There e"ists a constant ' W - such that 5ith

    probability at least 1 ,

    log log n& ;, $ & $n & ' r Car& log 1n )

    I& in addition the &unctions in ; satis&y Car& c@$ & A , then one obtains that5ith probability at least 1 ,

    log log n2 log 1& ;, $ & ' $n & @r A 1 n )

    $roo&) e only give the (ain steps o& the proo&)

    1) The starting point is Talagrand8s inequality &or e(pirical processes, a gen#eraliGation o& !c+iar(id8s inequality o& Bernstein type @i)e) 5hich includes

    the varianceA) This inequality tells that 5ith high probability,

    sup Car& 4n c 4n ,$ & $n & sup $ & $n & c sup& ; & ; & ;

    &or so(e constants c, c )

    2) The second step consists in peeling8 the class, that is splitting the class into

    subclasses according to the variance o& the &unctions

    ;% M J& Car& D"% , "% 1 AK ,

  • 8/11/2019 Learning theory2.rtf

    33/39

    2-. Bousquet, Boucheron H Lugosi

    3) e can then apply Talagrand8s inequality to each o& the sub#classes sepa#

    rately to get 5ith high probability

    sup $ & $n & sup $ & $n & c "Car& 4n c 4n ,& ; % & ; %

    ) Then the sy((etriGation le((a allo5s to introduce local >ade(acher av#

    erages) e get that 5ith high probability

    & ; , $& $n & 2>@;, "Car&A c "Car& 4n c 4n )

    behaves li%e a

    square root &unction since 5e can upper bound the local >ade(acher average

    by the value o& its &i"ed point) ith high probability,

    $ & $n & 2 r Car& c "Car& 4n c 4n )

    .) ;inally, 5e use the relationship bet5een variance and e"pectation

    Car& c@$ & A ,

    and solve the inequality in $ & to get the result)

    e 5ill not got into the details o& ho5 to apply the above result, but 5e give

    so(e re(ar%s about its use)

    ?n i(portant e"a(ple is the case 5here the class ; is o& &inite C' di(ension

    h) In that case, one has

    >@;, rA ' rh log nn ,

    so that r ' h lo g nn ) ?s a consequence, 5e obtain, under Tsyba%ov condition, a

    rate o& convergence o& $ &n to $ & is O@14n1 4@2 A A) It is i(portant to note that

    in this case, the rate o& convergence o& $n & to $ & in O@14 nA) So 5e obtain

    a &ast rate by loo%ing at the relative error) These &ast rates can be obtainedprovided t @but it is not needed that > M -A) This require(ent can bere(oved i& one uses structural ris% (ini(iGation or regulariGation)

    ?nother related result is that, as in the global case, one can obtain a bound5ith data#dependent @i)e) conditionalA local >ade(acher averages

    >n @;, rA M sup >n & )

    & ; $ & 2 r

    The result is the sa(e as be&ore @5ith di erent constantsA under the sa(e con#

    ditions as in Theore( *) ith probability at least 1 ,

    log log n2 log 1$ & ' $n & @rn

    n A 1

  • 8/11/2019 Learning theory2.rtf

    34/39

    Statistical Learning Theory 2-

    5here r n is the &i"ed point o& a sub#root upper bound o& >n @;, rA)ence, 5e can get i(proved rates 5hen the noise is 5ell#behaved and these

    rates interpolate bet5een n 1 42 and n 1 ) o5ever, it is not in general possibleto esti(ate the para(eters @c and A entering in the noise conditions, but 5e 5ill

    not discuss this issue &urther here) ?nother point is that although the capacity

    (easure that 5e use see(s lo cal8, it does depend on all the &unctions in the

    class, but each o& the( is i(plicitly appropriately rescaled) Indeed, in >@ ;, rA,each &unction & ; 5ith $ & 2 r is considered at scale r4$& 2 )

    Bibliographical re(ar%s) oe ding8s inequality appears in D1:E) ;or a proo&

    o& the contraction principle 5e re&er to Ledou" and Talagrand D2-E)

    Capni%#'hervonen%is#Sauer#Shelah8s le((a 5as proved independently by

    Sauer D21E, Shelah D22E, and Capni% and 'hervonen%is D1*E) ;or related co(#

    binatorial results 5e re&er to ?les%er D23E, ?lon, Ben#+avid, 'esa#Bianchi, and

    aussler D2E, 'esa#Bianchi and aussler D2

  • 8/11/2019 Learning theory2.rtf

    35/39

    2-* Bousquet, Boucheron H Lugosi

    The use o& >ade(acher averages in classi&ication 5as &irst pro(oted by

    Noltchins%ii D

  • 8/11/2019 Learning theory2.rtf

    36/39

    Statistical Learning Theory 2-:

    B Fo ;ree Lunch

    e can no5 give a &or(al de&inition o& consistency and state the core results

    about the i(possibility o& universally goo d algorith(s)

    +e&inition 11 @'onsistencyA) ?n algorith( is consistent i& &or any probability

    (easure $ ,

    li(n >@gn A M > al(ost surely)It is i(portant to understand the reasons that (a%e possible the e"istence o&

    consistent algorith(s) In the case 5here the input space is countable, things

    are so(eho5 easy since even i& there is no relationship at all bet5een inputs and

    outputs, by repeatedly sa(pling data independently &ro( $ , one 5ill get to see

    an increasing nu(ber o& di erent inputs 5hich 5ill eventually converge to all

    the inputs) So, in the countable case, an algorith( 5hich 5ould si(ply learn by

    heart8 @i)e) (a%es a (aority vote 5hen the instance has been seen be&ore, and

    produces an arbitrary prediction other5iseA 5ould be consistent)

    In the case 5here is not countable @e)g) M A, things are (ore subtle)Indeed, in that case, there is a see(ingly innocent assu(ption that beco(es

    crucial to be able to de&ine a probability (easure $ on , one needs a #algebra

    on that space, 5hich is typically the Borel #algebra) So the hidden assu(ption

    is that $ is a Borel (easure) This (eans that the topology o& plays a rolehere, and thus, the target &unction t 5ill be Borel (easurable) In a sense this

    guarantees that it is possible to appro"i(ate t &ro( its value @or appro"i(ate

    valueA at a &inite nu(ber o& points) The algorith(s that 5ill achieve consistency

    are thus those 5ho use the topology in the sense o& generaliGing8 the observed

    values to neighborho ods @e)g) lo cal classi&iersA) In a 5ay, the (easurability o& t

    is one o& the crudest notions o& s(oothness o& &unctions)

    e no5 cite t5o i(portant results) The &irst one tells that &or a &i"ed sa(ple

    siGe, one can construct arbitrarily bad proble(s &or a given algorith()

    Theore( : @Fo ;ree Lunch, see e)g) DEA) ;or any algorith(, any n and

    any W -, there e"ists a distribution $ such that > M - and

    >@gn A 1 2 M 1 )

    The second result is (ore subtle and indicates that given an algorith(, one

    can construct a proble( &or 5hich this algorith( 5ill converge as slo5ly as one

    5ishes)

    Theore( 1- @Fo ;ree Lunch at ?ll, see e)g) DEA) ;or any algorith(, and

    any sequence @an A that converges to -, there e"ists a probability distribution $

    such that > M - and

    >@gn A an )

    In the above theore(, the bad8 probability (easure is constructed on a countable

    set @5here the outputs are not related at all to the inputs so that no generaliGa#

    tion is possibleA, and is such that the rate at 5hich one gets to see ne5 inputs

    is as slo5 as the convergence o& an )

  • 8/11/2019 Learning theory2.rtf

    37/39

    21- Bousquet, Boucheron H Lugosi

    ;inally 5e (ention other notions o& consistency)

    +e&inition 12 @C' consistency o& =>!A) The =>! algorith( is consistent

    i& &or any probability (easure $ ,

    >@gn A >@g A in probability,

    and>n @gn A >@g A in probability)

    +e&inition 13 @C' non#trivial consistency o& =>!A) The =>! algorith(

    is non#trivially consistent &or the set and the probability distribution $ i& &or

    any c ,

    in& $ @&A in probability)$n @& A in&& ; $ & W c & ; $ & W c

    >e&erences

    1) Capni%, C) Statistical Learning Theory) ]ohn iley, Fe5 Por% @1::*A

    2) ?nthony, !), Bartlett, $)L) Feural Fet5or% Learning Theoretical ;oundations)

    'a(bridge 7niversity $ress, 'a(bridge @1:::A

    3) Brei(an, L), ;ried(an, ]), Olshen, >), Stone, ') 'lassi&ication and >egressionTrees) ads5orth International, Bel(ont, '? @1:*A) +evroye, L), y/ or&i, L), Lugosi, ) ? $robabilistic Theory o& $attern >ecognition)

    Springer#Cerlag, Fe5 Por% @1::.A

    ), art, $) $attern 'lassi&ication and Scene ?nalysis) ]ohn iley, Fe5

    Por% @1:3A

    .) ;u%unaga, N) Introduction to Statistical $attern >ecognition) ?cade(ic $ress,

    Fe5 Por% @1:2A

    ) Nearns, !), CaGirani, 7) ?n Introduction to 'o(putational Learning Theory)

    !IT $ress, 'a(bridge, !assachusetts @1::A

    *) Nul%arni, S), Lugosi, ), Cen%atesh, S) Learning pattern classi&ication`a sur#

    vey) I=== Transactions on In&or(ation Theory @1::*A 21*Y22-. In&or(ation

    Theory 1:*Y1::*) 'o((e(orative special issue):) Lugosi, ) $attern classi&ication and learning theory) In y/ or&i, L), ed) $rinciples

    o& Fonpara(etric Learning, Springer, Ciena @2--2A ecognition) ]ohniley, Fe5 Por% @1::2A

    11) !endelson, S) ? &e5 notes on statistical learning theory) In !endelson, S), S(ola,

    ?), eds) ?dvanced Lectures in !achine Learning) LF'S 2.--, Springer @2--3A 1Y

    -

    12) Fataraan, B) !achine Learning ? Theoretical ?pproach) !organ Nau&(ann,

    San !ateo, '? @1::1A

    13) Capni%, C) =sti(ation o& +ependencies Based on =(pirical +ata) Springer#Cerlag,Fe5 Por% @1:*2A

    1) Capni%, C) The Fature o& Statistical Learning Theory) Springer#Cerlag, Fe5 Por%

    @1::

  • 8/11/2019 Learning theory2.rtf

    38/39

    Statistical Learning Theory 211

    1.) von Lu"burg, 7), Bousquet, O), Sch/ ol%op&, B) ? co(pression approach to support

    vector (odel selection) The ]ournal o& !achine Learning >esearch < @2--A 2:3Y

    323

    1) !c+iar(id, ') On the (ethod o& bounded di erences) In Surveys in 'o(bina#

    torics 1:*:, 'a(bridge 7niversity $ress, 'a(bridge @1:*:A 1*Y1**

    1*) Capni%, C), 'hervonen%is, ?) On the uni&or( convergence o& relative &requencies

    o& events to their probabilities) Theory o& $robability and its ?pplications 1.

    @1:1A 2.Y2*-

    1:) oe ding, ) $robability inequalities &or su(s o& bounded rando( variables)

    ]ournal o& the ?(erican Statistical ?ssociation

  • 8/11/2019 Learning theory2.rtf

    39/39

    212 Bousquet, Boucheron H Lugosi

    3:) Capni%, C), 'hervonen%is, ?) Fecessary and su cient conditions &or the uni#

    &or( convergence o& (eans to their e"pectations) Theory o& $robability and its

    ?pplications 2. @1:*1A *21Y*32-) ?ssouad, $) +ensite et di(ension) ?nnales de l8Institut ;ourier 33 @1:*3A 233Y2*2

    1) 'over, T) eo(etrical and statistical properties o& syste(s o& linear inequali#

    ties 5ith applications in pattern recognition) I=== Transactions on =lectronic

    'o(puters 1 @1:.) Balls in >% do not cut all subsets o& % 2 points) ?dvances in

    !athe(atics 31 @3A @1::A 3-.Y3-*

    3) oldberg, $), ]erru(, !) Bounding the Capni%#'hervonen%is di(ension o& con#

    cept classes para(etriGed by real nu(bers) !achine Learning 1* @1::), +udley, >) So(e special Capni%#'hervonen%is classes) +iscrete

    !athe(atics 33 @1:*1A 313Y31*