decision trees some exercisesciortuz/ml.ex-book/slides/... · 2018. 3. 26. · decˆat atributul b,...

121
Decision Trees Some exercises 0.

Upload: others

Post on 29-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Decision Trees Some exercises

    0.

  • Exemplifying

    how to compute information gains

    and how to work with decision stumps

    CMU, 2013 fall, W. Cohen E. Xing, sample questions, pr. 4

    1.

  • Timmy wants to know how to do well for ML exam. He collects those oldstatistics and decides to use decision trees to get his model. He now gets 9data points, and two features: “whether stay up late before exam” (S) and“whether attending all the classes” (A). We already know the statistics is asbelow:

    Set (all ) = [5+, 4−]Set (S+) = [3+, 2−], Set (S−) = [2+, 2−]Set (A+) = [5+, 1−], Set (A−) = [0+, 3−]

    Suppose we are going to use the feature to gain most information at firstsplit, which feature should we choose? How much is the information gain?

    You may use the following approximations:

    N 3 5 7log2 N 1.58 2.32 2.81

    2.

  • [5+,4−]

    [3+,2−] [2+,2−]

    + −

    S

    [5+,4−]

    [5+,1−] [0+,3−]

    + −

    A

    H=1 H=0

    H(all)not.= H [5+, 4−] not.= H

    (5

    9

    )

    sim.= H

    (4

    9

    )def.=

    5

    9log2

    9

    5+

    4

    9log2

    9

    4= log2 9−

    5

    9log2 5−

    4

    9log2 4

    = 2 log2 3−5

    9log2 5−

    8

    9= −8

    9+ 2 log2 3−

    5

    9log2 5 = 0.991076

    H(all|S) def.= 59·H [3+, 2−] + 4

    9·H [2+, 2−] = . . .

    =5

    9· 0.970951+ 4

    9· 1 = 0.983861

    H(all|A) def.= 69·H [5+, 1−] + 3

    9·H [0+, 3−] = . . .

    =6

    9· 0.650022+ 4

    9· 0 = 0.433348

    IG(all, S)def.= H(all)−H(all|S) = 0.007215

    IG(all, A)def.= H(all)−H(all|A) = 0.557728

    IG(all, S) < IG(all, A) ⇔ H(all|S) > H(all|A)

    3.

  • Exemplifying the application of the ID3 algorithm

    on a toy mushrooms dataset

    CMU, 2002(?) spring, Andrew Moore, midterm example questions, pr. 2

    4.

  • You are stranded on a deserted island. Mushrooms of various types grow widely allover the island, but no other food is anywhere to be found. Some of the mushroomshave been determined as poisonous and others as not (determined by your formercompanions’ trial and error). You are the only one remaining on the island. You havethe following data to consider:

    Example NotHeavy Smelly Spotted Smooth Edible

    A 1 0 0 0 1

    B 1 0 1 0 1

    C 0 1 0 1 1

    D 0 0 0 1 0

    E 1 1 1 0 0

    F 1 0 1 1 0

    G 1 0 0 1 0

    H 0 1 0 0 0

    U 0 1 1 1 ?

    V 1 1 0 1 ?

    W 1 1 0 0 ?

    You know whether or not mushrooms A through H are poisonous, but you do not knowabout U through W .

    5.

  • For the a–d questions, consider only mushrooms A through H.

    a. What is the entropy of Edible?

    b. Which attribute should you choose as the root of a decision tree?Hint : You can figure this out by looking at the data without explicitlycomputing the information gain of all four attributes.

    c. What is the information gain of the attribute you chose in the previousquestion?

    d. Build a ID3 decision tree to classify mushrooms as poisonous or not.

    e. Classify mushrooms U , V and W using the decision tree as poisonous ornot poisonous.

    f. If the mushrooms A through H that you know are not poisonous suddenlybecame scarce, should you consider trying U , V and W? Which one(s) andwhy? Or if none of them, then why not?

    6.

  • a.

    HEdible = H [3+, 5−]def.= −3

    8· log2

    3

    8− 5

    8· log2

    5

    8=

    3

    8· log2

    8

    3+

    5

    8· log2

    8

    5

    =3

    8· 3− 3

    8· log2 3 +

    5

    8· 3− 5

    8· log2 5 = 3−

    3

    8· log2 3−

    5

    8· log2 5

    ≈ 0.9544

    7.

  • b.

    0 1

    [3+,5−]

    [1+,2−] [2+,3−]

    NotHeavy0

    0 1

    [3+,5−]

    [2+,3−] [1+,2−]

    Smelly0

    0 1

    [2+,3−] [1+,2−]

    Spotted0

    [3+,5−]

    0 1

    [2+,2−] [1+,3−]

    Smooth0

    [3+,5−]

    Node 1 Node 2

    8.

  • c.

    H0/Smoothdef.=

    4

    8H [2+, 2−] + 4

    8H [1+, 3−] = 1

    2· 1 + 1

    2

    (1

    4log2

    4

    1+

    3

    4log2

    4

    3

    )

    =1

    2+

    1

    2

    (1

    4· 2 + 3

    4· 2− 3

    4log2 3

    )

    =1

    2+

    1

    2

    (

    2− 34log2 3

    )

    =1

    2+ 1− 3

    8log2 3 =

    3

    2− 3

    8log2 3 ≈ 0.9056

    IG0/Smoothdef.= HEdible −H0/Smooth= 0.9544− 0.9056 = 0.0488

    9.

  • d.

    H0/NotHeavydef.=

    3

    8H [1+, 2−] + 5

    8H [2+, 3−]

    =3

    8

    (1

    3log2

    3

    1+

    2

    3log2

    3

    2

    )

    +5

    8

    (2

    5log2

    5

    2+

    3

    5log2

    5

    3

    )

    =3

    8

    (1

    3log2 3 +

    2

    3log2 3−

    2

    3· 1)

    +5

    8

    (2

    5log2 5−

    2

    5· 1 + 3

    5log2 5−

    3

    5log2 3

    )

    =3

    8

    (

    log2 3−2

    3

    )

    +5

    8

    (

    log2 5−3

    5log2 3−

    2

    5

    )

    =3

    8log2 3−

    2

    8+

    5

    8log2 5−

    3

    8log2 3−

    2

    8

    =5

    8log2 5−

    4

    8≈ 0.9512

    ⇒ IG0/NotHeavydef.= HEdible −H0/NotHeavy = 0.9544− 0.9512 = 0.0032,

    IG0/NotHeavy = IG0/Smelly = IG0/Spotted = 0.0032 < IG0/Smooth = 0.0488

    10.

  • Important Remark (in Romanian)

    În loc să fi calculat efectiv aceste câştiguri de informaţie, pentru a de-termina atributul cel mai ,,bun“, ar fi fost suficient să comparăm valorileentropiilor condiţionale medii H0/Smooth şi H0/NotHeavy:

    IG0/Smooth > IG0/NotHeavy ⇔ H0/Smooth < H0/NotHeavy

    ⇔ 32− 3

    8log2 3 <

    5

    8log2 5−

    1

    2

    ⇔ 12− 3 log2 3 < 5 log2 5− 4

    ⇔ 16 < 5 log2 5 + 3 log2 3

    ⇔ 16 < 11.6096 + 4.7548 (adev.)

    11.

  • Node 1: Smooth = 0

    1

    [2+,2−]

    [2+,0−] [0+,2−]

    01

    0 1

    Smelly

    0 1

    1

    [2+,2−]

    [0+,1−] [2+,1−]

    0

    NotHeavy1

    [2+,2−]

    [1+,1−] [1+,1−]

    0 1

    Spotted

    12.

  • Node 2: Smooth = 1

    0

    0 1

    2

    [1+,3−]

    [1+,1−] [0+,2−]

    Node 3

    NotHeavy

    0

    2

    [1+,3−]

    [1+,0−][0+,3−]

    1

    0 1

    Smelly

    0

    2

    [1+,3−]

    [0+,1−][1+,2−]

    0 1

    Spotted

    13.

  • The resulting ID3 Tree

    1 0 10

    10

    [3+,5−]

    [2+,2−] [1+,3−]

    [2+,0−] [0+,2−] [0+,3−] [1+,0−]

    0 1 0 1

    Smelly Smelly

    Smooth

    IF (Smooth = 0 AND Smelly = 0) OR(Smooth = 1 AND Smelly = 1)

    THEN Edibile;ELSE ¬Edible;

    Classification of test instances:

    U Smooth = 1, Smelly = 1 ⇒ Edible = 1V Smooth = 1, Smelly = 1 ⇒ Edible = 1W Smooth = 0, Smelly = 1 ⇒ Edible = 0

    14.

  • Exemplifying the greedy character of

    the ID3 algorithm

    CMU, 2003 fall, T. Mitchell, A. Moore, midterm, pr. 9.a

    15.

  • Fie atributele binare de intrare A,B,C, atributul de ieşire Y şi următoareleexemple de antrenament:

    A B C Y

    1 1 0 01 0 1 10 1 1 10 0 1 0

    a. Determinaţi arborele de decizie calculat de algoritmul ID3. Este acestarbore de decizie consistent cu datele de antrenament?

    16.

  • Răspuns

    Nodul 0 : (rădăcina)

    [2+,2−]

    [1+,1−] [1+,1−]

    0 1

    A

    [2+,2−]

    [1+,1−] [1+,1−]

    0 1

    B

    0Nod 1

    [2+,2−]

    [0+,1−] [2+,1−]

    0 1

    C

    Se observă imediat că primii doi “compaşi de decizie” (engl. decisionstumps) au IG = 0, ı̂n timp ce al treilea compas de decizie are IG > 0. Prinurmare, ı̂n nodul 0 (rădăcină) vom pune atributul C.

    17.

  • Nodul 1 : Avem de clasificat instanţele cu C = 1, deci alegerea se face ı̂ntreatributele A şi B.

    Nod 21

    [1+,1−] [1+,0−]

    0 1

    A

    [2+,1−]

    1

    [1+,1−] [1+,0−]

    0 1

    B

    [2+,1−]

    Cele două entropii condiţionale medii sunt egale:

    H1/A = H1/B =2

    3H [1+, 1−] + 1

    3H [1+, 0−]

    Aşadar, putem alege oricare dintre cele două atribute. Pentru fixare, ı̂lalegem pe A.

    18.

  • Nodul 2 : La acest nod nu mai avem disponibildecât atributul B, deci ı̂l vom pune pe acesta.

    Arborele ID3 complet este reprezentat ı̂n figuraalăturată.

    Prin construcţie, algoritmul ID3 este consistentcu datele de antrenament dacă acestea sunt con-sistente (i.e., necontradictorii). În cazul nostru,se verifică imediat că datele de antrenament suntconsistente.

    1

    10

    [1+,0−][0+,1−]

    0 1

    B

    0

    [2+,1−]

    C

    0 1

    [0+,1−]

    [1+,1−] [1+,0−]

    0 1

    A

    [2+,2−]

    19.

  • b. Există un arbore de decizie de adâncime mai mică (decât cea a arboreluiID3) consistent cu datele de mai sus? Dacă da, ce concept (logic) reprezintăacest arbore?

    Răspuns:

    Din date se observă că atributul de ieşireY reprezintă de fapt funcţia logică A xorB.

    Reprezentând această funcţie ca arborede decizie, vom obţine arborele alăturat.

    Acest arbore are cu un nivel mai puţindecât arborele construit cu algoritmulID3.

    Prin urmare, arborele ID3 nu este op-tim din punctul de vedere al număruluide niveluri.

    0 10 1

    1 10 0

    10

    [2+,2−]

    [1+,1−] [1+,1−]

    A

    B B

    [0+,1−] [1+,0−] [1+,0−] [0+,1−]

    20.

  • Aceasta este o consecinţă a caracterului “greedy” al algoritmuluiID3, datorat faptului că la fiecare iteraţie alegem ,,cel mai bun“atribut ı̂n raport cu criteriul câştigului de informaţie.

    Se ştie că algoritmii de tip “greedy” nu grantează obţinerea opti-mului global.

    21.

  • Exemplifying the application of the ID3 algorithm

    in the presence of both

    categorical and continue attributes

    CMU, 2012 fall, Eric Xing, Aarti Singh, HW1, pr. 1.1

    22.

  • As of September 2012, 800 extrasolar planets have been identified in our galaxy. Super-secret surveying spaceships sent to all these planets have established whether they arehabitable for humans or not, but sending a spaceship to each planet is expensive. Inthis problem, you will come up with decision trees to predict if a planet is habitablebased only on features observable using telescopes.

    a. In nearby table you are given the datafrom all 800 planets surveyed so far. The fea-tures observed by telescope are Size (“Big” or“Small”), and Orbit (“Near” or “Far”). Eachrow indicates the values of the features andhabitability, and how many times that set ofvalues was observed. So, for example, therewere 20 “Big” planets “Near” their star thatwere habitable.

    Size Orbit Habitable Count

    Big Near Yes 20Big Far Yes 170

    Small Near Yes 139Small Far Yes 45Big Near No 130Big Far No 30

    Small Near No 11Small Far No 255

    Derive and draw the decision tree learned by ID3 on this data (use the maximuminformation gain criterion for splits, don’t do any pruning). Make sure to clearly markat each node what attribute you are splitting on, and which value corresponds to whichbranch. By each leaf node of the tree, write in the number of habitable and inhabitableplanets in the training data that belong to that node.

    23.

  • Answer: Level 1

    [374+,426−]

    SB

    Size

    [184+,266−][190+,160−]

    FN

    Orbit

    [215+,285−]

    [374+,426−]

    H(374/800) H(374/800)

    H(92/225)[159+,141−]

    H(19/35) H(47/100) H(43/100)

    H(Habitable|Size) = 3580

    ·H(

    19

    35

    )

    +45

    80·H

    (

    92

    225

    )

    =35

    80· 0.9946 + 45

    80· 0.9759 = 0.9841

    H(Habitable|Orbit) = 38·H

    (

    47

    100

    )

    +5

    8·H

    (

    43

    100

    )

    =3

    8· 0.9974 + 5

    8· 0.9858 = 0.9901

    IG(Habitable;Size) = 0.0128

    IG(Habitable;Orbit) = 0.0067

    24.

  • The final decision tree

    +− −+

    FN

    [170+,30−]

    [374+,426−]

    SB

    FN

    [139+,11−] [45+,255−]

    [184+,266−][190+,160−]

    [20+,130−]

    Size

    Orbit Orbit

    25.

  • b. For just 9 of the planets, a thirdfeature, Temperature (in Kelvin degrees),has been measured, as shown in thenearby table.Redo all the steps from part a on this datausing all three features. For the Temper-ature feature, in each iteration you mustmaximize over all possible binary thresh-olding splits (such as T ≤ 250 vs. T > 250,for example).

    Size Orbit Temperature Habitable

    Big Far 205 NoBig Near 205 NoBig Near 260 YesBig Near 380 Yes

    Small Far 205 NoSmall Far 260 YesSmall Near 260 YesSmall Near 380 NoSmall Near 380 No

    According to your decision tree, would a planet with the features (Big, Near, 280) bepredicted to be habitable or not habitable?

    Hint : You might need to use the following values of the entropy function for a Bernoullivariable of parameter p:

    H(1/3) = 0.9182, H(2/5) = 0.9709, H(92/225) = 0.9759, H(43/100) = 0.9858, H(16/35) = 0.9946,H(47/100) = 0.9974.

    26.

  • Answer

    Binary threshold splits for the continuous attribute Temperature:

    205 380260232.5 320 Temperature

    27.

  • Answer: Level 1

    H=0

    H(1/3)

    T

    >>

    H(Habitable|Size) = 49+

    5

    9·H

    (

    2

    5

    )

    =4

    9+

    5

    9· 0.9709 = 0.9838

    H(Habitable|Temp ≤ 232.5) = 23· H

    (

    1

    3

    )

    =2

    3· 0.9182 = 0.6121

    IG(Habitable; Size) = H

    (

    187

    400

    )

    − 0.9838

    = 0.9969 − 0.9838= 0.0072

    IG(Habitable;Temp ≤ 232.5) = 0.3788

    28.

  • Answer: Level 2

    +

    H=0

    +

    H=0

    +

    H=0

    [4+,2−]

    SB

    [2+,0−]

    F

    OrbitSize

    [2+,2−]

    H=1

    [4+,2−]

    H(1/3)

    [1+,0−]

    N

    H(2/5)

    [3+,2−]

    H(1/3)

    T>

    >

    >=

    Note:The plain lines indicate that both the specific conditional entropies and their coefficients

    (weights) in the mean conditional entropies satisfy the indicated relationship. (For exam-

    ple, H(2/5) > H(1/3) and 5/6 > 3/6.)

    The dotted lines indicate that only the specific conditional entropies satisfy the indicated rela-

    tionship. (For example, H(2/2) = 1 > H(2/5) but 4/6 < 5/6.)

    29.

  • The final decision tree:

    +

    + −

    [0+,3−]

    NY

    [4+,2−]

    T

  • Exemplifying

    the application of the ID3 algorithm

    on continuous attributes,

    and in the presence of a noise.

    Decision surfaces; decision boundaries.

    The computation of the LOOCV error

    CMU, 2002 fall, Andrew Moore, midterm, pr. 3

    31.

  • Suppose we are learning a classifier with binary input values Y = 0 andY = 1. There is one real-valued input X. The training data is given in thenearby table.

    Assume that we learn a decision tree on this data. Assume that whenthe decision tree splits on the real-valued attribute X, it puts the splitthreshold halfway between the attributes that surround the split. Forexample, using information gain as splitting criterion, the decision treewould initially choos to split at X = 5, which is halfway between X = 4 andX = 6 datapoints.

    X Y1 02 03 04 06 17 18 18.5 09 110 1

    Let Algorithm DT2 be the method of learning a decision tree with only two leaf nodes(i.e., only one split).

    Let Algorithm DT⋆ be the method of learning a decision tree fully, with no prunning.

    a. What will be the training set error for DT2 and respectively DT⋆ on our data?

    b. What will be the leave-one-out cross-validation error (LOOCV) for DT2 and re-spectively DT⋆ on our data?

    32.

  • • training data:

    0 1 2 3 4 5 7 8 9 10 X6

    • discretization / decision thresholds:

    X1 2 3 4 6 7 8 9 105 8.25

    8.75

    • compact representation of the ID3 tree:

    0 21

    X1 2 3 4 5 6 7 8 9 10

    • decision “surfaces”:

    X8.25 8.75

    − + − +

    5

    ID3 tree:

    X

  • ID3:IG computations

    X

  • ID3, LOOCV:Decision surfaces

    LOOCV error: 3/10

    8.758.25

    ++ −−4.5

    X=4:

    5 8.75

    ++−7.75

    −X=8:

    5

    ++− +X=8.5:

    5 8.25

    ++ −−9.25

    X=9:

    5 8.758.25

    ++ −−X=1,2,3:

    5 8.758.25

    ++ −−X=7:

    5 8.758.25

    ++ −−X=10:

    8.758.25

    ++ −−5.5

    X=6:

    35.

  • DT2

    +

    5

    −Decision "surfaces":

    X

  • DT2, LOOCV

    IG computations

    Case 1: X=1, 2, 3, 4

    X

  • DT2, CVLOO

    IG computations

    (cont’d)

    Case 3: X=8.5

    X

  • Applying ID3 on a dataset with two continuous attributes:

    decision zones

    Liviu Ciortuz, 2017

    39.

  • Consider the training dataset in thenearby figure.X1 and X2 are considered countinous at-tributes.

    Apply the ID3 algorithm on this dataset.Draw the resulting decision tree.

    Make a graphical representation of thedecision areas and decision boundariesdetermined by ID3.

    X1

    X2

    0

    1

    3

    2

    4

    1 2 3 4 50

    40.

  • Solution

    Level 1:

    H=0

    H=0

    [4+,5−][4+,5−]

    NY

    [2+,0−] [2+,4−]

    Y N

    H(1/3)

    X1 < 9/2

    [4+,5−]

    NY

    X2 < 3/2X1 < 5/2

    H(1/4)

    [4+,5−]

    NY

    [4+,5−]

    NY

    X2 < 5/2 X2 < 7/2

    [2+,5−] [2+,1−] [1+,1−] [3+,4−] [3+,2−] [1+,3−] [4+,2−] [0+,3−]

    H=1 H(2/5)H(3/7)H(2/7) H(1/3)

    =

    >

    H(1/3)

    <

    >

    IG=0.091

    IG=0.378IG=0.319

    H[Y| . ] = 2/3 H(1/3)H[Y| . ] = 7/9 H(2/7) H[Y| . ] = 5/9 H(2/5) + 4/9 H(1/4)

    41.

  • Level 2:

    H=0−

    H=0

    H=0

    N

    X1 < 5/2

    [1+,1−]

    NY Y

    [4+,2−]

    [2+,0−] [2+,2−]H=1

    X1 < 4 X2 < 3/2 X2 < 5/2

    [4+,2−] [4+,2−] [4+,2−]

    Y N Y

    [2+,2−]H=1

    [2+,0−]

    =

    [3+,1−]H=1 H(1/4)

    [3+,2−] [1+,0−]H(2/5)

    IG=0.04

    IG=0.109

    >IG=0.251

    >=

    N

    IG=0.251

    H[Y| . ] = 1/3 + 2/3 H(1/4) H[Y| . ] = 5/6 H(2/5)H[Y| . ] = 2/3Notes:

    1. Split thresholds for continuous attributes must be recomputed at each new iteration,

    because they may change. (For instance, here above, 4 replaces 4.5 as a threshold for X1.)

    2. In the current stage, i.e., for the current node in the ID3 tree you may choose (as test)

    either X1 < 5/2 or X1 < 4.

    3. Here above we have an example of reverse relationships between weighted and respectively

    un-weighted specific entropies: H [2+, 2−] > H [3+, 2−] but 46·H [2+, 2−] < 5

    6·H [3+, 2−].

    42.

  • The final decision tree:

    [0+,3−]

    +

    + −

    NY

    X2 < 7/2

    [4+,5−]

    [4+,2−]

    [2+,0−] [2+,2−]

    NY

    X1 < 5/2

    SB

    [0+,2−]

    X1 < 4

    [2+,0−]

    Decision areas:

    +

    2X

    X1

    0

    1

    1 2 3

    3

    2

    4

    4 50

    +

    43.

  • Exemplifying

    pre- and post-puning of decision trees

    using a threshold for the Information Gain

    CMU, 2006 spring, Carlos Guestrin, midterm, pr. 4

    [adapted by Liviu Ciortuz]

    44.

  • Starting from the data in the following table, the ID3 algorithmbuilds the decision tree shown nearby.

    V W X Y

    0 0 0 00 1 0 11 0 0 11 1 0 01 1 1 1

    V

    X

    1

    W

    0 1

    10W

    1 0

    10

    0 1

    0 1

    a. One idea for pruning such a decision tree would be to startat the root, and prune splits for which the information gain (orsome other criterion) is less than some small ε. This is called top-down pruning. What is the decision tree returned for ε = 0.0001?What is the training set error for this tree?

    45.

  • AnswerWe will first augment the given decision tree withinformations regarding the data partitions (i.e., thenumber of positive and, respectively, negative in-stances) which were assigned to each test node duringthe application of ID3 algorithm.

    The information gain yielded by the attribute X in theroot node is:

    H [3+; 2−]− 1/5 · 0− 4/5 · 1 = 0.971− 0.8 = 0.171 > ε.

    Therefore, this node will not be eliminated from thetree.

    The information gain for the attribute V (in the left-hand side child of the root node) is:

    H [2+; 2−]− 1/2 · 1− 1/2 · 1 = 1− 1 = 0 < ε.

    X

    V 1

    W W

    0 1 1 0

    [3+;2−]

    [2+;2−] [1+;0−]

    0 1

    0 1

    [1+;1−] [1+;1−]

    10 10

    [0+;1−] [1+;0−] [1+;0−] [0+;1−]

    So, the whole left subtree will be cut off and replaced by a decisionnode, as shown nearby. The training error produced by this tree is2/5.

    X

    0 1

    0 1

    46.

  • b. Another option would be to start at the leaves, and prunesubtrees for which the information gain (or some other criterion)of a split is less than some small ε. In this method, no ancestors ofchildren with high information gain will get pruned. This is calledbottom-up pruning. What is the tree returned for ε = 0.0001?What is the training set error for this tree?

    Answer:

    The information gain of V is IG(Y ;V ) = 0. A step later, the infor-mation gain of W (for either one of the descendent nodes of V ) isIG(Y ;W ) = 1. So bottom-up pruning won’t delete any nodes andthe tree [given in the problem statement] remains unchanged.

    The training error is 0.

    47.

  • c. Discuss when you would want to choose bottom-up pruningover top-down pruning and vice versa.

    Answer:

    Top-down pruning is computationally cheaper. When buildingthe tree we can determine when to stop (no need for real pruning).But as we saw top-down pruning prunes too much.

    On the other hand, bottom-up pruning is more expensive since wehave to first build a full tree — which can be exponentially large— and then apply pruning. The second problem with bottom-uppruning is that supperfluous attributes may fullish it (see CMU,CMU, 2009 fall, Carlos Guestrin, HW1, pr. 2.4). The third prob-lem with it is that in the lower levels of the tree the number ofexamples in the subtree gets smaller so information gain mightbe an inappropriate criterion for pruning, so one would usuallyuse a statistical test instead.

    48.

  • Exemplifying

    χ2-Based Pruning of Decision Trees

    CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1

    49.

  • In class, we learned a decision tree pruning algorithm that iter-atively visited subtrees and used a validation dataset to decidewhether to remove the subtree. However, sometimes it is desir-able to prune the tree after training on all of the available data.

    One such approach is based on statistical hypothesis testing.

    After learning the tree, we visit each internal node and testwhether the attribute split at that node is actually uncorrelatedwith the class labels.

    We hypothesize that the attribute is independent and then usePearson’s chi-square test to generate a test statistic that mayprovide evidence that we should reject this “null” hypothesis. Ifwe fail to reject the hypothesis, we prune the subtree at that node.

    50.

  • a. At each internal node we can create a contingency table for the trainingexamples that pass through that node on their paths to the leaves. The tablewill have the c class labels associated with the columns and the r values thesplit attribute associated with the rows.

    Each entry Oi,j in the table is the number of times we observe a trainingsample with that attribute value and label, where i is the row index thatcorresponds to an attribute value and j is the column index that correspondsto a class label.

    In order to calculate the chi-square test statistic, we need a similar table ofexpected counts. The expected count is the number of observations we wouldexpect if the class and attribute are independent.

    Derive a formula for each expected count Ei,j in the table.

    Hint : What is the probability that a training example that passes throughthe node has a particular label? Using this probability and the independenceassumption, what can you say about how many examples with a specificattribute value are expected to also have the class label?

    51.

  • b. Given these two tables for the split, you can now calculate the chi-squaretest statistic

    χ2 =

    r∑

    i=1

    c∑

    j=1

    (Oi,j − Ei,j)2Ei,j

    with degrees of freedom (r − 1)(c− 1).

    You can plug the test statistic and degrees of freedom into a software packagea

    or an online calculatorb to calculate a p-value. Typically, if p < 0.05 we rejectthe null hypothesis that the attribute and class are independent and say thesplit is statistically significant.

    The decision tree given on the next slide was built from the data in the tablenearby.For each of the 3 internal nodes in the decision tree, show the p-value for thesplit and state whether it is statistically significant.How many internal nodes will the tree have if we prune splits with p ≥ 0.05?

    aUse 1-chi2cdf(x,df) in MATLAB or CHIDIST(x,df) in Excel.b(https://en.m.wikipedia.org/wiki/Chi-square distribution#Table of .CF.872 value vs p-value.

    52.

  • Input:

    X1 X2 X3 X4 Class1 1 0 0 01 0 1 0 10 1 0 0 01 0 1 1 10 1 1 1 10 0 1 0 01 0 0 0 10 1 0 1 11 0 0 1 11 1 0 1 11 1 1 1 10 0 0 0 0

    4X

    1X

    2X

    01

    0

    1

    0 1

    [4−,8+]

    10

    [1−,2+]

    0 1

    [0−,2+] [1−,0+]

    [3−,0+]

    [4−,2+] [0−,6+]

    53.

  • Idea

    While traversing the ID3 tree [usually in bottom-up manner],

    remove the nodes for which

    there is not enough (“significant”) statistical evidence that

    there is a dependence between

    the values of the input attribute tested in that node and the valuesof the output attribute (the labels),

    supported by the set of instances assigned to that node.

    54.

  • Contingency tables

    OX4 Class = 0 Class = 1

    X4 = 0 4 2X4 = 1 0 6

    N=12⇒

    P (X4 = 0) =6

    12=

    1

    2, P (X4 = 1) =

    1

    2

    P (Class = 0) =4

    12=

    1

    3, P (Class = 1) =

    2

    3

    OX1|X4=0 Class = 0 Class = 1

    X1 = 0 3 0X1 = 1 1 2

    N=6⇒

    P (X1 = 0 | X4 = 0) = 36=

    1

    2

    P (X1 = 1 | X4 = 0) =1

    2

    P (Class = 0 | X4 = 0) = 46=

    2

    3

    P (Class = 1 | X4 = 0) = 13

    OX2|X4=0,X1=1 Class = 0 Class = 1

    X2 = 0 0 2

    X2 = 1 1 0

    N=3⇒

    P (X2 = 0 | X4 = 0, X1 = 1) = 23

    P (X2 = 1 | X4 = 0, X1 = 1) =1

    3

    P (Class = 0 | X4 = 0, X1 = 1) = 13

    P (Class = 1 | X4 = 0, X1 = 1) =2

    3

    55.

  • The reasoning that leads to the computation of

    the expected number of observations

    P (A = i, C = j) = P (A = i) · P (C = j)

    P (A = i) =

    ∑ck=1Oi,k

    Nand P (C = j) =

    ∑rk=1Ok,j

    N

    P (A = i, C = j)indep.=

    (∑c

    k=1Oi,k) (∑r

    k=1Ok,j)

    N2

    E[A = i, C = j] = N · P (A = i, C = j)

    56.

  • Expected number of observations

    EX4 Class = 0 Class = 1

    X4 = 0 2 4X4 = 1 2 4

    EX1|X4 Class = 0 Class = 1

    X1 = 0 2 1X1 = 1 2 1

    EX2|X4,X1=1 Class = 0 Class = 1

    X2 = 02

    3

    4

    3

    X2 = 11

    3

    2

    3

    EX4(X4 = 0,Class = 0) :

    N = 12, P (X4 = 0) =1

    2şi P (Class = 0) =

    1

    3⇒

    N · P (X4 = 0,Class = 0) = N · P (X4 = 0) · P (Class = 0) = 12 ·1

    2· 13= 2

    57.

  • χ2 Statistics

    χ2 =r∑

    i=1

    c∑

    j=1

    (Oi,j −Ei,j)2Ei,j

    χ2X4

    =(4− 2)2

    2+

    (0− 2)22

    +(2− 4)2

    4+

    (6− 4)24

    = 2 + 2 + 1 + 1 = 6

    χ2X1|X4=0

    =(3− 2)2

    2+

    (1− 2)22

    +(0− 1)2

    1+

    (2− 1)21

    = 3

    χ2X2|X4=0,X1=1

    =

    (

    0− 23

    )2

    2

    3

    +

    (

    1− 13

    )2

    1

    3

    +

    (

    2− 43

    )2

    4

    3

    +

    (

    0− 23

    )2

    2

    3

    =4

    9· 274

    = 3

    p-values: 0.0143, 0.0833, and respectively 0.0833.

    The first one of these p-values is smaller than ε, therefore the root node(X4) cannot be prunned.

    58.

  • 0 2 4 6 8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    χ2 − Pearson’s cumulative test statistic

    p−va

    lue

    0 2 4 6 8

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Chi Squared Pearson test statistics

    k = 1k = 2k = 3k = 4k = 6k = 9

    59.

  • Output (pruned tree) for 95% confidence level

    4X

    10

    10

    60.

  • The AdaBoost algorithm:

    The convergence – in certain conditions to 0 – of the training

    error

    CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5

    CMU, 2009 fall, Carlos Guestrin, HW2, pr. 3.1

    CMU, 2009 fall, Eric Xing, HW3, pr. 4.2.2

    61.

  • Consider m training examples S = {(x1, y1), . . . , (xm, ym)}, where x ∈ X and y ∈ {−1, 1}.Suppose we have a weak learning algorithm A which produces a hypothesis h : X →{−1, 1} given any distribution D of examples.AdaBoost is an iterative algorithm which works as follows:

    • Begin with a uniform distribution D1(i) =1

    m, i = 1, . . . ,m.

    • At each round t = 1, . . . , T ,

    – run the weak learning algorithm A on the distribution Dt and produce thehypothsis ht;

    Note: Since A is a weak learning algorithm, the produced hypothesis ht atround t is only slightly better than random guessing, say, by a margin γt:

    εt = errDt(ht) = Prx∼Dt [y 6= ht(x)] =1

    2− γt.

    [If at a certain iteration t < T the weak classifier A cannot produce a hypothesisbetter than random guessing (i.e., γt = 0) or it produces a hypothesis for whichεt = 0, then the AdaBoost algorithm should be stopped.]

    – update Dt+1(i) =1

    Zt·Dt(i) · e−αtyiht(xi) for i = 1, . . . ,m,

    where αtnot.=

    1

    2ln

    1− εtεt

    , and Zt is the normalizer.

    • In the end, deliver H = sign(∑T

    t=1 αtht

    )

    as the learned hypothesis, which will act

    as a weighted majority vote.

    62.

  • We will prove that the training error errS(H) of AdaBoost decreases at a veryfast rate, and in certain cases it converges to 0.

    Important Remark

    The above formulation of the AdaBoost algorithm states no restriction onthe ht hypothesis delivered by the weak classifier A at iteration t, except thatεt < 1/2.

    However, in another formulation of the AdaBoost algorithm (in a more gen-eral setup; see for instance MIT, 2006 fall, Tommi Jaakkola, HW4, problem3), it is requested / reccommended that hypothesis ht be chosen by (approxi-mately) maximizing the [criterion of] weighted training error on a whole classof hypotheses like, for instance, decision trees of depth 1 (decision stumps).

    In this problem we will not be concerned with such a request, but we willcomply with it for instance in problem CMU, 2015 fall, Ziv Bar-Joseph, EricXing, HW4, pr2.6, when showing how AdaBoost works in practice.

    63.

  • a. Show that the [particular] choice of αt =1

    2ln

    1− εtεt

    implies errDt+1(ht) = 1/2.

    b. Show that DT+1(i) =(

    m ·∏T

    t=1 Zt

    )−1e−yif(xi), where f(x) =

    ∑Tt=1 αtht(x).

    c. Show that errS(H) ≤∏T

    t=1 Zt, where errS(H)not.=

    1

    m

    ∑mi=1 1{H(xi) 6=yi} is the traing error

    produced by AdaBoost.

    d. Obviously, we would like to minimize test set error produced by AdaBoost, but itis hard to do so directly. We thus settle for greedily optimizing the upper bound onthe training error found at part c. Observe that Z1, . . . , Zt−1 are determined by the first t− 1iterations, and we cannot change them at iteration t. A greedy step we can take to minimize thetraining set error bound on round t is to minimize Zt. Prove that the value of αt that minimizes

    Zt (among all possible values for αt) is indeed αt =1

    2ln

    1− εtεt

    (see the previous slide).

    e. Show that∏T

    t=1 Zt ≤ e−2∑T

    t=1γ2t .

    f. From part c and d, we know the training error decreases at exponential rate withrespect to T . Assume that there is a number γ > 0 such that γ ≤ γt for t = 1, . . . , T .(This is called empirical γ-weak learnability.) How many rounds are needed to achievea training error ε > 0? Please express in big-O notation, T = O(·).

    64.

  • Solution

    a. If we define the correct set C = {i : yiht(xi) ≥ 0} and the mistake set M = {i : yiht(xi) <0}, we can write:

    errDt+1(ht) =m∑

    i=1

    Dt+1(i) · 1{yi 6=ht(xi)} =∑

    i∈M

    1

    ZtDt(i)e

    αt =1

    Zt

    i∈MDt(i)

    ︸ ︷︷ ︸εt

    eαt =1

    Zt· εt · eαt

    Zt =

    m∑

    i=1

    Dt(i)e−αtyiht(xi) =

    i∈CDt(i)e

    −αt +∑

    i∈MDt(i)e

    αt = (1− εt) · e−αt + εt · eαt (1)

    eαt = e

    1

    2ln1− εtεt = e

    ln

    1− εtεt =

    √1− εtεt

    and e−αt =1

    eαt=

    √εt

    1− εt

    ⇒ Zt = (1− εt) ·√

    εt1− εt

    + εt ·√

    1− εtεt

    = 2√

    εt(1− εt) (2)

    ⇒ errDt+1(ht) =1

    Zt· εt · eαt =

    1

    2√

    εt(1− εt)· εt ·

    √1− εtεt

    =1

    2

    Note that1− εtεt

    > 0 because εt ∈ (0, 1/2).

    65.

  • b. We will expand Dt(i) recursively:

    DT+1(i) =1

    ZTDT (i) e

    −αT yi hT (xi) = DT (i)1

    ZTe−αT yi hT (xi)

    = DT−1(i)1

    ZT−1e−αT−1 yi hT−1(xi)

    1

    ZTe−αT yi hT (xi)

    ...

    = D1(i)1

    ∏Tt=1 Zt

    e−∑

    Tt=1 αt yi ht(xi) =

    1

    m ·∏Tt=1 Zte−yi f(xi).

    c. We will make use of the fact that exponential loss upper bounds the 0-1loss, i.e. 1{x

  • d. We will start from the equation

    Zt = εt · eαt + (1− εt) · e−αt ,

    which has been proven at part a. Note that the right-hand side is constantwith respect to εt (the error produced by ht, the hypothesis produced by theweak classifier A at the current step).Then we will proceed as usually, computing the partial derivative w.r.t. εt:

    ∂αt

    (εt · eαt + (1− εt) · e−αt

    )= 0 ⇔ εt · eαt − (1 − εt) · e−αt = 0

    ⇔ εt · (eαt)2 = 1− εt ⇔ e2αt =1− εtεt

    ⇔ 2αt = ln1− εtεt

    ⇔ αt =1

    2ln

    1− εtεt

    .

    Note that1− εtεt

    > 1 (and therefore αt > 0) because εt ∈ (0, 1/2).

    It can also be immediately shown that αt =1

    2ln

    1− εtεt

    is indeed the value for

    which we reach the minim of the expression εt ·eαt +(1−εt) ·e−αt, and thereforeof Zt too:

    εt · eαt − (1− εt) · e−αt > 0 ⇔ e2αt −1− εtεt

    > 0 ⇔ αt >1

    2ln

    1− εtεt

    > 0.

    67.

  • e. Making use of the relationship (2) proven at part a, and using the fact that1− x ≤ e−x for all x ∈ R, we can write:

    T∏

    t=1

    Zt =

    T∏

    t=1

    2 ·√

    εt(1− εt) =T∏

    t=1

    2 ·√(1

    2− γt

    )(

    1−(1

    2− γt

    ))

    =

    T∏

    t=1

    1− 4γ2t

    ≤T∏

    t=1

    e−4γ2t =

    T∏

    t=1

    (e−2γ2t )2 =

    T∏

    t=1

    e−2γ2t = e−2

    Tt=1

    γ2t

    f. From the result obtained at parts c and d, we get:

    errS(H) ≤ e−2∑T

    t=1γ2t ≤ e−2Tγ2

    Therefore,

    errS(H) < ε if − 2Tγ2 < ln ε ⇔ 2Tγ2 > − ln ε ⇔ 2Tγ2 > ln1

    ε⇔ T > 1

    2γ2ln

    1

    ε

    Hence we need T = O(

    1

    γ2ln

    1

    ε

    )

    .

    Note: It follows that errS(H) → 0 as T → ∞.

    68.

  • Exemplifying the application of AdaBoost algorithm

    CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.6

    69.

  • Consider the training dataset in the nearby fig-ure. Run T = 3 iterations of AdaBoost with deci-sion stumps (axis-aligned separators) as the baselearners. Illustrate the learned weak hypothesesht in this figure and fill in the table given below.

    (For the pseudo-code of the AdaBoost algorithm, see CMU,

    2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5. Please

    read the Important Remark that follows that pseudo-code!)

    x1

    x2

    x6

    x3

    x7

    x4

    5x

    8x

    9x

    X1

    X2

    0

    1

    3

    2

    4

    1 2 3 4 50

    t εt αt Dt(1) Dt(2) Dt(3) Dt(4) Dt(5) Dt(6) Dt(7) Dt(8) Dt(9) errS(H)

    1

    2

    3

    Note: The goal of this exercise is to help you understand how AdaBoost works in practice.It is advisable that — after understanding this exercise — you would implement a program /

    function that calculates the weighted training error produced by a given decision stump, w.r.t. a

    certain probabilistic distribution (D) defined on the training dataset. Later on you will extend

    this program to a full-fledged implementation of AdaBoost.

    70.

  • Solution

    Unlike the graphical reprezentation that we used until now for decisionstumps (as trees of depth 1), here we will work with the following analit-ical representation : for a continuous attribute X taking values x ∈ R and forany threshold s ∈ R, we can define two decision stumps:

    sign(x − s) ={

    1 if x ≥ s−1 if x < s and sign(s− x) =

    {−1 if x ≥ s1 if x < s.

    For convenience, in the sequel we will denote the first decision stump withX ≥ s and the second with X < s.

    According to the Important Remark that follows the AdaBoost pseudo-code[see CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5], at each iter-ation (t) the weak algorithm A selects the/a decision stump which, among alldecision stumps, has the minimum weighted training error w.r.t. the currentdistribution (Dt) on the training data.

    71.

  • Notes

    When applying the ID3 algorithm, for each continous attribute X, we useda threshold for each pair of examples (xi, yi), (xi+1, yi+1), with yiyi+1 < 0 suchthat xi < xi+1, but no xj ∈ Val (X) for which xi < xj < xi+1.

    We will proceed similarly when applying AdaBoost with decision stumps andcontinous attributes.

    [In the case of ID3 algorithm, there is a theoretical result stating that there isno need to consider other thresholds for a continuous attribute X apart fromthose situated beteen pairs of successive values (xi < xi+1) having oppositelabels (yi 6= yi+1), because the Information Gain (IG) for the other thresholds(xi < xi+1, with yi = yi+1) is provably less than the maximal IG for X.

    LC: A similar result can be proven, which allows us to simplify the applicationof the weak classifier (A) in the framework of the AdaBoost algorithm.]

    Moreover, we will consider also a threshold from the outside of the interval ofvalues taken by the attribute X in the training dataset. [The decision stumpscorresponding to this “outside” threshold can be associated with the decisiontrees of depth 0 that we met in other problems.]

    72.

  • Iteration t = 1:

    Therefore, at this stage (i.e, the first iteration of AdaBoost) the thresholds forthe two continuous variables (X1 and X2) corresponding to the two coordinatesof the training instances (x1, . . . , x9) are

    1

    2,5

    2, and

    9

    2for X1, and

    1

    2,3

    2,5

    2and

    7

    2for X2.

    One can easily see that we can get rid of the “outside” threshold1

    2for X2,

    because the decision stumps corresponding to this threshold act in the same

    as the decision stumps associated to the “outside” threshold1

    2for X1.

    The decision stumps corresponding to this iteration together with their as-sociated weighted training errors are shown on the next slide. When fillingthose tabels, we have used the equalities errDt(X1 ≥ s) = 1 − errDt(X1 < s)and, similarly, errDt(X2 ≥ s) = 1− errDt(X2 < s), for any threshold s and everyiteration t = 1, 2, . . .. These equalities are easy to prove.

    73.

  • s1

    2

    5

    2

    9

    2

    errD1(X1 < s)4

    9

    2

    9

    4

    9+

    2

    9=

    2

    3

    errD1(X1 ≥ s)5

    9

    7

    9

    1

    3

    s1

    2

    3

    2

    5

    2

    7

    2

    errD1(X2 < s)4

    9

    1

    9+

    3

    9=

    4

    9

    2

    9+

    1

    9=

    1

    3

    2

    9

    errD1(X2 ≥ s)5

    9

    5

    9

    2

    3

    7

    9

    It can be seen that the minimal weighted training error (ε1 = 2/9) is obtained for the

    decision stumps X1 < 5/2 and X2 < 7/2. Therefore we can choose h1 = sign

    (7

    2−X2

    )

    as

    best hypothesis at iteration t = 1; the corresponding separator is the line X2 =7

    2. The

    h1 hypothesis wrongly classifies the instances x4 and x5. Then

    γ1 =1

    2− 2

    9=

    5

    18and α1 =

    1

    2ln

    1− ε1ε1

    = ln

    √(

    1− 29

    )

    :2

    9= ln

    7

    2≈ 0.626

    74.

  • Now the algorithm must get a new distribution (D2) by altering the old one (D1) sothat the next iteration concentrates more on the misclassified instances.

    D2(i) =1

    Z1D1(i)( e

    −α1︸ ︷︷ ︸√

    2/7

    )yi h1(xi) =

    1

    Z1· 19·√

    2

    7for i ∈ {1, 2, 3, 6, 7, 8, 9};

    1

    Z1· 19·√

    7

    2for i ∈ {4, 5}.

    Remember that Z1 is a normalization factor for D2.So,

    Z1 =1

    9

    (

    7 ·√

    2

    7+ 2 ·

    7

    2

    )

    =2√14

    9= 0.8315

    Therefore,

    D2(i) =

    9

    2√14

    · 19·√

    2

    7=

    1

    14for i 6∈ {4, 5};

    9

    2√14

    · 19·√

    7

    2=

    1

    4for i ∈ {4, 5}.

    X2

    x3

    x1

    2x

    4x

    9x

    8x

    h1

    X1

    0

    1

    1 2 3

    3

    2

    4

    4 50

    1/4

    1/4 1/14

    1/141/14

    1/14

    1/14 1/14 1/14

    +−6 7

    5

    x x

    x

    75.

  • Note

    If, instead of sign

    (7

    2−X2

    )

    we would have taken, as hypothesis h1, the deci-

    sion stump sign

    (5

    2−X1

    )

    , the subsequent calculus would have been slightly

    different (although both decision stumps have the same – minimal – weighted

    training error,2

    9): x8 and x9 would have been allocated the weights

    1

    4, while

    x4 şi x5 would have been allocated the weights1

    14.

    (Therefore, the output of AdaBoost may not be uniquely determined!)

    76.

  • Iteration t = 2:

    s1

    2

    5

    2

    9

    2

    errD2(X1 < s)4

    14

    2

    14

    2

    14+

    2

    4+

    2

    14=

    11

    14

    errD2(X1 ≥ s)10

    14

    12

    14

    3

    14

    s1

    2

    3

    2

    5

    2

    7

    2

    errD2(X2 < s)4

    14

    1

    4+

    3

    14=

    13

    28

    2

    4+

    1

    14=

    8

    14

    2

    4=

    1

    2

    errD2(X2 ≥ s)10

    14

    15

    28

    6

    14

    1

    2

    Note: According to the theoretical result presented at part a of CMU, 2015fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, computing the weighted errorrate of the decision stump [corresponding to the test] X2 < 7/2 is now super-fluous, because this decision stump has been chosen as optimal hypothesis atthe previous iteration. (Nevertheless, we had placed it into the tabel, for thesake of a thorough presentation.)

    77.

  • Now the best hypothesis is h2 = sign

    (5

    2−X1

    )

    ; the corresponding separator

    is the line X1 =5

    2.

    ε2 = PD2 ({x8, x9}) =2

    14=

    1

    7= 0.143 ⇒ γ2 =

    1

    2− 1

    7=

    5

    14

    α2 = ln

    √1− ε2ε2

    = ln

    √(

    1− 17

    )

    :1

    7= ln

    √6 = 0.896

    D3(i) =1

    Z2·D2(i) · ( e−α2︸ ︷︷ ︸

    1/√6

    )yi h2(xi) =

    1

    Z2·D2(i) ·

    1√6

    if h2(xi) = yi;

    1

    Z2·D2(i) ·

    √6 otherwise

    =

    1

    Z2· 114

    · 1√6

    for i ∈ {1, 2, 3, 6, 7};

    1

    Z2· 14· 1√

    6for i ∈ {4, 5};

    1

    Z2· 114

    ·√6 for i ∈ {8, 9}.

    78.

  • Z2 = 5 ·1

    14· 1√

    6+ 2 · 1

    4· 1√

    6+ 2 · 1

    14·√6 =

    5

    14√6+

    1

    2√6+

    √6

    7=

    12 + 12

    14√6

    =24

    14√6=

    2√6

    7≈ 0.7

    D3(i) =

    7

    2√6· 114

    · 1√6=

    1

    24for i ∈ {1, 2, 3, 6, 7};

    7

    2√6· 14· 1√

    6=

    7

    48for i ∈ {4, 5};

    7

    2√6· 114

    ·√6 =

    1

    4for i ∈ {8, 9}.

    X2

    h2

    h1

    X1

    x1

    5x x

    9

    x4

    x3

    x2

    x6

    x7

    0

    1

    1 2 3

    3

    2

    4

    4 50

    7/48 1/4

    1/4

    1/24

    1/24 1/24 1/24

    7/481/24

    +−

    −+

    8x

    79.

  • Iteration t = 3:

    s1

    2

    5

    2

    9

    2

    errD3(X1 < s)2

    24+

    2

    4=

    7

    12

    2

    4

    2

    24+ 2 · 7

    48+ 2 · 1

    4=

    21

    24

    errD3(X1 ≥ s)5

    12

    2

    4

    3

    24=

    1

    8

    s1

    2

    3

    2

    5

    2

    7

    2

    errD3(X1 < s)7

    12

    7

    48+

    2

    24+

    1

    4=

    23

    482 · 7

    48+

    1

    24=

    1

    32 · 7

    48=

    7

    24

    errD3(X1 ≥ s)5

    12

    25

    48

    2

    3

    17

    24

    80.

  • The new best hypothesis is h3 = sign

    (

    X1 −9

    2

    )

    ; the corresponding separator is the line

    X1 =9

    2.

    ε3 = PD3({x1, x2, x7}) = 2 ·1

    24+

    1

    24=

    3

    24=

    1

    8

    γ3 =1

    2− 1

    8=

    3

    8

    α3 = ln

    √1− ε3ε3

    = ln

    √(

    1− 18

    )

    :1

    8= ln

    √7 = 0.973

    2X h

    2h

    3

    h1

    X1

    x1

    x2

    x9

    x8

    x7

    x4

    x5

    x63

    x

    0

    1

    1 2 3

    3

    2

    4

    4 50

    + +

    − −

    +

    81.

  • Finally, after filling our results in the given table, we get:

    t εt αt Dt(1) Dt(2) Dt(3) Dt(4) Dt(5) Dt(6) Dt(7) Dt(8) Dt(9) errS(H)

    1 2/9 ln√

    7/2 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 2/9

    2 2/14 ln√6 1/14 1/14 1/14 1/4 1/4 1/14 1/14 1/14 1/14 2/9

    3 1/8 ln√7 1/24 1/24 1/24 7/48 7/48 1/24 1/24 1/4 1/4 0

    Note: The following table helps you understand how errS(H) was computed;

    remember that H(xi)def.= sign

    (∑T

    t=1 αt ht(xi))

    .

    t αt x1 x2 x3 x4 x5 x6 x7 x8 x9

    1 0.626 +1 +1 −1 +1 +1 −1 −1 +1 +12 0.896 +1 +1 −1 −1 −1 −1 −1 −1 −13 0.973 − 1 −1 −1 −1 −1 −1 +1 +1 +1

    H(xi) +1 +1 −1 −1 −1 −1 −1 +1 +1

    82.

  • Remark : One can immediately see thatthe [test] instance (1, 4) will be classifiedby the hypothesis H learned by AdaBoostas negative (since −α1 + α2 − α3 = −0.626 +0.896− 0.973 < 0). After making other sim-ilar calculi, we can conclude that the de-cision zones and the decision boundariesproduced by AdaBoost for the given train-ing data will be as indicated in the nearbyfigure.

    +

    +

    2X h

    2h

    3

    h1

    X1

    x7

    x8

    x9

    x1

    x2

    3x x

    6

    x4

    x5

    0

    1

    1 2 3

    3

    2

    4

    4 50

    Remark : The execution of AdaBoost could continue (if we would have taken initiallyT > 3...), although we have obtained errS(H) = 0 at iteration t = 3. By elaborating thedetails, we would see that for t = 4 we would obtain as optimal hypothesis X2 < 7/2(which has already been selected at iteration t = 1). This hypothesis produces nowthe weighted training error ε4 = 1/6. Therefore, α4 = ln

    √5, and this will be added to

    α1 = ln√

    7/2 in the new output H. In this way, the confidence in the hypothesis X2 < 7/2would be strenghened.So, we should keep in mind that AdaBoost can select several times a certain weakhypothesis (but never at consecutive iterations, cf. CMU, 2015 fall, E. Xing, Z. Bar-Joseph, HW4, pr. 2.1).

    83.

  • Seeing AdaBoost as an optimization algorithm,

    w.r.t. the [inverse] exponential loss function

    CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1

    CMU, 2008 fall, Eric Xing, midterm, pr. 5.1

    84.

  • At CMU, 2015 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5, part d, we have shownthat in AdaBoost, we try to [indirectly] minimize the training error errS(H) by sequen-

    tially minimizing its upper bound∏T

    t=1 Zt, i.e. at each iteration t (1 ≤ t ≤ T ) we chooseαt so as to minimize Zt (viwed as a function of αt).

    Here you will see that another way to explain AdaBoost is by sequentially minimizingthe negative exponential loss:

    Edef.=

    m∑

    i=1

    exp(−yifT (xi)) not.=m∑

    i=1

    exp(−yiT∑

    t=1

    αtht(xi)). (3)

    That is to say, at the t-th iteration (1 ≤ t ≤ T ) we want to choose besides the appropriateclassifier ht the corresponding weight αt so that the overall loss E (accumulated up tothe t-th iteration) is minimized.

    Prove that this [new] strategy will lead to the same update rule for αt used in AdaBoost,

    i.e., αt =1

    2ln

    1− εtεt

    .

    Hint : You can use the fact that Dt(i) ∝ exp(−yift−1(xi)), and it [LC: the proportionalityfactor] can be viewed as constant when we try to optimize E with respect to αt in thet-th iteration.

    85.

  • Solution

    At the t-th iteration, we have

    E =

    m∑

    i=1

    exp(−yift(xi)) =m∑

    i=1

    exp

    (

    −yi(

    t−1∑

    t′=1

    αt′ht′(xi)

    )

    − yiαtht(xi))

    =m∑

    i=1

    exp(−yift−1(xi)) · exp(−yiαtht(xi))

    ∝m∑

    i=1

    Dt(i) · exp(−yiαtht(xi)) not.= E′(see CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, part b)

    Further on, we can rewrite E′ as

    E′ =m∑

    i=1

    Dt(i) · exp(−yiαtht(xi)) =∑

    i∈CDt(i) exp(−αt) +

    i∈MDt(i) exp(αt)

    = (1 − εt) · e−αt + εt · eαt , (4)

    where C is the set of examples which are correctly classified by ht, and M is the set ofexamples which are mis-classified by ht.

    86.

  • The relation (4) is identical with the expression (1) from part a of CMU, 2015fall, Ziv Bar-Joseph, Eric Xing, HW4, pr. 2.1-5 (see the solution). Therefore,

    E will reach its minim for αt =1

    2ln

    1− εtεt

    .

    87.

  • AdaBoost algorithm: the notion of [voting ] margin ;

    some properties

    CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3

    88.

  • Despite that model complexity increases with each iteration, AdaBoost does not usuallyoverfit. The reason behind this is that the model becomes more “confident” as weincrease the number of iterations. The “confidence” can be expressed mathematicallyas the [voting] margin. Recall that after the algorithm AdaBoost terminates with Titerations the [output] classifier is

    HT (x) = sign

    (T∑

    t=1

    αt ht(x)

    )

    .

    Similarly, we can define the intermediate weighted classifier after k iterations as:

    Hk(x) = sign

    (k∑

    t=1

    αt ht(x)

    )

    .

    As its output is either −1 or 1, it does not tell the confidence of its judgement. Here,without changing the decision rule, let

    Hk(x) = sign

    (k∑

    t=1

    ᾱt ht(x)

    )

    ,

    where ᾱt =αt

    ∑kt′=1 αt′

    so that the weights on each weak classifier are normalized.

    89.

  • Define the margin after the k-th iteration as [the sum of] the [normalized]weights of ht voting correctly minus [the sum of] the [normalized] weights ofht voting incorrectly.

    Margink(x) =∑

    t:ht(x)=y

    ᾱt −∑

    t:ht(x) 6=yᾱt.

    a. Let fk(x)not.=∑k

    t=1 ᾱt ht(x). Show that Margink(xi) = yi fk(xi) for all traininginstances xi, with i = 1, . . . ,m.

    b. If Margink(xi) > Margink(xj), which of the samples xi and xj will receive ahigher weight in iteration k + 1?

    Hint : Use the relation Dk+1(i) =1

    m ·∏kt=1 Zt· exp(−yifk(xi)) which was proven

    at CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.2.

    90.

  • Solution

    a. We will prove the equality starting from its right hand side:

    yi fk(xi) = yi

    k∑

    t=1

    ᾱt ht(xi) =

    k∑

    t=1

    ᾱt yi ht(xi) =∑

    t:ht(xi)=yi

    ᾱt −∑

    t:ht(xi) 6=yi

    ᾱt

    = Margink(xi).

    b. According to the relationship already proven at part a,

    Margink(xi) > Margink(xj) ⇔ yi fk(xi) > yj fk(xj) ⇔−yi fk(xi) < −yj fk(xj) ⇔ exp(−yi fk(xi)) < exp(−yj fk(xj)).

    Based on the given Hint, it follows that Dk+1(i) < Dk+1(j).

    91.

  • Important Remark

    It can be shown mathematically that boosting tends to increasethe margins of training examples — see the relation (3) at CMU,2008 fall, Eric Xing, HW3, pr. 4.1.1 —, and that a large marginon training examples reduces the generalization error.

    Thus we can explain why, although the number of “parameters” ofthe model created by AdaBoost increases with 2 at every iteration— therefore complexity rises —, it usually doesn’t overfit.

    92.

  • AdaBoost: a sufficient condition for γ-weak learnability,

    based on the voting margin

    CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.1.4

    93.

  • At CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5 we encounteredthe notion of empirical γ-weak learnability. When this condition — γ ≤ γt forall t, where γt

    def.=

    1

    2−εt, with εt being the weighted training error produced by

    the weak hypothesis ht — is met, it ensures that AdaBoost will drive downthe training error quickly. However, this condition does not hold all the time.

    In this problem we will prove a sufficient condition for empirical weak learn-ability [to hold]. This condition refers to the notion of voting margin whichwas presented in CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3.

    Namely, we will prove thatif there is a constant θ > 0 such that the [voting] margins of all traininginstances are lower-bounded by θ at each iteration of the AdaBoost algorithm,then the property of empirical γ-weak learnability is “guaranteed”, with γ =θ/2.

    94.

  • [Formalisation]

    Suppose we are given a training set S = {(x1, y1), . . . , (xm, ym)}, such that forsome weak hypotheses h1, . . . , hk from the hypothesis space H, and some non-negative coefficients α1, . . . , αk with

    ∑kj=1 αj = 1, there exists θ > 0 such that

    yi(

    k∑

    j=1

    αjhj(xi)) ≥ θ, ∀(xi, yi) ∈ S.

    Note: according to CMU, 2016 spring, W. Cohen, N. Balcan, HW4, pr. 3.3,

    yi(k∑

    j=1

    αjhj(xi)) = Margink(xi) = fk(xi), where fk(xi)not.=

    k∑

    j=1

    αjhj(xi).

    Key idea: We will show that if the condition above is satisfied (for a given k),then for any distribution D over S, there exists a hypothesis hl ∈ {h1, . . . , hk}with weighted training error at most

    1

    2− θ

    2over the distribution D.

    It will follow that when the condition above is satisfied for any k, the training

    set S is empirically γ-weak learnable, with γ =θ

    2.

    95.

  • a. Show that — if the condition stated above is met — there exists a weakhypothesis hl from {h1, . . . , hk} such that Ei∼D[yihl(xi)] ≥ θ.

    Hint : Taking expectation under the same distribution does not change theinequality conditions.

    b. Show that the inequality Ei∼D[yihl(xi)] ≥ θ is equivalent to

    Pri∼D[yi 6= hl(xi)] ≤1

    2− θ

    2,

    meaning that the weighted training error of hl is at most1

    2− θ

    2, and therefore

    γt ≥θ

    2.

    96.

  • Solution

    a. Since yi(∑k

    j=1 αjhj(xi)) ≥ θ ⇔ yifk(xi) ≥ θ for i = 1, . . . ,m, it follows (according to theHint) that

    Ei∼D[yifk(xi)] ≥ θ where fk(xi) not.=k∑

    j=1

    αjhj(xi). (5)

    On the other side, Ei∼D[yihl(xi)] ≥ θdef.⇔ ∑mi=1 yihl(xi) ·D(i) ≥ θ.

    Suppose, on contrary, that Ei∼D[yihl(xi)] < θ, that is∑m

    i=1 yihl(xi) ·D(i) < θ for l = 1, . . . , k.Then

    ∑mi=1 yihl(xi) ·D(i) · αl < θ · αl for l = 1, . . . , k. By summing up these inequations for

    l = 1, . . . , k we get

    k∑

    l=1

    m∑

    i=1

    yihl(xi) ·D(i) · αl <k∑

    l=1

    θ · αl ⇔m∑

    i=1

    yiD(i)

    (k∑

    l=1

    hl(xi)αl

    )

    < θk∑

    l=1

    αl ⇔

    m∑

    i=1

    yifk(xi) ·D(i) < θ, (6)

    because∑k

    j=1 αj = 1 şi fk(xi)not.=∑k

    l=1 αlhl(xi).

    The inequation (6) can be written as Ei∼D[yifk(xi)] < θ. Obviously, it contradicts therelationship (5). Therefore, the previous supposition is false. In conclusion, there existl ∈ {1, . . . , k} such that Ei∼D[yihl(xi)] ≥ θ.

    97.

  • Solution (cont’d)

    b. We already said that Ei∼D[yihl(xi)] ≥ θ ⇔∑m

    i=1 yi hl(xi) ·D(i) ≥ θ.Since yi ∈ {−1,+1} and hl(xi) ∈ {−1,+1} for i = 1, . . . ,m and l = 1, . . . , k, wethave

    m∑

    i=1

    yi hl(xi) ·D(i) ≥ θ ⇔∑

    i:yi=hl(xi)

    D(xi)−∑

    i:yi 6=hl(xi)D(xi) ≥ θ ⇔ (1− εl)− εl ≥ θ ⇔

    1− 2εl ≥ θ ⇔ 2εl ≤ 1− θ ⇔ εl ≤1

    2− θ

    2

    def.⇔ Pri∼D[yi 6= hl(xi)] ≤1

    2− θ

    2.

    98.

  • AdaBoost:

    Any set of consistently labelled instances from Ris empirically γ-weak learnable

    by using decision stumps

    Stanford, 2016 fall, Andrew Ng, John Duchi, HW2, pr. 6.abc

    99.

  • At CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5 we encountered thenotion of empirical γ-weak learnability. When this condition — γ ≤ γt for allt, where γt

    def.=

    1

    2− εt, with εt being the weighted training error produced by

    the weak hypothesis ht — is met, it ensures that AdaBoost will drive downthe training error quickly.

    In this problem we will assume that our input attribute vectors x ∈ R, that is,they are one-dimensional, and we will show that [LC] when these vectors areconsitently labelled, decision stumps based on thresholding provide a weak-learning guarantee (γ).

    100.

  • Decision stumps: analytical definitions / formalization

    Thresholding-based decision stumps can be seen as functions indexed by athreshold s and sign +/−, such that

    φs,+(x) =

    {1 if x ≥ s−1 if x < s and φs,−(x) =

    {−1 if x ≥ s1 if x < s.

    Therefore, φs,+(x) = −φs,−(x).

    Key idea for the proof

    We will show that given a consistently labelled training set S ={(x1, y1), . . . , (xm, ym)}, with xi ∈ R and yi ∈ {−1,+1} for i = 1, . . . ,m, thereis some γ > 0 such that for any distribution p defined on this training setthere is a threshold s ∈ R for which

    errorp(φs,+) ≤1

    2− γ or errorp(φs,−) ≤

    1

    2− γ,

    where errorp(φs,+) and errorp(φs,−) denote the weighted training error of φs,+and respectively φs,−, computed according to the distribution p.

    101.

  • Convention: In our problem we will assume that our training instancesx1, . . . xm ∈ R are distinct. Moreover, we will assume (without loss of gen-erality, but this makes the proof notationally simpler) that

    x1 > x2 > . . . > xm.

    a. Show that, given S, for each threshold s ∈ R there is some m0(s) ∈{0, 1, . . . ,m} such that

    errorp(φs,+)def.=

    m∑

    i=1

    pi · 1{yi 6=φs,+(xi)} =1

    2− 1

    2

    m0(s)∑

    i=1

    yipi −m∑

    i=m0(s)+1

    yipi

    ︸ ︷︷ ︸

    not.: f(m0(s))and

    errorp(φs,−)def.=

    m∑

    i=1

    pi · 1{yi 6=φs,−(xi)} =1

    2− 1

    2

    m∑

    i=m0(s)+1

    yipi −m0(s)∑

    i=1

    yipi

    ︸ ︷︷ ︸

    not.: −f(m0(s))

    Note: Treat sums over empty sets of indices as zero. Therefore,∑0

    i=1 ai = 0for any ai, and similarly

    ∑mi=m+1 ai = 0.

    102.

  • b. Prove that, given S, there is some γ > 0 (which may depend on the trainingset size m) such that for any set of probabilities p on the training set (thereforepi ≥ 0 and

    ∑mi=1 pi = 1) we can find m0 ∈ {0, . . . ,m} so as

    |f(m0)| ≥ 2γ, where f(m0) not.=m0∑

    i=1

    yipi −m∑

    i=m0(s)+1

    yipi,

    Note: γ should not depend on p.

    Hint : Consider the difference f(m0)− f(m0 − 1).What is your γ?

    103.

  • c. Based on your answers to parts a and b, what edge can thresholded decisionstumps guarantee on any training set {xi, yi}mi=1, where the raw attributesxi ∈ R are all distinct? Recall that the edge of a weak classifier φ : R → {−1, 1}is the constant γ ∈ (0, 1/2) such that

    errorp(φ)def.=

    m∑

    i=1

    pi · 1{φ(xi) 6=yi} ≤1

    2− γ.

    d. Can you give an upper bound on the number of thresholded decisionstumps required to achieve zero error on a given training set?

    104.

  • Solution

    a. We perform several algebraic steps.Let sign(t) = 1 if t ≥ 0, and sign(t) = −1 otherwise.Then

    1{φs,+(x) 6=y} = 1{sign(x−s) 6=y} = 1{y·sign(x−s)≤0},

    where the symbol 1{ } denotes the well known indicator function.Thus we have

    errorp(φs,+)def.=

    m∑

    i=1

    pi · 1{yi 6=φs,+(xi)} =m∑

    i=1

    pi · 1{yi·sign(xi−s)≤0}

    =∑

    i:xi≥spi · 1{yi=−1} +

    i:xi m0(s), which we know must exist because x1 > x2 > . . . , wehave

    errorp(φs,+)def.=

    m∑

    i=1

    pi · 1{yi 6=φs,+(xi)} =m0(s)∑

    i=1

    pi · 1{yi=−1} +m∑

    i=m0(s)+1

    pi · 1{yi=1}.

    105.

  • Now we make a key observation :we have

    1{y=−1} =1− y2

    and 1{y=1} =1 + y

    2,

    because y ∈ {−1, 1}.

    Consequently,

    errorp(φs,+)def.=

    m∑

    i=1

    pi · 1{yi 6=φs,+(xi)} =m0(s)∑

    i=1

    pi · 1{yi=−1} +m∑

    i=m0(s)+1

    pi · 1{yi=1}

    =

    m0(s)∑

    i=1

    pi ·1− yi

    2+

    m∑

    i=m0(s)+1

    pi ·1 + yi

    2=

    1

    2

    m∑

    i=1

    pi −1

    2

    m0(s)∑

    i=1

    piyi +1

    2

    m∑

    i=m0(s)+1

    piyi

    =1

    2− 1

    2

    m0(s)∑

    i=1

    piyi −m∑

    i=m0(s)+1

    piyi

    .

    The last equality follows because∑m

    i=1 pi = 1.

    The case for φs,− is symmetric to this one, so we omit the argument.

    106.

  • Solution (cont’d)

    b. For any m0 ∈ {1, . . . ,m} we have

    f(m0)− f(m0 − 1) =m0∑

    i=1

    yipi −m∑

    i=m0+1

    yipi −m0−1∑

    i=1

    yipi +

    m∑

    i=m0

    yipi = 2ym0pm0 .

    Therefore, |f(m0)− f(m0 − 1)| = 2|ym0 | pm0 = 2pm0 for all m0 ∈ {1, . . . ,m}.Because

    ∑mi=1 pi = 1, there must be at least one index m

    ′0 with pm′0 ≥

    1

    m.

    Thus we have |f(m′0)− f(m′0 − 1)| ≥2

    m, and so it must be the case that at least

    one of

    |f(m′0)| ≥1

    mor |f(m′0 − 1)| ≥

    1

    m

    holds.Depending on which one of those two inequations is true, we would then“return” m′0 or m

    ′0 − 1.

    (Note: If |f(m′0 − 1)| ≥1

    mand m′0 = 1, then we have to consider an “outside”

    threshold, s > x1.)

    Finally, we have γ =1

    2m.

    107.

  • Solution (cont’d)

    c. The inequation proven at part b, |f(m0)| ≥ 2γ implies that either f(m0) ≥ 2γor f(m0) ≤ −2γ. So,

    either f(m0) ≥ 2γ ⇔ −f(m0) ≤ −2γ ⇔1

    2− 1

    2f(m0)

    ︸ ︷︷ ︸

    errorp(φs,+)

    ≤ 12− 1

    2· 2γ = 1

    2− γ

    or f(m0) ≤ −2γ ⇔1

    2+

    1

    2f(m0)

    ︸ ︷︷ ︸

    errorp(φs,−)

    ≤ 12− 1

    2· 2γ = 1

    2− γ

    Therefore thresholded decision stumps are guaranteed to have an edge of at

    least γ =1

    2mover random guessing.

    108.

  • Summing up

    At each iteration t executed by AdaBoost,

    • a probabilistic distribution p (denoted as Dt in CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5) is in use;

    • [ at part b of the present exercise we proved that ]there is at least one m0 (better denoted m0(p)) in {0, . . . ,m} such that

    f(m0) ≥1

    m

    not.= 2γ, where f(m0)

    def.=

    m0∑

    i=1

    yipi −m∑

    i=m0+1

    yipi

    • [ the proof made at part a of the present exercise implies that ]for any s ∈ (xm0+1, xm0 ],

    errorp(φs,+) ≤1

    2− γ or errorp(φs,−) ≤

    1

    2− γ, where γ not.= 1

    m,

    As a consequence, AdaBoost can choose at each iteration a weak hypothesis

    (ht) for which γt ≥ γ =1

    m.

    109.

  • Solution (cont’d)

    d. Boosting takeslnm

    2γ2iterations to achieve zero [training] error, as shown at

    CMU, 2015, Z. Bar-Joseph, E. Xing, HW4, pr. 2.5, so with decision stumpswe will achieve zero [training] error in at most 2m2 lnm iterations of boosting.

    Each iteration of boosting introduces a single new weak hypothesis, so atmost 2m2 lnm thresholded decision stumps are necessary.

    110.

  • A generalized version of the AdaBoost algorithm

    MIT, 2003 fall, Tommy Jaakkola, HW4, pr. 2

    111.

  • Here we derive a boosting algorithm from a slightly more general perspectivethan the AdaBoost algorithm in CMU, 2015 fall, Z. Bar-Joseph, E. Xing,HW4, pr. 2.1-5, that will be applicable for a class of loss functions includingthe exponential one.

    The goal is to generate discriminant functions of the form

    fK(x) = α1h(x; θ1) + . . .+ αKh(x; θK),

    where both x and w belong to Rd and you can assume that the weak classifiersh(x; θ) are decision stumps whose predictions are ±1; any other set of weaklearners would be fine without modification.

    We successively add components to the overall discriminant function in amanner that will separate the estimation of [the parameters of] the weakclassifiers from the setting of the votes α to the extent possible.

    112.

  • A useful definition

    Let’s start by defining a set of useful loss functions. The only restriction weplace on the loss is that it should be a monotonically decreasing and differ-entiable function of its argument. The argument in our context is yi fK(xi)so that the more the discriminant function agrees with the ±1 label yi, thesmaller the loss.

    The simple exponential loss we have already considered [at CMU, 2015 fall,Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5], i.e.,

    Loss(yifK(xi)) = exp(−yifK(xi))

    certainly conforms to this notion.

    And so does the logistic loss

    Loss(yifK(xi)) = ln(1 + exp(−yifK(xi)).

    113.

  • Remark

    Note that the logistic loss has a nice interpretation as a negative log-probability. Indeed, [recall that] for an additive logistic regression model

    − lnP (y = 1|x,w) = − ln 11 + exp(−z) = ln(1 + exp(−z)),

    where z = w1φ1(x)+. . .+wKφK(x) and we omit the bias term (w0) for simplicity.By replacing the additive combination of basis functions (φi(x)) with thecombination of weak classifiers (h(x; θi)), we have an additive logistic regressionmodel where the weak classifiers serve as the basis functions. The difference isthat both the basis functions (weak classifiers) and the coefficients multiplyingthem will be estimated. In the logistic regression model we typically envisiona fixed set of basis functions.

    114.

  • Let us now try to derive the boosting algorithm in a manner that can acco-modate any loss function of the type discussed above. To this end, supposewe have already included k − 1 component classifiers

    fk−1(x) = α̂1h(x; θ̂1) + . . .+ α̂k−1h(x; θ̂k−1), (7)

    and we wish to add another h(x; θ). The estimation criterion for the overalldiscriminant function, including the new component with votes α, is given by

    J(α, θ) =1

    m

    m∑

    i=1

    Loss(yifk−1(xi) + yi αh(xi; θ)).

    Note that we explicate only how the objective depends on the choice of thelast component and the corresponding votes since the parameters of the k− 1previous components along with their votes have already been set and won’tbe modified further.

    115.

  • We will first try to find the new component or parameters θ so as to maximizeits potential in reducing the empirical loss, potential in the sense that we cansubsequently adjust the votes to actually reduce the empirical loss. Moreprecisely, we set θ so as to minimize the derivative

    ∂αJ(α, θ)|α=0 =

    1

    m

    m∑

    i=1

    ∂αLoss(yifk−1(xi) + yi αh(xi; θ))|α=0

    =1

    m

    m∑

    i=1

    dL(yifk−1(xi)) yi h(xi; θ), (8)

    where dL(z)not.=

    ∂Loss(z)

    ∂z.

    Note that this derivative∂

    ∂αJ(α, θ)|α=0 precisely captures the amount by which

    we would start to reduce the empirical loss if we gradually increased the vote(α) for the new component with parameters θ. Minimizing this reductionseems like a sensible estimation criterion for the new component or θ. Thisplan permits us to first set θ and then subsequently optimize α to actuallyminimize the empirical loss.

    116.

  • Let’s rewrite the algorithm slightly to make it look more like a boosting al-gorithm. First, let’s define the following weights and normalized weights onthe training examples:

    W(k−1)i = −dL(yifk−1(xi)) and

    W̃(k−1)i =

    W(k−1)i

    ∑mj=1 W

    (k−1)j

    , for i = 1, . . . ,m.

    These weights are guaranteed to be non-negative since the loss function is adecreasing function of its argument (its derivative has to be negative or zero).

    117.

  • Now we can rewrite the expression (8) as

    ∂αJ(α, θ)|α=0 = −

    1

    m

    m∑

    i=1

    W(k−1)i yi h(xi; θ)

    = − 1m(∑

    j

    W(k−1)j ) ·

    m∑

    i=1

    W(k−1)i

    j W(k−1)j

    yi h(xi; θ)

    = − 1m(∑

    j

    W(k−1)j ) ·

    m∑

    i=1

    W̃(k−1)i yi h(xi; θ).

    By ignoring the multiplicative constant (i.e.,1

    m(∑

    j W(k−1)j ), which is constant

    at iteration k), we will estimate θ by minimizing

    −m∑

    i=1

    W̃(k−1)i yih(xi; θ),

    where the normalized weights W̃(k−1)i sum to 1. This is the same as maximizing

    the weighted agreement with the labels, i.e.,∑m

    i=1 W̃(k−1)i yih(xi; θ).

    118.

  • We are now ready to cast the steps of the boosting algorithm in a form similarto the AdaBoost algorithm given at CMU, 2015 fall, Z. Bar-Joseph, E. Xing,HW4, pr. 2.1-5:

    Step 1 : Find any classifier h(x; θ̂k) that performs better than chance withrespect to the weighted training error:

    εk =1

    2

    (

    1−m∑

    i=1

    W̃(k−1)i yih(xi; θ̂k)

    )

    .

    Step 2 : Set the votes αk for the new component by minimizing the overallempirical loss:

    J(α, θ̂k) =1

    m

    m∑

    i=1

    Loss(yi fk−1(xi) + yi αh(xi; θ̂k)) and αk = argminα≥0

    J(α, θ̂k).

    Step 3 : Recompute the normalized weights for the next iteration accordingto

    W̃(k)i = −c · dL(yi fk−1(xi) + yi αk h(xi; θ̂k)

    ︸ ︷︷ ︸

    yi fk(xi)

    ) for i = 1, . . . ,m,

    where c is chosen so that∑m

    i=1 W̃(k)i = 1.

    119.

  • Show that the three steps in the algorithm correspond exactly to AdaBoostwhen the loss function is the exponential loss Loss(z) = exp(−z).More precisely, show that in this case the setting of αk based on the new weak

    classifier and the weight update to get W̃(k)i would be identical to AdaBoost.

    (In CMU, 2015 fall, Z. Bar-Joseph, E. Xing, HW4, pr. 2.1-5, W̃(k)i corresponds

    to Dk(i).)

    Solution

    For the first part, see CMU, 2008 fall, Eric Xing, HW3, pr. 4.1.1.

    For the second part, note that the weight assignment in Step 3 of the generalalgorithm (for stage k) is

    W̃(k)i = −c · dL(yifk(xi)) = c · exp(yifk(xi)),

    which is the same as in AdaBoost (see CMU, 2015 fall, Z. Bar-Joseph, E.Xing, HW4, pr. 2.2).

    120.