lecture # 4 multilayer percceptron & decision treesbiomisa.org/uploads/2014/06/lect-4.pdfback...

105
1 Machine Learning Lecture # 4 Multilayer Percceptron & Decision Trees

Upload: others

Post on 18-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

1

Machine Learning

Lecture # 4Multilayer Percceptron & Decision Trees

Page 2: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Artificial Neural Networks (ANN)• Neural computing requires a

number of neurons, to be connected together into a neural network.

• A neural network consists of:– layers

– links between layers

• The links are weighted.

• There are three kinds of layers:1. input layer

2. Hidden layer

3. output layer

Page 3: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

From Human Neurones to Artificial Neurones

Page 4: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

A simple neuron

• At each neuron, every input has an associated weight which modifies the strength of each input.

• The neuron simply adds together all the inputs and calculates an output to be passed on.

Page 5: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Activation function

Page 6: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

MultiLayer Perceptron (MLP)

Page 7: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Motivation

• Perceptrons are limited because they canonly solve problems that are linearlyseparable

• We would like to build more complicatedlearning machines to model our data

• One way to do this is to build a multiplelayers of perceptrons

Page 8: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Brief History

• 1985 Ackley, Hinton and Sejnowski propose the Boltzmann machine

– This was a multi-layer step perceptron

– More powerful than perceptron

– Successful application NETtalk

• 1986 Rummelhart, Hinton and Williams invent Multi-Layer Perceptron (MLP) with backpropagation

– Dominant neural net architecture for 10 years

Page 9: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Multi layer networks

• So far we discussed networks with one layer.

• But these networks can be extended to combine several layers, increasing the set of functions that can be represented using a NN

Page 10: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

MLP

Page 11: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)
Page 12: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Multilayer Neural Network

Page 13: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Sigmoid Response Functions

Page 14: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

MLP

Page 15: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Simple example: AND

0 00 11 01 1

Page 16: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Example: OR function

0 00 11 01 1

-10

20

20

Page 17: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Negation:

01

10

-20

Page 18: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Putting it together:

0 0

0 1

1 0

1 1

-30

20

20

10

-20

-20

-10

20

20

-30

20

20

10

-20

-20

-10

20

20

Page 19: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Example of multilayer Neural Network

Page 20: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

• Suppose input values are 10, 30, 20

• The weighted sum coming into H1

SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20)

= 2 -3 + 8 = 7.

• The σ function is applied to SH1:

σ(SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999

• Similarly, the weighted sum coming into H2:

SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20)

= 7 - 36 + 24 = -5

• σ applied to SH2:

σ(SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067

Page 21: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

• Now the weighted sum to output unit O1 :

SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996

• The weighted sum to output unit O2:

SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047

• The output sigmoid unit in O1:

σ(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750

• The output from the network for O2:

σ(SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957

• The input triple (10,30,20) would becategorised with O2, because this has thelarger output.

Page 22: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Training Parametric Model

Page 23: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Minimizing Error

Page 24: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Least Squares Gradient

Page 25: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Single Layer Perceptron

Page 26: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Single layer Perceptrons

Page 27: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Different Response Functions

Page 28: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Learning a Logistic Perceptron

Page 29: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Back Propagation

Page 30: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Back Propagation

Page 31: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)
Page 32: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

A Worked Example:

• Propagated the values (10,30,20) through the network

• Suppose now that the target categorization for the example was the one associated with O1(using a learning rate of η = 0.1)

• the target output for O1 was 1, and the target output for O2 was 0

• t1(E) = 1; t2(E) = 0; o1(E) = 0.750; o2(E) = 0.957

• error values for the output units O1 and O2 – δO1 = o1(E)(1 - o1(E))(t1(E) - o1(E)) = 0.750(1-0.750)(1-0.750) = 0.0469

– δO2 = o2(E)(1 - o2(E))(t2(E) - o2(E)) = 0.957(1-0.957)(0-0.957) = -0.0394

Input units Hidden units Output units

Unit Output UnitWeighted Sum

InputOutput Unit

Weighted Sum Input

Output

I1 10 H1 7 0.999 O1 1.0996 0.750

I2 30 H2 -5 0.0067 O2 3.1047 0.957

I3 20

Page 33: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

• To propagate this information backwards to the hidden nodes H1 and H2– Multiply the error term for O1 by the weight from H1

to O1, then add this to the multiplication of the error term for O2 and the weight between H1 and O2, (1.1*0.0469) + (3.1*-0.0394) = -0.0706

– δH1 = -0.0706*(0.999 * (1-0.999)) = -0.0000705

– Similarly for H2: (0.1*0.0469)+(1.17*-0.0394) = -0.0414

– δH2 -0.0414 * (0.067 * (1-0.067)) = -0.00259

A Worked Example:

Page 34: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Input unit Hidden unit η δH xi Δ = η*δH*xi Old weight New weight

I1 H1 0.1 -0.0000705 10 -0.0000705 0.2 0.1999295

I1 H2 0.1 -0.00259 10 -0.00259 0.7 0.69741

I2 H1 0.1 -0.0000705 30 -0.0002115 -0.1 -0.1002115

I2 H2 0.1 -0.00259 30 -0.00777 -1.2 -1.20777

I3 H1 0.1 -0.0000705 20 -0.000141 0.4 0.39999

I3 H2 0.1 -0.00259 20 -0.00518 1.2 1.1948

Hiddenunit

Outputunit

η δO hi(E) Δ = η*δO*hi(E) Old weight New weight

H1 O1 0.1 0.0469 0.999 0.000469 1.1 1.100469

H1 O2 0.1 -0.0394 0.999 -0.00394 3.1 3.0961

H2 O1 0.1 0.0469 0.0067 0.00314 0.1 0.10314

H2 O2 0.1 -0.0394 0.0067 -0.0000264 1.17 1.16998

A Worked Example:

Page 35: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

When to Learn

Page 36: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Online Learning

Page 37: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Batch Learning

Page 38: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Early Stopping

Page 39: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Self Study Examples

Page 40: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

XOR Example

Page 41: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Linear separation

Can AND, OR and NOT be represented?

• Is it possible to represent every boolean function by simply combining these?

• Every boolean function can be composed using AND, OR and NOT (or even only NAND).

Page 42: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Linear separation

• How we can learn XOR function?

Page 43: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Linear separation

X1 X2 XOR

0 0 0

1 0 1

0 1 1

1 1 0

Page 44: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Linear separation

X1 X2 XOR

0 0 0

1 0 1

0 1 1

1 1 0

It is impossible to find the value of Wi to learn

XOR

Page 45: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Linear separation

X1 X2 X1*X2 XOR

0 0 0

1 0 1

0 1 1

1 1 0

So we learned W1, W2 and W3

Page 46: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Example, Back Propogation learning function XOR

• Training samples (bipolar)

• Network: 2-2-1 with thresholds (fixed output 1)

in_1 in_2 d

P0 -1 -1 -1

P1 -1 1 1

P2 1 -1 1

P3 1 1 1

• Initial weights W(0)

• Learning rate = 0.2

• Node function: hyperbolic tangent

)1,1,1(:

)5.0,5.0,5.0(:

)5.0,5.0,5.0(:

)1,2(

)0,1(2

)0,1(1

w

w

w

))(1))((1(5.0)('

))(1)(()('

1)(2)(

;1

1)(

1)(lim

;1

1)tanh()(

xgxgxg

xsxsxs

xsxge

xs

xge

exxg

x

x

x

x

pj

W(1,0) W(2,1)

o

0)1(

1x

)1(2x

2

1

0

1

2

Page 47: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

-0.63211)(

-1.489840.24492)-,0.24492-,1)(1,1,1(

-0.244921)1/(2)(

-0.244921)1/(2)(

5.0)1,1,1()5.0,5.0,5.0(

5.0)1,1,1()5.0,5.0,5.0(

)1()1,2(

5.02

)1(1

5.01

)1(1

0)0,1(

22

0)0,1(

11

o

o

netgo

xwnet

enetgx

enetgx

pwnet

pwnet

computing Forward

1- d :1)- 1,- (1, P Present00

0.22090.6321)0.6321)(1-1(-0.3679

))(1))((1()('

-0.36789-0.63211)(1

ooo netgnetglnetgl

odl

gpropogatin back Error

-0.207650.24492)(10.24492)-1(1-0.2209

)('

-0.207650.24492)(10.24492)-1(1-0.2209

)('

2)1,2(

22

1)1,2(

11

netgw

netgw

Page 48: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

0.0108)0.0108, 0.0442,(0.2449)- 0.2449,-(1,0.2209)(2.0

)1()1,2(

xw

update Weight

0.0415)0.0415,-0.0415,()1-,1-(1,-0.2077)(2.0

0.0415)0.0415,-0.0415,()1-,1-(1,-0.2077)(2.0

02)0,1(

2

01)0,1(

1

pw

pw

1.0108)1.0108, (-0.5415,

0.0108)0.0108, (-0.0442,)1,1,1()1,2()1,2()1,2(

www

0.5415) 0.4585,--0.5415,(

0.0415)0.0415,-0.0415,()5.0,5.0,5.0(

0.4585)-0.5415,-0.5415,(

0.0415)0.0415,-0.0415,()5.0,5.0,5.0(

)0,1(2

)0,1(2

)0,1(2

)0,1(1

)0,1(1

)0,1(1

www

www

0.102823 to0.135345 from reduced for Error 20 lP

Page 49: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

MSE reduction:every 10 epochs

Output: every 10 epochs

epoch 1 10 20 40 90 140 190 d

P0 -0.63 -0.05 -0.38 -0.77 -0.89 -0.92 -0.93 -1

P1 -0.63 -0.08 0.23 0.68 0.85 0.89 0.90 1

P2 -0.62 -0.16 0.15 0.68 0.85 0.89 0.90 1

p3 -0.38 0.03 -0.37 -0.77 -0.89 -0.92 -0.93 -1

MSE 1.44 1.12 0.52 0.074 0.019 0.010 0.007

Page 50: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

init (-0.5, 0.5, -0.5) (-0.5, -0.5, 0.5) (-1, 1, 1)

p0 -0.5415, 0.5415, -0.4585 -0.5415, -0.45845, 0.5415 -1.0442, 1.0108, 1.0108

p1 -0.5732, 0.5732, -0.4266 -0.5732, -0.4268, 0.5732 -1.0787, 1.0213, 1.0213

p2 -0.3858, 0.7607, -0.6142 -0.4617, -0.3152, 0.4617 -0.8867, 1.0616, 0.8952

p3 -0.4591, 0.6874, -0.6875 -0.5228, -0.3763, 0.4005 -0.9567, 1.0699, 0.9061

)0,1(1w )0,1(

2w )1,2(w

After epoch 1

# Epoch

13 -1.4018, 1.4177, -1.6290 -1.5219, -1.8368, 1.6367 0.6917, 1.1440, 1.1693

40 -2.2827, 2.5563, -2.5987 -2.3627, -2.6817, 2.6417 1.9870, 2.4841, 2.4580

90 -2.6416, 2.9562, -2.9679 -2.7002, -3.0275, 3.0159 2.7061, 3.1776, 3.1667

190 -2.8594, 3.18739, -3.1921 -2.9080, -3.2403, 3.2356 3.1995, 3.6531, 3.6468

Page 51: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Trees

Page 52: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Tree Classifier

Ross Quinlan

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Abdomen Length

Abdomen Length > 7.1?

no yes

KatydidAntenna Length > 6.0?

no yes

KatydidGrasshopper

Page 53: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

• An inductive learning task– Use particular facts to make more generalized conclusions

• A predictive model based on a branching series of Boolean tests– These smaller Boolean tests are less complex than a one-

stage classifier

• Let’s look at a sample decision tree…

What is a Decision Tree?

Page 54: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Predicting Commute Time

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Long

Long

Short Medium Long

No Yes No Yes

If we leave at

10 AM and

there are no

cars stalled on

the road, what

will our

commute time

be?

Page 55: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Inductive Learning

• In this decision tree, we made a series of Boolean decisions and followed the corresponding branch– Did we leave at 10 AM?

– Did a car stall on the road?

– Is there an accident on the road?

• By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take

Page 56: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Trees as Rules

• We did have represent this tree graphically

• We could have represented as a set of rules. However, this may be much harder to read…

Page 57: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Tree as a Rule Set

if hour == 8am

commute time = long

else if hour == 9am

if accident == yes

commute time = long

else

commute time = medium

else if hour == 10am

if stall == yes

commute time = long

else

commute time = short

• Notice that all attributes to not have to be used in each path of the decision.

• As we will see, all attributes may not even appear in the tree.

Page 58: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Weather Example

Page 59: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Objective

• From a set of observations, the objective is to

predict whether we will be able to play Tennis based

on the past examples (inductive principle)

– For this, we will build automatically a decision

tree

– The decision concerns the Play Tennis attribute. It

shares the dataset into two classes: play = Yes

and play = No

Page 60: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Attribute-values

• 14 instances described by 4 attributes categorical

(nominal).

• Each attribute is associated with a set of values.

• An attribute is selected to be in the class, for which

we make a decision.

Page 61: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Tree

• An internal node is a test on an attribute

• A branch represents an outcome of the

test, e.g. outlook=sunny

• A leaf node represents a class label or

class label distribution

• At each node, one attribute is chosen to

split training examples into distinct

classes as much as possible

• A new case is classified by following a

matching path to a leaf node

Page 62: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Building a Decision Tree

Page 63: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Building a Decision Tree

• One approach is to generate all possible trees and

find the best too expensive in general!

• There must be a way:

– Exploration top-down or bottom-up

– form a decision trees

• The main problem:

– during construction at each step, choose a good

attribute on which a test to be performed

Page 64: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Building a Decision Tree

Top-down Tree Construction

Initially, all the training examples are at the root

Then, the examples are recursively partitioned, by choosing one

attribute at a time

Bottom-up Tree Pruning

Remove subtrees or branches, in a bottom-up manner, to

improve the estimated accuracy on new cases.

Page 65: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

When Should Building Stop?

• There are several possible stopping criteria

– All samples for a given node belong to the same

class

– If there are no remaining attributes for further

partitioning, majority voting is employed

– There are no samples left

– Or there is nothing to gain in splitting

Page 66: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Tree Algorithms

• The basic idea behind any decision tree algorithm is as follows:

– Choose the best attribute(s) to split the remaining instances and make that attribute a decision node

– Repeat this process recursively for each child

– Stop when:• All the instances have the same target attribute value

• There are no more attributes

• There are no more instances

Page 67: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Sample Experience Table

Example Attributes Target

Hour Weather Accident Stall Commute

D1 8 AM Sunny No No Long

D2 8 AM Cloudy No Yes Long

D3 10 AM Sunny No No Short

D4 9 AM Rainy Yes No Long

D5 9 AM Sunny Yes Yes Long

D6 10 AM Sunny No No Short

D7 10 AM Cloudy No No Short

D8 9 AM Rainy No No Medium

D9 9 AM Sunny Yes No Long

D10 10 AM Cloudy Yes Yes Long

D11 10 AM Rainy No No Short

D12 8 AM Cloudy Yes No Long

D13 9 AM Sunny No No Medium

Page 68: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Predicting Commute Time

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Long

Long

Short Medium Long

No Yes No Yes

If we leave at

10 AM and

there are no

cars stalled on

the road, what

will our

commute time

be?

Page 69: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Choosing Attributes

• The previous experience decision table showed 4 attributes: hour, weather, accident and stall

• But the decision tree only showed 3 attributes: hour, accident and stall

• Why is that?

Page 70: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Choosing Attributes

• Methods for selecting attributes (which will be described later) show that weather is not a discriminating attribute

Page 71: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Choosing Attributes

• The basic structure of creating a decision tree is the same for most decision tree algorithms

• The difference lies in how we select the attributes for the tree

• We will focus on the ID3 algorithm developed by Ross Quinlan in 1975

Page 72: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Identifying the Best Attributes

• Refer back to our original decision tree

Leave At

Stall? Accident?

10 AM 9 AM8 AM

Long

Long

Short Medium

No Yes No Yes

Long

How did we know to split on leave at

and then on stall and accident and not

weather?

Page 73: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Which is the splitting (best) attribute?

Page 74: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3 Heuristic

• To determine the best attribute, we look at the ID3 heuristic

• ID3 splits attributes based on their entropy.

• Entropy is the measure of disinformation…

Page 75: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Entropy

• Entropy is minimized when all values of the target attribute are the same.– If we know that commute time will always be short, then

entropy = 0

• Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random)– If commute time = short in 3 instances, medium in 3

instances and long in 3 instances, entropy is maximized

Page 76: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Entropy

• Calculation of entropy

– Entropy(S) = ∑(i=1 to l)-|Si|/|S| * log2(|Si|/|S|)

• S = set of examples

• Si = subset of S with value vi under the target attribute

• l = size of the range of the target attribute

Page 77: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

The Entropy Function Relative to Boolean Classification

79

1.0

0.0 0.5 1.0

Proportion of positive examples

Entro

py

Example taken from

Tom Mitchell’s

Machine Learning

Page 78: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3

• ID3 splits on attributes with the lowest entropy

• calculate the entropy for all values of an attribute as the weighted sum of subset entropies as follows:

• We can also measure information gain (which is inversely proportional to entropy) as follows:– Entropy(S) - ∑(i = 1 to k) |Si|/|S| Entropy(Si)

Page 79: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3

• Given our commute time sample set, we can calculate the entropy of each attribute at the root node

Attribute Expected Entropy Information Gain

Hour 0.6511 0.768449

Weather 1.28884 0.130719

Accident 0.92307 0.496479

Stall 1.17071 0.248842

Page 80: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Entropy: weather example

Page 81: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Which is the splitting (best) attribute?

Page 82: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Which is the splitting (best) attribute?

• At each node, available attributes are evaluated on the basis of separating

the classes of the training examples

• A purity or impurity measure is used for this purpose

• Information Gain: Increases with the average purity of the subsets that an

attribute produces

• Splitting Strategy: choose the attribute that results in greatest information

gain

• Typical goodness functions: information gain (ID3), information gain ratio

(C4.5), gini index (CART)

Page 83: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Which is the splitting (best) attribute?

Page 84: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Which is the splitting (best) attribute?

• Entropy is a measure of disorder prevailing in a collection of objects. If all

objects belong to the same class, there is no disorder.

• Quinlan proposed to select the attribute that minimizes the disorder of

the partition resultant.

Page 85: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

The attribute “outlook”

• “outlook” = “sunny”

• “outlook” = “overcast”

• “outlook” = “rainy”

• Expected information for attribute

Page 86: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Information Gain

Difference between the information before split and the information after

split

The information before the split, info(D), is the entropy,

The information after the split using attribute A is computed as the

weighted sum of the entropies on each split, given n splits,

Page 87: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Information Gain

• Difference between the information before split and the information after split

• Information gain for the attributes from the weather data:– gain(“outlook”)=0.247

– gain(“temperature”)=0.029

– gain(“humidity”)=0.152

– gain(“windy”)=0.048

Page 88: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)
Page 89: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

The Final Decision Tree

• Not all the leaves need to be pure

• Splitting stops when data can not be split any further

Page 90: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Rule extraction from Tree

if

Then PlayTennis = yes

Page 91: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3 in Gaming

• Black & White, developed by Lionhead Studios, and released in 2001 used ID3

• Used to predict a player’s reaction to a certain creature’s action

• In this model, a greater feedback value means the creature should attack

Page 92: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3 in Black & White

Example Attributes Target

Allegiance Defense Tribe Feedback

D1 Friendly Weak Celtic -1.0

D2 Enemy Weak Celtic 0.4

D3 Friendly Strong Norse -1.0

D4 Enemy Strong Norse -0.2

D5 Friendly Weak Greek -1.0

D6 Enemy Medium Greek 0.2

D7 Enemy Strong Greek -0.4

D8 Enemy Medium Aztec 0.0

D9 Friendly Weak Aztec -1.0

Page 93: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3 in Black & White

Allegiance

Defense

Friendly Enemy

0.4 -0.3

-1.0

Weak Strong

0.1

Medium

Note that this decision tree does not even use the tribe attribute

Page 94: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

ID3 in Black & White

• Now suppose we don’t want the entire decision tree, but we just want the 2 highest feedback values

• We can create a Boolean expressions, such as((Allegiance = Enemy) ^ (Defense = Weak)) v ((Allegiance = Enemy) ^ (Defense = Medium))

Page 95: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Deciding when a tree is complete

• Continue splitting nodes until some goodness-of-split criterion

fails to be met.

– when the quality of a particular split falls below the threshold, the

tree is not grown further along that branch.

– when all branches from the root reach terminal nodes then tree is

complete.

97

Page 96: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Deciding when a tree is complete

• Grow the tree too large and then prune the nodes off.

– after tree construction stops create a sequence of subtrees from the

original tree

• choose one subtree for each possible number of leaves (the

subtree chosen with p leaves has the best assessment value of all

candidate subtrees with p leaves)

– once the sequence of subtrees is established select which subtree to

use according to some criterion.

• best assessment val

98

Page 97: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Evaluation methodology

• Standard methodology:

1. Collect a large set of examples (all with correct classifications)

2. Randomly divide collection into two disjoint sets: training and test

3. Apply learning algorithm to training set giving hypothesis H

4. Measure performance of H w.r.t. test set

Important: keep the training and test sets disjoint!

• To study the efficiency and robustness of an algorithm, repeat steps 2-4 for

different training sets and sizes of training sets

• If you improve your algorithm, start again with step 1 to avoid evolving the

algorithm to work well on just this collection99

Page 98: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Another Version of the Weather Dataset

Page 99: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Decision Tree for the New Dataset

• Entropy for splitting using “ID Code” is zero, since each leaf node is “pure”

• Information Gain is thus maximal for ID code

Page 100: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Highly-Branching attributes

• Attributes with a large number of values are usually problematic

E.g. id, primary keys, or almost primary key attributes

• Subsets are likely to be pure if there is a large number of values

• Information Gain is biased towards choosing attributes with a large number of values

• This may result in overfitting (selection of an attribute that is non-optimal for prediction)

Page 101: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Solution: Information Gain Ratio

• Modification of the Information Gain that reduces the bias toward highly-branching attributes

• Information Gain Ratio should be

– Large when data is evenly spread

– Small when all data belong to one branch

• Information Gain Ratio takes number and size of branches into account when choosing an attribute

• It corrects the information gain by taking the intrinsic information of a split into account

Page 102: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Information Gain Ratio andIntrinsic information

• Intrinsic information

computes the entropy of distribution of instances into branches

• Information Gain Ratio normalizes Information Gain by

Page 103: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Computing the Information Gain Ratio

• The intrinsic information for ID code is

• Importance of attribute decreases as intrinsic information gets larger

• The Information gain ratio of “ID code”,

Page 104: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

Information Gain Ratio for Weather Data

Page 105: Lecture # 4 Multilayer Percceptron & Decision Treesbiomisa.org/uploads/2014/06/Lect-4.pdfBack Propagation. Back Propagation. A Worked Example: • Propagated the values (10,30,20)

107

Acknowledgements

Introduction to Machine Learning, Alphaydin

Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000

Pattern Recognition and Analysis Course – A.K. Jain, MSU

Pattern Classification” by Duda et al., John Wiley & Sons.

http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html

Some Material adopted from Dr. Adam Prugel-Bennett Dr. Andrew Ng and Dr. Aman

ullah’s Slides

Mat

eria

l in

th

ese

slid

es h

as b

een

tak

en f

rom

, th

e fo

llow

ing

reso

urc

es