the id3 decision tree learning algorithm · the id3 decision tree learning algorithm. 2 y x* x 1 m...

32
Artificial Intelligence MSc Programs in: Computer Engineering, Cybersecurity and Artificial Intelligence Electronic Engineering Academic year 2019/2020 Instructor: Giorgio Fumera The ID3 Decision Tree Learning Algorithm

Upload: others

Post on 09-Feb-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

Artificial Intelligence

MSc Programs in:Computer Engineering, Cybersecurity and Artificial Intelligence

Electronic Engineering

Academic year 2019/2020Instructor: Giorgio Fumera

The ID3 Decision Tree Learning Algorithm

Page 2: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

2

Y X* X1

M1

L 1 1

M2

L 1 1

M3

L 1 0

M4

L 1 0

M5

L 1 0

M6

S 0 0

M7

S 0 0

M8

S 0 0

M9

S 0 1

M10

S 0 1

A possible training set for a spam/legitimate email classifier:● ten examples (M

1-M

10): five legitimate (L) and five spam (S) emails

● two boolean attributes X* and X1, denoting the occurrence of two

given words in an email

Page 3: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

3

Y X* X1

M1

L 1 1

M2

L 1 1

M3

L 1 0

M4

L 1 0

M5

L 1 0

M6

S 0 0

M7

S 0 0

M8

S 0 0

M9

S 0 1

M10

S 0 1

X1

0 1

5/5

3/3 2/2

If X1 is chosen for the root node of the decision tree, the 5/5 examples

in the training set are split according to its values into 3/3 and 2/2. Intuitively, this means that X

1 has no discriminant capability: the

corresponding word has the same probability of occurring both in legitimate and in spam emails...

Page 4: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

4

Y X* X1

M1

L 1 1

M2

L 1 1

M3

L 1 0

M4

L 1 0

M5

L 1 0

M6

S 0 0

M7

S 0 0

M8

S 0 0

M9

S 0 1

M10

S 0 1

X1

0 1

5/5

3/3 2/2

... ...

... therefore, it is not possible to stop the construction of the DT by inserting two leaf nodes, since the resulting DT would misclassify some training examples. For the DT to be consistent with the above training set, it is necessary to add two new nodes.

Page 5: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

5

Y X* X1

M1

L 1 1

M2

L 1 1

M3

L 1 0

M4

L 1 0

M5

L 1 0

M6

S 0 0

M7

S 0 0

M8

S 0 0

M9

S 0 1

M10

S 0 1

X*

0 1

5/5

5/0 0/5

If X* is chosen for the root of the DT instead, training examples are split according to its values into 5/0 and 0/5...

Page 6: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

6

Y X* X1

M1

L 1 1

M2

L 1 1

M3

L 1 0

M4

L 1 0

M5

L 1 0

M6

S 0 0

M7

S 0 0

M8

S 0 0

M9

S 0 1

M10

S 0 1

X*

0 1

5/5

5/0 0/5

L S

... this allows one to obtain the smallest possible DT, consistent with the training set, by inserting two leaves just below the root node.This means that X* is a perfectly discriminant attribute.

Page 7: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

7

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

In practice, perfectly discriminant attributes are very rare.In the figure above, the same emails are represented using five attributes (words), none of which is perfectly discriminant.In the following, a widely used learning algorithm for DTs, named ID3, is sketched.

Page 8: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

8

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

?

5/5

The learning algorithm starts by constructing the root node of the DT.The corresponding attribute has to be chosen from all the available ones. ID3 makes this choice by looking for the most discriminant attribute, i.e., the one whose values split the training examples as much as possible according to their class. This favours the construction of a small and consistent DT.

Page 9: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

9

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X1

0 1

5/5

3/3 2/2

Let us consider each of the possible attributes.

We have already seen that X1 is not a good choice: it has no

discriminant capability.

Page 10: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

10

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

X2 has a better discriminant capability: for each of its values, most of

the corresponding training examples belong to only one of the classes (legitimate when X

2=0, spam when X

2=1).

Page 11: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

11

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X3

0 1

5/5

1/3 4/2

X3 has the same discriminant capability as X

2, since it produces the

same distribution of spam and legitimate training emails according to its outputs (the class proportions are switched with respect to X

2,

but this is not relevant to the discriminant capability).

Page 12: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

12

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X4

0 1

5/5

2/1 3/4

Intuitively, X4 has a lower discriminant capability than X

2 and X

3 (but

still better than X1), since it produces a more balanced distribution

of spam and legitimate training emails according to its outputs.

Page 13: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

13

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X5

0 1

5/5

3/1 2/4

Finally, X5 has the same discriminant capability as X

2, and X

3.

Page 14: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

14

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

Among the three attributes exhibiting the highest discriminant capability, assume that X

2 is chosen (e.g., randomly).

Page 15: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

15

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

?

Now the ID3 algorithm proceeds recursively by building a sub-tree for each value of X

2. Assuming that the value X

2=0 is considered first, since there are

both legitimate and spam emails for which X2=0, the root of the sub-tree

must be a node and not a leaf. The attribute must be chosen among the ones not present in the same path from the root, i.e.: X

1, X

3, X

4 and X

5.

Page 16: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

16

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

?

To find the most discriminant attribute among X1, X

3, X

4 and X

5, only the

training examples that reach the considered node must be considered (since the goal is to find a consistent tree), i.e., the four emails (tree legitimate and one spam email) highlighted above.

Page 17: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

17

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

X3

0 1

0/1 3/0

It is easy to see that X3 is a perfectly discriminant attribute for the

four training examples at hand, and that the other three attributes have a lower discriminant capability. Accordingly, X

3 must be

chosen for this node.

Page 18: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

18

X2

0 1

5/5

3/1 2/4

X3

0

0/1

S L

1

3/0

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

It is also easy to see that the two recursive calls to the ID3 procedure construct two leaves with the class labels shown above.Then recursion proceeds with the right child of the root node...

Page 19: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

19

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

X3

0

0/1

S

?

L

1

3/0

... and a sub-tree has to be built according to the six training examples highlighted above (two legitimate and four spam emails) corresponding to X

2=1.

Page 20: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

20

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

X3

0

0/1

S

X5

L

1

3/0 2/0 0/4

0 1

Also in this case, a perfectly discriminant attribute exists, i.e., X5.

Page 21: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

21

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

X3

0

0/1

S

X5

L

1

3/0 2/0 0/4

0 1

The last two recursive calls to ID3 produce the final decision tree shown above.

SL

Page 22: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

22

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

?

5/5

So far the discriminant capability of an attribute has been evaluated only qualitatively. Among several possible quantitative measures, in the ID3 learning algorithm the entropy of the class distribution is chosen, estimated from the training examples that reach the considered node.

Page 23: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

23

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

?

5/5

In this example, before choosing the attribute of the root node the whole training set must be considered, which is made up of 5 legitimate and 5 spam emails. Accordingly, the class distribution can be estimated as:P(Y=L) = 5/10 = 0.5P(Y=S) = 5/10 = 0.5

Page 24: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

24

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

?

5/5

The corresponding entropy is defined as:H(Y) = -P(Y=0) log

2P(Y=0) -P(Y=1) log

2P(Y=1)

= -0.5 log20.5 - 0.5 log

20.5

= 1

Page 25: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

25

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

If X2 is chosen as the attribute of the root node, it produces the two class

distributions shown above (one for each output value).

Page 26: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

26

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

The entropy of the class distribution, after observing the value of X2, is

defined as the conditional entropy:

H(Y|X2) = P(X

2=0)H(Y|X

2=0) + P(X

2=1)H(Y|X

2=1).

Page 27: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

27

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

To compute H(Y|X2), the values P(X

2=0) and P(X

2=1) can be estimated as

4/10 and 6/10, respectively, whereas H(Y|X2=0) and H(Y|X

2=1) can be

computed from the distributions P(Y|X2=0) and P(Y|X

2=1), respectively.

Page 28: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

28

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

Accordingly, one obtains:H(Y|X

2) = P(X

2=0)H(Y|X

2=0) + P(X

2=1)H(Y|X

2=1)

= 0.4(-3/4 log23/4 -1/4 log

21/4) +

0.6(-2/6 log22/6 -4/6 log

24/6)

≈ 0.875

Page 29: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

29

Y X1

X2

X3

X4

X5

M1

L 1 0 1 0 1

M2

L 1 0 1 1 0

M3

L 0 0 1 1 1

M4

L 0 1 0 0 0

M5

L 0 1 1 1 0

M6

S 0 0 0 0 0

M7

S 0 1 0 1 1

M8

S 0 1 1 1 1

M9

S 1 1 0 1 1

M10

S 1 1 1 1 1

X2

0 1

5/5

3/1 2/4

To sum up, before observing the value of X2 the entropy of the

class distribution is H(Y)=1; after observing the value of X2 the

entropy reduces to H(Y|X2) ≈ 0.875. This means that the attribute

X2 has some discriminant capability. The discriminant capability of

any attribute at any node of a DT can be evaluated similarly.

Page 30: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

30

Over-fitting in Decision Trees

X2

0 1

500/150

300/140 200/10

X3

0

0/140

S

X4

L

1

300/0 195/0 5/10

0 1

LX

5

015/0 0/10

L S

Page 31: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

31

Over-fitting in Decision Trees

X2

0 1

500/150

300/140 200/10

X3

0

0/140

S

X4

L

1

300/0 195/0 5/10

0 1

L

Page 32: The ID3 Decision Tree Learning Algorithm · The ID3 Decision Tree Learning Algorithm. 2 Y X* X 1 M 1 L 1 1 M 2 L 1 1 M 3 L 1 0 M 4 L 1 0 M 5 L 1 0 M 6 S 0 0 M 7 S 0 0 M 8 S 0 0 M

32

Tree pruning

X2

0 1

500/150

300/140 200/10

X3

0

0/140

S

X4

L

1

300/0 195/0 5/10

0 1

L S