data mining using decision trees professor j. f. baldwin

Data Mining using Decision Trees

Professor J. F. Baldwin

Decision Trees from Data BaseEx Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes2 small red wedge no3 small red sphere yes4 large red wedge no5 large green pillar yes6 large red pillar no7 large green sphere yes

Choose target : Concept satisfiedUse all attributes except Ex Num

CLS - Concept LearningSystem - Hunt et al.

Parent nodeAttribute V

v1 v2 v3

Node with mixtureof +ve and -veexamples

Children nodes

Tree Structure

CLS ALGORITHM

1. Initialise the tree T by setting it to consist of onenode containing all the examples, both +ve and -ve,in the training set

2. If all the examples in T are +ve, create a YES node and HALT

3. If all the examples in T are -ve, create a NO node and HALT

4. Otherwise, select an attribute F with values v1, ..., vn Partition T into subsets T1, ..., Tn according to the values on F. Create branches with F as parent and T1, ..., Tn as child nodes.

5. Apply the procedure recursively to each child node

Data Base Example

Using attribute SIZE

{1, 2, 3, 4, 5, 6, 7} SIZE

medsmall

large

{1} {2, 3} {4, 5, 6, 7}

YES Expand Expand

Expanding{1, 2, 3, 4, 5, 6, 7} SIZE

medsmall

large

{1} {2, 3}COLOUR

{4, 5, 6, 7}SHAPE

YES {2, 3}SHAPE

wedgesphere

{3}{2}

no yes

wedge spherepillar

{4} {7}{5, 6}COLOUR

No Yesred{6}No

green{5}Yes

Rules from TreeIF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) )))OR (SIZE = small AND SHAPE = wedge)THEN NO

IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) )OR (SIZE = small AND SHAPE = sphere)OR (SIZE = medium)THEN YES

Disjunctive Normal Form - DNF

IF(SIZE = medium)OR (SIZE = small AND SHAPE = sphere)OR (SIZE = large AND SHAPE = sphere)OR (SIZE = large AND SHAPE = pillar AND COLOUR = greenTHEN CONCEPT = satisfied

ELSE CIONCEPT = not satisfied

ID3 - Quinlan

ID3 = CLS + efficient ordering of attributes

Entropy is used to order the attributes.

Attributes are chosen in any order for the CLS algorithm.This can result in large decision trees if the ordering is notoptimal. Optimal ordering would result in smallest decisionTree.

No method is known to determine optimal ordering.We use a heuristic to provide efficient ordering whichwill result in near optimal ordering

Entropy

For random variable V which can take values {v1, v2, …, vn}with Pr(vi) = pi, all i, the entropy of V is given by

S(V)=− pii=1

n∑ ln(pi )

Entropy for a fair dice = S=−16i=1

6∑ ln

16

⎛ ⎝ ⎜

⎞ ⎠ ⎟ =ln(6)

Entropy for fair dice with even score = S=−13i=1

3∑ ln

13

⎛ ⎝ ⎜

⎞ ⎠ ⎟ =ln(3)

= 1.7917

= 1.0986

Information gain = 1.7917 - 1.0986 = 0.6931 Differences betweenentropies

Attribute Expansion

Ai T

Expand attribute Ai -

ai1aim

T TPr

Pr

Equally likely unless specifiedPr(A1, …Ai, …An, T)

AttributesExceptAi

Pr(A1, …Ai-1, Ai+1, …An, T | Ai = ai1)

otherattributes

Pass probabilities corresponding to ai1 from above and re-normalise

-equally likely again if previous equally likely

Expected Entropy for an Attribute

Ai T

Attribute Ai and target T -

ai1aim

T T

S(ai2) S(ai1) S(aim)

Expected Entropy for Ai =

Pr

Pr(Ai,T)= ... ... Pr(A1...,An,T)An

∑A i=1

∑A i−1

∑A1

∑

Pr Pr

Pass probabilities corresponding to tk from above for ai1and re-normalise

S(ai1) = Pr(tk |ai1)Lnk∑ Pr(tk |ai1)

S(Ai )= Pr(aik)S(aik)k∑

Pr(T | Ai=aim)

Pr(A i)= Pr(AiT∑ ,T)

How to choose attribute and Information gain

Determine expected entropy for each attributei.e. S(Ai), all i

Choose s such that MINj

{S(A j}=S(As)

Expand attribute As

By choosing attribute As the information gain isS - S(As)where whereS= Pr(T)Ln{Pr(T)

T∑ } Pr(T)= ... Pr(A1...,An,T)

An

∑A1

∑

Minimising expected entropy is equivalent to maximisingInformation gain

Previous Example

Ex Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes 1/72 small red wedge no 1/73 small red sphere yes 1/74 large red wedge no 1/75 large green pillar yes 1/76 large red pillar no 1/77 large green sphere yes 1/7

Pr

Conceptsatisfied

yesno

Pr

4/73/7

S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99

Entropy for attribute SizeAtt ConceptSize Satisfiedmed yes 1/7small no 1/7small yes 1/7large no 2/7large yes 2/7

Pr

ConceptSatisfiedno 1/2yes 1/2

Pr

small

med

ConceptSatisfiedyes 1

Pr

ConceptSatisfiedno 1/2yes 1/2

Pr

large

S(small) = 1

S(med) = 0

S(large) = 1

Pr(small) = 2/7Pr(large) = 4/7

Pr(med) = 1/7

S(Size) = (2/7)1 + (1/7)0+ (4/7)1 = 6/7 = 0.86

Information Gainfor Size = 0.99 - 0.86= 0.13

First Expansion

Attribute Information GainSIZE 0.13COLOUR 0.52SHAPE 0.7 choosemax

{1, 2, 3, 4, 5, 6, 7}SHAPE

wedge brick pillarsphere

{2, 4}

NO

{1}

YES

{5, 6}{3, 7}

YESExpand

Complete Decision Tree{1, 2, 3, 4, 5, 6, 7}

SHAPE

wedge brick pillarsphere

{2, 4}

NO

{1}

YES

{5, 6}{3, 7}

YES

COLOUR

red green

{6)

NO

{5}

YES

Rule:IFShape is wedgeORShape is brickORShape is pillar AND Colour is redORShape is sphereTHEN NOELSE YES

A new case

Att Att Att ConceptSize Colour Shape Satisfied

med red pillar ?

SHAPE

pillar

COLOUR

red

? = NO

Post PruningAny Node S

N examplesin node

n cases of CC is one of{YES, NO}

Let C be classwith mostexamplesi.e majority

E(S)

Suppose we terminate this node and make it a leaf withclassification C.What will be the expected error, E(S), if we use the treefor new cases and we reach this node.

E(S) = Pr(class of new case is a class ≠ C)

Bayes Updating for Post Pruning

Let p denote probability of class C for new case arriving at SWe do not know p. Let f(p) be a prior probability distributionfor p on [0, 1]. We can update this prior using Bayes’ updatingwith the information at node S.

The information at node S is n C in S

1

0

Pr(n C in S | p) f(p)

Pr(n C in S | p) f(p)dp

f(p | n in S) =

Mathematics of Post Pruning

Assume f(p) to be uniform over [0, 1]

1

0 dp

f(p | n C in S) = p (1-p)

n N – n

p (1-p)n N – n

E(S) = E (1 – p)f(p | n C in S)

E(S) =

1

0 dp

p (1-p)n N – n + 1

p (1-p)n N – n

= N – n + 1

N + 2 dp using

Beta Functions.

The evaluation of the integral

n! (N – n + 1)!(N + 2)!

1

0

dx x (1-x)a b =

using Beta Functions

Post Pruning for Binary Case

S

S1 S2 SmError(S1) Error(S2) Error(Sm)

P1 P2Pm

E(S)BackUpError(S)

For any node S which is not a leaf node we can calculate

BackUpError(S) = Pi Error(Si)i

Error(S) =

MIN{ }Pi =

Num of examples in SiNum of examples in S

For leaf nodes Si

Error(Si) = E(Si) E(S)BackUpError(S)

Decision: Prune at S ifBackUpError(S) ≥ Error(S)

Example of Post Pruning

Before Pruninga

[6, 4]

b[4, 2]

c[2, 2]

d[1, 2]

[x, y]means x YES casesand y NO cases

We underline Error(Sk)

[3, 2]0.429

[1, 0]0.333

[1, 1]0.5

[0, 1]0.333

[1, 0]0.333

0.3750.413

0.4170.378

0.50.383

0.40.444 PRUNE

PRUNE

PRUNE means cut the sub- tree below this point

Result of Pruning

After Pruninga

[6, 4]

[4, 2] c[2, 2]

[1, 2] [1, 0]

Generalisation

For the case in which we have k classes the generalisation for E(S) is

= N – n + k – 1

N + k

Otherwise, pruning method is the same.

E(S)

TestingDataBase

TrainingSet

TestSet

Learn rules using Training Set and Prune

Test rules on this set and record % correct

Test rules on Test Set record % correct

% accuracy on test set should be close to that of training set.This indicates good generalisation

Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.

data mining using decision trees professor j. f. baldwin

Documents

im slide

likely slide

shape wedge sphere

child node slide

baldwin slide

information gain slide

im t t pr

attribute aiai t attribute