data mining using decision trees professor j. f. baldwin
TRANSCRIPT
Data Mining using Decision Trees
Professor J. F. Baldwin
Decision Trees from Data BaseEx Att Att Att ConceptNum Size Colour Shape Satisfied
1 med blue brick yes2 small red wedge no3 small red sphere yes4 large red wedge no5 large green pillar yes6 large red pillar no7 large green sphere yes
Choose target : Concept satisfiedUse all attributes except Ex Num
CLS - Concept LearningSystem - Hunt et al.
Parent nodeAttribute V
v1 v2 v3
Node with mixtureof +ve and -veexamples
Children nodes
Tree Structure
CLS ALGORITHM
1. Initialise the tree T by setting it to consist of onenode containing all the examples, both +ve and -ve,in the training set
2. If all the examples in T are +ve, create a YES node and HALT
3. If all the examples in T are -ve, create a NO node and HALT
4. Otherwise, select an attribute F with values v1, ..., vn Partition T into subsets T1, ..., Tn according to the values on F. Create branches with F as parent and T1, ..., Tn as child nodes.
5. Apply the procedure recursively to each child node
Data Base Example
Using attribute SIZE
{1, 2, 3, 4, 5, 6, 7} SIZE
medsmall
large
{1} {2, 3} {4, 5, 6, 7}
YES Expand Expand
Expanding{1, 2, 3, 4, 5, 6, 7} SIZE
medsmall
large
{1} {2, 3}COLOUR
{4, 5, 6, 7}SHAPE
YES {2, 3}SHAPE
wedgesphere
{3}{2}
no yes
wedge spherepillar
{4} {7}{5, 6}COLOUR
No Yesred{6}No
green{5}Yes
Rules from TreeIF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) )))OR (SIZE = small AND SHAPE = wedge)THEN NO
IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) )OR (SIZE = small AND SHAPE = sphere)OR (SIZE = medium)THEN YES
Disjunctive Normal Form - DNF
IF(SIZE = medium)OR (SIZE = small AND SHAPE = sphere)OR (SIZE = large AND SHAPE = sphere)OR (SIZE = large AND SHAPE = pillar AND COLOUR = greenTHEN CONCEPT = satisfied
ELSE CIONCEPT = not satisfied
ID3 - Quinlan
ID3 = CLS + efficient ordering of attributes
Entropy is used to order the attributes.
Attributes are chosen in any order for the CLS algorithm.This can result in large decision trees if the ordering is notoptimal. Optimal ordering would result in smallest decisionTree.
No method is known to determine optimal ordering.We use a heuristic to provide efficient ordering whichwill result in near optimal ordering
Entropy
For random variable V which can take values {v1, v2, …, vn}with Pr(vi) = pi, all i, the entropy of V is given by
S(V)=− pii=1
n∑ ln(pi )
Entropy for a fair dice = S=−16i=1
6∑ ln
16
⎛ ⎝ ⎜
⎞ ⎠ ⎟ =ln(6)
Entropy for fair dice with even score = S=−13i=1
3∑ ln
13
⎛ ⎝ ⎜
⎞ ⎠ ⎟ =ln(3)
= 1.7917
= 1.0986
Information gain = 1.7917 - 1.0986 = 0.6931 Differences betweenentropies
Attribute Expansion
Ai T
Expand attribute Ai -
ai1aim
T TPr
Pr
Equally likely unless specifiedPr(A1, …Ai, …An, T)
AttributesExceptAi
Pr(A1, …Ai-1, Ai+1, …An, T | Ai = ai1)
otherattributes
Pass probabilities corresponding to ai1 from above and re-normalise
-equally likely again if previous equally likely
Expected Entropy for an Attribute
Ai T
Attribute Ai and target T -
ai1aim
T T
S(ai2) S(ai1) S(aim)
Expected Entropy for Ai =
Pr
Pr(Ai,T)= ... ... Pr(A1...,An,T)An
∑A i=1
∑A i−1
∑A1
∑
Pr Pr
Pass probabilities corresponding to tk from above for ai1and re-normalise
S(ai1) = Pr(tk |ai1)Lnk∑ Pr(tk |ai1)
S(Ai )= Pr(aik)S(aik)k∑
Pr(T | Ai=aim)
Pr(A i)= Pr(AiT∑ ,T)
How to choose attribute and Information gain
Determine expected entropy for each attributei.e. S(Ai), all i
Choose s such that MINj
{S(A j}=S(As)
Expand attribute As
By choosing attribute As the information gain isS - S(As)where whereS= Pr(T)Ln{Pr(T)
T∑ } Pr(T)= ... Pr(A1...,An,T)
An
∑A1
∑
Minimising expected entropy is equivalent to maximisingInformation gain
Previous Example
Ex Att Att Att ConceptNum Size Colour Shape Satisfied
1 med blue brick yes 1/72 small red wedge no 1/73 small red sphere yes 1/74 large red wedge no 1/75 large green pillar yes 1/76 large red pillar no 1/77 large green sphere yes 1/7
Pr
Conceptsatisfied
yesno
Pr
4/73/7
S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99
Entropy for attribute SizeAtt ConceptSize Satisfiedmed yes 1/7small no 1/7small yes 1/7large no 2/7large yes 2/7
Pr
ConceptSatisfiedno 1/2yes 1/2
Pr
small
med
ConceptSatisfiedyes 1
Pr
ConceptSatisfiedno 1/2yes 1/2
Pr
large
S(small) = 1
S(med) = 0
S(large) = 1
Pr(small) = 2/7Pr(large) = 4/7
Pr(med) = 1/7
S(Size) = (2/7)1 + (1/7)0+ (4/7)1 = 6/7 = 0.86
Information Gainfor Size = 0.99 - 0.86= 0.13
First Expansion
Attribute Information GainSIZE 0.13COLOUR 0.52SHAPE 0.7 choosemax
{1, 2, 3, 4, 5, 6, 7}SHAPE
wedge brick pillarsphere
{2, 4}
NO
{1}
YES
{5, 6}{3, 7}
YESExpand
Complete Decision Tree{1, 2, 3, 4, 5, 6, 7}
SHAPE
wedge brick pillarsphere
{2, 4}
NO
{1}
YES
{5, 6}{3, 7}
YES
COLOUR
red green
{6)
NO
{5}
YES
Rule:IFShape is wedgeORShape is brickORShape is pillar AND Colour is redORShape is sphereTHEN NOELSE YES
A new case
Att Att Att ConceptSize Colour Shape Satisfied
med red pillar ?
SHAPE
pillar
COLOUR
red
? = NO
Post PruningAny Node S
N examplesin node
n cases of CC is one of{YES, NO}
Let C be classwith mostexamplesi.e majority
E(S)
Suppose we terminate this node and make it a leaf withclassification C.What will be the expected error, E(S), if we use the treefor new cases and we reach this node.
E(S) = Pr(class of new case is a class ≠ C)
Bayes Updating for Post Pruning
Let p denote probability of class C for new case arriving at SWe do not know p. Let f(p) be a prior probability distributionfor p on [0, 1]. We can update this prior using Bayes’ updatingwith the information at node S.
The information at node S is n C in S
1
0
Pr(n C in S | p) f(p)
Pr(n C in S | p) f(p)dp
f(p | n in S) =
Mathematics of Post Pruning
Assume f(p) to be uniform over [0, 1]
1
0 dp
f(p | n C in S) = p (1-p)
n N – n
p (1-p)n N – n
E(S) = E (1 – p)f(p | n C in S)
E(S) =
1
0 dp
p (1-p)n N – n + 1
p (1-p)n N – n
= N – n + 1
N + 2 dp using
Beta Functions.
The evaluation of the integral
n! (N – n + 1)!(N + 2)!
1
0
dx x (1-x)a b =
using Beta Functions
Post Pruning for Binary Case
S
S1 S2 SmError(S1) Error(S2) Error(Sm)
P1 P2Pm
E(S)BackUpError(S)
For any node S which is not a leaf node we can calculate
BackUpError(S) = Pi Error(Si)i
Error(S) =
MIN{ }Pi =
Num of examples in SiNum of examples in S
For leaf nodes Si
Error(Si) = E(Si) E(S)BackUpError(S)
Decision: Prune at S ifBackUpError(S) ≥ Error(S)
Example of Post Pruning
Before Pruninga
[6, 4]
b[4, 2]
c[2, 2]
d[1, 2]
[x, y]means x YES casesand y NO cases
We underline Error(Sk)
[3, 2]0.429
[1, 0]0.333
[1, 1]0.5
[0, 1]0.333
[1, 0]0.333
0.3750.413
0.4170.378
0.50.383
0.40.444 PRUNE
PRUNE
PRUNE means cut the sub- tree below this point
Result of Pruning
After Pruninga
[6, 4]
[4, 2] c[2, 2]
[1, 2] [1, 0]
Generalisation
For the case in which we have k classes the generalisation for E(S) is
= N – n + k – 1
N + k
Otherwise, pruning method is the same.
E(S)
TestingDataBase
TrainingSet
TestSet
Learn rules using Training Set and Prune
Test rules on this set and record % correct
Test rules on Test Set record % correct
% accuracy on test set should be close to that of training set.This indicates good generalisation
Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.