data mining using decision trees professor j. f. baldwin

Data Mining using Decision Trees

Professor J. F. Baldwin

Decision Trees from Data BaseEx Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes2 small red wedge no3 small red sphere yes4 large red wedge no5 large green pillar yes6 large red pillar no7 large green sphere yes

Choose target : Concept satisfiedUse all attributes except Ex Num

CLS - Concept LearningSystem - Hunt et al.

Parent nodeAttribute V

v1 v2 v3

Node with mixtureof +ve and -veexamples

Children nodes

Tree Structure

CLS ALGORITHM

1. Initialise the tree T by setting it to consist of onenode containing all the examples, both +ve and -ve,in the training set

2. If all the examples in T are +ve, create a YES node and HALT

3. If all the examples in T are -ve, create a NO node and HALT

4. Otherwise, select an attribute F with values v1, ..., vn Partition T into subsets T1, ..., Tn according to the values on F. Create branches with F as parent and T1, ..., Tn as child nodes.

5. Apply the procedure recursively to each child node

Data Base Example

Using attribute SIZE

{1, 2, 3, 4, 5, 6, 7} SIZE

medsmall

{1} {2, 3} {4, 5, 6, 7}

YES Expand Expand

Expanding{1, 2, 3, 4, 5, 6, 7} SIZE

medsmall

{1} {2, 3}COLOUR

{4, 5, 6, 7}SHAPE

YES {2, 3}SHAPE

wedgesphere

{3}{2}

no yes

wedge spherepillar

{4} {7}{5, 6}COLOUR

No Yesred{6}No

green{5}Yes

Rules from TreeIF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) )))OR (SIZE = small AND SHAPE = wedge)THEN NO

IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) )OR (SIZE = small AND SHAPE = sphere)OR (SIZE = medium)THEN YES

Disjunctive Normal Form - DNF

IF(SIZE = medium)OR (SIZE = small AND SHAPE = sphere)OR (SIZE = large AND SHAPE = sphere)OR (SIZE = large AND SHAPE = pillar AND COLOUR = greenTHEN CONCEPT = satisfied

ELSE CIONCEPT = not satisfied

ID3 - Quinlan

ID3 = CLS + efficient ordering of attributes

Entropy is used to order the attributes.

Attributes are chosen in any order for the CLS algorithm.This can result in large decision trees if the ordering is notoptimal. Optimal ordering would result in smallest decisionTree.

No method is known to determine optimal ordering.We use a heuristic to provide efficient ordering whichwill result in near optimal ordering

Entropy

For random variable V which can take values {v1, v2, …, vn}with Pr(vi) = pi, all i, the entropy of V is given by

S(V)=− pii=1

n∑ ln(pi )

Entropy for a fair dice = S=−16i=1

6∑ ln

⎛ ⎝ ⎜

⎞ ⎠ ⎟ =ln(6)

Entropy for fair dice with even score = S=−13i=1

3∑ ln

⎛ ⎝ ⎜

⎞ ⎠ ⎟ =ln(3)

= 1.7917

= 1.0986

Information gain = 1.7917 - 1.0986 = 0.6931 Differences betweenentropies

Attribute Expansion

Expand attribute Ai -

ai1aim

Equally likely unless specifiedPr(A1, …Ai, …An, T)

AttributesExceptAi

Pr(A1, …Ai-1, Ai+1, …An, T | Ai = ai1)

otherattributes

Pass probabilities corresponding to ai1 from above and re-normalise

-equally likely again if previous equally likely

Expected Entropy for an Attribute

Attribute Ai and target T -

ai1aim

S(ai2) S(ai1) S(aim)

Expected Entropy for Ai =

Pr(Ai,T)= ... ... Pr(A1...,An,T)An

∑A i=1

∑A i−1

Pass probabilities corresponding to tk from above for ai1and re-normalise

S(ai1) = Pr(tk |ai1)Lnk∑ Pr(tk |ai1)

S(Ai )= Pr(aik)S(aik)k∑

Pr(T | Ai=aim)

Pr(A i)= Pr(AiT∑ ,T)

How to choose attribute and Information gain

Determine expected entropy for each attributei.e. S(Ai), all i

Choose s such that MINj

{S(A j}=S(As)

Expand attribute As

By choosing attribute As the information gain isS - S(As)where whereS= Pr(T)Ln{Pr(T)

T∑ } Pr(T)= ... Pr(A1...,An,T)

Minimising expected entropy is equivalent to maximisingInformation gain

Previous Example

Ex Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes 1/72 small red wedge no 1/73 small red sphere yes 1/74 large red wedge no 1/75 large green pillar yes 1/76 large red pillar no 1/77 large green sphere yes 1/7

Conceptsatisfied

4/73/7

S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99

Entropy for attribute SizeAtt ConceptSize Satisfiedmed yes 1/7small no 1/7small yes 1/7large no 2/7large yes 2/7

ConceptSatisfiedno 1/2yes 1/2

ConceptSatisfiedyes 1

ConceptSatisfiedno 1/2yes 1/2

S(small) = 1

S(med) = 0

S(large) = 1

Pr(small) = 2/7Pr(large) = 4/7

Pr(med) = 1/7

S(Size) = (2/7)1 + (1/7)0+ (4/7)1 = 6/7 = 0.86

Information Gainfor Size = 0.99 - 0.86= 0.13

First Expansion

Attribute Information GainSIZE 0.13COLOUR 0.52SHAPE 0.7 choosemax

{1, 2, 3, 4, 5, 6, 7}SHAPE

wedge brick pillarsphere

{2, 4}

{5, 6}{3, 7}

YESExpand

Complete Decision Tree{1, 2, 3, 4, 5, 6, 7}

wedge brick pillarsphere

{2, 4}

{5, 6}{3, 7}

COLOUR

red green

Rule:IFShape is wedgeORShape is brickORShape is pillar AND Colour is redORShape is sphereTHEN NOELSE YES

A new case

Att Att Att ConceptSize Colour Shape Satisfied

med red pillar ?

pillar

COLOUR

? = NO

Post PruningAny Node S

N examplesin node

n cases of CC is one of{YES, NO}

Let C be classwith mostexamplesi.e majority

Suppose we terminate this node and make it a leaf withclassification C.What will be the expected error, E(S), if we use the treefor new cases and we reach this node.

E(S) = Pr(class of new case is a class ≠ C)

Bayes Updating for Post Pruning

Let p denote probability of class C for new case arriving at SWe do not know p. Let f(p) be a prior probability distributionfor p on [0, 1]. We can update this prior using Bayes’ updatingwith the information at node S.

The information at node S is n C in S

Pr(n C in S | p) f(p)

Pr(n C in S | p) f(p)dp

f(p | n in S) =

Mathematics of Post Pruning

Assume f(p) to be uniform over [0, 1]

f(p | n C in S) = p (1-p)

n N – n

p (1-p)n N – n

E(S) = E (1 – p)f(p | n C in S)

E(S) =

p (1-p)n N – n + 1

p (1-p)n N – n

= N – n + 1

N + 2 dp using

Beta Functions.

The evaluation of the integral

n! (N – n + 1)!(N + 2)!

dx x (1-x)a b =

using Beta Functions

Post Pruning for Binary Case

S1 S2 SmError(S1) Error(S2) Error(Sm)

P1 P2Pm

E(S)BackUpError(S)

For any node S which is not a leaf node we can calculate

BackUpError(S) = Pi Error(Si)i

Error(S) =

MIN{ }Pi =

Num of examples in SiNum of examples in S

For leaf nodes Si

Error(Si) = E(Si) E(S)BackUpError(S)

Decision: Prune at S ifBackUpError(S) ≥ Error(S)

Example of Post Pruning

Before Pruninga

[6, 4]

b[4, 2]

c[2, 2]

d[1, 2]

[x, y]means x YES casesand y NO cases

We underline Error(Sk)

[3, 2]0.429

[1, 0]0.333

[1, 1]0.5

[0, 1]0.333

[1, 0]0.333

0.3750.413

0.4170.378

0.50.383

0.40.444 PRUNE

PRUNE means cut the sub- tree below this point

Result of Pruning

After Pruninga

[6, 4]

[4, 2] c[2, 2]

[1, 2] [1, 0]

Generalisation

For the case in which we have k classes the generalisation for E(S) is

= N – n + k – 1

Otherwise, pruning method is the same.

TestingDataBase

TrainingSet

TestSet

Learn rules using Training Set and Prune

Test rules on this set and record % correct

Test rules on Test Set record % correct

% accuracy on test set should be close to that of training set.This indicates good generalisation

Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.

data mining using decision trees professor j. f. baldwin

im slide

likely slide

shape wedge sphere

child node slide

baldwin slide

information gain slide

im t t pr

attribute aiai t attribute

Documents

data mining with r - decision trees and random...

data mining – algorithms: decision trees - id3 chapter 4,...

mining optimal decision trees from itemset lattices

mining data streams using option trees

4_bw_comad2011_submission_36_minimally infrequent itemset...

data mining classification: basic concepts, decision trees,...

data mining, decision trees and earthquake prediction...

•niessweees baldwins a i baldwin 010 mckerrow mining...

data mining by example - building predictive model using...

mining adaptively frequent closed unlabeled rooted trees in...

data mining classification: basic concepts, decision trees,...

chapter 7 decision tree. data warehouse and data mining...

data mining in artificial intelligence: decision trees

frequent pattern mining using fp trees (1)

evolving model trees for mining data sets with...

text stream mining using suffix trees

data mining classification: basic concepts, decision trees,...

cse5230 - data mining, 2004lecture 6.1 data mining - cse5230...

data mining classification: basic concepts, decision trees...

effect of mining on trees