data mining using decision trees professor j. f. baldwin

26
Data Mining using Decision Trees Professor J. F. Baldwin

Upload: james-mckenna

Post on 28-Mar-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Mining using Decision Trees Professor J. F. Baldwin

Data Mining using Decision Trees

Professor J. F. Baldwin

Page 2: Data Mining using Decision Trees Professor J. F. Baldwin

Decision Trees from Data BaseEx Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes2 small red wedge no3 small red sphere yes4 large red wedge no5 large green pillar yes6 large red pillar no7 large green sphere yes

Choose target : Concept satisfiedUse all attributes except Ex Num

Page 3: Data Mining using Decision Trees Professor J. F. Baldwin

CLS - Concept LearningSystem - Hunt et al.

Parent nodeAttribute V

v1 v2 v3

Node with mixtureof +ve and -veexamples

Children nodes

Tree Structure

Page 4: Data Mining using Decision Trees Professor J. F. Baldwin

CLS ALGORITHM

1. Initialise the tree T by setting it to consist of onenode containing all the examples, both +ve and -ve,in the training set

2. If all the examples in T are +ve, create a YES node and HALT

3. If all the examples in T are -ve, create a NO node and HALT

4. Otherwise, select an attribute F with values v1, ..., vn Partition T into subsets T1, ..., Tn according to the values on F. Create branches with F as parent and T1, ..., Tn as child nodes.

5. Apply the procedure recursively to each child node

Page 5: Data Mining using Decision Trees Professor J. F. Baldwin

Data Base Example

Using attribute SIZE

{1, 2, 3, 4, 5, 6, 7} SIZE

medsmall

large

{1} {2, 3} {4, 5, 6, 7}

YES Expand Expand

Page 6: Data Mining using Decision Trees Professor J. F. Baldwin

Expanding{1, 2, 3, 4, 5, 6, 7} SIZE

medsmall

large

{1} {2, 3}COLOUR

{4, 5, 6, 7}SHAPE

YES {2, 3}SHAPE

wedgesphere

{3}{2}

no yes

wedge spherepillar

{4} {7}{5, 6}COLOUR

No Yesred{6}No

green{5}Yes

Page 7: Data Mining using Decision Trees Professor J. F. Baldwin

Rules from TreeIF (SIZE = large AND ((SHAPE = wedge) OR (SHAPE = pillar AND COLOUR = red) )))OR (SIZE = small AND SHAPE = wedge)THEN NO

IF (SIZE = large AND ((SHAPE = pillar) AND COLOUR = green) OR SHAPE = sphere) )OR (SIZE = small AND SHAPE = sphere)OR (SIZE = medium)THEN YES

Page 8: Data Mining using Decision Trees Professor J. F. Baldwin

Disjunctive Normal Form - DNF

IF(SIZE = medium)OR (SIZE = small AND SHAPE = sphere)OR (SIZE = large AND SHAPE = sphere)OR (SIZE = large AND SHAPE = pillar AND COLOUR = greenTHEN CONCEPT = satisfied

ELSE CIONCEPT = not satisfied

Page 9: Data Mining using Decision Trees Professor J. F. Baldwin

ID3 - Quinlan

ID3 = CLS + efficient ordering of attributes

Entropy is used to order the attributes.

Attributes are chosen in any order for the CLS algorithm.This can result in large decision trees if the ordering is notoptimal. Optimal ordering would result in smallest decisionTree.

No method is known to determine optimal ordering.We use a heuristic to provide efficient ordering whichwill result in near optimal ordering

Page 10: Data Mining using Decision Trees Professor J. F. Baldwin

Entropy

For random variable V which can take values {v1, v2, …, vn}with Pr(vi) = pi, all i, the entropy of V is given by

S(V)=− pii=1

n∑ ln(pi )

Entropy for a fair dice = S=−16i=1

6∑ ln

16

⎛ ⎝ ⎜

⎞ ⎠ ⎟ =ln(6)

Entropy for fair dice with even score = S=−13i=1

3∑ ln

13

⎛ ⎝ ⎜

⎞ ⎠ ⎟ =ln(3)

= 1.7917

= 1.0986

Information gain = 1.7917 - 1.0986 = 0.6931 Differences betweenentropies

Page 11: Data Mining using Decision Trees Professor J. F. Baldwin

Attribute Expansion

Ai T

Expand attribute Ai -

ai1aim

T TPr

Pr

Equally likely unless specifiedPr(A1, …Ai, …An, T)

AttributesExceptAi

Pr(A1, …Ai-1, Ai+1, …An, T | Ai = ai1)

otherattributes

Pass probabilities corresponding to ai1 from above and re-normalise

-equally likely again if previous equally likely

Page 12: Data Mining using Decision Trees Professor J. F. Baldwin

Expected Entropy for an Attribute

Ai T

Attribute Ai and target T -

ai1aim

T T

S(ai2) S(ai1) S(aim)

Expected Entropy for Ai =

Pr

Pr(Ai,T)= ... ... Pr(A1...,An,T)An

∑A i=1

∑A i−1

∑A1

Pr Pr

Pass probabilities corresponding to tk from above for ai1and re-normalise

S(ai1) = Pr(tk |ai1)Lnk∑ Pr(tk |ai1)

S(Ai )= Pr(aik)S(aik)k∑

Pr(T | Ai=aim)

Pr(A i)= Pr(AiT∑ ,T)

Page 13: Data Mining using Decision Trees Professor J. F. Baldwin

How to choose attribute and Information gain

Determine expected entropy for each attributei.e. S(Ai), all i

Choose s such that MINj

{S(A j}=S(As)

Expand attribute As

By choosing attribute As the information gain isS - S(As)where whereS= Pr(T)Ln{Pr(T)

T∑ } Pr(T)= ... Pr(A1...,An,T)

An

∑A1

Minimising expected entropy is equivalent to maximisingInformation gain

Page 14: Data Mining using Decision Trees Professor J. F. Baldwin

Previous Example

Ex Att Att Att ConceptNum Size Colour Shape Satisfied

1 med blue brick yes 1/72 small red wedge no 1/73 small red sphere yes 1/74 large red wedge no 1/75 large green pillar yes 1/76 large red pillar no 1/77 large green sphere yes 1/7

Pr

Conceptsatisfied

yesno

Pr

4/73/7

S = (4/7)Log(4/7) + (3/7)Log(3/7) = 0.99

Page 15: Data Mining using Decision Trees Professor J. F. Baldwin

Entropy for attribute SizeAtt ConceptSize Satisfiedmed yes 1/7small no 1/7small yes 1/7large no 2/7large yes 2/7

Pr

ConceptSatisfiedno 1/2yes 1/2

Pr

small

med

ConceptSatisfiedyes 1

Pr

ConceptSatisfiedno 1/2yes 1/2

Pr

large

S(small) = 1

S(med) = 0

S(large) = 1

Pr(small) = 2/7Pr(large) = 4/7

Pr(med) = 1/7

S(Size) = (2/7)1 + (1/7)0+ (4/7)1 = 6/7 = 0.86

Information Gainfor Size = 0.99 - 0.86= 0.13

Page 16: Data Mining using Decision Trees Professor J. F. Baldwin

First Expansion

Attribute Information GainSIZE 0.13COLOUR 0.52SHAPE 0.7 choosemax

{1, 2, 3, 4, 5, 6, 7}SHAPE

wedge brick pillarsphere

{2, 4}

NO

{1}

YES

{5, 6}{3, 7}

YESExpand

Page 17: Data Mining using Decision Trees Professor J. F. Baldwin

Complete Decision Tree{1, 2, 3, 4, 5, 6, 7}

SHAPE

wedge brick pillarsphere

{2, 4}

NO

{1}

YES

{5, 6}{3, 7}

YES

COLOUR

red green

{6)

NO

{5}

YES

Rule:IFShape is wedgeORShape is brickORShape is pillar AND Colour is redORShape is sphereTHEN NOELSE YES

Page 18: Data Mining using Decision Trees Professor J. F. Baldwin

A new case

Att Att Att ConceptSize Colour Shape Satisfied

med red pillar ?

SHAPE

pillar

COLOUR

red

? = NO

Page 19: Data Mining using Decision Trees Professor J. F. Baldwin

Post PruningAny Node S

N examplesin node

n cases of CC is one of{YES, NO}

Let C be classwith mostexamplesi.e majority

E(S)

Suppose we terminate this node and make it a leaf withclassification C.What will be the expected error, E(S), if we use the treefor new cases and we reach this node.

E(S) = Pr(class of new case is a class ≠ C)

Page 20: Data Mining using Decision Trees Professor J. F. Baldwin

Bayes Updating for Post Pruning

Let p denote probability of class C for new case arriving at SWe do not know p. Let f(p) be a prior probability distributionfor p on [0, 1]. We can update this prior using Bayes’ updatingwith the information at node S.

The information at node S is n C in S

1

0

Pr(n C in S | p) f(p)

Pr(n C in S | p) f(p)dp

f(p | n in S) =

Page 21: Data Mining using Decision Trees Professor J. F. Baldwin

Mathematics of Post Pruning

Assume f(p) to be uniform over [0, 1]

1

0 dp

f(p | n C in S) = p (1-p)

n N – n

p (1-p)n N – n

E(S) = E (1 – p)f(p | n C in S)

E(S) =

1

0 dp

p (1-p)n N – n + 1

p (1-p)n N – n

= N – n + 1

N + 2 dp using

Beta Functions.

The evaluation of the integral

n! (N – n + 1)!(N + 2)!

1

0

dx x (1-x)a b =

using Beta Functions

Page 22: Data Mining using Decision Trees Professor J. F. Baldwin

Post Pruning for Binary Case

S

S1 S2 SmError(S1) Error(S2) Error(Sm)

P1 P2Pm

E(S)BackUpError(S)

For any node S which is not a leaf node we can calculate

BackUpError(S) = Pi Error(Si)i

Error(S) =

MIN{ }Pi =

Num of examples in SiNum of examples in S

For leaf nodes Si

Error(Si) = E(Si) E(S)BackUpError(S)

Decision: Prune at S ifBackUpError(S) ≥ Error(S)

Page 23: Data Mining using Decision Trees Professor J. F. Baldwin

Example of Post Pruning

Before Pruninga

[6, 4]

b[4, 2]

c[2, 2]

d[1, 2]

[x, y]means x YES casesand y NO cases

We underline Error(Sk)

[3, 2]0.429

[1, 0]0.333

[1, 1]0.5

[0, 1]0.333

[1, 0]0.333

0.3750.413

0.4170.378

0.50.383

0.40.444 PRUNE

PRUNE

PRUNE means cut the sub- tree below this point

Page 24: Data Mining using Decision Trees Professor J. F. Baldwin

Result of Pruning

After Pruninga

[6, 4]

[4, 2] c[2, 2]

[1, 2] [1, 0]

Page 25: Data Mining using Decision Trees Professor J. F. Baldwin

Generalisation

For the case in which we have k classes the generalisation for E(S) is

= N – n + k – 1

N + k

Otherwise, pruning method is the same.

E(S)

Page 26: Data Mining using Decision Trees Professor J. F. Baldwin

TestingDataBase

TrainingSet

TestSet

Learn rules using Training Set and Prune

Test rules on this set and record % correct

Test rules on Test Set record % correct

% accuracy on test set should be close to that of training set.This indicates good generalisation

Over-fitting can occur if noisy data is used or too specific attributes are used. Pruning will overcome noise to some extent but not completely. Too specific attributes must be dropped.