comp527-08

8/8/2019 comp527-08

1/25

Dr Robert Sanderson

([email protected])

Dept. of Computer Science

University of Liverpool

2008

COMP527: Data Mining

Classification: Trees January 18, 2008 Slide 1

COMP527:Data Mining

8/8/2019 comp527-08

2/25

Introduction to the Course

Introduction to Data MiningIntroduction to Text Mining

General Data Mining Issues

Data Warehousing

Classification: Challenges, Basics

Classification: RulesClassification: Trees

Classification: Trees 2

Classification: Bayes

Classification: Neural Networks

Classification: SVMClassification: Evaluation

Classification: Evaluation 2

Regression, Prediction

COMP527: Data Mining


COMP527:Data Mining

Input Preprocessing

Attribute SelectionAssociation Rule Mining

ARM: A Priori and Data Structures

ARM: Improvements

ARM: Advanced Techniques

Clustering: Challenges, BasicsClustering: Improvements

Clustering: Advanced Algorithms

Hybrid Approaches

Graph Mining, Web Mining

Text Mining: Challenges, Basics

Text Mining: Text-as-Data

Text Mining: Text-as-Language

Revision for Exam

8/8/2019 comp527-08

3/25

Trees

Tree Learning AlgorithmAttribute Splitting Decisions

Random

'Purity Count'Entropy (aka ID3)Information Gain Ratio

Today's Topics


COMP527:Data Mining

8/8/2019 comp527-08

4/25

Anything can be made better by storing it in a tree structure! (Not really!)

Instead of having lists or sets of rules, why not have a treeof rules? Then there's no problem with order, or repeatingthe same test over and over again in different conjunctiverules.

So each node in the tree is an attribute test, the branchesfrom that node are the different outcomes.

Instead of 'separate and conquer', Decision Trees are themore typical 'divide and conquer' approach. Once thetree is built, new instances can be tested by simplystepping through each test.

Trees


COMP527:Data Mining

8/8/2019 comp527-08

5/25

Here's our example data again:

How to construct a tree from it, instead of rules?

Example Data Again


COMP527:Data Mining

Outlook Temperature Humidity Windy Play?

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

8/8/2019 comp527-08

6/25

Trivial Tree Learner:

create empty tree T

select attribute A

create branches in T for each value v of A

for each branch,

recurse with instances where A=v

add tree as branch node

Most interesting part of this algorithm is line 2, the attribute

selection. Let's start with a Random selection, then look at how it

might be improved.

Tree Learning Algorithm


COMP527:Data Mining

8/8/2019 comp527-08

7/25

Random method: Let's pick 'windy'

Need to split again, looking at only the 8 and 6 instances respectively.

For windy=false, we'll randomly select outlook:

sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes

As all instances of overcast and rainy are yes, they stop, sunny continues.

T


COMP527:Data Mining

Windy

false true

6 yes2 no

3 yes3 no

8/8/2019 comp527-08

8/25

As we may have thousands of attributes and/or values to test, we

want to construct small decision trees. Think back to RIPPER's

description length ... the smallest decision tree will have the

smallest description length. So how can we reduce the number

of nodes in the tree?

We want all paths through the tree to be as short aspossible. Nodes with one class stop a path, so we wantthose to appear early in the tree, otherwise they'll occurin multiple branches.

Think back: the first rule we generated wasoutlook=overcast because it was pure.

Attribute Selection


COMP527:Data Mining

8/8/2019 comp527-08

9/25

'Purity' count:

Select attribute that has the most 'pure' nodes, randomise equal

counts.Still mediocre. Most data sets won't have pure nodes for several

levels. Need a measure of the purity instead of the simple count.

Attribute Selection: Purity


COMP527:Data Mining

Outlook

sunny rainy

2 yes3 no

3 yes2 no

4 yes

8/8/2019 comp527-08

10/25

For each test:

Maximal purity: All values are the same

Minimal purity: Equal number of each value

Find a scale between maximal and minimal, and then merge across all of the

attribute tests.

One function that calculates this is the Entropy function:

entropy(p1,p2...,pn)

= -p1*log(p1) + -p2*log(p2) + ... -pn*log(pn)

p1 ... pn are the number of instances of each class, expressed as a fraction of

the total number of instances at that point in the tree. log is base 2.

Attribute Selection: Entropy


COMP527:Data Mining

8/8/2019 comp527-08

11/25

entropy(p1,p2...,pn)

= -p1*log(p1) + -p2*log(p2) + ... -pn *log(pn)

This is to calculate one test. For outlook there are three tests:

sunny: info(2,3)

= -2/5 log(2/5) -3/5 log(3/5)= 0.5287 + 0.4421

= 0.971

overcast: info(4,0) = -(4/4*log(4/4)) + -(0*log(0))

Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it

is the final result will be 0.



COMP527:Data Mining

8/8/2019 comp527-08

12/25

sunny: info(2,3) = 0.971

overcast: info(4,0) = 0.0

rainy: info(3,2) = 0.971

But we have 14 instances to divide down those paths...

So the total for outlook is:(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693

Now to calculate the gain, we work out the entropy for the top node

and subtract the entropy for outlook:info(9,5) = 0.940

gain(outlook) = 0.940 - 0.693 = 0.247



COMP527:Data Mining

8/8/2019 comp527-08

13/25

Now to calculate the gain for all of the attributes:gain(outlook) = 0.247

gain(humidity) = 0.152

gain(windy) = 0.048

gain(temperature) = 0.029

And select the maximum ... which is outlook.This is (also!) called information gain. The total is the information,

measured in 'bits'.

Equally we could select the minimum amount of information

needed -- the minimum description length issue in RIPPER.

Let's do the next level, where outlook=sunny.



COMP527:Data Mining

8/8/2019 comp527-08

14/25

Now to calculate the gain for all of the attributes:

Temp: hot info(0,2) mild info(1,1) cool info(1,0)

Humidity: high info(0,3) normal info(2,0)

Windy: false info(1,2) true info(1,1)

Don't even need to do the math. Humidity is the obvious choice as

it predicts all 5 instances correctly. Thus the information will be

0, and the gain will be maximal.



COMP527:Data Mining

Outlook Temperature Humidity Windy Play?

sunny hot high false no

sunny hot high true no

sunny mild high false no

sunny cool normal false yes

sunny mild normal true yes

8/8/2019 comp527-08

15/25

Now our tree looks like:

This algorithm is called ID3, developed by Quinlan.



COMP527:Data Mining

Outlook

sunny rainy

yes no

yesHumidity

normal high

?

8/8/2019 comp527-08

16/25

Nasty side effect of Entropy: It prefers attributes with a large

number of branches.

Eg, if there was an 'identifier' attribute with a unique value, this

would uniquely determine the class, but be useless for

classification. (over-fitting!)

Eg: info(0,1) info(0,1) info(1,0) ...

Doesn't need to be unique. If we assign 1 to the first two instances,

2 to the second and so forth, we still get a 'better' split.

Entropy: Issues


COMP527:Data Mining

8/8/2019 comp527-08

17/25

Half-Identifier 'attribute':info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)

info(1,1)

= 0 0 0.5 0.5 0 0 0.5

2/14 down each route, so:

= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...= 3 * (2/14 * 0.5)

= 3/14

= 0.214

Gain is:

0.940 - 0.214 = 0.726

Remember that the gain for Outlook was only 0.247!

Urgh. Once more we run into over-fitting.

Entropy: Issues


COMP527:Data Mining

8/8/2019 comp527-08

18/25

Solution: Use a gain ratio. Calculate the entropy disregarding

classes for all of the daughter nodes:

eg info(2,2,2,2,2,2,2) for half-identifier

and info(5,4,5) for outlook

identifier = -1/14 * log(1/14) * 14 = 3.807

half-identifier = -1/7 * log(1/7) * 7 = 2.807

outlook = 1.577

Ratios:identifier = 0.940 / 3.807 = 0.247

half-identifier = 0.726 / 2.807 = 0.259

outlook = 0.247 / 1.577 = 0.157

Gain Ratio


COMP527:Data Mining

8/8/2019 comp527-08

19/25

Close to success: Picks half-identifier (only accurate in 4/7

branches) over identifier (accurate in all 14 branches)!

half-identifier = 0.259

identifier = 0.247

outlook = 0.157humidity = 0.152

windy = 0.049

temperature = 0.019

Humidity is now also very close to outlook, whereas before they

were separated.

Gain Ratio


COMP527:Data Mining

8/8/2019 comp527-08

20/25

We can simply check for identifier like attributes and ignore them.

Actually, they should be removed from the data before the data

mining begins.

However the ratio can also over-compensate. It might pick an

attribute because it's entropy is low. Note how close humidity

and outlook became... maybe that's not such a good thing?

Possible Fix: First generate the information gain. Throw away any

attributes with less than the average. Then compare using the

ratio.

Gain Ratio


COMP527:Data Mining

8/8/2019 comp527-08

21/25

An alternative method to Information Gain is called the Gini Index

The total for node D is:gini(D) = 1 - sum(p12, p22, ... pn2)

Where p1..n are the frequency ratios of class 1..n in D.

So the Gini Index for the entire set:

= 1- (9/142 + 5/142)

= 1 - (0.413 + 0.127)

= 0.459

Alternative: Gini


COMP527:Data Mining

8/8/2019 comp527-08

22/25

The gini value of a split of D into subsets is:

Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)

Where N' is the size of split D', and N is the size of D.

eg: Outlook splits into 5,4,5:split = 5/14 gini(sunny) + 4/14 gini(overcast)

+ 5/14 gini(rainy)

sunny = 1-sum(2/52, 3/52) = 1 - 0.376 = 0.624

overcast= 1- sum(4/4

2

, 0/4

2

) = 0.0rainy = sunny

split = (5/14 * 0.624) * 2

= 0.446

Gini


COMP527:Data Mining

8/8/2019 comp527-08

23/25

The attribute that generates the smallest gini split value is chosen

to split the node on.

(Left as an exercise for you to do!)

Gini is used in CART (Classification and Regression Trees), IBM's

IntelligentMiner system, SPRINT (Scalable PaRallelizable

INduction of decision Trees). It comes from an Italian statistician

who used it to measure income inequality.

Gini


COMP527:Data Mining

8/8/2019 comp527-08

24/25

The various problems that a good DT builder needs to address:

Ordering of Attribute Splits

As seen, we need to build the tree picking the best attribute to split on first.

Numeric/Missing Data

Dividing numeric data is more complicated. How?

Tree Structure

A balanced tree with the fewest levels is preferable.

Stopping Criteria

Like with rules, we need to stop adding nodes at some point. When? Pruning

It may be beneficial to prune the tree once created? Or incrementally?

Decision Tree Issues


COMP527:Data Mining

8/8/2019 comp527-08

25/25

Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4 Han, 6.3 Berry and Browne, Chapter 4 Berry and Linoff, Chapter 6

Further Reading


COMP527:Data Mining

comp527-08

Documents