comp527-08
TRANSCRIPT
-
8/8/2019 comp527-08
1/25
Dr Robert Sanderson
Dept. of Computer Science
University of Liverpool
2008
COMP527: Data Mining
Classification: Trees January 18, 2008 Slide 1
COMP527:Data Mining
-
8/8/2019 comp527-08
2/25
Introduction to the Course
Introduction to Data MiningIntroduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: RulesClassification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVMClassification: Evaluation
Classification: Evaluation 2
Regression, Prediction
COMP527: Data Mining
Classification: Trees January 18, 2008 Slide 2
COMP527:Data Mining
Input Preprocessing
Attribute SelectionAssociation Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, BasicsClustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
-
8/8/2019 comp527-08
3/25
Trees
Tree Learning AlgorithmAttribute Splitting Decisions
Random
'Purity Count'Entropy (aka ID3)Information Gain Ratio
Today's Topics
Classification: Trees January 18, 2008 Slide 3
COMP527:Data Mining
-
8/8/2019 comp527-08
4/25
Anything can be made better by storing it in a tree structure! (Not really!)
Instead of having lists or sets of rules, why not have a treeof rules? Then there's no problem with order, or repeatingthe same test over and over again in different conjunctiverules.
So each node in the tree is an attribute test, the branchesfrom that node are the different outcomes.
Instead of 'separate and conquer', Decision Trees are themore typical 'divide and conquer' approach. Once thetree is built, new instances can be tested by simplystepping through each test.
Trees
Classification: Trees January 18, 2008 Slide 4
COMP527:Data Mining
-
8/8/2019 comp527-08
5/25
Here's our example data again:
How to construct a tree from it, instead of rules?
Example Data Again
Classification: Trees January 18, 2008 Slide 5
COMP527:Data Mining
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
-
8/8/2019 comp527-08
6/25
Trivial Tree Learner:
create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node
Most interesting part of this algorithm is line 2, the attribute
selection. Let's start with a Random selection, then look at how it
might be improved.
Tree Learning Algorithm
Classification: Trees January 18, 2008 Slide 6
COMP527:Data Mining
-
8/8/2019 comp527-08
7/25
Random method: Let's pick 'windy'
Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes
As all instances of overcast and rainy are yes, they stop, sunny continues.
T
Classification: Trees January 18, 2008 Slide 7
COMP527:Data Mining
Windy
false true
6 yes2 no
3 yes3 no
-
8/8/2019 comp527-08
8/25
As we may have thousands of attributes and/or values to test, we
want to construct small decision trees. Think back to RIPPER's
description length ... the smallest decision tree will have the
smallest description length. So how can we reduce the number
of nodes in the tree?
We want all paths through the tree to be as short aspossible. Nodes with one class stop a path, so we wantthose to appear early in the tree, otherwise they'll occurin multiple branches.
Think back: the first rule we generated wasoutlook=overcast because it was pure.
Attribute Selection
Classification: Trees January 18, 2008 Slide 8
COMP527:Data Mining
-
8/8/2019 comp527-08
9/25
'Purity' count:
Select attribute that has the most 'pure' nodes, randomise equal
counts.Still mediocre. Most data sets won't have pure nodes for several
levels. Need a measure of the purity instead of the simple count.
Attribute Selection: Purity
Classification: Trees January 18, 2008 Slide 9
COMP527:Data Mining
Outlook
sunny rainy
2 yes3 no
3 yes2 no
4 yes
-
8/8/2019 comp527-08
10/25
For each test:
Maximal purity: All values are the same
Minimal purity: Equal number of each value
Find a scale between maximal and minimal, and then merge across all of the
attribute tests.
One function that calculates this is the Entropy function:
entropy(p1,p2...,pn)
= -p1*log(p1) + -p2*log(p2) + ... -pn*log(pn)
p1 ... pn are the number of instances of each class, expressed as a fraction of
the total number of instances at that point in the tree. log is base 2.
Attribute Selection: Entropy
Classification: Trees January 18, 2008 Slide 10
COMP527:Data Mining
-
8/8/2019 comp527-08
11/25
entropy(p1,p2...,pn)
= -p1*log(p1) + -p2*log(p2) + ... -pn *log(pn)
This is to calculate one test. For outlook there are three tests:
sunny: info(2,3)
= -2/5 log(2/5) -3/5 log(3/5)= 0.5287 + 0.4421
= 0.971
overcast: info(4,0) = -(4/4*log(4/4)) + -(0*log(0))
Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it
is the final result will be 0.
Attribute Selection: Entropy
Classification: Trees January 18, 2008 Slide 11
COMP527:Data Mining
-
8/8/2019 comp527-08
12/25
sunny: info(2,3) = 0.971
overcast: info(4,0) = 0.0
rainy: info(3,2) = 0.971
But we have 14 instances to divide down those paths...
So the total for outlook is:(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693
Now to calculate the gain, we work out the entropy for the top node
and subtract the entropy for outlook:info(9,5) = 0.940
gain(outlook) = 0.940 - 0.693 = 0.247
Attribute Selection: Entropy
Classification: Trees January 18, 2008 Slide 12
COMP527:Data Mining
-
8/8/2019 comp527-08
13/25
Now to calculate the gain for all of the attributes:gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029
And select the maximum ... which is outlook.This is (also!) called information gain. The total is the information,
measured in 'bits'.
Equally we could select the minimum amount of information
needed -- the minimum description length issue in RIPPER.
Let's do the next level, where outlook=sunny.
Attribute Selection: Entropy
Classification: Trees January 18, 2008 Slide 13
COMP527:Data Mining
-
8/8/2019 comp527-08
14/25
Now to calculate the gain for all of the attributes:
Temp: hot info(0,2) mild info(1,1) cool info(1,0)
Humidity: high info(0,3) normal info(2,0)
Windy: false info(1,2) true info(1,1)
Don't even need to do the math. Humidity is the obvious choice as
it predicts all 5 instances correctly. Thus the information will be
0, and the gain will be maximal.
Attribute Selection: Entropy
Classification: Trees January 18, 2008 Slide 14
COMP527:Data Mining
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
-
8/8/2019 comp527-08
15/25
Now our tree looks like:
This algorithm is called ID3, developed by Quinlan.
Attribute Selection: Entropy
Classification: Trees January 18, 2008 Slide 15
COMP527:Data Mining
Outlook
sunny rainy
yes no
yesHumidity
normal high
?
-
8/8/2019 comp527-08
16/25
Nasty side effect of Entropy: It prefers attributes with a large
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this
would uniquely determine the class, but be useless for
classification. (over-fitting!)
Eg: info(0,1) info(0,1) info(1,0) ...
Doesn't need to be unique. If we assign 1 to the first two instances,
2 to the second and so forth, we still get a 'better' split.
Entropy: Issues
Classification: Trees January 18, 2008 Slide 16
COMP527:Data Mining
-
8/8/2019 comp527-08
17/25
Half-Identifier 'attribute':info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)
info(1,1)
= 0 0 0.5 0.5 0 0 0.5
2/14 down each route, so:
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...= 3 * (2/14 * 0.5)
= 3/14
= 0.214
Gain is:
0.940 - 0.214 = 0.726
Remember that the gain for Outlook was only 0.247!
Urgh. Once more we run into over-fitting.
Entropy: Issues
Classification: Trees January 18, 2008 Slide 17
COMP527:Data Mining
-
8/8/2019 comp527-08
18/25
Solution: Use a gain ratio. Calculate the entropy disregarding
classes for all of the daughter nodes:
eg info(2,2,2,2,2,2,2) for half-identifier
and info(5,4,5) for outlook
identifier = -1/14 * log(1/14) * 14 = 3.807
half-identifier = -1/7 * log(1/7) * 7 = 2.807
outlook = 1.577
Ratios:identifier = 0.940 / 3.807 = 0.247
half-identifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157
Gain Ratio
Classification: Trees January 18, 2008 Slide 18
COMP527:Data Mining
-
8/8/2019 comp527-08
19/25
Close to success: Picks half-identifier (only accurate in 4/7
branches) over identifier (accurate in all 14 branches)!
half-identifier = 0.259
identifier = 0.247
outlook = 0.157humidity = 0.152
windy = 0.049
temperature = 0.019
Humidity is now also very close to outlook, whereas before they
were separated.
Gain Ratio
Classification: Trees January 18, 2008 Slide 19
COMP527:Data Mining
-
8/8/2019 comp527-08
20/25
We can simply check for identifier like attributes and ignore them.
Actually, they should be removed from the data before the data
mining begins.
However the ratio can also over-compensate. It might pick an
attribute because it's entropy is low. Note how close humidity
and outlook became... maybe that's not such a good thing?
Possible Fix: First generate the information gain. Throw away any
attributes with less than the average. Then compare using the
ratio.
Gain Ratio
Classification: Trees January 18, 2008 Slide 20
COMP527:Data Mining
-
8/8/2019 comp527-08
21/25
An alternative method to Information Gain is called the Gini Index
The total for node D is:gini(D) = 1 - sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.
So the Gini Index for the entire set:
= 1- (9/142 + 5/142)
= 1 - (0.413 + 0.127)
= 0.459
Alternative: Gini
Classification: Trees January 18, 2008 Slide 21
COMP527:Data Mining
-
8/8/2019 comp527-08
22/25
The gini value of a split of D into subsets is:
Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)
Where N' is the size of split D', and N is the size of D.
eg: Outlook splits into 5,4,5:split = 5/14 gini(sunny) + 4/14 gini(overcast)
+ 5/14 gini(rainy)
sunny = 1-sum(2/52, 3/52) = 1 - 0.376 = 0.624
overcast= 1- sum(4/4
2
, 0/4
2
) = 0.0rainy = sunny
split = (5/14 * 0.624) * 2
= 0.446
Gini
Classification: Trees January 18, 2008 Slide 22
COMP527:Data Mining
-
8/8/2019 comp527-08
23/25
The attribute that generates the smallest gini split value is chosen
to split the node on.
(Left as an exercise for you to do!)
Gini is used in CART (Classification and Regression Trees), IBM's
IntelligentMiner system, SPRINT (Scalable PaRallelizable
INduction of decision Trees). It comes from an Italian statistician
who used it to measure income inequality.
Gini
Classification: Trees January 18, 2008 Slide 23
COMP527:Data Mining
-
8/8/2019 comp527-08
24/25
The various problems that a good DT builder needs to address:
Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
Numeric/Missing Data
Dividing numeric data is more complicated. How?
Tree Structure
A balanced tree with the fewest levels is preferable.
Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When? Pruning
It may be beneficial to prune the tree once created? Or incrementally?
Decision Tree Issues
Classification: Trees January 18, 2008 Slide 24
COMP527:Data Mining
-
8/8/2019 comp527-08
25/25
Introductory statistical text books Witten, 3.2, 4.3 Dunham, 4.4 Han, 6.3 Berry and Browne, Chapter 4 Berry and Linoff, Chapter 6
Further Reading
Classification: Trees January 18, 2008 Slide 25
COMP527:Data Mining