decision trees prof. carolina ruiz dept. of computer science wpi
DESCRIPTION
Which attribute splits the data more homogenously? [0,1,3] [2,1,2] [3,1,1] bad unknown good [3,3,2] [2,1,4] low high [3,2,6] [2,1,0] none adequate [0,0,4] [0,2,2] [5,1,0] >35 low moderate high Goal: Assign a unique number to each attribute that represents how well it “splits” the dataset according to the target attribute TargetTRANSCRIPT
Decision Trees
Prof. Carolina RuizDept. of Computer Science
WPI
Constructing a decision tree
?
Which attribute to use as the root node? That is, which attribute to check first when making a prediction?
Pick the attribute that brings us closer to a decision.That is, the attribute that splits the data more homogenously.
Which attribute splits the data more homogenously? [0,1,3] [2,1,2] [3,1,1] bad unknown good
[3,3,2] [2,1,4] low high
[3,2,6] [2,1,0] none adequate
[0,0,4] [0,2,2] [5,1,0] 0-15 15-35 >35
low moderate high
Goal: Assign a unique number to each attribute that represents how well it “splits” the dataset according to the target attribute
Target
For example …
What function f to use?f([0,1,3],[2,1,2],[3,1,1]) = number
Possible f functions:
• Gini Indexmeasure of impurity
• Entropy from information theory
• Misclassification errormetric used by OneR
[0,1,3] [2,1,2] [3,1,1] bad unknown good
Using entropy as the f metric
f([0,1,3],[2,1,2],[3,1,1])
= Entropy([0,1,3],[2,1,2],[3,1,1])
= (4/14)*Entropy([0,1,3]) + (5/14)*Entropy([2,1,2]) + (5/14)*Entropy([3,1,1])
= (4/14)*[-0 -1/4 log2(1/4) -3/4 log2(3/4) ]
+ (5/14)*[-2/5 log2(2/5) -1/5 log2(1/5) -2/5 log2(2/5) ]
+ (5/14)*[-3/5 log2(3/5) -1/5 log2(1/5) -1/5 log2(1/5) ]
= 1.265
[0,1,3] [2,1,2] [3,1,1] bad unknown good
In general: Entropy([p,q,…,z])= - (p/m)log2(p/m) – (q/m)log2(q/m)
- … - (z/m)log2(z/m)
where m = p+q+…+z
Which attribute splits the data more homogenously? [0,1,3] [2,1,2] [3,1,1] bad unknown good
[3,3,2] [2,1,4] low high
[3,2,6] [2,1,0] none adequate
[0,0,4] [0,2,2] [5,1,0] 0-15 15-35 >35
low moderate high
Attribute with lowest entropy is chosen:
income
Target
1.2651.467 1.324
0.564
Constructing a decision tree
?
Which attribute to use as the root node? That is, which attribute to check first when making a prediction?
Pick the attribute that brings us closer to a decision.That is, the attribute that splits the data more homogenously.
Constructing a decision tree
income
0-15 15-35 > 35
prediction: high ? ?
?
Splitting instances with income = 15-35
[0,0,1], [0,1,1],[0,1,0] [0,1,0], [0,1,2] [0,2,2], [0,0,0]
entropy: 0.5 0.688 1
<- high
<- moderate
attribute with lowest entropy
Constructing a decision tree
income
0-15 15-35 > 35
prediction: high
… … …
Credit-history