decision trees, entropy, information gain,...

Decision trees, entropy, information gain, ID3

Relevant Readings: Sections 3.1 through 3.6 in Mitchell

CS495 - Machine Learning, Fall 2009

Decision trees

I Decision trees are commonly used in learning algorithms

I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion

I In a decision tree, every internal node v considers someattribute a of the given test instance

I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.

I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there

I Do example using lakes data

I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?

I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?

I A decision tree can be used to represent any boolean functionon discrete attributes

Decision trees

How do we build the tree?

I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly

I The only question is which attribute to use first

I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?

I A: We will use entropy and in particular, information gain

Entropy

I Entropy is a measure of disorder or impurity

I In our case, we will be interested in the entropy of the outputvalues of a set of training instances

I Entropy can be understood as the average number of bitsneeded to encode an output value

I If the output values are split 50%-50%, then the set isdisorderly (impure)

I The entropy is 1 (average of 1 bit of information to encode anoutput value)

I If the output values were all positive (or negative), the set isquite orderly (pure)

I The entropy is 0 (average of 0 bits of information to encode anoutput value)

I If the output values are split 25%-75%, then the entropy turnsout to be about .811

I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples

I Definition: Entropy(S) = −p log2(p)− q log2(q)

Entropy

I Entropy can be generalized from boolean to discrete-valuedtarget functions

I The idea is the same: entropy is the average number of bitsneeded to encode an output value

I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i

I Definition: Entropy(S) = −∑

i pi log2(pi )

Entropy

i pi log2(pi )

Entropy

i pi log2(pi )

Entropy

i pi log2(pi )

Entropy

i pi log2(pi )

Information gain

I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets

I Information gain is a measure of this change in entropy

I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−

∑v∈Values(A)

|Sv ||S| · Entropy(Sv )

(recall |S | denotes the size of set S)

I If you just count the number of bits of informationeverywhere, you end up with something equivalent:

I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−

∑v∈Values(A) |Sv | · Entropy(Sv )

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

Information gain

∑v∈Values(A)

I The essentials:

I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

withI No root-to-leaf path should contain the same discrete

attribute twice

I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree

I The border cases:I If all positive or all negative training instances remain, label

that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training

instances left at that nodeI If no instances remain, label with a majority vote of the

parent’s training instances

I The pseudocode:I Table 3.1 on page 56 in Mitchell

I The essentials:I Start with all training instances associated with the root node

I Use info gain to choose which attribute to label each nodewith

I No root-to-leaf path should contain the same discreteattribute twice

I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node

I No root-to-leaf path should contain the same discreteattribute twice

attribute twice

I The border cases:

I If all positive or all negative training instances remain, labelthat node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

I If no instances remain, label with a majority vote of theparent’s training instances

attribute twice

that node “yes” or “no” accordingly

I If no attributes remain, label with a majority vote of traininginstances left at that node

attribute twice

instances left at that node

attribute twice

I The pseudocode:

I Table 3.1 on page 56 in Mitchell

attribute twice

Notes on ID3

I Inductive bias

I Shorter trees are preferred over longer trees (but size is notnecessarily minimized)

I Note that this is a form of Ockham’s Razor

I High information gain attributes are placed nearer to the rootof the tree

I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I Inductive biasI Shorter trees are preferred over longer trees (but size is not

necessarily minimized)

I Note that this is a form of Ockham’s Razor

Notes on ID3

necessarily minimized)I Note that this is a form of Ockham’s Razor

Notes on ID3

I Positives

I Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast

Notes on ID3

I PositivesI Robust to errors in the training data

I Training is reasonably fastI Classifying new examples is very fast

Notes on ID3

I PositivesI Robust to errors in the training dataI Training is reasonably fast

I Classifying new examples is very fast

Notes on ID3

I Negatives

I Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I NegativesI Difficult to extend to real-valued target functions

I Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data

Notes on ID3

I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributes

I The version of ID3 discussed above can overfit the data

Notes on ID3

Avoiding overfitting

I Some possibilities

I Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set

I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when

calculating information gain

I Some possibilitiesI Use a tuning set to decide when to stop growing the tree

I Use a tuning set to prune the tree after it’s completely grownI Keep removing nodes as long as it doesn’t hurt performance

on the tuning set

I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown

I Use the Minimum Description Length principle

I Track the number of bits used to describe the tree itself whencalculating information gain

decision trees, entropy, information gain,...

Documents