decision trees, entropy, information gain,...
TRANSCRIPT
![Page 1: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/1.jpg)
Decision trees, entropy, information gain, ID3
Relevant Readings: Sections 3.1 through 3.6 in Mitchell
CS495 - Machine Learning, Fall 2009
![Page 2: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/2.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 3: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/3.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 4: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/4.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 5: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/5.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 6: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/6.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 7: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/7.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 8: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/8.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 9: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/9.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 10: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/10.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 11: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/11.jpg)
Decision trees
I Decision trees are commonly used in learning algorithms
I For simplicity, we will restrict to discrete-valued attributes andboolean target functions in this discussion
I In a decision tree, every internal node v considers someattribute a of the given test instance
I Each child of v corresponds to some value that a can take on;v just passes the instance to the appropriate child.
I Every leaf node has a label “yes” or “no”, which is used toclassify test instances that end up there
I Do example using lakes data
I If someone gave you a decision tree, could you give a booleanfunction in Disjunctive Normal Form (DNF) that represents it?
I If someone gave you a boolean function in DNF, could yougive a decision tree that represents it?
I A decision tree can be used to represent any boolean functionon discrete attributes
![Page 12: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/12.jpg)
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
![Page 13: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/13.jpg)
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
![Page 14: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/14.jpg)
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
![Page 15: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/15.jpg)
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
![Page 16: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/16.jpg)
How do we build the tree?
I In the simplest variant, we will start with a root node andrecursively add child nodes until we fit the training dataperfectly
I The only question is which attribute to use first
I Q: We want to use the attribute that does the “best job”splitting up the training data, but how can this be measured?
I A: We will use entropy and in particular, information gain
![Page 17: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/17.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 18: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/18.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 19: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/19.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 20: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/20.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 21: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/21.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 22: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/22.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 23: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/23.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 24: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/24.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 25: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/25.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 26: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/26.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 27: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/27.jpg)
Entropy
I Entropy is a measure of disorder or impurity
I In our case, we will be interested in the entropy of the outputvalues of a set of training instances
I Entropy can be understood as the average number of bitsneeded to encode an output value
I If the output values are split 50%-50%, then the set isdisorderly (impure)
I The entropy is 1 (average of 1 bit of information to encode anoutput value)
I If the output values were all positive (or negative), the set isquite orderly (pure)
I The entropy is 0 (average of 0 bits of information to encode anoutput value)
I If the output values are split 25%-75%, then the entropy turnsout to be about .811
I Let S be a set, let p be the fraction of positive trainingexamples, and q be the fraction of negative training examples
I Definition: Entropy(S) = −p log2(p)− q log2(q)
![Page 28: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/28.jpg)
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
![Page 29: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/29.jpg)
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
![Page 30: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/30.jpg)
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
![Page 31: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/31.jpg)
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
![Page 32: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/32.jpg)
Entropy
I Entropy can be generalized from boolean to discrete-valuedtarget functions
I The idea is the same: entropy is the average number of bitsneeded to encode an output value
I Let S be a set of instances, and let pi be the fraction ofinstances in S with output value i
I Definition: Entropy(S) = −∑
i pi log2(pi )
![Page 33: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/33.jpg)
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
![Page 34: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/34.jpg)
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
![Page 35: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/35.jpg)
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
![Page 36: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/36.jpg)
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
![Page 37: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/37.jpg)
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
![Page 38: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/38.jpg)
Information gain
I The entropy typically changes when we use a node in adecision tree to partition the training instances into smallersubsets
I Information gain is a measure of this change in entropy
I Definition: Suppose S is a set of instances, A is an attribute,Sv is the subset of S with A = v , and Values(A) is the set ofall possible values of A, thenGain(S , A) = Entropy(S)−
∑v∈Values(A)
|Sv ||S| · Entropy(Sv )
(recall |S | denotes the size of set S)
I If you just count the number of bits of informationeverywhere, you end up with something equivalent:
I Total gain in bits = |S | · Gain(S , A) =|S | · Entropy(S)−
∑v∈Values(A) |Sv | · Entropy(Sv )
![Page 39: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/39.jpg)
ID3
I The essentials:
I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 40: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/40.jpg)
ID3
I The essentials:I Start with all training instances associated with the root node
I Use info gain to choose which attribute to label each nodewith
I No root-to-leaf path should contain the same discreteattribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 41: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/41.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
with
I No root-to-leaf path should contain the same discreteattribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 42: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/42.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 43: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/43.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 44: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/44.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:
I If all positive or all negative training instances remain, labelthat node “yes” or “no” accordingly
I If no attributes remain, label with a majority vote of traininginstances left at that node
I If no instances remain, label with a majority vote of theparent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 45: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/45.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordingly
I If no attributes remain, label with a majority vote of traininginstances left at that node
I If no instances remain, label with a majority vote of theparent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 46: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/46.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that node
I If no instances remain, label with a majority vote of theparent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 47: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/47.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 48: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/48.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:
I Table 3.1 on page 56 in Mitchell
![Page 49: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/49.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 50: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/50.jpg)
ID3
I The essentials:I Start with all training instances associated with the root nodeI Use info gain to choose which attribute to label each node
withI No root-to-leaf path should contain the same discrete
attribute twice
I Recursively construct each subtree on the subset of traininginstances that would be classified down that path in the tree
I The border cases:I If all positive or all negative training instances remain, label
that node “yes” or “no” accordinglyI If no attributes remain, label with a majority vote of training
instances left at that nodeI If no instances remain, label with a majority vote of the
parent’s training instances
I The pseudocode:I Table 3.1 on page 56 in Mitchell
![Page 51: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/51.jpg)
Notes on ID3
I Inductive bias
I Shorter trees are preferred over longer trees (but size is notnecessarily minimized)
I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 52: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/52.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)
I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 53: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/53.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 54: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/54.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 55: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/55.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I Positives
I Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 56: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/56.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training data
I Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 57: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/57.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fast
I Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 58: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/58.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 59: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/59.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I Negatives
I Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 60: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/60.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functions
I Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 61: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/61.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributes
I The version of ID3 discussed above can overfit the data
![Page 62: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/62.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 63: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/63.jpg)
Notes on ID3
I Inductive biasI Shorter trees are preferred over longer trees (but size is not
necessarily minimized)I Note that this is a form of Ockham’s Razor
I High information gain attributes are placed nearer to the rootof the tree
I PositivesI Robust to errors in the training dataI Training is reasonably fastI Classifying new examples is very fast
I NegativesI Difficult to extend to real-valued target functionsI Need to adapt the algorithm to continuous attributesI The version of ID3 discussed above can overfit the data
![Page 64: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/64.jpg)
Avoiding overfitting
I Some possibilities
I Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
![Page 65: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/65.jpg)
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the tree
I Use a tuning set to prune the tree after it’s completely grownI Keep removing nodes as long as it doesn’t hurt performance
on the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
![Page 66: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/66.jpg)
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
![Page 67: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/67.jpg)
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
![Page 68: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/68.jpg)
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principle
I Track the number of bits used to describe the tree itself whencalculating information gain
![Page 69: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/69.jpg)
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain
![Page 70: Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree,](https://reader033.vdocuments.site/reader033/viewer/2022052420/5b0324e17f8b9a2e228bf89a/html5/thumbnails/70.jpg)
Avoiding overfitting
I Some possibilitiesI Use a tuning set to decide when to stop growing the treeI Use a tuning set to prune the tree after it’s completely grown
I Keep removing nodes as long as it doesn’t hurt performanceon the tuning set
I Use the Minimum Description Length principleI Track the number of bits used to describe the tree itself when
calculating information gain