from decision trees to random forests
TRANSCRIPT
![Page 1: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/1.jpg)
From decision trees to random forests
Viet-Trung Tran
![Page 2: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/2.jpg)
Decision tree learning
• Supervised learning • From a set of measurements, – learn a model – to predict and understand a phenomenon
![Page 3: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/3.jpg)
Example 1: wine taste preference • From physicochemical properties (alcohol, acidity,
sulphates, etc) • Learn a model • To predict wine taste preference (from 0 to 10)
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical proper@es, 2009
![Page 4: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/4.jpg)
Observation
• Decision tree can be interpreted as set of IF...THEN rules
• Can be applied to noisy data • One of popular inductive learning • Good results for real-life applications
![Page 5: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/5.jpg)
Decision tree representation
• An inner node represents an attribute • An edge represents a test on the attribute of
the father node • A leaf represents one of the classes • Construction of a decision tree – Based on the training data – Top-down strategy
![Page 6: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/6.jpg)
Example 2: Sport preferene
![Page 7: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/7.jpg)
Example 3: Weather & sport practicing
![Page 8: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/8.jpg)
Classification • The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node. • A record enters the tree at the root node. • At the root, a test is applied to determine which child node
the record will encounter next. • This process is repeated until the record arrives at a leaf
node. • All the records that end up at a given leaf of the tree are
classified in the same way. • There is a unique path from the root to each leaf. • The path is a rule which is used to classify the records.
![Page 9: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/9.jpg)
• The data set has five attributes. • There is a special attribute: the attribute class is the class
label. • The attributes, temp (temperature) and humidity are
numerical attributes • Other attributes are categorical, that is, they cannot be
ordered. • Based on the training data set, we want to find a set of rules
to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.
![Page 10: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/10.jpg)
• RULE 1 If it is sunny and the humidity is not above 75%, then play.
• RULE 2 If it is sunny and the humidity is above 75%, then do not play.
• RULE 3 If it is overcast, then play. • RULE 4 If it is rainy and not windy, then play. • RULE 5 If it is rainy and windy, then don't play.
![Page 11: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/11.jpg)
Splitting attribute • At every node there is an attribute associated with
the node called the splitting attribute • Top-down traversal – In our example, outlook is the splitting attribute at root. – Since for the given record, outlook = rain, we move to the
rightmost child node of the root. – At this node, the splitting attribute is windy and we find
that for the record we want classify, windy = true. – Hence, we move to the left child node to conclude that
the class label Is "no play".
![Page 12: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/12.jpg)
![Page 13: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/13.jpg)
![Page 14: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/14.jpg)
Decision tree construction
• Identify the splitting attribute and splitting criterion at every level of the tree
• Algorithm – Iterative Dichotomizer (ID3)
![Page 15: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/15.jpg)
Iterative Dichotomizer (ID3) • Quinlan (1986) • Each node corresponds to a splitting attribute • Each edge is a possible value of that attribute. • At each node the splitting attribute is selected to be the
most informative among the attributes not yet considered in the path from the root.
• Entropy is used to measure how informative is a node.
![Page 16: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/16.jpg)
![Page 17: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/17.jpg)
Splitting attribute selection • The algorithm uses the criterion of information gain
to determine the goodness of a split. – The attribute with the greatest information gain is taken
as the splitting attribute, and the data set is split for all distinct values of the attribute values of the attribute.
• Example: 2 classes: C1, C2, pick A1 or A2
![Page 18: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/18.jpg)
Entropy – General Case • Impurity/Inhomogeneity measurement • Suppose X takes n values, V1, V2,… Vn, and
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
• What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X? It’s E(X) = p1 log2 p1 – p2 log2 p2 – … pnlog2pn
• E(X) = the entropy of X
)(log1
2 i
n
ii pp∑
=
−=
![Page 19: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/19.jpg)
Example: 2 classes
![Page 20: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/20.jpg)
![Page 21: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/21.jpg)
Information gain
![Page 22: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/22.jpg)
![Page 23: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/23.jpg)
• Gain(S,Wind)? • Wind = {Weak, Strong} • S = {9 Yes &5 No} • Sweak = {6 Yes & 2 No | Wind=Weak} • Sstrong = {3 Yes &3 No | Wind=Strong}
![Page 24: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/24.jpg)
Example: Decision tree learning • Choose splitting attribute for root among {Outlook,
Temperature, Humidity, Wind}? – Gain(S, Outlook) = ... = 0.246 – Gain(S, Temperature) = ... = 0.029 – Gain(S, Humidity) = ... = 0.151 – Gain(S, Wind) = ... = 0.048
![Page 25: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/25.jpg)
• Gain(Ssunny,Temperature) = 0,57 • Gain(Ssunny, Humidity) = 0,97 • Gain(Ssunny, Windy) =0,019
![Page 26: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/26.jpg)
Over-fitting example • Consider adding noisy training example #15
– Sunny, hot, normal, strong, playTennis = No • What effect on earlier tree?
![Page 27: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/27.jpg)
Over-fitting
![Page 28: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/28.jpg)
Avoid over-fitting
• Stop growing when data split not statistically significant
• Grow full tree then post-prune • How to select best tree – Measure performance over training tree – Measure performance over separate validation
dataset – MDL minimize • size(tree) + size(misclassifications(tree))
![Page 29: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/29.jpg)
Reduced-error pruning
• Split data into training and validation set • Do until further pruning is harmful – Evaluate impact on validation set of pruning
each possible node – Greedily remove the one that most improves
validation set accuracy
![Page 30: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/30.jpg)
Rule post-pruning • Convert tree to equivalent set
of rules • Prune each rule independently
of others • Sort final rules into desired
sequence for use
![Page 31: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/31.jpg)
Issues in Decision Tree Learning • How deep to grow? • How to handle continuous attributes? • How to choose an appropriate attributes selection
measure? • How to handle data with missing attributes values? • How to handle attributes with different costs? • How to improve computational efficiency? • ID3 has been extended to handle most of these.
The resulting system is C4.5 (http://cis-linux1.temple.edu/~ingargio/cis587/readings/id3-c45.html)
![Page 32: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/32.jpg)
Decision tree – When?
![Page 33: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/33.jpg)
References • Data mining, Nhat-Quang Nguyen, HUST • http://www.cs.cmu.edu/~awm/10701/slides/
DTreesAndOverfitting-9-13-05.pdf
![Page 34: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/34.jpg)
RANDOM FORESTS Credits: Michal Malohlava @Oxdata
![Page 35: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/35.jpg)
Motivation • Training sample of points
covering area [0,3] x [0,3] • Two possible colors of
points
![Page 36: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/36.jpg)
• The model should be able to predict a color of a new point
![Page 37: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/37.jpg)
Decision tree
![Page 38: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/38.jpg)
How to grow a decision tree • Split rows in a given
node into two sets with respect to impurity measure – The smaller, the more
skewed is distribution – Compare impurity of
parent with impurity of children
![Page 39: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/39.jpg)
When to stop growing tree • Build full tree or • Apply stopping criterion - limit on: – Tree depth, or – Minimum number of points in a leaf
![Page 40: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/40.jpg)
How to assign leafvalue?
• The leaf value is – If leaf contains only one point
then its color represents leaf value
• Else majority color is picked, or color distribution is stored
![Page 41: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/41.jpg)
Decision tree
• Tree covered whole area by rectangles predicting a point color
![Page 42: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/42.jpg)
Decision tree scoring
• The model can predict a point color based on its coordinates.
![Page 43: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/43.jpg)
Over-fitting
• Tree perfectly represents training data (0% training error), but also learned about noise!
![Page 44: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/44.jpg)
• And hence poorly predicts a new point!
![Page 45: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/45.jpg)
Handle over-fitting
• Pre-pruning via stopping criterion! • Post-pruning: decreases complexity of
model but helps with model generalization
• Randomize tree building and combine trees together
![Page 46: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/46.jpg)
Randomize #1- Bagging
![Page 47: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/47.jpg)
Randomize #1- Bagging
![Page 48: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/48.jpg)
Randomize #1- Bagging
• Each tree sees only sample of training data and captures only a part of the information.
• Build multiple weak trees which vote together to give resulting prediction – voting is based on majority vote, or weighted
average
![Page 49: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/49.jpg)
Bagging - boundary
• Bagging averages many trees, and produces smoother decision boundaries.
![Page 50: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/50.jpg)
Randomize #2 - Feature selectionRandom forest
![Page 51: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/51.jpg)
Random forest - properties • Refinement of bagged trees; quite popular • At each tree split, a random sample of m features is drawn,
and only those m features are considered for splitting. Typically
• m=√p or log2(p), where p is the number of features • For each tree grown on a bootstrap sample, the error rate
for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” error rate.
• Random forests tries to improve on bagging by “de-correlating” the trees. Each tree has the same expectation
![Page 52: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/52.jpg)
Advantages of Random Forest
• Independent trees which can be built in parallel
• The model does not overfit easily • Produces reasonable accuracy • Brings more features to analyze data variable
importance, proximities, missing values imputation
![Page 53: From decision trees to random forests](https://reader031.vdocuments.site/reader031/viewer/2022020203/58ee10d81a28ab684c8b46ad/html5/thumbnails/53.jpg)
Out of bag points and validation
• Each tree is built over a sample of training points.
• Remaining points are called “out-of-bag” (OOB).
These points are used for valida@on as a good approxima@on for generaliza@on error. Almost iden@cal as N-‐fold cross valida@on.