public: a decision tree classifier that integrates building and pruning
DESCRIPTION
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Rajeev RastogiKyuseok Shim. Presented by: Alon Keinan. Presentation layout. Introduction: Classification and Decision Trees Decision Tree Building Algorithms SPRINT & MDL PUBLIC Performance Comparison - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/1.jpg)
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
Rajeev Rastogi Kyuseok Shim
Presented by: Alon Keinan
![Page 2: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/2.jpg)
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
![Page 3: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/3.jpg)
Introduction: Classification
• Classification in data mining:– Training sample set– Classifying future records
• Techniques: Bayesian, NN, Genetic, decision trees …
![Page 4: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/4.jpg)
Introduction: Decision Trees
training
![Page 5: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/5.jpg)
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
![Page 6: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/6.jpg)
Decision Tree Building Algorithms
• 2 phases: – The building phase– The pruning phase
• The building constructs a “perfect” tree
• The pruning prevents “overfitting”
![Page 7: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/7.jpg)
Building Phase Algorithms
• Differ in the selection of the test criterion for partitioning– CLS– ID3 & C4.5– CART, SLIQ & SPRINT
• Differ in their ability to handle large training sets
• All consider “guillotine-cut” only
![Page 8: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/8.jpg)
Pruning Phase Algorithms
• MDL – Minimum Description Length
• Cost-Complexity Pruning
![Page 9: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/9.jpg)
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
![Page 10: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/10.jpg)
SPRINT
• Initialize root node• Initialize queue Q to contain root node• While Q is not empty do
– dequeue the first node N in Q– if N is not pure
• for each attribute evaluate splits• use least entropy split to split node N into N1 and
N2• append N1 and N2 to Q
![Page 11: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/11.jpg)
Entropy
![Page 12: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/12.jpg)
MDL
• The best tree is the one that can be encoded using the fewest number of bits
• Cost of encoding data records:
• Cost of encoding tree:– The structure of the tree– The splits– The classes in the leaves
![Page 13: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/13.jpg)
Pruning algorithm
• computeCost&Prune(Node N)– If N is a leaf return (C(S)+1)
– minCostLeft:=computeCost&Prune(Nleft)
– minCostRight:=computeCost&Prune(Nright)
– minCost:=min{C(S)+1, Csplit(N)+1+minCostLeft+minCostRight}
– If minCost=C(S)+1• Prune child nodes Nleft and Nright
– return minCost
![Page 14: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/14.jpg)
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
![Page 15: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/15.jpg)
PUBLIC
• PUBLIC = PrUning and BuiLding Integrated in Classification
• Uses SPRINT for building• Prune periodically !!!• Basically uses MDL for pruning• Distinguished three types of leaves:
– “not expandable”– “pruned”– “yet to be expanded”
• Exact same tree
![Page 16: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/16.jpg)
Lower Bound Computation
• PUBLIC(1) – Bound=1
• PUBLIC(S) – Incorporating split costs
• PUBLIC(V) – Incorporating split values
![Page 17: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/17.jpg)
PUBLIC(S)
• Calculates a lower bound for s=0,..,k-1– For s=0: C(S)+1– For s>0:
• Takes the minimum of the bounds
• Computes by iterative addition
• O(klogk)
![Page 18: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/18.jpg)
PUBLIC(V)
• PUBLIC(S) estimates each split as log(a)
• PUBLIC(V) estimates each split as log(a), plus the encoding of the splitting value\s
• Complexity: O(k*(logk+a))
![Page 19: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/19.jpg)
Lower Bound ComputationSummary
PUBLIC(1) Fixed - 1 O(1)
PUBLIC(S) Incorporating split costs
O(klogk)
PUBLIC(V) Incorporating split value costs
O(k*(logk+a))
![Page 20: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/20.jpg)
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
![Page 21: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/21.jpg)
Performance Comparisons
• Algorithms:– SPRINT– PUBLIC(1)– PUBLIC(S)– PUBLIC(V)
• Data sets:– Real-life– Synthetic
![Page 22: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/22.jpg)
Real-life Data Sets
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SPRINT
PUBLIC(1)
PUBLIC(S)
PUBLIC(V)
![Page 23: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/23.jpg)
Synthetic Data Sets
0
500
1000
1500
2000
SPRINT
PUBLIC(1)
PUBLIC(S)
PUBLIC(V)
![Page 24: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/24.jpg)
Noise
![Page 25: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/25.jpg)
Other Parameters
• No. of Attributes
• No. of Classes
• Size of training set
![Page 26: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/26.jpg)
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
![Page 27: PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning](https://reader035.vdocuments.site/reader035/viewer/2022062809/568158f4550346895dc62f8c/html5/thumbnails/27.jpg)
Conclusion
• The pruning is integrated into the building phase• Computing lower bounds of the cost of “yet to be
expanded” leaves• Improved performance• Open:
– How often to invoke the pruning procedure?
– Expanding other algorithms …
– Developing a tighter lower bound…