sliq
DESCRIPTION
A Data Mining Paper Presentation on ClassificationTRANSCRIPT
![Page 1: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/1.jpg)
SLIQ: A Fast Scalable Classifier for Data Mining
Manish Mehta, Rakesh Agrawal, Jorma Rissanen
Presentation by: Sara Alaee , Zahra Taheri
![Page 2: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/2.jpg)
SLIQ: A Fast Scalable Classifier for Data Mining
Presented in: 5th International Conference on
Extending Database Technology Avignon, France, March 25–29, 1996 Proceedings
927 citations
2
![Page 3: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/3.jpg)
Outline Introduction Motivation SLIQ Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 3
![Page 4: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/4.jpg)
Introduction
Most of the classification algorithms are designed for memory-resident data limited suitability for mining large training
datasets Solution : build a scalable classifier -
SLIQ SLIQ : Supervised Learning in Quest
04/13/23 4
![Page 5: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/5.jpg)
Outline Introduction Motivation SLIQ Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 5
![Page 6: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/6.jpg)
Motivation Improve scalability of tree classifiers Previous proposals:
Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for
each partition All methods achieve low accuracy!
SLIQ – improve learning time without loss in accuracy!
04/13/23 6
![Page 7: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/7.jpg)
Motivation (cont.)
Recall (ID3, C4.5, CART):
04/13/23 7
![Page 8: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/8.jpg)
Motivation (cont.) Non-Scalable Decision Trees:
Complexity in determining the best split for each attribute
Cost of evaluating splits for numerical attributes = cost of sorting values at each node
Cost of evaluating splits for categorical attributes = cost of searching for the best subset
Pruning cross-validation: inapplicable for large
datasets divide data in two parts - training and test
set : sizes & distribution problem04/13/23 8
–
![Page 9: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/9.jpg)
Outline Introduction Motivation SLIQ - Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 9
![Page 10: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/10.jpg)
SLIQ – Algorithm
Key features: Tree classifier, handling both numerical
and categorical attributes Pre-sort numerical attributes before
tree has been built Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm
based on Minimum Description Length (MDL)
04/13/23 10
![Page 11: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/11.jpg)
SLIQ – Algorithm (cont.)
Pre-sorting: Eliminate the need to sort the data at
each node
Create sorted list for each numerical attribute
Create class list04/13/23 11
![Page 12: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/12.jpg)
SLIQ – Algorithm (cont.)
Example:
04/13/23 12
![Page 13: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/13.jpg)
SLIQ – Algorithm (cont.)
Split evaluation:
04/13/23 13
![Page 14: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/14.jpg)
SLIQ – Algorithm (cont.)
Example:
04/13/23 14
![Page 15: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/15.jpg)
SLIQ – Algorithm (cont.)
Update class list:
04/13/23 15
![Page 16: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/16.jpg)
SLIQ – Algorithm (cont.)
Example:
04/13/23 16
![Page 17: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/17.jpg)
SLIQ – Algorithm (cont.) When node becomes pure, stop splitting Condense attribute lists by discarding
examples corresponding to the pure node
For large-cardinality categorical attributes (determined based on threshold): the best split computed either in greedy way, or all possible splits are evaluated
SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical
04/13/23 17
![Page 18: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/18.jpg)
SLIQ - Pruning
Post pruning algorithm based on Minimum Description Length principle
Find a model that minimizes:Cost(M,D) = Cost(D|M) + Cost(M)Cost(M) - cost of the modelCost(D|M) - cost of encoding the data D if model M is given
04/13/23 18
![Page 19: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/19.jpg)
SLIQ - Pruning Cost of the data: classification error Cost of the model:
Encoding the tree: number of bits Encoding the splits:
numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality
MDL pruning evaluates the code length at each node to decide on pruning
04/13/23 19
![Page 20: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/20.jpg)
SLIQ - Pruning
Pruning Algorithm:
C’(ti) : cost of encoding the children’s examples using the parent’s statistics.
04/13/23 20
![Page 21: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/21.jpg)
SLIQ - Pruning
Three pruning strategies: Full – pruning both children and
convert node to the leaf Partial – prune into the leaf or prune
the left child or prune the right child or leave node intact
Hybrid – apply Full method and then partial (prune left, prune right or leave intact)
04/13/23 21
![Page 22: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/22.jpg)
Outline Introduction Motivation SLIQ - Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 22
![Page 23: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/23.jpg)
Evaluation Metrics:
Primary: classification accuracy Secondary: classification time & size of the
decision tree Setup:
Small benchmarks: datasets from the STATLOG classification
benchmark Synthetic databases: 9 attributes for each
tuple, 2 classification functions04/13/23 23
![Page 24: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/24.jpg)
Evaluation
STATLOG benchmark:
04/13/23 24
![Page 25: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/25.jpg)
Evaluation
Pruning strategy comparison:
Hybrid pruning is the preferred approach, and is used for the experiments in this paper.04/13/23 25
![Page 26: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/26.jpg)
Evaluation
Small datasets:• IND-Cart:
• good accuracy • small trees• an order of
magnitude slower than others.
• IND-C4: • Accurate• fast• large decision
trees. • SLIQ:
• Accurate• smaller than IND-
C4.• faster than IND-
Cart.
04/13/23 26
![Page 27: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/27.jpg)
Evaluation
Scalability:
04/13/23 27
![Page 28: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/28.jpg)
Outline Introduction Motivation SLIQ - Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 28
![Page 29: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/29.jpg)
Conclusion SLIQ demonstrates to be a fast, low-cost
and scalable classifier that builds accurate trees
Based on empirical tests SLIQ achieves accuracy while producing smaller decision trees compared to other algorithms
Scalability??? Memory problem when increasing number of attributes or number of classes
04/13/23 29
![Page 30: SLIQ](https://reader036.vdocuments.site/reader036/viewer/2022062419/558535d9d8b42a86388b522c/html5/thumbnails/30.jpg)
THANK YOU!
04/13/23 30