birch: a new data clustering algorithm and its applications tian zhang, raghu ramakrishnan, miron...
TRANSCRIPT
BIRCH:A New Data Clustering Algorithm and Its Applications
Tian Zhang, Raghu Ramakrishnan, Miron Livny
Presented by Qiang JingOn CS 331, Spring 2006
Paul Haake, Spring 2007
April 21, 2007 2
Problem Introduction
Data clustering
How do we divide n data points into k groups?
How do we minimize the difference within the groups?
How do we maximize the difference between different groups?
How do we avoid trying all possible solutions?
Very large data sets
Limited computational resources: memory, I/O
April 21, 2007 3
Outline
Problem introduction
Previous work
Introduction to BIRCH
The algorithm
Experimental results
Conclusions & practical use
April 21, 2007 4
Previous Work
Two classes of clustering algorithms:
Probability-Based
Incremental, top-down sorting, probabilistic objective function (like CU)
Examples: COBWEB (discrete) and CLASSIT (continuous)
Distance-Based
KMEANS, KMEDOIDS and CLARANS
April 21, 2007 5
Previous work: COBWEB
Probabilistic approach to make decisions
Probabilistic measure: Category Utility
Clusters are represented with probabilistic description
Incrementally generates a hierarchy
Cluster data points one at a time
Maximizes Category Utility at each decision point
April 21, 2007 6
Computing category utility is very expensive
Attributes are assumed to be statistically independent
Every instance translates into a terminal node in the hierarchy
Infeasible for large data sets
Large hierarchies tend to over fit data
Previous work: COBWEB limitations
April 21, 2007 7
Previous work: distance-based clusterering
Partition Clustering
Starts with an initial clustering, then moves data points between different clusters to find better clusterings
Each cluster represented by a “centroid”
Hierarchical Clustering
Repeatedly merges closest pairs of objects and splits farthest pairs of objects
April 21, 2007 8
Previous work: KMEANS
Distance based approach
There must be a distance measurement between any two instances (data points)
Iteratively groups instances towards the nearest centroid to minimize distances
Converges on a local minimum
Sensitive to instance order
May have exponential run time (worst case)
April 21, 2007 9
Previous work: KMEANS limitations
All instances must be initially available
Instances must be stored in memory
Frequent scan (non-incremental)
Global methods at the granularity of data points
All instances are considered individually
But, not all data are equally important for clustering
Possible improvement (foreshadowing!): close or dense data points could be considered collectively.
April 21, 2007 10
Previous work: KMEDOIDS, CLARAMS
KMEDOIDS
Similar to KMEANS, except that the centroid of a cluster is represented by one centrally located object
CLARAMS
= KMEDOIDS + randomized partial search strategy
Should out-perform KMEDOIDS, theoretically
April 21, 2007 11
Introduction to BIRCH
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Only works with "metric" attributes
Must have Euclidean coordinates
Designed for very large data sets
Time and memory constraints are explicit
Treats dense regions of data points collectively
Thus, not all data points are equally important for clustering
Problem is transformed to clustering a set of “summaries” rather than a set of data points
Only one scan of data is necessary
April 21, 2007 12
Introduction to BIRCH
Incremental, distance-based approach
Decisions are made without scanning all data points, or all currently existing clusters
Does not need the whole data set in advance
Unique approach: Distance-based algorithms generally need all the data points to work
Make best use of available memory while minimizing I/O costs
Does not assume that the probability distributions on attributes is independent
April 21, 2007 13
Background
Given a cluster of instances , we define:
Centroid:
Radius: average distance from member points to centroid
Diameter: average pair-wise distance within a cluster
The Radius and Diameter are two alternative measures of cluster “tightness”
April 21, 2007 14
Background: Cluster-to-Cluster Distance Measures
We define the centroid Euclidean distance and centroid Manhattan distance between any two clusters as:
Anyone know how these are different?
April 21, 2007 15
Background: Cluster-to-Cluster Distance Measures
We define the average inter-cluster, the average intra-cluster, and the variance increase distances as:
Diameter of merged cluster
April 21, 2007 16
Background
Cluster {Xi}:
i = 1, 2, …, N1
Cluster {Xj}:
j = N1+1, N1+2, …, N1+N2
April 21, 2007 17
Background
Cluster Xl = {Xi} + {Xj}:
l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2
April 21, 2007 18
Background
Optional Data Preprocessing (Normalization)
Avoids bias caused by dimensions with a large spread
But, large spread may naturally describe data!
April 21, 2007 19
Clustering Feature
“How much information should be kept for each subcluster?”
“How is the information about subclusters organized?”
“How efficiently is the organization maintained?”
April 21, 2007 20
Clustering Feature
A Clustering Feature (CF) summarizes a sub-cluster of data points:
Additivity theorem
April 21, 2007 21
Properties of Clustering Feature
CF entry is more compact
Stores significantly less then all of the data points in the sub-cluster
A CF entry has sufficient information to calculate the centroid, D, R, and D0-D4
Additivity theorem allows us to merge sub-clusters incrementally & consistently
April 21, 2007 22
CF-Tree
April 21, 2007 23
Properties of CF-Tree
Each non-leaf node has at most B entries
Each leaf node has at most L CF entries which each satisfy threshold T, a maximum diameter or radius
P (page size in bytes) is the maximum size of a node
Compact: each leaf node is a subcluster, not a data point!
April 21, 2007 24
CF-Tree Insertion
Recurse down from root, find the appropriate leaf
Follow the "closest" CF path, w.r.t. D0 / … / D4
Modify the leaf
If the closest-CF leaf cannot absorb due to exceeding the threshold, make a new CF entry. If there is no room for new leaf entry, split the parent node, recursively back to the root if necessary.
Traverse back & up
Update CFs on the path or splitting nodes to reflect the new addition to the cluster
Merging refinement
Splits are caused by page size, which may result in unnatural clusters. Merge nodes to try to compensate for this.
April 21, 2007 25
CF-Tree Anomalies
Anomaly 1
Two subclusters that should be together are split across two nodes due to page size, or
Two subclusters that should not be in the same cluster are in the same node
April 21, 2007 26
CF-Tree Anomalies
Anomaly 2
Equal-valued data points inserted at different times may end up in different leaf entries
April 21, 2007 27
CF-Tree Rebuilding
If we run out of memory, increase threshold T
By increasing the threshold, CFs absorb more data, but are less granular: leaf entry clusters become larger
Rebuilding "pushes" CFs over
The larger T allows different CFs to group together
Reducibility theorem
Increasing T will result in a CF-tree as small or smaller then the original
Rebuilding needs at most h (height of tree) extra pages of memory
April 21, 2007 28
BIRCH Overview
April 21, 2007 29
The Algorithm: BIRCH
Phase 1: Load data into memory
Build an initial in-memory CF-tree with the data (one scan)
Subsequent phases are
fast (no more I/O needed, work on sub-clusters instead of individual data points)
accurate (outliers are removed)
less order sensitive (because the CF-Tree forms an initial ordering of the data)
April 21, 2007 30
The Algorithm: BIRCH
April 21, 2007 31
The Algorithm: BIRCH
Phase 2: Condense data
Allows us to resize the data set so Phase 3 runs on an optimally sized data set
Rebuild the CF-tree with a larger T
Remove more outliers
Group together crowded subclusters
Condensing is optional
April 21, 2007 32
The Algorithm: BIRCH
Phase 3: Global clustering
Use existing clustering algorithm (e.g., HC, KMEANS, CLARANS) on CF entries
Helps fix problem where natural clusters span nodes (Anomaly 1)
April 21, 2007 33
The Algorithm: BIRCH
Phase 4: Cluster refining
Do additional passes over the dataset & reassign data points to the closest centroid from phase 3
Refining is optional
Fixes the problem with CF-trees where same-valued data points may be assigned to different leaf entries (anomaly 2)
Always converges to a minimum
Allows us to discard more outliers
April 21, 2007 34
Memory Management
BIRCH's memory usage is determined by data distribution not data size
In phases 1-3, BIRCH uses all available memory to generate clusters as granular as possible
Phase 4 can correct inaccuracies caused by insufficient memory (i.e., lack of granularity)
Time/space trade-off: if memory is low, spend more time in phase 4
April 21, 2007 35
Experimental Results
Input parameters:
Memory (M): 5% of data set
Distance equation: D2 (average intercluster distance)
Quality equation: weighted average diameter (D)
Initial threshold (T): 0.0
Page size (P): 1024 bytes
Phase 3 algorithm: an agglomerative Hierarchical Clustering algorithm
One refinement pass in phase 4
April 21, 2007 36
Experimental Results
Create 3 synthetic data sets for testing
Also create an ordered copy for testing input order
KMEANS and CLARANS require entire data set to be in memory
Initial scan is from disk, subsequent scans are in memory
April 21, 2007 37
Experimental Results
Intended clustering
April 21, 2007 38
Experimental Results
KMEANS clustering
DS Time D # Scan DS Time D # Scan
1 43.9 2.09 289 1o 33.8 1.97 197
2 13.2 4.43 51 2o 12.7 4.20 29
3 32.9 3.66 187 3o 36.0 4.35 241
Ordered data
April 21, 2007 39
Experimental Results
CLARANS clustering
DS Time D # Scan DS Time D # Scan
1 932 2.10 3307 1o 794 2.11 2854
2 758 2.63 2661 2o 816 2.31 2933
3 835 3.39 2959 3o 924 3.28 3369
April 21, 2007 40
Experimental Results
BIRCH clustering
DS Time D # Scan DS Time D # Scan
1 11.5 1.87 2 1o 13.6 1.87 2
2 10.7 1.99 2 2o 12.1 1.99 2
3 11.4 3.95 2 3o 12.2 3.99 2
April 21, 2007 41
Conclusions & Practical Use
Pixel classification in images
From top to bottom:
BIRCH classification
Visible wavelength band
Near-infrared band
April 21, 2007 42
Conclusions & Practical Use
Image compression using vector quantization
Generate codebook for frequently occurring patterns
BIRCH performs faster then CLARANS or LBG, while getting better compression and nearly as good quality
April 21, 2007 43
Conclusions & Practical Use
BIRCH works with very large data sets
Explicitly bounded by computational resources
Runs with specified amount of memory
Superior to CLARANS and KMEANS
Quality, speed, stability and scalability
April 21, 2007 44
Exam Questions
What is the main limitation of BIRCH?
BIRCH only works with metric attributes (i.e., Euclidian coordinates)
Name the two algorithms in BIRCH clustering:
CF-Tree Insertion
CF-Tree Rebuilding
April 21, 2007 45
Exam Questions
What is the purpose of phase 4 in BIRCH?
Slides 33-34: Guaranteed convergence to minimum, discards outliers, ensures duplicate data points are in the same cluster, compensates for low memory availability during previous phases