birch: a new data clustering algorithm and its applications tian zhang, raghu ramakrishnan, miron...

BIRCH:A New Data Clustering Algorithm and Its Applications

Tian Zhang, Raghu Ramakrishnan, Miron Livny

Presented by Qiang JingOn CS 331, Spring 2006

Paul Haake, Spring 2007

April 21, 2007 2

Problem Introduction

Data clustering

How do we divide n data points into k groups?

How do we minimize the difference within the groups?

How do we maximize the difference between different groups?

How do we avoid trying all possible solutions?

Very large data sets

Limited computational resources: memory, I/O

April 21, 2007 3

Outline

Problem introduction

Previous work

Introduction to BIRCH

The algorithm

Experimental results

Conclusions & practical use

April 21, 2007 4

Previous Work

Two classes of clustering algorithms:

Probability-Based

Incremental, top-down sorting, probabilistic objective function (like CU)

Examples: COBWEB (discrete) and CLASSIT (continuous)

Distance-Based

KMEANS, KMEDOIDS and CLARANS

April 21, 2007 5

Previous work: COBWEB

Probabilistic approach to make decisions

Probabilistic measure: Category Utility

Clusters are represented with probabilistic description

Incrementally generates a hierarchy

Cluster data points one at a time

Maximizes Category Utility at each decision point

April 21, 2007 6

Computing category utility is very expensive

Attributes are assumed to be statistically independent

Every instance translates into a terminal node in the hierarchy

Infeasible for large data sets

Large hierarchies tend to over fit data

Previous work: COBWEB limitations

April 21, 2007 7

Previous work: distance-based clusterering

Partition Clustering

Starts with an initial clustering, then moves data points between different clusters to find better clusterings

Each cluster represented by a “centroid”

Hierarchical Clustering

Repeatedly merges closest pairs of objects and splits farthest pairs of objects

April 21, 2007 8

Previous work: KMEANS

Distance based approach

There must be a distance measurement between any two instances (data points)

Iteratively groups instances towards the nearest centroid to minimize distances

Converges on a local minimum

Sensitive to instance order

May have exponential run time (worst case)

April 21, 2007 9

Previous work: KMEANS limitations

All instances must be initially available

Instances must be stored in memory

Frequent scan (non-incremental)

Global methods at the granularity of data points

All instances are considered individually

But, not all data are equally important for clustering

Possible improvement (foreshadowing!): close or dense data points could be considered collectively.

April 21, 2007 10

Previous work: KMEDOIDS, CLARAMS

KMEDOIDS

Similar to KMEANS, except that the centroid of a cluster is represented by one centrally located object

CLARAMS

= KMEDOIDS + randomized partial search strategy

Should out-perform KMEDOIDS, theoretically

April 21, 2007 11


BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Only works with "metric" attributes

Must have Euclidean coordinates

Designed for very large data sets

Time and memory constraints are explicit

Treats dense regions of data points collectively

Thus, not all data points are equally important for clustering

Problem is transformed to clustering a set of “summaries” rather than a set of data points

Only one scan of data is necessary

April 21, 2007 12


Incremental, distance-based approach

Decisions are made without scanning all data points, or all currently existing clusters

Does not need the whole data set in advance

Unique approach: Distance-based algorithms generally need all the data points to work

Make best use of available memory while minimizing I/O costs

Does not assume that the probability distributions on attributes is independent

April 21, 2007 13

Background

Given a cluster of instances , we define:

Centroid:

Radius: average distance from member points to centroid

Diameter: average pair-wise distance within a cluster

The Radius and Diameter are two alternative measures of cluster “tightness”

April 21, 2007 14

Background: Cluster-to-Cluster Distance Measures

We define the centroid Euclidean distance and centroid Manhattan distance between any two clusters as:

Anyone know how these are different?

April 21, 2007 15

Background: Cluster-to-Cluster Distance Measures

We define the average inter-cluster, the average intra-cluster, and the variance increase distances as:

Diameter of merged cluster

April 21, 2007 16

Background

Cluster {Xi}:

i = 1, 2, …, N1

Cluster {Xj}:

j = N1+1, N1+2, …, N1+N2

April 21, 2007 17

Background

Cluster Xl = {Xi} + {Xj}:

l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2

April 21, 2007 18

Background

Optional Data Preprocessing (Normalization)

Avoids bias caused by dimensions with a large spread

But, large spread may naturally describe data!

April 21, 2007 19

Clustering Feature

“How much information should be kept for each subcluster?”

“How is the information about subclusters organized?”

“How efficiently is the organization maintained?”

April 21, 2007 20

Clustering Feature

A Clustering Feature (CF) summarizes a sub-cluster of data points:

Additivity theorem

April 21, 2007 21

Properties of Clustering Feature

CF entry is more compact

Stores significantly less then all of the data points in the sub-cluster

A CF entry has sufficient information to calculate the centroid, D, R, and D0-D4

Additivity theorem allows us to merge sub-clusters incrementally & consistently

April 21, 2007 22

CF-Tree

April 21, 2007 23

Properties of CF-Tree

Each non-leaf node has at most B entries

Each leaf node has at most L CF entries which each satisfy threshold T, a maximum diameter or radius

P (page size in bytes) is the maximum size of a node

Compact: each leaf node is a subcluster, not a data point!

April 21, 2007 24

CF-Tree Insertion

Recurse down from root, find the appropriate leaf

Follow the "closest" CF path, w.r.t. D0 / … / D4

Modify the leaf

If the closest-CF leaf cannot absorb due to exceeding the threshold, make a new CF entry. If there is no room for new leaf entry, split the parent node, recursively back to the root if necessary.

Traverse back & up

Update CFs on the path or splitting nodes to reflect the new addition to the cluster

Merging refinement

Splits are caused by page size, which may result in unnatural clusters. Merge nodes to try to compensate for this.

April 21, 2007 25

CF-Tree Anomalies

Anomaly 1

Two subclusters that should be together are split across two nodes due to page size, or

Two subclusters that should not be in the same cluster are in the same node

April 21, 2007 26

CF-Tree Anomalies

Anomaly 2

Equal-valued data points inserted at different times may end up in different leaf entries

April 21, 2007 27

CF-Tree Rebuilding

If we run out of memory, increase threshold T

By increasing the threshold, CFs absorb more data, but are less granular: leaf entry clusters become larger

Rebuilding "pushes" CFs over

The larger T allows different CFs to group together

Reducibility theorem

Increasing T will result in a CF-tree as small or smaller then the original

Rebuilding needs at most h (height of tree) extra pages of memory

April 21, 2007 28

BIRCH Overview

April 21, 2007 29

The Algorithm: BIRCH

Phase 1: Load data into memory

Build an initial in-memory CF-tree with the data (one scan)

Subsequent phases are

fast (no more I/O needed, work on sub-clusters instead of individual data points)

accurate (outliers are removed)

less order sensitive (because the CF-Tree forms an initial ordering of the data)

April 21, 2007 30


April 21, 2007 31


Phase 2: Condense data

Allows us to resize the data set so Phase 3 runs on an optimally sized data set

Rebuild the CF-tree with a larger T

Remove more outliers

Group together crowded subclusters

Condensing is optional

April 21, 2007 32


Phase 3: Global clustering

Use existing clustering algorithm (e.g., HC, KMEANS, CLARANS) on CF entries

Helps fix problem where natural clusters span nodes (Anomaly 1)

April 21, 2007 33


Phase 4: Cluster refining

Do additional passes over the dataset & reassign data points to the closest centroid from phase 3

Refining is optional

Fixes the problem with CF-trees where same-valued data points may be assigned to different leaf entries (anomaly 2)

Always converges to a minimum

Allows us to discard more outliers

April 21, 2007 34

Memory Management

BIRCH's memory usage is determined by data distribution not data size

In phases 1-3, BIRCH uses all available memory to generate clusters as granular as possible

Phase 4 can correct inaccuracies caused by insufficient memory (i.e., lack of granularity)

Time/space trade-off: if memory is low, spend more time in phase 4

April 21, 2007 35

Experimental Results

Input parameters:

Memory (M): 5% of data set

Distance equation: D2 (average intercluster distance)

Quality equation: weighted average diameter (D)

Initial threshold (T): 0.0

Page size (P): 1024 bytes

Phase 3 algorithm: an agglomerative Hierarchical Clustering algorithm

One refinement pass in phase 4

April 21, 2007 36


Create 3 synthetic data sets for testing

Also create an ordered copy for testing input order

KMEANS and CLARANS require entire data set to be in memory

Initial scan is from disk, subsequent scans are in memory

April 21, 2007 37


Intended clustering

April 21, 2007 38


KMEANS clustering

DS Time D # Scan DS Time D # Scan

1 43.9 2.09 289 1o 33.8 1.97 197

2 13.2 4.43 51 2o 12.7 4.20 29

3 32.9 3.66 187 3o 36.0 4.35 241

Ordered data

April 21, 2007 39


CLARANS clustering


1 932 2.10 3307 1o 794 2.11 2854

2 758 2.63 2661 2o 816 2.31 2933

3 835 3.39 2959 3o 924 3.28 3369

April 21, 2007 40


BIRCH clustering


1 11.5 1.87 2 1o 13.6 1.87 2

2 10.7 1.99 2 2o 12.1 1.99 2

3 11.4 3.95 2 3o 12.2 3.99 2

April 21, 2007 41

Conclusions & Practical Use

Pixel classification in images

From top to bottom:

BIRCH classification

Visible wavelength band

Near-infrared band

April 21, 2007 42


Image compression using vector quantization

Generate codebook for frequently occurring patterns

BIRCH performs faster then CLARANS or LBG, while getting better compression and nearly as good quality

April 21, 2007 43


BIRCH works with very large data sets

Explicitly bounded by computational resources

Runs with specified amount of memory

Superior to CLARANS and KMEANS

Quality, speed, stability and scalability

April 21, 2007 44

Exam Questions

What is the main limitation of BIRCH?

BIRCH only works with metric attributes (i.e., Euclidian coordinates)

Name the two algorithms in BIRCH clustering:

CF-Tree Insertion

CF-Tree Rebuilding

April 21, 2007 45

Exam Questions

What is the purpose of phase 4 in BIRCH?

Slides 33-34: Guaranteed convergence to minimum, discards outliers, ensures duplicate data points are in the same cluster, compensates for low memory availability during previous phases

birch: a new data clustering algorithm and its applications tian zhang, raghu ramakrishnan, miron...

Documents

scan of data

dense data points

n data points

hierarchycluster data

large data setstime

set of data pointsonly

new data clustering

initial clustering