birch: a new data clustering algorithm and its applications tian zhang, raghu ramakrishnan, miron...

45
BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006 Paul Haake, Spring 2007

Upload: philippa-edwards

Post on 29-Jan-2016

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

BIRCH:A New Data Clustering Algorithm and Its Applications

Tian Zhang, Raghu Ramakrishnan, Miron Livny

Presented by Qiang JingOn CS 331, Spring 2006

Paul Haake, Spring 2007

Page 2: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 2

Problem Introduction

Data clustering

How do we divide n data points into k groups?

How do we minimize the difference within the groups?

How do we maximize the difference between different groups?

How do we avoid trying all possible solutions?

Very large data sets

Limited computational resources: memory, I/O

Page 3: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 3

Outline

Problem introduction

Previous work

Introduction to BIRCH

The algorithm

Experimental results

Conclusions & practical use

Page 4: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 4

Previous Work

Two classes of clustering algorithms:

Probability-Based

Incremental, top-down sorting, probabilistic objective function (like CU)

Examples: COBWEB (discrete) and CLASSIT (continuous)

Distance-Based

KMEANS, KMEDOIDS and CLARANS

Page 5: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 5

Previous work: COBWEB

Probabilistic approach to make decisions

Probabilistic measure: Category Utility

Clusters are represented with probabilistic description

Incrementally generates a hierarchy

Cluster data points one at a time

Maximizes Category Utility at each decision point

Page 6: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 6

Computing category utility is very expensive

Attributes are assumed to be statistically independent

Every instance translates into a terminal node in the hierarchy

Infeasible for large data sets

Large hierarchies tend to over fit data

Previous work: COBWEB limitations

Page 7: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 7

Previous work: distance-based clusterering

Partition Clustering

Starts with an initial clustering, then moves data points between different clusters to find better clusterings

Each cluster represented by a “centroid”

Hierarchical Clustering

Repeatedly merges closest pairs of objects and splits farthest pairs of objects

Page 8: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 8

Previous work: KMEANS

Distance based approach

There must be a distance measurement between any two instances (data points)

Iteratively groups instances towards the nearest centroid to minimize distances

Converges on a local minimum

Sensitive to instance order

May have exponential run time (worst case)

Page 9: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 9

Previous work: KMEANS limitations

All instances must be initially available

Instances must be stored in memory

Frequent scan (non-incremental)

Global methods at the granularity of data points

All instances are considered individually

But, not all data are equally important for clustering

Possible improvement (foreshadowing!): close or dense data points could be considered collectively.

Page 10: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 10

Previous work: KMEDOIDS, CLARAMS

KMEDOIDS

Similar to KMEANS, except that the centroid of a cluster is represented by one centrally located object

CLARAMS

= KMEDOIDS + randomized partial search strategy

Should out-perform KMEDOIDS, theoretically

Page 11: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 11

Introduction to BIRCH

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Only works with "metric" attributes

Must have Euclidean coordinates

Designed for very large data sets

Time and memory constraints are explicit

Treats dense regions of data points collectively

Thus, not all data points are equally important for clustering

Problem is transformed to clustering a set of “summaries” rather than a set of data points

Only one scan of data is necessary

Page 12: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 12

Introduction to BIRCH

Incremental, distance-based approach

Decisions are made without scanning all data points, or all currently existing clusters

Does not need the whole data set in advance

Unique approach: Distance-based algorithms generally need all the data points to work

Make best use of available memory while minimizing I/O costs

Does not assume that the probability distributions on attributes is independent

Page 13: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 13

Background

Given a cluster of instances , we define:

Centroid:

Radius: average distance from member points to centroid

Diameter: average pair-wise distance within a cluster

The Radius and Diameter are two alternative measures of cluster “tightness”

Page 14: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 14

Background: Cluster-to-Cluster Distance Measures

We define the centroid Euclidean distance and centroid Manhattan distance between any two clusters as:

Anyone know how these are different?

Page 15: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 15

Background: Cluster-to-Cluster Distance Measures

We define the average inter-cluster, the average intra-cluster, and the variance increase distances as:

Diameter of merged cluster

Page 16: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 16

Background

Cluster {Xi}:

i = 1, 2, …, N1

Cluster {Xj}:

j = N1+1, N1+2, …, N1+N2

Page 17: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 17

Background

Cluster Xl = {Xi} + {Xj}:

l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2

Page 18: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 18

Background

Optional Data Preprocessing (Normalization)

Avoids bias caused by dimensions with a large spread

But, large spread may naturally describe data!

Page 19: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 19

Clustering Feature

“How much information should be kept for each subcluster?”

“How is the information about subclusters organized?”

“How efficiently is the organization maintained?”

Page 20: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 20

Clustering Feature

A Clustering Feature (CF) summarizes a sub-cluster of data points:

Additivity theorem

Page 21: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 21

Properties of Clustering Feature

CF entry is more compact

Stores significantly less then all of the data points in the sub-cluster

A CF entry has sufficient information to calculate the centroid, D, R, and D0-D4

Additivity theorem allows us to merge sub-clusters incrementally & consistently

Page 22: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 22

CF-Tree

Page 23: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 23

Properties of CF-Tree

Each non-leaf node has at most B entries

Each leaf node has at most L CF entries which each satisfy threshold T, a maximum diameter or radius

P (page size in bytes) is the maximum size of a node

Compact: each leaf node is a subcluster, not a data point!

Page 24: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 24

CF-Tree Insertion

Recurse down from root, find the appropriate leaf

Follow the "closest" CF path, w.r.t. D0 / … / D4

Modify the leaf

If the closest-CF leaf cannot absorb due to exceeding the threshold, make a new CF entry. If there is no room for new leaf entry, split the parent node, recursively back to the root if necessary.

Traverse back & up

Update CFs on the path or splitting nodes to reflect the new addition to the cluster

Merging refinement

Splits are caused by page size, which may result in unnatural clusters. Merge nodes to try to compensate for this.

Page 25: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 25

CF-Tree Anomalies

Anomaly 1

Two subclusters that should be together are split across two nodes due to page size, or

Two subclusters that should not be in the same cluster are in the same node

Page 26: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 26

CF-Tree Anomalies

Anomaly 2

Equal-valued data points inserted at different times may end up in different leaf entries

Page 27: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 27

CF-Tree Rebuilding

If we run out of memory, increase threshold T

By increasing the threshold, CFs absorb more data, but are less granular: leaf entry clusters become larger

Rebuilding "pushes" CFs over

The larger T allows different CFs to group together

Reducibility theorem

Increasing T will result in a CF-tree as small or smaller then the original

Rebuilding needs at most h (height of tree) extra pages of memory

Page 28: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 28

BIRCH Overview

Page 29: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 29

The Algorithm: BIRCH

Phase 1: Load data into memory

Build an initial in-memory CF-tree with the data (one scan)

Subsequent phases are

fast (no more I/O needed, work on sub-clusters instead of individual data points)

accurate (outliers are removed)

less order sensitive (because the CF-Tree forms an initial ordering of the data)

Page 30: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 30

The Algorithm: BIRCH

Page 31: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 31

The Algorithm: BIRCH

Phase 2: Condense data

Allows us to resize the data set so Phase 3 runs on an optimally sized data set

Rebuild the CF-tree with a larger T

Remove more outliers

Group together crowded subclusters

Condensing is optional

Page 32: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 32

The Algorithm: BIRCH

Phase 3: Global clustering

Use existing clustering algorithm (e.g., HC, KMEANS, CLARANS) on CF entries

Helps fix problem where natural clusters span nodes (Anomaly 1)

Page 33: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 33

The Algorithm: BIRCH

Phase 4: Cluster refining

Do additional passes over the dataset & reassign data points to the closest centroid from phase 3

Refining is optional

Fixes the problem with CF-trees where same-valued data points may be assigned to different leaf entries (anomaly 2)

Always converges to a minimum

Allows us to discard more outliers

Page 34: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 34

Memory Management

BIRCH's memory usage is determined by data distribution not data size

In phases 1-3, BIRCH uses all available memory to generate clusters as granular as possible

Phase 4 can correct inaccuracies caused by insufficient memory (i.e., lack of granularity)

Time/space trade-off: if memory is low, spend more time in phase 4

Page 35: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 35

Experimental Results

Input parameters:

Memory (M): 5% of data set

Distance equation: D2 (average intercluster distance)

Quality equation: weighted average diameter (D)

Initial threshold (T): 0.0

Page size (P): 1024 bytes

Phase 3 algorithm: an agglomerative Hierarchical Clustering algorithm

One refinement pass in phase 4

Page 36: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 36

Experimental Results

Create 3 synthetic data sets for testing

Also create an ordered copy for testing input order

KMEANS and CLARANS require entire data set to be in memory

Initial scan is from disk, subsequent scans are in memory

Page 37: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 37

Experimental Results

Intended clustering

Page 38: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 38

Experimental Results

KMEANS clustering

DS Time D # Scan DS Time D # Scan

1 43.9 2.09 289 1o 33.8 1.97 197

2 13.2 4.43 51 2o 12.7 4.20 29

3 32.9 3.66 187 3o 36.0 4.35 241

Ordered data

Page 39: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 39

Experimental Results

CLARANS clustering

DS Time D # Scan DS Time D # Scan

1 932 2.10 3307 1o 794 2.11 2854

2 758 2.63 2661 2o 816 2.31 2933

3 835 3.39 2959 3o 924 3.28 3369

Page 40: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 40

Experimental Results

BIRCH clustering

DS Time D # Scan DS Time D # Scan

1 11.5 1.87 2 1o 13.6 1.87 2

2 10.7 1.99 2 2o 12.1 1.99 2

3 11.4 3.95 2 3o 12.2 3.99 2

Page 41: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 41

Conclusions & Practical Use

Pixel classification in images

From top to bottom:

BIRCH classification

Visible wavelength band

Near-infrared band

Page 42: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 42

Conclusions & Practical Use

Image compression using vector quantization

Generate codebook for frequently occurring patterns

BIRCH performs faster then CLARANS or LBG, while getting better compression and nearly as good quality

Page 43: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 43

Conclusions & Practical Use

BIRCH works with very large data sets

Explicitly bounded by computational resources

Runs with specified amount of memory

Superior to CLARANS and KMEANS

Quality, speed, stability and scalability

Page 44: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 44

Exam Questions

What is the main limitation of BIRCH?

BIRCH only works with metric attributes (i.e., Euclidian coordinates)

Name the two algorithms in BIRCH clustering:

CF-Tree Insertion

CF-Tree Rebuilding

Page 45: BIRCH: A New Data Clustering Algorithm and Its Applications Tian Zhang, Raghu Ramakrishnan, Miron Livny Presented by Qiang Jing On CS 331, Spring 2006

April 21, 2007 45

Exam Questions

What is the purpose of phase 4 in BIRCH?

Slides 33-34: Guaranteed convergence to minimum, discards outliers, ensures duplicate data points are in the same cluster, compensates for low memory availability during previous phases