hierarchical model-based clustering of large datasets through fractionation and refractionation

Post on 10-Feb-2016

32 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation. Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University of Washington. - PowerPoint PPT Presentation

TRANSCRIPT

Jeremy Tantrum, Department of Statistics,

University of Washington

joint work with

Alejandro Murua & Werner StuetzleInsightful Corporation University of Washington

This work has been supported by NSA grant 62-1942

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and

Refractionation

Motivating Example

• Consider clustering documents• Topic Detection and Tracking corpus

• 15,863 news stories for one year from Reuters and CNN• 25,000 unique words• Possibly many topics

• Large numbers of observations• High dimensions• Many groups

Goal of Clustering

40 45 50 55

7476

7880

8284

Detect that there are 5 or 6 groupsAssign Observations to groups

NonParametric Clustering

Premise: • Observations are sampled from a density p(x) • Groups correspond to modes of p(x)

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

NonParametric Clustering

Fitting: Estimate p(x) nonparametrically and find significant modes of the estimate

-10 -5 0 5 10

0.0

0.02

0.04

0.06

0.08

0.10

0.12

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| | ||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Model Based Clustering

Premise: • Observations are sampled from a mixture density p(x) = g pg(x)• Groups correspond to mixture components

Model Based Clustering

-10 -5 0 5 10

0.0

0.05

0.10

0.15

| | | ||||||||||||||||||||||||||||||||||||| |||||||||||| |||||| || || |||||||||| |||||||| |||| ||| | | | | | |||||||||| ||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||

Fitting: Estimate g and parameters of pg(x)

Model Based Clustering

Fitting a Mixture of Gaussians • Use the EM algorithm to maximize the log

likelihood– Estimates the probabilities of each observation

belonging to each group– Maximizes likelihood given these probabilites

– Requires a good starting point

Model Based Clustering

Hierarchical Clustering• Provides a good starting point for EM

algorithm• Start with every point being it’s own cluster• Merge the two closest clusters

– Measured by the decrease in likelihood when those two clusters are merged

– Uses the Classification Likelihood – not the Mixture Likelihood

• Algorithm is quadratic in the number of observations

| || | |||| | | |||||||||| |||||||| |||||| |||||||||||| ||||||||||||||||||| ||||||| || ||||||| || |||||| | |||| || || | |

p1(x)p2(x)

p (x)

Merge gives small decrease in likelihood

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

Merge gives big decrease in likelihood

Likelihood Distance

| |||||||||||| | | ||||||||||||||| || | | | | | |||| ||||| |||||||||||||||||||||||||||||||||||||||||||||||||| || | |

p1(x) p2(x)

p (x)

Bayesian Information Criterion

• Choose number of clusters by maximizing the Bayesian Information Criterion

– r is the number of parameters– n is the number of observations

• Log likelihood penalized for complexity

Fractionation

Original Data – size n

n/M fractions of size M

If n >M

M is the largest number of observations for which a hierarchical O(M2) algorithm is computationally feasible

Invented by Cutting, Karger, Pederson and Tukey for nonparametric clustering of large datasets.

n clusters(meta-obervations, i)

Partition each fraction into M clusters < 1

Fractionation– n meta-observations after the first round– 2n meta-observations after the second round– in meta-observations after the ith round

• For the ith pass, we have i-1n/M fractions taking O(M2) operations each

• Total number of operations is:

• Total running time is linear in n!

• Use model based clustering• Meta-observations contain all sufficient

statistics – (ni, i, i)– ni is the number of observations – size

– i is the mean – location

– i is the covariance matrix – shape and volume

Model Based Fractionation

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

• ••••

•••

••

•••

••

••

••

••

•••

••

••••

••

•••

••

••

••

••

••

••••

••

••••

••

••

••

•••

••

••

••

••••

••

••

••

••

••

••

••

••

•• ••

••••

••

•••

•••

••••

••

••

•••

••

••

•••

••

••

••

••

••

••

••

••

••

••••

••

••

••

••

••••

••

••

••

•••

••

••

••

•••

••

••

••

••

••

••••

••

••

••

••

An example, 400 observations in 4 groupsObservations in the first fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

••

••

••

••

••••••

••

•••

••

••

••••

••••

••

••

••

••

••

••••

••

•••

•••••

••

10 meta-observations from the first fraction10 meta-observations from the second fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•••

••

••

••

••

••

•••

••

•••

••

••

••

••

••

••

••

••

•••

••

10 meta-observations from the third fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••••

••

••

•••

• •

10 meta-observations from the fourth fraction

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

••

•••

•••

••

••

••

••

•••

• •

••• ••

•••

••

••

••

••

The 40 Meta-observations

1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

The Final Clusters Chosen by BICSuccess!

Model Based Fractionation

The data – 400 observations in 25 groups

1 2 3 4 5

12

34

5 •

••

••

••

••

••

••

••

•••

••••

••

••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

•••

••

••

••

••

••••

••

•••

••

••

••

••••

•••

••

••

•••

1 2 3 4 5

12

34

5 •

••

••

••

••

••

•••

••

••

••

••

••

••

•••

••

Observations in fraction 110 meta-observations from the first fraction10 meta-observations from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

••

•••

•••

• •

••

••

••

••

••

••

10 meta-observations from the third fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

••

••

••

•••

•••

••••

••

••

••

••

••

••

••

•••

••

•••

10 meta-observations from the fourth fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

••

••

••

••••

••

••

• •

••

••

••

•••

••

••

•••

•••

••

•••••

The 40 meta-observations

1 2 3 4 5

12

34

5

The clusters chosen by BICFractionation fails!

Example 2

Refractionation

Problem:• If the number of meta-observations generated from a

fraction is less than the number of groups in that fraction then two or more groups will be merged.

• Once observations from two groups are merged they can never be split again.

Solution:• Apply fractionation repeatedly.• Use meta-observations from the previous pass of

fractionation to create “better” fractions.

Example 2 Continued

1 2 3 4 5

12

34

5

The 40 meta-observations4 new clusters4 new fractions

1 2 3 4 5

12

34

5 • ••

••

••

••

•••

••

•••

•••••••

••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

•••

••

••

•••

•••

••

•••

••

•••

•••

• •

••

•••

••

••

••

••

••

••

•••

••

••••

••••

•••

•••

••

••

••

••

••

•••

• •••

••

••

•••••

••• •

••••

••

•••

••

••

••

••

••

••••

••••

•••

••••

••••

••••••

••

••

Observations in the new fraction 1

1 2 3 4 5

12

34

5

• ••

••

••

•••

••

•••

•••••••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

•••

••

••

•••

•••

••

•••

••

•••

•••

• •

••

••

••

••

•••

••

• ••

••••

••

•••

••

••

•••

•••

••

••••

••

••••

••

•••

••

••

••

••••

•••

•••

•••

•••

••

1 2 3 4 5

12

34

5

• ••

••

••

•••

••

•••

•••••••

••

••

••

••

•••

•••

••

••

••

•••

•••

••

•••

••••

••

••

••

••

••

••

•••••

••

••

••

••

••

••

••

•••

••

••

•••

•••

••

•••

••

•••

•••

• •

••

••

••

••

•••

••

• ••

••••

••

•••

••

••

•••

•••

••

••••

••

••••

••

•••

••

••

••

••••

•••

•••

•••

•••

••

Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

•••

• •

••

•••

••

••

••

••

••

••

•••

••

••••

••••

•••

•••

••

••

••

••

••

•••

Clusters from the third fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

• •••

••

••

•••••

••• •

••••

••

•••

••

••

••

••

••

••••

••••

•••

••••

••••

••••••

••

••

Clusters from the fourth fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

•• ••

••

••

••

•••

••

•••

•••••••

••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

Clusters chosen by BIC

Example 2 – Pass 2

The 40 meta-observations of pass 2 of fractionation

1 2 3 4 5

12

34

5

4 new clusters

1 2 3 4 5

12

34

5

4 new fractions

1 2 3 4 5

12

34

5

••

••

•••••

••

•••••

••

•••

•••••

••••

••

••

••

•••

••

••

•••

•••

••

• ••

••

••

••

•••

••

•••

••• •

••••

•••

•••

••

•••

••

••

••

••••

••

•••

••

•••

••

••

••••

•••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

••

•••

••

•••

••

•••••

••••

•••

••

•••

••

•••

••

••

••

••

••

••

••

•••

••••••

••

••

••

••

••

••

••

•••

••••

1 2 3 4 5

12

34

5

••

••

•••••

••

•••••

••

•••

•••••

••••

••

••

••

•••

••

••

•••

•••

••

• ••

••

••

••

•••

••

•••

•••

••

••

••

••

•••

••••••

••

••

••

••

•••

••••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

Observations in the new fraction 1Clusters from the first fractionClusters from the second fraction

1 2 3 4 5

12

34

5

••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

••

•••

••

•••

••

•••••

••••

•••

••

•••

••

•••

••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

•••

•••

••

••••

•••

•••

••

•••

••

••••

••

•••

••

••

••

•••

Clusters from the third fraction

1 2 3 4 5

12

34

5

••••

•••

•••

••

•••

••

••

••

••••

••

•••

••

•••

••

••

••••

•••

••

••

••

••

••

••

••

•••

••

••

••

••

••

••

••

••••

••

••

••

••

••

••

••

••

••

••

•••

••

•••

••

• ••

••

••••

•••

•••

•••

Clusters from the fourth fraction

1 2 3 4 5

12

34

5 •

••

••

••

••

••

••

•••

••••••

••

••

••

••

••

••

••

•••

••••

••

••••

•••••

•••

••

••

••

••

•••

••

••

••

•••

••

••

••

••

••

••

••

••

••

••

••

•••

••

••

••

• ••

••

•••

••

••

•••

•••

The 40 meta-observations

1 2 3 4 5

12

34

5

Clusters chosen by BICRefractionation Succeeds

Example 2 – Pass 3

Realistic Example• 1100 documents from the TDT corpus

partitioned by people into 19 topics– Transformed into 50 dimensional space using Latent

Semantic Indexing

••

•••

•••

••

••

••

••

•••

••

••

••

••

••

••

•••

••

••

•••

••

••

••

•••••

••

••••

••

••

••

••

••

••

••

•••

••

••••

••

••

••

••

•••

••••

•••

••

••

•••

•••

••

•••

••

••

••

••

•••

••

•••

•••

••

••

••••

••

••

•••

••

••

•••

••

••

••••

••

••

••

••

••

••

••

•••

•••

••

••

••

••

••

••

••

••

••

••

•••

••

••

•••

•••••••

•• •

••

•••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• ••••••

••••••

•••

••••

••

••

•• ••

• ••••

••• ••••

••••

•••

••••

••••••••••••

••

•••••

••

••

••••••

••

•••••

••••••

••

•••

••

••••••

••••

••••

••

••••

••••••

••

•••

••••••

•••

••

•••

••

••

•••••

••

••

••

•••

••

••

••

••

••

•••

••

••

••

••

••

•••

••

••

•••

••

••

••

••••

••

••

••

••

••

••••

••

••••••••••••• ••••

••••••

••••

••

••••

••••••••••••••••••••••••••••••••••••

••••

••

••••••••••

••••••••••

••

••

••••••

••

•• •

••••

•••

••

••

••

••

••

•••••

••••

•••

••

••••••

••

•••

•••••

•••••••

•••

••••

••

•••••

••

• •••• •••••••••••••••••••••••••••••••••••••••••

••

•••

••••

••

Projection of the dataonto a plane – colorsrepresent topics

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Realistic ExampleWant to create a dataset with more observations and more groupsIdea: Replace each group with a scaled and transformed version of the entire data set.

Realistic ExampleTo measure similarity of clusters to groups:Fowlkes-Mallows index• Geometric average of:

– Probability of 2 randomly chosen observations from the same cluster being in the same group

– Probability of 2 randomly chosen observations from the same group being in the same cluster

• Fowlkes–Mallows index near 1 means clusters are good estimates of the groups

• Clustering the 1100 documents gives a Fowlkes–Mallows index of 0.76 – our “gold standard”

Realistic Example• 19£19=361 clusters, 19£1100=20900 observations in

50 dimensions• Fraction size¼1000 with 100 metaobservations per

fraction• 4 passes of fractionation choosing 361 clusters

Pass Min Median Max nf

1 270 289 296 202 18 88 150 183 18 19 60 174 19 19 58 16

Distribution of the number of groups per fraction.

Number of fractions

Realistic Example

Pass Fowlkes Mallows

Purity of the clusters

1 0.325 17292 0.554 9083 0.616 6714 0.613 651

The sum of the number of groups represented in each cluster:• 361 is perfect

• 19£19=361 clusters, 19£1100=20900 observations in 50 dimensions

• Fraction size¼1000 with 100 metaobservations per fraction

• 4 passes of fractionation choosing 361 clusters

Realistic Example• 19£19=361 clusters, 19£1100=20900 observations in

50 dimensions• Fraction size¼1000 with 100 metaobservations per

fraction• 4 passes of fractionation choosing 361 clusters

Refractionation:• Purifies fractions• Successfully deals with the case where the number of

groups is greater than M, the number of meta-observations

Contributions

• Model Based Fractionation– Extended fractionation idea to parametric setting

• Incorporates information about size, shape and volume of clusters

• Chooses number of clusters– Still linear in n

• Model Based ReFractionation– Extended fractionation to handle larger number of

groups

Extensions

• Extend to 100,000s of observations – 1000s of groups– Currently the number of groups must be less

than M• Extend to a more flexible class of models

– With small groups in high dimensions, we need a more constrained model (fewer parameters) than the full covariance model

– Mixture of Factor Analyzers

Fowlkes-Mallows Index

Pr(2 documents in same group | they are in the same cluster)

Pr(2 documents in same cluster | they are in the same group)

true clustersGroups 1 2 … I Total

1 n11 n12 … n1I n1¢

2 n21 n22 … n2I n1¢

… … … … … …J nJ1 nj2 … nJI n1¢

Total n¢1 n¢2 … n¢I n

top related