hierarchical model-based clustering of large datasets through fractionation and refractionation...

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Advisor ： Dr. HsuGraduate ： You-Cheng ChenAuthor ： Jeremy Tantrum

Alejandro Murua Werner Stuetzle

Motivation Objective Introduction Model-based Fractionation Model-based ReFractionation Example Conclusions Personal Opinion

Outline

Motivation

Propose a extended method to improve performance of model-based clustering method and apply it to large datasets.

Objective

Apply Fractionation and Refractionation to model-based clustering.

Introduction

Model-based clustering in a nutshell

Sample: nxxx ,...,, 21

)(xpg is the density modeling group g

g is the prior probability that a randomlychosen observation belongs to group g

xpxp gg

Introduction

Model-based clustering in a nutshell

We can use Approximate Weight of Evidence to estimate the number of groups.

)))log(2/3(2)(2(maxarg nrGLG G

Introduction Previous work on model-based clustering

for large datasets

Scalable EM(SEM) algorithm can be used to finding fitting mixture models to large datasets but it can’t estimate the number of groups.

The simplest and potentially fastest is to draw a sample of the data.

Original Fractionation algorithm

2. Fractionation

1 Split data into fractions of size M2 Cluster each fraction into a fixed number M where a < 1. Summarize each cluster by its mean We refer to these cluster means as meat-observations.

3 If the total number of meta-observations is greater that M return to setp1

4 Cluster the meta-observations into G clusters.5 Assign each individual observation to the cluster with the closet mean.

In model-based Fractionation, we use all sufficient the mean,the covariance,and the number of observations to present cluster.

Using AWE to determine the number of clusters in Step 4

2-1. Model-based Fractionation Main difference:

3. Model-based ReFractionation Step 4 of Fractionation algorithm is replaced by 4a,4b

4a Clustering the meta-observations into G clusters,

where G is determined by AWE criterion

4b Define the fractions for the i-th pass.

3.1 Illustration

M=100 fraction=4 meta-observation=40

3.1 Illustration

Step 4a Use AWE find G=25

Step 4b

3.1 IllustrationSecond pass

3.1 Illustration

2th pass 3th pass

3.2 Scope of (Re)Fractionation

Let ng be the number of groups in the data nf be the number of fractions nc be the number of clusters generated from each fraction Step2

If ng > nc will bead to impure clusters.

4. Example

4.1 Measuring the agreement between groups and clusters

nnn jjjg

,222 )()(/)( ..Fowlkes-Mallows index=

4.3 Example 1

Group = 19 n=22000 M=1000 clusters=100

4.3 Example 3

Group=361 n=20900 M=1045 cluster=100

Conclusions

We can study the performance of the AWE criterion for estimating the number of groups in a mixture of factor analyzers model.

Personal Opinion

We can apply advantage of another clustering methodto improve ours defect.

hierarchical model-based clustering of large datasets through fractionation and refractionation...

Documents

protein fractionation techniques, hplc and … fractionation...

evolution of fractionation and conventional fractionation in...

fractionation action checklist - shogun...

midwest fractionation ovreview.pdf

determination of isotopic fractionation δ c of methane from...

hsu research

altered fractionation

section 04 - fractionation

protein fractionation

fractionation of asphaltenes in understanding their role...

hsu presentation

novel fractionation technologies interphex

fractionation in radiotherapy

analyzing crystal fractionation

“the fractionation formula” - enslavement, not...

fractionation of starch

field-flow fractionation of macromolecules and structures...

dry fractionation plant / dewaxing ... -...

lignin production by organosolv fractionation of ... ·...

biomass fractionation