linked based clustering and its theoretical foundations paper written by margareta ackerman and shai...
TRANSCRIPT
Linked Based Clustering and Its Theoretical Foundations
Paper written by Margareta Ackerman and Shai Ben-David
Presented by Yan T. YangYan T. Yang
Outline
Hierarchical clustering Linkage based clustering Theoretical foundation of clustering Characterization of linkage based
clustering
Outline
Hierarchical clustering Linkage based clustering Theoretical foundation of clustering Characterization of linkage based
clustering
Hierarchical clustering
Agglomerative Divisive
Hierarchical clustering
A
D
E
B
C
A B C D E
Raw Data Dendrogram
F
F
Agglomerative clustering
Raw Data Dendrogram
A B C D E F
A
D
E
B
C
F
Agglomerative clustering
Raw Data Dendrogram
A B C D E F
A
D
E
B
C
F
Agglomerative clustering
A
D
E
B
C
Raw Data Dendrogram
F
A B C D E F
Agglomerative clustering
A
D
E
B
C
Raw Data Dendrogram
F
A B C D E F
Agglomerative clustering
Raw Data Dendrogram
A B C D E F
A
D
E
B
C
F
Agglomerative clustering
Raw Data Dendrogram
A B C D E F
A
D
E
B
C
F
General Complexity: O(n3)
Agglomerative clustering
Raw Data Dendrogram
General Complexity: O(n3)
Linkage Based: O(n2)
Linkage Based Clustering
General Complexity: O(n3)
Linkage Based: O(n2)
Whenever a pair of points share the same cluster,
they are connected via a tight chain of points.
A
D
E
B
C
F
Linkage Based Clustering
General Complexity: O(n3)
Linkage Based: O(n2)
Whenever a pair of points share the same cluster,
they are connected via a tight chain of points.
A
D
E
B
C
F
3. Repeat until only k clusters remain.
Steps:
1. Start: singletons
2. At each step, merge the closest pair of clusters
Linkage-based clustering
Linkage Based: O(n2)
Dissimilarity Measure Linkage Criteria
A measure of distance between pairof observations.
A measure of distance between setsof observations
as function of dissimilarity measure.
Linkage-based clustering
Linkage Based: O(n2)
Euclidean distance
Minkowski distance (generalized Euclidean)
Mahalahobis distance (scaled Euclidean)
City block distance ( |*| version of Euclidean)
Dissimilarity Measure Linkage Criteria
Linkage-based clustering
Linkage Based: O(n2)
Dissimilarity Measure Linkage Criteria
Between two sets A and B
},:),(min{ :Linkage Single BbAabad
},:),(max{ :Linkage Complete BbAabad
Aa Bb
bad ),(|B||A|
1 :Linkage Average
Linkage-based clustering
Linkage Based: O(n2)
Linkage Criteria
Between two sets A and B
},:),(min{ :Linkage Single BbAabad
},:),(max{ :Linkage Complete BbAabad
Aa Bb
bad ),(|B||A|
1 :Linkage Average
Minimum Spanning Tree
Successive Cliques
UPGMA tree
Linkage-based clustering
Linkage Based: O(n2)
Linkage Criteria
Between two sets A and B
},:),(min{ :Linkage Single BbAabad
},:),(max{ :Linkage Complete BbAabad
Aa Bb
bad ),(|B||A|
1 :Linkage Average
SLINK [Defays]
CLINK [Sibson]
UPGMA [Murtagh]
Linkage-based clustering},:),(min{ :Linkage Single BbAabad
Add to cluster as long as there is a link
Linkage-based clustering},:),(min{ :Linkage Single BbAabad
Add to cluster as long as there is a link
Linkage-based clustering},:),(min{ :Linkage Single BbAabad
Add to cluster as long as there is a link
Linkage-based clustering},:),(min{ :Linkage Single BbAabad
Add to cluster as long as there is a link
Linkage-based clustering},:),(min{ :Linkage Single BbAabad
Add to cluster as long as there is a link
Linkage-based clustering},:),(min{ :Linkage Single BbAabad
Add to cluster as long as there is a link
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Linkage-based clustering},:),(max{ :Linkage Complete BbAabad
Add to cluster when a clique is formed
Theoretical Foundation
Many different clustering techniques: which one to use ?
Identify properties that separate input-output behavior of different clustering paradigms: [Ackerman, Ben-David, Loker]
1) Be intuitive and meaningful
2) Differentiate algorithms
The properties should
Theoretical Foundation
[Kleinberg, 2002]: abstract properties OR axioms.
[Zadeh, Ben-David 2009]: properties to characterize single linkage.
[Ackerman, Ben-David, Loker] : properties to characterize all linkage.
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
Clustering function
Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
A clustering C is a refinement of clustering C’ if every cluster in C’ is a union of some clusters in C.
CC’
Refinement
Hierarchical
Clustering function
Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
A clustering function F(X,d,k):
Clustering function
Hierarchical
1. Take input X, all data points2. Distance measure, d
3. Number of clusters desired, k.4. Output the clusters
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
A clustering function is hierarchical if forall X and all d and every 1<= k <= k’ <= |X| F(X,d,k’) is a refinement of F(X,d,k)
Clustering function
A B C D E F
6
43
2
Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
Clustering function)||,,(
:any and any for if local is F
Cc
CdcFC
F(X,d,k)C X, d, k
C
Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
Clustering function . and , allfor then distances,
cluster -between increasingfor except ,equals If
kd, X(X,d’,k)F(X,d,k)=F
dd’
d d’
Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
Clustering function
Locality and outer consistency
Not axioms
Spectral clustering with
Ratio region
Normalized cut
does not satisfy these two.
Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Definition
Refinement
Hierarchical
Clustering function ),( 11 dX ),( 22 dX ),( 33 dX
),( 321 dXXX Hierarchical
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Theorem:
The clustering function is linkage based
If and only if
All of the properties are satisfied.
Hierarchical
Theoretical Foundation
Theorem:
The clustering function is linkage based
If and only if
All of the properties are satisfied.
properties are satisfied.
Properties
Extended richness
Outer consistency
Locality
Hierarchical
linkage based function
Theoretical Foundation
Theorem:
The clustering function is linkage based
If and only if
All of the properties are satisfied.
linkage based function properties are satisfied.
Straightforward
Properties
Extended richness
Outer consistency
Locality
Hierarchical
Theoretical Foundation
Theorem:
The clustering function is linkage based
If and only if
All of the properties are satisfied.
properties are satisfied.
Properties
Extended richness
Outer consistency
Locality
Hierarchical
linkage based function
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Hierarchical
linkage based function
A linkage function is a function l: {(X1,X2,d) : d is a dissimilarity function over X1 U X2} → R+
that satisfies the following:},:),(min{ :Linkage Single BbAabad
},:),(max{ :Linkage Complete BbAabad
Aa Bb
bad ),(|B||A|
1 :Linkage Average
- Representation independent
- Monotonic
- Any pair of clusters can be made arbitrarily distant
Theoretical Foundation
Properties
Extended richness
Outer consistency
Locality
Hierarchical
linkage based function
Goal
Given a clustering function F that satisfies the properties, define a linkage function l so that the linkage-based clustering based on l coincides with F (for every X, d and k).
F
Linkage-based
Theoretical FoundationGiven a clustering function F that satisfies the properties, define a linkage function l so that the linkage-based clustering based on l coincides with F (for every X, d and k).
F
Linkage-based
Define an operator <F: (A,B,d1) <F (C,D,d2) if there exists d that extends d1 and d2 such that when we run F on (A U B U C U D), A and B are merged before C and D.
A
B
C
D
F(AUBUCUD,d,4) A
B
C
D
F(AUBUCUD,d,3)
Theoretical FoundationGiven a clustering function F that satisfies the properties, define a linkage function l so that the linkage-based clustering based on l coincides with F (for every X, d and k).
F
Linkage-based
Define an operator <F: (A,B,d1) <F (C,D,d2) if there exists d that extends d1 and d2 such that when we run F on (A U B U C U D), A and B are merged before C and D.
A
B
C
D
F(AUBUCUD,d,4) A
B
C
D
F(AUBUCUD,d,3)
Theoretical Foundation- Prove that <F can be extended to a partial ordering - Use the ordering to define l
Lemma (<F is cycle-free)
Given a function F that is hierarchical, local, outer-consistent and satisfies extended richness, there are no (A1,B1,d1), (A2,B2,d2),… (An,Bn,dn) so that
(A1,B1,d1) <F (A2,B2,d2) … <F (An,Bn,dn)and (A1,B1,d1) = (An,Bn,dn)
By the above Lemma, the transitive closure of <F is a partial ordering.
There exists an order preserving function l that maps pairs of data sets to R (since <F is defined over a countable set)
It can be shown that l satisfies the properties of a linkage function.