by: moses charikar, chandra chekuri, tomas feder, and rajeev motwani presented by: sarah hegab
DESCRIPTION
Incremental Clustering And Dynamic Information Retrieval. By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab. Outline:. Motivation Main Problem Hierarchical Agglomerative Clustering A Model Incremental Clustering - PowerPoint PPT PresentationTRANSCRIPT
1
By:MOSES CHARIKAR, CHANDRA CHEKURI,
TOMAS FEDER, ANDRAJEEV MOTWANI
Presented By: Sarah Hegab
2
Outline:Outline:
• Motivation
• Main Problem
• Hierarchical Agglomerative Clustering
• A Model Incremental Clustering
• Different incremental algorithms
• Lower Bounds for incremental algorithms
• Dual Problem
3
I. Main ProblemI. Main Problem
The clustering problem is as follows: given n points in a metric space M, partition the points into k clusters so as to minimize the maximum cluster diameter.
4
1.Greedy Incremental Clustering
a) Center-Greedy
b) Diameter-greedy
5
a) Center-Greedy
The center-greedy algorithm associates a center for each cluster and merges the two clusters whose centers are closest. The center of the old cluster with the larger radius becomes the new center
Theorem: The center-greedy algorithm’s performance ratio has a lower bound of 2k - 1.
6
0
v1 v2v3 v4 v5
0 1
0 1 1 0
-1 0
-1 0 1 0
S0 S1S3 S2 S2
a) Center-Greedy cont.
Proof:
• 1-Tree Construction
K=2
7
a) Center-Greedy cont.
• 2-Tree Graph
• Set Ai (in our example Ai={{v1},{v2}, {v3},{v4}})
v1 v2v3 v4 v5
S0 S1S3 S2 S2
11-1-
1-1
8
Post-Order Traverse
9
a) Center-Greedy cont.
Claims:• For 1 <= i <= 2k - 1, Ai is the set of
clusters of center-greedy which contain more than one vertex after the k + i vertices v1, . . . , vk+i are given.
• There is a k-clustering of G of diameter 1. The clustering which achieves the above diameter is {S0 US1, . . . , S2k-2 US2k-1}.
10
K=4
11
Competitiveness of Center-Greedy
Theorem : The center-greedy algorithm has performance ratio of 2k-1 in any metric space.
12
b) Diameter-Greedy
The diameter-greedy algorithm always merges those two clusters which minimize the diameter of the resulting merged cluster.
Theorem : The diameter-greedy algorithm’s performance ratio (log(k)) is even on the line.
13
b) Diameter-Greedy cont.
• Proof:
1) Assumptions
Ui = Uj=1Fi{{pij, qij}, {rij, sij}},
Vi = Uj=1Fi{{qij}, {rij}},
Wi = Uj=1Fi{{pij}, {qij, rij}},
Xi = Uj=1Fi{{pij}, {qij, rij}, {sij}},
Yi = Uj=1Fi{{pij, qij, rij}, {sij}},
Zi = Uj=1Fi{{pij, qij, rij, sij}}.
14
b) Diameter-Greedy cont.
• Proof:
2) Invariant : When the last element of Kt is received, diameter-greedy’s k+1 clusters are
(Ui=1t-2 Zi) UYt-1U Xt (Ur
i=t+1 Vi).
Since there are k+1 clusters, two of the clusters have to be merged and the algorithm merges two clusters in Vt+1 to form a cluster of diameter (t+1). Without loss of generality, we may assume that the clusters merged are {q(t+1)1} and {r(t+1)1}.
15
Competitiveness of Diameter-Greedy
Theorem : For k = 2, the diameter-greedy algorithm has a performance ratio 3 in any metric space.
16
2.Doubling Algorithm
a) Deterministic
b) Randomized
c) Oblivious
d) Randomized Oblivious
17
a) Deterministic doubling algorithm
• The algorithm works in phases• At the start of phase i it has k+1 clusters• Uses and , s.t /(1-)<=• At start of phase i the following is assumed:
1. for each cluster Cj , the radius of Cj defined as maxp Cj d(cj, p) is at most αdi
2. for each pair of clusters Cj and Cl, the inter-center distance d(cj, cl) => di
3. di <= opt.
18
a) Deterministic doubling algorithm
• Each phase has two stages
1- Merging stage, in which the algorithm reduces the number of clusters by merging certain pairs
2-Update stage, in which the algorithm accepts new updates and tries to maintain at most k clusters without increasing the radius of the clusters or violating the invariants
A phase ends when number of clusters exceeds k
19
a) Deterministic doubling algorithm
• Definition: The t-threshold graph on a set of points P = {p1, p2, . . . , pn} is the graph G=(P,E) such that (pi, pj) in E if and only if d(pi, pj) <= t.
• Merging stage defines di+1= di and a graph G di+1–threshold for centers c1,. . . , ck+1 .
• New clusters C’1…C’m. If m=k+1 this ends the phase i
20
a) Deterministic doubling algorithm• Lemma The pairwise distance between
cluster centers after the merging stage of phase i is at least di+1.
• Lemma The radius of the clusters after the merging stage of phase i is at most di+1+αdi<=αdi+1
• Update continues while the number of clusters is at most k. It is restricted by the radius bound αdi+1. Then phase i ends.
21
a) Deterministic doubling algorithm
• Initialization : the algorithm waits until k+1 points have arrived then enters phase 1, with each point as a center containing just itself. And d1 set to the distance between the closest pair of points
22
a) Deterministic doubling algorithm
• Lemma The k + 1 clusters at the end of the ith phase satisfy the following conditions:
1. The radius of the clusters is at most αdi+1.2. The pairwise distance between the cluster centers is at least d i+1.
3. di+1 <= OPT, where OPT is the diameter of the optimal clustering for the current set of points.
Theorem: The doubling algorithm has performance ratio 8 in any metric space.
23
a) Deterministic doubling algorithm
Example to show the analysis is tight:
k=>3.
Input consists of k+3 points p1…pk+3
the points p1…pk+1 have distance 1, pk+2 ,pk+3
have distance 4 from the others, and 8 from each other.
24
b) Randomized doubling algorithm
• Choose a random variable r from [1/e,1] according to the probability density function 1/r
• The min pairwise distance of the first k+1 point is x. And d1=rx
• =e,=e/(e-1)
25
b) Randomized doubling algorithm
Theorem : The randomized doubling algorithm has expected performance ratio 2e in any metric space. The same bound is also achieved for the radius measure.
26
c) Oblivious clustering algorithm
• Does not need to know k
• Assume we have un upper bound on the max distance between point which is 1.
• Points are maintained in a tree
27
c) Oblivious clustering algorithm cont.
At distance greater than 1/2i
Within dist. 1/2i-1
from parent
Where i is the depth of the vertex , i=>0
Root at depth 0
28
Illustration Of Oblivious clustering algorithm:
29
c) Oblivious clustering algorithm cont.
• How do we obtain the k clusters from the tree?• If k is given, and i is the greatest depth containing
at most k points. • These are the k cluster centers. The sub-trees of
the vertices at depth i are the clusters.• As points are added, the number of vertices at
depth i increases; if it goes beyond k, then we change i to i - 1, collapsing certain clusters; otherwise, the new point is inserted in one of the existing clusters.
30
c) Oblivious clustering algorithm cont.
Theorem : The algorithm that outputs the k clusters obtained from the tree construction has performance ratio 8 for the diameter measure and the radius measure.
• Optimal diameter is ½i+1 < d <= ½I
• Then points at depth i are in different clusters, so there are at most k of them.
• j=>i be the greatest depth containing at most k points.
• Subtrees are at a distance of the root within ½j + ½j+1 + ½j+2 + · · ·<= ½j-1< 4d.
31
d) Randomized Oblivious
• Distance threshold for depth i is r/ei
• r chosen once at random from [1,e], according to the PDF 1/r
• The expected diameter is at most 2e.OPT diameter
32
Lower Bounds
Theorem1: For k = 2, there is a lower bound of 2 and 2 - ½k/2 on the performance ratio of deterministic and randomized algorithms, respectively, for incremental clustering on the line.
33
Lower Bounds cont.
Theorem2: There is a lower bound of 1+21/2 on the performance ratio of any deterministic incremental clustering algorithm for arbitrary metric spaces.
34
Lower Bounds cont.
35
Lower Bounds cont.
Theorem3: For any e>0 and k=>2, there is a lower bound of 2 - e on the performance ratio of any randomized incremental algorithm.
36
Lower Bounds cont.
Theorem4: For the radius measure, no deterministic incremental clustering algorithm has a performance ratio better than 3 and no randomized algorithm has a ratio better than 3 – e for any fixed e > 0.
37
II. Dual ProblemII. Dual Problem
For a sequence of points p1,p2,...,pnRd, cover each point with a unit ball in d as it arrives, so as to minimize the total number of balls used.
38
II. Dual ProblemII. Dual Problem
Rogers Theorem: Rd can be covered by any convex shape with covering density O(d log d).
Theorem: For the dual clustering problem in Rd, there is an incremental algorithm with performance ratio O(2dd log d).
Theorem: For the dual clustering problem in d, any incremental algorithm must have performance ratio ( (log d)/(log log log d) ).
39