[ieee sixth ieee international conference on data mining - workshops (icdmw'06) - hong kong,...

Dynamic Algorithm for Graph Clustering Using Minimum Cut Tree

Barna SahaIndian Institute of Technology, Kanpur

[email protected]

Pabitra MitraIndian Institute of Technology, Kharagpur

[email protected]

Abstract

In this paper we introduce a dynamic algorithm for clus-tering undirected graphs, whose edge property is contin-uously changing. The algorithm can maintain high-qualityclusters efficiently in presence of insertion and deletion (up-date) of edges. The algorithm, is motivated by the minimum-cut tree based partitioning algorithm of [3] and [4]. It takesO(k3) time for each update processing, where k is the max-imum size of any cluster. This is the worst case time com-plexity, and in general time taken is much less. To the bestof our knowledge, this is the first clustering algorithm, forevolving graphs, providing strong theoretical quality guar-antee on clusters.

1 Introduction

Graph Clustering, partitions a graph into several disjointsubgraphs, such that each cluster is heavily connected, andinter-cluster connectivity is low. The problem has intrinsicrelevance in computer science, mathematics and many ap-plied areas. The most popular bicriteria measure for “goodclustering” in this respect, is given by Kannan et al. [6].A slight variant of the above criteria is proposed in [3]. Inreality, both these measures perform well.

There exists large varieties of graph clustering algo-rithms, based on spectral clustering [6], rapid mixing ofMarkov chains [1], multilevel graph partitioning schemeslike METIS [7] and MLKM algorithm of [2] etc.. MLKMalgorithm is so far the fastest algorithm known for graphclustering. A new direction to graph-clustering has beenintroduced by modeling the clustering problem, usingminimum-cut, maximum-flow problem of the underlyinggraph [4, 3]. Flake etal. has used this method to produceclusters [3], with theoretical quality measure, that works re-markably in practice [3] and has also been used as learn-ing algorithm [8]. But if the underlying graph is dynamicin nature, each time there is a change in the edge relation-ship, these algorithms need to be run afresh on the entiregraph. Many real world networks can be modeled as graphs

and most of these graphs are dynamic and evolving continu-ously. The example includes the World Wide Web (WWW)graph, the citation graph, graphs generated from ad hoc sen-sor networks, mobile networks etc. Hence requirement fordeveloping dynamic clustering algorithms which can handlechanges in edge-relationships effectively comes naturally.

1.1 Contribution

Our contributions are as follows:

• We develop a dynamic graph clustering algorithm thatcan handle insertion and deletion of edges, while clus-tering is on progress, in O(k3) time. This can be re-duced to O(k2), using a heuristic. Here k is the max-imum size of any cluster, which is much less than thetotal size of the network, generally logarithmic of thetotal size.Vertex addition and deletion can also be han-dled efficiently. If links and nodes can only be added,the clustering can be performed in time O(mk3). If kis logarithmic of n, then since most of the real-worldnetworks have constant average degree [3], the timetaken to perform clustering is O(n(log n)3), in con-trast to the O(n3) algorithm of [3].

• A detailed theoretical analysis of the clustering qualityis provided.

• The effectiveness of the algorithm is establishedthrough experimentation on several benchmark graphs,upto 30 thousand nodes and 1 million edges. Compar-ison results with the algorithm of [3] and MLKM [2]clearly demonstrates the superiority of our dynamic al-gorithm.

1.2 Organization

The rest of the paper is organized as follows. Section 2delineates the clustering algorithm based on minimum-cuttree developed in [3]. Section 3 describes our proposed dy-namic graph clustering algorithm and gives detailed proofof its quality guarantee. Section 4 gives the experimentalresults.

1

Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06)0-7695-2702-7/06 $20.00 © 2006

2 Clustering Using Mincut Tree

Let G = (V,E), denote a weighted undirected graphwith n = |V | nodes or vertices and m = |E| links oredges. Each edge e = (u, v), u, v ∈ V has an associ-ated weight w(u, v) > 0. The adjacency matrix A of Gis an n × n matrix in which A(i, j) = w(i, j) if (i, j) ∈ E,else A(i, j) = 0. The corresponding adjacency list struc-ture maintains for every i ∈ V only those j’s, for whichw(i, j) > 0. The w(i, j) values are also retained.

Let s and t be two nodes in G(V,E), designated assource and destination respectively. The minimum cut ofG with respect to s and t is a partition of V , namely, S andV/S, such that s ∈ S, t ∈ V/S and total weight of theedges linking vertices in two partitions is minimum. Thesum of the edge-weights across S and V/S, is denoted bythe cut-value, c(S, V/S). S is called the community of s.The minimum cut tree, TG of G, defined in [5], is a tree onV , such that inspecting the path between s and t in TG, theminimum-cut of G with respect to s and t can be obtained.Removal of the minimum weight edge in the path yieldsthe two partitions and the weight of the corresponding edgegives the cut-value.

2.1 Clustering Algorithm

[3] defines clustering based on minimum-cut tree. Anartificial sink, t, is added in the graph and is connected toall the vertices. Each edge (t, v), v ∈ V has the associatedweight α > 0. The value of α is critical in determining thequality of the clusters. The minimum-cut tree is then com-puted on this new graph. The disjoint components obtainedafter removal of t from the minimum-cut tree are the re-quired clusters. The algorithm is named as “Cut-Clusteringalgorithm”. Figure 1 gives the basic “Cut-Clustering algo-rithm”.

—————————————————————————Cut-Clustering(G(V,E),α)begin

Let V := V ∪ tFor all vertices v in G

Connect t to v with edge of weight αLet G′(V ′, E′) be the new graph after connecting t to VCalculate the minimum-cut tree T ′ of G′

Remove t from TReturn all connected components as the clusters of G

end

—————————————————————————–

Figure 1. Cut Clustering Algorithm of [3]

2.2 Clustering Quality

The quality of the clusters produced using “Cut-Clustering” algorithm, is measured using expansion like cri-teria. Let (S, S̄) be a cut in G. The expansion of this cut isdefined as

Ψ(S) =

∑i∈S,j∈S̄ w(i, j)

min{|S|, |S̄|}=

c(S, S̄)min{|S|, |S̄|}

The expansion of a (sub)graph is the minimum expansionover all the cuts of the (sub)graph. Higher the expansion ofa cluster, better is its quality.

[3] assures that if S is a cluster produced by the “Cut-Clustering” algorithm, then the following conditions aresatisfied:

1. c(P,Q)min{|S|,|S̄|} ≥ α, for any P,Q ⊂ S, such that P∪Q =S and P ∩ Q = φ.

2. c(S,V −S)|V −S| ≤ α.

Therefore α serves as a lower bound for intra-clusterexpansion and an upper bound for inter-cluster expansion.This clustering quality measure is not exactly same as thebicriteria proposed by Kannan et.al [6], but closely similarto it and has its own advantage.

3 Dynamic Graph Clustering with Min-CutTree

In this section we present our main contribution—a dy-namic algorithm for graph-clustering. The clustering tech-nique is motivated by the “Cut-Clustering” algorithm de-scribed in the previous section.

Our proposed algorithm maintains for every vertex, twovariables, In Cluster Weight (ICW) and Out Cluster Weight(OCW). Let C1, C2, ..., Cs, s > 0 is an integer, be the clus-ters of G(V,E). Then ICW and OCW are defined as below.

Definition 1 In Cluster Weight (ICW). In Cluster Weightor ICW of a vertex v ∈ V is defined as the total weightof the edges linking the vertex v to all the vertices whichbelong to the same cluster as v. That is, if v ∈ Ci, 0 ≤ i ≤s, then ICW (v) =

∑u∈Ci

w(v, u).

Definition 2 Out Cluster Weight (OCW). Out ClusterWeight or OCW of a vertex v ∈ V is defined as the totalweight of the edges linking the vertex v to all the verticeswhich do not belong to the same cluster as v. That is, ifv ∈ Ci, 0 ≤ i ≤ s, then OCW (v) =

∑u∈Cj ,j �=i w(v, u).


The main feature of our algorithm is that it builds partof the minimum-cut tree as and when necessary. Minimum-cut tree is computed only over a very small graph, createdefficiently from the original graph. This has a flavor ofcoarsening step of [2]. However no remapping or refine-ment step is necessary. Clusters of the original graph canbe directly obtained from the coarsened graph. The algo-rithm is thus very fast. At any instant, the set of clustersof the graph, seen so far, can be obtained without any fur-ther processing. The cluster quality is identical to the offline“Cut-Clustering” algorithm (See Figure. 1). The algorithmsupports and maintains clusters over insertion and deletionof edges and vertices. Edge insertion and deletion involveedges, where end vertices have already been seen.

Let C = {C1, C2, ..., Cs}, are the clusters of the graphG(V,E), that has been seen so far. Let A be the adjacencymatrix of G. Below we give description of our algorithmover these update operations.

1. Edge Insertion.Intra-Cluster Edge Insertion.

—————————————————————————Intra-Cluster Edge Addition((i, j),wi,j)begin

Let i, j ∈ Cu

Update A(i, j)+ = w(i, j)Update ICW (i)+ = w(i, j), ICW (j)+ = w(i, j)Return C

end

—————————————————————————–

Figure 2. Intra-Cluster Edge Addition

When an edge gets added, whose both end vertices be-long to the same cluster, the cluster becomes more well-connected. Therefore, in this case, we simply update theadjacency matrix and ICW. The clusters remain unchanged.Figure 2 gives the algorithm for intra-cluster edge addition.The time required is O(1).

Inter-Cluster Edge Addition. Addition of an edge,whose end points belong to different clusters increases con-nectivity across the clusters. Therefore as a result, the clus-tering quality suffers. If the quality measure, given in Sub-section 2.2, is not maintained, re-clustering becomes neces-sary. To understand the algorithm in the case of inter clus-ter edge addition, we need to first look into two processes,merging of clusters and contraction of clusters.

Merging of Clusters. Merging of two clusters Cu andCv is described in Figure 3. By merging, a single clusteris formed containing the vertices of Cu and Cv . ICW andOCW can easily be updated using the adjacency matrix ofthe graph G. The time complexity for merging is Θ(|Cu|+

—————————————————————————MERGE(Cu,Cv)begin

D = Cu ∪ Cv

For all u ∈ Cu

Update ICW (u)+ =∑

v∈Cvw(u, v)

Update OCW (u)− =∑

v∈Cvw(u, v)

For all v ∈ Cv

Update ICW (v)+ =∑

u∈Cuw(v, u)

Update OCW (v)− =∑

u∈Cuw(v, u)

Return Dend

—————————————————————————–

Figure 3. Merging of Two Clusters

|Cv|) = Θ(|Cu + Cv|).

—————————————————————————CONTRACT(G(V,E),S)Comment. S is a set of clustersbegin

Let A′ denote the adjacency matrix of the contracted graphV ′ = {V − S, x}, n′ = |V ′|Copy the entries of A, involving both the verticesfrom V − S, to A′

A′(i, n) = ICW (i) + OCW (i) − ∑n−1j=1 A′(i, j)

Comment. E′ can be obtained from A′

Return G′(V ′, E′)end

—————————————————————————–

Figure 4. Creating Contracted Graph

Contraction of Clusters. Contraction of A ⊂ V in G isperformed by replacing A with a single node x. Self loops,resulting from the edges connecting vertices in A, are re-moved. Parallel edges are replaced by a single edge, havingweight equal to the sum of the weights of the parallel edges.While contracting clusters, A represents a single or multi-ple clusters. The process to contract clusters is described inFigure 4. The contracted graph can be obtained from theadjacency matrix of the original graph along with ICW andOCW in time Θ(|A|2).

Now we are ready to describe our algorithm for inter-cluster edge addition (Figure 5). If the addition of edgedoes not deteriorate the clustering quality (CASE 1), thenthe same clusters are maintained. Else if, CASE 2 is satis-fied, then the two clusters, Cu and Cv , containing the end-


—————————————————————————Inter-Cluster Edge Addition((i, j),w(i, j),α)begin

Let i ∈ Cu and j ∈ Cv

If∑

u∈CuOCW(u)+w(i,j)

|V −Cu| ≤ α and∑

v∈CvOCW(v)+w(i,j)

|V −Cv| ≤ α (CASE 1)Then

Update A(i, j)+ = w(i, j)Update OCW (i)+ = w(i, j), OCW (j)+ = w(i, j)Return C

ElseIf 2c(Cu,Cv)

V ≥ α (CASE 2)Then

D=MERGE(Cu, Cv)Return C + D − {Cu, Cv}Else {(CASE 3)}

G′(V ′, E′) =CONTRACT(G(V,E),V − Cu − Cv)Connect t to v, ∀v ∈ Cu, Cv with edge of weightαConnect t to V ′ − {Cu, Cv} with edge of weightα|V − Cu − Cv|Let G′′(V ′′, E′′) is the resulting graphCalculate MINIMUM-CUT Tree T ′′ of G′′(V ′′, E′′)Remove tLet{D1, D2, .., Dk}, k > 0, are the connectedcomponents of T ′′ after removing t, containingvertices of Cu and Cv .C = {D1, D2, .., Dk, C1, C2, .., Cs} − {Cu, Cv}Return C

end

—————————————————————————–

Figure 5. Inter-Cluster Edge Addition

vertices of the inserted edge, are merged. Otherwise (CASE3), we create a coarsened graph, by contracting all the clus-ters except Cu and Cv to x. The resulting graph has only|Cu + Cv| + 1 vertex entries and significantly smaller thanthe original graph. Similar to “Cut-Clustering” algorithm,we add an artificial sink t and add edges connecting t to allvertices in the coarsened graph. However, the weight of theedge linking t to x is |V −Cu −Cv|α. All other edges witht as one end-point, have weight of α. The minimum-cut treeis computed over this graph. The connected components arecomputed from this tree, after removing t. Those compo-nents containing vertices of Cu and Cv along with the clus-ters C−{Cu, Cv} are returned as the clusters of the originalgraph. The entire process takes time Θ(|Cu + Cv|3).

2. Edge Deletion Intra and inter cluster edge deletion isnearly similar to inter and intra cluster edge insertion and is

omitted from here for brevity.3. Vertex Addition and Deletion An addition of a new

vertex, with edges incident on it, may be viewed as creationof a new cluster containing the new vertex as an isolatednode and then processing all the edges incident on it as inthe case of “Edge Insertion”. Similarly vertex deletion canbe handled, by first removing all the edges incident on thatvertex, by the process of edge deletion and then deleting theisolated vertex.

The detailed description of the algorithm may be foundin the full and extended version [9].

Time Complexity Let k = maxsu=1{|Cu|}. We see time

complexity is dominated by the operations of inter-clusteredge insertion and intra-cluster edge deletion. Thereforeupdate-processing time is O(k3). When edges can only beadded, or when the graph is stored in the secondary diskapriori, time required for clustering is O(mk3) (actuallymuch less than this). Generally the clusters are small. Soif k = O(log n), and since most of the massive graphs thatoccur in real life have very low average degree [3], the timecomplexity becomes O(n polylog(n)) compared to O(n3)time requirement of the offline “Cut-Clustering” algorithm.

Proof of Clustering Quality In this section, we show thatthe clusters obtained by our dynamic algorithm, has thesame quality guarantee (Subsection 2.2) of the offline clus-tering algorithm of [3].

Intra-Cluster Edge Addition. The analysis in this caseis straight forward and may be found in [9]

Inter-Cluster Edge Addition.CASE 1. Note that,

∑u∈Cu

OCW(u) = C(Cu, V −Cu).Therefore if CASE 1 is satisfied, by quality guarantee ofthe existing clusters, both the criterias of Subsection 2.2 aresatisfied.

CASE 2. Lemma 1 is obtained from the property of theminimum-cut tree. Lemma 2 establishes the quality guar-antee of our algorithm under CASE 2, using Lemma 1.

Lemma 1 Let T be the minimum-cut tree of G, after ad-dition of the artificial sink t. If P ,Q, P = φ, be any cutof a connected component, S, of T after removing t, thenc(x, Q) ≤ c(P,Q), where x is obtained by contracting t∪Xin T .

Proof. See Lemma 3.2 of [3].

Lemma 2 If 2c(Cu,Cv)|V | ≥ α then merging of Cu and Cv ,

maintains the clustering quality.

Proof. Let D = Cu ∪ Cv . For all i = u, v, c(Ci, V − Ci)remains unchanged. We see, by direct calculation c(D,V −D) = α(|V −D|)+α|V | − 2c(Cu, Cv). If α ≤ 2c(Cu,Cv)

|V | ,

then c(D,V −D)|V −D| ≤ α.


Now, c(Cu,Cv)min{|Cu|,|Cv|} ≥ 2c(Cu,Cv)

|V | ≥ α. Let P,Q ⊂ D,P ∪ Q = D and P ∩ Q = φ. Let P = Pu + Pv andQ = Qu + Qv , where Pu, Qu ⊆ Cu and Pv, Qv ⊆ Cv . Weonly consider the case when, (P,Q) = (Cu, Cv). Thereforeif Pu = φ or Pv = φ, then Qu = φ and Qv = φ and viceversa. Without loss of generality, let us assume Pu and Pv

= φ. We get,

c(P,Q) = c(Pu + Pv, Qu + Qv)≥ c(Pu, Qu) + c(Pv, Qv)≥ c(x, Qu) + c(x, Qv) ,By Lemma 1

≥ α|Qu| + α|Qv| ,By construction

≥ α|Q| ≥ α min{|P |, |Q|}

CASE 3. To prove the claim of our algorithm underCASE 3, we first state an important lemma, Lemma 3, form[3]. This is obtained directly by the way min-cut tree isproduced in [5].

Lemma 3 Let T be the (unique) min-cut tree of an undi-rected graph G, and let A be a subtree of T . Let G′ be thegraph that results after contracting A in G, and let T ′ bethe min-cut tree of G′. Let T ′′ be the tree that results aftercontracting A in T . Then T ′ and T ′′ are identical.

The following Lemma 4 is derived using Lemma 3.

Lemma 4 Let Cu and Cv be two connected componentsobtained, after removing the artificial sink t from the cre-ated min-cut tree T of G. If there are some insertion anddeletion of edges across and within the clusters Cu and Cv ,then except Cu and Cv , all other clusters remain unaffected.

Proof. Let G be the original graph and let us denote thegraph after edge insertion and deletion as G′. Contract Cu

and Cv in G and G′, and call the contracted graphs H andH ′ respectively. Since all the edge insertions and deletionsare within Cu and Cv , H = H ′. Min-cut tree of H and H ′

are therefore identical and is same as the min-cut tree T ofG after contracting Cu and Cv in T (by Lemma 3). Sincecontraction of Cu and Cv in T , does not affect the othercommunities, the proof follows.

Observe that, adding t in G, as in basic ”Cut-Clusteringalgorithm”, and contracting V − S, has the same effect ascontracting V − S first in x and then adding an edge (t, x),of weight α|VS | in the contracted graph. In the contractedgraph, the contracted clusters, x, form a singleton cluster(follows directly from Lemma 4). With these observations,the claim of the algorithm under CASE 3 now follows fromLemma 4 and the correctness proof of the “Cut-Clusteringalgorithm” of [3].

The proofs of maintenance of clustering quality overedge deletions and vertex insertions and deletions are simi-lar and omitted for brevity.

Heuristic to Improve Time Complexity A Heuristicmay be applied to improve the time complexity of each up-date to O(k2). It employs marking a special vertex of eachcluster as “prime” and computing the first minimum cutwith the prime vertex as source (of the cluster involved),while computing the partial minimum cut tree. After thatthe heuristic of [3] is followed. The details of this may befound in [9].

4 Experimental Results

The results of a preliminary experimental study on ourdynamic clustering algorithm clearly demonstrate the supe-riority of our algorithm in terms of cluster quality and com-putation time over the Cut Clustering Algorithm (Section 2)[3] and the recent multi-level algorithm (MLKM) [2]. Thedetailed results of the experiments and comparison analysismay be found in [9] and are not given here for lack of space.

References

[1] Ulrik Brandes, Marco Gaertler, and Dorothea Wagner.Experiments on graph clustering. In ESA’03, volume2832, pages 568–579, 2003.

[2] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Afast kernel-based multilevel algorithm for graph clus-tering. In KDD’05, pages 629–634, 2005.

[3] G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis.Graph clustering and minimum cut trees, Internet Math-ematics, 1(3), 355-378, 2004.

[4] Gary William Flake, Steve Lawrence, and C. Lee Giles.Efficient identification of web communities. In KDD’00:, pages 150–160, 2000.

[5] R. E. Gomory and T. C. Hu. Multi-terminal networkflows. J-SIAM, 9(4):551–570, December 1961.

[6] R. Kannan, S. Vempala, and A. Veta. On clusterings-good, bad and spectral. In FOCS ’00:, page 367, 2000.

[7] G. Karypis and V. Kumar. A Fast and High QualityMultilevel Scheme for Partitioning Irregular Graphs.Technical Report 95-035, University of Minnesota,June 1995.

[8] B. Pang and L. Lee. A sentimental education: Senti-ment analysis using subjectivity summarization basedon minimum cuts. In Proceedings of the ACL, MainVolume, 2004, pages 271-278, Barcelona.

[9] B. Saha and P. Mitra. Dynamic Algorithm for GraphClustering, http://home.iitk.ac.in/∼barna/dc.pdf, July2006.


[ieee sixth ieee international conference on data mining - workshops (icdmw'06) - hong kong,...

Documents