parallel algorithms for hierarchical clustering and applications to

36
Ž . Journal of Algorithms 36, 205]240 2000 doi:10.1006rjagm.2000.1090, available online at http:rrwww.idealibrary.com on Parallel Algorithms for Hierarchical Clustering and Applications to Split Decomposition and Parity Graph Recognition Elias Dahlhaus 1 Department of Computer Science and Department of Mathematics, Uni ¤ ersity of Cologne, Pohligstrasse 1, D-50969 Cologne, Germany E-mail: [email protected] Received January 13, 1999 Ž . We present efficient parallel algorithms for two hierarchical clustering heuris- tics. We point out that these heuristics can also be applied to solving some algorithmic problems in graphs, including split decomposition. We show that efficient parallel split decomposition induces an efficient parallel parity graph recognition algorithm. This is a consequence of the result of S. Cicerone and D. Di wx Stefano 7 that parity graphs are exactly those graphs that can be split decomposed into cliques and bipartite graphs. Q 2000 Academic Press Key Words: parallel algorithms; graph algorithms; split decomposition; hierarchi- cal clustering; single linkage. 1. INTRODUCTION Hierarchical clustering plays an important role in many areas of applied science. The major application is the classification of objects as is done in psychology, the social sciences, and artificial intelligence. The reader who w x is interested in these applications should take a look at 20 , e.g. There are different approaches to hierarchical clustering. One is the Ž w x. single linkage method see for example 20 . We are given distances between the elements of a fixed set V and consider two elements to be in the same d-cluster if one can reach the second element from the first by one or more jumps through elements of V, such that the distance of each jump is at most d. 1 Present address: Institute of Computer Graphics, Vienna University of Technology. 205 0196-6774r00 $35.00 Copyright Q 2000 by Academic Press All rights of reproduction in any form reserved.

Upload: others

Post on 10-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Ž .Journal of Algorithms 36, 205]240 2000doi:10.1006rjagm.2000.1090, available online at http:rrwww.idealibrary.com on

Parallel Algorithms for Hierarchical Clustering andApplications to Split Decomposition and Parity

Graph Recognition

Elias Dahlhaus1

Department of Computer Science and Department of Mathematics,Uni ersity of Cologne, Pohligstrasse 1, D-50969 Cologne, Germany

E-mail: [email protected]

Received January 13, 1999

Ž .We present efficient parallel algorithms for two hierarchical clustering heuris-tics. We point out that these heuristics can also be applied to solving somealgorithmic problems in graphs, including split decomposition. We show thatefficient parallel split decomposition induces an efficient parallel parity graphrecognition algorithm. This is a consequence of the result of S. Cicerone and D. Di

w xStefano 7 that parity graphs are exactly those graphs that can be split decomposedinto cliques and bipartite graphs. Q 2000 Academic Press

Key Words: parallel algorithms; graph algorithms; split decomposition; hierarchi-cal clustering; single linkage.

1. INTRODUCTION

Hierarchical clustering plays an important role in many areas of appliedscience. The major application is the classification of objects as is done inpsychology, the social sciences, and artificial intelligence. The reader who

w xis interested in these applications should take a look at 20 , e.g.There are different approaches to hierarchical clustering. One is the

Ž w x.single linkage method see for example 20 . We are given distancesbetween the elements of a fixed set V and consider two elements to be inthe same d-cluster if one can reach the second element from the first byone or more jumps through elements of V, such that the distance of eachjump is at most d.

1 Present address: Institute of Computer Graphics, Vienna University of Technology.

205

0196-6774r00 $35.00Copyright Q 2000 by Academic Press

All rights of reproduction in any form reserved.

ELIAS DAHLHAUS206

In another approach, we are given a set system and want to turn it into ahierarchical clustering. That means that we would like to transform thegiven set system into another set system, so that the two sets are eitherdisjoint or one is a subset of the other. We say that two sets overlap if theyintersect but one is not a subset of the other. A natural approach is todetermine the o¨erlap components, i.e., the connected components of thegraph G that consists of the sets of the set system S as vertices andSoverlapping pairs of sets in S as edges. The clusters are the unions of setsthat are in a certain overlap component.

It turns out that these two clustering procedures have a commonapplication in graph algorithms. They both appear as subprocedures in an

Žefficient parallel algorithm for split decomposition preliminary versionw x.18 . A split of a graph is a partition of its vertex set into two sets such thatthe edges between the two sets induce a complete bipartite graph. A first

w xparallel split decomposition algorithm is due to Barten 4 . The processorŽ 4.8.1. Ž 2 .number of this algorithm is O n and the time bound is O log n . We

will show that split decomposition can be done in polylogarithmic timeŽ w x.with a linear processor number preliminary version 18 . We also will

show that the algorithm can be turned into a linear time sequentialŽ w x.algorithm this result is also mentioned in the preliminary paper 18 . The

fastest known split decomposition algorithm that was known before is duew x Ž 2 .to Ma and Spinrad 28 and runs in O n time. Finally, using a result of

w xCicerone and Di Stefano 7 , it comes out that parity graphs can berecognized in linear time and in parallel with a linear processor bound inquadratic logarithmic time. One gets a simplification of an earlier parity

w xgraph recognition algorithm due to the author 16 that has the same timeand processor bound. This algorithm is an improvement of the algorithm

w xof Corneil and Przyticka 30 .In Section 3 we develop an efficient algorithm for single linkage cluster-

w xing. This is an improvement of the algorithm of the author in 14 . InSection 4, we present an efficient parallel and linear time algorithm todetermine the overlap components of a set system. This algorithm has also

w xbeen sketched in 18 . In Section 5, we develop an efficient parallel andlinear time algorithm for split decomposition. We also discuss the recogni-tion of parity graphs in this section.

2. PRELIMINARIES

A hierarchical clustering is a collection C of subsets c of a fixed set V,such that for all c and d in C either c l d s B or c ; d or d ; c. Thismeans that if each singleton and V are included in C, the subset relationdefines on C a tree-like partial ordering, i.e., for each c g C with c / V,

CLUSTERING AND SPLIT DECOMPOSITION 207

there is a unique smallest d g C with c ; d. This d is also called theparent of c.

In general, a tree is a root-directed tree consisting of a set V of nodesTand a set E of directed edges. The parent of a node t is the uniqueT

Ž .u g V with u, t g E . The children of t are the nodes that have t asT Ttheir parent.

y g V is called an ancestor of x g V iff there is a directed pathT TŽ .possibly of length 0 from y to x in T. x is also called a descendant of y ify is an ancestor of x. The set of descendants of t in T including t isdenoted by T . We identify T and its induced subtree.t t

For x, y g V the least common ancestor of x and y, denoted byTŽ .LCA x, y , is the common ancestor z of x and y, such that no child of z is

an ancestor of x and y.A distance function is a binary symmetric positively real valued function

d on a domain V. Moreover, we assume that for x g V the equationŽ .d x, x s 0 is valid. A metric is a distance function satisfying the triangle

inequality.Here we always assume that V is a finite domain. Moreover, we let V

� 4be a set of the form 1, . . . , n . The distance function d is implemented asan n = n matrix.

A distance function d is called an ultrametric iff the following extendedtriangle inequality is valid:

d i , j F max d i , k , d j, k .� 4Ž . Ž . Ž .

A dendrogram is a root-directed tree T together with a positively realŽ . Ž .valued labeling h of the vertices with a height function, i.e., h ¨ - h w if

w is an ancestor of ¨ .Note that a dendrogram always defines a hierarchical clustering. The

cluster c of a node t of T is just the set of descendants of t in T that aretleaves of T. The hierarchical clustering defined by T is the collection of allc with t g T.t

Ž w x.PROPOSITION See for Example 22 . A distance function d on V is anŽ .ultrametric iff there is a dendrogram T , h such that

1. V is the set of lea¨es of T , andŽ . Ž Ž ..2. for all u, ¨ g V the distance d u, ¨ is the labeling h LCA u, ¨ of

Ž .the least common ancestor LCA u, ¨ of u and ¨ with respect to T.

Ž .A graph G s V, E consists of a ¨ertex set V and an edge set E. Multipleedges and loops are not allowed. The edge joining x and y is denotedby xy.

ELIAS DAHLHAUS208

We say that x is a neighbor of y iff xy g E. The neighborhood of x is� 4the set y : xy g E consisting of all neighbors of x and is denoted by

Ž . X Ž X. �N x . The neighborhood of a set of vertices V is the set N V s y ¬ ' xX 4 Xg V , yx g E of all neighbors of some vertex in V .

Ž . Ž X X . X XA subgraph of V, E is a graph V , E such that V ; V, E ; E. AnŽ X X .induced subgraph is an edge-preserving subgraph, i.e., V , E is an in-

Ž . X X � X4duced subgraph of V, E iff V ; V and E s xy g E : x, y g V .Connectedness is defined as usual.For a distance function d with domain V and a real number r, let G ber

Ž .the graph consisting of all unordered pairs with d x, y F r and letXŽ .d x, y be the minimum r such that x and y are in the same connected

component of G .r

Ž w x. XLEMMA 1 See for Example 20 . d is an ultrametric.

w xThe proof is trivial and can be seen in 20 .

DEFINITION 1. The clustering defined by the ultrametric dX is called thesingle linkage clustering of d. dX is also called the single linkage distancefunction for d.

LEMMA 2. The clusters of the single linkage clustering of d are exactly theconnected components of any G as defined abo¨e.r

If a single linkage cluster c is a connected component of G then it isralso called the single linkage cluster of coarseness r.

Ž .A minimum spanning tree MST is a tree T with vertex set V, such thatŽ .the sum of d xy over all edges xy of T is a minimum.

Ž w x.LEMMA 3 See for Example 20 . Let T be a minimum spanning tree forXŽ . Ž X X. X Xd. Then d x, y is the maximum d x , y o¨er all edges x y on the unique

path from x to y in T.

Two sets o¨erlap if they are not disjoint and not comparable withrespect to the subset relation. For a collection C of subsets of V, ano¨erlap component is a connected component of the graph G with vertexCset C and c c is an edge iff c and c overlap. Note that if C and C are1 2 1 2 1 2different overlap components of C then either all c g C and c g C1 1 2 2are disjoint or there is a c g C such that for all c g C , c ; c , or vice2 2 1 1 1 2versa.

We say C ; C if there is a c g C , such that for all c g C , c ; c .1 2 2 2 1 1 1 2Ž .We denote the set of overlap components of C by Overlap C .

One can show the following.

Ž Ž . .LEMMA 4. Overlap C , ; is a tree-like ordering. For each o¨erlapcomponent C of C, there is a unique o¨erlap component C such that1 2

CLUSTERING AND SPLIT DECOMPOSITION 209

C ; C and for no o¨erlap component C , C ; C ; C if and only if there1 2 3 1 3 2is an o¨erlap component CX with C ; CX .1 1 1

We will show the lemma in the section dealing with overlap components.

3. PARALLEL ALGORITHM FOR THE SINGLELINKAGE METHOD

w xIn this section, a preliminary version of which appeared in 14 , we showthe following.

THEOREM 1. If an MST for d is known then the dendrogram for the singleX Ž . Ž .linkage distance d of d can be determined in O log n time with O n

Ž .processors and O n space on a CREW-PRAM.

They key for an algorithm to determine the dendrogram of the singlew xlinkage distance is the following result in 15 .

LEMMA 5. Each ultrametric d has a minimum spanning tree T with parentfunction par, s.t. for each x g V that is not the root of T and that is not a

Ž Ž .. Ž Ž . Ž Ž ...child of the root of T , d x, par x - d par x , par par x . If such aspanning tree for the ultrametric d is known then the dendrogram can be

Ž . Ž .determined in O log n time with O n processors on an EREW-PRAM.

We call a spanning tree with the requirements as stated in Lemma 5 acanonical spanning tree.

The main job is to compute a canonical spanning tree of the singlelinkage distance efficiently. We assume that the MST T of d is directed to

Ž . � 4the root ¨ and p x is the parent of x in T for each x g V _ ¨ . Let0 0Ž . Ž Ž .. Ž . Ž ..par x be the first ancestor y of x, such that d x, p x - d y , p y , i.e.,

kŽ . l Ž Ž . Ž Ž ..y s p x , for some k, for all l - k, with z s p , d z, p z F d x, p x ,Ž Ž .. Ž Ž .. Ž .and d y, p y ) d x, p x . If such a y does not exist then par x [ ¨ .0

The construction of par is shown in Fig. 1. The straight lines are the edgesof the MST with parent function p. The remaining lines are the edges

Ž .x par x .

LEMMA 6. par is the parent function of a canonical MST T X for the singlelinkage distance dX of d.

XŽ Ž . Ž Ž ..Proof. First we show that d x, par x s d x, p x . This follows fromŽ .the fact that for all edges yz on the unique path from x to par x ,

Ž . Ž Ž .. Ž .d yz F d x, p x and xp x is one of the edges of this path.

ELIAS DAHLHAUS210

FIG. 1. From the MST to the canonical MST of the corresponding ultrametric.

Ž .Now suppose x is not a child of ¨ with respect to par; i.e., par x / ¨ .0 0Then

dX x , par x s d x , p xŽ . Ž .Ž . Ž .- d par x , p par x s dX par x , par par x .Ž . Ž . Ž . Ž .Ž . Ž .Ž . Ž .

It remains to show the following.

PROPOSITION 1. The canonical MST T X for the single linkage distance dX

Ž . Ž .of d can be computed in O log n with O n processors by a CREW-PRAM ifthe MST T for d is known.

Proof. We compute the parent function par as in Lemma 6. Here weŽ . w xpresent an algorithm that needs only O n space. The algorithm in 14

Ž .needs O n log n space.The structure of the algorithm is as follows.

Ž .1. We decompose the edge into edge disjoint paths lines , such thatthe unique path from each vertex to the root passes only logarithmicallymany lines. This can be done by tree contraction in logarithmic time with alinear processor number.

CLUSTERING AND SPLIT DECOMPOSITION 211

Ž .2. For each vertex ¨ g V, let l ¨ be the line that contains the edgeŽ . Ž .¨p ¨ of T. We determine the maximum distance d e of an ancestor edge

Ž . Ž .of ¨ on l ¨ , denoted by w ¨ . This can be done in logarithmic time with aˆw xlinear workload by list ranking 3 .

Ž .3. We determine, for each vertex ¨ , the first ancestor a ¨ , such thatŽ . Ž Ž .w ¨ ) d ¨ , p ¨ . This can be done, for each vertex ¨ , in logarithmicallyˆ

Ž .many steps as follows. For each vertex x we determine the root r ¨ of theŽ . Ž . Ž . Ž Ž . Ž Ž ..line l ¨ . Start with a ¨ [ ¨ and assign a ¨ [ r a ¨ , as long as w a ¨ˆ

Ž Ž ..F d ¨ , p ¨ . This requires logarithmically many steps, because one passesat most logarithmically many lines.

Ž . Ž Ž ..4. Note that par ¨ , the first ancestor w of ¨ , such that d ¨ , p ¨ -Ž Ž .. Ž . Ž Ž .. Ž .d w, p w , is an ancestor of a ¨ in l a ¨ . We determine par ¨ by a

binary search strategy that will be described below.

3.1. Decomposition into Lines

We first make T a tree that is almost binary; i.e., each vertex has at mosttwo children. If a vertex t has children t , . . . , t then we replace the star1 k

with vertices t, t , . . . , t and edges tt by a binary tree B with root t and1 k i t

leaves t and with the parent function p . The distances of edges in B arei B tŽ . Ž Ž ..determined as follows. The edge t p t gets the distance d t p t [i B i i B i

Ž .d t t . All of the other edges of B get the distance zero. Note that thei t

distances of the unique path from a vertex s to a vertex t in T beforemaking it almost binary and after making it almost binary deviate only byadditional zero distances. Therefore the maximum distance of the uniquepath from s to t does not change.

We next determine the chains of T , i.e., the maximal path of T , suchthat all inner nodes have exactly one child. This can be done by list

w xranking in logarithmic time with a linear workload 3 . Note that each edgeof T belongs to exactly one chain. In the case that both end vertices of theedge e of T are not of degree two then the chain containing e consists

Ž .only of this edge. For each chain c, let b c be the vertex in c that isŽ . Ž .farthest away from the root of T i.e. the leaf of c and f c be the vertex

Ž .of c that is closest to the root of T the root of c .w xNow we proceed as in the tree contraction procedure as described in 2

Ž w x.see also 1 .

1. We number the leaves from left to right.

2. For each odd numbered leaf ¨ , we remove ¨ and the innerŽ . Ž .vertices of the chain c ¨ that contains ¨ and make c ¨ a line.

ELIAS DAHLHAUS212

Ž .3. For each c ¨ , we concatenate the two chains of the remainingŽ Ž .. Ž .tree T that contain the root f c ¨ of c ¨ to one chain.

4. We renumber the leaves of T by dividing their numbers by two.

We repeat this procedure until the only remaining vertex of T is the root.w xBy the same argument as in 2 , the chains that are removed have

pairwise no vertex in common, because only chains associated to oddnumbered leaves are removed. For the same reason, there are no threechains that are concatenated to one chain at the same time. Thereforethere is no writing conflict and also no reading conflict. The procedure is

Ž .repeated O log n times, and since one application of the procedureŽ .requires only a constant time, the whole procedure needs O log n time.

w x Ž .The workload is as in 2 , O n .Since only logarithmically many steps are necessary to eliminate the

whole tree T , one can reach the root from each vertex of the original treeT by passing logarithmically many lines. Therefore we get the follow-ing result.

LEMMA 7. One can split the edge set T into lines, such that the uniqueŽ .path from any ¨ertex t of T to the root of T passes O log n many lines in

Ž . Ž .O log n time with a workload of O n .

Ž . Ž Ž ..3.2. Finding par ¨ in l a ¨

Ž . Ž Ž .. Ž . Ž .Let w ¨ [ d ¨ , p ¨ . par ¨ is the first ancestor of a ¨ , say y, suchŽ . Ž .that w ¨ - w y .

For each line l, let S be a balanced binary tree with the edges of l aslleaves. The leaves appear in S in the same order as on the line l. For anyl

Ž .inner node t of S , let D t be the maximum distance of an edge of l thatlis a descendant of t. Denote the interval of edges of l that are descendants

Ž .of t by I . Note that D t is the maximum distance of an edge in I . Ourt tstrategy is as follows. We first search for the next inclusion maximal

Ž . Ž . Ž .interval I , say I , that is right from a ¨ with w ¨ - D t . Then byt tŽ¨ .Ž .binary search in I , we determine the next right ancestor edge e [ e ¨tŽ¨ .

Ž . Ž . Ž .with w ¨ - d e , and we get par ¨ .

Ž . Ž . Ž . Ž Ž .. Ž .Determine t ¨ I . Let f [ a ¨ p a ¨ be the parent edge of a ¨ .tŽ¨ .Ž .First we search for the first ancestor s ¨ of f that is a right child. If

Ž Ž .. Ž . Ž . Ž .D s ¨ ) w ¨ then t ¨ [ s ¨ . Otherwise we search for the next ances-Ž . Ž .tor t of s ¨ that is a left child and update s ¨ as the right sibling of t. We

Ž Ž .. Ž . Ž . Ž .repeat this step until D s ¨ ) w ¨ . Afterward we set t ¨ [ s ¨ .

Ž .Determine e ¨ . Let t be the left child and t be the right child of1 2Ž . Ž . Ž .t ¨ . If the maximum distance of an edge in I , D t , is larger than w ¨t 11

CLUSTERING AND SPLIT DECOMPOSITION 213

Ž . Ž . Ž .then update t ¨ by t , else update t ¨ by t . Repeat this step until t ¨ is1 2Ž . Ž .an edge of l. Finally, e ¨ [ t ¨ .

Ž . Ž .Note that par ¨ is the child vertex of e ¨ . It is easily seen that for each¨ both steps can be done in logarithmic time. The space needed for each ¨separately is constant. Moreover, the tree S can be determined in loga-l

Ž .rithmic time with a linear workload. Therefore we find par ¨ in logarith-mic time with a linear processor number in linear space. This completes theproof of Proposition 1 and of the theorem.

4. A PARALLEL AND A LINEAR TIME ALGORITHM TODETERMINE OVERLAP COMPONENTS

We assume that a collection C of subsets of V is given. The basicŽ .strategy is as follows. We determine, for each c g C, a set Max c g C

Ž .that overlaps with c and which is of maximum cardinality. Max c is onlydefined if its size is at least as large as the size of c. The following result isessential to get all overlap components in an appropriate time and proces-sor bound.

< < < < < Ž . <LEMMA 8. If c F d F Max c and c l d / B then d o¨erlaps with cŽ .or with Max c .

Ž .Proof. If d does not overlap with c then c ; d. But then c and Max cŽ .have a nonempty intersection. Therefore if d and Max d do not overlap

Ž . Ž .then d : Max c . But then c : Max c . This is a contradiction.

4.1. Determining the O¨erlap Components if Max Is Known

Define C to be the set of c g C, such that x g c. Moreover, we assumexthat C is sorted in increasing order with respect to size and each C isxsorted in the same increasing order with respect to size, i.e., C sx� x x4 < x < < x <c , . . . , c and, with i - j, c F c . We discuss later how to get such a1 k i jsorting in linear time.

We use these sortings to construct a graph G with vertex set C with anC< <edge number that is not larger than the sum of V and the sum of the

sizes of c g C. This graph should have the same connected components asthe overlap components of C.

The edge set E of G consists of the edges c xc x with the propertyC C i iq1< Ž x. < < x <that there is a j F i with Max c G c .j iq1

< < < <PROPOSITION 2. 1. E F S c .C cg C

Ž .2. The connected components of G [ C, E are exactly the o¨erlapC Ccomponents of C.

ELIAS DAHLHAUS214

Proof of Proposition. It is easily seen that the size of E is bounded byCthe sum of the sizes of the sets C . But this is also the sum of the sizes ofxc, c g C.

To prove the second part, we have to show the following.

LEMMA 9. 1. If c xc x g E then c x and c x are in the same o¨erlapi iq1 C i iq1component.

2. If c and d g C o¨erlap then they are in the same connected compo-nent of G .C

Proof of Lemma. The first part of the lemma is proved as follows. Ifx x < Ž x. < < x < < x <c c g E then there is a j F i with Max c G c . Note that c Fi iq1 C j iq1 j

< x < < x < < Ž x. < x Ž x. x xc F c F Max c . Since c and Max c overlap and c and ci iq1 j j j i iq1x Ž x. x xoverlap with c or with Max c , c and c are in the same overlapj j i iq1

component.To prove the second part of the lemma, we pick an x g c l d. We

assume w.l.o.g. that c s c x, d s c x, and i - j. Since c and d overlap,i j

< Ž . < < Ž x. < < < < x <Max c s Max c G d s c . Therefore, for all l with i F l - j, therei jX Ž . < Ž x. < < x <Xis an l F l that is, i with Max c G c . That means that for all ll lq1

with i F l - j, c xc x g E . Therefore c s c x and d s c x are in the samel lq1 C i jconnected component of G .C

Ž .Proposition

< < < <It remains to check the complexity of computing E . Let n [ V q CC< < Ž .and m [ S c . The size of the input of V, C is n q m.cg C

PROPOSITION 3. E can be computed by an EREW-PRAM in logarithmicCŽ .time with a linear workload and therefore in linear time, pro¨ided Max c , for

each c g C and the sortings of C and the sets C , x g V, are known.x

x < Ž x. <Proof of Proposition. For each c , we compute the maximum Max ci j

Ž x.with j F i and denote it by MAX c . We get this in logarithmic time withiŽ w xa linear workload by parallel prefix computation see for example 21 ;

w x. x xcompare also 27 . To check that c c g E , one only has to check thati iq1 Cx x< < Ž .c F MAX c .iq1 i

COROLLARY 1. The o¨erlap components of C can be computed in lineartime sequentially and in parallel by a CRCW-PRAM in logarithmic time with alinear processor number, pro¨ided Max and the sortings of C and the sets Cxare gi en.

Proof. This follows immediately from Proposition 3 and the fact thatconnected components can be computed in the time bounds as mentioned

Žin the corollary for the sequential case see any textbook on algorithms,w x w x.e.g., 9 , and for the parallel case see 31 .

CLUSTERING AND SPLIT DECOMPOSITION 215

4.2. The Tree Structure of O¨erlap Components

First we show now the following lemma that has been mentioned before.

Ž Ž . .LEMMA 10. Overlap C , ; is a tree-like ordering. For each o¨erlapcomponent C of C, there is a unique o¨erlap component C , such that1 2C ; C , and for no o¨erlap component C , C ; C ; C if and only if1 2 3 1 3 2there is an o¨erlap component CX with C ; CX .1 1 1

Proof of Lemma. First observe that if d overlaps with D c then itcg C1

overlaps with some c in the overlap component C . Therefore for any two1overlap components, say C and C , either D c and D c are1 2 cg C cg C1 2

disjoint or they are comparable with respect to the subset relation. Notethat if C ; C then D c ; D c. Also, the converse is true, for the1 2 cg C cg C1 2

following reason: Suppose D c ; D c. Since no c g C overlapscg C cg C 21 2

with d [ D c, there are the possibilities that all c g C are disjoint1 cg C 21

with d or there is a c g C that contains d as a subset. The first1 2 1possibility is impossible, because d s D c ; D c. Therefore one1 cg C cg C2 2

gets a unique ‘‘minimal’’ component C that ‘‘contains’’ C if there is a2 1Xcomponent C that contains C .1 1

The component C as mentioned in the last lemma is also called the2parent component of C . The tree that corresponds to the tree-like order-1ing ; on the overlap components of C is called the o¨erlap tree of C andis denoted by T .C

PROPOSITION 4. The o¨erlap tree of C can be determined in constant timeby a CRCW-PRAM with a linear number of processors, pro¨ided the o¨erlapcomponents of C and the sortings of C and the sets C with respect to the sizesxare known. That means the o¨erlap tree of C can be determined in linear timesequentially.

Proof of Proposition. Let c x be defined as above. We have to proceedias follows. If c x and c x are in different overlap components, say C andi iq1 1

Ž . xC , then we set parent C [ C . Note that c is the smallest c g C2 1 2 iq1that contains D c, and therefore must be in the ‘‘next larger’’ overlapcg C1

component that contains D c as a subset. Therefore C as determinedcg C 21

above is really the parent component of C . There might be a writ-1ing conflict caused by different c x g C . But all these processors writei 1the same. The number of processors is linear, and the parallel time is con-stant.

4.3. Determining Max

Ž 2 .First we describe an algorithm that works in O n time. Afterwards wetransform it into a linear time algorithm that can be parallelized.

ELIAS DAHLHAUS216

First we sort C in increasing order with respect to size into a sequenceŽ . Žc , . . . , c . This can be done in linear time by bucket sorting see for1 k

w x.example 9 . This can also be done in logarithmic time with a linearw xprocessor number by an EREW-PRAM 8 .

Next we sort V lexicographically with respect to C where the largestxc g C has the highest priority, i.e., if x f c , y g c and, for all j ) i,x i ieither x and y are in c or x and y are not in c then x - j. This again canj j

be done in linear time. It also can be done in logarithmic time with alinear processor number by a CRCW-PRAM. Note that one comparisonneeds one time unit by a CRCW-PRAM. Combining this with the algo-

w xrithm of 8 , one gets a logarithmic time bound by a CRCW-PRAM.Ž .To determine Max c , we use the following observation.

LEMMA 11. Let c be the element of C with the highest index i thatio¨erlaps with c g C. Then for all x g c_c and all y g c l c , x - y.i i

Proof. Note that for all j ) i, c ; c or c l c s B. Therefore allj jx g c do not distinguish in the membership of any c , j ) i. Therefore ifjx g c_c and y g c l c , x - y.i i

Ž .Therefore one can determine Max c as follows. We determine theŽ . Ž . Ž .smallest x g c, called min c , and the largest x g c, called max c . Max c

Ž .is the c with the highest index containing max c and not containingiŽ . < Ž . < < <min c . Finally, we have to check that Max c G c . Obviously this can be

done in quadratic time.Ž . Ž .Note that min c and max c can be determined in linear time and also

in parallel by an EREW-PRAM in logarithmic time with a linear work-Ž . Xload. To determine Max c , we determine the lowest index i , such that

Ž . Ž . Xmin c and max c do not distinguish in the membership in c , j G i .jŽ .We assume that V is sorted to a sequence ¨ , . . . , ¨ with ¨ - ¨ , i - j.1 l i j

Ž Ž .. Ž .Define the barrier of i to be i, b i where the height b i is defined to be

b i [ max j ¬ ¨ and ¨ distinguish the membership in c .Ž . Ž .i iq1 j

The barriers can be determined in linear time sequentially and in constanttime with a linear processor number by a CRCW-PRAM and therefore inlogarithmic time with a linear processor number by an EREW-PRAM.

LEMMA 12. Suppose i - j. Then the maximum iX, such that ¨ and ¨i jŽ X. X

Xdistinguish in the membership of c , is the maximum b j with i F j - j.i

Ž X. XProof. Let q be the maximum b j , such that i F j - j. Note that all¨ X with i F jX - j do not distinguish in the membership of c X , iX ) q.j iMoreover, let qX be the maximum index, such that ¨ and ¨ distinguish ini jthe membership of c X . Suppose x distinguishes in the membership of c Y ,q q

CLUSTERING AND SPLIT DECOMPOSITION 217

Y X Ž . Yq ) q with ¨ and therefore also with ¨ . We assume q is a maximumi j

Ž .Y Yindex. Then either x - ¨ if ¨ g c and x f c or x ) ¨ . That meansi i q q j

the index of x is not between i and j. Therefore all x X , i F jX - j, do notjY X

Ydistinguish in the membership in c , q ) q . This proves the lemma.q

Next we build up a tree structure with the elements of V as leaves andŽ Ž ..the barriers i, b i as inner nodes.

Ž . Ž . XLEMMA 13. There are no two i - j with b i s b j and for all i withX Ž X. Ž .i F i - j, b i - b i .

Ž . Ž .Proof. Let q s b i s b j . Then, since ¨ - ¨ and ¨ and ¨ doi iq1 i iq1not distinguish in the membership, for c , p ) q, ¨ f c and ¨ g c .p i q iq1 qFor the same reason ¨ f c and ¨ g c . Since all barriers between ij q jq1 qand j are of smaller height than q, ¨ and ¨ do not distinguish in theiq1 jmembership in c . This is a contradiction.q

Ž . Ž Ž .. Ž . Ž Ž ..For each ¨ , let left ¨ [ i y 1, b i y 1 and right ¨ [ i, b i . Fori i iŽ Ž .. Ž Ž .. Ž Ž ..a barrier, i, b i , let left i, b i be the barrier j, b j of maximum index

Ž . Ž . Ž Ž .. Ž Ž ..j - i with b i F b j and right i, b i be the j, b j of minimum indexŽ .j G i, such that b j G b . Note that, because of Lemma 13, the heightsi

Ž . Ž Ž .. Ž Ž ..b j of left i, b i and of right i, b i are always greater than the heightŽ . Ž Ž ..b i of i, b i .

w xLEMMA 14 34 . The functions left and right can be determined in lineartime and in logarithmic time with a linear workload by a CREW-PRAM.

Ž .The tree T with parent function Pa is defined as follows: Pa t is theŽ . Ž . Ž . Ž .barrier left t or right t of smaller height and if left t or right t is not

Ž . Ž . Ž . Ž .defined then Pa t is that of left t or right t that is defined. left t andŽ .right t have different heights because of Lemma 13.

Ž Ž .. Ž .PROPOSITION 5. The barrier p, b p of highest height b p with i F p- j is the least common ancestor of ¨ and ¨ in T.i j

Ž . Ž . Ž .Proof. For technical reasons, let left t [ 0, ` if left t is not definedŽ . Ž . Ž . Žand right t [ k, ` if right t is not defined k is the largest index for

.the elements ¨ g V .Ž .One gets a canonical ordering of V j B B is the set of barriers by the

sequence

0, ` , ¨ , 1, b 1 , ¨ , 2, b 2 , . . . , k y 1, b k y 1 , ¨ , k , ` .Ž . Ž . Ž . Ž . Ž .Ž . Ž . Ž .Ž .1 2 k

Ž . Ž . Ž Ž ..We say that ¨ is between b i and b j if it is between i, b i andn

Ž Ž ..j, b j in the canonical ordering, i.e., i - n F j. Analogously, we say thatŽ Ž .. Ž . Ž .a barrier n , b n is between b i and b j if i - n - j.

ELIAS DAHLHAUS218

Ž Ž ..LEMMA 15. i, b i is an ancestor of ¨ if and only if ¨ is betweenj jŽ Ž .. Ž Ž ..left i, b i and right i, b i .

Ž Ž .. Ž Ž ..Proof of Lemma. Suppose ¨ is between left i, b i and right i, b i .j

Ž Ž .. Ž Ž ..Then w.l.o.g. we may assume that ¨ is between left i, b i and i, b i .j

Ž .Note that always the height of Pa t is greater than the height of t. Byinduction one can easily show that if t is an ancestor of ¨ and the heightj

Ž . Ž Ž ..of t is smaller than the height b i then t is between left i, b i andŽ Ž .. Ž Ž . Ž . Ž .i, b i if the height of Pa t is less than b i then Pa t is between

Ž Ž ... Ž . Ž . Ž Ž .. .left i, b i if Pa t s left t and between t and i, b i otherwise . Con-Ž .sider now the ancestor t of ¨ with the highest height smaller than b i .j

Ž . Ž Ž .. Ž . Ž Ž .. ŽThen left t s left i, b i and right t s i, b i it cannot be a barrier of. Ž . Ž Ž .. Ž Ž ..smaller height . Then clearly Pa t s i, b i , and therefore i, b i is an

ancestor of ¨ .jŽ Ž ..Vice versa, let i, b i be an ancestor of ¨ . Consider any ancestor t ofj

Ž . Ž . Ž Ž .. Ž .¨ . It is easily seen that if Pa t s left t then right Pa t s right t andjŽ Ž .. Ž . Ž .left Pa t is clearly left from left t . In any case, if ¨ is between left tj

Ž . Ž Ž .. Ž Ž ..and right t then ¨ is also between left Pa t and right Pa t . Thisjproves the other direction of the lemma.

The rest of the proof of the proposition works as follows. Consider theŽ Ž ..barrier t s q, b q of highest height between ¨ and ¨ . Then ¨i j i

Ž . Ž .is between left t and t, and ¨ is between t and right t , and thereforej

Ž . Ž .both are between left t and right t ; i.e., t is an ancestor of ¨ and ¨ .i jX Ž X.Suppose t is a barrier of smaller height than t. Then left t is right from t

Ž X.or equals t or right t is left from t or equals t. If t would not be the leastcommon ancestor of ¨ and ¨ then there would be a common ancestor tX

i jŽ .of smaller height than t. This is a contradiction. Proposition

This proposition reduces the problem of determining Max into theproblem of determining the least common ancestor. We have to determine

Ž Ž . Ž ..a linear number of least common ancestors of min c and max c and weŽ .have a tree of linear size. This allow us to determine Max c , for all c

simultaneously, by an EREW-PRAM in logarithmic time with a logarith-w xmic workload 31 , and therefore sequentially in linear time.

PROPOSITION 6. Max can be determined by an EREW-PRAM in logarith-Ž .mic time with a linear workload with respect to the size of the input V, C and

therefore sequentially in linear time.

The overall result of this section is therefore the following.

THEOREM 2. The o¨erlap components can be determined in linear timesequentially and by a CRCW-PRAM in logarithmic time with a linear numberof processors.

CLUSTERING AND SPLIT DECOMPOSITION 219

5. EFFICIENT PARALLEL SPLIT DECOMPOSITIONALGORITHM AND ITS TRANSFORMATION INTO A

LINEAR TIME ALGORITHM

5.1. Formulation of the Problem

Ž .A split of the graph G s V, E is a partition of V into two subsets V1and V with at least two elements, such that all vertices in V that have2 1neighbors in V have the same neighbors in V .2 2

Split decomposition is the following recursive procedure.

v If G has a split into subsets V and V , we apply split decomposition1 2to graphs G , G that are defined as follows: The vertex sets of G and G1 2 1 2

� 4 � 4are V j ¨ and V j ¨ respectively. The additional vertices ¨ and ¨1 1 2 2 1 2Ž .are called ¨irtual ¨ertices. The edge set of G G consists of the edges of1 2

Ž . Ž .G restricted to V V and the edges w¨ w¨ , such that w is in the1 2 1 2Ž . Ž .neighborhood of V V in V V .2 1 1 2

v If G does not have a split then G is called prime.

We call the final graphs created by the split decomposition of G splitŽcomponents of G. Note that cliques and stars one center vertex joined by

.an edge with each vertex of an independent set are not uniquely splitw xdecomposable. But Cunningham proved the following 12 .

THEOREM 3. Each connected graph has a unique split decomposition intoprime graphs, stars, and cliques with a minimum number of split components.

Ž .Modules are splits of special type. By a module of G s V, E , we meana subset V X of V such that with y g V _V X and u, ¨ g V X, both uy and ¨yare in E or none of uy and ¨y is in E.

Ž .A graph G s V, E with more than two vertices is called modularlyprime if the only modules are V and the subsets of V contain exactly onevertex.

A module X : V is called a prime module if the graph that results fromthe identification of all vertices that are in the same maximal submoduleof X is modularly prime. A module X is degenerated if X can bepartitioned into submodules X , . . . , X , such that either for all X , X , all1 k i j

vertices of X are adjacent with all vertices of X , or for all X , X , alli j i j

vertices of X are not adjacent with all vertices of X . If all submodulesi jX , X are pairwise adjacent, X is called positi ely degenerated; otherwisei jthe degenerated submodule X is called negatively degenerated. Note thattwo modules o¨erlap, i.e., the intersection and both differences are notempty, only if they are both degenerated. We call a degenerated moduleo¨erlap-free if it does not overlap with another module. For each overlapfree module M / V, there exists exactly one minimal overlap free module

ELIAS DAHLHAUS220

M X such that M is a proper subset of M X. M X is also called the parentŽ .module of M. Note that with P M s parent module of M, a parent

function of a root-directed tree T with V as its root is defined. We call TG Galso the modular tree of G.

5.2. The Structure of Splits

Ž .We assume that G s V, E is connected. Let T be a spanning tree of1Ž .G and let u , . . . , u be the postorder enumeration of V. For each1 n

Ž .u / u , let Par u be the neighbor u of u with the maximum index j.i n i j i

�Let T be the spanning tree of G with parent function Par. Let L [ x2 iŽ . 4 � 4g V ¬ Par x s u and L [ u . Call L a layer of G.i nq1 n i

Ž .Let V , V be a split of G and u g V . Let U be the set of neighbors1 2 n 2 1of V in V .2 1

LEMMA 16. All ¨ertices of U are in the same layer.1

Proof. Let U be the set of neighbors of V in V . Note that the edges2 1 2between V and V are exactly the pairs of vertices ¨ ¨ with ¨ g U and1 2 1 2 1 1¨ g U . Pick the vertex u g U with the largest index i. No u of larger2 2 i 2 jindex than i is V j U , because u has been chosen as an element in U1 2 i 2with maximum index and the unique path on T from any u g V to u1 l 1 n

Ž .Xmust pass a vertex u in U that is because it is an ancestor of u ofl 2 llarger index lX than u . This means that all ¨ g U are in the neighbor-l 1hood of u but not in the neighborhood of some u , j ) i. Therefore all Ui j 1are a subset of the layer L .i

Let W s V _U . Suppose U : L .1 1 1 1 i

LEMMA 17. U is a module of G restricted to V _W .1 1

Proof. W is the set of all vertices in V that have no neighbors in V .1 1 2Therefore if we delete the vertices of W from G, only the vertices in V1 1that have neighbors in V remain. This is the set U and since all vertices2 1in V that have neighbors in V have the same neighbors in V , U is a1 2 2 1module in G restricted to V _W .1

COROLLARY 2. U is a module in G restricted to D L .1 jG i j

LEMMA 18. W is a union of connected components of G restricted to1D L if U : L .j- i j 1 i

Proof. Note that all neighbors of W that are not in W are in U . That1 1 1means also that all neighbors of W are in V . As in the proof of Lemma1 117, we use the fact that for all u g V , j - i, and therefore all ¨ g W arej 1 1in an L , j - i. On the other hand, to come through a path from a vertexjin W to a vertex not in W one must pass U and therefore vertices in L .1 1 1 iTherefore W is a union of connected components of D L .1 j- i j

CLUSTERING AND SPLIT DECOMPOSITION 221

COROLLARY 3. Suppose C is a connected component of G restricted toŽ . Ž .D L , that is, not in W . Then either U l N C s B or U : N C . Inj- i j 1 1 1

the second case, all ¨ertices in U ha¨e the same neighbors in C, i.e., for each1Ž . Ž .¨ertex x g C, U : N x or U l N x s B.1 1

Ž .THEOREM 4. A partition V , V of the ¨ertex set V of G with u , . . . , u1 2 1 nas defined before and u g V is a split if and only if for the maximum layern 2L with V l L / B,i 1 i

1. V l L is a module in G restricted to D L ,1 i jG i j

2. W [ V l D L is a union of connected components of G1 1 j- i jrestricted to D L ,j- i j

3. the set of neighbors of W that are not in W is a subset of1 1U [ V l L , and1 1 i

Ž . Ž .4. for all x g D L _V , N x l V l L s B or V l L : N x .j- i j 1 1 i 1 i

Proof. The direction from left to right follows from previous considera-tions. The other direction can be seen as follows. First observe that allvertices in V that have neighbors in V s V _V are not in W and1 2 1 1therefore in U . Next we use the fact that U is a module in G restricted to1 1D L . Therefore all vertices in U have the same neighbors in D LjG i j 1 jG i jthat are not in V and therefore in V . Finally, by the last item, all vertices1 2in U have the same neighbors in V that are in a layer L , j - i.1 2 j

Ž .We say that U represents a split if there is a split V , V , such that1 1 2¨ g V and U is the set of vertices in V that have neighbors in V .n 2 1 1 2

From the last theorem we get the following.

THEOREM 5. U represents a split iff there is an i with U ; L , such that1 1 i

1. U is a module in D L ,1 jG i j

2. there is a union W of connected components of D L , such that1 j- i j

all neighbors of W that are not in W are in U and the ¨ertices of1 1 1D L _W are adjacent either to all ¨ertices of U or to no ¨ertex of U .j- i j 1 1 1

5.2.1. Stars

Stars come up if there is the following situation: V is decomposed inV , . . . , V . For each V , there is a subset U such that with i / 1, all1 k i ivertices in U are adjacent to all vertices in U .i 1

First assume that ¨ is in V . Then U , . . . , U all represent a split and then 1 2 kunion of U , . . . , U represents a split. Therefore all U , . . . , U are in the2 k 2 k

Ž .same level L and form a not necessarily overlap free negatively degener-iated module of L . We say that Dk U represents a star of the first kindi js2 jŽ .see also Fig. 2 .

ELIAS DAHLHAUS222

FIG. 2. Star of the first kind.

Ž k . ŽNext assume that ¨ g V . Then D V , V j V and V j D V ,n 2 js3 j 1 2 1 jG 3 j.V are splits. That means U represents a split, and all U with j G 3 and2 1 j

D U represent splits. We assume that U is in layer L . The ¨ X ofjG 3 j 1 i ilargest index that is adjacent to some vertex in U , j G 3, is in U .j 1Therefore all U , j G 3, are in one layer L X with iX - i. The U , j G 3, formj i ja negatively degenerated module in L X . Moreover, all neighbors of the U ,i jj G 3, that are in an L Y , iY G iX, are in U . We say that U and Dk Ui 1 1 js3 j

Ž .together represent a star of the second kind see also Fig. 3 . U is also1called the center of the star of the second kind it represents U , . . . , U are3 kcalled the child representati es of U .1

5.2.2. Cliques

Cliques come up if there is the following situation: V is decomposed inV , . . . , V . For each V , there is a subset U such that with i / j, all vertices1 k i iin U are adjacent to all vertices in U . Without loss of generality, wei j

Ž k . Ž .X Xassume that ¨ g V . Then D V , V and each V , D V with j / 1n 1 js2 j 1 j j / j jare splits, and therefore Dk U and each U , j / 1, represent a split. Alljs2 j j

U , j G 2, are in the same layer L . D U forms a positively degeneratedj i jG 2 j

Žmodule of L with U , . . . , U as submodules not necessarily overlap freei 2 k. kmodules . We say that D U represents a clique.js2 j

5.3. An Outline of the Algorithm

The basic strategy is as follows.

1. We first compute the sets U that represent a split.1

2. We extract stars and cliques.3. We determine the prime components.

CLUSTERING AND SPLIT DECOMPOSITION 223

FIG. 3. Star of the second kind.

5.3.1. The Computation of the Sets Representing a Split

First we compute, for each i, the set CC of connected components ofiD L . A compact representation of the sets CC will be discussed in thej- i j i

next subsection. Let DD be the set of those C g CC such that all neighborsi iof C in D L are in L .jG i j i

For each L , we compute the set M of overlap free modules of L .i i i� Ž . 4 � Ž . 4Let S s M j N C l L ¬ C g CC j N x ¬ x g D L . Note thati i i i j/ i j

S is a multiset.i

Ž .LEMMA 19. Let V , V be a split decomposition of G, u g V , and V1 2 2 2be connected. Let U be the set of neighbors of V in V and U : L . Then,1 2 1 1 ifor each X g S , X is a subset of U , U is a subset of X, or X and U arei 1 1 1disjoint. Moreo¨er, if X is a prime module then X : U or X l U s B or1 1

ELIAS DAHLHAUS224

there is a child module X X of X such that U : X X, and if X is the1neighborhood of some x g L , j ) i, then either U l X s B or U : X.j 1 1

Ž .Proof. Suppose X s N x l L and x g L , j - i, and x g V . Theni j 1Ž .x g V _U and N x : U . Otherwise x is joined by an edge with all1 1 1

vertices in U or with no vertex in U , and the vertices in U are the only1 1 1vertices of V for which x is joined by edges. Therefore U : X or1 1U l X s B. In any case, if x g L , j ) i, x f V , and therefore either1 j 1U : X or X l U s B.1 1

Note that U is a module of L and therefore does not overlap with any1 ioverlap-free module of L . Therefore U is either a subset of a degener-1 1ated module that does not overlap with its child modules or it is a primemodule. Therefore if X is a prime module, U l X s B or X : U or X is1 1a subset of a child module of X.

Finally, let X be the neighborhood of a connected component ofŽ .D L in L . Then X is the union of some N x l L , x g L , j - i, andj- i j i i j

therefore X l U s B or X and U are comparable with respect to the1 1subset relation.

We continue as follows.

v We compute the overlap components of S . Note that it is possibleithat A, B g S which represent the same set might be in different overlapi

� 4 � 4components. This can be if A and B are one-element overlap compo-nents. In that case, we consider overlap components as equi alent but notequal.

v Ž .If x g C g CC then the overlap components of N x l L andi iŽ .N C l L and all of the overlap components in between are combinedi

into one component.v If an overlap component A, i.e., the union of all sets of A, is

contained in a prime module X X but not contained in a child module ofX X then c and the overlap component AX containing X X are unified intoone component.

v If a component A contains the neighborhood of some x g L , j ) i,jthen all components containing A are combined into one component, andif the neighborhood of x is not the only element of A then we put A alsointo this component. These components are marked as bad components.

v Ž .If, for C g CC , N C _C contains vertices in some L , j ) i, theni jŽ .the component A containing N C l L and all its ancestor componentsi

but not components equivalent to A are unified into one component, andthis component is marked as a bad component.

Components resulting from this procedure are called clusters. Thesubset relation on clusters is defined as the subset relation on overlapcomponents.

CLUSTERING AND SPLIT DECOMPOSITION 225

For any cluster A, let L be the set of connected components C ofAŽ .D L such that the cluster, which contains N C l L , is a subcluster ofj- i j i

A or A itself. Let U be the union of all sets in A.A

LEMMA 20. If A is a cluster than U represents a split if and only if A isAnot a bad component.

Proof. First we show the following.

SUBLEMMA 1. If A is a cluster that is not a bad component then U is aAmodule of D L .jG i j

Proof of Sublemma. First we have to show that U is a module in L .A iBy construction, each U does not overlap with any overlap free module ofAL . Next suppose X is a prime module and U ; X. It cannot be that U isi A Anot a subset of a child module of X. Otherwise X and the sets in A wouldhave been unified into one cluster. Therefore U is a module in L .A i

To show the sublemma we have to show that for each x g L , j ) i,j

either all vertices in U or no vertex in U is a neighbor of x. SupposeA Athere is a vertex x g L , j ) i, that is a neighbor of some but not of all thej

Ž .vertices of U . Then N x l L overlaps with some set in A or is a subsetA iof all sets in A. Therefore A would be a bad component.

To show the lemma, we first have to construct a split if U satisfies theAconditions as stated above. Let V be the union of U and all connected1 A

Ž . Ž .components C of D L with N C l L g A or N C l L in a sub-j- i j i icluster of A. Since A is not a bad component, no such component hasneighbors in some L , j ) i, and U is a module in D L . The onlyj A jG i jvertices in V that have neighbors outside V are therefore the vertices in1 1U . Suppose x g V _V . If x g L , j G i, then it is adjacent to all verticesA 1 jof U or to no vertex of U , since U is a module in D L . IfA A A jG i j

Ž .x g L , j - i, then N x l L is not in A and not in a subcluster of A,j i

Ž . Ž .and therefore there are only the alternatives that U : N x or U l N xA As B.

Ž .Vice versa, suppose U represents a split, say V , V . Then V _UA 1 2 1 Aconsists of connected components C of D L . If C ; V then all of itsj- i j 1neighbors in L , j G i, are in U and therefore in L . The neighborhoodsj A iof these components C are therefore in A or in a subcluster of A.Moreover, they do not make A a bad component. Now consider any

Ž . Ž .x g L , j ) i. Then U : N x of U l N x s B, since U is a module inj A A AŽ .D L . Therefore U : N x l L and therefore A is a subcluster of thejG i j A i

Ž . � Ž . 4cluster containing N x l L or A s N x l L . A is not made a badi iŽ .component by N x .

ELIAS DAHLHAUS226

We call a set U that represents a split as in the last lemma a cluster splitArepresentati e and A a split cluster.

Not all sets U representing a split are cluster split representatives. If U1 1is a degenerated module then it is not necessarily an overlap-free degener-ated module, but only a subset of an overlap-free degenerated module ofL , say m. Let AX be the cluster containing m. Then all U with U ; Ui A A 1are contained in the same sets of AX, and AX is the parent cluster of eachsuch A, such that U is a maximal subset of U .A 1

Ž .For each A let Buf A be the sets of the parent cluster of A thatcontain U , called the buffer of A.A

LEMMA 21. U represents a split if and only if U is a cluster split1 1representati e or if U is a union of cluster split representati es U , such that1 A

Ž .their split clusters ha¨e all the same buffer Buf A , U is not the neighbor-AŽ .hood of some x g D L in L , and if Buf A contains a module then thej) i j i

Ž .smallest module in Buf A is not a prime module.

Proof. Suppose U is a split representative and represents the split1Ž . Ž .V , V . Then U is a module not necessarily overlap free of D L , for1 2 1 jG i j

some i. Let C be the set of connected components of V _U and M be1 1 1 1the set of overlap free modules that are subsets of U . If U is a prime1 1module then U itself is a cluster split representative. We assume that U1 1is a subset of a degenerated module that does not overlap with its child

Ž .modules. U also does not overlap with any neighborhood N c of a1connected component of D L . Let A , . . . , A be the clusters that arej- i j 1 kcontained in U and that are maximal with respect to the subcluster1relation. It includes also the case that there is only one A and U s U .l A 1l

Then U is a cluster split representative. Now assume that all U are1 A l

properly contained in U . Note that all U are not bad components and1 A l

are therefore cluster split representatives. All U are in the neighborhoodA l

of the same x not in V and therefore have the same buffer. If the buffer1of any A contains a module then it contains also the smallest overlap-freelmodule that contains U . This is a degenerated module. Note that each A1 lis not the neighborhood of some x g D L in L .j) i j i

Vice versa, let A , . . . , A be cluster split representatives with the same1 kŽ . Ž .buffer Buf A . First assume Buf A does not contain a module. We havei i

to show that the smallest overlap-free module m that contains all U is aA i

degenerated module. If m is a prime module then the cluster containingŽ . Ž .Buf A is unified with m and therefore Buf A contains a module.i i

Ž i i. iWe assume that U represents the split V , V and V is maximalA 1 2 1i

under this condition. That means V i contains all connected components c1Ž .of D L with N c l L : U . Since all A have the same buffer, everyj- i j i A ii

CLUSTERING AND SPLIT DECOMPOSITION 227

Ž .N x l L , x f L , is either contained in some U , contains all U , or isi i A Al lŽ .disjoint with all U . If x g L , j ) i, then N x l L cannot be a properA j il

subset of any U , because each A is not a bad component. By assump-A llŽ .tion, N x l L cannot be equal to some U . Therefore such an x isi A l

either adjacent to all vertices in the union of A or to none of theselvertices. Since each A is not a bad component, each A and eachl lsubcluster of A does not contain the neighborhood of some connectedlcomponent c of D L that has neighbors in an L X , jX ) i.j- i j j

1 k Ž .With V s V j ??? j V , V , V _V is a split that is represented1 1 1 1 1by U .A

We can immediately derive the following.

LEMMA 22. If U and Dk U together represent a star of the second kind1 js2 j

then there are i - i such that U s U , for some cluster A in L ; all U ,1 2 1 A i j2

j G 2, are of the form U where A is a cluster in L ; all A ha¨e the sameA j i jj 1

buffer B; the buffer of B and any ancestor component do not contain anŽ .N C l L or a prime module or a positi ely degenerated module; and U isi 11

a cluster split representati e.

LEMMA 23. If A , . . . , A are the clusters with buffer B, no buffer of any2 kŽ .A or ancestor cluster of A contains a set N C l L or a prime module or aj j i1

positi ely degenerated module, and the neighborhood of any U , j G 2, inA j

D L is a cluster split representati e U , then U together with Dk Ul G i l A A js2 A1 1 1 j

represents a star of the second kind.

Proof. Note that Dk U represents a split. Moreover, all U have nojs2 A Aj j

neighbors in L _U , because the buffer B and the buffers of anyi A1 j

ancestor cluster of the A do not contain a prime or positively degeneratedjmodule. If C is a connected component of D L then its neighborhoodjF i j1

is either a subset of some U or it is disjoint with all U , because theA Al l

buffer B and the buffer of any ancestor cluster of the A does not containlŽ .a set N C l L . Since all U , l s 2, . . . , k, have as neighbors in D Li A jG i j1 l 1

exactly the vertices in U , we get a split representative of the second kind.A1

LEMMA 24. Suppose U , . . . , U represent a star of the first kind. Then all1 kU are cluster split representati es. Assume U s U , i s 1, . . . , k. Then alli i A i

Ž .the clusters A ha¨e the same buffer Buf A and the smallest module that isi iŽ .in Buf A or in the buffer of an ancestor of A is a negati ely degeneratedi i

module.

Proof. Note first that the union of the U is a negatively degeneratedimodule. If U is not a cluster split representative then also U is ai i

ELIAS DAHLHAUS228

negatively degenerated module and it can be split into cluster split repre-sentatives U 1, . . . , U q. But then also U , . . . , U , U 1, . . . , U q, U , . . . , Ui i 1 iy1 i i iq1 krepresent a star of the first kind. Since the union of the U represents ai

Ž .split not necessarily is it a cluster split representative , all A have theisame buffer. Since the union of the U is a negatively degenerated module,ithe smallest overlap-free module containing the union of the U is nega-itively degenerated. This is the smallest module in the buffer of any A oriin any ancestor cluster of A .i

Ž .LEMMA 25. If clusters A , . . . , A ha¨e the same buffer Buf A , the1 k ismallest module in the buffer of the A or an ancestor cluster of A is al inegati ely degenerated module, and A , . . . , A are not a part of the star of the1 ksecond kind, then U , . . . , U represent a star of the first kind.A A1 k

Proof. Since the clusters A , . . . , A have a buffer, that are not unified1 kwith the root cluster and therefore they are not bad components andtherefore they represent splits. The union of the U is a negativelyA l

degenerated module, because the smallest overlap-free module containingthe U is negatively degenerated. Note that the union of the U repre-A Al l

sents a split, because all the A have the same buffer.l

LEMMA 26. Suppose U , . . . , U represent a clique. Then all U are cluster1 k iŽ .split representati es, say U s U , and all A ha¨e the same buffer Buf A .i A i ii

Moreo¨er, the smallest module that is in the buffer of A or of an ancestor ofiA is a positi ely degenerated module.i

Proof. Whenever we have a clique component, we have a decomposi-tion of the vertex set V of G in V , . . . , V , U : V , l s 1, . . . , k q 1, and1 kq1 l lall vertices in different U are pairwise joined by an edge. There are nolother edges between different V . Without loss of generality, we assumel

Ž .that ¨ g V . Since V j ??? j V , V is a split, all U , . . . , U are inn kq1 1 k kq1 1 kthe same layer L . The union of the U , l s 1, . . . , k, is a positivelyi ldegenerated module and therefore the smallest overlap-free module con-taining the U is positively degenerated. Therefore the smallest module inlthe buffer of any A or of an ancestor cluster is a positively degeneratedlmodule.

Vice versa, clusters A as mentioned in the last lemma representlcliques.

LEMMA 27. Suppose A , . . . , A ha¨e the same buffer, and that the1 ksmallest module in the buffer of the A or in the buffer of an ancestor cluster oflthe A is positi ely degenerated. Then U , . . . , U represent a clique.l A A1 k

Proof. Since the buffers of the A are defined, the A are not identi-l lfied with the root cluster and the A are not bad components. Since thel

CLUSTERING AND SPLIT DECOMPOSITION 229

smallest module containing the U is positively degenerated, all vertices inA l

different U are pairwise joined by an edge. Each A represents a splitA llŽ .V , W . We assume that V is maximal under the condition that Ul l l A l

Ž .represents a split V , W . That means each connected component c ofl lD L is in V , or the neighborhood of c in L is disjoint with U , or thej- i j l i lneighborhood of c in L contains all the U . The union of the Ui l A l

Ž k .represents a split D V , W . Let U be the set of vertices in W that havels1 lneighbors not in W. U is just the set of neighbors of the union of the UA l

that are not in the sets V . All the vertices that are in different U , W arel A l

pairwise joined by an edge. Therefore U , . . . , U represent a clique.A A1 k

Algorithmically we proceed as follows.

1. We compute the clusters as mentioned above and select the goodclusters, i.e., those clusters that are not bad.

2. We select the U , U , . . . , U representing a star of the second kind1 2 kas follows. Let A , . . . , A be the clusters that have the same buffer. First2 kwe check that the buffers of A and of any ancestor of A contain onlyj j

Ž .negatively degenerated modules and sets of the form N x l L , x g L ,i ll ) i; i.e., they do not contain any module that is prime or positively

Ž .degenerated and they do not contain any set N C l L , l - i. Then welcheck that the neighborhood of all U in D L is a cluster splitj l G i lrepresentative.

3. We select the U , . . . , U representing stars of the first kind as1 kfollows. Again let A , . . . , A be the clusters that have the same buffer1 k

Ž .Buf A . We assume that A , . . . , A do not pass all the checks of thej 1 kprevious step. To represent a star of the first kind, one only has to checkthat the smallest module in the buffer of A or of an ancestor of A is aj jnegatively degenerated module.

4. To select U , . . . , U that represent a clique, one determines1 kA , . . . , A with the same buffer, such that the smallest module that is in1 kthe buffer of A or of an ancestor cluster of A is a positively degeneratedj jmodule.

5.3.2. Computation of the Connected Components of D Lj- i j

Ž .We assign each edge xy of G a distance d x, y that is the maximum iŽ .with x g L or y g L , i.e., if x g L and y g L then d x, y is thei i i j

maximum of i and j.

LEMMA 28. The connected components of D L are the single linkj- i jclusters C of d that are of coarseness i y 1.

ELIAS DAHLHAUS230

Proof. Note that an edge xy joins two vertices that are in D L ifj- i jŽ .and only if d x, y F i y 1. Therefore the single link clusters of coarseness

i y 1 coincide with the connected components of D L .j- i j

To get the connected components of D L , for all i, we perform a singlej- i j

Ž . Ž .link clustering on the distance function d x, y s max j : x g L or y g L .j jŽ .This can be done in logarithmic time with O n q m processors on a

CRCW-PRAM.To determine the neighborhood of a connected component C of D L ,j- i j

we proceed as follows:For any edge xy with x g L , y g L , and j - i, we determine thej i

largest single linkage cluster C s C that contains x but not y and putŽ x, y .an edge Cy g EX. Note that for all xy g E simultaneously, all C can beŽ x, y .

determined in logarithmic time with a linear workload by determining the leastcommon ancestor D of x and y in the single linkage clustering. C is justx, y Ž x, y .

Ž w x.the child of D that is an ancestor of x see 1 .x, y

Note that the number of edges in EX is bounded by the number of edgesX Ž .of G and that E can be determined in O log n time with a linear

workload.

Ž . �LEMMA 29. For a connected component C of D L , N C l L s yj- i j iX4g L ¬ Cy g E .i

Ž .Proof. y g N C l L is equivalent to the statement that there is anix g C with xy g E. Since C is a connected component of D L , it is aj- i jmaximal single link cluster of coarseness i y 1 and therefore a maximalsingle link cluster containing x but not y. Therefore Cy g EX. Vice versa,suppose Cy g EX and y g L . There is an x g C with xy g E. Note thatithe coarseness of C is less than i. Otherwise x would belong to C. C is amaximal cluster of coarseness - i that contains x and is therefore aconnected component of D L .j- i j

COROLLARY 4. If there is a y g L with Cy g EX then C is a connectedicomponent of D L .j- i j

Ž . X XTo get the sets N C l L one determines the edge set E with Cy g E ifithere is an x g C, such that C is the maximum single linkage cluster

X Ž .containing x and not y. For each L that contains y with Cy g E , N C l Li i� X4s y g L ¬ Cy g E .iNext we check whether C is bad in L , i.e., C has neighbors in L andi i

in an L , j ) i.jFor each single linkage cluster C, we determine the maximum i, such

that C has a neighbor in L , say p .i CClearly, C is bad in L if there is a y g L with Cy g EX and i - p . pi i C C

can be determined simultaneously, for all C, in logarithmic time with a linearŽ w x.workload by tree contraction see 1 .

CLUSTERING AND SPLIT DECOMPOSITION 231

5.4. Determining Splits

The basic idea is to traverse along the split representatives. Let U be aŽ .split representative of a split V , V with U : V . Then we provide nodes1 2 1

uU and ¨ U . ¨ U is joined by an edge with all vertices in U and ¨ U is joinedin out in outby an edge with all neighbors of U in V .2

LEMMA 30. If U is a split representati e and not the center of a star of theŽ .second kind then there is a unique split V , V with ¨ g V and U is the set1 2 n 2

of ¨ertices in V that ha¨e neighbors in V . The neighborhood of U : L in V1 2 i 2consists of

1. all ¨ertices in D L that are neighbors of U and not in U, andjG i j

2. all ¨ertices x g D L such that x is a neighbor of all ¨ertices in Uj- i jand U is the union of a proper subcluster or a buffer of a proper subcluster of

Ž .the cluster that contains N x g L . We also call x an outer neighbor of U.i

Ž .Proof. Assume that U represents more than one split V , V and1 2Ž .W , W . Then also1 2

V , l W , V j WŽ .1 1 2 2

and

V j W , V l WŽ .1 1 2 2

and splits represented by U. Let W be the intersection of all V with the1 1Ž . Xproperty that U represents a split V , V , and let W be the union of all1 2 1Ž . XV such that U represents a split V , V . All vertices in W _W _U that1 1 2 1 1

have neighbors in U have the same neighbors in U and have no otherX Ž .neighbors in W . Therefore with X s W _W _U, X , V _ X is a split.1 1 1 1 1 1

Note that X is a union of connected components of D L if U : L ,1 j- i j i

say C , . . . , C . All vertices in any C that have neighbors outside C have1 p l lexactly the vertices in U as neighbors. Therefore U is the center of a starof the second kind. This is a contradiction.

Ž .Assume U is not the center of a star and V , V is the split represented1 2by U. Then the neighborhood U in V is clearly determined as in the2lemma.

The basic idea of a split decomposition algorithm is as follows: For eachcluster split representative U we introduce a vertex ¨ U that is joined by aninedge with all vertices in U and a vertex ¨ U that is joined by an edge without

Ž .all neighbors of U not in V , where V , V is the split represented by U. If1 1 2the split is unique then we can do this. If this is not the case then U is acenter of a star of the second kind.

If U is the center of a star of the second kind and U , . . . , U are the1 lchild representatives of U then we split U formally into two representa-

ELIAS DAHLHAUS232

tives, U and U . The vertices in U , . . . , U are considered as outerdown up 1 l

neighbors of U but not as outer neighbors of U . U is considereddown up downas a child representative of U .up

We also have to parallelize this procedure.

1. We join ¨ U and all ¨ U X

where U X is a maximal subset splitin outrepresentative of U or a vertex of the same layer L that is in U but not inia smaller split representative than U.

2. If xy g E and x, y g L , let U be the maximum split representa-i xtive that contains x but not y and U be the maximum split representativeythat contains y but not x. We join ¨ Ux and ¨ Uy by an edge.out out

3. Suppose x g L , y g L , i - j, and xy g E. Let U be the maxi-i j xmum split representative containing x and U be the maximum splityrepresentative containing y and having x as an outer neighbor. We join¨ Ux and ¨ Uy by an edge.out out

THEOREM 6. The decomposition procedure as described abo¨e determinesthe unique split decomposition into stars, cliques, and prime components.

Proof. We also could proceed as follows. We first eliminate the stars.Then we get the maximal stars. Then we eliminate the cliques. Then weget the maximal clique components. Finally, we only have to determine theprime components in the components we got by the decompositionsbefore. They are unique in any case. Therefore we get the unique splitdecomposition.

5.5. Complexity Analysis

When we compute overlap components, we also always use overlap freemodules of L as sets. To get a linear time bound and a linear processoribound in parallel, we have to show the following.

THEOREM 7. In any graph G with n ¨ertices and m edges, the set of pairsŽ .x, M , such that x g M and M is an o¨erlap-free module, is bounded byn q m.

Proof. Suppose M is a prime module or a positively degeneratedmodule. Then the number of vertices in M is bounded by the number ofedges that join two vertices in M that are not in a common child moduleof M. They are exactly those edges xy such that the smallest overlap-free

Ž .module containing x and y is M. Therefore the number of pairs x, M ,such that x g M and M is an overlap-free module that is not a negativelydegenerated module, is bounded by the number of edges.

Now suppose M is a negatively degenerated module. We first assumethat M is not the whole graph. Then M has a parent module M X that is

CLUSTERING AND SPLIT DECOMPOSITION 233

not a negatively degenerated module. Therefore there are edges that joinall vertices in M with a vertex in M X _ M. They are just those edges xywith x g M, and the smallest overlap-free module containing x and y isM X. The number of vertices in M is bounded by the number of edges xy,x g M, y g M X _ M. Every such edge xy joins at most two child modules ofM X that are negatively degenerated. Therefore one can bound the number

Ž .of x, M , such that x g M and M is an overlap-free negatively degener-Ž .ated module / V by the number of edges of G s V, E .

If M s V then M is bounded by n. Only one overlap-free module is V.

We also have to compute the modules of each L .i

w x Ž 2 .LEMMA 31 17 . Modular decomposition can be done in O log n timeŽ .with O n q m processors on a CRCW-PRAM.

Recall that, for a connected component C of D L and a y g L ,j- i j i

Cy g EX if and only if y is in the neighborhood of C. We determined theseconnected components of D L by the single linkage method whichj- i j

consists of the computation of a minimum spanning tree and the applica-tion of the single linkage algorithm. The overall complexity is in parallelŽ 2 . Ž .O log n time with a processor number of O n q m on a CREW-PRAM

w x X6 . To determine whether Cy g E , we compute, for each edge xy, thelargest component C that contains x but not y. That means, in the clustertree of the single linkage clustering we compute the least common ances-tor CX of x and y and the child C of CX that is an ancestor of x. This canbe done, for all edges xy simultaneously, in logarithmic time with a linear

w xprocessor number 33 on a CREW-PRAM. Immediately we get thefollowing result.

LEMMA 32. EX and therefore the neighborhood of any connected compo-Ž 2 . Ž .nent of D L in L can be computed in O log n with O n q mj- i j i

processors on a CREW-PRAM. The size of EX does not exceed the number mof edges.

As a consequence of the last lemma and the last theorem we get thefollowing.

COROLLARY 5. The o¨erlap components of modules, of neighborhoods of¨ertices in D L , and of neighborhoods of connected components ofj/ i j

Ž 2 .D L can be determined by a CREW-PRAM in O log n time withj- i j

Ž .O n q m processors.

To check whether a connected component C of D L is a badj- i jcomponent, i.e., has a neighbor in L , k ) i, we only determine thek

ELIAS DAHLHAUS234

maximum k, such that C has a neighbor in L . This can be done askfollows. For each vertex x, we determine the maximum k , such that x hasxa neighbor in L . We determine by tree contraction the minimum k withk xx

w xx g C. This can be done in logarithmic time with a linear workload 1 .To collect overlap components to clusters we are given a set of pairs

Ž .c , c where c is an ancestor overlap component of c , and we collect1 2 2 1c , c , and all ancestors of c that are descendants of c to one compo-1 2 1 2nent.

Ž .1. If c is the overlap component containing N x l L and x g1 iŽ .D L , then c is the overlap component that contains N C l Lj- i j 2 i

where C is the connected component of D L that contains x.j- i j

Ž .2. Let c s N x l L and x g D L . If the overlap componenti j) i jcontaining c has more than one element then c is the overlap component1containing c, and c is the root overlap component. If c is the only2element of its overlap component then c is the parent overlap component1of c and c is the root overlap component.2

3. Let c be a module, cX be the parent module of c, and cX be aprime module. Suppose c and cX are in different overlap components. Ifthe overlap component containing c has more than one element then c is1the overlap component containing c and c is the overlap component2containing cX. If c is the only element of an overlap component then c is1the parent overlap component of c and c is the overlap component2containing cX.

Ž .4. If C is a bad component then c [ N C l L and c is the root1 i 2overlap component.

Ž .We can compute the set Pairs of these pairs c , c in logarithmic time1 2with a linear workload on an EREW-PRAM.

The collection of overlap components to clusters is done as follows. Firstwe check whether the component c and its parent component cX have tobe collected to one cluster as follows. We compute, for each overlap

Ž .component c, the ‘‘size’’ of Si c , i.e., the number of descendants. Then,Ž .for each c, we compute the maximum size M c of an overlap component

Ž .c with c s c and c , c g Pairs. Then, for each overlap component c2 1 1 2Ž X. Xwe determine the maximum M c , such that c is a descendant of c or

X XŽ .c s c, say M c . All these data can be computed in logarithmic time withw xa linear workload by tree contraction 1 .

LEMMA 33. c and its parent component belong to the same cluster if andŽ . XŽ .only if Si c - M c .

Proof. Note that c and its parent belong to the same cluster if and onlyŽ .if there is a c , c g Pairs, such that c s c or c is a descendant of c1 2 1 1

CLUSTERING AND SPLIT DECOMPOSITION 235

and c is the parent of c or an ancestor of the parent of c. This is2Ž .equivalent to the statement that there is a c , c g Pairs, such that1 2

Ž . Ž .c s c or c is a descendant of c and Si c ) Si c . This is again1 1 2XŽ . Ž .equivalent to the statement that Si c - M c .

Finally, we have to compute the clusters. For each overlap component c,Ž .let anc c be the next ancestor overlap component with the property that

Žits parent component does not belong to the same cluster it is possibleŽ . .that anc c s c . anc can be determined with linear workload in logarith-

mic time by an EREW-PRAM using list ranking on the Euler cycle of theoverlap components tree where each edge is replaced by a double edgeŽ w x. Ž .see for example 21 . We identify each overlap component c with anc c .

To extract representatives of stars and of cliques, we have to computethe buffers of any cluster. Given a cluster c, we pick a vertex ¨ thatc

Ž .appears in some set in c. Note that Buf c is the set of sets in the parentŽ . Ž .cluster Par c of c that contains ¨ . In any case, we know the set S ¨ ofc c

sets in S that contain ¨ . When we select ¨ we only have to select thosei c cŽ .sets in S ¨ that appear in the parent cluster of c. What we need is to findc

out the collection of clusters with the same buffer. This can be done bylexicographic sorting. Sequentially, this can be done in linear time. Inparallel we can do a CRCW-PRAM in logarithmic time with a linear

Ž 2 .processor number and therefore by a CREW-PRAM in O log n timewith a linear processor number.

Next we have to check that the clusters c , . . . , c with the same buffer1 kform together a negatively degenerated module, i.e., the smallest ancestorbuffer of all c contains a negatively degenerated module. We only have toi

Ž .label each cluster c by a 1 if Buf c contains a module. We determine theŽ .first ancestor of any c, called ANC c , such that its buffer contains a

module. Note that the modules in any buffer are ordered by inclusion. WeŽ Ž .. Ž Ž .select the smallest module Modul ANC c in Buf ANC c . If this module

is a negatively degenerated module for c s c then the clusters c , . . . , ci 1 kform a negatively degenerated module and are therefore candidates for astar representative.

Ž .ANC c can be found in linear time sequentially and in logarithmic timeŽwith a linear workload on an EREW-PRAM use list ranking and Eulerian

w x. Ž .cycle techniques; see 21 . To find the smallest module in ANC C , oneŽneeds logarithmic time and a linear workload standard minimum compu-

.tation .To check whether c , . . . , c represent the lower layer components of a1 k

star of the second kind, one has to check that all ancestor buffers do notcontain certain sets in S . Here we label clusters with a 1 if they containiforbidden sets. Otherwise we label a cluster with a 0. One has to checkthat all ancestor clusters are labelled by a 0. One only has to determine,

ELIAS DAHLHAUS236

for each cluster, the sum of labels of its ancestors. This can be done byŽEulerian cycle techniques in logarithmic time with a linear workload see

w x.for example 21 . One also has to check that the neighborhood of each ciin a higher layer is a split representative. First one has to check that allthese neighbors are of the same layer. Then one determines the smallestcluster containing these neighbors. If this cluster is a good cluster then wedetermine the number of underlying vertices and compare this numberwith the number of neighbors of each c in this layer. This is a procedureithat can be done in logarithmic time and linear workload on an EREW-PRAM.

Finally, we have to do the split decomposition as described in the lastsubsection. This can be done in logarithmic time with linear workload onan EREW-PRAM. One only has to follow the algorithmic description andone gets this bound.

As an overall result, we get the following.

THEOREM 8. Split decomposition can be done by a CRCW-PRAM inŽ 2 . Ž .O log n time with O n q m processors. All steps with the exception of

determining the connected components of D L and the neighborhoods ofj- i j

these components can be done sequentially in linear time. All steps butŽ 2 .modular decomposition can be done by a CREW-PRAM in O log n time

Ž .with O n q m processors.

5.6. Transformation into a Linear Time Algorithm

Since minimum spanning tree computation and the usual single linkageŽŽ . .clustering needs O n q m log n workload, we have to circumvent these

procedures in some way. On the other hand, we can compute a breadth-firstsearch tree in linear time and therefore the set K of vertices that haveidistance k y i from a fixed vertex ¨ . k is the minimum distance of anvertex from ¨ .n

Ž .LEMMA 34. Let V , V be a split with ¨ g V and U be the set of1 2 n 2 1¨ertices in V that ha¨e neighbors in V . Then U is a subset of some K and1 2 1 iV _U is a union of connected components of D K .1 1 j- i j

Proof. The proof is the same as in the hierarchy of L .i

To get a split decomposition, we determine first, for each i, the neigh-Ž .borhoods N C l K of connected components C of D K . Then we doi j- i j

modular decomposition sequentially. The rest is done as in the parallelalgorithm with the L hierarchy. Note that the rest can be done in linearitime.

CLUSTERING AND SPLIT DECOMPOSITION 237

w xLEMMA 35 19, 29, 11 . Modular decomposition can be done in lineartime.

Ž .To determine the neighborhoods N C l K , we determine for i si0, . . . , k the set CC of connected components of K , shrink each compo-i inent c g CC to one vertex ¨ , determine the neighborhood of c in K ,i c iq1and put ¨ into K . It is easily seen that at level i each connectedc iq1component of D K is shrunk to one vertex. By construction, each edgejF i jis either in one level K or joins two vertices of consecutive levelsiK , K . Therefore, for a connected component of K , say c, that isi iq1 ishrunk to one vertex ¨ , the incident edges of ¨ are all the edges betweenc cK and K . Therefore each edge is called once in the shrinking processi iq1and once in the process to compute connected components. Therefore the

Ž .time to compute the neighborhoods of connected components N C j Kiis linear.

As an overall result we get the following.

THEOREM 9. Split decomposition can be done in linear time.

5.7. Parity Graph Recognition

A parity graph is a graph with the property that for each vertex x andeach vertex y, all chordless paths from x to y have an odd length, or allchordless paths from x to y have an even length.

Parity graphs can be characterized as follows.

w xTHEOREM 10 7 . Parity graphs are exactly those graphs that can be splitdecomposed into cliques and bipartite graphs.

An immediate consequence is the following.

COROLLARY 6. Parity graphs can be recognized in linear time.

We can improve the result of the parallel time bound.

LEMMA 36. Let L be defined as in the parallel split decompositionialgorithm. In a parity graph, each layer L is a cograph, i.e., a graph that hasino induced path of length 3.

Ž .Proof. Suppose L has an induced path u , u , u , u of length three.i 1 2 3 4Ž .Then u , ¨ , u is a path of length two and the other path is of length1 i 4

four. This is a contradiction to the assumption that we are given a paritygraph.

Next we can characterize also cographs as follows.

Ž w x.LEMMA 37 See for Example 10 . A graph is a cograph if and only if allits modules are degenerated modules.

ELIAS DAHLHAUS238

The modular tree of a cograph is identical to its cotree.

Žw x w x.LEMMA 38 13 ; compare also 25 . The cotree of a cograph can beŽ 2 .computed in O log n time with a linear processor number on a CREW-

PRAM. It can be checked in the same bounds whether a graph is a cograph.

As a final result, we get

Ž 2 .THEOREM 11. Parity graphs can be recognized in O log n time with alinear processor number on a CREW-PRAM.

Proof. We determine first the layers L . Then we check whether eachiL is a cograph, and if this is the case then we compute, for each L , ai i

Ž .cotree which is also a modular tree . This part replaces the modulardecomposition of each L . Then in the rest of the split decompositioniprocedure, we continue as in the general split decomposition algorithm.Finally, we check for each component whether it is a clique or bipartite.The only step in the general modular decomposition procedure that could

Ž 2 .not be done in the time bound of O log ’n on a CREW-PRAM is thegeneral modular decomposition. But this has been replaced by cographrecognition and cotree computation. It can be checked in the same boundsas the computation of the connected components that a graph is bipartite.This proves the theorem.

Remark 1. There is a previous paper dealing with a parallel algorithmw xto recognize parity graphs in the same bounds as in the theorem 16 . The

w xalgorithm did not use the results of 7 .

6. CONCLUSIONS

Undirected split decomposition also became interesting in connectionŽ w x.with recognizing circle graphs see for example 15, 23, 30 . It is well

known that it is sufficient to check the circle graph property for the primecomponents. It might be interesting to find linear time algorithms oralmost linear workload parallel algorithms to check whether a certainprime graph is a circle graph.

REFERENCES

1. K. Abrahamson, N. Dadoun, D. Kirkpatrick, and T. Przyticka, A simple parallel treeŽ .contraction algorithm, J. Algorithms 10 1988 , 287]302.

2. M. Atallah, M. Goodrich, and S. R. Kosaraju, Parallel algorithms for evaluating se-Ž .quences of set manipulation operations, J. ACM 41 1994 , 1049]1085.

Ž .3. R. Anderson and G. Miller, Deterministic parallel list ranking, Algorithmica 6 1991 ,859]868.

CLUSTERING AND SPLIT DECOMPOSITION 239

4. A. Barten, ‘‘Design of Very Fast Parallel Algorithms in the Combinatorial Optimization,’’w xDiploma thesis, RWTH Aachen, 1989 in German .

5. A. Bouchet, Reducing prime graphs and recognizing circle graphs, Combinatorica 7Ž .1987 , 243]254.

6. F. Chin, J. Lam, and I. Chen, Efficient parallel algorithms for some graph problems,Ž .Comm. ACM 25 1982 , 659]665.

7. S. Cicerone and D. Di Stefano, On the extension of bipartite graphs to parity graphs,Ž .Discrete Appl. Math. 95 1999 , 181]195.

8. R. Cole, Parallel merge sort, in ‘‘Proceedings, 27th IEEE-FOCS, 1986,’’ pp. 511]516.9. T. Cormen, C. Leiserson, and R. Rivest, ‘‘Introduction into Algorithms,’’ MIT Press,

Cambridge, MA, 1990.10. D. Corneil, H. Lerchs, and L. Burlingham, Complement reducible graphs, Discrete Appl.

Ž .Math. 3 1981 , 163]174.11. A. Cournier and M. Habib, A new linear algorithm for modular decomposition, in

Ž‘‘CAAP ’94: 19th International Colloquium,’’ Lecture Notes in Computer Science Sophie.Tison, Ed. , Vol. 787, pp. 68]82, Springer-Verlag, New YorkrBerlin, 1994.

12. W. Cunningham, Decomposition of directed graphs, SIAM J. Algebraic and DiscreteŽ .Methods 3 1982 , 214]228.

13. E. Dahlhaus, Efficient parallel algorithms to recognize cographs and distance hereditaryŽ .graphs, Discrete Appl. Math. 57 1995 , 29]44.

14. E. Dahlhaus, Fast parallel algorithm for the single link heuristics of hierarchical cluster-ing, in ‘‘Proceedings of the Fourth IEEE Symposium on Parallel and DistributedProcessing, 1992,’’ pp. 184]186.

15. E. Dahlhaus, Fast parallel recognition of ultrametrics and tree metrics, SIAM J. DiscreteŽ .Math. 6 1993 , 523]532.

16. E. Dahlhaus, An efficient parallel recognition algorithm of parity graphs, in ‘‘ICCI 93’’Ž .O. Abou-Rabia et al., Eds. , pp. 82]86.

17. E. Dahlhaus, Efficient parallel modular decomposition, extended abstract, in ‘‘Graph-ŽTheoretic Concepts in Computer Science, 21th International Workshop WG ’95’’ Nagl

.et al., Eds. , Lecture Notes in Computer Science, Vol. 1017, pp. 290]302, Springer-Verlag,New YorkrBerlin, 1995.

18. E. Dahlhaus, Efficient parallel and linear time split decomposition, in ‘‘14th FST]TCS’’Ž .P. Thiagarajan, Ed. , Lecture Notes in Computer Science, Vol. 880, pp. 171]180,Springer-Verlag, New YorkrBerlin, 1994.

19. E. Dahlhaus, J. Gustedt, and R. McConnell, Efficient and practical modular decomposi-tion, in ‘‘Eighth Annual ACM]SIAM Symposium on Discrete Algorithms, 1997,’’ pp.26]35.

20. R. Dubes and A. Jain, ‘‘Algorithms for Clustering Data,’’ Prentice]Hall, EnglewoodCliffs, New Jersey, 1988.

21. A. Gibbons and W. Rytter, ‘‘Efficient Parallel Algorithms,’’ Cambridge Univ. Press,Cambridge, UK, 1989.

22. M. Golumbic, ‘‘Algorithmic Graph Theory and Perfect Graphs,’’ Academic Press, NewYork, 1980.

23. C. Gabor, K. Supowit, and W. Hsu, Recognizing circle graphs in polynomial time, J.Ž .ACM 36 1989 , 435]473.

Ž .24. P. Hammer and F. Maffray, Completely separable graphs, Discrete Appl. Math. 27 1990 ,85]99.

25. X. He, Parallel algorithm for cograph recognition with applications, J. Algorithms 15Ž .1993 , 284]313.

26. P. Klein, Efficient parallel algorithms for chordal graphs, in ‘‘29th IEEE]FOCS, 1988,’’pp. 150]161.

ELIAS DAHLHAUS240

Ž .27. R. Ladner and M. Fischer, Parallel prefix computation, J. ACM 27 1980 , 831]838.Ž 2 .28. T. Ma and J. Spinrad, An O n -algorithm for undirected split decomposition, J.

Ž .Algorithms 16 1994 , 145]160.29. R. McConnell and J. Spinrad, Linear-time modular decomposition and efficient transitive

orientation of comparability graphs, in ‘‘Fifth Annual ACM]SIAM Symposium of Dis-crete Algorithms, 1994,’’ pp. 536]545.

Ž .30. T. Przyticka and D. Corneil, Parallel algorithms for parity graphs, J. Algorithms 12 1991 ,96]109.

Ž .31. Y. Shiloach and U. Vishkin, An O log n parallel connectivity algorithm, J. Algorithms 3Ž .1982 , 57]67.

Ž .32. J. Spinrad, Recognition of circle graphs, J. Algorithms 16 1994 , 264]282.33. R. Tarjan and U. Vishkin, Finding biconnected components in logarithmic parallel time,

Ž .SIAM-J. Computing 14 1984 , 862]874.34. H. Wagener, Triangulating a monotone polygon in parallel, in ‘‘Computational Geometry

and Its Applications,’’ Lecture Notes in Computer Science, Vol. 333, pp. 136]142,Springer-Verlag, New YorkrBerlin, 1988.