simple and scalable constrained clustering: a generalized...

Simple and Scalable Constrained Clustering:A Generalized Spectral Method 1

Mihai Cucuringu Ioannis Koutis Sanjay Chawla 2 Gary Miller Richard PengU. of California [email protected]

U. of Puerto Rico [email protected]

Qatar [email protected]

Carnegie Mellon [email protected]

Georgia [email protected]

Abstract

We present a simple spectral approach to thewell-studied constrained clustering problem.It captures constrained clustering as a gener-alized eigenvalue problem in which both ma-trices are graph Laplacians. The algorithmworks in nearly-linear time and provides con-crete guarantees for the quality of the clus-ters, at least for the case of 2-way partition-ing. In practice this translates to a veryfast implementation that consistently outper-forms existing spectral approaches both inspeed and quality.

1 Introduction

Clustering with constraints is a problem of central im-portance in machine learning and data mining. It cap-tures the case when information about an applicationtask comes in the form of both data and domain knowl-edge. We study the standard problem where domainknowledge is specified as a set of soft must-link (ML)and cannot-link (CL) constraints [Basu et al., 2008].

The extensive literature reports a plethora of meth-ods, including spectral algorithms that explore var-ious modifications and extensions of the basic spec-tral algorithm by [Shi and Malik, 2000] and its variantby [Ng et al., 2001].

The distinctive feature of our algorithm is that it con-stitutes a natural generalization, rather than an ex-tension of the basic spectral method. The generaliza-tion is based on a critical look at how existing meth-ods handle constraints, in section 3. The solution isderived from a geometric embedding obtained via a

Appearing in Proceedings of the 19th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright2016 by the authors.

spectral relaxation of an optimization problem, exactlyin the spirit of [Ng et al., 2001, Shi and Malik, 2000].This is depicted in the workflow in Figure 1. Dataand ML constraints are represented by a Laplacianmatrix L, and CL constraints by another Laplacianmatrix H. The embedding is realized by computing afew eigenvectors of the generalized eigenvalue problemLx = λHx. The generalization of [Ng et al., 2001,Shi and Malik, 2000] lies essentially in H being aLaplacian matrix rather than the diagonal D of L. Infact, as we will discuss later, D itself is equivalent toa specific Laplacian matrix; thus our method encom-passes the basic spectral method as a special case ofconstrained clustering.

DataMLconstraints

CLconstraints

DiscreteOp1miza1onproblem

SpectralRelaxa1on

EmbeddingL

Lx=λHx ClusteringH

Figure 1: A schematic overview of our approach.

Our approach is characterized by its conceptual sim-plicity that enables a straightforward mathematicalderivation of the algorithm, possibly the simplestamong all competing spectral methods. Reducing theproblem to a relatively simple generalized eigensys-tem enables us to derive directly from recent signif-icant progress in the theoretical understanding of thestandard spectral clustering method, offering its firstpractical realization [Lee et al., 2012]. In addition, thealgorithm comes with two features that are not simul-taneously shared by any of the prior methods: (i) it isprovably fast by design as it leverages fast linear sys-

2Qatar Computing Research Institute, HBKU. Onleave from University of Sydney.

1M. Cucuringu is supported by AFOSR MURI grantFA9550-10-1-0569. I. Koutis is supported by NSF CA-REER award CCF-1149048. G. Miller and R. Peng weresupported by CCF-1018463 and CCF-1065106, while R.Peng was at CMU. Significant part of this work was carriedout while the authors were visiting the Simons Institute forthe Theory of Computing at UC Berkeley in Fall 2014.

445

Simple and Scalable Constrained Clustering

tem solvers for Laplacian systems [Koutis et al., 2012](ii) it provides a concrete theoretical guarantee for thequality of 2-way constrained partitioning, with respectto the underlying discrete optimization problem, via ageneralized Cheeger inequality (section 5).

In practice, our method is at least 10x faster than com-peting methods on large data sets. It solves data setswith millions of points in less than 2 minutes, on verymodest hardware. Furthermore the quality of the com-puted segmentations is often dramatically better.

2 Problem definition

The constrained clustering problem is specified bythree weighted graphs:

1. The data graph GD which contains a given numberof k clusters that we seek to find. Formally, the graphis a triple GD = (V,ED, wD), with the edge weightswD being positive real numbers indicating the level of‘affinity’ of their endpoints.

2. The knowledge graphs GML and GCL. The twographs are formally triples GML = (V,EML, wML)and GCL = (V,ECL, wCL). Each edge in GML in-dicates that its two endpoints should be in the samecluster, and each edge in GCL indicates that its twoendpoints should be in different clusters. The weightof an edge indicates the level of belief placed in thecorresponding constraint.

We emphasize that prior knowledge does not have tobe exact or even self-consistent, and thus the con-straints should not be viewed as ‘hard’ ones. However,to conform with prior literature, we will use the exist-ing terminology of ‘must link’ (ML) and ‘cannot link’(CL) constraints to which GML and GCL owe theirnotation respectively.

In the constrained clustering problem the general goalis to find k disjoint clusters in the data graph. Intu-itively, the clusters should result from cutting a smallnumber of edges in the data graph, while simultane-ously respecting as much as possible the constraints inthe knowledge graphs.

3 Re-thinking constraints

Many approaches have been pursued within the con-strained spectral clustering framework. They are quitedistinct but do share a common point of view: con-straints are viewed as entities structurally extrane-ous to the basic spectral formulation, necessitating itsmodification or extension with additional mathemati-cal features. However, a key fact is overlooked:

Standard clustering is a special case of constrainedclustering with implicit soft ML and CL constraints.

To see why, let us briefly recall the optimization prob-lem in the standard method (Ncut).

φ = minS⊆V

cutGD(S, S)

vol(S)vol(S)/vol(V ).

Here vol(S) denotes the total weight incident to thevertex set S, and cutG(S, S) denotes the total weightcrossing from S to S in G.

The data graph GD is actually an implicit encodingof soft ML constraints. Indeed, pairwise affinities be-tween nodes can be viewed as ‘soft declarations’ thatsuch nodes should be connected rather than discon-nected in a clustering. Let now di denote the totalincident weight of vertex i in GD. Consider the de-mand graph K of implicit soft CL constraints, de-fined by the adjacency Kij = didj/vol(V ). It is easyto verify that vol(S)vol(S)/vol(V ) = cutK(S, S). Wehave

minS⊆V

cutGD(S, S)

vol(S)vol(S)/vol(V )= min

S⊆VcutGD

(S, S)

cutK(S, S).

In other words, the Ncut objective can be viewed as:

minS⊆V

weight of cut (violated) implicit ML constraints

weight of cut (satisfied) implicit CL constraints.

(1)

With this realization, it becomes evident that incor-porating the knowledge graphs (GML, GCL) is mainlya degree-of-belief issue, between implicit and explicitconstraints. Yet all existing methods insist on han-dling the explicit constraints separately. For example,[Rangapuram and Hein, 2012] modify the Ncut op-timization function by adding in the numerator thenumber of violated explicit constraints (independentlyof them being ML or CL), times a parameter γ. In an-other example, [Wang et al., 2014] solve the spectralrelaxation of Ncut, but under the constraint that thenumber of satisfied ML constraints minus the num-ber of violated CL constraints is lower bounded by aparameter α. Despite the separate handling of the ex-plicit constraints, degree-of-belief decisions (reflectedby parameters α and γ) are not avoided. The ac-tual handling also appears to be somewhat arbitrary.For instance, most methods take the constraints un-weighted, as usually provided by a user, and handlethem uniformly; but it is unclear why one constraintin a densely connected part of the graph should betreated equally to another constraint in a less well-connected part. Moreover, most prior methods en-force the use of the balance implicit constraints in K,without questioning their role, which may be actually

446

Cucuringu, Koutis, Chawla, Miller, Peng

adverserial in some cases. In general, the mechanismsfor including the explicit constraints are oblivious ofthe input, or even of the underlying algebra.

Our approach. We choose to temporarily dropthe distinction of the constraints into explicit and im-plicit. We instead assume that we are given one set ofML constraints, and one set of CL constraints, in theform of weighted graphs G and H. We then designa generalized spectral clustering method that retainsthe k-way version of the objective shown in equation 1.We apply this generalized method to our original prob-lem, after a merging step of the explicit and implicitCL/ML constraints into one set of CL/ML constraints.

The merging step can be left entirely up to the user,who may be able to exploit problem-specific informa-tion and provide their choice of weights for G and H.Of course, we expect that in most cases explicit CL andML constraints will be provided in the form of simpleunweighted graphs GML and GCL. For this case weprovide in section 4.7 a simple method that resolvesthe degree-of-belief issue and constructs G and H au-tomatically. The method is heuristic, but not obliviousto the data graph, as they adjust to it.

4 Algorithm and its derivation

4.1 Graph Laplacians

Let G = (V,E,w) be a graph with positive weights.The Laplacian LG of G is defined by LG(i, j) = −wij

and LG(i, i) =∑

j 6=i wij . The graph Laplacian satis-fies the following basic identity for all vectors x:

xTLGx =∑

i,j

wij(xi − xj)2. (2)

Given a cluster C ⊆ V we define a cluster indicatorvector by xC(i) = 1 if i ∈ C and xC(i) = 0 otherwise.We have:

xTCLGxC = cutG(C, C) (3)

where cutG(C, C) denotes the total weight crossingfrom C to C in G.

4.2 The optimization problem

As we discussed in section 3, we assume that the inputconsists of two weighted graphs, the must-link con-straints G, and the cannot-link constraints H.

Our objective is to partition the node set V into kdisjoint clusters Ci. We define an individual measureof badness for each cluster Ci:

φi(G,H) =cutG(Ci, Ci)

cutH(Ci, Ci)(4)

The numerator is equal to the total weight of the vi-olated ML constraints, because cutting one such con-straint violates it. The denominator is equal to thetotal weight of the satisfied CL constraints, becausecutting one such constraint satisfies it. Thus the min-imization of the individual badness is a sensible objec-tive.

We would like then to find clusters C1, . . . , Ck thatminimize the maximum badness, i.e. solve the follow-ing problem:

Φk = min maxiφi. (5)

Using equation 3, the above can be captured in termsof Laplacians: letting xCi

denote the indicator vectorfor cluster i, we have

φi(G,H) =xTCi

LGxCi

xTCiLHxCi

.

Therefore, solving the minimization problem posed inequation 5 amounts to finding k vectors in {0, 1}n withdisjoint support.

Notice that the optimization problem may not be well-defined in the event that there are very few CL con-straints in H. This can be detected easily and the usercan be notified. The merging phase also takes auto-matically care of this case. Thus we assume that theproblem is well-defined.

4.3 Spectral Relaxation

To relax the problem we instead look for k vectorsin y1, . . . , yk ∈ Rn, such that for all i 6= j, we haveyiLHyj = 0. These LH -orthogonality constraints canbe viewed as a relaxation of the disjointness require-ment. Of course their particular form is motivatedby the fact that they directly give rise to a general-ized eigenvalue problem. Concretely, the k vectors yithat minimize the maximum among the k Rayleighquotients (yTi LGyi)/(y

Ti LHyi) are precisely the gen-

eralized eigenvectors corresponding to the k smallesteigenvalues of the problem: LGx = λLHx.

1 This factis well understood and follows from a generalizationof the min-max characterization of the eigenvalues forsymmetric matrices; details can be found for instancein [Stewart and Sun, 1990].

Notice that H does not have to be connected. Since weare looking for a minimum, the optimization functionavoids vectors that are in the null space of LH . That

1When H is the demand graph K discussed in section 2,the problem is identical to the standard problem LGx =λDx, where D is the diagonal of LG. This is because LK =D − ddT /(dT1), and the eigenvectors of LGx = λDx ared-orthogonal, where d is vector of degrees in G.

447


means that no restriction needs to be placed on x sothat the eigenvalue problem is well defined, other thanit cannot be the constant vector (which is in the nullspace of both LG and LH), assuming without loss ofgenerality that G is connected.

4.4 The embedding

Let X be the n × k matrix of the first k general-ized eigenvectors for LGx = λLHx. The embeddingis shown in Figure 2.

We discuss the intuition behind the embedding. With-out step 4 and with LH replaced with the diagonal D,the embedding is exactly the one recently proposedand analyzed in [Lee et al., 2012]. It is a combinationof the embeddings considered in [Shi and Malik, 2000,Ng et al., 2001, Verma and Meila, 2003], but the firstknown to produce clusters with approximation guar-antees. The generalized eigenvalue problem Lx = λDxcan be viewed as a simple eigenvalue problem over aspace endowed with the D-inner product: 〈x, y〉D =xTDy. Step 5 normalizes the eigenvectors to a unitD-norm, i.e. xTDx = 1. Given this normalization,it is shown in [Lee et al., 2012] that the rows of U atstep 7 (vectors in k-dimensional space) are expectedto concentrate in k different directions. This justifiessteps 8-10 that normalize these row vectors onto the k-dimensional sphere, in order to concentrate them in aspatial sense. Then a geometric partitioning algorithmcan be applied.

Input: X,LH , dOutput: embedding U ∈ Rn×k, l ∈ Rn×1

1: u← 1n

2: for i = 1 : k do3: x = X:,i

4: x = x− (xT d/uT d)u

5: x = x/√xTLHx

6: U:,i = x7: end for8: for j = 1 : n do9: lj = ||Uj,:||2

10: Uj,: = Uj,:/lj11: end for

Figure 2: Embedding Computation(based on [Lee et al., 2012]).

From a technical point of view, working with LH in-stead of D makes almost no difference. LH is a positivedefinite matrix. It can be rank-deficient, but the eigen-vectors avoid the null space of LH , by definition. Thusthe geometric intuition about U remains the same ifwe syntactically replace D by LH . However, there isa subtlety: LG and LH share the constant vector in

their null spaces. This means that if x is an eigen-vector, then for all c the vector x + c1n is also aneigenvector with the same eigenvalue. Among all suchpossible eigenvectors we pick one representative: inStep 4 we pick c such that x+ c1n is orthogonal to d.The intuition for this is derived from the proof of theCheeger inequality claimed in section 5; this choice iswhat makes possible the analysis of a theoretical guar-antee for a 2-way cut.

4.5 Computing Eigenvectors

It is understood that spectral algorithms based oneigenvector embeddings do not require the exacteigenvectors, but only approximations of them, inthe sense that the quotients xTLx/xTHx are closeto their exact values, i.e. close to the eigen-values [Chung, 1997, Lee et al., 2012]. The com-putation of such approximate generalized eigenvec-tors for LGx = λLHx is the most time-consumingpart of the entire process. The asymptoticallyfastest known algorithm for the problem runs inO(km log2m) time. It combines a fast Lapla-cian linear system solver [Koutis et al., 2011a] anda standard power method [Golub and Loan, 1996].In practice we use the combinatorial multigridsolver [Koutis et al., 2011b] which empirically runs inO(m) time. The solver provides an approximate in-verse for LG which in turn is used with the precondi-tioned eigenvalue solver LOBPCG [Knyazev, 2001].

4.6 Partitioning

For the special case when k = 2, we can compute thesecond eigenvector, sort it, and then select the sparsestcut among the n−1 possible cuts into {v1, . . . , vi} and{vi+1 . . . vn}, for i ∈ [1, n], where vj is the vertex thatcorresponds to coordinate j after the sorting. This‘Cheeger sweep’ method is associated with the proof ofthe Cheeger inequality [Chung, 1997], and is also usedin the proof of the inequality we claim in section 5.

In the general case, given the embedding ma-trix embedding U , the clustering algorithm invokeskmeans(U) (with a random start), which returns a k-partitioning. The partitioning can be refined option-ally into a k-clustering by performing a Cheeger sweepamong the nodes of each component, independently foreach component: the nodes are sorted according to thevalues of the corresponding coordinates in the vector lreturned by the embedding algorithm given in 2. Wewill not use this refinement option in our experiments.

448


4.7 Merging Constraints

As we discussed in section 2, it is frequently the casethat a user provides unweighted constraints GML andGCL. Merging these unweighted constraints with thedata into one pair of graphs G and H is an interestingproblem.

Here we propose a simple heuristic. We constructtwo weighted graphs GML and GCL, as follows: ifedge (i, j) is a constraint, we take its weight in thecorresponding graph to be didj/(dmindmax), where didenotes the total incident weight of vertex i, anddmin, dmax the minimum and maximum among thedi’s. We then let G = GD+GML and H = K/n+GCL,where K is the demand graph and n is the size of thedata graph, whose edges are normalized to have mini-mum weight. We include this small copy of K in H inorder to render the problem well-defined in all cases ofuser input.

The intuition behind this choice of weights is bet-ter understood in the context of a sparse unweightedgraph. A constraint on two high-degree vertices ismore significant relative to a constraint on two lower-degree vertices, as it has the potential to drasticallychange the clustering, if enforced. In addition, assum-ing that noisy/inaccurate constraints are uniformlyrandom, there is a lower probability that a high-degreeconstraint is inaccurate, simply because its two end-points are relatively rare, due to their high degree.From an algebraic point of view, it also makes sensehaving a higher weight on this edge, in order for itto be comparable in size with the neighborhood of iand j and have an effect on the value of the objec-tive function. Notice also that when no constraintsare available the method reverts to standard spectralclustering.

5 A generalized Cheeger inequality

The success of the standard spectral clustering methodis often attributed to the existence of non-trivial ap-proximation guarantees, which in the 2-way case isgiven by the Cheeger inequality and the associatedmethod [Chung, 1997]. Here we present a generaliza-tion of the Cheeger inequality. We believe that it pro-vides supporting mathematical evidence for the advan-tages of expressing the constrained clustering problemas a generalized eigenvalue problem with Laplacians.

Theorem 1. Let G and H be any two weighted graphsand d be the vector containing the degrees of the ver-tices in G. Let also K be the demand graph andφ(G,H) = minS⊂V cutG(S, S)/cutH(S, S).

For any vector x such that xT d = 0, we have

xTLGx

xTLHx≥ φ(G,K) · φ(G,H)/4,

where K is the demand graph. A cut meet-ing the guarantee of the inequality can be obtainedvia a Cheeger sweep on x. Here: φ(G,H) =minS⊂V cutG(S, S)/cutH(S, S).

The proof is given separately in the supplementary file.

6 Related Work

The literature on constrained clustering is quite exten-sive, as the problem has been pursued under variousguises from different communities. Here we present ashort and unavoidably partial review.

A number of methods incorporate the constraintsvia only modifying the data matrix in the stan-dard method. In certain cases some or all of theCL constraints are dropped in order to prevent thematrix from turning negative [Kamvar et al., 2003,Lu and Carreira-Perpinan, 2008]. The formulationof [Rangapuram and Hein, 2012] incorporates all con-straints into the data matrix, essentially by addinga signed Laplacian, which is a generalization of theLaplacian for graphs with negative weights; notably,their algorithm does not solve a spectral relaxation ofthe problem but attempts to solve the (hard) optimiza-tion problem exactly, via a continuous optimizationapproach.

A different approach is proposed in [Li et al., 2009],where constraints are used in order to improve theembedding obtained through the standard problem,before applying the partitioning step. In principle thisembedding-processing step is orthogonal to methodsthat compute some embedding (including ours), andit can be used as post-processing step to potentiallyimprove them.

A number of other works use the ML and CLconstraints to super-impose algebraic constraintsonto the spectral relaxation of the standard prob-lem. These additional algebraic constraints usuallyyield much harder constrained optimization prob-lems [Eriksson et al., 2011, Kawale and Boley, 2013,Xu et al., 2009, Wang et al., 2014].

Besides our work, there exists a number of other ap-proaches that reduce constrained clustering into gen-eralized eigenvalue problems Ax = λBx. Thesemethods can be implemented to run fast, as longas: (i) linear systems in A can be solved efficiently,(ii) A and B are positive semi-definite. Specifically,[Yu and Shi, 2001, Yu and Shi, 2004] use a generalizedeigenvalue problem in which B is a diagonal, but A is

449


not generally amenable to existing efficient linear sys-tem solvers. In [Wang et al., 2014] matrix A is set tobe the normalized Laplacian of the data graph (implic-itly attempting to impose the standard balance con-straints), and B has both positive and negative off-diagonal entries representing ML and CL constraintsrespectively. In the general case B is not positive,forcing the computation of full eigenvalue decomposi-tions. However the method can be modified to use a(positive) signed Laplacian as the matrix B, as par-tially observed in [Wang et al., 2012]. This modifi-cation has a fast implementation. The formulationin [Rangapuram and Hein, 2012] also leads to a fastimplementation of its spectral relaxation.

7 Experiments

In this section, we sample some of our experimen-tal results. We compare our algorithm Fast-GEagainst two other methods, CSP [Wang et al., 2014]and COSC [Rangapuram and Hein, 2012].

CSP reduces constrained clustering to a generalizedeigenvalue problem. However, the problem is indef-inite and the method requires the computation of afull eigenvalue decomposition.

COSC is an iterative algorithm that attempts tosolve exactly an NP-hard discrete optimization prob-lem that captures 2-way constrained clustering; k-waypartitions are computed via recursive calls to the 2-way partitioner. The method actually comes in twovariants, an exact version which is very slow in all butvery small problems, and an approximate ‘fast’ ver-sion which has no convergence guarantees. The size ofthe data in our experiments forces us to use the fastversion, COSf.

We focus on these two methods because of their read-ily available implementations but mostly because thecorresponding papers provide sufficient evidence thatthey outperform other competing methods. We alsoselected them because they can be both modified orextended into methods that have fast implementa-tions, thus allowing for a comparison with our pro-posed method on very large problems.

7.1 Some negative findings.

COSC has a natural spectral relaxation into a gen-eralized eigenvalue problem Ax = λBx where A is asigned Laplacian and B is a diagonal. CSP can alsobe modified by replacing the indefinite matrix Q of itsgeneralized eigenvalue problem with a signed Lapla-cian that counts the number of satisfied constraints.In this way both methods become scalable. We per-formed a number of experiments based on these ob-

servations. The results were disappointing, especiallywhen k > 2. The output quality was comparable orworse to that obtained by COSf and CSP in the re-ported experiments. We attribute this the less-cleanmathematical properties of the signed Laplacian.

We also experimented with the automated mergingphase of Fast-GE. Specifically we tried adding moresignificance to the standard implicit balance con-straints, by increasing the coefficient of the demandgraph K in graph H. The output deteriorates (of-ten significantly) for the more challenging problems wetried. This supports our decision to not enforce the useof balance constraints in our generalized formulation,unlike all prior methods.

7.2 Synthetic Data Sets.

We begin with a number of small synthetic experi-ments. The purpose is to test the output quality, es-pecially under the presence of noise.

We generically apply the following construction: wechose uniformly at random a set of nodes for which weassume cluster-membership information is provided.The cluster-membership information gives unweightedML and CL constraints in the obvious way. We alsoadd random noise in the data.

More concretely, we say that a graph G is generatedfrom the ensemble NoisyKnn(n, kg, lg) with parame-ters n, kg and lg if G of size n is the union of two(non-necessarily disjoint) graphs H1 and H2 each onthe same set of n vertices G = H1 ∪ H2, where H1

is a k-nearest-neighbor (knn) graph with each nodeconnected to its kg nearest neighbors, and H2 is anErdos-Renyi graph where each edge appears indepen-dently with probability lg/n. One may interpret theparameter lg as the noise level in the data, since thelarger lg the more random edges are wired across thedifferent clusters, thus rendering the problem more dif-ficult to solve. In other words, the planted clusters areharder to detect when there is a large amount of noisein the data, obscuring the separation of the clusters.

Since in these synthetic data sets, the ground truthpartition is available, we measure the accuracy of themethods by the popular Rand Index [Rand, 1971].The Rand Index indicates how well the resulting parti-tion matches the ground truth partition; a value closerto 1 indicates an almost perfect recovery, while a valuecloser to 0 indicates an almost random assignment ofthe nodes into clusters.

Four Moons. Our first synthetic example is the‘Four-Moons’ data set, where the underlying graphG is generated from the ensemble NoisyKnn(n =1500, kg = 30, lg = 15). The plots in Figure 4 show

450


the accuracy and running times of all three methodson this example, while Figure 3 shows a random in-stance of the clustering returned by each of the meth-ods, with 75 constraints. The accuracy of FAST-GE and COSf is very similar, with FAST-GE beingsomewhat better with more constraints, as shown inFigure 4. However FAST-GE is already at least 4xfaster than COSf, for this size.

−10 0 10 20 30 40 50

−6

−4

−2

0

2

4

6

8

10

−10 0 10 20 30 40 50

−6

−4

−2

0

2

4

6

8

10

−10 0 10 20 30 40 50

−6

−4

−2

0

2

4

6

8

10

Figure 3: Segmentation for a random instance of theFour-Moons data set with 75 labels produced by CSP(left), COSf (middle) and FAST-GE (right).

10 25 50 75 100 125 150 175 200 225 250 275 300 325 3500.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

Number of Constraints

Ran

d In

dex

CSP−kmeansFAST−GE−kmeansCOSf−kmeans

10 25 50 75 100 125 150 175 200 225 250 275 300 325 3500

0.5

1

1.5

2

2.5

3

3.5

4


Run

ning

Tim

e

CSPFAST−GECOSf

Figure 4: Accuracy and running times for the Four-Moons data set, where the underlying graph given bythe model NoisyKnn(n = 1500, k = 30, l = 15), forvarying number of constraints. Time is in logarithmicscale. The bars indicate the variance in the output overrandom trials using the same number of constraints.

PACM. Our second synthetic example is the some-what more irregular PACM graph, formed by a cloudof n = 426 points in the shape of letters {P,A,C,M},whose topology renders the segmentation particularlychallenging. The details about this data set are givenin the supplementary file. Here we only present a vi-sualization of the obtained segmentations.

0 5 10 150

1

2

3

4

5

6

0 5 10 150

1

2

3

4

5

6

0 5 10 150

1

2

3

4

5

6

Figure 5: Top: Segmentation for a random instance ofthe PACM data set with 125 labels produced by CSP(left), COSf (middle) and FAST-GE (right)

7.3 Image Data

In terms of real data, we consider two very differentapplications. Our first application is to segmentationof real images, where the underlying grid graph is givenby the affinity matrix of the image, computed using theRBF kernel based on the greyscale values.

We construct the constraints by assigning cluster-membership information to a very small number ofthe pixels, which are shown colored in the picturesbelow. The cluster-membership information is thenturned into pairwise constraints as follows: if t nodesare contained within a ML constraint set, we form acomplete graph with

(t2

)edges. Similarly, we derive a

CL constraint for all pairs of nodes which belong to twodifferent clusters. Our output is obtained by runningk-means 20 times and selecting the best segmentationaccording to the k-means objective value.

Patras. Figure 6 shows the 5-way segmentation ofan image with approximately 44K pixels, which ourmethod is able to detect in under 3 seconds. Thesize of this problem is prohibitive for CSP. The COSfalgorithm runs in 40 seconds and while it does betteron the lower part of the image it erroneously mergestwo of the clusters (the red and the blue one) into asingle region.

50 100 150 200 250

20

40

60

80

100

120

140

160

50 100 150 200 250

20

40

60

80

100

120

140

160

Figure 6: Patras: Top-left: Output of FAST-GE,in 2.8 seconds. Top-right: output of COSf, in 40.2seconds.

Santorini. In Figure 7 we test our proposed methodon the Santorini image, with approximately 250K pix-els. Our approach successfully recovers a 4-way par-titioning, with few errors, in just 15 seconds. Com-puting clusterings in data of this size is infeasible forCSP. The output of the COSf method, which runsin over 260 seconds, is meaningless.

Soccer. In Figure 8 we consider one last Soccer image,with 1.1 million pixels. We compute a 5-way partition-ing using the Fast-GE method in just 94 seconds.Note that while k-means clustering hinders some ofthe details in the image, the individual eigenvectorsare able to capture finer details, such as the soccerball for example, as shown in the two bottom plots ofthe same Figure 8. The output of the COSf methodis obtained in 25 minutes and is again meaningless.

451


100 200 300 400 500 600

50

100

150

200

250

300

350

400

Eig 1

100 200 300 400 500 600

50

100

150

200

250

300

350

400

Eig 2

100 200 300 400 500 600

50

100

150

200

250

300

350

400

Figure 7: Santorini : Left: output of FAST-GE, in15.2 seconds. Right: output of COSf, in 263.6 sec-onds. Bottom: heatmaps for the first two eigenvectorscomputed by FAST-GE.

200 400 600 800 1000 1200

100

200

300

400

500

600

700

800

900

Eig 2

200 400 600 800 1000 1200

100

200

300

400

500

600

700

800

900

Figure 8: Top-right: output of FAST-GE, in under94 seconds. Bottom-right: output of COSf in 25 min-utes. Bottom-left: heat-maps of eigenvectors.

7.4 Friendship Networks

Our final data sets represent Facebook networks inAmerican colleges. The work in [Traud et al., 2012]studies the structure of Facebook networks at onehundred American colleges and universities at a sin-gle point in time (2005). Following a suggestionfrom [Traud et al., 2011], we consider the dormitoryaffiliation as the ground truth clustering, and aim torecover this underlying structure from the availablefriendship graph and any available constraints. Weadd constraints to the clustering problem by samplinguniformly at random nodes in the graph, and the re-sulting pairwise constraints are generated dependingon whether the two nodes belong to the same clusteror not. In order for us to be able to compare to the

computationally expensive CSP method, we considertwo small-sized schools, Simmons College (n = 850,d = 36, k = 10) and Haverford College (n = 1025,d = 72, k = 15), where d denotes the average degreein the graph and k the number of clusters. For both ex-amples, FAST-GE yields more accurate results thanboth CSP and COSf, and does so at a much smallercomputational cost.

20 50 100 150 200 250 300 350 4000.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95


Ran

d In

dex

CSPFAST−GECOSf

20 50 100 150 200 250 300 350 400−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5


Run

ning

Tim

e

CSPFAST−GECOSf

20 50 100 150 200 250 300 350 4000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9


Ran

d In

dex

CSPFAST−GECOSf

20 50 100 150 200 250 300 350 400−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5


Run

ning

Tim

e

CSPFAST−GECOSf

Figure 9: Facebook networks. Top: accuracy and run-ning times for the Simmons College (n = 850, d = 36,k = 10). Bottom: accuracy and running times forHaverford College (n = 1025, d = 72, k = 15). Timeis in logarithmic scale.

8 Final Remarks

We presented a spectral method that reduces con-strained clustering into a generalized eigenvalue prob-lem in which both matrices are Laplacians. This of-fers two advantages that are not simultaneously sharedby any of the previous methods: an efficient imple-mentation and an approximation guarantee for the 2-way partitioning problem in the form of a generalizedCheeger inequality. In practice this translates to amethod that is at least 10x faster than some state-of-the-art algorithms, while producing output of superiorquality. Its speed makes our method amenable to sometype of iteration, e.g. as in [Tolliver and Miller, 2006],or interactive user feedback, that would further im-prove its output.

We view the Cheeger inequality presented in section 5as indicative of the rich mathematical properties ofgeneralized Laplacian eigenvalue problems. We expectthat tighter versions are to be discovered, along thelines of [Kwok et al., 2013]. Finding k-way generaliza-tions of the Cheeger inequality, as in [Lee et al., 2012],poses an interesting open problem.

Code for our method can be found at:http://tiny.cc/spclu

452


References

[Basu et al., 2008] Basu, S., Davidson, I., andWagstaff, K. (2008). Constrained Clustering: Ad-vances in Algorithms, Theory and Applications.CRC Press.

[Chung, 1997] Chung, F. (1997). Spectral Graph The-ory, volume 92 of Regional Conference Series inMathematics. American Mathematical Society.

[Eriksson et al., 2011] Eriksson, A. P., Olsson, C., andKahl, F. (2011). Normalized cuts revisited: A re-formulation for segmentation with linear groupingconstraints. Journal of Mathematical Imaging andVision, 39(1):45–61.

[Golub and Loan, 1996] Golub, G. and Loan, C. V.(1996). Matrix Computations. The Johns HopkinsUniversity Press, Baltimore, 3d edition.

[Kamvar et al., 2003] Kamvar, S. D., Klein, D., andManning, C. D. (2003). Spectral learning. InIJCAI-03, Proceedings of the Eighteenth Interna-tional Joint Conference on Artificial Intelligence,Acapulco, Mexico, August 9-15, 2003, pages 561–566.

[Kawale and Boley, 2013] Kawale, J. and Boley, D.(2013). Constrained spectral clustering using l1 reg-ularization. In SDM, pages 103–111.

[Knyazev, 2001] Knyazev, A. V. (2001). Toward theoptimal preconditioned eigensolver: Locally optimalblock preconditioned conjugate gradient method.SIAM J. Sci. Comput, 23:517–541.

[Koutis et al., 2011a] Koutis, I., Miller, G. L., andPeng, R. (2011a). A nearly m log n solver for SDDlinear systems. In FOCS ’11: Proceedings of the52nd Annual IEEE Symposium on Foundations ofComputer Science.

[Koutis et al., 2012] Koutis, I., Miller, G. L., andPeng, R. (2012). A fast solver for a class of linearsystems. Commun. ACM, 55(10):99–107.

[Koutis et al., 2011b] Koutis, I., Miller, G. L., andTolliver, D. (2011b). Combinatorial precondition-ers and multilevel solvers for problems in computervision and image processing. Computer Vision andImage Understanding, 115(12):1638–1646.

[Kwok et al., 2013] Kwok, T. C., Lau, L. C., Lee,Y. T., Gharan, S. O., and Trevisan, L. (2013).Improved Cheeger’s Inequality: Analysis of Spec-tral Partitioning Algorithms through Higher OrderSpectral Gap. CoRR, abs/1301.5584.

[Lee et al., 2012] Lee, J. R., Oveis Gharan, S., andTrevisan, L. (2012). Multi-way spectral partitioningand higher-order Cheeger inequalities. In Proceed-ings of the Forty-fourth Annual ACM Symposium onTheory of Computing, STOC ’12, pages 1117–1130,New York, NY, USA. ACM.

[Li et al., 2009] Li, Z., Liu, J., and Tang, X. (2009).Constrained clustering via spectral regularization.In 2009 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR2009), 20-25 June 2009, Miami, Florida, USA,pages 421–428.

[Lu and Carreira-Perpinan, 2008] Lu, Z. andCarreira-Perpinan, M. A. (2008). Constrainedspectral clustering through affinity propagation.In 2008 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR2008), 24-26 June 2008, Anchorage, Alaska, USA.IEEE Computer Society.

[Ng et al., 2001] Ng, A. Y., Jordan, M. I., and Weiss,Y. (2001). On spectral clustering: Analysis andan algorithm. In Advances in Neural InformationProcessing Systems 14 [Neural Information Process-ing Systems: Natural and Synthetic, NIPS 2001,December 3-8, 2001, Vancouver, British Columbia,Canada], pages 849–856.

[Rand, 1971] Rand, W. (1971). Objective criteria forthe evaluation of clustering methods. Journal of theAmerican Statistical Association, 66(336):846–850.

[Rangapuram and Hein, 2012] Rangapuram, S. S. andHein, M. (2012). Constrained 1-spectral clustering.In AISTATS, pages 1143–1151.

[Shi and Malik, 2000] Shi, J. and Malik, J. (2000).Normalized cuts and image segmentation. IEEETrans. Pattern Anal. Mach. Intell., 22(8):888–905.

[Stewart and Sun, 1990] Stewart, G. and Sun, J.-G.(1990). Matrix Perturbation Theory. AcademicPress, Boston.

[Tolliver and Miller, 2006] Tolliver, D. and Miller,G. L. (2006). Graph partitioning by spectral round-ing: Applications in image segmentation and clus-tering. In CVPR (1), pages 1053–1060.

[Traud et al., 2011] Traud, A. L., Kelsic, E. D.,Mucha, P. J., and Porter, M. A. (2011). Com-paring Community Structure to Characteristics inOnline Collegiate Social Networks. SIAM Review,53(3):526–543.

[Traud et al., 2012] Traud, A. L., Mucha, P. J., andPorter, M. A. (2012). Social structure of Facebook

453


networks. Physica A: Statistical Mechanics and itsApplications.

[Verma and Meila, 2003] Verma, D. and Meila, M.(2003). A comparison of spectral clustering algo-rithms. Technical report.

[Wang et al., 2012] Wang, X., Qian, B., and David-son, I. (2012). Improving document clustering us-ing automated machine translation. In Proceedingsof the 21st ACM International Conference on In-formation and Knowledge Management, CIKM ’12,pages 645–653, New York, NY, USA. ACM.

[Wang et al., 2014] Wang, X., Qian, B., and David-son, I. (2014). On constrained spectral clusteringand its applications. Data Min. Knowl. Discov.,28(1):1–30.

[Xu et al., 2009] Xu, L., Li, W., and Schuurmans,D. (2009). Fast normalized cut with linear con-straints. In 2009 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition(CVPR 2009), 20-25 June 2009, Miami, Florida,USA, pages 2866–2873.

[Yu and Shi, 2001] Yu, S. X. and Shi, J. (2001). Un-derstanding popout through repulsion. In 2001IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR 2001), withCD-ROM, 8-14 December 2001, Kauai, HI, USA,pages 752–757. IEEE Computer Society.

[Yu and Shi, 2004] Yu, S. X. and Shi, J. (2004). Seg-mentation given partial grouping constraints. IEEE

Trans. Pattern Anal. Mach. Intell., 26(2):173–183.

454

simple and scalable constrained clustering: a generalized...

Documents