an improved cost function for hierarchical cluster trees

JoCG 11(1), 283–331, 2020 283

Journal of Computational Geometry jocg.org

AN IMPROVED COST FUNCTION FOR HIERARCHICALCLUSTER TREES∗

Dingkang Wang† and Yusu Wang‡

Abstract. Hierarchical clustering has been a popular method in various data analysisapplications. It partitions a data set into a hierarchical collection of clusters, and can pro-vide a global view of (cluster) structure behind data across different granularity levels. Ahierarchical clustering (HC) of a data set can be naturally represented by a tree, called aHC-tree, where leaves correspond to input data and subtrees rooted at internal nodes cor-respond to clusters. Many hierarchical clustering algorithms used in practice are developedin a procedural manner. In [14], Dasgupta proposed to study the hierarchical clusteringproblem from an optimization point of view, and introduced an intuitive cost function forsimilarity-based hierarchical clustering with nice properties as well as natural approxima-tion algorithms. There since have been several follow-up works, providing for example newhardness analysis and better approximation algorithms for computing the optimal HC-tree.

We observe that while Dasgupta’s cost function is effective at differentiating a goodHC-tree from a bad one for a fixed graph, the value of this cost function does not reflect howwell an input similarity graph is consistent with a hierarchical structure. In this paper, wepresent an improved cost function, which is developed based on Dasgupta’s cost function,to address this issue. The optimal tree under the improved cost function remains the sameas the one under Dasgupta’s cost function. However, the value of our cost function is moremeaningful. For example, the cost of an optimal HC-tree of a graph G equals 1 if and only ifG has a “perfect HC-structure” in the sense that there exists a HC-tree that is consistent withall triplets relations in G; and the optimal cost will be larger than 1 otherwise. The new wayof formulating the cost function also leads to a polynomial-time algorithm to compute theoptimal cluster tree when the input graph has a perfect HC-structure, and an approximationalgorithm when the input graph has “almost” perfect HC-structure. Finally, we provide afurther understanding of the improved cost function by studying its behavior for randomgraphs sampled from an edge probability matrix.

∗We would like to thank Sanjoy Dasgupta for very insightful discussions at the early stage of this work,which led to the observation of the base-cost for a graph. We also thank anonymous reviewers for theirhelpful comments. The work was initiated during the semester-long program of “Foundations of MachineLearning” at Simons Insitute for the Theory of Computing in spring 2017. This work is partially supportedby National Science Foundation under grants CCF-1740761, DMS-1547357, and RI-1815697.†The Ohio State University, [email protected]‡The Ohio State University, [email protected]

http://jocg.org/

JoCG 11(1), 283–331, 2020 284


1 Introduction

Clustering has been one of the most important and popular data analysis methods in themodern data era, with numerous clustering algorithms proposed in the literature [1]. The-oretical studies on clustering have so far been focused mostly on flat clustering algorithms,e.g., [5, 6, 15, 16, 21], which aim to partition the input data set into a set of k (often pre-specified) number of groups, called clusters. However, there are many scenarios where itis more desirable to perform hierarchical clustering, which recursively partitions data intoa hierarchical collection of clusters. A hierarchical clustering (HC) of a data set can benaturally represented by a tree, called a HC-tree, where leaves correspond to input data andsubtrees rooted at internal nodes correspond to clusters. Hierarchical clustering can providea more thorough view of the cluster structure behind input data across all levels of gran-ularity simultaneously, and is sometimes better at revealing the complex structure behindmodern data. It has been broadly used in data mining, machine learning, and bioinformaticapplications; e.g., the studies of phylogenetics.

Most hierarchical clustering algorithms used in practice are developed in a proceduralmanner : For example, the family of agglomerative methods builds a HC-tree bottom-up bystarting with all data points in individual clusters, and then repeatedly merging them toform bigger clusters at coarser levels. Prominent merging strategies include single-linkage,average-linkage, and complete-linkage heuristics. The family of divisive methods insteadpartitions the data in a top-down manner, starting with a single cluster, and then recursivelydividing it into smaller clusters using strategies based on spectral cut, k-means, k-center andso on. Many of these algorithms work well in different practical situations, for example, theaverage-linkage algorithm is known as Unweighted Pair Group Method with ArithmeticMean (UPGMA) algorithm [23] commonly used in evolutionary biology for phylogeneticinference. However, it is in general not clear what the output HC-tree aims to optimize, andwhat one should expect to obtain. This lack of optimization understanding of the HC-treealso makes it hard to decide which hierarchical clustering algorithm one should use given aspecific type of input data.

The optimization formulation of the HC-tree was recently tackled by Dasgupta in[14]. Specifically, given a similarity graph G (which is a weighted graph with edge weightcorresponding to the similarity between nodes), Dasgupta proposed an intuitive cost functionfor any HC-tree, and defined an optimal HC-tree for G to be one that minimizes this cost.Dasgupta showed that an optimal tree under this objective function has many nice propertiesand is indeed desirable. Furthermore, while it is NP-hard to find an optimal tree, he showedthat a simple heuristic using an αn-approximation of the sparsest graph cut will lead to analgorithm computing a HC-tree whose cost is an O(αn · log n)-approximation of the optimalcost. Given that the best approximation factor αn for the sparsest cut is O(

√log n) [4],

this gives an O(log3/2 n)-approximation algorithm for the optimal HC-tree as defined byDasgupta’s cost function. The approximation factor has since been improved in severalsubsequent works [8, 13, 22], and it has been shown independently in [8, 13] that one canobtain an O(αn)-approximation.

http://jocg.org/

JoCG 11(1), 283–331, 2020 285


Our work. For a fixed graph G, the value of Dasgupta’s cost function can be used todifferentiate “better” HC-trees (with smaller cost) from “worse” ones (with larger cost), andthe HC-tree with the smallest cost is optimal. However, we observe that this cost function, inits current form, does not indicate whether an input graph has a strong hierarchical structureor not, or whether one graph has “more” hierarchical structure than another graph. Forexample, consider a star graph G1 and a path graph G2, both with n nodes, n−1 edges, andunit edge weights. It turns out that the cost of the optimal HC-tree for the star G1 is Θ(n2),while that for path graph G2 is Θ(n log n). However, the star, with all nodes connected to acentral node, is intuitively more cluster-like than the path with n− 1 sequential edges. (Infact, we will show later that a star, or the so-called linked star where there are two stars withtheir central vertices linked by an edge, see Figure 7 (a) in Appendix, both have what we calla perfect HC-structure.) Furthermore, consider a dense unit-weight graph with Θ(n2) edges,one can show that the optimal cost is always Θ(n3), whether the graph exhibits any clusterstructure at all. Hence in general, it is not meaningful to compare the optimal HC-tree costsacross different graphs.

We propose a modification of Dasgupta’s cost function to address this issue and studyits properties and algorithms. In particular, by reformulating Dasgupta’s cost function, weobserve that for a fixed graph, there exists a base-cost which reflects the minimum cost onecan hope to achieve. Based on this observation, we develop an improved cost ρG(T ) toevaluate how well a HC-tree T represents an input graph G. An optimal HC-tree for G isthe one minimizing ρG(T ). The improved cost function has several interesting properties:

(i) For any graph G, a tree T minimizes ρG(T ) for G if and only if it minimizes Dasgupta’scost function; thus the optimal tree under our cost function remains the same as theoptimal tree under Dasgupta’s cost function. Furthermore, hardness results and theexisting approximation algorithm developed in [11] still apply to our cost function.In fact, the improved cost function can be approximated within the same asymptoticapproximation factor as Dasgupta’s cost function for the case where there are k “mustlink before” constraints that have to be satisfied, via a modification of the recursivesparsest cut algorithm of [11] (details in Appendix B.2).

(ii) For any positively weighted graphG with n vertices, the optimal cost ρ∗G := minTρG(T )is bounded with ρ∗G ∈ [1, n−2] (while the optimal cost under Dasgupta’s cost functionis bounded by O(nW ) where W is the total weights of all edges). The optimal cost ρ∗Gintuitively indicates how much HC-structure the graph G has. In particular, ρ∗G = 1if and only if there exists a HC-tree that is consistent with all triplet relations in G(see Section 2 for more precise definitions), in which case, we say that this graph hasa perfect hierarchical-clustering (HC) structure.

(iii) For a similarity graph with n vertices and m edges, the new formulation enablesus to develop an O(n2m)-time algorithm to test whether an input graph G has aperfect HC-structure (i.e., ρ∗G = 1) or not, as well as computing an optimal tree ifρ∗G = 1. If an input graph G is what we call a δ-perturbation of a graph G∗ with aperfect HC-structure, then in O(nm) time we can compute a HC-tree T whose costis a (δ2 + 1)-approximation of the optimal one (δ = 1 means the graph has a perfectHC-structure).

http://jocg.org/

JoCG 11(1), 283–331, 2020 286


(iv) In Section 5, we study the behavior of our cost function for a random graph G gener-ated from an edge probability matrix P. Under mild conditions on P, we show that theoptimal cost ρ∗G concentrates on a certain value, which interestingly, is different fromthe optimal ρ∗ cost for the expectation-graph (i.e., the graph whose edge weights equalto entries in P). For random graphs sampled from Erdős-Rényi model and plantedbisection model, the optimal cost ρ∗G will decrease if we strengthen in-cluster connec-tions. In other words, the optimal cost reflects how much HC-structure a randomgraph has.

In general, we believe that the new formulation and results from our investigationhelp to reveal insights on the hierarchical structure behind graph data. We remark that [13]proposed a concept of ground-truth input (graph), which, informally, is a graph consistentwith an ultrametric under some monotone transformation of weights. Our concept of graphswith perfect HC-structure is more general than their ground-truth graph (see Theorem 3),and for example is much more meaningful for unweighted graphs (Proposition 1).

More on related work. The present study is inspired by the work of Dasgupta [14], as wellas the subsequent work in [8, 13, 19, 22], especially [13]. As mentioned earlier, after [14],there have been several independent follow-up works to improve the approximation factor toO(log n) [22] and then to O(αn) via more refined analysis of the sparse-cut based algorithmof Dasgupta [8, 13], or via SDP relaxation [8]. It was also shown in [8, 22] that it is hardto approximate the optimal cost (under Dasgupta’s cost function) within any constant fac-tor, assuming the Small Set Expansion (SSE) hypothesis originally introduced in [20].However, for special families of graphs, such as random graphs generated from the so-calledhierarchical stochastic block model (HSBM), constant-factor approximation algorithms havebeen proposed [12, 13]. We also note that in [11], Chatziafratis et al. considered optimizingDasgupta’s cost function with certain structural constraints such as “must link before” con-straints. Given k constraints, they proposed an O(kαn)-approximation algorithm based onrecursive sparsest (balanced) cut approaches.

In [19], a dual version of Dasgupta’s cost function, called the revenue function,was studied. The authors showed that the standard average-linkage algorithm can alreadyachieve a 1/3-approximation for the maximization problem of this revenue function. Theyalso proposed a divisive algorithm that achieves the same approximation factor based ona local search strategy. In [10], Charikar et al. showed that the approximation factor canbe improved slightly for Euclidean data under similarity measure induced by the Gaussiankernel. In particular, for data from Rd, they proposed an efficient Projected Random Cutalgorithm that achieves a constant improvement over 1

3 in terms of the approximation factor.Recently, Ahmadian et al. [2] further improved the approximation factor to 0.4246 forgeneral cases by using Max-Uncut Bisection as a subroutine. Also, they proved that itis APX-hard to maximize the revenue function under the SSE hypothesis.

An analog of Dasgupta’s cost function for dissimilarity graphs (i.e., graphs whereedge weight represents dissimilarity) was studied in [13]. The formulation also leads to amaximization problem. The authors showed that the average-linkage algorithm is alreadya 1/2-approximation algorithm for this maximization problem as well; and also proposed

http://jocg.org/

JoCG 11(1), 283–331, 2020 287


a better 2/3-approximation algorithm using dense-cut. The approximation factor for theaverage-linkage algorithm for this dissimilarity version of Dasgupta’s cost function was fur-ther proven to be 2/3 with tighter analysis later by [9]. In fact, in [9], for both formulationsabove, Charikar et al. further developed new algorithms that achieve strictly better approx-imation factors than the best prior approaches.

Finally, we remark that the investigation in [13] considered the so-called admissiblecost functions: This concept is developed based on the idea of Dasgupta’s cost function, butis slightly more general.

Table for notations. There are several different cost functions introduced in this paper,and various notations are defined in the proofs of main theorems. For reference, we includea table for frequently used notations in Appendix D.

2 An Improved Cost Function for HC-trees

In this section, we first describe the cost function proposed by Dasgupta [14] in Section 2.1,and introduce our cost function in Section 2.2.

2.1 Problem setup and Dasgupta’s cost function

Our input is a set of n data points V = v1, . . . , vn as well as their pairwise similarity,represented as an n × n weight matrix W with wij = W [i][j] representing the similaritybetween points vi and vj . We assume that W is symmetric and each entry is non-negative.Alternatively, we can assume that the input is a weighted (undirected) graph G = (V,E,w),with the weight for an edge (vi, vj) ∈ E being wij . The two views are equivalent: if theinput graph G is not a complete graph, then in the weight matrix view, we simply setwij = 0 if (vi, vj) /∈ E. We use the two views interchangeably in this paper. Finally, we use“unweighted graph” G = (V,E) to refer to a graph where each edge e ∈ E has unit-weight1, and 0 otherwise.

Given a set of data points V = v1, . . . , vn, a hierarchical clustering tree (HC-tree)is a rooted tree T = (VT , ET ) whose leaf set equals V . We also say that T is a HC-treespanning (its leaf set) V . Given any tree node u ∈ VT , T [u] represents the subtree rootedat u, and leaves(T [u]) denotes the set of leaves contained in the subtree T [u]. To simplifypresentation, we sometimes use indices of vertices to refer to vertices, e.g., (i, j) ∈ E means(vi, vj) ∈ E. Given any two points vi, vj ∈ V , LCAT (i, j) denotes the lowest common ancestorof leaves vi and vj in T ; the subscript T is often omitted when the choice of T is clear. Thefollowing cost function to evaluate a HC-tree T w.r.t. a similarity graph G was introducedin [14]:

costG(T ) =∑

(i,j)∈E

wij |leaves(T [LCA(i, j)])|.

An optimal HC-tree T ∗ is defined as one that minimizes costG(T ). Intuitively,to minimize the cost, pairs of nodes with high similarity should be merged (into a single

http://jocg.org/

JoCG 11(1), 283–331, 2020 288


cluster) earlier. It was shown in [14] that there is always an optimal tree that is binary. Theoptimal tree under this cost function has several nice properties, e.g., behaving as expectedfor graphs such as disconnected graphs, cliques, and planted partitions. In particular, if Gis an unweighted clique, then all trees have the same cost, and thus are optimal – this isintuitive as no preference should be given to any specific tree shape in this case.

2.2 The improved cost function

vjvi

vk

LCA(i, j)

LCA(i, j, k)

vjvi vk

LCA(i, j, k)

Figure 1: i, j|k (left) and i|j|k (right).

To introduce our improved cost function, it is more convenient to take the matrixview where the weight wij is defined for all pairs of nodes vi and vj (as mentioned earlier,if the input is a weighted graph G = (V,E,w), then we set wij = 0 for (vi, vj) /∈ E). Recallthat LCA(i, j) denotes the lowest common ancestor of vi and vj , and similarly let LCA(i, j, k)denote the lowest common ancestor of three leaves vi, vj and vk. A triplet i, j, k refersto three distinct indices i 6= j 6= k ∈ [1, n]. We say that relation i, j|k holds in T , ifthe lowest common ancestor LCA(i, j) of vi and vj is a proper descendant of LCA(i, j, k).Intuitively, subtrees containing leaves vi and vj will merge first, before they merge with asubtree containing vk. We say that relationi|j|k holds in T , if they are merged at thesame time; that is, LCA(i, j) = LCA(j, k) = LCA(i, j, k). See Figure 1 for an illustration.

Definition 1. Given any triplet i, j, k of [1, n], the cost of this triplet (induced by T ) is

triCT,G(i, j, k) =

wik + wjk if relation i, j|k holdswij + wjk if relation i, k|j holdswij + wik if relation j, k|i holdswij + wjk + wik if relation i|j|k holds

We omit G from the above notation when its choice is clear. The total-cost of tree T w.r.t.G is

totalCG(T ) =∑

1≤i<j<k≤ntriCT (i, j, k).

The rather simple proof of the following claim is in Appendix A.1.

Claim 1. totalCG(T ) =∑

(i,j)∈E wij(|leaves(T [LCA(i, j)])|−2) = costG(T )−2∑

(i,j)∈E wij .

The total-cost of any HC-tree T differs from Dasgupta’s cost costG(T ) by a quantitydepending only on the input graph G. Hence for a fixed graph G, it maintains the relative

http://jocg.org/

JoCG 11(1), 283–331, 2020 289


order of the costs of any two trees, implying that the optimal tree under totalCG or costGremains the same. While the difference from Dasgupta’s cost function seems to be minor, itis easier to see using this formulation that for a fixed graph G, there is a least-possible costthat any HC-tree will incur, which we call the base-cost. It is important to note that thefollowing base-cost depends only on the input graph G.

Definition 2. Given an n-node graph G associated with similarity matrixW , for any distincttriplet i, j, k ⊂ [1, n], define its min-triplet cost to be

minTriCG(i, j, k) = minwij + wik, wij + wjk, wik + wjk.

The base-cost of similarity graph G is

baseC(G) =∑

1≤i<j<k≤nminTriCG(i, j, k).

To differentiate from Dasgupta’s cost function, we call our improved cost functionthe ratio-cost.

Definition 3 (Ratio-cost function). Given a similarity graph G and a HC-tree T , the ratio-cost of T w.r.t. G is defined as

ρG(T ) =totalCG(T )

baseC(G).

The optimal tree for G is a tree T ∗ such that ρG(T ∗) = minT ρG(T ); and its ratio-costρG(T ∗) is called the optimal ratio-cost ρ∗G = ρG(T ∗).

Observation 1. (i) For any HC-tree T , totalCG(T ) ≥ baseC(G), implying that ρG(T ) ≥ 1.(ii) A tree minimizing ρG also minimizes totalCG (and thus costG), as baseC(G) is aconstant for a fixed graph.(iii) There is always an optimal tree for ρG that is binary.

Observation (i) and (ii) above follow directly from the definitions of these costs and Claim1. (iii) holds as it holds for Dasgupta’s cost function costG ([14], Section 2.3). Hence inthe remainder of the paper, we will consider only binary trees when talking about optimaltrees.

Note that it is possible that baseC(G) = 0 for a non-empty graph G (e.g., considera graph whose edges form a matching of the nodes; that is, each node can be incident toat most one edge). We follow the convention that 0

0 = 1 while x0 = +∞ for any positive

number x > 0. We note that this choice is natural: In particular, in Appendix A.2 we showthat if baseC(G) = 0, then there must exist a tree T such that totalCG(T ) = 0 = baseC(G).Hence ρ∗G = 1 is still finite for this case.

Intuition behind the costs. Consider any triplet i, j, k, and assume w.l.o.g that wij isthe largest among the three pairwise similarities. If there exists a “perfect” HC-tree T ,then it should first merge vi and vj as they are most similar, before merging them with

http://jocg.org/

JoCG 11(1), 283–331, 2020 290


vk. That is, the relation for this triplet in the HC-tree should be i, j|k; and we say thatthis relation i, j|k (and the tree T ) is consistent with (similarties of) this triplet. Thecost of the triplet triCT (i, j, k) is designed to reward this “perfect” relation: triCT (i, j, k)is minimized, in which case triCT (i, j, k) = minTriC(i, j, k), only when i, j|k holds in T .In other words, minTriC(i, j, k) is the smallest cost possible for this triplet, and a HC-treeT can achieve this cost only when the relation of this triplet in T is consistent with theirsimilarities.

If there is a HC-tree T such that for all triplets, their relations in T are consistentwith their similarities, then totalCG(T ) = baseC(G), implying that T must be optimal (byObservation 1 (i)) and ρ∗G = 1. Similarly, if ρ∗G = 1 for a graph G, then the optimal treeT ∗ has to be consistent with all triplets from V . Intuitively, this graph G has a perfectHC-structure, in the sense that there is a HC-tree such that the desired merging orders forall triplets are preserved in this tree.

Definition 4 (Graph with a perfect HC-structure). A similarity graph G has a perfectHC-structure if ρ∗G = 1. Equivalently, there exists a HC-tree T such that totalCG(T ) =baseC(G).

Examples. The base-cost is independent of tree T and can be computed easily for a graph.If we find a tree T with totalCG(T ) = baseC(G), then T must be optimal and G has aperfect HC-structure. With this in mind, it is now easy to see that for a clique G (with unitweight), given any HC-tree T and any triplet i, j, k, triCT (i, j, k) = 2 = minTriC(i, j, k).Hence totalCG(T ) = baseC(G) = 2 ·

(n3

)for any HC-tree T , and a clique has a perfect HC-

structure. A complete graph whose edge weights equal to entries of the edge probabilities ofa planted partition also has a perfect HC-structure. It is also easy to check that the n-nodestar graph G1 (or the linked-stars) has a perfect HC-structure as well, while for the n-nodepath G2, ρ∗G2

= Θ(log n). Intuitively, a path does not possess much hierarchical structure.See Appendix A.3 for details. We also refer the readers to see more results and discussions onErdős-Rényi random graph and planted bipartition (also sometimes called planted bisection)random graphs in Section 5. In particular, as we describe in the discussion after Corollary2, the ratio-cost function exhibits an interesting, yet natural, behavior as the in-cluster andbetween-cluster probabilities for a planted bisection model change.

Remark: Note that in general, the value of Dasgupta’s cost costG(T ) is affected by the(edge) density of a graph. One may think that we could normalize costG(T ) by the numberof edges in the graph (or by total edge weights). However, the examples of the star and thepath show that this strategy itself is not sufficient to reveal the cluster structure behind aninput graph, as those two graphs have an equal number of edges. Rather, one needs to lookat triplet relations instead of total edge weights.

3 Some Properties of Ratio-cost

For an unweighted graph, optimal cost under Dasgupta’s cost function is bounded by O(n3).For a weighted graph, the optimal cost under Dasgupta’s cost function could be Θ(nW ),

http://jocg.org/

JoCG 11(1), 283–331, 2020 291


where W is the sum of all edge weights. In contrast, for our ratio-cost function, it turnsout that the optimal cost is always bounded by n for both unweighted and weighted graphs(independent of the weights). For an unweighted graph, we can in fact obtain an upperbound that is asymptotically the same as the one in Theorem 1 (ii) below using a muchsimpler argument. However, the stated upper bound (n

2−2n2m−n ) is tight in the sense that it

equals ‘1’ for a clique, matching the fact that ρ∗G = 1 for a clique.

Theorem 1. (i) Given a similarity graph G = (V,E,w) with w being symmetric, havingnon-negative entries, we have that ρ∗G ∈ [1, n− 2] where n = |V | ≥ 3.(ii) For a connected unweighted graph G = (V,E) with n = |V | ≥ 3 and m = |E|, we havethat ρ∗G ∈ [1, n

2−2n2m−n ].

Proof of Theorem 1: We prove claim (i) of Theorem 1 by induction. Consider the basecase where we have a graph G with n = 3 and weights w12, w13 and w23. Assume w.l.o.gthat w12 is the largest among the three weights. Then baseC(G) = w13 +w23. Now considerthe tree T that merges v1 and v2 first, then merges with v3; for this tree totalCG(T ) =w13 + w23 = baseC(G). Hence ρ∗G = 1 no matter what edge weights it has. Thus claim (i)holds for this base case.

Now assume the claim holds when the number of nodes n of graph satisfies n ≤ sand s ≥ 3. We aim to prove it for a graph G with n = s+ 1 nodes. Arrange the indices sothat the weight ws,s+1 is the largest among all weights. Consider the following HC-tree T onnodes of G: First, merge vs and vs+1. Next, construct an optimal tree T ′ for the subgraphG[V ′] induced by nodes V ′ = V \ vs, vs+1 = v1, . . . , vs−1. Finally, merge T ′ with thesubtree containing vs and vs+1 to obtain T . See figure 2.

T'vs vs+1

Figure 2: Merge T ′ with vs and vs+1 to obtain tree T .

First assume that s−1 ≥ 3 (we will discuss the case when s−1 = 2 later). Applyingthe induction hypothesis to G[V ′] with s− 1 nodes, we have:

totalCG[V ′](T′)

baseC(G[V ′])= ρ∗G[V ′] ≤ (s− 1)− 2 ⇒ totalCG[V ′](T

′) ≤ (s− 3) · baseC(G[V ′]).

(1)

http://jocg.org/

JoCG 11(1), 283–331, 2020 292


Furthermore, for the tree T we constructed above, note that

totalCG(T ) = totalCG[V ′](T′) +

∑i∈[1,s−1]

(wis + wi,s+1)

+∑

1≤i<j<s[(wis + wi,s+1) + (wjs + wj,s+1)]. (2)

For each i ∈ [1, s− 1] (i.e, a node in T ′), the weight wis will be counted s− 1 times in thelast two terms of the RHS (Eqn. 2) (where RHS stands for right-hand side); similarly forwi,s+1. Combining this with (Eqn. 1), it then follows that

totalCG(T ) = totalCG[V ′](T′) + (s− 1) ·

∑i∈[1,s−1]

(wis + wi,s+1)

≤ (s− 3) · baseC(G[V ′]) + (s− 1) ·∑

i∈[1,s−1]

(wis + wi,s+1). (3)

On the other hand, we have that:

baseC(G) = baseC(G[V ′]) +∑

i∈[1,s−1]

(wis + wi,s+1)

+∑

1≤i<j<s[minTriC(i, j, s) + minTriC(i, j, s+ 1)]

≥ baseC(G[V ′]) +∑

i∈[1,s−1]

(wis + wi,s+1); (4)

Combining Eqn. (3) and Eqn. (4), we then have that, for s− 1 ≥ 3,

ρ∗G ≤totalCG(T )

baseC(G)≤ s− 1 = n− 2. (5)

For the case when s−1 = 2, the first term in both (Eqn. 3) and (Eqn. 4) vanishes. That is,

totalCG(T ) = (s− 1) ·∑

i∈[1,s−1]

(wis + wi,s+1) while baseC(G) ≥∑

i∈[1,s−1]

(wis + wi,s+1).

Hence the bound in (Eqn. 5) still holds. It then follows from induction that ρ∗G holds forany similarity graph G with n ≥ 3 nodes; which proves claim (i) of Theorem 1.

Proving claim (ii) of Theorem 1. We now prove claim (ii), where G is an unweightedconnected graph. First, for node vi, let di denote its degree in graph G. For any distincttriplet i, j, k ∈ [1, n], its induced subgraph in G can have 0, 1, 2, or 3 edges. We call thetriplet a wedge if its induced subgraph has 2 edges, and a triangle if it has 3. It is easy to seethat only wedges and triangles will contribute to the base-cost, where a wedge contributesto cost 1, and a triangle contributes to cost 2. Hence we have that:

baseC(G) = #wedges+ 2 ·#triangles ≥ 2

3

∑i∈[1,n]

(di2

)

=2

3[∑i

d2i

2−m] ≥ 2

3[2m2

n−m] =

2m(2m− n)

3n. (6)

http://jocg.org/

JoCG 11(1), 283–331, 2020 293


To obtain the first inequality in the above derivation, we use the fact that for each node vi,(di2

)counts the total number of wedges having i as the apex, as well as the total number of

triangles containing i. However, if there is a triangle, it will be counted three times in thesummation (while for a wedge will be counted exactly once). Since a triangle will incur acost of 2, the first inequality thus follows. The second inequality in Eqn. (6) follows fromthe Cauchy-Schwarz inequality and that

∑i di = 2m.

On the other hand, by Claim 1, a trivial bound for totalCG(T ∗), where T ∗ is anoptimal HC-tree for G, is: totalCG(T ∗) =

∑(i,j)∈E(|leaves(T [LCA(i, j)])−2) ≤ m · (n−2).

Combining this with Eqn. (6), we already can obtain a bound that is asymptotically thesame as the one in claim (ii) of Theorem 1. Below we show the bound on the optimaltotal-cost can be improved to totalCG(T ∗) ≤ 2m(n− 2)/3.

Claim 2. Given an unweighted connected graph G = (V,E) with n ≥ 3 vertices and medges, totalCG(T ∗) ≤ 2m(n− 2)/3.

Proof of Claim 2: Specifically, for any fixed tree shape T , let L = [1, n] denote the set ofleaf nodes in T . The n! number of permutations of nodes in V give rise to n! one-to-onecorrespondences from V to L. Let σ : V → L denote a certain correspondence, and Tσ itscorresponding tree of shape T and with leaf set L = σ(V ). Let Σ be the set of all possiblecorrespondences.

totalC∗G ≤ 1

n!

∑σ∈Σ

totalCG(Tσ) =1

n!

∑σ∈Σ

∑(i,j)∈E

(|leaves(T [LCA(σ(i), σ(j))])| − 2)

=1

n!

∑σ∈Σ

∑l1<l2∈L

(|leaves(T [LCA(l1, l2)])| − 2)1(σ−1(l1),σ−1(l2))∈E

=1

n!

∑l1<l2∈L

(|leaves(T [LCA(l1, l2)])| − 2) ·∑σ∈Σ

1(σ−1(l1),σ−1(l2))∈E . (7)

Here 1(σ−1(l1),σ−1(l2))∈E is the indicator function, meaning that 1(σ−1(l1),σ−1(l2))∈E = 1 ifthere is an edge in E between σ−1(l1) and σ−1(l2), and 0 otherwise. Now for a fixed pair l1and l2, consider the term

∑σ∈Σ 1(σ−1(l1),σ−1(l2))∈E from Eqn (7). The indicator function will

have value 1 only when σ−1 maps l1 and l2 back to the two endpoints of some edge in E.There are m edges in E, thus giving rise 2m possible choices of preimages of (l1, l2) underσ. For each such choice, there is no constraint on the preimage of the remaining n− 2 leafnodes; rendering (n− 2)! possibilities of σ once σ−1(l1) and σ−1(l2) are fixed. Therefore forfixed l1 and l2, we have: ∑

σ∈Σ

1(σ−1(l1),σ−1(l2))∈E = 2m(n− 2)!.

It then follows that:

totalC∗G ≤ 1

n!

∑l1<l2∈L

(|leaves(T [LCA(l1, l2)])| − 2

)· 2m(n− 2)!

=2m

n(n− 1)

∑l1<l2∈L

(|leaves(T [LCA(l1, l2)])| − 2). (8)

http://jocg.org/

JoCG 11(1), 283–331, 2020 294


Now observe that∑

l1<l2∈L(|leaves(T [LCA(l1, l2)])|−2) is simply totalCGc(T ) the total-costof the tree T w.r.t. the n-node complete graph Gc with unit edge-weight. It then followsthat ∑

l1<l2∈L(|leaves(T [LCA(l1, l2)])| − 2) = 2

(n

3

).

Plugging this to Eqn (8), we thus have that

totalC∗G = totalCG(T ∗) ≤ totalCG(T ) ≤ 2m

n(n− 1)· 2 ·

(n

3

)=

2m(n− 2)

3.

Combining the bound in Claim 2 for totalC∗G with the bound that baseC(G) ≥2m(2m−n)

3n , we thus obtain that ρ∗G ≤n2−2n2m−n , which proves claim (ii) of Theorem 1.

Next, we show that the bound in Theorem 1 (i) is asymptotically tight. To provethat, we will use a certain family of edge expanders.

Definition 5. Given α > 0 and integer k, a graph G(V,E) is an (α, k)-edge expander iffor every S ⊆ V with |S| ≤ k, the set of crossing edges from S to V \S, denoted as E(S, S),has cardinality at least α|S|.

An undirected graph is d-regular if all nodes have degree d.

Fact 1. [4] For any natural number d ≥ 4, and sufficiently large n, there exist d-regulargraphs on n vertices which are also ( d10 ,

n2 )-edge expanders.

Theorem 2. A d-regular ( d10 ,n2 )-edge expander graph G satisfies that ρ∗G = Θ(n) for bounded

d.

Proof. The existence of a d-regular ( d10 ,n2 ) edge expander is guaranteed by Fact 1. Given a

cut (S, V \ S), the sparsity of it is φ(S) = |E(S,V \S)||S|·|V \S| . Consider a sparsest-cut (S∗, V \ S∗)

for G with minimum sparsity. Assume w.l.o.g that |S∗| ≤ n/2. Its sparsity satisfies:

φ∗ =|E(S∗, V \ S∗)||S∗| · |V \ S∗|

≥d10 · |S

∗||S∗| · |V \ S∗|

=d

10|V \ S∗|≥ d

10n. (9)

The first inequality above holds as G is a ( d10 ,n2 )-edge expander. On the other hand, consider

an arbitrary bisection cut (S, V \ S) with |S| = n/2. Its sparsity satisfies:

φ(S) =|E(S, V \ S)||S| · |V \ S|

=E(S, V \ S)

n2 ·

n2

≤d · n2n2/4

=2d

n. (10)

Combining (Eqn. 9) and (Eqn. 10), we have that a bisection cut (S, V \ S) is a 20-approximation for the sparsest cut. On the other hand, it is shown in [13] that a recursiveβ-approximation sparsest-cut algorithm will lead to a HC-tree T whose cost costG(T ) is a

http://jocg.org/

JoCG 11(1), 283–331, 2020 295


c · β-approximation for costG(T ∗) where T ∗ is the optimal tree. Now consider the HC-treeT resulting from first bisecting the input graph via (S, V \ S), then using sparest-cut tofurther partition each set (of S and V \ S) recursively to build a sub-tree for it. By [13],costG(T ) ≤ c · 20 · costG(T ∗) for some constant c. Furthermore, costG(T ) is at least thecost induced by those edges split at the top level; that is,

costG(T ) =∑

(i,j)∈E

|leaves(T [LCA(i, j)])| ≥∑

(i,j)∈E(S,V \S)

|leaves(T [LCA(i, j)])|

=∑

(i,j)∈E(S,V \S)

n = n · |E(S, V \ S)| ≥ n · nd20

=dn2

20, (11)

where the last inequality holds as G is a ( d10 ,n2 )-edge expander. Hence

costG(T ∗) ≥ 1

20ccostG(T ) ≥ dn2

400c,

with c being a constant. By Claim 1, we have that

totalCG(T ∗) = costG(T ∗)− 2|E| ≥ dn2

400c− dn = Ω(dn2). (12)

Now call a triplet a wedge (resp. a triangle) if there are exactly two (resp. exactlythree) edges among the three nodes. Only wedges and triangles contribute to baseC(G).Thus:

baseC(G) = #wedges+ 2#triangles ≤∑i∈[1,n]

(d

2

)= O(d2n)

⇒ ρ∗G =totalCG(T ∗)

baseC(G)= Ω(

dn2

d2n) = Ω(

n

d).

Combining the above with Theorem 1, the theorem then follows.

Relation to the “ground-truth input” graphs of [13]. Cohen-Addad et al. introducedwhat they call the “ground-truth input” graphs to describe inputs that admit a “natural”ground-truth cluster tree [13]. A brief review of this concept is given in Appendix A.4.We show that a ground-truth input graph always has a perfect HC-structure. However,the converse is not necessarily true and the family of graphs with perfect HC-structures isbroader while still being natural. Proof of the following theorem can be found in AppendixA.4.

Theorem 3. Given a similarity graph G = (V,E,w), if G is a ground-truth input graph of[13], then G must have a perfect HC-structure. However, the converse is not true.

Intuitively, a graph G has a perfect HC-structure if there exists a tree T such that for alltriplets, the most similar pair (with the largest similarity) will be merged first in T . Such atriplet-order constraint is much weaker than the requirement of the ground-truth graph of

http://jocg.org/

JoCG 11(1), 283–331, 2020 296


Figure 3: A unweighted graph G with a perfect HC-structure; an optimal HC-tree T isshown on the Right. This graph is not a ground-truth input graph of [13].

[13] (which intuitively is generated by an ultrametric). Also, common agglomerative hier-archical clustering algorithms such as single-linkage and average-linkage will not necessarilyoutput the optimal HC-tree and thus fail to recover the optimal HC-structure for graphswith perfect HC-structures. An example is given in Figure 3. In fact, the following propo-sition shows that the concept of ground-truth graph is rather stringent for graphs with unitedge weights (i.e., unweighted graphs). In particular, a connected unweighted graph is aground-truth graph of [13] if and only if it is the complete graph. In contrast, unweightedgraphs with perfect HC-structures represent a much broader family of graphs. The proof ofthis proposition is in Appendix A.4.

Proposition 1. Let G = (V,E) be an unweighted graph (i.e., w(u, v) = 1 if (u, v) ∈ E and0 otherwise). G is a ground-truth graph if and only if each connected component of G is aclique.

4 Algorithms

4.1 Hardness and approximation algorithms

By Claim 1, totalCG(T ) equals costG(T ) minus a quantity that depends only on G. Thusthe hardness results for optimizing costG(T ) also holds for optimizing ρG(T ) = totalCG(T )

baseC(G) .Theorem 4 below follows mostly from results of [8] and [14]. The proof is in Appendix B.1.

Theorem 4. (1) It is NP-hard to compute ρ∗G, even when G is an unweighted graph (i.e.,edges have unit weights). (2) Furthermore, under the SSE hypothesis, it is NP-hard toapproximate ρ∗G within any constant factor.

We remark that while the hardness results for optimizing costG(T ) translate into hardnessresults for ρ∗G, it is not immediately evident that an approximation algorithm translatestoo, as totalCG(T ) differs from costG(T ) by a positive quantity W = 2

∑(i,j)∈E wij . This

quantity could be dominating the value of costG(T ), hence an approximation of costG(T )does not translates to an approximation of totalCG(T ) (and thus ρG(T )) within the sameapproximation factor. Nevertheless, it turns out that, one can modify the approximationalgorithms of [8] and of [11] so that they lead to an approximation algorithm for the optimalratio-cost ρ∗G within the same asymptotic approximation factor. We summarize the resultbelow, and the details can be found in Appendix B.2.

http://jocg.org/

JoCG 11(1), 283–331, 2020 297


Theorem 5. Given a similarity graph G = (V,E) with n nodes, there is a polynomial timealgorithm to approximate ρ∗G within a factor of O(αn) = O(

√log n), where αn stands for the

best approximation factor for the sparsest cut problem of a graph of size n.

We note that in fact, in Appendix B.2, we prove a stronger result, where we havean approximation algorithm for the optimal ratio-cost of a HC-tree that satisfies some pre-defined structural constraints (in the form of “must link before” triplets), by making minormodifications of the algorithm of [11] designed to approximate the optimal Dasgupta cost ofa HC-tree satisfying these structural constraints. The above theorem is then a by-productwhen there is no constraint. See Appendix B.2 for details.

4.2 Algorithms for graphs with perfect HC-structures

While in general, it remains open how to approximate ρ∗G (as well as costG(T ∗)) to a factorbetter than

√log n, we now show that we can check whether a graph has a perfect HC-

structure or not, and compute an optimal tree if it has, in polynomial time. In the nextsubsection, we will also provide a polynomial-time approximation algorithm for graphs withnear-perfect HC-structures (to be defined later).

Remark: One may wonder whether a simple agglomerative (bottom-up) hierarchical clus-tering algorithm, such as the single-linkage, average-linkage, or complete-linkage clusteringalgorithm, could have recovered the perfect HC-structure. For the ground-truth input graphsintroduced in [13], it is known that it can be recovered via average-linkage clustering algo-rithms. However, as the example in Figure 3 shows, this in general is not true for recoveringgraphs with perfect HC-structures, as depending on the strategy, a bottom-up approachmay very well merge nodes 1 and 4 first, leading to a non-optimal HC-tree.

Intuitively, it is necessary to have a top-down approach to recover the perfect HC-structure. The high-level framework of our recursive algorithm BuildPerfectTree(G) is givenbelow and outputs a binary HC-tree T that spans (i.e., with its leaf set being) a subset ofvertices from V = V (G). Roughly speaking, Procedure validBipartition(G) splits V into twosets VA and VB, such that this split does not violate any merging order constraint potentiallyincurred by any triplet of vertices in V . If such a “valid bi-partition” (to be defined formallylater) of V indeed exists, then our algorithm will try to construct an optimal HC-tree for thesubgraphs GA and GB of G spanned by vertice set VA and VB by calling the same algorithmrecursively. We will show later that the output tree T spans all vertices V (G), if and only ifG has a perfect HC-structure (in which case T will also be an optimal tree). The key stepis Procedure validBipartition(G), which we will describe after introducing some notations.

To produce a valid bi-partition, we first produce a valid partition that may containmore than two subsets, and then merge multiple subsets into two supersets if possible. Thetriplet types, which we will define shortly, play an important role in both the partitioning andthe merging steps: they provide a more refined classification of the triplets, and each typewill be processed differently so that the final bi-partition does not violate any constraintinduced by these triplet types. We now introduce the definitions of triplet types and avalid partition. First, we say that (A,B) is a partial bi-partition of V if A ∩ B = ∅ andA∪B ⊆ V ; and (A,B) is a bi-partition of V (or G) if A∩B = ∅, A,B 6= ∅, and A∪B = V .

http://jocg.org/

JoCG 11(1), 283–331, 2020 298


Algorithm 1: BuildPerfectTree

input : A graph G = (V , E, w).output: A binary HC-tree T .

(A,B) = validBipartition(G);if A = ∅ or B = ∅ then

return ∅;endelse

TA = BuildPerfectTree(GA);TB = BuildPerfectTree(GB);Build tree T with TA and TB being the two subtrees of its root;return T ;

end

Let V = x1, . . . , xn.

Definition 6 (Triplet types). A triplet xi, xj , xk with edge weights wij , wik and wjk istype-1: if the largest weight, say wij, is strictly larger than the other two; i.e., wij > wik, wjk;type-2: if exactly two weights, say wij and wik, are the largest; i.e., wij = wik > wjk;type-3: otherwise, which means all three weights are equal; i.e., wij = wik = wjk.

Definition 7 (Valid partition). A partition S = (S1, . . . , St), t > 1, of V (i.e., ∪Si = V ,Si 6= ∅, and Si ∩ Sj = ∅) is valid w.r.t. G if (i) for any type-1 triplet xi, xj , xk withwij > maxwik, wjk, either all three vertices belong to the same set from S; or xi and xjare in one set from S, while xk is in another set; and (ii) for any type-2 triplet xi, xj , xkwith wij = wik > wjk, it cannot be that xj and xk are from the same set of S, while xi is inanother one.

If this partition is a bi-partition, then it is also called a valid bi-partition.

In what follows, we also refer to each set Si in the partition S as a cluster.

The procedure validBipartition(G) will compute a valid bi-partition if it exists. Oth-erwise, it returns “fail” (more precisely, it returns (A = ∅, B = ∅)). It turns out thatcomputing a valid partition (Step-1) follows from existing literature on the so-called rootedtriplet consistency problem. The main technical challenge is to develop (Step-2).

Algorithm 2: validBipartition (G = (V , E))

(Step-1): Compute a certain valid partition C = C1, . . . , Cr of V , if possible.(Step-2): Compute a valid bi-partition from this valid partition C.

(Step 1): compute a valid partition. It turns out that the partition procedure withinthe algorithm BUILD of [3] will compute a specific partition C = C1, . . . , Cr with niceproperties. The following result can be derived from Lemma 1, Theorem 2, and proof ofTheorem 4 of [3].

http://jocg.org/

JoCG 11(1), 283–331, 2020 299


Proposition 2 ([3]). (1) We can check whether there is a valid partition for G = (V , E) inO(nm+ n2 log n) time, where n = |V | and m = |E|. Furthermore, if one exists, then withinthe same time bound we can compute a valid partition C = C1, . . . , Cr which is minimalin the following sense: Given any other valid partition C′ = C ′1, . . . , C ′t, C is a refinementof C′ (i.e., all points from the same cluster Ci ∈ C are contained within some C ′j ∈ C′).

(2) Let (A,B) be any valid bi-partition for G, and C the partition as computed above. Itfollows from (1) that for any x ∈ Ci, if x ∈ A (resp. x ∈ B), then Ci ⊆ A (resp. Ci ⊆ B).

We note that the time complexity of algorithm BUILD of [3] is minO(n′+N log2 n′),O(N + n′2 log n′) time, where n′ is the number of nodes and N is the total number ofconstraints that the final tree needs to satisfy. In our case, by Definition 7, only tripletsof type-1 or of type-2 will generate constraints, and each triplet can only generate O(1)constraints. Hence the total number of contraints is bounded by the total number of type-1and type-2 triplets. A triplet of these two types contains at least one edge in E, and thusthere are at most N = O(mn) number of contraints, with n = |V | and m = |E|. This givesrise to the claimed time complexity in the above proposition.

(Step 2): compute a valid bi-partition from C if possible. Suppose (Step 1) succeedsand let C = C1, . . . , Cr, r > 1, be the minimum valid partition computed. It is not clearwhether a valid bi-partition exists (and how to find it) even though a valid partition exists.Our algorithm computes a valid bi-partition depending on whether a claw-configurationexists or not.

Definition 8 (A claw). Four points xi | xj , xk, x` form a claw w.r.t. C if (i) each pointis from a different cluster in C; and (ii) wij = wik = wi` > maxwjk, wj`, wk`. We call xithe pivot of this claw. See Figure 4 (a).

xk

xl

xj

xi

y1

y2

y3

yr

ys...

(a) (b)

Figure 4: (a) A claw xi | xj , xk, x`. (b) Clique C formed by light (dashed) edges.

(Case-1): Suppose there is a claw w.r.t. C. Fix any claw, and assume w.l.o.g that it isyr | y1, y2, y3 such that yi ∈ Ci for i = 1, 2, 3, or r. We choose an arbitrary but fixedrepresentative vertex yi ∈ Ci for each cluster Ci with i ∈ [4, r − 1]. Set G′ = (V ′, E′, w)to be the complete graph over node set V ′ = y1, y2, · · · , yr with edge weights w(yi, yj)

(so w(yi, yj) = 0 if (yi, yj) /∈ E is not an edge in the input graph G = (V , E)). Setw = w(y1, yr) = w(y2, yr) = w(y3, yr). We say that an edge e ∈ E′ is light if its weight isstrictly less than w; otherwise, it is heavy. Easy to see that by definition of the claw, edges

http://jocg.org/

JoCG 11(1), 283–331, 2020 300


(y1, y2), (y1, y3), and (y2, y3) are all light. Now, consider the subgraph G′′ of G′ spanned byonly light-edges, and w.l.o.g. let C = y1, . . . , ys be the maximum clique C in G′′ containsy1, y2 and y3. See Figure 4 (b) for this clique, where solid edges are heavy, while dashedones are light. (It turns out that this maximum clique can be computed efficiently, as weshow in Appendix B.5.2.)

We then set up s potential bi-partitions Πi = (Ci, ∪`∈[1,r],` 6=iC`), for each i ∈[1, s]. We check the validity of each such Πi. If none of them is valid, then procedurevalidBipartition(G) returns ‘fail’. Otherwise, it turns the valid one. The correctness of thisstep is guaranteed by the Lemma 1, whose proof can be found in Appendix B.3.

Lemma 1. Suppose there is a claw w.r.t. C. Let Πi’s, i ∈ [1, s] be described above. Thereexists a valid bi-partition for G if and only if one of the Πi, i ∈ [1, s], is valid.

(Case-2): Suppose there is no claw w.r.t. C. The case when there is no claw is slightly morecomplicated. We present the lemma below with a constructive proof in Appendix B.4.

Lemma 2. If there is no claw w.r.t. C, then we can check whether there is a valid bi-partition in G = (V , E) (and compute one if it exists) in O(nm) time, where n = |V | andm = |E|.

This finishes the description of procedure validBipartition(G). Putting everythingtogether, we conclude with the following theorem, with proof in Appendix B.5. Note thatit is easy to obtain a time complexity of O(n5). However, we show in Appendix B.5 how tomodify our algorithm as well as to provide a more refined analysis to improve the time toO(n2m) = O(n4).

Theorem 6. Given a similarity graph G = (V,E) with n vertices and m edges, algorithmBuildPerfectTree(G) can be implemented to run in O(n2m+ n3 log n) time. It returns a treespanning all vertices in V if and only if G has a perfect HC-structure, in which case thisspanning tree is an optimal HC-tree.

Hence we can check whether G has a perfect HC-structure, as well as compute anoptimal HC-tree if it has, in O(n2m+ n3 log n) time.

4.3 Approximation algorithm for graphs with almost perfect HC-structures

In practice, a graph with a perfect HC-structure could be corrupted with noise. We introducea concept of graphs with an almost perfect HC-structure, and present a polynomial-timealgorithm to approximate the optimal cost.

Definition 9 (δ-perfect HC-structure). A graph G = (V,E,w) has a δ-perfect HC-structure,δ ≥ 1, if there exists weights w∗ : E → R such that (i) the graph G∗ = (V,E,w∗) has aperfect HC-structure; and (ii) for any e = (u, v) ∈ E, we have 1

δw(e) ≤ w∗(e) ≤ δ ·w(e). Inthis case, we also say that G = (V,E,w) is a δ-perturbation of graph G∗ = (V,E,w∗).

Note that a graph with 1-perfect HC-structure is simply a graph with a perfect HC-structure.

http://jocg.org/

JoCG 11(1), 283–331, 2020 301


The following theorem gives rise to an O(n3) time algorithm to approximate anoptimal HC-tree for graphs with δ-perfect HC-structure. Note that when δ = 1 (i.e., thegraph G has a perfect HC-structure), this theorem also gives rise to a 2-approximationalgorithm for the optimal ratio-cost in O(nm + n2 log n) = O(n3). In contrast, an exactalgorithm to compute the optimal tree in this case takes O(n4) time as shown in Theorem6.

Theorem 7. Suppose G = (V,E,w) is a δ-perturbation of a graph G∗ = (V,E,w∗) with aperfect HC-structure. Then (i) ρ∗G ≤ δ2; and (ii) we can compute a HC-tree T s.t. ρG(T ) ≤(1 + δ2) ·ρ∗G (i.e., we can (1 + δ2)-approximate ρ∗G) in O(nm+n2 log n) time, where n = |V |and m = |E|.

Proof of Theorem 7. The remainder of this subsection is devoted to proving Theorem7. To prove claim (i), note that 1

δ w(u, v) ≤ w∗(u, v) ≤ δ w(u, v) for any u, v ∈ V . Itthen follows that baseC(G) ≥ 1

δbaseC(G∗); Furthermore, for any HC-tree T , totalCG(T ) ≤δ totalCG∗(T ). Let T ∗ be the optimal HC-tree for the graph G∗ with a perfect HC-structure;that is, ρG∗(T ∗) = 1. We then have

ρ∗G ≤totalCG(T ∗)

baseC(G)≤ δ totalCG∗(T ∗)

1δbaseC(G∗)

= δ2 · ρ∗G∗(T ∗) = δ2,

proving claim (i) of Theorem 7.

We now prove claim (ii). To this end, given the weighted graph G = (V,E,w) withV = v1, . . . , vn, we will first set up a collection R of constraints of the form i, j|k asfollows:

Given any triplet i, j, k ∈ V with weights wij ≥ wjk ≥ wik, we say that edge(i, j) has approximately-largest weight among i, j, k if we have wij > δ2wjk (andthus wij > δ2wik as well).

If (i, j) has approximately-largest weight, then we add constraint i, j|k to R.

We aim to compute a binary tree T such that T is consistent with all constraints inR; that is, for each i, j|k ∈ R, LCA(i, j) is a proper descendant of LCA(i, j, k) (i.e., leavesvi and vj are merged first, before merging with vk).

T1 T2 Tk. . .

T1 T2Tk1

. . .Tk

u

Figure 5: A node u with k children is converted to a collection of binary tree nodes.

Claim 3. We can compute a binary HC-tree T consistent with R, if it exists, in O(nm +n2 log n) time.

http://jocg.org/

JoCG 11(1), 283–331, 2020 302


Proof. Our problem turns out to be almost the same as the so-called rooted triplets con-sistency (RTC) problem, which has been studied widely in the literature of phylogenetics;e.g., [7, 18]. In particular, it is shown in [18] that a tree T ′ consistent with a collection ofN input constraints on n nodes can be computed in minO(n+N log2 n), O(N +n2 log n)time (if it exists). In our case, the number of constraints N = |R| = O(mn): This is because(i) each contraint i, j|k contains an edge (i, j) that has strictly positive weight (as it hasapproximately largest weight) and thus has to be an edge from E; and (ii) for each (i, j),we have O(n) choices of the last node k in the constraint. Hence we can compute a tree T ′

consistent with all constraints in R in O(nm+ n2 log n) time.

The only issue left is that this output tree T ′ consistent with R may not be binary.However, consider any node u ∈ T ′ with more than 2 children. We claim that we can locallyresolve it arbitrarily into a collection of binary tree nodes; an example is given in Figure 5.It is easy to see that this process does not violate any constraint in R. (We remark that thisis due to the fact that R is generated from only type-1 triplets. If there are constraints fromtype-2 constraints, then this is no longer true. This is why in algorithm BuildPerfectTreewe have to spend much effort to obtain a valid bi-partition in order to derive a binaryHC-tree.)

Next, we show that if G has a δ-perfect HC-structure, then such a tree T consistentwith R indeed exists.

Lemma 3. Suppose G = (V,E,w) is a δ-perturbation of a graph G∗ = (V,E,w∗) withperfect HC-structure. Then there exists a binary HC-tree T consistent with all constraintsin the set R as constructed above.

Proof. First, consider an optimal HC-tree T ∗ for G∗. Now given any constraint i, j|k ∈ R,note that edge (i, j) must be approximately-largest among i, j, k. As G is a δ-perturbationof G∗, we have

w∗ij ≥1

δwij >

1

δ· δ2 maxwik, wjk ≥ δmax1

δw∗ik,

1

δw∗jk = maxw∗ik, w∗jk,

that is, w∗ij > maxw∗ik, w∗jk. This is a type-1 triplet and thus an optimal tree T ∗ for G∗

has to be consistent with the constraint i, j|k. Hence T ∗ is consistent with all constraintsin R, establishing the lemma.

Lemma 4. Let T be any binary HC-tree that is consistent with constraints in R as con-structed above. Then ρG(T ) ≤ (1 + δ2)ρ∗G.

Proof. Consider an optimal (binary) HC-tree T ∗ for the graph G. Recall Definition 1,

totalCG(T ∗) =∑

i,j,k∈VtriC

T ∗,G(i, j, k).

For any triplet i, j, k ∈ V , suppose its order in the optimal HC-tree T ∗ is i, j|k(i.e., i and j are first merged, and then merged with k in T ∗). If the order in T is the

http://jocg.org/

JoCG 11(1), 283–331, 2020 303


same, then triCT,G(i, j, k) = triCT ∗,G(i, j, k). Now assume that the order in T is different

from the order in T ∗, say suppose w.l.o.g that relation i, k|j holds in T . Note thattriC

T ∗,G(i, j, k) = wik + wjk.

There are two possibilities:(1) i, k|j is a constraint in R, meaning that (i, k) is approximately-largest among i, j, k.Hence

triCT,G(i, j, k) = wij + wjk ≤ wik + wjk ≤ triCT ∗,G(i, j, k).

(2) i, k|j is not a constraint inR. Note that as T is consistent with i, k|j, this means thatneither j, k|i nor i, j|k can belong toR. So none of the three edges (i, k), (j, k), and (i, j)can be approximately-largest among i, j, k. In particular, that (i, j) is not approximately-largest among i, j, k means that wij ≤ δ2 maxwik, wjk. If wjk ≤ wik, then

triCT,G(i, j, k) = wij + wjk ≤ δ2wik + wjk ≤ δ2(wik + wjk) = δ2 · triCT ∗,G(i, j, k).

Otherwise wjk > wik, in which case we have:

triCT,G(i, j, k) = wij + wjk ≤ δ2wjk + wjk

≤ (1 + δ2)wjk ≤ (1 + δ2)(wik + wjk) = (1 + δ2) · triCT ∗,G(i, j, k).

As δ ≥ 1, in all cases, we have triCT,G(i, j, k) ≤ (1+δ2) ·triCT ∗,G(i, j, k) for any i, j, k ∈ V .

It then follows that

totalCG(T ) ≤ (1 + δ2) · totalCG(T ∗) ⇒ ρG(T ) ≤ (1 + δ2)ρG(T ∗) = (1 + δ2)ρ∗G.

Combining Claim 3, Lemma 3 and Lemma 4, claim (ii) of Theorem 7 then follows.

5 Ratio-cost Function for Random Graphs

In this section, we study the behavior of ρ∗G for a random graph G sampled from an edgeprobability matrix P. Specifically, first, we show that under very mild assumptions, theoptimal solution on the expectation-graph G is a (1 + o(1))-approximation of the optimalcost for G with high probability. We note that this result is general and applicable to anyfamily of random graphs satisfying the assumptions. We also show that for Erdős-Rényirandom graphs and planted bisection graphs, the optimal ratio-cost ρ∗G will decrease as weraise the probability of in-cluster connections, and increase as we raise the probability ofcross-cluster connections – both are consistent with what a good cost function should be.We give a remark on graphs sampled from the so-called Hierarchical Stochastic Block model(HSBM) at the end of this section.

We now introduce some notations so as to study graphs sampled from edge proba-bility matrices.

http://jocg.org/

JoCG 11(1), 283–331, 2020 304


Definition 10. Given an n × n symmetric matrix P with each entry Pij = P[i][j] ∈ [0, 1],G = (V = v1, . . . , vn, E) is a random graph sampled from P if there is an edge (vi, vj)with probability Pij. Each edge in G has unit weight.

The expectation-graph G = (V,E,w) of P refers to the weighted graph where the edge(vi, vj) has weight wij = Pij.

In all statements, “with high probability (w.h.p.)” means “with probability largerthan 1 − n−ε for some constant ε > 0”. The main result of this section is as follows. Theproof is in Appendix C.1.

Theorem 8. Given an n × n edge probability matrix P, assume that for any i, j ∈ [1, n],

each entry satisfies Pij = ω

(√lognn

). Given a random graph G = (V,E) sampled from P,

let T ∗ denote an optimal HC-tree for G (w.r.t. ratio-cost), and T ∗ an optimal HC-tree forthe expectation-graph G. Then we have that w.h.p.,

ρ∗G = ρG(T ∗) = (1 + o(1))totalCG(T

∗)

E[baseC(G)],

where E[baseC(G)] is the expected base-cost for the random graph G.

Hence the optimal cost of a randomly generated graph concentrates around a quantity.However, this quantity totalCG(T

∗)

E[baseC(G)] is different from the optimal ratio-cost for the expectation

graph G, which would be ρ∗G

=totalCG(T

∗)

baseC(G). In particular, E[baseC(G)] can be very different

from baseC(G).

We emphasize that the above theorem is very general, applicable to any randomgraph sampled from an edge probability matrix under very mild conditions. We now givetwo specific examples below. An Erdős-Rényi random graph G = (n, p) is generated by theedge probability matrix P with Pij = p for all i, j ∈ [1, n]. The probability matrix P for theplanted bisection model G = (n, p, q), p > q, is such that (1) Pij = p for i, j ∈ [1, n/2] ori, j ∈ [n/2 + 1, n]; (2) Pij = q otherwise. See Appendix C.2 and C.3 for the proofs of theseresults.

Corollary 1. Given an Erdős-Rényi random graph G = (n, p) with p = ω

(√lognn

), w.h.p.,

the optimal HC-tree has ratio-cost ρ∗G = (1 + o(1)) 23p−p2 .

Corollary 2. (1) Given a planted bisection model G = (n, p, q) with p > q = ω

(√lognn

),

w.h.p., the optimal HC-tree has ratio-cost ρ∗G = (1 + o(1)) 2p+6q3(p+q)2−p3−3pq2

.

(2) Note that the ρ∗G decreases as p increases. When q < 13p, ρ

∗G increases as q

increases, otherwise, it decreases as q increases.

Note that while the cost of an optimal tree w.r.t. Dasgupta’s cost (and also ourtotal-cost) always concentrates around the cost for the expectation graph, as we mentioned

http://jocg.org/

JoCG 11(1), 283–331, 2020 305


above, the value of the ratio-cost depends on the sampled random graph itself. For Erdős-Rényi random graphs, larger p value indicates tighter connection among nodes, making itmore clique-like, and thus ρ∗G decreases till ρ∗G = 1 when p = 1. In contrast, the expectation-graph Gp is a complete graph where all edges have weight p; thus it always has ρ∗

Gp= 1 no

matter what p value it is.

For the planted bisection model, increasing the value of p also strengthens in-clusterconnections and thus ρ∗G decreases. Interestingly, increasing q value when it is small hurtsthe clustering structure, because it adds more cross-cluster connections thereby making thetwo clusters formed by vertices with indices from [1, n2 ] and from [n2 + 1, n] respectively.Indeed, ρ∗G increases when q increases for small q. However, when q is already relativelylarge (close to p), increasing it further makes the entire graph closer to a clique, and ρ∗Gdecreases when q increases for large q. Note that such a refined behavior (w.r.t probabilityq) does not hold for the original cost function by Dasgupta.

Remark on graphs sampled from HSBM. In [13], the authors studied the approximationalgorithm with respect to Dasgupta’s cost function on the so-called hierarchical stochasticblock model (HSBM). Roughly speaking, HSBM is a hierarchical extension of the stochasticblock model, where the edge probability matrix corresponds to clusters forming a hierarchicalclustering tree of bounded size. They proposed a (1 + o(1))-approximation algorithm basedon singular value decomposition. By combining their result and our Claim 5 (developed inthe proof of Theorem 8 in Appendix C.1), one can show that the same algorithm is also a(1 + o(1))-approximation for our improved cost function ρ∗G.

6 Concluding Remarks

In this paper, we proposed a modification of Dasgupta’s cost function for measuring thequality of hierarchical clustering trees (HCT) for a given similarity graph [14]. We arguethat our modified version, called ratio-cost function, is more effective at indicating howmuch hierarchical clustering structure an input graph has. We proved that the existinghardness results and approximation algorithms (e.g., recursive sparsest cut algorithm in[11]) also apply to our improved cost function. We defined perfect HC-structure based onour improved cost function and further proposed a polynomial algorithm to test whether aninput graph has such a structure or an almost perfect HC-structure. Finally, we studied thebehavior of our ratio-cost function on random graph models.

There are some open directions for future work. Currently, the algorithm we pro-posed for testing perfect HC-structure has time complexity O(n2m + n3 log n). A naturalquestion is whether this time complexity can be significantly improved. Another relatedquestion is whether we can further relax the concept of almost perfect HC-structure, anddevelop an efficient fixed-parameter tractable algorithm to compute (or approximate) an op-timal HC-tree based on that. Finally, we have focused on the minimization problem in thispaper. An interesting question is whether similar modification works for Moseley-Wang’sfunction studied in [19], and if it does, whether the approximation algorithms still apply tothe modified cost function.

http://jocg.org/

JoCG 11(1), 283–331, 2020 306


References

[1] C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications.Chapman & Hall/CRC, 1st edition, 2013.

[2] S. Ahmadian, V. Chatziafratis, A. Epasto, E. Lee, M. Mahdian, K. Makarychev, andG. Yaroslavtsev. Bisect and conquer: Hierarchical clustering via max-uncut bisection.ArXiv, abs/1912.06983, 2019.

[3] A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowestcommon ancestors with an application to the optimization of relational expressions.SIAM Journal on Computing, 10(3):405–421, 1981.

[4] S. Arora and B. Barak. Computational Complexity: A Modern Approach. CambridgeUniversity Press, New York, NY, USA, 1st edition, 2009.

[5] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. InProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms,SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial andApplied Mathematics.

[6] M. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM Journal onComputing, 45(1):102–155, 2016.

[7] J. Byrka, S. Guillemot, and J. Jansson. New results on optimizing rooted tripletsconsistency. Discrete Applied Mathematics, 158(11):1136–1147, 2010.

[8] M. Charikar and V. Chatziafratis. Approximate hierarchical clustering via sparsestcut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA ’17, pages 841–854, Philadelphia, PA, USA,2017. Society for Industrial and Applied Mathematics.

[9] M. Charikar, V. Chatziafratis, and R. Niazadeh. Hierarchical clustering better thanaverage-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA ’19, pages 2291–2304, 2019.

[10] M. Charikar, V. Chatziafratis, R. Niazadeh, and G. Yaroslavtsev. Hierarchical clusteringfor Euclidean data. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of MachineLearning Research, volume 89 of Proceedings of Machine Learning Research, pages 2721–2730. PMLR, 16–18 Apr 2019.

[11] V. Chatziafratis, R. Niazadeh, and M. Charikar. Hierarchical clustering with structuralconstraints. In J. Dy and A. Krause, editors, Proceedings of the 35th International Con-ference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,pages 774–783, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.

[12] V. Cohen-Addad, V. Kanade, and F. Mallmann-Trenn. Hierarchical clustering beyondthe worst-case. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems30, pages 6201–6209. Curran Associates, Inc., 2017.

http://jocg.org/

JoCG 11(1), 283–331, 2020 307


[13] V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. Hierarchical clus-tering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA ’18, pages 378–397, 2018.

[14] S. Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedingsof the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages118–127, New York, NY, USA, 2016. ACM.

[15] W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemesfor clustering problems. In Proceedings of the Thirty-fifth Annual ACM Symposium onTheory of Computing, STOC ’03, pages 50–58, New York, NY, USA, 2003. ACM.

[16] D. Ghoshdastidar and A. Dukkipati. Consistency of spectral hypergraph partitioningunder planted partition model. Ann. Statist., 45(1):289–315, 02 2017.

[17] S. Janson. Large deviations for sums of partly dependent random variables. RandomStructures Algorithms, 24:234–248, 2004.

[18] J. Jansson, J. H. K. Ng, K. Sadakane, and W.-K. Sung. Rooted maximum agreementsupertrees. Algorithmica, 43(4):293–307, Dec. 2005.

[19] B. Moseley and J. Wang. Approximation bounds for hierarchical clustering: Averagelinkage, bisecting k-means, and local search. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 30, pages 3097–3106. Curran Associates, Inc., 2017.

[20] P. Raghavendra and D. Steurer. Graph expansion and the unique games conjecture. InProceedings of the Forty-second ACM Symposium on Theory of Computing, STOC ’10,pages 755–764, New York, NY, USA, 2010. ACM.

[21] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensionalstochastic blockmodel. Ann. Statist., 39(4):1878–1915, 08 2011.

[22] A. Roy and S. Pokutta. Hierarchical clustering via spreading metrics. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in NeuralInformation Processing Systems 29, pages 2316–2324. Curran Associates, Inc., 2016.

[23] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: The Principles and Practiceof Numerical Classification. W. H. Freeman and Co., 1973.

[24] D. Wang and Y. Wang. An improved cost function for hierarchical cluster trees. CoRR,abs/1812.02715, 2018.

http://jocg.org/

JoCG 11(1), 283–331, 2020 308


A Missing Details from Section 2

A.1 Proof for Claim 1

We prove the first equality. In particular, fix any pair of leaves vi, vj , and count how manytimes wij is added for both sides. On the left-hand side, each time there is a node k suchthat relation i, k|j, j, k|i or i|j|k holds, wij will be added once. There are exactly|leaves(T [LCA(i, j)])| − 2 number of such k’s, which is the same as how many time wij willbe added to the right-hand side. Hence the first two terms are equal.

The second equality is straightforward after we plug in the definition of costG(T ).

A.2 Proving that baseC(G) = 0⇒ totalCG(T ∗) = 0

Let G denote an input graph with baseC(G) = 0. This implies that for each triplet i, j, k,there can be at most one edge among these three vertices; otherwise, baseC(G) cannot be0.

Hence the degree of each node in G will be at most 1. In other words, the collectionof edges in G form a matching of nodes in it. Let a1, a2, · · · , an, and b1, b2, · · · , bn denotethese matched nodes, where (ai, bi) ∈ E and thus these two nodes are paired together. Letc1, c2, · · · , cr denote the remaining isolated nodes. It is easy to verify that the tree T shownin Figure 6 has totalCG(T ) = 0.

...111

...

Figure 6: Tree T with totalCG(T ) = 0.

A.3 Examples of ratio-costs of some graphs

Consider the two examples in Figure 7 (a). The linked-stars G1 consists of two stars centeredat v1 and vn/2+1 respectively, connected via edge (v1, vn/2+1). For the linked-stars G1,

http://jocg.org/

JoCG 11(1), 283–331, 2020 309


assuming that n is even and by symmetry, we have that:

baseC(G1) = 2∑

1≤i<j<k≤n/2

minTriC(i, j, k) + 2∑

1≤i<j≤n/2,k∈[n/2+1,n]

minTriC(i, j, k)

= 2∑

2≤j<k≤n/2

minTriC(1, j, k) + 2∑

j∈[2,n/2]

minTriC(1, j, n/2 + 1)

= 2 ·(n/2− 1

2

)+ 2(n/2− 1) =

n2

4− n

2.

On the other hand, consider the natural tree T as shown in Figure 7 (b). It is easy to verifythat totalCG1(T ) equals baseC(G1) as for any triplet triCT = minTriC. Hence ρ∗G1

= 1,and the linked star has a perfect HC-structure.

For the path graph G2, it is easy to verify that baseC(G2) = n − 2. However, asshown in [14], the optimal tree T ∗ is the complete balanced binary tree and cost(T ∗) =n log2 n+O(n), meaning that totalCG2(T ∗) = cost(T ∗)−2|E| = n log2 n±O(n) (via Claim1). It then follows that ρ∗G2

= (1 + o(1)) log n. Intuitively, a path does not possess muchhierarchical structure.

G2 G1

. . .

. . .

. . .

v1vn/2+1

v2v3

v4 vn/2 vn/2+2 vn2

vn1vn

(a) Linked-stars G1 and the path graph G2.

. . .

v1 v2

vn/2

vn/2+1 vn/2+2 vn

T

. . .

(b) One optimal tree T ∗ forlinked-stars G1.

Figure 7: Comparison between a linked-stars and a path.

Finally, it is easy to check that, the graph whose similarity (weight) matrix looks likethe edge probability matrix for a planted partition as shown in Figure 8, also has a perfectHC-structure. In particular, as long as p ≥ q, the tree in Figure 8 is optimal, and the costof this tree equals to the base-cost baseC(G) = 4 · p ·

(n/23

)+ 2 · (p + q) · n ·

(n/22

). More

discussions on random graphs are given in Section 5.

A.4 Relation to ground-truth input graph of [13]

We will briefly review the concept of ground-truth input graph of [13] below, and thenpresent missing proofs of our results in the main text.

A metric space (X, d) is a ultrametric if we have d(x, y) ≤ maxd(x, z), d(y, z) forany x, y, z ∈ X. The ultrametric is a stronger version of metric, and has been used widelyin modeling certain hierarchical tree structures (more precisely, the so-called dendrograms),such as phylogenetic trees. Intuitively, the authors of [13] consider a graph to be a ground-truth input if it is “consistent” with an ultrametric in the following sense:

http://jocg.org/

JoCG 11(1), 283–331, 2020 310


7/10/2018 planted_partition.xml

1/1

q p

qpC1

C2 C1 C2

Figure 8: If the weight matrix (on the left) is the same as the probability matrix for aplanted partition (where p > q), then it has a perfect HC-structure. The optimal tree isshown on the right, which first arbitrarily merges all nodes in C1 and C2 individually viaany binary tree.

Definition 11 (Similarity Graphs Generated from Ultrametrics [13]). A weighted graph G =(V,E,w) is a similarity graph generated from an ultrametric, if there exists an ultrametric(V, d) on the node set V and a non-increasing function f : R+ → R+, such that for everyx, y ∈ V , x 6= y, we have w(e) = f(d(x, y)) for e = (x, y) (note, if (x, y) /∈ E, thenw(e) = 0).

To see the connection of an ultrametric with a tree structure, they also provide analternative way of viewing ground-truth input using generating trees.

Definition 12 (Generating Tree [13]). Let G = (V,E,w) be a weighted graph (w(e) = 0if e /∈ E). Let T be a rooted binary tree with |V | leaves and |V | − 1 internal nodes, whereN and L denote the set of internal nodes and the set of leaves of T , respectively. Letσ : L → V denote a bijection between leaves of T and nodes in G. We say that T is agenerating tree for G, if there exists a function W : N → R+, such that for N1, N2 ∈ N , ifN1 appears on the path from N2 to root, W (N1) ≤ W (N2). Moreover, for every x, y ∈ V ,w((x, y)) = W (LCA(σ−1(x), σ−1(y))).

Definition 13 (Ground-truth Input [13]). A graph G is a ground-truth input if it is asimilarity graph generated from an ultrametric. Equivalently, there exists a tree T that isgenerating for G.

Proof of Theorem 3. If G is a ground-truth input, let T be a generating tree for G whoseleaf set is exactly V . We now show that ρG(T ) = 1, meaning that T must be an optimalHC-tree for G with ρ∗G = ρG(T ).

Indeed, consider any triplet vi, vj , vk ∈ V , and assume w.l.o.g that N2 = LCA(vi, vj)is a descendant of N1 = LCA(vj , vk) = LCA(vi, vj , vk). This means that N1 is on the pathfrom N2 to the root of T , and by the definition of generating tree, it follows that w(vi, vj) ≥w(vi, vk) = w(vj , vk), and nodes vi and vj are first merged in T . Therefore, for this triplet,triCT (i, j, k) = minTriC(i, j, k). Since this holds for all triplets, we have totalCG(T ) =baseC(G), meaning that ρG(T ) = 1.

On the other hand, consider the graph (with unit edge weight) in Figure 3, whereit is easy to verify that the tree shown on the right satisfies ρG(T ) = 1 and thus is opti-mal. However, this graph is not a ground-truth input as defined in [13]. In particular,

http://jocg.org/

JoCG 11(1), 283–331, 2020 311


in Proposition 1 which we will prove shortly below, we show that a unit-weight graph is aground-truth input if and only if each connected component of it is a clique. This completesthe proof of Theorem 3.

Proof of Proposition 1. Assume tree T is the generating tree for our input unweightedgraph G = (V,E), which is a ground truth input. Let C and TC denote a connectedcomponent of G, and the subtree of T induced by the leaf set C, respectively. It is easyto check then TC is a generating tree for component C. Consider any connected tripletvi, vj , vk ∈ C, meaning that there are at least two edges from E spanning these threenodes in C. Assume w.l.o.g that wij = wik = 1. We now enumerate all possible relationsbetween vi, vj and vk in TC ; See figure 9. Note that by the definition of a generating tree,for each possible configuration in Figure 9, it is necessary that wjk = 1, meaning that theremust be an edge between vj and vk. In other words, if there are two edges (vj , vi) and(vi, vk) in C, then the third edge (vj , vk) must also be present in C and thus in G.

( )σ−1

vj

W( ) = 1N2

W( ) = 1N1

( )σ−1

vi

( )σ−1

vk

( )σ−1

vk

W( ) = 1N2

W( ) = 1N1

( )σ−1

vi

( )σ−1

vj

( )σ−1

vk

W( ) ≥ W( ) = 1N2 N1

W( ) = 1N1

( )σ−1

vj

( )σ−1

vi

Figure 9: Three possible relations of vi, vj and vk in tree TC . In all cases, wjk = 1.

Now for any two nodes va, vb from the same component C, there must be a path,say va = va0 , va+1, · · · , vak = vb connecting them. Using the observation above repeatedly,it is easy to see that there must be an edge between any two nodes vai and vaj (includingbetween va and vb). This can be made more precisely by an induction on the value of k.Since this holds for any two nodes va, vb in the component C, C must be a clique. Applyingthis argument for each component, Proposition 1 then follows.

B Missing Details from Section 4

B.1 Proof of Theorem 4

Proof of Part (2). We will first describe the proof of part (2) of Theorem 4 as it is easier.Note that if there is a c-approximation ρG(T ) for ρ∗G = ρG(T ∗) (i.e., ρG(T ) ≤ c · ρ∗G), then,totalCG(T ) is a c-approximation for totalCG(T ∗). Since costG(T ′) = totalCG(T ′) + Awhere A = 2

∑i,j∈V w(i, j) for any HC-tree T ′, it follows that

costG(T ) ≤ c · totalCG(T ∗) +A ≤ c(totalCG(T ∗) +A) = c · costG(T ∗).

Hence a c-approximation for ρ∗G gives rise to a c-approximation for costG(T ∗) as well.It then follows that any hardness result in approximating costG(T ∗) translates into anapproximation hardness result for ρ∗G as well. Claim (2) of the theorem follows directly from

http://jocg.org/

JoCG 11(1), 283–331, 2020 312


the work of [8], which showed that it is SSE-hard to approximate costG(T ∗) within anyconstant factor.

Proof of Part (1). In the remainder of this subsection, we prove claim (1) of Theorem4, which mostly follows from the NP-hardness proof of (Theorem 10, [14]). In particular,Dasgupta shows that it is NP-hard to minimize costG(T ) via converting it to a maximizationproblem. The proof is based on a reduction from a variant of Not-all-equal Sat problem,called the NaeSat* problem.

Specifically, the NaeSat* problem is defined as follows:

NaeSat*: The input is a boolean formula φ = C1 ∧ C2 ∧ · · · ∧ Cm consisting of m clausesand n variables x1, . . . , xn in conjunctive normal form. Each clause Ci has either threeor two literals. Each variable xi is required to appear exactly three times, once in a3-clause and twice in 2-caluses with opposite polarities. The NaeSat* problem is todecide whether such an input formula φ is satisfiable.

It is shown in [14] that this problem is NP-complete. Given an instance of NaeSat*problem with an input formula as described above, Dasgupta reduces it to an instance ofmaximizing the cost costG(T ) for a certain weighted graph (with potentially different edgeweights), where each literal is represented by a single node in the graph G. In what follows,we will show that a simple modification of Dasgupta’s reduction can show that maximizingcostG(T ) remains NP-hard even when the input graph has unit edge weight; hence it isNP-hard to compute costG(T ∗) (and thus ρ∗G) even when G is an unweighted graph. Themain difference is that instead of using one single node to represent a literal (as in theargument of [14]), we will now use r + 1 nodes for each literal (itself and r copies, where rwill be specified to be a quantity depending on m later). We explain this modified reductionin more detail below.

Graph construction. We are given an instance φ of NaeSat* problem, as defined earlier,and it contains n variables x1, . . . , xn and m clauses C1, . . . , Cm. First, we assume that allredundant clauses (e.g., x∨x, (x∨y, x∨ y), etc.) are already removed in the same way as in[14]. The redundant clauses need to be removed to guarantee the edges constructed beloware distinct.

Given φ, we now build a graph G = (V,E) as follows: Specifically, V contains2n(r + 1) nodes, where we create 2(r + 1) nodes for each variable xi: r + 1 nodes vi,vi1, · · · , vir for literal xi, and r + 1 noes vi, vi1, · · · , vir for literal xi, respectively. Wecall vij ’s copies for vi (and similarly vij are also copies). The edge set E of G falls intothree categories, the edges in the first two categories are exactly the same as Dasgupta’sconstruction and do not connect any copy. The third category is new.

1. For each 3-clause (assume there are m1 3-clauses in total), we add six edges in G:three edges joining all 3 literals in this clause, and three edges joining their negations.See an example in Figure 10. There are 6m1 edges from this category.

http://jocg.org/

JoCG 11(1), 283–331, 2020 313


1

¯2 3

¯1

2 ¯3

Figure 10: Edges in category 1 for clause x1 ∨ x2 ∨ x3.

2. For each 2-clause (assume there are m2 2-clauses in total), we add two edges in G: onejoining the two literals in this clause, and one joining their negations. See an examplein Figure 11. There are 2m2 edges from this category.

1 ¯2¯1 2

Figure 11: Edges in category 2 for clause x1 ∨ x2.

3. Finally, for every pair of literals (xi, xi), we add the complete bipartite graph oververtices (vi, vi1, . . . , vir ∪ vi, vi1, . . . , vir (see Figure 12). Note that only edges inthis category connect copies.

¯ ¯ 1

1

¯

...

...

Figure 12: Edges in category 3.

By construction, there will be 6m1 + 2m2 + n(r + 1)2 edges with unit weights in G. LetM = 2n(r+ 1)(5m1 + 2m2 +n(r+ 1)2). In what follows we prove that φ is satisfiable if andonly if the largest cost of any HC-tree for G is at least M .

φ is satisfiable ⇒ cost∗G ≥ M . Suppose the given formula φ is satisfiable, and let V +,V − denote nodes corresponding to positive and negative literals under the valid assignment,respectively (a copy will have the same polarity as its corresponding literal). Similarly to[14], consider a tree T ∗ which first splits V into V + and V −. Since V + and V − form asatisfying assignment, all edges, except exactly one edge per triangle from category 1, willbe crossing edges (which we say will be cut) by the top split of T ∗. The cost induced bythe top split is thus |V | · |E(V +, V −)| = 2n(r + 1)(4m1 + 2m2 + n(r + 1)2). At the secondlevel, we can construct T ∗ so that all remaining 2m1 edges can be cut as these edges arenode-disjoint. Therefore, the Dasgupta’s cost of tree T ∗ is

costG(T ∗) = 2n(r + 1) · (4m1 + 2m2 + n(r + 1)2) + n(r + 1) · (2m1)

= 2n(r + 1)(5m1 + 2m2 + n(r + 1)2) = M.

http://jocg.org/

JoCG 11(1), 283–331, 2020 314


Hence if φ is satisfiable, then there exists a tree T ∗ with cost M .

cost∗G ≥M ⇒ φ is satisfiable. Conversely, suppose there is a tree T such that costG(T ) ≥M for this graph G. We will show that the corresponding NaeSat* instance φ is satisfiable.Assume that the top split of tree T separates its node set V into two parts with sizesn(r + 1) + k and n(r + 1) − k, where k is an integer and 0 < k < n(r + 1). For this split,it can cut at most 4m1 edges from category 1 (in particular, for each triangle in category 1,either 0 or 2 edges can be crossing w.r.t. any split), at most 2m2 edges from category 2, andat most (n(r+1)−k) · (r+1) edges from category 3. The total number of crossing edges N1

at this level is thus N1 ≤ 4m1 + 2m2 + (n(r+ 1)− k) · (r+ 1), and the remaining number ofedges N2 satisfies N2 ≥ k(r + 1) + 2m1. For each of the remaining edge, the largest cost itcan incur is thus n(r+1)+k. It then follows that we can upper-bound costG(T ) as follows:

costG(T ) ≤ N1 · 2n(r + 1) +N2 · (n(r + 1) + k)

≤ (4m1 + 2m2 + (n(r + 1)− k) · (r + 1)) · 2n(r + 1)

+(k(r + 1) + 2m1) · (n(r + 1) + k)

= M − 2kn(r + 1)2 + 2km1 + (n(r + 1) + k)k(r + 1).

The first inequality follows as n(r+1)+k ≤ 2n(r+1), and thus the smaller N2 is, the largerthe cost is. For k 6= 0, if costG ≥M , it holds that

2km1 + k(r + 1)(k + n(r + 1)) ≥ 2kn(r + 1)2

⇒ 2m1 + k(r + 1) ≥ n(r + 1)2.

Now set r = 2m1. However, since k ≤ n(r + 1) − 1, the above inequality cannot hold.Hence the only possibility is that k = 0, which means that T will first partition the nodeset V into two equal-sized subsets V + and V − at the top level. In fact, it is easy to verifythat the first split necessarily leaves at most one edge per triangle of category 1 uncut, andalso cuts all the other edges; otherwise the cost will fall below M again. A split (V +, V −)satisfying these conditions leads to a satisfying assignment for φ: Indeed, since all edgesfrom category 3 will be crossing edges, that means that (i) any copy is within the samesubset as its corresponding literal, and (ii) no vi and vi can be in the same subset.

Putting everything together, since φ is satisfiable if and only if the largest cost of anyHC-tree for G is at least M , we have reduced the instance of φ to an instance of checkingwhether the maximum cost for a graph of polynomial size is smaller than M or not. It thenfollows that it is NP-hard to maximize cost, and thus to minimize the ratio-cost ρG onunweighted graphs.

B.2 Approximation for ρ∗G with structural constraints

In this section, we will first consider a more general problem for minimizing ratio-cost with kfeasible structural constraints, similar to the problem proposed and studied by Chatziafratiset al. in [11]. When k = 0, this is the same problem as minimizing the ratio-cost ρ∗Gof an input graph G. We will show that the approximation algorithm developed in [11]

http://jocg.org/

JoCG 11(1), 283–331, 2020 315


for Dasgupta’s cost function can be modified to approximate our ratio-cost function withstructural constraints as well within the same approximation factor. In particular, theapproximation factor is O(kαn), where αn is the best approximation factor for sparsestcut problem and αn = O(

√log n) currently. As a by-product, this leads to an O(

√log n)-

approximation algorithm for the ratio-cost ρ∗G when there is no structural constraint (moreprecisely, by setting k = 1 with a dummy constraint in our approximation algorithm). Thisin turn proves Theorem 5. We provide more details below.

Optimizing costG(T ) or ρG(T ) with structural constraints. Given a graph G = (V,E),we say that a HC-tree T satisfies a structural constraint R = ab|c, with a, b, c ∈ V , if thenode LCA(a, b) is a decendent of LCA(a, b, c) (i.e., nodes a, b merge before they merge withnode c). Given a graph G and k constraints R = R1, . . . , Rk, we say that a HC-tree isvalid w.r.t. R if it satisfies all constraints in R.

Given a graph G and k feasible constraints R = R1, . . . , Rk, the k-constraintoptimal cost for G is the smallest possible costG(T ) for any valid HC-tree. Similarly, thek-constraint optimal ratio-cost is the smallest possible ratio-cost ρG(T ) for any valid HC-treew.r.t. the k constraints in R.

In [11], Chatziafratis et al. proposed a polynomial-time algorithm to approximatethe optimal k-constraint cost within a factor of O(kαn), where αn = O(

√log n) is the best

approximation factor for sparsest-cut problem. Their algorithm is called the ConstrainedRecursive Sparsest Cut (CRSC). Roughly speaking, the CRSC algorithm proceeds as follows:at any moment, suppose we are processing a subgraph G′ spanned on vertex set V ′ ⊆ V .For each active structural constraint ab|c, which is a constraint with a, b, c ∈ V ′, the nodesa, b will be first merged into a supernode before performing the approximate sparsest cut onthe resulting graph. The paralleled edges will be merged into a single edge and the weight isthe sum of the edges before merging. One can unpack a supernode once the constraint is notactive. After performing the approximate sparsest cut V ′ = (V ′1 , V

′2), the same procedure

will be recursively applied to the subgraphs spanned on V ′1 and V ′2 .

It turns out that the CRSC algorithm of [11], with minor modifications, is an O(kαn)-approximation algorithm for computing the k-constraint ratio-cost. Our modified algorithmalso applies to the case with no constraints (considering there is a dummy constraint andk = 1), which leads to an O(αn)-approximation. This leads to Theorem 9.

We also remark that in the previous version of this paper (Appendix B.2, [24]),we also proved that the Semidefinite Programming algorithm studied in [8] can achieve anO(√

log n)-approximation with respect to ρG when there is no constraint. This result isomitted from the present paper as it can be achieved as a special case for the constraintversion.

Theorem 9. Given a graph G = (V,E) and k feasible constraints R = R1, . . . , Rk, thereis a polynomial time algorithm (Algorithm 3) that approximates the k-constraint optimalratio-cost within a factor of O(kαn) = O(k

√log n), where αn is the best approximation

factor for the sparsest-cut problem.

The remainder of this section devotes to the proof of this theorem. We note that the

http://jocg.org/

JoCG 11(1), 283–331, 2020 316


algorithm and analysis mostly follow those of [11], with only minor modifications. So belowwe focus on illustrating the key modifications.

First, our modified algorithm of the CRSC algorithm of [11] is given in Algorithm 3.Roughly speaking, we apply the CRSC algorithm when the subgraph considered during therecursion is of size larger than 24k; otherwise we apply the procedure CRSC-MIN which isthe version of CRSC that uses minimum-cut to substitute approximate sparsest cut duringthe recursion. Specifically, given a subgraph G′ = (V ′, E′), we now compute a cut (V ′1 , V

′2)

of G′ such that the sum of the weights of all edges crossing this cut is minimized, and thenrecurse over the subgraphs induced on V ′1 and V ′2 , respectively. Using CRSC-MIN when thesubgraph is of size smaller than 24k ((Step-2) in Algorithm 3) is the only modification tothe original CRSC algorithm of [11].

Algorithm 3: BuildTreeWithConstraintsinput : Graph G = (V,E) and k feasible triplet constraints.output: A hierarchical clustering tree T on V .(Step-1): Apply the recursive sparsest cut algorithm (CRSC) to G, the recursivealgorithm stops once the input set has size smaller than 24k. Let Ts denote thepartial hierarchical clustering tree produced in this step.(Step-2): For the bottom-level clusters A1, A2, · · · , Am in Ts. ApplyCRSC-MIN to each Ai, let Ti denote the output for Ai. Let T be the treeproduced by replacing Ais in Ts with Tis.

Proof of Theorem 9. We show that the tree T output by Algorithm 3 has a ratio-cost thatO(kαn)-approximates that of an optimal valid tree T ∗ of G w.r.t. the input constraints.

We say that A1, . . . , Ar form a partition of set A if all Ai’s are disjoint and A =A1 ∪ A2 ∪ · · ·Ar. One can consider a HC-tree T of n nodes [1, . . . , n] as a collection ofpartitions at different levels t = n, · · · , 0. The partition at level t consists of maximalclusters of size at most t, and is represented by a set of variables xtij , i, j ∈ V : Specifically,xtij = 1 if i and j are in different clusters in the partition at level t, and xtij = 0 otherwise.The problem of minimizing costG can be thus modeled as a 0-1 optimization problem withobjective function

minn∑t=0

∑(i,j)∈E

xtijwij , (13)

where E is the set of edges in graph G, and wij is the edge weight between i and j.

Recall that the difference between costG and totalCG is the constant term of2∑

(i,j)∈E wij (Claim 1). Hence minimizing the total-cost totalCG can be written as:

minn∑t=2

∑(i,j)∈E

xtijwij , (14)

http://jocg.org/

JoCG 11(1), 283–331, 2020 317


Indeed, since xtij is 1 for t = 0, 1, by doing this, we deduct 2∑

(i,j)∈E wij from the objectivefunction in Eqn (13), giving rise to totalCG. Recall that T ∗ is the optimal valid HC-treewhose total-cost minimizes totalCG (thus equivalently, also costG), and let OPT denote itstotal-cost1. Set OPT(t) =

∑(i,j)∈E x

tijwij ; intuitively, this is the total weight of all crossing

edges for the partition at level t. It is shown in [11] that

costG(T ∗) =n∑t=0

OPT(t). (15)


totalCG(T ∗) =n∑t=2

OPT(t). (16)

Finally, OPTA(t) denotes the cost at level t but only restricted on subset A, i.e., OPTA(t) =∑(i,j)∈E,i,j∈A x

tijwij .

The function OPT(t) is a non-increasing function of t, because the set of crossingedges at level t′ is a subset of those at level t for t′ < t. This fact will be used multipletimes in the proof. Finally, given a set A, recall that a constraint ab|c ∈ R is active if allnodes a, b, c ∈ A. For the sparsest cut splitting a set A into two smaller sets B1 and B2 with|B1| ≤ |A|2 . Chatziafratis et al. [11] proved the following result:

|A| · w(B1, B2) ≤ 4αn|B1| · OPTA(b|A|/6kAc), (17)

where kA (≤ k) is the number of active constraints for set A, and

w(B1, B2) :=∑

(i,j)∈E,i∈B1,j∈B2

wij

stands for the total sum of weights of edges crossing the cut (B1, B2). When there is noactive constraint, kA can be simply set to 1. Since in our objective function totalCG, wedo not consider the cost of clusters of size 2, this causes some difficulties in applying theabove lemma to small clusters. And this is the reason why we use minimum-cuts for smallclusters in Algorithm 3.

In what follows, we aim to prove the following two inequalities for the splits performedby BuildTreeWithConstraints: Specifically, setting totalC∗ := totalCG(T ∗), we claim

Inequality-(1)∑

A:|A|≥24k

(|A| − 2) · w(B1, B2) ≤ O(kαn) · totalC∗

Inequality-(2)∑

A:|A|<24k

(|A| − 2) · w(B1, B2) ≤ O(k) · totalC∗

Here, the summations are over all clusters A in the tree T output by Algorithm 3. Notethat these two inequalities together suffice to show that the output tree T has a total-cost totalCG (and thus ratio-cost ρG, as the base-cost is independent of the HC-tree) that

1In [11], the authors use OPT = totalCG(T ∗) for the optimal cost without structural constraints. Herewe use this term for the optimal cost for trees satisfying the structural constraints. The lemma proved intheir paper still holds because the inequality holds for any hierarchical clustering trees.

http://jocg.org/

JoCG 11(1), 283–331, 2020 318


O(kαn)-approximates totalCG(T ∗) (thus O(kαn)-approximates ρG(T ∗) as well) on graph Gw.r.t. k feasible constraints – Indeed, the sum of the left-hand sides of the two inequalitiesgives rise to totalCG(T ). This is because that the totalCG can be interpreted as the sumof splitting costs, where the cost for split A = (B1, B2) is (|A| − 2) · w(B1, B2). Hence toprove Theorem 9, what remains is to prove Inequality-(1) and Inequality-(2). The proof ofInequality-(1) follows almost the same argument as the one in [11] (as this part uses thesame CRSC algorithm); nevertheless, we include the derivation below for completeness. Themain new part is the proof of Inequality-(2).

Proof of Inequality-(1). Using Eqn (17), and w.l.o.g assuming |B1| ≤ |A|/2 for the cut(B1, B2) of A, we have the following:∑

A:|A|≥24k

(|A| − 2) · w(B1, B2) ≤∑

A:|A|≥24k

|A| · w(B1, B2) Line 1

≤∑

A:|A|≥24k

4αn|B1| · OPTA(b|A|/6kAc) Line 2

≤ O(αn) ·∑

A:|A|≥24k

|A|∑t=|A|−|B1|+1

OPTA(bt/6kAc) Line 3

≤ O(αn) ·n∑

t=12k

OPT(bt/6kc) Line 4

≤ O(αn) · 6kn∑t=2

OPT(t) = O(kαn) · totalC∗. Line 5

To get line 2 we use Eqn (17). Line 2 to 3 follows the fact that the function OPT(t) isnon-increasing. To derive line 4 from 3, first, let us consider which terms contribute toline 3 for a fixed level t. For a fixed t, a cluster A that contributes to line-3 must satisfy|A| ≥ t > |B2| > |B1|, where A = (B1, B2) is the split of A in tree T . In other words, thecluster A is a minimal cluster with size no smaller than t. All such clusters Ais (contributingto line 3 for a fixed t) thus form a disjoint partition of V because of this minimality property.Furthermore, observe that OPTA(bt/6kAc) ≤ OPTA(bt/6kc) as the OPTA function is non-increasing. For each fixed t, summing up OPTA(bt/6kAc) over all As contributing to it isthus at most OPT(bt/6kc). Finally, for any A considered in line 3, the smallest possiblet is |A| − |B1| + 1 ≥ 24k − 12k + 1 > 12k, it then follows that line 3 is bounded frombelow by summing up OPT(bt/6kc) for all t ≥ 12k (i.e., line 4). Line 4 to 5 is true because∑a+6k

t=a OPT(bt/6kc) ≤ 6k ·OPT(ba/6kc) for any a, as OPT(t) is a non-increasing function. Thusby breaking the range [12k, n] to intervals of length 6k, we have that

n∑t=12k

OPT(bt/6kc) ≤ 6k ·n/6k∑t=2

OPT(t) ≤ 6k ·n∑t=2

OPT(t).

The last equality follows from Eqn (16).

Proof of Inequality-(2). For each maximal cluster A with size smaller than 24k (i.e., itsparent has size strictly larger than 24k), there will be at most 24k−1 minimum-cuts applied

http://jocg.org/

JoCG 11(1), 283–331, 2020 319


to it until all clusters become singletons. Consider a starting cluster A (i.e., a maximal clusterwith size at most 24k): Set totalC∗A =

∑nt=2 OPTA(t) to be the portion of the total-cost of

optimal valid tree T ∗ restricted to elements in cluster A. In (Step-2) of Algorithm 3, we nowfind minimum-cut (B1, B2) of A that satisfies all active constraints to create T . The sum ofweights of crossing edges w.r.t. this minimum-cut (B1, B2), denoted by w(B1, B2), shouldthus be smaller than the weight of any other cut, including the one (B∗1 , B

∗2) induced by the

optimal valid tree T ∗ restricted to elements in cluster A. It then follows that

(|A| − 2) · w(B1, B2) ≤ (|A| − 2) · w(B∗1 , B∗2) ≤ totalC∗A.

Similarly, for any cluster A′ ⊂ A created in the recursive cut of A, (|A′| − 2) ·w(B′1, B′2) will

be at most totalC∗A as well. Since there can be 24k − 1 cuts ever applied, generating atmost 24k subclusters A′ of A during the recursive algorithm. Hence for a fixed A,

CA :=∑A′⊂A

(|A′| − 2) · w(B′1, B′2) ≤ 24k · totalC∗A,

where the summation ranges over A as well as all subcluster A′ of A during the recursivecutting of A in (Step-2) of Algorithm 3. Finally, note that all maximal clusters A1, . . . , Arwith size smaller than 24k form a partition of the n nodes V . Hence we have:∑

A:|A|<24k

(|A| − 2) · w(B1, B2) ≤∑i∈[1,r]

CAi ≤ 24k ·∑i∈[1,r]

totalC∗Ai

≤ 24k · totalC∗ = O(k) · totalC∗.

This proves Inequality-(2).

Putting everything together, the output tree T has a ratio-cost thatO(kαn)-approximatesthat of the optimal valid HC-tree. Theorem 9 then follows.

B.3 Proof of Lemma 1

First, we state the following simple observation (which holds because the set of clustersC = C1, . . . , Cr form a valid partition), which we will use repeatedly later.

Observation 2. Given a valid partition C of G, if a triplet xi, xj , xk of G is type-1, thenit is not possible that all three nodes are contained in three different clusters in C.

Now note that one direction of Lemma 1 is trivial: If some Πi with i ∈ [1, s] is valid,then clearly, a valid bi-partition for G exists, as by construction Πi contains two clustersinside. In the remainder we prove the other direction, that is, if there exists some validbi-partition for G, then some Πi has to be valid.

First we claim that all edges (yr, yi) for i ∈ [4, s] must be heavy (recall that y1, . . . , ysis a clique formed by only light edges). This is because otherwise, the triplet y1, yi, yr willbe of type-1, which is not possible by Observation 2.

http://jocg.org/

JoCG 11(1), 283–331, 2020 320


Now suppose there exists a valid bi-partition (A,B) of V . Assume that two nodes,say yi, yj with i 6= j ∈ [1, s], are in A. (Note that if a node yi ∈ Ci is in A, then by Proposition2 all nodes in Ci are in A.) The triplet yi, yj , yr cannot be type-3, as the associated edgeweights are not all equal. It also cannot be type-1 by Observeation 2. Hence yi, yj , yr istype-2. It is then necessary that yr ∈ A as well, as w(yj , yr) = w(yi, yr) > w(yi, yj) andthe valid bi-partition (A,B) is consistent with this triplet. Furthermore, in this case, B cancontain at most one vertex from clique C = y1, . . . , ys, as otherwise, yr also has to be inB by using the same argument above, contradicting to that it already must be in A.

Now consider any vertex yt ∈ V ′ outside the clique C, and t 6= r. Since yt /∈ C, thereexists a vertex ya ∈ C such that the edge (yt, ya) is heavy, as otherwise yt would have beenincluded in the clique C. We claim that in this case, all edges (yt, yi) with i ∈ [1, s] mustbe heavy. Indeed, suppose edge (yt, yb) is light for some b ∈ [1, s], b 6= a. Then the tripletyt, ya, yb is of type-1, which is not possible by Observation 2.

As yt forms heavy edges with all points in the clique C, now applying our previousargument, where we show yr ∈ A, to yt, we can prove that yt ∈ A as well. In other words,B ∩ V ′ (recall V ′ = y1, . . . , yr contains one point from each cluster in C) is either empty,or it contains exactly one point yk in which case it is also necessary that yk ∈ C; that is,k ∈ [1, s].

Finally, by Proposition 2, if a point from a cluster Ci is in A (or B), then all points inthat cluster are necessarily in A (or B). This means that B∩V ′ cannot be empty (otherwiseB will be empty), and contains exactly one point yk from the clique C (which implies thatB = Ck). In this case, the valid partition (A,B) is exactly Πk = Ck,∪j 6=k,j∈[1,r]Cj. Thisproves Lemma 1.

B.4 Proof of Lemma 2

Suppose there is a valid bi-partition (A,B), by Proposition 2 we just need to assign eachcluster Ci in C to either A or B. More specifically, let z1, . . . , zr be a set of boolean variableswith zi = 1 if Ci ⊆ A, and zi = 0 if Ci ⊆ B. By Definition 7, our goal now is to find truthassignments for zis, so that the resulting bi-partition (A,B) is consistent with each tripletof G of type-1 or type-2. Below we will go through each triplet of type-1 or type-2, andidentify constraints on zis that this triplet may incur.

Consider a triplet xi, xj , xk. If it is of type-1 with wij > maxwik, wjk, then asthe partition C is valid, it must be that either all three points belong to the same cluster inC, or xi and xj are from the same cluster, while xk is from a different one in C. For bothcases, it is easy to verify that no matter how Cis are assigned in the bi-partition (A,B), theresulting bi-partition is still consistent with this triplet. In other words, this triplet does notincur any constraint in finding assignments zis.

Now assume the triplet is of type-2. There are three possibilities: (i) If all threenodes are from the same cluster, then it incurs no constraints in the assignments of clustersto the sets A and B. (ii) Suppose two nodes are from the same cluster, and the third nodefrom a different one. Either this remains the case in the final bi-partition (A,B), or allthree belong to one subset in (A,B). In either case, the final bi-partition (A,B) will still be

http://jocg.org/

JoCG 11(1), 283–331, 2020 321


(a) Claw1: wij = wik = wit >wjk, wjt, wkt

(b) Claw2: wit = wjt = wkt >wij = wik > wjk

(c) Two type-2 triplets: wij =wik > wjk, and wjt = wkt >wjk.

Figure 13: Three possible patterns consist of four points, where xi, xj , xk is of type-2.

consistent with this triplet.

(iii) The only remaining case is when all three nodes in this type-2 triplet are fromthree different clusters of C. W.l.o.g, assume this type-2 triplet is xi, xj , xk from clustersCi, Cj and Ck, respectively.

Consider an arbitrary point xt ∈ Ct with t ∈ [1, r] \ i, j, k. W.l.o.g., assume thatthe type-2 triplet xi, xj , xk is such that w = wij = wik > wjk. By Observation 2, for anythree points out of xi, xj , xk, xt, at least two edges among them have equal weights whichis larger than or equal to the weight of the last edge. Based on this we can show that these4 points either form a claw (see Figure 13 (a) and (b)), or xt, xj , xk also form a type-2triplet with wtj = wtk > wjk (Figure 13 (c)). The former case is not possible as we haveassumed that there is no claw w.r.t. C. Hence xj and xk form a type-2 triplet with everynode xt, for t ∈ [1, r] \ i, j, k, with wjk having the smallest edge weight in this triplet. Ifxj and xk are in the same subset in a valid bi-partition (A,B), then for the bi-partition tobe consistent with the triplet xt, xj , xk, xt (thus Ct) also needs to belong to this subset.In other words, if zj = zk, then z1 = z2 = · · · = zr and all clusters belong to the samesubset in the bi-partition, which is not possible. It then follows that the assignment mustsatisfy zj 6= zk. On the other hand, if Cj and Ck are assigned to different subset in thebi-partition, then no matter how other clusters are assigned, all these triplets xt, xj , xkwill be consistent.

Finally, note that there are O(mn) number of type-2 triplets: In particular, eachtype-2 triplet has to contain one edge from E, and thus there are at most m × (n − 1)number of them and they can be enumerated in O(mn) time. Now going through all O(nm)type-2 triplets where all three points coming from three distinct clusters, we obtain a setof at most O(minr2, nm) = O(nm) constraints of the form zj 6= zk that all need tohold. Given a set of such constraints, we can use a modified breadth-first-search procedureto identify, in time linear to the number of constraints, whether an assignment of booleanvariables z1, . . . , zr where all constraints are satisfied exists or not, and compute one if itexists. The assignment, if exists, then gives rise to the desired valid bi-partition. If it doesnot exists, algorithm validBipartition(G) returns “fail”.

http://jocg.org/

JoCG 11(1), 283–331, 2020 322


B.5 Proof for Theorem 6

We simply run algorithm BuildPerfectTree(G) on the input graph G = (V,E), and let T beits output HC-tree. We claim that G has a perfect HC-structure if and only if T spans allvertices in V ; in which case, T is also an optimal HC-tree (i.e., ρG(T ) = 1). We first arguethe correctness of our algorithm, which is relatively simple. The main technical contributioncomes from proving that we can implement the algorithm in the stated time complexity.

B.5.1 Correctness of the algorithm

First, a subgraph G′(V ′) of G induced by a subset V ′ ⊆ V of vertices is the subgraph of Gspanned by all edges of G between vertices in V ′. We also call G′(V ′) an induced subgraph.Recall that by definition of totalCG and baseC(G), a binary tree T satisfies ρG(T ) = 1 ifand only if T is consistent with the similarities of any triplet in V (see the discussion aboveDefinition 4). If the algorithm returns a HC-tree T spans all vertices in V , then we claimthat this returned binary tree T is consistent with all triplets of G. We prove this claiminductively by considering each subtree Tv rooted at v ∈ T , as well as the correspondingsubgraph Gv in a bottom-up manner, and show that

if the subtree Tv (returned by our algorithm for graph Gv) spans all vertices inGv, then Gv has a perfect HC-structure, and Tv is an optimal HC-tree for Gv.

For the base case, note that at each leaf node v, this claim holds trivially. Nowconsider an internal node v ∈ T , and the subtree Tv is obtained by BuildPerfectTree(Gv),where Gv is the subgraph of G induced by all leaves in Tv. Let TA and TB be the two child-subtrees of Tv, with GA and GB their corresponding subgraphs. By induction hypothesis,GA (resp. GB) has a perfect HC-structure with TA (resp. TB) being an optimal HC-treefor it. To check that Tv is also optimal for Gv, we just need to verify that for any tripletvi, vj , vk not all from the same subtrees, Tv is consistent with it. Indeed, this followseasily from the fact that leaves of TA and TB form a valid partition of Gv (as procedurevalidBipartition(G) succeeds).

For the opposite direction, we need to show that if G has a perfect HC-structure,then BuildPerfectTree(G) should succeed. This follows from the simple observation below,combined with Lemma 1 and 2.

Observation 3. If a graph G = (V , E) has a perfect HC-structure, then (i) any of itsinduced subgraph also has a perfect HC-structure; and (ii) there exists a valid bi-partitionfor G.

Proof. Let T ∗ be an optimal binary HC-tree for G. Then T ∗ is consistent with any tripletof G. Let GV ′ be an induced subgraph of G spanning on vertices V ′ ⊂ V . Now removingall leaves from T ∗ corresponding to vertices V \V ′, and removing any degree-2 nodes in theresulting tree. This gives rise to an induced binary HC-tree T ∗V ′ . Obviously, this tree is stillconsistent with all triplets formed by vertices from V ′. Hence it is an optimal HC-tree forGV ′ with ρGV ′

(T ∗V ′) = 1. This proves (i). For (ii), note that the two sets of leaves from the

two subtrees of the root of T ∗ form a valid bi-partition of G.

http://jocg.org/

JoCG 11(1), 283–331, 2020 323


B.5.2 Time complexity

The recursive algorithm BuildPerfectTree(G) has depth at most O(n). At each level (depth),we call the algorithm for a collection of subgraphs which are all disjoint. Hence at eachlevel, the total size (as measured by the size of the vertex set) of all subproblems is O(n).We now derive the time complexity of one subproblem BuildPerfectTree(G) for a subgraphG = (V , E) with n vertices and m edges.

First, our algorithm builds an initial partition C = C1, . . . , Cr of G = (V , E). ByProposition 2, this (Step-1) takes O(nm+ n2 log n) time.

Next, in (Step-2) we compute a bi-partition (A,B) from the partition C. If there isno claw w.r.t. C, then Lemma 2 states that a valid bi-partition (if it exists) can be computedin O(nm) time. What remains is to bound the time complexity for the case when there existclaws.

Naively checking for claws as well as checking for a valid bi-partition once a claw isgiven, take O(n4) time. However, there is much structure behind crossing triplets, whichare triplets with all three vertices from three different clusters in the valid partition C =C1, . . . , Cr. We will leverage that structure to compute a claw and check for a validbi-partition in O(n(n + m)) time. (Recall that any claw would work for our algorithm tocompute a valid bi-partition if one exists.)

A more efficient algorithm for identifying a claw. We perform the following in algorithmvalidBipartition(G) to check whether there is a claw or not, and compute one if any exists.Recall that the valid partition C = C1, . . . , Cr is already computed. We say that an edgeis a crossing edge if its two endpoints are from different clusters in C. A triplet is crossingif all three nodes in it are from three different clusters of C.

Let us now fix a crossing-edge (xj , xk) ∈ E, say, with the two nodes xj and xk fromclusters Cj and Ck, respectively. At the beginning, all clusters Ct ∈ C have label ‘0’. Wethen scan all nodes x ∈ Ct with t /∈ j, k and do the following:

Algorithm 4: Labeling (xj , xk)

Label all clusters Ct ∈ C by ‘0’;for each node xt ∈ Ct, t /∈ j, k do

if The triplet xj , xk, xt is of type-2 with w(xj , xk) = w(xj , xt) > w(xk, xt)then

Set the label of cluster Ct to be ‘1’;end

end

Note for a cluster, once it receives a ‘1’ label, it will not change any more. For a fixedcrossing-edge (xj , xk), labeling all clusters for a fixed edge (xj , xk) thus takes O(n) time, asthere are at most n− 2 crossing triplets containing (xj , xk).

Given a claw xa | xb, xc, xd, we call the edges (xa, xb), (xa, xc) and (xa, xd) with

http://jocg.org/

JoCG 11(1), 283–331, 2020 324


maximum weight the legs of this claw; while the three remaining edges the base-edges ofthis claw.

Lemma 5. Process all crossing triplets containing (xj , xk) as described above. There is aclaw pivoted at xj and containing (xj , xk) as a leg if and only if at least two clusters havelabels ‘1’.

Proof. First, suppose there is a claw xj | xk, xa, xb for some xa ∈ Ca and xb ∈ Cb. Thenit is easy to see that there are at least two crossing triplets (xj , xk, xa and xj , xk, xb)satisfying the condition in Labeling procedure, causing the clusters Ca and Cb to be markedwith label ‘1’. Hence one direction of the lemma holds. To prove the opposite direction,suppose we have clusters Ct and Cs with labels ‘1’. Assume that these labels are causedby crossing triplets xj , xk, xs and xj , xk, xt with xs ∈ Cs and xt ∈ Ct. Then obviously,w := w(xj , xk) = w(xj , xs) = w(xj , xt) > maxw(xk, xs), w(xk, xt). To show that the fournodes xj , xk, xs, xt span a claw xj | xk, xs, xt, we just need to argue that w(xs, xt) < w.Indeed, this is true as otherwise, the triple xk, xs, xt will be of type-1 which is not possibleby Observation 2. Hence if Ct and Cs are both marked with label ‘1’, we have a clawxj | xk, xs, xt pivoted at xj and containing (xj , xk), proving the opposite direction of thelemma. Putting everything together, the lemma then follows.

To identify a claw, we consider all edges (xj , xk) in E, and there are O(m) numberof them (recall that m = |E|). For each edge (xj , xk), we call the above procedure twice:Labeling(xj , xk) and Labeling(xk, xj). After each Labeling procedure, we check whether thereexists two clusters with label ‘1’. If so, we compute a claw as specified in the proof of Lemma5 and terminate the entire algorithm. This process takes O(n+ r) = O(n) time for (xj , xk)as there are at most n − 2 triplets containing (xj , xk) and r ≤ n. By Lemma 5, callingLabeling(xj , xk) and Labeling(xk, xj) will ensure that if there exists any claw containing(xj , xk) as a leg (and the pivot can be xj or xk), our algorithm will then be able to computeone.

Finally, note that by definition, any leg of a claw has to be an edge in E (as its weightis strictly larger than those of the base-edges and thus cannot be 0). Hence by enumeratingall m edges in E, we will compute a claw if any exists (note that our algorithm does notenumerate all claws). The total time needed for identifying a claw is thus O(m) · O(n) =O(nm).

Lemma 6. Given a graph G = (V , E) with n vertices and m edges, and a valid partitionC = C1, . . . , Cr for G, it takes O(nm) to detect whether a claw exists C, and compte oneif it exists.

Checking for valid bi-partition with a given claw. Finally, suppose we have now identifieda claw yr | y1, y2, y3. Recall in our algorithm for (Case-1) as described in Section 4.2, weconstruct a graph G′ spanned on a set of representative points V ′ = y1, . . . , yr one fromeach cluster Ci. We next compute the subgraph G′′ consisting only of light edges, as wellas the maximum clique C = y1, . . . , ys containing y1, y2, y3 in G′′. (In particular, letw denote the weight from the “legs” of the claw yr | y1, y2, y3, i.e., w = w(yr, y1) =

http://jocg.org/

JoCG 11(1), 283–331, 2020 325


w(yr, y2) = w(yr, y3). Recall that an edge is light if its weight is strictly smaller than w, andheavy otherwise. )

Computing maximum clique in general is expensive, however, in our case it turnsout that:

Claim 4. The component in the light-edge graph G′′ containing y1, y2, y3 is necessarily aclique.

Indeed, in the proof of Lemma 1, we have already shown that for any yt ∈ V ′ outsidethe maximum clique C, it is necessary that all edges (yt, yi) are heavy for i ∈ [1, s], whichimplies the above claim.

Computing the graph G′, G′′ and the maximum clique thus takes O(r2) = O(n2)time.

What remains is to show that given the clique C = y1, . . . , ys, we can check whetherthere is a valid bi-partition Πi, for some i ∈ [1, s], in O(nm) time.

To this end, we maintain an array L of size s = O(r) = O(n), where each entryL[i] will be initialized to be 0. Recall that by Observation 2, no crossing triplet can be oftype-1. We now scan through all crossing triplet of type-2 only. Assume w.l.o.g., that wehave a type-2 crossing triplet xi, xj , xk with points from Ci, Cj and Ck, respectively, andsuppose that wij = wik > wjk. Then, the only bi-partition from the set Π1, . . . ,Πs thatis not consistent with this triplet is Πi, as xj and xk will be first merged in the resultingbinary HC-tree, before merging with xi. In this case, we simply set L[i] = 1. We ignore alltype-3 crossing triplets as any bi-partition will be consistent with them.

There are O(nm) number of crossing triplets of type-2, as each of them has to containan edge with positive weight (thus an edge in E). After we process all crossing triplets oftype-2 in O(nm) time, we linearly scan array L. We do not need to process non-crossingtriplets as it is already shown at the beginning of proof for Lemma 2 (Section B.4) thatnon-crossing triplets will remain valid to any bi-partition arisen from merging clusters inC. Hence in the end, if there exists any entry L[t] = 0 for some t ∈ [1, s], it means thatbi-partition Πt must be consistent with all triplets, and thus is valid. Otherwise, there is novalid bi-partition possible. The entire process takes O(nm) time.

Combining this with Lemma 6, we conclude:

Lemma 7. There is an algorithm to identify a claw if it exists in O(nm) time. If a clawexists, then we can check whether a valid bi-partition exists or not, and compute one if itexists, in O(n(n+ m)) time.

Putting everything together, procedure validBipartition(G) takes O(n(n + m)) timewhere n and m are the number of vertices and the number edges in G, respectively.

Putting everything together, the time for calling BuildPerfectTree(G) is O(nm +n2 log n). Finally, this means that each depth level during our recursive algorithm takesO(∑

i(nimi + n2i log ni)) = O(nm + n2 log n) time, where ni’s (resp. mi’s) are the num-

ber of vertices (resp. of edges) of each subgraph at this level with∑

i ni = n (resp.∑imi ≤ m). Since there are at most O(n) levels, the total time complexity of algorithm

BuildPerfectTree(G) is O(n2m+ n3 log n). This finishes the proof of Theorem 6.

http://jocg.org/

JoCG 11(1), 283–331, 2020 326


C Missing Details from Section 5

C.1 Proof of Theorem 8

Since our total-cost is closely related to Dasgupta’s cost function (recall Claim 1), the partof the proof for Theorem 8 on bounding the total-cost is almost the same as the argumentused in [13] to show the concentration of Dasgupta’s cost function on a random graph. Thepart on bounding the base-cost is different.

Recall that an optimal HC-tree w.r.t. ratio-cost function is also optimal w.r.t. thetotal-cost function. We first claim the following.

Proposition 3. Given an n× n probability matrix P, assume Pij = ω

(√lognn

)for i, j ∈

[1, n]. Given a random graph G = (V,E), let T ∗ be an optimal HC-tree for G w.r.t. ratio-cost, and let T ∗ be an optimal tree for the expectation-graph G. Then we have that w.h.p.,

|totalCG(T ∗)− totalCG(T∗)| = o(totalCG(T

∗)).

Proof. Note that there are 2cn logn, for some constant c > 0, possible HC-trees spanned onn leaves.

Claim 5. For an arbitrary but fixed HC-tree T , there exists some constant C > c such that:

P[ |totalCG(T )− totalCG(T )| ≥ o(totalCG(T )) ] ≤ exp(−Cn log n).

Proof. The proof is almost the same as the proof of Theorem 5.6 of [13] (the arXiv version)by using a generalized version of Hoeffding’s Inequality. Specifically, set Yij = 1(i,j)∈E to bethe indicator variable of whether (i, j) is in the graph G or not.

Set Zij = |leaves(T [LCA(i, j)])− 2| · Yij ; then totalCG(T ) =∑i<j

Zij .

Let κ(n) denote the total-cost for a clique of size n, which is 2 ·(n3

). Let Pmin denote the

smallest entry in P. It is easy to verify that:

1. E[totalCG(T )] =∑

i<j E[Zij ] =∑

i<j |leaves(T [LCA(i, j)])−2| ·Pij = totalCG(T ) ≥κ(n) · Pmin.

2. aij = 0 ≤ Zij ≤ |leaves(T [LCA(i, j)])− 2| = bij , for any i < j ∈ [1, n]∑i<j(bij − aij)2 =

∑i<j |leaves(T [LCA(i, j)])− 2|2 ≤

(n2

)· n2.

3. Set ε =

√C logn/n

Pmin= o(1); then ε · totalCG(T ) = o(totalCG(T )).

It follows from (1) above that E[totalCG(T )] = totalCG(T ). Plug in all these terms intothe Hoeffding’s inequality, we thus get

P[|totalCG(T )− totalCG(T )| ≥ ε · totalCG(T )] ≤ exp

(−2ε2κ2(n)P2

min(n2

)· n2

)= exp (−Cn log n) ,

http://jocg.org/

JoCG 11(1), 283–331, 2020 327


for a sufficiently large constant C > c, which finishes the proof of Claim 5.

As a corollary of the above claim, for C sufficiently large, by union bound we havethat w.h.p., |totalCG(T )− totalCG(T )| = o(totalCG(T )) holds for all O(2cn logn) numberof HC-trees. It then follows that w.h.p.:

totalCG(T ∗) ≥ (1− o(1))totalCG(T ∗) ≥ (1− o(1))totalCG(T∗),

where the second inequality is because T ∗ is an optimal tree for the expectation graph Gw.r.t. ratio-cost, and thus also w.r.t. the total-cost. Similarly,

totalCG(T ∗) ≤ totalCG(T∗) ≤ (1 + o(1))totalCG(T

∗),

which completes the proof of Proposition 3.

To bound the denominator baseC(G), we cannot use Hoeffding’s inequality directly,because for a triplet, whether they can form a triangle or a wedge are dependent on othertriples. We instead will use the following Janson’s result from [17].

Let X =∑

α∈A Yα be a random variable with A being some index set. Let Γ be thedependency graph for Yα; that is, each node in Γ represents a random variable Yα, andtwo nodes corresponds to Yα and Yβ will have an edge connecting them if and only if Yαand Yβ are dependent. Let ∆(Γ) denote the maximum degree among nodes in graph Γ, andset ∆1(Γ) = ∆(Γ) + 1.

Theorem 10 (Theorem 2.3 of [17]). Suppose that X =∑

α∈A Yα, where Yα is a randomvariable with α ranging over some index set A. If −c ≤ Yα − E[Yα] ≤ b for some b, c > 0and all α ∈ A, and S =

∑α∈AVarYα. For t ≥ 0,

P(X ≥ E[X] + t) ≤ exp

(− 8t2

25∆1(Γ)(S + bt/3)

),

and

P(X ≤ E[X]− t) ≤ exp

(− 8t2

25∆1(Γ)(S + ct/3)

).

Proposition 4. Given an n × n probability matrix P, assume Pij = ω( lognn ) for all pair

(i, j). For a graph G sampled from P, the following holds w.h.p.

|baseC(G)− E[baseC(G)]| = o(E[baseC(G)])

Remark: (1) It is important to note that E[baseC(G)] may be different from baseC(G). (2)Note that E[baseC(G)] can be calculated easily via linearity of expectation once P is given.

Proof. We first write the random variable baseC(G) as

baseC(G) =∑

1≤i<j≤k≤nYi,j,k, where Yi,j,k := minTriC(i, j, k).

http://jocg.org/

JoCG 11(1), 283–331, 2020 328


Yi,j,k is a simple random variable. For simplicity, set α = (i, j, k). Then (1) Yα = 2 withprobability P4 = Pij · Pjk · Pki; (2) Yα = 1 with probability P∧ = Pij · Pjk · (1 − Pki) +Pjk ·Pki · (1−Pij) + Pki ·Pij · (1−Pjk); and (3) otherwise (with probatility 1−P4 −P∧),Yα = 0. We can easily compute the expectation and variance of Yα:

E[Yα] = 2P4 + P∧,

VarYα = 4P4 + P∧ − (E[Yα])2 < 2 · E[Yα].

Also, for any triplet i, j, k, −2 ≤ Yi,j,k−E[Yi,j,k] ≤ 2, and the sum of the variances satisfies:

S =∑

1≤i<j≤k≤nVarYi,j,k <

∑1≤i<j≤k≤n

(2 · E[Yi,j,k]) = 2 · E[baseC(G)].

With assumption that Pmin = ω(log n/n), for any triplet i, j, k we have 2P4 + P∧ >Pij · Pjk = ω(log2 n/n2). and

E[baseC(G)] =∑

1≤i<j≤k≤nE[Yi,j,k] = ω(n log2 n). (18)

For any two random variable Yα and Yβ with α and β being two different triplets, they aredependent only when they share exactly two nodes, i.e., |α ∩ β| = 2. For each triplet, therewill be 3n− 9 dependent triples. Let Γ denote the dependency graph, then ∆(Γ) = 3n− 9,and ∆1(Γ) = 3n− 8.

Set ε = c√

n log2 nE[baseC(G)] for some large constant c; note that ε = o(1) due to Eqn (18).

Using Theorem 10, we have:

P(baseC(G) ≥ (1 + ε)E[baseC(G)]) ≤ exp(− 8ε2(E[baseC(G)])2

25(3n−8)(2E[baseC(G)]+2εE[baseC(G)]/3)

)≤ exp

(−C ε2·E[baseC(G)]

n

)≤ n−C′ logn

for some constant C ′ > 0. One can establish the other direction in a symmetric manner.Proposition 4 then follows.

We now finish the proof for Theorem 8. Combing Propositions 3 and 4, we have:

ρ∗G = ρG(T ∗) =totalCG(T ∗)

baseC(G)≤

(1 + o(1))totalCG(T∗)

(1− o(1))E[baseC(G)]≤ (1 + o(1))

totalCG(T∗)

E[baseC(G)].

The lower-bound can be established similarly. Theorem 8 then follows.

C.2 Proof of Corollary 1

Proof. Following the result of theorem 8, we only need to calculate totalCG(T∗) and

E[baseC(G)], where T ∗ is the optimal tree for the expectation graph G w.r.t. the total-cost function (and thus also w.r.t. the ratio-cost function). Since the expectation graph Gis a complete graph where all edge weights are p, we have that

totalCG(T∗) = p · 2 ·

(n

3

).

http://jocg.org/

JoCG 11(1), 283–331, 2020 329


On the other hand, let Yi,j,k = minTriC(i, j, k). E[Yi,j,k] = 2 · p3 + 3 · p2(1− p). Hence

E[baseC(G)] =

(n

3

)2p3 +

(n

3

)3p2(1− p).


ρ∗G = (1 + o(1))2p(n3

)(n3

)2p3 +

(n3

)3p2(1− p)

= (1 + o(1))2p

2p3 + 3p2(1− p)

= (1 + o(1))2

3p− p2

C.3 Proof of Corollary 2

Proof. Let T ∗ be the optimal tree for the expectation graph G w.r.t. the total-cost func-tion (and thus also w.r.t. the ratio-cost function). Let V1 = v1, . . . , vn

2 and V2 =

vn2

+1, . . . , vn. Call V1 and V2 clusters. The expectation-graph G is the complete graphwhere the edge weight between two nodes coming from the same clusters is p while that fora crossing edge is q. Hence it is easy to show that the optimal tree T ∗ will first split V intoV1 and V2 at the top-most level, and then is an arbitrary HC-tree for each cluster. Hence

totalCG(T∗) = 2p · 2 ·

(n/2

3

)+ 4q ·

(n/2

2

)· n/2.

Indeed, to see that the tree T ∗ constructed above is the optimal HC-tree for the expectationgraph G, note that totalCG(T

∗) = baseC(G).

On the other hand,

E[baseC(G)] = 2 ·(n/2

3

)· [2p3 + 3p2(1−p)] + 2 ·

(n/2

2

)·n/2 · [2pq2 + q2(1−p) + 2pq(1− q)].


ρ∗G = (1 + o(1))2p+ 6q

3p2 + 3q2 + 6pq − p3 − 3pq2,

which is part (1) of the corollary.

Now for part (2): let E denote the fraction in the above bound. We will takepartial derivatives to study how changes of p or q will affect the value of E. For simplicity,let f , g denote the numerator and denominator of the fraction E, i.e., f = 2p + 6q, and

http://jocg.org/

JoCG 11(1), 283–331, 2020 330


g = 3p2 + 3q2 + 6pq − p3 − 3pq2. First, for p,

∂E

∂p=

1

g2[2g − f · (6p+ 6q − 3p2 − 3q2)]

=1

g2(6p2 + 6q2 + 12pq − 2p3 − 6pq2 − 12p2 − 12pq + 6p3 + 6pq2 − 36pq − 36q2

+18p2q + 18q3)

=1

g2(−6p2 − 30q2 − 36pq + 4p3 + 18q3 + 18p2q)

< 0.

The last step follows the fact that g > 0, p, q < 1, and thus p2 > p3, q2 > q3, and pq > p2.Hence as p increases, the bound on ρ∗G decreases.

Now we consider the partial derivative w.r.t. q.

∂E

∂q=

1

g2[6g − f · (6p+ 6q − 6pq)]

=1

g2(18p2 + 18q2 + 36pq − 6p3 − 18pq2 − 12pq − 12p2 + 12p2q − 36pq − 36q2 + 36pq2)

=1

g2(6p2 − 18q2 − 12pq − 6p3 + 12p2q + 18pq2).

Consider p as a fixed constant, and calculate the zero points of parabola −18(1 − p)q2 −12p(1 − p)q + 6p2(1 − p). It is not hard to see that its zeros points are −p and p

3 , whichcompletes the proof.

http://jocg.org/

JoCG 11(1), 283–331, 2020 331


D Table for Notations

Abbreviation Meaning (first appearance)

αnThe best approximation factor for the sparsest cut problem. Currently,αn = O(

√log n) (Section 1).

costG(T ) Dasgupta’s cost function (Section 2.1).LCA Lowest common ancestor of tree nodes (Section 2.1).

triCT,G The cost of a triplet induced by T (Definition 1).

minTriCGThe min-triplet cost, the minimum cost that can be achieved for atriplet (Definition 2).

totalCG(T )The total-cost of tree T on input graph G, which is a shifted versionof costG(T ). It is equal to the sum of triCT,G over all triplets(Definition 1).

baseC(G)The base-cost of similarity graph G, defined as the sum of minTriCGover all triplets in G (Definition 2).

ρG(T )Our improved cost function (ratio-cost), defined as totalCG(T )

baseC(G)

(Definition 2).

NaeSat*A variant of Not-all-equal Sat problem studied in [14](Appendix B.1).

OPTThe total-cost of an optimal HC-tree that satisfies the given structuralconstraints (Appendix B.2).

OPTA(t)The optimal HC-tree can be considered as a collection of partitionsat different levels. The cost restricted on subset A at level t is denotedas OPTA(t) (Appendix B.2).

Table 1: The (abbreviated) notations, their meanings and the sections where they are de-fined.

http://jocg.org/

an improved cost function for hierarchical cluster trees

Documents