research openaccess ps-mcl:parallelshotguncoarsened

12
Lim et al. BMC Bioinformatics 2019, 20(Suppl 13):381 https://doi.org/10.1186/s12859-019-2856-8 RESEARCH Open Access PS-MCL: parallel shotgun coarsened Markov clustering of protein interaction networks Yongsub Lim 1 , Injae Yu 2 , Dongmin Seo 3 , U Kang 4 and Lee Sael 4* From The 8th Annual Translational Bioinformatics Conference Seoul, South Korea. 31 October - 2 November 2018 Abstract Background: How can we obtain fast and high-quality clusters in genome scale bio-networks? Graph clustering is a powerful tool applied on bio-networks to solve various biological problems such as protein complexes detection, disease module detection, and gene function prediction. Especially, MCL (Markov Clustering) has been spotlighted due to its superior performance on bio-networks. MCL, however, is skewed towards finding a large number of very small clusters (size 1-3) and fails to detect many larger clusters (size 10+). To resolve this fragmentation problem, MLR-MCL (Multi-level Regularized MCL) has been developed. MLR-MCL still suffers from the fragmentation and, in cases, unrealistically large clusters are generated. Results: In this paper, we propose PS-MCL (Parallel Shotgun Coarsened MCL), a parallel graph clustering method outperforming MLR-MCL in terms of running time and cluster quality. PS-MCL adopts an efficient coarsening scheme, called SC (Shotgun Coarsening), to improve graph coarsening in MLR-MCL. SC allows merging multiple nodes at a time, which leads to improvement in quality, time and space usage. Also, PS-MCL parallelizes main operations used in MLR-MCL which includes matrix multiplication. Conclusions: Experiments show that PS-MCL dramatically alleviates the fragmentation problem, and outperforms MLR-MCL in quality and running time. We also show that the running time of PS-MCL is effectively reduced with parallelization. Keywords: Graph clustering, Markov clustering, Parallel clustering, Coarsening, Non-overlapping clusters; Protein complex finding Background Graph clustering is one of the most fundamental prob- lems in graph mining and arises in various fields including bio-network analysis [1, 2]. Graph clustering is exten- sively studied and applied in protein complex finding, [35], disease module finding [6], and gene function prediction [7]. In general, the main task of the graph clustering prob- lem is to divide the graph into cohesive clusters that *Correspondence: [email protected] 4 Department of Computer Science and Engineering, Seoul National University, Seoul, Korea Full list of author information is available at the end of the article have low interdependency: i.e., few inter-cluster edges and many intra-cluster edges. Additional domain-specific constraints can be added to the graph clustering to improve the clustering quality, however, we only focus on improving topology-based clustering as con-straints can be added easily afterward. Among a number of clustering algorithms, MCL (Markov Clustering) [8] has received greatest attention in the bio-network analysis. Various studies have shown its superiority to other methods [3, 911]. However, MCL tends to result in too small clusters, which is called the fragmentation problem. Considering that many bio- network analysis related problems require cluster sizes in the range of 5–20 [1214], fragmentation needs to © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Upload: others

Post on 01-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381httpsdoiorg101186s12859-019-2856-8

RESEARCH Open Access

PS-MCL parallel shotgun coarsenedMarkov clustering of protein interactionnetworksYongsub Lim1 Injae Yu2 Dongmin Seo3 U Kang4 and Lee Sael4

From The 8th Annual Translational Bioinformatics ConferenceSeoul South Korea 31 October - 2 November 2018

Abstract

Background How can we obtain fast and high-quality clusters in genome scale bio-networks Graph clustering is apowerful tool applied on bio-networks to solve various biological problems such as protein complexes detectiondisease module detection and gene function prediction Especially MCL (Markov Clustering) has been spotlighteddue to its superior performance on bio-networks MCL however is skewed towards finding a large number of verysmall clusters (size 1-3) and fails to detect many larger clusters (size 10+) To resolve this fragmentation problemMLR-MCL (Multi-level Regularized MCL) has been developed MLR-MCL still suffers from the fragmentation and incases unrealistically large clusters are generated

Results In this paper we propose PS-MCL (Parallel Shotgun Coarsened MCL) a parallel graph clustering methodoutperforming MLR-MCL in terms of running time and cluster quality PS-MCL adopts an efficient coarsening schemecalled SC (Shotgun Coarsening) to improve graph coarsening in MLR-MCL SC allows merging multiple nodes at atime which leads to improvement in quality time and space usage Also PS-MCL parallelizes main operations used inMLR-MCL which includes matrix multiplication

Conclusions Experiments show that PS-MCL dramatically alleviates the fragmentation problem and outperformsMLR-MCL in quality and running time We also show that the running time of PS-MCL is effectively reduced withparallelization

Keywords Graph clustering Markov clustering Parallel clustering Coarsening Non-overlapping clusters Proteincomplex finding

BackgroundGraph clustering is one of the most fundamental prob-lems in graph mining and arises in various fields includingbio-network analysis [1 2] Graph clustering is exten-sively studied and applied in protein complex finding[3ndash5] disease module finding [6] and gene functionprediction [7]

In general the main task of the graph clustering prob-lem is to divide the graph into cohesive clusters that

Correspondence saelleesnuackr4Department of Computer Science and Engineering Seoul National UniversitySeoul KoreaFull list of author information is available at the end of the article

have low interdependency ie few inter-cluster edgesand many intra-cluster edges Additional domain-specificconstraints can be added to the graph clustering toimprove the clustering quality however we only focus onimproving topology-based clustering as con-straints canbe added easily afterward

Among a number of clustering algorithms MCL(Markov Clustering) [8] has received greatest attention inthe bio-network analysis Various studies have shown itssuperiority to other methods [3 9ndash11] However MCLtends to result in too small clusters which is calledthe fragmentation problem Considering that many bio-network analysis related problems require cluster sizesin the range of 5ndash20 [12ndash14] fragmentation needs to

copy The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 40International License (httpcreativecommonsorglicensesby40) which permits unrestricted use distribution andreproduction in any medium provided you give appropriate credit to the original author(s) and the source provide a link to theCreative Commons license and indicate if changes were made The Creative Commons Public Domain Dedication waiver(httpcreativecommonsorgpublicdomainzero10) applies to the data made available in this article unless otherwise stated

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 2 of 12

be avoided To solve the problem in MCL R-MCL(Regularized-MCL) has been developed [15] but it oftengenerates clusters that are too large eg one cluster con-taining most of the nodes Satuluri et al [9] generalizesR-MCL to obtain clusters whose sizes are similar to theones observed in real bio-networks by introducing a bal-ancing factor In the work a large number of nodes belongto clusters of size 10ndash20 with an appropriate balanc-ing factor however tiny clusters of size 1ndash3 also greatlyincrease compared with the original R-MCL To improvethe scalability of R-MCL MLR-MCL (Multi-level R-MCL)has been developed [15] MLR-MCL first coarsens a graphand then runs R-MCL with refinement But its coarseningscheme HEM (Heavy Edge Matching) is known to be inef-ficient for real world graphs such as protein interactionnetworks which have a heavy-tailed degree distribution[16ndash18]

In this paper we propose PS-MCL (Parallel ShotgunCoarsened MCL) a parallel graph clustering method forbio-networks with an efficient graph coarsening schemeand parallelization First we propose SC (Shotgun Coars-ening) scheme for MLR-MCL SC allows grouping mul-tiple nodes at a time [19] Compared with HEM used inMLR-MCL which is similar to a greedy algorithm forthe traditional matching problem SC coarsens a graphto have more cohesive super nodes Moreover the coars-ened graph with a manageable size is obtained morequickly by SC than by HEM Second we carefully paral-lelize main operations in R-MCL which is a subroutine ofMLR-MCL ie Regularize Inflate and Prune operationsare parallelized The latter two are column-wise opera-tions by definition and we parallelize them by assigningeach column to a core The former Regularize is a matrixmultiplication We divide matrix-matrix multiplicationinto a number of matrix-vector multiplications and paral-lelize them by distributing the vectors to multi-cores andsharing the matrix Through experiments we show thatPS-MCL not only resolves the fragmentation problem butalso outperforms MLR-MCL in quality and running timeMoreover we show that PS-MCL gets effectively faster asmore processing cores are used PS-MCL produces clus-tering with the best quality and its speed is comparable toMCL which is a baseline method and much faster thanMLR-MCL which is our main competitor (Table 1)

Our contributions are summarized as follows

bull Coarsening We propose the Shotgun Coarsening(SC) scheme for MLR-MCL SC allows mergingmultiple nodes to a super node at a time Comparedwith the existing Heavy Edge Matching (HEM)coarsening method SC improves both the qualityand efficiency of coarsening

bull Parallelization We carefully parallelize proposedalgorithm using multiple cores by rearranging the

Table 1 Comparison of our proposed PS-MCL to MLR-MCL andMCL in clustering quality and speed

(Proposed)

PS-MCL MLR-MCL [15] MCL [8]

Fragmentation Least Less Most

Ncut Small Moderate Large

Speed Fast Slow Fastest

Note that PS-MCL produces clustering with the best quality and its speed iscomparable to MCL which is a baseline method and much faster than MLR-MCLwhich is our main competitor

operations to be calculable in a column-wise mannerand assigning each column-wise computation to onecore

bull Performance Through experiments we show thatPS-MCL prefers clusters of sizes in the range of 10 to20 and results in less fragmentation compared toMLR-MCL We also show that PS-MCL is effectivelyparallelizable As a consequence PS-MCLoutperforms MLR-MCL in both quality and speed(Table 1)

In the rest of the paper we explain preliminariesincluding MCL based algorithms describe our proposedmethod PS-MCL in detail show experimental results onvarious protein interaction networks and make a conclu-sion

PreliminariesIn this section we explain existing MCL based algorithmsthe original MCL R-MCL and MLR-MCL Table 2 liststhe symbols used in this paper

Table 2 Table of symbols

Symbol Description

G = (V E) Graph

Gprime = (V prime Eprime) Coarsened graph

n Number of nodes

m Number of edges

u v x y Node

S Super Node

N (Nprime) Set of neighbors in G (Gprime)W (W prime) Edge weight map in G (Gprime)A Adjacency matrix

M Mi MG Flow matrix

MG Initial flow matrix

r Regularization factor

b Balancing factor

C Clustering

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 3 of 12

Markov clustering (MCL)MCL is a flow-based graph clustering algorithm Let G =(V E) be a graph with n = |V | and m = |E| and A be theadjacency matrix of G where self-loops for all nodes areadded The (i j)th element Mij of the initial flow matrix Mis defined as follows

Mij = Aijsumn

k=1 Akj

Intuitively Mij can be understood as the transition prob-ability or the amount of flow from j to i MCL iterativelyupdates M until convergence and each iteration consistsof the following three steps

bull Expand M larr M times Mbull Inflate Mij larr (

Mij)r

sumn

k=1(Mkj

)r where r gt 1bull Prune elements whose values are below a certain

threshold are set to 0 every column is normalized tosum to 1

When MCL converges each column of M has at leastone nonzero element All nodes whose correspondingcolumns have a nonzero element in the same row areassigned to the same cluster If a node has multiplenonzero elements in its column a row is arbitrarily cho-sen Although MCL is simple and intuitive it lacks scal-ability due to the matrix multiplication in the expandingstep and outputs a large number of too small clusters egoutputs 1416 clusters from a network with 4741 nodes(fragmentation problem) [15]

Regularized-MCL (R-MCL)One reason of the fragmentation of clusters in MCL is thatthe adjacency structure of a given graph is used only at thebeginning which leads to diverging columns for neigh-boring node pairs To resolve this fragmentation problemR-MCL [15] regularizes a flow matrix instead of expand-ing it The flow of a node is updated by minimizing theweighted sum of KL divergences between the target nodeand its neighbors This minimization problem has a closedform solution and consequently the regularizing step ofR-MCL is derived as follows

bull Regularize M = M times MG where MG is an initial flowmatrix defined as MG = ADminus1 and A is theadjacency matrix of G with the added self-loops andthe weight transformation [15] and D is the diagonalmatrix from A (ie Dii = sumn

k=1 Aik)

R-MCL finds a smaller number of clusters than MCL doesA problem of R-MCL is that it finds clusters whose

sizes are spread over a wide range while clusters in bio-networks usually are in the size range of 5ndash20 [12 13] Toresolve the problem [9] generalizes the regularization stepas follows

bull mass(i) = sumnj=1 Mij

bull MR =column_normal(diag(M times mass)minusb times MG

)

where b is a balancing parameterbull Regularized by M = M times MR

The balancing parameter b controls the degree of bal-ances in the cluster sizes higher b encourages more bal-anced clustering The intuition of this generalization is topenalize flows to a node currently having a large numberof incoming flows Note that b = 0 is equal to the originalR-MCL

Multi-level R-MCL (MLR-MCL)MLR-MCL uses graph coarsening to further improve thequality and the running time of R-MCL [15] Graph coars-ening means to merge related nodes to a super nodeMLR-MCL first generates a sequence of coarsened graphs(G0 G1 G) where G0 is the original graph and G isthe most coarsened (smallest) graph For i = down to 1R-MCL is run on Gi only for a few iterations and the com-puted flows on Gi are projected to Giminus1 After reachingthe original graph G0 R-MCL is run until convergenceAlgorithm 1 shows the overall procedure of MLR-MCLAlthough the description is for b = 0 b gt 0 can be alsoused by changing MG to MR as defined in the previoussection

The original R-MCL and MLR-MCL use HEM (HeavyEdge Matching) which picks an unmatched neighbor con-nected to the heaviest edge for a given node to coarsethe graph [15] In HEM the node v to which a node u ismerged is determined as follows

v = arg maxvprimeisinNunmatched(u)

W(u vprime)

where Nunmatched(u) is the set of unmatched neighbors ofu and W (u vprime) is the weight between u and vprime Note thatHEM allows a node to be matched with at most one othernode MLR-MCL assigns all flows of a super node to oneof its children for the flow projection It is shown that aclustering result is invariant on the choice of the child towhich all flows are assigned For more details refer to [15]Note that MLR-MCL greatly reduces the overall compu-tation of R-MCL since the flow update is done for thecoarsened graph which is smaller than the original graph

Limitation of HEM HEM of MLR-MCL has two mainlimitations First the strategy of HEM that merges atmost two single nodes can lead to undesirable coars-ening where super nodes are not cohesive enough (seeldquoCohesive super noderdquo section for details) Second HEMis known to be unsuitable for real-world graphs [19] dueto skewed degree distribution of the graphs which pre-vents the graph size from being effectively reduced (seeldquoQuickly reduced graphrdquo section for details) These short-ages of HEM make MLR-MCL inefficient for real-world

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 4 of 12

graphs To overcome this in the next section we proposePS-MCL that allows multiple nodes to be merged at atime

Algorithm 1 MLR-MCLInput Graph G coarsening depth and

regularization factor rOutput Clustering C

1 G0 = G2 Obtain G0 middot middot middot G a sequence of coarsened graphs3 Let M = MG be the initial flow matrix of G4 foreach i = down to 1 do5 foreach j = 1 to 4 (or 5) do6 Mi = Mi times MG Regularize7 Mi = Inflate(Mi r)8 Mi = Prune(Mi)9 end

10 Miminus1 = FlowProjection(Mi)11 Let MG be the initial flow matrix of Giminus112 end13 Update M0 by applying lines 5ndash9 but until

convergence not for a few iterations14 Determine clustering C from M0

ImplementationIn this section we describe our proposed method PS-MCL (Parallel Shotgun Coarsened MCL) which improvesMLR-MCL in two perspectives 1 increasing efficiency ofthe graph coarsening and 2 parallelizing the operations ofR-MCL

Shotgun coarsening (SC)As described previously HEM is ineffective on real worldgraphs To overcome the limitation of HEM we proposeto use a graph coarsening which allows merging multiplenodes at a time We call this scheme Shotgun Coarsen-ing (SC) because it aggregates satellite nodes to the centerone Algorithm 2 describes the proposed SC coarseningmethod where N(u) denotes a set of neighbors of u inG = (V E) and connected_ components(V F) outputs aset of connected components each of which is a set ofnodes of the graph (V F)

Our SC algorithm consists of three steps 1) identify aset F of edges whose end nodes will be merged (lines 1ndash6) 2) determine a set V prime of super nodes of a coarsenedgraph and associated weights to them (lines 7ndash12) and 3)determine a set Eprime of edges between super nodes and theirweights (lines 13ndash20) Let G = (V E) be an input graph tobe coarsened Z V rarr N be a node weight map for G andW E rarr N be an edge weight map for G In the first stepwe visit every node of G in an arbitrary order (line 2) and

Algorithm 2 Shotgun CoarseningInput Graph G = (V E)

a node weight map Z V rarr N for Gan edge weight map W E rarr N for G

Output Coarsend Graph Gprime = (V prime Eprime)a node weight map Zprime V rarr N for Gprimean edge weight map W prime Eprime rarr N for for Gprime

1 F larr empty2 foreach u isin V do3 N1(u) larr v isin N(u) W (u v) =

maxxisinN(u) W (u x)4 N2(u) larr v isin N1(u) Z(v) = minxisinN1(u) Z(x)5 F larr F cup (u v) for arbitrary picked v isin N2(u)6 end7 V prime larr connected_components(V F)8 foreach S isin V prime do9 Zprime(S) larr sum

uisinS Z(u)10 E larr E cup (S S)11 W prime(S S) larr sum

eisinEcap(StimesS) W (e)12 end13 Eprime larr empty14 foreach (S T) isin V prime times V prime such that S = T do15 H = e isin E e isin S times V ore isin T times S16 if |H| gt 0 then17 Eprime larr Eprime cup (S T)18 W prime(S T) larr sum

eisinH W (e)19 end20 end

for each node u isin V visited we find the best match nodev Precisely the algorithm finds the neighboring node of uwith the highest edge weight to u (line 3) ie

v = arg maxxisinN(u)

W (u x)

There may be multiple neighbors with the same highestweight Let N1(u) be the set of those neighbors then inthis case the one with the smallest node weight is chosenamong them (line 4) ie

v = arg minxisinN1(u)

Z(x)

This strategy of preferring a smaller node weight at thesame edge weight prevents the emergence of an over-coarsened graph containing an excessively massive supernode Note that if every node in an initial graph has weight1 the weight of a super node in a coarsened graph isequal to the number of nodes merged to create that supernode If there are multiple neighbors with the same high-est edge weight and the smallest node weight any v isarbitrarily chosen among the ties Edge (u v) is added toF (line 5)

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 5 of 12

The second step determines super nodes and associ-ated weights to them Note that for (u v) isin F u and vshould belong to the same super node by definition Bymathematical induction two nodes belong to the samesuper node if and only if they are reachable along edgesin F As a result we can identify a set V prime of super nodesby computing connected components of the graph (V F)

(line 7)After finding V prime we determine weights of super nodes

in V prime and their self-loops as follows For each supernode S isin V prime its node weight Zprime(S) is defined bythe sum of weights of nodes in V that belong toS (line 9) The self-edge (S S) is added to Eprime (line10) and its weight W prime(S S) is defined by the sum ofweights of edges in E whose end nodes belong to S(line 11)

The last step determines non-self edges between nodesin V prime and their edge weights as follows For eachunordered pair (S T) isin V prime times V prime find a set H of edgesin E that one end node is in S and the other is in T(line 15) If H = empty (line 16) (S T) is added to Eprime (line17) and W prime(S T) is defined by the sum of weights ofedges in H (line 18) Otherwise there is no edge betweenS and T

Skip Rate In practice a graph can be reduced too quicklyby SC if it has super-hub nodes To coarsen the graph toa reasonable size we propose to randomly skip mergingwhile iterating nodes in SC ie with probability 0 le p lt 1lines 3ndash5 are not executed We call p a skip rate and usep = 05 in this paper

Cohesive super nodeThe goal of coarsening is to merge tightly connectednodes to one super node In this aspect HEM may pre-vent a super node from being cohesive Figure 1a showsthe ideal coarsening for a given graph Let us assume thatfor the first merging the leftmost two nodes are mergedas shown in Fig 1b and the next node to be merged is uIf we use HEM v is chosen since it is the only candidateleading to Fig 1c Note that although u has more edges tothe green super node than to v it should be merged with vObviously the result is undesired for good coarsening Incontrast SC (Fig 1d) chooses the green super node for usince the weight to the green node is larger than that to vAs a result SC generates more cohesive super nodes thanthose by HEM leading to a high quality coarsened graph

Quickly reduced graphIdeally at each step of the coarsening the number ofnodes should halve but that does not happen for realworld graphs due to their highly skewed degree distribu-tion [19] In other words a large number of coarseningsteps are needed to obtain a coarsened graph of a man-ageable size leading to large memory spaces for storing

ba

c d

Fig 1 Effectiveness of our proposed SC method compared with HEMa Ideal coarsening for the graph b Coarsening in progress For the firstmerging the leftmost two nodes are chosen c For the second nodeto be merged u is chosen Since v is the only candidate for mergingin HEM u and v are merged to a super node d In SC the green supernode is also a candidate for u Since the weight of u to the greennode is larger than that to v u is merged to the green super node

graphs themselves and node merging information Thisproblem arises mainly due to star-like structures whichis depicted in Fig 2a The red and yellow nodes areeventually merged with the blue and the green groupsrespectively but it needs 5 more coarsening steps becauseonly two nodes can be merged Note that for an addi-tional coarsening step we need spaces to store one graphand mapping from a node to a super node if the graphsize is not effectively reduced the amount of the requiredspaces greatly increases with the coarsening depth Thisinefficiency can be resolved by SC as shown in Fig 2b Incontrast that 5 more coarsening steps are required withHEM only one step is enough in SC

a b

Fig 2 Efficiency of our proposed coarsening method SC comparedwith HEM a Result of one coarsening step by HEM The red andyellow nodes are eventually merged to the blue and green nodesrespectively but it takes 5 more coarsening steps b Result of onecoarsening step by SC The result is the same as (a) after 5 morecoarsening

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 6 of 12

ParallelizationWe also improve MLR-MCL via multi-core paralle-lization for its three main operations Regularize Inflateand Prune For Regularize we parallelize the computa-tion by assigning columns of the resulting matrix intocores In other words for M3 = M1 times M2 we divide thecomputation as follows

M3( i) = M1 times M2( i)

where Mk( i) denotes the i-th column of Mk Com-putingith column of M3 is independent of computing othercolumns j = i of M3 and thus we distribute the columnsof M2 to multiple cores while keeping M1 in a sharedmemory Inflate and Prune themselves are columnwiseoperations Thus the computation on each column isassigned to a core

For efficiency in memory usage we use the CSC (Com-pressed Sparse Column) format [20] to represent a matrixwhich requires much less memory when storing a sparsematrix compared to a two-dimensional array format Inessence the CSC format only stores nonzero values ofa matrix Note that this strategy is efficient especiallyfor real world graphs which are very sparse in generaleg |E| = O(|V |) Figure 3 shows the CSC format for anexample matrix To access the nonzero elements from thejth column (1-base indexing) we 1) obtain a = colPtr[ j]and b = colPtr[ j + 1] minus1 and 2) for a le i le b getval[ i] = A(rowInd[ i] j) For example to access the firstcolumn we first obtain a = 1 and b = 2 By checkingval[ i] and rowInd[ i] for 1 le i le 2 we identify the twononzero values 10 and 9 at the first and the fourth rowsrespectively in the first column ie A(1 1) = 10 andA(4 1) = 9 since val[ 1] = 10 with rowInd[ 1] = 1 andval[ 2] = 9 with rowInd[ 2] = 4

Algorithm 3 shows our implementation for one itera-tion of parallelized R-MCL with the CSC format In thealgorithm nonzeros(M j) is a set of pairs (i x) indicating

Fig 3 CSC format for sparse matrix representation The top figure is agiven matrix A and the bottom one is the corresponding CSCrepresentation of A As the size of matrix gets larger and the matrixgets sparser CSC gets much efficient since the space complexity ofCSC is proportional to the number of nonzero elements not to thematrix size

Algorithm 3 One Iteration of Parallelized R-MCLInput Current flow matrix M initial flow matrix

MG and the number T of threadsOutput Next flow matrix N

1 larr empty2 Parallel for j larr 1 to n

each of which is assigned to the (j mod T)-th threaddo

3 c = empty4 foreach (i x) isin nonzeros(MG j) do5 foreach (k y) isin nonzeros(M i) do6 if (k z) isin c then

c larr c cup (k z + xy) (k z)7 else c larr c cup (k xy)8 end9 end

10 c larr Prune(Inflate(c))11 larr cup (j c)12 end13 colPtr larr 1n+114 foreach (j c) isin do colPtr(j + 1) larr |c|15 foreach i larr 2 to n + 1 do

colPtr(i) larr colPtr(i) + colPtr(i minus 1)16 val larr rowInd larr 0colPtr(n+1)minus117 Parallel for (j c) isin

each of which is assigned to the (j mod T)-th threaddo

18 = colPtr(j)19 foreach (k z) isin c in increasing order of k do20 val() larr z21 rowInd() larr k22 larr + 123 end24 end25 N larr CSC(valrowIndcolPtr)

nonzeros in the j-th column of the matrix M ie M(i j) =x 0n and 1n denote n dimensional vectors all of whose ele-ments are 0 and 1 respectively Lines 4 to 9 correspond toRegularize Each thread running in a dedicated core per-forms c = MtimesMG( j) for each column j assigned to it andas a result (j c) is added to after applying Inflate andPrune to c Note that although we do not describe Inflateand Prune in Line 10 its implementation is trivial for eachcolumn c

Lines 13 to 16 correspond to constructing colPtrand allocating spaces for val and rowInd for the result-ing matrix N which is done sequentially After-wardsLines 17 to 24 correspond to filling val and rowIndusing and colPtr in parallel on the columns Note that

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 2 of 12

be avoided To solve the problem in MCL R-MCL(Regularized-MCL) has been developed [15] but it oftengenerates clusters that are too large eg one cluster con-taining most of the nodes Satuluri et al [9] generalizesR-MCL to obtain clusters whose sizes are similar to theones observed in real bio-networks by introducing a bal-ancing factor In the work a large number of nodes belongto clusters of size 10ndash20 with an appropriate balanc-ing factor however tiny clusters of size 1ndash3 also greatlyincrease compared with the original R-MCL To improvethe scalability of R-MCL MLR-MCL (Multi-level R-MCL)has been developed [15] MLR-MCL first coarsens a graphand then runs R-MCL with refinement But its coarseningscheme HEM (Heavy Edge Matching) is known to be inef-ficient for real world graphs such as protein interactionnetworks which have a heavy-tailed degree distribution[16ndash18]

In this paper we propose PS-MCL (Parallel ShotgunCoarsened MCL) a parallel graph clustering method forbio-networks with an efficient graph coarsening schemeand parallelization First we propose SC (Shotgun Coars-ening) scheme for MLR-MCL SC allows grouping mul-tiple nodes at a time [19] Compared with HEM used inMLR-MCL which is similar to a greedy algorithm forthe traditional matching problem SC coarsens a graphto have more cohesive super nodes Moreover the coars-ened graph with a manageable size is obtained morequickly by SC than by HEM Second we carefully paral-lelize main operations in R-MCL which is a subroutine ofMLR-MCL ie Regularize Inflate and Prune operationsare parallelized The latter two are column-wise opera-tions by definition and we parallelize them by assigningeach column to a core The former Regularize is a matrixmultiplication We divide matrix-matrix multiplicationinto a number of matrix-vector multiplications and paral-lelize them by distributing the vectors to multi-cores andsharing the matrix Through experiments we show thatPS-MCL not only resolves the fragmentation problem butalso outperforms MLR-MCL in quality and running timeMoreover we show that PS-MCL gets effectively faster asmore processing cores are used PS-MCL produces clus-tering with the best quality and its speed is comparable toMCL which is a baseline method and much faster thanMLR-MCL which is our main competitor (Table 1)

Our contributions are summarized as follows

bull Coarsening We propose the Shotgun Coarsening(SC) scheme for MLR-MCL SC allows mergingmultiple nodes to a super node at a time Comparedwith the existing Heavy Edge Matching (HEM)coarsening method SC improves both the qualityand efficiency of coarsening

bull Parallelization We carefully parallelize proposedalgorithm using multiple cores by rearranging the

Table 1 Comparison of our proposed PS-MCL to MLR-MCL andMCL in clustering quality and speed

(Proposed)

PS-MCL MLR-MCL [15] MCL [8]

Fragmentation Least Less Most

Ncut Small Moderate Large

Speed Fast Slow Fastest

Note that PS-MCL produces clustering with the best quality and its speed iscomparable to MCL which is a baseline method and much faster than MLR-MCLwhich is our main competitor

operations to be calculable in a column-wise mannerand assigning each column-wise computation to onecore

bull Performance Through experiments we show thatPS-MCL prefers clusters of sizes in the range of 10 to20 and results in less fragmentation compared toMLR-MCL We also show that PS-MCL is effectivelyparallelizable As a consequence PS-MCLoutperforms MLR-MCL in both quality and speed(Table 1)

In the rest of the paper we explain preliminariesincluding MCL based algorithms describe our proposedmethod PS-MCL in detail show experimental results onvarious protein interaction networks and make a conclu-sion

PreliminariesIn this section we explain existing MCL based algorithmsthe original MCL R-MCL and MLR-MCL Table 2 liststhe symbols used in this paper

Table 2 Table of symbols

Symbol Description

G = (V E) Graph

Gprime = (V prime Eprime) Coarsened graph

n Number of nodes

m Number of edges

u v x y Node

S Super Node

N (Nprime) Set of neighbors in G (Gprime)W (W prime) Edge weight map in G (Gprime)A Adjacency matrix

M Mi MG Flow matrix

MG Initial flow matrix

r Regularization factor

b Balancing factor

C Clustering

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 3 of 12

Markov clustering (MCL)MCL is a flow-based graph clustering algorithm Let G =(V E) be a graph with n = |V | and m = |E| and A be theadjacency matrix of G where self-loops for all nodes areadded The (i j)th element Mij of the initial flow matrix Mis defined as follows

Mij = Aijsumn

k=1 Akj

Intuitively Mij can be understood as the transition prob-ability or the amount of flow from j to i MCL iterativelyupdates M until convergence and each iteration consistsof the following three steps

bull Expand M larr M times Mbull Inflate Mij larr (

Mij)r

sumn

k=1(Mkj

)r where r gt 1bull Prune elements whose values are below a certain

threshold are set to 0 every column is normalized tosum to 1

When MCL converges each column of M has at leastone nonzero element All nodes whose correspondingcolumns have a nonzero element in the same row areassigned to the same cluster If a node has multiplenonzero elements in its column a row is arbitrarily cho-sen Although MCL is simple and intuitive it lacks scal-ability due to the matrix multiplication in the expandingstep and outputs a large number of too small clusters egoutputs 1416 clusters from a network with 4741 nodes(fragmentation problem) [15]

Regularized-MCL (R-MCL)One reason of the fragmentation of clusters in MCL is thatthe adjacency structure of a given graph is used only at thebeginning which leads to diverging columns for neigh-boring node pairs To resolve this fragmentation problemR-MCL [15] regularizes a flow matrix instead of expand-ing it The flow of a node is updated by minimizing theweighted sum of KL divergences between the target nodeand its neighbors This minimization problem has a closedform solution and consequently the regularizing step ofR-MCL is derived as follows

bull Regularize M = M times MG where MG is an initial flowmatrix defined as MG = ADminus1 and A is theadjacency matrix of G with the added self-loops andthe weight transformation [15] and D is the diagonalmatrix from A (ie Dii = sumn

k=1 Aik)

R-MCL finds a smaller number of clusters than MCL doesA problem of R-MCL is that it finds clusters whose

sizes are spread over a wide range while clusters in bio-networks usually are in the size range of 5ndash20 [12 13] Toresolve the problem [9] generalizes the regularization stepas follows

bull mass(i) = sumnj=1 Mij

bull MR =column_normal(diag(M times mass)minusb times MG

)

where b is a balancing parameterbull Regularized by M = M times MR

The balancing parameter b controls the degree of bal-ances in the cluster sizes higher b encourages more bal-anced clustering The intuition of this generalization is topenalize flows to a node currently having a large numberof incoming flows Note that b = 0 is equal to the originalR-MCL

Multi-level R-MCL (MLR-MCL)MLR-MCL uses graph coarsening to further improve thequality and the running time of R-MCL [15] Graph coars-ening means to merge related nodes to a super nodeMLR-MCL first generates a sequence of coarsened graphs(G0 G1 G) where G0 is the original graph and G isthe most coarsened (smallest) graph For i = down to 1R-MCL is run on Gi only for a few iterations and the com-puted flows on Gi are projected to Giminus1 After reachingthe original graph G0 R-MCL is run until convergenceAlgorithm 1 shows the overall procedure of MLR-MCLAlthough the description is for b = 0 b gt 0 can be alsoused by changing MG to MR as defined in the previoussection

The original R-MCL and MLR-MCL use HEM (HeavyEdge Matching) which picks an unmatched neighbor con-nected to the heaviest edge for a given node to coarsethe graph [15] In HEM the node v to which a node u ismerged is determined as follows

v = arg maxvprimeisinNunmatched(u)

W(u vprime)

where Nunmatched(u) is the set of unmatched neighbors ofu and W (u vprime) is the weight between u and vprime Note thatHEM allows a node to be matched with at most one othernode MLR-MCL assigns all flows of a super node to oneof its children for the flow projection It is shown that aclustering result is invariant on the choice of the child towhich all flows are assigned For more details refer to [15]Note that MLR-MCL greatly reduces the overall compu-tation of R-MCL since the flow update is done for thecoarsened graph which is smaller than the original graph

Limitation of HEM HEM of MLR-MCL has two mainlimitations First the strategy of HEM that merges atmost two single nodes can lead to undesirable coars-ening where super nodes are not cohesive enough (seeldquoCohesive super noderdquo section for details) Second HEMis known to be unsuitable for real-world graphs [19] dueto skewed degree distribution of the graphs which pre-vents the graph size from being effectively reduced (seeldquoQuickly reduced graphrdquo section for details) These short-ages of HEM make MLR-MCL inefficient for real-world

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 4 of 12

graphs To overcome this in the next section we proposePS-MCL that allows multiple nodes to be merged at atime

Algorithm 1 MLR-MCLInput Graph G coarsening depth and

regularization factor rOutput Clustering C

1 G0 = G2 Obtain G0 middot middot middot G a sequence of coarsened graphs3 Let M = MG be the initial flow matrix of G4 foreach i = down to 1 do5 foreach j = 1 to 4 (or 5) do6 Mi = Mi times MG Regularize7 Mi = Inflate(Mi r)8 Mi = Prune(Mi)9 end

10 Miminus1 = FlowProjection(Mi)11 Let MG be the initial flow matrix of Giminus112 end13 Update M0 by applying lines 5ndash9 but until

convergence not for a few iterations14 Determine clustering C from M0

ImplementationIn this section we describe our proposed method PS-MCL (Parallel Shotgun Coarsened MCL) which improvesMLR-MCL in two perspectives 1 increasing efficiency ofthe graph coarsening and 2 parallelizing the operations ofR-MCL

Shotgun coarsening (SC)As described previously HEM is ineffective on real worldgraphs To overcome the limitation of HEM we proposeto use a graph coarsening which allows merging multiplenodes at a time We call this scheme Shotgun Coarsen-ing (SC) because it aggregates satellite nodes to the centerone Algorithm 2 describes the proposed SC coarseningmethod where N(u) denotes a set of neighbors of u inG = (V E) and connected_ components(V F) outputs aset of connected components each of which is a set ofnodes of the graph (V F)

Our SC algorithm consists of three steps 1) identify aset F of edges whose end nodes will be merged (lines 1ndash6) 2) determine a set V prime of super nodes of a coarsenedgraph and associated weights to them (lines 7ndash12) and 3)determine a set Eprime of edges between super nodes and theirweights (lines 13ndash20) Let G = (V E) be an input graph tobe coarsened Z V rarr N be a node weight map for G andW E rarr N be an edge weight map for G In the first stepwe visit every node of G in an arbitrary order (line 2) and

Algorithm 2 Shotgun CoarseningInput Graph G = (V E)

a node weight map Z V rarr N for Gan edge weight map W E rarr N for G

Output Coarsend Graph Gprime = (V prime Eprime)a node weight map Zprime V rarr N for Gprimean edge weight map W prime Eprime rarr N for for Gprime

1 F larr empty2 foreach u isin V do3 N1(u) larr v isin N(u) W (u v) =

maxxisinN(u) W (u x)4 N2(u) larr v isin N1(u) Z(v) = minxisinN1(u) Z(x)5 F larr F cup (u v) for arbitrary picked v isin N2(u)6 end7 V prime larr connected_components(V F)8 foreach S isin V prime do9 Zprime(S) larr sum

uisinS Z(u)10 E larr E cup (S S)11 W prime(S S) larr sum

eisinEcap(StimesS) W (e)12 end13 Eprime larr empty14 foreach (S T) isin V prime times V prime such that S = T do15 H = e isin E e isin S times V ore isin T times S16 if |H| gt 0 then17 Eprime larr Eprime cup (S T)18 W prime(S T) larr sum

eisinH W (e)19 end20 end

for each node u isin V visited we find the best match nodev Precisely the algorithm finds the neighboring node of uwith the highest edge weight to u (line 3) ie

v = arg maxxisinN(u)

W (u x)

There may be multiple neighbors with the same highestweight Let N1(u) be the set of those neighbors then inthis case the one with the smallest node weight is chosenamong them (line 4) ie

v = arg minxisinN1(u)

Z(x)

This strategy of preferring a smaller node weight at thesame edge weight prevents the emergence of an over-coarsened graph containing an excessively massive supernode Note that if every node in an initial graph has weight1 the weight of a super node in a coarsened graph isequal to the number of nodes merged to create that supernode If there are multiple neighbors with the same high-est edge weight and the smallest node weight any v isarbitrarily chosen among the ties Edge (u v) is added toF (line 5)

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 5 of 12

The second step determines super nodes and associ-ated weights to them Note that for (u v) isin F u and vshould belong to the same super node by definition Bymathematical induction two nodes belong to the samesuper node if and only if they are reachable along edgesin F As a result we can identify a set V prime of super nodesby computing connected components of the graph (V F)

(line 7)After finding V prime we determine weights of super nodes

in V prime and their self-loops as follows For each supernode S isin V prime its node weight Zprime(S) is defined bythe sum of weights of nodes in V that belong toS (line 9) The self-edge (S S) is added to Eprime (line10) and its weight W prime(S S) is defined by the sum ofweights of edges in E whose end nodes belong to S(line 11)

The last step determines non-self edges between nodesin V prime and their edge weights as follows For eachunordered pair (S T) isin V prime times V prime find a set H of edgesin E that one end node is in S and the other is in T(line 15) If H = empty (line 16) (S T) is added to Eprime (line17) and W prime(S T) is defined by the sum of weights ofedges in H (line 18) Otherwise there is no edge betweenS and T

Skip Rate In practice a graph can be reduced too quicklyby SC if it has super-hub nodes To coarsen the graph toa reasonable size we propose to randomly skip mergingwhile iterating nodes in SC ie with probability 0 le p lt 1lines 3ndash5 are not executed We call p a skip rate and usep = 05 in this paper

Cohesive super nodeThe goal of coarsening is to merge tightly connectednodes to one super node In this aspect HEM may pre-vent a super node from being cohesive Figure 1a showsthe ideal coarsening for a given graph Let us assume thatfor the first merging the leftmost two nodes are mergedas shown in Fig 1b and the next node to be merged is uIf we use HEM v is chosen since it is the only candidateleading to Fig 1c Note that although u has more edges tothe green super node than to v it should be merged with vObviously the result is undesired for good coarsening Incontrast SC (Fig 1d) chooses the green super node for usince the weight to the green node is larger than that to vAs a result SC generates more cohesive super nodes thanthose by HEM leading to a high quality coarsened graph

Quickly reduced graphIdeally at each step of the coarsening the number ofnodes should halve but that does not happen for realworld graphs due to their highly skewed degree distribu-tion [19] In other words a large number of coarseningsteps are needed to obtain a coarsened graph of a man-ageable size leading to large memory spaces for storing

ba

c d

Fig 1 Effectiveness of our proposed SC method compared with HEMa Ideal coarsening for the graph b Coarsening in progress For the firstmerging the leftmost two nodes are chosen c For the second nodeto be merged u is chosen Since v is the only candidate for mergingin HEM u and v are merged to a super node d In SC the green supernode is also a candidate for u Since the weight of u to the greennode is larger than that to v u is merged to the green super node

graphs themselves and node merging information Thisproblem arises mainly due to star-like structures whichis depicted in Fig 2a The red and yellow nodes areeventually merged with the blue and the green groupsrespectively but it needs 5 more coarsening steps becauseonly two nodes can be merged Note that for an addi-tional coarsening step we need spaces to store one graphand mapping from a node to a super node if the graphsize is not effectively reduced the amount of the requiredspaces greatly increases with the coarsening depth Thisinefficiency can be resolved by SC as shown in Fig 2b Incontrast that 5 more coarsening steps are required withHEM only one step is enough in SC

a b

Fig 2 Efficiency of our proposed coarsening method SC comparedwith HEM a Result of one coarsening step by HEM The red andyellow nodes are eventually merged to the blue and green nodesrespectively but it takes 5 more coarsening steps b Result of onecoarsening step by SC The result is the same as (a) after 5 morecoarsening

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 6 of 12

ParallelizationWe also improve MLR-MCL via multi-core paralle-lization for its three main operations Regularize Inflateand Prune For Regularize we parallelize the computa-tion by assigning columns of the resulting matrix intocores In other words for M3 = M1 times M2 we divide thecomputation as follows

M3( i) = M1 times M2( i)

where Mk( i) denotes the i-th column of Mk Com-putingith column of M3 is independent of computing othercolumns j = i of M3 and thus we distribute the columnsof M2 to multiple cores while keeping M1 in a sharedmemory Inflate and Prune themselves are columnwiseoperations Thus the computation on each column isassigned to a core

For efficiency in memory usage we use the CSC (Com-pressed Sparse Column) format [20] to represent a matrixwhich requires much less memory when storing a sparsematrix compared to a two-dimensional array format Inessence the CSC format only stores nonzero values ofa matrix Note that this strategy is efficient especiallyfor real world graphs which are very sparse in generaleg |E| = O(|V |) Figure 3 shows the CSC format for anexample matrix To access the nonzero elements from thejth column (1-base indexing) we 1) obtain a = colPtr[ j]and b = colPtr[ j + 1] minus1 and 2) for a le i le b getval[ i] = A(rowInd[ i] j) For example to access the firstcolumn we first obtain a = 1 and b = 2 By checkingval[ i] and rowInd[ i] for 1 le i le 2 we identify the twononzero values 10 and 9 at the first and the fourth rowsrespectively in the first column ie A(1 1) = 10 andA(4 1) = 9 since val[ 1] = 10 with rowInd[ 1] = 1 andval[ 2] = 9 with rowInd[ 2] = 4

Algorithm 3 shows our implementation for one itera-tion of parallelized R-MCL with the CSC format In thealgorithm nonzeros(M j) is a set of pairs (i x) indicating

Fig 3 CSC format for sparse matrix representation The top figure is agiven matrix A and the bottom one is the corresponding CSCrepresentation of A As the size of matrix gets larger and the matrixgets sparser CSC gets much efficient since the space complexity ofCSC is proportional to the number of nonzero elements not to thematrix size

Algorithm 3 One Iteration of Parallelized R-MCLInput Current flow matrix M initial flow matrix

MG and the number T of threadsOutput Next flow matrix N

1 larr empty2 Parallel for j larr 1 to n

each of which is assigned to the (j mod T)-th threaddo

3 c = empty4 foreach (i x) isin nonzeros(MG j) do5 foreach (k y) isin nonzeros(M i) do6 if (k z) isin c then

c larr c cup (k z + xy) (k z)7 else c larr c cup (k xy)8 end9 end

10 c larr Prune(Inflate(c))11 larr cup (j c)12 end13 colPtr larr 1n+114 foreach (j c) isin do colPtr(j + 1) larr |c|15 foreach i larr 2 to n + 1 do

colPtr(i) larr colPtr(i) + colPtr(i minus 1)16 val larr rowInd larr 0colPtr(n+1)minus117 Parallel for (j c) isin

each of which is assigned to the (j mod T)-th threaddo

18 = colPtr(j)19 foreach (k z) isin c in increasing order of k do20 val() larr z21 rowInd() larr k22 larr + 123 end24 end25 N larr CSC(valrowIndcolPtr)

nonzeros in the j-th column of the matrix M ie M(i j) =x 0n and 1n denote n dimensional vectors all of whose ele-ments are 0 and 1 respectively Lines 4 to 9 correspond toRegularize Each thread running in a dedicated core per-forms c = MtimesMG( j) for each column j assigned to it andas a result (j c) is added to after applying Inflate andPrune to c Note that although we do not describe Inflateand Prune in Line 10 its implementation is trivial for eachcolumn c

Lines 13 to 16 correspond to constructing colPtrand allocating spaces for val and rowInd for the result-ing matrix N which is done sequentially After-wardsLines 17 to 24 correspond to filling val and rowIndusing and colPtr in parallel on the columns Note that

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 3 of 12

Markov clustering (MCL)MCL is a flow-based graph clustering algorithm Let G =(V E) be a graph with n = |V | and m = |E| and A be theadjacency matrix of G where self-loops for all nodes areadded The (i j)th element Mij of the initial flow matrix Mis defined as follows

Mij = Aijsumn

k=1 Akj

Intuitively Mij can be understood as the transition prob-ability or the amount of flow from j to i MCL iterativelyupdates M until convergence and each iteration consistsof the following three steps

bull Expand M larr M times Mbull Inflate Mij larr (

Mij)r

sumn

k=1(Mkj

)r where r gt 1bull Prune elements whose values are below a certain

threshold are set to 0 every column is normalized tosum to 1

When MCL converges each column of M has at leastone nonzero element All nodes whose correspondingcolumns have a nonzero element in the same row areassigned to the same cluster If a node has multiplenonzero elements in its column a row is arbitrarily cho-sen Although MCL is simple and intuitive it lacks scal-ability due to the matrix multiplication in the expandingstep and outputs a large number of too small clusters egoutputs 1416 clusters from a network with 4741 nodes(fragmentation problem) [15]

Regularized-MCL (R-MCL)One reason of the fragmentation of clusters in MCL is thatthe adjacency structure of a given graph is used only at thebeginning which leads to diverging columns for neigh-boring node pairs To resolve this fragmentation problemR-MCL [15] regularizes a flow matrix instead of expand-ing it The flow of a node is updated by minimizing theweighted sum of KL divergences between the target nodeand its neighbors This minimization problem has a closedform solution and consequently the regularizing step ofR-MCL is derived as follows

bull Regularize M = M times MG where MG is an initial flowmatrix defined as MG = ADminus1 and A is theadjacency matrix of G with the added self-loops andthe weight transformation [15] and D is the diagonalmatrix from A (ie Dii = sumn

k=1 Aik)

R-MCL finds a smaller number of clusters than MCL doesA problem of R-MCL is that it finds clusters whose

sizes are spread over a wide range while clusters in bio-networks usually are in the size range of 5ndash20 [12 13] Toresolve the problem [9] generalizes the regularization stepas follows

bull mass(i) = sumnj=1 Mij

bull MR =column_normal(diag(M times mass)minusb times MG

)

where b is a balancing parameterbull Regularized by M = M times MR

The balancing parameter b controls the degree of bal-ances in the cluster sizes higher b encourages more bal-anced clustering The intuition of this generalization is topenalize flows to a node currently having a large numberof incoming flows Note that b = 0 is equal to the originalR-MCL

Multi-level R-MCL (MLR-MCL)MLR-MCL uses graph coarsening to further improve thequality and the running time of R-MCL [15] Graph coars-ening means to merge related nodes to a super nodeMLR-MCL first generates a sequence of coarsened graphs(G0 G1 G) where G0 is the original graph and G isthe most coarsened (smallest) graph For i = down to 1R-MCL is run on Gi only for a few iterations and the com-puted flows on Gi are projected to Giminus1 After reachingthe original graph G0 R-MCL is run until convergenceAlgorithm 1 shows the overall procedure of MLR-MCLAlthough the description is for b = 0 b gt 0 can be alsoused by changing MG to MR as defined in the previoussection

The original R-MCL and MLR-MCL use HEM (HeavyEdge Matching) which picks an unmatched neighbor con-nected to the heaviest edge for a given node to coarsethe graph [15] In HEM the node v to which a node u ismerged is determined as follows

v = arg maxvprimeisinNunmatched(u)

W(u vprime)

where Nunmatched(u) is the set of unmatched neighbors ofu and W (u vprime) is the weight between u and vprime Note thatHEM allows a node to be matched with at most one othernode MLR-MCL assigns all flows of a super node to oneof its children for the flow projection It is shown that aclustering result is invariant on the choice of the child towhich all flows are assigned For more details refer to [15]Note that MLR-MCL greatly reduces the overall compu-tation of R-MCL since the flow update is done for thecoarsened graph which is smaller than the original graph

Limitation of HEM HEM of MLR-MCL has two mainlimitations First the strategy of HEM that merges atmost two single nodes can lead to undesirable coars-ening where super nodes are not cohesive enough (seeldquoCohesive super noderdquo section for details) Second HEMis known to be unsuitable for real-world graphs [19] dueto skewed degree distribution of the graphs which pre-vents the graph size from being effectively reduced (seeldquoQuickly reduced graphrdquo section for details) These short-ages of HEM make MLR-MCL inefficient for real-world

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 4 of 12

graphs To overcome this in the next section we proposePS-MCL that allows multiple nodes to be merged at atime

Algorithm 1 MLR-MCLInput Graph G coarsening depth and

regularization factor rOutput Clustering C

1 G0 = G2 Obtain G0 middot middot middot G a sequence of coarsened graphs3 Let M = MG be the initial flow matrix of G4 foreach i = down to 1 do5 foreach j = 1 to 4 (or 5) do6 Mi = Mi times MG Regularize7 Mi = Inflate(Mi r)8 Mi = Prune(Mi)9 end

10 Miminus1 = FlowProjection(Mi)11 Let MG be the initial flow matrix of Giminus112 end13 Update M0 by applying lines 5ndash9 but until

convergence not for a few iterations14 Determine clustering C from M0

ImplementationIn this section we describe our proposed method PS-MCL (Parallel Shotgun Coarsened MCL) which improvesMLR-MCL in two perspectives 1 increasing efficiency ofthe graph coarsening and 2 parallelizing the operations ofR-MCL

Shotgun coarsening (SC)As described previously HEM is ineffective on real worldgraphs To overcome the limitation of HEM we proposeto use a graph coarsening which allows merging multiplenodes at a time We call this scheme Shotgun Coarsen-ing (SC) because it aggregates satellite nodes to the centerone Algorithm 2 describes the proposed SC coarseningmethod where N(u) denotes a set of neighbors of u inG = (V E) and connected_ components(V F) outputs aset of connected components each of which is a set ofnodes of the graph (V F)

Our SC algorithm consists of three steps 1) identify aset F of edges whose end nodes will be merged (lines 1ndash6) 2) determine a set V prime of super nodes of a coarsenedgraph and associated weights to them (lines 7ndash12) and 3)determine a set Eprime of edges between super nodes and theirweights (lines 13ndash20) Let G = (V E) be an input graph tobe coarsened Z V rarr N be a node weight map for G andW E rarr N be an edge weight map for G In the first stepwe visit every node of G in an arbitrary order (line 2) and

Algorithm 2 Shotgun CoarseningInput Graph G = (V E)

a node weight map Z V rarr N for Gan edge weight map W E rarr N for G

Output Coarsend Graph Gprime = (V prime Eprime)a node weight map Zprime V rarr N for Gprimean edge weight map W prime Eprime rarr N for for Gprime

1 F larr empty2 foreach u isin V do3 N1(u) larr v isin N(u) W (u v) =

maxxisinN(u) W (u x)4 N2(u) larr v isin N1(u) Z(v) = minxisinN1(u) Z(x)5 F larr F cup (u v) for arbitrary picked v isin N2(u)6 end7 V prime larr connected_components(V F)8 foreach S isin V prime do9 Zprime(S) larr sum

uisinS Z(u)10 E larr E cup (S S)11 W prime(S S) larr sum

eisinEcap(StimesS) W (e)12 end13 Eprime larr empty14 foreach (S T) isin V prime times V prime such that S = T do15 H = e isin E e isin S times V ore isin T times S16 if |H| gt 0 then17 Eprime larr Eprime cup (S T)18 W prime(S T) larr sum

eisinH W (e)19 end20 end

for each node u isin V visited we find the best match nodev Precisely the algorithm finds the neighboring node of uwith the highest edge weight to u (line 3) ie

v = arg maxxisinN(u)

W (u x)

There may be multiple neighbors with the same highestweight Let N1(u) be the set of those neighbors then inthis case the one with the smallest node weight is chosenamong them (line 4) ie

v = arg minxisinN1(u)

Z(x)

This strategy of preferring a smaller node weight at thesame edge weight prevents the emergence of an over-coarsened graph containing an excessively massive supernode Note that if every node in an initial graph has weight1 the weight of a super node in a coarsened graph isequal to the number of nodes merged to create that supernode If there are multiple neighbors with the same high-est edge weight and the smallest node weight any v isarbitrarily chosen among the ties Edge (u v) is added toF (line 5)

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 5 of 12

The second step determines super nodes and associ-ated weights to them Note that for (u v) isin F u and vshould belong to the same super node by definition Bymathematical induction two nodes belong to the samesuper node if and only if they are reachable along edgesin F As a result we can identify a set V prime of super nodesby computing connected components of the graph (V F)

(line 7)After finding V prime we determine weights of super nodes

in V prime and their self-loops as follows For each supernode S isin V prime its node weight Zprime(S) is defined bythe sum of weights of nodes in V that belong toS (line 9) The self-edge (S S) is added to Eprime (line10) and its weight W prime(S S) is defined by the sum ofweights of edges in E whose end nodes belong to S(line 11)

The last step determines non-self edges between nodesin V prime and their edge weights as follows For eachunordered pair (S T) isin V prime times V prime find a set H of edgesin E that one end node is in S and the other is in T(line 15) If H = empty (line 16) (S T) is added to Eprime (line17) and W prime(S T) is defined by the sum of weights ofedges in H (line 18) Otherwise there is no edge betweenS and T

Skip Rate In practice a graph can be reduced too quicklyby SC if it has super-hub nodes To coarsen the graph toa reasonable size we propose to randomly skip mergingwhile iterating nodes in SC ie with probability 0 le p lt 1lines 3ndash5 are not executed We call p a skip rate and usep = 05 in this paper

Cohesive super nodeThe goal of coarsening is to merge tightly connectednodes to one super node In this aspect HEM may pre-vent a super node from being cohesive Figure 1a showsthe ideal coarsening for a given graph Let us assume thatfor the first merging the leftmost two nodes are mergedas shown in Fig 1b and the next node to be merged is uIf we use HEM v is chosen since it is the only candidateleading to Fig 1c Note that although u has more edges tothe green super node than to v it should be merged with vObviously the result is undesired for good coarsening Incontrast SC (Fig 1d) chooses the green super node for usince the weight to the green node is larger than that to vAs a result SC generates more cohesive super nodes thanthose by HEM leading to a high quality coarsened graph

Quickly reduced graphIdeally at each step of the coarsening the number ofnodes should halve but that does not happen for realworld graphs due to their highly skewed degree distribu-tion [19] In other words a large number of coarseningsteps are needed to obtain a coarsened graph of a man-ageable size leading to large memory spaces for storing

ba

c d

Fig 1 Effectiveness of our proposed SC method compared with HEMa Ideal coarsening for the graph b Coarsening in progress For the firstmerging the leftmost two nodes are chosen c For the second nodeto be merged u is chosen Since v is the only candidate for mergingin HEM u and v are merged to a super node d In SC the green supernode is also a candidate for u Since the weight of u to the greennode is larger than that to v u is merged to the green super node

graphs themselves and node merging information Thisproblem arises mainly due to star-like structures whichis depicted in Fig 2a The red and yellow nodes areeventually merged with the blue and the green groupsrespectively but it needs 5 more coarsening steps becauseonly two nodes can be merged Note that for an addi-tional coarsening step we need spaces to store one graphand mapping from a node to a super node if the graphsize is not effectively reduced the amount of the requiredspaces greatly increases with the coarsening depth Thisinefficiency can be resolved by SC as shown in Fig 2b Incontrast that 5 more coarsening steps are required withHEM only one step is enough in SC

a b

Fig 2 Efficiency of our proposed coarsening method SC comparedwith HEM a Result of one coarsening step by HEM The red andyellow nodes are eventually merged to the blue and green nodesrespectively but it takes 5 more coarsening steps b Result of onecoarsening step by SC The result is the same as (a) after 5 morecoarsening

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 6 of 12

ParallelizationWe also improve MLR-MCL via multi-core paralle-lization for its three main operations Regularize Inflateand Prune For Regularize we parallelize the computa-tion by assigning columns of the resulting matrix intocores In other words for M3 = M1 times M2 we divide thecomputation as follows

M3( i) = M1 times M2( i)

where Mk( i) denotes the i-th column of Mk Com-putingith column of M3 is independent of computing othercolumns j = i of M3 and thus we distribute the columnsof M2 to multiple cores while keeping M1 in a sharedmemory Inflate and Prune themselves are columnwiseoperations Thus the computation on each column isassigned to a core

For efficiency in memory usage we use the CSC (Com-pressed Sparse Column) format [20] to represent a matrixwhich requires much less memory when storing a sparsematrix compared to a two-dimensional array format Inessence the CSC format only stores nonzero values ofa matrix Note that this strategy is efficient especiallyfor real world graphs which are very sparse in generaleg |E| = O(|V |) Figure 3 shows the CSC format for anexample matrix To access the nonzero elements from thejth column (1-base indexing) we 1) obtain a = colPtr[ j]and b = colPtr[ j + 1] minus1 and 2) for a le i le b getval[ i] = A(rowInd[ i] j) For example to access the firstcolumn we first obtain a = 1 and b = 2 By checkingval[ i] and rowInd[ i] for 1 le i le 2 we identify the twononzero values 10 and 9 at the first and the fourth rowsrespectively in the first column ie A(1 1) = 10 andA(4 1) = 9 since val[ 1] = 10 with rowInd[ 1] = 1 andval[ 2] = 9 with rowInd[ 2] = 4

Algorithm 3 shows our implementation for one itera-tion of parallelized R-MCL with the CSC format In thealgorithm nonzeros(M j) is a set of pairs (i x) indicating

Fig 3 CSC format for sparse matrix representation The top figure is agiven matrix A and the bottom one is the corresponding CSCrepresentation of A As the size of matrix gets larger and the matrixgets sparser CSC gets much efficient since the space complexity ofCSC is proportional to the number of nonzero elements not to thematrix size

Algorithm 3 One Iteration of Parallelized R-MCLInput Current flow matrix M initial flow matrix

MG and the number T of threadsOutput Next flow matrix N

1 larr empty2 Parallel for j larr 1 to n

each of which is assigned to the (j mod T)-th threaddo

3 c = empty4 foreach (i x) isin nonzeros(MG j) do5 foreach (k y) isin nonzeros(M i) do6 if (k z) isin c then

c larr c cup (k z + xy) (k z)7 else c larr c cup (k xy)8 end9 end

10 c larr Prune(Inflate(c))11 larr cup (j c)12 end13 colPtr larr 1n+114 foreach (j c) isin do colPtr(j + 1) larr |c|15 foreach i larr 2 to n + 1 do

colPtr(i) larr colPtr(i) + colPtr(i minus 1)16 val larr rowInd larr 0colPtr(n+1)minus117 Parallel for (j c) isin

each of which is assigned to the (j mod T)-th threaddo

18 = colPtr(j)19 foreach (k z) isin c in increasing order of k do20 val() larr z21 rowInd() larr k22 larr + 123 end24 end25 N larr CSC(valrowIndcolPtr)

nonzeros in the j-th column of the matrix M ie M(i j) =x 0n and 1n denote n dimensional vectors all of whose ele-ments are 0 and 1 respectively Lines 4 to 9 correspond toRegularize Each thread running in a dedicated core per-forms c = MtimesMG( j) for each column j assigned to it andas a result (j c) is added to after applying Inflate andPrune to c Note that although we do not describe Inflateand Prune in Line 10 its implementation is trivial for eachcolumn c

Lines 13 to 16 correspond to constructing colPtrand allocating spaces for val and rowInd for the result-ing matrix N which is done sequentially After-wardsLines 17 to 24 correspond to filling val and rowIndusing and colPtr in parallel on the columns Note that

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 4 of 12

graphs To overcome this in the next section we proposePS-MCL that allows multiple nodes to be merged at atime

Algorithm 1 MLR-MCLInput Graph G coarsening depth and

regularization factor rOutput Clustering C

1 G0 = G2 Obtain G0 middot middot middot G a sequence of coarsened graphs3 Let M = MG be the initial flow matrix of G4 foreach i = down to 1 do5 foreach j = 1 to 4 (or 5) do6 Mi = Mi times MG Regularize7 Mi = Inflate(Mi r)8 Mi = Prune(Mi)9 end

10 Miminus1 = FlowProjection(Mi)11 Let MG be the initial flow matrix of Giminus112 end13 Update M0 by applying lines 5ndash9 but until

convergence not for a few iterations14 Determine clustering C from M0

ImplementationIn this section we describe our proposed method PS-MCL (Parallel Shotgun Coarsened MCL) which improvesMLR-MCL in two perspectives 1 increasing efficiency ofthe graph coarsening and 2 parallelizing the operations ofR-MCL

Shotgun coarsening (SC)As described previously HEM is ineffective on real worldgraphs To overcome the limitation of HEM we proposeto use a graph coarsening which allows merging multiplenodes at a time We call this scheme Shotgun Coarsen-ing (SC) because it aggregates satellite nodes to the centerone Algorithm 2 describes the proposed SC coarseningmethod where N(u) denotes a set of neighbors of u inG = (V E) and connected_ components(V F) outputs aset of connected components each of which is a set ofnodes of the graph (V F)

Our SC algorithm consists of three steps 1) identify aset F of edges whose end nodes will be merged (lines 1ndash6) 2) determine a set V prime of super nodes of a coarsenedgraph and associated weights to them (lines 7ndash12) and 3)determine a set Eprime of edges between super nodes and theirweights (lines 13ndash20) Let G = (V E) be an input graph tobe coarsened Z V rarr N be a node weight map for G andW E rarr N be an edge weight map for G In the first stepwe visit every node of G in an arbitrary order (line 2) and

Algorithm 2 Shotgun CoarseningInput Graph G = (V E)

a node weight map Z V rarr N for Gan edge weight map W E rarr N for G

Output Coarsend Graph Gprime = (V prime Eprime)a node weight map Zprime V rarr N for Gprimean edge weight map W prime Eprime rarr N for for Gprime

1 F larr empty2 foreach u isin V do3 N1(u) larr v isin N(u) W (u v) =

maxxisinN(u) W (u x)4 N2(u) larr v isin N1(u) Z(v) = minxisinN1(u) Z(x)5 F larr F cup (u v) for arbitrary picked v isin N2(u)6 end7 V prime larr connected_components(V F)8 foreach S isin V prime do9 Zprime(S) larr sum

uisinS Z(u)10 E larr E cup (S S)11 W prime(S S) larr sum

eisinEcap(StimesS) W (e)12 end13 Eprime larr empty14 foreach (S T) isin V prime times V prime such that S = T do15 H = e isin E e isin S times V ore isin T times S16 if |H| gt 0 then17 Eprime larr Eprime cup (S T)18 W prime(S T) larr sum

eisinH W (e)19 end20 end

for each node u isin V visited we find the best match nodev Precisely the algorithm finds the neighboring node of uwith the highest edge weight to u (line 3) ie

v = arg maxxisinN(u)

W (u x)

There may be multiple neighbors with the same highestweight Let N1(u) be the set of those neighbors then inthis case the one with the smallest node weight is chosenamong them (line 4) ie

v = arg minxisinN1(u)

Z(x)

This strategy of preferring a smaller node weight at thesame edge weight prevents the emergence of an over-coarsened graph containing an excessively massive supernode Note that if every node in an initial graph has weight1 the weight of a super node in a coarsened graph isequal to the number of nodes merged to create that supernode If there are multiple neighbors with the same high-est edge weight and the smallest node weight any v isarbitrarily chosen among the ties Edge (u v) is added toF (line 5)

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 5 of 12

The second step determines super nodes and associ-ated weights to them Note that for (u v) isin F u and vshould belong to the same super node by definition Bymathematical induction two nodes belong to the samesuper node if and only if they are reachable along edgesin F As a result we can identify a set V prime of super nodesby computing connected components of the graph (V F)

(line 7)After finding V prime we determine weights of super nodes

in V prime and their self-loops as follows For each supernode S isin V prime its node weight Zprime(S) is defined bythe sum of weights of nodes in V that belong toS (line 9) The self-edge (S S) is added to Eprime (line10) and its weight W prime(S S) is defined by the sum ofweights of edges in E whose end nodes belong to S(line 11)

The last step determines non-self edges between nodesin V prime and their edge weights as follows For eachunordered pair (S T) isin V prime times V prime find a set H of edgesin E that one end node is in S and the other is in T(line 15) If H = empty (line 16) (S T) is added to Eprime (line17) and W prime(S T) is defined by the sum of weights ofedges in H (line 18) Otherwise there is no edge betweenS and T

Skip Rate In practice a graph can be reduced too quicklyby SC if it has super-hub nodes To coarsen the graph toa reasonable size we propose to randomly skip mergingwhile iterating nodes in SC ie with probability 0 le p lt 1lines 3ndash5 are not executed We call p a skip rate and usep = 05 in this paper

Cohesive super nodeThe goal of coarsening is to merge tightly connectednodes to one super node In this aspect HEM may pre-vent a super node from being cohesive Figure 1a showsthe ideal coarsening for a given graph Let us assume thatfor the first merging the leftmost two nodes are mergedas shown in Fig 1b and the next node to be merged is uIf we use HEM v is chosen since it is the only candidateleading to Fig 1c Note that although u has more edges tothe green super node than to v it should be merged with vObviously the result is undesired for good coarsening Incontrast SC (Fig 1d) chooses the green super node for usince the weight to the green node is larger than that to vAs a result SC generates more cohesive super nodes thanthose by HEM leading to a high quality coarsened graph

Quickly reduced graphIdeally at each step of the coarsening the number ofnodes should halve but that does not happen for realworld graphs due to their highly skewed degree distribu-tion [19] In other words a large number of coarseningsteps are needed to obtain a coarsened graph of a man-ageable size leading to large memory spaces for storing

ba

c d

Fig 1 Effectiveness of our proposed SC method compared with HEMa Ideal coarsening for the graph b Coarsening in progress For the firstmerging the leftmost two nodes are chosen c For the second nodeto be merged u is chosen Since v is the only candidate for mergingin HEM u and v are merged to a super node d In SC the green supernode is also a candidate for u Since the weight of u to the greennode is larger than that to v u is merged to the green super node

graphs themselves and node merging information Thisproblem arises mainly due to star-like structures whichis depicted in Fig 2a The red and yellow nodes areeventually merged with the blue and the green groupsrespectively but it needs 5 more coarsening steps becauseonly two nodes can be merged Note that for an addi-tional coarsening step we need spaces to store one graphand mapping from a node to a super node if the graphsize is not effectively reduced the amount of the requiredspaces greatly increases with the coarsening depth Thisinefficiency can be resolved by SC as shown in Fig 2b Incontrast that 5 more coarsening steps are required withHEM only one step is enough in SC

a b

Fig 2 Efficiency of our proposed coarsening method SC comparedwith HEM a Result of one coarsening step by HEM The red andyellow nodes are eventually merged to the blue and green nodesrespectively but it takes 5 more coarsening steps b Result of onecoarsening step by SC The result is the same as (a) after 5 morecoarsening

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 6 of 12

ParallelizationWe also improve MLR-MCL via multi-core paralle-lization for its three main operations Regularize Inflateand Prune For Regularize we parallelize the computa-tion by assigning columns of the resulting matrix intocores In other words for M3 = M1 times M2 we divide thecomputation as follows

M3( i) = M1 times M2( i)

where Mk( i) denotes the i-th column of Mk Com-putingith column of M3 is independent of computing othercolumns j = i of M3 and thus we distribute the columnsof M2 to multiple cores while keeping M1 in a sharedmemory Inflate and Prune themselves are columnwiseoperations Thus the computation on each column isassigned to a core

For efficiency in memory usage we use the CSC (Com-pressed Sparse Column) format [20] to represent a matrixwhich requires much less memory when storing a sparsematrix compared to a two-dimensional array format Inessence the CSC format only stores nonzero values ofa matrix Note that this strategy is efficient especiallyfor real world graphs which are very sparse in generaleg |E| = O(|V |) Figure 3 shows the CSC format for anexample matrix To access the nonzero elements from thejth column (1-base indexing) we 1) obtain a = colPtr[ j]and b = colPtr[ j + 1] minus1 and 2) for a le i le b getval[ i] = A(rowInd[ i] j) For example to access the firstcolumn we first obtain a = 1 and b = 2 By checkingval[ i] and rowInd[ i] for 1 le i le 2 we identify the twononzero values 10 and 9 at the first and the fourth rowsrespectively in the first column ie A(1 1) = 10 andA(4 1) = 9 since val[ 1] = 10 with rowInd[ 1] = 1 andval[ 2] = 9 with rowInd[ 2] = 4

Algorithm 3 shows our implementation for one itera-tion of parallelized R-MCL with the CSC format In thealgorithm nonzeros(M j) is a set of pairs (i x) indicating

Fig 3 CSC format for sparse matrix representation The top figure is agiven matrix A and the bottom one is the corresponding CSCrepresentation of A As the size of matrix gets larger and the matrixgets sparser CSC gets much efficient since the space complexity ofCSC is proportional to the number of nonzero elements not to thematrix size

Algorithm 3 One Iteration of Parallelized R-MCLInput Current flow matrix M initial flow matrix

MG and the number T of threadsOutput Next flow matrix N

1 larr empty2 Parallel for j larr 1 to n

each of which is assigned to the (j mod T)-th threaddo

3 c = empty4 foreach (i x) isin nonzeros(MG j) do5 foreach (k y) isin nonzeros(M i) do6 if (k z) isin c then

c larr c cup (k z + xy) (k z)7 else c larr c cup (k xy)8 end9 end

10 c larr Prune(Inflate(c))11 larr cup (j c)12 end13 colPtr larr 1n+114 foreach (j c) isin do colPtr(j + 1) larr |c|15 foreach i larr 2 to n + 1 do

colPtr(i) larr colPtr(i) + colPtr(i minus 1)16 val larr rowInd larr 0colPtr(n+1)minus117 Parallel for (j c) isin

each of which is assigned to the (j mod T)-th threaddo

18 = colPtr(j)19 foreach (k z) isin c in increasing order of k do20 val() larr z21 rowInd() larr k22 larr + 123 end24 end25 N larr CSC(valrowIndcolPtr)

nonzeros in the j-th column of the matrix M ie M(i j) =x 0n and 1n denote n dimensional vectors all of whose ele-ments are 0 and 1 respectively Lines 4 to 9 correspond toRegularize Each thread running in a dedicated core per-forms c = MtimesMG( j) for each column j assigned to it andas a result (j c) is added to after applying Inflate andPrune to c Note that although we do not describe Inflateand Prune in Line 10 its implementation is trivial for eachcolumn c

Lines 13 to 16 correspond to constructing colPtrand allocating spaces for val and rowInd for the result-ing matrix N which is done sequentially After-wardsLines 17 to 24 correspond to filling val and rowIndusing and colPtr in parallel on the columns Note that

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 5 of 12

The second step determines super nodes and associ-ated weights to them Note that for (u v) isin F u and vshould belong to the same super node by definition Bymathematical induction two nodes belong to the samesuper node if and only if they are reachable along edgesin F As a result we can identify a set V prime of super nodesby computing connected components of the graph (V F)

(line 7)After finding V prime we determine weights of super nodes

in V prime and their self-loops as follows For each supernode S isin V prime its node weight Zprime(S) is defined bythe sum of weights of nodes in V that belong toS (line 9) The self-edge (S S) is added to Eprime (line10) and its weight W prime(S S) is defined by the sum ofweights of edges in E whose end nodes belong to S(line 11)

The last step determines non-self edges between nodesin V prime and their edge weights as follows For eachunordered pair (S T) isin V prime times V prime find a set H of edgesin E that one end node is in S and the other is in T(line 15) If H = empty (line 16) (S T) is added to Eprime (line17) and W prime(S T) is defined by the sum of weights ofedges in H (line 18) Otherwise there is no edge betweenS and T

Skip Rate In practice a graph can be reduced too quicklyby SC if it has super-hub nodes To coarsen the graph toa reasonable size we propose to randomly skip mergingwhile iterating nodes in SC ie with probability 0 le p lt 1lines 3ndash5 are not executed We call p a skip rate and usep = 05 in this paper

Cohesive super nodeThe goal of coarsening is to merge tightly connectednodes to one super node In this aspect HEM may pre-vent a super node from being cohesive Figure 1a showsthe ideal coarsening for a given graph Let us assume thatfor the first merging the leftmost two nodes are mergedas shown in Fig 1b and the next node to be merged is uIf we use HEM v is chosen since it is the only candidateleading to Fig 1c Note that although u has more edges tothe green super node than to v it should be merged with vObviously the result is undesired for good coarsening Incontrast SC (Fig 1d) chooses the green super node for usince the weight to the green node is larger than that to vAs a result SC generates more cohesive super nodes thanthose by HEM leading to a high quality coarsened graph

Quickly reduced graphIdeally at each step of the coarsening the number ofnodes should halve but that does not happen for realworld graphs due to their highly skewed degree distribu-tion [19] In other words a large number of coarseningsteps are needed to obtain a coarsened graph of a man-ageable size leading to large memory spaces for storing

ba

c d

Fig 1 Effectiveness of our proposed SC method compared with HEMa Ideal coarsening for the graph b Coarsening in progress For the firstmerging the leftmost two nodes are chosen c For the second nodeto be merged u is chosen Since v is the only candidate for mergingin HEM u and v are merged to a super node d In SC the green supernode is also a candidate for u Since the weight of u to the greennode is larger than that to v u is merged to the green super node

graphs themselves and node merging information Thisproblem arises mainly due to star-like structures whichis depicted in Fig 2a The red and yellow nodes areeventually merged with the blue and the green groupsrespectively but it needs 5 more coarsening steps becauseonly two nodes can be merged Note that for an addi-tional coarsening step we need spaces to store one graphand mapping from a node to a super node if the graphsize is not effectively reduced the amount of the requiredspaces greatly increases with the coarsening depth Thisinefficiency can be resolved by SC as shown in Fig 2b Incontrast that 5 more coarsening steps are required withHEM only one step is enough in SC

a b

Fig 2 Efficiency of our proposed coarsening method SC comparedwith HEM a Result of one coarsening step by HEM The red andyellow nodes are eventually merged to the blue and green nodesrespectively but it takes 5 more coarsening steps b Result of onecoarsening step by SC The result is the same as (a) after 5 morecoarsening

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 6 of 12

ParallelizationWe also improve MLR-MCL via multi-core paralle-lization for its three main operations Regularize Inflateand Prune For Regularize we parallelize the computa-tion by assigning columns of the resulting matrix intocores In other words for M3 = M1 times M2 we divide thecomputation as follows

M3( i) = M1 times M2( i)

where Mk( i) denotes the i-th column of Mk Com-putingith column of M3 is independent of computing othercolumns j = i of M3 and thus we distribute the columnsof M2 to multiple cores while keeping M1 in a sharedmemory Inflate and Prune themselves are columnwiseoperations Thus the computation on each column isassigned to a core

For efficiency in memory usage we use the CSC (Com-pressed Sparse Column) format [20] to represent a matrixwhich requires much less memory when storing a sparsematrix compared to a two-dimensional array format Inessence the CSC format only stores nonzero values ofa matrix Note that this strategy is efficient especiallyfor real world graphs which are very sparse in generaleg |E| = O(|V |) Figure 3 shows the CSC format for anexample matrix To access the nonzero elements from thejth column (1-base indexing) we 1) obtain a = colPtr[ j]and b = colPtr[ j + 1] minus1 and 2) for a le i le b getval[ i] = A(rowInd[ i] j) For example to access the firstcolumn we first obtain a = 1 and b = 2 By checkingval[ i] and rowInd[ i] for 1 le i le 2 we identify the twononzero values 10 and 9 at the first and the fourth rowsrespectively in the first column ie A(1 1) = 10 andA(4 1) = 9 since val[ 1] = 10 with rowInd[ 1] = 1 andval[ 2] = 9 with rowInd[ 2] = 4

Algorithm 3 shows our implementation for one itera-tion of parallelized R-MCL with the CSC format In thealgorithm nonzeros(M j) is a set of pairs (i x) indicating

Fig 3 CSC format for sparse matrix representation The top figure is agiven matrix A and the bottom one is the corresponding CSCrepresentation of A As the size of matrix gets larger and the matrixgets sparser CSC gets much efficient since the space complexity ofCSC is proportional to the number of nonzero elements not to thematrix size

Algorithm 3 One Iteration of Parallelized R-MCLInput Current flow matrix M initial flow matrix

MG and the number T of threadsOutput Next flow matrix N

1 larr empty2 Parallel for j larr 1 to n

each of which is assigned to the (j mod T)-th threaddo

3 c = empty4 foreach (i x) isin nonzeros(MG j) do5 foreach (k y) isin nonzeros(M i) do6 if (k z) isin c then

c larr c cup (k z + xy) (k z)7 else c larr c cup (k xy)8 end9 end

10 c larr Prune(Inflate(c))11 larr cup (j c)12 end13 colPtr larr 1n+114 foreach (j c) isin do colPtr(j + 1) larr |c|15 foreach i larr 2 to n + 1 do

colPtr(i) larr colPtr(i) + colPtr(i minus 1)16 val larr rowInd larr 0colPtr(n+1)minus117 Parallel for (j c) isin

each of which is assigned to the (j mod T)-th threaddo

18 = colPtr(j)19 foreach (k z) isin c in increasing order of k do20 val() larr z21 rowInd() larr k22 larr + 123 end24 end25 N larr CSC(valrowIndcolPtr)

nonzeros in the j-th column of the matrix M ie M(i j) =x 0n and 1n denote n dimensional vectors all of whose ele-ments are 0 and 1 respectively Lines 4 to 9 correspond toRegularize Each thread running in a dedicated core per-forms c = MtimesMG( j) for each column j assigned to it andas a result (j c) is added to after applying Inflate andPrune to c Note that although we do not describe Inflateand Prune in Line 10 its implementation is trivial for eachcolumn c

Lines 13 to 16 correspond to constructing colPtrand allocating spaces for val and rowInd for the result-ing matrix N which is done sequentially After-wardsLines 17 to 24 correspond to filling val and rowIndusing and colPtr in parallel on the columns Note that

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 6 of 12

ParallelizationWe also improve MLR-MCL via multi-core paralle-lization for its three main operations Regularize Inflateand Prune For Regularize we parallelize the computa-tion by assigning columns of the resulting matrix intocores In other words for M3 = M1 times M2 we divide thecomputation as follows

M3( i) = M1 times M2( i)

where Mk( i) denotes the i-th column of Mk Com-putingith column of M3 is independent of computing othercolumns j = i of M3 and thus we distribute the columnsof M2 to multiple cores while keeping M1 in a sharedmemory Inflate and Prune themselves are columnwiseoperations Thus the computation on each column isassigned to a core

For efficiency in memory usage we use the CSC (Com-pressed Sparse Column) format [20] to represent a matrixwhich requires much less memory when storing a sparsematrix compared to a two-dimensional array format Inessence the CSC format only stores nonzero values ofa matrix Note that this strategy is efficient especiallyfor real world graphs which are very sparse in generaleg |E| = O(|V |) Figure 3 shows the CSC format for anexample matrix To access the nonzero elements from thejth column (1-base indexing) we 1) obtain a = colPtr[ j]and b = colPtr[ j + 1] minus1 and 2) for a le i le b getval[ i] = A(rowInd[ i] j) For example to access the firstcolumn we first obtain a = 1 and b = 2 By checkingval[ i] and rowInd[ i] for 1 le i le 2 we identify the twononzero values 10 and 9 at the first and the fourth rowsrespectively in the first column ie A(1 1) = 10 andA(4 1) = 9 since val[ 1] = 10 with rowInd[ 1] = 1 andval[ 2] = 9 with rowInd[ 2] = 4

Algorithm 3 shows our implementation for one itera-tion of parallelized R-MCL with the CSC format In thealgorithm nonzeros(M j) is a set of pairs (i x) indicating

Fig 3 CSC format for sparse matrix representation The top figure is agiven matrix A and the bottom one is the corresponding CSCrepresentation of A As the size of matrix gets larger and the matrixgets sparser CSC gets much efficient since the space complexity ofCSC is proportional to the number of nonzero elements not to thematrix size

Algorithm 3 One Iteration of Parallelized R-MCLInput Current flow matrix M initial flow matrix

MG and the number T of threadsOutput Next flow matrix N

1 larr empty2 Parallel for j larr 1 to n

each of which is assigned to the (j mod T)-th threaddo

3 c = empty4 foreach (i x) isin nonzeros(MG j) do5 foreach (k y) isin nonzeros(M i) do6 if (k z) isin c then

c larr c cup (k z + xy) (k z)7 else c larr c cup (k xy)8 end9 end

10 c larr Prune(Inflate(c))11 larr cup (j c)12 end13 colPtr larr 1n+114 foreach (j c) isin do colPtr(j + 1) larr |c|15 foreach i larr 2 to n + 1 do

colPtr(i) larr colPtr(i) + colPtr(i minus 1)16 val larr rowInd larr 0colPtr(n+1)minus117 Parallel for (j c) isin

each of which is assigned to the (j mod T)-th threaddo

18 = colPtr(j)19 foreach (k z) isin c in increasing order of k do20 val() larr z21 rowInd() larr k22 larr + 123 end24 end25 N larr CSC(valrowIndcolPtr)

nonzeros in the j-th column of the matrix M ie M(i j) =x 0n and 1n denote n dimensional vectors all of whose ele-ments are 0 and 1 respectively Lines 4 to 9 correspond toRegularize Each thread running in a dedicated core per-forms c = MtimesMG( j) for each column j assigned to it andas a result (j c) is added to after applying Inflate andPrune to c Note that although we do not describe Inflateand Prune in Line 10 its implementation is trivial for eachcolumn c

Lines 13 to 16 correspond to constructing colPtrand allocating spaces for val and rowInd for the result-ing matrix N which is done sequentially After-wardsLines 17 to 24 correspond to filling val and rowIndusing and colPtr in parallel on the columns Note that

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 7 of 12

the positions of val and rowInd to be updated for eachcolumn are specified in colPtr and they do not overlap

ResultsWe present experimental results to answer the followingquestions

Q1 How does PS-MCL improve the distribution ofcluster sizes compared with MLR-MCL

Q2 What is the performance of PS-MCL compared withMLR-MCL in quality and running time

Q3 How much speedup do we obtain by parallelizingPS-MCL

Q4 How accurate are clusters found by PS-MCLcompared to the ground-truth

Table 3 lists the used datasets in our experiments Weuse various bio-networks for evaluating the clusteringquality and the running time the largest dataset DBLP isused for scalability experiments

Experimental settingsMachine All experiments are conducted on a work-station with double CPU Intel(R) Xeon(R) CPU E5-2630v4 220GHz and 250GB memory

Evaluation criteria To evaluate the quality of cluster-ing C for Q1 and Q2 we use the average NCut [15] definedas follows

AverageNCut(C) = 1|C|

sum

cisinCNCut(c)

where

NCut(c) =sum

uisincvisinc A(u v)sum

uisinc degree(u)

For answering Q4 we focus on protein complex findingproblem and use the accuracy measure defined by Her-nandez et al [4] as follows Let G be the set of ground truth

clusters (protein complexes) then the degree of overlapTgc for every g isin G and c isin C is defined as

Tgc = |g cap c|The accuracy ACC is the geometric mean of SensitivitySST and Positive Predictive Value PPV

SST =sum

gisinG maxcisinCTgcsum

gisin|G| |g|

PPV =sum

cisinC maxgisinGTgcsum

cisinC |c|ACC =(SST times PPV )

12

Parameter We use the coarsening depth of 3 for PS-MCL and MLR-MCL with which the improvement inquality and speed is large while the number of resultingclusters remains reasonable

Performance of SCIn this section we answer Q1 and Q2 Figure 4 showscomparison of PS-MCL with MLR-MCL and MCL Thehorizontal axis denotes the cluster size and the verticalaxis denotes the number of nodes belonging to a specificcluster size

Before going into details we briefly summarize theresults of MCL which are invariant within each columnof Fig 4 because of the lack of the balancing conceptin MCL As discussed in [15] for all cases MCL suffersfrom the fragmentation problem that a large portion ofnodes belongs to very tiny clusters of size 1ndash3 We pro-vide results of MCL as a baseline in the figure and thefollowing analysis focuses on comparing the MLR-MCLand PS-MCL

For all of our datasets we observe the same patternsdescribed in the following Observations 1ndash3 though wepresent the four representative results in the Fig 4

Table 3 Dataset description used in our experiments

Name Nodes Edges Density (10minus4) Description

BioPlex [23] 10961 56553 942 PPI networks of Homo sapiens in the BioPlex 20 Network

DIP-droso [21] 22595 69148 271 PPI network of Drosophila melanogaster

Drosophila [24] 6600 19820 91 PPI network of Drosophila melanogaster

MINT [25] 3872 56937 7597 combined PPI network of 325 different organisms

Yeast-1 [26] 2353 36790 13295 Genome-scale epistasis map of Schizosaccharomyces pombe

Yeast-2 [27] 2428 4606 1563 PPI network of Saccharomyces cerevisiae

Yeast-3 [28] 3886 57782 7655 PPI network of Saccharomyces cerevisiae

Yeast-4 [29] 2223 7049 2854 PPI network of Saccharomyces cerevisiae

DIP-yeast [21] 4929 17786 1464 PPI network of Saccharomyces cerevisiae

BioGRID-homo [22] 20837 288224 1328 PPI network of Homo sapiens

DBLP 317080 1049866 021 Coauthorship network

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 8 of 12

Fig 4 The number of nodes belonging to clusters of specific sizes With balancing factor b = 0 ie without considering balance PS-MCL andMLR-MCL do not find clusters properly In general they output too large clusters and even group all the nodes into one cluster for MINT Usingbalancing factor b gt 0 both result in clusters whose sizes are concentrated around a small value That value is larger in PS-MCL than in MLR-MCLObserve that MLR-MCL makes a significantly large number of tiny clusters including singletons which are meaningless in clustering In contrast ourproposed PS-MCL greatly reduces that number less than 5 compared with the number by MLR-MCL For all cases MCL suffers from the clusterfragmentation problem For the other datasets in 3 we observe the same patterns described here

Observation 1 (Too massive cluster without balancing)Without balancing ie b = 0 a large number of nodes areassigned to one cluster Often the entire graph becomes onecluster Balancing factor of 10 to 15 resulted in reasonablecluster size distribution for most of the networks

The first row of Fig 4 corresponds to the result with-out balancing ie b = 0 In this case both PS-MCL andMLR-MCL group too many nodes into one cluster Espe-cially on MINT both output only one cluster containingall nodes in the graph Figure 5 shows the ratio of the

largest cluster size over the number of nodes for the bio-networks listed in 3 Note that the largest cluster sizes forBioPlex Drosophila MINT Yeast-1 Yeast-2 and Yeast-3are nearly the same as the total number of nodes thosefor DIP-droso and Yeast-4 are relatively smaller but stilloccupy a large proportion in the entire size

Observation 2 (PS-MCL preferring larger clusters thanMLR-MCL) With b gt 0 the cluster size with the maxi-mum number of total nodes in PS-MCL is larger than thatin MLR-MCL

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 9 of 12

Fig 5 Ratio of the largest cluster size over the number of nodesWithout balancing ie b = 0 PS-MCL and MLR-MCL group too manynodes to one cluster

The second third and fourth rows of Fig 4 show theresults of varying the balancing factor b isin 1 15 2respectively In contrast to the case of b = 0 the clustersizes of PS-MCL and MLR-MCL are concentrated on cer-tain sizes The mode of cluster size in PS-MCL is largerthan that in MLR-MCL The modes in PS-MCL are 10ndash20 for DIP-droso BioPlex and Drosophila and 20ndash50 forMINT those in MLR-MCL are 5ndash10 for all This obser-vation is useful in practice when we want to cluster at acertain scale

Observation 3 (PS-MCL with less fragmentation thanMLR-MCL) With b gt 0 PS-MCL results in a significantlysmaller number of fragmented clusters whose sizes are 1ndash3compared with MLR-MCL

PS-MCL achieves concentrated cluster sizes as wellas avoids the fragmented clusters MLR-MCL and MCLstill suffer from the fragmentation The number of nodesbelonging to very small clusters in PS-MCL is muchsmaller than that in MLR-MCL For instance the numberof nodes belonging to clusters of size 1ndash3 in PS-MCL isless than 5 of that in MLR-MCL for the DIP with b = 15

Observation 4 (PS-MCL better than MLR-MCL in timeand NCut) PS-MCL results in a faster running time with asmaller NCut than MLR-MCL does

Figure 6 shows the plot of running time versus the aver-age NCut PS-MCL runs faster down to 21 and outputsclustering with a smaller average NCut down to 87 thanMLR-MCL does on average For some cases MCL is fasterthan PS-MCL but for all cases its average NCut is worsethan that by PS-MCL

Performance of parallelizationIn this section we answer Q3 We use b = 15 for PS-MCLand MLR-MCL Figure 7 shows the performance evalu-ation results for PS-MCL on the bio-networks in 3 with

Fig 6 Time vs average NCut for all datasets with b = 15 for PS-MCLMLR-MCL and MCL We use 4 cores for PS-MCL PS-MCL was 21faster in running time and had 87 better NCut on average over alldatasets compared with MLR-MCL Although MCL is faster thanPS-MCL for some networks its average NCut is higher than that byPS-MCL for all networks

increasing cores For all cases PS-MCL gets faster as thenumber of cores increases

To test the scalability more effectively we use DBLPthe largest in our datasets though it is not a bio-networkFigure 8a shows the speed up of PS-MCL while increas-ing the number of cores compared with MLR-MCLand MCL We use points not lines for MLR-MCL andMCL since they are single-core algorithms PS-MCL out-performs MLR-MCL regardless of the number of coresand becomes faster effectively as the number of coresincreases Precisely the running time of PS-MCL isimproved down to 81 and 55 with 2 and 4 coresrespectively compared that with a single core

Figure 8b shows the running time of PS-MCL whileincreasing data sizes compared with MLR-MCL andMCL Here we use 4 cores for PS-MCL To obtain var-ious sizes of graphs we first take principal submatrices

Fig 7 Running time of PS-MCL for bio-networks in 3 while increasingthe number of cores Note that the running time is effectivelyreduced as the number of cores increases

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 10 of 12

a b

Fig 8 Running time of PS-MCL MLR-MCL and MCL a Running timeon DBLP while varying the number of cores Running time of PS-MCLis improved down to 81 and 55 with 2 and 4 cores respectivelycompared that with a single core Furthermore the single coreperformance of PS-MCL outperforms MLR-MCL b Running time onDBLP while varying data sizes Running times of all methods are linearon the number of edges in the graphs and PS-MCL outperformsMLR-MCL for all cases

from the adjacency matrix of DBLP with sizes20 40 60 80 100 of the total and use the giantconnected components of them As shown in the figurethe running times of all the methods are linear on thegraph sizes and PS-MCL outperforms MLR-MCL forall scales Note that although MCL is slightly faster thanPS-MCL MCL has fragmentation problem and worseNCut while PS-MCL has no fragmentation problem andbetter NCut as shown in Figs 4 and 6

Protein complex identificationIn this section we use the two bio-networks ie DIP-yeast [21] and BioGRID-homo [22] described in 3to answer Q4 on protein complex finding problemThe ground-truth protein complexes information areextracted from CYC2008 20 [13] for DIP-yeast andCORUM [14] for BioGRID-homo The complexes areused as reference clusters for measuring the accuracy

Figure 9 shows the performance of PS-MCL while vary-ing skip rates in comparison with MLR-MCL and MCL(Note The skip rate is not applicable to MLR-MCLand MCL leading to one accuracy value For clearperformance comparison we represent that value by thehorizontal dash line along the x-axis) Remind that theskip rate p determines the chance that each node isskipped and thus not merged with others Namely thesmaller p the more aggressive coarsening In the figurePS-MCL performs the best with moderate values of pmdash06and 07 for DIP-yeast and BioGRID-homo respectivelyFor both networks PS-MCL consistently outperformsMLR-MCL with 05 le p le 07 the accuracy of PS-MCL ishigher than that of MLR-MCL up to 366 for DIP-yeastand 824 for BioGRID-homo This result makes sensebecause SC with too large p hardly reduces a graph in

a b

c

Fig 9 Accuracy of PS-MCL with varying skip rates MLR-MCL and MCLThe performance of PS-MCL varies depending on p but is better thanthat of MLR-MCL in the range of 05 le p le 07 Comparing to MCLPS-MCL is slightly less accurate But we observe that it is due to manydimer structures and excluding dimers PS-MCL greatly outperformsMCL (c) a DIP-yeast b Biogrid-homo c Biogrid-homo-dimers

size while too small p leads to too large clusters due toaggressive coarsening

PS-MCL achieves up to 332 higher accuracy thanMCL for DIP-yeast and 987 of the MCL accuracy forBioGRID-homo This is due to many dimer structurespresent in the CORUM database Exclusion of dimersin the database PS-MCL greatly outperforms MCL asshown in Fig 9c Although PS-MCL is not effective infinding dimers note that MCL suffers from the frag-mentation problem (Fig 4) and performs poorly in inter-nal evaluation by Average NCut (Fig 6) which assessesthe potentials of finding well-formed but undiscoveredclusters

ConclusionIn this paper we propose PS-MCL a parallel graphclustering method which gives superior performancein bio-networks PS-MCL includes two enhancementscompared to previous methods First PS-MCL incor-porates a newly proposed coarsening scheme we call SCto resolve the inefficiency of MLR-MCL in real-worldnetworks SC allows merging multiple nodes at a timeleading to reducing the graph size more quickly andmaking super nodes much cohesive than HEM used inMLR-MCL Second PS-MCL gives a multi-core parallelalgorithm for clustering to increase scalability Extensive

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 11 of 12

experiments show that PS-MCL results in clusters thatgenerally have larger sizes than those by MLR-MCL andalso greatly alleviate the fragmentation problem More-over PS-MCL finds clusters whose quality is better thanthose by MLR-MCL in both internal (average NCut) andexternal (reference clusters) criteria Also as more coresare used PS-MCL gets faster and outperforms MLR-MCLin speed even with a single core

The PS-MCLlsquos capability to quickly find mid-size clus-ters in large scale bio-networks has wide range of appli-cability on systems biology Although we have only shownthat PS-MCL effectively find mid-size protein complexeson two protein-protein interaction network compared toexisting topology-based clustering algorithms we believethat it can be effectively applied on function predictiondisease modules detection and other systems biologyanalysis

Availability and requirementsProject name PS-MCLProject home page httpsgithubcomleesaelPS-MCLOperating system(s) Platform independent (tested onUbuntu)Programming language JavaOther requirements Java 18 or higherLicense BSD Any restrictions to use by non-academicslicence needed

AbbreviationsHEM Heavy edge matching MCL Markov clustering MLR-MCL Multi-levelR-MCL PS-MCL Parallel shotgun coarsening MCL R-MCL Regularized-MCL SCShotgun coarsening

AcknowledgementsNot applicable

FundingPublication of this article has been funded by National Research Foundation ofKorea grant funded by the Korea government (NRF-2018R1A5A1060031NRF-2018R1A1A3A0407953) and by Korea Institute of Science and TechnologyInformation (K-18-L03-C02)

Availability of data and materialsThe datasets analysed during the current study are available in the Githubrepository httpsgithubcomleesaelPS-MCL

About this supplementThis article has been published as part of BMC Bioinformatics Volume 20Supplement 13 2019 Selected articles from the 8th Translational BioinformaticsConference Bioinformatics The full contents of the supplement are availableonline at httpsbmcbioinformaticsbiomedcentralcomarticlessupplementsvolume-20-supplement-13

Authorsrsquo contributionsYL took the lead in writing the manuscript IY developed the software andconducted all the experiments IY UK and LS designed the experiments andYL and IY summarized the results YL IY and UK designed the method DS UKand LS provided critical feedback LS provided the expertise and resources toanalyze the results and supervise the project All the authors read andapproved the final manuscript

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

Author details1Data RampD Center SK Telecom Gyeonggi Korea 2School of ComputingKAIST Daejeon Korea 3KISTI Daejeon Korea 4Department of ComputerScience and Engineering Seoul National University Seoul Korea

Published 24 July 2019

References1 Thomas J Seo D Sael L Review on graph clustering and subgraph

similarity based analysis of neurological disorders Int J Mol Sci 201617(6)862

2 Lei X Wu F-X Tian J Zhao J ABC and IFC Modules detection methodfor PPI network BioMed Res Int 201420141ndash11

3 Xu B Wang Y Wang Z Zhou J Zhou S Guan J An effective approachto detecting both small and large complexes from protein-proteininteraction networks BMC Bioinformatics 201718(Supple 12)419

4 Hernandez C Mella C Navarro G Olivera-Nappa A Araya J Proteincomplex prediction via dense subgraphs and false positive analysis PLoSONE 201712(9)0183460

5 Bernardes JS Vieira FR Costa LM Zaverucha G Evaluation andimprovements of clustering algorithms for detecting remotehomologous protein families BMC Bioinformatics 201516(1)34

6 Tadaka S Kinoshita K NCMine Core-peripheral based functional moduledetection using near-clique mining Bioinformatics 201632(22)3454ndash60

7 Li Z Liu Z Zhong W Huang M Wu N Yun Xie ZD Zou X Large-scaleidentification of human protein function using topological features ofinteraction network Sci Rep 20166 716199

8 Van Dongen S Graph clustering by flow simulation PhD thesis Universityof Utrecht 2000

9 Satuluri V Parthasarathy S Ucar D Markov clustering of proteininteraction networks with improved balance and scalability In ACMConference on Bioinformatics Computational Biology and HealthInformatics New York ACM 2010 p 247ndash56

10 Brohee S van Helden J Evaluation of clustering algorithms forprotein-protein interaction networks BMC Bioinformatics 20067(1)488

11 Vlasblom J Wodak SJ Markov clustering versus affinity propagation forthe partitioning of protein interaction graphs BMC Bioinformatics20091099

12 Beyer A Wilhelm T Dynamic simulation of protein complex formation ona genomic scale Bioinformatics 200521(8)1610ndash6

13 Pu S Wong J Turner B Cho E Wodak SJ Up-to-date catalogues of yeastprotein complexes Nucleic Acids Res 200837(3)825ndash31

14 Ruepp A Waegele B Lechner M Brauner B Dunger-Kaltenbach I FoboG Frishman G Montrone C Mewes H-W Corum the comprehensiveresource of mammalian protein complexesmdash2009 Nucleic Acids Res200938(suppl_1)497ndash501

15 Satuluri V Parthasarathy S Scalable graph clustering using stochasticflows Applications to community discovery In Proceedings of the 15thACM SIGKDD International Conference on Knowledge Discovery andData Mining New York ACM 2009 p 737ndash46

16 Faloutsos M Faloutsos P Faloutsos C On power-law relationships of theinternet topology In SIGCOMM New York ACM 1999 p 251ndash62

17 Lim Y Kang U Faloutsos C Slashburn Graph compression and miningbeyond caveman communities IEEE Trans Knowl Data Eng 201426(12)3077ndash89

18 Lim Y Lee W Choi H Kang U MTP discovering high quality partitions inreal world graphs World Wide Web 201720(3)491ndash514

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References

Lim et al BMC Bioinformatics 2019 20(Suppl 13)381 Page 12 of 12

19 Abou-Rjeili A Karypis G Multilevel algorithms for partitioning power-lawgraphs In Proceedings of the 20th International Conference on Paralleland Distributed Processing Washington DC IEEE Computer Society2006 p 124

20 Duff IS Grimes RG Lewis JG Sparse matrix test problems ACM TransMath Softw 198915(1)1ndash14

21 Salwinski L Miller CS Smith AJ Pettit FK Bowie JU Eisenberg D Thedatabase of interacting proteins 2004 update Nucleic Acids Res200432(suppl 1)449ndash51

22 Wang J Vasaikar S Shi Z Greer M Zhang B Webgestalt 2017 a morecomprehensive powerful flexible and interactive gene set enrichmentanalysis toolkit Nucleic Acids Res 201745(W1)W130ndash7

23 Huttlin EL Bruckner RJ Paulo JA Cannon JR Ting L Baltier K Colby GGebreab F Gygi MP Parzen H et al Architecture of the humaninteractome defines protein communities and disease networks Nature2017545(7655)505ndash9

24 Giot L Bader JS Brouwer C Chaudhuri A Kuang B Li Y Hao Y Ooi CGodwin B Vitols E et al A protein interaction map of drosophilamelanogaster Science 2003302(5651)1727ndash36

25 Chatr-Aryamontri A Ceol A Palazzi LM Nardelli G Schneider MVCastagnoli L Cesareni G Mint the molecular interaction databaseNucleic Acids Res 200735(suppl 1)572ndash4

26 Ryan CJ Roguev A Patrick K Xu J Jahari H Tong Z Beltrao P Shales MQu H Collins SR et al Hierarchical modularity and the evolution ofgenetic interactomes across species Mol Cell 201246(5)691ndash704

27 Chen J Hsu W Lee ML Ng S-K Increasing confidence of proteininteractomes using network topological metrics Bioinformatics200622(16)1998ndash2004

28 Costanzo M Baryshnikova A Bellay J Kim Y Spear ED Sevier CS DingH Koh JL Toufighi K Mostafavi S et al The genetic landscape of a cellScience 2010327(5964)425ndash31

29 Bu D Zhao Y Cai L Xue H Zhu X Lu H Zhang J Sun S Ling L ZhangN et al Topological structure analysis of the proteinndashprotein interactionnetwork in budding yeast Nucleic Acids Res 200331(9)2443ndash50

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Preliminaries
        • Markov clustering (MCL)
        • Regularized-MCL (R-MCL)
        • Multi-level R-MCL (MLR-MCL)
          • Implementation
            • Shotgun coarsening (SC)
              • Skip Rate
              • Cohesive super node
              • Quickly reduced graph
                • Parallelization
                  • Results
                    • Experimental settings
                    • Performance of SC
                    • Performance of parallelization
                    • Protein complex identification
                      • Conclusion
                      • Availability and requirements
                      • Abbreviations
                      • Acknowledgements
                      • Funding
                      • Availability of data and materials
                      • About this supplement
                      • Authors contributions
                      • Ethics approval and consent to participate
                      • Consent for publication
                      • Competing interests
                      • Publishers Note
                      • Author details
                      • References