a supernodal all-pairs shortest path algorithm

12
A Supernodal All-Pairs Shortest Path Algorithm Piyush Sao, Ramakrishnan Kannan Oak Ridge National Laboratory Oak Ridge, TN {saopk,kannanr}@ornl.gov Prasun Gera, Richard Vuduc Georgia Institute of technology Atlanta, GA {prasun.gera,richie}@gatech.edu ABSTRACT We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem. Floyd- Warshall is an attractive choice for Apsp on high-performing systems due to its structural similarity to solving dense linear systems and matrix multiplication. However, if sparsity of the input graph is not properly exploited, Floyd-Warshall will perform unnecessary asymptotic work and thus may not be a suitable choice for many input graphs. To overcome this limita- tion, the key idea in our approach is to use the known algebraic relationship between Floyd-Warshall and Gaussian elimi- nation, and import several algorithmic techniques from sparse Cholesky factorization, namely, fill-in reducing ordering, sym- bolic analysis, supernodal traversal, and elimination tree par- allelism. When combined, these techniques reduce computa- tion, improve locality and enhance parallelism. We implement these ideas in an efficient shared memory parallel prototype that is orders of magnitude faster than an efficient multi- threaded baseline Floyd-Warshall that does not exploit spar- sity. Our experiments suggest that the Floyd-Warshall algo- rithm can compete with Dijkstra’s algorithm (the algorithmic core of Johnson’s algorithm) for several classes sparse graphs. KEYWORDS graph algorithm, sparse matrix computations, shared-memory parallelism, communication-avoiding algorithms This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/ doe-public-access-plan). ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or affiliate of the United States government. As such, the United States government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for government purposes only. PPoPP ’20, February 22–26, 2020, San Diego, CA, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-6818-6/20/02. . . $15.00 https://doi.org/10.1145/3332466.3374533 ACM Reference Format: Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc. 2020. A Supernodal All-Pairs Shortest Path Algorithm. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming (PPoPP ’20), February 22–26, 2020, San Diego, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3332466.3374533 1 INTRODUCTION In this paper, we improve the performance of the classic Floyd-Warshall algorithm (Floyd-Warshall) for the all-pairs shortest path (Apsp) problem on shared memory parallel ma- chines for sparse graphs. Floyd-Warshall is appropriate when the graph is dense or nearly so, in which case one can achieve good parallel scalability and high-performance by ex- ploiting the algebraic connection between Floyd-Warshall and the Gaussian elimination process for solving linear sys- tems [3, 6, 12]. Through that lens, Floyd-Warshall reduces to matrix-multiplication-like (level-3 BLAS-like) operations, thereby enabling fast computations of Apsp on, for instance, GPUs [5] or distributed memory platforms [38]. But if the graph is sparse, then the implementation of Floyd-Warshall must change. Our key insight, similar to matrix multiplication, is that the full body of algorithmic techniques from sparse direct solvers can be applied [9] for Floyd-Warshall on sparse graphs. Formally, we consider Apsp on a weighted undirected graph G = ( V ,E ) with n = | V | vertices and m = | E | edges. The weights may be negative, but we preclude cycles whose sum of weights is negative. If, furthermore, the graph is dense, so that m = O ( n 2 ) , then one may use the Floyd-Warshall algorithm. Its overall algorithmic structure consists of three nested-loops (Algorithm 1), each iterating over all vertices, so its sequential complexity is O ( n 3 ) operations. Throughout, it updates a ma- trix {Dist} ij that stores the length of the current shortest path between any two vertices v i and v j ; this distance is initialized to if no path has yet been discovered. Floyd-Warshall maintains, at each iteration k , the invariant that the {Dist} ij is minimum with at most k vertices as intermediaries. Hence, as shown in Fig. 1, Floyd-Warshall discovers more paths between vertices and the number of infinite {Dist} ij entries decreases. For sparse graphs, m = O( n) and we prefer methods with better asymptotic scaling. One option is Johnson’s algorithm, which scales like O ( n 2 logn +nm ) [21]. However, Johnson’s 250

Upload: others

Post on 17-Apr-2022

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Supernodal All-Pairs Shortest Path Algorithm

A Supernodal All-Pairs Shortest Path AlgorithmPiyush Sao, Ramakrishnan Kannan∗

Oak Ridge National LaboratoryOak Ridge, TN

{saopk,kannanr}@ornl.gov

Prasun Gera, Richard VuducGeorgia Institute of technology

Atlanta, GA{prasun.gera,richie}@gatech.edu

ABSTRACT

Weshowhowtoexploitgraphsparsity in theFloyd-Warshallalgorithmfor theall-pairs shortestpath (Apsp)problem.Floyd-Warshall is anattractive choice forApsponhigh-performingsystems due to its structural similarity to solving dense linearsystems andmatrixmultiplication. However, if sparsity of theinput graph is not properly exploited, Floyd-Warshallwillperform unnecessary asymptotic work and thus may not be asuitablechoice formany inputgraphs.Toovercomethis limita-tion, thekey idea inour approach is touse theknownalgebraicrelationship between Floyd-Warshall and Gaussian elimi-nation, and import several algorithmic techniques fromsparseCholesky factorization, namely,fill-in reducingordering, sym-bolic analysis, supernodal traversal, and elimination tree par-allelism.When combined, these techniques reduce computa-tion, improve locality and enhance parallelism.We implementthese ideas in an efficient shared memory parallel prototypethat is orders of magnitude faster than an efficient multi-threadedbaselineFloyd-Warshall that doesnot exploit spar-sity. Our experiments suggest that the Floyd-Warshall algo-rithm can compete with Dijkstra’s algorithm (the algorithmiccore of Johnson’s algorithm) for several classes sparse graphs.

KEYWORDS

graphalgorithm,sparsematrixcomputations, shared-memoryparallelism, communication-avoiding algorithms

∗This manuscript has been authored by UT-Battelle, LLC under Contract No.DE-AC05-00OR22725 with the U.S. Department of Energy. The publisher, byaccepting the article for publication, acknowledges that the United StatesGovernment retains anon-exclusive, paid-up, irrevocable,world-wide licenseto publish or reproduce the published formof thismanuscript, or allowothersto do so, for United States Government purposes. The Department of Energywill provide public access to these results of federally sponsored research inaccordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

ACM acknowledges that this contribution was authored or co-authored byan employee, contractor, or affiliate of theUnited States government. As such,the United States government retains a nonexclusive, royalty-free right topublish or reproduce this article, or to allow others to do so, for governmentpurposes only.PPoPP ’20, February 22–26, 2020, San Diego, CA, USA

© 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-6818-6/20/02. . . $15.00https://doi.org/10.1145/3332466.3374533

ACMReference Format:

Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc.2020. A Supernodal All-Pairs Shortest Path Algorithm. In 25th ACMSIGPLAN Symposium on Principles and Practice of Parallel Program-

ming (PPoPP ’20), February 22–26, 2020, San Diego, CA, USA.ACM,NewYork,NY,USA,12pages.https://doi.org/10.1145/3332466.3374533

1 INTRODUCTION

In this paper, we improve the performance of the classicFloyd-Warshall algorithm (Floyd-Warshall) for the all-pairsshortest path (Apsp) problem on shared memory parallel ma-chines for sparse graphs. Floyd-Warshall is appropriatewhen the graph is dense or nearly so, in which case one canachieve good parallel scalability and high-performance by ex-ploiting the algebraic connection between Floyd-Warshalland the Gaussian elimination process for solving linear sys-tems [3, 6, 12]. Through that lens, Floyd-Warshall reducesto matrix-multiplication-like (level-3 BLAS-like) operations,thereby enabling fast computations of Apsp on, for instance,GPUs [5] or distributed memory platforms [38]. But if thegraph is sparse, then the implementation of Floyd-Warshallmust change. Our key insight, similar to matrix multiplication,

is that the full body of algorithmic techniques from sparse direct

solvers can be applied [9] for Floyd-Warshall on sparse graphs.Formally,weconsiderApsponaweightedundirectedgraph

G = (V ,E) with n = |V | vertices and m = |E | edges. Theweightsmaybenegative, butwepreclude cycleswhose sumofweights is negative. If, furthermore, the graph is dense, so thatm=O

(n2), then onemay use the Floyd-Warshall algorithm.

Its overall algorithmic structure consists of three nested-loops(Algorithm 1), each iterating over all vertices, so its sequentialcomplexity is O

(n3)operations. Throughout, it updates a ma-

trix {Dist}i j that stores the length of the current shortest pathbetween any two verticesvi andvj ; this distance is initializedto∞ if no path has yet been discovered. Floyd-Warshallmaintains, at each iteration k , the invariant that the {Dist}i jis minimumwith at most k vertices as intermediaries. Hence,as shown in Fig. 1, Floyd-Warshall discovers more pathsbetween vertices and the number of infinite {Dist}i j entriesdecreases.For sparse graphs,m=O(n) and we prefer methods with

better asymptotic scaling. One option is Johnson’s algorithm,which scales like O

(n2logn+nm

)[21]. However, Johnson’s

250

Page 2: A Supernodal All-Pairs Shortest Path Algorithm

PPoPP ’20, February 22–26, 2020, San Diego, CA, USA Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc

0 0.3 ∞ ∞ 0.6 0.60.3 0 0.2 0.2 ∞ ∞∞ 0.2 0 ∞ ∞ ∞∞ 0.2 ∞ 0 ∞ ∞0.6 ∞ ∞ ∞ 0 ∞0.6 ∞ ∞ ∞ ∞ 0

6

5

1 2

3

4

0.30.2

0.2

0.6

0.6

0 0.3 0.5 0.5 0.6 0.60.3 0 0.2 0.2 0.9 0.90.5 0.2 0 0.4 1.1 1.10.5 0.2 0.4 0 1.1 1.10.6 0.9 1.1 1.1 0 1.20.6 0.9 1.1 1.1 1.2 0

(a) Graph

0 0.3 ∞ ∞ 0.6 0.60.3 0 0.2 0.2 ∞ ∞∞ 0.2 0 ∞ ∞ ∞∞ 0.2 ∞ 0 ∞ ∞0.6 ∞ ∞ ∞ 0 ∞0.6 ∞ ∞ ∞ ∞ 0

⇒after 2 iteration

0 0.3 0.5 0.5 0.6 0.60.3 0 0.2 0.2 0.9 0.90.5 0.2 0 0.4 1.1 1.10.5 0.2 0.4 0 1.1 1.10.6 0.9 1.1 1.1 0 1.20.6 0.9 1.1 1.1 1.2 0

(b) The initial distancematrix and after two Floyd-

Warshall iteration

Figure 1: In Floyd-Warshall algorithm, the distance matrix

which is initially sparse, quickly become dense if vertex ordering is

not optimal

algorithm cannot effectively use the features of modern com-puter architecture such as long SIMD vector units or cache,and therefore, it underutilizes modern high-performing com-puting systems. It is natural to ask whether we can exploitsparsity in Floyd-Warshall to reduce its asymptotic com-plexity while retaining its parallel-scalability.

Ourapproach todoingsoderives fromtheknowntechniqueof vertex reordering [26, 33]. During the execution of Floyd-Warshall, infinite valuesof {Dist}i j will play the roleof “zeroentries” in sparse numerical linear algebra, and certain oper-ations on infinite values may be avoided. By choosing the op-timal vertex order of the outer-loop of Floyd-Warshall, wecan defer replacement of infinite values for more iterations ofthe algorithm. The so-called fill-in reducing orderings used in,for instance, sparse Cholesky factorization to keep the factormatrix sparse, arealsooptimal in thecaseof Floyd-Warshall.If a graph has a minimal vertex separator of size S that par-titions the graph into two components of roughly equal size,then the use of a fill-in reducing ordering in Floyd-Warshallwill incur only O

(n2S+S3

)operations. This can be asymptot-

ically lower than O(n3). For instance, in a planar graph like

a road network, the separator size is S = O(√n); therefore,

Floyd-Warshallwould incurO(n2.5

)operations.Even in the

case of graphs with S =O(n), using the fill-in reducing order-ing, a constant-factor reduction in the number of operationscan be substantial. Therefore, using an optimal reordering isnecessary for goodperformance of a sparse Floyd-Warshall.

The use of fill-reducing orderings poses several challenges.First, onemust design a careful data structure that can accom-modate new entries in {Dist}i j . Secondly, the data structureshould also support blocked operations, to effectively use thememory hierarchies in modern architectures [17]. These is-sues motivate our supernodal Floyd-Warshall, or SuperFw,inspired by supernodal sparse Cholesky factorization [9]. Inthe supernodal approach, we group nodes having a similar ad-jacency structure into supernodes, and operate on supernodesinstead of individual vertices, thereby leading to a blockedalgorithm. To accommodate new entries in the {Dist}i j , weuse symbolic analysis, which can efficiently extract the fill-instructure of the {Dist}i j matrix. We perform symbolic anal-ysis and extract supernodal structure to set up a supernodalblock sparse matrix, which allows blocked operations while

Table 1: List of symbols used

Symbol type Symbol Description

x ⊕y min(x,y)x ⊗y x+yP×Q Cartesian Product of non-empty sets P,Q

n Dimension of the matrixAnb Dimension of every block of Dist matrix

GraphsG(V ,E) Input GraphG withV vertices and E edgesDist Distance matrix

E Elimination tree ofAS Top level separator of ESupernodeC1, C2 Children etrees of EA(a) set of ancestors of a supernode aD(a) set of descendents of the supernode a

exploiting the sparsity. Finally, the operation and task depen-dencies in SuperFw can be represented by an elimination tree.Its structure indicates what operations may execute concur-rently and, therefore, guides parallelism.Applying these ideas from sparse Cholesky factorization

results in an efficient shared-memory parallel SuperFw im-plementation. We show that it can in practice be orders ofmagnitude faster than an efficient implementation of Floyd-Warshall that does not exploit the sparsity (Section 5). More-over, despite performing asymptotically more operations, Su-perFw’s better match to modern hardware can make it com-parable to or even faster than Johnson’s algorithm for manysparse graphs, as well as more scalable. Finally, given the richliterature for optimizing sparse Cholesky high-performancesystems,we believemany of these techniques are also applica-ble to graph path problems. We discuss potential scenarios inwhich graph path analysis would benefit from optimizationtechniques in linear systems, and vice versa.

2 BACKGROUND

Many path problems in graph analysis can be described suc-cinctly in a semiring algebra. We review this formalism andthe resulting classical and blocked Floyd-Warshall algo-rithms forApsp, below.

Notation and terminology. Let G = {V ,E,W } be an undi-rected weighted graph with a vertex setV containing n= |V |vertices or nodes, edge set E withm= |E | edges, and weightsW , defined below. Denote the i-th vertex byvi and an edgebetweenvi andvj by ei, j . The weights are represented byW ,a sparse symmetric matrix whose entrywi, j denotes the dis-tance between verticesvi andvj ifei, j ∈E; otherwise,wi, j =∞.

During the computation of Apsp, Floyd-Warshallmain-tains and updates a 2-D array of distances,Dist. Each entryDist[i,j] holds the current shortest distance betweenvi andvj discovered so far, with its value at termination of the al-gorithm being the shortest such distance. We will assume

251

Page 3: A Supernodal All-Pairs Shortest Path Algorithm

A Supernodal All-Pairs Shortest Path Algorithm PPoPP ’20, February 22–26, 2020, San Diego, CA, USA

for simplicity that the graphG consists of a single connectedcomponent, in which case theDist eventually becomes fullydense; however, our implementation will work when thereare multiple connected components.

2.1 Classical Floyd-Warshall algorithm

Algorithm 1 Floyd-Warshall algorithm forApsp1: function FloydWarshall(G= (V ,E)):2: Let n←dim(V )

3: LetDist[i,j]=

{wi, j if(i,j) ∈E∞ otherwise

4: for k= {1, 2...,n} do:5: for i= {1, 2...,n} do:6: for j= {1, 2...,n} do:7: Dist[i,j]=min{Dist[i,j], Dist[i,k]+Dist[k,j]}8: ReturnDistFloyd-Warshall uses a dynamic programming approach

to computing Apsp, as shown in Algorithm 1. It initializesDistwith the input weightsW . Then, in the k-th iteration, itchecks for all pairs of verticesvi andvj if there is a shorterpathbetween them via the intermediate vertex vk . If so, Floyd-Warshall updatesDist[i,j]. Therefore,Dist[i,j] after k steps,which we denote byDistk (i,j), may be defined recursively as

Distk [i,j]←min{Distk−1[i,j],Distk−1[i,k]+Distk−1[k,j]

}.

This computation may be done in place, withDistk [i,j] over-writingDistk−1[i,j]. It can also be shown that afterk iterations,Distk [i,j] is the shortest of all paths between vi and vj thatuse only vertices from the set {v1,v2,...,vk }.1 Therefore, atthe end of then-th iterationDistn[i,j]will be the length of theshortest path betweenvi andvj .

2.2 Min-PlusMatrixMultiplication

Apspmay be understood algebraically as computing the ma-trix closure of the weight matrix,W , defined over the tropicalsemiring [12]. Inmore basic terms, let ⊕ and ⊗ denote the twobinary scalar operators

x⊕y := min(x ,y)x⊗y := x+y,

wherex andy are real valuesor∞.Next, consider twomatricesA∈Rm×k andB ∈Rk×n . TheMin-Plus productC ofA andB is

Ci j←

⊕∑k

Aik ⊗Bk j =mink

(Aik+Bk j

).

This product is the analogue of matrix-matrix multipli-cation over the reals. To see its connection to graph pathanalysis, consider an example of the complete tripartite graph1This fact holds only if there are no cycles of negative weight sum. In thepresence of negative cycles, the minimum path length between any twovertices in the cycle will be∞.

(a) MinPlus productC =A⊗B (b) Substeps of Algorithm 2

Figure 2: Fig. 2a shows the shortest path between source vertices

and destination vertices that goes through bridge vertices (see Sec-

tion 2.2). Fig. 2b illustrates substeps of Apsp on the diagonal sub-

graph, panel (block row and column) update, and outer-product up-

date of the trailing subgraph.

in Fig. 2a. This graph has three disjoint subsets ofm+n+pvertices:m source vertices, {s1,s2···sm}; n destination vertices,{d1,d2···dn}; and p bridge vertices,

{b1,b2···bp

}. Every source

and destination connects to every bridge, but no verticeswithin each subset are adjacent. LetAik denotes the weightof the edge (si ,bk ) and let Bk j be the weight of (bk ,dj ). ThenAik ⊗Bk j =Aik+Bk j denotes the length of path from si to djviabk . Thus, the shortest path betweensi anddj via anyvertexbk is the minimum ofAik ⊗Bk j over all k . This interpretationof theMin-Plus product helps to understand the followingblocked version of Floyd-Warshall (Algorithm 2).

2.3 Blocked Floyd-Warshall algorithm

Suppose we divide Dist into nb × nb blocks, each of sizeb×b (i.e., nb = n

b ). Let Ai j denote the (i,j) block of A, where1 ≤ i, j ≤ nb . Then a blocked version of Floyd-Warshall,called BlockedFw in Algorithm 2, can carry out the sameApsp computation as Floyd-Warshall in the following threesteps, as illustrated in Fig. 2b:• Diagonal Update: Perform the classic Floyd-Warshallalgorithm on a diagonal block,Akk .• Panel Update:Update thek-th block row and column. Forany blockA(k,j), j,k in the block row, the update is aMin-Plusmultiply withAkk from the left, and for blockA(i,k)on the k-th block column isMin-Plusmultiply withAkkfrom right, i.e.,

A(k,j)←A(k,j)⊕ A(k,k)⊗A(k,j) j,k

A(i,k)←A(i,k)⊕ A(i,k)⊗A(k,k) i,k

Here, ⊕ denotes element-wise application of the corre-sponding scalar operator, and ⊗ denotesMin-Plus product.• MinPlus Outer Product: Perform the outer product of k-thblock rowandblockcolumn,andupdateall the remainingblocks of matrixA

A(i,j)←A(i,j)⊕ A(i,k)⊗A(k,j) i,j,k .

252

Page 4: A Supernodal All-Pairs Shortest Path Algorithm

PPoPP ’20, February 22–26, 2020, San Diego, CA, USA Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc

Figure 3: An illustration of BlockedFw on a 3 × 3 block-

partitionedmatrix.

This step is analogous to a Schur-complement update in LUor Cholesky factorization.

Algorithm 2 A blocked version of Floyd-Warshall algo-rithm forApsp1: function BlockedFloydWarshall(A):2: for k= {1, 2...,nb } do:

Diagonal Update

3: A(k,k)←Floyd-Warshall(A(k,k))Panel Update

4: A(k,:)←A(k,:)⊕

A(k,k)⊗A(k,:)5: A(:,k)←A(:,k)

⊕A(:,k)⊗A(k,k)

MinPlus Outer Product

6: for i= {1, 2...,nb }, i,k do:7: for j= {1, 2...,nb }, j,k do:8: A(i,j)←A(i,j)

⊕A(i,k)⊗A(k,j)

9: ReturnA

3 SUPERNODAL FLOYD-WARSHALL

When theDistmatrix is sparse, we can use ideas from sparsedirect solvers to transform theBlockedFw algorithm intoonethat can maintain and exploit that sparsity. There are threecritical concepts: (i) reordering, which helps to control the dy-namically evolving sparsity structure ofDist, thereby control-ling the amount of asymptotic work incurred by BlockedFw;(ii) supernodes, which help manage and organize this struc-ture, thereby exploiting locality; and (iii) the elimination tree,which helps express the operational and data dependencies,thereby exposing parallelism.

3.1 Effect of ordering on BlockedFw

When running BlockedFw, the sparsity ofDist in any giveniteration determines how that sparsity will change in subse-quent iterations. We illustrate these dependencies in Fig. 3.There, we show a 3×3 block-partitioned sparse matrix witha particular sparsity pattern in which theA21 andA12 blocksare “empty,” meaning that all the entries in them are infinity.

The first iteration of BlockedFw, i.e., k=1, performs aDi-agUpdate on theA11 block followed by PanelUpdate on theblockA13 andA31. SinceA12 andA21 are empty, they remainempty after the PanelUpdate step. In theOuterUpdate step,the blocksA22,A23, andA32 remain unchanged since any up-dates thereto depend onA12 orA21, which are empty.We onlyperform an update on the blockA33 as follows:

A33←A33⊕ A31⊗A13.

(a) A 25×25 sparsematrix (b) Non-zero structure of the

permutedmatrix

(c) eliminationtree (d) supernodal matrix

Figure 4: NestedDissection (ND) ona 5×5 grid graph.Under anND

ordering, we find a graph separator (highlighted in yellow Fig. 4a),

and label the nodes in the separator in the end. Fig. 4b shows the ad-

jacency matrix of the graph permuted in ND ordering. The elimina-

tion tree (Fig. 4c) captures thedependency ineliminationofdifferent

nodes. The Fig. 4d shows the final block sparsematrix obtained after

steps described in Sections 3.2 and 3.3.

Similarly, when k = 2, we perform DiagUpdate on theblock A22, and PanelUpdate on A23 and A32, and we onlyperform update on the blockA33

A33←A33⊕ A32⊗A23.

In the 3rd iteration, we perform DiagUpdate on the blockA33, PanelUpdate from the left onA31 andA32, and from theright onA23 andA13. Since none of the third row and columnblocks are empty, we update all the remaining blocks in theOuterUpdate step. The blocksA12 andA21 which had beenempty so far finally become full blocks:

A12←A13⊗A32

A21←A23⊗A31.

This example reveals two essential aspects of BlockedFw.(1) For this particular example, the block sparsity is main-

tained until the last iteration. That is, only when k=3does execution of the OuterUpdate step require anupdate on the complete matrix in which empty blocks(of all∞ values) become finite. This change from infi-nite to finite values is the graph-path analog of nonzerofill-in in sparse linear algebra.

253

Page 5: A Supernodal All-Pairs Shortest Path Algorithm

A Supernodal All-Pairs Shortest Path Algorithm PPoPP ’20, February 22–26, 2020, San Diego, CA, USA

(2) TheOuterUpdate step in the first iteration does notinvolve updating any block of second block row or col-umn, i.e. A22,A23 and A23, and vice versa. Therefore,theDiagUpdate and PanelUpdate of iterations k=1and k=2 do not have any data dependencies, and maytherefore proceed in parallel. However, both iterationsupdate theA33 block. Nevertheless, sinceMin-PlusMa-trix Addition is associative, the updates from the twoiterations can be done in any order. Thus, the finalA33after the two updates will be given by

A33←A33⊕ A32⊗A23⊕ A31⊗A13.

This dependency between operations in iteration k=3on those from k=1 and k=2 can be described by a tree,as shown in Fig. 3 (last).

3.2 Nested-Dissection Ordering

From the theory and practice of sparse matrix reordering, it iswell understood that “arrow” patterns for Gaussian elimina-tion work well in reducing fill-in. Indeed, we can reorder theadjacency matrix of the graphG to obtain block-arrow struc-ture through a process known as nested-dissection (ND) [14],which graph partitioning tools like Metis or Scotch can beused to obtain [22, 31].

The ND process may be summarized as follows. Our initialgoal is to compute a vertex separator S ⊂V that partitions thevertices of the graphG into three disjoints sets,V =C1∪S∪C2,so that following holds:

(1) there are no edges between any vertex in C1 to anyvertex inC2;

(2) |C1 | and |C2 | are roughly equal; and(3) S is as small as possible.

Using this partition, we can reorder the adjacencymatrixAso that the verticeswithin each setC1 andC2 have consecutiveindices; and vertices in S have a higher index than any vertexinC1 andC2. For example, in Fig. 4, we show a grid graphG,a separator, its adjacency matrix, and the 3×3 block-arrowmatrix obtained by reordering the matrix as described above.This process may be performed recursively withinC1 andC2to obtain a more fine-grained ordering for the entire matrix.

3.3 Supernodal structure extraction

The goal of this step is to obtain a blocked sparse matrix asshown in Fig. 4b thatwe shall call supernodalmatrix, from thepermutedmatrix usingNDordering (shown in Fig. 4d); and tocalculate the so-called elimination tree (shown in Fig. 4c) thatguides the scheduling and parallelism of our algorithm.Weborrow the so-called symbolic analysis used in sparse directsolvers to do so.

Symbolic analysis: Symbolic analysis is the calculationof the exact fill-in structure in sparse Cholesky factorization.It enables preallocation of memory for any fill-ins.

Elimination tree: Recall the 3 × 3 block sparse matrixshown in Fig. 3. Elimination can proceed in either {1,2,3}or {2,1,3} order while maintaining low fill. Such an elimi-nation ordering can be described using a tree Fig. 3, calledelimination tree or eTree. The eTree is also calculated duringthe symbolic analysis step. ND ordering leads to a multileveleTree as shown in Fig. 4c.Supernodes: A supernode is a collection of vertices or

nodes that have a similar fill-in structure. When that occurs,these verticesmay be grouped together to obtain block-sparsematrix. The supernodal partition is obtained by perform-ing vertex contraction on the eTree, yielding a supernodaleTree. For subsequent discussion, eTree refers to a supern-odal eTree.

Ancestor and descendant supernodes: Further, we de-fine the ancestors of a supernode as the set of supernodes thatconsist of its parent, the parent of parent, and so on. Similarly,if supernode a is an ancestor of another supernode b, thenwe say b is a descendant of a. In the eTree representation,the ancestors of a node occupy a higher spot than that nodeand descendants appear below it. We denote ancestors anddescendants of a supernodev byA(v) andD(v), respectively.

3.4 The SuperFw algorithm

Algorithm 3 The SuperFw algorithm1: ns := Number of supernodes2: function SuperFw(G= (V ,E)):3: for k= {1, 2...,ns } do:

Diagonal Update

4: A(k,k)←Floyd-Warshall(A(k,k))Panel Update

5: for i ∈A(k)∪D(k) do6: A(i,k)←A(i,k)⊕ A(i,k)⊗A(k,k)7: A(k,i)←A(k,i)⊕ A(k,k)⊗A(k,i)

MinPlus Outer Product

8: for (i,j) ∈ {A(k)∪D(k)}×{A(k)∪D(k)} do:9: A(i,j)←A(i,j)⊕ A(i,k)⊗A(k,j)

Algorithm 3 describes the sequential SuperFw algorithm.At a high-level, it performs Floyd-Warshall iterations onthe supernodalmatrix. However, this algorithmexhibits somesubtle behaviors.Consider the elimination of a supernode v . Its elimina-

tion only requires updating blocks corresponding toA(v),D(v), andv itself. The DiagUpdate and PanelUpdate areperformed in place. The regions updated in theOuterUpdatestep can be divided into four subsets:

(1) D(v)×D(v) (top-left region, relative to a),(2) D(v)×A(v) (top-right region),

254

Page 6: A Supernodal All-Pairs Shortest Path Algorithm

PPoPP ’20, February 22–26, 2020, San Diego, CA, USA Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc

(a) Parallel elimination of su-

pernodes 2 and 5

(b) The eTree view

Figure 5: Parallel OuterUpdate of two cousin supernodes. The

regions thatupdated in theOuterUpdate stepof eliminationof sec-

ond and fifth supernode are highlighted under blue and green trans-

parency, respectively.

(3) A(v)×D(v) (bottom-left region), and(4) A(v)×A(v) (bottom-right region).

Theese different regions associated with the elimination of anode appear in Fig. 5a. In conventionalCholesky factorization,we only need to updateA(v)×A(v), also called the trailingmatrix. This operation involves only sparse blocks, and itupdates the supernodal block sparse matrix. TheOuterUp-date onD(v)×D(v),D(v)×A(v) andA(v)×D(v) directlyupdates the distance matrix, which is dense. At the end of thecomputation, the supernodal matrix contains the semiring

equivalent of Cholesky factors and the dense distance matrixcontains final pairwise lengths of all shortest paths.

3.5 Parallel SuperFw algorithm

Recall from the baseline BlockedFw algorithm that, in thek-thOuterUpdate step, all the updates on all theAi j blockscould be performed in parallel. Relative to that available par-allelism, the enhancement in SuperFw comes from exploitingthe eTree.

eTree guided scheduling and parallelism. We say a supern-ode a is a cousin of a supernode b if D(a) ∩ D(b) = ∅. Forinstance any two leaf nodes in the eTree are cousins. TheDiagUpdate and PanelUpdate of any two cousin nodes canbe done in parallel as they operate on distinct regions of thematrix. Next, consider the dependencies in theOuterUpdatestep of two cousins. Recall the four sets of blocks updated intheOuterUpdate:D(v)×D(v),D(v)×A(v),A(v)×D(v),andA(v)×A(v). Any blockA(i,j) in the first three subsetswill have at least one of i and j inD(v). If two supernodesvandu are cousins, then by definitionD(v)∩D(u)=∅. There-fore, the first three subsets of OuterUpdate of two cousinsupernodes are disjoint and can be updated concurrently. ButA(v)×A(v) andA(u)×A(u) can have some common blocks.Therefore, those blocks are updated sequentially.

To expose maximum parallelism, we perform a bottom-uptopological sort of the eTree, which partitions the eTree intolevels as shown in Fig. 5b. Since all the supernodes in the givenlevel are cousins to one other, their elimination can be done inparallel. We refer to such eTree-guided scheduling as eTreeparallelism.

4 ASYMPTOTICANALYSIS

Table 2: Asymptotic work(W ), depth(D) and

concurrency(C)

Algorithm W (n)† D(n)† C(n)‡

BlockedFw O(n3) O(n) O

(n2)

SuperFw O(n2 |S |

)O(|S |log2n

)O

(n2

log2n

)Dijkstra O

(n2logn+nm

)O(nlogn+m) O(n)

PathDoubling [40] O(n3) O(logn) O

(n3logn

)♦ n: #vertices,m: #edges, |S |: size of the top level separator.† lower is better. ‡ higher is better.

Weuse thework-depthmodel[4] toquantify theasymptoticsequential work and the available parallelism for differentalgorithms. The work is the total number of operations per-formed and the depth is the length of the longest sequentialchain of data dependencies. Formally, ifT (n,p) denotes thetime of execution of a parallel algorithm onp processors for aproblem of sizen, thenworkW (n) and depthD(n) are definedas follows:

W (n)=T (n,p=1) (1)D(n)= lim

p→∞T (n,p) (2)

UsingW (n) and D(n), the average available parallelism, orconcurrencyC(n), is defined as

C(n)=W (n)

D(n).

The concurrencyC(n) indicates the average number of pro-cessors an algorithm can fruitfully utilize.

Weassumeaparallel randomaccessmemory(PRAM)model [1]of parallel execution that supports concurrent read exclusivewrite (CREW). In this model, all processors can access a mem-ory location simultaneously, and only one processor canwriteat a location at a time. In terms of their depth costs, perform-ing xi←αyi+β has a depth of O(1); and reduction operationy←

∑i=1:nxi has a depth of O(logn).

4.1 AsymptoticWork

If |S | is the size of the top-level separator of graphG with nvertices, then the cost of running SuperFw is O

(n2 |S |

). The

asymptotic cost reduction fromO(n3)toO

(n2 |S |

)comes from

the ND reordering. The asymptotic cost of ND reordering andother equivalent reordering schemes can be found elsewhere

255

Page 7: A Supernodal All-Pairs Shortest Path Algorithm

A Supernodal All-Pairs Shortest Path Algorithm PPoPP ’20, February 22–26, 2020, San Diego, CA, USA

aswell [26].Given the algebraic equivalence of SuperFwwithsparse Gaussian elimination (Cholesky for the symmetric orundirected case), its costs are the same; here, we sketch thederivation of that cost and its implication for SuperFw below.

LetV = {C1∪S∪C2} be a nested dissection vertex partition-ing of the graphG = (V ,E). Let |S |, |C1 |, and |C2 | denote thenumber of vertices in each set. In the following discussion, weassume the following. First, the separator is small, i.e., |S |≪nand the partition is balanced, i.e., |C1 |= |C2 |. Therefore,

|C1 |= |C2 | ≈n/2.

Furtherweassume that size of separators as a functionof num-ber of vertices in the graph denoted by S(n), is monotonic, i.e.:

n1>n2 =⇒ S(n1)>S(n2).

Consider the elimination of the separator S , which is the lastiteration of SuperFw. It involves three steps:DiagUpdate,PanelUpdate andOuterUpdate. The cost of DiagUpdateis equal to the cost of running Floyd-Warshall on a graphwith |S | vertices, or O

(|S |3

). The PanelUpdate step involves

Min-Plusmatrix multiplication of the two separator panelsof size |S | × (n− |S |)) with a |S | × |S | matrix; thus, its cost is|S |2×(n−|S |)). TheOuterUpdate step involves computingouter product of the two panels; (n−|S |)×|S | and |S |×(n−|S |);thus, its cost is (n−|S |)2×|S |. Since |S |≪n, the outer productcost will dominate and so the total cost of elimination of thetop level separator is given by

W 0(n)≈n2S(n).

Here zero denotes the level of separator from the top, i.e., roothas a level zero and leaves have level h − 1, where h is theheight of the separator tree.

Now consider the elimination of the two first level separa-tors in the eTree. Again,OuterUpdate dominates the cost ofelimination. Recalling our assumption that the partitions areapproximately balanced, thenOuterUpdate involves com-putingouter productwhose dimensions are (n/2+ |S(n)/2|))×|S(n/2)| and |S(n/2)|×(n/2+ |S(n)/2|), so that the cost of elim-ination of the two first level separator is

W 1(n)≈2(n2

)2S(n2

)=n2

2S(n2

).

Similarly, the total cost of elimination of all the i-th levelseparator is given by

W i (n)≈2i( n2i)2S( n2i)=n2

2iS( n2i).

There are approximately logn levels of the separator. There-fore, summing over all levels yields the final cost,

W (n)≈

logn∑i=0

n2

2iS( n2i)=n2S(n)

logn∑i=0

S(n/2i

)2iS(n)

.

SinceS(n) ismonotonic, so S(n/2i )S (n) ≤ 2, thecoefficientofn2S(n)

inW (n) is ≤∑logn

i=0 1/2i ≤ 1. Hence, the total work of SuperFw

on a graph with n vertices is given by

W (n)=n2 |S |. (3)

4.2 Asymptotic Depth

The depth of the baseline Floyd-Warshall algorithm isO(n)since each of n vertices of the graph is eliminated sequen-tially. Within the elimination of a single vertex,DiagUpdate,PanelUpdate andOuterUpdatewill each have depth O(1)using O

(n2)processors. If the SuperFw does use the etree

parallelism and perform sequential elimination of vertex,then it will have the same depth O(n) as the baseline Floyd-Warshall.

To calculate the depth of SuperFwwith etree parallelism,we consider the elimination of top-level separator first. Theelimination of top-level separator uses the blocked Floyd-Warshall algorithm, thus its depth is S(n). In the first level,we eliminate two separators each of size ≈S(n/2) in parallel.

For the elimination of two separators in the first level, wecan performDiagUpdate, PanelUpdate, andOuterUpdateinvolving regions ofA×D, D×A and D×D, in parallel.However, for performing OuterUpdate involving A×A,we may update the same block from OuterUpdate step ofeither separator. In general, in the elimination of separatorsof the i-th level, multiple processes might try to update thesame block inA×A. Notice that the update is a reductionoperation, hence ifp process try to update the same block, wecan perform the update using tree-reduction with a depth oflog2p−1=O(logp). Since any block inA×A will be updatedby at most O

(2i)processors at level i , hence the depth of up-

dating any block inOuterUpdate of update of any block intheA×A will be at most log(2i )=i . Therefore, the depth ofperforming elimination in the i-the level of the separator treewill be iS(n/2i ). So the total depth of performing SuperFw isgiven by :

D(n)=

logn∑i=0

iS(n/2i )≤logn∑i=0

lognS(n)=S(n)log2n (4)

Therefore, the depth of performing SuperFwwith etree par-allelism is O

(S(n)log2n

)or just O

(|S |log2n

). Verily, we can

express the available parallelism or concurrency as :

C(n)=W (n)

D(n)=O

(n2

log2n

)(5)

4.3 Discussion

In Table 2, we summarize the work, depth, and concurrencyof SuperFw and BlockedFw, along with two notable gen-eralApsp algorithms a)Dijkstrawhich is work optimal for

256

Page 8: A Supernodal All-Pairs Shortest Path Algorithm

PPoPP ’20, February 22–26, 2020, San Diego, CA, USA Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc

Table 3: Test sparsematrices used in experiments

Name Source n nnzn

n|S |

USpowerGrid Power network 4.9e3 2.66 6.2e2OPF_6000 Power network 2.9e4 9.1 1.4e3nd6k 3D 1.8e4 383 5.8oilpan structural 7.3e4 29.1 1.7e2finan512 Optimization 7.5e4 7.9 1.5e3net4-1 Optimization 8.8e4 28 2.9e3c-42 Optimization 1.0e4 10.58 1.5e2c-69 Optimization 6.7e4 9.24 2.0e2lpl1 Optimization 3.2e4 10 4.8e2onera_dual Structural 8.5e4 4.9 1.5e2email-Enron SNAP 3.7e4 9.9 52

delaunay_n14 DIMACS10 1.6e4 5.99 1.7e2delaunay_n16 DIMACS10 6.5e4 5.99 1.7e2fe_sphere DIMACS10 1.6e4 5.99 8.5e1luxembourg_osm DIMACS10 1.1e5 2.1 6.7e3fe_tooth DIMACS10 7.8e4 11.6 88wing DIMACS10 6.2e4 3.9 1.0e2t60k DIMACS10 6.0e4 3.0 1.1e3

G67 Random 1e4 4 5.0e1EB_8192_256 Barabasi - Albert 8.1e3 256 2.5e0EB_16384_64 Barabasi - Albert 1.63e4 64 2.6e0rgg2d_14 RandomGeometric 1.63e4 128.17 1.6e1rgg3d_14 RandomGeometric 1.63e4 910 2.57hypercube_14 hypercube Graph 1.6e4 28 5.0e0

sparse graphs but offers only O(n) concurrency; and b) Path-Doubling is a known theoretical variant of Floyd-Warshallalgorithmwith best knownparallel complexity. Per Eq. (3), Su-perFw lowers the work complexity of Floyd-Warshall by afactorofO

(n

S (n)

),while also reducing theasymptotic depthby

O

(n

S (n)log2n

). Hence SuperFw improves the asymptotic work

complexitywith little exacerbation of available parallelism. Incontrast,Dijkstra’s algorithm has a lower asymptotic workcomplexity, but it has a concurrency of O(n). Further,Dijk-stra uses the priority queue data structure, which may noteffectively utilize modern architectural features such as onchip cache memory and large SIMD units found onmodernprocessors.

Per Eq. (3), a small vertex separator implies an asymptoticreduction in the cost of SuperFw relative to the naïve (dense)costofO

(n3). Theclassofgraphswith small separators largely

falls into thecategoryofgeometric graphs, aswell asadditionalclasses of graphs derived from geometric graphs.Informally,a geometric graph is one that can be embedded into a d≪n-dimensional grid [10]. For a d-dimensional grid graph, thereexists a separator of size O

(n1−

1d

). The best-known example

is a planar graph,whichhas a separator of sizeO(√n)[27]. Ad-

ditionally, there are graphs with O(n) separators also knownas expander graphs [20].Many randomgraphs tend tobecomeexpanders as number of edges increase.

5 RESULTS

We have implemented a shared memory multicore version ofthe SuperFw algorithm using OpenMP. The aim of our evalu-ation is to quantify the impact of each algorithmic techniquesfrom sparse Cholesky onApsp, though these techniques arenot specific to sharedmemory; Section 6 briefly discusses can-didate implementations for other programming models andframeworks, such as distributed memory. We present perfor-mance of SuperFw for both the sequential andmultithreadedimplementation on different datasets listed in Table 3.

5.1 Experimental Setup

In this section,wepresent the details of theTest Bed, Baselinesand the test graphs that we use for experiments.

5.1.1 Test Bed. We conducted our experiment in a sharedmemory system that contains 32 cores as a dual-socket 16-core Intel E5-2698 v3 “Haswell” processors. Each socket has40-MB shared L3 cache. It has a total of 128 GB DDR4 2133MHzmemory arranged in four 16GBDIMMsper socket. Eachcore can support two hyperthreads, thus 64 threads in total.

5.1.2 Competing Apsp implementations . We compare theSuperFw implementation that uses the three optimizations(a) ND ordering (b) Supernodal structure and (c) eliminationtree parallelism with the following three baselines.• BlockedFw : this is an efficient multithreaded implemen-tation of Algorithm 2 using OpenMP. This implementationdoes not exploit the sparsity of the graph. This would per-form n3 operations.• SuperBfs: This algorithm does not use the optimal ND or-dering.However, it doesperformsymbolic factorizationandset-up supernodal data structure.We perform BFS from thevertex-0anduse theorder inwhichverticeswerediscoveredas the ordering, to ensure that initial ordering of the ma-trix has some structure, so supernodal approach might stillexploit the sparsity. This would perform O

(n3)operations,

but coefficients might be much smaller than BlockedFw• Dijkstra: This algorithm performs a single-pair shortestpath from all the vertices. It has the lowest asymptotic com-plexity of all themethods considered herein, and is the coreof Johnson’s algorithm when there are no negative edgeweights. (That is, Johnson’s algorithm uses Dijkstra as asubroutine and would be more expensive than Dijkstra forgraphs with only positive edge weights.)• BoostDijkstra:Apsp implementationusingoff-the-shelfimplementation ofDijkstra’s algorithm frompopular BoostGraph Library (BGL) [37].2

2BGL also provides an implementation of both Floyd-Warshall and John-son’s algorithm, but their performance is not competitive to BoostDijkstra,and thus not considered.

257

Page 9: A Supernodal All-Pairs Shortest Path Algorithm

A Supernodal All-Pairs Shortest Path Algorithm PPoPP ’20, February 22–26, 2020, San Diego, CA, USA

(a) Small Graphs

(b) Large Graphs

Figure 6: Performance of differentmultithreadedApsp algorithms for small and large graphs. Each bar shows normalized execution time for

a given algorithmandgraph. Time is normalized over the baselineBlockedFw andDijkstra algorithm for small and large graphs respectively.

Each text label over each bar denotes the speed-up over the reference algorithm.

(a) finan512 (b) net4-1

(c) email-Enron (d)wingFigure 7: Shared-memory scaling of different Apsp implementa-

tions for large graphs on a intel "Haswell" dual-socket system with

32 physical cores.

Figure 8: Impact of etree parallelism (Section 3.5) on the scaling of

SuperFw algorithm on 32 cores (see Section 5.2.3)

• ∆-Step: We use the parallel ∆-stepping variant of Dijk-stra’salgorithm[30] forcomputing thesingle-source-shortestpath in Johnson’s algorithm. We also use the parallel ∆-stepping algorithm from the Galois Graph library [32]. The∆-stepping requires tuning the ∆ parameter for each graph.Our ∆-Step-basedApsp is autotuned, i.e., it tries differentvalues of ∆ of first few SSSP calls and picks the best ∆ forrest of the execution.

Thecoreofall threeFloyd-WarshallalgorithmsBlockedFw,SuperBfs and SuperFw use the same semiring double preci-sionmatrix-multiplicationkernelSemiringGemm. TheSemir-ingGemm kernel achieves 10.2 Gflop/sec per core which is

258

Page 10: A Supernodal All-Pairs Shortest Path Algorithm

PPoPP ’20, February 22–26, 2020, San Diego, CA, USA Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc

28% of the theoretical peak of 36.8 Gflop/sec (18.4 Gflop with-out FMA instructions) per core. The BlockedFw achieves amaximum of 244 Gflop/sec on 32 cores. The operand sizes canvary significantly in SuperFw, thus per core flop rate variesbetween graphs and ranges from 7.6-9.4 Gflop/sec.

5.1.3 Test Graphs. We used graphs from three differentcategories (a) Real world Graphs (b) Graphs from DIMACScompetition and (c) Random graph generators with differentparameters. The details of the set of matrices such as the sizeand density used for experiments are listed in Table 3. Thegraph luxembourg_osm has 114k vertices, which requires105GB of memory to store the distance matrix, is the largestgraph we could successfully try. The Djikstra’s and ∆-Stepalgorithmsworks on graphswith positive edgeweights, sowemodify the adjacency matrices from real world and syntheticto have only positive entries.

5.1.4 Pre-processingoverhead. Theworst-casepre-processingcost is 18% of the multithreaded SuperFw execution time,even though pre-processing (i.e. graph partitioning viaMetis)is not multithreaded. In sequential or single-node case, thepre-processing step is not a a bottleneck for even for sparseCholesky (which is aO(S3) the operation, whereas SuperFwperformsO(n2S) operations), thus many efforts are towardsimproving the performance of the numerical factorizationstep. In the subsequent performance and scalability analysis,the time shown does not include the pre-processing costs.

5.2 Observations

5.2.1 Small Graphs. For small graphs, we compare theperformance of SuperFw against BlockedFw, SuperBfsandDijkstra, shown in Fig. 6a. The SuperFw is faster overBlockedFw by upto 123×, and it represents the impact ofall the optimizations combined. The SuperBfs is faster overBlockedFw by upto 3.9×. We highlight the following threekey observations.

• Impact ofNDordering: In the case of the testmatrices, de-launey_n14,OPF_6000, and fe_sphere,NDorderingyieldedsignificant benefit because of smaller separators. However,there are many real world cases with smaller separatorsthat can benefit from SuperFw.• Impact of Supernodal structure: The asymptotic cost ofSuperBfs is stillO(n3), but it does exploit sparsity to someextent, unlike blockFW. As in Fig. 6a, this offers an advan-tage of 1-3.9× onmany real world and synthetic graphs.Wealso observe that in the case of hypercube_14with n

logn sep-arator, reordering cannot reduce the asymptotic cost. Yet,by using a supernodal data structure, we get a performanceimprovement of 4.1x speedup over BlockedFw.

• Adversarial Cases: Finally, there are synthetic graphs likeextended_barbasi, where neither ND ordering nor supern-odal structureofferany improvement.Moregenerally, thereare graph classes, like expander graphs, that are sparse yetwell connected. Such graphs won’t have good separatorsand we would not expect SuperFw to provide any advan-tage over BlockedFw. There are number of random graphgenerators, such as Erdos-Renyi, Watts-Strogatz, that havesimilar properties for graphs with large number of vertices.

We expect that the performance gap between BlockedFwand SuperFwwill increase with increasing in problem sizedue to asymptotic difference in the time-complexity, whereasperformance gap between BlockedFw and SuperBfs willremain similar for larger graphs.

5.2.2 Large Graphs. For large graphs, we compare theperformance of SuperFw with Dijkstra, BoostDijkstra,and ∆-Step and leave out O

(n3)algorithms shown in Fig. 6b.

The SuperFw is faster by 0.2-52× than the Dijkstra’s. Weexpect that for larger graphsDijkstrawill outperform Su-perFw, nevertheless, SuperFw is competitive to Dijkstrafor planar graphs of sizes on the order of 100k vertices, e.g.,luxembourg_osm. The BoostDijkstra andDijkstra are al-gorithmically very similar, yetBoostDijkstra is often slowerthan our implementation of Dijkstra. This differencemainlystems from BoostDijkstra’s adjacency list data structurefor storing graphs vs compressed-sparse-row storage usedby Dijkstra. The ∆-Step is neither work-optimal and norscalable, thus not competitive to eitherDijkstra or SuperFw.

5.2.3 Scalability.

• Impact of eTree Parallelism: In Fig. 8, we show the rel-ative performance of two implementations of SuperFw,with and without eTree parallelism on cores over sequen-tial performance. eTree parallelism can improve the scal-ability of SuperFw by 2×. The impact of eTree parallelismis more visible for small-graphs, e.g.,USPowerGrid, whereSuperFw performs very little per-iterationwork. For largergraphs eTree parallelism has a little impact of performanceas they already enough parallelism in each iteration. HenceeTree-parallelism is essential for strong scaling.• Scalability of different Apsp implementations: Thescalability of Dijkstra, BoostDijkstra, ∆-Step, and Su-perFw are compared in Figs. 7a to 7d. Our SuperFw im-plementation scales linearly up to 32 threads (= number ofphysical cores) achieving a parallel efficiency of 74%. TheDijkstra and BoostDijkstra can effectively use hyper-threading to hide the latencies of the priority queue datastructure, thus they can scale to 64 threads. The ∆-Stepmethod only parallelizes each SSSP call, thus it requires sig-nificantly more inter-thread synchronizations and scalespoorly compared to the other three implementations.

259

Page 11: A Supernodal All-Pairs Shortest Path Algorithm

A Supernodal All-Pairs Shortest Path Algorithm PPoPP ’20, February 22–26, 2020, San Diego, CA, USA

6 RELATEDWORK

This work builds upon principles from several different ar-eas including semiring algebra, graph algorithms and sparsedirect solvers.A number of other problems are equivalent to Apsp, e.g.

metricity, minimum-weight triangle, second shortest path,etc. [43, 44]. WhileApsp is the semiring equivalent of matrixinversion, no no truly subcubic algorithm (i.e., equivalent ofStrassen’s algorithm) forApsp is known.The best knowncom-plexity of Apsp for the dense case isO

(n3

no1

)[43], andO

(mnlogn

)for sparse graphs [7]; for the parallel case, it is O(logn) [40].The equivalence between finding the shortest path and

solving a system of linear equations goes back to the work ofCarre [3, 6]. He gavemany interpretation of linear algebra op-erations, including LU factorization and Sherman-MorrisonWoodbury formula for graph updates. Modern treatments ofthis subject can also be found elsewhere [16, 28].The method of nested dissection (ND) was discovered by

George [13] for solving linear system of equations from finiteelementmeshes. ItsgeneralizationbyLiptonetal. [26], and theplanar separator theorem [27, 39] has had a large impact on anumber of graph and sparse linear algebra algorithms. In par-ticular, several algorithms for path problems on planar graphsare based on ND and the planar separator theorem [8, 11, 42].

Lastly, sparsedirect solvershavebeenstudied ingreatdetailin the context of parallel computing. Depending on sched-uling, there are other variants namely, left-looking, right-looking, multifrontal, and Crout’s variant. The effect of dif-ferent scheduling strategies on performance can be foundat [19, 34]. Theproposed SuperFw closely resembles the right-looking variant. Similarly, a number of works have focusedon improving scalability on accelerators such as GPU [15, 23–25, 29, 36, 41, 46] and distributed memory [2, 18, 35, 45]. Mostdistributed algorithms rely on some form eTree parallelismfor reducing communication and data distribution [18, 35].

7 CONCLUSION

This paper is the first practical demonstration of how to ex-ploit the algebraic structure of the all-pairs shortest pathsproblem to create a parallel APSP algorithm for sparse graphs,applying all of the machinery we know from sparse directsolvers for linear systems. Although the observation aboutthe similarity of linear solver and shortest path is known, oursis the first attempt to create a practical implementation andconduct an empirical analysis. We believe that the algebraicinterpretation of APSP is important to explore, and our studyis just the first of many that we or others could carry out,now that we have done this one. In particular, consider linearsystems. In that setting, there is a rich “hierarchy” of methodsthat trade-off generality and robustness for speed and asymp-totic optimality, with “dense LU” at one end and “multigrid”

at the other. Sparse Cholesky/LU is in the middle of that spec-trum. For APSP, we do not know yet fully understand whatthe analogous hierarchy might look like. This study would bejust one a series of future studies that could try to fill in theother analogous methods.

8 ACKNOWLEDGEMENTS

Thismanuscript has been authored by UT-Battelle,LLC underContract No. DE-AC05-00OR22725 with the U.S. Departmentof Energy. This research is partially support by Oak Ridge Na-tionalLaboratoryDirector’sResearchandDevelopmentFund-ing.Additional support comes from theDefenseAdvancedRe-search Projects Agency (DARPA) contract FA8750-18-2-0108,under theDARPAMTOSoftwareDefinedHardware program,and the National Science Foundation (NSF) under Award1710371. The publisher, by accepting the article for publica-tion, acknowledges that theUnited StatesGovernment retainsa non-exclusive, paid-up, irrevocable, world-wide license topublish or reproduce the published formof thismanuscript, orallow others to do so, for United States Government purposes.The Department of Energy will provide public access to theseresults of federally sponsored research in accordance withthe DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

REFERENCES

[1] Alok Aggarwal, Ashok K Chandra, and Marc Snir. Communicationcomplexity of prams. Theoretical Computer Science, 71(1):3–28, 1990.

[2] CC Ashcraft. A taxonamy of distributed dense LU factorizationmethods. Engineering Computing and Analysis Technical Report

ECA-TR-161, Boeing Computer Services(1991), 1991.[3] Roland C Backhouse and Bernard A Carré. Regular algebra applied

to path-finding problems. IMA Journal of Applied Mathematics,15(2):161–186, 1975.

[4] Guy E Blelloch. Programming parallel algorithms. Communications

of the ACM, 39(3):85–97, 1996.[5] Aydın Buluç, John R Gilbert, and Ceren Budak. Solving path problems

on the gpu. Parallel Computing, 36(5-6):241–253, 2010.[6] Bernard A Carré. An algebra for network routing problems. IMA

Journal of Applied Mathematics, 7(3):273–294, 1971.[7] TimothyM Chan. All-pairs shortest paths for unweighted undirected

graphs in o (mn) time. In Proceedings of the seventeenth annual

ACM-SIAM symposium on Discrete algorithm, pages 514–523. Societyfor Industrial and Applied Mathematics, 2006.

[8] Hristo Djidjev, Guillaume Chapuis, Rumen Andonov, Sunil Thulasi-dasan, and Dominique Lavenier. All-pairs shortest path algorithmsfor planar graph for gpu-accelerated clusters. Journal of Parallel andDistributed Computing, 85:91–103, 2015.

[9] Iain S. Duff, A.M. Erisman, and John K. Ried. Direct methods for sparse

matrices. Oxford University Press, 2nd edition, 2017.[10] Paul Erdös, Frank Harary, andWilliam T Tutte. On the dimension of

a graph. Mathematika, 12(2):118–122, 1965.[11] GregN Federickson. Fast algorithms for shortest paths in planar graphs,

with applications. SIAM Journal on Computing, 16(6):1004–1022, 1987.[12] Jeremy T. Fineman and Eric Robinson. Fundamental graph algorithms.

In Jeremy Kepner and John Gilbert, editors, Graph Algorithms in

260

Page 12: A Supernodal All-Pairs Shortest Path Algorithm

PPoPP ’20, February 22–26, 2020, San Diego, CA, USA Piyush Sao, Ramakrishnan Kannan and Prasun Gera, Richard Vuduc

the Language of Linear Algebra, chapter 5, pages 45–58. Society ofIndustrial and Applied Mathematics, Philadelphia, PA, USA, 2011.

[13] A. George. Nested dissection of a regular finite element mesh. SIAMJournal on Numerical Analysis, 10(2):345–363, 1973.

[14] Alan George. Nested dissection of a regular finite element mesh. SIAMJournal on Numerical Analysis, 10(2), 1973.

[15] T. George, V. Saxena, A. Gupta, A. Singh, and A. Choudjury. Multi-frontal factorization of sparse SPDmatrices on GPUs. In Proc. of IEEEInternational Parallel and Distributed Processing Symposium (IPDPS

2011), Anchorage, Alaska, May 16-20 2011.[16] Michel Gondran and Michel Minoux. Graphs, dioids and semirings: new

models and algorithms, volume 41. Springer Science & Business Media,2008.

[17] Kazushige Goto and Robert A Geijn. Anatomy of high-performancematrix multiplication. ACM Transactions on Mathematical Software

(TOMS), 34(3):12, 2008.[18] Anshul Gupta, George Karypis, and Vipin Kumar. Highly scalable

parallel algorithms for sparse matrix factorization. IEEE Transactionson Parallel and Distributed Systems, 8(5):502–520, 1997.

[19] Michael T Heath, Esmond Ng, and BarryW Peyton. Parallel algorithmsfor sparse linear systems. SIAM review, 33(3):420–460, 1991.

[20] Shlomo Hoory, Nathan Linial, and AviWigderson. Expander graphsand their applications. Bulletin of the American Mathematical Society,43(4):439–561, 2006.

[21] Donald B. Johnson. Efficient algorithms for shortest paths in sparsenetworks. Journal of the ACM (JACM), 24(1):1–13, 1 1977.

[22] George Karypis and Vipin Kumar. A fast and high-quality multilevelscheme for partitioning irregular graphs. SIAM Journal on Scientific

Computing (SISC), 20(1), 1998.[23] Kyungjoo Kim and Victor Eijkhout. Scheduling a parallel sparse

direct solver to multiple GPUs. In Parallel and Distributed Processing

Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th

International, pages 1401–1408. IEEE, 2013.[24] G. Krawezik and G. Poole. Accelerating the ANSYS direct sparse solver

with GPUs. In Proc. Symposium on Application Accelerators in High Per-

formance Computing (SAAHPC), Urbana-Champaign, IL, NCSA, 2009.[25] Xavier Lacoste, Mathieu Faverge, George Bosilca, Pierre Ramet, and

Samuel Thibault. Taking advantage of hybrid systems for sparse directsolvers via task-based runtimes. In Parallel & Distributed Processing

SymposiumWorkshops (IPDPSW), 2014 IEEE International, pages 29–38.IEEE, 2014.

[26] Richard J Lipton, Donald J Rose, and Robert Endre Tarjan. Generalizednested dissection. SIAM journal on numerical analysis, 16(2):346–358,1979.

[27] Richard J Lipton and Robert Endre Tarjan. A separator theorem for pla-nar graphs. SIAM Journal on Applied Mathematics, 36(2):177–189, 1979.

[28] Grigory L Litvinov. Idempotent/tropical analysis, the hamilton–jacobiand bellman equations. InHamilton-Jacobi equations: approximations,

numerical analysis and applications, pages 251–301. Springer, 2013.[29] R. Lucas, G. Wagenbreth, D. Davis, and R. Grimes. Multifrontal

computations on GPUs and their multi-core hosts. In VECPAR’10: Proc.9th Intl. Meeting for High Performance Computing for Computational

Science, Berkeley, CA, 2010.[30] Ulrich Meyer and Peter Sanders. δ -stepping: A parallel single source

shortest path algorithm. In European symposium on algorithms, pages393–404. Springer, 1998.

[31] François Pellegrini and Jean Roman. Scotch: A software packagefor static mapping by dual recursive bipartitioning of process andarchitecture graphs. In Proceedings of High-Performance Computing

and Networking (HPCN-Europe), volume LNCS 1067, 1996.[32] Keshav Pingali. High-speed graph analytics with the galois system. In

Proceedings of the first workshop on Parallel programming for analytics

applications, pages 41–42. ACM, 2014.[33] Leon R Planken, Mathijs M de Weerdt, and Roman PJ van der Krogt.

Computing all-pairs shortest paths by leveraging low treewidth.Journal of Artificial Intelligence Research, 43:353–388, 2012.

[34] Edward Rothberg. Exploiting the memory hierarchy in sequentialand parallel sparse Cholesky factorization. Technical report, StanfordUniversity, Department of Computer Science, 1992.

[35] Piyush Sao, Xiaoye S. Li, and Richard Vuduc. A communication-avoiding 3D LU factorization algorithm for sparse matrices. InProceedings of the IEEE International Parallel and Distributed Processing

Symposium (IPDPS), Vancouver, BC, Canada, May 2018.[36] Piyush Sao, Richard Vuduc, and Xiaoye Sherry Li. A distributed

CPU-GPU sparse direct solver. In Euro-Par 2014 Parallel Processing,pages 487–498. Springer International Publishing, 2014.

[37] Jeremy Siek, Andrew Lumsdaine, and Lie-Quan Lee. The boost graphlibrary: user guide and reference manual. Addison-Wesley, 2002.

[38] Edgar Solomonik, Aydın Buluç, and James Demmel. Minimizingcommunication in all-pairs shortest paths. In Proceedings of the 27th

IEEE International Parallel and Distributed Processing Symposium

(IPDPS), Boston, MA, USA, 5 2013.[39] Daniel A Spielmat and Shang-Hua Teng. Spectral partitioning works:

Planar graphs and finite element meshes. In Proceedings of 37th Con-ference on Foundations of Computer Science, pages 96–105. IEEE, 1996.

[40] Alexandre Tiskin. All-pairs shortest paths computation in the bspmodel. In International Colloquium on Automata, Languages, and

Programming, pages 178–189. Springer, 2001.[41] R. Vuduc, A. Chandramowlishwaran, J. Choi, M. Guney, and

A. Shringarpure. On the limits of GPU acceleration. In Proc. of the 2ndUSENIX conference on Hot topics in parallelism, HotPar’10, Berkeley, CA,2010.

[42] OrenWeimann and Raphael Yuster. Computing the girth of a planargraph in o(n\logn) time. SIAM Journal on Discrete Mathematics,24(2):609–616, 2010.

[43] R RyanWilliams. Faster all-pairs shortest paths via circuit complexity.SIAM Journal on Computing, 47(5):1965–1985, 2018.

[44] Virginia Vassilevska Williams and Ryan Williams. Subcubic equiva-lences between path, matrix and triangle problems. In 2010 IEEE 51stAnnual Symposium on Foundations of Computer Science, pages 645–654.IEEE, 2010.

[45] Ichitaro Yamazaki and Xiaoye S Li. New scheduling strategies andhybrid programming for a parallel right-looking sparse LU factorizationalgorithm on multicore cluster systems. In Parallel & Distributed

Processing Symposium (IPDPS), 2012 IEEE 26th International, pages619–630. IEEE, 2012.

[46] C.D.Yu,W.Wang,andD.Pierce. ACPU-GPUhybridapproach for theun-symmetric multifrontal method. Parallel Computing, 37:759–770, 2011.

261