numa-aware optimization of sparse matrix-vector

12
NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-based Many-Core Architectures Xiaosong Yu 1 , Huihui Ma 1 , Zhengyu Qu 1 , Jianbin Fang 2 , and Weifeng Liu 1 1 Super Scientific Software Laboratory, Department of Computer Science and Technology, China University of Petroleum-Beijing 2 Institute for Computer Systems, College of Computer, National University of Defense Technology [email protected], [email protected], [email protected], [email protected], [email protected] Abstract. As a fundamental operation, sparse matrix-vector multipli- cation (SpMV) plays a key role in solving a number of scientific and engineering problems. This paper presents a NUMA-Aware optimiza- tion technique for the SpMV operation on the Phytium 2000+ ARMv8- based 64-core processor. We first provide a performance evaluation of the NUMA architecture of the Phytium 2000+ processor, then reorder the input sparse matrix with hypregraph partitioning for better cache locality, and redesign the SpMV algorithm with NUMA tools. The ex- perimental results on Phytium 2000+ show that our approach utilizes the bandwidth in a much more efficient way, and improves the performance of SpMV by an average speedup of 1.76x on Phytium 2000+ . Keywords: Sparse matrix-vector multiplication, NUMA architecture, Hypergraph partitioning, Phytium 2000+ . 1 Introduction The sparse matrix-vector multiplication (SpMV) operation multiples a sparse matrix A with a dense vector x and gives a resulting dense vector y. It is one of the level 2 sparse basic linear algebra subprograms (BLAS) [13], and is one of the most frequently called kernels in the field of scientific and engineering computations. Its performance normally has a great impact on sparse iterative solvers such as conjugate gradient (CG) method and its variants [17]. To represent the sparse matrix, many storage formats and their SpMV al- gorithms have been proposed to save memory and execution time. Since SpMV generally implements algorithms with a very low ratio of floating-point calcu- lations to memory accesses, and its accessing patterns can be very irregular, it is a typical memory-bound and latency-bound algorithm. Currently, many SpMV optimization efforts have achieved performance improvements to various

Upload: others

Post on 16-Jan-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

NUMA-Aware Optimization ofSparse Matrix-Vector Multiplication onARMv8-based Many-Core Architectures

Xiaosong Yu1, Huihui Ma1, Zhengyu Qu1, Jianbin Fang2, and Weifeng Liu1

1 Super Scientific Software Laboratory, Department of Computer Science andTechnology, China University of Petroleum-Beijing

2 Institute for Computer Systems, College of Computer, National University ofDefense Technology

[email protected], [email protected],

[email protected], [email protected], [email protected]

Abstract. As a fundamental operation, sparse matrix-vector multipli-cation (SpMV) plays a key role in solving a number of scientific andengineering problems. This paper presents a NUMA-Aware optimiza-tion technique for the SpMV operation on the Phytium 2000+ ARMv8-based 64-core processor. We first provide a performance evaluation ofthe NUMA architecture of the Phytium 2000+ processor, then reorderthe input sparse matrix with hypregraph partitioning for better cachelocality, and redesign the SpMV algorithm with NUMA tools. The ex-perimental results on Phytium 2000+ show that our approach utilizes thebandwidth in a much more efficient way, and improves the performanceof SpMV by an average speedup of 1.76x on Phytium 2000+ .

Keywords: Sparse matrix-vector multiplication, NUMA architecture,Hypergraph partitioning, Phytium 2000+ .

1 Introduction

The sparse matrix-vector multiplication (SpMV) operation multiples a sparsematrix A with a dense vector x and gives a resulting dense vector y. It is oneof the level 2 sparse basic linear algebra subprograms (BLAS) [13], and is oneof the most frequently called kernels in the field of scientific and engineeringcomputations. Its performance normally has a great impact on sparse iterativesolvers such as conjugate gradient (CG) method and its variants [17].

To represent the sparse matrix, many storage formats and their SpMV al-gorithms have been proposed to save memory and execution time. Since SpMVgenerally implements algorithms with a very low ratio of floating-point calcu-lations to memory accesses, and its accessing patterns can be very irregular,it is a typical memory-bound and latency-bound algorithm. Currently, manySpMV optimization efforts have achieved performance improvements to various

degrees, but lack consideration on utilizing NUMA (non-uniform memory access)characteristics of a wide range of modern processors, such as ARM CPUs.

To obtain scale-out benefits on modern multi-core and many-core processors,NUMA architectures is often an inevitable choice. Most modern x86 processors(e.g., AMD EPYC series) and ARM processors (e.g., Phytium 2000+ ) utilizeNUMA architecture for building a processor with tens of cores. To further in-crease the number of cores in a single node, multiple (typically two, four or eight)such processor modules are integrated onto a single motherboard and are con-nected through high-speed buses. But such scalable design often brings strongerNUMA effects, i.e., giving noticeable lower bandwidth and larger latency whencross-NUMA accesses occur.

To improve the SpMV performance on modern processors, we in this workdevelop a NUMA-Aware SpMV approach. We first reorder the input sparse ma-trix with hypergraph partitioning tools, then allocate a row block of A and thecorresponding part of x for different NUMA nodes, and pin threads onto hard-ware cores of the NUMA nodes for running parallel SpMV operation. Becausethe reordering technique can organize the non-zeros in A on diagonal blocksand naturally brings the affinity between the blocks and the vector x, the datalocality of accessing x can be significantly improved.

We benchmark 15 sparse matrices from the SuiteSparse Matrix Collection [3]on a 64-core ARMv8-based Phytium 2000+ processor. We set the number of hy-pergraph partitions to 2, 4, 8, 16, 32 and 64, and set the number of threads to 8,16, 32, and 64, then measure the performance of their combinations. The experi-mental results show that, compared to classical OpenMP SpMV implementation,our NUMA-Aware approach greatly improves the SpMV performance by 1.76xon average (up to 2.88x).

2 Background

2.1 Parallel Sparse Matrix-Vector Multiplication

Sparse matrices can be represented with various storage formats, and SpMVwith different storage formats often has noticeable performance differences [1].The most widely-used format is the compressed sparse row (CSR) containingthree arrays for row pointers, column indices and values. The SpMV algorithmusing the CSR format can be parallelized by assigning a group of rows to athread. Algorithm 1 shows the pseudocode of an OpenMP parallel SpMV methodwith the CSR format.

2.2 NUMA Architecture of the Phytium 2000+ Processor

Figure 1 gives a high-level view of the Phytium 2000+ processor. It uses theMars II architecture [16], and features 64 high-performance ARMv8 compatible

algorithm 1 An OpenMP implementation of parallel SpMV.

1: #pragma omp parallel for2: for i = 0→ A.row nums do3: y[i] = 04: for j = A.rowptr[i]→ A.rowptr[i + 1] do5: y[i] = y[i] + A.val[j] ∗ x[A.col[j]]6: end for7: end for

Fig. 1. A high-level view of the Phytium 2000+ architecture. Processor cores are groupsinto panels (left) where each panel contains eight ARMv8 based Xiaomi cores (right).

node0 node1 node2 node3 node4 node5 node6 node7

node0

node1

node2

node3

node4

node5

node6

node 7

16.83 15.88 13.88 14.89 15.14 14.58 12.20 13.15

15.53 16.78 14.55 15.28 14.93 15.27 13.49 14.34

14.29 14.89 16.90 15.63 12.80 13.35 15.51 14.06

14.92 15.56 15.98 17.39 13.37 14.55 14.50 15.29

15.40 14.01 12.59 12.64 17.78 15.76 13.94 14.62

13.91 14.92 13.29 13.75 16.13 17.30 14.75 15.42

11.93 12.57 14.77 14.48 13.75 14.43 16.71 16.22

13.04 13.65 14.94 15.44 14.49 15.50 15.55 16.89

Fig. 2. STREAM triad bandwidth test on Phytium 2000+ .

xiaomi cores running at 2.2GHz. The entire chip offers a peak performance of563.2 Gflops for double-precision operations, with a maximum power consump-tion of 96 Watts. The 64 hardware cores are organized into 8 panels, where eachpanel connects a memory control unit.

The panel architecture of Phytium 2000+ is shown in the right part of Fig-ure 1. It can be seen that each panel has eight xiaomi cores, and each core has aprivate L1 cache of 32KB for data and instructions, respectively. Every four coresform a core group and share a 2MB L2 cache. The L2 cache of Phytium 2000+

uses an inclusive policy, i.e., the cachelines in L1 are also present in L2.

Each panel contains two Directory Control Units (DCU) and one routing

cell. The DCUs on each panel act as dictionary nodes of the entire on-chipnetwork. With these function modules, Mars II conducts a hierarchical on-chipnetwork, with a local interconnect on each panel and a global connect for theentire chip. The former couples cores and L2 cache slices as a local cluster,achieving a good data locality. The latter is implemented by a configurable cell-network to connect panels to gain a better scalability. Phytium 2000+ uses ahome-grown Hawk cache coherency protocol to implement a distributed directory-based global cache coherency across panels.

We run a customized Linux OS with Linux Kernel v4.4 on the Phytium 2000+

system. We use gcc v8.2.0 compiler and the OpenMP/POSIX threading model.

We measure NUMA-Aware bandwidth of Phytium 2000+ by running a Pthreadversion of the STREAM bandwidth code [15] bound to eight NUMA nodes onthe processor. Figure 2 plots the triad bandwidth of all pairs of NUMA nodes.It can be seen that the bandwidth results obtained inside the same NUMA nodeare the highest, and the bandwidth between difference nodes are noticeable muchlower. On Phytium 2000+ , the maximum bandwidth within the same nodes is17.78 GB/s, while the bandwidth of cross-border access can be down to only11.93 GB/s. The benchmark results further motivate us to design a NUMA-Aware SpMV approach for better utilizing the NUMA architectures.

2.3 Hypergraph Partitioning

Hypergraph can be seen as a general form of graph. A hypergraph is oftendenoted by H = (V,E), where V is a vertex set, and E is a hyperedge set. Ahyperedge can link more than two vertices, which is different to an edge in agraph [18]. A hypergraph can also correspond to a sparse matrix through therow-net model (columns and rows are vertices and hyperedges, respectively).

The hypergraph partitioning problem divides a hypergraph into a given num-ber of partitions, and each partition includes roughly the same number of ver-tices. Its effect is that the connections between different partitions can be mini-mized. Thus when hypergraph partitioning is used for distributed SpMV, bothload balancing (because the sizes of the partitions are almost the same) and lessremote memory accesses (because connections between partitions are reduced)can be obtained for better performance [4,17,22]. In this work, we use PaToHv3.0 [19,23] as the hypergraph partitioning tool for preparing the sparse matricesfor our NUMA-Aware SpMV implementation.

Fig. 3. (a) An original sparse matrix A of size 16x16, (b) A row-net representationof hypergraph H of matrix A, and a four-way partitioning of H, (c) The matrix Areordered according to the hypergraph partitioning. As can be seen, the non-zero entriesin the reordered matrix are gathered onto the diagonal blocks, meaning that the numberof connections between the four partitions are reduced.

3 NUMA-Aware SpMV Alogrithm

The conventional way of parallelizing SpMV is by assigning distinct rows todifferent threads. But the irregularity of accessing x through indirect indices ofthe non-zeros of A may degrade the overall performance a lot. To address theissue, considering the memory accessing characteristics of the NUMA architec-ture, storing a group of rows of A and part of the vector x most needed by therows onto the same NUMA node should in general bring a performance improve-ment. In this way, the cross-node memory accesses (i.e., accessing the elementsof x not stored on the local node) can be largely avoided.

To this end, we first need to partition the hypergraph form of a sparse matrix.Figure 3 plots an example showing the difference between a matrix before andafter reordering according to hypergraph partitioning. It can be seen that someof the non-zero elements of the matrix move to the diagonal blocks. The numberof non-zeros in the off-diagonal blocks in Figure 3(a) is 48, but in Figure 3(c),the number is reduced to 38.

After the reordering, the matrix is divided into sub-matrices i.e. row blocks,and the vector is also divided into sub-vectors. For example, the matrix in Fig-ure 3 now includes four sub-matrices of four rows. Then we use the memoryallocation API in libnuma-2.0 [2] to allocate memory for the sub-matrices andsub-vectors on each NUMA node. Figure 4 demonstrates this procedure, andAlgorithm 2 lists the pseudocode. As can be seen, the local memory of eachNUMA node contains one sub-matrices of four rows and one sub-vector of fourelements. In Algorithm 2, the start and end (line 3) represent the beginning andthe end of the row positions for each thread, and the Xpos and remainder (line6) are used to locate the vector x needed for local calculation.

Fig. 4. The computational process of the NUMA-Aware SpMV algorithm on a 16-corefour-node NUMA system. The four row blocks of the sparse matrix in Figure 3(c) areallocated onto the four nodes, and the four sub-vectors are also evenly distributed.When computing SpMV, the cross-node memory accesses for remote x are visualizedby using lines with arrows.

When the matrix and vector are allocated, our algorithm creates Pthreadthreads and bind them to the corresponding NUMA nodes (lines 10-12 in Algo-rithm 2). For the example in Figure 4, we issue and bind four threads to eachNUMA node and let each one computes a row of the sub-matrix allocated onthe same NUMA node (lines 1-9). Since the memory on the node only stores apart of the full vector, the threads will access both the local sub-vector and theremote ones. In the example in Figure 4, the threads in NUMA node 0 will alsoaccess the sub-vectors stored in nodes 1 and 4.

algorithm 2 A NUMA-Aware Pthread implementation of parallel SpMV.

1: function SpMV2: numa run on node(numanode)3: for i = start→ end do4: suby[i] = 05: for j = A.subrowptr[i]→ A.subrowptr[i + 1] do6: suby[i] = suby[i] + A.subval[j] ∗ subx[Xpos][remainder]7: end for8: end for9: end function

10: for i = 0→ thread nums in parallel do11: pthread create(&thread[i], NULL, SpMV, (void∗)parameter)12: end for13: for i = 0→ thread nums in parallel do14: pthread join(thread[i], NULL)15: end for

4 Performance and Analysis

4.1 Experimental Setup and Dataset

In this work, we benchmark 15 sparse matrices from the SuiteSparse MatrixCollection [3] (formerly known as the University of Florida Sparse Matrix Col-lection). The 15 sparse matrices include seven regular and eight irregular ones.The classification is mainly based on the distribution of non-zero elements. Thenon-zero elements of regular matrices are mostly located on diagonals, whilethose of irregular matrices are distributed in a pretty random way. We in Ta-ble 1 list the 15 original sparse matrices and their reordered forms generated bythe PaToH library. Each matrix is divided into 2, 4, 8, 16, 32 and 64 partitions,and reordered according to the partitions. In terms of the sparsity structures ofthese matrices, as the number of partitions increases, the non-zero elements willmove towards the diagonal blocks.

We measure the performance of OpenMP SpMV and NUMA-Aware SpMVon Phytium 2000+ . The total number of threads is set to 8, 16, 32 and 64,and the threads are created and pinned to NUMA nodes in an interleaved way.We run each SpMV 100 times, and report the average execution time. For bothalgorithms, we run original matrices and their reordered forms according tohypergraph partitioning.

4.2 NUMA-Aware SpMV Performance

Figure 5 shows the performance comparison of OpenMP SpMV and NUMA-Aware SpMV running the 15 test matrices. Compared to OpenMP SpMV, ourNUMA-Aware SpMV obtains an average speedup of 1.76x, and the best speedupis 2.88x (occurs in matrix M6 ). We see that both regular and irregular matricesobtain a significant speedup. The average speedup of irregular matrices is 1.91x,and that of regular matrices is 1.59x.

According to the experimental data, it can be seen that hypergraph parti-tioning has greatly improved our NUMA-Aware SpMV, but has little impact onOpenMP SpMV. Moreover, for the same matrix, the number of partitions bringsnoticeable different performance. In can also be seen that, after partitioning, themore non-zero elements the diagonal blocks have, the better the performance.

Specifically, in Figure 6, with the increase of the number of blocks, the num-ber of remote memory accesses has been continuously decreased. Before par-titioning matrix M6, SpMV needs a total of 16,242,632 cross-node accesses tothe vector x. But when the matrix is divided into 64 partitions, that numberdrops to 1,100,551, meaning that the number of cross-node accesses decreased by93.22%. As can be seen from Table 1, the non-zero elements of the split matrixmove towards the diagonal. As for the matrix circuit5M, the best performance isachieved when having 16 partitions, and the number of cross-node accesses hasbeen dropped by 65%, compared to the original form.

Table 1. Sparsity structures of the original and reordered sparse matrices. The topseven are regular matrices and the bottom eight are irregular ones.

Matrix Name#rows&#non-zeros

Original #par.=2 #par.=4 #par.=8 #par.=16 #par.=32 #par.=64

Transport1.6M&23.4M

af shell60.5M&17.5M

bone0100.9M&47.8M

x1040.1M&8.7M

ML Laplace0.3M&27.5M

pre20.6M&5.8M

Long COUP1.4M&84.4M

circuit5M5.5M&59.5M

NLR4.1M&24.9M

cage155.1M&99.1M

dielF ilterV 31.1M&89.3M

germany osm11.5M&24.7M

M63.5M&21M

packing − 5002.1M&34.9M

road central14M&33.8M

8 16 32 64

orignal

2

4

8

16

32

64

2.84 3.23 3.04 2.882.85 2.98 2.95 2.922.87 3.01 2.85 2.932.90 3.07 2.89 2.832.94 2.95 2.98 2.862.91 3.05 2.91 2.862.78 2.93 2.91 2.75

OpenMP

8 16 32 64

3.31 3.32 4.33 4.303.42 4.31 4.18 4.093.62 4.28 4.35 4.313.79 4.36 4.38 4.283.74 4.34 4.38 4.303.80 4.36 4.37 4.113.76 4.32 4.36 4.20

Pthread-bind

(a) Transport

8 16 32 64

orignal

2

4

8

16

32

64

3.49 3.66 3.68 3.333.46 3.74 3.51 3.433.53 3.71 3.67 3.443.27 3.65 3.61 3.373.47 3.77 3.70 3.413.51 3.65 3.60 3.403.47 3.70 3.64 3.42

OpenMP

8 16 32 64

4.23 4.54 3.50 4.344.72 5.00 4.76 4.264.84 5.07 4.77 4.194.72 5.04 4.72 4.573.87 5.09 4.13 4.074.82 5.07 4.68 4.414.84 4.57 4.76 4.16

Pthread-bind

(b) af shell6

8 16 32 64

orignal

2

4

8

16

32

64

3.52 3.78 3.73 3.743.58 4.10 3.59 3.553.56 3.79 3.74 3.503.35 3.61 3.80 3.653.55 3.59 3.57 3.493.36 3.74 3.84 3.613.48 3.78 3.63 3.66

OpenMP

8 16 32 64

4.58 6.97 6.92 7.064.29 6.95 6.55 6.754.64 6.90 6.96 6.924.40 7.03 6.91 6.924.76 7.00 6.94 6.854.67 7.02 7.02 6.934.70 7.00 6.83 7.03

Pthread-bind

(c) bone010

8 16 32 64

orignal

2

4

8

16

32

64

3.13 3.38 3.37 3.113.22 3.40 3.36 3.103.21 3.41 3.36 3.073.33 3.33 3.36 3.193.19 3.36 3.31 3.073.20 3.39 3.33 3.183.27 3.41 3.34 3.15

OpenMP

8 16 32 64

4.53 4.51 4.35 2.634.71 4.51 4.26 2.994.70 4.58 4.42 3.184.44 4.36 4.23 3.014.43 4.48 4.17 2.864.67 4.44 3.60 2.974.33 4.65 4.37 2.79

Pthread-bind

(d) x104

8 16 32 64

orignal

2

4

8

16

32

64

3.54 3.82 3.73 3.613.61 3.73 3.75 3.623.58 3.73 3.73 3.603.70 3.78 3.73 3.583.63 3.81 3.74 3.603.67 3.73 3.79 3.603.40 3.81 3.81 3.63

OpenMP

8 16 32 64

4.95 5.42 5.46 5.294.91 5.44 5.45 5.044.95 5.42 5.42 5.424.96 5.45 5.45 5.404.53 5.46 5.47 5.464.87 5.43 5.45 5.454.90 5.45 5.50 5.29

Pthread-bind

(e) ML Laplace

8 16 32 64

orignal

2

4

8

16

32

64

2.83 3.13 2.76 2.902.95 3.16 2.88 2.832.99 3.09 2.86 2.832.91 3.20 3.06 2.692.97 3.09 2.88 2.782.97 3.14 2.95 2.852.93 3.06 3.12 2.81

OpenMP

8 16 32 64

3.81 3.82 3.60 3.073.37 3.77 3.60 2.603.83 3.84 3.46 2.883.85 3.84 3.75 2.883.97 3.83 3.73 2.913.65 3.95 3.77 3.223.92 3.90 3.80 3.42

Pthread-bind

(f) pre2

8 16 32 64

orignal

2

4

8

16

32

64

3.52 3.49 4.21 4.164.09 4.24 3.94 4.143.71 3.85 3.75 3.624.00 4.30 4.20 4.153.70 3.88 4.04 3.523.50 3.72 3.68 3.513.47 3.67 3.59 3.52

OpenMP

8 16 32 64

4.48 5.72 5.88 6.584.43 5.73 5.85 6.964.31 5.65 5.90 6.664.56 5.96 6.87 6.864.30 5.71 6.58 6.604.63 5.98 6.59 7.194.67 5.83 6.46 6.70

Pthread-bind

(g) Long Coup dt0

8 16 32 64

orignal

2

4

8

16

32

64

1.29 2.01 2.05 2.091.97 1.99 2.13 2.282.61 2.94 2.84 2.602.85 3.22 3.12 2.952.98 3.18 3.10 2.772.70 3.33 3.12 2.532.70 3.09 2.92 2.81

OpenMP

8 16 32 64

0.95 1.40 1.58 1.831.78 1.93 2.16 2.622.93 3.27 3.27 3.593.05 4.25 4.92 4.233.87 5.27 4.92 4.903.41 5.26 4.66 4.713.03 4.51 4.78 4.71

Pthread-bind

(h) circuit5M

8 16 32 64

orignal

2

4

8

16

32

64

0.72 1.05 1.09 1.030.74 1.12 1.26 1.120.75 0.93 1.20 1.090.78 1.17 1.01 1.100.94 1.14 1.22 1.201.21 1.36 1.15 0.981.55 1.70 1.61 1.21

OpenMP

8 16 32 64

0.47 0.83 1.17 1.110.56 0.99 1.36 1.280.68 1.25 1.56 1.420.81 1.25 1.75 1.801.15 1.65 2.02 1.921.65 1.97 2.18 2.332.12 2.50 2.64 2.59

Pthread-bind

(i) NLR

8 16 32 64

orignal

2

4

8

16

32

64

2.52 2.83 3.34 3.782.42 3.07 3.79 2.972.35 3.10 3.54 3.392.36 2.85 2.52 2.892.30 3.11 3.05 3.132.12 3.11 3.31 3.201.80 2.43 2.38 3.18

OpenMP

8 16 32 64

2.50 3.77 5.10 4.882.49 3.77 5.29 4.882.48 3.77 5.40 4.932.27 3.76 4.93 4.902.25 3.74 4.90 4.632.09 3.70 4.67 4.552.07 3.63 4.59 4.56

Pthread-bind

(j) cage15

8 16 32 64

orignal

2

4

8

16

32

64

3.21 4.31 4.28 4.563.52 4.36 4.93 4.483.71 4.42 4.71 5.333.58 4.60 5.02 5.194.08 5.33 3.63 4.423.93 5.12 4.97 5.623.95 5.21 4.33 4.62

OpenMP

8 16 32 64

2.80 4.02 5.73 5.883.39 5.34 5.71 6.493.91 5.84 6.48 6.734.18 6.71 7.05 6.714.41 5.83 6.99 7.314.44 5.87 6.37 6.814.39 5.81 5.99 6.53

Pthread-bind

(k) dieFilterV3real

8 16 32 64

orignal

2

4

8

16

32

64

0.94 1.34 1.41 1.220.94 1.38 1.41 1.181.01 1.35 1.45 1.201.06 1.20 1.55 1.401.05 1.33 1.62 1.341.16 1.44 1.47 1.161.16 1.60 1.68 1.16

OpenMP

8 16 32 64

0.85 1.24 1.54 1.610.92 1.56 1.90 1.771.01 1.64 1.80 1.861.12 1.62 1.78 1.921.18 1.63 1.81 2.011.34 1.78 1.92 2.081.51 2.40 2.25 2.28

Pthread-bind

(l) germany osm

8 16 32 64

orignal

2

4

8

16

32

64

0.72 1.00 1.17 1.020.72 1.04 1.08 1.130.73 1.11 1.22 1.160.80 1.00 1.21 1.141.01 1.02 1.10 1.061.17 1.31 1.20 1.131.16 1.20 1.01 0.94

OpenMP

8 16 32 64

0.48 0.84 1.15 0.990.58 1.05 1.30 1.250.73 1.34 1.48 1.520.88 1.39 1.81 1.791.25 1.63 2.07 2.021.68 2.09 2.13 2.262.21 2.39 2.49 2.71

Pthread-bind

(m) M6

8 16 32 64

orignal

2

4

8

16

32

64

2.96 3.00 3.13 3.053.03 3.18 3.17 3.013.07 3.13 3.10 3.063.05 3.18 3.13 3.023.05 3.17 3.11 2.992.92 3.18 3.08 3.042.65 2.99 3.09 2.95

OpenMP

8 16 32 64

3.95 4.66 4.72 4.823.89 4.64 5.03 5.163.94 5.13 4.77 4.963.71 4.55 4.65 4.903.77 4.62 4.60 4.853.92 4.60 4.75 5.124.94 4.65 4.15 3.58

Pthread-bind

(n) packing - 500

8 16 32 64

orignal

2

4

8

16

32

64

0.49 0.69 0.58 0.630.58 0.78 0.69 0.690.65 0.72 0.72 0.640.66 0.91 0.73 0.640.71 0.85 0.73 0.720.90 0.97 0.99 0.680.85 0.93 0.86 0.66

OpenMP

8 16 32 64

0.46 0.84 1.22 1.150.52 0.95 1.33 1.350.59 0.97 1.42 1.410.69 1.11 1.54 1.460.81 1.15 1.57 1.550.89 1.34 1.60 1.651.13 1.69 1.69 1.79

Pthread-bind

(o) road central

Fig. 5. Performance of OpenMP SpMV (left) and NUMA-Aware SpMV (right) onPhytium 2000+ . In each subfigure, x-axis and y-axis refer to the number of threadsand partitions, respectively. The heatmap values are in single precision GFlop/s.

2 4 8 16 32 64M6

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00 1e7Original matrixPartitioned matrix

2 4 8 16 32 64circuit5M

0

1

2

3

4

5

1e7Original matrixPartitioned matrix

Fig. 6. Comparison of communication volume before and after partitioning under dif-ferent number of blocks (the x-axis is the number of partitioned blocks, the y-axis isthe communication volume).

248 16 32 640

50010001500200025003000350040004500

Transport

248 16 32 64

af_shell6

248 16 32 64

bone010

248 16 32 64

x104

248 16 32 64

ML_Laplace

248 16 32 640

50010001500200025003000350040004500500055006000

pre2

248 16 32 64

Long_Coup_dt0

248 16 32 64

Circuit5M

248 16 32 64

NLR

248 16 32 64

cage15

248 16 32 640

50010001500200025003000350040004500

dielFilterV3real

248 16 32 64

germany_osm

248 16 32 64

M6

248 16 32 64

packing-500

248 16 32 64

road_central

Fig. 7. The ratio of hypergraph partition time to a single NUMA-Aware SpMV time.The x-axis represents the number of blocks from the hypergraph partitioning.

4.3 Preprocessing Overhead (Hypergraph Partitioning Runtime)

The preprocessing overhead (i.e., the execution time of the hypergraph par-tition) is another important metric for parallel SpMV. Figure 7 reports the ratioof the running time of the hypergraph partition to a single NUMA-Aware SpMV.From the overall perspective of the 15 matrices, due to the different distributionof non-zero elements of the matrix, the best observed performance of NUMA-Aware SpMV is mainly concentrated when the numbers of partitions are 16 and32. It can be seen that as the number of partitions increases, the ratio increasesaccordingly. As the number of partition blocks increases, the partition time ingeneral increases as well. Specifically, the matrix pre2 has a maximum ratio of5749 times when the number of hypergraph partitions is 64. In contrast, theminimum ratio is 257, which is from the matrix Af shell6.

5 Related Work

SpMV has been widely studied in the area of parallel algorithm. A numberof data structures and algorithms, such as BCSR [7], CSX [12] and CSR5 [14],have been proposed for accelerating SpMV on a variety of parallel platforms.Williams et al. [20], Goumas et al. [6], Filippone et al. [5], and Zhang et al. [21]evaluated performance of parallel SpMV on shared memory processors.

Hypergraph partitioning received much attention when accelerating SpMVon distributed platforms. In the past a couple of decades, researchers have pro-posed a few partitioning approaches and models for various graph structures and

algorithm scenarios [4,17,22,24,25]. A few software packages containing these al-gorithms have been developed and widely used. For example, Karypis et al.developed MeTiS [9,11], ParMeTiS [10] and hMeTiS [8], and atalyrek et al. de-veloped PaToH [19,23], which is used in this work.

6 Conclusions

We have presented a NUMA-Aware SpMV approach and benchmarked 15representative sparse matrices on a Phytium 2000+ processor. The experimen-tal results showed that our approach can significantly outperform the classicalOpenMP SpMV approach, and the number of generated hypergraph partitionsdemonstrates a dramatic impact on the SpMV performance.

Acknowledgments

We would like to thank the invaluable comments from all the reviewers. WeifengLiu is the corresponding author of this paper. This research was supported by theScience Challenge Project under Grant No. TZZT2016002, the National NaturalScience Foundation of China under Grant No. 61972415 and 61972408, and theScience Foundation of China University of Petroleum, Beijing under Grant No.2462019YJRC004, 2462020XKJS03.

References

1. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K.,Patterson, D., Plishker, W.L., Shalf, J., Williams, S.: The landscape of parallelcomputing research: A view from berkeley. Technical Report Uc Berkeley (2006)

2. Bligh, M.J., Dobson, M.: Linux on numa systems. Ottawa Linux Symp (2004)3. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans-

actions on Mathematical Software 38(1), 1–25 (2011)4. Devine, K.D., Boman, E.G., Heaphy, R.T., Bisseling, R.H., mit V. atalyrek: Par-

allel hypergraph partitioning for scientific computing. In: International Parallel &Distributed Processing Symposium (2006)

5. Filippone, S., Cardellini, V., Barbieri, D., Fanfarillo, A.: Sparse matrix-vector mul-tiplication on gpgpus. ACM Trans. Math. Softw. 43(4) (Jan 2017)

6. Georgios, Goumas, Kornilios, Kourtis, Nikos, Anastopoulos, Vasileios, Karakasis-Nectarios, Koziris: Performance evaluation of the sparse matrix-vector multiplica-tion on modern architectures. Journal of Supercomputing (2009)

7. Im, E.J., Yelick, K., Vuduc, R.: Sparsity: Optimization framework for sparse matrixkernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)

8. Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph par-titioning: applications in vlsi domain. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems 7(1), 69–79 (1999)

9. Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning. In: Supercom-puting ’95:Proceedings of the 1995 ACM/IEEE Conference on Supercomputing.pp. 29–29 (1995)

10. Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning scheme for irregulargraphs. In: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.pp. 35–es. Supercomputing ’96 (1996)

11. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioningirregular graphs. SIAM Journal on Scientific Computing 20(1), 359–392 (1998)

12. Kourtis, K., Karakasis, V., Goumas, G., Koziris, N.: CSX: An extended compres-sion format for spmv on shared memory systems. In: Proceedings of the 16thACM Symposium on Principles and Practice of Parallel Programming. pp. 247–256. PPoPP ’11 (2011)

13. Liu, W.: Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D.thesis, University of Copenhagen (2015)

14. Liu, W., Vinter, B.: CSR5: An efficient storage format for cross-platform sparsematrix-vector multiplication. In: Proceedings of the 29th ACM on InternationalConference on Supercomputing. pp. 339–350. ICS ’15 (2015)

15. McCalpin, J.D.: Stream: Sustainable memory bandwidth in high performance com-puters. Tech. rep., University of Virginia, Charlottesville, Virginia (1991-2007), acontinually updated technical report.

16. Phytium: Mars ii - microarchitectures. https://en.wikichip.org/wiki/phytium/microarchitectures/mars_ii

17. Uar, B., Aykanat, C.: Partitioning sparse matrices for parallel preconditioned it-erative methods. SIAM Journal on Scientific Computing (2007)

18. Uar, B., Aykanat, C.: Revisiting hypergraph models for sparse matrix partitioning.Siam Review 49(4), 595–603 (2007)

19. Uar, B., atalyrek, .V., Aykanat, C.: A matrix partitioning interface to patoh inmatlab. Parallel Computing 36(5), 254 – 272 (2010)

20. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimizationof sparse matrix-vector multiplication on emerging multicore platforms. ParallelComputing 35(3), 178–194 (2009)

21. Zhang, F., Liu, W., Feng, N., Zhai, J., Du, X.: Performance Evaluation and Analy-sis of Sparse Matrix and Graph Kernels on Heterogeneous Processors. CCF Trans-actions on High Performance Computing pp. 131–143 (2019)

22. atalyrek, .V., Aykanat, C.: Hypergraph-partitioning-based decomposition for par-allel sparse-matrix vector multiplication. IEEE Transactions on Parallel and Dis-tributed Systems 10(7), 673–693 (1999)

23. atalyrek, .V., Aykanat, C.: Patoh (partitioning tool for hypergraphs). Encyclopediaof Parallel Computing pp. 1479–1487 (2011)

24. atalyrek, .V., Aykanat, C., Uar, B.: On two-dimensional sparse matrix partitioning:Models, methods, and a recipe. SIAM Journal on Scientific Computing 32(2), 656–683 (2010)

25. atalyrek, .V., Boman, E.G., Devine, K.D., Bozda, D., Heaphy, R.T., Riesen, L.A.:A repartitioning hypergraph model for dynamic load balancing. Journal of Paralleland Distributed Computing (2009)