optimization of gpu-based sparse matrix multiplication for

Optimization of GPU-based Sparse Matrix

Multiplication for Large Sparse Networks

Jeongmyung Lee, Seokwon Kang, Yongseung Yu, Yong-Yeon Jo, Sang-Wook Kim, Yongjun Park

Department of Computer Science

Hanyang University, Seoul, Korea

{jeongmyung, kswon0202, dydtmd1991, jyy0430, wook, yongjunpark}@hanyang.ac.kr

Abstract—Sparse matrix multiplication (spGEMM) is widelyused to analyze the sparse network data, and extract importantinformation based on matrix representation. As it contains ahigh degree of data parallelism, many efficient implementationsusing data-parallel programming platforms such as CUDA andOpenCL have been introduced on graphic processing units(GPUs). Several well-known spGEMM techniques, such as cuS-PARSE and CUSP, often do not utilize the GPU resources fully,owing to the load imbalance between threads in the expansionprocess and high memory contention in the merge process.Furthermore, even though several outer-product-based spGEMMtechniques are proposed to solve the load balancing problemon expansion, they still do not utilize the GPU resources fully,because severe computation load variations exist among themultiple thread blocks.

To solve these challenges, this paper proposes a new opti-mization pass called Block Reorganizer, which balances the totalcomputations of each computing unit on target GPUs, basedon the outer-product-based expansion process, and reduces thememory pressure during the merge process. For expansion, itfirst identifies the actual computation amount for each block,and then performs two thread block transformation processesbased on their characteristics: 1) B-Splitting to transform aheavy-computation blocks into multiple small blocks and 2) B-Gathering to aggregate multiple small-computation blocks to alarger block. While merging, it improves the overall performanceby performing B-Limiting to limit the number of blocks on eachcomputing unit. Experimental results show that it improves thetotal performance of kernel execution by 1.43x, on an average,when compared to the row-product-based spGEMM, for NVIDIATitan Xp GPUs on real-world datasets.

Index Terms—Sparse matrix multiplication; sparse network;GPU; linear algebra;

I. INTRODUCTION

Matrix multiplication is one of the core kernels in various

data-mining applications, such as social network services

(SNSs) and graph analytics, and is used to extract key informa-

tion. Based on the rapid growth of the size of sparse networks,

the extraction of valuable information required for various

operations, such as ranking [1], similarity computation [2],

[3], and recommendation [4], [5], has become a critical

challenge. Weighted graphs are typically used to model such

network data and are represented in matrix forms, where each

element contains an edge weight between two nodes. Matrix

multiplication based on the adjacent matrix format is widely

used to extract useful information from original data.

Because matrix multiplication is a data-parallel operation,

graphic processing units (GPUs) are considered to be the most

appropriate accelerators for their speed-up by providing high

computational throughput using single-instruction, multiple-

thread (SIMT) programming models, such as CUDA [6]

and OpenCL [7]. A GPU generally consists of a set of

Streaming Multiprocessors (SMs). OpenCL/CUDA programs

are executed on GPUs by allocating Thread Blocks (TBs) or

Cooperative Thread Arrays (CTAs) 1, which are groups of

threads, to each SM in parallel.

The main challenge is developing an efficient matrix multi-

plication technique considering the data-specific characteristics

of sparsity and power-law degree distribution [8]. Typical

sparse networks contain a much smaller number of edges with

non-zero values, compared to the number of all possible edges

between nodes, and therefore, most of the elements in a sparse

matrix have a value of zero. To reduce memory waste caused

by sparsity, matrices are typically represented in the sparse

format [9]. Sparse networks also commonly have power-law

distributions [8], where a very small number of hub nodes

have extremely large numbers of connections and most other

nodes have very small numbers of connections. Based on

the power-law, the distribution of non-zero elements is often

highly skewed, and the resulting matrices for sparse networks

generally contain a few rows with large numbers of non-zero

elements while a large number of rows have a few non-zero

elements.

There have been several previous studies on implement-

ing efficient sparse matrix multiplication (spGEMM) for

two sparse matrices on GPUs, including cuSPARSE [10]

and CUSP [11]. These techniques generally consist of row-

product-based intermediate data expansion and parallel data

merge processes. Despite their promising performance, GPU

resources are still not fully utilized. First, the row-product-

based expansion process often leads to poor load balancing

among threads due to the irregular distributions of target sparse

networks. Second, excessive memory accesses during the par-

allel merge process frequently leads to degraded performance

than expected because of significant memory contention

caused by excessive accesses. Although several improved

row-product-based techniques, such as bhSPARSE [12], have

recently been introduced, experimental results have shown that

they still suffer from poor thread-level load balancing problem

of the row-product-based scheme and the high performance

overhead during the merge process while performing multipli-

1In this work, we use the term thread block and CTA interchangeably.

925

2020 IEEE 36th International Conference on Data Engineering (ICDE)

2375-026X/20/$31.00 ©2020 IEEEDOI 10.1109/ICDE48307.2020.00085

cation on highly irregular matrices.

To overcome these limitations, several new spGEMM ap-

proaches have been introduced by adopting the outer-product

(column-row product) scheme [13], [14]. Outer-product-based

expansion is expected to produce higher performance than

row-product-based expansion, because the computational loads

of all threads in a TB are identical. However, the outer-

product is not yet an ideal solution. First, the outer-product

algorithm creates another load imbalance problem among SMs

because of the high block-level workload variance. In the

outer-product scheme, each TB is formulated by a column and

a row of input matrices. Therefore, the resulting TBs consist

of several computation-heavy TBs (overloaded blocks) from

several columns and rows with huge numbers of non-zero

elements, and a massive number of computation-light TBs

(underloaded blocks) with large numbers of zero elements. As

a result, the SMs that execute overloaded blocks can become

a performance bottleneck, while all other SMs are idle.

Second, the outer-product scheme is mainly effective for

expansion, and the merge performance remains the same or

might even become worse, because it produces intermediate

results in a matrix form during expansion, whereas the row-

product produces the intermediate results in a single row

form [15]. Therefore, full matrix-wise accumulation may be

slower than row-wise accumulation owing to the additional

column address indexing.

To address the limitations, we propose a novel outer-

product-based spGEMM optimization pass referred to as the

Block Reorganizer. It first identifies the computation amount

of each block and categorizes the blocks as overloaded blocks,

normal blocks, and underloaded blocks, based on their compu-

tational loads. It then performs two different optimizations in

the expansion process: Block Splitting for overloaded blocks

and Block Gathering for underloaded blocks. Block Splitting is

the process of dividing an overloaded block into multiple small

blocks for better load balancing. For underloaded blocks, the

Block Reorganizer performs the Block Gathering process by

creating a combined block from multiple underloaded blocks

to increase intra-SM computation unit utilization and improve

latency hiding efficiency via fast context-switching support.

After executing all operations to produce intermediate results

during the expansion process, Block Limiting is applied to

improve performance further during the merge process. Block

Limiting is the process where each merging block is forced

to execute solely on the allocated SM in order to minimize

resource contention.

This paper provides the following three contributions:

• An in-depth analysis of the inefficient resource utilization

of outer-product operations on GPUs including expansion

and merge processes on real-world datasets.

• The design of a novel optimization framework for ef-

ficient sparse matrix multiplication based on the outer-

product scheme. To achieve this objective, we offer three

key techniques:

1) Block Splitting: it divides original blocks into sev-

eral small blocks for better load balancing.

SM

Core Core Core Core

L1 cache

Shared Memory

Core

Warp Scheduler

Register File

GPU

SM 0 SM 1 SM 2 SM 3

L2 cache

Global Memory

Shared Memory Requirement of each TB

(a) (b)

Thread

Thread Block

Shared Memory

SM

Shared Memory

SM

Thread Block

Thread Block

Shared Memory

SM

Fig. 1: (a) A GPU architecture overview and (b) an effect of

shared memory requirement per thread block on thread block

allocation.

2) Block Gathering: it merges several underloaded

blocks into a combined block for better SM resource

utilization and latency hiding effectiveness.

3) Block Limiting: it prevents the blocks from exe-

cuting with other blocks on an SM for minimizing

resource contention.

• An extensive evaluation of the effectiveness of the Block

Reorganizer framework using synthetic and real-world

datasets on multiple target GPUs.

II. BACKGROUND

A. GPU Architectures and SIMT Programming Model

GPUs are accelerators that provide high throughput by

maximizing data parallelism using an SIMT programming

model such as CUDA [6] and OpenCL [7], which enables

multiple independent threads to execute the same instructions

concurrently. In such programming languages, a thread is the

basic unit of execution, and several threads are grouped into

TBs or CTAs. A TB is the main scheduling unit for execution

on GPUs, and the threads within a TB are affected by

barrier operations for synchronization. For NVIDIA GPUs in

particular, a number of threads (typically 32) are also grouped

into another scheduling unit, called a warp. In NVIDIA GPUs,

the threads in a warp are executed in lock-step similar to SIMD

accelerators [16].

To support such operations efficiently, recent GPUs have

been equipped with multiple SMs to execute the kernel in-

structions of allocated TBs in an SIMD manner. Each SM

contains multiple computing cores, a large register file, an L1

cache, and a shared memory, as shown in Figure 1 (a). To

hide memory access latency, GPUs also allow fast context

switching between warps. Thus, GPUs attempt to allocate the

maximum allowable number of threads to an SM within the

resource limit.

The number of threads allocated to an SM is limited by

resource usage(e.g. shared memory and register files). For

example, the shared memory requirement for each TB can

change the total number of allowable TBs on an SM, as shown

in Figure 1 (b). Although the number of threads in a TB

is determined statically, all threads are not always executed

identically based on branch divergence. In this paper, we refer

926

B

A Expansion (�� )

val

idx

ptr

a00 a20 a30 a01 a11

0 2 3 0 1

0 3 5 6 8

b00 b01 b02 b03 b11 �

0 1 2 3 1

0 4 5 8 9

val

idx

ptr

A: CSC representation

Outer-product spGEMM

Row-product spGEMMInput

B: CSR representation

(a) (b) (d)

(c)

b00 b01 b02 b03

b11

b33a03

a01

a00

b00 b01 b02 b03

a30

a20

a00

b00 b01 b02 b03

b00 b01 b02 b03

Threadexecution

timeStall

Threads

Expansion (�� ) Merge

Intra-row merge

Merge Result C

Result C

Fig. 2: (a) Example input matrices, (b) sparse matrix formats

(CSR/CSC), (c) row-product, and (d) outer-product.

to a thread with real computations as an effective thread, and

a thread without real computations as an non-effective thread.

B. Sparse Matrix Multiplication

1) Sparse matrix format: The dense-format-based repre-

sentation of sparse matrices with few non-zero elements

incurs high memory space inefficiency owing to massive

storage requirements for zero elements. Thus, compressed

sparse row/column(CSR/CSC) formats without zero values are

generally used for sparse matrix representation [9] 2. As shown

in Figure 2 (b), the CSR format consists of three arrays. The

val array stores the value of non-zero elements, the idx array

stores column indices and the ptr array stores row pointers,

which indicate the first element locations within the rows. The

CSC format has the same structure, but stores elements in

column-major order, while CSR is based on row-major order.

The CSC format stores row indices in the idx array and column

pointers in the ptr array. The size of the ptr array is N , which

is the number of rows/columns(CSR/CSC) of a target matrix,

and the sizes of the val and idx arrays are nnz.

2) Matrix multiplication algorithms: Matrix multiplication

(C = AB) is an operation to produce an output matrix (C(N×Msize)) from two input matrices of A (N × Ksize) and

B (K ×Msize) 3. For sparse matrix multiplication, a basic

method based on the dot product is not well matched because

it requires index matching, which is not appropriate for sparse

matrix format.

ci∗ =

N−1∑

j=0

aij × bj∗ (1)

For sparse matrix multiplications, several libraries, such as

cuSPARSE [10] or CUSP [11], are implemented based on row-

product schemes, where input matrices are CSR-formatted.

As shown in Equation (1), the output row ci∗ is obtained by

performing row-row product calculation between the ith row

ai∗ of A and all corresponding rows in B. In GPUs, a TB

generally performs all the row-products for a row in A, and all

the corresponding rows in B. Therefore, the row-product based

method has a high probability of creating poor load balancing

2General CSR/CSC formats do not require the sorted order of column/rowindices within a row/column, and this work produces the final result inunordered CSR format [9].

3Note that we used C = A2 multiplication for base evaluation, wherenumbers of rows and columns are same for the input matrix A.

Algorithm 1 Outer-product based spGEMM pseudocode

for i := 0, to i <N do

for a idx := a.ptr[i] to a.ptr[i+1] do

row ← a.idx[a idx]offset ← c.ptr e[row]c.ptr e[row]← c.ptr e[row]+b.ptr[i+1]-b.ptr[i]for b idx := b.ptr[i] to b.ptr[i + 1] in parallel do

c idx ← c.ptr[row]+ offset

c.val[c idx]← a.val[a idx] ∗ b.val[b idx]c.idx[c idx]← b.idx[b idx]

end for

end for

end for

between the threads within a block when there is a high

variance in the number of non-zero elements between the rows

in B, as illustrated in Figure 2 (c). In such case, only some

threads perform numerous computations, while most threads

are idle or finish early after a small number of computations.

Ci = a∗i × bi∗ (2)

As shown in Equation (2), unlike the row-product, the

outer-product-based scheme produces a partial matrix Ci by

calculating a column-by-row product between the ith column

a∗i of A and the ith row bi∗ of B. As shown in Figure 2

(d), all elements in the input column are multiplied by the

same row, and therefore, every thread has the same number

of computations. Therefore, the outer-product based method

does not create the load balancing problem within a TB when

executing on GPUs. Based on the data access pattern, the CSC

format is used for matrix A and the CSR format is used for

matrix B.

Both row-product and outer-product methods can generate

multiple elements with the same index over partial results.

Therefore, such methods require a specific merge-phase to

accumulate the elements with the same index into a single ele-

ment after the generation of intermediate results in expansion-

phase. In this work, we denote the intermediate result matrix,

which allows multiple elements with the same indices between

expansion-phase and merge-phase, as C. Algorithm 1 presents

the pseudocode of the outer-product-based expansion-phase

algorithm for generating Cis for every index. In the algorithm,

c.ptr e indicates an array to store the nnz filled for each

row and is updated using atomic function to manage parallel

execution.

III. MOTIVATION

A. Limitations of Current Approaches

Since the distribution patterns of sparse matrices are diverse,

a main challenge for the spGEMM performance improvement

on GPUs is to achieve high resource utilization. As shown

in Figure 2, thread-level load balancing in a TB can be

achieved by adopting the outer-product scheme, whereas the

row-product method suffers from intra-SM load imbalance.

However, there are three major remaining problems leading to

poor resource utilization.

927

Fig. 3: (a) Execution time variance of outer-product-based

spGEMM between SMs (Titan XP), (b) thread block distribu-

tion at different number of effective threads, and (c) execution

time distribution at expansion and merge processes.

1) Overloaded block: As discussed in the previous section,

sparse matrices often have a power-law degree distribution,

where some rows and columns related to the hub-nodes

contain massive numbers of non-zero elements, whereas oth-

ers have only a few non-zero elements. Therefore, several

overloaded blocks used to perform multiplications of the

columns and rows related to the hub nodes incur a substantial

amount of computations, while other blocks (underloaded

blocks) perform very few computations. When overloaded

blocks are scheduled to a few SMs and underloaded blocks are

scheduled to the rest of the SMs, the SMs with the underloaded

blocks should remain idle after completing their tasks until

all computations of the overloaded blocks on other SMs are

completed.

Figure 3 (a) presents the variation in the SM-level exe-

cution time of expansion-phase when running outer-product

spGEMM operations in multiple sparse network datasets on

an NVIDIA Titan Xp architecture, containing 30 SMs. In

Figure 3 (a), the execution times for all SMs in the GPU are

presented in descending order for each dataset, and five sparse

matrices on the left have relatively regular distributions, but

the five sparse matrices on the right have skewed distributions.

In this figure, one can see that irregularity leads to high

execution time variation between SMs. When the overloaded

block is scheduled to a SM, the block occupies the SM for

a long period and other small blocks are scheduled to the

remaining available SMs. Workload redistribution from long-

running SMs to idle SMs is therefore the key challenge for

performance improvement on skewed matrices. For example,

SM utilization for the “loc-Gowalla” and “as-Caida” sets is

less than 20% owing to small numbers of long-running SMs.

2) Underloaded block: Another issue is that most

rows/columns in sparse matrices have zero or a small number

of non-zero elements than the warp size, except for those

rows/columns related to hub nodes. Underloaded blocks for

multiplication of those columns and rows contain small num-

bers of effective threads with small computations, and they

lead to substantial performance degradation on GPUs.

While the five left-hand matrices in Figure 3 (a) exhibit a

fair load balancing of SMs, another inefficiency is generated

by underloaded blocks. In Figure 3 (b), most of the thread

block have less than 32 effective threads for many matrices.

For this situation, two main reasons exist for the significant

performance degradation in each SM. First, multiple comput-

ing cores within an SM are idle when executing underloaded

blocks with less than 32 threads, because 32 threads are

executed in a lock-step manner, as described in Section II-A.

Second, a memory latency hiding technique with fast context

switching cannot be utilized, because no eligible warps for

context switching exist when a warp stalls for several cycles

owing to the occurrence of a memory access. Therefore, gener-

ating larger blocks by aggregating several underloaded blocks

is highly recommended for further performance enhancement.

3) Overhead on merge: In this work, the merge process

was implemented in a manner similar to the widely used

Gustavson’s dense accumulator algorithm [19], which uses

a temporary array with a length equal to the dimension of

the target matrix. Using the dense accumulator algorithm

gives an advantage to aggregate elements without sorting

overhead. For implementing the algorithm on GPUs, we used

atomic functions to manage parallel execution. In Figure 3

(c), high merge latency exists when the merge process is

performed for rows with large nnz, because the block requires

massive number of memory transactions, which can lead to

performance degradation due to significant memory resource

contention. Several recent studies [17], [18] have also reported

that allocating the maximum amount of blocks on GPUs does

not always guarantee the best performance because resource

contention may decrease overall performance when excessive

threads are allocated. Therefore, the over-allocation of merging

blocks on an SM should be avoided.

B. Beyond Conventional Approaches

Several insights have been derived from comparisons be-

tween several spGEMM algorithms and the analysis of con-

flicts between GPU characteristics and sparse network char-

acteristics. First, an outer-product scheme is a better expan-

sion technique than a row-product scheme owing to superior

thread-level load balancing within a block, but the block-level

load imbalance problem must be solved by considering both

overloaded and underloaded blocks. Second, the performance

of the merge process must be improved as well, by reducing

resource contention by adjusting the block allocation to each

SM.

Based on these insights, we propose several intuitive high-

level solutions for improved spGEMM performance. We first

perform preprocessing to classify column-row product blocks

into three different categories, based on their computational

loads: overloaded, normal, and underloaded blocks. Over-

loaded blocks are then split into multiple small blocks to

be distributed into different SMs. For underloaded blocks,

we improve performance by gathering multiple underloaded

928

Input matrices

A

B

Block-splitting Block-gathering

Block-limiting

Index # of elem.s0 272 188633 227517 1911 37113 9

Index # of row-wise elem.s0 352 28333 37147 1911 65813 31

... ... ... ...

Dominator bin.

Normal bin.

Low performer bin.

2, 3, ...

11, ...

0, 7, 13, ...

Index

Limiting bin.

Non-limiting bin.

2, 3, ...

0, 7, 11, 13 ...

Index

Expansion phase

Dominator bin.

Spl

ittin

g fa

ctor

N

Dom. Idx. #2

Dom. Idx. #3

#2Split. #2Split. #2 ...

#3Split. #3Split. #3 ...

N0

N1

Low performer bin.

Low. Idx. #0

Low. Idx. #7

Low. Idx. #13

Gathering factor M

M0

Gathered. #0, #7, #13

Limiting bin.

Non-limiting bin.

2, 3

0, 7, 11, 13

SM SM

TB #2TB #0

TB #7

Unmerged.

Merged.

3 2 4 1 4 6

3 7 4 6

Pre-process / Workload classification

Block ReorganizerMerge phase

Fig. 4: An overview of the Block Reorganizer.

blocks into a single combined block, to maximize the number

of effective threads. We also improve merge performance by

limiting the number of allocated merging blocks on SMs.

IV. BLOCK REORGANIZER

A. Overview

The Block Reorganizer is an optimization method for accel-

erating sparse matrix multiplication by applying an improved

block-level load balancing mechanism that is adaptive to

sparse network characteristics. The Block Reorganizer is based

on the outer-product scheme, and applies several novel load

balancing techniques, based on an in-depth understanding of

GPU architectures. Figure 4 presents a conceptual view of

the Block Reorganizer that is proposed to improve resource

utilization during both expansion and merge processes.

As shown in Figure 4, the Block Reorganizer first precalcu-

lates the workload sizes of all blocks to perform column-by-

row product. The blocks are then classified into three groups

of overloaded, normal, and underloaded blocks based on the

sizes of their workloads. We will refer to a set of overloaded

column/row pairs having numerous non-zero elements, as a

Dominator. A Low performer is a set of underloaded col-

umn/row pairs that requires only a few computations due to

their insufficient number of effective threads.

Following categorization, dominator pairs are split into

multiple smaller column/row pairs (block splitting). Multiple

underloaded blocks are gathered to generate larger blocks

(block gathering). The newly created combined blocks can

be efficiently executed on GPUs by maximizing thread level

parallelism through both high utilization of in-SM computing

cores and better latency hiding using fast context switching

between warps. After all elements are generated and stored

in the intermediate matrix C, elements with the same indices

are merged to produce the final matrix C. To achieve better

throughput by avoiding excessive memory contention, we

adjust the number of thread blocks allocated to an SM.

B. Precalculation & Workload Categorization

Block reorganizer first calculates nnz(C) to allocate the

upper bound memory space for C. There are two different

ways to compute memory space as shown in Figure 4, and we

employ both methods for later optimizations. The row-wise

nnz is used to relocate the outer-product’s elements with same

row closer together for faster merge process. We also calculate

the block-wise nnz for workload classification.

Because of the irregular distributions of sparse networks,

the outer-product of the dominator pair produces a massive

number of non zero elements compared to the other remaining

pairs. As a single column/row pair operation is assigned to a

single block, the execution time for overloaded blocks can be

much greater than the total execution time for all remaining

blocks. This often leads to poor load balancing between SMs,

and is one of the main causes of performance degradation

in skewed matrices. For low performer pairs, the underuti-

lization of in-SM computing units is another reason for poor

performance. Therefore, different optimization techniques are

required for each column/row pair category.

Based on block-wise nnz estimation, all dominator pairs

are identified from the input matrices (A, B). Because of

the sparse data characteristics, the number of dominator pairs

is typically small, and the threshold ratio for identifying

dominator pairs should be selected carefully. In this study,

blocks that produce more than the threshold number of ele-

ments (threshold = nnz(C)/(#blocks × α)) are classified

as dominators. The criteria for classification can be changed

by adjusting the value of α based on the target sparse network

characteristics. Highly skewed networks can have lower αvalues, but social networks with several medium-size hub-

nodes should have high α values to avoid selecting too many

dominator pairs. The dominators are copied into new tempo-

rary matrices (A′, B′), while blocks with less than 32(size of

warps) effective threads are classified as underloaded blocks.

C. Expansion Optimization

1) Block Splitting: We propose the Block-splitting tech-

nique for better block-level workload balance. Block-splitting

is applied to overloaded blocks that are generated by domi-

nator vectors, in order to distribute heavy workloads evenly

across multiple SMs. As expressed in Equation (2), the outer-

product operations for each pair are independent of each

other, without the possibility of data reuse. Therefore, it can

be separated and modified without affecting the results of

other blocks. The dominator column vector, which is copied

into temporary matrices A′, is divided into multiple smaller

929

Fig. 5: B-Splitting: an overloaded block is split into multiple

small blocks.

columns by modifying the column pointer values. This then

creates a mapper array, for storing the mapping between

divided vector pairs. The multiple divided blocks execute their

own products by referencing the mapper array, and therefore,

the overloaded workload can be reallocated to multiple SMs

to achieve fair load balancing. Figure 5 illustrates a detailed

example of the block-splitting process and highlights its effec-

tiveness. First, the dominator vector a∗0 and b0∗ (originally

from input matrices A and B) are copied into matrices A′

and B′. During the splitting process, several elements from

each column vector are shifted to the next vector sequentially.

This operation can be accomplished by simply expanding

the pointer index of the sparse format matrix, as shown in

Figure 5. A mapper array is constructed to track all of the

divided vector pairs to produce the same results as the original

vector pairs. As a result, the overloaded block requiring 25

computations is split into three smaller blocks.

Block splitting not only improves SM-level load balanc-

ing, but also provides improved cache performance. Because

global memory access requires hundreds of cycles, spatial

and temporal data localities should be fully utilized. Block-

splitting forces multiple SMs to share identical vectors, thereby

increasing the probability of re-referencing data from SMs

and preventing the data from being evicted due to memory

space shortage. As a result, additional performance gains are

achieved.

Determining the splitting factors for dominators is impor-

tant, because performance improvement depends heavily on

these factors. Due to irregularity of sparse matrices, it is

difficult to identify the optimal factor that can be applied

to all datasets. Even within dominator groups, the nnz of

vectors varies, and the splitting factor for each vector should be

selected carefully. From a GPU architectural view, overloaded

blocks should be divided into a number of smaller blocks

that is greater than the total number of SMs. The number

of effective threads within each block should be larger than

the warp size to guarantee full utilization of in-SM cores.

Based on these two insights, we decided to choose the splitting

factor (2n) heuristically. Column vectors, where the number of

elements is equal to the number of computations per thread,

are split into several smaller vectors in a greedy manner. On

the other hand, row vectors, where the number of elements

corresponds to the number of threads, are not split to guarantee

Fig. 6: B-Gathering: several underloaded blocks are combined

into a large block through block-compaction.

a sufficient number of effective threads in each block.

2) Block Gathering: Because of the irregularity of sparse

matrices, executing kernels with a fixed thread block size is

inefficient, and therefore, executing blocks with an appropriate

thread block size is required to avoid thread waste. However,

as shown in Figure 3 (b), underloaded blocks, which are

generated by low performer groups, contain fewer effective

threads than the minimum block size (32). In the proposed

method, nnz(bi∗) indicates the number of effective threads

within a block. As shown in Figure 3 (b), for some networks,

most row vectors have less than 32 non-zero elements. This

means that several computing units in an SM are idle when

executing such blocks because the threads in a warp are

executed in a lock-step manner, as discussed in Section II-A.

Thus, thread-level parallelism cannot be fully utilized through

concurrent executions.

Having an insufficient number of effective threads in a block

also significantly decreases performance, as latency hiding

using fast context switching cannot be applied. When the

current active warp cannot issue the next instructions for any

reason, the warp scheduler chooses and schedules another

warp among the eligible warps to hide latency. However,

latency hiding based on fast warp-level context switching

cannot be applied, as underloaded blocks contain only a small

number of warps with effective threads (typically only one).

To solve the problem, we propose Block Gathering, which

is intuitive and can be applied easily. In Block Gathering,

original underloaded blocks are first transformed into micro-

blocks, which generate exactly the same results as the original

underloaded blocks, although they only have fewer threads

than the original blocks (block-compaction). Multiple micro-

blocks are then combined into a large combined block with

multiple partitions, which has the same number of threads as

the original underloaded blocks.

For block-gathering, it is relatively easy to determine the

optimal value of the gathering factor. In general, the number

of threads in a block is set to a power of two. When the

930

Thread block configuration

GPUSM 0

L2-cache / Global memory

SM 1

...

GPUSM 0

L2-cache / Global memory

SM 1

...Shared memory Shared memory Shared memory Shared memory

TB #0SMEM usage

TB #1SMEM usage

TB #2SMEM usage

TB #3SMEM usage

...

Thread block configuration

TB #0SMEM usage

TB #1SMEM usage

TB #2SMEM usage

TB #3SMEM usage

...

TB #0 TB #1 TB #2 TB #3 TB #0 TB #1

Large memory contention Small memory contention

Fig. 7: B-Limiting: extra shared memory is allocated to

alleviate resource contention while merging long rows.

number of threads of an underloaded block is in the range of

2n−1 to 2n, the gathering factor is set to 32/2n. For example,

if a thread block contains 2 effective threads, and the gathering

factor is 16 to fill the 32 sized block completely.

To illustrate this concept, we present a simple merging

scenario in Figure 6. Here, the size of the thread block is

set to 16 for simplicity, and “before gathering” represents

the original underloaded blocks. The original block indices

are binned based on the corresponding numbers of effective

threads. The blocks contained in bin 1 are compressed into

single block with gathering factor 4, and blocks in bin 2

are gathered with factor 2. However, blocks in bin 3 are not

gathered to avoid serialization.

D. Merge Optimization: Block LimitingAfter generating all non-zero elements in the intermediate

result matrix C , elements with the same indices are merged

into unique elements. This merging process is highly memory

intensive and has a small computational overhead, meaning

it is sensitive to memory throughput. Similar to the input

matrices, the result matrix often has a power-law distribution.

Therefore, during the merging process, some thread blocks

can generate too many memory requests and incur substantial

performance degradation by reducing the L2 cache throughput,

which is shared by multiple SMs [17], [18].

Based on the insight, we propose a B-Limiting technique,

which reduces resource contention by limiting the number of

blocks allocated to an SM. Figure 7 illustrates the B-Limiting

process. The allowable number of blocks is determined by the

resource requirements of each block. Therefore, we allocate

extra shared memory to the merge kernel functions in order

to reduce the number of blocks in an SM [20].

Because allocating the maximum number of blocks in an

SM generally yields the best GPU performance, the block

limiting technique should be applied carefully only when

it is expected to be better than the traditional allocation

scheme. Block-limiting is therefore currently applied only

to the large rows of c∗i where the nnzs exceed the given

threshold(threshold = nnz(C)/(#blocks× β)), where β is

currently 10 to show fair performance gain.

E. Putting It All TogetherIn this section, an example workflow is presented for

YouTube data, for better understanding of the mechanism for

combining these three techniques into the Block Reorganizer.

Block Reorganizer first estimates the block-wise nnz and row-

wise nnz. Workload categorization is then performed based

on the information. If the block-wise load of an (a∗i, bi∗) pair

exceeds the threshold, the pair is classified as the Dominator.

If the row-wise nnz exceeds a certain threshold, the corre-

sponding rows are determined to cause resource contention

during merging. For YouTube, 713 pairs are classified as

the dominator, and 362736 pairs are classified as the low

performer. 12657 rows are also selected to use B-Limiting

during merging. The overloaded blocks from the dominator

group are then split into smaller blocks using a splitting

factor. As a result, the B-Splitting technique shows 10.4%

performance gain with improved SM utilization from 16% to

99%.

In contrast, low performer vector pairs are binned in four

groups. Depending on their thread ranges, underloaded blocks

are gathered and compressed into single, same-sized block.

This B-Gathering technique shows 6.7% performance gain.

After generating all non-zero elements, B-Limiting is applied

to reduce memory contention in the merging process. Extra-

shared memory is allocated to perform merge process for

long rows in order to limit the number of allocated blocks

in SM. As a result, the B-Limiting technique shows 16.8%

performance gain with 32% l2 cache throughput improvement.

Finally, combination of the three techniques improves the total

performance by 41.5% for Youtube data.

V. EXPERIMENTAL ENVIRONMENT

Implementation The Block Reorganizer is implemented

as an executable binary, which was originally written in

the CUDA [6] programming language and compiled using

NVCC 8.0. Block Reorganizer first reads the input matrices

and precalculates block-wise workloads. It then applies three

optimization techniques called B-Splitting, B-Gathering, and

B-Limiting. All preprocesses are performed on the target

GPUs except for B-Splitting, which is performed on host

CPUs. When all preprocesses are completed, the sparse matrix

multiplication kernel is executed.

System Configuration In our experiments, we evaluated

the Block Reorganizer mainly on a real machine with an

Intel Xeon E5-2060 (2.10 GHz) CPU with 64 GB of main

memory and an NVIDIA TITAN Xp GPU [21] with 12 GB

of global memory as shown in Table I. We also tested the Block

Reorganizer on additional systems to determine its scalability:

a Xeon E5 and NVIDIA Tesla V100 system (DGX Station),

and a Xeon Gold and NVIDIA RTX 2080 Ti system (Table I).

Performance Measurement Our spGEMM algorithm gen-

erates output data in an unordered CSR format similar to the

Gustavson merge algorithm [19]. Therefore, we present our

performance results in two different ways for fairness. We

first compare Block Reorganizer performance to a baseline

spGEMM, which uses a row-product based expansion and

a Gustavson merge process, and four widely used spGEMM

libraries (cuSPARSE, CUSP, and bhSPARSE for GPUs, and

MKL for CPUs) [13], in order to measure the performance

difference to other open libraries. We then perform detailed

analysis of each Block Reorganizer technique and compared

the results to the performance of the baseline spGEMM. All

931

TABLE I: Target system configurationsSystem 1 System 2 [22] System 3

CPU Xeon E5-2640v4 [23] Xeon E5-2698v4 [23] Xeon Gold 5115 [24]

Number of10 / 20 20 / 40 10 / 20

Core/Threads

MAX CPU Clock 3.40GHz 3.60GHz 3.40GHz

Memory 64 GB 256 GB 128 GB

GPU Titan Xp [21] Tesla V100 [25] 2080Ti [26]

Number of SMs 30 80 68

MAX GPU Clock 1582MHz 1380MHz 1545MHz

CUDA Capability 6.1(Pascal) 7.0(Volta) 7.5(Turing)

OS Ubuntu 16.04 Ubuntu 18.04 Ubuntu 16.04

Baseline NVIDIA cuSPARSE v2, CUSP 0.4.0, bhSPARSE, MKL

TABLE II: Real-world datasets from Florida Suite Sparse [27]

and Stanford large network dataset collection [28]

Namedimension

plot Namedimension

plotnnz(A) nnz(A)nnz(C) nnz(C)

filter3D106k

ship140k

2.7 M 3.7M20.1 M 23.0M

harbor46k

protein36k

2.3M 2.1M7.5M 18.7M

sphere81k

2cube sphere99k

2.9M 854k25.3M 8.6M

accelerator118k

cage12127k

1.3M 1.9M17.8M 14.5M

hood215k

m133-b3196k

5.2M 782k32.7M 3.0M

majorbasis156k

mario002381k

1.7M 1.1M7.9M 6.2M

mono 500Hz165k

offshore254k

4.8M 2.1M39.5M 22.2M

patents main235k

poisson3Da13k

548k 344k2.2M 2.8M

QCD48k

scircuit167k

1.8M 0.9M10.4M 5.0M

power197k193k

youtube1.1M

3.3M 2.8M38.0M 148M

as-caida26k

sx-mathoverflow87k

104k 495k25.6M 17.7M

loc-gowalla192k

emailEnron36k

1.8M 359k456M 29.1M

slashDot76k

epinions74k

884k 497k75.2M 19.6M

web-Notredame318k

stanford275k

1.4M 2.2M16.0M 19.8M

experimental results include the overhead, except the data

transfer time between host and the device. This is because

spGEMM is an application kernel with results that will be used

in a GPU. The overhead includes the precalculation, workload

classification and preprocessing for block-splitting.

For block reorganizer and baseline spGEMM, basic

memory-related optimizations, considering shard memory uti-

lization, cache blocking, and memory coalescing, are applied

for maximizing performance.

Dataset A total of 28 real-world datasets from the stanford

large network dataset collection [28] and the Florida matrix

suite [27] were used for computing C = A2. Table II lists

detailed information for the tested real-world datasets. We

chose specific datasets by considering the distribution and

size of each matrix, and datasets from the stanford large

network dataset collection generally exhibit irregular distri-

butions, whereas the datasets from the florida matrix suite

generally exhibits regular distributions. We also used synthetic

datasets generated using R-MAT [29], [30] to evaluate both

C = A2 and C = AB.

VI. EVALUATIONS

In this section, we show the effectiveness of Block Reor-

ganizer, along with the following techniques used within it:

block-splitting, block-gathering, and block-limiting. Section

VI-A shows the performance improvement and analyses on

real-world datasets. Section VI-B presents an examination

of the effectiveness of the techniques across multiple GPU

architectures, and Section VI-C and VI-D present analysis of

the performance impact on various dataset characteristics using

synthetic datasets.

A. Evaluation on Real-World Datasets

Figure 8 and 9 show the normalized and absolute perfor-

mance of Block Reorganizer compared to four widely used

spGEMM libraries, and our two baselines based on row- and

outer-products. The X-axises represent the datasets, and the

Y-axises represent the relative performance based on the row-

product baseline (Figure 8) and the absolute performance

in GFLOPs (Figure 9). Based on the figures, the Block

Reorganizer achieves a performance gain of 1.43x over the

row-product baseline, while the outer-product baseline and the

libraries shows only 0.95x, 0.29x, 0.22x, 0.55x, and 0.48x

speedups, respectively. Block Reorganizer also shows high

coverage, as it exhibits the best performance on most datasets.

Block-splitting and block-limiting are generally effective for

irregular data that require numerous calculations and memory

accesses per block. However, block-gathering can be applied to

most matrices due to its high sparsity of matrices, regardless of

the regularity. Figure 10 shows the performance improvement

for the three techniques over the outer-product baseline. Block-

gathering, which is applied to all sparse matrices, shows the

highest coverage for matrices. However, for some matrices

with high skewness (mostly in Stanford datasets), block-

gathering on the underloaded blocks cannot improve perfor-

mance significantly because the execution time is dominated

by the overloaded blocks or the merging process. For these

datasets, block-splitting and block-limiting are very effective.

Consequently, block-limiting, block-splitting, block-gathering,

and Block Reorganizer show average performance gains of

1.05x, 1.05x, 1.28x, and 1.51x, respectively.

1) Better load balancing with block-splitting: To evaluate

the effect of block-splitting on load balancing, we define a

new metric, load balancing index (LBI), as shown in Equation

(3). LBI indicates the average execution time of all SMs

normalized to the SM with the longest execution time.

LBI =∑N

i=1(cycles(SMi)/MAX cycles(SM))/N

N : number of SMs in GPU(3)

Figure 11 shows the LBI values and execution times of

dominators for 10 Stanford datasets with increasing splitting

factors. As long-running overloaded blocks are the main

performance bottleneck for the datasets, the execution time

932

0

0.5

1

1.5

2

2.5

Nor

mal

ized

Per

f.

row-product outer-product cuSPARSECUSP bhSPARSE MKLBlock-Reorganizer

Florida matrix suite Stanford large network data

Fig. 8: Speedup of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP, bhSPARSE,

MKL), and Block Reorganizer on real-world datasets. All data are normalized to the row-product-based spGEMM performance.

02468

1012141618

GF

LOP

S

row-product outer-product cuSPARSECUSP bhSPARSE MKLBlock-Reorganizer


Fig. 9: Absolute performance of spGEMM operations for row/outer-product baselines, multiple libraries (cuSPARSE, CUSP,

bhSPARSE, MKL), and Block Reorganizer on real-world datasets.

0

0.5

1

1.5

2

2.5

Nor

mal

ized

per

f.

B-Limiting B-SplittingB-Gathering Block-Reorganizer


Fig. 10: Relative performance of B-Splitting, B-Gathering, B-Limiting, and Block Reorganizer.

0

5

10

15

20

251 2 4 8 16 32 64

LBI

0.75

0.5

0.25

0

Nor

mal

ized

per

f.

1

LBI

Fig. 11: Load balancing effectiveness when applying B-

Splitting.

of dominator blocks is only measured to show the effect of

block-splitting. The X-axis indicates splitting factors from 1 to

64, and the Y-axis represents the LBI values and relative per-

formance gains normalized to the performance with a splitting

factor of 1. When the splitting factor increases, corresponding

LBI and performance increments are observed. The LBI values

converge to more than 90% when splitting factors almost equal

the number of SMs in the target GPU. This implies a scale-

up of hardware on increasing the number of SMs, and block-

splitting is still an effective technique to improve performance.

By applying block-splitting, LBI increases from 0.17 to 0.96,

and dominator performance is improved by 8.68x on average.

2) Better cache performance with block-splitting: Some

matrices such as “loc-gowalla,” “sx-mathoverflow,” and “slash-

Dot” are observed to improve even when splitting factor be-

comes larger than the number of existing SMs and there is no

significant LBI improvement. This performance gain is mainly

due to better cache utilization, and block-splitting improves

the L2-cache throughput, mainly by splitting the overloaded

blocks. Memory transactions are originally concentrated in

few overloaded blocks, and the transactions are distributed to

multiple divided blocks using block-splitting. Thus, L2 cache

utilization can be significantly improved by distributing the

divided blocks to share the same memory spaces.

0

100

200

300

400

500

L2 w

rite

thro

ughp

ut(G

B/s

)

0

100

200

300

400

500

600

L2 r

ead

thro

ughp

ut(G

B/s

) 1 2 4 8 16 32 64

Fig. 12: L2 cache throughput improvements using B-Splitting.

Figure 12 shows the improvement in L2 cache throughput

when splitting overloaded blocks using the NVIDIA nvprof

profiler [31]. The X-axis represents datasets and the Y-axis

shows L2 cache throughput. For all datasets, block-splitting

shows a substantial L2 cache improvement of 8.9x on average.

This explains the further performance gain when splitting

factor is larger than the number of SMs.

3) Better latency hiding efficiency with block-gathering:

To prove the effectiveness of block-gathering, we profiled the

933

0

20

40

60

80sy

nc s

talls

of t

otal

st

alls

(%)

sync stall before gathering sync stall after gathering

Fig. 13: Changes in sync stall when applying B-Gathering.

kernel to observe the changes of the ratio of effective threads

using nvprof. The sync stall percentage is used as a metric

to demonstrate the ratio of effective threads, as numerous

synchronization stalls exist when many non-effective threads

await the complete computation of several effective threads.

Figure 13 shows the percentage of stall due to thread syn-

chronization. The X-axis represents the datasets and the Y-

axis represents the percentage of sync stalls. As shown in

Figure 13, the percentage of sync stalls highly decreases when

the block-gathering technique is applied.

As discussed, underloaded blocks cannot efficiently hide

latency due to the insufficient number of effective threads.

Therefore, most non-effective threads wait for effective threads

to execute their instructions. By applying block-gathering

to underloaded blocks to increase the number of effective

threads in a block, most stalls on synchronization disappear

leaving only memory stalls. Consequently, block-gathering

highly increases the performance for underloaded blocks.

0

40

80

120

160

200

L2 w

rite

thro

ughp

ut(G

B/s

)

0 6144 12288 18432 24576 30720 36864 43008

050

100150200250300

L2 r

ead

thro

ughp

ut(G

B/s

)

0 6144 12288 18432 24576 30720 36864 43008

Fig. 14: L2 cache throughput improvements using B-Limiting.

4) Less resource contention with block-limiting: Limiting

the number of blocks for an SM is effective for memory-

intensive kernels as it alleviates the resource contention. Thus,

it is expected to increase the performance of merging kernels

having many elements. Figure 14 shows the effect of block

limiting on L2 cache throughput. The X-axis represents the

10 Stanford datasets on which block-limiting is applied, and

the Y-axis represents the percentages of L2 cache throughput

with different limiting factors. The limiting factor indicates

the additionally allocated shared memory size to adjust the

number of blocks in a single SM. For the experiment, the size

of allocated memory increases by 6144 bytes. As shown in

the figure, the L2 cache throughput improves as the limiting

factor increase initially at a certain point, and it decreases after

the point. The reason for the performance degradation is that

the performance loss due to less warp occupancy increases

compared to the gain from reducing cache contention. As the

distribution of matrices varies highly, it is difficult to find an

optimal point for each matrix. In this study, limiting factor is

set to a constant value of 4× 6144 to show fair performance

gain. Consequently, L2 cache read and write throughputs

increase by 1.49x and 1.52x on average, respectively.

1.43 1.66 1.4

0

0.5

1

1.5

2

Titan XP Tesla V100 RTX 2080ti

Nor

mal

ized

Per

f.

row-product outer-product cuSPARSE CUSPbhSPARSE MKL Block-Reorganizer

Fig. 15: Performance scalability on various GPUs.

B. Performance Scalability on Different Architectures

To verify the scalability of Block Reorganizer on various

GPU architectures, we tested the performance on three differ-

ent devices of different generations: TITAN Xp, Tesla V100,

and RTX 2080 Ti, as shown in Table I. Figure 15 represents

the normalized performance after applying Block Reorganizer

technique on the target GPUs. The X-axis represents the

devices, and the Y-axis represents normalized performance

gain of each technique based on the row-product baseline.

As shown in the figure, Block Reorganizer shows the best

performance across all the target GPU architectures while the

outer-product baseline shows a similar performance level to

the row-product baseline. This is because the main problems

of sparsity and skewness are on all the GPU architectures,

and three main techniques proposed by Block Reorganizer can

solve the problems successfully. Therefore, 1.43x, 1.66x, and

1.40x speedups over the row-product baseline were achieved

on TITAN Xp, Tesla V100, and RTX 2080 Ti, respectively.

TABLE III: Synthetic datasetsData Dimension(N) # elements Parameters

C = A2

S

s1 250000 62500

(0.45,0.15,0.15,0.25)s2 500000 250000s3 750000 562500s4 1000000 1000000

P

p1

1M 1M

(0.25,0.25,0.25,0.25)p2 (0.45,0.15,0.15,0.25)p3 (0.55,0.15,0.15,0.15)p4 (0.57,0.19,0.19,0.05)

SP

sp1

1M

4M

(0.25,0.25,0.25,0.25)sp2 3Msp3 2Msp4 1M

C = AB

15A 32768 440747 scale=15B 32768 440024 edge-factor=16




C. Evaluation on Synthetic Datasets (C = A2)

In previous sections, We discussed the effectiveness of

Block Reorganizer with real-world datasets compared to the

libraries and our customized baseline. To show the general

applicability of Block Reorganizer, we tested the effectiveness

using synthetic datasets of contrasting characteristics as shown

in Table III. In these synthetic datasets, we changed the

following important factors: number of nodes (S: scalability),

skewness (P: power-law), and sparsity (SP).

1) Scalability (dataset S): The first four matrices (s1-s4) in

Figure 16 (a) show the performance changes when changing

the matrix size. When the matrix is very small, cuSPARSE

934

0

0.5

1

1.5

2

2.5

s1 s2 s3 s4

Nor

mal

ized

Per

f.

p1 p2 p3 p4

row-product outer-product cuSPARSE CUSP bhSPARSE

Skewness(dataset P) Sparsity(dataset SP)Scalability(dataset S)

0

0.5

1

1.5

15 16 17 18

Nor

mal

ized

Per

f.

row-product outer-productcuSPARSE CUSPbhSPARSE MKLBlock-Reorganizer

(a) (b)

sp1 sp2 sp3 sp4

MKL Block-Reorganizer

Fig. 16: (a) Speedup of spGEMM libraries and Block Reorganizer normalized to the row-product baseline on synthetic datasets

on C = A2 operations, and (b) speedup on C = AB operations.

shows the best performance. However, as the matrices be-

come larger, its performance drops significantly and eventually

shows the lowest performance among others. In contrast, Block

Reorganizer shows low performance in small matrices as the

execution time for matrix multiplication is insufficient, and the

performance is mainly affected by preprocessing overheads.

However, as the matrices become larger, it shows the best

performance over all other methods.

2) Skewness (dataset P): The next four matrices (p1-

p4) in Figure 16 (a) show the performance changes when

increasing the matrix skewness. The X-axis represents the

matrices used for the evaluation, and the Y-axis represents

the normalized performance to the baseline. With an increase

in the skewness level, cuSPARSE and bhSPARSE exhibit

performance degradation similar to real datasets. In contrast,

Block Reorganizer shows substantial performance gains for

all cases owing to the wide coverage. Notably, block-splitting

and block-limiting improve performance mainly for highly

skewed data by solving the load imbalance and high resource

contention problems.

3) Sparsity (datasets SP): The last four matrices (sp1-

sp4) in Figure 16 (a) show the performance changes when

decreasing the matrix density. bhSPARSE shows high per-

formance over other spGEMMs for relatively dense matrices.

However, as the matrices become sparser, Block Reorganizer

outperforms all other methods by mainly applying block-

gathering.

D. Evaluation on Synthetic Datasets (C = AB)

To prove the generality of our approach, we also evaluated

the performance of Block Reorganizer for C = AB cases,

in addition to C = A2. As shown in Table III, the last four

sets of input matrix pairs of (A, B) are synthetically generated

with two parameters of scale and edge-factor. The size of the

target matrix is set to (2scale), and the number of non-zero

entries is set to (edge-factor×2scale). The performance data

are evaluated by increasing the scale parameter from 15 to

18 when the edge-factor parameter is fixed to 16 as shown in

Graphulo [32].

Figure 16 (b) shows the normalized performance of Block

Reorganizer for C = AB cases. The X-axis represents the

4 spGEMM of matrix pairs, and the Y-axis represents the

relative performance normalized to the row-product baseline.

As shown in the figure, Block Reorganizer shows fair speedups

across all input matrix pairs. C = AB operations do not

generate denser output matrix as from C = A2 operations [32].

Therefore, block-gathering is an effective optimization be-

cause most thread blocks are categorized into underloaded

blocks with a few overloaded blocks. Consequently, Block-

Reorganizer achieves an average performance gain of 1.09x

over the baseline, that is the best of the given techniques. The

gain also appears scalable as the input size increases.

VII. RELATED WORKS

There have been many previous studies for spGEMM.

NVIDIA and Intel provide libraries to support fast spGEMM

[10], [11], [33]. Furthermore, several optimized techniques

have been also proposed [13], [34]–[42].

For more details, regularization [35], input categoriza-

tion [36], and resource optimization [37] techniques are pro-

posed for spGEMM on GPUs. From the perspective of load

balancing, lbGEMM [13] highly improved the performance by

introducing outer-product scheme to solve thread level load

balancing problem. AC-spGEMM [39] also improved overall

performance highly by using thread-level load balancing on

row-product-based spGEMM. Akbudak [40] also improved

merging performance via increasing the matrix locality by

orchestrating partitioned and permutated workloads in or-

der to reduce communication overheads between processors.

Kernert [41] and Patwary [42] improved cache locality using

adaptive tiling of target matrices.

However, as discussed in Section III, these techniques

are not optimally suitable for matrix multiplication for SNS

analysis due to no consideration of power-law degree distri-

bution [10], [11], [33], SM-level load balancing, or in-SM re-

source utilization problems [13], [35]–[37]. Our outer-product-

based approach also shows stable performance gain across

various target matrices by resolving thread-level load imbal-

ance problem natively without introducing complex per-row-

level load balancing techniques, which often require additional

control overhead to secure per-row linked list structures [39].

We propose three novel techniques for better load balancing

and resource utilization. Several related studies also have been

proposed [17], [18], [43], [44]. Thread tailor [43] adjusted

the number of threads by combining multiple CPU threads

into a merged thread based on profile results. Lee [18] and

Kayiran [17] showed that allocating maximum number of TB

on GPUs does not always guarantee the best performance,

and suggested hardware-level approaches for finding and al-

locating optimal number of TBs. Ho [44] introduced threads

935

pairing, which merged two threads into a thread to vectorize

operations in GPUs. These approaches are partially related to

our approach.

VIII. CONCLUSION

This work proposed a novel optimization pass called Block

Reorganizer for outer-product-based spGEMM with three

block-level optimizing techniques of B-Splitting, B-Gathering,

and B-Limiting. Block Reorganizer first identifies overloaded

and underloaded thread blocks and then applies different

techniques to them. It solves SM level load imbalance problem

by splitting overloaded blocks into multiple small blocks

using B-Splitting. For underloaded blocks, it increases in-SM

computing unit utilization by gathering multiple underloaded

blocks into a single block using B-Gathering. It also limits the

number of allocated thread blocks on an SM using B-Limiting,

when overloaded rows exist in the merging process. Based on

the three optimization techniques, it shows an average speedup

of 1.43x on execution time compared to the baseline for total

28 real-world datasets on a target server-class GPU.

IX. ACKNOWLEDGMENTS

Thanks to Myung-Hwan Jang and Hyuck-Moo Gwon for

all their help and feedback. We also thank the anonymous

reviewers who provided good suggestions for improving the

quality of this work. This work was supported by Samsung

Research Funding & Incubation Center of Samsung Electron-

ics under Project Number SRFC-IT1901-03. Yongjun Park is

the corresponding author.

REFERENCES[1] D.-H. Bae et al., “Constructing seminal paper genealogy,” in Proceed-

ings of the 20th ACM international conference on Information andknowledge management. ACM, 2011, pp. 2101–2104.

[2] G. He et al., “Parallel simrank computation on large graphs with iterativeaggregation,” in Proceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 2010, pp.543–552.

[3] Y. Cai et al., “Efficient algorithm for computing link-based similarity inreal world networks,” in 2009 Ninth IEEE International Conference onData Mining. IEEE, 2009, pp. 734–739.

[4] Y. Dong et al., “Link prediction and recommendation across heteroge-neous social networks,” in 2012 IEEE 12th International conference ondata mining. IEEE, 2012, pp. 181–190.

[5] Y. Koren et al., “Matrix factorization techniques for recommendersystems,” Computer, no. 8, pp. 30–37, 2009.

[6] J. Nickolls et al., “NVIDIA CUDA software and GPU parallel comput-ing architecture,” in Microprocessor Forum, May 2007.

[7] KHRONOS Group, “OpenCL - the open standard for parallel program-ming of heterogeneous systems,” 2010, http://www.khronos.org.

[8] J. Leskovec et al., “Graph evolution: Densification and shrinking diam-eters,” ACM Transactions on Knowledge Discovery from Data (TKDD),vol. 1, no. 1, p. 2, 2007.

[9] C. W. Keler and C. Smith, “The SPARAMAT Approach to AutomaticComprehension of Sparse Matrix Computations,” in Proceedings of theSeventh International Workshop on Program Comprehension. IEEEComputer Society, 1999, pp. 200–207.

[10] “NVIDIA cuSPARSE Library,” http://developer.nvidia.com/cusparse.[11] S. Dalton et al., “CUSP: Generic parallel algorithms for sparse matrix

and graph computations,” 2014, version 0.5.0. [Online]. Available:http://cusplibrary.github.io/

[12] W. Liu and B. Vinter, “An efficient gpu general sparse matrix-matrixmultiplication for irregular data,” in 2014 IEEE 28th InternationalParallel and Distributed Processing Symposium, May 2014, pp. 370–381.

[13] Y.-Y. Jo et al., “Efficient sparse matrix multiplication on gpu for largesocial network analysis,” in Proceedings of the 24th ACM Internationalon Conference on Information and Knowledge Management. ACM,2015, pp. 1261–1270.

[14] S. Pal et al., “Outerspace: An outer product based sparse matrixmultiplication accelerator,” 02 2018, pp. 724–736.

[15] J. J. Elliott and C. M. Siefert, “Low thread-count gustavson: A multi-threaded algorithm for sparse matrix-matrix multiplication using perfecthashing,” in 2018 IEEE/ACM 9th Workshop on Latest Advances inScalable Algorithms for Large-Scale Systems (scalA), Nov 2018, pp.57–64.

[16] S. K. Raman et al., “Implementing Streaming SIMD Extensions on thePentium III Processor,” IEEE Micro, vol. 20, no. 4, pp. 47–57, 2000.

[17] O. Kayiran et al., “Neither more nor less: Optimizing thread-levelparallelism for gpgpus,” in Proceedings of the 22Nd InternationalConference on Parallel Architectures and Compilation Techniques, ser.PACT ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 157–166.[Online]. Available: http://dl.acm.org/citation.cfm?id=2523721.2523745

[18] M. Lee et al., “Improving gpgpu resource utilization through alternativethread block scheduling,” in 2014 IEEE 20th International Symposiumon High Performance Computer Architecture (HPCA). IEEE, 2014, pp.260–271.

[19] F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplica-tion and permuted transposition,” ACM Transactions on MathematicalSoftware (TOMS), vol. 4, no. 3, pp. 250–269, 1978.

[20] Y. Yu et al., “A compiler-based approach for GPGPU performancecalibration using TLP modulation (WIP paper).”

[21] NVIDIA, “NVIDIA Titan Xp Graphics Cards,” 2017,https://www.nvidia.com/en-us/titan/titan-xp/.

[22] NVIDIA, “Nvidia dgx station,” 2017,https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-station/nvidia-dgx-station-datasheet.pdf.

[23] INTEL, “Intel xeon e5-2600 model specification,” 2016,https://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-brief.html.

[24] INTEL, “Intel gold 5115 model specification,” 2017,https://ark.intel.com/content/www/kr/ko/ark/products/120484/intel-xeon-gold-5115-processor-13-75m-cache-2-40-ghz.html.

[25] NVIDIA, “NVIDIA Tesla V100,” 2017,https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.

[26] NVIDIA, “NVIDIA RTX 2080 Ti Graphics Cards,” 2018,

https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/.[27] T. A. Davis and Y. Hu, “The university of florida sparse matrix

collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec.2011. [Online]. Available: http://doi.acm.org/10.1145/2049662.2049663

[28] “Stanford large network dataset collection,”http://snap.stanford.edu/data.

[29] D. Chakrabarti et al., “R-mat: A recursive model for graph mining,”in Proceedings of the 2004 SIAM International Conference on DataMining. SIAM, 2004, pp. 442–446.

[30] D. Zheng et al., “Flashgraph: Processing billion-node graphs on an arrayof commodity ssds,” in 13th USENIX Conference on File and StorageTechnologies (FAST 15), 2015, pp. 45–58.

[31] Profiler User’s guide, NVIDIA, 2018,http://docs.nvidia.com/cuda/pdf/CUDA profiler Users Guide.pdf.

[32] D. Hutchison et al., “Graphulo implementation of server-side sparse ma-trix multiply in the accumulo database,” in 2015 IEEE High PerformanceExtreme Computing Conference (HPEC), Sep. 2015, pp. 1–7.

[33] Intel, “Intel Math Kernel Library,” 2003,https://software.intel.com/en-us/mkl.

[34] B. Xie et al., “Cvr: Efficient vectorization of spmv on x86 processors,” inProceedings of the 2018 International Symposium on Code Generationand Optimization. ACM, 2018, pp. 149–162.

[35] J. Zhang and L. Gruenwald, “Regularizing irregularity: bitmap-basedand portable sparse matrix multiplication for graph data on gpus,” inProceedings of the 1st ACM SIGMOD Joint International Workshopon Graph Data Management Experiences & Systems (GRADES) andNetwork Data Analytics (NDA). ACM, 2018, p. 4.

[36] C. Hong et al., “Efficient sparse-matrix multi-vector product on gpus,” inProceedings of the 27th International Symposium on High-PerformanceParallel and Distributed Computing. ACM, 2018, pp. 66–79.

[37] J. Liu et al., “Register-based implementation of the sparsegeneral matrix-matrix multiplication on gpus,” in Proceedingsof the 23rd ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming, ser. PPoPP ’18. NewYork, NY, USA: ACM, 2018, pp. 407–408. [Online]. Available:http://doi.acm.org/10.1145/3178487.3178529

[38] F. Gremse et al., “Gpu-accelerated sparse matrix-matrix multiplicationby iterative row merging,” SIAM Journal on Scientific Computing,vol. 37, pp. C54–C71, 01 2015.

[39] M. Winter et al., “Adaptive sparse matrix-matrix multiplication on thegpu,” in Proceedings of the 24th Symposium on Principles and Practiceof Parallel Programming. ACM, 2019, pp. 68–81.

[40] K. Akbudak and C. Aykanat, “Simultaneous input and output matrix par-titioning for outer-product–parallel sparse matrix-matrix multiplication,”SIAM Journal on Scientific Computing, vol. 36, no. 5, pp. C568–C590,2014.

[41] D. Kernert et al., “Topology-aware optimization of big sparse matricesand matrix multiplications on main-memory systems,” in 2016 IEEE32nd International Conference on Data Engineering (ICDE). IEEE,2016, pp. 823–834.

[42] M. M. A. Patwary et al., “Parallel efficient sparse matrix-matrix multi-plication on multicore platforms,” in International Conference on HighPerformance Computing. Springer, 2015, pp. 48–57.

[43] J. Lee et al., “Thread tailor: dynamically weaving threads together forefficient, adaptive parallel applications,” in Proc. of the 37th AnnualInternational Symposium on Computer Architecture, 2010, pp. 270–279.

[44] N.-M. Ho and W.-F. Wong, “Exploiting half precision arithmetic innvidia gpus,” in 2017 IEEE High Performance Extreme ComputingConference (HPEC). IEEE, 2017, pp. 1–7.

936

optimization of gpu-based sparse matrix multiplication for

Documents