performance optimization for sparse matrix …

101
PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX FACTORIZATION ALGORITHMS ON HYBRID MULTICORE ARCHITECTURES By MENG TANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2020

Upload: others

Post on 10-Apr-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX FACTORIZATION ALGORITHMSON HYBRID MULTICORE ARCHITECTURES

By

MENG TANG

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2020

Page 2: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

c⃝ 2020 Meng Tang

Page 3: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

ACKNOWLEDGMENTS

My sincerest gratitude to my advisor, Dr. Sanjay Ranka, for his continuous support. Dr.

Ranka’s guidance and motivation were the most valuable to my Ph.D study. The resources and

the funding he provides were critical to my research work.

I thank Dr. Mohamed Gadou and Dr. Tania Banerjee for their helps in our projects and

in the paperwriting. They were great collaborators and friends who have aided me in solving

many problems.

I thank Dr. Timothy Davis and Dr. Steven Rennich for generously providing essential

equipment. Without their servers, it would be very difficult to conduct my research.

I thank Dr. Alper Ungor, Dr. Jih-Kwon Peir, and Dr. William hager, for their insightful

comments on my thesis. As members of my supervisory committee, they have provided

valuable ideas and comments during my progression to the degree.

I thank my family for their unwavering support in my seek for knowledge. They have been,

and will always be my source of strength in even the darkest hours.

3

Page 4: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER

1 INTRODUCTORY REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Fill-reducing Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Elimination Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 SPARSE CHOLESKY FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Up-looking Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Left-looking Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Right-looking Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 The Supernodal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 The Multifrontal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.7 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.8 The Subtree Method and the Multilevel Subtree Method . . . . . . . . . . . . 293.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.10 The Batched Sparse Cholesky Factorization . . . . . . . . . . . . . . . . . . . 37

3.10.1 The Merge-and-Factorize Approach . . . . . . . . . . . . . . . . . . . 373.10.2 The Normal Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.11 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.11.1 Multuthreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.11.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.11.3 Multilevel Subtree Method . . . . . . . . . . . . . . . . . . . . . . . . 433.11.4 Batched Sparse Cholesky Factorization . . . . . . . . . . . . . . . . . 46

3.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 SPARSE QR FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Givens Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Blocked Givens Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5 Householder Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4

Page 5: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

4.6 The Multifrontal Sparse QR Factorization . . . . . . . . . . . . . . . . . . . . 564.7 The Arithmetic CUDA Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 594.8 Pipelining CUDA Kernels and Device-to-Host Transfers . . . . . . . . . . . . . 634.9 Pipelining GPU Workload and CPU Workload . . . . . . . . . . . . . . . . . 654.10 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 SPARSE LU FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1 Left-looking LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.2 Right-looking LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3 The Supernodal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.4 The Multifrontal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5 Implementation of a Supernodal Sparse LU Algorithm . . . . . . . . . . . . . 79

5.5.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.5.1.1 GPU information . . . . . . . . . . . . . . . . . . . . . . . . 805.5.1.2 Matrix information . . . . . . . . . . . . . . . . . . . . . . . 81

5.5.2 The Supernodal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 835.5.3 Multithreading and Batched Factorization . . . . . . . . . . . . . . . . 865.5.4 Utilizing Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 SUMMARY AND CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 94

APPENDIX: PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5

Page 6: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

LIST OF TABLES

Table page

3-1 Test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4-1 Test matrices used for QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4-2 QR experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5-1 Test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5-2 Factorization time (s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6

Page 7: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

LIST OF FIGURES

Figure page

3-1 Workflow of single level (a+c) / multilevel (a+b) subtree algorithm . . . . . . . . . 32

3-2 Using pipeline in the factorization of a subtree . . . . . . . . . . . . . . . . . . . . 35

3-3 Factorization time for sparse cholesky with / without multithreading . . . . . . . . 40

3-4 Average power consumption for sparse cholesky with / without multithreading . . . 40

3-5 Energy consumption for sparse cholesky with / without multithreading . . . . . . . 41

3-6 Cholesky factorization performance for sparse matrices . . . . . . . . . . . . . . . . 42

3-7 Power Consumption of sparse Cholesky factorization . . . . . . . . . . . . . . . . . 42

3-8 Energy Consumption of sparse Cholesky factorization . . . . . . . . . . . . . . . . . 43

3-9 Performance comparison between single-level subtree algorithm and multilevel sub-tree algorithm on single GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3-10 Structure of matrix Geo 1438 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3-11 Performance comparison between single-level subtree algorithm and multilevel sub-tree algorithm on single GPU and two GPUs . . . . . . . . . . . . . . . . . . . . . 45

3-12 Batched Cholesky factorization performance versus sequential matrices factorizationon GPUs (2 GPUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3-13 Batched Cholesky factorization performance versus sequential matrices factorizationon GPUs (4 GPUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4-1 The elimination tree of a sparse matrix . . . . . . . . . . . . . . . . . . . . . . . . 57

4-2 A possible scheduling of fronts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4-3 Stages in the workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4-4 Stages in the elimination tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4-5 Reducing PCIe communications with stages . . . . . . . . . . . . . . . . . . . . . . 58

4-6 The factorization and the assembly operations . . . . . . . . . . . . . . . . . . . . 59

4-7 Factorization of a front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4-8 Factorization of a front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4-9 The VT tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4-10 U and V in the VT tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7

Page 8: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

4-11 Q, U and V in the VT tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-12 Pipelining the CUDA kernel runs and device-to-host transfers . . . . . . . . . . . . 63

4-13 Pinned host memory used as a buffer . . . . . . . . . . . . . . . . . . . . . . . . . 63

4-14 Pipelining the factorization of stages and the buffer flushing . . . . . . . . . . . . . 63

4-15 A secondary pinned host memory buffer to avoid data conflict . . . . . . . . . . . . 65

4-16 Comparison between sparse QR algorithm with / without stage-level pipeline . . . . 66

4-17 Performance comparison between algorithm before and after optimizations . . . . . 67

4-18 Relationship between flop count and improvement in performance . . . . . . . . . . 68

4-19 Energy consumed by the GPU in factorization (large matrices) . . . . . . . . . . . . 68

4-20 Reduction in energy consumption after the optimization . . . . . . . . . . . . . . . 69

4-21 Average power of the GPU in factorization . . . . . . . . . . . . . . . . . . . . . . 69

4-22 Reduction in average power after the optimization . . . . . . . . . . . . . . . . . . 70

5-1 Supernode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5-2 Supernode stored in column major form . . . . . . . . . . . . . . . . . . . . . . . . 83

5-3 Elimination tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5-4 Serial factorization of supernode . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5-5 Parallel factorization of supernode . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5-6 Computing contribution blocks using 4 CUDA streams . . . . . . . . . . . . . . . . 86

5-7 LU factorization time (natural log transformed) . . . . . . . . . . . . . . . . . . . . 93

5-8 LU factorization time using one or multiple GPUs . . . . . . . . . . . . . . . . . . 93

8

Page 9: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX FACTORIZATION ALGORITHMSON HYBRID MULTICORE ARCHITECTURES

By

Meng Tang

May 2020

Chair: Sanjay RankaMajor: Computer Engineering

The use of sparse direct methods in computational science is ubiquitous. Direct methods

can be used to find solutions to many numerical algebra applications, including sparse linear

systems, sparse linear least squares, and eigenvalue problems; consequently they form the

backbone of a broad spectrum of large scale applications. The use of sparse direct methods is

extensive, with many of the relevant science and engineering application areas being pushed to

run at ever higher scales.

In this work we delve into the implementations of sparse direct methods including the

sparse Cholesky, QR, and LU factorizations. We research on a number of state-of-the-art

libraries for sparse matrix factorizations, and improve their performance by applying various

optimizations.

For the sparse Cholesky factorization we have implemented multithreading, pipelining, the

multilevel subtree method, and batched factorization.

For the sparse QR factorization we implemented pipelining and improved the arithmetic

CUDA kernels.

For the sparse LU factorization, we implemented a supernodal sparse LU solver that can

utilize multiple GPUs, and supports multithreading, pipelining, and batched factorization.

9

Page 10: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

CHAPTER 1INTRODUCTORY REMARKS

Matrix factorizations such as the Cholesky factorization, the QR factorization, and the

LU factorization are the key to many applications involving linear algrbra, and are useful in

solving linear problems such as linear equation systems, linear least square problems, eigenvalue

problems, non-linear optimization, Monte Carlo simulation etc. This forms the foundation

of a wide variety of large scale applications including feature extraction, data compression,

computer graphics, recommender systems, artificial intelligence, etc.

A sparse matrix is a matrix whose most elements are zero. The factorization of a sparse

matrix differs from the dense matrix factorization in that arithmetic operations on the sparse

matrix’s zero elements can mostly be avoided to reduce the total workload, thus greatly

reducing time, memory, and energy cost of the factorization. In order to exploit the sparsity of

the sparse matrix, a symolic analysis phase is needed before the actual numerical factorization,

to compute the nonzero pattern of the sparse matrix.

The symbolic analysis also computes the elimination tree. The elimination tree is a key

data structure in sparse matrix factorization algorithms [36]. It provides strructural information

of the sparse matrix, and directs the workflow of the factorization. The elimination tree also

describes the dependency between the sparse matrix’s columns/rows, therefore it plays a vital

role in parallelizing the sparse matrix factorization algorithms.

Usually a fill-reducing permutation needs to be performed before the numerical factoriza-

tion, to reduce fill-in. The fill-in are new nonzero entries in the factors that are initially zero

in the corresponding positions of the matrix being factorized [15]. To find the optimum fill-

reducing permutation has been proven to be NP-hard [59], but there exist heuristic algorithms

that find near-to-optimum permutations.

The numerical factorization phase is where the most floating-point operations happen.

The floating-point operations are largely fixed because they are decided by the output of the

analysis phase, but there are various techiques to accelerate the factorization phase. These

10

Page 11: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

techniques usually focus on exploiting the natural parallelism available within the factorization.

It is a common practice to accelerate the factorization with high-throughput highly-parallel

co-processors such as GPUs. Additionally, the sparsity of the matrix enables significant

performance gain through parallel programming on hybrid multicore systems.

Prior to the introduction of specific sparse matrix factorization algorithms, we will provide

background information and show related works in Chapter 2. Symbolic analysis will also be

presented in this chapter. In Chapter 2, we will introduce the fill-reducing algorithms, and the

construction of the elimination tree. We will also breifly describe the extra analysis steps for

the supernodal sparse matrix factorization algorithm [5] [61] and the subtree method [48].

In Chapter 3, we provide the techniques for accelerating the sparse Cholesky factorization.

The Cholesky factorization algorithms for dense matrices are the foundation of sparse Cholesky,

therefore they are introduced at the beginning of this chapter. Later in Chapter 3, we describe

the basic sparse Cholesky algorithm [5] that our work is based upon. The remaining of the

chapter will cover techniques that enhance the performance of the sparse Cholesky algorithm,

including multithreading [54], pipelining [55], the subtree method [48], and the multilevel

subtree method [55].

In Chapter 4, we consider the sparse QR factorization. In this chapter we will first

introduce algorithms for both dense and sparse QR factorization, including Gram-Schmidt

orthogonalization, Givens rotation, and Householder reflection. We will then describe the

multifrontal sparse QR factorization algorithm [61] implemented by Yeralan et al. Then, we

provide our optimization techniques for the above algorithm, including the optimization of

arithmetic CUDA kernels, and pipelining.

In Chapter 5, we present our work on the sparse LU factorization. We will describe the

implementation details of our supernodal sparse LU solver.

We summarize our work in Chapter 6.

11

Page 12: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

CHAPTER 2BACKGROUND AND RELATED WORK

In typical sparse matrix factorization algorithms, the factorization splits into two phases:

the symbolic analysis phase and the numerical factorization phase [15]. The symbolic analysis

phase depends only on the nonzero pattern of the matrix being factorized, and will not look

into the actual values of the matrix’s entries.

The symbolic analysis is the key in exploiting the sparsity of the matrix, because the time

and memory efficiency of the numerical factorization phase greatly depends on the analysis.

The symbolic analysis is asymptotically faster than the numerical factorization, and makes the

numerical factorization phase more efficient in terms of time and memory [15].

The symbolic analysis has close relations to the graph theory. For a symmetric n × n

matrix A = {aij}, the symbolic pattern of A can be represented by an undirected graph G

where

G = (V,E)

V = {v1, · · · , vn}

E = {(vi, vj) : aij = 0}

The symbolic analysis for unsymmetric matrices is closely related to the symbolic analysis

for symmetric matrices.

2.1 Fill-reducing Permutation

Usually the first step of the symbolic analysis phase is the fill-reducing permutation. The

optimum fill-reducing permutation for an m × n matrix A is to find an m × m permutation

matrix P and an n × n permutation matrix Q such that the factorization of PAQ generates

minimum fill-in, where fill-in are nonzero entries of the factors whose same-position entries in

PAQ are zero.

The fill-in pose redundant workload to the factorization algorithm, and result in an

increase in time, memory and energy consumption, therefore a good fill-reducing permutation

12

Page 13: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

is required for an ideal performance of the sparse matrix factorization. However, finding the

optimum fill-reducing permutation is NP-hard.

The minimum fill-in problem is equivilent to the minimum triangulated supergraph

problem, The goal of the minimum triangulated supergraph problem is to find a minimum

number of additional edges that turns the graph chordal (triangulated). A graph is chordal if

for every cycle with length no less than 4, there exists an edge in the graph connecting non-

adjacent vertices of the cycle. Yannakakis proved that the minimum triangulated supergraph

problem for bipartite graphs is NP-complete [59].

Many heuristics are available for the minimum fill-in problem, including bandwidth

reduction [7], minimum degree ordering [23], minimal triangulation [42],

• The term ”minimal” is different from ”minimum” in this case. Let the graph G =

(V,E), and let C be a set such that C = {Ec : Gc = (V,E + Ec), Gc is chordal}.

Here ”minimal” means minimal with respect to inclusion (Ec ∈ C but no subset of Ec is

in C). The inclusion relation on C is a partial order, and a minimal triangulation corre-

sponds to a minimal element of C with respect to the inclusion relation. A ”minimum”

means that Ec ∈ C and Ec has the smallest cardinality.

nested dissection [19], etc.

Let A = {aij} be a symmetric matrix, the bandwidth of A is defined as

maxaij =0

|i− j|

The bandwidth of A’s graph, G = (V,E), where V = {v1, · · · , vn} and E = {(vi, vj) :

aij = 0}, is defined as

min max(vi,vj)∈E

|π(i)− π(j)|

where π is an ordering of {v1, · · · , vn}.

13

Page 14: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The minumum bandwidth of A is equal to the bandwidth of G, and the bandwidth

minimization problem is equivalent to finding an ordering of vertices such that the minimum

bandwidth is obtained.

Papadimitriou proved that the bandwidth minimization problem is NP-complete [43].

Cuthill and McHee provided an efficient heuristic algorithm for the bandwidth minimization

problem [7]. Their algorithm uses a greedy node numbering scheme, starting from a vertex

with the minimum degree, and doing a breadth-first search which prioritizes vertices with

smaller degrees. George proposed the ”reverse Cuthill-McKee” method which reverses the

ordering of vertices. Liu et al. compared the the Cuthill-McKee algorithm to the reverse

Cuthill-McHee algorithm [38]. Their experiments showed that these two ordrings are equivalent

for band elimination methods, but when evelope elimination techniques are used, the reverse

Cuthill-McHee ordering is always better than or as good as the original Cuthill-McHee ordering.

They also explored the conditions under which the reverse Cuthill-McHee ordering is always

strictly better than the original one.

The minimum degree ordering algorithm is a heuristic for the minimum fill-in problem

using a local greedy strategy. It greedily selects the sparsest pivot row and column during

the course of a right-looking sparse Cholesky factorization [15]. It is a symmetric analog of

an algorithm proposed by Markowitz for reordering equations arising in linear programming

applications [39]. Tinney and Walker were the first to propose the symmetric version of

Markowitz’s algorithm [56]. Rose developed a graph theoretic model for this algorithm, and

renamed it as the minimum degree algorithm [50]. George and McIntyre provided an efficient

implementation for the minimum degree algorithm [24].

Nested dissection is a heuristic for the minimum fill-in problem, using the divide and

conquer paradigm. The nested dissection algorithm selects a vertex separator, which is a

group of vertices that divide the graph into two roughly equal-sized subgraphs. After these

vertices are removed from the graph, the two subgraphs can be ordered via nested dissection or

minimum degree [15]. Nested dissection was discovered by George [19], it was initially inteded

14

Page 15: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

to find an ordering of an n × n mesh, to reduce the factorization time complexity from O(n4)

to O(n3) and reduce the space complexity from O(n3) to O(n2 log2 n). Lipton et al. proposed

the generalized nested dissection which applies to any system of equations defined on a planar

or almost-planar graph [33]. Gilbert created a variant of the generalized nested dissection

algorithm. Instead of separating the graph into 2 subgraphs, Gilbert’s algorithm divides the

graph into r subgraphs where r ≥ 2.

Multiple libraries for graph partitioning are available, such as METIS [29], PARTY [46],

SCOTCH [44], ParMETIS [30], ST-SCOTCH [6], Mongoose [31] etc.

2.2 Elimination Tree

After the fill-reducing permutation, the algorithm computes the elimination tree [36].

The elimination tree is an output of the symbolic analysis. It is a tree structure that provides

structural information of the sparse matrix. The elimination tree can also be viewed as

a directed acyclic graph which shows the dependency between matrix columns/rows and

describes the topological pattern of the factorization workflow.

For the Cholesky factorization of an n × n sparse matrix A, the elimination tree is a tree

with n nodes. If we number the matrix’s columns and rows from 0 to n − 1, then for any

integer i and j where 0 ≤ j < i < n, the elimination tree’s node j is a descendant of node i if

aij = 0. A path from j to i indicates data dependency between column j and column i, that

column j of the factor must be computed before computing column i. Let L denote the factor,

A = LLT , then node j is a descendant of node i if and only if lij = 0.

For QR and LU factorizations of a sparse matrix A, the column elimination tree is used.

The column elimination tree is the elimination tree of ATA [15].

The structure of an elimination tree was implicitly used long before its importance was

recognized [36]. Schreiber formally defined the elimination tree structure [53] (the term

”elimination tree” was not used but the definition of the tree structure referred to is the same

as the elimination tree). In [34], Liu used the term ”elimination tree” to refer to the tree

structure defined by Schreiber.

15

Page 16: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The elimination tree can be constructed incrementally, from leaves to root, following these

rules:

• If aij = 0, then lij = 0

• If i > j > k, lik = 0, ljk = 0, then lij = 0

The time complexity of the elimination tree’s construction is nearly O(|A|) [15], where |A|

is the number of nonzero entries in A.

Supernodal matrix factorization algorithms are blocked matrix factorization algorithms

that attempt to factorize multiple columns at a time. The columns are put into groups, named

column groups or column panels. A column group is composed of sequential columns with

identical, or similar nonzero patterns. The nonzero pattern of a column group is the union of

the nonzero patterns of all columns in the column group. The supernodal elimination tree is a

variant of the elimination tree where each node (named ”supernode”) represents not a columm

but a column group. The column groups can be computed by iterating through the columns

of A. If the next column has a nonzero pattern similar to the current column group, then it is

merged into the column group, otherwise a new column group is created. Each column group

corresponds to a supernode of the supernodal elimination tree. Let Lp and Lq be two column

groups, and p > q. Let li ∈ Lp and lj ∈ Lq. Since column groups are composed of sequential

columns, we have i > j. Then Lp is an ancestor of Lq in the supernodal elimination tree if

and only if there exit li and lj such that li ∈ Lp, lj ∈ Lq, and li is an ancestor of lj in the

elimination tree.

In the rest of this paper, we will refer to the supernodal elimination tree as the ”elimina-

tion tree”, disregarding their difference.

16

Page 17: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

CHAPTER 3SPARSE CHOLESKY FACTORIZATION

The Cholesky factorization provides solutions to many problems in scientific computing,

such as linear equation systems, matrix inversion, and eigenvalue problems.

In this chapter, we provide optimizing techniques for the sparse Cholesky factoriza-

tion. The base algorithm is CHOLMOD [5] from SuiteSparse [8]. We show that significant

improvement in performance can be achieved by applying our optimizations to CHOLMOD.

Multuthreading is a common practice to increase programs’ concurrency by exploiting the

problem’s internal parallelism. The sparsity of sparse matrices grants the possibility to divide

the sparse matrix factorization problem into multiple sub-problems and have different threads

handle them when there is no data dependency. On a hybrid multicore system, multithreading

can significantly accelerate the sparse matrix factorization algorithm. We implement the

multithreading [54] by uszing ing OpenMP and CUDA’s stream feature.

Pipelining is a technique to implement parallelism by attempting to keep different parts

of the system busy. It is a widely used technique in modern CPUs. In matrix factorization

on hybrid multicore systems, pipelining can parallelize the workload on different components

of the system, including CPUs, GPUs, and the DMA engine. In a GPU-accelerated sparse

matrix factorization algorithm, the pipelining technique can effectively ”hide” the data transfer

(between main memory and GPU memory) overhead behind the on-GPU floating-point

operations, reducing the overall time consumption.

The subtree method [48] is a batching technique where multiple CUDA kernels are

lauched in a batch. It aims to reduce the total kernel launch overhead of numerous small tasks

by significantly reducing the total number of CUDA kernel calls. The original subtree method

only applies to the lower-level supernodes of the elimination tree. We provide a variant of

the subtree method [55], which applies also to high-level supernodes of the elimination tree if

possible.

17

Page 18: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

3.1 Introduction

The Cholesky factorization is a decomposition of a symmetric positive definite matrix into

the product of a lower triangular matrix and its transpose. Given a symmetric positive definite

matrix A, the Cholesky factorization is to find a matrix L such that A = LLT , where L is a

lower triangular real matrix with positive diagonal entries, and LT is the transpose of L.

One application of the Cholesky factorization is solving systems of linear equations. Let A

be a symmetric positive definite matrix, the equation Ax = b can be solved with the help of

Cholesky factorization. If A = LLT where L is a lower triangular matrix, then LLTx = b. This

equation can be solved by solving Ly = b and LTx = y.

The Cholesky factorization of a dense matrix and a sparse matrix differ in that by taking

into account the non-zero patterns of the sparse matrix, a large portion of computations

involving zero elements can be avoided, and data dependencies between matrix elements can

be loosened allowing parallel factorization of different parts of the sparse matrix.

Assume the matrix A takes the form

a11 a21 · · · an1

a21 a22 · · · an2...

.... . .

...

an1 an2 · · · ann

and the matrix L takes the form

l11

l21 l22...

.... . .

ln1 ln2 · · · lnn

18

Page 19: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

then

a11 a21 · · · an1

a21 a22 · · · an2...

.... . .

...

an1 an2 · · · ann

=

l11

l21 l22...

.... . .

ln1 ln2 · · · lnn

l11 l21 · · · ln1

l22 · · · ln2

. . ....

lnn

First we have the equation a11 = l11l11, therefore l11 =

√a11.

With l11 known, the rest of the matrix L can be solved iteratively. There exist multiple

orders in which the entire L can be iteratively solved, and each order defines an algorithm,

namely the up-looking, the right-looking and the left-looking Cholesky. We will introduce these

algorithms in their respective subsections.

3.2 Up-looking Cholesky

The up-looking Cholesky (also named the row-Cholesky) is an algorithm which iteratively

solves L one row at a time, from top to bottom.

Let

L =

L11

l21 l22

L31 l32 L33

where L11 the top lower triangular sub-matrix already factorized.

A11 aT21 AT

31

a21 a22 aT32

A31 a32 A33

=

L11

l21 l22

L31 l32 L33

L11 lT21 LT

31

l22 lT32

L33

At the beginning of the up-looking Cholesky algorithm, L11 is empty, and the first

iteration is done by computinging√a11.

19

Page 20: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Since l21L11 = a21 and l21lT21 + l222 = a22, we have l21 = a21L

−111 and l22 =

√a22 − l21lT21.

The up-looking Cholesky algorithm was first introduced by Rose, Whitten, Sherman and

Tarjan [49]. Compared with band methods [58] and envelope methods [28], the up-looking

algorithm is able to more effectively exploit the matrix’s sparseness. Rose et al. describes the

up-looking Cholesky algorithm as a ”general sparse method”. Since it only stores and operates

on the actual nonzeros, it can be substantially more efficient than band methods and envelope

methods.

Liu gives an implementation of up-looking Cholesky [37] that exploits all possible zeros,

by employing a generalized form of the envelope method. The envelope method only exploits

zeros outside envelopes, treating zeros inside envelopes as nonzeros logically. Liu’s algorithm

divides the computation into a sequence of full envelope (aka. envelopes with no zeros)

triangular solves, essentially avoiding operations on zeros.

The up-looking Cholesky algorithm does not appear very frequently in the literature, but

it is still widely used. According to Davis’s research [9], the up-looking algorithm may be the

most efficient for very sparse matrices. MATLAB uses an up-looking Cholesky algorithm if the

matrix is very sparse, and uses a left-looking supernodal algorithm otherwise.

3.3 Left-looking Cholesky

The left-looking Cholesky (also named the column-Cholesky) is an algorithm which

iteratively solves L one column at a time, from left to right.

Let

L =

L11

l21 l22

L31 l32 L33

where

L11

l21

L31

represents the columns already computed.

20

Page 21: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The left-looking Cholesky algorithm starts with L11, l21, L31 empty, and in each step,

computes the next column of L.

Since A11 aT21 AT

31

a21 a22 aT32

A31 a32 A33

=

L11

l21 l22

L31 l32 L33

L11 lT21 LT

31

l22 lT32

L33

we have:

l22 =√a22 − l21lT21

l32 = (a32 − L31lT21)/l22

The left-looking Cholesky has been more widely used than the up-looking Cholesky, and

forms the basis of the left-looking supernodal method [5]. The LAPACK function DPOTRF is

an implementation of dense left-looking Cholesky algorithm. Depending on the dimension on

the matrix, DPOTRF may switch to the blocked version.

In the left-looking Cholesky factorization of sparse matrices, multiple columns can be

computed independently, provided that there is no dependency between these columns. This

makes the left-looking Cholesky very useful in parallel Cholesky factorization algorithms.

The parallelism of the left-looking Cholesky is guided by the elimination tree, which is

an output of the symbolic analysis. The nodes of the elimination tree each corresponds to

a column in the matrix. There is data dependency between two columns if an only if one

column’s corresponding node is a descendant of the other, otherwise those columns may

benefit from parallel techniques such as multithreading. The parralelism depicted by the

elimination tree is also referred to as ”tree parallelism”. The tree parallelism can be utilized

when multiple threads or multiple processors are present.

3.4 Right-looking Cholesky

The right-looking Cholesky (also named the submatrix-Cholesky) is also an iterative

method that factorizes the matrix one column at a time.

21

Page 22: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Let

L =

l11l21 L22

Since a11 aT21

a21 A22

=

l11l21 L22

l11 lT21

L22

we have:

l11 =√a11

l21 = a21/l11

A′22 = A22 − l21l

T21

L22 = r cholesky(A22) (the function r cholesky stands for a right-looking Cholesky

factorization of A′22).

The workflow of the right-looking Cholesky is very similar to that of the left-looking

Cholesky, except that the left-looking Cholesky uses a ”lazy” updating scheme. In a left-

looking Cholesky algorithm, the columns are not updated until right before their factorization,

while in a right-looking Cholesky algorithm, after the factorization of a column, all columns to

the right are immediately updated.

Unlike the up-looking Cholesky and left-looking Cholesky, there exist multiple target

columns for the update operations of the right-looking Cholesky [15]. This is not an issue

if the entire factorization is performed in shared main memory, however, it can cause extra

workload if A22 is stored on different devices (for example, if GPUs are used to perform column

updates).

The right-looking Cholesky forms the foundation of the multifrontal method.

3.5 The Supernodal Method

In this section we will first focus on the left-looking supernodal Cholesky.

Instead of factorizing the matrix one column at a time, the supernodal Cholesky method

combines columns into column panels, and runs in a blockwise manner.

22

Page 23: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

In practice, matrices to be factorized often have columns with identical or similar nonzero

patterns. The supernodal method may exploit this to perform a blockwise factorization without

suffering a significant increase in fill. The supernodal method saves both space and time. By

combining columns with similar nonzero patterns, the supernodal method stores less duplicate

symbolic information of the matrix. By factorizing multiple columns at a time, the supernodal

method benefits from a better utilization of the memory hierarchy [15].

Let

L =

L11

L21 L22

L31 L32 L33

where

L11

L21

L31

represents the columns already computed.

The left-looking supernodal Cholesky algorithm is similar to the left-looking Cholesky

except that multiple columns are combined processed together. The algorithm starts with

L11, L21, L31 empty, and in each step, factorizes the next column panel.

Since A11 AT

21 AT31

A21 A22 AT32

A31 A32 A33

=

L11

L21 L22

L31 L32 L33

L11 LT

21 LT31

L22 LT32

L33

, we have:

C = −

L21

L31

LT21 (C is named the ”contribution block”),

A′22

A′32

=

A22

A32

+ C (this step is named ”assemble”),

L22 = cholesky(A′22) (the function cholesky stands for a dense Cholesky factorization of

A′22),

and L32 = A′32L

−122 (this step is named ”triangular solve”).

23

Page 24: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The computation of the contribution block and the assemble step combined is also called

”update”, aka., we say that the column panel

A22

A32

is updated by

L21

L31

[9].

We expect supernodes (i.e. column panels) to be composed of columns with identical

nonzero patterns, but columns of similar nonzero patterns can also be merged. In this case, the

supernode is called a ”relaxed” supernode, and the nonzero pattern of the supernode is a union

of its columns’ nonzero pattern.

Supernodes and relaxed supernodes are determined during the symbolic analysis phase,

after the non-supernodal elimination tree has been determined. The determination of relaxed

supernodes does not require nonzero patterns of each column. Since the nonzero pattern of

a parent node is always a superset of a child node, only nonzero counts are needed to check

if two columns have similar nonzero patterns, provided that one column is an ancestor of the

other in the elimination tree (however, information from the non-supernodal elimination tree is

needed to make sure of this).

The determination of supernodes and relaxed suopernodes yields a supernodal elimination

tree, which describes the data dependency between supernodes, and depicts the tree parallelism

of the matrix. The supernodal elimination tree will be used to guide the parallel factorization

of multiple supernodes.

Apart from exploiting tree parallelism, the supernodal method may also increase the

performance by calling functions from efficient linear algebra libraries (BLAS and LAPACK).

Chen et al. developed the CHOLMOD [5] package, which performs Cholesky factorizations

with either the up-looking method (when the matrix is very small or very sparse) or the

left-looking supernodal method (when the matrix is large and not very sparse).

Rennich et al. further enhanced the performance of CHOLMOD by using GPUs in the

factorization [48]. The GPU enhanced version achieves a significant speedup compared to

the CPU-only version. They also introduced the subtree method. The subtree method stores

an entire subtree (instead of a single supernode) of the elimination tree in the GPU memory.

24

Page 25: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

This largely eliminates PCIe transmissions during the factorization of the subtrees, and allows

batched CUDA kernel launches, which significantly reduces the overall kernel launch delay.

Supernodal methods may also be applied to the right-looking Cholesky.

Let

L =

L11

L21 L22

Since A11 AT

21

A21 A22

=

L11

L21 L22

L11 LT

21

L22

we have:

L11 = cholesky(A11) (the function cholesky stands for a dense Cholesky factorization of

A11),

L21 = A21L−111 ,

A′22 = A22 − L21L

T21,

L22 = r cholesky(A]22) (the function r cholesky stands for a right-looking supernodal

Cholesky factorization of A′22).

Similar to right-looking non-supernodal Cholesky, in a right-looking supernodal Cholesky

algorithm, there may be multiple target supernodes for an update operation, and can cause

extra data copy and copyback if the updates are not performed on a shared memory device

(e.g. on a GPU). This issue can be partly addressed by storing a subtree of the elimination tree

on the same device [20].

3.6 The Multifrontal Method

The multifrontal Cholesky is similar to right-looking supernodal Cholesky.

The first step of the multifrontal Cholesky is the symbolic analysis, during which the

supernodal elimination tree is constructed. The supernodal elimination tree is also named the

”assembly tree”.

25

Page 26: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The nodes of the assembly tree each corresponds to a supernode, which is also named

a ”frontal matrix” in this case. The assembly tree describes the dependency between frontal

matrices.

The difference between the multifrontal Cholesky and the right-looking supernodal

Cholesky is that with the multifrontal Cholesky method, the contribution block of a child

frontal matrix updates only its parent frontal matrix, and in return, each node of the assembly

tree must be large enough to hold the full contribution block from its children, and pass the

unused entries on to its parent. On the contrary, the supernodes of a supernodal Cholesky

algorithm does not relay the contribution blocks, because contribution blocks will be assembled

direcly into the target ancestor supernode.

The multifrontal Cholesky factorization is composed of multiple partial factorizations of

fontal matrices. In the beginning of the algorithm, only the leaves of the assembly tree can

be factorized. A frontal matrix can not be factorized until all of its descendants have been

factorized and their contribution blocks assembled.

Let Ap be a frontal matrix, and suppose the contribution blocks from the children of Ap

have already been assembled.

Ap =

Ap11 ATp21

Ap21 Cp

The factorization of Ap is as follows:

Lp11 = cholesky(Ap11) (the function cholesky stands for a dense Cholesky factorization

of Ap11),

Lp21 = Ap21L−1p11

,

C ′p = Cp − Lp21L

Tp21

.

Let Aq be the parent frontal matrix of Ap, then the contribution block C ′p needs to be

assembled into Aq.

26

Page 27: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

3.7 Multithreading

The CHOLMOD module of SuiteSparse v4.5.3 was a single-thread supernodal sparse

Cholesky algorithm which was only able to utilize one GPU and one CPU. We implemented the

multithreading feature on top of it [54].

Sparse matrices usually contain mutually data-independent supernodes and make available

what is called the ”tree parallelism” because this kind of parallelism is represented by the

structure of the elimination tree.

The multithreading technique allows the algorithm to exploit the tree parallelism and

factorize those supernodes in parallel. On a machine with multiple GPUs, the multithreaded

algorithm will associate each GPU with a thread and let the threads handle the supernodes

simultaneously.

Even if only one GPU is available, the algorithm can try to exploit the tree parallelism

by dividing the GPU memory into multiple regions, letting each region hold a supernode, and

having independent CUDA streams factorize the supernodes in parallel.

In our algorithm, we leverage the tree parallelism with OpenMP. We use a elimination-

tree-based scheduling policy to parallelize the factorizations of supernodes. Upon entering

the numerical factorization phase, a number of threads are created, and each thread will

repeatedly try to fetch the next available supernode for factorization until no more supernodes

are available. A supernode is ”available” for factorization if and only if all other supernodes it

depends on have been factorized and their contribution blocks assembled, i.e. all of its children

in the elimination tree have been factorized, their contribution blocks computed and assembled

into this supernode.

We maintain a queue W , which contains all supernodes available for factorization. Ini-

tially, W should contain all the leaves of the elimination tree. In our algorithm, the elimination

tree is represented by an n-element array P = {p0, · · · , pn−1}, where n is the number of

supernodes and pk is the parent of k. Since W = {0, · · · , n − 1} − P , W can be initialized

27

Page 28: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

by iterating through P and taking out elements of P from {0, · · · , n − 1}. In Algorithm 1, we

refer to P as Parent.

We also use another n-element array Nchild which initially contains each supernode’s

number of children. This array can be computed by first initializing all its elements to 0, and

then iterating through P and incrementing Nchild[pk] for each k ∈ {0, · · · , n− 1}.

Each thread will run Algorithm 1 (W , Parent, and Nchild are shared among all threads,

finished, s, and p are local).

Algorithm 1 Parallel Factorization Scheduling

1: finished := FALSE2: while finished = FALSE do3: enter critical section4: if W is empty then5: finished := TRUE6: else7: s := W.pop()8: end if9: exit critical section10: if finished = FALSE then11: factorize Supernode s12: if s has no parent then13: finished := TRUE14: else15: p := Parent[s]16: enter critical section17: Nchild[p] := Nchild[p] - 118: if Nchild[p] = 0 then19: W.push(p)20: end if21: exit critical section22: end if23: end if24: end while

Algorithm 1 ensures that:

• All leaf supernodes are factorized.

• All leaf supernodes are factorized exactly once, because they will not be pushed into W

in Algorithm 1.

28

Page 29: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

• If the children of a supernode p are factorized exactly once in Algorithm 1, then p will

be pushed into W when, and only when the last of is children is factorized, and p will

eventually be factorized, exactly once.

Therefore Algorithm 1 factorizes all supernodes in the elimination tree exactly once, and

the dependencies between supernodes are satisfied.

We implement multithreading on the GPU-accelerated CHOLMOD. When multiple GPUs

are available, we associate each thread with a GPU, so that the threads will run independently

without resource conflict.

If only one GPU is available, or the number of GPUs is smaller than our intended number

of threads, we may split the GPU memory into multiple regions, and let each thread use a

specific GPU memory region. In this case, we will make use of CUDA’s streams. A CUDA

stream is a sequence of operations that execute in issue-order on the GPU. CUDA operations

from the same stream run sequentially, while CUDA operations from different streams may

run concurrently and interleaved. We assign a CUDA stream to each thread, and issue CUDA

operations in each thread to their designated CUDA stream.

We observed up to 3.5 times increase in performance in experiments after applying the

multithreading optimization.

3.8 The Subtree Method and the Multilevel Subtree Method

The subtree method was implemented by Rennich et al. [48] in CHOLMOD of SuiteS-

parse v4.6.0 beta.

The size of supernodes of a sparse matrix may greatly affect the efficiency of the sparse

Cholesky factorization. Since each CUDA kernel launch incurs an overhead, if a sparse matrix

contains numerous small supernodes, a traditional supernodal sparse Cholesky algorithm may

require a large number of CUDA kernel launchs with non-negligible launching overhead, that

could result in a significant gross kernel launching overhead.

The subtree method addresses this issue by launching CUDA kernels in batches. With

the subtree method, the algorithm sends entire subtrees instead of individual supernodes to

29

Page 30: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

the GPU memory, so that multiple supernodes can exist in the GPU memory at the same

time. The algorithm then collects tasks that are small enough (below a configurable threshold)

from those on-GPU supernodes, and launches their corresponding CUDA kernels in batches.

The batched launch of CUDA kernels significantly reduces overall kernel-launching overhead.

Rennich’s experiments showed that the application of the subtree method nearly eliminates the

CUDA kernel launching overhead [48].

The subtree method also reduces data transimssion between the main memory and the

GPU memory. In the baseline factorization algorithm, a descendant supernode copied to

the GPU memory is overwritten after it updates an ancestor (after its contribution block is

computed and assembled to its current ancestor being factorized), therefore each supernode

must be copied to the GPU memory every time it needs to update an ancestor. But with the

subtree algorithm, since multiple supernodes can co-exist in the GPU memory, a supernode

can update multiple ancestors as long as these ancestors are in the GPU memory with it at

the same time. This reduces the host-to-device data transfer cost, and increases the overall

performance.

There tends to be data dependency between supernodes in the same subtree, therefore

the algorithm can not factorize all the supernodes in the subtree in one single batch. Instead,

the subtree method divides the subtree into multiple levels where each level contains mutually

data-independent supernodes, and factorizes the levels in such an order that no level depends

on those levels behind it. The algorithm picks supernodes from the current level, and launches

the CUDA kernels for their factorization in a batch.

Algorithm 2 describes the implementation of the subtree factorization.

In the case when the entire matrix can not be stored in the GPU memory, there will be

remaining supernodes at higher levels of the elimination tree that are not processed in subtrees.

These supernodes form a tree that we call a ”root tree”. Rennich’s algorithm will fall back to

the baseline supernodal factorization algorithm when processing the root tree.

30

Page 31: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Algorithm 2 Subtree algorithm

1: procedure factorize subtree2: for all supernodes in subtree do3: copy supernode data to GPU memory4: end for5: device synchronization6: for all levels in subtree do7: for all supernodes in level do8: contribution blocks (upper, batched SYRK)9: contribution blocks (lower, batched GEMM)10: end for11: device synchronization12: for all supernodes in level do13: assemble contribution blocks (batched)14: end for15: device synchronization16: for all supernodes in level do17: factorize supernode (batched POTRF)18: end for19: device synchronization20: for all supernodes in level do21: triangular solve supernode (batched TRSM)22: end for23: device synchronization24: end for25: for all supernodes in subtree do26: copy supernode data back to main memory27: end for28: end procedure

Fig. 3-1 depicts the workflow of the subtree algorithm. Note that after factorizing the

subtree, the program may choose between the baseline algorithm and the multilevel subtree

algorithm (which we will describe later).

Rennich’s experiments showed that the subtree method brings a performance increase of

up to 1.9 times [48].

On top of the subtree method we implemented the multilevel subtree method [55] which

applies the subtree method to higher levels of the elimination tree. The multilevel subtree

algorithm tries to create even more subtrees, by applying the same technique as the subtree

31

Page 32: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 3-1. Workflow of single level (a+c) / multilevel (a+b) subtree algorithm

algorithm for the entire tree. This will take the form of a while loop which repeats until either

the matrix is fully factorized or no more subtrees can be constructed. The ”leaf” subtrees

will be processed in the same way as Rennich’s subtree algorithm, while ”non-leaf” subtrees

need to be updated by their descendant subtrees before their factorization, to satisfy the data

dependency.

The supernodes on higher levels of the elimination tree are usually large, therefore the

reduction in CUDA kernel launching cost is not significant, but the algorithm can still benefit

from the reduction in the cost of host-to-device transfers. These supernodes also generally

have large numbers of descendants. Our experiments show that usually after the first loop of

the subtree algorithm, most remaining unfactorized supernodes will have a set of descendants

whose total size exceeds the capacity of our GPU memory. Most of these descendants

are supernodes that are already factorized in previous subtree factorizations, but have not

yet updated all of their respective ancestor supernodes. To address this issue, we exclude

supernodes that are already factorized in previous loops from newly constructed subtrees.

32

Page 33: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Those supernodes will be stored in the main memory, and copied to the GPU memory only

when they are needed in an update operation.

For these inter-subtree updates, we fall back to the original supernodal method but orders

the updates in such a way that each descendant supernode can update multiple ancestors

before it is overwritten by the next descendant, to minimize the host-to-device data transfer

cost. To hide the host-to-device transfers incurred during the inter-subtree updates behind the

on-GPU update operations, two blocks of memory are allocated on the GPU, so that while

the descendant supernode stored in one block of memory is used in the updates, the next

descendant can be copied into the other at the same time.

We maintain a linked list for each supernode s, which contains the supernode’s de-

scendants that have been factorized but are not yet used to update s. At the start of the

factorization of a ”non-leaf” subtree, we iterate through these lists, and issues update tasks

for the supernodes in these lists. For each supernode d in these lists, we find its ancestors in

the current subtree, and issue an update task for each of these ancestors. d is then copied to

the GPU memory, and is used to update the ancestors mentioned above. Then we find d’s

lowest-level ancestor that is also an ancestor of the current subtree’s root (or one of the roots

if the ”subtree” is actually a forest), and (if it exists) put d in its list (if this ancestor needs to

be updated by d).

After the inter-subtree updates, the ”non-leaf” subtree should have become a ”leaf”

subtree, and be ready for processing with the subtree method for the first-level subtrees.

Algorithm 3 describes the implementation of the multilevel subtree factorization algorithm

on higher levels of the elimination tree.

Fig. 3-1 also depicts the workflow of the second level of the multilevel subtree algorithm.

3.9 Pipelining

The pipelining technique improves the efficiency of the sparse matrix factorization by

keeping different components of the machine busy simultaneously. The factorization of a sparse

33

Page 34: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Algorithm 3 Multilevel subtree algorithm

1: procedure factorize subtree2: for all supernodes in subtree do3: copy supernode data to GPU memory4: end for5: device synchronization6: for all supernodes in subtree do7: ancestor := supernode8: while ancestor in subtree do9: ancestor := ancestor.parent10: end while11: copy descendant to GPU memory12: use descendant to update nodes in subtree13: if ancestor != nil then14: put descendant in ancestor.update list15: end if16: end for17: device synchronization18: for all levels in subtree do19: for all supernodes in level do20: contribution blocks (upper, batched SYRK)21: contribution blocks (lower, batched GEMM)22: end for23: device synchronization24: for all supernodes in level do25: assemble contribution blocks (batched)26: end for27: device synchronization28: for all supernodes in level do29: factorize supernode (batched POTRF)30: end for31: device synchronization32: for all supernodes in level do33: triangular solve supernode (batched TRSM)34: end for35: device synchronization36: end for37: for all supernodes in subtree do38: copy supernode data back to main memory39: end for40: end procedure

34

Page 35: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

matrix involves the CPUs, the GPUs and DMA engine, and these components can work in

parallel.

There are two layers of possible pipelines. One is a pipeline within the subtrees, which

overlaps the factorization and the copyback of different levels of the subtree. The other is

a pipeline that stacks the factorization of the subtrees and the flushing of the pinned host

memory buffers.

In the subtree algorithm, after a subtree is copied from the main memory to the GPU

memory, the algorithm processes the subtree level by level. Each level must be factorized and

then copied back to the main memory. Since the factorization of a level is performed by the

GPU cores while the copyback is done by the DMA engine, it is possible to use a pipeline. We

implemented the pipeline to let the copyback of a level and the factorization of the next level

run simultaneously. It was done by using different CUDA streams for the factorization and the

copyback operations, and adding synchronization barriers to ensure data consistency.

Fig. 3-2 describes the performance gain by utilizing pipelines. The pipeline effectively have

the CUDA kernels and copyback run in parallel, thus hiding the device-to-host transfers behind

on-GPU operations, reducing total time consumption.

Figure 3-2. Using pipeline in the factorization of a subtree

A subtree copied back is not directly put into their destination memory region. Instead,

a piece of pinned host memory is used as a buffer. The pinned host memory is necessary for

asynchronous PCIe transfer, but the allocation of pinned host memory is very time consuming.

35

Page 36: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The buffer must be flushed before it is overwritten by the next subtree. Since the flushing

of the buffer is done by the CPU while the factorization and the copyback of the subtree are

done by the GPU and the DMA engine, we implemented another pipeline so that the buffer

flushing of a subtree overlaps with the factorization and the copyback (through PCIe) of the

next subtree.

The modified subtree algorithm with pipelining is described in Algorithm 4.

Algorithm 4 Pipelining in subtree algorithm

1: procedure factorize subtree pipelined2: for all supernodes in subtree do3: copy supernode data to GPU memory4: end for5: device synchronization6: for all levels in subtree do7: for all supernodes in level do8: contribution blocks (upper, batched SYRK)9: contribution blocks (lower, batched GEMM)10: end for11: event synchronization12: copy the previous level from buffer to destination13: device synchronization14: for all supernodes in level do15: assemble contribution blocks (batched)16: end for17: device synchronization18: for all supernodes in level do19: factorize supernode (batched POTRF)20: end for21: device synchronization22: for all supernodes in level do23: triangular solve supernode (batched TRSM)24: end for25: device synchronization26: for all supernodes in level do27: copy supernode data back to buffer28: end for29: record event30: end for31: copy the last level from buffer to destination32: end procedure

36

Page 37: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

3.10 The Batched Sparse Cholesky Factorization

A batched matrix factorization mechanism may help in improving the average efficiency of

the matrix factorization, by eliminating the need of repeated resource allocations / dealloca-

tions, and increasing the program’s concurrency.

In this section we will introduce two ways to implement the batched Cholesky factoriza-

tion.

3.10.1 The Merge-and-Factorize Approach

A trick can be implemented on top of the non-batched version of the sparse Cholesky

factorization algorithm to allow batched Cholesky factorization of multiple matrices. The

Cholesky factorization is the decomposition of a symmetric positive-definite matrix A in the the

product of a lower triangular matrix L and its transpose (A = LLT ). Given a list of symmetric

positive-definite matrices A1, A2, · · · , Ak, doing their Cholesky factorization is equivalent to

factorizing A =

A1

A2

. . .

Ak

.

A is also a symmetric positive-definite matrix. Let A = LLT (L is lower triangular), it is

easy to see that L =

L1

L2

. . .

Lk

where Ai = LiL

Ti for every i from 1 to k.

This implementation of the batched Cholesky factorization is very straitforward: we

arrange the matrices to be factorized diagonally to form a larger matrix, and factorize the new

matrix instead.

The elimination tree of A will be a forest composed of the elimination trees of A1, A2, · · · , An.

Since A1, A2, · · · , An have no dependency on each other, it is possible to factorize them in

parallel. More precisely, since the supernodes and subtrees from different matrices have no data

37

Page 38: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

dependency, if multiple GPUs (or multiple threads) are present, the batched factorization al-

lows greater flexibility in the scheduling of the workflow than serial non-batched factorizations.

The batched sparse Cholesky factorization also works well with the subtree algorithm.

The merging of two sparse matrices can result in subtrees containing supernodes from different

matrices, allowing larger subtrees to be constructed.

Our previous experiments has shown up to 140% improvement in performance with this

merge-and-factorize type of batched factorization scheme.

3.10.2 The Normal Approach

Unfortunately the previous trick works only for Cholesky and not for QR, and its memory

consumption is high. To expand the availability of batching to the sparse QR factorization,

and reduce the memory consumption, we need to reimplement the batched sparse matrix

factorization algorithm. This other implementation of the batched factorization runs the

symbolic analysis for each of the input matrices and construct their respective elimination

trees, instead of merging those matrices.

We implement this generalized batched factorization scheme by exploiting OpenMP. A

number (equals to the maximum number of sparse matrices being factorized) of threads are

created. Each thread will try to fetch the next available sparse matrix in the list, and factorize

it, until all sparse matrices in the list have been factorized.

To maximize the utilization of GPUs while avoiding conflicts, each thread will try to make

use of all available GPUs. Each thread will also have sub-threads to have the factorizations

of its supernodes run in parallel. After selecting an available supernode and before actually

factorizing it, the sub-thread will iterate through the list of GPUs and try to resere a GPU. It

all GPUs are busy, the sub-thread will wait until one is available.

Our experiments show a performance increase of up to 37.53% with the application of this

type of batched factorization.

38

Page 39: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

3.11 Experiment Results

The testcase matrices we used are from the SuiteSparse Matrix Collection [13], and are

listed in Table 3-1.

Table 3-1. Test matricesmatrix problem type dimension nonzerosEmilia 923 structural 923,136 40,373,538Fault 639 structural 638,802 27,245,944Flan 1565 structural 1,564,794 114,165,372Geo 1438 structural 1,437,960 60,236,322Hook 1498 structural 1,498,023 59,374,451Serena structural 1,391,349 64,131,971StocF-1465 computational 1,465,137 21,005,389

fluid dynamicsaudikw 1 structural 943,695 77,651,847bone010 model reduction 986,703 47,851,783nd24k 2D/3D problem 72,000 28,715,634

Due to availability of equipment, experiments were carried out on different platforms.

3.11.1 Multuthreading

The experiments for the multithreading optimization was performed on a platform with a

dual socket Intel(R) Xeon(R) CPU E5-2695 v2 at 2.4 GHz, eight NVIDIA Tesla K40m GPUs

each with 2880 CUDA cores, and 12 GB of physical memory.

The experiments were performed with different numbers of threads: 1, 2, 4, 8, 12 threads.

Each thread was assigned a CUDA stream, so the number of CUDA streams was equal to the

number of threads.

Fig. 3-3, 3-4, 3-5 show respectively the time, power, and energy cost with different

configurations of number of threads. Our experiments show an increase in performance with

multithreading until a threshold is reached, and beyond that the performance starts to decline

when more threads are used.

In our experiments, the optimal performance (minimal time or minimal energy) was

reached mostly when 4 or 8 threads were used. The factorization time was reduced by up to

73.06%, and the energy consumption was reduced by up to 73.49%. However, there was no

significant changes in the GPU’s power consumption.

39

Page 40: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

0

10

20

30

40

50

60

70

80

90

audikw_1 bone010 boneS10 Emilia_923 Fault_639 Flan_1565 Geo-1438 Hook_1498 StocF-1465

Fact

ori

zati

on

tim

e

Testing Matrices

CHOLMOD CPU

CHOLMOD GPU

GPU multiple streams - 2 streams

GPU multiple streams - 4 streams

GPU multiple streams - 8 streams

GPU multiple streams - 12 streams

Figure 3-3. Factorization time for sparse cholesky with / without multithreading

0

50

100

150

200

250

300

Av

erag

e P

ow

er C

on

sum

pti

on

CHOLMOD CPU

CHOLMOD GPU (1 stream)

GPU multiple streams - 2 streams

GPU multiple streams - 4 streams

GPU multiple streams - 8 streams

GPU multiple streams - 12 streams

Figure 3-4. Average power consumption for sparse cholesky with / without multithreading

40

Page 41: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

0

2000

4000

6000

8000

10000

12000

14000

16000

En

ergy

Co

nsu

mp

tion

Testing Matrices

CHOLMOD CPU

CHOLMOD GPU

GPU multiple streams - 2 streams

GPU multiple streams - 4 streams

GPU multiple streams - 8 streams

GPU multiple streams - 12 streams

Figure 3-5. Energy consumption for sparse cholesky with / without multithreading

3.11.2 Pipelining

We applied pipelines to Rennich’s subtree method, before the multilevel subtree was

implemented. Our experiments were performed on a platform with a dual socket Intel(R)

Xeon(R) CPU E5-2695 v2 at 2.4 GHz, four NVIDIA Tesla P100 GPUs each with 3840 CUDA

cores, and 16 GB of physical memory.

We compare the performance of GPU-accelerated supernodal sparse Cholesky algorithm

with different optimizations, including

• The baseline supernodal algorithm (SuiteSparse 4.5.3)

• Supernodal algorithm with multithreading

• The subtree method (SuiteSparse 4.6.0-beta)

• The subtree method with multithreading and pipelines

Fig. 3-6, 3-7, 3-8 show the performance comparison in terms of Gflops, power, and

energy.

We see that multithreading increases the efficiency of the algorithm by up to 2.5 times.

Though there are exceptions, the performance usually increases when more threads are used.

41

Page 42: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 3-6. Cholesky factorization performance for sparse matrices

Figure 3-7. Power Consumption of sparse Cholesky factorization

42

Page 43: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 3-8. Energy Consumption of sparse Cholesky factorization

The subtree method increases the efficiency of the algorithm by up to 3 times. Using pipelines

on top of the subtree method increase the performance by an additional 10% to 25%. We

expected to gain the best performance when all three techniques are combined, but in reality

multithreading is not very effective with the subtree technique present. The number of threads

quickly hits the threashold when the subtree algorithm is used. The experiments show that the

best performance is reached when we use the pipelined subtree algorithm, and the number of

GPU threads is 1 or 2.

The experiments show that the power increases as the performance of the algorithm im-

proves. It is expected, because higher performance usually indicates more intense computations

on GPU. Despite of higher power consumption, the total energy consumed actually decreases

when the performance is better. This is because the decrease in factorization time offsets the

increase in power. It is also observed that pipelining does not have significnat impact on the

total energy consumption.

3.11.3 Multilevel Subtree Method

The multilevel subtree method was implemented on top of the pipelined subtree method.

43

Page 44: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Our experiments for the multilevel subtree method was performed on a platform with a

dual socket AMD Opteron Processor 6168 CPU and two NVIDIA Tesla K20c GPUs.

Fig. 3-9 shows the performance when only one GPU is used. It can be seen that the

multilevel subtree algorithm achieves up to 2.43 times the performance of the baseline single-

level subtree algorithm, and up to 1.42 times the performance of the pipeline enhanced subtree

algorithm. For 9 of the 10 testcase matrices, using the multilevel subtree algorithm along with

pipeline led to enhancements in performance compared to the single-level subtree algorithm.

The average speedup is 1.59 times over the single-level subtree algorithm.

Figure 3-9. Performance comparison between single-level subtree algorithm and multilevelsubtree algorithm on single GPU

The multilevel subtree algorithm does not bring a performance increase for all testcase

matrices. Geo 1438 is an example in which the application of the multilevel subtree method

results in a performance loss. The subtree method is only effective when there are multiple

supernodes in the subtree, so if there is only one supernode in a subtree, there will be no

performance gain, and the extra overhead introduced in the subtree method may slow the

entire algorithm down.

Fig. 3-10 depicts the structure of the matrix Geo 1438, where each triangle stands for

either a subtree or the root tree. The number in each triangle shows the number of supernodes

in the subtree. Since the subtrees in the third and the fourth level each contains only one

supernode, they are not able to provide any performance boost. In this case the only possible

performance gain over the single level subtree algorithm lies in the second level. However the

experiment result shows that it is not enough to offset the loss in the third and the fourth

level.

44

Page 45: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 3-10. Structure of matrix Geo 1438

Fig. 3-11 adds the data of the performance when both GPUs are used. When using two

GPUs the performance of the multilevel subtree algorithm is worse than the single level subtree

algorithm. The cause of this is not yet confirmed.

Figure 3-11. Performance comparison between single-level subtree algorithm and multilevelsubtree algorithm on single GPU and two GPUs

45

Page 46: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

3.11.4 Batched Sparse Cholesky Factorization

We implemented batched sparse Cholesky factorization on top of the pipelined multilevel

subtree method.

The experiments were performed on a platform with a dual socket Intel(R) Xeon(R) CPU

E5-2695 v2 at 2.4 GHz and eight NVIDIA Tesla K40m GPUs. We compare the performance of

sequential factorization and the performance of batched factorization when factorizing multiple

matrices.

Fig. 3-12 and Fig. 3-13 show the experiments results for batched sparse Cholesky

factorization (data for batched factorization of 4x matrix Serena are missing). With 2 GPUs

and 2 matrices to factorize, the batched factorization is able to improve the performance by

up to 53.9%. With 4 GPUs and 4 matrices to factorize, the batched factorization is able to

improve the performance by up to 125.0%.

Figure 3-12. Batched Cholesky factorization performance versus sequential matricesfactorization on GPUs (2 GPUs)

3.12 Conclusions

In this chapter, we present several optimizations for CHOLMOD, a supernodal sparse

Cholesky factorization algorithm. Our optimizations include mulththreading, pipelining,

multilevel subtree method, and batched factorization. Each of these optimizations provide

46

Page 47: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 3-13. Batched Cholesky factorization performance versus sequential matricesfactorization on GPUs (4 GPUs)

performance increase, and when used in conjunction with the subtree method [48], they can

offer significant increase in the efficiency of the factorization.

47

Page 48: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

CHAPTER 4SPARSE QR FACTORIZATION

The QR factorization can be utilized to solve problems in scientific computing. It is the

method of choice for sparse least squares problems, underdetermined systems, and for solving

sparse linear systems when A is very ill-conditioned [15].

4.1 Introduction

The QR factorization is the decomposition of a matrix A into the product A = QR, where

Q is orthogonal and R is upper triangular.

An orthogonal matrix is a matrix whose rows (or columns) are on an orthonormal basis.

An orthogonal matrix Q has the property QQT = I, therefore Q−1 = QT . This property makes

the QR factorization useful for solving linear equation system Ax = b. If A = QR where Q

is orthogonal and R is upper triangular, then the equation system can be solved by computing

y = QT b and then solving Rx = y.

The QR factorization is often used to solve the linear least squares problem. Let A be a

given m× n matrix, and b a given vector, the linear squares problem is to determine a vector x

such that ∥b− Ax∥ has minimum value [25].

Suppose A is an m× n matrix with rank r, and A has QR factorization A = QR, then we

have rank(R) = r. Let vector c = QT b, then ∥b− Ax∥ = ∥Qc−QRx∥ = ∥c−Rx∥.

When m ≤ n, the minimum value of ∥c − Rx∥ is 0. The linear least squares problem has

unique solution if and only if r = n.

In the subsequent text we will assume that m ≥ n and r = n.

If m ≥ n and r = n, let c =

c1c2

where c1 is the first n entries of c, let R =

R1

O

where R1 is the first n rows of R and the entries of O are all zero. Then ∥c − Rx∥ =

c1 −R1X

c2

∥ , it has minimum value when x = R−11 c1.

When solving the linear least square problem, there is no need to store Q [10]. Since the

orthogonal Q represents matrix row operations, and the calculation of R is basically applying a

48

Page 49: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

series of matrix row operations on A, the same row operations can be applied to b during the

QR factorization, and Q can be discarded thereafter.

Angorithms for the QR factorization are typically based on Gram-Schmidt orthogonaliza-

tion [4] [60], Householder reflection [51], or Givens rotation [17] [21].

The Gram-Schmidt orthogonalization is a method for computing QR factorizations that

iteratively eliminates subdiagonal elements of A, one column at a time. While effective for

dense QR factorizations, the Gram-Schmidt orthogonalization method has the disadvantage

of generating numerous fill-in entries, making it undesirable for sparse problems. The Gram-

Schmidt orthogonalization method also requires storing the matrix Q explicitly.

QR factorization algorithm based on the Householder reflection also eliminates the

subdiagonal elements of A one column at a time. But the Householder reflcetion method is

able to represent Q in a much sparser form than the Gram-Schmidt method. And when using

Householder reflection based QR factorization to solve Ax = b, Q can be discarded to save

space by applying the transformations to b as Q is computed.

The Givens rotation approach is believed to outperform both the Gram-Schmidt orthog-

onalization and the Householder reflection when the matrix A is very sparse, due to its ability

to more effectively exploit the matrix’s sparsity [27]. However this advantage disappears when

the matrix is dense enough. For a full matrix, the Givens rotation based QR algorithm requires

50% more floating-point operations than its Householder reflection counterpart [9].

4.2 Gram-Schmidt Orthogonalization

The Gram-Schmidt orthogonalization, also named the Gram-Schmidt process, is a

procedure that takes a set of nonorthogonal linearly independent vectors and constructs an

orthogonal basis. If an orthonormal basis is constructed, then this process is called Gram-

Schmidt orthogonalization.

Let A be an n × n full rank matrix, A =

[a1 a2 · · · an

]where aj are n × 1 column

vectors.

u1 = a1, e1 =u1

∥u1∥

49

Page 50: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

u2 = a2 − (a2 · e1)e1, e2 = u2

∥u2∥

· · ·

uj = aj −∑j−1

k=1(aj · ek)ek, ej =uj

∥uj∥

The QR factorization of A can be derived from the Gram-Schmidt orthogonalization:

Q =

[e1 e2 · · · en

]

R =

a1 · e1 a2 · e1 · · · an · e1

a2 · e2 · · · an · e2. . .

...

an · en

Q is an orthonormal basis of A and R is upper triangular.

QR[1 : n][j] =

j∑k=1

(aj · ek)ek = aj

QR = A

Unlike the Givens rotation (Sect. 4.3) based or Householder reflection (Sect. 4.5) based

QR, the Gram-Schmidt orthogonalization based QR does require storing the orthogonal matrix

Q, because Q is explicitly constructed during the factorization.

One disadvantage of the Gram-Schmidt orthogonalization based QR is that it tends

to generate large amounts of fill-in entries, which makes it not very ideal for sparse QR

factorizations.

4.3 Givens Rotation

The Givens rotation is a 2 × 2 orthogonal matrix

c s

−s c

(c2 + s2 = 1) that can be

applied to a 2× n matrix to zero out the selected entry [9].

c s

−s c

0 · · · 0 ai,j · · ·

0 · · · 0 ai′,j · · ·

50

Page 51: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

=

0 · · · 0 a′i,j · · ·

0 · · · 0 a′i′,j · · ·

To perform the Givens rotation based QR factorization, we let a′i′,j = 0, then

−sai,j + cai′,j = 0

c =ai,j√

a2i,j + a2i′,j

s =ai′,j√

a2i,j + a2i′,j

The advantage of the Givens rotation based QR factorization (over Householder reflec-

tion) is that it can be implemented without square root operations. However, for a full matrix,

the Givens rotation based QR factorization requires 50% more floating-point operations than

its Householder reflection based counterpart [9].

It is believed that the Givens rotation based QR factorization provides better performance

than the Gram-Schmidt orthogonalization based QR and the Householder reflection based QR

when the matrix is very sparse. The Givens rotation based QR also usually generates fewer

intermediate fill-in entries than the latter two.

George et al. devised the first Givens rotation based QR factorization [15] [21]. Unlike

traditional QR factorizations based on Gram-Schmidt orthogonalization or Householder

reflection, which require access to the entire matrix during the factorization process, their

QR factorization algorithm can process the matrix row by row, which potentially reduces the

memory requirement. However, the method proposed requires more time to solve the linear

least square problem than the method based on Cholesky factorization.

Gentleman et al. proposed to implement the Givens rotation based matrix triangulation

(including orthogonal triangulation, i.e. QR factorzation) in the form of a triangular systolic

array [18]. A systolic array is a homogeneous network of tightly coupled data processing units.

To perform the QR factorization of an m × n matrix (m ≥ n), an n × n upper triangular

systolic array is needed. The triangular systolic array is comprised of boundary (diagonal) cells

51

Page 52: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

and internal (non-diagonal) cells. The internal cells will mainly perform multiplies and adds,

whereas boundary cells will mainly perform divisions and reciprocals.

McWhirter introduced an improved version [40] of Kung and Gentleman’s QR factoriza-

tion algorithm that computes the least square residual more simply and directly without having

to solve the corresponding triangular linear system. Their modified version is also more stable

and robust due to this property.

4.4 Blocked Givens Rotation

The blocked Givens rotation is an orthogonal matrix of the form G =

C1 S1

S2 C2

, C1 is

p× p and C2 is q × q.

Halleck proposed another way to represent G:

G =

A ABT

−CB C

A is p× p and C is q × q.

GGT =

A ABT

−CB C

AT −BTCT

BAT CT

=

AAT + ABTBAT 0

0 CCT + CBBTCT

G is orthogonal if and only if

AAT + ABTBAT = Ip

and

CCT + CBBTCT = Iq

.

Assume A and C have full rank, then

I +BTB = A−1(AT )−1 = A−1(A−1)T

52

Page 53: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

A−1 can be computed by solving the Cholesky factorization of I + BBT . Since A−1 is lower

triangular, A can be easily obtained by solving A−1A = Ip.

Similarly, C can be obtained by solving the Cholesky factorization of I + BTB (C−1

obtained) and then solving C−1C = Iq.

Let V be a block vector and V =

XY

where X has p rows and Y has q rows.

The blocked Givens rotation zeros out Y whenX ′

O

=

GV =

A ABT

−CB C

XY

=

AX + ABTY

−CBX + CY

C(Y −BX) = O

A solution to this equation may be obtained if we let BX = Y .

4.5 Householder Reflection

A Householder reflection is an orthogonal matrix of the form H = I − βvvT where β is a

scalar and v is a column vector [9].

We assume that H is m×m and v is m× 1.

HHT = I − 2βvvT + β2∥v∥2vvT = I + β(β∥v∥2 − 2)vvT

H is orthogonal if and only if HHT = I, i.e. v = 0 or β = 0 or β∥v∥2 = 2.

For an m× 1 column vector x, it is possible to find an m×m Householder reflection such

that Hx =

∥x∥O

.

Proof:

53

Page 54: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

When ∥x∥ = 0, the solution is trivil.

When ∥x∥ > 0,

Hx = x− β(m∑i=1

vixi)v

β(m∑i=1

vixi)v = x−

∥x∥O

Let β∥v∥2 = 2, then

β|(m∑i=1

vixi)|√

2/β = ∥x−

∥x∥O

β can be solved from this equation, and v can be consequently solved. H is determined from β

and v.

The QR factorization may be accomplished by applying a series of Householder transfor-

mations.

Let A1 = A =

[a1 · · ·

]where a1 is the first column of A1. Find a Householder reflection

H1 such that H1a1 =

∥a1∥O

. Then H1A1 =

r11 R12

O A2

where r11 is a scalar, R12 is

1× (n− 1), A2 is (m− 1)× (n− 1).

The similar transformation can be applied to A2, and iteratively to all Aj where j < n.

The Householder reflections Hj will take the form

Hj = Im+1−j − βjvjvTj

where Im+1−j is the (m + 1 − j) × (m + 1 − j) identity matrix, βj is a scalar, and vj is a

(m+ 1− j)× 1 column vector.

It is easy to see that the resulting matrix after the (n − 1)th iteration is an upper

triangular matrix. Denote it by R.

Let Uj =

Ij−1 O

O Hj

where Ij is the j × j identity matrix. Let U =∏m

j=m+1−n Um+1−j,

then R = UA.

54

Page 55: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Let Q = UT =∏n

j=1 Uj , then Q is an orthogonal matrix and A = QR.

One advantage of the Householder reflection (over the Givens rotation) is that it is easier

to store the Householder reflection matrix [15], because the Householder reflection is generated

from a scalar β and a column vector v, so it is not necessary to store the entire Householder

reflection matrix.

Based on the Householder reflection are the left-looking QR and the right-looking QR.

The left-looking QR and the right-looking QR differ in the ordering of their workflow.

The right-looking QR applies each Householder reflection to the entire matrix. In a

right-looking QR algorithm, the matrix A will go through the following transformations:

A → U1A → U2U1A → · · · →m∏

j=m+1−n

Um+1−jA = R

The right-looking QR is the foundation of the multifrontal method for QR factorization

[9].

The left-looking QR however only updates one column of the matrix during each iteration.

It is clear that R[1 : m][j] = UA[1 : m][j].

R[1 : m][j] =

R[1 : j][j]

O

=

U [1 : j][1 : m]A[1 : m][j]

O

U [1 : j][1 : m]

= ((

m−j∏k=m+1−n

Um+1−k)(m∏

k=m+1−j

Um+1−k))[1 : j][1 : m]

= ((

m−j∏k=m+1−n

Im−k O

O Hm+1−k

)( m∏k=m+1−j

Um+1−k))

[1 : j][1 : m]

55

Page 56: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

= (m∏

k=m+1−j

Um+1−k)[1 : j][1 : m]

The left-looking QR algorithm uses a lazy updating scheme and delays the application of

the 1 to (j − 1)th Householder transformations on the jth column until the jth iteration, just

like the left-looking Cholesky.

Blocked forms of the Householder reflection based QR factorization also exist. Like the

blocked Cholesky, the blocked QR decomposition factorizes the matrix in terms of column

panels. The Householder reflections for blocked QR factorization takes the form

H = I − V BV T

where I is the n× n identity matrix, V is an n× k matrix (k is the width of the column panel),

and B is a k × k square upper triangular matrix.

Though it is generally believed that the Givens rotation method outperforms the House-

holder reflection method for QR factorizations of very sparse matrices, George et al. developed

a way of applying Householder reflections [22] which is competitive or superior to Givens rota-

tions even for sparse matrices. This algorithm is an extension of Liu’s row merging scheme for

sparse Givens transformations [35].

Davis developed a multithreaded sparse multifrontal QR factorization algorithm [10] that

is based on the Householder reflection.

Yeralan et al. extended the work of Davis and presented an enhanced version [61] of the

above algorithm that achieves higher performance with GPUs enabled. They also introduced

the bucket scheduler algorithm, which exploits parallelism between different rows during dense

QR factorizations.

4.6 The Multifrontal Sparse QR Factorization

This sub-chapter introduces SuiteSparse’s SPQR module [10] [61] from SuiteSparse [8],

which our work in this chapter is based on.

56

Page 57: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Sparse matrix factorizations differ from their dense counterparts in that the sparse

factorization algorithm may exploit the zeros in the matrix to reduce the total number of

floating-point operations. The factorization of a sparse matrix consists of the (symbolic)

analysis phase and the (numerical) factorization phase. During the analysis phase, the nonzero

pattern of the matrix is explored, and the symbolic structure of the matrix is represented by

the elimination tree [36]. The elimination tree is a key data structure of the sparse matrix

factorization. It not only provides the structural information of the matrix but also directs the

workflow of the factorization.

Figure 4-1. The elimination tree of a sparse matrix

Figure 4-2. A possible scheduling of fronts

57

Page 58: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 4-3. Stages in the workflow

Figure 4-4. Stages in the elimination tree

Fig. 4-1 depicts a matrix’s symbolic pattern (x stands for nonzeros and fill-in) and its

elimination tree. In the multifrontal sparse QR factorization, the sparse matrix is divided into

multiple fronts. Each front is a dense matrix that will be factorized by dense QR factorization

algorithm. The nodes of the elimination tree each represent a front, and the data dependency

between the fronts is represented by the edges.

SPQR schedules the factorizations of the fronts according to the elimination tree so that

a front is not factorized until all of its children have been factorized. The factorization of each

front generates a contribution block, and prior to a front’s factorization, all contribution blocks

from its children must be assembled into the front.

Figure 4-5. Reducing PCIe communications with stages

58

Page 59: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Fig. 4-2 shows a possible scheduling of the fronts for the matrix in Fig. 4-1. Each front

must go through the assembly and the factorization. Fig. 4-6 shows the child fronts after their

factorizations and the assembly of the children’s contribution block into the parent.

Figure 4-6. The factorization and the assembly operations

In order to reduce the cost of PCIe communications, SPQR groups the fronts into stages.

A stage is a set of adjacent fronts in the scheduled workflow, as shown in Fig. 4-3. PCIe

transfer cost can be reduced when a stage contains fronts with data dependency (Fig. 4-4)

because, in this case, there will be no need to use the main memory as a temporary buffer to

hold the contribution blocks (Fig. 4-5).

4.7 The Arithmetic CUDA Kernels

SPQR essentially divides the sparse QR factorization into a number of assembly oper-

ations and dense QR factorizations. The dense QR factorizations are further reduced into

numerous even smaller dense QR factorizations and apply operations and are performed by

custom CUDA kernels.

59

Page 60: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 4-7. Factorization of a front

Fig. 4-7 shows the QR factorization of a front, step by step. The tiles are of size 32× 32,

and due to the limit of the GPU’s shared memory, up to 96 rows are processed in each

factorization or application.

Consider a factorization and an application task that process a m × n submatrix of the

front. For simplicity, we assume m to be 96, 64, or 32 (edge cases are handled inside the

CUDA kernels), and n > 32 (otherwise, there is no application).

Figure 4-8. Factorization of a front

Let A be the leftmost 96 × 32 submatrix and B be the rest. The factorization and the

application finds the orthogonal matrix Q, the upper triangular matrix R, and S such that[A B

]= Q

[R S

](Fig. 4-8) .

Using blocked Householder reflection, the factorize CUDA kernel finds the m × 32 lower

trapezoidal matrix V containing the Householder vectors and the 32 × 32 lower triangular

matrix T such that Q = I − V TV T . Then, R = QTA = A− V T TV TA.

The apply operation is where the most floating-point operations happen, and it can

take around 80% of the total running time. The apply operation computes S = QTB =

B − V T TV TB.

60

Page 61: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The apply CUDA kernel utilizes the shared memory of the GPU to speed up the com-

putation. The shared memory is a region of on-chip memory that is shared among all CUDA

threads of the same block. Access to the shared memory is much faster than global memory

access (shared memory latency is roughly 100 times lower than uncached global memory);

therefore, the shared memory is suitable for data that will be frequently reused.

On NVIDIA K40m GPUs, each block has 48KB shared memory. We consider a 32×32 tile

of matrix entries, with each entry stored in type double (8 bytes). Because (48 × 1024)/(8 ×

(32 × 32)) = 6 and considering that we need some extra space for padding (to prevent shared

memory bank conflicts), we can safely assume that only 5 tiles of data can be stored in the

shared memory at any time. SPQR allocates space in the shared memory for two matrices of

type double: the 97× 32 ”VT tile” and the 32× 64 ”C”.

Figure 4-9. The VT tile

Prior to the actual matrix multiplication, the matrices V and T are copied from the

GPU’s global memory to the VT tile. Because V is lower trapezoidal and T is lower triangular,

they can be stored in the form of V and T T as shown in Fig. 4-9.

The apply kernel divides the matrix B into multiple submatrices with at most 64 columns.

We denote them by B1, B2, · · · , Bk. Then, for each i from 1 to k, the apply kernel does the

following:

• Compute Ci = V TBi, and write Ci to C

• Compute Zi = T TCi, and write Zi to C

• Compute Si = Bi − V Zi, and write Si to the global memory

It is easy to see that the above is not the optimal method for multiplying the matrices.

We assume that a total of time t is required to compute the product of two 32 × 32 matrices.

61

Page 62: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

When V is (p × 32)-by-32 abd B is (p × 32)-by-(q × 32), the total running time of the above

method is (pq + q + pq)t = (2pq + q)t.

When p ≤ 2, we optimize the multiplications so that they are performed in the following

way:

• Compute U = V T , and write U to the VT tile

• for each i from 1 to k:

– Compute Zi = UTBi, and write Zi to C

– Compute Si = Bi − V Zi, and write Si to the global memory

Figure 4-10. U and V in the VT tile

Note that U is also (p × 32)-by-32 lower trapezoidal; therefore, it can coexist with V in

the VT tile only if p ≤ 2. Fig. 4-10 shows the layout of U and V in the VT tile.

This method runs in pt + (pq + pq)t = (2pq + p)t time. When q is large enough, it

is faster than the original implementation. When p = 2 and q is very large, this method is

approximately 25% faster than the original implementation.

When p = 1, the multiplications can be further simplified into the following:

• Compute U = V T , and write U to the VT tile

• Compute Q = I − UV T , and write Q to the VT tile

• for each i from 1 to k:

– Compute Si = QTBi, and write Si to the global memory

Figure 4-11. Q, U and V in the VT tile

When p = 1, the VT tile is spacious enough to hold Q, U , and V at the same time (Fig.

4-11).

62

Page 63: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

This method runs in (q + 2)t time. When q is very large, it is approximately 200% faster

than the original implementation.

Despite the theoretical improvements mentioned above, this optimization only works

when the apply task has no more than 64 rows. Because obviously the algorithm should try

to process as many rows as possible in each task to maximize concurrency, the improvement

is only possible in edge cases while the majority of apply tasks does not benefit from this

optimization.

4.8 Pipelining CUDA Kernels and Device-to-Host Transfers

Figure 4-12. Pipelining the CUDA kernel runs and device-to-host transfers

Figure 4-13. Pinned host memory used as a buffer

Figure 4-14. Pipelining the factorization of stages and the buffer flushing

SPQR groups fronts in stages in order to reduce PCIe communications. Each stage

consists of multiple fronts that are present in the GPU memory at the same time. Because we

expect data dependency between fronts in the same stage, the factorizations of the fronts in a

stage are usually not entirely parallel.

63

Page 64: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The algorithm runs a scheduler that loops until the entire stage has been factorized. In

each loop, the scheduler picks fronts that are ready for factorization, runs the appropriate

CUDA kernels (in batch), and then updates the fronts’ statuses.

In the original implementation of SPQR, the factorization of a stage can be summarized

as three phases:

• Copy the fronts’ data from the main memory to the GPU memory

• Run the CUDA kernels

• Copy the factorized fronts’ data and the contribution blocks from the GPU memory to

the main memory

The CUDA kernel execution is the most time-consuming of all three. Because the

contribution blocks tend to be much larger than the fronts themselves, the copyback also

takes a considerable amount of time. The first phase, host-to-device copy, is the least time-

consuming of all three.

We reduce the total factorization time of the stage by overlapping the CUDA kernel

executions and the copyback of data. The original implementation of SPQR defers the

copyback until the end of the last CUDA kernel. However, because the GPU and the PCIe bus

can work at the same time, the CUDA kernel executions and the copyback can run in parallel.

The pipeline was implemented by using an extra CUDA stream to allow asynchronous

execution of the copyback. At the end of the scheduler loop, the status of each active (being

factorized) front is queried. If the front’s factorization is complete, an asynchronous device-to-

host transfer is initiated on the designated CUDA stream. The transfer will run independently

of future scheduling and CUDA kernel calls. After all fronts in the stage have been processed,

the function cudaStreamSynchronize is called and blocks until all copybacks are finished.

Fig. 4-12 compares the processing of a stage with and without the pipeline. The pipeline

partially hides the PCIe transfers behind the on-GPU computations and saves total factoriza-

tion time.

64

Page 65: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

4.9 Pipelining GPU Workload and CPU Workload

Another layer pipeline is possible, not within stages, but on the same ”level” with them.

The copyback of factorized front data does not put the data in their final destination. For

the efficiency of the transfer, the data are first put in a pinned host memory buffer and must

be copied to their designated main memory region (pageable memory) later (Fig. 4-13).

Since the processing of the stages is mostly done by the GPU (CUDA kernel executions)

and the DMA engine (device-to-host transfers), it can be run in parallel with the flushing of

the pinned host memory buffer.

The pipeline was implemented using OpenMP. For a sparse QR factorization with k

stages, the algorithm creates a for loop ranging from 0 to k. Within each loop iteration, two

threads work in parallel. One thread factorizes a stage and copies the results back to the

pinned host memory buffer. And at the same time, the other thread flushes the buffer that

contains results from the previous factorized stage.

Figure 4-15. A secondary pinned host memory buffer to avoid data conflict

Fig. 4-14 and Fig. 4-15 show the mechanism of the pipeline. To avoid conflicts, two

pinned host memory buffers are allocated, and the DMA engine and the CPU will switch

between them.

Fig. 4-16 shows the comparison between algorithms with and without the stage-level

pipeline. The sparse matrix used in this case is Flan 1565 from the SuiteSparse Matrix

Collection [13]. The pipeline has effectively hidden the buffer flushing behind the processing of

stages, eliminating most of the GPU’s idle time between stages.

65

Page 66: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 4-16. Comparison between sparse QR algorithm with / without stage-level pipeline

4.10 Experiment Results

The experiments were carried out on a platform with a Intel Xeon E5-2667v4 CPU and an

NVIDIA Titan V GPU. SPQR has no support for multiple GPUs, therefore only one GPU was

used. The test-case matrices (Table 4-1) were from the SuiteSparse Matrix Collection [13].

Table 4-1. Test matrices used for QRmatrix problem type dimension nonzerosFlan 1565 structural 1,564,794 114,165,372Freescale1 Circuit Simulation 3,428,755 17,052,626H2O Theoretical/Quantum 67,024 2,216,736

Chemistry Problembundle adj Computer Vision 513,351 20,207,907circuit5M dc Circuit Simulation 3,523,317 14,865,409hood structural 220,542 9,895,422nd24k 2D/3D problem 72,000 28,715,634

Table 4-2 lists the performance of SPQR on each of the matrices, before and after the

optimization. The table also lists the total number of floating point operations involved in the

factorization.

It can be seen from Fig. 4-17 that the combination of two layers of pipelines and the

reimplemented apply kernels achieves up to 43.92% improvement in performance vs. the

original SPQR.

We also conducted experiments to measure the energy and power cost. From Fig. 4-19

and Fig. 4-20 we see that after our optimizations, the total energy cost of the QR factorization

is reduced by up to 35.95%.

66

Page 67: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Table 4-2. QR experiment resultsmatrix flop count Gflops before Gflops after

optimization optimizationFlan 1565 3.28E+14 920.94 1283.26Freescale1 2.11E+11 26.23 36.90H2O 4.22E+13 982.50 1345.55bundle adj 3.12E+13 524.96 622.73circuit5M dc 2.21E+11 27.90 40.15hood 6.43E+11 133.43 114.75nd24k 3.30E+13 1014.49 1252.21

Fig. 4-21 and Fig. 4-22 show a comparison in average power before and after the

optimization. The power is reduced by up to 12.45%, but there is not always a notable

reduction in average power. In most of our experiments, the change in average power was less

than 4%. The total energy cost was reduced mainly due to the reduced factorization time.

Figure 4-17. Performance comparison between algorithm before and after optimizations

4.11 Conclusions

In this chapter, we present several optimizations for SPQR, a multifrontal sparse QR fac-

torization algorithm. We first optimize the edge cases of the apply CUDA kernels, reducing the

67

Page 68: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 4-18. Relationship between flop count and improvement in performance

Figure 4-19. Energy consumed by the GPU in factorization (large matrices)

68

Page 69: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 4-20. Reduction in energy consumption after the optimization

Figure 4-21. Average power of the GPU in factorization

69

Page 70: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 4-22. Reduction in average power after the optimization

total amount of floating-point operations when the apply task is small. Then, we implement

a pipeline that overlaps the CUDA kernel executions and device-to-host transfers, hiding PCIe

transfers behind on-GPU computations and reducing the time for factorizing a stage. We also

implement another pipeline that parallelizes the factorization of stages and the flushing of the

pinned host memory buffers.

The experimental results showed good improvement in performance when the sparse

matrix is large and the total flop count is large enough, achieving up to a 23% increase in

performance.

Energy cost is also reduced for large matrices, up to 14.58%.

However, due to extra cost introduced, such as the allocation of additional pinned host

memory, our optimizations do not work well if the matrix is too small. In our experiments,

the performance after optimization may decrease by up to 7.26% for small matrices, and the

energy consumption may increase by 16.82% for small matrices.

70

Page 71: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

We may combine the advantages of our algorithm and the original SPQR by adding a

switch in our code and selecting whether to apply our changes, according to the size of the

matrix.

71

Page 72: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

CHAPTER 5SPARSE LU FACTORIZATION

The LU factorization is a decomposition of a square matrix into the product of a lower tri-

angular matrix and an upper triangular matrix. Given a square matrix A, the LU factorization

is to find a lower triangular matrix L and an upper triangular matrix U such that A = LU .

The LU factorization is usually used for square unsymmetric matrices.

Like the Cholesky factorization, the LU factorization may also be applied in solving

systems of linear equations. Let A be a square matrix, the equation Ax = b can be solved

with the help of the LU factorization. If A = LU where L is a lower triangular matrix and U

is an upper triangular matrix, then we have LUx = b. This equation can be solved by solving

Ly = b and Ux = y.

For the LU factorization, there exist the left-looking LU and the right-looking LU [9]. The

supernodal method and the multifrontal method are also available to the LU factorization.

5.1 Left-looking LU

The left-looking LU is an iterative method which solves L and U one column at a time.

Let

L =

L11

l21 l22

L31 l32 L33

and

U =

U11 u12 U13

u22 u23

U33

where

L11

l21

L31

and

[U11

]represent the columns already computed.

72

Page 73: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Since A11 a12 A13

a21 a22 a23

A31 a32 A33

=

L11

l21 l22

L31 l32 L33

U11 u12 U13

u22 u23

U33

we have:

u12 = L−111 a12

u22 = (a22 − l21u12)/l22

l32 = (a32 − L31u12)/u22

If we let l22 = 1 (assme that L has unit diagonal), then:

u12 = L−111 a12

u22 = a22 − l21u12

l32 = (a32 − L31u12)/u22

5.2 Right-looking LU

The right-looking LU, like the right-looking Cholesky, is very similar to the left-looking LU,

except the ordering of their workflow.

Let

L =

l11l21 L22

and

U =

u11 u12

U22

Since a11 a12

a21 A22

=

l11l21 L22

u11 u12

U22

we have:

l11u11 = a11

73

Page 74: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

l21u11 = a21

l11u12 = a12

l21u12 + L22U22 = A22

Note that a11, l11, u11 are scalars, therefore

L22U22 = A22 − l21u12

= A22 −a21a12a11

L22 and U22 can be solved recursively with the (right-looking) LU factorization of a21a12a11

.

If we let l11 = 1 (assume that L has unit diagonal), then

u11 = a11

l21 =a21u11

u12 = a12

The right-looking LU factorization is essentially the Gaussian Elimination of the matrix A.

The transformation from A to U consists of a series of row transformations. Denote those row

transformations by M1, M2, · · · , Mn

Mj =

Ij−1 O

O Nj

where Ij−1 is the (j− 1)× (j− 1) identity matrix, and Nj is an (n+1− j)× (n+1− j) lower

triangular matrix.

For every j, Mj is an n× n lower triangualr matrix. Let

M =n∏

j=1

Mn+1−j

then M is an n× n lower triangualr matrix, and MA = U .

Let L = M−1, then L is lower triangular and A = LU .

74

Page 75: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The right-looking LU factorization also shares some similarities with the Householder

reflection based QR. The difference is that the matrices corresponding to the row transforma-

tions in the LU factorization are lower triangular matrices, while in the QR factorization they

are orthogonal matrices.

The right-looking LU forms the basis of the multifrontal LU algorithms.

5.3 The Supernodal Method

Demmel et al. implemented SuperLU [16], a supernodal LU facctorization library.

SuperLU is based on sparse Gaussian elimination, and consists of three modules: SuperLU (the

sequential supernodal LU library, left-looking), SuperLU MT (multithreaded supernodal LU on

shared memory parallel machines, left-looking), and SuperLU DIST (distributed supernodal LU,

right-looking). SuperLU MT is very similar to SuperLU from the users’ and the algorithm’s

point of view [32].

Unlike the Cholesky factorization, the LU factorization does not guarantee any relation,

whether numerical or topological (in terms of nonzero entries), between L and U . Therefore

SuperLU utilizes the idea of unsymmetric supernodes [15]. The columns in L can be grouped,

like in the supernodal Cholesky factorization, according to L’s columns’ nonzero patterns, but

SuperLU only stores L in the supernodal format, while U is stored in the compressed column

form.

Schenk et al. implemented a supernodal LU algorithm that combines the workflow

ordering of both left-looking LU and right-looking LU (PARDISO[52]). PARDISO can be

viewed as numerically left-looking but symbolically right-looking: descendants of a supernode

are not assembled until right before the factorization of the supernode (numerically left-

looking), but as soon as a supernode is factorized, all of its ancestors are notified of its

completion (symbolically right-looking).

5.4 The Multifrontal Method

The multifrontal LU is based on the right-looking LU, and works well on distributed

memory platforms.

75

Page 76: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

The workflow of the multifrontal LU depends on the assembly tree. The assembly tree

is essentially an elimination tree for multifrontal methods, and is the output of the symbolic

analysis phase.

Each node in the assembly tree corresponds to a stage and a frontal matrix, and a frontal

matrix can not be factorized until all its descendants are factorized and their contribution

blocks assembled.

In this section we will first look at the first stage of the multifrontal LU:

Let

L =

L11

L21 L22

and

U =

U11 U12

U22

The dimensions of L11, L21, U11, U12 are determined through amalgamation [11].

Since A11 A12

A21 A22

=

L11

L21 L22

U11 U12

U22

we have:

L11U11 = A11

L21U11 = A21

L11U12 = A12

L21U12 + L22U22 = A22

First we solve L11 and U11 with a dense LU factorization, then L21 and U12 can be

obtained by solving L21 = A21U−111 and U12 = L−1

11 A12

C = −L21U12

76

Page 77: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

C is the contribution block generated by the stage. If the factorization is done on a distributed

memory platform, It is very likely that A22 is stored separately. C needs to be transferred to

ancestor nodes for the assembly operations in other stages.

The LU factorization may use pivoting [47]. Pivoting is not mandatory, but it can increase

the numerical stability of the factorization [57]. Numerical instablity can be caused by division

by zero or small matrix entries. Reid gave examples showing how pivoting can solve these

problems.

For zero pivots:

Let A =

0 1

1 1

=

1

l21 1

u11 u12

u22

, then we have u11 = 0 and l21u11 = 1. This

results in a divide-by-zero error.

If we let P =

0 1

1 0

and instead solve PA = LU , then we have L =

1 0

0 1

and

U =

1 1

0 1

.For small pivots:

Let A =

10−20 1

1 1

, then L =

1 0

1020 1

and U =

10−20 1

0 1− 1020

. This time we

don’t have a divide-by-zero error, but the floating-point number (1− 1020) may be represented

inaccurately and rounded to 1020, therefore what we actually get is U =

10−20 1

0 −1020

, andLU =

10−20 1

1 0

= A.

If we let P =

0 1

1 0

and solve PA = LU , then L =

1 0

10−20 1

and U =

1 1

0 1− 10−20

. The equation is satisfied even if the numbers are rounded.

There exist partial pivoting, complete pivoting, and rook pivoting.

77

Page 78: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

A partial pivoting is performed by doing a row exchange in each elimination so that the

first entry in the current column is exchanged with the the largest entry in the column. Each

exchange corresponds to a permutation matrix Pi, and the combined permutation matrix

is P = Pn−1Pn−2 . . . P2P1. In an LU factorization with partial pivoting, instead of solving

A = LU , we solve PA = LU .

The complete pivoting is performed by doing both a row exchange and a column exchange

in each elimination so that the leading entry of the remaining matrix is exchanged with

the largest entry in the remaining matrix. The complete pivoting can be represented by

PAQ = LU , where P = Pn−1Pn−2 . . . P2P1 and Q = Q1Q2 . . . Qn−1Qn−1.

The rook pivoting is similar to the complete pivoting except that instead of choosing the

largest entry, it chooses an entry that is the largest in its own row and column.

Whether or not to use pivoting will affect the behavior of the LU algorithm, both in the

symbolic analysis and in the numerical factorization.

When partial pivoting is used, the LU algorithm allows arbitrary row changes during the

factorization. In this case, the LU factorization is more closely related to the QR factorization

[15] as both of them eliminates entries below the diagonal. If the LU algorithm allows arbitrary

row and column changes during the factorization, then a prior symbolic analysis is not possible

[15].

In cases where pivoting is not required, the nonzero pattern of L and U can be computed

as follows:

• If the matrix A has a symmetric nonzero pattern, then the nonzero pattern of L is the

symbolic Cholesky factorization of A, and the nonzero pattern of U is identical to LT .

• When the nonzero pattern of A is unsymmetric, we may let L have nonzero pattern

same as the symbolic Cholesky factorization of A + AT , and the nonzero pattern U is

identical LT . In this case some zero entries in A are considered logically nonzero.

78

Page 79: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Amestoy et al. implemented the parallel multifrontal solver MUMPS [3], which was part

of the project PARASOL [1]. MUMPS is a fully asynchronous algorithm with dynamic data

structures and distributed dynamic scheduling of tasks.

Davis et al. implemented the multifrontal sparse LU factorizer UMFPACK [12]. with

the goal of achieving high performance by using the level 3 BLAS. Instead of an assembly

tree, UMFPACK guides the workflow with an assembly DAG [26]. UMFPACK features a

dynamic analyze-factorize phase. Since the structure of the assembly DAG is not knows prior

to factorization, the assembly DAG will be constructed during the analyze-factorize phase

dynamically. The assembly DAG is then used in a factorize-only phase.

5.5 Implementation of a Supernodal Sparse LU Algorithm

We implemented a highly efficient supernodal sparse LU algorithm that has the following

features:

1. The algorithm uses blocking to compute multiple columns in each single iteration. Thisapproach is more amenable to GPUs as compared non-blocked factorization algorithms.When compared to multifrontal algorithms, a supernodal algorithm uses less memory, dueto smaller sizes of contribution blocks. Our implementation uses highly tuneed CUDAlibraries such as cuBLAS and cuSOLVER as building blocks.

2. The algorithm can use multiple GPUs and multiple CPU cores when available. If only oneGPU is available, the algorithm tries to divide the GPU memory into multiple regions,and utilize CUDA streams to function as if there were multiple GPUs, so that parallelismis still exploited.

3. The algorithm uses pipelining to overlap the PCIe communication and the on-GPUcomputation, reducing overall factorization time.

4. The algorithm supports batched factorization, where multiple sparse matrices can befactorized simultaneously, and the factorizations are interleaved.

We refer to this algorithm as ”NLU” where ”N” stands for ”node”, indicating supernodes.

NLU outperforms GLU [45], UMFPACK, KLU [14] [41], and SuperLU by several orders of

magnitude when the sparse matrix is large enough, and can handle matrices whose size is

beyond the capability of the above sparse LU solvers. The current implementation of NLU does

not support pivoting, but will be updated to include it in future versions.

79

Page 80: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

5.5.1 Data Representation

The algorithm requires the storage of key information, including platform-specific informa-

tion, GPU-specific information, and matrix-specific information.

The platform-specific information is stored globally in the data structure of type struct

common info struct. It includes the number of GPUs, the size of available GPU memory, the

size of required pinned host memory (which is proportional to the size of the GPU memory),

the number of threads for batched factorization, etc. This data structure is initialized at the

beginning of the program, together with the allocation of the GPU memory and the pinned

host memory, and persists through all the factorizations, until it is freed at the end of the

program. The platform-specific information does not change after initialization.

struct common info struct{

int numGPU; // number of GPUssize t minDevMemSize; // GPU memory sizesize t minHostMemSize; // pinned host memory sizeint matrixThreadNum; // factorization batch sizeint numSparseMatrix; // number of sparse matrices...

};

5.5.1.1 GPU information

The information of each GPU, mainly the CUDA streams, the cuBLAS handles, and the

pointers to the allocated GPU memory and the host memory, are stored in a data structure of

type struct gpu info struct. The data structure contains a pointer to the allocated GPU

memory, whose size is slighly smaller than the maximum available memory size of the GPU,

and a pointer to a piece of pinned host memory ( it is part of the main memory ). The pinned

host memory is mandatory for asynchronous data transfer between the GPU memory and the

main memory, because the GPU is not able to access regular ( aka. pageable ) main memory.

The allocation of pinned host memory is very time-consuming, therefore we set its size to

roughly the same as the allocated GPU memory, and use it as a buffer.

80

Page 81: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

struct gpu info struct also contains an OpenMP lock for multithread support. A

thread must reserve a GPU through the OpenMP lock before it actually starts working on a

supernode.

struct gpu info struct{

omp lock t gpuLock;void *devMem; // GPU memory ( device memory )void *hostMem; // pinned host memory buffer...

};

In a system with K GPUs, an array of K struct gpu info struct objects is used

to store the information of all the GPUs. The array is initialized at the beginning of the

algorithm, and is freed after all factorizations are completed.

5.5.1.2 Matrix information

The data of the sparse matrices, including the entry values, the symbolic structure, the

factorization result, and the runtime status, are stored in data structures of type struct

matrix info struct. An array of T struct matrix info struct objects is used in our

algorithm, and each object corresponds to a top-level thread that we call ”matrix thread”. The

matrix threads continually read a sparse matrix’s file path from the command line input, read

the matrix from the file, perform the factorization and the result validation, until the input is

exausted.

NLU reads sparse matrices in triplet form. The triplet form stores the sparse matrix with a

list of triplets in the form of {i, j, x}, where i is the row index, j is the column index, and x is

the entry value.

The sparse matrix is then transformed into the compressed column form. Let n be

the number of columns of the matrix, and nnz be the number of nonzero entries, then the

compressed column form stores the matrix with three arrays: Ap, Ai, and Ax, where Ap is of

size (n + 1), and Ai and Ax are both of size nnz. An entry {i, j, x} exists in the matrix if and

only if there is an integer p such that Ap[j] ≤ p < Ap[j + 1], Ai[p] = i, and Ax[p] = x. The

81

Page 82: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

compressed column form is required for fill-reducing permutation algorithm such as AMD [2]

and METIS [29].

The symbolic analysis creates the elimination tree. The elimination tree is stored with

the left-child-right-sibling representation. The symbolic patterns, including dimensions, and

column and row indices, of the supernodes, are also computed along with the elimination tree.

The mapping between the sparse matrix and the supernodes is also computed, ad stored in the

struct matrix info struct object.

During the numeric factorization, we need to perform dense matrix arithmetics on the

supernodes. We store the supernodes in the column major form so that it can fit in dense

linear algebra routines provided by BLAS and cuBLAS. An (m × n) dense matrix A can be

stored in an array of size (d× n), where Ai,j is stored at the index of (j × d+ i).

Figure 5-1. Supernode

In NLU, a supernode of dimension (m × n) is composed of an L part of dimension

(m×n) and a U part of dimension ((m−n)×n). For a supernode shown in Fig. 5-1, we have

n = j1 − j0 and m = i1 − i0. We store the supernode in the column major form in an array of

82

Page 83: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

((2m − n) × n) (Fig. 5-2). Ai0:j1−1,j1:i1−1 is transposed and attached to the remaining of the

supernode, turning the L-shape into a rectangular matrix that is easier to store and handle.

Figure 5-2. Supernode stored in column major form

5.5.2 The Supernodal Algorithm

NLU is implemented with the left-looking LU. It’s different from the blocked dense left-

looking LU in that the order of the factorization is not simply from left to right, but is guided

by the elimination tree. In the left-looking LU for a sparse matrix, Li2 and U2j correspond

to a supernode in the elimination tree. Without multithreading, a supernodal sparse LU

factorization algorithm repeatedly picks a leaf supernode from the elimination tree, factorizes

it, and removes the supernode from the elimination tree, until the elimination tree becomes

empty. A supernode is ready for factorization only when either it is initially a leaf, or all of its

descendants have been factorized.

For a sparse matrix with elimination tree like Fig. 5-3, one possible ordering of the

supernodes’ factorization is the ascending order with regards to the supernodes’ index.

During the factorization of a supernode s, for each descendant d of s, the contribution

block of d is computed. The contribution blocks from all descendants of s are assembled to

form the aggregate contribution block, and the aggregate contribution block is subtracted from

s. Then a dense LU factorization and two triangular solves are performed on s to compute the

factorization result of the supernode.

The GPUs are only used for sufficiently large supernodes. If the size of s is too small,

NLU will perform the above operations entirely in the main memory ( and not in the pinned

83

Page 84: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 5-3. Elimination tree

host memory ), using the CPU. In this case, Fig. 5-4 depicts the timing of the factorization of

a supernode.

Figure 5-4. Serial factorization of supernode

Among the steps within each iteration of the left-looking LU, the computing and the

assembly of the contribution blocks are the most time consuming, due to the large number of

descendants.

Consider a supernode s and one of its descendant, d, the algorithm performs the follow-

ing:

1. A portion (determined by the matrix’s symbolic structure) of d is copied from thepageable host memory to the pinned host memory.

2. The descendant is transferred to the GPU memory via the PCIe bus.

3. The contribution block is computed with two matrix multiplications, and assembled intothe aggregate contribution block using a matrix addition with mapping.

Note that each of the above steps uses only a portion of the resource available:

84

Page 85: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

• Step 1 uses the pinned host memory.

• Step 2 uses the pinned host memory, the GPU memory, and the PCIe bus.

• Step 3 use the GPU memory and the GPU’s streaming multiprocessors.

In fact, we do not have to wait until the end of Step 3 to start computing the contribu-

tion block of the next descendant. The algorithm can start copying the next descendant into

the pinned host memory without worrying about overwriting, as long as the current descendant

has finished Step 2. We implement this by utilizing CUDA’s stream and event features. A total

of two CUDA stream is required, and each of them should be associated to a CUDA event, a

piece of separate GPU memory, and a piece of separate pinned host memory.

Let s0 and s1 be the two CUDA streams, e0 and e1 be the two CUDA events correspond-

ing to them respectively. A for loop iterates over all the descendants of s. At the beginning

of each iteration, the algorithm queries e0 and e1. If either of them (ei) returns cudaSuccess,

the algorithm starts copying the next descendant to the pinned host memory, and subsequently

performs Step 2 and 3 on si. Since Step 2 and 3 are executed on a CUDA stream other than

the default stream, they will not block the execution of the host code, which means the loop

can continue before the on-GPU operations of Step 2 and 3 are finished. ei is recorded at the

end of Step 2, to make sure that the next time ei is queried, it will reflect whether Step 2 has

already completed.

If both of the CUDA event queries fail, the algorithm can either wait until one of the

CUDA streams becomes available, or just fall back to using the CPU. Generally speaking, it

is usually not good to use the CPU when the task size is large, or use the GPU when task

size is too small. The actual strategy the algorithm selects depends on the dimensions of the

descendant, and the threshold is configurable with parameters.

Fig. 5-5 shows an example of factorizing a supernode with 5 descendants using a GPU.

The CPU computes the contribution block using the data from the pageable host memory

and assembles the contribution block to the pinned host memory, while the GPU does this in

the GPU memory. Therefore after all contribution blocks have been computed and assembled,

85

Page 86: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 5-5. Parallel factorization of supernode

the aggregate contribution block located in the pinned host memory must be copied to the

GPU memory, and a matrix addition needs to be performed to sum up the two aggregate

contribution blocks.

Figure 5-6. Computing contribution blocks using 4 CUDA streams

Fig. 5-6 is part of the NVIDIA visual profiler output showing how the contribution blocks

are computed using multiple CUDA streams. We see that the host-to-device memory copies are

stacking with on-GPU floating-point operations, due to the pipelining.

5.5.3 Multithreading and Batched Factorization

Consider a server with a multicore CPU and K GPUs. We factorize N sparse matrices on

this server using multiple threads. The threads are nested, with T (T ≤ N) top-level threads

( named ”matrix threads”, since they handle entire sparse matrices ), and each matrix thread

having K sub-threads ( named ”node threads”, since they handle individual supernodes ). The

86

Page 87: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

actual number of active node threads in each matrix thread depends on the structure of the

matrix, and changes during run time, but it does not exceed the number of GPUs.

Different sparse matrices submitted to NLU are stored in different pieces of main memory.

The operating system makes sure that they do not overlap in the main memory. But the sparse

matrices must share the more scarce GPU-related resources, including the multiprocessors, the

GPU memory, and the pinned host memory. To avoid conflicts while maximizing each GPU’s

availability to each sparse matrix, we do not statically assign GPUs to node threads. Instead,

we put the GPUs in a resource pool, and let the node threads query and lock GPUs on the

fly. In this way, the batched factorization is implemented by simply running multiple sparse LU

factorizations in different matrix threads, and the maximum number of sparse matrices being

factorized concurrently can be adjusted with a simple change of a parameter.

The implementation of the inter-supernode multithreading relies heavily on the elimination

tree. Two supernodes can be factorized in parallel if and only if neither of them is an ancestor

or a descendant of the other. Our scheduling policy is implemented using a queue named

leafQueue.

Initially leafQueue should contain the indices of all the leaf nodes of the elimination tree.

Each node thread contains a while loop that terminates if and only if leafQueue is empty.

At the beginning of each loop, the node thread pops the supernode s from leafQueue, in a

critical section. At the end of the loop, the node thread enters another critical section, and

performs the following:

• Find p, the parent of s

• Remove s from the elimination tree

• Check if p has become a leaf, if it has, push it into leafQueue

For the efficiency of data access, some metadata of the sparse matrices are stored in the

pinned host memory and the device memory, and each piece of metadata is shared among

all supernodes of the same sparse matrix. The metadata must be transferred from the main

memory before the GPU can access it.

87

Page 88: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

If the GPU starts working on a different sparse matrix, the metadata will be overwritten.

To prevent redundant data transfer, we reduce the overwriting of metadata by having the node

threads hold onto the GPUs for as long as they are active. A node thread is considered idle

if it runs out of available supernodes to factorize. At the beginning of each node thread, if

it fails its first pop of leafQueue, it goes idle immediately. If it succeeds, the node thread

should loop over the K GPUs and try to lock one of them by testing their OpenMP locks.

The locks are not released upon finishing factorizing a supernode, instead, when a supernode is

finished, the node thread looks for the next supernode in the same sparse matrix to factorize,

and only releases the lock if it fails to find another. Since the parallelism of a sparse matrix

always monotonically decreases as we go further up the elimination tree, we don’t need wo

worry about an idle node thread going active again. In this way, we make sure that a GPU is

released only if it is no longer needed by the last sparse matrix it was working on.

This strategy works well when all the GPUs are of the same computing power, but may

cause some loss of performance if the system is heterogeneous, because it might be better to

switch from a weaker GPU when a more powerful one is released by another node thread.

5.5.4 Utilizing Localization

This is an experimental feature. At the moment there isn’t a noteable performance gain

from this, but an optimization of the scheduling policy might make this useful in the future.

The data transfers from the main memory to the GPU memory are a significant time sink

in the GPU-accelerated sparse LU algorithm. Though we are able to hide some of these PCIe

communications by stacking them with on-GPU floating-point operations, it is still better to

avoid making these transfers whenever possible.

Supernodes must be factorized before they can be used to update their ancestors. If the

factorization result left over in the GPU memory is not overwritten before it is used in an

update operation, we can skip the data copy from the main memory to the GPU memory.

88

Page 89: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

We reduce the chance of overwriting with stages. A stage is a group of supernodes whose

total size can fit in the GPU memory. We set the starting offset of supernodes so that the

supernodes in the same stage do not overlap in the GPU memory.

The factorization result of supernode s is considered intact in a GPU’s memory if:

• s was factorized on that GPU.

• No supernodes from other stages was factorized on that GPU since the factorization of

s.

• No supernodes from other matrices was factorized on that GPU since the factorization of

s.

It is also possible that the factorization result does not exist in the GPU memory, but

is in the pinned host memory. If the GPU was used when computing and assembling the

contribution blocks, but the size of the supernode is small, the algorithm may choose to

perform the dense LU and the triangular solves in the pinned host memory, using the CPU.

We use several arrays and variables to track the location of supernodes’ factorization

result:

• GPUSerial[]:

The index of the GPU that the supernode was processed on. If GPUs are not used for

supernode s, then GPUSerial[s]=-1.

• NodeLocation[]:

Whether the supernode’s factorization result is in the GPU memory or in the pinned host

memory. NodeLocation[s] is meaningful if and only if GPUSerial[s] actually points

to a valid GPU.

• gpu info->lastMatrix:

gpu info is of type struct gpu info struct *, a pointer to the object that contains

GPU-specific information.

gpu info->lastMatrix records the last sparse matrix that the GPU worked on. This is

to indicate whether the factorization result is potentially overwritten by another matrix.

89

Page 90: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

• NodeStPass[]:

An array with monotonically increasing elements. If supernodes s0, s1, and s2 are

factorized in order on the same GPU, s0 and s2 are from the same stage while s1 is from

a different one, then s0 should be considered already overwritten at the time we factorize

s2. We maintain NodeStPass[] so that NodeStPass[s0] and NodeStPass[s2] have

different values, reflecting the overwriting between these two supernodes.

Before updating a supernode s with its descendants, we can see whether the factorization

results of the descendants are still in the GPU memory or the pinned host memory by checking

the above arrays and variables. If there is a hit, we can skip some data transfer, and reduce the

total running time.

At the moment, the hit rate we have achieved is still very low ( less than 1% ), but we

might be able to increase it by updating our multithread scheduling policy. We expect the hit

rate to be high when supernodes from the same stage are factorized on the same GPU.

5.6 Experimental Results

We carried out our experiments on a platform with two Intel Xeon E5-2695 v2 CPUs and

eight NVIDIA Tesla K40m GPUs. The test matrices are listed in Table 5-1. These matrices are

from the SuiteSparse matrix collection [13].

The experiment results are listed in 5-2 (OOM means ”out of memory”). The factor-

ization time comparison of some matrices is shown in Fig. 5-7. In the comparison, GLU and

NLU used one GPU and the CPU, while UMFPACK, KLU, and SuperLU used the CPU only.

We see that GLU is a strong competitor, expecially when the matrix is small. But NLU works

better on larger matrices. The maximum improvement in performance we were able to achieve

was that when factorizing the matrix ”li”, NLU’s performance was 43.66x vs GLU, 209.88x vs

UMFPACK, 755.48x vs KLU, and 20.63x vs SuperLU. NLU was able to achieve 200.35 times

the performance of SuperLU when factorizing the matrix ”epb1”.

90

Page 91: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Table 5-1. Test matricesmatrix problem type dimension nonzerospoli large Economic Problem 15,575 33,033epb1 Thermal Problem 14,734 95,053bayer01 Chemical Process 57,735 275,094

Simulation Problemckt11752 dc 1 Circuit Simulation 49,702 333,029onetone1 Frequency Domain 36,057 335,552

Circuit SimulationASIC 100ks Circuit Simulation 99,190 578,890rajat25 Circuit Simulation 87,190 606,489rim Computational Fluid 22,560 1,014,951

Dynamics Problemxenon1 Materials Problem 48,600 1,181,120matrix 9 Semiconductor Device 103,430 1,205,518li Electromagnetics 22,695 1,215,181Raj1 Circuit Simulation 263,743 1,300,261rajat24 Circuit Simulation 358,172 1,946,979rma10 Computational Fluid 46,835 2,329,092

Dynamics ProblemASIC 680k Circuit Simulation 682,862 2,638,997pre2 Frequency Domain 659,033 5,834,044

Circuit Simulationrajat30 Circuit Simulation 643,994 6,175,244marine1 Chemical 400,320 6,226,538

OceanographyFreescale1 Circuit Simulation 3,428,755 17,052,626Transport Structural Problem 1,602,111 23,487,281dgreen Semiconductor Device 1,200,611 26,606,169ML Laplace Structural Problem 377,002 27,582,698ss Semiconductor Process 1,652,680 34,753,577nv2 Semiconductor Device 1,453,908 37,475,646ML Geer Structural Problem 1,504,002 110,686,677stokes Semiconductor Process 11,449,533 349,321,980

Performance comparison when using different numbers of GPUs is shown in Fig. 5-8.

The performance gained when factorizing these matrices using multiple GPUs was 18.43% to

104.86% when using 2 GPUs, and 30.11% to 206.30% when using 4 GPUs.

5.7 Conclusions

In this chapter, we present NLU, an efficient supernodal sparse LU factorization algorithm

for hybrid multicore systems that supports multithreading and batched factorization.

91

Page 92: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Table 5-2. Factorization time (s)matrix GLU UMFPACK KLU SuperLU NLUpoli large 1.01E-02 2.90E-03 1.00E-03 2.64E+00 4.79E-01epb1 7.66E-02 8.75E-02 4.76E-02 1.16E+01 5.78E-02bayer01 3.88E-01 1.36E-01 1.60E-01 3.99E+01 3.15E+00ckt11752 dc 1 1.71E-01 2.93E-01 4.45E-02 2.79E+01 4.05E-01onetone1 3.27E-01 3.07E-01 6.85E+00 2.17E+01 5.77E-01ASIC 100ks 3.78E-01 5.91E-01 1.02E+00 3.16E+01 8.98E-01rajat25 4.56E-01 4.54E+01 1.56E+00 6.61E+01 1.27E+00rim 2.41E-01 6.06E-01 1.20E+01 1.33E+01 6.78E-02xenon1 3.59E+00 1.57E+00 1.15E+01 2.80E+01 3.45E-01matrix 9 3.79E+01 1.17E+01 1.85E+02 3.71E+01 1.87E+00li 1.56E+01 7.50E+01 2.70E+02 7.73E+00 3.57E-01Raj1 2.01E+00 1.12E+02 5.14E+01 1.21E+02 2.06E+00rajat24 2.55E+00 OOM 1.46E+01 1.55E+02 4.92E+00rma10 6.48E-01 9.79E-01 8.83E-01 9.51E+00 1.42E-01ASIC 680k 3.85E+01 5.91E-01 6.91E-01 timeout 2.93E+01pre2 3.98E+01 OOM fail 4.12E+02 1.09E+01rajat30 4.80E+01 OOM 1.23E+01 5.63E+02 1.19E+01marine1 timeout OOM 1.13E+03 1.12E+02 4.58E+00Freescale1 4.38E+00 fail 7.34E+00 timeout 1.61E+01Transport timeout fail fail 7.08E+02 2.96E+01dgreen timeout fail fail 4.32E+02 2.43E+01ML Laplace timeout OOM 5.36E+02 7.06E+01 4.66E+00ss timeout fail fail timeout 7.74E+01nv2 timeout fail fail 5.67E+02 2.64E+01ML Geer timeout singular fail 2.92E+02 2.21E+01stokes timeout singular fail segfault 1.29E+03

In our experiments, we compared NLU to other sparse LU solvers, and saw that NLU

is able to handle matrices significantly larger than the capability of other sparse LU solvers,

and NLU can achieve higher performance on some matrices: up to 43.66 times vs GLU, up to

209.88 times vs UMFPACK, up to 755.48 vs KLU, and up to 200.35 times vs SuperLU.

NLU is able to accelerate the factorization by using multiple GPUs. The performance

gained was up to 104.86% when using 2 GPUs and up to 206.30% when using 4 GPUs.

One shortcoming of NLU is that it does not yet support pivoting, which may limit the

scenarios in which it can be used. We will update it to include pivoting in future versions.

92

Page 93: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

Figure 5-7. LU factorization time (natural log transformed)

Figure 5-8. LU factorization time using one or multiple GPUs

93

Page 94: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

CHAPTER 6SUMMARY AND CONCLUSIONS

Our work is focused on sparse matrix factorization algorithms on hybrid multicore

architectures, including sparse Cholesky, sparse QR, and sparse LU.

In Chapter 3, we present optimization techniques for the sparse Cholesky algorithm,

CHOLMOD. Our optimizations for CHOLMOD include multithreading, pipelining, and

the multilevel subtree method. We also implemented the batched factorization feature for

CHOLMOD.

The optimizations for CHOLMOD, when put together, can increase the efficiency of the

factorization by tens of times.

In Chapter 4, we introduce our optimization for the sparse QR algorithm, SPQR. We

did some improvements to arithmetic CUDA kernels to increase the performance of the apply

operations. We also implemented two pipelines to reduce the data transfer overhead.

Our optimizations for SPQR was able to increase its performance by up to 43.92%.

In Chapter 5, we present our implementation of a sparse LU solver. Our sparse LU algo-

rithm is a supernodal algorithm that can utilize multiple GPUs, and supports multithreading,

pipelining, and batched factorization.

We compared our sparse LU algorithm to other LU solvers, and saw that when using one

GPU, it’s performance can be up to 43.66x vs GLU, 209.88x vs UMFPACK, 755.48x vs KLU,

and 200.35x vs SuperLU.

Using multiple GPUs can further increase our LU solver’s performance. The performance

gained when using multiple GPUs was 18.43% to 104.86% when using 2 GPUs, and 30.11% to

206.30% when using 4 GPUs.

94

Page 95: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

APPENDIXPUBLICATIONS

• Meng Tang, Mohamed Gadou, and Sanjay Ranka. 2017. A Multithreaded Algorithm for

Sparse Cholesky Factorization on Hybrid Multicore Architectures. Procedia Computer

Science 108 (2017), 616-625.

• Meng Tang, Mohamed Gadou, Steven Rennich, Timothy A Davis, and Sanjay Ranka.

Optimized Sparse Cholesky Factorization on Hybrid Multicore Architectures. Journal of

Computational Science (2018).

• Meng Tang, Mohamed Gadou, Steven Rennich, Timothy A Davis, and Sanjay Ranka.

A Multilevel Subtree Method for Single and Batched Sparse Cholesky Factorization.

Proceedings of the 47th International Conference on Parallel Processing (2018).

95

Page 96: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

REFERENCES

[1] Amestoy, Patrick, Duff, Iain, L’Excellent, Jean Yves, and Plechac, Petr. “PARASOL Anintegrated programming environment for parallel sparse matrix solvers.” High-PerformanceComputing. Springer, 1999, 79–90.

[2] Amestoy, Patrick R, Davis, Timothy A, and Duff, Iain S. “Algorithm 837: AMD, anapproximate minimum degree ordering algorithm.” ACM Transactions on MathematicalSoftware (TOMS) 30 (2004).3: 381–388.

[3] Amestoy, Patrick R, Duff, Iain S, L’Excellent, Jean-Yves, and Koster, Jacko. “MUMPS:a general purpose distributed memory sparse solver.” International Workshop on AppliedParallel Computing. Springer, 2000, 121–130.

[4] Bjorck, Ake. “Solving linear least squares problems by Gram-Schmidt orthogonalization.”BIT Numerical Mathematics 7 (1967).1: 1–21.

[5] Chen, Yanqing, Davis, Timothy A, Hager, William W, and Rajamanickam, Sivasankaran.“Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and up-date/downdate.” ACM Transactions on Mathematical Software (TOMS) 35 (2008).3:22.

[6] Chevalier, Cedric and Pellegrini, Francois. “PT-Scotch: A tool for efficient parallel graphordering.” Parallel computing 34 (2008).6-8: 318–331.

[7] Cuthill, Elizabeth and McKee, James. “Reducing the bandwidth of sparse symmetricmatrices.” Proceedings of the 1969 24th national conference. ACM, 1969, 157–172.

[8] Davis, Tim, Hager, WW, and Duff, IS. “SuiteSparse.” (2014).URL http://faculty.cse.tamu.edu/davis/suitesparse.html

[9] Davis, Timothy A. Direct methods for sparse linear systems. SIAM, 2006.

[10] ———. “Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparseQR factorization.” ACM Transactions on Mathematical Software (TOMS) 38 (2011).1: 8.

[11] Davis, Timothy A and Duff, Iain S. “Unsymmetric-pattern multifrontal methods forparallel sparse LU factorization.” Technical Report Comp. and Info. Sci. Dept., Universityof Florida (1991).

[12] ———. “An unsymmetric-pattern multifrontal method for sparse LU factorization.”SIAM Journal on Matrix Analysis and Applications 18 (1997).1: 140–158.

[13] Davis, Timothy A and Hu, Yifan. “The University of Florida sparse matrix collection.”ACM Transactions on Mathematical Software (TOMS) 38 (2011).1: 1.

96

Page 97: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

[14] Davis, Timothy A and Palamadai Natarajan, Ekanathan. “Algorithm 907: KLU, a directsparse solver for circuit simulation problems.” ACM Transactions on MathematicalSoftware (TOMS) 37 (2010).3: 36.

[15] Davis, Timothy A, Rajamanickam, Sivasankaran, and Sid-Lakhdar, Wissam M. “A surveyof direct methods for sparse linear systems.” Acta Numerica 25 (2016): 383–566.

[16] Demmel, James W. “SuperLU users’ guide.” (1999).

[17] Gentleman, W Morven. “Least squares computations by Givens transformations withoutsquare roots.” IMA Journal of Applied Mathematics 12 (1973).3: 329–336.

[18] Gentleman, W Morven and Kung, HT. “Matrix triangularization by systolic arrays.”Real-Time Signal Processing IV. vol. 298. International Society for Optics and Photonics,1982, 19–27.

[19] George, Alan. “Nested dissection of a regular finite element mesh.” SIAM Journal onNumerical Analysis 10 (1973).2: 345–363.

[20] George, Alan, Heath, Michael, Liu, Joseph, and Ng, Esmond. “Solution of sparse positivedefinite systems on a hypercube.” Journal of Computational and Applied Mathematics 27(1989).1-2: 129–156.

[21] George, Alan and Heath, Michael T. “Solution of sparse linear least squares problemsusing Givens rotations.” Linear Algebra and its Applications 34 (1980): 69–83.

[22] George, Alan and Liu, Joseph WH. “Householder reflections versus Givens rotationsin sparse orthogonal decomposition.” Linear Algebra and its Applications 88 (1987):223–238.

[23] ———. “The evolution of the minimum degree ordering algorithm.” Siam review 31(1989).1: 1–19.

[24] George, Alan and McIntyre, David R. “On the application of the minimum degreealgorithm to finite element systems.” Mathematical Aspects of Finite Element Methods.Springer, 1977. 122–149.

[25] Golub, Gene. “Numerical methods for solving linear least squares problems.” NumerischeMathematik 7 (1965).3: 206–216.

[26] Hadfield, Steven Michael. On the LU factorization of sequences of identically structuredsparse matrices within a distributed memory environment. Ph.D. thesis, Citeseer, 1994.

[27] Heath, Michael T. “Numerical methods for large sparse linear least squares problems.”SIAM Journal on Scientific and Statistical Computing 5 (1984).3: 497–513.

97

Page 98: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

[28] Jennings, Alan. “A compact storage scheme for the solution of symmetric linear simulta-neous equations.” The Computer Journal 9 (1966).3: 281–285.

[29] Karypis, George and Kumar, Vipin. “METIS: unstructured graph partitioning and sparsematrix ordering system, version 2.0.” (1995).

[30] Karypis, George, Schloegel, Kirk, and Kumar, Vipin. “Parmetis.” Parallel graphpartitioning and sparse matrix ordering library. Version 2 (2003).

[31] Kolodziej, Scott, Yeralan, Nuri, Davis, Tim, and Hager, William W. “Mongoose UserGuide, Version 2.0.3.” (2018).

[32] Li, Xiaoye S. “An overview of SuperLU: Algorithms, implementation, and user interface.”ACM Transactions on Mathematical Software (TOMS) 31 (2005).3: 302–325.

[33] Lipton, Richard J, Rose, Donald J, and Tarjan, Robert Endre. “Generalized nesteddissection.” SIAM journal on numerical analysis 16 (1979).2: 346–358.

[34] Liu, Joseph W. “A compact row storage scheme for Cholesky factors using eliminationtrees.” ACM Transactions on Mathematical Software (TOMS) 12 (1986).2: 127–148.

[35] Liu, Joseph WH. “On general row merging schemes for sparse Givens transformations.”SIAM journal on scientific and statistical computing 7 (1986).4: 1190–1211.

[36] ———. “The role of elimination trees in sparse factorization.” SIAM Journal on MatrixAnalysis and Applications 11 (1990).1: 134–172.

[37] ———. “A generalized envelope method for sparse factorization by rows.” ACMTransactions on Mathematical Software (TOMS) 17 (1991).1: 112–129.

[38] Liu, Wai-Hung and Sherman, Andrew H. “Comparative analysis of the Cuthill-McKeeand the reverse Cuthill-McKee ordering algorithms for sparse matrices.” SIAM Journal onNumerical Analysis 13 (1976).2: 198–213.

[39] Markowitz, Harry M. “The elimination form of the inverse and its application to linearprogramming.” Management Science 3 (1957).3: 255–269.

[40] McWhirter, JG. “Recursive least-squares minimization using a systolic array.” Real-TimeSignal Processing VI. vol. 431. International Society for Optics and Photonics, 1983,105–114.

[41] Natarajan, Ekanathan Palamadai. KLU-A high performance sparse linear solver for circuitsimulation problems. Ph.D. thesis, University of Florida, 2005.

[42] Ohtsuki, Tatsuo, Cheung, Lap Kit, and Fujisawa, Toshio. “Minimal triangulation of agraph and optimal pivoting order in a sparse matrix.” Journal of Mathematical Analysisand Applications 54 (1976).3: 622–633.

98

Page 99: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

[43] Papadimitriou, Ch H. “The NP-completeness of the bandwidth minimization problem.”Computing 16 (1976).3: 263–270.

[44] Pellegrini, Francois and Roman, Jean. “Scotch: A software package for static mapping bydual recursive bipartitioning of process and architecture graphs.” International Conferenceon High-Performance Computing and Networking. Springer, 1996, 493–498.

[45] Peng, Shaoyi and Tan, Sheldon X-D. “GLU3.0: Fast GPU-based Parallel Sparse LUFactorization for Circuit Simulation.” arXiv preprint arXiv:1908.00204 (2019).

[46] Preis, Robert and Diekmann, Ralf. The PARTY Partitioning-library: User Guide; Version1.1. Univ.-GH, FB Mathematik/Informatik, 1996.

[47] Reid, Matthew W. “Pivoting for LU Factorization.” (2014).

[48] Rennich, Steven C, Stosic, Darko, and Davis, Timothy A. “Accelerating sparse Choleskyfactorization on GPUs.” Parallel Computing 59 (2016): 140–150.

[49] Rose, DJ, Whitten, GG, Sherman, AH, and Tarjan, RE. “Algorithms and software for in-core factorization of sparse symmetric positive definite matrices.” Computers & Structures11 (1980).6: 597–608.

[50] Rose, Donald J. “A graph-theoretic study of the numerical solution of sparse positivedefinite systems of linear equations.” Graph theory and computing. Elsevier, 1972.183–217.

[51] Rotella, F and Zambettakis, I. “Block Householder transformation for parallel QRfactorization.” Applied mathematics letters 12 (1999).4: 29–34.

[52] Schenk, Olaf, Gartner, Klaus, and Fichtner, Wolfgang. “Scalable parallel sparse factor-ization with left-right looking strategy on shared memory multiprocessors.” InternationalConference on High-Performance Computing and Networking. Springer, 1999, 221–230.

[53] Schreiber, Robert. “A new implementation of sparse Gaussian elimination.” ACMTransactions on Mathematical Software (TOMS) 8 (1982).3: 256–276.

[54] Tang, Meng, Gadou, Mohamed, and Ranka, Sanjay. “A Multithreaded Algorithm forSparse Cholesky Factorization on Hybrid Multicore Architectures.” Procedia ComputerScience 108 (2017): 616–625.

[55] Tang, Meng, Gadou, Mohamed, Rennich, Steven C, Davis, Timothy A, and Ranka, San-jay. “A Multilevel Subtree Method for Single and Batched Sparse Cholesky Factorization.”Proceedings of the 47th International Conference on Parallel Processing. ACM, 2018, 50.

[56] Tinney, William F and Walker, John W. “Direct solutions of sparse network equationsby optimally ordered triangular factorization.” Proceedings of the IEEE 55 (1967).11:1801–1809.

99

Page 100: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

[57] van de Geijn, Robert A. “Notes on LU Factorization.” (2014).

[58] Yang, Wei H. “A method for updating Cholesky factorization of a band matrix.”Computer Methods in Applied Mechanics and Engineering 12 (1977).3: 281–288.

[59] Yannakakis, Mihalis. “Computing the minimum fill-in is NP-complete.” SIAM Journal onAlgebraic Discrete Methods 2 (1981).1: 77–79.

[60] Yanovsky, Igor. “QR Decomposition with Gram-Schmidt.” University of California, LosAngeles (2012).

[61] Yeralan, Sencer Nuri, Davis, Timothy A, Sid-Lakhdar, Wissam M, and Ranka, Sanjay. “Al-gorithm 980: Sparse QR Factorization on the GPU.” ACM Transactions on MathematicalSoftware (TOMS) 44 (2017).2: 17.

100

Page 101: PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX …

BIOGRAPHICAL SKETCH

Meng Tang received his Ph.D from University of Florida in 2020, his master’s degree

in computer engineering from Chinese Academy of Sciences and his bachelor’s degree in

computer science from Shanghai Jiaotong University. His area of research is High Performance

Computing.

101