performance optimization for sparse matrix …
TRANSCRIPT
PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX FACTORIZATION ALGORITHMSON HYBRID MULTICORE ARCHITECTURES
By
MENG TANG
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2020
c⃝ 2020 Meng Tang
ACKNOWLEDGMENTS
My sincerest gratitude to my advisor, Dr. Sanjay Ranka, for his continuous support. Dr.
Ranka’s guidance and motivation were the most valuable to my Ph.D study. The resources and
the funding he provides were critical to my research work.
I thank Dr. Mohamed Gadou and Dr. Tania Banerjee for their helps in our projects and
in the paperwriting. They were great collaborators and friends who have aided me in solving
many problems.
I thank Dr. Timothy Davis and Dr. Steven Rennich for generously providing essential
equipment. Without their servers, it would be very difficult to conduct my research.
I thank Dr. Alper Ungor, Dr. Jih-Kwon Peir, and Dr. William hager, for their insightful
comments on my thesis. As members of my supervisory committee, they have provided
valuable ideas and comments during my progression to the degree.
I thank my family for their unwavering support in my seek for knowledge. They have been,
and will always be my source of strength in even the darkest hours.
3
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER
1 INTRODUCTORY REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 BACKGROUND AND RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Fill-reducing Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Elimination Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 SPARSE CHOLESKY FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Up-looking Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Left-looking Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Right-looking Cholesky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 The Supernodal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6 The Multifrontal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.7 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.8 The Subtree Method and the Multilevel Subtree Method . . . . . . . . . . . . 293.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.10 The Batched Sparse Cholesky Factorization . . . . . . . . . . . . . . . . . . . 37
3.10.1 The Merge-and-Factorize Approach . . . . . . . . . . . . . . . . . . . 373.10.2 The Normal Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.11 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.11.1 Multuthreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.11.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.11.3 Multilevel Subtree Method . . . . . . . . . . . . . . . . . . . . . . . . 433.11.4 Batched Sparse Cholesky Factorization . . . . . . . . . . . . . . . . . 46
3.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 SPARSE QR FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Givens Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Blocked Givens Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.5 Householder Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4
4.6 The Multifrontal Sparse QR Factorization . . . . . . . . . . . . . . . . . . . . 564.7 The Arithmetic CUDA Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 594.8 Pipelining CUDA Kernels and Device-to-Host Transfers . . . . . . . . . . . . . 634.9 Pipelining GPU Workload and CPU Workload . . . . . . . . . . . . . . . . . 654.10 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 SPARSE LU FACTORIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Left-looking LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.2 Right-looking LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3 The Supernodal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.4 The Multifrontal Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5 Implementation of a Supernodal Sparse LU Algorithm . . . . . . . . . . . . . 79
5.5.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.5.1.1 GPU information . . . . . . . . . . . . . . . . . . . . . . . . 805.5.1.2 Matrix information . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.2 The Supernodal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 835.5.3 Multithreading and Batched Factorization . . . . . . . . . . . . . . . . 865.5.4 Utilizing Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 SUMMARY AND CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 94
APPENDIX: PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5
LIST OF TABLES
Table page
3-1 Test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4-1 Test matrices used for QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4-2 QR experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5-1 Test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5-2 Factorization time (s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6
LIST OF FIGURES
Figure page
3-1 Workflow of single level (a+c) / multilevel (a+b) subtree algorithm . . . . . . . . . 32
3-2 Using pipeline in the factorization of a subtree . . . . . . . . . . . . . . . . . . . . 35
3-3 Factorization time for sparse cholesky with / without multithreading . . . . . . . . 40
3-4 Average power consumption for sparse cholesky with / without multithreading . . . 40
3-5 Energy consumption for sparse cholesky with / without multithreading . . . . . . . 41
3-6 Cholesky factorization performance for sparse matrices . . . . . . . . . . . . . . . . 42
3-7 Power Consumption of sparse Cholesky factorization . . . . . . . . . . . . . . . . . 42
3-8 Energy Consumption of sparse Cholesky factorization . . . . . . . . . . . . . . . . . 43
3-9 Performance comparison between single-level subtree algorithm and multilevel sub-tree algorithm on single GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3-10 Structure of matrix Geo 1438 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3-11 Performance comparison between single-level subtree algorithm and multilevel sub-tree algorithm on single GPU and two GPUs . . . . . . . . . . . . . . . . . . . . . 45
3-12 Batched Cholesky factorization performance versus sequential matrices factorizationon GPUs (2 GPUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-13 Batched Cholesky factorization performance versus sequential matrices factorizationon GPUs (4 GPUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4-1 The elimination tree of a sparse matrix . . . . . . . . . . . . . . . . . . . . . . . . 57
4-2 A possible scheduling of fronts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4-3 Stages in the workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4-4 Stages in the elimination tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4-5 Reducing PCIe communications with stages . . . . . . . . . . . . . . . . . . . . . . 58
4-6 The factorization and the assembly operations . . . . . . . . . . . . . . . . . . . . 59
4-7 Factorization of a front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4-8 Factorization of a front . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4-9 The VT tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4-10 U and V in the VT tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7
4-11 Q, U and V in the VT tile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4-12 Pipelining the CUDA kernel runs and device-to-host transfers . . . . . . . . . . . . 63
4-13 Pinned host memory used as a buffer . . . . . . . . . . . . . . . . . . . . . . . . . 63
4-14 Pipelining the factorization of stages and the buffer flushing . . . . . . . . . . . . . 63
4-15 A secondary pinned host memory buffer to avoid data conflict . . . . . . . . . . . . 65
4-16 Comparison between sparse QR algorithm with / without stage-level pipeline . . . . 66
4-17 Performance comparison between algorithm before and after optimizations . . . . . 67
4-18 Relationship between flop count and improvement in performance . . . . . . . . . . 68
4-19 Energy consumed by the GPU in factorization (large matrices) . . . . . . . . . . . . 68
4-20 Reduction in energy consumption after the optimization . . . . . . . . . . . . . . . 69
4-21 Average power of the GPU in factorization . . . . . . . . . . . . . . . . . . . . . . 69
4-22 Reduction in average power after the optimization . . . . . . . . . . . . . . . . . . 70
5-1 Supernode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5-2 Supernode stored in column major form . . . . . . . . . . . . . . . . . . . . . . . . 83
5-3 Elimination tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5-4 Serial factorization of supernode . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5-5 Parallel factorization of supernode . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5-6 Computing contribution blocks using 4 CUDA streams . . . . . . . . . . . . . . . . 86
5-7 LU factorization time (natural log transformed) . . . . . . . . . . . . . . . . . . . . 93
5-8 LU factorization time using one or multiple GPUs . . . . . . . . . . . . . . . . . . 93
8
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
PERFORMANCE OPTIMIZATION FOR SPARSE MATRIX FACTORIZATION ALGORITHMSON HYBRID MULTICORE ARCHITECTURES
By
Meng Tang
May 2020
Chair: Sanjay RankaMajor: Computer Engineering
The use of sparse direct methods in computational science is ubiquitous. Direct methods
can be used to find solutions to many numerical algebra applications, including sparse linear
systems, sparse linear least squares, and eigenvalue problems; consequently they form the
backbone of a broad spectrum of large scale applications. The use of sparse direct methods is
extensive, with many of the relevant science and engineering application areas being pushed to
run at ever higher scales.
In this work we delve into the implementations of sparse direct methods including the
sparse Cholesky, QR, and LU factorizations. We research on a number of state-of-the-art
libraries for sparse matrix factorizations, and improve their performance by applying various
optimizations.
For the sparse Cholesky factorization we have implemented multithreading, pipelining, the
multilevel subtree method, and batched factorization.
For the sparse QR factorization we implemented pipelining and improved the arithmetic
CUDA kernels.
For the sparse LU factorization, we implemented a supernodal sparse LU solver that can
utilize multiple GPUs, and supports multithreading, pipelining, and batched factorization.
9
CHAPTER 1INTRODUCTORY REMARKS
Matrix factorizations such as the Cholesky factorization, the QR factorization, and the
LU factorization are the key to many applications involving linear algrbra, and are useful in
solving linear problems such as linear equation systems, linear least square problems, eigenvalue
problems, non-linear optimization, Monte Carlo simulation etc. This forms the foundation
of a wide variety of large scale applications including feature extraction, data compression,
computer graphics, recommender systems, artificial intelligence, etc.
A sparse matrix is a matrix whose most elements are zero. The factorization of a sparse
matrix differs from the dense matrix factorization in that arithmetic operations on the sparse
matrix’s zero elements can mostly be avoided to reduce the total workload, thus greatly
reducing time, memory, and energy cost of the factorization. In order to exploit the sparsity of
the sparse matrix, a symolic analysis phase is needed before the actual numerical factorization,
to compute the nonzero pattern of the sparse matrix.
The symbolic analysis also computes the elimination tree. The elimination tree is a key
data structure in sparse matrix factorization algorithms [36]. It provides strructural information
of the sparse matrix, and directs the workflow of the factorization. The elimination tree also
describes the dependency between the sparse matrix’s columns/rows, therefore it plays a vital
role in parallelizing the sparse matrix factorization algorithms.
Usually a fill-reducing permutation needs to be performed before the numerical factoriza-
tion, to reduce fill-in. The fill-in are new nonzero entries in the factors that are initially zero
in the corresponding positions of the matrix being factorized [15]. To find the optimum fill-
reducing permutation has been proven to be NP-hard [59], but there exist heuristic algorithms
that find near-to-optimum permutations.
The numerical factorization phase is where the most floating-point operations happen.
The floating-point operations are largely fixed because they are decided by the output of the
analysis phase, but there are various techiques to accelerate the factorization phase. These
10
techniques usually focus on exploiting the natural parallelism available within the factorization.
It is a common practice to accelerate the factorization with high-throughput highly-parallel
co-processors such as GPUs. Additionally, the sparsity of the matrix enables significant
performance gain through parallel programming on hybrid multicore systems.
Prior to the introduction of specific sparse matrix factorization algorithms, we will provide
background information and show related works in Chapter 2. Symbolic analysis will also be
presented in this chapter. In Chapter 2, we will introduce the fill-reducing algorithms, and the
construction of the elimination tree. We will also breifly describe the extra analysis steps for
the supernodal sparse matrix factorization algorithm [5] [61] and the subtree method [48].
In Chapter 3, we provide the techniques for accelerating the sparse Cholesky factorization.
The Cholesky factorization algorithms for dense matrices are the foundation of sparse Cholesky,
therefore they are introduced at the beginning of this chapter. Later in Chapter 3, we describe
the basic sparse Cholesky algorithm [5] that our work is based upon. The remaining of the
chapter will cover techniques that enhance the performance of the sparse Cholesky algorithm,
including multithreading [54], pipelining [55], the subtree method [48], and the multilevel
subtree method [55].
In Chapter 4, we consider the sparse QR factorization. In this chapter we will first
introduce algorithms for both dense and sparse QR factorization, including Gram-Schmidt
orthogonalization, Givens rotation, and Householder reflection. We will then describe the
multifrontal sparse QR factorization algorithm [61] implemented by Yeralan et al. Then, we
provide our optimization techniques for the above algorithm, including the optimization of
arithmetic CUDA kernels, and pipelining.
In Chapter 5, we present our work on the sparse LU factorization. We will describe the
implementation details of our supernodal sparse LU solver.
We summarize our work in Chapter 6.
11
CHAPTER 2BACKGROUND AND RELATED WORK
In typical sparse matrix factorization algorithms, the factorization splits into two phases:
the symbolic analysis phase and the numerical factorization phase [15]. The symbolic analysis
phase depends only on the nonzero pattern of the matrix being factorized, and will not look
into the actual values of the matrix’s entries.
The symbolic analysis is the key in exploiting the sparsity of the matrix, because the time
and memory efficiency of the numerical factorization phase greatly depends on the analysis.
The symbolic analysis is asymptotically faster than the numerical factorization, and makes the
numerical factorization phase more efficient in terms of time and memory [15].
The symbolic analysis has close relations to the graph theory. For a symmetric n × n
matrix A = {aij}, the symbolic pattern of A can be represented by an undirected graph G
where
G = (V,E)
V = {v1, · · · , vn}
E = {(vi, vj) : aij = 0}
The symbolic analysis for unsymmetric matrices is closely related to the symbolic analysis
for symmetric matrices.
2.1 Fill-reducing Permutation
Usually the first step of the symbolic analysis phase is the fill-reducing permutation. The
optimum fill-reducing permutation for an m × n matrix A is to find an m × m permutation
matrix P and an n × n permutation matrix Q such that the factorization of PAQ generates
minimum fill-in, where fill-in are nonzero entries of the factors whose same-position entries in
PAQ are zero.
The fill-in pose redundant workload to the factorization algorithm, and result in an
increase in time, memory and energy consumption, therefore a good fill-reducing permutation
12
is required for an ideal performance of the sparse matrix factorization. However, finding the
optimum fill-reducing permutation is NP-hard.
The minimum fill-in problem is equivilent to the minimum triangulated supergraph
problem, The goal of the minimum triangulated supergraph problem is to find a minimum
number of additional edges that turns the graph chordal (triangulated). A graph is chordal if
for every cycle with length no less than 4, there exists an edge in the graph connecting non-
adjacent vertices of the cycle. Yannakakis proved that the minimum triangulated supergraph
problem for bipartite graphs is NP-complete [59].
Many heuristics are available for the minimum fill-in problem, including bandwidth
reduction [7], minimum degree ordering [23], minimal triangulation [42],
• The term ”minimal” is different from ”minimum” in this case. Let the graph G =
(V,E), and let C be a set such that C = {Ec : Gc = (V,E + Ec), Gc is chordal}.
Here ”minimal” means minimal with respect to inclusion (Ec ∈ C but no subset of Ec is
in C). The inclusion relation on C is a partial order, and a minimal triangulation corre-
sponds to a minimal element of C with respect to the inclusion relation. A ”minimum”
means that Ec ∈ C and Ec has the smallest cardinality.
nested dissection [19], etc.
Let A = {aij} be a symmetric matrix, the bandwidth of A is defined as
maxaij =0
|i− j|
The bandwidth of A’s graph, G = (V,E), where V = {v1, · · · , vn} and E = {(vi, vj) :
aij = 0}, is defined as
min max(vi,vj)∈E
|π(i)− π(j)|
where π is an ordering of {v1, · · · , vn}.
13
The minumum bandwidth of A is equal to the bandwidth of G, and the bandwidth
minimization problem is equivalent to finding an ordering of vertices such that the minimum
bandwidth is obtained.
Papadimitriou proved that the bandwidth minimization problem is NP-complete [43].
Cuthill and McHee provided an efficient heuristic algorithm for the bandwidth minimization
problem [7]. Their algorithm uses a greedy node numbering scheme, starting from a vertex
with the minimum degree, and doing a breadth-first search which prioritizes vertices with
smaller degrees. George proposed the ”reverse Cuthill-McKee” method which reverses the
ordering of vertices. Liu et al. compared the the Cuthill-McKee algorithm to the reverse
Cuthill-McHee algorithm [38]. Their experiments showed that these two ordrings are equivalent
for band elimination methods, but when evelope elimination techniques are used, the reverse
Cuthill-McHee ordering is always better than or as good as the original Cuthill-McHee ordering.
They also explored the conditions under which the reverse Cuthill-McHee ordering is always
strictly better than the original one.
The minimum degree ordering algorithm is a heuristic for the minimum fill-in problem
using a local greedy strategy. It greedily selects the sparsest pivot row and column during
the course of a right-looking sparse Cholesky factorization [15]. It is a symmetric analog of
an algorithm proposed by Markowitz for reordering equations arising in linear programming
applications [39]. Tinney and Walker were the first to propose the symmetric version of
Markowitz’s algorithm [56]. Rose developed a graph theoretic model for this algorithm, and
renamed it as the minimum degree algorithm [50]. George and McIntyre provided an efficient
implementation for the minimum degree algorithm [24].
Nested dissection is a heuristic for the minimum fill-in problem, using the divide and
conquer paradigm. The nested dissection algorithm selects a vertex separator, which is a
group of vertices that divide the graph into two roughly equal-sized subgraphs. After these
vertices are removed from the graph, the two subgraphs can be ordered via nested dissection or
minimum degree [15]. Nested dissection was discovered by George [19], it was initially inteded
14
to find an ordering of an n × n mesh, to reduce the factorization time complexity from O(n4)
to O(n3) and reduce the space complexity from O(n3) to O(n2 log2 n). Lipton et al. proposed
the generalized nested dissection which applies to any system of equations defined on a planar
or almost-planar graph [33]. Gilbert created a variant of the generalized nested dissection
algorithm. Instead of separating the graph into 2 subgraphs, Gilbert’s algorithm divides the
graph into r subgraphs where r ≥ 2.
Multiple libraries for graph partitioning are available, such as METIS [29], PARTY [46],
SCOTCH [44], ParMETIS [30], ST-SCOTCH [6], Mongoose [31] etc.
2.2 Elimination Tree
After the fill-reducing permutation, the algorithm computes the elimination tree [36].
The elimination tree is an output of the symbolic analysis. It is a tree structure that provides
structural information of the sparse matrix. The elimination tree can also be viewed as
a directed acyclic graph which shows the dependency between matrix columns/rows and
describes the topological pattern of the factorization workflow.
For the Cholesky factorization of an n × n sparse matrix A, the elimination tree is a tree
with n nodes. If we number the matrix’s columns and rows from 0 to n − 1, then for any
integer i and j where 0 ≤ j < i < n, the elimination tree’s node j is a descendant of node i if
aij = 0. A path from j to i indicates data dependency between column j and column i, that
column j of the factor must be computed before computing column i. Let L denote the factor,
A = LLT , then node j is a descendant of node i if and only if lij = 0.
For QR and LU factorizations of a sparse matrix A, the column elimination tree is used.
The column elimination tree is the elimination tree of ATA [15].
The structure of an elimination tree was implicitly used long before its importance was
recognized [36]. Schreiber formally defined the elimination tree structure [53] (the term
”elimination tree” was not used but the definition of the tree structure referred to is the same
as the elimination tree). In [34], Liu used the term ”elimination tree” to refer to the tree
structure defined by Schreiber.
15
The elimination tree can be constructed incrementally, from leaves to root, following these
rules:
• If aij = 0, then lij = 0
• If i > j > k, lik = 0, ljk = 0, then lij = 0
The time complexity of the elimination tree’s construction is nearly O(|A|) [15], where |A|
is the number of nonzero entries in A.
Supernodal matrix factorization algorithms are blocked matrix factorization algorithms
that attempt to factorize multiple columns at a time. The columns are put into groups, named
column groups or column panels. A column group is composed of sequential columns with
identical, or similar nonzero patterns. The nonzero pattern of a column group is the union of
the nonzero patterns of all columns in the column group. The supernodal elimination tree is a
variant of the elimination tree where each node (named ”supernode”) represents not a columm
but a column group. The column groups can be computed by iterating through the columns
of A. If the next column has a nonzero pattern similar to the current column group, then it is
merged into the column group, otherwise a new column group is created. Each column group
corresponds to a supernode of the supernodal elimination tree. Let Lp and Lq be two column
groups, and p > q. Let li ∈ Lp and lj ∈ Lq. Since column groups are composed of sequential
columns, we have i > j. Then Lp is an ancestor of Lq in the supernodal elimination tree if
and only if there exit li and lj such that li ∈ Lp, lj ∈ Lq, and li is an ancestor of lj in the
elimination tree.
In the rest of this paper, we will refer to the supernodal elimination tree as the ”elimina-
tion tree”, disregarding their difference.
16
CHAPTER 3SPARSE CHOLESKY FACTORIZATION
The Cholesky factorization provides solutions to many problems in scientific computing,
such as linear equation systems, matrix inversion, and eigenvalue problems.
In this chapter, we provide optimizing techniques for the sparse Cholesky factoriza-
tion. The base algorithm is CHOLMOD [5] from SuiteSparse [8]. We show that significant
improvement in performance can be achieved by applying our optimizations to CHOLMOD.
Multuthreading is a common practice to increase programs’ concurrency by exploiting the
problem’s internal parallelism. The sparsity of sparse matrices grants the possibility to divide
the sparse matrix factorization problem into multiple sub-problems and have different threads
handle them when there is no data dependency. On a hybrid multicore system, multithreading
can significantly accelerate the sparse matrix factorization algorithm. We implement the
multithreading [54] by uszing ing OpenMP and CUDA’s stream feature.
Pipelining is a technique to implement parallelism by attempting to keep different parts
of the system busy. It is a widely used technique in modern CPUs. In matrix factorization
on hybrid multicore systems, pipelining can parallelize the workload on different components
of the system, including CPUs, GPUs, and the DMA engine. In a GPU-accelerated sparse
matrix factorization algorithm, the pipelining technique can effectively ”hide” the data transfer
(between main memory and GPU memory) overhead behind the on-GPU floating-point
operations, reducing the overall time consumption.
The subtree method [48] is a batching technique where multiple CUDA kernels are
lauched in a batch. It aims to reduce the total kernel launch overhead of numerous small tasks
by significantly reducing the total number of CUDA kernel calls. The original subtree method
only applies to the lower-level supernodes of the elimination tree. We provide a variant of
the subtree method [55], which applies also to high-level supernodes of the elimination tree if
possible.
17
3.1 Introduction
The Cholesky factorization is a decomposition of a symmetric positive definite matrix into
the product of a lower triangular matrix and its transpose. Given a symmetric positive definite
matrix A, the Cholesky factorization is to find a matrix L such that A = LLT , where L is a
lower triangular real matrix with positive diagonal entries, and LT is the transpose of L.
One application of the Cholesky factorization is solving systems of linear equations. Let A
be a symmetric positive definite matrix, the equation Ax = b can be solved with the help of
Cholesky factorization. If A = LLT where L is a lower triangular matrix, then LLTx = b. This
equation can be solved by solving Ly = b and LTx = y.
The Cholesky factorization of a dense matrix and a sparse matrix differ in that by taking
into account the non-zero patterns of the sparse matrix, a large portion of computations
involving zero elements can be avoided, and data dependencies between matrix elements can
be loosened allowing parallel factorization of different parts of the sparse matrix.
Assume the matrix A takes the form
a11 a21 · · · an1
a21 a22 · · · an2...
.... . .
...
an1 an2 · · · ann
and the matrix L takes the form
l11
l21 l22...
.... . .
ln1 ln2 · · · lnn
18
then
a11 a21 · · · an1
a21 a22 · · · an2...
.... . .
...
an1 an2 · · · ann
=
l11
l21 l22...
.... . .
ln1 ln2 · · · lnn
l11 l21 · · · ln1
l22 · · · ln2
. . ....
lnn
First we have the equation a11 = l11l11, therefore l11 =
√a11.
With l11 known, the rest of the matrix L can be solved iteratively. There exist multiple
orders in which the entire L can be iteratively solved, and each order defines an algorithm,
namely the up-looking, the right-looking and the left-looking Cholesky. We will introduce these
algorithms in their respective subsections.
3.2 Up-looking Cholesky
The up-looking Cholesky (also named the row-Cholesky) is an algorithm which iteratively
solves L one row at a time, from top to bottom.
Let
L =
L11
l21 l22
L31 l32 L33
where L11 the top lower triangular sub-matrix already factorized.
A11 aT21 AT
31
a21 a22 aT32
A31 a32 A33
=
L11
l21 l22
L31 l32 L33
L11 lT21 LT
31
l22 lT32
L33
At the beginning of the up-looking Cholesky algorithm, L11 is empty, and the first
iteration is done by computinging√a11.
19
Since l21L11 = a21 and l21lT21 + l222 = a22, we have l21 = a21L
−111 and l22 =
√a22 − l21lT21.
The up-looking Cholesky algorithm was first introduced by Rose, Whitten, Sherman and
Tarjan [49]. Compared with band methods [58] and envelope methods [28], the up-looking
algorithm is able to more effectively exploit the matrix’s sparseness. Rose et al. describes the
up-looking Cholesky algorithm as a ”general sparse method”. Since it only stores and operates
on the actual nonzeros, it can be substantially more efficient than band methods and envelope
methods.
Liu gives an implementation of up-looking Cholesky [37] that exploits all possible zeros,
by employing a generalized form of the envelope method. The envelope method only exploits
zeros outside envelopes, treating zeros inside envelopes as nonzeros logically. Liu’s algorithm
divides the computation into a sequence of full envelope (aka. envelopes with no zeros)
triangular solves, essentially avoiding operations on zeros.
The up-looking Cholesky algorithm does not appear very frequently in the literature, but
it is still widely used. According to Davis’s research [9], the up-looking algorithm may be the
most efficient for very sparse matrices. MATLAB uses an up-looking Cholesky algorithm if the
matrix is very sparse, and uses a left-looking supernodal algorithm otherwise.
3.3 Left-looking Cholesky
The left-looking Cholesky (also named the column-Cholesky) is an algorithm which
iteratively solves L one column at a time, from left to right.
Let
L =
L11
l21 l22
L31 l32 L33
where
L11
l21
L31
represents the columns already computed.
20
The left-looking Cholesky algorithm starts with L11, l21, L31 empty, and in each step,
computes the next column of L.
Since A11 aT21 AT
31
a21 a22 aT32
A31 a32 A33
=
L11
l21 l22
L31 l32 L33
L11 lT21 LT
31
l22 lT32
L33
we have:
l22 =√a22 − l21lT21
l32 = (a32 − L31lT21)/l22
The left-looking Cholesky has been more widely used than the up-looking Cholesky, and
forms the basis of the left-looking supernodal method [5]. The LAPACK function DPOTRF is
an implementation of dense left-looking Cholesky algorithm. Depending on the dimension on
the matrix, DPOTRF may switch to the blocked version.
In the left-looking Cholesky factorization of sparse matrices, multiple columns can be
computed independently, provided that there is no dependency between these columns. This
makes the left-looking Cholesky very useful in parallel Cholesky factorization algorithms.
The parallelism of the left-looking Cholesky is guided by the elimination tree, which is
an output of the symbolic analysis. The nodes of the elimination tree each corresponds to
a column in the matrix. There is data dependency between two columns if an only if one
column’s corresponding node is a descendant of the other, otherwise those columns may
benefit from parallel techniques such as multithreading. The parralelism depicted by the
elimination tree is also referred to as ”tree parallelism”. The tree parallelism can be utilized
when multiple threads or multiple processors are present.
3.4 Right-looking Cholesky
The right-looking Cholesky (also named the submatrix-Cholesky) is also an iterative
method that factorizes the matrix one column at a time.
21
Let
L =
l11l21 L22
Since a11 aT21
a21 A22
=
l11l21 L22
l11 lT21
L22
we have:
l11 =√a11
l21 = a21/l11
A′22 = A22 − l21l
T21
L22 = r cholesky(A22) (the function r cholesky stands for a right-looking Cholesky
factorization of A′22).
The workflow of the right-looking Cholesky is very similar to that of the left-looking
Cholesky, except that the left-looking Cholesky uses a ”lazy” updating scheme. In a left-
looking Cholesky algorithm, the columns are not updated until right before their factorization,
while in a right-looking Cholesky algorithm, after the factorization of a column, all columns to
the right are immediately updated.
Unlike the up-looking Cholesky and left-looking Cholesky, there exist multiple target
columns for the update operations of the right-looking Cholesky [15]. This is not an issue
if the entire factorization is performed in shared main memory, however, it can cause extra
workload if A22 is stored on different devices (for example, if GPUs are used to perform column
updates).
The right-looking Cholesky forms the foundation of the multifrontal method.
3.5 The Supernodal Method
In this section we will first focus on the left-looking supernodal Cholesky.
Instead of factorizing the matrix one column at a time, the supernodal Cholesky method
combines columns into column panels, and runs in a blockwise manner.
22
In practice, matrices to be factorized often have columns with identical or similar nonzero
patterns. The supernodal method may exploit this to perform a blockwise factorization without
suffering a significant increase in fill. The supernodal method saves both space and time. By
combining columns with similar nonzero patterns, the supernodal method stores less duplicate
symbolic information of the matrix. By factorizing multiple columns at a time, the supernodal
method benefits from a better utilization of the memory hierarchy [15].
Let
L =
L11
L21 L22
L31 L32 L33
where
L11
L21
L31
represents the columns already computed.
The left-looking supernodal Cholesky algorithm is similar to the left-looking Cholesky
except that multiple columns are combined processed together. The algorithm starts with
L11, L21, L31 empty, and in each step, factorizes the next column panel.
Since A11 AT
21 AT31
A21 A22 AT32
A31 A32 A33
=
L11
L21 L22
L31 L32 L33
L11 LT
21 LT31
L22 LT32
L33
, we have:
C = −
L21
L31
LT21 (C is named the ”contribution block”),
A′22
A′32
=
A22
A32
+ C (this step is named ”assemble”),
L22 = cholesky(A′22) (the function cholesky stands for a dense Cholesky factorization of
A′22),
and L32 = A′32L
−122 (this step is named ”triangular solve”).
23
The computation of the contribution block and the assemble step combined is also called
”update”, aka., we say that the column panel
A22
A32
is updated by
L21
L31
[9].
We expect supernodes (i.e. column panels) to be composed of columns with identical
nonzero patterns, but columns of similar nonzero patterns can also be merged. In this case, the
supernode is called a ”relaxed” supernode, and the nonzero pattern of the supernode is a union
of its columns’ nonzero pattern.
Supernodes and relaxed supernodes are determined during the symbolic analysis phase,
after the non-supernodal elimination tree has been determined. The determination of relaxed
supernodes does not require nonzero patterns of each column. Since the nonzero pattern of
a parent node is always a superset of a child node, only nonzero counts are needed to check
if two columns have similar nonzero patterns, provided that one column is an ancestor of the
other in the elimination tree (however, information from the non-supernodal elimination tree is
needed to make sure of this).
The determination of supernodes and relaxed suopernodes yields a supernodal elimination
tree, which describes the data dependency between supernodes, and depicts the tree parallelism
of the matrix. The supernodal elimination tree will be used to guide the parallel factorization
of multiple supernodes.
Apart from exploiting tree parallelism, the supernodal method may also increase the
performance by calling functions from efficient linear algebra libraries (BLAS and LAPACK).
Chen et al. developed the CHOLMOD [5] package, which performs Cholesky factorizations
with either the up-looking method (when the matrix is very small or very sparse) or the
left-looking supernodal method (when the matrix is large and not very sparse).
Rennich et al. further enhanced the performance of CHOLMOD by using GPUs in the
factorization [48]. The GPU enhanced version achieves a significant speedup compared to
the CPU-only version. They also introduced the subtree method. The subtree method stores
an entire subtree (instead of a single supernode) of the elimination tree in the GPU memory.
24
This largely eliminates PCIe transmissions during the factorization of the subtrees, and allows
batched CUDA kernel launches, which significantly reduces the overall kernel launch delay.
Supernodal methods may also be applied to the right-looking Cholesky.
Let
L =
L11
L21 L22
Since A11 AT
21
A21 A22
=
L11
L21 L22
L11 LT
21
L22
we have:
L11 = cholesky(A11) (the function cholesky stands for a dense Cholesky factorization of
A11),
L21 = A21L−111 ,
A′22 = A22 − L21L
T21,
L22 = r cholesky(A]22) (the function r cholesky stands for a right-looking supernodal
Cholesky factorization of A′22).
Similar to right-looking non-supernodal Cholesky, in a right-looking supernodal Cholesky
algorithm, there may be multiple target supernodes for an update operation, and can cause
extra data copy and copyback if the updates are not performed on a shared memory device
(e.g. on a GPU). This issue can be partly addressed by storing a subtree of the elimination tree
on the same device [20].
3.6 The Multifrontal Method
The multifrontal Cholesky is similar to right-looking supernodal Cholesky.
The first step of the multifrontal Cholesky is the symbolic analysis, during which the
supernodal elimination tree is constructed. The supernodal elimination tree is also named the
”assembly tree”.
25
The nodes of the assembly tree each corresponds to a supernode, which is also named
a ”frontal matrix” in this case. The assembly tree describes the dependency between frontal
matrices.
The difference between the multifrontal Cholesky and the right-looking supernodal
Cholesky is that with the multifrontal Cholesky method, the contribution block of a child
frontal matrix updates only its parent frontal matrix, and in return, each node of the assembly
tree must be large enough to hold the full contribution block from its children, and pass the
unused entries on to its parent. On the contrary, the supernodes of a supernodal Cholesky
algorithm does not relay the contribution blocks, because contribution blocks will be assembled
direcly into the target ancestor supernode.
The multifrontal Cholesky factorization is composed of multiple partial factorizations of
fontal matrices. In the beginning of the algorithm, only the leaves of the assembly tree can
be factorized. A frontal matrix can not be factorized until all of its descendants have been
factorized and their contribution blocks assembled.
Let Ap be a frontal matrix, and suppose the contribution blocks from the children of Ap
have already been assembled.
Ap =
Ap11 ATp21
Ap21 Cp
The factorization of Ap is as follows:
Lp11 = cholesky(Ap11) (the function cholesky stands for a dense Cholesky factorization
of Ap11),
Lp21 = Ap21L−1p11
,
C ′p = Cp − Lp21L
Tp21
.
Let Aq be the parent frontal matrix of Ap, then the contribution block C ′p needs to be
assembled into Aq.
26
3.7 Multithreading
The CHOLMOD module of SuiteSparse v4.5.3 was a single-thread supernodal sparse
Cholesky algorithm which was only able to utilize one GPU and one CPU. We implemented the
multithreading feature on top of it [54].
Sparse matrices usually contain mutually data-independent supernodes and make available
what is called the ”tree parallelism” because this kind of parallelism is represented by the
structure of the elimination tree.
The multithreading technique allows the algorithm to exploit the tree parallelism and
factorize those supernodes in parallel. On a machine with multiple GPUs, the multithreaded
algorithm will associate each GPU with a thread and let the threads handle the supernodes
simultaneously.
Even if only one GPU is available, the algorithm can try to exploit the tree parallelism
by dividing the GPU memory into multiple regions, letting each region hold a supernode, and
having independent CUDA streams factorize the supernodes in parallel.
In our algorithm, we leverage the tree parallelism with OpenMP. We use a elimination-
tree-based scheduling policy to parallelize the factorizations of supernodes. Upon entering
the numerical factorization phase, a number of threads are created, and each thread will
repeatedly try to fetch the next available supernode for factorization until no more supernodes
are available. A supernode is ”available” for factorization if and only if all other supernodes it
depends on have been factorized and their contribution blocks assembled, i.e. all of its children
in the elimination tree have been factorized, their contribution blocks computed and assembled
into this supernode.
We maintain a queue W , which contains all supernodes available for factorization. Ini-
tially, W should contain all the leaves of the elimination tree. In our algorithm, the elimination
tree is represented by an n-element array P = {p0, · · · , pn−1}, where n is the number of
supernodes and pk is the parent of k. Since W = {0, · · · , n − 1} − P , W can be initialized
27
by iterating through P and taking out elements of P from {0, · · · , n − 1}. In Algorithm 1, we
refer to P as Parent.
We also use another n-element array Nchild which initially contains each supernode’s
number of children. This array can be computed by first initializing all its elements to 0, and
then iterating through P and incrementing Nchild[pk] for each k ∈ {0, · · · , n− 1}.
Each thread will run Algorithm 1 (W , Parent, and Nchild are shared among all threads,
finished, s, and p are local).
Algorithm 1 Parallel Factorization Scheduling
1: finished := FALSE2: while finished = FALSE do3: enter critical section4: if W is empty then5: finished := TRUE6: else7: s := W.pop()8: end if9: exit critical section10: if finished = FALSE then11: factorize Supernode s12: if s has no parent then13: finished := TRUE14: else15: p := Parent[s]16: enter critical section17: Nchild[p] := Nchild[p] - 118: if Nchild[p] = 0 then19: W.push(p)20: end if21: exit critical section22: end if23: end if24: end while
Algorithm 1 ensures that:
• All leaf supernodes are factorized.
• All leaf supernodes are factorized exactly once, because they will not be pushed into W
in Algorithm 1.
28
• If the children of a supernode p are factorized exactly once in Algorithm 1, then p will
be pushed into W when, and only when the last of is children is factorized, and p will
eventually be factorized, exactly once.
Therefore Algorithm 1 factorizes all supernodes in the elimination tree exactly once, and
the dependencies between supernodes are satisfied.
We implement multithreading on the GPU-accelerated CHOLMOD. When multiple GPUs
are available, we associate each thread with a GPU, so that the threads will run independently
without resource conflict.
If only one GPU is available, or the number of GPUs is smaller than our intended number
of threads, we may split the GPU memory into multiple regions, and let each thread use a
specific GPU memory region. In this case, we will make use of CUDA’s streams. A CUDA
stream is a sequence of operations that execute in issue-order on the GPU. CUDA operations
from the same stream run sequentially, while CUDA operations from different streams may
run concurrently and interleaved. We assign a CUDA stream to each thread, and issue CUDA
operations in each thread to their designated CUDA stream.
We observed up to 3.5 times increase in performance in experiments after applying the
multithreading optimization.
3.8 The Subtree Method and the Multilevel Subtree Method
The subtree method was implemented by Rennich et al. [48] in CHOLMOD of SuiteS-
parse v4.6.0 beta.
The size of supernodes of a sparse matrix may greatly affect the efficiency of the sparse
Cholesky factorization. Since each CUDA kernel launch incurs an overhead, if a sparse matrix
contains numerous small supernodes, a traditional supernodal sparse Cholesky algorithm may
require a large number of CUDA kernel launchs with non-negligible launching overhead, that
could result in a significant gross kernel launching overhead.
The subtree method addresses this issue by launching CUDA kernels in batches. With
the subtree method, the algorithm sends entire subtrees instead of individual supernodes to
29
the GPU memory, so that multiple supernodes can exist in the GPU memory at the same
time. The algorithm then collects tasks that are small enough (below a configurable threshold)
from those on-GPU supernodes, and launches their corresponding CUDA kernels in batches.
The batched launch of CUDA kernels significantly reduces overall kernel-launching overhead.
Rennich’s experiments showed that the application of the subtree method nearly eliminates the
CUDA kernel launching overhead [48].
The subtree method also reduces data transimssion between the main memory and the
GPU memory. In the baseline factorization algorithm, a descendant supernode copied to
the GPU memory is overwritten after it updates an ancestor (after its contribution block is
computed and assembled to its current ancestor being factorized), therefore each supernode
must be copied to the GPU memory every time it needs to update an ancestor. But with the
subtree algorithm, since multiple supernodes can co-exist in the GPU memory, a supernode
can update multiple ancestors as long as these ancestors are in the GPU memory with it at
the same time. This reduces the host-to-device data transfer cost, and increases the overall
performance.
There tends to be data dependency between supernodes in the same subtree, therefore
the algorithm can not factorize all the supernodes in the subtree in one single batch. Instead,
the subtree method divides the subtree into multiple levels where each level contains mutually
data-independent supernodes, and factorizes the levels in such an order that no level depends
on those levels behind it. The algorithm picks supernodes from the current level, and launches
the CUDA kernels for their factorization in a batch.
Algorithm 2 describes the implementation of the subtree factorization.
In the case when the entire matrix can not be stored in the GPU memory, there will be
remaining supernodes at higher levels of the elimination tree that are not processed in subtrees.
These supernodes form a tree that we call a ”root tree”. Rennich’s algorithm will fall back to
the baseline supernodal factorization algorithm when processing the root tree.
30
Algorithm 2 Subtree algorithm
1: procedure factorize subtree2: for all supernodes in subtree do3: copy supernode data to GPU memory4: end for5: device synchronization6: for all levels in subtree do7: for all supernodes in level do8: contribution blocks (upper, batched SYRK)9: contribution blocks (lower, batched GEMM)10: end for11: device synchronization12: for all supernodes in level do13: assemble contribution blocks (batched)14: end for15: device synchronization16: for all supernodes in level do17: factorize supernode (batched POTRF)18: end for19: device synchronization20: for all supernodes in level do21: triangular solve supernode (batched TRSM)22: end for23: device synchronization24: end for25: for all supernodes in subtree do26: copy supernode data back to main memory27: end for28: end procedure
Fig. 3-1 depicts the workflow of the subtree algorithm. Note that after factorizing the
subtree, the program may choose between the baseline algorithm and the multilevel subtree
algorithm (which we will describe later).
Rennich’s experiments showed that the subtree method brings a performance increase of
up to 1.9 times [48].
On top of the subtree method we implemented the multilevel subtree method [55] which
applies the subtree method to higher levels of the elimination tree. The multilevel subtree
algorithm tries to create even more subtrees, by applying the same technique as the subtree
31
Figure 3-1. Workflow of single level (a+c) / multilevel (a+b) subtree algorithm
algorithm for the entire tree. This will take the form of a while loop which repeats until either
the matrix is fully factorized or no more subtrees can be constructed. The ”leaf” subtrees
will be processed in the same way as Rennich’s subtree algorithm, while ”non-leaf” subtrees
need to be updated by their descendant subtrees before their factorization, to satisfy the data
dependency.
The supernodes on higher levels of the elimination tree are usually large, therefore the
reduction in CUDA kernel launching cost is not significant, but the algorithm can still benefit
from the reduction in the cost of host-to-device transfers. These supernodes also generally
have large numbers of descendants. Our experiments show that usually after the first loop of
the subtree algorithm, most remaining unfactorized supernodes will have a set of descendants
whose total size exceeds the capacity of our GPU memory. Most of these descendants
are supernodes that are already factorized in previous subtree factorizations, but have not
yet updated all of their respective ancestor supernodes. To address this issue, we exclude
supernodes that are already factorized in previous loops from newly constructed subtrees.
32
Those supernodes will be stored in the main memory, and copied to the GPU memory only
when they are needed in an update operation.
For these inter-subtree updates, we fall back to the original supernodal method but orders
the updates in such a way that each descendant supernode can update multiple ancestors
before it is overwritten by the next descendant, to minimize the host-to-device data transfer
cost. To hide the host-to-device transfers incurred during the inter-subtree updates behind the
on-GPU update operations, two blocks of memory are allocated on the GPU, so that while
the descendant supernode stored in one block of memory is used in the updates, the next
descendant can be copied into the other at the same time.
We maintain a linked list for each supernode s, which contains the supernode’s de-
scendants that have been factorized but are not yet used to update s. At the start of the
factorization of a ”non-leaf” subtree, we iterate through these lists, and issues update tasks
for the supernodes in these lists. For each supernode d in these lists, we find its ancestors in
the current subtree, and issue an update task for each of these ancestors. d is then copied to
the GPU memory, and is used to update the ancestors mentioned above. Then we find d’s
lowest-level ancestor that is also an ancestor of the current subtree’s root (or one of the roots
if the ”subtree” is actually a forest), and (if it exists) put d in its list (if this ancestor needs to
be updated by d).
After the inter-subtree updates, the ”non-leaf” subtree should have become a ”leaf”
subtree, and be ready for processing with the subtree method for the first-level subtrees.
Algorithm 3 describes the implementation of the multilevel subtree factorization algorithm
on higher levels of the elimination tree.
Fig. 3-1 also depicts the workflow of the second level of the multilevel subtree algorithm.
3.9 Pipelining
The pipelining technique improves the efficiency of the sparse matrix factorization by
keeping different components of the machine busy simultaneously. The factorization of a sparse
33
Algorithm 3 Multilevel subtree algorithm
1: procedure factorize subtree2: for all supernodes in subtree do3: copy supernode data to GPU memory4: end for5: device synchronization6: for all supernodes in subtree do7: ancestor := supernode8: while ancestor in subtree do9: ancestor := ancestor.parent10: end while11: copy descendant to GPU memory12: use descendant to update nodes in subtree13: if ancestor != nil then14: put descendant in ancestor.update list15: end if16: end for17: device synchronization18: for all levels in subtree do19: for all supernodes in level do20: contribution blocks (upper, batched SYRK)21: contribution blocks (lower, batched GEMM)22: end for23: device synchronization24: for all supernodes in level do25: assemble contribution blocks (batched)26: end for27: device synchronization28: for all supernodes in level do29: factorize supernode (batched POTRF)30: end for31: device synchronization32: for all supernodes in level do33: triangular solve supernode (batched TRSM)34: end for35: device synchronization36: end for37: for all supernodes in subtree do38: copy supernode data back to main memory39: end for40: end procedure
34
matrix involves the CPUs, the GPUs and DMA engine, and these components can work in
parallel.
There are two layers of possible pipelines. One is a pipeline within the subtrees, which
overlaps the factorization and the copyback of different levels of the subtree. The other is
a pipeline that stacks the factorization of the subtrees and the flushing of the pinned host
memory buffers.
In the subtree algorithm, after a subtree is copied from the main memory to the GPU
memory, the algorithm processes the subtree level by level. Each level must be factorized and
then copied back to the main memory. Since the factorization of a level is performed by the
GPU cores while the copyback is done by the DMA engine, it is possible to use a pipeline. We
implemented the pipeline to let the copyback of a level and the factorization of the next level
run simultaneously. It was done by using different CUDA streams for the factorization and the
copyback operations, and adding synchronization barriers to ensure data consistency.
Fig. 3-2 describes the performance gain by utilizing pipelines. The pipeline effectively have
the CUDA kernels and copyback run in parallel, thus hiding the device-to-host transfers behind
on-GPU operations, reducing total time consumption.
Figure 3-2. Using pipeline in the factorization of a subtree
A subtree copied back is not directly put into their destination memory region. Instead,
a piece of pinned host memory is used as a buffer. The pinned host memory is necessary for
asynchronous PCIe transfer, but the allocation of pinned host memory is very time consuming.
35
The buffer must be flushed before it is overwritten by the next subtree. Since the flushing
of the buffer is done by the CPU while the factorization and the copyback of the subtree are
done by the GPU and the DMA engine, we implemented another pipeline so that the buffer
flushing of a subtree overlaps with the factorization and the copyback (through PCIe) of the
next subtree.
The modified subtree algorithm with pipelining is described in Algorithm 4.
Algorithm 4 Pipelining in subtree algorithm
1: procedure factorize subtree pipelined2: for all supernodes in subtree do3: copy supernode data to GPU memory4: end for5: device synchronization6: for all levels in subtree do7: for all supernodes in level do8: contribution blocks (upper, batched SYRK)9: contribution blocks (lower, batched GEMM)10: end for11: event synchronization12: copy the previous level from buffer to destination13: device synchronization14: for all supernodes in level do15: assemble contribution blocks (batched)16: end for17: device synchronization18: for all supernodes in level do19: factorize supernode (batched POTRF)20: end for21: device synchronization22: for all supernodes in level do23: triangular solve supernode (batched TRSM)24: end for25: device synchronization26: for all supernodes in level do27: copy supernode data back to buffer28: end for29: record event30: end for31: copy the last level from buffer to destination32: end procedure
36
3.10 The Batched Sparse Cholesky Factorization
A batched matrix factorization mechanism may help in improving the average efficiency of
the matrix factorization, by eliminating the need of repeated resource allocations / dealloca-
tions, and increasing the program’s concurrency.
In this section we will introduce two ways to implement the batched Cholesky factoriza-
tion.
3.10.1 The Merge-and-Factorize Approach
A trick can be implemented on top of the non-batched version of the sparse Cholesky
factorization algorithm to allow batched Cholesky factorization of multiple matrices. The
Cholesky factorization is the decomposition of a symmetric positive-definite matrix A in the the
product of a lower triangular matrix L and its transpose (A = LLT ). Given a list of symmetric
positive-definite matrices A1, A2, · · · , Ak, doing their Cholesky factorization is equivalent to
factorizing A =
A1
A2
. . .
Ak
.
A is also a symmetric positive-definite matrix. Let A = LLT (L is lower triangular), it is
easy to see that L =
L1
L2
. . .
Lk
where Ai = LiL
Ti for every i from 1 to k.
This implementation of the batched Cholesky factorization is very straitforward: we
arrange the matrices to be factorized diagonally to form a larger matrix, and factorize the new
matrix instead.
The elimination tree of A will be a forest composed of the elimination trees of A1, A2, · · · , An.
Since A1, A2, · · · , An have no dependency on each other, it is possible to factorize them in
parallel. More precisely, since the supernodes and subtrees from different matrices have no data
37
dependency, if multiple GPUs (or multiple threads) are present, the batched factorization al-
lows greater flexibility in the scheduling of the workflow than serial non-batched factorizations.
The batched sparse Cholesky factorization also works well with the subtree algorithm.
The merging of two sparse matrices can result in subtrees containing supernodes from different
matrices, allowing larger subtrees to be constructed.
Our previous experiments has shown up to 140% improvement in performance with this
merge-and-factorize type of batched factorization scheme.
3.10.2 The Normal Approach
Unfortunately the previous trick works only for Cholesky and not for QR, and its memory
consumption is high. To expand the availability of batching to the sparse QR factorization,
and reduce the memory consumption, we need to reimplement the batched sparse matrix
factorization algorithm. This other implementation of the batched factorization runs the
symbolic analysis for each of the input matrices and construct their respective elimination
trees, instead of merging those matrices.
We implement this generalized batched factorization scheme by exploiting OpenMP. A
number (equals to the maximum number of sparse matrices being factorized) of threads are
created. Each thread will try to fetch the next available sparse matrix in the list, and factorize
it, until all sparse matrices in the list have been factorized.
To maximize the utilization of GPUs while avoiding conflicts, each thread will try to make
use of all available GPUs. Each thread will also have sub-threads to have the factorizations
of its supernodes run in parallel. After selecting an available supernode and before actually
factorizing it, the sub-thread will iterate through the list of GPUs and try to resere a GPU. It
all GPUs are busy, the sub-thread will wait until one is available.
Our experiments show a performance increase of up to 37.53% with the application of this
type of batched factorization.
38
3.11 Experiment Results
The testcase matrices we used are from the SuiteSparse Matrix Collection [13], and are
listed in Table 3-1.
Table 3-1. Test matricesmatrix problem type dimension nonzerosEmilia 923 structural 923,136 40,373,538Fault 639 structural 638,802 27,245,944Flan 1565 structural 1,564,794 114,165,372Geo 1438 structural 1,437,960 60,236,322Hook 1498 structural 1,498,023 59,374,451Serena structural 1,391,349 64,131,971StocF-1465 computational 1,465,137 21,005,389
fluid dynamicsaudikw 1 structural 943,695 77,651,847bone010 model reduction 986,703 47,851,783nd24k 2D/3D problem 72,000 28,715,634
Due to availability of equipment, experiments were carried out on different platforms.
3.11.1 Multuthreading
The experiments for the multithreading optimization was performed on a platform with a
dual socket Intel(R) Xeon(R) CPU E5-2695 v2 at 2.4 GHz, eight NVIDIA Tesla K40m GPUs
each with 2880 CUDA cores, and 12 GB of physical memory.
The experiments were performed with different numbers of threads: 1, 2, 4, 8, 12 threads.
Each thread was assigned a CUDA stream, so the number of CUDA streams was equal to the
number of threads.
Fig. 3-3, 3-4, 3-5 show respectively the time, power, and energy cost with different
configurations of number of threads. Our experiments show an increase in performance with
multithreading until a threshold is reached, and beyond that the performance starts to decline
when more threads are used.
In our experiments, the optimal performance (minimal time or minimal energy) was
reached mostly when 4 or 8 threads were used. The factorization time was reduced by up to
73.06%, and the energy consumption was reduced by up to 73.49%. However, there was no
significant changes in the GPU’s power consumption.
39
0
10
20
30
40
50
60
70
80
90
audikw_1 bone010 boneS10 Emilia_923 Fault_639 Flan_1565 Geo-1438 Hook_1498 StocF-1465
Fact
ori
zati
on
tim
e
Testing Matrices
CHOLMOD CPU
CHOLMOD GPU
GPU multiple streams - 2 streams
GPU multiple streams - 4 streams
GPU multiple streams - 8 streams
GPU multiple streams - 12 streams
Figure 3-3. Factorization time for sparse cholesky with / without multithreading
0
50
100
150
200
250
300
Av
erag
e P
ow
er C
on
sum
pti
on
CHOLMOD CPU
CHOLMOD GPU (1 stream)
GPU multiple streams - 2 streams
GPU multiple streams - 4 streams
GPU multiple streams - 8 streams
GPU multiple streams - 12 streams
Figure 3-4. Average power consumption for sparse cholesky with / without multithreading
40
0
2000
4000
6000
8000
10000
12000
14000
16000
En
ergy
Co
nsu
mp
tion
Testing Matrices
CHOLMOD CPU
CHOLMOD GPU
GPU multiple streams - 2 streams
GPU multiple streams - 4 streams
GPU multiple streams - 8 streams
GPU multiple streams - 12 streams
Figure 3-5. Energy consumption for sparse cholesky with / without multithreading
3.11.2 Pipelining
We applied pipelines to Rennich’s subtree method, before the multilevel subtree was
implemented. Our experiments were performed on a platform with a dual socket Intel(R)
Xeon(R) CPU E5-2695 v2 at 2.4 GHz, four NVIDIA Tesla P100 GPUs each with 3840 CUDA
cores, and 16 GB of physical memory.
We compare the performance of GPU-accelerated supernodal sparse Cholesky algorithm
with different optimizations, including
• The baseline supernodal algorithm (SuiteSparse 4.5.3)
• Supernodal algorithm with multithreading
• The subtree method (SuiteSparse 4.6.0-beta)
• The subtree method with multithreading and pipelines
Fig. 3-6, 3-7, 3-8 show the performance comparison in terms of Gflops, power, and
energy.
We see that multithreading increases the efficiency of the algorithm by up to 2.5 times.
Though there are exceptions, the performance usually increases when more threads are used.
41
Figure 3-6. Cholesky factorization performance for sparse matrices
Figure 3-7. Power Consumption of sparse Cholesky factorization
42
Figure 3-8. Energy Consumption of sparse Cholesky factorization
The subtree method increases the efficiency of the algorithm by up to 3 times. Using pipelines
on top of the subtree method increase the performance by an additional 10% to 25%. We
expected to gain the best performance when all three techniques are combined, but in reality
multithreading is not very effective with the subtree technique present. The number of threads
quickly hits the threashold when the subtree algorithm is used. The experiments show that the
best performance is reached when we use the pipelined subtree algorithm, and the number of
GPU threads is 1 or 2.
The experiments show that the power increases as the performance of the algorithm im-
proves. It is expected, because higher performance usually indicates more intense computations
on GPU. Despite of higher power consumption, the total energy consumed actually decreases
when the performance is better. This is because the decrease in factorization time offsets the
increase in power. It is also observed that pipelining does not have significnat impact on the
total energy consumption.
3.11.3 Multilevel Subtree Method
The multilevel subtree method was implemented on top of the pipelined subtree method.
43
Our experiments for the multilevel subtree method was performed on a platform with a
dual socket AMD Opteron Processor 6168 CPU and two NVIDIA Tesla K20c GPUs.
Fig. 3-9 shows the performance when only one GPU is used. It can be seen that the
multilevel subtree algorithm achieves up to 2.43 times the performance of the baseline single-
level subtree algorithm, and up to 1.42 times the performance of the pipeline enhanced subtree
algorithm. For 9 of the 10 testcase matrices, using the multilevel subtree algorithm along with
pipeline led to enhancements in performance compared to the single-level subtree algorithm.
The average speedup is 1.59 times over the single-level subtree algorithm.
Figure 3-9. Performance comparison between single-level subtree algorithm and multilevelsubtree algorithm on single GPU
The multilevel subtree algorithm does not bring a performance increase for all testcase
matrices. Geo 1438 is an example in which the application of the multilevel subtree method
results in a performance loss. The subtree method is only effective when there are multiple
supernodes in the subtree, so if there is only one supernode in a subtree, there will be no
performance gain, and the extra overhead introduced in the subtree method may slow the
entire algorithm down.
Fig. 3-10 depicts the structure of the matrix Geo 1438, where each triangle stands for
either a subtree or the root tree. The number in each triangle shows the number of supernodes
in the subtree. Since the subtrees in the third and the fourth level each contains only one
supernode, they are not able to provide any performance boost. In this case the only possible
performance gain over the single level subtree algorithm lies in the second level. However the
experiment result shows that it is not enough to offset the loss in the third and the fourth
level.
44
Figure 3-10. Structure of matrix Geo 1438
Fig. 3-11 adds the data of the performance when both GPUs are used. When using two
GPUs the performance of the multilevel subtree algorithm is worse than the single level subtree
algorithm. The cause of this is not yet confirmed.
Figure 3-11. Performance comparison between single-level subtree algorithm and multilevelsubtree algorithm on single GPU and two GPUs
45
3.11.4 Batched Sparse Cholesky Factorization
We implemented batched sparse Cholesky factorization on top of the pipelined multilevel
subtree method.
The experiments were performed on a platform with a dual socket Intel(R) Xeon(R) CPU
E5-2695 v2 at 2.4 GHz and eight NVIDIA Tesla K40m GPUs. We compare the performance of
sequential factorization and the performance of batched factorization when factorizing multiple
matrices.
Fig. 3-12 and Fig. 3-13 show the experiments results for batched sparse Cholesky
factorization (data for batched factorization of 4x matrix Serena are missing). With 2 GPUs
and 2 matrices to factorize, the batched factorization is able to improve the performance by
up to 53.9%. With 4 GPUs and 4 matrices to factorize, the batched factorization is able to
improve the performance by up to 125.0%.
Figure 3-12. Batched Cholesky factorization performance versus sequential matricesfactorization on GPUs (2 GPUs)
3.12 Conclusions
In this chapter, we present several optimizations for CHOLMOD, a supernodal sparse
Cholesky factorization algorithm. Our optimizations include mulththreading, pipelining,
multilevel subtree method, and batched factorization. Each of these optimizations provide
46
Figure 3-13. Batched Cholesky factorization performance versus sequential matricesfactorization on GPUs (4 GPUs)
performance increase, and when used in conjunction with the subtree method [48], they can
offer significant increase in the efficiency of the factorization.
47
CHAPTER 4SPARSE QR FACTORIZATION
The QR factorization can be utilized to solve problems in scientific computing. It is the
method of choice for sparse least squares problems, underdetermined systems, and for solving
sparse linear systems when A is very ill-conditioned [15].
4.1 Introduction
The QR factorization is the decomposition of a matrix A into the product A = QR, where
Q is orthogonal and R is upper triangular.
An orthogonal matrix is a matrix whose rows (or columns) are on an orthonormal basis.
An orthogonal matrix Q has the property QQT = I, therefore Q−1 = QT . This property makes
the QR factorization useful for solving linear equation system Ax = b. If A = QR where Q
is orthogonal and R is upper triangular, then the equation system can be solved by computing
y = QT b and then solving Rx = y.
The QR factorization is often used to solve the linear least squares problem. Let A be a
given m× n matrix, and b a given vector, the linear squares problem is to determine a vector x
such that ∥b− Ax∥ has minimum value [25].
Suppose A is an m× n matrix with rank r, and A has QR factorization A = QR, then we
have rank(R) = r. Let vector c = QT b, then ∥b− Ax∥ = ∥Qc−QRx∥ = ∥c−Rx∥.
When m ≤ n, the minimum value of ∥c − Rx∥ is 0. The linear least squares problem has
unique solution if and only if r = n.
In the subsequent text we will assume that m ≥ n and r = n.
If m ≥ n and r = n, let c =
c1c2
where c1 is the first n entries of c, let R =
R1
O
where R1 is the first n rows of R and the entries of O are all zero. Then ∥c − Rx∥ =
∥
c1 −R1X
c2
∥ , it has minimum value when x = R−11 c1.
When solving the linear least square problem, there is no need to store Q [10]. Since the
orthogonal Q represents matrix row operations, and the calculation of R is basically applying a
48
series of matrix row operations on A, the same row operations can be applied to b during the
QR factorization, and Q can be discarded thereafter.
Angorithms for the QR factorization are typically based on Gram-Schmidt orthogonaliza-
tion [4] [60], Householder reflection [51], or Givens rotation [17] [21].
The Gram-Schmidt orthogonalization is a method for computing QR factorizations that
iteratively eliminates subdiagonal elements of A, one column at a time. While effective for
dense QR factorizations, the Gram-Schmidt orthogonalization method has the disadvantage
of generating numerous fill-in entries, making it undesirable for sparse problems. The Gram-
Schmidt orthogonalization method also requires storing the matrix Q explicitly.
QR factorization algorithm based on the Householder reflection also eliminates the
subdiagonal elements of A one column at a time. But the Householder reflcetion method is
able to represent Q in a much sparser form than the Gram-Schmidt method. And when using
Householder reflection based QR factorization to solve Ax = b, Q can be discarded to save
space by applying the transformations to b as Q is computed.
The Givens rotation approach is believed to outperform both the Gram-Schmidt orthog-
onalization and the Householder reflection when the matrix A is very sparse, due to its ability
to more effectively exploit the matrix’s sparsity [27]. However this advantage disappears when
the matrix is dense enough. For a full matrix, the Givens rotation based QR algorithm requires
50% more floating-point operations than its Householder reflection counterpart [9].
4.2 Gram-Schmidt Orthogonalization
The Gram-Schmidt orthogonalization, also named the Gram-Schmidt process, is a
procedure that takes a set of nonorthogonal linearly independent vectors and constructs an
orthogonal basis. If an orthonormal basis is constructed, then this process is called Gram-
Schmidt orthogonalization.
Let A be an n × n full rank matrix, A =
[a1 a2 · · · an
]where aj are n × 1 column
vectors.
u1 = a1, e1 =u1
∥u1∥
49
u2 = a2 − (a2 · e1)e1, e2 = u2
∥u2∥
· · ·
uj = aj −∑j−1
k=1(aj · ek)ek, ej =uj
∥uj∥
The QR factorization of A can be derived from the Gram-Schmidt orthogonalization:
Q =
[e1 e2 · · · en
]
R =
a1 · e1 a2 · e1 · · · an · e1
a2 · e2 · · · an · e2. . .
...
an · en
Q is an orthonormal basis of A and R is upper triangular.
QR[1 : n][j] =
j∑k=1
(aj · ek)ek = aj
QR = A
Unlike the Givens rotation (Sect. 4.3) based or Householder reflection (Sect. 4.5) based
QR, the Gram-Schmidt orthogonalization based QR does require storing the orthogonal matrix
Q, because Q is explicitly constructed during the factorization.
One disadvantage of the Gram-Schmidt orthogonalization based QR is that it tends
to generate large amounts of fill-in entries, which makes it not very ideal for sparse QR
factorizations.
4.3 Givens Rotation
The Givens rotation is a 2 × 2 orthogonal matrix
c s
−s c
(c2 + s2 = 1) that can be
applied to a 2× n matrix to zero out the selected entry [9].
c s
−s c
0 · · · 0 ai,j · · ·
0 · · · 0 ai′,j · · ·
50
=
0 · · · 0 a′i,j · · ·
0 · · · 0 a′i′,j · · ·
To perform the Givens rotation based QR factorization, we let a′i′,j = 0, then
−sai,j + cai′,j = 0
c =ai,j√
a2i,j + a2i′,j
s =ai′,j√
a2i,j + a2i′,j
The advantage of the Givens rotation based QR factorization (over Householder reflec-
tion) is that it can be implemented without square root operations. However, for a full matrix,
the Givens rotation based QR factorization requires 50% more floating-point operations than
its Householder reflection based counterpart [9].
It is believed that the Givens rotation based QR factorization provides better performance
than the Gram-Schmidt orthogonalization based QR and the Householder reflection based QR
when the matrix is very sparse. The Givens rotation based QR also usually generates fewer
intermediate fill-in entries than the latter two.
George et al. devised the first Givens rotation based QR factorization [15] [21]. Unlike
traditional QR factorizations based on Gram-Schmidt orthogonalization or Householder
reflection, which require access to the entire matrix during the factorization process, their
QR factorization algorithm can process the matrix row by row, which potentially reduces the
memory requirement. However, the method proposed requires more time to solve the linear
least square problem than the method based on Cholesky factorization.
Gentleman et al. proposed to implement the Givens rotation based matrix triangulation
(including orthogonal triangulation, i.e. QR factorzation) in the form of a triangular systolic
array [18]. A systolic array is a homogeneous network of tightly coupled data processing units.
To perform the QR factorization of an m × n matrix (m ≥ n), an n × n upper triangular
systolic array is needed. The triangular systolic array is comprised of boundary (diagonal) cells
51
and internal (non-diagonal) cells. The internal cells will mainly perform multiplies and adds,
whereas boundary cells will mainly perform divisions and reciprocals.
McWhirter introduced an improved version [40] of Kung and Gentleman’s QR factoriza-
tion algorithm that computes the least square residual more simply and directly without having
to solve the corresponding triangular linear system. Their modified version is also more stable
and robust due to this property.
4.4 Blocked Givens Rotation
The blocked Givens rotation is an orthogonal matrix of the form G =
C1 S1
S2 C2
, C1 is
p× p and C2 is q × q.
Halleck proposed another way to represent G:
G =
A ABT
−CB C
A is p× p and C is q × q.
GGT =
A ABT
−CB C
AT −BTCT
BAT CT
=
AAT + ABTBAT 0
0 CCT + CBBTCT
G is orthogonal if and only if
AAT + ABTBAT = Ip
and
CCT + CBBTCT = Iq
.
Assume A and C have full rank, then
I +BTB = A−1(AT )−1 = A−1(A−1)T
52
A−1 can be computed by solving the Cholesky factorization of I + BBT . Since A−1 is lower
triangular, A can be easily obtained by solving A−1A = Ip.
Similarly, C can be obtained by solving the Cholesky factorization of I + BTB (C−1
obtained) and then solving C−1C = Iq.
Let V be a block vector and V =
XY
where X has p rows and Y has q rows.
The blocked Givens rotation zeros out Y whenX ′
O
=
GV =
A ABT
−CB C
XY
=
AX + ABTY
−CBX + CY
C(Y −BX) = O
A solution to this equation may be obtained if we let BX = Y .
4.5 Householder Reflection
A Householder reflection is an orthogonal matrix of the form H = I − βvvT where β is a
scalar and v is a column vector [9].
We assume that H is m×m and v is m× 1.
HHT = I − 2βvvT + β2∥v∥2vvT = I + β(β∥v∥2 − 2)vvT
H is orthogonal if and only if HHT = I, i.e. v = 0 or β = 0 or β∥v∥2 = 2.
For an m× 1 column vector x, it is possible to find an m×m Householder reflection such
that Hx =
∥x∥O
.
Proof:
53
When ∥x∥ = 0, the solution is trivil.
When ∥x∥ > 0,
Hx = x− β(m∑i=1
vixi)v
β(m∑i=1
vixi)v = x−
∥x∥O
Let β∥v∥2 = 2, then
β|(m∑i=1
vixi)|√
2/β = ∥x−
∥x∥O
∥
β can be solved from this equation, and v can be consequently solved. H is determined from β
and v.
The QR factorization may be accomplished by applying a series of Householder transfor-
mations.
Let A1 = A =
[a1 · · ·
]where a1 is the first column of A1. Find a Householder reflection
H1 such that H1a1 =
∥a1∥O
. Then H1A1 =
r11 R12
O A2
where r11 is a scalar, R12 is
1× (n− 1), A2 is (m− 1)× (n− 1).
The similar transformation can be applied to A2, and iteratively to all Aj where j < n.
The Householder reflections Hj will take the form
Hj = Im+1−j − βjvjvTj
where Im+1−j is the (m + 1 − j) × (m + 1 − j) identity matrix, βj is a scalar, and vj is a
(m+ 1− j)× 1 column vector.
It is easy to see that the resulting matrix after the (n − 1)th iteration is an upper
triangular matrix. Denote it by R.
Let Uj =
Ij−1 O
O Hj
where Ij is the j × j identity matrix. Let U =∏m
j=m+1−n Um+1−j,
then R = UA.
54
Let Q = UT =∏n
j=1 Uj , then Q is an orthogonal matrix and A = QR.
One advantage of the Householder reflection (over the Givens rotation) is that it is easier
to store the Householder reflection matrix [15], because the Householder reflection is generated
from a scalar β and a column vector v, so it is not necessary to store the entire Householder
reflection matrix.
Based on the Householder reflection are the left-looking QR and the right-looking QR.
The left-looking QR and the right-looking QR differ in the ordering of their workflow.
The right-looking QR applies each Householder reflection to the entire matrix. In a
right-looking QR algorithm, the matrix A will go through the following transformations:
A → U1A → U2U1A → · · · →m∏
j=m+1−n
Um+1−jA = R
The right-looking QR is the foundation of the multifrontal method for QR factorization
[9].
The left-looking QR however only updates one column of the matrix during each iteration.
It is clear that R[1 : m][j] = UA[1 : m][j].
R[1 : m][j] =
R[1 : j][j]
O
=
U [1 : j][1 : m]A[1 : m][j]
O
U [1 : j][1 : m]
= ((
m−j∏k=m+1−n
Um+1−k)(m∏
k=m+1−j
Um+1−k))[1 : j][1 : m]
= ((
m−j∏k=m+1−n
Im−k O
O Hm+1−k
)( m∏k=m+1−j
Um+1−k))
[1 : j][1 : m]
55
= (m∏
k=m+1−j
Um+1−k)[1 : j][1 : m]
The left-looking QR algorithm uses a lazy updating scheme and delays the application of
the 1 to (j − 1)th Householder transformations on the jth column until the jth iteration, just
like the left-looking Cholesky.
Blocked forms of the Householder reflection based QR factorization also exist. Like the
blocked Cholesky, the blocked QR decomposition factorizes the matrix in terms of column
panels. The Householder reflections for blocked QR factorization takes the form
H = I − V BV T
where I is the n× n identity matrix, V is an n× k matrix (k is the width of the column panel),
and B is a k × k square upper triangular matrix.
Though it is generally believed that the Givens rotation method outperforms the House-
holder reflection method for QR factorizations of very sparse matrices, George et al. developed
a way of applying Householder reflections [22] which is competitive or superior to Givens rota-
tions even for sparse matrices. This algorithm is an extension of Liu’s row merging scheme for
sparse Givens transformations [35].
Davis developed a multithreaded sparse multifrontal QR factorization algorithm [10] that
is based on the Householder reflection.
Yeralan et al. extended the work of Davis and presented an enhanced version [61] of the
above algorithm that achieves higher performance with GPUs enabled. They also introduced
the bucket scheduler algorithm, which exploits parallelism between different rows during dense
QR factorizations.
4.6 The Multifrontal Sparse QR Factorization
This sub-chapter introduces SuiteSparse’s SPQR module [10] [61] from SuiteSparse [8],
which our work in this chapter is based on.
56
Sparse matrix factorizations differ from their dense counterparts in that the sparse
factorization algorithm may exploit the zeros in the matrix to reduce the total number of
floating-point operations. The factorization of a sparse matrix consists of the (symbolic)
analysis phase and the (numerical) factorization phase. During the analysis phase, the nonzero
pattern of the matrix is explored, and the symbolic structure of the matrix is represented by
the elimination tree [36]. The elimination tree is a key data structure of the sparse matrix
factorization. It not only provides the structural information of the matrix but also directs the
workflow of the factorization.
Figure 4-1. The elimination tree of a sparse matrix
Figure 4-2. A possible scheduling of fronts
57
Figure 4-3. Stages in the workflow
Figure 4-4. Stages in the elimination tree
Fig. 4-1 depicts a matrix’s symbolic pattern (x stands for nonzeros and fill-in) and its
elimination tree. In the multifrontal sparse QR factorization, the sparse matrix is divided into
multiple fronts. Each front is a dense matrix that will be factorized by dense QR factorization
algorithm. The nodes of the elimination tree each represent a front, and the data dependency
between the fronts is represented by the edges.
SPQR schedules the factorizations of the fronts according to the elimination tree so that
a front is not factorized until all of its children have been factorized. The factorization of each
front generates a contribution block, and prior to a front’s factorization, all contribution blocks
from its children must be assembled into the front.
Figure 4-5. Reducing PCIe communications with stages
58
Fig. 4-2 shows a possible scheduling of the fronts for the matrix in Fig. 4-1. Each front
must go through the assembly and the factorization. Fig. 4-6 shows the child fronts after their
factorizations and the assembly of the children’s contribution block into the parent.
Figure 4-6. The factorization and the assembly operations
In order to reduce the cost of PCIe communications, SPQR groups the fronts into stages.
A stage is a set of adjacent fronts in the scheduled workflow, as shown in Fig. 4-3. PCIe
transfer cost can be reduced when a stage contains fronts with data dependency (Fig. 4-4)
because, in this case, there will be no need to use the main memory as a temporary buffer to
hold the contribution blocks (Fig. 4-5).
4.7 The Arithmetic CUDA Kernels
SPQR essentially divides the sparse QR factorization into a number of assembly oper-
ations and dense QR factorizations. The dense QR factorizations are further reduced into
numerous even smaller dense QR factorizations and apply operations and are performed by
custom CUDA kernels.
59
Figure 4-7. Factorization of a front
Fig. 4-7 shows the QR factorization of a front, step by step. The tiles are of size 32× 32,
and due to the limit of the GPU’s shared memory, up to 96 rows are processed in each
factorization or application.
Consider a factorization and an application task that process a m × n submatrix of the
front. For simplicity, we assume m to be 96, 64, or 32 (edge cases are handled inside the
CUDA kernels), and n > 32 (otherwise, there is no application).
Figure 4-8. Factorization of a front
Let A be the leftmost 96 × 32 submatrix and B be the rest. The factorization and the
application finds the orthogonal matrix Q, the upper triangular matrix R, and S such that[A B
]= Q
[R S
](Fig. 4-8) .
Using blocked Householder reflection, the factorize CUDA kernel finds the m × 32 lower
trapezoidal matrix V containing the Householder vectors and the 32 × 32 lower triangular
matrix T such that Q = I − V TV T . Then, R = QTA = A− V T TV TA.
The apply operation is where the most floating-point operations happen, and it can
take around 80% of the total running time. The apply operation computes S = QTB =
B − V T TV TB.
60
The apply CUDA kernel utilizes the shared memory of the GPU to speed up the com-
putation. The shared memory is a region of on-chip memory that is shared among all CUDA
threads of the same block. Access to the shared memory is much faster than global memory
access (shared memory latency is roughly 100 times lower than uncached global memory);
therefore, the shared memory is suitable for data that will be frequently reused.
On NVIDIA K40m GPUs, each block has 48KB shared memory. We consider a 32×32 tile
of matrix entries, with each entry stored in type double (8 bytes). Because (48 × 1024)/(8 ×
(32 × 32)) = 6 and considering that we need some extra space for padding (to prevent shared
memory bank conflicts), we can safely assume that only 5 tiles of data can be stored in the
shared memory at any time. SPQR allocates space in the shared memory for two matrices of
type double: the 97× 32 ”VT tile” and the 32× 64 ”C”.
Figure 4-9. The VT tile
Prior to the actual matrix multiplication, the matrices V and T are copied from the
GPU’s global memory to the VT tile. Because V is lower trapezoidal and T is lower triangular,
they can be stored in the form of V and T T as shown in Fig. 4-9.
The apply kernel divides the matrix B into multiple submatrices with at most 64 columns.
We denote them by B1, B2, · · · , Bk. Then, for each i from 1 to k, the apply kernel does the
following:
• Compute Ci = V TBi, and write Ci to C
• Compute Zi = T TCi, and write Zi to C
• Compute Si = Bi − V Zi, and write Si to the global memory
It is easy to see that the above is not the optimal method for multiplying the matrices.
We assume that a total of time t is required to compute the product of two 32 × 32 matrices.
61
When V is (p × 32)-by-32 abd B is (p × 32)-by-(q × 32), the total running time of the above
method is (pq + q + pq)t = (2pq + q)t.
When p ≤ 2, we optimize the multiplications so that they are performed in the following
way:
• Compute U = V T , and write U to the VT tile
• for each i from 1 to k:
– Compute Zi = UTBi, and write Zi to C
– Compute Si = Bi − V Zi, and write Si to the global memory
Figure 4-10. U and V in the VT tile
Note that U is also (p × 32)-by-32 lower trapezoidal; therefore, it can coexist with V in
the VT tile only if p ≤ 2. Fig. 4-10 shows the layout of U and V in the VT tile.
This method runs in pt + (pq + pq)t = (2pq + p)t time. When q is large enough, it
is faster than the original implementation. When p = 2 and q is very large, this method is
approximately 25% faster than the original implementation.
When p = 1, the multiplications can be further simplified into the following:
• Compute U = V T , and write U to the VT tile
• Compute Q = I − UV T , and write Q to the VT tile
• for each i from 1 to k:
– Compute Si = QTBi, and write Si to the global memory
Figure 4-11. Q, U and V in the VT tile
When p = 1, the VT tile is spacious enough to hold Q, U , and V at the same time (Fig.
4-11).
62
This method runs in (q + 2)t time. When q is very large, it is approximately 200% faster
than the original implementation.
Despite the theoretical improvements mentioned above, this optimization only works
when the apply task has no more than 64 rows. Because obviously the algorithm should try
to process as many rows as possible in each task to maximize concurrency, the improvement
is only possible in edge cases while the majority of apply tasks does not benefit from this
optimization.
4.8 Pipelining CUDA Kernels and Device-to-Host Transfers
Figure 4-12. Pipelining the CUDA kernel runs and device-to-host transfers
Figure 4-13. Pinned host memory used as a buffer
Figure 4-14. Pipelining the factorization of stages and the buffer flushing
SPQR groups fronts in stages in order to reduce PCIe communications. Each stage
consists of multiple fronts that are present in the GPU memory at the same time. Because we
expect data dependency between fronts in the same stage, the factorizations of the fronts in a
stage are usually not entirely parallel.
63
The algorithm runs a scheduler that loops until the entire stage has been factorized. In
each loop, the scheduler picks fronts that are ready for factorization, runs the appropriate
CUDA kernels (in batch), and then updates the fronts’ statuses.
In the original implementation of SPQR, the factorization of a stage can be summarized
as three phases:
• Copy the fronts’ data from the main memory to the GPU memory
• Run the CUDA kernels
• Copy the factorized fronts’ data and the contribution blocks from the GPU memory to
the main memory
The CUDA kernel execution is the most time-consuming of all three. Because the
contribution blocks tend to be much larger than the fronts themselves, the copyback also
takes a considerable amount of time. The first phase, host-to-device copy, is the least time-
consuming of all three.
We reduce the total factorization time of the stage by overlapping the CUDA kernel
executions and the copyback of data. The original implementation of SPQR defers the
copyback until the end of the last CUDA kernel. However, because the GPU and the PCIe bus
can work at the same time, the CUDA kernel executions and the copyback can run in parallel.
The pipeline was implemented by using an extra CUDA stream to allow asynchronous
execution of the copyback. At the end of the scheduler loop, the status of each active (being
factorized) front is queried. If the front’s factorization is complete, an asynchronous device-to-
host transfer is initiated on the designated CUDA stream. The transfer will run independently
of future scheduling and CUDA kernel calls. After all fronts in the stage have been processed,
the function cudaStreamSynchronize is called and blocks until all copybacks are finished.
Fig. 4-12 compares the processing of a stage with and without the pipeline. The pipeline
partially hides the PCIe transfers behind the on-GPU computations and saves total factoriza-
tion time.
64
4.9 Pipelining GPU Workload and CPU Workload
Another layer pipeline is possible, not within stages, but on the same ”level” with them.
The copyback of factorized front data does not put the data in their final destination. For
the efficiency of the transfer, the data are first put in a pinned host memory buffer and must
be copied to their designated main memory region (pageable memory) later (Fig. 4-13).
Since the processing of the stages is mostly done by the GPU (CUDA kernel executions)
and the DMA engine (device-to-host transfers), it can be run in parallel with the flushing of
the pinned host memory buffer.
The pipeline was implemented using OpenMP. For a sparse QR factorization with k
stages, the algorithm creates a for loop ranging from 0 to k. Within each loop iteration, two
threads work in parallel. One thread factorizes a stage and copies the results back to the
pinned host memory buffer. And at the same time, the other thread flushes the buffer that
contains results from the previous factorized stage.
Figure 4-15. A secondary pinned host memory buffer to avoid data conflict
Fig. 4-14 and Fig. 4-15 show the mechanism of the pipeline. To avoid conflicts, two
pinned host memory buffers are allocated, and the DMA engine and the CPU will switch
between them.
Fig. 4-16 shows the comparison between algorithms with and without the stage-level
pipeline. The sparse matrix used in this case is Flan 1565 from the SuiteSparse Matrix
Collection [13]. The pipeline has effectively hidden the buffer flushing behind the processing of
stages, eliminating most of the GPU’s idle time between stages.
65
Figure 4-16. Comparison between sparse QR algorithm with / without stage-level pipeline
4.10 Experiment Results
The experiments were carried out on a platform with a Intel Xeon E5-2667v4 CPU and an
NVIDIA Titan V GPU. SPQR has no support for multiple GPUs, therefore only one GPU was
used. The test-case matrices (Table 4-1) were from the SuiteSparse Matrix Collection [13].
Table 4-1. Test matrices used for QRmatrix problem type dimension nonzerosFlan 1565 structural 1,564,794 114,165,372Freescale1 Circuit Simulation 3,428,755 17,052,626H2O Theoretical/Quantum 67,024 2,216,736
Chemistry Problembundle adj Computer Vision 513,351 20,207,907circuit5M dc Circuit Simulation 3,523,317 14,865,409hood structural 220,542 9,895,422nd24k 2D/3D problem 72,000 28,715,634
Table 4-2 lists the performance of SPQR on each of the matrices, before and after the
optimization. The table also lists the total number of floating point operations involved in the
factorization.
It can be seen from Fig. 4-17 that the combination of two layers of pipelines and the
reimplemented apply kernels achieves up to 43.92% improvement in performance vs. the
original SPQR.
We also conducted experiments to measure the energy and power cost. From Fig. 4-19
and Fig. 4-20 we see that after our optimizations, the total energy cost of the QR factorization
is reduced by up to 35.95%.
66
Table 4-2. QR experiment resultsmatrix flop count Gflops before Gflops after
optimization optimizationFlan 1565 3.28E+14 920.94 1283.26Freescale1 2.11E+11 26.23 36.90H2O 4.22E+13 982.50 1345.55bundle adj 3.12E+13 524.96 622.73circuit5M dc 2.21E+11 27.90 40.15hood 6.43E+11 133.43 114.75nd24k 3.30E+13 1014.49 1252.21
Fig. 4-21 and Fig. 4-22 show a comparison in average power before and after the
optimization. The power is reduced by up to 12.45%, but there is not always a notable
reduction in average power. In most of our experiments, the change in average power was less
than 4%. The total energy cost was reduced mainly due to the reduced factorization time.
Figure 4-17. Performance comparison between algorithm before and after optimizations
4.11 Conclusions
In this chapter, we present several optimizations for SPQR, a multifrontal sparse QR fac-
torization algorithm. We first optimize the edge cases of the apply CUDA kernels, reducing the
67
Figure 4-18. Relationship between flop count and improvement in performance
Figure 4-19. Energy consumed by the GPU in factorization (large matrices)
68
Figure 4-20. Reduction in energy consumption after the optimization
Figure 4-21. Average power of the GPU in factorization
69
Figure 4-22. Reduction in average power after the optimization
total amount of floating-point operations when the apply task is small. Then, we implement
a pipeline that overlaps the CUDA kernel executions and device-to-host transfers, hiding PCIe
transfers behind on-GPU computations and reducing the time for factorizing a stage. We also
implement another pipeline that parallelizes the factorization of stages and the flushing of the
pinned host memory buffers.
The experimental results showed good improvement in performance when the sparse
matrix is large and the total flop count is large enough, achieving up to a 23% increase in
performance.
Energy cost is also reduced for large matrices, up to 14.58%.
However, due to extra cost introduced, such as the allocation of additional pinned host
memory, our optimizations do not work well if the matrix is too small. In our experiments,
the performance after optimization may decrease by up to 7.26% for small matrices, and the
energy consumption may increase by 16.82% for small matrices.
70
We may combine the advantages of our algorithm and the original SPQR by adding a
switch in our code and selecting whether to apply our changes, according to the size of the
matrix.
71
CHAPTER 5SPARSE LU FACTORIZATION
The LU factorization is a decomposition of a square matrix into the product of a lower tri-
angular matrix and an upper triangular matrix. Given a square matrix A, the LU factorization
is to find a lower triangular matrix L and an upper triangular matrix U such that A = LU .
The LU factorization is usually used for square unsymmetric matrices.
Like the Cholesky factorization, the LU factorization may also be applied in solving
systems of linear equations. Let A be a square matrix, the equation Ax = b can be solved
with the help of the LU factorization. If A = LU where L is a lower triangular matrix and U
is an upper triangular matrix, then we have LUx = b. This equation can be solved by solving
Ly = b and Ux = y.
For the LU factorization, there exist the left-looking LU and the right-looking LU [9]. The
supernodal method and the multifrontal method are also available to the LU factorization.
5.1 Left-looking LU
The left-looking LU is an iterative method which solves L and U one column at a time.
Let
L =
L11
l21 l22
L31 l32 L33
and
U =
U11 u12 U13
u22 u23
U33
where
L11
l21
L31
and
[U11
]represent the columns already computed.
72
Since A11 a12 A13
a21 a22 a23
A31 a32 A33
=
L11
l21 l22
L31 l32 L33
U11 u12 U13
u22 u23
U33
we have:
u12 = L−111 a12
u22 = (a22 − l21u12)/l22
l32 = (a32 − L31u12)/u22
If we let l22 = 1 (assme that L has unit diagonal), then:
u12 = L−111 a12
u22 = a22 − l21u12
l32 = (a32 − L31u12)/u22
5.2 Right-looking LU
The right-looking LU, like the right-looking Cholesky, is very similar to the left-looking LU,
except the ordering of their workflow.
Let
L =
l11l21 L22
and
U =
u11 u12
U22
Since a11 a12
a21 A22
=
l11l21 L22
u11 u12
U22
we have:
l11u11 = a11
73
l21u11 = a21
l11u12 = a12
l21u12 + L22U22 = A22
Note that a11, l11, u11 are scalars, therefore
L22U22 = A22 − l21u12
= A22 −a21a12a11
L22 and U22 can be solved recursively with the (right-looking) LU factorization of a21a12a11
.
If we let l11 = 1 (assume that L has unit diagonal), then
u11 = a11
l21 =a21u11
u12 = a12
The right-looking LU factorization is essentially the Gaussian Elimination of the matrix A.
The transformation from A to U consists of a series of row transformations. Denote those row
transformations by M1, M2, · · · , Mn
Mj =
Ij−1 O
O Nj
where Ij−1 is the (j− 1)× (j− 1) identity matrix, and Nj is an (n+1− j)× (n+1− j) lower
triangular matrix.
For every j, Mj is an n× n lower triangualr matrix. Let
M =n∏
j=1
Mn+1−j
then M is an n× n lower triangualr matrix, and MA = U .
Let L = M−1, then L is lower triangular and A = LU .
74
The right-looking LU factorization also shares some similarities with the Householder
reflection based QR. The difference is that the matrices corresponding to the row transforma-
tions in the LU factorization are lower triangular matrices, while in the QR factorization they
are orthogonal matrices.
The right-looking LU forms the basis of the multifrontal LU algorithms.
5.3 The Supernodal Method
Demmel et al. implemented SuperLU [16], a supernodal LU facctorization library.
SuperLU is based on sparse Gaussian elimination, and consists of three modules: SuperLU (the
sequential supernodal LU library, left-looking), SuperLU MT (multithreaded supernodal LU on
shared memory parallel machines, left-looking), and SuperLU DIST (distributed supernodal LU,
right-looking). SuperLU MT is very similar to SuperLU from the users’ and the algorithm’s
point of view [32].
Unlike the Cholesky factorization, the LU factorization does not guarantee any relation,
whether numerical or topological (in terms of nonzero entries), between L and U . Therefore
SuperLU utilizes the idea of unsymmetric supernodes [15]. The columns in L can be grouped,
like in the supernodal Cholesky factorization, according to L’s columns’ nonzero patterns, but
SuperLU only stores L in the supernodal format, while U is stored in the compressed column
form.
Schenk et al. implemented a supernodal LU algorithm that combines the workflow
ordering of both left-looking LU and right-looking LU (PARDISO[52]). PARDISO can be
viewed as numerically left-looking but symbolically right-looking: descendants of a supernode
are not assembled until right before the factorization of the supernode (numerically left-
looking), but as soon as a supernode is factorized, all of its ancestors are notified of its
completion (symbolically right-looking).
5.4 The Multifrontal Method
The multifrontal LU is based on the right-looking LU, and works well on distributed
memory platforms.
75
The workflow of the multifrontal LU depends on the assembly tree. The assembly tree
is essentially an elimination tree for multifrontal methods, and is the output of the symbolic
analysis phase.
Each node in the assembly tree corresponds to a stage and a frontal matrix, and a frontal
matrix can not be factorized until all its descendants are factorized and their contribution
blocks assembled.
In this section we will first look at the first stage of the multifrontal LU:
Let
L =
L11
L21 L22
and
U =
U11 U12
U22
The dimensions of L11, L21, U11, U12 are determined through amalgamation [11].
Since A11 A12
A21 A22
=
L11
L21 L22
U11 U12
U22
we have:
L11U11 = A11
L21U11 = A21
L11U12 = A12
L21U12 + L22U22 = A22
First we solve L11 and U11 with a dense LU factorization, then L21 and U12 can be
obtained by solving L21 = A21U−111 and U12 = L−1
11 A12
C = −L21U12
76
C is the contribution block generated by the stage. If the factorization is done on a distributed
memory platform, It is very likely that A22 is stored separately. C needs to be transferred to
ancestor nodes for the assembly operations in other stages.
The LU factorization may use pivoting [47]. Pivoting is not mandatory, but it can increase
the numerical stability of the factorization [57]. Numerical instablity can be caused by division
by zero or small matrix entries. Reid gave examples showing how pivoting can solve these
problems.
For zero pivots:
Let A =
0 1
1 1
=
1
l21 1
u11 u12
u22
, then we have u11 = 0 and l21u11 = 1. This
results in a divide-by-zero error.
If we let P =
0 1
1 0
and instead solve PA = LU , then we have L =
1 0
0 1
and
U =
1 1
0 1
.For small pivots:
Let A =
10−20 1
1 1
, then L =
1 0
1020 1
and U =
10−20 1
0 1− 1020
. This time we
don’t have a divide-by-zero error, but the floating-point number (1− 1020) may be represented
inaccurately and rounded to 1020, therefore what we actually get is U =
10−20 1
0 −1020
, andLU =
10−20 1
1 0
= A.
If we let P =
0 1
1 0
and solve PA = LU , then L =
1 0
10−20 1
and U =
1 1
0 1− 10−20
. The equation is satisfied even if the numbers are rounded.
There exist partial pivoting, complete pivoting, and rook pivoting.
77
A partial pivoting is performed by doing a row exchange in each elimination so that the
first entry in the current column is exchanged with the the largest entry in the column. Each
exchange corresponds to a permutation matrix Pi, and the combined permutation matrix
is P = Pn−1Pn−2 . . . P2P1. In an LU factorization with partial pivoting, instead of solving
A = LU , we solve PA = LU .
The complete pivoting is performed by doing both a row exchange and a column exchange
in each elimination so that the leading entry of the remaining matrix is exchanged with
the largest entry in the remaining matrix. The complete pivoting can be represented by
PAQ = LU , where P = Pn−1Pn−2 . . . P2P1 and Q = Q1Q2 . . . Qn−1Qn−1.
The rook pivoting is similar to the complete pivoting except that instead of choosing the
largest entry, it chooses an entry that is the largest in its own row and column.
Whether or not to use pivoting will affect the behavior of the LU algorithm, both in the
symbolic analysis and in the numerical factorization.
When partial pivoting is used, the LU algorithm allows arbitrary row changes during the
factorization. In this case, the LU factorization is more closely related to the QR factorization
[15] as both of them eliminates entries below the diagonal. If the LU algorithm allows arbitrary
row and column changes during the factorization, then a prior symbolic analysis is not possible
[15].
In cases where pivoting is not required, the nonzero pattern of L and U can be computed
as follows:
• If the matrix A has a symmetric nonzero pattern, then the nonzero pattern of L is the
symbolic Cholesky factorization of A, and the nonzero pattern of U is identical to LT .
• When the nonzero pattern of A is unsymmetric, we may let L have nonzero pattern
same as the symbolic Cholesky factorization of A + AT , and the nonzero pattern U is
identical LT . In this case some zero entries in A are considered logically nonzero.
78
Amestoy et al. implemented the parallel multifrontal solver MUMPS [3], which was part
of the project PARASOL [1]. MUMPS is a fully asynchronous algorithm with dynamic data
structures and distributed dynamic scheduling of tasks.
Davis et al. implemented the multifrontal sparse LU factorizer UMFPACK [12]. with
the goal of achieving high performance by using the level 3 BLAS. Instead of an assembly
tree, UMFPACK guides the workflow with an assembly DAG [26]. UMFPACK features a
dynamic analyze-factorize phase. Since the structure of the assembly DAG is not knows prior
to factorization, the assembly DAG will be constructed during the analyze-factorize phase
dynamically. The assembly DAG is then used in a factorize-only phase.
5.5 Implementation of a Supernodal Sparse LU Algorithm
We implemented a highly efficient supernodal sparse LU algorithm that has the following
features:
1. The algorithm uses blocking to compute multiple columns in each single iteration. Thisapproach is more amenable to GPUs as compared non-blocked factorization algorithms.When compared to multifrontal algorithms, a supernodal algorithm uses less memory, dueto smaller sizes of contribution blocks. Our implementation uses highly tuneed CUDAlibraries such as cuBLAS and cuSOLVER as building blocks.
2. The algorithm can use multiple GPUs and multiple CPU cores when available. If only oneGPU is available, the algorithm tries to divide the GPU memory into multiple regions,and utilize CUDA streams to function as if there were multiple GPUs, so that parallelismis still exploited.
3. The algorithm uses pipelining to overlap the PCIe communication and the on-GPUcomputation, reducing overall factorization time.
4. The algorithm supports batched factorization, where multiple sparse matrices can befactorized simultaneously, and the factorizations are interleaved.
We refer to this algorithm as ”NLU” where ”N” stands for ”node”, indicating supernodes.
NLU outperforms GLU [45], UMFPACK, KLU [14] [41], and SuperLU by several orders of
magnitude when the sparse matrix is large enough, and can handle matrices whose size is
beyond the capability of the above sparse LU solvers. The current implementation of NLU does
not support pivoting, but will be updated to include it in future versions.
79
5.5.1 Data Representation
The algorithm requires the storage of key information, including platform-specific informa-
tion, GPU-specific information, and matrix-specific information.
The platform-specific information is stored globally in the data structure of type struct
common info struct. It includes the number of GPUs, the size of available GPU memory, the
size of required pinned host memory (which is proportional to the size of the GPU memory),
the number of threads for batched factorization, etc. This data structure is initialized at the
beginning of the program, together with the allocation of the GPU memory and the pinned
host memory, and persists through all the factorizations, until it is freed at the end of the
program. The platform-specific information does not change after initialization.
struct common info struct{
int numGPU; // number of GPUssize t minDevMemSize; // GPU memory sizesize t minHostMemSize; // pinned host memory sizeint matrixThreadNum; // factorization batch sizeint numSparseMatrix; // number of sparse matrices...
};
5.5.1.1 GPU information
The information of each GPU, mainly the CUDA streams, the cuBLAS handles, and the
pointers to the allocated GPU memory and the host memory, are stored in a data structure of
type struct gpu info struct. The data structure contains a pointer to the allocated GPU
memory, whose size is slighly smaller than the maximum available memory size of the GPU,
and a pointer to a piece of pinned host memory ( it is part of the main memory ). The pinned
host memory is mandatory for asynchronous data transfer between the GPU memory and the
main memory, because the GPU is not able to access regular ( aka. pageable ) main memory.
The allocation of pinned host memory is very time-consuming, therefore we set its size to
roughly the same as the allocated GPU memory, and use it as a buffer.
80
struct gpu info struct also contains an OpenMP lock for multithread support. A
thread must reserve a GPU through the OpenMP lock before it actually starts working on a
supernode.
struct gpu info struct{
omp lock t gpuLock;void *devMem; // GPU memory ( device memory )void *hostMem; // pinned host memory buffer...
};
In a system with K GPUs, an array of K struct gpu info struct objects is used
to store the information of all the GPUs. The array is initialized at the beginning of the
algorithm, and is freed after all factorizations are completed.
5.5.1.2 Matrix information
The data of the sparse matrices, including the entry values, the symbolic structure, the
factorization result, and the runtime status, are stored in data structures of type struct
matrix info struct. An array of T struct matrix info struct objects is used in our
algorithm, and each object corresponds to a top-level thread that we call ”matrix thread”. The
matrix threads continually read a sparse matrix’s file path from the command line input, read
the matrix from the file, perform the factorization and the result validation, until the input is
exausted.
NLU reads sparse matrices in triplet form. The triplet form stores the sparse matrix with a
list of triplets in the form of {i, j, x}, where i is the row index, j is the column index, and x is
the entry value.
The sparse matrix is then transformed into the compressed column form. Let n be
the number of columns of the matrix, and nnz be the number of nonzero entries, then the
compressed column form stores the matrix with three arrays: Ap, Ai, and Ax, where Ap is of
size (n + 1), and Ai and Ax are both of size nnz. An entry {i, j, x} exists in the matrix if and
only if there is an integer p such that Ap[j] ≤ p < Ap[j + 1], Ai[p] = i, and Ax[p] = x. The
81
compressed column form is required for fill-reducing permutation algorithm such as AMD [2]
and METIS [29].
The symbolic analysis creates the elimination tree. The elimination tree is stored with
the left-child-right-sibling representation. The symbolic patterns, including dimensions, and
column and row indices, of the supernodes, are also computed along with the elimination tree.
The mapping between the sparse matrix and the supernodes is also computed, ad stored in the
struct matrix info struct object.
During the numeric factorization, we need to perform dense matrix arithmetics on the
supernodes. We store the supernodes in the column major form so that it can fit in dense
linear algebra routines provided by BLAS and cuBLAS. An (m × n) dense matrix A can be
stored in an array of size (d× n), where Ai,j is stored at the index of (j × d+ i).
Figure 5-1. Supernode
In NLU, a supernode of dimension (m × n) is composed of an L part of dimension
(m×n) and a U part of dimension ((m−n)×n). For a supernode shown in Fig. 5-1, we have
n = j1 − j0 and m = i1 − i0. We store the supernode in the column major form in an array of
82
((2m − n) × n) (Fig. 5-2). Ai0:j1−1,j1:i1−1 is transposed and attached to the remaining of the
supernode, turning the L-shape into a rectangular matrix that is easier to store and handle.
Figure 5-2. Supernode stored in column major form
5.5.2 The Supernodal Algorithm
NLU is implemented with the left-looking LU. It’s different from the blocked dense left-
looking LU in that the order of the factorization is not simply from left to right, but is guided
by the elimination tree. In the left-looking LU for a sparse matrix, Li2 and U2j correspond
to a supernode in the elimination tree. Without multithreading, a supernodal sparse LU
factorization algorithm repeatedly picks a leaf supernode from the elimination tree, factorizes
it, and removes the supernode from the elimination tree, until the elimination tree becomes
empty. A supernode is ready for factorization only when either it is initially a leaf, or all of its
descendants have been factorized.
For a sparse matrix with elimination tree like Fig. 5-3, one possible ordering of the
supernodes’ factorization is the ascending order with regards to the supernodes’ index.
During the factorization of a supernode s, for each descendant d of s, the contribution
block of d is computed. The contribution blocks from all descendants of s are assembled to
form the aggregate contribution block, and the aggregate contribution block is subtracted from
s. Then a dense LU factorization and two triangular solves are performed on s to compute the
factorization result of the supernode.
The GPUs are only used for sufficiently large supernodes. If the size of s is too small,
NLU will perform the above operations entirely in the main memory ( and not in the pinned
83
Figure 5-3. Elimination tree
host memory ), using the CPU. In this case, Fig. 5-4 depicts the timing of the factorization of
a supernode.
Figure 5-4. Serial factorization of supernode
Among the steps within each iteration of the left-looking LU, the computing and the
assembly of the contribution blocks are the most time consuming, due to the large number of
descendants.
Consider a supernode s and one of its descendant, d, the algorithm performs the follow-
ing:
1. A portion (determined by the matrix’s symbolic structure) of d is copied from thepageable host memory to the pinned host memory.
2. The descendant is transferred to the GPU memory via the PCIe bus.
3. The contribution block is computed with two matrix multiplications, and assembled intothe aggregate contribution block using a matrix addition with mapping.
Note that each of the above steps uses only a portion of the resource available:
84
• Step 1 uses the pinned host memory.
• Step 2 uses the pinned host memory, the GPU memory, and the PCIe bus.
• Step 3 use the GPU memory and the GPU’s streaming multiprocessors.
In fact, we do not have to wait until the end of Step 3 to start computing the contribu-
tion block of the next descendant. The algorithm can start copying the next descendant into
the pinned host memory without worrying about overwriting, as long as the current descendant
has finished Step 2. We implement this by utilizing CUDA’s stream and event features. A total
of two CUDA stream is required, and each of them should be associated to a CUDA event, a
piece of separate GPU memory, and a piece of separate pinned host memory.
Let s0 and s1 be the two CUDA streams, e0 and e1 be the two CUDA events correspond-
ing to them respectively. A for loop iterates over all the descendants of s. At the beginning
of each iteration, the algorithm queries e0 and e1. If either of them (ei) returns cudaSuccess,
the algorithm starts copying the next descendant to the pinned host memory, and subsequently
performs Step 2 and 3 on si. Since Step 2 and 3 are executed on a CUDA stream other than
the default stream, they will not block the execution of the host code, which means the loop
can continue before the on-GPU operations of Step 2 and 3 are finished. ei is recorded at the
end of Step 2, to make sure that the next time ei is queried, it will reflect whether Step 2 has
already completed.
If both of the CUDA event queries fail, the algorithm can either wait until one of the
CUDA streams becomes available, or just fall back to using the CPU. Generally speaking, it
is usually not good to use the CPU when the task size is large, or use the GPU when task
size is too small. The actual strategy the algorithm selects depends on the dimensions of the
descendant, and the threshold is configurable with parameters.
Fig. 5-5 shows an example of factorizing a supernode with 5 descendants using a GPU.
The CPU computes the contribution block using the data from the pageable host memory
and assembles the contribution block to the pinned host memory, while the GPU does this in
the GPU memory. Therefore after all contribution blocks have been computed and assembled,
85
Figure 5-5. Parallel factorization of supernode
the aggregate contribution block located in the pinned host memory must be copied to the
GPU memory, and a matrix addition needs to be performed to sum up the two aggregate
contribution blocks.
Figure 5-6. Computing contribution blocks using 4 CUDA streams
Fig. 5-6 is part of the NVIDIA visual profiler output showing how the contribution blocks
are computed using multiple CUDA streams. We see that the host-to-device memory copies are
stacking with on-GPU floating-point operations, due to the pipelining.
5.5.3 Multithreading and Batched Factorization
Consider a server with a multicore CPU and K GPUs. We factorize N sparse matrices on
this server using multiple threads. The threads are nested, with T (T ≤ N) top-level threads
( named ”matrix threads”, since they handle entire sparse matrices ), and each matrix thread
having K sub-threads ( named ”node threads”, since they handle individual supernodes ). The
86
actual number of active node threads in each matrix thread depends on the structure of the
matrix, and changes during run time, but it does not exceed the number of GPUs.
Different sparse matrices submitted to NLU are stored in different pieces of main memory.
The operating system makes sure that they do not overlap in the main memory. But the sparse
matrices must share the more scarce GPU-related resources, including the multiprocessors, the
GPU memory, and the pinned host memory. To avoid conflicts while maximizing each GPU’s
availability to each sparse matrix, we do not statically assign GPUs to node threads. Instead,
we put the GPUs in a resource pool, and let the node threads query and lock GPUs on the
fly. In this way, the batched factorization is implemented by simply running multiple sparse LU
factorizations in different matrix threads, and the maximum number of sparse matrices being
factorized concurrently can be adjusted with a simple change of a parameter.
The implementation of the inter-supernode multithreading relies heavily on the elimination
tree. Two supernodes can be factorized in parallel if and only if neither of them is an ancestor
or a descendant of the other. Our scheduling policy is implemented using a queue named
leafQueue.
Initially leafQueue should contain the indices of all the leaf nodes of the elimination tree.
Each node thread contains a while loop that terminates if and only if leafQueue is empty.
At the beginning of each loop, the node thread pops the supernode s from leafQueue, in a
critical section. At the end of the loop, the node thread enters another critical section, and
performs the following:
• Find p, the parent of s
• Remove s from the elimination tree
• Check if p has become a leaf, if it has, push it into leafQueue
For the efficiency of data access, some metadata of the sparse matrices are stored in the
pinned host memory and the device memory, and each piece of metadata is shared among
all supernodes of the same sparse matrix. The metadata must be transferred from the main
memory before the GPU can access it.
87
If the GPU starts working on a different sparse matrix, the metadata will be overwritten.
To prevent redundant data transfer, we reduce the overwriting of metadata by having the node
threads hold onto the GPUs for as long as they are active. A node thread is considered idle
if it runs out of available supernodes to factorize. At the beginning of each node thread, if
it fails its first pop of leafQueue, it goes idle immediately. If it succeeds, the node thread
should loop over the K GPUs and try to lock one of them by testing their OpenMP locks.
The locks are not released upon finishing factorizing a supernode, instead, when a supernode is
finished, the node thread looks for the next supernode in the same sparse matrix to factorize,
and only releases the lock if it fails to find another. Since the parallelism of a sparse matrix
always monotonically decreases as we go further up the elimination tree, we don’t need wo
worry about an idle node thread going active again. In this way, we make sure that a GPU is
released only if it is no longer needed by the last sparse matrix it was working on.
This strategy works well when all the GPUs are of the same computing power, but may
cause some loss of performance if the system is heterogeneous, because it might be better to
switch from a weaker GPU when a more powerful one is released by another node thread.
5.5.4 Utilizing Localization
This is an experimental feature. At the moment there isn’t a noteable performance gain
from this, but an optimization of the scheduling policy might make this useful in the future.
The data transfers from the main memory to the GPU memory are a significant time sink
in the GPU-accelerated sparse LU algorithm. Though we are able to hide some of these PCIe
communications by stacking them with on-GPU floating-point operations, it is still better to
avoid making these transfers whenever possible.
Supernodes must be factorized before they can be used to update their ancestors. If the
factorization result left over in the GPU memory is not overwritten before it is used in an
update operation, we can skip the data copy from the main memory to the GPU memory.
88
We reduce the chance of overwriting with stages. A stage is a group of supernodes whose
total size can fit in the GPU memory. We set the starting offset of supernodes so that the
supernodes in the same stage do not overlap in the GPU memory.
The factorization result of supernode s is considered intact in a GPU’s memory if:
• s was factorized on that GPU.
• No supernodes from other stages was factorized on that GPU since the factorization of
s.
• No supernodes from other matrices was factorized on that GPU since the factorization of
s.
It is also possible that the factorization result does not exist in the GPU memory, but
is in the pinned host memory. If the GPU was used when computing and assembling the
contribution blocks, but the size of the supernode is small, the algorithm may choose to
perform the dense LU and the triangular solves in the pinned host memory, using the CPU.
We use several arrays and variables to track the location of supernodes’ factorization
result:
• GPUSerial[]:
The index of the GPU that the supernode was processed on. If GPUs are not used for
supernode s, then GPUSerial[s]=-1.
• NodeLocation[]:
Whether the supernode’s factorization result is in the GPU memory or in the pinned host
memory. NodeLocation[s] is meaningful if and only if GPUSerial[s] actually points
to a valid GPU.
• gpu info->lastMatrix:
gpu info is of type struct gpu info struct *, a pointer to the object that contains
GPU-specific information.
gpu info->lastMatrix records the last sparse matrix that the GPU worked on. This is
to indicate whether the factorization result is potentially overwritten by another matrix.
89
• NodeStPass[]:
An array with monotonically increasing elements. If supernodes s0, s1, and s2 are
factorized in order on the same GPU, s0 and s2 are from the same stage while s1 is from
a different one, then s0 should be considered already overwritten at the time we factorize
s2. We maintain NodeStPass[] so that NodeStPass[s0] and NodeStPass[s2] have
different values, reflecting the overwriting between these two supernodes.
Before updating a supernode s with its descendants, we can see whether the factorization
results of the descendants are still in the GPU memory or the pinned host memory by checking
the above arrays and variables. If there is a hit, we can skip some data transfer, and reduce the
total running time.
At the moment, the hit rate we have achieved is still very low ( less than 1% ), but we
might be able to increase it by updating our multithread scheduling policy. We expect the hit
rate to be high when supernodes from the same stage are factorized on the same GPU.
5.6 Experimental Results
We carried out our experiments on a platform with two Intel Xeon E5-2695 v2 CPUs and
eight NVIDIA Tesla K40m GPUs. The test matrices are listed in Table 5-1. These matrices are
from the SuiteSparse matrix collection [13].
The experiment results are listed in 5-2 (OOM means ”out of memory”). The factor-
ization time comparison of some matrices is shown in Fig. 5-7. In the comparison, GLU and
NLU used one GPU and the CPU, while UMFPACK, KLU, and SuperLU used the CPU only.
We see that GLU is a strong competitor, expecially when the matrix is small. But NLU works
better on larger matrices. The maximum improvement in performance we were able to achieve
was that when factorizing the matrix ”li”, NLU’s performance was 43.66x vs GLU, 209.88x vs
UMFPACK, 755.48x vs KLU, and 20.63x vs SuperLU. NLU was able to achieve 200.35 times
the performance of SuperLU when factorizing the matrix ”epb1”.
90
Table 5-1. Test matricesmatrix problem type dimension nonzerospoli large Economic Problem 15,575 33,033epb1 Thermal Problem 14,734 95,053bayer01 Chemical Process 57,735 275,094
Simulation Problemckt11752 dc 1 Circuit Simulation 49,702 333,029onetone1 Frequency Domain 36,057 335,552
Circuit SimulationASIC 100ks Circuit Simulation 99,190 578,890rajat25 Circuit Simulation 87,190 606,489rim Computational Fluid 22,560 1,014,951
Dynamics Problemxenon1 Materials Problem 48,600 1,181,120matrix 9 Semiconductor Device 103,430 1,205,518li Electromagnetics 22,695 1,215,181Raj1 Circuit Simulation 263,743 1,300,261rajat24 Circuit Simulation 358,172 1,946,979rma10 Computational Fluid 46,835 2,329,092
Dynamics ProblemASIC 680k Circuit Simulation 682,862 2,638,997pre2 Frequency Domain 659,033 5,834,044
Circuit Simulationrajat30 Circuit Simulation 643,994 6,175,244marine1 Chemical 400,320 6,226,538
OceanographyFreescale1 Circuit Simulation 3,428,755 17,052,626Transport Structural Problem 1,602,111 23,487,281dgreen Semiconductor Device 1,200,611 26,606,169ML Laplace Structural Problem 377,002 27,582,698ss Semiconductor Process 1,652,680 34,753,577nv2 Semiconductor Device 1,453,908 37,475,646ML Geer Structural Problem 1,504,002 110,686,677stokes Semiconductor Process 11,449,533 349,321,980
Performance comparison when using different numbers of GPUs is shown in Fig. 5-8.
The performance gained when factorizing these matrices using multiple GPUs was 18.43% to
104.86% when using 2 GPUs, and 30.11% to 206.30% when using 4 GPUs.
5.7 Conclusions
In this chapter, we present NLU, an efficient supernodal sparse LU factorization algorithm
for hybrid multicore systems that supports multithreading and batched factorization.
91
Table 5-2. Factorization time (s)matrix GLU UMFPACK KLU SuperLU NLUpoli large 1.01E-02 2.90E-03 1.00E-03 2.64E+00 4.79E-01epb1 7.66E-02 8.75E-02 4.76E-02 1.16E+01 5.78E-02bayer01 3.88E-01 1.36E-01 1.60E-01 3.99E+01 3.15E+00ckt11752 dc 1 1.71E-01 2.93E-01 4.45E-02 2.79E+01 4.05E-01onetone1 3.27E-01 3.07E-01 6.85E+00 2.17E+01 5.77E-01ASIC 100ks 3.78E-01 5.91E-01 1.02E+00 3.16E+01 8.98E-01rajat25 4.56E-01 4.54E+01 1.56E+00 6.61E+01 1.27E+00rim 2.41E-01 6.06E-01 1.20E+01 1.33E+01 6.78E-02xenon1 3.59E+00 1.57E+00 1.15E+01 2.80E+01 3.45E-01matrix 9 3.79E+01 1.17E+01 1.85E+02 3.71E+01 1.87E+00li 1.56E+01 7.50E+01 2.70E+02 7.73E+00 3.57E-01Raj1 2.01E+00 1.12E+02 5.14E+01 1.21E+02 2.06E+00rajat24 2.55E+00 OOM 1.46E+01 1.55E+02 4.92E+00rma10 6.48E-01 9.79E-01 8.83E-01 9.51E+00 1.42E-01ASIC 680k 3.85E+01 5.91E-01 6.91E-01 timeout 2.93E+01pre2 3.98E+01 OOM fail 4.12E+02 1.09E+01rajat30 4.80E+01 OOM 1.23E+01 5.63E+02 1.19E+01marine1 timeout OOM 1.13E+03 1.12E+02 4.58E+00Freescale1 4.38E+00 fail 7.34E+00 timeout 1.61E+01Transport timeout fail fail 7.08E+02 2.96E+01dgreen timeout fail fail 4.32E+02 2.43E+01ML Laplace timeout OOM 5.36E+02 7.06E+01 4.66E+00ss timeout fail fail timeout 7.74E+01nv2 timeout fail fail 5.67E+02 2.64E+01ML Geer timeout singular fail 2.92E+02 2.21E+01stokes timeout singular fail segfault 1.29E+03
In our experiments, we compared NLU to other sparse LU solvers, and saw that NLU
is able to handle matrices significantly larger than the capability of other sparse LU solvers,
and NLU can achieve higher performance on some matrices: up to 43.66 times vs GLU, up to
209.88 times vs UMFPACK, up to 755.48 vs KLU, and up to 200.35 times vs SuperLU.
NLU is able to accelerate the factorization by using multiple GPUs. The performance
gained was up to 104.86% when using 2 GPUs and up to 206.30% when using 4 GPUs.
One shortcoming of NLU is that it does not yet support pivoting, which may limit the
scenarios in which it can be used. We will update it to include pivoting in future versions.
92
Figure 5-7. LU factorization time (natural log transformed)
Figure 5-8. LU factorization time using one or multiple GPUs
93
CHAPTER 6SUMMARY AND CONCLUSIONS
Our work is focused on sparse matrix factorization algorithms on hybrid multicore
architectures, including sparse Cholesky, sparse QR, and sparse LU.
In Chapter 3, we present optimization techniques for the sparse Cholesky algorithm,
CHOLMOD. Our optimizations for CHOLMOD include multithreading, pipelining, and
the multilevel subtree method. We also implemented the batched factorization feature for
CHOLMOD.
The optimizations for CHOLMOD, when put together, can increase the efficiency of the
factorization by tens of times.
In Chapter 4, we introduce our optimization for the sparse QR algorithm, SPQR. We
did some improvements to arithmetic CUDA kernels to increase the performance of the apply
operations. We also implemented two pipelines to reduce the data transfer overhead.
Our optimizations for SPQR was able to increase its performance by up to 43.92%.
In Chapter 5, we present our implementation of a sparse LU solver. Our sparse LU algo-
rithm is a supernodal algorithm that can utilize multiple GPUs, and supports multithreading,
pipelining, and batched factorization.
We compared our sparse LU algorithm to other LU solvers, and saw that when using one
GPU, it’s performance can be up to 43.66x vs GLU, 209.88x vs UMFPACK, 755.48x vs KLU,
and 200.35x vs SuperLU.
Using multiple GPUs can further increase our LU solver’s performance. The performance
gained when using multiple GPUs was 18.43% to 104.86% when using 2 GPUs, and 30.11% to
206.30% when using 4 GPUs.
94
APPENDIXPUBLICATIONS
• Meng Tang, Mohamed Gadou, and Sanjay Ranka. 2017. A Multithreaded Algorithm for
Sparse Cholesky Factorization on Hybrid Multicore Architectures. Procedia Computer
Science 108 (2017), 616-625.
• Meng Tang, Mohamed Gadou, Steven Rennich, Timothy A Davis, and Sanjay Ranka.
Optimized Sparse Cholesky Factorization on Hybrid Multicore Architectures. Journal of
Computational Science (2018).
• Meng Tang, Mohamed Gadou, Steven Rennich, Timothy A Davis, and Sanjay Ranka.
A Multilevel Subtree Method for Single and Batched Sparse Cholesky Factorization.
Proceedings of the 47th International Conference on Parallel Processing (2018).
95
REFERENCES
[1] Amestoy, Patrick, Duff, Iain, L’Excellent, Jean Yves, and Plechac, Petr. “PARASOL Anintegrated programming environment for parallel sparse matrix solvers.” High-PerformanceComputing. Springer, 1999, 79–90.
[2] Amestoy, Patrick R, Davis, Timothy A, and Duff, Iain S. “Algorithm 837: AMD, anapproximate minimum degree ordering algorithm.” ACM Transactions on MathematicalSoftware (TOMS) 30 (2004).3: 381–388.
[3] Amestoy, Patrick R, Duff, Iain S, L’Excellent, Jean-Yves, and Koster, Jacko. “MUMPS:a general purpose distributed memory sparse solver.” International Workshop on AppliedParallel Computing. Springer, 2000, 121–130.
[4] Bjorck, Ake. “Solving linear least squares problems by Gram-Schmidt orthogonalization.”BIT Numerical Mathematics 7 (1967).1: 1–21.
[5] Chen, Yanqing, Davis, Timothy A, Hager, William W, and Rajamanickam, Sivasankaran.“Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and up-date/downdate.” ACM Transactions on Mathematical Software (TOMS) 35 (2008).3:22.
[6] Chevalier, Cedric and Pellegrini, Francois. “PT-Scotch: A tool for efficient parallel graphordering.” Parallel computing 34 (2008).6-8: 318–331.
[7] Cuthill, Elizabeth and McKee, James. “Reducing the bandwidth of sparse symmetricmatrices.” Proceedings of the 1969 24th national conference. ACM, 1969, 157–172.
[8] Davis, Tim, Hager, WW, and Duff, IS. “SuiteSparse.” (2014).URL http://faculty.cse.tamu.edu/davis/suitesparse.html
[9] Davis, Timothy A. Direct methods for sparse linear systems. SIAM, 2006.
[10] ———. “Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparseQR factorization.” ACM Transactions on Mathematical Software (TOMS) 38 (2011).1: 8.
[11] Davis, Timothy A and Duff, Iain S. “Unsymmetric-pattern multifrontal methods forparallel sparse LU factorization.” Technical Report Comp. and Info. Sci. Dept., Universityof Florida (1991).
[12] ———. “An unsymmetric-pattern multifrontal method for sparse LU factorization.”SIAM Journal on Matrix Analysis and Applications 18 (1997).1: 140–158.
[13] Davis, Timothy A and Hu, Yifan. “The University of Florida sparse matrix collection.”ACM Transactions on Mathematical Software (TOMS) 38 (2011).1: 1.
96
[14] Davis, Timothy A and Palamadai Natarajan, Ekanathan. “Algorithm 907: KLU, a directsparse solver for circuit simulation problems.” ACM Transactions on MathematicalSoftware (TOMS) 37 (2010).3: 36.
[15] Davis, Timothy A, Rajamanickam, Sivasankaran, and Sid-Lakhdar, Wissam M. “A surveyof direct methods for sparse linear systems.” Acta Numerica 25 (2016): 383–566.
[16] Demmel, James W. “SuperLU users’ guide.” (1999).
[17] Gentleman, W Morven. “Least squares computations by Givens transformations withoutsquare roots.” IMA Journal of Applied Mathematics 12 (1973).3: 329–336.
[18] Gentleman, W Morven and Kung, HT. “Matrix triangularization by systolic arrays.”Real-Time Signal Processing IV. vol. 298. International Society for Optics and Photonics,1982, 19–27.
[19] George, Alan. “Nested dissection of a regular finite element mesh.” SIAM Journal onNumerical Analysis 10 (1973).2: 345–363.
[20] George, Alan, Heath, Michael, Liu, Joseph, and Ng, Esmond. “Solution of sparse positivedefinite systems on a hypercube.” Journal of Computational and Applied Mathematics 27(1989).1-2: 129–156.
[21] George, Alan and Heath, Michael T. “Solution of sparse linear least squares problemsusing Givens rotations.” Linear Algebra and its Applications 34 (1980): 69–83.
[22] George, Alan and Liu, Joseph WH. “Householder reflections versus Givens rotationsin sparse orthogonal decomposition.” Linear Algebra and its Applications 88 (1987):223–238.
[23] ———. “The evolution of the minimum degree ordering algorithm.” Siam review 31(1989).1: 1–19.
[24] George, Alan and McIntyre, David R. “On the application of the minimum degreealgorithm to finite element systems.” Mathematical Aspects of Finite Element Methods.Springer, 1977. 122–149.
[25] Golub, Gene. “Numerical methods for solving linear least squares problems.” NumerischeMathematik 7 (1965).3: 206–216.
[26] Hadfield, Steven Michael. On the LU factorization of sequences of identically structuredsparse matrices within a distributed memory environment. Ph.D. thesis, Citeseer, 1994.
[27] Heath, Michael T. “Numerical methods for large sparse linear least squares problems.”SIAM Journal on Scientific and Statistical Computing 5 (1984).3: 497–513.
97
[28] Jennings, Alan. “A compact storage scheme for the solution of symmetric linear simulta-neous equations.” The Computer Journal 9 (1966).3: 281–285.
[29] Karypis, George and Kumar, Vipin. “METIS: unstructured graph partitioning and sparsematrix ordering system, version 2.0.” (1995).
[30] Karypis, George, Schloegel, Kirk, and Kumar, Vipin. “Parmetis.” Parallel graphpartitioning and sparse matrix ordering library. Version 2 (2003).
[31] Kolodziej, Scott, Yeralan, Nuri, Davis, Tim, and Hager, William W. “Mongoose UserGuide, Version 2.0.3.” (2018).
[32] Li, Xiaoye S. “An overview of SuperLU: Algorithms, implementation, and user interface.”ACM Transactions on Mathematical Software (TOMS) 31 (2005).3: 302–325.
[33] Lipton, Richard J, Rose, Donald J, and Tarjan, Robert Endre. “Generalized nesteddissection.” SIAM journal on numerical analysis 16 (1979).2: 346–358.
[34] Liu, Joseph W. “A compact row storage scheme for Cholesky factors using eliminationtrees.” ACM Transactions on Mathematical Software (TOMS) 12 (1986).2: 127–148.
[35] Liu, Joseph WH. “On general row merging schemes for sparse Givens transformations.”SIAM journal on scientific and statistical computing 7 (1986).4: 1190–1211.
[36] ———. “The role of elimination trees in sparse factorization.” SIAM Journal on MatrixAnalysis and Applications 11 (1990).1: 134–172.
[37] ———. “A generalized envelope method for sparse factorization by rows.” ACMTransactions on Mathematical Software (TOMS) 17 (1991).1: 112–129.
[38] Liu, Wai-Hung and Sherman, Andrew H. “Comparative analysis of the Cuthill-McKeeand the reverse Cuthill-McKee ordering algorithms for sparse matrices.” SIAM Journal onNumerical Analysis 13 (1976).2: 198–213.
[39] Markowitz, Harry M. “The elimination form of the inverse and its application to linearprogramming.” Management Science 3 (1957).3: 255–269.
[40] McWhirter, JG. “Recursive least-squares minimization using a systolic array.” Real-TimeSignal Processing VI. vol. 431. International Society for Optics and Photonics, 1983,105–114.
[41] Natarajan, Ekanathan Palamadai. KLU-A high performance sparse linear solver for circuitsimulation problems. Ph.D. thesis, University of Florida, 2005.
[42] Ohtsuki, Tatsuo, Cheung, Lap Kit, and Fujisawa, Toshio. “Minimal triangulation of agraph and optimal pivoting order in a sparse matrix.” Journal of Mathematical Analysisand Applications 54 (1976).3: 622–633.
98
[43] Papadimitriou, Ch H. “The NP-completeness of the bandwidth minimization problem.”Computing 16 (1976).3: 263–270.
[44] Pellegrini, Francois and Roman, Jean. “Scotch: A software package for static mapping bydual recursive bipartitioning of process and architecture graphs.” International Conferenceon High-Performance Computing and Networking. Springer, 1996, 493–498.
[45] Peng, Shaoyi and Tan, Sheldon X-D. “GLU3.0: Fast GPU-based Parallel Sparse LUFactorization for Circuit Simulation.” arXiv preprint arXiv:1908.00204 (2019).
[46] Preis, Robert and Diekmann, Ralf. The PARTY Partitioning-library: User Guide; Version1.1. Univ.-GH, FB Mathematik/Informatik, 1996.
[47] Reid, Matthew W. “Pivoting for LU Factorization.” (2014).
[48] Rennich, Steven C, Stosic, Darko, and Davis, Timothy A. “Accelerating sparse Choleskyfactorization on GPUs.” Parallel Computing 59 (2016): 140–150.
[49] Rose, DJ, Whitten, GG, Sherman, AH, and Tarjan, RE. “Algorithms and software for in-core factorization of sparse symmetric positive definite matrices.” Computers & Structures11 (1980).6: 597–608.
[50] Rose, Donald J. “A graph-theoretic study of the numerical solution of sparse positivedefinite systems of linear equations.” Graph theory and computing. Elsevier, 1972.183–217.
[51] Rotella, F and Zambettakis, I. “Block Householder transformation for parallel QRfactorization.” Applied mathematics letters 12 (1999).4: 29–34.
[52] Schenk, Olaf, Gartner, Klaus, and Fichtner, Wolfgang. “Scalable parallel sparse factor-ization with left-right looking strategy on shared memory multiprocessors.” InternationalConference on High-Performance Computing and Networking. Springer, 1999, 221–230.
[53] Schreiber, Robert. “A new implementation of sparse Gaussian elimination.” ACMTransactions on Mathematical Software (TOMS) 8 (1982).3: 256–276.
[54] Tang, Meng, Gadou, Mohamed, and Ranka, Sanjay. “A Multithreaded Algorithm forSparse Cholesky Factorization on Hybrid Multicore Architectures.” Procedia ComputerScience 108 (2017): 616–625.
[55] Tang, Meng, Gadou, Mohamed, Rennich, Steven C, Davis, Timothy A, and Ranka, San-jay. “A Multilevel Subtree Method for Single and Batched Sparse Cholesky Factorization.”Proceedings of the 47th International Conference on Parallel Processing. ACM, 2018, 50.
[56] Tinney, William F and Walker, John W. “Direct solutions of sparse network equationsby optimally ordered triangular factorization.” Proceedings of the IEEE 55 (1967).11:1801–1809.
99
[57] van de Geijn, Robert A. “Notes on LU Factorization.” (2014).
[58] Yang, Wei H. “A method for updating Cholesky factorization of a band matrix.”Computer Methods in Applied Mechanics and Engineering 12 (1977).3: 281–288.
[59] Yannakakis, Mihalis. “Computing the minimum fill-in is NP-complete.” SIAM Journal onAlgebraic Discrete Methods 2 (1981).1: 77–79.
[60] Yanovsky, Igor. “QR Decomposition with Gram-Schmidt.” University of California, LosAngeles (2012).
[61] Yeralan, Sencer Nuri, Davis, Timothy A, Sid-Lakhdar, Wissam M, and Ranka, Sanjay. “Al-gorithm 980: Sparse QR Factorization on the GPU.” ACM Transactions on MathematicalSoftware (TOMS) 44 (2017).2: 17.
100
BIOGRAPHICAL SKETCH
Meng Tang received his Ph.D from University of Florida in 2020, his master’s degree
in computer engineering from Chinese Academy of Sciences and his bachelor’s degree in
computer science from Shanghai Jiaotong University. His area of research is High Performance
Computing.
101