786 ieee transactions on parallel and ......it is hard to form large supernodes in highly sparse...

15
GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling Xiaoming Chen, Student Member, IEEE, Ling Ren, Yu Wang, Member, IEEE, and Huazhong Yang, Senior Member, IEEE Abstract—The sparse matrix solver by LU factorization is a serious bottleneck in Simulation Program with Integrated Circuit Emphasis (SPICE)-based circuit simulators. The state-of-the-art Graphics Processing Units (GPU) have numerous cores sharing the same memory, provide attractive memory bandwidth and compute capability, and support massive thread-level parallelism, so GPUs can potentially accelerate the sparse solver in circuit simulators. In this paper, an efficient GPU-based sparse solver for circuit problems is proposed. We develop a hybrid parallel LU factorization approach combining task-level and data-level parallelism on GPUs. Work partitioning, number of active thread groups, and memory access patterns are optimized based on the GPU architecture. Experiments show that the proposed LU factorization approach on NVIDIA GTX580 attains an average speedup of 7.02 (geometric mean) compared with sequential PARDISO, and 1.55 compared with 16-threaded PARDISO. We also investigate bottlenecks of the proposed approach by a parametric performance model. The performance of the sparse LU factorization on GPUs is constrained by the global memory bandwidth, so the performance can be further improved by future GPUs with larger memory bandwidth. Index Terms—Graphics processing unit, parallel sparse LU factorization, circuit simulation, performance model Ç 1 INTRODUCTION T HE Simulation Program with Integrated Circuit Empha- sis (SPICE) [1] is the most widely used circuit simula- tion kernel for transistor-level simulation in IC design and verification. In recent years, the rapid development of very large scale integrations (VLSI) presents great challenges to the performance of SPICE-based simulators. In modern VLSIs, circuit matrices after post-layout extraction can eas- ily reach a dimension of several million. Traditional SPICE- based simulators may take days or even weeks to perform transient simulations. There are two bottlenecks in a SPICE-based simulator: the sparse matrix solver by LU factorization and model evaluation. The two steps are repeated for thousands of rounds in a transient simulation, as shown in Fig. 1. Parallelization of model evaluation is straightforward, but the sparse solver is difficult to parallelize because of the strong data dependence during LU factorization, the irregu- lar structure of circuit matrices, and the low computation- to-memory ratio in sparse LU factorization. To accelerate the sparse solver in SPICE-based simulators, some parallel software solvers, such as ShyLU [2] and NICSLU [3], [4], [5], are developed. They have been proved efficient for cir- cuit matrices on multicore CPUs. In the last decade, state-of-the-art Graphics Processing Units (GPU) have been proved useful in many scientific computing fields. GPUs have hundreds of cores, large regis- ter files, and high memory bandwidth, etc. Collectively, these computing and memory resources provide a massive thread-level parallelism. Thus, GPUs potentially provide a better solution than multicore CPUs to accelerate SPICE- based circuit simulators. There have been some studies on GPU-based model evaluation [6], [7], [8]. Although some direct sparse solvers which are based on multifrontal [9] or supernodal algorithms have been developed on GPUs [10], [11], [12], [13], [14], [15], currently there is no work on GPU- based direct solver targeted at circuit matrices. Circuit matrices are quite different from matrices from other appli- cations. Circuit matrices are highly asymmetric and sparse, so multifrontal and supernodal solvers are proved to per- form poorly [16]. Therefore, there raises a natural question whether solvers targeted at circuit matrices can be acceler- ated by GPUs, and how we can map a sparse solver from multicore architectures to manycore architectures. In this paper we develop a parallel sparse solver on GPUs and experiments prove that it is efficient for circuit problems. This paper is extended from our conference paper [17]. This paper presents more results and develops a performance model compared with [17]. We make the following contri- butions in this paper. A GPU-based sparse solver for circuit problems is developed, which is based on parallel LU numeric factorization without pivoting. Different from the existing GPU-based direct solvers, our GPU solver does not use dense kernels. This feature makes our solver be suitable for circuit matrices. More parallel- ism is explored for the manycore architecture of GPUs. Task-level parallelism is proposed in CPU-version NICSLU [3], [4], [5], to perform an X. Chen, Y. Wang and H. Yang are with the Department of Electronic Engineering, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, China. E-mail: [email protected], [email protected], [email protected]. L. Ren is with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139. E-mail: [email protected]. Manuscript received 16 Sept. 2013; revised 15 Jan. 2014; accepted 26 Feb. 2014. date of publication 17 Mar. 2014; date of current version 6 Feb. 2015. Recommended for acceptance by F. Mueller. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2014.2312199 786 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015 1045-9219 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 19-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

GPU-Accelerated Sparse LU Factorization forCircuit Simulation with Performance Modeling

Xiaoming Chen, Student Member, IEEE, Ling Ren, Yu Wang,Member, IEEE, and

Huazhong Yang, Senior Member, IEEE

Abstract—The sparse matrix solver by LU factorization is a serious bottleneck in Simulation Program with Integrated Circuit Emphasis

(SPICE)-based circuit simulators. The state-of-the-art Graphics Processing Units (GPU) have numerous cores sharing the same

memory, provide attractive memory bandwidth and compute capability, and support massive thread-level parallelism, so GPUs can

potentially accelerate the sparse solver in circuit simulators. In this paper, an efficient GPU-based sparse solver for circuit problems is

proposed. We develop a hybrid parallel LU factorization approach combining task-level and data-level parallelism on GPUs. Work

partitioning, number of active thread groups, and memory access patterns are optimized based on the GPU architecture. Experiments

show that the proposed LU factorization approach on NVIDIA GTX580 attains an average speedup of 7.02� (geometric mean)

compared with sequential PARDISO, and 1.55� compared with 16-threaded PARDISO. We also investigate bottlenecks of the

proposed approach by a parametric performance model. The performance of the sparse LU factorization on GPUs is constrained by

the global memory bandwidth, so the performance can be further improved by future GPUs with larger memory bandwidth.

Index Terms—Graphics processing unit, parallel sparse LU factorization, circuit simulation, performance model

Ç

1 INTRODUCTION

THE Simulation Program with Integrated Circuit Empha-sis (SPICE) [1] is the most widely used circuit simula-

tion kernel for transistor-level simulation in IC design andverification. In recent years, the rapid development of verylarge scale integrations (VLSI) presents great challenges tothe performance of SPICE-based simulators. In modernVLSIs, circuit matrices after post-layout extraction can eas-ily reach a dimension of several million. Traditional SPICE-based simulators may take days or even weeks to performtransient simulations. There are two bottlenecks in aSPICE-based simulator: the sparse matrix solver by LUfactorization and model evaluation. The two steps arerepeated for thousands of rounds in a transient simulation,as shown in Fig. 1.

Parallelization of model evaluation is straightforward,but the sparse solver is difficult to parallelize because of thestrong data dependence during LU factorization, the irregu-lar structure of circuit matrices, and the low computation-to-memory ratio in sparse LU factorization. To acceleratethe sparse solver in SPICE-based simulators, some parallelsoftware solvers, such as ShyLU [2] and NICSLU [3], [4],[5], are developed. They have been proved efficient for cir-cuit matrices on multicore CPUs.

In the last decade, state-of-the-art Graphics ProcessingUnits (GPU) have been proved useful in many scientificcomputing fields. GPUs have hundreds of cores, large regis-ter files, and high memory bandwidth, etc. Collectively,these computing and memory resources provide a massivethread-level parallelism. Thus, GPUs potentially provide abetter solution than multicore CPUs to accelerate SPICE-based circuit simulators. There have been some studies onGPU-based model evaluation [6], [7], [8]. Although somedirect sparse solvers which are based on multifrontal [9] orsupernodal algorithms have been developed on GPUs [10],[11], [12], [13], [14], [15], currently there is no work on GPU-based direct solver targeted at circuit matrices. Circuitmatrices are quite different from matrices from other appli-cations. Circuit matrices are highly asymmetric and sparse,so multifrontal and supernodal solvers are proved to per-form poorly [16]. Therefore, there raises a natural questionwhether solvers targeted at circuit matrices can be acceler-ated by GPUs, and how we can map a sparse solver frommulticore architectures to manycore architectures. In thispaper we develop a parallel sparse solver on GPUs andexperiments prove that it is efficient for circuit problems.This paper is extended from our conference paper [17]. Thispaper presents more results and develops a performancemodel compared with [17]. We make the following contri-butions in this paper.

� A GPU-based sparse solver for circuit problems isdeveloped, which is based on parallel LU numericfactorization without pivoting. Different from theexisting GPU-based direct solvers, our GPU solverdoes not use dense kernels. This feature makes oursolver be suitable for circuit matrices. More parallel-ism is explored for the manycore architectureof GPUs. Task-level parallelism is proposed inCPU-version NICSLU [3], [4], [5], to perform an

� X. Chen, Y. Wang and H. Yang are with the Department of ElectronicEngineering, Tsinghua National Laboratory for Information Science andTechnology, Tsinghua University, Beijing, China.E-mail: [email protected], [email protected],[email protected].

� L. Ren is with the Department of Electrical Engineering and ComputerScience, Massachusetts Institute of Technology, Cambridge, MA 02139.E-mail: [email protected].

Manuscript received 16 Sept. 2013; revised 15 Jan. 2014; accepted 26 Feb.2014. date of publication 17 Mar. 2014; date of current version 6 Feb. 2015.Recommended for acceptance by F. Mueller.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2014.2312199

786 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015

1045-9219� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

inter-vector parallel algorithm, which is not suffi-cient for the thousands of threads running concur-rently on a GPU, so intra-vector parallelism is alsoexposed on GPUs. Therefore, the GPU implementa-tion is a hybrid approach combining both task-leveland data-level parallelism. The implementation isoptimized based on the GPU architecture.

� We present a parametric performance model to ana-lyze the bottlenecks of the proposed GPU-based LUfactorization. We investigate the relation betweenthe achieved GPU performance and the GPU param-eters using the proposed model. Experimentalresults indicate that the performance of the proposedsparse LU factorization on GPUs is constrained bythe global memory bandwidth of GPUs. The perfor-mance model is universal and can also be used forother GPU applications.

The rest of the paper is organized as follows. Section 2shows some related work. Some backgrounds are intro-duced in Section 3. Section 4 presents the proposed GPU-based sparse LU factorization approach and optimizationstrategies in detail. We show the experimental resultsand analysis in Section 5. A parametric performancemodel is developed in Section 6. Finally Section 7 con-cludes the paper.

2 RELATED WORK

There are many LU factorization-based sparse solversdeveloped on CPUs. SuperLU [18] incorporates superno-des in the Gilbert-Peierls (G-P) left-looking algorithm[19], and the Basic Linear Algebra Subroutines (BLAS)[20] library is used to compute dense subblocks insupernodes to enhance the compute capability. Anothersolver PARDISO [21] also utilizes supernodes. However,it is hard to form large supernodes in highly sparsematrices such as circuit matrices. Hence KLU [16] whichis specially targeted at circuit matrices, directly adoptsthe G-P left-looking algorithm without supernodes. KLUis generally faster than BLAS-based sequential solvers forcircuit matrices [16]. SuperLU and PARDISO have paral-lel packages, but KLU doesn’t. NICSLU is a parallel

version of the G-P left-looking algorithm, and it worksefficiently for circuit matrices on multicore machines [3],[4], [5].

GPU-based dense LU factorization has been studied in[25], [26], [27], [28], which all belong to the MAGMA project[29]. The performance of these approaches is very promis-ing, up to 80 percent of the GPU’s theoretic peak perfor-mance [26]. NVIDIA has developed Compute UnifiedDevice Architecture (CUDA)-based BLAS library for accel-erating dense linear algebra computing [30].

In recent years, some researches have developedsparse direct solvers on GPUs [10], [11], [12], [13], [14],[15]. PARDISO is mapped onto NVIDIA GPUs using sin-gle-precision floating-point kernels [10]. It uses theCUBLAS kernels to compute dense subblocks on GPUs.Other approaches [11], [12], [13], [14], [15] are all basedon the multifrontal algorithm [9], and they all use densekernels to compute dense subblocks during sparse LUfactorization. We summarize these GPU-based directsolvers in Table 1. CHOLMOD [31], which is a Choleskyfactorization-based direct solver, now can be acceleratedby CUDA BLAS too. These GPU solvers all involve off-loading the time-consuming dense kernels to GPUs inorder to improve the overall performance, so they aresuitable for matrices which are denser than circuit matri-ces but may not be suitable for highly sparse matrices. Toour knowledge, currently there is no published work onGPU-based direct solver for circuit matrices withoutusing BLAS.

3 BACKGROUNDS

This section briefly introduces some backgrounds about theleft-looking algorithm, the two-mode scheduling algorithmin NICSLU, the NVIDIA GPU architecture, and the CUDAruntime model.

3.1 Left-Looking Algorithm

The pseudo code of the sparse left-looking algorithm [19](without pivoting) is shown in Algorithm 1. This algo-rithm factorizes a matrix by processing columns insequence. When factorizing column i, some columns onthe left side will be used to update column i (i.e., datadependence). The dependence is determined by the struc-ture of U . The columns that column i depends on arefj : Uðj; iÞ 6¼ 0; j < ig. The left-looking algorithm is wellsuited for cache-based and shared-memory machines

Fig. 1. Flow of SPICE-based transient simulation.

TABLE 1Summary of Existing GPU-Based Direct Sparse Solvers

CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATIONWITH PERFORMANCE MODELING 787

Page 3: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

because it writes less data to main memory, comparedwith the right-looking algorithm.

3.2 Two-Mode Scheduling in NICSLU

Based on the dependence analysis in the previous section,the dependence between columns are described by a directacyclic graph (DAG) [4], as shown in Fig. 2. The DAG is fur-ther partitioned into different levels. A two-mode schedul-ing algorithm combining cluster mode and pipeline mode isproposed in NICSLU [3], [4], [5] to perform a parallel LUfactorization algorithm. Cluster mode is used for levels thathave many columns. These levels are processed in sequencebut columns in one level are processed in parallel. For eachlevel, columns are evenly assigned to work threads. Pipelinemode is used for the rest of levels that have fewer columnsin each level. These levels with dependence are computedconcurrently by a pipeline-like algorithm. The proposedGPU-based sparse LU factorization approach is based onthe two-mode scheduling algorithm. The details of the twomodes can be found in [4].

3.3 GPU Architecture and CUDA Runtime Model

A rough diagram of the NVIDIA GPU architecture [32] isshown in Fig. 3. A GPU consists of some streaming multi-processors (SM), and each SM is composed of many stream-ing processors (SP), L1 cache, shared memory, and somespecial functional units. In NVIDIA Fermi architecture,

each SM has 32 SPs; in the latest Kepler architecture, eachSM has 192 SPs. The off-chip memory is called global mem-ory, which has much longer access latency than on-chipstorages.

NVIDIA GPUs execute programs in a single-instruction-multiple-thread (SIMT) manner. SMs execute threads ingroups of 32 concurrent threads called a warp. Severalwarps are grouped into a thread block. One thread block isexecuted on one SM, and one SM can hold several concur-rent thread blocks. Data can be easily shared within a threadblock through the on-chip shared memory.

4 SPARSE LU FACTORIZATION ON GPUS

In this section, we present the GPU-based sparse LU factori-zation approach and the optimization strategies in detail.First, an overall framework is presented, and then optimiza-tion strategies are discussed.

4.1 Overall Framework

The three shaded rectangles in Fig. 1 are called a sparsesolver. Fig. 4 shows the overall framework of the proposedGPU-based sparse solver. Actually it is a CPU-GPU hybridsolver, in which the most time-consuming numeric LU fac-torization is implemented on the GPU, the host CPU per-forms the scheduling for the GPU and executes the othersteps. The proposed solver focuses on GPU-based numericfactorization to accelerate circuit simulation.

The pre-processing step performs row/column reorder-ing to reduce fill-ins. The HSL_MC64 algorithm [33], [34] isalso performed in pre-processing to pre-scale the matrix toenhance the numeric stability. During the iterations of cir-cuit simulation, the structure of the circuit matrix is fixed,so only the first factorization is performed with pivoting,subsequent factorizations will use the fixed pivoting orderand fixed structure of the LU factors obtained by the firstfactorization [16]. In circuit simulation, although values ofmatrix elements change during iterations, they cannotchange much according to the physics of circuits, so thisapproach can ensure the numeric stability of subsequentfactorizations. This flow was claimed to be effective in digi-tal circuit simulation [35]. A subsequent factorization with-out pivoting is also called a re-factorization. Pre-processingis executed just once and its time cost is generally much less

Fig. 2. Two-mode scheduling in NICSLU.

Fig. 3. Rough diagram of the NVIDIA GPU architecture.

Fig. 4. Framework of the sparse solver on GPU.

788 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015

Page 4: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

than factorization, so its time cost is not of interest. Theright-hand-solving step costs much less time than factoriza-tion (a test on NICSLU [3], [4], [5] shows that the right-hand-solving step costs on average less than 1 percent ofthe sequential factorization time), so LU re-factorization bythe sparse left-looking algorithm is the most time-consum-ing step.

The GPU-based numeric re-factorization uses the paralle-lization strategies mentioned in Section 3.2. In cluster mode,each level launches a GPU kernel; while in pipeline mode,only one kernel is launched. Before each re-factorizationexecuted on the GPU, the values of A are written to GPU’sglobal memory, and when the GPU finishes one re-factori-zation, the values of the LU factors are copied from GPU’sglobal memory to the host memory. The right-hand-solvingstep is executed on the host CPU.

4.2 Exploring More Parallelism

As shown in Algorithm 1, the core operation in the sparseleft-looking algorithm is vector multiplication-and-add(MAD) (line 6). Cluster mode and pipeline mode are pro-posed in [4] to describe the parallelism between vectorMAD operations. Take Fig. 5 as an example to explain thislevel of parallelism. Column 1�7 are finished, and column8�10 are being processed. The MAD operations representedby solid arrows are executable. Parallelism exists in theseoperations, though the numeric updates to a same columnmust be executed in a strict order. This granularity of paral-lelism is called inter-vector parallelism, which can also becalled task-level parallelism.

The inter-vector/task-level parallelism alone cannot takefull advantage of state-of-the-art GPU’s high memory band-width and compute capability. We expose another intrinsicgranularity of parallelism in sparse LU factorization: intra-vector parallelism, which is also called data-level parallel-ism. This level of parallelism means that one vector is com-puted by multiple threads simultaneously, which matchesthe SIMT nature of GPUs. More specifically, each of the vec-tor operations (lines 3, 6, 8 and 9) shown in Algorithm 1 iscomputed by multiple concurrent threads. Now we con-sider how to partition the workload to fully utilize the GPUresources. In this point, several factors should beconsidered.

For convenience, threads that process a same column arecalled a virtual group. All threads in a virtual group operateon elements in a same column and must synchronize. Thesize of a virtual group should be carefully decided. First,

virtual groups should not be too large. This is because if thesize of a virtual group is larger than the number of nonzerosin a column, some threads will idle. So smaller virtualgroups can reduce idle threads. However, virtual groupsshould not be too small either. There are two reasons forthis point. First, too small virtual groups result in too fewthreads in total (the number of virtual groups is limited bythe storage space and cannot be very large), leading to toofew concurrent columns, which cannot fully utilize theGPU’s compute capability. Second, GPUs schedule threadsin a SIMT manner. If threads within a warp diverge, differ-ent branches are executed serially. Different virtual groupsprocess different columns and hence often diverge. Thus,too small virtual groups increase divergence within SIMTthreads.

In cluster mode, columns are very sparse, so while ensur-ing enough threads in total, virtual groups should be assmall as possible to minimize idle threads. In pipelinemode, columns usually contain enough nonzeros for a warpor several warps, so the size of virtual groups matters littlein the sense of reducing idle threads. Taking all these factorsinto consideration, we use one warp as one virtual group inboth modes. This strategy not only reduces divergencewithin SIMT threads, but also saves synchronization costs,since synchronization of threads within a warp is implicitlyguaranteed by the SIMT nature of GPUs.

The implementation details and optimization strategiesincluding synchronization for timing order and memoryaccess pattern optimization can be found in Appendix A,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TPDS.2014.2312199.

5 EXPERIMENTAL RESULTS

5.1 Experiment Setup

All the experiments are tested on three platforms whichare shown in Table 2. GPU codes are programmed usingCUDA 5.0 [32] and compiled with -O3 optimization.Twenty-three circuit matrices from University of FloridaSparse Matrix Collection [36] are used to evaluate theGPU solver. The performance of LU factorization onGPUs is compared with NICSLU [3], [4], [5] (optimized byhand-written SSE2 instructions and compiled with -O2

Fig. 5. Parallelism in left-looking algorithm.

TABLE 2Specifications of Computing Platforms

CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATIONWITH PERFORMANCE MODELING 789

Page 5: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

optimization), KLU [16] (compiled with-O3 optimization),and PARDISO [21] from the Intel Math Kernel Library(MKL). We show the main results here, some additionalresults are shown in Appendix B, available online.

5.2 Results of LU Factorization on GPUs

5.2.1 Performance and Speedups

Table 3 shows the results of the proposed LU factorizationapproach on NVIDIA GTX580 and NVIDIA K20x, and thespeedups compared with KLU and PARDISO which areimplemented on the host CPU. All the benchmarks aresorted by the average number of floating-point operations(flop) per nonzero ( flop

NNZðLþU�IÞ). The listed time in Table 3 isonly for numeric re-factorization, excluding pre-processingand right-hand solving. Some data have to be transferredbetween the host memory and the device memory in everyre-factorization (see Fig. 4), time for these transfers areincluded in the GPU re-factorization time.

Our GPU-based LU re-factorization outperforms KLU re-factorization for most of the benchmarks. The speedupstend to be higher for matrices with bigger flop

NNZðLþU�IÞ. Thisindicates that GPUs are more suitable for problems that aremore computationally demanding. The average speedup is24.24� achieved by GTX580 and 19.86� achieved by K20x.Since the speedups differ greatly for different matrices, thegeometric means of speedups are also listed in Table 3.

Compared with sequential PARDISO, the GPU-based LU

re-factorization is also faster than PARDISO for most of the

benchmarks. The speedups tend to be higher for matrices

with smaller flopNNZðLþU�IÞ. PARDISO utilizes BLAS to com-

pute dense subblocks so it performs better for matrices with

bigger flopNNZðLþU�IÞ. The average speedup of the GPU-based

LU re-factorization is 18.77� achieved by GTX580 and

15.72� achieved by K20x. The geometric mean of the speed-

ups is 7.02� achieved by GTX580.Compared with 16-threaded PARDISO, the GPU-based

LU re-factorization is faster than PARDISO for about halfof the benchmarks and slower for the other half. The aver-age speedup is 1.55� (geometric mean) achieved byGTX580 and 1.21� achieved by K20x. Fig. 6a shows theaverage speedup of the GPU-based LU re-factorization(implemented on GTX580), compared with PARDISOusing different number of threads. On average, the GPU-based LU re-factorization approach outperforms multi-threaded PARDISO.

Fig. 6b shows the average speedup of the GPU-based LUre-factorization (implemented on GTX580) compared withNICSLU re-factorization. Since the speedups differ littlefor different matrices in these results, only arithmeticmeans are presented. Fig. 6b indicates that the performance

TABLE 3Performance of LU Factorization on GPUs, and the Speedups over KLU (re-factorization) and PARDISO

Fig. 6. Average speedups. The GPU results are obtained on GTX580.

790 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015

Page 6: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

of the GPU-based LU re-factorization is similar to thatof 10-threaded NICSLU. When the number of threads ofNICSLU is larger than 10, NICSLU will outperform theGPU implementation. The average speedup of the GPU-based LU re-factorization compared with 16-threaded NIC-SLU is 0.78�.

Although the theoretic peak performance of K20x isabout 6� larger than that of GTX580 (for double-precision),the results in Table 3 indicate that the achieved performanceon K20x and GTX580 is similar. The main reason is that per-formance of sparse LU factorization on GPUs is bandwidth-bound. We will explain this point in Section 6. The effectiveglobal memory bandwidth of K20x (with ECC on) is veryclose to that of GTX580, so the higher compute capability ofK20x cannot be explored.

5.2.2 Bottleneck Analysis

The achieved average bandwidth of all benchmarks is53.81, 51.05, and 92.75 GB/s, obtained on NVIDIAGTX580, NVIDIA K20x, and Intel E5-2690�2 (imple-mented by 16-threaded NICSLU), respectively. For CPUexecution, some memory accesses are served by caches,so the bandwidth may greatly exceed the theoretic peakof the main memory. For different platforms, factors thatrestrict the performance are different. For CPUs, largecaches and high clock frequency can greatly improve thecomputational efficiency. However, for GTX580 andK20x, disabling the L1 cache just results in less than 5percent performance loss. It is difficult to evaluate theeffect of GPU’s L2 cache, since there is no way to disablethe L2 cache of current NVIDIA GPUs programmedusing CUDA. Since GPU’s L2 cache is much smaller thanmodern CPU’s last level cache (LLC) and there are muchmore cores sharing the LLC on a GPU, it is expected thatthe LLC has less effect on GPUs than on CPUs. The mainbottleneck of the proposed LU re-factorization approachon GPUs is the peak global memory bandwidth.

We have found that processing too many columns con-currently may decrease the performance on GPUs. Fig. 7shows the giga floating-point operations per second(Gflop/s) of three matrices obtained on GTX580 andK20x, using different number of resident warps per SM.The best performance is attained with about 24 (forGTX580)/40 (for K20x) resident warps per SM, ratherthan with maximum resident warps per SM. Althoughthe reported bandwidth in Table 3 does not reach thepeak bandwidth of the GPU, Fig. 7 indicates that we have

already fully utilized the bandwidth with 24 (forGTX580)/40 (for K20x) resident warps per SM. It alsomeans that the actual bandwidth is saturated with 24 (forGTX580)/40 (for K20x) resident warps per SM. Whenwarps become more, the performance is constrained andbecomes lower. To explain this phenomenon in detail, webuild a parametric performance model in Section 6.

6 PERFORMANCE EVALUATION OF LUFACTORIZATION ON GPUS

In this section, a parametric performance model is built toexplain the phenomenon shown in Fig. 7 in detail. Withthis model, we can investigate the relation between theachieved GPU performance and the GPU parameterssuch as the global memory bandwidth and the number ofwarps, to help us better understand the bottlenecks of thesolver and choose the optimal number of work threadson the given hardware platform. Actually the proposedmodel is universal and can also be used for other GPUapplications.

A CWP-MWP (CWP andMWP are short for computationwarp parallelism and memory warp parallelism) model isproposed in [37]. In that model, a warp that is waiting formemory data is called a memory warp, and MWP repre-sents the number of memory warps that can be handledconcurrently on one SM. We have attempted to use theCWP-MWP model to evaluate the performance on NVIDIAGTX580, but it gets incorrect results, as shown in Fig. 8. Thereason is that MWP in the CWP-MWP model is very small(4�7 for different matrices), so when the number of residentwarps per SM exceeds MWP , the execution time is greatlyaffected by the global memory latencies, leading to unrea-sonable predicted results.

In this paper, a better performance model is proposed,which improves the CWP-MWP model at the followingpoints.

� Computation of MWP is changed. MWP is hard-ware-specific and directly constrained by the peakbandwidth of the global memory.

� Concepts of MWP req and warp exp which havephysical meanings are introduced to illustrate thedifferent cases of warp scheduling.

� The effect of the L2 cache is considered.

� The CWP-MWP model performs static analysis onthe compiled code to predict performance. However,in the pipeline mode (see Appendix A.1, available

Fig. 7. Achieved GPU performance, using different number of residentwarps per SM.

Fig. 8. Predicted LU factorization time by the CWP-MWP model [37], forASIC_100 ks, obtained on GTX580.

CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATIONWITH PERFORMANCE MODELING 791

Page 7: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

online) of sparse LU factorization, each warp waitsfor some other warps to finish, which is a run-timebehavior and the time cost of waiting cannot bedetermined at compilation time. Our solution isbased on a DAG analysis to avoid such dynamicbehaviors.

Based on the performance model, the execution cycles ofeach task (i.e., node) in a DAG can be calculated, and thenwe calculate the execution cycles of the whole DAG by awell-known method which finds the critical path of theDAG (see Appendix C.2, available online, for details).

6.1 Parametric Performance Model

GPUs hide global memory latencies by warp switches, asshown in Fig. 9. When a warp invokes a global memoryrequest, the SIMT engine switches to another warp that isnot waiting for data to execute. Our model is based on thebasic warp scheduling in GPUs.

6.1.1 Parameters

The parameters used in the model are shown in Table 4. Inthe following contents, some important parameters areintroduced.

mem cycle represents the average cycles of each memoryperiod as marked in Fig. 9. Considering the cache miss ratemiss rate, mem cycle is expressed as a weighted combina-tion of global cycle and cache cycle:

mem cycle ¼ global cycle�miss rate

þ cache cycle� ð1�miss rateÞ; (1)

miss rate is measured on the target GPU using a micro-benchmark, see Appendix C.1, available online, for details.

MWP represents the maximum number of memorywarps that can be served concurrently on one SM. It is hard-ware-specific and directly constrained by the peak band-width of the global memory:

MWP ¼ peak bandwidth�mem cycle

#SM � freq � load bytes per wap; (2)

where peak bandwidth, #SM and freq are physicalparameters of a GPU. load bytes per wap means the aver-age data size accessed from the global memory in eachmemory period. Since in a MAD operation, there are 1int reading operation, 2 double reading operations, and1 double writing operation, the equivalent data sizeaccessed from the global memory considering the cache

effect is expressed as:

load bytes per wap

¼ size of ðintÞ þ 3� size of ðdoubleÞ4

� 32�miss rate:

(3)

MWP req is the number of concurrent memory warpsrequested by the invoked warps on one SM. As can be seenfrom Fig. 9, if the GPU can support mem cycle

comp cycle concurrentmemory warps on one SM, all the memory requests can beserved without delaying. SoMWP req is given by

MWP req ¼ MINmem cycle

comp cycle;#warp

� �; (4)

warp exp means the expected number of concurrentwarps on one SM such that the memory latencies can becompletely hidden. Still take Fig. 9 as an example, to hidethe latency of one memory period, there should be at leastmem cyclecomp cycle þ 1warps running concurrently, so

warp exp ¼ mem cycle

comp cycleþ 1 (5)

6.1.2 Three Cases in Warp Scheduling

Based on the above parameters, there are three cases in thewarp scheduling of GPUs, according to the number of resi-dent warps per SM, as illustrated in Fig. 10.

Case 1 (Fig. 10a). if #warp � warp exp and MWP �MWP req, there are enough warps running to hide globalmemory latencies and the memory bandwidth is also suffi-cient to support all the concurrent memory warps. In thetime line, each memory period follows the correspondingcomputing period closely and no delay occurs. The totalruntime is mainly determined by the computing periods.For a given warp that is repeated for #repeat iterations(#repeat ¼ 3 in Fig. 10a), the number of total executioncycles is:

Fig. 9. Illustration of warp scheduling and some parameters.

TABLE 4Parameters Used in the Model, for GTX580

792 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015

Page 8: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

total cycle per warp

¼ ð#repeat� 1Þ � comp cycle�#warp

þ comp cycleþmem cycle:

(6)

Case 2 (Fig. 10b). If MWP < MWP req, the limited mem-ory bandwidth cannot support all the memory warps con-currently. For example, in Fig. 10b, MWP ¼ 3 andMWP req ¼ 4, so the requested memory warps cannot behandled concurrently and they are served by queueing.Each memory period cannot closely follow its correspond-ing computing period, leading to some delay in the subse-quent computing periods and memory periods. Note thatFig. 10b is plotted from the beginning of execution, the firsttwo iterations are a little different from the subsequent itera-tions. The number of execution cycles of a warp is mainlydetermined by memory periods:

total cycle per warp

¼ #repeat�mem cycle�#warp

MWP:

(7)

Case 3 (Fig. 10c). if #warp < warp exp and MWP �MWP req, warps are too few to hide the memory latency(the memory bandwidth is sufficient to support all the con-current memory warps). This case is simple and we have

total cycle per warp

¼ #repeat� ðcomp cycleþmem cycleÞ: (8)

6.2 Discussion

We have validated the accuracy of the proposed model,please see Appendix C.3, available online, for details. Herewe explain why using maximum resident warps per SM isnot the best choice on GPUs, taking GTX580 as an example.

We show the results of two matrices in Fig. 11. The pro-posed performance model indicates that MWP � 23 � 26and warp exp � 40 � 50 on GTX580, for different bench-marks. When #warp � 24 (case 3), warps are too few tohide memory latencies, and the global memory bandwidthis sufficient to support all these warps, so the performanceincreases with more warps. When #warp � 28 (case 2),warps are still too few to hide memory latencies, but on theother hand, warps are so many that the global memorybandwidth cannot support all the memory warps concur-rently, so the performance is limited by the memory band-width and decreases with more warps. Consequently,#warp ¼ 24 is the best choice on NVIDIA GTX580.

In sparse LU factorization, the computation-to-memoryratio is very low (there are only 2 � 5 computationalinstructions between two global memory instructions), lead-ing to a low comp cycle (< 10) and a high warp exp (40 � 50,for GTX580). Case 1 is the best case in GPU computing, butit does not exist in the proposed LU factorization approachperformed on our GPUs, since warp exp is larger than themaximum number of resident warps per SM.

It is necessary to explain why the bandwidth reportedin Table 3 does not reach the peak bandwidth of theGPU, but the performance is still constrained by theglobal memory bandwidth. The bandwidth reported inTable 3 only considers the necessary data sizes. We canregard this bandwidth as “effective bandwidth”. How-ever, GPUs work in a different way. NVIDIA GPUs accessto the global memory via aligned 32-, 64- or 128-bytememory transactions [32]. If a warp only needs an 1-bytedata, it still requests a 32-byte transaction. In this case,the unused 31 bytes are unnecessary but they still occupythe memory bandwidth. Consequently, the actual run-time memory bandwidth which constrains the perfor-mance, is larger than the reported effective bandwidth. Ifthe peak memory bandwidth can be larger, we still havea great potential to further improve the performance,since currently we do not fully utilize the compute capa-bility of GPUs.

7 CONCLUSIONS

This paper presents a GPU-based sparse direct solverintended for circuit simulation problems, which is based onparallel LU numeric factorization without pivoting. Wehave presented the GPU-based LU factorization approachin detail and analyzed the performance. A parametric per-formance model is built to investigate the bottleneck of the

Fig. 10. Illustration of GPU execution.

Fig. 11. Predicted time and actual time (on GTX580), with different#warp.

CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATIONWITH PERFORMANCE MODELING 793

Page 9: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

proposed approach and choose the optimal number of workthreads on the given hardware platform.

The proposed LU factorization approach on NVIDIAGTX580 achieves on average 7.02� (geometric mean)speedup compared with sequential PARDISO, and 1.55�speedup compared with 16-threaded PARDISO. The perfor-mance of the proposed GPU solver is comparable to that ofthe existing GPU solvers listed in Table 1. The model-basedanalysis indicates that the performance of the GPU-basedLU factorization is constrained by the global memory band-width. Consequently, if the global memory bandwidthbecomes larger, the performance can be further improved.

One limitation of the current approach is that it can beonly implemented on single-GPU platforms. For multiple-GPU platforms, blocked algorithms are required, and thedata transfers between GPUs need to be carefully con-trolled. In addition, multiple GPUs can provide largerglobal memory bandwidth so the performance is expectedto be higher.

ACKNOWLEDGMENTS

This work was supported by 973 project (2013CB329000),National Science and Technology Major Project(2013ZX03003013-003) and National Natural Science Foun-dation of China (No. 61373026, No. 61261160501), and Tsing-hua University Initiative Scientific Research Program. Thiswork was done when Ling Ren was at Tsinghua University.We would like to thank Prof. Haohuan Fu and YingqiaoWang from Center for Earth System Science, Tsinghua Uni-versity for lending us the NVIDIA Tesla K20x platform. Wealso thank theNVIDIACorporation for donating us GPUs.

REFERENCES

[1] L. W. Nagel, “SPICE 2: A computer program to stimulate semi-conductor circuits,” Ph.D. dissertation, Univ. California, Berkeley,CA, USA, 1975.

[2] S. Rajamanickam, E. Boman, and M. Heroux, “ShyLU: A hybrid-hybrid solver for multicore platforms,” in Proc. IEEE 26th Int.Parallel Distrib. Process. Symp., 2012, pp. 631–643.

[3] X. Chen, Y. Wang, and H. Yang, “An adaptive LU factorizationalgorithm for parallel circuit simulation,” in Proc. 17th Asia andSouth Pacific Design Autom. Conf., Jan. 30, 2012–Feb. 2, 2012,pp. 359–364.

[4] X. Chen, W. Wu, Y. Wang, H. Yu, and H. Yang, “An escheduler-based data dependence analysis and task scheduling for parallelcircuit simulation,” IEEE Trans. Circuits Syst. II: Express Briefs,vol. 58, no. 10, pp. 702–706, Oct. 2011.

[5] X. Chen, Y. Wang, and H. Yang, “NICSLU: An adaptive sparsematrix solver for parallel circuit simulation,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 2, pp. 261–274, Feb.2013.

[6] N. Kapre and A. DeHon, “Performance comparison of single-pre-cision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors,” in Proc. Int. Conf. Field Programmable Logic andAppl., Aug. 31, 2009–Sep. 2, 2009, pp. 65–72.

[7] K. Gulati, J. Croix, S. Khatri, and R. Shastry, “Fast circuit simula-tion on graphics processing units,” in Proc. Asia and South PacificDesign Autom. Conf., Jan. 2009, pp. 403–408.

[8] R. Poore, “GPU-accelerated time-domain circuit simulation,” inProc. IEEE Custom Integr. Circuits Conf., Sep. 2009, pp. 629–632.

[9] J. W. H. Liu, “The multifrontal method for sparse matrix solu-tion: Theory and practice,” SIAM Rev., vol. 34, no. 1, pp. 82–109, 1992.

[10] M. Christen, O. Schenk, and H. Burkhart, “General-purposesparse matrix building blocks using the NVIDIA CUDA technol-ogy platform,” in Proc. First Workshop General Purpose Process.Graph. Process. Units, 2007, pp. 1–8 .

[11] G. P. Krawezik and G. Poole, “Accelerating the ANSYS directsparse solver with GPUs,” in Proc. Symp. Appl. Accelerators HighPerform. Comput., July 2009, pp. 1–3.

[12] C. D. Yu, W. Wang, and D. Pierce, “A CPU-GPU hybrid approachfor the unsymmetric multifrontal method,” Parallel Comput.,vol. 37, no. 12, pp. 759–770, Dec. 2011.

[13] T. George, V. Saxena, A. Gupta, A. Singh, and A. Choudhury,“Multifrontal Factorization of Sparse SPD Matrices on GPUs,” inProc. IEEE Int. Parallel Distrib. Process. Symp., 2011, pp. 372–383.

[14] R. F. Lucas, G. Wagenbreth, J. J. Tran, and D. M. Davis,“Multifrontal sparse matrix factorization on graphics processingunits,” Los Angeles, CA, USA, Tech. Rep. ISI-TR-677, 2012.

[15] R.F. Lucas, G. Wagenbreth, D. M. Davis, and R. Grimes,“Multifrontal computations on GPUs and their multi-core hosts,”in Proc. 9th Int. Conf. High Perform. Comput. Comput. Sci., 2011,pp. 71–82.

[16] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, adirect sparse solver for circuit simulation problems,” ACM Trans.Math. Softw., vol. 37, pp. 36:1–36:17, Sep. 2010.

[17] L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang, “Sparse LU fac-torization for parallel circuit simulation on GPU,” in Proc. 49thACM/EDAC/IEEE Design Autom. Conf., Jun. 2012, pp. 1125–1130.

[18] J. W. Demmel, J. R. Gilbert, and X. S. Li, “An asynchronous paral-lel supernodal algorithm for sparse gaussian elimination,” SIAMJ. Matrix Anal. Appl., vol. 20, no. 4, pp. 915–952, 1999.

[19] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time pro-portional to arithmetic operations,” SIAM J. Sci. Statist. Comput.,vol. 9, no. 5, pp. 862–874, 1988.

[20] J. J. Dongarra, J. D. Cruz, S. Hammerling, and I. S. Duff,“Algorithm 679: A set of level 3 basic linear algebra subprograms:Model implementation and test programs,” ACM Trans. Math.Softw., vol. 16, no. 1, pp. 18–28, Mar. 1990.

[21] O. Schenk and K. G€artner, “Solving unsymmetric sparse systemsof linear equations with PARDISO,” Future Generat. Comput. Syst.,vol. 20, no. 3, pp. 475–487, Apr. 2004.

[22] G. Poole, Y. Liu, Y. cheng Liu, and J. Mandel, “Advancing analysiscapabilities in ansys through solver technology,” in Proc. Electron.Trans. Numer. Anal., Jan. 2003, pp. 106–121.

[23] T. A. Davis, “Algorithm 832: Umfpack v4.3—An unsymmetric-pattern multifrontal method,” ACM Trans. Math. Softw., vol. 30,no. 2, pp. 196–199, Jun. 2004.

[24] A. Gupta, M. Joshi, and V. Kumar, “WSMP: A high-performanceshared- and distributed-memory parallel sparse linear solver,”IBM T. J. Watson Research Center, Yorktown Heights, NY, USA,Tech. Rep. RC 22038 (98932), Apr. 2001.

[25] V. Volkov and J. Demmel, “Benchmarking GPUs to tune denselinear algebra,” in Proc. Int. Conf. High Perform. Comput., Netw.,Storage Anal., Nov. 2008, pp. 1–11.

[26] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, “Dense linear alge-bra solvers for multicore with GPU accelerators,” in Proc. IEEEInt. Symp. Parallel Distrib. Process., Workshops Phd Forum, Apr.2010, pp. 1–8.

[27] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linearalgebra for hybrid GPU accelerated manycore systems,” ParallelComput., vol. 36, pp. 232–240, Jun. 2010.

[28] J. Kurzak, P. Luszczek, M. Faverge, and J. Dongarra, “LU factori-zation with partial pivoting for a multicore system with acceler-ators,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 8, pp. 1613–1621, Aug. 2013.

[29] The Univ. Tennessee, Knoxville, TN, USA. The MAGMA Project.(2013). [Online]. Available: http://icl.cs.utk.edu/magma/index.html

[30] NVIDIA Corporation. (2012). CUDA BLAS [Online]. Available:http://docs.nvidia.com/cuda/cublas/

[31] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam,“Algorithm 887: Cholmod, supernodal sparse cholesky factoriza-tion and update/downdate,” ACM Trans. Math. Softw., vol. 35,no. 3, pp. 22:1–22:14, Oct. 2008.

[32] NVIDIA Corporation. NVIDIA CUDA C Programming guide.(2012). [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[33] I. S. Duff and J. Koster, “The design and use of algorithms for per-muting large entries to the diagonal of sparse matrices,” SIAM J.Matrix Anal. Appl., vol. 20, no. 4, pp. 889–901, Jul. 1999.

[34] I. S. Duff and J. Koster, “On algorithms for permuting large entriesto the diagonal of a sparse matrix,” SIAM J. Matrix Anal. Appl.,vol. 22, no. 4, pp. 973–996, Jul. 2000.

794 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 3, MARCH 2015

Page 10: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

[35] R. Bahar, E. Frohm, C. Gaona, G. Hachtel, E. Macii, A. Pardo, andF. Somenzi, “Algebric decision diagrams and their applications,”Formal Methods Syst. Design, vol. 10, no. 2/3, pp. 171–206, 1997.

[36] T. A. Davis and Y. Hu, “The university of florida sparse matrixcollection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25,Dec. 2011.

[37] S. Hong and H. Kim, “An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness,” inProc. 36th Annu. Int. Symp. Comput. Archit., 2009, pp. 152–163.

Xiaoming Chen (S’12) received the BS degreefrom the Department of Electronic Engineering,Tsinghua University, Beijing, China, in 2009,where he is currently working toward the PhDdegree. His current research interests includepower and reliability-aware circuit design meth-odologies, parallel circuit simulation, high-perfor-mance computing on GPUs, and GPUarchitecture. He was nominated for the BestPaper Award in ISLPED 2009 and ASPDAC2012. He is a student member of the IEEE.

Ling Ren received the BS degree from theDepartment of Electronic Engineering, TsinghuaUniversity, Beijing, China, in 2012. He is currentlyworking toward the PhD degree in the Depart-ment of Electrical Engineering and Computer Sci-ence at Massachusetts Institute of Technology.His research interests include computer architec-ture, computer security, and parallel computing.

Yu Wang (S’05-M’07) received the BS degreeand the PhD degree (with honor) from TsinghuaUniversity, Beijing, China in 2002 and 2007,respectively. He is currently an associate profes-sor with the Department of Electronic Engineer-ing, Tsinghua University. His current researchinterests include parallel circuit analysis, low-power and reliability-aware system design meth-odology, and application-specific hardware com-puting, especially for brain-related topics. He hasauthored and co-authored more than 90 papers

in refereed journals and conference proceedings. He received the BestPaper Award in ISVLSI 2012. He was nominated five times for the BestPaper Award (ISLPED 2009, CODES 2009, twice in ASPDAC 2010,and ASPDAC 2012). He is the Technical Program Committee co-chairof ICFPT 2011, the Publicity co-chair of ISLPED 2011, and the financechair of ISLPED 2013. He is a member of the IEEE.

Huazhong Yang (M’97-SM’00) received the BSdegree in microelectronics, and the MS and PhDdegrees in electronic engineering from TsinghuaUniversity, Beijing, China, in 1989, 1993, and1998, respectively. In 1993, he joined the Depart-ment of Electronic Engineering, Tsinghua Univer-sity, where he is currently a specially appointedprofessor of the Cheung Kong Scholars Pro-gram. He has authored and co-authored morethan 200 technical papers and holds 70 grantedpatents. His current research interests include

wireless sensor networks, data converters, parallel circuit simulationalgorithms, nonvolatile processors, and energy-harvesting circuits. He isa senior member of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

CHEN ET AL.: GPU-ACCELERATED SPARSE LU FACTORIZATION FOR CIRCUIT SIMULATIONWITH PERFORMANCE MODELING 795

Page 11: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

1

GPU-Accelerated Sparse LU Factorization forCircuit Simulation with Performance Modeling

(Supplementary Material)Xiaoming Chen, Student Member, IEEE , Ling Ren, Yu Wang, Member, IEEE ,

and Huazhong Yang, Senior Member, IEEE

F

APPENDIX AIMPLEMENTATION DETAILS AND OPTIMIZATIONSTRATEGIES

A.1 Ensuring Timing Order on GPUs

Algorithm 1 shows the pipeline mode parallel left-looking algorithm. In this mode, appropriate timing or-der between columns must be guaranteed. Line 6 showsthat a column must wait for its dependent columns tofinish. If column k depends on column t, only after col-umn t is finished, can column k be updated by column t.We use Fig. 1 as an example to explain it. Columns 8∼10are being processed simultaneously, and other columnsare finished. Column 9 can be first updated by columns4, 6 and 7. But currently column 9 cannot be updatedby column 8 since column 8 is not finished. Column 9must wait for column 8 to finish. Similar situation is forcolumn 10.

We set a flag which ia marked as volatile for each col-umn to indicate whether the column is finished (falsefor unfinished and true for finished). Line 6 is executedby a while loop, and in the loop, the flag correspondingto the dependent column is checked. If the flag changesto true, the loop exits. Another work [1] uses a similarmethod to deal with the timing order in the case of one-way dependence.

Ensuring timing order on GPUs deserves special atten-tion. The number of warps in a GPU kernel must be care-fully controlled. This issue is with respect to the conceptof resident warp (or active warp) [2] on GPUs. Unlikethe context switch mechanism of CPUs, GPUs have largeregister files to hold lots of warps, and warps whichcan be switched for execution are resident in registersto ensure a zero-overhead context switch. Consequently,there is a maximum number of resident warps that theregister file can handle. In addition to the register file,the shared memory usage can also restrict the maximumnumber of resident warps (our algorithm does not useany shared memory). If the number of assigned warpsexceeds the maximum number of resident warps, somewarps will be not resident at the beginning, and they

Algorithm 1 Pipeline mode algorithm.

1: for each warp in parallel do2: while there are still unfinished columns do3: Get a new unfinished column, say column i;4: Put A(:, i) into an intermediate vector x;5: for j = 1 : i− 1 where U(j, i) ̸= 0 do6: Wait until column j is finished;7: x(j + 1 : n)− = x(j)× L(j + 1 : n, j);8: end for9: U(1 : i, i) = x(1 : i);

10: L(i : n, i) = x(i : n)/x(i);11: Mark column i as finished;12: end while13: end for

level columns

0

1

2

3

1 2 3 5 7

4 6

8

9

104

cluster

mode

pipeline

mode

warp 1

warp 2

warp 3, inactive

in pipeline mode

finished

deadlock

finished

X inexecutable

Columns

Operations

not started

executable

Fig. 1: Illustration of deadlock caused by non-residentwarps.

have to wait for other resident warps to finish executionand then become resident.

In pipeline mode, we have to ensure all the warps tobe resident from the beginning. If a column is computedby a non-resident warp, columns depending on it haveto wait for this column to finish. But in turn, the non-resident warp has no chance to become resident because

Page 12: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

2

no resident warp can finish. This results in a deadlock.Fig. 1 is an illustration of a deadlock. Suppose we haveinvoked 3 warps on a GPU that supports only 2 residentwarps. There is no problem in cluster mode, since warps1 and 2 will eventually finish execution so warp 3 canstart. But in pipeline mode, columns 9 and 10 depend oncolumn 8, which is allocated to the non-resident warp 3,so the resident warps (warps 1 and 2) fall in deadlocks,waiting for column 8 forever. This in turn leaves nochance for warp 3 to become resident.

Therefore the maximum number of columns that canbe factorized concurrently in pipeline mode is exactlythe maximum number of resident warps of the kernel.In sparse LU factorization, this number depends on theregister usage, which can be obtained from the compileror some profiling tools. The number of resident warpsgreatly influences the performance of the solver. We havealso found that processing too many columns concur-rently is undesirable. Our experiments in Section 5 ofthe main paper will confirm this point, and a detailedexplanation is presented in Section 6 of the main paper.

A.2 Optimization of Memory Access PatternsOptimization of sparse LU factorization on GPUs ismainly about memory optimization. In this subsection,we discuss the data format for intermediate vectors, andthe sorting process for more coalesced access patterns tothe global memory.

A.2.1 Format of Intermediate VectorsWe have two alternative data formats for the interme-diate vectors (x in Algorithm 1): compressed sparsecolumn (CSC) vectors or uncompressed vectors. CSCvectors save space and can be placed in the sharedmemory, while uncompressed vectors have to reside inthe global memory. Uncompressed vectors are preferredin this problem for two reasons. First, CSC format isinconvenient for indexed accesses. We have to use binarysearch, which is very time-consuming even in the sharedmemory. Moreover, using too much shared memory willreduce the maximum number of resident warps, whichresults in severe performance degradation:

resident warps per SM ≤ size of shared memory per SMsize of a CSC vector

Our tests have proved that using uncompressed formatsis several times faster than using compressed formats.

A.2.2 Improving Data LocalityHigher global memory bandwidth can be achieved onGPUs if memory access patterns are coalesced [2]. Thenonzeros in L and U are out of order after pre-processingand the first factorization performed on the host CPU,resulting in highly random memory access patterns. Wesort the nonzeros in each column of L and U by their rowindices to improve the data locality. As shown in Fig. 2,after sorting, neighboring nonzeros in each column aremore likely to be processed by consecutive threads.

T0 T1 T2 T3 T31

global memory

threads in a warp

(a) Unsorted nonzeros, random memory access.

T0 T1 T2 T3 T31

global memory

threads in a warp

(b) Sorted nonzeros, more coalesced memory access.

Fig. 2: More coalesced memory access patterns aftersorting the nonzeros.

0

20

40

60

80

100

120

140

160

180

200

add32

hcircuit

add20

bcircuit

circuit_4

scircuit

rajat03

coupled

rajat15

rajat18

raj1

transient

rajat24

asic_680k

onetone2

asic_680ks

asic_320k

asic_100k

asic_320ks

asic_100ks

onetone1

g2_circuit

twotoneA

chie

ved

mem

ory

ba

nd

wid

th (

GB

/s)

unsorted

sorted

Fig. 3: Performance increases after sorting the nonzeros.

In Fig. 3, we use the test benchmarks to show theeffect of sorting. The achieved memory bandwidth issignificantly increased by an average of 2.1×. It is worthmentioning that CPU-based sparse LU factorization alsobenefits from sorting the nonzeros, but the performancegain is tiny (less than 5% on average). Sorting thenonzeros is done by transposing the CSC storages of Land U twice. The time cost of sorting is about 20% onaverage of the time cost of one LU re-factorization onGPUs. In addition, sorting is performed just once so itstime cost is not the bottleneck and negligible in circuitsimulation.

APPENDIX BADDITIONAL RESULTS

B.1 Relation Between the Achieved GPU Perfor-mance and Matrix CharacteristicsWe find that the achieved GPU performance stronglydepends on the characteristics of the matrices, as shownin Fig. 4. The two subfigures are scatter plots, in whicheach point represents a matrix. The results are obtainedon GTX580. They show that the achieved GPU perfor-mance has an approximate logarithmic relation with therelative fill-in (NNZ(L+U−I)

NNZ(A) ) or the average number offlop per nonzero ( flop

NNZ(L+U−I) ). The fitted lines are alsoplotted in Fig. 4.

Page 13: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

3

0

1

2

3

4

5

6

7

8

9

10

11

12

1 2 4 8 16 32 64 128 256 512 1024

Gflop/s

flop/NNZ(L+U-I)

(a) Relation between the achieved GPU perfor-mance and the average number of flop per nonzero.

0

1

2

3

4

5

6

7

8

9

10

11

12

1 2 4 8 16 32

Gflop/s

NNZ(L+U-I)/NNZ(A)

(b) Relation between the achieved GPU perfor-mance and the relative fill-in.

Fig. 4: Relation between the achieved GPU performanceand matrix characteristics, obtained on NVIDIA GTX580.

B.2 Comparison with GMRES in CUSP

The GPU solver is compared with a preconditionedgeneralized minimal residual (GMRES) [3] iterativesolver in the CUSP library 0.3.1 [4]. The GMRES solveruses an approximate inversion preconditioner namednonsym_bridson_ainv, which is based on an incom-plete factorization and a dropping strategy proposedin [5]. The results shown in Table 1 are obtained onK20x. The results indicate that for circuit matrices, theGMRES iterative solver is not competitive comparedwith our direct solver. We have also tried other pre-conditioners provided by CUSP but they do not helpimprove the performance of GMRES. Iterative solversneed good preconditioners to ensure convergence, butcircuit matrices are asymmetric and irregular so are hardto precondition. In addition, in circuit simulation, thematrix entries change during iterations, so precondi-tioner is required in each iteration, which significantlyincreases the computational time of iterative solvers. Asshown in Table 1, the preconditioner costs most of thecomputational time of the iterative solver. Regardless ofthe preconditioner, time cost of the iterative solving stepis also larger than that of the direct solver.

TABLE 1: Comparison with the preconditioned GMRESiterative solver in the CUSP library, obtained on K20x.

Our approach GMRES solver in CUSPGPU factoriz.+CPU solve GPU precond.+GPU solvetransf. factoriz. solve total transf. precond. solve total

add32 0.000 0.001 0.000 0.001 0.000 0.220 0.021 0.241hcircuit 0.003 0.007 0.003 0.012 0.001 2.075 0.193 2.269add20 0.000 0.000 0.000 0.001 0.000 0.377 0.189 0.566

bcircuit 0.005 0.003 0.003 0.010 0.001 1.042 36.930 37.972circuit 4 0.002 0.029 0.002 0.033 0.001 0.386 40.024 40.411scircuit 0.011 0.017 0.008 0.035 0.001 3.561 0.225 3.787rajat03 0.001 0.003 0.000 0.005 0.000 0.091 0.097 0.188

coupled 0.002 0.017 0.001 0.020 0.000 1.973 0.167 2.140rajat15 0.008 0.037 0.003 0.048 0.000 9.281 0.159 9.440rajat18 0.005 0.072 0.003 0.080 0.001 42.358 0.293 42.652

raj1 0.033 0.268 0.019 0.320 0.002 1015.778 0.561 1016.341transient 0.009 0.168 0.006 0.183 0.002 2340.344 0.487 2340.832rajat24 0.018 0.336 0.013 0.367 0.003 2740.974 0.788 2741.764

asic 680k 0.029 0.985 0.024 1.038 failonetone2 0.006 0.049 0.002 0.057 0.000 0.657 0.157 0.814

asic 680ks 0.021 0.116 0.019 0.156 0.004 3.716 0.027 3.746asic 320k 0.025 0.589 0.019 0.634 failasic 100k 0.019 0.342 0.009 0.369 0.001 1057.375 0.382 1057.758asic 320ks 0.022 0.147 0.017 0.185 0.002 2.304 0.012 2.318asic 100ks 0.017 0.141 0.008 0.165 0.001 0.766 0.079 0.847onetone1 0.014 0.146 0.005 0.165 0.000 1.991 0.174 2.165g2 circuit 0.085 0.805 0.035 0.925 0.001 1.288 16.053 17.343twotone 0.052 0.777 0.018 0.846 0.001 16.977 0.218 17.197

B.3 Results of Right-Hand-SolvingThough the time cost of right-hand-solving is fairlysmall, it will affect the scalability of the solver if the LUfactorization time is significantly reduced. The parallelstrategy of right-hand-solving can be very similar tothe parallel re-factorization algorithm proposed in [6].NVIDIA researches have also proposed a similar paralleltriangular solving strategy [7], which is included inthe cuSPARSE library [8] (cuSPARSE is an integratedlibrary in CUDA). The comparison between cuSPARSE(version 5.0, implemented on NVIDIA K20x) and CPU-based sequential right-hand-solving is shown in Fig. 5(the reported GPU time excludes the analysis time). Thehost CPU is on average 3.16× faster than the GPU forthe right-hand-solving step. The right-hand-solving stephas much fewer flop than numeric factorization, leadingto a high communication-to-memory ratio, especially forcircuit matrices. Consequently, GPU-based right-hand-solving for circuit matrices is not expected to obtainhigh speedups. In addition, a recent research shows thatthe maximum speedup of parallel right-hand-solving onmulticore CPUs is less than 2× regardless of the numberof threads [9]. Our hybrid solver uses sequential right-hand-solving executed on CPUs.

APPENDIX CSUPPLEMENTARY FOR THE PERFORMANCEMODEL

C.1 Measurement of the Cache Miss RateWe have found that for GTX580 and K20x, disabling theL1 cache just results in less than 5% performance loss, sothe effect of the L1 cache is fairly small. Consequently,

Page 14: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

4

0.00

0.05

0.10

0.15

0.20

0.25

0.30

add3

2

hci

rcuit

add2

0

bci

rcuit

circ

uit

_4

scir

cuit

raja

t03

coup

led

raja

t15

raja

t18

raj1

tran

sien

t

raja

t24

asic

_6

80k

onet

on

e2

asic

_6

80k

s

asic

_3

20k

asic

_1

00k

asic

_3

20k

s

asic

_1

00k

s

onet

on

e1

g2_

circ

uit

two

ton

e

tim

e (

s)

CPU solve + data transfer

GPU solve + data transfer

Fig. 5: Comparison of the right-hand-solving time. TheCPU code is sequential. The GPU time excludes theanalysis time.

we ignore the L1 cache in the model. In our model,“cache” refers to the L2 cache. The miss rate of the L2cache miss rate is difficult to be analytically modeled,since the cache behavior is very complex and affectedby many low-level factors. But we find that miss ratehas a strong relation with the number of transactions ofeach memory period (mem trans). If the memory accesspatterns are more coalesced, fewer memory transactionsare generated, the cache hit ratio is higher. In addition,if there are more resident warps on one SM, the cachehit ratio will decrease since more warps are sharing thesame cache. A micro-benchmark designed for this studyis used to measure the cache miss rate. The code of themicro-benchmark is shown here:

__global__ void CacheMissRateMeasurement(double *a, double *b, int stride)

{int tid = threadIdx.x+ blockIdx.x*blockDim.x;double x = a[tid];for (int i=tid; i<N; i+=32){

a[(i*stride)%N] -= x * b[i];}

}

Basically, this micro-benchmark emulates the core opera-tion (multiplication-and-add) in sparse LU factorization.The parameter stride is used to control the numberof memory transactions. The number of resident warp-s is controlled by the invoked threads of this micro-benchmark. The measured results on GTX580 are shownin Fig. 6, using NVIDIA Visual Profiler.

C.2 Directed Acyclic Graph (DAG) Analysis

As mentioned in Section 3.2 of the main paper, a depen-dence graph (DAG) is used to describe the dependencebetween columns. In this section, the DAG is expandedto describe the detailed dependence in pipeline mode.

10

20

30

40

50

60

70

4 8 12 16 20 24 28 32

Ca

che

mis

s ra

te (

%)

#resident warps per SM

mem_trans=1

mem_trans=2

mem_trans=3

mem_trans=4

Fig. 6: Measured L2 cache miss rate, obtained on GTX580using NVIDIA Visual Profiler.

(a) Original dependencegraph of node 9.

9.1 store A(:, 9) into

uncompressed vector x

9.2

9.3

9.4

9.5

9.6

using column 4 to

update x

using column 6 to

update x

using column 7 to

update x

using column 8 to

update x

original node 9

store x into L(:, 9)

and U(:, 9)

original

node 4

original

node 6

original

node 7

original

node 8

(b) Expanded dependence graph of node 9.

Fig. 7: Illustration of DAG analysis.

The operations for factorizing each column is decom-posed so each original node is also decomposed intoseveral subnodes. We use node 9 in Fig. 1 to illustratehow the DAG is expanded, as shown in Fig. 7. All thesubnodes are labeled, and their corresponding opera-tions are shown in Fig. 7b. All the subnodes belongingto one original node are executed by a same warp.Each solid arrow in Fig. 7b denotes data dependenceor computation order. For example, subnodes 9.1 to 9.6are executed in sequence by a same warp, and subnode9.2 can start only after node 4 and subnode 9.1 are bothfinished.

The number of execution cycles of each subnode iscalculated by the performance model. Still take Fig. 7bas an example, when the start time of node 9 and finishtime of nodes 4, 6, 7 and 8 are known, it is quite simple tocalculate the earliest finish time of node 9 (i.e. the earliestfinish time of subnode 9.6) by a well-known criticalpath analysis. The expanded DAG avoids calculating thedynamic waiting time in the pipeline mode (shown inAlgorithm 1).

To calculate the earliest finish time of the whole DAGcomputed by #warp × #SM warps, a task queue ismaintained for each warp. All the nodes are visited ina topological order. When a node (say node k) is beingvisited, the warp that first finishes its last task (say warpt) is selected to compute node k, so node k is put into

Page 15: 786 IEEE TRANSACTIONS ON PARALLEL AND ......it is hard to form large supernodes in highly sparse matrices such as circuit matrices. Hence KLU [16] which is specially targeted at circuit

5

0.0001

0.001

0.01

0.1

1

0.0001 0.001 0.01 0.1 1

Pre

dic

ted

tim

e (

s)

Actual time (s)

Fig. 8: Predicted time vs. actual time, each point repre-sents a matrix, obtained on GTX580.

warp t’s task queue and the finish time of node k can bedetermined. The earliest finish time of the whole DAGis just the maximum finish time of all the nodes.

C.3 Validation of the Performance ModelThe comparison between the predicted LU factorizationtime on GTX580 and the actual runtime is shown inFig. 8. The proposed performance model can obtain rea-sonable results compared with the real results obtainedon NVIDIA GTX580. The average relative error betweenthe predicted runtime and the actual runtime is 21.8%.

REFERENCES[1] S. Yan, G. Long, and Y. Zhang, “StreamScan: Fast Scan Algorithms

for GPUs Without Global Barrier Synchronization,” in Proceedingsof the 18th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, ser. PPoPP ’13, 2013, pp. 229–238.

[2] NVIDIA Corporation, “NVIDIA CUDA C programmingguide.” [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[3] Y. Saad and M. H. Schultz, “GMRES: a generalized minimal resid-ual algorithm for solving nonsymmetric linear systems,” SIAM J.Sci. Stat. Comput., vol. 7, no. 3, pp. 856–869, Jul. 1986.

[4] “CUSP.” [Online]. Available: https://code.google.com/p/cusp-library/

[5] C.-J. Lin and J. J. More, “Incomplete Cholesky Factorizations WithLimited Memory,” SIAM J. SCI. COMPUT, vol. 21, no. 1, pp. 24–45,1999.

[6] X. Chen, W. Wu, Y. Wang, H. Yu, and H. Yang, “An escheduler-based data dependence analysis and task scheduling for parallelcircuit simulation,” Circuits and Systems II: Express Briefs, IEEETransactions on, vol. 58, no. 10, pp. 702–706, oct. 2011.

[7] M. Naumov, “Parallel Solution of Sparse Triangular Linear Systemsin the Preconditioned Iterative Methods on the GPU,” NVIDIATechnical Report, NVR-2011-001, Tech. Rep., june 2011.

[8] NVIDIA Corporation, “cuSPARSE.” [Online]. Available:https://developer.nvidia.com/cusparse

[9] X. Xiong and J. Wang, “Parallel forward and back substitutionfor efficient power grid simulation,” in Proceedings of IEEE/ACMInternational Conference on Computer-Aided Design (ICCAD), nov.2012, pp. 660–663.