exploiting multiple gpus in sparse qr: regular numerics with … · 2015. 4. 28. · gpu kernels...

Post on 16-Aug-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exploiting Multiple GPUs in Sparse QR:Regular Numerics with Irregular Data Movement

Tim Davis (Texas A&M University)with Sanjay Ranka, Mohamed Gadou (University of Florida)

Nuri Yeralan (Microsoft)

NVIDIA GTC 2015March 2015

Outline

Combinatorial scientific computing: math + CS + applications

Multifrontal methods for factorizing sparse matrices

fill-in creates cliques in the graphcliques connected via the elimination treeSparse LU: square cliques assembled via additionSparse LU: rectangular, via additionSparse QR: rectangular, via concatenation

GPU kernel design for sparse QR

Bucket scheduler for single-GPU QR

Extended for multi-GPU QR

Performance results

Combinatorial Scientific Computing: math + CS + applications

Cliques in the graph: the multifrontal method

Cliques + elimination tree = sequence of frontal matrices

Dense factorization within a front; assemble data into parent

regular + irregular computation

UMFPACK: unsymmetric multifrontal method

Frontal matrices become rectangular

Assemble data into ancestors, not just parents

UMFPACK: unsymmetric multifrontal method

Key results / impact

high-performance via dense matrix kernels within each front

symbolic preordering and analysis, followed by revised local pivot search withapproximate unsymmetric degree update

widely used

sparse LU and x=A\b in MATLABMathematicaIBM circuit simulation applicationfinite-element solvers: NASTRAN, FEniCS, ...Berkeley Design Automation: circuit simulationCVXOPT...

SuiteSparseQR: multifrontal sparse QR factorization

Key results / impact

rectangular fronts like UMFPACK, but simpler frontal matrix assembly(concatenation, not summation) (Duff, Puglisi)

rank approximation (Heath, extended to multifrontal case)

multicore parallelism

on multicore CPU (70 Gflop theoretical peak): up to 30 Gflops

sparse qr in MATLAB, and x=A\btoday’s talk: GPU algorithm

novel “Bucket QR” scheduler and custom GPU kernelsup to 150 GFlops on one Kepler K20c, 286 GFlops on 4 Tesla C2070’sup to 28x speedup vs CPU algorithm (10x typical for large problems)

SuiteSparseQR

A column elimination tree and its supernodes

SuiteSparseQR

Frontal matrix assembly

SuiteSparseQR

concatenation, resulting a staircase matrix

SuiteSparseQR

Multifrontal factorization and assembly

Prior methods

one front at a time on the GPUassembly on CPUpanel factorization on the CPU, applied on GPU

Our multifrontal QR

many fronts on the GPU (entire subtree)assembly on GPU: data concatenation, not summationentire dense QR of each front on the GPU

Consider a subtree of frontal matrices on the GPU

Expanded to show GPU kernel launches

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

Bucket QR factorization

GPU kernels for Bucket QR

Bucket QR requires two kernels on the GPU

QR factorization of a t-by-1 tile,

t = 1, 2, or 3creates a block Householderdetails on next slides

Apply a block Householder:

A = A− V (T ′(V ′A))A is t-by-s, where s can be largethread-block iterates 2 column blocks at atime(details omitted)

GPU kernel: Block Householder QR

Block Householder QR = Awhere Q = (I − VT ′V ′)′, and R is upper triangular

[m n] = size (A)

for k = 1:n

[tau, v] = house (A (k:m,k))

A (k:m,k:n) = A (k:m,k:n) - v * (tau * (v’ * A (k:m,k:n)))

V (k:m,k) = v ; Tau (k) = tau

end

T = zeros (n)

for k = 1:n

tau = Tau (k) ; v = V (k:m,k)

z = - tau * v’ * V (k:m,1:k-1)

T (1:k-1,k) = T (1:k-1,1:k-1) * z’

T (k,k) = tau

end

GPU kernel: Block Householder QR

Householder update

A(k:m,k:n) = A(k:m,k:n) - ...

v*(tau*(v’*A(k:m,k:n)))

Construct T

z = - tau * v’ * V (k:m,1:k-1)

T (1:k-1,k) = T(1:k-1,1:k-1)*z’

T (k,k) = tau

GPU kernel: Block Householder QR

Towards an GPU kerneloverwrite tril(A,-1) with V, and fold in construction of T.

[m n] = size (A)

T = zeros (n)

for k = 1:n

[tau, v] = house (A (k:m,k))

A (k:m,k:n) = A (k:m,k:n) - ...

v * (tau * (v’ * A (k:m,k:n)))

V1 (k) = v (1)

A (k+1:m,k) = v (2:end)

z = - tau * v’ * A (k:m,1:k-1)

T (1:k-1,k) = T (1:k-1,1:k-1) * z’

T (k,k) = tau

end

GPU kernel: Block Householder QR

The GPU kernel update A and construct T in parallel:

A (k:m,k:n) = A (k:m,k:n) - ...

v * (tau * (v’ * A (k:m,k:n)))

z = - tau * (v’ * A (k:m,1:k-1))

T (1:k-1,k) = T (1:k-1,1:k-1) * z’

T (k,k) = tau

becomes

z = -tau * v’ * A (k:m,:)

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

T (k,k) = tau

GPU kernel: Block Householder QR

z = -tau * v’ * A (k:m,:)

A (k:m,k:n) = ...

A (k:m,k:n) + ...

v * z (k:n)

T (1:k-1,k) = ...

T (1:k-1,1:k-1) ...

* z (1:k-1)’

T (k,k) = tau

GPU kernel: Block Householder QR

The GPU kernel thread-level parallelism

z = -tau * v’ * A (k:m,:)

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

T (k,k) = tau

A is 96-by-32, in register during factorization, and finally in global memory at theend (V and R)

each thread owns an 8-by-1 “bitty block” of A

v is 96-by-1, in shared memory

z is 32-by-1, in shared. Requires 12-by-32 shared space for v’*A(k:m,:)reduction

GPU kernel: Block Householder QR

Putting it all together. At the kth step:

threads that own column k write A to shared

thread zero computes Householder coefficients

z = -tau * v’ * A (k:m,:)

each thread computes 8-by-1 dot product in parallelwrites scalar result in 12-by-32 reduction spacewarp zero sums reduction space to get z

A (k:m,k:n) = A (k:m,k:n) + v * z (k:n)

only done by threads that own columns k:n

threads that own column k+1 compute norm of that column of A, for nextHouseholder coef, saving result in 12-by-1 reduction space

T (1:k-1,k) = T (1:k-1,1:k-1) * z (1:k-1)’

only done by threads 1:k-1

thread zero sums up reduction space for norm of column k+1

GPU kernel: Block Householder QR

z = -tau * v’ * A (k:m,:)

each thread computes8-by-1 dot product inparallel

writes scalar result in12-by-32 reduction space

warp zero sums reductionspace to get z

GPU kernel: Block Householder QR

A (k:m,k:n) = A (k:m,k:n)

+ v * z (k:n)

only done by threads thatown columns k:n

threads that own columnk+1 compute norm of thatcolumn of A, for nextHouseholder coef, savingresult in 12-by-1 reductionspace

GPU kernel: Block Householder QR

T (1:k-1,k) = T

(1:k-1,1:k-1) * z (1:k-1)’

only done by threads 1:k-1

Single GPU performance results

Putting it all together ... Performance resultsFermi K20 K40

GPU kernels:apply block Householder 183 Gflops 260 Gflops 360 Gflopsfactorize 3 tiles 27 Gflops 20 Gflops

dense QR for large front 107 Gflops 120 Gflops(’bare metal’ flops) 154 Gflops 172 Gflops

sparse QR on GPU 80 Gflops 150 Gflopspeak speedup over CPU 11x 20xtypical speedup over CPU 5x 10x

GPU Kernel pitfalls

What we’ll do differently in our kernel design

Householder block-apply using too much shared memory

uberkernel approach

each thread block determines what to do from a task list (QR, apply, assemble)pros: single large kernel launch, lots of parallelismcon: all tasks use same thread geometrycon: QR of panel needs higher occupancy to hide scalar

√(d)

to do: block apply kernel needs to stage A = A−WYA by not keeping W and Yin shared.

split the uberkernel so QR panel can have higher occupancy

Single GPU performance on many matrices

Multiple GPUs: decomposing the tree

Multiple GPUs: Performance Results

Results on NVIDIA Tesla C2070 GPUsproblem CPU 1 GPU 2 GPU 4 GPU speedup speedup

GFlop GFlop GFlop GFlop vs CPU vs 1 GPU

1500:2D 6.1 16.0 27.1 38.4 6.3 2.42000:2D 6.9 21.0 37.8 56.7 8.2 2.73000:2D 7.8 25.8 44.8 73.7 9.4 2.9lp nug20 23.9 74.3 86.4 66.1 2.8 0.9ch7-8-b3 25.3 104.0 111.3 173.7 6.9 1.7ch7-8-b3:8 10.0 88.0 160.4 286.2 28.6 3.3

Multiple GPUs: for large fronts

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Multiple GPUs: bucket scheduler on the large scale

Acknowledgements:

National Science Foundation

NVIDIA

Texas A&M University

Summary: Sparse QR on GPUs

Fronts live and die on the GPU, reduces CPU-GPU traffic

Bucket scheduler: extends Communication-Avoiding QR method

Single GPU: speedup 5x to 20x on one GPU

Multi GPU prototype: speedup over 3x on 4 GPUs

Code: SuiteSparse.com and developer.nvidia.com/cholmod

SuiteSparse logo, and music to art via math: NotesArtStudio.com

top related