cuda linear algebra library and next generation yukai hung a0934147@gmail.com department of...

Post on 27-Dec-2015

237 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CUDA Linear Algebra Library and Next GenerationCUDA Linear Algebra Library and Next GenerationYukai Hung

a0934147@gmail.comDepartment of MathematicsNational Taiwan University

Yukai Hunga0934147@gmail.com

Department of MathematicsNational Taiwan University

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

3

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Dense approach is wasteful - unclear how to map work to parallel processors - irregular elements accessing for global memory

4

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

stru

ctur

edun

stru

ctur

ed

DIA - diagonal format

ELL - ellpack format

CSR - compressed row format

HYB - hybrid format

COO - coordinate format

5

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Diagonal format - diagonal should be mostly populated format - high parallelism to map one thread for one row - good parallel efficiency and good memory behavior

global memory coalescing

6

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Ellpack format - assign one thread to compute one row again - but the load imbalance hurts parallel efficiency

7

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Coordinate format - insensitive to sparsity pattern but slower than ellpack - assign one thread for one element and combine the results from all elements in a row to get output element

8

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Hybrid format - combine regular ellpack format and flexible coo format

typical exceptional

9

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Property comparison Matrix Format Granularity Coalescing

DIA thread/row fullELL thread/row full

CSR(scalar) thread/row rareCSR(vector) warp/row partial

COO thread/nonzero fullHYB thread/row full

fixed number of nonzeros and variable matrix size

10

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Sparse matrices for parallel efficiency: ellpack format - one thread per row is efficient for memory accessing

Sparse matrices for load imbalance: coordinate format - one thread per element is insensitive to matrix structure

Conclusion for all structures - hybrid structure gives the best performance averagely - irregularity is manageable if regularize the common case

11

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Performance comparison

12

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Performance comparison

13

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Performance comparison

Linear Algebra LibraryLinear Algebra Library

15

Linear Algebra LibraryLinear Algebra Library

CUBLAS: CUDA Basic Linear Algebra Subroutines - implement basic linear algebra subroutines on runtime level - only available for single device not implement for multiple devices

CUFFT: CUDA Fast Fourier Transforms Library - use divide-and-conquer algorithm for discrete transform - support real and complex data for in-place or out-of-place - support the stream operation for simultaneous execution

- use complex-to-complex to replace real-to-complex - problem size in power-of-two gives best performance

16

Linear Algebra LibraryLinear Algebra Library

CUDPP: CUDA Data Parallel Primitive Library - a library of data-parallel algorithm primitives - parallel prefix-sum and sorting and data reduction - stream compaction and random number generator

17

Linear Algebra LibraryLinear Algebra Library

CUDPP: CUDA Data Parallel Primitive Library comparison with multicore CPU

18

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

linear system solvesleast square solvers

orthogonal factorizationsymmetric eigenproblem

non-symmetric eigenproblemsingular value decompositions

19

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

double precision LU-factorizationdouble precision QR-factorization

20

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

double precision QR-factorizationdouble precision symmetric eigenvalue problem

21

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

double precision symmetric eigenvalue problemdouble precision singular value decomposition

22

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture - open source project to develop a dense linear algebra library similar to basic linear algebra package but for heterogeneous and hybrid architecture with manycore CPUs and GPUs systems

23

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture

double precision matrix-matrix multiplicationsingle precision QR-factorization

24

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture

single precision QR-factorizationsolving Ax=b by using LU-factorization

25

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture

solving Ax=b by using LU-factorizationsingle precision Cholesky-factorization

26

Linear Algebra LibraryLinear Algebra Library

Thrust - thrust is a CUDA library of parallel algorithm with an interface resembling the C++ Standard Template Library STL to provide flexible high-level interface that greatly enhance productivity

27

Linear Algebra LibraryLinear Algebra Library

int main(int argc,char** argv){ //allocate memory space on the host thrust::host_vector<float> hvec(1024);

//generate random number on the host thrust::generate(hvec.begin(),hvec.end(),rand); //allocate and transfer data to device thrust::device_vector<float> dvec=hvec; //manipulate device values from the host dvec[0]=(float)rand()/(float)(RAND_MAX-1); dvec[1]=(float)rand()/(float)(RAND_MAX-1); //sum all data on device by parallel reduction sum=thrust::reduce(dvec.begin(),dvec.end());

//sort all data on device by radix sort thrust::sort(dvec.begin(),dvec.end());

//transfer final data back to host thrust::copy(dvec.begin(),dvec.end(),hvec.begin());}

28

Linear Algebra LibraryLinear Algebra Library

int main(int argc,char** argv){ //create list container on the host std::list<int> hlist; hlist.push_back(13); hlist.push_back(27); //copy host data from list into device vector thrust::device_vector<int> dvec(hlist.size()); thrust::copy(hlist.begin(),hlist.end(),dvec.begin()); //alternative method to convert from host to device thrust::device_vector<int> dvec(hlist.begin(),hlist.end());

//obtain raw pointer from device memory int* dpointer=thrust::raw_pointer_cast(dvec);

//launch device kernel function kernel<<<blocknum,blocksize>>>(dpointer,dvec.size());

//deallocate device memory cudaFree(dpointer); }

29

Linear Algebra LibraryLinear Algebra Library

CUSP: Generic Parallel Algorithm for Sparse Matrix Computations - cusp provides a high-level and flexible interface for manipulating sparse matrix and solving sparse linear systems by iterative method - cusp is implemented on the thrust template interface structure

"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009

30

Linear Algebra LibraryLinear Algebra Library

CUSP: Generic Parallel Algorithm for Sparse Matrix Computations

31

Linear Algebra LibraryLinear Algebra Library

Matrix format - cusp natively supports several sparse matrix formats - cusp make it is easy to transfer sparse matrix data between host and device and convert between sparse matrix format //allocate storage space for a CSR matrix on the

//host with 5 row 8 column and 12 nonzero elements cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12);

//allocate and transfer from host to device memory cusp::csr_matrix<int,float,cusp::device_memory> B=A;

//convert the CSR matrix format to HYB matrix formatcusp::hyb_matrix<int,float,cusp::device_memory> C=A;

32

Linear Algebra LibraryLinear Algebra Library

Algorithm and iterative solver - matrix-vector multiplication sand transpose - conjugate gradient and biconjugate gradient stab

//matrix-vector multiplicationcusp::multiply(A,x,y)

//sparse matrix transposecusp::transpose(A,At)

//conjugate gradientcusp::krylov::cg(A,x,b)

//biconjugate gradient stabcusp::krylov::bicgstab(A,x,b)

33

Linear Algebra LibraryLinear Algebra Library

int main(int argc,char** argv){ //create an empty HYB sparse matrix structure cusp::hyb_matrix<int,float,cusp::device_memory> A;

//load a matrix stored in the matrix market format cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”);

//allocate storage for solution x and right-hand side b cusp::array1d<float,cusp::device_memory> x(A.num_rows,0); cusp::array1d<float,cusp::device_memory> b(A.num_rows,1); //set the iteration and residual stopping criteria cusp::verbose_monitor<ValueType> monitor(100,1e-6); //setup the matrix preconditioner cusp::precond::diagonal<ValueType,MemorySpace> M(A);

//solve the linear system with conjugate gradient method cusp::krylov::cg(A,x,b,monitor,M);

return 0; }

34

Linear Algebra LibraryLinear Algebra Library

OpenNL: Open Numerical Library

- efficient sparse matrix data structure

- sparse direct linear solver for SuperLU

- matrix preconditioner for Jacobi and SSOR

- iterative builder for sparse least-square problems

- iterative solvers for conjugate gradient, BICGSTAB, GMRES

35

Linear Algebra LibraryLinear Algebra Library

ViennaCL - a basic linear algebra for computations on GPUs based on OpenCL - support basic linear algebra subroutines - generalized minimal residual method - direct linear system solver with LU-factorization - sparse conjugate gradient and biconjugate gradient - optimal incomplete LU preconditioner with threshold

GATLAS: GPU Automatically Tuned Linear Algebra Subroutines - automatically tuned the kernel of level 3 BLAS based on OpenCL

Next Generation ArchitectureNext Generation Architecture

37

Next Generation ArchitectureNext Generation Architecture

Next GPU generation architecture is called Fermi

38

Next Generation ArchitectureNext Generation Architecture

Next GPU generation architecture is called Fermi

39

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

32 processors for each SMdouble precision 50% of single

(8X faster than GT200)

dual thread/warp scheduler

4 special function units

64 KB of RAM for shared memory and configurable L1 cache

40

Next Generation ArchitectureNext Generation Architecture

Second generation Parallel Thread Execution

IEEE 754-2008 floating point standard, surpassing even the most advanced CPU

Fused multiply-add FMA instruction for both single and double precision

Newly designed 32-bit integer ALU and extended precision operations

41

Next Generation ArchitectureNext Generation Architecture

Improved Memory System

first GPU architecture to support true cache hierarchy in combination with

on-chip shared memory

L1 cache for each multiprocessorimprove bandwidth/reduce latency

unified L2 cache (768 KB)coherent data sharing across all cores

ECC support GDDR5 memory interfacewhich is almost 2X faster than GDDR3

42

Next Generation ArchitectureNext Generation Architecture

GigaThread Hardware Scheduler

Hierarchically manage thousandsof simultaneously active threads

10X faster application context switching to support concurrent

kernel execution

43

Next Generation ArchitectureNext Generation Architecture

GigaThread Hardware Scheduler

concurrent kernel execution + faster context switch

44

Next Generation ArchitectureNext Generation Architecture

GigaThread Hardware Scheduler

Dual DMA engines for simultaneous data transfer to fully overlap with CPU and GPU processing time

45

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

fully pipeline of integer arithmetic logic unit and floating-point unit

improve floating-point arithmetic from IEEE 745-1985 to IEEE 745-2008

to support FMA instruction

improve integer ALU from 24-bit precision into 32-bit precision

46

Next Generation ArchitectureNext Generation Architecture

What is NEW on the floating-point operation? - support fused multiply-add instructions for both single and double

x =

=

original multiply-add

A B product+

C result

truncate extra digits

fused multiply-add

retain all digits

47

Next Generation ArchitectureNext Generation Architecture

What is NEW on the floating-point operation?

- support subnormal numbers for both single and double precision which are small numbers that lie between the zero and smallest normalized number of a given floating point number system

- prior generation flush subnormal operand and results to zero

- CPU typically perform subnormal calculation in exception-handling software taking thousands of cycles, but Fermi handle subnormal calculations in hardware with no additional performance penalty

48

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

16 load/store units to allow source and destination addresses to be

calculated for 16 threads per cycle

32 single precision FMA units16 double precision FMA units

49

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

double precision application performance

50

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

two warp scheduler andinstruction dispatch units

51

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

dual warp scheduler allowing two warps to be issued and executed concurrently for 32 cores

52

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

two warp scheduler andinstruction dispatch units

64KB configurable shared memory and L1 cache

53

Next Generation ArchitectureNext Generation Architecture

64KB configurable shared memory and L1 cache - 48KB shared memory and 16KB L1 cache - 16KB shared memory and 48KB L1 cache

radix sort using shared memory

54

Next Generation ArchitectureNext Generation Architecture

Unified memory address space - combine three separate addresses space for load and store - this feature enable Fermi to support all C++ specific programs virtual function, function pointer, new and delete object, try and catch

55

Next Generation ArchitectureNext Generation Architecture

summary table

56

Next Generation ArchitectureNext Generation Architecture

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

scheduler bottleneck

57

Next Generation ArchitectureNext Generation Architecture

old bottleneck

new bottleneck

58

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php

top related