cuda linear algebra library and next generation yukai hung a0934147@gmail.com department of...

CUDA Linear Algebra Library and Next GenerationCUDA Linear Algebra Library and Next GenerationYukai Hung

a0934147@gmail.comDepartment of MathematicsNational Taiwan University

Yukai Hunga0934147@gmail.com

Department of MathematicsNational Taiwan University

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Dense approach is wasteful - unclear how to map work to parallel processors - irregular elements accessing for global memory

DIA - diagonal format

ELL - ellpack format

CSR - compressed row format

HYB - hybrid format

COO - coordinate format

Diagonal format - diagonal should be mostly populated format - high parallelism to map one thread for one row - good parallel efficiency and good memory behavior

global memory coalescing

Ellpack format - assign one thread to compute one row again - but the load imbalance hurts parallel efficiency

Coordinate format - insensitive to sparsity pattern but slower than ellpack - assign one thread for one element and combine the results from all elements in a row to get output element

Hybrid format - combine regular ellpack format and flexible coo format

typical exceptional

Property comparison Matrix Format Granularity Coalescing

DIA thread/row fullELL thread/row full

CSR(scalar) thread/row rareCSR(vector) warp/row partial

COO thread/nonzero fullHYB thread/row full

fixed number of nonzeros and variable matrix size

Sparse matrices for parallel efficiency: ellpack format - one thread per row is efficient for memory accessing

Sparse matrices for load imbalance: coordinate format - one thread per element is insensitive to matrix structure

Conclusion for all structures - hybrid structure gives the best performance averagely - irregularity is manageable if regularize the common case

Performance comparison

Linear Algebra LibraryLinear Algebra Library

CUBLAS: CUDA Basic Linear Algebra Subroutines - implement basic linear algebra subroutines on runtime level - only available for single device not implement for multiple devices

CUFFT: CUDA Fast Fourier Transforms Library - use divide-and-conquer algorithm for discrete transform - support real and complex data for in-place or out-of-place - support the stream operation for simultaneous execution

- use complex-to-complex to replace real-to-complex - problem size in power-of-two gives best performance

CUDPP: CUDA Data Parallel Primitive Library - a library of data-parallel algorithm primitives - parallel prefix-sum and sorting and data reduction - stream compaction and random number generator

CUDPP: CUDA Data Parallel Primitive Library comparison with multicore CPU

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

linear system solvesleast square solvers

orthogonal factorizationsymmetric eigenproblem

non-symmetric eigenproblemsingular value decompositions

double precision LU-factorizationdouble precision QR-factorization

double precision QR-factorizationdouble precision symmetric eigenvalue problem

double precision symmetric eigenvalue problemdouble precision singular value decomposition

MAGMA: Matrix Algebra on GPU and Multicore Architecture - open source project to develop a dense linear algebra library similar to basic linear algebra package but for heterogeneous and hybrid architecture with manycore CPUs and GPUs systems

MAGMA: Matrix Algebra on GPU and Multicore Architecture

double precision matrix-matrix multiplicationsingle precision QR-factorization

single precision QR-factorizationsolving Ax=b by using LU-factorization

solving Ax=b by using LU-factorizationsingle precision Cholesky-factorization

Thrust - thrust is a CUDA library of parallel algorithm with an interface resembling the C++ Standard Template Library STL to provide flexible high-level interface that greatly enhance productivity

int main(int argc,char** argv){ //allocate memory space on the host thrust::host_vector<float> hvec(1024);

//generate random number on the host thrust::generate(hvec.begin(),hvec.end(),rand); //allocate and transfer data to device thrust::device_vector<float> dvec=hvec; //manipulate device values from the host dvec[0]=(float)rand()/(float)(RAND_MAX-1); dvec[1]=(float)rand()/(float)(RAND_MAX-1); //sum all data on device by parallel reduction sum=thrust::reduce(dvec.begin(),dvec.end());

//sort all data on device by radix sort thrust::sort(dvec.begin(),dvec.end());

//transfer final data back to host thrust::copy(dvec.begin(),dvec.end(),hvec.begin());}

int main(int argc,char** argv){ //create list container on the host std::list<int> hlist; hlist.push_back(13); hlist.push_back(27); //copy host data from list into device vector thrust::device_vector<int> dvec(hlist.size()); thrust::copy(hlist.begin(),hlist.end(),dvec.begin()); //alternative method to convert from host to device thrust::device_vector<int> dvec(hlist.begin(),hlist.end());

//obtain raw pointer from device memory int* dpointer=thrust::raw_pointer_cast(dvec);

//launch device kernel function kernel<<<blocknum,blocksize>>>(dpointer,dvec.size());

//deallocate device memory cudaFree(dpointer); }

CUSP: Generic Parallel Algorithm for Sparse Matrix Computations - cusp provides a high-level and flexible interface for manipulating sparse matrix and solving sparse linear systems by iterative method - cusp is implemented on the thrust template interface structure

"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009

CUSP: Generic Parallel Algorithm for Sparse Matrix Computations

Matrix format - cusp natively supports several sparse matrix formats - cusp make it is easy to transfer sparse matrix data between host and device and convert between sparse matrix format //allocate storage space for a CSR matrix on the

//host with 5 row 8 column and 12 nonzero elements cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12);

//allocate and transfer from host to device memory cusp::csr_matrix<int,float,cusp::device_memory> B=A;

//convert the CSR matrix format to HYB matrix formatcusp::hyb_matrix<int,float,cusp::device_memory> C=A;

Algorithm and iterative solver - matrix-vector multiplication sand transpose - conjugate gradient and biconjugate gradient stab

//matrix-vector multiplicationcusp::multiply(A,x,y)

//sparse matrix transposecusp::transpose(A,At)

//conjugate gradientcusp::krylov::cg(A,x,b)

//biconjugate gradient stabcusp::krylov::bicgstab(A,x,b)

int main(int argc,char** argv){ //create an empty HYB sparse matrix structure cusp::hyb_matrix<int,float,cusp::device_memory> A;

//load a matrix stored in the matrix market format cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”);

//allocate storage for solution x and right-hand side b cusp::array1d<float,cusp::device_memory> x(A.num_rows,0); cusp::array1d<float,cusp::device_memory> b(A.num_rows,1); //set the iteration and residual stopping criteria cusp::verbose_monitor<ValueType> monitor(100,1e-6); //setup the matrix preconditioner cusp::precond::diagonal<ValueType,MemorySpace> M(A);

//solve the linear system with conjugate gradient method cusp::krylov::cg(A,x,b,monitor,M);

return 0; }

OpenNL: Open Numerical Library

- efficient sparse matrix data structure

- sparse direct linear solver for SuperLU

- matrix preconditioner for Jacobi and SSOR

- iterative builder for sparse least-square problems

- iterative solvers for conjugate gradient, BICGSTAB, GMRES

ViennaCL - a basic linear algebra for computations on GPUs based on OpenCL - support basic linear algebra subroutines - generalized minimal residual method - direct linear system solver with LU-factorization - sparse conjugate gradient and biconjugate gradient - optimal incomplete LU preconditioner with threshold

GATLAS: GPU Automatically Tuned Linear Algebra Subroutines - automatically tuned the kernel of level 3 BLAS based on OpenCL

Next Generation ArchitectureNext Generation Architecture

Next GPU generation architecture is called Fermi

Third generation Streaming Multiprocessor

32 processors for each SMdouble precision 50% of single

(8X faster than GT200)

dual thread/warp scheduler

4 special function units

64 KB of RAM for shared memory and configurable L1 cache

Second generation Parallel Thread Execution

IEEE 754-2008 floating point standard, surpassing even the most advanced CPU

Fused multiply-add FMA instruction for both single and double precision

Newly designed 32-bit integer ALU and extended precision operations

Improved Memory System

first GPU architecture to support true cache hierarchy in combination with

on-chip shared memory

L1 cache for each multiprocessorimprove bandwidth/reduce latency

unified L2 cache (768 KB)coherent data sharing across all cores

ECC support GDDR5 memory interfacewhich is almost 2X faster than GDDR3

GigaThread Hardware Scheduler

Hierarchically manage thousandsof simultaneously active threads

10X faster application context switching to support concurrent

kernel execution

concurrent kernel execution + faster context switch

Dual DMA engines for simultaneous data transfer to fully overlap with CPU and GPU processing time

fully pipeline of integer arithmetic logic unit and floating-point unit

improve floating-point arithmetic from IEEE 745-1985 to IEEE 745-2008

to support FMA instruction

improve integer ALU from 24-bit precision into 32-bit precision

What is NEW on the floating-point operation? - support fused multiply-add instructions for both single and double

original multiply-add

A B product+

C result

truncate extra digits

fused multiply-add

retain all digits

What is NEW on the floating-point operation?

- support subnormal numbers for both single and double precision which are small numbers that lie between the zero and smallest normalized number of a given floating point number system

- prior generation flush subnormal operand and results to zero

- CPU typically perform subnormal calculation in exception-handling software taking thousands of cycles, but Fermi handle subnormal calculations in hardware with no additional performance penalty

16 load/store units to allow source and destination addresses to be

calculated for 16 threads per cycle

32 single precision FMA units16 double precision FMA units

double precision application performance

two warp scheduler andinstruction dispatch units

dual warp scheduler allowing two warps to be issued and executed concurrently for 32 cores

two warp scheduler andinstruction dispatch units

64KB configurable shared memory and L1 cache

64KB configurable shared memory and L1 cache - 48KB shared memory and 16KB L1 cache - 16KB shared memory and 48KB L1 cache

radix sort using shared memory

Unified memory address space - combine three separate addresses space for load and store - this feature enable Fermi to support all C++ specific programs virtual function, function pointer, new and delete object, try and catch

summary table

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

scheduler bottleneck

old bottleneck

new bottleneck

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php

cuda linear algebra library and next generation yukai hung a0934147@gmail.com department of...

linear algebra library

parallel efficiency

variable matrix size

best performance slide

global memory slide

cuda linear algebra

library of data

output element slide

Documents

van phat hung corp · van phat hung corp

hung society

antik series - westcoast windows...the westcoast open out...

home | environmental protection department · mezz floor,...

hung remakingbeijingch1

double/single hung windows - loewen windowsdouble/single...

eye tracking: principles and applications 廖文宏 wen-hung...

text requirements and...

index to participants - narsc anping...

hung episode.pdf

series 660 single hung series 670 double hung 3 7 8 ... ·...

portable operating system interface thread yukai hung...

under hung

hung a ian

bocco manualassets.bocco.me/documents/manual_bocco.pdftitle...

cam hung song_theo_7_thoi_quen_thanh_dat

premium double hung - arcat · premium double hung series...

osciloscopio hung chang

double hung

series 680 single hung/tilt • series 690 double hung/tilt