cuda linear algebra library and next generation yukai hung [email protected] department of...

58
CUDA Linear Algebra Library and Next Generation Yukai Hung [email protected] Department of Mathematics National Taiwan University

Upload: nickolas-ramsey

Post on 27-Dec-2015

236 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

CUDA Linear Algebra Library and Next GenerationCUDA Linear Algebra Library and Next GenerationYukai Hung

[email protected] of MathematicsNational Taiwan University

Yukai [email protected]

Department of MathematicsNational Taiwan University

Page 2: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Page 3: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

3

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Dense approach is wasteful - unclear how to map work to parallel processors - irregular elements accessing for global memory

Page 4: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

4

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

stru

ctur

edun

stru

ctur

ed

DIA - diagonal format

ELL - ellpack format

CSR - compressed row format

HYB - hybrid format

COO - coordinate format

Page 5: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

5

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Diagonal format - diagonal should be mostly populated format - high parallelism to map one thread for one row - good parallel efficiency and good memory behavior

global memory coalescing

Page 6: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

6

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Ellpack format - assign one thread to compute one row again - but the load imbalance hurts parallel efficiency

Page 7: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

7

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Coordinate format - insensitive to sparsity pattern but slower than ellpack - assign one thread for one element and combine the results from all elements in a row to get output element

Page 8: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

8

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Hybrid format - combine regular ellpack format and flexible coo format

typical exceptional

Page 9: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

9

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Property comparison Matrix Format Granularity Coalescing

DIA thread/row fullELL thread/row full

CSR(scalar) thread/row rareCSR(vector) warp/row partial

COO thread/nonzero fullHYB thread/row full

fixed number of nonzeros and variable matrix size

Page 10: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

10

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Sparse matrices for parallel efficiency: ellpack format - one thread per row is efficient for memory accessing

Sparse matrices for load imbalance: coordinate format - one thread per element is insensitive to matrix structure

Conclusion for all structures - hybrid structure gives the best performance averagely - irregularity is manageable if regularize the common case

Page 11: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

11

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Performance comparison

Page 12: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

12

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Performance comparison

Page 13: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

13

Sparse Matrix-Vector MultiplicationSparse Matrix-Vector Multiplication

Performance comparison

Page 14: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Linear Algebra LibraryLinear Algebra Library

Page 15: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

15

Linear Algebra LibraryLinear Algebra Library

CUBLAS: CUDA Basic Linear Algebra Subroutines - implement basic linear algebra subroutines on runtime level - only available for single device not implement for multiple devices

CUFFT: CUDA Fast Fourier Transforms Library - use divide-and-conquer algorithm for discrete transform - support real and complex data for in-place or out-of-place - support the stream operation for simultaneous execution

- use complex-to-complex to replace real-to-complex - problem size in power-of-two gives best performance

Page 16: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

16

Linear Algebra LibraryLinear Algebra Library

CUDPP: CUDA Data Parallel Primitive Library - a library of data-parallel algorithm primitives - parallel prefix-sum and sorting and data reduction - stream compaction and random number generator

Page 17: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

17

Linear Algebra LibraryLinear Algebra Library

CUDPP: CUDA Data Parallel Primitive Library comparison with multicore CPU

Page 18: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

18

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

linear system solvesleast square solvers

orthogonal factorizationsymmetric eigenproblem

non-symmetric eigenproblemsingular value decompositions

Page 19: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

19

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

double precision LU-factorizationdouble precision QR-factorization

Page 20: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

20

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

double precision QR-factorizationdouble precision symmetric eigenvalue problem

Page 21: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

21

Linear Algebra LibraryLinear Algebra Library

CULA: GPU-Accelerated Linear Algebra Library - implement LAPACK function for variant language interface

double precision symmetric eigenvalue problemdouble precision singular value decomposition

Page 22: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

22

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture - open source project to develop a dense linear algebra library similar to basic linear algebra package but for heterogeneous and hybrid architecture with manycore CPUs and GPUs systems

Page 23: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

23

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture

double precision matrix-matrix multiplicationsingle precision QR-factorization

Page 24: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

24

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture

single precision QR-factorizationsolving Ax=b by using LU-factorization

Page 25: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

25

Linear Algebra LibraryLinear Algebra Library

MAGMA: Matrix Algebra on GPU and Multicore Architecture

solving Ax=b by using LU-factorizationsingle precision Cholesky-factorization

Page 26: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

26

Linear Algebra LibraryLinear Algebra Library

Thrust - thrust is a CUDA library of parallel algorithm with an interface resembling the C++ Standard Template Library STL to provide flexible high-level interface that greatly enhance productivity

Page 27: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

27

Linear Algebra LibraryLinear Algebra Library

int main(int argc,char** argv){ //allocate memory space on the host thrust::host_vector<float> hvec(1024);

//generate random number on the host thrust::generate(hvec.begin(),hvec.end(),rand); //allocate and transfer data to device thrust::device_vector<float> dvec=hvec; //manipulate device values from the host dvec[0]=(float)rand()/(float)(RAND_MAX-1); dvec[1]=(float)rand()/(float)(RAND_MAX-1); //sum all data on device by parallel reduction sum=thrust::reduce(dvec.begin(),dvec.end());

//sort all data on device by radix sort thrust::sort(dvec.begin(),dvec.end());

//transfer final data back to host thrust::copy(dvec.begin(),dvec.end(),hvec.begin());}

Page 28: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

28

Linear Algebra LibraryLinear Algebra Library

int main(int argc,char** argv){ //create list container on the host std::list<int> hlist; hlist.push_back(13); hlist.push_back(27); //copy host data from list into device vector thrust::device_vector<int> dvec(hlist.size()); thrust::copy(hlist.begin(),hlist.end(),dvec.begin()); //alternative method to convert from host to device thrust::device_vector<int> dvec(hlist.begin(),hlist.end());

//obtain raw pointer from device memory int* dpointer=thrust::raw_pointer_cast(dvec);

//launch device kernel function kernel<<<blocknum,blocksize>>>(dpointer,dvec.size());

//deallocate device memory cudaFree(dpointer); }

Page 29: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

29

Linear Algebra LibraryLinear Algebra Library

CUSP: Generic Parallel Algorithm for Sparse Matrix Computations - cusp provides a high-level and flexible interface for manipulating sparse matrix and solving sparse linear systems by iterative method - cusp is implemented on the thrust template interface structure

"Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors“ Nathan Bell and Michael Garland, in "Supercomputing 09", 2009

Page 30: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

30

Linear Algebra LibraryLinear Algebra Library

CUSP: Generic Parallel Algorithm for Sparse Matrix Computations

Page 31: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

31

Linear Algebra LibraryLinear Algebra Library

Matrix format - cusp natively supports several sparse matrix formats - cusp make it is easy to transfer sparse matrix data between host and device and convert between sparse matrix format //allocate storage space for a CSR matrix on the

//host with 5 row 8 column and 12 nonzero elements cusp::csr_matrix<int,float,cusp::host_memory> A(5,8,12);

//allocate and transfer from host to device memory cusp::csr_matrix<int,float,cusp::device_memory> B=A;

//convert the CSR matrix format to HYB matrix formatcusp::hyb_matrix<int,float,cusp::device_memory> C=A;

Page 32: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

32

Linear Algebra LibraryLinear Algebra Library

Algorithm and iterative solver - matrix-vector multiplication sand transpose - conjugate gradient and biconjugate gradient stab

//matrix-vector multiplicationcusp::multiply(A,x,y)

//sparse matrix transposecusp::transpose(A,At)

//conjugate gradientcusp::krylov::cg(A,x,b)

//biconjugate gradient stabcusp::krylov::bicgstab(A,x,b)

Page 33: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

33

Linear Algebra LibraryLinear Algebra Library

int main(int argc,char** argv){ //create an empty HYB sparse matrix structure cusp::hyb_matrix<int,float,cusp::device_memory> A;

//load a matrix stored in the matrix market format cusp::io::read_matrix_market_file(A,”5pt_10x10.mtx”);

//allocate storage for solution x and right-hand side b cusp::array1d<float,cusp::device_memory> x(A.num_rows,0); cusp::array1d<float,cusp::device_memory> b(A.num_rows,1); //set the iteration and residual stopping criteria cusp::verbose_monitor<ValueType> monitor(100,1e-6); //setup the matrix preconditioner cusp::precond::diagonal<ValueType,MemorySpace> M(A);

//solve the linear system with conjugate gradient method cusp::krylov::cg(A,x,b,monitor,M);

return 0; }

Page 34: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

34

Linear Algebra LibraryLinear Algebra Library

OpenNL: Open Numerical Library

- efficient sparse matrix data structure

- sparse direct linear solver for SuperLU

- matrix preconditioner for Jacobi and SSOR

- iterative builder for sparse least-square problems

- iterative solvers for conjugate gradient, BICGSTAB, GMRES

Page 35: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

35

Linear Algebra LibraryLinear Algebra Library

ViennaCL - a basic linear algebra for computations on GPUs based on OpenCL - support basic linear algebra subroutines - generalized minimal residual method - direct linear system solver with LU-factorization - sparse conjugate gradient and biconjugate gradient - optimal incomplete LU preconditioner with threshold

GATLAS: GPU Automatically Tuned Linear Algebra Subroutines - automatically tuned the kernel of level 3 BLAS based on OpenCL

Page 36: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Next Generation ArchitectureNext Generation Architecture

Page 37: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

37

Next Generation ArchitectureNext Generation Architecture

Next GPU generation architecture is called Fermi

Page 38: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

38

Next Generation ArchitectureNext Generation Architecture

Next GPU generation architecture is called Fermi

Page 39: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

39

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

32 processors for each SMdouble precision 50% of single

(8X faster than GT200)

dual thread/warp scheduler

4 special function units

64 KB of RAM for shared memory and configurable L1 cache

Page 40: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

40

Next Generation ArchitectureNext Generation Architecture

Second generation Parallel Thread Execution

IEEE 754-2008 floating point standard, surpassing even the most advanced CPU

Fused multiply-add FMA instruction for both single and double precision

Newly designed 32-bit integer ALU and extended precision operations

Page 41: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

41

Next Generation ArchitectureNext Generation Architecture

Improved Memory System

first GPU architecture to support true cache hierarchy in combination with

on-chip shared memory

L1 cache for each multiprocessorimprove bandwidth/reduce latency

unified L2 cache (768 KB)coherent data sharing across all cores

ECC support GDDR5 memory interfacewhich is almost 2X faster than GDDR3

Page 42: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

42

Next Generation ArchitectureNext Generation Architecture

GigaThread Hardware Scheduler

Hierarchically manage thousandsof simultaneously active threads

10X faster application context switching to support concurrent

kernel execution

Page 43: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

43

Next Generation ArchitectureNext Generation Architecture

GigaThread Hardware Scheduler

concurrent kernel execution + faster context switch

Page 44: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

44

Next Generation ArchitectureNext Generation Architecture

GigaThread Hardware Scheduler

Dual DMA engines for simultaneous data transfer to fully overlap with CPU and GPU processing time

Page 45: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

45

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

fully pipeline of integer arithmetic logic unit and floating-point unit

improve floating-point arithmetic from IEEE 745-1985 to IEEE 745-2008

to support FMA instruction

improve integer ALU from 24-bit precision into 32-bit precision

Page 46: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

46

Next Generation ArchitectureNext Generation Architecture

What is NEW on the floating-point operation? - support fused multiply-add instructions for both single and double

x =

=

original multiply-add

A B product+

C result

truncate extra digits

fused multiply-add

retain all digits

Page 47: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

47

Next Generation ArchitectureNext Generation Architecture

What is NEW on the floating-point operation?

- support subnormal numbers for both single and double precision which are small numbers that lie between the zero and smallest normalized number of a given floating point number system

- prior generation flush subnormal operand and results to zero

- CPU typically perform subnormal calculation in exception-handling software taking thousands of cycles, but Fermi handle subnormal calculations in hardware with no additional performance penalty

Page 48: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

48

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

16 load/store units to allow source and destination addresses to be

calculated for 16 threads per cycle

32 single precision FMA units16 double precision FMA units

Page 49: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

49

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

double precision application performance

Page 50: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

50

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

two warp scheduler andinstruction dispatch units

Page 51: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

51

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

dual warp scheduler allowing two warps to be issued and executed concurrently for 32 cores

Page 52: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

52

Next Generation ArchitectureNext Generation Architecture

Third generation Streaming Multiprocessor

two warp scheduler andinstruction dispatch units

64KB configurable shared memory and L1 cache

Page 53: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

53

Next Generation ArchitectureNext Generation Architecture

64KB configurable shared memory and L1 cache - 48KB shared memory and 16KB L1 cache - 16KB shared memory and 48KB L1 cache

radix sort using shared memory

Page 54: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

54

Next Generation ArchitectureNext Generation Architecture

Unified memory address space - combine three separate addresses space for load and store - this feature enable Fermi to support all C++ specific programs virtual function, function pointer, new and delete object, try and catch

Page 55: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

55

Next Generation ArchitectureNext Generation Architecture

summary table

Page 56: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

56

Next Generation ArchitectureNext Generation Architecture

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

scheduler bottleneck

Page 57: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

57

Next Generation ArchitectureNext Generation Architecture

old bottleneck

new bottleneck

Page 58: CUDA Linear Algebra Library and Next Generation Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

58

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php