Download - GPU computing and CUDA
GPU computing and CUDA
Marko Mišić ([email protected])Milo Tomašević ([email protected])
YUINFO 2012Kopaonik, 29.02.2012.
2/30
Introduction to GPU computing (1)
Graphics Processing Units (GPUs) have been used
for non-graphics computation for several years This trend is called
General-Purpose computation on GPUs (GPGPU) The GPGPU applications can be found in:
Computational physics/chemistry/biology Signal processing Computational geometry Database management Computational finance Computer vision
3/30
Introduction to GPU computing (2)
The GPU is a highly parallel processor good at data-parallel processing with many calculations per memory access The same computation executed
on many data elements in parallel with high arithmetic intensity
Same computation means lower requirement for sophisticated flow control
High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches
4/30
CPU vs. GPU trends (1)
CPU is optimized to execute tasks Big caches hide memory latencies Sophisticated flow control
GPU is specialized for compute-intensive, highly parallel computation More transistors can be devoted to data processing
rather than data caching and flow control
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
5/30
CPU vs. GPU trends (2) The GPU has evolved into a very flexible and powerful processor
Programmable using high-level languages Computational power: 1 TFLOPS vs. 100 GFLOPS Bandwidth: ~10x bigger GPU is found in almost every workstation
6/30
CPU vs. GPU trends (3)
CUDAAdvantage
Rigid BodyPhysicsSolver
10x
20x
47x
197x
MatrixNumerics
BLAS1:60+ GB/sBLAS3:
100+ GFLOPS
WaveEquation
FDTD:1.2 Gcells/s
FFT:52 GFLOPS
(GFLOPS as defined by benchFFT)
BiologicalSequence
Match
SSEARCH:5.2 Gcells/s
Finance
Black Scholes:4.7 GOptions/s
7/30
History of GPU programming The fast-growing video game industry puts
strong pressure that forces constant innovation GPUs evolved from fixed-function pipeline processors
to the more programmable, general-purpose processors Programmable shaders (2000)
Programmed through OpenGL and DirectX API Lots of limitations
Memory access, ISA, floating-point support, etc. NVIDIA CUDA (2007) AMD/ATI (Brook+, FireStream, Close-To-Metal) Microsoft DirectCompute (DirectX 10/DirectX 11) OpenCompute Language, OpenCL (2009)
8/30
CUDA overview (1)
Compute Device Unified Architecture (CUDA) A new hardware and software architecture
for issuing and managing computations on the GPU Started with NVIDIA 8000 (G80) series GPUs
General-purpose programming model SIMD / SPMD User launches batches of threads on the GPU GPU could be seen as dedicated
super-threaded, massively data parallel coprocessor Explicit and unrestricted memory management
9/30
CUDA overview (2)
The GPU is viewed as a compute device that is a coprocessor to the CPU (host) Executes compute-intensive part of the application Runs many threads in parallel Has its own DRAM (device memory)
Data-parallel portions of an application are expressed as device kernels which run on many threads GPU threads are extremely lightweight
Very little creation overhead GPU needs 1000s of threads for full efficiency
Multicore CPU needs only a few
10/30
CUDA overview (3)
Dedicated software stack Runtime and driver C-language extension
for easier programming Targeted API for advanced
users Complete tool chain
Compiler, debugger, profiler Libraries and 3rd party support
GPU Computing SDK cuFFT, cuBLAS... FORTRAN, C++, Python,
MATLAB, Thrust, GMAC…
GPU
CPU
CUDA Runtime
CUDA Libraries(FFT, BLAS)
CUDA Driver
Application
11/30
Programming model (1)
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
CUDA application consists of two parts Sequential parts are executed on the CPU (host) Compute-intensive parts are executed on the GPU (device) The CPU is responsible for data management,
memory transfers, and the GPU execution configuration
12/30
Programming model (2) A kernel is executed as
a grid of thread blocks A thread block is a batch
of threads that can cooperate with each other by: Efficiently sharing data
through shared memory Synchronizing their
execution Two threads from
two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
13/30
Programming model (3) Threads and blocks have
IDs So each thread can
decide what data to work on
Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory
addressing when processingmultidimensional data Image processing Solving PDEs on volumes
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
14/30
Memory model (1) Each thread can:
Read/write per-thread registers
Read/write per-thread local memory
Read/write per-block shared memory
Read/write per-grid global memory
Read only per-grid constant memory
Read only per-grid texture memory
Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
15/30
Memory model (2)
The host can read/write global, constant, and texture memory All stored in device DRAM
Global memory accesses are slow Around ~200 cycles
Memory architecture optimized for high bandwidth Memory banks Transactions
Device
Global Memory (DRAM)
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Global Memory (DRAM)
16/30
Memory model (3)
Shared memory is a fast on-chip memory Allows threads in a block to share intermediate data
Access time ~ 3-4 cycles Could be seen as user-managed cache (scratchpad)
Threads are responsible to bring the data to and move it from the shared memory
Small in size (up to 48KB)
DRAM
ALU
Sharedmemory
Control
CacheALU ALU ...
d0 d1 d2 d3
d0 d1 d2 d3
ALU
Sharedmemory
Control
CacheALU ALU ...
d4 d5 d6 d7
d4 d5 d6 d7
…
…
17/30
A common programming strategy Local and global memory reside in device memory
(DRAM) Much slower access than shared memory
A common way of performing computation on the device is to block it up (tile) to take advantage of fast shared memory Partition the data set into subsets that fit into shared
memory Handle each data subset with one thread block by:
Loading the subset from global memory to shared memory Performing the computation on the subset from shared
memory Each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory
18/30
Matrix Multiplication Example (1) P = M * N of size WIDTH x WIDTH Without blocking:
One thread handles one element of P M and N are loaded WIDTH times from
global memory
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
19/30
Matrix Multiplication Example (2) P = M * N of size WIDTH x WIDTH With blocking:
One thread block handles one BLOCK_SIZE x BLOCK_SIZE sub-matrix Psub of P
M and N are only loaded WIDTH / BLOCK_SIZE times from global memory
Great saving of memory bandwidth! M
N
P
Psub
BLOCK_SIZEBLOCK_SIZE BLOCK_SIZE BLOCK_SIZE
BL
OC
K_S
IZE
BL
OC
K_S
IZE
BL
OC
K_S
IZE
BL
OC
K_S
IZE
WID
TH
WID
TH
WIDTH WIDTH
20/30
CUDA API (1)
The CUDA API is an extension to the C programming language consisting of: Language extensions
To target portions of the code for execution on the device
A runtime library split into: A common component providing built-in vector types
and a subset of the C runtime library in both host and device codes
A host component to control and access one or more devices from the host
A device component providing device-specific functions
21/30
CUDA API (2)
Function declaration qualifiers __global__, __host__, __device__
Variable qualifiers __host__, __device___, __shared__, etc.
Built-in variables gridDim, blockDim, blockIdx, threadIdx
Mathematical functions Kernel calling convention (execution configuration)
myKernel<<< DimGrid, DimBlock >>>(arg1, … ); Programmer explicitly specifies block and grid organization
1D, 2D or 3D
22/30
Hardware implementation (1) The device is a set of multiprocessors Each multiprocessor is a set of 32-bit
processors with a SIMD architecture At each clock cycle, a multiprocessor
executes the same instruction on a group of threads called a warp Including branches
Allows scalable execution of kernels Adding more multiprocessors improves
performance
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
InstructionUnit
Processor 1
…
Processor 2
Processor M
…
23/30
Hardware implementation (2)
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
24/30
Hardware implementation (3) Each thread block of a grid is split into warps
that get executed by one multiprocessor Warp consists of threads with consecutive thread IDs)
Each thread block is executed by only one multiprocessor Shared memory space resides in the on-chip shared memory Registers are allocated among the threads A kernel that requires too many registers will fail to launch
A multiprocessor can execute several blocks concurrently Shared memory and registers are allocated
among the threads of all concurrent blocks Decreasing shared memory usage (per block) and
register usage (per thread) increases number of blocks that can run concurrently
25/30
Memory architecture (1)
In a parallel machine, many threads access memory Memory is divided into banks Essential to achieve high bandwidth
Each bank can service one address per cycle A memory can service
as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank
result in a bank conflict Conflicting accesses are serialized
Shared memory is organized in similar fashion
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
26/30
Memory architecture (2)
When accessing global memory, accesses are combined into transactions Peak bandwidth is achieved
when all threads in a half warp access continuous memory locations
“Memory coalescing” In that case, there are no bank conflicts
Programmer is responsible to optimize algorithms to access data in appropriate fashion
27/30
Performance considerations CUDA has a low learning curve
It is easy to write a correct program Performance can vary greatly
depending on the resource constraints of the particular device architecture Performance concerned programmers still need
to be aware of them to make a good use of a contemporary hardware
It is essential to understand hardware and memory architecture Thread scheduling and execution Suitable memory access patterns Shared memory utilization Resource limitations
28/30
Conclusion
Highly multithreaded architecture of modern GPUs is very suitable for solving data-parallel problems Vastly improves performance in certain domains
It is expected that GPU architectures will evolve to further broaden application domains We are in the dawn of heterogeneous computing
Software support is developing rapidly Mature tool chain Libraries Available applications OpenCL
29/30
References David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors:
A Hands on Approach, Morgan Kaufmann, 2010. Course ECE498AL, University of Illinois, Urbana-Champaign
http://courses.engr.illinois.edu/ece498/al/ Dann Connors, OpenCL and CUDA Programming for Multicore
and GPU Architectures, ACACES 2011, Fiuggi, Italy, 2011. David Kirk, Wen-mei Hwu, Programming and tUnining
Massively Parallel Systems, PUMPS 2011, Barcelona, Spain, 2011. NVIDIA CUDA C Programming Guide 4.0, 2011. Mišić, Đurđević, Tomašević, “Evolution and Trends in GPU Computing”,
MIPRO 2012, Abbazia, Croatia, 2012. (to be published) NVIDIA Developer zone,
http://developer.nvidia.com/category/zone/cuda-zone http://en.wikipedia.org/wiki/GPGPU http://en.wikipedia.org/wiki/CUDA GPU training wiki,
https://hpcforge.org/plugins/mediawiki/wiki/gpu-training/index.php/Main_Page
GPU computing and CUDA
Questions?
Marko Mišić ([email protected])Milo Tomašević ([email protected])
YUINFO 2012Kopaonik, 29.02.2012.