gpu and cuda - unifi

63
GPU COMPUTING

Upload: others

Post on 29-Mar-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

GPU and CUDAAvailable processing power
Available processing power
Nvidia Tesla K20X single prec.: 3950 GFLOPS double prec.: 1310 GFLOPS
Intel Core i7-3900 187 GFLOPS
2014
Available processing power
Nvidia Tesla P100 (Pascal) single prec.: 10600 GFLOPS double prec.: 5300 GFLOPS
Intel Core E7-8870V3 single prec.: ~1720 GFLOPS
double prec.: ~860 GFLOPStoday
¨ 128 Streaming Processors SPs, ¤ 367 GFLOPS, 768 MB DRAM,
86.4 GB/S Mem BW, 4GB/S BW to CPU
Load/store
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
Parallel Data Cache
G80 Characteristics
¨ 367 GFLOPS peak performance ¤ (25-50 times of current high-end microprocessors)
¨ 265 GFLOPS sustained ¨ Massively parallel, 128 cores, 90W ¨ Massively threaded, sustains 1000s of threads per app ¨ 30-100 times speedup over high-end microprocessors
on scientific and media applications: medical imaging, molecular dynamics
“I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.”
-John Stone, VMD group, Physics UIUC
2010 - Fermi Architecture
¨ ~1.5 TFLOPS (SP)/~800 GFLOPS (DP) 230 GB/s DRAM Bandwidth 16 SM x 32 SP = 512 cores ( and a true cache )
2012 – Kepler Architecture
¨ ~3.9TFLOPS (SP)/~1300GFLOPS (DP) 250GB/s DRAM Bandwidth 15 SM x (192 sing.p. + 64 double.p.) SP = 3840 “cores” (Dynamic parallelism)
Nvidia Pascal GPU Architecture
Nvidia Pascal GP100 GPU Architecture
¨ An array of 6 Graphics Processing Clusters (GPCs) ¨ Each GPC has 10 Streaming Multiprocessors (SMs)
¨ Each SM has 64 CUDA Cores
¨ 3840 single precision CUDA Cores
¨ A total of 4096 KB of L2 cache
Future Apps Reflect a Concurrent World
¨ Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” ¤Molecular dynamics simulation, Video and audio coding
and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products
¤These “Super-apps” represent and model physical, concurrent world
¨ Various granularities of parallelism exist, but… ¤programming model must not hinder parallel
implementation ¤data delivery needs careful management
Stretching Traditional Architectures
¨ The game is to grow mainstream architectures “out” or domain-specific architectures “in” ¤CUDA is latter Traditional applications
Current architecture coverage
Before CUDA
¨ Dealing with graphics API ¤ Working with the corner cases of the graphics API
¨ Addressing modes ¤ Limited texture size/dimension
¨ Shader capabilities ¤ Limited outputs
¨ Communication limited ¤ Between pixels
Input Registers
Fragment Program
Output Registers
FB Memory
¤ User kicks off batches of threads on the GPU
¤ GPU = dedicated super-threaded, massively data parallel co-processor
¨ Targeted software stack ¤ Compute oriented drivers, language, and tools
¨ Driver for loading computation programs into GPU ¤ Standalone Driver - Optimized for computation
¤ Interface designed for compute – graphics-free API
¤ Data sharing with OpenGL buffer objects
¤ Guaranteed maximum download & readback speeds
¤ Explicit GPU memory management
CUDA basics
¨ The computing system consists in: ¤ a HOST running serial or modestly parallel C code ¤ one or more DEVICES running kernel C code ¤ exploiting massive data parallelism
. . .
CUDA Devices and threads
¨ A compute device ¤ Is a coprocessor to the CPU or host ¤ Has its own DRAM (device memory) ¤ Runs many threads in parallel ¤ Is typically a GPU but can also be another type of parallel
processing device
¨ Data-parallel portions of an application are expressed as device kernels which run on many threads
¨ Differences between GPU and CPU threads ¤ GPU threads are extremely lightweight
n Very little creation overhead
¤ GPU needs 1000s of threads for full efficiency n Multi-core CPU needs only a few
The thread hierarchy (bottom-up)
¨ KERNEL is C function that, when called, is executed N times in parallel by N different CUDA THREADS.
¨ Threads are organized in BLOCKS: ¤ Threads in the same block share the same processor and its resources. ¤ On current GPUs, a block may contain at most 2048 threads. ¤ Threads within a block can cooperate by sharing data through some shared memory ¤ Threads have a 1/2/3-dimensional identifier:
n threadIdx.x, threadIdx.y, threadIdx.z
¨ Blocks are organized in GRIDS: ¤ The number of thread blocks in a grid is usually dictated by the size of the data being
processed or the number of processors in the system, which it can greatly exceed. ¤ A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common
choice. ¤ Blocks have a 1/2-dimensional identifier:
n blockIdx.x, blockIdx.y
¤ Blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores
¤ Automatic scalability
The thread hierarchy (top-down)
¨ A GRID is a piece of work that can be executed by the GPU ¤ A 2D array of BLOCKs
¨ A BLOCK is an independent sub- piece of work that can be executed in any order by a Streaming Multiprocessor SM ¤ A 3D array of threads ¤ Max no. of threads in a block
depends on hardware. ¨ A THREAD is the minimal unit of
work. ¤ All the threads execute the same
KERNEL function ¤ Threads are grouped in WARPS
of 32 for scheduling, i.e. they are the minimal units of scheduling
Fake “Hello World!” example
__global__ void kernel( void ) { }
}
¨ __global__ declare a kernel function to be run on the GPU
¨ kernel<<<2,3>>>(); runs 2 blocks with 3 threads each executing the function kernel
Fake “Hello World!” example
__global__ void kernel( void ) { }
int main( void ) { // set the grid and block sizes dim3 dimGrid(3,1); // a 3x1 array of blocks dim3 dimBlock(2,2,2); // a 2x2x2 array of threads
}
¨ dim3 is the data type used to declare grid and block sizes
Memory
¨ Global Memory is the on-board device memory ¤ Data transfers occur between Host
memory and Global memory ¤ It is accessible by any thread ¤ It is (relatively) costly ¤ Constant memory (64K) supports
read-only access by the GPU, which has a short latency
¨ Shared Memory is shared by the threads in the same block ¤ Provides fast access but it is very
limited in size ¨ Registers are private to threads
Grid
Registers
Host
Memory
¨ cudaMalloc() ¤ Allocates memory in the device Global Memory ¤ Parameters: Address of a pointer to the allocated object, Size of of
allocated object
¨ cudaFree() ¤ Frees memory from device Global Memory ¤ Parameters: Pointer to freed object
¨ cudaMemcpy() ¤ memory data transfer ¤ Parameters: Pointer to destination, Pointer to source, Number of
bytes copied n Type of transfer: Host to Host, Host to Device, Device to Host, Device to
Device ¤ Asynchronous transfer
// desired output
str[i] -= i;
char *d_str;
cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice);
dim3 dimGrid(3); // 3 blocks
dim3 dimBlock(4); // 4 threads
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// unmangle output
Function declaration
¨ __global__ defines a kernel function ¨ __device__
¤ No recursion ¤ No static variables inside the function ¤ No variable number of arguments
¨ __device__ and __host__ can be used together
hosthost__host__ float HostFunc()
hostdevice__global__ void KernelFunc()
devicedevice__device__ float DeviceFunc()
¨ Kernel calls are asynchronous ¨ gridDim.x/y blockIdx.x/y
blockDim.x/y/z threadIdx.x/y/x identify threads and blocks in a grid within kernel function
¨ Sizes and limitations (Compute capability 1.0): ¤ Warp size = 32; Threads per block = 512;
Warps per SM = 24; Blocks per SM = 8; Threads per SM = 768
Automatic (Transparent) Scalability
¨ Do we need to take care of the device computing power (number of SM) ? ¤ A Grid contains a set of independent blocks, which can be executed
in any order. ¤ No, because the block scheduler can re-arrange blocks accordingly
Device
time
c[i] = a[i] + b[i];
}
Vector add (2)
int size = N*sizeof(int); // memory size
// allocate memory
a[i] = rand() % 1000;
b[i] = rand() % 1000;
blockthreadAdd<<<N/8,8>>>(dev_a, dev_b, dev_c);
// copy result on host
cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
¨ Allocate memory on device
¨ Invoke kernel ¨ the number of threads depend
on the input size
¨ . the end .
¨ You must ¤ download and install the CUDA sdk
¤ Compile similarly as with g++ n nvcc helloworld.cu --gpu-architecture=sm_52
¤ Make sure to read the nvcc manual and chose the propose architecture
Dot product example
Objective
a [ | | | | | | | | | | | | | | | | | | | | | | | | ]
b [ | | | | | | | | | | | | | | | | | | | | | | | | ]
c [ ]
* * * * * * * * * * * * * * * * * * * * * * * **
// position in the block/grid
// compute product and store it in block-shared memory
// wait all threads to compute their product
// make one thread in the warp to compute the block sum
// increase safely the global current sum
}
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
// make one thread in the warp to compute the block sum
// increase safely the global current sum
}
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
__syncthreads();
// make one thread in the warp to compute the block sum
// increase safely the global current sum
}
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
__syncthreads();
// make one thread in the warp to compute the block sum
if (threadIdx.x==0) {
int sum = 0;
sum += temp[j];
}
__shared__ int temp[THREADS_PER_BLOCK];
// position in the block/grid
// compute product and store it in block-shared memory
temp[threadIdx.x] = a[i]*b[i];
// wait all threads to compute their product
__syncthreads();
// make one thread in the warp to compute the block sum
if (threadIdx.x==0) {
int sum = 0;
sum += temp[j];
atomicAdd(c, sum);
32-bit or 64-bit word residing in global or shared memory
¤ atomicAdd(), atomicSub(), atomicExch(), atomicMin(), atomicMax(), atomicInc(), atomicDec(), atomicAnd(), atomicOr(), atomicXor()
¨ __synchthreads() ¤ Barrier for threads in the same block, possibly in
different warps.
Memory (2)
registers n Automatic variables
Grid
Advanced Thread and Memory Management
Fermi Dual Warp Scheduler (DWS)
¨ The SM schedules threads in groups of 32 parallel threads called warps
¨ Each SM features two warp schedulers and two instruction dispatch units.
¨ (DWS) selects two warps, and issues one instruction from each warp to a group of sixteen cores
¨ Kepler architecture provides 4 warp schedulers and 2 independent instructions can be dispatched at each cycle
Thread divergence
Parallel sum
Mini-Warp 2
Mini-Warp 3
Mini-Warp 4
Mini-Warp 1
Parallel sum
Parallel sum
Mini-Warp 2
Mini-Warp 1
Prefix sum
Global Memory
¨ Accesses to global memory from a warp are split into half-warp requests
¨ Each half-warp request if coalesced in a small number of memory “transactions”
¨ The coalescing depends on the device compute capability
Global Memory
¨ Compute Capability 1.0 e 1.1 ¤ The size of the words accessed by the threads must be
4, 8, or 16 bytes; ¤ If this size is:
n 4, all 16 words must lie in the same 64-byte segment, n 8, all 16 words must lie in the same 128-byte segment, n 16, the first 8 words must lie in the same 128-byte segment
and the last 8 words in the following 128-byte segment;
¤ Threads must access the words in sequence: The k-th thread in the half-warp must access the k-th word.
Global Memory
¨ Compute Capability 1.2 e 1.3 ¤ Threads can access any words in any order, including the same words, and a
single memory transaction for each segment addressed by the half-warp is issued.
¤ Find the memory segment that contains the address requested by the active thread with the lowest thread ID. The segment size depends on the size of the words accessed by the threads: n 32 bytes for 1-byte words, n 64 bytes for 2-byte words, n 128 bytes for 4-, 8- and 16-byte words.
¤ Find all other active threads whose requested address lies in the same segment. ¤ Reduce the transaction size, if possible:
n If the transaction size is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes;
n If the transaction size is 64 bytes and only the lower or upper half is used, reduce the transaction size to 32 bytes.
¤ Carry out the transaction and mark the serviced threads as inactive. ¤ Repeat until all threads in the half-warp are serviced.
Global Memory
¨ Compute Capability 2.x ¤ Memory accesses are cached ¤ A cache line is 128 bytes and maps to a 128-byte aligned
segment in device memory. ¤ If the size of the words accessed by each thread is more
than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently: n Two memory requests, one for each half-warp, if the size is 8
bytes, n Four memory requests, one for each quarter-warp, if the
size is 16 bytes. ¤ Note that threads can access any words in any order,
including the same words.
Shared Memory (2.0)
¨ 32 memory banks organized such that successive words reside in different banks
¨ Each bank has a bandwidth of 32 bits per 2 clock cycles
¨ 32 adjacent words are accessed in parallel from 32 different memory banks
¨ A bank conflict occurs if two threads access to different words within the same bank
¨ When multiple threads access the same word ¤ A broadcast occurs in case of read ¤ Only one threads writes (which is undetermined)
References