curs vsd10 11
TRANSCRIPT
-
7/31/2019 Curs Vsd10 11
1/78
Vizualizarea in sisteme distribuite
S.l. Dr. ing. Simona Caraiman
Master SDTW an II2011 - 2012
-
7/31/2019 Curs Vsd10 11
2/78
VSD - Curs 10-11
Programarea GPU (1V)
CUDA Advanced topics
VSD Curs 10-11 Master SDTW an II 2011 - 2012
-
7/31/2019 Curs Vsd10 11
3/78
Textures in CUDATexture is an object for reading data Benefits: data is cached (optimized for 2D locality) filtering linear / bilinear / trilinear
dedicated hardware wrap modes (for out-of-bounds addresses) clamp to edge / repeat
addressable in 1D, 2D or 3D using integer or normalized coordinates
Usage: CPU code binds data to a texture object Kernel reads data by calling a fetch function
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
4/78
Textures in CUDA
Texture Addressing
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
5/78
Textures in CUDA
Two Texture Types Bound to linear memory global memory address is bound to a texture only 1D integer addressing no filtering, no addressing modes
Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, 3D float addressing (size-based or normalized) filtering addressing modes (clamp, repeat)
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
6/78
CUDA Texturing Steps Host (CPU) code: allocate/obtain memory (global linear, or CUDA array) create texture reference object bind the texture reference to memory/array when done:
unbind the texture reference, free resources
Device (kernel) code: fetch using texture reference
linear memory textures: tex1Dfetch() array textures: tex1D() or tex2D() or tex3D()
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
7/78
Atomics
Problem: How do you do global communication? Finish a grid and start a new one
Finish a kernel and start a new one All writes from all threads complete before a
kernel finishes
step1(...);// The system ensures that all
// writes from step1 complete.
step2(...);2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
8/78
Atomics
Global Communication Would need to decompose kernels into before
and after parts
Or, write to a predefined memory location
Race condition! Updates can be lostthreadId:0 threadId:1917
// vector[0] was equal to 0
vector[0] += 5; vector[0] += 1;
... ...a = vector[0]; a = vector[0];
What is the value ofa in thread 0?
What is the value ofa in thread 1917?
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
9/78
Atomics
Race conditionsThread 0 could have finished execution before
1917 started
Or the other way around
Or both are executing at the same time
Answer: not defined by the programmingmodel, can be arbitrary
CUDA provides atomic operations to dealwith this problem
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
10/78
Atomics
An atomic operation guarantees that only asingle thread has access to a piece of memorywhile an operation completes
The name atomic comes from the fact that it
is uninterruptable No dropped data, but ordering is still arbitrary
Different types of atomic instructions
atomic{Add, Sub, Exch, Min, Max, Inc,
Dec, CAS, And, Or, Xor}
More types in fermi
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
11/78
AtomicsExample: Histogram
// Determine frequency of colors in a picture
// colors have already been converted into ints
// Each thread looks at one pixel and increments
// a counter atomically
__global__void histogram(int* color,int* buckets)
{
int i = threadIdx.x
+ blockDim.x * blockIdx.x; int c = colors[i];atomicAdd(&buckets[c], 1);
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
12/78
AtomicsExample: Workqueue
// For algorithms where the amount of work per item
// is highly non-uniform, it often makes sense for
// to continuously grab work from a queue
__global__
void workq(int* work_q, int* q_counter,
int* output, int queue_max){
int i = threadIdx.x
+ blockDim.x * blockIdx.x;
int q_index =
atomicInc(q_counter, queue_max);int result = do_work(work_q[q_index]);
output[i] = result;
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
13/78
Atomics
Atomics are slower than normal load/storeYou can have the whole machine queuing on a
single location in memory
Atomics unavailable on G80!
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
14/78
AtomicsExample: Workqueue
// For algorithms where the amount of work per item
// is highly non-uniform, it often makes sense for
// to continuously grab work from a queue
__global__
void workq(int* work_q, int* q_counter,
int* output, int queue_max){
int i = threadIdx.x
+ blockDim.x * blockIdx.x;
int q_index =
atomicInc(q_counter, queue_max);int result = do_work(work_q[q_index]);
output[i] = result;
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
15/78
AtomicsExample: Global Min/Max (Naive)
// If you require the maximum across all threads
// in a grid, you could do it with a single
// global maximum value, but it will be VERY slow
__global__
void global_max(int* values, int* gl_max){
int i = threadIdx.x
+ blockDim.x * blockIdx.x;
int val = values[i];
atomicMax(gl_max,val);}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
16/78
AtomicsExample: Global Min/Max (Better)
// introduce intermediate maximum results, so that
// most threads do not try to update the global max
__global__
void global_max(int* values, int* max,
int *regional_maxes,
int num_regions){
// i and val as before
int region = i % num_regions;
if(atomicMax(®_max[region],val) < val)
{ atomicMax(max,val);
}
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
17/78
Atomics
Global Min/Max
Single value causes serial bottleneck
Create hierarchy of values for more
parallelism Performance will still be slow, so usejudiciously
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
18/78
Performance optimization
Overview
Memory Optimizations
Execution Configuration Optimizations
Examples
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
19/78
Performance optimization - Overview
Optimize Algorithms for the GPU
maximize independent parallelism
maximize arithmetic intensity
(math/bandwith) sometimes its better to recompute than to
cache GPU spends its transistors on ALUs, not memory
do more computation on the GPU to avoidcostly data transfers even low parallelism computations can sometimes
be fastr than transferring back and forth to host2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
20/78
Performance optimization - Overview
Optimize Memory Access
Coalesced vs. Non-coalesced = order ofmagnitude
Global/Local device memory Optimize for spatial locality in cached texture
memory
In shared memory, avoid high degree bank
conflicts
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
21/78
Performance optimization - Overview
Take advantage of shared memory
hundreds of times faster than global memory
threads can cooperate via shared memory
use one / a few threads to load / computedata shared by all threads
use it to avoid non-coalesced access stage loads and stores in shared memory to re-
order non-coalesceable addressing
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
22/78
Performance optimization - Overview
Use parallelism efficiently
partition your computation to keep the GPUmultiprocessors equally busy
many threads, many thread blocks
keep resource usage low enough to supportmultiple active threads blocks per
multiprocessor registers, shared memory
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
23/78
Memory optimizationsThe global, constant and
texture spaces areregions of devicememory
Each multiprocessor has:
a set of 32-bit registersper processor on-chip shared memory where the shared memory
space resides
a read-only constantcache to speed up access to the
constant memory space a read-only texture cache
to speed up access to thetexture memory space 2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
24/78
Memory optimizations
Optimizing host-devicedata transfers
Coalescing global dataaccesses
Using shared memoryeffectively
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
25/78
Memory optimizations
Host-Device Data Transfers
device2host memory bandwidth much lowerthan device2device memory bandwidth 4GB/s peak (PCI-e x16 Gen 1) vs. 76 GB/s peak
(Tesla C870)
minimize transfers intermediate data structures can be allocated,
operated on, and deallocated without evercopying them to host memory
group transfers
one large transfer much better than many smallones
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
26/78
Memory optimizations
Global and shared memory
Global memory not cached on G8x GPUs high latency, but launching more threads hides
latency important to minimize accesses
coalesce global memory accesses
Shared memory is on-chip, very high
bandwidth low latency like user-managed per-multiprocessor cache try to minimize or avoid bank conflicts
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
27/78
Memory optimizations
Texture and Constant Memory
Texture partition is cached uses the texture cache also used for graphics
optimized for 2D spatial locality best performance when threads of a warp read
locations that are close together in 2D
Constant memory is cached
4 cycles per address read within a single warp total cost 4 cycles if all threads in a warp read the same
address
total cost 64 cycles if all threads read differentaddresses
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
28/78
Memory optimizations
Global Memory Reads/Writes
global memory is not cached on G8x
highest latency instructions: 400-600 clock
cycles likely to be a performance bottleneck
optimizations can greatly increaseperformance
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
29/78
Memory optimizations
Coalescing a coordinated read by a half-warp (16 threads) a contiguous region of global memory 64 bytes each thread reads a word: int, float,.. 128 bytes each thread reads a double-word: int2,
float2, .. 256 bytes each thread reads a quad-word: int4,
float4,
additional restrictions
starting address for a region must be a multiple ofregion size
the kth thread in a half-warp must access the kthelement in a block being read
exception: not all threads must beartici atin2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
30/78
Memory optimizations
Coalesced Access: Reading floats
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
31/78
Memory optimizations
Uncoalesced Access: Reading floats
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
32/78
Memory optimizations
Coalescing: Timing results
Experiment: kernel: read a float, increment, write back
3M floats (12 MB) Times averaged over 10k runs
12K blocks x 256 threads: 356s coalesced
357s coalesced, some threads dontparticipate
3,494s permuted/misaligned thread access
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
33/78
Memory optimizations
Shared Memory
hundred times faster than global memory
cache data to reduce global memory accesses
threads can cooperate via shared memory use it to avoid non-coalesced access stage loads and stores in shared memory to re-
order non-coalesceable addressing
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
34/78
Memory optimizationsExample: thread-local variables
// motivate per-thread variables with
// Ten Nearest Neighbors application
__global__void ten_nn(float2 *result, float2 *ps, float2 *qs,
size_t num_qs)
{
// p goes in a register
float2 p = ps[threadIdx.x];
// per-thread heap goes in off-chip memory
float2 heap[10];
// read through num_qs points, maintaining
// the nearest 10 qs to p in the heap ...
// write out the contents of heap to result
...
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
35/78
Memory optimizationsExample: shared variables
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] input[i-1]
__global__void adj_diff_naive(int *result, int *input)
{
// compute this threads global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0)
{
// each thread loads two elements from global memory
int x_i = input[i];
int x_i_minus_one = input[i-1];
result[i] = x_i x_i_minus_one;
}
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
36/78
Memory optimizationsExample: shared variables
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] input[i-1]
__global__void adj_diff_naive(int *result, int *input)
{
// compute this threads global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0)
{
// what are the bandwidth requirements of this kernel?
int x_i = input[i];
int x_i_minus_one = input[i-1];
result[i] = x_i x_i_minus_one;
}
}
2011 - 2012Master SDTW an IIVSD Curs 10
Two loads
-
7/31/2019 Curs Vsd10 11
37/78
Memory optimizationsExample: shared variables
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] input[i-1]
__global__void adj_diff_naive(int *result, int *input)
{
// compute this threads global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0)
{
// How many times does this kernel load input[i]?
int x_i = input[i]; // once by thread i
int x_i_minus_one = input[i-1]; // again by thread i+1
result[i] = x_i x_i_minus_one;
}
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
38/78
Memory optimizationsExample: shared variables
// motivate shared variables with
// Adjacent Difference application:
// compute result[i] = input[i] input[i-1]
__global__void adj_diff_naive(int *result, int *input)
{
// compute this threads global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0)
{
// Idea: eliminate redundancy by sharing data
int x_i = input[i];
int x_i_minus_one = input[i-1];
result[i] = x_i x_i_minus_one;
}
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
39/78
Memory optimizationsExample: shared variables
// optimized version of adjacent difference__global__void adj_diff(int *result, int *input)
{
// shorthand for threadIdx.x
int tx = threadIdx.x;
// allocate a __shared__ array, one element per thread
__shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data
unsigned int i = blockDim.x * blockIdx.x + tx;
s_data[tx] = input[i];
// avoid race condition: ensure all loads
// complete before continuing__syncthreads();
...
}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
40/78
Memory optimizations
Example: shared variables// optimized version of adjacent difference
__global__void adj_diff(int *result, int *input)
{
...
if(tx > 0)
result[i] = s_data[tx] s_data[tx1];
elseif(i > 0)
{
// handle thread block boundary
result[i] = s_data[tx] input[i-1];
}}
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
41/78
Memory optimizations
Example: shared variables// when the size of the array isnt known at compile time...
__global__ void adj_diff(int *result, int *input)
{
// use extern to indicate a __shared__ array will be
// allocated dynamically at kernel launch time
extern__shared__ int s_data[];...
}
// pass the size of the per-block array, in bytes, as the third
// argument to the triple chevrons
adj_diff(r,i);
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
42/78
Execution Configuration Optimizations
OccupancyThread instructions are executed sequentially,
so executing other warps is the only way tohide latencies and keep the hardware busy
Occupancy = Number of warps runningconcurrently on a multiprocessor divided bymaximum number of warps that can runconcurrently
Limited resource usage: registers shared memory
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
43/78
Execution Configuration Optimizations
Grid/Block Size Heuristics
# of blocks > # of multiprocessors so all multiprocessors have at least one block to
execute
# of blocks / # of multiprocessors > 2 multiple blocks can run concurrently in a
multiprocessor blocks that arent waiting at a __syncthreads()
keep the hardware busy subject to resource availability register, shared
memory
# of blocks > 100 to scale to future devices
blocks executed in pipeline fashion 1000 blocks er rid will scale across multi le2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
44/78
Execution Configuration Optimizations
Register Dependency
Read-after-write register dependency Instructions result can be read ~11 cycles later
Scenario:
To completely hide the latency:
run at least 192 threads (6 warps) permultiprocessor at least 25% occupancy
threads dont have to belong to the same thread
block 2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
45/78
Execution Configuration Optimizations
Register Pressure Hide latency by using more threads per SM Limiting factors: Number of registers per kernel 8192 per SM, partitioned among concurrent threads
Amount of shared memory 16 KB per SM, partitioned among concurrent thread
blocks
Compile with --ptxas-options=-v flag
Use --maxrregcount=N flag to NVCC N = desired maximum registers/kernel At some point spilling into LMEM may occur reduces performance LMEM is slow
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
46/78
Execution Configuration Optimizations
Determining resource usage
compile the kernel code with the cubin flagto determine register usage
open the .cubin file with a text editor and look
for the code section
2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
47/78
Execution Configuration Optimizations
Optimizing threads per block
Choose threads per block as a multiple of warp size avoid wasting computation on under-populated warps
More threads per-block == better memory latencyhiding
But, more threads per block == fewer registers perthread kernel invocations can fail if too many registers are used
Heuristics Minimum: 64 threads per block only if multiple concurrent blocks
192 or 256 threads a better choice usually still enough regs to copile and invoke successfully
this all depends on your computation, so experiment!2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
48/78
Execution Configuration Optimizations
Occupancy != Performance
Increasing occupancy does not necessarilyincrease performance
BUT
Low-occupancy multiprocessors cannotadequately hide latency on memory-boundkernels it all comes down to arithmetic intensity and
available parallelism2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
49/78
Execution Configuration Optimizations
Parameterize your application
Parameterization helps adaptation to differentGPUs
GPUs vary in many ways # of multiprocessors memory bandwidth
shared memory size
register file size max. threads per block
You can even make apps self-tuning experiment mode discovers and saves optimal
configuration 2011 - 2012Master SDTW an IIVSD Curs 10
-
7/31/2019 Curs Vsd10 11
50/78
A Common Programming Strategy
Global memory resides in device memory(DRAM) Much slower access than shared memory
Tile data to take advantage of fast shared
memory: Generalize from adjacent_difference example Divide and conquer
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
51/78
A Common Programming Strategy
Partition data into subsets that fit into sharedmemory
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
52/78
A Common Programming Strategy
Handleeach data subset with one thread block
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
53/78
A Common Programming Strategy
Load the subset from global memory to sharedmemory, using multiple threads to exploitmemory-level parallelism
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
54/78
A Common Programming Strategy
Perform the computation on the subset fromshared memory
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
55/78
A Common Programming Strategy
Copy the result from shared memory back toglobal memory
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
56/78
A Common Programming Strategy
Carefully partition data according to accesspatterns
Read-only __constant__ memory (fast)
R/W & shared within block __shared__
memory (fast) R/W within each thread registers (fast)
Indexed R/W within each thread localmemory (slow)
R/W inputs/results cudaMalloced globalmemory (slow)
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
57/78
Communication Through Memory
Question:
__global__void race(void)
{ __shared__int my_shared_variable;
my_shared_variable = threadIdx.x;
// what is the value of
// my_shared_variable?
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
58/78
Communication Through Memory
This is a race condition
The result is undefined
The order in which threads access the variable
is undefined without explicit coordination Use barriers (e.g., __syncthreads) or
atomic operations (e.g., atomicAdd) to enforcewell-defined semantics
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
59/78
Communication Through Memory Use __syncthreads to ensure data is ready for
access
__global__void share_data(int *input)
{
__shared__int data[BLOCK_SIZE];
data[threadIdx.x] = input[threadIdx.x];
__syncthreads();
// the state of the entire data array// is now well-defined for all threads
// in this block
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
60/78
Communication Through Memory Use atomic operations to ensure exclusive access
to a variable
// assume *result is initialized to 0
__global__void sum(int *input, int *result)
{
atomicAdd(result, input[threadIdx.x]);
// after this kernel exits, the value of// *result will be the sum of the input
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
61/78
Resource Contention Atomic operations arent cheap!
They imply serialized access to a variable
__global__void sum(int *input, int *result)
{ atomicAdd(result, input[threadIdx.x]);
}
...
// how many threads will contend
// for exclusive access to result?
sum(input,result);
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
62/78
Hierarchical Atomics
Divide & Conquer Per-thread atomicAdd to a __shared__ partial sum
Per-block atomicAdd to the total sum
0 1
2011 - 2012VSD Curs 10-11 Master SDTW an II
Hierarchical Atomics
-
7/31/2019 Curs Vsd10 11
63/78
Hierarchical Atomics
__global__void sum(int *input, int *result)
{ __shared__int partial_sum;
// thread 0 is responsible for
// initializing partial_sum if(threadIdx.x == 0)
partial_sum = 0;
__syncthreads();
...
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
64/78
Hierarchical Atomics
__global__void sum(int *input, int *result)
{
...
// each thread updates the partial sum
atomicAdd(&partial_sum,
input[threadIdx.x]);
__syncthreads();
// thread 0 updates the total sum
if(threadIdx.x == 0)
atomicAdd(result, partial_sum);
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
65/78
Advice
Use barriers such as __syncthreads to waituntil __shared__ data is ready
Prefer barriers to atomics when data accesspatterns are regular or predictable
Prefer atomics to barriers when data accesspatterns are sparse or unpredictable
Atomics to __shared__ variables are muchfaster than atomics to global variables
Dont synchronize or serialize unnecessarily
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
66/78
Generalizeadjacent_difference example
AB = A * B Each element AB
ij = dot(row(A,i),col(B,j))
Parallelization strategy Thread ABij
2D kernel
Matrix Multiplication Example
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
67/78
First Implementation__global__ void mat_mul(float *a, float *b,
float *ab, int width){
// calculate the row & col index of the element
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
float result = 0;
// do dot product between row of a and col of b
for(int k = 0; k < width; ++k)
result += a[row*width+k] * b[k*width+col];
ab[row*width+col] = result;
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
68/78
How will this perform?
How many loads per term of dotproduct?
2 (a & b) = 8 Bytes
How many floating point operations? 2 (multiply & addition)
Global memory access to flop ratio(GMAC)
8 Bytes / 2 ops = 4 B/op
What is the peak fp performance ofGeForce GTX 260?
805 GFLOPS
Lower bound on bandwidth required toreach peak fp performance
GMAC * Peak FLOPS = 4 * 805 = 3.2TB/s
What is the actual memory bandwidth
of GeForce GTX 260?
112 GB/s
Then what is an upper bound onperformance of our implementation?
Actual BW / GMAC = 112 / 4 = 28GFLOPS
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
69/78
Idea: Use __shared__memory toreuse global data
Each input element isread by width threads
Load each element
into __shared__memory and haveseveral threads usethe local version to
reduce the memorybandwidth
width2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
70/78
Tiled Multiply
Partition kernel loopinto phases
Load a tile of bothmatrices into__shared__ eachphase
Each phase, each
thread computes apartial result
TILE_WIDTH
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
71/78
Better Implementation__global__ void mat_mul(float *a, float *b,
float *ab, int width){
// shorthand
int tx = threadIdx.x, ty = threadIdx.y;
int bx = blockIdx.x, by = blockIdx.y;
// allocate tiles in __shared__ memory
__shared__float s_a[TILE_WIDTH][TILE_WIDTH];
__shared__float s_b[TILE_WIDTH][TILE_WIDTH];
// calculate the row & col index
int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx;
float result = 0;
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
72/78
Better Implementation// loop over the tiles of the input in phases
for(int p = 0; p < width/TILE_WIDTH; ++p)
{ // collaboratively load tiles into __shared__
s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)];
s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col];
__syncthreads();
// dot product between row of s_a and col of s_b
for(int k = 0; k < TILE_WIDTH; ++k)
result += s_a[ty][k] * s_b[k][tx];
__syncthreads();
}
ab[row*width+col] = result;
}
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
73/78
Use of Barriers in mat_mul
Two barriers per phase: __syncthreads after all data is loaded into__shared__ memory
__syncthreads after all data is read from__shared__ memory
Note that second __syncthreads in phase pguards the load in phase p+1
Use barriers to guard data Guard against using uninitialized data Guard against bashing live data
2011 - 2012VSD Curs 10-11 Master SDTW an II
-
7/31/2019 Curs Vsd10 11
74/78
First Order Size Considerations
Each thread block should have many threads TILE_WIDTH = 16 16*16 = 256 threads
There should be many thread blocks
1024*1024 matrices
64*64 = 4096 threadblocks
TILE_WIDTH = 16 gives each SM 3 blocks, 768threads
Full occupancy
Each thread block performs 2 * 256 = 512 32bloads for 256 * (2 * 16) = 8,192 fp ops
Memory bandwidth no longer limiting factor2011 - 2012VSD Curs 10-11 Master SDTW an II
Optimization Analysis
-
7/31/2019 Curs Vsd10 11
75/78
Optimization Analysis
Experiment performed on a GT200
This optimization was clearly worth the effort
Better performance still possible in theory
Implementation Original ImprovedGlobal Loads 2N3 2N2 *(N/TILE_WIDTH)
Throughput 10.7 GFLOPS 183.9 GFLOPS
SLOCs 20 44
Relative Improvement 1x 17.2x
Improvement/SLOC 1x 7.8x
TILE SIZE Effects
-
7/31/2019 Curs Vsd10 11
76/78
TILE_SIZE Effects
-
7/31/2019 Curs Vsd10 11
77/78
Memory Resources as Limit toParallelism
Effective use of different memory resourcesreduces the number of accesses to global
memoryThese resources are finite!
The more memory locations each threadrequires the fewer threads an SM can
accommodate
Resource Per GT200 SM Full Occupancy onGT200
Registers 16384
-
7/31/2019 Curs Vsd10 11
78/78
Final Thoughts
Effective use of CUDA memory hierarchy
decreases bandwidth consumption to increasethroughput
Use __shared__ memory to eliminateredundant loads from global memory Use __syncthreads barriers to protect__shared__ data
Use atomics if access patterns are sparse orunpredictable
Optimization comes with a development cost
Memory resources ultimately limit parallelism