curs vsd10 11

Upload: canache-gheorghe

Post on 05-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Curs Vsd10 11

    1/78

    Vizualizarea in sisteme distribuite

    S.l. Dr. ing. Simona Caraiman

    Master SDTW an II2011 - 2012

  • 7/31/2019 Curs Vsd10 11

    2/78

    VSD - Curs 10-11

    Programarea GPU (1V)

    CUDA Advanced topics

    VSD Curs 10-11 Master SDTW an II 2011 - 2012

  • 7/31/2019 Curs Vsd10 11

    3/78

    Textures in CUDATexture is an object for reading data Benefits: data is cached (optimized for 2D locality) filtering linear / bilinear / trilinear

    dedicated hardware wrap modes (for out-of-bounds addresses) clamp to edge / repeat

    addressable in 1D, 2D or 3D using integer or normalized coordinates

    Usage: CPU code binds data to a texture object Kernel reads data by calling a fetch function

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    4/78

    Textures in CUDA

    Texture Addressing

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    5/78

    Textures in CUDA

    Two Texture Types Bound to linear memory global memory address is bound to a texture only 1D integer addressing no filtering, no addressing modes

    Bound to CUDA arrays CUDA array is bound to a texture 1D, 2D, 3D float addressing (size-based or normalized) filtering addressing modes (clamp, repeat)

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    6/78

    CUDA Texturing Steps Host (CPU) code: allocate/obtain memory (global linear, or CUDA array) create texture reference object bind the texture reference to memory/array when done:

    unbind the texture reference, free resources

    Device (kernel) code: fetch using texture reference

    linear memory textures: tex1Dfetch() array textures: tex1D() or tex2D() or tex3D()

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    7/78

    Atomics

    Problem: How do you do global communication? Finish a grid and start a new one

    Finish a kernel and start a new one All writes from all threads complete before a

    kernel finishes

    step1(...);// The system ensures that all

    // writes from step1 complete.

    step2(...);2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    8/78

    Atomics

    Global Communication Would need to decompose kernels into before

    and after parts

    Or, write to a predefined memory location

    Race condition! Updates can be lostthreadId:0 threadId:1917

    // vector[0] was equal to 0

    vector[0] += 5; vector[0] += 1;

    ... ...a = vector[0]; a = vector[0];

    What is the value ofa in thread 0?

    What is the value ofa in thread 1917?

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    9/78

    Atomics

    Race conditionsThread 0 could have finished execution before

    1917 started

    Or the other way around

    Or both are executing at the same time

    Answer: not defined by the programmingmodel, can be arbitrary

    CUDA provides atomic operations to dealwith this problem

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    10/78

    Atomics

    An atomic operation guarantees that only asingle thread has access to a piece of memorywhile an operation completes

    The name atomic comes from the fact that it

    is uninterruptable No dropped data, but ordering is still arbitrary

    Different types of atomic instructions

    atomic{Add, Sub, Exch, Min, Max, Inc,

    Dec, CAS, And, Or, Xor}

    More types in fermi

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    11/78

    AtomicsExample: Histogram

    // Determine frequency of colors in a picture

    // colors have already been converted into ints

    // Each thread looks at one pixel and increments

    // a counter atomically

    __global__void histogram(int* color,int* buckets)

    {

    int i = threadIdx.x

    + blockDim.x * blockIdx.x; int c = colors[i];atomicAdd(&buckets[c], 1);

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    12/78

    AtomicsExample: Workqueue

    // For algorithms where the amount of work per item

    // is highly non-uniform, it often makes sense for

    // to continuously grab work from a queue

    __global__

    void workq(int* work_q, int* q_counter,

    int* output, int queue_max){

    int i = threadIdx.x

    + blockDim.x * blockIdx.x;

    int q_index =

    atomicInc(q_counter, queue_max);int result = do_work(work_q[q_index]);

    output[i] = result;

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    13/78

    Atomics

    Atomics are slower than normal load/storeYou can have the whole machine queuing on a

    single location in memory

    Atomics unavailable on G80!

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    14/78

    AtomicsExample: Workqueue

    // For algorithms where the amount of work per item

    // is highly non-uniform, it often makes sense for

    // to continuously grab work from a queue

    __global__

    void workq(int* work_q, int* q_counter,

    int* output, int queue_max){

    int i = threadIdx.x

    + blockDim.x * blockIdx.x;

    int q_index =

    atomicInc(q_counter, queue_max);int result = do_work(work_q[q_index]);

    output[i] = result;

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    15/78

    AtomicsExample: Global Min/Max (Naive)

    // If you require the maximum across all threads

    // in a grid, you could do it with a single

    // global maximum value, but it will be VERY slow

    __global__

    void global_max(int* values, int* gl_max){

    int i = threadIdx.x

    + blockDim.x * blockIdx.x;

    int val = values[i];

    atomicMax(gl_max,val);}

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    16/78

    AtomicsExample: Global Min/Max (Better)

    // introduce intermediate maximum results, so that

    // most threads do not try to update the global max

    __global__

    void global_max(int* values, int* max,

    int *regional_maxes,

    int num_regions){

    // i and val as before

    int region = i % num_regions;

    if(atomicMax(&reg_max[region],val) < val)

    { atomicMax(max,val);

    }

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    17/78

    Atomics

    Global Min/Max

    Single value causes serial bottleneck

    Create hierarchy of values for more

    parallelism Performance will still be slow, so usejudiciously

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    18/78

    Performance optimization

    Overview

    Memory Optimizations

    Execution Configuration Optimizations

    Examples

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    19/78

    Performance optimization - Overview

    Optimize Algorithms for the GPU

    maximize independent parallelism

    maximize arithmetic intensity

    (math/bandwith) sometimes its better to recompute than to

    cache GPU spends its transistors on ALUs, not memory

    do more computation on the GPU to avoidcostly data transfers even low parallelism computations can sometimes

    be fastr than transferring back and forth to host2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    20/78

    Performance optimization - Overview

    Optimize Memory Access

    Coalesced vs. Non-coalesced = order ofmagnitude

    Global/Local device memory Optimize for spatial locality in cached texture

    memory

    In shared memory, avoid high degree bank

    conflicts

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    21/78

    Performance optimization - Overview

    Take advantage of shared memory

    hundreds of times faster than global memory

    threads can cooperate via shared memory

    use one / a few threads to load / computedata shared by all threads

    use it to avoid non-coalesced access stage loads and stores in shared memory to re-

    order non-coalesceable addressing

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    22/78

    Performance optimization - Overview

    Use parallelism efficiently

    partition your computation to keep the GPUmultiprocessors equally busy

    many threads, many thread blocks

    keep resource usage low enough to supportmultiple active threads blocks per

    multiprocessor registers, shared memory

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    23/78

    Memory optimizationsThe global, constant and

    texture spaces areregions of devicememory

    Each multiprocessor has:

    a set of 32-bit registersper processor on-chip shared memory where the shared memory

    space resides

    a read-only constantcache to speed up access to the

    constant memory space a read-only texture cache

    to speed up access to thetexture memory space 2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    24/78

    Memory optimizations

    Optimizing host-devicedata transfers

    Coalescing global dataaccesses

    Using shared memoryeffectively

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    25/78

    Memory optimizations

    Host-Device Data Transfers

    device2host memory bandwidth much lowerthan device2device memory bandwidth 4GB/s peak (PCI-e x16 Gen 1) vs. 76 GB/s peak

    (Tesla C870)

    minimize transfers intermediate data structures can be allocated,

    operated on, and deallocated without evercopying them to host memory

    group transfers

    one large transfer much better than many smallones

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    26/78

    Memory optimizations

    Global and shared memory

    Global memory not cached on G8x GPUs high latency, but launching more threads hides

    latency important to minimize accesses

    coalesce global memory accesses

    Shared memory is on-chip, very high

    bandwidth low latency like user-managed per-multiprocessor cache try to minimize or avoid bank conflicts

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    27/78

    Memory optimizations

    Texture and Constant Memory

    Texture partition is cached uses the texture cache also used for graphics

    optimized for 2D spatial locality best performance when threads of a warp read

    locations that are close together in 2D

    Constant memory is cached

    4 cycles per address read within a single warp total cost 4 cycles if all threads in a warp read the same

    address

    total cost 64 cycles if all threads read differentaddresses

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    28/78

    Memory optimizations

    Global Memory Reads/Writes

    global memory is not cached on G8x

    highest latency instructions: 400-600 clock

    cycles likely to be a performance bottleneck

    optimizations can greatly increaseperformance

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    29/78

    Memory optimizations

    Coalescing a coordinated read by a half-warp (16 threads) a contiguous region of global memory 64 bytes each thread reads a word: int, float,.. 128 bytes each thread reads a double-word: int2,

    float2, .. 256 bytes each thread reads a quad-word: int4,

    float4,

    additional restrictions

    starting address for a region must be a multiple ofregion size

    the kth thread in a half-warp must access the kthelement in a block being read

    exception: not all threads must beartici atin2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    30/78

    Memory optimizations

    Coalesced Access: Reading floats

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    31/78

    Memory optimizations

    Uncoalesced Access: Reading floats

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    32/78

    Memory optimizations

    Coalescing: Timing results

    Experiment: kernel: read a float, increment, write back

    3M floats (12 MB) Times averaged over 10k runs

    12K blocks x 256 threads: 356s coalesced

    357s coalesced, some threads dontparticipate

    3,494s permuted/misaligned thread access

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    33/78

    Memory optimizations

    Shared Memory

    hundred times faster than global memory

    cache data to reduce global memory accesses

    threads can cooperate via shared memory use it to avoid non-coalesced access stage loads and stores in shared memory to re-

    order non-coalesceable addressing

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    34/78

    Memory optimizationsExample: thread-local variables

    // motivate per-thread variables with

    // Ten Nearest Neighbors application

    __global__void ten_nn(float2 *result, float2 *ps, float2 *qs,

    size_t num_qs)

    {

    // p goes in a register

    float2 p = ps[threadIdx.x];

    // per-thread heap goes in off-chip memory

    float2 heap[10];

    // read through num_qs points, maintaining

    // the nearest 10 qs to p in the heap ...

    // write out the contents of heap to result

    ...

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    35/78

    Memory optimizationsExample: shared variables

    // motivate shared variables with

    // Adjacent Difference application:

    // compute result[i] = input[i] input[i-1]

    __global__void adj_diff_naive(int *result, int *input)

    {

    // compute this threads global index

    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if(i > 0)

    {

    // each thread loads two elements from global memory

    int x_i = input[i];

    int x_i_minus_one = input[i-1];

    result[i] = x_i x_i_minus_one;

    }

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    36/78

    Memory optimizationsExample: shared variables

    // motivate shared variables with

    // Adjacent Difference application:

    // compute result[i] = input[i] input[i-1]

    __global__void adj_diff_naive(int *result, int *input)

    {

    // compute this threads global index

    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if(i > 0)

    {

    // what are the bandwidth requirements of this kernel?

    int x_i = input[i];

    int x_i_minus_one = input[i-1];

    result[i] = x_i x_i_minus_one;

    }

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

    Two loads

  • 7/31/2019 Curs Vsd10 11

    37/78

    Memory optimizationsExample: shared variables

    // motivate shared variables with

    // Adjacent Difference application:

    // compute result[i] = input[i] input[i-1]

    __global__void adj_diff_naive(int *result, int *input)

    {

    // compute this threads global index

    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if(i > 0)

    {

    // How many times does this kernel load input[i]?

    int x_i = input[i]; // once by thread i

    int x_i_minus_one = input[i-1]; // again by thread i+1

    result[i] = x_i x_i_minus_one;

    }

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    38/78

    Memory optimizationsExample: shared variables

    // motivate shared variables with

    // Adjacent Difference application:

    // compute result[i] = input[i] input[i-1]

    __global__void adj_diff_naive(int *result, int *input)

    {

    // compute this threads global index

    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

    if(i > 0)

    {

    // Idea: eliminate redundancy by sharing data

    int x_i = input[i];

    int x_i_minus_one = input[i-1];

    result[i] = x_i x_i_minus_one;

    }

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    39/78

    Memory optimizationsExample: shared variables

    // optimized version of adjacent difference__global__void adj_diff(int *result, int *input)

    {

    // shorthand for threadIdx.x

    int tx = threadIdx.x;

    // allocate a __shared__ array, one element per thread

    __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data

    unsigned int i = blockDim.x * blockIdx.x + tx;

    s_data[tx] = input[i];

    // avoid race condition: ensure all loads

    // complete before continuing__syncthreads();

    ...

    }

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    40/78

    Memory optimizations

    Example: shared variables// optimized version of adjacent difference

    __global__void adj_diff(int *result, int *input)

    {

    ...

    if(tx > 0)

    result[i] = s_data[tx] s_data[tx1];

    elseif(i > 0)

    {

    // handle thread block boundary

    result[i] = s_data[tx] input[i-1];

    }}

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    41/78

    Memory optimizations

    Example: shared variables// when the size of the array isnt known at compile time...

    __global__ void adj_diff(int *result, int *input)

    {

    // use extern to indicate a __shared__ array will be

    // allocated dynamically at kernel launch time

    extern__shared__ int s_data[];...

    }

    // pass the size of the per-block array, in bytes, as the third

    // argument to the triple chevrons

    adj_diff(r,i);

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    42/78

    Execution Configuration Optimizations

    OccupancyThread instructions are executed sequentially,

    so executing other warps is the only way tohide latencies and keep the hardware busy

    Occupancy = Number of warps runningconcurrently on a multiprocessor divided bymaximum number of warps that can runconcurrently

    Limited resource usage: registers shared memory

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    43/78

    Execution Configuration Optimizations

    Grid/Block Size Heuristics

    # of blocks > # of multiprocessors so all multiprocessors have at least one block to

    execute

    # of blocks / # of multiprocessors > 2 multiple blocks can run concurrently in a

    multiprocessor blocks that arent waiting at a __syncthreads()

    keep the hardware busy subject to resource availability register, shared

    memory

    # of blocks > 100 to scale to future devices

    blocks executed in pipeline fashion 1000 blocks er rid will scale across multi le2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    44/78

    Execution Configuration Optimizations

    Register Dependency

    Read-after-write register dependency Instructions result can be read ~11 cycles later

    Scenario:

    To completely hide the latency:

    run at least 192 threads (6 warps) permultiprocessor at least 25% occupancy

    threads dont have to belong to the same thread

    block 2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    45/78

    Execution Configuration Optimizations

    Register Pressure Hide latency by using more threads per SM Limiting factors: Number of registers per kernel 8192 per SM, partitioned among concurrent threads

    Amount of shared memory 16 KB per SM, partitioned among concurrent thread

    blocks

    Compile with --ptxas-options=-v flag

    Use --maxrregcount=N flag to NVCC N = desired maximum registers/kernel At some point spilling into LMEM may occur reduces performance LMEM is slow

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    46/78

    Execution Configuration Optimizations

    Determining resource usage

    compile the kernel code with the cubin flagto determine register usage

    open the .cubin file with a text editor and look

    for the code section

    2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    47/78

    Execution Configuration Optimizations

    Optimizing threads per block

    Choose threads per block as a multiple of warp size avoid wasting computation on under-populated warps

    More threads per-block == better memory latencyhiding

    But, more threads per block == fewer registers perthread kernel invocations can fail if too many registers are used

    Heuristics Minimum: 64 threads per block only if multiple concurrent blocks

    192 or 256 threads a better choice usually still enough regs to copile and invoke successfully

    this all depends on your computation, so experiment!2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    48/78

    Execution Configuration Optimizations

    Occupancy != Performance

    Increasing occupancy does not necessarilyincrease performance

    BUT

    Low-occupancy multiprocessors cannotadequately hide latency on memory-boundkernels it all comes down to arithmetic intensity and

    available parallelism2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    49/78

    Execution Configuration Optimizations

    Parameterize your application

    Parameterization helps adaptation to differentGPUs

    GPUs vary in many ways # of multiprocessors memory bandwidth

    shared memory size

    register file size max. threads per block

    You can even make apps self-tuning experiment mode discovers and saves optimal

    configuration 2011 - 2012Master SDTW an IIVSD Curs 10

  • 7/31/2019 Curs Vsd10 11

    50/78

    A Common Programming Strategy

    Global memory resides in device memory(DRAM) Much slower access than shared memory

    Tile data to take advantage of fast shared

    memory: Generalize from adjacent_difference example Divide and conquer

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    51/78

    A Common Programming Strategy

    Partition data into subsets that fit into sharedmemory

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    52/78

    A Common Programming Strategy

    Handleeach data subset with one thread block

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    53/78

    A Common Programming Strategy

    Load the subset from global memory to sharedmemory, using multiple threads to exploitmemory-level parallelism

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    54/78

    A Common Programming Strategy

    Perform the computation on the subset fromshared memory

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    55/78

    A Common Programming Strategy

    Copy the result from shared memory back toglobal memory

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    56/78

    A Common Programming Strategy

    Carefully partition data according to accesspatterns

    Read-only __constant__ memory (fast)

    R/W & shared within block __shared__

    memory (fast) R/W within each thread registers (fast)

    Indexed R/W within each thread localmemory (slow)

    R/W inputs/results cudaMalloced globalmemory (slow)

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    57/78

    Communication Through Memory

    Question:

    __global__void race(void)

    { __shared__int my_shared_variable;

    my_shared_variable = threadIdx.x;

    // what is the value of

    // my_shared_variable?

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    58/78

    Communication Through Memory

    This is a race condition

    The result is undefined

    The order in which threads access the variable

    is undefined without explicit coordination Use barriers (e.g., __syncthreads) or

    atomic operations (e.g., atomicAdd) to enforcewell-defined semantics

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    59/78

    Communication Through Memory Use __syncthreads to ensure data is ready for

    access

    __global__void share_data(int *input)

    {

    __shared__int data[BLOCK_SIZE];

    data[threadIdx.x] = input[threadIdx.x];

    __syncthreads();

    // the state of the entire data array// is now well-defined for all threads

    // in this block

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    60/78

    Communication Through Memory Use atomic operations to ensure exclusive access

    to a variable

    // assume *result is initialized to 0

    __global__void sum(int *input, int *result)

    {

    atomicAdd(result, input[threadIdx.x]);

    // after this kernel exits, the value of// *result will be the sum of the input

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    61/78

    Resource Contention Atomic operations arent cheap!

    They imply serialized access to a variable

    __global__void sum(int *input, int *result)

    { atomicAdd(result, input[threadIdx.x]);

    }

    ...

    // how many threads will contend

    // for exclusive access to result?

    sum(input,result);

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    62/78

    Hierarchical Atomics

    Divide & Conquer Per-thread atomicAdd to a __shared__ partial sum

    Per-block atomicAdd to the total sum

    0 1

    2011 - 2012VSD Curs 10-11 Master SDTW an II

    Hierarchical Atomics

  • 7/31/2019 Curs Vsd10 11

    63/78

    Hierarchical Atomics

    __global__void sum(int *input, int *result)

    { __shared__int partial_sum;

    // thread 0 is responsible for

    // initializing partial_sum if(threadIdx.x == 0)

    partial_sum = 0;

    __syncthreads();

    ...

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    64/78

    Hierarchical Atomics

    __global__void sum(int *input, int *result)

    {

    ...

    // each thread updates the partial sum

    atomicAdd(&partial_sum,

    input[threadIdx.x]);

    __syncthreads();

    // thread 0 updates the total sum

    if(threadIdx.x == 0)

    atomicAdd(result, partial_sum);

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    65/78

    Advice

    Use barriers such as __syncthreads to waituntil __shared__ data is ready

    Prefer barriers to atomics when data accesspatterns are regular or predictable

    Prefer atomics to barriers when data accesspatterns are sparse or unpredictable

    Atomics to __shared__ variables are muchfaster than atomics to global variables

    Dont synchronize or serialize unnecessarily

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    66/78

    Generalizeadjacent_difference example

    AB = A * B Each element AB

    ij = dot(row(A,i),col(B,j))

    Parallelization strategy Thread ABij

    2D kernel

    Matrix Multiplication Example

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    67/78

    First Implementation__global__ void mat_mul(float *a, float *b,

    float *ab, int width){

    // calculate the row & col index of the element

    int row = blockIdx.y*blockDim.y + threadIdx.y;

    int col = blockIdx.x*blockDim.x + threadIdx.x;

    float result = 0;

    // do dot product between row of a and col of b

    for(int k = 0; k < width; ++k)

    result += a[row*width+k] * b[k*width+col];

    ab[row*width+col] = result;

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    68/78

    How will this perform?

    How many loads per term of dotproduct?

    2 (a & b) = 8 Bytes

    How many floating point operations? 2 (multiply & addition)

    Global memory access to flop ratio(GMAC)

    8 Bytes / 2 ops = 4 B/op

    What is the peak fp performance ofGeForce GTX 260?

    805 GFLOPS

    Lower bound on bandwidth required toreach peak fp performance

    GMAC * Peak FLOPS = 4 * 805 = 3.2TB/s

    What is the actual memory bandwidth

    of GeForce GTX 260?

    112 GB/s

    Then what is an upper bound onperformance of our implementation?

    Actual BW / GMAC = 112 / 4 = 28GFLOPS

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    69/78

    Idea: Use __shared__memory toreuse global data

    Each input element isread by width threads

    Load each element

    into __shared__memory and haveseveral threads usethe local version to

    reduce the memorybandwidth

    width2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    70/78

    Tiled Multiply

    Partition kernel loopinto phases

    Load a tile of bothmatrices into__shared__ eachphase

    Each phase, each

    thread computes apartial result

    TILE_WIDTH

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    71/78

    Better Implementation__global__ void mat_mul(float *a, float *b,

    float *ab, int width){

    // shorthand

    int tx = threadIdx.x, ty = threadIdx.y;

    int bx = blockIdx.x, by = blockIdx.y;

    // allocate tiles in __shared__ memory

    __shared__float s_a[TILE_WIDTH][TILE_WIDTH];

    __shared__float s_b[TILE_WIDTH][TILE_WIDTH];

    // calculate the row & col index

    int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx;

    float result = 0;

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    72/78

    Better Implementation// loop over the tiles of the input in phases

    for(int p = 0; p < width/TILE_WIDTH; ++p)

    { // collaboratively load tiles into __shared__

    s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)];

    s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col];

    __syncthreads();

    // dot product between row of s_a and col of s_b

    for(int k = 0; k < TILE_WIDTH; ++k)

    result += s_a[ty][k] * s_b[k][tx];

    __syncthreads();

    }

    ab[row*width+col] = result;

    }

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    73/78

    Use of Barriers in mat_mul

    Two barriers per phase: __syncthreads after all data is loaded into__shared__ memory

    __syncthreads after all data is read from__shared__ memory

    Note that second __syncthreads in phase pguards the load in phase p+1

    Use barriers to guard data Guard against using uninitialized data Guard against bashing live data

    2011 - 2012VSD Curs 10-11 Master SDTW an II

  • 7/31/2019 Curs Vsd10 11

    74/78

    First Order Size Considerations

    Each thread block should have many threads TILE_WIDTH = 16 16*16 = 256 threads

    There should be many thread blocks

    1024*1024 matrices

    64*64 = 4096 threadblocks

    TILE_WIDTH = 16 gives each SM 3 blocks, 768threads

    Full occupancy

    Each thread block performs 2 * 256 = 512 32bloads for 256 * (2 * 16) = 8,192 fp ops

    Memory bandwidth no longer limiting factor2011 - 2012VSD Curs 10-11 Master SDTW an II

    Optimization Analysis

  • 7/31/2019 Curs Vsd10 11

    75/78

    Optimization Analysis

    Experiment performed on a GT200

    This optimization was clearly worth the effort

    Better performance still possible in theory

    Implementation Original ImprovedGlobal Loads 2N3 2N2 *(N/TILE_WIDTH)

    Throughput 10.7 GFLOPS 183.9 GFLOPS

    SLOCs 20 44

    Relative Improvement 1x 17.2x

    Improvement/SLOC 1x 7.8x

    TILE SIZE Effects

  • 7/31/2019 Curs Vsd10 11

    76/78

    TILE_SIZE Effects

  • 7/31/2019 Curs Vsd10 11

    77/78

    Memory Resources as Limit toParallelism

    Effective use of different memory resourcesreduces the number of accesses to global

    memoryThese resources are finite!

    The more memory locations each threadrequires the fewer threads an SM can

    accommodate

    Resource Per GT200 SM Full Occupancy onGT200

    Registers 16384

  • 7/31/2019 Curs Vsd10 11

    78/78

    Final Thoughts

    Effective use of CUDA memory hierarchy

    decreases bandwidth consumption to increasethroughput

    Use __shared__ memory to eliminateredundant loads from global memory Use __syncthreads barriers to protect__shared__ data

    Use atomics if access patterns are sparse orunpredictable

    Optimization comes with a development cost

    Memory resources ultimately limit parallelism