programming with cuda ws 08/09 lecture 10 tue, 25 nov, 2008

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 10Lecture 10Tue, 25 Nov, 2008Tue, 25 Nov, 2008

PreviouslyPreviously

Optimizing Instruction ThroughputOptimizing Instruction Throughput– Low throughout instructionsLow throughout instructions

Different versions of math functionsDifferent versions of math functions Type conversions are costlyType conversions are costly Avoid warp diversionAvoid warp diversion Accessing global memory is expensiveAccessing global memory is expensive Overlap memory ops with math opsOverlap memory ops with math ops

PreviouslyPreviously

Optimizing Instruction ThroughputOptimizing Instruction Throughput– Optimal use of memory bandwidthOptimal use of memory bandwidth

Global memory: coalesce accessesGlobal memory: coalesce accesses Local memory: coalesced automaticallyLocal memory: coalesced automatically Constant memory: cached, cost Constant memory: cached, cost

proportional to #addresses readproportional to #addresses read Texture memory: cached, optimized for Texture memory: cached, optimized for

2D spatial locality2D spatial locality Shared memory: on chip, fast but avoid Shared memory: on chip, fast but avoid

bank conflictsbank conflicts

TodayToday

Optimizing Instruction ThroughputOptimizing Instruction Throughput– Optimal use of memory bandwidthOptimal use of memory bandwidth

Shared memory: on chip, fast but avoid Shared memory: on chip, fast but avoid bank conflictsbank conflicts

RegistersRegisters

Optimizing #Threads per BlockOptimizing #Threads per Block Memory copiesMemory copies Texture vs. Global vs. ConstantTexture vs. Global vs. Constant General optimizationsGeneral optimizations

Shared MemoryShared Memory

Bank conflictsBank conflicts– Shared memory divided into 32-Shared memory divided into 32-

bit modules called banksbit modules called banks– Allow simultaneous readsAllow simultaneous reads– N-way bank conflict if N threads N-way bank conflict if N threads

try to read from the same banktry to read from the same bankLeads to serializing of readsLeads to serializing of readsNot necessarily N serial readsNot necessarily N serial reads

Shared MemoryShared Memory

Bank conflictsBank conflicts– Broadcast mechanismBroadcast mechanism

One word is chosen as a One word is chosen as a broadcast wordbroadcast word

Automatically passed to other Automatically passed to other threads reading from that wordthreads reading from that word

– Cannot control which word is Cannot control which word is picked as the broadcast wordpicked as the broadcast word

RegistersRegisters

Generally 0 clock cyclesGenerally 0 clock cycles– Time to access registers included Time to access registers included

in instruction timein instruction time– There could be delaysThere could be delays

RegistersRegisters

Delays may occur due to register Delays may occur due to register memory bank conflictsmemory bank conflicts– Register memory banks handled Register memory banks handled

by compiler and thread schedulerby compiler and thread schedulerTry to schedule instructions to Try to schedule instructions to avoid conflictsavoid conflicts

Work best when 64x threads Work best when 64x threads per blockper block

– Application has no other controlApplication has no other control

RegistersRegisters

Delays may occur due to read-Delays may occur due to read-after-write dependenciesafter-write dependencies– May be hidden if each SM has at May be hidden if each SM has at

least 192 active threadsleast 192 active threads

Optimizing #threads Optimizing #threads per blockper block 2 or more blocks per SM2 or more blocks per SM

– A waiting block (thread sync, A waiting block (thread sync, memo copy) can be overlapped memo copy) can be overlapped with running blockswith running blocks

Shared memory per block should Shared memory per block should be less than half the shared be less than half the shared memory per SMmemory per SM

Optimizing #threads Optimizing #threads per blockper block Having 32x threads per block fully Having 32x threads per block fully

populates warpspopulates warps Having 64x threads per block Having 64x threads per block

allows compiler and thread allows compiler and thread scheduler to avoid register scheduler to avoid register memory bank conflictsmemory bank conflicts

Optimizing #threads Optimizing #threads per blockper block More threads per block = fewer More threads per block = fewer

registers per kernelregisters per kernel– Compiler option to report Compiler option to report

memory requirements of a memory requirements of a kernel, kernel, --ptxas-options=-v--ptxas-options=-v

– #registers per device varies with #registers per device varies with compute capabilitycompute capability

Optimizing #threads Optimizing #threads per blockper block When optimizing, go for 64x When optimizing, go for 64x

threads per blockthreads per block– 192 or 256 recommended192 or 256 recommended

Occupancy of SM = (#active Occupancy of SM = (#active warps) / (max. active warps)warps) / (max. active warps)– Compiler tries to maximize Compiler tries to maximize

occupancyoccupancy

Optimizing Memory Optimizing Memory CopiesCopies Host mem <=> Device memHost mem <=> Device mem

– Low bandwidthLow bandwidth– Higher bandwidth can be Higher bandwidth can be

achieved using achieved using pagelocked/pinned memorypagelocked/pinned memory

Optimizing Memory Optimizing Memory CopiesCopies Minimize such transfersMinimize such transfers

– Move more code to the device, Move more code to the device, even if it does not fully utilize even if it does not fully utilize parallelismparallelism

– Create intermediate data Create intermediate data structures in device memorystructures in device memory

– Group several transfers into oneGroup several transfers into one

Texture fetches vs. reading Texture fetches vs. reading Global/Constant memGlobal/Constant mem

Cached, optimized for spatial Cached, optimized for spatial localitylocality

No coalescing constraintsNo coalescing constraints Address calculation latency is Address calculation latency is

better hiddenbetter hidden Data can be packedData can be packed Optional conversion of integers to Optional conversion of integers to

normalized floats [0.0,1.0] or [-normalized floats [0.0,1.0] or [-1.0,1.0]1.0,1.0]

Texture fetches vs. reading Texture fetches vs. reading Global/Constant memGlobal/Constant mem

For textures stored in CUDA arrays For textures stored in CUDA arrays – FilteringFiltering– Normalized texture coordinatesNormalized texture coordinates– Addressing modesAddressing modes

General GuidelinesGeneral Guidelines

Maximize parallelismMaximize parallelism Maximize memory bandwidthMaximize memory bandwidth Maximize instruction throughputMaximize instruction throughput

Maximize ParallelismMaximize Parallelism

Build on data parallelismBuild on data parallelism– Broken in case of thread dependencyBroken in case of thread dependency– For threads in the same blockFor threads in the same block

__syncThreads()__syncThreads() share data using shared memoryshare data using shared memory

– For threads in different blocksFor threads in different blocks Share data using global memoryShare data using global memory Two kernel callsTwo kernel calls First to write dataFirst to write data Second to read dataSecond to read data

Maximize ParallelismMaximize Parallelism

Build on data parallelismBuild on data parallelism Choose kernel parameters Choose kernel parameters

accordinglyaccordingly Clever device use: streamsClever device use: streams Clever host use: async kernelsClever host use: async kernels

Maximize Memory Maximize Memory BandwidthBandwidth Minimize host <=> device memory Minimize host <=> device memory

copiescopies Minimize device <=> device Minimize device <=> device

memory data transfermemory data transfer– Use shared memoryUse shared memory

Might even be better to not copy at Might even be better to not copy at allall– Just recompute on deviceJust recompute on device

Maximize Memory Maximize Memory BandwidthBandwidth Organize data for optimal memory Organize data for optimal memory

access patternsaccess patterns– Crucial for accesses to global memoryCrucial for accesses to global memory

Maximize Instruction Maximize Instruction ThroughputThroughput For non-crucial cases, use higher For non-crucial cases, use higher

throughput arithmetic instructions throughput arithmetic instructions – Sacrifice accuracy for performanceSacrifice accuracy for performance– Replace Replace doubledouble with with float float

operationsoperations Pay attention to warp diversionPay attention to warp diversion

– Try to arrange diverging threads pe Try to arrange diverging threads pe warpwarpif (threadIdx / warp_size) > nif (threadIdx / warp_size) > n

Final ProjectsFinal Projects

Time-lineTime-line– Thu, 20 Nov:Thu, 20 Nov:

Float write-ups on ideas of Jens & WaqarFloat write-ups on ideas of Jens & Waqar

– Tue, 25 Nov (today):Tue, 25 Nov (today): Suggest groups and topicsSuggest groups and topics

– Thu, 27 Nov:Thu, 27 Nov: Groups and topics assignedGroups and topics assigned

– Tue, 2 Dec:Tue, 2 Dec: Last chance to change groups/topicsLast chance to change groups/topics Groups and topics finalizedGroups and topics finalized

All for todayAll for today

Next timeNext time– A full-fledged example projectA full-fledged example project

On to exercises!On to exercises!

programming with cuda ws 08/09 lecture 10 tue, 25 nov, 2008

Documents