cuda programming on the tegra x1gpu compute capability •a simple way to express functionality....
TRANSCRIPT
![Page 1: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/1.jpg)
And Debugging
CUDA Programming on the Tegra X1
-BY KRISTOFFER ROBIN STOKKE, PROXDYNAMICS
![Page 2: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/2.jpg)
Keep This Under Your Pillow
qCUDA Programmer’s Guideqhttp://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz4LNR1Gymo
qPTX ISA Referenceqhttp://docs.nvidia.com/cuda/parallel-thread-execution/#axzz4LNR1Gymo
qTegra X1 Whitepaperqhttp://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf
qLast but not least qCUDA-GDB
qYou will need it
qNVPROF – if you care about performanceqhttp://docs.nvidia.com/cuda/profiler-users-guide/#nvprof-overview
![Page 3: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/3.jpg)
Tegra K1 vs. Tegra X1
Tegra K1 Tegra X1High Performance CPU 4 x Cortex-A15 2 MB L2 +
32 kB $I/$D4 x Cortex-A57 2 MB L2 +
48 kB $I, 32 kB $D
Low Power CPU 1 x Cortex-A15 512 kB L2 + 32 kB $I/$D
4 x Cortex-A53 512 kB L2 + 32 kB $I/$D
Architecture Kepler Maxwell
Cores 192 Cores (one SMX) 256 Cores (two SMMs, 128 Cores / SMM)
Memory bitwidth 64-bit 64-bit
L2 cache 128 kB 256 kB
L1 cache 64 kB (shared memory + L1 RO cache) 64 kB (unique shared memory)
From X1 Whitepaper:«L1 cache function has been moved to to be
shared with the texture cache»
![Page 4: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/4.jpg)
GPU Compute Capability• A simple way to express functionality.
Compute Capability Generation1.x Tesla
2.x Fermi
3.x Kepler4.x Skipped
5.x Maxwell6.x Pascal
7.x Volta / Turing
You are here
• Important keywords:• Half-precision (16-bit) floating point• Dynamic parallelism• Funnel shift
• CUDA Toolkits• You use 9.0• Transparent software-functionality
![Page 5: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/5.jpg)
Parallelism in GPUs•Massive.
•SMM – Maxwell Streaming Multiprocessor• Four Warp Schedulers (WS)
• 4 x 32 CUDA cores
• 4 x 8 LD/ST units, ~16k 32-bit registers
• 4 x 8 Special Function Units (SFUs)
• 64 kB shared memory
•At every clock cycle..• Each WS selects an eligible warp..• .. and dispatches two instructions
•All threads should follow, more or less, the same execution path
Group of 32 threads
![Page 6: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/6.jpg)
GPU Memories§64-bit RAM interface
§GPU-global L2 cache§ Acts as a read-only cache§ Write-back policy?
§SMM-local L1 cache§ Directly addressable through shared
memory§ Dedicated, pairwise access to L1
texture RO cache
§Registers
FasterThisWay
EMC
§A flexible, multi-layered cache hierarchy§Improves memory bandwidth§WS selects ready (non-stalling warps)§Highly programmable
![Page 7: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/7.jpg)
GPU Memories (Continued)§The CUDA toolkit documentation introduces the following memory spaces and naming conventions..
§Global memory loads: Loads from RAM, possibly through caches
§«Local memory»: Register spills, code, and other§ Resides in RAM or «somewhere» in the cache hierarchy, hopefully in the right place
§«Shared memory»: RW L1 cache shared in a thread block§«L1 RO cache»: Cache global, read-only memory loads
![Page 8: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/8.jpg)
CUDA Programmer’s PerspectiveqSchedule blocks of threads (execution configuration syntax)
qWS schedules eligible 32-thread groups of blocks
From CUDA Programmer’s guide.
__global__ void memcpy( uint32_t * src, uint32_t * dst) {
...
}
void main(void){
dim3 block_dim(1024, 1, 1);
dim3 grid_dim(1024, 1, 1);
uint32_t *src, *dst;
memcpy<<<grid_dim, block_dim, 0, 0>>(src, dst);
...
}
Shared memory per block
Cuda stream (default 0)
![Page 9: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/9.jpg)
CUDA Programmer’s Perspective (Cont.)
__global__ void memcpy( uint32_t * src, uint32_t * dst) {
int idx;
idx = threadIdx.x + ( blockIdx.x * blockDim.x );
dst[idx] = src[idx];
return;}
qSpecial purpose registersqthreadIdx.[x/y/z] -> block index coordsqblockIdx.[x/y/z] -> grid index coordsqblockDim.[x/y/z] -> grid dimension sizes
qIn example:qblockDim.x = 1024, blockIdx.x \in [0, 1023]qIndex into contiguous memory
![Page 10: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/10.jpg)
SynchronisationqECS kernel_symbol_name<<< gridDim, blkDim, shared, stream >>> ( __VA_ARGS__ )
qKernel launches are always asynchronousqExecuting thread immediately returns
q«Worst» sync: cudaDeviceSynchronise()qBlocks until all pending GPU activity is doneqHowever good for debugging / testing purposes
qYou should use streamsqStreams created with cudaStreamCreate() -> + flags!
qRun kernel launches and asynchronous memory copies in streams
qSync on streams with cudaStreamSynchronize( stream )
![Page 11: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/11.jpg)
Other API Specific DetailsqTwo APIS
qDriver API
qRuntime API <- use this (http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#axzz4LQaTHKIr )
qOther modules you should have a look at
qDevice management
qError handling
qMemory management, unified addressing
qCUDA samples: deviceQuery
qCUDA Compiler: nvcc
qSource files with CUDA code (*.cu) are compiled as .cpp files
qnvcc extracts CUDA code, passes rest to native c++ compiler
![Page 12: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/12.jpg)
When Things Aren’t Going Your WayqCuda-gdbqJust like gdb
qMain advantage: captures error conditions for youqBut this doesn’t mean you can get lazy
qAlways check error codes and break on anything != cudaSuccessqMake a macro
![Page 13: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/13.jpg)
GPU Performance AnalysisqCUPTI: GPU Hardware Performance Counters (HPCs)
qUsage: nvprof –e <event counters> -m <metrics> <binary> <arguments>
qSummary modes, counter collection modes....qTells you about resource usage – time, memory, floating point performance, elapsed cyclesqTakes time to profile – be patient or use ./c63enc –f 5 <- make sure to trigger ME & MC
qCheck HPC availability with nvprof –query-events –query-metricsqNotice there are well above 100 HPCs to choose from..q...which ones matter?qI will tell you! JJJ
![Page 14: CUDA Programming on the Tegra X1GPU Compute Capability •A simple way to express functionality. Compute Capability Generation 1.x Tesla 2.x Fermi 3.x Kepler 4.x Skipped 5.x Maxwell](https://reader034.vdocuments.site/reader034/viewer/2022042315/5f0423827e708231d40c80fb/html5/thumbnails/14.jpg)
GPU Performance Analysis (Continued)qMemory usageqL1_global_load_hit, l1_local_{store/load}_hit, l1_shared_{store/load}_transactions, shared_efficiency
qInstructionsqInst_integer, inst_bit_convert, inst_control, inst_misc, inst_fp_{16/32/64}
qCauses of stallingqMemory, instruction dependencies, sync...
qOtherqElapsed_cycles_sm
qThese are for the TK1, but should be at least similar for TX1
qDon’t get confused by HPCs such as {gld/gst}_throughput