cuda developer tools: overview & exciting new features
TRANSCRIPT
Rafael Campana, Software Engineering Director, NVIDIA
CUDA DEVELOPER TOOLS: OVERVIEW & EXCITING NEW FEATURES
2
Developer Workflow & Developer Tools portfolio
Overall updates
Developer Tool product overviews & new features
AGENDA
3
DEVELOPER WORKFLOW
Application
Development
IDE integration
Debug Gfx APIDebug CUDA
System & CUDA
Trace
Graphics ProfilingCUDA Kernel
Profiling
Gfx GPU
crash dump
CUDA GPU
crash dump
4
Nsight Eclipse EditionNsight Visual Studio Edition
DEVELOPER TOOLS PORTFOLIO
Application
Development
IDE integration
Debug Gfx APIDebug CUDA
System & CUDA
Trace
Graphics ProfilingCUDA Kernel
Profiling
Gfx GPU
crash dump
CUDA GPU
crash dump
5
Nsight Eclipse EditionNsight Visual Studio Edition
DEVELOPER TOOLS PORTFOLIO
Application
Development
IDE integration
Debug Gfx APIDebug CUDA
System & CUDA
Trace
Graphics ProfilingCUDA Kernel
Profiling
Gfx GPU
crash dump
CUDA GPU
crash dump
Nsight Eclipse Editioncuda-gdbNsight Visual Studio EditionNsight Computecuda-memcheck & compute-sanitizerNsight Graphics
6
Nsight Eclipse EditionNsight Visual Studio Edition
DEVELOPER TOOLS PORTFOLIO
Application
Development
IDE integration
Debug Gfx APIDebug CUDA
System & CUDA
Trace
Graphics ProfilingCUDA Kernel
Profiling
Gfx GPU
crash dump
CUDA GPU
crash dump
Nsight Eclipse Editioncuda-gdbNsight Visual Studio EditionNsight Computecuda-memcheck & compute-sanitizerNsight Graphics
cuda-gdb/Nsight Eclipse EditionNsight Visual Studio EditionNsight Aftermath
7
Nsight Eclipse EditionNsight Visual Studio Edition
DEVELOPER TOOLS PORTFOLIO
Application
Development
IDE integration
Debug Gfx APIDebug CUDA
System & CUDA
Trace
Graphics ProfilingCUDA Kernel
Profiling
Gfx GPU
crash dump
CUDA GPU
crash dump
Nsight Eclipse Editioncuda-gdbNsight Visual Studio EditionNsight Computecuda-memcheck & compute-sanitizerNsight Graphics
cuda-gdb/Nsight Eclipse EditionNsight Visual Studio EditionNsight Aftermath
Nsight Systems
8
Nsight Eclipse EditionNsight Visual Studio Edition
DEVELOPER TOOLS PORTFOLIO
Application
Development
IDE integration
Debug Gfx APIDebug CUDA
System & CUDA
Trace
Graphics ProfilingCUDA Kernel
Profiling
Gfx GPU
crash dump
CUDA GPU
crash dump
Nsight Eclipse Editioncuda-gdbNsight Visual Studio EditionNsight Computecuda-memcheck & compute-sanitizerNsight Graphics
cuda-gdb/Nsight Eclipse EditionNsight Visual Studio EditionNsight Aftermath
Nsight Systems
Nsight Compute
Nsight Graphics
9
OVERALL UPDATESChips, platforms support across developer tools
CUDA 11.0 support OS support updatesMacOSX host platform only
Removal of Windows 7 support
Chips UpdateA100 GPU Support
Arm SBSA support
10
NVIDIA® NSIGHT™ ECLIPSE EDITION
Plug-in to EclipseEclipse 4.8 to 4.11 support
Edit, build, debug CUDA-C applications
CUDA aware source code editor – syntax highlighting, code completion and inline help
Debugger - Seamless and simultaneous debugging of CPU and GPU code
NVCC build integration to cross compile for various target platforms
Docker support
Plug-ins enabling CUDA development
Documentation: https://docs.nvidia.com/cuda/nsight-eclipse-plugins-guide
11
NSIGHT VISUAL STUDIO EDITION
Visual Studio 2015, 2017 and 2019 support
CPU + GPU CUDA Debugging
Source-correlated assembly debugging(SASS / PTX / SASS+PTX)
Data breakpoints for CUDA C/C++ code
Expressions in Locals, Watch and Conditionals
CUDA info view
Crash dump
Warp Watch view in NextGen debugger
Improved OptiX debugging (-lineinfo)
More info & documentation: https://developer.nvidia.com/nsight-visual-studio-edition
VS Extension enabling CUDA development
12
NSIGHT TOOLS INTEGRATION INTO VISUAL STUDIO
Improved workflow for Nsight Systems, Nsight Graphics & Nsight Compute:• Settings passed to the standalone tool• Quick launch with key bindings• Complements Nsight Visual Studio Edition's Debugger
Works with Visual C++, C#, Visual Basic .NET, F#, and Python projectsOn Visual Studio Marketplace: installation and automatic update notificationsSupported in Visual Studio 2015, 2017, & 2019
Improved Workflow. Launches more powerful, standalone tools within Visual Studio
Tools appear under the Nsight menu (highlighted) Visual Studio project settings are transferred to the Nsight tool upon launch
More Info: https://developer.nvidia.com/nsight-tools-visual-studio-integration
13
CUDA-GDB
Command line source and assembly (SASS) level debugger
Nsight Eclipse Edition debugging backend
Simultaneous CPU and GPU debugging
Inspect and modify memory, register, variable state
Control program execution
Runtime GPU error detection
Support for multiple GPUs, multiple contexts, multiple kernels, Thread focus
Core dump support
Documentation : http://docs.nvidia.com/cuda/cuda-gdb
Overview of command line debugger
(cuda-gdb) info cuda threads breakpoint all
BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line
Kernel 0
(1,0,0) (0,0,0) 0x0000000000948e58 0 11 0 0 infoCommands.cu 12
(1,0,0) (1,0,0) 0x0000000000948e58 0 11 0 1 infoCommands.cu 12
(1,0,0) (2,0,0) 0x0000000000948e58 0 11 0 2 infoCommands.cu 12
(1,0,0) (3,0,0) 0x0000000000948e58 0 11 0 3 infoCommands.cu 12
(1,0,0) (4,0,0) 0x0000000000948e58 0 11 0 4 infoCommands.cu 12
(1,0,0) (5,0,0) 0x0000000000948e58 0 11 0 5 infoCommands.cu 12
(cuda-gdb) info cuda threads breakpoint 2 lane 1
BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line
Kernel 0
(1,0,0) (1,0,0) 0x0000000000948e58 0 11 0 1 infoCommands.cu 12
14
CUDA-GDB
Upgrade to GDB 8.2
MacOSX support brought back to life as a host
Performance improvements
● Module load time (~30% faster)
Quality improvements
● Improved handling of -lineinfo debug information (OptiX)● Improved debugging with parallel cuda-gdb sessions● Enabled CPU-side hardware watchpoints
New Features
15
CUDA-MEMCHECK
Multiple tools
memcheck : reports out of bounds/misaligned memory access errors
racecheck : identifies races on __shared__ memory
initcheck : usage of uninitialized global memory
synccheck : identify invalid usage of __syncthreads() and __syncwarp()
Documentation: http://docs.nvidia.com/cuda/cuda-memcheck/index.html
Functional correctness checking tool suite
16
COMPUTE-SANITIZER
Next-Gen replacement tool for cuda-memcheck
New command line interface (CLI) tool based on the Sanitizer API
Performance gain for applications using libraries such as CUSOLVER, CUFFT or DL frameworks
cuda-memcheck has performance issues on Windows that were inherent in its design
compute-sanitizer fixes that and brings performance on Windows to be on par with Linux.
Available in CUDA 11.0
OS: Linux (x86_64, Power, Arm SBSA), Windows
GPUs: Maxwell+
cuda-memcheck still supported in CUDA 11.0 (does not support Arm SBSA)
Documentation: https://docs.nvidia.com/cuda/compute-sanitizer
New tool for functional correctness checking tool suite
17
COMPUTE-SANITIZER
Provides finer control than cuda-memcheck through APIs to analyze memory patterns
APIs are grouped into two categories:
Callback API – CUDA events such as memory allocations/kernel
Patching API – inserts patches for specific memory instructions
New in CUDA 11.0:
• SanitizerSetCallbackData API has been updated to take a function as input rather than a stream
• Added support for Cooperative Groups
• Memory Access callbacks for shared/local memory now report the address offset within the shared/local window
Documentation: https://docs.nvidia.com/cuda/compute-sanitizer
Samples: https://github.com/NVIDIA/compute-sanitizer-samples
API for functional correctness checking tool suite
18
NSIGHT SUITE OF PROFILERS
19
NSIGHT TOOLS WORKFLOW
Nsight SystemsComprehensive system-level
performance
Nsight ComputeDetailed CUDA kernel performance
Nsight GraphicsDetailed frame/render performance
Dive into top CUDA kernels by using metrics/counter
collection
Dive into graphicsframes
Start here
Re-check overall performance
Re-check overall performance
20
NSIGHT SYSTEMSSystem profiler
Key Features:
• System-wide application algorithm tuning
• Multi-process tree support
• Locate optimization opportunities
• Visualize millions of events on a very fast GUI timeline
• Or gaps of unused CPU and GPU time
• Balance your workload across multiple CPUs and GPUs
• CPU algorithms, utilization and thread stateGPU streams, kernels, memory transfers, etc
• Command Line, Standalone, IDE Integration
OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+
Docs/product: https://developer.nvidia.com/nsight-systems
21
CPU
utilization
Processes &
threads
OS runtime
APIs
CUDA &
cuBLAS APIs
GPU CUDA
Kernels & memory
transfers
CPU IP &
backtrace
sample data
NVTX
annotations
Additionally
Trace:• TensorRT• Direct3D11,12,DXR• Vulkan• OpenGL• OpenACC• MPI• OpenMP• Ftrace• ETW• WDDM• GPU Context Switch
Export:• SQLite• HDF5• JSON
Architectures:• X86_64• Power• Arm SBSA• Tegra
NVTX projected on
GPU CUDA streams
22
NSIGHT SYSTEMS
● NVTX projected
onto GPU
● Event table
Key Features
23
NSIGHT SYSTEMSMPI & OpenACC Trace
24
NSIGHT SYSTEMSOpenMP 5 API Trace
25
NSIGHT SYSTEMSCUDA graph node correlation & NVTX projection
26
● Before
No utilization
No sorting
● After
Clear activity/state
Sort by utilization
NSIGHT SYSTEMSImproved UX without thread scheduler data permission
27
NSIGHT SYSTEMS
CLI improvement
Simultaneous sessions & management
nsys launch forwards terminal and signals
nsys stats command
nsys export command
CUDA context “All Streams” aggregation tree
Windows WDDM GPU memory usage, paging, evictions
Graphics trace enhancements
Other Features
28
NSIGHT COMPUTEKernel Profiling Tool
Key Features:
• Interactive CUDA API debugging and kernel profiling
• Fast Data Collection
• Compare performance metrics across different runs
• Fully Customizable (Programmable UI/Guided Analysis)
• Command Line, Standalone, IDE Integration
OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX(host only)
GPUs: Volta, Turing, A100 GPUs
Docs/product: https://developer.nvidia.com/nsight-compute
29
NSIGHT COMPUTEProfile Report – Details Page
Focused Sections
All Data on
Single Page
Ordered from Top-Level to
Low-Level
30
NSIGHT COMPUTESection Example
Section Headerprovides overview &
context for other sections
Section Bodyprovides additional
details (tables & charts)
Section Configcompletely data driven
add/modify/change sections
31
NSIGHT COMPUTEUnguided Analysis / Rules System
Analysis Rulesrecommendations from
nvvp and more
Rules Configcompletely data driven
add/modify/change rules
32
NSIGHT COMPUTEDiff’ing kernel runs
Metric deltacurrent values and changes
from baseline
Baselinefrom any previous profile report
(different kernel, gpu, …)
Chart differencecurrent values and
baseline values
33
NSIGHT COMPUTE
Support for Arm (SBSA)
target architecture
Support for PowerPC
target architecture
Profile activity command
line is shown
New Features
34
NSIGHT COMPUTE
Detailed breakdown of
high-level SOL metrics
New Features
35
NSIGHT COMPUTE
Improved help for Source page
and Details page charts
New Features
36
NSIGHT COMPUTE
Support for CUDA Asynchronous Copy
Sparse Data compression
A100 GPU support
37
NSIGHT COMPUTE
Efficient way to evaluate kernel characteristics, quickly understand potential directions for further improvements or existing limiters
Inputs: Arithmetic Intensity (FLOPS/bytes)Performance (FLOPS/s)
Ceilings: Peak Memory BandwidthPeak FP32/FP64 Performance
New Roofline analysis
38
NSIGHT COMPUTE
New Hot Spot Tables
Result Filtering by NVTX
Source Page can now be collapsed to generate aggregate metrics
Recommendations: links to other sections & lines on Source page, access instanced metrics
Workflow Improvements
39
NSIGHT COMPUTE
Shortened Executables and File Extensions
Old names continue to work
New rule for reporting uncoalesced memory accesses
Support report name placeholders %p, %q, %i, and %h
Improved documentation
Other Improvements
Versions UI CLI Report file extension
2019.5 (CUDA 10.2) and before nv-nsight-cu nv-nsight-cu-cli .nsight-cuprof-report
2020.1 (CUDA 11.0) and newer ncu-ui ncu .ncu-rep
40
USEFUL LINKS
Web: https://developer.nvidia.com/tools-overview
How to contact us? Forums: https://forums.developer.nvidia.com/c/development-toolsemail: [email protected]
Other digital GTC talks of interest:
S21351: Scaling the Transformer Model Implementation in PyTorch Across Multiple Nodes
S21547: Rebalancing the Load:Profile-Guided Optimization of the NAMD Molecular Dynamics Program for Modern GPUs using Nsight Systems
S21771: Optimizing CUDA Kernels in HPC Simulation and Visualization Codes using Nsight Compute
S21565: Roofline Performance Model for HPC and Deep-Learning Applications