the visual computing company - hpc advisory council · 2020. 1. 14. · the visual computing...
TRANSCRIPT
![Page 1: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/1.jpg)
Advancements in the NVIDIA GPU Ecosystem Axel Koehler, Senior Solutions Architect HPC, NVIDIA
The Visual Computing Company
HPC Advisory Council Meeting, April 2014 , Lugano
![Page 2: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/2.jpg)
Outline
Tesla K40 and GPU Boost
Jetson TK-1 Development Board for Embedded HPC
Pascal GPU
3D Memory
NVLINK
CUDA 6.0
Unified memory
Extended Library Interfaces
GPU Direct RDMA with OpenMPI
… and beyond
![Page 3: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/3.jpg)
0
1
2
3
4
5
CPU K20X K40
ns/day
Tesla K40
FASTER 1.4 TF| 2880 Cores | 288 GB/s
LARGER 2x Memory Enables More Apps
SMARTER Unlock Extra Performance
Using Power Headroom
AMBER Benchmark: SPFP-Nucleosome CPU: Dual E5-2687W @ 3.10GHz, 64GB System Memory, CentOS 6.2, GPU systems: Single Tesla K20X or Single Tesla K40
AMBER Benchmark
6GB
Fluid Dynamics
Seismic Analysis
Rendering
12GB
![Page 4: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/4.jpg)
Average GPU Power in Watts
0
20
40
60
80
100
120
140
160
180
AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D
Board
Pow
er
(Watt
s)
Avg GPU Power in Watts for Real Applications on K20X
![Page 5: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/5.jpg)
GPU Boost on Tesla K40
Base
Clock
Workload # 1
Worst case
Reference App
235W
Boost
Clock #1
Workload # 2
E.g. AMBER
235W
Boost
Clock #2
Workload # 3
E.g. ANSYS Fluent
235W
Convert Power Headroom to Higher Performance
5
810Mhz
745Mhz
875Mhz
![Page 6: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/6.jpg)
Non-Tesla
Compute Workload Behavior with GPU Boost
GPU
Clock
Automatic clock switching
Default Boost Base
Preset Options Lock to base clock 3 Levels: Base, Boost1 or Boost2
Boost Interface Control Panel
NV-SMI, NVML
nvidia-smi -q –d CLOCK,SUPPORTED_CLOCKS
nvidia-smi -ac <MEM clock, Graphics clock>
Target duration
for boost clocks ~50% of run-time
100% of workload run time
Must-have for HPC workload
Boost Clock # 1
Boost Clock # 2
Tesla K40
Deterministic Clocks
Base Clock # 1
![Page 7: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/7.jpg)
JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER
Development Platform for Embedded
Computer Vision, Robotics, Medical, .... • Tegra K1 SOC
• Kepler GPU with 192 Cores (Compute
Capability 3.2)
• 4 Plus 1 Quad core ARM Cortex A15 CPU
• 2 GB Memory, 16 GB eMMC memory
• IO options
• miniPCI-e slot, GigE, HDMI, SD/MMC
connector, USB 3.0, SATA data port, ….
• CUDA Toolkit 6.0, OpenGL 4.4, OpenGL ES 3.0
• Runs 32-bit Ubuntu 13.04 Linux for Tegra (L4T)
• 326 GFLOPS, 5 Watts
https://developer.nvidia.com/jetson-tk1
![Page 8: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/8.jpg)
Pascal GPU
Optimized for double precision FP
Very high bandwidth, large capacity 3D memory on
package
NVLINK for high bandwidth CPU GPU and GPU
GPU interconnect
Unified Memory (UM) HW support
New packaging allows much denser solutions (one-third
(one-third the size of current PCIe boards)
![Page 9: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/9.jpg)
Stacked Memory
3D chip on wafer integration
Multiple layers of DRAM components will be integrated
vertically on the package along with the GPU
Compared to GDDR5 memory
4x Higher Bandwidth
3x Larger Capacity
4x More Energy Efficient per bit
![Page 10: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/10.jpg)
NVLINK
CPU GPU communication limited by low bandwidth connection via PCI-e
NVLINK is a high speed interconnect between CPU GPU and GPU GPU
Basic building block is a 8-lane, differential, dual simplex bidirectional link
Multiple links can be aggregated to increase BW of a connection
NVLink will provide between 80 and 200 GB/s of bandwidth
Cache coherency provided with NVLINK 2.0
Preserves the PCIe programming model
CPU-initiated transactions such as control and configuration over a PCIe
connection
GPU-initiated transactions use NVLink
Allowing the GPU full-bandwidth access to the CPU’s memory system
NVLink is more than twice as energy efficient as a PCIe 3.0 connection
![Page 11: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/11.jpg)
NVLINK
![Page 12: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/12.jpg)
12
Unified Memory
Dramatically Lower Developer Effort
Developer View Today Developer View With Unified Memory
Unified Memory System Memory
GPU Memory
![Page 13: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/13.jpg)
13
Super Simplified Memory Management Code
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
CPU Code CUDA 6 Code with Unified Memory
![Page 14: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/14.jpg)
14
Unified Memory Delivers
1. Simpler
Programming &
Memory Model
2. Performance
Through
Data Locality
Migrate data to accessing processor
Guarantee global coherency
Still allows cudaMemcpyAsync() hand tuning
Single pointer to data, accessible anywhere
Tight language integration
Greatly simplifies code porting
![Page 15: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/15.jpg)
CUDA 6: Ease of Use
Single Pointer to Data
No Memcopy Required
Coherence @ launch & sync
Shared C/C++ Data Structures
Next: Optimizations
Prefetching
Migration Hints
Additional OS Support
Future GPUs
Finer Grain Migration
Not Limited to GPU Memory Size
Unified Memory Roadmap
Learn More: http://bit.ly/um-p4a
![Page 16: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/16.jpg)
GPU Direct RDMA with OpenMPI
Starting with CUDA 6 OpenMPI also supports GPU Direct RDMA
Kepler class GPUs (K10, K20, K20X, K40)
Mellanox ConnectX-3, ConnectX-3 Pro, Connect-IB
CUDA 6.0 (EA, RC, Final), Open MPI 1.7.4 and Mellanox OFED 2.1 drivers.
GPU Direct RDMA enabling software http://www.mellanox.com/downloads/ofed/nvidia_peer_memory-1.0-0.tar.gz
![Page 17: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/17.jpg)
GPU Direct RDMA with OpenMPI
OpenMPI Compilation: configure --with-cuda Support is configured in if CUDA 6.0 cuda.h header file is detected.
To check: > ompi_info --all | grep btl_openib_have_cuda_gdr
MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data
source: default, level: 4 tuner/basic, type: bool)
> ompi_info -all | grep btl_openib_have_driver_gdr
MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data
source: default, level: 4 tuner/basic, type: bool)
Enable GPU Direct RDMA usage (off by default) --mca btl_openib_want_cuda_gdr 1
Adjust when we switch to pipeline transfers through host memory.
Current default is 30,000 bytes --mca btl_openib_cuda_rdma_limit 60000
![Page 18: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/18.jpg)
GPU Direct RDMA with OpenMPI
Chipset implementation limits bandwidth at larger message sizes
Still use pipelining with host memory staging for large messages
(hybrid version utilizes asynchronous copies)
![Page 19: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/19.jpg)
GPU Direct RDMA with OpenMPI
HOOMD-blue (git master 28Jan14), Lennard-Jones Liquid dataset (16K, 512K Particles)
Dual-Socket Intel E5-2680 v2 @ 2.80 GHz CPUs, 64GB memory,
RHEL 6.2 , MLNX_OFED 2.1-1.0.0, Mellanox FDR
1 x Tesla K40 per node, Driver 331.20, Open MPI 1.7.4rc1,
GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)
Dual-Socket Intel E5-2630 v2 @ 2.60 GHz CPUs, 64GB memory,
Scientific Linux 6.4 , MLNX_OFED 2.1-1.0.0, Mellanox FDR
2 x Tesla K20 per node, Driver 331.20, Open MPI 1.7.4rc1,
GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)
20%
102% Higher is better Higher is better
http://www.hpcadvisorycouncil.com/pdf/HOOMDblue_Analysis_and_Profiling.pdf
![Page 20: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/20.jpg)
Extended (XT) Library Interfaces
Automatic Scaling to multiple GPUs per node
cuFFT 2D/3D & cuBLAS level 3
Operate directly on large datasets that reside in CPU memory
2.2 TFLOPS
4.2 TFLOPS
6.0 TFLOPS
7.9 TFLOPS
0
1
2
3
4
5
6
7
8
1 x K10 2 x K10 3 x K10 4 x K10
16K x 16K SGEMM on Tesla K10
developer.nvidia.com/cublasxt
![Page 21: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/21.jpg)
New Drop-in NVBLAS Library
Drop-in replacement for CPU-only BLAS
Automatically route BLAS3 calls to cuBLAS
Example: Drop-in Speedup for R
> LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so R
> A <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > B <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > system.time(C <- A %*% B)
user system elapsed
0.348 0.142 0.289
Use in any app that uses standard BLAS3
Octave, Scilab, etc.
0
500
1000
1500
2000
2500
3000
0 5000 10000 15000 20000 25000 30000 35000
fp64 G
Flo
ps/
s
matrix dimension
Matrix-Matrix Multiplication in R
nvBLAS, 4x K20X GPUs
MKL, 6-core Xeon E5-2667 CPU
![Page 22: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/22.jpg)
Remote Development with Nsight Eclipse Edition
Local IDE, remote application
Edit locally, build & run remotely
Automatic sync via ssh
Cross-compilation to ARM
Full debugging & profiling via
remote connection
Build
Run
Debug
Profile
Edit
sync
![Page 23: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/23.jpg)
Goals for the CUDA Platform
• Learn, adopt, & use parallelism with ease Simplicity
• Quickly achieve feature & performance goals Productivity
• Write code that can execute on all targets Portability
• High absolute performance and scalability Performance
![Page 24: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/24.jpg)
Simpler Heterogeneous Applications
We want: homogeneous programs, heterogeneous execution
– Unified programming model includes parallelism in language
– Abstract heterogeneous execution via Runtime or Virtual Machine
GPU CPU GPU CPU
Single Program
Homogeneous
Programming Model
Current Ideal
Hybrid Program
parallel serial parallel + serial
![Page 25: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/25.jpg)
Parallelism in Mainstream Languages
• Enable more programmers to write parallel software
• Give programmers the choice of language to use
• GPU support in key languages
C
![Page 26: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/26.jpg)
C++ Parallel Algorithms Library Progress
• Complete set of parallel primitives:
for_each, sort, reduce, scan, etc.
• ISO C++ committee voted unanimously to
accept as official tech. specification working draft
N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf
Prototype: https://github.com/n3554/n3554
std::vector<int> vec = ... // previous standard sequential loop std::for_each(vec.begin(), vec.end(), f); // explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f); // permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f);
![Page 27: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/27.jpg)
Numba Python Compiler
• Free and open source compiler for array-oriented Python
• NEW numba.cuda module integrates CUDA directly into Python
• http://numba.pydata.org/
@cuda.jit(“void(float32[:], float32, float32[:], float32[:])”) def saxpy(out, a, x, y): i = cuda.grid(1) out[i] = a * x[i] + y[i] # Launch saxpy kernel saxpy[griddim, blockdim](out, a, x, y)
![Page 28: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/28.jpg)
28
![Page 29: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/29.jpg)
GPU-Accelerated Hadoop
Extract insights from customer data
Data Analytics using clustering algorithms
Developed using CUDA-accelerated IBM Java
![Page 30: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/30.jpg)
Compile Java for GPUs
• Approach: apply a closure to a set of arrays
• foreach iterations parallelized over GPU threads
– Threads run closure execute() method
// vector addition float[] X = {1.0, 2.0, 3.0, 4.0, … }; float[] Y = {9.0, 8.1, 7.2, 6.3, … }; float[] Z = {0.0, 0.0, 0.0, 0.0, … }; jog.foreach(X, Y, Z, new jogContext(), new jogClosureRet<jogContext>() { public float execute(float x, float y) { return x + y; } } );
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Millions of Options
Java Black-Scholes Options Pricing Speedup
Speedup vs.Sequential Java
![Page 31: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/31.jpg)
The Massively Parallel Programming Blog
Technical posts on GPUs, CUDA, OpenACC, Libraries, C/C++/Python and more
In-depth articles and regular series:
CUDACasts: instructive videos
CUDA Pro Tips: useful techniques
CUDA Spotlight Interviews
Join the conversation by subscribing to email or RSS updates today!
http://devblogs.nvidia.com/parallelforall
![Page 32: The Visual Computing Company - HPC Advisory Council · 2020. 1. 14. · The Visual Computing Company HPC Advisory Council Meeting, April 2014 , Lugano . Outline Tesla K40 and GPU](https://reader035.vdocuments.site/reader035/viewer/2022070217/6120da8a6bd44f0b4f1883f6/html5/thumbnails/32.jpg)
NVIDIA, the NVIDIA logo, GeForce, Quadro, Tegra, Tesla, GeForce Experience, GRID, GTX, Kepler, ShadowPlay, GameStream, SHIELD, and The Way It’s Meant To Be Played are trademarks and/or
registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
© 2014 NVIDIA Corporation. All rights reserved.
Axel Koehler [email protected]
The Visual Computing Company