the visual computing company - hpc advisory council · 2020. 1. 14. · the visual computing...

Advancements in the NVIDIA GPU Ecosystem Axel Koehler, Senior Solutions Architect HPC, NVIDIA

The Visual Computing Company

HPC Advisory Council Meeting, April 2014 , Lugano

Outline

Tesla K40 and GPU Boost

Jetson TK-1 Development Board for Embedded HPC

Pascal GPU

3D Memory

NVLINK

CUDA 6.0

Unified memory

Extended Library Interfaces

GPU Direct RDMA with OpenMPI

… and beyond

0

1

2

3

4

5

CPU K20X K40

ns/day

Tesla K40

FASTER 1.4 TF| 2880 Cores | 288 GB/s

LARGER 2x Memory Enables More Apps

SMARTER Unlock Extra Performance

Using Power Headroom

AMBER Benchmark: SPFP-Nucleosome CPU: Dual E5-2687W @ 3.10GHz, 64GB System Memory, CentOS 6.2, GPU systems: Single Tesla K20X or Single Tesla K40

AMBER Benchmark

6GB

Fluid Dynamics

Seismic Analysis

Rendering

12GB

Average GPU Power in Watts

0

20

40

60

80

100

120

140

160

180

AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D

Board

Pow

er

(Watt

s)

Avg GPU Power in Watts for Real Applications on K20X

GPU Boost on Tesla K40

Base

Clock

Workload # 1

Worst case

Reference App

235W

Boost

Clock #1

Workload # 2

E.g. AMBER

235W

Boost

Clock #2

Workload # 3

E.g. ANSYS Fluent

235W

Convert Power Headroom to Higher Performance

5

810Mhz

745Mhz

875Mhz

Non-Tesla

Compute Workload Behavior with GPU Boost

GPU

Clock

Automatic clock switching

Default Boost Base

Preset Options Lock to base clock 3 Levels: Base, Boost1 or Boost2

Boost Interface Control Panel

NV-SMI, NVML

nvidia-smi -q –d CLOCK,SUPPORTED_CLOCKS

nvidia-smi -ac <MEM clock, Graphics clock>

Target duration

for boost clocks ~50% of run-time

100% of workload run time

Must-have for HPC workload

Boost Clock # 1

Boost Clock # 2

Tesla K40

Deterministic Clocks

Base Clock # 1

JETSON TK1 THE WORLD’S 1st EMBEDDED SUPERCOMPUTER

Development Platform for Embedded

Computer Vision, Robotics, Medical, .... • Tegra K1 SOC

• Kepler GPU with 192 Cores (Compute

Capability 3.2)

• 4 Plus 1 Quad core ARM Cortex A15 CPU

• 2 GB Memory, 16 GB eMMC memory

• IO options

• miniPCI-e slot, GigE, HDMI, SD/MMC

connector, USB 3.0, SATA data port, ….

• CUDA Toolkit 6.0, OpenGL 4.4, OpenGL ES 3.0

• Runs 32-bit Ubuntu 13.04 Linux for Tegra (L4T)

• 326 GFLOPS, 5 Watts

https://developer.nvidia.com/jetson-tk1

Pascal GPU

Optimized for double precision FP

Very high bandwidth, large capacity 3D memory on

package

NVLINK for high bandwidth CPU GPU and GPU

GPU interconnect

Unified Memory (UM) HW support

New packaging allows much denser solutions (one-third

(one-third the size of current PCIe boards)

Stacked Memory

3D chip on wafer integration

Multiple layers of DRAM components will be integrated

vertically on the package along with the GPU

Compared to GDDR5 memory

4x Higher Bandwidth

3x Larger Capacity

4x More Energy Efficient per bit

NVLINK

CPU GPU communication limited by low bandwidth connection via PCI-e

NVLINK is a high speed interconnect between CPU GPU and GPU GPU

Basic building block is a 8-lane, differential, dual simplex bidirectional link

Multiple links can be aggregated to increase BW of a connection

NVLink will provide between 80 and 200 GB/s of bandwidth

Cache coherency provided with NVLINK 2.0

Preserves the PCIe programming model

CPU-initiated transactions such as control and configuration over a PCIe

connection

GPU-initiated transactions use NVLink

Allowing the GPU full-bandwidth access to the CPU’s memory system

NVLink is more than twice as energy efficient as a PCIe 3.0 connection

NVLINK

12

Unified Memory

Dramatically Lower Developer Effort

Developer View Today Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

13

Super Simplified Memory Management Code

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPU Code CUDA 6 Code with Unified Memory

14

Unified Memory Delivers

1. Simpler

Programming &

Memory Model

2. Performance

Through

Data Locality

Migrate data to accessing processor

Guarantee global coherency

Still allows cudaMemcpyAsync() hand tuning

Single pointer to data, accessible anywhere

Tight language integration

Greatly simplifies code porting

CUDA 6: Ease of Use

Single Pointer to Data

No Memcopy Required

Coherence @ launch & sync

Shared C/C++ Data Structures

Next: Optimizations

Prefetching

Migration Hints

Additional OS Support

Future GPUs

Finer Grain Migration

Not Limited to GPU Memory Size

Unified Memory Roadmap

Learn More: http://bit.ly/um-p4a


Starting with CUDA 6 OpenMPI also supports GPU Direct RDMA

Kepler class GPUs (K10, K20, K20X, K40)

Mellanox ConnectX-3, ConnectX-3 Pro, Connect-IB

CUDA 6.0 (EA, RC, Final), Open MPI 1.7.4 and Mellanox OFED 2.1 drivers.

GPU Direct RDMA enabling software http://www.mellanox.com/downloads/ofed/nvidia_peer_memory-1.0-0.tar.gz


OpenMPI Compilation: configure --with-cuda Support is configured in if CUDA 6.0 cuda.h header file is detected.

To check: > ompi_info --all | grep btl_openib_have_cuda_gdr

MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data

source: default, level: 4 tuner/basic, type: bool)

> ompi_info -all | grep btl_openib_have_driver_gdr

MCA btl: informational "btl_openib_have_driver_gdr" (current value: "true", data

source: default, level: 4 tuner/basic, type: bool)

Enable GPU Direct RDMA usage (off by default) --mca btl_openib_want_cuda_gdr 1

Adjust when we switch to pipeline transfers through host memory.

Current default is 30,000 bytes --mca btl_openib_cuda_rdma_limit 60000


Chipset implementation limits bandwidth at larger message sizes

Still use pipelining with host memory staging for large messages

(hybrid version utilizes asynchronous copies)


HOOMD-blue (git master 28Jan14), Lennard-Jones Liquid dataset (16K, 512K Particles)

Dual-Socket Intel E5-2680 v2 @ 2.80 GHz CPUs, 64GB memory,

RHEL 6.2 , MLNX_OFED 2.1-1.0.0, Mellanox FDR

1 x Tesla K40 per node, Driver 331.20, Open MPI 1.7.4rc1,

GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)

Dual-Socket Intel E5-2630 v2 @ 2.60 GHz CPUs, 64GB memory,

Scientific Linux 6.4 , MLNX_OFED 2.1-1.0.0, Mellanox FDR

2 x Tesla K20 per node, Driver 331.20, Open MPI 1.7.4rc1,

GPUDirect RDMA (nvidia_peer_memory-1.0-0.tar.gz)

20%

102% Higher is better Higher is better

http://www.hpcadvisorycouncil.com/pdf/HOOMDblue_Analysis_and_Profiling.pdf

Extended (XT) Library Interfaces

Automatic Scaling to multiple GPUs per node

cuFFT 2D/3D & cuBLAS level 3

Operate directly on large datasets that reside in CPU memory

2.2 TFLOPS

4.2 TFLOPS

6.0 TFLOPS

7.9 TFLOPS

0

1

2

3

4

5

6

7

8

1 x K10 2 x K10 3 x K10 4 x K10

16K x 16K SGEMM on Tesla K10

developer.nvidia.com/cublasxt

New Drop-in NVBLAS Library

Drop-in replacement for CPU-only BLAS

Automatically route BLAS3 calls to cuBLAS

Example: Drop-in Speedup for R

> LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so R

> A <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > B <- matrix(rnorm(4096*4096), nrow=4096, ncol=4096) > system.time(C <- A %*% B)

user system elapsed

0.348 0.142 0.289

Use in any app that uses standard BLAS3

Octave, Scilab, etc.

0

500

1000

1500

2000

2500

3000

0 5000 10000 15000 20000 25000 30000 35000

fp64 G

Flo

ps/

s

matrix dimension

Matrix-Matrix Multiplication in R

nvBLAS, 4x K20X GPUs

MKL, 6-core Xeon E5-2667 CPU

Remote Development with Nsight Eclipse Edition

Local IDE, remote application

Edit locally, build & run remotely

Automatic sync via ssh

Cross-compilation to ARM

Full debugging & profiling via

remote connection

Build

Run

Debug

Profile

Edit

sync

Goals for the CUDA Platform

• Learn, adopt, & use parallelism with ease Simplicity

• Quickly achieve feature & performance goals Productivity

• Write code that can execute on all targets Portability

• High absolute performance and scalability Performance

Simpler Heterogeneous Applications

We want: homogeneous programs, heterogeneous execution

– Unified programming model includes parallelism in language

– Abstract heterogeneous execution via Runtime or Virtual Machine

GPU CPU GPU CPU

Single Program

Homogeneous

Programming Model

Current Ideal

Hybrid Program

parallel serial parallel + serial

Parallelism in Mainstream Languages

• Enable more programmers to write parallel software

• Give programmers the choice of language to use

• GPU support in key languages

C

C++ Parallel Algorithms Library Progress

• Complete set of parallel primitives:

for_each, sort, reduce, scan, etc.

• ISO C++ committee voted unanimously to

accept as official tech. specification working draft

N3960 Technical Specification Working Draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf

Prototype: https://github.com/n3554/n3554

std::vector<int> vec = ... // previous standard sequential loop std::for_each(vec.begin(), vec.end(), f); // explicitly sequential loop std::for_each(std::seq, vec.begin(), vec.end(), f); // permitting parallel execution std::for_each(std::par, vec.begin(), vec.end(), f);

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf



https://github.com/n3554/n3554

Numba Python Compiler

• Free and open source compiler for array-oriented Python

• NEW numba.cuda module integrates CUDA directly into Python

• http://numba.pydata.org/

@cuda.jit(“void(float32[:], float32, float32[:], float32[:])”) def saxpy(out, a, x, y): i = cuda.grid(1) out[i] = a * x[i] + y[i] # Launch saxpy kernel saxpy[griddim, blockdim](out, a, x, y)

GPU-Accelerated Hadoop

Extract insights from customer data

Data Analytics using clustering algorithms

Developed using CUDA-accelerated IBM Java

Compile Java for GPUs

• Approach: apply a closure to a set of arrays

• foreach iterations parallelized over GPU threads

– Threads run closure execute() method

// vector addition float[] X = {1.0, 2.0, 3.0, 4.0, … }; float[] Y = {9.0, 8.1, 7.2, 6.3, … }; float[] Z = {0.0, 0.0, 0.0, 0.0, … }; jog.foreach(X, Y, Z, new jogContext(), new jogClosureRet<jogContext>() { public float execute(float x, float y) { return x + y; } } );

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Millions of Options

Java Black-Scholes Options Pricing Speedup

Speedup vs.Sequential Java

The Massively Parallel Programming Blog

Technical posts on GPUs, CUDA, OpenACC, Libraries, C/C++/Python and more

In-depth articles and regular series:

CUDACasts: instructive videos

CUDA Pro Tips: useful techniques

CUDA Spotlight Interviews

Join the conversation by subscribing to email or RSS updates today!

http://devblogs.nvidia.com/parallelforall

NVIDIA, the NVIDIA logo, GeForce, Quadro, Tegra, Tesla, GeForce Experience, GRID, GTX, Kepler, ShadowPlay, GameStream, SHIELD, and The Way It’s Meant To Be Played are trademarks and/or

registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

© 2014 NVIDIA Corporation. All rights reserved.

Axel Koehler [email protected]

The Visual Computing Company

the visual computing company - hpc advisory council · 2020. 1. 14. · the visual computing...

Documents