heterogeneous hpc, architecture optimization, and … › supercomputing › ... · cto,...
TRANSCRIPT
![Page 1: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/1.jpg)
HETEROGENEOUS HPC,
ARCHITECTURE OPTIMIZATION,
AND NVLINK
Steve Oberlin
CTO, Accelerated Computing
![Page 2: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/2.jpg)
2
US to Build Two Flagship Supercomputers
Major Step Forward on the Path to Exascale
Partnership for Science
100-300 PFLOPS Peak Performance
10x in Scientific Applications
2017
SUMMIT SIERRA
![Page 3: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/3.jpg)
3
Just 4 nodes in Summit would make the Top500 list of
supercomputers today
Similar Power as Titan 5-10x Faster
1/5th the Size
150 PF = 3M Laptops One laptop for Every Resident in
State of Mississippi
![Page 4: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/4.jpg)
4
Optimizing Serial/Parallel Execution
Application Code
+
GPU CPU
Parallel Work
Majority of Ops
Serial Work
System and Sequential Ops
![Page 5: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/5.jpg)
5
IBM POWER CPU Most Powerful Serial Processor
NVIDIA NVLink Fastest CPU-GPU Interconnect
NVIDIA Volta GPU Most Powerful Parallel Processor
NVLink-Enabled Heterogeneous Node 5x Higher Energy Efficiency
80-200 GB/s
![Page 6: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/6.jpg)
6
Latency
Optimized
Throughput
Optimized
NVLink: Logical Node Integration
5x PCIe bandwidth
Move data at CPU memory speed
3x lower energy/bit
TESLA
GPU
Power or
ARM CPU
DDR Memory Stacked Memory
NVLink
80 GB/s
DDR4
50-75 GB/s
HBM
1 Terabyte/s
![Page 7: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/7.jpg)
7
NVLink High-speed GPU
Interconnect
NVLink
NVLink
POWER CPU
X86 ARM64 POWER CPU
PASCAL GPU KEPLER GPU
2016 2014
PCIe PCIe
X86 ARM64 POWER CPU
![Page 8: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/8.jpg)
8 8 8
NVLink Unleashes Multi-GPU Performance
3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)
TESLA
GPU
TESLA
GPU
CPU
5x Faster than
PCIe Gen3 x16
PCIe Switch
GPUs Interconnected with NVLink
1.00x
1.25x
1.50x
1.75x
2.00x
2.25x
ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT
Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs PCIe based Server
![Page 9: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/9.jpg)
9
Two Computing Models For Accelerators
CPU Optimized for Serial Tasks
GPU Accelerator Optimized for Parallel Tasks
Heterogeneous Computing Model Complementary Processors Work Together
Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work
Xeon Phi (And Others) Many Weak Serial Cores
![Page 10: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/10.jpg)
10
Amdahl’s Law Analysis
0 1 2 3 4 5 6 7 8 9 10
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
98% Parallel Work
Minutes Run Time
![Page 11: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/11.jpg)
11
Amdahl’s Law Analysis
0 1 2 3 4 5 6 7 8 9 10
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
90% Parallel Work
Minutes Run Time
![Page 12: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/12.jpg)
12
Amdahl’s Law Analysis
0 1 2 3 4 5 6 7 8 9 10
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
80% Parallel Work
Minutes Run Time
![Page 13: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/13.jpg)
13
Amdahl’s Law Analysis
0 2 4 6 8 10 12 14
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
70% Parallel Work
Minutes Run Time
![Page 14: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/14.jpg)
14
Amdahl’s Law Analysis
60% Parallel Work
0 2 4 6 8 10 12 14 16 18
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
Minutes Run Time
![Page 15: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/15.jpg)
15
Amdahl’s Law Analysis
50% Parallel Work
0 5 10 15 20 25
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
Minutes Run Time
![Page 16: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/16.jpg)
16
Amdahl’s Law Analysis
40% Parallel Work
0 5 10 15 20 25 30
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
Minutes Run Time
![Page 17: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/17.jpg)
17
Amdahl’s Law Analysis
30% Parallel Work
0 5 10 15 20 25 30
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
Minutes Run Time
![Page 18: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/18.jpg)
18
Amdahl’s Law Analysis
20% Parallel Work
0 5 10 15 20 25 30 35
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
Minutes Run Time
![Page 19: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/19.jpg)
19
Amdahl’s Law Analysis
10% Parallel Work
0 5 10 15 20 25 30 35 40
Work 1x CPU
2 x MWC (.25x CPU)
1 GPU+1 CPU
Serial
Parallel
Minutes Run Time
![Page 20: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/20.jpg)
20
TESLA K80
WORLD’S FASTEST ACCELERATOR
FOR DATA ANALYTICS AND
SCIENTIFIC COMPUTING
Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2
Maximum Performance Dynamically Maximize
Performance for Every Application
Double the Memory Designed for Big Data Apps
24GB
Oil & Gas
Data Analytics
HPC Viz
K40 12GB
2x Faster 2.9 TF| 4992 Cores | 480 GB/s
0x
5x
10x
15x
20x
25x
CPU Tesla K40 Tesla K80
Deep Learning: Caffe
Dual-GPU Accelerator for
Max Throughput
![Page 21: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/21.jpg)
21
Performance Lead Continues to Grow
0
500
1000
1500
2000
2500
3000
3500
2008 2009 2010 2011 2012 2013 2014
Peak Double Precision FLOPS
NVIDIA GPU x86 CPU
M2090
M1060
K20
K80
Westmere Sandy Bridge
Haswell
GFLOPS
0
100
200
300
400
500
600
2008 2009 2010 2011 2012 2013 2014
Peak Memory Bandwidth
NVIDIA GPU x86 CPU
GB/s
K20
K80
Westmere Sandy Bridge
Haswell
Ivy Bridge
K40
Ivy Bridge
K40
M2090
M1060
![Page 22: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/22.jpg)
22
0x
5x
10x
15x
K80 CPU
10x Faster than CPU on Applications
CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled
Quantum Chemistry Molecular Dynamics Physics Benchmarks
![Page 23: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/23.jpg)
23
Tesla Platform Enables Optimization
Scalable Nodes, ISA Choice
x86
NVLink
![Page 24: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/24.jpg)
24
Tesla Platform Enables Optimization
Ecosystem Industry Standard CPUs and Interconnects
ARM64 POWER x86
NVIDIA
GPU
InfiniBand
Industry-Driven Solutions
Others Cray Ethernet
![Page 25: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/25.jpg)
25
CORAL Scalable Heterogeneous Node
Approximately 3,400 nodes, each with:
IBM POWER9 CPUs and multiple NVIDIA Tesla® Volta GPUs
CPUs and GPUs integrated on-node with high speed NVLink
Large coherent memory: over 512 GB (HBM + DDR4)
All directly addressable from the CPUs and GPUs
An additional 800 GB of NVRAM, burst buffer or as extended memory
Over 40 TF peak performance/node(!)
NVLink In Practice
![Page 26: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/26.jpg)
26
Optimized Heterogeneous Node
CORAL Application Performance Projections
![Page 27: HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND … › supercomputing › ... · CTO, Accelerated Computing . 2 US to Build Two Flagship Supercomputers Major Step Forward on the](https://reader036.vdocuments.site/reader036/viewer/2022081407/5f1c919e1680364c0977fc4a/html5/thumbnails/27.jpg)
Tesla Accelerated Computing
YOUR PLATFORM FOR DISCOVERY