gpuコンピューティング最新情報 - pccluster.org sympo/nvidia-print.pdf · cuda debugging...
TRANSCRIPT
GPUコンピューティング最新情報エヌビディア合同会社
エンタープライズソリューションプロダクト事業部平野 幸彦
DevelopmentData Center Infrastructure
Tesla Accelerated Computing Platform
GPU
AcceleratorsInterconnect
System
Management
Compiler
Solutions
GPU Boost
…
GPU Direct
NVLink
…
NVML
…
LLVM
…
Profile and
Debug
CUDA Debugging API
…
Development
Tools
Programming
Languages
Infrastructure
ManagementCommunicationSystem Solutions
/
Software
Solutions
Libraries
cuBLAS
…
Tesla: Platform with Open EcosystemCommon Programming Models Across Multiple CPUs
x86
Libraries
Programming
Languages
Compiler
Directives
AmgX
cuBLAS
/
4
TESLA K80
ビッグデータ解析と科学技術計算のための
世界最速のアクセラレータ
Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2
Maximum PerformanceDynamically Maximize
Performance for Every Application
Double the MemoryDesigned for Big Data Apps
24GB
Oil & Gas
Data Analytics
HPC Viz
K4012GB
2x Faster2.9 TF| 4992 Cores | 480 GB/s
0x
5x
10x
15x
20x
25x
CPU Tesla K40 Tesla K80
Deep Learning: Caffe
Dual-GPU Accelerator for Max Throughput
5
Performance Lead Continues to Grow
0
500
1000
1500
2000
2500
3000
3500
2008 2009 2010 2011 2012 2013 2014
Peak Double Precision FLOPS
NVIDIA GPU x86 CPU
M2090
M1060
K20
K80
WestmereSandy Bridge
Haswell
GFLOPS
0
100
200
300
400
500
600
2008 2009 2010 2011 2012 2013 2014
Peak Memory Bandwidth
NVIDIA GPU x86 CPU
GB/s
K20
K80
WestmereSandy Bridge
Haswell
Ivy Bridge
K40
Ivy Bridge
K40
M2090
M1060
6
0x
5x
10x
15x
K80 CPU
10x Faster than CPU on Applications
CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2GPU: Single Tesla K80, Boost enabled
Quantum ChemistryMolecular Dynamics PhysicsBenchmarks
7
Tesla K80: Up to 2x Faster than K10 and K40
1.0x
1.2x
1.4x
1.6x
1.8x
2.0xK80 K10
Seismic Machine Learning
Single K10 or K80, Boost enabled for K80. Actual boost level dependent on application profile.RTM ISO R4 2x, RTM TTI R8 3 pass
1.0x
1.2x
1.4x
1.6x
1.8x
2.0xK80 K40
CFDMolecular Dynamics
QuantumChemistry Finance Physics
8
High GPU Density Servers Now Mainstream
Dell C4130
4 K80s per NodeCray CS-Storm
8 K80s per Node
HP SL270
8 K80s per Node
Quanta S2BV
4 K80s per Node
9
CSCS Piz DaintThe World’s Largest In-situ Viz Enabled GPU
Supercomputer
2048 GPU Nodes running Galaxy
formation and Molecular Dynamics
Simulation + Visualization
Watch live at NVIDIA SC14 Booth
10
Visualize Data Instantly for Faster Science
TraditionalSlower Time to Discovery
CPU Supercomputer Viz Cluster
Simulation- 1 Week Viz- 1 Day
Multiple Iterations
Time to Discovery =
Months
Tesla PlatformFaster Time to Discovery
GPU-Accelerated Supercomputer
Visualize while
you simulate
Restart Simulation Instantly
Multiple Iterations
Time to Discovery = Weeks
11
US to Build Two Flagship Supercomputers
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
>40 TFLOPS per Node, >3,400 Nodes
2017
SUMMIT SIERRA150-300 PFLOPS
Peak Performance> 100 PFLOPS
Peak Performance
Major Step Forward on the Path to Exascale
12
Just 4 nodes in Summit would make the Top500 list of
supercomputers today
Similar Power as Titan5-10x Faster
1/5th the Size
150 PF = 3M LaptopsOne laptop for Every Resident in
State of Mississippi
IBM POWER CPUMost Powerful Serial Processor
NVIDIA NVLinkFastest CPU-GPU Interconnect
NVIDIA Volta GPUMost Powerful Parallel Processor
Accelerated Computing5x Higher Energy Efficiency
GPU ロードマップ
SG
EM
M /
W N
orm
alized
2012 20142008 2010 2016
TeslaCUDA
Fermi倍精度浮動小数点演算
Keplerダイナミック並列処理
MaxwellDirectX 12
Pascal統合メモリー3次元メモリーNVLink
20
16
12
8
6
2
0
4
10
14
18
NVLink
3次元メモリー
モジュール
PCIe 3.0 の5倍から12倍
2倍から4倍のメモリバンド幅とサイズ
PCIe カードの3分の1のサイズ
PASCAL
16
NVLinkHigh-speed GPU
Interconnect
NVLink
NVLink
POWER CPU
X86 ARM64POWER CPU
X86 ARM64 POWER CPU
PASCAL GPUKEPLER GPU
20162014
PCIe PCIe
17 17
NVLink Unleashes Multi-GPU Performance
3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)
TESLA
GPU
TESLA
GPU
CPU
5x Faster than
PCIe Gen3 x16
PCIe Switch
GPUs Interconnected with NVLink
1.00x
1.25x
1.50x
1.75x
2.00x
2.25x
ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT
Over 2x Application Performance SpeedupWhen Next-Gen GPUs Connect via NVLink Versus PCIe
Speedup vs PCIe based Server
18
GPU Technology Theater
Ian Buck with Oak Ridge and Livermore
www.nvidia.com/sc14
25
DevelopmentData Center Infrastructure
Tesla Accelerated Computing Platform
GPU
AcceleratorsInterconnect
System
Management
Compiler
Solutions
GPU Boost
…
GPU Direct
NVLink
…
NVML
…
LLVM
…
Profile and
Debug
CUDA Debugging API
…
Development
Tools
Programming
Languages
Infrastructure
ManagementCommunicationSystem Solutions
/
Software
Solutions
Libraries
cuBLAS
…
US DoE Supercomputers Built on Tesla Platform
Tesla Accelerated Computing
YOUR PLATFORM FOR DISCOVERY