gpuコンピューティング最新情報 - pccluster.org sympo/nvidia-print.pdf · cuda debugging...

GPUコンピューティング最新情報エヌビディア合同会社

エンタープライズソリューションプロダクト事業部平野幸彦

DevelopmentData Center Infrastructure

Tesla Accelerated Computing Platform

GPU

AcceleratorsInterconnect

System

Management

Compiler

Solutions

GPU Boost

…

GPU Direct

NVLink

…

NVML

…

LLVM

…

Profile and

Debug

CUDA Debugging API

…

Development

Tools

Programming

Languages

Infrastructure

ManagementCommunicationSystem Solutions

/

Software

Solutions

Libraries

cuBLAS

…

Tesla: Platform with Open EcosystemCommon Programming Models Across Multiple CPUs

x86

Libraries

Programming

Languages

Compiler

Directives

AmgX

cuBLAS

/

4

TESLA K80

ビッグデータ解析と科学技術計算のための

世界最速のアクセラレータ

Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2

Maximum PerformanceDynamically Maximize

Performance for Every Application

Double the MemoryDesigned for Big Data Apps

24GB

Oil & Gas

Data Analytics

HPC Viz

K4012GB

2x Faster2.9 TF| 4992 Cores | 480 GB/s

0x

5x

10x

15x

20x

25x

CPU Tesla K40 Tesla K80

Deep Learning: Caffe

Dual-GPU Accelerator for Max Throughput

5

Performance Lead Continues to Grow

0

500

1000

1500

2000

2500

3000

3500

2008 2009 2010 2011 2012 2013 2014

Peak Double Precision FLOPS

NVIDIA GPU x86 CPU

M2090

M1060

K20

K80

WestmereSandy Bridge

Haswell

GFLOPS

0

100

200

300

400

500

600

2008 2009 2010 2011 2012 2013 2014

Peak Memory Bandwidth

NVIDIA GPU x86 CPU

GB/s

K20

K80

WestmereSandy Bridge

Haswell

Ivy Bridge

K40

Ivy Bridge

K40

M2090

M1060

6

0x

5x

10x

15x

K80 CPU

10x Faster than CPU on Applications

CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2GPU: Single Tesla K80, Boost enabled

Quantum ChemistryMolecular Dynamics PhysicsBenchmarks

7

Tesla K80: Up to 2x Faster than K10 and K40

1.0x

1.2x

1.4x

1.6x

1.8x

2.0xK80 K10

Seismic Machine Learning

Single K10 or K80, Boost enabled for K80. Actual boost level dependent on application profile.RTM ISO R4 2x, RTM TTI R8 3 pass

1.0x

1.2x

1.4x

1.6x

1.8x

2.0xK80 K40

CFDMolecular Dynamics

QuantumChemistry Finance Physics

8

High GPU Density Servers Now Mainstream

Dell C4130

4 K80s per NodeCray CS-Storm

8 K80s per Node

HP SL270

8 K80s per Node

Quanta S2BV

4 K80s per Node

9

CSCS Piz DaintThe World’s Largest In-situ Viz Enabled GPU

Supercomputer

2048 GPU Nodes running Galaxy

formation and Molecular Dynamics

Simulation + Visualization

Watch live at NVIDIA SC14 Booth

10

Visualize Data Instantly for Faster Science

TraditionalSlower Time to Discovery

CPU Supercomputer Viz Cluster

Simulation- 1 Week Viz- 1 Day

Multiple Iterations

Time to Discovery =

Months

Tesla PlatformFaster Time to Discovery

GPU-Accelerated Supercomputer

Visualize while

you simulate

Restart Simulation Instantly

Multiple Iterations

Time to Discovery = Weeks

11

US to Build Two Flagship Supercomputers

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

>40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA150-300 PFLOPS

Peak Performance> 100 PFLOPS

Peak Performance

Major Step Forward on the Path to Exascale

12

Just 4 nodes in Summit would make the Top500 list of

supercomputers today

Similar Power as Titan5-10x Faster

1/5th the Size

150 PF = 3M LaptopsOne laptop for Every Resident in

State of Mississippi

IBM POWER CPUMost Powerful Serial Processor

NVIDIA NVLinkFastest CPU-GPU Interconnect

NVIDIA Volta GPUMost Powerful Parallel Processor

Accelerated Computing5x Higher Energy Efficiency

GPU ロードマップ

SG

EM

M /

W N

orm

alized

2012 20142008 2010 2016

TeslaCUDA

Fermi倍精度浮動小数点演算

Keplerダイナミック並列処理

MaxwellDirectX 12

Pascal統合メモリー3次元メモリーNVLink

20

16

12

8

6

2

0

4

10

14

18

NVLink

3次元メモリー

モジュール

PCIe 3.0 の5倍から12倍

2倍から4倍のメモリバンド幅とサイズ

PCIe カードの3分の1のサイズ

PASCAL

16

NVLinkHigh-speed GPU

Interconnect

NVLink

NVLink

POWER CPU

X86 ARM64POWER CPU

X86 ARM64 POWER CPU

PASCAL GPUKEPLER GPU

20162014

PCIe PCIe

17 17

NVLink Unleashes Multi-GPU Performance

3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)

TESLA

GPU

TESLA

GPU

CPU

5x Faster than

PCIe Gen3 x16

PCIe Switch

GPUs Interconnected with NVLink

1.00x

1.25x

1.50x

1.75x

2.00x

2.25x

ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT

Over 2x Application Performance SpeedupWhen Next-Gen GPUs Connect via NVLink Versus PCIe

Speedup vs PCIe based Server

18

GPU Technology Theater

Ian Buck with Oak Ridge and Livermore

www.nvidia.com/sc14

25

DevelopmentData Center Infrastructure

Tesla Accelerated Computing Platform

GPU

AcceleratorsInterconnect

System

Management

Compiler

Solutions

GPU Boost

…

GPU Direct

NVLink

…

NVML

…

LLVM

…

Profile and

Debug

CUDA Debugging API

…

Development

Tools

Programming

Languages

Infrastructure

ManagementCommunicationSystem Solutions

/

Software

Solutions

Libraries

cuBLAS

…

US DoE Supercomputers Built on Tesla Platform

Tesla Accelerated Computing

YOUR PLATFORM FOR DISCOVERY

gpuコンピューティング最新情報 - pccluster.org sympo/nvidia-print.pdf · cuda debugging...

Documents