high performance computing with accelerators › download › attachments › 7012700 › gp… ·...

21
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT)

Upload: others

Post on 27-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

High Performance

Computing with

Accelerators

Volodymyr Kindratenko

Innovative Systems Laboratory @ NCSA

Institute for Advanced Computing

Applications and Technologies (IACAT)

Page 2: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

Presentation Outline

• Recent trends in application accelerators

• FPGAs, Cell, GPUs, …

• Accelerator clusters

• Accelerator clusters at NCSA

• Cluster management

• Programmability Issues

• Production codes running on Accelerator Clusters

• Cosmology, molecular dynamics, electronic structure,

quantum chromodynamics

Page 3: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

HPC on Special-purpose Processors

• Field-Programmable Gate Arrays (FPGAs)

– Digital signal processing, embedded computing

• Graphics Processing Units (GPUs)

– Desktop graphics accelerators

• Physics Processing Units (PPUs)

– Desktop games accelerators

• Sony/Toshiba/IBM Cell Broadband Engine

– Game console and digital content delivery systems

• ClearSpeed accelerator

– Floating-point accelerator board for compute-

intensive applications

Page 4: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

0

200

400

600

800

1000

1200

9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

GF

lop

/s

NVIDIA GPU

Intel CPU

7900 GTX

8800 GTX

8800 Ultra

GTX 285

Intel Xeon Quad-core 3 GHz

GTX 280

GPU Performance Trends: FLOPS

58005950 Ultra

6800 Ultra

7800 GTX

Courtesy of NVIDIA

Page 5: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GPU Performance Trends: Memory

Bandwidth

0

20

40

60

80

100

120

140

160

180

9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

GB

yte

/s

5800

5950 Ultra

6800 Ultra

7800 GTX 7900 GTX

8800 GTX

8800 Ultra

GTX 285

GTX 280

Courtesy of NVIDIA

Page 6: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

NVIDIA Tesla T10 GPU Architecture

• T10 architecture

• 240 streaming

processors arranged

as 30 streaming

multiprocessors

• At 1.3 GHz this

provides

• 1 TFLOPS SP

• 86.4 TFLOPS DP

• 512-bit interface to

off-chip GDDR3

memory

• 102 GB/s bandwidth

TPC 1

Geometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

Texture units

Texture L1

TPC 10

Geometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

Texture units

Texture L1

Thread execution managerInput assemblerPCIe interface

L2ROPL2 ROP

512-bit memory interconnect

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

Page 7: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

NVIDIA Tesla S1070 GPU Computing Server

• 4 T10 GPUs

Tesla

GPU Tesla

GPU

Tesla

GPU

Tesla

GPU

4 GB GDDR3

SDRAM

4 GB GDDR3

SDRAM

4 GB GDDR3

SDRAM 4 GB GDDR3

SDRAM

NVIDIA

SWITCH

NVIDIA

SWITCH

Po

wer

supply

Ther

mal

managem

ent

Syst

em

mo

nit

ori

ng

PCI x16

PC

I x1

6

Page 8: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GPU Clusters at NCSA

• Lincoln

• Production system available the

standard NCSA/TeraGrid HPC

allocation

• AC

• Experimental system available

for anybody who is interested

in exploring GPU computing

Page 9: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

Intel 64 Tesla Linux Cluster Lincoln

• Dell PowerEdge 1955 server

• Intel 64 (Harpertown) 2.33 GHz

dual socket quad core

• 16 GB DDR2

• InfiniBand SDR

• S1070 1U GPU Computing

Server

• 1.3 GHz Tesla T10 processors

• 4x4 GB GDDR3 SDRAM

• Cluster

• Servers: 192

• CPU cores: 1536

• Accelerator Units: 96

• GPUs: 384

Dell PowerEdge

1955 server

IB

Tesla S1070

T10 T10

PCIe interface

DRAM DRAM

T10 T10

PCIe interface

DRAM DRAM

Dell PowerEdge

1955 server

PCIe x8 PCIe x8

SDR IB SDR IB

Page 10: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

AMD AMD Opteron Tesla Linux Cluster AC

• HP xw9400 workstation

• 2216 AMD Opteron 2.4 GHz

dual socket dual core

• 8 GB DDR2

• InfiniBand QDR

• S1070 1U GPU Computing

Server

• 1.3 GHz Tesla T10 processors

• 4x4 GB GDDR3 SDRAM

• Cluster

• Servers: 32

• CPU cores: 128

• Accelerator Units: 32

• GPUs: 128

IB

Tesla S1070

T10 T10

PCIe interface

DRAM DRAM

T10 T10

PCIe interface

DRAM DRAM

HP xw9400 workstation

PCIe x16 PCIe x16

QDR IB

Nallatech

H101

FPGA

card

PCI-X

Page 11: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

AC Cluster: 128 TFLOPS (SP)

Page 12: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

NCSA GPU Cluster Management Software

• CUDA SDK Wrapper

• Enables true GPU resources allocation by the scheduler

• E.g., qsub -l nodes=1:ppn=1 will allocate 1 CPU core and 1 GPU,

fencing other GPUs in the node from accessing

• Virtualizes GPU devices

• Sets thread affinity to CPU core “nearest” the the GPU device

• GPU Memory Scrubber

• Cleans memory between runs

• GPU Memory Test utility

• Tests memory for manufacturing defects

• Tests memory for soft errors

• Tests GPUs for entering an erroneous state

Page 13: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GPU Node Pre/Post Allocation Sequence

• Pre-Job (minimized for rapid device acquisition)

• Assemble detected device file unless it exists

• Sanity check results

• Checkout requested GPU devices from that file

• Initialize CUDA wrapper shared memory segment with unique key for user

(allows user to ssh to node outside of job environment and have same gpu

devices visible)

• Post-Job

• Use quick memtest run to verify healthy GPU state

• If bad state detected, mark node offline if other jobs present on node

• If no other jobs, reload kernel module to “heal” node (for CUDA 2.2 bug)

• Run memscrubber utility to clear gpu device memory

• Notify of any failure events with job details

• Terminate wrapper shared memory segment

• Check-in GPUs back to global file of detected devices

Page 14: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GPU Programming

• 3rd Party Programming tools

• CUDA C 2.2 SDK

• OpenCL 1.2 SDK

• PGI x64+GPU Fortran & C99 Compilers

• IACAT’s on-going work

• CUDA-auto

• CUDA-lite

• GMAC

• CUDA-tune

• MCUDA/OpenMP

Page 15: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

IA multi-core

& LarrabeNVIDIA GPU

NVIDIA

SDK

MCUDA/

OpenMP

CUDA-lite

GMAC

CUDA-tune

CUDA-auto

1st generation CUDA programming

with explicit, hardwired thread

organizations and explicit

management of memory types and

data transfers

Parameterized CUDA programming using

auto-tuning and optimization space

pruning

Locality annotation programming to

eliminate need for explicit management of

memory types and data transfers

Implicitly parallel programming with data

structure and function property

annotations to enable auto parallelization

Courtesy of Wen-mei Hwu, UIUC

Page 16: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GMAC – Designed to reduce accelerator use barrier

• Unified CPU / GPU Address Space:

• Same CPU and GPU address

• Customizable implicit data transfers:

• Transfer everything (safe mode)

• Transfer dirty data before kernel execution

• Transfer data as being produced (default)

• Multi-process / Multi-thread support

• CUDA compatible

Courtesy of Wen-mei Hwu, UIUC

Page 17: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GPU Cluster Applications

• TPACF

• Cosmology code used to study how the matter is distributed in the

Universe

• NAMD

• Molecular Dynamics code used to run large-scale MD simulations

• DSCF

• NSF CyberChem project; DSCF quantum chemistry code for energy

calculations

• WRF

• Weather modeling code

• MILC

• Quantum chromodynamics code, work just started

• …

Page 18: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

TPACF on AC

88.2×

154.1×

59.1×

45.5×

30.9

6.8×

0.0×

20.0×

40.0×

60.0×

80.0×

100.0×

120.0×

140.0×

160.0×

180.0×

Quadro FX 5600

GeForce GTX 280 (SP)

Cell B./E. (SP) GeForce GTX 280 (DP)

PowerXCell 8i (DP)

H101 (DP)

sp

ee

du

pQP GPU cluster scaling

567.5

289.0

147.5

79.9

46.329.3

23.7

10.0

100.0

1000.0

1 10

exeu

tion

tim

e (

sec)

number of MPI threads

Page 19: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

NAMD on Lincoln(8 cores and 2 GPUs per node, very early results)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

4 8 16 32 64 128

CPU (8ppn)

CPU (4ppn)

CPU (2ppn)

GPU (4:1)

GPU (2:1)

GPU (1:1)

2 GPUs = 24 cores

4 GPUs

8 GPUs

16 GPUs

CPU cores

ST

MV

s/s

tep

8 GPUs =

96 CPU cores

~5.6 ~2.8

Courtesy of James Phillips, UIUC

Page 20: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

Bovine pancreatic

trypsin inhibitor (BPTI)

3-21G, 875 atoms, 4893

basis functions

MPI timings and scalability

DSCF on Lincoln

Courtesy of Ivan Ufimtsev, Stanford

Page 21: High Performance Computing with Accelerators › download › attachments › 7012700 › GP… · High Performance Computing with Accelerators Volodymyr Kindratenko ... GPU Performance

GPU Clusters Open Issues / Future Work

• GPU cluster configuration

• CPU/GPU ratio

• Host/device memory ratio

• Cluster interconnect requirements

• Cluster management

• Resources allocation

• Monitoring

• CUDA/OpenCL SDK for HPC

• Programming models for accelerator clusters

• CUDA/OpenCL+pthreads/MPI/OpenMP/Charm++/…