high performance computing with accelerators › download › attachments › 7012700 › gp… ·...

National Center for Supercomputing Applications

University of Illinois at Urbana-Champaign

High Performance

Computing with

Accelerators

Volodymyr Kindratenko

Innovative Systems Laboratory @ NCSA

Institute for Advanced Computing

Applications and Technologies (IACAT)

Presentation Outline

• Recent trends in application accelerators

• FPGAs, Cell, GPUs, …

• Accelerator clusters

• Accelerator clusters at NCSA

• Cluster management

• Programmability Issues

• Production codes running on Accelerator Clusters

• Cosmology, molecular dynamics, electronic structure,

quantum chromodynamics

HPC on Special-purpose Processors

• Field-Programmable Gate Arrays (FPGAs)

– Digital signal processing, embedded computing

• Graphics Processing Units (GPUs)

– Desktop graphics accelerators

• Physics Processing Units (PPUs)

– Desktop games accelerators

• Sony/Toshiba/IBM Cell Broadband Engine

– Game console and digital content delivery systems

• ClearSpeed accelerator

– Floating-point accelerator board for compute-

intensive applications

0

200

400

600

800

1000

1200

9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

GF

lop

/s

NVIDIA GPU

Intel CPU

7900 GTX

8800 GTX

8800 Ultra

GTX 285

Intel Xeon Quad-core 3 GHz

GTX 280

GPU Performance Trends: FLOPS

58005950 Ultra

6800 Ultra

7800 GTX

Courtesy of NVIDIA

GPU Performance Trends: Memory

Bandwidth

0

20

40

60

80

100

120

140

160

180

9/22/02 2/4/04 6/18/05 10/31/06 3/14/08

GB

yte

/s

5800

5950 Ultra

6800 Ultra

7800 GTX 7900 GTX

8800 GTX

8800 Ultra

GTX 285

GTX 280

Courtesy of NVIDIA

NVIDIA Tesla T10 GPU Architecture

• T10 architecture

• 240 streaming

processors arranged

as 30 streaming

multiprocessors

• At 1.3 GHz this

provides

• 1 TFLOPS SP

• 86.4 TFLOPS DP

• 512-bit interface to

off-chip GDDR3

memory

• 102 GB/s bandwidth

TPC 1

Geometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

Texture units

Texture L1

TPC 10

Geometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

Texture units

Texture L1

Thread execution managerInput assemblerPCIe interface

L2ROPL2 ROP

512-bit memory interconnect

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

NVIDIA Tesla S1070 GPU Computing Server

• 4 T10 GPUs

Tesla

GPU Tesla

GPU

Tesla

GPU

Tesla

GPU

4 GB GDDR3

SDRAM

4 GB GDDR3

SDRAM

4 GB GDDR3

SDRAM 4 GB GDDR3

SDRAM

NVIDIA

SWITCH

NVIDIA

SWITCH

Po

wer

supply

Ther

mal

managem

ent

Syst

em

mo

nit

ori

ng

PCI x16

PC

I x1

6

GPU Clusters at NCSA

• Lincoln

• Production system available the

standard NCSA/TeraGrid HPC

allocation

• AC

• Experimental system available

for anybody who is interested

in exploring GPU computing

Intel 64 Tesla Linux Cluster Lincoln

• Dell PowerEdge 1955 server

• Intel 64 (Harpertown) 2.33 GHz

dual socket quad core

• 16 GB DDR2

• InfiniBand SDR

• S1070 1U GPU Computing

Server

• 1.3 GHz Tesla T10 processors

• 4x4 GB GDDR3 SDRAM

• Cluster

• Servers: 192

• CPU cores: 1536

• Accelerator Units: 96

• GPUs: 384

Dell PowerEdge

1955 server

IB

Tesla S1070

T10 T10

PCIe interface

DRAM DRAM

T10 T10

PCIe interface

DRAM DRAM

Dell PowerEdge

1955 server

PCIe x8 PCIe x8

SDR IB SDR IB

AMD AMD Opteron Tesla Linux Cluster AC

• HP xw9400 workstation

• 2216 AMD Opteron 2.4 GHz

dual socket dual core

• 8 GB DDR2

• InfiniBand QDR

• S1070 1U GPU Computing

Server

• 1.3 GHz Tesla T10 processors

• 4x4 GB GDDR3 SDRAM

• Cluster

• Servers: 32

• CPU cores: 128

• Accelerator Units: 32

• GPUs: 128

IB

Tesla S1070

T10 T10

PCIe interface

DRAM DRAM

T10 T10

PCIe interface

DRAM DRAM

HP xw9400 workstation

PCIe x16 PCIe x16

QDR IB

Nallatech

H101

FPGA

card

PCI-X

AC Cluster: 128 TFLOPS (SP)

NCSA GPU Cluster Management Software

• CUDA SDK Wrapper

• Enables true GPU resources allocation by the scheduler

• E.g., qsub -l nodes=1:ppn=1 will allocate 1 CPU core and 1 GPU,

fencing other GPUs in the node from accessing

• Virtualizes GPU devices

• Sets thread affinity to CPU core “nearest” the the GPU device

• GPU Memory Scrubber

• Cleans memory between runs

• GPU Memory Test utility

• Tests memory for manufacturing defects

• Tests memory for soft errors

• Tests GPUs for entering an erroneous state

GPU Node Pre/Post Allocation Sequence

• Pre-Job (minimized for rapid device acquisition)

• Assemble detected device file unless it exists

• Sanity check results

• Checkout requested GPU devices from that file

• Initialize CUDA wrapper shared memory segment with unique key for user

(allows user to ssh to node outside of job environment and have same gpu

devices visible)

• Post-Job

• Use quick memtest run to verify healthy GPU state

• If bad state detected, mark node offline if other jobs present on node

• If no other jobs, reload kernel module to “heal” node (for CUDA 2.2 bug)

• Run memscrubber utility to clear gpu device memory

• Notify of any failure events with job details

• Terminate wrapper shared memory segment

• Check-in GPUs back to global file of detected devices

GPU Programming

• 3rd Party Programming tools

• CUDA C 2.2 SDK

• OpenCL 1.2 SDK

• PGI x64+GPU Fortran & C99 Compilers

• IACAT’s on-going work

• CUDA-auto

• CUDA-lite

• GMAC

• CUDA-tune

• MCUDA/OpenMP

IA multi-core

& LarrabeNVIDIA GPU

NVIDIA

SDK

MCUDA/

OpenMP

CUDA-lite

GMAC

CUDA-tune

CUDA-auto

1st generation CUDA programming

with explicit, hardwired thread

organizations and explicit

management of memory types and

data transfers

Parameterized CUDA programming using

auto-tuning and optimization space

pruning

Locality annotation programming to

eliminate need for explicit management of

memory types and data transfers

Implicitly parallel programming with data

structure and function property

annotations to enable auto parallelization

Courtesy of Wen-mei Hwu, UIUC

GMAC – Designed to reduce accelerator use barrier

• Unified CPU / GPU Address Space:

• Same CPU and GPU address

• Customizable implicit data transfers:

• Transfer everything (safe mode)

• Transfer dirty data before kernel execution

• Transfer data as being produced (default)

• Multi-process / Multi-thread support

• CUDA compatible

Courtesy of Wen-mei Hwu, UIUC

GPU Cluster Applications

• TPACF

• Cosmology code used to study how the matter is distributed in the

Universe

• NAMD

• Molecular Dynamics code used to run large-scale MD simulations

• DSCF

• NSF CyberChem project; DSCF quantum chemistry code for energy

calculations

• WRF

• Weather modeling code

• MILC

• Quantum chromodynamics code, work just started

• …

TPACF on AC

88.2×

154.1×

59.1×

45.5×

30.9

6.8×

0.0×

20.0×

40.0×

60.0×

80.0×

100.0×

120.0×

140.0×

160.0×

180.0×

Quadro FX 5600

GeForce GTX 280 (SP)

Cell B./E. (SP) GeForce GTX 280 (DP)

PowerXCell 8i (DP)

H101 (DP)

sp

ee

du

pQP GPU cluster scaling

567.5

289.0

147.5

79.9

46.329.3

23.7

10.0

100.0

1000.0

1 10

exeu

tion

tim

e (

sec)

number of MPI threads

NAMD on Lincoln(8 cores and 2 GPUs per node, very early results)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

4 8 16 32 64 128

CPU (8ppn)

CPU (4ppn)

CPU (2ppn)

GPU (4:1)

GPU (2:1)

GPU (1:1)

2 GPUs = 24 cores

4 GPUs

8 GPUs

16 GPUs

CPU cores

ST

MV

s/s

tep

8 GPUs =

96 CPU cores

~5.6 ~2.8

Courtesy of James Phillips, UIUC

Bovine pancreatic

trypsin inhibitor (BPTI)

3-21G, 875 atoms, 4893

basis functions

MPI timings and scalability

DSCF on Lincoln

Courtesy of Ivan Ufimtsev, Stanford

GPU Clusters Open Issues / Future Work

• GPU cluster configuration

• CPU/GPU ratio

• Host/device memory ratio

• Cluster interconnect requirements

• Cluster management

• Resources allocation

• Monitoring

• CUDA/OpenCL SDK for HPC

• Programming models for accelerator clusters

• CUDA/OpenCL+pthreads/MPI/OpenMP/Charm++/…

high performance computing with accelerators › download › attachments › 7012700 › gp… ·...

Documents