high performance computing with accelerators › download › attachments › 7012700 › gp… ·...
TRANSCRIPT
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
High Performance
Computing with
Accelerators
Volodymyr Kindratenko
Innovative Systems Laboratory @ NCSA
Institute for Advanced Computing
Applications and Technologies (IACAT)
Presentation Outline
• Recent trends in application accelerators
• FPGAs, Cell, GPUs, …
• Accelerator clusters
• Accelerator clusters at NCSA
• Cluster management
• Programmability Issues
• Production codes running on Accelerator Clusters
• Cosmology, molecular dynamics, electronic structure,
quantum chromodynamics
HPC on Special-purpose Processors
• Field-Programmable Gate Arrays (FPGAs)
– Digital signal processing, embedded computing
• Graphics Processing Units (GPUs)
– Desktop graphics accelerators
• Physics Processing Units (PPUs)
– Desktop games accelerators
• Sony/Toshiba/IBM Cell Broadband Engine
– Game console and digital content delivery systems
• ClearSpeed accelerator
– Floating-point accelerator board for compute-
intensive applications
0
200
400
600
800
1000
1200
9/22/02 2/4/04 6/18/05 10/31/06 3/14/08
GF
lop
/s
NVIDIA GPU
Intel CPU
7900 GTX
8800 GTX
8800 Ultra
GTX 285
Intel Xeon Quad-core 3 GHz
GTX 280
GPU Performance Trends: FLOPS
58005950 Ultra
6800 Ultra
7800 GTX
Courtesy of NVIDIA
GPU Performance Trends: Memory
Bandwidth
0
20
40
60
80
100
120
140
160
180
9/22/02 2/4/04 6/18/05 10/31/06 3/14/08
GB
yte
/s
5800
5950 Ultra
6800 Ultra
7800 GTX 7900 GTX
8800 GTX
8800 Ultra
GTX 285
GTX 280
Courtesy of NVIDIA
NVIDIA Tesla T10 GPU Architecture
• T10 architecture
• 240 streaming
processors arranged
as 30 streaming
multiprocessors
• At 1.3 GHz this
provides
• 1 TFLOPS SP
• 86.4 TFLOPS DP
• 512-bit interface to
off-chip GDDR3
memory
• 102 GB/s bandwidth
TPC 1
Geometry controller
SMC
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
Texture units
Texture L1
TPC 10
Geometry controller
SMC
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
Texture units
Texture L1
Thread execution managerInput assemblerPCIe interface
L2ROPL2 ROP
512-bit memory interconnect
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
NVIDIA Tesla S1070 GPU Computing Server
• 4 T10 GPUs
Tesla
GPU Tesla
GPU
Tesla
GPU
Tesla
GPU
4 GB GDDR3
SDRAM
4 GB GDDR3
SDRAM
4 GB GDDR3
SDRAM 4 GB GDDR3
SDRAM
NVIDIA
SWITCH
NVIDIA
SWITCH
Po
wer
supply
Ther
mal
managem
ent
Syst
em
mo
nit
ori
ng
PCI x16
PC
I x1
6
GPU Clusters at NCSA
• Lincoln
• Production system available the
standard NCSA/TeraGrid HPC
allocation
• AC
• Experimental system available
for anybody who is interested
in exploring GPU computing
Intel 64 Tesla Linux Cluster Lincoln
• Dell PowerEdge 1955 server
• Intel 64 (Harpertown) 2.33 GHz
dual socket quad core
• 16 GB DDR2
• InfiniBand SDR
• S1070 1U GPU Computing
Server
• 1.3 GHz Tesla T10 processors
• 4x4 GB GDDR3 SDRAM
• Cluster
• Servers: 192
• CPU cores: 1536
• Accelerator Units: 96
• GPUs: 384
Dell PowerEdge
1955 server
IB
Tesla S1070
T10 T10
PCIe interface
DRAM DRAM
T10 T10
PCIe interface
DRAM DRAM
Dell PowerEdge
1955 server
PCIe x8 PCIe x8
SDR IB SDR IB
AMD AMD Opteron Tesla Linux Cluster AC
• HP xw9400 workstation
• 2216 AMD Opteron 2.4 GHz
dual socket dual core
• 8 GB DDR2
• InfiniBand QDR
• S1070 1U GPU Computing
Server
• 1.3 GHz Tesla T10 processors
• 4x4 GB GDDR3 SDRAM
• Cluster
• Servers: 32
• CPU cores: 128
• Accelerator Units: 32
• GPUs: 128
IB
Tesla S1070
T10 T10
PCIe interface
DRAM DRAM
T10 T10
PCIe interface
DRAM DRAM
HP xw9400 workstation
PCIe x16 PCIe x16
QDR IB
Nallatech
H101
FPGA
card
PCI-X
AC Cluster: 128 TFLOPS (SP)
NCSA GPU Cluster Management Software
• CUDA SDK Wrapper
• Enables true GPU resources allocation by the scheduler
• E.g., qsub -l nodes=1:ppn=1 will allocate 1 CPU core and 1 GPU,
fencing other GPUs in the node from accessing
• Virtualizes GPU devices
• Sets thread affinity to CPU core “nearest” the the GPU device
• GPU Memory Scrubber
• Cleans memory between runs
• GPU Memory Test utility
• Tests memory for manufacturing defects
• Tests memory for soft errors
• Tests GPUs for entering an erroneous state
GPU Node Pre/Post Allocation Sequence
• Pre-Job (minimized for rapid device acquisition)
• Assemble detected device file unless it exists
• Sanity check results
• Checkout requested GPU devices from that file
• Initialize CUDA wrapper shared memory segment with unique key for user
(allows user to ssh to node outside of job environment and have same gpu
devices visible)
• Post-Job
• Use quick memtest run to verify healthy GPU state
• If bad state detected, mark node offline if other jobs present on node
• If no other jobs, reload kernel module to “heal” node (for CUDA 2.2 bug)
• Run memscrubber utility to clear gpu device memory
• Notify of any failure events with job details
• Terminate wrapper shared memory segment
• Check-in GPUs back to global file of detected devices
GPU Programming
• 3rd Party Programming tools
• CUDA C 2.2 SDK
• OpenCL 1.2 SDK
• PGI x64+GPU Fortran & C99 Compilers
• IACAT’s on-going work
• CUDA-auto
• CUDA-lite
• GMAC
• CUDA-tune
• MCUDA/OpenMP
IA multi-core
& LarrabeNVIDIA GPU
NVIDIA
SDK
MCUDA/
OpenMP
CUDA-lite
GMAC
CUDA-tune
CUDA-auto
1st generation CUDA programming
with explicit, hardwired thread
organizations and explicit
management of memory types and
data transfers
Parameterized CUDA programming using
auto-tuning and optimization space
pruning
Locality annotation programming to
eliminate need for explicit management of
memory types and data transfers
Implicitly parallel programming with data
structure and function property
annotations to enable auto parallelization
Courtesy of Wen-mei Hwu, UIUC
GMAC – Designed to reduce accelerator use barrier
• Unified CPU / GPU Address Space:
• Same CPU and GPU address
• Customizable implicit data transfers:
• Transfer everything (safe mode)
• Transfer dirty data before kernel execution
• Transfer data as being produced (default)
• Multi-process / Multi-thread support
• CUDA compatible
Courtesy of Wen-mei Hwu, UIUC
GPU Cluster Applications
• TPACF
• Cosmology code used to study how the matter is distributed in the
Universe
• NAMD
• Molecular Dynamics code used to run large-scale MD simulations
• DSCF
• NSF CyberChem project; DSCF quantum chemistry code for energy
calculations
• WRF
• Weather modeling code
• MILC
• Quantum chromodynamics code, work just started
• …
TPACF on AC
88.2×
154.1×
59.1×
45.5×
30.9
6.8×
0.0×
20.0×
40.0×
60.0×
80.0×
100.0×
120.0×
140.0×
160.0×
180.0×
Quadro FX 5600
GeForce GTX 280 (SP)
Cell B./E. (SP) GeForce GTX 280 (DP)
PowerXCell 8i (DP)
H101 (DP)
sp
ee
du
pQP GPU cluster scaling
567.5
289.0
147.5
79.9
46.329.3
23.7
10.0
100.0
1000.0
1 10
exeu
tion
tim
e (
sec)
number of MPI threads
NAMD on Lincoln(8 cores and 2 GPUs per node, very early results)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
4 8 16 32 64 128
CPU (8ppn)
CPU (4ppn)
CPU (2ppn)
GPU (4:1)
GPU (2:1)
GPU (1:1)
2 GPUs = 24 cores
4 GPUs
8 GPUs
16 GPUs
CPU cores
ST
MV
s/s
tep
8 GPUs =
96 CPU cores
~5.6 ~2.8
Courtesy of James Phillips, UIUC
Bovine pancreatic
trypsin inhibitor (BPTI)
3-21G, 875 atoms, 4893
basis functions
MPI timings and scalability
DSCF on Lincoln
Courtesy of Ivan Ufimtsev, Stanford
GPU Clusters Open Issues / Future Work
• GPU cluster configuration
• CPU/GPU ratio
• Host/device memory ratio
• Cluster interconnect requirements
• Cluster management
• Resources allocation
• Monitoring
• CUDA/OpenCL SDK for HPC
• Programming models for accelerator clusters
• CUDA/OpenCL+pthreads/MPI/OpenMP/Charm++/…