parallel computers today
DESCRIPTION
Parallel Computers Today. Two Nvidia 8800 GPUs > 1 TFLOPS. LANL / IBM Roadrunner > 1 PFLOPS. Intel 80-core chip > 1 TFLOPS. TFLOPS = 10 12 floating point ops/sec PFLOPS = 1,000,000,000,000,000 / sec (10 15 ). - PowerPoint PPT PresentationTRANSCRIPT
Parallel Computers TodayParallel Computers Today
LANL / IBM Roadrunner> 1 PFLOPS
Two Nvidia 8800 GPUs> 1 TFLOPS
Intel 80-core chip> 1 TFLOPS TFLOPS = 1012 floating point ops/sec
PFLOPS = 1,000,000,000,000,000 / sec (1015)
Columbia (10240-processor SGI Altix, 50 Teraflops, NASA Ames Research Center)
Beowulf (18-processor cluster, lab machine)
AMD Opteron quad-core die
The nVidia G80 GPUThe nVidia G80 GPU
• 128 streaming floating point processors @1.5Ghz• 1.5 Gb Shared RAM with 86Gb/s bandwidth• 500 Gflop on one chip (single precision)
The Computer Architecture ChallengeThe Computer Architecture Challenge
Most high-performance computer designs allocate resources to optimize Gaussian elimination on large, dense matrices.
Originally, because linear algebra is the middleware of scientific computing.
Nowadays, mostly for bragging rights.
= xP A L U
• Top 500 List
• http://www.top500.org/list/2008/11/100
Generic Parallel Machine ArchitectureGeneric Parallel Machine Architecture
• Key architecture question: Where is the interconnect, and how fast?
• Key algorithm question: Where is the data?
ProcCache
L2 Cache
L3 Cache
Memory
Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
Multicore SMP SystemsMulticore SMP Systems
4MBShared L2
Core2
FSB
Fully Buffered DRAM
10.6GB/s
Core2
Chipset (4x64b controllers)
10.6GB/s
10.6 GB/s(write)
4MBShared L2
Core2 Core2
4MBShared L2
Core2
FSB
Core2
4MBShared L2
Core2 Core2
21.3 GB/s(read)
Intel ClovertownC
ross
bar
Sw
itch
Fully Buffered DRAM
4MB
Sha
red
L2 (
16 w
ay)
42.7GB/s (read), 21.3 GB/s (write)
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
8K D$MT UltraSparcFPU
179
GB
/s(f
ill)
90 G
B/s
(writ
ethr
u)
Sun Niagara2
4x128b FBDIMM memory controllers
AMD Opteron
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
1MBvictim
Opteron
1MBvictim
Opteron
Memory Controller / HT
DDR2 DRAM DDR2 DRAM
10.6GB/s 10.6GB/s
4GB/s(each direction)
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
<<20GB/seach
direction
SPE256K
PPE512K L2
MFC
BIF
XDR
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
SPE256KMFC
XDR DRAM
25.6GB/s
EIB
(Ring N
etwork)
SPE 256K
PPE 512K L2
MFC
BIF
XDR
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
SPE 256K MFC
IBM Cell Blade
More Detail on GPU Architecture
• Michael Perrone (IBM): Proper Care and Feeding of Multicore Beasts
• http://www.csm.ornl.gov/workshops/HPA/documents/1-arch/feeding_the_beast_perrone.pdf
Cray XMT Cray XMT (highly multithreaded (highly multithreaded shared memory)shared memory)
i = n
i = 3
i = 2
i = 1
. . .
1 2 3 4
Sub- problem
A
i = n
i = 1
i = 0
. . .
Sub- problem
BSubproblem A
Serial Code
Unused streams
. . . .
Programs running in parallel
Concurrent threads of computation
Hardware streams (128)
Instruction Ready Pool;
Pipeline of executing instructions