teraflops and beyond robert f. lucas · 1-1 sixth international ls-dyna users conference 1...
TRANSCRIPT
1-1
1Sixth International LS-DYNA Users Conference
Teraflops and Beyond
Robert F. LucasHigh Performance Computing Research Department Head
National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory
April 11, 2000
2Sixth International LS-DYNA Users Conference
What is NERSC?
• The US Department of Energy, Office of Science, supercomputing center for the past 25 years
• Currently located at Lawrence Berkeley National Laboratory in the hills above the University of California, Berkeley campus
1-2
3Sixth International LS-DYNA Users Conference
Outline
• A Teraflop today
• Ten Teraflops tomorrow!
• A Petaflop someday?
• Conclusions
4Sixth International LS-DYNA Users Conference
Path to a Tflop/s at NERSC Cray C90 Installed In 1991
• Cray C90 installed in December 1991• Stable high end production platform for seven years
until 12/31/98
1-3
5Sixth International LS-DYNA Users Conference
Getting Closer to a Tflop/s Cray T3E-900 Installed In 1996
The 696 processor T3E-900 is one of the most powerful unclassified supercomputers in the U.S.
1997 GAO report judged NERSC to have the best MPP utilization (75% then -- >90% today)
6Sixth International LS-DYNA Users Conference
T3E Utilization
MPP Charging and UsageFY98
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
Date
30-Day Moving Ave. Lost Time
30-Day Moving Ave. Pierre Free
30-Day Moving Ave. Pierre
30-Day Moving Ave. GC0
30-Day Moving Ave. Mcurie
80%
85%
90%
Max CPU Hours
80%
85%
90%
Peak
1-4
7Sixth International LS-DYNA Users Conference
We’re Almost to a Tflop/sIBM SP3 Installed In 1999
• New contract with IBM announced in April1999• Phase 1 system delivered in June, 1999• Phase 2 will exceed 3 Tflop/s in Q4 2000
8Sixth International LS-DYNA Users Conference
Why IBM?
Peak Performance
0500
1000
1500
2000
250030003500
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Quarter
Gfl
op
/s t
ota
l
NERSC-3
NERSC-2
1-5
9Sixth International LS-DYNA Users Conference
What About Applications?
Discipline Computational technology
magnetic fusion particle in cell
computational chemistry local density functionalmaterial sciences
climate research partial diff. equationscomputational biology
QCD Monte Carlo techniqueaccelerator designparticle detection searching, pattern matchingsimulation
combustion PDEsapplied mathematics image processing
10Sixth International LS-DYNA Users Conference
Good News:1998 Gordon Bell Prize
First complete application to break the 1Tflops barrier.
Collaborators from DOE’s Grand Challenge on Materials, Methods, Microstructure, and Magnetism.
1024-atom first-principles simulation of metallic magnetism in iron
1-6
11Sixth International LS-DYNA Users Conference
How About DYNA?
• World record is only 6 Gflop/s on a Fujitsu VPP 5000VPP 5000 is a “cluster” of large vector nodesVery high memory B/W per nodeHigh B/W crossbar interconnect
• Lots of reasons why we’re not up to 1 Tflop yet:Meshes are irregular and dynamically changingDot products and other mathematical bottlenecks to scalabilityNote: complexity of LINPACK benchmark and LSMS Tflop/s
application dominated by matrix-matrix multiply
12Sixth International LS-DYNA Users Conference
Lets Look at Sparse Solvers
• Direct solver dominates the complexity of non-linear and dynamic analysis.
• Multifrontal algorithm is method of choice.—Dense arithmetic kernels -> maximize speed—Easily goes out-of-core -> maximize size
• Why is BCSLIB-Ext world record only 8 Gflop/s?
1-7
13Sixth International LS-DYNA Users Conference
Load Balance
• Can’t simultaneously load balance:Finite Element AssemblyFactorizationTriangular Solves
14Sixth International LS-DYNA Users Conference
Graph Partitioning
• How do you cut this grid in half?
• Optimal graph partitioning is NP-complete
1-8
15Sixth International LS-DYNA Users Conference
Out-of-Core?
• What if you go out-of-core?10,000,000 equation NASTRAN problems today!
• What system today provides 1 GB/s of standard I/O?
• Triangular solves completely I/O bound.2 flops/word for linear solve12 flops/word for block-shifted Lanczos
16Sixth International LS-DYNA Users Conference
Scalable Solvers?
• Research codes … one or more of the following usually apply:
Don’t exploit symmetry!Don’t dynamically pivot!Don’t go out-of-core!Don’t map to initial distribution of finite elements!Don’t report performance of triangular solver!
• Usually benchmarked on trivial, 3D grids!
1-9
17Sixth International LS-DYNA Users Conference
More Bad News
• The HPC community left Electrical Engineers behind twenty years ago.SPICE didn’t vectorize effectively
• Mechancal Engineers are falling by the wayside nowNot good for HPC industryNASTRAN was the “killer app.” for Crays
• Who’s pushing the envelope?Bigger problems?Automatic optimization?New algorithms?
18Sixth International LS-DYNA Users Conference
Outline
• A Teraflop today
• Ten Teraflops tomorrow!
• A Petaflop someday?
• Conclusions
1-10
19Sixth International LS-DYNA Users Conference
What Will a 10 Tflop/s System Look Like?
• The first ones are already on orderLawrence Livermore National Laboratory in USEarth System Simulator in Japan
• Systems are large clustersSMP nodes in USVector nodes in Japan
• Programming model:OpenMP and/or vectors to maximize node speedMPI for global communication
20Sixth International LS-DYNA Users Conference
A New Business Model?
Transition from vertical to horizontal companies
SGI IBM HP Sun
MIPS PowerPC PA-RISC SPARC
Origin SP SPP HPC
Irix AIX Solaris
applications software with MPI
sales
Intel others
SGI Compaq HP SunIBM
Microsoft “NT” Linux
applications software with MPI
mail order retail
1-11
21Sixth International LS-DYNA Users Conference
Its Already Happening
• Forecast Systems Lab procurement• Prime contractor is High Performance Technologies
Inc. (HPTi)• Subcontractor is Compaq
22Sixth International LS-DYNA Users Conference
Its Really Nothing New!
• Remember EDS?• Integration is how Ross Perot got rich!
1-12
23Sixth International LS-DYNA Users Conference
Buying From Integrators Worries Me
• Integrator’s don’t “own” the whole system end-to-end.
• Who takes responsibility for problems?
• Who fixes third party H/W or S/W bugs?
• Who adds new features not demanded by transaction, database, or Web server markets?CheckpointingLow-latency communication?
24Sixth International LS-DYNA Users Conference
How Do Clusters of SMPsPerform?
• Good news:Cheap flop/s and memoryPC’s in LSTC’s classroom faster than LSTC’s 8-
processor Origin 2000 on an explicit problem
• Bad news:Global communication B/W is lowCommunication latency is highNet result will be difficulty with implicit solutions
1-13
25Sixth International LS-DYNA Users Conference
Good news example
0
200
400
600
800
1000
1200
1400
0 5 10 15 20 25
Per
form
ance
Mflo
p/s
Number of Processors
NAS Parallel Benchmark Results: BT
PC ClusterT3E-900
26Sixth International LS-DYNA Users Conference
Bad News
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35
Per
form
ance
Mflo
p/s
Number of Processors
NAS Parallel Benchmark Results: CG
PC ClusterT3E-900
1-14
27Sixth International LS-DYNA Users Conference
Downright Ugly
0
200
400
600
800
1000
1200
1400
0 5 10 15 20 25 30 35
Per
form
ance
Mflo
p/s
Number of Processors
NAS Parallel Benchmark Results: FT
PC ClusterT3E-900
28Sixth International LS-DYNA Users Conference
Clusters Increase Power and Space Requirements
0
500
1000
1500
2000
2500
3000
3500
4000
Gflop/sPeak
kWPower
ft 2̂Space
1992
1996
2000
1-15
29Sixth International LS-DYNA Users Conference
Eight Tflop/s Floorplan
30Sixth International LS-DYNA Users Conference
We’re Not Quite There Yet
1-16
31Sixth International LS-DYNA Users Conference
Near-term Prediction
• Peak performance and memory will go up• So will power and floor space• Most big systems will be procured from integrators
• The space of applications enjoying this will shrink• The space of applications will certainly change
Materials!Designer drugs?
• Implicit codes will probably not be well served at the very high end
32Sixth International LS-DYNA Users Conference
Outline
• A Teraflop today
• Ten Teraflops tomorrow!
• A Petaflop someday?
• Conclusions
1-17
33Sixth International LS-DYNA Users Conference
Processor in Memory Chips
• Processing in memory is an old idea• Semiconductor technology now makes it very
attractive
Mitsubishi M32R
34Sixth International LS-DYNA Users Conference
Rediscovering Vectors
• Couple this with vectors for high B/W
UC Berkeley IRAM floorplan
CPU
IO8 Vector Pipes (+ 1 spare)
Memory (512 Mbits / 64 MBytes)
Memory (512 Mbits / 64 MBytes)
1-18
35Sixth International LS-DYNA Users Conference
CMOS Petaflop/s Solution
• 100,000 10 Gflop/s vector PIM chips• 3D mesh interconnect• O(10^7) operations must be sustained on every clock
cycle to avoid an Amdahl bottleneck• IBM Blue Gene is an example
• God help the programmers!
36Sixth International LS-DYNA Users Conference
An Alternate Technology?
1 THz
HTS RSFQ (??)
100 GHz 30M JJ3M JJ 0.4 um
300K JJ 0.8 um10K JJ 1.5 um
3.5 um10 GHz 0.05 um
0.07 um0.10 um
0.12 um0.14 um
0.20 um1 GHz
lithography?
optical lithography100 MHz
1995 1998 2001 2004 2007 2010Year
•Single Flux Quantum (SFQ)
•Operates at 4 Kelvin
1-19
37Sixth International LS-DYNA Users Conference
Hybrid Technology, Multithreaded Architecture
L-1 Frame Buffer512 MB
L-2 Frame Buffer128 GB
CNET
Smart Shared L-3SRAM-PIM
1 TB
HRAMBlock 3/2 Storage
1 PB
DRAM-PIM16 TB
Data Intensive—Data Flow Driven
to DRAM10,000 - 40,000
cycles
Latencies
to CRAM40 - 400 cycles
Intra-executionpipeline
10 - 100 cycles
to SRAM400 - 1000 cycles
DRAM to HRAM1x106-4x106 cycles
Data Vortex Network
Smart Main Memory
P M P M P MP M P M
CRAM
SRAM
CRAM
SRAM
CRAM
SRAM
CRAM
SRAM
HSP HSP HSP HSP
P MP M P M P MP M
38Sixth International LS-DYNA Users Conference
HTMT Machine Room
1-20
39Sixth International LS-DYNA Users Conference
4oK
50 W77oK
Fiber/Wire Interconnects
3 m
0.5 m
220Volts
Nitrogen Helium
Hard Disk
Array
(40 cabinets) 3 m
Tape Silo
Array
(400 Silos)Front End Computer
Server
Console
Cable Tray Assembly
Generator
WDM Source
Optical Amplifiers
220Volts
980 nm Pumps
(20 cabinets)
Generator
HTMT Cross-Section
40Sixth International LS-DYNA Users Conference
Outline
• A Teraflop today
• Ten Teraflops tomorrow!
• A Petaflop someday?
• Conclusions
1-21
41Sixth International LS-DYNA Users Conference
It’s a Volatile Business
• For most of us:PCs will become SMPsLarger, more complicated analysis will be common
• At the very high end:Clusters of SMPs will predominate for the next five yearsPeak performance will keep going upIt’ll get harder and harder to realize
• Long Term Future:Architecture will change againProgramming model will change againApplication base will change again