supercomputer platforms and its applications dr. george chiu ibm t.j. watson research center
DESCRIPTION
Supercomputer Platforms and Its Applications Dr. George Chiu IBM T.J. Watson Research Center. Plasma Science – International Challenges. Microturbulence & Transport What causes plasma transport? Macroscopic Stability What limits the pressure in plasmas? Wave-particle Interactions - PowerPoint PPT PresentationTRANSCRIPT
Supercomputer Platforms and Its ApplicationsDr. George ChiuIBM T.J. Watson Research Center
Plasma Science – International Challenges
• Microturbulence & Transport
– What causes plasma transport?
• Macroscopic Stability
– What limits the pressure in plasmas?
• Wave-particle Interactions
– How do particles and plasma waves interact?
• Plasma-wall Interactions
– How can high-temperature plasma and material surfaces co-exist?
IBM Confidential © 2007 IBM Corporation
IBM Systems
3
2007-2008 Deep Computing Roadmap Summary 1H07 2H07 1H08 2H08
System PServers
System PSoftware
System XServers
System XSoftware
BlueGene
Blue GeneSoftware
Cell BE
P5 560Q+
PL4/ML16 PHV8
p6 Blade p6IH
PL4/ML32
PHV8
p6 Blade
Specific & Exclusive Repurposed – Neither Specific nor exclusive
11S0
11S2
SystemStorage
JS21 IB AIX Solution: CSM 1.6/RSCT 2.4.7 GPFS 3.1, LoadLeveler 3.4.1 PESSL 3.3, PE 4.3.1PERCS System Design Analysis
Initial p6 support for SMPs & EthernetGPFS 3.2 – filesystem mgtCSM 1.7
p6 IH/Blades IB SolutionsAIX 5.3 and SLES 10Initial AIX 6.1 support for SMPs & Ethernet
x3455 DC x3455 QC (Barcelona)
x3550 QC
x3850 QCx3755 QC
HS21 LS21 LS41 LS Blades –> Barcelona QC
x3550 Harpertown/ Greencreek Refresh
CSM RHEL 5 support CSM 1.6/RSCT 2.4.7*
GPFS 3.2 support for System x/1350RHEL 5 supportCSM 1.7 for System x/1350*
Blue Gene /L BG/L (EOL)
BG/P 1st Petaflop * .
BlueGene/P Support: GPFS 3.2, CSM 1.7 LoadLeveler 3.4.2, ESSL 4.3.1
Specific but not exclusive
SERVER & SYSTEMS LEGEND
QS20 QS21 Prototype QS22
DDN OEM Agreement
DS4800
DCS9550
EXP100 Attach DS4800 Follow-on
SDK 2.1 SDK 3.0 SDK 4.0
Workstations
System Accept
M50 R1Z30 R1 Z40
M60 Z40 R1
M60 R1*APro elim impacts DCV
Source: IBM Deep Computing Strategy – 7.18.07
P6H
HV4
p6 IH/Blades IB SolutionsAIX 6.1 CSM 1.7.0.x ,GPFS 3.3, LoadLeveler 3.5, PE 5.1, ESSL 4.4, PESSL 3.3
GPFS 3.3 and CSM 1.7.0.x support for System x/1350
LA
DCS9550 +
DS4700 for HPC
iDPX – Stoakley Planar iDPX – Thurley Planar
SDK 5.0 QS21
* 1st Petaflop dependent on BG client demand
IBM HPC roadmap
Power 5
Power 6
Power 7
Blue Gene/L
Blue Gene/PBlue Gene/Q
Clusters and Blades
• The POWER series is IBM’s mainstream computing offering– Market is about 60% commercial and 40% technical– Product line value proposition
• General purpose computing engine• Robustness, security & reliability fitting mission-critical requirements• Standard programming model and interfaces• Performance leadership with competitive performance/price value• Robust integration with industry standards (hardware and software)
• Current status– POWER 6 announced– POWER 7 is underway
IBM HPC conceptual roadmap: POWER
Power 5
Power 6
Power 7
ASC Purple 100TF Machine based on Power 5 ~1500 8-way Power5 Nodes Federation (HPS) ~12K CPUs
(~1500 × 2 multi-plane fat-tree topology, 2x2 GB/s links)
Communication libraries: < 5 µs latency, 1.8 GB/s uni
GPFS: 122 GB/s Supports NIF
#6 in Top500 (Nov. 2007)
www.top500.org
'95 '96 '97 '98 '99 '00 '01 '02 '03 '040.5
1
10
100
1+ Tflop/0.5 TB
3+ Tflop/1.5 TB
10+ Tflop/5 TB
30+ Tflop/10 TB
100+ Tflop/50 TB
Option Red
Option Blue
Option White
Accelerated Strategic Computing Initiative
Purple
Turquoise
'05
Autonomic Computing Enhancements
2001
POWER4 2007
POWER6
65 nm
L2 caches
4.7 GHz Core
Advanced SystemFeatures & Switch
Chip Multi Processing - Distributed Switch - Shared L2Dynamic LPARs (16)
180 nm
1.3 GHzCore
1.3 GHzCore
Distributed Switch
Shared L2
2002-3
POWER4+
1.7 GHz Core
1.7 GHz Core
130 nm
Reduced sizeLower powerLarger L2More LPARs (32)
Shared L2
Distributed Switch
2004
POWER52005-06
POWER5+
Simultaneous multi-threadingSub-processor partitioningDynamic firmware updatesEnhanced scalability, parallelismHigh throughput performanceEnhanced memory subsystem
90 nm
Shared L2
2.3 GHz Core
2.3 GHz Core
Distributed Switch
130 nm
1.9 GHzCore
1.9 GHz Core
Distributed Switch
Shared L2
POWER Server Roadmap
**Planned to be offered by IBM. All statements about IBM’s future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.
4.7 GHz Core
Ultra High FrequencyVery Large L2Robust Error RecoveryHigh ST and HPC PerfHigh throughput PerfMore LPARs (1024)Enhanced memory subsystem
Challenge
Innovation
Deliver world-class deep-computing and e-Science services with an attractive cost/performance ratio
Enable collaboration among leading scientific teams in the areas of biology, chemistry, medicine, earth sciences and physics
Efficient integration of commercially available commodity components
Modular and scalable open cluster architecture
computing, storage, networking, software, management, applications
Diskless capabilityimproves node reliability, reducing installation and maintenance costs
Record cluster density and power efficiencyLeading price/performance and TCO
in High Performance Computing
IBM e1350 capability Linux cluster platform comprising 42 IBM eServer p615 servers, 2560 IBM eServer BladeCenter JS21 serversand IBM TotalStorage hardware
94 TF DP (64-bit)186 TF SP (32-bit)376 Tops (8-bit)
20 TB RAM, 370 TB disk
Linux 2.6#1 in Europe#9 in TOP500
~ 120 m²~ 750 kW
MareNostrum at a Glance
BlueGene/P
13.6 GF/s8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s2.0 GB DDR2
(4.0GB is an option)
32 Node Cards
13.9 TF/s2 TB
72 Racks, 72x32x32
1 PF/s144 TB
Cabled 8x8x16Rack
System
Compute Card
Chip
435 GF/s64 GB
(32 chips 4x4x2)32 compute, 0-1 IO cards
Node Card
November 28, 2005
HPC Challenge Benchmarks
Benchmark 64-rack BG
(optimized)
Best posted
Cray XT3
4600/5200 nodes
HPL (TFlop/s) 259.213 20.53
RandomAccess
(GUP/s)
35.46 0.687 (7.69 on Cray X1E)
FFT (GFlop/s) 2311.09 905.57
STREAM Triad
(GB/s)
160,064 29,164.8
PTRANS
(GB/s)
4600 1800 (Red Storm)
0.23
0.34
0.02 0.02 0.02
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
BG/L BG/P Red Storm Thunderbird Purple
System Power EfficiencyG
flo
ps/
Wa
tt
Failures per Month per @ 100 TFlops (20 BG/L racks)unparalleled reliability
0
100
200
300
400
500
600
700
800
IA64
X86
Power5
Blue Gene
127
1
394
800
Results of survey conducted by Argonne National Lab on 10 clusters ranging from 1.2 to 365 TFlops (peak); excluding storage subsystem, management nodes, SAN network equipment, software outages
Classical MD – ddcMD2005 Gordon Bell Prize Winner!!
Scalable, general purpose code for performing classical molecular dynamics (MD) simulations using highly accurate MGPT potentials
MGPT semi-empirical potentials, based on a rigorous expansion of many body terms in the total energy, are needed in to quantitatively investigate dynamic behavior of d-shell and f-shell metals.
524 million atom simulations on 64K nodes achieved 101.5 TF/s sustained. Superb strong and weak scaling for full machine - (“very impressive machine” says PI)
Visualization of important scientific findings already achieved on BG/L: Molten Ta at 5000K demonstrates solidification during isothermal compression to 250 GPa
2,048,000 Tantalum atoms
Qbox: First Principles Molecular DynamicsFrancois Gygi UCD, Erik Draeger, Martin Schulz, Bronis de Supinski, LLNLFranz Franchetti Carnegie mellon, John Gunnels, Vernon Austel, Jim Sexton, IBM
Treats electrons quantum mechanically Treats nuclii classically Developed at LLNL BG Supported provided by IBM Simulated 1,000 Mo atoms with 12,000 electrons Achieves 207.3 Teraflops sustained.
(56.8% of peak).
Qbox simulation of the transition from a molecular solid (top) to a quantum liquid (bottom) that is expected to occur in hydrogen under high pressure.
Gordon Bell Special Achievement Award 2006, P. Vranas et. al.QCD Speedup on BG/L to 70.5 sustained Teraflops.
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
0 20000 40000 60000 80000 100000 120000 140000
Number of CPU cores
Su
stai
ned
Ter
aflo
ps
Dirac Operator = 19.3%
CG inverter = 18.7%
Compute Power of the Gyrokinetic Toroidal CodeNumber of particles (in million) moved 1 step in 1 second
10
100
1000
10000
100000
10 10000 10000000
Number of Cores
Co
mp
ute
Po
we
r (m
illio
n o
f p
art
icle
s)
Cray XT3/XT4
BG/L
BG/L Optimal
BG/L at Livermore
Compute Power of the Gyrokinetic Toroidal CodeNumber of particles (in million) moved 1 step in 1 second
BlueGene can reach 150 billion particles in 2008, >1 trillion in 2011.POWER6 can reach 1 billion particles in 2008, >0.3 trillion in 2011.
10
100
1000
10000
100000
1000000
10 10000 10000000
Number of Cores
Co
mp
ute
Po
we
r (m
illio
n o
f p
art
icle
s)
Cray XT3/XT4
BG/L
BG/L Optimal
BG/L at Livermore
BG/P at 3.5PF
IBM Power
P6 at 300TF
Systems Performance
IBM Systems & Technology Group
Rechenzentrum Garching at BG Watson: GENE
Strong scaling of GENEv11+ for a problem size of 300-500 GB with measurement points for 1k, 2k, 4k, 8k and 16k processors normalized to 1k processors.
Quasi-linear scaling has been observed with a parallel efficiency of 95% on 8k processors, and of 89% on 16k processors By Hermann Lederer*, Reinhard Tisma* and Frank Jenko+, RZG*and IPP+, March 21,22 2007
November 28, 2005
Current HPC Systems Characteristics
IBM POWER IBM BlueGene/P #1 Worldwide
Mare Nostrum
#1 in Europe
Intel (Clovertown)
GF/socket 37.6 13.6 18.4 37.328
TF/rack 1.8 13.9 3.1 4.48
GB/core 4 0.5 2 1
GB/rack 384 2048 672 480
Mem BW byte/flop
1.5 1 - 0.34
Mem BW in TB/s 2.7 13.9 - 1.5
P-P Interconnect byte/flop
0.1 0.75 0.014 0.1
Kilo-watt/rack 35 35 25 25
Space/100TF 1500 170 1200 1200
November 28, 2005
Summary
IBM is much involved in ITER applications through its collaborations- Princeton Plasma Physics Laboratory
- Max-Planck-Institut für Plasma Physik/Rechenzentrum Garching
- Barcelona Supercomputer Center
- Oak Ridge National Laboratory
IBM is also involved in laser-plasma fusion through its collaborations- Lawrence Livermore National Laboratory
- Forschungszentrum Jülich
IBM offers multiple platforms to address ITER needs- POWER: high memory capacity/node, moderate interprocessor bandwidth, moderate scalability – capability and capacity machine
- Blue Gene – low power, low memory capacity/node, high interprocessor bandwidth, highest scalability - capability and capacity applications
- X Series and white box: moderate memory capacity/node, low interprocessor bandwidth, limited, moderate scalability – mostly capacity machine.
November 28, 2005
Backup
November 28, 2005
What BG brings to Core Turbulence Transport
Benchmark case CYCLONE- GENE: < 1 day on 64 procs; few hours on 1024 procs BG/L
- GYSELA: ~2.5 days on 64 procs
- ORB5: < 1day on 64 procs; few hours on 1024 procs BG/L
Similar ITER-size benchmark- GENE: ~ ½ day on 6K procs BG/L
- GYSELA: ~ 10 days on 1024 procs
- ORB5: ~ ½ day on 16K procs BG/L; ~1 week on 256 procs PC cluster
Courtesy José Mª Cela , Director of Applications, BSC
The Gyrokinetic Toroidal Code GTC
Description: Particle-in-cell code (PIC) Developed by Zhihong Lin (now at UC Irvine) Non-linear gyrokinetic simulation of microturbulence [Lee,
1983] Particle-electric field interaction treated self-consistently Uses magnetic field line following coordinates () Guiding center Hamiltonian [White and Chance, 1984] Non-spectral Poisson solver [Lin and Lee, 1995] Low numerical noise algorithm (f method) Full torus (global) simulation
BlueGene Key Applications - Major Scientific Advances
1. Qbox (DFT) LLNL: 56.5% 2006 Gordon-Bell Award 64 racksCPMD IBM: 30% highest scaling 64 racks
2. ddcMD (Classical MD) LLNL: 27.6% 2005 Gordon-Bell Award 64 racksMDCASK LLNL: highest scaling 64 racksSPaSM LANL: highest scaling 64 racksLAMMPS SNL: highest scaling 16 racksBlue Matter IBM: highest scaling 16 racksRosetta UW: highest scaling 20 racksAMBER 8 racks
3. Quantum Chromodynamics IBM: 30% 2006 GB Special Award 64 racksQCD at KEK: 10 racks
4. sPPM (CFD) LLNL: 18% highest scaling 64 racksMiranda LLNL: highest scaling 64 racksRaptor LLNL: highest scaling 64 racksDNS highest scaling 16 racksPETSc FUN3D ANL: 14.2%NEK5 (Thermal Hydraulics) ANL: 22%
5. ParaDis (dislocation dynamics) LLNL: highest scaling 64 racks6. GFMC (Nuclear Physics) ANL: 16%7. WRF (Weather) NCAR: 14% highest scaling 64 racks
POP (Oceanography): highest scaling 16 racks8. HOMME (Climate) NCAR: 12% highest scaling 32 racks
9. GTC (Plasma Physics) PPPL: highest scaling 16 racks
ORB5 RZG: highest scaling 8 racksGENE RZG: 12.5% highest scaling 16 racks
10. Flash (Supernova Ia) highest scaling 32 racksCactus (General Relativity) highest scaling 16 racks
11. AWM (Earthquake) highest scaling 20 racks
Science
Theory
Experiment Simulation