exascale: parallelism gone wild! - ipdps 04 tcpp e… · ibm research 4 exascale: parallelism gone...
TRANSCRIPT
Exascale: Parallelism gone wild!
Craig Stunkel, IBM Research
IPDPS TCPP meeting, April 2010
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 2
Outline
Why are we talking about Exascale? Why will it be fundamentally different? How will we attack the challenges?
– In particular, we will examine: • Power • Memory
• Programming models • Reliability/Resiliency
IBM Research
Whole Organ Simulation
Low Emission Engine Design
Tumor Modeling
Smart Grid
CO2 Sequestration
Nuclear Energy Li/Air Batteries
Examples of Applications that Need Exascale
Li Anode
Li+
solvated Li ion
(aqueous case)
O2
Air Cathode
#1 #2 #3 #4
Li+
Life Sciences: Sequencing
IPDPS, April 2010 3 Exascale: Parallelism gone wild!
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 4
Beyond Petascale, applications will be materially transformed Climate: Improve our understanding of complex biogeochemical
cycles that underpin global economic systems functions and control the sustainability of life on Earth
Energy: Develop and optimize new pathways for renewable energy production ….
Biology: Enhance our understanding of the roles and functions of microbial life on Earth and adapt these capabilities for human use …
Socioeconomics: Develop integrated modeling environments for coupling the wealth of observational data and complex models to economic, energy, and resource models that incorporate the human dynamic, enabling large scale global change analysis
* “Modeling and simulation at the exascale for energy and the environment”, DoE Office of Science Report, 2007.
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 5
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 6
Are we on track to Exascale machines?
Some IBM supercomputer sample points: 2008, Los Alamos National Lab: Roadrunner was the first
peak Petaflops system 2011, U. of Illinois: Blue Waters will be around 10 Petaflops
peak? – NSF “Track 1”, provides a sustained Petaflops system
2012, LLNL: Sequoia system, 20 Petaflops peak
So far the Top500 trend (10x every 3.6 years) is continuing What could possibly go wrong before Exaflops?
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 7
Microprocessor Clock Speed Trends Managing power dissipation is limiting clock speed increases
2004 Frequency Extrapolation
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 8
Microprocessor Transistor Trend Moore’s (original) Law alive: transistors still increasing exponentially
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 9
Server Microprocessors Thread Growth We are in a new era of massively multi-threaded computing
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 10
Exascale requires much lower power/energy
Even for Petascale, energy costs have become a significant portion of TCO
#1 Top500 system consumes 7 MW – 0.25 Gigaflops/Watt
For Exascale, 20-25 MW is upper end of comfort – Anything more is a TCO problem for labs – And a potential facilities issue
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 11
Exascale requires much lower power/energy
For Exascale, 20-25 MW is upper end of comfort For 1 Exaflops, this limits us to 25 pJ/flop
– Equivalently, this requires ≥40 Gigaflops/Watt
Today’s best supercomputer efficiency: – ~ 0.5 – 0.7 Gigaflops/Watt
Two orders of magnitude improvement required! – Far more aggressive than commercial roadmaps
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 12
A surprising advantage of low power
Lower-power processors permit more ops/rack! – Even though more processor chips are required – Less variation in heat flux permits more densely packed
components – Result: more ops/ft2
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 13
13.6 GF/s 8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s 2–4 GB DDR
Supports 4-way SMP
32 Node Cards 1024 chips, 4096 procs
1 to 72 or more Racks
1 PF/s + 144 TB +
Cabled 8x8x16 Rack
System
Compute Card
Chip
435 GF/s 64–128 GB
(32 chips 4x4x2) 32 compute, 0-2 IO cards
Node Card
Space-saving, power-efficient packaging
Blue Gene/P
14 TF/s 2–4 TB
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 14
A perspective on Blue Gene/L
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 15
How do we increase power efficiency O(100)?
Crank down voltage Smaller devices with each new silicon generation Run cooler Circuit innovation Closer integration (memory, I/O, optics)
But with general-purpose core architectures, we still can’t get there
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 16
Core architecture trends that combat power
Trend #1: Multi-threaded multi-core processors – Maintain or reduce frequency while replicating cores
Trend #2: Wider SIMD units
Trend #3: Special (compute) cores – Power and density advantage for applicable workloads – But can’t handle all application requirements
Result: Heterogeneous multi-core
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 17
Processor versus DRAM costs
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 18
Memory costs
Memory costs are already a significant portion of system costs
Hypothetical 2018 system decision-making process: – How much memory can I afford? – OK, now throw in all the cores you can (for free)
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 19
Memory costs: back of the envelope
There is (some) limit on the max system cost – This will determine the total amount of DRAM
For an Exaflops system, one projection: – Try to maintain historical 1 B/F of DRAM capacity – Assume: 8 Gb chips in 2018 @ $1 each – $1 Billion for DRAM (a bit unlikely )
We must live with less DRAM per core unless and until DRAM alternatives become reality
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 20
Getting to Exascale: parallelism gone wild!
1 Exaflops is 109 Gigaflops
For 3 GHz operation (perhaps optimistic) – 167 Million FP units!
Implemented via a heterogeneous multi-threaded multi-core system
Imagine cores with beefy SIMD units containing 8 FPUs This still requires over 20 Million cores
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 21
Petascale
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 22
Exascale
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 23
Programming issues
Many cores per node – Hybrid programming models to exploit node shared memory?
• E.g., OpenMP on node, MPI between – New models?
• E.g., Transactional Memory, thread-level speculation – Heterogeneous (including simpler) cores
• Not all cores will be able to support MPI
At the system level: – Global addressing (PGAS and APGAS languages)?
Limited memory per core – Will often require new algorithms to scale
IBM Research
April 2009 Programming models, Salishan conference 24
Different approaches to exploit parallelism
Special cores/ Heterogeneity
Speculative threads
Hardware Innovations
Programming Intrusiveness
Compiler Innovations
No change to customer code
Rewrite program
Traditional & Auto-Parallelizing
Compilers
Parallel Language Compiler
Directives + Compiler
Parallel languages
Single-thread program
Annotated program
PGAS/APGAS languages
APGAS annotations for existing languages
Multicore / SMP Clusters
IBM Research
April 2009 Programming models, Salishan conference 25
Clusters
w/ Heterogeneity/accelerators
Potential migration paths
Base and MPI
Base/OpenMP and MPI
CUDA
Base/OpenCL
libspe
Base/OpenMP+ and MPI
Base/OpenMP
C/C++/Fortran/Java (Base)
Charm++
RapidMind
GEDAE/Streaming models
PGAS/ APGAS
Make portable, open
Scale
Scale
Green: open, widely available Blue: somewhere in between Red: proprietary
Base/OpenCL and MPI
Har
ness
acc
eler
ator
s
ALF
Make portable, open
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 26
Reliability / Resiliency From IESP: “The advantage of robustness on exascale
platforms will eventually override concerns over computational efficiency”
With each new CMOS generation, susceptibility to faults and errors is increasing: – For 45 nm and beyond, soft errors in latches may become
commonplace Need changes in latch design (but requires more power) Need more error checking logic (oops, more power) Need means of locally saving recent state and rolling back
inexpensively to recover on-the-fly Hard failures reduced by running cooler
IBM Research
Shift Toward Design-for-Resilience
Architecture level solutions are indispensable to insure yield Design resilience applied thru all levels of the design
Resilient design techniques at all levels will be required to ensure functionality and fault tolerance
Device/Technology Controlling & Modeling Variability
Circuit Innovative topologies (read/write assist…)
Redundancy
Circuit adaptation driven by sensors
Micro-Architecture
Heterogeneous core frequencies
Defect-tolerant PE array
Defect-tolerant function-optimized CPU
On-line testing/verification
IPDPS, April 2010 Exascale: Parallelism gone wild!
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 28
Reliability: silent (undetected) errors
How often are silent errors already occurring in high-end systems today? – With Exascale systems we can compute the wrong
answer 1000x faster than Petascale systems
Silent error rates are a far more serious concern for supercomputers than for typical systems – Exascale systems will require systems to be built from
the ground up for error detection and recovery • Including the processor chips
Fault-tolerant applications can help
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 29
Some other issues we didn’t cover
Interconnection networks Operating systems Debugging and monitoring Performance tools Algorithms Storage and file systems Compiler optimizations Scheduling
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 30
Perspective on supercomputer trends
Vector systems gave way to killer micros Clusters of killer micros and SMPs have ruled for
almost 20 years The ASCI program drove the innovation for these
systems – Leveraging commodity micros with interconnect, …
However, commodity killer micros aren’t likely to be the answer for Exascale – Back to the drawing board, with investment required from
the ground up
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 31
A “Jeff Foxworthy” take on Exascale If your system energy efficiency is >100 pJ/flop
– You might *not* have an Exascale system
If your algorithm doesn’t partition data well – You might *not* have an Exascale algorithm
If your application is difficult to perfectly load-balance – You might *not* have an Exascale application
If message-passing is the only means of providing parallelism for your application – You might *not* have an Exascale application
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 32
Concluding thoughts Getting to Exascale/Exaflops performance within 10 years will
be tremendously challenging – Power and cost constraints require significant innovation – Success not a foregone conclusion
Processor architecture and technology – Low voltage many-core, SIMD, heterogeneity, fault tolerance
Memory and storage technology – Closer integration, limited size, and Phase Change Memory
Programming models and tools – Must deal with parallelism gone wild! – Hybrid programming models, PGAS languages
An exciting time for parallel processing research!
IBM Research
IPDPS, April 2010 Exascale: Parallelism gone wild! 33
Exascale