high-performance geometric multigrid (hpgmg) and ... · i compute meaningful quantities – needed...

28
High-performance Geometric Multigrid (HPGMG) and quantification of performance versatility This talk: https://jedbrown.org/files/20160217-CISLVersatility.pdf Jed Brown [email protected] (CU Boulder) CISL Seminar, NCAR, 2016-02-17

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

High-performance Geometric Multigrid (HPGMG)and quantification of performance versatility

This talk:https://jedbrown.org/files/20160217-CISLVersatility.pdf

Jed Brown [email protected] (CU Boulder)

CISL Seminar, NCAR, 2016-02-17

Page 2: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

What is performance?

Dimensions

I Model complexity

I AccuracyI Time

I per problem instanceI for the first instanceI compute time versus human time

I CostI incremental costI subsidized?

I Terms relevant to scientist/engineer

I Compute meaningful quantities – needed to make a decision orobtain a result of scientific value—not one iteration/time step

I No flop/s, number of elements/time steps

Page 3: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Work-precision diagram: de rigueur in ODE community

[Hairer and Wanner (1999)]

I Tests discretization, adaptivity, algebraic solvers, implementationI No reference to number of time steps, flop/s, etc.I Useful performance results inform decisions about tradeoffs.

Page 4: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Strong Scaling: efficiency-time tradeoff

I Good: shows absolute timeI Bad: log-log plot makes it difficult to discern efficiency

I Stunt 3: http://blogs.fau.de/hager/archives/5835I Bad: plot depends on problem size

Page 5: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Strong Scaling: efficiency-time tradeoff

I Good: absolute time, absolute efficiency (like DOF/s/cost)I Good: independent of problem size for perfect weak scalingI Bad: hard to see machine size (but less important)

Page 6: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Exascale Science & Engineering DemandsI Model fidelity: resolution, multi-scale, coupling

I Transient simulation is not weak scaling: ∆t ∼∆xI Analysis using a sequence of forward simulations

I Inversion, data assimilation, optimizationI Quantify uncertainty, risk-aware decisions

I Increasing relevance =⇒ external requirements on timeI Policy: 5 SYPD to inform IPCCI Weather, manufacturing, field studies, disaster response

I “weak scaling” [. . . ] will increasingly give way to “strong scaling”[The International Exascale Software Project Roadmap, 2011]

I ACME @ 25 km scaling saturates at < 10% of Titan (CPU) orMira

I Cannot decrease ∆x : SYPD would be too slow to calibrateI “results” would be meaningless for 50-100y predictions, a “stunt

run”I ACME v1 goal of 5 SYPD is pure strong scaling.

I Likely faster on Edison (2013) than any DOE machine –2020I Many non-climate applications in same position.

Page 7: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPL and the Top500 list

19931995

20002005

20102014

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

Sum

Top

#500

I High Performance LINPACKI Solve n×n dense linear system: O(N3/2) flops on N = n2 dataI Top500 list created in 1993 by Hans Meuer, Jack Dongarra, Erich

Strohmeier, Horst Simon

Page 8: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Role of HPL

I The major centers have their own benchmark suites (e.g.,CORAL)

I Nobody (vendors or centers) will say they built an HPL machine

I HPL ranking and peak flop/s are still used for press releasesI Machines need to be justified to politicians holding the money

I Politicians are vulnerable to propaganda and claims of inefficientspending

I It is naive to believe HPL has no influence on procurement or onscientists’ expectations

Page 9: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

1

10

2007 2008 2009 2010 2011 2012 2013 2014

FL

OP

pe

r B

yte

End of Year

Floating Point Operations per Byte, Double Precision

Radeon HD 3870

Radeon HD 4870

Radeon HD 5870

Radeon HD 6970

Radeon HD 6970

Radeon HD 7970 GHz Ed.

Radeon HD 8970

Tesla C1060

Tesla C1060

Xeon X5680

Xeon X5690

Tesla K20

Tesla K20X

CPUs, Intel

Xeon X5482

Xeon X5492

Xeon W5590

Tesla C2050

Tesla C2090Xeon E5-2690

Xeon E5-2697 v2

Xeon E5-2699 v3

GPUs, NVIDIA

GPUs, AMD

MIC, Intel

Xeon Phi X7120X

[c/o Karl Rupp]

Page 10: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Arithmetic intensity is not enough

I QR and LU factorization have same complexity.I Stable QR factorization involves more synchronization.I Synchronization is much more expensive on Xeon Phi.

Page 11: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

How much parallelism out of how much cache?

Processor v width threads F/inst latency L1D L1D/#par

Nehalem 2 1 2 5 32 KiB 1638 BSandy Bridge 4 2 2 5 32 KiB 819 BHaswell 4 2 4 5 32 KiB 410 BBG/P 2 1 2 6 32 KiB 1365 BBG/Q 4 4 2 6 32 KiB 682 BKNC 8 4 4 5 32 KiB 205 BTesla K20 32 * 2 10 64 KiB 102 B

I Most “fast” algorithms do about O(N) flops on N data

I xGEMM does O(N3/2) flops on N data

I Exploitable parallelism limited by cache and register load/store

I L2/L3 performance highly variable between architectures

Page 12: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Vectorization versus memory locality

I Each vector lane and pipelined instruction need their ownoperands

I Can we extract parallelism from smaller working set?I Sometimes, but more cross-lane and pipeline dependenciesI More complicated/creative code, harder for compiler

I Good implementations strike a brittle balance (e.g., Knepley,Rupp, Terrel; HPGMG-FE)

I Applications change discretization order, number of fields, etc.I CFD: 5-15 fieldsI Tracers in atmospheric physics: 100 speciesI Adaptive chemistry for combustion: 10-10000 speciesI Crystal growth for mesoscale materials: 10-10000 fields

I AoS or SoA?I Choices not robust to struct sizeI AoS good for prefetch and cache reuseI Can pack into SoA when necessary

Page 13: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

SPECint is increasing despite stagnant clock

100

101

102

103

104

105

106

107

1970 1980 1990 2000 2010 2020

Year

40 Years of Microprocessor Trend Data

Number ofLogical Cores

Frequency (Hz)

Single-ThreadPerformance

(SpecINT x 103)

Transistors(thousands)

Typical Power(Watts)

I Karl Rupp’s update to figure by Horowitz et al.

Page 14: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Algorithms keep pace with hardware (sometimes)

[c/o David Keyes]

I Opportunities now: uncertainty quantification, design

I Incentive to find optimal algorithms for more applications

Page 15: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

What does “representative” mean?

I Diverse applicationsI Explicit PDE solvers (seismic wave propagation, turbulence)I Implicit PDE solvers and multigrid methods (geodynamics,

structural mechanics, steady-state RANS)I Irregular graph algorithms (network analysis, genomics, game

trees)I Dense linear algebra and tensors (quantum chemistry)I Fast methods for N-body problems (molecular dynamics,

cosmology)I Cross-cutting: data assimilation, uncertainty quantification

I Diverse external requirementsI Real-time, policy, manufacturingI PrivacyI In-situ processing of experimental dataI Mobile/energy limitations

Page 16: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Necessary and sufficient

Goodhart’s LawWhen a measure becomes a target, it ceases to be a good measure.

I Features stressed by benchmark necessary for some apps

I Performance on benchmark sufficient for most apps

Page 17: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPGMG: a new benchmarking proposal

I https://hpgmg.org, [email protected] mailing list

I Mark Adams, Sam Williams (finite-volume), Jed (finite-element),John Shalf, Brian Van Straalen, Erich Strohmeier, Rich Vuduc

I Gathering momentum, SC14 BoF

I Implementations

Finite Volume memory bandwidth intensive, simple datadependencies, 2nd and 4th order

Finite Element compute- and cache-intensive, vectorizes,overlapping writes

I Full multigrid, well-defined, scale-free problem

I Matrix-free operators, Chebyshev smoothers

Page 18: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Full Multigrid (FMG): Prototypical Fast Algorithm

`coarse `coarse

`fineIhH IH

h , IHh Ih

H

I start with coarse grid

I truncation error within one cycle

I about five work units for many problems

I no “fat” left to trim – robust to gamingI distributed memory – restrict active process set using Z-order

I O(log2 N) parallel complexity stresses networkI scale-free specification

I no mathematical reward for decomposition granularityI don’t have to adjudicate “subdomain”

Page 19: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Multigrid design decisions

I Q2 finite elementsI Partition of work not partition of data – sharing/overlapping writesI Q2 is a middle-ground between lowest order and high orderI Matrix-free pays off, tensor-product element evaluation

I Linear elliptic equation with manufactured solutionI Mapped coordinates

I More memory streams, increase working set, longer critical pathI No reductions

I Coarse grid is strictly more difficult than reductionI Not needed because FMG is a direct method

I Chebyshev/Jacobi smoothers, V (3,1) cycleI Multiplicative smoothers hard to verify in parallelI Avoid intermediate scales (like Block Jacobi/Gauss-Seidel)

I Full Approximation Scheme

Page 20: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPGMG-FE on Edison, SuperMUC, Titan

Titan >200ms

varia

bilit

y

1.6B

155B

309B12.9B

Page 21: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPGMG-FE on Edison (Aries, E5-2695v2)

155B

69B

12.9B

4.8B1.6B

variability

Page 22: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPGMG-FV, 2015-11, 2nd order

Page 23: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPGMG-FV, 2015-11, 4th order

Page 24: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Messaging from threaded code

I Off-node messages need to be packed and unpacked

I Many MPI+threads apps pack in serial – bottleneckI Extra software synchronization required to pack in parallel

I Formally O(logT ) critical path, T threads/NIC contextI Typical OpenMP uses barrier – oversynchronizes

I MPI_THREAD_MULTIPLE – atomics and O(T ) critical path

I Choose serial or parallel packing based on T and messagesizes?

I Hardware NIC context/core now, maybe not in future

I What is lowest overhead approach to message coalescing?

Page 25: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

HPGMG-FV: flat MPI vs MPI+OpenMP (Aug 2014)

0.00  

0.05  

0.10  

0.15  

0.20  

0.25  

0.30  

0.35  

0.40  

1   10   100   1,000   10,000   100,000  

HPG

MG-­‐FV  Solve  Tim

e  (secon

ds)  

NUMA  Nodes  (2M  DOF/NUMA  Node)  

HPGMG-­‐FV  Solve  Time  Mira  

Edison  

Hopper  

Stampede(SNB)  

Peregrine  

K  

Edison(Flat  MPI)  

K  (Flat  MPI)  

Carver  

I c/o Sam Williams

Page 26: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

CAM-SE dynamics numbers

I 25 km resolution, 18 simulated seconds/RK stage

I Current performance at strong scaling limit

Edison 3 SYPDTitan 2 SYPDMira 0.9 SYPD

I Performance requirement: 5 SYPD (about 2000x faster than realtime)

I 10 ms budget per dynamics stageI Increasing spatial resolution decreases this budget

I ACME strong scaling saturates while too small for the CapabilityQueue on DOE LCFs

I Null hypothesis: Edison will run ACME faster than any DOEmachine through 2020

I Difficult to get large allocations

Page 27: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Tim Palmer’s call for 1km (Nature, 2014)

I Would require 104 more total work than ACME target resolutionI 5 SYPD at 1km is like 75 SYPD at 15km, assuming infinite

resource and perfect weak scalingI Two choices:

1. compromise simulation speed—this would come at a high price,impacting calibration, data assimilation, and analysis; or

2. ground-up redesign of algorithms and hardware to cut latency by afactor of 20 from that of present hardware

I DE Shaw’s Anton is an example of Option 2I Models need to be constantly developed and calibrated

I custom hardware stifles algorithm/model innovationI Exascale roadmaps don’t make a dent in 20x latency problem

Page 28: High-performance Geometric Multigrid (HPGMG) and ... · I Compute meaningful quantities – needed to make a decision or ... Nehalem 2 1 2 5 32 KiB 1638 B Sandy Bridge 4 2 2 5 32

Outlook

I Application scaling mode must be scientifically relevantI Algorithmic barriers exist

I Throughput architectures are not just “hard to program”

I Vectorization versus memory locality

I Over-decomposition adds overhead and lengthens critical pathI Versatile architectures are needed for model coupling and

advanced analysisI How to include dynamic range in ranking metric?I Why is NERSC installing DRAM in Cori?

I Abstractions must be durable to changing scientific needs

I “Energy efficiency” is not if algorithms give up nontrivial constantsI What is the cost of performance variability?

I Measure best performance, average, median, 10th percentile?

I HPGMG https://hpgmg.orgI The real world is messy!