charm++ - migratable objects + asynchronous methods...

47
Charm++ Migratable Objects + Asynchronous Methods + Adaptive Runtime = Performance + Productivity Laxmikant V. Kale * Anshu Arya Nikhil Jain Akhil Langer Jonathan Lifflander Harshitha Menon Xiang Ni Yanhua Sun Ehsan Totoni Ramprasad Venkataraman * Lukasz Wesolowski Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign * {kale, ramv}@illinois.edu SC12: November 13, 2012 Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 1 / 37

Upload: others

Post on 09-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++Migratable Objects + Asynchronous Methods + Adaptive Runtime

=Performance + Productivity

Laxmikant V. Kale∗

Anshu Arya Nikhil Jain Akhil Langer Jonathan LifflanderHarshitha Menon Xiang Ni Yanhua Sun Ehsan Totoni

Ramprasad Venkataraman∗ Lukasz Wesolowski

Parallel Programming LaboratoryDepartment of Computer Science

University of Illinois at Urbana-Champaign

∗{kale, ramv}@illinois.edu

SC12: November 13, 2012Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 1 / 37

Page 2: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Benchmarks

Required

1D FFT

Random Access

Dense LU Factorization

Optional

Molecular Dynamics

Adaptive Mesh Refinement

Sparse Triangular Solver

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 2 / 37

Page 3: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Metrics: Performance and ProductivityOur Implementations in Charm++

Productivity Performance

Code C++ CI Benchmark Driver Total Machine Max Performance HighlightSubtotal Cores

1D FFT 54 29 83 102 185 IBM BG/P 64K 2.71 TFlop/sIBM BG/Q 16K 2.31 TFlop/s

Random Access 76 15 91 47 138 IBM BG/P 128K 43.10 GUPSIBM BG/Q 16K 15.00 GUPS

Dense LU 1001 316 1317 453 1770 Cray XT5 8K 55.1 TFlop/s (65.7% peak)

Molecular Dynamics 571 122 693 n/a 693 IBM BG/P 128K 24 ms/step (2.8M atoms)IBM BG/Q 16K 44 ms/step (2.8M atoms)

Triangular Solver 642 50 692 56 748 IBM BG/P 512 48x speedup on 64 coreswith helm2d03 matrix

AMR 1126 118 1244 n/a 1244 IBM BG/Q 32k 22 timesteps/sec, 2d mesh,max 15 levels refinement

C++ Regular C++ codeCI Parallel interface descriptions and control flow DAG

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 3 / 37

Page 4: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

CapabilitiesDemonstrated Productivity Benefits

Automatic load balancing

Automatic checkpoints

Tolerating process failures

Asynchronous, non-blocking collective communication

Interoperating with MPI

For more info

http://charm.cs.illinois.edu/

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 4 / 37

Page 5: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: Automated Dynamic Load Balancing

Measurement based fine-grained load balancing

I Principle of persistence - recent past indicates near future.I Charm++ provides a suite of load-balancers.

How to use?I Periodic calls in application - AtSync().I Command line argument - +balancer Strategy.

MetaBalancer - When and how to load balance?

I Monitors the application continuously and predicts behavior.I Decides when to invoke which load balancer.I Command line argument - +MetaLB

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37

Page 6: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: Automated Dynamic Load Balancing

Measurement based fine-grained load balancing

I Principle of persistence - recent past indicates near future.I Charm++ provides a suite of load-balancers.

How to use?I Periodic calls in application - AtSync().I Command line argument - +balancer Strategy.

MetaBalancer - When and how to load balance?

I Monitors the application continuously and predicts behavior.I Decides when to invoke which load balancer.I Command line argument - +MetaLB

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37

Page 7: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: Automated Dynamic Load Balancing

Measurement based fine-grained load balancing

I Principle of persistence - recent past indicates near future.I Charm++ provides a suite of load-balancers.

How to use?I Periodic calls in application - AtSync().I Command line argument - +balancer Strategy.

MetaBalancer - When and how to load balance?

I Monitors the application continuously and predicts behavior.I Decides when to invoke which load balancer.I Command line argument - +MetaLB

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 5 / 37

Page 8: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: Checkpointing Application State

Checkpointing to disk for split executionCkStartCheckpoint(callback)

I Designed for applications need to run for a long period, but cannot getall the allocation needed at one time.

Restart applications from checkpoint on any number of processors

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 6 / 37

Page 9: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: Tolerating Process Failures

Double in-memory checkpointing for online recoveryCkStartMemCheckpoint(callback)

I To tolerate the more and more frequent failures in HPC system.

Injecting failure and automatically detection of failuresCkDieNow()

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 7 / 37

Page 10: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: InteroperabilityInvoke Charm++ from MPI

Callable like other external MPI librariesUse MPI communicators to enable the following modes

MPI

Charm++Time...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

(a) Time Sharing (b) Space Sharing (c) Combined

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 8 / 37

Page 11: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: InteroperabilityTrivial Changes to Existing Codes

Initialize and destroy Charm++ instances

Use interface functions to transfer control

//MPI Init and other basic initialization{ optional pure MPI code blocks }

//create a communicator for initializing Charm++MPI Comm split(MPI COMM WORLD, peid%2, peid, &newComm);CharmLibInit(newComm, argc, argv);

{ optional pure MPI code blocks }

//Charm++ library invocationif(myrank%2)

fft1d(inputData,outputData,data size);

//more pure MPI code blocks//more Charm++ library calls

CharmLibExit();//MPI cleanup and MPI Finalize

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 9 / 37

Page 12: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Capabilities: Asynchronous, Non-blocking CollectiveCommunication

Overlap collective communication with other work

Topological Routing and Aggregation Module (TRAM)I Transforms point-to-point communication into collectivesI Minimal topology-aware software routingI Aggregation of fine-grained communicationI Recombining at intermediate destinations

Intuitive expression of collectives through overloading constructs forpoint-to-point sends (e.g. broadcast)

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 10 / 37

Page 13: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

FFT: Parallel Coordination CodedoFFT()

for(phase = 0; phase < 3; ++phase) {atomic {

sendTranspose();

}for(count = 0; count < P; ++count)

when recvTranspose[phase] (fftMsg *msg) atomic {applyTranspose(msg);

}if (phase < 2) atomic {

fftw execute(plan);

if(phase == 0)

twiddle();

}}

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 11 / 37

Page 14: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

FFT: PerformanceIBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 6553610

1

102

103

104

Cores

GF

lop/

s

P2P All−to−allMesh All−to−allSerial FFT limit

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 12 / 37

Page 15: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

FFT: PerformanceIBM Blue Gene/P (Intrepid), 25% memory, ESSL /w fftw wrappers

256 512 1024 2048 4096 8192 16384 32768 6553610

1

102

103

104

Cores

GF

lop/

s

P2P All−to−allMesh All−to−allSerial FFT limit

Charm++ all-to-all using TRAM

Asynchronous, Non-blocking, Topology-aware, Combining, Streaming

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 12 / 37

Page 16: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Random Access

Productivity

Use point to point sends and let Charm++ optimize communication

Automatically detect and adapt to network topology of partition

Performance

Automatic communication optimization using TRAMI Aggregation of fine-grained communicationI Minimal topology-aware software routingI Recombining at intermediate destinations

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 13 / 37

Page 17: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Random Access: PerformanceIBM Blue Gene/P (Intrepid), BlueGene/Q (Vesta)

0.0625

0.25

1

4

16

64

128 512 2K 8K 32K 128K

GU

PS

Number of cores

43.10

Perfect ScalingBG/PBG/Q

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 14 / 37

Page 18: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LU: Capabilities

Composable libraryI Modular program structureI Seamless execution structure (interleaved modules)

Block-centricI Algorithm from a block’s perspectiveI Agnostic of processor-level considerations

Separation of concernsI Domain specialist codes algorithmI Systems specialist codes tuning, resource mgmt etc

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37

Page 19: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LU: Capabilities

Composable libraryI Modular program structureI Seamless execution structure (interleaved modules)

Block-centricI Algorithm from a block’s perspectiveI Agnostic of processor-level considerations

Separation of concernsI Domain specialist codes algorithmI Systems specialist codes tuning, resource mgmt etc

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37

Page 20: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LU: Capabilities

Composable libraryI Modular program structureI Seamless execution structure (interleaved modules)

Block-centricI Algorithm from a block’s perspectiveI Agnostic of processor-level considerations

Separation of concernsI Domain specialist codes algorithmI Systems specialist codes tuning, resource mgmt etc

Lines of Code Module-specificCI C++ Total Commits

Factorization 517 419 936 472/572 83%Mem. Aware Sched. 9 492 501 86/125 69%Mapping 10 72 82 29/42 69%

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 15 / 37

Page 21: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LU: Capabilities

Flexible data placementI Experiment with data layout

Memory-constrained adaptive lookahead

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 16 / 37

Page 22: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LU: PerformanceWeak Scaling: (N such that matrix fills 75% memory)

0.1

1

10

100

128 1024 8192

TotalTFlop/s

Number of Cores

Theoretical peak on XT5Weak scaling on XT5

67%

67.4%

67.4%

67.1%

66.2%

65.7%

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 17 / 37

Page 23: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LU: Performance... and strong scaling too! (N=96,000)

0.1

1

10

100

128 1024 8192

TotalTFlop/s

Number of Cores

Theoretical peak on XT5Weak scaling on XT5

Theoretical peak on BG/PStrong scaling on BG/P

60.3%

45%

40.8%31.6%

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 18 / 37

Page 24: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Optional BenchmarksWhy MD, AMR and Sparse Triangular Solver

Relevant scientific computing kernels

Challenge the parallelization paradigmI Load imbalancesI Dynamic communication structure

Express non-trivial parallel control flow

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 19 / 37

Page 25: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

LeanMDSLOC: 693

1 Mimics short-range force calculation in NAMD2 Resembles miniMD of Mantevo project (SLOC≈3000)3 Advanced features:

Metabalancer: automated dynamic load balancingFault tolerance: in-memory checkpointing based restartSplit Execution: checkpoint on x cores, restart on y cores

1 Away Decomposition 2 AwayX Decomposition

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 20 / 37

Page 26: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Code for FT and LB

if (stepCount % ldbPeriod == 0) {serial { AtSync(); }when ResumeFromSync() { }

}

if (stepCount % checkptFreq == 0) {serial {

//coordinate to start checkpointingcontribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0)));

}if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) {

when startCheckpoint() serial {CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy);if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb);else CkStartMemCheckpoint(cb);

}}when recvCheckPointDone() { }

}

//kill one of processes to demonstrate fault toleranceif (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial {

if (CkHasCheckpoints()) {CkDieNow();

}}

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37

Page 27: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Code for FT and LB

if (stepCount % ldbPeriod == 0) {serial { AtSync(); }when ResumeFromSync() { }

}

if (stepCount % checkptFreq == 0) {serial {

//coordinate to start checkpointingcontribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0)));

}if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) {

when startCheckpoint() serial {CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy);if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb);else CkStartMemCheckpoint(cb);

}}when recvCheckPointDone() { }

}

//kill one of processes to demonstrate fault toleranceif (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial {

if (CkHasCheckpoints()) {CkDieNow();

}}

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37

Page 28: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Code for FT and LB

if (stepCount % ldbPeriod == 0) {serial { AtSync(); }when ResumeFromSync() { }

}

if (stepCount % checkptFreq == 0) {serial {

//coordinate to start checkpointingcontribute(CkCallback(CkReductionTarget(Cell,startCheckpoint),thisProxy(0,0,0)));

}if (thisIndex.x == 0 && thisIndex.y == 0 && thisIndex.z == 0) {

when startCheckpoint() serial {CkCallback cb(CkReductionTarget(Cell,recvCheckPointDone), thisProxy);if (checkptStrategy == 0) CkStartCheckpoint(logs.c str(), cb);else CkStartMemCheckpoint(cb);

}}when recvCheckPointDone() { }

}

//kill one of processes to demonstrate fault toleranceif (stepCount == 30 && thisIndex.x == 1 && thisIndex.y == 1 && thisIndex.z == 0) serial {

if (CkHasCheckpoints()) {CkDieNow();

}}

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 21 / 37

Page 29: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

MD: Performance1 million atoms. IBM Blue Gene/P (Intrepid)

10

100

1000

10000

2k 4k 8k 16k 32k 64k 128k

Tim

e pe

r st

ep (

ms)

Number of cores

Performance on Intrepid (2.8 million atoms)

No LBHybrid LB

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 22 / 37

Page 30: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Fault Tolerance Support for MDMD: Checkpoint Time

0

10

20

30

40

50

60

2048 4096 8192 16384 32768

Tim

e(m

s)

Number of processes

LeanMD Checkpoint Time on BlueGene/Q

2.8 million1.6 million

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 23 / 37

Page 31: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Fault Tolerance Support for MDMD: Restart Time

0

20

40

60

80

100

120

140

160

2048 4096 8192 16384 32768

Tim

e(m

s)

Number of processes

LeanMD Restart Time on BlueGene/Q

2.8 million1.6 million

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 24 / 37

Page 32: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Meta-Balancer vs Periodic Load Balancing

32

64

128

256

512

1024

2048

64 128 256 512 1024

Elap

sed

time

(s)

LB Period

Elapsed time vs LB Period (BlueGene/P)

8k cores16k cores32k cores

64k cores128k cores

Cores No LB (s) Periodic LB (s) Meta-Balancer (s)

8k 666 504 41316k 336 260 27732k 171 131 13164k 122 104 100

128k 73 54 52

Frequent load balancingincreases total executiontime

Infrequent load balancingleads to load imbalanceand results in no gains

Meta-Balancer adaptivelyperforms load balancingto obtain best totalexecution time

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 25 / 37

Page 33: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Sparse Triangular Solver- Matrix Decomposition

Column Decomposition

Dense Parts Decomposition

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 26 / 37

Page 34: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Sparse Triangular Solver- Parallel Algorithm

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 27 / 37

Page 35: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Sparse Triangular Solver- Performance vs. SuperLU DIST

0.001

0.01

0.1

1

10

1 2 4 8 16 32 64 128 256 512

Solu

tion

Tim

e (s

)

Number of Cores

SuperLU_webbase-1MSuperLU_helm2d03SuperLU_largebasis

SuperLU_hoodslu_largebasis

slu_webbase-1Mslu_hood

slu_helm2d03

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 28 / 37

Page 36: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Sparse Triangular Solver- Productivity in Charm++

More complicated (higher performance) algorithm in 692 total SLOCsI vs. 897 SLOCs of SuperLU DIST triangular solver

Overdecomposition (with Round-Robin mapping) is essentialI Communication computation overlap, load balance

Creation of parallel units dynamicallyI Distributing dense regions

Message-driven nature and prioritiesI No need for something like MPI Iprobe

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 29 / 37

Page 37: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Adaptive Mesh Refinement

Sample simulation

Propagation of refinement decision messages

Stay Refine

d d + 1

Coarsen

d - 1

Coarsend - 1 d

Stay

d + 1

Sibling d

d + 2

d - 1 d d + 1

Refined + 2

Initial state

Received message

d

dRequired depth

Decision

Local error condition

Termination detection

*

Refine

Coarsen,Stay

*

Finite state machine for each block’s decision update

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 30 / 37

Page 38: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Adaptive Mesh Refinement

Sample simulation

Propagation of refinement decision messages

Stay Refine

d d + 1

Coarsen

d - 1

Coarsend - 1 d

Stay

d + 1

Sibling d

d + 2

d - 1 d d + 1

Refined + 2

Initial state

Received message

d

dRequired depth

Decision

Local error condition

Termination detection

*

Refine

Coarsen,Stay

*

Finite state machine for each block’s decision update

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 30 / 37

Page 39: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Adaptive Mesh Refinement

Charm++ Implementation

Blocks as virtual processors instead of each process containing many blocksI simplifies implementation

Blocks addressed with bit vector indicesI CharmRTS handles physical locations

Dynamic distributed load balancing

Algorithmic Improvements1

O(#blocks/P ) vs O(#blocks) memory per process

2 system quiesence states vs #level reductions for mesh restructuring

O(1) vs O(logP ) time neighbor lookup

1Langer, et.al. Scalable Algorithms for Distributed Memory Adaptive Mesh Refinement. In 24th International

Conference on Computer Architecture and High Performance Computing (SBAC-PAD) 2012, NY, USA.

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 31 / 37

Page 40: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Adaptive Mesh Refinement

0.1

1

10

100

256 512 1k 2k 4k 8k 16k 32k

Tim

este

ps /

sec

Number of Ranks

IBM BG/Q Min−depth 4IBM BG/Q Min−depth 5

Timesteps per second strong scaling on IBM BG/Q with a

max depth of 15.

0

5

10

15

20

25

30

35

16 32 64 128 256 512 1k 2k 4k 8k 16k 32k

Rem

eshi

ng L

aten

cy T

ime

(ms)

Number of Ranks

Depth Range 4−9Depth Range 4−10Depth Range 4−11

The non-overlapped delay of remeshing in milliseconds on

IBM BG/Q. The candlestick graphs the minimum and maxi-

mum values, the 5th and 95th percentile, and the median.

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 32 / 37

Page 41: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++ at SC12

Temperature-aware load balancing Tue @ 2:00 pm

NAMD at 200K+ cores Thu @ 11:00 am

For more info

http://charm.cs.illinois.edu/

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 33 / 37

Page 42: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++Programming Model

Object-based Express logic via indexed collections ofinteracting objects (both data and tasks)

Over-decomposed Expose more parallelism than availableprocessors

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 34 / 37

Page 43: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++Programming Model

Runtime-Assisted scheduling, observation-based adaptivity,load balancing, composition, etc.

Message-Driven Trigger computation by invoking remoteentry methods

Non-blocking, Asynchronous Implicitly overlapped data transfer

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 35 / 37

Page 44: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++Program Structure

Regular C++ codeI No special compilers

Small parallel interface description fileI Can contain control flow DAGI Parsed to generate more C++ code

Inherit from framework classes toI Communicate with remote objectsI Serialize objects for transmission

Exploit modern C++ program design techniques (OO, generics etc)

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37

Page 45: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++Program Structure

Regular C++ codeI No special compilers

Small parallel interface description fileI Can contain control flow DAGI Parsed to generate more C++ code

Inherit from framework classes toI Communicate with remote objectsI Serialize objects for transmission

Exploit modern C++ program design techniques (OO, generics etc)

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37

Page 46: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

Charm++Program Structure

Regular C++ codeI No special compilers

Small parallel interface description fileI Can contain control flow DAGI Parsed to generate more C++ code

Inherit from framework classes toI Communicate with remote objectsI Serialize objects for transmission

Exploit modern C++ program design techniques (OO, generics etc)

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 36 / 37

Page 47: Charm++ - Migratable Objects + Asynchronous Methods ...charm.cs.uiuc.edu/users/wesolwsk/presentation.pdf · Benchmarks Required 1D FFT Random Access Dense LU Factorization Optional

TRAM: Message Routing and Aggregation

Kale et al. (PPL, Illinois) Charm++ SC12: November 13, 2012 37 / 37