charm++ & mpi: combining the best of both...

45
Charm++ & MPI: Combining the Best of Both Worlds IPDPS: May 27, 2015 Nikhil Jain, Abhinav Bhatele, Jae-Seung Yeom, Mark F. Adams, Francesco Miniati, Chao Mei, Laxmikant V. Kale

Upload: others

Post on 27-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of

Both WorldsIPDPS: May 27, 2015

Nikhil Jain, Abhinav Bhatele, Jae-Seung Yeom, Mark F. Adams, Francesco Miniati, Chao Mei,

Laxmikant V. Kale

Page 2: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Motivation: additional capabilities and code reuse

2

Page 3: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Motivation: additional capabilities and code reuse

• Multi-physics modeling and coupled simulations require sophisticated techniques, but…

• Most applications developed in a single parallel language• Limited features• No code reuse across languages

2

Page 4: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Motivation: additional capabilities and code reuse

• Multi-physics modeling and coupled simulations require sophisticated techniques, but…

• Most applications developed in a single parallel language• Limited features• No code reuse across languages

• Interoperation of languages in an application• MPI + X, where MPI is across nodes and X is within• MPI + Charm++ : MPI and Charm++ everywhere!

2

Page 5: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Charm++: object-based message-driven parallel programming

3

Page 6: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Charm++: object-based message-driven parallel programming

‣ Fundamental design attributes ➡ Overdecomposition ➡ Asynchronous message

driven execution ➡ Migratability

‣ C++ objects based

3

Page 7: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Charm++: object-based message-driven parallel programming

‣ Fundamental design attributes ➡ Overdecomposition ➡ Asynchronous message

driven execution ➡ Migratability

‣ C++ objects based

‣ Driven by an adaptive runtime system

3

A[1]

A[0]

A[2]

B[3]

B[0]

C[1,0]

C[1,2]

C[0,0]

C[0,2]

C[1,4]

Processor 1 Processor 2

B[3]C[0,0]

C[1,4]

Processor 3 Processor 4

A[1]A[2]

C[0,2]

C[1,0]C[1,2]

A[0]

B[0]

Location ManagerSchedulerLocation ManagerScheduler

User View and System View

Page 8: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Features: comp-comm overlap, load balancing, introspection…

4

Message-driven execution

Migratability

Introspective and adaptive runtime

system

Scalable Tools

Automatic overlap of Communication and

Computation

Emulation for Performance Prediction

Fault Tolerance

Dynamic load balancing (topology-aware, scalable)

Temperature/Power/Energy Optimizations

Compositionality

Over-decomposition

Shrink and Expand

Page 9: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Features: comp-comm overlap, load balancing, introspection…

4

Message-driven execution

Migratability

Introspective and adaptive runtime

system

Scalable Tools

Automatic overlap of Communication and

Computation

Emulation for Performance Prediction

Fault Tolerance

Dynamic load balancing (topology-aware, scalable)

Temperature/Power/Energy Optimizations

Compositionality

Over-decomposition

Shrink and Expand Applications: NAMD, ChaNGa, OpenAtom, EpiSimdemics, ClothSim, BRAMS, and many more…

Page 10: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Related Work• Harper et al. : PVM in Legion environment

• MetaChaos : HPF + Chaos + pC++

• Kale et al. : MPI, PVM, and Charm++ on Converse

• OpenMP + MPI

• Dinan et al. : MPI + UPC

• Zhao et al. : Active messages in MPI

5

Page 11: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Novelty: control flow, code reuse, and performance studies

• The control flow styles for MPI and Charm++ are different • MPI is user-driven, while Charm++ is system-driven

• Minimal (re)implementation of languages

• Focus on reuse of existing code with minor changes!

• In contrast to interoperation via reimplementing MPI on Converse, this scheme works with any MPI

• Demonstration via performance studies at scale

6

Page 12: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Control flow management in MPI vs Charm++

7

User MPI Charm++

Charm++ RTS selects the user

code that will execute next

Network Progress

User code makes MPI calls which drives the network

Page 13: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Control flow management in MPI vs Charm++

Concurrent Threads: execute each module/language in its own home thread

Pros: Easy to understand and implement

Cons: • Thread scheduling overhead • Sub-optimal scheduling • Adaptive scheduling requires

significant code changes

7

User MPI Charm++

Charm++ RTS selects the user

code that will execute next

Network Progress

User code makes MPI calls which drives the network

Flow management solution I: concurrent threads

Page 14: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Flow management solution II: user controlled transfer

Exposing the Charm++ scheduler at a coarse granularity

Pros: • Eliminates the thread overheads • Reuse of existing code is easy

Cons: • Switching decisions by user (or is

it a disadvantage?) • Inter-module overlap is absent

8

MPI Module

MPI Module

Charm Module

Charm Module

1

2 3

45

MPI Module

Page 15: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Language APIs: additions to enable interoperation

9

Page 16: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Language APIs: additions to enable interoperation

๏ Initialize: set up to create a module/language instance ➡ MPI_Init/Comm_create, CharmLibInit

9

Page 17: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Language APIs: additions to enable interoperation

๏ Initialize: set up to create a module/language instance ➡ MPI_Init/Comm_create, CharmLibInit

๏ Execute: make progress ➡ Implicit in MPI, StartCharmScheduler

9

Page 18: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Language APIs: additions to enable interoperation

๏ Initialize: set up to create a module/language instance ➡ MPI_Init/Comm_create, CharmLibInit

๏ Execute: make progress ➡ Implicit in MPI, StartCharmScheduler

๏ Transfer: stop execution ➡ Implicit in MPI, StopCharmScheduler/CkExit

9

Page 19: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Language APIs: additions to enable interoperation

๏ Initialize: set up to create a module/language instance ➡ MPI_Init/Comm_create, CharmLibInit

๏ Execute: make progress ➡ Implicit in MPI, StartCharmScheduler

๏ Transfer: stop execution ➡ Implicit in MPI, StopCharmScheduler/CkExit

๏ Clean up: destroy the instance ➡ MPI_Comm_free, CharmLibExit

9

Page 20: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

MPI code example: create language instances and execute

#include "mpi-interoperate.h"

int main(int argc, char **argv) { MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_split(MPI_COMM_WORLD, myrank%2, myrank, &newComm); if(myrank % 2) { // Create Charm++ instance on subset of processes CharmLibInit(newComm, argc, argv); StartCharm(16); // Call Charm++ library CharmLibExit(); // Destroy Charm++ instance } else { // MPI work on rest of the processes } MPI_Finalize();}

10

Page 21: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Charm++ code example: interface function

#include "mpi-interoperate.h"

// invoked from MPI, marks the beginning of Charm++void StartCharm(int elems) { if(CkMyPe() == 0) { workerProxy.StartWork(elems); } StartCharmScheduler();}

// Charm++ function that deactivates schedulervoid Worker::StartWork(int elems) { // Charm++ work on a subset of processes CkExit();}

11

Page 22: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Resource sharing: time, space, and hybrid division

12

MPICharm++

Time

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

(a) Time Division (b) Space Division (c) Hybrid

Page 23: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Resource sharing: time, space, and hybrid division

12

MPICharm++

Time

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

(a) Time Division (b) Space Division (c) Hybrid

Page 24: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Resource sharing: time, space, and hybrid division

12

MPICharm++

Time

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

...

P(1) P(2) P(N-1) P(N)

(a) Time Division (b) Space Division (c) Hybrid

Page 25: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Data Sharing and Rank Mapping

• Data Sharing

➡ Shared memory pointer-based

➡ Data repository

• Rank Mapping - Dinan et al. for MPI + UPC

➡ One to one

➡ Many to one

➡ One to none

13

Page 26: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Nikhil Jain, Parallel Programming Laboratory

Charm++ & MPI: Combining the Best of Both Worlds

Application Studies

14

Page 27: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

CHARM: scaling bottleneck caused by global sorting

๏ CHARM is a cosmology code based on Chombo (MPI) ‣ Non-uniform particle distribution ‣ Load balancing and locality requires global sorting every step

15

Page 28: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

CHARM: scaling bottleneck caused by global sorting

๏ CHARM is a cosmology code based on Chombo (MPI) ‣ Non-uniform particle distribution ‣ Load balancing and locality requires global sorting every step

15

0.1

1

10

100

8 64 512 4096

Tim

e (s

)

Number of cores

Baseline performance of CHARM on Cray XE6

AdvanceMultiway-Merge Sort

Amount of time spent in sorting increases, while time spent in computation is constant

Scaling Bottleneck!

Page 29: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Eliminating bottleneck via a high performance sorting library

16

Page 30: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Eliminating bottleneck via a high performance sorting library

‣ What does efficient sorting need? ➡ Asynchrony and non-blocking communication ➡ Overlap of local sorting with communication

16

Page 31: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Eliminating bottleneck via a high performance sorting library

‣ What does efficient sorting need? ➡ Asynchrony and non-blocking communication ➡ Overlap of local sorting with communication

‣ Option 1: Implement a new MPI based code and optimize it!

16

Page 32: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Eliminating bottleneck via a high performance sorting library

‣ What does efficient sorting need? ➡ Asynchrony and non-blocking communication ➡ Overlap of local sorting with communication

‣ Option 1: Implement a new MPI based code and optimize it!

‣ Option 2: Reuse an existing sorting library ➡ HistSort - Highly scalable sorting library in Charm++

(Solomonik et al.)

16

Page 33: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Using HistSort in CHARM: time sharing MPI and Charm++

/* CHARM code that prepares the input */...@195 lines of Multi-way Merge sort in MPI@/* Computation code in CHARM */...

---------------------------------------------------

/* CHARM code that prepares the input */...// call to HistSortHistSorting<key_type, std::pair<partType, char[MAX_PART_SZ]>>(loc_s_len, dataIn, &loc_r_len, &dataOut); /* Computation code in CHARM */...

17

Page 34: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Interoperable HistSort library: minor changes lead to reuse

// interface function for HistSorttemplate <class key, class value>void HistSorting(int input_elems_, kv_pair<key, value>* dataIn_, int * output_elems_, kv_pair<key, value>** dataOut_) { // store parameters to global locations dataIn = (void*)dataIn_; dataOut = (void**)dataOut_; in_elems = input_elems_; out_elems = output_elems_; // initiate message to main object if(CkMyPe() == 0) { static CProxy_Main<key,value> mainProxy = CProxy_Main<key,value>::ckNew(CkNumPes()); mainProxy.DataReady(); } StartCharmScheduler();}

18

Page 35: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Weak scaling: time spent in sorting increases slowly

19

0.1

1

10

100

8 64 512 4096

Tim

e (s

)

Number of cores

Weak scaling on Cray XE6

AdvanceMultiway-Merge Sort

Charm++ HistSort

Page 36: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Strong scaling: 48x speed up on 16k cores of Hopper

20

0.1

1

10

100

512 1024 2048 4096 8192 16384

Tim

e (s

)

Number of cores

Strong scaling on Cray XE6

Multiway-Merge SortCharm++ HistSort

Page 37: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

EpiSimdemics: IO leads to performance and productivity loss

21

Page 38: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

EpiSimdemics: IO leads to performance and productivity loss

• Agent-based simulator used to study spread of contagious diseases over social networks, implemented in Charm++

21

Page 39: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

EpiSimdemics: IO leads to performance and productivity loss

• Agent-based simulator used to study spread of contagious diseases over social networks, implemented in Charm++

• Requires reading many large input files: an hour long startup! • Cause: sequential input

21

Page 40: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

EpiSimdemics: IO leads to performance and productivity loss

• Agent-based simulator used to study spread of contagious diseases over social networks, implemented in Charm++

• Requires reading many large input files: an hour long startup! • Cause: sequential input

• Many large output files, written periodically • Writes to multiple files, aggregates later • Limited number of allowed open file descriptors prevents

execution

21

Page 41: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

MPI IO with EpiSimdemics

• MPI IO - portable, often vendor-implemented

• Use of MPI collectives to aggregate IO meta-data

• IO module executed in a hybrid manner with rest of the code

22

Page 42: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Input performance: input time reduced to less than 10s

23

0.1

1

10

100

1000

10000

16k 32k 64k 128k 256k

Inpu

t tim

e (s

)

Number of cores

Time spent in input on Blue Gene/Q

Sequential reading of Schedule file notdone at scale to save CPU hours

Schedule/SerialPerson/Serial

Schedule/MPI-IOPerson/MPI-IO

Page 43: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Output performance: write to single file even on large #cores

24

0

100

200

300

400

500

600

700

8k 16k 32k 64k 128k 256k

Tot

al e

xecu

tion

time

(s)

Number of cores

Time spent in simulation + output on Blue Gene/Q

Custom I/O failedat large core counts

With Custom Parallel-IOWith MPI-IO

Page 44: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory25

Application Library Productivity Performance

CHARM HistSort 195 lines removed 48x speed up in sorting

EpiSimdemics MPI IO Writes to a single file 256x faster input

NAMD FFTW 280 lines reduction Similar performance

Load balancing framework ParMetis Parallel graph

paratitioning Faster applications

Page 45: Charm++ & MPI: Combining the Best of Both Worldscharm.cs.illinois.edu/newPapers/15-17/interop.pdf · Charm++ & MPI: Combining the Best of Both Worlds Nikhil Jain, Parallel Programming

Charm++ & MPI: Combining the Best of Both Worlds

Nikhil Jain, Parallel Programming Laboratory

Conclusion• Interoperating Charm++ and MPI is easy

• Leads to several benefits

• Available in production version of Charm++ along with any MPI implementation:

• http://charmplusplus.org

• http://charm.cs.illinois.edu/manuals/html/charm++/25.html

Questions

26