1 advanced charm++ tutorial charm workshop tutorial sameer kumar orion sky lawlor charm.cs.uiuc.edu...

73
1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm. cs . uiuc .edu 2004/10/19

Upload: matthew-simon

Post on 18-Jan-2018

226 views

Category:

Documents


0 download

DESCRIPTION

3 Advanced Messaging

TRANSCRIPT

Page 1: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

1

Advanced Charm++ Tutorial

Charm Workshop TutorialSameer Kumar

Orion Sky Lawlorcharm.cs.uiuc.edu

2004/10/19

Page 2: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

2

How to Become a Charm++ Hacker Advanced Charm++

Advanced Messaging Writing system libraries

•Groups•Delegation

Communication framework Advanced load-balancing Checkpointing Threads SDAG

Page 3: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

3

Advanced Messaging

Page 4: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

4

Prioritized Execution If several messages available,

Charm will process the message with highest priority Otherwise, oldest message (FIFO)

Has no effect: If only one message is available

(common for network-bound applications!)

On outgoing messages Very useful for speculative

work, ordering timesteps, etc...

Page 5: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

5

Priority Classes Charm++ scheduler has three

queues: high, default, and low As signed integer priorities:

-MAXINT Highest priority -- -1 0 Default priority 1 -- +MAXINT Lowest priority

As unsigned bitvector priorities: 0x0000 Highest priority -- 0x7FFF 0x8000 Default priority 0x8001 -- 0xFFFF Lowest priority

Page 6: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

6

Prioritized Marshalled Messages

Pass “CkEntryOptions” as last parameter

For signed integer priorities:CkEntryOptions opts;opts.setPriority(-1);fooProxy.bar(x,y,opts);

For bitvector priorities:CkEntryOptions opts;unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF};opts.setPriority(64,prio);fooProxy.bar(x,y,opts);

Page 7: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

7

Prioritized Messages Number of priority bits passed during

message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages

Signed integer priorities:*CkPriorityPtr(msg)=-1;CkSetQueueing(m, CK_QUEUEING_IFIFO);

Unsigned bitvector prioritiesCkPriorityPtr(msg)[0]=0x7fffffff;CkSetQueueing(m, CK_QUEUEING_BFIFO);

Page 8: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

8

Advanced Message Features

Read-only messages Entry method agrees not to modify

or delete the message Avoids message copy for

broadcasts, saving time Expedited messages

Message do not go through the charm++ scheduler (faster)

Immediate messages Entries are executed in a interrupt

or the communication thread Very fast, but tough to get right

Page 9: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

9

Read-Only, Expedited, Immediate

All declared in the .ci file{

.. ... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *);

.. .. .. }; // Immediate messages only currently work //for NodeGroups

Page 10: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

10

Groups

Page 11: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

11

Object Groups A collection of objects (chares)

Also called branch office chares Exactly one representative on each

processor• Ideally suited for system libraries

A single proxy for the group as a whole

Similar to arrays:•Broadcasts, reductions, indexing

But not completely like arrays:•Non-migratable; one per processor

Page 12: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

12

Declarations .ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method

}; C++ file

class mygroup : public Group {mygroup() {}

void foo(foomsg *m) { CkPrintf(“Do Nothing”);}

};

Page 13: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

13

Creating and Calling Groups

Creationp = CProxy_mygroup::ckNew();

Remote invocationp.foo(msg); //broadcastp[1].foo(msg); //asynchronous invocation

Direct local access mygroup *g=p.ckLocalBranch();g->foo(….); //local invocation Danger: if you migrate, the group

stays behind!

Page 14: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

14

Delegation

Page 15: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

15

Delegation Enables Charm++ proxy

messages to be forwarded to a delegation manager group

Delegation manager can trap calls to proxy sends and apply optimizations

Delegation manager must inherit from CkDelegateMgr

User program must to call proxy.ckDelegate(mgrID);

Page 16: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

16

Delegation Interface

.ci filegroup MyDelegateMgr {entry MyDelegateMgr(); //Constructor

}; .h file

class MyDelegateMgr : public CkDelegateMgr {MyDelegateMgr();

void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a);

void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. ……………..}

Page 17: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

17

Communication Optimization

Page 18: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

18

Automatic Communication Optimizations

The parallel-objects Runtime System can observe, instrument, and measure communication patterns Communication libraries can optimize

• By substituting most suitable algorithm for each operation

• Learning at runtime E.g. All to all communication

• Performance depends on many runtime characteristics• Library switches between different algorithms

Communication is from/to objects, not processors

• Streaming messages optimization

Page 19: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

19

Managing Collective Communication

Communication operation where all (or most) the processors participate For example broadcast, barrier, all

reduce, all to all communication etc Applications: NAMD multicast, NAMD

PME, CPAIMD Issues

Performance impediment Naïve implementations often do not scale Synchronous implementations do not

utilize the co-processor effectively

Page 20: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

20

All to All Communication All processors send data to all

other processors All to all personalized

communication (AAPC)•MPI_Alltoall

All to all multicast/broadcast (AAMC)•MPI_Allgather

Page 21: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

21

Strategies For AAPC Short message optimizations

High software over head (α) Message combining

Large messages Network contention

Page 22: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

22

Short Message Optimizations

Direct all to all communication is α dominated

Message combining for small messages Reduce the total number of messages Multistage algorithm to send messages

along a virtual topology Group of messages combined and sent to

an intermediate processor which then forwards them to their final destinations

AAPC strategy may send same message multiple times

Page 23: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

23

Virtual Topology: Mesh

Organize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 2: Processors send messages to column neighbors1 P

2* messages instead of P-1 1P

Page 24: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

24

AAPC Times for Small Messages

0

20

40

60

80

100

16 32 64 128 256 512 1024 2048

Processors

Time (m

s)

Lemieux Native MPI Mesh Direct

AAPC Performance

Page 25: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

25

Large Message Issues Network contention

Contention free schedules Topology specific optimizations

Page 26: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

26

Ring Strategy for Collective Multicast

Performs all to all multicast by sending messages along a ring formed by the processors

Congestion free on most topologies

0 1 2 i i+1 P-1…… ……..

Page 27: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

27

Streaming Messages Programs often have streams of

short messages Streaming library combines a

bunch of messages and sends them off

Stripping large charm++ header Short array message packing Effective message performance of

about 3us

Page 28: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

28

Using communication library

Communication optimizations embodied as strategies EachToManyMulticastStrategy RingMulticast PipeBroadcast Streaming MeshStreaming

Page 29: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

29

Bracketed vs. Non-bracketed

Bracketed Strategies Require user to give specific end

points for each iteration of message sends

Endpoints declared by calling ComlibBegin() and ComlibEnd()

Examples: EachToManyMulticast Non bracketed strategies

No such end points necessary Examples: Streaming,

PipeBroadcast

Page 30: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

30

Accessing the Communication Library

From mainchare::main Creating a strategy

Strategy *strat = new EachToManyMulticastStrategy(USE_MESH)

Strat = new StreamingStrategy();Strat->enableShortMessagePacking();

Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy);

•myproxy should be passed to all array elements

Page 31: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

31

Sending Messages

ComlibBegin(myproxy);//Bracketed Strategiesfor(.. ..) { .. ..myproxy.foo(msg);

.. ..}ComlibEnd(); //Bracketed strategies

Page 32: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

32

Handling Migration Migrating array element PUP’s

the comlib associated proxy

FooArray::pup(PUP::er &p) { .. .. .. p | myProxy; .. .. ..}

Page 33: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

33

Compiling You must include compile time

option –module commlib

Page 34: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

34

Advanced Load-balancers

Writing a Load-balancing Strategy

Page 35: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

35

Advanced load balancing: Writing a new strategy

Inherit from CentralLB and implement the work(…) function

class foolb : public CentralLB {

public: .. .. .. void work (CentralLB::LDStats*

stats, int count); .. .. .. };

Page 36: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

36

LB Database struct LDStats {

ProcStats *procs; LDObjData* objData; LDCommData* commData;

int *to_proc; //.. .. .. }//Dummy Work function which assigns all objects to//processor 0//Don’t implement it!void fooLB::work(CentralLB::LDStats* stats,int

count){ for(int count=0;count < nobjs; count++)

stats.to_proc[count] = 0;}

Page 37: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

37

Compiling and Integration Edit and run Makefile_lb.sh

Creates Make.lb which is included by the LDB Makefile

Run make depends to correct dependencies

Rebuild charm++

Page 38: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

38

Checkpoint Restart

Page 39: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

39

Checkpoint/Restart Any long running application

must be able to save its state When you checkpoint an

application, it uses the pup routine to store the state of all objects

State information is saved in a directory of your choosing

Restore also uses pup, so no additional application code is needed (pup is all you need)

Page 40: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

40

Checkpointing Job In AMPI, use

MPI_Checkpoint(<dir>); Collective call; returns when

checkpoint is complete In Charm++, use

CkCheckpoint(<dir>,<resume>); Called on one processor; calls

resume when checkpoint is complete

Page 41: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

41

Restart Job from Checkpoint The charmrun option ++restart

<dir> is used to restart Number of processors need not be

the same You can also restart groups by

marking them migratable and writing a PUP routine – they still will not load balance, though

Page 42: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

42

Threads

Page 43: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

43

Why use Threads? They provide one key feature: blocking

Suspend execution (e.g., at message receive) Do something else Resume later (e.g., after message arrives)

Example: MPI_Recv, MPI_Wait semantics Function call interface more convenient

than message-passing Regular call/return structure (no CkCallbacks) Allows blocking in middle of deeply nested

communication subroutine

Page 44: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

44

Why not use Threads? Slower

Around 1us context-switching overhead unavoidable

Creation/deletion perhaps 10us More complexity, more bugs

Breaks a lot of machines! (but we have workarounds)

Migration more difficult State of thread is scattered through stack,

which is maintained by compiler By contrast, state of object is maintained by

users Thread disadvantages form the motivation

to use SDAG (later)

Page 45: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

45

What are (Charm) Threads? One flow of control (instruction

stream) Machine Registers & program counter Execution stack

Like pthreads (kernel threads) Only different:

Implemented at user level (in Converse) Scheduled at user level; non-preemptive Migratable between nodes

Page 46: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

46

How do I use Threads? Many options:

AMPI •Always uses threads via TCharm library

Charm++• [threaded] entry methods run in a

thread• [sync] methods

Converse•C routines

CthCreate/CthSuspend/CthAwaken•Everything else is built on these• Implemented using

• SYSV makecontext/setcontext• POSIX setjmp/alloca/longjmp

Page 47: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

47

How do I use Threads (example)

Blocking API routine: find array elementint requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src)}

Send request and suspendint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return;}

Awaken thread when data arrivesvoid myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}

Page 48: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

48

How do I use Threads (example)

Send request, suspend, recv, awaken, returnint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend();

return stashed_return;}

void myObject::fooNetworkResponse(int ret) {

stashed_return=ret;

CthAwaken(stashed_thread);}

Page 49: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

49

The Horror of Thread Migration

Page 50: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

50

Stack Data The stack is used by the compiler

to track function calls and provide temporary storage Local Variables Subroutine Parameters C “alloca” storage

Most of the variables in a typical application are stack data

Users have no control over how stack is laid out

Page 51: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

51

Migrate Stack Data Without compiler support,

cannot change stack’s address Because we can’t change stack’s

interior pointers (return frame pointer, function arguments, etc.)

Solution: “isomalloc” addresses Reserve address space on every

processor for every thread stack Use mmap to scatter stacks in

virtual memory efficiently Idea comes from PM2

Page 52: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

52

Migrate Stack Data

Thread 2 stackThread 3 stackThread 4 stack

Processor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Migrate Thread 3

Page 53: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

53

Migrate Stack Data: Isomalloc

Thread 2 stack

Thread 4 stack

Processor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

Thread 1 stack

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Migrate Thread 3 Thread 3 stack

Page 54: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

54

Migrate Stack Data Isomalloc is a completely automatic

solution No changes needed in application or

compilers Just like a software shared-memory

system, but with proactive paging But has a few limitations

Depends on having large quantities of virtual address space (best on 64-bit)

• 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine

Depends on unportable mmap • Which addresses are safe? (We must guess!)• What about Windows? Or Blue Gene?

Page 55: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

55

Aliasing Stack DataProcessor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stack

Thread 3 stack

Page 56: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

56

Aliasing Stack Data: Run Thread 2

Thread 2 stack

Processor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stack

Thread 3 stack

Execution Copy

Page 57: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

57

Aliasing Stack DataProcessor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stack

Thread 3 stack

Page 58: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

58

Aliasing Stack Data: Run Thread 3

Thread 3 stack

Processor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stack

Thread 3 stack

Execution Copy

Page 59: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

59

Aliasing Stack DataProcessor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stack

Thread 3 stack

Thread 3 stack

Migrate Thread 3

Page 60: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

60

Aliasing Stack DataProcessor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stackThread 3 stack

Page 61: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

61

Aliasing Stack DataProcessor A’s Memory

CodeGlobals

Heap

0x00000000

0xFFFFFFFF

CodeGlobals

Heap

0x00000000

0xFFFFFFFFProcessor B’s Memory

Thread 2 stack

Thread 3 stackExecution Copy

Thread 3 stack

Page 62: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

62

Aliasing Stack Data Does not depend on having large

quantities of virtual address space Works well on 32-bit machines

Requires only one mmap’d region at a time Works even on Blue Gene!

Downsides: Thread context switch requires

munmap/mmap (3us) Can only have one thread running

at a time (so no SMP’s!)

Page 63: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

63

Heap Data Heap data is any dynamically

allocated data C “malloc” and “free” C++ “new” and “delete” F90 “ALLOCATE” and

“DEALLOCATE” Arrays and linked data

structures are almost always heap data

Page 64: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

64

Migrate Heap Data Automatic solution: isomalloc all heap

data just like stacks! “-memory isomalloc” link option Overrides malloc/free No new application code needed Same limitations as isomalloc; page

allocation granularity (huge!) Manual solution: application moves

its heap data Need to be able to size message buffer,

pack data into message, and unpack on other side

“pup” abstraction does all three

Page 65: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

65

SDAG

Page 66: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

66

Structured Dagger What is it?

A coordination language built on top of Charm++

Motivation Charm++’s asynchrony is efficient

and reliable, but tough to program•Flags, buffering, out-of-order

receives, etc. Threads are easy to program, but

less efficient and less reliable• Implementation complexity•Porting headaches

Want benefits of both!

Page 67: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

67

Structured Dagger Constructs when <method list> {code}

Do not continue until method is called• Internally generates flags, checks, etc.• Does not use threads

atomic {code} Call ordinary sequential C++ code

if/else/for/while C-like control flow

overlap {code1 code2 ...} Execute code segments in parallel

forall “Parallel Do” Like a parameterized overlap

Page 68: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

68

Stencil Example Using Structured Dagger

array[1D] myArray {…entry void GetMessages () { when rightmsgEntry(), leftmsgEntry() { atomic { CkPrintf(“Got both left and right messages \n”); doWork(right, left); } }};

entry void rightmsgEntry();entry void leftmsgEntry();…};

Page 69: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

69

Overlap for LeanMD Initialization

array[1D] myArray {… entry void waitForInit(void) { overlap { when recvNumCellPairs(myMsg* pMsg) { atomic { setNumCellPairs(pMsg->intVal); delete pMsg; } } when recvNumCells(myMsg * cMsg) { atomic { setNumCells(cMsg->intVal); delete cMsg; } } } }};

Page 70: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

70

For for LeanMD timeloopentry void doTimeloop(void) { for (timeStep_=1; timeStep_<=SimParam.NumSteps; timeStep++) { atomic {sendAtomPos(); }

overlap { for (forceCount_=0; forceCount_<numForceMsg_; forceCount_++) { when recvForces(ForcesMsg* msg) { atomic {procForces(msg); } } } for (pmeCount_=0; pmeCount_<nPME; pmeCount_++) { when recvPME(PMEGridMsg* m) {atomic {procPME(m);}} } } atomic { doIntegration(); } if (timeForMigrate()) { ... } }}

Page 71: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

71

Conclusions

Page 72: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

72

Conclusions AMPI and Charm++ provide a

fully virtualized runtime system Load balancing via migration Communication optimizations Checkpoint/restart

Virtualization can significantly improve performance for real applications

Page 73: 1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

73

Thank You!

Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/

Parallel Programming Lab at University of Illinois