1 advanced charm++ tutorial charm workshop tutorial sameer kumar orion sky lawlor charm.cs.uiuc.edu...

Advanced Charm++ Tutorial

Charm Workshop TutorialSameer Kumar

Orion Sky Lawlorcharm.cs.uiuc.edu

2004/10/19

How to Become a Charm++ Hacker Advanced Charm++

Advanced Messaging Writing system libraries

•Groups•Delegation

Communication framework Advanced load-balancing Checkpointing Threads SDAG

Advanced Messaging

Prioritized Execution If several messages available,

Charm will process the message with highest priority Otherwise, oldest message (FIFO)

Has no effect: If only one message is available

(common for network-bound applications!)

On outgoing messages Very useful for speculative

work, ordering timesteps, etc...

Priority Classes Charm++ scheduler has three

queues: high, default, and low As signed integer priorities:

-MAXINT Highest priority -- -1 0 Default priority 1 -- +MAXINT Lowest priority

As unsigned bitvector priorities: 0x0000 Highest priority -- 0x7FFF 0x8000 Default priority 0x8001 -- 0xFFFF Lowest priority

Prioritized Marshalled Messages

Pass “CkEntryOptions” as last parameter

For signed integer priorities:CkEntryOptions opts;opts.setPriority(-1);fooProxy.bar(x,y,opts);

For bitvector priorities:CkEntryOptions opts;unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF};opts.setPriority(64,prio);fooProxy.bar(x,y,opts);

Prioritized Messages Number of priority bits passed during

message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages

Signed integer priorities:*CkPriorityPtr(msg)=-1;CkSetQueueing(m, CK_QUEUEING_IFIFO);

Unsigned bitvector prioritiesCkPriorityPtr(msg)[0]=0x7fffffff;CkSetQueueing(m, CK_QUEUEING_BFIFO);

Advanced Message Features

Read-only messages Entry method agrees not to modify

or delete the message Avoids message copy for

broadcasts, saving time Expedited messages

Message do not go through the charm++ scheduler (faster)

Immediate messages Entries are executed in a interrupt

or the communication thread Very fast, but tough to get right

Read-Only, Expedited, Immediate

All declared in the .ci file{

.. ... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *);

.. .. .. }; // Immediate messages only currently work //for NodeGroups

Groups

Object Groups A collection of objects (chares)

Also called branch office chares Exactly one representative on each

processor• Ideally suited for system libraries

A single proxy for the group as a whole

Similar to arrays:•Broadcasts, reductions, indexing

But not completely like arrays:•Non-migratable; one per processor

Declarations .ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method

}; C++ file

class mygroup : public Group {mygroup() {}

void foo(foomsg *m) { CkPrintf(“Do Nothing”);}

Creating and Calling Groups

Creationp = CProxy_mygroup::ckNew();

Remote invocationp.foo(msg); //broadcastp[1].foo(msg); //asynchronous invocation

Direct local access mygroup *g=p.ckLocalBranch();g->foo(….); //local invocation Danger: if you migrate, the group

stays behind!

Delegation

Delegation Enables Charm++ proxy

messages to be forwarded to a delegation manager group

Delegation manager can trap calls to proxy sends and apply optimizations

Delegation manager must inherit from CkDelegateMgr

User program must to call proxy.ckDelegate(mgrID);

Delegation Interface

.ci filegroup MyDelegateMgr {entry MyDelegateMgr(); //Constructor

}; .h file

class MyDelegateMgr : public CkDelegateMgr {MyDelegateMgr();

void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a);

void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. ……………..}

Communication Optimization

Automatic Communication Optimizations

The parallel-objects Runtime System can observe, instrument, and measure communication patterns Communication libraries can optimize

• By substituting most suitable algorithm for each operation

• Learning at runtime E.g. All to all communication

• Performance depends on many runtime characteristics• Library switches between different algorithms

Communication is from/to objects, not processors

• Streaming messages optimization

Managing Collective Communication

Communication operation where all (or most) the processors participate For example broadcast, barrier, all

reduce, all to all communication etc Applications: NAMD multicast, NAMD

PME, CPAIMD Issues

Performance impediment Naïve implementations often do not scale Synchronous implementations do not

utilize the co-processor effectively

All to All Communication All processors send data to all

other processors All to all personalized

communication (AAPC)•MPI_Alltoall

All to all multicast/broadcast (AAMC)•MPI_Allgather

Strategies For AAPC Short message optimizations

High software over head (α) Message combining

Large messages Network contention

Short Message Optimizations

Direct all to all communication is α dominated

Message combining for small messages Reduce the total number of messages Multistage algorithm to send messages

along a virtual topology Group of messages combined and sent to

an intermediate processor which then forwards them to their final destinations

AAPC strategy may send same message multiple times

Virtual Topology: Mesh

Organize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 2: Processors send messages to column neighbors1 P

2* messages instead of P-1 1P

AAPC Times for Small Messages

16 32 64 128 256 512 1024 2048

Processors

Time (m

Lemieux Native MPI Mesh Direct

AAPC Performance

Large Message Issues Network contention

Contention free schedules Topology specific optimizations

Ring Strategy for Collective Multicast

Performs all to all multicast by sending messages along a ring formed by the processors

Congestion free on most topologies

0 1 2 i i+1 P-1…… ……..

Streaming Messages Programs often have streams of

short messages Streaming library combines a

bunch of messages and sends them off

Stripping large charm++ header Short array message packing Effective message performance of

about 3us

Using communication library

Communication optimizations embodied as strategies EachToManyMulticastStrategy RingMulticast PipeBroadcast Streaming MeshStreaming

Bracketed vs. Non-bracketed

Bracketed Strategies Require user to give specific end

points for each iteration of message sends

Endpoints declared by calling ComlibBegin() and ComlibEnd()

Examples: EachToManyMulticast Non bracketed strategies

No such end points necessary Examples: Streaming,

PipeBroadcast

Accessing the Communication Library

From mainchare::main Creating a strategy

Strategy *strat = new EachToManyMulticastStrategy(USE_MESH)

Strat = new StreamingStrategy();Strat->enableShortMessagePacking();

Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy);

•myproxy should be passed to all array elements

Sending Messages

ComlibBegin(myproxy);//Bracketed Strategiesfor(.. ..) { .. ..myproxy.foo(msg);

.. ..}ComlibEnd(); //Bracketed strategies

Handling Migration Migrating array element PUP’s

the comlib associated proxy

FooArray::pup(PUP::er &p) { .. .. .. p | myProxy; .. .. ..}

Compiling You must include compile time

option –module commlib

Advanced Load-balancers

Writing a Load-balancing Strategy

Advanced load balancing: Writing a new strategy

Inherit from CentralLB and implement the work(…) function

class foolb : public CentralLB {

public: .. .. .. void work (CentralLB::LDStats*

stats, int count); .. .. .. };

LB Database struct LDStats {

ProcStats *procs; LDObjData* objData; LDCommData* commData;

int *to_proc; //.. .. .. }//Dummy Work function which assigns all objects to//processor 0//Don’t implement it!void fooLB::work(CentralLB::LDStats* stats,int

count){ for(int count=0;count < nobjs; count++)

stats.to_proc[count] = 0;}

Compiling and Integration Edit and run Makefile_lb.sh

Creates Make.lb which is included by the LDB Makefile

Run make depends to correct dependencies

Rebuild charm++

Checkpoint Restart

Checkpoint/Restart Any long running application

must be able to save its state When you checkpoint an

application, it uses the pup routine to store the state of all objects

State information is saved in a directory of your choosing

Restore also uses pup, so no additional application code is needed (pup is all you need)

Checkpointing Job In AMPI, use

MPI_Checkpoint(<dir>); Collective call; returns when

checkpoint is complete In Charm++, use

CkCheckpoint(<dir>,<resume>); Called on one processor; calls

resume when checkpoint is complete

Restart Job from Checkpoint The charmrun option ++restart

<dir> is used to restart Number of processors need not be

the same You can also restart groups by

marking them migratable and writing a PUP routine – they still will not load balance, though

Threads

Why use Threads? They provide one key feature: blocking

Suspend execution (e.g., at message receive) Do something else Resume later (e.g., after message arrives)

Example: MPI_Recv, MPI_Wait semantics Function call interface more convenient

than message-passing Regular call/return structure (no CkCallbacks) Allows blocking in middle of deeply nested

communication subroutine

Why not use Threads? Slower

Around 1us context-switching overhead unavoidable

Creation/deletion perhaps 10us More complexity, more bugs

Breaks a lot of machines! (but we have workarounds)

Migration more difficult State of thread is scattered through stack,

which is maintained by compiler By contrast, state of object is maintained by

users Thread disadvantages form the motivation

to use SDAG (later)

What are (Charm) Threads? One flow of control (instruction

stream) Machine Registers & program counter Execution stack

Like pthreads (kernel threads) Only different:

Implemented at user level (in Converse) Scheduled at user level; non-preemptive Migratable between nodes

How do I use Threads? Many options:

AMPI •Always uses threads via TCharm library

Charm++• [threaded] entry methods run in a

thread• [sync] methods

Converse•C routines

CthCreate/CthSuspend/CthAwaken•Everything else is built on these• Implemented using

• SYSV makecontext/setcontext• POSIX setjmp/alloca/longjmp

How do I use Threads (example)

Blocking API routine: find array elementint requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src)}

Send request and suspendint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return;}

Awaken thread when data arrivesvoid myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}

How do I use Threads (example)

Send request, suspend, recv, awaken, returnint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend();

return stashed_return;}

void myObject::fooNetworkResponse(int ret) {

stashed_return=ret;

CthAwaken(stashed_thread);}

The Horror of Thread Migration

Stack Data The stack is used by the compiler

to track function calls and provide temporary storage Local Variables Subroutine Parameters C “alloca” storage

Most of the variables in a typical application are stack data

Users have no control over how stack is laid out

Migrate Stack Data Without compiler support,

cannot change stack’s address Because we can’t change stack’s

interior pointers (return frame pointer, function arguments, etc.)

Solution: “isomalloc” addresses Reserve address space on every

processor for every thread stack Use mmap to scatter stacks in

virtual memory efficiently Idea comes from PM2

Migrate Stack Data

Thread 2 stackThread 3 stackThread 4 stack

Processor A’s Memory

CodeGlobals

0x00000000

0xFFFFFFFF

Thread 1 stack

CodeGlobals

0x00000000

0xFFFFFFFFProcessor B’s Memory

Migrate Thread 3

Migrate Stack Data: Isomalloc

Thread 2 stack

Thread 4 stack

CodeGlobals

0x00000000

0xFFFFFFFF

Thread 1 stack

CodeGlobals

0x00000000

Migrate Thread 3 Thread 3 stack

Migrate Stack Data Isomalloc is a completely automatic

solution No changes needed in application or

compilers Just like a software shared-memory

system, but with proactive paging But has a few limitations

Depends on having large quantities of virtual address space (best on 64-bit)

• 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine

Depends on unportable mmap • Which addresses are safe? (We must guess!)• What about Windows? Or Blue Gene?

Aliasing Stack DataProcessor A’s Memory

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stack

Thread 3 stack

Aliasing Stack Data: Run Thread 2

Thread 2 stack

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stack

Thread 3 stack

Execution Copy

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stack

Thread 3 stack

Aliasing Stack Data: Run Thread 3

Thread 3 stack

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stack

Thread 3 stack

Execution Copy

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stack

Thread 3 stack

Migrate Thread 3

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stackThread 3 stack

CodeGlobals

0x00000000

0xFFFFFFFF

CodeGlobals

0x00000000

Thread 2 stack

Thread 3 stackExecution Copy

Thread 3 stack

Aliasing Stack Data Does not depend on having large

quantities of virtual address space Works well on 32-bit machines

Requires only one mmap’d region at a time Works even on Blue Gene!

Downsides: Thread context switch requires

munmap/mmap (3us) Can only have one thread running

at a time (so no SMP’s!)

Heap Data Heap data is any dynamically

allocated data C “malloc” and “free” C++ “new” and “delete” F90 “ALLOCATE” and

“DEALLOCATE” Arrays and linked data

structures are almost always heap data

Migrate Heap Data Automatic solution: isomalloc all heap

data just like stacks! “-memory isomalloc” link option Overrides malloc/free No new application code needed Same limitations as isomalloc; page

allocation granularity (huge!) Manual solution: application moves

its heap data Need to be able to size message buffer,

pack data into message, and unpack on other side

“pup” abstraction does all three

Structured Dagger What is it?

A coordination language built on top of Charm++

Motivation Charm++’s asynchrony is efficient

and reliable, but tough to program•Flags, buffering, out-of-order

receives, etc. Threads are easy to program, but

less efficient and less reliable• Implementation complexity•Porting headaches

Want benefits of both!

Structured Dagger Constructs when <method list> {code}

Do not continue until method is called• Internally generates flags, checks, etc.• Does not use threads

atomic {code} Call ordinary sequential C++ code

if/else/for/while C-like control flow

overlap {code1 code2 ...} Execute code segments in parallel

forall “Parallel Do” Like a parameterized overlap

Stencil Example Using Structured Dagger

array[1D] myArray {…entry void GetMessages () { when rightmsgEntry(), leftmsgEntry() { atomic { CkPrintf(“Got both left and right messages \n”); doWork(right, left); } }};

entry void rightmsgEntry();entry void leftmsgEntry();…};

Overlap for LeanMD Initialization

array[1D] myArray {… entry void waitForInit(void) { overlap { when recvNumCellPairs(myMsg* pMsg) { atomic { setNumCellPairs(pMsg->intVal); delete pMsg; } } when recvNumCells(myMsg * cMsg) { atomic { setNumCells(cMsg->intVal); delete cMsg; } } } }};

For for LeanMD timeloopentry void doTimeloop(void) { for (timeStep_=1; timeStep_<=SimParam.NumSteps; timeStep++) { atomic {sendAtomPos(); }

overlap { for (forceCount_=0; forceCount_<numForceMsg_; forceCount_++) { when recvForces(ForcesMsg* msg) { atomic {procForces(msg); } } } for (pmeCount_=0; pmeCount_<nPME; pmeCount_++) { when recvPME(PMEGridMsg* m) {atomic {procPME(m);}} } } atomic { doIntegration(); } if (timeForMigrate()) { ... } }}

Conclusions

Conclusions AMPI and Charm++ provide a

fully virtualized runtime system Load balancing via migration Communication optimizations Checkpoint/restart

Virtualization can significantly improve performance for real applications

Thank You!

Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/

Parallel Programming Lab at University of Illinois

1 advanced charm++ tutorial charm workshop tutorial sameer kumar orion sky lawlor charm.cs.uiuc.edu...

Documents

charm cts+

elemental charm

southern charm southerncharm@suncommunities.com...

charm 050222

1 data structures for scientific computing orion sky lawlor...

1 basic charm++ and load balancing gengbin zheng...

tutorial 3: third time is a charm design of cross-section to...

Язык charm ++

charm overview

charm overview2

charm++ tutorial presented by: abhinav bhatele chao mei...

advanced charm++ and ?virtualization tutorial presented by:...

charm citycakespres

charm update

dragon's charm

charm it! crayola creativity design-a-charm contest

advanced charm++ tutorial

southern charm

1 charm review update on charm mixing charm semileptonic...

charm ides