1 advanced charm++ tutorial charm workshop tutorial sameer kumar orion sky lawlor charm.cs.uiuc.edu...
Post on 18-Jan-2018
226 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Advanced Charm++ Tutorial
Charm Workshop TutorialSameer Kumar
Orion Sky Lawlorcharm.cs.uiuc.edu
2004/10/19
2
How to Become a Charm++ Hacker Advanced Charm++
Advanced Messaging Writing system libraries
•Groups•Delegation
Communication framework Advanced load-balancing Checkpointing Threads SDAG
3
Advanced Messaging
4
Prioritized Execution If several messages available,
Charm will process the message with highest priority Otherwise, oldest message (FIFO)
Has no effect: If only one message is available
(common for network-bound applications!)
On outgoing messages Very useful for speculative
work, ordering timesteps, etc...
5
Priority Classes Charm++ scheduler has three
queues: high, default, and low As signed integer priorities:
-MAXINT Highest priority -- -1 0 Default priority 1 -- +MAXINT Lowest priority
As unsigned bitvector priorities: 0x0000 Highest priority -- 0x7FFF 0x8000 Default priority 0x8001 -- 0xFFFF Lowest priority
6
Prioritized Marshalled Messages
Pass “CkEntryOptions” as last parameter
For signed integer priorities:CkEntryOptions opts;opts.setPriority(-1);fooProxy.bar(x,y,opts);
For bitvector priorities:CkEntryOptions opts;unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF};opts.setPriority(64,prio);fooProxy.bar(x,y,opts);
7
Prioritized Messages Number of priority bits passed during
message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages
Signed integer priorities:*CkPriorityPtr(msg)=-1;CkSetQueueing(m, CK_QUEUEING_IFIFO);
Unsigned bitvector prioritiesCkPriorityPtr(msg)[0]=0x7fffffff;CkSetQueueing(m, CK_QUEUEING_BFIFO);
8
Advanced Message Features
Read-only messages Entry method agrees not to modify
or delete the message Avoids message copy for
broadcasts, saving time Expedited messages
Message do not go through the charm++ scheduler (faster)
Immediate messages Entries are executed in a interrupt
or the communication thread Very fast, but tough to get right
9
Read-Only, Expedited, Immediate
All declared in the .ci file{
.. ... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *);
.. .. .. }; // Immediate messages only currently work //for NodeGroups
10
Groups
11
Object Groups A collection of objects (chares)
Also called branch office chares Exactly one representative on each
processor• Ideally suited for system libraries
A single proxy for the group as a whole
Similar to arrays:•Broadcasts, reductions, indexing
But not completely like arrays:•Non-migratable; one per processor
12
Declarations .ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method
}; C++ file
class mygroup : public Group {mygroup() {}
void foo(foomsg *m) { CkPrintf(“Do Nothing”);}
};
13
Creating and Calling Groups
Creationp = CProxy_mygroup::ckNew();
Remote invocationp.foo(msg); //broadcastp[1].foo(msg); //asynchronous invocation
Direct local access mygroup *g=p.ckLocalBranch();g->foo(….); //local invocation Danger: if you migrate, the group
stays behind!
14
Delegation
15
Delegation Enables Charm++ proxy
messages to be forwarded to a delegation manager group
Delegation manager can trap calls to proxy sends and apply optimizations
Delegation manager must inherit from CkDelegateMgr
User program must to call proxy.ckDelegate(mgrID);
16
Delegation Interface
.ci filegroup MyDelegateMgr {entry MyDelegateMgr(); //Constructor
}; .h file
class MyDelegateMgr : public CkDelegateMgr {MyDelegateMgr();
void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a);
void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. ……………..}
17
Communication Optimization
18
Automatic Communication Optimizations
The parallel-objects Runtime System can observe, instrument, and measure communication patterns Communication libraries can optimize
• By substituting most suitable algorithm for each operation
• Learning at runtime E.g. All to all communication
• Performance depends on many runtime characteristics• Library switches between different algorithms
Communication is from/to objects, not processors
• Streaming messages optimization
19
Managing Collective Communication
Communication operation where all (or most) the processors participate For example broadcast, barrier, all
reduce, all to all communication etc Applications: NAMD multicast, NAMD
PME, CPAIMD Issues
Performance impediment Naïve implementations often do not scale Synchronous implementations do not
utilize the co-processor effectively
20
All to All Communication All processors send data to all
other processors All to all personalized
communication (AAPC)•MPI_Alltoall
All to all multicast/broadcast (AAMC)•MPI_Allgather
21
Strategies For AAPC Short message optimizations
High software over head (α) Message combining
Large messages Network contention
22
Short Message Optimizations
Direct all to all communication is α dominated
Message combining for small messages Reduce the total number of messages Multistage algorithm to send messages
along a virtual topology Group of messages combined and sent to
an intermediate processor which then forwards them to their final destinations
AAPC strategy may send same message multiple times
23
Virtual Topology: Mesh
Organize processors in a 2D (virtual) Mesh
Phase 1: Processors send messages to row neighbors1 P
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
Phase 2: Processors send messages to column neighbors1 P
2* messages instead of P-1 1P
24
AAPC Times for Small Messages
0
20
40
60
80
100
16 32 64 128 256 512 1024 2048
Processors
Time (m
s)
Lemieux Native MPI Mesh Direct
AAPC Performance
25
Large Message Issues Network contention
Contention free schedules Topology specific optimizations
26
Ring Strategy for Collective Multicast
Performs all to all multicast by sending messages along a ring formed by the processors
Congestion free on most topologies
0 1 2 i i+1 P-1…… ……..
27
Streaming Messages Programs often have streams of
short messages Streaming library combines a
bunch of messages and sends them off
Stripping large charm++ header Short array message packing Effective message performance of
about 3us
28
Using communication library
Communication optimizations embodied as strategies EachToManyMulticastStrategy RingMulticast PipeBroadcast Streaming MeshStreaming
29
Bracketed vs. Non-bracketed
Bracketed Strategies Require user to give specific end
points for each iteration of message sends
Endpoints declared by calling ComlibBegin() and ComlibEnd()
Examples: EachToManyMulticast Non bracketed strategies
No such end points necessary Examples: Streaming,
PipeBroadcast
30
Accessing the Communication Library
From mainchare::main Creating a strategy
Strategy *strat = new EachToManyMulticastStrategy(USE_MESH)
Strat = new StreamingStrategy();Strat->enableShortMessagePacking();
Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy);
•myproxy should be passed to all array elements
31
Sending Messages
ComlibBegin(myproxy);//Bracketed Strategiesfor(.. ..) { .. ..myproxy.foo(msg);
.. ..}ComlibEnd(); //Bracketed strategies
32
Handling Migration Migrating array element PUP’s
the comlib associated proxy
FooArray::pup(PUP::er &p) { .. .. .. p | myProxy; .. .. ..}
33
Compiling You must include compile time
option –module commlib
34
Advanced Load-balancers
Writing a Load-balancing Strategy
35
Advanced load balancing: Writing a new strategy
Inherit from CentralLB and implement the work(…) function
class foolb : public CentralLB {
public: .. .. .. void work (CentralLB::LDStats*
stats, int count); .. .. .. };
36
LB Database struct LDStats {
ProcStats *procs; LDObjData* objData; LDCommData* commData;
int *to_proc; //.. .. .. }//Dummy Work function which assigns all objects to//processor 0//Don’t implement it!void fooLB::work(CentralLB::LDStats* stats,int
count){ for(int count=0;count < nobjs; count++)
stats.to_proc[count] = 0;}
37
Compiling and Integration Edit and run Makefile_lb.sh
Creates Make.lb which is included by the LDB Makefile
Run make depends to correct dependencies
Rebuild charm++
38
Checkpoint Restart
39
Checkpoint/Restart Any long running application
must be able to save its state When you checkpoint an
application, it uses the pup routine to store the state of all objects
State information is saved in a directory of your choosing
Restore also uses pup, so no additional application code is needed (pup is all you need)
40
Checkpointing Job In AMPI, use
MPI_Checkpoint(<dir>); Collective call; returns when
checkpoint is complete In Charm++, use
CkCheckpoint(<dir>,<resume>); Called on one processor; calls
resume when checkpoint is complete
41
Restart Job from Checkpoint The charmrun option ++restart
<dir> is used to restart Number of processors need not be
the same You can also restart groups by
marking them migratable and writing a PUP routine – they still will not load balance, though
42
Threads
43
Why use Threads? They provide one key feature: blocking
Suspend execution (e.g., at message receive) Do something else Resume later (e.g., after message arrives)
Example: MPI_Recv, MPI_Wait semantics Function call interface more convenient
than message-passing Regular call/return structure (no CkCallbacks) Allows blocking in middle of deeply nested
communication subroutine
44
Why not use Threads? Slower
Around 1us context-switching overhead unavoidable
Creation/deletion perhaps 10us More complexity, more bugs
Breaks a lot of machines! (but we have workarounds)
Migration more difficult State of thread is scattered through stack,
which is maintained by compiler By contrast, state of object is maintained by
users Thread disadvantages form the motivation
to use SDAG (later)
45
What are (Charm) Threads? One flow of control (instruction
stream) Machine Registers & program counter Execution stack
Like pthreads (kernel threads) Only different:
Implemented at user level (in Converse) Scheduled at user level; non-preemptive Migratable between nodes
46
How do I use Threads? Many options:
AMPI •Always uses threads via TCharm library
Charm++• [threaded] entry methods run in a
thread• [sync] methods
Converse•C routines
CthCreate/CthSuspend/CthAwaken•Everything else is built on these• Implemented using
• SYSV makecontext/setcontext• POSIX setjmp/alloca/longjmp
47
How do I use Threads (example)
Blocking API routine: find array elementint requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src)}
Send request and suspendint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return;}
Awaken thread when data arrivesvoid myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread);}
48
How do I use Threads (example)
Send request, suspend, recv, awaken, returnint myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend();
return stashed_return;}
void myObject::fooNetworkResponse(int ret) {
stashed_return=ret;
CthAwaken(stashed_thread);}
49
The Horror of Thread Migration
50
Stack Data The stack is used by the compiler
to track function calls and provide temporary storage Local Variables Subroutine Parameters C “alloca” storage
Most of the variables in a typical application are stack data
Users have no control over how stack is laid out
51
Migrate Stack Data Without compiler support,
cannot change stack’s address Because we can’t change stack’s
interior pointers (return frame pointer, function arguments, etc.)
Solution: “isomalloc” addresses Reserve address space on every
processor for every thread stack Use mmap to scatter stacks in
virtual memory efficiently Idea comes from PM2
52
Migrate Stack Data
Thread 2 stackThread 3 stackThread 4 stack
Processor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
Thread 1 stack
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Migrate Thread 3
53
Migrate Stack Data: Isomalloc
Thread 2 stack
Thread 4 stack
Processor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
Thread 1 stack
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Migrate Thread 3 Thread 3 stack
54
Migrate Stack Data Isomalloc is a completely automatic
solution No changes needed in application or
compilers Just like a software shared-memory
system, but with proactive paging But has a few limitations
Depends on having large quantities of virtual address space (best on 64-bit)
• 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine
Depends on unportable mmap • Which addresses are safe? (We must guess!)• What about Windows? Or Blue Gene?
55
Aliasing Stack DataProcessor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stack
Thread 3 stack
56
Aliasing Stack Data: Run Thread 2
Thread 2 stack
Processor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stack
Thread 3 stack
Execution Copy
57
Aliasing Stack DataProcessor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stack
Thread 3 stack
58
Aliasing Stack Data: Run Thread 3
Thread 3 stack
Processor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stack
Thread 3 stack
Execution Copy
59
Aliasing Stack DataProcessor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stack
Thread 3 stack
Thread 3 stack
Migrate Thread 3
60
Aliasing Stack DataProcessor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stackThread 3 stack
61
Aliasing Stack DataProcessor A’s Memory
CodeGlobals
Heap
0x00000000
0xFFFFFFFF
CodeGlobals
Heap
0x00000000
0xFFFFFFFFProcessor B’s Memory
Thread 2 stack
Thread 3 stackExecution Copy
Thread 3 stack
62
Aliasing Stack Data Does not depend on having large
quantities of virtual address space Works well on 32-bit machines
Requires only one mmap’d region at a time Works even on Blue Gene!
Downsides: Thread context switch requires
munmap/mmap (3us) Can only have one thread running
at a time (so no SMP’s!)
63
Heap Data Heap data is any dynamically
allocated data C “malloc” and “free” C++ “new” and “delete” F90 “ALLOCATE” and
“DEALLOCATE” Arrays and linked data
structures are almost always heap data
64
Migrate Heap Data Automatic solution: isomalloc all heap
data just like stacks! “-memory isomalloc” link option Overrides malloc/free No new application code needed Same limitations as isomalloc; page
allocation granularity (huge!) Manual solution: application moves
its heap data Need to be able to size message buffer,
pack data into message, and unpack on other side
“pup” abstraction does all three
65
SDAG
66
Structured Dagger What is it?
A coordination language built on top of Charm++
Motivation Charm++’s asynchrony is efficient
and reliable, but tough to program•Flags, buffering, out-of-order
receives, etc. Threads are easy to program, but
less efficient and less reliable• Implementation complexity•Porting headaches
Want benefits of both!
67
Structured Dagger Constructs when <method list> {code}
Do not continue until method is called• Internally generates flags, checks, etc.• Does not use threads
atomic {code} Call ordinary sequential C++ code
if/else/for/while C-like control flow
overlap {code1 code2 ...} Execute code segments in parallel
forall “Parallel Do” Like a parameterized overlap
68
Stencil Example Using Structured Dagger
array[1D] myArray {…entry void GetMessages () { when rightmsgEntry(), leftmsgEntry() { atomic { CkPrintf(“Got both left and right messages \n”); doWork(right, left); } }};
entry void rightmsgEntry();entry void leftmsgEntry();…};
69
Overlap for LeanMD Initialization
array[1D] myArray {… entry void waitForInit(void) { overlap { when recvNumCellPairs(myMsg* pMsg) { atomic { setNumCellPairs(pMsg->intVal); delete pMsg; } } when recvNumCells(myMsg * cMsg) { atomic { setNumCells(cMsg->intVal); delete cMsg; } } } }};
70
For for LeanMD timeloopentry void doTimeloop(void) { for (timeStep_=1; timeStep_<=SimParam.NumSteps; timeStep++) { atomic {sendAtomPos(); }
overlap { for (forceCount_=0; forceCount_<numForceMsg_; forceCount_++) { when recvForces(ForcesMsg* msg) { atomic {procForces(msg); } } } for (pmeCount_=0; pmeCount_<nPME; pmeCount_++) { when recvPME(PMEGridMsg* m) {atomic {procPME(m);}} } } atomic { doIntegration(); } if (timeForMigrate()) { ... } }}
71
Conclusions
72
Conclusions AMPI and Charm++ provide a
fully virtualized runtime system Load balancing via migration Communication optimizations Checkpoint/restart
Virtualization can significantly improve performance for real applications
73
Thank You!
Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/
Parallel Programming Lab at University of Illinois
top related