lots: a software dsm supporting large object space benny wang-leung cheung, cho-li wang, and francis...
TRANSCRIPT
LOTS: A Software DSM LOTS: A Software DSM Supporting Large Object Supporting Large Object
SpaceSpaceBenny Wang-Leung Cheung, Benny Wang-Leung Cheung, Cho-Li WangCho-Li Wang, and Francis Chi-Moon , and Francis Chi-Moon
LauLauDepartment of Computer Department of Computer Science The University of Science The University of
Hong KongHong Kong
September, September, 20042004
22
Presentation OutlinePresentation Outline
• Why LOTS? (Objectives)Why LOTS? (Objectives)• DSM Background and Related WorkDSM Background and Related Work• Design of LOTSDesign of LOTS• Performance Testing and ResultsPerformance Testing and Results• Conclusion and Future WorkConclusion and Future Work
33
The Problem in Current The Problem in Current DSMDSM
• Lack of shared object (memory) spaceLack of shared object (memory) space– Another major problem apart from Another major problem apart from
performanceperformance– Fixed address mapping in virtual memoryFixed address mapping in virtual memory– Shared object space size < process spaceShared object space size < process space
• TreadMarks: ~ min RAM size among all TreadMarks: ~ min RAM size among all machinesmachines
• JIAJIA V1.0: 128 MBJIAJIA V1.0: 128 MB– 32-bit machines 32-bit machines max 4 GB shared space max 4 GB shared space– Unscalable: Fixed regardless of # machinesUnscalable: Fixed regardless of # machines– Large problems (with > 4GB shared memory Large problems (with > 4GB shared memory
need) can’t be run directly need) can’t be run directly The programmer The programmer needs to change the application code to needs to change the application code to reduce the memory utilization. reduce the memory utilization.
44
Objectives of LOTSObjectives of LOTS
• Using 64-bit machines Using 64-bit machines is not a total solution!is not a total solution!
• 32-bit machines are 32-bit machines are dominating the dominating the market (poor man’s market (poor man’s clusters :<)clusters :<)
• Problems keep Problems keep increasing memory increasing memory consumptionconsumption
(Poor man’s cluster)(Poor man’s cluster)
(Rich man’s cluster)(Rich man’s cluster)
55
Objectives of LOTSObjectives of LOTS
• Hence we introduce LOTS:Hence we introduce LOTS:– Large Shared Object Space > 4GBLarge Shared Object Space > 4GB– Dynamic run-time memory mapping Dynamic run-time memory mapping
techniquetechnique– Local disk as the backing store for Local disk as the backing store for
temporarily unused objectstemporarily unused objects– Shared space size now limited by disk spaceShared space size now limited by disk space– Lazy disk read/write Lazy disk read/write reasonable reasonable
performanceperformance
66
Some DSM BackgroundSome DSM Background
• Memory Consistency Memory Consistency Issues:Issues:– Memory Consistency ModelsMemory Consistency Models
• Sequential Consistency (IVY) Sequential Consistency (IVY) performs poorlyperforms poorly
• Relaxed models reduce Relaxed models reduce redundant data trafficredundant data traffic
• Lazy Release Consistency Lazy Release Consistency (TreadMarks)(TreadMarks)
• Scope Consistency (JIAJIA)Scope Consistency (JIAJIA)Rel(L)Rel(L)
Acq(LAcq(L))X=?
Y=?
Rel(L)Rel(L)
Acq(LAcq(L))X=3
Y=5
PP QQ
In Scope, Q In Scope, Q sees X to sees X to be 3, but Y be 3, but Y may not be may not be 55
77
Some DSM BackgroundSome DSM Background
• More Memory Consistency Issues:More Memory Consistency Issues:– Coherence ProtocolsCoherence Protocols
• Home-based (JIAJIA) vs Homeless Home-based (JIAJIA) vs Homeless (TreadMarks) vs Migrating-Home (JUMP)(TreadMarks) vs Migrating-Home (JUMP)
• Write-update vs Write-invalidateWrite-update vs Write-invalidate• Adaptive Protocol (DOSA, ADSM)Adaptive Protocol (DOSA, ADSM)
– Coherence Protocol has to match with Coherence Protocol has to match with memory model for higher efficiencymemory model for higher efficiency
• No DSM deals with Large Object Space!No DSM deals with Large Object Space!
88
Related WorkRelated Work
• Large object space support:Large object space support:– Pointer swizzlingPointer swizzling
• Artificial, invalid addresses are translated to Artificial, invalid addresses are translated to machine-addressable form during accessmachine-addressable form during access
• Used in persistent store (QuickStore, Thor-1)Used in persistent store (QuickStore, Thor-1)
Process Space
Compiler-generated Compiler-generated addresses cause page addresses cause page fault at runtime and fault at runtime and are translated to valid are translated to valid onesones
Unused objects free Unused objects free their virtual addresses their virtual addresses and are swapped out and are swapped out (i.e., swizzled out) to (i.e., swizzled out) to hard diskhard disk
99
Design of LOTSDesign of LOTS
• Dynamic Memory Mapping (DMM)Dynamic Memory Mapping (DMM)– Uses C++ Uses C++ Operator OverloadingOperator Overloading as the interface as the interface
• Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc.Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc.– Purely runtime; Purely runtime;
Virtual Memory Virtual Memory AreaArea
PrograProgramm
Local Hard Local Hard DiskDisk
Remote Remote MemoryMemory
NetworNetworkk
(2) (2) Bring Bring in object in object from from disk / disk / networknetwork
Array AArray A
Data (DMM) AreaData (DMM) Area
Heap AreaHeap Area
(3) (3) Internal structure Internal structure points to object data points to object data for accessfor access
A->ctrlA->ctrl(1) (1) Access invokes Access invokes mapping mapping mechanismmechanism
A[5]=7;A[5]=7;
1010
LOTS Shared Objects LOTS Shared Objects CreationCreation
• Through the LOTS Through the LOTS memory allocatormemory allocator– Exists as a C++ class Exists as a C++ class
• Memory allocation Memory allocation through alloc() through alloc() functionfunction
• Put data into specific Put data into specific part of process spacepart of process space
• Object control info in Object control info in heap areaheap area
00x00000000x00000000
00x50000000x50000000
00x70000000x70000000
00x90000000x90000000
00xb0000000xb000000000xc0000000xc0000000
00xffffffffxffffffff
Heap Heap AreaArea
DMM AreaDMM Area
Twin AreaTwin Area
DSM Control DSM Control AreaArea
Kernel Kernel ReservedReserved C++ C++
StackStack
TXT TXT SegmentSegment
DAT DAT SegmentSegmentA->CtrlA->Ctrl
Array AArray A
Process SpaceProcess Space
1111
LOTS Memory AllocatorLOTS Memory Allocator• Bypass Doug Lea’s Memory Allocator Bypass Doug Lea’s Memory Allocator
used in original C/C++used in original C/C++• Uses mmap() to get physical memory, Uses mmap() to get physical memory,
and map the shared object data to the and map the shared object data to the process space.process space.– Free queues and used queues Free queues and used queues – Small & large objects allocated separatelySmall & large objects allocated separatelyFreeFree
queuqueuee
0x50000000 0x50000000 DMM AreaDMM Area 0x70000000 0x70000000
Used Used blockblockFree Free blockblock
Heap Heap AreaArea
81624324048…1M2M…½G
Twin and Twin and Control Control
AreaArea
81624324048…1M2M…½GUsed Used queuqueu
ee
1212
Shared Memory BehaviorShared Memory Behavior
• Goal: Reduce redundant data trafficGoal: Reduce redundant data traffic– Memory Consistency Model: ScopeMemory Consistency Model: Scope– Memory Coherence Protocol: MixedMemory Coherence Protocol: Mixed
• Lock-synchronized objectsLock-synchronized objects : Homeless + : Homeless + write-updatewrite-update
• Barrier-synchronized objectsBarrier-synchronized objects : Migrating- : Migrating-Home + write-invalidateHome + write-invalidate
• Principle: To eliminate as much all-to-all data Principle: To eliminate as much all-to-all data communication as possiblecommunication as possible
1313
Mixed Coherence ProtocolMixed Coherence Protocol
• An Example:An Example:
Rel(L1)Rel(L1)
New HomeNew Home
Acq(L1)Acq(L1)x1=1x1=1
x1=1 x1=1 y1=5y1=5
Home of X Home of X and Yand Y
P0P0 P1P1 P2P2 P3P3
BarrierBarrier
y1=5y1=5
Rel(L1)Rel(L1)
Acq(L1)Acq(L1)x1++x1++
X Y
y1++y1++
Rel(L2)Rel(L2)
Acq(L2)Acq(L2)x2=3x2=3
Rel(L2)Rel(L2)
Acq(L2)Acq(L2)x2 = 3x2 = 3
Inv X, Inv X, YY
Inv X, Inv X, YY
Inv X, Inv X, YY
x2++x2++
x1=?x1=?
x2 = 4x2 = 4
x1 = 2, x2 = 4x1 = 2, x2 = 4
X Y
Updates Movement Updates Movement Home Token Home Token MovementMovement
XY
When the processes arrive at the barrier, the process that holds the token of the object will become the new home of that object, and other processes will send the updates to the home.
1414
Making LOTS More Making LOTS More EfficientEfficient
• Eliminating Diff Accumulation ProblemEliminating Diff Accumulation Problem– Lock and timestamp info in DSM control areaLock and timestamp info in DSM control area– Calculate diff on request, no redundancyCalculate diff on request, no redundancy
X1 X2 X7X3 X5 X6 X8X4
ValueValue Last Updated Last Updated
TimeTime
X1 X2 X6 X7
X1 X3 X5
X2 X8
X7X3 X5
X8
X5 X7
X3 X8
T=1 T=1 (len=6(len=6) T=2 ) T=2 (len=4(len=4) T=3 ) T=3 (len=4(len=4) T=4 ) T=4 (len=3(len=3))
Traditional Traditional MethodMethod
LOTS MethodLOTS Method
All updates above need to be All updates above need to be sent (17 units data + 8 units of sent (17 units data + 8 units of control)control)
Only send 7 units data + 8 Only send 7 units data + 8 units of control dataunits of control data
LengthLength
X4X3X2X1
2
X8X7X6X5
3 4 0 4 1 4 3
1 1 X6 2 1 X1
3 2 X2 X8 4 3 X3 X7X5
TimeTime
1515
Other Components in Other Components in LOTSLOTS
• C++ runtime library in LinuxC++ runtime library in Linux• Minimal set of functions as interfaceMinimal set of functions as interface
– Retains as much C++ syntax as possible to Retains as much C++ syntax as possible to improve programmabilityimprove programmability
• Synchronization: Locks and BarriersSynchronization: Locks and Barriers– Barriers: With/Without memory effectBarriers: With/Without memory effect
• Communication: Sockets with UDP/IPCommunication: Sockets with UDP/IP• SIGIO handler for incoming messagesSIGIO handler for incoming messages
1616
Performance TestingPerformance Testing
• Two Kinds of TestingTwo Kinds of Testing1 Without invoking large object space supportWithout invoking large object space support
• Compare performance with other DSM (JIAJIA V1.0, Compare performance with other DSM (JIAJIA V1.0, as both have similar communication protocol)as both have similar communication protocol)
• Report no. of messages and bytes sentReport no. of messages and bytes sent• Calculate large object space support overheadCalculate large object space support overhead• 16 Pentium IV 2GHz machines with 100Mbps Fast 16 Pentium IV 2GHz machines with 100Mbps Fast
Ethernet connection, 128MB mem, Linux FedoraEthernet connection, 128MB mem, Linux Fedora2 With large object space supportWith large object space support
• Use an application with large memory demandUse an application with large memory demand• Run on different platforms for analysisRun on different platforms for analysis• Expect disk read/write overhead dominatesExpect disk read/write overhead dominates
1717
Test 1: Timing Test 1: Timing PerformancePerformance
LOTS: LOTS enabled LOTS: LOTS enabled LOTS-x : LOTS disabledLOTS-x : LOTS disabledx-axis : problem size, x-axis : problem size, y-axis : execution time y-axis : execution time in secondsin seconds
LOTS<JiaJiaLOTS<JiaJia
LOTS<JiaJiaLOTS<JiaJia
LOTS>JiaJiaLOTS>JiaJia
LOTS<JiaJiaLOTS<JiaJia
1818
Performance Results Performance Results SummarySummary
• LOTS beat JIAJIA V1.0 in most LOTS beat JIAJIA V1.0 in most applicationsapplications– Mixed protocol + “Diff accumulation Mixed protocol + “Diff accumulation
elimination” reduce data trafficelimination” reduce data traffic
• Large object space support and Large object space support and access checking incur a considerable access checking incur a considerable overheadoverhead– about about 5-15%5-15% of total execution time of total execution time
(application dependent)(application dependent)
1919
Test 2: Large Object SpaceTest 2: Large Object Space
• Using 4-node PC and server clustersUsing 4-node PC and server clusters
• Test program: simple matrix operationsTest program: simple matrix operations• With 120GB (SCSI) hard disk in each machine, able to With 120GB (SCSI) hard disk in each machine, able to
claim claim 117.77GB117.77GB Shared Object Space Shared Object Space• Disk read and write time is closely related to the OS Disk read and write time is closely related to the OS
version.version.
CPU CPU (MHz)(MHz)
OSOS RAM RAM (MB)(MB)
# # Shared Shared Objs (X)Objs (X)
Per Obj Per Obj Size (MB)Size (MB)
Total Shared Total Shared Obj Size (GB)Obj Size (GB)
Exec Time Exec Time (sec)(sec)
P4-2GHzP4-2GHz Fedora 1Fedora 1 512512 3333 128128 4.1254.125 142142132132 128128 16.5016.50 373373
Xeon P3 Xeon P3 500 MHz 500 MHz x4 (SMP)x4 (SMP)
Fedora 1Fedora 1 10241024 132132 128128 16.5016.50 507507132132 511511 65.8765.87 21122112201201 511511 100.30100.30 32273227236236 511511 117.77117.77 38393839
2020
ConclusionsConclusions
• LOTS succeed in:LOTS succeed in:– Providing a large shared object space Providing a large shared object space
larger than the local process space larger than the local process space during runtimeduring runtime
– Performing reasonably well by reducing Performing reasonably well by reducing data traffic through data traffic through Scope ConsistencyScope Consistency, , mixed coherence protocolmixed coherence protocol and and “diff “diff accumulation elimination”accumulation elimination” technique technique
– Similar programming interface with C++Similar programming interface with C++
2121
Future WorkFuture Work
• A Number of Optimizations:A Number of Optimizations:– Further increase shared object space Further increase shared object space
“ “the minimum hard disk space x number of the minimum hard disk space x number of processes / 2”.processes / 2”.
• Recent progress: 64GB (4GB x 16) of shared Recent progress: 64GB (4GB x 16) of shared objects can be allocated in 16 machines, each objects can be allocated in 16 machines, each having a 9GB hard disk.having a 9GB hard disk.
– Reduce disk overheadReduce disk overhead– Reduce over-loading overhead (access check)Reduce over-loading overhead (access check)– Load-aware migrating-home protocol: Load-aware migrating-home protocol:
coherence protocol adapting to network coherence protocol adapting to network traffic and processor loading (e.g., avoid too traffic and processor loading (e.g., avoid too many “homes” in a single machine) many “homes” in a single machine)
Questions ?
2323
Test 1: No. of Messages Test 1: No. of Messages SentSent
0
10
20
30
40
50
60
70
80
90
p=2 p=4 p=8 p=16
FL(n=2048)
LU(n=1024)
ME(n=8192)
RX(n=8192)
RB(n=2048)
%%No. of procs No. of procs (p)(p)
The percentage is obtained by dividing the number of The percentage is obtained by dividing the number of messages sent in LOTS over that in JIAJIA for the same messages sent in LOTS over that in JIAJIA for the same
application.application.Due to mixed Due to mixed protocol, LOTS send protocol, LOTS send fewer messages fewer messages through the network through the network than JIAJIAthan JIAJIA
2424
Test 1: No. of Bytes SentTest 1: No. of Bytes Sent
0
10
20
30
40
50
60
70
80
90
100
p=2 p=4 p=8 p=16
FL(n=2048)
LU(n=1024)
ME(n=8192)
RX(n=8192)
RB(n=2048)
%%No. of procs No. of procs (p)(p)
The percentage is obtained by dividing the number of bytes The percentage is obtained by dividing the number of bytes sent in LOTS over that in JIAJIA for the same application.sent in LOTS over that in JIAJIA for the same application.
2525
Test 2: Large Object SpaceTest 2: Large Object Space
• Allocate shared objects with total size > Allocate shared objects with total size > 4GB, and another process accesses 4GB, and another process accesses each of them once (array addition with each of them once (array addition with p=4)p=4)int main(int argc, char **argv){ int i, j, pp, local[4]; // 2D int array Pointer <Pointer <int> > a; lots_init(); // init LOTS // shared memory allocation a.alloc(X);
for (i=0; i<X; i++) a[i].alloc(size); nm_barrier(); // barrier
for (j = 0; j < linec; j++) { pp = (dsmid + j) % linec; for (i = pp; i < X; i += 4) { acq(i); a[i][0] += rand(); rel(i); } } // array addition
nm_barrier(); return 0;}