lots: a software dsm supporting large object space benny wang-leung cheung, cho-li wang, and francis...

25
LOTS: A Software DSM LOTS: A Software DSM Supporting Large Object Supporting Large Object Space Space Benny Wang-Leung Cheung, Benny Wang-Leung Cheung, Cho-Li Wang Cho-Li Wang , and Francis , and Francis Chi-Moon Lau Chi-Moon Lau Department of Computer Department of Computer Science The University Science The University of Hong Kong of Hong Kong September, September, 2004 2004

Upload: leo-fisher

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

LOTS: A Software DSM LOTS: A Software DSM Supporting Large Object Supporting Large Object

SpaceSpaceBenny Wang-Leung Cheung, Benny Wang-Leung Cheung, Cho-Li WangCho-Li Wang, and Francis Chi-Moon , and Francis Chi-Moon

LauLauDepartment of Computer Department of Computer Science The University of Science The University of

Hong KongHong Kong

September, September, 20042004

Page 2: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

22

Presentation OutlinePresentation Outline

• Why LOTS? (Objectives)Why LOTS? (Objectives)• DSM Background and Related WorkDSM Background and Related Work• Design of LOTSDesign of LOTS• Performance Testing and ResultsPerformance Testing and Results• Conclusion and Future WorkConclusion and Future Work

Page 3: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

33

The Problem in Current The Problem in Current DSMDSM

• Lack of shared object (memory) spaceLack of shared object (memory) space– Another major problem apart from Another major problem apart from

performanceperformance– Fixed address mapping in virtual memoryFixed address mapping in virtual memory– Shared object space size < process spaceShared object space size < process space

• TreadMarks: ~ min RAM size among all TreadMarks: ~ min RAM size among all machinesmachines

• JIAJIA V1.0: 128 MBJIAJIA V1.0: 128 MB– 32-bit machines 32-bit machines max 4 GB shared space max 4 GB shared space– Unscalable: Fixed regardless of # machinesUnscalable: Fixed regardless of # machines– Large problems (with > 4GB shared memory Large problems (with > 4GB shared memory

need) can’t be run directly need) can’t be run directly The programmer The programmer needs to change the application code to needs to change the application code to reduce the memory utilization. reduce the memory utilization.

Page 4: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

44

Objectives of LOTSObjectives of LOTS

• Using 64-bit machines Using 64-bit machines is not a total solution!is not a total solution!

• 32-bit machines are 32-bit machines are dominating the dominating the market (poor man’s market (poor man’s clusters :<)clusters :<)

• Problems keep Problems keep increasing memory increasing memory consumptionconsumption

(Poor man’s cluster)(Poor man’s cluster)

(Rich man’s cluster)(Rich man’s cluster)

Page 5: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

55

Objectives of LOTSObjectives of LOTS

• Hence we introduce LOTS:Hence we introduce LOTS:– Large Shared Object Space > 4GBLarge Shared Object Space > 4GB– Dynamic run-time memory mapping Dynamic run-time memory mapping

techniquetechnique– Local disk as the backing store for Local disk as the backing store for

temporarily unused objectstemporarily unused objects– Shared space size now limited by disk spaceShared space size now limited by disk space– Lazy disk read/write Lazy disk read/write reasonable reasonable

performanceperformance

Page 6: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

66

Some DSM BackgroundSome DSM Background

• Memory Consistency Memory Consistency Issues:Issues:– Memory Consistency ModelsMemory Consistency Models

• Sequential Consistency (IVY) Sequential Consistency (IVY) performs poorlyperforms poorly

• Relaxed models reduce Relaxed models reduce redundant data trafficredundant data traffic

• Lazy Release Consistency Lazy Release Consistency (TreadMarks)(TreadMarks)

• Scope Consistency (JIAJIA)Scope Consistency (JIAJIA)Rel(L)Rel(L)

Acq(LAcq(L))X=?

Y=?

Rel(L)Rel(L)

Acq(LAcq(L))X=3

Y=5

PP QQ

In Scope, Q In Scope, Q sees X to sees X to be 3, but Y be 3, but Y may not be may not be 55

Page 7: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

77

Some DSM BackgroundSome DSM Background

• More Memory Consistency Issues:More Memory Consistency Issues:– Coherence ProtocolsCoherence Protocols

• Home-based (JIAJIA) vs Homeless Home-based (JIAJIA) vs Homeless (TreadMarks) vs Migrating-Home (JUMP)(TreadMarks) vs Migrating-Home (JUMP)

• Write-update vs Write-invalidateWrite-update vs Write-invalidate• Adaptive Protocol (DOSA, ADSM)Adaptive Protocol (DOSA, ADSM)

– Coherence Protocol has to match with Coherence Protocol has to match with memory model for higher efficiencymemory model for higher efficiency

• No DSM deals with Large Object Space!No DSM deals with Large Object Space!

Page 8: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

88

Related WorkRelated Work

• Large object space support:Large object space support:– Pointer swizzlingPointer swizzling

• Artificial, invalid addresses are translated to Artificial, invalid addresses are translated to machine-addressable form during accessmachine-addressable form during access

• Used in persistent store (QuickStore, Thor-1)Used in persistent store (QuickStore, Thor-1)

Process Space

Compiler-generated Compiler-generated addresses cause page addresses cause page fault at runtime and fault at runtime and are translated to valid are translated to valid onesones

Unused objects free Unused objects free their virtual addresses their virtual addresses and are swapped out and are swapped out (i.e., swizzled out) to (i.e., swizzled out) to hard diskhard disk

Page 9: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

99

Design of LOTSDesign of LOTS

• Dynamic Memory Mapping (DMM)Dynamic Memory Mapping (DMM)– Uses C++ Uses C++ Operator OverloadingOperator Overloading as the interface as the interface

• Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc.Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc.– Purely runtime; Purely runtime;

Virtual Memory Virtual Memory AreaArea

PrograProgramm

Local Hard Local Hard DiskDisk

Remote Remote MemoryMemory

NetworNetworkk

(2) (2) Bring Bring in object in object from from disk / disk / networknetwork

Array AArray A

Data (DMM) AreaData (DMM) Area

Heap AreaHeap Area

(3) (3) Internal structure Internal structure points to object data points to object data for accessfor access

A->ctrlA->ctrl(1) (1) Access invokes Access invokes mapping mapping mechanismmechanism

A[5]=7;A[5]=7;

Page 10: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1010

LOTS Shared Objects LOTS Shared Objects CreationCreation

• Through the LOTS Through the LOTS memory allocatormemory allocator– Exists as a C++ class Exists as a C++ class

• Memory allocation Memory allocation through alloc() through alloc() functionfunction

• Put data into specific Put data into specific part of process spacepart of process space

• Object control info in Object control info in heap areaheap area

00x00000000x00000000

00x50000000x50000000

00x70000000x70000000

00x90000000x90000000

00xb0000000xb000000000xc0000000xc0000000

00xffffffffxffffffff

Heap Heap AreaArea

DMM AreaDMM Area

Twin AreaTwin Area

DSM Control DSM Control AreaArea

Kernel Kernel ReservedReserved C++ C++

StackStack

TXT TXT SegmentSegment

DAT DAT SegmentSegmentA->CtrlA->Ctrl

Array AArray A

Process SpaceProcess Space

Page 11: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1111

LOTS Memory AllocatorLOTS Memory Allocator• Bypass Doug Lea’s Memory Allocator Bypass Doug Lea’s Memory Allocator

used in original C/C++used in original C/C++• Uses mmap() to get physical memory, Uses mmap() to get physical memory,

and map the shared object data to the and map the shared object data to the process space.process space.– Free queues and used queues Free queues and used queues – Small & large objects allocated separatelySmall & large objects allocated separatelyFreeFree

queuqueuee

0x50000000 0x50000000 DMM AreaDMM Area 0x70000000 0x70000000

Used Used blockblockFree Free blockblock

Heap Heap AreaArea

81624324048…1M2M…½G

Twin and Twin and Control Control

AreaArea

81624324048…1M2M…½GUsed Used queuqueu

ee

Page 12: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1212

Shared Memory BehaviorShared Memory Behavior

• Goal: Reduce redundant data trafficGoal: Reduce redundant data traffic– Memory Consistency Model: ScopeMemory Consistency Model: Scope– Memory Coherence Protocol: MixedMemory Coherence Protocol: Mixed

• Lock-synchronized objectsLock-synchronized objects : Homeless + : Homeless + write-updatewrite-update

• Barrier-synchronized objectsBarrier-synchronized objects : Migrating- : Migrating-Home + write-invalidateHome + write-invalidate

• Principle: To eliminate as much all-to-all data Principle: To eliminate as much all-to-all data communication as possiblecommunication as possible

Page 13: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1313

Mixed Coherence ProtocolMixed Coherence Protocol

• An Example:An Example:

Rel(L1)Rel(L1)

New HomeNew Home

Acq(L1)Acq(L1)x1=1x1=1

x1=1 x1=1 y1=5y1=5

Home of X Home of X and Yand Y

P0P0 P1P1 P2P2 P3P3

BarrierBarrier

y1=5y1=5

Rel(L1)Rel(L1)

Acq(L1)Acq(L1)x1++x1++

X Y

y1++y1++

Rel(L2)Rel(L2)

Acq(L2)Acq(L2)x2=3x2=3

Rel(L2)Rel(L2)

Acq(L2)Acq(L2)x2 = 3x2 = 3

Inv X, Inv X, YY

Inv X, Inv X, YY

Inv X, Inv X, YY

x2++x2++

x1=?x1=?

x2 = 4x2 = 4

x1 = 2, x2 = 4x1 = 2, x2 = 4

X Y

Updates Movement Updates Movement Home Token Home Token MovementMovement

XY

When the processes arrive at the barrier, the process that holds the token of the object will become the new home of that object, and other processes will send the updates to the home.

Page 14: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1414

Making LOTS More Making LOTS More EfficientEfficient

• Eliminating Diff Accumulation ProblemEliminating Diff Accumulation Problem– Lock and timestamp info in DSM control areaLock and timestamp info in DSM control area– Calculate diff on request, no redundancyCalculate diff on request, no redundancy

X1 X2 X7X3 X5 X6 X8X4

ValueValue Last Updated Last Updated

TimeTime

X1 X2 X6 X7

X1 X3 X5

X2 X8

X7X3 X5

X8

X5 X7

X3 X8

T=1 T=1 (len=6(len=6) T=2 ) T=2 (len=4(len=4) T=3 ) T=3 (len=4(len=4) T=4 ) T=4 (len=3(len=3))

Traditional Traditional MethodMethod

LOTS MethodLOTS Method

All updates above need to be All updates above need to be sent (17 units data + 8 units of sent (17 units data + 8 units of control)control)

Only send 7 units data + 8 Only send 7 units data + 8 units of control dataunits of control data

LengthLength

X4X3X2X1

2

X8X7X6X5

3 4 0 4 1 4 3

1 1 X6 2 1 X1

3 2 X2 X8 4 3 X3 X7X5

TimeTime

Page 15: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1515

Other Components in Other Components in LOTSLOTS

• C++ runtime library in LinuxC++ runtime library in Linux• Minimal set of functions as interfaceMinimal set of functions as interface

– Retains as much C++ syntax as possible to Retains as much C++ syntax as possible to improve programmabilityimprove programmability

• Synchronization: Locks and BarriersSynchronization: Locks and Barriers– Barriers: With/Without memory effectBarriers: With/Without memory effect

• Communication: Sockets with UDP/IPCommunication: Sockets with UDP/IP• SIGIO handler for incoming messagesSIGIO handler for incoming messages

Page 16: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1616

Performance TestingPerformance Testing

• Two Kinds of TestingTwo Kinds of Testing1 Without invoking large object space supportWithout invoking large object space support

• Compare performance with other DSM (JIAJIA V1.0, Compare performance with other DSM (JIAJIA V1.0, as both have similar communication protocol)as both have similar communication protocol)

• Report no. of messages and bytes sentReport no. of messages and bytes sent• Calculate large object space support overheadCalculate large object space support overhead• 16 Pentium IV 2GHz machines with 100Mbps Fast 16 Pentium IV 2GHz machines with 100Mbps Fast

Ethernet connection, 128MB mem, Linux FedoraEthernet connection, 128MB mem, Linux Fedora2 With large object space supportWith large object space support

• Use an application with large memory demandUse an application with large memory demand• Run on different platforms for analysisRun on different platforms for analysis• Expect disk read/write overhead dominatesExpect disk read/write overhead dominates

Page 17: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1717

Test 1: Timing Test 1: Timing PerformancePerformance

LOTS: LOTS enabled LOTS: LOTS enabled LOTS-x : LOTS disabledLOTS-x : LOTS disabledx-axis : problem size, x-axis : problem size, y-axis : execution time y-axis : execution time in secondsin seconds

LOTS<JiaJiaLOTS<JiaJia

LOTS<JiaJiaLOTS<JiaJia

LOTS>JiaJiaLOTS>JiaJia

LOTS<JiaJiaLOTS<JiaJia

Page 18: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1818

Performance Results Performance Results SummarySummary

• LOTS beat JIAJIA V1.0 in most LOTS beat JIAJIA V1.0 in most applicationsapplications– Mixed protocol + “Diff accumulation Mixed protocol + “Diff accumulation

elimination” reduce data trafficelimination” reduce data traffic

• Large object space support and Large object space support and access checking incur a considerable access checking incur a considerable overheadoverhead– about about 5-15%5-15% of total execution time of total execution time

(application dependent)(application dependent)

Page 19: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

1919

Test 2: Large Object SpaceTest 2: Large Object Space

• Using 4-node PC and server clustersUsing 4-node PC and server clusters

• Test program: simple matrix operationsTest program: simple matrix operations• With 120GB (SCSI) hard disk in each machine, able to With 120GB (SCSI) hard disk in each machine, able to

claim claim 117.77GB117.77GB Shared Object Space Shared Object Space• Disk read and write time is closely related to the OS Disk read and write time is closely related to the OS

version.version.

CPU CPU (MHz)(MHz)

OSOS RAM RAM (MB)(MB)

# # Shared Shared Objs (X)Objs (X)

Per Obj Per Obj Size (MB)Size (MB)

Total Shared Total Shared Obj Size (GB)Obj Size (GB)

Exec Time Exec Time (sec)(sec)

P4-2GHzP4-2GHz Fedora 1Fedora 1 512512 3333 128128 4.1254.125 142142132132 128128 16.5016.50 373373

Xeon P3 Xeon P3 500 MHz 500 MHz x4 (SMP)x4 (SMP)

Fedora 1Fedora 1 10241024 132132 128128 16.5016.50 507507132132 511511 65.8765.87 21122112201201 511511 100.30100.30 32273227236236 511511 117.77117.77 38393839

Page 20: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

2020

ConclusionsConclusions

• LOTS succeed in:LOTS succeed in:– Providing a large shared object space Providing a large shared object space

larger than the local process space larger than the local process space during runtimeduring runtime

– Performing reasonably well by reducing Performing reasonably well by reducing data traffic through data traffic through Scope ConsistencyScope Consistency, , mixed coherence protocolmixed coherence protocol and and “diff “diff accumulation elimination”accumulation elimination” technique technique

– Similar programming interface with C++Similar programming interface with C++

Page 21: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

2121

Future WorkFuture Work

• A Number of Optimizations:A Number of Optimizations:– Further increase shared object space Further increase shared object space

“ “the minimum hard disk space x number of the minimum hard disk space x number of processes / 2”.processes / 2”.

• Recent progress: 64GB (4GB x 16) of shared Recent progress: 64GB (4GB x 16) of shared objects can be allocated in 16 machines, each objects can be allocated in 16 machines, each having a 9GB hard disk.having a 9GB hard disk.

– Reduce disk overheadReduce disk overhead– Reduce over-loading overhead (access check)Reduce over-loading overhead (access check)– Load-aware migrating-home protocol: Load-aware migrating-home protocol:

coherence protocol adapting to network coherence protocol adapting to network traffic and processor loading (e.g., avoid too traffic and processor loading (e.g., avoid too many “homes” in a single machine) many “homes” in a single machine)

Page 22: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

Questions ?

Page 23: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

2323

Test 1: No. of Messages Test 1: No. of Messages SentSent

0

10

20

30

40

50

60

70

80

90

p=2 p=4 p=8 p=16

FL(n=2048)

LU(n=1024)

ME(n=8192)

RX(n=8192)

RB(n=2048)

%%No. of procs No. of procs (p)(p)

The percentage is obtained by dividing the number of The percentage is obtained by dividing the number of messages sent in LOTS over that in JIAJIA for the same messages sent in LOTS over that in JIAJIA for the same

application.application.Due to mixed Due to mixed protocol, LOTS send protocol, LOTS send fewer messages fewer messages through the network through the network than JIAJIAthan JIAJIA

Page 24: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

2424

Test 1: No. of Bytes SentTest 1: No. of Bytes Sent

0

10

20

30

40

50

60

70

80

90

100

p=2 p=4 p=8 p=16

FL(n=2048)

LU(n=1024)

ME(n=8192)

RX(n=8192)

RB(n=2048)

%%No. of procs No. of procs (p)(p)

The percentage is obtained by dividing the number of bytes The percentage is obtained by dividing the number of bytes sent in LOTS over that in JIAJIA for the same application.sent in LOTS over that in JIAJIA for the same application.

Page 25: LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University

2525

Test 2: Large Object SpaceTest 2: Large Object Space

• Allocate shared objects with total size > Allocate shared objects with total size > 4GB, and another process accesses 4GB, and another process accesses each of them once (array addition with each of them once (array addition with p=4)p=4)int main(int argc, char **argv){ int i, j, pp, local[4]; // 2D int array Pointer <Pointer <int> > a; lots_init(); // init LOTS // shared memory allocation a.alloc(X);

for (i=0; i<X; i++) a[i].alloc(size); nm_barrier(); // barrier

for (j = 0; j < linec; j++) { pp = (dsmid + j) % linec; for (i = pp; i < X; i += 4) { acq(i); a[i][0] += rand(); rel(i); } } // array addition

nm_barrier(); return 0;}