Download - CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear

CoreDet: A Compiler and Runtime System for

Deterministic Multithreaded Execution

Tom BerganOwen Anderson, Joe Devietti, Luis Ceze, Dan

GrossmanTo appear at ASPLOS 2010

2

CoreDetCoreDetGoal: Execution-level determinism ...•of arbitrary, unmodified C/C++ pthreads programs•without special hardware•without sacrificing scalability

Implementation:•a compiler pass in LLVM, plus•a custom runtime library

3

CoreDetCoreDetGoal: Execution-level determinism ...•of arbitrary, unmodified C/C++ pthreads programs•without special hardware•without sacrificing scalability

Implementation:•a compiler pass in LLVM, plus•a custom runtime library

4

CoreDetCoreDetAn implementation of DMP in software

• DMP-OwnershipA new protocol

• DMP-Buffering

5


• DMP-Ownership• straightforward to implement• poorer scalability, fewer overheads

• DMP-Buffering• better scalability, more overheads• no speculation• key insight: relaxed memory

consistency(specifically, Weak Ordering)

A new protocol

6


• DMP-Ownership• straightforward to implement• poorer scalability, fewer overheads

• DMP-Buffering• better scalability, more overheads• no speculation (easier to implement

than TM)• key insight: relaxed memory

consistency(specifically, Weak Ordering)

A new protocol

7

CoreDet: CoreDet: ImplementationImplementation

A compiler•instruments the code with calls to the runtime•static optimizations to remove instrumentation

A runtime library•scheduling threads•tracks interthread communication•deterministic wrappers for . . .

- pthreads- malloc

8

OutlineOutlineDMP-Ownership in Software

Why not DMP-TM in Software?

DMP-Buffering

A Few Implementation Details

Performance

9

DMP-SerialDMP-Serial

a := xa := x b := yb := y

x := a * 2x := a * 2 y := a + by := a + b

Thread 1 c := xc := x d := yd := y

x := a * 3x := a * 3 y := a - by := a - b

Thread 2

10


a := xa := x b := yb := y

x := a * 2x := a * 2 y := a + by := a + b

Thread 1

c := xc := x d := yd := y

x := a * 3x := a * 3 y := a - by := a - b

Thread 2

quan

tum

11


end of roundtime

T1

T3

T2

quantum

Execution is completely serialized

12

DMP-OwnershipDMP-OwnershipParallel Serial

Parallel mode: no communication (can write only to private data)Serial mode: arbitrary communication

end of roundtime

T1

T3

T2

MOTx owned-by

T1

y shared

z owned-by T2:: ::

13

DMP-Ownership in DMP-Ownership in SoftwareSoftware

T1

T3

T2

x owned-by T1

y shared

z owned-by T2:: ::

Quantum Formation•Count instructions: e.g. at the end of each basic block

Requirements:

14


T1

T3

T2

x owned-by T1

y shared

z owned-by T2:: ::

Quantum Formation•Count instructions: e.g. at the end of each basic block Ownership Tracking•Instrument every load/store•Manipulate an MOT (in memory) via a runtime library

Requirements:

15


We have implemented this in CoreDet.. . . but the scalability is not that great

splashmean

parsecmean

16



DMP-Buffering


Performance

17


end of roundtime

T1

T3

T2

quantum

Execution is completely serialized

18

DMP-TMDMP-TM

end of roundtime

T1

T3

T2

Execution is parallel and transactional

commitimplicit

transactions

19

DMP-TM in SoftwareDMP-TM in Software

Quantum Formation•Count instructions

Requirements:

T1

T3

T2

20

DMP-TM in SoftwareDMP-TM in Software

Quantum Formation•Count instructions Transactional Execution•Instrument every load/store•Use an off-the-shelf STM with minor modifications

Requirements:

This is really hard!

T1

T3

T2

21

DMP-TM: Why not DMP-TM: Why not STM?STM?

STMs make the wrong assumptions:•Most code is not transactional •Transactions are short •Transactions are scoped

void foo() { ... begin_transaction() return}

An unscoped transaction:

25

➡ Speculation makes things hard➡ Good scalability because of versioned

memory(versions are modified in parallel)

DMP-TM: What Can We DMP-TM: What Can We Learn?Learn?

DMP-Buffering: Use versioned memory without speculation

26



DMP-Buffering


Performance

27

DMP-BufferingDMP-Buffering

Parallel mode: no communication (versioned memory via store buffering)

time

T1

T3

T2

ParallelGlobal Global MemoryMemory

(read only)(read only)Thread

x := .. y ..x := .. y ..

Store BufferStore Buffer

28


Parallel mode: no communication (versioned memory via store buffering)Commit mode: deterministically (serially) publish local store buffers

Global Global MemoryMemory

......


Thread

time

T1

T3

T2

Parallel Commit

stores are reordered

x=..

29


Parallel mode: no communication (versioned memory via store buffering)

Global Global MemoryMemory

(read/write)(read/write)

lock(x)lock(x)


Thread

end of roundtime

T1

T3

T2

Parallel SerialCommit

Commit mode: deterministically (serially) publish local store buffersSerial mode: used for synchronization (e.g. atomic ops)

30

DMP-BufferingDMP-BufferingT1

T3

T2

Parallel mode: buffer stores locally

Commit mode: publish local store buffers

Serial mode: used for synchronization (e.g. atomic ops)

•ends at synchronization (atomic ops and fences), and quantum boundaries

•happens serially for determinism•executes in parallel for performance

31

DMP-BufferingDMP-Buffering A = 1A = 1 if (B == 0)if (B == 0) ......

B = 1B = 1 if (A == 0)if (A == 0) ......

Thread 1 Thread 2

Dekker’s Algorithm(there is a data race)

32


A = buffer[A]A = buffer[A] B = buffer[B]B = buffer[B]

Thread 1 Thread 2 buffer[A] = 1buffer[A] = 1 if (B == 0)if (B == 0) ......

buffer[B] = 1buffer[B] = 1 if (A == 0)if (A == 0) ......

33


A = buffer[A]A = buffer[A]

B = buffer[B]B = buffer[B]

Thread 1 Thread 2parallel

comm

it

buffer[A] = 1buffer[A] = 1 if (B == 0)if (B == 0) ......


This is deterministic . . .

reor

dere

d

34


A = buffer[A]A = buffer[A]

B = buffer[B]B = buffer[B]

Thread 1 Thread 2parallel

comm

it

buffer[A] = 1buffer[A] = 1 if (B == 0)if (B == 0) ......


. . . but not sequentially consistent(cycle in the happens-before graph)

35

DMP-BufferingDMP-BufferingThread 1 Thread 2

A = 1A = 1 tmptmp11 = B = B

if (tmpif (tmp11 == 0) == 0) ......

B = 1B = 1 tmptmp22 = A = A

if (tmpif (tmp22 == 0) == 0) ......

Dekker’s Algorithm (again)Let’s remove the data race . . .

36


A = 1A = 1 tmptmp11 = B = B

Thread 1 Thread 2 lock(L)lock(L)

unlock(L)unlock(L)

lock(L)lock(L)

unlock(L)unlock(L)

if (tmpif (tmp11 == 0) == 0) ......

B = 1B = 1 tmptmp22 = A = A

if (tmpif (tmp22 == 0) == 0) ......

Dekker’s Algorithm(no data race)

37

DMP-BufferingDMP-Buffering A = 1A = 1 tmptmp11 = B = B

lock(L)lock(L)

unlock(L)unlock(L)

lock(L)lock(L)

unlock(L)unlock(L)

if (tmpif (tmp11 == 0) == 0) ......

B = 1B = 1 tmptmp22 = A = A

if (tmpif (tmp22 == 0) == 0) ......

parallel +commit

serial

serial

parallel +commit

serial

parallel +commit

Synchronizationhappens sequentially

DMP-BufferingDMP-Buffering A = 1A = 1 tmptmp11 = B = B

lock(L)lock(L)

unlock(L)unlock(L)

lock(L)lock(L)

unlock(L)unlock(L)

if (tmpif (tmp11 == 0) == 0) ......

B = 1B = 1 tmptmp22 = A = A

if (tmpif (tmp22 == 0) == 0) ......

parallel +commit

serial

serial

parallel +commit

serial

parallel +commit

Synchronizationhappens sequentially

Data race free programs are sequentially consistent

(required by C++ and Java memory models)

43

DMP-Buffering: Parallel DMP-Buffering: Parallel CommitCommit

For determinism, the commit order must be deterministic i.e. logically serialFor performance, the commit must happen in parallelBasic idea:•Publish store buffers in parallel•Preserve the commit order on collisions

44

0xA0xA 00

0xB0xB 00

0xC0xC 00

addr value

0xD0xD 00

0xE0xE 00

Global Memory

DMP-Buffering: Parallel DMP-Buffering: Parallel CommitCommit

0xA0xA 11

0xB0xB 11

0xC0xC 11

0xC0xC 22

0xD0xD 22

0xE0xE 22

addr value addr value

Thre

ad 1

Thread 2

Basic idea:•Commit store buffers in parallel•Preserve the commit order on collisions

Collision!Resolve for Thread 2

46



DMP-Buffering


Performance

48

• Redundant accesses y = ... x ... z = ... x ...

Remove Instrumention Remove Instrumention From ...From ...

• Accesses to thread-local (non-escaping) objects

don’t need to instrument this

49



don’t need to instrument this

DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global; ...

must access through the store buffer

• Redundant accesses y = ... x ... z = ... x ...

50



don’t need to instrument this• Redundant accesses

y = ... x ... z = ... x ...



51



don’t need to instrument this• Redundant accesses

y = ... x ... z = ... x ...



DMP-Buffering: this does not apply

52

External LibrariesExternal LibrariesWe do not instrument external shared libraries, suchas the system libc•External calls must be serialized

Preventing over-serialization:•We check indirect calls at runtime•We provide deterministic wrappers for common libc functions, e.g. memcpy and malloc•We do not serialize pure libc functions, e.g. sqrt

53



DMP-Buffering


Performance

54

ScalabilityScalability

splash mean parsec mean

55

OverheadsOverheads

splash mean parsec mean

Since we preserve scalability, we can

overcome overheads by adding cores

56

Wrap UpWrap UpCoreDet•execution-level determinism in software of arbitrary multithreaded programs

DMP-Buffering•uses a relaxed memory consistency model•scales comparably to nondeterministic execution

Thank you!The CoreDet source code will eventually be released at

http://sampa.cs.washington.edu

Download - CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear

Top Related