CoreDet: A Compiler and Runtime System for
Deterministic Multithreaded Execution
Tom BerganOwen Anderson, Joe Devietti, Luis Ceze, Dan
GrossmanTo appear at ASPLOS 2010
2
CoreDetCoreDetGoal: Execution-level determinism ...•of arbitrary, unmodified C/C++ pthreads programs•without special hardware•without sacrificing scalability
Implementation:•a compiler pass in LLVM, plus•a custom runtime library
3
CoreDetCoreDetGoal: Execution-level determinism ...•of arbitrary, unmodified C/C++ pthreads programs•without special hardware•without sacrificing scalability
Implementation:•a compiler pass in LLVM, plus•a custom runtime library
4
CoreDetCoreDetAn implementation of DMP in software
• DMP-OwnershipA new protocol
• DMP-Buffering
5
CoreDetCoreDetAn implementation of DMP in software
• DMP-Ownership• straightforward to implement• poorer scalability, fewer overheads
• DMP-Buffering• better scalability, more overheads• no speculation• key insight: relaxed memory
consistency(specifically, Weak Ordering)
A new protocol
6
CoreDetCoreDetAn implementation of DMP in software
• DMP-Ownership• straightforward to implement• poorer scalability, fewer overheads
• DMP-Buffering• better scalability, more overheads• no speculation (easier to implement
than TM)• key insight: relaxed memory
consistency(specifically, Weak Ordering)
A new protocol
7
CoreDet: CoreDet: ImplementationImplementation
A compiler•instruments the code with calls to the runtime•static optimizations to remove instrumentation
A runtime library•scheduling threads•tracks interthread communication•deterministic wrappers for . . .
- pthreads- malloc
8
OutlineOutlineDMP-Ownership in Software
Why not DMP-TM in Software?
DMP-Buffering
A Few Implementation Details
Performance
9
DMP-SerialDMP-Serial
a := xa := x b := yb := y
x := a * 2x := a * 2 y := a + by := a + b
Thread 1 c := xc := x d := yd := y
x := a * 3x := a * 3 y := a - by := a - b
Thread 2
10
DMP-SerialDMP-Serial
a := xa := x b := yb := y
x := a * 2x := a * 2 y := a + by := a + b
Thread 1
c := xc := x d := yd := y
x := a * 3x := a * 3 y := a - by := a - b
Thread 2
quan
tum
11
DMP-SerialDMP-Serial
end of roundtime
T1
T3
T2
quantum
Execution is completely serialized
12
DMP-OwnershipDMP-OwnershipParallel Serial
Parallel mode: no communication (can write only to private data)Serial mode: arbitrary communication
end of roundtime
T1
T3
T2
MOTx owned-by
T1
y shared
z owned-by T2:: ::
13
DMP-Ownership in DMP-Ownership in SoftwareSoftware
T1
T3
T2
x owned-by T1
y shared
z owned-by T2:: ::
Quantum Formation•Count instructions: e.g. at the end of each basic block
Requirements:
14
DMP-Ownership in DMP-Ownership in SoftwareSoftware
T1
T3
T2
x owned-by T1
y shared
z owned-by T2:: ::
Quantum Formation•Count instructions: e.g. at the end of each basic block Ownership Tracking•Instrument every load/store•Manipulate an MOT (in memory) via a runtime library
Requirements:
15
DMP-Ownership in DMP-Ownership in SoftwareSoftware
We have implemented this in CoreDet.. . . but the scalability is not that great
splashmean
parsecmean
16
OutlineOutlineDMP-Ownership in Software
Why not DMP-TM in Software?
DMP-Buffering
A Few Implementation Details
Performance
17
DMP-SerialDMP-Serial
end of roundtime
T1
T3
T2
quantum
Execution is completely serialized
18
DMP-TMDMP-TM
end of roundtime
T1
T3
T2
Execution is parallel and transactional
commitimplicit
transactions
19
DMP-TM in SoftwareDMP-TM in Software
Quantum Formation•Count instructions
Requirements:
T1
T3
T2
20
DMP-TM in SoftwareDMP-TM in Software
Quantum Formation•Count instructions Transactional Execution•Instrument every load/store•Use an off-the-shelf STM with minor modifications
Requirements:
This is really hard!
T1
T3
T2
21
DMP-TM: Why not DMP-TM: Why not STM?STM?
STMs make the wrong assumptions:•Most code is not transactional •Transactions are short •Transactions are scoped
void foo() { ... begin_transaction() return}
An unscoped transaction:
25
➡ Speculation makes things hard➡ Good scalability because of versioned
memory(versions are modified in parallel)
DMP-TM: What Can We DMP-TM: What Can We Learn?Learn?
DMP-Buffering: Use versioned memory without speculation
26
OutlineOutlineDMP-Ownership in Software
Why not DMP-TM in Software?
DMP-Buffering
A Few Implementation Details
Performance
27
DMP-BufferingDMP-Buffering
Parallel mode: no communication (versioned memory via store buffering)
time
T1
T3
T2
ParallelGlobal Global MemoryMemory
(read only)(read only)Thread
x := .. y ..x := .. y ..
Store BufferStore Buffer
28
DMP-BufferingDMP-Buffering
Parallel mode: no communication (versioned memory via store buffering)Commit mode: deterministically (serially) publish local store buffers
Global Global MemoryMemory
......
Store BufferStore Buffer
Thread
time
T1
T3
T2
Parallel Commit
stores are reordered
x=..
29
DMP-BufferingDMP-Buffering
Parallel mode: no communication (versioned memory via store buffering)
Global Global MemoryMemory
(read/write)(read/write)
lock(x)lock(x)
Store BufferStore Buffer
Thread
end of roundtime
T1
T3
T2
Parallel SerialCommit
Commit mode: deterministically (serially) publish local store buffersSerial mode: used for synchronization (e.g. atomic ops)
30
DMP-BufferingDMP-BufferingT1
T3
T2
Parallel mode: buffer stores locally
Commit mode: publish local store buffers
Serial mode: used for synchronization (e.g. atomic ops)
•ends at synchronization (atomic ops and fences), and quantum boundaries
•happens serially for determinism•executes in parallel for performance
31
DMP-BufferingDMP-Buffering A = 1A = 1 if (B == 0)if (B == 0) ......
B = 1B = 1 if (A == 0)if (A == 0) ......
Thread 1 Thread 2
Dekker’s Algorithm(there is a data race)
32
DMP-BufferingDMP-Buffering
A = buffer[A]A = buffer[A] B = buffer[B]B = buffer[B]
Thread 1 Thread 2 buffer[A] = 1buffer[A] = 1 if (B == 0)if (B == 0) ......
buffer[B] = 1buffer[B] = 1 if (A == 0)if (A == 0) ......
33
DMP-BufferingDMP-Buffering
A = buffer[A]A = buffer[A]
B = buffer[B]B = buffer[B]
Thread 1 Thread 2parallel
comm
it
buffer[A] = 1buffer[A] = 1 if (B == 0)if (B == 0) ......
buffer[B] = 1buffer[B] = 1 if (A == 0)if (A == 0) ......
This is deterministic . . .
reor
dere
d
34
DMP-BufferingDMP-Buffering
A = buffer[A]A = buffer[A]
B = buffer[B]B = buffer[B]
Thread 1 Thread 2parallel
comm
it
buffer[A] = 1buffer[A] = 1 if (B == 0)if (B == 0) ......
buffer[B] = 1buffer[B] = 1 if (A == 0)if (A == 0) ......
. . . but not sequentially consistent(cycle in the happens-before graph)
35
DMP-BufferingDMP-BufferingThread 1 Thread 2
A = 1A = 1 tmptmp11 = B = B
if (tmpif (tmp11 == 0) == 0) ......
B = 1B = 1 tmptmp22 = A = A
if (tmpif (tmp22 == 0) == 0) ......
Dekker’s Algorithm (again)Let’s remove the data race . . .
36
DMP-BufferingDMP-Buffering
A = 1A = 1 tmptmp11 = B = B
Thread 1 Thread 2 lock(L)lock(L)
unlock(L)unlock(L)
lock(L)lock(L)
unlock(L)unlock(L)
if (tmpif (tmp11 == 0) == 0) ......
B = 1B = 1 tmptmp22 = A = A
if (tmpif (tmp22 == 0) == 0) ......
Dekker’s Algorithm(no data race)
37
DMP-BufferingDMP-Buffering A = 1A = 1 tmptmp11 = B = B
lock(L)lock(L)
unlock(L)unlock(L)
lock(L)lock(L)
unlock(L)unlock(L)
if (tmpif (tmp11 == 0) == 0) ......
B = 1B = 1 tmptmp22 = A = A
if (tmpif (tmp22 == 0) == 0) ......
parallel +commit
serial
serial
parallel +commit
serial
parallel +commit
Synchronizationhappens sequentially
DMP-BufferingDMP-Buffering A = 1A = 1 tmptmp11 = B = B
lock(L)lock(L)
unlock(L)unlock(L)
lock(L)lock(L)
unlock(L)unlock(L)
if (tmpif (tmp11 == 0) == 0) ......
B = 1B = 1 tmptmp22 = A = A
if (tmpif (tmp22 == 0) == 0) ......
parallel +commit
serial
serial
parallel +commit
serial
parallel +commit
Synchronizationhappens sequentially
Data race free programs are sequentially consistent
(required by C++ and Java memory models)
43
DMP-Buffering: Parallel DMP-Buffering: Parallel CommitCommit
For determinism, the commit order must be deterministic i.e. logically serialFor performance, the commit must happen in parallelBasic idea:•Publish store buffers in parallel•Preserve the commit order on collisions
44
0xA0xA 00
0xB0xB 00
0xC0xC 00
addr value
0xD0xD 00
0xE0xE 00
Global Memory
DMP-Buffering: Parallel DMP-Buffering: Parallel CommitCommit
0xA0xA 11
0xB0xB 11
0xC0xC 11
0xC0xC 22
0xD0xD 22
0xE0xE 22
addr value addr value
Thre
ad 1
Thread 2
Basic idea:•Commit store buffers in parallel•Preserve the commit order on collisions
Collision!Resolve for Thread 2
46
OutlineOutlineDMP-Ownership in Software
Why not DMP-TM in Software?
DMP-Buffering
A Few Implementation Details
Performance
48
• Redundant accesses y = ... x ... z = ... x ...
Remove Instrumention Remove Instrumention From ...From ...
• Accesses to thread-local (non-escaping) objects
don’t need to instrument this
49
Remove Instrumention Remove Instrumention From ...From ...
• Accesses to thread-local (non-escaping) objects
don’t need to instrument this
DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global; ...
must access through the store buffer
• Redundant accesses y = ... x ... z = ... x ...
50
Remove Instrumention Remove Instrumention From ...From ...
• Accesses to thread-local (non-escaping) objects
don’t need to instrument this• Redundant accesses
y = ... x ... z = ... x ...
DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global; ...
must access through the store buffer
51
Remove Instrumention Remove Instrumention From ...From ...
• Accesses to thread-local (non-escaping) objects
don’t need to instrument this• Redundant accesses
y = ... x ... z = ... x ...
DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global; ...
must access through the store buffer
DMP-Buffering: this does not apply
52
External LibrariesExternal LibrariesWe do not instrument external shared libraries, suchas the system libc•External calls must be serialized
Preventing over-serialization:•We check indirect calls at runtime•We provide deterministic wrappers for common libc functions, e.g. memcpy and malloc•We do not serialize pure libc functions, e.g. sqrt
53
OutlineOutlineDMP-Ownership in Software
Why not DMP-TM in Software?
DMP-Buffering
A Few Implementation Details
Performance
54
ScalabilityScalability
splash mean parsec mean
55
OverheadsOverheads
splash mean parsec mean
Since we preserve scalability, we can
overcome overheads by adding cores
56
Wrap UpWrap UpCoreDet•execution-level determinism in software of arbitrary multithreaded programs
DMP-Buffering•uses a relaxed memory consistency model•scales comparably to nondeterministic execution
Thank you!The CoreDet source code will eventually be released at
http://sampa.cs.washington.edu