dongyoon lee , peter chen, jason flinn , satish narayanasamy
DESCRIPTION
Chimera: Hybrid Program Analysis for Determinism. Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy University of Michigan, Ann Arbor. * Chimera image from http ://superpunch.blogspot.com/2009/02/chimera-sketch.html. Deterministic Replay. - PowerPoint PPT PresentationTRANSCRIPT
- 1 -
Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy
University of Michigan, Ann Arbor
Chimera: Hybrid Program Analysis for Determinism
* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html
- 2 -
Deterministic Replay
Goal: record and reproduce multithreaded execution• Debugging concurrency bugs• Offline heavyweight dynamic analysis• Forensics and intrusion detection• … and many more uses
Problem• Multithreaded record-and-replay is too slow (>2x) or requires custom hardware
- 3 -
Multithreaded Record-and-Replay is Slow
Write
WriteRead
Log shared memory dependencies
Checkpoint Memory and Register State
Log non-deterministic program input - Interrupts, I/O values, DMA, etc.
Thread 1 Thread 2 Thread 3
- 4 -
Replay for Data-Race-Free Programs is Cheap
Data-race-free programs• Shared memory accesses are well ordered by synchronization ops.• Recording happens-before order of sync. ops. is sufficient
Problem: Programs with data races
T1 T2X=0Y=0
X=1Y=1
Y=2
Unlock(l)Lock(l)
Unlock(l)
Signal(c)Wait(c)
Z=1
X=2
Z=2
T3order of mem. ops.
order of sync. ops.
- 5 -
Our Contribution: A Hybrid Analysis
Potentially racyprogram P
Data-race-freeprogram P’
Sound static data race analysis • Add synchronizations for potential data races• Problem: Too many false positives
Profiling non-concurrent code regions
Symbolic bounds analysis
Chimera
- 6 -
Roadmap
• Motivation
• Chimera Analysis
1) Static data race analysis
2) Profiling non-concurrent code regions
3) Symbolic bounds analysis
• Weak-lock Design
• Evaluation
• Conclusion
- 7 -
Roadmap
• Motivation
• Chimera Analysis
1) Static data race analysis
2) Profiling non-concurrent code regions
3) Symbolic bounds analysis
• Weak-lock Design
• Evaluation
• Conclusion
- 8 -
Static Data Race Analysis
• Find potential data-races using a sound static data race detector RELAY [Voung et al.,
FSE’07]
• Protect all potential data-races using weak-locks − A new time-out lock which may be preempted (discussed later)
• Record and replay the happens-before order of weak-locks
- 9 -
Protect Potential Races using Weak-locks
Potential racy-pair
Potential racy-pair
Static analysis helps avoid instrumentation for access to Z
No race report
void foo() { X = 0;
for(i = ... ){
Y[ tid ][ i ] = 0;
}
}
void bar() { X = 1;
for(i = … ){
Y[ tid ][ i ] = 1;
Z = 1; }
}
- 10 -
Sources of False Positives in RELAY
• Sound data-race detector reports too many false data-races− 53x overhead
• Source 1: Non-mutex synchronizations are ignored− Lockset based analysis ignores fork-join, barrier, signal-wait, etc. − May report a false data-race between memory instructions that
can never execute concurrently
• Source 2: Conservative pointer analysis − Overestimate variables accessed by a memory instruction − May report a false data-race between memory instructions that
can never access the same location
- 11 -
Roadmap
• Motivation
• Chimera Analysis
1) Static data race analysis
2) Profiling non-concurrent code regions
3) Symbolic bounds analysis
• Weak-lock Design
• Evaluation
• Conclusion
- 12 -
Profiling Non-concurrent Code Regions
Problem• Lockset based analysis ignores non-mutex synchronization ops.
Solution• Profile non-concurrent code regions (e.g., functions)• Increase the granularity of weak-locks to protect a larger code
region instead of each potential racy instruction• Parallelism is preserved unless mis-profiled
T1foo()
BARRIER
T2
BARRIER
bar()
False Race
- 13 -
Function-Level Weak-Locks
if profiler says foo() and bar() are not likely to run concurrentlyfoo()
BARRIERBARRIER
bar()
False Race
void foo() { X = 0;
for(i = … ){
Y[ tid ][ i ] = 0;
}
}
void bar() { X = 1;
for(i = … ){
Y[ tid ][ i ] = 1;
Z = 1; }
}
- 14 -
Roadmap
• Motivation
• Chimera Analysis
1) Static data race analysis
2) Profiling non-concurrent code regions
3) Symbolic bounds analysis
• Design
• Evaluation
• Conclusion
- 15 -
Imprecision in Conservative Pointer Analysis
T1foo()
BARRIER
T2
BARRIER
May runConcurrently
bar()
- 16 -
Imprecision in Conservative Pointer Analysis
• RELAY uses Steensgaard’s and Anderson’s pointer analysis− Flow-Insensitive and Context-Insensitive (FICI) analysis− Naming heap objects is conservative
• Overestimate the variables accessed by a memory instruction
void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … }}
void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … }}
False Race
Y[][]
Thread1 Thread 2
… … …
Potential racy-pair
- 17 -
Symbolic Bounds AnalysisOur Solution• Derive the symbolic lower and upper bounds that a racy code
region may access (e.g., loops) [Rugina and Rinard, PLDI’00]
• Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression
• Parallelism is preserved if the bounds are precise enough
void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } …}
Bounds: &Y[tid][0] to &Y[tid][N]
SymbolicBoundsAnalysis
- 18 -
Loop-level Weak-locks
Symbolic bounds: &Y[tid][0] ~ &Y[tid][N]
(&Y[tid][0],&Y[tid][N])
(&Y[tid][0],&Y[tid][N])
(&Y[tid][0],&Y[tid][N])
(&Y[tid][0],&Y[tid][N])
void foo() { X = 0;
for(i = 0 to N){
Y[ tid ][ i ] = 0;
}
}
void bar() { X = 1;
for(i = 0 to N){
Y[ tid ][ i ] = 1;
Z = 1; }
}
- 19 -
Imprecise Symbolic Bounds
Sources• Depend on the value computed inside the code region• Depend on arithmetic operations not supported in the analysis
− e.g., modulo operations, logical AND/OR, etc.
Choosing the optimal granularity• If bounds are too imprecise and the loop body is long enough,
resort to instruction (basic-block) level weak-locks for parallelism
void qux() { … for(i = 0 to N){ prev = Z[ prev ]; } …}
Bounds: -INF to +INF
SymbolicBoundsAnalysis
- 20 -
Roadmap
• Motivation
• Chimera Analysis
• Weak-lock Design
• Evaluation
• Conclusion
- 21 -
Deadlock due to Weak-locks
No deadlocks between weak-locks• function-level > loop-level > instruction-level
Deadlock between weak-locks and original sync. ops. is possible
T1
…
wait (cv)
…
T2
…
signal(cv)
…
Time-out !!
- 22 -
Weak-lock Time-out
A weak-lock might time-out• Invoke a special system call to handle it
Weak-lock guarantee• Only one thread holds a given weak-lock at any given time• Mutual exclusion may be compromised; but sufficient for replay
T2
…
signal(cv)
…
Time-out !!T1
…
wait (cv)
…
Current owner Current owner
Logged order of weak-locks
- 23 -
Roadmap
• Motivation
• Chimera Analysis
• Weak-lock Design
• Evaluation
• Conclusion
- 24 -
Implementation
Source-to-source Instrumentation• Implemented in OCaml using CIL as a front end
Static analysis• Data race detection: RELAY [Voung et al., FSE’07]
− Include all library source codes for soundness (uClibc’s libc, libm, etc.)• Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]
− Intra-procedural analysis for racy loops only
Runtime system• Modified Linux kernel to record/replay program input • Modified pthread library to record/replay happens-before order
of original synchronization operations and weak-locks
- 25 -
Evaluation Setup
Test Environment• 2.66 GHz 8-core Xeon processor with 4 GB of RAM • Different set of inputs for profiling and performance evaluation• Average of five trials with 4 worker threads• 2, 4, 8 threads for scalability results
Benchmarks• Desktop applications
− aget, pfscan, and pbzip2• Server programs
− knot and apache• SPLASH-2 suite
− ocean, water-nsq, fft, and radix
- 26 -
Record and Replay Performance
aget pfscan pbzip2 knot apache ocean water fft radix average0
0.5
1
1.5
2
2.5record replay
Norm
aliz
ed p
erf.
over
head
• Recording : 39% on average• Replay : similar to recording; much lower for I/O intensive prgs.
2.4% slowdown
86% slowdown
39%
- 27 -
Effectiveness of Coarse-grained Weak-locks
aget pfscan pbzip2 knot apache ocean water fft radix average1
10
100
instr instr + func instr + loop instr + loop + func instr + bb + loop + func
Norm
aliz
ed re
cord
ing
over
head
135 251 100>
53x
- 28 -
Effectiveness of Coarse-grained Weak-locks
aget pfscan pbzip2 knot apache ocean water fft radix average1
10
100
instr instr + func instr + loop instr + loop + func instr + bb + loop + func
Norm
aliz
ed re
cord
ing
over
head
135 100>251
• Coarse-grained weak-locks reduce the cost of instrumentation
- 29 -
Effectiveness of Coarse-grained Weak-locks
aget pfscan pbzip2 knot apache ocean water fft radix average1
10
100
instr instr + func instr + loop instr + loop + func instr + bb + loop + func
Norm
aliz
ed re
cord
ing
over
head
135 251 100>
• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)
- 30 -
Effectiveness of Coarse-grained Weak-locks
aget pfscan pbzip2 knot apache ocean water fft radix average1
10
100
instr instr + func instr + loop instr + loop + func instr + bb + loop + func
Norm
aliz
ed re
cord
ing
over
head
135 251 100>
• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)
- 31 -
Effectiveness of Coarse-grained Weak-locks
aget pfscan pbzip2 knot apache ocean water fft radix average1
10
100
instr instr + func instr + loop instr + loop + func instr + bb + loop + func
Norm
aliz
ed re
cord
ing
over
head
135 251 100>
• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)
1.39x
- 32 -
Breakdown of Recording Overhead
aget pfscan pbzip2 knot apache ocean water fft radix1
1.5
2
2.5
Nor
mal
ized
reco
rdin
g ov
erhe
ad
• Weak-lock overhead = contention (waiting) cost + logging cost
func locksloop locksinstr/bb lockssync op & system log
- 33 -
Breakdown of Recording Overhead
aget pfscan pbzip2 knot apache ocean water fft radix1
1.5
2
2.5
Nor
mal
ized
reco
rdin
g ov
erhe
ad func waitloop waitinstr/bb waitsync op & system log
func logloop loginstr/bb log
• Weak-lock overhead = contention (waiting) cost + logging cost• High loop-lock contention• High instr/bb-lock contention
- 34 -
Scalability
aget pfscan pbzip2 knot apache ocean water fft radix average0
0.5
1
1.5
2
2.5
3
3.52p 4p 8p
Norm
aliz
ed re
cord
ing
over
head
• Scientific applications scale worse due to imprecise symbolic bounds analysis
- 35 -
Conclusion
Goal: Software-only deterministic multiprocessor replay systems
Chimera Analysis• Static data race analysis
− Find and protect potential data races with weak-locks− Instruction/basic-block-level weak-locks
• Profiling non-concurrent code regions− Address the inadequacy of lockset-based algorithm− Function-level weak-locks
• Symbolic bounds analysis− Address the imprecision of conservative pointer analysis− Loop-level weak-locks
Low Recording Overhead• 39% recording overhead for 4 worker threads
- 36 -
Thank you