dongyoon lee , peter chen, jason flinn , satish narayanasamy

- 1 -

Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy

University of Michigan, Ann Arbor

Chimera: Hybrid Program Analysis for Determinism

* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

- 2 -

Deterministic Replay

Goal: record and reproduce multithreaded execution• Debugging concurrency bugs• Offline heavyweight dynamic analysis• Forensics and intrusion detection• … and many more uses

Problem• Multithreaded record-and-replay is too slow (>2x) or requires custom hardware

- 3 -

Multithreaded Record-and-Replay is Slow

Write

WriteRead

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input - Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

- 4 -

Replay for Data-Race-Free Programs is Cheap

Data-race-free programs• Shared memory accesses are well ordered by synchronization ops.• Recording happens-before order of sync. ops. is sufficient

Problem: Programs with data races

T1 T2X=0Y=0

X=1Y=1

Y=2

Unlock(l)Lock(l)

Unlock(l)

Signal(c)Wait(c)

Z=1

X=2

Z=2

T3order of mem. ops.

order of sync. ops.

- 5 -

Our Contribution: A Hybrid Analysis

Potentially racyprogram P

Data-race-freeprogram P’

Sound static data race analysis • Add synchronizations for potential data races• Problem: Too many false positives

Profiling non-concurrent code regions

Symbolic bounds analysis

Chimera

- 6 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

- 7 -

Roadmap

• Motivation






• Evaluation

• Conclusion

- 8 -

Static Data Race Analysis

• Find potential data-races using a sound static data race detector RELAY [Voung et al.,

FSE’07]

• Protect all potential data-races using weak-locks − A new time-out lock which may be preempted (discussed later)

• Record and replay the happens-before order of weak-locks

- 9 -

Protect Potential Races using Weak-locks

Potential racy-pair

Potential racy-pair

Static analysis helps avoid instrumentation for access to Z

No race report

void foo() { X = 0;

for(i = ... ){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1; }

}

- 10 -

Sources of False Positives in RELAY

• Sound data-race detector reports too many false data-races− 53x overhead

• Source 1: Non-mutex synchronizations are ignored− Lockset based analysis ignores fork-join, barrier, signal-wait, etc. − May report a false data-race between memory instructions that

can never execute concurrently

• Source 2: Conservative pointer analysis − Overestimate variables accessed by a memory instruction − May report a false data-race between memory instructions that

can never access the same location

- 11 -

Roadmap

• Motivation






• Evaluation

• Conclusion

- 12 -

Profiling Non-concurrent Code Regions

Problem• Lockset based analysis ignores non-mutex synchronization ops.

Solution• Profile non-concurrent code regions (e.g., functions)• Increase the granularity of weak-locks to protect a larger code

region instead of each potential racy instruction• Parallelism is preserved unless mis-profiled

T1foo()

BARRIER

T2

BARRIER

bar()

False Race

- 13 -

Function-Level Weak-Locks

if profiler says foo() and bar() are not likely to run concurrentlyfoo()

BARRIERBARRIER

bar()

False Race

void foo() { X = 0;

for(i = … ){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1; }

}

- 14 -

Roadmap

• Motivation





• Design

• Evaluation

• Conclusion

- 15 -

Imprecision in Conservative Pointer Analysis

T1foo()

BARRIER

T2

BARRIER

May runConcurrently

bar()

- 16 -

Imprecision in Conservative Pointer Analysis

• RELAY uses Steensgaard’s and Anderson’s pointer analysis− Flow-Insensitive and Context-Insensitive (FICI) analysis− Naming heap objects is conservative

• Overestimate the variables accessed by a memory instruction

void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … }}

void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … }}

False Race

Y[][]

Thread1 Thread 2

… … …

Potential racy-pair

- 17 -

Symbolic Bounds AnalysisOur Solution• Derive the symbolic lower and upper bounds that a racy code

region may access (e.g., loops) [Rugina and Rinard, PLDI’00]

• Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression

• Parallelism is preserved if the bounds are precise enough

void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } …}

Bounds: &Y[tid][0] to &Y[tid][N]

SymbolicBoundsAnalysis

- 18 -

Loop-level Weak-locks

Symbolic bounds: &Y[tid][0] ~ &Y[tid][N]

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

void foo() { X = 0;

for(i = 0 to N){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = 0 to N){

Y[ tid ][ i ] = 1;

Z = 1; }

}

- 19 -

Imprecise Symbolic Bounds

Sources• Depend on the value computed inside the code region• Depend on arithmetic operations not supported in the analysis

− e.g., modulo operations, logical AND/OR, etc.

Choosing the optimal granularity• If bounds are too imprecise and the loop body is long enough,

resort to instruction (basic-block) level weak-locks for parallelism

void qux() { … for(i = 0 to N){ prev = Z[ prev ]; } …}

Bounds: -INF to +INF

SymbolicBoundsAnalysis

- 20 -

Roadmap

• Motivation



• Evaluation

• Conclusion

- 21 -

Deadlock due to Weak-locks

No deadlocks between weak-locks• function-level > loop-level > instruction-level

Deadlock between weak-locks and original sync. ops. is possible

T1

…

wait (cv)

…

T2

…

signal(cv)

…

Time-out !!

- 22 -

Weak-lock Time-out

A weak-lock might time-out• Invoke a special system call to handle it

Weak-lock guarantee• Only one thread holds a given weak-lock at any given time• Mutual exclusion may be compromised; but sufficient for replay

T2

…

signal(cv)

…

Time-out !!T1

…

wait (cv)

…

Current owner Current owner

Logged order of weak-locks

- 23 -

Roadmap

• Motivation



• Evaluation

• Conclusion

- 24 -

Implementation

Source-to-source Instrumentation• Implemented in OCaml using CIL as a front end

Static analysis• Data race detection: RELAY [Voung et al., FSE’07]

− Include all library source codes for soundness (uClibc’s libc, libm, etc.)• Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]

− Intra-procedural analysis for racy loops only

Runtime system• Modified Linux kernel to record/replay program input • Modified pthread library to record/replay happens-before order

of original synchronization operations and weak-locks

- 25 -

Evaluation Setup

Test Environment• 2.66 GHz 8-core Xeon processor with 4 GB of RAM • Different set of inputs for profiling and performance evaluation• Average of five trials with 4 worker threads• 2, 4, 8 threads for scalability results

Benchmarks• Desktop applications

− aget, pfscan, and pbzip2• Server programs

− knot and apache• SPLASH-2 suite

− ocean, water-nsq, fft, and radix

- 26 -

Record and Replay Performance

aget pfscan pbzip2 knot apache ocean water fft radix average0

0.5

1

1.5

2

2.5record replay

Norm

aliz

ed p

erf.

over

head

• Recording : 39% on average• Replay : similar to recording; much lower for I/O intensive prgs.

2.4% slowdown

86% slowdown

39%

- 27 -

Effectiveness of Coarse-grained Weak-locks


10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

Norm

aliz

ed re

cord

ing

over

head

135 251 100>

53x

- 28 -



10

100


Norm

aliz

ed re

cord

ing

over

head

135 100>251

• Coarse-grained weak-locks reduce the cost of instrumentation

- 29 -



10

100


Norm

aliz

ed re

cord

ing

over

head

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

- 30 -



10

100


Norm

aliz

ed re

cord

ing

over

head

135 251 100>


- 31 -



10

100


Norm

aliz

ed re

cord

ing

over

head

135 251 100>


1.39x

- 32 -

Breakdown of Recording Overhead

aget pfscan pbzip2 knot apache ocean water fft radix1

1.5

2

2.5

Nor

mal

ized

reco

rdin

g ov

erhe

ad

• Weak-lock overhead = contention (waiting) cost + logging cost

func locksloop locksinstr/bb lockssync op & system log

- 33 -

Breakdown of Recording Overhead

aget pfscan pbzip2 knot apache ocean water fft radix1

1.5

2

2.5

Nor

mal

ized

reco

rdin

g ov

erhe

ad func waitloop waitinstr/bb waitsync op & system log

func logloop loginstr/bb log

• Weak-lock overhead = contention (waiting) cost + logging cost• High loop-lock contention• High instr/bb-lock contention

- 34 -

Scalability


0.5

1

1.5

2

2.5

3

3.52p 4p 8p

Norm

aliz

ed re

cord

ing

over

head

• Scientific applications scale worse due to imprecise symbolic bounds analysis

- 35 -

Conclusion

Goal: Software-only deterministic multiprocessor replay systems

Chimera Analysis• Static data race analysis

− Find and protect potential data races with weak-locks− Instruction/basic-block-level weak-locks

• Profiling non-concurrent code regions− Address the inadequacy of lockset-based algorithm− Function-level weak-locks

• Symbolic bounds analysis− Address the imprecision of conservative pointer analysis− Loop-level weak-locks

Low Recording Overhead• 39% recording overhead for 4 worker threads

- 36 -

Thank you

dongyoon lee , peter chen, jason flinn , satish narayanasamy

Documents

data racest1t2x

potential dataraces

potential data racesproblem

race report void foo

nonmutex synchronization

nonmutex synchronizations

hybrid program analysis

memory instructions