dongyoon lee , peter chen, jason flinn , satish narayanasamy

36
- Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

Upload: armina

Post on 24-Feb-2016

64 views

Category:

Documents


0 download

DESCRIPTION

Chimera: Hybrid Program Analysis for Determinism. Dongyoon Lee , Peter Chen, Jason Flinn , Satish Narayanasamy University of Michigan, Ann Arbor. * Chimera image from http ://superpunch.blogspot.com/2009/02/chimera-sketch.html. Deterministic Replay. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 1 -

Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy

University of Michigan, Ann Arbor

Chimera: Hybrid Program Analysis for Determinism

* Chimera image from http://superpunch.blogspot.com/2009/02/chimera-sketch.html

Page 2: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 2 -

Deterministic Replay

Goal: record and reproduce multithreaded execution• Debugging concurrency bugs• Offline heavyweight dynamic analysis• Forensics and intrusion detection• … and many more uses

Problem• Multithreaded record-and-replay is too slow (>2x) or requires custom hardware

Page 3: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 3 -

Multithreaded Record-and-Replay is Slow

Write

WriteRead

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input - Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

Page 4: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 4 -

Replay for Data-Race-Free Programs is Cheap

Data-race-free programs• Shared memory accesses are well ordered by synchronization ops.• Recording happens-before order of sync. ops. is sufficient

Problem: Programs with data races

T1 T2X=0Y=0

X=1Y=1

Y=2

Unlock(l)Lock(l)

Unlock(l)

Signal(c)Wait(c)

Z=1

X=2

Z=2

T3order of mem. ops.

order of sync. ops.

Page 5: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 5 -

Our Contribution: A Hybrid Analysis

Potentially racyprogram P

Data-race-freeprogram P’

Sound static data race analysis • Add synchronizations for potential data races• Problem: Too many false positives

Profiling non-concurrent code regions

Symbolic bounds analysis

Chimera

Page 6: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 6 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 7: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 7 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 8: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 8 -

Static Data Race Analysis

• Find potential data-races using a sound static data race detector RELAY [Voung et al.,

FSE’07]

• Protect all potential data-races using weak-locks − A new time-out lock which may be preempted (discussed later)

• Record and replay the happens-before order of weak-locks

Page 9: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 9 -

Protect Potential Races using Weak-locks

Potential racy-pair

Potential racy-pair

Static analysis helps avoid instrumentation for access to Z

No race report

void foo() { X = 0;

for(i = ... ){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1; }

}

Page 10: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 10 -

Sources of False Positives in RELAY

• Sound data-race detector reports too many false data-races− 53x overhead

• Source 1: Non-mutex synchronizations are ignored− Lockset based analysis ignores fork-join, barrier, signal-wait, etc. − May report a false data-race between memory instructions that

can never execute concurrently

• Source 2: Conservative pointer analysis − Overestimate variables accessed by a memory instruction − May report a false data-race between memory instructions that

can never access the same location

Page 11: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 11 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 12: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 12 -

Profiling Non-concurrent Code Regions

Problem• Lockset based analysis ignores non-mutex synchronization ops.

Solution• Profile non-concurrent code regions (e.g., functions)• Increase the granularity of weak-locks to protect a larger code

region instead of each potential racy instruction• Parallelism is preserved unless mis-profiled

T1foo()

BARRIER

T2

BARRIER

bar()

False Race

Page 13: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 13 -

Function-Level Weak-Locks

if profiler says foo() and bar() are not likely to run concurrentlyfoo()

BARRIERBARRIER

bar()

False Race

void foo() { X = 0;

for(i = … ){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = … ){

Y[ tid ][ i ] = 1;

Z = 1; }

}

Page 14: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 14 -

Roadmap

• Motivation

• Chimera Analysis

1) Static data race analysis

2) Profiling non-concurrent code regions

3) Symbolic bounds analysis

• Design

• Evaluation

• Conclusion

Page 15: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 15 -

Imprecision in Conservative Pointer Analysis

T1foo()

BARRIER

T2

BARRIER

May runConcurrently

bar()

Page 16: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 16 -

Imprecision in Conservative Pointer Analysis

• RELAY uses Steensgaard’s and Anderson’s pointer analysis− Flow-Insensitive and Context-Insensitive (FICI) analysis− Naming heap objects is conservative

• Overestimate the variables accessed by a memory instruction

void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; … }}

void bar() { … for(i= 0 to N){ Y[ tid ][ i ] = 1; … }}

False Race

Y[][]

Thread1 Thread 2

… … …

Potential racy-pair

Page 17: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 17 -

Symbolic Bounds AnalysisOur Solution• Derive the symbolic lower and upper bounds that a racy code

region may access (e.g., loops) [Rugina and Rinard, PLDI’00]

• Increase the granularity of weak-locks to protect a larger code region for a set of addresses specified by a symbolic expression

• Parallelism is preserved if the bounds are precise enough

void foo() { … for(i = 0 to N){ Y[ tid ][ i ] = 0; } …}

Bounds: &Y[tid][0] to &Y[tid][N]

SymbolicBoundsAnalysis

Page 18: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 18 -

Loop-level Weak-locks

Symbolic bounds: &Y[tid][0] ~ &Y[tid][N]

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

(&Y[tid][0],&Y[tid][N])

void foo() { X = 0;

for(i = 0 to N){

Y[ tid ][ i ] = 0;

}

}

void bar() { X = 1;

for(i = 0 to N){

Y[ tid ][ i ] = 1;

Z = 1; }

}

Page 19: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 19 -

Imprecise Symbolic Bounds

Sources• Depend on the value computed inside the code region• Depend on arithmetic operations not supported in the analysis

− e.g., modulo operations, logical AND/OR, etc.

Choosing the optimal granularity• If bounds are too imprecise and the loop body is long enough,

resort to instruction (basic-block) level weak-locks for parallelism

void qux() { … for(i = 0 to N){ prev = Z[ prev ]; } …}

Bounds: -INF to +INF

SymbolicBoundsAnalysis

Page 20: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 20 -

Roadmap

• Motivation

• Chimera Analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 21: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 21 -

Deadlock due to Weak-locks

No deadlocks between weak-locks• function-level > loop-level > instruction-level

Deadlock between weak-locks and original sync. ops. is possible

T1

wait (cv)

T2

signal(cv)

Time-out !!

Page 22: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 22 -

Weak-lock Time-out

A weak-lock might time-out• Invoke a special system call to handle it

Weak-lock guarantee• Only one thread holds a given weak-lock at any given time• Mutual exclusion may be compromised; but sufficient for replay

T2

signal(cv)

Time-out !!T1

wait (cv)

Current owner Current owner

Logged order of weak-locks

Page 23: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 23 -

Roadmap

• Motivation

• Chimera Analysis

• Weak-lock Design

• Evaluation

• Conclusion

Page 24: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 24 -

Implementation

Source-to-source Instrumentation• Implemented in OCaml using CIL as a front end

Static analysis• Data race detection: RELAY [Voung et al., FSE’07]

− Include all library source codes for soundness (uClibc’s libc, libm, etc.)• Symbolic bounds analysis: [Rugina and Rinard, PLDI’00]

− Intra-procedural analysis for racy loops only

Runtime system• Modified Linux kernel to record/replay program input • Modified pthread library to record/replay happens-before order

of original synchronization operations and weak-locks

Page 25: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 25 -

Evaluation Setup

Test Environment• 2.66 GHz 8-core Xeon processor with 4 GB of RAM • Different set of inputs for profiling and performance evaluation• Average of five trials with 4 worker threads• 2, 4, 8 threads for scalability results

Benchmarks• Desktop applications

− aget, pfscan, and pbzip2• Server programs

− knot and apache• SPLASH-2 suite

− ocean, water-nsq, fft, and radix

Page 26: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 26 -

Record and Replay Performance

aget pfscan pbzip2 knot apache ocean water fft radix average0

0.5

1

1.5

2

2.5record replay

Norm

aliz

ed p

erf.

over

head

• Recording : 39% on average• Replay : similar to recording; much lower for I/O intensive prgs.

2.4% slowdown

86% slowdown

39%

Page 27: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 27 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

Norm

aliz

ed re

cord

ing

over

head

135 251 100>

53x

Page 28: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 28 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

Norm

aliz

ed re

cord

ing

over

head

135 100>251

• Coarse-grained weak-locks reduce the cost of instrumentation

Page 29: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 29 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

Norm

aliz

ed re

cord

ing

over

head

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

Page 30: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 30 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

Norm

aliz

ed re

cord

ing

over

head

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

Page 31: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 31 -

Effectiveness of Coarse-grained Weak-locks

aget pfscan pbzip2 knot apache ocean water fft radix average1

10

100

instr instr + func instr + loop instr + loop + func instr + bb + loop + func

Norm

aliz

ed re

cord

ing

over

head

135 251 100>

• Coarse-grained weak-locks reduce the cost of instrumentation• Exception: control-flow dependency (e.g., pfscan)

1.39x

Page 32: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 32 -

Breakdown of Recording Overhead

aget pfscan pbzip2 knot apache ocean water fft radix1

1.5

2

2.5

Nor

mal

ized

reco

rdin

g ov

erhe

ad

• Weak-lock overhead = contention (waiting) cost + logging cost

func locksloop locksinstr/bb lockssync op & system log

Page 33: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 33 -

Breakdown of Recording Overhead

aget pfscan pbzip2 knot apache ocean water fft radix1

1.5

2

2.5

Nor

mal

ized

reco

rdin

g ov

erhe

ad func waitloop waitinstr/bb waitsync op & system log

func logloop loginstr/bb log

• Weak-lock overhead = contention (waiting) cost + logging cost• High loop-lock contention• High instr/bb-lock contention

Page 34: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 34 -

Scalability

aget pfscan pbzip2 knot apache ocean water fft radix average0

0.5

1

1.5

2

2.5

3

3.52p 4p 8p

Norm

aliz

ed re

cord

ing

over

head

• Scientific applications scale worse due to imprecise symbolic bounds analysis

Page 35: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 35 -

Conclusion

Goal: Software-only deterministic multiprocessor replay systems

Chimera Analysis• Static data race analysis

− Find and protect potential data races with weak-locks− Instruction/basic-block-level weak-locks

• Profiling non-concurrent code regions− Address the inadequacy of lockset-based algorithm− Function-level weak-locks

• Symbolic bounds analysis− Address the imprecision of conservative pointer analysis− Loop-level weak-locks

Low Recording Overhead• 39% recording overhead for 4 worker threads

Page 36: Dongyoon Lee ,   Peter Chen,  Jason  Flinn ,    Satish Narayanasamy

- 36 -

Thank you