cooperative concurrency bug isolation guoliang jin, aditya thakur, ben liblit, shan lu university of...

1

Cooperative Concurrency Bug Isolation

Guoliang Jin, Aditya Thakur, Ben Liblit, Shan LuUniversity of Wisconsin–Madison

Instrumentation and Sampling Strategies

for

2

Cooperative Concurrency Bug Isolation

• They are synchronization mistakes in multi-threaded programs.

• Several types:– Atomicity violation– Data race– Deadlock, etc.

read(x)

read(x)

write(x)

thread 1 thread 2

JL

write(x)

read(x)

thread 1 thread 2

J？J？

3

Concurrency bugs are common in the fields

• Developers are poor at parallel programming• Interleaving testing is inefficient• Applications with concurrency bugs shipped to

the users

�ƒ€‚�

4

Concurrency bug lead to failures in the field

• Disasters in the past– Therac-25, Northeastern Blackout 2003

• More threats in multi-core era

‚

5

Failure diagnosis is critical

6

L

Concurrency Bug Failure Example

Concurrency Bug from Apache HTTP Server

7

…memcpy(&buf[idx], s, strlen(s));

…log_writer() {

…}…

thread 1

J



…temp = idx;idx = temp + strlen(s);

idx

thread 2

…return SUCCESS;


…log_writer() {

…}…

…temp = idx;idx = temp + strlen(s);…return SUCCESS;

8

…return SUCCESS;



…log_writer() {

…}…

thread 1

L




idx

thread 2

…return SUCCESS;

…log_writer() {

…}…


9

• The failure is non-deterministic and rare– Programmers have trouble to repeat the failure

• The root cause involves more than one thread

Diagnosing Concurrency Bug Failure is Challenging

10

Existing work and their limitations

• Failure replay– High runtime overhead– Developers need to manually locate faults

• Run-time bug detection– (mostly) High runtime overhead– Not guided by the failure• Many false positives How to achieve

low-overhead & accurate

failure diagnosis?

11

Predicates

Our work: CCI

�ƒƒ€‚�

Program

SourceCompiler

Counts& J/L

StatisticalDebugging

Predictors

Sampler

• Goal: diagnosing production run concurrency bug failures• Major components:– predicates instrumentor– sampler– statistical debugging

True in most failure runs, false in most correct runs.

12

CCI Overview• Three different types of predicates.• Each predicate has its supporting

sampling strategy.• Same statistical debugging as in CBI.• Experiments show CCI is effective in

diagnosing concurrency failures.

Capability

Ove

rhea

d

FunRe

Havoc

Prev

13

• Motivation• CCI Overview• CCI Predicates and Sampling Strategies – CCI-Prev and its sampling strategy – CCI-Havoc and its sampling strategy– CCI-FunRe and its sampling strategy

• Evaluation• Conclusion

Outline



14

CCI-Prev Intuition

read(x)

read(x)

write(x)

J L

thread 1 thread 2

read(x)

read(x)

write(x)

thread 1 thread 2

read(x)

write(x)

J L

thread 1 thread 2

read(x)

write(x)

thread 1 thread 2

Atomicity Violation Data Race

Just record which thread accessed last time.

read(x) write(x)

read(x)

read(x)

read(x)

write(x) read(x)

15

CCI-Prev PredicateIt tracks whether two successive accesses to

a shared memory location were by two distinct threads or were by the same thread.

Capability

Ove

rhea

d Prev

16



…log_writer() {

…}…

thread 1

J

CCI-Prev Predicate on the Correct Run



thread 2

…return SUCCESS;

…log_writer() {

…}…


I

I

Predicate J L…

remoteI 0 0

localI 0 0

…

Predicate J L…

remoteI 0 0

localI 1 0

…

Predicate J L…

remoteI 0 0

localI 2 0

…

17



…return SUCCESS;

…log_writer() {

…}…

thread 1

L

CCI-Prev Predicate on the Failure Run



thread 2

…return SUCCESS;

…log_writer() {

…}…


I

I

Predicate J L…

remoteI 0 0

localI 2 0

…

Predicate J L…

remoteI 0 0

localI 2 1

…

Predicate J L…

remoteI 0 1

localI 2 1

…

Predicate J L…

remoteI 0 1

localI 2 1

…

Predicate J L…

remoteI 0 1

localI 2 1

…

18


…

…log_writer() {

…}…

thread 1

L

CCI-Prev Predicate Instrumentation


temp = idx;

idx = temp + strlen(s);

thread 2

…return SUCCESS;

…log_writer() {…}…

Predicate J L…

remoteI 0 0

localI 2 1

…

Predicate J L…

remoteI 0 1

localI 2 1

…

Iunlock(glock);

remote = test_and_insert(& idx, curTid);record(I, remote);

lock(glock);a global hash table

address ThreadID

… …

& idx 2

… …

address ThreadID

… …

& idx 1

… …

address ThreadID

… …

& idx 1

… …

19



…return SUCCESS;

…log_writer() {

…}…

thread 1

CCI-Prev Sampling Strategy


thread 2

…return SUCCESS;

…log_writer() {

…}…


Does traditional sampling work? NO.

• Thread-coordinated• Bursty

I

20





Outline

21


CCI-Havoc Intuition

Just record what value was observed during last access.


…return SUCCESS;

…log_writer() {

…}…

thread 1


thread 2

…return SUCCESS;

…log_writer() {

…}…


I

22

CCI-Havoc PredicateIt tracks whether the value of a given shared location changes between two consecutive accesses by one thread.

Capability

Ove

rhea

d Prev

Havoc

Only uses thread local information

23



…log_writer() {

…}…

thread 1

J

CCI-Havoc Predicate on the Correct Run



thread 2

…return SUCCESS;

…log_writer() {

…}…


I

I

Predicate J L…

unchangedI 0 0

changedI 0 0

…

Predicate J L…

unchangedI 1 0

changedI 0 0

…

Predicate J L…

unchangedI 2 0

changedI 0 0

…

24



…return SUCCESS;

…log_writer() {

…}…

thread 1

L

CCI-Havoc Predicate on the Failure Run



thread 2

…return SUCCESS;

…log_writer() {

…}…


I

I

Predicate J L…

unchangedI 2 0

changedI 0 0

…

Predicate J L…

unchangedI 2 1

changedI 0 0

…

Predicate J L…

unchangedI 2 1

changedI 0 1

…

Predicate J L…

unchangedI 2 1

changedI 0 1

…

Predicate J L…

unchangedI 2 1

changedI 0 1

…

25


…log_writer() {

…}…

thread 1

L

CCI-Havoc Predicate Instrumentation


… temp = idx;

idx = temp + strlen(s);

thread 2

…return SUCCESS;

Predicate J L…

unchangedI 2 1

changedI 0 0

…

Predicate J L…

unchangedI 2 1

changedI 0 1

…

…log_writer() {…}…

I

insert (& idx, temp);

changed = test(& idx, temp);record(I, changed);

hash table forthread1

address value

… …

& idx idx

… …

address value

… …

& idx idx+len2

… …

26


…return SUCCESS;

…log_writer() {

…}…

thread 1

CCI-Havoc Sampling Strategy


thread 2

…return SUCCESS;

…log_writer() {

…}…


• Bursty• Thread-independent


27





Outline

28

CCI-FunRe PredicateIt tracks whether the execution of one function overlaps with the execution of the same function from a different thread.

Capability

Ove

rhea

d Prev

HavocFunRe

CCI-FunRe Predicate Examplethread 1 thread 2

L

thread 1 thread 2

J

…log_writer() {…return SUCCESS;}… …

log_writer() {…return SUCCESS;}…

…log_writer() {…

return SUCCESS;}…

…log_writer() {…return SUCCESS;}…

Predicate J L…

NonReentlog_writer 2 1

Reentlog_writer 0 1

…

Predicate J L…


Reentlog_writer 0 1

… 29

30

…log_writer() {

oldCount = atomic_inc(Count); record(“log_writer”, oldCount);

…

atomic_dec(Count); return SUCCESS;}…

CCI-FunRe Predicate Instrumentationthread 1 thread 2

…log_writer() {


…


L

Predicate J L…


Reentlog_writer 0 0

…

FuncName Counter

… …

log_writer 0

… …

FuncName Counter

… …

log_writer 1

… …

Predicate J L…


Reentlog_writer 0 0

…

FuncName Counter

… …

log_writer 2

… …

Predicate J L…


Reentlog_writer 0 1

…

Predicate J L…


Reentlog_writer 0 1

…

FuncName Counter

… …

log_writer 0

… …

31

CCI-FunRe Sampling Strategy

L

thread 1 thread 2…log_writer() {

…

return SUCCESS;}…

Function execution accounting is not suitable for sampling, so this part is unconditional.

…log_writer() {


…


FuncName Counter

… …

log_writer 0

… …

FuncName Counter

… …

log_writer 0

… …

FuncName Counter

… …

log_writer 0

… …

32

CCI-FunRe Sampling Strategy

• Function execution accounting:–unconditional

• FunRe predicate recording:–thread-independent–non-bursty

33





Outline

34

Experimental Evaluation

• Implementation– Static instrumentor based on the CBI framework

• Real world concurrency bug failure from:– Apache HTTP server, Cherokee– Mozilla-JS, PBZIP2– SPLASH-2: FFT, LU

• Parameter used– Roughly 1/100 sampling rate

35

Failure Diagnosis Evaluation

• Methodology– Using concurrency bug failures occurred in real-world– Each app. runs 3000 times on a multi-core machine• Add random sleep to get some failure runs

– Sampling is enabled– Statistical debugging then return a list of predictors• Which predictor in the list can diagnose failure?

36

Failure Diagnosis Results (with sampling)

Program CCI-Prev CCI-Havoc CCI-FunRe

Apache-1 top1 top1 top1Apache-2 top1 top1 Cherokee top2

FFT top1 LU top1

Mozilla-JS-1 top2 top1Mozilla-JS-2 top1 top1 top1Mozilla-JS-3 top2 top1 top1

PBZIP2 top1 top1

FunRe Havoc Prev

Capability

37

Runtime OverheadPrev Havoc FunRe

No Sampling

Sampling No Sampling


Sampling

Apache-1 62.6% 27.4% 1.1%

Apache-2 8.4% 4.2% 0.2%

Cherokee 19.1% 2.1% 0.3%

FFT 169 % 33.5% 72.8%

LU 57857 % 1693 % 1682 %

Mozilla-JS 11311 % 7587 % 123 %

PBZIP2 0.2% 0.2% 0.3%

FunRe Havoc Prev

Overhead

Prev Havoc FunRe

No Sampling



Sampling

Apache-1 62.6% 1.9% 27.4% 2.8% 1.1% 1.8%

Apache-2 8.4% 0.5% 4.2% 0.4% 0.2% 0.2%

Cherokee 19.1% 0.3% 2.1% 0.0% 0.3% 0.4%

FFT 169 % 24.0% 33.5% 5.5% 72.8% 30.0%

LU 57857 % 949 % 1693 % 8.9% 1682 % 926 %

Mozilla-JS 11311 % 606 % 7587 % 356 % 123 % 97.0%

PBZIP2 0.2% 0.2% 0.2% 0.2% 0.3% 0.2%

38

Conclusion• CCI is capable and suitable to

diagnose many production-run concurrency bug failures.

• Future predicates can leverage our effective sampling strategies.

• Experiments confirm design tradeoff.

Capability

Ove

rhea

d

Prev

Havoc

FunRe

39

Questions about ?

Capability

Ove

rhea

d

Prev

Havoc

FunRe

CCI

40

Questions about ?

Capability

Ove

rhea

d

Prev

Havoc

FunRe

CCI

41



CBI on Concurrency Bug Failures

…return SUCCESS;

…log_writer() {

…}…

thread 1

LConcurrency Bug from Apache HTTP Server


thread 2

…return SUCCESS;

…log_writer() {

…}…


CBI does not work!

idx

To diagnose production-run concurrency bug failures, interleaving related events should be tracked!!!

42

CCI-Prev Predicate Instrumentation with Sampling

if (gsample) {

} else {

temp = cnt;

lock(glock);

changed = test_and_insert(& cnt, curTid);

record(I, changed);

temp = cnt;

unlock(glock);

[[ gsample = true; iset = curTid; lLength=gLength=0;]]?}

43

CCI-Prev Predicate Instrumentation with Sampling

if (gsample) {

} else {

temp = cnt;

lock(glock);

changed = test_and_insert(& cnt, curTid);

record(I, changed);

temp = cnt;

[[ gsample = true; iset = curTid; lLength=gLength=0;]]?

}

unlock(glock);

lLength++;

gLength++;

if (( iset == curTid && lLength > lMAX) || gLength > gMAX){ clear (); iset = unusedTid; gsample = false; }

record(stale ? P1 : P2, changed);

changed = test_and_insert(& cnt, curTid, &stale);

44

CCI-Havoc Predicate Instrumentation with Sampling

record(stale ? P1 : P2, changed);

changed = test(& cnt, cnt, &stale);

if (sample) {

} else {

temp = cnt;

temp = cnt;

[[ sample = true; length=0;]]?

}

insert (& cnt, cnt);

if (length > lMAX) { clear (); sample = false;}

length++;

No global lock used!!!

45

Failure Diagnosis Results (with sampling)

Program CBI CCI-Prev CCI-Havoc CCI-FunRe

Apache-1 top1 top1 top1Apache-2 top1 top1 Cherokee top2

FFT top1 LU top1

Mozilla-JS-1 top2 top1Mozilla-JS-2 top1 top1 top1Mozilla-JS-3 top2 top1 top1

PBZIP2 top1 top1

FunRe Havoc Prev

Capability

46

Failure diagnosis is critical

cooperative concurrency bug isolation guoliang jin, aditya thakur, ben liblit, shan lu university of...

Documents

concurrency bug failures

concurrency bug lead

types of concurrency

rid of concurrency bugs

failure diagnosis

software failure

specific failure

production run