dbug: systematic evaluation of distributed systems

32
dBug: Systematic Evaluation of Distributed Systems Jiří Šimša Randy Bryant, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

Upload: others

Post on 27-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: dBug: Systematic Evaluation of Distributed Systems

dBug: Systematic Evaluation of Distributed Systems

Jiří Šimša Randy Bryant, Garth Gibson

PARALLEL DATA LABORATORY

Carnegie Mellon University

Page 2: dBug: Systematic Evaluation of Distributed Systems

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 2

Concurrency Bugs Everywhere

Page 3: dBug: Systematic Evaluation of Distributed Systems

Why Do Concurrency Bugs Exist?

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 3

Server 1 Server i Server j Server n... ... ...

Client

2

31

4

Page 4: dBug: Systematic Evaluation of Distributed Systems

Why Do Concurrency Bugs Exist?

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 4

Server 1 Server i Server j Server n... ... ...

Client Client

1 1

2

2

Page 5: dBug: Systematic Evaluation of Distributed Systems

Motivating Example Lessons •  Locking across RPC = bad idea

•  Explosion of possible scenarios

•  Corner case errors easy to miss

•  Testing concurrent systems is hard: •  Control / Enumerate possible scenarios

•  Tackle state space explosion

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 5

Page 6: dBug: Systematic Evaluation of Distributed Systems

Need For Better Testing Methods •  Hardware performance

•  Software complexity

•  Formal specifications impractical

•  New systems rarely written from scratch

•  Common testing mechanism: stress testing

•  Imprecise, falling behind

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 6

Page 7: dBug: Systematic Evaluation of Distributed Systems

Outline •  Motivation

•  dBug Design

•  dBug Prototype

•  Prototype Case Studies

•  Ongoing & Future Work

•  Conclusion

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 7

Page 8: dBug: Systematic Evaluation of Distributed Systems

dBug Design •  Goal: Enable systematic enumeration of (all)

possible execution scenarios of a test

•  Repeated execution of the same test is guaranteed to explore different scenarios

•  Light-weight model checking •  Fixed initial state •  User provided test as a specification •  State space of the actual implementation explored

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 8

Page 9: dBug: Systematic Evaluation of Distributed Systems

Motivating Example dBug-ed

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 9

Server 1 Server i Server j Server n... ... ...

Client ClientArbiter1 23

4 5 6

Page 10: dBug: Systematic Evaluation of Distributed Systems

dBug Design Decisions •  What events to control on and how? •  When to signal a request? •  How to (re)store a state of the system? •  How to explore the state space?

•  Parallel exploration •  Exploration heuristics •  State space reduction

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 10

Page 11: dBug: Systematic Evaluation of Distributed Systems

Outline •  Motivation

•  dBug Design

•  dBug Prototype

•  Prototype Case Studies

•  Ongoing & Future Work

•  Conclusion

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 11

Page 12: dBug: Systematic Evaluation of Distributed Systems

Event Control Mechanism

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 12

Application

OS + Libraries

Application

OS + Libraries

dBug interposition

Page 13: dBug: Systematic Evaluation of Distributed Systems

Compile-time Interposition Source code annotation of:

•  Creation of threads (processes) •  Destruction of threads (processes) •  Coordination primitives:

–  Thread synchronization –  Remote procedure calls –  “Your coordination primitive here”

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 13

Page 14: dBug: Systematic Evaluation of Distributed Systems

Client-Server Architecture

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 14

dBug

Original Distributed System

dBug arbiter

Thread 1

dBug client

Thread n

dBug client

. . .

dBug server

Page 15: dBug: Systematic Evaluation of Distributed Systems

When to Signal a Request? •  Blind Mode:

•  Uses a timeout •  Pros: Easy to implement •  Cons: Overhead, Imprecise

•  Informed Mode •  Uses application idle/progress hints •  Pros: Fast, Accurate •  Cons: Expert knowledge, Annotation

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 15

Page 16: dBug: Systematic Evaluation of Distributed Systems

State Space Exploration

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 16

dBug

Original Distributed System

dBug arbiter

Thread 1

dBug client

Thread n

dBug client

. . .

dBug server

dBug

Original Distributed System

dBug arbiter

Thread 1

dBug client

Thread n

dBug client

. . .

dBug server

dBug

Original Distributed System

dBug arbiter

Thread 1

dBug client

Thread n

dBug client

. . .

dBug server

dBug

Original Distributed System

dBug arbiter

Thread 1

dBug client

Thread n

dBug client

. . .

dBug server

Page 17: dBug: Systematic Evaluation of Distributed Systems

Outline •  Motivation

•  dBug Design

•  dBug Prototype

•  Prototype Case Studies

•  Ongoing & Future Work

•  Conclusion

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 17

Page 18: dBug: Systematic Evaluation of Distributed Systems

Fast Array of Wimpy Nodes •  Energy-efficient architecture •  FAWN-KV = distributed key-value storage •  put()/get() interface, strong consistency •  get() returns value of the last acked put()

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 18

KV  Ring  Front-­‐end  

Back-­‐end  

Back-­‐end  

Switch  

. . .

Page 19: dBug: Systematic Evaluation of Distributed Systems

Case Study 1: Multi-threading

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 19

Log-structured writes

Need for clean-up

Rewrite Operation •  sequential scan •  atomic swap

Obsolete data Up-to-date data

Page 20: dBug: Systematic Evaluation of Distributed Systems

Integrating FAWN-KV and dBug •  Creation and destruction of threads

•  20 lines of annotations

•  Acquiring and releasing locks •  Compile-time interposition on pthread interface

•  Test case: put(key,value1); if (fork() == 0) { rewrite(); } else { put(key,value2); get(key); }

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 20

Page 21: dBug: Systematic Evaluation of Distributed Systems

Case Study Results •  Evaluated with the blind mode for ~24 hours •  Over 7000 possible scenarios •  Test always executed correctly

•  Introduced and detected a data race bug •  The bug showed up in ~700 scenarios

•  Two person weeks of work

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 21

Page 22: dBug: Systematic Evaluation of Distributed Systems

Case Study 2: Including RPCs

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 22

   H  

G  

F  

D  

C  

B  

A  

Page 23: dBug: Systematic Evaluation of Distributed Systems

Integrating FAWN-KV with dBug •  Creation and destruction of agents

•  20 lines of annotations

•  Issuing remote procedure calls •  Modified Apache Thrift library (2 lines)

•  Test case: put(key,value1); If (fork() == 0) { join(); } else { if (fork() == 0) put(key,value2); else get(key); }

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 23

Page 24: dBug: Systematic Evaluation of Distributed Systems

Case Study Results •  Evaluated with blind mode for 45 minutes •  Total of 173 possible scenarios •  Found a bug

•  The bug showed up in only 3 scenarios •  get(key) returns “not found”

•  Two person weeks of work

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 24

Page 25: dBug: Systematic Evaluation of Distributed Systems

Outline •  Motivation

•  dBug Design

•  dBug Prototype

•  Prototype Case Studies

•  Ongoing & Future Work

•  Conclusion

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 25

Page 26: dBug: Systematic Evaluation of Distributed Systems

dBug Evolution

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 26

Page 27: dBug: Systematic Evaluation of Distributed Systems

dBug 2nd Generation •  Open source Autotools project •  dBug interposition as a shared library •  Precise and automatic detection of when to

signal a request

•  Educational use of dBug: •  In use to evaluate student solutions for 15-213 •  Found bugs in the TA implementation •  Available to students to test their solutions

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 27

Page 28: dBug: Systematic Evaluation of Distributed Systems

Future Work

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 28

Interface support State Space Case Studies Event Injection

Parallel ExplorationPOSIX threads

Local I/O

Network I/O

UNIX processes

POSIX threads

Local I/O

Network I/O

UNIX processes

Fault Injection

Time Distortion

15-213

RAIDTool

FAWN-KV

PVFS

Manual Ad hoc

FAWN-KV

PVFS

PRESENT

PAST

FUTURE

Page 29: dBug: Systematic Evaluation of Distributed Systems

Outline •  Motivation

•  dBug Design

•  dBug Prototype

•  Prototype Case Studies

•  Ongoing & Future Work

•  Conclusion

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 29

Page 30: dBug: Systematic Evaluation of Distributed Systems

Related Work •  Verisoft [Godefroid98]

•  manual, exhaustive, multi-threaded, C and C++ sources

•  MaceMC [Killian07] •  automated, selective, distributed, Mace sources

•  CHESS [Musuvathi08] •  automated, selective, multi-threaded, Windows binaries

•  MoDist [Yang09] •  automated, selective, distributed, Windows binaries

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 30

Page 31: dBug: Systematic Evaluation of Distributed Systems

Conclusion •  Systematic and automatic evaluation of

distributed system test cases

•  Open source implementation of dBug •  Experiments with:

•  Parallel Virtual File System (C) •  FAWN-based key value storage (C++) •  CMU student class projects (C and C++) •  RAIDTool (Java)

•  Finding real bugs

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 31

Page 32: dBug: Systematic Evaluation of Distributed Systems

References •  [Godefroid98] P. Godefroid, VeriSoft: A Tool for the Automatic

Analysis of Concurrent Reactive Software, CAV 1997. •  [Killian07] C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat:

Life, Death, and the Critical Transition: Detecting Liveness Bugs in Systems Code, NSDI 2007.

•  [Musuvathi08] M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs, OSDI 2008.

•  [Yang09] J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, L. Zhou: MODIST: Transparent Model Checking of Unmodified Distributed Systems, NSDI 2009.

•  [Simsa10] J. Simsa, G. Gibson, R. Bryant: dBug: Systematic Evaluation of Distributed Systems, SSV 2010.

Jiri Simsa © October 10!http://www.pdl.cmu.edu/ 32