d 3 s: debug deployed distributed systems xuezheng liu, zhenyu guo, xi wang, feibo chen, xiaochen...

D3S: Debug Deployed Distributed Systems

Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, Zheng

Zhang

Microsoft Research Asia, Tsinghua University, Fudan University, Shanghai Jiaotong University, MIT CSAIL

Debugging distributed systems is difficult

• Bugs are difficult to reproduce– Many machines executing concurrently– Machines may fail– Network may fail

Example: Distributed lock

• Distributed reader-writer locks– Lock mode: exclusive, shared– Invariant: only one client can hold a lock in the

exclusive mode

• Debugging is difficult because the protocol is complex– For performance, clients cache locks– For failure tolerance, locks have a lease

How do people debug?

• Simulation

• Model-checking

• Runtime checking

State-of-the-art of runtime checking

Step 1: add logsvoid ClientNode::OnLockAcquired(…) { … print_log( m_NodeID, lock, mode);}

Step 2: Collect logs, align them into a globally consistent sequence• Keep partial order

Step 3: Write checking scripts• Scan the logs to retrieve lock states• Check the consistency of locks

Problems for large/deployed systems

• Too much manual effort

• Difficult to anticipate what needs to log– Too much information: slow systems down– Too little information: miss a problem

• Checking for large system is challenging– A central checker cannot keep up– Snapshots must be consistent

• Our focus: make runtime checking easier and feasible for deployed/large-scale system

D3S approach

Checker Checker

Predicate:no conflict locks

Violation!

statestate

state

statestate

Conflict!

Our contributions/outline

• A simple language for writing distributed predicates

• Programmers can change what is being checked on-the-fly

• Failure tolerant consistent snapshot for predicate checking

• Evaluation with five real-world applications

Design goals

• Simplicity: a sequential style for writing predicates• Parallelism: run in parallel on multiple checkers• Correctness: check consistent states in spite of

failures

• Solution– MapReduce model – Failure-tolerant consistent snapshot

Developers write a D3S predicate

V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)

class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for partitioning};

Part 1: define the dataflow and types of states, and how states are retrieved

Part 2: define the logic and mapping function in each stage for predicates

D3S parallel predicate checker

Lock clients

Checkers

Expose statesindividually

Reconstruct:SN1, SN2, …

Exposed states(C1, L1, E), (C2, L3, S), (C5, L1, S),…

L1L1

(C1,L1,E),(C5,L1,S) (C2,L3,S)

Key: LockID

States and dataflowV0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)

V0: exposer

Set of (C, L, M)

V1 (checker)

Set of (lock)

Final report

triggersin app

checkingfunction

Source code for Boxwood client:class ClientNode { ClientID m_NodeID; void OnLockAcquired( LockID, LockMode ); void OnLockReleased( LockID, LockMode );};

• Insert a hook to the app using binary rewrite at run time• Triggered at function boundaries to expose app states

Checking functions• Write in C++ language, reuse types• Execute(): run for each snapshot• Mapping(): guide partitioning of

snapshots

V0: exposer

Set of (C, L, M)

V1 (checker)

Set of (lock)

Final report

triggersin app

checkingfunction

class MyChecker : vertex<V1> { void Execute( const V0::Snapshot & SN) { foreach (V0::Tuple t in SN) { if (t.mode == EXCLUSIVE) ex[t.lock]++; else sh[t.lock]++; } foreach (LockID L in ex) { if (ex[L] > 1 || (ex[L] == 1 && sh[L] > 0)) output += V1::Tuple(L); } } int64 Mapping( const V0::Tuple & t ) { return t.lock; }};

Summary of checking language

• Predicate– Any property calculated from a finite number of

consecutive state snapshots

• Highlights– Sequential programs (w/ mapping)– Reuse app types in the script and C++ code– Supports for reducing the overhead (in the paper)

• Incremental checking• Sampling the time or snapshots

Constructing consistent snapshots

• Use Lamport clock to total order states

• Problem: how does the checker know whether it receives all necessary states for a snapshot?

• Solution: detect app node failures and use membership info to construct snapshots

Constructing consistent snapshots

• Membership: external service or built-in heart-beats– Snapshot is correct as long as membership is correct

• When no state being exposed, app node should report its timestamp periodically

A

B

Checker

{ (A, L0, S) }, ts=2

{ (B, L1, E) }, ts=6

{ }, ts=10

ts=12

{ (A, L1, E) }, ts=16

M(2)={A,B}SB(2)=??

M(6)={A,B}SA(6)=??

M(10)={A,B}SA(6)=SA(2) check(6)

Detect failure

SB(10)=SB(6) check(10)

M(16)={A}check(16)

SA(2) SB(6) SA(10) SA(16)

Experimental method

• By debugging 5 real systems, we answer– Can D3S help developers find bugs?– Are predicates simple to write?– Is the checking overhead acceptable?

• None of the apps are written by us!

Case study: Leader-election

• Predicate– There is at most one leader in each group of replicas

• Deployment– 8 machines (1 Gb Ethernet, 2 GHz Intel Xeon CPU, and 4 GB memory)– Test scenario: database app with random I/O (40 MB/s per machine

at peak time)– Randomly crash & restart processes

• Debugging– 3 checkers, partitioned by replica groups– Time to trigger violation: several hours

Root cause of the bug

Coordinator

Replica node Replica node

Failure detector

Failure detector

Checker

leader! Leader!

ReportNode involved, sequence of related states and events.

(catch violation)

timeout

Replica node

• Coordinator crashed and forgot the previous answer• Must write to disk synchronously!

Summary of resultsApplication LoC Predicates LoP Results

PacificA (Structured data storage)

67,263 membership consistency; leader election; consistency among replicas

118 3 correctness bugs

Paxos implement-ation

6,993 consistency in consensus outputs; leader election

50 2 correctness bugs

Web search engine

26,036 unbalanced response time of indexing servers

81 1 performance problem

Chord (DHT) 7,640 aggregate key range coverage; conflict key holders

72 tradeoff bw/ availability & consistency

BitTorrent client

36,117 Health in neighbor set; distribution of downloaded pieces; peer contribution rank

210 2 performance bugs; free riders

Dat

a ce

nter

app

sW

ide

area

app

s

Performance overhead (stress test of PacificA)

2 4 6 8 100

30

60

90

120

150

180

7.21%

4,38%3.94%

4.20%

7.24%

withoutwith

# of clients, each sending 10,000 requests

tim

e to

com

plet

e (s

econ

ds)

• Less than 8%, in most cases less than 4%. • I/O overhead < 0.5%• Overhead is negligible in other checked systems

Related work

• Log analysis – Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07]

• Predicate checking at replay time– WiDS Checker[NSDI’07], Friday[NSDI’07]

• P2-based online monitoring– P2-monitor[EuroSys’06]

• Model checking– MaceMC[NSDI’07], CMC[OSDI’04]

Conclusions

• Predicate checking is effective for debugging deployed & large-scale distributed systems

• D3S enables:– Change of what is monitored on-the-fly– Checking with multiple checkers– Specify predicate in sequential & centralized

manner

Thanks & Q/A

Design goals• An advanced predicate checker designed for

deployment & large scale

• Deployment– Flexibility: change which states are checked on-the-fly– Low overhead

• Large scale– Distributed checking– Failure-tolerance: continue to check correctly when

• App node fails• Checking machine fails

Case study: PacificA

• A BigTable-like distributed database

• Replica group management– Perfect failure detection on storage node– Group reconfiguration to handle node failures

• Primary-backup replication– Two-phase commit for consistent updates– Data reconciliation when re-joining a node

Case study: PacificA• A bunch of invariants stem from the design

– Group consistency: • single-primary in all replica groups

– Data consistency• same data for the same version number

– Reliability• when committing, all replicas are already prepared

– Correctness of reconciliation• After joining the group, the new node have up-to-date states

– Etc…

• Specify the invariants as predicates, and check them– Necessary to use multiple checkers

• Result: detected 3 correctness bugs caused by atomicity violation and incorrect failure handling

Bug in RSL (Paxos server in Cosmos)

• RSL– 1 primary, 4 secondaries– Two phase commit– Leader election/failure detection

A

B

C

D E

Primary

prepare preparePrimary

Learning Root cause of the “live-lock”:• Prepare node only re-sends requests to the ones that has previously responded to it• A node in “learning” never participates in prepare• Result: D is stuck in preparing for a long time

VerifierDetect the unstable node status

Lesson:• Complete system is error-prone due to optimization and supporting components• Bugs are not always visible to outside• Always-on checking catches “hidden” bugs

Chord overlay

Perfect Ring:• No overlap, no hole• Aggregated key coverage is 100%

???

0 10000200003000040000500006000070000800000%

50%

100%

150%

200% 3 predecessors8 predecessors

time (seconds)

key

ran

ge c

over

age

rati

o

Consistency vs. Availability: cannot get both• Global measure on the factors• See the tradeoff quantitatively for performance tuning• Capable of checking detailed key coverage

0 64 128 192 2560

1

2

3

43 predecessors

8 predecessors

key serial

# of

hit

of c

hord

nod

es

d 3 s: debug deployed distributed systems xuezheng liu, zhenyu guo, xi wang, feibo chen, xiaochen...

Documents

system log

large system

checking scripts

lock states

consistent states

predicate checking evaluation

storage system scripts

deployedlargescale system