gray failure: the achilles’ heel of cloud-scale …huang/talk/hotos17_talk.pdfgray failure: the...
TRANSCRIPT
Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch,
Yingnong Dang, Murali Chintalapati, Randolph Yao
Outline
Background & the gray failure problem
Real-world gray failure cases in Azure
A model and a definition for gray failure
differential observability
Potential future directions
26
Rapid Growth of Cloud System Infra
» Software user shift
direct: e.g., office 365, Google Drive
indirect: e.g., Netflix on AWS
» Workload diversity
website, workflow, big data, machine learning
» Internal composition
more data centers, larger cluster, special h/w
containerization, micro-services
27
Demanding Requirement on Availability
» Users intolerant of service downtime
» Failure more costly
SLA violation, reputation hit, customer loss, engineering resource waste
» New availability bar
3 nines to 5 or 6 nines
28
Key: Embrace Fault-tolerance!
29
Redundancy CorrectnessDecomposition
Rich history since 1950s Steps:
Key: Embrace Fault-tolerance!
30
Redundancy CorrectnessDecomposition
process pair
RAID
Primary/backup
state machine replication
transaction
checkpoint chain replication
virtual synchrony Paxos Zab PBFT
Gossip
Zyzzyva
N-version2PC
triple modular redundancy
erasure coding
Rich history since 1950s
…
Steps:
Building block:
Status Quo
31
Status Quo
» By and large, the efforts paid off
many faults successfully detected, tolerated, and repaired every day
few global outages
99.9% is achievable
» But moving forward…
simplistic assumptions start to break
reasoning about availability becomes hard
frequent bizarre phenomenon in production
99.999% and beyond is challenging
32
Status Quo
» By and large, the efforts paid off
many faults successfully detected, tolerated, and repaired every day
few global outages
99.9% is achievable
» But moving forward…
simplistic assumptions start to break
reasoning about availability becomes hard
frequent bizarre phenomenon in production
99.999% and beyond is challenging
33
scale & complexity
availability?t
Status Quo
» By and large, the efforts paid off
many faults successfully detected, tolerated, and repaired every day
few global outages
99.9% is achievable
» But moving forward…
simplistic assumptions start to break
reasoning about availability becomes hard
frequent bizarre phenomenon in production
99.999% and beyond is challenging
34
scale & complexity
availability?t
A common theme:
the overlooked gray
failure problem
The Elephant in the Cloud - Gray Failure
35
process pair
RAID primary backup
chain replication2PC
TMR
erasure codingPaxos
virtual synchrony
…
Zab
Fail-stop
A component either
works correctly or stops
The Elephant in the Cloud - Gray Failure
36
Byzantine
A component may
behave arbitrarily
process pair
RAID primary backup
chain replication2PC
TMR
erasure codingPaxos
PBFT Zyzzyva
virtual synchrony
…
UpRight
Q/U BAR Gossip
Aliph
…
Zab
Fail-stop
A component either
works correctly or stops
The Elephant in the Cloud - Gray Failure
37
ByzantineGray failure
A component appears to be still working
but is in fact experiencing severe issue
A component may
behave arbitrarily
process pair
RAID primary backup
chain replication2PC
TMR
erasure codingPaxos
PBFT Zyzzyva
virtual synchrony
…
UpRight
Q/U BAR Gossip
Aliph
…
Zab
Fail-stop
A component either
works correctly or stops
The Elephant in the Cloud - Gray Failure
38
Fail-stop ByzantineGray failure
A component appears to be still working
but is in fact experiencing severe issue
A component either
works correctly or stops
A component may
behave arbitrarily
The Elephant in the Cloud - Gray Failure
39
Fail-stop ByzantineGray failure
A component appears to be still working
but is in fact experiencing severe issue
A component either
works correctly or stops
A component may
behave arbitrarily
• subtle and ambiguous: e.g., switch random packet loss, non-
fatal exceptions, memory thrashing, flaky disk I/O, overload..
symptom
The Elephant in the Cloud - Gray Failure
40
• across s/w and h/w stack in the infra due to various defects
• behind most service incidents we’ve seen in Azure
occurrence
• subtle and ambiguous: e.g., switch random packet loss, non-
fatal exceptions, memory thrashing, flaky disk I/O, overload..
symptom
Fail-stop Byzantine
A component appears to be still working
but is in fact experiencing severe issue
A component either
works correctly or stops
A component may
behave arbitrarily
Gray failure
The Elephant in the Cloud - Gray Failure
41
• across s/w and h/w stack in the infra due to various defects
• behind most service incidents we’ve seen in Azure
occurrence
• fault-tolerance ineffective or counterproductive
• faults take engineers & designers huge efforts to nail down
• teams play the blame game with each other
danger
Fail-stop Byzantine
A component appears to be still working
but is in fact experiencing severe issue
A component either
works correctly or stops
A component may
behave arbitrarily
• subtle and ambiguous: e.g., switch random packet loss, non-
fatal exceptions, memory thrashing, flaky disk I/O, overload..
symptom
Gray failure
Real-world Gray Failure Cases in Azure
42
Case I: Redundancy in Datacenter Network (1)
43
Core
Aggregation
ToR
A B
r1 r2
Case I: Redundancy in Datacenter Network (1)
44
Core
Aggregation
ToR
A B
r1 r2
Case I: Redundancy in Datacenter Network (1)
45
Core
Aggregation
ToR
A B
r1 r2
crash
Case I: Redundancy in Datacenter Network (1)
46
Core
Aggregation
ToR
A B
r1 r2
crash
Case I: Redundancy in Datacenter Network (1)
47
Core
Aggregation
ToR
A B
r1 r2
increasing # of core switches helps with availability
crash
Case I: Redundancy in Datacenter Network (2)
48
Core
Aggregation
ToR
A B
r1 r2
Workload: single round trip
random
packet
drop
𝒑
Case I: Redundancy in Datacenter Network (2)
49
Core
Aggregation
ToR
A B
r1 r2
Workload: single round trip
• packets will not be re-routed application glitches or increased latency
• increasing # of core switches may not affect chance of being affected
random
packet
drop
𝒑
Case I: Redundancy in Datacenter Network (3)
50
Core
Aggregation
ToR
A B
r1 r2
Workload: send multiple requests
wait for all to finish (e.g., search)
r3 r4
C D E
• high chance to involve every core switches for each front-end request
• gray failure at any core switch will cause delay
• more core switches worse tail latencies 𝟏 − (𝟏 − 𝒑)𝒏
random
packet
drop
Case II: Failure Detector in Compute Service
51
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
Hierarchical agents to catch failure in different layers
Case II: Failure Detector in Compute Service
52
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
crash!
Hierarchical agents to catch failure in different layers
Case II: Failure Detector in Compute Service
53
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
VM3 dead
crash!
Hierarchical agents to catch failure in different layers
Case II: Failure Detector in Compute Service
54
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
reboot
Hierarchical agents to catch failure in different layers
Case II: Failure Detector in Compute Service
55
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
connectivity
issue
Hierarchical agents to catch failure in different layers
Case II: Failure Detector in Compute Service
56
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
connectivity
issue
VM3 good
Hierarchical agents to catch failure in different layers
No action needed
Case II: Failure Detector in Compute Service
57
Fabric Controller (Primary)
Host Agent Host OS
VM0
Guest
Agent
VM1
Guest
Agent
VM2
Guest
Agent
VM3
Guest
Agent
Physical Node
Fabric Controller (Replica) …
Role Instance Role Instance Role Instance Role Instance
Physical Node
Hypervisor vSwitch
Network
connectivity
issue
VM3 good
Can’t SSH/RDP
Hierarchical agents to catch failure in different layers
No action needed
Case III: Recovery in Storage Service
58
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
Case III: Recovery in Storage Service
59
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
low free blocks
Case III: Recovery in Storage Service
60
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
low free blocks
EN1,EN2,EN3
healthy
Case III: Recovery in Storage Service
61
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
low free blocks
EN1,EN2,EN3
healthy
Case III: Recovery in Storage Service
62
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
EN1,EN2,EN3
healthy
crash
write
Case III: Recovery in Storage Service
63
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
EN3 is
down
crash
Case III: Recovery in Storage Service
64
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
EN3 is
down
crash
Case III: Recovery in Storage Service
65
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
low free blocks
Case III: Recovery in Storage Service
66
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
low free blocks
EN2,EN3,EN4
healthy
Case III: Recovery in Storage Service
67
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
low free blocks
EN2,EN3,EN4
healthy
Case III: Recovery in Storage Service
68
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
EN2,EN3,EN4
healthy
crash
write
Case III: Recovery in Storage Service
69
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
EN3 is
down
crash
Case III: Recovery in Storage Service
70
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front End
EN3 is
down
crash
Case III: Recovery in Storage Service
71
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN3 EN4 EN5 …
Front End Front End Front EndEN3 is
broken
Case III: Recovery in Storage Service
72
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN4 EN5 …
Front End Front End Front EndEN3 is
broken
Case III: Recovery in Storage Service
73
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN4 EN5 …
Front End Front End Front EndEN3 is
broken
re-replication,
fragmentation
Case III: Recovery in Storage Service
74
Extent Nodes (EN)
Stream
Manager
EN1 EN2 EN4 EN5 …
Front End Front End Front EndEN3 is
broken
re-replication,
fragmentation
Understanding Gray Failure
75
The Many Faces of Gray Failure
76
So, what is a gray failure?
The Many Faces of Gray Failure
77
So, what is a gray failure?
A performance issue.
The Many Faces of Gray Failure
78
So, what is a gray failure?
A performance issue.
A problem that some thinks is a failure but some thinks is not,
e.g., a 2% packet loss. The ambiguity itself defines gray failure.
If everyone agrees it is a problem, it is not a gray failure.
The Many Faces of Gray Failure
79
So, what is a gray failure?
A performance issue.
A problem that some thinks is a failure but some thinks is not,
e.g., a 2% packet loss. The ambiguity itself defines gray failure.
If everyone agrees it is a problem, it is not a gray failure.
A Heisenbug, sometimes it occurs and sometimes it does not.
The Many Faces of Gray Failure
80
So, what is a gray failure?
A performance issue.
A problem that some thinks is a failure but some thinks is not,
e.g., a 2% packet loss. The ambiguity itself defines gray failure.
If everyone agrees it is a problem, it is not a gray failure.
A Heisenbug, sometimes it occurs and sometimes it does not.
The system is failing slowly, e.g., memory leak.
The Many Faces of Gray Failure
81
So, what is a gray failure?
A performance issue.
A problem that some thinks is a failure but some thinks is not,
e.g., a 2% packet loss. The ambiguity itself defines gray failure.
If everyone agrees it is a problem, it is not a gray failure.
A Heisenbug, sometimes it occurs and sometimes it does not.
The system is failing slowly, e.g., memory leak.
There is an increasing number of transient errors in the system,
which results in reduced system capacity even if the system still
manages to continue working.
The Many Faces of Gray Failure
82
So, what is a gray failure?
A performance issue.
A problem that some thinks is a failure but some thinks is not,
e.g., a 2% packet loss. The ambiguity itself defines gray failure.
If everyone agrees it is a problem, it is not a gray failure.
A Heisenbug, sometimes it occurs and sometimes it does not.
The system is failing slowly, e.g., memory leak.
There is an increasing number of transient errors in the system,
which results in reduced system capacity even if the system still
manages to continue working.
What is a formal way to define and
study gray failure, one that potentially
sheds light on how to address it?
System Core
System
An Abstract Model
83
Note: these are logical entities
System Core
System
An Abstract Model
84
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
Note: these are logical entities
System Core
System
An Abstract Model
85
App1
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app
Note: these are logical entities
System Core
System
An Abstract Model
86
App1 App2
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics
Note: these are logical entities
System Core
System
An Abstract Model
87
App1 App2 App3
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics system2
Note: these are logical entities
System Core
System
An Abstract Model
88
App1 App2 App3
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics user/operator
…Appn
system2
Note: these are logical entities
System Core
System
An Abstract Model
89
Observer
App1 App2 App3
probe
report
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics user/operator
…Appn
system2
Note: these are logical entities
System Core
System
An Abstract Model
90
Observer
Reactor
App1 App2 App3
probe
report
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics user/operator
…Appn
system2
Note: these are logical entities
System Core
System
An Abstract Model
91
Observer
Reactor
App1 App2 App3
probe
report
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics user/operator
…Appn
system2
Note: these are logical entities
System Core
System
An Abstract Model
92
Observer
Reactor
App1 App2 App3
probe
report
• distributed storage system
• IaaS platform
• data center network
• search engine
• …
web app analytics user/operator
…Appn
system2
Note: these are logical entities
Gray Failure Trait: Differential Observability
93
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
Gray Failure Trait: Differential Observability
94
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
Gray Failure Trait: Differential Observability
95
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
different entities come into different conclusions
about whether a system is working or not
difference
Gray Failure Trait: Differential Observability
96
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
All apps deem
system good
Appi deems
system bad
observer deems
system good
observer deems
system bad
Gray Failure!
difference
Gray Failure Trait: Differential Observability
97
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
All apps deem
system good
Appi deems
system bad
observer deems
system good
observer deems
system bad
Gray Failure!
❶ ❷
difference
Healthy or w/ minor
latent fault
Gray Failure Trait: Differential Observability
98
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
All apps deem
system good
Appi deems
system bad
observer deems
system good
observer deems
system bad
Gray Failure!
❶ ❷
❸
Fault tolerance at play,
or a false positive
difference
Healthy or w/ minor
latent fault
Gray Failure Trait: Differential Observability
99
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
All apps deem
system good
Appi deems
system bad
observer deems
system good
observer deems
system bad
Gray Failure!
❶ ❷
❸ ❹ Crash, fail-stop
Fault tolerance at play,
or a false positive
difference
Healthy or w/ minor
latent fault
Applying the Model
100
Case I
term example
system data center network
observer switch peers
reactor routing protocols
app1 simple web server
app2 search engine
gray failure app2 observed glitches but neighbors of
the bad switch (and app1) didn’t
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
Applying the Model
101
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
Applying the Model
102
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
Applying the Model
103
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
- EN3 degraded
- Observation difference
Applying the Model
104
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
- EN3 degraded
- Observation difference
- EN3 crashed & rebooted
- No observation difference
Applying the Model
105
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
- EN3 degraded
- Observation difference
- EN3 crashed & rebooted
- No observation difference
- EN3 degraded
- Observation difference
Applying the Model
106
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
- EN3 degraded
- Observation difference
- EN3 crashed & rebooted
- No observation difference
- EN3 degraded
- Observation difference
… - EN3 crashed & removed
- No observation difference
Applying the Model
107
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
- EN3 degraded
- Observation difference
- EN3 crashed & rebooted
- No observation difference
- EN3 degraded
- Observation difference
… - EN3 crashed & removed
- No observation difference
…- More ENs affected
- More observation difference
Applying the Model
108
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appnterm example
system Azure storage service
observer storage master
reactor storage master
app1..n Azure VMs
gray failure some VMs hit remote I/O exceptions
while storage master deems ENs healthy
Case III
- EN3 degraded
- Observation difference
- EN3 crashed & rebooted
- No observation difference
- EN3 degraded
- Observation difference
… - EN3 crashed & removed
- No observation difference
…- More ENs affected
- More observation difference
- Observation difference
completely gone, too late
Direction 1: Close Observation Gap
109
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
Traditional failure detector multi-dimensional health monitor
Direction 1: Close Observation Gap
110
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
Doctors can’t use heartbeat as the only signal of a patient’s health
Traditional failure detector multi-dimensional health monitor
Direction 1: Close Observation Gap
111
Doctors can’t use heartbeat as the only signal of a patient’s health
Heartbeat-based
hierarchical failure detector
In-VM performance counters before and
during the gray failure incident (case II)
Traditional failure detector multi-dimensional health monitor
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
Direction 2: Approximate Application View
112
• Infeasible to completely eliminate differential observability due
to multi-tenancy and modularity constraints
• System sends probes to emulate common application usage
Direction 2: Approximate Application View
113
• Infeasible to completely eliminate differential observability due
to multi-tenancy and modularity constraints
• System sends probes to emulate common application usage
pod
podsetSpine
Leaf
ToR
Servers
PingMesh [SIGCOMM ’15]
Direction 2: Approximate Application View
114
• Infeasible to completely eliminate differential observability due
to multi-tenancy and modularity constraints
• System sends probes to emulate common application usage
pod
podsetSpine
Leaf
ToR
Servers
Failure in Spine
PingMesh [SIGCOMM ’15]
Direction 3: Leverage Power of Scale
115
Break the observation silos to complement each other
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
Direction 3: Leverage Power of Scale
Observable vs. Detectable
although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference
example solution: infer signals from many network devices to localize fault
Blame game
A thinks B is problematic, B thinks A is problematic
example solution: correlate VM failure events with topology info to judge
116
Break the observation silos to complement each other
Direction 3: Leverage Power of Scale
Observable vs. Detectable
although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference
example solution: infer signals from many network devices to localize fault
Blame game
A thinks B is problematic, B thinks A is problematic
example solution: correlate VM failure events with topology info to judge
117
Break the observation silos to complement each other
Direction 4: Harness Temporal Patterns
118
Time series and trend analysis of key health signals
time
observability
❶
❷
❹
❸
System Core
System
Observer
Reactor
App1 App2 App3
probe
report
Appn
observation
observation
Direction 4: Harness Temporal Patterns
119
Storage side observed availability issue
Time series and trend analysis of key health signals
Conclusion
•Cloud system are adept at handling crash and fail-stop failures
» decades of efforts and research advances have paid off
•Gray failure is a major challenge moving forward
» behind many service incidents
» an acute pain for system designers and engineers
» fault tolerance 1.0 2.0
•A first attempt to define and explore this problem domain
» differential observability is a fundamental trait
» addressing this trait is key to tackle gray failure
» potential future directions with open challenges
120
Discussion
•Why does differential observability occur?
•Should (can) academia work on this problem?
121
• practitioners have been troubled by these issues for quite a while,
relying on intuition, workaround, resources and process
• hungry for principled approach to understand and tackle the problem
• many issues exist in open-source distributed system stack as well
• simplistic model and assumption about component behavior
• modularity principle, unexpected dependency, improper error handling
• focus on narrow point availability but dismissed app availability