gray failure: the achilles’ heel of cloud-scale …huang/talk/hotos17_talk.pdfgray failure: the...

Gray Failure: The Achilles’ Heel of Cloud-Scale Systems

Ryan Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch,

Yingnong Dang, Murali Chintalapati, Randolph Yao

Outline

Background & the gray failure problem

Real-world gray failure cases in Azure

A model and a definition for gray failure

differential observability

Potential future directions

26

Rapid Growth of Cloud System Infra

» Software user shift

direct: e.g., office 365, Google Drive

indirect: e.g., Netflix on AWS

» Workload diversity

website, workflow, big data, machine learning

» Internal composition

more data centers, larger cluster, special h/w

containerization, micro-services

27

Demanding Requirement on Availability

» Users intolerant of service downtime

» Failure more costly

SLA violation, reputation hit, customer loss, engineering resource waste

» New availability bar

3 nines to 5 or 6 nines

28

Key: Embrace Fault-tolerance!

29

Redundancy CorrectnessDecomposition

Rich history since 1950s Steps:

Key: Embrace Fault-tolerance!

30

Redundancy CorrectnessDecomposition

process pair

RAID

Primary/backup

state machine replication

transaction

checkpoint chain replication

virtual synchrony Paxos Zab PBFT

Gossip

Zyzzyva

N-version2PC

triple modular redundancy

erasure coding

Rich history since 1950s

…

Steps:

Building block:

Status Quo

31

Status Quo

» By and large, the efforts paid off

many faults successfully detected, tolerated, and repaired every day

few global outages

99.9% is achievable

» But moving forward…

simplistic assumptions start to break

reasoning about availability becomes hard

frequent bizarre phenomenon in production

99.999% and beyond is challenging

32

Status Quo



few global outages

99.9% is achievable






33

scale & complexity

availability?t

Status Quo



few global outages

99.9% is achievable






34

scale & complexity

availability?t

A common theme:

the overlooked gray

failure problem

The Elephant in the Cloud - Gray Failure

35

process pair

RAID primary backup

chain replication2PC

TMR

erasure codingPaxos

virtual synchrony

…

Zab

Fail-stop

A component either

works correctly or stops


36

Byzantine

A component may

behave arbitrarily

process pair

RAID primary backup


TMR

erasure codingPaxos

PBFT Zyzzyva

virtual synchrony

…

UpRight

Q/U BAR Gossip

Aliph

…

Zab

Fail-stop

A component either



37

ByzantineGray failure

A component appears to be still working

but is in fact experiencing severe issue

A component may

behave arbitrarily

process pair

RAID primary backup


TMR

erasure codingPaxos

PBFT Zyzzyva

virtual synchrony

…

UpRight

Q/U BAR Gossip

Aliph

…

Zab

Fail-stop

A component either



38

Fail-stop ByzantineGray failure



A component either


A component may

behave arbitrarily


39

Fail-stop ByzantineGray failure



A component either


A component may

behave arbitrarily

• subtle and ambiguous: e.g., switch random packet loss, non-

fatal exceptions, memory thrashing, flaky disk I/O, overload..

symptom


40

• across s/w and h/w stack in the infra due to various defects

• behind most service incidents we’ve seen in Azure

occurrence



symptom

Fail-stop Byzantine



A component either


A component may

behave arbitrarily

Gray failure


41

• across s/w and h/w stack in the infra due to various defects

• behind most service incidents we’ve seen in Azure

occurrence

• fault-tolerance ineffective or counterproductive

• faults take engineers & designers huge efforts to nail down

• teams play the blame game with each other

danger

Fail-stop Byzantine



A component either


A component may

behave arbitrarily



symptom

Gray failure

Real-world Gray Failure Cases in Azure

42

Case I: Redundancy in Datacenter Network (1)

43

Core

Aggregation

ToR

A B

r1 r2


44

Core

Aggregation

ToR

A B

r1 r2


45

Core

Aggregation

ToR

A B

r1 r2

crash


46

Core

Aggregation

ToR

A B

r1 r2

crash


47

Core

Aggregation

ToR

A B

r1 r2

increasing # of core switches helps with availability

crash


48

Core

Aggregation

ToR

A B

r1 r2

Workload: single round trip

random

packet

drop

𝒑


49

Core

Aggregation

ToR

A B

r1 r2

Workload: single round trip

• packets will not be re-routed application glitches or increased latency

• increasing # of core switches may not affect chance of being affected

random

packet

drop

𝒑


50

Core

Aggregation

ToR

A B

r1 r2

Workload: send multiple requests

wait for all to finish (e.g., search)

r3 r4

C D E

• high chance to involve every core switches for each front-end request

• gray failure at any core switch will cause delay

• more core switches worse tail latencies 𝟏 − (𝟏 − 𝒑)𝒏

random

packet

drop

Case II: Failure Detector in Compute Service

51

Fabric Controller (Primary)

Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node

Fabric Controller (Replica) …

Role Instance Role Instance Role Instance Role Instance

Physical Node

Hypervisor vSwitch

Network

Hierarchical agents to catch failure in different layers


52


Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node



Physical Node

Hypervisor vSwitch

Network

crash!



53


Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node



Physical Node

Hypervisor vSwitch

Network

VM3 dead

crash!



54


Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node



Physical Node

Hypervisor vSwitch

Network

reboot



55


Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node



Physical Node

Hypervisor vSwitch

Network

connectivity

issue



56


Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node



Physical Node

Hypervisor vSwitch

Network

connectivity

issue

VM3 good


No action needed


57


Host Agent Host OS

VM0

Guest

Agent

VM1

Guest

Agent

VM2

Guest

Agent

VM3

Guest

Agent

Physical Node



Physical Node

Hypervisor vSwitch

Network

connectivity

issue

VM3 good

Can’t SSH/RDP


No action needed

Case III: Recovery in Storage Service

58

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN3 EN4 EN5 …

Front End Front End Front End


59

Extent Nodes (EN)

Stream

Manager



low free blocks


60

Extent Nodes (EN)

Stream

Manager



low free blocks

EN1,EN2,EN3

healthy


61

Extent Nodes (EN)

Stream

Manager



low free blocks

EN1,EN2,EN3

healthy


62

Extent Nodes (EN)

Stream

Manager



EN1,EN2,EN3

healthy

crash

write


63

Extent Nodes (EN)

Stream

Manager



EN3 is

down

crash


64

Extent Nodes (EN)

Stream

Manager



EN3 is

down

crash


65

Extent Nodes (EN)

Stream

Manager



low free blocks


66

Extent Nodes (EN)

Stream

Manager



low free blocks

EN2,EN3,EN4

healthy


67

Extent Nodes (EN)

Stream

Manager



low free blocks

EN2,EN3,EN4

healthy


68

Extent Nodes (EN)

Stream

Manager



EN2,EN3,EN4

healthy

crash

write


69

Extent Nodes (EN)

Stream

Manager



EN3 is

down

crash


70

Extent Nodes (EN)

Stream

Manager



EN3 is

down

crash


71

Extent Nodes (EN)

Stream

Manager


Front End Front End Front EndEN3 is

broken


72

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …


broken


73

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …


broken

re-replication,

fragmentation


74

Extent Nodes (EN)

Stream

Manager

EN1 EN2 EN4 EN5 …


broken

re-replication,

fragmentation

Understanding Gray Failure

75

The Many Faces of Gray Failure

76

So, what is a gray failure?


77


A performance issue.


78



A problem that some thinks is a failure but some thinks is not,

e.g., a 2% packet loss. The ambiguity itself defines gray failure.

If everyone agrees it is a problem, it is not a gray failure.


79






A Heisenbug, sometimes it occurs and sometimes it does not.


80







The system is failing slowly, e.g., memory leak.


81








There is an increasing number of transient errors in the system,

which results in reduced system capacity even if the system still

manages to continue working.


82








There is an increasing number of transient errors in the system,

which results in reduced system capacity even if the system still

manages to continue working.

What is a formal way to define and

study gray failure, one that potentially

sheds light on how to address it?

System Core

System

An Abstract Model

83

Note: these are logical entities

System Core

System

An Abstract Model

84

• distributed storage system

• IaaS platform

• data center network

• search engine

• …


System Core

System

An Abstract Model

85

App1


• IaaS platform


• search engine

• …

web app


System Core

System

An Abstract Model

86

App1 App2


• IaaS platform


• search engine

• …

web app analytics


System Core

System

An Abstract Model

87

App1 App2 App3


• IaaS platform


• search engine

• …

web app analytics system2


System Core

System

An Abstract Model

88

App1 App2 App3


• IaaS platform


• search engine

• …

web app analytics user/operator

…Appn

system2


System Core

System

An Abstract Model

89

Observer

App1 App2 App3

probe

report


• IaaS platform


• search engine

• …


…Appn

system2


System Core

System

An Abstract Model

90

Observer

Reactor

App1 App2 App3

probe

report


• IaaS platform


• search engine

• …


…Appn

system2


System Core

System

An Abstract Model

91

Observer

Reactor

App1 App2 App3

probe

report


• IaaS platform


• search engine

• …


…Appn

system2


System Core

System

An Abstract Model

92

Observer

Reactor

App1 App2 App3

probe

report


• IaaS platform


• search engine

• …


…Appn

system2


Gray Failure Trait: Differential Observability

93

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn


94

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation


95

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

different entities come into different conclusions

about whether a system is working or not

difference


96

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

difference


97

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

difference

Healthy or w/ minor

latent fault


98

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

❸

Fault tolerance at play,

or a false positive

difference

Healthy or w/ minor

latent fault


99

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

All apps deem

system good

Appi deems

system bad

observer deems

system good

observer deems

system bad

Gray Failure!

❶ ❷

❸ ❹ Crash, fail-stop

Fault tolerance at play,

or a false positive

difference

Healthy or w/ minor

latent fault

Applying the Model

100

Case I

term example

system data center network

observer switch peers

reactor routing protocols

app1 simple web server

app2 search engine

gray failure app2 observed glitches but neighbors of

the bad switch (and app1) didn’t

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

Applying the Model

101

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example

system Azure storage service

observer storage master

reactor storage master

app1..n Azure VMs

gray failure some VMs hit remote I/O exceptions

while storage master deems ENs healthy

Case III

Applying the Model

102

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

Applying the Model

103

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

- EN3 degraded

- Observation difference

Applying the Model

104

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

- EN3 degraded


- EN3 crashed & rebooted

- No observation difference

Applying the Model

105

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

- EN3 degraded




- EN3 degraded


Applying the Model

106

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

- EN3 degraded




- EN3 degraded


… - EN3 crashed & removed


Applying the Model

107

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

- EN3 degraded




- EN3 degraded




…- More ENs affected

- More observation difference

Applying the Model

108

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appnterm example




app1..n Azure VMs



Case III

- EN3 degraded




- EN3 degraded




…- More ENs affected

- More observation difference


completely gone, too late

Direction 1: Close Observation Gap

109

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Traditional failure detector multi-dimensional health monitor


110

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Doctors can’t use heartbeat as the only signal of a patient’s health



111

Doctors can’t use heartbeat as the only signal of a patient’s health

Heartbeat-based

hierarchical failure detector

In-VM performance counters before and

during the gray failure incident (case II)


System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Direction 2: Approximate Application View

112

• Infeasible to completely eliminate differential observability due

to multi-tenancy and modularity constraints

• System sends probes to emulate common application usage


113




pod

podsetSpine

Leaf

ToR

Servers

PingMesh [SIGCOMM ’15]


114




pod

podsetSpine

Leaf

ToR

Servers

Failure in Spine

PingMesh [SIGCOMM ’15]

Direction 3: Leverage Power of Scale

115

Break the observation silos to complement each other

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation


Observable vs. Detectable

although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference

example solution: infer signals from many network devices to localize fault

Blame game

A thinks B is problematic, B thinks A is problematic

example solution: correlate VM failure events with topology info to judge

116



Observable vs. Detectable

although end-to-end probe may expose differential observability, it may not detect who is responsible for the difference

example solution: infer signals from many network devices to localize fault

Blame game

A thinks B is problematic, B thinks A is problematic

example solution: correlate VM failure events with topology info to judge

117


Direction 4: Harness Temporal Patterns

118

Time series and trend analysis of key health signals

time

observability

❶

❷

❹

❸

System Core

System

Observer

Reactor

App1 App2 App3

probe

report

Appn

observation

observation

Direction 4: Harness Temporal Patterns

119

Storage side observed availability issue

Time series and trend analysis of key health signals

Conclusion

•Cloud system are adept at handling crash and fail-stop failures

» decades of efforts and research advances have paid off

•Gray failure is a major challenge moving forward

» behind many service incidents

» an acute pain for system designers and engineers

» fault tolerance 1.0 2.0

•A first attempt to define and explore this problem domain

» differential observability is a fundamental trait

» addressing this trait is key to tackle gray failure

» potential future directions with open challenges

120

Discussion

•Why does differential observability occur?

•Should (can) academia work on this problem?

121

• practitioners have been troubled by these issues for quite a while,

relying on intuition, workaround, resources and process

• hungry for principled approach to understand and tackle the problem

• many issues exist in open-source distributed system stack as well

• simplistic model and assumption about component behavior

• modularity principle, unexpected dependency, improper error handling

• focus on narrow point availability but dismissed app availability

gray failure: the achilles’ heel of cloud-scale …huang/talk/hotos17_talk.pdfgray failure: the...

Documents