netpilot: automating datacenter network failure mitigation xin wu, daniel turner, chao-chih chen,...

36
NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang Presented by: Chen Li

Upload: christian-booker

Post on 30-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

NetPilot: Automating Datacenter Network Failure Mitigation

Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

Presented by: Chen Li

Page 2: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

2

Failures are Common and Harmful

• Network failures are common

10,000+ switches

Page 3: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

3

Failures are Common and Harmful

• Network failures are common

• Failures cause long down times

Time from detection to repair (minutes)

Six-month failure logs of production datacenters

25% of failures take 13+ hours to repair

Page 4: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

4

Failures are Common and Harmful

• Failures are common due to VERY large datacenters

• Failures cause long down times

• Long failure duration large revenue loss

Page 5: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

How to Shorten Failure Recovery Time?

Page 6: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

6

Previous Work

• Conventional failure recovery takes 3 steps

• Failure localization/diagnosis– [M. K. Aguilera, SOSP’03]– [M. Y. Chen, NSDI’04]– [R.R Kompella, NSDI ’05]– [P.Bahl, SIGCOMM’07]– [S. Kandula, SIGCOMM’09]…

Detection Diagnosis Repair

passiveping

active

Page 7: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

7

Automating Failure Diagnosis is Challenging

• Root causes are deep in network stack

• Diagnosis involves multiple parties

Page 8: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

8

Category Failure types Diagnosis & Repair

%

Software 21% Link layer loop Find and fix bugs

19%Imbalance overload 2%

Hardware 18% FCS error Replace cable 13%Unstable power Repair power 5%

Unknown 23% Switch stops forwarding N/A 9%Imbalance overload 7%Lost configuration 5%High CPU utilization 2%

Configuration 38%

Errors on multiple switches

Update configuration

32%

Errors on one switch 6%

• Six-month failure logs from several production DCNs

1. Root causes are deep in the network stack

2. Diagnosis involves multiple partiesFailure Diagnosis Requires

Human Intervention !

Page 9: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

Can we do something other than failure diagnosis?

Page 10: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

10

NetPilot: Mitigating rather than Diagnosing Failures

• Mitigate failure symptoms ASAP, at the cost of reduced capacity

Detection Diagnosis RepairAutomated Mitigation

Page 11: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

11

NetPilot Benefits

• Short recovery time• Small network disruption• Low operation cost

Automated Mitigation

Detection Diagnosis Repair

Page 12: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

12

Failure Mitigation is Effective

• Most failures can be mitigated by simple actions

• Mitigation is feasible due to redundancy

Page 13: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

13

Category Failure types Mitigation Repair %Software 21%

Link layer loop Deactivate port Find and fix bugs

19%

Imbalance-triggered overload

Restart switch2%

Hardware 18%

FCS error Deactivate port Replace cable 13%

Unstable power Deactivate switch Repair power 5%Unknown 23%

Switch stops forwarding

Restart switch N/A 9%

Imbalance-triggered overload

Restart switch 7%

Lost configuration Restart switch 5%High CPU utilization

Restart switch 2%

Configuration 38%

Errors on multiple switches

n/a Update configuration

32%

Errors on single switch

Deactivate switch 6%

68% of failures can be mitigated by simple actions

Page 14: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

14

Mitigation Made Possible by Redundancy

• Redundancy deactivation unlikely to partition / overload the network

ToR

AGG

CORE

Internet

Page 15: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

15

Outline

• Automating failure diagnosis is challenging

• Failure mitigation is effective

• How to automate mitigation?

• NetPilot evaluations

• Conclusion

Page 16: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

16

A Strawman NetPilot: Trial-and-error

Network failure

Roll back if necessary

No Failure mitigated? End

Yes

Execute an action

Localization

Page 17: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

17

NetPilot: Challenges & Solutions

1. Blind trial-and-error takes a long time

Network failure

Roll back if necessary

NoFailure

mitigated? EndYes

Execute an action

LocalizationLocalization

Failure specific localization

Page 18: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

18

NetPilot: Challenges & SolutionsNetwork failure

Roll back if necessary

NoFailure

mitigated? EndYes

Execute an action

Localization

Estimate impact

Localization

2. Partition/overload network

Impact estimation

Page 19: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

19

NetPilot: Challenges & SolutionsNetwork failure

Roll back if necessary

NoFailure

mitigated? EndYes

Execute an action

Localization

Estimate impact

Rank actions

Localization

3. Different actions have different side-effects

Rank actions based on impact

Page 20: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

20

Failure Specific Localization• Limited # of failure types• Domain knowledge improves accuracy

Failure types1. Link layer loop2. Imbalance-triggered overload3. FCS error4. Unstable power5. Switch stops forwarding6. Imbalance-triggered overload

7. Lost configuration8. High CPU utilization9. Errors on multiple switches

10. Errors on single switch

Page 21: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

21

Example: Frame Check Sequence (FCS) Errors

• 13% of all the failures• Cut-through switching– Forward frames before checksums are verified

• Increase application latency

Page 22: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

22

Localizing FCS Errorserror frames seen on L frames corrupted by L

frames corrupted by other links & traverse L

• xL: link corruption rate

• # of variables = # of equations = # of links

• Corrupted links: xL> 0

Page 23: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

23

NetPilot OverviewNetwork failure

Roll back if necessary

NoFailure

mitigated? EndYes

Execute an action

Localization

Estimate impact

Rank actions

Page 24: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

24

Impact Metrics

• Derived from Service Level Agreement (SLA)– Availability: online_server_ratio– Packet loss: total_lost_pkt– latency: max_link_utilization• Small link utilization small (queuing) delay

• Total_lost_pkt & max_link_utilization derived from utilization of individual links

Page 25: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

25

Estimating Link Utilization

• # of flows >> redundant paths– Traffic evenly distributed under ECMP

• Estimate the load contributed by each flow on each link

• Sum up the loads to compute utilization

Impact Estimator

Action

Traffic Link utilization Topology

Page 26: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

26

Link Utilization Estimation is Highly Accurate

• 1-month traffic from a 8000-server network– Log socket events on each server

• Ground truth: SNMP counters

Page 27: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

27

NetPilot OverviewNetwork failure

Roll back if necessary

NoFailure

mitigated? EndYes

Execute an action

Localization

Estimate impact

Rank actionsChoose the action with the least impact

Page 28: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

28

Outline• Automating failure diagnosis is challenging

• Failure mitigation is effective

• How to automate mitigation? – Localization impact estimation ranking

• NetPilot evaluations–Mitigating load imbalance–Mitigating FCS errors–Mitigating overload

• Conclusion

Page 29: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

29

Load Imbalance

• Agga stops receiving traffic • Localize to 4 suspects

corea

Agga

coreb

Aggb

Page 30: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

30

Mitigating Load Imbalance

0:00 0:05 0:10 0:15 0:20 0:250

5000000000100000000001500000000020000000000250000000003000000000035000000000

lag core_a->AR_a lag core_a->AR_b

lag core_b->AR_a lag core_b->AR_b

Time (minutes)

corea -> agga

coreb -> agga

corea -> aggb

coreb -> aggb

Agga stops receiving traffic

Detected & reboot coreb

Reboot corea Reboot Agga

Mitigation confirmed

Load evenly splitted

corea

Agga

coreb

Aggb

Page 31: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

31

Fast FCS Error Mitigation

NetPilot:deactivates 2 links in 1 trial

within 15 minutes

Human operator:

after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated

3.5 hours 15 minutes

Page 32: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

32

Mitigating Link Overload• Mitigate overload by deactivating healthy links

core1

1.5 1.5

3

agg

core2

core1

agg

Page 33: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

33

Mitigating Link Overload• Mitigate overload by deactivating healthy links– Many candidate links in production networks– Choose the link(s) with the least impact

core1

1.5 1.5

3

agg

core2 core1

1 1.5

3

agg

core2 core1

0 3

3

agg

core2

lost 0.5

Page 34: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

34

Action Ranking Lowers Link Utilization

• Replay 97 overload incidents due to link failures

Page 35: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

35

Conclusion

• Mitigation shortens failure recovery time– Simple actions are effective– Made possible by redundancy

• NetPilot: automating failure mitigation – Recovery time: hour minutes– Several mitigation scenarios deployed in Bing

Page 36: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

36

Thank You!

Detection Diagnosis RepairNetPilot:

Automated Mitigation

[email protected]