netpilot: automating datacenter network failure mitigation xin wu, daniel turner, chao-chih chen,...
TRANSCRIPT
NetPilot: Automating Datacenter Network Failure Mitigation
Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang
Presented by: Chen Li
2
Failures are Common and Harmful
• Network failures are common
10,000+ switches
3
Failures are Common and Harmful
• Network failures are common
• Failures cause long down times
Time from detection to repair (minutes)
Six-month failure logs of production datacenters
25% of failures take 13+ hours to repair
4
Failures are Common and Harmful
• Failures are common due to VERY large datacenters
• Failures cause long down times
• Long failure duration large revenue loss
How to Shorten Failure Recovery Time?
6
Previous Work
• Conventional failure recovery takes 3 steps
• Failure localization/diagnosis– [M. K. Aguilera, SOSP’03]– [M. Y. Chen, NSDI’04]– [R.R Kompella, NSDI ’05]– [P.Bahl, SIGCOMM’07]– [S. Kandula, SIGCOMM’09]…
Detection Diagnosis Repair
passiveping
active
7
Automating Failure Diagnosis is Challenging
• Root causes are deep in network stack
• Diagnosis involves multiple parties
8
Category Failure types Diagnosis & Repair
%
Software 21% Link layer loop Find and fix bugs
19%Imbalance overload 2%
Hardware 18% FCS error Replace cable 13%Unstable power Repair power 5%
Unknown 23% Switch stops forwarding N/A 9%Imbalance overload 7%Lost configuration 5%High CPU utilization 2%
Configuration 38%
Errors on multiple switches
Update configuration
32%
Errors on one switch 6%
• Six-month failure logs from several production DCNs
1. Root causes are deep in the network stack
2. Diagnosis involves multiple partiesFailure Diagnosis Requires
Human Intervention !
Can we do something other than failure diagnosis?
10
NetPilot: Mitigating rather than Diagnosing Failures
• Mitigate failure symptoms ASAP, at the cost of reduced capacity
Detection Diagnosis RepairAutomated Mitigation
11
NetPilot Benefits
• Short recovery time• Small network disruption• Low operation cost
Automated Mitigation
Detection Diagnosis Repair
12
Failure Mitigation is Effective
• Most failures can be mitigated by simple actions
• Mitigation is feasible due to redundancy
13
Category Failure types Mitigation Repair %Software 21%
Link layer loop Deactivate port Find and fix bugs
19%
Imbalance-triggered overload
Restart switch2%
Hardware 18%
FCS error Deactivate port Replace cable 13%
Unstable power Deactivate switch Repair power 5%Unknown 23%
Switch stops forwarding
Restart switch N/A 9%
Imbalance-triggered overload
Restart switch 7%
Lost configuration Restart switch 5%High CPU utilization
Restart switch 2%
Configuration 38%
Errors on multiple switches
n/a Update configuration
32%
Errors on single switch
Deactivate switch 6%
68% of failures can be mitigated by simple actions
14
Mitigation Made Possible by Redundancy
• Redundancy deactivation unlikely to partition / overload the network
ToR
AGG
CORE
Internet
15
Outline
• Automating failure diagnosis is challenging
• Failure mitigation is effective
• How to automate mitigation?
• NetPilot evaluations
• Conclusion
16
A Strawman NetPilot: Trial-and-error
Network failure
Roll back if necessary
No Failure mitigated? End
Yes
Execute an action
Localization
17
NetPilot: Challenges & Solutions
1. Blind trial-and-error takes a long time
Network failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
LocalizationLocalization
Failure specific localization
18
NetPilot: Challenges & SolutionsNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Localization
2. Partition/overload network
Impact estimation
19
NetPilot: Challenges & SolutionsNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Rank actions
Localization
3. Different actions have different side-effects
Rank actions based on impact
20
Failure Specific Localization• Limited # of failure types• Domain knowledge improves accuracy
Failure types1. Link layer loop2. Imbalance-triggered overload3. FCS error4. Unstable power5. Switch stops forwarding6. Imbalance-triggered overload
7. Lost configuration8. High CPU utilization9. Errors on multiple switches
10. Errors on single switch
21
Example: Frame Check Sequence (FCS) Errors
• 13% of all the failures• Cut-through switching– Forward frames before checksums are verified
• Increase application latency
22
Localizing FCS Errorserror frames seen on L frames corrupted by L
frames corrupted by other links & traverse L
• xL: link corruption rate
• # of variables = # of equations = # of links
• Corrupted links: xL> 0
23
NetPilot OverviewNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Rank actions
24
Impact Metrics
• Derived from Service Level Agreement (SLA)– Availability: online_server_ratio– Packet loss: total_lost_pkt– latency: max_link_utilization• Small link utilization small (queuing) delay
• Total_lost_pkt & max_link_utilization derived from utilization of individual links
25
Estimating Link Utilization
• # of flows >> redundant paths– Traffic evenly distributed under ECMP
• Estimate the load contributed by each flow on each link
• Sum up the loads to compute utilization
Impact Estimator
Action
Traffic Link utilization Topology
26
Link Utilization Estimation is Highly Accurate
• 1-month traffic from a 8000-server network– Log socket events on each server
• Ground truth: SNMP counters
27
NetPilot OverviewNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Rank actionsChoose the action with the least impact
28
Outline• Automating failure diagnosis is challenging
• Failure mitigation is effective
• How to automate mitigation? – Localization impact estimation ranking
• NetPilot evaluations–Mitigating load imbalance–Mitigating FCS errors–Mitigating overload
• Conclusion
29
Load Imbalance
• Agga stops receiving traffic • Localize to 4 suspects
corea
Agga
coreb
Aggb
30
Mitigating Load Imbalance
0:00 0:05 0:10 0:15 0:20 0:250
5000000000100000000001500000000020000000000250000000003000000000035000000000
lag core_a->AR_a lag core_a->AR_b
lag core_b->AR_a lag core_b->AR_b
Time (minutes)
corea -> agga
coreb -> agga
corea -> aggb
coreb -> aggb
Agga stops receiving traffic
Detected & reboot coreb
Reboot corea Reboot Agga
Mitigation confirmed
Load evenly splitted
corea
Agga
coreb
Aggb
31
Fast FCS Error Mitigation
NetPilot:deactivates 2 links in 1 trial
within 15 minutes
Human operator:
after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated
3.5 hours 15 minutes
32
Mitigating Link Overload• Mitigate overload by deactivating healthy links
core1
1.5 1.5
3
agg
core2
core1
agg
33
Mitigating Link Overload• Mitigate overload by deactivating healthy links– Many candidate links in production networks– Choose the link(s) with the least impact
core1
1.5 1.5
3
agg
core2 core1
1 1.5
3
agg
core2 core1
0 3
3
agg
core2
lost 0.5
34
Action Ranking Lowers Link Utilization
• Replay 97 overload incidents due to link failures
35
Conclusion
• Mitigation shortens failure recovery time– Simple actions are effective– Made possible by redundancy
• NetPilot: automating failure mitigation – Recovery time: hour minutes– Several mitigation scenarios deployed in Bing