Page 1
NetPoirot: Taking The Blame Game Out of Data Center Operations
Behnaz Arzani, Selim Ciraci, Boon Thau Loo,
Assaf Schuster, Geoff Outhred
Page 2
Datacenters can fail …
2
Page 3
Failures are disruptive
••
•
•
3
Page 4
Why is debugging hard?
4
Penn researcher
Azure VM Azure Network Service X
Network
Page 5
NetworkNetwork
`
Someone accepts responsibility Each blames the other
5
In the case of a failure…
Page 6
A real example… Event X
•
••
•
6
Page 7
Current tools are insufficient
SherlockSIGCOMM-07
NetMedicSIGCOMM-09NSDI-11
TRatSIGCOMM-02 Netprofile
rP2Psys-05
7
Page 8
Can we do better? (Overview)
• Introducing…
8
NetPoirot
Fault injector
Learning Agent
Page 9
The monitoring agent
•
•
•
••
•
•
9
Page 10
What is the TCP event digest?
•
•
•
10
Page 11
Why do we think this can work?
••
•
•
••
11
Page 12
To distinguish failures…
•
••
12
Page 13
Decision trees…
•
13His uncertainty is X
Page 14
Decision trees…
••
14His uncertainty is X-Y
Page 15
Decision trees alone are not enough
15
Page 16
Decision trees alone are not enough
16
Page 17
Decision trees alone are not enough
17Feature 1
Fe
atu
re 2
Page 18
Decision trees alone are not enough
Easiest to
18
Hardest to classify
Fe
atu
re 2
Feature 1
Page 19
What we do to deal with this
19
Fe
atu
re 2
Feature 1
Page 20
Upper portion of an example tree…
20
Mean of max congestion window
Min of the last congestion window
50th percentile of number of triple duplicate ACKs
50th percentile of connection duration
Max of the number of triple duplicate Acks
95th percentile of the max congestion window
Page 21
What we do to deal with this
21
Fe
atu
re 2
Feature 1
Page 22
Upper portion of an example tree…
22
50TH percentile of the max RTT
Number of flows
50th percentile of amount of data received
95th percentile of the number of timeouts
Page 23
Decision trees alone are not enough
23Feature 1
Fe
atu
re 2
Page 24
The upper portion of an example tree…
24
Mean time spent in zero window probing
95th percentile of the ratio of number of bytes posted
to received
Number of flows
Number of flows
95th percentile of connection durations
Minimum of the number of bytes received
Page 25
25
Is it a network failure?
Is it a server problem?
Is it a client side problem?
Page 26
Other details
••
•
•
26
If throughput < x:Open more
connections
If throughput <x:Send more data on the same connection
Page 27
What did we learn from all this?
••
••
•
••
••
27
Page 28
Evaluation
••
•
•
••
•
28
Page 29
How did we get labeled data?
•
•
••
•
•
•
29
Page 30
Worse case application
•
30
Page 31
What if we haven’t seen the failure before?
31
Page 32
Performance on real applications
32
General label
Normal Client Network
Precision
97.78% 99.7% 100%
Recall 99.68% 98.25% 99.37
YouTube
Event X
Page 33
Things we did not talk about
•
•
•
•
•
33
Page 34
What’s next?
••
••
34
Page 35
Conclusion
•
•
35