motivation: finding the root cause of a symptom
DESCRIPTION
Debugging networks with provenance C received packet Packet P Packet P B sent packet A B C B received packet Rule match on B Rule installed by controller A sent packet A received packet Rule match on A Incoming packet at controller Typical debuggers tell us what happened: NetSight: Packet histories Y!: Network provenance Key benefit: Rich explanation of what, when, and why.TRANSCRIPT
Differential Provenance:Better Network Diagnostics with Reference Events
Ang Chen Yang WuAndreas Haeberlen Wenchao Zhou+ Boon Thau Loo
University of Pennsylvania Georgetown University+
2
Motivation: Finding the root cause of a symptom
• Networks can (and frequently do!) have bugs• Example: Software-defined networks• We need a good debugger!
Web server 1 DPIWeb server 2
Overly specific flow entry
InternetBob
Traffic arriving at the wrong server !?!
4.3.2.0/244.3.3.0/24
3
Debugging networks with provenance
• Typical debuggers tell us what happened:• NetSight: Packet histories• Y!: Network provenance
• Key benefit: Rich explanation of what, when, and why.
A B C
C received packetB sent packet
B received packet
Rule match on B
A sent packetA received packet
Rule match on A
Packet P
Packet P
Rule installed by controllerIncoming packetat controller
4
Problem: Explanation can be too big!
Root cause:faulty rule
• The problem: Finding the root cause in a large provenance tree.
root
Rule 7:Next-
hop=port2
Packet arrives at wrong server
5
Key insight: Use reference events!
• Remember that some packets were routed correctly.• The same things should have happened to all
packets!• Key insight: If we have both a (bad) symptom and a
(good) reference, we only need to reason about the differences between them!
Web server 1 DPIWeb server 2
S1 S2 S3 S4 S5
S6
Bob
6
fault
reference
Field 3 of configentry 4 is wrong!
A new debugger
• Bob collects both a bad symptom and a good reference
• Bob sends both events to the debugger• Debugger generates provenance, outputs
difference• Ideally, there is only one diff—the root cause!
Bob
Debugger
7
Outline- Motivation: Network diagnostics- Background- Key insight- A new debugger- Differential provenance
- Are references typically available?- Strawman approach- Our approach- Initial results
- Conclusion
8
Are references typically available?
• Survey: • Posts on the ‘Outages’ mailing list in Sept-Dec 2014.• 64 posts related to diagnostics.• 42/64 (66%) posts involve both a fault and some
reference.
• Examples:• Some DNS servers have stale records, but others are
good• Probes sometimes fail, sometimes succeed• More examples in the paper
9
Strawman solution
• A strawman solution: Pick out different nodes in trees.• Bad provenance: 201 nodes• Reference provenance: 156 nodes• Naïve diff: 278 nodes!
- =Bad provenance Reference provenance
?
10
Why does the strawman not work?
• Observation: The diff can be larger than the individual trees.
• Reason #1: Differences that “do not matter”• E.g., timestamps, packet payloads, etc.
• Reason #2: “Butterfly effect”• A small difference can change later events drastically!
Faulty rule
11
Differential provenance
• Approach: Change past events, and think about what could have happened.• (1) Find some early ‘differences’ in the trees.• (2) Change the faulty node to a correct equivalent.• (3) Use replay to determine what would have happened.• (4) Output the set of changes that align the trees.
Bad provenance Reference provenance
Output: - Rule 7: change port- Rule 9: change range
12
Technical challenges• Challenge #1: Where do we start?• Heuristics: Change early events, minimum changes…• E.g., prefer changing 1 event than 1000 events.
• Challenge #2: How should we make the change?• Approach: Think about what should have happened.• E.g., packet should go to switch 2, not 1.
• Challenge #3: Irrelevant differences?• Approach: Equivalence relations between events.• E.g., IPs 4.3.2.1 and 4.3.3.1
• See paper for more details.
13
Setup
• Setup• Platform: RapidNet• SDN: 6 switches, 2 servers• The symptom: misrouted packets from 4.3.2.0/24• The reference: packets from 4.3.3.0/24
Web server 1 DPI
Overly specific flow entry
Internet4.3.2.0/244.3.3.0/24
14
Initial results
• Differential provenance finds a single node (the faulty rule) to be the root cause!
Fault: 201 nodes
Reference: 156 nodes
Differential provenance
Naïve diff
=
= Rule 7: next hop should be port 1, not 2!
15
Conclusion• Debugging networks is hard• Need good debuggers!
• Provenance can find the causes of an event• Problem: Explanation can be too detailed.
• Idea: Use reference events• Sufficient to find the (few) differences to the observed
symptom• New debugger based on differential provenance
• Result: Very precise diagnostics• Ideally, can identify a single root cause!
Thanks!