best practices for network troubleshooting
Post on 29-Jan-2018
161 Views
Preview:
TRANSCRIPT
1
Simplifying Network Troubleshooting in Data Centers
Aug 3, 2017
Dinesh G Dutt
2Cumulus Networks
What’s Changed Limitations of existing tools and why do we
need new ones ?
3Cumulus Networks
Demo Topology
spine-1 spine-2 spine-3
torc-11 torc-12 torc-21 torc-22 tor-1 tor-2
hostd-11 hostd-21 hosts-11 hosts-21
3.0.0.4 3.0.3.132
4Cumulus Networks
Multipathing
● Modern data center is completely multipathed
● Traceroute: the truth, not the whole truth
● Linux is acquiring 5-tuple based load
balancing in 4.12
5Cumulus Networks
Network Virtualization
● Tunnels obscure path and reachability
● Incorrect MTU configuration can cause
unexplained connectivity problems
6Cumulus Networks
Microservices
● Applications are more distributed than ever,
connectivity is even more critical than before
● Short lives of containers means it’s harder to
do post-mortem analysis
7Cumulus Networks
Deployment Speed
● Needs the ability to make changes with
confidence
● Network needs to be immutable to daily
dynamic needs
○ In public/private clouds for example, no adding
and deleting of VLANs
8Cumulus Networks
Scale
● Automation is key, not a nice-to-have
● Rethink of network design and architecture to
align with automation
● Network as generic infrastructure
● Push everything not related to connectivity to
the edge
○ Security, segmentation endpoints, services
9Cumulus Networks
Rise of Whitebox
Switching
● Simple, uniform building blocks
● Merchant switching silicon
● Disaggregate hardware and software (NOS)
just like servers
10
What’s Not Changed
11Cumulus Networks
Is Network
The Problem ?
12Cumulus Networks
The Ball of Technologies The Network Admins Deal With
13Cumulus Networks
What Network Admins Go To Bat With
14Cumulus Networks
Box-by-Box Debugging
15Cumulus Networks
What Makes Network Troubleshooting Particularly Difficult ?
Network devices are appliances
Lack of platform approach means the device is generally closed
to any non-vendor program
Lack of programmatic access or structured output
CLI screen scraping
Packet forwarding happens in silicon which has limited troubleshooting capabilities
In comparison to compute where software is king
16Cumulus Networks
In Short...
In short, levels of abstraction have grown in modern data center networks…
without a corresponding increase in tools that break down the levels
17Cumulus Networks
And so, what’s
the answer ?
18Cumulus Network
No silver bullet in
troubleshooting
19Cumulus Networks
Many Layers to Peel
From network architecture to
configuration to diagnostic to forensic
20Cumulus Networks
Key
Observations
Choose the right architecture to limit or eliminate
problems
Use automation to eliminate random errors
Catch problems close to the source
21Cumulus Networks
Three Step Process to Network Troubleshooting
Right Architecture => Eliminate errors due to poor design, simplify design
Right Configuration => Eliminate errors due to complex configuration, complex automation scripts
Right Telemetry => To catch errors that’ve slipped past despite the rigor, catch operational drifts due to changes beyond control like aging effect - cable faults
22
The Appropriate Architecture
23Cumulus Networks
Key Takeaways
Modern DC architecture: build large networks out of simple building blocks
Simple building blocks significantly change the complexity of troubleshooting networks
24Cumulus Networks
Benefits of
Simple,
Common
Building
Blocks
“Google uses a very common set of building
blocks across all of its software, so by
instrumenting these building blocks Dapper
is able to automatically generate a lot of
useful trace information without any
application involvement. “
- Dapper Paper, 2015
25Cumulus Networks
Tackling Cabling
Complexity in
Clos Networks
● Catch miscabling errors as miscabling errors, not
protocol or application errors
26Cumulus Networks
Verify Cabling:
Prescriptive
Topology
Manager
SPINE
LEAF
S1 S2 S3 S4
L1 L2 L3 L5L4
Graph G {S1:p1 – L1:p1;S1:p2 – L2:p1;S1:p3 – L3:p1;S1:p4 – L4:p1;S1:p5 – L5:p1;S2:p1 – L1:p2;S2:p2 – L2:p2;
...S4:p5 – L5:p4;
}
● Define expected topology using DOT language
● Verify connectivity per topology plan using LLDP
● Take dynamically defined actions based on
mis/match of expected & actual
● https://github.com/CumulusNetworks/ptm
27
Network Telemetry
28Cumulus Networks
What Data Can We Gather ?
Logs Network state can be configuration or runtime
29Cumulus Networks
Logs
Pros
In theory, catch errors and warnings or exceptions
Mature tools now available to handle logs
ELK, Splunk
Cons
Usually box specific.
Errors that require fabric awareness can be hard to catch, and so can’t be easily logged. Example: Duplicate IP address, routing loop
30Cumulus Networks
Metrics
The good thing about metrics is that there are so many to gather
Brendan Gregg’s USE model is a good yardstick to decide which metrics to gather:
“For every resource, check utilization, saturation, and errors.”
For example, applying USE to network interfaces:
Utilization: Basic RX/TX rates
Saturation: Buffer monitor stats per port
Errors: Drops, errors for RX/TX
31Cumulus Networks
Metrics Usage
For troubleshooting:
For a network operator, network latency is probably the one
thing that can be used as an indicator to determine if
suboptimal performance is due to the network
Other metrics and mechanisms come into play to isolate the
problem in the network
For capacity planning:
Usage and saturation metrics help you decide if we’re reaching
network capacity
32Cumulus Networks
Metrics Dos and Donts
Gather your data as frequently as possible
Practical limits maybe how quickly the hardware stats are
updated
1-5 seconds is quite possible
Do not use SNMP
The first bullet prevents this anyhow
Do not aggregate data quickly
Use a good TSDB
InfluxDB and Prometheus are the ones I encounter the most
33Cumulus Networks
Packet Capture
Pros
Useful for identifying what sort of traffic is flowing through
For security compliance
For things like IDS
Cons
Relatively expensive to capture as much data as flowing even in a single switch (3.2Tbps and increasing)
sFlow and its cousins are better suited for identifying traffic
Make most sense for use reactively, in troubleshooting
34Cumulus Networks
Network State
Properly designed, can be a good balance between packet capture and formal verification to answer questions such as:
Did this change break my network ?
Was there a forwarding loop at 10 pm last night
Show me the changes between 1h and 2h
35Cumulus Networks
Problem Remains
Lots of data can be gathered
Ability to correlate across these is still mostly lacking
Eg: Graphs show a drop in interface throughput without showing
at the same time an annotation indicating that the drop is
because a link failed
Building actionable alerts remains elusive
Many customers tell me that they essentially ignore alerts due to
the high false positive rate
36
Troubleshooting Tools
37Cumulus Networks
The Problem ● Network admins are typically contacted for one of
two cases:
o A can’t talk to B OR
o A can talk to B, but sub-optimally
● Also check for proper network segmentation
38Cumulus Networks
What Tool For What Problem ?
A can’t talk to B:
Network state maybe the most useful thing to identify problem
A can talk to B, but suboptimally:
This is a performance issue and metrics gathered can be used
to identify problem
Check compliance such as traffic does not leak across virtual network (VLAN, VRF or VxLAN)
Network state is the most helpful to answer this question
39Cumulus Networks
One Simple Step: Make Servers Discoverable
Enable LLDP on server
If using lldpd, configure it to send ifname
Add a file called portidsubtype.conf to /etc/lldpd.d with contents:configure lldp portidsubtype ifname
Restart lldpd via sudo systemctl restart lldpd (or equivalent)
With PTM, enables cabling verification to servers too
40Cumulus Networks
Traceroute family
traceroute mtr tracepath traceroute
-paris
traceroute
-dublin
scamper
ECMP support: traceroute, traceroute-paris/dublin, scamper
PMTU support: traceroute, tracepath, mtr
NAT detection: traceroute-dublin
IPv6 Support: All except traceroute-dublin
VxLAN support: None
41Cumulus Networks
NetQ
Designed for Linux-based networking devices and hosts
An open framework with a paid analysis engine
Users can build their own analysis engine or customize
Designed around the modern data center use case
Simplify codifying validation
Simplify troubleshooting
Codify troubleshooting
Time machine debugging (or DVR) included
42
NetQ Architecture
42Ubuntu 16.04 RHEL 7 CentOS 7
Q
Q
Q
Q
Q
Q Q Q
NetQ
Telemetr
y Server
43
NetQ: Fabric Change Log
Linux Kernel
L3 L2 VxLAN
NetQ New Route Added
OSPF Neighbor Change
MAC Address Removed
See state now or any point in the past
44
NetQ: Analysis Engine
• Validate Current State
▪ BGP
▪ OSPF
▪ MTU
▪ mLAG
▪ VxLAN
• Telemetry Server analyzes entire network state
Cumulus Networks Confidential
45
NetQ: Intelligent Visibility
• View remote information
▪ IPs
▪ MACs
▪ OS
▪ System Specs
• Improve Command Outputs
▪ Resolve hostnames in any Linux command
▪ No need for DNS
Cumulus Networks Confidential
46
NetQ: Advanced Notification
• NetQ Notifier Service
• Automatically Alert on Check Failures
▪ Syslog
▪ ChatOps (Slack)
▪ ELK
▪ Splunk
Cumulus Networks Confidential
47
48
Summary
49Cumulus Networks
Summary
Network troubleshooting remains hard for most people
The modern data center has the potential to make both troubleshooting simpler and more complex
Avoiding troubleshooting is better than troubleshooting
The right architecture and configuration models go a long way in
addressing this
Correlating across network state, logs and metrics is still beyond the reach of most network operators
Newer tools are on the rise to address this
50
Thank you!Visit us at cumulusnetworks.com or follow us @cumulusnetworks or
slack.cumulusnetworks.com
© 2017 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus
Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark
Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.
top related