best practices for network troubleshooting

Post on 29-Jan-2018

161 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Simplifying Network Troubleshooting in Data Centers

Aug 3, 2017

Dinesh G Dutt

2Cumulus Networks

What’s Changed Limitations of existing tools and why do we

need new ones ?

3Cumulus Networks

Demo Topology

spine-1 spine-2 spine-3

torc-11 torc-12 torc-21 torc-22 tor-1 tor-2

hostd-11 hostd-21 hosts-11 hosts-21

3.0.0.4 3.0.3.132

4Cumulus Networks

Multipathing

● Modern data center is completely multipathed

● Traceroute: the truth, not the whole truth

● Linux is acquiring 5-tuple based load

balancing in 4.12

5Cumulus Networks

Network Virtualization

● Tunnels obscure path and reachability

● Incorrect MTU configuration can cause

unexplained connectivity problems

6Cumulus Networks

Microservices

● Applications are more distributed than ever,

connectivity is even more critical than before

● Short lives of containers means it’s harder to

do post-mortem analysis

7Cumulus Networks

Deployment Speed

● Needs the ability to make changes with

confidence

● Network needs to be immutable to daily

dynamic needs

○ In public/private clouds for example, no adding

and deleting of VLANs

8Cumulus Networks

Scale

● Automation is key, not a nice-to-have

● Rethink of network design and architecture to

align with automation

● Network as generic infrastructure

● Push everything not related to connectivity to

the edge

○ Security, segmentation endpoints, services

9Cumulus Networks

Rise of Whitebox

Switching

● Simple, uniform building blocks

● Merchant switching silicon

● Disaggregate hardware and software (NOS)

just like servers

10

What’s Not Changed

12Cumulus Networks

The Ball of Technologies The Network Admins Deal With

13Cumulus Networks

What Network Admins Go To Bat With

15Cumulus Networks

What Makes Network Troubleshooting Particularly Difficult ?

Network devices are appliances

Lack of platform approach means the device is generally closed

to any non-vendor program

Lack of programmatic access or structured output

CLI screen scraping

Packet forwarding happens in silicon which has limited troubleshooting capabilities

In comparison to compute where software is king

16Cumulus Networks

In Short...

In short, levels of abstraction have grown in modern data center networks…

without a corresponding increase in tools that break down the levels

19Cumulus Networks

Many Layers to Peel

From network architecture to

configuration to diagnostic to forensic

20Cumulus Networks

Key

Observations

Choose the right architecture to limit or eliminate

problems

Use automation to eliminate random errors

Catch problems close to the source

21Cumulus Networks

Three Step Process to Network Troubleshooting

Right Architecture => Eliminate errors due to poor design, simplify design

Right Configuration => Eliminate errors due to complex configuration, complex automation scripts

Right Telemetry => To catch errors that’ve slipped past despite the rigor, catch operational drifts due to changes beyond control like aging effect - cable faults

22

The Appropriate Architecture

23Cumulus Networks

Key Takeaways

Modern DC architecture: build large networks out of simple building blocks

Simple building blocks significantly change the complexity of troubleshooting networks

24Cumulus Networks

Benefits of

Simple,

Common

Building

Blocks

“Google uses a very common set of building

blocks across all of its software, so by

instrumenting these building blocks Dapper

is able to automatically generate a lot of

useful trace information without any

application involvement. “

- Dapper Paper, 2015

25Cumulus Networks

Tackling Cabling

Complexity in

Clos Networks

● Catch miscabling errors as miscabling errors, not

protocol or application errors

26Cumulus Networks

Verify Cabling:

Prescriptive

Topology

Manager

SPINE

LEAF

S1 S2 S3 S4

L1 L2 L3 L5L4

Graph G {S1:p1 – L1:p1;S1:p2 – L2:p1;S1:p3 – L3:p1;S1:p4 – L4:p1;S1:p5 – L5:p1;S2:p1 – L1:p2;S2:p2 – L2:p2;

...S4:p5 – L5:p4;

}

● Define expected topology using DOT language

● Verify connectivity per topology plan using LLDP

● Take dynamically defined actions based on

mis/match of expected & actual

● https://github.com/CumulusNetworks/ptm

27

Network Telemetry

28Cumulus Networks

What Data Can We Gather ?

Logs Network state can be configuration or runtime

29Cumulus Networks

Logs

Pros

In theory, catch errors and warnings or exceptions

Mature tools now available to handle logs

ELK, Splunk

Cons

Usually box specific.

Errors that require fabric awareness can be hard to catch, and so can’t be easily logged. Example: Duplicate IP address, routing loop

30Cumulus Networks

Metrics

The good thing about metrics is that there are so many to gather

Brendan Gregg’s USE model is a good yardstick to decide which metrics to gather:

“For every resource, check utilization, saturation, and errors.”

For example, applying USE to network interfaces:

Utilization: Basic RX/TX rates

Saturation: Buffer monitor stats per port

Errors: Drops, errors for RX/TX

31Cumulus Networks

Metrics Usage

For troubleshooting:

For a network operator, network latency is probably the one

thing that can be used as an indicator to determine if

suboptimal performance is due to the network

Other metrics and mechanisms come into play to isolate the

problem in the network

For capacity planning:

Usage and saturation metrics help you decide if we’re reaching

network capacity

32Cumulus Networks

Metrics Dos and Donts

Gather your data as frequently as possible

Practical limits maybe how quickly the hardware stats are

updated

1-5 seconds is quite possible

Do not use SNMP

The first bullet prevents this anyhow

Do not aggregate data quickly

Use a good TSDB

InfluxDB and Prometheus are the ones I encounter the most

33Cumulus Networks

Packet Capture

Pros

Useful for identifying what sort of traffic is flowing through

For security compliance

For things like IDS

Cons

Relatively expensive to capture as much data as flowing even in a single switch (3.2Tbps and increasing)

sFlow and its cousins are better suited for identifying traffic

Make most sense for use reactively, in troubleshooting

34Cumulus Networks

Network State

Properly designed, can be a good balance between packet capture and formal verification to answer questions such as:

Did this change break my network ?

Was there a forwarding loop at 10 pm last night

Show me the changes between 1h and 2h

35Cumulus Networks

Problem Remains

Lots of data can be gathered

Ability to correlate across these is still mostly lacking

Eg: Graphs show a drop in interface throughput without showing

at the same time an annotation indicating that the drop is

because a link failed

Building actionable alerts remains elusive

Many customers tell me that they essentially ignore alerts due to

the high false positive rate

36

Troubleshooting Tools

37Cumulus Networks

The Problem ● Network admins are typically contacted for one of

two cases:

o A can’t talk to B OR

o A can talk to B, but sub-optimally

● Also check for proper network segmentation

38Cumulus Networks

What Tool For What Problem ?

A can’t talk to B:

Network state maybe the most useful thing to identify problem

A can talk to B, but suboptimally:

This is a performance issue and metrics gathered can be used

to identify problem

Check compliance such as traffic does not leak across virtual network (VLAN, VRF or VxLAN)

Network state is the most helpful to answer this question

39Cumulus Networks

One Simple Step: Make Servers Discoverable

Enable LLDP on server

If using lldpd, configure it to send ifname

Add a file called portidsubtype.conf to /etc/lldpd.d with contents:configure lldp portidsubtype ifname

Restart lldpd via sudo systemctl restart lldpd (or equivalent)

With PTM, enables cabling verification to servers too

40Cumulus Networks

Traceroute family

traceroute mtr tracepath traceroute

-paris

traceroute

-dublin

scamper

ECMP support: traceroute, traceroute-paris/dublin, scamper

PMTU support: traceroute, tracepath, mtr

NAT detection: traceroute-dublin

IPv6 Support: All except traceroute-dublin

VxLAN support: None

41Cumulus Networks

NetQ

Designed for Linux-based networking devices and hosts

An open framework with a paid analysis engine

Users can build their own analysis engine or customize

Designed around the modern data center use case

Simplify codifying validation

Simplify troubleshooting

Codify troubleshooting

Time machine debugging (or DVR) included

42

NetQ Architecture

42Ubuntu 16.04 RHEL 7 CentOS 7

Q

Q

Q

Q

Q

Q Q Q

NetQ

Telemetr

y Server

43

NetQ: Fabric Change Log

Linux Kernel

L3 L2 VxLAN

NetQ New Route Added

OSPF Neighbor Change

MAC Address Removed

See state now or any point in the past

44

NetQ: Analysis Engine

• Validate Current State

▪ BGP

▪ OSPF

▪ MTU

▪ mLAG

▪ VxLAN

• Telemetry Server analyzes entire network state

Cumulus Networks Confidential

45

NetQ: Intelligent Visibility

• View remote information

▪ IPs

▪ MACs

▪ OS

▪ System Specs

• Improve Command Outputs

▪ Resolve hostnames in any Linux command

▪ No need for DNS

Cumulus Networks Confidential

46

NetQ: Advanced Notification

• NetQ Notifier Service

• Automatically Alert on Check Failures

▪ Syslog

▪ ChatOps (Slack)

▪ ELK

▪ Splunk

Cumulus Networks Confidential

47

48

Summary

49Cumulus Networks

Summary

Network troubleshooting remains hard for most people

The modern data center has the potential to make both troubleshooting simpler and more complex

Avoiding troubleshooting is better than troubleshooting

The right architecture and configuration models go a long way in

addressing this

Correlating across network state, logs and metrics is still beyond the reach of most network operators

Newer tools are on the rise to address this

50

Thank you!Visit us at cumulusnetworks.com or follow us @cumulusnetworks or

slack.cumulusnetworks.com

© 2017 Cumulus Networks. Cumulus Networks, the Cumulus Networks Logo, and Cumulus Linux are trademarks or registered trademarks of Cumulus

Networks, Inc. or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. The registered trademark

Linux® is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.

top related