troubleshooting tracebacks

78
Image FPO NO VALID WAS HOST FOUND Troubleshooting tracebacks and other common failure scenarios

Upload: james-denton

Post on 13-Apr-2017

1.102 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Troubleshooting Tracebacks

Image FPO

NO VALID WAS HOST FOUNDTroubleshoot ing tracebacks and other common fa i lure scenar ios

Page 2: Troubleshooting Tracebacks

2

The presenters…

WADELEWIS

OpenStackArchitect

JAMESDENTONPrincipalArchitect

Page 3: Troubleshooting Tracebacks

3

What we’re here to talk about…

Troubleshooting OpenStack Issues:

• Tracebacks

• Common Nova issues

• Common Neutron Issues

Page 4: Troubleshooting Tracebacks

4

Slides available at SlideShareThese slides will be available at the following location after this presentation:

http://www.slideshare.net/JamesDenton1

Page 5: Troubleshooting Tracebacks

5

OpenStack is complex

OpenStack is a complex system:

• Many moving parts

• Limited visibility to problems via API

Page 6: Troubleshooting Tracebacks

6

Troubleshooting methods

Page 7: Troubleshooting Tracebacks

What is a traceback?

Page 8: Troubleshooting Tracebacks

8

Traceback 101

• When errors occur, sometimes exceptions are raised.

• When an exception is caught, an error and a list functions that got us to the point of the error are logged. This is a traceback.

• The traceback output can be useful to operators and developers and allows them to trace the steps to the error.

• As you’ll see, a traceback doesn’t always provide clear insight into the real error.

Page 9: Troubleshooting Tracebacks

Image FPO

9

D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke re a d i n g t h e M a t r i x

9

Page 10: Troubleshooting Tracebacks

Image FPO

10

D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke re a d i n g t h e M a t r i x

10

Page 11: Troubleshooting Tracebacks

11

Tips on reading a traceback

Read from the bottom to the top– The last few lines are the most relevant

In this case, within the init function the program was unable to connect to MySQL.

Page 12: Troubleshooting Tracebacks

12

Slides available at SlideShareThese slides will be available at the following location after this presentation:

http://www.slideshare.net/JamesDenton1

Page 13: Troubleshooting Tracebacks

Nova

Page 14: Troubleshooting Tracebacks

Image FPO

14

“ N o v a l i d h o s t w a s f o u n d . W h a t t h e h e c k d o e s t h a t m e a n ? ! ”

14

Page 15: Troubleshooting Tracebacks

15

No valid host was found

Page 16: Troubleshooting Tracebacks

16

No valid host was foundThis error is likely seen when booting an instance. Common reasons for failing:

• There really are no hosts available

• Networking issues on compute node

• Lack of resources

Page 17: Troubleshooting Tracebacks

17

So you spun up an instance..

Page 18: Troubleshooting Tracebacks

18

Identify the host. I f there is one…

Page 19: Troubleshooting Tracebacks

19

Check the compute logs on the compute node

Page 20: Troubleshooting Tracebacks

20

Check the networking logs on the compute node

Page 21: Troubleshooting Tracebacks

21

Bringing it back together!! WARNING !!

The following example may not utilize Python or Neutron coding best practices.

Page 22: Troubleshooting Tracebacks

22

Bringing it back together

Let’s take a look at that traceback:

Page 23: Troubleshooting Tracebacks

23

Bringing it back together

Taking a look at the function, we can see there is no exception handling:

Page 24: Troubleshooting Tracebacks

24

Bringing it back together

By adding some exception handling to the function…

Page 25: Troubleshooting Tracebacks

25

Bringing it back together

… we get a nice, clean error that clearly indicates what is wrong

Page 26: Troubleshooting Tracebacks

26

Bringing it back together

Interface mappings can be found in the ML2 configuration file:

If eht2 does not exist on this host, the Neutron agent may be unable to complete the network configuration.

Page 27: Troubleshooting Tracebacks

27

Next example: When there isn’t a host…

Page 28: Troubleshooting Tracebacks

28

Check the scheduler and conductor logs

• /var/log/nova/nova-scheduler.log

• /var/log/nova/nova-conductor.log

Page 29: Troubleshooting Tracebacks

29

First pass:2015-10-16 17:14:18

Second pass:2015-10-16 17:10:10

Page 30: Troubleshooting Tracebacks

30

NTP! NTP! NTP!

Wonky behavior caused by inconsistencies in time between hosts

• Services and agents can appear DOWN when they’re UP

• Service and agent flapping can cause scheduling issues

Page 31: Troubleshooting Tracebacks

Neutron

Page 32: Troubleshooting Tracebacks

32

Neutron architectureNeutron is composed of various services and agents responsible for building and maintaining the virtual network:

Failures can occur at any point.

Page 33: Troubleshooting Tracebacks

DHCP Agent

Page 34: Troubleshooting Tracebacks

34

Neutron architecture

The DHCP agent is responsible for:

• Creating network namespaces

• Configuring dnsmasq – a DHCP server

When instances are created, IPs are statically assigned.

Page 35: Troubleshooting Tracebacks

35

Neutron architectureFailures of the DHCP agent on a host can result in:

• Instances not getting their initial lease

• Instances not renewing their lease

Page 36: Troubleshooting Tracebacks

36

Dnsmasq Basics As subnets and ports are created, the DHCP agent is responsible for configuring the files used by dnsmasq to provide DHCP services to the network:

When dnsmasq hands out the lease, it updates its active lease database.

• /var/lib/neutron/dhcp/<network_uuid>/host

Page 37: Troubleshooting Tracebacks

37

Dnsmasq Basics

By default, dnsmasq writes its logs to:

• /var/log/syslog (Ubuntu,Debian)

• /var/log/messages (RHEL,CentOS,Fedora)

Page 38: Troubleshooting Tracebacks

38

Troubleshooting DHCPIf there are issues obtaining an IP, start with packet captures on the following devices:

• Compute node:– Tap interface– Bridge interface – Physical interface

• Network node:– Physical interface– Bridge interface – Veth interface – Namespace interface

Listen on UDP ports 67 and 68. You should see the full DHCP cycle in the packet capture on most interfaces.

Page 39: Troubleshooting Tracebacks

39

Troubleshooting DHCP – Packet Captures• Working example:

• Non-working example:

When DHCP isn’t working, investigate the switching layer or dnsmasq.

Page 40: Troubleshooting Tracebacks

N o w w e k n o w h o w i t w o r k s …40

Page 41: Troubleshooting Tracebacks

… w e ’ v e g o t a l i v e b u g : D H C P N A K !41

Page 42: Troubleshooting Tracebacks

42

Troubleshooting DHCP – DHCPNAK Issues

I see DHCPNAK packets. HELP!

• Likely means the DHCP agent was restarted and active lease file deleted

• Instances may receive DHCPNAK when requesting / renewing address

• This may result in delayed or no connectivity

• Addressed in patch for bug #1345947, which sets dnsmasq to renew the lease anyway without sending a NAK and repopulate its lease file

Page 43: Troubleshooting Tracebacks

43

Troubleshooting DHCP – DHCPNAK IssuesWhen a network is scheduled to more than 1 DHCP agent, there may be issues:

• That fix expected only 1 DHCP server in the network!

• The DHCPREQUEST packet sent on renewal attempt is received by all DHCP agents (it’s a broadcast, after all)

• The renewal attempt is accepted by the agent that provided the original lease

• At the same time, the renewal attempt is rejected by the agent that didn’t provide the original lease

Page 44: Troubleshooting Tracebacks

44

Troubleshooting DHCP – DHCPNAK IssuesThe end result? The client honors the DHCPNAK and restarts the DHCP process

Page 45: Troubleshooting Tracebacks

45

Troubleshooting DHCP – DHCPNAK Issues

However, there is hope!

• Bug 1457900 addresses the multiple DHCP agent issue

• The fix is to pre-populate the dnsmasq leases file on all DHCP agents with all known MACs/IPs for respective networks

• Fixed in Liberty, coming to a backport near you!

Page 46: Troubleshooting Tracebacks

L2 Agent

Page 47: Troubleshooting Tracebacks

47

Neutron architectureThe L2 agent is responsible for:

• Programming the virtual switching infrastructure

• Applying security groups

Page 48: Troubleshooting Tracebacks

48

Neutron architectureFailures of the L2 agent on a host can result in:

• Lack of instance connectivity

• Security group issues

• ERROR state during nova boot

Page 49: Troubleshooting Tracebacks

49

Troubleshooting OVS connectionsWhen troubleshooting L2 connectivity issues, run packet captures on highlighted interfaces:

Page 50: Troubleshooting Tracebacks

50

Troubleshooting OVS connectionsEvery interface plugged into the integration bridge should have a local VLAN ID that is unique to that node, no matter what the network type (VLAN, flat, local, VXLAN, GRE):

If the tag is missing, try restarting the OVS agent to force a rebuild of the integration bridge VLAN tagging and corresponding flows.

Page 51: Troubleshooting Tracebacks

51

Troubleshooting OVS connections

If you see an OVS port in VLAN 4095, it typically means that the agent was unable to find a corresponding Neutron port in the database:

When this happens, it usually means that the port was deleted from the DB manually or as part of another action that did not complete successfully.

Page 52: Troubleshooting Tracebacks

52

Troubleshooting OVS connectionsUseful commands include:

• ovs-vsctl show– High-level view of virtual bridges on the respective node– Shows local VLAN IDs for each port

• ovs-ofctl dump-flows BRIDGE– Show the flow rules for the respective bridge– The flow rules determine how traffic is manipulated and forwarded

• ovs-ofctl show BRIDGE– Port-level view of respective virtual switch– Shows port IDs on the bridge. Useful when reading flows.

Page 53: Troubleshooting Tracebacks

53

Troubleshooting LinuxBridge connectionsWhen troubleshooting L2 connectivity issues, run packet captures on highlighted interfaces:

Page 54: Troubleshooting Tracebacks

54

Troubleshooting LinuxBridge connectionsIn a working environment, every interface will connect to a bridge that corresponds to a Neutron network:

If a bridge is missing, check the agent log to see if there is an error.

Network A(VXLAN Network)Network B(VLAN Network)

Page 55: Troubleshooting Tracebacks

55

Troubleshooting LinuxBridge connectionsUseful commands include:

• brctl show– High-level view of virtual bridges on the respective node– One bridge for each network

• bridge fdb show– Shows the bridge forwarding database– Useful for knowing how MAC addresses are reached

• ip neigh show– Shows the ARP cache

Page 56: Troubleshooting Tracebacks

56

Binding Failed is back!

• Usually seen when booting instance or attaching interface

• Typically result of Neutron misconfiguration or agent issues

• Not limited to just instance ports

Unexpected vif_type=binding_failed

Page 57: Troubleshooting Tracebacks

57

Binding Failed is back!In this example, both the DHCP and L3 agent ports were in binding_failed status:

Page 58: Troubleshooting Tracebacks

58

Binding failed is back!

In this case, a look at the L2 agent log shows the misconfiguration:

If the agent is stopped or in a restart loop, port bindings will likely fail.

Page 59: Troubleshooting Tracebacks

59

Binding Failed: The Fallout

For existing DHCP and L3 ports you may need to:

• Fix router port:– Unschedule tenant network from

L3 agent– Reschedule tenant network to L3

agent– This creates new port

• Fix DHCP port:– Unschedule tenant network from

DHCP agent– Delete DHCP port– Reschedule tenant network to DHCP

agent– This creates new port

Page 60: Troubleshooting Tracebacks

60

L2 agent troubleshooting tips

• Check to make sure the respective L2 agent is configured properly and is running (not restarting!)

• Make sure OVS is running (if applicable)

• Check the Neutron agent logs– /var/log/neutron/neutron-*-linuxbridge-agent.log– /var/log/neutron/neutron-*-openvswitch-agent.log

Tips:

Page 61: Troubleshooting Tracebacks

L3 Agent

Page 62: Troubleshooting Tracebacks

62

Neutron architectureThe L3 agent is responsible for:

• Creating network namespaces for each router

• Providing routing between networks

• Providing NAT to instances

Page 63: Troubleshooting Tracebacks

63

Neutron architectureFailures of the L3 agent on a host can result in:

• Failure to route traffic

• Floating IPs not functioning

Page 64: Troubleshooting Tracebacks

64

Neutron architecture

Page 65: Troubleshooting Tracebacks

65

L3 agent troubleshooting tips

• Check to make sure the L3 agent is running and configured properly

• Perform packet captures within the router namespace and other interfaces to observe traffic entering and leaving the router

• Check iptables within the router namespace to observe the proper rules have been created

• Check the Neutron L3 agent log:– /var/log/neutron/l3-agent.log

Tips:

Page 66: Troubleshooting Tracebacks

More Neutron…

Page 67: Troubleshooting Tracebacks

67

MTUIf the plumbing looks good, but you still experience connectivity issues to instances over certain protocols, it may be worth checking out the MTU size.

• Overlay network header can cause packet to exceed MTU

• Often manifests itself as SSH issues

• Try ssh –v to see where it hangs

• Pass lower MTU with DHCP option 26

Page 68: Troubleshooting Tracebacks

68

Don’t forget security groups!

• Try applying a test rule

• Test connectivity from a namespace

• Verify iptables on compute nodes

• L2 agents are responsible for applying rules

When things are plumbed up correctly and everything looks normal, there may be an issue with security group rules.

Page 69: Troubleshooting Tracebacks

69

Neutron architectureOther issues can only be observed at scale:

• Race conditions

• System limits too low

• No disk space available

• Syslog is your friend

Page 70: Troubleshooting Tracebacks

Takeaways

Page 71: Troubleshooting Tracebacks

71

Neutron failuresMany common Neutron failures can be traced back to misconfigurations of the:

• Neutron configuration file

• ML2 configuration file

• Interface configuration files

Page 72: Troubleshooting Tracebacks

72

Takeaways

Get familiar with the underlying technologies:

• KVM

• Open vSwitch

• Linux bridging

• IPtables

Page 73: Troubleshooting Tracebacks

73

TakeawaysFamiliarize yourself with a working environment so that you know how to spot an issue.

Page 74: Troubleshooting Tracebacks

74

Takeaways

• Turn on DEBUG mode

• Check syslog

• Start services by hand

• Start out with simple configurations

• Reach out to community

• Gather as much information as possible before submitting a bug

Page 75: Troubleshooting Tracebacks

Image FPO

75

D o n ’ t b e a f r a i d t o b re a k t h i n g s75

Page 76: Troubleshooting Tracebacks

76

Stop by the Rackspace booth in the marketplace

Free book giveaways at the Rackspace booth during the morning and afternoon breaks!

MorningAfternoon

Page 77: Troubleshooting Tracebacks

77

Slides available at SlideShareThese slides will be available at the following location after this presentation:

http://www.slideshare.net/JamesDenton1

Page 78: Troubleshooting Tracebacks

O N E FAN AT I C A L P L AC E | S A N AN T O N I O , T X 7 8 2 1 8U S S A L E S : 1 - 8 0 0 - 9 6 1 - 2 8 8 8 | U S S U P P O RT: 1 - 8 0 0 - 9 6 1 - 4 4 5 4 | W W W. RAC K S PAC E . C O M

© RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM

Thank you