troubleshooting tracebacks
TRANSCRIPT
Image FPO
NO VALID WAS HOST FOUNDTroubleshoot ing tracebacks and other common fa i lure scenar ios
2
The presenters…
WADELEWIS
OpenStackArchitect
JAMESDENTONPrincipalArchitect
3
What we’re here to talk about…
Troubleshooting OpenStack Issues:
• Tracebacks
• Common Nova issues
• Common Neutron Issues
4
Slides available at SlideShareThese slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
5
OpenStack is complex
OpenStack is a complex system:
• Many moving parts
• Limited visibility to problems via API
6
Troubleshooting methods
What is a traceback?
8
Traceback 101
• When errors occur, sometimes exceptions are raised.
• When an exception is caught, an error and a list functions that got us to the point of the error are logged. This is a traceback.
• The traceback output can be useful to operators and developers and allows them to trace the steps to the error.
• As you’ll see, a traceback doesn’t always provide clear insight into the real error.
Image FPO
9
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke re a d i n g t h e M a t r i x
9
Image FPO
10
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke re a d i n g t h e M a t r i x
10
11
Tips on reading a traceback
Read from the bottom to the top– The last few lines are the most relevant
In this case, within the init function the program was unable to connect to MySQL.
12
Slides available at SlideShareThese slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
Nova
Image FPO
14
“ N o v a l i d h o s t w a s f o u n d . W h a t t h e h e c k d o e s t h a t m e a n ? ! ”
14
15
No valid host was found
16
No valid host was foundThis error is likely seen when booting an instance. Common reasons for failing:
• There really are no hosts available
• Networking issues on compute node
• Lack of resources
17
So you spun up an instance..
18
Identify the host. I f there is one…
19
Check the compute logs on the compute node
20
Check the networking logs on the compute node
21
Bringing it back together!! WARNING !!
The following example may not utilize Python or Neutron coding best practices.
22
Bringing it back together
Let’s take a look at that traceback:
23
Bringing it back together
Taking a look at the function, we can see there is no exception handling:
24
Bringing it back together
By adding some exception handling to the function…
25
Bringing it back together
… we get a nice, clean error that clearly indicates what is wrong
26
Bringing it back together
Interface mappings can be found in the ML2 configuration file:
If eht2 does not exist on this host, the Neutron agent may be unable to complete the network configuration.
27
Next example: When there isn’t a host…
28
Check the scheduler and conductor logs
• /var/log/nova/nova-scheduler.log
• /var/log/nova/nova-conductor.log
29
First pass:2015-10-16 17:14:18
Second pass:2015-10-16 17:10:10
30
NTP! NTP! NTP!
Wonky behavior caused by inconsistencies in time between hosts
• Services and agents can appear DOWN when they’re UP
• Service and agent flapping can cause scheduling issues
Neutron
32
Neutron architectureNeutron is composed of various services and agents responsible for building and maintaining the virtual network:
Failures can occur at any point.
DHCP Agent
34
Neutron architecture
The DHCP agent is responsible for:
• Creating network namespaces
• Configuring dnsmasq – a DHCP server
When instances are created, IPs are statically assigned.
35
Neutron architectureFailures of the DHCP agent on a host can result in:
• Instances not getting their initial lease
• Instances not renewing their lease
36
Dnsmasq Basics As subnets and ports are created, the DHCP agent is responsible for configuring the files used by dnsmasq to provide DHCP services to the network:
When dnsmasq hands out the lease, it updates its active lease database.
• /var/lib/neutron/dhcp/<network_uuid>/host
37
Dnsmasq Basics
By default, dnsmasq writes its logs to:
• /var/log/syslog (Ubuntu,Debian)
• /var/log/messages (RHEL,CentOS,Fedora)
38
Troubleshooting DHCPIf there are issues obtaining an IP, start with packet captures on the following devices:
• Compute node:– Tap interface– Bridge interface – Physical interface
• Network node:– Physical interface– Bridge interface – Veth interface – Namespace interface
Listen on UDP ports 67 and 68. You should see the full DHCP cycle in the packet capture on most interfaces.
39
Troubleshooting DHCP – Packet Captures• Working example:
• Non-working example:
When DHCP isn’t working, investigate the switching layer or dnsmasq.
N o w w e k n o w h o w i t w o r k s …40
… w e ’ v e g o t a l i v e b u g : D H C P N A K !41
42
Troubleshooting DHCP – DHCPNAK Issues
I see DHCPNAK packets. HELP!
• Likely means the DHCP agent was restarted and active lease file deleted
• Instances may receive DHCPNAK when requesting / renewing address
• This may result in delayed or no connectivity
• Addressed in patch for bug #1345947, which sets dnsmasq to renew the lease anyway without sending a NAK and repopulate its lease file
43
Troubleshooting DHCP – DHCPNAK IssuesWhen a network is scheduled to more than 1 DHCP agent, there may be issues:
• That fix expected only 1 DHCP server in the network!
• The DHCPREQUEST packet sent on renewal attempt is received by all DHCP agents (it’s a broadcast, after all)
• The renewal attempt is accepted by the agent that provided the original lease
• At the same time, the renewal attempt is rejected by the agent that didn’t provide the original lease
44
Troubleshooting DHCP – DHCPNAK IssuesThe end result? The client honors the DHCPNAK and restarts the DHCP process
45
Troubleshooting DHCP – DHCPNAK Issues
However, there is hope!
• Bug 1457900 addresses the multiple DHCP agent issue
• The fix is to pre-populate the dnsmasq leases file on all DHCP agents with all known MACs/IPs for respective networks
• Fixed in Liberty, coming to a backport near you!
L2 Agent
47
Neutron architectureThe L2 agent is responsible for:
• Programming the virtual switching infrastructure
• Applying security groups
48
Neutron architectureFailures of the L2 agent on a host can result in:
• Lack of instance connectivity
• Security group issues
• ERROR state during nova boot
49
Troubleshooting OVS connectionsWhen troubleshooting L2 connectivity issues, run packet captures on highlighted interfaces:
50
Troubleshooting OVS connectionsEvery interface plugged into the integration bridge should have a local VLAN ID that is unique to that node, no matter what the network type (VLAN, flat, local, VXLAN, GRE):
If the tag is missing, try restarting the OVS agent to force a rebuild of the integration bridge VLAN tagging and corresponding flows.
51
Troubleshooting OVS connections
If you see an OVS port in VLAN 4095, it typically means that the agent was unable to find a corresponding Neutron port in the database:
When this happens, it usually means that the port was deleted from the DB manually or as part of another action that did not complete successfully.
52
Troubleshooting OVS connectionsUseful commands include:
• ovs-vsctl show– High-level view of virtual bridges on the respective node– Shows local VLAN IDs for each port
• ovs-ofctl dump-flows BRIDGE– Show the flow rules for the respective bridge– The flow rules determine how traffic is manipulated and forwarded
• ovs-ofctl show BRIDGE– Port-level view of respective virtual switch– Shows port IDs on the bridge. Useful when reading flows.
53
Troubleshooting LinuxBridge connectionsWhen troubleshooting L2 connectivity issues, run packet captures on highlighted interfaces:
54
Troubleshooting LinuxBridge connectionsIn a working environment, every interface will connect to a bridge that corresponds to a Neutron network:
If a bridge is missing, check the agent log to see if there is an error.
Network A(VXLAN Network)Network B(VLAN Network)
55
Troubleshooting LinuxBridge connectionsUseful commands include:
• brctl show– High-level view of virtual bridges on the respective node– One bridge for each network
• bridge fdb show– Shows the bridge forwarding database– Useful for knowing how MAC addresses are reached
• ip neigh show– Shows the ARP cache
56
Binding Failed is back!
• Usually seen when booting instance or attaching interface
• Typically result of Neutron misconfiguration or agent issues
• Not limited to just instance ports
Unexpected vif_type=binding_failed
57
Binding Failed is back!In this example, both the DHCP and L3 agent ports were in binding_failed status:
58
Binding failed is back!
In this case, a look at the L2 agent log shows the misconfiguration:
If the agent is stopped or in a restart loop, port bindings will likely fail.
59
Binding Failed: The Fallout
For existing DHCP and L3 ports you may need to:
• Fix router port:– Unschedule tenant network from
L3 agent– Reschedule tenant network to L3
agent– This creates new port
• Fix DHCP port:– Unschedule tenant network from
DHCP agent– Delete DHCP port– Reschedule tenant network to DHCP
agent– This creates new port
60
L2 agent troubleshooting tips
• Check to make sure the respective L2 agent is configured properly and is running (not restarting!)
• Make sure OVS is running (if applicable)
• Check the Neutron agent logs– /var/log/neutron/neutron-*-linuxbridge-agent.log– /var/log/neutron/neutron-*-openvswitch-agent.log
Tips:
L3 Agent
62
Neutron architectureThe L3 agent is responsible for:
• Creating network namespaces for each router
• Providing routing between networks
• Providing NAT to instances
63
Neutron architectureFailures of the L3 agent on a host can result in:
• Failure to route traffic
• Floating IPs not functioning
64
Neutron architecture
65
L3 agent troubleshooting tips
• Check to make sure the L3 agent is running and configured properly
• Perform packet captures within the router namespace and other interfaces to observe traffic entering and leaving the router
• Check iptables within the router namespace to observe the proper rules have been created
• Check the Neutron L3 agent log:– /var/log/neutron/l3-agent.log
Tips:
More Neutron…
67
MTUIf the plumbing looks good, but you still experience connectivity issues to instances over certain protocols, it may be worth checking out the MTU size.
• Overlay network header can cause packet to exceed MTU
• Often manifests itself as SSH issues
• Try ssh –v to see where it hangs
• Pass lower MTU with DHCP option 26
68
Don’t forget security groups!
• Try applying a test rule
• Test connectivity from a namespace
• Verify iptables on compute nodes
• L2 agents are responsible for applying rules
When things are plumbed up correctly and everything looks normal, there may be an issue with security group rules.
69
Neutron architectureOther issues can only be observed at scale:
• Race conditions
• System limits too low
• No disk space available
• Syslog is your friend
Takeaways
71
Neutron failuresMany common Neutron failures can be traced back to misconfigurations of the:
• Neutron configuration file
• ML2 configuration file
• Interface configuration files
72
Takeaways
Get familiar with the underlying technologies:
• KVM
• Open vSwitch
• Linux bridging
• IPtables
73
TakeawaysFamiliarize yourself with a working environment so that you know how to spot an issue.
74
Takeaways
• Turn on DEBUG mode
• Check syslog
• Start services by hand
• Start out with simple configurations
• Reach out to community
• Gather as much information as possible before submitting a bug
Image FPO
75
D o n ’ t b e a f r a i d t o b re a k t h i n g s75
76
Stop by the Rackspace booth in the marketplace
Free book giveaways at the Rackspace booth during the morning and afternoon breaks!
MorningAfternoon
77
Slides available at SlideShareThese slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
O N E FAN AT I C A L P L AC E | S A N AN T O N I O , T X 7 8 2 1 8U S S A L E S : 1 - 8 0 0 - 9 6 1 - 2 8 8 8 | U S S U P P O RT: 1 - 8 0 0 - 9 6 1 - 4 4 5 4 | W W W. RAC K S PAC E . C O M
© RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM
Thank you