nlb core technology overview
TRANSCRIPT
Network Load Balancing (NLB)A Core Technology Overview
Sean B House June 18, 2002(Updated October 31, 2002)
Agenda
NLB architecture and fundamentals NLB cluster membership protocol Packet filtering and TCP connection affinity Limitations of NLB Advanced NLB topics
Multicast Bi-Directional Affinity BiVPN (PPTP & IPSec/L2TP)
Q&A
Introduction to NLB
Fully distributed, symmetric, software-based softwareTCP/IP load-balancing load
Cloned services (e.g., IIS) run on each host in the cluster Client requests are partitioned across all hosts Load distribution is static, but configurable through load weights (percentages) Use commodity hardware Simple and robust Highly available
Design goals include:
Introduction to NLB
NLB Provides:
ScaleScale-out for IP services High availability (no single point of failure) An inexpensive alternative to HW LB devices Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)
NLB is appropriate for load balancing:
HighHigh-end hardware load-balancers cover a much loadbroader range of load-balancing scenarios load-
NLB Architecture
NLB is an NDIS intermediate filter driver inserted between the physical NIC and the protocols in the network stack
Protocols think NLB is a NIC NICs think NLB is a protocol
One instance of NLB per NIC to which it s bound
All NLB instances operate independently of each other
Fundamental Algorithm
NLB is fundamentally just a packet filter Via the NLB membership protocol, all hosts in the cluster agree on the load distribution NLB requires that all hosts see all inbound packets Each host discards those packets intended for other hosts in the cluster
Each host makes accept/drop decisions independently
Packets accepted on each host are passed up to the protocol(s) and one response is sent back to the client
Host 3
Host 2
Host 1
NLB Cluster
The network floods A client initiates a A response accepts One server is sent the incoming client request to an NLB backclient request. the to the client. request. cluster.
Internet
Client(s)
Cluster Operation Modes
NLB modes of operation:
Unicast, multicast and IGMP multicast Unicast makes up approximately 98% of deployments For the rest of this talk, assume unicast operation
Advanced Topics covers multicast and IGMP multicast
To project a single system image:
All hosts share the same set of virtual IP addresses All hosts share a common network (MAC) address
In unicast, NLB actually alters the MAC address of the NIC
This precludes inter-host communication over the NLB NIC inter-
Communication with specific cluster hosts is accomplished through the use of dedicated NICs or dedicated IP addresses
Unicast Mode
Each host in the cluster is configured with the same unicast MAC address
02-bf-WW-XX-YY02-bf-WW-XX-YY-ZZ
02 = locally administered address bf = arbitrary (Bain/Faenov) WW-XX-YYWW-XX-YY-ZZ = the primary cluster IP address
All ARP requests for virtual IP addresses resolve to this cluster MAC address automagically
NLB must ensure that all inbound packets are received by all hosts in the cluster
All N hosts in the cluster receive every packet and N-1 Nhosts discard each packet
Inbound Packet Flooding
NLB is well-suited for (arguably designed for) a wellhub environment
Hubs forward inbound packets on all ports by their very nature
Switches associate MAC addresses with ports in an effort to eliminate flooding
On each port, switches snoop the source MAC addresses of all packets received
Those source MAC addresses are learned on that port When packets arrive destined for a learned MAC address, they are forwarded only on the associated port
This breaks NLB compatibility with switches
NLB/Switch Incompatibility
All NLB hosts share the same cluster MAC address Switches only allow a particular MAC address to be associated with one switch port at a given time This results in the cluster MAC address / port association thrashing between ports
Host 3
Host 2
Host 1
Switch
Inbound packets are only forwarded to the port with which the switch currently believes the cluster MAC address is associated Connectivity to the cluster will be intermittent at best
MAC Address Masking
NLB uses MAC address masking to keep switches from learning the cluster MAC address and associating it with a particular port NLB spoofs the source MAC address of all outgoing packets
The second byte of the source MAC address is overwritten with the host s unique NLB host ID
E.g., 02-bf-0a-00-00-01 -> 02-09-0a-00-00-01 02-bf-0a-00-0002-09-0a-00-00-
This prevents switches from associating the cluster MAC address with a particular port
Switches only associate the masked MAC addresses Enables inbound packet flooding
Host 3
Host 2
Host 1
10.0.0.1 02-bf-0a-00-00-01 From: 02-03-0a-00-00-01/10.0.0.1:80 To: 00-a0-cc-a1-cd-9f/10.0.0.5:29290
NLB Cluster
A response is sent back to the client. The source MAC A client initiates a request address is NLB cluster. the The to an masked using to switch does not know host sserver accepts One unique host ID. which port 02-bf-0a-00-00-01 An ARP request for the client request. belongs, will floods the The switch so itcontinue to 10.0.0.1 resolves to the request to all ports. associate 02-03-0a-00-00-01, cluster MAC address not 02-bf-0a-00-00-01. with 02-bf-0a-00-00-01, this switch port. This enables switch flooding.
Switch
From: 00-a0-cc-a1-cd-9f/10.0.0.5:29290 To: 02-bf-0a-00-00-01/10.0.0.1:80
Client(s)10.0.0.5 00-a0-cc-a1-cd-9f
LoadLoad-Balancing Overview
Each host periodically sends a heartbeat packet to announce their presence and distribute load Load distribution is quantized into 60 buckets that are distributed amongst hosts
Each host owns a subset of the buckets Typically using the IP 2-tuple or 4-tuple as input to 24the hashing function The owner of the bucket accepts the packet, the others drop the packet What happens to existing connections if bucket ownership changes?
Incoming packets hash to one of the 60 buckets
Cluster Membership
Each NLB host is assigned a unique host ID in the range from 1 to 32 (the maximum cluster size) Using ethernet broadcast, each host sends heartbeat packets to announce its presence and distribute load
Twice per second during convergence Once per second after convergence completes Registered Ethernet type = 0x886f Contains configuration information such as host ID, dedicated IP address, port rules, etc. Contains load balancing state such as the load distribution, load weights (percentages), activity indicators, etc.
Heartbeats are MTU-sized, un-routable Ethernet frames MTUun
Convergence
Convergence is a distributed mechanism for determining cluster membership and load distribution
Conveyed via the NLB heartbeat messages
Hosts initiate convergence primarily to partition the load
When consensus is reached, the cluster is said to be converged
Misconfiguration can cause perpetual convergence Network problems can cause periodic convergence Cluster operations continue during convergence
Can result in disruption to or denial of client service
Triggering Convergence
Joining hosts
New hosts trigger convergence to repartition the load distribution and begin accepting client requests The other hosts pick up the slack when a fixed number of heartbeats are missed from the departing host Configuration changes Administrative operations that change the configured load of a server (disable, enable, drain, etc.)
Departing hosts
New configuration on hosts
The Convergence Algorithm1. 2. 3.
4.
5.
All hosts enter the CONVERGING state The host with the smallest host ID is elected the default host Each host moves from the CONVERGING state to the STABLE state after a fixed number of epochs* in which consistent membership and load distribution are observed The default host enters the CONVERGED state after a fixed number of epochs in which all hosts are observed to be in the STABLE state Other hosts enter the CONVERGED state when they see that the default host has converged
* For all intents and purposes, an epoch is a heartbeat period
Bucket Distribution
Load distribution is quantized into buckets
Incoming packets hash to one of 60 buckets
Largely for reasons of equal divisibility
Buckets are distributed amongst hosts, but based on configuration (load weights/percentages), may or may not be shared equally Buckets are not dynamically re-distributed based on reload (no dynamic load balancing)
Goal: minimize disruption to existing connections during bucket transfer NonNon-goal: optimize remaps across a series of convergences
Bucket Distribution
During convergence, each host computes identical target load distributions
Based on the existing load distribution and the new membership information
When convergence completes, hosts transfer buckets pair-wise via heartbeats pair
First, the donor host surrenders ownership of the buckets and notifies the recipient Soon thereafter, the recipient picks up the buckets, asserts ownership of them and notifies the donor During the transfer (~2 seconds), nobody is accepting new connections on those buckets
Bucket Distribution
Advantages
Easy method by which to divide client population among hosts Convenient for adjusting relative load weights (percentage) between hosts Avoid state lookup in optimized cases Quantized domain has limited granularity
Disadvantages
Unbalanced for some cluster sizes, but 60 divides nicely:
1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30
Worst case is a load distribution ratio of 2:1 in 31 and 32 host clusters
Host 3
Host 2
Host 1 Now: None Next: 10-19, 50-5910-19, 50-59 None
Now: 0-29 Next: 0-9, 20-290-29 0-9, 20-29
Now: 30-59 Next: 30-49
30-49 30-59
NLB Cluster
When convergence completes, each pair of hosts transfers the designated buckets via the Hosts 2host uses the same Host 1 heartbeats. onand Each and 3the cluster all joins are converged Convergence begins and sending CONVERGED algorithm to computecluster. begins sending CONVERGING three hosts in the the new Bucketsload distribution. the are removed from heartbeats. heartbeats. donating host s bucket map before being handed off to the new owner.
Switch
Internet
Client 1
Packet Filtering
Filtered packets are those for which NLB will make an accept/drop decision (load-balance) (loadIP Protocols that are filtered by NLB
TCP, UDP GRE
Assumes a relationship with a corresponding PPTP tunnel Assumes a relationship with a corresponding IPSec/L2TP tunnel By default, all hosts accept ICMP; can be optionally filtered
ESP/AH (IPSec)
ICMP
Other protocols and Ethernet types are passed directly up to the protocol(s)
Client Affinity
None
Typically provides the best load balance Uses both client IP address and port when hashing Used primarily for session support for SSL and multimulticonnection protocols (IPSec/L2TP, PPTP, FTP) Uses only the client IP address when hashing Used primarily for session support for users behind scaling proxy arrays Uses only the class C subnet of the client IP address when hashing
Single
Class C
Hashing
Packets hash to one of 60 buckets, which are distributed amongst hosts NLB employs bi-level hashing bi
Level 1: Bucket ownership Level 2: State lookup Optimized
NLB hashing operates in one of two modes
Level 1 hashing only The bucket owner accepts the packet unconditionally Level 1 and level 2 hashing State lookup is necessary to resolve ownership ambiguity
NonNon-optimized
Hashing
Protocols such as UDP always operate in optimized mode
No state is maintained for UDP, which eliminates the need for level 2 hashing
Protocols such as TCP can operate in either optimized or non-optimized mode non
State is maintained for all TCP connections When ambiguity arises, state lookup determines ownership New connections always belong to the bucket owner Global aggregation determines when other hosts complete service of a lingering connection and optimize out level 2 hashing
Host 3
Host 2
Host 1
10.0.0.1From: 10.0.0.1:80 To: 10.0.0.5:29290 20-39 0-19 40-59
02-bf-0a-00-00-01
NLB Cluster
Hash on IP 5-tuple (10.0.0.5, 29290, 10.0.0.1, 80, TCP) A client initiates a A response is sent maps to Bucket 14, owned by Host 3. request to client. back to thean NLB Host 3 acceptscluster. the request. All other hosts drop the request.
Switch
Internet
From: 10.0.0.5:29290 To: 10.0.0.1:80
Client(s)10.0.0.5 00-a0-cc-a1-cd-9f
Connection Tracking
Ensures that connections are serviced by the same host for their duration even if a change in bucket ownership occurs Sessionful vs. sessionless hashing
E.g., UDP is sessionless
If an ownership change occurs, existing streams shift immediately to the new bucket owner If an ownership change occurs, existing connections continue to be serviced by the old bucket owner Requires state maintenance and lookup to resolve packet ownership ambiguity
E.g., TCP is sessionful
Host 3 Client 1 Owner
Host 2
No TCP Connection AffinityHost 1
NLB Cluster
The client completes A client initiates a TCP The ACK is accepted the three-way A SYN+ACK sending The SYN by is sent accepted connectionis breaking by Host 1,by sending handshake the client back toHost 3NLB by a TCP connection. the SYN to theto the an ACK back cluster. NLB cluster.
Switch
Internet
Client 1
TCP Connection State
NLB maintains a connection descriptor for each active TCP connection
A connection descriptor is basically an IP 5-tuple 5
(Client IP, Client port, Server IP, Server Port, TCP)
In optimized mode, descriptors are maintained, but not needed when making accept/drop decisions
State and its associated lifetime are maintained by either:
TCP Packet snooping
Monitoring TCP packet flags (SYN, FIN, RST) Using kernel callbacks
Explicit notifications from TCP/IP
TCP Packet Snooping
NLB monitors the TCP packet flags
Upon seeing a SYN
If accepted, a descriptor is created to track the connection Only one host should have a descriptor for this IP 5-tuple 5Destroys the associated descriptor, if one is found
Upon seeing a FIN/RST
Problems include:
Routing and interface metrics can cause traffic flow to be asymmetric
NLB cannot rely on seeing outbound packets
State is created before connection is accepted Both of which can result in stale/orphaned descriptors
Wasted memory, performance degradation, packet loss
TCP Connection Notifications
NLB receives explicit notifications from TCP/IP when connection state is created or destroyed TCP/IP notifies NLB when a connection enters:
SYN_RCVD
A descriptor is created to track the inbound connection Destroys the associated descriptor, if one is found
CLOSED
Advantages include:
NLB state maintenance remains in very close synchronization with TCP/IP
NLB TCP connection tracking is more reliable
Affords NLB protection against SYN attacks, etc.
Host 3TCP Connection Descriptor
Host 2
TCP Connection AffinityHost 1
Client 1 Owner
NLB Cluster
The ACK is accepted AThe client because it client initiates a TCP by Host 3 completes the itis accepted Theconnectionactive A SYN+ACK is by SYN has sent knows three-way handshake SYN and back toathe 3 by by sending sending Host client TCP connectionsto thea an matching TCP ACK back to the NLB cluster. NLB cluster. connection descriptor.
Switch
Internet
The ACK is rejected by Host 1 because it knows that other hosts have active TCP connections and it does NOT have a matching TCP connection descriptor.
Client 1
Session Tracking
Session tracking complications in NLB are of one of two forms, or a combination thereof:
The inability of NLB to detect the start or end of a session in the protocol itself The need to associate multiple seemingly unrelated streams and provide affinity to a single server Requires specialized support from PPTP and IPSec Session start and end are unknown to NLB SSL sessions span TCP connections
PPTP and IPSec/L2TP sessions are supported
UDP sessions are not supported
SSL sessions are not supported
Session Tracking
The classic NLB answer to session support is to use client affinity
Assumes that client identity remains the same throughout the session Different connections in the same session may have different client IP addresses and/or ports Using class C affinity can help, but is likely to highly skew the achievable load balance Session lifetime is highly indeterminate Sessions can span many connections
AOL proxy problem (scaling proxy arrays)
Terminal Server
Subsequent connections may be from different locations
ScaleScale-Out Limitations
Network limitations
Switch flooding
The pipe to each host in the cluster must be as fat as the uplink pipe Not allowing the switch to learn the MAC address causes degraded switch performance as well All hosts share the same virtual IP address(es)
Incompatible with layer 3 switches
CPU limitations
Packet drop overhead
Every host drops (N-1)/N % of all packets on average (N-
LoadLoad-Balancing Limitations
The NLB load balancing algorithm is static
Only the IP 5-tuple is considered when making load5loadbalancing decisions
No dynamic metrics are taken into consideration
E.g. CPU, memory, total number of connections E.g. Terminal Server vs. IIS
No application semantics are taken into consideration
NLB requires a sufficiently large (and varied) client population to achieve the configured balance
A small number of clients will result in poor balance Mega proxies can significantly skew the load balance
Other Limitations
No inter-host communication possible without a intersecond NIC
Hosts are cloned and traffic destined for local MAC addresses doesn t reach the wire Both multicast modes address this issue, but require a static ARP entry in Cisco routers TCP connections are preserved during a rebalance NLB generally has no session awareness
NLB provides connection, not session, affinity connection, session,
E.g., SSL can/will break during a rebalance Specialized support from NLB and VPN allows VPN sessions (tunnels) to be preserved during a rebalance
Summary
NLB is fully distributed, symmetric, softwaresoftwarebased TCP/IP load-balancing load
Cloned services run on each host in the cluster and client requests are partitioned across all hosts
NLB provides high availability and scale-out for scaleIP services NLB is appropriate for load balancing:
Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)
Advanced Topics
Multicast Bi-directional affinity BiVPN session support
Multicast
All hosts share a common multicast MAC address
Each host retains its unique MAC address Packets addressed to multicast MAC addresses are flooded by switches NLB munges ARP requests to resolve all virtual IP addresses to the shared multicast MAC address All ARP requests for the dedicated IP address of a host resolve to the unique hardware MAC address
Does not limit switch flooding
Does allow inter-host communication inter
Multicast
NLB multicast modes break an internet RFC
Unicast IP addresses cannot resolve to multicast MAC addresses Requires a static ARP entry on Cisco routers
Cisco routers won t dynamically add the ARP entry Cisco plans to eliminate support for static ARP entries for multicast addresses
The ping-pong effect ping
In a redundant router configuration, multicast packets may be repeatedly replayed onto the network
Typically until the TTL reaches zero
Router utilization skyrockets Network bandwidth plummets
IGMP Multicast
All hosts share a common IGMP multicast MAC address IGMP does limit switch flooding
All cluster hosts join the same IGMP group
Hosts periodically send IGMP join messages on the network
ARP requests for all virtual IP addresses resolve to the shared IGMP multicast MAC address Switches forward packets destined for IGMP multicast MAC address only on the ports on which the switch has recently received a join message for that IGMP group
Still requires a static ARP entry in Cisco routers
Bi-Directional Affinity Bi
Proxy/firewall scalability and availability By default, NLB instances on distinct network adapters operate independently
Independently configured Independently converge and distribute load Independently make packet accept/drop decisions That load-balancers associate multiple packet streams loadThat all related packet streams get load-balanced to loadthe same firewall server This is Bi-Directional Affinity (BDA) Bi-
Firewall stateful packet inspection requires:
No Bi-Directional Affinity BiThe internal server response may be accepted by a The internal server different NLB/Firewallathe The client initiates server One NLB/Firewall A firewall routes sends one that handled a response to than the request an the client accepts the server request to to the the client via the initial request. server. appropriateclient request. client internal NLB/Firewall cluster. NLB/Firewall cluster. This breaks stateful packet inspection.
NLB/Firewall ClusterFirewall State (SPI)
Host 1
Host 2
Host 3
Client(s)
InternetPublished Server
Stateful Packet Inspection
Firewalls maintain state [generally] on a perperconnection basis This state is necessary to perform advanced inspection of traffic through the firewall Requires special load-balancing semantics load
LoadLoad-balance incoming external requests for internal resources LoadLoad-balance outgoing internal requests for external resources Maintain firewall server affinity for the responses
Return traffic must pass through the same firewall server as the request
The Affinity Problem
Firewalls/Proxies often alter TCP connection parameters
Source and destination ports Source IP address
If translated, the host s dedicated IP address should be used In many scenarios, a published IP address is translated at the firewall into a private IP address
Destination IP address
The packets of the request and associated response are often very different
Difficult for load-balancers to associate the two loadseemingly unrelated streams and provide affinity
The Affinity Problem
Incoming packets utilize the conventional NLB hashing algorithm
Lookup the applicable port rule using the server port Hash on the IP 2-tuple or 4-tuple 24Map that result to an owner, who accepts the packet Port rule lookup
For firewalls/proxies, problems include:
Server port is different on client and server sides of firewall Ports and IP addresses have been altered by the firewall Each NLB instance has independent bucket ownership
Hashing function
Bucket ownership
BDA Teaming
Abandons some aspects of independence between designated NLB instances All members of a BDA team belong to a different cluster that continues to converge independently
Primarily useful for consistency and failure detection
However, all members of a BDA team share loadload-balancing state, including:
Connection descriptors Bucket distribution
Allows all team members to make consistent accept/drop decisions and preserve affinity
Preserving Affinity with BDA
Requirements include:
A Single port rule, ports = (0 - 65535)
Eliminates problems with port rule lookup due to port translation Eliminates hashing problems due to port translation Eliminates hashing problems due to IP address translation
Single or Class C affinity on the only port rule
Server IP address not used in hashing
The lone common element in hashing is then the client IP address
Use the source IP address on incoming client requests Use the destination IP address on server responses
This is often called reverse-hashing reverse-
Bi-Directional Affinity BiNLB hashes on the destination IP address (the client IP address) of the NLB hashes on the source response and Bi-Directional IP address (the that The internal client A firewall routes Affinity internalserver the ensures The client initiates atheIP The a of theserver address) responsethe is request sends request an to request toresponse internal server isto client responsevia sent and one NLB/Firewall the client NLB/Firewall cluster. handled internal same appropriatetoby thethe back the theserver. client. server acceptscluster. NLB/Firewallserverclient NLB/Firewall that request. client handled the initial request.
NLB/Firewall ClusterFirewall State (SPI)
Host 1
Host 2
Host 3
Client(s)
InternetPublished Server
BDA Miscellaneous
External entities are expected to monitor the health of BDA teams through NLB WMI events
E.g., if one member of a BDA team fails, the entire team should be stopped
All load will then be re-distributed to surviving hosts re-
Reverse hashing is set on a per-adapter basis per
To override the configured hashing scheme on a perperpacket basis, NLB provides a kernel-mode hook kernel
Entities register to see all packets in the send and/or receive paths and can influence NLB s decision to accept/drop them Hook returns ACCEPT, REJECT, FORWARD hash, REVERSE hash or PROCEED with the default hash Enables extensions to BDA support for more complex firewall/proxy scenarios without explicit changes to NLB
VPN Session Support
Support for clustering VPN servers
PPTP
TCP tunnel GRE Call IDs
IPSec/L2TP
Notifications from IKE No FINs MM and QM SAs INITIAL_CONTACT