nlb core technology overview

Network Load Balancing (NLB)A Core Technology Overview

Sean B House June 18, 2002(Updated October 31, 2002)

Agenda

NLB architecture and fundamentals NLB cluster membership protocol Packet filtering and TCP connection affinity Limitations of NLB Advanced NLB topics

Multicast Bi-Directional Affinity BiVPN (PPTP & IPSec/L2TP)

Q&A

Introduction to NLB

Fully distributed, symmetric, software-based softwareTCP/IP load-balancing load

Cloned services (e.g., IIS) run on each host in the cluster Client requests are partitioned across all hosts Load distribution is static, but configurable through load weights (percentages) Use commodity hardware Simple and robust Highly available

Design goals include:

Introduction to NLB

NLB Provides:

ScaleScale-out for IP services High availability (no single point of failure) An inexpensive alternative to HW LB devices Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)

NLB is appropriate for load balancing:

HighHigh-end hardware load-balancers cover a much loadbroader range of load-balancing scenarios load-

NLB Architecture

NLB is an NDIS intermediate filter driver inserted between the physical NIC and the protocols in the network stack

Protocols think NLB is a NIC NICs think NLB is a protocol

One instance of NLB per NIC to which it s bound

All NLB instances operate independently of each other

Fundamental Algorithm

NLB is fundamentally just a packet filter Via the NLB membership protocol, all hosts in the cluster agree on the load distribution NLB requires that all hosts see all inbound packets Each host discards those packets intended for other hosts in the cluster

Each host makes accept/drop decisions independently

Packets accepted on each host are passed up to the protocol(s) and one response is sent back to the client

Host 3

Host 2

Host 1

NLB Cluster

The network floods A client initiates a A response accepts One server is sent the incoming client request to an NLB backclient request. the to the client. request. cluster.

Internet

Client(s)

Cluster Operation Modes

NLB modes of operation:

Unicast, multicast and IGMP multicast Unicast makes up approximately 98% of deployments For the rest of this talk, assume unicast operation

Advanced Topics covers multicast and IGMP multicast

To project a single system image:

All hosts share the same set of virtual IP addresses All hosts share a common network (MAC) address

In unicast, NLB actually alters the MAC address of the NIC

This precludes inter-host communication over the NLB NIC inter-

Communication with specific cluster hosts is accomplished through the use of dedicated NICs or dedicated IP addresses

Unicast Mode

Each host in the cluster is configured with the same unicast MAC address

02-bf-WW-XX-YY02-bf-WW-XX-YY-ZZ

02 = locally administered address bf = arbitrary (Bain/Faenov) WW-XX-YYWW-XX-YY-ZZ = the primary cluster IP address

All ARP requests for virtual IP addresses resolve to this cluster MAC address automagically

NLB must ensure that all inbound packets are received by all hosts in the cluster

All N hosts in the cluster receive every packet and N-1 Nhosts discard each packet

Inbound Packet Flooding

NLB is well-suited for (arguably designed for) a wellhub environment

Hubs forward inbound packets on all ports by their very nature

Switches associate MAC addresses with ports in an effort to eliminate flooding

On each port, switches snoop the source MAC addresses of all packets received

Those source MAC addresses are learned on that port When packets arrive destined for a learned MAC address, they are forwarded only on the associated port

This breaks NLB compatibility with switches

NLB/Switch Incompatibility

All NLB hosts share the same cluster MAC address Switches only allow a particular MAC address to be associated with one switch port at a given time This results in the cluster MAC address / port association thrashing between ports

Host 3

Host 2

Host 1

Switch

Inbound packets are only forwarded to the port with which the switch currently believes the cluster MAC address is associated Connectivity to the cluster will be intermittent at best

MAC Address Masking

NLB uses MAC address masking to keep switches from learning the cluster MAC address and associating it with a particular port NLB spoofs the source MAC address of all outgoing packets

The second byte of the source MAC address is overwritten with the host s unique NLB host ID

E.g., 02-bf-0a-00-00-01 -> 02-09-0a-00-00-01 02-bf-0a-00-0002-09-0a-00-00-

This prevents switches from associating the cluster MAC address with a particular port

Switches only associate the masked MAC addresses Enables inbound packet flooding

Host 3

Host 2

Host 1

10.0.0.1 02-bf-0a-00-00-01 From: 02-03-0a-00-00-01/10.0.0.1:80 To: 00-a0-cc-a1-cd-9f/10.0.0.5:29290

NLB Cluster

A response is sent back to the client. The source MAC A client initiates a request address is NLB cluster. the The to an masked using to switch does not know host sserver accepts One unique host ID. which port 02-bf-0a-00-00-01 An ARP request for the client request. belongs, will floods the The switch so itcontinue to 10.0.0.1 resolves to the request to all ports. associate 02-03-0a-00-00-01, cluster MAC address not 02-bf-0a-00-00-01. with 02-bf-0a-00-00-01, this switch port. This enables switch flooding.

Switch

From: 00-a0-cc-a1-cd-9f/10.0.0.5:29290 To: 02-bf-0a-00-00-01/10.0.0.1:80

Client(s)10.0.0.5 00-a0-cc-a1-cd-9f

LoadLoad-Balancing Overview

Each host periodically sends a heartbeat packet to announce their presence and distribute load Load distribution is quantized into 60 buckets that are distributed amongst hosts

Each host owns a subset of the buckets Typically using the IP 2-tuple or 4-tuple as input to 24the hashing function The owner of the bucket accepts the packet, the others drop the packet What happens to existing connections if bucket ownership changes?

Incoming packets hash to one of the 60 buckets

Cluster Membership

Each NLB host is assigned a unique host ID in the range from 1 to 32 (the maximum cluster size) Using ethernet broadcast, each host sends heartbeat packets to announce its presence and distribute load

Twice per second during convergence Once per second after convergence completes Registered Ethernet type = 0x886f Contains configuration information such as host ID, dedicated IP address, port rules, etc. Contains load balancing state such as the load distribution, load weights (percentages), activity indicators, etc.

Heartbeats are MTU-sized, un-routable Ethernet frames MTUun

Convergence

Convergence is a distributed mechanism for determining cluster membership and load distribution

Conveyed via the NLB heartbeat messages

Hosts initiate convergence primarily to partition the load

When consensus is reached, the cluster is said to be converged

Misconfiguration can cause perpetual convergence Network problems can cause periodic convergence Cluster operations continue during convergence

Can result in disruption to or denial of client service

Triggering Convergence

Joining hosts

New hosts trigger convergence to repartition the load distribution and begin accepting client requests The other hosts pick up the slack when a fixed number of heartbeats are missed from the departing host Configuration changes Administrative operations that change the configured load of a server (disable, enable, drain, etc.)

Departing hosts

New configuration on hosts

The Convergence Algorithm1. 2. 3.

4.

5.

All hosts enter the CONVERGING state The host with the smallest host ID is elected the default host Each host moves from the CONVERGING state to the STABLE state after a fixed number of epochs* in which consistent membership and load distribution are observed The default host enters the CONVERGED state after a fixed number of epochs in which all hosts are observed to be in the STABLE state Other hosts enter the CONVERGED state when they see that the default host has converged

* For all intents and purposes, an epoch is a heartbeat period

Bucket Distribution

Load distribution is quantized into buckets

Incoming packets hash to one of 60 buckets

Largely for reasons of equal divisibility

Buckets are distributed amongst hosts, but based on configuration (load weights/percentages), may or may not be shared equally Buckets are not dynamically re-distributed based on reload (no dynamic load balancing)

Goal: minimize disruption to existing connections during bucket transfer NonNon-goal: optimize remaps across a series of convergences

Bucket Distribution

During convergence, each host computes identical target load distributions

Based on the existing load distribution and the new membership information

When convergence completes, hosts transfer buckets pair-wise via heartbeats pair

First, the donor host surrenders ownership of the buckets and notifies the recipient Soon thereafter, the recipient picks up the buckets, asserts ownership of them and notifies the donor During the transfer (~2 seconds), nobody is accepting new connections on those buckets

Bucket Distribution

Advantages

Easy method by which to divide client population among hosts Convenient for adjusting relative load weights (percentage) between hosts Avoid state lookup in optimized cases Quantized domain has limited granularity

Disadvantages

Unbalanced for some cluster sizes, but 60 divides nicely:

1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30

Worst case is a load distribution ratio of 2:1 in 31 and 32 host clusters

Host 3

Host 2

Host 1 Now: None Next: 10-19, 50-5910-19, 50-59 None

Now: 0-29 Next: 0-9, 20-290-29 0-9, 20-29

Now: 30-59 Next: 30-49

30-49 30-59

NLB Cluster

When convergence completes, each pair of hosts transfers the designated buckets via the Hosts 2host uses the same Host 1 heartbeats. onand Each and 3the cluster all joins are converged Convergence begins and sending CONVERGED algorithm to computecluster. begins sending CONVERGING three hosts in the the new Bucketsload distribution. the are removed from heartbeats. heartbeats. donating host s bucket map before being handed off to the new owner.

Switch

Internet

Client 1

Packet Filtering

Filtered packets are those for which NLB will make an accept/drop decision (load-balance) (loadIP Protocols that are filtered by NLB

TCP, UDP GRE

Assumes a relationship with a corresponding PPTP tunnel Assumes a relationship with a corresponding IPSec/L2TP tunnel By default, all hosts accept ICMP; can be optionally filtered

ESP/AH (IPSec)

ICMP

Other protocols and Ethernet types are passed directly up to the protocol(s)

Client Affinity

None

Typically provides the best load balance Uses both client IP address and port when hashing Used primarily for session support for SSL and multimulticonnection protocols (IPSec/L2TP, PPTP, FTP) Uses only the client IP address when hashing Used primarily for session support for users behind scaling proxy arrays Uses only the class C subnet of the client IP address when hashing

Single

Class C

Hashing

Packets hash to one of 60 buckets, which are distributed amongst hosts NLB employs bi-level hashing bi

Level 1: Bucket ownership Level 2: State lookup Optimized

NLB hashing operates in one of two modes

Level 1 hashing only The bucket owner accepts the packet unconditionally Level 1 and level 2 hashing State lookup is necessary to resolve ownership ambiguity

NonNon-optimized

Hashing

Protocols such as UDP always operate in optimized mode

No state is maintained for UDP, which eliminates the need for level 2 hashing

Protocols such as TCP can operate in either optimized or non-optimized mode non

State is maintained for all TCP connections When ambiguity arises, state lookup determines ownership New connections always belong to the bucket owner Global aggregation determines when other hosts complete service of a lingering connection and optimize out level 2 hashing

Host 3

Host 2

Host 1

10.0.0.1From: 10.0.0.1:80 To: 10.0.0.5:29290 20-39 0-19 40-59

02-bf-0a-00-00-01

NLB Cluster

Hash on IP 5-tuple (10.0.0.5, 29290, 10.0.0.1, 80, TCP) A client initiates a A response is sent maps to Bucket 14, owned by Host 3. request to client. back to thean NLB Host 3 acceptscluster. the request. All other hosts drop the request.

Switch

Internet

From: 10.0.0.5:29290 To: 10.0.0.1:80

Client(s)10.0.0.5 00-a0-cc-a1-cd-9f

Connection Tracking

Ensures that connections are serviced by the same host for their duration even if a change in bucket ownership occurs Sessionful vs. sessionless hashing

E.g., UDP is sessionless

If an ownership change occurs, existing streams shift immediately to the new bucket owner If an ownership change occurs, existing connections continue to be serviced by the old bucket owner Requires state maintenance and lookup to resolve packet ownership ambiguity

E.g., TCP is sessionful

Host 3 Client 1 Owner

Host 2

No TCP Connection AffinityHost 1

NLB Cluster

The client completes A client initiates a TCP The ACK is accepted the three-way A SYN+ACK sending The SYN by is sent accepted connectionis breaking by Host 1,by sending handshake the client back toHost 3NLB by a TCP connection. the SYN to theto the an ACK back cluster. NLB cluster.

Switch

Internet

Client 1

TCP Connection State

NLB maintains a connection descriptor for each active TCP connection

A connection descriptor is basically an IP 5-tuple 5

(Client IP, Client port, Server IP, Server Port, TCP)

In optimized mode, descriptors are maintained, but not needed when making accept/drop decisions

State and its associated lifetime are maintained by either:

TCP Packet snooping

Monitoring TCP packet flags (SYN, FIN, RST) Using kernel callbacks

Explicit notifications from TCP/IP

TCP Packet Snooping

NLB monitors the TCP packet flags

Upon seeing a SYN

If accepted, a descriptor is created to track the connection Only one host should have a descriptor for this IP 5-tuple 5Destroys the associated descriptor, if one is found

Upon seeing a FIN/RST

Problems include:

Routing and interface metrics can cause traffic flow to be asymmetric

NLB cannot rely on seeing outbound packets

State is created before connection is accepted Both of which can result in stale/orphaned descriptors

Wasted memory, performance degradation, packet loss

TCP Connection Notifications

NLB receives explicit notifications from TCP/IP when connection state is created or destroyed TCP/IP notifies NLB when a connection enters:

SYN_RCVD

A descriptor is created to track the inbound connection Destroys the associated descriptor, if one is found

CLOSED

Advantages include:

NLB state maintenance remains in very close synchronization with TCP/IP

NLB TCP connection tracking is more reliable

Affords NLB protection against SYN attacks, etc.

Host 3TCP Connection Descriptor

Host 2

TCP Connection AffinityHost 1

Client 1 Owner

NLB Cluster

The ACK is accepted AThe client because it client initiates a TCP by Host 3 completes the itis accepted Theconnectionactive A SYN+ACK is by SYN has sent knows three-way handshake SYN and back toathe 3 by by sending sending Host client TCP connectionsto thea an matching TCP ACK back to the NLB cluster. NLB cluster. connection descriptor.

Switch

Internet

The ACK is rejected by Host 1 because it knows that other hosts have active TCP connections and it does NOT have a matching TCP connection descriptor.

Client 1

Session Tracking

Session tracking complications in NLB are of one of two forms, or a combination thereof:

The inability of NLB to detect the start or end of a session in the protocol itself The need to associate multiple seemingly unrelated streams and provide affinity to a single server Requires specialized support from PPTP and IPSec Session start and end are unknown to NLB SSL sessions span TCP connections

PPTP and IPSec/L2TP sessions are supported

UDP sessions are not supported

SSL sessions are not supported

Session Tracking

The classic NLB answer to session support is to use client affinity

Assumes that client identity remains the same throughout the session Different connections in the same session may have different client IP addresses and/or ports Using class C affinity can help, but is likely to highly skew the achievable load balance Session lifetime is highly indeterminate Sessions can span many connections

AOL proxy problem (scaling proxy arrays)

Terminal Server

Subsequent connections may be from different locations

ScaleScale-Out Limitations

Network limitations

Switch flooding

The pipe to each host in the cluster must be as fat as the uplink pipe Not allowing the switch to learn the MAC address causes degraded switch performance as well All hosts share the same virtual IP address(es)

Incompatible with layer 3 switches

CPU limitations

Packet drop overhead

Every host drops (N-1)/N % of all packets on average (N-

LoadLoad-Balancing Limitations

The NLB load balancing algorithm is static

Only the IP 5-tuple is considered when making load5loadbalancing decisions

No dynamic metrics are taken into consideration

E.g. CPU, memory, total number of connections E.g. Terminal Server vs. IIS

No application semantics are taken into consideration

NLB requires a sufficiently large (and varied) client population to achieve the configured balance

A small number of clients will result in poor balance Mega proxies can significantly skew the load balance

Other Limitations

No inter-host communication possible without a intersecond NIC

Hosts are cloned and traffic destined for local MAC addresses doesn t reach the wire Both multicast modes address this issue, but require a static ARP entry in Cisco routers TCP connections are preserved during a rebalance NLB generally has no session awareness

NLB provides connection, not session, affinity connection, session,

E.g., SSL can/will break during a rebalance Specialized support from NLB and VPN allows VPN sessions (tunnels) to be preserved during a rebalance

Summary

NLB is fully distributed, symmetric, softwaresoftwarebased TCP/IP load-balancing load

Cloned services run on each host in the cluster and client requests are partitioned across all hosts

NLB provides high availability and scale-out for scaleIP services NLB is appropriate for load balancing:

Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)

Advanced Topics

Multicast Bi-directional affinity BiVPN session support

Multicast

All hosts share a common multicast MAC address

Each host retains its unique MAC address Packets addressed to multicast MAC addresses are flooded by switches NLB munges ARP requests to resolve all virtual IP addresses to the shared multicast MAC address All ARP requests for the dedicated IP address of a host resolve to the unique hardware MAC address

Does not limit switch flooding

Does allow inter-host communication inter

Multicast

NLB multicast modes break an internet RFC

Unicast IP addresses cannot resolve to multicast MAC addresses Requires a static ARP entry on Cisco routers

Cisco routers won t dynamically add the ARP entry Cisco plans to eliminate support for static ARP entries for multicast addresses

The ping-pong effect ping

In a redundant router configuration, multicast packets may be repeatedly replayed onto the network

Typically until the TTL reaches zero

Router utilization skyrockets Network bandwidth plummets

IGMP Multicast

All hosts share a common IGMP multicast MAC address IGMP does limit switch flooding

All cluster hosts join the same IGMP group

Hosts periodically send IGMP join messages on the network

ARP requests for all virtual IP addresses resolve to the shared IGMP multicast MAC address Switches forward packets destined for IGMP multicast MAC address only on the ports on which the switch has recently received a join message for that IGMP group

Still requires a static ARP entry in Cisco routers

Bi-Directional Affinity Bi

Proxy/firewall scalability and availability By default, NLB instances on distinct network adapters operate independently

Independently configured Independently converge and distribute load Independently make packet accept/drop decisions That load-balancers associate multiple packet streams loadThat all related packet streams get load-balanced to loadthe same firewall server This is Bi-Directional Affinity (BDA) Bi-

Firewall stateful packet inspection requires:

No Bi-Directional Affinity BiThe internal server response may be accepted by a The internal server different NLB/Firewallathe The client initiates server One NLB/Firewall A firewall routes sends one that handled a response to than the request an the client accepts the server request to to the the client via the initial request. server. appropriateclient request. client internal NLB/Firewall cluster. NLB/Firewall cluster. This breaks stateful packet inspection.

NLB/Firewall ClusterFirewall State (SPI)

Host 1

Host 2

Host 3

Client(s)

InternetPublished Server

Stateful Packet Inspection

Firewalls maintain state [generally] on a perperconnection basis This state is necessary to perform advanced inspection of traffic through the firewall Requires special load-balancing semantics load

LoadLoad-balance incoming external requests for internal resources LoadLoad-balance outgoing internal requests for external resources Maintain firewall server affinity for the responses

Return traffic must pass through the same firewall server as the request

The Affinity Problem

Firewalls/Proxies often alter TCP connection parameters

Source and destination ports Source IP address

If translated, the host s dedicated IP address should be used In many scenarios, a published IP address is translated at the firewall into a private IP address

Destination IP address

The packets of the request and associated response are often very different

Difficult for load-balancers to associate the two loadseemingly unrelated streams and provide affinity

The Affinity Problem

Incoming packets utilize the conventional NLB hashing algorithm

Lookup the applicable port rule using the server port Hash on the IP 2-tuple or 4-tuple 24Map that result to an owner, who accepts the packet Port rule lookup

For firewalls/proxies, problems include:

Server port is different on client and server sides of firewall Ports and IP addresses have been altered by the firewall Each NLB instance has independent bucket ownership

Hashing function

Bucket ownership

BDA Teaming

Abandons some aspects of independence between designated NLB instances All members of a BDA team belong to a different cluster that continues to converge independently

Primarily useful for consistency and failure detection

However, all members of a BDA team share loadload-balancing state, including:

Connection descriptors Bucket distribution

Allows all team members to make consistent accept/drop decisions and preserve affinity

Preserving Affinity with BDA

Requirements include:

A Single port rule, ports = (0 - 65535)

Eliminates problems with port rule lookup due to port translation Eliminates hashing problems due to port translation Eliminates hashing problems due to IP address translation

Single or Class C affinity on the only port rule

Server IP address not used in hashing

The lone common element in hashing is then the client IP address

Use the source IP address on incoming client requests Use the destination IP address on server responses

This is often called reverse-hashing reverse-

Bi-Directional Affinity BiNLB hashes on the destination IP address (the client IP address) of the NLB hashes on the source response and Bi-Directional IP address (the that The internal client A firewall routes Affinity internalserver the ensures The client initiates atheIP The a of theserver address) responsethe is request sends request an to request toresponse internal server isto client responsevia sent and one NLB/Firewall the client NLB/Firewall cluster. handled internal same appropriatetoby thethe back the theserver. client. server acceptscluster. NLB/Firewallserverclient NLB/Firewall that request. client handled the initial request.

NLB/Firewall ClusterFirewall State (SPI)

Host 1

Host 2

Host 3

Client(s)

InternetPublished Server

BDA Miscellaneous

External entities are expected to monitor the health of BDA teams through NLB WMI events

E.g., if one member of a BDA team fails, the entire team should be stopped

All load will then be re-distributed to surviving hosts re-

Reverse hashing is set on a per-adapter basis per

To override the configured hashing scheme on a perperpacket basis, NLB provides a kernel-mode hook kernel

Entities register to see all packets in the send and/or receive paths and can influence NLB s decision to accept/drop them Hook returns ACCEPT, REJECT, FORWARD hash, REVERSE hash or PROCEED with the default hash Enables extensions to BDA support for more complex firewall/proxy scenarios without explicit changes to NLB

VPN Session Support

Support for clustering VPN servers

PPTP

TCP tunnel GRE Call IDs

IPSec/L2TP

Notifications from IKE No FINs MM and QM SAs INITIAL_CONTACT

nlb core technology overview

Documents