xs japan 2008 networking english

51
November 2008 [email protected]

Upload: xen-project

Post on 05-Dec-2014

813 views

Category:

Technology


3 download

DESCRIPTION

Leonid Grossman: Networking Via Direct Function Assignment

TRANSCRIPT

Page 1: XS Japan 2008 Networking English

November [email protected]

Page 2: XS Japan 2008 Networking English

Direct PCI Function assignment eliminates the performance overhead of virtualization – while preserving virtualization benefits like migration, and leaving control over privileged operations with Dom0.PCI Function assignment adoption is not gated by SR IOV support – depending upon hardware design, it can work with both IOV-aware and with legacy kernels.With PCI functions implemented as complete netdevinterfaces, no split netfront/netback driver model is required – native Linux or Windows drivers can run in guests. This allows leveraging work done for the Native OSs◦ Features like Multi-Queue support, Bonding, etc.◦ Driver certification and distribution.

Page 3: XS Japan 2008 Networking English

Performance, Performance, Performance.◦ CPU% overhead of virtualizing 10GbE can be 4x-6x,

depending upon networking workloads. ◦ Running benchmarks at 10GbE line rate in a guest is

possible with multi-queue solutions – but at the expense of driver complexity and additional cpu% overhead.

Use of single “native” drivers in a guest, regardless of which Hypervisor is deployed (or none).◦ The approach offers number of certification and

distribution benefits, as well as rich feature set of a native stack. Driver “sees” the same resources.

Incumbent solutions (emulation and para-virtualization) – are further profiled in backup slides.

Page 4: XS Japan 2008 Networking English

SR IOV and MR IOV specifications are released by PCI SIG this year.A new generation of SR-IOV compliant NICs (like Neterion 10GbE x3100) are being released in early 2009 with standard support for multiple PCI Functions:◦ From the system prospective, each function is an independent

NIC.◦ The functions typically share an external physical port and PCI bus

– the rest of NIC resources can be dedicated.Enabled by standards based IOMMUs◦ PCI-SIG standardizes the IOMMU with SR-IOV.◦ Pre-IOMMU deployment possible for prototyping

Page 5: XS Japan 2008 Networking English

Multi-Port GbE NICs.◦ Issues range from physical port/cables limitations to

bandwidth limitations.10GbE NICs with SR IOV VFs implemented as a subset of a NIC not as a complete NIC – for example, with VFs implemented as Level 1 or 2 queue pair.◦ Queue pair is a good way to scale across multiple cpu cores within

single OS image, but not across multiple OS imagesMore detailed analysis is included in backup slides

Page 6: XS Japan 2008 Networking English

This presentation is not a call to add support for multi-function NICs in Xen.◦ Because the support is already there.Xen, and the various GOSs, already have almost everything they need to support multi-function NICs.◦ Xen has PCI Function Delegation.◦ Xen has migration.◦ GOSs support bonding/teaming drivers.◦ GOSs support PCI device insertion/removal.◦ An example of configuring multi-function x3100 10GbE NIC for

direct access and migration is included in backup slides.

Page 7: XS Japan 2008 Networking English

NIC presents itself as multiple PCI functions.◦ Xen can assign as many to each guest as it wants to.Still needs to deal with more VMs than direct PCI Functions◦ Correct, only I/O intensive VMs need a dedicated function.“Needing” a h/w specific driver in guest◦ That’s not a bug. That’s a feature; a native feature-rich, certified

driver is already included in a guest.Full L2+ switch functionality on NIC◦ No need for a “full switch”, only raw frame forwarding services are

required. The functionality is similar to a soft switch in Dom0.Relies on an Address Translation Solution◦ IOMMU is just the most likely solution.

Page 8: XS Japan 2008 Networking English

Act as a configured switch under Don0 privileged control.◦ Forwarding may be statically controlled, not learned.Direct incoming frames to the correct PCI Function.Provide internal VNIC-to-VNIC frame forwarding.Provide VNIC to external port forwarding:◦ Can prevent forged source addresses.◦ Can enforce VLAN membership.Device domain participation is optional, but Dom0 can control things it already knows (Vlan participation, etc)More details included in Backup slides.

Page 9: XS Japan 2008 Networking English

Each Guest Driver directly controls a Virtual NIC.Each Native Driver manages only it’s VNIC.Privileged (Device Domain) Driver also manages the shared resources.◦ Or whatever portion of

them it wants to manage.

Page 10: XS Japan 2008 Networking English

Some of the resources supported by current Neterion x3100 10GbE drivers◦ VF MAC addresses◦ VF VLAN membership◦ VF promiscuous mode◦ Bandwidth allocation between VFs◦ Device-wide errors◦ Link statistics◦ Link speed and aggregation modes

Page 11: XS Japan 2008 Networking English

Migration Support using Guest OS Services

Page 12: XS Japan 2008 Networking English

Frontend/Backend supplies excellent migration already◦ But requires a Hypervisor

specific frontend driver.Because it is the only universally supported solution it plays a critical role in enabling migration.

Page 13: XS Japan 2008 Networking English
Page 14: XS Japan 2008 Networking English

Frontend/Backend is kept in place and is always available.Direct Assignment is used for most important Guests◦ Each multi-function device will have a limit on how

many guests it can directly support.Native Driver talks directly to NIC through its own PCI function, if enabled.Bonding Driver uses frontend/backend if direct NIC is not available.More details on migration are included in Backup slides.

Page 15: XS Japan 2008 Networking English

ServerBlades

Internal EthernetSwitch

360Gb/sec adapter aggregate bandwidth

40-80Gb/sec‘egress’bandwidth

Server Blade Chassis

External EthernetFabric

ServerManagement

Internal Ethernet SwitchManagement

Ethernet ‘Fabric’Management

Unclear management responsibility:Ethernet switches inside the Server domain

Mezz

MezzMezz

MezzMezz

MezzMezz

MezzMezz

MezzMezz

MezzMezz

MezzMezz

Mezz

Fabric Module

No link-layerflow control

Serious CongestionIssues

InteroperabilityConcerns

Non hot-swappableI/O

Extra heaton each blade

Page 16: XS Japan 2008 Networking English

ServerBlades

Internal PCIeSwitch

360Gb/sec adapter aggregate bandwidth

20-40Gb/sec‘egress’bandwidth

Server Blade Chassis

External EthernetFabric

ServerManagement Ethernet Fabric’

Management

PCIePCIe

PCIePCIe

PCIePCIe

PCIePCIe

PCIePCIe

PCIePCIe

PCIePCIe

PCIePCIe

E3100E3100

Clear management responsibility:Servers and Network are managed independentl

Fabric Module

No InteroperabilityConcernsPCIe link-layer

flow control

E3100 10GbE ASIC and PCIe switch manage congestion

Hot-swappableI/O Module

RemoveHeat/powerFrom blades

Page 17: XS Japan 2008 Networking English

Direct PCI Function assignment eliminates the performance overhead of virtualization – while preserving virtualization benefits like migration, and leaving control over privileged operations with Dom0.PCI Function assignment adoption is not gated by SR IOV support – depending upon hardware design, it can work with both IOV-aware and with legacy kernels.With PCI functions implemented as complete netdevinterfaces, no split netfront/netback driver model is required – native Linux or Windows drivers can run in guests. This allows leveraging work done for the Native OSs◦ Features like Multi-Queue support, Bonding, etc.◦ Driver certification and distribution.

Page 18: XS Japan 2008 Networking English
Page 19: XS Japan 2008 Networking English

Multi-port NIC has multiple physical ports.◦ Or possibly simply multiple NICs.

Each port is a distinct PCI function.◦ Each PCI function can be directly assigned to a Guest.◦ Provides benefits of Native Driver usage

No virtualization penalty.Full features of native mode available.Single driver in the OS image.

But an entire port is a big thing to assign◦ Not enough space for physical ports and cables.◦ No granularity on assignment.◦ Bandwidth wasted if the assignee has nothing to send.

Page 20: XS Japan 2008 Networking English

The same mechanisms that enable direct assignment of stand-alone PCI devices also enable direct assignment of PCI functions that share a port.◦ MSI-X Mapping.◦ Page Mapping◦ Config Space Mapping.◦ Ability to use Native Drivers.Only the issues related to the shared resources need to be solved◦ Most do not require any additional management from Xen.

Page 21: XS Japan 2008 Networking English

A Multi-queue NIC provides multiple independent queues within a PCI function.Native OS can use multiple queues itself◦ CPU Affinity, QoS, Ethernet Priorities.DomD can utilize Guest Specific queues:◦ But his is not true device assignment.

Backend must validate/translate each request (WQE/Transmit Descriptor).Does not enable vendor’s native driver.

Which already knows how to use multiple queues.

Page 22: XS Japan 2008 Networking English

Limited functionality queues can be set up to “Queue Pairs” in Guest (or even Guest applications)Creation/Deletion/Control still managed through the Device Domain.Hardware specific code required◦ In both the Guest and Device Domain.

Single-threaded control means entangled control.◦ Need to separate control-plane work done for a guest before it

can be migrated.

Page 23: XS Japan 2008 Networking English

Multi-function NICs present each external port as though it were a multi-queue NIC◦ But for each of multiple PCI functions.◦ Each PCI function is virtual – but that term is reserved.

Each PCI function can be directly assigned◦ Allowing Native Drivers to be used as is.◦ Allowing support of large number of Guests with a compact

number of ports.◦ Enabling existing support of multi-queue capabilities already in

the Guest OS.

Page 24: XS Japan 2008 Networking English

RDMA-style NIC

DomU Device Domain

Queue Pair

Packet Input Routing / Output Mixing

External Port External Port

VNIC

“Verbs”Native Driver

Frontend Backend

Bridging

Application Backend

DomU

“Verbs

Frontend Application

Queu Pair

Page 25: XS Japan 2008 Networking English

Page Protection is not enough◦ Full Function Assignment requires true isolation of each PCI

Function.Protecting Pages is not enough.If a bad configuration register can hang the device then the functions are not truly independent.

◦ Some devices can only support direct Fastpaths.A direct fastpath does not address Driver Distribution Issues – It still requires two drivers◦ One when virtualized◦ One when running in native mode.Single-path for Slowpath Control means entangled Slowpaths◦ Untangling for migration not guaranteed to be easy.RDMA users complain about cost of slowpath operations. ◦ Virtualization will only make it worse.

Page 26: XS Japan 2008 Networking English

Availability:◦ Is the driver installed in the Guest OS image?Efficiency:◦ Does the driver interfaces efficiently with the NIC?

Migration:◦ Can Guests using this Driver be migrated?

Flexibility:◦ Can new services be supported?

Page 27: XS Japan 2008 Networking English

Availability:◦ Excellent, NICs to be emulated are selected based on widespread

deployment.

Performance:◦ Terrible.

Migration:◦ not a problem.

Flexibility:◦ None. You’re emulating a 20th century NIC.

Page 28: XS Japan 2008 Networking English

Availability:◦ Good. But there is a lag problem on which frontend has made it

into the OS distribution.

Performance:◦ Tolerable.Migration:◦ not a problem.Flexibility:◦ New features require extensive collaboration.

Page 29: XS Japan 2008 Networking English

Availability:◦ Excellent. The same driver is used whether running natively or

under any Hypervisor.◦ NIC vendors already deal with OS distributions.

Performance:◦ Same as native.Migration:◦ Not really a problem, details to follow.

Flexibility:◦ Same as native.

Page 30: XS Japan 2008 Networking English

Multi-queue is a valuable feature◦ But it does not really compensate for being a Single PCI Function

Device.

Multi-function NICs are multi-queue NICs◦ But each queue is owned by a specific PCI Function.◦ It operates within the function specific IO MAP

Allowing GOS to communicate GPAs directly to the NIC.

Each PCI Funciton has its own◦ MSI-X.◦ PCI Config space.◦ Function Level Reset.◦ Statistics.

Page 31: XS Japan 2008 Networking English

No performance penalty◦ GOS Driver is interacting with Device the same way it would

without virtualization.There is Zero penalty to the host.Multi-function NICs offload the cost of sharing.

◦ Frontend/Backend solutions always cost more:Address translation has non-zero cost. Copying even more.Latency penalty unavoidable.An extra step cannot take zero time.

Can support ANY service supported by the Native OS.◦ because the Native OS Driver sees the same resources.

Page 32: XS Japan 2008 Networking English

This is not a bug. It is a feature.There already is a device specific driver in the Guest OS image.The vendor worked very hard to get it there.◦ And to get it tested, and certified.◦ There is already a distribution chain

Which customers strongly prefer.◦ It already integrates hardware capabilities with Native OS

requirements.

Page 33: XS Japan 2008 Networking English

“Soft Switches” have traditionally performed switch-like services within the Device Domain.◦ But generally have not been fully compliant 802.1Q Bridges.◦ For example most do not enable Spanning Tree.◦ And in fact that is generally a good thing.

They don’t have any real information to contribute to the topology discovery. They’d just slow things down.

The Dom0 Soft Switch will still be part of the solution◦ It must support VMs not given direct assignments.◦ Some co-ordination between the two “Switches” is desired.

Page 34: XS Japan 2008 Networking English

At minimum these services acts as a configured switch under privileged control.◦ Forwarding may be statically controlled, not learned.

Directs incoming frames to the correct PCI Function.Provides internal VNIC-to-VNIC frame forwarding.Provides VNIC to external port forwarding:◦ Some form of traffic shaping probably required.◦ Must prevent forged source addresses.◦ Must enforce VLAN membership.

Page 35: XS Japan 2008 Networking English

Virtual Links cannot be monitored◦ Need to know where frames really came from.

MF-NIC MUST prevent spoofing.Validating Local MAC Address is enough◦ IP and DNS can be checked using existing tools given a validated

MAC address.

Only privileged entities can grant permission to a PCI Function to use a given MAC Address.

Page 36: XS Japan 2008 Networking English

Manufacturer◦ Each directly assigned Ethernet function is a valid

Ethernet device.It has a manufacture assigned globally unique MAC Address.No dynamic action is required to enable use of this address.

Zero configuration conventions for VLANs:◦ Membership in default VLAN is usually enabled even without

network management (usually VLAN 1).

Page 37: XS Japan 2008 Networking English

Both network management and Xen already assign MAC Address and VLAN Memberships to Ethernet Ports.Xen can perform this function itself, or enable network management daemons in a designated domain to control the multi-function NIC’s “switch ports”.In either cases, the “port configuration” is a privileged operation through the PCI Function driver.

Page 38: XS Japan 2008 Networking English

Neither the Hypervisor or Dom0 needs to be involved.◦ The Shared Device and its PCI Function Drivers can implement

their own solution.◦ The device already knows how to talk with each VF Driver, and

what VFs are active.

Dom0 can control things it already knows:◦ MAC Address of VNIC ports.◦ VLAN Membership.

But Dom0 does not need to deal with new device specific controls.◦ All required logic can be implemented in device specific drivers

and/or network management daemons.

Page 39: XS Japan 2008 Networking English

Device Domain bridge is created for each VLAN.◦ Each does simple Bridging within a VLAN (802.1D).

Script (vif-bridge) is run whenever a VIF is added for a DomU.◦ Controls which Bridge it is attached to.◦ May specific the MAC Address

Default is random within a reserved OUI.Device Domain’s OS may provide other filtering/controls◦ Such as netfilter.

Page 40: XS Japan 2008 Networking English

Many methods possible◦ As though 802.1Q Bridge per external port.◦ Static defaults applied to be unmanaged switch

All VNIC MAC Addresses are Manufacturer supplied.◦ Privileged operations via the Native Driver

Enabled for DomD or stand-alone Native Drivers.◦ Combinations of the above.

Existing vif-bridge script could easily configure the vsport matching the VNIC for a directly assigned VIF.◦ It already has MAC Address and any VLAN ID.◦ Suggested naming convention: use PCI Function number to name

the Backend instance. Simplifies pairing with direct device.

Page 41: XS Japan 2008 Networking English

Multi-function NIC is unlikely to fully support all netfilter rules in hardware.When considering Direct Assignment:◦ Determine which netfilter rules are implemented by the

Multifunction NICs frame forwarding services.◦ Determine if the remaining netfilter rules can be trusted to DomU.◦ If there are filters that the hardware cannot implement, and

cannot be trusted to DomU, then don’t do the direct assignment.Direct Assignment complements frontend/backend.It is not a replacement.

Page 42: XS Japan 2008 Networking English

Same-to-same migration only requires checkpoint/restore of any Device state via VF Driver.◦ Once state is checkpointed in VM memory, the

Hypervisor knows how to migrate the VM.Many services do not require migration◦ Each VM implements one instance of a distributed

Service. Persistence is a shared responsibility.◦ Most Web servers fall in this category.GOS probably already provides failover between dissimilar devices through bonding drivers.

Page 43: XS Japan 2008 Networking English

Directly Assigned Devices can be migrated using existing services:◦ GOS Device Bonding / Failover.

Including netfront/netback in the team enables migrations between platforms with different hardware.

◦ GOS support of PCI device insertion/removal.Including check-pointing of any stateful data in host memory.

Page 44: XS Japan 2008 Networking English

GOS already knows how to failover between two NICs.◦ Same-to-same migration requires same GOS

services whether failing over between two physical NICs or as a result of VM migration.

XEN already provides a universally available NIC via para-virtualization.◦ So a same-to-same Migration is always possible.

Page 45: XS Japan 2008 Networking English

Requirement: device must be able to checkpoint any per-client stateful image in the client’s memory space.Device is told when to checkpoint any Guest-specific stateful information in the Guest memory image.Migrating Guest check-pointed memory image is a known problem that is already solved.Device driver on new host is told to restore from check-pointed memory image.◦ Checkpointed image should be devoid of any

absolute (non-VF relative) references.

Page 46: XS Japan 2008 Networking English

Not all platforms have the same direct-access NICs, but same-to-same migration can be used anyway.Method A: post-migration makes right◦ Just do a Same-to-Same Migration anyway.◦ It will work

Of course because the actual device is missing on the new platform the re-activated instance will fail.Invoking existing device failover logic within the Guest.

◦ Possible Enhancement:Provide PCI Device removal event immediately on migration.

Page 47: XS Japan 2008 Networking English

Method B: migrate Same-to-same via netfront.◦ Fail the Directly Assigned device.◦ GOS will failover to the Frontend device.◦ Migrate same-to-same to the new target platform.

Which always can support netfront.◦ Enable the appropriate Directly Assigned device on the new

platform.◦ GOS is informed on newly inserted PCI Function.◦ GOS will failover to the preferred device as though it were being

restored to service.

Page 48: XS Japan 2008 Networking English

View of X3100 functions in Dom0◦ >>lspci –d 17d5:5833◦ 05:00.0 Ethernet controller: Neterion Inc.: Unknown device 5833 (rev 01)◦ --◦ 05:00.7 Ethernet controller: Neterion Inc.: Unknown device 5833 (rev 01)Export a function to DomU, so vxge driver can be loaded:◦ >>echo -n 0000:05:00.1 > /sys/bus/pci/drivers/vxge/unbind◦ >>echo -n 0000:05:00.1 > /sys/bus/pci/drivers/pciback/new_slot◦ >>echo -n 0000:05:00.1 > /sys/bus/pci/drivers/pciback/bind◦ >>xm pci-attach 1 0000:05:00.1Configure Bonding interface in DomU with active-backup policy and arplink monitoring, so the delegated interface (say eth1) can be enslaved to the bonding interface.◦ >>modprobe bonding mode=1 arp_interval=100 arp_ip_target=17.10.10.1

◦ >>ifconfig bond0 17.10.10.2 up◦ >>echo +eth1 > /sys/class/net/bond0/bonding/slaves◦ >>echo eth1 > /sys/class/net/bond0/bonding/primary

Page 49: XS Japan 2008 Networking English

Create a virtual interface to enslave in DomU as a backup, and remove the delegated interface from DomU to transfer traffic to the virtual interface.◦ >>xm network-attach 1 bridge=xenbr1 ◦ >>echo +eth2 > /sys/class/net/bond0/bonding/slaves◦ >>echo –eth1 > /sys/class/net/bond0/bonding/slaves◦ >>ifconfig eth1 down◦ >>rmmod vxgeDetach pci function from DomU and migrate the DomU to the destination Xen machine (say IP is 172.10.9.7).◦ >>xm pci-detach 1 0000:05.00.1◦ >>xm migrate --live 1 172.10.9.7At this point, network traffic runs on the virtual interface on the destination machine

Page 50: XS Japan 2008 Networking English

Delegate function on the destination machine◦ >>xm pci-attach 1 0000:02:00.1Enslave the direct interface◦ echo +eth3 > /sys/class/net/bond0/bonding/slaves◦ echo eth3 > /sys/class/net/bond0/bonding/primary◦ echo -eth2 > /sys/class/net/bond0/bonding/slavesRemove virtual backup in DomU◦ >>xm network-list 1◦ >>Idx BE MAC Addr . handle state evt-ch tx-/rx-ring-ref BE-path◦ 0 0 00:16:3e:45:de:53 0 4 8 768 /769 /local/domain/0/backend/vif/1/0◦ 2 0 00:16:3e:61:91:0a 2 4 9 1281 /1280 /local/domain/0/backend/vif/1/2◦ >>xm network-detach 1 2

Page 51: XS Japan 2008 Networking English