b4:experience with a globally deployed software defined wan › sites › default › files ›...

Post on 10-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

B4:Experience with a Globally Deployed Software Defined WAN

Why?

• To save money! – WAN hardware and links are over-provisioned

– But this hardware is expensive!

– And Google’s traffic between DC’s is increasing Rapidly!!

Assumptions/Insights

• Control over applications, servers, switches

• Only few dozen Datacenters

• Applications can: – handle failures

– adapt to changing bandwidth

– class and priority tells traffic patterns/importance

Implementation

• Full control over WAN routing – WAN scale SDN deployment

• Managing the links in smart way – Traffic Engineering (TE)

TAKING CONTROL OVER WAN LINKS Step-1

Presenter
Presentation Notes
Before SDN Google ran B4 as a single Autonomous System using BGP/ISIS protocols. With the current implementation, each WAN site is treated as a separate AS and iBGP is used between them as a backup. At the Global level the SDN gateway controls flow between sites.
Presenter
Presentation Notes
RIB – routing information base RPC – remote procedure call RAP – routing application proxy written as SDN application for routing updates, handling routing protocol packages between quagga and OF switches, and interface updates from the switches to quagga. The RAP caches the Quagga RIB and translates RIB entries into Onix’s NIB entries.

DECIDING WHO GETS RESOURCES Step-2

Traffic Engineering (TE)

• Goal: Share bandwidth among competing applications possibly using multiple paths.

• Sharing bandwidth is defined by Google as max-min fairness.

• Basics of max-min fairness: – No source gets a resource share larger than its

demand. – Sources with unsatisfied demands get an equal share

of the resource.

TE Optimization Algorithm

• Traditional solutions are expensive • Google’s solution

– Aggregate flows into flow-groups, tunnel-groups – 25x faster, and utilizes at least 99% of the bandwidth

• Three Steps – Tunnel selection: select tunnels for flow group (FG) – Tunnel Group Generation: Allocation of bandwidth to

FGs – Tunnel Group Quantization: Changing split ratios in

each TG to match the granularity supported by switch hardware.

TE State and OpenFlow

• B4 switches operate in 3 roles:

1. Encapsulating switch initiates tunnels and splits traffic between them.

2. Transit switch forwards packets based on their outer header.

3. Decapsulating switch terminates tunnels then forwards packets using regular routes.

Using TE and shortest path together

RESULTS Step-3

Link Utilization

Link Utilization

Failures Google conducted experiments to test the recovery time from different types of failures. Their results are summarized below:

Presenter
Presentation Notes
Transit switch failure is slow because the encapsulation switch must update table entries for potentially several tunnels and the operation is typically 100ms

Experience with an outage

• During a move of switches from one physical location to another, two switches became manually configured with the same ID.

• Resulted in network state never converging to the topology.

• System recovered after all traffic was stopped, buffers emptied and OFCs restarted from scratch.

Backup

top related