niky r networking at scale v6 sl - afpif · 03/08/2018  · niky_r_networking_at_scale_v6_sl.key...

Post on 19-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

INFRASTRUCTUREINFRASTRUCTURE

Edge Fabric:Steering Oceans of Content to the world

Robel KitabaNetwork Engineer, Facebook

Locations just for visualization purposes, it does not reflect current configuration.

Global Load BalancerManages ingress traffic

Locations just for visualization purposes, it does not reflect current configuration.

Latency based telemetry (SONAR)

PX

Network

Bac

kbon

e

TransitPNI

PoP: Point of Presence (colo facilities)

PNI Links: Direct peering with user networks

PX Links: Peering with networks over shared infrastructure

Transit Links: Peering with intermediate networks that provide global reachability

Total egress capacity at PoP

Total traffic at PoP

1 Day

Total egress capacity at PoP

Total traffic at PoPCapacity for iface@PoPDemand for iface@PoP

1 Day

>250%

Drops

Why demands exceeds capacity

Peering with other networks using BGP

Local Preference

Med

AS Path length

Communities

BGP (STATIC)

best BGP path

POP

Why demands exceeds capacity

Peering with other networks using BGP

Local Preference

Med

AS Path length

Communities

Traffic demand changes

Limited capacity

Performance variations

Transient failures

BGP (STATIC) REALITY (DYNAMIC)

best BGP path UnusedOverloaded

POP

Local Edge ControllerEdge Fabric

"Engineering Egress with Edge Fabric: Steering Oceans of Content to the World", Brandon Schlinker et al, SIGCOMM 2017

LOCAL CONTROLLER’S JOURNEY

PNI Transit 1PX

Manual interventions to change BGP policy when there were failures in PNIs

Setup MPLS paths from end hosts to PRs in order to choose egress links

Use DSCP marking at the end hosts to indicate egress link

not scalable, too slow, error prone

Restrictions on hw

Not scalable, coordination of config, rigid assumptions

V0

V1

V2

V0 V1 V2 V3 V4

Rack

Rack

Rack

Transit 2

Network 1

Use GRE tunnels from end hosts to PRsV3 Coordination of config, vendor bugLOCAL

CONTROLLER

PEERING ROUTER

EDGE CLUSTER

LOCAL CONTROLLER’S JOURNEY

Network 1

PNITransitPX

Manual interventions to change BGP policy when there were failures in PNIs

Setup MPLS paths from end hosts to PRs in order to choose egress links

Use DSCP marking at the end hosts to indicate egress link

Use GRE tunnels from end hosts to PRs

Use BGP injections at PRs

not scalable, too slow, error prone

Restrictions on hw

Not scalable, coordination of config, rigid assumptions

Coordination of config, vendor bug

Flexible, dynamic, decouples decisions from PoP architecture

V0

V1

V2

V3

V4Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

V0 V1 V2 V3 V4

LOCAL CONTROLLER

EDGE CLUSTER

PEERING ROUTER

Dest 1.2.3.0/24LocalPref 500

ASPath 100

Nexthop 42.1.3.1

Community 100:1

Dest 1.2.3.0/24LocalPref 200

ASPath 7018,100

Nexthop 201.2.4.12

Community 7018:1

1.2.3.0/24

BGP INJECTION MODE

PEERING ROUTER TRANSIT

PNI

Dest 1.2.3.0/24LocalPref 500

ASPath 100

Nexthop 42.1.3.1

Community 100:1

Dest 1.2.3.0/24LocalPref 200

ASPath 7018,100

Nexthop 201.2.4.12

Community 7018:1

1.2.3.0/24

EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

BGP INJECTION MODE

PEERING ROUTER TRANSIT

PNIBGP Session

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

1.2.3.0/24

BGP INJECTION MODE

PEERING ROUTER TRANSIT

PNI

Dest 1.2.3.0/24LocalPref 500

ASPath 100

Nexthop 42.1.3.1

Community 100:1

Dest 1.2.3.0/24LocalPref 200

ASPath 7018,100

Nexthop 201.2.4.12

Community 7018:1

BGP Session

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

1.2.3.0/24

EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

BGP INJECTION MODE

PEERING ROUTER TRANSIT

PNI

Dest 1.2.3.0/24LocalPref 500

ASPath 100

Nexthop 42.1.3.1

Community 100:1

Dest 1.2.3.0/24LocalPref 200

ASPath 7018,100

Nexthop 201.2.4.12

Community 7018:1

BGP Session

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Dest 1.2.3.0/24LocalPref 50000ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Dest 1.2.3.0/24LocalPref 50000

ASPath 7018,100

Nexthop 201.2.4.12

Community 7018:1

1.2.3.0/24

EF CONTROLLERDest 1.2.3.0/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1.2.3.0/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

BGP INJECTION MODEDest 1.2.3.0/24LocalPref 500

ASPath 100

Nexthop 42.1.3.1

Community 100:1

Dest 1.2.3.0/24LocalPref 200

ASPath 7018,100

Nexthop 201.2.4.12

Community 7018:1

PEERING ROUTER TRANSIT

PNIBGP Session

Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Dest 1:2400::/34LocalPref 50000ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

PEERING

TRANSIT

1:2400::/24EF CONTROLLER

Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Split prefix traffic

Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Dest 1:2400::/34LocalPref 50000ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

PEERING 1:2400::/34

TRANSIT

1:2400::/24EF CONTROLLER

Dest 1:2400::/24LocalPref 500ASPath 100Nexthop 42.1.3.1Community 100:1

Dest 1:2400::/24LocalPref 200ASPath 7018,100Nexthop 201.2.4.12Community 7018:1

Split prefix traffic

SYSTEM ARCHITECTURE

prefix via v.x.y.z

Interface Info (SNMP)

Traffic Rates (Netflow/Sflow)

BGP Routes (BMP)

Policy & Config

Topology Info (FBNet)

Controller

Peering Routers

Route Overrides

BGP Injector

w/ Audits to make it more robust

BMP Audit Netflow Audit

Injector AuditRoute Audit

Total egress capacity at PoP

Total traffic at PoPCapacity for iface@PoPDemand for iface@PoP

1 Day

Total egress capacity at PoP

Total traffic at PoP

Capacity for iface@PoPDemand for iface@PoP

1 DayTraffic on iface@PoP w/Edge Fabric

Avoid packet drops while maintaining high link utilization

Robel Kitaba
Robel Kitaba

Looking beyond Facebook's network

Local Preference

Med

AS Path length

Communities

Traffic demand changes

Limited capacity

Performance variations

Transient failures

BGP (STATIC) REALITY (DYNAMIC)

Best BGP Path

POP

Facebook’s Network

?

Performance RoutingAlternative Path Measurements

Network 1

PNITransitPX

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Collect TCP stats for transactions (RTT, packet loss, throughput)

Network 1

PNITransitPX

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Collect TCP stats for transactions (RTT, packet loss, throughput)

Allow us to monitor performance only to the primary path

PNITransitPX

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Collect TCP stats for transactions (RTT, packet loss, throughput)

Allow us to monitor performance only to the primary path

Send a very small portion of traffic over alternate paths

Network 1

Mark random flows with special DSCP values

PNITransitPX

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Network 1

Mark random flows with special DSCP values

Configure alternate routing tables per DSCP value

PNITransitPX

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Network 1

Mark random flows with special DSCP values

Insert routes into the alternate routing tables

APM CONTROLLER

PNITransitPX

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Rack

Configure alternate routing tables per DSCP value

Network 1

Temporary congestion of the primary path

Interesting Examples

1 Day

thro

ughp

ut

Alternate path 2

Alternate path 1

Primary path

Public Exchange Performance problem

AS 300 AS 400

AS 32934

AS 100 AS 200

Peer’s capacity is unknown

PX

??

? ?

Public Exchange Performance problem

AS 300 AS 400

AS 32934

AS 100 AS 200

Peer’s capacity is unknown

PX

Path Performance Monitoring Service

Computes effective Peer’s capacity on PX

HTTP TCP Stats

BGP Routes

Stats Aggregator

Traffic Rates

Capacity limit computation

Public Exchange Performance problem

AS 300 AS 400

AS 32934

AS 100 AS 200

Infer how much traffic to send without overwhelming the peer

PX

ENHANCE EDGE FABRIC W/ PERFORMANCE

prefix via v.x.y.z

Interface Info (SNMP)

Traffic Rates (Netflow/Sflow)

BGP Routes (BMP)

Policy & Config

Topology Info (FBNet)

Performance Limits

Controller

Peering Routers

Route Overrides

BGP Injector

BMP Audit Netflow Audit

Injector AuditRoute Audit

Thanks

top related