slide title 70 pt capitals slide subtitle minimum 30 pt ethernet routing for large scale distributed...
TRANSCRIPT
Ethernet Routing for Large Scale DistributedData Center Fabrics
Dave Allan, János Farkas, Panagiotis Saltsidis, Jeff TantsuraEricsson
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 2
› This is a concept and architecture for a distributed Cloud› One purpose is to illustrate the capabilities and the
scalability of the “state of the art” Ethernet› The components of the proposed architecture are
progressing in standards, either complete or in progress› The architecture is built on
– IEEE Shortest Path Bridging – MAC mode (SPBM)› As standardized in IEEE 802.1aq-2012
– IETF Ethernet Virtual Private Network (EVPN) as extended for SPBM interworking
› This is being standardized in draft-ietf-l2vpn-spbm-evp
Introduction
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 3
› Key antecedentsto SPB
1. Provider Backbone Bridges (PBB)[802.1ah]
– Full MAC-in-MAC encapsulation– 24-bit I-SID, which is a 24-bit L2 Virtual Network ID
2. PBB Traffic Engineering (PBB-TE) [802.1Qay]– Enabled external control of bridge forwarding with
complete route freedom, i.e.– Software Defined Networking (SDN) with geographical separation
A Bit of History
Dst Addr
Src Addr
802.1D-1990
Ethertype
Payload
C-VID
ProviderBridges (PB)
802.1ad-2005
Ethertype
Payload
C-DA
C-SA
S-VIDEthertype
Ethertype
DA
SA
802.1Q-1998
Ethertype
Payload
VID
Ethertype
Provider Backbone
Bridges (PBB)802.1ah-2008
I-SID
B-DA
B-SA
B-VID
Ethertype
Payload
C-VID
Ethertype
C-DA
C-SA
S-VIDEthertype
Ethertype
Ethertype
I-tagB
-tagB
-MA
CS
-tagC
-tag
optional
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 4
› SPBV: SPB VID– VID based– Applicable to all types of
VLANs– Flooding and learning– Plug&play
› SPBM: SPB MAC– MAC based– Designed to leverage the
scalability provided by PBB MAC-in-MAC
– No flooding and learning– Managed environments
What is Shortest Path Bridging (802.1aq SPB)?› SPB is a routed Ethernet solution that has been specified
by the IEEE link state for bridges– IS-IS aspects documented in IETF RFC 6329
› All control functionality has been collapsed into a single protocol (IS-IS)
– Unicast and multicast tree construction, VLAN registration etc.
› Two SPB modes are defined:
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 5
› It is compute based: computation instead of signaling› It uses multiple shortest path trees instead of shared
spanning trees– Unicast and multicast frames follow the same path between any two
points in a given VLAN› So no frame misordering & you get meaningful OAM support
› It uses loop mitigation AND loop prevention › It uses edge based load spreading› It is backwards compatible with, and is consistent with the
full body of Ethernet standardization (IEEE 802.1)– CFM, EVB, lossless Ethernet etc.
› It implements the full MEF 12.1 set of service constructs– E-LINE, E-LAN, E-TREE
What is important to understand about SPBM?
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 6
› Ability to utilize more richly connected topologies– SPBM supports up to 16 way multi-pathing and is extensible to go
further– Each multipath instance is a full mesh of the network
› Large scale virtualization– PBB data plane scales to billion virtual networks
(24-bit I-SID over 12-bit B-VID: 224 * 212)
› Operational simplicity– All information contained in a single control protocol IS-IS– Single touch adds/moves and changes– Computed multicast– Reduced CP messaging combined with a computation driven
convergence of unicast & multicast is a virtuous circle…
Problems Already Solved
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 7
Ubiquity and reach– Interconnect different flavors of “Ethernet”, across the dominant
WAN technology (MPLS)
Preserve operational simplicity– Preserve “single touch” add/move/delete automation– Minimal configuration
›Alignment of BGP and IS-IS control plane paradigms ›Break the scaling barriers of a single routing domain
– Combined SPBM-EVPN allows much larger topologies– Domain isolation to “divide and conquer” state– Operate each SPBM domain on a “need to know” basis– Non-relevant information is excluded from routing advertisement
› Minimize Filtering Database (FDB) state
Solution Objectives
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 8
› There are a number of aspect of the solution
1.Topology hiding and abstraction
2.“Need to know” filtering
3.Independence of local multi-pathing
4.Multicast summarization
Solution Overview
BEB PE BEBPEDCN1
EVPN
DCN2MPLSB-VID1
I-SID1
LSP
I-SID1 I-SID1
B-VID2
SPBM SPBM
B-VLAN1
Tenant Virtual Network: I-SID1
B-VLAN2
Tenant’s overlay, e.g. IP subnet or VLAN
EVPN
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 9
› Shortest Path Trees (SPT) are the basic connectivity construct for SPBM
– They are edge rooted shortest path, and much finer grained than the shared spanning trees but they are still TREEs
› Which constrains the set of network interconnect mechanisms– The set of fine grained MAC based trees are aggregated into
Backbone VLANs (B-VLAN), where each B-VLAN delineates full mesh connectivity
› EVPN is IP/MPLS based, and uses BGP to sort out mirroring of attached Ethernet networks
› But once in EVPN we can map SPBM connectivity to any paradigm
› The trick is interconnecting them
SPBM and EVPN
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 10
› Trees have ROOTs….– Which means interworking needs to pin way points which can then
permit the required design strategies work
› For SPBM-EVPN interworking, we make the interworking function on the EVPN-PE into a “pinned waypoint”
– This has the desirable effect of keep “churn” in subtending SPBM networks out of BGP
› An EVPN-PE that is a “pinned waypoint” for a set of VLANs is known as a “designated forwarder”
Mapping between SPBM & EVPN
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 11
› The set of EVPN-PEs attached to an SPBM network self elect which subset of VLANs they will act as Designated Forwarder (DF) for
– This is based on local B-VID
› The DF is then responsible for the relaying of all required state associated with the subset of VLANs it owns between the two control planes, and the interworking of data plane traffic between the SPBM and EVPN networks
– This is simply in the form of a list of I-SIDs/B-MAC tuples– No topology information is leaked, the DF condenses all topology
behind it down to a single node representation into the peer network– The DF also “re-roots” all (S,G) multicast trees that transit it by
“blindly” rewriting “S” (Source)
Designated Forwarder
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 12
DF Control Plane Interworking
› DF has a Control Plane Interworking function1. It proxies B-MAC/I-SID announcements from ISIS-SPB into
BGP for the set of I-SIDs it is DF for2. It will only proxy B-MAC/I-SID announcements from EVPN
into ISIS-SPB if there is already locally registered interest in the I-SID
BGP has the whole picture, IS-IS is “need to know”
PEDC WAN
IS-IS BGP
IS-ISDatabase
BGPDatabase
Control PlaneInterworking
Function
IS-IS PDUs BGP PDUs
PBBN MPLS
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 13
EVPN-SPBM data plane
BEB1DF1 BEB2
DF2DCN1 DCN2MPLSB-VID1
I-SID1
LSP
I-SID1 I-SID1
B-VID2
SPBM SPBMEVPNVM1 VM2
I-SID1
B-DA: DF1
B-SA: BEB1
B-VID1
C-DA: VM2
C-SA: VM1
Payload
I-SID1
B-DA: BEB2
B-SA: DF2
B-VID2
C-DA: VM2
C-SA: VM1
Payload
I-SID1
MPLSB-DA: DF2
B-SA: DF1
C-DA: VM2
C-SA: VM1
Payload
I-SID1
B-DA: BEB1
B-SA: DF1
B-VID1
C-DA: VM1
C-SA: VM2
Payload
I-SID1
B-DA: DF2
B-SA: BEB2
B-VID2
C-DA: VM1
C-SA: VM2
Payload
I-SID1
MPLSB-DA: DF1B-SA: DF2
C-DA: VM1
C-SA: VM2
Payload
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 14
› Islands are decoupled by keeping B-Tags out of the EVPN core
– What the core sees is MPLS encapsulated B-MACs and I-SIDs
› B-Tags stripped by PEs on ingress to EVPN› B-Tags locally added by PEs on egress from EVPN
– So the core is independent of however multi-pathing is implemented in each subtending island, or whether a PBBN exists at all (e.g. PBB-PEs)
› Multicast MACs are aggregated at SPBM ingress
DF Data Plane Procedures
DFPBBN MPLS
Ethernet Frames MPLS PacketsStriptags
BMAClookup
Add labelstack
Ethernet Frames MPLS PacketsAddtags
BMAClookup
Strip labelstack
Unicast interworking
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 15
› Objective is to get away from the inefficiencies of edge based replication in the PEs while minimizing the multicast state impact in the core
› VLAN emulation can use lots of Multicast Distribution Trees (MDTs)
› These can be aggregated into shared MDTs between larger sites
– Shared MDTs can substantially reduce the amount of multicast state in the MPLS core to service large sites
– Smaller sites may more likely benefit from service specific MDTs› So we will support both
Add Multicast in the MPLS Core
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 16
› Issue is how to resolve VLANs to shared trees without getting into resolution servers or provisioning
› One way to do this is to algorithmically “name” the tree – (*,G) or (S,G) where G is a sorted list of leaf node IDs
› Via BGP every PE has sufficient information to construct the names of the MDTs
› mLDP permits arbitrary opaque identifiers for MDTs to be used as a multicast FEC so the algorithmically constructed names can be used directly in signaling
Shared Multicast Distribution Trees
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 17
Example
802.1aq SPBM
802.1aq SPBM
802.1ad PBN
802.1aq SPBM
EVPN + mLDP
PE1
PE2 PE6
PE5
PE3
PE4
PBBPE7
BGP
IS-IS
CECE
CE
CERSTP
IS-IS
IS-IS
DF
DF
DF
PE2, PE3 and PE5 are DFs for a common set of VLANs
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 18
Example
802.1aq SPBM
802.1aq SPBM
802.1ad PBN
802.1aq SPBM
EVPN + mLDP
PE1
PE6
PE5
PE4
PBBPE7
BGP
IS-IS
CECE
CE
CE
RSTP
IS-IS
IS-IS
DF
PE2DF
PE3DF mLDP
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 19
Example
802.1aq SPBM
802.1aq SPBM
802.1ad PBN
802.1aq SPBM
EVPN + mLDP
PE1
PE6
PE5
PE4
PBBPE7
BGP
IS-IS
CECE
CE
CE
RSTP
IS-IS
IS-IS
DF
PE2DF
mLDP
PE3DF
I am PE 3, and I have 10 VLANs that need
(*,G) multicast to myself and PEs 2, and 5 so
the FEC is PE2+PE3+PE5
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 20
Example
802.1aq SPBM
802.1aq SPBM
802.1ad PBN
802.1aq SPBM
EVPN + mLDP
PE1
PE6
PE5
PE4
PBBPE7
BGP
IS-IS
CECE
CE
CE
RSTP
IS-IS
IS-IS
DF
PE3DF
PE2DF
mLDP
I am PE 2, and I have 10 VLANs that need
(*,G) multicast to myself and PEs 3, and 5 so
the FEC is PE2+PE3+PE5
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 21
Example
802.1aq SPBM
802.1aq SPBM
802.1ad PBN
802.1aq SPBM
EVPN + mLDP
PE1
PE6
PE4
PBBPE7
BGP
IS-IS
CECE
CE
CE
RSTP
IS-IS
IS-IS
PE2DF
PE3DF
PE5DF
I am PE 5, and I have 10 VLANs that need
(*,G) multicast to myself and PEs 2, and 3 so
the FEC is PE2+PE3+PE5
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 22
Example
802.1aq SPBM
802.1aq SPBM
802.1ad PBN
802.1aq SPBM
EVPN + mLDP
PE1
PE6
PE4
PBBPE7
BGP
IS-IS
CECE
CE
CE
RSTP
IS-IS
IS-IS
PE3DF
PE2DF
PE5DF
Resulting MDT
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 23
› mLDP like PIM is rather chatty, and based on transactional convergence
› If I had 10000 VLANs spread across the 3 sites in the example I WOULD have 10000 (*,G) or 30000 (S,G) trees
› For 3 dual homed sites, there are ONLY 8 possible (*,G) and 24 possible (S,G) shared trees
– It becomes practical to simply “nail them up” and modify the membership set of each tree at the ingress
› Result is both scalable and stable
What does this get me?
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 24
› Assumption of rich mesh hidden from SPBM in the first place
– Exposing a large highly regular CLOS topology in link state simply burdens the control plane
› Some topological summarization is required in the first place to usefully scale individual sites to 100,000 servers+ with existing technology
› There is lots that can be done to engineer an SPBM network both with the vanilla standard, and with techniques currently under research
– Deterministic aggregated trees lend themselves to “demand engineering” with automation
› Work needs to be done to seamlessly extend this into the EVPN realm
Key Insights & Next steps
Ethernet Routing for Large Scale Distributed Data Center Fabrics | 2013-11-13 | Page 25
› The totality, completeness and self-consistency of IEEE data center networking solutions is impressive
– From OAM to Edge Virtual Bridging
› SPB permits this to scale to orders of magnitude beyond what Ethernet previously was capable of
› Adding EVPN is a form of “multi-area” solution adds orders of magnitude beyond what SPB alone can do…
Summary