juniper networks: virtual chassis high availability

12
Juniper Networks Virtual Chassis: High Availability November 2012

Upload: juniper-networks

Post on 20-Aug-2015

2.929 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Juniper Networks: Virtual Chassis High Availability

Juniper Networks

Virtual Chassis:

High Availability

November 2012

Page 2: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 2

Executive Summary Juniper Networks commissioned Network Test to evaluate its Virtual Chassis technology in

Juniper EX8200 modular and Juniper EX4200/EX4500/EX4550 fixed-configuration

switches. In this second installment of a two-part project, the focus is on the reliability and

resiliency of Virtual Chassis technology. Part I of this project focused on Virtual Chassis

performance and scalability.

For most enterprise network managers, maintaining maximum uptime is an even more

important consideration than raw performance. Application availability is not only expected

but also demanded from IT infrastructure. Enterprises expect to have access to their data

and applications 24/7/365. To ensure round-the-clock access, network infrastructure must

be both robust and highly available. The tests described in this document validate that

Juniper’s Virtual Chassis technology addresses these requirements. Failovers in many cases

are hitless, with no disruption in case of planned or unplanned events.

Among the highlights of high-availability testing:

In all 46 test cases described here, the Virtual Chassis system recovered from

component and/or link failures in less than 1 second, with hitless failover in

many cases.

Virtual Chassis technology offered total protection against a “split-brain”

problem where multiple routing engines each try to act as a Virtual Chassis

master. Even when test engineers simultaneously disabled multiple components,

the Virtual Chassis system correctly migrated all control-plane state between

master and backup routing engines.

Juniper’s Nonstop Software Upgrade (NSSU) feature performed a complete

upgrade of all EX8200 Virtual Chassis components, including four switches

and two external routing engines, with less than 1 second of downtime. In the

layer-2 test case, user data was “off the air” for less than 1/8 of a second with NSSU.

Virtual Chassis technology recovered from component and/or link failure far

faster than routing protocols. Enterprise routing protocols such as OSPF and

Protocol Independent Multicast (PIM-SM) take tens of seconds, or longer, to recover

from network topology changes. Since recovery times for Virtual Chassis

configurations are less than 1 second, transitions are invisible to the routed network.

Virtual Chassis configurations recovered from component and/or link failures

far faster than spanning tree, the dominant switching protocol. Spanning tree is

widely used for loop prevention, but even rapid spanning tree typically takes at least

1-3 seconds to converge after a failure. Virtual Chassis technology eliminates the

need for spanning tree, and always recovers from failures in less than 1 second.

Page 3: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 3

Introducing EX8200 Virtual Chassis Technology Virtual Chassis technology allows up to four EX8200 switches or any combination of up to

10 EX4200/EX4500/EX4550 switches to be interconnected to form one logical entity.

This unified approach has many advantages:

Virtual Chassis technology doubles available bandwidth by using active/active

redundancy instead of the active/passive model used by the spanning tree protocol.

Virtual Chassis technology enhances scalability by adding capacity as needed. A

Virtual Chassis configuration requires just two EX8200 or EX4200/EX4500/EX4550

chassis to get started; network architects can then add chassis as the network grows.

There’s no disruption to existing Virtual Chassis components, and the newly

expanded Virtual Chassis system will continue to appear as one entity to the rest of

the network.

Virtual Chassis technology simplifies network management by using just one

configuration file for all EX8200 or EX4200/EX4500/EX4550 chassis. This reduces

the number of network elements seen by external monitoring and management

tools, easing the management workload.

Virtual Chassis technology allows “rightsizing” by combining switches with

different port densities. In all performance tests described here, engineers

combined smaller EX8208 and larger EX8216 switches to form a single logical entity.

Similarly, engineers connected different combinations of EX4200/EX4500/EX4550

switches, each using different port densities and speeds, to create one logical device.

Test Methodology Figure 1 shows the Layer-3 test bed used for this project. A key design goal of this project

was to represent a typical data center or campus switching architecture, with core and

access switches along with a WAN edge router.

In the core is one EX8200 Virtual Chassis instance comprising two Juniper EX8216 and two

Juniper EX8208 switches, along with redundant EX8200-XRE200 external routing engines

to handle control-plane tasks. The four Juniper EX8200 switches are also called line card

chassis (LCCs).

At the access layer, there are two Virtual Chassis instances. One combined a Juniper EX4200

and the new Juniper EX4550 switch, while the other combined Juniper EX4200 and Juniper

EX4500 switches. Also at the access layer is a standalone EX8208 deployed as an end-of-

row or middle-of-row switch. The WAN edge router is represented by a single Juniper MX80.

Page 4: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 4

As is often the case in modern data centers, most test traffic flowed in an “east-west”

direction, between the various access nodes shown at the bottom of Figure 1. This traffic

was evenly divided between IPv4 and IPv6 unicast flows.

In addition, a small percentage of “north-south” traffic flowed between the Juniper MX80

WAN edge router and the access nodes shown at the bottom of Figure 1. This north-south

traffic also consisted of a mix of IPv4 and IPv6 unicast traffic, with IPv4 multicast added. In

this Layer-3 scenario, all devices ran OSPF for unicast routing and Protocol Independent

Multicast-Sparse Mode (PIM-SM) for multicast routing.

Figure 1: The Juniper Virtual Chassis Layer-3 high-availability test bed

The Spirent TestCenter traffic generator/analyzer served as the primary test instrument in

this project. For the multicast traffic, the Spirent instrument emulated 48 IPv4 hosts

sending to 50 multicast groups, for a total of 2,400 multicast routes. For the unicast traffic,

the Spirent instrument emulated one IPv4 and one IPv6 host per port. The Spirent

instrument connected to switch ports at the network edge via 12 gigabit Ethernet ports and

8 10-Gbit/s Ethernet ports, and to the Juniper MX80 router via 4 10-Gbit/s Ethernet ports.

To showcase Virtual Chassis support for IEEE 802.3ad link aggregation, engineers used

four-member link aggregation groups to connect 10-Gbit/s Ethernet switch ports.

Engineers repeated all tests twice, in Layer-2 and Layer-3 modes. In the Layer-2

configuration, engineers configured all EX8200 Virtual Chassis ports facing the access layer,

along with all switch ports in the access layer, to use a single VLAN and broadcast domain.

Page 5: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 5

In the Layer-3 tests, engineers used the routed VLAN interface (RVI) feature in Junos

software to place all host-facing ports on the test bed in different IP subnets. Figure 2 shows

the Layer-2 configuration of the test bed.

Figure 2: The Juniper Virtual Chassis Layer-2 high-availability test bed

A primary goal of all tests was to validate Juniper’s claim of subsecond recovery from

various types of hardware and software failures. For all tests, engineers determined

recovery time using the following formula1:

Frame loss / (total transmitted frames / test duration)

1 Engineers also normalized overall transmit rates by configuring the test instrument’s 10-Gbit/s interfaces to offer traffic at 1/10 the rate of its gigabit Ethernet interfaces.

Page 6: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 6

EX8200 Virtual Chassis High Availability In 24 different test cases, the EX8200 Virtual Chassis configuration recovered from

link and component failures in well below 1 second, with hitless failover in many

instances. Even in the absolute worst case – a Layer-3 test involving the loss of a line card –

the highest recovery time seen was 174 ms, or less than one-fifth of a second.

As described in detail

below, these tests

involved failure of

virtually every possible

component attached to

the EX8200 chassis on

the test bed.

Significantly, several of

these tests validated

Juniper’s claim that even

the simultaneous loss of

multiple components at

the same time will not

cause a Virtual Chassis

system to go into “split

brain” mode, where

different routing

engines each think they

are the master

controller. In all such

test cases, the Virtual

Chassis system correctly

transferred “mastership”

status when a

component failure

occurred.

Table 1 summarizes test

results from high

availability testing of the

Juniper EX8200 Virtual

Chassis configuration. The remainder of this section will discuss the tests performed in detail.

1. XRE failure

To increase resiliency, redundant EX8200-XRE200 external routing engines handle all

control-plane tasks in an EX8200 Virtual Chassis configuration. To determine the impact of

the loss of one of these critical components, engineers rebooted the master XRE200 while

Control-plane tests

Recovery time

(seconds) Test case Layer 2 Layer 3

Master XRE failure 0.020 0.000

Backup XRE failure 0.000 0.000

Master LCC-RE failure 0.000 0.000

Backup LCC-RE failure 0.000 0.000

VCP failure (between master XRE and LCC-RE)

0.000 0.000

VCP failure (between backup XRE and LCC-RE)

0.000 0.000

VCP failure (between XREs)

0.000 0.000

LCC failure 0.034 0.040

Line-card failure 0.088 0.174

Data-plane tests

Recovery time

(seconds) Test case Layer 2 Layer 3

Link flapping 0.024 0.020

LAG member failure 0.031 0.032

Control- and data-plane tests

Recovery time

(seconds) Test case Layer 2 Layer 3

Multiple failures 0.081 0.102 Table 1: Juniper EX8200 Virtual Chassis high-availability test results

Page 7: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 7

offering traffic from Spirent TestCenter at a constant rate. Next, engineers measured

failover time using the formula described in the “Methodology and Results” section above.

Engineers then rebooted the same unit – after verifying it was now in a backup role – and

again measured failover time. In all, these tests ran four times: once apiece for master and

backup XRE200s, and one each in Layer-2 and Layer-3 configurations.

2. LCC-RE failure

The redundant line card chassis-routing engines (LCC-REs) in each EX8200 Virtual Chassis

configuration also act in master and backup roles. To determine failover time, engineers

rebooted the master LCC-RE in one member of the Virtual Chassis while offering test traffic

at a constant rate. There was no frame loss in this test. Engineers then repeated the test by

again rebooting the same LCC-RE after verifying it had shifted into a backup role. Again,

there was zero frame loss. In fact, the EX8200 Virtual Chassis configuration dropped no

frames in any LCC-RE failure test, both in Layer-2 and Layer-3 modes.

3. VCP failure (split-brain protection)

The Virtual Chassis Port (VCP) is a key component in any EX8200 Virtual Chassis

configuration, since it carries not only Layer-2 and Layer-3 control-plane traffic but also the

Virtual Chassis Control Protocol (VCCP) frames needed for Virtual Chassis technology to

work. Given its importance, engineers tested three different types of VCP failures.

All three VCP test cases involved the potential risk of “split-brain” configurations, where the

loss of a link could cause multiple XREs and/or LCC-REs to claim master status at the same

time. Engineers increased the risk of split-brain configurations by disabling multiple sets of

links in all three test cases.

In the first test case, engineers disabled two sets of links between master XRE and master

LCC-RE ports while offering test traffic at a constant rate. In the second case, engineers

disabled two sets of links between backup XRE and backup LCC-RE ports, again while

offering test traffic. Finally, engineers disabled multiple links between master and backup

XREs, again with test traffic active.

In all three cases, there was no frame loss and no split-brain configuration as a result

of multiple VCP failures.

4. LCC failure

The LCC failure test determined the effect of the loss of an entire EX8200 chassis within a

Virtual Chassis system. Here, engineers rebooted one Juniper EX8216 switch within the

Virtual Chassis configuration while offering test traffic at a constant rate. This had the effect

of taking the chassis and its line cards offline, forcing Virtual Chassis state migration.

In Layer-2 and Layer-3 configurations, the EX8200 Virtual Chassis configuration

recovered in less than 50 ms from the loss of a switch member.

Page 8: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 8

5. Line card failure

Engineers rebooted one line card in a Juniper EX8216 switch within the Virtual Chassis

while offering test traffic at a constant rate. This forced Virtual Chassis state migration for

the flows that previously used this line card. (Engineers first verified that the line card

carried test traffic.)

In Layer-2 and Layer-3 configurations, the EX8200 Virtual Chassis configuration

recovered in 174 ms or less from the loss of a line card, well below Juniper’s stated

ceiling of 1-second maximum recovery time.

6. Link flapping (soft failure)

In this scenario, engineers used the Junos command-line interface (CLI) to disable one

member of the link aggregation group connecting the EX8200 Virtual Chassis with one of

the other Virtual Chassis instances at the edge of the test bed. As in other cases, engineers

configured Spirent TestCenter to offer traffic throughout the test, and derived failover time

from frame loss.

In both Layer-2 and Layer-3 scenarios, failover time due to link flapping was less than

25 ms.

7. Link flapping (hard failure)

This link-flapping test was similar to the previous one, only here engineers induced a failure

by physically removing a cable from one member of the link aggregation group between the

EX8200 Virtual Chassis and one of the Virtual Chassis instances at the edge of the network.

Here, too, the Spirent test instrument offered traffic at a constant rate.

In Layer-2 and Layer-3 scenarios, failover time due to loss of a physical link was

32 ms or less.

8. Multiple failures

It is unlikely, though not impossible, that several components can fail at once. To model this

scenario, engineers created a multiple-failure test case, offering traffic while simultaneously

disabling these components:

Master XRE

Master LCC-RE

LCC (EX8216 chassis)

Link aggregation group member

These failures required multiple concurrent state transitions.

Despite multiple concurrent component failures, the EX8200 Virtual Chassis

recovered in less than 110 ms in both Layer-2 and Layer-3 test cases. Both results are

well below Juniper’s stated guideline of 1-second recovery times.

Page 9: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 9

9. Nonstop Software Upgrade (NSSU)

In addition to measuring recovery times in various failover scenarios, engineers also

exercised Juniper’s Nonstop Software Upgrade capability on the EX8200 Virtual Chassis

system. NSSU allows in-place software upgrades with little or no disruption.

In this test, engineers upgraded the entire Virtual Chassis system – comprising two EX8216

chassis, two EX8208 chassis, and two EX8200-XRE200 external routing engines – from

Junos version 12.1R2 to 12.1R3. As in other cases, engineers offered a mix of unicast and

multicast traffic while conducting the upgrade, and derived recovery time from frame loss.

In the Layer-2 test case, the system recovered in 117 ms from NSSU. In the Layer-3 test case,

which involved OSPF and PIM routing on every port, the system recovered in 857 ms from

NSSU. Both figures are less than Juniper’s stated guideline of 1-second recovery times.

EX4200/EX4500/EX4550 Virtual Chassis High Availability While most tests focused on the Juniper EX8200 core switching platform, engineers also

performed high-availability tests on Virtual Chassis instances at the edge of the network –

those using the Juniper EX4200, Juniper EX4500, and the new Juniper EX4550 top-of-rack

switches. In all test cases, Juniper EX4200/EX4500/EX4550 Virtual Chassis instances

recovered from component and/or link failure in less than 1 second.

Table 2 presents results from

these tests. Trials involved

the same combination of IPv4,

IPv6, unicast, and multicast

traffic as in the EX8200 tests,

with the majority of traffic in

the “east-west” direction

between Virtual Chassis

instances.

Most traffic used a partially

meshed pattern between the

two Virtual Chassis instances

at the edge of the network; as

defined in RFC 2285, a partial

mesh is one in which all ports

on one side of the network

exchange traffic with all ports

on the other side, but no

traffic stays local. That meant

all traffic went through the

core Virtual Chassis instance.

Control-plane tests

Recovery time

(seconds)

Test case Layer 2 Layer 3 EX4200/EX4500 master failure

0.000 0.284

EX4200/EX4550 master failure

0.281 0.294

EX4200/EX4500 backup failure

0.000 0.258

EX4200/EX4550 backup failure

0.291 0.303

EX4200/EX4500 remove VCP 0.000 0.000 EX4200/EX4550 remove VCP 0.000 0.000

Data-plane tests

Recovery time

(seconds) Test case Layer 2 Layer 3 EX4200/EX4500 LAG member failure

0.073 0.056

EX4200/EX4550 LAG member failure

0.059 0.064

Table 2: Juniper EX4200/EX4500/EX4550 high-availability test results

Page 10: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 1

0

As in the EX8200 tests, Juniper’s stated guideline at the network edge is that recovery from

component or link failure will take less than 1 second in all cases. As the results show,

recovery times always fell within that limit. Even the very longest recovery time – 303 ms,

in the case of Layer-3 recovery from loss of a Virtual Chassis backup switch – is still well

below Juniper’s 1-second guideline.

Because Virtual Chassis implementations at the edge of the network involve fewer

components (there is no external routing engine, as was the case with the EX8200-XRE200

in the core switching tests), the number of test cases is reduced. Still, the results

demonstrate that Virtual Chassis instances made up of Juniper EX4200/EX4500/EX4550

switches recover quickly from failures, in several cases with zero disruption.

1. Master failure

Working from the Junos CLI, engineers rebooted a master switch in each Virtual Chassis

instance while offering a mix of unicast and multicast IPv4 and IPv6 traffic. As in the

EX8200 Virtual Chassis tests, engineers then derived recovery time from frame-loss

measurements. In this and all other EX4200/EX4500/EX4550 Virtual Chassis tests,

engineers ensured Spirent test ports were attached only to the device not lost during the

failover scenario.

In all four combinations of switches and Layer-2 and Layer-3 configurations, the

Virtual Chassis instances recovered in less than 300 ms. With Layer-2 traffic and the

loss of a Juniper EX4200/EX4500 Virtual Chassis master, there was zero frame loss

and thus zero disruption.

2. Backup failure

In this test, engineers used the Junos CLI to reboot a backup switch in each Virtual Chassis

instance.

In all four combinations of switches and Layer-2 and Layer-3 configurations, the

Virtual Chassis instances recovered in 303 ms or less. With Layer-2 traffic and the

loss of a Juniper EX4200/EX4500 Virtual Chassis backup, there was zero frame loss

and thus zero disruption.

3. Link flapping (soft failure)

In this scenario, engineers used the Junos CLI to disable one member of the link aggregation

group linking each Virtual Chassis instance at the network edge with the EX8200 Virtual

Chassis instance in the network core. As in other test cases, engineers configured Spirent

TestCenter to offer test traffic throughout the test, and derived failover time from frame loss.

In all Layer-2 and Layer-3 scenarios, failover time due to a software-initiated link flap

was 120 ms or less.

Page 11: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 1

1

4. Link flapping (hard failure)

This link-flapping test was similar to the previous one, only here engineers induced a failure

by removing a cable from one member of the link aggregation group between the core

EX8200 Virtual Chassis configuration and one of the Virtual Chassis instances at the edge of

the network. Here, too, the Spirent test instrument offered traffic at a constant rate, mainly

in an “east-west” direction between Virtual Chassis instances at the edge of the network.

In Layer-2 and Layer-3 scenarios, failover time due to a loss of physical link was

73 ms or less.

5. VCP failure

In the context of EX4200/EX4500/EX4550 Virtual Chassis instances, VCPs are dedicated

ports connecting each switch member, carrying all control-plane traffic including VCCP

frames. Engineers assessed the loss of this critical component by physically disconnecting a

primary VCP cable while simultaneously offering test traffic. In all four test cases, the loss

of a VCP link caused little or no disruption to user traffic. In three of four cases, the

Virtual Chassis system dropped zero frames. In a fourth instance, involving Layer-2

traffic and a Juniper EX4200/4550 Virtual Chassis configuration, the system dropped

16 frames out of more than 700 million total, the equivalent of about 9 microseconds

of failover time.

Conclusion These tests validated the high-availability features of Juniper’s Virtual Chassis technology as

implemented on Juniper EX8200 and Juniper EX4200/EX4500/EX4550 switches. In dozens

of test cases involving split Layer-2/Layer-3 and pure Layer-3 scenarios, the systems under

test always recovered from failure in less than 1 second. In every case, recovery times were

always far faster than those for common enterprise switching or routing protocols.

Subsecond frame loss is helpful with management tasks, such as removing an XRE

controller or Virtual Chassis member for maintenance or repair.

These tests also showcased NSSU for nearly hitless upgrades of Juniper EX8200 switches

running Virtual Chassis technology. Here again, NSSU recovery times were less than 1

second in both Layer-2 and Layer-3 test cases.

Moreover, the test results also showed how the multiple levels of redundancy in Virtual

Chassis technology protect against “split-brain” problems, where different routing engines

try to claim a master role. Despite engineers’ best efforts to create split-brain scenarios,

Juniper’s Virtual Chassis technology always transferred master and backup roles as

expected, with one routing engine playing the master role at any given instant.

For most enterprise network managers, high availability is even more important than high

performance; after all, a fast network is of little use if it can’t be reached. With subsecond

recovery times in all cases (and zero frame loss in many tests), these results demonstrate

how Juniper Virtual Chassis technology can make enterprise networks more reliable.

Page 12: Juniper Networks: Virtual Chassis High Availability

Juniper Virtual Chassis High Availability Assessment

Pag

e 1

2

Appendix A: About Network Test

Network Test is an independent third-party test lab and engineering services consultancy.

Our core competencies are performance, security, and conformance assessment of

networking equipment and live networks. Our clients include equipment manufacturers,

large enterprises, service providers, industry consortia, and trade publications.

Appendix B: Hardware and Software Releases Tested

This appendix describes the software versions used on the test bed. All tests were conducted in September 2012 at Juniper’s headquarters facility in Sunnyvale, CA, USA.

Component Version

Juniper EX8208, Juniper EX8216, Juniper EX8200-XRE200, Juniper EX4200, Juniper EX4500, Juniper EX4550, Juniper MX80

Junos 12.3I0 (all tests except NSSU); Junos 12.1R2, Junos 12.1R3 (NSSU)

Spirent TestCenter 4.03.0496.0000

Appendix C: Disclaimer Network Test Inc. has made every attempt to ensure that all test procedures were

conducted with the utmost precision and accuracy, but acknowledges that errors do occur.

Network Test Inc. shall not be held liable for damages that may result from the use of

information contained in this document. All trademarks mentioned in this document are

property of their respective owners.

Version 2012110100. Copyright © 2012 Network Test Inc. All rights reserved.

Network Test Inc. 31324 Via Colinas, Suite 113 Westlake Village, CA 91362-6761 USA +1-818-889-0011 http://networktest.com [email protected]