fabric resiliency best practices

Redpaper

Fabric Resiliency Best Practices

In this IBM® Redpaper™ publication, we discuss best practices for deploying and utilizing advanced Brocade Fabric OS (FOS) features to identify, monitor, and protect Fibre Channel (FC) SANs from problematic device and media behavior.

Introduction Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard FC media can significantly impact the performance of FC fabrics and the applications they support. In most real-world scenarios, these issues cannot be corrected or completely mitigated within the fabric itself. Instead, the behavior must be addressed directly.

However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency. This document provides a high-level description of the most commonly experienced, detrimental device and link behaviors, and explains how to use features in recent levels of Fabric OS (FOS) to protect your data center.

In FOS 6.1, Brocade introduced Port Fencing as part of the optional Fabric Watch offering. In FOS 6.3, Brocade added a new set of base features referred to as Bottleneck Detection. This was extended in FOS 6.4 with broader monitoring, improved configuration, and detection capabilities for additional types of bottlenecks.

For further details about the features described in this publication, see the following product documents, available with registration at http://my.brocade.com/wps/portal:

� Fabric OS Administrator’s Guide (version 6.4, 53-1001763-02) � Fabric OS Command Reference Manual (version 6.4, 53_1001764_02) � Fabric Watch Administrator’s Guide (version 6.4.1, 53-1001770-01)

Fabric Operating System: This paper covers Brocade Fabric OS command options up to and including version 6.4.2b. If you are using a code level higher than this, the guidance and strategy provided by the paper is still applicable. However, you might find additional options available that allow even greater granularity in establishing specific alerting features.

Ian MacQuarrieJohn Juenemann

Jon Tate

© Copyright IBM Corp. 2012. All rights reserved. ibm.com/redbooks 1

http://www.redbooks.ibm.com/


http://mybrocade.com

� Configuring Performance Monitoring and Thresholds in Brocade Data Center Fabric Manager (DCFM), GA-BP-247-00

� Bottleneck Detection Best Practices Guide, GA-BP-383-00

It is assumed that readers of this document are familiar with the basic functionality of features, such as bottleneck detection, fabric watch, and port fencing.

Fabric resiliency Two primary aspects of fabric resiliency are captured in this document:

� Detecting abnormal behavior in external components (typically servers and hosts, storage devices, or faulty links) so that the negative impact on the fabric can be addressed.

� Providing mechanisms that protect the fabric from adverse effects caused by a faulty component. This includes one or more actions that can be invoked automatically by a switch, when faulty behavior is detected, to contain and isolate the impact of the misbehaving component in the fabric. This should be considered a temporary measure. Ultimately, the faulty or improperly configured component must be addressed to resolve the problem completely and permanently.

There are two common classes of abnormal behavior originating from fabric components:

� Misbehaving high-latency end devices (hosts or storage)

End devices that do not respond as quickly as expected, and therefore cause the fabric to hold frames for excessive periods of time (referred to as slow draining devices), can result in application performance degradation or, in extreme cases, I/O failure.

Common examples of moderate device latency include disk arrays that are overloaded and hosts that cannot process data as fast as they request it. Severe latencies are caused by badly misbehaving devices that stop receiving, accepting, or acknowledging frames for excessive periods of time.

The Bottleneck Detection feature of the Brocade FOS discussed in this document is specifically designed to identify and raise alerts for high-latency or slow draining devices within the SAN.

� Faulty media (fiber optic cables and small form-factor pluggable (SFP) transceivers and optics)

Faulty media can cause frame loss due to excessive cyclic redundancy check (CRC) errors, invalid transmission words, and other conditions. This might result in I/O failure and application performance degradation.

The Fabric Watch monitoring feature of the Brocade FOS discussed in this document is designed to identify and raise alerts for these types of conditions.

Be aware that FC switches cannot correct bad node behavior or faulty media. Instead, they can only attempt to alert and compensate for it. Ultimately, the problems must be addressed at the host, target device, or media where the source of the error actually resides.

Maintaining an optimal FC SAN environment

Although there are many features available in FOS to assist you with monitoring, protecting, and troubleshooting fabrics, several recent enhancements have been implemented that deal exclusively with this area. This document focuses specifically on those newer features and related capabilities that help provide optimum fabric resiliency.

2 Fabric Resiliency Best Practices

Most of those features and capabilities are available and supported on the majority of 4 Gbps and 8 Gbps platforms, provided that the most recent FOS releases are used. Some features might require optional licensing. This section discusses these features, minimum release levels, licensing requirements, and platform limitations.

It is strongly suggested that you review the additional documentation listed the “Introduction ” on page 1 to understand all of the tools available for maintaining a FC SAN environment. Also, be sure to read the FOS Release Notes.

Bottleneck Detection Bottleneck Detection was introduced in FOS 6.3.0 with monitoring for device latency conditions, and then enhanced in FOS 6.4.0 with added support for congestion detection on both E_Ports and F_Ports. FOS 6.4 also added improved reporting options and simplified configuration capabilities. The FOS 6.3.1b release (and later) included enhancement in the algorithm for detecting device latency, making it more accurate.

Bottleneck Detection does not require a license, and it is supported on both 4 and 8 Gbps platforms.

Fabric Watch and Port Fencing Fabric Watch is an optional (licensed) feature that was enhanced in FOS 6.1.0 with the addition of Port Fencing. This capability allows a switch to monitor specific behaviors and protect a switch by blocking a port when specified thresholds have been reached.

Edge Hold Time configuration Edge Hold Time Configuration is a new capability added in the FOS 6.3.1b release. However, it is not documented in the FOS 6.3 or FOS 6.4 Command Reference. See “Configuring Edge Hold Time ” on page 14 for details about its use.

No license is required to configure the Edge Hold Time setting.

Device latencies A device experiencing latencies responds more slowly than expected. The device does not return buffer credits (through R_RDY primitives) to the transmitting switch fast enough to support the offered load, even though the offered load is less than the maximum physical capacity of the link connected to the device, as shown in Figure 1.

Note: To use all of the capabilities described in this document, switches need to be running FOS 6.4.0 or later.

Fabric Resiliency Best Practices 3

Figure 1 Buffer backup on ingress port 6 on B1 causes latency bottleneck upstream on S1, port 3

After it exhausts all available credits, the switch port connected to the device needs to hold additional outbound frames until a buffer credit is returned by the device. When a device is not responding in a timely fashion, the transmitting switch is forced to hold frames for longer periods of time, thus resulting in high buffer occupancy. This, in turn, results in the switch lowering the rate at which it returns buffer credits to other transmitting switches.

This effect propagates through switches (and potentially, multiple switches with devices attempting to send frames to devices attached to the switch with the high-latency device), and ultimately impacts the fabric. See Figure 2.

Figure 2 Latency on a switch can propagate through the fabric

Assessing device latency severityThere are several features and capabilities you can use to assess the severity of device latency.

Note: The impact to the fabric (and other traffic flows) varies based on the severity of the latency exhibited by the device. The longer the delay caused by the device in returning credits to the switch, the more severe the problem.


Moderate device latencies Moderate device latencies are defined as those not severe enough to cause frame loss. If the time between successive credit returns by the device is between tens of milliseconds (ms) up to 100 ms, then the device exhibits moderate latencies because this delay is typically not enough to cause frame loss (frame loss typically occurs above 100 ms). This causes a drop in performance of traffic flows using the fabric, but typically does not cause frame drops or I/O failures.

When a device exhibits mild to moderate latency behavior, applications might see a drop in performance but typically not I/O failure. However, the higher the latency, the greater the chance that a user will experience degraded performance.

Severe device latencies

Devices taking longer than 100 ms between successive credit returns are exhibiting severe latency. Severe device latencies result in frame loss, which triggers the host small computer system interface (SCSI) stack to detect failures and to retry I/Os. This process can take tens of seconds (possibly as long as 30 to 60 seconds), which can cause quite a noticeable application delay and potentially result in application errors. If the time between successive credit returns by the device is in excess of 100 milliseconds, then the device is exhibiting severe latency.

When a device exhibits severe latency, the switch is forced to hold frames for excessively long periods of time (in the order of hundreds of milliseconds). When this time becomes greater than the established timeout threshold (500 ms by default), the switch drops the frame (per FC standards). Frame loss in switches is also known as Class 3 (C3) discards or timeouts.

Dropped frames cause applications to retry I/OsBecause the effect of device latencies often spreads through the fabric, frames can be dropped due to time outs, not just on the F_Port to which the misbehaving device is connected, but also on E_Ports carrying traffic to the F_Port. Dropped frames typically cause I/O errors that result in a host retry and can result in significant decreases in application performance.

The implications of this behavior are compounded and exacerbated because frame drops on the affected F_Port (device) result not only in I/O failures to the misbehaving device (which would be expected), but also because frame drops on E_Ports might cause I/O failures for unrelated traffic flows involving other hosts using the same inter-switch links (ISLs) as the device experiencing severe latency, as depicted in item 5 in Figure 2 on page 4 (which would not typically be expected).

Latency detection Using Bottleneck Detection to detect devices that exhibit latency, and Fabric Watch to detect frame timeouts, is considered best practice and therefore strongly recommended.

Using Bottleneck Detection on F_PortsBottleneck Detection is a comprehensive feature that can be used to detect a wide range of device latencies from mild to severe. See “Configuring Bottleneck Detection ” on page 9 for details about how to enable Bottleneck Detection.

After Bottleneck Detection is enabled, the switch monitors F_Ports for latency symptoms. Specifically, it looks for conditions in which the time delay between successive buffer credit


returns from a device is higher than expected. When the condition is detected, Bottleneck Detection reports latency bottlenecks at F_Ports based on configurable thresholds. These reports can then be leveraged to:

� Determine the specific device port on which device latency is occurring and raise an alert� Determine the severity and duration of the latency behavior � Determine the actual device latency in the range of 100 microseconds to hundreds of

milliseconds

Enter bottleneckmon --show to display the percent congestion on specified ports, as shown in Example 1.

Example 1 Using the bottleneckmon --show command to display percent congestion

switch:admin> bottleneckmon --show -interval 5 -span 30 1/16===========================================================Thu Jun 16 13:24:11 UTC 2011===========================================================Percentage ofFrom To affected secs===========================================================Jun 16 13:24:06 Jun 16 13:24:11 0.00% (no data for 3 seconds)Jun 16 13:24:01 Jun 16 13:24:06 66.67% (no data for 2 seconds)Jun 16 13:23:56 Jun 16 13:24:01 25.00% (no data for 1 seconds)Jun 16 13:23:51 Jun 16 13:23:56 100.00% (no data for 1 seconds)Jun 16 13:23:46 Jun 16 13:23:51 66.67% (no data for 2 seconds)Jun 16 13:23:41 Jun 16 13:23:46 100.00% (no data for 2 seconds)

Timeout notification on F_Ports Use Fabric Watch to detect frame timeouts, that is, frames that have been dropped because of severe latency conditions (the Fabric Watch C3TX_TO area is available in version 6.3 for 8 Gbps ports and in FOS 6.3.1b/6.4.0 and later for 4 Gbps ports). If the number of timed-out frames on an F_Port exceeds the currently effective threshold settings, Fabric Watch can notify the user by:

� Sending an SNMP trap � Logging a RASlog message � Sending an email alert � Logging a SYSlog message

Latency mitigation actionThe following section describes actions you can take to mitigate latency.

Reducing timeouts on unrelated flowsFC standards dictate that frames are dropped in switches if they have been held in the switch buffers for longer than the established Hold Time, which is a value calculated from several configurable fabric parameters. Unless any of these fabric parameters (R_A_TOV, E_D_TOV, WAN_TOV, or MAX_HOPs) have been changed from their defaults, the Hold Time is calculated to be 500 ms.

In most environments, fabric parameters on all switches in a fabric should match, and thus the Hold Time should be consistent throughout a fabric. When congestion conditions cause frames to drop in the core of the fabric, then wherever there tends to be more flows or traffic, there will be more disruption.


To reduce frame drops on E_Ports on core switches, the edge switches that host the end server or storage devices can be configured to have a shorter Hold Time compared to the core switches by using the Edge Hold Time feature (available in FOS 6.3.1b and later). This setting lowers the Hold Time on the edge of the network, which reduces the likelihood of frame loss on the core of the network, effectively mitigating the impact of the misbehaving device. For this reason, it is useful to enable the Edge Hold Time feature.

See “Configuring Edge Hold Time ” on page 14 for details about how to enable the Edge Hold Time feature. Enabling and configuring the Edge Hold Time is a nondisruptive operation.

Fabric configuration Fabrics can be architected to mitigate some impacts of device latency. Isolating the device flows (host or storage pair) that exhibit high latencies, either by putting them in their own fabric or on their own blade or switch, contains the impact of the latencies to the fabric, blade or switch that contains the high-latency device flows. Features such as Integrated Routing (FC Routing) and local switching provide architectural-level solutions that limit the need for more complex monitoring and mitigation capabilities. However, using fabric design as a protection mechanism requires some knowledge of which devices are likely to exhibit latency.

Faulty media In addition to high-latency devices causing disruptions to data centers, fabric problems are often the result of faulty media. Faulty media can include bad cables, SFPs, extension equipment, receptacles, patch panels, improper connections, and so on. Media can fault on any port type (E_Port or F_Port) and fail, often unpredictably and intermittently, making it even harder to diagnose. Faulty media involving F_Ports results in an impact to the end device attached to the F_Port and to devices communicating with this device.

Failures on E_Ports can have an even greater impact. Many flows (host and target pairs) can simultaneously traverse a single E_Port. In large fabrics, this can be hundreds or thousands of flows. In the event of a media failure involving one of these links, it is possible to disrupt some or all of the flows utilizing the path.

Severe cases of faulty media, such as a disconnected cable, can result in a complete failure of the media, which effectively brings a port offline. This is typically easy to detect and identify. When this occurs on an F_Port, the impact is specific to flows involving the F_Port. E_Ports are typically redundant, so severe failures on E_Ports typically only result in a minor drop in bandwidth because the fabric automatically utilizes redundant paths. Also, error reporting built into FOS readily identifies the failed link and port, allowing for simple corrective action and repair.

With moderate cases of faulty media, failures occur but the port can remain online or transition between online and offline. This can cause repeated errors, which can occur indefinitely or until the media fails completely. When these types of failures occur on E_Ports, the result can be devastating because there can be repeated errors that impact many flows. This can result in significant impacts to applications that last for prolonged durations. Signatures of these types of failures include the following:

� CRC errors on frames � Invalid Transfer Words (includes encoder out errors) � State Changes (ports going offline or online repeatedly) � Credit loss (complete loss of credit on a virtual channel (VC) on an E_Port prevents traffic

from flowing on that VC, resulting in frame loss and I/O failures for devices utilizing the VC)


Automatically detecting and mitigating faulty media You can automatically detect and mitigate the impact of faulty media by using Fabric Watch monitoring and quarantine, and the Bottleneck Detection feature, as explained here.

Fabric Watch Enable Fabric Watch to monitor for CRC errors, Invalid Transfer Words, and State Changes. Configure for alerts on reaching low thresholds and fence (disable) a port when reaching high thresholds. See “Configuring Port Fencing” on page 12 for details about how to enable and configure Fabric Watch Port Fencing.

Fabric Watch monitoringFabric Watch monitors can be enabled to automatically detect most of the faulty media conditions previously noted. For example, Fabric Watch can monitor CRC errors (available in FOS 6.1), Invalid Transfer Words (available in FOS 6.1), and State Changes (ports transitioning between offline and online, available in FOS 6.3). Fabric Watch generates alerts based on user-defined thresholds for these conditions.

The most common cause of credit loss is corruption to credit return messages (VC_RDY or R_RDY) due to faulty media. Credit corruption is tracked by an encoding out error, which is an Invalid Transfer Word error. Monitoring and mitigating Invalid Transfer Word issues protects against credit loss.

Fabric Watch quarantineFabric Watch also provides a mechanism that quarantines the badly behaving component with the optional action of Port Fencing. Port Fencing is available for each of the previously noted conditions. Use it to automatically protect the fabric from these error conditions. The thresholds specified in “Configuring Port Fencing” on page 12 have been tested and tuned to quarantine components that are misbehaving to the point at which they are likely to cause a fabric-wide impact. These thresholds do not falsely trigger on normally behaving components.

Bottleneck Detection The Bottleneck Detection feature can detect different types of bottlenecks in a fabric. Lost buffer credits can result in extreme congestion by slowing the aggregate throughput of a connection. Bottleneck Detection can detect ports that are blocked due to lost credits, and generate special lost credit alerts for the E_Port in this condition (available in FOS 6.3.1b and later). Bottleneck Detection can also generate alerts on upstream E_Ports blocked due to a downstream latency condition such as an E_Port with lost credits or a high-latency device.

Bottleneck Detection can optionally create alerts when latencies are detected. Prior to FOS 6.4, alerts are sent to RASlog. Simple Network Management Protocol (SNMP) alerts were introduced in FOS 6.4. See “Configuring Bottleneck Detection ” on page 9 for useful suggestions about configuring and using this feature. FOS version 6.3.2d and above allows for Bottleneck Detection SNMP alerts to be sent through RASLOG AN-1003. AN-1003 itself is not an actual error condition, but rather an indication of a potential bottleneck device. The change in this version 6.3 code level and above provides an easy interface for clients to trap Bottleneck Detection alerts through SNMP.


Summary of best practicesThe following list summarizes how you can use features and capabilities to improve the overall resiliency of Brocade FOS-based FC fabric environments:

� Enable Fabric Watch for monitoring and alerting. � Enable Fabric Watch Port Fencing to fence ports on extreme behavior.� Enable the Edge Hold Time feature. � Enable Bottleneck Detection for congestion conditions.

See “Suggested implementation” on page 14 for specific commands required to implement these features using FOS 6.3.1b and FOS 6.4.

Configuring Bottleneck Detection

This section provides examples showing you how to:

� Enable and disable Bottleneck Detection� Display a list of ports with Bottleneck Detection enabled� Change the Bottleneck Detection setting on a port� Display a history of bottlenecks that occurred on a port

This section also provides an example of a bottleneck alert.

Enabling and disabling Bottleneck Detection When Bottleneck Detection is enabled, RASlog alerts can be enabled to be sent when the bottleneck conditions at a port exceed a specified threshold.

On the switch with target port connections, log in with administrator privileges.

Enter the bottleneckmon --enable command to enable Bottleneck Detection on an F_Port or FL_Port.

bottleneckmon --enable [ -alert ] [ -thresh threshold ] [ -time window ] [ -qtime quiet_time] [slot/]portlist [[slot/]portlist]...

If the alert parameter is not specified, then alerts are not sent, but a history of bottleneck conditions for the port can be viewed. The thresh, time, and qtime parameters are also ignored if the alert parameter is not specified.

Use the default values for the thresh (0.1), time (300), and qtime (300) parameters.

Enabling Bottleneck Detection example (preferred use case)The following example enables bottleneck detection on all F_ and FL_Ports in the switch with RASlog alerts using default values for threshold and time. Alerts are logged when a port is experiencing a bottleneck condition for 10% of the time (default value for thresh/lthresh) over

Note: The following sections are intended to be illustrative of the commands required to configure and enable FOS features. The actual commands and output will vary slightly, depending on the version of the FOS deployed. Refer to the FOS command reference manual for the FOS version in your environment.


any period of 300 seconds (default value for time) with a minimum of 300 seconds (default value for qtime) between alerts.

switch:admin> bottleneckmon --enable -alert *

Enabling Bottleneck Detection on ports 3 - 7 with default values exampleThe following example enables bottleneck detection on ports 3 through 7 using default values for threshold and time. No alerts will be delivered to report bottleneck conditions, but the bottleneck history can be viewed using the CLI.

switch:admin> bottleneckmon --enable 3-7

Example: Disabling Bottleneck Detection You can disable Bottleneck Detection by following these steps:

1. Connect to the switch to which the target port belongs, and log in as administrator. 2. Enter bottleneckmon --disable to disable Bottleneck Detection on a port.

Example: Disabling Bottleneck Detection on port 3You can disable Bottleneck Detection on port 3 by using this command:

switch:admin> bottleneckmon --disable 3

Displaying a list of ports with Bottleneck Detection enabled Follow these steps to display a list of ports that have Bottleneck Detection enabled:

1. Connect to the switch to which the target ports belong and log in as administrator. 2. Enter bottleneckmon --status to display a list of ports on which Bottleneck Detection is

enabled, as shown in Example 2.

Example 2 Results of the bottleneckmon --status command

switch:admin> bottleneckmon --statusPort Alerts? Threshold Time(s) Quiet Time(s)=======================================================================3 N -- -- --4 Y 0.100 300 3005 Y 0.100 300 3006 N -- -- --

Changing Bottleneck Detection settings on a port The default settings for Bottleneck Detection are the preferred settings. These settings are configurable in the event that a user has specific reasons for modifying them, but in most cases, the default settings should not be changed.

Examples of reasons to change the defaults can include transient events that cause moderate congestion that are considered normal. Increasing the time or threshold might accommodate such events. Using the following procedure, RASlog alerts can be enabled or disabled, along with configuration of the following settings:

� Threshold: the percentage of one-second intervals required to generate an alert

Note: When using Virtual Fabrics, the output displays ports that do not belong to the logical switch if the ports were moved out of the logical switch after Bottleneck Detection was enabled on them.


� Time: the time window in seconds in which bottleneck conditions are monitored and compared against the threshold

� Quiet Time (qtime) options

To change settings on a port:

1. Connect to the switch to which the target port belongs and log in as administrator. 2. Enter bottleneckmon --disable to disable Bottleneck Detection on the port. 3. Enter bottleneckmon --enable to enable Bottleneck Detection, specify the new threshold

values, and set the alert option.

Example 3 changes the Bottleneck Detection settings on port 4. In this example, the bottleneck --status commands show the before and after settings.

Example 3 Before and after running the bottleneck --status command

switch:admin> bottleneckmon --status Port Alerts? Threshold Time(s) Quiet Time(s)=======================================================================4 Y 0.800 300 300

switch:admin> bottleneckmon –-disable 4switch:admin> bottleneckmon –-enable –thresh 0.6 –time 420 4switch:admin> bottleneckmon –-status

Port Alerts? Threshold Time(s) Quiet Time(s)=======================================================================4 Y 0.600 420 300

Displaying the history of bottlenecks on a port

Use the bottleneckmon –show command to display a history of bottleneck conditions for an individual port:

1. Connect to the switch to which the target port belongs and log in as administrator. 2. Enter the bottleneckmon --show command to display a history of the bottleneck severity

for a specific port.

Example 4 shows the bottleneck history for port 3 in five-second windows over a period of 30 seconds.

Example 4 Results of the bottleneckmon --show command

fcr_saturn1:root> bottleneckmon --show -interval 5 -span 30 3=============================================================Mon Jun 15 18:54:35 UTC 2010=============================================================From To affected secs=============================================================Jun 15 18:54:30 Jun 15 18:54:35 80.00%Jun 15 18:54:25 Jun 15 18:54:30 40.00%Jun 15 18:54:20 Jun 15 18:54:25 0.00%Jun 15 18:54:15 Jun 15 18:54:20 0.00%

Note: Bottleneck Detection must be disabled on a port before any of the settings can be modified.


Jun 15 18:54:10 Jun 15 18:54:15 20.00%Jun 15 18:54:05 Jun 15 18:54:10 80.00%

Bottleneck alert example The following example illustrates a Bottleneck Detection alert on an F_Port.

Example 5 Example Bottleneck Detection alert on an F_Port

2010/03/16-03:40:47, [AN-1003], 21760, FID 128, WARNING, sw0, Latency bottleneck at slot 0, port 38. 100.00 percent of last 300 seconds were affected. Avg. time b/w transmits 80407.3975 us.

Configuring Port Fencing

Use the portFencing CLI command to enable error reporting for the Fabric Watch Port Fencing feature on all ports of a specified type and to configure the ports to report errors for a specific area. Supported port types include E_Ports, F_Ports, and physical ports. A specified port type can be configured to report errors for one or more areas.

Port Fencing monitors ports for erratic behavior and disables a port if specified error conditions are met. The portFencing CLI command enables or disables the Port Fencing feature for an area of a class. You can customize or tune the threshold of an area by using the portthConfig CLI command.

Use portFencing to configure Port Fencing to detect excessive Invalid Words on all F_Ports. See the following example:

portfencing --enable fop-port –area IW

The same command can be used to configure Port Fencing on link reset:

portfencing --enable fop-port –area LR

Use portFencing to configure Port Fencing to detect excessive State Changes on E_Ports. See the following example:

portfencing --enable e-port –area ST

Note: The following sections are intended to be illustrative of the commands required to configure and enable FOS features. The actual command usage and output will vary slightly, depending on the version of the FOS deployed. Refer to the FOS command reference manual for the FOS version in your environment.

Note: Avoid Port Fencing using time outs on E_Ports, because it can affect traffic for a large number of devices throughout the fabric.

Note: Using the Lossless DPS feature can minimize the number of dropped frames due to E_Ports being fenced because of State Changes.


Use portThconfig to customize Port Fencing thresholds:

switch:admin> portthconfig --set port -area crc -highthreshold -value 2 -trigger above -action email switch:admin> portthconfig --set port -area crc -highthreshold -trigger below -action email switch:admin> portthconfig --set port -ar crc -lowthreshold -value 1 -trigger above -action email switch:admin> portthconfig --set port -ar crc -lowthreshold -trigger below -action email

To apply the new custom settings so that they become effective:

switch:admin> portthconfig --apply port -area crc -action cust -thresh_level custom

To display the port threshold configuration for all port types and areas:

switch:admin> portthconfig --show

Port Fencing suggested thresholds Table 1 lists the suggested thresholds for Port Fencing for three areas.

Table 1 Suggested Port Fencing thresholds

Suggested thresholds for CRC errors and Invalid Transfer WordsCRC errors and Invalid Transfer Words can occur on normal links. They have also been known to occur during certain transitions, such as server reboots. When these errors occur more frequently, they can cause a severe impact.

Although most systems can tolerate infrequent CRC errors or Invalid Transfer Words, other environments can be sensitive to even infrequent instances. The overall quality of the fabric interconnects is also a factor.

When you establish thresholds for CRC errors and Invalid Transfer Words, consider the following: in general, cleaner interconnects can have lower thresholds because they should be less likely to introduce errors on the links. Table 2 on page 13 lists suggestions for establishing moderate (the preferred choice), conservative, and aggressive thresholds.

Table 2 Aggressive and conservative thresholds

Area Suggested threshold

Link Reset 5

State Change 7

C3TX_TO 5

Area Moderate threshold Aggressive threshold Conservative threshold

CRC Low 5High 20

Low 1High 2

Low 5High 40

Invalid Transfer Words

Low 25High 40

Low 12High 25

Low 25High 80


After you select the type of thresholds for an environment, set the low threshold with an action of ALERT (RASlog, email, SNMP trap). The alert will be triggered whenever the low threshold is exceeded. Set the high threshold with an action of Fence. The port will be fenced (disabled) whenever the high threshold is detected.

Configuring Edge Hold Time

Users can configure Edge Hold Time by using the following commands. The switch does not need to be disabled to modify the hold time. Use the Configure Edge Hold Time option to turn this feature on or off.

Output of the configure command is shown in Example 6.

Example 6 Output of configure command

IBM_SAN384B_27_VF:admin> configure

Not all options will be available on an enabled switch.To disable the switch, use the "switchDisable" command.

Configure... Fabric parameters (yes, y, no, n): [no] yes Configure edge hold time (yes, y, no, n): [yes] Edge hold time: (100..500) [100]

The edge_hold_time value is persistently stored in the configuration file. All configuration file operations, such as upload and download, are supported for this feature.

Suggested implementationImplement the resiliency features provided by the Brocade FOS by using a two-phase approach. In Phase 1, enable only the alerting and monitoring features using the suggested threshold values. In Phase 2, enable the mitigation (fencing) features. This approach allows you to gain experience in the environment (over a 30-day period, for example) using these features, and also provides an opportunity for you to identify and address pre-existing conditions that might exist in the environment, prior to enabling mitigation actions.

Note: The following sections are intended to be illustrative of the commands required to configure and enable FOS features. The actual command usage and output will vary slightly, depending on the version of FOS deployed. For the actual command details and output produced, see the FOS command reference manual for the version of FOS in your environment.

Note: This setting is only available in FOS 6.3.1b and later.

Note: The suggested approach and associated thresholds presented have been identified as appropriate for most environments. It is possible that specific environments might require alternate settings to meet specific requirements.


The resiliency features discussed in this paper were introduced and enhanced over several levels of FOS code. This section specifically refers to the capabilities provided by FOS codes 6.3 and later. FOS code 6.3 contains the majority of these new features. FOS 6.4 introduced additional enhancements, most notably in regard to bottleneck detection.

Phase 1

Step 1: Enable Fabric Watch alertingEnable Fabric Watch Alerting on both F_Ports and E_Ports. Table 3 lists suggested values.

Table 3 Suggested Fabric Watch Alerting thresholds for F_Ports and E_Ports

Thresholds can be defined using either the Fabric Watch GUI or the command line. Figure 3 on page 19 shows the Fabric Watch Threshold Configuration tab. Example 7 on page 16 shows the CLI commands used to set the low and high threshold values for F_Ports and E_Ports.

See the following product documents, available with registration at http://my.brocade.com/wps/portal/registration:

Note: Port Fencing for Class 3 (C3) discards or timeouts is supported on 8 Gbps in FOS 6.3 and later. However, it was not supported on 4 Gbps until FOS 6.3.1b.

Condition F_Ports E_Ports

Link Resets Low 3High 5

Low 2

State Change Low 3High 7

Low 2

CRC Low 5High 20

Low 2

Invalid Transfer Word Low 25High 40

Low 10

C3TX_TO (C3 Discard) Low 3High 5

NA

Note: The C3 Discard Frames Threshold cannot be applied to an E_Port.

Note: Thresholds can also be defined using DCFM; however, detailing the use of DCFM is beyond the scope of this paper. For information about using DCFM to set monitoring thresholds, see the following product documents, available with registration at http://my.brocade.com/wps/portal:

� IBM System Storage publication Data Center Fabric Manager User Manual Supporting DCFM version 10.4.x, GC52-1304-03, at http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003231

� Configuring Performance Monitoring and Thresholds in Brocade Data Center Fabric Manager (DCFM), GA-BP-247-00 at http://www.brocade.com/downloads/documents/white_papers/DCFM_PerfMon_Thresholds_GA-BP-247-00.pdf


http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003231

http://www.brocade.com/downloads/documents/white_papers/DCFM_PerfMon_Thresholds_GA-BP-247-00.pdf

http://www.brocade.com/downloads/documents/white_papers/DCFM_PerfMon_Thresholds_GA-BP-247-00.pdf



Example 7 CLI commands used to set strongly suggested threshold values

portthconfig --set fop-port -area LR -lowthreshold -value 3 -trigger above-action raslogportthconfig --set fop-port -area LR -highthreshold -value 5 -trigger above-action raslogportthconfig --set fop-port -area ST -lowthreshold -value 3 -trigger above-action raslogportthconfig --set fop-port -area ST -highthreshold -value 7 -trigger above-action raslogportthconfig --set fop-port -area CRC -lowthreshold -value 5 -trigger above-action raslogportthconfig --set fop-port -area CRC -highthreshold -value 20 -trigger above-action raslogportthconfig --set fop-port -area ITW -lowthreshold -value 25 -trigger above-action raslogportthconfig --set fop-port -area ITW -highthreshold -value 40 -trigger above-action raslogportthconfig --set fop-port -area C3TX_TO -lowthreshold -value 3 -trigger above-action raslogportthconfig --set fop-port -area C3TX_TO -highthreshold -value 5 -trigger above-action raslogportthconfig --set e-port -area LR -lowthreshold -value 2 -trigger above-action raslogportthconfig --set e-port -area ST -lowthreshold -value 2 -trigger above-action raslogportthconfig --set e-port -area CRC -lowthreshold -value 2 -trigger above-action raslogportthconfig --set e-port -area ITW -lowthreshold -value 10 -trigger above-action raslogportthconfig --set e-port -area C3TX_TO -lowthreshold -value 2 -trigger above-action raslog

After the custom thresholds and actions have been defined as shown in Example 7, they must then be applied. Example 8 shows the CLI commands used to apply the custom defined threshold and action for each area.

Example 8 CLI commands used to apply custom defined thresholds and actions

portthconfig --apply fop-port -area LR -action_level cust -threhsold_level custportthconfig --apply fop-port -area LR -action_level cust -threhsold_level custportthconfig --apply fop-port -area ST -action_level cust -threhsold_level custportthconfig --apply fop-port -area ST -action_level cust -threhsold_level custportthconfig --apply fop-port -area CRC -action_level cust -threhsold_level custportthconfig --apply fop-port -area CRC -action_level cust -threhsold_level custportthconfig --apply fop-port -area ITW -action_level cust -threhsold_level custportthconfig --apply fop-port -area ITW -action_level cust -threhsold_level custportthconfig --apply fop-port -area C3TX_TO -action_level cust -threhsold_level custportthconfig --apply fop-port -area C3TX_TO -action_level cust -threhsold_level custportthconfig --apply e-port -area LR -action_level cust -threhsold_level custportthconfig --apply e-port -area ST -action_level cust -threhsold_level cust


portthconfig --apply e-port -area CRC -action_level cust -threhsold_level custportthconfig --apply e-port -area ITW -action_level cust -threhsold_level custportthconfig --apply e-port -area C3TX_TO -action_level cust -threhsold_level cust

Step 2: Enable Bottleneck Detection alerting Use the default values that follow:

� FOS 6.3: thresh=(0.1), time=(300), qtime=(300)� FOS 6.4: lthresh=(0.1), cthresh=(0.8), time=(300), qtime=(300)

Example 9 shows the commands used to enable Bottleneck Detection alerting on FOS 6.3.1b. Be aware that on FOS 6.3.1b, ports must be enabled individually.

The output from bottleneckmon --status shows ports that have been enabled.

Example 9 Enabling Bottleneck Detection alerting (for FOS 6.3.1b)

butane:admin> bottleneckmon --enable -alert 1butane:admin> bottleneckmon --enable -alert 2butane:admin> bottleneckmon --enable -alert 3butane:admin> bottleneckmon --statusPort Alerts? Threshold Time (s) Quiet Time (s)======================================================================1 Y 0.100 300 3002 Y 0.100 300 3003 Y 0.100 300 300butane:admin>

Note: For threshold alerts to be acted upon by writing them to the RASlog, sending an email, or sending an SNMP trap, the -trigger and -action parameters must be specified. Multiple actions can be specified when separated by commas.

Example:

portthconfig --set fop-port -area LR -lowthreshold -value 3 -trigger above -action raslog,email,snmp

Note:

� FOS 6.3 supports Bottleneck Detection on F_Ports only. � FOS 6.4 supports Bottleneck Detection on F_Ports and E_Ports.


Example 10 shows the commands used to enable Bottleneck Detection alerting on FOS 6.4. Be aware that on FOS 6.4, all ports are enabled together on a logical switch basis.

Example 10 Enabling Bottleneck Detection alerting (for FOS 6.4 and later)

max2:admin> bottleneckmon --enable -alert *max2:admin> bottleneckmon --statusBottleneck detection - Enabled==============================Switch-wide alerting parameters:============================Alerts - YesAction requested - NoLatency threshold for alert - 0.100Congestion threshold for alert - 0.800Averaging time for alert - 300 secondsQuiet time for alert - 300 seconds

Step 3: Enable the Edge Hold Timer feature Select a hold time of 100 ms.

Example 11 shows the command used to enable the Edge Hold Timer with a hold time of 100 ms.

Example 11 Setting Edge Hold Timer

max2:admin> configure

Not all options will be available on an enabled switch.To disable the switch, use the "switchDisable" command.

Configure...

Fabric parameters (yes, y, no, n): [no] yes

Configure edge hold time (yes, y, no, n): [no] yes

Edge hold time: (100..500) [100]

max2:admin>

Phase 2

Step 4: Enable Fabric Watch Port FencingEnable Fabric Watch Port Fencing on F_Ports only. Fencing will occur on the high threshold values defined in Phase 1. See Table 3 on page 15 for information about those values.


Example 12 shows the commands used to enable port fencing on F_Ports.

Example 12 Commands used to enable port fencing on F_Ports

portfencing --enable fop-port -area LR portfencing --enable fop-port -area STportfencing --enable fop-port -area CRC portfencing --enable fop-port -area ITW portfencing --enable fop-port -area C3TX_TO

Figure 3 shows the Fabric Watch Threshold Configuration menu.

Figure 3 Fabric Watch Threshold Configuration menu

Step 5: Evaluate and tune Bottleneck Detection alertingAfter gaining experience with Bottleneck Detection enabling using the default values, it might be appropriate to change the threshold values to more aggressive settings. If the default values applied in Step 5 are not generating an excessive volume of alerts, then a more aggressive setting should be applied to improve the effectiveness of the monitoring.

Note: Although the high threshold values used by port fencing can be defined using the Fabric Watch GUI, port fencing can only be enabled from the command line or through DCFM. For information about using DCFM, see the following document, available with registration at http://my.brocade.com/wps/portal: IBM System Storage Data Center Fabric Manager User Manual, Supporting DCFM version 10.4.x, GC52-1304-03 at http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003231.





Suggested initial aggressive values are:

� FOS 6.3: thresh=(0.1), time=(120), qtime=(120)� FOS 6.4: lthresh=(0.1), cthresh=(0.75), time=(120), qtime=(120)

Example 13 shows the commands for changing Bottleneck Detection alerting using FOS 6.3.1b.

Example 13 Changing Bottleneck Detection alerting (for FOS 6.3.1b)

butane:admin> bottleneckmon --enable -alert -thresh 0.1 -time 120 -qtime 120 1butane:admin> bottleneckmon --enable -alert -thresh 0.1 -time 120 -qtime 120 2butane:admin> bottleneckmon --enable -alert -thresh 0.1 -time 120 -qtime 120 3butane:admin> bottleneckmon --statusPort Alerts? Threshold Time (s) Quiet Time (s)======================================================================1 Y 0.100 120 1202 Y 0.100 120 1203 Y 0.100 120 120butane:admin>

Example 14 shows the commands for changing Bottleneck Detection alerting using FOS 6.4.

Example 14 Changing Bottleneck Detection alerting (for FOS 6.4)

max2:admin> bottleneckmon --config -lthresh .1 -cthresh .75 -time 120 -qtime 120 -alertmax2:admin> bottleneckmon --statusBottleneck detection - Enabled==============================

Switch-wide alerting parameters:============================Alerts - YesLatency threshold for alert - 0.100Congestion threshold for alert - 0.750Averaging time for alert - 120 secondsQuiet time for alert - 120 secondsmax2:admin>

Note: The recommended threshold settings defined here were found to be appropriate for most environments. In some cases, these settings will be adjusted for the specific environment. If alerting is triggering on events that do not represent a problem, then the values can either be returned to the default values or tuned through trial and error.


The team who wrote this paper

This paper was produced by a team of specialists from around the world, working at the International Technical Support Organization, San Jose Center.

Ian MacQuarrie is a Senior Technical Staff Member with the IBM Systems and Technology Group located in San Jose, California. He has 26 years of experience in enterprise storage systems in a variety of test and support roles. He is currently a member of the Systems and Technology Group (STG) Field Assist Team (FAST) supporting clients through critical account engagements, availability assessments, and technical advocacy. His areas of expertise include storage area networks (SANs), open systems storage solutions, and performance analysis. Ian co-authored a previous IBM Redbooks® publication, Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933.

John Juenemann is a Senior Technical Staff Member with the IBM Global Technology Services® Delivery Storage Service Line, located in Boulder, Colorado. He has 19 years of experience in strategic outsourcing. During the last nine years, he has been focused on delivering storage solutions to some of the largest, strategically outsourced IBM accounts. His areas of expertise include open systems servers, databases, SANs, and enterprise storage solutions. John previously contributed to the IBM Redbooks publication, Exploiting RS/6000 Security: Keeping it Safe, which is available at http://www.redbooks.ibm.com/abstracts/sg245521.html?Open.

Jon Tate is a Project Manager for IBM System Storage® SAN Solutions at the International Technical Support Organization, San Jose Center. Before joining the ITSO in 1999, he worked in the IBM Technical Support Center providing Level 2 and Level 3 support for IBM storage products. Jon has 26 years of experience in storage software and management, services, and support, and is both an IBM Certified IT Specialist and an IBM SAN Certified Specialist. He is also the UK Chairman of the Storage Networking Industry Association.

Special thanks to Brocade for its unparalleled support of this paper in terms of equipment and support in many areas, and to the following people at Brocade:

� Jim Baldyga� Mansi Botadra� Silviano Gaona� Brian Steffler� Marcus Thordal� Steven Tong

Now you can become a published author, too!

Here's an opportunity to spotlight your skills, grow your career, and become a published author—all at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies. Your efforts will help to increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to six weeks in length, and you can participate either in person or as a remote resident working from your home base.

Find out more about the residency program, browse the residency index, and apply online at:

ibm.com/redbooks/residencies.html


http://www.redbooks.ibm.com/residencies.html

http://www.redbooks.ibm.com/abstracts/sg245521.html?Open

http://www.redbooks.ibm.com/residencies.html

Stay connected to IBM Redbooks� Find us on Facebook:

http://www.facebook.com/IBMRedbooks

� Follow us on Twitter:

http://twitter.com/ibmredbooks

� Look for us on LinkedIn:

http://www.linkedin.com/groups?home=&gid=2130806

� Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks weekly newsletter:

https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm

� Stay current on recent Redbooks publications with RSS Feeds:

http://www.redbooks.ibm.com/rss.html


http://www.facebook.com/IBMRedbooks

http://twitter.com/ibmredbooks

http://www.linkedin.com/groups?home=&gid=2130806



http://www.redbooks.ibm.com/rss.html

Notices

This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

© Copyright International Business Machines Corporation 2012. All rights reserved.Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 23

®

Redpaper™

This document REDP-4722-01 was created or updated on July 13, 2012.

Send us your comments in one of the following ways:� Use the online Contact us review Redbooks form found at:

ibm.com/redbooks� Send your comments in an email to:

[email protected]� Mail your comments to:

IBM Corporation, International Technical Support OrganizationDept. HYTD Mail Station P0992455 South RoadPoughkeepsie, NY 12601-5400 U.S.A.

Trademarks

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:

Global Technology Services®IBM®Redbooks®

Redpaper™Redbooks (logo) ®RS/6000®

System Storage®

The following terms are trademarks of other companies:

Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.



http://www.ibm.com/redbooks/

http://www.ibm.com/redbooks/

http://www.redbooks.ibm.com/contacts.html

http://www.ibm.com/legal/copytrade.shtml

fabric resiliency best practices

Documents