analysis of failures in transport networks - nsn

26
   © 2007 Nokia Siemens Networks. All righ ts reserved. Nokia Siemens Networks is expected to start operations in Q1 2007, subject to regulatory approvals and the closing conditions.  

Upload: mataraki

Post on 07-Oct-2015

26 views

Category:

Documents


0 download

DESCRIPTION

transport analysis

TRANSCRIPT

  • 2007 Nokia Siemens Networks. All rights reserved.

    Nokia Siemens Networks is expected to start operations in Q1 2007, subject to regulatory approvals and the closing conditions.

  • 2/26

    2007 Nokia Siemens Networks. All rights reserved.

    Nokia Siemens Networks Analysis of configuration failures in transport networks Part II: Configuration failures of the Ether-net and the WDM layer

    Christian Merkle Technische Universitt Mnchen Lehrstuhl fr Kommunikationsnetze Arcisstrae 21 80333 Mnchen

    Cooperation NSN and Institute of Communication Networks, TU Mnchen: Robust Planning of Cost Efficient Next Generation Networks (ROPOCON) Dr. Dominic Schupke, [email protected] Dr. Claus Gruber, [email protected]

    01.08.2008

    Technical report

  • 3/26

    2007 Nokia Siemens Networks. All rights reserved.

    Outline 1. Introduction ....................................................................................... 4

    2. Configuration failures in the Ethernet layer ....................................... 5

    3. Configuration failures in the WDM layer .......................................... 12

    4. Network outages due to software failures ....................................... 16

    5. Rating system for configuration failures .......................................... 18

    6. Conclusion ...................................................................................... 23

    Acronyms ...................................................................................................... 24

    References ................................................................................................... 25

  • 4/26

    2007 Nokia Siemens Networks. All rights reserved.

    Technical report

    1. Introduction

    In the Technical Report Part I [1] configuration failures of IP routers that occur during the configuration with a Command Line Interface (CLI) are described. The most difficult configuration tasks are the configuration of filter rules and au-thentication methods. Failures during this tasks happen more often, because these configuration tasks includes more single configuration steps, and have an larger effect on the network, because if the wrong customer packets are filtered or the wrong authentication parameters are used, a customer cannot connect to the network and Service Level Agreements are maybe violated.

    In this report network failures during the configuration of the Ethernet and WDM layer are analyzed. The importance to reduce network failures in an IP transport network is shown in [2]. In a study from the University of Michigan it has figured out that router failures are responsible for 23% of downtime and router operation are responsible for 36% of downtime in a wide area IP network. The same study shows that IP routers have a downtime of 1519 minutes per year and carrier grade switches have a downtime of 5.2 minutes per year.

    The authors in [3] and [4] have analyzed network outages of a regional network provider between November 1997 and November 1998. The type, frequency, origin, and duration of these failures are characterized. The authors report in [4] from a software fault that triggered failures in communication between different backbone routers. Due to that failure many users experienced a network outage and a increased packet loss. The outage of the network last several hours until the failure could be solved.

    In [5] Infonetics Research has calculated the costs of enterprise network down-times. It has figured out that the annual total cost of downtime account for 40.7 million dollar. Looking at the downtime costs per cause it is shown that software downtime costs are the biggest ratio of the total costs and are responsible for 36%. The second biggest portion are the human error costs with a ratio of 22%. Considering the total hours of downtime (501 hours) of the network, software failures are the biggest part of the downtime with 33%. Hardware failures and human errors are responsible for 23% and 22%, respectively. This study shows that it is important to reduce the outage time of the network caused by software and hardware failures and human errors to reduce the operational expenditures of the networks.

    In chapter 2 possible configuration failures of the Ethernet layer are described and in chapter 3 follows the description of configuration failures of the WDM layer. Finally, chapter 4 describes the impact of faulty software on network ele-ments in transport networks.

  • 5/26

    2007 Nokia Siemens Networks. All rights reserved.

    2. Configuration failures in the Ethernet layer

    In this chapter the configuration tasks in the Ethernet layer are described. At first general configuration tasks of the Ethernet layer are mentioned. Afterwards in the second part of this chapter the configuration of Carrier Ethernet protocols is described. To analysis which configuration failure can occur, the configuration parameters of the different protocols used in the Ethernet layer are taken into account to derive possible configuration errors during maintenance tasks and to identify possible impacts on a network.

    A very helpful feature is the configuration of multiple interfaces with the same configuration parameters via one configuration file. But a failure in the configura-tion file leads to a configuration error on more interfaces at the same time. The misconfiguration of the interfaces can result in loss of connectivity, because the configuration of two connected interfaces is incompatible. If the wrong interface range is chosen, also interfaces that should not be configured with certain pa-rameters are overwritten and maybe do not work correctly anymore. The change of the original configuration of interfaces can lead to loss of packets or to per-formance degradation if the interface does not work correctly anymore. This misconfiguration can also create a loop in the network, which results in higher traffic load in the network, or can lead to disconnected switches which triggers the Spanning Tree Protocol (STP) to reconfigure itself because the network to-pology has changed.

    Spanning Tree Protocol (STP)

    Switches that are not running STP forward Bridge Protocol Data Unit packets, so that other switches running STP receive this control packets to process them. To break loops in the network it is important to run STP on enough switches in the network. If the STP is not running on enough switches there could occur loops in the network and these loops can cause a broadcast storm of packets. This excessive traffic and the indefinite duplication of packets can reduce the network performance or can lead to crashed devices if the traffic load at a de-vice is too high for the processor.

    The path cost of links can be configured at switches to influence which interface ports are set first to forwarding state. Interfaces with lower path costs are pre-ferred for setting to forwarding state. Normally lower path cost represent higher bandwidth on the link. If the interface of a switch is misconfigured, for example higher path costs are set to an interface, which is connected to a link with higher bandwidth, then the link with the lower bandwidth is chosen first. This reduces the performance of the network, because the traffic is sent over a link with smaller bandwidth.

    The misconfiguration of the STP timers has also a negative impact on the per-formance of the network and can cause high traffic load because of undefined duplication of packets in the network. If the hello timer interval is too short more

  • 6/26

    2007 Nokia Siemens Networks. All rights reserved.

    hello messages are send in the network and there is a higher traffic load in the network. The aging timer determines how long a switch waits without receiving SPT configuration messages. The default setting of the aging timer is 20 seconds. If the aging timer is too short the switch tries to reconfigure the SPT, because no control packet arrived in the time interval, and blocks the forwarding of all received packets. Hence, the performance of the network is reduced be-cause the switches stop the forwarding of packets and start the reconfiguration of the STP. If the time interval of the aging timer is too long it could happen that the STP detects an outage of a switch too late because the switch still waits for hello packet of the failed device. Traffic is also lost in the network if the topology information of the STP is not up to date, because the path on which the packets are transmitted do not exist.

    Duplex mode configuration

    Duplex mode configuration is used in Ethernet and Fast Ethernet Switching, which are used in Local Area Networks and Metro Area Networks to connect customers to backbone networks. The port duplex mode can be set for 10/100Mbit/s ports. For Gigabit Ethernet the port duplex mode cannot be changed. The port duplex mode is full duplex only.

    Duplex mismatches can result in performance degradation, intermittent connec-tivity, and loss of communication. Duplex mismatches occur if two directly con-nected devices use an incompatible combination of duplex configurations. Poss-ible duplex mismatches are shown in Table 2-1 [6].

    Device A Device B

    100 Mbit/s, full duplex Auto Duplex mismatch

    Auto 100 Mbit/s, full duplex Duplex mismatch

    100 Mbit/s, full duplex 1000 Mbit/s full duplex Link established, but speed mismatch

    10 Mbit/s, half duplex 100 Mbit/s half duplex Link established, but speed mismatch

    Table 2-1: Duplex mismatches because of wrong duplex configuration

    Duplex mismatches due to auto-negotiation results from hardware incompatibili-ties or software defects. Hardware incompatibility can be a result from vendor specific features that are not described for the auto-negotiation. The mismatch can also result if auto-negotiation is disabled on both connected devices and the network interface cards are configured manually. A duplex speed mismatch leads to a lower transmission bitrate, but the transmission is still possible, whe-

  • 7/26

    2007 Nokia Siemens Networks. All rights reserved.

    reas a duplex mismatch leads to a not established link and the packets cannot be sent over the affected link.

    Protocol filtering

    Layer 3 protocol filtering can be activated on Ethernet switches to filter Layer 3 packets from protocols like IP, IPX, and AppleTalk. If the packet filtering option is activated on the wrong switch, the switch does not forward layer 3 packets anymore. This results in packet loss, if the route through this switch is the only available connection to another device or the configuration failure leads to per-formance degradation if there is another link to the destination, but with lower bandwidth.

    Ethernet, Fast Ethernet, and Gigabit Ethernet

    To use Ethernet, Fast Ethernet, or Gigabit Ethernet the parameters port name, port speed, and port duplex mode must be configured. The misconfiguration of these parameters can be responsible that a connection between two switches cannot be established. As shown in Table 2-1 the duplex configuration on both switches connected through a link must be compatible. If the duplex mode is in-compatible there could be a speed degradation on this link or the connection is not established. Gigabit Ethernet is full duplex only and the duplex mode cannot be changed.

    On Ethernet interfaces the size of the maximum transmission unit (MTU) of packets can be configured. The default MTU size for all interfaces is 1548 bytes and the jumbo frames have a size of 9216 bytes. If the MTU size is different on two connected switches, for example one switch has a MTU size of 2000 Byte and the other switch has a smaller one, the packets cannot be processed by the switch with the smaller configured MTU size. In this case the packets are dropped at the interface. MTU mismatch for example can occur if a jumbo capa-ble Gigabit Ethernet interface is connected with a non jumbo interface like Fast Ethernet. It can also occur if a jumbo capable Ethernet is connected to a switch that does not support jumbo frames.

    Virtual LAN (VLAN) configuration

    To configure a VLAN on an Ethernet switch the parameters VLAN ID, VLAN name, VLAN type, and MTU size must be configured. The possible configuration errors and impacts are similar to the misconfiguration of VLANs on IP routers as described in the Technical report Part I [1]. A wrong VLAN ID can result in two connected VLANs that should not be connected or can result in blocking of au-thorized customers that want to connect to a certain VLAN. The mismatch of the MTU size can result in dropped packets on the interface with the smaller MTU size as described above. The transmission of packets on this link is then only in one direction possible. If a protocol like TCP, which waits for acknowledgments (Acks) for every sent packet, is running over this layer 2 interface then it could

  • 8/26

    2007 Nokia Siemens Networks. All rights reserved.

    happen that TCP tries to resend packets because no Acks are received from the destination. This additional packets cause additional traffic in the network and can be responsible for congestions on links. Duplicate VLAN names is a further possible configuration failure using VLAN. This misconfiguration can also lead to security problems, because two different VLANs are able to communicate.

    EtherChannel configuration

    EtherChannel bundles different Ethernet links into a single logical link that pro-vides bandwidth up to 1600 Mbit/s or 16 Gbit/s [7]. To configure EtherChannel on a layer 2 interface there exits five different modes for the channel-group inter-face configuration command [8]. Depending on this modes the Port Aggregation Protocol (PAgP) packets or Link Aggregation Control Protocol (LACP) packets are exchanged between two connected interfaces. With the PAgP, a Cisco pro-prietary protocol, and the LACP it is possible to create EtherChannels by ex-changing packets between Ethernet interfaces automatically. The five possible modes are active, auto, desirable, on, and passive. Thereby it has to be re-garded that an interface in the active mode cannot form an EtherChannel with another interface that is also in the auto mode, because the two interfaces do not start the PAgP negotiation. The same is true if two interfaces are in the passive mode, than neither of them starts the LACP negotiation. In the on mode it is important to know that the EtherChannel does not use PAgP and LACP and an usable EtherChannel only exists when an interface group in the on mode is connected to another interface group in the on mode. All interface ports in the on mode are grouped into the same group with similar characteris-tics. A misconfiguration of the group can lead to packet loss or spanning tree loops. It is also important that all interfaces in each EtherChannel must have the same speed and duplex mode. Otherwise the interfaces cannot communicate with each other.

    Quality of Service (QoS)

    Just like for IP routers QoS parameters can also be configured on switches. The incoming packets are examined by the fields in the packets and either forwarded or dropped depending on the match conditions. There are different classification methods like Class of Service or Differentiated Services Code Point. If the wrong classification rules are configured, packets that should be dropped are for-warded or vice versa. Hence, the wrong classification can lead to congestion on a link if more packets are forwarded through this link than allowed or it can lead to packet loss if packets are dropped at the interface. The dropped packet can be responsible for SLA violations if the QoS is not achieved as agreed by con-tract with the customer.

    A second option to guarantee QoS is the policing and marking of packets. Polic-ers can only be configured on ingress interfaces. If policers are configured the wrong way, the wrong packets are dropped or the bandwidth of a link is wrongly scaled down according to the policy. This can lead to performance degradation

  • 9/26

    2007 Nokia Siemens Networks. All rights reserved.

    in the network and also to performance degradation for certain applications like IPTV or Video on Demand, which need a certain required bandwidth for the transmission.

    802.1x port based authentication

    The configuration of an authentication protocol like 802.1x prevents unautho-rized users to connect to a network. At first the interface which should use au-thentication must be configured. If the authentication protocol is configured on the wrong interface or the authentication protocol itself is misconfigured, then users cannot connect to the switch anymore. If the authentication is not confi-gured on an interface, all users are able to connect to the network and the pro-tection of the network is not ensured anymore. Both configuration failures could lead to security holes because all users are able to connect to the network.

    The configuration of the RADIUS server parameters is required to enable the authentication on the interface. Theses parameters are the IP address (data type: integer) or the host name (data type: string) of the RADIUS server, the UDP port (data type: integer) for the authentication request, and the key (data type: string). If one of these parameters is misconfigured, for example the en-cryption key does not match the encryption key on the RADIUS server, the au-thentication may not work and all users have access to the network. According to [8] the default values of the Switch-to-Client Retransmission Time and the Switch-to-Client Frame Retransmission Number should be changed to prevent problems with other clients and the authentication server if one client cannot au-thorize correctly because of a wrong password for example.

    The single-host mode allows only the connection of one authorized user per port. If one user is authorized on one port the packets of all other users are blocked on this port. To configure this mode the interface-id (data type: integer) must be specified on which this mode should run. If the wrong interface-id is configured then this mode is enabled on the wrong interface and users are blocked that want to connect to the network. Otherwise, the authentication mode is not enabled on the port that should run with the single-host mode and the se-curity is not ensured.

    Network security with access control lists (ACLs)

    The configuration of ALCs allows the filtering of packets on an interface. There are different ALCs, like IP ALCs to filter IP, TCP, and UDP traffic, Ethernet or MAC ACLs to filter layer 2 packets, and three further kinds of ACLs [8] to filter due to protocol specific information. The misconfiguration of the used ACL, for example the wrong IP address or the wrong MAC address, drops packets of customers that are allowed to connect to the network or forwards packets from users that should be dropped. As for the authentication method this misconfigu-ration is a security hole in the network. The misconfiguration of ACLs filter rules can also be responsible for performance degradation on specific links or parts of

  • 10/26

    2007 Nokia Siemens Networks. All rights reserved.

    the network, because more traffic is routed through a certain link or parts of the network as allowed.

    It is possible to configure a time range for ACLs to determine at which time packets from certain users should be filtered. If the wrong time range is confi-gured the users cannot connect to the network, although it should be possible. Or due to the wrong time range of the ACL more users than allowed can con-nect to the network and the performance of the network goes down. Both cases again can lead to SLA violations if the guaranteed connectivity or bandwidth is not achieved.

    Ethernet Operations, Administration, and Maintenance (OAM)

    Ethernet OAM is a protocol for monitoring and troubleshooting Ethernet Wide Area networks (WANs). The OAM features, which are defined in IEEE 802.3ah, are recovery, Link Monitoring, Remote Fault Detection, and Remote Loopback. It is by default disabled on an interface and must be enabled with the following configuration tasks. To enable it on an interface the interface ID and the max rate, the min rate, and timeout parameters must be configured. Link Monitoring is enabled by default when Ethernet OAM is enabled.

    The wrong timeout setting can lead to a reset of the state machine of a device because a device declares its OAM peer for down if it does not receive an OAM message within the timeout period. If the wrong interface ID is configured, the Ethernet OAM protocol is enabled on the wrong interface. On the interface that should be monitored the Ethernet OAM protocol is not activated.

    Configuring Ethernet Connectivity Fault Management (CFM) in a Service Provider Network

    To activate Ethernet CFM it must be enabled and the domain level, achieve hold time, and continuity check messages parameters must be configured. For Ethernet CFM two different kinds of maintenance points exists. Maintenance Endpoints (MEPs) are at the edge of a maintenance domain and transmit CCM, traceroute, and loopback messages. Maintenance Intermediate Points (MIPs) are configured within a domain and stop CCM messages from lower mainten-ance levels and forward CCM messages from higher maintenance levels. Differ-ent maintenance domains are useful to determine the relationship between dif-ferent maintenance domains. The larger the domain the higher is the mainten-ance level.

    The misconfiguration of a maintenance domain can lead to a intersection of dif-ferent domains which is not allowed, because domains should be managed only from one entity. Also a device that belongs to the wrong domain cannot be ma-naged by the entity that should be responsible for the device. A misconfiguration of the MEPs and MIPs can lead to dropped control messages because a MIP

  • 11/26

    2007 Nokia Siemens Networks. All rights reserved.

    does not forward control messages from lower maintenance levels. A correct monitoring, fault verification, and fault isolation is maybe not possible anymore.

    IEEE 802.3ad Link Bundling

    To configure IEEE 802.3ad Link Bundling on an interface the Link Aggregation Control Protocol (LACP) must be enabled first and the parameters port channel, channel group, and system priority must be set. Link bundling allows to aggre-gate multiple Ethernet links into a single logical channel. LACP supports the au-tomatic creation of EtherChannels by exchanging LACP packets between LAN ports. After LACP identifies correctly matched Ethernet links, it facilitates group-ing the links into an EtherChannel. To configure a port channel the port channel number must be set and an IP address and subnet mask must be assigned to the EtherChannel. To associate a channel group with a port channel the port channel must be created and the interface type number must be configured. Ad-ditional the channel group mode must be set. The configuration of the channel group mode includes the interface as part of the port channel bundle.

    A configuration failure prevents maybe that an interface is included in an Ether-net bundle. This occurs if the above described parameters are configured wrongly, for example the port channel and the channel group is configured on the wrong interface. Also, the port channel parameter must be configured before the group channel parameter is configured to enable the Link bundling correctly. Otherwise the link will not be aggregated into an EtherChannel.

  • 12/26

    2007 Nokia Siemens Networks. All rights reserved.

    3. Configuration failures in the WDM layer

    In this chapter the configuration of WDM components is considered. The confi-guration tasks of WDM equipment does not include the configuration of proto-cols like for the Ethernet layer and the IP layer, instead parameters like wave-length and power threshold can be set to influence the behavior of the compo-nents.

    Tunable Laser

    The transmitting power of a laser must be configured correctly to ensure a cer-tain signal to noise ratio (SNR). A lower laser power reduces the transmitting range of the signal and the SNR respectively and increases the bit error rate (BER) because of the degradation of the signal over a path. If the power of the laser reaches an upper threshold the laser is turned off to prevent it to cause any damage to the network [9]. The wavelength transmitted over a link must be configured at the laser. Today it is possible to send 80 wavelength over one fiber link. If the wrong wavelength is transmitted on a link, for example the same wa-velength is send twice over a link, these wavelengths interfere with each other and the information cannot be received correctly at the receiver.

    The frequency of a laser can be tuned by modulating either the laser current or operating temperature. If a wrong temperature is chosen for a laser then the la-ser emits the wrong light frequency into the fiber. Due to the wrong frequency a receiver or Add/drop multiplexer for example drop the wavelength according to its configuration and the information transmitted on this wavelength is lost. But a laser must also be operated in a certain temperature range to guarantee optimal functionality and lifetime. Operating the laser in a significantly higher tempera-ture than the room temperature leads to a faster aging of the laser and degrades its lifetime. A faster aging of the laser means that the laser must be changed earlier and leads to higher operating cost of the network. If the laser ages faster the signal power of the lasers and also the SNR at receiver decreases faster over the time. A lower SNR means that the transmission range of the signal de-creases and the signal cannot be received at the next receiver.

    A Laser can also be modulated by direct and external modulation [10]. An exter-nal modulator for example is the Mach-Zehnder interferometer (MZI). To mod-ulate the laser with the MZI a certain voltage is applied to the MZI to interfere the two arms of the MZI constructively or destructively. In the first case an output power appears at the output of the MZI and in the second case there is no out-put power at the output. If the drive voltage of the MZI is misconfigured then the signal is modulated wrongly. For example there is and output power at the out-put of the MZI, but there should be no output power, because the pulses of the two arms of the MZI interfere constructively instead of destructively. Hence, the signal cannot be demodulated correctly at the receiver and the sent information is lost.

  • 13/26

    2007 Nokia Siemens Networks. All rights reserved.

    Optical Amplifier

    The erbium doped fiber amplifier (EDFA) amplifies the optical signal directly without converting it into an electrical signal. To amplify the signal an EDFA has a pump laser to excite ions to a higher energy level from where they can decay back to the lower energy level via stimulated emission of a photon. The required pump power to get a constant output power depends on the signal wavelength. If the pump power of the amplifier is configured wrongly then the signal is not transmitted correctly over the whole path. If the pump power is too low then the optical signal is not amplified enough and the SNR at the receiver is reduced. Hence, it can happen that the optical signal is not detected correctly at the next amplifier or destination, because the SNR is too low. Otherwise, if the pump power of the amplifier is too high for the used wavelength then the spontaneous emission increases, because more ions are pumped to a higher energy level and are not available for the stimulated emission. The spontaneous emission of ions increases and the amplification of the signal becomes lower. Also the tem-perature of the amplifier has a impact on the gain of the amplifier. With increas-ing temperature the gain of the EDFA decreases. An optical amplifier, which is not working in the optimal temperature range, has a lower amplification gain and therefore the SNR is lower and the number of transmitting failures increases [11].

    The above described misconfiguration can be responsible for the lost of informa-tion on a link, because the signal is not correctly amplified. A lower SNR at the output of the EDFA reduces the transmission range of the signal so that the sig-nal cannot be detected on the receiver side. Because EDFAs amplify multiple wavelength on one fiber at the same time, all information on one link can be lost. Hence, a higher layer protocol like OSPF recalculates the routes through the network, because the link with the misconfigured EDFA seems to be down. This rerouting of the traffic can lead to congestion on other links, because they have to transmit the additional traffic from the misconfigured link. The misconfigura-tion of an EDFA can affect the behavior of a protocol of a higher layer and this again can influence the SLAs of operators with their customers. SLA violations occur if traffic will be rerouted in the network and generates congestion on another links, which influences the QoS parameters of that link.

    Dense Wavelength Division Multiplexing (DWDM) Controller

    DWDM is an optical technology that is used to multiplex different wavelength to-gether to increase the bandwidth on fibers. The configuration of a DWDM con-troller includes the setting of the transponder receive power threshold, the wave-length channel number, and the transmit-power. As described for the laser the transmitting power influences the SNR of the signal. So it is important to confi-gure the right transmitting power to ensure a accurate transmission of the signal.

    The transponder receive power threshold values can range between -200 and 0, which corresponds to a loss of signal (LOS) range of -20dbm and 0dbm [12].

  • 14/26

    2007 Nokia Siemens Networks. All rights reserved.

    The default power level is -18dbm according to [12]. If a received signal is below or equal to this threshold then the LOS alarm is raised. A misconfiguration of this threshold, for example a too small value, prevents that the alarm is raised and that the too weak signal is transmitted on the next link. The degradation of the signal on the next path can result in a SNR that is too low to detect the signal correctly at the next receiver.

    Receivers

    Tunable receivers are able to convert wavelength within a given range [10]. If the receiver is misconfigured it could happen that the received wavelength are filtered by the receiver. Because of the blocked wavelength a customer maybe cannot connect to the network or the customer is not able to communicate with other customers. For example an office of a company has no connection to other offices of the same company. The misconfiguration of the receiver can also result in rerouted traffic because the primary path is blocked and the traffic must be switched to the backup path.

    Reconfigurable Optical Add Drop Multiplexer (ROADM)

    ROADMs allow the selection of wavelengths to be dropped and added on the fly. This makes the planning of networks more flexible in comparison to fixed dropped and added wavelengths. The misconfiguration of ROADMs can lead to the dropping or adding of the wrong wavelengths. If the wrong wavelength is dropped the traffic transmitted with this wavelength is lost. If a backup path exits the traffic will rerouted over this backup path. In the worst case the customer cannot connect to the network if no backup path can be calculated and can also not be reached from other users. A further impact of a wrongly dropped wave-length is that in the case of a link failure the protection can fail. This happens if a primary paths fails and the backup path uses the wavelength on the backup path, which is dropped at the misconfigured ROADM. Hence, the whole traffic from the primary path is lost.

    The wrongly adding of an wavelength at the ROADM can be responsible for per-formance degradation or the loss of the information on that link on which the wavelength is added. If the same wavelength, as the added wavelength, is al-ready transmitted on the link, the two wavelengths interfere on the link and the receiver cannot receive the signal correctly.

    Optical Cross Connect (OXC)

    OXC are required to handle more complex network topologies and large num-bers of wavelengths [10]. OXCs can also set up or take down lightpaths as needed. So a wrongly configured OXC causes the same impacts to the network as described for ROADMs.

  • 15/26

    2007 Nokia Siemens Networks. All rights reserved.

    Misconfiguration of patch panels

    During maintenance of router or switches, for example when a interface card is changed, the technicians have to disconnect the fiber connections at the patch panel. If the fibers are installed wrongly in the patch panel after the change dif-ferent failures in the network can occur. There could be a loop in the network or one part of the customers are disconnected from the network.

  • 16/26

    2007 Nokia Siemens Networks. All rights reserved.

    4. Network outages due to software failures

    If a new faulty software is installed on a device like an IP router or Ethernet switch it could happen that the parts of the device or the whole device fails. If the software failure has only an impact on a specific protocol then the links using this protocol are affected and the other links of the router are not affected. If only one interface of the device is not available then only the traffic routed through that link is affected, but all customers connected through this failed link have no connection to the network.

    If the router or switch starts to crash after a configuration change, then the prob-lem is probably software-related. Because the router or switch crashed, all inter-faces of the device are down and it is not available in the network anymore. All paths which were routed through the failed device need to be rerouted. This can lead to performance degradation, because other links, with lower bandwidth, are used as backup path, or there will be a congestion on other links, because the whole traffic from the failed router is routed additional to the existing traffic on a certain link. Furthermore, the customer connected directly to the device have no connection to the network.

    Because of a software failure it could happen that a device reboots periodically parts of the software or in the worst case the operating software. A device that reboots itself periodically has no connection to the network and again the direct-ly connected customer are not able to communicate over the network. Additional the rebooting of a device can lead to additional control traffic in the network. Be-cause protocols like Intermediate System to Intermediate System Protocol (IS-IS) or Open Shortest Path First (OSPF) exchange information with other routers, the device sends hello messages or down messages to the other devices to inform them about the current status. If this happens in a short time period there will be much additional traffic in the network caused by control messages send from the routers and because the rebooting router is not available in the net-work, protocols like OSPF and BGP reroute the traffic over backup paths. This rerouting of traffic can be responsible for a performance degradation on other links or parts of the network, because the additional traffic must be processed. Routers connected directly to the router with the faulty software need more processing time to handle the control messages, which are sent because of the rebooting of the router.

    If only one part of a software is rebooted periodically the effect on the network is smaller because fewer customer are affected by this outage, but if a part of a company is not reachable then maybe Service Level Agreements (SLAs) are vi-olated and the Internet Service Provider (ISP) has to pay penalty calculations.

    If a device fails not only the directly connected links and devices can be af-fected, but also links and device in other parts of the backbone network. Proto-cols like Border Gateway Protocol (BGP) have information about possible desti-

  • 17/26

    2007 Nokia Siemens Networks. All rights reserved.

    nations of the whole network in their routing tables and if a device fails, which is part of a certain route, then a new route must be calculated.

    Some possible errors for system crashes are described in [13]. Address errors, arithmetic exception, cache error exception, and error interrupt are failures that can lead to software crashes. It is also described that if the memory of a router becomes too small the router reboots itself and reports this as software forced-crash. During the rebooting of the router all connections to other devices are down and no communication is possible. So paths through this failed router must be recalculated and this can lead to service outages for a few minutes until the new path is calculated. Also the performance of a service can be degraded because the backup path has not the available bandwidth as the primary path.

    The mismatch of software version on different devices can also be responsible for the failing of a communication between two devices. If a newer software ver-sion supports a certain feature, which is needed to establish a connection, and the older one does not support it, then it could happen that the connection be-tween the two devices fails. If such a software conflict occurs on a device at the edge of a backbone network, all users connected to this device cannot connect to the network.

    The upgrade of a switch or router by downloading the wrong files to the device and by deleting the image file can also corrupt the software of the device [14]. The device does not pass the power-on self-test and there is no connectivity. All routes going through this device must be rerouted and have the same impact as described for a system crash above.

    Software bugs on routers can lead to denial of service attacks to shutdown the router. As reported in [15] Juniper and Cisco routers had such a software bug and the faulty software needed to be patched. The filtering of denial of service attack packets was not able with the router packet filters. The Cisco router soft-ware Internetwork Operating System (IOS) had additional a failure in the BGP implementation and it was possible to shutdown the router through this security hole. In [16] a further bug in Ciscos IOS is reported that can be used to create an buffer overflow and to get the control over the router.

    In [17] buffer leaks are described which are identified as software bugs. The symptom of such buffer leaks is a full input queue. If the input queue of an inter-face is full the interface is called wedged interface and a router does not for-ward traffic that come from a wedged interface. If the traffic is not forwarded from a certain link the communication through this link is lost. Buffer leaks are often misinterpreted as burst of traffic [17].

  • 18/26

    2007 Nokia Siemens Networks. All rights reserved.

    5. Rating system for configuration failures

    In this chapter a rating system is developed to give an impression which configu-ration failure is more critical in the sense of impact, frequency, and SLA viola-tions. As done in the Technical Report I also two different views are considered: the customer view and the operator view. This differentiation is important, be-cause the operator has the focus on avoiding SLA violations, whereas a cus-tomer want to have a high availability of the network. Hence, certain failures have a different weighting depending on the point of view. In Table 5-1 the rating of the single configuration failures is shown.

    Configuration task

    Impacts Frequency SLA viola-tion

    Rating

    Ethernet layer

    MTU size No processing of packets, packet loss

    Initial configu-ration

    Yes (2+1+1)/5

    Customer [6/7] Provider [5/6]

    Filter rules Filtering or throughput of wrong packets, no connection to oth-er devices, dis-connected cus-tomer, security hazard

    All protocols, every time a new service is installed

    Yes (2+2+1)/5

    Customer [7/7] Provider [6/6]

    Authentication disconnected cus-tomer, security hazard

    All protocols, per interface, depends also on the installa-tion of new ser-vices

    Yes (2+2+1)/5

    Customer [7/7] Provider [6/6]

    QoS Dropped packets, no connection, security hazard

    Per interface, every time new service is in-stalled, adap-tion of the net-work

    Yes (2+2+1)/5

    Customer [7/7] Provider [6/6]

  • 19/26

    2007 Nokia Siemens Networks. All rights reserved.

    Configuration task

    Impacts Frequency SLA viola-tion

    Rating

    STP Loop, perfor-mance degrada-tion, higher path costs, congestion

    Initial configu-ration

    Yes (2+1+1)/5

    Customer [6/7] Provider [5/6]

    Old configura-tion

    No connection, higher delay

    Every Protocol, modification of the protocol

    Yes (2+2+1)/5

    Customer [7/7] Provider [6/6]

    WDM layer

    Power Wavelength shift, less signal amplifi-cation, faster ag-ing of component

    Initial configu-ration

    No (1+1+0)/5

    Customer [3/7] Provider [2/6]

    Temperature Wavelength shift, less signal amplifi-cation, faster ag-ing of component

    Initial configu-ration

    No (1+1+0)/5

    Customer [3/7] Provider [2/6]

    Patch Panel Disconnected cus-tomer, loop, secu-rity hazard

    Adding new paths, Reconfi-guration of ser-vices

    Yes (2+1+1)/5

    Customer [6/7] Provider [5/6]

    Wavelength as-signment

    Wavelength fil-tered at receiver, Interference of the same wavelength

    Adding new paths, new ser-vices

    Yes (2+1+1)/5

    Customer [6/7] Provider [5/6]

  • 20/26

    2007 Nokia Siemens Networks. All rights reserved.

    Configuration task

    Impacts Frequency SLA viola-tion

    Rating

    Software

    Software Ver-sion

    Security hazard, New functions not supported

    Installing new software up-grade

    No (2+2+0)/5

    Customer [6/7] Provider [4/6]

    Buffer leak Higher traffic, dropped packets

    Installing new software up-grade

    Yes (2+1+1)/5

    Customer [6/7] Provider [5/6]

    Software failure Rebooting of de-vice, rebooting parts of the device shutdown of de-vice, higher traffic load, security ha-zard

    Installing new software up-grade

    Yes (2+2+1)/5

    Customer [7/7] Provider [6/6]

    Table 5-1: Rating of the different failures

    For the three influence factors impact, frequency, and SLA violations the same weighting is used as for the rating system of the IP layer. The impact of a failure is set to 1, if it leads to a higher delay of packets or to QoS degradation on a link or path. The loss of a connection between nodes is evaluated with a weight of 2, because it could lead to SLA violations. Similar is the weighting of the frequency of configuration errors. If a configuration task is done once in the beginning of setting up a network, then the weight of a failures is 1. Otherwise, the value for configuration failures of periodically done configuration tasks is set to 2. The value for SLA violations caused by configuration errors is set to 0, if they cause no SLA violations and it is set to 1, if configuration failures cause SLA violations. To take into account the two different views, once the impact of configuration failures is weighted with an factor of 2, to calculate the rating value for the cus-tomer view and once the value for SLA violations is weighted with 2, to calculate the value for the provider view. The calculation of the general value of the confi-guration failures is done using equation 1. The value for the customer view is calculated with equation 2 and the value for the provider view is calculated with equation 3.

  • 21/26

    2007 Nokia Siemens Networks. All rights reserved.

    Rating = (Impact + Frequency + SLA violation)/weighting; Rating [0;1] (1)

    Rating = (2*Impact + Frequency + SLA violation)/weighting; Rating [0;1] (2)

    Rating = (Impact + Frequency + 2*SLA violation)/weighting; Rating [0;1] (3)

    At the Ethernet layer the misconfiguration of filter rules, the authentication me-thod, the QoS configuration, and the use of old configuration files have the high-est weighting of the configuration errors. As for the IP layer these failures can lead to a disconnected links and, hence, to disconnected customers. This can result in SLA violations for the provider. At the WDM layer failures during the installation of fiber cables at the patch pan-el and failures by the wavelength assignment in the network have the highest weighting. The wrong installation of fibers at the patch panel can lead to a loop in the network or can be responsible for disconnected links in the network. This kind of failures can also be responsible for security hazards, because customers have maybe access to other Virtual Private Networks. The wrong wavelength assignment can lead to blocking of wavelengths at opti-cal devices or will cause interferences on one link, if the same wavelength is send twice on it. Both failure scenarios cause disconnected links between nodes in the network and influence the availability of the network. Assigning the wrong power or temperature to a device can be responsible for a higher degradation of signals going through this device, so that the received power at the end of a path is lower than desired. The communication between the devices is still poss-ible, if the signal power does not fall below a certain threshold. This is the rea-son why this kind of failures have a lower rating in comparison to the wrong wa-velength assignment.

    At least the impacts of software failures of nodes were considered. Again the same rating system as for the other layers was used. For the evaluation of the software three failure scenarios were considered: Wrong software version, soft-ware failures, and buffer leaks. Faulty software has the highest rating of the three failure categories. These failures can lead to a multitude of failure scena-rios and mostly affect more connections at the same time. For example a faulty software can be responsible for a periodic rebooting of a device. Hence, the node and also all links connected to this node are not available in the network anymore. The periodic updates of the software, to close a security hole for ex-ample, or to install a new functionality on the device, increase the probability and the frequency of software failures. A further source of error is the use of different software versions on the different network nodes. The different software ver-sions can lead to disconnected links, because certain functionalities are maybe not supported by the older software version which are used by a new service.

    Considering the customer and the provider view it can be seen that the most failures have a similar rating. Especially for the Ethernet and the software fail-ures. As mentioned before, the reason for the similar weighting is the relation-ship between the impact and possible SLA violations of the considered failures.

  • 22/26

    2007 Nokia Siemens Networks. All rights reserved.

    SLA violations occur if a connection is down or the QoS of a service is lower than defined in a contract between a provider and a customer. The most failures in Table 5-1 can lead to a SLA violation, because its impacts are disconnected links or paths. But such degradation effects also result in a lower availability, which was the main criteria for the customer view of configuration failures. So there is a correlation between impact and SLA violation.

  • 23/26

    2007 Nokia Siemens Networks. All rights reserved.

    6. Conclusion

    In this technical report possible configuration failures of the Ethernet and WDM layer and faulty software are considered. On the Ethernet layer similar failures are possible as for the IP layer described in [1]. At the WDM layer configuration failures are related to wrong power setting, wrong wavelength switching, and wrong operating temperature. More and more reconfigurable devices like ROADMs or OXCs are used in backbone networks because they are more flexi-ble for carriers when planning their networks. But as described the more recon-figurable devices are used in the network the more configuration failures can happen when installing the devices. Hence, it is important to reduce misconfigu-ration to reduce the costs of network outages as shown in [5].

    The rating of the configuration failures gives an impression about the heaviness of the impact of the failures in the network. The rating includes how many cus-tomers could be affected from this failure and it also includes if failures happen more often or not. A rating of 1 means the configuration failure has an larger im-pact on the network and a rating of 0 means it has a lower impact. The miscon-figuration of the authentication method and the ACLs have a larger impact on the network, because if connections from customers to the network a blocked this could lead to SLA violations. Also a wrong configured filter can lead to a se-curity hazard and to attacks in the network. The misconfiguration of the optical component is also a significant failure in the network. The wrong power or tem-perature configuration of a laser leads maybe to a shift in the wavelength and so to less amplification of the light signal and to a lower SNR.

    To prevent configuration failures automating of the configuration and fallback mechanism, for example a standard configuration, for single device could be one step to reduce the complexity of configuration tasks. The development of signaling protocols to detect degradation effects of optical elements, like an opti-cal amplifier [18], before the device fails can help to reduce network outages.

  • 24/26

    2007 Nokia Siemens Networks. All rights reserved.

    Acronyms

    ACL Access Control List

    Ack Acknowledgment

    BER Bit Error rate

    BGP Border Gateway Protocol

    CLI Command Line Interface

    DWDM Dense Wavelength Division Multiplexer

    EDFA Erbium Doped Fiber Amplifier

    IOS Internetwork Operating System

    IS-IS Intermediate System to Intermediate System Protocol

    ISP Internet Service Provider

    LACP Link Aggregation Control Protocol

    LOS Loss of Signal

    MTU Maximum Transmission Unit

    MZI Mach-Zehnder Interferometer

    OSPF Open Shortest Path First

    OXC Optical cross-connect

    PAgP Port Aggregation Protocol

    PLIM Physical Layer Interface Module

    QoS Quality of Service

    ROADM Reconfigurable Optical Add Drop Multiplexer

    SLA Service Level Agreement

    SNR Signal to Noise Ratio

    STP Spanning Tree Protocol

    WDM Wavelength Division Multiplex

    VLAN Virtual LAN

  • 25/26

    2007 Nokia Siemens Networks. All rights reserved.

    References

    [1] Christian Merkle, Analysis of configuration failures in transport networks: Part I: Configuration failures of the IP layer, Technical Report, Nokia Sie-mens Network, 2007

    [2] G. Hudson Gilmer, Part1 in the Reliability Series: Examining the cost of Poor Quality in IP Networks, White Paper, 2001

    [3] Craig Labovitz, Abha Ahuja, Farnam Jahanian, Experimental Study of Inter-net Stability and Backbone Failures, Fault-Tolerant Computing, 278-285, 1999

    [4] Craig Labovitz, Abha Ahuja, Farnam Jahanian, Experimental Study of Inter-net Stability and Wide Area Backbone Failures, university of Michigan CSE-TR-382-98, 1998

    [5] Rob Dearborn, Michael Howard, Susan Klarich, Laura Whitcomb, Richard Webb, Jeff Wilson, The Costs of Enterprise Downtime, North America 2004, Infonetics Research, February 2004

    [6] Cisco, Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues, http://www.cisco.com/warp/public/473/46.html, 2005

    [7] Cisco Systems, Catalyst 6500 Series Software Configuration Guide, 5.5 http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/catos/5.x/configuration/guide/channel.html, last seen 2007

    [8] Cisco Systems, Catalyst 2950 and Catalyst 2955 Switch Software Configu-ration Guide,12.1(14)EA1 http://www.cisco.com/en/US/docs/switches/lan/catalyst2950/software/release/12.1_14_ea1/configuration/guide/Sw8021x.html#wp1025467, last seen 2007

    [9] Carmen Mas, Patrick Thiran, Jean-Yves Le Boudec, Fault Localization at the WDM Layer, Photonic Network Communications, 1999

    [10] Rajiv Ramaswami, Kumar N. Sivarajan, Optical Networks A Practical Per-spective, Second Edition, Morgan Kaufmann Puplisher, 2002

    [11] J. Kemtchou, M. Duhamel, P. Lecoy, Gain temperature dependence of er-bium-doped silica and fluoridefiber amplifiers in multichannel wavelength-multiplexed transmissionsystems, Journal of Lightwave Technology, vo. 15, pp. 2083-290, 1997

    [12] Cisco Systems, Dense Wavelength Division Multiplexing Commands on Cisco IOS XR Software, last seen 2007

    [13] Cisco Systems, Less common types of systems crashes http://www.cisco.com/en/US/products/sw/iosswrel/ps1831/products_tech_note09186a008010876d.shtml#ts, last seen 2007

  • 26/26

    2007 Nokia Siemens Networks. All rights reserved.

    [14] Cisco Systems, Cisco IOS Desktop Switching Software Configuration Guide, Release 12.0(5)XU, http://www.cisco.com/en/US/docs/switches/lan/catalyst2900xl_3500xl/release12.0_5_xu/scg/kitrbl.html, last seen 2007

    [15] ChannelPartner, Bugs in Router-Software von Cisco und Juniper http://www.channelpartner.de/news/205189/, 2005, last seen 2007

    [16] Computerwoche, Cisco stopft Sicherheitsloch in Router Betriebssystem IOS http://www.computerwoche.de/nachrichten/568215/, 2005, last seen 2007

    [17] Cisco Systems, Troubleshooting Buffer Leaks http://www.cisco.com/en/US/products/hw/iad/ps397/products_tech_note09186a00800a7b85.shtml, last seen 2007

    [18] Lutz Rapp, Quality Surveillance Algorithm for Erbium-Doped Fiber Amplifi-ers, Workshop on Design and Reliable Communication Networks (DRCN), 2005