svc.1.7 online diagnostics.doc.doc

33
Open Source Development Labs Carrier Grade Linux Availability Requirements Definition Version 3.0 (Working Draft Version – 2 December 2004) Prepared by the Carrier Grade Linux Specifications Subgroup

Upload: cameroon45

Post on 24-May-2015

256 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: SVC.1.7 Online Diagnostics.doc.doc

Open Source Development Labs

Carrier Grade Linux Availability

Requirements DefinitionVersion 3.0

(Working Draft Version – 2 December 2004)

Prepared by the Carrier Grade Linux Specifications Subgroup

Page 2: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

Open Source Development Labs, Inc.12725 SW Millikan Way, Suite 400Beaverton, OR 97005 USA

Phone: +1-503-626-2455

Page 2

Page 3: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

Copyright (c) 2004 by The Open Source Development Labs, Inc. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is available at http://www.opencontent.org/opl.shtml). Distribution of substantively modified versions of this document is prohibited without the explicit permission of the copyright holder.Other company, product or service names may be the trademarks of others.Linux is a Registered Trademark of Linus Torvalds.

Contributors to each section of this Requirements Definition (in alphabetic order) include the following:

AvailabilityBadovinatz, Peter (IBM)Chacron, Eric (Alcatel)Cherry, John (OSDL)Christopher, Johnson (Sun)Cress, Andrew (Intel)Dake, Steven (Monta Vista)Fleischer, Julie (Intel)Haddad, Ibrahim (Ericsson)** Ikebe, Takashi (NTT)* Ishitsuka, Seiichi (NEC)Kevin, Fox (Sun)* Kimura, Masato (NTT Comware)Kukkonen, Mika (OSDL)Liu, Bing Wei (Intel)Manas, Saksena (Timesys)Nakayama, Mitsuo (NEC)Sakuma, Junichi (OSDL)**Indicates specification editors

Comments on the contents of this document should be sent to [email protected]

Page 3

Page 4: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

References...........................................................................................................................5

1 Introduction.....................................................................................................................6

2 Requirements Definitions...............................................................................................6

Availability Requirements.................................................................................................8AVL.1.0 Robust Mutexes............................................................................8AVL.2.0 Software ECC Support.................................................................8AVL.3.0 Forced Device Removal...............................................................9AVL.3.1 Block Device Removal.................................................................9AVL.3.2 Forced Unmount...........................................................................9AVL.4.0 Memory Overcommit Actions......................................................9AVL.4.1 VM Strict Over-commit.............................................................10AVL.6.0 Non-intrusive Monitoring of Processes......................................10AVL.6.1 Kernel-level Non-intrusive Application Monitor Without Modifying Application Code.....................................................................11AVL.6.2 Kernel-Level Non-intrusive Application Monitor Using a Defined API...............................................................................................11AVL.7.0 Disk Predictive Analysis............................................................11AVL.8.0 Redundant Paths to Resources....................................................11AVL.8.1 Multi-Path Access to Storage.....................................................12AVL.10.0 Fast System Startup within kernel space..................................12AVL.10.1 Fast Linux Start bypassing BIOS.............................................12AVL.13.0 Boot Image Fallback Mechanism.............................................13AVL.14.0 Live patching............................................................................13

3 Availability Roadmap...................................................................................................14AVL.3.0 Forced Device Removal.............................................................14AVL.3.3 Forced Unmount Application Notification.................................14AVL.4.0 Memory Overcommit Actions....................................................14AVL.4.2 Replaceable OOM Killer............................................................15AVL.4.3 Low-Memory-Condition Monitor..............................................15AVL.4.4 Out Of Memory Notification Mechanism..................................15AVL.5.0 Fault Isolation Enabling.............................................................16AVL.6.0 Non-intrusive Monitoring of Processes......................................16AVL.6.3 Process-level Non-intrusive Application Monitor......................17AVL.8.0 Redundant Paths to Resources....................................................17AVL.8.2 Advanced Multi-Path Access to Storage....................................18AVL.8.3 Redundant Communication Paths.............................................18AVL.9.0 NFS Client Protection across Server Failures............................18AVL.10.0 Fast System Startup within kernel space..................................19AVL.10.2 Fast Linux Start using Known-devices Database.....................19AVL.10.3 Parallel Driver Initialization during Startup.............................19AVL.11.0 Fast System Startup Within User Space...................................20AVL.11.1 Parallel User Initialization during Startup................................20AVL.12.0 Infinite Loop Detection............................................................20AVL.15.0 Fast Application Restart Mechanism........................................21

Page 4

Page 5: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

AVL.16.0 Fallback Operation Mechanism................................................21AVL.17.0 Multiple FIB Support...............................................................21AVL.18.0 iSCSI error handling support....................................................22AVL.19.0 Application profiler..................................................................22AVL.20.0 Kernel Resources Expansion for Threads................................22

Appendices........................................................................................................................23

A.1 General References................................................................................................23

A.2General Systems References..................................................................................23

Page 5

Page 6: SVC.1.7 Online Diagnostics.doc.doc

References Background information useful to readers of this document can be found in the following places:

Open Source Development Labs (OSDL) home page: http://www.osdl.org

The Carrier Grade Linux web page on the OSDL Web site: http://www.osdl.org/projects/cgl

The OSDL “Requirements Definition, Version 1.1”:http://www.osdl.org/docs/cgl_requirements_definition_11.pdf

The OSDL “Requirements Definition, Version 2.0”:http://www.osdl.org/docs/carrier_grade_linux_requirements_definition___version_20.pdf

Page 7: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

1 IntroductionPast OSDL Carrier Grade Linux technical documents have contained all requirements in a single document. For OSDL CGL v3.0 draft releases, we are releasing them as more granular sections, roughly split on functional boundaries. This document contains the availability section, the anticipated set of sections to be released are in this table.

1. APIs/Specifications/Standards (aka, “Standards”) – references to useful and necessary existing standards and interface specifications, e.g., POSIX, IETF, etc.

2. Availability – useful and necessary functionality for single node availability and recovery.

3. Clustering – useful and necessary components to build a clustered set of individual systems. Key target is clustering for high availability, although load-balancing and performance are secondary aims. There is recognition that “one size fits all” is not achievable, so not all features will always be used together.

4. Hardware – useful and necessary hardware-specific support, where it affects the expected carrier operating environment.

5. Performance - useful and necessary features contributory to adequate performance of a system, e.g., real-time capabilities, and also base OS components for supporting performance tools (but not the tools).

6. Security - useful and necessary features for building secure systems. There is recognition that “one size fits all” is not achievable, so not all features will always be used together.

7. Serviceability– useful and necessary features for servicing and maintaining a system, and coverage of tools that support serviceability.

These sections are being developed in parallel and various drafts of each will be released on independent schedules.

2 Requirements DefinitionsThe availability requirements define capabilities that are related to single system availability. These requirements apply to the carrier grade Linux operating system environment.

This document presents both CGL requirements and CGL roadmap material. Requirements are defined as necessary for a CGL system. Roadmap items are provided to highlight possible future requirements.

Each requirement is described by two header fields and one descriptive text field.

Header fields

ID - Unique identification code associated with a requirementName - Short/simple description of the requirement

Text fields

Page 8

Page 8: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

Description - Detailed description of the requirement

Page 9

Page 9: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

Availability Requirements This section contains requirements that apply to the Linux kernel, core libraries, and tools essential to a carrier-grade system. These Availability requirements are meant to be those related to single system availability, e.g.,such as support for memory failure detection. Requirements related to clustered availability, e.g.,such as heartbeating heartbeat monitoring and failover, will be found in the Clustering requirements section.

Requirements are generally arranged according to priority. However, if a requirement summary is followed by sub-requirements with different priorities, the requirement summary and sub-requirements are placed at the priority level of the highest priority sub-requirement.

ID Name

AVL.1.0V2.0: AVL.1.0

Robust Mutexes

Description: OSDL CGL specifies that carrier grade Linux shall provide an enhancement to the POSIX Thread implementation that provides support for robust mutexeses support. Robust mutexes support shall permit a mutex to synchronize threads, either in the same process or in different processes, even when processes exit or abort unexpectedly.

A robust mutex is initialized with robust mutex attributes. It must be an inter-process shared mutex, allocated in a shared memory segment mapped into the processes that use it. Applications using a robust mutex shall be able to see various return codes that indicate whether the previous holder of the mutex terminated, and also the recovery status of the state of the mutex. The new holder of the robust mutex shall be able to detect a failure, perform cleanup actions, and re-initialize the mutex for continued use.

If a cleanup of the state protected by the mutex can't be completed, the mutex shall be marked “inconsistent” so that any future attempts to lock it will generate a status indicating that it is inconsistent. The following two modes for setting the mutex to an inconsistent state shall be provided:

Automatically mark the mutex “inconsistent” when the owner dies and a subsequent mutex lock is doneattempted and completed.

Provide an advisory to the next owner that the mutex needs to be explicitly marked inconsistent.

For further details, refer to http://www.humanfactor.com/pthreads/posix-threads.html .

ID Name

AVL.2.0V2.0: AVL.2.0

Software ECC Support

Description: OSDL CGL specifies that carrier grade Linux shall provide reporting a mechanism for reporting whenever hardware error checking and correcting (ECC) detects and/or recovers from a single-bit ECC errors, and a panic trigger mechanism that is activated whenever hardware ECC detects multi-bit ECC errors.

Page 10

Page 10: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.3.0V2.0:-V1.1: -

Forced Device Removal

Description: OSDL CGL specifies that carrier grade Linux shall provide support for forced unmounting of a file system and block device removal. After When a file system is unmounted, any process can not access to open file any moreprocesses shall not be able to access or open files on the file system, . and block device removal send the hot swap signal to actual storage controller.When a block device is removed, a hot swap signal shall be sent to the storage controller.

ID Name

AVL.3.1V2.0: PLT.atca.2

Block Device Removal

Description: OSDL CGL specifies that Linux shall allow removal of a block device while it is in use without degrading the reliability of the system. The block device shall be removable even if it has been placed in use by an open file command, such as (fdisk /dev/sda), it is a member of a raid RAID-1 volume, or a file system is mounted on the device, or permutations thereof or a combination of these. . If a file is in use and it cannot be serviced by a mirrored disk, the operating system shall return an error to the system calls referencing that file.

ID Name

AVL.3.2V2.0: AVL.4.0V1.1: AVL 4.10

Forced Unmount

Description: OSDL CGL specifies that carrier grade Linux shall provide support for forced unmounting of a file system. The unmount shall work even if there are open files in the file system. Pending requests shall be ended with the return of an error value when the file system is unmounted.

ID Name

AVL.4.0V2.0: AVL.6.0

Memory Overcommit Actions

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability to configure a global limit on RAM utilization. This limit is a combination of physical memory and swap space. In addition, adequate information and an interface must be provided to allow a middleware component to take action before the system runs out of memory. This requirement is in addition to or a replacement for the kernel out-of-memory killer.

Page 11

Page 11: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL4.1V2.0: AVL.6.1

VM Strict Over-commit

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability to control kernel virtual memory allocation adjustments based on the specific needs of the system. Control of virtual memory shall include but not be limited to the following:

Strict over-commit – - The total address space committed for the system is not permitted to exceed swap + a configurable percentage (default is 50%) of physical RAM (the default is 50%).

Heuristic over-commit – - Obvious over-commits of address space are refused. Limited to free physical memory + free swap.

ID Name

AVL.6.0V2.0: AVL.8.0

Non-intrusive Monitoring of Processes

Description: OSDL CGL specifies that carrier grade Linux shall provide a range of capabilities to enable non-intrusive monitoring of processes. To enable monitoring, some configuration actions may have to be taken to specify which processes are to be monitored. Capabilities may be limited in certain cases, as long as the limitations are known. Capabilities to be provided include the following:

Processes must be manageable and controllable even if the actual process code cannot (or will not) be changed to exploit a specified API.

Processes must be manageable and controllable even if they are not a direct child process of the tools and mechanisms provided to enable these capabilities. A carrier system consists of middleware and processes from many sources, which may be difficult to run from a single parent process, as they will usually require different userids, capabilities, permissions, etc.

The latency of event detection while processes are being monitored must be as low as possible, preferably occurring immediately upon complete failure of a process.

The overhead of monitoring the processes should be as low as possible.

Since inittab does not provide sufficient capabilities to meet this requirement. Therefore, enhancements to inittab must be provided to address the following limitations:

o Monitors only processes inittab starts

o Limited reactions to process death

o No healthcheck capabilities for non-terminating processes

o No controls on respawn loops of processes

Page 12

Page 12: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.6.12V2.0: AVL.8.2

Kernel-level Non-intrusive Application Monitor wWithout mModifying aApplication cCode.Kernel-level Non-intrusive Application Monitor

Description: OSDL CGL specifies that carrier grade Linux shall provide a range of capabilitiesa a service to enable non-intrusive monitoring of processes at the kernel level. To enable monitoring, the following capabilities shall be provided;

Communication between the monitoring process and the kernel.

Registering a list of processes.

Ability to define policy based on process events including process/thread creation and exit.

Ability to take action whenever an event occurs.To enable monitoring, some configuration modifications may need to be made. Capabilities may be limited in certain cases, as long as the limitations are known.OSDL CGL specifies that carrier grade Linux shall provide a range of capabilities to enable non-intrusive monitoring of processes at the kernel level. To enable monitoring, some configuration actions (configuration modifications may need to be made?) and some additional coding on processes which will be monitored may have to be taken (additional code may need to be added to the processes that will be monitored?). Capabilities may be limited in certain cases, as long as the limitations are known.

ID Name

AVL.6.2V2.0: AVL.8.2

Kernel-lLevel Non-intrusive Application Monitor uUsing a dDefined specific API

Description: OSDL CGL specifies that carrier grade Linux shall provide a service a range of capabilities to enable non-intrusive monitoring of processes at the kernel level through a defined API. Any application to be monitored will need to use this API. To enable monitoring, some configuration modifications may need to be made and additional code may need to be added to the processes that will be monitored. Capabilities may be limited in certain cases, as long as the limitations are known.

ID Name

AVL.7.0V2.0: AVL.9.0

Disk Predictive Analysis

Description: OSDL CGL specifies that carrier grade Linux shall provide capabilities to assist in predictive analysis of disks. The aim of this support is to assist in predicting situations likely to lead to failure of disks. This allows preventive action to be taken to avoid the failure and resulting disruption of service.

Note that this could be considered a subset of the requirement SVC. 1.7 4.0 Online Diagnostics , but since isolated mechanisms to support this requirement currently exist, it is listed as a separate requirement.

Page 13

Page 13: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.8.0V2.0: -

Redundant Paths to Resources

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to enable redundant access paths to system resources.

The software shall handle sending and receiving data via redundant paths without conflicts, and provide high-availability access to resources even if an error occurs on one side of path.in one of the redundant paths.

ID Name

AVL.8.1V2.0: AVL.12.0

Multi-Path Access to Storage

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to enable multiple access paths from a single cluster node to storage devices. The software shall determine if multiple paths exists to the same port of the I/O device, and, with configurable controls, it will balance I/O requests across multiple host bus adapters. If multiple paths exist to the same device over two separate device ports on the same host bus adapter, those I/Os will not be balanced.

Handling a path failure must be automatic. A mechanism must be provided for the reactivation of failed paths, which can be placed back into service.allowing them to be placed back in service. It must be possible to automatically determine and configure multiple paths. Automatic configuration shall allow automatic multi-path configuration of complete disks and partitions located on those disks

A multipath device feature that allows multipath detection and mapping early on in the boot process must be provided so that the root file system can exist on a multipathed device.

Page 14

Page 14: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.10.0V2.0: AVL.fast

Fast System Startup within kernel space

Description: OSDL CGL specifies that carrier grade Linux shall provide a variety of capabilities to allow a single system to move from power-on to ready in as short a time as possible.

Startup sequence divided up as follows:

1. power-on

o Boot

2. Linux loading –[taken care of already]

3. Linux started (COLD: BIOS -> OS handoff, WARM: OS -> OS handoff)

o Linux start

The normal startup sequence includes:

1. Power on and boot (includes BIOS initialization)

2. Load the Linux image

3. Start and initialize Linux

A cold start (BIOS to operating system handoff) comprises steps 1 through 3. A warm start (operating sytem to operating system handoff) comprises steps 2 and 3.

Fast system startup capabilities include the ability to:

Bipass BIOS initialization by beginning the startup sequence at step 2 (see AVL10.1).

Bipass initialization of the Linux image in step 3 (See AVL 10.2).

Complete a parallel initialization of device drivers in step 3 (See AVL 10.3).

Startup types can be divided as follows:

Warm restart = restart in which the memory of the node is kept intact (basically requires kexec/bootimg/...).

Cold restart = a restart in which the memory of the node is not kept.

Intact (ye olde reboot).

ID Name

AVL.10.1V2.0: AVL.fast.3

Fast Linux Restart bypassing BIOS

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to speed up operating system initialization by bypassing the BIOS when one instance of Linux reboots to another instance of Linux.

Page 15

Page 15: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.13.0V2.0: --

Boot Image Fallback Mechanism

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism that enables a system to fallback to a previous "known good" good boot image in the event of a catastrophic boot failure (i.e. failure to boot, panic on boot, failure to initialize HW/SW). System images are captured from the "known good" system and the system reboots to the latest good image. in the event of a catastrophic boot failure. This mechanism would allow an automatic fallback mechanism to protect against problems in resulting from system changes, such as program updates, installations, kernel changes, and configuration changes."

ID Name

AVL.14.0V2.0: --

Live patching

Description: OSDL CGL specifies that carrier grade Linux shall provide the mechanism for dynamically replacing the symbols of a running process's symbols to be replaced dynamically (without restarting). By Dynamic replacement ofreplacing symbols, allows a process can to access to patched functions or values without restarting process, and can improve process availability.

Page 16

Page 16: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

3 Availability RoadmapThis section attempts to capture promising technology directions in serviceabilityavailability. The items listed here are not current requirements, but are considered to be desirable future features.

ID Name

AVL.3.0V2.0:-V1.1: -

Forced Device Removal

see above

Description: OSDL CGL specifies that carrier grade Linux shall provide support for forced unmounting of a file system and block device removal. After an unmount, any process can not access to open file any more, and block device removal send the hot swap signal to actual storage controller.

ID Name

AVL.3.3V2.0: -V1.1: -

Forced Unmount Application Notification

Description: OSDL CGL specifies that carrier grade Linux shall provide a notification mechanism whenever file systems are unmounted in forcewhen a forced unmount of a file system occurs. The notification mechanism should send a signal or other message to a processes whenever processes open that attempts to access the a file on an umounted volume.

ID Name

AVL.4.0V2.0: AVL.6.0

Memory Overcommit Actions

see above

Description: OSDL CGL specifies that carrier grade Linux shall provide the ability to configure a global limit on RAM utilization. This limit is a combination of physical memory and swap space. In addition, adequate information and an interface must be provided to allow a middleware component to take action before the system runs out of memory. This is in addition to or a replacement for the kernel out-of-memory killer.

Page 17

Page 17: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.4.2V2.0: AVL.6.2

Replaceable OOM Killer

Description: OSDL CGL specifies that carrier grade Linux shall provide mechanisms to allow the replacement of the out-of-memory (OOM) killer algorithm within the kernel. In an environment in which an application is made up of many processes, the act of killing any single process may prevent the application from continuing to provide service while leaving its remaining processes running and preventing proper recovery. Hence it must be possible to provide a replacement algorithm that can take the relationships between processes into account when determining which ones to slay. By default the current algorithm in the kernel is used., and Tthe new algorithm can be activated by loading the relevant kernel module.

ID Name

AVL.4.3V2.0: --

Low-Memory-Condition Monitor

Description: OSDL CGL specifies that carrier grade Linux shall provide a low memory condition monitor. To avoid encountering a true out-of-memory (OOM) condition in the Linux kernel, a user-space facility should be provided to monitor memory usage and take action based on a configurable low-memory threshold. This threshold would be set to predict an OOM condition before it becomes critical. The threshold would apply to both physical memory and swap area.

The application should record the top N memory-consuming processes, so that when the threshold is reached, processes that are not on the user-defined do-not-kill list that are trending up in memory use can be killed. This capability would allow the application to tell the kernel to stop allocating memory to user-space processes. When applications run out of pre-allocated memory, the system could remain nominally in service until more memory becomes available.OSDL CGL specifies that carrier grade Linux shall provide a low memory condition monitor. In order to. To avoid encountering a true out-of-memory (OOM) condition in the Linux kernel, there should be a user-space facility should be provided to monitor memory usage and take action based on a configurable low-memory threshold. This threshold would be set to predict an OOM condition before it gets becomes critical. The detection should include (The threshold would apply to?) both physical memory and swap area.

The application should record the top N memory-consuming processes, so that when the threshold is reached, processes as long as theythat are not on the user-defined do-not-kill list , the process(es) that are trending up in memory use can be killed. If the kernel functionality to do this were added, the application could optionally tell the kernel to stop allocating memory to user-space processes. This would be useful when the production applications run out of pre-allocated memory, and the system could then remain nominally in-service until it could be fixed. OK to replace with the following? This capability would allow the application to tell the kernel to stop allocating memory to user-space processes. When production applications (what are these?) run out of pre-allocated memory, the system could remain nominally in service until it could be fixed (more memory becomes available?).

Another threshold action that should be (shall be?) supported is to notify an enterprise management station, either viausing an SNMP trap, CIM, or other enterprise-level notification.

Page 18

Page 18: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.4.4V2.0: --

LowOut Of Memory Notification Mechanism

Description: OSDL CGL specifies that carrier grade Linux shall provide a low n out of memory notification mechanism.

Whenever a low memory OOM condition is detected, the mechanism shallhould generate a remote notificationy to management station. Notification methods shall support enterprise-level notification protocols such as SNMP MIB or, CIM.

4. SNMP STD 7.0 SNMP(for IPv4 and IPv6)

5. CIM STD 13.0 CIM

Include xrefs to Standards

ID Name

AVL.5.0V2.0: AVL.7.0

Fault Isolation Enabling

Description: OSDL CGL specifies that carrier grade Linux shall provide support to report anomalies it has detected on a compute node. The objective in reporting these anomalies is to provide data for fault isolation mechanisms. Software-related failures may require actions like the restart or termination of a process or the unloading and reinstallation of a kernel module. Hardware-related failures may require actions to restart, turn off, or isolate a failing device.

OSDL CGL specifies that carrier grade Linux shall provide mechanisms to isolate faulty software or hardware components. These mechanisms can be activated by management middleware fault isolation policies.

Page 19

Page 19: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.6.0V2.0: AVL.8.0

Non-intrusive Monitoring of Processes

see above

see above

Description: OSDL CGL specifies that carrier grade Linux shall provide a range of capabilities to enable non-intrusive monitoring of processes. To enable monitoring, some configuration actions may have to be taken to specify which processes are to be monitored. Capabilities may be limited in certain cases, as long as the limitations are known. Capabilities to be provided include the following:

Processes must be manageable and controllable even if the actual process code cannot (or will not) be changed to exploit a specified API.

Processes must be manageable and controllable even if they are not a direct child process of the tools and mechanisms provided to enable these capabilities. A carrier system consists of middleware and processes from many sources, which may be difficult to run from a single parent process, as they will usually require different userids, capabilities, permissions, etc.

The latency of event detection while processes are being monitored must be as low as possible, preferably occurring immediately upon complete failure of a process.

The overhead of monitoring the processes should be as low as possible.

Since inittab does not provide sufficient capabilities to meet this requirement. Therefore, enhancements to inittab must be provided to address the following limitations:

o Monitors only processes inittab starts

o Limited reactions to process death

o No healthcheck capabilities for non-terminating processes

o No controls on respawn loops of processesof processes

Page 20

Page 20: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.6.31V2.0: AVL.8.1

Process-level Non-intrusive Application Monitor

Description: OSDL CGL specifies that carrier grade Linux shall provide control and management capabilities for processes that cannot be changed to any specific API altered to incorporate any monitoring API. Such capabilities are known as non-intrusive monitoring. These capabilities must be implemented programmatically using commands or scripts.

Another issue for many such processes is that the start script itself may spawn an application processchild that is then the actual process that makes up the application. (that then becomes the application process, that is?) This process then is not an actual child (This process is then no longer a child?) under the of the control of the and management process. This subThis sub-requirement assumes that this does not happen, and the child process remains under the control of the management entity.

Capabilities required:

The following capabilities must be enabled for controlling processes:

o The ability to start a process (or a list of processes)

o The ability to stop a process (or a list of processes)

The following capabilities must be enabled for monitoring processes:

o The ability to detect the unexpected exit of a process

o The ability to configure a set of actions in response to an unexpected exit of a process

The following services must be provided beyond those currently provided by inittab:

o The ability to configure whether to restart the application if the process dies

o A configurable amount of time to wait before restarting the application

o A limit on the number of times to restart the application

ID Name

AVL.8.0V2.0: -

Redundant Paths to Resources

see above

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to enable redundant access paths to system resources.

The software shall handle sending and receiving data via redundant paths without conflicts, and provide high-availability access to resources even if an error occurs on one side of path.

Page 21

Page 21: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.8.2V2.0: AVL.12.0

Advanced Multi-Path Access to Storage

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to enable multiple access paths from a single cluster node to storage devices. The mechanism should implement the following features:.

- Ability to boot from SAN storage using the mMulti-path mechanism.

- Ability to use a swap partition on a mMulti-path disk.

- Kernel support for a path- switching policy.

E - Error logs must provide easy device identification

- Interoperability: NBD, udev, volume managers, hotplug

ID Name

AVL.8.3V2.0: --

Redundant Communication Paths

Description: OSDL CGL specifies that Linux shall provide support for redundant communication paths between nodes to improve network availability. The system should handle sending and receiving data between nodes via redundant communication paths without any conflicts.The path should form logical or physical end-to-end redundant paths.

ID Name

AVL.9.0V2.0: AVL.13.0

NFS Client Protection across Server Failures

Description: OSDL CGL specifies that carrier grade Linux shall provide mechanisms that allow an NFS server to have failover capability to provide service continuity upon a node failure. The NFS service has to be resumed on another node without any impact on NFS clients other than the retransmission of pending requests (open files must remain open). Clients aAuthenticated clients on the old server must remain authenticated by on the new server.

Page 22

Page 22: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.10.0V2.0: AVL.fast

Fast System Startup within kernel space

see above

Description: OSDL CGL specifies that carrier grade Linux shall provide a variety of capabilities to allow a single system to move from power-on to ready in as short a time as possible.

Startup sequence divided up as follows:

6. power-on

o Boot

7. Linux loading –[taken care of already]

8. Linux started (COLD: BIOS -> OS handoff, WARM: OS -> OS handoff)

o Linux start

Startup types can be divided as follows:

Warm restart = restart in which the memory of the node is kept intact (basically requires kexec/bootimg/...).

Cold restart = a restart in which the memory of the node is not kept.

Intact (ye olde reboot).

ID Name

AVL.10.2V2.0: AVL.fast.3

Fast Linux Start using Known-devices Database

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to speed up operating system initialization. The improvement in booting speed could be achieved by leveraging boot load to inform the OS operating system of previously connected devices, or the known devices could be derived from a previously running instance of the operating system.

ID Name

AVL.10.3V2.0: AVL.fast.2

Parallel Driver Initialization during Startup

Description: OSDL CGL specifies that, if multiple drivers are compiled into the Linux Kernel, those drivers’sthe initialization or probing routines of those drivers execute in parallel or decreasing the timeouts (which are quite long to cater for slow legacy devices). CGL further specifies that, if multiple drivers are to be loaded as modules, those the driver's modules are loaded in parallel. CGL further specifies that in either of these two cases, a driver is only initialized once its dependent drivers have initialized.

Page 23

Page 23: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.11.0V2.0: AVL.fast

Fast System Startup wWithin user User spaceSpace

Description: OSDL CGL specifies that carrier grade Linux shall provide a variety of capabilities to allow a single system to move from a power-on state to an application- ready state in as short a time as possible.

The normal startup sequence includes:

1. Power on and boot (includes BIOS initialization)

2. Load the Linux image

3. Start and initialize Linux

Startup sequence divided up as follows:

power-on

Linux kenel startup

init started (OS -> init handoff)

init start

Start application started

ID Name

AVL.11.1V2.0: --

Parallel User Initialization during Startup

Description: OSDL CGL specifies that the user initialization procedure executed by the program /sbin/init shall provide a mechanism to allow multiple init scripts to run in parallel. CGL further specifies that a service is only started once it's dependent services have started.

Page 24

Page 24: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.12.0V2.0: --

ExcessiveInfinite Loo CPUp Cycle Usage Detection

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism that detects infinite excessive CPU cycle usage byloops anyin process or threadapplications. To enable detectingon, the following capabilities shall be provided:;

Communication between the monitoring process and the kernel.

Registering a list of processes or threads and their allowed CPU cycle thresholds.

Ability to define policy based on process events including process/thread creation and exit.

Ability to take action whenever an event occurs.

Ability to set the The mechanism is an enhancement of setrlimit()/SIGXCPU. The setrlimit() function call is available only by process itself, however this mechanism should allow management process to set the target process’s rlimit parameters and receive notification of target process’s CPU time exceed. CPU cycle threshold to a resolution of one millisecond.

OSDL CGL specifies that carrier grade Linux shall provide a mechanism that detects infinite loops in applications. The timeout values of each priority shall be customized for each system. The mechanism is an enhancement of setrlimit()/SIGXCPU. It is desirable that the mechanism provide below the functions below:.

Clear process time whenever a process is switched by a user system call.

Check process time whenever a process is switched by a basic clock/preemption.

Send the signal (a signal? to what?) whenever an infinite loop (timer overflow) is detected.

Ability to apply on any threads not only on processes. Detect infinite loops on any threads, not just processes

To ensure compatibility of existing setrlimit() (with the existing setrlimit() function?), it is desirable to prepare another process time table. (a separate process time table should be used?)

Page 25

Page 25: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.15.0V2.0: --

Fast Application Restart Mechanism

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism that enables a quick application restart. Typical applications in a carrier environment use multiple processes with inter-process communications. As applications become more complex, application initialization times become longer.

To speed up application initialization, the mechanism shall provides the functionality to simultaneously save memory images of multiple processes’ memory (including the kernel resources used by each process) simultaneously and to restore the images it.

When the application completes initialization, including making connections between processes and setting up kernel resources for inter-process communication, the application invokes a save function that makes a copy of the memory images of the process and kernel resources. If the application hangs, the mechanism restores the memory images and kernel resources and restarts the application.

OSDL CGL specifies that carrier grade Linux shall provide a mechanism that enables application restart quickly (enables a quick application restart?). Typical applications on in a carrier environment using use multi processes (multi-processes? multiple processes?) with inter-process communications. Therefore application initialization time became much longer as application became complex. (As applications become more complex, application initialization times become longer?)

When a process status becomes in service When the process status of an application becomes “in service”, the application invokes this function (the Fast Application Restart mechanism?), which makes kernel to copy (When the mechanism in invoked, the kernel creates copies of) the memory images of the process and kernel resources for inter-process communication. In case ofIf the process is goes down, the kernel restarts the application from the copied images and restores kernel resources for inter-process communication (from the copied images?). . The mechanism reduces the initialization time of the application.

ID Name

AVL.16.0V2.0: --

Fallback Operation Mechanism

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism which that enables or disables specific functions that allow system fallback mode operation when an overload condition is detected. It is desirable that the mechanism provide below the functions below:.

A sSoftirq- based interrupts handler.

Temporal roll-in/roll-out.

Temporal low priority daemon execution stops.

Page 26

Page 26: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.17.0V2.0: PLT.atca.2

Multiple FIB Support

Description: OSDL CGL specifies that Linux shall support Multiple FIB(Forwarding information base) notion, to have better server virtualization with overlapping address. multiple Forwarding Information Base (FIB) quick look-up tables with forwarding addresses to allow better server virtualization of overlapping addresses.

The A Forwarding Information Base (FIB) is a table that contains a copy of the forwarding information in the IP routing table. All the hooks/changes, to support multiple FIB, shall be added.

ID Name

AVL.18.0V2.0: -

iSCSI error handling support

Description: OSDL CGL specifies that the iSCSI Initiators implemented by carrier grade Linux should support the following iSCSI options:

- Header and Data Digests

- Error recovery level 1 as specified by RFC3270.

ID Name

AVL.19.0V2.0: -V1.1: 4.1

Application profiler

Description: OSDL CGL specifies that carrier grade Linux shall provide a mechanism to profile critical resources of the kernel and applications. Critical The critical resources that are profiled by this mechanism shall include (but are not limited to)::

time Time used,

memory Memory used, or

number Number of semaphores, mutexes, sockets, and threads/child processes behaviorin use, or

files opened. Number of open files.

Monitoring shall happen at configurable, periodic intervals or as initiated by the user.

Page 27

Page 27: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

ID Name

AVL.20.0V2.0: --

Kernel rResources eExpansion for tThreads

Description: Description: OSDL CGL specifies that carrier grade Linux shall expand available kernel resources to provide an expansionadditional support for of kernel resources for threads. The existing tThread model is defined as a light-weight process model, therefore some thread kernel resources are missing. Nowadays tThreads are is widely used widely and whenever thread is inapplied to carrier gradee level applications, so at least the following additional kernel resource functionality shall be provided hould to support threads:.

1. Full SIGNAL support -, Tthe SIGNAL should be sent to each thread.

2. Full The rlimit support -– The rlimit parameter should be supported for each thread.

Page 28

Page 28: SVC.1.7 Online Diagnostics.doc.doc

Carrier Grade Linux Requirements Definition Version 3.0 Public Draft

Appendices

A.1 General References

The Carrier Grade Linux Web page on OSDL Web site: http://www.osdl.org/projects/cgl

OSDL “Carrier Grade Linux Requirements Specification, Version 1.1”: http://www.osdl.org/docs/cgl_requirements_definition_11.pdf

OSDL “Carrier Grade Linux Architecture Specification, Version 2.0”:http://www.osdl.org/docs/carrier_grade_linux_requirements_definition___version_20.pdf

A.2 General Systems References

POSIX: http://www.opengroup.org/ http://www.unix.org/online.html http://www.opengroup.org/onlinepubs/007908799/ http://posixtest.sf.net for more POSIX conformance data on Linux. POSIX Technical Corrigendum 1 text:

http://www.opengroup.org/pubs/catalog/u057.htm

POSIX Specification with current Technical Corrigendum: http://www.unix.org/version3/

Linux Standard Base, Free Standards Group: http://www.linuxbase.org/ http://www.freestandards.org/

Service Availability Forum: http://www.saforum.org/

IETF: http://www.ietf.org/rfc.html

Page 29