types of faults that need checking pg3 -rtos-for-fault-tolerant-application

8
RTOS for Fault Tolerant Application Abstract: Increasing complexity of safety-critical systems that support real-time multitasking applications requests the concurrency management offered by real- time operating systems (RTOS). Real-time systems can suffer severe consequences if the functional as well as the time specifications are not met. In addition, real-time systems are subject to transient errors originating from several sources, including the impact of high energy particles on sensitive areas of integrated circuits. Therefore, the evaluation of the sensitivity of RTOS to transient faults is a major issue. This paper explores sensitivity of RTOS kernels in safety-critical systems. We characterize and analyze the consequences of transient faults on key components of the kernel of MicroC, a popular RTOS. We specifically focus on its task scheduling and context switching modules. Classes of fault syndromes specific to safety-critical real-time systems are identified. Results reported in this paper demonstrate that 34% of faults that affect the scheduling and context switching functions led to scheduling dysfunctions. This represents an important fraction of faults that cannot be ignored during the design phase of safety-critical applications running under an RTOS. Index TermsContext switch, fault injection, fault syndromes, real-time operating systems (RTOS), scheduler, safety-critical systems. Introduction: TODAY, many safety-critical embedded systems support real-time multitasking applications (e.g., nuclear power stations applications, aerospace applications, traffic control or medical life support, etc.). The complexity of these systems requires real-time operating systems (RTOS). Due to the time criticality factor, the design of real-time systems becomes challenging. In real- time systems, critical tasks must never miss their deadlines and never produce incorrect output results. If their time responses exceed a given time period (deadline) or if they provide incorrect results, the consequences can be catastrophic (e.g., loss of human lives or economical disaster). Therefore, the correct real-time functionality of safety-critical systems is mandatory in order to guarantee the correctness of output results and the required response time of critical tasks, even in the worst situations. Real-time systems, like all electronic systems, are subject to transient errors due to cosmic rays and alpha particles. These errors can cause undesired modifications of storage memory cells. The consequences of transient errors are currently a well known concern in microelectronic systems. International technology roadmap for semiconductor (ITRS) predicts increasing system failure rates due to transient errors for future generations.

Upload: john-leung

Post on 18-Jan-2016

9 views

Category:

Documents


0 download

DESCRIPTION

RTOS

TRANSCRIPT

Page 1: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

RTOS for Fault Tolerant Application

Abstract:

Increasing complexity of safety-critical systems that support real-time

multitasking applications requests the concurrency management offered by real-

time operating systems (RTOS). Real-time systems can suffer severe

consequences if the functional as well as the time specifications are not met. In

addition, real-time systems are subject to transient errors originating from

several sources, including the impact of high energy particles on sensitive areas

of integrated circuits. Therefore, the evaluation of the sensitivity of RTOS to

transient faults is a major issue. This paper explores sensitivity of RTOS kernels

in safety-critical systems. We characterize and analyze the consequences of

transient faults on key components of the kernel of MicroC, a popular RTOS.

We specifically focus on its task scheduling and context switching modules.

Classes of fault syndromes specific to safety-critical real-time systems are

identified. Results reported in this paper demonstrate that 34% of faults that

affect the scheduling and context switching functions led to scheduling

dysfunctions. This represents an important fraction of faults that cannot be

ignored during the design phase of safety-critical applications running under an

RTOS. Index Terms—Context switch, fault injection, fault syndromes, real-time

operating systems (RTOS), scheduler, safety-critical systems.

Introduction:

TODAY, many safety-critical embedded systems support real-time

multitasking applications (e.g., nuclear power stations applications, aerospace

applications, traffic control or medical life support, etc.). The complexity of

these systems requires real-time operating systems (RTOS). Due to the time

criticality factor, the design of real-time systems becomes challenging. In real-

time systems, critical tasks must never miss their deadlines and never produce

incorrect output results. If their time responses exceed a given time period

(deadline) or if they provide incorrect results, the consequences can be

catastrophic (e.g., loss of human lives or economical disaster). Therefore, the

correct real-time functionality of safety-critical systems is mandatory in order to

guarantee the correctness of output results and the required response time of

critical tasks, even in the worst situations. Real-time systems, like all electronic

systems, are subject to transient errors due to cosmic rays and alpha particles.

These errors can cause undesired modifications of storage memory cells. The

consequences of transient errors are currently a well known concern in

microelectronic systems. International technology roadmap for semiconductor

(ITRS) predicts increasing system failure rates due to transient errors for future

generations.

Page 2: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

These errors affect applications running on embedded systems as well as

the RTOS under which it executes. Consequently, they affect both correctness

of output results and the timing of the task‟s response. In real-time applications,

the time correctness can be more important than the correctness of output

results. For instance, if a system is able to provide correct output results, but

later than some deadline, the system behaviour may be incorrect, with

consequences more significant than if a result with a minor error is provided on

time. The main services provided by an RTOS kernel are task scheduling

(taking into account several factors - tasks priorities, resources and time

management, etc.) and context switching. The scheduler decides which task is

to be executed, while the context switch module loads the context (variables,

stack, etc.) of the selected task. If these two services do not work properly, the

tasks execution order may be affected, and some critical tasks could miss their

deadlines or provide incorrect output results. A real-time scheduler must be

extremely reliable and safe, in order to ensure correctness of the real-time

system response. This is a major concern to RTOS providers, and several

standards for safety and reliable implementations were proposed. For instance,

RTCA DO-178B is a standard for software used in avionics equipments. This

standard approach reliability and safety from the software development

perspective, ensuring RTOS fault tolerance in case of software bugs. However,

implementations respecting this standard may also be subject to transient errors,

and the study of their sensitivity to these errors becomes an important issue for

safety-critical real-time applications. The majority of existing works propose

fault injection techniques to evaluate the robustness of kernels that are not

real-time. In, a fault injection tool was developed to study error propagation in

UNIX systems. Reported results show that most injected faults lead to system

failure. A similar result has been reported , the authors propose a fault injection

tool that corrupts the system calls parameters. The results show a high failure

rate of POSIX1 functions. Representative studies reported in propose the

MAFALDA tool to inject faults in the microkernel object code and the

application data segment. The results report not only system crashes, but also

error propagation to the application level.

However, none of the cited works addresses the real-time aspect, which is

the key reason for using RTOS in safety-critical real-time systems. 1POSIX

(Portable Operating System Interface) is standards specified by the IEEE to

define the API (Application Program Interface) for software designed to run on

variants of the UNIX OS There is a lack of contributions in the specialized

literature that consider sensitivity of real-time features of RTOSs subject to

transient faults. The work proposed in is to our knowledge the only existing

research that investigates the temporal aspects of injected faults. In this work,

the authors propose a tool that aims at evaluating the time correctness of the

Chorus microkernel. They study the consequences of faults injected on the

Page 3: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

scheduler code. Experimental results show that about 7% of injected faults are

propagated to the application level. It is of interest that modern RTOSs are

PROMable, which means that a CPU can execute the RTOS services directly

from the PROM. Since PROMs are less sensitive to transient errors than RAMs,

faults in the scheduler code are less of a concern.

However, the PROMable RTOSs are still subject to transient errors

during their execution, as the CPU registers are intensively used. Therefore, to

assess the robustness of RTOSs to transient faults, it is mandatory to investigate

their sensitivity to faults injected in CPU registers. With respect to the presented

state-of-the-art, the main contributions of this paper are: 1) the definition of

different types of syndromes caused by transient faults occurring in safety-

critical systems, including RTOS; 2) the proposal of a fault injection

methodology allowing to asses MicroC RTOS sensitivity to register level

transient faults taking into account both functional correctness and real-time

aspects; and 3) a detailed analysis of reasons for scheduling dysfunctions caused

by transient errors.

The choice of MicroC in order to evaluate the sensitivity of RTOS kernels in

safety-critical systems was motivated by several aspects. MicroC is an open

source kernel and it is widely used in real-time applications. In addition,

MicroC was certified for use in safety-critical systems (in conformity to RTCA

DO-178B). Moreover, the current trends in real-time systems is to adopt less

Page 4: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

complex RTOSs running on multiprocessor-based architectures, instead of

using a complex RTOS running on a single processor.

The transient fault model considered in our experiments is bit-flips in the

processor registers, while the key components of the MicroC kernel (the task

scheduling and the context switch) are active. Comparing our results to those

reported in, we observed that faults corrupting the CPU registers during the

execution of the scheduling and context switching functions have a significant

impact on the real-time systems reliability. In our experiments, we recorded that

34% of injected faults caused scheduling dysfunctions while an additional 17%

led to system crashes. This represents an important fraction of faults that

cannot be ignored during the design stage of safety-critical applications running

under an RTOS. The paper is structured as follows. Section II identifies fault

syndromes for safety-critical systems including an RTOS. Section III briefly

describes the main features of the MicroC kernel. The conceptual framework of

the proposed fault injection technique is depicted in Section IV. Fault injection

results are analyzed and discussed in Section V. Section VI provides some

lessons learned concerning the fault injection experiments and results analysis.

Finally, Section VII presents our concluding remarks.

Page 5: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

Fault Syndromes For Safety-Critical Systems Including An RTOS:

Transient faults in the RTOS kernel of a safety-critical system may cause

several syndromes. The main classes of syndromes caused by the transient

faults occurring in an RTOS kernel are presented in Fig. 1. As illustrated in the

figure, when affected by transient faults, an RTOS may present two main

classes of syndromes.

• Syndromes that may also be observed in classical systems.

• Effect-less—no observable effect on system functionality;

• Application hang—the system application stops responding (e.g., it enters an

infinite loop);

• Exception—the program triggers some exception routine (e.g., illegal

instruction, division by zero, etc.);

• Memory access dysfunction—the system tries to access a non-valid physical

memory address;

• System crash—the system stops functioning. This syndrome may be a

consequence of a memory access dysfunction;

• Incorrect output results—the systems provides results, but they are different

from the expected ones.

• Syndromes specific to real-time systems using an RTOS may be classified as

follows.

• Real-time problem—the real-time constraints specified for the system are not

respected;

• Scheduling dysfunction—the scheduling of the tasks composing the

application running on the system is not correct. This syndrome may cause real-

time problems, incorrect output results problems, or system crashes.

Page 6: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

Microc OS-Ii Real-Time Kernel: Basic Considerations:

MicroC is a reliable, flexible, pre-emptive, real-time multitasking kernel.

It has been certified by the Federal Aviation Administration for use in

commercial aircrafts. The source code of MicroC kernel is mainly written in

standard C, which makes it portable to different processor architectures. Only a

small portion of the code has to be adapted to the target processor. The main

services offered by MicroC are task scheduling, intertask communication by

semaphores, message mailboxes and message queues, time management

functions, etc. MicroC can manage up to 64 tasks; each task is associated

with a unique priority. A task can be in one of five states (dormant, ready to run,

running, waiting and interrupted). The dormant state corresponds to a task that

has not been made available to the multitasking kernel. The waiting state

corresponds to a task that waits for the occurrence of an event.

A task is in the interrupted state when an interrupt has occurred and the

CPU is handling the interrupt service routine (ISR). A task is running when it

has exclusive control of the CPU. The ready to run state corresponds to a task

that can be executed once the CPU becomes available (the running task

terminates). Generally, a task is an infinite loop function that executes user

code. MicroC associates to each task a task control block (TCB) that contains

essential information about the task (e.g., delay, state, priority, address to the

current top of the stack, etc.). MicroC uses the TCB to preserve the task‟s state

when it is suspended, and to resume its execution exactly where it was when the

task becomes ready to run again. All TCBs are located in RAM. Another

characteristic of the considered multitasking application is that each task has its

own stack, which contains task‟s variables and the task‟s running context (the

content of all the CPU registers).

The scheduling function is activated every time a task calls the kernel‟s

services and when the system returns from an interrupt service routine. When

invoked, the scheduling function verifies if a higher priority task than the

currently running task is ready to run. In this case, a context switch is

performed. The context switch saves the context of the task being suspended

and loads into the CPU the values of the registers for the task to resume. The

ready to run tasks are placed in the ready list that is stored in memory in two

structures: in which, each bit is associated to a priority level. OSRdyGrp is an 8-

bit vector. Each bit in OSRdyGrp corresponds to a row in OSRdyTbl. If at least

one of the tasks whose priorities are grouped in a row is ready to run, the

corresponding bit in OSRdyGrp is set to “1.” The scheduler uses OSRdyGrp

and OSRdyTbl structures to determine the highest priority task allowed to run.

The values of OSRdyGrp and of OSRdyTbl row corresponding to the first „1‟ in

OSRdyGrp are used as indexes in a lookup table helping to determine the

Page 7: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

highest priority task. This operation is deterministic (its execution time is

constant for all contexts). Taking into account this functionality, transient faults

occurring in the OSRdyGrp and OSRdyTbl structures may have major

implications on the correct behavior of the MicroC RTOS and consequently on

the global system (as explained in Section II in the definition of scheduling

dysfunction syndromes).

Fault Injection Framework:

In order to asses the robustness of the MicroC RTOS kernel scheduler,

we developed an environment able to inject faultsthat corrupt CPU‟s registers at

random instants, while the scheduler and the context switch functions are

executed. The studied system architecture is organized as illustrated in Fig. 3.

The adopted system architecture is simulated by an Instruction Set Simulator

(ISS) tool. The fault injection tool uses temporal breakpoint features available in

the ISS to inject faults by software means. Once a temporal breakpoint is

reached, global execution is suspended and the ISS tool activates a Fault

Injection Manager (FIM) that comprises three modules: a fault parameters

generator, a fault tracer and a results analyzer. After the fault has been injected,

the global execution is resumed.

Page 8: Types of Faults That Need Checking Pg3 -RTOS-For-Fault-Tolerant-Application

The injection process is depicted in Fig. 4. The fault parameters generator

calculates when and where the fault will be injected. In our experiments, faults

consist of single bit-flips affecting only the main MicroC kernel features: task

scheduling and context switching. Accordingly, the fault instant must coincide

to the time intervals when these functions are active, as illustrated.

Conclusion:

Today, many safety-critical embedded systems execute realtime

multitasking applications. The complexity of these systems typically sets a

requirement for an RTOS. These systems are subject to transient errors induced

by parasitic phenomena that may both affect the correctness of logical results

and the timing of the tasks response. In this paper, we analyzed the sensitivity of

MicroC RTOS to transient faults. We presented a classification of syndromes

caused by the transient faults occurring in safety-critical systems including

RTOSs. We identified syndromes specific to real-time systems including

RTOS: real-time problems (when real-time constraints specified for the system

are not respected) and scheduling dysfunction (when scheduling of the different

tasks composing the application running on the system is not correct).We also

presented a methodology based on fault injection that allows assessing MicroC

RTOS sensitivity to transient faults taking into account both logical correctness

and real-time aspects. In addition, this paper presents an analysis of reasons

for scheduling dysfunctions, which may allow designers to improve the RTOS

robustness to transient faults.