unit v recovery and security mechanism

8/2/2019 Unit v Recovery and Security Mechanism

1/35


2/35

Recovery refers to restoring a system to its normal

operational state.

For eg if the process fails, the resources allocated to the

failed process be reclaimed.

If one or more cooperating processes fails, then the

efforts due to interaction of the failed processes with the

other processes must be undone or every failed process

would have to restart from an appropriate state.

If the site fails, recovery in this case involves the

question of how not to expose the system to data

inconsistencies & bring back the failed site to an up to

date state consistent with the rest of the system.


3/35

System : h/w & s/w

Failure : when the system does not perform itsservices in the manner specified.

An Erroneous state of the system is a state whichcould lead to a system failure by a sequence of validstate transitions.

A faultis an anomalous physical conditions.

Cause of this includes design errors, manufacturing

problems, damage, external disturbances. An error is the part of the system state of errors.

Failure recovery is a process that involvesrestoring an erroneous state to an error free state.


4/35

Figure An error is a manifestation of fault and can

lead to failure.


5/35

Process failure

System failure

Secondary storage failure

Communication medium failureProcess failure:

The computation results in an incorrect outcome,

the process causes the system state to deviate from

specifications, the process may fail to progress. Etc.

Errors causing to fail a process : deadlocks,

timeouts, protection violation, wrong input

provided by the user, consistency violation.


6/35

System failure:

Occurs when processor fails to execute.

It is caused by the s/w errors & h/w problems. In case of system failure, it is stopped & restarted

from the correct state i.e predefined state.

Classified into following:

1) An amnesia failure occurs when a system restartsin a predefined state that does not depend upon

the state of the system before its failure.

2) A partial amnesia failure occurs when a system

restarts in a state wherein a part of the state is

same as the state before the failure & the rest of the

state is predefined. Eg file server crashes.


7/35

3) A pause failure occurs when a system restarts in

a same state it was in before the failure.

4) A halting failure occurs when a crashed systemnever restarts.

Secondary storage failure

When thestored data cannot be accessed.

Cause : parity error, head crash, or dust particlessettled on the medium.

Its contents are corrupted & must be reconstructed

from archive systems & log files.

Communication medium failure

Occurs when a site can not communicate with

another operational site in the network.


8/35

Cause : failure of switching node includes system

failure & secondary storage failure, & link failure

includes physical rupture & noise in thecommunication channels.

May not cause total shutdown of the system.


9/35

Error is that part of the state that differs from itsintended value and can lead to a system failure, andfailure recovery is a process that involves restoring anerroneous state to an error-free state.

Two approaches for restoring an erroneous state to anerror free state.

If the nature of errors and damages caused by faults canbe completely and accurately assessed, then it ispossible to remove those errors in the process's

(system's) state and enable the process (system) tomove forward. This technique is known asforward-error recovery.


10/35

If it is not possible to foresee the nature of faults and

to remove all the errors in the process's (system's)

state, then the process's (system's) state can berestored to a previous error-free state of the process

(system). This technique is known as backward-error

recovery.

backward-error recovery is simpler than forward-

error recovery as it is independent of the fault and the

errors caused by the fault.

Problem with backward-error recovery :


11/35

Performance penalty: The overhead to restore a process

(system) state to a prior state can be quite high.

There is no guarantee that faults will not occur again whenprocessing begins from a prior state.

Some component of the system state may be

unrecoverable. For example, cash dispensed at an

automatic teller machine cannot be recovered.

The forward-error recovery technique, on the other

hand, incurs less overhead because only those parts

of the state that deviate from the intended value

need to be corrected


12/35

In backward-error recovery, a process is restored to a

prior state in the hope that the prior state is free of

errors.

The points in the execution of a process to which the

process can later be restored are known as recoverypoints.

Recovery done at the process level is simply a subset of

the actions necessary to recover the entire system.

In a system recovery, all the user processes that wereactive need to be restored to their respective recovery

points and data (in secondary storage) modified by the

processes need to be restored to a proper state.


13/35

There are two ways to implement

the operation based approach and

the state-based approach

System Model


14/35

audit trail or a log: state of a process are recorded

in sufficient detail so that a previous state of the

process can be restored by reversing all the changes

made to the state.

UPDATING-IN-PLACE. : every update (write) is to be

recorded in log file & stable storage.

The information recorded includes:

1) the name of the object,2) the old state of the object (used for UNDO), and

3) the new state of the object (used for REDO).


15/35

A recoverable update operation can be implemented

as a collection of operations as follows:

1) A do operation, which does the action (update) andwrites a log record.

2) An undo operation, which, given a log record written

by a do operation, undoes the action performed by

the do operation.3) A redo operation, which, given a log record written

by a do operation, redoes the action specified by

the do operation.

4) An optional display operation, which displays the log

record.


16/35

The major problem with the updating-in-

place is that a do operation cannot be

undone if the system crashes after anupdate operation but before the log record

is stored.

This problem is overcome by the write-ahead-log protocol


17/35

the complete state of a process is saved

when a recovery point is established and

recovering a process involves reinstating its

saved state and resuming the execution ofthe process from that state.


18/35

The process of saving state is also referred

to as checkpointing or taking acheckpoint.

The recoverypoint at which checkpointing

occurs is often referred to as a checkpoint.

The process of restoring a process to a

prior-state is referred to as rolling back the

process.

A special case of the state-based recovery

approach is the technique based on shadow

pages.


19/35

if one of the cooperating processes fails and

resumes execution from a recovery point; thenthe effects it has caused at other processes due

to the information it has exchanged with them

after establishing the recovery point will haveto be undone.

To undo the effects caused by a failed process

at an active process, the active process must

also rollback to an earlier state.

Thus, in concurrent systems, all cooperating

processes need to establish recovery points.


20/35

Rolling back of processes can cause further

problems:

Orphan Messages and the Domino Effect

Lost messages

Problem of Livelocks


21/35


22/35


23/35

In distributed systems involves taking acheckpoint by all the processes (sites) or at least

by a set of processes (sites) that interact with one

another in performing a distributed computation.

Typically, in distributed systems, all the sites save

their local states, which are known as local

checkpoints, and the process of saving local states is

called local check pointing. All the local checkpoints, one from each site,

collectively form aglobal checkpoint


24/35

STRONGLY CONSISTENT SET OF CHECKPOINTS :

To overcome the domino effect, a set of localcheckpoints is needed (one for each process

in the set) such that no information flow

takes place (i.e., no orphan messages)between any pair of processes in the set, as

well as between any process in the set and

any process outside the set during the

interval spanned by the checkpoints.


25/35

Such a set of checkpoints is known as a recovery line

or a strongly consistent set of checkpoints.


26/35

CONSISTENT SET OF CHECKPOINTS:

A consistent set of ckeckpoints is similar to a

consistent global state in that it requires that each

message recorded as received in a checkpoint

(state) should also be recorded as sent in another

checkpoint (state).

Therefore, systems that do not establish a strongly

consistent set of checkpoints have to deal with lost

messages during roll back recovery.


27/35

checkpointing and recovery technique proposedby Koo and Toueg that takes a consistent set of

checkpoints and avoids livelock problems during

recovery.

The algorithm's approach is said to be

synchronous, as the processes involved coordinate

their local checkpointing actions such that the set

of all recent checkpoints in the system isguaranteed to be consistent.


28/35

The checkpoint algorithm assumes the followingcharacteristics for the distributed system:

Processes communicate by exchanging messages

through communication channels

Channels are FIFO in nature.

Communication failures do not partition the network.

The checkpoint algorithm takes two kinds of

checkpoints on stable storage, permanent andtentative.

A permanent checkpoint is a local checkpoint at a

process and is a part of a consistent global checkpoint.


29/35

A tentative checkpoint is a temporary checkpoint

that is made a permanent checkpoint on the

successful termination of the checkpoint

algorithm. Processes roll back only to their permanent

checkpoint.

The algorithm has two phases. First Phase. :An initiating process Pi takes a

tentative checkpoint and requests all the processes

to take tentative checkpoints.

Each process informs Pi whether it succeeded intaking a tentative checkpoint.

A process says "no" to a request if it fails to take a

checkpoint, which could be due to several reasons,

depending upon the underlying application.


30/35

IfPi learns that all the processes have successfully

taken tentative checkpoints, Pi decides that all

tentative checkpoints should be made permanent;

otherwise, Pi decides. that all the tentativecheckpoints should be discarded.

Second Phase:Pi informs all the processes of the

decision it reached at the end of the first phase.

A process, on receiving the message from Pi, will

act accordingly.

Therefore, either all or none of the processes take

permanent checkpoints. The algorithm requires that every process, once it

has taken a tentative checkpoint not send

messages related to the underlying computation

until it is informed ofPiSdecision.


31/35

The rollback recovery algorithm assumes that asingle process invokes the algorithm, as opposed to

several processes concurrently invoking it to

rollback and recover.

It also assumes that the checkpoint and the rollback

recovery algorithms are not concurrently invoked.

The rollback recovery algorithm has two phases.

First Phase. An initiating process Pi checks to see ifall the processes are willing to restart from their

previous checkpoints.


32/35

A process may reply "no" to a restart request if it is

already participating in a checkpointing or a

recovering process initiated by some other

process.

Second Phase. Pi propagates its decision to all the

processes. On receiving PiS decision, a process

will act accordingly. The recovery algorithm requires that every

process not send messages related to the

underlying computation while it is waiting for Pis

decision .


33/35

synchronous checkpointing simplifies recovery(because a consistent set of checkpoints is readilyavailable), it has the following disadvantages:

1) Additional messages are exchanged by the

checkpoint algorithm when it takes eachcheckpoint.

2) Synchronization delays are introduced duringnormal operations.

3) If failures rarely occur between successivecheckpoints, then the synchronous approachplaces unnecessary burden on the system in theform of additional messages, delays, and

processing overhead.


34/35

To minimize the amount of computation undone

during a roll back, all incoming messages are

logged (stored on stable storage) at each

processor.

The messages that were received after

establishing a recovery point can be processed

again in the event of a roll back to the recoverypoint.

The messages received can be logged in two ways :

pessimistic and optimistic .

In pessimistic message logging, an incoming

message is logged before it is processed. A

drawback of this approach is that it slows down

the underlying computation, even when there are

no failures .


35/35

In optimistic message logging, processors continue to

perform the computation and the messages received

are stored in volatile storage, which are logged atcertain intervals.

A Scheme for Asynchronous Checkpointing and

Recovery :

1. The communication channels are reliable.2. The communication channels deliver the messages in

the order they were sent.

3. The communication channels are assumed to have

infinite buffers.

4. The message transmission delay is arbitrary, but .finite.

5. The underlying computation is assumed to be event-

d i

unit v recovery and security mechanism

Documents