unit v recovery and security mechanism
TRANSCRIPT
-
8/2/2019 Unit v Recovery and Security Mechanism
1/35
-
8/2/2019 Unit v Recovery and Security Mechanism
2/35
Recovery refers to restoring a system to its normal
operational state.
For eg if the process fails, the resources allocated to the
failed process be reclaimed.
If one or more cooperating processes fails, then the
efforts due to interaction of the failed processes with the
other processes must be undone or every failed process
would have to restart from an appropriate state.
If the site fails, recovery in this case involves the
question of how not to expose the system to data
inconsistencies & bring back the failed site to an up to
date state consistent with the rest of the system.
-
8/2/2019 Unit v Recovery and Security Mechanism
3/35
System : h/w & s/w
Failure : when the system does not perform itsservices in the manner specified.
An Erroneous state of the system is a state whichcould lead to a system failure by a sequence of validstate transitions.
A faultis an anomalous physical conditions.
Cause of this includes design errors, manufacturing
problems, damage, external disturbances. An error is the part of the system state of errors.
Failure recovery is a process that involvesrestoring an erroneous state to an error free state.
-
8/2/2019 Unit v Recovery and Security Mechanism
4/35
Figure An error is a manifestation of fault and can
lead to failure.
-
8/2/2019 Unit v Recovery and Security Mechanism
5/35
Process failure
System failure
Secondary storage failure
Communication medium failureProcess failure:
The computation results in an incorrect outcome,
the process causes the system state to deviate from
specifications, the process may fail to progress. Etc.
Errors causing to fail a process : deadlocks,
timeouts, protection violation, wrong input
provided by the user, consistency violation.
-
8/2/2019 Unit v Recovery and Security Mechanism
6/35
System failure:
Occurs when processor fails to execute.
It is caused by the s/w errors & h/w problems. In case of system failure, it is stopped & restarted
from the correct state i.e predefined state.
Classified into following:
1) An amnesia failure occurs when a system restartsin a predefined state that does not depend upon
the state of the system before its failure.
2) A partial amnesia failure occurs when a system
restarts in a state wherein a part of the state is
same as the state before the failure & the rest of the
state is predefined. Eg file server crashes.
-
8/2/2019 Unit v Recovery and Security Mechanism
7/35
3) A pause failure occurs when a system restarts in
a same state it was in before the failure.
4) A halting failure occurs when a crashed systemnever restarts.
Secondary storage failure
When thestored data cannot be accessed.
Cause : parity error, head crash, or dust particlessettled on the medium.
Its contents are corrupted & must be reconstructed
from archive systems & log files.
Communication medium failure
Occurs when a site can not communicate with
another operational site in the network.
-
8/2/2019 Unit v Recovery and Security Mechanism
8/35
Cause : failure of switching node includes system
failure & secondary storage failure, & link failure
includes physical rupture & noise in thecommunication channels.
May not cause total shutdown of the system.
-
8/2/2019 Unit v Recovery and Security Mechanism
9/35
Error is that part of the state that differs from itsintended value and can lead to a system failure, andfailure recovery is a process that involves restoring anerroneous state to an error-free state.
Two approaches for restoring an erroneous state to anerror free state.
If the nature of errors and damages caused by faults canbe completely and accurately assessed, then it ispossible to remove those errors in the process's
(system's) state and enable the process (system) tomove forward. This technique is known asforward-error recovery.
-
8/2/2019 Unit v Recovery and Security Mechanism
10/35
If it is not possible to foresee the nature of faults and
to remove all the errors in the process's (system's)
state, then the process's (system's) state can berestored to a previous error-free state of the process
(system). This technique is known as backward-error
recovery.
backward-error recovery is simpler than forward-
error recovery as it is independent of the fault and the
errors caused by the fault.
Problem with backward-error recovery :
-
8/2/2019 Unit v Recovery and Security Mechanism
11/35
Performance penalty: The overhead to restore a process
(system) state to a prior state can be quite high.
There is no guarantee that faults will not occur again whenprocessing begins from a prior state.
Some component of the system state may be
unrecoverable. For example, cash dispensed at an
automatic teller machine cannot be recovered.
The forward-error recovery technique, on the other
hand, incurs less overhead because only those parts
of the state that deviate from the intended value
need to be corrected
-
8/2/2019 Unit v Recovery and Security Mechanism
12/35
In backward-error recovery, a process is restored to a
prior state in the hope that the prior state is free of
errors.
The points in the execution of a process to which the
process can later be restored are known as recoverypoints.
Recovery done at the process level is simply a subset of
the actions necessary to recover the entire system.
In a system recovery, all the user processes that wereactive need to be restored to their respective recovery
points and data (in secondary storage) modified by the
processes need to be restored to a proper state.
-
8/2/2019 Unit v Recovery and Security Mechanism
13/35
There are two ways to implement
the operation based approach and
the state-based approach
System Model
-
8/2/2019 Unit v Recovery and Security Mechanism
14/35
audit trail or a log: state of a process are recorded
in sufficient detail so that a previous state of the
process can be restored by reversing all the changes
made to the state.
UPDATING-IN-PLACE. : every update (write) is to be
recorded in log file & stable storage.
The information recorded includes:
1) the name of the object,2) the old state of the object (used for UNDO), and
3) the new state of the object (used for REDO).
-
8/2/2019 Unit v Recovery and Security Mechanism
15/35
A recoverable update operation can be implemented
as a collection of operations as follows:
1) A do operation, which does the action (update) andwrites a log record.
2) An undo operation, which, given a log record written
by a do operation, undoes the action performed by
the do operation.3) A redo operation, which, given a log record written
by a do operation, redoes the action specified by
the do operation.
4) An optional display operation, which displays the log
record.
-
8/2/2019 Unit v Recovery and Security Mechanism
16/35
The major problem with the updating-in-
place is that a do operation cannot be
undone if the system crashes after anupdate operation but before the log record
is stored.
This problem is overcome by the write-ahead-log protocol
-
8/2/2019 Unit v Recovery and Security Mechanism
17/35
the complete state of a process is saved
when a recovery point is established and
recovering a process involves reinstating its
saved state and resuming the execution ofthe process from that state.
-
8/2/2019 Unit v Recovery and Security Mechanism
18/35
The process of saving state is also referred
to as checkpointing or taking acheckpoint.
The recoverypoint at which checkpointing
occurs is often referred to as a checkpoint.
The process of restoring a process to a
prior-state is referred to as rolling back the
process.
A special case of the state-based recovery
approach is the technique based on shadow
pages.
-
8/2/2019 Unit v Recovery and Security Mechanism
19/35
if one of the cooperating processes fails and
resumes execution from a recovery point; thenthe effects it has caused at other processes due
to the information it has exchanged with them
after establishing the recovery point will haveto be undone.
To undo the effects caused by a failed process
at an active process, the active process must
also rollback to an earlier state.
Thus, in concurrent systems, all cooperating
processes need to establish recovery points.
-
8/2/2019 Unit v Recovery and Security Mechanism
20/35
Rolling back of processes can cause further
problems:
Orphan Messages and the Domino Effect
Lost messages
Problem of Livelocks
-
8/2/2019 Unit v Recovery and Security Mechanism
21/35
-
8/2/2019 Unit v Recovery and Security Mechanism
22/35
-
8/2/2019 Unit v Recovery and Security Mechanism
23/35
In distributed systems involves taking acheckpoint by all the processes (sites) or at least
by a set of processes (sites) that interact with one
another in performing a distributed computation.
Typically, in distributed systems, all the sites save
their local states, which are known as local
checkpoints, and the process of saving local states is
called local check pointing. All the local checkpoints, one from each site,
collectively form aglobal checkpoint
-
8/2/2019 Unit v Recovery and Security Mechanism
24/35
STRONGLY CONSISTENT SET OF CHECKPOINTS :
To overcome the domino effect, a set of localcheckpoints is needed (one for each process
in the set) such that no information flow
takes place (i.e., no orphan messages)between any pair of processes in the set, as
well as between any process in the set and
any process outside the set during the
interval spanned by the checkpoints.
-
8/2/2019 Unit v Recovery and Security Mechanism
25/35
Such a set of checkpoints is known as a recovery line
or a strongly consistent set of checkpoints.
-
8/2/2019 Unit v Recovery and Security Mechanism
26/35
CONSISTENT SET OF CHECKPOINTS:
A consistent set of ckeckpoints is similar to a
consistent global state in that it requires that each
message recorded as received in a checkpoint
(state) should also be recorded as sent in another
checkpoint (state).
Therefore, systems that do not establish a strongly
consistent set of checkpoints have to deal with lost
messages during roll back recovery.
-
8/2/2019 Unit v Recovery and Security Mechanism
27/35
checkpointing and recovery technique proposedby Koo and Toueg that takes a consistent set of
checkpoints and avoids livelock problems during
recovery.
The algorithm's approach is said to be
synchronous, as the processes involved coordinate
their local checkpointing actions such that the set
of all recent checkpoints in the system isguaranteed to be consistent.
-
8/2/2019 Unit v Recovery and Security Mechanism
28/35
The checkpoint algorithm assumes the followingcharacteristics for the distributed system:
Processes communicate by exchanging messages
through communication channels
Channels are FIFO in nature.
Communication failures do not partition the network.
The checkpoint algorithm takes two kinds of
checkpoints on stable storage, permanent andtentative.
A permanent checkpoint is a local checkpoint at a
process and is a part of a consistent global checkpoint.
-
8/2/2019 Unit v Recovery and Security Mechanism
29/35
A tentative checkpoint is a temporary checkpoint
that is made a permanent checkpoint on the
successful termination of the checkpoint
algorithm. Processes roll back only to their permanent
checkpoint.
The algorithm has two phases. First Phase. :An initiating process Pi takes a
tentative checkpoint and requests all the processes
to take tentative checkpoints.
Each process informs Pi whether it succeeded intaking a tentative checkpoint.
A process says "no" to a request if it fails to take a
checkpoint, which could be due to several reasons,
depending upon the underlying application.
-
8/2/2019 Unit v Recovery and Security Mechanism
30/35
IfPi learns that all the processes have successfully
taken tentative checkpoints, Pi decides that all
tentative checkpoints should be made permanent;
otherwise, Pi decides. that all the tentativecheckpoints should be discarded.
Second Phase:Pi informs all the processes of the
decision it reached at the end of the first phase.
A process, on receiving the message from Pi, will
act accordingly.
Therefore, either all or none of the processes take
permanent checkpoints. The algorithm requires that every process, once it
has taken a tentative checkpoint not send
messages related to the underlying computation
until it is informed ofPiSdecision.
-
8/2/2019 Unit v Recovery and Security Mechanism
31/35
The rollback recovery algorithm assumes that asingle process invokes the algorithm, as opposed to
several processes concurrently invoking it to
rollback and recover.
It also assumes that the checkpoint and the rollback
recovery algorithms are not concurrently invoked.
The rollback recovery algorithm has two phases.
First Phase. An initiating process Pi checks to see ifall the processes are willing to restart from their
previous checkpoints.
-
8/2/2019 Unit v Recovery and Security Mechanism
32/35
A process may reply "no" to a restart request if it is
already participating in a checkpointing or a
recovering process initiated by some other
process.
Second Phase. Pi propagates its decision to all the
processes. On receiving PiS decision, a process
will act accordingly. The recovery algorithm requires that every
process not send messages related to the
underlying computation while it is waiting for Pis
decision .
-
8/2/2019 Unit v Recovery and Security Mechanism
33/35
synchronous checkpointing simplifies recovery(because a consistent set of checkpoints is readilyavailable), it has the following disadvantages:
1) Additional messages are exchanged by the
checkpoint algorithm when it takes eachcheckpoint.
2) Synchronization delays are introduced duringnormal operations.
3) If failures rarely occur between successivecheckpoints, then the synchronous approachplaces unnecessary burden on the system in theform of additional messages, delays, and
processing overhead.
-
8/2/2019 Unit v Recovery and Security Mechanism
34/35
To minimize the amount of computation undone
during a roll back, all incoming messages are
logged (stored on stable storage) at each
processor.
The messages that were received after
establishing a recovery point can be processed
again in the event of a roll back to the recoverypoint.
The messages received can be logged in two ways :
pessimistic and optimistic .
In pessimistic message logging, an incoming
message is logged before it is processed. A
drawback of this approach is that it slows down
the underlying computation, even when there are
no failures .
-
8/2/2019 Unit v Recovery and Security Mechanism
35/35
In optimistic message logging, processors continue to
perform the computation and the messages received
are stored in volatile storage, which are logged atcertain intervals.
A Scheme for Asynchronous Checkpointing and
Recovery :
1. The communication channels are reliable.2. The communication channels deliver the messages in
the order they were sent.
3. The communication channels are assumed to have
infinite buffers.
4. The message transmission delay is arbitrary, but .finite.
5. The underlying computation is assumed to be event-
d i