optimization of full versus incremental periodic backup...

13
Optimization of Full versus Incremental Periodic Backup Policy Gregory Levitin, Senior Member, IEEE, Liudong Xing, Senior Member, IEEE, Qingqing Zhai, Student Member, IEEE, and Yuanshun Dai, Member, IEEE Abstract—This paper models repairable computing systems performing a mission that is successful if the system can accomplish a specified amount of work within the allowed mission time or deadline. During the mission the system is subject to a sequence of full and incremental data backup procedures to facilitate an effective system recovery and avoid repeating the entire mission work from the very beginning when a system failure happens. The repair time is fixed while the system time-to-failure can follow any arbitrary type of distributions. This paper makes novel contributions by first developing a new numerical algorithm to evaluate mission success probability and expected completion time of the considered repairable real-time computing systems subject to mixed full and incremental backups. Correctness of the proposed evaluation algorithm is verified using Monte Carlo simulations. We make another new contribution by formulating and solving the backup schedule optimization problem that finds the full and incremental backup frequencies maximizing the mission success probability. Through illustrative examples, effects of different parameters (including the system time-to-failure distribution parameter, maximum allowed mission time, data backup and retrieval times, storage availability, repair time and efficiency) on the mission success probability and expected completion time as well as on the optimal backup schedule solution are investigated. Index Terms—Full backup, incremental backup, repair, mission success probability, expected completion time, real-time computing system Ç ACRONYMS cdf Cumulative distribution function. pdf Probability density function. jdf Joint distribution function. FB Full backup. IB Incremental backup. SHM System health management. NOMENCLATURE W time needed to complete the mission without failures and backups (mission complexity). w time during which the system should operate between two consecutive backups. B i time needed to perform ith backup. b i data retrieval time after i backups. p FB frequency factor. N maximal possible number of repairs during the mission. t maximum allowed mission time. R mission success probability. E expected mission completion time. K total number of backups in the mission. " F ;" I availabilities of the FB and the IB storages. hH j ;X j i event when the jth failure happens at time X j and the index of the last completed backup is H j . q j ðh; xÞ jdf of random values H j and X j . q FI jþ1 ðh; xÞ conditional jdf of H j and X j given FB and IB sto- rages are available. q F I jþ1 ðh; xÞ conditional jdf of H j and X j given only FB stor- age is available. q FI jþ1 ðh; xÞ conditional jdf of H j and X j given only IB stor- age is available. q F I jþ1 ðh; xÞ conditional jdf of H j and X j given FB and IB sto- rages are unavailable. Q j ðh; vÞ probability that jth failure happens in time interval v and the index of the last backup com- pleted before this failure is h. d repair time. r number of intervals needed to perform repair. f ðtÞ;F ðtÞ pdf and cdf of baseline system time-to-failure distribution. gðt 0 ;tÞ; Gðt 0 ;tÞ pdf and cdf of after repair time-to-failure distri- bution of system with t 0 operation time before the repair. z I ; z F times needed to get access to the IB and FB data storages for retrieving data. I ; F times needed to get access to the IB and FB data storages for performing backups. G. Levitin is with the Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China, and The Israel Electric Corporation, PO Box 10, Haifa 31000, Israel. E-mail: [email protected]. L. Xing is with the Department of Electrical and Computer Engineering, University of Massachusetts Dartmouth, North Dartmouth, MA 02747. E-mail: [email protected]. Q. Zhai is with the School of Reliability and Systems Engineering, Beihang University, Beijing, China. Y. Dai is with the Collaborative Autonomic Computing Laboratory, School of Computer Science, University of Electronic Science and Technology of China. Manuscript received 15 Sept. 2014; revised 16 Feb. 2015; accepted 7 Mar. 2015. Date of publication 27 Apr. 2015; date of current version 16 Nov. 2016. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TDSC.2015.2413404 644 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016 1545-5971 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 31-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

Optimization of Full versus IncrementalPeriodic Backup Policy

Gregory Levitin, Senior Member, IEEE, Liudong Xing, Senior Member, IEEE,

Qingqing Zhai, Student Member, IEEE, and Yuanshun Dai,Member, IEEE

Abstract—This paper models repairable computing systems performing a mission that is successful if the system can accomplish a

specified amount of work within the allowed mission time or deadline. During the mission the system is subject to a sequence of full and

incremental data backup procedures to facilitate an effective system recovery and avoid repeating the entire mission work from the very

beginning when a system failure happens. The repair time is fixed while the system time-to-failure can follow any arbitrary type of

distributions. This paper makes novel contributions by first developing a new numerical algorithm to evaluate mission success

probability and expected completion time of the considered repairable real-time computing systems subject to mixed full and

incremental backups. Correctness of the proposed evaluation algorithm is verified using Monte Carlo simulations. We make another

new contribution by formulating and solving the backup schedule optimization problem that finds the full and incremental backup

frequencies maximizing the mission success probability. Through illustrative examples, effects of different parameters (including the

system time-to-failure distribution parameter, maximum allowed mission time, data backup and retrieval times, storage availability,

repair time and efficiency) on the mission success probability and expected completion time as well as on the optimal backup schedule

solution are investigated.

Index Terms—Full backup, incremental backup, repair, mission success probability, expected completion time, real-time computing system

Ç

ACRONYMS

cdf Cumulative distribution function.pdf Probability density function.jdf Joint distribution function.FB Full backup.IB Incremental backup.SHM System health management.

NOMENCLATURE

W time needed to complete the mission withoutfailures and backups (mission complexity).

w time during which the system should operatebetween two consecutive backups.

Bi time needed to perform ith backup.bi data retrieval time after i backups.p FB frequency factor.

N maximal possible number of repairs during themission.

t maximum allowed mission time.R mission success probability.E expected mission completion time.

K total number of backups in the mission.

"F ; "I availabilities of the FB and the IB storages.

hHj;Xji event when the jth failure happens at time Xj

and the index of the last completed backup isHj.

qjðh; xÞ jdf of random valuesHj andXj.

qFIjþ1ðh; xÞ conditional jdf ofHj andXj given FB and IB sto-rages are available.

qF�I

jþ1ðh; xÞ conditional jdf of Hj and Xj given only FB stor-age is available.

q�FIjþ1ðh; xÞ conditional jdf of Hj and Xj given only IB stor-

age is available.

q�F �Ijþ1ðh; xÞ conditional jdf ofHj andXj given FB and IB sto-

rages are unavailable.Qjðh; vÞ probability that jth failure happens in time

interval v and the index of the last backup com-pleted before this failure is h.

d repair time.r number of intervals needed to perform repair.fðtÞ; F ðtÞ pdf and cdf of baseline system time-to-failure

distribution.

gðt0; tÞ;Gðt0; tÞ

pdf and cdf of after repair time-to-failure distri-bution of system with t0 operation time beforethe repair.

zI ; zF times needed to get access to the IB and FB datastorages for retrieving data.

’I ;’F times needed to get access to the IB and FB datastorages for performing backups.

� G. Levitin is with the Collaborative Autonomic Computing Laboratory,School of Computer Science, University of Electronic Science andTechnology of China, and The Israel Electric Corporation, PO Box 10,Haifa 31000, Israel. E-mail: [email protected].

� L. Xing is with the Department of Electrical and Computer Engineering,University of Massachusetts Dartmouth, North Dartmouth, MA 02747.E-mail: [email protected].

� Q. Zhai is with the School of Reliability and Systems Engineering, BeihangUniversity, Beijing, China.

� Y. Dai is with the Collaborative Autonomic Computing Laboratory, Schoolof Computer Science, University of Electronic Science and Technology ofChina.

Manuscript received 15 Sept. 2014; revised 16 Feb. 2015; accepted 7 Mar.2015. Date of publication 27 Apr. 2015; date of current version 16 Nov. 2016.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TDSC.2015.2413404

644 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016

1545-5971� 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

CI ;CF times needed to transfer the information gener-ated during the entire mission to the IB and FBdata storages.

QI ;QF times needed to transfer the information gener-ated during the entire mission from the IB andFB data storages.

m number of discrete intervals considered in thenumerical algorithm.

D duration of a discrete time interval: D ¼ t=m.xb c floor operation that returns the maximal integer

not exceeding x.Yj event when the mission is completed after j

repairs.rj Pr(Yj).ej conditional expected mission completion time

given j failures happen during the mission.h, b scale, shape parameters of Weibull time-to-fail-

ure distribution.�ðiÞ index of last completed backup, which is FB

given the total number of completed backups is i.z repair efficiency coefficient (0 � z � 1; z ¼ 0

corresponds to “as good as new”, z ¼ 1 corre-sponds to “as bad as old”).

1(�) unity function 1(FALSE) ¼ 0; 1(TRUE) ¼ 1.

1 INTRODUCTION

The backup mechanism is commonly applied in computingrelated systems to facilitate effective system reconfigurationor recovery when system failures happen due to hardwaremalfunctions, software errors, human mistakes or naturaldisasters [1], [2], [3], [4], [5], [6]. Particularly, backup proce-dures are carried out periodically to store data associatedwith the completed portion of the mission task. In the caseof the system failure occurring during the mission, the sys-tem can resume the mission task from the last backed uppoint through data retrievals; without backups the systemhas to repeat the entire mission task from the very begin-ning, which is inefficient in both time and cost [7].

This paper considers a repairable real-time system adopt-ing a backup approach that combines occasional full back-ups with more frequent incremental backups (IB) [8], [9].Such a “full þ incremental” scheme has been used in manycommercial or research backup systems such as the UNIXdump and tar [10], Amanda backups [11], the Spiralogbackup system from DEC [12], and the Petal system [13]. Inthe considered system, during a full backup (FB) all thedata generated from the beginning of the mission until thebacked up point are saved to the backup storage; while dur-ing an incremental backup, only data created since a previ-ous backup are saved. Restoring a failed system after therepair requires the system to retrieve the data from the lastFB performed before the system failure, and then the datafrom each of the IBs since then. Apparently, an FB can beslow and consume significant capacity on the backupmedium, and the corresponding data backup and retrievaltimes are high. An IB is fast to perform and requires onlysmall capacity on the backup storage, and the correspond-ing data backup and retrieval times are low. Under themixed full and incremental backup strategy, the backupschedule that defines the frequencies of both FB and IB

could significantly impact the probability that a systemcompletes a specified amount of work within some fixedmission time or deadline, i.e., the mission success probabil-ity. In this paper we solve the optimal backup scheduleproblem for repairable real-time systems with the objectiveto maximize the mission success probability.

There are three levels of repair models: perfect, minimal,and imperfect repairs [14], [15]. As one extreme model, aperfect repair can restore a system to an “as good as new”condition through maintenance actions (e.g., replacement).As the other extreme model, a minimal repair restores thefailed system to an “as bad as old” condition, i.e., the samecondition as the system was immediately before its failure.As a general model, an imperfect repair brings the systemto any condition between the former two extreme cases. Thedifference among these three repair models can also bedescribed in terms of the virtual age concept [16], [17]. Thevirtual age (also known as effective age) describes the pres-ent condition of the system in comparison to a new system;it is redefined at each failure event according to the type ofrepair conducted. Specifically, under the perfect repairmodel, the virtual age of the system after the repair is sim-ply reduced to 0; under the minimal repair model, the vir-tual age is the same as before the repair; under the generalimperfect repair model, the virtual age is reduced by a cer-tain amount depending on the repair type or efficiency. Inthis work the general imperfect repair model is consideredwhile the perfect and minimal repairs appear as specialcases of the proposed methodology.

Intensive research has been conducted for the model-ing and optimization of repairable systems under differ-ent repair models. For example, a variety of techniqueshas been proposed to describe the failure process ormodel the reliability of a repairable system, such asrenewal processes including the homogeneous Poissonprocesses and non-homogeneous Poisson processes [18],[19], geometric processes [20], [21], [22], [23], Markovchains [24], [25], [26], [27], and Bayesian methods [28],[29], [30]. In addition, different optimization problemshave been formulated and addressed for repairable sys-tems subject to diverse maintenance behaviors or policies.For example, the problem of finding the optimal periodicor non-periodic inspection scheme has been solved forrepairable systems subject to interactions between hardand soft failures or with hidden failures that can only behandled during inspections [31], [32], [33], [34], [35], [36],[37]. The problem of finding the optimal replacement pol-icy has been solved for repairable systems subject to wait-ing and repair times [38], under free-repair warranty [39],or whose repairman can have vacations [20], [40]. Theproblem of finding the optimal maintenance and war-ranty policy has been solved for repairable systems withthe consideration of all life cycle phases, where the opti-mal burn-in period, optimal preventive maintenanceintervals and optimal replacement times are decided [41].Also, a joint optimization problem for determining theoptimal schedule and number of preventive maintenanceactions as well as corresponding maintenance degrees hasbeen solved in [42].

Despite the rich literature that models and optimizesrepairable systems, to the best of our knowledge, none of

LEVITIN ET AL.: OPTIMIZATION OF FULL VERSUS INCREMENTAL PERIODIC BACKUP POLICY 645

Page 3: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

the aforementioned existing works have addressed themixed full and incremental backup strategy as well as therelated optimal backup schedule problem. Recently Xiaet al. [9] proposed an analytical modeling approach thatcombines Markov chains, queuing networks and stochasticreward nets for analyzing the performance and availabilityof a web service system under different backup policiesincluding the mixed backup. The effect of backup frequencyon the file service availability and performance was studiedthrough evaluations using a few example backup frequen-cies. The optimal backup scheduling problem, however,was not considered. In addition, the method of [9] is notdirectly applicable to non-exponential general distributionsand the metric of expected mission completion time is notconsidered. In this paper we advance the state of the art bymodeling general repairable, real-time systems subject tothe periodic “full þ incremental” backups. The systemtime-to-failure can follow any arbitrary type of distribu-tions. A new numerical algorithm is first proposed toanalyze the mission success probability and expected mis-sion completion time of the considered system. The optimalbackup scheduling problem is then solved, where frequen-cies of FB and IB maximizing the mission success probabil-ity are determined.

The rest of the paper is organized as follows. Section 2presents the system model and formulation of the optimalbackup schedule problem. Section 3 presents the methodol-ogy for analyzing the mission success probability andexpected completion time of single-component repairablesystems subject to a mixture of FB and IB actions. Section 4describes the discrete numerical evaluation algorithm.Section 5 presents illustrative examples and verification ofthe proposed algorithm using Monte Carlo simulations.Effects of different system and mission parameters on themission success probability, expected mission completiontime, as well as the optimal backup schedule are alsopresented. Section 6 concludes the paper and gives direc-tions of future work.

2 THE MODEL AND PROBLEM DESCRIPTION

The system should complete a specified amount of workwithin a fixed mission time or a hard deadline t. To com-plete this amount of work without the failures and backups

the system needs time W ðW < tÞ. After completing equalparts of work the system performs backups. If the totalnumber of backups is K, the backups are performed aftercompletion of i=ðK þ 1Þ-th of the entire mission task, wherei ¼ 1; . . . ; K þ 1. For the sake of further derivations weassume that the last K þ 1-th dummy backup is performedat the end of the mission.

The time during which the system should perform themission task between two consecutive backups isw ¼ W=ðK þ 1Þ. Every p-th backup is an FB, the rest of thebackups are IB. Thus, the total number of FBs during themission is bK=pc.

If the total number of completed backups is i, the index ofthe last completed backup, which is FB can be obtained as�ðiÞ ¼ p i=pb c. Thus, any ith backup is full if i ¼ �ðiÞ andincremental if �ðiÞ < i. The data of IBs performed before thelast successfully completed FB are deleted because of thelimited IB storage capacity.

When the system fails between completing the ith andiþ 1th backups, it undergoes a repair and then retrievesfrom the FB storage the data of last FB (�ðiÞth backup) andafterwards retrieves from the IB storage the data from lasti� �ðiÞ IBs performed after the last FB. Then the systemcontinues the mission task from the step that immediatelyfollows the ith backup procedure. Consider the exampleillustrated in Fig. 1A whereK ¼ 8 backups are scheduled toperform with the third and the sixth backups being FBs (i.e.,p ¼ 3) and the rest being IBs.

If the system fails between completing the fifth and sixthbackups (indicated by the lightning sign in the figure), thesystem retrieves the data of the last FB (i.e., the thirdbackup) as well as the data of the last two IBs (i.e., the fourthand fifth backups) after repair. Then the system resumes themission task from the step following the fifth backup. InFigs. 1, 2, 3 and 4 the dashed arrow illustrates the systemperformance after the jth repair.

If the IB storage is unavailable and the FB storage is avail-able the system retrieves the data of �ðiÞth backup andcontinues the mission task from the step that immediatelyfollows this backup procedure. This case is illustrated inFig. 1B.

If the FB storage is unavailable and the IB storage is avail-able, the system does not retrieve the backup data ifi=pb c > 0 (Fig. 1C) or retrieves the data of ith IB if no FB hasbeen completed (i.e., when �ðiÞ ¼ 0) and continues the

Fig. 1. Examples of the mission continuation after a failure. A. Both FBand IB storages are available; B. Only FB storage is available; C. FBstorage is unavailable.

Fig. 2. Example of hHj;Xji ! hHjþ1; Xjþ1i event transition when both IBand FB storages are unavailable.

646 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016

Page 4: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

mission task from the step that immediately follows thei � 1 �ðiÞ ¼ 0ð Þ��th backup procedure, where 1 �ðiÞ ¼ 0ð Þ is aunity function, which gives 1 if the condition �ðiÞ ¼ 0 isTRUE or 0 otherwise.

When both IB and FB storages are unavailable or whenonly FB storage is unavailable and at least one FB hasbeen completed before the failure, no data is retrievedand the mission task is performed after repair fromscratch (Fig. 1C).

We assume that the data backup and retrieval times areproportional to the amount of the saved information. EachIB saves information corresponding to 1=ðK þ 1Þth part ofthe entire task, which is performed between two backups,each FB i saves the information corresponding to thei=ðK þ 1Þth part of the entire task, which is performedbetween the mission beginning and this backup. Thus, thetime needed to perform the ith backup is

Bi ¼CI=ðK þ 1Þ þ ’I if �ðiÞ<i; i:e:; the backup is incremental

iCF=ðK þ 1Þ þ ’F if �ðiÞ ¼ i; i:e:; the backup is full

�;

(1)

where ’I ;’F are the constant times needed to get access tothe IB and FB data storages, CI ;CF are the times needed totransfer the information generated during the entire missionto the IB and FB data storages. By definition B0 ¼ BKþ1 ¼ 0.

The minimal time needed to complete the mission (givenno failures happen) is

Tmin ¼ W þXKi¼1

Bi ¼ W þ CI=ðK þ 1Þ þ cIð Þ K � K=pb cð Þ

þ pCF K=pb c K=pb c þ 1ð Þ2ðK þ 1Þ þ cF K=pb c:

(2)

If both data storages are available, the data retrieval timeafter ith completed backup is

bi ¼ �ðiÞQF=ðK þ 1Þ þ zF � 1 �ðiÞ > 0ð Þþ i� �ðiÞð ÞQI=ðK þ 1Þ þ zI � 1 �ðiÞ < ið Þ: (3)

Here zI ; zF are the constant times needed to get access to theIB and FB data storages, QI ;QF are the times needed totransfer the stored information generated during the entiremission from the IB and FB data storages.

The system time-to-failure distribution is known anddetermined by the cumulative distribution function (cdf)F(t), which can follow any arbitrary type of distributions.When the system fails, the repair/replacement procedurestarts immediately and takes a fixed time d. The maxi-mum possible number of failures in a successful missioncannot exceed the value

N ¼ ðt � TminÞ=db c; (4)

which corresponds to the case when the system fails Ntimes immediately after each repair at the mission begin-ning and after Nth repair the system succeeds to completethe entire mission.

Following the repair model in [16], if the system had haz-ard rate z(t) before repair, its hazard rate after repair iszðzt0 þ tÞ, where t0 and t are the operation times before andafter the repair respectively, z is the repair efficiency coeffi-cient that can vary from 0 (the system after repair is as goodas new, which corresponds to the replacement by a brandnew system or perfect repair) to 1 (the system is as bad asold, which corresponds to the minimal repair). The pdfgðt0; tÞ and cdf Gðt0; tÞ of the system time-to-failure after therepair can be obtained as gðt0; tÞ ¼ fðzt0 þ tÞ=½1� F ðzt0Þ�and Gðt0; tÞ ¼ ½F ðzt0 þ tÞ � F ðzt0Þ�=½1� F ðzt0Þ�.

Having the repair and data backup and retrieval times,the backup frequencies and the baseline system time-to-failure distribution F(t), one can find the expected mis-sion completion time E(K,p) and mission success proba-bility R(K,p) defined as the probability that the systemcompletes the mission within time t. The backup sched-ule optimization problem presumes finding the IB and FBfrequency parameters K and p that maximize the missionsuccess probability

K;p¼ arg maxRðK;pÞ: (5)

The following assumptions are made:

1) The fault detection mechanism of the system isperfect.

Fig. 3. Example of hHj;Xji ! hHjþ1; Xjþ1i event transition when both IBand FB storages are available.

Fig. 4. Example of hHj;Xji ! hHjþ1; Xjþ1i event transition when onlyFB storage is available.

LEVITIN ET AL.: OPTIMIZATION OF FULL VERSUS INCREMENTAL PERIODIC BACKUP POLICY 647

Page 5: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

2) The backup data can always be saved (an availablestorage can always be found prior to any backupprocedure), but can be lost with a fixed probability.

3) The backup storage does not fail during data backupand retrieval processes.

4) The availabilities of the FB and IB storages are statis-tically independent. This is valid when the two typesof backups are made to physically independent orstandalone storage space.

5) The mission task is performed evenly in time, i.e.during equal time intervals the system performsequal portions of work. Many safety-critical real-timesystems can better approximate this assumption,where tasks have to be designed to have predictabletiming to support meeting timing deadlines (e.g., fly-by-wire flight computing systems [49], [50]).

3 DETERMINING MISSION SUCCESS PROBABILITY

AND EXPECTED COMPLETION TIME

Let hHj;Xji be a random event when the jth system fail-ure happens at time Xj and the index of the last backupcompleted before this failure is Hj, and qjðh; xÞ be thefunction that describes joint distribution of random val-ues Hj and Xj. This function equals to the conditional pdfof jth system failure time Xj given Hj ¼ h. Having qjðh; xÞfor j ¼ 1; . . . ; N and system time-to-failure cdf one canobtain the probability that the system completes the mis-sion given event hHj;Xji with any possible realizationsHj ¼ h, Xj ¼ x. The recursive algorithm for determiningqjðh; xÞ for j ¼ 1; . . . ; N based on analysis of random eventtransitions is presented in the following section. Thenumerical realization of the algorithm is presented inSection 4.

3.1 Recursive Determination of jdf qjðh; xÞqjðh; xÞFor the first failure,

q1ðh; xÞ ¼ fðxÞ if xminðhÞ � x<min t; xminðhþ 1Þð Þ0 otherwise;

�(6)

where xminðhÞ ¼ hwþPhi¼0 Bi is the minimal time needed

to complete hth backup.Given the system operates during time t between the jth

and jþ 1th failures, one can distinguish the following fourdifferent hHj;Xji ! hHjþ1; Xjþ1i event transition scenariosdepending on the availability of the backup storages.

1. Both IB and FB storages are unavailable. In the case offailure the system accomplishes the mission task from scratchwithout retrieving the backup data. The system transits to theevent hHjþ1;Xjþ1i from events hHj;Xjiwith any valueHj notexceedingK if it operates during time not less than

Hjþ1wþXHjþ1

i¼0

Bi; (7)

needed to complete the Hjþ1th backup starting from themission beginning, but less than

ðHjþ1 þ 1ÞwþXHjþ1þ1

i¼0

Bi; (8)

needed to complete the Hjþ1 þ 1th backup. Fig. 2 illustratesan example of event transition from hHj;Xji to hHjþ1; Xjþ1iwhen both IB and FB storages are not available. The dashedarrow in Fig. 2 illustrates the system operation after the jthrepair. Notice that Hjþ1 can be less than, greater than, orequal to Hj because after the jth repair the system performsthe mission task from scratch and has to re-perform all thepreviously completed backups again.

For any realization of time t of the system operationbetween the jth and jþ 1th failures, Xjþ1 ¼ Xj þ dþ t.Thus, if the jth failure occurs at time Xjþ1, the previous fail-ure must occur at timeXj ¼ Xjþ1 � d� t.

The probability that the system that failed at timeXj failsafter working less than time t following the repair isGðXj; tÞ. Thus, having qjðh; xÞ and f(t) one can obtain the

conditional jdf q�F �Ijþ1ðh; xÞ for 0 � h � K and 0 � x � t using

the recursive equation

q�F �Ijþ1ðh; xÞ

¼XKn¼0

Z ðhþ1ÞwþPhþ1

i¼0Bi

hwþPh

i¼0Bi

qjðn; x� d� tÞg x� d� t; tð Þdt;(9)

where qjðh; xÞ ¼ 0 if x < 0 for any j by definition.2. Both IB and FB storages are available. The data gen-

erated by Hj backups is retrieved and the mission is con-tinued from the step following this backup (see Fig. 3).

The system can transit to the event hHjþ1; Xjþ1i onlyfrom events hHj;Xji with 0 � Hj � Hjþ1. For any givenpair of Hjþ1 and Hj the system must operate during timenot less than

1ðHj 6¼ Hjþ1Þ � bHjþ ðHjþ1 �HjÞwþ

XHjþ1

i¼Hjþ1

Bi; (10)

needed to complete the Hjþ1th backup starting from thestep immediately following the Hjth backup after the dataretrieval, but less than

bHjþ ðHjþ1 �Hj þ 1Þwþ

XHjþ1þ1

i¼Hjþ1

Bi (11)

needed to complete the Hjþ1 þ 1th backup. The multiplier1(Hj 6¼ Hjþ1) is used to take into account the fact that whenthe system fails before the completion of the data retrieval,no further backups are performed andHjþ1 ¼ Hj.

Thus, the conditional jdf qFIjþ1ðh; xÞ for 0 � h � K and0 � x � t can be obtained using the recursive equation

qFIjþ1ðh; xÞ

¼Xhn¼0

Z bnþðh�nþ1ÞwþPhþ1

i¼nþ1Bi

1ðn6¼hÞ�bnþðh�nÞwþPh

i¼nþ1Bi

qjðn; x� d� tÞg x� d� t; tð Þdt;

(12)

wherePb

i¼a Bi ¼ 0 if a>b by definition.3. Only FB storage is available. When the jth system fail-

ure happens after completion of Hjth backup, the systemcan retrieve data from last completed FB, which has index�ðHjÞ (see Fig. 4). The system can transit to the event

648 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016

Page 6: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

hHjþ1; Xjþ1i from events hHj;Xji with Hj > Hjþ1, howeverthe inequality 0 � Hj < �ðHjþ1Þ þ p must hold. Otherwisethe data of the �ðHjþ1Þ þ pth backup, which is FB, is avail-able after the jth failure and there is no need to retrieve thedata from the earlier Hjþ1th backup. For any given pair ofHjþ1 andHj the system must operate not less than time

1ð�ðHjÞ 6¼ Hjþ1Þ � b�ðHjÞ þ ðHjþ1 � �ðHjÞÞwþXHjþ1

i¼� Hjð Þþ1

Bi;

(13)needed to complete the Hjþ1 backup starting from the stepimmediately following the �ðHjÞth backup after the dataretrieval, but less than time

b�ðHjÞ þ ðHjþ1 � �ðHjÞ þ 1ÞwþXHjþ1þ1

i¼� Hjð Þþ1

Bi; (14)

needed to complete the Hjþ1 backup. Thus, the conditional

jdf qF�I

jþ1ðh; xÞ for 0 � h � K and 0 � x � t can be obtained

using the recursive equation

qF�I

jþ1ðh; xÞ

¼X�ðhÞþp�1

n¼0

Z b�ðnÞþðh��ðnÞþ1ÞwþPhþ1

i¼�ðnÞþ1Bi

1ð�ðnÞ6¼hÞb�ðnÞþðh��ðnÞÞwþPh

i¼�ðnÞþ1Bi

qjðn; x� d� tÞg x�d�t; tð Þdt:

(15)

4. Only IB storage is available. When the failure happensbetween Hj and Hjþ1th completed backups and Hj � p thesystem cannot retrieve data thatwas stored on the unavailableFB storage and resumes the mission task from scratch. Thus,for Hj � p Eq. (9) can be applied for determining the jdf

q�FIjþ1ðh;xÞ.WhenHj < p, no FBs are completed and the system

can retrieve data from last completed IB. Thus, for Hj < p

Eq. (12) can be applied for determining the jdf q�FIjþ1ðh; xÞ.

Summarizing, the conditional jdf q�FIjþ1ðh;xÞ for 0 � h � K

and 0 � x � t can be obtained using the following recursiveequation:

q�FIjþ1ðh; xÞ ¼

XKn¼p

Z ðhþ1ÞwþPhþ1

i¼0Bi

hwþPh

i¼0Bi

qjðn; x� d� tÞg x� d� t; tð Þdt

þXminðh;p�1Þ

n¼0

Z bnþðh�nþ1ÞwþPhþ1

i¼nþ1Bi

1ðn<hÞ�bnþðh�nÞwþPh

i¼nþ1Bi

qjðn; x� d� tÞg x� d� t; tð Þdt:(16)

Having the availabilities of the FB ("F ) and IB ("I ) sto-rages one can obtain the unconditional jdf qjþ1ðh; xÞ as

qjþ1ðh; xÞ ¼ "F "IqFIjþ1ðh; xÞ þ 1� "Fð Þ"Iq �FI

jþ1ðh; xÞþ "F 1� "Ið ÞqF �I

jþ1ðh; xÞ þ 1� "Fð Þ 1� "Ið Þq �F �Ijþ1ðh; xÞ:

(17)

Note that the availabilities of FB and IB storages aregiven as fixed values in this work. Refer to [48] for modelsthat evaluate the availability of a storage system using dif-ferent data protection techniques such as full, incrementalbackup and remote mirroring.

3.2 Evaluating the Mission Success Probabilityand Expected Completion Time

Let Yj be the event when the system completes the missiontask at time not exceeding t given exactly j failures happenduring the mission. The entire mission success probabilitycan be obtained as a sum of probabilities of mutually exclu-sive events Y0; Y1; . . . ; YN .

Without any failure (thus repair), the system needs timeTmin, determined in (2) to complete the entire task. Thus

r0 ¼ PrfY0g ¼ 1� F ðTminÞ: (18)

Given the event hHj;Xji, the system can complete theremaining amount of work by operating during time

TremðxÞ ¼ bx þ Tmin � xw

�Xxi¼0

Bj ¼ bx þ ðK þ 1� xÞwþXKi¼xþ1

Bi;(19)

where x is the number/index of the last available backupafter the jth failure. When IB and FB storages are available,x ¼ Hj; when only FB storage is available, x ¼ �ðHjÞ; whenonly IB storage is available, x ¼ Hj � 1ðHj < pÞ; and whenboth IB and FB storages are unavailable, x ¼ 0.

The operation time remained after the event hHj;Xji andthe subsequent repair is t �Xj � d. Thus, the mission canbe completed after j failures only if TremðxÞ � t �Xj � d

and the system does not fail during time TremðxÞ after jthrepair. Thus, having jdf qjðh; xÞ and cdf F(t) one can obtain

the probability of mission completion after j failures as

rj ¼XKh¼0

Z t�Trem xð Þ�d

xminðhÞqjðh; xÞ 1�G x; Trem xð Þð Þ½ �dx

¼ "F "IXKh¼0

Z t�bh�ðKþ1�hÞw�PK

i¼hþ1Bi�d

xminðhÞqjðh; xÞ

1�G x; bh þ ðK þ 1� hÞwþXKi¼hþ1

Bi

!" #dx

þ "F ð1� "IÞXKh¼0

Z t�b�ðhÞ�ðKþ1��ðhÞÞw�PK

i¼�ðhÞþ1Bi�d

xminðhÞqjðh; xÞ

1�G x; b�ðhÞ þ ðK þ 1� �ðhÞÞwþXK

i¼�ðhÞþ1

Bi

0@

1A

24

35dx

þ "Ið1� "F ÞXp�1

h¼0

Z t�bh�ðKþ1�hÞw�PK

i¼hþ1Bi�d

xminðhÞqjðh; xÞ

1�G x; bh þ ðK þ 1� hÞwþXKi¼hþ1

Bi

!" #dx

þ "Ið1� "F ÞXKh¼p

Z t�W�PK

i¼1Bi�d

xminðhÞqjðh; xÞ

1�G x;W þXKi¼1

Bi

!" #dx

þ ð1� "IÞð1� "F ÞXKh¼0

Z t�W�PK

i¼1Bi�d

xminðhÞqjðh; xÞ

1�G x;W þXKi¼1

Bi

!" #dx:

(20)

LEVITIN ET AL.: OPTIMIZATION OF FULL VERSUS INCREMENTAL PERIODIC BACKUP POLICY 649

Page 7: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

The total mission success probability is

R ¼XNj¼0

rj: (21)

The conditional expected mission time, given the mis-sion is completed after j failures, can be determined for0 � j � K as

ej ¼ 1

rj

XKh¼0

Z t�Trem xð Þ�d

xminðhÞqjðh; xÞ 1�G x; Trem xð Þð Þ½ � xþ Trem xð Þð Þdx;

(22)

which can be obtained similar to (20).When the mission is completed without failures, the total

mission time is e0 ¼ Tmin: The probability of such event isr0 ¼ 1� F ðTminÞ. Thus the total expected mission comple-tion time is

E ¼XNj¼0

rjej ¼ Tmin 1� F Tminð Þð Þ

þXNj¼1

XKh¼0

Z t�Trem xð Þ�d

xminðhÞqjðh; xÞ 1�G x; Trem xð Þð Þ½ � xþ Trem xð Þð Þdx:

(23)

4 DISCRETE NUMERICAL ALGORITHM

To obtain the mission success probability R and expectedcompletion time E numerically, we divide the maximumallowable mission time t into m equal intervals with dura-tion D ¼ t=m such that for i ¼ 0; . . . ;m interval i begins attime iD and ends at time ðiþ 1ÞD. The repair takes r ¼ d=Dtime intervals.

Having the baseline cdf F(t) for the time-to-failure of thesystem, one can obtain the probability that it fails beforeoperating i intervals after repair given it operated v intervalsbefore the repair as GðDv;DiÞ ¼ ½F ðDðzvþ iÞÞ � F ðDðzvÞÞ�=½1� F ðDðzvÞÞ�, and the probability that it fails in the ithinterval after the repair given it operated v intervalsbefore the repair as g�ðv; iÞ ¼ GðDv;Dðiþ 1ÞÞ �GðDv;DiÞ ¼½F ðDðzvþ iþ 1ÞÞ� F ðDðzvþ iÞÞ�=ð1� F ðDzvÞÞ.

The jdf qjðh; xÞ can be approximated by a discretematrix Qjðh; vÞ for h ¼ 0; . . . ; K and v ¼ 0; . . . ; m� 1 repre-senting the probability that the jth failure occurs in timeinterval v and the system completed h backups beforethis failure.

According to (6) the nonzero elements of the matrix Q1

can be obtained using the following procedure:

1. k ¼ wþB1;n ¼ 0;2. For i ¼ 0; . . . ;m� 1:

2.1 If(Di > k) then (n ¼ nþ 1; k ¼ kþ wþBnþ1).

2.2. If(n � K) then Q1ðn; iÞ ¼ g�ð0; iÞ.Having Qj�1ðh; vÞ for j ¼ 2; . . . ; N one can obtain Qjðh; vÞ

using the following procedure.

1. Set Qjðh; vÞ ¼ 0 for h ¼ 0; . . . ;K; v ¼ 0; . . . ;m;2. For h ¼ 0; . . . ;K:

2.1. For v ¼ xminðhÞ=D; . . . ;m� 1:

2.1.1. k ¼ wþB1;n ¼ 0;

2.1.2. s ¼ wþBhþ1; a ¼ h;

2.1.3. u ¼ wþB�ðhÞþ1; c ¼ �ðhÞ;2.1.4. For i ¼ 0; . . . ;m� v� r� 1

2.1.4.1. If ðDi> kÞ then (n ¼ nþ 1; k ¼ kþ wþBnþ1).

2.1.4.2. If (Di� bh>s) then (a¼aþ 1; s ¼ sþ wþBaþ1).

2.1.4.3. If (Di� b�ðhÞ>u) then (c¼ cþ 1; u¼uþwþBcþ1)

2.1.4.4. p ¼ vþ dþ i;

2.1.4.5. If (p < m) then:

2.1.4.5.1. If ðn � KÞ then Qjðn; pÞ ¼ Qjðn; pÞþð1� "F Þð1� "IÞQj�1ðh; vÞg�ðv; iÞ;

2.1.4.5.2. If (a � K) then Qjða; pÞ ¼ Qjða; pÞ þ"F "IQj�1ðh; vÞg�ðv; iÞ;

2.1.4.5.3. If (c � K) then Qjðc; pÞ ¼ Qjðc; pÞ þ"F ð1� "IÞQj�1ðh; vÞg�ðv; iÞ;

2.1.4.5.4. If (h � p and n � K) then Qjðn; pÞ ¼Qjðn; pÞ þ ð1� "F Þ"IQj�1ðh; vÞg�ðv; iÞ;

2.1.4.5.5. If (h < p and a � K) then Qjða; pÞ ¼Qjða; pÞ þ ð1� "F Þ"IQj�1ðh; vÞg�ðv; iÞ;

In the above pseudo code n, a and c represent the numberof the last backup completed when the system operatedduring i intervals between the j-1th and jth failure given itresumed the operation from the mission beginning, fromthe last backup data retrieval and from the last FB dataretrieval respectively. k, s and u represent the times of themission task execution needed to complete each nextbackup starting from the mission beginning, from lastbackup data retrieval and from last FB data retrievalrespectively.

Having Qjðh; vÞ and F(t) one can obtain rj, and ej usingthe following procedure based on (20) and (22).

1. rj ¼ ej ¼ 0.2. g ¼ Tmin þ w.3. For h ¼ 0; . . . ; K:

3.1 g ¼ g � w� Bh;

3.2 If ðh ¼ �ðhÞÞ then s ¼ g.

3.3. For v ¼ xminðhÞ=D; . . . ;m� 1:

3.3.1. If ðDvþ dþ Tmin � tÞ then fm ¼ ð1� "F Þð1� "IÞQjðh; vÞ½1�GðDv; TminÞ�; rj ¼ rj þ m; ej ¼ ej þ mðDvþ dþ TminÞg.

3.3.2. If ðDvþ dþ bh þ g � tÞ then fm ¼ "F "IQjðh; vÞ½1�GðDv; gþ bhÞ�; rj¼rjþm; ej¼ejþmðDvþ dþgþbhÞg.

3.3.3. If ðDvþ dþ s þ b�ðhÞ � tÞ thenfm ¼ "F ð1� "IÞQjðh; vÞ½1�GðDv; s þ b�ðhÞÞ�; rj ¼ rj þ m; ej ¼ ej þ mðDv þ d þs þ b�ðhÞÞg.

3.3.4. If ðh < p and Dvþ dþ bh þ g � tÞ thenfm ¼ ð1� "F Þ "IQjðh; vÞ½1�GðDv; gÞ�;rj ¼ rj þ m; ej ¼ ej þ mðDvþ dþ g þ bhÞg.

3.3.5. If ðh�p and Dvþ dþ Tmin � tÞ thenfm ¼ ð1� "F Þ"IQjðh; vÞ½1�GðDv; TminÞ�;rj ¼ rj þ m; ej ¼ ej þ mðDvþ dþ TminÞg.

4. ej ¼ ej=rj.

In the above pseudo code Tmin, g and s representthe operation time needed to complete the mission afterj-1th failure given the operation is resumed from themission beginning, from the last completed backup dataretrieval and from the last completed FB data retrievalrespectively.

Summarizing, we get the algorithm that consecutivelyobtains matrixes Qj for j ¼ 1; . . . ; N and for each matrix Qj

650 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016

Page 8: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

calculates the corresponding rj and ej, which are added to Rand E respectively. The full pseudo-code of the algorithm ispresented in the Appendix.

From the pseudo-code follows that the algorithm com-plexity is less than OðNK=D2Þ. For obtaining each matrixQjðh; vÞ, only matrix Qj�1ðh; vÞ is needed. Thus the algo-rithm only needs memory required for keeping twomatrixes of sizem ðK þ 1Þ.

5 ILLUSTRATIVE EXAMPLES

Consider a real-time onboard system health management(SHM) component for aerospace applications. The SHMneeds to process information or signals from various systemcomponents within a certain time so as to detect and diag-nose abnormal behavior of the system in a timely manner[51]. Suppose the mission performed by the SHM hasparameters of W ¼ 50 and maximum allowed time t ¼ 150.The parameters of the backup storages are "F ¼ 0:98; "I ¼0:7; zI ¼ ’I ¼ 0:5; zF ¼ ’F ¼ 1;CI ¼ QI ¼ 7;CF ¼ QF ¼ 16.The SHM has a Weibull time-to-failure distribution withscale parameter h ¼ 150 and shape parameter b ¼ 1:2.While our suggested numerical algorithm has no restrictionon distribution types, the Weibull distribution is chosenhere because of its wide application in reliability engineer-ing as well as its flexibility in representing different failurerate behavior [43], [44]. The repair time and efficiency arerespectively d ¼ 8 and z ¼ 0:5.

For the backups strategy K ¼ 9;p ¼ 4 the minimalmission completion time is Tmin ¼ 79:6, the maximalpossible number of repairs is N ¼ 3, the mission successprobability and expected completion time obtainedusing the proposed algorithm are R ¼ 0:977 and E ¼ 90respectively.

To investigate the impact of the duration of a discretetime interval D on the accuracy of the obtained results, thevalues of R and E are obtained for different D ranging from0.02 to 1. Fig. 5 presents the values of R and E as functionsof 1=D as well as the running time of the suggested algo-rithm on Intel 3.2 GHz PC. It can be seen that as the value of1=D increases, the estimates of R and E converge. For exam-ple, the difference between the results obtained for D ¼ 1and D ¼ 0:02 is 0.87 percent for R and 0.22 percent for E.When comparing results obtained for D ¼ 0:1 and D ¼ 0:02,these differences lower to 0.08, and 0.03 percentrespectively.

To verify the correctness of the proposed algorithm,Monte Carlo simulations with 105 samples are carriedout in Matlab for the example system. Fig. 6 presents theresults of 50 replications in comparison to the resultsgenerated by the proposed algorithm with D ¼ 0:02. Themean values of the simulation results are 0.9776 for Rand 90.0278 for E; the corresponding 95 percent confi-dence intervals in the form of (lower bound, upper bound)are [0.9765, 0.9787] for R and [89.9476, 90.1080] for E.Apparently, the simulation results are consistent withthe results of the proposed discrete algorithm, which val-idates the proposed approach.

The optimal backup strategy parameters K and p can befound by brute force enumeration of all the possible values.Table 1 presents the mission success probability andexpected completion time values for different K and pwhenthe system and mission parameters are the same as thosepresented above.

It can be seen from the table that the optimal backupstrategy that maximizes the mission success probability isK ¼ 3 and p ¼ 2 i.e., the backups are performed each timewhen a portion of work, which equals to 25 percent ofthe entire mission task is completed. The first and thirdbackups are IB, the second backup is FB.

Fig. 7 presents the functions RðK;pÞ for the cases whenno FBs are performed ðp > KÞ and a single FB is performedðp ¼ ðK þ 1Þ=2Þ.

Fig. 5. Mission success probability R, expected mission completion timeE and algorithm running time t obtained for the example system as func-tions of D.

Fig. 6. Monte Carlo simulation results for the example system (solid line—simulations, dashed line—suggested algorithm).

LEVITIN ET AL.: OPTIMIZATION OF FULL VERSUS INCREMENTAL PERIODIC BACKUP POLICY 651

Page 9: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

The dependence of the optimal backup strategy parame-ters on the system time-to-failure distribution scale parame-ter h is presented in Fig. 8 (the values of p are presentedonly when at least one FB is performed during the missionin the backup strategy). The R and E obtained for the opti-mal K and p are compared with the values of R� and E�,obtained for fixed backup strategy K ¼ 3;p ¼ 2. It can beseen that for h < 100 the optimal strategy uses no FBs andthe number of IBs decreases from 6 to 3. While not improv-ing the mission success probability considerably comparedto the K ¼ 3;p ¼ 2 strategy, the optimal strategy allows toreduce the expected mission time. For 100 < h < 500 theK ¼ 3;p ¼ 2 strategy becomes optimal. For h > 500 the opti-mal strategy presumes a single FB during the entire mission.

With an increase of the system reliability the mission suc-cess probability monotonically increases, the expected mis-sion completion time behaves non-monotonically becauseof variations of the backup strategy parameters. When Kand p remain constant, the expected mission completiontime decreases monotonically because the chances that themission is completed without failures increase.

The dependence of the optimal backup strategy parame-ters and corresponding R and E on the allowed missiontime t is presented in Fig. 9 for different values of the IBstorage availability "I .

For low t no backups are used and R and E do notdepend on "I . The influence of the maximum allowed

mission time on the optimal number of backups is twofold.On one hand, with increase of t more backup actions can beallowed, on the other hand with increase of t the missionhas greater chances to succeed even when the number ofbackups is rather small and in the case of failures a greaterpart of mission must be redone. Thus the optimal value of Kas a function t behaves non-monotonically. The expectedmission completion time also behaves non-monotonicallybecause of the variation of K and p.

The dependence of the optimal backup strategy parame-ters and corresponding R and E on the mission complexityW is presented in Fig. 10. Similar to the decrease of theallowed mission time t, the increase of W has a twofoldeffect on the optimal number of backups. The mission reli-ability always decreases with an increase inW and increaseswith an increase in "I , whereas the expected mission com-pletion time behaves non-monotonically because of the vari-ation of K and p.

Fig. 11 presents the dependence of the optimal backupstrategy parameters and corresponding R and E on the databackup and retrieval times. When FB times are low and IBtimes are high the preferable strategy is to use only FBs(p ¼ 1). With the increase of FB times the use of FBsbecomes unaffordable and pure IB strategy is used if IBtimes are moderate. If both FB and IB times are great, nobackups are allowed. When the optimal number of FBsbecomes zero, the mission success probability and expectedcompletion time do not depend on CF and QF . The missionsuccess probability decreases and the expected completiontime increases with data backup and retrieval times, how-ever the expected completion time can behave non-mono-tonically because of the variation of the backup strategyparameters.

Fig. 12 presents the dependence of the optimal backupstrategy parameters and corresponding R and E on therepair time d and efficiency z.

The optimal number of backups behaves non-monoton-ically with d. The influence of d on the optimal backup pol-icy is two-fold analogous to the influence of the allowedmission time t. Indeed, longer repairs leave less time forperforming the mission. On one hand, with increase of dless backup actions can be allowed, on the other hand withincrease of d the mission needs more backups to succeedin the case of failures as it has lower remaining time toredo the work. Eventually, the further increase of d leads

TABLE IMission Success Probability for Different Values of Backup Strategy Parameters K and p (K < pCorresponds to no FBs)

p K 1 2 3 4 5 6 7 8 9 10 11

0 0.98741 0.9932 0.99342 0.9889 0.9894 0.99413 0.9786 0.9947 0.9850 0.99414 0.9595 0.9804 0.9925 0.9810 0.99415 0.9252 0.9888 0.9946 0.9895 0.9774 0.99376 0.8738 0.9621 0.9734 0.9928 0.9865 0.9741 0.99367 0.8003 0.9757 0.9827 0.9939 0.9907 0.9833 0.9714 0.99328 0.6783 0.9291 0.9874 0.9678 0.9924 0.9882 0.9801 0.9685 0.99299 0.4980 0.9503 0.9502 0.9770 0.9933 0.9908 0.9857 0.9771 0.9647 0.992510 0.4006 0.8763 0.9633 0.9824 0.9630 0.9919 0.9887 0.9829 0.9742 0.9611 0.9921

Fig. 7. Mission success probability R, as function of total number ofbackups when no FBs and single FB is performed.

652 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016

Page 10: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

Fig. 8. Optimal backup strategy parameters and the corresponding mission success probability and expected completion time as functions of systemtime-to-failure distribution scale parameter h.

Fig. 9. Optimal backup strategy parameters and the corresponding mission success probability and expected completion time as functions of theallowed mission time t for different values of the IB storage availability "I .

Fig. 10. Optimal backup strategy parameters and the corresponding mission success probability and expected completion time as functions of themission complexityW for different values of the IB storage availability "I .

Fig. 11. Optimal backup strategy parameters and the corresponding mission success probability and expected completion time as functions of thedata backup and retrieval times.

LEVITIN ET AL.: OPTIMIZATION OF FULL VERSUS INCREMENTAL PERIODIC BACKUP POLICY 653

Page 11: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

to inefficiency of any backups and the optimal value of Kbecomes zero for d > 42. The full backups are efficientonly for d < 10. The repair efficiency almost does not affectthe optimal backup policy. However, with increase of theefficiency (decrease of z) the overall mission success prob-ability increases.

6 CONCLUSION AND FUTURE WORK

Data backups play a crucial role in the recovery or recon-figuration of computing/IT systems. By periodically sav-ing copies of data associated with the completed portionof the mission task the system can resume the missionfrom the latest backed up point instead of repeating theentire task from scratch. This paper considers repairablereal-time computing systems subject to a sequence ofperiodic full and incremental data backups during themission. The mission succeeds if a specified amount ofwork can be accomplished within the allowed time. Themajor contributions of this work include: 1) We devel-oped a numerical algorithm to evaluate the mission suc-cess probability and expected completion time of theconsidered repairable systems subject to a mixture of FBand IB. Systems without backups and with only FB oronly IB can be analyzed as special cases of the proposedmodel by setting availability of corresponding FB or IBstorage as zero. Correctness of the algorithm is verifiedusing Monte Carlo simulations; 2) Based on the verifiedevaluation algorithm, we solved the backup scheduleoptimization problem, which finds the frequencies of FBand IB maximizing the mission success probability; 3) Weinvestigated the influence of a variety of parameters(including system time-to-failure distribution parameter,maximum allowed mission time, data backup andretrieval times, IB storage availability, repair time andefficiency) on the mission success probability, expectedcompletion time and the optimal backup schedule solu-tion. The complicated relationship between these parame-ters and the mission performance indices demonstratesthe significance and necessity of solving the proposedoptimization problem.

It was shown that the optimal number of incrementaland full backups can change drastically depending on sys-tem and mission parameters. Analyzing this dependencecan be of great importance in the stage of operation plan-ning and system design. This is especially important for

systems performing several missions in parallel or havingdynamically changing resources.

One direction of our future research is to extend the pro-posed methodology for repairable systems involved inmulti-phased missions, where system failure behavior andrepair time can vary from phase to phase. Another directionis to consider standby sparing systems where differentstandby modes (hot, cold, and warm) can be implemented.The model presented in this paper can be combined withmodels of [7], [45] for this purpose.

The assumptions made in the paper can lead to overesti-mation of the mission success probability. Relaxing theseassumptions should be our future task. For example, theassumption about perfect fault detection can be relaxedusing the approach suggested in [46], the assumption aboutindependency of FB and IB storage failures can be relaxedusing the approach suggested in [47].

APPENDIX

Full pseudo-code of the mission success probability andexpected completion time evaluation algorithm.

1. R ¼ 1� F Tminð Þ;2: E ¼ Tmin 1� F Tminð Þð Þ.3. Set Q1ðh; vÞ ¼ 0 for h ¼ 0; . . . ;K; v ¼ 0; . . . ; m;4. k ¼ wþB1;n ¼ 0;5. For i ¼ 0; . . . ;m� 1:

5.1. If ðDi > kÞ then ðn ¼ nþ 1; k ¼ kþ wþBnþ1Þ.5.2. If ðn � KÞ then Q1ðn; iÞ ¼ g�ð0; iÞ.

6. For j ¼ 2; . . . ; N þ 1:

6.1. g ¼ Tmin þ w.

6.2. Set Qjðh; vÞ ¼ 0 for h ¼ 0; . . . ; K; v ¼ 0; . . . ;m;

6.3. For h ¼ 0; . . . ; K:

6.3.1. g ¼ g � w�Bh;

6.3.2. If ðh ¼ �ðhÞÞ then s ¼ g;

6.3.3. For v ¼ xminðhÞ=D; . . . ;m� 1:

6.3.3.1. If ðDvþ dþ Tmin � tÞ thenfm ¼ ð1� "F Þ½ð1� "IÞ þ "I � 1ðh � pÞ�Qj�1ðh; vÞ½1�GðDv; TminÞ��;R ¼ Rþ m;E ¼ E þ mðDvþ dþ TminÞg.

6.3.3.2. If (Dvþ dþ g þ bh � t) thenfm ¼ "I ½"F þ ð1� "F Þ � 1ðh < pÞ�Qj�1ðh; vÞ½1�GðDv; g þ bhÞ�;R ¼ Rþ m;E ¼ E þ mðDvþ dþ g þ bhÞg.

Fig. 12. Optimal backup strategy parameters and the corresponding mission success probability and expected completion time as functions of therepair time and efficiency.

654 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016

Page 12: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

6.3.3.3. If ðDvþ dþ s þ b�ðhÞ � tÞ thenfm ¼ "F ð1� "IÞQj�1ðh; vÞ½1�GðDv; s þ b�ðhÞÞ�;R ¼ Rþ m;E ¼ E þ mðDvþ dþ s þ b�ðhÞÞg.

6.3.3.4. k ¼ wþB1;n ¼ 0;

6.3.3.5. s ¼ wþBhþ1 þ bh; a ¼ h;

6.3.3.6. u ¼ wþB�ðhÞþ1 þ b�ðhÞ; c ¼ �ðhÞ;6.3.3.7. For i ¼ 0; . . . ;m� v� r� 1:

6.3.3.7.1. If ðDi > kÞ then ðn¼nþ 1; k ¼ kþ wþBnþ1Þ.6.3.3.7.2. If ðDi > sÞ then

ða ¼ aþ 1; s ¼ sþ wþBaþ1Þ.6.3.3.7.3. If ðDi > uÞ then

ðc ¼ cþ 1;u ¼ uþ wþBcþ1Þ.6.3.3.7.4. p ¼ vþ rþ i;

6.3.3.7.5. If ðp < mÞ then:6.3.3.7.5.1. � ¼ Qj�1ðh; vÞg�ðv; iÞ;6.3.3.7.5.2. If ðn � KÞ then Qjðn; pÞ ¼ Qjðn; pÞþ

ð1� "F Þ½ð1� "IÞ þ "I � 1ðh � pÞ��.6.3.3.7.5.3. If ða � KÞ then Qjða; pÞ ¼Qjða; pÞ

þ"I ½"F þ ð1� "F Þ�1ðh<pÞ��.6.3.3.7.5.4. If ðc � KÞ then Qjðc; pÞ ¼ Qjðc; pÞþ

"F ð1� "IÞ�.7. E ¼ E=R.

ACKNOWLEDGMENTS

This work was supported in part by the National Natu-ral Science Foundation of China (No. 61170042) andJiangsu Province development and reform commission(No. 2013-883).

REFERENCES

[1] S. Gaonkar, K. Keeton, A. Merchant, and W. H. Sanders,“Designing dependable storage solutions for shared applicationenvironments,” IEEE Trans. Dependable Secure Comput., vol. 7,no. 4, pp. 366–380, Oct.–Dec. 2010.

[2] X. Yang, Z. Wang, J. Xue, and Y. Zhou, “The reliability wall forexascale supercomputing,” IEEE Trans. Comput., vol. 61, no. 6,pp. 767–779, Jun. 2012.

[3] P. Koppol, K. S. Namjoshi, T. Stathopoulos, and G. T. Wilfong,“The inherent difficulty of timely primary-backup replication,”Bell Labs Tech. J., vol. 17, no. 2, pp. 15–24, Sep. 2012.

[4] Y. Fu, H. Jiang, N. Xiao, L. Tian, F. Liu, and L. Xu, “Application-Aware local-global source deduplication for cloud backup serv-ices of personal storage,” IEEE Trans. Parallel Distrib. Syst., vol. 25,no. 5, pp. 1155–1165, May 2014

[5] M. Kaczmarski, T. Jiang, and D. A. Pease, “Beyond backup towardstorage management,” IBM Syst. J., vol. 42, no. 2, pp. 322–337,2003.

[6] Y. Tang, P. P. C. Lee, J. C. S. Lui, and R. Perlman, “Secure overlaycloud storage with access control and assured deletion,” IEEETrans. Dependable Secure Comput., vol. 9, no. 6, pp. 903–916, Nov.–Dec. 2012.

[7] G. Levitin, L. Xing, B. W. Johnson, and Y. Dai, “Mission reliability,cost and time for cold standby computing systems with periodicbackup,” IEEE Trans. Comput., vol. 64, no. 4, pp. 1043–1057, Apr.2015.

[8] X. Yin, J. Alonso, F. Machida, E. Andrade, and K. S. Trivedi,“Availability modeling and analysis for data backup and restoreoperations,” in Proc. IEEE 31st Symp. Rel. Distrib. Syst., Oct. 2012,pp. 141–150.

[9] R. Xia, X. Yin, J. A. Lopez, F. Machida, and K. S. Trivedi,“Performance and availability modeling of ITSystems with databackup and restore,” IEEE Trans. Dependable Secure Comput.,vol. 11, no. 4, pp. 375–389, Jul./Aug. 2014.

[10] A. Chervenak, V. Vellanki, and Z. Kurmas, “Protecting file sys-tems: A survey of backup techniques,” in Proc. Joint NASA IEEEMass Storage Conf., 1998, pp. 2–17.

[11] J. da Silva and O. Guomundsson, “The amanda network backupmanager,” in Proc. USENIX Syst. Administration (LISA VII) Conf.,Nov. 1993, pp. 171–182.

[12] R. Green, A. Baird, and C. Davies, “Designing a fast, on-linebackup system for a log-structured file system,” Digital Tech. J.,vol. 8, no. 2, pp. 32–45, Oct. 1996.

[13] E. K. Lee and C. A. Thekkath, “Petal: Distributed virtual disks,” inProc. 7th Int. Conf. Archit. Support Programm. Languages OperatingSyst., Oct. 1996, pp. 84–92.

[14] Q. Yang, N. Zhang, and Y. Hong, “Reliability analysis of repair-able systems with dependent component failures under partiallyperfect repair,” IEEE Trans. Rel., vol. 62, no. 2, pp. 490–498, Jun.2013.

[15] M. Ya~nez, F. Joglar, and M. Modarres, “Generalized renewal pro-cess for analysis of repairable systems with limited failure experi-ence,” Rel. Eng. Syst. Safety, vol. 77, no. 2, pp. 167–180, 2002.

[16] B. H. Lindqvist, “On the statistical modeling and analysis ofrepairable systems,” Statist. Sci., vol. 21, no. 4, pp. 532–551, 2006.

[17] M. Kijima, “Some results for repairable systems with generalrepair,” J. Appl. Probability, vol. 26, no. 1, pp. 89–102, 1989.

[18] P. L. C. Saldanha, E. A. de Simone, and P. F. Frutuoso e Melo, “Anapplication of non-homogeneous Poisson point processes to thereliability analysis of service water pumps,” Nuclear Eng. Design,vol. 210, nos. 1–3, pp. 125–133, 2001.

[19] G. R. Weckman, R. L. Shell, and J. H. Marvel, “Modelling the reli-ability of repairable systems in the aviation industry,” Comput.Ind. Eng., vol. 40, no. 1, pp. 51–63, 2001.

[20] J. Jia and S. Wu, “A replacement policy for a repairable systemwith its repairman having multiple vacations,” Comput. Ind. Eng.,vol. 57, no. 1, pp. 156–160, 2009.

[21] Y. Lam, “A note on the optimal replacement problem,” Adv. Appl.Probability, vol. 20, pp. 479–482, 1988.

[22] I. T. Castro and R. P�erez-Oc�on, “Reward optimization of a repair-able system,” Rel. Eng. Syst. Safety, vol. 91, no. 3, pp. 311–319, 2006.

[23] Y. L. Zhang and G. J. Wang, “A deteriorating cold standby repair-able system with priority in use,” Eur. J. Oper. Res., vol. 183, no. 1,pp. 278–295, 2007.

[24] S. Bloch-Mercier, “Optimal restarting distribution after repair fora Markov deteriorating system,” Rel. Eng. Syst. Safety, vol. 74,no. 2, pp. 181–191, 2001.

[25] S. Bloch-Mercier, “A preventive maintenance policy with sequen-tial checking procedure for a Markov deteriorating system,” Eur.J. Oper. Res., vol. 147, no. 4, pp. 548–576, 2002.

[26] A. C. Marquez and A. S. Heguedas, “Models for maintenanceoptimization: a study for repairable systems and finite time peri-ods,” Rel. Eng. Syst. Safety, vol. 75, no. 3, pp. 367–377, 2002.

[27] I. W. Soro, M. Nourelfath, and D. A€ıt-Kadi, “Performance evalua-tion of multi-state degraded systems with minimal repairs andimperfect preventive maintenance,” Rel. Eng. Syst. Safety, vol. 95,no. 2, pp. 65–69, 2010.

[28] D. F. Percy, “Bayesian enhanced strategic decision making forreliability,” Eur. J. Oper. Res., vol. 139, no. 1, pp. 133–145, 2002.

[29] T. Rosqvist, “Bayesian aggregation of experts’ judgements on fail-ure intensity,”Rel. Eng. Syst. Safety, vol. 70, no. 2, pp. 283–289, 2000.

[30] S.-H. Sheu, R. H. Yeh, Y.-B. Lin, and M.-G. Juang, “A Bayesianapproach to an adaptive preventive maintenance model,” Rel.Eng. Syst. Safety, vol. 71, no. 1, pp. 33–44, 2001.

[31] S. Taghipour and D. Banjevic, “Periodic inspection optimizationmodels for a repairable system subject to hidden failures,” IEEETrans. Rel., vol. 60, no. 1, pp. 275–285, Mar. 2011.

[32] H. R. Golmakani and H. Moakedi, “Periodic inspection optimiza-tion model for a two-component repairable system with failureinteraction,” Comput. Ind. Eng., vol. 63, no. 3, pp. 540–545, 2012.

[33] S. Taghipour and D. Banjevic, “Optimal inspection of a complexsystem subject to periodic and opportunistic inspections and pre-ventive replacements,” Eur. J. Oper. Res., vol. 220, no. 3, pp. 649–660, 2012.

[34] S. Taghipour, D. Banjevic, and A. K. S. Jardine, “Periodic inspec-tion optimization model for a complex repairable system,” Rel.Eng. Syst. Safety, vol. 95, no. 9, pp. 944–952, 2010.

[35] S. Taghipour and D. Banjevic, “Optimum inspection interval for asystem under periodic and opportunistic inspections,” IIE Trans.,vol. 44, no. 11, pp. 932–948, 2012.

[36] H. R. Golmakani and H. Moakedi, “Optimal nonperiodic inspec-tion scheme for a multicomponent repairable system with failureinteraction using A� search algorithm,” Int. J. Adv. Manuf. Technol.,vol. 67, nos. 5–8, pp. 1325–1336, 2013.

LEVITIN ET AL.: OPTIMIZATION OF FULL VERSUS INCREMENTAL PERIODIC BACKUP POLICY 655

Page 13: Optimization of Full versus Incremental Periodic Backup Policyalchieri/disciplinas/posgraduacao/... · 2019-04-12 · Optimization of Full versus Incremental Periodic Backup Policy

[37] H. R. Golmakani and H. Moakedi, “Optimal non-periodic inspec-tion scheme for a multi-component repairable system using A�

search algorithm,” Comput. Ind. Eng., vol. 63, no. 4, pp. 1038–1047,2012.

[38] J. Jia and S. Wu, “Optimizing replacement policy for a cold-standby system with waiting repair times,” Appl. Math. Comput.,vol. 214, no. 1, pp. 133–141, 2009.

[39] R. H. Yeh, M.-Y. Chen, and C.-Y. Lin, “Optimal periodic replace-ment policy for repairable products under free-repair warranty,”Eur. J. Oper. Res., vol. 176, no. 3, pp. 1678–1686, 2007.

[40] L. Yuan and J. Xu, “An optimal replacement policy for a repair-able system based on its repairman having vacations,” Rel. Eng.Syst. Safety, vol. 96, no. 7, pp. 868–875, 2011.

[41] A. Monga andM. Zuo, “Optimal system design considering main-tenance and warranty,” Comput. Oper. Res., vol. 25, no. 9, pp. 691–705, 1998.

[42] R. H. Yeh and H.-C. Lo, “Optimal preventive-maintenance war-ranty policy for repairable products,” Eur. J. Oper. Res., vol. 134,no. 1, pp. 59–69, 2001.

[43] W. Weibull, “A statistical distribution function of wideapplicability,” J. Appl. Mech.-Trans. ASME, vol. 18, pp. 293–297,1951.

[44] M. Modarres, M. Kaminskiy, and V. Krivtsov, Reliability Engineer-ing and Risk Analysis: A Practical Guide (2nd Edition). Boca Raton,FL, USA: CRC Press, 2010.

[45] G. Levitin, L. Xing, and Y. Dai, “Heterogeneous 1-out-of-N warmstandby systems with dynamic uneven backups,” IEEE Trans. Rel.in press.

[46] G. Levitin, L. Xing, and Y. Dai, “Mission cost and reliability of 1-out-of-N warm standby systems with imperfect switching mecha-nisms,” IEEE Trans. Syst., Man, Cybern: Syst., vol. 44, no. 9,pp. 1262–1271, Sep. 2014.

[47] G. Levitin, L. Xing, Y. Dai, and H. B. Haim, “Effect of failure prop-agation on cold vs. hot standby tradeoff in heterogeneous 1-out-of-N: G systems,” IEEE Trans. Rel., vol. 64, no. 1, pp. 410–419, 2015.

[48] K. Keeton and A. Merchant, “A framework for evaluating storagesystem dependability,” in Proc. Int. Conf. Dependable Syst. Netw.,2004, pp. 877–886.

[49] B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems.Reading, MA, USA: Addison-Wesley, 1989

[50] G. L. Hartmann, A. D. Hills, G. M. Papadopoulos, T. Anderson,and P. A. Barrett, Fault Tolerant Hardware/Software Architecturefor Flight Critical Function, Defense Technical Information Center,1985.

[51] S. B. Johnson, T. Gormley, S. Kessler, C. Mott, A. Patterson-Hine,K. Reichard, P. Scandura, Jr., Eds., System Health Management:With Aerospace Applications. New York, NY, USA: Wiley, Jul. 2011.

Gregory Levitin (M’97-SM’99) is presently a dis-tinguished visiting professor at the University ofElectronic Science and Technology of China anda senior expert at the Reliability Department ofthe Israel Electric Corporation. His current inter-ests are in operations research and artificial intel-ligence applications in reliability, defense andpower systems. In this field, he has publishedmore than 230 papers and four books. He is anassociate editor of the IEEE Transactions onReliability, area coordinator of the International

Journal of Performability Engineering and a member of editorial boardsof Reliability Engineering & System Safety, Journal of Risk and Reliabil-ity, and Reliability and Quality Performance. He is a senior member ofthe IEEE and chair of the ESRA Technical Committee on SystemReliability.

Liudong Xing (S’00-M’02-SM’07) received theBE degree in computer science from ZhengzhouUniversity, China, in 1996, and the MS and PhDdegrees in electrical engineering from the Univer-sity of Virginia, in 2000 and 2002, respectively.She is currently a professor with the Departmentof Electrical and Computer Engineering, Univer-sity of Massachusetts (UMass) Dartmouth. Sheis an associate editor for the International Journalof Systems Science and International Journal ofSystems Science: Operations & Logistics. She is

an editorial board member for Reliability Engineering & System Safety,and the Journal of Computational Engineering. She is also an assistanteditor-in-chief for the International Journal of Performability Engineering.She received the Leo M. Sullivan Teacher of the Year Award (2014),Scholar of the Year Award (2010), and Outstanding Women Award(2011) of UMass Dartmouth and IEEE Region 1 Technological Innova-tion (Academic) Award (2007). She also coreceived the Best PaperAward at the IEEE International Conference on Networking, Architec-ture, and Storage in 2009. She is a senior member of the IEEE.

Qingqing Zhai (S’14) received the BE degree inquality and reliability engineering from BeihangUniversity, China, in 2011. He is currently work-ing toward the PhD degree major in systemengineering in Beihang University. His researchinterests are in complex systems reliabilitymodeling and sensitivity analysis techniques. Heis a student member of the IEEE.

Yuanshun Dai (S’02-M’03) received the BSdegree from Tsinghua University, Beijing, China,in 2000, and the PhD degree from the NationalUniversity of Singapore, Singapore, in 2003. Heis currently a chaired professor and the directorof the Collaborative Autonomic Computing (CAC)Laboratory, School of Computer Science andEngineering, University of Electronic Science andTechnology of China, Chengdu, China. Heserves as a chairman of the Professor Committeein the School since 2012, and as the associate

director at the Youth Committee of the “National 1000er Plan” in China.He has published 90 papers and five books, where there are 40 papersindexed by SCI including 20 IEEE/ACM Transactions papers. His cur-rent research interests include HPC, Cloud Computing and Grid, Reli-ability and Security, Modeling and Optimization. He has served as aguest editor of the IEEE Transactions on Reliability. He is currently aneditor of the special issue on cloud computing for the Journal of Super-computing. He is also on the editorial boards of several journals. He is amember of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

656 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2016