coding approaches for maintaining data in unreliable

126
Coding Approaches for Maintaining Data in Unreliable Network Systems by Vitaly Abdrashitov B.S., Moscow Institute of Physics and Technology (2009) M.S., Moscow Institute of Physics and Technology (2011) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2018 c Massachusetts Institute of Technology 2018. All rights reserved. Author ................................................................................ Department of Electrical Engineering and Computer Science March 26, 2018 Certified by ........................................................................... Muriel M´ edard Cecil H. Green Professor in Electrical Engineering and Computer Science Thesis Supervisor Accepted by ........................................................................... Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students

Upload: others

Post on 01-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coding Approaches for Maintaining Data in Unreliable

Coding Approaches for Maintaining Data in Unreliable

Network Systems

by

Vitaly Abdrashitov

B.S., Moscow Institute of Physics and Technology (2009)M.S., Moscow Institute of Physics and Technology (2011)

Submitted to the Department of Electrical Engineering and Computer Sciencein partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2018

c© Massachusetts Institute of Technology 2018. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science

March 26, 2018

Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Muriel Medard

Cecil H. Green Professor in Electrical Engineering and Computer ScienceThesis Supervisor

Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Leslie A. Kolodziejski

Professor of Electrical Engineering and Computer ScienceChair, Department Committee on Graduate Students

Page 2: Coding Approaches for Maintaining Data in Unreliable

2

Page 3: Coding Approaches for Maintaining Data in Unreliable

Coding Approaches for Maintaining Data in Unreliable Network Systems

by

Vitaly Abdrashitov

Submitted to the Department of Electrical Engineering and Computer Scienceon March 26, 2018, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy in Electrical Engineering and Computer Science

Abstract

In the recent years, the explosive growth of the data storage demand has made the storage costa critically important factor in the design of distributed storage systems (DSS). At the sametime, optimizing the storage cost is constrained by the reliability requirements. The goal of thethesis is to further study the fundamental limits of maintaining data fault tolerance in a DSSspread across a communication network. Particularly, we focus our attention on performingefficient storage node repair in a redundant erasure-coded storage with a low storage overhead.We consider two operating scenarios of the DSS.

First, we consider a clustered scenario, where individual nodes are grouped into clustersrepresenting data centers, storage clouds of different service providers, racks, etc. The networkbandwidth within a cluster is assumed to be cheap with respect to the bandwidth between nodesin different clusters. We extend the regenerating codes framework by Dimakis et al. [1] to theclustered topologies, and introduce generalized regenerating codes (GRC), which perform noderepair using the helper data both from the local cluster and from other clusters. We show theoptimal trade-off between the storage overhead and the inter-cluster repair bandwidth, alongwith optimal code constructions. In addition, we find the minimal amount of the intra-clusterrepair bandwidth required for achieving a given point on the trade-off.

Second, we consider a scenario, where the underlying network features a highly varyingtopology. Such behavior is characteristic for peer-to-peer, content delivery, or ad-hoc mobilenetworks. Because of the limited and time-varying connectivity, the sources for node repairare scarce. We consider a stochastic model of failures in the storage, which also describesthe random and opportunistic nature of selecting the sources for node repair. We show that,even though the repair opportunities are scarce, with a practically high probability, the datacan be maintained for a large number of failures and repairs and for the time periods farexceeding a typical lifespan of the data. The thesis also analyzes a random linear networkcoded (RLNC) approach to operating in such variable networks and demonstrates its highachievable rates, outperforming that of regenerating codes, and robustness in a wide range ofmodel and implementation assumptions and parameters such as code rate, field size, repairbandwidth, node distributions, etc.

Thesis Supervisor: Muriel MedardTitle: Cecil H. Green Professor in Electrical Engineering and Computer Science

3

Page 4: Coding Approaches for Maintaining Data in Unreliable

4

Page 5: Coding Approaches for Maintaining Data in Unreliable

Acknowledgments

Pursuing my Ph.D. degree at MIT has been a long and challenging journey. At the sametime, it has allowed me to meet many amazing and extraordinary people, students and faculty,inspiring researchers, innovative thinkers, dedicated teachers. I am deeply thankful to them fortheir support and sharing of the expertise.

First and foremost, I would like to thank Muriel Medard, who has been to me not only anextremely knowledgeable and insightful research supervisor, but also a patient mentor, a verysupportive adviser in career and life, and a wonderful person. Without her, I definitely wouldnot have become what I am today.

Beside Muriel, I would like to thank Prakash Narayana Moorthy, with whom I enjoyedto be in a close collaboration, and who is a passionate researcher and a great friend. Thethesis would not be possible without his extensive expertise. I am also honored to have DavidKarger and Viveck Cadambe as my committee members. I am really thankful for their timeand commitment, and for the valuable guidance and advice.

I am extremely lucky to be a member of my research group of Network Coding and ReliableCommunications. I would like to thank the people I met in the group for infinite opportunities tolearn and share ideas. In particular, I would like to thank Ali Makhdoumi, Salman Salamatian,Weifei Zeng, Flavio du Pin Calmon, Ahmad Beirami, and Arman Rezaee for being supportiveand true friends. I would also like to thank Surat Teerapittayanon, Georgios Angelopoulos,Kerim Fouli, Soheil Feizi, Jason Cloud, and many other members of the group, with whom Ihave had the pleasure to work and learn together. Very special thanks go to Molly Kruko, whomakes sure that the group activities always run smoothly. I also wish to thank a lot of othergreat people who I met at MIT, and many of whom became my dearest friends.

Finally, and most importantly, I thank my parents and my family, for their unconditionallove, support, and encouragement during all these years in the graduate school and in the U.S.,and for being my backbone and foundation.

The work in this thesis was in part supported by the Air Force Office of Scientific Research (AFOSR) underaward No FA9550-14-1-043 and FA9550-13-1-0023, in part supported by the National Science Foundation (NSF)under Grant No CCF-1527270 and CCF-1409228, and in part supported by the Defense Advanced ResearchProjects Agency (DARPA) award No HR0011-17-C-0050.

5

Page 6: Coding Approaches for Maintaining Data in Unreliable

Dedicated to my parents

Page 7: Coding Approaches for Maintaining Data in Unreliable

Contents

Contents 7

List of Figures 9

List of Tables 13

1 Introduction 15

1.1 Distributed Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Small Repair Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Small Repair Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Preliminaries 25

2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Linear Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Linear Network Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Regenerating Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1 Information Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Matroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

I Regenerating Codes for Clustered Storage Systems 34

3 Generalized Regenerating Codes (GRCs) and File Size Bounds 35

3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.1 IFG Model for GRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 File Size Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 Proof of the File Size Bound . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Storage vs Inter-Cluster Bandwidth Trade-off . . . . . . . . . . . . . . . . 45

3.4 Code Constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 Exact Repair Code Construction . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.2 A Functional Repair Code for Arbitrary Number of Failures . . . . . . . . 50

4 GRC for Repair of Multiple Failures 55

4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Exact Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 ER Code Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.2 File Size Bound Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7

Page 8: Coding Approaches for Maintaining Data in Unreliable

4.3 Functional Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3.1 Information Flow Graph Model . . . . . . . . . . . . . . . . . . . . . . . . 604.3.2 File Size Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Implications of the Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Intra-Cluster Bandwidth of GRCs 655.1 Local Helper Bandwidth in the Host Cluster . . . . . . . . . . . . . . . . . . . . . 665.2 External Helper Cluster Local Bandwidth . . . . . . . . . . . . . . . . . . . . . . 705.3 Optimality and Implications of the Intra-cluster Bandwidth Bounds . . . . . . . 75

II Information Survival in Volatile Networks 78

6 Network Coding for Time-Varying Networks 796.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Stochastic Rank Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3.1 Matroid Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4 Bounding Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5 Impact of Repair Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.6 Expected Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.7 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 Implementation Aspects and Numerical Results 997.1 RLNC Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.2 Small Field Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.3 Failed and Helper Nodes Distributions . . . . . . . . . . . . . . . . . . . . . . . . 1017.4 Variable Number of Helpers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.5 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.6 Effects of Several Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

8 Conclusions 1078.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Bibliography 113

9 Appendices 1199.1 MRGRC Chain Order Lemma 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 1199.2 Achievability of the FR File Size Bound for MRGRC (Theorem 4.3.1) . . . . . . 1219.3 Matroid Lemma 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1249.4 Matrix Addition Lemma 6.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8

Page 9: Coding Approaches for Maintaining Data in Unreliable

List of Figures

1-1 A system with n = 4 storage nodes with 2 packets per node, and a regeneratingcode, which can repair the exact content of any node by downloading 1 packetfrom each of 3 other nodes. The plus sign denotes bit xor operation. . . . . . . 17

1-2 Comparing the storage overhead and the expensive inter-cluster repair bandwidthof three coding options for a clustered DSS. The three options are (i) extra paritycheck nodes in each cluster, (ii) classical regenerating codes, and (iii) generalizedregenerating code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1-3 An example of LRC with 6 information nodes and 4 parity nodes, 2 of which arelocal parities. Every information node (e.g. c1) can be recovered by downloadingdata from a specific set of d = 3 other nodes in the same local group. The globalparities p1, p2 are linear combinations of all 6 information symbols and allow dataregeneration when more than 1 node in each local group is lost. . . . . . . . . . . 21

1-4 The helper selection vs rate trade-off (higher the better). . . . . . . . . . . . . . . 22

1-5 Berlekamp’s bat phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2-1 Storage-bandwidth trade-off for RCs with (n = 7, k = 5, d = 6, B = 10). Theprecise trade-off for exact repair remains unknown. . . . . . . . . . . . . . . . . . 30

2-2 An example of information-flow graph for n = 4, k = 2, d = 2 and 3 nodefailures/repairs. Also shown is a sample cut (U ,V) between S and Z of capacityα+ β. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3-1 An example of the IFG model representing the notion of generalized regeneratingcodes, when intra-cluster bandwidth is ignored. In this figure, we assume (n = 3,k = 2, d = 2)(m = 2, ℓ = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3-2 An example of the information flow graph used in cut-set based upper bound forthe file size. In this figure, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1). Wealso indicate a possible choice of S−Z cut that results in the desired upper bound. 42

3-3 An example of how any S−Z cut in the IFG affects nodes in Fi. In the example,we assume m = 4. With respect to the description in the text, ai = 2. Further,the node Xi,4(ti,4) is a replacement node in the IFG. . . . . . . . . . . . . . . . . 43

3-4 Trade-off between storage overhead nmα/B and inter-cluster repair bandwidthoverhead dβ/α, for an (n = 5, k = 4, d = 4) clustered storage system, withℓ = m− 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3-5 Illustration of the exact repair code construction. We first stack ℓ MDS codesand (m − ℓ) classical regenerating codes, and then transform each row via theinvertible matrix A. The first ℓ rows of the matrix A generates an (m, ℓ) MDScode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

9

Page 10: Coding Approaches for Maintaining Data in Unreliable

3-6 An illustration of the node repair process for exact repair generalized regeneratingcode obtained in Construction 3.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . 50

4-1 An illustration of the information flow graph used in cut-set based upper boundfor the file-size under functional repair. We assume (n = 3, k = 2, d = 2)(m = 3,ℓ = 0, t = 2). Only a subset of nodes is named to avoid clutter. Two batches,each of t = 2 nodes, fail and get repaired first in cluster 1 and then in cluster 3.We also indicate a possible choice of the S − Z cut that results in the desiredupper bound. We fail nodes in cluster 3 instead of cluster 2 only to make thefigure compact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4-2 Trade-offs for an (n = 5, k = 4, d = 4)(m = 3, ℓ = 0, t = 2) system, plottedbetween the MSR and the MBR points. . . . . . . . . . . . . . . . . . . . . . . . 63

4-3 Impact of number of local helper nodes, ℓ, on file-size for an (n = 7, k = 4, d = 5,m = 17, t = 5) clustered storage system at MBR point (α = 1, β = 1). Localhelp does not provide any advantage unless ℓ > 2. . . . . . . . . . . . . . . . . . . 63

5-1 An illustration of the evolution of the k-th cluster of the information flow graphused in cut-set based lower bound for γ in Theorem 5.1.1. In this figure, weassume that m = 4, ℓ = 2. Nodes 3, 4, 1 fail in this respective order. For therepair of node 3, nodes 1 and 2 act as the local helper nodes. For the repairof the remaining two nodes, nodes 2 and 3 act as the local helper nodes. Alsoindicated is our choice of the S-Z cut used in the bound derivation. . . . . . . . 67

5-2 An illustration of the IFG used in cut-set based lower bound for γ′ in Theorem5.2.1. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1)(ℓ′ = 2,γ = α). The second node fails in clusters 1 and 2 in the respective order. Alsoindicated is our choice of the S-Z cut used in the bound derivation. . . . . . . . 71

5-3 An illustration of the IFG used in cut-set based lower bound for ℓ′ in Theorem5.2.3. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 0)(ℓ′ = 1,γ = γ′ = α). The second node fails in clusters 1 and 2 in the respective order.Also indicated is our choice of the S-Z cut used in the bound derivation. . . . . 74

5-4 Simulation results for a system with parameters (n = 4, k = 3, d = 3), (ℓ = 1,m = 2), (α, β = 4), showing probability of successful data collection againstnumber of node repairs performed, for an RLNC-based GRC. The legends indi-cate parameters (γ, γ′, ℓ′) for each test. For all operating points ℓ∗ = m = 2. . . 76

5-5 Illustrating the impact of ℓ on the various performance metrics. We operate atthe MBR point with parameters {(n = 12, k = 8, d = n−1)(α = dβ, β = 2)}. Wesee that while ℓ = m− 1 is ideal in terms of optimizing storage and inter-clusterBW, it imposes the maximum burden on intra-cluster BW. . . . . . . . . . . . . 77

6-1 An example of a system evolution for 3 iterations of failure and repair, n = 6,d = 2, a = b = 1. At t = 0 node i contains packet si. For the 4 considered systemstates the evolution matrix W t and its matroid representation M(W t) are alsoshown. The most recently changed column of W t is bold-faced. . . . . . . . . . . 81

6-2 An example of information-flow graph for n = 4, d = 2 and t = 4 node fail-ures/repairs. Also shown is a sample cut (U ,V) of capacity a + 2b ≥ min{a,2b}+min{a, b}+ b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6-3 Simulated expected lifetime for n = 20, a = b = 1. . . . . . . . . . . . . . . . . . . 96

10

Page 11: Coding Approaches for Maintaining Data in Unreliable

6-4 Probability of decoding error pe against the coding rate for fixed n = 20, d = 4,a = b = 1. The dots indicate E[rankW t]/n. . . . . . . . . . . . . . . . . . . . . . 96

6-5 Expected rank of the evolution matrix rt = E[rankW t], with the upper and lowerbounds for n = 40, d = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6-6 Expected rank rt = E[rankW t] for n = 40 and various values of d. . . . . . . . . 97

7-1 Performance of various recoding regimes: No recoding (N), Sparse recoding (S),and Full recoding (FR) for a system with parameters n = 20, t = 2000, d = 4,a = 3, b = 2. The legend indicates the recoding regimes (helper, replacementnodes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7-2 Impact of the effective field size qa on the average rank of W t for a system withparameters n = 20, d = 4, a = b, t = 1000. The actual field size used is q. . . . . . 100

7-3 Probability mass functions of the test node distributions for a storage with n = 20nodes. Given a fixed parameter x, the probability of i-th atom is p(i) ∝ ix. Largervalues of x lead to stronger concentration of probability at the nodes with highindices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7-4 Impact of the failed and helper node distribution PF ,PH on the average rankfor n = 20, t = 1000, d = 4. The distributions have p(i) ∝ ix for x ∈ {xF , xH}.xF < 0 corresponds to p(i) ∝ (n+ 1− i)|xF |. . . . . . . . . . . . . . . . . . . . . 102

7-5 Impact of standard deviation of the number of helper nodes d on the averagerank for n = 20, a = b = 1, t = 1000. Beta-binomial distributions with differentsupports are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7-6 Decoding error probability pdce = Pr[rankMt|S < k| rankMt = k] for randomlychosen column set S ⊂ [n], |S| = ndc. n = 20, d = 4, a = b = 1, k = nR. . . . . . 103

7-7 The maximal rate Rε for error probability under ε = 5 ∗ 10−4, t = 2000, andn = 20. First, tests are performed for the base case a = b = 1, q = 65536, then,various adverse parameter changes are introduced incrementally. The maximaltheoretical RC code rate for n = 20, a = 6, b = 4 is provided for comparison. . . . 105

8-1 Mean rank per node for t scaled proportionally to n with d = 4, a = b = 1. . . . . 110

11

Page 12: Coding Approaches for Maintaining Data in Unreliable

12

Page 13: Coding Approaches for Maintaining Data in Unreliable

List of Tables

3.1 Notation for the clustered storage system model. . . . . . . . . . . . . . . . . . . 36

6.1 Notation for the time-varying network storage system model. . . . . . . . . . . . 82

13

Page 14: Coding Approaches for Maintaining Data in Unreliable

14

Page 15: Coding Approaches for Maintaining Data in Unreliable

Chapter 1

Introduction

1.1 Distributed Storage Systems

In the recent years, the demand for cheap and reliable data storage has been driven high by nu-

merous entertainment, industrial, and scientific applications, which overall generate zettabytes

(1 ZB ≈ 1012GB) of data yearly. In 2018 the size of the global data sphere is measured by tens

of zettabytes, and this number doubles every 3 years [2]. The demand for the storage capacity

grows faster than the storage media production and is expected to outstrip the production by 6

ZB in 2020 [3]. As a result of these trends, the storage cost becomes an increasingly important

factor in the large storage systems design.

Large-scale storage systems follow a distributed approach: the data is spread across several

storage nodes, possibly in different locations. Using multiple less capable and cheaper nodes

instead of a single powerful node allows cost-efficient scalability. Thus, modern distributed

storage systems (DSS) are composed of a large number of individually unreliable nodes. Data

loss is unacceptable, and DSS must deal with failures in the system, i.e. have a sufficient fault

tolerance.

Since fault tolerance is achieved through a redundancy in storage, and the redundancy

inevitably increases the storage size per byte of user data, i.e. the storage overhead , it is critically

important to find the optimal balance between the required fault tolerance and the storage

cost. A simple way to introduce the redundancy is replication, in which multiple copies of

the same data segments are stored on different physical nodes. Simple to set up and manage,

15

Page 16: Coding Approaches for Maintaining Data in Unreliable

3-way replication (keeping 3 copies of data) has been a widely adopted approach. However,

its storage overhead, is too high (3), and makes it prohibitively expensive for large amounts

of data. An alternative approach to introduce redundancy is to encode the data using erasure

codes. Storage node failures can be treated as erasures, and the source data can be decoded

from an incomplete set of nodes. Many major cloud service providers like Amazon [4], Microsoft

Azure [5] employ coding in their storage systems.

An important aspect in DSS is maintaining redundancy. Upon node failures, the data

segments they store become unavailable, the overall redundancy and fault tolerance goes down.

A new, replacement, nodes need to be introduced into the system to keep the failed portion of

the data. This process is called node repair and involves downloading data (helper data) from

a set of surviving nodes (helper nodes, or helpers) to generate the data to store on the new

node. The efficiency of node repair is mainly associated with the following metrics:

1. repair bandwidth — the number of the symbols downloaded from the helper nodes to the

replacement node [1];

2. repair locality — the number of the helper nodes contacted [6];

3. repair disk I/O — the number of stored symbols read by the helper nodes from their

storage media to generate the helper data [7].

These metrics directly affect the repair latency, cost, and extra load on the helper nodes, and

ideally, all the metrics should be small. However, the metrics can be simultaneously minimized

only for a replication storage, but not for an erasure-coded DSS.

The thesis is focused on further studies of the fundamental limits of maintaining redundancy

and node repair in DSS with a low storage overhead. We consider two operating scenarios of

the DSS. The first one mainly considers DSS with a low repair bandwidth and disk I/O, while

the second one studies a DSS with a low repair locality.

1.2 Small Repair Bandwidth

Repair bandwidth minimization has been largely studied in the context of regenerating codes

(RCs). First introduced by Dimakis et al. [1], RCs require that repair of any node can be

16

Page 17: Coding Approaches for Maintaining Data in Unreliable

a1

a2

b1

b2

a1

a2

b1

b2

a1+b

1a2+b

2

a1+b

2a2+b

1

a1

b2

a1+b

1+a

2+b

2

Figure 1-1: A system with n = 4 storage nodes with 2 packets per node, and a regeneratingcode, which can repair the exact content of any node by downloading 1 packet from each of 3other nodes. The plus sign denotes bit xor operation.

performed with an arbitrary set of d helper nodes among the surviving nodes. For a fixed repair

locality d, RCs minimize the repair bandwidth and achieve the optimal trade-off between the

bandwidth and the storage overhead. Generally, to achieve the minimal repair bandwidth, a

helper node needs to send a function of its stored data, rather than a part of it. Consider an

example of a DSS with n = 4 storage nodes shown in Figure 1-1. The source file is split into

4 packets, represented by vectors of bits of a fixed dimension. They are encoded into 8 coded

packets and stored on 4 nodes, 2 packets per node, so that the source file can be decoded from

the content of any 2 nodes. To repair a failed node, e.g. node 4, the replacement node needs

to download at least 3 helper packets from the other 3 nodes. It is not sufficient, however, for

the helpers to directly send out the packets they store; instead helper node 3 sends out a linear

combination a1 + b1 + a2 + b2 of packets a1 + b1, a2 + b2 it holds. The addition is performed

element-wise in the binary field GF (2). This demonstrates that network coding is necessary to

achieve the minimal repair bandwidth.

Network coding refers to the operation of coding at a network node over its inputs to pro-

duce its outputs. Whereas in the traditional paradigm of network routing, a node only stores

the incoming packets to forward them to the next nodes without changing (possibly changing

packets metadata), with network coding the node can emit packets which are functions, ”mix-

tures” of the incoming packets. In the context of DSS, the incoming packets correspond to those

a node stores, and the outgoing packets to those it sends out when it serves as a helper node.

17

Page 18: Coding Approaches for Maintaining Data in Unreliable

The seminal work by Ahlswede et al. [8] shows that, unlike routing, network coding achieves the

capacity of multicast wireline networks, where a data stream needs to be delivered to multiple

destinations. The capacity is equal to the minimal value of a cut between the source node and

a destination (a sink node). References [9, 10] showed that for achieving the multicast capacity

it is sufficient to use linear network coding, where the outgoing packets are linear combinations

of the incoming packets. Works of Ho et al. [11], Sanders et al. [12] showed that the linear

coding coefficients at each node can be picked uniformly at random, and such random linear

network codes (RLNC) in a large finite field asymptotically achieve the multicast capacity. Lin-

ear network codes can also be constructed deterministically with a polynomial-time algorithm

[13].

Dimakis et al. [1] considered two alternative repair regimes of RCs: exact (ER) and func-

tional repair (FR). In ER regime after each node repair, the replacement node content should

be exactly the same as the content of the failed node. With FR this does not need to hold,

as long as the source file can be decoded. The advantage of the ER codes is that they can

have a systematic form with certain nodes holding the uncoded source packets (the first two

nodes in Figure 1-1), which allows very fast data reads. The FR codes generally achieve strictly

better bandwidth-storage trade-off and allow simpler and more efficient code constructions. A

detailed description of RCs and network coding is presented in Chapter 2.

Subsequent works on regenerating codes studied code constructions in specific capacity-

achieving scenarios [14, 15], repairing multiple node failures [16–19], security aspects [20, 21].

These works consider the flat network topology, where each node has a direct logical connection

to every other node, and all logical links incur the same bandwidth cost for communication per

bit.

However, the practical large-scale DSSs feature hierarchical structure; for instance, individ-

ual nodes can be grouped into a server rack and connected to the same network switch, while

a rack is a part of an aisle, and the latter is a composition unit of a data center. Finally, data

centers can be grouped into a geographically distributed storage system, employing erasure

coding across data centers [22], or into a user-defined cloud-of-clouds along with other storage

service providers [23–25]. In a large DSS, the data is protected against failure or unavailability

of individual system parts, and repairing a node can require a repair traffic across different levels

18

Page 19: Coding Approaches for Maintaining Data in Unreliable

of the system hierarchy. While the communication between the nodes in the same rack is low-

delay and a spare bandwidth is usually available, the inter-rack bandwidth is shared by many

nodes and applications and is a more limited resource, and the inter-data-center bandwidth is

even more scarce and expensive. For instance, around 180TB of inter-rack repair bandwidth is

used daily in the Facebook warehouse, which limits resources for other applications [26].

Although possible, complete elimination of the expensive, inter-cluster , as we shall call them,

repair bandwidth components, e.g. by using extra parity check nodes in each rack, results in

an excessively large storage overhead. A better approach is needed to characterize the optimal

system performance in terms of the storage overhead vs repair bandwidth trade-off. The extra

parity check solution is shown by the corresponding point on the trade-off in Figure 1-2. It is

also possible to reduce the expensive bandwidth by employing several RCs for flat topologies

(classical RCs), so that each RC spans across one node in each rack. Any point on the straight

line between the two solutions can be achieved by space-sharing, i.e. applying each of the two

codes only for a fraction of the total stored data.

We introduce Generalized Regenerating codes (GRCs) for clustered topologies, which care-

fully combine repair bandwidth from different hierarchy levels (cheap intra-cluster and expen-

sive inter-cluster bandwidths) to achieve the optimal trade-off. The trade-off is strictly better

than what is achieved with space-sharing between the existing coding schemes, as shown in

Figure 1-2. The analysis of GRCs is presented in Chapter 3. Besides the characterization of the

optimal trade-off, we also provide explicit code constructions for the ER and the FR regime.

The FR construction is optimal in terms of the trade-off, while the ER construction achieves

the most important operating points of the trade-off. In Chapter 4 we extend the results to the

scenario with multiple node failures per cluster and their simultaneous repair.

In Chapter 5 we study the local properties of GRCs. While the goal of GRC is to optimize

the main trade-off between the storage vs the expensive inter-cluster bandwidth, it is also

desirable to minimize the cheap intra-cluster bandwidth to improve the latency the disk I/O,

without affecting the trade-off. The repair process also gives rise to the intra-cluster bandwidth

in the clusters providing the helper data (helper clusters). The chapter gives the answer to the

following question: what is the minimal intra-cluster bandwidth, in the cluster with the failed

node and in the helper clusters, required to operate on a specific point of the trade-off?

19

Page 20: Coding Approaches for Maintaining Data in Unreliable

1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0Storage overhead

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Inte

r-clu

ster

ban

dwid

thLocal paritycheck nodes

Classical MBR

Classical MSR GRCSpace-sharingClassical RC

Figure 1-2: Comparing the storage overhead and the expensive inter-cluster repair bandwidthof three coding options for a clustered DSS. The three options are (i) extra parity check nodesin each cluster, (ii) classical regenerating codes, and (iii) generalized regenerating code.

1.3 Small Repair Locality

Optimizing the repair locality is typically studied in the context of locally repairable codes

(LRCs). LRC-based solutions have been used in very large systems, like the storage for Microsoft

Azure [5]. References [6, 27] first introduced the notion of LRC and demonstrated the trade-off

between the locality, the minimum code distance, and the code rate (the inverse of the storage

overhead) for linear or non-linear codes. Figure 1-3 shows an example of an LRC code which

stores a file of 6 packets on 10 nodes. The 6 source packets are placed in the uncoded form on 6

information nodes, which belong to 2 local groups. Each local group contains d+ 1 nodes and

includes a local parity node, which allows exact regeneration of any information node from d = 3

helper nodes. Unlike the setting of RC, LRCs are mainly studied in the exact repair regime

with a predetermined and fixed set of helper nodes (repair set) for each potential node failure.

More recent studies [28–30] also considered LRCs in the functional repair regime and allowed

the replacement node to choose the best repair set. The coding rate achievable by LRCs, where

each symbol can be recovered by downloading data from d helper nodes, is bounded by

RLRC ≤ d

d+ 1. (1.1)

Fixed repair sets constitute a problem for LRCs from the availability perspective. Fast

20

Page 21: Coding Approaches for Maintaining Data in Unreliable

a1 a2

b1 b2

c1 c2

a2 b2 c2

d2=a2+b2+c2 p2

p1a1 b1 c1

d1=a1+b1+c1

local group 1 local group 2

Figure 1-3: An example of LRC with 6 information nodes and 4 parity nodes, 2 of which arelocal parities. Every information node (e.g. c1) can be recovered by downloading data from aspecific set of d = 3 other nodes in the same local group. The global parities p1, p2 are linearcombinations of all 6 information symbols and allow data regeneration when more than 1 nodein each local group is lost.

access to the specific repair set from the replacement node cannot be guaranteed because of

(temporarily) helper nodes unavailability, which may be caused by the node maintenance,

unresponsiveness due to high load, bandwidth over-subscription, network congestion, etc. This

issue is of a special importance in networks with highly unstable or time-varying topology, such

as peer-to-peer (P2P) and peer-aided edge/fog caching networks, mobile ad hoc, and sensor

networks. The problem does not arise for RCs and MDS codes, where an arbitrary set of d

nodes can serve as helpers, but those codes have significantly lower coding rate RRC ≤ d/n, for a

storage with n nodes. Works [31, 32] and others considered LRCs with several alternative repair

sets. Specifically, [31] considered local groups of d+δ−1 nodes, such that a node within a group

can be recovered from any subset of d other nodes in a group. Each local group represents an

MDS code with the minimum distance δ ≥ 2. Although this construction increases the maximal

coding rate to dd+δ−1 , the repair is still bound to a local set of a relatively small number of nodes.

A different model is considered by reference [32]: the nodes have multiple, τ ≥ 1, disjoint repair

sets of size d, and the resulting coding rate is upper-bounded by∏τ

i=1(1 + 1/di)−1. While this

gives a coding rate improvement as compared to RCs, τ cannot be large, and the repair set

selection remains limited with many nodes excluded from the consideration to be helpers.

Fundamentally, LRCs can achieve a high rate at the price of a limited helper selection and a

small minimum code distance, while the large minimum distance of RC and MDS codes allows

an arbitrary choice of helpers at the price of a low rate (Figure 1-4). The minimum distance is

often treated as the key code parameter controlling the fault tolerance of the DSS. However, it

only measures the fault tolerance with respect to the worst-case erasure patterns, while these

21

Page 22: Coding Approaches for Maintaining Data in Unreliable

0.2 0.3 0.4 0.5 0.6 0.7 0.8R

0

2

4

6

8

log(

# of

rep.

sets

)

RC

Kamath2014

Tamo2016

LRC

RLNC(pe = 3 * 10 4, t= 100)RLNC(pe = 0.15, t= 100)

Figure 1-4: The helper selection vs rate trade-off (higher the better).

patterns constitute only a small fraction of the possible patterns. Since the DSS reliability is

typically compared via expected value metrics (e.g. normalized magnitude of data loss — the

expected number of lost bytes per terabyte in the first 5 years of deployment [33]), it is more

natural to consider the expected impact of the potential failure and repair availability patterns

on the storage. In channel coding, this phenomenon is illustrated with the Berlekamp’s bat [34,

Chapter 13] (Figure 1-5. A bat flies around the center of a nearly-spherical cave, the center

of which represents the original codeword surrounded by the neighboring codewords in a high-

dimensional code space. The location of the bat represents the distorted codeword. The bat

tries to avoid touching the pikes from the wall. Since neighboring codewords exactly at the

minimum distance are scarce, the range at which she can fly in the perfect safety is far less than

the actual range she can fly with a high probability of safety. While the early coding theory

focused on codes with a large minimum distance which guarantee to correct a large number

of erasures, the Shannon capacity can only be reached by the codes working far beyond the

minimum distance of the code: the probability of hitting the vicinity of another codeword is

low even when the number of erasures exceeds the minimum distance. Examples of such codes

are Turbo codes, LDPC, and random linear codes.

To study the expected impact of failures and repairs, we need to introduce a probability

measure on them, i.e. to consider a random selection of the failure and helper nodes. Since in

channel coding random linear codes perform well in decoding beyond the minimum distance, for

22

Page 23: Coding Approaches for Maintaining Data in Unreliable

Perfect safety range

Safety range with high probability

Figure 1-5: Berlekamp’s bat phenomenon.

a random selection DSS model it is most natural to repair a node by generating random linear

combinations of the packets on the helper nodes, in other words, employing RLNC. In Chapter

6, we introduce a stochastic failure and repair DSS model, equipped with an RLNC code

generation and repair, and study its rate-reliability trade-offs. Since the model is probabilistic,

its performance is parametrized by time, which is measured as the number of the failure and

repair iterations.

The time-dependent nature of the model has another interesting aspect. Most existing DSS

codes, such as MDS, RCs, and LRCs, are designed under the assumption that the data needs

to be stored forever. However, the data often has a limited lifespan, after which the data can

be deleted or migrated to another (e.g. archival) storage. The lifespan can be really short: for

instance, in edge caching, a certain content can be popular just for a month, and afterwards it

does not need to be stored in caches any more. For such scenarios, it is reasonable to have a code

which provides the guarantees of maintaining the data only for a limited number of node repairs.

As a result of this relaxed requirements, we can expect an improved storage overhead and lower

storage costs. The RLNC code considered in Chapter 6 with a gradual degradation over time

under a random node selection model is well-suited for such limited lifetime applications. The

code both realizes the rate gain over RCs and has practically the same helper selection freedom

as RCs, far exceeding that of LRCs (Figure 1-4), which makes it a viable storage solution for

the time-varying networks.

23

Page 24: Coding Approaches for Maintaining Data in Unreliable

In Chapter 7, we numerically study the performance of RLNC under our model in a wide

range of model and implementation assumptions and parameters. We show RLNC to perform

stably well with the binary field, low repair bandwidth, non-uniform node distributions, variable

number of helper nodes, and sparse RLNC recoding. Even in those adverse conditions, the

achievable code rate is shown to be significantly higher than that of regenerating codes.

24

Page 25: Coding Approaches for Maintaining Data in Unreliable

Chapter 2

Preliminaries

In this chapter, we provide the notations and tools used throughout the thesis. We give the

basic definitions of the coding theory, network coding, and the matroid theory. In addition, we

provide an overview of the framework and main results from the theory of RCs, and introduce

information flow graphs (IFG) as an important tool for studying functional repair systems.

2.1 Notations

Unless indicated otherwise, we use capital letters to denote matrices, sets and graph node

labels, bold small letters to denote row vectors, and regular font for scalar variables. We also

use capital letters when we need to highlight the fact that a variable is a random variable.

rowspanA and colspanA denote the linear subspaces spanned by rows and columns of matrix

A, respectively. In denotes the n×n identity matrix. AT denotes the transposition of a matrix

or a vector A. Unless specified otherwise, A|s, A|S denote the submatrices of A consisting of

row s, rows in set S, respectively. A|s, A|S denote the submatrices of A consisting of column s,

columns in set S, respectively.

|A| is the cardinality of set A. We use A ∪ B,A ∩ B,A − B to denote union, intersection,

and set-theoretic difference of sets A,B, and also use A+ b, A− b to denote A ∪ {b}, A− {b}.

The set of integers between i, j inclusive is denoted [i, j] = {i, i+1, . . . , j−1, j], and [n] , [1, n].

Unless specified otherwise, all data/code symbols are considered elements of a finite field

Fq. We use Fq to denote any finite field with q non-zero elements. Whenever needed, a specific

25

Page 26: Coding Approaches for Maintaining Data in Unreliable

field will be indicated, if multiple fields with the same q exist (for non-prime q). H(X) will

denote the entropy of a discrete random variable or a set of random variables X, computed

with respect to log q. H(X|Y ) denotes the conditional entropy. We shall also use the chain rule

entropy expansion:

H(X1, X2, . . . , Xn) =

n∑

i=1

H(Xi|{Xj , j ∈ [i− 1]}). (2.1)

For a real number x, (x)+ is used as a shorthand notation for max{x, 0}. For integers a, b, if b

is a multiple of a, we shall write a|b, and a ∤ b otherwise. 1E is the indicator function, equal to

1 if E is true, and 0 otherwise.

For a system of equations or inequalities, notation

A

B

(2.2)

implies that at least one among A,B must hold true.

2.2 Linear Codes

A brief overview of the main notions in coding theory is given in this section. For a more

detailed reference on the topic, we refer the readers to [35].

A (block) code C is a map from Fkq to Fn

q . In this thesis we only consider linear (n, k) codes,

for which the map is described by a linear transformation C : u → uG, where full-rank matrix

G ∈ FF k×nq is called generator matrix of the code. u is referred to as a vector of information

symbols or a message, uG is a codeword corresponding to the message. The set of all possible

codewords forms the codebook, which with a slight abuse of notation will be also denoted by

C = {uG}u∈Fk

q. Note, that for any codewords c, c′ ∈ C and any a ∈ Fq, ac and c + c′ also

belong to C, i.e. the code can also be characterized as a linear subspace of Fnq with dimension

k. The codebook is not uniquely identified with the generator matrix, for any full-rank matrix

A ∈ Fk×kq , a code with generator matrix AG corresponds to the same codebook.

The code rate of C is given by k/n, and represents the average information value of one

26

Page 27: Coding Approaches for Maintaining Data in Unreliable

codeword symbol. The code is called systematic if the generator matrix is of the form G =

[Ik P ]; in this case, the first k codeword symbols contain the original message symbols, and the

remaining n− k are parity check symbols.

The weight |x| of a codeword x is the number of non-zero symbols in it, and the (Hamming)

distance |x−x′| between two codewords x,x′, is the weight of their difference, i.e. the number

of the coordinates where the two codewords differ. The minimum distance of code C is the

shortest distance between two distinct codewords from C, or equivalently, the smallest weight

of a non-zero codeword in C. The minimum distance D ≤ 1 of the code is related to the code

capability to tolerate symbol erasures. If a codeword x = uG suffers erasures at m arbitrary

positions, the observed codeword y has m coordinates unknown, but it can be corrected and

uniquely mapped back to x and decoded to u as long no other codeword x′ can be transformed

to y by m erasures. A code with minimum distance D can correct arbitrary D − 1 erasures in

a codeword. Such code is called a (n, k,D) code.

Theorem 2.2.1 (Singleton Bound). The minimum distance of a (n, k) linear code is upper

bounded by

D ≤ n− k + 1. (2.3)

A code that achieves the Singleton bound with D = n− k + 1 is called maximum distance

separable (MDS). For an MDS code every subset of k columns of the generator matrix is linearly

independent, and therefore any k symbols of a codeword x are sufficient to decode the message

u.

Theorem 2.2.2 (MDS Code Generate Matrix). A code is MDS if and only if its codebook

can be represented by a generator matrix in the systematic form G = [Ik P ], and every square

sub-matrix of P is invertible, where a sub-matrix is defined as an intersection of any i columns

with any i rows, i ∈ [min{k, n− k}].

Examples of MDS codes are:

• n-Repetition code (n, 1, n);

• single parity check code (k + 1, k, 1);

27

Page 28: Coding Approaches for Maintaining Data in Unreliable

• Reed-Solomon (RS) codes (n, k, n− k + 1).

While the first two codes can be constructed over F2, RS codes require Fq with q ≥ n.

For systematic codes C1(n1, k1), C2(n2, k2) with generator matrices, G1 = [Ik1 P1], G2 =

[Ik2 P2], the product code C(n2n1, k2k1) of C1, C2 maps a message U ∈ Fk2×k1q to a codeword

X ∈ Fn2×n1q , where X is given by

X =

U UP1

P T2 U P T

2 UP1

=

U

P T2 U

[Ik1 P1] =

Ik2

P T2

[U UP1]. (2.4)

Every row of X is a codeword from C1, and every column is a codeword from C2.

For integer α ≥ 1, a linear (n, k) vector code C is a linear (n, k) code over symbol alphabet

Fαq , such that its codewords are Fq-linear, i.e. for any codewords c, c′ ∈ C and any a, a′ ∈ Fq,

ac+ a′c′ ∈ C.

2.3 Linear Network Coding

Network coding in packet networks assumes that intermediate nodes perform coding across the

incoming packets to generate the outgoing packets. Specifically, in this thesis we assume that in

a network with packets of length m′, a node x with k incoming and n outgoing links computes

n outgoing packets as the columns of UGx, where matrix U ∈ Fm′×kq contains the k incoming

packets as columns, and Gx ∈ Fk×nq is the generator matrix of a local linear code at node x;

note that n is determined by the network structure, and may be smaller than k. We will say

that node x recodes the k incoming packets into n outgoing packets. In random linear network

coding (RLNC) the elements of Gx are drawn at random from Fq, rather than deterministically

constructed based on the network topology.

To keep track of the transformations of the original source packets after several recoding

operations, each coded packet has a header with the coordinates in the source packets basis.

To be more precise, let s1, . . . , sr ∈ Fmq be uncoded source packets to be transmitted over the

28

Page 29: Coding Approaches for Maintaining Data in Unreliable

network, let matrix S , [s1T . . . sr

T ] ∈ Fm×rq , and let

S′ =

Ir

S

. (2.5)

The columns of S′ are injected into the network as packets of length m′ = r + m, with an

r-symbol header. When these packets are recoded in the network, each generated coded packet

is a linear combination of columns of S′, in which the first r symbols are the coordinates of

the m-symbol payload part in the basis of the source packets s1, . . . , sr. The r-symbol header

is called (global) coding vector of the packet. Whenever a node receives r coded packets with

linearly independent coding vectors, i.e. the matrix of the r coding vectors is full-rank, the node

can decode the r source packets by applying the inverse linear transformation to the received

packets.

2.4 Regenerating Codes

Next, we overview the model of RCs of [1]. A source file of size B symbols is encoded and

stored in a DSS of n storage nodes. Each node stored α symbols, and the code rate is B/nα.

Whenever a node failure happens, a new node replaces the failed one and downloads β symbols

of helper data from each node from an arbitrary set of d ≥ k helper nodes to generate its

content. Under exact repair (ER) the generated content should be the same as that on the

failed node, under functional repair (FR) there is no such constraint. To retrieve the source file

(perform data collection) one downloads the content of an arbitrary set of k nodes.

For the model outlined above with parameters (n, k, d, α, β) and FR, the source file of size

B can be stored in the system if and only if

B ≤ BFR ,k∑

i=1

min{α, (d− i+ 1)β}. (2.6)

For a fixed file size B, there exists a trade-off between storage α per node and repair

bandwidth dβ (Figure 2-1. Only the pairs of (α, dβ) on or above the trade-off curve are feasible.

29

Page 30: Coding Approaches for Maintaining Data in Unreliable

2 2.2 2.4 2.6 2.8 3 3.2

3

3.5

4

4.5

5

5.5

6

6.5 FR

ER (approximate)

MBR

MSR

Figure 2-1: Storage-bandwidth trade-off for RCs with (n = 7, k = 5, d = 6, B = 10). Theprecise trade-off for exact repair remains unknown.

Since α is proportional to the storage overhead nα/B, lower values of α lead to lower storage

costs. The MSR (minimum storage regeneration) point on the trade-off corresponds to the

smallest storage overhead, and is defined by α = B/k = (d − k + 1)β. The MBR (minimum

bandwidth regeneration) point on the trade-off corresponds to the smallest repair bandwidth,

which is defined by dβ = α,B =∑k

i=1(d− i+1)β. For a bounded number of failures, all points

on the FR trade-off can be achieved by linear network coding over a large enough field, or in

particular by RLNC, or space-sharing of two network coded solutions. Reference [36] presented

randomized and deterministic FR code constructions for an unlimited number of failures. The

general ER trade-off remains an open problem, although it is known that ER can always achieve

the MSR and MBR points, and for most points between those two the ER trade-off is strictly

worse than that of FR. Under ER, the MBR point can be achieved by a product-matrix code

construction for all values of parameters (n, k, d) [14].

2.4.1 Information Flow Graph

Information flow graph (IFG) is a convenient tool for analysis of the maximal achievable file size

of RC models in FR regime. The IFG is a directed acyclic graph with capacitated edges, which

represents the data flows from the uncoded source file to the data collectors via an error-free

network of the original and replacement storage nodes with limited memory. Each original or

30

Page 31: Coding Approaches for Maintaining Data in Unreliable

𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋

𝑋𝑖 𝑋𝑋𝑖 𝑋

𝑋𝑖 𝑋Edge

capacities ∞

𝑍𝑆

Figure 2-2: An example of information-flow graph for n = 4, k = 2, d = 2 and 3 node fail-ures/repairs. Also shown is a sample cut (U ,V) between S and Z of capacity α+ β.

replacement physical node Xi of size α is represented in the IFG by a new pair of in- and out-

graph nodes Xini

α→ Xouti with an edge of capacity α between them. The in-node represents the

single point, where data enters to the node, possibly from several sources, e.g. helper nodes.

The out-node connects to the other nodes or data collectors, where node Xi sends data. A

special IFG node S serves as the source of the source file of size B to be stored in the DSS.

The IFG evolves with the DSS as node failures and repairs happen. Before any node failures,

IFG contains n pairs of nodes Xini , Xout

i , i ∈ [n], and all n out-nodes are considered active. The

source node S connects to all the n in-nodes Xini , i ∈ [n] via edges of infinite capacity. If node

Xi fails, its IFG out-node Xouti becomes inactive, a replacement physical node is introduced to

the system, and downloads helper data from some d surviving nodes. The new node has a new

index, say Xn+1, and is represented by a new pair of the in- and the out-node in the IFG. The

in-node Xinn+1 connects to the corresponding d active out-nodes via edges of capacity β. The

new out-node Xoutn+1 becomes active, so that at any moment there are n active out-nodes in the

graph. A data collector is represented by a node Z, which connects to the n active out-nodes

Xoutj at any specific moment via edges of infinite capacity. An example of IFG is shown in

Figure 2-2.

A cut of the IFG is a partitioning of all graph nodes into two sets (U ,V). The cut-set

31

Page 32: Coding Approaches for Maintaining Data in Unreliable

corresponding to a cut (U ,V) is the set of all edges which go from a node in U to a node in V.

The capacity (or the value) of a cut or the corresponding cut-set is the sum of the capacities of

the edges in the cut-set. A cut between nodes S and Z is any cut (U ,V) with S ∈ U , Z ∈ V. If

S is a proper cut-set between S,Z, then any directed path from S to Z has at least one edge

in S. A cut between S,Z is called a minimal cut or min-cut, if its capacity is minimal among

all cuts between S and Z.

Given parameters (n, k, d, α, β,B), the problem of code repair satisfying the RC model

requirements can be cast as a problem of multicasting B symbols over all possible IFGs with

parameters (n, d, α, β) to an arbitrary number of data collectors, connecting to any k active

out-nodes. The latter problem is solvable if and only if B is no greater than the min-cut value

between S and any data collector node Z for any possible IFG. Moreover, if the condition is

satisfied, then there exists a linear network code solution over a sufficiently large field such that

all data collectors can recover the B source symbols. RLNC also provides a solution with an

arbitrarily large probability as the field size increases for a bounded number of failures/repairs.

2.5 Matroids

In this section, we give a brief overview of the matroid theory terminology and results used in

the thesis. We refer the reader to reference [37] for more details.

A matroid M is a pair (E , I) consisting of a finite set E (the ground set), and a collection

I of subsets of E . The elements of I are called independent sets. I should satisfy the following

properties:

• I is non-empty.

• Every subset of an independent set is also independent.

• (Independence augmentation property) If sets I1 and I2 are independent (I1, I2 ∈ I), and

|I2| = |I1|+ 1, then there is an element s ∈ I2 − I1, such that I1 ∪ {s} is independent.

A set S ⊆ E which is not independent is called dependent. The maximal independent sets are

called bases (I ∈ I, I + s /∈ I, ∀s /∈ I). Every basis has the same cardinality. The minimal

dependent sets are called circuits (S /∈ I, S − s ∈ I, ∀s ∈ S). A matroid on n elements

32

Page 33: Coding Approaches for Maintaining Data in Unreliable

is uniquely characterized not only by the collection of its independent sets but also by the

collection of its circuits. An element s ∈ E is a loop if it is not an element of any basis, or,

equivalently, if {s} is a circuit. An element s ∈ E is a coloop if it is not an element of any

circuit, or, equivalently, if s is an element of every basis. If for elements s1, s2 of matroid M,

{s1, s2} is a circuit, then s1 and s2 are said to be parallel in M, and the set of all elements

parallel to s1 or s2 is called a parallel class.

Given a matrix A ∈ Fn×mq , the vector matroid M[A] is the matroid defined over the set of

columns of A, where a subset independence is defined as the linear independence of the columns

in the subset.

If out of the three properties of I only the first two are satisfied, the resulting object (E , I)

is called an independence system.

33

Page 34: Coding Approaches for Maintaining Data in Unreliable

Part I

Regenerating Codes for Clustered

Storage Systems

34

Page 35: Coding Approaches for Maintaining Data in Unreliable

Chapter 3

Generalized Regenerating Codes

(GRCs) and File Size Bounds

3.1 System Model

We propose a natural generalization of the setting of regenerating codes (RC) [1] for clustered

storage networks. The network consists of n clusters, withm nodes in each cluster. The network

is fully connected such that any two nodes within a cluster are connected via an intra-cluster

link, and any two clusters are connected via an inter-cluster link. A node in one cluster that

needs to communicate with another node in a second cluster does so via the corresponding

inter-cluster link. A source file of size B symbols is encoded into nmα symbols and stored

across the nm nodes such that each node stores α symbols. For data collection, we have an

availability constraint such that the entire content of any k clusters should be sufficient to

recover the original data file. Nodes represent points of failure. In this chapter, we restrict

ourselves to the case of efficient recovery from single node failure. In Chapter 4, we generalize

some of our results to the scenario of recovering from multiple node failures within a cluster.

Node repair is parametrized by three parameters d, β and ℓ. We assume that the replacement

of a failed node is in the same cluster (host cluster) as the failed node. The replacement node

downloads β symbols each from any set of d other clusters, dubbed remote helper clusters. The

β symbols from any of the remote helper clusters are a function of the mα symbols present

35

Page 36: Coding Approaches for Maintaining Data in Unreliable

Table 3.1: Notation for the clustered storage system model.

Symbol Definition

B source file sizen total number of clusters in the systemm number of storage nodes in each clusterk number of clusters required for data collectiond number of remote helper clusters providing helper data during a node

repaird′ min{d, k}ℓ number of local helper nodes providing helper data during node repairq finite field size for data symbolsα number of symbols each storage node holds for one coded fileβ size of helper data downloaded from each remote helper cluster during

node repair, in symbols

in the cluster; we assume that a dedicated compute unit in the cluster takes responsibility for

computing these β symbols before passing them outside the cluster. In addition, the replacement

node can download (entire) content from any set of ℓ ∈ [m−1] other nodes, dubbed local helper

nodes, in the host cluster, during the repair process. The quantity dβ represents the inter-

cluster repair-bandwidth. We shall also use notation d′ = min{d, k}. We refer to the overall

code as the generalized regenerating code (GRC) Cm with parameters {(n, k, d)(α, β)(m, ℓ)}. A

summary of the various parameters used in the description of the system model appears in

Table 3.1.

The model reduces to the setup of RCs in [1], when m = 1 (in which case, ℓ = 0 automat-

ically). We shall refer to the setup in [1] as the classical setup or classical regenerating codes.

Our generalization has two additional parameters ℓ and m when compared with the classical

setup. As in the classical setup, we consider both FR and ER regimes. We further note that,

unlike the classical setup, our generalized setup permits d < k.

We will say that a GRC code is locally non-redundant if the encoding function does not

introduce any local dependence among the content of the various nodes of a cluster. For linear

GRC, the coded content of cluster i can be written as uGi, where u is the message vector of

length B, and Gi is a B × mα matrix. In this case, locally non-redundant code means that

Gi has full column rank. Oppositely, a locally redundant code can have, for example, a local

parity node within a cluster, which would hold the component-wise sum in Fαq of the data on

the other m− 1 nodes.

36

Page 37: Coding Approaches for Maintaining Data in Unreliable

The model described above does not consider intra-cluster bandwidth incurred during repair.

Intra-cluster bandwidth is needed, firstly, to compute the β symbols in any remote helper

cluster, and, secondly, to download content from ℓ local helper nodes in the host cluster. The

intra-cluster bandwidth of GRC is studied in detail in chapter 5.

Our goal is to obtain a trade-off between storage overhead nmα/B and inter-cluster repair-

bandwidth dβ for an {(n, k, d)(α, β)(m, ℓ)} GRC.

3.1.1 IFG Model for GRC

In this section, we describe the IFG model of GRC used in this chapter to derive the main file

size bound. Let Xi,j denote the physical node j ∈ [m] in cluster i ∈ [n]. In the IFG, Xi,j is

represented by a pair of nodes X ini,j

α→ Xouti,j . With a slight abuse of notation, we will let Xi,j to

also denote the pair (X ini,j , X

outi,j ) of the graph nodes. Cluster i also has an additional external

node, denoted as Xexti . Each out-node in the cluster Xout

i,j , j ∈ [m] is connected to Xexti via an

edge of capacity α. The external node Xexti is used to transfer data outside the cluster, and

thus serves two purposes: 1) it represents a single point of contact to the cluster, for a data

collector which connects to this cluster, and 2) it represents the compute unit which generates

the β symbols for repair of any node in a different cluster.

The source node S connects to the in-nodes of all physical storage nodes in their original

state (S∞→ X in

i,j), ∀i ∈ [n], ∀j ∈ [m]. The sink node Z represents a data collector, it connects to

the external nodes of an arbitrary subset of k clusters (Xexti

∞→ Z).

Each cluster at any moment has m active nodes. When a physical node Xi,j fails, it

becomes inactive, and its replacement node, say Xi,j , becomes active instead (see Figure 3-1

for an illustration). The replacement node Xi,j is regenerated by downloading β symbols from

any d nodes in the set {Xexti′ , i′ ∈ [n], i′ 6= i}. The replacement node also connects to any subset

of ℓ nodes in the set {Xouti,j′ , j

′ ∈ [m], j′ 6= j} via links of capacity α.

Along with the replacement of Xi,j with Xi,j , we will also copy the remaining m− 1 nodes

in cluster i as they are, and represent them with new identical pairs of nodes (X ini,j′

α→ Xouti,j′ ),

j′ ∈ [m], j′ 6= j. We shall also a have a new external node for the cluster, which connects to the

new m out-nodes. Thus, in the IFG modeling, we say that the entire old cluster with the failed

node becomes inactive, and gets replaced by a new active cluster. For either data collection

37

Page 38: Coding Approaches for Maintaining Data in Unreliable

SZ

Figure 3-1: An example of the IFG model representing the notion of generalized regeneratingcodes, when intra-cluster bandwidth is ignored. In this figure, we assume (n = 3, k = 2,d = 2)(m = 2, ℓ = 1).

or repair, we connect to external nodes of the active clusters. Note that, at any point in time,

a physical cluster contains only one active cluster in the IFG, and fi inactive clusters in the

IFG, where fi ≥ 0 denotes the total number of failures and repairs experienced by the various

nodes in the cluster. We shall use the notation Xi(t), 0 ≤ t ≤ fi to denote the cluster that

appears in IFG after the tth repair associated with cluster i. The clusters Xi(0), . . . ,Xi(fi − 1)

are inactive, while Xi(fi) is active, after fi repairs. The nodes of Xi(t) will be denoted by

X ini,j(t), X

outi,j (t), Xext

i (t), 1 ≤ j ≤ m. With a slight abuse of notation, we will let Xi(t) to also

denote the collection of all 2m + 1 nodes in this cluster. We write Xi,j(t) to denote the pair

(X ini,j(t), X

outi,j (t)); again, with a slight abuse of notation, we shall use Xi,j(t) to also denote the

node j in cluster i after the tth repair in cluster i. We further use notation Fi to denote the

union (family) of all nodes in all inactive clusters, and the active cluster, corresponding to the

physical cluster i after t repairs in cluster i, i.e., Fi = ∪fit=0Xi(t). We have avoided indexing Fi

with the parameter t as well, to keep the notation simple. The value of t in our usage of the

notation Fi will be clear from the context.

38

Page 39: Coding Approaches for Maintaining Data in Unreliable

3.2 Previous Work

Regenerating code variations for the data-center-like topologies consisting of racks and nodes

are considered in [38–42]. In [38], [39] and [40], the authors distinguish between inter-rack

(inter-cluster) and intra-rack (intra-cluster) bandwidth costs. Further, the works [38] and [39]

permit pooling of intra-rack helper data to decrease inter-rack bandwidth. Also, all three works

allow taking help from host-rack nodes during repair. Unlike our model, for data collection,

all three works simply require file decodability from any set of k nodes irrespective of the

racks (clusters) to which they belong. In other words, the notion of clustering applies only

to repair, and not data collection, and this is a major difference with respect to our model.

Thus, while these variations are suitable for modeling the node-rack topologies present within

a data center, they do not model the situation of erasure coding across data centers with the

availability requirement as considered in this work. The work in [41] is a variation of that in [40]

for a two-rack model, where the per-node storage capacity of the two racks differ. In [42], the

authors consider a two-layer storage setting like ours, consisting of several blocks (analogous to

clusters as considered in this work) of storage nodes. A different clustering approach is followed

for both data collection and node repair. For data collection, one accesses kc nodes each from

any of bc blocks. Though [42] focuses on node repair, the model assumes possible unavailability

of the whole block where the failed node resides, and as such uses only nodes from other blocks

for repair. Further, unlike our model in this work, the authors do not differentiate between

inter-block and intra-block bandwidth costs. The framework of twin-codes introduced in [43]

is also related to our model and implicitly contains the notion of clustering. In [43] nodes are

divided into two sets. For data collection, one connects to any k nodes in the same set. Recovery

of a failed node in one set is accomplished by connecting to d nodes in the other set. However,

there is no distinction between intra-set and inter-set bandwidth costs, and this becomes the

main difference with our model.

Several works [44–49] study variations of RCs in varied settings, with different combinations

of node capacities, link costs, and amount of data-download-per-node. The main difference

between our model and these works is that none of them explicitly considers clustering of nodes

while performing data collection. In [44], the authors introduce flexible regenerating codes for

39

Page 40: Coding Approaches for Maintaining Data in Unreliable

a flat topology of storage nodes, where uniformity of download is enforced neither during data

collection nor during node repair. References [45], [46] consider systems where the storage and

repair-download costs are non-uniform across the various nodes. The authors of [45], as in [44],

allow a replacement node to download an arbitrary amount of data from each helper node. In

[47], nodes are divided into two sets, based on the cost incurred while these nodes aid during

repair. As noted in [41], the repair model of [47] is different from a clustered network, where

the repair cost incurred by a specific helper node depends on which cluster the replacement

node belongs to. The works of [48] and [49] focus on minimizing regeneration time rather

regeneration bandwidth in systems with non-uniform point-to-point link capacities. Essentially,

each helper node is expected to find the optimal path, perhaps via other intermediate nodes,

to the replacement node such that the various link capacities are used in a way to transfer all

the helper data needed for repair in the shortest possible time. It is interesting to note both

of these works permit pooling of data at an intermediate node, which gathers and processes

any relayed data with its own helper data. Recall that our model (and the one in [38]) also

considers pooling of data within a remote helper cluster, before passing on to the target cluster.

3.3 File Size Bound

In this section, we derive a bound for file size B for an arbitrary set of code parameters. We

further use this bound to characterize the storage overhead vs inter-cluster repair bandwidth

overhead trade-off.

Theorem 3.3.1 (GRC Capacity). The file size B of GRC with parameters {(n, k, d) (α, β)

(m, ℓ)} under FR regime is upper bounded by

B ≤ B∗ , ℓkα+ (m− ℓ)k−1∑

i=0

min{α, (d− i)+β}. (3.1)

Further, if there is an upper bound on the number of repairs that occur for the duration of

operation of the system, the above bound is sharp, i.e., B∗ gives the functional repair storage

capacity of the system.

40

Page 41: Coding Approaches for Maintaining Data in Unreliable

3.3.1 Proof of the File Size Bound

The proof consists of two parts: the upper bound and its achievability. For finding the desired

upper bound on the file size, it is enough to show a cut between the source from the sink in

an IFG, for a specific sequence of failures and repairs, such that the value of the cut is the

desired upper bound. To prove achievability of the bound, we shall show that, for any valid

IFG, independent of the specific sequence of failures and repairs, B∗ is indeed a lower bound on

the minimum possible value of any S − Z cut, and, thus, B∗ symbols can always be multicast

to the data collectors.

Upper Bound

We begin with the proof of the upper bound. We consider a sequence of k(m− ℓ) failures and

repairs, as follows: Physical nodes Xi,ℓ+1, Xi,ℓ+2, . . . Xi,m fail in this order in cluster i = 1, then

in cluster i = 2, and so on, until cluster i = k. In the IFG this corresponds to the sequence

of failures of nodes X1,ℓ+1(0), X1,ℓ+2(1), . . . , X1,m(m − ℓ − 1), X2,ℓ+1(0), . . . , X2,m(m − ℓ − 1),

. . . , Xk,m(m− ℓ− 1), in the respective order. The replacement node Xi,ℓ+t(t) for Xi,ℓ+t(t− 1),

1 ≤ t ≤ m− ℓ draws local helper data from Xi,1(t− 1), Xi,2(t− 1), · · · , Xi,ℓ(t− 1), and remote

helper data from the clusters X1(m− ℓ), · · · ,Xi−1(m− ℓ) and from some set of d−min{i− 1,

d} = (d− i+ 1)+ other active clusters in the IFG. An example is shown in Figure 3-2 for a set

of system parameters that is same as those used in Figure 3-1.

Let data collector Z connect to clusters X1(m− ℓ), . . . ,Xk(m− ℓ). Consider the S − Z cut

consisting of the following edges of the IFG:

• {(X ini,j(0) → Xout

i,j (0), i ∈ [k], j ∈ [ℓ]}. The total capacity of these edges is kℓα.

• For each i ∈ [k], t ∈ [m − ℓ], either the set of edges {(Xexti′ (0) → X in

i,ℓ+t(t)), i′ ∈ {remote

helper cluster indices for the replacement node X ini,ℓ+t(t)} − [min{i − 1, d}]}, or the edge

(X ini,ℓ+t(t) → Xout

i,ℓ+t(t)). Between the two possibilities, we pick the one which has smaller

capacity. In this case, the total capacity of this part of the cut is given by∑k

i=1

∑mj=ℓ+1min{α,

(d−min{i− 1, d})β} = (m− ℓ)∑k

i=1min{α, (d− i+ 1)+β}.

The value of the cut is given by kℓα+ (m− ℓ)∑k

i=1min{α, (d− i+ 1)+β} = B∗, which proves

our upper bound. In the example in Figure 3-2 for (n = 3, k = 2, d = 2)(m = 2, ℓ = 1), first,

41

Page 42: Coding Approaches for Maintaining Data in Unreliable

S

Z

Edge

capacities

𝑋 ,

𝑋 ,𝑖𝑋 ,𝑋 ,𝑖

𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖

𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖

𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖

𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖

𝑋𝑒𝑥

𝑋𝑒𝑥

𝑋𝑒𝑥

𝑋𝑒𝑥

𝑋𝑒𝑥

Figure 3-2: An example of the information flow graph used in cut-set based upper bound forthe file size. In this figure, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1). We also indicate apossible choice of S − Z cut that results in the desired upper bound.

m− ℓ = 1 node fails in cluster 1 and downloads helper data from clusters 2, 3, second, a node

fails in cluster 2 and downloads helper data from clusters 1, 3. The data collector connects to

clusters 1, 2. A minimal cut for 2β ≤ α is shown in the figure and has value 2α+ 3β = B∗.

Achievability

We next show that for any valid IFG (independent of the specific sequence of failures and

repairs), B∗ is indeed a lower bound on the minimum possible value of any S−Z cut. Consider

any S − Z cut (U ,V). Since node Z connects to k external nodes via links of infinite capacity,

we only consider cuts such that V has at least k external nodes corresponding to active clusters.

Next, we observe that the IFG is a directed acyclic graph, and, hence, there exists a topological

sorting of nodes of the graph such that an edge exists between two nodes A and A′ of the IFG

only if A appears before A′ in the sorting [50]. Further, we consider a topological sorting such

that all in-, out- and external nodes of the cluster Xi(τ) appear together in the sorted order,

∀i, τ .

Now, consider the sequence E of all the external nodes (which are part of both active and

inactive clusters) in V in their sorted order. Let Y1 denote the first node in this sequence.

42

Page 43: Coding Approaches for Maintaining Data in Unreliable

Edge

capacities

Figure 3-3: An example of how any S − Z cut in the IFG affects nodes in Fi. In the example,we assume m = 4. With respect to the description in the text, ai = 2. Further, the nodeXi,4(ti,4) is a replacement node in the IFG.

Without loss of generality let Y1 ∈ F1. Next, consider the subsequence of E which is obtained

after excluding all the external nodes in F1 from E . Let Y2 denote the first external node in this

subsequence. We continue in this manner until we find the first k external nodes {Y1, Y2, . . . ,

Yk} in E , such that each of the k nodes corresponds to a distinct physical cluster. Once again,

without loss of generality, we assume that Yi ∈ Fi, 2 ≤ i ≤ k. Let us assume that Yi = Xexti (ti),

for some ti. Now, consider the m out-nodes Xouti,1 (ti), . . . , X

outi,m(ti) that connect to Xext

i (ti).

Among these m out-nodes, let ai, 0 ≤ ai ≤ m denote the number of out-nodes that appear

in U . Without loss of generality let these be the nodes Xouti,1 (ti), X

outi,2 (ti), . . . , X

outi,ai

(ti). Next,

corresponding to the out-node Xouti,j (ti), ai + 1 ≤ j ≤ m, consider its past versions {Xout

i,j (t),

t < ti} in the IFG, and let Xouti,j (ti,j), for some ti,j ≤ ti denote the first sorted node that appears

in V. Without loss of generality, let us also assume that the nodes {Xouti,j (ti,j), ai+1 ≤ j ≤ m} are

sorted in the order Xouti,ai+1(ti,ai+1), X

outi,ai+2(ti,ai+2), . . . , X

outi,m(ti,m). An illustration is provided

in Figure 3-3.

To obtain a lower bound on the value of the S−Z cut, we make the following observations:

43

Page 44: Coding Approaches for Maintaining Data in Unreliable

• The ai edges {(Xouti,j (ti) → Xext

i (ti)), 1 ≤ j ≤ ai} are part of the cut. These contribute a

total value of aiα.

• For any node Xouti,j (ti,j), ai +1 ≤ j ≤ m, if the corresponding in-node X in

i,j(ti,j) belongs to

U , then the edge (X ini,j(ti,j) → Xout

i,j (ti,j)) appears in the cut, and contributes a value of α

to the cut. Now, consider the case when the in-node X ini,j(ti,j) belongs to V. In this case,

consider the following two sub cases:

– The node Xi,j(ti,j) is not a replacement node: This means that, either the edge

(Xouti,j (ti,j−1) → X in

i,j(ti,j)) appears in the cut, if ti,j > 0, or the edge (S → X ini,j(ti,j))

appears in the cut, if ti,j = 0. In any case, the contribution to the overall value of

the cut is at least α.

– The node Xi,j(ti,j) is a replacement node of Xi,j(ti,j − 1): We know that ℓ local

helper nodes and d external nodes are involved in the repair. It is straightforward to

see that out of the ℓ local helper nodes, at most (j−1) belong to V. To see this, note

that the potential candidates for the local helper nodes that appear in V correspond

to the physical nodes1 Xi,1, Xi,2, . . . , Xi,j−1. The version of the physical node Xi,j′ ,

j′ > j , if it aids in the repair process, appears in U because of our definition of

Xi,j′(ti,j′). Next, note that out of the d external nodes, at most (i− 1) belong to V.

In this case, the contribution to the value of the cut, due to the edges that aid in

repair, is lower bounded by (ℓ− j + 1)+α+ (d− (i− 1))+β.

1It may be noted that we count the physical nodes Xi,1, . . . , Xi,aiamong the possible set of local helpers,

although we assume that Xi,j(ti), 1 ≤ j ≤ ai appears in U . This is because, we cannot discount the possibilitythat Xi,j(ti,j′ − 1) appears in V, for j ≤ ai, j

′ > ai.

44

Page 45: Coding Approaches for Maintaining Data in Unreliable

1.2 1.4 1.6 1.8 2 2.2 2.4 2.60

0.5

1

1.5

2

2.5

3

3.5

4

4.5

MSR

MBR

classical

trade-off

more nodes per

cluster

Figure 3-4: Trade-off between storage overhead nmα/B and inter-cluster repair bandwidthoverhead dβ/α, for an (n = 5, k = 4, d = 4) clustered storage system, with ℓ = m− 1.

Based on the observations above, the value of the cut is lower bounded by

mincut(S − Z) ≥k∑

i=1

aiα+

m∑

j=ai+1

min(α, (ℓ− j + 1)+α+ (d− (i− 1))+β)

= aikα+

k∑

i=1

ℓ∑

j=ai+1

α+

k∑

i=1

m∑

j=max(ℓ,ai)+1

min(α, (d− (i− 1))+β)

= max(ai, ℓ)kα+ (m−max(ai, ℓ))

k∑

i=1

min(α, (d− (i− 1))+β)

≥ ℓkα+ (m− ℓ)

k∑

i=1

min(α, (d− (i− 1))+β) = B∗,

for any ai, 0 ≤ ai ≤ m. This completes the proof of the achievability.

3.3.2 Storage vs Inter-Cluster Bandwidth Trade-off

For fixed values of B = B∗, n, k, d > 0, ℓ,m, (3.1) gives a normalized FR trade-off (see Figure 3-

4) between the storage overhead nmα/B (storage used per a source file symbol) and inter-cluster

repair bandwidth overhead dβ/α (inter-cluster repair bandwidth per a repaired symbol). For

any m, when ℓ = 0, the trade-off is exactly same as that of the classical regenerating codes [1].

When ℓ > 0 (implies m > 1), the trade-off is strictly better than that of the classical setup.

We shall call the trade-off point dβ = α as MBR operating point. An optimal code Cm at

the MBR operating point will be referred to as an MBR code. For locally non-redundant GRC,

45

Page 46: Coding Approaches for Maintaining Data in Unreliable

the MBR point identifies the minimal amount of inter-cluster bandwidth required for repair,

regardless of the number of local helper nodes. At the MSR operating point the maximum

file-size per (3.1) is not bandwidth-constrained, i.e. B = ℓkα + (m− ℓ)d′α, and has the lowest

possible inter-cluster repair bandwidth defined by ((d− k)+ + 1)β = α.

Note that the GRC trade-off allows the points with inter-cluster repair bandwidth overhead

dβ/α below 1. This requires the GRC to be locally redundant. Specifically, the point with zero

inter-cluster bandwidth implies having a parity check node in each cluster. On the contrary, all

trade-off points on the MBR line and above can be achieved with locally non-redundant codes,

with significantly lower storage overhead than locally redundant codes.

3.4 Code Constructions

In this section, we describe optimal GRC code constructions. Two constructions are presented;

the first one is an instance of an exact repair code, and results in optimal codes at the MSR

and MBR points under the setting of generalized regenerating codes; the second construction

is a functional-repair regenerating code. Both codes can withstand any number of repairs for

the duration of operation of the system. The exact repair code withstands any number of

repairs by definition, since after each repair the data on all nodes is the same as at the start

of system operation. This logic does not hold for functional repair codes, because the repaired

node content is generally different from the original one. Network-coding based achievability

proofs for functional-repair work only if there is a known upper bound on the number of repairs

that occur over the lifetime of the system. Our functional repair code relies on the construction

in [36], which allows our code to operate for arbitrarily many repairs. For both constructions,

we rely on existing optimal classical regenerating codes that are linear. By a linear regenerating

code, we mean that both encoding and repair are performed via linear combinations of either

the input or the coded symbols, respectively. The first construction generates an optimal (n,

k, d)(α, β)(m, ℓ) code for any m, ℓ ≤ m− 1, 1 ≤ d ≤ n− 1, whenever an optimal (n, k, d′)(α, β)

classical exact repair linear regenerating code exists. Our functional repair code construction

is limited to the case ℓ = m− 1, d ≥ k.

For a linear (n, k, d′)(α, β) classical regenerating code that encodes a data file of size B

46

Page 47: Coding Approaches for Maintaining Data in Unreliable

symbols, one can associate a generator matrix G of size B×nα. Without loss of generality, the

first α columns of G generates the content of node 1, and so on. We say that two (n, k, d′)(α, β)

classical linear regenerating codes C1 and C2, having generator matrices G1 and G2 are identical

if G1 = G2.

3.4.1 Exact Repair Code Construction

We begin with a description of the code and then show its data collection and repair properties.

Construction 3.4.1 (Exact repair GRC). Let Cj , 1 ≤ j ≤ ℓ denote (n, k) MDS vector codes

over Fαq . The amount of data that can be encoded with these ℓ codes is ℓkα. Next, let Cj ,

ℓ + 1 ≤ j ≤ m denote m − ℓ identical (n, k, d′)(α, β) classical exact repair linear regenerating

codes over Fq, each having a file size B′ =∑k−1

i=0 min(α, (d′− i)β). For encoding, we first divide

the data file of size B∗ = ℓkα + (m − ℓ)B′ into m stripes, such that first ℓ have size kα, and

the last m− ℓ have size B′. Stripe j, 1 ≤ j ≤ m is encoded by Cj to generate the coded symbols

cj = [c1,j , c2,j , . . . , cnα,j ]. Next, consider an m × m invertible matrix A over Fq such that the

first the ℓ rows of A generate an (m, ℓ) MDS code Fq. Let matrix A be decomposed as

Am×m =

Eℓ×m

Fm−ℓ×m

. (3.2)

Thus, the matrix Eℓ×m generates an (m, ℓ) MDS code. The coded data stored in the various

clusters is generated as follows:

[c′T1 c′T2 · · · c′Tm ] = [cT1 cT2 · · · cTm]Am×m. (3.3)

The content of node j in cluster i is given by [c′(i−1)α+1,j , c′(i−1)α+2,j , . . . , c

′iα,j ]

T , 1 ≤ i ≤ n,

1 ≤ j ≤ m. This completes the description of the construction. A pictorial overview of the

description appears in Figure 3-5.

We next prove that the code described in Construction 3.4.1 is an optimal exact repair

generalized regenerating code, for anym, ℓ < m. The optimal code can be constructed whenever

47

Page 48: Coding Approaches for Maintaining Data in Unreliable

Cluster

Cluster

Cluster

generates [ , ℓ] MDS code

Invertible Matrix

[𝐶𝑀𝐷𝑆 𝐶𝑟𝑒𝑔𝑒 ]

𝐴 × = ×−ℓ×

𝐴 ×𝐴 ×

𝐴 ×

, 𝑘 MDS codes , 𝑘, 𝑑 , regen. codes

𝒞1 𝒞ℓ+1𝒞ℓ 𝒞

Figure 3-5: Illustration of the exact repair code construction. We first stack ℓ MDS codes and(m− ℓ) classical regenerating codes, and then transform each row via the invertible matrix A.The first ℓ rows of the matrix A generates an (m, ℓ) MDS code.

an optimal (n, k, d′)(α, β) exact repair linear regenerating code exists, having a file size B′ =

∑k−1i=0 min(α, (d′ − i)β).

It is clear that the code in Construction 3.4.1 has a file size B∗, where B∗ is as given in

Theorem 3.3.1. Further, the data collection property of the code is also straightforward to

check, and this essentially follows from the facts that 1) the matrix A is invertible, and 2)

each of the codes Ci, 1 ≤ i ≤ m is uniquely decodable given its coded data belonging to any k

clusters. To examine the repair properties of the code, let us rewrite (3.3) as follows:

[c′T1 c′T2 · · · c′Tm ] = [cT1 cT2 · · · cTm]Am×m (3.4)

= [CMDS Cregen]Am×m (3.5)

=

C(1)MDS C

(1)regen

C(2)MDS C

(2)regen

......

C(n)MDS C

(n)regen

Am×m, (3.6)

where CMDS = [cT1 · · · cTℓ ] and CTregen = [cTℓ+1 · · · cTm]. The matrices C

(i)MDS and C

(i)regen,

1 ≤ i ≤ n denote rows (i−1)α+1, . . . , iα of CMDS and Cregen, respectively. Let us also expand

48

Page 49: Coding Approaches for Maintaining Data in Unreliable

the decomposition of matrix A in (3.2) further as follows:

Am×m =

Eℓ×m

Fm−ℓ×m

(3.7)

=

eT1 eT2 · · · eTm

fT1 fT2 · · · fTm

, (3.8)

where eTj and fTj , 1 ≤ j ≤ m denote the jth column of the matrices E and F , respectively.

Based on (3.6) and (3.8), it can be seen that the content of node j in cluster i is given by

[C

(i)MDS C(i)

regen

]

eTj

fTj

(3.9)

Given the notation above, without loss of generality, consider repairing node ℓ + 1 in cluster

1 with the help of 1) the first ℓ local nodes in cluster 1 and 2) clusters 2, . . . , d′ + 1. Let us

first examine the role of the ℓ local nodes in the repair process. Let E′ and F ′ denote the first

ℓ columns of E and F , respectively. By assumption, E generates an (m, ℓ) MDS code, and

hence the submatrix E′ is invertible. In this case, the content from the ℓ local nodes can be

put together to generate

[C

(1)MDS C(1)

regen

]

E′

F ′

E′−1eTℓ+1 =[C

(1)MDS C(1)

regen

]

eTℓ+1

fTℓ+1

, (3.10)

where fTj = F ′E′−1eTℓ+1. Thus, the local helper nodes serve to recover the part corresponding to

the MDS-codes’ components given by C(1)MDSe

Tℓ+1. However, the regenerating-codes’ components

C(1)regenfTℓ+1 differs from the original C

(1)regenfTℓ+1.

Let us next examine the role of the d′ remote helper clusters. We know that the data stored

in cluster i is given by[C

(i)MDS C

(i)regen

]A. Since matrix A is invertible, the vector C

(i)regen can be

49

Page 50: Coding Approaches for Maintaining Data in Unreliable

Regeneration

Incorrect component

Cluster 1

Cluster

𝐻 𝑖 = helper data for 𝐂regen − መCorrecting the local helper

data component from መ𝐣𝐓 to

𝐂regen 𝐂regen − መ

[𝐂MD1 𝐂regen1 ] መ 𝐂regen1 − መ[𝐂MD1 𝐂regen1 ]

Figure 3-6: An illustration of the node repair process for exact repair generalized regeneratingcode obtained in Construction 3.4.1.

recovered from this. Using the regenerating property of classical RC codes Cℓ+1, . . . , Cm, from

C(i)regen cluster i computes helper symbols H(i) ∈ Fβ×(m−ℓ)

q for regenerating C(1)regen, and sends

out H(i)(fTj − fTj ) ∈ Fβq . By identity and linearity of the classical RC used, the replacement node

can regenerate C(1)regen(fTj − fTj ) using the helper data from the d′ remote clusters, and combine

it with the local helper data (see (3.10)) to correct the regenerating-codes’ components, and

restore the content of the lost node. A pictorial illustration of the repair process is shown in

Figure 3-6.

3.4.2 A Functional Repair Code for Arbitrary Number of Failures

In this section, we show the existence of optimal functional repair codes over a finite field that

can tolerate an arbitrary number of repairs for the duration of operation of the system. We

show the existence for any (n, k, d ≥ k)(α, β)(m, ℓ = m − 1). The code construction combines

m−1 MDS vector codes C1, . . . , Cm−1 with an (n, k, d)(α, β) FR code Cm for the classical setting.

The code Cm is a deterministic one that can tolerate an arbitrary number of repairs for the

duration of operation of the system. Reference [36] guarantees the existence of such code over

Fq whenever q > q0, where q0 is entirely determined by the parameters (n, k, d)(α, β), and is

independent of the number of repairs performed over the lifetime of the code. By a deterministic

regenerating code, we mean that the regenerated data corresponding to a repair operation of a

given physical node is uniquely determined given the content of the helper nodes. As we shall

50

Page 51: Coding Approaches for Maintaining Data in Unreliable

see, the fact the code is deterministic is important to ensure the data collection property of our

functional repair construction.

We first describe the code construction, along with the repair procedure, and then show the

optimality property of the code.

Construction 3.4.2 (Functional repair GRC). Let Cj , 1 ≤ j ≤ m − 1 denote (n, k) MDS

vector codes over Fαq . The amount of data that can be encoded with these m codes is (m −

1)kα. Next, let Cm denote an (n, k, d)(α, β) deterministic classical functional repair linear

regenerating code as described above. The code Cm has a file size B′ =∑k−1

i=0 min(α, (d − i)β).

For encoding, we first divide the data file of size B∗ = ℓkα + (m − ℓ)B′ into m stripes, such

that first m − 1 have size kα, and the last one has size B′. Stripe j, 1 ≤ j ≤ m is encoded

by Cj to generate the coded symbols cj = [c1,j , c2,j , . . . , cnα,j ]. Node j, 1 ≤ j ≤ m − 1 in

cluster i stores the vector [c(i−1)α+1,j , c(i−1)α+2,j , . . . , ciα,j ]. The content of node m is given by

[c(i−1)α+1,m−∑m−1

j=1 c(i−1)α+1,j , c(i−1)α+2,m−∑m−1

j=1 c(i−1)α+2,j , . . . , ciα,m−∑m−1

j=1 ciα,j ], i.e. the

sum of the content off all m nodes is equal to [c(i−1)α+1,m, c(i−1)α+2,m, . . . , ciα,m]. This completes

the description of the initial layout of the coded data. Since the code is a functional repair code,

the code description is not complete unless we specify the procedure for node repair, as well. We

do this next.

Node Repair: Let yi,j(t) ∈ Fαq denote the content of node j in cluster i, after the tth, t ≥ 0

repair, anywhere in the system. The quantities {yi,j(0), 1 ≤ i ≤ n, 1 ≤ j ≤ m} denote the initial

content present in the system and are as described above. The repair procedure is such that the

vector [∑m

j=1 y1,j(t),∑m

j=1 y2,j(t), . . . ,∑m

j=1 yn,j(t)] remains a valid codeword of the functional

repair regenerating code Cm, for every t ≥ 0 (to be proved further). Clearly, the above statement

is true for t = 0. The repair procedure can be described recursively as follows: Let the tth

repair be associated with node jf in cluster if . Each of the d remote helper clusters, say i,

internally computes∑m

j=1 yi,j(t− 1), and passes the β symbols for repair of∑m

j=1 yif ,j(t− 1).

The replacement node, first of all, regenerates yif (t− 1), as a replacement to∑m

j=1 yif ,j(t− 1),

given the helper data from the d remote clusters. Next, since ℓ = m− 1, the replacement node

gets access to local helper data {yif ,j(t− 1), 1 ≤ j ≤ m, j 6= jf}. The content that is eventually

51

Page 52: Coding Approaches for Maintaining Data in Unreliable

stored in the replacement node is computed as follows:

yif ,jf (t) = yif (t− 1)−m∑

j=1j 6=jf

yif ,j(t− 1). (3.11)

For any other (i, j) 6= (if , jf ), we assume that

yi,j(t) = yi,j(t− 1). (3.12)

This completes the description of the repair process and the code construction.

Next, we argue optimality of the (n, k, d)(α, β)(m, ℓ = m− 1) code described above. Specif-

ically, we show that the code retains the functional repair and data collection properties, after

every repair. We assume that the data collector is aware of the entire repair history of the

system. By this we mean that the data collector is aware of 1) the exact sequence of t failures

and repairs that has happened in the system, and 2) the indices of the remote helper clusters

that aided in each of the t repairs.

It is clear that the code in Construction 3.4.2 has a file-size B∗, as given by Theorem

3.3.1. To show that the code retains the functional repair property, it is sufficient to show that

the vector [∑m

j=1 y1,j(t),∑m

j=1 y2,j(t), . . . ,∑m

j=1 yn,j(t)] remains a valid codeword of the FR

regenerating code Cm, for every t ≥ 0. We do this inductively. Clearly, the statement is true

for t = 0. Let us next assume that the statement is true for t = t′ ≥ 0, and show its validity for

t = t′+1. Assume that the (t′+1)th repair is associated with node jf in cluster if . The relation

between the content of the various nodes before and after the (t′ + 1)th repair is obtained via

(3.11) and (3.12). In this case, the quantities {∑m

j=1 yi,j(t′ + 1), 1 ≤ i ≤ n} are given by

m∑

j=1

yif ,j(t′ + 1)

(a)= yif (t

′), (3.13)

m∑

j=1

yi,j(t′ + 1) =

m∑

j=1

yi,j(t′), 1 ≤ i ≤ n, i 6= if . (3.14)

Now, recall that yif (t′) is the replacement of

∑mj=1 yif ,j(t

′), which is regenerated using the

helper data generated using d elements of the set {∑m

j=1 yi,j(t′), 1 ≤ i ≤ n, i 6= if}. Combining

52

Page 53: Coding Approaches for Maintaining Data in Unreliable

with the induction hypothesis for t = t′, it follows that the induction statement holds good for

t = t′ + 1 as well. This completes the proof of functional repair property of the code.

Let us next see how data collection is accomplished after t, t ≥ 0 repairs in the system.

Without loss of generality assume that a data collector connects to clusters 1, 2, . . . , k, and

accesses {yi,j(t), 1 ≤ i ≤ k, 1 ≤ j ≤ m}. The data collector as a first step computes the vector

[∑m

j=1 y1,j(t),∑m

j=1 y2,j(t), . . . ,∑m

j=1 yk,j(t)], and uses this to decode the data corresponding

to the code Cm. Now, recall the fact that the code Cm is deterministic, and also our assumption

that the data collector is aware of the entire repair-history of the system. In this case, having

decoded Cm, using (3.11) and (3.12), the data collector can iteratively recover {yi,j(t′), 1 ≤ k ≤,

1 ≤ j ≤ m}, for t ≥ t′ ≥ 0 by starting at t′ = t and proceeding backwards until the content

at t′ = 0 is recovered (essentially, we are rewinding the system by eliminating the effects of

all the repairs, starting from the last one and proceeding backwards in time). Finally, from

Construction 3.4.2, we know that the content {yi,j(0), 1 ≤ k ≤, 1 ≤ j ≤ m − 1} is the coded

data corresponding to the m− 1 (n, k) MDS coded C1, . . . , Cm−1, and thus these codes can also

be decoded. The completes the proof of data collection, and also of the construction optimality.

53

Page 54: Coding Approaches for Maintaining Data in Unreliable

54

Page 55: Coding Approaches for Maintaining Data in Unreliable

Chapter 4

GRC for Repair of Multiple Failures

In this chapter, we extend our GRC model to scenarios of multiple node failures. We assume

the problem of recovery from t ∈ [m] node failures that occur in one of the n clusters. While

single-node failure is the most common failure event, correlated failures of nodes within a data

center is an important issue reported in practice [51] and this motivates our failure model. The

t newcomer nodes are added to the same cluster as a replacement to those failed. For restoring

the content of the t new nodes, as before we download external helper data from any set of d

other clusters, β symbols each, and local helper data from any set of ℓ ≤ m− t surviving nodes

in the failure cluster. We also restrict ourselves to the case d ≥ k, even though analysis for the

case 0 ≤ d ≤ k − 1 is perfectly feasible.

A code satisfying the above model requirements shall be called multi-node repair generalized

regenerating code (MRGRC) C with parameters {(n, k, d), (α, β), (m, ℓ, t)}. We shall also use

an auxiliary notation m− ℓ = at+ b, a ≥ 1, 0 ≤ b ≤ t− 1.

4.1 Previous Work

The problem of multiple-node repair for classical RCs has been studied under the frameworks

of cooperative repair [16, 17] and centralized repair [18, 19]. In cooperative repair, each of

the t replacement nodes first individually contacts respective sets of d helper nodes and then

communicates among themselves before restoring the new content. In centralized repair, a

centralized compute node downloads data from some subset of d nodes and generates the data

55

Page 56: Coding Approaches for Maintaining Data in Unreliable

for all t replacement nodes. Our repair model can be considered as a centralized repair model

for clustered storage systems.

Regenerating code variations for cluster-like topologies listed in section 3.2 all focus on

single-node repair.

Repairing t ≥ 1 failures from the same cluster has been partially studied in [52] for the

special case of ℓ = m − t, for which the authors show the file size upper bound. However, as

we show later, the case 0 ≤ ℓ < m − t, t > 1 offers several surprising results which cannot be

inferred from analysis of the case ℓ = m− t, t > 1.

4.2 Exact Repair

4.2.1 ER Code Construction

A simple construction of exact repair MRGRCs for any t > 1 can be directly obtained from

constructions for the case t = 1, whenever t|β. In order to construct an exact repair MRGRC

C with parameters (n, k, d)(α, β)(m, ℓ, t), t|β, we start with an ER GRC C′ from section 3.4.1

with parameters {(n, k, d)(α, β′ = β/t)(m, ℓ, t′ = 1)}, which, as we previously shown, exists

whenever a classical ER (n, k, d)(α, β′) RC exists, with file-size∑k−1

i=0 min(α, (d − i)β′). The

code C′ can be viewed as the code C as it is, if we assume that repair of any group of t nodes

in C happens one node at a time via the repair procedure in C′. Also, we use the same set of

local and external helpers for repair of all t failed nodes. Inter-cluster bandwidth, for the repair

of the entire group, per external helper amounts to β = tβ′. The file-size B that we obtain is

given by

B = B′ = ℓkα+ (m− ℓ)

k−1∑

i=0

min(α, (d− i)β′)

= ℓkα+ (m− ℓ)

k−1∑

i=0

min

(α,

(d− i)β

t

). (4.1)

As we show next in this section, the file size B achieved by the construction is optimal, as

it reaches the upper bound for {(n, k, d), (α, β), (m, ℓ, t)} ER MRGRC given by the following

theorem.

56

Page 57: Coding Approaches for Maintaining Data in Unreliable

Theorem 4.2.1 (ER MRGRC File Size Bound). The file size B of GRC with parameters

{(n, k, d), (α, β), (m, ℓ, t)} under ER regime is upper bounded by

B ≤ B∗E = ℓkα+ (m− ℓ)

k−1∑

i=0

min

(α,

(d− i)β

t

). (4.2)

The bound is optimal at the minimum storage-overhead (MSR) and the minimum inter-cluster

repair-bandwidth-overhead (MBR) points characterized by B = mkα and tα = dβ, respectively.

4.2.2 File Size Bound Proof

In this section, we present the proof of the file-size upper bound in (4.2) for exact repair codes.

We assume the code to be deterministic, i.e. the helper data is uniquely determined given the

indices of the t failed nodes, the local helper nodes, and the helper clusters. We begin with useful

notation. Let F denote the random variable corresponding to the data file that gets stored. We

assume F to be uniformly distributed over FBq . Let Yi,j ∈ Fα

q , 1 ≤ i ≤ n, 1 ≤ j ≤ m denote the

content stored in node j of cluster i. Yi,j are also random variables which depend on F . We

also use the following notations: Yi,S = {Yi,j , ∀j ∈ S ⊆ [m]}, Yi = Yi,[m], YS = ∪i∈S⊆[n]Yi.

Since the file should be completely decodable from any set of k clusters, we have the following

entropy condition:

H (F |YS) = 0 ∀S ⊂ [n], |S| = k. (4.3)

Next, consider the repair of t nodes indexed by Ri in cluster i. Let H ⊂ [n] − i, |H| = d, and

L ⊆ [m]−Ri, |L| = ℓ respectively denote the indices of helper clusters and local nodes that aid

in the repair process. Let ZH,Li′,Ri

denote external helper data passed by cluster i′. The property

of exact repair is jointly characterized by the following set of conditions:

H(ZH,Li′,Ri

|Yi′

)= 0 (4.4)

H(ZH,Li′,Ri

)≤ β (4.5)

H(Yi,Ri

|{ZH,Li′,Ri

, i′ ∈ H},Yi,L

)= 0,

∀H ⊂ [n]− {i}, |H| = d, ∀L ⊂ [m]−Ri, |H| = ℓ. (4.6)

57

Page 58: Coding Approaches for Maintaining Data in Unreliable

Our proof technique of the file-size bound presented here, though has some similarity with the

information theoretic techniques in works like [19], [53], it differs in an important way. The

proofs in these other works rely on the chain rule of entropy, and so does our proof; however,

here, we demand that the chain is expanded in a specific order. The following lemma is used

to determine this order. The lemma is required only when b > 0. When b = 0, the bound proof

does not need this lemma.

Lemma 4.2.2 (MRGRC Chain Order). Let b > 1, i.e. t ∤ (m − ℓ). Consider any Si ⊂ [n],

|Si| = i, 1 ≤ i ≤ k − 1. Then, for any i′ ∈ [n] − Si, there exists a permutation σi′,Siof

{ℓ+ 1, ℓ+ 2, . . . ,m} such that

H(Yi′,σi′,Si

(j′)|YSi,Yi′,[1,l], {Yi′,σi′,Si

(j)}j∈[ℓ+1,j′−1]

)≤ min

(α,

(d− i)β

t

), (4.7)

for all j′ ∈ {m− b+ 1,m− b+ 2, . . . ,m}.

The proof of the lemma is given in Appendix 9.1.

Proof of Exact Repair Upper Bound (4.2). We have

B = H(F ) ≤ H(Y[1,k]) =

k∑

i′=1

H(Yi′ |Y[1,i′−1])

=

k∑

i′=1

(H(Yi′,[1,ℓ]|Y[1,i′−1]) +H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1])

)

≤ ℓkα+

k∑

i′=1

H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]). (4.8)

Now, if we let σ = σi′,[1,i′−1] to be the permutation as obtained from Lemma 4.2.2, then we

expand the term H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]) in (4.8) using the order determined by the

permutation σ, as follows:

H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]) = H({Yi′,σ(j′), j′ ∈ [ℓ+ 1,m]}|Yi′,[1,ℓ],Y[1,i′−1])

≤a−1∑

u=0

H({Yi′,σ(ℓ+ut+v), v ∈ [t]}|Yi′,[ℓ],Y[i′−1]) (4.9)

+

m∑

j′=m−b+1

H(Yi′,σ(j′), |Y[i′−1], {Yi′,σi′,[i′−1](j)}j∈[ℓ+1,j′−1]).

58

Page 59: Coding Approaches for Maintaining Data in Unreliable

S

Z

Edge Capacities

Cluster

𝑋3𝑟𝑒𝑋3,1𝑖 𝑋3,1

𝑋1,1𝑖 𝑋1,1 𝑋1,1𝑖 𝑋1,1𝑋1𝑟𝑒 𝑋1𝑒𝑥

∞Figure 4-1: An illustration of the information flow graph used in cut-set based upper boundfor the file-size under functional repair. We assume (n = 3, k = 2, d = 2)(m = 3, ℓ = 0, t = 2).Only a subset of nodes is named to avoid clutter. Two batches, each of t = 2 nodes, fail andget repaired first in cluster 1 and then in cluster 3. We also indicate a possible choice of theS −Z cut that results in the desired upper bound. We fail nodes in cluster 3 instead of cluster2 only to make the figure compact.

Using (4.6), each term under the first summation in (4.9) is upper bounded by min(tα, (d −

i′+1)β), while each term under the second summation in (4.9) is upper bounded using Lemma

4.2.2. Thus, we get that

H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]) ≤ amin(tα, (d− i′ + 1)β) + bmin

(α,

(d− i′ + 1)β

t

)

= (m− ℓ)min

(α,

(d− i′ + 1)β

t

). (4.10)

The desired bound now follows by combining (4.8) with (4.10).

4.3 Functional Repair

In this section, we present the file-size upper bound under functional repair via IFG analysis.

59

Page 60: Coding Approaches for Maintaining Data in Unreliable

4.3.1 Information Flow Graph Model

The IFG used here (see Fig. 4-1) is a generalization of the one considered in section 3.1.1 for the

case of t = 1. When a cluster, say i, experiences a batch of t failures, the whole cluster becomes

inactive and is replaced with a new active cluster. In the new cluster, a special repair node Xrepi

is used to combine local and external helper data and generate the content of the replacement

nodes. The out nodes of the ℓ local helper nodes connect to Xrepi via links of capacity α,

and the external nodes of the d helper clusters connect to Xrepi via links of capacity β. Also,

Xrepi connects to the in-nodes of the replacement nodes via links of capacity α. Further, the

m − t nodes, which did not experience failure in the inactive cluster are copied as such in the

new active cluster. At any point in time, physical cluster i contains one active and fi inactive

clusters in the IFG where fi ≥ 0 denotes the total number of batch failures and repairs in the

cluster. We write Xi(τ), 0 ≤ τ ≤ fi to denote the cluster in the IFG after the τ th (batch)

repair associated with cluster i, and use Ri(τ), 0 ≤ τ ≤ fi − 1 to denote the indices of nodes

that fail in Xi(τ). The clusters Xi(0), . . . ,Xi(fi − 1) are inactive, while Xi(fi) is active, after fi

repairs. The nodes of Xi(τ) will be denoted by X ini,j(τ), X

outi,j (τ), Xext

i (τ), Xrepi (τ) (there is no

repair node if τ = 0).

4.3.2 File Size Upper Bound

Theorem 4.3.1 (FR MRGRC Capacity). The file size B of GRC with parameters {(n, k, d),

(α, β), (m, ℓ, t)} under FR regime is upper bounded by

B ≤ B∗F = ℓkα+ a

k−1∑

i=0

min(tα, (d− i)β) +

k−1∑

i=0

min(bα, (d− i)β) (4.11)

The bound is tight, if there is a known upper bound on the number of repairs in the system.

Proof. To show the upper bound, it is enough to demonstrate a sequence of batch failures and

a set of k clusters used by a data collector, such that there exists a cut between the source

and the data collector with the capacity no more than B∗F . In the example sequence, that we

consider, the clusters 1 to k are used for data collection and experience node failures. At each

of these clusters a+1 batch failures occur (recall that a, b are defined from m−ℓ = at+b, a ≥ 1,

60

Page 61: Coding Approaches for Maintaining Data in Unreliable

0 ≤ b ≤ t − 1). They jointly cover the first m − ℓ nodes of a cluster. Specifically, at cluster

i ∈ [k] the first batch failure affects the last t of these nodes: Ri(0) = {m− ℓ− t+1, . . . ,m− ℓ}.

The remaining batch failures affect disjoint sets of t nodes starting from the first node Xi,1:

Ri(1) = {1, . . . , t}, Ri(2) = {t+ 1, . . . , 2t}, until Ri(a) = {(a− 1)t, . . . , at}.

In all cases, the last ℓ nodes in a cluster provide the local helper data. For repairs in cluster

i, clusters 1, . . . , i − 1 and n − (d − i), . . . , n serve as helper clusters. Failures first occur in

cluster 1, then in clusters 2, 3, etc. until cluster k.

In the IFG, corresponding to the described failure sequence, cluster Xi(a + 1) is active for

each i ∈ k. Let τj be such that the cluster Xi(τj) appears in the IFG right after the last repair

of node Xi,j (we say ”last repair” since nodes whose indices belong to Ri(0) ∩Ri(a) fail twice

in our sequence of failures; other nodes in cluster i fail only once). Consider a cut-set (U ,V)

consisting of the following edges:

• X ini,j(a + 1)

α→ Xouti,j (a + 1), ∀i ∈ [k], j ∈ [m − ℓ + 1,m]. Total capacity of these edges is

ℓkα.

• For all i ∈ [k]:

– Edge setXrepi (τj)

α→ X ini,j(τj), j ∈ [at], or edge setXext

i′ (0)β→ Xrep

i (τj)∀i′ ∈ [n−(d−i),

n], j ∈ {t, 2t, . . . , at}, whichever set capacity is smaller. The total capacity of these

edges is amin(tα, (d− i+ 1)β).

– If b > 0: edge set Xrepi (τj)

α→ X ini,j(τj), j ∈ [at + 1,m − ℓ], or edge set Xext

i′ (0)β→

Xrepi (τj)∀i′ ∈ [n − (d − i), n], j = m − ℓ, whichever set capacity is smaller. Total

capacity of these edges is min(bα, (d− i+ 1)β).

The value of the cut is given by ℓkα+a∑k

i=1min(tα, (d−i+1)β)+∑k

i=1min(bα, (d−i+1)β) =

B∗F , which proves the bound.

We demonstrate the proof by an example for the special case (n = 3, k = 2, d = 2)(α,

β)(m = 3, ℓ = 0, t = 2). Note that for this special case, t ∤ (m − ℓ) and this will help us

illustrate the difference between functional and exact repair. Consider the following sequence

of 4 batches of failures and repairs (see Fig. 4-1). Batches 1 and 2 are associated with cluster

1 with R1(0) = {2, 3} and R1(1) = {1, 2}. Batches 3 and 4 are associated with cluster 3

61

Page 62: Coding Approaches for Maintaining Data in Unreliable

with R3(0) = {2, 3} and R3(1) = {1, 2}. There is no local help in this example, cluster 1

receives external help from Xext2 (0) and Xext

3 (0) for both batches of repairs, while cluster 3

receives external help from Xext2 (0) and Xext

1 (2) for its repairs. Consider data collection by

connecting to Xext1 (2) and Xext

3 (2), and consider the S-Z cut whose edges are found as follows:

For disconnecting Xout1,1 (2) and Xout

1,2 (2), we either remove (based on whichever has smaller

capacity) the two edges X in1,1(2) → Xout

1,1 (2) and X in1,2(2) → Xout

1,2 (2) or the set of helper edges

Xext2 (0) → Xrep

1 (2) and Xext3 (0) → Xrep

1 (2). For disconnecting Xout1,3 (2), we either remove

the single edge X in1,3(1) → Xout

1,3 (1) or the set of two helper edges Xext2 (0) → Xrep

1 (1) and

Xext3 (0) → Xrep

1 (1). The set of edges that disconnects cluster 3 is similarly found, except that if

we choose to disconnect links from external helpers, we only disconnect those from Xext2 (0) and

not Xext1 (2). The value of the cut forms an upper bound for B, and is given by B ≤ min(2α,

dβ) +min(α, dβ) +min(2α, (d− 1)β) +min(α, (d− 1)β), which is the same as the one given by

(4.11).

To prove achievability of the bound, we also show that for any valid IFG, regardless of the

specific sequence of failures and repairs, B∗F is indeed a lower bound on the minimum possible

value of any S-Z cut. Please see Appendix 9.2 for a proof of this fact, which establishes the

system capacity under functional repair.

4.4 Implications of the Bounds

Comparing the bounds

B∗F = ℓkα+ at

k−1∑

i=0

min

(α,

(d− i)β

t

)+ b

k−1∑

i=0

min

(α,

(d− i)β

b

)(4.12)

B∗E = ℓkα+ (at+ b)

k−1∑

i=0

min

(α,

(d− i)β

t

), (4.13)

we note that B∗E ≤ B∗

F . Specifically, when t|(m − ℓ) the bounds coincide. Furthermore, they

give the same storage overhead vs inter-cluster bandwidth overhead trade-off for any value of

t ≥ 1. That means that under FR there is no advantage to jointly repair multiple nodes instead

of repairing one. For ER at the MSR and MBR points, there is no benefit to jointly repair

multiple nodes for any t > 1, irrespective of if t|(m− ℓ) or not.

62

Page 63: Coding Approaches for Maintaining Data in Unreliable

1 1.5 2 2.50

1

2

3

4

Figure 4-2: Trade-offs for an (n = 5, k = 4,d = 4)(m = 3, ℓ = 0, t = 2) system, plottedbetween the MSR and the MBR points.

0 2 4 6 8 10 1245

50

55

60

65

Figure 4-3: Impact of number of local helpernodes, ℓ, on file-size for an (n = 7, k = 4,d = 5,m = 17, t = 5) clustered storage systemat MBR point (α = 1, β = 1). Local help doesnot provide any advantage unless ℓ > 2.

When t ∤ (m−ℓ), it is possible that B∗F > B∗

E . Specifically, at the MBR point with tα = dβ,

we have B∗F > B∗

E , whenever k > 1. This also means that the storage overhead vs inter-cluster

bandwidth overhead trade-off under FR for the case t > 1 (with k > 1) is strictly better than

that for the case t = 1. A comparison of trade-offs between exact and functional repair for the

case of {(n = 5, k = 4, d = 4)(m = 3, ℓ = 0, t = 2)} is shown in Fig. 4-2.

Another implication of the bounds relates to the usefulness of the number of local helper

nodes ℓ used in the repair process. Under FR, for the case of t = 1 studied in Chapter 3,

if we fix n, k, d,m, α, β, the optimal file-size increases strictly monotonically with ℓ, whenever

α > (d − k + 1)β (i.e., if we exclude the MSR point). However, strict monotonicity is not

necessarily true when t > 1. Specifically, at the MBR point, it is straightforward to show that

whenever (m mod t) ≤ ⌊(d−k+1)t/d⌋, for any ℓ in the range 0 ≤ ℓ ≤ (m mod t), the capacity

is as good as with no local help at all (see Fig. 4-3).

63

Page 64: Coding Approaches for Maintaining Data in Unreliable

64

Page 65: Coding Approaches for Maintaining Data in Unreliable

Chapter 5

Intra-Cluster Bandwidth of GRCs

The GRC model introduced in Chapter 3 does not consider the intra-cluster bandwidth incurred

during repair. Intra-cluster bandwidth is needed to generate the external helper data to be sent

from the helper clusters and to download content from ℓ local helper nodes in the host cluster.

In this chapter, we characterize the amount of intra-cluster bandwidth that is needed to achieve

the optimal trade-off between storage overhead and inter-cluster repair bandwidth identified in

section 3.3.2. We consider the repair model where the replacement node downloads at most

γ, γ ≤ α symbols from each of tIn the recent years, the explosive growth of the data storage

demand has made the storage cost a critically important factor in the design of distributed

storage systems (DSS). At the same time, optimizing the storage cost is constrained by the

reliability requirements. The goal of the thesis is to further study the fundamental limits of

maintaining data fault tolerance in a DSS spread across a communication network. Particularly,

we focus our attention on performing efficient storage node repair in a redundant erasure-coded

storage with a low storage overhead. We consider two operating scenarios of the DSS.

First, we consider a clustered scenario, where individual nodes are grouped into clusters

representing data centers, storage clouds of different service providers, racks, etc. The network

bandwidth within a cluster is assumed to be cheap with respect to the bandwidth between nodes

in different clusters. We extend the regenerating codes framework by Dimakis et al. [1] to the

clustered topologies, and introduce generalized regenerating codes (GRC), which perform node

repair using the helper data both from the local cluster and from other clusters. We show the

optimal trade-off between the storage overhead and the inter-cluster repair bandwidth, along

65

Page 66: Coding Approaches for Maintaining Data in Unreliable

with optimal code constructions. In addition, we find the minimal amount of the intra-cluster

repair bandwidth required for achieving a given point on the trade-off.

Second, we consider a scenario, where the underlying network features a highly varying

topology. Such behavior is characteristic for peer-to-peer, content delivery, or ad-hoc mobile

networks. Because of the limited and time-varying connectivity, the sources for node repair

are scarce. We consider a stochastic model of failures in the storage, which also describes

the random and opportunistic nature of selecting the sources for node repair. We show that,

even though the repair opportunities are scarce, with a practically high probability, the data

can be maintained for a large number of failures and repairs and for the time periods far

exceeding a typical lifespan of the data. The thesis also analyzes a random linear network coded

(RLNC) approach to operate in such variable networks and demonstrates its high achievable

rates, outperforming that of regenerating codes, and robustness in a wide range of model and

implementation assumptions and parameters such as code rate, field size, repair bandwidth,

node distributions, etc.

he ℓ local helper nodes from the host-cluster. We also assume that the β symbols contributed

by a remote helper cluster are only a function of at most ℓ′, ℓ′ ≤ m nodes of the cluster. We

make the assumption that any set of ℓ′ nodes can be used to compute the β symbols. Further,

we limit the amount of data that each of these ℓ′ nodes can contribute to at most γ′ ≤ α

symbols. The goal of this chapter is to identify necessary requirements on the parameters

γ, ℓ′, γ′ that are needed for achieving the optimal trade-off between storage and inter-cluster

bandwidth, defined by the maximum file-size equation

B∗ , ℓkα+ (m− ℓ)

k−1∑

i=0

min{α, (d− i)+β}. (5.1)

5.1 Local Helper Bandwidth in the Host Cluster

In this section we focus on the intra-cluster bandwidth in the host cluster, taken by commu-

nicating ℓγ helper symbols from the local helper nodes. In the following theorem, we find the

minimal value γ∗ of γ required for the optimal trade-off 5.1, and show that this value γ∗ is

optimal, i.e. sufficient for achieving the trade-off.

66

Page 67: Coding Approaches for Maintaining Data in Unreliable

Z

S

Edge Capacities

𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡

𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡

𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡

𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡

∞Figure 5-1: An illustration of the evolution of the k-th cluster of the information flow graphused in cut-set based lower bound for γ in Theorem 5.1.1. In this figure, we assume that m = 4,ℓ = 2. Nodes 3, 4, 1 fail in this respective order. For the repair of node 3, nodes 1 and 2 actas the local helper nodes. For the repair of the remaining two nodes, nodes 2 and 3 act as thelocal helper nodes. Also indicated is our choice of the S-Z cut used in the bound derivation.

Theorem 5.1.1 (GRC Local Intra-cluster Bandwidth). For an optimal functional repair GRC

with parameters (n, k, d > 0), (α, β), (m, ℓ), γ′ = α, ℓ′ = m, local helper node bandwidth γ is

lower-bounded by

γ ≥ γ∗ , α− (d− k + 1)+β. (5.2)

Further, if there is a known upper bound on the number of repairs that occur over the lifetime

of the system, the above bound is tight; i.e., the functional repair capacity of the system remains

as B∗ as long as γ ≥ γ∗.

Proof. We consider an IFG model similar to the main GRC model in section 3.1.1, except that

now a replacement in-node X ini,j connects to ℓ inactive helper out-nodes in the same cluster Xout

i,j′

via edges of capacity γ instead of α.

For the lower bound, consider the same system evolution as in the proof of the upper bound

in Theorem 3.3.1, except for the k-th cluster accessed by the data collector. Thus, physical

nodes Xi,ℓ+1, Xi,ℓ+2, . . . Xi,m fail in this order in cluster i = 1, then in cluster i = 2, and so on,

until cluster i = k−1. Note that each of the first k−1 clusters experiences a total of m−ℓ node

failures. For cluster k, we consider the failure of m−ℓ+1 nodes, corresponding to physical nodes

Xk,ℓ+1, Xk,ℓ+2, . . . Xk,m, Xk,1 in this respective order. In terms of the notation introduced in

3.1.1, the sequence of failures in the kth cluster correspond to IFG nodes Xk,ℓ+1(0), Xk,ℓ+2(1),

. . . , Xk,m(m − ℓ − 1), Xk,1(m − ℓ). For the repair of Xk,ℓ+1(0), the local helper nodes used

67

Page 68: Coding Approaches for Maintaining Data in Unreliable

are Xk,1(0), . . . , Xk,ℓ(0). For the repair of any of the remaining nodes Xk,((ℓ+t) mod m)+1(t),

1 ≤ t ≤ m − ℓ, the local helper nodes used are Xk,2(t), Xk,3(t), . . . , Xk,ℓ+1(t). Also, clusters

X1(m− ℓ),X2(m− ℓ), . . . ,Xmin(d,k−1)(m− ℓ) are included in the set of remote clusters that aid

in the repair of the m − ℓ + 1 nodes in the kth cluster. An illustration of the IFG, for the kth

cluster is shown in Fig. 5-1. Note in this figure that the edges corresponding to local help have

capacity γ.

Let data collector Z connect to clusters X1(m−ℓ), . . . ,Xk−1(m−ℓ),Xk(m−ℓ+1). Consider

an S-Z cut in the IFG that partitions the graph nodes in clusters 1, · · · , k−1 in the same way as

in the proof of Theorem 3.3.1; however it differs in the way the nodes of cluster k are partitioned.

The overall set of edges in the cut-set is given below:

Clusters 1, . . . , k − 1:

• {(X ini,j(0) → Xout

i,j (0), i ∈ [k − 1], j ∈ [ℓ]}. Total capacity of these edges is (k − 1)ℓα.

• For each i ∈ [k − 1], t ∈ [m − ℓ], either the set of edges {(Xexti′ (0) → X in

i,ℓ+t(t)), i′ ∈

{remote helper cluster indices for the replacement node X ini,ℓ+t(t)}− [min{i− 1, d}] or the

edge (X ini,ℓ+t(t) → Xout

i,ℓ+t(t)). Between the two possibilities, we pick the one which has

smaller sum-capacity. In this case, the total capacity of this part of the cut is given by

∑k−1i=1

∑mj=ℓ+1min{α, (d−min{i− 1, d})β} = (m− ℓ)

∑k−1i=1 min{α, (d− i+ 1)+β}.

Cluster k:

• (Xoutk,1 (0) → X in

k,ℓ+1(1)) of capacity γ.

• (X ink,j(0) → Xout

k,j (0)), ∀j ∈ [2, ℓ]. Total capacity of these edges is (ℓ− 1)α.

• Either the set of edges {(Xexti′ (0) → X in

k,((ℓ+t) mod m)+1)(t+1)), i′ ∈ {remote helper cluster

indices for the replacement node X ink,((ℓ+t) mod m)+1)(t + 1)} − [min{i − 1, d}], 0 ≤ t ≤

(m − ℓ)} or the set of edges {(X ink,((ℓ+t) mod m)+1)(t + 1) → Xout

k,((ℓ+t) mod m)+1)(t + 1)),

0 ≤ t ≤ (m− ℓ)}. Among the two sets, we pick the one which has smaller sum-capacity.

In this case, the total capacity of these edges is (m− ℓ+ 1)min{α, (d− k + 1)+β}.

68

Page 69: Coding Approaches for Maintaining Data in Unreliable

The total cut capacity is given by

Ccut = (k − 1)ℓα+ (m− ℓ)

k−2∑

i=0

min{α, (d− i)+β}

+ γ + (ℓ− 1)α+ (m− l + 1)min{α, (d− k + 1)+β}

= kℓα+ (m− ℓ)k−1∑

i=0

min{α, (d− i)+β} − α+min{α, (d− k + 1)+β}+ γ

= B∗ − α+min{α, (d− k + 1)+β}+ γ.

Since we assume an optimal code, it must be true that Ccut ≥ B∗, which results in

γ ≥ α−min{α, (d− k + 1)+β} = α− (d− k + 1)+β = γ∗,

since for an optimal code α ≥ (d− k + 1)+β. This proves the lower bound.

We shall then prove that, as long as γ ≥ γ∗, the min-cut of any valid IFG is necessarily lower

bounded by B∗; in this case, like in the proof of Theorem 3.3.1, we know that the functional

repair capacity remains as B∗, as long as there is a known upper bound on the number of

repairs in the system. We start with the proof of the lower bound.

We next prove the tightness of the bound; we show that, as long as γ ≥ γ∗, the min-cut

of any valid IFG is necessarily lower bounded by B∗. Consider the proof of achievability part

of Theorem 3.3.1, where we obtained a lower bound on the min-cut of any valid IFG. One can

repeat the same sequence of arguments, except with the change that the edges corresponding

to local help have capacity γ instead of α. In this case, it can be seen that instead of (3.2), we

obtain the following lower bound on min-cut:

mincut(S − Z) ≥k∑

i=1

aiα+

m∑

j=ai+1

min(α, (ℓ− j + 1)+γ + (d− (i− 1))+β)

(5.3)

In the above expression, observe that if γ ≥ γ∗, for j ≤ ℓ, i ≤ k we have

(ℓ− j + 1)+γ + (d− (i− 1))+β ≥ α. (5.4)

Therefore, (5.3) can be written as (3.2). It follows then that mincut(S − Z) is indeed lower

69

Page 70: Coding Approaches for Maintaining Data in Unreliable

bounded by B∗ as long as γ ≥ γ∗. This completes the proof of the tightness, and also the

theorem.

Note that the for α = (d − k + 1)β (MSR point for d ≥ k), the bound (5.2) gives γ ≥ 0.

Indeed, in this case, the optimal file size B∗ = mkα can be achieved with γ = 0 by using m

classical MSR RCs with parameters (n, k, d), (α = (d− k+1)β, β) and file size BRC = kα each,

which perform repairs independently of each other, without using any helper data from the

local cluster.

5.2 External Helper Cluster Local Bandwidth

In this section, we provide lower bounds on the parameters γ′ and ℓ′. Unlike the previous

section, here we do not prove the bound optimality, which allows us to simplify our IFG model

by avoiding replicating the surviving nodes. A new external IFG node for a cluster is added to

IFG any time when the cluster is used for data collection or generating helper data, so that each

external node is used exactly once. Whenever a physical node Xi,j fails, we say that it becomes

inactive, and its replacement node, say Xi,j = (X ini,j , X

outi,j ), becomes active in the same cluster.

The remaining m−1 nodes are not replicated, as in the previous model IFG model. An external

node Xexti′,Xi,j

used for generating helper data connects to a subset ℓ′ active out-nodes in the

cluster via edges of capacity γ′. Note how we index the external node of cluster i′ that aids in

the repair of Xi,j . A data collector, Z, connects to cluster i via the external node Xexti,Z , which

in turn connects to all m active out-nodes in the cluster via links of capacity α. In comparison

with the previous model, we do not time-index the sequence of failures in the current model.

This is because, in our proof of bounds for γ′ and ℓ′, we only consider system evolutions in

which each node fails at most once. In this case, we find it convenient simply to denote the

replacement node of Xi,j as Xi,j .

Theorem 5.2.1 (GRC External Helper Cluster Local Bandwidth). For an optimal functional

repair generalized regenerating code with parameters (n, k > 1, d) (α, β), (m, ℓ), γ = α, ℓ′ = m,

70

Page 71: Coding Approaches for Maintaining Data in Unreliable

S

Z

𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,

𝑋 ,𝑋 ,𝑒𝑥𝑋 ,𝑋 ,𝑒𝑥 𝑋 ,𝑋 ,𝑒𝑥

𝑋 ,𝑋 ,𝑒𝑥𝑋 ,𝑖 𝑋 ,𝑋 ,𝑍𝑒𝑥

𝑋 ,𝑍𝑒𝑥

𝑋 ,𝑖 𝑋 ,Edge Capacities ∞

Figure 5-2: An illustration of the IFG used in cut-set based lower bound for γ′ in Theorem5.2.1. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1)(ℓ′ = 2, γ = α). Thesecond node fails in clusters 1 and 2 in the respective order. Also indicated is our choice of theS-Z cut used in the bound derivation.

the remote helper-node repair bandwidth γ′ is lower-bounded by

γ′ ≥ γ′∗ , max{γ′∗1 , γ′∗2 } (5.5)

γ′∗1 ,β

m(5.6)

γ′∗2 ,min{β, (α− (d− k + 1)+β)}

m− ℓ. (5.7)

Proof. The first bound (5.6) directly follows from the code optimality. Indeed, for an optimal

code, the helper data send out by a cluster cannot be redundant, and its size β cannot be larger

than the sum size ℓ′γ′ = mγ′ of the components from which it is generated.

However, it turns out that on most points of the trade-off, the second bound (5.7) is tighter

than the first. To prove it, we consider data collection from clusters 1 to k. Before data

collection, the system experiences k(m− ℓ) repairs. Nodes ℓ+ 1, . . . ,m fail and get repaired in

cluster 1 in this respective order. This is followed by failure and repair of nodes ℓ + 1, . . . ,m

in cluster 2, and so on, until we consider failure and repair of nodes ℓ + 1, . . . ,m in cluster k.

In terms of physical nodes, it may be noted that this is the same sequence of failures that was

71

Page 72: Coding Approaches for Maintaining Data in Unreliable

considered in the proof of Theorem 3.3.1; here, however, we will impose additional restrictions

on the choice of the remote helper clusters. The external help is taken from the set of the first

d+ 1 clusters, excluding the cluster where the failed node resides. Thus, for the repair of Xi,j ,

the indices of remote helper clusters are [1, i− 1] ∪ [i+ 1, d+ 1]. The choice of the local helper

nodes remains same as in the proof of Theorem 3.3.1, where we used the first ℓ nodes in the

cluster. An illustration of the IFG is shown in Figure 5-2.

It can be seen that the following cut-set separates the source from the data collector:

• {(X ini,j → Xout

i,j ), i ∈ [k], j ∈ [ℓ]}. The total capacity of these edges is kℓα.

• For each i, 1 ≤ i ≤ k, the edge set with smaller capacity out of A1(i) ∪ A2(i) and A3(i)

where

– A1(i) , {(Xexti′ → X in

i,j), i′ ∈ [k+1, d+1], j ∈ [ℓ+1,m]}. The total capacity of edges

in A1(i) is (d− k + 1)+(m− ℓ)β.

– A2(i) , {(Xouti′,j′ → Xext

i′,Xi,j), j ∈ [ℓ+ 1,m], i′ ∈ [i+ 1,min{k, d+ 1}], j′ ∈ [ℓ+ 1,m]}.

The total capacity of edges in A2(i) is (m− ℓ)(min{k, d+ 1} − i)+(m− ℓ)γ′

– A3(i) , {(X ini,j → Xout

i,j ), j ∈ [ℓ + 1,m]}. The total capacity of edges in A3(i) is

(m− ℓ)α.

The capacity of the cut-set is given by

Ccut = kℓα + (m− ℓ)k∑

i=1

min{α, (d− k + 1)+β + (min{k, d+ 1} − i)+(m− ℓ)γ′}.

72

Page 73: Coding Approaches for Maintaining Data in Unreliable

Since we consider optimal codes, we necessarily have

Ccut ≥ B∗ = kℓα+ (m− ℓ)

k∑

i=1

min{α, (d− i+ 1)+β}

min{α, (d− k + 1)+β

+ (min{k, d+ 1} − i)+(m− ℓ)γ′} ≥ min{α, (d− i+ 1)+β}, ∀i ∈ [k](d− k + 1)+β + (min{k, d+ 1} − i)+(m− ℓ)γ′ ≥ (d− i+ 1)+β

(d− k + 1)+β + (min{k, d+ 1} − i)+(m− ℓ)γ′ ≥ α

, ∀i ∈ [k]

γ′ ≥ β/(m− ℓ)

γ′ ≥ (α− (d− k + 1)+β)/(m− ℓ)(min{k, d+ 1} − i)

, ∀i ∈ [min{k, d+ 1} − 1]

γ′ ≥ min{β, (α− (d− k + 1)+β)}/(m− ℓ).

Corollary 5.2.2. For an optimal FR GRC with parameters (n, k > 1, d), (α ≥ (d − k + 2)β,

β), (m, ℓ), γ = α, ℓ′ = m, the remote helper-node repair bandwidth γ′ is lower-bounded by

γ′ ≥ γ′∗2 =β

m− ℓ. (5.8)

The following theorem establishes the necessary condition on ℓ′ for optimal codes. Recall,

that by definition of ℓ′, the helper cluster must be able to generate the helper symbols from an

arbitrary subset of ℓ′ of its nodes.

Theorem 5.2.3 (GRC External Helper Cluster I/O). For an optimal functional repair GRC

with parameters (n, k, d), (α, β > 0), (m, ℓ), γ = γ′ = α, necessarily

ℓ′ ≥ ℓ′∗ , m. (5.9)

All m nodes in a helper cluster should contribute to the helper data.

Proof. Considered a system evolution with k(m− ℓ) repairs. Nodes with indices in range [ℓ+1,

ℓ′] fail and get repaired in cluster 1, then in cluster 2, etc. until cluster d+ 1. This is followed

by failure of nodes [max{ℓ, ℓ′},m] in cluster 1, then in cluster 2, etc. until cluster k. For all the

73

Page 74: Coding Approaches for Maintaining Data in Unreliable

S

Z

𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,

𝑋 ,𝑋 ,𝑒𝑥

𝑋 ,𝑋 ,𝑒𝑥

𝑋 ,𝑖 𝑋 , 𝑋 ,𝑍𝑒𝑥

𝑋 ,𝑍𝑒𝑥

∞𝑋 ,𝑋 ,𝑒𝑥

𝑋 ,

𝑋 ,𝑋 ,

𝑋 ,

External nodes

Figure 5-3: An illustration of the IFG used in cut-set based lower bound for ℓ′ in Theorem5.2.3. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 0)(ℓ′ = 1, γ = γ′ = α). Thesecond node fails in clusters 1 and 2 in the respective order. Also indicated is our choice of theS-Z cut used in the bound derivation.

failures the local help is provided by the first ℓ nodes in each host cluster. The external help for

a failed node in cluster i is taken from clusters [1, d+1]− i, and the first ℓ′ nodes in each helper

cluster are always used to generate the β symbols of external helper data. Data collection is

performed from clusters 1 to k. An illustration of the IFG is shown in Figure 5-3.

It can be seen that the following cut-set separates the source from the data collector:

• {(X ini,j → Xout

i,j ), i ∈ [k], j ∈ [ℓ]}. The total capacity of these edges is kℓα.

• For each cluster i, 1 ≤ i ≤ k, the edge set with smaller capacity among A1(i) and A2(i)

where

– A1(i) , {(Xexti′ → X in

i,j), i′ ∈ [i+ 1, d+ 1], j ∈ [ℓ+ 1, ℓ′]}. The total capacity of edges

in A1(i) is (d− i+ 1)+(ℓ′ − ℓ)β.

– A2(i) , {(X ini,j → Xout

i,j ), j ∈ [ℓ + 1, ℓ′]}. The total capacity of edges in A2(i) is

(ℓ′ − ℓ)+α.

The capacity of the cut-set is given by

Ccut = kℓα+ (ℓ′ − ℓ)+k∑

i=1

min{α, (d− i+ 1)+β}. (5.10)

74

Page 75: Coding Approaches for Maintaining Data in Unreliable

Since we consider optimal codes, we necessarily have

Ccut ≥ B∗ = kℓα+ (m− ℓ)

k∑

i=1

min{α, (d− i+ 1)+β}, (5.11)

which results in ℓ′ ≥ m.

5.3 Optimality and Implications of the Intra-cluster Bandwidth

Bounds

In the previous sections, we provided lower bounds for intra-cluster bandwidth parameters γ,

γ′, ℓ′. We analytically showed that the bounds for ℓ′ and γ (under FR) are optimal. In this

section, we perform numerical RLNC simulation to study optimality of the other bounds and

simultaneous tightness of all the bounds. For a given operating point on the optimal trade-

off (Figure 3-4) with a fixed β, we generate a random B∗ × nmα matrix over F65537, whose

columns are global coding vectors of the nmα symbols (or packets) stored in the system. We

simulate iterations of failure/repair by replacing the columns corresponding to the failed symbols

with random linear combinations of the corresponding helper symbols, according to parameters

d, β, while the helper symbols are computed according to γ, γ′, ℓ′. After each iteration, we

check that the code satisfies the data collection requirement by computing the rank of several

random subsets of kmα columns corresponding to the data collection clusters. Data collection is

successful if the rank is B∗. The probability of decoding is estimated as a fraction of successful

data collections. If the GRC satisfies the data collection requirement, the estimated probability

of decoding should be 1.

We simulate a system with parameters (n = 4, k = 3, d = 3), (ℓ = 1,m = 2), (α, β = 4)

at the MBR, MSR, and the near-MSR points, with the latter having α = (d − k + 2)β. At

each operating point, we compute the intra-cluster parameter bounds γ∗, γ′∗, ℓ′∗ from (5.2),

(5.5), (5.9), and perform a test for these values of γ, γ′, ℓ′, followed by tests with one parameter

decreased, while the other parameters maximized. In each test, we estimate the probability of

decoding after each iteration. The results are presented in Figures 5-4.

The plots suggest that at all operating points the bounds (5.2), (5.5), (5.9) are tight for FR,

75

Page 76: Coding Approaches for Maintaining Data in Unreliable

0 5 10 15 20# of repairs

0.0

0.2

0.4

0.6

0.8

1.0Pr

obab

ility

of d

ecod

ing

( * , ′ * , ′ * )( * 1, ,m)( , ′ * 1,m)( , , ′ * 1)

(a) MBR point α = dβ = 12B∗ = 60, γ∗ = 8, γ′∗ = 4.

0 5 10 15 20# of repairs

0.2

0.4

0.6

0.8

1.0

Prob

abilit

y of

dec

odin

g

( * , ′ * , ′ * )( * 1, ,m)( , ′ * 1,m)( , , ′ * 1)

(b) Near-MSR point α = 2β = 8B∗ = 44, γ∗ = 4, γ′∗ = 4.

0 5 10 15 20# of repairs

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abilit

y of

dec

odin

g

( * , ′ * , ′ * )( , ′ * 1,m)( , , ′ * 1)

(c) MSR point α = β = 4B∗ = 24, γ∗ = 0, γ′∗ = 2.

Figure 5-4: Simulation results for a system with parameters (n = 4, k = 3, d = 3), (ℓ = 1,m = 2), (α, β = 4), showing probability of successful data collection against number of noderepairs performed, for an RLNC-based GRC. The legends indicate parameters (γ, γ′, ℓ′) for eachtest. For all operating points ℓ∗ = m = 2.

and can simultaneously be achieved by RLNC. Violating any single bound results in a loss of

the code data collection property after as few as 3 failure/repair iterations.

The bounds on ℓ′ and γ′ highlight the necessary trade-off between the system capacity B∗

and the remote helper intra-cluster bandwidth ℓ′γ′ = mγ′, via parameter ℓ, the key parameter

that distinguishes our model from the classical model. Our bounds reveal an interesting fact

that, while it is beneficial to increase the number of local helper nodes ℓ in order to improve

the trade-off between storage and inter-cluster bandwidth, increasing ℓ not only increases the

intra-cluster repair bandwidth in the host cluster but also increases the intra-cluster repair

bandwidth in the remote helper clusters. For example, for MBR GRC the storage overhead

approaches that of MSR codes for large m as ℓ approaches m. However, a high value of ℓ also

increases the remote helper cluster bandwidth; indeed, mγ′∗ surges as m− ℓ approaches 1. See

Figure 5-5 for an illustration.

76

Page 77: Coding Approaches for Maintaining Data in Unreliable

0 5 10 15 20 25 301.4

1.6

1.8

2

2.2

0 5 10 15 20 25 3021

21.5

22

22.5

23

0 5 10 15 20 25 300

200

400

600

800

0 5 10 15 20 25 300

200

400

600

Figure 5-5: Illustrating the impact of ℓ on the various performance metrics. We operate at theMBR point with parameters {(n = 12, k = 8, d = n − 1)(α = dβ, β = 2)}. We see that whileℓ = m−1 is ideal in terms of optimizing storage and inter-cluster BW, it imposes the maximumburden on intra-cluster BW.

77

Page 78: Coding Approaches for Maintaining Data in Unreliable

Part II

Information Survival in Volatile

Networks

78

Page 79: Coding Approaches for Maintaining Data in Unreliable

Chapter 6

Network Coding for Time-Varying

Networks

6.1 System Model

We consider a functional repair DSS with n storage nodes of size α symbols, which stores a

source file of size B symbols. Upon a node failure, the replacement node downloads β symbols

of helper data from each of d helper nodes and generates new α symbols to store. We assume a

stochastic model of node failures and helper node selection. For each failure, the index f of the

failed node is drawn from a probability distribution PF over [n], independently of other node

failures. A helper set H of d helper nodes is drawn (without replacement) from a probability

distribution PHi over [1, n] − i, independently of other failures and the corresponding helper

sets. We assume that the next failure happens only after the previous repair is complete. The

failure/repair iterations are indexed by discrete time t = 1, 2, 3, . . .. After a certain number of

failure and repair iterations, we call the storage operational if the source file can still be decoded

from the coded data on the nodes, and broken otherwise.

We study our system through the prism of RLNC packet storage. The source file is split

into k segments of size B/k = α/a for an integer a. As described in Section 2.3, a k symbol

header (coding vector) (0, . . . , 0, 1, 0, . . . , 0) with 1 on the ith position is added to ith uncoded

segment to form ith source packet. During the initial storage setup, the k source packets are

79

Page 80: Coding Approaches for Maintaining Data in Unreliable

recoded with RLNC into na coded packets which are placed on the storage nodes. Since in

practice k is negligibly small in comparison with segment size B/k, we shall ignore the extra

storage taken by a packet coding vector in the header, and assume that each node of size α

stores a coded packets. Let matrix M0 ∈ Fk×naq be the initial system (global) coding matrix ,

whose columns are the coding vectors of the initial na coded packets in the storage. The first

a columns of the global coding matrix correspond to the a packets on the first nodes, the next

a columns corresponds to the second node, and so forth. The elements of M0 are sampled

uniformly independently identically distributed (i.i.d.) from Fq.

At each failure/repair iteration, each helper node generates b , β/(α/a) helper packets of

total size β by recoding over its a packets, and sends the new helper packets to the replacement

node. The latter receives db helper packets and recodes them with RLNC into new a packets

to store. Decoding of the source file is possible if the DSS contains k packets with linearly

independent coding vectors. We assume that α/a, β/(α/a) are integers. Let matrix Mt ∈ Fk×naq

be the global coding matrix after t iterations of failure and repair. After the next node failure

and repair, the coding matrix becomes Mt+1 = MtWt+1, where Wt+1 ∈ Fna×naq is a random

evolution matrix. Multiplying Mt by Wt+1 represents replacing the columns corresponding to

the failed node ft+1 with random linear combinations of the columns corresponding to the

helper nodes form the helper set Ht+1. If Mt|i is the k × a submatrix corresponding to node i,

then

Mt+1|ft+1 =∑

j∈Ht+1

Mt|jDHt+1(j)D

Rt+1(j), (6.1)

where DHt+1(j) is the a × b recoding matrix at helper node j, DR

t+1(j) is the b × a part of the

recoding matrix at the replacement node, corresponding to the helper j; the entire recoding

matrix at the replacement node has dimensions db× a. We shall use W t2t1

to denote the cumu-

lative evolution matrix from time t1 to t2: W t2t1

= Wt1Wt1+1 · · ·Wt2 . Let also W t = W t1. The

coding matrix after t iterations is given by Mt = M0Wt. An example of 3 iteration system

evolution along with the corresponding evolution matrices is shown in Figure 6-1.

We measure the performance of our system via lifetime and achievable coding rate metrics.

Lifetime L of the system is the index of the first failure/repair iteration that breaks the storage.

In other words, L is the first iteration that decreases the rank of the coding matrix below k

80

Page 81: Coding Approaches for Maintaining Data in Unreliable

+ +++

+++ +

0 0 0 0 0 0

01 0 0 0 0

00 1 0 0 0

20 0 1 0 0

00 0 0 1 0

30 0 0 0 1

𝑊 , 𝑎𝑛𝑘 = 𝑊 , 𝑎𝑛𝑘 = 𝑊 , 𝑎𝑛𝑘 =1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1

𝑊 , 𝑎𝑛𝑘 =

ℳ 𝑊 ℳ 𝑊 ℳ 𝑊 ℳ 𝑊

00 0 00 0

0 1 0 00 0

0 0 1 40 0

2 0 0 40 0

0 0 0 01 0

3 0 0 60 1

0 0 0 0 0 0

0 1 0 0 0 3

0 0 1 4 0 5

2 0 0 4 0 0

0 0 0 0 1 0

3 0 0 6 0 0

Figure 6-1: An example of a system evolution for 3 iterations of failure and repair, n = 6, d = 2,a = b = 1. At t = 0 node i contains packet si. For the 4 considered system states the evolutionmatrix W t and its matroid representation M(W t) are also shown. The most recently changedcolumn of W t is bold-faced.

and renders the data undecodable:

L , min{t : rankM0Wt < k}. (6.2)

The coding rate of our system is R , B/nα = k/na. For a set of parameters (n, d, a, b, t), let

the error probability pe(t) , Pr[L ≤ t] = Pr[rankMt < k] be the probability of storage failure

in no more than t iterations. Let

Rε = Rε(n, d, a, b, t) ,1

namax{k : pe ≤ ε}, (6.3)

i.e. Rε is the maximal coding rate such that the system is operational after t iterations is at

least 1− ε. The main system model parameters are summarized in Table 6.1.

6.2 Previous Work

The key difference of our model from the setups of RC [1] and LRC [6, 27] is a probabilistic

selection of the helper nodes and allowing a non-zero probability of decoding error. RC al-

low arbitrary (worst-case) helper selection. LRC and their generalizations for multiple helper

sets, like [31, 32], require helper sets to come from a relatively small and limited number of

alternatives.

A probabilistic approach to local code symbol repair is considered in locally correctable

codes (LCCs) and locally decodable codes (LDCs) [54]. A code C : Fkq → FN

q is (d, δ, ǫ) LDC,

81

Page 82: Coding Approaches for Maintaining Data in Unreliable

Table 6.1: Notation for the time-varying network storage system model.

Symbol Definition

n total number of nodes in the systema number of packets each storage node holds for one coded fileb number of helper data packets downloaded from each helper node during

node repairk size of the source file, in packetsd number of helper nodes providing helper data during node repairq finite field size for data symbolst number of successive failure and repair iterationsft index of the failed node at t-th failureHt helper set at t-th repair, |Ht| = dMt k × na (global) coding matrix after t iterations, which describes global

coding vectors of all na coded packets in the systemWt na × na evolution matrix of t-th iteration, determines the evolution of

Mt

W t matrix of cumulative evolution from M0 to Mt

L lifetime, the first t such that rankMt < kpe error probability Pr[rankMt < k]R coding rate R = k/naRε maximal coding rate (for given (n, d, a, b, t)), such that pe ≤ ε

resp. LCC, if there exists a randomized algorithm AD, resp. AC , which reads at most d symbols

of a corrupted codeword y and can correctly decode a source message symbol Ui, resp. correct

a codeword symbol xi, with probability (w.p.) at least 1 − ǫ ∀u ∈ Fkq , ∀i, ∀y : |y − x| ≤ δN ,

where | · | denotes the Hamming distance. For ǫ < 1/2, the decoding/correcting algorithm can

be invoked multiple times, and majority logic can be used to make the probability of successful

decoding/correction arbitrarily close to 1; note that multiple algorithm calls potentially read

many more than d codeword symbols. Although in LDC/LCC the error probability is generally

not strictly zero, our model is different from that of LDC/LCC in several important aspects.

LDC/LCC protect against the worst-case corruption pattern and can repair δN simultaneous

failures, while our model performs repairs one at a time. Also unlike our model, LDC/LCC is

focused on exact repair.

Fitzek et al. [55] consider a model similar to ours and performed an implementation-based

evaluation of RLNCs and showed them to outperform Reed-Solomon-based and uncoded storage

approaches. Mazumdar [56] studies a local repair storage model, in which the network topology

is fixed, and a failed node can get the helper data for repair only from its neighbors according

82

Page 83: Coding Approaches for Maintaining Data in Unreliable

to the storage network graph. Luby et al. [57] consider a large-code lazy-repair DSS, where

node failures are modeled by Poisson processes in the continuous time, and the repair process is

running for a large fraction of time at a very low repair bandwidth with a large repair locality.

To the best of authors knowledge, the literature on highly time-varying networks, like mo-

bile ad hoc networks, has not specifically studied the problem of distributed storage, focusing

instead on communication between nodes. RLNC has been previously successfully applied to

multicasting in mobile ad hoc networks [58, 59] and delay-tolerant networks [60–62].

6.3 Stochastic Rank Decay

Note that rankMt and L are random variables determined by failed/helper nodes and RLNC

coefficients selection. In this chapter, we study the main stochastic aspect of our model, namely

the randomness of failed and helper nodes.

When the field size q is large, to determine L it is enough to analyze the rank of evolution

matrix W t, as shown by the following proposition.

Proposition 1. Let LW , min{t : rankW t < k} ≥ L. Then

Pr[LW 6= L] <1

(q − 1). (6.4)

Proof. If rankM0 < k, LW = L = 0 and the statement holds trivially. If rankM0 = k,

LW > L implies that for some t rankW t = k, rankMt ≤ k − 1. This can be true only if

some vector x in the column span of W t is in the kernel of M0, i.e. M0 · xτ = 0. This kernel

has dimension na − k (since M0 is full rank), and is spanned by na − k basis vectors. For

M0 sampled uniformly from Fk×naq , these basis vector components can be considered drawn

uniformly from Fq, independently of W t. The probability that the column space of W t of

dimension k has a non-trivial intersection with the uniformly sampled (na − k)-dimensional

kernel is upper bounded by

qna−k

qna+

qna−k+1

qna+ · · ·+ qna−1

qna=

k−1∑

i=0

q−k+i <1

(q − 1),

83

Page 84: Coding Approaches for Maintaining Data in Unreliable

which gives the desired bound, because the column span of W t stays fixed for all values of t,

for which rankW t = k.

In this chapter, we study the system behavior in the limit of infinite field size, and focus

on rankW t instead of rankMt. Although infinitely large fields are not feasible in practice, the

limiting behavior of the system is important to analyze; as we show in the following chapter,

the system dynamics largely remains the same when q > 100. We shall also assume the failed

and helper node distributions PF ,PHi , ∀i to be uniform over [n], [n]− i, respectively.

Note that for d < k/a, one cannot guarantee the source file decodability for more than first

few iterations. Indeed, there exists a sequence of failures/repairs such that the first d nodes

serve as helper nodes to repair all other nodes [d + 1, n], which results in rankW t ≤ da < k.

Such sequence has a non-zero probability, and it takes a finite time to encounter it, thus

Pr[L ≤ n − d] > 0 and E[L] < ∞ almost surely. We shall see, however, that with high

probability the lifetime is much larger than n− d.

6.3.1 Matroid Perspective

Under the large field size assumption and with a = b, the resulting RLNC and, in particular,

the evolution matrix can be conveniently represented by a matroid. When a node failure/repair

happens, the evolution matrix changes from W t to W t+1. The columns of W t corresponding

to the failed node are replaced with linear combinations of the columns corresponding to the

helper nodes. This creates one or more linear dependencies involving a repaired column and all

helper columns. When q → ∞, all linear dependencies between columns arise from the choices

of failed and helper nodes, but not from a specific choice of random linear coefficients, with

probability arbitrarily close to 1 (from now on w.p.a.c. 1). These dependencies are captured by

a matroid representation M(W t) of the evolution matrix. The system matroid M(W t) has n

elements in its ground set E(M(W t)), which correspond to the n storage nodes. Each element

represents the subspace of Fnaq , spanned by the a columns of the corresponding node in W t. A

set of m elements is considered independent if the corresponding ma columns of W t are linearly

independent vectors in Fnaq . The collection of all independent sets, ∀m ∈ [1, n] forms I. Figure

6-1 shows the matroid representation for a sample system evolution. The dots correspond the

84

Page 85: Coding Approaches for Maintaining Data in Unreliable

matroid elements and the lines represent the circuits, which span across several elements.

For a node set S ⊆ [n], let W t|S denote the submatrix of W t composed of the a|S| columns

corresponding to the packets on the nodes in S.

To show that (E , I) is indeed a matroid, we use the following lemma.

Lemma 6.3.1 (Matroid Lemma.). For q → ∞, a = b, rankW t|S = nSa, where nS ≥ 1 is

integer w.p.a.c. 1, ∀S ⊆ [n], S 6= ∅, ∀t.

The proof is given in Appendix 9.3.

The lemma is essentially saying that for s /∈ S, rankW t|S+s is either rankWt|S , or rankW t|S+

a; the column span of rankW t|S either contains the column span of rankW t|s, or has a trivial

intersection with it. Next, we formally show that (E , I) is indeed a matroid.

1. I is non-empty, since by lemma 6.3.1, rankW t|j = a, ∀t, j.

2. Every subset of a set in I is also in I. This is true, since a subset of a set of linearly

independent columns of W t is also linearly independent.

3. I1, I2 ∈ I, |I2| = |I1|+1, then there is an element s ∈ I2 − I1, such that I1 + s ∈ I. Since

rankW t|I2 = |I2|a > rankW t|I1 , there is s ∈ I2 such that colspanW t|s * colspanW t|I1 .

By lemma 6.3.1, rankW t|I1+s = (|I1| + 1)a = |I1 + s|a, therefore, I1 + s is independent,

and the property is satisfied.

Note that for b < a, the third (independence augmentation) property does not necessarily

hold, and (E , I) is not a matroid, but only an independence system.

Lemma 6.3.1 also implies that (rankW t)/a is equal to the cardinality of a maximal in-

dependent set of M(W t). Since the matroid structure is determined only by the choices of

failure/helper nodes, if follows that (rankW t)/a is independent of a. Thus, the next theorem

follows.

Theorem 6.3.2. Let W t(n, d, a, b) be the evolution matrix for a system with parameters n, d,

a, b, and let q → ∞. Then for two systems with parameters (n, d, a, a), (n, d, 1, 1) with the same

failed/helper node sequence, w.p.a.c. 1

rankW t(n, d, a, a) = a rankW t(n, d, 1, 1). (6.5)

85

Page 86: Coding Approaches for Maintaining Data in Unreliable

Each circuit of the matroid represents a parity check relation between the nodes. A fail-

ure/repair iteration results in rankW t+1 decrease by a if and only if the columns corresponding

to the failed node cannot be expressed as a linear combination of the other columns, which

happens if and only if the failed node is a coloop of the current matroid M(W t). In Figure

6-1, the coloop elements are indicated by the dots not covered by any line. Each failure/repair

iteration, all circuits involving the failed node are removed, and a new circuit appears, involving

the failed node and the d helper nodes. Additional circuits with d + 1 or more nodes appear

if the helper nodes were previously involved in some circuits without the current failed node:

an example of this situation is shown in the last iteration in Figure 6-1. In other words, any

intersection of two circuits creates an additional circuit. As a result, the total number of cir-

cuits grows very fast with t, and for large t almost any two nodes are involved in some common

circuit.

Every new circuit is constructed on d + 1 or more nodes, and since at t = 0 there are no

circuits with d nodes or fewer, every subset of d columns is independent for any t w.p.a.c. 1.

When d ≥ k/a (this regime is considered in RCs), this implies that the rank never drops below

ka, and the lifetime is infinite.

At t = 0 every node is a coloop. For low values of t < O(n log n) the number of coloops

ncoloops is typically above zero, and the rank ofW t decreases quickly: it drops by a after the next

iteration with a relatively large probability Pr[next failed node is a coloop] = ncoloops/n. We

shall refer to these early iterations as the burn-in phase. After the burn-in phase, the stability

phase gradually ensues, when there are so many circuits — dependencies among the nodes —

that there are no coloops during most time steps. At each iteration, numerous circuits involving

the failed node are removed, and many new circuits are created. The rank now decreases very

slowly, dropping only on those rare occasions when a coloop appears and is chosen to be the

failed node before being a helper.

6.4 Bounding Processes

Let us assume a = b = 1, q → ∞. Consider a system evolution for τ > 0 iterations. Let

Y t = W ττ−t+1, 0 ≤ t ≤ τ, be the backward cumulative evolution matrix, with Y 0 = In, Y

τ = W τ .

86

Page 87: Coding Approaches for Maintaining Data in Unreliable

Consider transition from t to t + 1: Y t+1 = Wτ−tYt. Let f = fτ−t be the index of the failure

node corresponding to Wτ−t, and let H = Hτ−t be the helper set. The rows of Yt corresponding

to f and H will be called failure row and helper rows. As Y t is multiplied by Wτ−t to form

Y t+1, the failure row of Y t is chosen to be, first, added with RLNC multiplicative coefficients

to the d helper rows, and, second, replaced with zeros. For example, f = 1 and H = {4, 6} may

correspond to matrix W 1 from Figure 6-1; left multiplication by this matrix would result in

adding 2 x first row to the fourth row, adding 3 x first row to the sixth row, and then replacing

the first row with zeros.

Let Zt be the set of the indices of zero rows in Y t. If the failure row at the next iteration

is a zero row in Y t, then Y t+1 = Y t and Zt+1 = Zt. Otherwise, a non-zero failure row is added

with random coefficients to d helper rows. Let l ∈ [0, d] be the number of zero rows among

the helper rows. Since q → ∞, all helper rows become non-zero in Y t+1 w.p.a.c. 1. The total

number of zero rows in Y t+1 becomes |Zt+1| = |Zt| + 1 − l. Since the number of zero rows

lower-bounds the nullity (the dimension of the kernel) of Y t, which is non-decreasing with t,

rankY τ = rankW τ is upper-bounded by n−maxt≤τ |Zt|.

Note that l, the number of zero helper rows, is between 0 and d; hence, |Zt| can go from

i at step t to anywhere between (i − d)+ + 1 to i + 1 at step t + 1. While being the lower-

bound of the nullity, |Zt| does not necessarily equal the nullity, because it does not take into

account dependent non-zero rows. Such rows arise when a failure non-zero row is added (with

a multiplier) to l > 1 zero rows.

Let St ⊆ [n] be a set of row indices of Y t. Let S0 = [n]. We define St+1 as follows:

• If f is not in St, then we let St+1 = St.

• If f and entire H are in St, then we let St+1 = St − f .

• If f is in St, but at least one helper node, say hj ∈ H, is not in St, then we let St+1 =

St − f + hj .

As we show below, the set of rows of Y t indexed by St is linearly independent w.p.a.c. 1.

Therefore, we have the following theorem.

87

Page 88: Coding Approaches for Maintaining Data in Unreliable

Theorem 6.4.1. For a = b = 1, q → ∞, the rank of the evolution matrix w.p.a.c. 1 is bounded

by

|Sτ | ≤ rankW τ ≤ n−maxt≤τ

|Zt|, (6.6)

To prove the linear independence of the rows St in Y t we use the following Matrix Addition

Lemma. It formally shows that adding a certain random matrix to a full-rank matrix results in

a full-rank matrix w.p.a.c. 1 in the limit of infinite field size. The proof of the lemma is given

in Appendix 9.4.

Lemma 6.4.2 (Matrix Addition Lemma.). Let A ∈ Fm×nq ,m ≤ n, be a full-rank matrix with

rows a1, . . . ,am. Let u,v ∈ Fnq be arbitrary vectors, and d′ ∈ [0,m − 1] be an integer. Let

A′ ∈ Fm×nq be an additively transformed matrix with rows a

1, . . . ,a′

m, such that

a′

i =

ai + αiu, if i ∈ [1, d′]

ai, if i ∈ [d′ + 1,m− 1]

βiai + v, if i = m,

(6.7)

where αi, βm are random scalars, sampled uniformly i.i.d. from Fq. Then limq→∞ Pr[rankA′ =

m] = 1, i.e. A′ is full-rank w.p.a.c. 1 in the limit of infinite field size.

Proof of Theorem 6.4.1. Let Yt|E denote the submatrix of Y t consisting of row(s) indexed by

E. We only need to prove that the rows of Y t|St are linearly independent. We prove it by

induction. The statement is true for t = 0, since Y 0 is the identity matrix. Suppose, it also

holds for some t. We transition from Y t to Y t+1 = Wτ−tYt with f,H corresponding to Wτ−t.

• If f /∈ St, and St+1 = St, we use Lemma 6.4.2 with d′ = |H ∩ St|, m − 1 = |St|, and let

A|[m−1] = Y t|St with the first d′ rows being helper rows in St, and row A|m arbitrary,

provided it makes A full-rank, u = Y t|f . Then, A′|[m−1] = Y t+1|St+1 is full-rank and its

rows are linearly independent w.p.a.c. 1.

• If f ∈ St,H ⊆ St, and St+1 = St − f , we use Lemma 6.4.2 with d′ = |H ∩ St|, m = |St|,

and let A = Y t|St (the first d′ = |H ∩ St| rows are helper rows in St, the last row am is

the failure row), u = am = Y t|f . Then, A′|[m−1] = Y t+1|St+1 is full-rank and its rows are

88

Page 89: Coding Approaches for Maintaining Data in Unreliable

linearly independent w.p.a.c. 1.

• If f ∈ St, and there is hj ∈ H, hj /∈ St, and St+1 = St − f + hj , we use Lemma 6.4.2 with

d′ = |H ∩ St|, m = |St|, and let A = Y t|St (the first d′ = |H ∩ St| rows are helper rows in

St, the last row am is the failure row), u = am = Y t|f , v = Y t|hj . Then, A′ = Y t+1|St+1

is full-rank and its rows are linearly independent w.p.a.c. 1.

In all cases independence is maintained for rows of Y t+1|St+1 , therefore, by induction for any

t ≤ τ w.p.a.c. 1 rows of Y t|St are linearly independent, and |St| ≤ rankY t.

The next theorem shows that for the case of a single helper node the bounds in Theorem

6.4.1 are tight.

Theorem 6.4.3. For d = 1, sets St and Zt complement each other

St ∪ Zt = [n], St ∩ Zt = ∅, ∀t ∈ [τ ]. (6.8)

As a result, the bounds in (6.6) coincide and are tight:

|Sτ | = n− |Zτ | = rankW τ . (6.9)

Proof. Since the rows of Y t|St are linearly independent, none of them is zero, thus, St∩Zt = ∅.

We show St ∪ Zt = [n] by induction. It holds for t = 0, with Z0 empty, S0 = [n]. Assuming it

holds for t, consider the transition from Y t to Y t+1, with H = {h} consisting of a single helper

row.

• If f ∈ Zt, then f /∈ St, and both Zt and St remain the same for t+ 1.

• If f ∈ St, h ∈ St, then Zt+1 = Zt + f , St+1 = St − f .

• If f ∈ St, h /∈ St, then, by assumption, h ∈ Zt, and Zt+1 = Zt − h+ f , St+1 = St − f + h.

In every case the complementary property remains true for t+1; by induction, it holds for any

t. Thus, |Sτ | = n− |Zτ |, and the bounds in (6.6) are tight.

89

Page 90: Coding Approaches for Maintaining Data in Unreliable

For given PF and {PHi }i∈[n], W t ∈ Fn×n

q is a Markov process on a very large state space,

which greatly complicates the direct analysis of W t and its rank. Theorem 6.4.1 bounds the

process rankW τ using two other processes St, Zt, which are functions of {Wt}t∈[τ ]. It is not hard

to see that w.p.a.c. 1 St, Zt are also Markov processes on a much smaller space 2[n]. Indeed, by

construction of Zt+1, St+1 from Zt, St, it follows that the probability distributions of Zt+1, St+1

are fully determined by the previous states Zt, St for fixed PF , {PHi }i∈[n].

When distribution PF , {PHi }i∈[n] are uniform, the analysis of the bounds (6.6) is further

simplified by the Markov property of |St|, |Zt|, as shown by the following theorem.

Theorem 6.4.4. For uniform PF ,PHi , ∀i, Nt , |Zt|, N ′

t , n − |St| are Markov processes on

state space [0, n] with transition probabilities

pi,j(N) , Pr[Nt+1 = j|Nt = i] =i

n1i=j +

n− i

nHg

i−j+1/id/n−1 (6.10)

pi,i+1(N′) = Hg

d+1/n−id+1/n =

n− i

nHg

0/id/n−1 =

(n−id+1

)(

nd+1

) (6.11)

pi,i(N′) = 1− pi,i+1(N

′) (6.12)

and N0 = N ′0 = 0, where Hg

k/Kn/N =

(Kk )(N−Kn−k )

(Nn)is the probability mass function of the hypergeo-

metric distribution with n trials, N items, and K possible successes.

Proof. For uniform failure and helper node distributions, at each iteration, any row has proba-

bility 1/n to be selected the failure row. For the transition from Y t to Y t+1 the probability that

the failure row is among Zt (and, thus, Zt+1 = Zt) is |Zt|/n. Otherwise, w.p. (n − |Zt|)/n, a

non-zero failure row is added to d helper rows, l of which are zero. In Y t+1 these l rows become

non-zero and the failure row becomes zero, thus |Zt+1| = |Zt|+1− l. The probability of having

l zero rows among d helper rows chosen out of n − 1 non-failure rows is Hgl/|Zt|d/n−1. Thus, the

distribution of the number of zero rows Nt+1 = |Zt+1| depends only on |Zt| = Nt, and N is a

Markov process with transition probabilities given by (6.10).

|St| is changed (decreased) at the next iteration only when the failure and helper rows are

all in St. This happens with probability p = Hgd+1/|St|d+1/n . This probability is a function only of

the value of |St| at time t, thus, N ′t = n− |St| is a Markov process with N ′

0 = 0 with transition

probabilities given by (6.11), (6.12). Note that the increase probability pi,i+1(N′) for N ′

t = i

90

Page 91: Coding Approaches for Maintaining Data in Unreliable

equals n−in Hg

0/id/n−1, which is the same as the increase probability pi,i+1(N) of process Nt given

by (6.10) for i − j + 1 = l = 0. While Nt goes down if l = i − j + 1 ≥ 2, N ′t never decreases

with t. For d = 1, l can only be 0 or 1, and Nt is non-decreasing and identical to N ′t .

The analysis above is carried out for the single packet per node case. Theorem 6.3.2 allows

applying the same results for a = b > 1 by scaling the rank proportionally by a.

6.5 Impact of Repair Bandwidth

We have established the rank bounds for a = b, i.e. for the case of maximal repair bandwidth,

when the helper nodes send out as many packets as they store. In this section, we show how

much the rank may decrease when the contribution of each helper node is limited to b ≤ a

packets.

The following theorem provides a lower bound for the rank for b ≤ a as a function of the

rank with b = a.

Theorem 6.5.1. For q → ∞, consider a system with parameters n, d, a, and an arbitrary

sequence of node failures/repairs E , {ft,Ht}t. Let W t be a sample evolution matrix under

sequence E and a helper packets per helper node, and let W ′t be a sample evolution matrix

under sequence E and b < a helper packets per helper node. Then, w.p.a.c. 1

rankW ′t ≥d−1∑

i=0

min{a, (d− i)b}+ b

(rankW t

a− d

). (6.13)

Proof. Consider sequence E and the corresponding IFG after t iterations of failures and repairs

under a helper packets per helper node. The IFG for our RLNC packet system is defined

according to Section 2.4.1, except that now we measure the edge capacities in packets, rather

than symbols: the edges now have capacities a, b instead of α, β. Per Lemma 6.3.1 and Theorem

6.3.2, rankW t is a multiple of a, rankW t = ar, and there exists a set of r physical nodes at

time t, which contain enough packets to decode the source file of size ar packets. Let SZ be

the set of r corresponding active IFG out-nodes connecting to a data collector Z. Therefore,

there exists a min-cut between S and Z of capacity ar, and by the Max-flow Min-cut theorem

[50], there exists a flow from S to Z of capacity ar. Since Z connects to r out-nodes, and in-

91

Page 92: Coding Approaches for Maintaining Data in Unreliable

to out-node edges have capacity a, there exist r edge-disjoint paths {Pi}i∈[r] in IFG from S to

the r nodes in SZ . Since nodes in IFG cannot have multiple incoming and multiple outgoing

edges at the same time, {Pi} are also node-disjoint, except for the common starting node S.

Next, consider sequence E under b < a helper packets per helper node. For the same

sequence, the IFG structure remains the same, except that now the connections to helper

nodes have capacity b. Consider an IFG cut, which partitions all vertices into two sets U ,V,

with S ∈ U , Z ∈ V . We want to lower-bound the cut capacity, and we assume that the r

out-nodes in SZ are in V, and X ini ∈ U , ∀i ∈ [n]. Consider a topological sorting of nodes in

the IFG. Let Stopo = {Y out1 , Y out

2 , . . . , Y outd } be the topologically first d out-nodes in V. Node

Y outi , i ∈ [d], is directly connected to Y in

i . If Y ini ∈ U , edge Y in

ia→ Y out

i crosses the cut and

contributes a to the cut value. If Y ini ∈ V, then Y in

i is directly connected to d other out-nodes,

corresponding to the d helpers providing repair data to node Yi. By the construction of Stopo, at

most i−1 of the d other out-nodes can be in V, and at least (d− (i−1)) out-nodes are in U and

contribute b each to the cut value. Thus, the contribution of Y outi to the cut capacity is at least

min{a, (d− i+1)b}, and the contribution of Y out1 , . . . , Y out

d is at least∑d

i=1min{a, (d− i+1)b}.

Note that this value matches the capacity of RCs in Equation (2.6).

For each path Pi, let Vouti ∈ V be the topologically first out-node on Pi in V. Let Spath =

{V outi , i ∈ [r]}, and let V ′out

1 , . . . , V ′outr be the nodes of Spath in their topologically sorted order.

For each i ∈ [d + 1, r], consider node V ′outi = V out

j ∈ V with two previous nodes in the same

path Pj : Uouti

b→ V ′ini

a→ V ′outi or S

∞→ V ′ini

a→ V ′outi . Since Uout

i and S are in U , the considered

segment of the path crosses the cut and contributes at least min{b,∞, a} = b to its value.

This contribution has not been already counted earlier in the previous paragraph, because by

construction, {V ′outi }i∈[d+1,r] is disjoint from {Y out

i }i∈[d], and {V ′ini }i∈[d+1,r] is disjoint from

{Y ini }i∈[d].

Figure 6-2 demonstrates the argument for a sample evolution in a system with n = 4, d = 2,

t = 4. With a packets per helper node, rankW t = 3a, and the data collector can decode the

source file from the first r = 3 physical nodes. Node Z directly connects to SZ = {Xout5 , Xout

6 ,

Xout8 }. There exist 3 disjoint paths between S to the 3 nodes of SZ , e.g. S → X2 → X5,

S → X4 → X6, S → X3 → X7 → X8. For a cut shown on the figure, these paths correspond to

Spath = {Xout5 , Xout

6 , Xout7 }. The set of the topologically first d out-nodes in V is Stopo = {Xout

5 ,

92

Page 93: Coding Approaches for Maintaining Data in Unreliable

𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋

𝑋𝑖 𝑋𝑋𝑖 𝑋

𝑋𝑖 𝑋

Edge

capacities ∞𝑆

𝑋𝑖 𝑋Figure 6-2: An example of information-flow graph for n = 4, d = 2 and t = 4 node fail-ures/repairs. Also shown is a sample cut (U ,V) of capacity a+2b ≥ min{a, 2b}+min{a, b}+ b.

Xout6 }. For the case b < a, they contribute a+ b ≥ min{a, 2b}+min{a, b} to the cut capacity.

The path through Xout7 , the topologically last node in Spath, contains edge Xout

3 → X in7 , which

contributes extra b to the cut capacity.

Overall, the capacity of any cut is at least∑d−1

i=0 min{a, (d − i)b}+ b(r − d), which gives a

lower bound for the file size achievable by RLNC, and for rankW ′t.

6.6 Expected Lifetime

In this section we use rank bounds (6.6) to obtain bounds for expected lifetime under assumption

of uniform failure and helper node distributions, a = b = 1, and q → ∞. Since LW = min{t :

rankW t < k}, the upper and lower bounds L+, L− on LW are given by

L+ , min{t : n−Nt < k}

L− , min{t : n−N ′t < k}.

In other words, the bounds correspond to the number of time steps to reach state n − k + 1

from state 0 (first hit times) for processes N,N ′. Therefore, E[L−],E[L+] bound the expected

lifetime.

93

Page 94: Coding Approaches for Maintaining Data in Unreliable

Lower Bound

For non-decreasing chain N ′, to reach state r + 1 from r it takes 1/pr,r+1(N′) time steps in

expectation, as the mean of the geometric distribution with the probability of success pr,r+1.

The expected first hit time of state n− k + 1 is

E[L−] =

n−k∑

r=0

1

pr,r+1(N ′)=

n−k∑

r=0

n

n− r

(n−1d

)(n−r−1

d

)

=n

d+ 1

(n− 1

d

) n−k∑

r=0

(n− r

d+ 1

)−1

=n

d

(n−1d

)(k−1d

) − n

d+ 1 = O

(n(nk

)d), (6.14)

where the last summation is collapsed using binomial identity

∞∑

r=m

(n+ r

n

)−1

=n

n− 1

(n+m− 1

n− 1

)−1

, (6.15)

provided in reference [63, Corollary 3.7].

Note that, for the case of single helper node d = 1, the lower bound (6.14) is tight and

equals the expected lifetime. On average, it takes only around n2/k iterations for the rank to

drop to k, and n2 iterations to drop to 1.

Upper Bound

In order to estimate first hit times for N , we approximate N with a birth-death Markov process

N for d ≥ 2 with |Nt+1− Nt| ≤ 1. Let pi,j = pi,j(N) = Pr[Nt+1 = j|Nt = i]. Similarly to N , let

pi,i+1 =n− i

nHg

0/id/n−1

pi,i−1 =n− i

nHg

2/id/n−1,

and let pi,i take the rest of the probability mass. Numerical simulations show that E[Nt] is very

close to E[Nt]. Since N is a birth-death chain, it is reversible, and its stationary distribution

94

Page 95: Coding Approaches for Maintaining Data in Unreliable

π1, π2, . . . can be derived from the detailed balance equation

πr+1 = πrpr,r+1

pr+1,r= πr

2(n− r)(n− r − d)

d(d− 1)r(r + 1)

πn−k+1 = π1

n−k∏

r=1

pr,r+1

pr+1,r= π1

(n−1n−k

)(n−d−1n−k

)

(n− k + 1)(d2

)n−k∝(

n

d(n− k)

)2(n−k)

.

For the values of n − k &√2n/d the stationary probability πn−k+1 becomes negligible w.r.t.

πi, i < n− k, and the first hit time of state n− k + 1 and the lifetime bound can be estimated

as

E[L+] ≈ 1

πn−k+1= O

((d(n− k)

n

)2(n−k))

= O((d(1−R))2n(1−R)

). (6.16)

The upper bound grows large when d(n − k)/n > 1, which gives a necessary rate-locality

condition for the lifetime to be large:

d(1−R) > 1 ⇐⇒ R <d− 1

d. (6.17)

When the condition is satisfied, the upper bound (6.16) grows super-exponentially with (1−R) =

(n − k)/n for fixed n. Interestingly, the rate bound (6.17) closely matches the upper bound

(1.1), the best rate in a strictly easier coding problem — locally repairable codes, where helper

node selection is allowed.

Figure 6-3 shows the expected (W -based) lifetime for a storage with n = 20 nodes. As

predicted by bounds (6.14), (6.16), E[L] grows super-exponentially with 1 − R for d ≥ 2, but

remains relatively small even at lowest rates for d = 1. When d = 1, all circuits of the system

matroid consist of 2 elements, and M(W t) is divided into parallel classes, without any circuits

involving elements from different classes. As a result, the number of circuits is relatively small,

and coloops appear quite often, which leads to the sharp difference in behavior of a single helper

system.

6.7 Error Probability

In this section, we use the rank bounds (6.6) to estimate the error probability. We continue to

assume a = b = 1, q → ∞, and the uniform node distributions.

95

Page 96: Coding Approaches for Maintaining Data in Unreliable

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Rate R= k/n

101

102

103

104

105

E[L W

]d= 1d= 2d= 3d= 4

Figure 6-3: Simulated expected lifetime forn = 20, a = b = 1.

0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85R

10 5

10 4

10 3

10 2

10 1

100

p e

t=20t=100t=400t=1000t=5000

Figure 6-4: Probability of decoding error peagainst the coding rate for fixed n = 20, d = 4,a = b = 1. The dots indicate E[rankW t]/n.

Figure (6-4) shows a sample numerical estimation of the error probability pe as a function

of coding rate R = k/n for a system with parameters (n = 20, a = b = 1, d = 4). The expected

rank of W t per node E[rankW t]/n is indicated by a dot for each t. The plot suggests that

for the rates below E[rankW t]/n, the error probability generally drops exponentially with the

coding rate.

Therefore, the expected rank of the evolution matrix provides a baseline for estimating the

error probabilities and the achievable rates Rε below E[rankW t]/n. To estimate E[rankW t],

we apply expectation to the bounds (6.6):

n− E[N ′τ ] ≤ E[rankW τ ] ≤ n− E[max

t≤τNt]. (6.18)

Figure 6-5 shows simulated expected rank of the evolution matrix along with the bounds. For

values of t of the order O(n) (burn-in phase), the lower bound is very tight and can be used for

estimating E[rankW τ ]. For the higher t, however, the upper bound approximates the expected

rank much better. Both E[rankW τ ] and the upper bound go down very slowly, within a constant

factor from each other.

While operating at rate E[rankW t]/n results in diminishing pe as t increases, these prob-

abilities (order of 0.1–0.3) are not small enough for reliable data storing. To estimate the

error probability decrease with rate (i.e. the error exponent), we use the rank lower bound

96

Page 97: Coding Approaches for Maintaining Data in Unreliable

100 101 102 103 104

t

5

10

15

20

25

30

35

40rank

Wt

upper bound r +t

rtlower bound rt

Figure 6-5: Expected rank of the evolutionmatrix rt = E[rankW t], with the upper andlower bounds for n = 40, d = 4.

101 102 103 104

t

0

5

10

15

20

25

30

35

40

rank

Wt

d=10d=9d=8d=7d=6d=5d=4d=3d=2d=1

Figure 6-6: Expected rank rt = E[rankW t]for n = 40 and various values of d.

n−N ′τ ≤ rankW τ , and have

pe = pe(t, k) = Pr[rankW t < k] ≤ Pr[n−N ′t < k] = Pr[N ′

t > n− k] , p+e (t, k). (6.19)

Since the rank lower bound is tight only for t = O(n), and since pe decreases with the rate

more steeply for larger t (see Figure 6-4), we can use p+e (t, k) at point t = n to upper-bound pe

as follows:

pe(t ≥ n, k) = pe(t, (k − E[rankW t]) + E[rankW t])

≤ pe(t,E[rankWt])

pe(n,∆k + E[rankWn])

pe(n,E[rankWn])

≤ pe(n,∆k + E[rankWn])

≤ p+e (n,∆k + E[rankWn])

= p+e (n, k − E[rankW t] + E[rankWn]),

where ∆k , k − E[rankW t] ≤ 0. For small values, the tail probability Pr[N ′t > n − k] is

very closely approximated by the probability mass function Pr[N ′t = n − k + 1], which can be

97

Page 98: Coding Approaches for Maintaining Data in Unreliable

expressed using the transition probabilities pi,i+1 given by (6.11):

p+e (t, k) = Pr[N ′t > n− k] ≈ Pr[N ′

t = n− k + 1] (6.20)

= 1t≥n−k+1

n−k∏

i=0

pi,i+1

(c0,c1,...,cn−k+1)∑n−k+1i=0 ci=t−(n−k+1)

(1− pi,i+1)ci (6.21)

= 1t≥n−k+1

n−k∏

i=0

pi,i+1(1− pi,i+1)

t−1

∏n−k+1j=0j 6=i

(pj,j+1 − pi,j+1). (6.22)

To have a general sense of how the error exponent depends on the system parameters, we

use the following expansion:

p+e (t, k) ∝n−k∏

i=0

pi,i+1 =n−k∏

i=0

(n−id+1

)(

nd+1

) =n−k∏

i=0

∏dj=0(n− i− j)∏d

j=0(n− j)

≤n−k∏

i=0

(n− i

n

)d+1

=

(n(n− 1) · · · (k + 1)k

nn−k+1

)d+1

=

(n!

(k − 1)!nn−k+1

)d+1

∝(

nnek

enkknn−k+1

)d+1

≈(

nk

kken−k

)d+1

= e−(d+1)(n−k−k log nk)

= e−n(d+1)(1−R(1−logR)).

The exponent n(d+ 1)(1−R(1− logR)) increases with both n and d.

98

Page 99: Coding Approaches for Maintaining Data in Unreliable

Chapter 7

Implementation Aspects and

Numerical Results

In this chapter, we consider some aspects of the storage system of Chapter 6, which were not

covered by the previous analysis, but which may have an impact on the code performance in

practical implementations. Some factors arise because our storage model may be too simplistic

for certain scenarios, e.g. the real distributions PF ,PH of failed and helper nodes may not be

uniform, or the number of the available helpers d may not be the same for different iterations.

We shall call these model factors. Other factors are implementation-specific and result from the

intention to make the implementation simple and cost-efficient, e.g. small field size, or sparse

packet recoding. These factors will be referred to as implementation factors. We study the

impacts of these aspects both individually, and together in numerical simulations. In addition,

we evaluate the fault tolerance of system in terms of the number of nodes to be accessed to

decode the source file.

7.1 RLNC Recoding

In our model packet recoding happens at a helper and replacement nodes (matrices DH , DR of

Equation (6.1), respectively). Dense recoding matrices result in high CPU utilization during

recoding of large data blocks, because the number of finite field operations is proportional to

the number of non-zero elements in the recoding matrix. Therefore, it is desirable to make the

99

Page 100: Coding Approaches for Maintaining Data in Unreliable

2 3 5 7 13 19 29q

0

5

10

15

20

25

30

E[rank

Wt ] (FR,FR)

(S,FR)(N,FR)(FR,S)(S,S)(N,S)(FR,N)(S,N)(N,N)

Figure 7-1: Performance of various recodingregimes: No recoding (N), Sparse recoding(S), and Full recoding (FR) for a system withparameters n = 20, t = 2000, d = 4, a = 3,b = 2. The legend indicates the recodingregimes (helper, replacement nodes).

101 102 103

qa

5

6

7

8

9

10

11

1 aE[rank

Wt ]

a= 1, varying prime qa= 1, varying q= 2x

varying a, q= 2

Figure 7-2: Impact of the effective field size qa

on the average rank of W t for a system withparameters n = 20, d = 4, a = b, t = 1000.The actual field size used is q.

matrices sparse [64, 65]. When n1 input packets are RLNC recoded into n2 output packets

with a sparse matrix, each input packet participates only in a small fraction of the n2 output

packet. We evaluate our system performance in the following regimes: full recoding with upper-

triangular full-rank recoding matrix, no recoding with recoding matrix being the matrix with

ones on the main diagonal and zeros elsewhere with the columns shuffled, and sparse recoding

which starts with a no-recoding matrix, and one more random element in each column is picked

uniformly from Fq\{0}. Thus, our sparse recoding ensures that every incoming packet is used

in some output linear combination, while most elements of the recoding matrix remain zero.

Figure 7-1 shows the testing results for various recoding regimes at helper and replacement

nodes. According to the results, E[rankW t] is largely determined by the recoding regime at

replacement nodes. With no recoding at replacement nodes, the number of helper packets is

virtually decreased from db to a, which severely affects the average rank. Sparse recoding is

sufficient to achieve more than 80% of the full recoding performance. Note that, sparse recoding

with q = 2 results in much higher expected rank than no recoding with large field sizes.

100

Page 101: Coding Approaches for Maintaining Data in Unreliable

7.2 Small Field Size

In this section, we explore numerically the impact of the field size on the code performance.

Large field arithmetic operations are computationally expensive, and we are looking into the

possibility of operating with smaller fields. In their work on RLNC Ho et al. [11] show that

if a problem of multicasting to D receivers over an acyclic network is solvable for some fixed

network code coefficients, then RLNC with uniformly distributed coefficients from Fq provides

a valid solution w.p. at least (1 − D/q)η, where η is the maximum number of the IFG edges

originating from the nodes performing coding in any minimum cut-set between the source and

the sink. In our model, η grows linearly with the number of failures, which makes the lower

bound negligibly low for large t, especially for small q.

We compare the performance for different field sizes for n = 20, d = 4, t = 1000. First, we

perform the test for single-packet nodes a = b = 1 and varying field size q. Second, we fix the

field size to be q = 2, and perform the test for varying node size a with b = a. We expect

that an RLNC code over field Fq and a packets per node performs similarly to a code over field

extension Fqa with 1 packet per node; in fact, multiplying a packet by a scalar over Fqa can

be represented by multiplying a packets by a certain full-rank a × a matrix over Fq. In our

case, qa can be thought of as the effective field size. The plot in Figure 7-2 indicates that for

a = 1 the field sizes from q = 17 is enough to achieve more than 90% of the limiting q → ∞

expected rank. Operating with multiple packets per node a > 1 allows using the binary field

with simple bitwise XOR addition and AND multiplication: the average rank per node packet

closely approaches that with a large field size and a = 1.

7.3 Failed and Helper Nodes Distributions

In this section, we study the system behavior with non-uniform distributions of failed and helper

nodes PF ,PHi . In practice, strict uniformity is not achievable, and some nodes will fail or will

be unavailable during repair more often than the others. We assume that PHi is the same for

all nodes i, and denote it simply PH . Without loss of generality, we assume that the nodes

are sorted in a non-decreasing order of their probability mass under PH . We consider a family

of power-law distributions where the probability to pick node i is p(i) ∝ ix, x ≥ 0, i ∈ [1, n],

101

Page 102: Coding Approaches for Maintaining Data in Unreliable

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20i

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

p(i)

x=0x=0.4x=1.0x=3.0

Figure 7-3: Probability mass functions of thetest node distributions for a storage with n =20 nodes. Given a fixed parameter x, theprobability of i-th atom is p(i) ∝ ix. Largervalues of x lead to stronger concentration ofprobability at the nodes with high indices.

0.0 0.5 1.0 1.5 2.0 2.5 3.0xH

7

8

9

10

11

12

E[rank

Wt ]

xF=3.0xF=1.0xF=0.4xF=0.0xF= 0.4xF= 1.0xF= 3.0

Figure 7-4: Impact of the failed and helpernode distribution PF ,PH on the average rankfor n = 20, t = 1000, d = 4. The distributionshave p(i) ∝ ix for x ∈ {xF , xH}. xF < 0corresponds to p(i) ∝ (n+ 1− i)|xF |.

with normalization p(n)/p(1) = 10 for x > 0. x = 0 corresponds to the uniform case, while for

x ≫ 1 first several nodes are much less probable than the others (Figure 7-3).

Figure 7-4 shows the numerical evaluations results for the scenarios with varying PF ,PH

w.p. mass function (pmf) p(i) ∝ ixF , resp. ixH , n = 20, d = 4, t = 1000. The minus sign

of parameter xF means that the pmf of PF increases in the nodes order opposite to that of

the pmf of PH , i.e. p(i) ∝ (n + 1 − i)|xF |. Positive, resp. negative values of xF bias the

distribution towards higher, resp. lower indices. The plot indicates that for the uniform helper

nodes distribution xH = 0 the average rank is largely insensitive to the choice of PF . For

non-uniform PH with xH > 0, though, the rank drops with xF and xF − xH . In particular, for

uniform PF , the rank decreases significantly as PH becomes less uniform. Intuitively, uniform

PH results in the best possible diversity of the helper data and the regenerated packets. As

PH becomes more non-uniform and biased towards the nodes with higher indices. As a result,

the majority of the helper packets end up coming from those high probability nodes; additional

negative xF makes the lower index nodes fail more often. Both of these effects reduce the packet

diversity. On the contrary, when xF > xH ≥ 0, the lower index nodes become helpers more

often than failures, while failures mostly happen in a narrow high index range; as a result, the

102

Page 103: Coding Approaches for Maintaining Data in Unreliable

0.0 0.5 1.0 1.5 2.0d

8

9

10

11

12E[rank

Wt ]

E[d] = 4, d [4, 4]E[d] = 4, d [3, 5]E[d] = 4, d [2, 5]E[d] = 4, d [1, 5]E[d] = 4, d [1, 6]E[d] = 5, d [5, 5]E[d] = 5, d [4, 6]E[d] = 5, d [3, 6]E[d] = 5, d [2, 6]E[d] = 5, d [2, 7]E[d] = 5, d [1, 7]

Figure 7-5: Impact of standard deviation ofthe number of helper nodes d on the averagerank for n = 20, a = b = 1, t = 1000. Beta-binomial distributions with different supportsare used.

9 10 11 12 13ndc

10 5

10 4

10 3

10 2

10 1

pdc e

t=1000,R=10/20t=5000,R=9/20

Figure 7-6: Decoding error probability pdce =Pr[rankMt|S < k| rankMt = k] for randomlychosen column set S ⊂ [n], |S| = ndc. n = 20,d = 4, a = b = 1, k = nR.

rank becomes even slightly higher than in the all-uniform scenario.

7.4 Variable Number of Helpers

In this section, we consider a scenario where the number of helper nodes d at each repair

is a random variable. This is a reasonable assumption for ad-hoc and P2P networks, where

connectivity changes dynamically and there may be fewer or more nodes connected to a given

node at certain moments. We test the average performance for a fixed mean of d and varying

standard deviation and support. The distribution of d is chosen to be beta-binomial, which

models the number of independent successful connections to helper nodes out of some finite

set, such that the probability of successful connection follows the beta distribution. Figure 7-5

depicts the numerically evaluated results for two different expected values of d. The performance

generally worsens as the standard deviation increases. However, the expected rank is much more

sensitive to changing E[d], rather than to changing its standard deviation σd.

103

Page 104: Coding Approaches for Maintaining Data in Unreliable

7.5 Fault Tolerance

As shown previously, and specifically in Figure 7-7, in our model scenario RLNCs significantly

outperform RCs. A reasonable question to ask is whether this performance gain comes with

a price of low fault tolerance. Indeed, while RCs ensure that the source file can be decoded

even when up to n − k nodes are unavailable (for a = 1) (i.e. any set of k nodes can be used

for decoding), our RLNCs do not provide such a guarantee. It turns out, however, that with

high probability k or slightly more nodes picked uniformly at random contain enough data for

decoding the source file with high probability. Figure 7-6 shows the decoding error probability

pdce , i.e. the probability that a uniformly chosen set of ndc nodes (out of n) does not have k

independent coded packets, conditioned on Mt being full rank. pdce is less than 10−1 for ndc = k,

and goes down exponentially fast as ndc increases.

7.6 Effects of Several Factors

We test the RLNC code under the effects of several factors together. In each test, we find the

maximal rate Rε, such that the error probability does not exceed ε = 5 ∗ 10−4 for t = 2000. As

a base case we consider a system with n = 20 nodes, single packet per node a = b = 1, large

field size q = 65537, and fixed d in range [2, 8]; the base case does not include any factors of this

section. Then we incrementally modify the model to account the discussed factors. Specifically,

1. first, the field size is changed to q = 2 and, to compensate, the node size a and b are

increased to 6;

2. for the next test, in addition to the new q, a, the size of the helper data per node b is

reduced to 4;

3. then, the failed and helper node distributions are made non-uniform with xH = 1, xP = −1

as per Fig. 7-3 and 7-4;

4. then, the number of helpers d is made a random variable distributed uniformly in range

[E[d]− 2,E[d] + 2] with the same integer mean as on the previous step;

104

Page 105: Coding Approaches for Maintaining Data in Unreliable

2 3 4 5 6 7 8E[d]

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Rbase: a= 1, q= 65537+a= b= 6, q= 2+b= 4+non-uniform F, H

+d± 2+(No,Sparse) recodingRC max rate

Figure 7-7: The maximal rate Rε for error probability under ε = 5 ∗ 10−4, t = 2000, andn = 20. First, tests are performed for the base case a = b = 1, q = 65536, then, various adverseparameter changes are introduced incrementally. The maximal theoretical RC code rate forn = 20, a = 6, b = 4 is provided for comparison.

5. finally, the number of recoding operations is reduced by performing no recoding (”No”)

at the helpers, and sparse recoding (”Sparse”) at the replacement nodes.

Factors in 1, 5 are related to implementation and those in 2, 3, 4 are model factors. The

resulting rates are shown in Figure 7-7. In addition, the maximal possible rate of regenerating

codes for the same values of d, n = 20, a = 6, b = 4 are shown by another curve for comparison.

The maximal RC rate is given by R = (d − 1 + b/a)/n = (d − 1/3)/20, as per the file size

equation (2.6). The plots show that our model is operational with high probability in a wide

range of system parameters. Even in presence of multiple factors that adversely affect the system

performance, the resulting coding rate can be significantly higher than the rate provided by the

best RCs.

105

Page 106: Coding Approaches for Maintaining Data in Unreliable

106

Page 107: Coding Approaches for Maintaining Data in Unreliable

Chapter 8

Conclusions

8.1 Summary

In this thesis, we study the fundamental limits of maintaining redundancy in coded network

storage systems in terms of trade-offs between the storage overhead, fault tolerance, and node

repair cost, measured by repair bandwidth, repair locality, and disk I/O.

In the first part of the thesis, we study clustered storage systems, where storage nodes are

grouped into clusters, with relatively cheap network bandwidth within a cluster and expensive

bandwidth between nodes in different clusters. We extend the regenerating codes framework

by Dimakis et al. [1] to clustered topologies, and introduce generalized regenerating codes

(GRC), which perform node repair using helper data both from the local cluster and from

other clusters. We showed the optimal trade-off between storage overhead and inter-cluster

repair bandwidth, and demonstrated explicit code constructions that achieve the operating

points on the trade-off, which are not achievable by applying the existing codes (or their space-

sharing combinations) to the clustered topology. We also analyzed the intra-cluster bandwidth,

incurred at the optimal trade-off operating points, and demonstrated that, although increasing

the number of local helper nodes improves the trade-off, it also greatly increases the required

intra-cluster bandwidth. Therefore, this three-way trade-off between the storage overhead,

inter-cluster bandwidth, and intra-cluster bandwidth provides an important intuition into the

design of clustered storage systems. The results were also extended for joint repair of multiple

node failures within a cluster. Under functional repair, amortized inter-cluster repair bandwidth

107

Page 108: Coding Approaches for Maintaining Data in Unreliable

per failed node can be reduced by performing joint repair of several failed nodes instead of a

sequential repair of individual nodes.

In the second part of the thesis, we consider storage in time-varying networks with a small

repair locality and random opportunistic helper nodes selection. We show that for a storage sys-

tem design it is important to focus on the average-case (typical) instead of the worst-case failure

patterns and to consider the lifespan of the stored data. This leads to significant improvements

in the storage overhead with very little sacrifice of the fault tolerance, as the worst-case failure

patterns do not take place during the storage lifetime with overwhelmingly high probability.

We demonstrated that a random node selection RLNC-based storage outperforms regenerating

codes in terms of achievable rate for a very large number of iterations of node failure and re-

pair. In addition, the performance of the RLNC storage is robust to a wide range of model and

implementation assumption and parameters; in particular, it performs well under the binary

field, heavily skewed node distributions, and sparse recoding.

8.2 Future Directions

In this section, we briefly outline the potential direction of further research related the results

of the thesis.

Clustered Storage Systems

In section 3.4.1 we showed an optimal exact repair GRC construction, employing existing

classical RC constructions. However, it uses the maximal amount of intra-cluster BW with

γ = γ′ = α. It turns out that the derived bounds (5.2), (5.5) are not tight for exact repair, i.e.

there exist operating points on the trade-off, which require strictly more bandwidth than γ∗, γ′∗.

It would be of great interest to find tight exact repair bounds for γ, γ′, and the intra-cluster

bandwidth optimal exact repair code constructions.

From the practical perspective, it is reasonable to combine the power of LRCs to perform

small-locality repairs of those failures which can be repaired locally without using the inter-

cluster bandwidth, with the inter-cluster bandwidth efficiency of GRCs to be used when the

repairs cannot be performed with local information only. In contrast to the work of Kamath

108

Page 109: Coding Approaches for Maintaining Data in Unreliable

et al. [31], where regenerating codes are used inside the clusters, this extension would employ

LRCs for intra-cluster repairs and GRCs for intra-with-inter-cluster repairs.

Finally, our GRC model and the analysis can be readily generalized to 3 or more levels

of the network hierarchy, e.g. nodes grouped into racks, and racks grouped into data centers,

with repair bandwidth on different levels treated separately. The resulting multi-dimensional

trade-offs between the storage overhead and the repair bandwidths, along with the bandwidth

costs, would provide an important network planning intuition about the optimal repair network

utilization on different hierarchy levels.

Information Survival in Time-Varying Networks

The analysis of the lifetime and the achievable rates in Chapter 6 is mainly based on the rank

bounds using processes Nt, N′t . As demonstrated by Figure 6-5, the bounds are not very tight

for large t. It would useful to find tighter bounds on the rank in order to obtain a better

estimation of the achievable rates. The main difficulty in constructing the rank bounds for W t

from the bounds for W t−1 is the need to keep track of exponentially many circuits in the system

matroid. In fact, the bounds (6.6) are related to vector matroid M[(W τ )T ], which captures the

dependencies between the rows of the evolution matrix W τ . Unlike the system matroid, this

matroid contains loops, and keeping track of them leads to bounds (6.6).

One, potentially manageable, way to analyze the system dynamics is to look at the system

in an asymptotic regime. This would also allow studying the Shannon capacity of the storage.

An operationally meaningful way to scale the system is to let the number of iterations t grow

linearly with the number of nodes n. The total storage size nα and the number of packets per

node a can be fixed, while the packet size α/a ∝ 1/n ∝ 1/t goes down as n increases. In this

case, the average number of failures per node t/n and the size of the failed and repaired data

tα are constant for any value of n. One may expect that if for n nodes E[rankW t] = ρ, then for

2n nodes E[rankW 2t] = 2ρ, because in both cases each node undergoes t/n failures and repairs

on average. However, the numerical evaluation results shown in Figure 8-1 indicate that this is

the case only for the burn-in phase when t/n is small. In the stability phase with larger t/n,

the expected rank per node increases with n ∝ t.

A practical challenging aspect of our model is predicting the failure/repair node distributions

109

Page 110: Coding Approaches for Maintaining Data in Unreliable

0 25 50 75 100 125 150 175 200n

0.50

0.55

0.60

0.65

0.70

0.75

1 nE[rank

Wt ]

t/n=0.5t/n=1.0t/n=1.5t/n=2.5t/n=5.0t/n=15.0t/n=50.0t/n=150.0t/n=500.0

Figure 8-1: Mean rank per node for t scaled proportionally to n with d = 4, a = b = 1.

in a real system. Different failures are not necessarily independent and identically distributed,

they may form a Markov or even a more general non-ergodic process. Therefore, the real

distributions need to be estimated empirically, and storage coding rate should be adapted

accordingly by adjusting the number of nodes n in the system.

The General Problem

In practical use-cases, the DSS may take advantage of several ways to maintain redundancy

upon node failures. For a given replacement node, some helper nodes may be always available;

these can be dedicated backup nodes used for repair only or the local nodes in a clustered DSS.

Other nodes may be available on a random basis, e.g. the nodes in the P2P part of the DSS or

mobile nodes. For a degrading storage, there can be multiple ways to prevent the degradation,

i.e. the decrease of rankW t in terms of the model of Part II. One way is increasing the storage

overhead by introducing extra storage nodes. Another one is a smart selection of the helper

nodes out of those which are permanently available. For instance, consider a DSS with d = 3,

such that 2 helper nodes are random, while the third one can be selected by the code protocol.

If any of the n − 1 surviving nodes can be selected as the third helper, the code with d = 3

can achieve the rate as high as 0.7, which is close to the maximal rate 0.75 of LRCs with

d = 3. An alternative way to preventing the degradation is to actively introduce dependencies

between the nodes in the time intervals between failure iterations. Doing so is equivalent to

110

Page 111: Coding Approaches for Maintaining Data in Unreliable

performing ”artificial” node failures, in which the replacement physical node is the same as the

”failed” one. Since the only purpose of such ”failures” is maintaining redundancy, they can be

performed at the carefully chosen moments when the helper node availability favors creating

useful dependencies.

Overall, studying more general methods to maintain redundancy, combinations of those

methods, quantifying the optimal contribution of each method, and the trade-offs between

them is an important future research direction, and we hope that the approaches presented in

the thesis provide novel perspectives to this general problem.

111

Page 112: Coding Approaches for Maintaining Data in Unreliable

112

Page 113: Coding Approaches for Maintaining Data in Unreliable

Bibliography

[1] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Networkcoding for distributed storage systems,” IEEE Trans. Inf. Theory , vol. 56, no. 9, pp. 4539–4551, 2010.

[2] D. Reinsel, J. Gantz, and J. Rydning, “Data age 2025: The evolution of data to life-critical,” Framingham: IDC Analyze the Future, 2017.

[3] L. Rizzatti, “Digital data storage is undergoing mind-boggling growth,” EETimes, 2016.

[4] “AWS storage services overview,” Amazon Web Services Whitepapers, 2016. [Online].Available: https://d0.awsstatic.com/whitepapers/Storage/AWS%20Storage%20Services%20Whitepaper-v9.pdf

[5] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhaninet al., “Erasure coding in windows azure storage.” in Proc. USENIX Annual Tech. Conf.(USENIX ATC). Boston, MA, 2012, pp. 15–26.

[6] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the locality of codeword symbols,”IEEE Trans. Inf. Theory , vol. 58, no. 11, pp. 6925–6934, 2012.

[7] O. Khan, R. C. Burns, J. S. Plank, and C. Huang, “In search of i/o-optimal recovery fromdisk failures.” in Proc. USENIX Conf. Hot Topics Storage File Systems (HotStorage), 2011.

[8] R. Ahlswede, N. Cai, S.-Y. Li, and R. W. Yeung, “Network information flow,” IEEE Trans.Inf. Theory , vol. 46, no. 4, pp. 1204–1216, 2000.

[9] S.-Y. Li, R. W. Yeung, and N. Cai, “Linear network coding,” IEEE Trans. Inf. Theory ,vol. 49, no. 2, pp. 371–381, 2003.

[10] R. Koetter and M. Medard, “An algebraic approach to network coding,” IEEE/ACMTrans. Netw., vol. 11, no. 5, pp. 782–795, 2003.

[11] T. Ho, M. Medard, R. Kotter, D. R. Karger, M. Effros, J. Shi, and B. Leong, “A randomlinear network coding approach to multicast,” IEEE Trans. Inf. Theory , vol. 52, no. 10,pp. 4413–4430, 2006.

[12] P. Sanders, S. Egner, and L. Tolhuizen, “Polynomial time algorithms for network informa-tion flow,” in Proc. ACM Symp. Parallel Alg. Archit. ACM, 2003, pp. 286–294.

[13] S. Jaggi, P. Sanders, P. A. Chou, M. Effros, S. Egner, K. Jain, and L. M. Tolhuizen,“Polynomial time algorithms for multicast network code construction,” IEEE Trans. Inf.Theory , vol. 51, no. 6, pp. 1973–1982, 2005.

113

Page 114: Coding Approaches for Maintaining Data in Unreliable

[14] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating codes fordistributed storage at the MSR and MBR points via a product-matrix construction,” IEEETrans. Inf. Theory , vol. 57, no. 8, pp. 5227–5239, 2011.

[15] Y. Hu, P. P. C. Lee, and K. W. Shum, “Analysis and construction of functional regeneratingcodes with uncoded repair for distributed storage systems,” in Proc. IEEE Int. Conf. Comp.Comm. (INFOCOM), April 2013, pp. 2355–2363.

[16] K. W. Shum and Y. Hu, “Cooperative regenerating codes,” IEEE Trans. Inf. Theory ,vol. 59, no. 11, 2013.

[17] A.-M. Kermarrec, N. Le Scouarnec, and G. Straub, “Repairing multiple failures with co-ordinated and adaptive regenerating codes,” in Proc. IEEE Network Cod. Theory App.Workshop (NetCod). IEEE, 2011.

[18] V. R. Cadambe, S. A. Jafar, H. Maleki, K. Ramchandran, and C. Suh, “Asymptoticinterference alignment for optimal repair of MDS codes in distributed storage,” IEEETrans. Inf. Theory , vol. 59, no. 5, pp. 2974–2987, 2013.

[19] A. S. Rawat, O. O. Koyluoglu, and S. Vishwanath, “Centralized repair of multiple nodefailures with applications to communication efficient secret sharing,” ArXiv e-prints , vol.abs/1603.04822, 2016. [Online]. Available: http://arxiv.org/abs/1603.04822

[20] S. Pawar, S. E. Rouayheb, and K. Ramchandran, “Securing dynamic distributed storagesystems against eavesdropping and adversarial attacks,” IEEE Trans. Inf. Theory , vol. 57,no. 10, pp. 6734–6753, Oct 2011.

[21] N. B. Shah, K. V. Rashmi, and P. V. Kumar, “Information-theoretically secure regeneratingcodes for distributed storage,” in Proc. IEEE Global Telecomm. Conf. (GLOBECOM), Dec2011, pp. 1–5.

[22] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure codingobjects across global data centers,” in Proc. USENIX Annual Tech. Conf. (USENIX ATC),2017, pp. 539–551.

[23] H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon, “Racs: a case for cloud storagediversity,” in Proc. ACM Symp. Cloud Computing. ACM, 2010, pp. 229–240.

[24] A. Bessani, M. Correia, B. Quaresma, F. Andre, and P. Sousa, “Depsky: dependable andsecure storage in a cloud-of-clouds,” ACM Trans. Storage, vol. 9, no. 4, p. 12, 2013.

[25] J. Y. Chung, C. Joe-Wong, S. Ha, J. W.-K. Hong, and M. Chiang, “Cyrus: Towardsclient-defined cloud storage,” in Proc. European Conf. Comp. Sys. ACM, 2015, p. 17.

[26] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A solutionto the network challenges of data recovery in erasure-coded distributed storage systems:A study on the facebook warehouse cluster.” in Proc. USENIX Conf. Hot Topics StorageFile Systems (HotStorage), 2013.

[27] D. S. Papailiopoulos and A. G. Dimakis, “Locally repairable codes,” IEEE Trans. Inf.Theory , vol. 60, no. 10, pp. 5843–5855, 2014.

114

Page 115: Coding Approaches for Maintaining Data in Unreliable

[28] H. D. Hollmann, “On the minimum storage overhead of distributed storage codes with agiven repair locality,” in Proc. IEEE Int. Symp. Inf. Theory. IEEE, 2014, pp. 1041–1045.

[29] I. Ahmad and C.-C. Wang, “When and by how much can helper node selection improveregenerating codes?” in Proc. IEEE Annual Allerton Conf. Comm., Control, Computing(Allerton). IEEE, 2014, pp. 459–466.

[30] ——, “When locally repairable codes meet regenerating codes what if some helpers areunavailable,” in Proc. IEEE Int. Symp. Inf. Theory. IEEE, 2015, pp. 849–853.

[31] G. M. Kamath, N. Prakash, V. Lalitha, and P. V. Kumar, “Codes with local regenerationand erasure correction,” IEEE Trans. Inf. Theory , vol. 60, no. 8, pp. 4637–4660, 2014.

[32] I. Tamo, A. Barg, and A. Frolov, “Bounds on the parameters of locally recoverable codes,”IEEE Trans. Inf. Theory , vol. 62, no. 6, pp. 3070–3083, 2016.

[33] K. M. Greenan, J. S. Plank, J. J. Wylie et al., “Mean time to meaningless: MTTDL,Markov models, and storage system reliability.” in Proc. USENIX Conf. Hot Topics StorageFile Systems (HotStorage), 2010.

[34] D. J. MacKay, Information theory, inference, and learning algorithms. Citeseer, 2003,vol. 7.

[35] F. J. MacWilliams and N. J. A. Sloane, The theory of error-correcting codes. Elsevier,1977.

[36] Y. Wu, “Existence and construction of capacity-achieving network codes for distributedstorage,” IEEE J. Sel. Areas Commun., vol. 28, no. 2, pp. 277–288, February 2010.

[37] J. G. Oxley, Matroid theory. Oxford University Press, USA, 2006, vol. 3.

[38] Y. Hu, P. P.-C. Lee, and X. Zhang, “Double regenerating codes for hierarchical datacenters,” in Proc. IEEE Int. Symp. Inf. Theory. IEEE, 2016.

[39] J. Sohn, B. Choi, S. W. Yoon, and J. Moon, “Capacity of clustered distributedstorage,” ArXiv e-prints , vol. abs/1610.04498, 2016. [Online]. Available: http://arxiv.org/abs/1610.04498

[40] B. Gaston, J. Pujol, and M. Villanueva, “A realistic distributed storage system: the rackmodel,” ArXiv e-prints , vol. abs/1302.5657, 2013.

[41] J. Pernas, C. Yuen, B. Gastn, and J. Pujol, “Non-homogeneous two-rack model for dis-tributed storage systems,” in Proc. IEEE Int. Symp. Inf. Theory , July 2013, pp. 1237–1241.

[42] G. Calis and O. O. Koyluoglu, “Architecture-aware coding for distributed storage:Repairable block failure resilient codes,” ArXiv e-prints , vol. abs/1605.04989, 2016.[Online]. Available: http://arxiv.org/abs/1605.04989

[43] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Enabling node repair in any erasure codefor distributed storage,” in Proc. IEEE Int. Symp. Inf. Theory , July 2011.

[44] N. B. Shah, K. V. Rashmi, and P. V. Kumar, “A flexible class of regenerating codes fordistributed storage,” in Proc. IEEE Int. Symp. Inf. Theory , June 2010, pp. 1943–1947.

115

Page 116: Coding Approaches for Maintaining Data in Unreliable

[45] Q. Yu, K. W. Shum, and C. W. Sung, “Tradeoff between storage cost and repair cost inheterogeneous distributed storage systems,” Trans. Emerging Telecomm. Tech., vol. 26,no. 10, pp. 1201–1211, 2015.

[46] T. Ernvall, S. El Rouayheb, C. Hollanti, and H. V. Poor, “Capacity and security of hetero-geneous distributed storage systems,” IEEE J. Sel. Areas Commun., vol. 31, no. 12, pp.2701–2709, 2013.

[47] S. Akhlaghi, A. Kiani, and M. R. Ghanavati, “Cost-bandwidth tradeoff in distributedstorage systems,” Comp. Comm., vol. 33, no. 17, pp. 2105–2115, 2010.

[48] J. Li, S. Yang, X. Wang, and B. Li, “Tree-structured data regeneration in distributedstorage systems with regenerating codes,” in Proc. IEEE Int. Conf. Comp. Comm. (IN-FOCOM). IEEE, 2010, pp. 1–9.

[49] Y. Wang, D. Wei, X. Yin, and X. Wang, “Heterogeneity-aware data regeneration in dis-tributed storage systems,” in Proc. IEEE Int. Conf. Comp. Comm. (INFOCOM), April2014, pp. 1878–1886.

[50] J. Bang-Jensen and G. Z. Gutin, Digraphs: theory, algorithms and applications. SpringerScience and Business Media, 2008.

[51] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, andS. Quinlan, “Availability in globally distributed storage systems,” in Proc. USENIX Conf.Oper. Systems Design Implem. (OSDI), vol. 10, 2010, pp. 1–7.

[52] M. Gerami, M. Xiao, and M. Skoglund, “Two-layer coding in distributed storage systemswith partial node failure/repair,” IEEE Commun. Lett., vol. PP, no. 99, 2017.

[53] N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran, “Distributed storage codeswith repair-by-transfer and nonachievability of interior points on the storage-bandwidthtradeoff,” IEEE Trans. Inf. Theory , vol. 58, no. 3, pp. 1837–1852, March 2012.

[54] S. Yekhanin, “Locally decodable codes,” Foundations and Trends R© in Theoretical Com-puter Science, vol. 6, no. 3, pp. 139–255, 2012.

[55] F. H. Fitzek, T. Toth, A. Szabados, M. V. Pedersen, D. E. Lucani, M. Sipos, H. Charaf,and M. Medard, “Implementation and performance evaluation of distributed cloud storagesolutions using random linear network coding,” in Proc. IEEE Int. Conf. Comm. Workshop(ICC). IEEE, 2014, pp. 249–254.

[56] A. Mazumdar, “Storage capacity of repairable networks,” IEEE Trans. Inf. Theory , vol. 61,no. 11, pp. 5810–5821, 2015.

[57] M. G. Luby, R. Padovani, T. J. Richardson, L. Minder, and P. Aggarwal, “Liquid cloudstorage,” ArXiv e-prints , vol. abs/1705.07983, 2017.

[58] T. Ho, B. Leong, M. Medard, R. Koetter, Y.-H. Chang, and M. Effros, “On the utility ofnetwork coding in dynamic environments,” in Proc. IEEE Int. Workshop Wireless Ad-HocNet. IEEE, 2004, pp. 196–200.

[59] J.-S. Park, M. Gerla, D. S. Lun, Y. Yi, and M. Medard, “Codecast: a network-coding-basedad hoc multicast protocol,” IEEE Trans. Wireless Commun., vol. 13, no. 5, 2006.

116

Page 117: Coding Approaches for Maintaining Data in Unreliable

[60] J. Widmer and J.-Y. Le Boudec, “Network coding for efficient communication in extremenetworks,” in Proc. ACM SIGCOMM Workshop on delay-tolerant networking. ACM,2005, pp. 284–291.

[61] Y. Lin, B. Li, and B. Liang, “Stochastic analysis of network coding in epidemic routing,”IEEE J. Sel. Areas Commun., vol. 26, no. 5, 2008.

[62] L. Sassatelli and M. Medard, “Inter-session network coding in delay-tolerant networksunder spray-and-wait routing,” in Proc. IEEE Int. Symp. on Modeling and Optimizationin Mobile, Ad Hoc, and Wireless Networks (WiOpt). IEEE, 2012, pp. 103–110.

[63] B. Sury, T. Wang, and F.-Z. Zhao, “Identities involving reciprocals of binomial coefficients,”J. of Integer Sequences, vol. 7, no. 2, p. 3, 2004.

[64] D. Silva, W. Zeng, and F. R. Kschischang, “Sparse network coding with overlappingclasses,” in Proc. IEEE Network Cod. Theory App. Workshop (NetCod). IEEE, 2009,pp. 74–79.

[65] S. Feizi, D. E. Lucani, C. W. Sørensen, A. Makhdoumi, and M. Medard, “Tunable sparsenetwork coding for multicast networks,” in Proc. IEEE Int. Symp. Network Cod. (NetCod).IEEE, 2014, pp. 1–6.

117

Page 118: Coding Approaches for Maintaining Data in Unreliable

118

Page 119: Coding Approaches for Maintaining Data in Unreliable

Chapter 9

Appendices

9.1 MRGRC Chain Order Lemma 4.2.2

MRGRC Chain Order Lemma. Let b > 1, i.e. t ∤ (m− ℓ). Consider any Si ⊂ [n], |Si| = i,

1 ≤ i ≤ k−1. Then, for any i′ ∈ [n]−Si, there exists a permutation σi′,Siof {ℓ+1, ℓ+2, . . . ,m}

such that

H(Yi′,σi′,Si

(j′)|YSi,Yi′,[1,l], {Yi′,σi′,Si

(j)}j∈[ℓ+1,j′−1]

)≤ min

(α,

(d− i)β

t

), (9.1)

for all j′ ∈ {m− b+ 1,m− b+ 2, . . . ,m}.

Proof. We present a candidate permutation σi′,Si. Consider the content of the cluster i′, given

by {Yi′,1, Yi′,2, . . . , Yi′,m}. Define the quantities (jm,Vm), (jm−1,Vm−1), . . . , (jm−b+1,Vm−b+1) in

this respective order as below:

1. Let U = {Yi′,ℓ+1, Yi′,ℓ+2, . . . , Yi′,m}, and x = 0

2. Define (jm−x,Vm−x) as

(jm−x,Vm−x) = argmin(j,V):Yi′,j∈U

V⊂U−{Yi′,j},|V|=t−1

H(Yi′,j |V,YSi

,Yi′,[1,ℓ]

).

3. If x < b− 1, update U as U = U − {Yi′,jm−x}. Increment x by 1 and return to Step 2.

119

Page 120: Coding Approaches for Maintaining Data in Unreliable

Additionally, let us also define {jℓ+1, jℓ+2, . . . , jm−b} , {ℓ + 1, . . . ,m} − {jm, jm−1, . . . ,

jm−b+1}. In the preceding definition, we only need equality as sets. We do not care about

any particular ordering of the elements in {ℓ + 1, . . . ,m} − {jm, jm−1, . . . , jm−b+1}, while

associating these with {jℓ+1, jℓ+2, . . . , jm−b}. The candidate for the permutation σi′,Sion the

set {ℓ+ 1, . . . ,m} is now defined as follows:

σi′,Si(p) = jp, ℓ+ 1 ≤ p ≤ m. (9.2)

We will show that the permutation σi′,Siis such that

H(Yi′,σi′,Si

(j′)|YSi,Yi′,[1,l], {Yi′,σi′,Si

(j)}j∈[ℓ+1,j′−1]

)≤ min

(α,

(d− i)β

t

), (9.3)

for all j′ ∈ {m − b + 1,m − b + 2, . . . ,m}. Consider the variable j′ appearing in (9.3), and let

j′ = m − x for some x, 0 ≤ x ≤ b − 1 so that using (9.2) we have, σi′,Si(j′) = jm−x. Consider

the definition of (jm−x,Vm−x) in (9.2); we then know that

H(Yi′,jm−x

|Vm−x,YSi,Yi′,[1,ℓ]

)≤ H

(Yi′,jp |V,YSi

,Yi′,[1,ℓ]

), (9.4)

for all V ⊂ {Yi′,jℓ+1, Yi′,jℓ+2

, . . . , Yi′,jm−x} − {Yi′,jp} such that |V| = t − 1, and for all p, ℓ + 1 ≤

p ≤ m− x− 1. To prove (9.3), first of all, observe that

H(Yi′,σi′,Si

(j′)|YSi, {Yi′,σi′,Si

(j)}j∈[ℓ+1,j′−1]

)≤ H

(Yi′,σi′,Si

(j′)|YSi,Vm−x,Yi′,[1,ℓ]

). (9.5)

This follows from the fact that Vm−x ⊂ {Yi′,jℓ+1, Yi′,jℓ+2

, . . . , Yi′,jm−x−1}. Without loss of gener-

ality, assume that Vm−x = {Yi′,jℓ+1, Yi′,jℓ+2

, . . . , Yi′,jℓ+t−1}. Next, from the exact repair condition

120

Page 121: Coding Approaches for Maintaining Data in Unreliable

given in (4.6), we know that

min(tα, (d− i)β) ≥ H(Yi′,σi′,Si

(j′),Vm−x|YSi,Yi′,[1,ℓ]

)

=

ℓ+t−1∑

p=ℓ+1

H(Yi′,jp |Yi′,jℓ+1

, . . . , Yi′,jp−1 ,YSi,Yi′,[1,ℓ]

)

+H(Yi′,σi′,Si

(j′)|Vm−x,YSi,Yi′,[1,ℓ]

)

≥ℓ+t−1∑

p=ℓ+1

H(Yi′,jp |Vjp ,YSi

,Yi′,[1,ℓ]

)+H

(Yi′,σi′,Si

(j′)|Vm−x,YSi,Yi′,[1,ℓ]

),

where Vjp = Vm−x − {Yi′,jp} ∪ {Yi′,σi′,Si(j′)}. Noting that |Vjp | = t − 1, we see that each term

under the summation can be lower bounded using (9.4), i.e.,

H(Yi′,jp |Vjp ,YSi

,Yi′,[1,ℓ]

)≥ H

(Yi′,jm−x

|Vm−x,YSi,Yi′,[1,ℓ]

)

= H(Yi′,σi′,Si

(j′)|Vm−x,YSi,Yi′,[1,ℓ]

). (9.6)

Therefore,

min (tα, (d− i)β) ≥ tH(Yi′,σi′,Si

(j′)|Vm−x,YSi,Yi′,[1,ℓ]

). (9.7)

The proof of the lemma now follows by combining (9.7) with (9.5).

9.2 Achievability of the FR File Size Bound for MRGRC (The-

orem 4.3.1)

FR MRGRC Capacity. The file size B of GRC with parameters {(n, k, d), (α, β), (m, ℓ, t)}

under FR regime is upper bounded by

B ≤ B∗F = ℓkα+ a

k−1∑

i=0

min(tα, (d− i)β) +k−1∑

i=0

min(bα, (d− i)β) (9.8)

The bound is tight, if there is a known upper bound on the number of repairs in the system.

Proof. To prove achievability of the bound, we show that for any valid IFG, regardless of the

specific sequence of failures and repairs, B∗F is indeed a lower bound on the minimum possible

value of any S-Z cut. Consider a cut of IFG, and let U and V be the two disjoint parts associated

121

Page 122: Coding Approaches for Maintaining Data in Unreliable

with nodes S and Z, respectively. Without loss of generality, we only consider cuts such that

V contains at least k external nodes corresponding to active clusters. Consider a topological

sorting of the IFG nodes such that: 1) an edge exists between two nodes A and B only if A

appears before B in the sorting, and 2) all in-, out-, external, and repair nodes (if τ > 0) of the

cluster Xi(τ) appear together in the sorted order, ∀i, τ .

Consider the sequence E of all the external nodes in both active and inactive clusters in

V in their sorted order. Let Y1 denote the first node in E . Without loss of generality let

Y1 = Xext1 (τ1), for some τ1. In this case, consider the subsequence of E which is obtained after

excluding all the external nodes associated with X1 from E . Let Y2 denote the first external

node in this subsequence. We continue in this manner until we find the first k external nodes

{Y1, Y2, . . . , Yk} in E , such that each of the k nodes corresponds to a distinct physical cluster.

Without loss of generality, let us also assume that Yi = Xexti (τi), 2 ≤ i ≤ k, for some τi. If

τi = 0, then clearly cluster i contributes (at least) mα to the cut. Thus let us assume that

τi > 0, 1 ≤ i ≤ k.

Consider the m out-nodes Xouti,1 (τi), . . . , X

outi,m(τi) that connect to Xext

i (τi). For each j ∈ [1,

m], either Xouti,j (τi) is in U or there exists a minimal τi,j ∈ [0, τi] such that Xout

i,j (τi,j) ∈ V.

Consider those values of j ∈ [1,m] for which all the following conditions hold:

Xouti,j (τi), X

ini,j(τi,j) ∈ V, j ∈ Ri(τi,j − 1),

Xrepi (τi,j) ∈ V. (9.9)

Let there bemi ∈ [0,m] of such values, and, without loss of generality, let them bem−mi+1,

. . . ,m. Also without loss of generality, let indices j be sorted in the order of increasing τi,j ,

i.e. j1 < j2 implies τi,j1 ≤ τi,j2 . For each j ∈ [m − mi + 1,m], Wi,j , {j′ : τi,j′ = τi,j ,

j′ ∈ [m −mi + 1,m]} is a contiguous set of at most t indices of the nodes with the same τi,j ,

and which are repaired together from the same repair node. Let Si = {distinct (minWi,j − 1),

∀j ∈ [m − mi + 1,m]} ⊆ [m − mi,m − 1] be the set of indices of the nodes preceding all

contiguous groups Wi,j . Note that by minWi,j we mean the minimum element contained in the

set Wi,j . The set Si is in one-to-one correspondence with the set of the repair nodes in (9.9)

for j ∈ [m−mi + 1,m]. Note that m−mi is always an element of Si.

122

Page 123: Coding Approaches for Maintaining Data in Unreliable

In order to relay helper data to X ini,j(τi,j) for all j ∈ [m −mi + 1,m], the number of these

repair nodes should be at least ⌈mi/t⌉, and |Si| ≥ ⌈mi/t⌉. Each of these repair nodes connects

to d external nodes in other clusters. By the construction of E , at most i− 1 of those external

nodes can be in V. Thus, each repair node contributes at least (d− i+ 1)β of external helper

data to the cut value. In addition, each repair node Xrepi (τi,j) connects to ℓ local nodes. By

(9.9) and by the construction of Si and sorting of τi,j , only nodes with indices {1, 2, . . . , j′} out

of these ℓ can be in V, where j′ = minWi,j − 1 is the corresponding element of Si. Thus, repair

node Xrepi (τi,j) contributes at least (ℓ− j′)+α of local helper data to the cut value.

The contribution to the cut value of those m−mi indices of j ∈ [1,m−mi], which do not

satisfy (9.9), is at least α each.

Based on the observations above, the overall cut value is lower-bounded by

mincut(S − Z) ≥k∑

i=1

((m−mi)α+

⌈mi

t

⌉(d− i+ 1)β +

j′∈Si

(ℓ− j′)+α). (9.10)

Consider a particular value of i ∈ [1, k] and the corresponding summation term in (9.10).

Let us assume that m−mi ≥ ℓ, and mi = ait+ bi ≤ m− ℓ, bi ∈ [0, t− 1]. Then the third term

in (9.10) is zero, and

(m−mi)α+ ⌈mi/t⌉(d− i+ 1)β

= mα− (ait+ bi)α+ (ai + 1bi>0)(d− i+ 1)β

= mα− ai(tα− (d− i+ 1)β)− biα+ 1bi>0(d− i+ 1)β

(1)

≥ ℓα+ (m− ℓ)α− a(tα− (d− i+ 1)β)+ − (bα− (d− i+ 1)β)+

= ℓα+ a(tα− (tα− (d− i+ 1)β)+) + (bα− (bα− (d− i+ 1)β)+)

= ℓα+ amin(tα, (d− i+ 1)β) + min(bα, (d− i+ 1)β)

, ci,

where (1) follows, because ait+ bi = mi ≤ m− ℓ = at+ b, ai ≤ a, and, if ai = a, bi ≤ b.

On the other hand, if m−mi = ℓ−µi < ℓ, and mi > m− ℓ = at+ b, ℓ− (m−mi) = µi > 0,

123

Page 124: Coding Approaches for Maintaining Data in Unreliable

then we have

(m−mi)α+ ⌈mi/t⌉(d− i+ 1)β +∑

j′∈Si

(ℓ− j′)+α

≥ (ℓ− µi)α+ (a+ 1b>0)(d− i+ 1)β + (ℓ− (m−mi))α+∑

j′∈Si

j′>m−mi

(ℓ− j′)+α

= ℓα+ (a+ 1b>0)(d− i+ 1)β +∑

j′∈Si

j′>m−mi

(ℓ− j′)+α

≥ ci,

where ci is the lower-bound for the case m−mi ≥ ℓ.

Since B∗F =

∑i ci, it is indeed a lower-bound on the file-size. This proves tightness of bound

9.8.

9.3 Matroid Lemma 6.3.1

Matroid Lemma. For q → ∞, a = b, rankW t|S = nSa, where nS ≥ 1 is integer w.p.a.c. 1,

∀S ⊆ [n], S 6= ∅, ∀t.

Proof. (Proof by induction) At t = 0 W t = Ina, all columns are independent and rankW 0|S =

a|S| for all S.

Let us assume that the statement is true for t ≥ 0. At the next iteration, a node i fails and

is repaired from a helper set H. Matrix W t+1 differs from W t in the columns corresponding to

node i only. The statement might be violated only for sets of nodes with i. Consider such a set

S ∋ i. Per column evolution equation (6.1)

rankW t+1|S = rank

W t|S−i

j∈H

W t|jDj

(9.11)

= rank

W t|S−i

j∈H−(S−i)

W t|jDj

, (9.12)

where Dj = DHt+1(j)D

Rt+1(j) ∈ Fa×a

q is full rank w.p.a.c. 1, and (S − i) is the linear ”closure”

124

Page 125: Coding Approaches for Maintaining Data in Unreliable

of S − i

(S − i) = {j ∈ [n] : rank[W t|S−i W

t|j]= rankW t|S−i}, (9.13)

i.e. the set of all nodes, whose packets are all in the column span of the packets on nodes in

S − i. For H − (S − i) = ∅, the set S − i is non-empty, and rankW t+1|S = rankW t|S−i, which

is a non-zero multiple of a, by the inductive assumption. For non-empty H − (S − i), note that

K , colspan∑

j∈H−(S−i)W t|jDj is a random subspace of V ,

⊕j∈H−(S−i)

colspanW t|j ⊆ Fnaq

of dimension at least

dimK ≥ minj∈H−(S−i)

rankW t|jDj = a, w.p.a.c. 1, (9.14)

since rankDj = a w.p. 1, and rankW t|j = a, by the inductive assumption. Let U =

colspanW t|S−i ⊆ Fnaq . Subspace K can be though of as the span of a random vectors v1,

. . . ,va ∈ V . The probability that the intersection of K with U is trivial is given by

Pr[K ∩ U = 0] = Pr[v1 /∈ V ∩ U ]

Pr[v2 /∈ (V ∩ U)⊕ span(v1) or v2 ∈ span(v1)|v1 /∈ V ∩ U ] · · ·

≥ Pr[v1 /∈ V ∩ U ]

Pr[v2 /∈ (V ∩ U)⊕ span(v1)|v1 /∈ V ∩ U ] · · ·

= (1− qdimV ∩U−dimV )(1− qdimV ∩U+1−dimV ) · · · (1− qdimV ∩U+a−1−dimV )

Since H − (S − i) 6= ∅, there is j ∈ H : colspanW t|j ∩ colspanW t|S−i = 0, and dimV ≥

dimV ∩ U + a. Therefore,

Pr[K ∩ U = 0] ≥ (1− q−a)(1− q−a+1) · · · (1− q−1)q→∞→ 1.

As a result,

rankW t+1|S = rank

W t|S−i

j∈H−(S−i)

W t|jDj

= rankW t|S−i + a,

125

Page 126: Coding Approaches for Maintaining Data in Unreliable

and the inductive statement is true for t+ 1.

9.4 Matrix Addition Lemma 6.4.2

Matrix Addition Lemma. Let A ∈ Fm×nq ,m ≤ n, be a full-rank matrix with rows a1, . . . ,

am. Let u,v ∈ Fnq be arbitrary vectors, and d′ ∈ [0,m− 1] be an integer. Let A′ ∈ Fm×n

q be an

additively transformed matrix with rows a′

1, . . . ,a′

m, such that

a′

i =

ai + αiu, if i ∈ [1, d′]

ai, if i ∈ [d′ + 1,m− 1]

βiai + v, if i = m,

(9.15)

where αi, βm are random scalars, sampled uniformly i.i.d. from Fq. Then limq→∞ Pr[rankA′ =

m] = 1, i.e. A′ is full-rank w.p.a.c. 1 in the limit of infinite field size.

Proof. Let A|[m−1], A′|[m−1] be the submatrices composed of the first m − 1 rows of A,A′,

respectively. Let α = (α1, . . . , αd′) ∈ Fd′q , and let S = {α : A′|[m−1] not full-rank}. Since

A′|[m−1] is a linear function of α, S is an affine subspace of Fd′q . Since before the additive

transformation A|[m−1] was full-rank, zero-vector 0 /∈ S, and, therefore, dimS < d′. Thus,

Pr[A′|[m−1] full-rank] = Pr[α /∈ S] = 1− |S|/|Fd′q | = 1− qdimS−d′ → 1 as q → ∞.

Let S′ = {α : am ∈ rowspanA′|[m−1]}. S′ is also an affine subspace of Fd′q . Since before the

transformation A was full-rank, am /∈ rowspanA|[m−1], zero-vector 0 /∈ S′, and dimS′ < d′.

Thus, Pr[A′|[m−1] full-rank and am /∈ rowspanA′|[m−1]] = Pr[α /∈ S∪S′] ≥ 1−(|S|+|S′|)/qd′ →

1 as q → ∞.

Conditioned on A′|[m−1] is full-rank and am /∈ rowspanA′|[m−1], a′

m = βmam + v can be

in rowspanA′|[m−1] for at most 1 value of βm (otherwise, for 2 corresponding values of a′

m,

their difference would be a multiple of am and lie in rowspanA′|[m−1], which contradicts

am /∈ rowspanA′|[m−1]). Therefore, under the same condition, Pr[A full-rank] = Pr[a′

m /∈

rowspanA′|[m−1]] ≥ 1−1/q → 1. Since the condition holds w.p.→ 1, unconditional Pr[A full-rank] →

1, as q → ∞.

126