coding approaches for maintaining data in unreliable
TRANSCRIPT
Coding Approaches for Maintaining Data in Unreliable
Network Systems
by
Vitaly Abdrashitov
B.S., Moscow Institute of Physics and Technology (2009)M.S., Moscow Institute of Physics and Technology (2011)
Submitted to the Department of Electrical Engineering and Computer Sciencein partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2018
c© Massachusetts Institute of Technology 2018. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science
March 26, 2018
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Muriel Medard
Cecil H. Green Professor in Electrical Engineering and Computer ScienceThesis Supervisor
Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer ScienceChair, Department Committee on Graduate Students
2
Coding Approaches for Maintaining Data in Unreliable Network Systems
by
Vitaly Abdrashitov
Submitted to the Department of Electrical Engineering and Computer Scienceon March 26, 2018, in partial fulfillment of the
requirements for the degree ofDoctor of Philosophy in Electrical Engineering and Computer Science
Abstract
In the recent years, the explosive growth of the data storage demand has made the storage costa critically important factor in the design of distributed storage systems (DSS). At the sametime, optimizing the storage cost is constrained by the reliability requirements. The goal of thethesis is to further study the fundamental limits of maintaining data fault tolerance in a DSSspread across a communication network. Particularly, we focus our attention on performingefficient storage node repair in a redundant erasure-coded storage with a low storage overhead.We consider two operating scenarios of the DSS.
First, we consider a clustered scenario, where individual nodes are grouped into clustersrepresenting data centers, storage clouds of different service providers, racks, etc. The networkbandwidth within a cluster is assumed to be cheap with respect to the bandwidth between nodesin different clusters. We extend the regenerating codes framework by Dimakis et al. [1] to theclustered topologies, and introduce generalized regenerating codes (GRC), which perform noderepair using the helper data both from the local cluster and from other clusters. We show theoptimal trade-off between the storage overhead and the inter-cluster repair bandwidth, alongwith optimal code constructions. In addition, we find the minimal amount of the intra-clusterrepair bandwidth required for achieving a given point on the trade-off.
Second, we consider a scenario, where the underlying network features a highly varyingtopology. Such behavior is characteristic for peer-to-peer, content delivery, or ad-hoc mobilenetworks. Because of the limited and time-varying connectivity, the sources for node repairare scarce. We consider a stochastic model of failures in the storage, which also describesthe random and opportunistic nature of selecting the sources for node repair. We show that,even though the repair opportunities are scarce, with a practically high probability, the datacan be maintained for a large number of failures and repairs and for the time periods farexceeding a typical lifespan of the data. The thesis also analyzes a random linear networkcoded (RLNC) approach to operating in such variable networks and demonstrates its highachievable rates, outperforming that of regenerating codes, and robustness in a wide range ofmodel and implementation assumptions and parameters such as code rate, field size, repairbandwidth, node distributions, etc.
Thesis Supervisor: Muriel MedardTitle: Cecil H. Green Professor in Electrical Engineering and Computer Science
3
4
Acknowledgments
Pursuing my Ph.D. degree at MIT has been a long and challenging journey. At the sametime, it has allowed me to meet many amazing and extraordinary people, students and faculty,inspiring researchers, innovative thinkers, dedicated teachers. I am deeply thankful to them fortheir support and sharing of the expertise.
First and foremost, I would like to thank Muriel Medard, who has been to me not only anextremely knowledgeable and insightful research supervisor, but also a patient mentor, a verysupportive adviser in career and life, and a wonderful person. Without her, I definitely wouldnot have become what I am today.
Beside Muriel, I would like to thank Prakash Narayana Moorthy, with whom I enjoyedto be in a close collaboration, and who is a passionate researcher and a great friend. Thethesis would not be possible without his extensive expertise. I am also honored to have DavidKarger and Viveck Cadambe as my committee members. I am really thankful for their timeand commitment, and for the valuable guidance and advice.
I am extremely lucky to be a member of my research group of Network Coding and ReliableCommunications. I would like to thank the people I met in the group for infinite opportunities tolearn and share ideas. In particular, I would like to thank Ali Makhdoumi, Salman Salamatian,Weifei Zeng, Flavio du Pin Calmon, Ahmad Beirami, and Arman Rezaee for being supportiveand true friends. I would also like to thank Surat Teerapittayanon, Georgios Angelopoulos,Kerim Fouli, Soheil Feizi, Jason Cloud, and many other members of the group, with whom Ihave had the pleasure to work and learn together. Very special thanks go to Molly Kruko, whomakes sure that the group activities always run smoothly. I also wish to thank a lot of othergreat people who I met at MIT, and many of whom became my dearest friends.
Finally, and most importantly, I thank my parents and my family, for their unconditionallove, support, and encouragement during all these years in the graduate school and in the U.S.,and for being my backbone and foundation.
The work in this thesis was in part supported by the Air Force Office of Scientific Research (AFOSR) underaward No FA9550-14-1-043 and FA9550-13-1-0023, in part supported by the National Science Foundation (NSF)under Grant No CCF-1527270 and CCF-1409228, and in part supported by the Defense Advanced ResearchProjects Agency (DARPA) award No HR0011-17-C-0050.
5
Dedicated to my parents
Contents
Contents 7
List of Figures 9
List of Tables 13
1 Introduction 15
1.1 Distributed Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Small Repair Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Small Repair Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Preliminaries 25
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Linear Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Linear Network Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Regenerating Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Information Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Matroids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
I Regenerating Codes for Clustered Storage Systems 34
3 Generalized Regenerating Codes (GRCs) and File Size Bounds 35
3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 IFG Model for GRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 File Size Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Proof of the File Size Bound . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Storage vs Inter-Cluster Bandwidth Trade-off . . . . . . . . . . . . . . . . 45
3.4 Code Constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Exact Repair Code Construction . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 A Functional Repair Code for Arbitrary Number of Failures . . . . . . . . 50
4 GRC for Repair of Multiple Failures 55
4.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Exact Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 ER Code Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 File Size Bound Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7
4.3 Functional Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3.1 Information Flow Graph Model . . . . . . . . . . . . . . . . . . . . . . . . 604.3.2 File Size Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Implications of the Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Intra-Cluster Bandwidth of GRCs 655.1 Local Helper Bandwidth in the Host Cluster . . . . . . . . . . . . . . . . . . . . . 665.2 External Helper Cluster Local Bandwidth . . . . . . . . . . . . . . . . . . . . . . 705.3 Optimality and Implications of the Intra-cluster Bandwidth Bounds . . . . . . . 75
II Information Survival in Volatile Networks 78
6 Network Coding for Time-Varying Networks 796.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Stochastic Rank Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Matroid Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.4 Bounding Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5 Impact of Repair Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.6 Expected Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.7 Error Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Implementation Aspects and Numerical Results 997.1 RLNC Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.2 Small Field Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.3 Failed and Helper Nodes Distributions . . . . . . . . . . . . . . . . . . . . . . . . 1017.4 Variable Number of Helpers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.5 Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.6 Effects of Several Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Conclusions 1078.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography 113
9 Appendices 1199.1 MRGRC Chain Order Lemma 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 1199.2 Achievability of the FR File Size Bound for MRGRC (Theorem 4.3.1) . . . . . . 1219.3 Matroid Lemma 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1249.4 Matrix Addition Lemma 6.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8
List of Figures
1-1 A system with n = 4 storage nodes with 2 packets per node, and a regeneratingcode, which can repair the exact content of any node by downloading 1 packetfrom each of 3 other nodes. The plus sign denotes bit xor operation. . . . . . . 17
1-2 Comparing the storage overhead and the expensive inter-cluster repair bandwidthof three coding options for a clustered DSS. The three options are (i) extra paritycheck nodes in each cluster, (ii) classical regenerating codes, and (iii) generalizedregenerating code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1-3 An example of LRC with 6 information nodes and 4 parity nodes, 2 of which arelocal parities. Every information node (e.g. c1) can be recovered by downloadingdata from a specific set of d = 3 other nodes in the same local group. The globalparities p1, p2 are linear combinations of all 6 information symbols and allow dataregeneration when more than 1 node in each local group is lost. . . . . . . . . . . 21
1-4 The helper selection vs rate trade-off (higher the better). . . . . . . . . . . . . . . 22
1-5 Berlekamp’s bat phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2-1 Storage-bandwidth trade-off for RCs with (n = 7, k = 5, d = 6, B = 10). Theprecise trade-off for exact repair remains unknown. . . . . . . . . . . . . . . . . . 30
2-2 An example of information-flow graph for n = 4, k = 2, d = 2 and 3 nodefailures/repairs. Also shown is a sample cut (U ,V) between S and Z of capacityα+ β. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3-1 An example of the IFG model representing the notion of generalized regeneratingcodes, when intra-cluster bandwidth is ignored. In this figure, we assume (n = 3,k = 2, d = 2)(m = 2, ℓ = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3-2 An example of the information flow graph used in cut-set based upper bound forthe file size. In this figure, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1). Wealso indicate a possible choice of S−Z cut that results in the desired upper bound. 42
3-3 An example of how any S−Z cut in the IFG affects nodes in Fi. In the example,we assume m = 4. With respect to the description in the text, ai = 2. Further,the node Xi,4(ti,4) is a replacement node in the IFG. . . . . . . . . . . . . . . . . 43
3-4 Trade-off between storage overhead nmα/B and inter-cluster repair bandwidthoverhead dβ/α, for an (n = 5, k = 4, d = 4) clustered storage system, withℓ = m− 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3-5 Illustration of the exact repair code construction. We first stack ℓ MDS codesand (m − ℓ) classical regenerating codes, and then transform each row via theinvertible matrix A. The first ℓ rows of the matrix A generates an (m, ℓ) MDScode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9
3-6 An illustration of the node repair process for exact repair generalized regeneratingcode obtained in Construction 3.4.1. . . . . . . . . . . . . . . . . . . . . . . . . . 50
4-1 An illustration of the information flow graph used in cut-set based upper boundfor the file-size under functional repair. We assume (n = 3, k = 2, d = 2)(m = 3,ℓ = 0, t = 2). Only a subset of nodes is named to avoid clutter. Two batches,each of t = 2 nodes, fail and get repaired first in cluster 1 and then in cluster 3.We also indicate a possible choice of the S − Z cut that results in the desiredupper bound. We fail nodes in cluster 3 instead of cluster 2 only to make thefigure compact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4-2 Trade-offs for an (n = 5, k = 4, d = 4)(m = 3, ℓ = 0, t = 2) system, plottedbetween the MSR and the MBR points. . . . . . . . . . . . . . . . . . . . . . . . 63
4-3 Impact of number of local helper nodes, ℓ, on file-size for an (n = 7, k = 4, d = 5,m = 17, t = 5) clustered storage system at MBR point (α = 1, β = 1). Localhelp does not provide any advantage unless ℓ > 2. . . . . . . . . . . . . . . . . . . 63
5-1 An illustration of the evolution of the k-th cluster of the information flow graphused in cut-set based lower bound for γ in Theorem 5.1.1. In this figure, weassume that m = 4, ℓ = 2. Nodes 3, 4, 1 fail in this respective order. For therepair of node 3, nodes 1 and 2 act as the local helper nodes. For the repairof the remaining two nodes, nodes 2 and 3 act as the local helper nodes. Alsoindicated is our choice of the S-Z cut used in the bound derivation. . . . . . . . 67
5-2 An illustration of the IFG used in cut-set based lower bound for γ′ in Theorem5.2.1. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1)(ℓ′ = 2,γ = α). The second node fails in clusters 1 and 2 in the respective order. Alsoindicated is our choice of the S-Z cut used in the bound derivation. . . . . . . . 71
5-3 An illustration of the IFG used in cut-set based lower bound for ℓ′ in Theorem5.2.3. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 0)(ℓ′ = 1,γ = γ′ = α). The second node fails in clusters 1 and 2 in the respective order.Also indicated is our choice of the S-Z cut used in the bound derivation. . . . . 74
5-4 Simulation results for a system with parameters (n = 4, k = 3, d = 3), (ℓ = 1,m = 2), (α, β = 4), showing probability of successful data collection againstnumber of node repairs performed, for an RLNC-based GRC. The legends indi-cate parameters (γ, γ′, ℓ′) for each test. For all operating points ℓ∗ = m = 2. . . 76
5-5 Illustrating the impact of ℓ on the various performance metrics. We operate atthe MBR point with parameters {(n = 12, k = 8, d = n−1)(α = dβ, β = 2)}. Wesee that while ℓ = m− 1 is ideal in terms of optimizing storage and inter-clusterBW, it imposes the maximum burden on intra-cluster BW. . . . . . . . . . . . . 77
6-1 An example of a system evolution for 3 iterations of failure and repair, n = 6,d = 2, a = b = 1. At t = 0 node i contains packet si. For the 4 considered systemstates the evolution matrix W t and its matroid representation M(W t) are alsoshown. The most recently changed column of W t is bold-faced. . . . . . . . . . . 81
6-2 An example of information-flow graph for n = 4, d = 2 and t = 4 node fail-ures/repairs. Also shown is a sample cut (U ,V) of capacity a + 2b ≥ min{a,2b}+min{a, b}+ b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6-3 Simulated expected lifetime for n = 20, a = b = 1. . . . . . . . . . . . . . . . . . . 96
10
6-4 Probability of decoding error pe against the coding rate for fixed n = 20, d = 4,a = b = 1. The dots indicate E[rankW t]/n. . . . . . . . . . . . . . . . . . . . . . 96
6-5 Expected rank of the evolution matrix rt = E[rankW t], with the upper and lowerbounds for n = 40, d = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6-6 Expected rank rt = E[rankW t] for n = 40 and various values of d. . . . . . . . . 97
7-1 Performance of various recoding regimes: No recoding (N), Sparse recoding (S),and Full recoding (FR) for a system with parameters n = 20, t = 2000, d = 4,a = 3, b = 2. The legend indicates the recoding regimes (helper, replacementnodes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7-2 Impact of the effective field size qa on the average rank of W t for a system withparameters n = 20, d = 4, a = b, t = 1000. The actual field size used is q. . . . . . 100
7-3 Probability mass functions of the test node distributions for a storage with n = 20nodes. Given a fixed parameter x, the probability of i-th atom is p(i) ∝ ix. Largervalues of x lead to stronger concentration of probability at the nodes with highindices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7-4 Impact of the failed and helper node distribution PF ,PH on the average rankfor n = 20, t = 1000, d = 4. The distributions have p(i) ∝ ix for x ∈ {xF , xH}.xF < 0 corresponds to p(i) ∝ (n+ 1− i)|xF |. . . . . . . . . . . . . . . . . . . . . 102
7-5 Impact of standard deviation of the number of helper nodes d on the averagerank for n = 20, a = b = 1, t = 1000. Beta-binomial distributions with differentsupports are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7-6 Decoding error probability pdce = Pr[rankMt|S < k| rankMt = k] for randomlychosen column set S ⊂ [n], |S| = ndc. n = 20, d = 4, a = b = 1, k = nR. . . . . . 103
7-7 The maximal rate Rε for error probability under ε = 5 ∗ 10−4, t = 2000, andn = 20. First, tests are performed for the base case a = b = 1, q = 65536, then,various adverse parameter changes are introduced incrementally. The maximaltheoretical RC code rate for n = 20, a = 6, b = 4 is provided for comparison. . . . 105
8-1 Mean rank per node for t scaled proportionally to n with d = 4, a = b = 1. . . . . 110
11
12
List of Tables
3.1 Notation for the clustered storage system model. . . . . . . . . . . . . . . . . . . 36
6.1 Notation for the time-varying network storage system model. . . . . . . . . . . . 82
13
14
Chapter 1
Introduction
1.1 Distributed Storage Systems
In the recent years, the demand for cheap and reliable data storage has been driven high by nu-
merous entertainment, industrial, and scientific applications, which overall generate zettabytes
(1 ZB ≈ 1012GB) of data yearly. In 2018 the size of the global data sphere is measured by tens
of zettabytes, and this number doubles every 3 years [2]. The demand for the storage capacity
grows faster than the storage media production and is expected to outstrip the production by 6
ZB in 2020 [3]. As a result of these trends, the storage cost becomes an increasingly important
factor in the large storage systems design.
Large-scale storage systems follow a distributed approach: the data is spread across several
storage nodes, possibly in different locations. Using multiple less capable and cheaper nodes
instead of a single powerful node allows cost-efficient scalability. Thus, modern distributed
storage systems (DSS) are composed of a large number of individually unreliable nodes. Data
loss is unacceptable, and DSS must deal with failures in the system, i.e. have a sufficient fault
tolerance.
Since fault tolerance is achieved through a redundancy in storage, and the redundancy
inevitably increases the storage size per byte of user data, i.e. the storage overhead , it is critically
important to find the optimal balance between the required fault tolerance and the storage
cost. A simple way to introduce the redundancy is replication, in which multiple copies of
the same data segments are stored on different physical nodes. Simple to set up and manage,
15
3-way replication (keeping 3 copies of data) has been a widely adopted approach. However,
its storage overhead, is too high (3), and makes it prohibitively expensive for large amounts
of data. An alternative approach to introduce redundancy is to encode the data using erasure
codes. Storage node failures can be treated as erasures, and the source data can be decoded
from an incomplete set of nodes. Many major cloud service providers like Amazon [4], Microsoft
Azure [5] employ coding in their storage systems.
An important aspect in DSS is maintaining redundancy. Upon node failures, the data
segments they store become unavailable, the overall redundancy and fault tolerance goes down.
A new, replacement, nodes need to be introduced into the system to keep the failed portion of
the data. This process is called node repair and involves downloading data (helper data) from
a set of surviving nodes (helper nodes, or helpers) to generate the data to store on the new
node. The efficiency of node repair is mainly associated with the following metrics:
1. repair bandwidth — the number of the symbols downloaded from the helper nodes to the
replacement node [1];
2. repair locality — the number of the helper nodes contacted [6];
3. repair disk I/O — the number of stored symbols read by the helper nodes from their
storage media to generate the helper data [7].
These metrics directly affect the repair latency, cost, and extra load on the helper nodes, and
ideally, all the metrics should be small. However, the metrics can be simultaneously minimized
only for a replication storage, but not for an erasure-coded DSS.
The thesis is focused on further studies of the fundamental limits of maintaining redundancy
and node repair in DSS with a low storage overhead. We consider two operating scenarios of
the DSS. The first one mainly considers DSS with a low repair bandwidth and disk I/O, while
the second one studies a DSS with a low repair locality.
1.2 Small Repair Bandwidth
Repair bandwidth minimization has been largely studied in the context of regenerating codes
(RCs). First introduced by Dimakis et al. [1], RCs require that repair of any node can be
16
a1
a2
b1
b2
a1
a2
b1
b2
a1+b
1a2+b
2
a1+b
2a2+b
1
a1
b2
a1+b
1+a
2+b
2
Figure 1-1: A system with n = 4 storage nodes with 2 packets per node, and a regeneratingcode, which can repair the exact content of any node by downloading 1 packet from each of 3other nodes. The plus sign denotes bit xor operation.
performed with an arbitrary set of d helper nodes among the surviving nodes. For a fixed repair
locality d, RCs minimize the repair bandwidth and achieve the optimal trade-off between the
bandwidth and the storage overhead. Generally, to achieve the minimal repair bandwidth, a
helper node needs to send a function of its stored data, rather than a part of it. Consider an
example of a DSS with n = 4 storage nodes shown in Figure 1-1. The source file is split into
4 packets, represented by vectors of bits of a fixed dimension. They are encoded into 8 coded
packets and stored on 4 nodes, 2 packets per node, so that the source file can be decoded from
the content of any 2 nodes. To repair a failed node, e.g. node 4, the replacement node needs
to download at least 3 helper packets from the other 3 nodes. It is not sufficient, however, for
the helpers to directly send out the packets they store; instead helper node 3 sends out a linear
combination a1 + b1 + a2 + b2 of packets a1 + b1, a2 + b2 it holds. The addition is performed
element-wise in the binary field GF (2). This demonstrates that network coding is necessary to
achieve the minimal repair bandwidth.
Network coding refers to the operation of coding at a network node over its inputs to pro-
duce its outputs. Whereas in the traditional paradigm of network routing, a node only stores
the incoming packets to forward them to the next nodes without changing (possibly changing
packets metadata), with network coding the node can emit packets which are functions, ”mix-
tures” of the incoming packets. In the context of DSS, the incoming packets correspond to those
a node stores, and the outgoing packets to those it sends out when it serves as a helper node.
17
The seminal work by Ahlswede et al. [8] shows that, unlike routing, network coding achieves the
capacity of multicast wireline networks, where a data stream needs to be delivered to multiple
destinations. The capacity is equal to the minimal value of a cut between the source node and
a destination (a sink node). References [9, 10] showed that for achieving the multicast capacity
it is sufficient to use linear network coding, where the outgoing packets are linear combinations
of the incoming packets. Works of Ho et al. [11], Sanders et al. [12] showed that the linear
coding coefficients at each node can be picked uniformly at random, and such random linear
network codes (RLNC) in a large finite field asymptotically achieve the multicast capacity. Lin-
ear network codes can also be constructed deterministically with a polynomial-time algorithm
[13].
Dimakis et al. [1] considered two alternative repair regimes of RCs: exact (ER) and func-
tional repair (FR). In ER regime after each node repair, the replacement node content should
be exactly the same as the content of the failed node. With FR this does not need to hold,
as long as the source file can be decoded. The advantage of the ER codes is that they can
have a systematic form with certain nodes holding the uncoded source packets (the first two
nodes in Figure 1-1), which allows very fast data reads. The FR codes generally achieve strictly
better bandwidth-storage trade-off and allow simpler and more efficient code constructions. A
detailed description of RCs and network coding is presented in Chapter 2.
Subsequent works on regenerating codes studied code constructions in specific capacity-
achieving scenarios [14, 15], repairing multiple node failures [16–19], security aspects [20, 21].
These works consider the flat network topology, where each node has a direct logical connection
to every other node, and all logical links incur the same bandwidth cost for communication per
bit.
However, the practical large-scale DSSs feature hierarchical structure; for instance, individ-
ual nodes can be grouped into a server rack and connected to the same network switch, while
a rack is a part of an aisle, and the latter is a composition unit of a data center. Finally, data
centers can be grouped into a geographically distributed storage system, employing erasure
coding across data centers [22], or into a user-defined cloud-of-clouds along with other storage
service providers [23–25]. In a large DSS, the data is protected against failure or unavailability
of individual system parts, and repairing a node can require a repair traffic across different levels
18
of the system hierarchy. While the communication between the nodes in the same rack is low-
delay and a spare bandwidth is usually available, the inter-rack bandwidth is shared by many
nodes and applications and is a more limited resource, and the inter-data-center bandwidth is
even more scarce and expensive. For instance, around 180TB of inter-rack repair bandwidth is
used daily in the Facebook warehouse, which limits resources for other applications [26].
Although possible, complete elimination of the expensive, inter-cluster , as we shall call them,
repair bandwidth components, e.g. by using extra parity check nodes in each rack, results in
an excessively large storage overhead. A better approach is needed to characterize the optimal
system performance in terms of the storage overhead vs repair bandwidth trade-off. The extra
parity check solution is shown by the corresponding point on the trade-off in Figure 1-2. It is
also possible to reduce the expensive bandwidth by employing several RCs for flat topologies
(classical RCs), so that each RC spans across one node in each rack. Any point on the straight
line between the two solutions can be achieved by space-sharing, i.e. applying each of the two
codes only for a fraction of the total stored data.
We introduce Generalized Regenerating codes (GRCs) for clustered topologies, which care-
fully combine repair bandwidth from different hierarchy levels (cheap intra-cluster and expen-
sive inter-cluster bandwidths) to achieve the optimal trade-off. The trade-off is strictly better
than what is achieved with space-sharing between the existing coding schemes, as shown in
Figure 1-2. The analysis of GRCs is presented in Chapter 3. Besides the characterization of the
optimal trade-off, we also provide explicit code constructions for the ER and the FR regime.
The FR construction is optimal in terms of the trade-off, while the ER construction achieves
the most important operating points of the trade-off. In Chapter 4 we extend the results to the
scenario with multiple node failures per cluster and their simultaneous repair.
In Chapter 5 we study the local properties of GRCs. While the goal of GRC is to optimize
the main trade-off between the storage vs the expensive inter-cluster bandwidth, it is also
desirable to minimize the cheap intra-cluster bandwidth to improve the latency the disk I/O,
without affecting the trade-off. The repair process also gives rise to the intra-cluster bandwidth
in the clusters providing the helper data (helper clusters). The chapter gives the answer to the
following question: what is the minimal intra-cluster bandwidth, in the cluster with the failed
node and in the helper clusters, required to operate on a specific point of the trade-off?
19
1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0Storage overhead
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Inte
r-clu
ster
ban
dwid
thLocal paritycheck nodes
Classical MBR
Classical MSR GRCSpace-sharingClassical RC
Figure 1-2: Comparing the storage overhead and the expensive inter-cluster repair bandwidthof three coding options for a clustered DSS. The three options are (i) extra parity check nodesin each cluster, (ii) classical regenerating codes, and (iii) generalized regenerating code.
1.3 Small Repair Locality
Optimizing the repair locality is typically studied in the context of locally repairable codes
(LRCs). LRC-based solutions have been used in very large systems, like the storage for Microsoft
Azure [5]. References [6, 27] first introduced the notion of LRC and demonstrated the trade-off
between the locality, the minimum code distance, and the code rate (the inverse of the storage
overhead) for linear or non-linear codes. Figure 1-3 shows an example of an LRC code which
stores a file of 6 packets on 10 nodes. The 6 source packets are placed in the uncoded form on 6
information nodes, which belong to 2 local groups. Each local group contains d+ 1 nodes and
includes a local parity node, which allows exact regeneration of any information node from d = 3
helper nodes. Unlike the setting of RC, LRCs are mainly studied in the exact repair regime
with a predetermined and fixed set of helper nodes (repair set) for each potential node failure.
More recent studies [28–30] also considered LRCs in the functional repair regime and allowed
the replacement node to choose the best repair set. The coding rate achievable by LRCs, where
each symbol can be recovered by downloading data from d helper nodes, is bounded by
RLRC ≤ d
d+ 1. (1.1)
Fixed repair sets constitute a problem for LRCs from the availability perspective. Fast
20
a1 a2
b1 b2
c1 c2
a2 b2 c2
d2=a2+b2+c2 p2
p1a1 b1 c1
d1=a1+b1+c1
local group 1 local group 2
Figure 1-3: An example of LRC with 6 information nodes and 4 parity nodes, 2 of which arelocal parities. Every information node (e.g. c1) can be recovered by downloading data from aspecific set of d = 3 other nodes in the same local group. The global parities p1, p2 are linearcombinations of all 6 information symbols and allow data regeneration when more than 1 nodein each local group is lost.
access to the specific repair set from the replacement node cannot be guaranteed because of
(temporarily) helper nodes unavailability, which may be caused by the node maintenance,
unresponsiveness due to high load, bandwidth over-subscription, network congestion, etc. This
issue is of a special importance in networks with highly unstable or time-varying topology, such
as peer-to-peer (P2P) and peer-aided edge/fog caching networks, mobile ad hoc, and sensor
networks. The problem does not arise for RCs and MDS codes, where an arbitrary set of d
nodes can serve as helpers, but those codes have significantly lower coding rate RRC ≤ d/n, for a
storage with n nodes. Works [31, 32] and others considered LRCs with several alternative repair
sets. Specifically, [31] considered local groups of d+δ−1 nodes, such that a node within a group
can be recovered from any subset of d other nodes in a group. Each local group represents an
MDS code with the minimum distance δ ≥ 2. Although this construction increases the maximal
coding rate to dd+δ−1 , the repair is still bound to a local set of a relatively small number of nodes.
A different model is considered by reference [32]: the nodes have multiple, τ ≥ 1, disjoint repair
sets of size d, and the resulting coding rate is upper-bounded by∏τ
i=1(1 + 1/di)−1. While this
gives a coding rate improvement as compared to RCs, τ cannot be large, and the repair set
selection remains limited with many nodes excluded from the consideration to be helpers.
Fundamentally, LRCs can achieve a high rate at the price of a limited helper selection and a
small minimum code distance, while the large minimum distance of RC and MDS codes allows
an arbitrary choice of helpers at the price of a low rate (Figure 1-4). The minimum distance is
often treated as the key code parameter controlling the fault tolerance of the DSS. However, it
only measures the fault tolerance with respect to the worst-case erasure patterns, while these
21
0.2 0.3 0.4 0.5 0.6 0.7 0.8R
0
2
4
6
8
log(
# of
rep.
sets
)
RC
Kamath2014
Tamo2016
LRC
RLNC(pe = 3 * 10 4, t= 100)RLNC(pe = 0.15, t= 100)
Figure 1-4: The helper selection vs rate trade-off (higher the better).
patterns constitute only a small fraction of the possible patterns. Since the DSS reliability is
typically compared via expected value metrics (e.g. normalized magnitude of data loss — the
expected number of lost bytes per terabyte in the first 5 years of deployment [33]), it is more
natural to consider the expected impact of the potential failure and repair availability patterns
on the storage. In channel coding, this phenomenon is illustrated with the Berlekamp’s bat [34,
Chapter 13] (Figure 1-5. A bat flies around the center of a nearly-spherical cave, the center
of which represents the original codeword surrounded by the neighboring codewords in a high-
dimensional code space. The location of the bat represents the distorted codeword. The bat
tries to avoid touching the pikes from the wall. Since neighboring codewords exactly at the
minimum distance are scarce, the range at which she can fly in the perfect safety is far less than
the actual range she can fly with a high probability of safety. While the early coding theory
focused on codes with a large minimum distance which guarantee to correct a large number
of erasures, the Shannon capacity can only be reached by the codes working far beyond the
minimum distance of the code: the probability of hitting the vicinity of another codeword is
low even when the number of erasures exceeds the minimum distance. Examples of such codes
are Turbo codes, LDPC, and random linear codes.
To study the expected impact of failures and repairs, we need to introduce a probability
measure on them, i.e. to consider a random selection of the failure and helper nodes. Since in
channel coding random linear codes perform well in decoding beyond the minimum distance, for
22
Perfect safety range
Safety range with high probability
Figure 1-5: Berlekamp’s bat phenomenon.
a random selection DSS model it is most natural to repair a node by generating random linear
combinations of the packets on the helper nodes, in other words, employing RLNC. In Chapter
6, we introduce a stochastic failure and repair DSS model, equipped with an RLNC code
generation and repair, and study its rate-reliability trade-offs. Since the model is probabilistic,
its performance is parametrized by time, which is measured as the number of the failure and
repair iterations.
The time-dependent nature of the model has another interesting aspect. Most existing DSS
codes, such as MDS, RCs, and LRCs, are designed under the assumption that the data needs
to be stored forever. However, the data often has a limited lifespan, after which the data can
be deleted or migrated to another (e.g. archival) storage. The lifespan can be really short: for
instance, in edge caching, a certain content can be popular just for a month, and afterwards it
does not need to be stored in caches any more. For such scenarios, it is reasonable to have a code
which provides the guarantees of maintaining the data only for a limited number of node repairs.
As a result of this relaxed requirements, we can expect an improved storage overhead and lower
storage costs. The RLNC code considered in Chapter 6 with a gradual degradation over time
under a random node selection model is well-suited for such limited lifetime applications. The
code both realizes the rate gain over RCs and has practically the same helper selection freedom
as RCs, far exceeding that of LRCs (Figure 1-4), which makes it a viable storage solution for
the time-varying networks.
23
In Chapter 7, we numerically study the performance of RLNC under our model in a wide
range of model and implementation assumptions and parameters. We show RLNC to perform
stably well with the binary field, low repair bandwidth, non-uniform node distributions, variable
number of helper nodes, and sparse RLNC recoding. Even in those adverse conditions, the
achievable code rate is shown to be significantly higher than that of regenerating codes.
24
Chapter 2
Preliminaries
In this chapter, we provide the notations and tools used throughout the thesis. We give the
basic definitions of the coding theory, network coding, and the matroid theory. In addition, we
provide an overview of the framework and main results from the theory of RCs, and introduce
information flow graphs (IFG) as an important tool for studying functional repair systems.
2.1 Notations
Unless indicated otherwise, we use capital letters to denote matrices, sets and graph node
labels, bold small letters to denote row vectors, and regular font for scalar variables. We also
use capital letters when we need to highlight the fact that a variable is a random variable.
rowspanA and colspanA denote the linear subspaces spanned by rows and columns of matrix
A, respectively. In denotes the n×n identity matrix. AT denotes the transposition of a matrix
or a vector A. Unless specified otherwise, A|s, A|S denote the submatrices of A consisting of
row s, rows in set S, respectively. A|s, A|S denote the submatrices of A consisting of column s,
columns in set S, respectively.
|A| is the cardinality of set A. We use A ∪ B,A ∩ B,A − B to denote union, intersection,
and set-theoretic difference of sets A,B, and also use A+ b, A− b to denote A ∪ {b}, A− {b}.
The set of integers between i, j inclusive is denoted [i, j] = {i, i+1, . . . , j−1, j], and [n] , [1, n].
Unless specified otherwise, all data/code symbols are considered elements of a finite field
Fq. We use Fq to denote any finite field with q non-zero elements. Whenever needed, a specific
25
field will be indicated, if multiple fields with the same q exist (for non-prime q). H(X) will
denote the entropy of a discrete random variable or a set of random variables X, computed
with respect to log q. H(X|Y ) denotes the conditional entropy. We shall also use the chain rule
entropy expansion:
H(X1, X2, . . . , Xn) =
n∑
i=1
H(Xi|{Xj , j ∈ [i− 1]}). (2.1)
For a real number x, (x)+ is used as a shorthand notation for max{x, 0}. For integers a, b, if b
is a multiple of a, we shall write a|b, and a ∤ b otherwise. 1E is the indicator function, equal to
1 if E is true, and 0 otherwise.
For a system of equations or inequalities, notation
A
B
(2.2)
implies that at least one among A,B must hold true.
2.2 Linear Codes
A brief overview of the main notions in coding theory is given in this section. For a more
detailed reference on the topic, we refer the readers to [35].
A (block) code C is a map from Fkq to Fn
q . In this thesis we only consider linear (n, k) codes,
for which the map is described by a linear transformation C : u → uG, where full-rank matrix
G ∈ FF k×nq is called generator matrix of the code. u is referred to as a vector of information
symbols or a message, uG is a codeword corresponding to the message. The set of all possible
codewords forms the codebook, which with a slight abuse of notation will be also denoted by
C = {uG}u∈Fk
q. Note, that for any codewords c, c′ ∈ C and any a ∈ Fq, ac and c + c′ also
belong to C, i.e. the code can also be characterized as a linear subspace of Fnq with dimension
k. The codebook is not uniquely identified with the generator matrix, for any full-rank matrix
A ∈ Fk×kq , a code with generator matrix AG corresponds to the same codebook.
The code rate of C is given by k/n, and represents the average information value of one
26
codeword symbol. The code is called systematic if the generator matrix is of the form G =
[Ik P ]; in this case, the first k codeword symbols contain the original message symbols, and the
remaining n− k are parity check symbols.
The weight |x| of a codeword x is the number of non-zero symbols in it, and the (Hamming)
distance |x−x′| between two codewords x,x′, is the weight of their difference, i.e. the number
of the coordinates where the two codewords differ. The minimum distance of code C is the
shortest distance between two distinct codewords from C, or equivalently, the smallest weight
of a non-zero codeword in C. The minimum distance D ≤ 1 of the code is related to the code
capability to tolerate symbol erasures. If a codeword x = uG suffers erasures at m arbitrary
positions, the observed codeword y has m coordinates unknown, but it can be corrected and
uniquely mapped back to x and decoded to u as long no other codeword x′ can be transformed
to y by m erasures. A code with minimum distance D can correct arbitrary D − 1 erasures in
a codeword. Such code is called a (n, k,D) code.
Theorem 2.2.1 (Singleton Bound). The minimum distance of a (n, k) linear code is upper
bounded by
D ≤ n− k + 1. (2.3)
A code that achieves the Singleton bound with D = n− k + 1 is called maximum distance
separable (MDS). For an MDS code every subset of k columns of the generator matrix is linearly
independent, and therefore any k symbols of a codeword x are sufficient to decode the message
u.
Theorem 2.2.2 (MDS Code Generate Matrix). A code is MDS if and only if its codebook
can be represented by a generator matrix in the systematic form G = [Ik P ], and every square
sub-matrix of P is invertible, where a sub-matrix is defined as an intersection of any i columns
with any i rows, i ∈ [min{k, n− k}].
Examples of MDS codes are:
• n-Repetition code (n, 1, n);
• single parity check code (k + 1, k, 1);
27
• Reed-Solomon (RS) codes (n, k, n− k + 1).
While the first two codes can be constructed over F2, RS codes require Fq with q ≥ n.
For systematic codes C1(n1, k1), C2(n2, k2) with generator matrices, G1 = [Ik1 P1], G2 =
[Ik2 P2], the product code C(n2n1, k2k1) of C1, C2 maps a message U ∈ Fk2×k1q to a codeword
X ∈ Fn2×n1q , where X is given by
X =
U UP1
P T2 U P T
2 UP1
=
U
P T2 U
[Ik1 P1] =
Ik2
P T2
[U UP1]. (2.4)
Every row of X is a codeword from C1, and every column is a codeword from C2.
For integer α ≥ 1, a linear (n, k) vector code C is a linear (n, k) code over symbol alphabet
Fαq , such that its codewords are Fq-linear, i.e. for any codewords c, c′ ∈ C and any a, a′ ∈ Fq,
ac+ a′c′ ∈ C.
2.3 Linear Network Coding
Network coding in packet networks assumes that intermediate nodes perform coding across the
incoming packets to generate the outgoing packets. Specifically, in this thesis we assume that in
a network with packets of length m′, a node x with k incoming and n outgoing links computes
n outgoing packets as the columns of UGx, where matrix U ∈ Fm′×kq contains the k incoming
packets as columns, and Gx ∈ Fk×nq is the generator matrix of a local linear code at node x;
note that n is determined by the network structure, and may be smaller than k. We will say
that node x recodes the k incoming packets into n outgoing packets. In random linear network
coding (RLNC) the elements of Gx are drawn at random from Fq, rather than deterministically
constructed based on the network topology.
To keep track of the transformations of the original source packets after several recoding
operations, each coded packet has a header with the coordinates in the source packets basis.
To be more precise, let s1, . . . , sr ∈ Fmq be uncoded source packets to be transmitted over the
28
network, let matrix S , [s1T . . . sr
T ] ∈ Fm×rq , and let
S′ =
Ir
S
. (2.5)
The columns of S′ are injected into the network as packets of length m′ = r + m, with an
r-symbol header. When these packets are recoded in the network, each generated coded packet
is a linear combination of columns of S′, in which the first r symbols are the coordinates of
the m-symbol payload part in the basis of the source packets s1, . . . , sr. The r-symbol header
is called (global) coding vector of the packet. Whenever a node receives r coded packets with
linearly independent coding vectors, i.e. the matrix of the r coding vectors is full-rank, the node
can decode the r source packets by applying the inverse linear transformation to the received
packets.
2.4 Regenerating Codes
Next, we overview the model of RCs of [1]. A source file of size B symbols is encoded and
stored in a DSS of n storage nodes. Each node stored α symbols, and the code rate is B/nα.
Whenever a node failure happens, a new node replaces the failed one and downloads β symbols
of helper data from each node from an arbitrary set of d ≥ k helper nodes to generate its
content. Under exact repair (ER) the generated content should be the same as that on the
failed node, under functional repair (FR) there is no such constraint. To retrieve the source file
(perform data collection) one downloads the content of an arbitrary set of k nodes.
For the model outlined above with parameters (n, k, d, α, β) and FR, the source file of size
B can be stored in the system if and only if
B ≤ BFR ,k∑
i=1
min{α, (d− i+ 1)β}. (2.6)
For a fixed file size B, there exists a trade-off between storage α per node and repair
bandwidth dβ (Figure 2-1. Only the pairs of (α, dβ) on or above the trade-off curve are feasible.
29
2 2.2 2.4 2.6 2.8 3 3.2
3
3.5
4
4.5
5
5.5
6
6.5 FR
ER (approximate)
MBR
MSR
Figure 2-1: Storage-bandwidth trade-off for RCs with (n = 7, k = 5, d = 6, B = 10). Theprecise trade-off for exact repair remains unknown.
Since α is proportional to the storage overhead nα/B, lower values of α lead to lower storage
costs. The MSR (minimum storage regeneration) point on the trade-off corresponds to the
smallest storage overhead, and is defined by α = B/k = (d − k + 1)β. The MBR (minimum
bandwidth regeneration) point on the trade-off corresponds to the smallest repair bandwidth,
which is defined by dβ = α,B =∑k
i=1(d− i+1)β. For a bounded number of failures, all points
on the FR trade-off can be achieved by linear network coding over a large enough field, or in
particular by RLNC, or space-sharing of two network coded solutions. Reference [36] presented
randomized and deterministic FR code constructions for an unlimited number of failures. The
general ER trade-off remains an open problem, although it is known that ER can always achieve
the MSR and MBR points, and for most points between those two the ER trade-off is strictly
worse than that of FR. Under ER, the MBR point can be achieved by a product-matrix code
construction for all values of parameters (n, k, d) [14].
2.4.1 Information Flow Graph
Information flow graph (IFG) is a convenient tool for analysis of the maximal achievable file size
of RC models in FR regime. The IFG is a directed acyclic graph with capacitated edges, which
represents the data flows from the uncoded source file to the data collectors via an error-free
network of the original and replacement storage nodes with limited memory. Each original or
30
𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋
𝑋𝑖 𝑋𝑋𝑖 𝑋
𝑋𝑖 𝑋Edge
capacities ∞
𝑍𝑆
Figure 2-2: An example of information-flow graph for n = 4, k = 2, d = 2 and 3 node fail-ures/repairs. Also shown is a sample cut (U ,V) between S and Z of capacity α+ β.
replacement physical node Xi of size α is represented in the IFG by a new pair of in- and out-
graph nodes Xini
α→ Xouti with an edge of capacity α between them. The in-node represents the
single point, where data enters to the node, possibly from several sources, e.g. helper nodes.
The out-node connects to the other nodes or data collectors, where node Xi sends data. A
special IFG node S serves as the source of the source file of size B to be stored in the DSS.
The IFG evolves with the DSS as node failures and repairs happen. Before any node failures,
IFG contains n pairs of nodes Xini , Xout
i , i ∈ [n], and all n out-nodes are considered active. The
source node S connects to all the n in-nodes Xini , i ∈ [n] via edges of infinite capacity. If node
Xi fails, its IFG out-node Xouti becomes inactive, a replacement physical node is introduced to
the system, and downloads helper data from some d surviving nodes. The new node has a new
index, say Xn+1, and is represented by a new pair of the in- and the out-node in the IFG. The
in-node Xinn+1 connects to the corresponding d active out-nodes via edges of capacity β. The
new out-node Xoutn+1 becomes active, so that at any moment there are n active out-nodes in the
graph. A data collector is represented by a node Z, which connects to the n active out-nodes
Xoutj at any specific moment via edges of infinite capacity. An example of IFG is shown in
Figure 2-2.
A cut of the IFG is a partitioning of all graph nodes into two sets (U ,V). The cut-set
31
corresponding to a cut (U ,V) is the set of all edges which go from a node in U to a node in V.
The capacity (or the value) of a cut or the corresponding cut-set is the sum of the capacities of
the edges in the cut-set. A cut between nodes S and Z is any cut (U ,V) with S ∈ U , Z ∈ V. If
S is a proper cut-set between S,Z, then any directed path from S to Z has at least one edge
in S. A cut between S,Z is called a minimal cut or min-cut, if its capacity is minimal among
all cuts between S and Z.
Given parameters (n, k, d, α, β,B), the problem of code repair satisfying the RC model
requirements can be cast as a problem of multicasting B symbols over all possible IFGs with
parameters (n, d, α, β) to an arbitrary number of data collectors, connecting to any k active
out-nodes. The latter problem is solvable if and only if B is no greater than the min-cut value
between S and any data collector node Z for any possible IFG. Moreover, if the condition is
satisfied, then there exists a linear network code solution over a sufficiently large field such that
all data collectors can recover the B source symbols. RLNC also provides a solution with an
arbitrarily large probability as the field size increases for a bounded number of failures/repairs.
2.5 Matroids
In this section, we give a brief overview of the matroid theory terminology and results used in
the thesis. We refer the reader to reference [37] for more details.
A matroid M is a pair (E , I) consisting of a finite set E (the ground set), and a collection
I of subsets of E . The elements of I are called independent sets. I should satisfy the following
properties:
• I is non-empty.
• Every subset of an independent set is also independent.
• (Independence augmentation property) If sets I1 and I2 are independent (I1, I2 ∈ I), and
|I2| = |I1|+ 1, then there is an element s ∈ I2 − I1, such that I1 ∪ {s} is independent.
A set S ⊆ E which is not independent is called dependent. The maximal independent sets are
called bases (I ∈ I, I + s /∈ I, ∀s /∈ I). Every basis has the same cardinality. The minimal
dependent sets are called circuits (S /∈ I, S − s ∈ I, ∀s ∈ S). A matroid on n elements
32
is uniquely characterized not only by the collection of its independent sets but also by the
collection of its circuits. An element s ∈ E is a loop if it is not an element of any basis, or,
equivalently, if {s} is a circuit. An element s ∈ E is a coloop if it is not an element of any
circuit, or, equivalently, if s is an element of every basis. If for elements s1, s2 of matroid M,
{s1, s2} is a circuit, then s1 and s2 are said to be parallel in M, and the set of all elements
parallel to s1 or s2 is called a parallel class.
Given a matrix A ∈ Fn×mq , the vector matroid M[A] is the matroid defined over the set of
columns of A, where a subset independence is defined as the linear independence of the columns
in the subset.
If out of the three properties of I only the first two are satisfied, the resulting object (E , I)
is called an independence system.
33
Part I
Regenerating Codes for Clustered
Storage Systems
34
Chapter 3
Generalized Regenerating Codes
(GRCs) and File Size Bounds
3.1 System Model
We propose a natural generalization of the setting of regenerating codes (RC) [1] for clustered
storage networks. The network consists of n clusters, withm nodes in each cluster. The network
is fully connected such that any two nodes within a cluster are connected via an intra-cluster
link, and any two clusters are connected via an inter-cluster link. A node in one cluster that
needs to communicate with another node in a second cluster does so via the corresponding
inter-cluster link. A source file of size B symbols is encoded into nmα symbols and stored
across the nm nodes such that each node stores α symbols. For data collection, we have an
availability constraint such that the entire content of any k clusters should be sufficient to
recover the original data file. Nodes represent points of failure. In this chapter, we restrict
ourselves to the case of efficient recovery from single node failure. In Chapter 4, we generalize
some of our results to the scenario of recovering from multiple node failures within a cluster.
Node repair is parametrized by three parameters d, β and ℓ. We assume that the replacement
of a failed node is in the same cluster (host cluster) as the failed node. The replacement node
downloads β symbols each from any set of d other clusters, dubbed remote helper clusters. The
β symbols from any of the remote helper clusters are a function of the mα symbols present
35
Table 3.1: Notation for the clustered storage system model.
Symbol Definition
B source file sizen total number of clusters in the systemm number of storage nodes in each clusterk number of clusters required for data collectiond number of remote helper clusters providing helper data during a node
repaird′ min{d, k}ℓ number of local helper nodes providing helper data during node repairq finite field size for data symbolsα number of symbols each storage node holds for one coded fileβ size of helper data downloaded from each remote helper cluster during
node repair, in symbols
in the cluster; we assume that a dedicated compute unit in the cluster takes responsibility for
computing these β symbols before passing them outside the cluster. In addition, the replacement
node can download (entire) content from any set of ℓ ∈ [m−1] other nodes, dubbed local helper
nodes, in the host cluster, during the repair process. The quantity dβ represents the inter-
cluster repair-bandwidth. We shall also use notation d′ = min{d, k}. We refer to the overall
code as the generalized regenerating code (GRC) Cm with parameters {(n, k, d)(α, β)(m, ℓ)}. A
summary of the various parameters used in the description of the system model appears in
Table 3.1.
The model reduces to the setup of RCs in [1], when m = 1 (in which case, ℓ = 0 automat-
ically). We shall refer to the setup in [1] as the classical setup or classical regenerating codes.
Our generalization has two additional parameters ℓ and m when compared with the classical
setup. As in the classical setup, we consider both FR and ER regimes. We further note that,
unlike the classical setup, our generalized setup permits d < k.
We will say that a GRC code is locally non-redundant if the encoding function does not
introduce any local dependence among the content of the various nodes of a cluster. For linear
GRC, the coded content of cluster i can be written as uGi, where u is the message vector of
length B, and Gi is a B × mα matrix. In this case, locally non-redundant code means that
Gi has full column rank. Oppositely, a locally redundant code can have, for example, a local
parity node within a cluster, which would hold the component-wise sum in Fαq of the data on
the other m− 1 nodes.
36
The model described above does not consider intra-cluster bandwidth incurred during repair.
Intra-cluster bandwidth is needed, firstly, to compute the β symbols in any remote helper
cluster, and, secondly, to download content from ℓ local helper nodes in the host cluster. The
intra-cluster bandwidth of GRC is studied in detail in chapter 5.
Our goal is to obtain a trade-off between storage overhead nmα/B and inter-cluster repair-
bandwidth dβ for an {(n, k, d)(α, β)(m, ℓ)} GRC.
3.1.1 IFG Model for GRC
In this section, we describe the IFG model of GRC used in this chapter to derive the main file
size bound. Let Xi,j denote the physical node j ∈ [m] in cluster i ∈ [n]. In the IFG, Xi,j is
represented by a pair of nodes X ini,j
α→ Xouti,j . With a slight abuse of notation, we will let Xi,j to
also denote the pair (X ini,j , X
outi,j ) of the graph nodes. Cluster i also has an additional external
node, denoted as Xexti . Each out-node in the cluster Xout
i,j , j ∈ [m] is connected to Xexti via an
edge of capacity α. The external node Xexti is used to transfer data outside the cluster, and
thus serves two purposes: 1) it represents a single point of contact to the cluster, for a data
collector which connects to this cluster, and 2) it represents the compute unit which generates
the β symbols for repair of any node in a different cluster.
The source node S connects to the in-nodes of all physical storage nodes in their original
state (S∞→ X in
i,j), ∀i ∈ [n], ∀j ∈ [m]. The sink node Z represents a data collector, it connects to
the external nodes of an arbitrary subset of k clusters (Xexti
∞→ Z).
Each cluster at any moment has m active nodes. When a physical node Xi,j fails, it
becomes inactive, and its replacement node, say Xi,j , becomes active instead (see Figure 3-1
for an illustration). The replacement node Xi,j is regenerated by downloading β symbols from
any d nodes in the set {Xexti′ , i′ ∈ [n], i′ 6= i}. The replacement node also connects to any subset
of ℓ nodes in the set {Xouti,j′ , j
′ ∈ [m], j′ 6= j} via links of capacity α.
Along with the replacement of Xi,j with Xi,j , we will also copy the remaining m− 1 nodes
in cluster i as they are, and represent them with new identical pairs of nodes (X ini,j′
α→ Xouti,j′ ),
j′ ∈ [m], j′ 6= j. We shall also a have a new external node for the cluster, which connects to the
new m out-nodes. Thus, in the IFG modeling, we say that the entire old cluster with the failed
node becomes inactive, and gets replaced by a new active cluster. For either data collection
37
SZ
Figure 3-1: An example of the IFG model representing the notion of generalized regeneratingcodes, when intra-cluster bandwidth is ignored. In this figure, we assume (n = 3, k = 2,d = 2)(m = 2, ℓ = 1).
or repair, we connect to external nodes of the active clusters. Note that, at any point in time,
a physical cluster contains only one active cluster in the IFG, and fi inactive clusters in the
IFG, where fi ≥ 0 denotes the total number of failures and repairs experienced by the various
nodes in the cluster. We shall use the notation Xi(t), 0 ≤ t ≤ fi to denote the cluster that
appears in IFG after the tth repair associated with cluster i. The clusters Xi(0), . . . ,Xi(fi − 1)
are inactive, while Xi(fi) is active, after fi repairs. The nodes of Xi(t) will be denoted by
X ini,j(t), X
outi,j (t), Xext
i (t), 1 ≤ j ≤ m. With a slight abuse of notation, we will let Xi(t) to also
denote the collection of all 2m + 1 nodes in this cluster. We write Xi,j(t) to denote the pair
(X ini,j(t), X
outi,j (t)); again, with a slight abuse of notation, we shall use Xi,j(t) to also denote the
node j in cluster i after the tth repair in cluster i. We further use notation Fi to denote the
union (family) of all nodes in all inactive clusters, and the active cluster, corresponding to the
physical cluster i after t repairs in cluster i, i.e., Fi = ∪fit=0Xi(t). We have avoided indexing Fi
with the parameter t as well, to keep the notation simple. The value of t in our usage of the
notation Fi will be clear from the context.
38
3.2 Previous Work
Regenerating code variations for the data-center-like topologies consisting of racks and nodes
are considered in [38–42]. In [38], [39] and [40], the authors distinguish between inter-rack
(inter-cluster) and intra-rack (intra-cluster) bandwidth costs. Further, the works [38] and [39]
permit pooling of intra-rack helper data to decrease inter-rack bandwidth. Also, all three works
allow taking help from host-rack nodes during repair. Unlike our model, for data collection,
all three works simply require file decodability from any set of k nodes irrespective of the
racks (clusters) to which they belong. In other words, the notion of clustering applies only
to repair, and not data collection, and this is a major difference with respect to our model.
Thus, while these variations are suitable for modeling the node-rack topologies present within
a data center, they do not model the situation of erasure coding across data centers with the
availability requirement as considered in this work. The work in [41] is a variation of that in [40]
for a two-rack model, where the per-node storage capacity of the two racks differ. In [42], the
authors consider a two-layer storage setting like ours, consisting of several blocks (analogous to
clusters as considered in this work) of storage nodes. A different clustering approach is followed
for both data collection and node repair. For data collection, one accesses kc nodes each from
any of bc blocks. Though [42] focuses on node repair, the model assumes possible unavailability
of the whole block where the failed node resides, and as such uses only nodes from other blocks
for repair. Further, unlike our model in this work, the authors do not differentiate between
inter-block and intra-block bandwidth costs. The framework of twin-codes introduced in [43]
is also related to our model and implicitly contains the notion of clustering. In [43] nodes are
divided into two sets. For data collection, one connects to any k nodes in the same set. Recovery
of a failed node in one set is accomplished by connecting to d nodes in the other set. However,
there is no distinction between intra-set and inter-set bandwidth costs, and this becomes the
main difference with our model.
Several works [44–49] study variations of RCs in varied settings, with different combinations
of node capacities, link costs, and amount of data-download-per-node. The main difference
between our model and these works is that none of them explicitly considers clustering of nodes
while performing data collection. In [44], the authors introduce flexible regenerating codes for
39
a flat topology of storage nodes, where uniformity of download is enforced neither during data
collection nor during node repair. References [45], [46] consider systems where the storage and
repair-download costs are non-uniform across the various nodes. The authors of [45], as in [44],
allow a replacement node to download an arbitrary amount of data from each helper node. In
[47], nodes are divided into two sets, based on the cost incurred while these nodes aid during
repair. As noted in [41], the repair model of [47] is different from a clustered network, where
the repair cost incurred by a specific helper node depends on which cluster the replacement
node belongs to. The works of [48] and [49] focus on minimizing regeneration time rather
regeneration bandwidth in systems with non-uniform point-to-point link capacities. Essentially,
each helper node is expected to find the optimal path, perhaps via other intermediate nodes,
to the replacement node such that the various link capacities are used in a way to transfer all
the helper data needed for repair in the shortest possible time. It is interesting to note both
of these works permit pooling of data at an intermediate node, which gathers and processes
any relayed data with its own helper data. Recall that our model (and the one in [38]) also
considers pooling of data within a remote helper cluster, before passing on to the target cluster.
3.3 File Size Bound
In this section, we derive a bound for file size B for an arbitrary set of code parameters. We
further use this bound to characterize the storage overhead vs inter-cluster repair bandwidth
overhead trade-off.
Theorem 3.3.1 (GRC Capacity). The file size B of GRC with parameters {(n, k, d) (α, β)
(m, ℓ)} under FR regime is upper bounded by
B ≤ B∗ , ℓkα+ (m− ℓ)k−1∑
i=0
min{α, (d− i)+β}. (3.1)
Further, if there is an upper bound on the number of repairs that occur for the duration of
operation of the system, the above bound is sharp, i.e., B∗ gives the functional repair storage
capacity of the system.
40
3.3.1 Proof of the File Size Bound
The proof consists of two parts: the upper bound and its achievability. For finding the desired
upper bound on the file size, it is enough to show a cut between the source from the sink in
an IFG, for a specific sequence of failures and repairs, such that the value of the cut is the
desired upper bound. To prove achievability of the bound, we shall show that, for any valid
IFG, independent of the specific sequence of failures and repairs, B∗ is indeed a lower bound on
the minimum possible value of any S − Z cut, and, thus, B∗ symbols can always be multicast
to the data collectors.
Upper Bound
We begin with the proof of the upper bound. We consider a sequence of k(m− ℓ) failures and
repairs, as follows: Physical nodes Xi,ℓ+1, Xi,ℓ+2, . . . Xi,m fail in this order in cluster i = 1, then
in cluster i = 2, and so on, until cluster i = k. In the IFG this corresponds to the sequence
of failures of nodes X1,ℓ+1(0), X1,ℓ+2(1), . . . , X1,m(m − ℓ − 1), X2,ℓ+1(0), . . . , X2,m(m − ℓ − 1),
. . . , Xk,m(m− ℓ− 1), in the respective order. The replacement node Xi,ℓ+t(t) for Xi,ℓ+t(t− 1),
1 ≤ t ≤ m− ℓ draws local helper data from Xi,1(t− 1), Xi,2(t− 1), · · · , Xi,ℓ(t− 1), and remote
helper data from the clusters X1(m− ℓ), · · · ,Xi−1(m− ℓ) and from some set of d−min{i− 1,
d} = (d− i+ 1)+ other active clusters in the IFG. An example is shown in Figure 3-2 for a set
of system parameters that is same as those used in Figure 3-1.
Let data collector Z connect to clusters X1(m− ℓ), . . . ,Xk(m− ℓ). Consider the S − Z cut
consisting of the following edges of the IFG:
• {(X ini,j(0) → Xout
i,j (0), i ∈ [k], j ∈ [ℓ]}. The total capacity of these edges is kℓα.
• For each i ∈ [k], t ∈ [m − ℓ], either the set of edges {(Xexti′ (0) → X in
i,ℓ+t(t)), i′ ∈ {remote
helper cluster indices for the replacement node X ini,ℓ+t(t)} − [min{i − 1, d}]}, or the edge
(X ini,ℓ+t(t) → Xout
i,ℓ+t(t)). Between the two possibilities, we pick the one which has smaller
capacity. In this case, the total capacity of this part of the cut is given by∑k
i=1
∑mj=ℓ+1min{α,
(d−min{i− 1, d})β} = (m− ℓ)∑k
i=1min{α, (d− i+ 1)+β}.
The value of the cut is given by kℓα+ (m− ℓ)∑k
i=1min{α, (d− i+ 1)+β} = B∗, which proves
our upper bound. In the example in Figure 3-2 for (n = 3, k = 2, d = 2)(m = 2, ℓ = 1), first,
41
S
Z
Edge
capacities
𝑋 ,
∞
𝑋 ,𝑖𝑋 ,𝑋 ,𝑖
𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖
𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖
𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖
𝑋 ,𝑋 ,𝑖𝑋 ,𝑋 ,𝑖
𝑋𝑒𝑥
𝑋𝑒𝑥
𝑋𝑒𝑥
𝑋𝑒𝑥
𝑋𝑒𝑥
Figure 3-2: An example of the information flow graph used in cut-set based upper bound forthe file size. In this figure, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1). We also indicate apossible choice of S − Z cut that results in the desired upper bound.
m− ℓ = 1 node fails in cluster 1 and downloads helper data from clusters 2, 3, second, a node
fails in cluster 2 and downloads helper data from clusters 1, 3. The data collector connects to
clusters 1, 2. A minimal cut for 2β ≤ α is shown in the figure and has value 2α+ 3β = B∗.
Achievability
We next show that for any valid IFG (independent of the specific sequence of failures and
repairs), B∗ is indeed a lower bound on the minimum possible value of any S−Z cut. Consider
any S − Z cut (U ,V). Since node Z connects to k external nodes via links of infinite capacity,
we only consider cuts such that V has at least k external nodes corresponding to active clusters.
Next, we observe that the IFG is a directed acyclic graph, and, hence, there exists a topological
sorting of nodes of the graph such that an edge exists between two nodes A and A′ of the IFG
only if A appears before A′ in the sorting [50]. Further, we consider a topological sorting such
that all in-, out- and external nodes of the cluster Xi(τ) appear together in the sorted order,
∀i, τ .
Now, consider the sequence E of all the external nodes (which are part of both active and
inactive clusters) in V in their sorted order. Let Y1 denote the first node in this sequence.
42
Edge
capacities
Figure 3-3: An example of how any S − Z cut in the IFG affects nodes in Fi. In the example,we assume m = 4. With respect to the description in the text, ai = 2. Further, the nodeXi,4(ti,4) is a replacement node in the IFG.
Without loss of generality let Y1 ∈ F1. Next, consider the subsequence of E which is obtained
after excluding all the external nodes in F1 from E . Let Y2 denote the first external node in this
subsequence. We continue in this manner until we find the first k external nodes {Y1, Y2, . . . ,
Yk} in E , such that each of the k nodes corresponds to a distinct physical cluster. Once again,
without loss of generality, we assume that Yi ∈ Fi, 2 ≤ i ≤ k. Let us assume that Yi = Xexti (ti),
for some ti. Now, consider the m out-nodes Xouti,1 (ti), . . . , X
outi,m(ti) that connect to Xext
i (ti).
Among these m out-nodes, let ai, 0 ≤ ai ≤ m denote the number of out-nodes that appear
in U . Without loss of generality let these be the nodes Xouti,1 (ti), X
outi,2 (ti), . . . , X
outi,ai
(ti). Next,
corresponding to the out-node Xouti,j (ti), ai + 1 ≤ j ≤ m, consider its past versions {Xout
i,j (t),
t < ti} in the IFG, and let Xouti,j (ti,j), for some ti,j ≤ ti denote the first sorted node that appears
in V. Without loss of generality, let us also assume that the nodes {Xouti,j (ti,j), ai+1 ≤ j ≤ m} are
sorted in the order Xouti,ai+1(ti,ai+1), X
outi,ai+2(ti,ai+2), . . . , X
outi,m(ti,m). An illustration is provided
in Figure 3-3.
To obtain a lower bound on the value of the S−Z cut, we make the following observations:
43
• The ai edges {(Xouti,j (ti) → Xext
i (ti)), 1 ≤ j ≤ ai} are part of the cut. These contribute a
total value of aiα.
• For any node Xouti,j (ti,j), ai +1 ≤ j ≤ m, if the corresponding in-node X in
i,j(ti,j) belongs to
U , then the edge (X ini,j(ti,j) → Xout
i,j (ti,j)) appears in the cut, and contributes a value of α
to the cut. Now, consider the case when the in-node X ini,j(ti,j) belongs to V. In this case,
consider the following two sub cases:
– The node Xi,j(ti,j) is not a replacement node: This means that, either the edge
(Xouti,j (ti,j−1) → X in
i,j(ti,j)) appears in the cut, if ti,j > 0, or the edge (S → X ini,j(ti,j))
appears in the cut, if ti,j = 0. In any case, the contribution to the overall value of
the cut is at least α.
– The node Xi,j(ti,j) is a replacement node of Xi,j(ti,j − 1): We know that ℓ local
helper nodes and d external nodes are involved in the repair. It is straightforward to
see that out of the ℓ local helper nodes, at most (j−1) belong to V. To see this, note
that the potential candidates for the local helper nodes that appear in V correspond
to the physical nodes1 Xi,1, Xi,2, . . . , Xi,j−1. The version of the physical node Xi,j′ ,
j′ > j , if it aids in the repair process, appears in U because of our definition of
Xi,j′(ti,j′). Next, note that out of the d external nodes, at most (i− 1) belong to V.
In this case, the contribution to the value of the cut, due to the edges that aid in
repair, is lower bounded by (ℓ− j + 1)+α+ (d− (i− 1))+β.
1It may be noted that we count the physical nodes Xi,1, . . . , Xi,aiamong the possible set of local helpers,
although we assume that Xi,j(ti), 1 ≤ j ≤ ai appears in U . This is because, we cannot discount the possibilitythat Xi,j(ti,j′ − 1) appears in V, for j ≤ ai, j
′ > ai.
44
1.2 1.4 1.6 1.8 2 2.2 2.4 2.60
0.5
1
1.5
2
2.5
3
3.5
4
4.5
MSR
MBR
classical
trade-off
more nodes per
cluster
Figure 3-4: Trade-off between storage overhead nmα/B and inter-cluster repair bandwidthoverhead dβ/α, for an (n = 5, k = 4, d = 4) clustered storage system, with ℓ = m− 1.
Based on the observations above, the value of the cut is lower bounded by
mincut(S − Z) ≥k∑
i=1
aiα+
m∑
j=ai+1
min(α, (ℓ− j + 1)+α+ (d− (i− 1))+β)
= aikα+
k∑
i=1
ℓ∑
j=ai+1
α+
k∑
i=1
m∑
j=max(ℓ,ai)+1
min(α, (d− (i− 1))+β)
= max(ai, ℓ)kα+ (m−max(ai, ℓ))
k∑
i=1
min(α, (d− (i− 1))+β)
≥ ℓkα+ (m− ℓ)
k∑
i=1
min(α, (d− (i− 1))+β) = B∗,
for any ai, 0 ≤ ai ≤ m. This completes the proof of the achievability.
3.3.2 Storage vs Inter-Cluster Bandwidth Trade-off
For fixed values of B = B∗, n, k, d > 0, ℓ,m, (3.1) gives a normalized FR trade-off (see Figure 3-
4) between the storage overhead nmα/B (storage used per a source file symbol) and inter-cluster
repair bandwidth overhead dβ/α (inter-cluster repair bandwidth per a repaired symbol). For
any m, when ℓ = 0, the trade-off is exactly same as that of the classical regenerating codes [1].
When ℓ > 0 (implies m > 1), the trade-off is strictly better than that of the classical setup.
We shall call the trade-off point dβ = α as MBR operating point. An optimal code Cm at
the MBR operating point will be referred to as an MBR code. For locally non-redundant GRC,
45
the MBR point identifies the minimal amount of inter-cluster bandwidth required for repair,
regardless of the number of local helper nodes. At the MSR operating point the maximum
file-size per (3.1) is not bandwidth-constrained, i.e. B = ℓkα + (m− ℓ)d′α, and has the lowest
possible inter-cluster repair bandwidth defined by ((d− k)+ + 1)β = α.
Note that the GRC trade-off allows the points with inter-cluster repair bandwidth overhead
dβ/α below 1. This requires the GRC to be locally redundant. Specifically, the point with zero
inter-cluster bandwidth implies having a parity check node in each cluster. On the contrary, all
trade-off points on the MBR line and above can be achieved with locally non-redundant codes,
with significantly lower storage overhead than locally redundant codes.
3.4 Code Constructions
In this section, we describe optimal GRC code constructions. Two constructions are presented;
the first one is an instance of an exact repair code, and results in optimal codes at the MSR
and MBR points under the setting of generalized regenerating codes; the second construction
is a functional-repair regenerating code. Both codes can withstand any number of repairs for
the duration of operation of the system. The exact repair code withstands any number of
repairs by definition, since after each repair the data on all nodes is the same as at the start
of system operation. This logic does not hold for functional repair codes, because the repaired
node content is generally different from the original one. Network-coding based achievability
proofs for functional-repair work only if there is a known upper bound on the number of repairs
that occur over the lifetime of the system. Our functional repair code relies on the construction
in [36], which allows our code to operate for arbitrarily many repairs. For both constructions,
we rely on existing optimal classical regenerating codes that are linear. By a linear regenerating
code, we mean that both encoding and repair are performed via linear combinations of either
the input or the coded symbols, respectively. The first construction generates an optimal (n,
k, d)(α, β)(m, ℓ) code for any m, ℓ ≤ m− 1, 1 ≤ d ≤ n− 1, whenever an optimal (n, k, d′)(α, β)
classical exact repair linear regenerating code exists. Our functional repair code construction
is limited to the case ℓ = m− 1, d ≥ k.
For a linear (n, k, d′)(α, β) classical regenerating code that encodes a data file of size B
46
symbols, one can associate a generator matrix G of size B×nα. Without loss of generality, the
first α columns of G generates the content of node 1, and so on. We say that two (n, k, d′)(α, β)
classical linear regenerating codes C1 and C2, having generator matrices G1 and G2 are identical
if G1 = G2.
3.4.1 Exact Repair Code Construction
We begin with a description of the code and then show its data collection and repair properties.
Construction 3.4.1 (Exact repair GRC). Let Cj , 1 ≤ j ≤ ℓ denote (n, k) MDS vector codes
over Fαq . The amount of data that can be encoded with these ℓ codes is ℓkα. Next, let Cj ,
ℓ + 1 ≤ j ≤ m denote m − ℓ identical (n, k, d′)(α, β) classical exact repair linear regenerating
codes over Fq, each having a file size B′ =∑k−1
i=0 min(α, (d′− i)β). For encoding, we first divide
the data file of size B∗ = ℓkα + (m − ℓ)B′ into m stripes, such that first ℓ have size kα, and
the last m− ℓ have size B′. Stripe j, 1 ≤ j ≤ m is encoded by Cj to generate the coded symbols
cj = [c1,j , c2,j , . . . , cnα,j ]. Next, consider an m × m invertible matrix A over Fq such that the
first the ℓ rows of A generate an (m, ℓ) MDS code Fq. Let matrix A be decomposed as
Am×m =
Eℓ×m
Fm−ℓ×m
. (3.2)
Thus, the matrix Eℓ×m generates an (m, ℓ) MDS code. The coded data stored in the various
clusters is generated as follows:
[c′T1 c′T2 · · · c′Tm ] = [cT1 cT2 · · · cTm]Am×m. (3.3)
The content of node j in cluster i is given by [c′(i−1)α+1,j , c′(i−1)α+2,j , . . . , c
′iα,j ]
T , 1 ≤ i ≤ n,
1 ≤ j ≤ m. This completes the description of the construction. A pictorial overview of the
description appears in Figure 3-5.
We next prove that the code described in Construction 3.4.1 is an optimal exact repair
generalized regenerating code, for anym, ℓ < m. The optimal code can be constructed whenever
47
Cluster
Cluster
Cluster
generates [ , ℓ] MDS code
Invertible Matrix
[𝐶𝑀𝐷𝑆 𝐶𝑟𝑒𝑔𝑒 ]
𝐴 × = ×−ℓ×
𝐴 ×𝐴 ×
𝐴 ×
, 𝑘 MDS codes , 𝑘, 𝑑 , regen. codes
𝒞1 𝒞ℓ+1𝒞ℓ 𝒞
Figure 3-5: Illustration of the exact repair code construction. We first stack ℓ MDS codes and(m− ℓ) classical regenerating codes, and then transform each row via the invertible matrix A.The first ℓ rows of the matrix A generates an (m, ℓ) MDS code.
an optimal (n, k, d′)(α, β) exact repair linear regenerating code exists, having a file size B′ =
∑k−1i=0 min(α, (d′ − i)β).
It is clear that the code in Construction 3.4.1 has a file size B∗, where B∗ is as given in
Theorem 3.3.1. Further, the data collection property of the code is also straightforward to
check, and this essentially follows from the facts that 1) the matrix A is invertible, and 2)
each of the codes Ci, 1 ≤ i ≤ m is uniquely decodable given its coded data belonging to any k
clusters. To examine the repair properties of the code, let us rewrite (3.3) as follows:
[c′T1 c′T2 · · · c′Tm ] = [cT1 cT2 · · · cTm]Am×m (3.4)
= [CMDS Cregen]Am×m (3.5)
=
C(1)MDS C
(1)regen
C(2)MDS C
(2)regen
......
C(n)MDS C
(n)regen
Am×m, (3.6)
where CMDS = [cT1 · · · cTℓ ] and CTregen = [cTℓ+1 · · · cTm]. The matrices C
(i)MDS and C
(i)regen,
1 ≤ i ≤ n denote rows (i−1)α+1, . . . , iα of CMDS and Cregen, respectively. Let us also expand
48
the decomposition of matrix A in (3.2) further as follows:
Am×m =
Eℓ×m
Fm−ℓ×m
(3.7)
=
eT1 eT2 · · · eTm
fT1 fT2 · · · fTm
, (3.8)
where eTj and fTj , 1 ≤ j ≤ m denote the jth column of the matrices E and F , respectively.
Based on (3.6) and (3.8), it can be seen that the content of node j in cluster i is given by
[C
(i)MDS C(i)
regen
]
eTj
fTj
(3.9)
Given the notation above, without loss of generality, consider repairing node ℓ + 1 in cluster
1 with the help of 1) the first ℓ local nodes in cluster 1 and 2) clusters 2, . . . , d′ + 1. Let us
first examine the role of the ℓ local nodes in the repair process. Let E′ and F ′ denote the first
ℓ columns of E and F , respectively. By assumption, E generates an (m, ℓ) MDS code, and
hence the submatrix E′ is invertible. In this case, the content from the ℓ local nodes can be
put together to generate
[C
(1)MDS C(1)
regen
]
E′
F ′
E′−1eTℓ+1 =[C
(1)MDS C(1)
regen
]
eTℓ+1
fTℓ+1
, (3.10)
where fTj = F ′E′−1eTℓ+1. Thus, the local helper nodes serve to recover the part corresponding to
the MDS-codes’ components given by C(1)MDSe
Tℓ+1. However, the regenerating-codes’ components
C(1)regenfTℓ+1 differs from the original C
(1)regenfTℓ+1.
Let us next examine the role of the d′ remote helper clusters. We know that the data stored
in cluster i is given by[C
(i)MDS C
(i)regen
]A. Since matrix A is invertible, the vector C
(i)regen can be
49
Regeneration
Incorrect component
Cluster 1
Cluster
𝐻 𝑖 = helper data for 𝐂regen − መCorrecting the local helper
data component from መ𝐣𝐓 to
𝐂regen 𝐂regen − መ
[𝐂MD1 𝐂regen1 ] መ 𝐂regen1 − መ[𝐂MD1 𝐂regen1 ]
Figure 3-6: An illustration of the node repair process for exact repair generalized regeneratingcode obtained in Construction 3.4.1.
recovered from this. Using the regenerating property of classical RC codes Cℓ+1, . . . , Cm, from
C(i)regen cluster i computes helper symbols H(i) ∈ Fβ×(m−ℓ)
q for regenerating C(1)regen, and sends
out H(i)(fTj − fTj ) ∈ Fβq . By identity and linearity of the classical RC used, the replacement node
can regenerate C(1)regen(fTj − fTj ) using the helper data from the d′ remote clusters, and combine
it with the local helper data (see (3.10)) to correct the regenerating-codes’ components, and
restore the content of the lost node. A pictorial illustration of the repair process is shown in
Figure 3-6.
3.4.2 A Functional Repair Code for Arbitrary Number of Failures
In this section, we show the existence of optimal functional repair codes over a finite field that
can tolerate an arbitrary number of repairs for the duration of operation of the system. We
show the existence for any (n, k, d ≥ k)(α, β)(m, ℓ = m − 1). The code construction combines
m−1 MDS vector codes C1, . . . , Cm−1 with an (n, k, d)(α, β) FR code Cm for the classical setting.
The code Cm is a deterministic one that can tolerate an arbitrary number of repairs for the
duration of operation of the system. Reference [36] guarantees the existence of such code over
Fq whenever q > q0, where q0 is entirely determined by the parameters (n, k, d)(α, β), and is
independent of the number of repairs performed over the lifetime of the code. By a deterministic
regenerating code, we mean that the regenerated data corresponding to a repair operation of a
given physical node is uniquely determined given the content of the helper nodes. As we shall
50
see, the fact the code is deterministic is important to ensure the data collection property of our
functional repair construction.
We first describe the code construction, along with the repair procedure, and then show the
optimality property of the code.
Construction 3.4.2 (Functional repair GRC). Let Cj , 1 ≤ j ≤ m − 1 denote (n, k) MDS
vector codes over Fαq . The amount of data that can be encoded with these m codes is (m −
1)kα. Next, let Cm denote an (n, k, d)(α, β) deterministic classical functional repair linear
regenerating code as described above. The code Cm has a file size B′ =∑k−1
i=0 min(α, (d − i)β).
For encoding, we first divide the data file of size B∗ = ℓkα + (m − ℓ)B′ into m stripes, such
that first m − 1 have size kα, and the last one has size B′. Stripe j, 1 ≤ j ≤ m is encoded
by Cj to generate the coded symbols cj = [c1,j , c2,j , . . . , cnα,j ]. Node j, 1 ≤ j ≤ m − 1 in
cluster i stores the vector [c(i−1)α+1,j , c(i−1)α+2,j , . . . , ciα,j ]. The content of node m is given by
[c(i−1)α+1,m−∑m−1
j=1 c(i−1)α+1,j , c(i−1)α+2,m−∑m−1
j=1 c(i−1)α+2,j , . . . , ciα,m−∑m−1
j=1 ciα,j ], i.e. the
sum of the content off all m nodes is equal to [c(i−1)α+1,m, c(i−1)α+2,m, . . . , ciα,m]. This completes
the description of the initial layout of the coded data. Since the code is a functional repair code,
the code description is not complete unless we specify the procedure for node repair, as well. We
do this next.
Node Repair: Let yi,j(t) ∈ Fαq denote the content of node j in cluster i, after the tth, t ≥ 0
repair, anywhere in the system. The quantities {yi,j(0), 1 ≤ i ≤ n, 1 ≤ j ≤ m} denote the initial
content present in the system and are as described above. The repair procedure is such that the
vector [∑m
j=1 y1,j(t),∑m
j=1 y2,j(t), . . . ,∑m
j=1 yn,j(t)] remains a valid codeword of the functional
repair regenerating code Cm, for every t ≥ 0 (to be proved further). Clearly, the above statement
is true for t = 0. The repair procedure can be described recursively as follows: Let the tth
repair be associated with node jf in cluster if . Each of the d remote helper clusters, say i,
internally computes∑m
j=1 yi,j(t− 1), and passes the β symbols for repair of∑m
j=1 yif ,j(t− 1).
The replacement node, first of all, regenerates yif (t− 1), as a replacement to∑m
j=1 yif ,j(t− 1),
given the helper data from the d remote clusters. Next, since ℓ = m− 1, the replacement node
gets access to local helper data {yif ,j(t− 1), 1 ≤ j ≤ m, j 6= jf}. The content that is eventually
51
stored in the replacement node is computed as follows:
yif ,jf (t) = yif (t− 1)−m∑
j=1j 6=jf
yif ,j(t− 1). (3.11)
For any other (i, j) 6= (if , jf ), we assume that
yi,j(t) = yi,j(t− 1). (3.12)
This completes the description of the repair process and the code construction.
Next, we argue optimality of the (n, k, d)(α, β)(m, ℓ = m− 1) code described above. Specif-
ically, we show that the code retains the functional repair and data collection properties, after
every repair. We assume that the data collector is aware of the entire repair history of the
system. By this we mean that the data collector is aware of 1) the exact sequence of t failures
and repairs that has happened in the system, and 2) the indices of the remote helper clusters
that aided in each of the t repairs.
It is clear that the code in Construction 3.4.2 has a file-size B∗, as given by Theorem
3.3.1. To show that the code retains the functional repair property, it is sufficient to show that
the vector [∑m
j=1 y1,j(t),∑m
j=1 y2,j(t), . . . ,∑m
j=1 yn,j(t)] remains a valid codeword of the FR
regenerating code Cm, for every t ≥ 0. We do this inductively. Clearly, the statement is true
for t = 0. Let us next assume that the statement is true for t = t′ ≥ 0, and show its validity for
t = t′+1. Assume that the (t′+1)th repair is associated with node jf in cluster if . The relation
between the content of the various nodes before and after the (t′ + 1)th repair is obtained via
(3.11) and (3.12). In this case, the quantities {∑m
j=1 yi,j(t′ + 1), 1 ≤ i ≤ n} are given by
m∑
j=1
yif ,j(t′ + 1)
(a)= yif (t
′), (3.13)
m∑
j=1
yi,j(t′ + 1) =
m∑
j=1
yi,j(t′), 1 ≤ i ≤ n, i 6= if . (3.14)
Now, recall that yif (t′) is the replacement of
∑mj=1 yif ,j(t
′), which is regenerated using the
helper data generated using d elements of the set {∑m
j=1 yi,j(t′), 1 ≤ i ≤ n, i 6= if}. Combining
52
with the induction hypothesis for t = t′, it follows that the induction statement holds good for
t = t′ + 1 as well. This completes the proof of functional repair property of the code.
Let us next see how data collection is accomplished after t, t ≥ 0 repairs in the system.
Without loss of generality assume that a data collector connects to clusters 1, 2, . . . , k, and
accesses {yi,j(t), 1 ≤ i ≤ k, 1 ≤ j ≤ m}. The data collector as a first step computes the vector
[∑m
j=1 y1,j(t),∑m
j=1 y2,j(t), . . . ,∑m
j=1 yk,j(t)], and uses this to decode the data corresponding
to the code Cm. Now, recall the fact that the code Cm is deterministic, and also our assumption
that the data collector is aware of the entire repair-history of the system. In this case, having
decoded Cm, using (3.11) and (3.12), the data collector can iteratively recover {yi,j(t′), 1 ≤ k ≤,
1 ≤ j ≤ m}, for t ≥ t′ ≥ 0 by starting at t′ = t and proceeding backwards until the content
at t′ = 0 is recovered (essentially, we are rewinding the system by eliminating the effects of
all the repairs, starting from the last one and proceeding backwards in time). Finally, from
Construction 3.4.2, we know that the content {yi,j(0), 1 ≤ k ≤, 1 ≤ j ≤ m − 1} is the coded
data corresponding to the m− 1 (n, k) MDS coded C1, . . . , Cm−1, and thus these codes can also
be decoded. The completes the proof of data collection, and also of the construction optimality.
53
54
Chapter 4
GRC for Repair of Multiple Failures
In this chapter, we extend our GRC model to scenarios of multiple node failures. We assume
the problem of recovery from t ∈ [m] node failures that occur in one of the n clusters. While
single-node failure is the most common failure event, correlated failures of nodes within a data
center is an important issue reported in practice [51] and this motivates our failure model. The
t newcomer nodes are added to the same cluster as a replacement to those failed. For restoring
the content of the t new nodes, as before we download external helper data from any set of d
other clusters, β symbols each, and local helper data from any set of ℓ ≤ m− t surviving nodes
in the failure cluster. We also restrict ourselves to the case d ≥ k, even though analysis for the
case 0 ≤ d ≤ k − 1 is perfectly feasible.
A code satisfying the above model requirements shall be called multi-node repair generalized
regenerating code (MRGRC) C with parameters {(n, k, d), (α, β), (m, ℓ, t)}. We shall also use
an auxiliary notation m− ℓ = at+ b, a ≥ 1, 0 ≤ b ≤ t− 1.
4.1 Previous Work
The problem of multiple-node repair for classical RCs has been studied under the frameworks
of cooperative repair [16, 17] and centralized repair [18, 19]. In cooperative repair, each of
the t replacement nodes first individually contacts respective sets of d helper nodes and then
communicates among themselves before restoring the new content. In centralized repair, a
centralized compute node downloads data from some subset of d nodes and generates the data
55
for all t replacement nodes. Our repair model can be considered as a centralized repair model
for clustered storage systems.
Regenerating code variations for cluster-like topologies listed in section 3.2 all focus on
single-node repair.
Repairing t ≥ 1 failures from the same cluster has been partially studied in [52] for the
special case of ℓ = m − t, for which the authors show the file size upper bound. However, as
we show later, the case 0 ≤ ℓ < m − t, t > 1 offers several surprising results which cannot be
inferred from analysis of the case ℓ = m− t, t > 1.
4.2 Exact Repair
4.2.1 ER Code Construction
A simple construction of exact repair MRGRCs for any t > 1 can be directly obtained from
constructions for the case t = 1, whenever t|β. In order to construct an exact repair MRGRC
C with parameters (n, k, d)(α, β)(m, ℓ, t), t|β, we start with an ER GRC C′ from section 3.4.1
with parameters {(n, k, d)(α, β′ = β/t)(m, ℓ, t′ = 1)}, which, as we previously shown, exists
whenever a classical ER (n, k, d)(α, β′) RC exists, with file-size∑k−1
i=0 min(α, (d − i)β′). The
code C′ can be viewed as the code C as it is, if we assume that repair of any group of t nodes
in C happens one node at a time via the repair procedure in C′. Also, we use the same set of
local and external helpers for repair of all t failed nodes. Inter-cluster bandwidth, for the repair
of the entire group, per external helper amounts to β = tβ′. The file-size B that we obtain is
given by
B = B′ = ℓkα+ (m− ℓ)
k−1∑
i=0
min(α, (d− i)β′)
= ℓkα+ (m− ℓ)
k−1∑
i=0
min
(α,
(d− i)β
t
). (4.1)
As we show next in this section, the file size B achieved by the construction is optimal, as
it reaches the upper bound for {(n, k, d), (α, β), (m, ℓ, t)} ER MRGRC given by the following
theorem.
56
Theorem 4.2.1 (ER MRGRC File Size Bound). The file size B of GRC with parameters
{(n, k, d), (α, β), (m, ℓ, t)} under ER regime is upper bounded by
B ≤ B∗E = ℓkα+ (m− ℓ)
k−1∑
i=0
min
(α,
(d− i)β
t
). (4.2)
The bound is optimal at the minimum storage-overhead (MSR) and the minimum inter-cluster
repair-bandwidth-overhead (MBR) points characterized by B = mkα and tα = dβ, respectively.
4.2.2 File Size Bound Proof
In this section, we present the proof of the file-size upper bound in (4.2) for exact repair codes.
We assume the code to be deterministic, i.e. the helper data is uniquely determined given the
indices of the t failed nodes, the local helper nodes, and the helper clusters. We begin with useful
notation. Let F denote the random variable corresponding to the data file that gets stored. We
assume F to be uniformly distributed over FBq . Let Yi,j ∈ Fα
q , 1 ≤ i ≤ n, 1 ≤ j ≤ m denote the
content stored in node j of cluster i. Yi,j are also random variables which depend on F . We
also use the following notations: Yi,S = {Yi,j , ∀j ∈ S ⊆ [m]}, Yi = Yi,[m], YS = ∪i∈S⊆[n]Yi.
Since the file should be completely decodable from any set of k clusters, we have the following
entropy condition:
H (F |YS) = 0 ∀S ⊂ [n], |S| = k. (4.3)
Next, consider the repair of t nodes indexed by Ri in cluster i. Let H ⊂ [n] − i, |H| = d, and
L ⊆ [m]−Ri, |L| = ℓ respectively denote the indices of helper clusters and local nodes that aid
in the repair process. Let ZH,Li′,Ri
denote external helper data passed by cluster i′. The property
of exact repair is jointly characterized by the following set of conditions:
H(ZH,Li′,Ri
|Yi′
)= 0 (4.4)
H(ZH,Li′,Ri
)≤ β (4.5)
H(Yi,Ri
|{ZH,Li′,Ri
, i′ ∈ H},Yi,L
)= 0,
∀H ⊂ [n]− {i}, |H| = d, ∀L ⊂ [m]−Ri, |H| = ℓ. (4.6)
57
Our proof technique of the file-size bound presented here, though has some similarity with the
information theoretic techniques in works like [19], [53], it differs in an important way. The
proofs in these other works rely on the chain rule of entropy, and so does our proof; however,
here, we demand that the chain is expanded in a specific order. The following lemma is used
to determine this order. The lemma is required only when b > 0. When b = 0, the bound proof
does not need this lemma.
Lemma 4.2.2 (MRGRC Chain Order). Let b > 1, i.e. t ∤ (m − ℓ). Consider any Si ⊂ [n],
|Si| = i, 1 ≤ i ≤ k − 1. Then, for any i′ ∈ [n] − Si, there exists a permutation σi′,Siof
{ℓ+ 1, ℓ+ 2, . . . ,m} such that
H(Yi′,σi′,Si
(j′)|YSi,Yi′,[1,l], {Yi′,σi′,Si
(j)}j∈[ℓ+1,j′−1]
)≤ min
(α,
(d− i)β
t
), (4.7)
for all j′ ∈ {m− b+ 1,m− b+ 2, . . . ,m}.
The proof of the lemma is given in Appendix 9.1.
Proof of Exact Repair Upper Bound (4.2). We have
B = H(F ) ≤ H(Y[1,k]) =
k∑
i′=1
H(Yi′ |Y[1,i′−1])
=
k∑
i′=1
(H(Yi′,[1,ℓ]|Y[1,i′−1]) +H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1])
)
≤ ℓkα+
k∑
i′=1
H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]). (4.8)
Now, if we let σ = σi′,[1,i′−1] to be the permutation as obtained from Lemma 4.2.2, then we
expand the term H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]) in (4.8) using the order determined by the
permutation σ, as follows:
H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]) = H({Yi′,σ(j′), j′ ∈ [ℓ+ 1,m]}|Yi′,[1,ℓ],Y[1,i′−1])
≤a−1∑
u=0
H({Yi′,σ(ℓ+ut+v), v ∈ [t]}|Yi′,[ℓ],Y[i′−1]) (4.9)
+
m∑
j′=m−b+1
H(Yi′,σ(j′), |Y[i′−1], {Yi′,σi′,[i′−1](j)}j∈[ℓ+1,j′−1]).
58
S
Z
Edge Capacities
Cluster
𝑋3𝑟𝑒𝑋3,1𝑖 𝑋3,1
𝑋1,1𝑖 𝑋1,1 𝑋1,1𝑖 𝑋1,1𝑋1𝑟𝑒 𝑋1𝑒𝑥
∞Figure 4-1: An illustration of the information flow graph used in cut-set based upper boundfor the file-size under functional repair. We assume (n = 3, k = 2, d = 2)(m = 3, ℓ = 0, t = 2).Only a subset of nodes is named to avoid clutter. Two batches, each of t = 2 nodes, fail andget repaired first in cluster 1 and then in cluster 3. We also indicate a possible choice of theS −Z cut that results in the desired upper bound. We fail nodes in cluster 3 instead of cluster2 only to make the figure compact.
Using (4.6), each term under the first summation in (4.9) is upper bounded by min(tα, (d −
i′+1)β), while each term under the second summation in (4.9) is upper bounded using Lemma
4.2.2. Thus, we get that
H(Yi′,[ℓ+1,m]|Yi′,[1,ℓ],Y[1,i′−1]) ≤ amin(tα, (d− i′ + 1)β) + bmin
(α,
(d− i′ + 1)β
t
)
= (m− ℓ)min
(α,
(d− i′ + 1)β
t
). (4.10)
The desired bound now follows by combining (4.8) with (4.10).
4.3 Functional Repair
In this section, we present the file-size upper bound under functional repair via IFG analysis.
59
4.3.1 Information Flow Graph Model
The IFG used here (see Fig. 4-1) is a generalization of the one considered in section 3.1.1 for the
case of t = 1. When a cluster, say i, experiences a batch of t failures, the whole cluster becomes
inactive and is replaced with a new active cluster. In the new cluster, a special repair node Xrepi
is used to combine local and external helper data and generate the content of the replacement
nodes. The out nodes of the ℓ local helper nodes connect to Xrepi via links of capacity α,
and the external nodes of the d helper clusters connect to Xrepi via links of capacity β. Also,
Xrepi connects to the in-nodes of the replacement nodes via links of capacity α. Further, the
m − t nodes, which did not experience failure in the inactive cluster are copied as such in the
new active cluster. At any point in time, physical cluster i contains one active and fi inactive
clusters in the IFG where fi ≥ 0 denotes the total number of batch failures and repairs in the
cluster. We write Xi(τ), 0 ≤ τ ≤ fi to denote the cluster in the IFG after the τ th (batch)
repair associated with cluster i, and use Ri(τ), 0 ≤ τ ≤ fi − 1 to denote the indices of nodes
that fail in Xi(τ). The clusters Xi(0), . . . ,Xi(fi − 1) are inactive, while Xi(fi) is active, after fi
repairs. The nodes of Xi(τ) will be denoted by X ini,j(τ), X
outi,j (τ), Xext
i (τ), Xrepi (τ) (there is no
repair node if τ = 0).
4.3.2 File Size Upper Bound
Theorem 4.3.1 (FR MRGRC Capacity). The file size B of GRC with parameters {(n, k, d),
(α, β), (m, ℓ, t)} under FR regime is upper bounded by
B ≤ B∗F = ℓkα+ a
k−1∑
i=0
min(tα, (d− i)β) +
k−1∑
i=0
min(bα, (d− i)β) (4.11)
The bound is tight, if there is a known upper bound on the number of repairs in the system.
Proof. To show the upper bound, it is enough to demonstrate a sequence of batch failures and
a set of k clusters used by a data collector, such that there exists a cut between the source
and the data collector with the capacity no more than B∗F . In the example sequence, that we
consider, the clusters 1 to k are used for data collection and experience node failures. At each
of these clusters a+1 batch failures occur (recall that a, b are defined from m−ℓ = at+b, a ≥ 1,
60
0 ≤ b ≤ t − 1). They jointly cover the first m − ℓ nodes of a cluster. Specifically, at cluster
i ∈ [k] the first batch failure affects the last t of these nodes: Ri(0) = {m− ℓ− t+1, . . . ,m− ℓ}.
The remaining batch failures affect disjoint sets of t nodes starting from the first node Xi,1:
Ri(1) = {1, . . . , t}, Ri(2) = {t+ 1, . . . , 2t}, until Ri(a) = {(a− 1)t, . . . , at}.
In all cases, the last ℓ nodes in a cluster provide the local helper data. For repairs in cluster
i, clusters 1, . . . , i − 1 and n − (d − i), . . . , n serve as helper clusters. Failures first occur in
cluster 1, then in clusters 2, 3, etc. until cluster k.
In the IFG, corresponding to the described failure sequence, cluster Xi(a + 1) is active for
each i ∈ k. Let τj be such that the cluster Xi(τj) appears in the IFG right after the last repair
of node Xi,j (we say ”last repair” since nodes whose indices belong to Ri(0) ∩Ri(a) fail twice
in our sequence of failures; other nodes in cluster i fail only once). Consider a cut-set (U ,V)
consisting of the following edges:
• X ini,j(a + 1)
α→ Xouti,j (a + 1), ∀i ∈ [k], j ∈ [m − ℓ + 1,m]. Total capacity of these edges is
ℓkα.
• For all i ∈ [k]:
– Edge setXrepi (τj)
α→ X ini,j(τj), j ∈ [at], or edge setXext
i′ (0)β→ Xrep
i (τj)∀i′ ∈ [n−(d−i),
n], j ∈ {t, 2t, . . . , at}, whichever set capacity is smaller. The total capacity of these
edges is amin(tα, (d− i+ 1)β).
– If b > 0: edge set Xrepi (τj)
α→ X ini,j(τj), j ∈ [at + 1,m − ℓ], or edge set Xext
i′ (0)β→
Xrepi (τj)∀i′ ∈ [n − (d − i), n], j = m − ℓ, whichever set capacity is smaller. Total
capacity of these edges is min(bα, (d− i+ 1)β).
The value of the cut is given by ℓkα+a∑k
i=1min(tα, (d−i+1)β)+∑k
i=1min(bα, (d−i+1)β) =
B∗F , which proves the bound.
We demonstrate the proof by an example for the special case (n = 3, k = 2, d = 2)(α,
β)(m = 3, ℓ = 0, t = 2). Note that for this special case, t ∤ (m − ℓ) and this will help us
illustrate the difference between functional and exact repair. Consider the following sequence
of 4 batches of failures and repairs (see Fig. 4-1). Batches 1 and 2 are associated with cluster
1 with R1(0) = {2, 3} and R1(1) = {1, 2}. Batches 3 and 4 are associated with cluster 3
61
with R3(0) = {2, 3} and R3(1) = {1, 2}. There is no local help in this example, cluster 1
receives external help from Xext2 (0) and Xext
3 (0) for both batches of repairs, while cluster 3
receives external help from Xext2 (0) and Xext
1 (2) for its repairs. Consider data collection by
connecting to Xext1 (2) and Xext
3 (2), and consider the S-Z cut whose edges are found as follows:
For disconnecting Xout1,1 (2) and Xout
1,2 (2), we either remove (based on whichever has smaller
capacity) the two edges X in1,1(2) → Xout
1,1 (2) and X in1,2(2) → Xout
1,2 (2) or the set of helper edges
Xext2 (0) → Xrep
1 (2) and Xext3 (0) → Xrep
1 (2). For disconnecting Xout1,3 (2), we either remove
the single edge X in1,3(1) → Xout
1,3 (1) or the set of two helper edges Xext2 (0) → Xrep
1 (1) and
Xext3 (0) → Xrep
1 (1). The set of edges that disconnects cluster 3 is similarly found, except that if
we choose to disconnect links from external helpers, we only disconnect those from Xext2 (0) and
not Xext1 (2). The value of the cut forms an upper bound for B, and is given by B ≤ min(2α,
dβ) +min(α, dβ) +min(2α, (d− 1)β) +min(α, (d− 1)β), which is the same as the one given by
(4.11).
To prove achievability of the bound, we also show that for any valid IFG, regardless of the
specific sequence of failures and repairs, B∗F is indeed a lower bound on the minimum possible
value of any S-Z cut. Please see Appendix 9.2 for a proof of this fact, which establishes the
system capacity under functional repair.
4.4 Implications of the Bounds
Comparing the bounds
B∗F = ℓkα+ at
k−1∑
i=0
min
(α,
(d− i)β
t
)+ b
k−1∑
i=0
min
(α,
(d− i)β
b
)(4.12)
B∗E = ℓkα+ (at+ b)
k−1∑
i=0
min
(α,
(d− i)β
t
), (4.13)
we note that B∗E ≤ B∗
F . Specifically, when t|(m − ℓ) the bounds coincide. Furthermore, they
give the same storage overhead vs inter-cluster bandwidth overhead trade-off for any value of
t ≥ 1. That means that under FR there is no advantage to jointly repair multiple nodes instead
of repairing one. For ER at the MSR and MBR points, there is no benefit to jointly repair
multiple nodes for any t > 1, irrespective of if t|(m− ℓ) or not.
62
1 1.5 2 2.50
1
2
3
4
Figure 4-2: Trade-offs for an (n = 5, k = 4,d = 4)(m = 3, ℓ = 0, t = 2) system, plottedbetween the MSR and the MBR points.
0 2 4 6 8 10 1245
50
55
60
65
Figure 4-3: Impact of number of local helpernodes, ℓ, on file-size for an (n = 7, k = 4,d = 5,m = 17, t = 5) clustered storage systemat MBR point (α = 1, β = 1). Local help doesnot provide any advantage unless ℓ > 2.
When t ∤ (m−ℓ), it is possible that B∗F > B∗
E . Specifically, at the MBR point with tα = dβ,
we have B∗F > B∗
E , whenever k > 1. This also means that the storage overhead vs inter-cluster
bandwidth overhead trade-off under FR for the case t > 1 (with k > 1) is strictly better than
that for the case t = 1. A comparison of trade-offs between exact and functional repair for the
case of {(n = 5, k = 4, d = 4)(m = 3, ℓ = 0, t = 2)} is shown in Fig. 4-2.
Another implication of the bounds relates to the usefulness of the number of local helper
nodes ℓ used in the repair process. Under FR, for the case of t = 1 studied in Chapter 3,
if we fix n, k, d,m, α, β, the optimal file-size increases strictly monotonically with ℓ, whenever
α > (d − k + 1)β (i.e., if we exclude the MSR point). However, strict monotonicity is not
necessarily true when t > 1. Specifically, at the MBR point, it is straightforward to show that
whenever (m mod t) ≤ ⌊(d−k+1)t/d⌋, for any ℓ in the range 0 ≤ ℓ ≤ (m mod t), the capacity
is as good as with no local help at all (see Fig. 4-3).
63
64
Chapter 5
Intra-Cluster Bandwidth of GRCs
The GRC model introduced in Chapter 3 does not consider the intra-cluster bandwidth incurred
during repair. Intra-cluster bandwidth is needed to generate the external helper data to be sent
from the helper clusters and to download content from ℓ local helper nodes in the host cluster.
In this chapter, we characterize the amount of intra-cluster bandwidth that is needed to achieve
the optimal trade-off between storage overhead and inter-cluster repair bandwidth identified in
section 3.3.2. We consider the repair model where the replacement node downloads at most
γ, γ ≤ α symbols from each of tIn the recent years, the explosive growth of the data storage
demand has made the storage cost a critically important factor in the design of distributed
storage systems (DSS). At the same time, optimizing the storage cost is constrained by the
reliability requirements. The goal of the thesis is to further study the fundamental limits of
maintaining data fault tolerance in a DSS spread across a communication network. Particularly,
we focus our attention on performing efficient storage node repair in a redundant erasure-coded
storage with a low storage overhead. We consider two operating scenarios of the DSS.
First, we consider a clustered scenario, where individual nodes are grouped into clusters
representing data centers, storage clouds of different service providers, racks, etc. The network
bandwidth within a cluster is assumed to be cheap with respect to the bandwidth between nodes
in different clusters. We extend the regenerating codes framework by Dimakis et al. [1] to the
clustered topologies, and introduce generalized regenerating codes (GRC), which perform node
repair using the helper data both from the local cluster and from other clusters. We show the
optimal trade-off between the storage overhead and the inter-cluster repair bandwidth, along
65
with optimal code constructions. In addition, we find the minimal amount of the intra-cluster
repair bandwidth required for achieving a given point on the trade-off.
Second, we consider a scenario, where the underlying network features a highly varying
topology. Such behavior is characteristic for peer-to-peer, content delivery, or ad-hoc mobile
networks. Because of the limited and time-varying connectivity, the sources for node repair
are scarce. We consider a stochastic model of failures in the storage, which also describes
the random and opportunistic nature of selecting the sources for node repair. We show that,
even though the repair opportunities are scarce, with a practically high probability, the data
can be maintained for a large number of failures and repairs and for the time periods far
exceeding a typical lifespan of the data. The thesis also analyzes a random linear network coded
(RLNC) approach to operate in such variable networks and demonstrates its high achievable
rates, outperforming that of regenerating codes, and robustness in a wide range of model and
implementation assumptions and parameters such as code rate, field size, repair bandwidth,
node distributions, etc.
he ℓ local helper nodes from the host-cluster. We also assume that the β symbols contributed
by a remote helper cluster are only a function of at most ℓ′, ℓ′ ≤ m nodes of the cluster. We
make the assumption that any set of ℓ′ nodes can be used to compute the β symbols. Further,
we limit the amount of data that each of these ℓ′ nodes can contribute to at most γ′ ≤ α
symbols. The goal of this chapter is to identify necessary requirements on the parameters
γ, ℓ′, γ′ that are needed for achieving the optimal trade-off between storage and inter-cluster
bandwidth, defined by the maximum file-size equation
B∗ , ℓkα+ (m− ℓ)
k−1∑
i=0
min{α, (d− i)+β}. (5.1)
5.1 Local Helper Bandwidth in the Host Cluster
In this section we focus on the intra-cluster bandwidth in the host cluster, taken by commu-
nicating ℓγ helper symbols from the local helper nodes. In the following theorem, we find the
minimal value γ∗ of γ required for the optimal trade-off 5.1, and show that this value γ∗ is
optimal, i.e. sufficient for achieving the trade-off.
66
Z
S
Edge Capacities
𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡
𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡
𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡
𝑋𝑘,∗𝑋𝑘𝑒𝑥𝑡
∞Figure 5-1: An illustration of the evolution of the k-th cluster of the information flow graphused in cut-set based lower bound for γ in Theorem 5.1.1. In this figure, we assume that m = 4,ℓ = 2. Nodes 3, 4, 1 fail in this respective order. For the repair of node 3, nodes 1 and 2 actas the local helper nodes. For the repair of the remaining two nodes, nodes 2 and 3 act as thelocal helper nodes. Also indicated is our choice of the S-Z cut used in the bound derivation.
Theorem 5.1.1 (GRC Local Intra-cluster Bandwidth). For an optimal functional repair GRC
with parameters (n, k, d > 0), (α, β), (m, ℓ), γ′ = α, ℓ′ = m, local helper node bandwidth γ is
lower-bounded by
γ ≥ γ∗ , α− (d− k + 1)+β. (5.2)
Further, if there is a known upper bound on the number of repairs that occur over the lifetime
of the system, the above bound is tight; i.e., the functional repair capacity of the system remains
as B∗ as long as γ ≥ γ∗.
Proof. We consider an IFG model similar to the main GRC model in section 3.1.1, except that
now a replacement in-node X ini,j connects to ℓ inactive helper out-nodes in the same cluster Xout
i,j′
via edges of capacity γ instead of α.
For the lower bound, consider the same system evolution as in the proof of the upper bound
in Theorem 3.3.1, except for the k-th cluster accessed by the data collector. Thus, physical
nodes Xi,ℓ+1, Xi,ℓ+2, . . . Xi,m fail in this order in cluster i = 1, then in cluster i = 2, and so on,
until cluster i = k−1. Note that each of the first k−1 clusters experiences a total of m−ℓ node
failures. For cluster k, we consider the failure of m−ℓ+1 nodes, corresponding to physical nodes
Xk,ℓ+1, Xk,ℓ+2, . . . Xk,m, Xk,1 in this respective order. In terms of the notation introduced in
3.1.1, the sequence of failures in the kth cluster correspond to IFG nodes Xk,ℓ+1(0), Xk,ℓ+2(1),
. . . , Xk,m(m − ℓ − 1), Xk,1(m − ℓ). For the repair of Xk,ℓ+1(0), the local helper nodes used
67
are Xk,1(0), . . . , Xk,ℓ(0). For the repair of any of the remaining nodes Xk,((ℓ+t) mod m)+1(t),
1 ≤ t ≤ m − ℓ, the local helper nodes used are Xk,2(t), Xk,3(t), . . . , Xk,ℓ+1(t). Also, clusters
X1(m− ℓ),X2(m− ℓ), . . . ,Xmin(d,k−1)(m− ℓ) are included in the set of remote clusters that aid
in the repair of the m − ℓ + 1 nodes in the kth cluster. An illustration of the IFG, for the kth
cluster is shown in Fig. 5-1. Note in this figure that the edges corresponding to local help have
capacity γ.
Let data collector Z connect to clusters X1(m−ℓ), . . . ,Xk−1(m−ℓ),Xk(m−ℓ+1). Consider
an S-Z cut in the IFG that partitions the graph nodes in clusters 1, · · · , k−1 in the same way as
in the proof of Theorem 3.3.1; however it differs in the way the nodes of cluster k are partitioned.
The overall set of edges in the cut-set is given below:
Clusters 1, . . . , k − 1:
• {(X ini,j(0) → Xout
i,j (0), i ∈ [k − 1], j ∈ [ℓ]}. Total capacity of these edges is (k − 1)ℓα.
• For each i ∈ [k − 1], t ∈ [m − ℓ], either the set of edges {(Xexti′ (0) → X in
i,ℓ+t(t)), i′ ∈
{remote helper cluster indices for the replacement node X ini,ℓ+t(t)}− [min{i− 1, d}] or the
edge (X ini,ℓ+t(t) → Xout
i,ℓ+t(t)). Between the two possibilities, we pick the one which has
smaller sum-capacity. In this case, the total capacity of this part of the cut is given by
∑k−1i=1
∑mj=ℓ+1min{α, (d−min{i− 1, d})β} = (m− ℓ)
∑k−1i=1 min{α, (d− i+ 1)+β}.
Cluster k:
• (Xoutk,1 (0) → X in
k,ℓ+1(1)) of capacity γ.
• (X ink,j(0) → Xout
k,j (0)), ∀j ∈ [2, ℓ]. Total capacity of these edges is (ℓ− 1)α.
• Either the set of edges {(Xexti′ (0) → X in
k,((ℓ+t) mod m)+1)(t+1)), i′ ∈ {remote helper cluster
indices for the replacement node X ink,((ℓ+t) mod m)+1)(t + 1)} − [min{i − 1, d}], 0 ≤ t ≤
(m − ℓ)} or the set of edges {(X ink,((ℓ+t) mod m)+1)(t + 1) → Xout
k,((ℓ+t) mod m)+1)(t + 1)),
0 ≤ t ≤ (m− ℓ)}. Among the two sets, we pick the one which has smaller sum-capacity.
In this case, the total capacity of these edges is (m− ℓ+ 1)min{α, (d− k + 1)+β}.
68
The total cut capacity is given by
Ccut = (k − 1)ℓα+ (m− ℓ)
k−2∑
i=0
min{α, (d− i)+β}
+ γ + (ℓ− 1)α+ (m− l + 1)min{α, (d− k + 1)+β}
= kℓα+ (m− ℓ)k−1∑
i=0
min{α, (d− i)+β} − α+min{α, (d− k + 1)+β}+ γ
= B∗ − α+min{α, (d− k + 1)+β}+ γ.
Since we assume an optimal code, it must be true that Ccut ≥ B∗, which results in
γ ≥ α−min{α, (d− k + 1)+β} = α− (d− k + 1)+β = γ∗,
since for an optimal code α ≥ (d− k + 1)+β. This proves the lower bound.
We shall then prove that, as long as γ ≥ γ∗, the min-cut of any valid IFG is necessarily lower
bounded by B∗; in this case, like in the proof of Theorem 3.3.1, we know that the functional
repair capacity remains as B∗, as long as there is a known upper bound on the number of
repairs in the system. We start with the proof of the lower bound.
We next prove the tightness of the bound; we show that, as long as γ ≥ γ∗, the min-cut
of any valid IFG is necessarily lower bounded by B∗. Consider the proof of achievability part
of Theorem 3.3.1, where we obtained a lower bound on the min-cut of any valid IFG. One can
repeat the same sequence of arguments, except with the change that the edges corresponding
to local help have capacity γ instead of α. In this case, it can be seen that instead of (3.2), we
obtain the following lower bound on min-cut:
mincut(S − Z) ≥k∑
i=1
aiα+
m∑
j=ai+1
min(α, (ℓ− j + 1)+γ + (d− (i− 1))+β)
(5.3)
In the above expression, observe that if γ ≥ γ∗, for j ≤ ℓ, i ≤ k we have
(ℓ− j + 1)+γ + (d− (i− 1))+β ≥ α. (5.4)
Therefore, (5.3) can be written as (3.2). It follows then that mincut(S − Z) is indeed lower
69
bounded by B∗ as long as γ ≥ γ∗. This completes the proof of the tightness, and also the
theorem.
Note that the for α = (d − k + 1)β (MSR point for d ≥ k), the bound (5.2) gives γ ≥ 0.
Indeed, in this case, the optimal file size B∗ = mkα can be achieved with γ = 0 by using m
classical MSR RCs with parameters (n, k, d), (α = (d− k+1)β, β) and file size BRC = kα each,
which perform repairs independently of each other, without using any helper data from the
local cluster.
5.2 External Helper Cluster Local Bandwidth
In this section, we provide lower bounds on the parameters γ′ and ℓ′. Unlike the previous
section, here we do not prove the bound optimality, which allows us to simplify our IFG model
by avoiding replicating the surviving nodes. A new external IFG node for a cluster is added to
IFG any time when the cluster is used for data collection or generating helper data, so that each
external node is used exactly once. Whenever a physical node Xi,j fails, we say that it becomes
inactive, and its replacement node, say Xi,j = (X ini,j , X
outi,j ), becomes active in the same cluster.
The remaining m−1 nodes are not replicated, as in the previous model IFG model. An external
node Xexti′,Xi,j
used for generating helper data connects to a subset ℓ′ active out-nodes in the
cluster via edges of capacity γ′. Note how we index the external node of cluster i′ that aids in
the repair of Xi,j . A data collector, Z, connects to cluster i via the external node Xexti,Z , which
in turn connects to all m active out-nodes in the cluster via links of capacity α. In comparison
with the previous model, we do not time-index the sequence of failures in the current model.
This is because, in our proof of bounds for γ′ and ℓ′, we only consider system evolutions in
which each node fails at most once. In this case, we find it convenient simply to denote the
replacement node of Xi,j as Xi,j .
Theorem 5.2.1 (GRC External Helper Cluster Local Bandwidth). For an optimal functional
repair generalized regenerating code with parameters (n, k > 1, d) (α, β), (m, ℓ), γ = α, ℓ′ = m,
70
S
Z
𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,
𝑋 ,𝑋 ,𝑒𝑥𝑋 ,𝑋 ,𝑒𝑥 𝑋 ,𝑋 ,𝑒𝑥
𝑋 ,𝑋 ,𝑒𝑥𝑋 ,𝑖 𝑋 ,𝑋 ,𝑍𝑒𝑥
𝑋 ,𝑍𝑒𝑥
𝑋 ,𝑖 𝑋 ,Edge Capacities ∞
Figure 5-2: An illustration of the IFG used in cut-set based lower bound for γ′ in Theorem5.2.1. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 1)(ℓ′ = 2, γ = α). Thesecond node fails in clusters 1 and 2 in the respective order. Also indicated is our choice of theS-Z cut used in the bound derivation.
the remote helper-node repair bandwidth γ′ is lower-bounded by
γ′ ≥ γ′∗ , max{γ′∗1 , γ′∗2 } (5.5)
γ′∗1 ,β
m(5.6)
γ′∗2 ,min{β, (α− (d− k + 1)+β)}
m− ℓ. (5.7)
Proof. The first bound (5.6) directly follows from the code optimality. Indeed, for an optimal
code, the helper data send out by a cluster cannot be redundant, and its size β cannot be larger
than the sum size ℓ′γ′ = mγ′ of the components from which it is generated.
However, it turns out that on most points of the trade-off, the second bound (5.7) is tighter
than the first. To prove it, we consider data collection from clusters 1 to k. Before data
collection, the system experiences k(m− ℓ) repairs. Nodes ℓ+ 1, . . . ,m fail and get repaired in
cluster 1 in this respective order. This is followed by failure and repair of nodes ℓ + 1, . . . ,m
in cluster 2, and so on, until we consider failure and repair of nodes ℓ + 1, . . . ,m in cluster k.
In terms of physical nodes, it may be noted that this is the same sequence of failures that was
71
considered in the proof of Theorem 3.3.1; here, however, we will impose additional restrictions
on the choice of the remote helper clusters. The external help is taken from the set of the first
d+ 1 clusters, excluding the cluster where the failed node resides. Thus, for the repair of Xi,j ,
the indices of remote helper clusters are [1, i− 1] ∪ [i+ 1, d+ 1]. The choice of the local helper
nodes remains same as in the proof of Theorem 3.3.1, where we used the first ℓ nodes in the
cluster. An illustration of the IFG is shown in Figure 5-2.
It can be seen that the following cut-set separates the source from the data collector:
• {(X ini,j → Xout
i,j ), i ∈ [k], j ∈ [ℓ]}. The total capacity of these edges is kℓα.
• For each i, 1 ≤ i ≤ k, the edge set with smaller capacity out of A1(i) ∪ A2(i) and A3(i)
where
– A1(i) , {(Xexti′ → X in
i,j), i′ ∈ [k+1, d+1], j ∈ [ℓ+1,m]}. The total capacity of edges
in A1(i) is (d− k + 1)+(m− ℓ)β.
– A2(i) , {(Xouti′,j′ → Xext
i′,Xi,j), j ∈ [ℓ+ 1,m], i′ ∈ [i+ 1,min{k, d+ 1}], j′ ∈ [ℓ+ 1,m]}.
The total capacity of edges in A2(i) is (m− ℓ)(min{k, d+ 1} − i)+(m− ℓ)γ′
– A3(i) , {(X ini,j → Xout
i,j ), j ∈ [ℓ + 1,m]}. The total capacity of edges in A3(i) is
(m− ℓ)α.
The capacity of the cut-set is given by
Ccut = kℓα + (m− ℓ)k∑
i=1
min{α, (d− k + 1)+β + (min{k, d+ 1} − i)+(m− ℓ)γ′}.
72
Since we consider optimal codes, we necessarily have
Ccut ≥ B∗ = kℓα+ (m− ℓ)
k∑
i=1
min{α, (d− i+ 1)+β}
min{α, (d− k + 1)+β
+ (min{k, d+ 1} − i)+(m− ℓ)γ′} ≥ min{α, (d− i+ 1)+β}, ∀i ∈ [k](d− k + 1)+β + (min{k, d+ 1} − i)+(m− ℓ)γ′ ≥ (d− i+ 1)+β
(d− k + 1)+β + (min{k, d+ 1} − i)+(m− ℓ)γ′ ≥ α
, ∀i ∈ [k]
γ′ ≥ β/(m− ℓ)
γ′ ≥ (α− (d− k + 1)+β)/(m− ℓ)(min{k, d+ 1} − i)
, ∀i ∈ [min{k, d+ 1} − 1]
γ′ ≥ min{β, (α− (d− k + 1)+β)}/(m− ℓ).
Corollary 5.2.2. For an optimal FR GRC with parameters (n, k > 1, d), (α ≥ (d − k + 2)β,
β), (m, ℓ), γ = α, ℓ′ = m, the remote helper-node repair bandwidth γ′ is lower-bounded by
γ′ ≥ γ′∗2 =β
m− ℓ. (5.8)
The following theorem establishes the necessary condition on ℓ′ for optimal codes. Recall,
that by definition of ℓ′, the helper cluster must be able to generate the helper symbols from an
arbitrary subset of ℓ′ of its nodes.
Theorem 5.2.3 (GRC External Helper Cluster I/O). For an optimal functional repair GRC
with parameters (n, k, d), (α, β > 0), (m, ℓ), γ = γ′ = α, necessarily
ℓ′ ≥ ℓ′∗ , m. (5.9)
All m nodes in a helper cluster should contribute to the helper data.
Proof. Considered a system evolution with k(m− ℓ) repairs. Nodes with indices in range [ℓ+1,
ℓ′] fail and get repaired in cluster 1, then in cluster 2, etc. until cluster d+ 1. This is followed
by failure of nodes [max{ℓ, ℓ′},m] in cluster 1, then in cluster 2, etc. until cluster k. For all the
73
S
Z
𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,𝑋 ,𝑖 𝑋 ,
𝑋 ,𝑋 ,𝑒𝑥
𝑋 ,𝑋 ,𝑒𝑥
𝑋 ,𝑖 𝑋 , 𝑋 ,𝑍𝑒𝑥
𝑋 ,𝑍𝑒𝑥
∞𝑋 ,𝑋 ,𝑒𝑥
𝑋 ,
𝑋 ,𝑋 ,
𝑋 ,
External nodes
Figure 5-3: An illustration of the IFG used in cut-set based lower bound for ℓ′ in Theorem5.2.3. In this example, we assume (n = 3, k = 2, d = 2)(m = 2, ℓ = 0)(ℓ′ = 1, γ = γ′ = α). Thesecond node fails in clusters 1 and 2 in the respective order. Also indicated is our choice of theS-Z cut used in the bound derivation.
failures the local help is provided by the first ℓ nodes in each host cluster. The external help for
a failed node in cluster i is taken from clusters [1, d+1]− i, and the first ℓ′ nodes in each helper
cluster are always used to generate the β symbols of external helper data. Data collection is
performed from clusters 1 to k. An illustration of the IFG is shown in Figure 5-3.
It can be seen that the following cut-set separates the source from the data collector:
• {(X ini,j → Xout
i,j ), i ∈ [k], j ∈ [ℓ]}. The total capacity of these edges is kℓα.
• For each cluster i, 1 ≤ i ≤ k, the edge set with smaller capacity among A1(i) and A2(i)
where
– A1(i) , {(Xexti′ → X in
i,j), i′ ∈ [i+ 1, d+ 1], j ∈ [ℓ+ 1, ℓ′]}. The total capacity of edges
in A1(i) is (d− i+ 1)+(ℓ′ − ℓ)β.
– A2(i) , {(X ini,j → Xout
i,j ), j ∈ [ℓ + 1, ℓ′]}. The total capacity of edges in A2(i) is
(ℓ′ − ℓ)+α.
The capacity of the cut-set is given by
Ccut = kℓα+ (ℓ′ − ℓ)+k∑
i=1
min{α, (d− i+ 1)+β}. (5.10)
74
Since we consider optimal codes, we necessarily have
Ccut ≥ B∗ = kℓα+ (m− ℓ)
k∑
i=1
min{α, (d− i+ 1)+β}, (5.11)
which results in ℓ′ ≥ m.
5.3 Optimality and Implications of the Intra-cluster Bandwidth
Bounds
In the previous sections, we provided lower bounds for intra-cluster bandwidth parameters γ,
γ′, ℓ′. We analytically showed that the bounds for ℓ′ and γ (under FR) are optimal. In this
section, we perform numerical RLNC simulation to study optimality of the other bounds and
simultaneous tightness of all the bounds. For a given operating point on the optimal trade-
off (Figure 3-4) with a fixed β, we generate a random B∗ × nmα matrix over F65537, whose
columns are global coding vectors of the nmα symbols (or packets) stored in the system. We
simulate iterations of failure/repair by replacing the columns corresponding to the failed symbols
with random linear combinations of the corresponding helper symbols, according to parameters
d, β, while the helper symbols are computed according to γ, γ′, ℓ′. After each iteration, we
check that the code satisfies the data collection requirement by computing the rank of several
random subsets of kmα columns corresponding to the data collection clusters. Data collection is
successful if the rank is B∗. The probability of decoding is estimated as a fraction of successful
data collections. If the GRC satisfies the data collection requirement, the estimated probability
of decoding should be 1.
We simulate a system with parameters (n = 4, k = 3, d = 3), (ℓ = 1,m = 2), (α, β = 4)
at the MBR, MSR, and the near-MSR points, with the latter having α = (d − k + 2)β. At
each operating point, we compute the intra-cluster parameter bounds γ∗, γ′∗, ℓ′∗ from (5.2),
(5.5), (5.9), and perform a test for these values of γ, γ′, ℓ′, followed by tests with one parameter
decreased, while the other parameters maximized. In each test, we estimate the probability of
decoding after each iteration. The results are presented in Figures 5-4.
The plots suggest that at all operating points the bounds (5.2), (5.5), (5.9) are tight for FR,
75
0 5 10 15 20# of repairs
0.0
0.2
0.4
0.6
0.8
1.0Pr
obab
ility
of d
ecod
ing
( * , ′ * , ′ * )( * 1, ,m)( , ′ * 1,m)( , , ′ * 1)
(a) MBR point α = dβ = 12B∗ = 60, γ∗ = 8, γ′∗ = 4.
0 5 10 15 20# of repairs
0.2
0.4
0.6
0.8
1.0
Prob
abilit
y of
dec
odin
g
( * , ′ * , ′ * )( * 1, ,m)( , ′ * 1,m)( , , ′ * 1)
(b) Near-MSR point α = 2β = 8B∗ = 44, γ∗ = 4, γ′∗ = 4.
0 5 10 15 20# of repairs
0.0
0.2
0.4
0.6
0.8
1.0
Prob
abilit
y of
dec
odin
g
( * , ′ * , ′ * )( , ′ * 1,m)( , , ′ * 1)
(c) MSR point α = β = 4B∗ = 24, γ∗ = 0, γ′∗ = 2.
Figure 5-4: Simulation results for a system with parameters (n = 4, k = 3, d = 3), (ℓ = 1,m = 2), (α, β = 4), showing probability of successful data collection against number of noderepairs performed, for an RLNC-based GRC. The legends indicate parameters (γ, γ′, ℓ′) for eachtest. For all operating points ℓ∗ = m = 2.
and can simultaneously be achieved by RLNC. Violating any single bound results in a loss of
the code data collection property after as few as 3 failure/repair iterations.
The bounds on ℓ′ and γ′ highlight the necessary trade-off between the system capacity B∗
and the remote helper intra-cluster bandwidth ℓ′γ′ = mγ′, via parameter ℓ, the key parameter
that distinguishes our model from the classical model. Our bounds reveal an interesting fact
that, while it is beneficial to increase the number of local helper nodes ℓ in order to improve
the trade-off between storage and inter-cluster bandwidth, increasing ℓ not only increases the
intra-cluster repair bandwidth in the host cluster but also increases the intra-cluster repair
bandwidth in the remote helper clusters. For example, for MBR GRC the storage overhead
approaches that of MSR codes for large m as ℓ approaches m. However, a high value of ℓ also
increases the remote helper cluster bandwidth; indeed, mγ′∗ surges as m− ℓ approaches 1. See
Figure 5-5 for an illustration.
76
0 5 10 15 20 25 301.4
1.6
1.8
2
2.2
0 5 10 15 20 25 3021
21.5
22
22.5
23
0 5 10 15 20 25 300
200
400
600
800
0 5 10 15 20 25 300
200
400
600
Figure 5-5: Illustrating the impact of ℓ on the various performance metrics. We operate at theMBR point with parameters {(n = 12, k = 8, d = n − 1)(α = dβ, β = 2)}. We see that whileℓ = m−1 is ideal in terms of optimizing storage and inter-cluster BW, it imposes the maximumburden on intra-cluster BW.
77
Part II
Information Survival in Volatile
Networks
78
Chapter 6
Network Coding for Time-Varying
Networks
6.1 System Model
We consider a functional repair DSS with n storage nodes of size α symbols, which stores a
source file of size B symbols. Upon a node failure, the replacement node downloads β symbols
of helper data from each of d helper nodes and generates new α symbols to store. We assume a
stochastic model of node failures and helper node selection. For each failure, the index f of the
failed node is drawn from a probability distribution PF over [n], independently of other node
failures. A helper set H of d helper nodes is drawn (without replacement) from a probability
distribution PHi over [1, n] − i, independently of other failures and the corresponding helper
sets. We assume that the next failure happens only after the previous repair is complete. The
failure/repair iterations are indexed by discrete time t = 1, 2, 3, . . .. After a certain number of
failure and repair iterations, we call the storage operational if the source file can still be decoded
from the coded data on the nodes, and broken otherwise.
We study our system through the prism of RLNC packet storage. The source file is split
into k segments of size B/k = α/a for an integer a. As described in Section 2.3, a k symbol
header (coding vector) (0, . . . , 0, 1, 0, . . . , 0) with 1 on the ith position is added to ith uncoded
segment to form ith source packet. During the initial storage setup, the k source packets are
79
recoded with RLNC into na coded packets which are placed on the storage nodes. Since in
practice k is negligibly small in comparison with segment size B/k, we shall ignore the extra
storage taken by a packet coding vector in the header, and assume that each node of size α
stores a coded packets. Let matrix M0 ∈ Fk×naq be the initial system (global) coding matrix ,
whose columns are the coding vectors of the initial na coded packets in the storage. The first
a columns of the global coding matrix correspond to the a packets on the first nodes, the next
a columns corresponds to the second node, and so forth. The elements of M0 are sampled
uniformly independently identically distributed (i.i.d.) from Fq.
At each failure/repair iteration, each helper node generates b , β/(α/a) helper packets of
total size β by recoding over its a packets, and sends the new helper packets to the replacement
node. The latter receives db helper packets and recodes them with RLNC into new a packets
to store. Decoding of the source file is possible if the DSS contains k packets with linearly
independent coding vectors. We assume that α/a, β/(α/a) are integers. Let matrix Mt ∈ Fk×naq
be the global coding matrix after t iterations of failure and repair. After the next node failure
and repair, the coding matrix becomes Mt+1 = MtWt+1, where Wt+1 ∈ Fna×naq is a random
evolution matrix. Multiplying Mt by Wt+1 represents replacing the columns corresponding to
the failed node ft+1 with random linear combinations of the columns corresponding to the
helper nodes form the helper set Ht+1. If Mt|i is the k × a submatrix corresponding to node i,
then
Mt+1|ft+1 =∑
j∈Ht+1
Mt|jDHt+1(j)D
Rt+1(j), (6.1)
where DHt+1(j) is the a × b recoding matrix at helper node j, DR
t+1(j) is the b × a part of the
recoding matrix at the replacement node, corresponding to the helper j; the entire recoding
matrix at the replacement node has dimensions db× a. We shall use W t2t1
to denote the cumu-
lative evolution matrix from time t1 to t2: W t2t1
= Wt1Wt1+1 · · ·Wt2 . Let also W t = W t1. The
coding matrix after t iterations is given by Mt = M0Wt. An example of 3 iteration system
evolution along with the corresponding evolution matrices is shown in Figure 6-1.
We measure the performance of our system via lifetime and achievable coding rate metrics.
Lifetime L of the system is the index of the first failure/repair iteration that breaks the storage.
In other words, L is the first iteration that decreases the rank of the coding matrix below k
80
+ +++
+++ +
0 0 0 0 0 0
01 0 0 0 0
00 1 0 0 0
20 0 1 0 0
00 0 0 1 0
30 0 0 0 1
𝑊 , 𝑎𝑛𝑘 = 𝑊 , 𝑎𝑛𝑘 = 𝑊 , 𝑎𝑛𝑘 =1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
𝑊 , 𝑎𝑛𝑘 =
ℳ 𝑊 ℳ 𝑊 ℳ 𝑊 ℳ 𝑊
00 0 00 0
0 1 0 00 0
0 0 1 40 0
2 0 0 40 0
0 0 0 01 0
3 0 0 60 1
0 0 0 0 0 0
0 1 0 0 0 3
0 0 1 4 0 5
2 0 0 4 0 0
0 0 0 0 1 0
3 0 0 6 0 0
Figure 6-1: An example of a system evolution for 3 iterations of failure and repair, n = 6, d = 2,a = b = 1. At t = 0 node i contains packet si. For the 4 considered system states the evolutionmatrix W t and its matroid representation M(W t) are also shown. The most recently changedcolumn of W t is bold-faced.
and renders the data undecodable:
L , min{t : rankM0Wt < k}. (6.2)
The coding rate of our system is R , B/nα = k/na. For a set of parameters (n, d, a, b, t), let
the error probability pe(t) , Pr[L ≤ t] = Pr[rankMt < k] be the probability of storage failure
in no more than t iterations. Let
Rε = Rε(n, d, a, b, t) ,1
namax{k : pe ≤ ε}, (6.3)
i.e. Rε is the maximal coding rate such that the system is operational after t iterations is at
least 1− ε. The main system model parameters are summarized in Table 6.1.
6.2 Previous Work
The key difference of our model from the setups of RC [1] and LRC [6, 27] is a probabilistic
selection of the helper nodes and allowing a non-zero probability of decoding error. RC al-
low arbitrary (worst-case) helper selection. LRC and their generalizations for multiple helper
sets, like [31, 32], require helper sets to come from a relatively small and limited number of
alternatives.
A probabilistic approach to local code symbol repair is considered in locally correctable
codes (LCCs) and locally decodable codes (LDCs) [54]. A code C : Fkq → FN
q is (d, δ, ǫ) LDC,
81
Table 6.1: Notation for the time-varying network storage system model.
Symbol Definition
n total number of nodes in the systema number of packets each storage node holds for one coded fileb number of helper data packets downloaded from each helper node during
node repairk size of the source file, in packetsd number of helper nodes providing helper data during node repairq finite field size for data symbolst number of successive failure and repair iterationsft index of the failed node at t-th failureHt helper set at t-th repair, |Ht| = dMt k × na (global) coding matrix after t iterations, which describes global
coding vectors of all na coded packets in the systemWt na × na evolution matrix of t-th iteration, determines the evolution of
Mt
W t matrix of cumulative evolution from M0 to Mt
L lifetime, the first t such that rankMt < kpe error probability Pr[rankMt < k]R coding rate R = k/naRε maximal coding rate (for given (n, d, a, b, t)), such that pe ≤ ε
resp. LCC, if there exists a randomized algorithm AD, resp. AC , which reads at most d symbols
of a corrupted codeword y and can correctly decode a source message symbol Ui, resp. correct
a codeword symbol xi, with probability (w.p.) at least 1 − ǫ ∀u ∈ Fkq , ∀i, ∀y : |y − x| ≤ δN ,
where | · | denotes the Hamming distance. For ǫ < 1/2, the decoding/correcting algorithm can
be invoked multiple times, and majority logic can be used to make the probability of successful
decoding/correction arbitrarily close to 1; note that multiple algorithm calls potentially read
many more than d codeword symbols. Although in LDC/LCC the error probability is generally
not strictly zero, our model is different from that of LDC/LCC in several important aspects.
LDC/LCC protect against the worst-case corruption pattern and can repair δN simultaneous
failures, while our model performs repairs one at a time. Also unlike our model, LDC/LCC is
focused on exact repair.
Fitzek et al. [55] consider a model similar to ours and performed an implementation-based
evaluation of RLNCs and showed them to outperform Reed-Solomon-based and uncoded storage
approaches. Mazumdar [56] studies a local repair storage model, in which the network topology
is fixed, and a failed node can get the helper data for repair only from its neighbors according
82
to the storage network graph. Luby et al. [57] consider a large-code lazy-repair DSS, where
node failures are modeled by Poisson processes in the continuous time, and the repair process is
running for a large fraction of time at a very low repair bandwidth with a large repair locality.
To the best of authors knowledge, the literature on highly time-varying networks, like mo-
bile ad hoc networks, has not specifically studied the problem of distributed storage, focusing
instead on communication between nodes. RLNC has been previously successfully applied to
multicasting in mobile ad hoc networks [58, 59] and delay-tolerant networks [60–62].
6.3 Stochastic Rank Decay
Note that rankMt and L are random variables determined by failed/helper nodes and RLNC
coefficients selection. In this chapter, we study the main stochastic aspect of our model, namely
the randomness of failed and helper nodes.
When the field size q is large, to determine L it is enough to analyze the rank of evolution
matrix W t, as shown by the following proposition.
Proposition 1. Let LW , min{t : rankW t < k} ≥ L. Then
Pr[LW 6= L] <1
(q − 1). (6.4)
Proof. If rankM0 < k, LW = L = 0 and the statement holds trivially. If rankM0 = k,
LW > L implies that for some t rankW t = k, rankMt ≤ k − 1. This can be true only if
some vector x in the column span of W t is in the kernel of M0, i.e. M0 · xτ = 0. This kernel
has dimension na − k (since M0 is full rank), and is spanned by na − k basis vectors. For
M0 sampled uniformly from Fk×naq , these basis vector components can be considered drawn
uniformly from Fq, independently of W t. The probability that the column space of W t of
dimension k has a non-trivial intersection with the uniformly sampled (na − k)-dimensional
kernel is upper bounded by
qna−k
qna+
qna−k+1
qna+ · · ·+ qna−1
qna=
k−1∑
i=0
q−k+i <1
(q − 1),
83
which gives the desired bound, because the column span of W t stays fixed for all values of t,
for which rankW t = k.
In this chapter, we study the system behavior in the limit of infinite field size, and focus
on rankW t instead of rankMt. Although infinitely large fields are not feasible in practice, the
limiting behavior of the system is important to analyze; as we show in the following chapter,
the system dynamics largely remains the same when q > 100. We shall also assume the failed
and helper node distributions PF ,PHi , ∀i to be uniform over [n], [n]− i, respectively.
Note that for d < k/a, one cannot guarantee the source file decodability for more than first
few iterations. Indeed, there exists a sequence of failures/repairs such that the first d nodes
serve as helper nodes to repair all other nodes [d + 1, n], which results in rankW t ≤ da < k.
Such sequence has a non-zero probability, and it takes a finite time to encounter it, thus
Pr[L ≤ n − d] > 0 and E[L] < ∞ almost surely. We shall see, however, that with high
probability the lifetime is much larger than n− d.
6.3.1 Matroid Perspective
Under the large field size assumption and with a = b, the resulting RLNC and, in particular,
the evolution matrix can be conveniently represented by a matroid. When a node failure/repair
happens, the evolution matrix changes from W t to W t+1. The columns of W t corresponding
to the failed node are replaced with linear combinations of the columns corresponding to the
helper nodes. This creates one or more linear dependencies involving a repaired column and all
helper columns. When q → ∞, all linear dependencies between columns arise from the choices
of failed and helper nodes, but not from a specific choice of random linear coefficients, with
probability arbitrarily close to 1 (from now on w.p.a.c. 1). These dependencies are captured by
a matroid representation M(W t) of the evolution matrix. The system matroid M(W t) has n
elements in its ground set E(M(W t)), which correspond to the n storage nodes. Each element
represents the subspace of Fnaq , spanned by the a columns of the corresponding node in W t. A
set of m elements is considered independent if the corresponding ma columns of W t are linearly
independent vectors in Fnaq . The collection of all independent sets, ∀m ∈ [1, n] forms I. Figure
6-1 shows the matroid representation for a sample system evolution. The dots correspond the
84
matroid elements and the lines represent the circuits, which span across several elements.
For a node set S ⊆ [n], let W t|S denote the submatrix of W t composed of the a|S| columns
corresponding to the packets on the nodes in S.
To show that (E , I) is indeed a matroid, we use the following lemma.
Lemma 6.3.1 (Matroid Lemma.). For q → ∞, a = b, rankW t|S = nSa, where nS ≥ 1 is
integer w.p.a.c. 1, ∀S ⊆ [n], S 6= ∅, ∀t.
The proof is given in Appendix 9.3.
The lemma is essentially saying that for s /∈ S, rankW t|S+s is either rankWt|S , or rankW t|S+
a; the column span of rankW t|S either contains the column span of rankW t|s, or has a trivial
intersection with it. Next, we formally show that (E , I) is indeed a matroid.
1. I is non-empty, since by lemma 6.3.1, rankW t|j = a, ∀t, j.
2. Every subset of a set in I is also in I. This is true, since a subset of a set of linearly
independent columns of W t is also linearly independent.
3. I1, I2 ∈ I, |I2| = |I1|+1, then there is an element s ∈ I2 − I1, such that I1 + s ∈ I. Since
rankW t|I2 = |I2|a > rankW t|I1 , there is s ∈ I2 such that colspanW t|s * colspanW t|I1 .
By lemma 6.3.1, rankW t|I1+s = (|I1| + 1)a = |I1 + s|a, therefore, I1 + s is independent,
and the property is satisfied.
Note that for b < a, the third (independence augmentation) property does not necessarily
hold, and (E , I) is not a matroid, but only an independence system.
Lemma 6.3.1 also implies that (rankW t)/a is equal to the cardinality of a maximal in-
dependent set of M(W t). Since the matroid structure is determined only by the choices of
failure/helper nodes, if follows that (rankW t)/a is independent of a. Thus, the next theorem
follows.
Theorem 6.3.2. Let W t(n, d, a, b) be the evolution matrix for a system with parameters n, d,
a, b, and let q → ∞. Then for two systems with parameters (n, d, a, a), (n, d, 1, 1) with the same
failed/helper node sequence, w.p.a.c. 1
rankW t(n, d, a, a) = a rankW t(n, d, 1, 1). (6.5)
85
Each circuit of the matroid represents a parity check relation between the nodes. A fail-
ure/repair iteration results in rankW t+1 decrease by a if and only if the columns corresponding
to the failed node cannot be expressed as a linear combination of the other columns, which
happens if and only if the failed node is a coloop of the current matroid M(W t). In Figure
6-1, the coloop elements are indicated by the dots not covered by any line. Each failure/repair
iteration, all circuits involving the failed node are removed, and a new circuit appears, involving
the failed node and the d helper nodes. Additional circuits with d + 1 or more nodes appear
if the helper nodes were previously involved in some circuits without the current failed node:
an example of this situation is shown in the last iteration in Figure 6-1. In other words, any
intersection of two circuits creates an additional circuit. As a result, the total number of cir-
cuits grows very fast with t, and for large t almost any two nodes are involved in some common
circuit.
Every new circuit is constructed on d + 1 or more nodes, and since at t = 0 there are no
circuits with d nodes or fewer, every subset of d columns is independent for any t w.p.a.c. 1.
When d ≥ k/a (this regime is considered in RCs), this implies that the rank never drops below
ka, and the lifetime is infinite.
At t = 0 every node is a coloop. For low values of t < O(n log n) the number of coloops
ncoloops is typically above zero, and the rank ofW t decreases quickly: it drops by a after the next
iteration with a relatively large probability Pr[next failed node is a coloop] = ncoloops/n. We
shall refer to these early iterations as the burn-in phase. After the burn-in phase, the stability
phase gradually ensues, when there are so many circuits — dependencies among the nodes —
that there are no coloops during most time steps. At each iteration, numerous circuits involving
the failed node are removed, and many new circuits are created. The rank now decreases very
slowly, dropping only on those rare occasions when a coloop appears and is chosen to be the
failed node before being a helper.
6.4 Bounding Processes
Let us assume a = b = 1, q → ∞. Consider a system evolution for τ > 0 iterations. Let
Y t = W ττ−t+1, 0 ≤ t ≤ τ, be the backward cumulative evolution matrix, with Y 0 = In, Y
τ = W τ .
86
Consider transition from t to t + 1: Y t+1 = Wτ−tYt. Let f = fτ−t be the index of the failure
node corresponding to Wτ−t, and let H = Hτ−t be the helper set. The rows of Yt corresponding
to f and H will be called failure row and helper rows. As Y t is multiplied by Wτ−t to form
Y t+1, the failure row of Y t is chosen to be, first, added with RLNC multiplicative coefficients
to the d helper rows, and, second, replaced with zeros. For example, f = 1 and H = {4, 6} may
correspond to matrix W 1 from Figure 6-1; left multiplication by this matrix would result in
adding 2 x first row to the fourth row, adding 3 x first row to the sixth row, and then replacing
the first row with zeros.
Let Zt be the set of the indices of zero rows in Y t. If the failure row at the next iteration
is a zero row in Y t, then Y t+1 = Y t and Zt+1 = Zt. Otherwise, a non-zero failure row is added
with random coefficients to d helper rows. Let l ∈ [0, d] be the number of zero rows among
the helper rows. Since q → ∞, all helper rows become non-zero in Y t+1 w.p.a.c. 1. The total
number of zero rows in Y t+1 becomes |Zt+1| = |Zt| + 1 − l. Since the number of zero rows
lower-bounds the nullity (the dimension of the kernel) of Y t, which is non-decreasing with t,
rankY τ = rankW τ is upper-bounded by n−maxt≤τ |Zt|.
Note that l, the number of zero helper rows, is between 0 and d; hence, |Zt| can go from
i at step t to anywhere between (i − d)+ + 1 to i + 1 at step t + 1. While being the lower-
bound of the nullity, |Zt| does not necessarily equal the nullity, because it does not take into
account dependent non-zero rows. Such rows arise when a failure non-zero row is added (with
a multiplier) to l > 1 zero rows.
Let St ⊆ [n] be a set of row indices of Y t. Let S0 = [n]. We define St+1 as follows:
• If f is not in St, then we let St+1 = St.
• If f and entire H are in St, then we let St+1 = St − f .
• If f is in St, but at least one helper node, say hj ∈ H, is not in St, then we let St+1 =
St − f + hj .
As we show below, the set of rows of Y t indexed by St is linearly independent w.p.a.c. 1.
Therefore, we have the following theorem.
87
Theorem 6.4.1. For a = b = 1, q → ∞, the rank of the evolution matrix w.p.a.c. 1 is bounded
by
|Sτ | ≤ rankW τ ≤ n−maxt≤τ
|Zt|, (6.6)
To prove the linear independence of the rows St in Y t we use the following Matrix Addition
Lemma. It formally shows that adding a certain random matrix to a full-rank matrix results in
a full-rank matrix w.p.a.c. 1 in the limit of infinite field size. The proof of the lemma is given
in Appendix 9.4.
Lemma 6.4.2 (Matrix Addition Lemma.). Let A ∈ Fm×nq ,m ≤ n, be a full-rank matrix with
rows a1, . . . ,am. Let u,v ∈ Fnq be arbitrary vectors, and d′ ∈ [0,m − 1] be an integer. Let
A′ ∈ Fm×nq be an additively transformed matrix with rows a
′
1, . . . ,a′
m, such that
a′
i =
ai + αiu, if i ∈ [1, d′]
ai, if i ∈ [d′ + 1,m− 1]
βiai + v, if i = m,
(6.7)
where αi, βm are random scalars, sampled uniformly i.i.d. from Fq. Then limq→∞ Pr[rankA′ =
m] = 1, i.e. A′ is full-rank w.p.a.c. 1 in the limit of infinite field size.
Proof of Theorem 6.4.1. Let Yt|E denote the submatrix of Y t consisting of row(s) indexed by
E. We only need to prove that the rows of Y t|St are linearly independent. We prove it by
induction. The statement is true for t = 0, since Y 0 is the identity matrix. Suppose, it also
holds for some t. We transition from Y t to Y t+1 = Wτ−tYt with f,H corresponding to Wτ−t.
• If f /∈ St, and St+1 = St, we use Lemma 6.4.2 with d′ = |H ∩ St|, m − 1 = |St|, and let
A|[m−1] = Y t|St with the first d′ rows being helper rows in St, and row A|m arbitrary,
provided it makes A full-rank, u = Y t|f . Then, A′|[m−1] = Y t+1|St+1 is full-rank and its
rows are linearly independent w.p.a.c. 1.
• If f ∈ St,H ⊆ St, and St+1 = St − f , we use Lemma 6.4.2 with d′ = |H ∩ St|, m = |St|,
and let A = Y t|St (the first d′ = |H ∩ St| rows are helper rows in St, the last row am is
the failure row), u = am = Y t|f . Then, A′|[m−1] = Y t+1|St+1 is full-rank and its rows are
88
linearly independent w.p.a.c. 1.
• If f ∈ St, and there is hj ∈ H, hj /∈ St, and St+1 = St − f + hj , we use Lemma 6.4.2 with
d′ = |H ∩ St|, m = |St|, and let A = Y t|St (the first d′ = |H ∩ St| rows are helper rows in
St, the last row am is the failure row), u = am = Y t|f , v = Y t|hj . Then, A′ = Y t+1|St+1
is full-rank and its rows are linearly independent w.p.a.c. 1.
In all cases independence is maintained for rows of Y t+1|St+1 , therefore, by induction for any
t ≤ τ w.p.a.c. 1 rows of Y t|St are linearly independent, and |St| ≤ rankY t.
The next theorem shows that for the case of a single helper node the bounds in Theorem
6.4.1 are tight.
Theorem 6.4.3. For d = 1, sets St and Zt complement each other
St ∪ Zt = [n], St ∩ Zt = ∅, ∀t ∈ [τ ]. (6.8)
As a result, the bounds in (6.6) coincide and are tight:
|Sτ | = n− |Zτ | = rankW τ . (6.9)
Proof. Since the rows of Y t|St are linearly independent, none of them is zero, thus, St∩Zt = ∅.
We show St ∪ Zt = [n] by induction. It holds for t = 0, with Z0 empty, S0 = [n]. Assuming it
holds for t, consider the transition from Y t to Y t+1, with H = {h} consisting of a single helper
row.
• If f ∈ Zt, then f /∈ St, and both Zt and St remain the same for t+ 1.
• If f ∈ St, h ∈ St, then Zt+1 = Zt + f , St+1 = St − f .
• If f ∈ St, h /∈ St, then, by assumption, h ∈ Zt, and Zt+1 = Zt − h+ f , St+1 = St − f + h.
In every case the complementary property remains true for t+1; by induction, it holds for any
t. Thus, |Sτ | = n− |Zτ |, and the bounds in (6.6) are tight.
89
For given PF and {PHi }i∈[n], W t ∈ Fn×n
q is a Markov process on a very large state space,
which greatly complicates the direct analysis of W t and its rank. Theorem 6.4.1 bounds the
process rankW τ using two other processes St, Zt, which are functions of {Wt}t∈[τ ]. It is not hard
to see that w.p.a.c. 1 St, Zt are also Markov processes on a much smaller space 2[n]. Indeed, by
construction of Zt+1, St+1 from Zt, St, it follows that the probability distributions of Zt+1, St+1
are fully determined by the previous states Zt, St for fixed PF , {PHi }i∈[n].
When distribution PF , {PHi }i∈[n] are uniform, the analysis of the bounds (6.6) is further
simplified by the Markov property of |St|, |Zt|, as shown by the following theorem.
Theorem 6.4.4. For uniform PF ,PHi , ∀i, Nt , |Zt|, N ′
t , n − |St| are Markov processes on
state space [0, n] with transition probabilities
pi,j(N) , Pr[Nt+1 = j|Nt = i] =i
n1i=j +
n− i
nHg
i−j+1/id/n−1 (6.10)
pi,i+1(N′) = Hg
d+1/n−id+1/n =
n− i
nHg
0/id/n−1 =
(n−id+1
)(
nd+1
) (6.11)
pi,i(N′) = 1− pi,i+1(N
′) (6.12)
and N0 = N ′0 = 0, where Hg
k/Kn/N =
(Kk )(N−Kn−k )
(Nn)is the probability mass function of the hypergeo-
metric distribution with n trials, N items, and K possible successes.
Proof. For uniform failure and helper node distributions, at each iteration, any row has proba-
bility 1/n to be selected the failure row. For the transition from Y t to Y t+1 the probability that
the failure row is among Zt (and, thus, Zt+1 = Zt) is |Zt|/n. Otherwise, w.p. (n − |Zt|)/n, a
non-zero failure row is added to d helper rows, l of which are zero. In Y t+1 these l rows become
non-zero and the failure row becomes zero, thus |Zt+1| = |Zt|+1− l. The probability of having
l zero rows among d helper rows chosen out of n − 1 non-failure rows is Hgl/|Zt|d/n−1. Thus, the
distribution of the number of zero rows Nt+1 = |Zt+1| depends only on |Zt| = Nt, and N is a
Markov process with transition probabilities given by (6.10).
|St| is changed (decreased) at the next iteration only when the failure and helper rows are
all in St. This happens with probability p = Hgd+1/|St|d+1/n . This probability is a function only of
the value of |St| at time t, thus, N ′t = n− |St| is a Markov process with N ′
0 = 0 with transition
probabilities given by (6.11), (6.12). Note that the increase probability pi,i+1(N′) for N ′
t = i
90
equals n−in Hg
0/id/n−1, which is the same as the increase probability pi,i+1(N) of process Nt given
by (6.10) for i − j + 1 = l = 0. While Nt goes down if l = i − j + 1 ≥ 2, N ′t never decreases
with t. For d = 1, l can only be 0 or 1, and Nt is non-decreasing and identical to N ′t .
The analysis above is carried out for the single packet per node case. Theorem 6.3.2 allows
applying the same results for a = b > 1 by scaling the rank proportionally by a.
6.5 Impact of Repair Bandwidth
We have established the rank bounds for a = b, i.e. for the case of maximal repair bandwidth,
when the helper nodes send out as many packets as they store. In this section, we show how
much the rank may decrease when the contribution of each helper node is limited to b ≤ a
packets.
The following theorem provides a lower bound for the rank for b ≤ a as a function of the
rank with b = a.
Theorem 6.5.1. For q → ∞, consider a system with parameters n, d, a, and an arbitrary
sequence of node failures/repairs E , {ft,Ht}t. Let W t be a sample evolution matrix under
sequence E and a helper packets per helper node, and let W ′t be a sample evolution matrix
under sequence E and b < a helper packets per helper node. Then, w.p.a.c. 1
rankW ′t ≥d−1∑
i=0
min{a, (d− i)b}+ b
(rankW t
a− d
). (6.13)
Proof. Consider sequence E and the corresponding IFG after t iterations of failures and repairs
under a helper packets per helper node. The IFG for our RLNC packet system is defined
according to Section 2.4.1, except that now we measure the edge capacities in packets, rather
than symbols: the edges now have capacities a, b instead of α, β. Per Lemma 6.3.1 and Theorem
6.3.2, rankW t is a multiple of a, rankW t = ar, and there exists a set of r physical nodes at
time t, which contain enough packets to decode the source file of size ar packets. Let SZ be
the set of r corresponding active IFG out-nodes connecting to a data collector Z. Therefore,
there exists a min-cut between S and Z of capacity ar, and by the Max-flow Min-cut theorem
[50], there exists a flow from S to Z of capacity ar. Since Z connects to r out-nodes, and in-
91
to out-node edges have capacity a, there exist r edge-disjoint paths {Pi}i∈[r] in IFG from S to
the r nodes in SZ . Since nodes in IFG cannot have multiple incoming and multiple outgoing
edges at the same time, {Pi} are also node-disjoint, except for the common starting node S.
Next, consider sequence E under b < a helper packets per helper node. For the same
sequence, the IFG structure remains the same, except that now the connections to helper
nodes have capacity b. Consider an IFG cut, which partitions all vertices into two sets U ,V,
with S ∈ U , Z ∈ V . We want to lower-bound the cut capacity, and we assume that the r
out-nodes in SZ are in V, and X ini ∈ U , ∀i ∈ [n]. Consider a topological sorting of nodes in
the IFG. Let Stopo = {Y out1 , Y out
2 , . . . , Y outd } be the topologically first d out-nodes in V. Node
Y outi , i ∈ [d], is directly connected to Y in
i . If Y ini ∈ U , edge Y in
ia→ Y out
i crosses the cut and
contributes a to the cut value. If Y ini ∈ V, then Y in
i is directly connected to d other out-nodes,
corresponding to the d helpers providing repair data to node Yi. By the construction of Stopo, at
most i−1 of the d other out-nodes can be in V, and at least (d− (i−1)) out-nodes are in U and
contribute b each to the cut value. Thus, the contribution of Y outi to the cut capacity is at least
min{a, (d− i+1)b}, and the contribution of Y out1 , . . . , Y out
d is at least∑d
i=1min{a, (d− i+1)b}.
Note that this value matches the capacity of RCs in Equation (2.6).
For each path Pi, let Vouti ∈ V be the topologically first out-node on Pi in V. Let Spath =
{V outi , i ∈ [r]}, and let V ′out
1 , . . . , V ′outr be the nodes of Spath in their topologically sorted order.
For each i ∈ [d + 1, r], consider node V ′outi = V out
j ∈ V with two previous nodes in the same
path Pj : Uouti
b→ V ′ini
a→ V ′outi or S
∞→ V ′ini
a→ V ′outi . Since Uout
i and S are in U , the considered
segment of the path crosses the cut and contributes at least min{b,∞, a} = b to its value.
This contribution has not been already counted earlier in the previous paragraph, because by
construction, {V ′outi }i∈[d+1,r] is disjoint from {Y out
i }i∈[d], and {V ′ini }i∈[d+1,r] is disjoint from
{Y ini }i∈[d].
Figure 6-2 demonstrates the argument for a sample evolution in a system with n = 4, d = 2,
t = 4. With a packets per helper node, rankW t = 3a, and the data collector can decode the
source file from the first r = 3 physical nodes. Node Z directly connects to SZ = {Xout5 , Xout
6 ,
Xout8 }. There exist 3 disjoint paths between S to the 3 nodes of SZ , e.g. S → X2 → X5,
S → X4 → X6, S → X3 → X7 → X8. For a cut shown on the figure, these paths correspond to
Spath = {Xout5 , Xout
6 , Xout7 }. The set of the topologically first d out-nodes in V is Stopo = {Xout
5 ,
92
𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋𝑋𝑖 𝑋
𝑋𝑖 𝑋𝑋𝑖 𝑋
𝑋𝑖 𝑋
Edge
capacities ∞𝑆
𝑋𝑖 𝑋Figure 6-2: An example of information-flow graph for n = 4, d = 2 and t = 4 node fail-ures/repairs. Also shown is a sample cut (U ,V) of capacity a+2b ≥ min{a, 2b}+min{a, b}+ b.
Xout6 }. For the case b < a, they contribute a+ b ≥ min{a, 2b}+min{a, b} to the cut capacity.
The path through Xout7 , the topologically last node in Spath, contains edge Xout
3 → X in7 , which
contributes extra b to the cut capacity.
Overall, the capacity of any cut is at least∑d−1
i=0 min{a, (d − i)b}+ b(r − d), which gives a
lower bound for the file size achievable by RLNC, and for rankW ′t.
6.6 Expected Lifetime
In this section we use rank bounds (6.6) to obtain bounds for expected lifetime under assumption
of uniform failure and helper node distributions, a = b = 1, and q → ∞. Since LW = min{t :
rankW t < k}, the upper and lower bounds L+, L− on LW are given by
L+ , min{t : n−Nt < k}
L− , min{t : n−N ′t < k}.
In other words, the bounds correspond to the number of time steps to reach state n − k + 1
from state 0 (first hit times) for processes N,N ′. Therefore, E[L−],E[L+] bound the expected
lifetime.
93
Lower Bound
For non-decreasing chain N ′, to reach state r + 1 from r it takes 1/pr,r+1(N′) time steps in
expectation, as the mean of the geometric distribution with the probability of success pr,r+1.
The expected first hit time of state n− k + 1 is
E[L−] =
n−k∑
r=0
1
pr,r+1(N ′)=
n−k∑
r=0
n
n− r
(n−1d
)(n−r−1
d
)
=n
d+ 1
(n− 1
d
) n−k∑
r=0
(n− r
d+ 1
)−1
=n
d
(n−1d
)(k−1d
) − n
d+ 1 = O
(n(nk
)d), (6.14)
where the last summation is collapsed using binomial identity
∞∑
r=m
(n+ r
n
)−1
=n
n− 1
(n+m− 1
n− 1
)−1
, (6.15)
provided in reference [63, Corollary 3.7].
Note that, for the case of single helper node d = 1, the lower bound (6.14) is tight and
equals the expected lifetime. On average, it takes only around n2/k iterations for the rank to
drop to k, and n2 iterations to drop to 1.
Upper Bound
In order to estimate first hit times for N , we approximate N with a birth-death Markov process
N for d ≥ 2 with |Nt+1− Nt| ≤ 1. Let pi,j = pi,j(N) = Pr[Nt+1 = j|Nt = i]. Similarly to N , let
pi,i+1 =n− i
nHg
0/id/n−1
pi,i−1 =n− i
nHg
2/id/n−1,
and let pi,i take the rest of the probability mass. Numerical simulations show that E[Nt] is very
close to E[Nt]. Since N is a birth-death chain, it is reversible, and its stationary distribution
94
π1, π2, . . . can be derived from the detailed balance equation
πr+1 = πrpr,r+1
pr+1,r= πr
2(n− r)(n− r − d)
d(d− 1)r(r + 1)
πn−k+1 = π1
n−k∏
r=1
pr,r+1
pr+1,r= π1
(n−1n−k
)(n−d−1n−k
)
(n− k + 1)(d2
)n−k∝(
n
d(n− k)
)2(n−k)
.
For the values of n − k &√2n/d the stationary probability πn−k+1 becomes negligible w.r.t.
πi, i < n− k, and the first hit time of state n− k + 1 and the lifetime bound can be estimated
as
E[L+] ≈ 1
πn−k+1= O
((d(n− k)
n
)2(n−k))
= O((d(1−R))2n(1−R)
). (6.16)
The upper bound grows large when d(n − k)/n > 1, which gives a necessary rate-locality
condition for the lifetime to be large:
d(1−R) > 1 ⇐⇒ R <d− 1
d. (6.17)
When the condition is satisfied, the upper bound (6.16) grows super-exponentially with (1−R) =
(n − k)/n for fixed n. Interestingly, the rate bound (6.17) closely matches the upper bound
(1.1), the best rate in a strictly easier coding problem — locally repairable codes, where helper
node selection is allowed.
Figure 6-3 shows the expected (W -based) lifetime for a storage with n = 20 nodes. As
predicted by bounds (6.14), (6.16), E[L] grows super-exponentially with 1 − R for d ≥ 2, but
remains relatively small even at lowest rates for d = 1. When d = 1, all circuits of the system
matroid consist of 2 elements, and M(W t) is divided into parallel classes, without any circuits
involving elements from different classes. As a result, the number of circuits is relatively small,
and coloops appear quite often, which leads to the sharp difference in behavior of a single helper
system.
6.7 Error Probability
In this section, we use the rank bounds (6.6) to estimate the error probability. We continue to
assume a = b = 1, q → ∞, and the uniform node distributions.
95
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Rate R= k/n
101
102
103
104
105
E[L W
]d= 1d= 2d= 3d= 4
Figure 6-3: Simulated expected lifetime forn = 20, a = b = 1.
0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85R
10 5
10 4
10 3
10 2
10 1
100
p e
t=20t=100t=400t=1000t=5000
Figure 6-4: Probability of decoding error peagainst the coding rate for fixed n = 20, d = 4,a = b = 1. The dots indicate E[rankW t]/n.
Figure (6-4) shows a sample numerical estimation of the error probability pe as a function
of coding rate R = k/n for a system with parameters (n = 20, a = b = 1, d = 4). The expected
rank of W t per node E[rankW t]/n is indicated by a dot for each t. The plot suggests that
for the rates below E[rankW t]/n, the error probability generally drops exponentially with the
coding rate.
Therefore, the expected rank of the evolution matrix provides a baseline for estimating the
error probabilities and the achievable rates Rε below E[rankW t]/n. To estimate E[rankW t],
we apply expectation to the bounds (6.6):
n− E[N ′τ ] ≤ E[rankW τ ] ≤ n− E[max
t≤τNt]. (6.18)
Figure 6-5 shows simulated expected rank of the evolution matrix along with the bounds. For
values of t of the order O(n) (burn-in phase), the lower bound is very tight and can be used for
estimating E[rankW τ ]. For the higher t, however, the upper bound approximates the expected
rank much better. Both E[rankW τ ] and the upper bound go down very slowly, within a constant
factor from each other.
While operating at rate E[rankW t]/n results in diminishing pe as t increases, these prob-
abilities (order of 0.1–0.3) are not small enough for reliable data storing. To estimate the
error probability decrease with rate (i.e. the error exponent), we use the rank lower bound
96
100 101 102 103 104
t
5
10
15
20
25
30
35
40rank
Wt
upper bound r +t
rtlower bound rt
Figure 6-5: Expected rank of the evolutionmatrix rt = E[rankW t], with the upper andlower bounds for n = 40, d = 4.
101 102 103 104
t
0
5
10
15
20
25
30
35
40
rank
Wt
d=10d=9d=8d=7d=6d=5d=4d=3d=2d=1
Figure 6-6: Expected rank rt = E[rankW t]for n = 40 and various values of d.
n−N ′τ ≤ rankW τ , and have
pe = pe(t, k) = Pr[rankW t < k] ≤ Pr[n−N ′t < k] = Pr[N ′
t > n− k] , p+e (t, k). (6.19)
Since the rank lower bound is tight only for t = O(n), and since pe decreases with the rate
more steeply for larger t (see Figure 6-4), we can use p+e (t, k) at point t = n to upper-bound pe
as follows:
pe(t ≥ n, k) = pe(t, (k − E[rankW t]) + E[rankW t])
≤ pe(t,E[rankWt])
pe(n,∆k + E[rankWn])
pe(n,E[rankWn])
≤ pe(n,∆k + E[rankWn])
≤ p+e (n,∆k + E[rankWn])
= p+e (n, k − E[rankW t] + E[rankWn]),
where ∆k , k − E[rankW t] ≤ 0. For small values, the tail probability Pr[N ′t > n − k] is
very closely approximated by the probability mass function Pr[N ′t = n − k + 1], which can be
97
expressed using the transition probabilities pi,i+1 given by (6.11):
p+e (t, k) = Pr[N ′t > n− k] ≈ Pr[N ′
t = n− k + 1] (6.20)
= 1t≥n−k+1
n−k∏
i=0
pi,i+1
∑
(c0,c1,...,cn−k+1)∑n−k+1i=0 ci=t−(n−k+1)
(1− pi,i+1)ci (6.21)
= 1t≥n−k+1
n−k∏
i=0
pi,i+1(1− pi,i+1)
t−1
∏n−k+1j=0j 6=i
(pj,j+1 − pi,j+1). (6.22)
To have a general sense of how the error exponent depends on the system parameters, we
use the following expansion:
p+e (t, k) ∝n−k∏
i=0
pi,i+1 =n−k∏
i=0
(n−id+1
)(
nd+1
) =n−k∏
i=0
∏dj=0(n− i− j)∏d
j=0(n− j)
≤n−k∏
i=0
(n− i
n
)d+1
=
(n(n− 1) · · · (k + 1)k
nn−k+1
)d+1
=
(n!
(k − 1)!nn−k+1
)d+1
∝(
nnek
enkknn−k+1
)d+1
≈(
nk
kken−k
)d+1
= e−(d+1)(n−k−k log nk)
= e−n(d+1)(1−R(1−logR)).
The exponent n(d+ 1)(1−R(1− logR)) increases with both n and d.
98
Chapter 7
Implementation Aspects and
Numerical Results
In this chapter, we consider some aspects of the storage system of Chapter 6, which were not
covered by the previous analysis, but which may have an impact on the code performance in
practical implementations. Some factors arise because our storage model may be too simplistic
for certain scenarios, e.g. the real distributions PF ,PH of failed and helper nodes may not be
uniform, or the number of the available helpers d may not be the same for different iterations.
We shall call these model factors. Other factors are implementation-specific and result from the
intention to make the implementation simple and cost-efficient, e.g. small field size, or sparse
packet recoding. These factors will be referred to as implementation factors. We study the
impacts of these aspects both individually, and together in numerical simulations. In addition,
we evaluate the fault tolerance of system in terms of the number of nodes to be accessed to
decode the source file.
7.1 RLNC Recoding
In our model packet recoding happens at a helper and replacement nodes (matrices DH , DR of
Equation (6.1), respectively). Dense recoding matrices result in high CPU utilization during
recoding of large data blocks, because the number of finite field operations is proportional to
the number of non-zero elements in the recoding matrix. Therefore, it is desirable to make the
99
2 3 5 7 13 19 29q
0
5
10
15
20
25
30
E[rank
Wt ] (FR,FR)
(S,FR)(N,FR)(FR,S)(S,S)(N,S)(FR,N)(S,N)(N,N)
Figure 7-1: Performance of various recodingregimes: No recoding (N), Sparse recoding(S), and Full recoding (FR) for a system withparameters n = 20, t = 2000, d = 4, a = 3,b = 2. The legend indicates the recodingregimes (helper, replacement nodes).
101 102 103
qa
5
6
7
8
9
10
11
1 aE[rank
Wt ]
a= 1, varying prime qa= 1, varying q= 2x
varying a, q= 2
Figure 7-2: Impact of the effective field size qa
on the average rank of W t for a system withparameters n = 20, d = 4, a = b, t = 1000.The actual field size used is q.
matrices sparse [64, 65]. When n1 input packets are RLNC recoded into n2 output packets
with a sparse matrix, each input packet participates only in a small fraction of the n2 output
packet. We evaluate our system performance in the following regimes: full recoding with upper-
triangular full-rank recoding matrix, no recoding with recoding matrix being the matrix with
ones on the main diagonal and zeros elsewhere with the columns shuffled, and sparse recoding
which starts with a no-recoding matrix, and one more random element in each column is picked
uniformly from Fq\{0}. Thus, our sparse recoding ensures that every incoming packet is used
in some output linear combination, while most elements of the recoding matrix remain zero.
Figure 7-1 shows the testing results for various recoding regimes at helper and replacement
nodes. According to the results, E[rankW t] is largely determined by the recoding regime at
replacement nodes. With no recoding at replacement nodes, the number of helper packets is
virtually decreased from db to a, which severely affects the average rank. Sparse recoding is
sufficient to achieve more than 80% of the full recoding performance. Note that, sparse recoding
with q = 2 results in much higher expected rank than no recoding with large field sizes.
100
7.2 Small Field Size
In this section, we explore numerically the impact of the field size on the code performance.
Large field arithmetic operations are computationally expensive, and we are looking into the
possibility of operating with smaller fields. In their work on RLNC Ho et al. [11] show that
if a problem of multicasting to D receivers over an acyclic network is solvable for some fixed
network code coefficients, then RLNC with uniformly distributed coefficients from Fq provides
a valid solution w.p. at least (1 − D/q)η, where η is the maximum number of the IFG edges
originating from the nodes performing coding in any minimum cut-set between the source and
the sink. In our model, η grows linearly with the number of failures, which makes the lower
bound negligibly low for large t, especially for small q.
We compare the performance for different field sizes for n = 20, d = 4, t = 1000. First, we
perform the test for single-packet nodes a = b = 1 and varying field size q. Second, we fix the
field size to be q = 2, and perform the test for varying node size a with b = a. We expect
that an RLNC code over field Fq and a packets per node performs similarly to a code over field
extension Fqa with 1 packet per node; in fact, multiplying a packet by a scalar over Fqa can
be represented by multiplying a packets by a certain full-rank a × a matrix over Fq. In our
case, qa can be thought of as the effective field size. The plot in Figure 7-2 indicates that for
a = 1 the field sizes from q = 17 is enough to achieve more than 90% of the limiting q → ∞
expected rank. Operating with multiple packets per node a > 1 allows using the binary field
with simple bitwise XOR addition and AND multiplication: the average rank per node packet
closely approaches that with a large field size and a = 1.
7.3 Failed and Helper Nodes Distributions
In this section, we study the system behavior with non-uniform distributions of failed and helper
nodes PF ,PHi . In practice, strict uniformity is not achievable, and some nodes will fail or will
be unavailable during repair more often than the others. We assume that PHi is the same for
all nodes i, and denote it simply PH . Without loss of generality, we assume that the nodes
are sorted in a non-decreasing order of their probability mass under PH . We consider a family
of power-law distributions where the probability to pick node i is p(i) ∝ ix, x ≥ 0, i ∈ [1, n],
101
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20i
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
p(i)
x=0x=0.4x=1.0x=3.0
Figure 7-3: Probability mass functions of thetest node distributions for a storage with n =20 nodes. Given a fixed parameter x, theprobability of i-th atom is p(i) ∝ ix. Largervalues of x lead to stronger concentration ofprobability at the nodes with high indices.
0.0 0.5 1.0 1.5 2.0 2.5 3.0xH
7
8
9
10
11
12
E[rank
Wt ]
xF=3.0xF=1.0xF=0.4xF=0.0xF= 0.4xF= 1.0xF= 3.0
Figure 7-4: Impact of the failed and helpernode distribution PF ,PH on the average rankfor n = 20, t = 1000, d = 4. The distributionshave p(i) ∝ ix for x ∈ {xF , xH}. xF < 0corresponds to p(i) ∝ (n+ 1− i)|xF |.
with normalization p(n)/p(1) = 10 for x > 0. x = 0 corresponds to the uniform case, while for
x ≫ 1 first several nodes are much less probable than the others (Figure 7-3).
Figure 7-4 shows the numerical evaluations results for the scenarios with varying PF ,PH
w.p. mass function (pmf) p(i) ∝ ixF , resp. ixH , n = 20, d = 4, t = 1000. The minus sign
of parameter xF means that the pmf of PF increases in the nodes order opposite to that of
the pmf of PH , i.e. p(i) ∝ (n + 1 − i)|xF |. Positive, resp. negative values of xF bias the
distribution towards higher, resp. lower indices. The plot indicates that for the uniform helper
nodes distribution xH = 0 the average rank is largely insensitive to the choice of PF . For
non-uniform PH with xH > 0, though, the rank drops with xF and xF − xH . In particular, for
uniform PF , the rank decreases significantly as PH becomes less uniform. Intuitively, uniform
PH results in the best possible diversity of the helper data and the regenerated packets. As
PH becomes more non-uniform and biased towards the nodes with higher indices. As a result,
the majority of the helper packets end up coming from those high probability nodes; additional
negative xF makes the lower index nodes fail more often. Both of these effects reduce the packet
diversity. On the contrary, when xF > xH ≥ 0, the lower index nodes become helpers more
often than failures, while failures mostly happen in a narrow high index range; as a result, the
102
0.0 0.5 1.0 1.5 2.0d
8
9
10
11
12E[rank
Wt ]
E[d] = 4, d [4, 4]E[d] = 4, d [3, 5]E[d] = 4, d [2, 5]E[d] = 4, d [1, 5]E[d] = 4, d [1, 6]E[d] = 5, d [5, 5]E[d] = 5, d [4, 6]E[d] = 5, d [3, 6]E[d] = 5, d [2, 6]E[d] = 5, d [2, 7]E[d] = 5, d [1, 7]
Figure 7-5: Impact of standard deviation ofthe number of helper nodes d on the averagerank for n = 20, a = b = 1, t = 1000. Beta-binomial distributions with different supportsare used.
9 10 11 12 13ndc
10 5
10 4
10 3
10 2
10 1
pdc e
t=1000,R=10/20t=5000,R=9/20
Figure 7-6: Decoding error probability pdce =Pr[rankMt|S < k| rankMt = k] for randomlychosen column set S ⊂ [n], |S| = ndc. n = 20,d = 4, a = b = 1, k = nR.
rank becomes even slightly higher than in the all-uniform scenario.
7.4 Variable Number of Helpers
In this section, we consider a scenario where the number of helper nodes d at each repair
is a random variable. This is a reasonable assumption for ad-hoc and P2P networks, where
connectivity changes dynamically and there may be fewer or more nodes connected to a given
node at certain moments. We test the average performance for a fixed mean of d and varying
standard deviation and support. The distribution of d is chosen to be beta-binomial, which
models the number of independent successful connections to helper nodes out of some finite
set, such that the probability of successful connection follows the beta distribution. Figure 7-5
depicts the numerically evaluated results for two different expected values of d. The performance
generally worsens as the standard deviation increases. However, the expected rank is much more
sensitive to changing E[d], rather than to changing its standard deviation σd.
103
7.5 Fault Tolerance
As shown previously, and specifically in Figure 7-7, in our model scenario RLNCs significantly
outperform RCs. A reasonable question to ask is whether this performance gain comes with
a price of low fault tolerance. Indeed, while RCs ensure that the source file can be decoded
even when up to n − k nodes are unavailable (for a = 1) (i.e. any set of k nodes can be used
for decoding), our RLNCs do not provide such a guarantee. It turns out, however, that with
high probability k or slightly more nodes picked uniformly at random contain enough data for
decoding the source file with high probability. Figure 7-6 shows the decoding error probability
pdce , i.e. the probability that a uniformly chosen set of ndc nodes (out of n) does not have k
independent coded packets, conditioned on Mt being full rank. pdce is less than 10−1 for ndc = k,
and goes down exponentially fast as ndc increases.
7.6 Effects of Several Factors
We test the RLNC code under the effects of several factors together. In each test, we find the
maximal rate Rε, such that the error probability does not exceed ε = 5 ∗ 10−4 for t = 2000. As
a base case we consider a system with n = 20 nodes, single packet per node a = b = 1, large
field size q = 65537, and fixed d in range [2, 8]; the base case does not include any factors of this
section. Then we incrementally modify the model to account the discussed factors. Specifically,
1. first, the field size is changed to q = 2 and, to compensate, the node size a and b are
increased to 6;
2. for the next test, in addition to the new q, a, the size of the helper data per node b is
reduced to 4;
3. then, the failed and helper node distributions are made non-uniform with xH = 1, xP = −1
as per Fig. 7-3 and 7-4;
4. then, the number of helpers d is made a random variable distributed uniformly in range
[E[d]− 2,E[d] + 2] with the same integer mean as on the previous step;
104
2 3 4 5 6 7 8E[d]
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Rbase: a= 1, q= 65537+a= b= 6, q= 2+b= 4+non-uniform F, H
+d± 2+(No,Sparse) recodingRC max rate
Figure 7-7: The maximal rate Rε for error probability under ε = 5 ∗ 10−4, t = 2000, andn = 20. First, tests are performed for the base case a = b = 1, q = 65536, then, various adverseparameter changes are introduced incrementally. The maximal theoretical RC code rate forn = 20, a = 6, b = 4 is provided for comparison.
5. finally, the number of recoding operations is reduced by performing no recoding (”No”)
at the helpers, and sparse recoding (”Sparse”) at the replacement nodes.
Factors in 1, 5 are related to implementation and those in 2, 3, 4 are model factors. The
resulting rates are shown in Figure 7-7. In addition, the maximal possible rate of regenerating
codes for the same values of d, n = 20, a = 6, b = 4 are shown by another curve for comparison.
The maximal RC rate is given by R = (d − 1 + b/a)/n = (d − 1/3)/20, as per the file size
equation (2.6). The plots show that our model is operational with high probability in a wide
range of system parameters. Even in presence of multiple factors that adversely affect the system
performance, the resulting coding rate can be significantly higher than the rate provided by the
best RCs.
105
106
Chapter 8
Conclusions
8.1 Summary
In this thesis, we study the fundamental limits of maintaining redundancy in coded network
storage systems in terms of trade-offs between the storage overhead, fault tolerance, and node
repair cost, measured by repair bandwidth, repair locality, and disk I/O.
In the first part of the thesis, we study clustered storage systems, where storage nodes are
grouped into clusters, with relatively cheap network bandwidth within a cluster and expensive
bandwidth between nodes in different clusters. We extend the regenerating codes framework
by Dimakis et al. [1] to clustered topologies, and introduce generalized regenerating codes
(GRC), which perform node repair using helper data both from the local cluster and from
other clusters. We showed the optimal trade-off between storage overhead and inter-cluster
repair bandwidth, and demonstrated explicit code constructions that achieve the operating
points on the trade-off, which are not achievable by applying the existing codes (or their space-
sharing combinations) to the clustered topology. We also analyzed the intra-cluster bandwidth,
incurred at the optimal trade-off operating points, and demonstrated that, although increasing
the number of local helper nodes improves the trade-off, it also greatly increases the required
intra-cluster bandwidth. Therefore, this three-way trade-off between the storage overhead,
inter-cluster bandwidth, and intra-cluster bandwidth provides an important intuition into the
design of clustered storage systems. The results were also extended for joint repair of multiple
node failures within a cluster. Under functional repair, amortized inter-cluster repair bandwidth
107
per failed node can be reduced by performing joint repair of several failed nodes instead of a
sequential repair of individual nodes.
In the second part of the thesis, we consider storage in time-varying networks with a small
repair locality and random opportunistic helper nodes selection. We show that for a storage sys-
tem design it is important to focus on the average-case (typical) instead of the worst-case failure
patterns and to consider the lifespan of the stored data. This leads to significant improvements
in the storage overhead with very little sacrifice of the fault tolerance, as the worst-case failure
patterns do not take place during the storage lifetime with overwhelmingly high probability.
We demonstrated that a random node selection RLNC-based storage outperforms regenerating
codes in terms of achievable rate for a very large number of iterations of node failure and re-
pair. In addition, the performance of the RLNC storage is robust to a wide range of model and
implementation assumption and parameters; in particular, it performs well under the binary
field, heavily skewed node distributions, and sparse recoding.
8.2 Future Directions
In this section, we briefly outline the potential direction of further research related the results
of the thesis.
Clustered Storage Systems
In section 3.4.1 we showed an optimal exact repair GRC construction, employing existing
classical RC constructions. However, it uses the maximal amount of intra-cluster BW with
γ = γ′ = α. It turns out that the derived bounds (5.2), (5.5) are not tight for exact repair, i.e.
there exist operating points on the trade-off, which require strictly more bandwidth than γ∗, γ′∗.
It would be of great interest to find tight exact repair bounds for γ, γ′, and the intra-cluster
bandwidth optimal exact repair code constructions.
From the practical perspective, it is reasonable to combine the power of LRCs to perform
small-locality repairs of those failures which can be repaired locally without using the inter-
cluster bandwidth, with the inter-cluster bandwidth efficiency of GRCs to be used when the
repairs cannot be performed with local information only. In contrast to the work of Kamath
108
et al. [31], where regenerating codes are used inside the clusters, this extension would employ
LRCs for intra-cluster repairs and GRCs for intra-with-inter-cluster repairs.
Finally, our GRC model and the analysis can be readily generalized to 3 or more levels
of the network hierarchy, e.g. nodes grouped into racks, and racks grouped into data centers,
with repair bandwidth on different levels treated separately. The resulting multi-dimensional
trade-offs between the storage overhead and the repair bandwidths, along with the bandwidth
costs, would provide an important network planning intuition about the optimal repair network
utilization on different hierarchy levels.
Information Survival in Time-Varying Networks
The analysis of the lifetime and the achievable rates in Chapter 6 is mainly based on the rank
bounds using processes Nt, N′t . As demonstrated by Figure 6-5, the bounds are not very tight
for large t. It would useful to find tighter bounds on the rank in order to obtain a better
estimation of the achievable rates. The main difficulty in constructing the rank bounds for W t
from the bounds for W t−1 is the need to keep track of exponentially many circuits in the system
matroid. In fact, the bounds (6.6) are related to vector matroid M[(W τ )T ], which captures the
dependencies between the rows of the evolution matrix W τ . Unlike the system matroid, this
matroid contains loops, and keeping track of them leads to bounds (6.6).
One, potentially manageable, way to analyze the system dynamics is to look at the system
in an asymptotic regime. This would also allow studying the Shannon capacity of the storage.
An operationally meaningful way to scale the system is to let the number of iterations t grow
linearly with the number of nodes n. The total storage size nα and the number of packets per
node a can be fixed, while the packet size α/a ∝ 1/n ∝ 1/t goes down as n increases. In this
case, the average number of failures per node t/n and the size of the failed and repaired data
tα are constant for any value of n. One may expect that if for n nodes E[rankW t] = ρ, then for
2n nodes E[rankW 2t] = 2ρ, because in both cases each node undergoes t/n failures and repairs
on average. However, the numerical evaluation results shown in Figure 8-1 indicate that this is
the case only for the burn-in phase when t/n is small. In the stability phase with larger t/n,
the expected rank per node increases with n ∝ t.
A practical challenging aspect of our model is predicting the failure/repair node distributions
109
0 25 50 75 100 125 150 175 200n
0.50
0.55
0.60
0.65
0.70
0.75
1 nE[rank
Wt ]
t/n=0.5t/n=1.0t/n=1.5t/n=2.5t/n=5.0t/n=15.0t/n=50.0t/n=150.0t/n=500.0
Figure 8-1: Mean rank per node for t scaled proportionally to n with d = 4, a = b = 1.
in a real system. Different failures are not necessarily independent and identically distributed,
they may form a Markov or even a more general non-ergodic process. Therefore, the real
distributions need to be estimated empirically, and storage coding rate should be adapted
accordingly by adjusting the number of nodes n in the system.
The General Problem
In practical use-cases, the DSS may take advantage of several ways to maintain redundancy
upon node failures. For a given replacement node, some helper nodes may be always available;
these can be dedicated backup nodes used for repair only or the local nodes in a clustered DSS.
Other nodes may be available on a random basis, e.g. the nodes in the P2P part of the DSS or
mobile nodes. For a degrading storage, there can be multiple ways to prevent the degradation,
i.e. the decrease of rankW t in terms of the model of Part II. One way is increasing the storage
overhead by introducing extra storage nodes. Another one is a smart selection of the helper
nodes out of those which are permanently available. For instance, consider a DSS with d = 3,
such that 2 helper nodes are random, while the third one can be selected by the code protocol.
If any of the n − 1 surviving nodes can be selected as the third helper, the code with d = 3
can achieve the rate as high as 0.7, which is close to the maximal rate 0.75 of LRCs with
d = 3. An alternative way to preventing the degradation is to actively introduce dependencies
between the nodes in the time intervals between failure iterations. Doing so is equivalent to
110
performing ”artificial” node failures, in which the replacement physical node is the same as the
”failed” one. Since the only purpose of such ”failures” is maintaining redundancy, they can be
performed at the carefully chosen moments when the helper node availability favors creating
useful dependencies.
Overall, studying more general methods to maintain redundancy, combinations of those
methods, quantifying the optimal contribution of each method, and the trade-offs between
them is an important future research direction, and we hope that the approaches presented in
the thesis provide novel perspectives to this general problem.
111
112
Bibliography
[1] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Networkcoding for distributed storage systems,” IEEE Trans. Inf. Theory , vol. 56, no. 9, pp. 4539–4551, 2010.
[2] D. Reinsel, J. Gantz, and J. Rydning, “Data age 2025: The evolution of data to life-critical,” Framingham: IDC Analyze the Future, 2017.
[3] L. Rizzatti, “Digital data storage is undergoing mind-boggling growth,” EETimes, 2016.
[4] “AWS storage services overview,” Amazon Web Services Whitepapers, 2016. [Online].Available: https://d0.awsstatic.com/whitepapers/Storage/AWS%20Storage%20Services%20Whitepaper-v9.pdf
[5] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhaninet al., “Erasure coding in windows azure storage.” in Proc. USENIX Annual Tech. Conf.(USENIX ATC). Boston, MA, 2012, pp. 15–26.
[6] P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the locality of codeword symbols,”IEEE Trans. Inf. Theory , vol. 58, no. 11, pp. 6925–6934, 2012.
[7] O. Khan, R. C. Burns, J. S. Plank, and C. Huang, “In search of i/o-optimal recovery fromdisk failures.” in Proc. USENIX Conf. Hot Topics Storage File Systems (HotStorage), 2011.
[8] R. Ahlswede, N. Cai, S.-Y. Li, and R. W. Yeung, “Network information flow,” IEEE Trans.Inf. Theory , vol. 46, no. 4, pp. 1204–1216, 2000.
[9] S.-Y. Li, R. W. Yeung, and N. Cai, “Linear network coding,” IEEE Trans. Inf. Theory ,vol. 49, no. 2, pp. 371–381, 2003.
[10] R. Koetter and M. Medard, “An algebraic approach to network coding,” IEEE/ACMTrans. Netw., vol. 11, no. 5, pp. 782–795, 2003.
[11] T. Ho, M. Medard, R. Kotter, D. R. Karger, M. Effros, J. Shi, and B. Leong, “A randomlinear network coding approach to multicast,” IEEE Trans. Inf. Theory , vol. 52, no. 10,pp. 4413–4430, 2006.
[12] P. Sanders, S. Egner, and L. Tolhuizen, “Polynomial time algorithms for network informa-tion flow,” in Proc. ACM Symp. Parallel Alg. Archit. ACM, 2003, pp. 286–294.
[13] S. Jaggi, P. Sanders, P. A. Chou, M. Effros, S. Egner, K. Jain, and L. M. Tolhuizen,“Polynomial time algorithms for multicast network code construction,” IEEE Trans. Inf.Theory , vol. 51, no. 6, pp. 1973–1982, 2005.
113
[14] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating codes fordistributed storage at the MSR and MBR points via a product-matrix construction,” IEEETrans. Inf. Theory , vol. 57, no. 8, pp. 5227–5239, 2011.
[15] Y. Hu, P. P. C. Lee, and K. W. Shum, “Analysis and construction of functional regeneratingcodes with uncoded repair for distributed storage systems,” in Proc. IEEE Int. Conf. Comp.Comm. (INFOCOM), April 2013, pp. 2355–2363.
[16] K. W. Shum and Y. Hu, “Cooperative regenerating codes,” IEEE Trans. Inf. Theory ,vol. 59, no. 11, 2013.
[17] A.-M. Kermarrec, N. Le Scouarnec, and G. Straub, “Repairing multiple failures with co-ordinated and adaptive regenerating codes,” in Proc. IEEE Network Cod. Theory App.Workshop (NetCod). IEEE, 2011.
[18] V. R. Cadambe, S. A. Jafar, H. Maleki, K. Ramchandran, and C. Suh, “Asymptoticinterference alignment for optimal repair of MDS codes in distributed storage,” IEEETrans. Inf. Theory , vol. 59, no. 5, pp. 2974–2987, 2013.
[19] A. S. Rawat, O. O. Koyluoglu, and S. Vishwanath, “Centralized repair of multiple nodefailures with applications to communication efficient secret sharing,” ArXiv e-prints , vol.abs/1603.04822, 2016. [Online]. Available: http://arxiv.org/abs/1603.04822
[20] S. Pawar, S. E. Rouayheb, and K. Ramchandran, “Securing dynamic distributed storagesystems against eavesdropping and adversarial attacks,” IEEE Trans. Inf. Theory , vol. 57,no. 10, pp. 6734–6753, Oct 2011.
[21] N. B. Shah, K. V. Rashmi, and P. V. Kumar, “Information-theoretically secure regeneratingcodes for distributed storage,” in Proc. IEEE Global Telecomm. Conf. (GLOBECOM), Dec2011, pp. 1–5.
[22] Y. L. Chen, S. Mu, J. Li, C. Huang, J. Li, A. Ogus, and D. Phillips, “Giza: Erasure codingobjects across global data centers,” in Proc. USENIX Annual Tech. Conf. (USENIX ATC),2017, pp. 539–551.
[23] H. Abu-Libdeh, L. Princehouse, and H. Weatherspoon, “Racs: a case for cloud storagediversity,” in Proc. ACM Symp. Cloud Computing. ACM, 2010, pp. 229–240.
[24] A. Bessani, M. Correia, B. Quaresma, F. Andre, and P. Sousa, “Depsky: dependable andsecure storage in a cloud-of-clouds,” ACM Trans. Storage, vol. 9, no. 4, p. 12, 2013.
[25] J. Y. Chung, C. Joe-Wong, S. Ha, J. W.-K. Hong, and M. Chiang, “Cyrus: Towardsclient-defined cloud storage,” in Proc. European Conf. Comp. Sys. ACM, 2015, p. 17.
[26] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A solutionto the network challenges of data recovery in erasure-coded distributed storage systems:A study on the facebook warehouse cluster.” in Proc. USENIX Conf. Hot Topics StorageFile Systems (HotStorage), 2013.
[27] D. S. Papailiopoulos and A. G. Dimakis, “Locally repairable codes,” IEEE Trans. Inf.Theory , vol. 60, no. 10, pp. 5843–5855, 2014.
114
[28] H. D. Hollmann, “On the minimum storage overhead of distributed storage codes with agiven repair locality,” in Proc. IEEE Int. Symp. Inf. Theory. IEEE, 2014, pp. 1041–1045.
[29] I. Ahmad and C.-C. Wang, “When and by how much can helper node selection improveregenerating codes?” in Proc. IEEE Annual Allerton Conf. Comm., Control, Computing(Allerton). IEEE, 2014, pp. 459–466.
[30] ——, “When locally repairable codes meet regenerating codes what if some helpers areunavailable,” in Proc. IEEE Int. Symp. Inf. Theory. IEEE, 2015, pp. 849–853.
[31] G. M. Kamath, N. Prakash, V. Lalitha, and P. V. Kumar, “Codes with local regenerationand erasure correction,” IEEE Trans. Inf. Theory , vol. 60, no. 8, pp. 4637–4660, 2014.
[32] I. Tamo, A. Barg, and A. Frolov, “Bounds on the parameters of locally recoverable codes,”IEEE Trans. Inf. Theory , vol. 62, no. 6, pp. 3070–3083, 2016.
[33] K. M. Greenan, J. S. Plank, J. J. Wylie et al., “Mean time to meaningless: MTTDL,Markov models, and storage system reliability.” in Proc. USENIX Conf. Hot Topics StorageFile Systems (HotStorage), 2010.
[34] D. J. MacKay, Information theory, inference, and learning algorithms. Citeseer, 2003,vol. 7.
[35] F. J. MacWilliams and N. J. A. Sloane, The theory of error-correcting codes. Elsevier,1977.
[36] Y. Wu, “Existence and construction of capacity-achieving network codes for distributedstorage,” IEEE J. Sel. Areas Commun., vol. 28, no. 2, pp. 277–288, February 2010.
[37] J. G. Oxley, Matroid theory. Oxford University Press, USA, 2006, vol. 3.
[38] Y. Hu, P. P.-C. Lee, and X. Zhang, “Double regenerating codes for hierarchical datacenters,” in Proc. IEEE Int. Symp. Inf. Theory. IEEE, 2016.
[39] J. Sohn, B. Choi, S. W. Yoon, and J. Moon, “Capacity of clustered distributedstorage,” ArXiv e-prints , vol. abs/1610.04498, 2016. [Online]. Available: http://arxiv.org/abs/1610.04498
[40] B. Gaston, J. Pujol, and M. Villanueva, “A realistic distributed storage system: the rackmodel,” ArXiv e-prints , vol. abs/1302.5657, 2013.
[41] J. Pernas, C. Yuen, B. Gastn, and J. Pujol, “Non-homogeneous two-rack model for dis-tributed storage systems,” in Proc. IEEE Int. Symp. Inf. Theory , July 2013, pp. 1237–1241.
[42] G. Calis and O. O. Koyluoglu, “Architecture-aware coding for distributed storage:Repairable block failure resilient codes,” ArXiv e-prints , vol. abs/1605.04989, 2016.[Online]. Available: http://arxiv.org/abs/1605.04989
[43] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Enabling node repair in any erasure codefor distributed storage,” in Proc. IEEE Int. Symp. Inf. Theory , July 2011.
[44] N. B. Shah, K. V. Rashmi, and P. V. Kumar, “A flexible class of regenerating codes fordistributed storage,” in Proc. IEEE Int. Symp. Inf. Theory , June 2010, pp. 1943–1947.
115
[45] Q. Yu, K. W. Shum, and C. W. Sung, “Tradeoff between storage cost and repair cost inheterogeneous distributed storage systems,” Trans. Emerging Telecomm. Tech., vol. 26,no. 10, pp. 1201–1211, 2015.
[46] T. Ernvall, S. El Rouayheb, C. Hollanti, and H. V. Poor, “Capacity and security of hetero-geneous distributed storage systems,” IEEE J. Sel. Areas Commun., vol. 31, no. 12, pp.2701–2709, 2013.
[47] S. Akhlaghi, A. Kiani, and M. R. Ghanavati, “Cost-bandwidth tradeoff in distributedstorage systems,” Comp. Comm., vol. 33, no. 17, pp. 2105–2115, 2010.
[48] J. Li, S. Yang, X. Wang, and B. Li, “Tree-structured data regeneration in distributedstorage systems with regenerating codes,” in Proc. IEEE Int. Conf. Comp. Comm. (IN-FOCOM). IEEE, 2010, pp. 1–9.
[49] Y. Wang, D. Wei, X. Yin, and X. Wang, “Heterogeneity-aware data regeneration in dis-tributed storage systems,” in Proc. IEEE Int. Conf. Comp. Comm. (INFOCOM), April2014, pp. 1878–1886.
[50] J. Bang-Jensen and G. Z. Gutin, Digraphs: theory, algorithms and applications. SpringerScience and Business Media, 2008.
[51] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, andS. Quinlan, “Availability in globally distributed storage systems,” in Proc. USENIX Conf.Oper. Systems Design Implem. (OSDI), vol. 10, 2010, pp. 1–7.
[52] M. Gerami, M. Xiao, and M. Skoglund, “Two-layer coding in distributed storage systemswith partial node failure/repair,” IEEE Commun. Lett., vol. PP, no. 99, 2017.
[53] N. B. Shah, K. V. Rashmi, P. V. Kumar, and K. Ramchandran, “Distributed storage codeswith repair-by-transfer and nonachievability of interior points on the storage-bandwidthtradeoff,” IEEE Trans. Inf. Theory , vol. 58, no. 3, pp. 1837–1852, March 2012.
[54] S. Yekhanin, “Locally decodable codes,” Foundations and Trends R© in Theoretical Com-puter Science, vol. 6, no. 3, pp. 139–255, 2012.
[55] F. H. Fitzek, T. Toth, A. Szabados, M. V. Pedersen, D. E. Lucani, M. Sipos, H. Charaf,and M. Medard, “Implementation and performance evaluation of distributed cloud storagesolutions using random linear network coding,” in Proc. IEEE Int. Conf. Comm. Workshop(ICC). IEEE, 2014, pp. 249–254.
[56] A. Mazumdar, “Storage capacity of repairable networks,” IEEE Trans. Inf. Theory , vol. 61,no. 11, pp. 5810–5821, 2015.
[57] M. G. Luby, R. Padovani, T. J. Richardson, L. Minder, and P. Aggarwal, “Liquid cloudstorage,” ArXiv e-prints , vol. abs/1705.07983, 2017.
[58] T. Ho, B. Leong, M. Medard, R. Koetter, Y.-H. Chang, and M. Effros, “On the utility ofnetwork coding in dynamic environments,” in Proc. IEEE Int. Workshop Wireless Ad-HocNet. IEEE, 2004, pp. 196–200.
[59] J.-S. Park, M. Gerla, D. S. Lun, Y. Yi, and M. Medard, “Codecast: a network-coding-basedad hoc multicast protocol,” IEEE Trans. Wireless Commun., vol. 13, no. 5, 2006.
116
[60] J. Widmer and J.-Y. Le Boudec, “Network coding for efficient communication in extremenetworks,” in Proc. ACM SIGCOMM Workshop on delay-tolerant networking. ACM,2005, pp. 284–291.
[61] Y. Lin, B. Li, and B. Liang, “Stochastic analysis of network coding in epidemic routing,”IEEE J. Sel. Areas Commun., vol. 26, no. 5, 2008.
[62] L. Sassatelli and M. Medard, “Inter-session network coding in delay-tolerant networksunder spray-and-wait routing,” in Proc. IEEE Int. Symp. on Modeling and Optimizationin Mobile, Ad Hoc, and Wireless Networks (WiOpt). IEEE, 2012, pp. 103–110.
[63] B. Sury, T. Wang, and F.-Z. Zhao, “Identities involving reciprocals of binomial coefficients,”J. of Integer Sequences, vol. 7, no. 2, p. 3, 2004.
[64] D. Silva, W. Zeng, and F. R. Kschischang, “Sparse network coding with overlappingclasses,” in Proc. IEEE Network Cod. Theory App. Workshop (NetCod). IEEE, 2009,pp. 74–79.
[65] S. Feizi, D. E. Lucani, C. W. Sørensen, A. Makhdoumi, and M. Medard, “Tunable sparsenetwork coding for multicast networks,” in Proc. IEEE Int. Symp. Network Cod. (NetCod).IEEE, 2014, pp. 1–6.
117
118
Chapter 9
Appendices
9.1 MRGRC Chain Order Lemma 4.2.2
MRGRC Chain Order Lemma. Let b > 1, i.e. t ∤ (m− ℓ). Consider any Si ⊂ [n], |Si| = i,
1 ≤ i ≤ k−1. Then, for any i′ ∈ [n]−Si, there exists a permutation σi′,Siof {ℓ+1, ℓ+2, . . . ,m}
such that
H(Yi′,σi′,Si
(j′)|YSi,Yi′,[1,l], {Yi′,σi′,Si
(j)}j∈[ℓ+1,j′−1]
)≤ min
(α,
(d− i)β
t
), (9.1)
for all j′ ∈ {m− b+ 1,m− b+ 2, . . . ,m}.
Proof. We present a candidate permutation σi′,Si. Consider the content of the cluster i′, given
by {Yi′,1, Yi′,2, . . . , Yi′,m}. Define the quantities (jm,Vm), (jm−1,Vm−1), . . . , (jm−b+1,Vm−b+1) in
this respective order as below:
1. Let U = {Yi′,ℓ+1, Yi′,ℓ+2, . . . , Yi′,m}, and x = 0
2. Define (jm−x,Vm−x) as
(jm−x,Vm−x) = argmin(j,V):Yi′,j∈U
V⊂U−{Yi′,j},|V|=t−1
H(Yi′,j |V,YSi
,Yi′,[1,ℓ]
).
3. If x < b− 1, update U as U = U − {Yi′,jm−x}. Increment x by 1 and return to Step 2.
119
Additionally, let us also define {jℓ+1, jℓ+2, . . . , jm−b} , {ℓ + 1, . . . ,m} − {jm, jm−1, . . . ,
jm−b+1}. In the preceding definition, we only need equality as sets. We do not care about
any particular ordering of the elements in {ℓ + 1, . . . ,m} − {jm, jm−1, . . . , jm−b+1}, while
associating these with {jℓ+1, jℓ+2, . . . , jm−b}. The candidate for the permutation σi′,Sion the
set {ℓ+ 1, . . . ,m} is now defined as follows:
σi′,Si(p) = jp, ℓ+ 1 ≤ p ≤ m. (9.2)
We will show that the permutation σi′,Siis such that
H(Yi′,σi′,Si
(j′)|YSi,Yi′,[1,l], {Yi′,σi′,Si
(j)}j∈[ℓ+1,j′−1]
)≤ min
(α,
(d− i)β
t
), (9.3)
for all j′ ∈ {m − b + 1,m − b + 2, . . . ,m}. Consider the variable j′ appearing in (9.3), and let
j′ = m − x for some x, 0 ≤ x ≤ b − 1 so that using (9.2) we have, σi′,Si(j′) = jm−x. Consider
the definition of (jm−x,Vm−x) in (9.2); we then know that
H(Yi′,jm−x
|Vm−x,YSi,Yi′,[1,ℓ]
)≤ H
(Yi′,jp |V,YSi
,Yi′,[1,ℓ]
), (9.4)
for all V ⊂ {Yi′,jℓ+1, Yi′,jℓ+2
, . . . , Yi′,jm−x} − {Yi′,jp} such that |V| = t − 1, and for all p, ℓ + 1 ≤
p ≤ m− x− 1. To prove (9.3), first of all, observe that
H(Yi′,σi′,Si
(j′)|YSi, {Yi′,σi′,Si
(j)}j∈[ℓ+1,j′−1]
)≤ H
(Yi′,σi′,Si
(j′)|YSi,Vm−x,Yi′,[1,ℓ]
). (9.5)
This follows from the fact that Vm−x ⊂ {Yi′,jℓ+1, Yi′,jℓ+2
, . . . , Yi′,jm−x−1}. Without loss of gener-
ality, assume that Vm−x = {Yi′,jℓ+1, Yi′,jℓ+2
, . . . , Yi′,jℓ+t−1}. Next, from the exact repair condition
120
given in (4.6), we know that
min(tα, (d− i)β) ≥ H(Yi′,σi′,Si
(j′),Vm−x|YSi,Yi′,[1,ℓ]
)
=
ℓ+t−1∑
p=ℓ+1
H(Yi′,jp |Yi′,jℓ+1
, . . . , Yi′,jp−1 ,YSi,Yi′,[1,ℓ]
)
+H(Yi′,σi′,Si
(j′)|Vm−x,YSi,Yi′,[1,ℓ]
)
≥ℓ+t−1∑
p=ℓ+1
H(Yi′,jp |Vjp ,YSi
,Yi′,[1,ℓ]
)+H
(Yi′,σi′,Si
(j′)|Vm−x,YSi,Yi′,[1,ℓ]
),
where Vjp = Vm−x − {Yi′,jp} ∪ {Yi′,σi′,Si(j′)}. Noting that |Vjp | = t − 1, we see that each term
under the summation can be lower bounded using (9.4), i.e.,
H(Yi′,jp |Vjp ,YSi
,Yi′,[1,ℓ]
)≥ H
(Yi′,jm−x
|Vm−x,YSi,Yi′,[1,ℓ]
)
= H(Yi′,σi′,Si
(j′)|Vm−x,YSi,Yi′,[1,ℓ]
). (9.6)
Therefore,
min (tα, (d− i)β) ≥ tH(Yi′,σi′,Si
(j′)|Vm−x,YSi,Yi′,[1,ℓ]
). (9.7)
The proof of the lemma now follows by combining (9.7) with (9.5).
9.2 Achievability of the FR File Size Bound for MRGRC (The-
orem 4.3.1)
FR MRGRC Capacity. The file size B of GRC with parameters {(n, k, d), (α, β), (m, ℓ, t)}
under FR regime is upper bounded by
B ≤ B∗F = ℓkα+ a
k−1∑
i=0
min(tα, (d− i)β) +k−1∑
i=0
min(bα, (d− i)β) (9.8)
The bound is tight, if there is a known upper bound on the number of repairs in the system.
Proof. To prove achievability of the bound, we show that for any valid IFG, regardless of the
specific sequence of failures and repairs, B∗F is indeed a lower bound on the minimum possible
value of any S-Z cut. Consider a cut of IFG, and let U and V be the two disjoint parts associated
121
with nodes S and Z, respectively. Without loss of generality, we only consider cuts such that
V contains at least k external nodes corresponding to active clusters. Consider a topological
sorting of the IFG nodes such that: 1) an edge exists between two nodes A and B only if A
appears before B in the sorting, and 2) all in-, out-, external, and repair nodes (if τ > 0) of the
cluster Xi(τ) appear together in the sorted order, ∀i, τ .
Consider the sequence E of all the external nodes in both active and inactive clusters in
V in their sorted order. Let Y1 denote the first node in E . Without loss of generality let
Y1 = Xext1 (τ1), for some τ1. In this case, consider the subsequence of E which is obtained after
excluding all the external nodes associated with X1 from E . Let Y2 denote the first external
node in this subsequence. We continue in this manner until we find the first k external nodes
{Y1, Y2, . . . , Yk} in E , such that each of the k nodes corresponds to a distinct physical cluster.
Without loss of generality, let us also assume that Yi = Xexti (τi), 2 ≤ i ≤ k, for some τi. If
τi = 0, then clearly cluster i contributes (at least) mα to the cut. Thus let us assume that
τi > 0, 1 ≤ i ≤ k.
Consider the m out-nodes Xouti,1 (τi), . . . , X
outi,m(τi) that connect to Xext
i (τi). For each j ∈ [1,
m], either Xouti,j (τi) is in U or there exists a minimal τi,j ∈ [0, τi] such that Xout
i,j (τi,j) ∈ V.
Consider those values of j ∈ [1,m] for which all the following conditions hold:
Xouti,j (τi), X
ini,j(τi,j) ∈ V, j ∈ Ri(τi,j − 1),
Xrepi (τi,j) ∈ V. (9.9)
Let there bemi ∈ [0,m] of such values, and, without loss of generality, let them bem−mi+1,
. . . ,m. Also without loss of generality, let indices j be sorted in the order of increasing τi,j ,
i.e. j1 < j2 implies τi,j1 ≤ τi,j2 . For each j ∈ [m − mi + 1,m], Wi,j , {j′ : τi,j′ = τi,j ,
j′ ∈ [m −mi + 1,m]} is a contiguous set of at most t indices of the nodes with the same τi,j ,
and which are repaired together from the same repair node. Let Si = {distinct (minWi,j − 1),
∀j ∈ [m − mi + 1,m]} ⊆ [m − mi,m − 1] be the set of indices of the nodes preceding all
contiguous groups Wi,j . Note that by minWi,j we mean the minimum element contained in the
set Wi,j . The set Si is in one-to-one correspondence with the set of the repair nodes in (9.9)
for j ∈ [m−mi + 1,m]. Note that m−mi is always an element of Si.
122
In order to relay helper data to X ini,j(τi,j) for all j ∈ [m −mi + 1,m], the number of these
repair nodes should be at least ⌈mi/t⌉, and |Si| ≥ ⌈mi/t⌉. Each of these repair nodes connects
to d external nodes in other clusters. By the construction of E , at most i− 1 of those external
nodes can be in V. Thus, each repair node contributes at least (d− i+ 1)β of external helper
data to the cut value. In addition, each repair node Xrepi (τi,j) connects to ℓ local nodes. By
(9.9) and by the construction of Si and sorting of τi,j , only nodes with indices {1, 2, . . . , j′} out
of these ℓ can be in V, where j′ = minWi,j − 1 is the corresponding element of Si. Thus, repair
node Xrepi (τi,j) contributes at least (ℓ− j′)+α of local helper data to the cut value.
The contribution to the cut value of those m−mi indices of j ∈ [1,m−mi], which do not
satisfy (9.9), is at least α each.
Based on the observations above, the overall cut value is lower-bounded by
mincut(S − Z) ≥k∑
i=1
((m−mi)α+
⌈mi
t
⌉(d− i+ 1)β +
∑
j′∈Si
(ℓ− j′)+α). (9.10)
Consider a particular value of i ∈ [1, k] and the corresponding summation term in (9.10).
Let us assume that m−mi ≥ ℓ, and mi = ait+ bi ≤ m− ℓ, bi ∈ [0, t− 1]. Then the third term
in (9.10) is zero, and
(m−mi)α+ ⌈mi/t⌉(d− i+ 1)β
= mα− (ait+ bi)α+ (ai + 1bi>0)(d− i+ 1)β
= mα− ai(tα− (d− i+ 1)β)− biα+ 1bi>0(d− i+ 1)β
(1)
≥ ℓα+ (m− ℓ)α− a(tα− (d− i+ 1)β)+ − (bα− (d− i+ 1)β)+
= ℓα+ a(tα− (tα− (d− i+ 1)β)+) + (bα− (bα− (d− i+ 1)β)+)
= ℓα+ amin(tα, (d− i+ 1)β) + min(bα, (d− i+ 1)β)
, ci,
where (1) follows, because ait+ bi = mi ≤ m− ℓ = at+ b, ai ≤ a, and, if ai = a, bi ≤ b.
On the other hand, if m−mi = ℓ−µi < ℓ, and mi > m− ℓ = at+ b, ℓ− (m−mi) = µi > 0,
123
then we have
(m−mi)α+ ⌈mi/t⌉(d− i+ 1)β +∑
j′∈Si
(ℓ− j′)+α
≥ (ℓ− µi)α+ (a+ 1b>0)(d− i+ 1)β + (ℓ− (m−mi))α+∑
j′∈Si
j′>m−mi
(ℓ− j′)+α
= ℓα+ (a+ 1b>0)(d− i+ 1)β +∑
j′∈Si
j′>m−mi
(ℓ− j′)+α
≥ ci,
where ci is the lower-bound for the case m−mi ≥ ℓ.
Since B∗F =
∑i ci, it is indeed a lower-bound on the file-size. This proves tightness of bound
9.8.
9.3 Matroid Lemma 6.3.1
Matroid Lemma. For q → ∞, a = b, rankW t|S = nSa, where nS ≥ 1 is integer w.p.a.c. 1,
∀S ⊆ [n], S 6= ∅, ∀t.
Proof. (Proof by induction) At t = 0 W t = Ina, all columns are independent and rankW 0|S =
a|S| for all S.
Let us assume that the statement is true for t ≥ 0. At the next iteration, a node i fails and
is repaired from a helper set H. Matrix W t+1 differs from W t in the columns corresponding to
node i only. The statement might be violated only for sets of nodes with i. Consider such a set
S ∋ i. Per column evolution equation (6.1)
rankW t+1|S = rank
W t|S−i
∑
j∈H
W t|jDj
(9.11)
= rank
W t|S−i
∑
j∈H−(S−i)
W t|jDj
, (9.12)
where Dj = DHt+1(j)D
Rt+1(j) ∈ Fa×a
q is full rank w.p.a.c. 1, and (S − i) is the linear ”closure”
124
of S − i
(S − i) = {j ∈ [n] : rank[W t|S−i W
t|j]= rankW t|S−i}, (9.13)
i.e. the set of all nodes, whose packets are all in the column span of the packets on nodes in
S − i. For H − (S − i) = ∅, the set S − i is non-empty, and rankW t+1|S = rankW t|S−i, which
is a non-zero multiple of a, by the inductive assumption. For non-empty H − (S − i), note that
K , colspan∑
j∈H−(S−i)W t|jDj is a random subspace of V ,
⊕j∈H−(S−i)
colspanW t|j ⊆ Fnaq
of dimension at least
dimK ≥ minj∈H−(S−i)
rankW t|jDj = a, w.p.a.c. 1, (9.14)
since rankDj = a w.p. 1, and rankW t|j = a, by the inductive assumption. Let U =
colspanW t|S−i ⊆ Fnaq . Subspace K can be though of as the span of a random vectors v1,
. . . ,va ∈ V . The probability that the intersection of K with U is trivial is given by
Pr[K ∩ U = 0] = Pr[v1 /∈ V ∩ U ]
Pr[v2 /∈ (V ∩ U)⊕ span(v1) or v2 ∈ span(v1)|v1 /∈ V ∩ U ] · · ·
≥ Pr[v1 /∈ V ∩ U ]
Pr[v2 /∈ (V ∩ U)⊕ span(v1)|v1 /∈ V ∩ U ] · · ·
= (1− qdimV ∩U−dimV )(1− qdimV ∩U+1−dimV ) · · · (1− qdimV ∩U+a−1−dimV )
Since H − (S − i) 6= ∅, there is j ∈ H : colspanW t|j ∩ colspanW t|S−i = 0, and dimV ≥
dimV ∩ U + a. Therefore,
Pr[K ∩ U = 0] ≥ (1− q−a)(1− q−a+1) · · · (1− q−1)q→∞→ 1.
As a result,
rankW t+1|S = rank
W t|S−i
∑
j∈H−(S−i)
W t|jDj
= rankW t|S−i + a,
125
and the inductive statement is true for t+ 1.
9.4 Matrix Addition Lemma 6.4.2
Matrix Addition Lemma. Let A ∈ Fm×nq ,m ≤ n, be a full-rank matrix with rows a1, . . . ,
am. Let u,v ∈ Fnq be arbitrary vectors, and d′ ∈ [0,m− 1] be an integer. Let A′ ∈ Fm×n
q be an
additively transformed matrix with rows a′
1, . . . ,a′
m, such that
a′
i =
ai + αiu, if i ∈ [1, d′]
ai, if i ∈ [d′ + 1,m− 1]
βiai + v, if i = m,
(9.15)
where αi, βm are random scalars, sampled uniformly i.i.d. from Fq. Then limq→∞ Pr[rankA′ =
m] = 1, i.e. A′ is full-rank w.p.a.c. 1 in the limit of infinite field size.
Proof. Let A|[m−1], A′|[m−1] be the submatrices composed of the first m − 1 rows of A,A′,
respectively. Let α = (α1, . . . , αd′) ∈ Fd′q , and let S = {α : A′|[m−1] not full-rank}. Since
A′|[m−1] is a linear function of α, S is an affine subspace of Fd′q . Since before the additive
transformation A|[m−1] was full-rank, zero-vector 0 /∈ S, and, therefore, dimS < d′. Thus,
Pr[A′|[m−1] full-rank] = Pr[α /∈ S] = 1− |S|/|Fd′q | = 1− qdimS−d′ → 1 as q → ∞.
Let S′ = {α : am ∈ rowspanA′|[m−1]}. S′ is also an affine subspace of Fd′q . Since before the
transformation A was full-rank, am /∈ rowspanA|[m−1], zero-vector 0 /∈ S′, and dimS′ < d′.
Thus, Pr[A′|[m−1] full-rank and am /∈ rowspanA′|[m−1]] = Pr[α /∈ S∪S′] ≥ 1−(|S|+|S′|)/qd′ →
1 as q → ∞.
Conditioned on A′|[m−1] is full-rank and am /∈ rowspanA′|[m−1], a′
m = βmam + v can be
in rowspanA′|[m−1] for at most 1 value of βm (otherwise, for 2 corresponding values of a′
m,
their difference would be a multiple of am and lie in rowspanA′|[m−1], which contradicts
am /∈ rowspanA′|[m−1]). Therefore, under the same condition, Pr[A full-rank] = Pr[a′
m /∈
rowspanA′|[m−1]] ≥ 1−1/q → 1. Since the condition holds w.p.→ 1, unconditional Pr[A full-rank] →
1, as q → ∞.
126