dist storage presentation final

Upload: excoder21

Post on 05-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Dist Storage Presentation Final

    1/25

    Reliable distributed data storage systems have to employ redundancy codes to tolerate the loss of storages. Manyappropriate codes and algorithms can be found in the literature, but efficient schemes for tolerating several storagefailures and their embedding in a distributed system are still research issues. In this paper, a variety of redundancyschemes are compared that got implemented in a distributed storage system

    I.II. INTRODUCTION

    The growth in the volume of information handled by modern applications, the fallingprice of storage

    units, and the rapid improvement in network speeds have accelerated the researchendeavor in distributed storage systems.

    These storage systems guarantee high availability of data in the presence of machinefailures. The distribution of units can be at various levels; they could begeographically separated nodes connected via the Internet, or nodes distributed overa LAN, or even an array of disks in a RAID-like [6] architecture.

    Irrespective of the scale of distribution, the key principle that enables highavailability (or fault tolerance) is the redundancy of information across differentstorage unitsCost of storage has decreased drastically over the years. to harness the ever growingcapacity and decreasing cost of distributed storage, a number of challenges need tobe addressed,

    (i) volatility of storage availability due to network (dis)connectivity, varyingadministrative restriction or user preferences, and nodal mobility (of mobile devices);

    (ii) (partial) failures of storage devices. For example, flash media are known to beengineered to trade-off error probabilities for cost reduction;

  • 7/31/2019 Dist Storage Presentation Final

    2/25

    (iii) software bugs or malicious attacks, where an adversary manages to compromisea node and causes it to misbehave.

    To ensure availability despite failure or compromise of storage nodes, survivablestorage systems spread data redundantly across a set of distributed storage nodes.

    At the core of a survivable storage system is a coding scheme that maps informationbits to stored bits, and vice versa. Without loss of generality, we call the unit of suchmapping, symbols. A (k, n) coding is defined by the following two primitives:

    - encode c = (u, k, n), which returns a coded vector c = [c0, c1, . . . , cn1] of lengthn from k informationsymbols u = [u0, u1, . . . , uk1]. The coded symbols can be stored on separatestorage nodes.- decode u = (r, k, n), which accesses a subset of storage nodes, and returns theinformation symbols from possibly corrupted symbols.

    Many existing approaches to survivable storage assume crash-stop behaviors, i.e., a

    storage device becomes unavailable if failed (also called erasure)..

    Thus, typically low-complexity (replication or XOR-based) coding mechanisms areemployed to recover from limited degree of erasure. --

    For fixed error correction capability, the efficiency of encode and decode primitivescan be evaluated by three metrics,i) storage overhead measured as the ratio between the number of storage symbols

    and total information symbols (n/k);ii) encoding and decoding computation time; andiii) communication overhead measured in the number of bits transferred in the

    network for encode and decode. Communication overhead is of much importance inwide-area and/or low-bandwidth storage systems.

    .

    In storage systems, ensuring reliability requires the introduction of redundancy.

    A file is divided into k pieces, encoded into n coded pieces and stored at n nodes. Oneimportant metric of coding efficiency is the redundancy reliability tradeoff defined asn/k.

    The simplest form of redundancy is replication. As a generalization of replication,erasure coding offers better storage efficiency. The Maximum Distance Separable

  • 7/31/2019 Dist Storage Presentation Final

    3/25

  • 7/31/2019 Dist Storage Presentation Final

    4/25

    viz., (i) Pure replication based methods and (ii) Transformation based techniques.

    In a replication based strategy, data is replicated several times and then thereplicas are stored on different servers (or storage units). In the retrieval phase, acertain number of these replicas are accessed and compared in order to obtain theoriginal document. The overhead in this scheme is that of byte copying during the

    write phase and that of the comparison cost during the read phase. However, thissimple design leads to very high space complexity. Additionally, to ensureconfidentiality of data, documents must be encrypted, thus adding theencryption/decryption costs to the write/read latencies.

    A transformation based scheme can be viewed as a mapping from a lowerdimensional space to higherdimensional one. A document of length m is inflated in size to length n (n m). Theinflated documentis now split into multiple pieces and each piece is stored on one of the storage units.

    The original document can be reconstructed even if some of the pieces are missing. Itis also known that this scheme is in general highly space optimal, requiring minimal

    redundancy to enable a certain degree of availability. However, this scheme no longerremains secure while deployed over a completely untrusted set of servers, i.e., all ofthem are allowed to collude to extract the document. Encryption has to be added tosafeguard the data increasing the associated access cost.

    A variant of the transformation-based approach uses error correction codes (ECC)ECC based techniques provide redundancy in a space optimal way, leading to aspace-optimal design for reliability.

    Storage systems are quickly growing in size through the use of more and biggerdisks, and through distribution over a network. With larger systems, the chance ofcomponent failure also increases, so techniques to protect data become moreimportant. Single parity used in RAID systems no longer provides sufficient protectionin all cases [1], and k-way replication is much too wasteful in storage space, even forsmall k. Therefore, new schemes are needed to protect data against multiple failuresin a distributed storage system.

    Erasure codes [2] have been used traditionally in communication systems, and morerecently in storage systems as an alternative to replication

    Erasure Code :

    In information theory, an erasure code is a forward error correction (FEC) code forthe binary erasure channel, which transforms a message ofksymbols into a longermessage (code word) with n symbols such that the original message can berecovered from a subset of the n symbols.AbstractErasure codes provide space-optimal data redundancy to protect against data loss. Acommon use is to reliably store data in a distributed system, where erasure-codeddata are kept in different nodes to tolerate node failures without losing data.

    http://en.wikipedia.org/wiki/Information_theoryhttp://en.wikipedia.org/wiki/Forward_error_correctionhttp://en.wikipedia.org/wiki/Binary_erasure_channelhttp://en.wikipedia.org/wiki/Information_theoryhttp://en.wikipedia.org/wiki/Forward_error_correctionhttp://en.wikipedia.org/wiki/Binary_erasure_channel
  • 7/31/2019 Dist Storage Presentation Final

    5/25

    MDS Array codes

    Reliability of storage system is often achieved by storing redundant data in thesystems using error control codes. Erasure correcting codes are used since the

    device failure can be marked as erasures. To make redundant data most effective,the code should have the MDS property.

    Additionally code should have simple encoding and decoding operations so that the

    computational overhead can be reduced to minimum. The well known Reed-Solomon codes are a class of powerful MDS codes, but their encoding and decoding

    need rather complicated finite field operations. It is useful to design the codes thathave both MDS property and simple encoding and decoding operations. MDS array

    codes are a class of error correcting codes with the both properties.

    Array codes have been studied extensively. A common property of these codes is

    that the encoding and decoding procedure use only simple X-OR operations Arraycodes are defined over an Abelian group GF(q) with the addition operation +. Forsimplicity we assume that q=2 i.e the code is binary and addition is just a simple

    bitwise XOR. In distributed storage system, the bits in a same column are stored ina same disk. If a disk fails, then the corresponding colum of the code is considered

    to be an erasure.

    1.3. Strategies for information protection

  • 7/31/2019 Dist Storage Presentation Final

    6/25

    2. Erasure coding for distributed storage

    Erasure recovery is necessary when the failure of a communications channel orstorage device prevents the direct retrieval of a previously stored data file. Although

    such failures are often caused by inherent properties of the device or channel, theymay also be the result of malicious or careless actions.

    In general, solutions to this longstanding problem involve the construction of a largerfile with explicit or implicit redundancy of information. After an erasure (i.e. the loss ofa bit), it is this redundancy which enables successful data recovery. One common,simple approach is to create one or more copiesof the original data. This results in a level of added protection far below that ofalternative methods employing the same amount of redundancy. Figure 3 shows anexample of encoding data fordistributed storage. A file is broken up into a number of fragments (step 2) andencoded (step 3) as described in the next section. A graph of parity and data nodes is

    created. Each parity node is computed by performing the exclusive distributed todifferent storage nodes over the network(steps 6 through 9).

    2.1. Related work

    Erasure codes have been applied to distributed storage in the OceanStore project atthe University of California Berkeley [3].

    1 Introduction

    3

    2 Erasure codes for distributed networks

    Erasure codes are a class of FEC codes, i.e., no retransmission is required when they

    are employed. Anerasure is a corrupted bit or symbol (packet) with an unknown value, but its locationin the codeword is known to the decoder. An erasure code is designed to recover orcorrect the erasures, rather than to correct errors, from the encoded bits or packetscorrectly received. In a packet based network, an (n, k) erasure code consists of kinformation packets and nk parity packets over a finite field GF(q) (q is a power of aprime). Using a simple taxonomy, the construction of erasure codes can be classifiedinto two categories: the maximum distance- separable (MDS)-code approach and thesparse graph- code approach.

  • 7/31/2019 Dist Storage Presentation Final

    7/25

  • 7/31/2019 Dist Storage Presentation Final

    8/25

    MDS codes are optimal in the sense that they have the maximum possible minimumdistance for the given code length and dimension. Array codes suit naturally thepacket-based system in terms of data presentation and processing. However, theMDS array codes currently available are mostly designed for storage and have limitederasure-control capacity.

    MDS array codes are widely used in storage systems to protect data against erasures.We address the rebuilding ratio problem, namely, in the case of erasures, what is thethe fraction of the remaining information that needs to be accessed in order to rebuildexactly the lost information? It is clear that when the number of erasures equals themaximum number of erasures that an MDS code can correct then the rebuilding ratiois 1 (access all the remaining information).Maximum distance separable is one of the desirable features of linear block codes, forachieving the maximum possible minimum distance for fixed n and k, i.e.d n _ k 1 2

    This result meets the Singleton bound [31] d _ n _ k 1 with equality.

    Two trivial examples of MDS codes are the (n, k, d) (n, n_1, 2) single-parity-check

    code and (n, 1, n) repetition code.

    The former has very limited error-control capacity though requiring a minimumredundancy, while the latter is equivalent to the retransmission of the same data, anerror-control mechanism used in the ARQ schemes, thus inefficient in terms ofbandwidth usage and throughput. Most non-trivial MDS codes are non-binary codes[32].

  • 7/31/2019 Dist Storage Presentation Final

    9/25

    3 Construction of a class of MDS array erasure codes

    1. Introduction

    An (n; k) MDS erasure code, or simply k-of-n code, encodes k blocks of data into n > kblocks. which we call a stripe such that any k blocks in the stripe can recover theoriginal k blocks. By storing each block in a separate node, data are protected againstthe simultaneous failure of up ton k nodes.

    It is well known that an (n,k) code can be used to store 'k' units of information in 'n'unit-capacity disks of a distributed data storage system. If the code used is maximumdistance separable (MDS), then the system can tolerate any (n-k) disk failures, sincethe original information can be recovered from any k surviving disks.

    Abstract

    I. INTRODUCTION

    Erasure-correcting codes are the basis of the ubiquitous RAID schemes for storagesystems, where disks correspond to symbols in the code. Specifically, RAID schemesare based on MDS (maximum distance separable) array codes that enable optimalstorage and efficient encoding and decoding algorithms. With r redundancy symbols,an MDS code is able to reconstruct the original information if no more than r symbolsare erased. An array code is a two dimensional array, where each columncorresponds to a symbol in the code and is stored in a disk in the RAID scheme. Weare going to refer to a disk/symbol as a node or a column interchangeably, and anentry in the array as an element. Examples of MDS array codes are EVENODD [1], [2],B-code [3], X-code [4], RDP [5], and STAR-code [6].

    Suppose that some nodes are erased in an MDS array code, we will rebuild them byaccessing (reading) some information in the surviving nodes. The fraction of the

    accessed information in the surviving nodes is called the rebuilding ratio. If r nodesare erased, then the rebuilding ratio is 1 since we need to read all the remaininginformation.However, is it possible to lower this ratio for less than r erasures?

    For example, Figure 1 shows the rebuilding of the first systematic (information) nodefor an MDS code with 4 information elements and 2 redundancy nodes, whichrequires the transmission of 3 elements. Thus the rebuilding ratio is 1/2.In [7], [8], a related problem is discussed: the nodes are

  • 7/31/2019 Dist Storage Presentation Final

    10/25

    assumed to be distributed and fully connected in a network, and the repair bandwidth

    is defined as the minimum amount of data needed to transmit in the network in orderto retain the MDS property. Note that one block of data transmitted can be a functionof several blocks of data. In addition, retaining MDS property does not implyrebuilding the original erased node, whereas we restrict our problem to exactrebuilding. Therefore, the repair bandwidth is a lower bound of the rebuilding ratio.

    The parity symbols are constructed by linear combinations of a set of informationsymbols, such that each information symbol is contained exactly once in each paritynode. These codes have a variety of advantages: 1) they are systematic codes, and itis easy to retrieve information; 2) they are have high code rate k/n, which iscommonly required in storage systems; 3) the encoding and decoding of the codes

    can be easily implemented (for r = 2, the code uses finite field of size 3); 4) theymatch the lower bound (1) when rebuilding a systematic node; 5) the rebuilding of afailed node requires simple computation and access to only 1/(n k) of the data ineach node (no linear combination of data); and 6) theyhave optimal update, namely, when an information element is updated, only n k +1elements in the array need update.

    The Repair Problem

    Consider the very simple (n=3,k=2) binary MDS code shown. This is a single parity (as used in

    RAID 5): one disk storing the parity of the others.

    http://csi.usc.edu/~dimakis/StorageWiki/lib/exe/detail.php?id=wiki%3Adefinitions%3Arepair_problem&media=wiki:definitions:pic1.png
  • 7/31/2019 Dist Storage Presentation Final

    11/25

    Clearly, this code has the property it can tolerate any single node failure. This means that even

    after one node fails, a data collector (shown as a laptop) can communicate the information from the

    remaining two nodes and reconstruct the file. This is shown here:

    here:

    EVEN ODD codes [On Optimizing XOR-Based Codes for Fault-TolerantStorage Applications ]

    I. INTRODUCTION

    Erasure correcting codes are often adopted by storage applications to provide faulttolerance [5]. For such applications, encoding and decoding complexity is the keyconcern in determining which codes to use. XOR-based codes use pure XOR operationduring coding computation, which makes implementation most efficient in bothhardware and software. Hence, such codes are highly desirable.

    XOR-based codes can be implemented by transforming from any existing codes,which originally could be defined in finite fields [3].

    Since the complexity of XOR-based codes is solely determined by the total number ofXOR operations in encoding or decoding, we make a simple yet key observation thatcommon XOR operations should be computed first (the COF rule). Based on the COFrule, we can

    http://csi.usc.edu/~dimakis/StorageWiki/lib/exe/detail.php?id=wiki%3Adefinitions%3Arepair_problem&media=wiki:definitions:pic2.png
  • 7/31/2019 Dist Storage Presentation Final

    12/25

    optimize arbitrary XOR-based codes (not just Reed-Solomon codes). We describe theoptimization problem as finding a computation path, which computes all requiredoutputs and minimizes the total number of XOR operations at the same time. Werelate the problem of optimizing XOR-based codes(OXC in short) to a known NP-complete problem and make a conjecture that thecurrent problem is also NP-complete.

    A. EVENODD: an example

    EVENODD codes [1] are probably the most widely referred XOR-based codes in fault-tolerant storage applications. Many other schemes adopt a similar concept, wheredata blocks are arranged in a two dimensional array and XOR is the only requiredoperation.

    Schemes as such are often referred as array

    codes. Low complexity is the key advantage of array codes, which is especiallydesirable for storage applications. Below we give a simple example of EVENODDcodes.

    1) EVENODD encoding:

    We examine a (5, 3) EVENODD code. There are 3 data blocks (k = 3) and 2redundant blocks (r = 2). An EVENODD code is in the form of a (p1)(p+ 2) twodimensional array, where p is a prime number. Hence, each block is segmented into(p 1) cells.

    Figure 1 shows this particular EVENODD code, where p = 3 and each block

    (corresponding to one column in the Figure) is segmented into 2 cells. The encodingis straightforward. The first redundant block is simply the XOR of all the data blocks.In terms of cells, they can be represented as (use + as a simple notation for XOR)

    Information bits are placed in the first n-2 columns and the parity bits are placed inthe last 2 columns. Notice that parity columns can be computed independently. Oneimportant parameter of array codes is the average number of parity bits affected bya change of a single information bit; this parameter is called the update complexity.

    The update complexity is particularly crucial when the codes are used in storageapplications that update information frequently.

    It also measures the encoding complexity of the code. The lower this parameter is,

    the simpler the encoding options are.If a code is described by a parity check matrix, then this parameter is the average

    row density-the number if non zero entries in a row of the parity check matrix.Research has been done to reduce this parameter or to make the density of paritycheck matrix of codes as low as possible.

    The update complexity of EVEN ODD codes approaches 2 as the length ( number ofthe columns ) of the code increases.

  • 7/31/2019 Dist Storage Presentation Final

    13/25

    c1 = d1 + d3 + d5,c2 = d2 + d4 + d6,

    which can be regarded as computing horizontal parities. The second redundant blockcan be computed as

    S = d4 + d5,c3 = d1 + d6 + S,c4 = d2 + d3 + S,

    which can be regarded as computing diagonal parities (S is called adjustor). It is easyto count that the total number of XORs is 9.

    2) EVENODD decoding:

    EVENODD codes guarantee recoverability when there are no more than two blockfailures (i.e., two columns completely wiped out). For instance, we examine aparticular failure pattern, when the second and the third data blocks are unavailable.

    The decoding turns out to be straightforward as well. Using all the remaining parityblocks, the adjustor can first be computed as

    S = c1 + c2 + c3 + c4.

    Once S is known, d6 can be computed as d6 = c3 + d1 + S. Then, d4 can becomputed as d4 = c2 +d2 +d6. Next, d5 can be computed as d5 = d4 +S. And finally,d3 = d1 +d5 +c1.

    The decoding process is completed and all failed blocks are recovered, as shown inFigure 1(b). The total number of XORs is 10.

    B. EVENODD:

  • 7/31/2019 Dist Storage Presentation Final

    14/25

    a matrix perspective The encoding and decoding of linear block codes can berepresented in a matrix form. Here, we use the same EVENODD code example toillustrate.

    1) encoding with COF: Denote data cells as D =[d1 d2 d3 d4 d5 d6] and parity cells asC = [c1 c2 c3 c4].

    Then, the encoding can be represented as C = D Me, where the encoding matrix Meis in the following form:

    Note that Me represents a portion of the codes generator matrix. For systematic

    codes, it is convenient to ignore the rest systematic part. Given the encoding matrix,a naive approach to compute the redundant blocks is to XOR all data cells wheneverthe encoding matrix has non-zero entries. For example, c1 = d1 + d3 + d5, c3 = d1 +d4 + d5 + d6, and so on. In this way, counting the total number of non-zeros entriesyields the encoding complexity. Hence, we might conclude that 10 XORs are required(note that three 1s in one column counts for 2 XORs). However, if we are slightlymore careful, wewill observe that some XORs are computed more than once. Indeed, if the EVENODDencoding is mapped onto the matrix representation, it is equivalent to computing d4+ d5 only once (the calculation of the adjustor), which saves 1 XOR and exactlyaccounts for the difference between the matrix basednaive approach and the original EVENODD encoding.

    Figure 2(a) illustrates this.Now, an interesting question to ask is: can we find more shared XORs, which can becomputed once and in turn further reduce the total number of operations? Indeed, weobserve that d2 + d4 (denoted as d2,4) and d3 + d5 (denoted as d3,5) are sharedXORs, as shown in Figure 2(b). If we adopt a simple rule to compute such commonoperations first (COF), d2,4 and d3,5 will be computed. Then, c1 = d1 + d3,5, c2 =d2,4 + d6, c3 = d1 + d4 + d5 + d6 (as usual), and c4 = d2,4 + d3,5. The totalnumber of XORs is 8, less than the original EVENODD encoding.

  • 7/31/2019 Dist Storage Presentation Final

    15/25

    2) decoding with COF: We consider the same failure pattern, where the second andthird data blocks are unavailable (i.e., cells d3, d4, d5 and d6 are erasures). It isstraightforward to derive decoding equations from the encoding matrix Me (essentallyperforming matrix inversion) and obtain D = C Md, where D = [d3 d4 d5 d6], C = [d1 d2 c1 c2 c3 c4], and the decoding matrix Md is

    Again, the naive approach requires 12 XORs. But, applying the COF rule, and computeshared XORs first (e.g., d1 + d2, c1 + c4 and c2 + c3 in this case, also shown inFigure 2(c)), the total number of required XORs is 9. This is also less than the originalEVENODD decoding (10 XORs). III. A 2-FAULT-TOLERANT REED-SOLOMON CODE In

    this section, we construct a (5, 3) Reed-Solomon code and apply the COF rule to bothencoding and decoding.

    X code and B codes

    The encoding operations are optimal i.e their update complexity achieves thetheoretical lower bound.A common structure of x- codes and B- codes is that parity bits are no longer placedin separate columns but mixed with information bits . This is the key to achievingthe lower bound of the update complexity . the error model for bothe codes is this : ofa column has atleast one bit erasure ( error ), then this column is considered as an

    erasure or errors column,. Both codes are of distance 3 i.e., they can either correcttwo erasures, detect two errors or correct one error. A common property of thesecodes that the encoding and decoding procedures use only simple XOR and cyclicshift operations.

    X Codes :

    The x-codes has a very simple geometrical structure: the parity bits are constructedalong two groups of parallel parity lines of slopes 1 and -1. This is the origin of thename X-code. This simple geometrical structure allows simple erasure-decoding and

    error decoding algorithms , using only XORs and vector cyclic shift operations .

    The code is viewed as ( n,k,d) cod over GF(q^n).Its distance is defined over GF(q^n),i.e. over the columns of the array. The X- code is an MDS code of distance d=3, i.e., k= n-2, which meets the singleton bound : d = n-k+1.

    The redundancy of X code is obtained by adding two parity rows rather than twoparity columns, which results in nice property that the updating one informationsymbol afftects only two parity symbols, i.e., the update complexity Is always two.

  • 7/31/2019 Dist Storage Presentation Final

    16/25

  • 7/31/2019 Dist Storage Presentation Final

    17/25

  • 7/31/2019 Dist Storage Presentation Final

    18/25

  • 7/31/2019 Dist Storage Presentation Final

    19/25

  • 7/31/2019 Dist Storage Presentation Final

    20/25

  • 7/31/2019 Dist Storage Presentation Final

    21/25

  • 7/31/2019 Dist Storage Presentation Final

    22/25

  • 7/31/2019 Dist Storage Presentation Final

    23/25

  • 7/31/2019 Dist Storage Presentation Final

    24/25

  • 7/31/2019 Dist Storage Presentation Final

    25/25